Regularization Paths for Generalized Linear Models via … · 2017. 5. 5. · 2 Regularization Paths for GLMs via Coordinate Descent 4. ‘ 1 regularization paths for generalized

Post on 27-Sep-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

JSS Journal of Statistical SoftwareJanuary 2010 Volume 33 Issue 1 httpwwwjstatsoftorg

Regularization Paths for Generalized Linear Models

via Coordinate Descent

Jerome FriedmanStanford University

Trevor HastieStanford University

Rob TibshiraniStanford University

Abstract

We develop fast algorithms for estimation of generalized linear models with convexpenalties The models include linear regression two-class logistic regression and multi-nomial regression problems while the penalties include `1 (the lasso) `2 (ridge regression)and mixtures of the two (the elastic net) The algorithms use cyclical coordinate descentcomputed along a regularization path The methods can handle large problems and canalso deal efficiently with sparse features In comparative timings we find that the newalgorithms are considerably faster than competing methods

Keywords lasso elastic net logistic regression `1 penalty regularization path coordinate-descent

1 Introduction

The lasso (Tibshirani 1996) is a popular method for regression that uses an `1 penalty toachieve a sparse solution In the signal processing literature the lasso is also known as basispursuit (Chen et al 1998) This idea has been broadly applied for example to general-ized linear models (Tibshirani 1996) and Coxrsquos proportional hazard models for survival data(Tibshirani 1997) In recent years there has been an enormous amount of research activitydevoted to related regularization methods

1 The grouped lasso (Yuan and Lin 2007 Meier et al 2008) where variables are includedor excluded in groups

2 The Dantzig selector (Candes and Tao 2007 and discussion) a slightly modified versionof the lasso

3 The elastic net (Zou and Hastie 2005) for correlated variables which uses a penalty thatis part `1 part `2

2 Regularization Paths for GLMs via Coordinate Descent

4 `1 regularization paths for generalized linear models (Park and Hastie 2007a)

5 Methods using non-concave penalties such as SCAD (Fan and Li 2005) and Friedmanrsquosgeneralized elastic net (Friedman 2008) enforce more severe variable selection than thelasso

6 Regularization paths for the support-vector machine (Hastie et al 2004)

7 The graphical lasso (Friedman et al 2008) for sparse covariance estimation and undi-rected graphs

Efron et al (2004) developed an efficient algorithm for computing the entire regularizationpath for the lasso for linear regression models Their algorithm exploits the fact that the coef-ficient profiles are piecewise linear which leads to an algorithm with the same computationalcost as the full least-squares fit on the data (see also Osborne et al 2000)

In some of the extensions above (items 23 and 6) piecewise-linearity can be exploited as inEfron et al (2004) to yield efficient algorithms Rosset and Zhu (2007) characterize the classof problems where piecewise-linearity existsmdashboth the loss function and the penalty have tobe quadratic or piecewise linear

Here we instead focus on cyclical coordinate descent methods These methods have beenproposed for the lasso a number of times but only recently was their power fully appreciatedEarly references include Fu (1998) Shevade and Keerthi (2003) and Daubechies et al (2004)Van der Kooij (2007) independently used coordinate descent for solving elastic-net penalizedregression models Recent rediscoveries include Friedman et al (2007) and Wu and Lange(2008) The first paper recognized the value of solving the problem along an entire path ofvalues for the regularization parameters using the current estimates as warm starts Thisstrategy turns out to be remarkably efficient for this problem Several other researchers havealso re-discovered coordinate descent many for solving the same problems we address in thispapermdashnotably Shevade and Keerthi (2003) Krishnapuram and Hartemink (2005) Genkinet al (2007) and Wu et al (2009)

In this paper we extend the work of Friedman et al (2007) and develop fast algorithmsfor fitting generalized linear models with elastic-net penalties In particular our modelsinclude regression two-class logistic regression and multinomial regression problems Ouralgorithms can work on very large datasets and can take advantage of sparsity in the featureset We provide a publicly available package glmnet (Friedman et al 2009) implemented inthe R programming system (R Development Core Team 2009) We do not revisit the well-established convergence properties of coordinate descent in convex problems (Tseng 2001) inthis article

Lasso procedures are frequently used in domains with very large datasets such as genomicsand web analysis Consequently a focus of our research has been algorithmic efficiency andspeed We demonstrate through simulations that our procedures outperform all competitorsmdash even those based on coordinate descent

In Section 2 we present the algorithm for the elastic net which includes the lasso and ridgeregression as special cases Section 3 and 4 discuss (two-class) logistic regression and multi-nomial logistic regression Comparative timings are presented in Section 5

Although the title of this paper advertises regularization paths for GLMs we only cover threeimportant members of this family However exactly the same technology extends trivially to

Journal of Statistical Software 3

other members of the exponential family such as the Poisson model We plan to extend oursoftware to cover these important other cases as well as the Cox model for survival data

Note that this article is about algorithms for fitting particular families of models and notabout the statistical properties of these models themselves Such discussions have taken placeelsewhere

2 Algorithms for the lasso ridge regression and elastic net

We consider the usual setup for linear regression We have a response variable Y isin R anda predictor vector X isin Rp and we approximate the regression function by a linear modelE(Y |X = x) = β0 + xgtβ We have N observation pairs (xi yi) For simplicity we assumethe xij are standardized

sumNi=1 xij = 0 1

N

sumNi=1 x

2ij = 1 for j = 1 p Our algorithms

generalize naturally to the unstandardized case The elastic net solves the following problem

min(β0β)isinRp+1

Rλ(β0 β) = min(β0β)isinRp+1

[1

2N

Nsumi=1

(yi minus β0 minus xgti β)2 + λPα(β)

] (1)

where

Pα(β) = (1minus α)12||β||2`2 + α||β||`1 (2)

=psumj=1

[12(1minus α)β2

j + α|βj |] (3)

Pα is the elastic-net penalty (Zou and Hastie 2005) and is a compromise between the ridge-regression penalty (α = 0) and the lasso penalty (α = 1) This penalty is particularly usefulin the p N situation or any situation where there are many correlated predictor variables1

Ridge regression is known to shrink the coefficients of correlated predictors towards eachother allowing them to borrow strength from each other In the extreme case of k identicalpredictors they each get identical coefficients with 1kth the size that any single one wouldget if fit alone From a Bayesian point of view the ridge penalty is ideal if there are manypredictors and all have non-zero coefficients (drawn from a Gaussian distribution)

Lasso on the other hand is somewhat indifferent to very correlated predictors and will tendto pick one and ignore the rest In the extreme case above the lasso problem breaks downThe lasso penalty corresponds to a Laplace prior which expects many coefficients to be closeto zero and a small subset to be larger and nonzero

The elastic net with α = 1minusε for some small ε gt 0 performs much like the lasso but removesany degeneracies and wild behavior caused by extreme correlations More generally the entirefamily Pα creates a useful compromise between ridge and lasso As α increases from 0 to 1for a given λ the sparsity of the solution to (1) (ie the number of coefficients equal to zero)increases monotonically from 0 to the sparsity of the lasso solution

Figure 1 shows an example that demonstrates the effect of varying α The dataset is from(Golub et al 1999) consisting of 72 observations on 3571 genes measured with DNA microar-rays The observations fall in two classes so we use the penalties in conjunction with the

1Zou and Hastie (2005) called this penalty the naive elastic net and preferred a rescaled version which theycalled elastic net We drop this distinction here

4 Regularization Paths for GLMs via Coordinate Descent

Figure 1 Leukemia data profiles of estimated coefficients for three methods showing onlyfirst 10 steps (values for λ) in each case For the elastic net α = 02

Journal of Statistical Software 5

logistic regression models of Section 3 The coefficient profiles from the first 10 steps (gridvalues for λ) for each of the three regularization methods are shown The lasso penalty admitsat most N = 72 genes into the model while ridge regression gives all 3571 genes non-zerocoefficients The elastic-net penalty provides a compromise between these two and has theeffect of averaging genes that are highly correlated and then entering the averaged gene intothe model Using the algorithm described below computation of the entire path of solutionsfor each method at 100 values of the regularization parameter evenly spaced on the log-scaletook under a second in total Because of the large number of non-zero coefficients for theridge penalty they are individually much smaller than the coefficients for the other methods

Consider a coordinate descent step for solving (1) That is suppose we have estimates β0 andβ` for ` 6= j and we wish to partially optimize with respect to βj We would like to computethe gradient at βj = βj which only exists if βj 6= 0 If βj gt 0 then

partRλpartβj|β=β = minus 1

N

Nsumi=1

xij(yi minus βo minus xgti β) + λ(1minus α)βj + λα (4)

A similar expression exists if βj lt 0 and βj = 0 is treated separately Simple calculus shows(Donoho and Johnstone 1994) that the coordinate-wise update has the form

βj larrS(

1N

sumNi=1 xij(yi minus y

(j)i ) λα

)1 + λ(1minus α)

(5)

where

y(j)i = β0 +

sum6=j xi`β` is the fitted value excluding the contribution from xij and

hence yi minus y(j)i the partial residual for fitting βj Because of the standardization

1N

sumNi=1 xij(yi minus y

(j)i ) is the simple least-squares coefficient when fitting this partial

residual to xij

S(z γ) is the soft-thresholding operator with value

sign(z)(|z| minus γ)+ =

z minus γ if z gt 0 and γ lt |z|z + γ if z lt 0 and γ lt |z|0 if γ ge |z|

(6)

The details of this derivation are spelled out in Friedman et al (2007)

Thus we compute the simple least-squares coefficient on the partial residual apply soft-thresholding to take care of the lasso contribution to the penalty and then apply a pro-portional shrinkage for the ridge penalty This algorithm was suggested by Van der Kooij(2007)

21 Naive updates

Looking more closely at (5) we see that

yi minus y(j)i = yi minus yi + xij βj

= ri + xij βj (7)

6 Regularization Paths for GLMs via Coordinate Descent

where yi is the current fit of the model for observation i and hence ri the current residualThus

1N

Nsumi=1

xij(yi minus y(j)i ) =

1N

Nsumi=1

xijri + βj (8)

because the xj are standardized The first term on the right-hand side is the gradient ofthe loss with respect to βj It is clear from (8) why coordinate descent is computationallyefficient Many coefficients are zero remain zero after the thresholding and so nothing needsto be changed Such a step costs O(N) operationsmdash the sum to compute the gradient Onthe other hand if a coefficient does change after the thresholding ri is changed in O(N) andthe step costs O(2N) Thus a complete cycle through all p variables costs O(pN) operationsWe refer to this as the naive algorithm since it is generally less efficient than the covarianceupdating algorithm to follow Later we use these algorithms in the context of iterativelyreweighted least squares (IRLS) where the observation weights change frequently there thenaive algorithm dominates

22 Covariance updates

Further efficiencies can be achieved in computing the updates in (8) We can write the firstterm on the right (up to a factor 1N) as

Nsumi=1

xijri = 〈xj y〉 minussum

k|βk|gt0

〈xj xk〉βk (9)

where 〈xj y〉 =sumN

i=1 xijyi Hence we need to compute inner products of each feature with yinitially and then each time a new feature xk enters the model (for the first time) we needto compute and store its inner product with all the rest of the features (O(Np) operations)We also store the p gradient components (9) If one of the coefficients currently in the modelchanges we can update each gradient in O(p) operations Hence with m non-zero terms inthe model a complete cycle costs O(pm) operations if no new variables become non-zero andcosts O(Np) for each new variable entered Importantly O(N) calculations do not have tobe made at every step This is the case for all penalized procedures with squared error loss

23 Sparse updates

We are sometimes faced with problems where the Ntimesp feature matrix X is extremely sparseA leading example is from document classification where the feature vector uses the so-called ldquobag-of-wordsrdquo model Each document is scored for the presenceabsence of each ofthe words in the entire dictionary under consideration (sometimes counts are used or sometransformation of counts) Since most words are absent the feature vector for each documentis mostly zero and so the entire matrix is mostly zero We store such matrices efficiently insparse column format where we store only the non-zero entries and the coordinates wherethey occur

Coordinate descent is ideally set up to exploit such sparsity in an obvious way The O(N)inner-product operations in either the naive or covariance updates can exploit the sparsityby summing over only the non-zero entries Note that in this case scaling of the variables will

Journal of Statistical Software 7

not alter the sparsity but centering will So scaling is performed up front but the centeringis incorporated in the algorithm in an efficient and obvious manner

24 Weighted updates

Often a weight wi (other than 1N) is associated with each observation This will arisenaturally in later sections where observations receive weights in the IRLS algorithm In thiscase the update step (5) becomes only slightly more complicated

βj larrS(sumN

i=1wixij(yi minus y(j)i ) λα

)sumN

i=1wix2ij + λ(1minus α)

(10)

If the xj are not standardized there is a similar sum-of-squares term in the denominator(even without weights) The presence of weights does not change the computational costs ofeither algorithm much as long as the weights remain fixed

25 Pathwise coordinate descent

We compute the solutions for a decreasing sequence of values for λ starting at the smallestvalue λmax for which the entire vector β = 0 Apart from giving us a path of solutions thisscheme exploits warm starts and leads to a more stable algorithm We have examples whereit is faster to compute the path down to λ (for small λ) than the solution only at that valuefor λ

When β = 0 we see from (5) that βj will stay zero if 1N |〈xj y〉| lt λα Hence Nαλmax =

max` |〈x` y〉| Our strategy is to select a minimum value λmin = ελmax and construct asequence of K values of λ decreasing from λmax to λmin on the log scale Typical values areε = 0001 and K = 100

26 Other details

Irrespective of whether the variables are standardized to have variance 1 we always centereach predictor variable Since the intercept is not regularized this means that β0 = y themean of the yi for all values of α and λ

It is easy to allow different penalties λj for each of the variables We implement this via apenalty scaling parameter γj ge 0 If γj gt 0 then the penalty applied to βj is λj = λγj If γj = 0 that variable does not get penalized and always enters the model unrestricted atthe first step and remains in the model Penalty rescaling would also allow for example oursoftware to be used to implement the adaptive lasso (Zou 2006)

Considerable speedup is obtained by organizing the iterations around the active set of featuresmdashthose with nonzero coefficients After a complete cycle through all the variables we iterateon only the active set till convergence If another complete cycle does not change the activeset we are done otherwise the process is repeated Active-set convergence is also mentionedin Meier et al (2008) and Krishnapuram and Hartemink (2005)

8 Regularization Paths for GLMs via Coordinate Descent

3 Regularized logistic regression

When the response variable is binary the linear logistic regression model is often used Denoteby G the response variable taking values in G = 1 2 (the labeling of the elements isarbitrary) The logistic regression model represents the class-conditional probabilities througha linear function of the predictors

Pr(G = 1|x) =1

1 + eminus(β0+xgtβ) (11)

Pr(G = 2|x) =1

1 + e+(β0+xgtβ)

= 1minus Pr(G = 1|x)

Alternatively this implies that

logPr(G = 1|x)Pr(G = 2|x)

= β0 + xgtβ (12)

Here we fit this model by regularized maximum (binomial) likelihood Let p(xi) = Pr(G =1|xi) be the probability (11) for observation i at a particular value for the parameters (β0 β)then we maximize the penalized log likelihood

max(β0β)isinRp+1

[1N

Nsumi=1

I(gi = 1) log p(xi) + I(gi = 2) log(1minus p(xi))

minus λPα(β)

] (13)

Denoting yi = I(gi = 1) the log-likelihood part of (13) can be written in the more explicitform

`(β0 β) =1N

Nsumi=1

yi middot (β0 + xgti β)minus log(1 + e(β0+xgti β)) (14)

a concave function of the parameters The Newton algorithm for maximizing the (unpe-nalized) log-likelihood (14) amounts to iteratively reweighted least squares Hence if thecurrent estimates of the parameters are (β0 β) we form a quadratic approximation to thelog-likelihood (Taylor expansion about current estimates) which is

`Q(β0 β) = minus 12N

Nsumi=1

wi(zi minus β0 minus xgti β)2 + C(β0 β)2 (15)

where

zi = β0 + xgti β +yi minus p(xi)

p(xi)(1minus p(xi)) (working response) (16)

wi = p(xi)(1minus p(xi)) (weights) (17)

and p(xi) is evaluated at the current parameters The last term is constant The Newtonupdate is obtained by minimizing `QOur approach is similar For each value of λ we create an outer loop which computes thequadratic approximation `Q about the current parameters (β0 β) Then we use coordinatedescent to solve the penalized weighted least-squares problem

min(β0β)isinRp+1

minus`Q(β0 β) + λPα(β) (18)

This amounts to a sequence of nested loops

Journal of Statistical Software 9

outer loop Decrement λ

middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

There are several important details in the implementation of this algorithm

When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

4 Regularized multinomial regression

When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

logPr(G = `|x)Pr(G = K|x)

= β0` + xgtβ` ` = 1 K minus 1 (19)

Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

Pr(G = `|x) =eβ0`+x

gtβ`sumKk=1 e

β0k+xgtβk

(20)

This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

10 Regularization Paths for GLMs via Coordinate Descent

We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

maxβ0`β`K1 isinRK(p+1)

[1N

Nsumi=1

log pgi(xi)minus λKsum`=1

Pα(β`)

] (21)

Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

`(β0` β`K1 ) =1N

Nsumi=1

[Ksum`=1

yi`(β0` + xgti β`)minus log

(Ksum`=1

eβ0`+xgti β`

)] (22)

The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

`Q`(β0` β`) = minus 12N

Nsumi=1

wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

where as before

zi` = β0` + xgti β` +yi` minus p`(xi)

p`(xi)(1minus p`(xi)) (24)

wi` = p`(xi)(1minus p`(xi)) (25)

Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

min(β0`β`)isinRp+1

minus`Q`(β0` β`) + λPα(β`) (26)

This amounts to the sequence of nested loops

outer loop Decrement λ

middle loop (outer) Cycle over ` isin 1 2 K 1 2

middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

Journal of Statistical Software 11

41 Regularization and parameter ambiguity

As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

mincisinRp

Ksum`=1

Pα(β` minus c) (27)

This can be done separately for each coordinate hence

cj = arg mint

Ksum`=1

[12(1minus α)(βj` minus t)2 + α|βj` minus t|

] (28)

Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

cj isin [βj βMj ] (29)

with the left endpoint achieved if α = 0 and the right if α = 1

The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

42 Grouped and matrix responses

As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

wi` = mip`(xi)(1minus p`(xi)) (30)

Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

12 Regularization Paths for GLMs via Coordinate Descent

5 Timings

In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

51 Regression with the lasso

We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

Y =psumj=1

Xjβj + k middot Z (31)

where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

52 Lasso-logistic regression

We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

Journal of Statistical Software 13

Linear regression ndash Dense features

Correlation0 01 02 05 09 095

N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

14 Regularization Paths for GLMs via Coordinate Descent

Logistic regression ndash Dense features

Correlation0 01 02 05 09 095

N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

53 Real data

Table 4 shows some timing results for four different datasets

Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

Journal of Statistical Software 15

Logistic regression ndash Sparse features

Correlation0 01 02 05 09 095

N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

54 Other comparisons

When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

16 Regularization Paths for GLMs via Coordinate Descent

Name Type N p glmnet l1logreg BBRBMR

DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

MacBook Pro HP Linux server

glmnet 034 013penalized 1031OWL-QN 31435

Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

6 Selecting the tuning parameters

The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

Journal of Statistical Software 17

minus6 minus5 minus4 minus3 minus2 minus1 0

2426

2830

32

log(Lambda)

Mea

n S

quar

ed E

rror

99 99 97 95 93 75 54 21 12 5 2 1

Gaussian Family

minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

08

09

10

11

12

13

14

log(Lambda)

Dev

ianc

e

100 98 97 88 74 55 30 9 7 3 2

Binomial Family

minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

020

030

040

050

log(Lambda)

Mis

clas

sific

atio

n E

rror

100 98 97 88 74 55 30 9 7 3 2

Binomial Family

Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

18 Regularization Paths for GLMs via Coordinate Descent

Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

7 Discussion

Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

Acknowledgments

We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

References

Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

Journal of Statistical Software 19

Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

20 Regularization Paths for GLMs via Coordinate Descent

Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

Journal of Statistical Software 21

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

22 Regularization Paths for GLMs via Coordinate Descent

A Proof of Theorem 1

We have

cj = arg mint

Ksum`=1

[12(1minus α)(βj` minus t)2 + α|βj` minus t|

] (32)

Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

Ksum`=1

[minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

t = βj +1K

α

1minus α

Ksum`=1

sj` (34)

It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

Affiliation

Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

  • Introduction
  • Algorithms for the lasso ridge regression and elastic net
    • Naive updates
    • Covariance updates
    • Sparse updates
    • Weighted updates
    • Pathwise coordinate descent
    • Other details
      • Regularized logistic regression
      • Regularized multinomial regression
        • Regularization and parameter ambiguity
        • Grouped and matrix responses
          • Timings
            • Regression with the lasso
            • Lasso-logistic regression
            • Real data
            • Other comparisons
              • Selecting the tuning parameters
              • Discussion
              • Proof of Theorem 1

    2 Regularization Paths for GLMs via Coordinate Descent

    4 `1 regularization paths for generalized linear models (Park and Hastie 2007a)

    5 Methods using non-concave penalties such as SCAD (Fan and Li 2005) and Friedmanrsquosgeneralized elastic net (Friedman 2008) enforce more severe variable selection than thelasso

    6 Regularization paths for the support-vector machine (Hastie et al 2004)

    7 The graphical lasso (Friedman et al 2008) for sparse covariance estimation and undi-rected graphs

    Efron et al (2004) developed an efficient algorithm for computing the entire regularizationpath for the lasso for linear regression models Their algorithm exploits the fact that the coef-ficient profiles are piecewise linear which leads to an algorithm with the same computationalcost as the full least-squares fit on the data (see also Osborne et al 2000)

    In some of the extensions above (items 23 and 6) piecewise-linearity can be exploited as inEfron et al (2004) to yield efficient algorithms Rosset and Zhu (2007) characterize the classof problems where piecewise-linearity existsmdashboth the loss function and the penalty have tobe quadratic or piecewise linear

    Here we instead focus on cyclical coordinate descent methods These methods have beenproposed for the lasso a number of times but only recently was their power fully appreciatedEarly references include Fu (1998) Shevade and Keerthi (2003) and Daubechies et al (2004)Van der Kooij (2007) independently used coordinate descent for solving elastic-net penalizedregression models Recent rediscoveries include Friedman et al (2007) and Wu and Lange(2008) The first paper recognized the value of solving the problem along an entire path ofvalues for the regularization parameters using the current estimates as warm starts Thisstrategy turns out to be remarkably efficient for this problem Several other researchers havealso re-discovered coordinate descent many for solving the same problems we address in thispapermdashnotably Shevade and Keerthi (2003) Krishnapuram and Hartemink (2005) Genkinet al (2007) and Wu et al (2009)

    In this paper we extend the work of Friedman et al (2007) and develop fast algorithmsfor fitting generalized linear models with elastic-net penalties In particular our modelsinclude regression two-class logistic regression and multinomial regression problems Ouralgorithms can work on very large datasets and can take advantage of sparsity in the featureset We provide a publicly available package glmnet (Friedman et al 2009) implemented inthe R programming system (R Development Core Team 2009) We do not revisit the well-established convergence properties of coordinate descent in convex problems (Tseng 2001) inthis article

    Lasso procedures are frequently used in domains with very large datasets such as genomicsand web analysis Consequently a focus of our research has been algorithmic efficiency andspeed We demonstrate through simulations that our procedures outperform all competitorsmdash even those based on coordinate descent

    In Section 2 we present the algorithm for the elastic net which includes the lasso and ridgeregression as special cases Section 3 and 4 discuss (two-class) logistic regression and multi-nomial logistic regression Comparative timings are presented in Section 5

    Although the title of this paper advertises regularization paths for GLMs we only cover threeimportant members of this family However exactly the same technology extends trivially to

    Journal of Statistical Software 3

    other members of the exponential family such as the Poisson model We plan to extend oursoftware to cover these important other cases as well as the Cox model for survival data

    Note that this article is about algorithms for fitting particular families of models and notabout the statistical properties of these models themselves Such discussions have taken placeelsewhere

    2 Algorithms for the lasso ridge regression and elastic net

    We consider the usual setup for linear regression We have a response variable Y isin R anda predictor vector X isin Rp and we approximate the regression function by a linear modelE(Y |X = x) = β0 + xgtβ We have N observation pairs (xi yi) For simplicity we assumethe xij are standardized

    sumNi=1 xij = 0 1

    N

    sumNi=1 x

    2ij = 1 for j = 1 p Our algorithms

    generalize naturally to the unstandardized case The elastic net solves the following problem

    min(β0β)isinRp+1

    Rλ(β0 β) = min(β0β)isinRp+1

    [1

    2N

    Nsumi=1

    (yi minus β0 minus xgti β)2 + λPα(β)

    ] (1)

    where

    Pα(β) = (1minus α)12||β||2`2 + α||β||`1 (2)

    =psumj=1

    [12(1minus α)β2

    j + α|βj |] (3)

    Pα is the elastic-net penalty (Zou and Hastie 2005) and is a compromise between the ridge-regression penalty (α = 0) and the lasso penalty (α = 1) This penalty is particularly usefulin the p N situation or any situation where there are many correlated predictor variables1

    Ridge regression is known to shrink the coefficients of correlated predictors towards eachother allowing them to borrow strength from each other In the extreme case of k identicalpredictors they each get identical coefficients with 1kth the size that any single one wouldget if fit alone From a Bayesian point of view the ridge penalty is ideal if there are manypredictors and all have non-zero coefficients (drawn from a Gaussian distribution)

    Lasso on the other hand is somewhat indifferent to very correlated predictors and will tendto pick one and ignore the rest In the extreme case above the lasso problem breaks downThe lasso penalty corresponds to a Laplace prior which expects many coefficients to be closeto zero and a small subset to be larger and nonzero

    The elastic net with α = 1minusε for some small ε gt 0 performs much like the lasso but removesany degeneracies and wild behavior caused by extreme correlations More generally the entirefamily Pα creates a useful compromise between ridge and lasso As α increases from 0 to 1for a given λ the sparsity of the solution to (1) (ie the number of coefficients equal to zero)increases monotonically from 0 to the sparsity of the lasso solution

    Figure 1 shows an example that demonstrates the effect of varying α The dataset is from(Golub et al 1999) consisting of 72 observations on 3571 genes measured with DNA microar-rays The observations fall in two classes so we use the penalties in conjunction with the

    1Zou and Hastie (2005) called this penalty the naive elastic net and preferred a rescaled version which theycalled elastic net We drop this distinction here

    4 Regularization Paths for GLMs via Coordinate Descent

    Figure 1 Leukemia data profiles of estimated coefficients for three methods showing onlyfirst 10 steps (values for λ) in each case For the elastic net α = 02

    Journal of Statistical Software 5

    logistic regression models of Section 3 The coefficient profiles from the first 10 steps (gridvalues for λ) for each of the three regularization methods are shown The lasso penalty admitsat most N = 72 genes into the model while ridge regression gives all 3571 genes non-zerocoefficients The elastic-net penalty provides a compromise between these two and has theeffect of averaging genes that are highly correlated and then entering the averaged gene intothe model Using the algorithm described below computation of the entire path of solutionsfor each method at 100 values of the regularization parameter evenly spaced on the log-scaletook under a second in total Because of the large number of non-zero coefficients for theridge penalty they are individually much smaller than the coefficients for the other methods

    Consider a coordinate descent step for solving (1) That is suppose we have estimates β0 andβ` for ` 6= j and we wish to partially optimize with respect to βj We would like to computethe gradient at βj = βj which only exists if βj 6= 0 If βj gt 0 then

    partRλpartβj|β=β = minus 1

    N

    Nsumi=1

    xij(yi minus βo minus xgti β) + λ(1minus α)βj + λα (4)

    A similar expression exists if βj lt 0 and βj = 0 is treated separately Simple calculus shows(Donoho and Johnstone 1994) that the coordinate-wise update has the form

    βj larrS(

    1N

    sumNi=1 xij(yi minus y

    (j)i ) λα

    )1 + λ(1minus α)

    (5)

    where

    y(j)i = β0 +

    sum6=j xi`β` is the fitted value excluding the contribution from xij and

    hence yi minus y(j)i the partial residual for fitting βj Because of the standardization

    1N

    sumNi=1 xij(yi minus y

    (j)i ) is the simple least-squares coefficient when fitting this partial

    residual to xij

    S(z γ) is the soft-thresholding operator with value

    sign(z)(|z| minus γ)+ =

    z minus γ if z gt 0 and γ lt |z|z + γ if z lt 0 and γ lt |z|0 if γ ge |z|

    (6)

    The details of this derivation are spelled out in Friedman et al (2007)

    Thus we compute the simple least-squares coefficient on the partial residual apply soft-thresholding to take care of the lasso contribution to the penalty and then apply a pro-portional shrinkage for the ridge penalty This algorithm was suggested by Van der Kooij(2007)

    21 Naive updates

    Looking more closely at (5) we see that

    yi minus y(j)i = yi minus yi + xij βj

    = ri + xij βj (7)

    6 Regularization Paths for GLMs via Coordinate Descent

    where yi is the current fit of the model for observation i and hence ri the current residualThus

    1N

    Nsumi=1

    xij(yi minus y(j)i ) =

    1N

    Nsumi=1

    xijri + βj (8)

    because the xj are standardized The first term on the right-hand side is the gradient ofthe loss with respect to βj It is clear from (8) why coordinate descent is computationallyefficient Many coefficients are zero remain zero after the thresholding and so nothing needsto be changed Such a step costs O(N) operationsmdash the sum to compute the gradient Onthe other hand if a coefficient does change after the thresholding ri is changed in O(N) andthe step costs O(2N) Thus a complete cycle through all p variables costs O(pN) operationsWe refer to this as the naive algorithm since it is generally less efficient than the covarianceupdating algorithm to follow Later we use these algorithms in the context of iterativelyreweighted least squares (IRLS) where the observation weights change frequently there thenaive algorithm dominates

    22 Covariance updates

    Further efficiencies can be achieved in computing the updates in (8) We can write the firstterm on the right (up to a factor 1N) as

    Nsumi=1

    xijri = 〈xj y〉 minussum

    k|βk|gt0

    〈xj xk〉βk (9)

    where 〈xj y〉 =sumN

    i=1 xijyi Hence we need to compute inner products of each feature with yinitially and then each time a new feature xk enters the model (for the first time) we needto compute and store its inner product with all the rest of the features (O(Np) operations)We also store the p gradient components (9) If one of the coefficients currently in the modelchanges we can update each gradient in O(p) operations Hence with m non-zero terms inthe model a complete cycle costs O(pm) operations if no new variables become non-zero andcosts O(Np) for each new variable entered Importantly O(N) calculations do not have tobe made at every step This is the case for all penalized procedures with squared error loss

    23 Sparse updates

    We are sometimes faced with problems where the Ntimesp feature matrix X is extremely sparseA leading example is from document classification where the feature vector uses the so-called ldquobag-of-wordsrdquo model Each document is scored for the presenceabsence of each ofthe words in the entire dictionary under consideration (sometimes counts are used or sometransformation of counts) Since most words are absent the feature vector for each documentis mostly zero and so the entire matrix is mostly zero We store such matrices efficiently insparse column format where we store only the non-zero entries and the coordinates wherethey occur

    Coordinate descent is ideally set up to exploit such sparsity in an obvious way The O(N)inner-product operations in either the naive or covariance updates can exploit the sparsityby summing over only the non-zero entries Note that in this case scaling of the variables will

    Journal of Statistical Software 7

    not alter the sparsity but centering will So scaling is performed up front but the centeringis incorporated in the algorithm in an efficient and obvious manner

    24 Weighted updates

    Often a weight wi (other than 1N) is associated with each observation This will arisenaturally in later sections where observations receive weights in the IRLS algorithm In thiscase the update step (5) becomes only slightly more complicated

    βj larrS(sumN

    i=1wixij(yi minus y(j)i ) λα

    )sumN

    i=1wix2ij + λ(1minus α)

    (10)

    If the xj are not standardized there is a similar sum-of-squares term in the denominator(even without weights) The presence of weights does not change the computational costs ofeither algorithm much as long as the weights remain fixed

    25 Pathwise coordinate descent

    We compute the solutions for a decreasing sequence of values for λ starting at the smallestvalue λmax for which the entire vector β = 0 Apart from giving us a path of solutions thisscheme exploits warm starts and leads to a more stable algorithm We have examples whereit is faster to compute the path down to λ (for small λ) than the solution only at that valuefor λ

    When β = 0 we see from (5) that βj will stay zero if 1N |〈xj y〉| lt λα Hence Nαλmax =

    max` |〈x` y〉| Our strategy is to select a minimum value λmin = ελmax and construct asequence of K values of λ decreasing from λmax to λmin on the log scale Typical values areε = 0001 and K = 100

    26 Other details

    Irrespective of whether the variables are standardized to have variance 1 we always centereach predictor variable Since the intercept is not regularized this means that β0 = y themean of the yi for all values of α and λ

    It is easy to allow different penalties λj for each of the variables We implement this via apenalty scaling parameter γj ge 0 If γj gt 0 then the penalty applied to βj is λj = λγj If γj = 0 that variable does not get penalized and always enters the model unrestricted atthe first step and remains in the model Penalty rescaling would also allow for example oursoftware to be used to implement the adaptive lasso (Zou 2006)

    Considerable speedup is obtained by organizing the iterations around the active set of featuresmdashthose with nonzero coefficients After a complete cycle through all the variables we iterateon only the active set till convergence If another complete cycle does not change the activeset we are done otherwise the process is repeated Active-set convergence is also mentionedin Meier et al (2008) and Krishnapuram and Hartemink (2005)

    8 Regularization Paths for GLMs via Coordinate Descent

    3 Regularized logistic regression

    When the response variable is binary the linear logistic regression model is often used Denoteby G the response variable taking values in G = 1 2 (the labeling of the elements isarbitrary) The logistic regression model represents the class-conditional probabilities througha linear function of the predictors

    Pr(G = 1|x) =1

    1 + eminus(β0+xgtβ) (11)

    Pr(G = 2|x) =1

    1 + e+(β0+xgtβ)

    = 1minus Pr(G = 1|x)

    Alternatively this implies that

    logPr(G = 1|x)Pr(G = 2|x)

    = β0 + xgtβ (12)

    Here we fit this model by regularized maximum (binomial) likelihood Let p(xi) = Pr(G =1|xi) be the probability (11) for observation i at a particular value for the parameters (β0 β)then we maximize the penalized log likelihood

    max(β0β)isinRp+1

    [1N

    Nsumi=1

    I(gi = 1) log p(xi) + I(gi = 2) log(1minus p(xi))

    minus λPα(β)

    ] (13)

    Denoting yi = I(gi = 1) the log-likelihood part of (13) can be written in the more explicitform

    `(β0 β) =1N

    Nsumi=1

    yi middot (β0 + xgti β)minus log(1 + e(β0+xgti β)) (14)

    a concave function of the parameters The Newton algorithm for maximizing the (unpe-nalized) log-likelihood (14) amounts to iteratively reweighted least squares Hence if thecurrent estimates of the parameters are (β0 β) we form a quadratic approximation to thelog-likelihood (Taylor expansion about current estimates) which is

    `Q(β0 β) = minus 12N

    Nsumi=1

    wi(zi minus β0 minus xgti β)2 + C(β0 β)2 (15)

    where

    zi = β0 + xgti β +yi minus p(xi)

    p(xi)(1minus p(xi)) (working response) (16)

    wi = p(xi)(1minus p(xi)) (weights) (17)

    and p(xi) is evaluated at the current parameters The last term is constant The Newtonupdate is obtained by minimizing `QOur approach is similar For each value of λ we create an outer loop which computes thequadratic approximation `Q about the current parameters (β0 β) Then we use coordinatedescent to solve the penalized weighted least-squares problem

    min(β0β)isinRp+1

    minus`Q(β0 β) + λPα(β) (18)

    This amounts to a sequence of nested loops

    Journal of Statistical Software 9

    outer loop Decrement λ

    middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

    inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

    There are several important details in the implementation of this algorithm

    When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

    Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

    Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

    We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

    The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

    4 Regularized multinomial regression

    When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

    logPr(G = `|x)Pr(G = K|x)

    = β0` + xgtβ` ` = 1 K minus 1 (19)

    Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

    Pr(G = `|x) =eβ0`+x

    gtβ`sumKk=1 e

    β0k+xgtβk

    (20)

    This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

    10 Regularization Paths for GLMs via Coordinate Descent

    We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

    maxβ0`β`K1 isinRK(p+1)

    [1N

    Nsumi=1

    log pgi(xi)minus λKsum`=1

    Pα(β`)

    ] (21)

    Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

    `(β0` β`K1 ) =1N

    Nsumi=1

    [Ksum`=1

    yi`(β0` + xgti β`)minus log

    (Ksum`=1

    eβ0`+xgti β`

    )] (22)

    The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

    `Q`(β0` β`) = minus 12N

    Nsumi=1

    wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

    where as before

    zi` = β0` + xgti β` +yi` minus p`(xi)

    p`(xi)(1minus p`(xi)) (24)

    wi` = p`(xi)(1minus p`(xi)) (25)

    Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

    min(β0`β`)isinRp+1

    minus`Q`(β0` β`) + λPα(β`) (26)

    This amounts to the sequence of nested loops

    outer loop Decrement λ

    middle loop (outer) Cycle over ` isin 1 2 K 1 2

    middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

    inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

    Journal of Statistical Software 11

    41 Regularization and parameter ambiguity

    As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

    mincisinRp

    Ksum`=1

    Pα(β` minus c) (27)

    This can be done separately for each coordinate hence

    cj = arg mint

    Ksum`=1

    [12(1minus α)(βj` minus t)2 + α|βj` minus t|

    ] (28)

    Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

    cj isin [βj βMj ] (29)

    with the left endpoint achieved if α = 0 and the right if α = 1

    The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

    Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

    42 Grouped and matrix responses

    As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

    sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

    observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

    wi` = mip`(xi)(1minus p`(xi)) (30)

    Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

    12 Regularization Paths for GLMs via Coordinate Descent

    5 Timings

    In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

    We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

    51 Regression with the lasso

    We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

    Y =psumj=1

    Xjβj + k middot Z (31)

    where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

    Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

    52 Lasso-logistic regression

    We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

    The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

    Journal of Statistical Software 13

    Linear regression ndash Dense features

    Correlation0 01 02 05 09 095

    N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

    N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

    N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

    N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

    N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

    N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

    Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

    same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

    14 Regularization Paths for GLMs via Coordinate Descent

    Logistic regression ndash Dense features

    Correlation0 01 02 05 09 095

    N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

    N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

    N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

    N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

    N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

    N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

    Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

    53 Real data

    Table 4 shows some timing results for four different datasets

    Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

    Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

    InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

    Journal of Statistical Software 15

    Logistic regression ndash Sparse features

    Correlation0 01 02 05 09 095

    N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

    N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

    N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

    N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

    Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

    NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

    All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

    For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

    54 Other comparisons

    When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

    16 Regularization Paths for GLMs via Coordinate Descent

    Name Type N p glmnet l1logreg BBRBMR

    DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

    SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

    Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

    MacBook Pro HP Linux server

    glmnet 034 013penalized 1031OWL-QN 31435

    Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

    an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

    OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

    The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

    Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

    6 Selecting the tuning parameters

    The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

    Journal of Statistical Software 17

    minus6 minus5 minus4 minus3 minus2 minus1 0

    2426

    2830

    32

    log(Lambda)

    Mea

    n S

    quar

    ed E

    rror

    99 99 97 95 93 75 54 21 12 5 2 1

    Gaussian Family

    minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

    08

    09

    10

    11

    12

    13

    14

    log(Lambda)

    Dev

    ianc

    e

    100 98 97 88 74 55 30 9 7 3 2

    Binomial Family

    minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

    020

    030

    040

    050

    log(Lambda)

    Mis

    clas

    sific

    atio

    n E

    rror

    100 98 97 88 74 55 30 9 7 3 2

    Binomial Family

    Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

    18 Regularization Paths for GLMs via Coordinate Descent

    Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

    7 Discussion

    Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

    Acknowledgments

    We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

    References

    Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

    Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

    Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

    Journal of Statistical Software 19

    Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

    Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

    Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

    Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

    Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

    Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

    Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

    Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

    Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

    Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

    Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

    Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

    Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

    Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

    Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

    Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

    20 Regularization Paths for GLMs via Coordinate Descent

    Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

    Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

    Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

    Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

    Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

    Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

    Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

    Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

    Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

    Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

    Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

    Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

    Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

    Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

    Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

    Journal of Statistical Software 21

    Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

    R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

    Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

    Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

    Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

    Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

    Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

    Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

    Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

    Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

    Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

    Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

    Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

    Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

    Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

    22 Regularization Paths for GLMs via Coordinate Descent

    A Proof of Theorem 1

    We have

    cj = arg mint

    Ksum`=1

    [12(1minus α)(βj` minus t)2 + α|βj` minus t|

    ] (32)

    Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

    Ksum`=1

    [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

    where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

    t = βj +1K

    α

    1minus α

    Ksum`=1

    sj` (34)

    It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

    Affiliation

    Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

    Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

    Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

    • Introduction
    • Algorithms for the lasso ridge regression and elastic net
      • Naive updates
      • Covariance updates
      • Sparse updates
      • Weighted updates
      • Pathwise coordinate descent
      • Other details
        • Regularized logistic regression
        • Regularized multinomial regression
          • Regularization and parameter ambiguity
          • Grouped and matrix responses
            • Timings
              • Regression with the lasso
              • Lasso-logistic regression
              • Real data
              • Other comparisons
                • Selecting the tuning parameters
                • Discussion
                • Proof of Theorem 1

      Journal of Statistical Software 3

      other members of the exponential family such as the Poisson model We plan to extend oursoftware to cover these important other cases as well as the Cox model for survival data

      Note that this article is about algorithms for fitting particular families of models and notabout the statistical properties of these models themselves Such discussions have taken placeelsewhere

      2 Algorithms for the lasso ridge regression and elastic net

      We consider the usual setup for linear regression We have a response variable Y isin R anda predictor vector X isin Rp and we approximate the regression function by a linear modelE(Y |X = x) = β0 + xgtβ We have N observation pairs (xi yi) For simplicity we assumethe xij are standardized

      sumNi=1 xij = 0 1

      N

      sumNi=1 x

      2ij = 1 for j = 1 p Our algorithms

      generalize naturally to the unstandardized case The elastic net solves the following problem

      min(β0β)isinRp+1

      Rλ(β0 β) = min(β0β)isinRp+1

      [1

      2N

      Nsumi=1

      (yi minus β0 minus xgti β)2 + λPα(β)

      ] (1)

      where

      Pα(β) = (1minus α)12||β||2`2 + α||β||`1 (2)

      =psumj=1

      [12(1minus α)β2

      j + α|βj |] (3)

      Pα is the elastic-net penalty (Zou and Hastie 2005) and is a compromise between the ridge-regression penalty (α = 0) and the lasso penalty (α = 1) This penalty is particularly usefulin the p N situation or any situation where there are many correlated predictor variables1

      Ridge regression is known to shrink the coefficients of correlated predictors towards eachother allowing them to borrow strength from each other In the extreme case of k identicalpredictors they each get identical coefficients with 1kth the size that any single one wouldget if fit alone From a Bayesian point of view the ridge penalty is ideal if there are manypredictors and all have non-zero coefficients (drawn from a Gaussian distribution)

      Lasso on the other hand is somewhat indifferent to very correlated predictors and will tendto pick one and ignore the rest In the extreme case above the lasso problem breaks downThe lasso penalty corresponds to a Laplace prior which expects many coefficients to be closeto zero and a small subset to be larger and nonzero

      The elastic net with α = 1minusε for some small ε gt 0 performs much like the lasso but removesany degeneracies and wild behavior caused by extreme correlations More generally the entirefamily Pα creates a useful compromise between ridge and lasso As α increases from 0 to 1for a given λ the sparsity of the solution to (1) (ie the number of coefficients equal to zero)increases monotonically from 0 to the sparsity of the lasso solution

      Figure 1 shows an example that demonstrates the effect of varying α The dataset is from(Golub et al 1999) consisting of 72 observations on 3571 genes measured with DNA microar-rays The observations fall in two classes so we use the penalties in conjunction with the

      1Zou and Hastie (2005) called this penalty the naive elastic net and preferred a rescaled version which theycalled elastic net We drop this distinction here

      4 Regularization Paths for GLMs via Coordinate Descent

      Figure 1 Leukemia data profiles of estimated coefficients for three methods showing onlyfirst 10 steps (values for λ) in each case For the elastic net α = 02

      Journal of Statistical Software 5

      logistic regression models of Section 3 The coefficient profiles from the first 10 steps (gridvalues for λ) for each of the three regularization methods are shown The lasso penalty admitsat most N = 72 genes into the model while ridge regression gives all 3571 genes non-zerocoefficients The elastic-net penalty provides a compromise between these two and has theeffect of averaging genes that are highly correlated and then entering the averaged gene intothe model Using the algorithm described below computation of the entire path of solutionsfor each method at 100 values of the regularization parameter evenly spaced on the log-scaletook under a second in total Because of the large number of non-zero coefficients for theridge penalty they are individually much smaller than the coefficients for the other methods

      Consider a coordinate descent step for solving (1) That is suppose we have estimates β0 andβ` for ` 6= j and we wish to partially optimize with respect to βj We would like to computethe gradient at βj = βj which only exists if βj 6= 0 If βj gt 0 then

      partRλpartβj|β=β = minus 1

      N

      Nsumi=1

      xij(yi minus βo minus xgti β) + λ(1minus α)βj + λα (4)

      A similar expression exists if βj lt 0 and βj = 0 is treated separately Simple calculus shows(Donoho and Johnstone 1994) that the coordinate-wise update has the form

      βj larrS(

      1N

      sumNi=1 xij(yi minus y

      (j)i ) λα

      )1 + λ(1minus α)

      (5)

      where

      y(j)i = β0 +

      sum6=j xi`β` is the fitted value excluding the contribution from xij and

      hence yi minus y(j)i the partial residual for fitting βj Because of the standardization

      1N

      sumNi=1 xij(yi minus y

      (j)i ) is the simple least-squares coefficient when fitting this partial

      residual to xij

      S(z γ) is the soft-thresholding operator with value

      sign(z)(|z| minus γ)+ =

      z minus γ if z gt 0 and γ lt |z|z + γ if z lt 0 and γ lt |z|0 if γ ge |z|

      (6)

      The details of this derivation are spelled out in Friedman et al (2007)

      Thus we compute the simple least-squares coefficient on the partial residual apply soft-thresholding to take care of the lasso contribution to the penalty and then apply a pro-portional shrinkage for the ridge penalty This algorithm was suggested by Van der Kooij(2007)

      21 Naive updates

      Looking more closely at (5) we see that

      yi minus y(j)i = yi minus yi + xij βj

      = ri + xij βj (7)

      6 Regularization Paths for GLMs via Coordinate Descent

      where yi is the current fit of the model for observation i and hence ri the current residualThus

      1N

      Nsumi=1

      xij(yi minus y(j)i ) =

      1N

      Nsumi=1

      xijri + βj (8)

      because the xj are standardized The first term on the right-hand side is the gradient ofthe loss with respect to βj It is clear from (8) why coordinate descent is computationallyefficient Many coefficients are zero remain zero after the thresholding and so nothing needsto be changed Such a step costs O(N) operationsmdash the sum to compute the gradient Onthe other hand if a coefficient does change after the thresholding ri is changed in O(N) andthe step costs O(2N) Thus a complete cycle through all p variables costs O(pN) operationsWe refer to this as the naive algorithm since it is generally less efficient than the covarianceupdating algorithm to follow Later we use these algorithms in the context of iterativelyreweighted least squares (IRLS) where the observation weights change frequently there thenaive algorithm dominates

      22 Covariance updates

      Further efficiencies can be achieved in computing the updates in (8) We can write the firstterm on the right (up to a factor 1N) as

      Nsumi=1

      xijri = 〈xj y〉 minussum

      k|βk|gt0

      〈xj xk〉βk (9)

      where 〈xj y〉 =sumN

      i=1 xijyi Hence we need to compute inner products of each feature with yinitially and then each time a new feature xk enters the model (for the first time) we needto compute and store its inner product with all the rest of the features (O(Np) operations)We also store the p gradient components (9) If one of the coefficients currently in the modelchanges we can update each gradient in O(p) operations Hence with m non-zero terms inthe model a complete cycle costs O(pm) operations if no new variables become non-zero andcosts O(Np) for each new variable entered Importantly O(N) calculations do not have tobe made at every step This is the case for all penalized procedures with squared error loss

      23 Sparse updates

      We are sometimes faced with problems where the Ntimesp feature matrix X is extremely sparseA leading example is from document classification where the feature vector uses the so-called ldquobag-of-wordsrdquo model Each document is scored for the presenceabsence of each ofthe words in the entire dictionary under consideration (sometimes counts are used or sometransformation of counts) Since most words are absent the feature vector for each documentis mostly zero and so the entire matrix is mostly zero We store such matrices efficiently insparse column format where we store only the non-zero entries and the coordinates wherethey occur

      Coordinate descent is ideally set up to exploit such sparsity in an obvious way The O(N)inner-product operations in either the naive or covariance updates can exploit the sparsityby summing over only the non-zero entries Note that in this case scaling of the variables will

      Journal of Statistical Software 7

      not alter the sparsity but centering will So scaling is performed up front but the centeringis incorporated in the algorithm in an efficient and obvious manner

      24 Weighted updates

      Often a weight wi (other than 1N) is associated with each observation This will arisenaturally in later sections where observations receive weights in the IRLS algorithm In thiscase the update step (5) becomes only slightly more complicated

      βj larrS(sumN

      i=1wixij(yi minus y(j)i ) λα

      )sumN

      i=1wix2ij + λ(1minus α)

      (10)

      If the xj are not standardized there is a similar sum-of-squares term in the denominator(even without weights) The presence of weights does not change the computational costs ofeither algorithm much as long as the weights remain fixed

      25 Pathwise coordinate descent

      We compute the solutions for a decreasing sequence of values for λ starting at the smallestvalue λmax for which the entire vector β = 0 Apart from giving us a path of solutions thisscheme exploits warm starts and leads to a more stable algorithm We have examples whereit is faster to compute the path down to λ (for small λ) than the solution only at that valuefor λ

      When β = 0 we see from (5) that βj will stay zero if 1N |〈xj y〉| lt λα Hence Nαλmax =

      max` |〈x` y〉| Our strategy is to select a minimum value λmin = ελmax and construct asequence of K values of λ decreasing from λmax to λmin on the log scale Typical values areε = 0001 and K = 100

      26 Other details

      Irrespective of whether the variables are standardized to have variance 1 we always centereach predictor variable Since the intercept is not regularized this means that β0 = y themean of the yi for all values of α and λ

      It is easy to allow different penalties λj for each of the variables We implement this via apenalty scaling parameter γj ge 0 If γj gt 0 then the penalty applied to βj is λj = λγj If γj = 0 that variable does not get penalized and always enters the model unrestricted atthe first step and remains in the model Penalty rescaling would also allow for example oursoftware to be used to implement the adaptive lasso (Zou 2006)

      Considerable speedup is obtained by organizing the iterations around the active set of featuresmdashthose with nonzero coefficients After a complete cycle through all the variables we iterateon only the active set till convergence If another complete cycle does not change the activeset we are done otherwise the process is repeated Active-set convergence is also mentionedin Meier et al (2008) and Krishnapuram and Hartemink (2005)

      8 Regularization Paths for GLMs via Coordinate Descent

      3 Regularized logistic regression

      When the response variable is binary the linear logistic regression model is often used Denoteby G the response variable taking values in G = 1 2 (the labeling of the elements isarbitrary) The logistic regression model represents the class-conditional probabilities througha linear function of the predictors

      Pr(G = 1|x) =1

      1 + eminus(β0+xgtβ) (11)

      Pr(G = 2|x) =1

      1 + e+(β0+xgtβ)

      = 1minus Pr(G = 1|x)

      Alternatively this implies that

      logPr(G = 1|x)Pr(G = 2|x)

      = β0 + xgtβ (12)

      Here we fit this model by regularized maximum (binomial) likelihood Let p(xi) = Pr(G =1|xi) be the probability (11) for observation i at a particular value for the parameters (β0 β)then we maximize the penalized log likelihood

      max(β0β)isinRp+1

      [1N

      Nsumi=1

      I(gi = 1) log p(xi) + I(gi = 2) log(1minus p(xi))

      minus λPα(β)

      ] (13)

      Denoting yi = I(gi = 1) the log-likelihood part of (13) can be written in the more explicitform

      `(β0 β) =1N

      Nsumi=1

      yi middot (β0 + xgti β)minus log(1 + e(β0+xgti β)) (14)

      a concave function of the parameters The Newton algorithm for maximizing the (unpe-nalized) log-likelihood (14) amounts to iteratively reweighted least squares Hence if thecurrent estimates of the parameters are (β0 β) we form a quadratic approximation to thelog-likelihood (Taylor expansion about current estimates) which is

      `Q(β0 β) = minus 12N

      Nsumi=1

      wi(zi minus β0 minus xgti β)2 + C(β0 β)2 (15)

      where

      zi = β0 + xgti β +yi minus p(xi)

      p(xi)(1minus p(xi)) (working response) (16)

      wi = p(xi)(1minus p(xi)) (weights) (17)

      and p(xi) is evaluated at the current parameters The last term is constant The Newtonupdate is obtained by minimizing `QOur approach is similar For each value of λ we create an outer loop which computes thequadratic approximation `Q about the current parameters (β0 β) Then we use coordinatedescent to solve the penalized weighted least-squares problem

      min(β0β)isinRp+1

      minus`Q(β0 β) + λPα(β) (18)

      This amounts to a sequence of nested loops

      Journal of Statistical Software 9

      outer loop Decrement λ

      middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

      inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

      There are several important details in the implementation of this algorithm

      When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

      Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

      Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

      We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

      The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

      4 Regularized multinomial regression

      When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

      logPr(G = `|x)Pr(G = K|x)

      = β0` + xgtβ` ` = 1 K minus 1 (19)

      Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

      Pr(G = `|x) =eβ0`+x

      gtβ`sumKk=1 e

      β0k+xgtβk

      (20)

      This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

      10 Regularization Paths for GLMs via Coordinate Descent

      We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

      maxβ0`β`K1 isinRK(p+1)

      [1N

      Nsumi=1

      log pgi(xi)minus λKsum`=1

      Pα(β`)

      ] (21)

      Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

      `(β0` β`K1 ) =1N

      Nsumi=1

      [Ksum`=1

      yi`(β0` + xgti β`)minus log

      (Ksum`=1

      eβ0`+xgti β`

      )] (22)

      The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

      `Q`(β0` β`) = minus 12N

      Nsumi=1

      wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

      where as before

      zi` = β0` + xgti β` +yi` minus p`(xi)

      p`(xi)(1minus p`(xi)) (24)

      wi` = p`(xi)(1minus p`(xi)) (25)

      Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

      min(β0`β`)isinRp+1

      minus`Q`(β0` β`) + λPα(β`) (26)

      This amounts to the sequence of nested loops

      outer loop Decrement λ

      middle loop (outer) Cycle over ` isin 1 2 K 1 2

      middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

      inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

      Journal of Statistical Software 11

      41 Regularization and parameter ambiguity

      As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

      mincisinRp

      Ksum`=1

      Pα(β` minus c) (27)

      This can be done separately for each coordinate hence

      cj = arg mint

      Ksum`=1

      [12(1minus α)(βj` minus t)2 + α|βj` minus t|

      ] (28)

      Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

      cj isin [βj βMj ] (29)

      with the left endpoint achieved if α = 0 and the right if α = 1

      The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

      Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

      42 Grouped and matrix responses

      As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

      sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

      observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

      wi` = mip`(xi)(1minus p`(xi)) (30)

      Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

      12 Regularization Paths for GLMs via Coordinate Descent

      5 Timings

      In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

      We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

      51 Regression with the lasso

      We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

      Y =psumj=1

      Xjβj + k middot Z (31)

      where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

      Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

      52 Lasso-logistic regression

      We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

      The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

      Journal of Statistical Software 13

      Linear regression ndash Dense features

      Correlation0 01 02 05 09 095

      N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

      N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

      N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

      N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

      N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

      N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

      Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

      same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

      14 Regularization Paths for GLMs via Coordinate Descent

      Logistic regression ndash Dense features

      Correlation0 01 02 05 09 095

      N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

      N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

      N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

      N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

      N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

      N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

      Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

      53 Real data

      Table 4 shows some timing results for four different datasets

      Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

      Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

      InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

      Journal of Statistical Software 15

      Logistic regression ndash Sparse features

      Correlation0 01 02 05 09 095

      N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

      N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

      N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

      N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

      Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

      NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

      All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

      For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

      54 Other comparisons

      When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

      16 Regularization Paths for GLMs via Coordinate Descent

      Name Type N p glmnet l1logreg BBRBMR

      DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

      SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

      Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

      MacBook Pro HP Linux server

      glmnet 034 013penalized 1031OWL-QN 31435

      Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

      an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

      OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

      The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

      Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

      6 Selecting the tuning parameters

      The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

      Journal of Statistical Software 17

      minus6 minus5 minus4 minus3 minus2 minus1 0

      2426

      2830

      32

      log(Lambda)

      Mea

      n S

      quar

      ed E

      rror

      99 99 97 95 93 75 54 21 12 5 2 1

      Gaussian Family

      minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

      08

      09

      10

      11

      12

      13

      14

      log(Lambda)

      Dev

      ianc

      e

      100 98 97 88 74 55 30 9 7 3 2

      Binomial Family

      minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

      020

      030

      040

      050

      log(Lambda)

      Mis

      clas

      sific

      atio

      n E

      rror

      100 98 97 88 74 55 30 9 7 3 2

      Binomial Family

      Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

      18 Regularization Paths for GLMs via Coordinate Descent

      Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

      7 Discussion

      Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

      Acknowledgments

      We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

      References

      Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

      Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

      Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

      Journal of Statistical Software 19

      Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

      Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

      Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

      Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

      Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

      Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

      Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

      Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

      Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

      Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

      Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

      Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

      Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

      Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

      Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

      Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

      20 Regularization Paths for GLMs via Coordinate Descent

      Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

      Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

      Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

      Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

      Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

      Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

      Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

      Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

      Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

      Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

      Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

      Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

      Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

      Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

      Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

      Journal of Statistical Software 21

      Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

      R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

      Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

      Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

      Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

      Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

      Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

      Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

      Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

      Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

      Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

      Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

      Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

      Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

      Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

      22 Regularization Paths for GLMs via Coordinate Descent

      A Proof of Theorem 1

      We have

      cj = arg mint

      Ksum`=1

      [12(1minus α)(βj` minus t)2 + α|βj` minus t|

      ] (32)

      Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

      Ksum`=1

      [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

      where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

      t = βj +1K

      α

      1minus α

      Ksum`=1

      sj` (34)

      It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

      Affiliation

      Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

      Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

      Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

      • Introduction
      • Algorithms for the lasso ridge regression and elastic net
        • Naive updates
        • Covariance updates
        • Sparse updates
        • Weighted updates
        • Pathwise coordinate descent
        • Other details
          • Regularized logistic regression
          • Regularized multinomial regression
            • Regularization and parameter ambiguity
            • Grouped and matrix responses
              • Timings
                • Regression with the lasso
                • Lasso-logistic regression
                • Real data
                • Other comparisons
                  • Selecting the tuning parameters
                  • Discussion
                  • Proof of Theorem 1

        4 Regularization Paths for GLMs via Coordinate Descent

        Figure 1 Leukemia data profiles of estimated coefficients for three methods showing onlyfirst 10 steps (values for λ) in each case For the elastic net α = 02

        Journal of Statistical Software 5

        logistic regression models of Section 3 The coefficient profiles from the first 10 steps (gridvalues for λ) for each of the three regularization methods are shown The lasso penalty admitsat most N = 72 genes into the model while ridge regression gives all 3571 genes non-zerocoefficients The elastic-net penalty provides a compromise between these two and has theeffect of averaging genes that are highly correlated and then entering the averaged gene intothe model Using the algorithm described below computation of the entire path of solutionsfor each method at 100 values of the regularization parameter evenly spaced on the log-scaletook under a second in total Because of the large number of non-zero coefficients for theridge penalty they are individually much smaller than the coefficients for the other methods

        Consider a coordinate descent step for solving (1) That is suppose we have estimates β0 andβ` for ` 6= j and we wish to partially optimize with respect to βj We would like to computethe gradient at βj = βj which only exists if βj 6= 0 If βj gt 0 then

        partRλpartβj|β=β = minus 1

        N

        Nsumi=1

        xij(yi minus βo minus xgti β) + λ(1minus α)βj + λα (4)

        A similar expression exists if βj lt 0 and βj = 0 is treated separately Simple calculus shows(Donoho and Johnstone 1994) that the coordinate-wise update has the form

        βj larrS(

        1N

        sumNi=1 xij(yi minus y

        (j)i ) λα

        )1 + λ(1minus α)

        (5)

        where

        y(j)i = β0 +

        sum6=j xi`β` is the fitted value excluding the contribution from xij and

        hence yi minus y(j)i the partial residual for fitting βj Because of the standardization

        1N

        sumNi=1 xij(yi minus y

        (j)i ) is the simple least-squares coefficient when fitting this partial

        residual to xij

        S(z γ) is the soft-thresholding operator with value

        sign(z)(|z| minus γ)+ =

        z minus γ if z gt 0 and γ lt |z|z + γ if z lt 0 and γ lt |z|0 if γ ge |z|

        (6)

        The details of this derivation are spelled out in Friedman et al (2007)

        Thus we compute the simple least-squares coefficient on the partial residual apply soft-thresholding to take care of the lasso contribution to the penalty and then apply a pro-portional shrinkage for the ridge penalty This algorithm was suggested by Van der Kooij(2007)

        21 Naive updates

        Looking more closely at (5) we see that

        yi minus y(j)i = yi minus yi + xij βj

        = ri + xij βj (7)

        6 Regularization Paths for GLMs via Coordinate Descent

        where yi is the current fit of the model for observation i and hence ri the current residualThus

        1N

        Nsumi=1

        xij(yi minus y(j)i ) =

        1N

        Nsumi=1

        xijri + βj (8)

        because the xj are standardized The first term on the right-hand side is the gradient ofthe loss with respect to βj It is clear from (8) why coordinate descent is computationallyefficient Many coefficients are zero remain zero after the thresholding and so nothing needsto be changed Such a step costs O(N) operationsmdash the sum to compute the gradient Onthe other hand if a coefficient does change after the thresholding ri is changed in O(N) andthe step costs O(2N) Thus a complete cycle through all p variables costs O(pN) operationsWe refer to this as the naive algorithm since it is generally less efficient than the covarianceupdating algorithm to follow Later we use these algorithms in the context of iterativelyreweighted least squares (IRLS) where the observation weights change frequently there thenaive algorithm dominates

        22 Covariance updates

        Further efficiencies can be achieved in computing the updates in (8) We can write the firstterm on the right (up to a factor 1N) as

        Nsumi=1

        xijri = 〈xj y〉 minussum

        k|βk|gt0

        〈xj xk〉βk (9)

        where 〈xj y〉 =sumN

        i=1 xijyi Hence we need to compute inner products of each feature with yinitially and then each time a new feature xk enters the model (for the first time) we needto compute and store its inner product with all the rest of the features (O(Np) operations)We also store the p gradient components (9) If one of the coefficients currently in the modelchanges we can update each gradient in O(p) operations Hence with m non-zero terms inthe model a complete cycle costs O(pm) operations if no new variables become non-zero andcosts O(Np) for each new variable entered Importantly O(N) calculations do not have tobe made at every step This is the case for all penalized procedures with squared error loss

        23 Sparse updates

        We are sometimes faced with problems where the Ntimesp feature matrix X is extremely sparseA leading example is from document classification where the feature vector uses the so-called ldquobag-of-wordsrdquo model Each document is scored for the presenceabsence of each ofthe words in the entire dictionary under consideration (sometimes counts are used or sometransformation of counts) Since most words are absent the feature vector for each documentis mostly zero and so the entire matrix is mostly zero We store such matrices efficiently insparse column format where we store only the non-zero entries and the coordinates wherethey occur

        Coordinate descent is ideally set up to exploit such sparsity in an obvious way The O(N)inner-product operations in either the naive or covariance updates can exploit the sparsityby summing over only the non-zero entries Note that in this case scaling of the variables will

        Journal of Statistical Software 7

        not alter the sparsity but centering will So scaling is performed up front but the centeringis incorporated in the algorithm in an efficient and obvious manner

        24 Weighted updates

        Often a weight wi (other than 1N) is associated with each observation This will arisenaturally in later sections where observations receive weights in the IRLS algorithm In thiscase the update step (5) becomes only slightly more complicated

        βj larrS(sumN

        i=1wixij(yi minus y(j)i ) λα

        )sumN

        i=1wix2ij + λ(1minus α)

        (10)

        If the xj are not standardized there is a similar sum-of-squares term in the denominator(even without weights) The presence of weights does not change the computational costs ofeither algorithm much as long as the weights remain fixed

        25 Pathwise coordinate descent

        We compute the solutions for a decreasing sequence of values for λ starting at the smallestvalue λmax for which the entire vector β = 0 Apart from giving us a path of solutions thisscheme exploits warm starts and leads to a more stable algorithm We have examples whereit is faster to compute the path down to λ (for small λ) than the solution only at that valuefor λ

        When β = 0 we see from (5) that βj will stay zero if 1N |〈xj y〉| lt λα Hence Nαλmax =

        max` |〈x` y〉| Our strategy is to select a minimum value λmin = ελmax and construct asequence of K values of λ decreasing from λmax to λmin on the log scale Typical values areε = 0001 and K = 100

        26 Other details

        Irrespective of whether the variables are standardized to have variance 1 we always centereach predictor variable Since the intercept is not regularized this means that β0 = y themean of the yi for all values of α and λ

        It is easy to allow different penalties λj for each of the variables We implement this via apenalty scaling parameter γj ge 0 If γj gt 0 then the penalty applied to βj is λj = λγj If γj = 0 that variable does not get penalized and always enters the model unrestricted atthe first step and remains in the model Penalty rescaling would also allow for example oursoftware to be used to implement the adaptive lasso (Zou 2006)

        Considerable speedup is obtained by organizing the iterations around the active set of featuresmdashthose with nonzero coefficients After a complete cycle through all the variables we iterateon only the active set till convergence If another complete cycle does not change the activeset we are done otherwise the process is repeated Active-set convergence is also mentionedin Meier et al (2008) and Krishnapuram and Hartemink (2005)

        8 Regularization Paths for GLMs via Coordinate Descent

        3 Regularized logistic regression

        When the response variable is binary the linear logistic regression model is often used Denoteby G the response variable taking values in G = 1 2 (the labeling of the elements isarbitrary) The logistic regression model represents the class-conditional probabilities througha linear function of the predictors

        Pr(G = 1|x) =1

        1 + eminus(β0+xgtβ) (11)

        Pr(G = 2|x) =1

        1 + e+(β0+xgtβ)

        = 1minus Pr(G = 1|x)

        Alternatively this implies that

        logPr(G = 1|x)Pr(G = 2|x)

        = β0 + xgtβ (12)

        Here we fit this model by regularized maximum (binomial) likelihood Let p(xi) = Pr(G =1|xi) be the probability (11) for observation i at a particular value for the parameters (β0 β)then we maximize the penalized log likelihood

        max(β0β)isinRp+1

        [1N

        Nsumi=1

        I(gi = 1) log p(xi) + I(gi = 2) log(1minus p(xi))

        minus λPα(β)

        ] (13)

        Denoting yi = I(gi = 1) the log-likelihood part of (13) can be written in the more explicitform

        `(β0 β) =1N

        Nsumi=1

        yi middot (β0 + xgti β)minus log(1 + e(β0+xgti β)) (14)

        a concave function of the parameters The Newton algorithm for maximizing the (unpe-nalized) log-likelihood (14) amounts to iteratively reweighted least squares Hence if thecurrent estimates of the parameters are (β0 β) we form a quadratic approximation to thelog-likelihood (Taylor expansion about current estimates) which is

        `Q(β0 β) = minus 12N

        Nsumi=1

        wi(zi minus β0 minus xgti β)2 + C(β0 β)2 (15)

        where

        zi = β0 + xgti β +yi minus p(xi)

        p(xi)(1minus p(xi)) (working response) (16)

        wi = p(xi)(1minus p(xi)) (weights) (17)

        and p(xi) is evaluated at the current parameters The last term is constant The Newtonupdate is obtained by minimizing `QOur approach is similar For each value of λ we create an outer loop which computes thequadratic approximation `Q about the current parameters (β0 β) Then we use coordinatedescent to solve the penalized weighted least-squares problem

        min(β0β)isinRp+1

        minus`Q(β0 β) + λPα(β) (18)

        This amounts to a sequence of nested loops

        Journal of Statistical Software 9

        outer loop Decrement λ

        middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

        inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

        There are several important details in the implementation of this algorithm

        When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

        Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

        Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

        We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

        The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

        4 Regularized multinomial regression

        When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

        logPr(G = `|x)Pr(G = K|x)

        = β0` + xgtβ` ` = 1 K minus 1 (19)

        Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

        Pr(G = `|x) =eβ0`+x

        gtβ`sumKk=1 e

        β0k+xgtβk

        (20)

        This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

        10 Regularization Paths for GLMs via Coordinate Descent

        We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

        maxβ0`β`K1 isinRK(p+1)

        [1N

        Nsumi=1

        log pgi(xi)minus λKsum`=1

        Pα(β`)

        ] (21)

        Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

        `(β0` β`K1 ) =1N

        Nsumi=1

        [Ksum`=1

        yi`(β0` + xgti β`)minus log

        (Ksum`=1

        eβ0`+xgti β`

        )] (22)

        The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

        `Q`(β0` β`) = minus 12N

        Nsumi=1

        wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

        where as before

        zi` = β0` + xgti β` +yi` minus p`(xi)

        p`(xi)(1minus p`(xi)) (24)

        wi` = p`(xi)(1minus p`(xi)) (25)

        Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

        min(β0`β`)isinRp+1

        minus`Q`(β0` β`) + λPα(β`) (26)

        This amounts to the sequence of nested loops

        outer loop Decrement λ

        middle loop (outer) Cycle over ` isin 1 2 K 1 2

        middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

        inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

        Journal of Statistical Software 11

        41 Regularization and parameter ambiguity

        As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

        mincisinRp

        Ksum`=1

        Pα(β` minus c) (27)

        This can be done separately for each coordinate hence

        cj = arg mint

        Ksum`=1

        [12(1minus α)(βj` minus t)2 + α|βj` minus t|

        ] (28)

        Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

        cj isin [βj βMj ] (29)

        with the left endpoint achieved if α = 0 and the right if α = 1

        The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

        Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

        42 Grouped and matrix responses

        As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

        sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

        observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

        wi` = mip`(xi)(1minus p`(xi)) (30)

        Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

        12 Regularization Paths for GLMs via Coordinate Descent

        5 Timings

        In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

        We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

        51 Regression with the lasso

        We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

        Y =psumj=1

        Xjβj + k middot Z (31)

        where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

        Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

        52 Lasso-logistic regression

        We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

        The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

        Journal of Statistical Software 13

        Linear regression ndash Dense features

        Correlation0 01 02 05 09 095

        N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

        N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

        N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

        N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

        N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

        N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

        Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

        same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

        14 Regularization Paths for GLMs via Coordinate Descent

        Logistic regression ndash Dense features

        Correlation0 01 02 05 09 095

        N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

        N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

        N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

        N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

        N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

        N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

        Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

        53 Real data

        Table 4 shows some timing results for four different datasets

        Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

        Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

        InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

        Journal of Statistical Software 15

        Logistic regression ndash Sparse features

        Correlation0 01 02 05 09 095

        N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

        N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

        N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

        N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

        Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

        NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

        All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

        For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

        54 Other comparisons

        When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

        16 Regularization Paths for GLMs via Coordinate Descent

        Name Type N p glmnet l1logreg BBRBMR

        DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

        SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

        Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

        MacBook Pro HP Linux server

        glmnet 034 013penalized 1031OWL-QN 31435

        Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

        an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

        OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

        The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

        Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

        6 Selecting the tuning parameters

        The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

        Journal of Statistical Software 17

        minus6 minus5 minus4 minus3 minus2 minus1 0

        2426

        2830

        32

        log(Lambda)

        Mea

        n S

        quar

        ed E

        rror

        99 99 97 95 93 75 54 21 12 5 2 1

        Gaussian Family

        minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

        08

        09

        10

        11

        12

        13

        14

        log(Lambda)

        Dev

        ianc

        e

        100 98 97 88 74 55 30 9 7 3 2

        Binomial Family

        minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

        020

        030

        040

        050

        log(Lambda)

        Mis

        clas

        sific

        atio

        n E

        rror

        100 98 97 88 74 55 30 9 7 3 2

        Binomial Family

        Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

        18 Regularization Paths for GLMs via Coordinate Descent

        Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

        7 Discussion

        Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

        Acknowledgments

        We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

        References

        Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

        Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

        Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

        Journal of Statistical Software 19

        Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

        Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

        Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

        Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

        Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

        Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

        Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

        Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

        Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

        Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

        Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

        Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

        Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

        Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

        Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

        Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

        20 Regularization Paths for GLMs via Coordinate Descent

        Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

        Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

        Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

        Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

        Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

        Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

        Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

        Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

        Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

        Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

        Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

        Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

        Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

        Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

        Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

        Journal of Statistical Software 21

        Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

        R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

        Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

        Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

        Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

        Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

        Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

        Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

        Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

        Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

        Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

        Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

        Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

        Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

        Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

        22 Regularization Paths for GLMs via Coordinate Descent

        A Proof of Theorem 1

        We have

        cj = arg mint

        Ksum`=1

        [12(1minus α)(βj` minus t)2 + α|βj` minus t|

        ] (32)

        Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

        Ksum`=1

        [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

        where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

        t = βj +1K

        α

        1minus α

        Ksum`=1

        sj` (34)

        It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

        Affiliation

        Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

        Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

        Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

        • Introduction
        • Algorithms for the lasso ridge regression and elastic net
          • Naive updates
          • Covariance updates
          • Sparse updates
          • Weighted updates
          • Pathwise coordinate descent
          • Other details
            • Regularized logistic regression
            • Regularized multinomial regression
              • Regularization and parameter ambiguity
              • Grouped and matrix responses
                • Timings
                  • Regression with the lasso
                  • Lasso-logistic regression
                  • Real data
                  • Other comparisons
                    • Selecting the tuning parameters
                    • Discussion
                    • Proof of Theorem 1

          Journal of Statistical Software 5

          logistic regression models of Section 3 The coefficient profiles from the first 10 steps (gridvalues for λ) for each of the three regularization methods are shown The lasso penalty admitsat most N = 72 genes into the model while ridge regression gives all 3571 genes non-zerocoefficients The elastic-net penalty provides a compromise between these two and has theeffect of averaging genes that are highly correlated and then entering the averaged gene intothe model Using the algorithm described below computation of the entire path of solutionsfor each method at 100 values of the regularization parameter evenly spaced on the log-scaletook under a second in total Because of the large number of non-zero coefficients for theridge penalty they are individually much smaller than the coefficients for the other methods

          Consider a coordinate descent step for solving (1) That is suppose we have estimates β0 andβ` for ` 6= j and we wish to partially optimize with respect to βj We would like to computethe gradient at βj = βj which only exists if βj 6= 0 If βj gt 0 then

          partRλpartβj|β=β = minus 1

          N

          Nsumi=1

          xij(yi minus βo minus xgti β) + λ(1minus α)βj + λα (4)

          A similar expression exists if βj lt 0 and βj = 0 is treated separately Simple calculus shows(Donoho and Johnstone 1994) that the coordinate-wise update has the form

          βj larrS(

          1N

          sumNi=1 xij(yi minus y

          (j)i ) λα

          )1 + λ(1minus α)

          (5)

          where

          y(j)i = β0 +

          sum6=j xi`β` is the fitted value excluding the contribution from xij and

          hence yi minus y(j)i the partial residual for fitting βj Because of the standardization

          1N

          sumNi=1 xij(yi minus y

          (j)i ) is the simple least-squares coefficient when fitting this partial

          residual to xij

          S(z γ) is the soft-thresholding operator with value

          sign(z)(|z| minus γ)+ =

          z minus γ if z gt 0 and γ lt |z|z + γ if z lt 0 and γ lt |z|0 if γ ge |z|

          (6)

          The details of this derivation are spelled out in Friedman et al (2007)

          Thus we compute the simple least-squares coefficient on the partial residual apply soft-thresholding to take care of the lasso contribution to the penalty and then apply a pro-portional shrinkage for the ridge penalty This algorithm was suggested by Van der Kooij(2007)

          21 Naive updates

          Looking more closely at (5) we see that

          yi minus y(j)i = yi minus yi + xij βj

          = ri + xij βj (7)

          6 Regularization Paths for GLMs via Coordinate Descent

          where yi is the current fit of the model for observation i and hence ri the current residualThus

          1N

          Nsumi=1

          xij(yi minus y(j)i ) =

          1N

          Nsumi=1

          xijri + βj (8)

          because the xj are standardized The first term on the right-hand side is the gradient ofthe loss with respect to βj It is clear from (8) why coordinate descent is computationallyefficient Many coefficients are zero remain zero after the thresholding and so nothing needsto be changed Such a step costs O(N) operationsmdash the sum to compute the gradient Onthe other hand if a coefficient does change after the thresholding ri is changed in O(N) andthe step costs O(2N) Thus a complete cycle through all p variables costs O(pN) operationsWe refer to this as the naive algorithm since it is generally less efficient than the covarianceupdating algorithm to follow Later we use these algorithms in the context of iterativelyreweighted least squares (IRLS) where the observation weights change frequently there thenaive algorithm dominates

          22 Covariance updates

          Further efficiencies can be achieved in computing the updates in (8) We can write the firstterm on the right (up to a factor 1N) as

          Nsumi=1

          xijri = 〈xj y〉 minussum

          k|βk|gt0

          〈xj xk〉βk (9)

          where 〈xj y〉 =sumN

          i=1 xijyi Hence we need to compute inner products of each feature with yinitially and then each time a new feature xk enters the model (for the first time) we needto compute and store its inner product with all the rest of the features (O(Np) operations)We also store the p gradient components (9) If one of the coefficients currently in the modelchanges we can update each gradient in O(p) operations Hence with m non-zero terms inthe model a complete cycle costs O(pm) operations if no new variables become non-zero andcosts O(Np) for each new variable entered Importantly O(N) calculations do not have tobe made at every step This is the case for all penalized procedures with squared error loss

          23 Sparse updates

          We are sometimes faced with problems where the Ntimesp feature matrix X is extremely sparseA leading example is from document classification where the feature vector uses the so-called ldquobag-of-wordsrdquo model Each document is scored for the presenceabsence of each ofthe words in the entire dictionary under consideration (sometimes counts are used or sometransformation of counts) Since most words are absent the feature vector for each documentis mostly zero and so the entire matrix is mostly zero We store such matrices efficiently insparse column format where we store only the non-zero entries and the coordinates wherethey occur

          Coordinate descent is ideally set up to exploit such sparsity in an obvious way The O(N)inner-product operations in either the naive or covariance updates can exploit the sparsityby summing over only the non-zero entries Note that in this case scaling of the variables will

          Journal of Statistical Software 7

          not alter the sparsity but centering will So scaling is performed up front but the centeringis incorporated in the algorithm in an efficient and obvious manner

          24 Weighted updates

          Often a weight wi (other than 1N) is associated with each observation This will arisenaturally in later sections where observations receive weights in the IRLS algorithm In thiscase the update step (5) becomes only slightly more complicated

          βj larrS(sumN

          i=1wixij(yi minus y(j)i ) λα

          )sumN

          i=1wix2ij + λ(1minus α)

          (10)

          If the xj are not standardized there is a similar sum-of-squares term in the denominator(even without weights) The presence of weights does not change the computational costs ofeither algorithm much as long as the weights remain fixed

          25 Pathwise coordinate descent

          We compute the solutions for a decreasing sequence of values for λ starting at the smallestvalue λmax for which the entire vector β = 0 Apart from giving us a path of solutions thisscheme exploits warm starts and leads to a more stable algorithm We have examples whereit is faster to compute the path down to λ (for small λ) than the solution only at that valuefor λ

          When β = 0 we see from (5) that βj will stay zero if 1N |〈xj y〉| lt λα Hence Nαλmax =

          max` |〈x` y〉| Our strategy is to select a minimum value λmin = ελmax and construct asequence of K values of λ decreasing from λmax to λmin on the log scale Typical values areε = 0001 and K = 100

          26 Other details

          Irrespective of whether the variables are standardized to have variance 1 we always centereach predictor variable Since the intercept is not regularized this means that β0 = y themean of the yi for all values of α and λ

          It is easy to allow different penalties λj for each of the variables We implement this via apenalty scaling parameter γj ge 0 If γj gt 0 then the penalty applied to βj is λj = λγj If γj = 0 that variable does not get penalized and always enters the model unrestricted atthe first step and remains in the model Penalty rescaling would also allow for example oursoftware to be used to implement the adaptive lasso (Zou 2006)

          Considerable speedup is obtained by organizing the iterations around the active set of featuresmdashthose with nonzero coefficients After a complete cycle through all the variables we iterateon only the active set till convergence If another complete cycle does not change the activeset we are done otherwise the process is repeated Active-set convergence is also mentionedin Meier et al (2008) and Krishnapuram and Hartemink (2005)

          8 Regularization Paths for GLMs via Coordinate Descent

          3 Regularized logistic regression

          When the response variable is binary the linear logistic regression model is often used Denoteby G the response variable taking values in G = 1 2 (the labeling of the elements isarbitrary) The logistic regression model represents the class-conditional probabilities througha linear function of the predictors

          Pr(G = 1|x) =1

          1 + eminus(β0+xgtβ) (11)

          Pr(G = 2|x) =1

          1 + e+(β0+xgtβ)

          = 1minus Pr(G = 1|x)

          Alternatively this implies that

          logPr(G = 1|x)Pr(G = 2|x)

          = β0 + xgtβ (12)

          Here we fit this model by regularized maximum (binomial) likelihood Let p(xi) = Pr(G =1|xi) be the probability (11) for observation i at a particular value for the parameters (β0 β)then we maximize the penalized log likelihood

          max(β0β)isinRp+1

          [1N

          Nsumi=1

          I(gi = 1) log p(xi) + I(gi = 2) log(1minus p(xi))

          minus λPα(β)

          ] (13)

          Denoting yi = I(gi = 1) the log-likelihood part of (13) can be written in the more explicitform

          `(β0 β) =1N

          Nsumi=1

          yi middot (β0 + xgti β)minus log(1 + e(β0+xgti β)) (14)

          a concave function of the parameters The Newton algorithm for maximizing the (unpe-nalized) log-likelihood (14) amounts to iteratively reweighted least squares Hence if thecurrent estimates of the parameters are (β0 β) we form a quadratic approximation to thelog-likelihood (Taylor expansion about current estimates) which is

          `Q(β0 β) = minus 12N

          Nsumi=1

          wi(zi minus β0 minus xgti β)2 + C(β0 β)2 (15)

          where

          zi = β0 + xgti β +yi minus p(xi)

          p(xi)(1minus p(xi)) (working response) (16)

          wi = p(xi)(1minus p(xi)) (weights) (17)

          and p(xi) is evaluated at the current parameters The last term is constant The Newtonupdate is obtained by minimizing `QOur approach is similar For each value of λ we create an outer loop which computes thequadratic approximation `Q about the current parameters (β0 β) Then we use coordinatedescent to solve the penalized weighted least-squares problem

          min(β0β)isinRp+1

          minus`Q(β0 β) + λPα(β) (18)

          This amounts to a sequence of nested loops

          Journal of Statistical Software 9

          outer loop Decrement λ

          middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

          inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

          There are several important details in the implementation of this algorithm

          When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

          Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

          Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

          We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

          The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

          4 Regularized multinomial regression

          When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

          logPr(G = `|x)Pr(G = K|x)

          = β0` + xgtβ` ` = 1 K minus 1 (19)

          Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

          Pr(G = `|x) =eβ0`+x

          gtβ`sumKk=1 e

          β0k+xgtβk

          (20)

          This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

          10 Regularization Paths for GLMs via Coordinate Descent

          We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

          maxβ0`β`K1 isinRK(p+1)

          [1N

          Nsumi=1

          log pgi(xi)minus λKsum`=1

          Pα(β`)

          ] (21)

          Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

          `(β0` β`K1 ) =1N

          Nsumi=1

          [Ksum`=1

          yi`(β0` + xgti β`)minus log

          (Ksum`=1

          eβ0`+xgti β`

          )] (22)

          The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

          `Q`(β0` β`) = minus 12N

          Nsumi=1

          wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

          where as before

          zi` = β0` + xgti β` +yi` minus p`(xi)

          p`(xi)(1minus p`(xi)) (24)

          wi` = p`(xi)(1minus p`(xi)) (25)

          Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

          min(β0`β`)isinRp+1

          minus`Q`(β0` β`) + λPα(β`) (26)

          This amounts to the sequence of nested loops

          outer loop Decrement λ

          middle loop (outer) Cycle over ` isin 1 2 K 1 2

          middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

          inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

          Journal of Statistical Software 11

          41 Regularization and parameter ambiguity

          As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

          mincisinRp

          Ksum`=1

          Pα(β` minus c) (27)

          This can be done separately for each coordinate hence

          cj = arg mint

          Ksum`=1

          [12(1minus α)(βj` minus t)2 + α|βj` minus t|

          ] (28)

          Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

          cj isin [βj βMj ] (29)

          with the left endpoint achieved if α = 0 and the right if α = 1

          The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

          Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

          42 Grouped and matrix responses

          As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

          sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

          observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

          wi` = mip`(xi)(1minus p`(xi)) (30)

          Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

          12 Regularization Paths for GLMs via Coordinate Descent

          5 Timings

          In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

          We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

          51 Regression with the lasso

          We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

          Y =psumj=1

          Xjβj + k middot Z (31)

          where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

          Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

          52 Lasso-logistic regression

          We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

          The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

          Journal of Statistical Software 13

          Linear regression ndash Dense features

          Correlation0 01 02 05 09 095

          N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

          N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

          N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

          N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

          N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

          N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

          Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

          same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

          14 Regularization Paths for GLMs via Coordinate Descent

          Logistic regression ndash Dense features

          Correlation0 01 02 05 09 095

          N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

          N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

          N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

          N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

          N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

          N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

          Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

          53 Real data

          Table 4 shows some timing results for four different datasets

          Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

          Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

          InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

          Journal of Statistical Software 15

          Logistic regression ndash Sparse features

          Correlation0 01 02 05 09 095

          N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

          N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

          N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

          N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

          Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

          NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

          All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

          For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

          54 Other comparisons

          When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

          16 Regularization Paths for GLMs via Coordinate Descent

          Name Type N p glmnet l1logreg BBRBMR

          DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

          SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

          Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

          MacBook Pro HP Linux server

          glmnet 034 013penalized 1031OWL-QN 31435

          Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

          an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

          OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

          The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

          Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

          6 Selecting the tuning parameters

          The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

          Journal of Statistical Software 17

          minus6 minus5 minus4 minus3 minus2 minus1 0

          2426

          2830

          32

          log(Lambda)

          Mea

          n S

          quar

          ed E

          rror

          99 99 97 95 93 75 54 21 12 5 2 1

          Gaussian Family

          minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

          08

          09

          10

          11

          12

          13

          14

          log(Lambda)

          Dev

          ianc

          e

          100 98 97 88 74 55 30 9 7 3 2

          Binomial Family

          minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

          020

          030

          040

          050

          log(Lambda)

          Mis

          clas

          sific

          atio

          n E

          rror

          100 98 97 88 74 55 30 9 7 3 2

          Binomial Family

          Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

          18 Regularization Paths for GLMs via Coordinate Descent

          Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

          7 Discussion

          Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

          Acknowledgments

          We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

          References

          Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

          Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

          Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

          Journal of Statistical Software 19

          Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

          Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

          Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

          Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

          Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

          Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

          Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

          Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

          Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

          Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

          Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

          Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

          Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

          Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

          Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

          Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

          20 Regularization Paths for GLMs via Coordinate Descent

          Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

          Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

          Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

          Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

          Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

          Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

          Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

          Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

          Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

          Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

          Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

          Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

          Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

          Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

          Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

          Journal of Statistical Software 21

          Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

          R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

          Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

          Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

          Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

          Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

          Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

          Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

          Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

          Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

          Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

          Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

          Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

          Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

          Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

          22 Regularization Paths for GLMs via Coordinate Descent

          A Proof of Theorem 1

          We have

          cj = arg mint

          Ksum`=1

          [12(1minus α)(βj` minus t)2 + α|βj` minus t|

          ] (32)

          Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

          Ksum`=1

          [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

          where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

          t = βj +1K

          α

          1minus α

          Ksum`=1

          sj` (34)

          It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

          Affiliation

          Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

          Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

          Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

          • Introduction
          • Algorithms for the lasso ridge regression and elastic net
            • Naive updates
            • Covariance updates
            • Sparse updates
            • Weighted updates
            • Pathwise coordinate descent
            • Other details
              • Regularized logistic regression
              • Regularized multinomial regression
                • Regularization and parameter ambiguity
                • Grouped and matrix responses
                  • Timings
                    • Regression with the lasso
                    • Lasso-logistic regression
                    • Real data
                    • Other comparisons
                      • Selecting the tuning parameters
                      • Discussion
                      • Proof of Theorem 1

            6 Regularization Paths for GLMs via Coordinate Descent

            where yi is the current fit of the model for observation i and hence ri the current residualThus

            1N

            Nsumi=1

            xij(yi minus y(j)i ) =

            1N

            Nsumi=1

            xijri + βj (8)

            because the xj are standardized The first term on the right-hand side is the gradient ofthe loss with respect to βj It is clear from (8) why coordinate descent is computationallyefficient Many coefficients are zero remain zero after the thresholding and so nothing needsto be changed Such a step costs O(N) operationsmdash the sum to compute the gradient Onthe other hand if a coefficient does change after the thresholding ri is changed in O(N) andthe step costs O(2N) Thus a complete cycle through all p variables costs O(pN) operationsWe refer to this as the naive algorithm since it is generally less efficient than the covarianceupdating algorithm to follow Later we use these algorithms in the context of iterativelyreweighted least squares (IRLS) where the observation weights change frequently there thenaive algorithm dominates

            22 Covariance updates

            Further efficiencies can be achieved in computing the updates in (8) We can write the firstterm on the right (up to a factor 1N) as

            Nsumi=1

            xijri = 〈xj y〉 minussum

            k|βk|gt0

            〈xj xk〉βk (9)

            where 〈xj y〉 =sumN

            i=1 xijyi Hence we need to compute inner products of each feature with yinitially and then each time a new feature xk enters the model (for the first time) we needto compute and store its inner product with all the rest of the features (O(Np) operations)We also store the p gradient components (9) If one of the coefficients currently in the modelchanges we can update each gradient in O(p) operations Hence with m non-zero terms inthe model a complete cycle costs O(pm) operations if no new variables become non-zero andcosts O(Np) for each new variable entered Importantly O(N) calculations do not have tobe made at every step This is the case for all penalized procedures with squared error loss

            23 Sparse updates

            We are sometimes faced with problems where the Ntimesp feature matrix X is extremely sparseA leading example is from document classification where the feature vector uses the so-called ldquobag-of-wordsrdquo model Each document is scored for the presenceabsence of each ofthe words in the entire dictionary under consideration (sometimes counts are used or sometransformation of counts) Since most words are absent the feature vector for each documentis mostly zero and so the entire matrix is mostly zero We store such matrices efficiently insparse column format where we store only the non-zero entries and the coordinates wherethey occur

            Coordinate descent is ideally set up to exploit such sparsity in an obvious way The O(N)inner-product operations in either the naive or covariance updates can exploit the sparsityby summing over only the non-zero entries Note that in this case scaling of the variables will

            Journal of Statistical Software 7

            not alter the sparsity but centering will So scaling is performed up front but the centeringis incorporated in the algorithm in an efficient and obvious manner

            24 Weighted updates

            Often a weight wi (other than 1N) is associated with each observation This will arisenaturally in later sections where observations receive weights in the IRLS algorithm In thiscase the update step (5) becomes only slightly more complicated

            βj larrS(sumN

            i=1wixij(yi minus y(j)i ) λα

            )sumN

            i=1wix2ij + λ(1minus α)

            (10)

            If the xj are not standardized there is a similar sum-of-squares term in the denominator(even without weights) The presence of weights does not change the computational costs ofeither algorithm much as long as the weights remain fixed

            25 Pathwise coordinate descent

            We compute the solutions for a decreasing sequence of values for λ starting at the smallestvalue λmax for which the entire vector β = 0 Apart from giving us a path of solutions thisscheme exploits warm starts and leads to a more stable algorithm We have examples whereit is faster to compute the path down to λ (for small λ) than the solution only at that valuefor λ

            When β = 0 we see from (5) that βj will stay zero if 1N |〈xj y〉| lt λα Hence Nαλmax =

            max` |〈x` y〉| Our strategy is to select a minimum value λmin = ελmax and construct asequence of K values of λ decreasing from λmax to λmin on the log scale Typical values areε = 0001 and K = 100

            26 Other details

            Irrespective of whether the variables are standardized to have variance 1 we always centereach predictor variable Since the intercept is not regularized this means that β0 = y themean of the yi for all values of α and λ

            It is easy to allow different penalties λj for each of the variables We implement this via apenalty scaling parameter γj ge 0 If γj gt 0 then the penalty applied to βj is λj = λγj If γj = 0 that variable does not get penalized and always enters the model unrestricted atthe first step and remains in the model Penalty rescaling would also allow for example oursoftware to be used to implement the adaptive lasso (Zou 2006)

            Considerable speedup is obtained by organizing the iterations around the active set of featuresmdashthose with nonzero coefficients After a complete cycle through all the variables we iterateon only the active set till convergence If another complete cycle does not change the activeset we are done otherwise the process is repeated Active-set convergence is also mentionedin Meier et al (2008) and Krishnapuram and Hartemink (2005)

            8 Regularization Paths for GLMs via Coordinate Descent

            3 Regularized logistic regression

            When the response variable is binary the linear logistic regression model is often used Denoteby G the response variable taking values in G = 1 2 (the labeling of the elements isarbitrary) The logistic regression model represents the class-conditional probabilities througha linear function of the predictors

            Pr(G = 1|x) =1

            1 + eminus(β0+xgtβ) (11)

            Pr(G = 2|x) =1

            1 + e+(β0+xgtβ)

            = 1minus Pr(G = 1|x)

            Alternatively this implies that

            logPr(G = 1|x)Pr(G = 2|x)

            = β0 + xgtβ (12)

            Here we fit this model by regularized maximum (binomial) likelihood Let p(xi) = Pr(G =1|xi) be the probability (11) for observation i at a particular value for the parameters (β0 β)then we maximize the penalized log likelihood

            max(β0β)isinRp+1

            [1N

            Nsumi=1

            I(gi = 1) log p(xi) + I(gi = 2) log(1minus p(xi))

            minus λPα(β)

            ] (13)

            Denoting yi = I(gi = 1) the log-likelihood part of (13) can be written in the more explicitform

            `(β0 β) =1N

            Nsumi=1

            yi middot (β0 + xgti β)minus log(1 + e(β0+xgti β)) (14)

            a concave function of the parameters The Newton algorithm for maximizing the (unpe-nalized) log-likelihood (14) amounts to iteratively reweighted least squares Hence if thecurrent estimates of the parameters are (β0 β) we form a quadratic approximation to thelog-likelihood (Taylor expansion about current estimates) which is

            `Q(β0 β) = minus 12N

            Nsumi=1

            wi(zi minus β0 minus xgti β)2 + C(β0 β)2 (15)

            where

            zi = β0 + xgti β +yi minus p(xi)

            p(xi)(1minus p(xi)) (working response) (16)

            wi = p(xi)(1minus p(xi)) (weights) (17)

            and p(xi) is evaluated at the current parameters The last term is constant The Newtonupdate is obtained by minimizing `QOur approach is similar For each value of λ we create an outer loop which computes thequadratic approximation `Q about the current parameters (β0 β) Then we use coordinatedescent to solve the penalized weighted least-squares problem

            min(β0β)isinRp+1

            minus`Q(β0 β) + λPα(β) (18)

            This amounts to a sequence of nested loops

            Journal of Statistical Software 9

            outer loop Decrement λ

            middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

            inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

            There are several important details in the implementation of this algorithm

            When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

            Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

            Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

            We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

            The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

            4 Regularized multinomial regression

            When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

            logPr(G = `|x)Pr(G = K|x)

            = β0` + xgtβ` ` = 1 K minus 1 (19)

            Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

            Pr(G = `|x) =eβ0`+x

            gtβ`sumKk=1 e

            β0k+xgtβk

            (20)

            This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

            10 Regularization Paths for GLMs via Coordinate Descent

            We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

            maxβ0`β`K1 isinRK(p+1)

            [1N

            Nsumi=1

            log pgi(xi)minus λKsum`=1

            Pα(β`)

            ] (21)

            Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

            `(β0` β`K1 ) =1N

            Nsumi=1

            [Ksum`=1

            yi`(β0` + xgti β`)minus log

            (Ksum`=1

            eβ0`+xgti β`

            )] (22)

            The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

            `Q`(β0` β`) = minus 12N

            Nsumi=1

            wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

            where as before

            zi` = β0` + xgti β` +yi` minus p`(xi)

            p`(xi)(1minus p`(xi)) (24)

            wi` = p`(xi)(1minus p`(xi)) (25)

            Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

            min(β0`β`)isinRp+1

            minus`Q`(β0` β`) + λPα(β`) (26)

            This amounts to the sequence of nested loops

            outer loop Decrement λ

            middle loop (outer) Cycle over ` isin 1 2 K 1 2

            middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

            inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

            Journal of Statistical Software 11

            41 Regularization and parameter ambiguity

            As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

            mincisinRp

            Ksum`=1

            Pα(β` minus c) (27)

            This can be done separately for each coordinate hence

            cj = arg mint

            Ksum`=1

            [12(1minus α)(βj` minus t)2 + α|βj` minus t|

            ] (28)

            Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

            cj isin [βj βMj ] (29)

            with the left endpoint achieved if α = 0 and the right if α = 1

            The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

            Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

            42 Grouped and matrix responses

            As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

            sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

            observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

            wi` = mip`(xi)(1minus p`(xi)) (30)

            Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

            12 Regularization Paths for GLMs via Coordinate Descent

            5 Timings

            In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

            We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

            51 Regression with the lasso

            We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

            Y =psumj=1

            Xjβj + k middot Z (31)

            where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

            Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

            52 Lasso-logistic regression

            We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

            The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

            Journal of Statistical Software 13

            Linear regression ndash Dense features

            Correlation0 01 02 05 09 095

            N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

            N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

            N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

            N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

            N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

            N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

            Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

            same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

            14 Regularization Paths for GLMs via Coordinate Descent

            Logistic regression ndash Dense features

            Correlation0 01 02 05 09 095

            N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

            N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

            N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

            N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

            N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

            N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

            Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

            53 Real data

            Table 4 shows some timing results for four different datasets

            Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

            Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

            InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

            Journal of Statistical Software 15

            Logistic regression ndash Sparse features

            Correlation0 01 02 05 09 095

            N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

            N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

            N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

            N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

            Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

            NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

            All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

            For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

            54 Other comparisons

            When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

            16 Regularization Paths for GLMs via Coordinate Descent

            Name Type N p glmnet l1logreg BBRBMR

            DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

            SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

            Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

            MacBook Pro HP Linux server

            glmnet 034 013penalized 1031OWL-QN 31435

            Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

            an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

            OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

            The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

            Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

            6 Selecting the tuning parameters

            The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

            Journal of Statistical Software 17

            minus6 minus5 minus4 minus3 minus2 minus1 0

            2426

            2830

            32

            log(Lambda)

            Mea

            n S

            quar

            ed E

            rror

            99 99 97 95 93 75 54 21 12 5 2 1

            Gaussian Family

            minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

            08

            09

            10

            11

            12

            13

            14

            log(Lambda)

            Dev

            ianc

            e

            100 98 97 88 74 55 30 9 7 3 2

            Binomial Family

            minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

            020

            030

            040

            050

            log(Lambda)

            Mis

            clas

            sific

            atio

            n E

            rror

            100 98 97 88 74 55 30 9 7 3 2

            Binomial Family

            Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

            18 Regularization Paths for GLMs via Coordinate Descent

            Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

            7 Discussion

            Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

            Acknowledgments

            We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

            References

            Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

            Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

            Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

            Journal of Statistical Software 19

            Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

            Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

            Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

            Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

            Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

            Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

            Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

            Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

            Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

            Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

            Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

            Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

            Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

            Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

            Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

            Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

            20 Regularization Paths for GLMs via Coordinate Descent

            Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

            Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

            Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

            Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

            Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

            Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

            Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

            Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

            Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

            Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

            Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

            Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

            Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

            Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

            Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

            Journal of Statistical Software 21

            Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

            R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

            Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

            Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

            Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

            Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

            Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

            Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

            Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

            Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

            Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

            Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

            Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

            Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

            Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

            22 Regularization Paths for GLMs via Coordinate Descent

            A Proof of Theorem 1

            We have

            cj = arg mint

            Ksum`=1

            [12(1minus α)(βj` minus t)2 + α|βj` minus t|

            ] (32)

            Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

            Ksum`=1

            [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

            where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

            t = βj +1K

            α

            1minus α

            Ksum`=1

            sj` (34)

            It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

            Affiliation

            Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

            Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

            Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

            • Introduction
            • Algorithms for the lasso ridge regression and elastic net
              • Naive updates
              • Covariance updates
              • Sparse updates
              • Weighted updates
              • Pathwise coordinate descent
              • Other details
                • Regularized logistic regression
                • Regularized multinomial regression
                  • Regularization and parameter ambiguity
                  • Grouped and matrix responses
                    • Timings
                      • Regression with the lasso
                      • Lasso-logistic regression
                      • Real data
                      • Other comparisons
                        • Selecting the tuning parameters
                        • Discussion
                        • Proof of Theorem 1

              Journal of Statistical Software 7

              not alter the sparsity but centering will So scaling is performed up front but the centeringis incorporated in the algorithm in an efficient and obvious manner

              24 Weighted updates

              Often a weight wi (other than 1N) is associated with each observation This will arisenaturally in later sections where observations receive weights in the IRLS algorithm In thiscase the update step (5) becomes only slightly more complicated

              βj larrS(sumN

              i=1wixij(yi minus y(j)i ) λα

              )sumN

              i=1wix2ij + λ(1minus α)

              (10)

              If the xj are not standardized there is a similar sum-of-squares term in the denominator(even without weights) The presence of weights does not change the computational costs ofeither algorithm much as long as the weights remain fixed

              25 Pathwise coordinate descent

              We compute the solutions for a decreasing sequence of values for λ starting at the smallestvalue λmax for which the entire vector β = 0 Apart from giving us a path of solutions thisscheme exploits warm starts and leads to a more stable algorithm We have examples whereit is faster to compute the path down to λ (for small λ) than the solution only at that valuefor λ

              When β = 0 we see from (5) that βj will stay zero if 1N |〈xj y〉| lt λα Hence Nαλmax =

              max` |〈x` y〉| Our strategy is to select a minimum value λmin = ελmax and construct asequence of K values of λ decreasing from λmax to λmin on the log scale Typical values areε = 0001 and K = 100

              26 Other details

              Irrespective of whether the variables are standardized to have variance 1 we always centereach predictor variable Since the intercept is not regularized this means that β0 = y themean of the yi for all values of α and λ

              It is easy to allow different penalties λj for each of the variables We implement this via apenalty scaling parameter γj ge 0 If γj gt 0 then the penalty applied to βj is λj = λγj If γj = 0 that variable does not get penalized and always enters the model unrestricted atthe first step and remains in the model Penalty rescaling would also allow for example oursoftware to be used to implement the adaptive lasso (Zou 2006)

              Considerable speedup is obtained by organizing the iterations around the active set of featuresmdashthose with nonzero coefficients After a complete cycle through all the variables we iterateon only the active set till convergence If another complete cycle does not change the activeset we are done otherwise the process is repeated Active-set convergence is also mentionedin Meier et al (2008) and Krishnapuram and Hartemink (2005)

              8 Regularization Paths for GLMs via Coordinate Descent

              3 Regularized logistic regression

              When the response variable is binary the linear logistic regression model is often used Denoteby G the response variable taking values in G = 1 2 (the labeling of the elements isarbitrary) The logistic regression model represents the class-conditional probabilities througha linear function of the predictors

              Pr(G = 1|x) =1

              1 + eminus(β0+xgtβ) (11)

              Pr(G = 2|x) =1

              1 + e+(β0+xgtβ)

              = 1minus Pr(G = 1|x)

              Alternatively this implies that

              logPr(G = 1|x)Pr(G = 2|x)

              = β0 + xgtβ (12)

              Here we fit this model by regularized maximum (binomial) likelihood Let p(xi) = Pr(G =1|xi) be the probability (11) for observation i at a particular value for the parameters (β0 β)then we maximize the penalized log likelihood

              max(β0β)isinRp+1

              [1N

              Nsumi=1

              I(gi = 1) log p(xi) + I(gi = 2) log(1minus p(xi))

              minus λPα(β)

              ] (13)

              Denoting yi = I(gi = 1) the log-likelihood part of (13) can be written in the more explicitform

              `(β0 β) =1N

              Nsumi=1

              yi middot (β0 + xgti β)minus log(1 + e(β0+xgti β)) (14)

              a concave function of the parameters The Newton algorithm for maximizing the (unpe-nalized) log-likelihood (14) amounts to iteratively reweighted least squares Hence if thecurrent estimates of the parameters are (β0 β) we form a quadratic approximation to thelog-likelihood (Taylor expansion about current estimates) which is

              `Q(β0 β) = minus 12N

              Nsumi=1

              wi(zi minus β0 minus xgti β)2 + C(β0 β)2 (15)

              where

              zi = β0 + xgti β +yi minus p(xi)

              p(xi)(1minus p(xi)) (working response) (16)

              wi = p(xi)(1minus p(xi)) (weights) (17)

              and p(xi) is evaluated at the current parameters The last term is constant The Newtonupdate is obtained by minimizing `QOur approach is similar For each value of λ we create an outer loop which computes thequadratic approximation `Q about the current parameters (β0 β) Then we use coordinatedescent to solve the penalized weighted least-squares problem

              min(β0β)isinRp+1

              minus`Q(β0 β) + λPα(β) (18)

              This amounts to a sequence of nested loops

              Journal of Statistical Software 9

              outer loop Decrement λ

              middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

              inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

              There are several important details in the implementation of this algorithm

              When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

              Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

              Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

              We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

              The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

              4 Regularized multinomial regression

              When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

              logPr(G = `|x)Pr(G = K|x)

              = β0` + xgtβ` ` = 1 K minus 1 (19)

              Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

              Pr(G = `|x) =eβ0`+x

              gtβ`sumKk=1 e

              β0k+xgtβk

              (20)

              This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

              10 Regularization Paths for GLMs via Coordinate Descent

              We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

              maxβ0`β`K1 isinRK(p+1)

              [1N

              Nsumi=1

              log pgi(xi)minus λKsum`=1

              Pα(β`)

              ] (21)

              Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

              `(β0` β`K1 ) =1N

              Nsumi=1

              [Ksum`=1

              yi`(β0` + xgti β`)minus log

              (Ksum`=1

              eβ0`+xgti β`

              )] (22)

              The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

              `Q`(β0` β`) = minus 12N

              Nsumi=1

              wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

              where as before

              zi` = β0` + xgti β` +yi` minus p`(xi)

              p`(xi)(1minus p`(xi)) (24)

              wi` = p`(xi)(1minus p`(xi)) (25)

              Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

              min(β0`β`)isinRp+1

              minus`Q`(β0` β`) + λPα(β`) (26)

              This amounts to the sequence of nested loops

              outer loop Decrement λ

              middle loop (outer) Cycle over ` isin 1 2 K 1 2

              middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

              inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

              Journal of Statistical Software 11

              41 Regularization and parameter ambiguity

              As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

              mincisinRp

              Ksum`=1

              Pα(β` minus c) (27)

              This can be done separately for each coordinate hence

              cj = arg mint

              Ksum`=1

              [12(1minus α)(βj` minus t)2 + α|βj` minus t|

              ] (28)

              Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

              cj isin [βj βMj ] (29)

              with the left endpoint achieved if α = 0 and the right if α = 1

              The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

              Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

              42 Grouped and matrix responses

              As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

              sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

              observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

              wi` = mip`(xi)(1minus p`(xi)) (30)

              Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

              12 Regularization Paths for GLMs via Coordinate Descent

              5 Timings

              In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

              We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

              51 Regression with the lasso

              We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

              Y =psumj=1

              Xjβj + k middot Z (31)

              where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

              Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

              52 Lasso-logistic regression

              We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

              The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

              Journal of Statistical Software 13

              Linear regression ndash Dense features

              Correlation0 01 02 05 09 095

              N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

              N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

              N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

              N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

              N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

              N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

              Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

              same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

              14 Regularization Paths for GLMs via Coordinate Descent

              Logistic regression ndash Dense features

              Correlation0 01 02 05 09 095

              N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

              N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

              N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

              N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

              N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

              N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

              Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

              53 Real data

              Table 4 shows some timing results for four different datasets

              Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

              Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

              InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

              Journal of Statistical Software 15

              Logistic regression ndash Sparse features

              Correlation0 01 02 05 09 095

              N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

              N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

              N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

              N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

              Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

              NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

              All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

              For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

              54 Other comparisons

              When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

              16 Regularization Paths for GLMs via Coordinate Descent

              Name Type N p glmnet l1logreg BBRBMR

              DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

              SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

              Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

              MacBook Pro HP Linux server

              glmnet 034 013penalized 1031OWL-QN 31435

              Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

              an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

              OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

              The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

              Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

              6 Selecting the tuning parameters

              The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

              Journal of Statistical Software 17

              minus6 minus5 minus4 minus3 minus2 minus1 0

              2426

              2830

              32

              log(Lambda)

              Mea

              n S

              quar

              ed E

              rror

              99 99 97 95 93 75 54 21 12 5 2 1

              Gaussian Family

              minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

              08

              09

              10

              11

              12

              13

              14

              log(Lambda)

              Dev

              ianc

              e

              100 98 97 88 74 55 30 9 7 3 2

              Binomial Family

              minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

              020

              030

              040

              050

              log(Lambda)

              Mis

              clas

              sific

              atio

              n E

              rror

              100 98 97 88 74 55 30 9 7 3 2

              Binomial Family

              Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

              18 Regularization Paths for GLMs via Coordinate Descent

              Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

              7 Discussion

              Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

              Acknowledgments

              We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

              References

              Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

              Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

              Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

              Journal of Statistical Software 19

              Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

              Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

              Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

              Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

              Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

              Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

              Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

              Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

              Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

              Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

              Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

              Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

              Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

              Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

              Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

              Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

              20 Regularization Paths for GLMs via Coordinate Descent

              Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

              Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

              Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

              Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

              Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

              Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

              Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

              Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

              Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

              Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

              Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

              Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

              Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

              Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

              Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

              Journal of Statistical Software 21

              Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

              R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

              Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

              Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

              Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

              Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

              Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

              Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

              Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

              Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

              Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

              Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

              Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

              Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

              Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

              22 Regularization Paths for GLMs via Coordinate Descent

              A Proof of Theorem 1

              We have

              cj = arg mint

              Ksum`=1

              [12(1minus α)(βj` minus t)2 + α|βj` minus t|

              ] (32)

              Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

              Ksum`=1

              [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

              where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

              t = βj +1K

              α

              1minus α

              Ksum`=1

              sj` (34)

              It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

              Affiliation

              Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

              Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

              Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

              • Introduction
              • Algorithms for the lasso ridge regression and elastic net
                • Naive updates
                • Covariance updates
                • Sparse updates
                • Weighted updates
                • Pathwise coordinate descent
                • Other details
                  • Regularized logistic regression
                  • Regularized multinomial regression
                    • Regularization and parameter ambiguity
                    • Grouped and matrix responses
                      • Timings
                        • Regression with the lasso
                        • Lasso-logistic regression
                        • Real data
                        • Other comparisons
                          • Selecting the tuning parameters
                          • Discussion
                          • Proof of Theorem 1

                8 Regularization Paths for GLMs via Coordinate Descent

                3 Regularized logistic regression

                When the response variable is binary the linear logistic regression model is often used Denoteby G the response variable taking values in G = 1 2 (the labeling of the elements isarbitrary) The logistic regression model represents the class-conditional probabilities througha linear function of the predictors

                Pr(G = 1|x) =1

                1 + eminus(β0+xgtβ) (11)

                Pr(G = 2|x) =1

                1 + e+(β0+xgtβ)

                = 1minus Pr(G = 1|x)

                Alternatively this implies that

                logPr(G = 1|x)Pr(G = 2|x)

                = β0 + xgtβ (12)

                Here we fit this model by regularized maximum (binomial) likelihood Let p(xi) = Pr(G =1|xi) be the probability (11) for observation i at a particular value for the parameters (β0 β)then we maximize the penalized log likelihood

                max(β0β)isinRp+1

                [1N

                Nsumi=1

                I(gi = 1) log p(xi) + I(gi = 2) log(1minus p(xi))

                minus λPα(β)

                ] (13)

                Denoting yi = I(gi = 1) the log-likelihood part of (13) can be written in the more explicitform

                `(β0 β) =1N

                Nsumi=1

                yi middot (β0 + xgti β)minus log(1 + e(β0+xgti β)) (14)

                a concave function of the parameters The Newton algorithm for maximizing the (unpe-nalized) log-likelihood (14) amounts to iteratively reweighted least squares Hence if thecurrent estimates of the parameters are (β0 β) we form a quadratic approximation to thelog-likelihood (Taylor expansion about current estimates) which is

                `Q(β0 β) = minus 12N

                Nsumi=1

                wi(zi minus β0 minus xgti β)2 + C(β0 β)2 (15)

                where

                zi = β0 + xgti β +yi minus p(xi)

                p(xi)(1minus p(xi)) (working response) (16)

                wi = p(xi)(1minus p(xi)) (weights) (17)

                and p(xi) is evaluated at the current parameters The last term is constant The Newtonupdate is obtained by minimizing `QOur approach is similar For each value of λ we create an outer loop which computes thequadratic approximation `Q about the current parameters (β0 β) Then we use coordinatedescent to solve the penalized weighted least-squares problem

                min(β0β)isinRp+1

                minus`Q(β0 β) + λPα(β) (18)

                This amounts to a sequence of nested loops

                Journal of Statistical Software 9

                outer loop Decrement λ

                middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

                inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

                There are several important details in the implementation of this algorithm

                When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

                Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

                Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

                We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

                The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

                4 Regularized multinomial regression

                When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

                logPr(G = `|x)Pr(G = K|x)

                = β0` + xgtβ` ` = 1 K minus 1 (19)

                Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

                Pr(G = `|x) =eβ0`+x

                gtβ`sumKk=1 e

                β0k+xgtβk

                (20)

                This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

                10 Regularization Paths for GLMs via Coordinate Descent

                We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

                maxβ0`β`K1 isinRK(p+1)

                [1N

                Nsumi=1

                log pgi(xi)minus λKsum`=1

                Pα(β`)

                ] (21)

                Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

                `(β0` β`K1 ) =1N

                Nsumi=1

                [Ksum`=1

                yi`(β0` + xgti β`)minus log

                (Ksum`=1

                eβ0`+xgti β`

                )] (22)

                The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

                `Q`(β0` β`) = minus 12N

                Nsumi=1

                wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

                where as before

                zi` = β0` + xgti β` +yi` minus p`(xi)

                p`(xi)(1minus p`(xi)) (24)

                wi` = p`(xi)(1minus p`(xi)) (25)

                Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

                min(β0`β`)isinRp+1

                minus`Q`(β0` β`) + λPα(β`) (26)

                This amounts to the sequence of nested loops

                outer loop Decrement λ

                middle loop (outer) Cycle over ` isin 1 2 K 1 2

                middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

                inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

                Journal of Statistical Software 11

                41 Regularization and parameter ambiguity

                As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

                mincisinRp

                Ksum`=1

                Pα(β` minus c) (27)

                This can be done separately for each coordinate hence

                cj = arg mint

                Ksum`=1

                [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                ] (28)

                Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

                cj isin [βj βMj ] (29)

                with the left endpoint achieved if α = 0 and the right if α = 1

                The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

                Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

                42 Grouped and matrix responses

                As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

                sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

                observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

                wi` = mip`(xi)(1minus p`(xi)) (30)

                Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

                12 Regularization Paths for GLMs via Coordinate Descent

                5 Timings

                In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

                We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

                51 Regression with the lasso

                We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

                Y =psumj=1

                Xjβj + k middot Z (31)

                where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

                Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

                52 Lasso-logistic regression

                We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

                The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

                Journal of Statistical Software 13

                Linear regression ndash Dense features

                Correlation0 01 02 05 09 095

                N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

                N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

                N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

                N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

                N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

                N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

                Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

                same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

                14 Regularization Paths for GLMs via Coordinate Descent

                Logistic regression ndash Dense features

                Correlation0 01 02 05 09 095

                N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

                N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

                N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

                N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

                N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

                N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

                Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

                53 Real data

                Table 4 shows some timing results for four different datasets

                Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

                Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

                InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

                Journal of Statistical Software 15

                Logistic regression ndash Sparse features

                Correlation0 01 02 05 09 095

                N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

                N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

                N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

                N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

                Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

                NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

                All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

                For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

                54 Other comparisons

                When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

                16 Regularization Paths for GLMs via Coordinate Descent

                Name Type N p glmnet l1logreg BBRBMR

                DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

                SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

                Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

                MacBook Pro HP Linux server

                glmnet 034 013penalized 1031OWL-QN 31435

                Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

                an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

                OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

                The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

                Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

                6 Selecting the tuning parameters

                The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

                Journal of Statistical Software 17

                minus6 minus5 minus4 minus3 minus2 minus1 0

                2426

                2830

                32

                log(Lambda)

                Mea

                n S

                quar

                ed E

                rror

                99 99 97 95 93 75 54 21 12 5 2 1

                Gaussian Family

                minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                08

                09

                10

                11

                12

                13

                14

                log(Lambda)

                Dev

                ianc

                e

                100 98 97 88 74 55 30 9 7 3 2

                Binomial Family

                minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                020

                030

                040

                050

                log(Lambda)

                Mis

                clas

                sific

                atio

                n E

                rror

                100 98 97 88 74 55 30 9 7 3 2

                Binomial Family

                Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                18 Regularization Paths for GLMs via Coordinate Descent

                Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                7 Discussion

                Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                Acknowledgments

                We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                References

                Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                Journal of Statistical Software 19

                Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                20 Regularization Paths for GLMs via Coordinate Descent

                Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                Journal of Statistical Software 21

                Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                22 Regularization Paths for GLMs via Coordinate Descent

                A Proof of Theorem 1

                We have

                cj = arg mint

                Ksum`=1

                [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                ] (32)

                Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                Ksum`=1

                [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                t = βj +1K

                α

                1minus α

                Ksum`=1

                sj` (34)

                It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                Affiliation

                Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                • Introduction
                • Algorithms for the lasso ridge regression and elastic net
                  • Naive updates
                  • Covariance updates
                  • Sparse updates
                  • Weighted updates
                  • Pathwise coordinate descent
                  • Other details
                    • Regularized logistic regression
                    • Regularized multinomial regression
                      • Regularization and parameter ambiguity
                      • Grouped and matrix responses
                        • Timings
                          • Regression with the lasso
                          • Lasso-logistic regression
                          • Real data
                          • Other comparisons
                            • Selecting the tuning parameters
                            • Discussion
                            • Proof of Theorem 1

                  Journal of Statistical Software 9

                  outer loop Decrement λ

                  middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

                  inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

                  There are several important details in the implementation of this algorithm

                  When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

                  Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

                  Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

                  We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

                  The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

                  4 Regularized multinomial regression

                  When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

                  logPr(G = `|x)Pr(G = K|x)

                  = β0` + xgtβ` ` = 1 K minus 1 (19)

                  Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

                  Pr(G = `|x) =eβ0`+x

                  gtβ`sumKk=1 e

                  β0k+xgtβk

                  (20)

                  This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

                  10 Regularization Paths for GLMs via Coordinate Descent

                  We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

                  maxβ0`β`K1 isinRK(p+1)

                  [1N

                  Nsumi=1

                  log pgi(xi)minus λKsum`=1

                  Pα(β`)

                  ] (21)

                  Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

                  `(β0` β`K1 ) =1N

                  Nsumi=1

                  [Ksum`=1

                  yi`(β0` + xgti β`)minus log

                  (Ksum`=1

                  eβ0`+xgti β`

                  )] (22)

                  The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

                  `Q`(β0` β`) = minus 12N

                  Nsumi=1

                  wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

                  where as before

                  zi` = β0` + xgti β` +yi` minus p`(xi)

                  p`(xi)(1minus p`(xi)) (24)

                  wi` = p`(xi)(1minus p`(xi)) (25)

                  Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

                  min(β0`β`)isinRp+1

                  minus`Q`(β0` β`) + λPα(β`) (26)

                  This amounts to the sequence of nested loops

                  outer loop Decrement λ

                  middle loop (outer) Cycle over ` isin 1 2 K 1 2

                  middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

                  inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

                  Journal of Statistical Software 11

                  41 Regularization and parameter ambiguity

                  As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

                  mincisinRp

                  Ksum`=1

                  Pα(β` minus c) (27)

                  This can be done separately for each coordinate hence

                  cj = arg mint

                  Ksum`=1

                  [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                  ] (28)

                  Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

                  cj isin [βj βMj ] (29)

                  with the left endpoint achieved if α = 0 and the right if α = 1

                  The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

                  Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

                  42 Grouped and matrix responses

                  As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

                  sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

                  observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

                  wi` = mip`(xi)(1minus p`(xi)) (30)

                  Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

                  12 Regularization Paths for GLMs via Coordinate Descent

                  5 Timings

                  In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

                  We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

                  51 Regression with the lasso

                  We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

                  Y =psumj=1

                  Xjβj + k middot Z (31)

                  where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

                  Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

                  52 Lasso-logistic regression

                  We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

                  The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

                  Journal of Statistical Software 13

                  Linear regression ndash Dense features

                  Correlation0 01 02 05 09 095

                  N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

                  N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

                  N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

                  N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

                  N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

                  N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

                  Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

                  same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

                  14 Regularization Paths for GLMs via Coordinate Descent

                  Logistic regression ndash Dense features

                  Correlation0 01 02 05 09 095

                  N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

                  N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

                  N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

                  N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

                  N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

                  N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

                  Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

                  53 Real data

                  Table 4 shows some timing results for four different datasets

                  Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

                  Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

                  InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

                  Journal of Statistical Software 15

                  Logistic regression ndash Sparse features

                  Correlation0 01 02 05 09 095

                  N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

                  N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

                  N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

                  N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

                  Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

                  NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

                  All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

                  For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

                  54 Other comparisons

                  When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

                  16 Regularization Paths for GLMs via Coordinate Descent

                  Name Type N p glmnet l1logreg BBRBMR

                  DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

                  SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

                  Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

                  MacBook Pro HP Linux server

                  glmnet 034 013penalized 1031OWL-QN 31435

                  Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

                  an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

                  OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

                  The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

                  Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

                  6 Selecting the tuning parameters

                  The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

                  Journal of Statistical Software 17

                  minus6 minus5 minus4 minus3 minus2 minus1 0

                  2426

                  2830

                  32

                  log(Lambda)

                  Mea

                  n S

                  quar

                  ed E

                  rror

                  99 99 97 95 93 75 54 21 12 5 2 1

                  Gaussian Family

                  minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                  08

                  09

                  10

                  11

                  12

                  13

                  14

                  log(Lambda)

                  Dev

                  ianc

                  e

                  100 98 97 88 74 55 30 9 7 3 2

                  Binomial Family

                  minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                  020

                  030

                  040

                  050

                  log(Lambda)

                  Mis

                  clas

                  sific

                  atio

                  n E

                  rror

                  100 98 97 88 74 55 30 9 7 3 2

                  Binomial Family

                  Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                  18 Regularization Paths for GLMs via Coordinate Descent

                  Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                  7 Discussion

                  Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                  Acknowledgments

                  We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                  References

                  Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                  Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                  Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                  Journal of Statistical Software 19

                  Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                  Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                  Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                  Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                  Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                  Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                  Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                  Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                  Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                  Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                  Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                  Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                  Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                  Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                  Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                  Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                  20 Regularization Paths for GLMs via Coordinate Descent

                  Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                  Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                  Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                  Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                  Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                  Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                  Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                  Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                  Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                  Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                  Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                  Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                  Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                  Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                  Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                  Journal of Statistical Software 21

                  Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                  R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                  Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                  Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                  Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                  Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                  Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                  Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                  Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                  Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                  Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                  Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                  Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                  Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                  Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                  22 Regularization Paths for GLMs via Coordinate Descent

                  A Proof of Theorem 1

                  We have

                  cj = arg mint

                  Ksum`=1

                  [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                  ] (32)

                  Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                  Ksum`=1

                  [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                  where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                  t = βj +1K

                  α

                  1minus α

                  Ksum`=1

                  sj` (34)

                  It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                  Affiliation

                  Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                  Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                  Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                  • Introduction
                  • Algorithms for the lasso ridge regression and elastic net
                    • Naive updates
                    • Covariance updates
                    • Sparse updates
                    • Weighted updates
                    • Pathwise coordinate descent
                    • Other details
                      • Regularized logistic regression
                      • Regularized multinomial regression
                        • Regularization and parameter ambiguity
                        • Grouped and matrix responses
                          • Timings
                            • Regression with the lasso
                            • Lasso-logistic regression
                            • Real data
                            • Other comparisons
                              • Selecting the tuning parameters
                              • Discussion
                              • Proof of Theorem 1

                    10 Regularization Paths for GLMs via Coordinate Descent

                    We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

                    maxβ0`β`K1 isinRK(p+1)

                    [1N

                    Nsumi=1

                    log pgi(xi)minus λKsum`=1

                    Pα(β`)

                    ] (21)

                    Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

                    `(β0` β`K1 ) =1N

                    Nsumi=1

                    [Ksum`=1

                    yi`(β0` + xgti β`)minus log

                    (Ksum`=1

                    eβ0`+xgti β`

                    )] (22)

                    The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

                    `Q`(β0` β`) = minus 12N

                    Nsumi=1

                    wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

                    where as before

                    zi` = β0` + xgti β` +yi` minus p`(xi)

                    p`(xi)(1minus p`(xi)) (24)

                    wi` = p`(xi)(1minus p`(xi)) (25)

                    Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

                    min(β0`β`)isinRp+1

                    minus`Q`(β0` β`) + λPα(β`) (26)

                    This amounts to the sequence of nested loops

                    outer loop Decrement λ

                    middle loop (outer) Cycle over ` isin 1 2 K 1 2

                    middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

                    inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

                    Journal of Statistical Software 11

                    41 Regularization and parameter ambiguity

                    As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

                    mincisinRp

                    Ksum`=1

                    Pα(β` minus c) (27)

                    This can be done separately for each coordinate hence

                    cj = arg mint

                    Ksum`=1

                    [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                    ] (28)

                    Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

                    cj isin [βj βMj ] (29)

                    with the left endpoint achieved if α = 0 and the right if α = 1

                    The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

                    Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

                    42 Grouped and matrix responses

                    As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

                    sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

                    observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

                    wi` = mip`(xi)(1minus p`(xi)) (30)

                    Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

                    12 Regularization Paths for GLMs via Coordinate Descent

                    5 Timings

                    In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

                    We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

                    51 Regression with the lasso

                    We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

                    Y =psumj=1

                    Xjβj + k middot Z (31)

                    where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

                    Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

                    52 Lasso-logistic regression

                    We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

                    The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

                    Journal of Statistical Software 13

                    Linear regression ndash Dense features

                    Correlation0 01 02 05 09 095

                    N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

                    N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

                    N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

                    N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

                    N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

                    N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

                    Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

                    same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

                    14 Regularization Paths for GLMs via Coordinate Descent

                    Logistic regression ndash Dense features

                    Correlation0 01 02 05 09 095

                    N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

                    N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

                    N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

                    N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

                    N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

                    N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

                    Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

                    53 Real data

                    Table 4 shows some timing results for four different datasets

                    Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

                    Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

                    InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

                    Journal of Statistical Software 15

                    Logistic regression ndash Sparse features

                    Correlation0 01 02 05 09 095

                    N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

                    N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

                    N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

                    N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

                    Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

                    NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

                    All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

                    For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

                    54 Other comparisons

                    When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

                    16 Regularization Paths for GLMs via Coordinate Descent

                    Name Type N p glmnet l1logreg BBRBMR

                    DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

                    SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

                    Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

                    MacBook Pro HP Linux server

                    glmnet 034 013penalized 1031OWL-QN 31435

                    Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

                    an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

                    OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

                    The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

                    Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

                    6 Selecting the tuning parameters

                    The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

                    Journal of Statistical Software 17

                    minus6 minus5 minus4 minus3 minus2 minus1 0

                    2426

                    2830

                    32

                    log(Lambda)

                    Mea

                    n S

                    quar

                    ed E

                    rror

                    99 99 97 95 93 75 54 21 12 5 2 1

                    Gaussian Family

                    minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                    08

                    09

                    10

                    11

                    12

                    13

                    14

                    log(Lambda)

                    Dev

                    ianc

                    e

                    100 98 97 88 74 55 30 9 7 3 2

                    Binomial Family

                    minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                    020

                    030

                    040

                    050

                    log(Lambda)

                    Mis

                    clas

                    sific

                    atio

                    n E

                    rror

                    100 98 97 88 74 55 30 9 7 3 2

                    Binomial Family

                    Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                    18 Regularization Paths for GLMs via Coordinate Descent

                    Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                    7 Discussion

                    Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                    Acknowledgments

                    We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                    References

                    Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                    Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                    Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                    Journal of Statistical Software 19

                    Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                    Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                    Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                    Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                    Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                    Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                    Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                    Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                    Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                    Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                    Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                    Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                    Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                    Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                    Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                    Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                    20 Regularization Paths for GLMs via Coordinate Descent

                    Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                    Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                    Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                    Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                    Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                    Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                    Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                    Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                    Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                    Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                    Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                    Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                    Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                    Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                    Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                    Journal of Statistical Software 21

                    Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                    R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                    Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                    Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                    Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                    Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                    Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                    Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                    Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                    Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                    Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                    Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                    Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                    Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                    Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                    22 Regularization Paths for GLMs via Coordinate Descent

                    A Proof of Theorem 1

                    We have

                    cj = arg mint

                    Ksum`=1

                    [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                    ] (32)

                    Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                    Ksum`=1

                    [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                    where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                    t = βj +1K

                    α

                    1minus α

                    Ksum`=1

                    sj` (34)

                    It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                    Affiliation

                    Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                    Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                    Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                    • Introduction
                    • Algorithms for the lasso ridge regression and elastic net
                      • Naive updates
                      • Covariance updates
                      • Sparse updates
                      • Weighted updates
                      • Pathwise coordinate descent
                      • Other details
                        • Regularized logistic regression
                        • Regularized multinomial regression
                          • Regularization and parameter ambiguity
                          • Grouped and matrix responses
                            • Timings
                              • Regression with the lasso
                              • Lasso-logistic regression
                              • Real data
                              • Other comparisons
                                • Selecting the tuning parameters
                                • Discussion
                                • Proof of Theorem 1

                      Journal of Statistical Software 11

                      41 Regularization and parameter ambiguity

                      As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

                      mincisinRp

                      Ksum`=1

                      Pα(β` minus c) (27)

                      This can be done separately for each coordinate hence

                      cj = arg mint

                      Ksum`=1

                      [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                      ] (28)

                      Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

                      cj isin [βj βMj ] (29)

                      with the left endpoint achieved if α = 0 and the right if α = 1

                      The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

                      Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

                      42 Grouped and matrix responses

                      As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

                      sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

                      observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

                      wi` = mip`(xi)(1minus p`(xi)) (30)

                      Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

                      12 Regularization Paths for GLMs via Coordinate Descent

                      5 Timings

                      In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

                      We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

                      51 Regression with the lasso

                      We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

                      Y =psumj=1

                      Xjβj + k middot Z (31)

                      where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

                      Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

                      52 Lasso-logistic regression

                      We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

                      The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

                      Journal of Statistical Software 13

                      Linear regression ndash Dense features

                      Correlation0 01 02 05 09 095

                      N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

                      N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

                      N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

                      N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

                      N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

                      N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

                      Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

                      same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

                      14 Regularization Paths for GLMs via Coordinate Descent

                      Logistic regression ndash Dense features

                      Correlation0 01 02 05 09 095

                      N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

                      N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

                      N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

                      N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

                      N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

                      N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

                      Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

                      53 Real data

                      Table 4 shows some timing results for four different datasets

                      Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

                      Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

                      InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

                      Journal of Statistical Software 15

                      Logistic regression ndash Sparse features

                      Correlation0 01 02 05 09 095

                      N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

                      N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

                      N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

                      N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

                      Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

                      NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

                      All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

                      For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

                      54 Other comparisons

                      When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

                      16 Regularization Paths for GLMs via Coordinate Descent

                      Name Type N p glmnet l1logreg BBRBMR

                      DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

                      SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

                      Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

                      MacBook Pro HP Linux server

                      glmnet 034 013penalized 1031OWL-QN 31435

                      Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

                      an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

                      OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

                      The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

                      Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

                      6 Selecting the tuning parameters

                      The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

                      Journal of Statistical Software 17

                      minus6 minus5 minus4 minus3 minus2 minus1 0

                      2426

                      2830

                      32

                      log(Lambda)

                      Mea

                      n S

                      quar

                      ed E

                      rror

                      99 99 97 95 93 75 54 21 12 5 2 1

                      Gaussian Family

                      minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                      08

                      09

                      10

                      11

                      12

                      13

                      14

                      log(Lambda)

                      Dev

                      ianc

                      e

                      100 98 97 88 74 55 30 9 7 3 2

                      Binomial Family

                      minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                      020

                      030

                      040

                      050

                      log(Lambda)

                      Mis

                      clas

                      sific

                      atio

                      n E

                      rror

                      100 98 97 88 74 55 30 9 7 3 2

                      Binomial Family

                      Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                      18 Regularization Paths for GLMs via Coordinate Descent

                      Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                      7 Discussion

                      Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                      Acknowledgments

                      We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                      References

                      Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                      Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                      Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                      Journal of Statistical Software 19

                      Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                      Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                      Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                      Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                      Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                      Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                      Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                      Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                      Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                      Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                      Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                      Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                      Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                      Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                      Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                      Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                      20 Regularization Paths for GLMs via Coordinate Descent

                      Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                      Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                      Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                      Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                      Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                      Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                      Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                      Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                      Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                      Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                      Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                      Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                      Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                      Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                      Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                      Journal of Statistical Software 21

                      Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                      R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                      Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                      Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                      Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                      Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                      Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                      Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                      Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                      Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                      Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                      Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                      Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                      Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                      Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                      22 Regularization Paths for GLMs via Coordinate Descent

                      A Proof of Theorem 1

                      We have

                      cj = arg mint

                      Ksum`=1

                      [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                      ] (32)

                      Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                      Ksum`=1

                      [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                      where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                      t = βj +1K

                      α

                      1minus α

                      Ksum`=1

                      sj` (34)

                      It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                      Affiliation

                      Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                      Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                      Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                      • Introduction
                      • Algorithms for the lasso ridge regression and elastic net
                        • Naive updates
                        • Covariance updates
                        • Sparse updates
                        • Weighted updates
                        • Pathwise coordinate descent
                        • Other details
                          • Regularized logistic regression
                          • Regularized multinomial regression
                            • Regularization and parameter ambiguity
                            • Grouped and matrix responses
                              • Timings
                                • Regression with the lasso
                                • Lasso-logistic regression
                                • Real data
                                • Other comparisons
                                  • Selecting the tuning parameters
                                  • Discussion
                                  • Proof of Theorem 1

                        12 Regularization Paths for GLMs via Coordinate Descent

                        5 Timings

                        In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

                        We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

                        51 Regression with the lasso

                        We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

                        Y =psumj=1

                        Xjβj + k middot Z (31)

                        where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

                        Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

                        52 Lasso-logistic regression

                        We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

                        The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

                        Journal of Statistical Software 13

                        Linear regression ndash Dense features

                        Correlation0 01 02 05 09 095

                        N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

                        N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

                        N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

                        N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

                        N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

                        N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

                        Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

                        same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

                        14 Regularization Paths for GLMs via Coordinate Descent

                        Logistic regression ndash Dense features

                        Correlation0 01 02 05 09 095

                        N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

                        N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

                        N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

                        N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

                        N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

                        N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

                        Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

                        53 Real data

                        Table 4 shows some timing results for four different datasets

                        Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

                        Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

                        InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

                        Journal of Statistical Software 15

                        Logistic regression ndash Sparse features

                        Correlation0 01 02 05 09 095

                        N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

                        N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

                        N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

                        N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

                        Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

                        NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

                        All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

                        For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

                        54 Other comparisons

                        When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

                        16 Regularization Paths for GLMs via Coordinate Descent

                        Name Type N p glmnet l1logreg BBRBMR

                        DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

                        SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

                        Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

                        MacBook Pro HP Linux server

                        glmnet 034 013penalized 1031OWL-QN 31435

                        Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

                        an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

                        OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

                        The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

                        Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

                        6 Selecting the tuning parameters

                        The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

                        Journal of Statistical Software 17

                        minus6 minus5 minus4 minus3 minus2 minus1 0

                        2426

                        2830

                        32

                        log(Lambda)

                        Mea

                        n S

                        quar

                        ed E

                        rror

                        99 99 97 95 93 75 54 21 12 5 2 1

                        Gaussian Family

                        minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                        08

                        09

                        10

                        11

                        12

                        13

                        14

                        log(Lambda)

                        Dev

                        ianc

                        e

                        100 98 97 88 74 55 30 9 7 3 2

                        Binomial Family

                        minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                        020

                        030

                        040

                        050

                        log(Lambda)

                        Mis

                        clas

                        sific

                        atio

                        n E

                        rror

                        100 98 97 88 74 55 30 9 7 3 2

                        Binomial Family

                        Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                        18 Regularization Paths for GLMs via Coordinate Descent

                        Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                        7 Discussion

                        Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                        Acknowledgments

                        We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                        References

                        Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                        Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                        Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                        Journal of Statistical Software 19

                        Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                        Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                        Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                        Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                        Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                        Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                        Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                        Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                        Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                        Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                        Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                        Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                        Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                        Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                        Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                        Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                        20 Regularization Paths for GLMs via Coordinate Descent

                        Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                        Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                        Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                        Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                        Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                        Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                        Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                        Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                        Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                        Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                        Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                        Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                        Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                        Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                        Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                        Journal of Statistical Software 21

                        Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                        R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                        Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                        Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                        Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                        Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                        Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                        Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                        Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                        Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                        Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                        Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                        Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                        Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                        Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                        22 Regularization Paths for GLMs via Coordinate Descent

                        A Proof of Theorem 1

                        We have

                        cj = arg mint

                        Ksum`=1

                        [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                        ] (32)

                        Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                        Ksum`=1

                        [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                        where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                        t = βj +1K

                        α

                        1minus α

                        Ksum`=1

                        sj` (34)

                        It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                        Affiliation

                        Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                        Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                        Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                        • Introduction
                        • Algorithms for the lasso ridge regression and elastic net
                          • Naive updates
                          • Covariance updates
                          • Sparse updates
                          • Weighted updates
                          • Pathwise coordinate descent
                          • Other details
                            • Regularized logistic regression
                            • Regularized multinomial regression
                              • Regularization and parameter ambiguity
                              • Grouped and matrix responses
                                • Timings
                                  • Regression with the lasso
                                  • Lasso-logistic regression
                                  • Real data
                                  • Other comparisons
                                    • Selecting the tuning parameters
                                    • Discussion
                                    • Proof of Theorem 1

                          Journal of Statistical Software 13

                          Linear regression ndash Dense features

                          Correlation0 01 02 05 09 095

                          N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

                          N = 5000 p = 100glmnet (type = naive) 024 025 026 034 032 031glmnet (type = cov) 005 005 005 005 005 005lars 029 029 029 030 029 029

                          N = 100 p = 1000glmnet (type = naive) 004 005 004 005 004 003glmnet (type = cov) 007 008 007 008 004 003lars 073 072 068 071 071 067

                          N = 100 p = 5000glmnet (type = naive) 020 018 021 023 021 014glmnet (type = cov) 046 042 051 048 025 010lars 373 353 359 347 390 352

                          N = 100 p = 20000glmnet (type = naive) 100 099 106 129 117 097glmnet (type = cov) 186 226 234 259 124 079lars 1830 1790 1690 1803 1791 1639

                          N = 100 p = 50000glmnet (type = naive) 266 246 284 353 339 243glmnet (type = cov) 550 492 613 735 452 253lars 5868 6400 6479 5820 6639 7979

                          Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

                          same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

                          14 Regularization Paths for GLMs via Coordinate Descent

                          Logistic regression ndash Dense features

                          Correlation0 01 02 05 09 095

                          N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

                          N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

                          N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

                          N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

                          N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

                          N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

                          Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

                          53 Real data

                          Table 4 shows some timing results for four different datasets

                          Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

                          Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

                          InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

                          Journal of Statistical Software 15

                          Logistic regression ndash Sparse features

                          Correlation0 01 02 05 09 095

                          N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

                          N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

                          N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

                          N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

                          Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

                          NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

                          All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

                          For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

                          54 Other comparisons

                          When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

                          16 Regularization Paths for GLMs via Coordinate Descent

                          Name Type N p glmnet l1logreg BBRBMR

                          DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

                          SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

                          Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

                          MacBook Pro HP Linux server

                          glmnet 034 013penalized 1031OWL-QN 31435

                          Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

                          an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

                          OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

                          The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

                          Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

                          6 Selecting the tuning parameters

                          The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

                          Journal of Statistical Software 17

                          minus6 minus5 minus4 minus3 minus2 minus1 0

                          2426

                          2830

                          32

                          log(Lambda)

                          Mea

                          n S

                          quar

                          ed E

                          rror

                          99 99 97 95 93 75 54 21 12 5 2 1

                          Gaussian Family

                          minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                          08

                          09

                          10

                          11

                          12

                          13

                          14

                          log(Lambda)

                          Dev

                          ianc

                          e

                          100 98 97 88 74 55 30 9 7 3 2

                          Binomial Family

                          minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                          020

                          030

                          040

                          050

                          log(Lambda)

                          Mis

                          clas

                          sific

                          atio

                          n E

                          rror

                          100 98 97 88 74 55 30 9 7 3 2

                          Binomial Family

                          Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                          18 Regularization Paths for GLMs via Coordinate Descent

                          Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                          7 Discussion

                          Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                          Acknowledgments

                          We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                          References

                          Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                          Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                          Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                          Journal of Statistical Software 19

                          Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                          Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                          Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                          Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                          Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                          Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                          Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                          Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                          Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                          Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                          Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                          Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                          Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                          Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                          Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                          Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                          20 Regularization Paths for GLMs via Coordinate Descent

                          Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                          Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                          Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                          Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                          Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                          Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                          Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                          Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                          Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                          Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                          Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                          Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                          Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                          Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                          Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                          Journal of Statistical Software 21

                          Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                          R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                          Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                          Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                          Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                          Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                          Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                          Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                          Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                          Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                          Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                          Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                          Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                          Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                          Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                          22 Regularization Paths for GLMs via Coordinate Descent

                          A Proof of Theorem 1

                          We have

                          cj = arg mint

                          Ksum`=1

                          [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                          ] (32)

                          Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                          Ksum`=1

                          [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                          where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                          t = βj +1K

                          α

                          1minus α

                          Ksum`=1

                          sj` (34)

                          It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                          Affiliation

                          Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                          Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                          Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                          • Introduction
                          • Algorithms for the lasso ridge regression and elastic net
                            • Naive updates
                            • Covariance updates
                            • Sparse updates
                            • Weighted updates
                            • Pathwise coordinate descent
                            • Other details
                              • Regularized logistic regression
                              • Regularized multinomial regression
                                • Regularization and parameter ambiguity
                                • Grouped and matrix responses
                                  • Timings
                                    • Regression with the lasso
                                    • Lasso-logistic regression
                                    • Real data
                                    • Other comparisons
                                      • Selecting the tuning parameters
                                      • Discussion
                                      • Proof of Theorem 1

                            14 Regularization Paths for GLMs via Coordinate Descent

                            Logistic regression ndash Dense features

                            Correlation0 01 02 05 09 095

                            N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

                            N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

                            N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

                            N = 100 p = 1000glmnet 106 107 109 145 172 137l1logreg 2599 2640 2567 2649 2434 2016BBR 7019 7119 7840 10377 14905 11387LPL 1102 1087 1076 1634 4184 7050

                            N = 100 p = 5000glmnet 524 443 512 705 787 605l1logreg 16502 16190 16325 16650 15191 13528

                            N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

                            Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

                            53 Real data

                            Table 4 shows some timing results for four different datasets

                            Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

                            Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

                            InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

                            Journal of Statistical Software 15

                            Logistic regression ndash Sparse features

                            Correlation0 01 02 05 09 095

                            N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

                            N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

                            N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

                            N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

                            Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

                            NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

                            All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

                            For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

                            54 Other comparisons

                            When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

                            16 Regularization Paths for GLMs via Coordinate Descent

                            Name Type N p glmnet l1logreg BBRBMR

                            DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

                            SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

                            Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

                            MacBook Pro HP Linux server

                            glmnet 034 013penalized 1031OWL-QN 31435

                            Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

                            an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

                            OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

                            The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

                            Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

                            6 Selecting the tuning parameters

                            The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

                            Journal of Statistical Software 17

                            minus6 minus5 minus4 minus3 minus2 minus1 0

                            2426

                            2830

                            32

                            log(Lambda)

                            Mea

                            n S

                            quar

                            ed E

                            rror

                            99 99 97 95 93 75 54 21 12 5 2 1

                            Gaussian Family

                            minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                            08

                            09

                            10

                            11

                            12

                            13

                            14

                            log(Lambda)

                            Dev

                            ianc

                            e

                            100 98 97 88 74 55 30 9 7 3 2

                            Binomial Family

                            minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                            020

                            030

                            040

                            050

                            log(Lambda)

                            Mis

                            clas

                            sific

                            atio

                            n E

                            rror

                            100 98 97 88 74 55 30 9 7 3 2

                            Binomial Family

                            Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                            18 Regularization Paths for GLMs via Coordinate Descent

                            Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                            7 Discussion

                            Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                            Acknowledgments

                            We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                            References

                            Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                            Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                            Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                            Journal of Statistical Software 19

                            Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                            Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                            Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                            Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                            Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                            Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                            Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                            Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                            Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                            Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                            Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                            Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                            Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                            Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                            Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                            Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                            20 Regularization Paths for GLMs via Coordinate Descent

                            Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                            Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                            Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                            Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                            Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                            Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                            Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                            Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                            Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                            Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                            Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                            Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                            Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                            Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                            Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                            Journal of Statistical Software 21

                            Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                            R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                            Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                            Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                            Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                            Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                            Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                            Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                            Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                            Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                            Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                            Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                            Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                            Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                            Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                            22 Regularization Paths for GLMs via Coordinate Descent

                            A Proof of Theorem 1

                            We have

                            cj = arg mint

                            Ksum`=1

                            [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                            ] (32)

                            Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                            Ksum`=1

                            [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                            where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                            t = βj +1K

                            α

                            1minus α

                            Ksum`=1

                            sj` (34)

                            It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                            Affiliation

                            Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                            Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                            Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                            • Introduction
                            • Algorithms for the lasso ridge regression and elastic net
                              • Naive updates
                              • Covariance updates
                              • Sparse updates
                              • Weighted updates
                              • Pathwise coordinate descent
                              • Other details
                                • Regularized logistic regression
                                • Regularized multinomial regression
                                  • Regularization and parameter ambiguity
                                  • Grouped and matrix responses
                                    • Timings
                                      • Regression with the lasso
                                      • Lasso-logistic regression
                                      • Real data
                                      • Other comparisons
                                        • Selecting the tuning parameters
                                        • Discussion
                                        • Proof of Theorem 1

                              Journal of Statistical Software 15

                              Logistic regression ndash Sparse features

                              Correlation0 01 02 05 09 095

                              N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

                              N = 100 p = 1000glmnet 181 173 155 170 163 155l1logreg 767 772 764 904 981 940BBR 466 458 468 515 578 553

                              N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

                              N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

                              Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

                              NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

                              All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

                              For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

                              54 Other comparisons

                              When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

                              16 Regularization Paths for GLMs via Coordinate Descent

                              Name Type N p glmnet l1logreg BBRBMR

                              DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

                              SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

                              Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

                              MacBook Pro HP Linux server

                              glmnet 034 013penalized 1031OWL-QN 31435

                              Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

                              an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

                              OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

                              The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

                              Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

                              6 Selecting the tuning parameters

                              The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

                              Journal of Statistical Software 17

                              minus6 minus5 minus4 minus3 minus2 minus1 0

                              2426

                              2830

                              32

                              log(Lambda)

                              Mea

                              n S

                              quar

                              ed E

                              rror

                              99 99 97 95 93 75 54 21 12 5 2 1

                              Gaussian Family

                              minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                              08

                              09

                              10

                              11

                              12

                              13

                              14

                              log(Lambda)

                              Dev

                              ianc

                              e

                              100 98 97 88 74 55 30 9 7 3 2

                              Binomial Family

                              minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                              020

                              030

                              040

                              050

                              log(Lambda)

                              Mis

                              clas

                              sific

                              atio

                              n E

                              rror

                              100 98 97 88 74 55 30 9 7 3 2

                              Binomial Family

                              Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                              18 Regularization Paths for GLMs via Coordinate Descent

                              Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                              7 Discussion

                              Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                              Acknowledgments

                              We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                              References

                              Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                              Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                              Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                              Journal of Statistical Software 19

                              Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                              Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                              Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                              Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                              Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                              Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                              Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                              Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                              Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                              Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                              Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                              Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                              Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                              Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                              Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                              Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                              20 Regularization Paths for GLMs via Coordinate Descent

                              Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                              Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                              Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                              Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                              Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                              Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                              Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                              Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                              Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                              Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                              Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                              Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                              Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                              Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                              Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                              Journal of Statistical Software 21

                              Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                              R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                              Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                              Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                              Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                              Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                              Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                              Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                              Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                              Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                              Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                              Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                              Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                              Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                              Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                              22 Regularization Paths for GLMs via Coordinate Descent

                              A Proof of Theorem 1

                              We have

                              cj = arg mint

                              Ksum`=1

                              [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                              ] (32)

                              Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                              Ksum`=1

                              [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                              where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                              t = βj +1K

                              α

                              1minus α

                              Ksum`=1

                              sj` (34)

                              It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                              Affiliation

                              Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                              Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                              Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                              • Introduction
                              • Algorithms for the lasso ridge regression and elastic net
                                • Naive updates
                                • Covariance updates
                                • Sparse updates
                                • Weighted updates
                                • Pathwise coordinate descent
                                • Other details
                                  • Regularized logistic regression
                                  • Regularized multinomial regression
                                    • Regularization and parameter ambiguity
                                    • Grouped and matrix responses
                                      • Timings
                                        • Regression with the lasso
                                        • Lasso-logistic regression
                                        • Real data
                                        • Other comparisons
                                          • Selecting the tuning parameters
                                          • Discussion
                                          • Proof of Theorem 1

                                16 Regularization Paths for GLMs via Coordinate Descent

                                Name Type N p glmnet l1logreg BBRBMR

                                DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

                                SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

                                Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

                                MacBook Pro HP Linux server

                                glmnet 034 013penalized 1031OWL-QN 31435

                                Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

                                an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

                                OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

                                The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

                                Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

                                6 Selecting the tuning parameters

                                The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

                                Journal of Statistical Software 17

                                minus6 minus5 minus4 minus3 minus2 minus1 0

                                2426

                                2830

                                32

                                log(Lambda)

                                Mea

                                n S

                                quar

                                ed E

                                rror

                                99 99 97 95 93 75 54 21 12 5 2 1

                                Gaussian Family

                                minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                                08

                                09

                                10

                                11

                                12

                                13

                                14

                                log(Lambda)

                                Dev

                                ianc

                                e

                                100 98 97 88 74 55 30 9 7 3 2

                                Binomial Family

                                minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                                020

                                030

                                040

                                050

                                log(Lambda)

                                Mis

                                clas

                                sific

                                atio

                                n E

                                rror

                                100 98 97 88 74 55 30 9 7 3 2

                                Binomial Family

                                Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                                18 Regularization Paths for GLMs via Coordinate Descent

                                Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                                7 Discussion

                                Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                                Acknowledgments

                                We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                                References

                                Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                                Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                                Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                                Journal of Statistical Software 19

                                Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                                Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                                Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                                Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                                Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                                Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                                Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                                Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                                Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                                Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                                Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                                Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                                Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                                Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                                Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                                Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                                20 Regularization Paths for GLMs via Coordinate Descent

                                Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                                Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                                Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                                Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                                Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                                Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                                Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                                Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                                Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                                Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                                Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                                Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                                Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                                Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                                Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                                Journal of Statistical Software 21

                                Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                                R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                                Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                                Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                                Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                                Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                                Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                                Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                                Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                                Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                                Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                                Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                                Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                                Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                                Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                                22 Regularization Paths for GLMs via Coordinate Descent

                                A Proof of Theorem 1

                                We have

                                cj = arg mint

                                Ksum`=1

                                [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                                ] (32)

                                Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                                Ksum`=1

                                [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                                where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                                t = βj +1K

                                α

                                1minus α

                                Ksum`=1

                                sj` (34)

                                It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                                Affiliation

                                Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                                Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                                Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                                • Introduction
                                • Algorithms for the lasso ridge regression and elastic net
                                  • Naive updates
                                  • Covariance updates
                                  • Sparse updates
                                  • Weighted updates
                                  • Pathwise coordinate descent
                                  • Other details
                                    • Regularized logistic regression
                                    • Regularized multinomial regression
                                      • Regularization and parameter ambiguity
                                      • Grouped and matrix responses
                                        • Timings
                                          • Regression with the lasso
                                          • Lasso-logistic regression
                                          • Real data
                                          • Other comparisons
                                            • Selecting the tuning parameters
                                            • Discussion
                                            • Proof of Theorem 1

                                  Journal of Statistical Software 17

                                  minus6 minus5 minus4 minus3 minus2 minus1 0

                                  2426

                                  2830

                                  32

                                  log(Lambda)

                                  Mea

                                  n S

                                  quar

                                  ed E

                                  rror

                                  99 99 97 95 93 75 54 21 12 5 2 1

                                  Gaussian Family

                                  minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                                  08

                                  09

                                  10

                                  11

                                  12

                                  13

                                  14

                                  log(Lambda)

                                  Dev

                                  ianc

                                  e

                                  100 98 97 88 74 55 30 9 7 3 2

                                  Binomial Family

                                  minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

                                  020

                                  030

                                  040

                                  050

                                  log(Lambda)

                                  Mis

                                  clas

                                  sific

                                  atio

                                  n E

                                  rror

                                  100 98 97 88 74 55 30 9 7 3 2

                                  Binomial Family

                                  Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

                                  18 Regularization Paths for GLMs via Coordinate Descent

                                  Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                                  7 Discussion

                                  Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                                  Acknowledgments

                                  We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                                  References

                                  Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                                  Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                                  Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                                  Journal of Statistical Software 19

                                  Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                                  Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                                  Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                                  Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                                  Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                                  Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                                  Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                                  Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                                  Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                                  Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                                  Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                                  Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                                  Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                                  Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                                  Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                                  Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                                  20 Regularization Paths for GLMs via Coordinate Descent

                                  Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                                  Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                                  Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                                  Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                                  Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                                  Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                                  Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                                  Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                                  Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                                  Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                                  Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                                  Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                                  Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                                  Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                                  Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                                  Journal of Statistical Software 21

                                  Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                                  R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                                  Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                                  Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                                  Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                                  Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                                  Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                                  Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                                  Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                                  Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                                  Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                                  Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                                  Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                                  Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                                  Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                                  22 Regularization Paths for GLMs via Coordinate Descent

                                  A Proof of Theorem 1

                                  We have

                                  cj = arg mint

                                  Ksum`=1

                                  [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                                  ] (32)

                                  Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                                  Ksum`=1

                                  [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                                  where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                                  t = βj +1K

                                  α

                                  1minus α

                                  Ksum`=1

                                  sj` (34)

                                  It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                                  Affiliation

                                  Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                                  Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                                  Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                                  • Introduction
                                  • Algorithms for the lasso ridge regression and elastic net
                                    • Naive updates
                                    • Covariance updates
                                    • Sparse updates
                                    • Weighted updates
                                    • Pathwise coordinate descent
                                    • Other details
                                      • Regularized logistic regression
                                      • Regularized multinomial regression
                                        • Regularization and parameter ambiguity
                                        • Grouped and matrix responses
                                          • Timings
                                            • Regression with the lasso
                                            • Lasso-logistic regression
                                            • Real data
                                            • Other comparisons
                                              • Selecting the tuning parameters
                                              • Discussion
                                              • Proof of Theorem 1

                                    18 Regularization Paths for GLMs via Coordinate Descent

                                    Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

                                    7 Discussion

                                    Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

                                    Acknowledgments

                                    We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

                                    References

                                    Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

                                    Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

                                    Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

                                    Journal of Statistical Software 19

                                    Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                                    Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                                    Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                                    Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                                    Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                                    Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                                    Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                                    Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                                    Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                                    Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                                    Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                                    Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                                    Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                                    Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                                    Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                                    Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                                    20 Regularization Paths for GLMs via Coordinate Descent

                                    Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                                    Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                                    Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                                    Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                                    Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                                    Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                                    Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                                    Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                                    Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                                    Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                                    Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                                    Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                                    Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                                    Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                                    Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                                    Journal of Statistical Software 21

                                    Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                                    R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                                    Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                                    Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                                    Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                                    Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                                    Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                                    Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                                    Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                                    Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                                    Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                                    Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                                    Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                                    Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                                    Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                                    22 Regularization Paths for GLMs via Coordinate Descent

                                    A Proof of Theorem 1

                                    We have

                                    cj = arg mint

                                    Ksum`=1

                                    [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                                    ] (32)

                                    Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                                    Ksum`=1

                                    [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                                    where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                                    t = βj +1K

                                    α

                                    1minus α

                                    Ksum`=1

                                    sj` (34)

                                    It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                                    Affiliation

                                    Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                                    Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                                    Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                                    • Introduction
                                    • Algorithms for the lasso ridge regression and elastic net
                                      • Naive updates
                                      • Covariance updates
                                      • Sparse updates
                                      • Weighted updates
                                      • Pathwise coordinate descent
                                      • Other details
                                        • Regularized logistic regression
                                        • Regularized multinomial regression
                                          • Regularization and parameter ambiguity
                                          • Grouped and matrix responses
                                            • Timings
                                              • Regression with the lasso
                                              • Lasso-logistic regression
                                              • Real data
                                              • Other comparisons
                                                • Selecting the tuning parameters
                                                • Discussion
                                                • Proof of Theorem 1

                                      Journal of Statistical Software 19

                                      Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

                                      Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

                                      Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

                                      Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

                                      Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

                                      Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

                                      Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

                                      Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

                                      Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

                                      Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

                                      Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

                                      Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

                                      Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

                                      Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

                                      Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

                                      Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

                                      20 Regularization Paths for GLMs via Coordinate Descent

                                      Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                                      Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                                      Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                                      Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                                      Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                                      Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                                      Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                                      Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                                      Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                                      Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                                      Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                                      Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                                      Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                                      Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                                      Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                                      Journal of Statistical Software 21

                                      Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                                      R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                                      Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                                      Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                                      Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                                      Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                                      Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                                      Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                                      Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                                      Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                                      Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                                      Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                                      Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                                      Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                                      Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                                      22 Regularization Paths for GLMs via Coordinate Descent

                                      A Proof of Theorem 1

                                      We have

                                      cj = arg mint

                                      Ksum`=1

                                      [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                                      ] (32)

                                      Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                                      Ksum`=1

                                      [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                                      where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                                      t = βj +1K

                                      α

                                      1minus α

                                      Ksum`=1

                                      sj` (34)

                                      It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                                      Affiliation

                                      Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                                      Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                                      Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                                      • Introduction
                                      • Algorithms for the lasso ridge regression and elastic net
                                        • Naive updates
                                        • Covariance updates
                                        • Sparse updates
                                        • Weighted updates
                                        • Pathwise coordinate descent
                                        • Other details
                                          • Regularized logistic regression
                                          • Regularized multinomial regression
                                            • Regularization and parameter ambiguity
                                            • Grouped and matrix responses
                                              • Timings
                                                • Regression with the lasso
                                                • Lasso-logistic regression
                                                • Real data
                                                • Other comparisons
                                                  • Selecting the tuning parameters
                                                  • Discussion
                                                  • Proof of Theorem 1

                                        20 Regularization Paths for GLMs via Coordinate Descent

                                        Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

                                        Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

                                        Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

                                        Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

                                        Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

                                        Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

                                        Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

                                        Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

                                        Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

                                        Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

                                        Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

                                        Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

                                        Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

                                        Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

                                        Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

                                        Journal of Statistical Software 21

                                        Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                                        R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                                        Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                                        Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                                        Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                                        Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                                        Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                                        Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                                        Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                                        Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                                        Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                                        Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                                        Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                                        Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                                        Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                                        22 Regularization Paths for GLMs via Coordinate Descent

                                        A Proof of Theorem 1

                                        We have

                                        cj = arg mint

                                        Ksum`=1

                                        [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                                        ] (32)

                                        Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                                        Ksum`=1

                                        [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                                        where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                                        t = βj +1K

                                        α

                                        1minus α

                                        Ksum`=1

                                        sj` (34)

                                        It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                                        Affiliation

                                        Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                                        Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                                        Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                                        • Introduction
                                        • Algorithms for the lasso ridge regression and elastic net
                                          • Naive updates
                                          • Covariance updates
                                          • Sparse updates
                                          • Weighted updates
                                          • Pathwise coordinate descent
                                          • Other details
                                            • Regularized logistic regression
                                            • Regularized multinomial regression
                                              • Regularization and parameter ambiguity
                                              • Grouped and matrix responses
                                                • Timings
                                                  • Regression with the lasso
                                                  • Lasso-logistic regression
                                                  • Real data
                                                  • Other comparisons
                                                    • Selecting the tuning parameters
                                                    • Discussion
                                                    • Proof of Theorem 1

                                          Journal of Statistical Software 21

                                          Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

                                          R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

                                          Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

                                          Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

                                          Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

                                          Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

                                          Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

                                          Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

                                          Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

                                          Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

                                          Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

                                          Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

                                          Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

                                          Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

                                          Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

                                          22 Regularization Paths for GLMs via Coordinate Descent

                                          A Proof of Theorem 1

                                          We have

                                          cj = arg mint

                                          Ksum`=1

                                          [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                                          ] (32)

                                          Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                                          Ksum`=1

                                          [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                                          where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                                          t = βj +1K

                                          α

                                          1minus α

                                          Ksum`=1

                                          sj` (34)

                                          It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                                          Affiliation

                                          Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                                          Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                                          Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                                          • Introduction
                                          • Algorithms for the lasso ridge regression and elastic net
                                            • Naive updates
                                            • Covariance updates
                                            • Sparse updates
                                            • Weighted updates
                                            • Pathwise coordinate descent
                                            • Other details
                                              • Regularized logistic regression
                                              • Regularized multinomial regression
                                                • Regularization and parameter ambiguity
                                                • Grouped and matrix responses
                                                  • Timings
                                                    • Regression with the lasso
                                                    • Lasso-logistic regression
                                                    • Real data
                                                    • Other comparisons
                                                      • Selecting the tuning parameters
                                                      • Discussion
                                                      • Proof of Theorem 1

                                            22 Regularization Paths for GLMs via Coordinate Descent

                                            A Proof of Theorem 1

                                            We have

                                            cj = arg mint

                                            Ksum`=1

                                            [12(1minus α)(βj` minus t)2 + α|βj` minus t|

                                            ] (32)

                                            Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

                                            Ksum`=1

                                            [minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

                                            where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

                                            t = βj +1K

                                            α

                                            1minus α

                                            Ksum`=1

                                            sj` (34)

                                            It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

                                            Affiliation

                                            Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

                                            Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

                                            Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

                                            • Introduction
                                            • Algorithms for the lasso ridge regression and elastic net
                                              • Naive updates
                                              • Covariance updates
                                              • Sparse updates
                                              • Weighted updates
                                              • Pathwise coordinate descent
                                              • Other details
                                                • Regularized logistic regression
                                                • Regularized multinomial regression
                                                  • Regularization and parameter ambiguity
                                                  • Grouped and matrix responses
                                                    • Timings
                                                      • Regression with the lasso
                                                      • Lasso-logistic regression
                                                      • Real data
                                                      • Other comparisons
                                                        • Selecting the tuning parameters
                                                        • Discussion
                                                        • Proof of Theorem 1

                                              top related