Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman
Linear Regression Models
Based on Chapter 3 ofHastie, Tibshirani and Friedman
Linear Regression Models
!=
+=
p
j
jjXXf1
0)( ""
Here the X’s might be:
•Raw predictor variables (continuous or coded-categorical)
•Transformed predictors (X4=log X3)
•Basis expansions (X4=X32, X5=X3
3, etc.)
•Interactions (X4=X2 X3 )
Popular choice for estimation is least squares:
2
1 1
0 )()( ! != =
""=
N
i
p
j
jji XyRSS ###
Least Squares
)()()( !!! XyXyRSST ""=
Often assume that the Y’s are independent and normallydistributed, leading to various classical statistical tests andconfidence intervals
yXXXTT 1)(ˆ !
=" #
yXXXXXyTT 1)(ˆˆ !
==" #
hat matrix
The least squares estimate of θ is:
If the linear model is correct, this estimate is unbiased (X fixed):
Gauss-Markov states that for any other linear unbiased estimator :
Of course, there might be a biased estimator with lower MSE…
Gauss-Markov TheoremConsider any linear combination of the β’s:
yXXXaaTTTT 1)(ˆˆ !
== "#
!" Ta=
E(!̂) = E(aT (XTX)
"1XTy) = a
T(X
TX)
"1XTX# = a
T#
ycT
=!~
)(Var)ˆ(Var ycaTT !"
i.e., E(cTy) = a
T!,
bias-variance!~
2
22
2
))~(()
~(
))~(())
~(
~(
))~()
~(
~(
!!!
!!!!
!!!!
"+=
"+"=
"+"=
EVar
EEEE
EEE
2)~()
~( !!! "= EMSE
For any estimator :
bias
Note MSE closely related to prediction error:
)~
()~
()()~
( 0
22
00
2
00
2
00 !"!!!! TTTTTxMSExxExYExYE +=#+#=#
Too Many Predictors?When there are lots of X’s, get models with high variance andprediction suffers. Three “solutions:”
1. Subset selection
2. Shrinkage/Ridge Regression
3. Derived Inputs
Score: AIC, BIC, etc.All-subsets + leaps-and-bounds, Stepwise methods,
Subset Selection•Standard “all-subsets” finds the subset of size k, k=1,…,p,that minimizes RSS:
•Choice of subset size requires tradeoff – AIC, BIC,marginal likelihood, cross-validation, etc.•“Leaps and bounds” is an efficient algorithm to doall-subsets
Cross-Validation•e.g. 10-fold cross-validation:
Randomly divide the data into ten parts
Train model using 9 tenths and compute prediction error on theremaining 1 tenth
Do these for each 1 tenth of the data
Average the 10 prediction error estimates
“One standard error rule”
pick the simplest model withinone standard error of theminimum
Shrinkage Methods•Subset selection is a discrete process – individual variablesare either in or out
•This method can have high variance – a different datasetfrom the same source can result in a totally different model
•Shrinkage methods allow a variable to be partly included inthe model. That is, the variable is included but with ashrunken co-efficient.
Ridge Regression
subject to:
2
1 1
0
ridge )(minargˆ ! != =
""=
N
i
p
j
jiji xy ####
!=
"p
j
js
1
2#
Equivalently:
!!"
#$$%
&+''= (( (
== =
p
j
j
N
i
p
j
jiji xy1
22
1 1
0
ridge )(minargˆ )*))))
This leads to:
Choose λ by cross-validation. Predictors should be centered.
yXIXXTT 1ridge )(ˆ !
+= "#works even whenXTX is singular
effective number of X’s
Ridge Regression = Bayesian Regression
22
2
2
0
),0(~
),(~
!"#
!$
"$$
=
+
with ridgeas same
N
xNy
j
T
ii
The Lasso
subject to:
2
1 1
0
ridge )(minargˆ ! != =
""=
N
i
p
j
jiji xy ####
!=
"p
j
js
1
#
Quadratic programming algorithm needed to solve for theparameter estimates. Choose s via cross-validation.
!!"
#$$%
&+''= (( (
== =
qp
j
j
N
i
p
j
jiji xy1
2
1 1
0 )(minarg~
)*))))
q=0: var. sel.q=1: lassoq=2: ridgeLearn q?
function of 1/lambda
has largest sample variance amongst all normalized linearcombinations of the columns of X
Principal Component RegressionConsider a an eigen-decomposition of XTX (and hence thecovariance matrix of X):
TTVVDXX2
=
The eigenvectors vj are called the principal components of XD is diagonal with entries d1 ≥ d2 ≥… ≥dp
1Xv
has largest sample variance amongst all normalized linearcombinations of the columns of X subject to being orthogonal toall the earlier ones
kXv
))(2
11
N
dXv =(var
(X is first centered)
(X is N x p)
Principal Component RegressionPC Regression regresses on the first M principal componentswhere M<p
Similar to ridge regression in some respects – see HTF, p.66
www.r-project.org/user-2006/Slides/Hesterberg+Fraley.pdf
x1<-rnorm(10)x2<-rnorm(10)y<-(3*x1) + x2 + rnorm(10,0.1)par(mfrow=c(1,2))plot(x1,y,xlim=range(c(x1,x2)),ylim=range(y))abline(lm(y~-1+x1))plot(x2,y,xlim=range(c(x1,x2)),ylim=range(y))abline(lm(y~-1+x2))
epsilon <- 0.1r <- ybeta <- c(0,0)numIter <- 25
for (i in 1:numIter) {
cat(cor(x1,r),"\t",cor(x2,r),"\t",beta[1],"\t",beta[2],"\n");
if (cor(x1,r) > cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x1) > 0))-1) beta[1] <- beta[1] + delta r <- r - (delta * x1) par(mfg=c(1,1)) abline(0,beta[1],col="red") } if (cor(x1,r) <= cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x2) > 0))-1) beta[2] <- beta[2] + delta r <- r - (delta * x2) par(mfg=c(1,2)) abline(0,beta[2],col="green") }}
► Start with all coefficients bj = 0
► Find the predictor xj most correlated with y
► Increase bj in the direction of the sign of its correlationwith y. Take residuals r=y-yhat along the way. Stopwhen some other predictor xk has as much correlationwith r as xj has
► Increase (bj,bk) in their joint least squares directionuntil some other predictor xm has as much correlationwith the residual r.
►Continue until all predictors are in the model
LARS
• If there are many correlated features, lasso givesnon-zero weight to only one of them
• Maybe correlated features (e.g. time-ordered)should have similar coefficients?
Fused Lasso
Tibshirani et al. (2005)
• Suppose you represent a categorical predictorwith indicator variables
• Might want the set of indicators to be in or out
Group Lasso
Yuan and Lin (2006)
regular lasso:
group lasso: