Chapter 11 Multiple Linear Regression Introduction Theory SAS Summary By: Airelle, Bochao, Chelsea, Menglin, Reezan, Tim, Wu, Xinyi, Yuming
Feb 25, 2016
Chapter 11
Multiple Linear Regression
IntroductionTheory
SASSummary
By:Airelle, Bochao, Chelsea, Menglin, Reezan, Tim, Wu, Xinyi, Yuming
Introduction
Regression Analysis in the Making The earliest form of regression analysis was the method of least squares published by Legendre in 1805 in the paper “Nouvelles méthodes pour la détermination des orbites des comètes.”1
Legendre used least squares to study the orbits of comets around the Sun
“Sur la Méthode des moindres quarrées”1 was an appendix to the paper (“On the method of least squares”)
Adrien-Marie Legendre2
1752-1833
1Firmin Didot, Paris, 1805. “Nouvelles méthodes pour la détermination des orbites des comètes.” “Sur la Méthode des moindres quarrés” appears as an appendix2Picture from <http://www.superstock.com/stock-photos-images/1899-40028>
Regression Analysis in the Making
Gauss also developed the method of least squares for the purpose of astronomical observations
In 1809 he published the paper: Theoria combinationis observationum erroribus minimus obnoxiae1 (Theory of the combination of observations least subject to errors).
Johann Carl Friedrich Gauss1777-1855
Shown here on the 10 Deutsche German banknote!
1C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum. (1809)2Picture from <http://www.pictobrick.de/en/gallery_gauss.shtml>
Why “regression”?
Coined in the 19th century by Sir Francis Galton1
Sir Francis Galton2
1822-1911 1”Regression analysis." Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 20 Nov. 2013.2Picture from <http://hu.wikipedia.org/wiki/Szineszt%C3%A9zia>
Why “regression”?
The termed was used to describe how the height of tall ancestors will “regress” down to the average height of the current generation; also known as “regression towards the mean.”
1”Regression analysis." Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 20 Nov. 2013.2Picture from <http://en.wikipedia.org/wiki/File:Miles_Park_Romney_family.jpg>
Fun Fact
Before 1970, one run of linear regression could take up to 24 hours on an
electromechanical desk calculator 1”Regression analysis." Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 20 Nov. 2013.2Picture from <http://www.technikum29.de/en/computer/electro-mechanical>
Uses of linear regression
1. Making predictions: create the model using linear regression on an observed set of data and outcomes, then predict the next unknown outcome
2. Correlating data: Determine the relationship between two sets of data (one is not “causal” to the other)
Theory
Multiple Linear Regression
• Review simple linear regression where we only have one predictor variable.
• What if there exist more than one predictor?• Multiple linear regression model
– Generalization of linear regression (considering more than one independent variable)
iii xY 10 i= 1,…,n
i= 1, … , n
• We fit a model with the form:• i= 1, … , n• k≥2 predictor variables• : k+1 unknown parameters • : is a random error • Note: here “linear” because it is linear in the ,
not necessarily in the x’s. For example: : may be the salary of the ith person in the sample : years of experience : years of education
Graph 1. Regression plane for the model with 2 predictor variables (source of Graph 1: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis)
iikkiii xxxY ...22110
:,...,, 21 ikii xxx
k ,...,, 10
i
);...,( ,2,1 iikii yxxx
iy
1ix
2ix
• Here, we assume that the random errors are independent r.v’s.
• Yi are independent r.v.’s with
i),0( 2N
),( 2iN
ikkiiii xxxYE ...)( 22110
i=1,2,…n
Fitting the Multiple Regression Model• Least Squares (LS) Fit estimates of the unknown parameters we minimize
We set the first partial derivatives of Q with respect to equal to zero
n
iikkiii xxxyQ
1
222110 )]...([
k ,...,, 10
k ,...,, 10
n
iikkiii xxxyQ
122110
0
0)]...([2
n
iijikkiii
j
xxxxyQ1
22110 0)]...([2
j=0,1,…,k
Simplification leads to the following normal equations:
n
iiji
n
iijikk
n
iiji
n
iij
n
ii
n
i
n
iikki
xyxxxxx
yxxn
11111
10
11 1110
...
...
k ,...,, 10
j= 1,2,…k
The resulting solutions are the least square ( LS ) estimates of And are denoted by respectively.
k ˆ,...,ˆ,ˆ 10
• Goodness of Fit of the Model• We use the residuals defined by i= 1,2,…, n
• Where are the fitted values: i=1,2,…,n• As an overall measure of the goodness of fit, we can use the error sum of
squares
(which is the minimum value of Q.) We compare this SSE to the total sum of square
Define the regression sum of squares given by SSR=SST-SSEThe ratio of SSR to SST is called the coefficient of multiple determination:
ranges between 0 and 1, values closer to 1 representing better fits. The fact is adding more predictor variables to a model generally increases The positive square root of is the multiple correlation coefficient
iii yye ˆ
iy ikkiii xxxy ˆ...ˆˆˆˆ 22110
n
iieSSE
1
2
2)( yySST i
SSTSSE
SSTSSRr 12
2rr
2r2r
2r
• Multiple Regression Model in Matrix Notation• Let
nnn y
yy
y
Y
YY
...
,...
,...
2
1
2
1
2
1
be the n*1 factors of the r.v.’s , there observes values , and random Errors respectively.
sYi ' syi 'si '
nknn
k
k
xxx
xxxxxx
...1...............
...1
...1
21
22221
11211
n*(k+1) matrix of the values of the Predictor variables.
Let
kk
ˆ...
ˆˆ
ˆ,...
1
0
1
0
yXXX ''
be the (k+1)*1 vectors of unknown parameters and their LS estimates respectively.
Then, the model can be written as:
YXXX ')'(ˆ 1
The simultaneous linear equation of the normal equations can be written as:
If the inverse of matrix X’X exists, then the solution is given by:
• Generalized linear model (GLM)The generalized linear model is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
source : http://en.wikipedia.org/wiki/Generalized_linear_model
0:,0: 10 jjjj HH
Statistical inference for Multiple regression
Then, in order to determine which predictor variables have statistically significant effects on the response variable, we need to test the hypotheses:
First, we assume that 2~ 0,iid
i N
If we reject 0:0 jjH
Then, is a significant predictor of y. jx
• It can be shown that each is normally distributed with mean and variance , where is the jth diagonal entry (j=0,1,..,k) of the matrix
jj
jj 2
1)'( XXV
jj
But how can we get the mean and variance?
Mean: Here, Is unbiased, which is similar to the and in simple linear regression.
Then we can get the following:
01
0 0
11
ˆ( )ˆ( )ˆ( )
ˆ( ) kk
E
EE
E
Variance:
From the assumption
)ˆ(Var 2~ 0,
iid
i N
IYVar 2
2
2
2
2
...000...00000...0
)(
We can get,
12 )'()ˆ( XXVar
Let is the jth diagonal entry (j=0,1,..,k) of the matrix ,
We can get
jj 1)'( XXV
jjjVar 2)ˆ(
YXXX ')'(ˆ 1
MSEkne
knSSES
i
)1(
)1(2
2
Derive the PQ for the inference :
The unbiased estimator of the unknown error varianceIs given by
2
Here, MSE is error mean square and (n-(k+1)) is the degree freedom.
Let 2)1(22
2
~))1((
kn
SSESknW
2S jAnd and are statistically independent.
Recall the definition of t-distribution, we can obtain the pivotal quantity.
)1,0(~ˆ
NZjj
jj
)1(
2
2~
ˆ
)1()1(
ˆ
)1(
knjj
jjjj
jj
TS
knSkn
knWZT
Statistically independent: the occurrence of one event does
not affect the outcome of the other event.
Confidence interval:
A 100(1-α)% confidence interval on is given by: j
)ˆˆ(1
)ˆ
(1
)(1
2,12,1
2,12,1
2,12,1
jjknjjjjknj
knjj
jjkn
knkn
StStP
tS
tP
tTtP
jSo, the confidence interval for is:
)ˆ(ˆ2,1
jknj SEt Where jjj SSE )ˆ(
Derivation of the hypothesis test for at j 0
10
0 :,: jjjjjj HH
Test statistic:
)ˆ(
ˆ 0
j
jj
j SEt
2,1
0
)ˆ(
ˆ
kn
j
jj
j tSE
t
Reject if jH0
Another Hypothesis test to determine if the model is useful
H0 : the null hypothesis which means that none of the predictors xj is related to y.
Ha : indicates that at least one of them is related.
The test statistic is
,~ )1(, knkfMSEMSRF
kSSRMSR Where and )1(
kn
SSEMSE
By using the formula
We have
krknr
knSSE
kSSR
MSEMSRF
)1())1((
)1(2
2
2rWe can see that F is an increasing function of , and in this formis used to test the statistical significance of , which is equivalent to testing H0.
2r
Reject H0 if
Coefficient of multiple determination
Extra Sum of Squares Method for Testing Subsetsof Parameters
Consider the full model:
And the partial model:
To test whether the full model is significantly better than the partial model, we have
Since SST is fixed regardless of the particular model,
We have
So,
The test statistic is
Rejects H0 if
Numerator m: # of coefficients set to zero.Denominator n-(k+1): the error df for the full model
))1(/(/)(
knSSEmSSESSEF
k
kmk
The extra sum of squares in the numerator represents the part
of the variation in y that is accounted for by regression on the m predictors.
Divided by m to get an average contribution per term.
Source of Variation(source)
Degrees of Freedom
(DF)
Sum of squares
(SS)
Mean Square(MS) F
Regression k SSR
Error
n-(k+1) SSE
Total n-1 SST
MSRFMSE
ANOVA table
kSSRMSR
)1(
knSSEMSE
Links between ANOVA and extra sum of squares method:Let k=1 and m=k, we have
SSTyySSEn
ii
2
10 )(
SSESSEk
SSRSSESSTSSESSE k 0
)1(,~)]1(/[
/
knkFMSE
MSRknSSEkSSRF
Prediction of Future Observation
•Having fitted a multiple regression model, suppose that we want to predict the future value of y for a specified vector of predictor variables
(Notice that we have included as the first component of the vector to correspond to the constant term in the model.)
Prediction of Future Observation
One way is to estimate by a confidence interval (CI).
We already have
Prediction of Future Observation
And
Prediction of Future Observation
Replacing b by its estimate which has n-(k+1) df.
The pivotal quantity is
A level C.I for is given by
Prediction of Future ObservationAnother way is to predict by a prediction interval (PI).We know
The error prediction , is the difference between two independent variables with mean
And variance
* * * 20 1 1~ ( ... , )k kY N x x
Prediction of Future Observation
Replacing by its estimate which has n-(k+1) df.
The pivotal quantity is
A level C.I for is given by
Residual Analysis
Recall that
Where
H is called the hat matrix
Residual Analysis
Standardized residuals are given by
Here ith is the ith diagonal element of the Hat Matrix HLarge | | values indicate outlier observations.
*
( ) 1i i
ii ii
e ee
SE e s h
Residual Analysis
Moreover,
we conclude the ith observation is influential if
1( ) 1n
iiiTrace H h k
2( 1) 2ii iikh hn
Data Transformation
•Transformations of the variables (both y and the x’s) are often necessary to satisfy the assumptions of linearity, normality and constant variance. Many seemingly nonlinear models can be written as the multiple linear regression model after making a suitable transformation
Example:1 2
0 1 2y x x
Data Transformation
We can do the transformation by taking Ln on both sides
Then we have
Let
We now have
Which is a good model.
0 1 1 2 2ln( ) ln( ) ln( ) ln( )y x x
* * * * * *0 0 1 1 1 1 2 2 2 2ln( ), ln( ), , ln( ), , ln( )y y x x x x
* * * * * *0 1 1 2 2y x x
Code and table and graphs
Voting Example
Voting Example• Setup: Data on individual state voting percentages for winners of the
last twelve (15) U.S. presidential elections.
y = New York voting percentage (‘ny’)x1 = California voting percentage (‘ca’)x2 = South Carolina voting percentage (‘sc’)x3 = Wisconsin voting percentage (‘wi’)
• Goal: See if there’s any positive correlation between NY and California’s (two traditionally Democratic states) voting patterns, or a negative correlation between NY and South Carolina’s (one Democratic, one Republican state). • Note: Wisconsin was included as a variable although their traditional stance is
(seemingly) more ambiguous.
Source: <http://www.presidency.ucsb.edu/elections.php>
Section 11.6Multicollinearity
Standardized Regression CoefficientsDummy Variables
Multicollinearity(Section 11.6.1)
• The “multicollinearity problem” = columns of X are approx. linearly dependent– Result: • → columns of X’X approx. linearly dependent
• → det(X’X) ≈ 0
• → X’X approx. singular (non-invertible)
• → unstable (large variance) or impossible to calculate
Multicollinearity (Section 11.6.1)
• One cause: predictor variables x1, x2, …, xk highly correlated– Example: x1 = income, x2 = expenditures, x3 =
savings
• Linear relationship: x3 = x1 – x2 → don’t want this• Conclusion: Only two (2) of the variables should be
included
Multicollinearity (from Tamhane/Dunlop – pg. 416)
• Example 11.5: Data on the heat evolved in calories during hardening of cement (y) along with percentages of four ingredients (x1, x2, x3, x4)
– Table 11.2: Cement DataNo. x1 x2 x3 x4 y1 7 26 6 60 78.52 1 29 15 52 74.33 11 56 8 20 104.34 11 31 8 47 87.65 7 52 6 33 95.96 11 55 9 22 109.27 3 71 17 6 102.78 1 31 22 44 72.59 2 54 18 22 93.110 21 47 4 26 115.911 1 40 23 34 83.812 11 66 9 12 113.313 10 68 8 12 109.4
Multicollinearity (from Tamhane/Dunlop – pg. 416 cont’d)
• Linear relationship: x2 + x4 ≈ 80 (again, don’t want this)• Results: (Tamhane/Dunlop)
– Correlations x1 x2 x3 x4
x2 0.229 x3 -0.824 -0.139x4 -0.245 -0.973 0.030y 0.731 0.816 -0.535 -0.821
– RegressionPredictor Coef t-ratio p-valueConstant 62.41 0.89 0.399x1 1.5511 2.08 0.071x2 0.5102 0.70 0.501x3 0.1019 0.14 0.896x4 -0.1441 -0.20 0.844
Multicollinearity (from Tamhane/Dunlop – pg. 416 cont’d)
– ANOVA:SOURCE DF SS MS F p-valueRegression 4 2667.90 666.97 111.48 0.000Error 8 47.86 5.98Total 12 2715.76
• Conclusion: All coefficients except x1 are nonsignificant, but model as a whole is a good fit. • Further work: better model includes only x1 and x2.
Standardized Regression Coefficients(Section 11.6.4)
• Purpose: allows user to compare predictors in terms of the magnitudes of their effects on response variable y.– Magnitudes of ’s depend on the units of xj ’s and
y → need to “standardize”
Standardized Regression Coefficients(Section 11.6.4)
• Result: new predictors , where
• All entries of R and r between -1 and 1 → more stable computation of
Dummy Predictor Variables(Section 11.6.3)
• Dummy variables: allow inclusion of categorical data into regression model.
– Example: “gender” as a predictor variable → x = 1 for female, x = 0 for male
• k ≥ 2 categories → k – 1 dummy variables x1, x2, … , xk-
1 , where xi = 1 for ith category and 0 otherwise
• x1 + x2 + … + xk = 1 → avoiding linear dependence by not including xk
Dummy Predictor Variables(Section 11.6.3)
• Example 11.6 (Tamhane/Dunlop): Winter → (x1, x2, x3) = (1, 0, 0); Spring → (x1, x2, x3) = (0, 1, 0); Summer → (x1, x2, x3) = (0, 0, 1); Fall → (x1, x2, x3) = (0, 0, 0)
• General Linear Model: at least one (1) categorical regressor in our model (i.e. at least one dummy variable x)
Dummy Predictor Variablesin General Linear Model
• Consider a GLM representing k categories as the predictors– k categories → k-1 dummy variables (discussed
earlier)• Model:
where
Dummy Predictor Variablesin General Linear Model
• Interpretation of ’s:
– Result: interpreted as the value added (or subtracted) to expected value of if category true.
– Note: all if category true.
Dummy Predictor Variablesin General Linear Model
• ANOVA: (general def n)– Testing the null hypothesis that all groups of data
are of the same population (same mean)– Setup: groups with means
Dummy Predictor Variablesin General Linear Model
– Test statistic: • •
– Reject at significance level if
Dummy Predictor Variablesin General Linear Model
• Our case (GLM with dummy variables): testing the null hypothesis , where is the (population) mean for category • But so we have
• Result: simplified ANOVA hypotheses (plug in and subtract )
Dummy Predictor Variablesin General Linear Model
• Test statistic:
Dummy Predictor Variablesin General Linear Model
• Estimated model:
• Let
• Fact: (can be shown) for all categories
Source: “Statistics 104: A Note on ANOVA and Dummy Variable Regression” – lecture by Sundeep Iyer (4/23/2010)
Dummy Predictor Variablesin General Linear Model
• Conclusion: F-test for this case similar to that of a regression (letting )
with the same set of hypotheses
Variable Selection Methods
Consider a model that includes all of the following variables:
WP = Winning PercentageNL= National League Indicator (dummy variable)AVG = Batting AverageOBP = On Base PercentageHR = Home RunsB2 = DoublesB3 = TripleshitsSB = Stolen Bases (offense)ERA = Earned Run AverageSO = Strike OutsErrors = Errors*
*Note: Errors is a baseball term referring to the number of mistakes players have made.
DATA dataMLB5;INFILE DATALINES DSD;INFORMAT TEAM $21.;INPUT TEAM $ WP NL AVG OBP HR B2 B3 hitSB ERA SO Errors;LABEL WP="Winning Percentage"
NL="National League Indicator"AVG="Batting Average"OBP="On Base Percentage"HR="Home Runs"B2="Doubles"B3="Triples"hitsSB="Stolen Bases"ERA="Earned Run Average"SO="Strike Outs"Errors="Errors";
DATALINES;Boston Red Sox,0.599,0,0.277,0.349,178,363,29,123,3.79,1294,80Tampa Bay Rays,0.564,0,0.257,0.329,165,296,23,73,3.74,1310,59Baltimore Orioles,0.525,0,0.26,0.313,212,298,14,79,4.2,1169,54
...San Francisco Giants,0.469,1,0.26,0.32,107,280,35,67,4,1256,107Philadelphia Phillies,0.451,1,0.248,0.306,140,255,32,73,4.32,1199,97Colorado Rockies,0.457,1,0.27,0.323,159,283,36,112,4.44,1064,90
;RUN;
Source (MLB.com): <http://mlb.mlb.com/stats/sortable.jsp?c_id=mlb&tcid=mm_mlb_stats#statType=hitting&elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&%23167%3BionType=sp&page=1&ts=1385005626059&season=2013&season_type=ANY&playerType=QUALIFIER&sportCode='mlb'&league_code='MLB'&split=&team_id=&active_sw=&game_type='R'&position=&page_type=SortablePlayer&sortOrder='asc'&sortColumn=era&results=&perPage=50&timeframe=&last_x_days=&extended=0§ionType=sp>
Variable Selection Methods
Run the regression on all of the data:
PROC REG DATA=dataMLB5;TITLE "Regression - Whole Model";MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/ P R tol vif collin;RUN;QUIT;
Variable Selection Methods
Some MODEL Statement Options:P compute predicted values R compute residualsTOL displays tolerance values for parameter estimatesVIF computes variance-inflation factorsCOLLIN produces collinearity analysis
Variable Selection Methods
Variable Selection Methods
Many very high p-values
Some of these variables should not be in the regression.
Variable Selection Methods
Variable Selection MethodsStepwise, Forward, and Backward Regression
Best Subsets Regression
Successively adding/removing variables from the model
A subset of variables is chosen that optimizes criterion
Final model not guaranteed to be optimal
Final model is optimal
Produces a single ‘final’ model(in actuality there might be several, almost equally good models)
Determines a specified number of best subsets (sized ). There are possible subsets
Not influenced by other considerations, such as practicality of including variables
When large number of variables, requires efficient algorithms to determine optimum subset
Variable Selection MethodsStepwise, Forward, and Backward Regression
Assume are in the model. To determine if should be included, compare the following models:
(p-1)-Model:
p-Model:
Partial F-testTest Statistic:
Since , we reject (at significance ), when:
Variable Selection MethodsStepwise, Forward, and Backward Regression
Partial Correlation Coefficients
Variable Selection MethodsStepwise, Forward, and Backward Regression
Variable Selection MethodsStepwise Regression Algorithm
p=0
Compute partial Fi
for xi, i=p+1,…,k
Is max Fi>FIN? for xi,
i=p+1,…,k
Yes
No STOP
Enter the xi
producing the max Fi
p p+1Relable variables
x1,…,xp
Compute partial Fi
for xi, i=1,…,p
Is min Fi<FOUT? for
xi, i=1,…,p
Yes
Remove the xi producing the min
Fi
p p-1Relable variables
x1,…,xp
No Does p=k?
Yes
STOP
No
(Tamhane, Dunlop, p. 430)
Forward SelectionPROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = FORWARD SLENTRY=0.05;RUN;
QUIT;
Adds variables producing the highest F value in the Partial F-test. Setting the threshold based on F with 0.05 significance.
Method: Forward, Backward, or Stepwise
significance level for inclusion
Forward Selection
Forward Selection – Final StepF values of variables within the final model.
Note:- F values changed at each step as variables were added
- For the FORWARD method, once a variable is included in the model, it will always stay in the model (its relative significance is not reconsidered as additional variables are added).
Forward Selection - Summary
Results of partial F tests at each step
Backward EliminationPROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = BACKWARD SLSTAY=0.05;RUN;QUIT;
Removes variables, one by one, until all have F values with 0.05 significance.
AVG will be removed in the next step.
significance level to remain in model
Backward EliminationAVG is no longer included.
Will continue to remove variables until the variable with the lowest F value still meets the specified significance level.
Backward Elimination - Final Step
F values of variables within the final model all meet the specified significance level.
Backward Elimination – Summary
Stepwise Selection
PROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = STEPWISE SLENTRY=0.05 SLSTAY=0.05;RUN;QUIT;
significance level to remain in model (variables re-evaluated after each new variable added)
Stepwise Selection - Final Model
F values of variables within the final model all meet the specified significance levels.
Stepwise Selection - Summary
If, at any step, a variable within the model no longer met the specified significance level, it would be removed.
Ultimately, none of the variables outside the model are significant, and all of the variables in the model are significant.
F values of variables at each step.
Variable Selection Methods – Best SubsetsOptimality Criteria 1) - Criterion:
• Maximized when all the variables are in the model• Only provides goodness of fit consideration (not how well the model predicts)
2) Adjusted - Criterion:
• Unlike , adding variables could cause it decrease
3) Criterion:• Minimizing is essentially equivalent to maximizing
4) Cp Criterion: Measures the ability of the model to predict Y-values
Select predictor variable vectors x (potentially representing a range for future predictions)
Standardized mean square error of prediction:
If we assume that the full model is the “true model” :
It should be noted that as we increase the number of variables, the prediction variance increases.
Variable Selection Methods – Best Subsets
4) Cp Criterion (continued):
Mallows’ Cp-Statistic is considered an “almost” unbiased estimate of
We can use to estimate , and commonly MSE (for the whole model) is used for
Note: for the full model (assumed to have a zero bias), so when p=k
Variable Selection Methods – Best Subsets
5) PRESSp Criterion:
• Considers impact of removing observations on predictability • Observations removed one at a time, and the model is re-fit after
each observation is removed
LS estimates when the ith observation is removed:
The predicted value for the observation that was just removed:
Prediction error sum of squares (PRESS):
Select the model with the smallest PRESSp
Variable Selection Methods – Best Subsets
Variable Selection Methods - Best Subsets
PROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = RSQUARE best=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = ADJRSQ best=10;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = CP best=10 ADJRSQ RSQUARE;RUN;QUIT;
RSQUARE finds r2 for subsets of modelsADJRSQ finds Adjusted r2 for subsets of modelsCP computes Mallow’s Cp statistic
Computes the model with the largest r2 (for each amount of variables that could be in the model)
Computes the best 10 models (based on the largest adjusted r2 values)
Computes the best 10 models (based on the lowest C(p) values). Also includes r2 and adjusted r2 values for each model.
Variable Selection Methods - Best Subsets
R-Square Selection Method• Each model represents
the combination of variables (from 1 variable through all variables) with the largest r2 value.
• r2 increases as additional variables are added to the model.
Variable Selection Methods - Best Subsets
Adjusted R-Square Selection Method• Considers all
combinations (and number) of variables in the model
• Models are listed in order of decreasing Adjusted r2 values.
• Since “BEST=10” was specified, the top 10 models are included
Variable Selection Methods - Best Subsets
C(p) Selection Method• Considers all
combinations (and number) of variables in the model
• Models are listed in order of increasing Cp values.
• Since “BEST=10” was specified, the top 10 models are included
PROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = CP ADJRSQ STOP=1 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = CP ADJRSQ STOP=2 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = CP ADJRSQ STOP=3 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = CP ADJRSQ STOP=4 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/
SELECTION = CP ADJRSQ STOP=5 BEST=1;RUN;QUIT;
Variable Selection Methods - Best Subsets
Determines the model with the lowest C(p) value for model including no more than 5 variables.
Variable Selection Methods - Best Subsets
Model with the lowest C(p) value - Up to 1 variable in the model
Model with the lowest C(p) value - Up to 2 variables in the model
Variable Selection Methods - Best Subsets
Model with the lowest C(p) value - Up to 3 variables in the model
Model with the lowest C(p) value - Up to 4 variables in the model
Note decreasing C(p)
Variable Selection Methods - Best SubsetsModel with the lowest C(p) value -
Up to 5 variables in the model
Same result as “up to 4 variables”
• The forward selection, backward elimination and stepwise selection methods produced the same model.
• We could conclude this is the best model. However, other data might produce different models using the various selection methods.
Variable Selection Methods
PROC REG DATA=dataMLB5;MODEL WP = OBP HR ERA Errors / P R tol vif collin;RUN;QUIT;
Variable Selection Methods
Decide the type of model
Collect the data
Explore the data
Divide the data
Fit several candidate models
Select and evaluate
Select the final modelModel Building
Strategy
?Model
DataExplore
and divide Let’s fit it!
Which is better?
•A model used to predict the response variable from a chosen set of predictor variables.
Predictive
•A model based on a theoretical relationship between a response variable and predictor variables
Theoretical•A model used control a response variable by
manipulating predictor variables.
Control
•A model used to explore the strength of relationships between a response variable and individual predictor variables
Inferential
•A model used primarily as a device to summarize a large set of data by a single equationData
Summary
Step I:Decide the Type of Model Needed
Step II: Collect the Data
• Decide variables(predictor and response) on which to collect data.
• Consult subject matter experts.• Obtain pertinent, bias-free data.
Step III: Explore the Data• Check for outliers, gross errors, missing values,etc. on a univariate basis.
• Study bivariate relationships to reveal other outliers, to suggest possible transformations, and to identify possible multicollinearities .
Very Important!
Step IV: Divide the Data
Randomly divide the data into:• Testing set: used for cross-validation of the fitted
model.• Training set: with at least 15-20 error degrees of
freedom, is used to estimate the model. • The split into the training and test set should be
done randomly. A 50:50 split can be used if sample size is large enough; otherwise more data may be put into the training set.
Step V: Fit Several Candidate Models
Generally models can be identified using the training data set:• Use best subsets regression.• Use stepwise regression, which of course only
yields one model unless different alpha-to-remove and alpha-to-enter values are specified.
Step VI: Select and Evaluate “Good” Models
pc• Select several models base on four criteria we
learned such as – statistic, the number of predictors(p)
• Check for violation of model assumptions• Further transformations in response and/or predictor
variables• If non of the models provided a satisfactory fit, try
something else, such as collecting more data, identifying different predictors, or formulating a different type of model.
Step VII: Select the Final Model
• Compare competing models by cross-validating them against the test data.
• Choose a model with smaller cross-validation SSE.
• Final selection of the model is based on considerations such as residual plots, outliers, parsimony, relevance, etc..
Summary• Multiple regression model extends the simple linear regression model to two or more predictor variables.
K≥2
• The least squares estimate of the parameter vector base on n complete set of observations on all variables equal
X is the n*(k+1) matrix observation on the predictor variableY is the n*1 vector of observation on y
iikkiii xxxY ...22110
YXXX ')'(ˆ 1
Summary• The fitted vector residual vector
• Error sum of squares
• Total sum of squares
• Multiple coefficient of determination
• Positive square root of is called the multiple correlation coefficient
ˆ Xy iii yye ˆ
n
iieSSE
1
2
SSTSSE
SSTSSRr 12
2)( yySST i
2r
Summary
• In probabilistic model for multiple regression, we assume random errors to be independent
• It follows that are
where is the jth diagonal term of the matrix
),0( 2Ni
j ),( 2jjj vN
jjv
1)'( XXV
Summary• Furthermore,
an unbiased estimate of and has distribution
, independent of the
• From the result, we can draw inference on the based on the t-distribution with (n-(k+1)) df.
100(1-α)% confidence interval on is
MSEkn
SSES
)1(
2
2
))1((
2)1(
2
knkn
j
j
j
jjknj St 2,1
ˆ
Summary• Extra sum of squares method is useful for deriving F-tests on subsets of the
• To test the hypothesis that a specified subset of m ≤ k of the
Let and denote the error sum of squares for the two models, respectively.
sj '
kSSE mkSSE
0' sj
Summary• Then F statistic is given by
• Two special cases of this method are the test of significance on• Single - t test• All - F statistic (not including )
j
j 0
MSEMSR
knSSEkSSRF
)]1(/[
/with k and (n-(k+1)) df
)]1(/[/)(
knSSEmSSESSEF
k
kmk with m and (n-(k+1)) df
Summary• The fitted vector can be written as
where is called the hat matrix
If ith diagonal element of H, , then the ith observation is regarded as influential because it has high leverage on the model fit.
• Residuals are used to check the model assumption like normality and constant variance by making appropriate residual plots. If the standardized residual
then the ith observation is regarded as an outlier.
Hyy ˆ
nkH ii /)1(2
2|)/(||| * jjii See
Summary
• Multicollinearity: columns of X are approximately linearly dependent → estimates of β is unstable– Major cause: high correlation between the xi’s
• Dummy variables: used to represent categorical data– xi = 1 if ith category true; xi = 0 otherwise
Summary
• Stepwise regression: selects and deletes variables based on marginal contributions to model– Partial F-statistics and partial correlation coefficients
• Best subsets regression: chooses subset of predictor variables that optimize certain criterion function (e.g. adjusted rp
2 or Cp-statistic)
Summary
• Fitting a model:1) Decide on the type of model.2) Collect the data.3) Explore the data.4) Divide the data.5) Fit several candidate models.6) Select/evaluate “good” models.7) Select final model.