Top Banner
Chapter 11 Multiple Linear Regression Introduction Theory SAS Summary By: Airelle, Bochao, Chelsea, Menglin, Reezan, Tim, Wu, Xinyi, Yuming
125

Chapter 11

Feb 25, 2016

Download

Documents

cachez

Chapter 11. Multiple Linear Regression Introduction Theory SAS Summary By: Airelle, Bochao , Chelsea, Menglin , Reezan, Tim, Wu, Xinyi, Yuming. Introduction. Regression Analysis in the Making. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 11

Chapter 11

Multiple Linear Regression

IntroductionTheory

SASSummary

By:Airelle, Bochao, Chelsea, Menglin, Reezan, Tim, Wu, Xinyi, Yuming

Page 2: Chapter 11

Introduction

Page 3: Chapter 11

Regression Analysis in the Making The earliest form of regression analysis was the method of least squares published by Legendre in 1805 in the paper “Nouvelles méthodes pour la détermination des orbites des comètes.”1

Legendre used least squares to study the orbits of comets around the Sun

“Sur la Méthode des moindres quarrées”1 was an appendix to the paper (“On the method of least squares”)

Adrien-Marie Legendre2

1752-1833

1Firmin Didot, Paris, 1805. “Nouvelles méthodes pour la détermination des orbites des comètes.” “Sur la Méthode des moindres quarrés” appears as an appendix2Picture from <http://www.superstock.com/stock-photos-images/1899-40028>

Page 4: Chapter 11

Regression Analysis in the Making

Gauss also developed the method of least squares for the purpose of astronomical observations

In 1809 he published the paper: Theoria combinationis observationum erroribus minimus obnoxiae1 (Theory of the combination of observations least subject to errors).

Johann Carl Friedrich Gauss1777-1855

Shown here on the 10 Deutsche German banknote!

1C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum. (1809)2Picture from <http://www.pictobrick.de/en/gallery_gauss.shtml>

Page 5: Chapter 11

Why “regression”?

Coined in the 19th century by Sir Francis Galton1

Sir Francis Galton2

1822-1911 1”Regression analysis." Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 20 Nov. 2013.2Picture from <http://hu.wikipedia.org/wiki/Szineszt%C3%A9zia>

Page 6: Chapter 11

Why “regression”?

The termed was used to describe how the height of tall ancestors will “regress” down to the average height of the current generation; also known as “regression towards the mean.”

1”Regression analysis." Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 20 Nov. 2013.2Picture from <http://en.wikipedia.org/wiki/File:Miles_Park_Romney_family.jpg>

Page 7: Chapter 11

Fun Fact

Before 1970, one run of linear regression could take up to 24 hours on an

electromechanical desk calculator 1”Regression analysis." Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 20 Nov. 2013.2Picture from <http://www.technikum29.de/en/computer/electro-mechanical>

Page 8: Chapter 11

Uses of linear regression

1. Making predictions: create the model using linear regression on an observed set of data and outcomes, then predict the next unknown outcome

2. Correlating data: Determine the relationship between two sets of data (one is not “causal” to the other)

Page 9: Chapter 11

Theory

Page 10: Chapter 11

Multiple Linear Regression

• Review simple linear regression where we only have one predictor variable.

• What if there exist more than one predictor?• Multiple linear regression model

– Generalization of linear regression (considering more than one independent variable)

iii xY 10 i= 1,…,n

i= 1, … , n

Page 11: Chapter 11

• We fit a model with the form:• i= 1, … , n• k≥2 predictor variables• : k+1 unknown parameters • : is a random error • Note: here “linear” because it is linear in the ,

not necessarily in the x’s. For example: : may be the salary of the ith person in the sample : years of experience : years of education

Graph 1. Regression plane for the model with 2 predictor variables (source of Graph 1: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis)

iikkiii xxxY ...22110

:,...,, 21 ikii xxx

k ,...,, 10

i

);...,( ,2,1 iikii yxxx

iy

1ix

2ix

Page 12: Chapter 11

• Here, we assume that the random errors are independent r.v’s.

• Yi are independent r.v.’s with

i),0( 2N

),( 2iN

ikkiiii xxxYE ...)( 22110

i=1,2,…n

Page 13: Chapter 11

Fitting the Multiple Regression Model• Least Squares (LS) Fit estimates of the unknown parameters we minimize

We set the first partial derivatives of Q with respect to equal to zero

n

iikkiii xxxyQ

1

222110 )]...([

k ,...,, 10

k ,...,, 10

n

iikkiii xxxyQ

122110

0

0)]...([2

n

iijikkiii

j

xxxxyQ1

22110 0)]...([2

j=0,1,…,k

Page 14: Chapter 11

Simplification leads to the following normal equations:

n

iiji

n

iijikk

n

iiji

n

iij

n

ii

n

i

n

iikki

xyxxxxx

yxxn

11111

10

11 1110

...

...

k ,...,, 10

j= 1,2,…k

The resulting solutions are the least square ( LS ) estimates of And are denoted by respectively.

k ˆ,...,ˆ,ˆ 10

Page 15: Chapter 11

• Goodness of Fit of the Model• We use the residuals defined by i= 1,2,…, n

• Where are the fitted values: i=1,2,…,n• As an overall measure of the goodness of fit, we can use the error sum of

squares

(which is the minimum value of Q.) We compare this SSE to the total sum of square

Define the regression sum of squares given by SSR=SST-SSEThe ratio of SSR to SST is called the coefficient of multiple determination:

ranges between 0 and 1, values closer to 1 representing better fits. The fact is adding more predictor variables to a model generally increases The positive square root of is the multiple correlation coefficient

iii yye ˆ

iy ikkiii xxxy ˆ...ˆˆˆˆ 22110

n

iieSSE

1

2

2)( yySST i

SSTSSE

SSTSSRr 12

2rr

2r2r

2r

Page 16: Chapter 11

• Multiple Regression Model in Matrix Notation• Let

nnn y

yy

y

Y

YY

...

,...

,...

2

1

2

1

2

1

be the n*1 factors of the r.v.’s , there observes values , and random Errors respectively.

sYi ' syi 'si '

nknn

k

k

xxx

xxxxxx

...1...............

...1

...1

21

22221

11211

n*(k+1) matrix of the values of the Predictor variables.

Page 17: Chapter 11

Let

kk

ˆ...

ˆˆ

ˆ,...

1

0

1

0

yXXX ''

be the (k+1)*1 vectors of unknown parameters and their LS estimates respectively.

Then, the model can be written as:

YXXX ')'(ˆ 1

The simultaneous linear equation of the normal equations can be written as:

If the inverse of matrix X’X exists, then the solution is given by:

Page 18: Chapter 11

• Generalized linear model (GLM)The generalized linear model is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

source : http://en.wikipedia.org/wiki/Generalized_linear_model

Page 19: Chapter 11

0:,0: 10 jjjj HH

Statistical inference for Multiple regression

Then, in order to determine which predictor variables have statistically significant effects on the response variable, we need to test the hypotheses:

First, we assume that 2~ 0,iid

i N

If we reject 0:0 jjH

Then, is a significant predictor of y. jx

Page 20: Chapter 11

• It can be shown that each is normally distributed with mean and variance , where is the jth diagonal entry (j=0,1,..,k) of the matrix

jj

jj 2

1)'( XXV

jj

But how can we get the mean and variance?

Page 21: Chapter 11

Mean: Here, Is unbiased, which is similar to the and in simple linear regression.

Then we can get the following:

01

0 0

11

ˆ( )ˆ( )ˆ( )

ˆ( ) kk

E

EE

E

Page 22: Chapter 11

Variance:

From the assumption

)ˆ(Var 2~ 0,

iid

i N

IYVar 2

2

2

2

2

...000...00000...0

)(

We can get,

12 )'()ˆ( XXVar

Let is the jth diagonal entry (j=0,1,..,k) of the matrix ,

We can get

jj 1)'( XXV

jjjVar 2)ˆ(

YXXX ')'(ˆ 1

Page 23: Chapter 11

MSEkne

knSSES

i

)1(

)1(2

2

Derive the PQ for the inference :

The unbiased estimator of the unknown error varianceIs given by

2

Here, MSE is error mean square and (n-(k+1)) is the degree freedom.

Page 24: Chapter 11

Let 2)1(22

2

~))1((

kn

SSESknW

2S jAnd and are statistically independent.

Recall the definition of t-distribution, we can obtain the pivotal quantity.

)1,0(~ˆ

NZjj

jj

)1(

2

2~

ˆ

)1()1(

ˆ

)1(

knjj

jjjj

jj

TS

knSkn

knWZT

Statistically independent: the occurrence of one event does

not affect the outcome of the other event.

Page 25: Chapter 11

Confidence interval:

A 100(1-α)% confidence interval on is given by: j

)ˆˆ(1

(1

)(1

2,12,1

2,12,1

2,12,1

jjknjjjjknj

knjj

jjkn

knkn

StStP

tS

tP

tTtP

jSo, the confidence interval for is:

)ˆ(ˆ2,1

jknj SEt Where jjj SSE )ˆ(

Page 26: Chapter 11

Derivation of the hypothesis test for at j 0

10

0 :,: jjjjjj HH

Test statistic:

)ˆ(

ˆ 0

j

jj

j SEt

2,1

0

)ˆ(

ˆ

kn

j

jj

j tSE

t

Reject if jH0

Page 27: Chapter 11

Another Hypothesis test to determine if the model is useful

H0 : the null hypothesis which means that none of the predictors xj is related to y.

Ha : indicates that at least one of them is related.

The test statistic is

,~ )1(, knkfMSEMSRF

kSSRMSR Where and )1(

kn

SSEMSE

Page 28: Chapter 11

By using the formula

We have

krknr

knSSE

kSSR

MSEMSRF

)1())1((

)1(2

2

2rWe can see that F is an increasing function of , and in this formis used to test the statistical significance of , which is equivalent to testing H0.

2r

Reject H0 if

Coefficient of multiple determination

Page 29: Chapter 11

Extra Sum of Squares Method for Testing Subsetsof Parameters

Consider the full model:

And the partial model:

To test whether the full model is significantly better than the partial model, we have

Page 30: Chapter 11

Since SST is fixed regardless of the particular model,

We have

So,

The test statistic is

Rejects H0 if

Numerator m: # of coefficients set to zero.Denominator n-(k+1): the error df for the full model

))1(/(/)(

knSSEmSSESSEF

k

kmk

The extra sum of squares in the numerator represents the part

of the variation in y that is accounted for by regression on the m predictors.

Divided by m to get an average contribution per term.

Page 31: Chapter 11

Source of Variation(source)

Degrees of Freedom

(DF)

Sum of squares

(SS)

Mean Square(MS) F

Regression k SSR

Error

n-(k+1) SSE

Total n-1 SST

MSRFMSE

ANOVA table

kSSRMSR

)1(

knSSEMSE

Links between ANOVA and extra sum of squares method:Let k=1 and m=k, we have

SSTyySSEn

ii

2

10 )(

SSESSEk

SSRSSESSTSSESSE k 0

)1(,~)]1(/[

/

knkFMSE

MSRknSSEkSSRF

Page 32: Chapter 11

Prediction of Future Observation

•Having fitted a multiple regression model, suppose that we want to predict the future value of y for a specified vector of predictor variables

(Notice that we have included as the first component of the vector to correspond to the constant term in the model.)

Page 33: Chapter 11

Prediction of Future Observation

One way is to estimate by a confidence interval (CI).

We already have

Page 34: Chapter 11

Prediction of Future Observation

And

Page 35: Chapter 11

Prediction of Future Observation

Replacing b by its estimate which has n-(k+1) df.

The pivotal quantity is

A level C.I for is given by

Page 36: Chapter 11

Prediction of Future ObservationAnother way is to predict by a prediction interval (PI).We know

The error prediction , is the difference between two independent variables with mean

And variance

* * * 20 1 1~ ( ... , )k kY N x x

Page 37: Chapter 11

Prediction of Future Observation

Replacing by its estimate which has n-(k+1) df.

The pivotal quantity is

A level C.I for is given by

Page 38: Chapter 11

Residual Analysis

Recall that

Where

H is called the hat matrix

Page 39: Chapter 11

Residual Analysis

Standardized residuals are given by

Here ith is the ith diagonal element of the Hat Matrix HLarge | | values indicate outlier observations.

*

( ) 1i i

ii ii

e ee

SE e s h

Page 40: Chapter 11

Residual Analysis

Moreover,

we conclude the ith observation is influential if

1( ) 1n

iiiTrace H h k

2( 1) 2ii iikh hn

Page 41: Chapter 11

Data Transformation

•Transformations of the variables (both y and the x’s) are often necessary to satisfy the assumptions of linearity, normality and constant variance. Many seemingly nonlinear models can be written as the multiple linear regression model after making a suitable transformation

Example:1 2

0 1 2y x x

Page 42: Chapter 11

Data Transformation

We can do the transformation by taking Ln on both sides

Then we have

Let

We now have

Which is a good model.

0 1 1 2 2ln( ) ln( ) ln( ) ln( )y x x

* * * * * *0 0 1 1 1 1 2 2 2 2ln( ), ln( ), , ln( ), , ln( )y y x x x x

* * * * * *0 1 1 2 2y x x

Page 43: Chapter 11

Code and table and graphs

Page 44: Chapter 11

Voting Example

Page 45: Chapter 11

Voting Example• Setup: Data on individual state voting percentages for winners of the

last twelve (15) U.S. presidential elections.

y = New York voting percentage (‘ny’)x1 = California voting percentage (‘ca’)x2 = South Carolina voting percentage (‘sc’)x3 = Wisconsin voting percentage (‘wi’)

• Goal: See if there’s any positive correlation between NY and California’s (two traditionally Democratic states) voting patterns, or a negative correlation between NY and South Carolina’s (one Democratic, one Republican state). • Note: Wisconsin was included as a variable although their traditional stance is

(seemingly) more ambiguous.

Page 46: Chapter 11

Source: <http://www.presidency.ucsb.edu/elections.php>

Page 47: Chapter 11
Page 48: Chapter 11
Page 49: Chapter 11
Page 50: Chapter 11
Page 51: Chapter 11
Page 52: Chapter 11
Page 53: Chapter 11

Section 11.6Multicollinearity

Standardized Regression CoefficientsDummy Variables

Page 54: Chapter 11

Multicollinearity(Section 11.6.1)

• The “multicollinearity problem” = columns of X are approx. linearly dependent– Result: • → columns of X’X approx. linearly dependent

• → det(X’X) ≈ 0

• → X’X approx. singular (non-invertible)

• → unstable (large variance) or impossible to calculate

Page 55: Chapter 11

Multicollinearity (Section 11.6.1)

• One cause: predictor variables x1, x2, …, xk highly correlated– Example: x1 = income, x2 = expenditures, x3 =

savings

• Linear relationship: x3 = x1 – x2 → don’t want this• Conclusion: Only two (2) of the variables should be

included

Page 56: Chapter 11

Multicollinearity (from Tamhane/Dunlop – pg. 416)

• Example 11.5: Data on the heat evolved in calories during hardening of cement (y) along with percentages of four ingredients (x1, x2, x3, x4)

– Table 11.2: Cement DataNo. x1 x2 x3 x4 y1 7 26 6 60 78.52 1 29 15 52 74.33 11 56 8 20 104.34 11 31 8 47 87.65 7 52 6 33 95.96 11 55 9 22 109.27 3 71 17 6 102.78 1 31 22 44 72.59 2 54 18 22 93.110 21 47 4 26 115.911 1 40 23 34 83.812 11 66 9 12 113.313 10 68 8 12 109.4

Page 57: Chapter 11

Multicollinearity (from Tamhane/Dunlop – pg. 416 cont’d)

• Linear relationship: x2 + x4 ≈ 80 (again, don’t want this)• Results: (Tamhane/Dunlop)

– Correlations x1 x2 x3 x4

x2 0.229 x3 -0.824 -0.139x4 -0.245 -0.973 0.030y 0.731 0.816 -0.535 -0.821

– RegressionPredictor Coef t-ratio p-valueConstant 62.41 0.89 0.399x1 1.5511 2.08 0.071x2 0.5102 0.70 0.501x3 0.1019 0.14 0.896x4 -0.1441 -0.20 0.844

Page 58: Chapter 11

Multicollinearity (from Tamhane/Dunlop – pg. 416 cont’d)

– ANOVA:SOURCE DF SS MS F p-valueRegression 4 2667.90 666.97 111.48 0.000Error 8 47.86 5.98Total 12 2715.76

• Conclusion: All coefficients except x1 are nonsignificant, but model as a whole is a good fit. • Further work: better model includes only x1 and x2.

Page 59: Chapter 11

Standardized Regression Coefficients(Section 11.6.4)

• Purpose: allows user to compare predictors in terms of the magnitudes of their effects on response variable y.– Magnitudes of ’s depend on the units of xj ’s and

y → need to “standardize”

Page 60: Chapter 11

Standardized Regression Coefficients(Section 11.6.4)

• Result: new predictors , where

• All entries of R and r between -1 and 1 → more stable computation of

Page 61: Chapter 11

Dummy Predictor Variables(Section 11.6.3)

• Dummy variables: allow inclusion of categorical data into regression model.

– Example: “gender” as a predictor variable → x = 1 for female, x = 0 for male

• k ≥ 2 categories → k – 1 dummy variables x1, x2, … , xk-

1 , where xi = 1 for ith category and 0 otherwise

• x1 + x2 + … + xk = 1 → avoiding linear dependence by not including xk

Page 62: Chapter 11

Dummy Predictor Variables(Section 11.6.3)

• Example 11.6 (Tamhane/Dunlop): Winter → (x1, x2, x3) = (1, 0, 0); Spring → (x1, x2, x3) = (0, 1, 0); Summer → (x1, x2, x3) = (0, 0, 1); Fall → (x1, x2, x3) = (0, 0, 0)

• General Linear Model: at least one (1) categorical regressor in our model (i.e. at least one dummy variable x)

Page 63: Chapter 11

Dummy Predictor Variablesin General Linear Model

• Consider a GLM representing k categories as the predictors– k categories → k-1 dummy variables (discussed

earlier)• Model:

where

Page 64: Chapter 11

Dummy Predictor Variablesin General Linear Model

• Interpretation of ’s:

– Result: interpreted as the value added (or subtracted) to expected value of if category true.

– Note: all if category true.

Page 65: Chapter 11

Dummy Predictor Variablesin General Linear Model

• ANOVA: (general def n)– Testing the null hypothesis that all groups of data

are of the same population (same mean)– Setup: groups with means

Page 66: Chapter 11

Dummy Predictor Variablesin General Linear Model

– Test statistic: • •

– Reject at significance level if

Page 67: Chapter 11

Dummy Predictor Variablesin General Linear Model

• Our case (GLM with dummy variables): testing the null hypothesis , where is the (population) mean for category • But so we have

• Result: simplified ANOVA hypotheses (plug in and subtract )

Page 68: Chapter 11

Dummy Predictor Variablesin General Linear Model

• Test statistic:

Page 69: Chapter 11

Dummy Predictor Variablesin General Linear Model

• Estimated model:

• Let

• Fact: (can be shown) for all categories

Source: “Statistics 104: A Note on ANOVA and Dummy Variable Regression” – lecture by Sundeep Iyer (4/23/2010)

Page 70: Chapter 11

Dummy Predictor Variablesin General Linear Model

• Conclusion: F-test for this case similar to that of a regression (letting )

with the same set of hypotheses

Page 71: Chapter 11

Variable Selection Methods

Consider a model that includes all of the following variables:

WP = Winning PercentageNL= National League Indicator (dummy variable)AVG = Batting AverageOBP = On Base PercentageHR = Home RunsB2 = DoublesB3 = TripleshitsSB = Stolen Bases (offense)ERA = Earned Run AverageSO = Strike OutsErrors = Errors*

*Note: Errors is a baseball term referring to the number of mistakes players have made.

Page 72: Chapter 11

DATA dataMLB5;INFILE DATALINES DSD;INFORMAT TEAM $21.;INPUT TEAM $ WP NL AVG OBP HR B2 B3 hitSB ERA SO Errors;LABEL WP="Winning Percentage"

NL="National League Indicator"AVG="Batting Average"OBP="On Base Percentage"HR="Home Runs"B2="Doubles"B3="Triples"hitsSB="Stolen Bases"ERA="Earned Run Average"SO="Strike Outs"Errors="Errors";

DATALINES;Boston Red Sox,0.599,0,0.277,0.349,178,363,29,123,3.79,1294,80Tampa Bay Rays,0.564,0,0.257,0.329,165,296,23,73,3.74,1310,59Baltimore Orioles,0.525,0,0.26,0.313,212,298,14,79,4.2,1169,54

...San Francisco Giants,0.469,1,0.26,0.32,107,280,35,67,4,1256,107Philadelphia Phillies,0.451,1,0.248,0.306,140,255,32,73,4.32,1199,97Colorado Rockies,0.457,1,0.27,0.323,159,283,36,112,4.44,1064,90

;RUN;

Source (MLB.com): <http://mlb.mlb.com/stats/sortable.jsp?c_id=mlb&tcid=mm_mlb_stats#statType=hitting&elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&%23167%3BionType=sp&page=1&ts=1385005626059&season=2013&season_type=ANY&playerType=QUALIFIER&sportCode='mlb'&league_code='MLB'&split=&team_id=&active_sw=&game_type='R'&position=&page_type=SortablePlayer&sortOrder='asc'&sortColumn=era&results=&perPage=50&timeframe=&last_x_days=&extended=0&sectionType=sp>

Variable Selection Methods

Page 73: Chapter 11

Run the regression on all of the data:

PROC REG DATA=dataMLB5;TITLE "Regression - Whole Model";MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/ P R tol vif collin;RUN;QUIT;

Variable Selection Methods

Some MODEL Statement Options:P compute predicted values R compute residualsTOL displays tolerance values for parameter estimatesVIF computes variance-inflation factorsCOLLIN produces collinearity analysis

Page 74: Chapter 11

Variable Selection Methods

Page 75: Chapter 11

Variable Selection Methods

Many very high p-values

Some of these variables should not be in the regression.

Page 76: Chapter 11

Variable Selection Methods

Page 77: Chapter 11

Variable Selection MethodsStepwise, Forward, and Backward Regression

Best Subsets Regression

Successively adding/removing variables from the model

A subset of variables is chosen that optimizes criterion

Final model not guaranteed to be optimal

Final model is optimal

Produces a single ‘final’ model(in actuality there might be several, almost equally good models)

Determines a specified number of best subsets (sized ). There are possible subsets

Not influenced by other considerations, such as practicality of including variables

When large number of variables, requires efficient algorithms to determine optimum subset

Page 78: Chapter 11

Variable Selection MethodsStepwise, Forward, and Backward Regression

Assume are in the model. To determine if should be included, compare the following models:

(p-1)-Model:

p-Model:

Page 79: Chapter 11

Partial F-testTest Statistic:

Since , we reject (at significance ), when:

Variable Selection MethodsStepwise, Forward, and Backward Regression

Page 80: Chapter 11

Partial Correlation Coefficients

Variable Selection MethodsStepwise, Forward, and Backward Regression

Page 81: Chapter 11

Variable Selection MethodsStepwise Regression Algorithm

p=0

Compute partial Fi

for xi, i=p+1,…,k

Is max Fi>FIN? for xi,

i=p+1,…,k

Yes

No STOP

Enter the xi

producing the max Fi

p p+1Relable variables

x1,…,xp

Compute partial Fi

for xi, i=1,…,p

Is min Fi<FOUT? for

xi, i=1,…,p

Yes

Remove the xi producing the min

Fi

p p-1Relable variables

x1,…,xp

No Does p=k?

Yes

STOP

No

(Tamhane, Dunlop, p. 430)

Page 82: Chapter 11

Forward SelectionPROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = FORWARD SLENTRY=0.05;RUN;

QUIT;

Adds variables producing the highest F value in the Partial F-test. Setting the threshold based on F with 0.05 significance.

Method: Forward, Backward, or Stepwise

significance level for inclusion

Page 83: Chapter 11

Forward Selection

Page 84: Chapter 11

Forward Selection – Final StepF values of variables within the final model.

Note:- F values changed at each step as variables were added

- For the FORWARD method, once a variable is included in the model, it will always stay in the model (its relative significance is not reconsidered as additional variables are added).

Page 85: Chapter 11

Forward Selection - Summary

Results of partial F tests at each step

Page 86: Chapter 11

Backward EliminationPROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = BACKWARD SLSTAY=0.05;RUN;QUIT;

Removes variables, one by one, until all have F values with 0.05 significance.

AVG will be removed in the next step.

significance level to remain in model

Page 87: Chapter 11

Backward EliminationAVG is no longer included.

Will continue to remove variables until the variable with the lowest F value still meets the specified significance level.

Page 88: Chapter 11

Backward Elimination - Final Step

F values of variables within the final model all meet the specified significance level.

Page 89: Chapter 11

Backward Elimination – Summary

Page 90: Chapter 11

Stepwise Selection

PROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = STEPWISE SLENTRY=0.05 SLSTAY=0.05;RUN;QUIT;

significance level to remain in model (variables re-evaluated after each new variable added)

Page 91: Chapter 11

Stepwise Selection - Final Model

F values of variables within the final model all meet the specified significance levels.

Page 92: Chapter 11

Stepwise Selection - Summary

If, at any step, a variable within the model no longer met the specified significance level, it would be removed.

Ultimately, none of the variables outside the model are significant, and all of the variables in the model are significant.

F values of variables at each step.

Page 93: Chapter 11

Variable Selection Methods – Best SubsetsOptimality Criteria 1) - Criterion:

• Maximized when all the variables are in the model• Only provides goodness of fit consideration (not how well the model predicts)

2) Adjusted - Criterion:

• Unlike , adding variables could cause it decrease

3) Criterion:• Minimizing is essentially equivalent to maximizing

Page 94: Chapter 11

4) Cp Criterion: Measures the ability of the model to predict Y-values

Select predictor variable vectors x (potentially representing a range for future predictions)

Standardized mean square error of prediction:

If we assume that the full model is the “true model” :

It should be noted that as we increase the number of variables, the prediction variance increases.

Variable Selection Methods – Best Subsets

Page 95: Chapter 11

4) Cp Criterion (continued):

Mallows’ Cp-Statistic is considered an “almost” unbiased estimate of

We can use to estimate , and commonly MSE (for the whole model) is used for

Note: for the full model (assumed to have a zero bias), so when p=k

Variable Selection Methods – Best Subsets

Page 96: Chapter 11

5) PRESSp Criterion:

• Considers impact of removing observations on predictability • Observations removed one at a time, and the model is re-fit after

each observation is removed

LS estimates when the ith observation is removed:

The predicted value for the observation that was just removed:

Prediction error sum of squares (PRESS):

Select the model with the smallest PRESSp

Variable Selection Methods – Best Subsets

Page 97: Chapter 11

Variable Selection Methods - Best Subsets

PROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = RSQUARE best=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = ADJRSQ best=10;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP best=10 ADJRSQ RSQUARE;RUN;QUIT;

RSQUARE finds r2 for subsets of modelsADJRSQ finds Adjusted r2 for subsets of modelsCP computes Mallow’s Cp statistic

Computes the model with the largest r2 (for each amount of variables that could be in the model)

Computes the best 10 models (based on the largest adjusted r2 values)

Computes the best 10 models (based on the lowest C(p) values). Also includes r2 and adjusted r2 values for each model.

Page 98: Chapter 11

Variable Selection Methods - Best Subsets

R-Square Selection Method• Each model represents

the combination of variables (from 1 variable through all variables) with the largest r2 value.

• r2 increases as additional variables are added to the model.

Page 99: Chapter 11

Variable Selection Methods - Best Subsets

Adjusted R-Square Selection Method• Considers all

combinations (and number) of variables in the model

• Models are listed in order of decreasing Adjusted r2 values.

• Since “BEST=10” was specified, the top 10 models are included

Page 100: Chapter 11

Variable Selection Methods - Best Subsets

C(p) Selection Method• Considers all

combinations (and number) of variables in the model

• Models are listed in order of increasing Cp values.

• Since “BEST=10” was specified, the top 10 models are included

Page 101: Chapter 11

PROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=1 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=2 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=3 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=4 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=5 BEST=1;RUN;QUIT;

Variable Selection Methods - Best Subsets

Determines the model with the lowest C(p) value for model including no more than 5 variables.

Page 102: Chapter 11

Variable Selection Methods - Best Subsets

Model with the lowest C(p) value - Up to 1 variable in the model

Model with the lowest C(p) value - Up to 2 variables in the model

Page 103: Chapter 11

Variable Selection Methods - Best Subsets

Model with the lowest C(p) value - Up to 3 variables in the model

Model with the lowest C(p) value - Up to 4 variables in the model

Note decreasing C(p)

Page 104: Chapter 11

Variable Selection Methods - Best SubsetsModel with the lowest C(p) value -

Up to 5 variables in the model

Same result as “up to 4 variables”

• The forward selection, backward elimination and stepwise selection methods produced the same model.

• We could conclude this is the best model. However, other data might produce different models using the various selection methods.

Page 105: Chapter 11

Variable Selection Methods

PROC REG DATA=dataMLB5;MODEL WP = OBP HR ERA Errors / P R tol vif collin;RUN;QUIT;

Page 106: Chapter 11

Variable Selection Methods

Page 107: Chapter 11

Decide the type of model

Collect the data

Explore the data

Divide the data

Fit several candidate models

Select and evaluate

Select the final modelModel Building

Strategy

?Model

DataExplore

and divide Let’s fit it!

Which is better?

Page 108: Chapter 11

•A model used to predict the response variable from a chosen set of predictor variables.

Predictive

•A model based on a theoretical relationship between a response variable and predictor variables

Theoretical•A model used control a response variable by

manipulating predictor variables.

Control

•A model used to explore the strength of relationships between a response variable and individual predictor variables

Inferential

•A model used primarily as a device to summarize a large set of data by a single equationData

Summary

Step I:Decide the Type of Model Needed

Page 109: Chapter 11

Step II: Collect the Data

• Decide variables(predictor and response) on which to collect data.

• Consult subject matter experts.• Obtain pertinent, bias-free data.

Page 110: Chapter 11

Step III: Explore the Data• Check for outliers, gross errors, missing values,etc. on a univariate basis.

• Study bivariate relationships to reveal other outliers, to suggest possible transformations, and to identify possible multicollinearities .

Very Important!

Page 111: Chapter 11

Step IV: Divide the Data

Randomly divide the data into:• Testing set: used for cross-validation of the fitted

model.• Training set: with at least 15-20 error degrees of

freedom, is used to estimate the model. • The split into the training and test set should be

done randomly. A 50:50 split can be used if sample size is large enough; otherwise more data may be put into the training set.

Page 112: Chapter 11

Step V: Fit Several Candidate Models

Generally models can be identified using the training data set:• Use best subsets regression.• Use stepwise regression, which of course only

yields one model unless different alpha-to-remove and alpha-to-enter values are specified.

Page 113: Chapter 11

Step VI: Select and Evaluate “Good” Models

pc• Select several models base on four criteria we

learned such as – statistic, the number of predictors(p)

• Check for violation of model assumptions• Further transformations in response and/or predictor

variables• If non of the models provided a satisfactory fit, try

something else, such as collecting more data, identifying different predictors, or formulating a different type of model.

Page 114: Chapter 11

Step VII: Select the Final Model

• Compare competing models by cross-validating them against the test data.

• Choose a model with smaller cross-validation SSE.

• Final selection of the model is based on considerations such as residual plots, outliers, parsimony, relevance, etc..

Page 116: Chapter 11

Summary• Multiple regression model extends the simple linear regression model to two or more predictor variables.

K≥2

• The least squares estimate of the parameter vector base on n complete set of observations on all variables equal

X is the n*(k+1) matrix observation on the predictor variableY is the n*1 vector of observation on y

iikkiii xxxY ...22110

YXXX ')'(ˆ 1

Page 117: Chapter 11

Summary• The fitted vector residual vector

• Error sum of squares

• Total sum of squares

• Multiple coefficient of determination

• Positive square root of is called the multiple correlation coefficient

ˆ Xy iii yye ˆ

n

iieSSE

1

2

SSTSSE

SSTSSRr 12

2)( yySST i

2r

Page 118: Chapter 11

Summary

• In probabilistic model for multiple regression, we assume random errors to be independent

• It follows that are

where is the jth diagonal term of the matrix

),0( 2Ni

j ),( 2jjj vN

jjv

1)'( XXV

Page 119: Chapter 11

Summary• Furthermore,

an unbiased estimate of and has distribution

, independent of the

• From the result, we can draw inference on the based on the t-distribution with (n-(k+1)) df.

100(1-α)% confidence interval on is

MSEkn

SSES

)1(

2

2

))1((

2)1(

2

knkn

j

j

j

jjknj St 2,1

ˆ

Page 120: Chapter 11

Summary• Extra sum of squares method is useful for deriving F-tests on subsets of the

• To test the hypothesis that a specified subset of m ≤ k of the

Let and denote the error sum of squares for the two models, respectively.

sj '

kSSE mkSSE

0' sj

Page 121: Chapter 11

Summary• Then F statistic is given by

• Two special cases of this method are the test of significance on• Single - t test• All - F statistic (not including )

j

j 0

MSEMSR

knSSEkSSRF

)]1(/[

/with k and (n-(k+1)) df

)]1(/[/)(

knSSEmSSESSEF

k

kmk with m and (n-(k+1)) df

Page 122: Chapter 11

Summary• The fitted vector can be written as

where is called the hat matrix

If ith diagonal element of H, , then the ith observation is regarded as influential because it has high leverage on the model fit.

• Residuals are used to check the model assumption like normality and constant variance by making appropriate residual plots. If the standardized residual

then the ith observation is regarded as an outlier.

Hyy ˆ

nkH ii /)1(2

2|)/(||| * jjii See

Page 123: Chapter 11

Summary

• Multicollinearity: columns of X are approximately linearly dependent → estimates of β is unstable– Major cause: high correlation between the xi’s

• Dummy variables: used to represent categorical data– xi = 1 if ith category true; xi = 0 otherwise

Page 124: Chapter 11

Summary

• Stepwise regression: selects and deletes variables based on marginal contributions to model– Partial F-statistics and partial correlation coefficients

• Best subsets regression: chooses subset of predictor variables that optimize certain criterion function (e.g. adjusted rp

2 or Cp-statistic)

Page 125: Chapter 11

Summary

• Fitting a model:1) Decide on the type of model.2) Collect the data.3) Explore the data.4) Divide the data.5) Fit several candidate models.6) Select/evaluate “good” models.7) Select final model.