Topic 13: Multiple Linear Regression Example. Outline Description of example Descriptive summaries Investigation of various models Conclusions.

Topic 13: Multiple Linear Regression Example

Outline

• Description of example

• Descriptive summaries

• Investigation of various models

• Conclusions

Study of CS students

• Too many computer science majors at

Purdue were dropping out of program

• Wanted to find predictors of success to

be used in admissions process

• Predictors must be available at time of

entry into program.

Data available

• GPA after three semesters

• Overall high school math grade

• Overall high school science grade

• Overall high school English grade

• SAT Math

• SAT Verbal

• Gender (of interest for other reasons)

Data for CS Example

• Y is the student’s grade point average (GPA) after 3 semesters

• 3 HS grades and 2 SAT scores are the explanatory variables (p=6)

• Have n=224 students

Descriptive StatisticsData a1; infile 'C:\...\csdata.dat'; input id gpa hsm hss hse satm satv genderm1;

proc means data=a1 maxdec=2; var gpa hsm hss hse satm satv;run;

Output from Proc Means

Variable N Mean Std Dev Minimum Maximumgpahsmhsshsesatmsatv

224224224224224224

2.648.328.098.09

595.29504.55

0.781.641.701.5186.4092.61

0.122.003.003.00

300.00285.00

4.0010.0010.0010.00800.00760.00

Descriptive Statistics

proc univariate data=a1; var gpa hsm hss hse satm satv; histogram gpa hsm hss hse satm satv /normal;run;

Correlations

proc corr data=a1; var hsm hss hse satm satv;proc corr data=a1; var hsm hss hse satm satv; with gpa;run;

Output from Proc CorrPearson Correlation Coefficients, N = 224

Prob > |r| under H0: Rho=0gpa hsm hss hse satm satv

gpa 1.00000 0.43650<.0001

0.32943<.0001

0.28900<.0001

0.251710.0001

0.114490.0873

hsm 0.43650<.0001

1.00000 0.57569<.0001

0.44689<.0001

0.45351<.0001

0.221120.0009

hss 0.32943<.0001

0.57569<.0001

1.00000 0.57937<.0001

0.240480.0003

0.26170<.0001

hse 0.28900<.0001

0.44689<.0001

0.57937<.0001

1.00000 0.108280.1060

0.243710.0002

satm 0.251710.0001

0.45351<.0001

0.240480.0003

0.108280.1060

1.00000 0.46394<.0001

satv 0.114490.0873

0.221120.0009

0.26170<.0001

0.243710.0002

0.46394<.0001

1.00000

Output from Proc Corr

Pearson Correlation Coefficients, N = 224Prob > |r| under H0: Rho=0

hsm hss hse satm satvgpa 0.43650

<.00010.32943<.0001

0.28900<.0001

0.251710.0001

0.114490.0873

All but SATV significantly correlated with GPA

Scatter Plot Matrix

proc corr data=a1 plots=matrix;

var gpa hsm hss hse satm satv;

run;

Allows visual check of pairwise relationships

No “strong” linear

Relationships

Can see discreteness

of high school scores

Use high school grades to predict GPA (Model #1)

proc reg data=a1; model gpa=hsm hss hse;run;

Parameter Estimates

Variable DFParameter

EstimateStandard

Error t Value Pr > |t|Intercept 1 0.58988 0.29424 2.00 0.0462hsm 1 0.16857 0.03549 4.75 <.0001hss 1 0.03432 0.03756 0.91 0.3619hse 1 0.04510 0.03870 1.17 0.2451

Root MSE 0.69984 R-Square 0.2046

Dependent Mean 2.63522 Adj R-Sq 0.1937

Coeff Var 26.55711

Results Model #1

Meaningful??

ANOVA Table #1

Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > FModel 3 27.71233 9.23744 18.86 <.0001

Error 220 107.75046 0.48977

Corrected Total 223 135.46279

Significant F test but not all variable t tests significant

Remove HSS (Model #2)

proc reg data=a1; model gpa=hsm hse;run;

Parameter Estimates


EstimateStandard

Error t Value Pr > |t|Intercept 1 0.62423 0.29172 2.14 0.0335hsm 1 0.18265 0.03196 5.72 <.0001hse 1 0.06067 0.03473 1.75 0.0820



Coeff Var 26.54718

Results Model #2

Slightly better MSE and adjusted R-Sq

ANOVA Table #2


Source DFSum of

SquaresMean


Error 221 108.15930 0.48941



Rerun with HSM only (Model #3)

proc reg data=a1; model gpa=hsm;run;

Parameter Estimates


EstimateStandard

Error t Value Pr > |t|Intercept 1 0.90768 0.24355 3.73 0.0002hsm 1 0.20760 0.02872 7.23 <.0001



Coeff Var 26.66958

Results Model #3

Slightly worse MSE and adjusted R-Sq

ANOVA Table #3


Source DFSum of

SquaresMean


Error 222 109.65290 0.49393


Significant F test and all variable t tests significant

SATs (Model #4)

proc reg data=a1; model gpa=satm satv;run;



Coeff Var 28.75287

Results Model #4

Much worse MSE and adjusted R-Sq

Parameter Estimates


EstimateStandard

Error t Value Pr > |t|Intercept 1 1.28868 0.37604 3.43 0.0007

satm 1 0.00228 0.00066291 3.44 0.0007

satv 1 -0.00002456 0.00061847 -0.04 0.9684

ANOVA Table #4


Source DFSum of

SquaresMean

Square F Value Pr > FModel 2 8.58384 4.29192 7.48 0.0007

Error 221 126.87895 0.57411



HS and SATs (Model #5)

proc reg data=a1; model gpa=satm satv hsm hss hse;*Does general linear test; sat: test satm, satv; hs: test hsm, hss, hse;



Coeff Var 26.56311

Results Model #5

Parameter Estimates


EstimateStandard

Error t Value Pr > |t|Intercept 1 0.32672 0.40000 0.82 0.4149hsm 1 0.14596 0.03926 3.72 0.0003hss 1 0.03591 0.03780 0.95 0.3432hse 1 0.05529 0.03957 1.40 0.1637satm 1 0.00094359 0.00068566 1.38 0.1702satv 1 -0.00040785 0.00059189 -0.69 0.4915

Test sat

Test sat Results for Dependent Variable gpa

Source DFMean

Square F Value Pr > FNumerator 2 0.46566 0.95 0.3882

Denominator 218 0.49000

Cannot reject the reduced model…No significant information lost…We don’t need SAT variables

Test hs

Test hs Results for Dependent Variable gpa

Source DFMean

Square F Value Pr > FNumerator 3 6.68660 13.65 <.0001

Denominator 218 0.49000

Reject the reduced model…There is significant information lost…We can’t remove HS variables from model

Best Model?

• Likely the one with just HSM or the one with HSE and HSM.

• We’ll discuss comparison methods in Chapters 7 and 8

Key ideas from case study

• First, look at graphical and numerical summaries one variable at a time

• Then, look at relationships between pairs of variables with graphical and numerical summaries.

• Use plots and correlations to understand relationships


• The relationship between a response variable and an explanatory variable depends on what other explanatory variables are in the model

• A variable can be a significant (P<.05) predictor alone and not significant (P>0.5) when other X’s are in the model


• Regression coefficients, standard errors and the results of significance tests depend on what other explanatory variables are in the model


• Significance tests (P values) do not tell the whole story

• Squared multiple correlations give the proportion of variation in the response variable explained by the explanatory variables) can give a different view

• We often express R2 as a percent


• You can fully understand the theory in terms of Y = Xβ + e

• However to effectively use this methodology in practice you need to understand how the data were collected, the nature of the variables, and how they relate to each other

Background Reading

• Cs2.sas contains the SAS commands used in this topic

Topic 13: Multiple Linear Regression Example. Outline Description of example Descriptive summaries Investigation of various models Conclusions.

Documents

gpa model

a1 model gpa

hss model

hsm hss hserunresults

hsm hserunresults model

satm satvrunresults

reg data

sats model