Topic 13: Multiple Linear Regression Example
Jan 04, 2016
Topic 13: Multiple Linear Regression Example
Outline
• Description of example
• Descriptive summaries
• Investigation of various models
• Conclusions
Study of CS students
• Too many computer science majors at
Purdue were dropping out of program
• Wanted to find predictors of success to
be used in admissions process
• Predictors must be available at time of
entry into program.
Data available
• GPA after three semesters
• Overall high school math grade
• Overall high school science grade
• Overall high school English grade
• SAT Math
• SAT Verbal
• Gender (of interest for other reasons)
Data for CS Example
• Y is the student’s grade point average (GPA) after 3 semesters
• 3 HS grades and 2 SAT scores are the explanatory variables (p=6)
• Have n=224 students
Descriptive StatisticsData a1; infile 'C:\...\csdata.dat'; input id gpa hsm hss hse satm satv genderm1;
proc means data=a1 maxdec=2; var gpa hsm hss hse satm satv;run;
Output from Proc Means
Variable N Mean Std Dev Minimum Maximumgpahsmhsshsesatmsatv
224224224224224224
2.648.328.098.09
595.29504.55
0.781.641.701.5186.4092.61
0.122.003.003.00
300.00285.00
4.0010.0010.0010.00800.00760.00
Descriptive Statistics
proc univariate data=a1; var gpa hsm hss hse satm satv; histogram gpa hsm hss hse satm satv /normal;run;
Correlations
proc corr data=a1; var hsm hss hse satm satv;proc corr data=a1; var hsm hss hse satm satv; with gpa;run;
Output from Proc CorrPearson Correlation Coefficients, N = 224
Prob > |r| under H0: Rho=0gpa hsm hss hse satm satv
gpa 1.00000 0.43650<.0001
0.32943<.0001
0.28900<.0001
0.251710.0001
0.114490.0873
hsm 0.43650<.0001
1.00000 0.57569<.0001
0.44689<.0001
0.45351<.0001
0.221120.0009
hss 0.32943<.0001
0.57569<.0001
1.00000 0.57937<.0001
0.240480.0003
0.26170<.0001
hse 0.28900<.0001
0.44689<.0001
0.57937<.0001
1.00000 0.108280.1060
0.243710.0002
satm 0.251710.0001
0.45351<.0001
0.240480.0003
0.108280.1060
1.00000 0.46394<.0001
satv 0.114490.0873
0.221120.0009
0.26170<.0001
0.243710.0002
0.46394<.0001
1.00000
Output from Proc Corr
Pearson Correlation Coefficients, N = 224Prob > |r| under H0: Rho=0
hsm hss hse satm satvgpa 0.43650
<.00010.32943<.0001
0.28900<.0001
0.251710.0001
0.114490.0873
All but SATV significantly correlated with GPA
Scatter Plot Matrix
proc corr data=a1 plots=matrix;
var gpa hsm hss hse satm satv;
run;
Allows visual check of pairwise relationships
No “strong” linear
Relationships
Can see discreteness
of high school scores
Use high school grades to predict GPA (Model #1)
proc reg data=a1; model gpa=hsm hss hse;run;
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept 1 0.58988 0.29424 2.00 0.0462hsm 1 0.16857 0.03549 4.75 <.0001hss 1 0.03432 0.03756 0.91 0.3619hse 1 0.04510 0.03870 1.17 0.2451
Root MSE 0.69984 R-Square 0.2046
Dependent Mean 2.63522 Adj R-Sq 0.1937
Coeff Var 26.55711
Results Model #1
Meaningful??
ANOVA Table #1
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 3 27.71233 9.23744 18.86 <.0001
Error 220 107.75046 0.48977
Corrected Total 223 135.46279
Significant F test but not all variable t tests significant
Remove HSS (Model #2)
proc reg data=a1; model gpa=hsm hse;run;
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept 1 0.62423 0.29172 2.14 0.0335hsm 1 0.18265 0.03196 5.72 <.0001hse 1 0.06067 0.03473 1.75 0.0820
Root MSE 0.69958 R-Square 0.2016
Dependent Mean 2.63522 Adj R-Sq 0.1943
Coeff Var 26.54718
Results Model #2
Slightly better MSE and adjusted R-Sq
ANOVA Table #2
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 2 27.30349 13.65175 27.89 <.0001
Error 221 108.15930 0.48941
Corrected Total 223 135.46279
Significant F test but not all variable t tests significant
Rerun with HSM only (Model #3)
proc reg data=a1; model gpa=hsm;run;
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept 1 0.90768 0.24355 3.73 0.0002hsm 1 0.20760 0.02872 7.23 <.0001
Root MSE 0.70280 R-Square 0.1905
Dependent Mean 2.63522 Adj R-Sq 0.1869
Coeff Var 26.66958
Results Model #3
Slightly worse MSE and adjusted R-Sq
ANOVA Table #3
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 1 25.80989 25.80989 52.25 <.0001
Error 222 109.65290 0.49393
Corrected Total 223 135.46279
Significant F test and all variable t tests significant
SATs (Model #4)
proc reg data=a1; model gpa=satm satv;run;
Root MSE 0.75770 R-Square 0.0634
Dependent Mean 2.63522 Adj R-Sq 0.0549
Coeff Var 28.75287
Results Model #4
Much worse MSE and adjusted R-Sq
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept 1 1.28868 0.37604 3.43 0.0007
satm 1 0.00228 0.00066291 3.44 0.0007
satv 1 -0.00002456 0.00061847 -0.04 0.9684
ANOVA Table #4
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 2 8.58384 4.29192 7.48 0.0007
Error 221 126.87895 0.57411
Corrected Total 223 135.46279
Significant F test but not all variable t tests significant
HS and SATs (Model #5)
proc reg data=a1; model gpa=satm satv hsm hss hse;*Does general linear test; sat: test satm, satv; hs: test hsm, hss, hse;
Root MSE 0.70000 R-Square 0.2115
Dependent Mean 2.63522 Adj R-Sq 0.1934
Coeff Var 26.56311
Results Model #5
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept 1 0.32672 0.40000 0.82 0.4149hsm 1 0.14596 0.03926 3.72 0.0003hss 1 0.03591 0.03780 0.95 0.3432hse 1 0.05529 0.03957 1.40 0.1637satm 1 0.00094359 0.00068566 1.38 0.1702satv 1 -0.00040785 0.00059189 -0.69 0.4915
Test sat
Test sat Results for Dependent Variable gpa
Source DFMean
Square F Value Pr > FNumerator 2 0.46566 0.95 0.3882
Denominator 218 0.49000
Cannot reject the reduced model…No significant information lost…We don’t need SAT variables
Test hs
Test hs Results for Dependent Variable gpa
Source DFMean
Square F Value Pr > FNumerator 3 6.68660 13.65 <.0001
Denominator 218 0.49000
Reject the reduced model…There is significant information lost…We can’t remove HS variables from model
Best Model?
• Likely the one with just HSM or the one with HSE and HSM.
• We’ll discuss comparison methods in Chapters 7 and 8
Key ideas from case study
• First, look at graphical and numerical summaries one variable at a time
• Then, look at relationships between pairs of variables with graphical and numerical summaries.
• Use plots and correlations to understand relationships
Key ideas from case study
• The relationship between a response variable and an explanatory variable depends on what other explanatory variables are in the model
• A variable can be a significant (P<.05) predictor alone and not significant (P>0.5) when other X’s are in the model
Key ideas from case study
• Regression coefficients, standard errors and the results of significance tests depend on what other explanatory variables are in the model
Key ideas from case study
• Significance tests (P values) do not tell the whole story
• Squared multiple correlations give the proportion of variation in the response variable explained by the explanatory variables) can give a different view
• We often express R2 as a percent
Key ideas from case study
• You can fully understand the theory in terms of Y = Xβ + e
• However to effectively use this methodology in practice you need to understand how the data were collected, the nature of the variables, and how they relate to each other
Background Reading
• Cs2.sas contains the SAS commands used in this topic