Topic 14: Inference in Multiple Regression
Outline
• Review multiple linear regression• Inference of regression coefficients– Application to book example
• Inference of mean– Application to book example
• Inference of future observation• Diagnostics and remedies
Data for Multiple Regression
• Yi is the response variable
• Xi1, Xi2, … , Xi,p-1 are the p-1
explanatory variables
• Yi, Xi1, Xi2, … , Xi,p-1 are the data for
case i, where i = 1 to n
Multiple Regression Model
• Yi = β0 + β1Xi1 + β2Xi2 +…+ βp-1Xi,p-1 + ei
• Yi is the value of the response variable for the ith case
• β0 is the intercept
• β1, β2, … , βp-1 are the regression coefficients for the explanatory variables
• ei are independent Normally distributed random errors with mean 0 and variance σ2
Least Squares Solutions
YX)XX(b 1
s2 = MSE=
s = Root MSE
)/()( pn YHIY
ANOVA F-test
• H0: β1 = β2 = … = βp-1 = 0
• Ha: βk ≠ 0, for at least one k=1,2,…,p-1
• Under H0, F ~ F(p-1,n-p)
• Reject H0 if F is large, using P-value we reject if the P-value ≤ 0.05
Inference for individual regression coefficients
• We can show b ~ N(β, σ2(X΄X)-1)
• Define
}b{}b{
)XX(}b{
,22
12
kkk
pp
ss
MSEs
Significance Test for βk
• H0: βk = 0
• Same test statistic t* = bk/s(bk)
• Still use dfE which now equals n-p
• P-value computed from t(n-p) dist
• This tests the significance of a variable given the other variables are already in the model (i.e., fitted last)
Confidence interval for βk
• CI: bk ± tcs(bk), where tc = t(.975, n-p)
• Same form as before but dfE now equals n-p
• This interval describes region of bk given the other variables are in the model
Example II(KNNL p 236)
• Dwaine Studios, Inc. operates portrait studios in 21 cities of medium size
• Yi is sales in city i
• X1 : population aged 16 and under
• X2 : per capita disposable income
i2i21i10i XXY
Read in the data
data a1; infile ‘../data/ch06fi05.txt'; input young income sales;
proc print data=a1; run;
Partial Proc Print Results
Obs young income sales
1 68.5 16.7 174.4 2 45.2 16.8 164.4 3 91.3 18.2 244.2 4 47.8 16.3 154.6 5 46.9 17.3 181.6
Proc Reg
proc reg data=a1; model sales=young income;run;
OutputAnalysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 2 24015 12008 99.10 <.0001Error 18 2180.9274 121.1626 Corrected Total
20 26196
Root MSE 11.00739 R-Square
0.917
At least one variable is helpful in predicting in sales
OutputParameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept 1 -68.85707 60.01695 -1.15 0.2663young 1 1.45456 0.21178 6.87 <.0001income 1 9.36550 4.06396 2.30 0.0333
Both variables are helpful in explaining sales after the other
is already in the model
CLB option
• Used to get confidence intervals for each coefficient
proc reg data=a1; model sales=young income/clb;run;
Output
Parameter Estimates
Variable DFParameter
EstimateStandard
Error95% Confidence
LimitsIntercept 1 -68.85707 60.01695 -194.94801 57.23387
young 1 1.45456 0.21178 1.00962 1.89950
income 1 9.36550 4.06396 0.82744 17.90356
What if just young fit?Parameter Estimates
Variable DFParameter
EstimateStandard
Error95% Confidence
LimitsIntercept 1 68.04536 9.46224 48.24066 87.85006
young 1 1.83588 0.14641 1.52943 2.14233
CIs for both the intercept and young change dramatically when just young as explanatory variable
Estimation of E(Yh)
• Xh is now a vector that looks like
(1, Xh1, Xh2, … , Xh,p-1)΄
• We want a point estimate and a confidence interval for the subpopulation mean corresponding to the set of explanatory variables Xh
Theory for E(Yh)
)-nt(0.975,)ˆ(ˆ:CI
XX)X(X{b}XsX)ˆ(
bXˆ
X)E(Y
1-222
h
ps
ss
hh
hhhhh
h
hhh
Using CLM option
proc reg data=a1; model sales=young income/clm; id young income;run;
Adds them to output table
CLM Output
Output Statistics
Obs young incomeDependent
VariablePredicted
ValueStd Error
Mean Predict 95% CL Mean1 68.5 16.7 174.4000 187.1841 3.8409 179.1146 195.2536
2 45.2 16.8 164.4000 154.2294 3.5558 146.7591 161.6998
3 91.3 18.2 244.2000 234.3963 4.5882 224.7569 244.0358
4 47.8 16.3 154.6000 153.3285 3.2331 146.5361 160.1210
5 46.9 17.3 181.6000 161.3849 4.4300 152.0778 170.6921
21 52.3 16.0 166.5000 157.0644 4.0792 148.4944 165.6344
Prediction of Yh
• Xh is still a vector of form
(1, Xh1, Xh2, … , Xh,p-1)΄
• We want a prediction of Yh based on a set of predictor values with an interval that expresses the uncertainty in our prediction
Theory for Yh
)-nt(.975,)pred(ˆ:CI
)XX)X(X(1
)(Var)Y(Var
)Y(Var)pred(
bXˆYXY
1-2
2
ps
s
s
h
hh
h
h
hhhhh
Using the CLI option
proc reg data=a1; model sales=young income/cli; id young income;run;
Adds them to output table
CLI Output
Output Statistics
Obs young incomeDependent
VariablePredicted
ValueStd Error
Mean Predict 95% CL Predict1 68.5 16.7 174.4000 187.1841 3.8409 162.6910 211.6772
2 45.2 16.8 164.4000 154.2294 3.5558 129.9271 178.5317
3 91.3 18.2 244.2000 234.3963 4.5882 209.3421 259.4506
21 52.3 16.0 166.5000 157.0644 4.0792 132.4018 181.7270
Diagnostics
• Look at the distribution of each variable
• Look at the relationship between pairs of variables
• Plot the residuals versus– the predicted/fitted values–each explanatory variable– time (if available)
Diagnostics
• Are the residuals approximately Normal
–Look at a histogram
–Normal quantile plot
• Is the variance constant
–Plot the residuals vs anything that might be related to the variance (e.g. residuals vs predicted values & residuals versus each X)
Remedies
• Similar remedies as simple regression
• Transformations such as Box-Cox
• Analyze with/without outliers
• More detail in KNNL Ch 9 and 10
Background Reading
• We finished Chapter 6.
• Program used to generate output for confidence intervals for means and prediction intervals is topic14.sas