Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.

Topic 14: Inference in Multiple Regression

Outline

• Review multiple linear regression• Inference of regression coefficients– Application to book example

• Inference of mean– Application to book example

• Inference of future observation• Diagnostics and remedies

Data for Multiple Regression

• Yi is the response variable

• Xi1, Xi2, … , Xi,p-1 are the p-1

explanatory variables

• Yi, Xi1, Xi2, … , Xi,p-1 are the data for

case i, where i = 1 to n

Multiple Regression Model

• Yi = β0 + β1Xi1 + β2Xi2 +…+ βp-1Xi,p-1 + ei

• Yi is the value of the response variable for the ith case

• β0 is the intercept

• β1, β2, … , βp-1 are the regression coefficients for the explanatory variables

• ei are independent Normally distributed random errors with mean 0 and variance σ2

Least Squares Solutions

YX)XX(b 1

s2 = MSE=

s = Root MSE

)/()( pn YHIY

ANOVA F-test

• H0: β1 = β2 = … = βp-1 = 0

• Ha: βk ≠ 0, for at least one k=1,2,…,p-1

• Under H0, F ~ F(p-1,n-p)

• Reject H0 if F is large, using P-value we reject if the P-value ≤ 0.05

Inference for individual regression coefficients

• We can show b ~ N(β, σ2(X΄X)-1)

• Define

}b{}b{

)XX(}b{

,22

12

kkk

pp

ss

MSEs

Significance Test for βk

• H0: βk = 0

• Same test statistic t* = bk/s(bk)

• Still use dfE which now equals n-p

• P-value computed from t(n-p) dist

• This tests the significance of a variable given the other variables are already in the model (i.e., fitted last)

Confidence interval for βk

• CI: bk ± tcs(bk), where tc = t(.975, n-p)

• Same form as before but dfE now equals n-p

• This interval describes region of bk given the other variables are in the model

Example II(KNNL p 236)

• Dwaine Studios, Inc. operates portrait studios in 21 cities of medium size

• Yi is sales in city i

• X1 : population aged 16 and under

• X2 : per capita disposable income

i2i21i10i XXY

Read in the data

data a1; infile ‘../data/ch06fi05.txt'; input young income sales;

proc print data=a1; run;

Partial Proc Print Results

Obs young income sales

1 68.5 16.7 174.4 2 45.2 16.8 164.4 3 91.3 18.2 244.2 4 47.8 16.3 154.6 5 46.9 17.3 181.6

Proc Reg

proc reg data=a1; model sales=young income;run;

OutputAnalysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > FModel 2 24015 12008 99.10 <.0001Error 18 2180.9274 121.1626 Corrected Total

20 26196

Root MSE 11.00739 R-Square

0.917

At least one variable is helpful in predicting in sales

OutputParameter Estimates

Variable DFParameter

EstimateStandard

Error t Value Pr > |t|Intercept 1 -68.85707 60.01695 -1.15 0.2663young 1 1.45456 0.21178 6.87 <.0001income 1 9.36550 4.06396 2.30 0.0333

Both variables are helpful in explaining sales after the other

is already in the model

CLB option

• Used to get confidence intervals for each coefficient

proc reg data=a1; model sales=young income/clb;run;

Output

Parameter Estimates


EstimateStandard

Error95% Confidence

LimitsIntercept 1 -68.85707 60.01695 -194.94801 57.23387

young 1 1.45456 0.21178 1.00962 1.89950

income 1 9.36550 4.06396 0.82744 17.90356

What if just young fit?Parameter Estimates


EstimateStandard

Error95% Confidence

LimitsIntercept 1 68.04536 9.46224 48.24066 87.85006

young 1 1.83588 0.14641 1.52943 2.14233

CIs for both the intercept and young change dramatically when just young as explanatory variable

Estimation of E(Yh)

• Xh is now a vector that looks like

(1, Xh1, Xh2, … , Xh,p-1)΄

• We want a point estimate and a confidence interval for the subpopulation mean corresponding to the set of explanatory variables Xh

Theory for E(Yh)

)-nt(0.975,)ˆ(ˆ:CI

XX)X(X{b}XsX)ˆ(

bXˆ

X)E(Y

1-222

h

ps

ss

hh

hhhhh

h

hhh

Using CLM option

proc reg data=a1; model sales=young income/clm; id young income;run;

Adds them to output table

CLM Output

Output Statistics

Obs young incomeDependent

VariablePredicted

ValueStd Error

Mean Predict 95% CL Mean1 68.5 16.7 174.4000 187.1841 3.8409 179.1146 195.2536

2 45.2 16.8 164.4000 154.2294 3.5558 146.7591 161.6998

3 91.3 18.2 244.2000 234.3963 4.5882 224.7569 244.0358

4 47.8 16.3 154.6000 153.3285 3.2331 146.5361 160.1210

5 46.9 17.3 181.6000 161.3849 4.4300 152.0778 170.6921

21 52.3 16.0 166.5000 157.0644 4.0792 148.4944 165.6344

Prediction of Yh

• Xh is still a vector of form

(1, Xh1, Xh2, … , Xh,p-1)΄

• We want a prediction of Yh based on a set of predictor values with an interval that expresses the uncertainty in our prediction

Theory for Yh

)-nt(.975,)pred(ˆ:CI

)XX)X(X(1

)(Var)Y(Var

)Y(Var)pred(

bXˆYXY

1-2

2

ps

s

s

h

hh

h

h

hhhhh

Using the CLI option

proc reg data=a1; model sales=young income/cli; id young income;run;

Adds them to output table

CLI Output

Output Statistics

Obs young incomeDependent

VariablePredicted

ValueStd Error

Mean Predict 95% CL Predict1 68.5 16.7 174.4000 187.1841 3.8409 162.6910 211.6772

2 45.2 16.8 164.4000 154.2294 3.5558 129.9271 178.5317

3 91.3 18.2 244.2000 234.3963 4.5882 209.3421 259.4506

21 52.3 16.0 166.5000 157.0644 4.0792 132.4018 181.7270

Diagnostics

• Look at the distribution of each variable

• Look at the relationship between pairs of variables

• Plot the residuals versus– the predicted/fitted values–each explanatory variable– time (if available)

Diagnostics

• Are the residuals approximately Normal

–Look at a histogram

–Normal quantile plot

• Is the variance constant

–Plot the residuals vs anything that might be related to the variance (e.g. residuals vs predicted values & residuals versus each X)

Remedies

• Similar remedies as simple regression

• Transformations such as Box-Cox

• Analyze with/without outliers

• More detail in KNNL Ch 9 and 10

Background Reading

• We finished Chapter 6.

• Program used to generate output for confidence intervals for means and prediction intervals is topic14.sas

Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.

Documents

Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.