Top Banner
Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Statistics for Health Research Research
24

Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Jan 05, 2016

Download

Documents

Hannah McGee
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Regression:

Checking the Model Peter T. Donnan

Professor of Epidemiology and Biostatistics

Statistics for Health ResearchStatistics for Health Research

Page 2: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Objectives of sessionObjectives of session

• Recognise the need to check fit Recognise the need to check fit of the modelof the model

• Carry out checks of Carry out checks of assumptions in SPSS for simple assumptions in SPSS for simple linear regressionlinear regression

• Understand predictive modelUnderstand predictive model• Understand residualsUnderstand residuals

• Recognise the need to check fit Recognise the need to check fit of the modelof the model

• Carry out checks of Carry out checks of assumptions in SPSS for simple assumptions in SPSS for simple linear regressionlinear regression

• Understand predictive modelUnderstand predictive model• Understand residualsUnderstand residuals

Page 3: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

How is the fitted line How is the fitted line obtained?obtained?

Use method of least squares Use method of least squares (LS)(LS)

Seek to minimise squared Seek to minimise squared vertical differences between vertical differences between each point and fitted lineeach point and fitted line

Results in parameter estimates Results in parameter estimates or regression coefficients of or regression coefficients of slope (b) and intercept (a) – slope (b) and intercept (a) – y=a+bxy=a+bx

Page 4: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Consider Fitted line of Consider Fitted line of y = a +bxy = a +bx

Explanatory (x)Explanatory (x)

Dep

en

den

t D

ep

en

den

t (y

)(y

)

aa

Page 5: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Consider the regression of age Consider the regression of age on minimum LDL cholesterol on minimum LDL cholesterol

achievedachieved

•Select Regression Select Regression Linear….Linear….

•Dependent (y) – Min LDL achievedDependent (y) – Min LDL achieved•Independent (x) - Age_BaseIndependent (x) - Age_Base

•Select Regression Select Regression Linear….Linear….

•Dependent (y) – Min LDL achievedDependent (y) – Min LDL achieved•Independent (x) - Age_BaseIndependent (x) - Age_Base

Page 6: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

N.B. -0.008 may look very small but N.B. -0.008 may look very small but represents: represents:

The DECREASE in LDL achieved for The DECREASE in LDL achieved for each increase in one unit of age i.e. each increase in one unit of age i.e. ONE yearONE year

Output from SPSS linear Output from SPSS linear regressionregression

Coefficientsa

Model Unstandardized Coefficients Standardized CoefficientsB Std. Error Beta t sig

1 (Constant) 2.024 .105 19.340 .000Age at baseline -.008 .002 -.121 -4.546 .000

a. Dependent Variable: Min LDL achieved

Page 7: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

HH00 : slope b = 0 : slope b = 0

Test t = slope/se = -0.008/0.002 = 4.546 Test t = slope/se = -0.008/0.002 = 4.546 with p<0.001, so statistically significantwith p<0.001, so statistically significant

Predicted LDL = 2.024 - 0.008xAgePredicted LDL = 2.024 - 0.008xAge

Output from SPSS linear Output from SPSS linear regressionregression

Coefficientsa

Model Unstandardized Coefficients Standardized CoefficientsB Std. Error Beta t sig

1 (Constant) 2.024 .105 19.340 .000Age at baseline -.008 .002 -.121 -4.546 .000

a. Dependent Variable: Min LDL achieved

Page 8: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Predicted LDL achieved = 2.024 - Predicted LDL achieved = 2.024 - 0.008xAge0.008xAge

So for a man aged 65 the predicted LDL So for a man aged 65 the predicted LDL achieved = 2.024 – 0.008x 65 = 1.504achieved = 2.024 – 0.008x 65 = 1.504

Prediction Equation from Prediction Equation from linear regressionlinear regression

Age Predicted Min LDL

45 1.664

55 1.584

65 1.504

75 1.424

Page 9: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Assumptions of Assumptions of RegressionRegression

11. . Relationship is linearRelationship is linear11. . Relationship is linearRelationship is linear

2. Outcome variable and hence 2. Outcome variable and hence residuals or error terms are residuals or error terms are approx. Normally distributed approx. Normally distributed

Page 10: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Use Graphs and Use Graphs and Scatterplot to obtain the Scatterplot to obtain the

Lowess line of fitLowess line of fit

Page 11: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Use Graphs and Scatterplot Use Graphs and Scatterplot to obtain the Lowess line of to obtain the Lowess line of

fitfit

1.1. Create Scatterplot and Create Scatterplot and then double-click to enter then double-click to enter chart editorchart editor

2.2. Chose Icon ‘Chose Icon ‘Add fit line at Add fit line at totaltotal’’

3.3. Then select type of fit Then select type of fit such as such as LowessLowess

Page 12: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Linear assumption: Fitted Linear assumption: Fitted lowess smoothed line lowess smoothed line

Lowess smoothed line (red) gives a good Lowess smoothed line (red) gives a good eyeball examination of linear eyeball examination of linear assumption (green)assumption (green)

Page 13: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Definition of a residualDefinition of a residual

A A residualresidual is the difference is the difference between the predicted value between the predicted value (fitted line) and the actual value or (fitted line) and the actual value or unexplained variationunexplained variation

rrii = y = yii – E ( y – E ( yii ) )

OrOr

rrii = y = yii – ( a + bx ) – ( a + bx )

Page 14: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

ResidualsResiduals

Page 15: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

To assess the residuals in SPSS To assess the residuals in SPSS linear regression, select plots…..linear regression, select plots…..

NormaliseNormalised or d or standardisstandardised ed predicted predicted value of value of LDLLDLNormalisNormalised ed residualresidualSelect Select histogram of histogram of residuals and residuals and normal normal probability probability plotplot

Page 16: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

In SPSS linear regression, select In SPSS linear regression, select Statistics…..Statistics…..

Select Select confidence confidence intervals intervals for for regression regression coefficientcoefficientss

Model fitModel fit

Select Durbin-Select Durbin-Watson for Watson for serial serial correlation correlation and and identification identification of outliersof outliers

Page 17: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Output: Output: Scatterplot of residuals vs. Scatterplot of residuals vs.

predictedpredicted

Note Note 1)1)Mean of Mean of residuals residuals = 0= 02)2)Most of Most of data lie data lie within + within + or -3 SDs or -3 SDs of meanof mean

Page 18: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Assumptions of Assumptions of RegressionRegression

1. Relationship is linear1. Relationship is linear1. Relationship is linear1. Relationship is linear

2. Outcome variable and hence 2. Outcome variable and hence residuals or error terms are residuals or error terms are approx. Normally distributed approx. Normally distributed

Page 19: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Plot of Plot of residualresiduals with s with normal normal curve curve super-super-imposedimposed

Output: Output: Histogram of standardised Histogram of standardised

residualsresiduals

Page 20: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Output: Output: Cumulative probability plotCumulative probability plot

Look Look for for deviatiodeviation from n from diagonadiagonal line to l line to indicate indicate non-non-normalinormalityty

Page 21: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Residuals Statisticsa

1.314867 1.843205 1.556478 .0878548 1383

-1.65389 4.0658469 .0000000 .7181448 1383

-2.750 3.264 .000 1.000 1383

-2.302 5.660 .000 1.000 1383

Predicted Value

Residual

Std. Predicted Value

Std. Residual

Minimum Maximum Mean Std. Dev iation N

Dependent Variable: Min LDL achieveda.

Output: Output: Description of residualsDescription of residuals

Subjects with Subjects with standardised residuals standardised residuals > 3> 3

Descriptive statistics for Descriptive statistics for residualsresiduals

Worth Worth investigation?investigation?

Casewise Diagnostics(a)

Case Number Std. Residual Min LDL Predicted Residual

164 5.660 5.5840 1.518153 4.0658471209 4.395 4.5260 1.368685 3.1573148250 3.143 3.7875 1.529325 2.2581750268 3.064 3.8730 1.671664 2.2013357274 3.227 4.0953 1.777153 2.3180975362 4.095 4.5350 1.593460 2.9415398517 3.636 4.3240 1.711788 2.6122125849 3.968 4.3290 1.478113 2.85088731047 4.207 4.4360 1.413686 3.02231411075 3.885 4.4040 1.613219 2.79078051103 3.519 3.9905 1.462584 2.52791571229 3.016 3.7660 1.599254 2.16674561290 3.975 4.2345 1.379107 2.8553933

a. Dependent Variable: Min LDL achieved

Page 22: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

R – correlation between min LDL achieved and R – correlation between min LDL achieved and Age at baseline, here 0.121Age at baseline, here 0.121

RR22 - % variation explained, here 1.5%, not - % variation explained, here 1.5%, not particularly highparticularly high

Durbin-Watson test - serial correlation of Durbin-Watson test - serial correlation of residuals should be approximately 2 if no serial residuals should be approximately 2 if no serial correlationcorrelation

Output: Output: Model fit and serial correlationModel fit and serial correlation

Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson1 .121a .015 .014 .7184048 2.034

a. Predictors: (Constant), Age at baseline

Page 23: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

SummarySummary

After fitting any regression model check After fitting any regression model check assumptions - assumptions -

• Functional form – linearity is default, Functional form – linearity is default, often not best fit, consider quadratic… often not best fit, consider quadratic…

• Check Residuals for approx. normalityCheck Residuals for approx. normality• Check Residuals for outliers (> 3 SDs)Check Residuals for outliers (> 3 SDs)• All accomplished within SPSSAll accomplished within SPSS

After fitting any regression model check After fitting any regression model check assumptions - assumptions -

• Functional form – linearity is default, Functional form – linearity is default, often not best fit, consider quadratic… often not best fit, consider quadratic…

• Check Residuals for approx. normalityCheck Residuals for approx. normality• Check Residuals for outliers (> 3 SDs)Check Residuals for outliers (> 3 SDs)• All accomplished within SPSSAll accomplished within SPSS

Page 24: Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Practical on Model Practical on Model CheckingChecking

Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’1) Fit age squared term in min LDL model and 1) Fit age squared term in min LDL model and

check fit of model compared to linear fit (Hint: check fit of model compared to linear fit (Hint: Use transform/compute to create age squared Use transform/compute to create age squared term and fit age and ageterm and fit age and age22))

2) Fit 2) Fit separateseparate linear regressions with min Chol linear regressions with min Chol achieved with predictors of 1) baseline Chol 2) achieved with predictors of 1) baseline Chol 2) APOE_lin 3) adherenceAPOE_lin 3) adherence

Check assumptions and interpret resultsCheck assumptions and interpret results

Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’1) Fit age squared term in min LDL model and 1) Fit age squared term in min LDL model and

check fit of model compared to linear fit (Hint: check fit of model compared to linear fit (Hint: Use transform/compute to create age squared Use transform/compute to create age squared term and fit age and ageterm and fit age and age22))

2) Fit 2) Fit separateseparate linear regressions with min Chol linear regressions with min Chol achieved with predictors of 1) baseline Chol 2) achieved with predictors of 1) baseline Chol 2) APOE_lin 3) adherenceAPOE_lin 3) adherence

Check assumptions and interpret resultsCheck assumptions and interpret results