Top Banner
Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission
34

Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Jan 01, 2016

Download

Documents

Dylan Gallagher
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression 5

Sociology 5811 Lecture 26

Copyright © 2005 by Evan Schofer

Do not copy or distribute without permission

Page 2: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Announcements

• Schedule: – Today: Multiple regression hypothesis tests,

assumptions, and problems

• Reminder: Paper due on Thursday• Questions about the paper?

Page 3: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions• As discussed in Knoke, p. 256

• Note: Allison refers to error (e) as disturbance (U); And uses slightly different language… but ideas are the same!

• 1. a. Linearity: The relationship between dependent and independent variables is linear

• Just like bivariate regression• Points don’t all have to fall exactly on the line; but error

(disturbance) must not have a pattern

– Check scatterplots of X’s and error (residual)• Watch out for non-linear trends: error is systematically

negative (or positive) for certain ranges of X• There are strategies to cope with non-linearity, such as

including X and X-squared to model curved relationship.

Page 4: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• 1. b. And, the model is properly specified: – No extra variables are included in the model, and no

important variables are omitted. This is HARD!

• Correct model specification is critical• If an important variable is left out of the model, results are

biased (“omitted variable bias”)

– Example: If we model job prestige as a function of family wealth, but do not include education

• Coefficient estimate for wealth would be biased

– Use theory and previous research to decide what critical variables must be included in your model.

Page 5: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• Correct model specification is critical– If an important variable is left out of the model,

results are biased• This is called “omitted variable bias”

– Example: If we model job prestige as a function of family wealth, but do not include education

• Coefficient estimate for wealth would be biased

– Use theory and previous research to help you identify critical variables

• For final paper, it is OK if model isn’t perfect.

Page 6: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• 2. All variables are measured without error

• Unfortunately, error is common in measures– Survey questions can be biased– People give erroneous responses (or lie)– Aggregate statistics (e.g., GDP) can be inaccurate

• This assumption is often violated to some extent– We do the best we can:– Design surveys well, use best available data– And, there are advanced methods for dealing with

measurement error.

Page 7: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• 3. The error term (ei) has certain properties• Recall: error is a cases deviation from the regression line

• Not the same as measurement error!

• After you run a regression, SPSS can tell you the error value for any or all cases (called the “residual”)

• 3. a. Error is conditionally normal– For bivariate, we looked to see if Y was conditionally

normal… Here, we look to see if error is normal

– Examine “residuals” (ei) for normality at different values of X variables.

Page 8: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• 3. b. The error term (ei) has a mean of 0

– This affects the estimate of the constant. (Not a huge problem)

• 3. c. The error term (ei) is homoskedastic (has constant variance)– Note: This affects standard error estimates,

hypothesis tests– Look at residuals, to see if they spread out with

changing values of X• Or plot standardized residuals vs. standardized predicted

values.

Page 9: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• 3. d. Predictors (Xis) are uncorrelated with error

– This most often happens when we leave out an important variable that is correlated with another Xi

– Example: Predicting job prestige with family wealth, but not including education

– Omission of education will affect error term. Those with lots of education will have large positive errors.

• Since wealth is correlated with education, it will be correlated with that error!

– Result: coefficient for family wealth will be biased.

Page 10: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• 4. In systems of equations, error terms of equations are uncorrelated

• Knoke, p. 256

– This is not a concern for us in this class• Worry about that later!

Page 11: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• 5. Sample is independent, errors are random• Technically, part of 3.c.

– Not only should errors not increase with X (heteroskedasticity), there should be no pattern at all!

• Things that cause patterns in error (autocorrelation):– Measuring data over long periods of time (e.g., every

year). Error from nearby years may be correlated.• Called: “Serial correlation”.

Page 12: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• More things that cause patterns in error (autocorrelation):– Measuring data in families. All members are similar,

will have correlated error– Measuring data in geographic space.

• Example: data on 50 US states. States in a similar region have correlated error

• Called “spatial autocorrelation”

• There are variations of regression models to address each kind of correlated error.

Page 13: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Assumptions

• Regression assumptions and final projects:

• At a minimum, check all bivariate regression assumptions– Also, you should check for outliers

• To be discussed soon!

– If you are capable of doing multiple regression assumptions, go ahead and do them

• It will show mastery… which can’t hurt your grade!

Page 14: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Regression: Outliers

• Note: Even if regression assumptions are met, slope estimates can have problems

• Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample

• More formally: “influential cases”

• Outliers can result from:• Errors in coding or data entry

• Highly unusual cases

• Or, sometimes they reflect important “real” variation

• Even a few outliers can dramatically change estimates of the slope, especially if N is small.

Page 15: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Regression: Outliers

• Outlier Example:

-4 -2 0 2 4

4

2

-2

-4

Extreme case that pulls regression

line up

Regression line with extreme case

removed from sample

Page 16: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Regression: Outliers

• Strategy for identifying outliers:

• 1. Look at scatterplots or regression partial plots for extreme values

• Easiest. A minimum for final projects

• 2. Ask SPSS to compute outlier diagnostic statistics– Examples: “Leverage”, Cook’s D, DFBETA,

residuals, standardized residuals.

Page 17: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Regression: Outliers

• SPSS Outlier strategy: Go to Regression – Save– Choose “influence” and “distance” statistics such as

Cook’s Distance, DFFIT, standardized residual– Result: SPSS will create new variables with values of

Cook’s D, DFFIT for each case– High values signal potential outliers– Note: This is less useful if you have a VERY large

dataset, because you have to look at each case value.

Page 18: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Scatterplots• Example: Study time and student achievement.

– X variable: Average # hours spent studying per day– Y variable: Score on reading test

Case X Y

1 2.6 28

2 1.4 13

3 .65 17

4 4.1 31

5 .25 8

6 1.9 16

7 3.5 6

Y axis

X axis

0 1 2 3 4

30

20

10

0

Page 19: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Outliers

• Results with outlier:Model Summaryb

.466a .217 .060 9.1618Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), HRSTUDYa.

Dependent Variable: TESTSCORb. Coefficientsa

10.662 6.402 1.665 .157

3.081 2.617 .466 1.177 .292

(Constant)

HRSTUDY

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: TESTSCORa.

Page 20: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Outlier Diagnostics

• Residuals: The numerical value of the error• Error = distance that points falls from the line

• Cases with unusually large error may be outliers

• Standardized residuals• Z-score of residuals… converts to a neutral unit

• Often, standardized residuals larger than 3 are considered worthy of scrutiny

• But, it isn’t the best outlier diagnostic.

Page 21: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Outlier Diagnostics

• Cook’s D: Identifies cases that are strongly influencing the regression line– SPSS calculates a value for each case

• Go to “Save” menu, click on Cook’s D

• How large of a Cook’s D is a problem?– Rule of thumb: Values greater than: 4 / (n – k – 1)– Example: N=7, K = 1: Cut-off = 4/5 = .80– Cases with higher values should be examined.

Page 22: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Outlier Diagnostics

• Example: Outlier/Influential Case Statistics

Hours Score Resid Std Resid Cook’s D

2.60 28 9.32 1.01 .124

1.40 13 -1.97 -.215 .006

.65 17 4.33 .473 .070

4.10 31 7.70 .841 .640

.25 8 -3.43 -.374 .082

1.90 16 -.515 -.056 .0003

3.50 6 -15.4 -1.68 .941

Page 23: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Outliers

• Results with outlier removed:Model Summaryb

.903a .816 .770 4.2587Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), HRSTUDYa.

Dependent Variable: TESTSCORb. Coefficientsa

8.428 3.019 2.791 .049

5.728 1.359 .903 4.215 .014

(Constant)

HRSTUDY

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: TESTSCORa.

Page 24: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Regression: Outliers

• Question: What should you do if you find outliers? Drop outlier cases from the analysis? Or leave them in?– Obviously, you should drop cases that are incorrectly

coded or erroneous– But, generally speaking, you should be cautious about

throwing out cases• If you throw out enough cases, you can produce any result

that you want! So, be judicious when destroying data.

Page 25: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Regression: Outliers

• Circumstances where it can be good to drop outlier cases:

• 1. Coding errors

• 2. Single extreme outliers that radically change results

• Your results should reflect the dataset, not one case!

• 3. If there is a theoretical reason to drop cases• Example: In analysis of economic activity, communist

countries may be outliers

• If the study is about “capitalism”, they should be dropped.

Page 26: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Regression: Outliers

• Circumstances when it is good to keep outliers

• 1. If they form meaningful cluster– Often suggests an important subgroup in your data

• Example: Asian-Americans in a dataset on education

• In such a case, consider adding a dummy variable for them

• Unless, of course, research design is not interested in that sub-group… then drop them!

• 2. If there are many• Maybe they reflect a “real” pattern in your data.

Page 27: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Regression: Outliers

• When in doubt: Present results both with and without outliers

• Or present one set of results, but mention how results differ depending on how outliers were handled

• For final projects: Check for outliers!• At least with scatterplots

• But, a better strategy is to use partialplots and Cooks D (or similar statistics)

– In the text: Mention if there were outliers, how you handled them, and the effect it had on results.

Page 28: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Problems

• Another common regression problem: Multicollinearity

• Definition: collinear = highly correlated• Multicollinearity = inclusion of highly correlated

independent variables in a single regression model

• Recall: High correlation of X variables causes problems for estimation of slopes (b’s)

• Recall: variable denominators approach zero, coefficients may wrong/too large.

Page 29: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multiple Regression Problems

• Multicollinearity symptoms:– Addition of a new variable to the model causes other

variables to change wildly• Note: occasionally a major change is expected (e.g., if a

key variable is added, or for interaction terms)

– If a variable typically has a small effect• BUT, when paired with another highly correlated variable,

BOTH have big effects in opposite directions.

Page 30: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multicollinearity

• Diagnosing multicollinearity:

• 1. Look at correlations of all independent vars– Correlation of .7 is a concern, .8> is often a problem– But, sometimes problems aren’t always bivariate…

and don’t show up in bivariate correlations• Ex: If you forget to omit a dummy variable

• 2. Watch out for the “symptoms”

• 3. Compute diagnostic statistics• Tolerances, VIF (Variance Inflation Factor).

Page 31: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multicollinearity

• Multicollinearity diagnostic statistics:

• “Tolerance”: Easily computed in SPSS– Low values indicate possible multicollinearity

• Start to pay attention at .4; Below .2 is very likely to be a problem

– Tolerance is computed for each independent variable by regressing it on other independent variables.

Page 32: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multicollinearity

• If you have 3 independent variables: X1, X2, X3… – Tolerance is based on doing a regression: X1 is

dependent; X2 and X3 are independent.

• Tolerance for X1 is simply 1 minus regression R-square.

• If a variable (X1) is highly correlated with all the others (X2, X3) then they will do a good job of predicting it in a regression

• Result: Regression r-square will be high… 1 minus r-square will be low… indicating a problem.

Page 33: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multicollinearity

• Variance Inflation Factor (VIF) is the reciprocal of tolerance: 1/tolerance

• High VIF indicates multicollinearity

– Gives an indication of how much the Standard Error of a variable grows due to presence of other variables.

Page 34: Multiple Regression 5 Sociology 5811 Lecture 26 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Multicollinearity

• Solutions to multcollinearity– It can be difficult if a fully specified model requires

several collinear variables

• 1. Drop unnecessary variables

• 2. If two collinear variables are really measuring the same thing, drop one or make an index– Example: Attitudes toward recycling; attitude toward

pollution. Perhaps they reflect “environmental views”

• 3. Advanced techniques: e.g., Ridge regression• Uses a more efficient estimator (but not BLUE – may

introduce bias).