Quantitative Methods I: Regression diagnostics...Assumptions and errors Leverage and inﬂuence Multicollinearity Heteroscedasticity Functional forms for additional non-linear transformations

Assumptions and errorsLeverage and influence

MulticollinearityHeteroscedasticity

Quantitative Methods I:Regression diagnostics

Johan A. Elkink

University College Dublin

10 December 2014

Johan A. Elkink diagnostics



1 Assumptions and errors

2 Leverage and influence

3 Multicollinearity

4 Heteroscedasticity




Outline



3 Multicollinearity





Assumptions: specification

Linear in parameters (i.e. f (Xβ) = Xβ and E [y] = Xβ)No extraneous variables in XNo omitted independent variablesParameters to be estimated are constantNumber of parameters is less than the number of cases,k < n




Assumptions: errors

Errors have an expected value of zero, E [ε|X] = 0Errors are normally distributed, ε ∼ N(0, σ2)

Errors have a constant variance, Var(ε|X) = σ2 <∞Errors are not autocorrelated, Cov(εi , εj |X) = 0 ∀ i 6= jErrors and X are uncorrelated, Cov(X, ε) = 0




Assumptions: regressors

X variesX is of full column rank (note: requires k < n)No measurement error in XNo endogenous variables in X




Assumptions for unbiasedness

If

the population regression model is linear in its parameters;the sample is a random sample from the population;there is no perfect collinearity, r(X) = k ;and the expected values of the error term is zeroconditional on the explanatory variables, E(ε|X) = 0 andCov(ε,X) = 0,

then the OLS estimators of β are unbiased.

(Glynn, 2007)




Non-constant error variance

Consequences:

σ2 is a biased estimator of σ2;the probability of a Type I error will not be α;the least squares estimator is no longer the best linearunbiased estimator,

but:

the severity depends on the level of heteroscedasticity;β still an unbiased estimator of β.

(King, 2007)




Non-normal errors

Consequences:

sampling distribution of β not normal;test statistics will not have t- and F -distributions;the probability of a Type I error will not be α,

but the estimates are still consistent: as n increases, the aboveproblems disappear.

(King, 2007)




Exercise: film reviews

Open the films.dta data set. Create a new variablehighrating, which is 1 for films rated 3 or higher, 0 otherwise.

Using matrix formulas,

1 regress desclength on a constant2 regress desclength on castsize3 regress desclength on castsize, highrating, length




Exercise: film reviews

Based on the last regression:

1 Which observation has the largest residual?2 Compute mean and median of residuals3 Compute correlation between residuals and fitted values4 Compute correlation between residuals and length5 All other predictors held constant, what would be the

difference in predicted description length between high andlow rated movies?




Non-linearity

If there is non-linearity in the variables, but not in theparameters, there is no problem. E.g.

yi = β0 + β1xi + β2x2i + εi

can be estimated with OLS.

If there are other non-linearities, sometimes the equation canbe transformed. E.g.

yi = β0xβ1i εi

log(yi) = log(β0) + β1 log(xi) + log(εi)

y∗i = β∗0 + β1x∗i + ε∗i




Functional forms for additional non-lineartransformations

log-linear as with the previous example

semi-log has two forms:

yi = β0 + β1 log(xi),

where β1 is ∆y due to %∆x

log(yi) = β0 + β1xi ,

where β1 is %∆y due to ∆x

inverse or reciprocal: yi = β0 + β11xi

polynomial yi = β0 + β1xi + β2x2i




Outline



3 Multicollinearity





Leverage

A high leverage point i is one where xi is far from the mean ofX. These points can be identified using the so-called hat matrix,the matrix that puts a hat on y:

y = Xβ = X(X′X)−1X′y = Hy,

the diagonal of which is a measure of leverage.

(King, 2007)




Outliers

An outlier is a point on the regression line where the residual islarge.

To account for the potential variables in the sampling variancesof the residuals, we calculate externally studentized residuals(or studentized deleted residuals), where a large absolute valueindicates an outlier. A test could be based on the fact that in amodel without outliers, they should follow a t(n− k) distribution.

(Kutner et al., 2005, 390–398)




Outliers

An outlier is a point on the regression line where the residual islarge.

To account for the potential variables in the sampling variancesof the residuals, we calculate externally studentized residuals(or studentized deleted residuals), where a large absolute valueindicates an outlier. A test could be based on the fact that in amodel without outliers, they should follow a t(n− k) distribution.

(Kutner et al., 2005, 390–398)




Influence

An influential point is one which has a strong impact on theestimation of β. An influential point is one which has highleverage and is also an outlier.

We typically look at Cook’s Distance to assess the level ofinfluence of each variable.

(King, 2007)




Leverage and influence

A point with high leverage is located far from the other points. Ahigh leverage point that strongly influences the regression lineis called an influential point.




Outlier, low leverage, low influence

●

●

●

●

●

●

3 4 5 6 7 8

810

1214

x

y

●




High leverage, low influence

●

●

●

●

●

●

5 10 15 20

1015

2025

x

y

●




High leverage, high influence

●

●

●

●

●

●

5 10 15 20

810

1214

16

x

y

●




Cook’s Distance

Di ≡∑n

j=1(yj − yj(−i))2

ks2 =(βOLS

(−i) − βOLS)′X′X(βOLS

(−i) − βOLS)

ks2

=

(ei

s√

1− hi

)2 hi

k(1− hi)

=t2ik

var(yi)

var(ei)

∼ F (k ,n − k)

The F -test here refers to whether βOLS would be significantlydifferent if observation i were to be removed (H0 : β = β(−i)).

(Cook, 1979, 168)




Cook’s Distance

Di =t2ik

var(yi)

var(ei)

“t2i is a measure of the degree to which the i th observation can

be considered as an outlier from the assumed model.”

“The ratios var(yi )var(ei )

measure the relative sensitivity of the

estimate, βOLS, to potential outlying values at each data point.”

(Cook, 1977, 16)




What to do with outliers?

Options:

1 Ignore the problem2 Investigate why the data are outliers — what makes them

unusual?3 Consider respecifying the model, either by tranforming a

variable or by including an additional variable (but bewareof overfitting)

4 Consider a variant of “robust regression” that downweightsoutliers




Diagnosing problems in R

A very easy set of diagnostic plots can be accessed byplotting a lm object, using plot.lm()

This produces, in order:1 residuals against fitted values2 Normal Q-Q plot3 scale-location plot of

√|ei | against fitted values

4 Cook’s distances versus row labels5 residuals against leverages6 Cook’s distances against leverage/(1-leverage)

Note that by default, plot.lm() only gives you 1,2,3,5




Exercise

Open the uswages.dta data set and regress log(wage) oneduc, exper and race.

Check for leverage, outliers, influential points and nonlinearities.




Outline



3 Multicollinearity





Collinearity

When some variables are linear combinations of others then wehave exact (or perfect) collinearity, and there is no unique leastsquares estimate of β.

When X variables are highly correlated, we havemulticollinearity.

Detecting multicollinearity:

look at correlation matrix of predictors for pairwisecorrelationsregress each independent variable on all otherindependent variables to produce R2

j , and look for highvalues (close to 1.0)




Multicollinearity

The extent to which multicollinearity is a problem is debatable.

The issue is comparable to that of sample size: if n is too small,we have difficulty picking up effects even if they really exist; thesame holds for variables that are highly multicollinear, making itdifficult to separate their effects on y.




Multicollinearity

However, some problems with high multicollinearity:

Small changes in data can lead to large changes inestimatesHigh standard errors but joint significanceCoefficients may have “wrong” sign or implausiblemagnitudes

(Greene, 2002, 57)




Variance of βOLS

var(βOLSk ) =

σ2

(1− R2k )∑n

i (xik − xk )2

σ2: all else equal, the better the fit, the lower the variance(1− R2

k ): all else equal, the lower the R2 from regressingthe k th independent variable on all other independentvariables, the lower the variance

∑ni (xik − xk )2: all else equal, the more variation in x , the

lower the variance

(Greene, 2002, 57)




Variance Inflation Factor

var(βOLSk ) =

σ2

(1− R2k )∑n

i (xik − xk )2

VIFk =1

1− R2k,

thus VIFk shows the increase in the var(βOLSk ) due to the

variable being collinear with other independent variables.

library(faraway)

vif(lm(...))




Multicollinearity: solutions

Check for coding or logical mistakes (esp. in cases ofperfect multicollinearity)Increase nRemove one of the collinear variables (apparently notadding much)Combine multiple variables in indices or underlyingdimensionsFormalise the relationship




Exercise

Using demdev.dta data and model

polity2i = β0+β1cwari+β3laggdppci+β4propdemi+β5energy2i+εi ,

check whether there are any multicollinearity problems.




Outline



3 Multicollinearity





Homoscedasticity




Heteroscedasticity




Heteroscedasticity

Regression disturbances whose variances are not constantacross observations are heteroscedastic.

Under heteroscedasticity, the OLS estimators remain unbiasedand consistent, but are no longer BLUE or asymptoticallyefficient.

(Thomas, 1985, 94)




Causes of heteroscedasicity

More variation for larger sizes (e.g. profits of firms variesmore for larger firms)More variation across different groups in the sampleLearning effects in time-seriesVariation in data collection quality (e.g. historical data)Turbulence after shocks in time-series (e.g. financialmarkets)Omitted variableWrong functional formAggregation with varying sizes of populationsetc.




Heteroscedasticity

Since OLS is no longer BLUE or asymptotically efficient,

other linear unbiased estimators exist which have smallersampling variances;other consistent estimators exist which collapse morequickly to the true values as n increases;we can no longer trust hypothesis tests, becausevar(βOLS) is biased.

cov(X2i , σ

2i ) > 0, then var(βOLS) is underestimated

cov(X2i , σ

2i ) = 0, then no bias in var(βOLS)

cov(X2i , σ

2i ) < 0, then var(βOLS) is overestimated

(inefficient)

(Thomas, 1985, 94–95)

(Judge et al., 1985, 422)Johan A. Elkink diagnostics



Residual plots: heteroscedasticity

To detect heteroscedasticity (unequal variances), it is useful toplot:

Residuals against fitted valuesResiduals against dependent variableResiduals against independent variable(s)

Usually, the first one is sufficient to detect heteroscedasticity,and can simply be found by:

m <- lm(y ~ x)

plot(m)





●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

0 2 4 6 8

510

1520

2530

35

x

y

m <- lm(y ~ x)





●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

5 10 15 20 25

−10

−5

05

fitted(m)

resi

dual

s(m

)

plot(residuals(m) ~ fitted(m))





●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

5 10 15 20 25 30 35

−10

−5

05

y

resi

dual

s(m

)

plot(residuals(m) ~ y)





●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

0 2 4 6 8

−10

−5

05

x

resi

dual

s(m

)

plot(residuals(m) ~ x)




Residual plots: homoscedasticity

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

0 2 4 6

510

1520

25

x

y

m <- lm(y ~ x)





●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

5 10 15 20 25

−2

−1

01

2

fitted(m)

resi

dual

s(m

)






●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

5 10 15 20 25

−2

−1

01

2

y

resi

dual

s(m

)






●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0 2 4 6

−2

−1

01

2

x

resi

dual

s(m

)






●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 2 4 6 8

020

4060

8010

012

0

x

y

m <- lm(y ~ x)





●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

● ●

●

●●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

−20 0 20 40 60 80

−10

010

20

fitted(m)

resi

dual

s(m

)






●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

● ●

●

●●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

0 20 40 60 80 100 120

−10

010

20

y

resi

dual

s(m

)






●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

● ●

●

●●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

0 2 4 6 8

−10

010

20

x

resi

dual

s(m

)





Cook, R. Dennis. 1977. “Detection of influential observation in linear regression.” Technometrics pp. 15–18.

Cook, R. Dennis. 1979. “Influential observations in linear regression.” Journal of the American Statistical Association74(365):169–174.

Glynn, Adam. 2007. “GOV 2000: Quantitative Methodology for Political Science I.” Lecture slides, Harvard University.

Greene, William H. 2002. Econometric analysis. London: Prentice Hall.

Judge, George G, William E Griffiths, R Carter Hill, Helmut Lutkepohl and Tsoung-Chao Lee. 1985. The Theory andPractice of Econometrics. New York: John Wiley and Sons.

King, Gary. 2007. “GOV 2000: Quantitative Methodology for Political Science I.” Lecture slides, Harvard University.

Kutner, Michael H., Christopher J. Nachtsheim, John Neter and William Li. 2005. Applied linear statistical models.5th ed. McGraw-Hill.

Thomas, R. Leighton. 1985. Introductory econometrics: theory and applications. Longman Harlow, Essex.


Quantitative Methods I: Regression diagnostics...Assumptions and errors Leverage and inﬂuence Multicollinearity Heteroscedasticity Functional forms for additional non-linear transformations

Documents