Chapter 8. R-squared, Adjusted R-Squared, the F test, …westfall.ba.ttu.edu/ISQS5349/URA_Ch8.pdf · 1 . Chapter 8. R-squared, Adjusted R-Squared, the F test, and Multicollinearity.

1

Chapter 8. R-squared, Adjusted R-squared, the F test, and Multicollinearity

This chapter discusses additional output in the regression analysis, from the context of multiple

regression in the classical model. It also discusses multicollinearity, its effects, and remedies.

8.1 The R-squared Statistic

The “population” R2 statistic was introduced in Chapter 6 as

= 1 – E{v (X)}/ Var(Y),

where v(x) is the conditional variance of Y given X = x.

This number tells you how well the “X” variable(s) predict your “Y” variable. Since the entire

focus of this book is on conditional distributions p(y | x), I’d like you to understand the

“prediction” concept in terms of separation of the distributions p(y | X = low) and p(y | X = high).

For example, suppose the true model is

Y = 6 + 0.2X + ,

where X ~ N(20,52) and Var() = 2. Then Var(Y) = 0.22 52 +

2 = 1 + 2, and v(x) =

2,

implying 1 +

2) = 1/(1 + 2). Three cases I’d like you to consider are (i)

9.0,

implying a low , (ii)1.0, implying a medium value and (iii)

9,

implying a high In all cases, let’s say a “low” value of X is 15.0, one standard deviation

below the mean, and a high value of X is 25.0, one standard deviation above the mean.

Now, when X = 15, the distribution p(y | X = 15) is the N(6+0.2(15) = 9.0, 2) distribution; and

when X = 25, the distribution p(y | X = 25) is the N(6+0.2(25) = 11.0, 2) distribution. Figure

8.1.0 displays these distributions for the three cases above, where the population R2 is either 0.1,

0.5, or 0.9 (which happen in this study when 2 is either 9.0, 1.0, or 1/9). Notice that there is

greater separation of the distributions p(y | x) when the “population” R2 is higher.

2

Figure 8.1.0. Separation of distributions p(y | X = low) (left distributions) and p(y | X = high)

(right distributions) in cases where the “population” R2 is 0.1 (top panel), 0.5 (medium panel)

and 0.9 (bottom panel). In all cases X = low and X = high refer to an X that is either one standard

deviation below the mean or one standard deviation above the mean.

In the case of the classical regression model, which is instantiated by Figure 8.1.0, the

conditional variance Var(Y | X = x) = v(x) is a constant 2, and does not depend on X = x. Also in

the classical regression model, the maximum likelihood estimate of 2 is 2̂ = SSE/n, where

SSE = 2

1ˆ( )

n

i iiy y

, the sum of squared vertical deviations from yi values to the fitted OLS

3

function. The unconditional variance is Var(Y) = 2

Y , so the “population” R2 statistic, in the

classical regression model, is

= 1 – 2/ 2

Y

The maximum likelihood estimate of 2

Y is 2ˆY = SST/n, where SST = 2

1( )

n

iiy y

, the “total”

sum of squared vertical deviations from yi values to the flat line where y = y .See Figure 8.1.1.

Figure 8.1.1. Scatterplot of n = 4 data points (indicated by X’s.) Horizontal red line is y = y

line and diagonal blue line is the least squares lines. Vertical deviations from the y = y line are

shown as red; SST is the sum of these squared deviations. Vertical deviations from the least

squares line are shown as blue; SSE is the sum of these squared deviations. The R2 statistic

equals 1 – SSE/SST.

4

Using the maximum likelihood estimates of conditional and unconditional variance, you get the

estimate of the “population” R-squared statistic,

R2 = 1 – (SSE/n)/(SST/n) = 1 – SSE/SST.

In Chapter 5, you saw models with different transformations in the X variable. The model with

the highest maximized log likelihood was the one with the smallest estimated conditional

variance SSE/n, hence it was also the model with smallest SSE, since n is always the same when

considering different models for the same data set. Also, SST is always the same when

considering different models for the same data set, because SST does not involve the predicted

values from the model. Thus, among the different models having different transformed X

variable1, the model with the highest log likelihood corresponds precisely to the model with the

highest R2 statistic.

While it is mathematically factual that 0 ≤ R2 ≤ 1.0, there is no “Ugly Rule of Thumb” for how

large an R2 statistic should be to be considered “good.” Rather, it depends on norms for the given

subject area: In finance, any non-zero R2 for predicting stock returns is interesting, because the

efficient markets hypothesis states that the “population” R2 is zero in this case. In chemical

reaction modeling, the outputs are essentially deterministic functions of the inputs, so an R2

statistic that is less than 1.0, e.g. 0.99, may not be good enough because it indicates faulty

experimental procedures. With human subjects and models to predict their behavior, the R2

statistics are typically less than 0.50 because people are, well, people. We have our own minds,

and are not robots that can be pigeon-holed by some regression model.

Our advice is to rely less on R2, and more on separation of distributions as seen in Figure 8.1.0.

When we get to more complex models, the usual R2 statistic becomes less interpretable, and in

some cases it is non-existent. But you always will have conditional distributions p(y | x), and you

can always graph those distributions as shown in Figure 8.1.0 to see how well your X predicts

your Y.

8.2 The Adjusted R-Squared Statistic

Recall that, in the classical model, = 1 – 2/ 2

Y , and that the standard R2 statistic replaces the

two variances with their maximum likelihood estimates. Recall also that maximum likelihood

estimates of variance are slightly biased. Replacing the variances with their unbiased estimates

gives the adjusted R2 statistic:

2

aR = 1 – {SSE/(n – k – 1)}/{SST/(n – 1)}

With larger number of predictor variables k, the ordinary R2 tends to be increasingly biased

upward; the adjusted R2 statistic is less biased. You can interpret the adjusted R2 statistic in the

same way as the ordinary one, but note that the adjusted R2 statistic can give values less than 0.0,

which are clearly bad estimates since the estimand cannot be negative.

1 This discussion refers to X transformations only, not Y transformations.

5

The following R code indicates where these statistics are, as well as “by hand” calculations of

them.

sales = read.table("http://westfall.ba.ttu.edu/isqs5349/Rdata/lance.txt")

attach(sales); Y = NSOLD; X1 = INTRATE^-1; X2 = PPGGAS

n = nrow(sales)

fit = lm(Y ~ X1 + X2); summary(fit)

SST = sum( (Y-mean(Y))^2 )

SSE = sum(fit$residuals^2)

## By hand calculations of R-squared statistics

R.squared = 1 - SSE/SST

R.squared.adj = 1 - (SSE/(n-3))/(SST/(n-1))

R.squared; R.squared.adj

The summary of the fit shows the following R2 and adjusted R2 statistic:

Multiple R-squared: 0.8986, Adjusted R-squared: 0.8851

F-statistic: 66.49 on 2 and 15 DF, p-value: 3.502e-08

The “by hand” calculations agree:

> R.squared; R.squared.adj

[1] 0.8986318

[1] 0.885116

8.3 The F Test

See the R output a few lines above: Underneath the R2 statistic is the F-statistic. This statistic is

related to the R2 statistic in that it is also a function of SST and SSE (see Figure 8.1.1 again.) It is

given by

F = {(SST – SSE)/k)/{SSE/(n – k – 1)},

If you add the line ((SST-SSE)/2)/(SSE/(n-3)) to the R code above, you will get the reported

F-statistic, although with more decimals: 66.48768.

With a little algebra, you can relate this directly to the R2 statistic, showing that for fixed k and n,

larger R2 corresponds to larger F:

F = {(n – k – 1)/k} R2/(1 – R2)

This statistic is used to test the global null hypothesis H0: 1 = 2 = … = k = 0, which states that

none of the regression variables X1, X2, …, or Xk is related to Y. Under the classical model where

H0:1 = 2 = … = k = 0 is true, it can be proven mathematically that

F ~ Fk, n – k – 1

6

In other words, the null distribution of the F statistic is the F distribution with k numerator

degrees of freedom and n – k – 1 denominator degrees of freedom.

Recall also that the degrees of freedom for “error,” dfe, was given by dfe = n – k – 1. The

numerator degrees of freedom, k, is sometimes called the “model” degrees of freedom, hence

symbolized as dfm, because it represents the flexibility2 (freedom) of the model.

When H0: 1 = 2 = … = k = 0 is true, the theoretical R2 statistic is exactly 0. And when H0

is false you get larger values of R2, hence larger F-statistics. Unlike the t-test for testing

individual regression coefficients, the p-value for testing H0:1 = 2 = … = k = 0 via the F test

considers the extreme values of F to be only the large values, not both the large and the small

ones. Smaller F values are expected under H0; hence small values of F are not “extreme.”

To understand the F statistic, when it is “small” and when it is “large,” its distribution, and the

“chance only” model where 1 = 2 = … = k = 0, you should use simulation. (As always!)

Simulation Study to Understand the F Statistic

sales = read.table("http://westfall.ba.ttu.edu/isqs5349/Rdata/lance.txt")

attach(sales); X1 = INTRATE^-1; X2 = PPGGAS

n = nrow(sales)

Y = 25 + 0*X1 + 0*X2 + rnorm(n,0,4) ## Notice the 0’s: The null model is true

fit = lm(Y ~ X1 + X2); summary(fit)

The code above generates data Y that is unrelated to either X1 or X2; in other words, the null

hypothesis H0: 1 = 2 = 0 is in fact true. From the code above, we got F = 0.1711 (yours will

vary by randomness). But to understand what is the range of possible F values that are explained

by chance alone, you need to repeat this simulation many (ideally, infinitely many) times. Hence,

we will simulate many data sets under the null model, save the F values, draw their histogram

and overlay the theoretically correct Fdfm,dfe density.

R Code for Figure 8.3.1

Nsim = 10000

Fsim.null = numeric(Nsim)

Fsim.alt = numeric(Nsim)

for (i in 1:Nsim) {

Y.null = 25 + 0*X1 + 0*X2 + rnorm(n,0,4)

Y.alt = 25 + 100*X1 + 50*X2 + rnorm(n,0,4)

fit.null = lm(Y.null ~ X1 + X2)

fit.alt = lm(Y.alt ~ X1 + X2)

Fsim.null[i] = summary(fit.null)$fstatistic[1]

Fsim.alt[i] = summary(fit.alt)$fstatistic[1]

}

par(mfrow=c(3,1))

2 For example, the quadratic regression model, which has dfm = 2, is more flexible than the linear model, which has

dfm = 1.

7

par(mar=c(4,4,1,1))

hist(Fsim.null, breaks=100, freq=F, main="", xlab="F value")


flist = seq(0,15,.01)

fdist = df(flist,2,15)

crit = qf(.95,2,15)

points(flist, fdist, type="l")

abline(v=crit, col="blue")


points(flist, fdist, type="l")

abline(v=crit, col="blue")

hist(Fsim.alt,breaks=100, freq=F, add=T, lty=2, border="red")

8

Figure 8.3.1. Top panel: Histogram of 10,000 simulated F statistics under the null model. Middle

panel: Same as top panel but with the theoretically correct F2,15 distribution overlaid (solid black

curve), as well as its 0.95 quantile 3.682 (blue line). Bottom panel: Same as middle panel, but

with histogram of 10,000 simulated F statistics under an alternative model superimposed (red

histogram).

The observed F statistic from the original data was 66.48768. As seen in Figure 8.3.1, this value

is off the chart, and hence rejects the null model where = 2 = 0. The p-value is calculated

from the solid F2,15 curve shown in the middle and bottom panels of Figure 8.3.1; it is the area

9

under that curve beyond 66.48768, and is calculated in R as 1-pf(66.48768,2,15), giving

3.501579e-08, agreeing with p-value: 3.502e-08 shown in the lm output above. The

conclusion is that the F statistic is not easily explained under the model where = 2 = 0, so it is

logical to conclude that (= 2 = 0) is not true. In other words, it is logical to conclude that

either ≠ 0, or 2 ≠ 0, or that both and 2 differ from 0. Be careful, though: The F test is not

specific. A significant F test does not tell you that both parameters differ from zero, nor can it

identify which parameter differs from zero. It can only tell you that at least one parameter (either

1 or 2) differs from 0.

8.4 Multicollinearity

Multicollinearity (MC) refers to the X variables being “collinear” to varying degrees. In the

case of two X variables, X1 and X2, collinearity means that the two variables are close to

linearly related. A “perfect” multicollinearity means that they are perfectly linearly related.

See Figure 8.4.1.

Figure 8.4.1. Left panel: Collinear X variables having correlation 0.976. Right panel:

Perfectly collinear X variables having correlation 1.0.

Often, “multicollinearity” with just two X variables is called simply “collinearity.” Figure

8.4.1, right panel, illustrates the meaning of the term “collinear.”

With more X variables, it is not so easy to visualize multicollinearity. But if one of the X

variables, say Xj, is closely related to all the other X variables via

10

Xj a0 + a1X1 + … + aj–1 Xj–1 + aj+1Xj+1 + … + akXk

then there is multicollinearity. And if the “” is in fact an “=” in the equation above, then

there is a perfect multicollinearity.

A perfect multicollinearity causes the XTX matrix to be non-invertible, implying that there are

no unique least squares estimates. Equations 0 through k shown in Section 7.1 can still be

solved for estimates of the ’s, but there are infinitely many solutions, so the effects of the

individual Xj variables on Y are not identifiable when there is a perfect multicollinearity.

To understand the notion that there can be an infinity of solutions for the estimated ’s,

consider the case where there is only one X variable. A perfect multicollinearity in this case

means that X1 = a0, a constant, so that the X1 column is perfectly related to the intercept

column of 1’s; i.e., X1 = a01. Figure 8.4.2 shows how data might look in this case, where xi =

10 for every i = 1,…,n, and also shows several possible least squares fits, all of which have

the same sum of squared errors.

Figure 8.4.2. Non-unique least squares fits, all of which provide the minimum SSE, when the

X column of data is perfectly related to the intercept column.

A similar phenomenon happens with the case of two X variables as shown in the right panel

of Figure 8.4.1: There are an infinity of three-dimensional planar functions (review Figure

11

7.1.1 in Chapter 7) of X1 and X2 that all minimize the SSE, all of which contain the two-

dimensional line shown in the right panel of Figure 8.4.1.

Using R, you will get one of these infinitely many estimated planar functions, but you can’t

trust the parameter estimates, because, again, they are just one of an infinity of possible

estimates. For example, the R code below generates perfectly collinear (X1, X2) data, then

generates Y data from these X data that satisfy all regression assumptions.

set.seed(12345)

X1 = rnorm(100)

X2 = 2*X1 -1 # Perfect collinearity

Y = 1 + 2*X1 + 3*X2 + rnorm(100,0,1)

summary(lm(Y~X1+X2))

This code produces the following output:

Call:

lm(formula = Y ~ X1 + X2)

Residuals:

Min 1Q Median 3Q Max

-2.20347 -0.60278 -0.01114 0.61898 2.60970

Coefficients: (1 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.97795 0.10353 -19.11 <2e-16 ***

X1 8.09454 0.09114 88.82 <2e-16 ***

X2 NA NA NA NA

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.011 on 98 degrees of freedom


F-statistic: 7888 on 1 and 98 DF, p-value: < 2.2e-16

Notice the “NA” for the coefficient of X2. Recognizing that XTX is not invertible, and hence

that there are infinitely many solutions for the estimated ’s, the lm function in R simply

assigned 2̂ = 0 and estimated3 1. Note also the comment “Coefficients: (1 not

defined because of singularities).” In matrix algebra, “singular” means “not

invertible”; the comment lets you know that R recognizes that XTX is not invertible.

To visualize the infinity of solutions for the regression plane in the example above, have a

look at the 3-D representation of the data just simulated in Figure 8.4.3 below. In that

graph, there are infinitely many planes that will separate the positive and negative residuals

as shown—some are steeper on one side of the vertical sheet where the data lie, some are

steeper on the other side of the sheet.

3While the estimates 8.09454 and 0 do not correspond well with R code where 1 = 2 and 2 = 3, the estimate

8.09454 actually makes sense when you replace X2 with 2X1 – 1 in the model equation; then you see that the true

multiplier of X1 is exactly 8.0.

12

Figure 8.4.3. Three-D scatterplot of data where the X variables are perfectly collinear. The

data lie in a vertical sheet above the line of collinearity on the (X1, X2) plane. There are

infinitely many planes, all having the same minimum SSE, for which the given blue points

are above and the given red points are below.

Intuitively, it makes sense that you cannot estimate the coefficients uniquely when there is

perfect multicollinearity. Recall that 2 is the difference between the means of the

distributions of potentially observable Y values in two cohorts:

Cohort 1: X1 = x1 , X2 = x2

Cohort 2: X1 = x1 , X2 = x2 +1

However, if X2 is perfectly related to X1, it is impossible to increase X2 while leaving X1 fixed.

13

Hence, with perfectly collinear (X1, X2) variables, it is simply impossible to estimate the effect

of larger X2 when X1 is held fixed. See the right panel of Figure 8.4.1: You cannot increase X2

while holding X1 constant.

The intuitive logic that you cannot estimate the effect of increasing X2 while X1 is held

constant for the case of perfectly collinear X variables, also explains the problem with near

perfect collinearity, as shown in the left panel of Figure 8.4.1. Since the data are so closely

related, there is very little variation in X2 when you fix X1, say, by drawing a vertical line over

any particular value of X1. Recall also that, to estimate the effect of an X variable on Y, you

need variation in that X variable. The relevant variation in the case of multiple regression,

where you are estimating effect of an X variable holding the other variables fixed, is exactly

the variation in that X variable where the other variables are fixed. If there is little such

variation, as shown in the left panel of Figure 8.4.1, you will get unique estimates of the ’s,

but they will be relatively imprecise estimates because, again, there is so little relevant

variation in the X data.

Therefore, the main problem with multicollinearity is that the estimates of the ’s are

relatively (relative to the case where the X variables are unrelated) imprecisely estimated.

This imprecision manifests itself in higher standard errors of the estimated ’s.

There is a simple formula to explain how multicollinearity affects the standard errors of the

estimated ’s: Recall from Chapter 7, Section 3, that

s.e.( ˆj ) = ˆ

jjc , j = 0,1,…,k.

In simple regression, where there is just one X variable, this expression reduces to the form

you saw in Chapter 3,

1ˆ. .( )s e =

ˆ

1xs n

. (8.4.1)

Some fairly complicated matrix algebra gives the following representation of the standard

errors for the multiple regression case:

s.e.( ˆj ) = ˆ

jjc =

1/2

2

ˆ 1

11j

jxRs n

.

Here, 2

jR is the R-squared statistic that you get by regressing Xj on all other X variables.

Higher 2

jR is in indication of more extreme multicollinearity, and its effect on the precision

of the estimate ˆj .

Two important special cases are (1) 2

jR = 0, in which case the standard error formula for ˆj

is exactly as given in the simple regression where there is only one X variable (see equation

14

(8.4.1) above), and (2) 2

jR 1, in which case the standard error tends to infinity. Such

behavior is expected, because when Xj is increasingly related to the other X variables, there

is less and less variation in Xj when all other X variables are held fixed.

The term 1/(1 – 2

jR ) is called the variance inflation factor because it measures how much

larger is the variance of ˆj due to multicollinearity. By the same token, {1/(1 – 2

jR )}1/2 can

be called a standard error inflation factor.

Example: Illustrating the effects of MC in a Simulation Study

## R code to illustrate the effects of MC

## This data set shows what happens with highly MC data. Note that the

## model has a highly significant F statistic (p-value = 2.306e-08),

## but neither X variable is significant via their t statistics. The MC

## between X1 and X2 makes it difficult to assess the effect

## of X1 when X2 is held fixed, and vice versa.

set.seed(12345)

x1 = rep(1:10, each=10)

x2 = x1 + rnorm(100, 0, .05) # X2 differs from X1 by N(0,0.05^2) random

variation.

## You can see the collinearity in the graph:

plot(x1,x2)

## The true model has beta0 = 7, beta1=1, and beta2 = 1,

## with all assumptions satisfied.

y = 7 + x1 + x2 + rnorm(100,0,10)

dat.high.mc = data.frame(y,x1,x2)

high.mc = lm( y ~ x1 + x2, data = dat.high.mc)

summary(high.mc)

The output is as follows:

Call:

lm(formula = y ~ x1 + x2, data = dat.high.mc)

Residuals:


-21.4159 -5.9119 0.0834 5.5034 26.4582

Coefficients:


(Intercept) 5.909 2.191 2.697 0.00825 **

x1 -16.712 18.362 -0.910 0.36500

x2 18.953 18.332 1.034 0.30376

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


15


F-statistic: 21.2 on 2 and 97 DF, p-value: 2.306e-08

In the example above, the 2

jR statistics (obtained via summary(lm(X1~X2)) and

summary(lm(X2~X1)) are4 0.9996306 and 0.9996306, implying standard error inflation

factors 1/(1 – 0.9996306)1/2 = 52.02973. This, the standard errors, 18.362 and 18.332 are

52.02973 times larger than they would have been had the variables been uncorrelated. A

slight modification of the simulation model to keep all the same (same n, same , nearly the

same variances of X1 and X2) except with uncorrelated X variables verifies that result:

set.seed(12345)

x1 = rep(1:10, each=10)

x2 = rep(1:10, 10)

## You can see the lack of collinearity in the graph:

plot(x1,x2)

## The true model has beta0 = 7, beta1=1, and beta2 = 1,

## with all assumptions satisfied.

y = 7 + x1 + x2 + rnorm(100,0,10)

dat.no.mc = data.frame(y,x1,x2)

no.mc = lm( y ~ x1 + x2, data = dat.no.mc)

summary(no.mc)

The output is as follows:

Call:

lm(formula = y ~ x1 + x2, data = dat.high.mc)

Residuals:


-27.296 -8.823 2.274 6.425 20.775

Coefficients:


(Intercept) 9.6931 3.2288 3.002 0.00341 **

x1 1.2958 0.3894 3.328 0.00124 **

x2 0.6604 0.3894 1.696 0.09311 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



F-statistic: 6.974 on 2 and 97 DF, p-value: 0.001479

4 In the case of simple regression, the R2 statistic is equal to the square of the correlation coefficient, so you get the

same R2 in both regressions. However, with more than two X variables, the R2 statistics will all be different. Some of

the X variables will be more highly related to the others; the parameter estimates for these variables suffer most from

multicollinearity.

16

The R2 statistics relating X1 to X2 and X2 to X1 are both 0.0 in this second example. The

standard error multiplier in the first example with the high multicollinearity was 52.02973;

checking, we see that 52.029730.3894 = 20.26, reasonably close to the standard errors

18.36 and 18.33 in the original analysis with multicollinear variables. Differences are

mostly explained by randomness in the estimates ̂ .

Summary of Multicollinearity (MC) and its effects

1. MC exists when the X’s are correlated (i.e., almost always). It does not involve the

Y’s. Existence of MC violates none of the classical model assumptions5.

2. Greater MC causes larger standard errors of the parameter estimates. This means that

your estimates of the parameters tend to be less precise with higher degrees of MC.

You will tend to have more insignificant tests and wider confidence intervals in these

cases. This happens because when X1 and X2 are closely related, the data cannot

isolate the unique effect of X1 on Y, controlling X2, as precisely as is the case when

X1 and X2 are not closely related.

3. The more the MC, the less interpretable are the parameters. In particular, 1 is the

effect of varying X1 when other X’s are held fixed. But it becomes difficult to even

imagine varying X1 while holding X2 fixed, when X1 and X2 are extremely highly

correlated.

4. MC almost always exists in observational data. The question is therefore not “is there

MC?,” but rather “how strong is the MC and what are its effects?” Generally, the

higher the correlations among the X's, the greater the degree of MC, and the greater

the effects (high parameter standard errors; tenuous parameter interpretation.)

5. The extreme case of MC is called “perfect MC,” and happens when the columns of

the X matrix are perfectly linearly dependent, in which case there are no unique least

squares estimates. The fact that there are no unique LSEs in this case does not mean

you can't proceed; you still can still estimate parameters (albeit non-uniquely) and

make valid predictions resulting from such estimates. Most computer software allow

you to estimate models in this case, but provide a warning message or other unusual

output (such as R’s “NA” for some parameter estimates) that you should pay attention

to.

6. Regression models that are estimated using MC data can still be useful. There is no

absolute requirement that MC be below a certain level. In fact, in some cases it is

strongly recommended that highly correlated variables be retained in the model. For

example, in most cases you should include the linear term in a quadratic model, even

5 Some books and web documents incorrectly state that there is an assumption of no MC in regression

analysis.

17

though the linear and quadratic terms are highly correlated. This is called the

“Variable Inclusion Principle”; more on this in the next chapter.

7. It is most important that you simply recognize the effects of multicollinearity, which

are (i) high variances of parameter estimates, (ii) tenuous parameter interpretations,

and (iii) in the extreme case of perfect multicollinearity, non-existence of unique

least squares estimates.

When might MC be a Problem?

It makes no sense to “test” for MC in the usual hypothesis testing “H0 vs. H1” sense. The following are not “tests,” they are just “suggestions,” essentially “Ugly Rules of Thumb,” aimed to help identify when MC might be a problem.

1. When correlations between the X variables are extremely high (e.g., many greater than

0.9) or variance inflation factors are very high (e.g., greater than 9.0; implying a standard

error inflation factor greater than 3.0).

2. When variables that you think are important, a priori, are found to be insignificant, you

might suspect a MC problem. But consider also whether your sample size is simply too

small.

What to do about MC?

1. Main Solution: Diagnose the problem and understand its effects. Display the correlation

matrix of the X variables and analyze the variance inflation factors. MC always exists to a

degree, and need not be “removed,” especially if MC is not severe, as it violates no

assumptions. You don't necessarily have to do anything at all about it.

2. In some cases, you can avoid using MC variables. Here are some suggestions. Evaluate them

in your particular situation to see if they make sense; every situation is different.

a. Drop less important and/or redundant X variables.

b. Combine X variables into an index. For example, if X1, X2 and X3 are all measuring

the same thing, then you might use their sum or average in the model in place of the

original three X variables.

c. Use principal components to reduce the dimensionality of the X variables (this is

discussed in courses in Multivariate Analysis).

d. Use “common factors” (or latent variables), to represent the correlated X variables,

and fit a structural equations model relating the response Y to these common factors.

This is a somewhat impractical solution because the common factors are

unobservable, and therefore cannot be used for prediction. Nevertheless, this model is

quite common in behavioral research. It is discussed in courses in Multivariate

Analysis.

e. Use ratios in “size-related” cases. For example, if you have the two firm-level

18

variables X1 = Total Sales and X2 = Total Assets in your model, they are bound to be

highly correlated. So you might use the two variables X1 = (Total Assets)/(Total

Sales) and X2 = (Total Sales) (perhaps in log form) in your model instead of the two

variables (Total Sales) and (Total Assets).

3. In some cases, you must leave multicollinear variables in the model. These cases include

a. Predictive Multicollinearity: Two variable can be highly correlated, but both are

essential for predicting Y. When you leave one or the other out of the model, you get

a much poorer model (much lower R2). In the data set Turtles, if you predict a turtle's

sex from its length and height, you will find that length and height are highly

correlated (R2 = 0.927). But you have to include them both in the model because

R2(length, height) = 0.61, whereas R2(length) = 0.31 and R2(height) = 0.47. The

scientific conclusion is that turtle sex is more related to turtle shape, a combination

of length and height, than it is to either length or height individually. This probably

makes sense to a biologist who studies turtle reproduction.

b. Variable Inclusion Rules: Whenever you include higher order terms in a model, you

should also include the implied lower order terms. For example, if you include X 2 in

the model, then you should also include X. But X and X 2 are highly correlated.

Nevertheless, both X and X 2 should be used in model, despite the fact that they are

highly correlated, for reasons we will give in the next chapter.

c. Research Hypotheses: Your main research hypothesis is to assess the effect of X1, but

you recognize that the effect of X1 on Y might be confounded by X2. If this is the

case, you are simply stuck with including both X1 and X2 in your model.

4. Other solutions: Redesign study or collect more data.

a. Selection of levels: If you have the opportunity to select the (X1, X2) values, then you

should attempt to do so in a way that makes those variables as uncorrelated as

possible. For example, (X1, X2) might refer to two process inputs, each either “Low”

or “High,” and you should select them in the arrangement (L,L), (L,H), (H,L), (H,H),

with equal numbers of runs at each combination, to ensure that X1 and X2 are

uncorrelated.

b. Sample size: The main problem resulting from MC is that the standard errors are

large. You can always make standard errors smaller by collecting a larger sample

size: Recall that

s.e.( ˆj ) =

1/2

2

ˆ 1

11j

jxRs n

.

So if you collect more data but change nothing else, your standard errors will become

smaller.

http://westfall.ba.ttu.edu/isqs5349/Rdata/turtles.txt

Chapter 8. R-squared, Adjusted R-Squared, the F test, …westfall.ba.ttu.edu/ISQS5349/URA_Ch8.pdf · 1 . Chapter 8. R-squared, Adjusted R-Squared, the F test, and Multicollinearity.

Documents