1 Chapter 8. R-squared, Adjusted R-squared, the F test, and Multicollinearity This chapter discusses additional output in the regression analysis, from the context of multiple regression in the classical model. It also discusses multicollinearity, its effects, and remedies. 8.1 The R-squared Statistic The “population” R 2 statistic was introduced in Chapter 6 as = 1 – E{v (X)}/ Var(Y), where v(x) is the conditional variance of Y given X = x. This number tells you how well the “X” variable(s) predict your “Y” variable. Since the entire focus of this book is on conditional distributions p(y | x), I’d like you to understand the “prediction” concept in terms of separation of the distributions p(y | X = low) and p(y | X = high). For example, suppose the true model is Y = 6 + 0.2X + , where X ~ N(20,5 2 ) and Var() = 2 . Then Var(Y) = 0.2 2 5 2 + 2 = 1 + 2 , and v(x) =2 , implying 1 + 2 ) = 1/(1 + 2 ). Three cases I’d like you to consider are (i) 9.0, implying a low , (ii) 1.0, implying a medium value and (iii) 9, implying a high In all cases, let’s say a “low” value of X is 15.0, one standard deviation below the mean, and a high value of X is 25.0, one standard deviation above the mean. Now, when X = 15, the distribution p(y | X = 15) is the N(6+0.2(15) = 9.0, 2 ) distribution; and when X = 25, the distribution p(y | X = 25) is the N(6+0.2(25) = 11.0, 2 ) distribution. Figure 8.1.0 displays these distributions for the three cases above, where the population R 2 is either 0.1, 0.5, or 0.9 (which happen in this study when 2 is either 9.0, 1.0, or 1/9). Notice that there is greater separation of the distributions p(y | x) when the “population” R 2 is higher.
18
Embed
Chapter 8. R-squared, Adjusted R-Squared, the F test, …westfall.ba.ttu.edu/ISQS5349/URA_Ch8.pdf · 1 . Chapter 8. R-squared, Adjusted R-Squared, the F test, and Multicollinearity.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Chapter 8. R-squared, Adjusted R-squared, the F test, and Multicollinearity
This chapter discusses additional output in the regression analysis, from the context of multiple
regression in the classical model. It also discusses multicollinearity, its effects, and remedies.
8.1 The R-squared Statistic
The “population” R2 statistic was introduced in Chapter 6 as
= 1 – E{v (X)}/ Var(Y),
where v(x) is the conditional variance of Y given X = x.
This number tells you how well the “X” variable(s) predict your “Y” variable. Since the entire
focus of this book is on conditional distributions p(y | x), I’d like you to understand the
“prediction” concept in terms of separation of the distributions p(y | X = low) and p(y | X = high).
For example, suppose the true model is
Y = 6 + 0.2X + ,
where X ~ N(20,52) and Var() = 2. Then Var(Y) = 0.22 52 +
2 = 1 + 2, and v(x) =
2,
implying 1 +
2) = 1/(1 + 2). Three cases I’d like you to consider are (i)
9.0,
implying a low , (ii)1.0, implying a medium value and (iii)
9,
implying a high In all cases, let’s say a “low” value of X is 15.0, one standard deviation
below the mean, and a high value of X is 25.0, one standard deviation above the mean.
Now, when X = 15, the distribution p(y | X = 15) is the N(6+0.2(15) = 9.0, 2) distribution; and
when X = 25, the distribution p(y | X = 25) is the N(6+0.2(25) = 11.0, 2) distribution. Figure
8.1.0 displays these distributions for the three cases above, where the population R2 is either 0.1,
0.5, or 0.9 (which happen in this study when 2 is either 9.0, 1.0, or 1/9). Notice that there is
greater separation of the distributions p(y | x) when the “population” R2 is higher.
2
Figure 8.1.0. Separation of distributions p(y | X = low) (left distributions) and p(y | X = high)
(right distributions) in cases where the “population” R2 is 0.1 (top panel), 0.5 (medium panel)
and 0.9 (bottom panel). In all cases X = low and X = high refer to an X that is either one standard
deviation below the mean or one standard deviation above the mean.
In the case of the classical regression model, which is instantiated by Figure 8.1.0, the
conditional variance Var(Y | X = x) = v(x) is a constant 2, and does not depend on X = x. Also in
the classical regression model, the maximum likelihood estimate of 2 is 2̂ = SSE/n, where
SSE = 2
1ˆ( )
n
i iiy y
, the sum of squared vertical deviations from yi values to the fitted OLS
3
function. The unconditional variance is Var(Y) = 2
Y , so the “population” R2 statistic, in the
classical regression model, is
= 1 – 2/ 2
Y
The maximum likelihood estimate of 2
Y is 2ˆY = SST/n, where SST = 2
1( )
n
iiy y
, the “total”
sum of squared vertical deviations from yi values to the flat line where y = y .See Figure 8.1.1.
Figure 8.1.1. Scatterplot of n = 4 data points (indicated by X’s.) Horizontal red line is y = y
line and diagonal blue line is the least squares lines. Vertical deviations from the y = y line are
shown as red; SST is the sum of these squared deviations. Vertical deviations from the least
squares line are shown as blue; SSE is the sum of these squared deviations. The R2 statistic
equals 1 – SSE/SST.
4
Using the maximum likelihood estimates of conditional and unconditional variance, you get the
estimate of the “population” R-squared statistic,
R2 = 1 – (SSE/n)/(SST/n) = 1 – SSE/SST.
In Chapter 5, you saw models with different transformations in the X variable. The model with
the highest maximized log likelihood was the one with the smallest estimated conditional
variance SSE/n, hence it was also the model with smallest SSE, since n is always the same when
considering different models for the same data set. Also, SST is always the same when
considering different models for the same data set, because SST does not involve the predicted
values from the model. Thus, among the different models having different transformed X
variable1, the model with the highest log likelihood corresponds precisely to the model with the
highest R2 statistic.
While it is mathematically factual that 0 ≤ R2 ≤ 1.0, there is no “Ugly Rule of Thumb” for how
large an R2 statistic should be to be considered “good.” Rather, it depends on norms for the given
subject area: In finance, any non-zero R2 for predicting stock returns is interesting, because the
efficient markets hypothesis states that the “population” R2 is zero in this case. In chemical
reaction modeling, the outputs are essentially deterministic functions of the inputs, so an R2
statistic that is less than 1.0, e.g. 0.99, may not be good enough because it indicates faulty
experimental procedures. With human subjects and models to predict their behavior, the R2
statistics are typically less than 0.50 because people are, well, people. We have our own minds,
and are not robots that can be pigeon-holed by some regression model.
Our advice is to rely less on R2, and more on separation of distributions as seen in Figure 8.1.0.
When we get to more complex models, the usual R2 statistic becomes less interpretable, and in
some cases it is non-existent. But you always will have conditional distributions p(y | x), and you
can always graph those distributions as shown in Figure 8.1.0 to see how well your X predicts
your Y.
8.2 The Adjusted R-Squared Statistic
Recall that, in the classical model, = 1 – 2/ 2
Y , and that the standard R2 statistic replaces the
two variances with their maximum likelihood estimates. Recall also that maximum likelihood
estimates of variance are slightly biased. Replacing the variances with their unbiased estimates
gives the adjusted R2 statistic:
2
aR = 1 – {SSE/(n – k – 1)}/{SST/(n – 1)}
With larger number of predictor variables k, the ordinary R2 tends to be increasingly biased
upward; the adjusted R2 statistic is less biased. You can interpret the adjusted R2 statistic in the
same way as the ordinary one, but note that the adjusted R2 statistic can give values less than 0.0,
which are clearly bad estimates since the estimand cannot be negative.
1 This discussion refers to X transformations only, not Y transformations.
5
The following R code indicates where these statistics are, as well as “by hand” calculations of
F-statistic: 6.974 on 2 and 97 DF, p-value: 0.001479
4 In the case of simple regression, the R2 statistic is equal to the square of the correlation coefficient, so you get the
same R2 in both regressions. However, with more than two X variables, the R2 statistics will all be different. Some of
the X variables will be more highly related to the others; the parameter estimates for these variables suffer most from
multicollinearity.
16
The R2 statistics relating X1 to X2 and X2 to X1 are both 0.0 in this second example. The
standard error multiplier in the first example with the high multicollinearity was 52.02973;
checking, we see that 52.029730.3894 = 20.26, reasonably close to the standard errors
18.36 and 18.33 in the original analysis with multicollinear variables. Differences are
mostly explained by randomness in the estimates ̂ .
Summary of Multicollinearity (MC) and its effects
1. MC exists when the X’s are correlated (i.e., almost always). It does not involve the
Y’s. Existence of MC violates none of the classical model assumptions5.
2. Greater MC causes larger standard errors of the parameter estimates. This means that
your estimates of the parameters tend to be less precise with higher degrees of MC.
You will tend to have more insignificant tests and wider confidence intervals in these
cases. This happens because when X1 and X2 are closely related, the data cannot
isolate the unique effect of X1 on Y, controlling X2, as precisely as is the case when
X1 and X2 are not closely related.
3. The more the MC, the less interpretable are the parameters. In particular, 1 is the
effect of varying X1 when other X’s are held fixed. But it becomes difficult to even
imagine varying X1 while holding X2 fixed, when X1 and X2 are extremely highly
correlated.
4. MC almost always exists in observational data. The question is therefore not “is there
MC?,” but rather “how strong is the MC and what are its effects?” Generally, the
higher the correlations among the X's, the greater the degree of MC, and the greater
the effects (high parameter standard errors; tenuous parameter interpretation.)
5. The extreme case of MC is called “perfect MC,” and happens when the columns of
the X matrix are perfectly linearly dependent, in which case there are no unique least
squares estimates. The fact that there are no unique LSEs in this case does not mean
you can't proceed; you still can still estimate parameters (albeit non-uniquely) and
make valid predictions resulting from such estimates. Most computer software allow
you to estimate models in this case, but provide a warning message or other unusual
output (such as R’s “NA” for some parameter estimates) that you should pay attention
to.
6. Regression models that are estimated using MC data can still be useful. There is no
absolute requirement that MC be below a certain level. In fact, in some cases it is
strongly recommended that highly correlated variables be retained in the model. For
example, in most cases you should include the linear term in a quadratic model, even
5 Some books and web documents incorrectly state that there is an assumption of no MC in regression
analysis.
17
though the linear and quadratic terms are highly correlated. This is called the
“Variable Inclusion Principle”; more on this in the next chapter.
7. It is most important that you simply recognize the effects of multicollinearity, which
are (i) high variances of parameter estimates, (ii) tenuous parameter interpretations,
and (iii) in the extreme case of perfect multicollinearity, non-existence of unique
least squares estimates.
When might MC be a Problem?
It makes no sense to “test” for MC in the usual hypothesis testing “H0 vs. H1” sense. The following are not “tests,” they are just “suggestions,” essentially “Ugly Rules of Thumb,” aimed to help identify when MC might be a problem.
1. When correlations between the X variables are extremely high (e.g., many greater than
0.9) or variance inflation factors are very high (e.g., greater than 9.0; implying a standard
error inflation factor greater than 3.0).
2. When variables that you think are important, a priori, are found to be insignificant, you
might suspect a MC problem. But consider also whether your sample size is simply too
small.
What to do about MC?
1. Main Solution: Diagnose the problem and understand its effects. Display the correlation
matrix of the X variables and analyze the variance inflation factors. MC always exists to a
degree, and need not be “removed,” especially if MC is not severe, as it violates no
assumptions. You don't necessarily have to do anything at all about it.
2. In some cases, you can avoid using MC variables. Here are some suggestions. Evaluate them
in your particular situation to see if they make sense; every situation is different.
a. Drop less important and/or redundant X variables.
b. Combine X variables into an index. For example, if X1, X2 and X3 are all measuring
the same thing, then you might use their sum or average in the model in place of the
original three X variables.
c. Use principal components to reduce the dimensionality of the X variables (this is
discussed in courses in Multivariate Analysis).
d. Use “common factors” (or latent variables), to represent the correlated X variables,
and fit a structural equations model relating the response Y to these common factors.
This is a somewhat impractical solution because the common factors are
unobservable, and therefore cannot be used for prediction. Nevertheless, this model is
quite common in behavioral research. It is discussed in courses in Multivariate
Analysis.
e. Use ratios in “size-related” cases. For example, if you have the two firm-level
18
variables X1 = Total Sales and X2 = Total Assets in your model, they are bound to be
highly correlated. So you might use the two variables X1 = (Total Assets)/(Total
Sales) and X2 = (Total Sales) (perhaps in log form) in your model instead of the two
variables (Total Sales) and (Total Assets).
3. In some cases, you must leave multicollinear variables in the model. These cases include
a. Predictive Multicollinearity: Two variable can be highly correlated, but both are
essential for predicting Y. When you leave one or the other out of the model, you get
a much poorer model (much lower R2). In the data set Turtles, if you predict a turtle's
sex from its length and height, you will find that length and height are highly
correlated (R2 = 0.927). But you have to include them both in the model because
R2(length, height) = 0.61, whereas R2(length) = 0.31 and R2(height) = 0.47. The
scientific conclusion is that turtle sex is more related to turtle shape, a combination
of length and height, than it is to either length or height individually. This probably
makes sense to a biologist who studies turtle reproduction.
b. Variable Inclusion Rules: Whenever you include higher order terms in a model, you
should also include the implied lower order terms. For example, if you include X 2 in
the model, then you should also include X. But X and X 2 are highly correlated.
Nevertheless, both X and X 2 should be used in model, despite the fact that they are
highly correlated, for reasons we will give in the next chapter.
c. Research Hypotheses: Your main research hypothesis is to assess the effect of X1, but
you recognize that the effect of X1 on Y might be confounded by X2. If this is the
case, you are simply stuck with including both X1 and X2 in your model.
4. Other solutions: Redesign study or collect more data.
a. Selection of levels: If you have the opportunity to select the (X1, X2) values, then you
should attempt to do so in a way that makes those variables as uncorrelated as
possible. For example, (X1, X2) might refer to two process inputs, each either “Low”
or “High,” and you should select them in the arrangement (L,L), (L,H), (H,L), (H,H),
with equal numbers of runs at each combination, to ensure that X1 and X2 are
uncorrelated.
b. Sample size: The main problem resulting from MC is that the standard errors are
large. You can always make standard errors smaller by collecting a larger sample
size: Recall that
s.e.( ˆj ) =
1/2
2
ˆ 1
11j
jxRs n
.
So if you collect more data but change nothing else, your standard errors will become