Multicollinearity in Cross-Sectional Regressions ‡ . Jørgen Lauridsen (*),Jesús Mur (**), (*)Corresponding author: The Econometric Group Department of Economics. University of Southern Denmark. Odense. Denmark. e-mail: [email protected](**)Department of Economic Analysis. University of Zaragoza. Zaragoza. Spain. e-mail: [email protected]Abstract The robustness of the results coming from an econometric application depends to a great extent on the quality of the sampling information. This statement is a general rule that becomes especially relevant in a spatial context where data usually have lots of irregularities. The purpose of this paper is to examine more closely this question paying attention to the impact of multicollinearity. It is well known that the reliability of estimators (least- squares or maximum-likelihood) gets worse as the linear relationships between the regressors become more acute. The main aspect of our work is that we resolve the discussion in a spatial context, looking closely into the behaviour shown, under several unfavourable conditions, by the most outstanding misspecification tests when collinear variables are added to the regression. For this purpose, we plan and solve a Monte Carlo simulation. The conclusions point to the fact that these statistics react in different ways to the problems posed. ‡ Acknowledgements: This work has been carried out with the financial support of project SEC 2002- 02350 of the Spanish Ministerio de Educatión. The authors also wish to thank Ana Angulo for her invaluable and disinterested collaboration.
26
Embed
Multicollinearity in Cross-Sectional Regressionsmetodos.upct.es/detaer/Principal_ingles/congresos/... · Multicollinearity in Cross-Sectional Regressions ... might well amount to
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Multicollinearity in Cross-Sectional Regressions‡.
Jørgen Lauridsen (*),Jesús Mur (**),
(*)Corresponding author: The Econometric Group Department of Economics. University of Southern Denmark. Odense. Denmark. e-mail: [email protected] (**)Department of Economic Analysis. University of Zaragoza. Zaragoza. Spain. e-mail: [email protected]
Abstract
The robustness of the results coming from an econometric application depends to a
great extent on the quality of the sampling information. This statement is a general rule
that becomes especially relevant in a spatial context where data usually have lots of
irregularities.
The purpose of this paper is to examine more closely this question paying attention to
the impact of multicollinearity. It is well known that the reliability of estimators (least-
squares or maximum-likelihood) gets worse as the linear relationships between the
regressors become more acute. The main aspect of our work is that we resolve the
discussion in a spatial context, looking closely into the behaviour shown, under several
unfavourable conditions, by the most outstanding misspecification tests when collinear
variables are added to the regression. For this purpose, we plan and solve a Monte Carlo
simulation. The conclusions point to the fact that these statistics react in different ways
to the problems posed.
‡ Acknowledgements: This work has been carried out with the financial support of project SEC 2002-
02350 of the Spanish Ministerio de Educatión. The authors also wish to thank Ana Angulo for her
invaluable and disinterested collaboration.
1
1- Introduction
The main purpose of this paper is to examine the relationship between quality of the
sampling information and trustworthiness of econometric results in a cross-sectional
setting. We will focus on one point in particular, namely multicollinearity.
Multicollinearity among regressors is an intriguing and common property of data. The
consequences for estimation and inference are well known: unreliable estimation
results; high standard errors; coefficients with wrong signs and implausible magnitudes,
etc. (Belsley et al., 1980). In light of these problems, it is striking to see the relatively
shortcut treatment of the problem in econometric literature. Usually the discussion is
restricted to a few diagnostics, together with some standard suggestions for estimation,
trading an (expected) small increase in bias for an (expected) small reduction of MSE.
This slight treatment is frequently justified by judging the problem irrelevant, using
statements alike that of Greene (2003): ‘Suggested “remedies” to multicollinearity
might well amount to attempts to force the theory on the data’.
Indeed, a number of serious attempts to resolve the multicollinearity problem show up
in literature, but they are generally not included as a part of the econometrician’s
toolbox. Such attempts include the three-stage test procedure (Farrar and Glauber, 1967,
Wichers, 1975, Kumar, 1975, and O’Hagan and McCabe, 1975); regularisation methods
(Draper and Nostrand, 1979, Hocking, 1983) and factor analysis regression (Scott,
1966, King, 1969, Scott, 1969). A recent attempt by Kosfeld and Lauridsen (2005)
integrates a common factor measurement model in an errors in variables setting and
suggest a feasible factor analysis regression (FAR) estimator which outperforms the
OLS estimator for cases of medium and strong multicollinearity. In-depth treatment of
the multicollinearity problem and tools for detection and remedying are presented by
Belsley et al. (1980) and Chatterjee and Hadi (1988).
The purpose of the present investigation is to address the specific problems caused
when performing misspecification tests in a spatial cross-sectional regression. In section
2 we go deeply into the issues related to multicollinearity by tracing the partial effect on
misspecification tests from an additional variable which is collinear to the remainding
variables, using a partial regression framework established by Chatterjee and Hadi
(1988). As it is well known (Chatterjee and Hadi, 1988; Belsley et al., 1980) that
2
multicollinearity does affect least squares estimates but not the least squares residuals,
on which the tests are based, this property is shown to be sensitive in unpredictable
manners to the amount of spatial dependency as well as misspecification of the
underlying spatial process. While the framework applied also serves well as a tool to
trace the impact of extremal observations (which is equivalent to omitting relevant
variables, i.e. a set of dummies each of which holds the value 1 for an extremal
observation) as well as the impact of joint presence of outliers and multicollinearity, the
present study concentrates on the multicollinearity problem. An analysis of the effects
of extremal observations in spatial cross-sectional regression appears in Mur and
Lauridsen (2005), and an integrative study combining the two aspects is planned to
occur (Lauridsen and Mur, 2005). A simulation study is carried out in the third section
in order to analyse finite-sample size and power effects on tests for spatial dependency
of 1) omission/inclusion of an additional collinear variable, and 2) misspecification of
the underlying spatial process. The paper finishes with a section of conclusions.
2- Multicollinarity in cross-sectional econometric models
Essentially, multicollinearity refers to the successive inclusion of additional variables
that lift the collinearity of the full set of explanatory variables to a ‘harmful’ level. This
is the case if the additional variables 1) closely correlates to one or more linear
combination of the variables already in the model and 2) contributes relatively little to
the prediction apart from what is provided by the variables already in the model.
Formally, the problem faced can be expressed as
(1) uXXY +β+β= 2211
where 1X are the variables included in the model and 2X a set of 2k additional
variables to be considered added to the specification. Rewriting ξ+=ξ+γ= 212 XXX ,
where ξ is a set of 2k error vectors, multicollinearity occurs when the variance of ξ is
relatively small as compared to the variance of 2x .
To trace the impact of adding the additional regressor, two matrices are central: the
prediction matrix, defined as ')'( 1 XXXXP −= , which maps y onto the prediction y ,
3
i.e. Pyy =ˆ , and the residual matrix PIM −= , which maps y onto the residuals from a
regression on X , i.e. Myu = .
The prediction y can be thought of as made up of two independent predictions: 1) the
prediction provided by 1X and 2) the additional prediction provided by the part of 2X
that is independent from 1X , i.e. the residual from a regression of 2X on 1X , which is
equal to ξ . This can be formalised by partialising the predictor matrix into two
predictor matrices as
(2) 121
2122111
11121 ')'(')'( MXXMXXMXXXXPPP −− +=+=
so that the prediction is partialised as 2121 ˆˆˆ yyyPyPPyy +=+== .
Essentially, spatial dependency may be incorporated in (1) in one of two ways. A
dynamic substantive dependency is included by respecifying (1) with a spatially
autoregressive term (SAR) as
(3) uXXWyY +β+β+ρ= 2211
while a static residual dependency is included by respecifying (1) as
(4) ν+β+β+ρ= 2211 XXWyY
which further divides into the spatially autocorrelated (SAC) specification obtained by
letting
(5) uW +νρ=ν
or the spatial moving average (SMA) specification obtained by letting
(6) Wuu ρ−=ν .
Further, static and substantive dependency may be combined by replacing the residual
of (3) with a residual on the form (5) to obtain a spatially autoregressive – spatially
autocorrelated (SARC) specification or on the form (6) to obtain a spatially
autoregressive – spatial moving average (SARMA) specification.
4
For the moment we will limit ourselves to evaluate the impact on the misspecification
statistics habitually used in cross-sectional econometric models, that is, on Moran's I ,
ERRLM − , ELLM − and KR , which address the problem of spatial dependence in
the error term, together with the LAGLM − and the LELM − whose objective is to
analyse the dynamic structure of the equation. To these we add the SARMA test, whose
alternative hypothesis is composite (dynamic structure in the equation and a moving
average error term). Appendix 1 provides a brief presentation. With respect to our work,
it is important to point out that the seven tests are constructed from the residuals of the
LS estimation. Given that these residuals react in a different way to the presence of
anomalies in the sample, this sensitivity should appear, at least in part, also in the tests.
To establish the impact of adding 2X to the regression on the Moran test, use that
2121 )( PMPPIM −=−−= , where 1M is the residual matrix for the regression of y
on 1X , whereby yPMMyu )(ˆ 21 −== , so that the Moran I test reads as
(7) yPyyMy
yWPMyyWPPyyWMMy
S
R
Myy
MWMyy
S
R
uu
uWu
S
RI
21
212211
000 ''
'2''
'
'ˆ'ˆ
ˆ'ˆ
−−+
===
2120
1054
321
0
iIDS
RD
S
R
mm
mmm
S
R +=+=−
−+=
where 4
11 m
mD = and
5424
1543422 mmm
mmmmmmD
−+−
= , with yWMMym 111 '= ,
yWPPym 222 '= , yWPMym 213 '2= , yMym 14 '= , and yPym 25 '= . Thus, 1I is the
Moran test that emerges when y is regressed on 1X only, and 2i is the additional effect
on the test when adding 2X to the regression. Increasing collinearity implies that all
quadratic and cross product terms involving yPy 22ˆ = goes towards 0, i.e. 2m , 3m and
5m goes to 0, while 1m and 4m are left unaffected. This implies that 2i goes toward 0,
so that the test involving 1X and 2X moves toward the test involving 1X only.
The effect on the expected value of the I test under the null is traced as follows:
5
(8) )()(
)()(
)()(
)( 20
100
WPtrkRS
RWMtr
kRS
RMWtr
kRS
RIE
−−
−=
−=
21 )( eiIE +=
where )( 1IE is the expectation when including only 1X in the regression, and 2ei the
additional effect that goes toward 0 for increasing collinearity.
An alike – but admittedly more involved – development for the variance of I provides: