Work Package No. 4 August 2011 D4.1 Manual for the tests of spatial econometric model Authors: Vincent Linderhof, Peter Nowicki, Eveline van Leeuwen, Stijn Reinhard and Martijn Smit Document status Internal use Confidential use Draft No. 2 Final Submitted for internal review x x 31-7-2011 29-3-2013 22-3-2013
39
Embed
Manual for the tests of spatial econometric modelproject2.zalf.de/...for_tests_of_spatial_econometric_models_final.pdf · Manual for the tests of spatial econometric model ... studies
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Work Package No. 4
August 2011
D4.1
Manual for the tests of spatial econometric
model
Authors: Vincent Linderhof, Peter Nowicki, Eveline van
Leeuwen, Stijn Reinhard and Martijn Smit
Document status
Internal use
Confidential use
Draft No. 2
Final
Submitted for internal review
x
x
31-7-2011
29-3-2013
22-3-2013
1
i
Table of contents
Figures ___________________________________________________________________ ii
Abbreviations _____________________________________________________________ iii
Summary ________________________________________________________________ iv
We now keep only the NUTS2 regions in the database by dropping all others, and then
likewise for the coordinates database.
use nuts2db, clear describe rename NUTS_ID nuts_id drop if STAT_LEVL!=2 save "nuts2db.dta", replace use nuts2coord, clear merge m:1 _ID using nuts2db, keep(match) keepusing( ) drop POLY_ID-_merge save "nuts2coord.dta", replace
We now make a queen contiguity matrix, which we call nuts2q. Subsequently, we include
the regional ID’s in our data file and set it to use the queen contiguity matrix nuts2q .
Finally, we create a spatially lagged variable of var using said matrix, and we choose to call
this variable var_wq . We can now use this variable just like any ordinary variable in
regressions. For more information, see help spmat in Stata.
use nuts2db.dta spmat contiguity nuts2q using nuts2coor, id(_ID) spmat save nuts2q using nuts2q.spmat use datafile.dta rename region NUTS_ID merge m:1 NUTS_ID using nuts2db, keepusing(_ID) rename NUTS_ID region drop if _merge==1 spmat use nuts2q using nuts2q.spmat spmat lag var_wq nuts2q var
4.2 Model specification test (diagnostics)
A model specification error can occur when one or more relevant variables are omitted from
the model or one or more irrelevant variables are included in the model. If relevant variables
are omitted from the model, the common variance they share with included variables may be
wrongly attributed to those variables, and the error term is inflated. On the other hand, if
irrelevant variables are included in the model, the common variance they share with included
variables may be wrongly attributed to them. Model specification errors can substantially
affect the estimate of regression coefficients. This section presents a summary of chapter 2 of
More information on the endogeneity test can be found on Cameron and Trivedi (2010, pp.
188).
4.2.3 Test on omitted variables: Ramsey test
The ovtest command performs a regression specification error test (RESET) for omitted
variables. It also creates new variables based on the predictors and refits the model using
those new variables to see if any of them would be significant.
ovtest Ramsey RESET test using powers of the fitted values of api00 Ho: model has no omitted variables F(3, 393) = 4.13 Prob > F = 0.0067
The ovtest command in the example above indicates that there are omitted variables.
More information on the Ramsey test can be found on Cameron and Trivedi (2010, pp. 98-
100) or at the Stata help on ovtest .
29
4.2.4 Test on multicollinearity of explanatory variables
When there is a perfect linear relationship among the predictors, the estimates for a regression
model cannot be uniquely computed. The term collinearity implies that two variables are near
perfect linear combinations of one another. When more than two variables are involved it is
often called multicollinearity, although the two terms are often used interchangeably.
The primary concern is that as the degree of multicollinearity increases, the regression model
estimates of the coefficients become unstable and the standard errors for the coefficients can
get wildly inflated. In this section, we will explore some Stata commands that help to detect
multicollinearity.
We can use the vif command after the regression to check for multicollinearity. vif stands
for variance inflation factor. As a rule of thumb, a variable whose VIF values are greater than
10 may merit further investigation. Tolerance, defined as 1/VIF, is used by many researchers
to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a
VIF of 10. It means that the variable could be considered as a linear combination of other
independent variables. Let us first look at the regression we did from the last section, the
regression model predicting the variable api00 from the variables meals, ell and emer and then
issue the vif command.”
More information on the Ramsey test can be found on Cameron and Trivedi (2010, p. 379).
4.2.5 Test on homogeneity
One of the main assumptions for the ordinary least squares regression is the homogeneity of
variance of the residuals. If the model is well-fitted, there should be no pattern to the residuals
plotted against the fitted values. If the variance of the residuals is non-constant then the
residual variance is said to be "heteroscedastic”. There are two test for heteroscedasticity:
30
Now let's look at a couple of commands that test for heteroscedasticity. The first test on
heteroscedasticity given by itmest is the White's test and the second one given by
hettest is the Breusch-Pagan test. Both test the null hypothesis that the variance of the
residuals is homogenous. Therefore, if the p-value is very small, we would have to reject the
hypothesis and accept the alternative hypothesis that the variance is not homogenous.
estat imtest Cameron & Trivedi's decomposition of IM-test --------------------------------------------------- Source | chi2 df p ---------------------+----------------------------- Heteroskedasticity | 18.35 9 0.0313 Skewness | 7.78 3 0.0507 Kurtosis | 0.27 1 0.6067 ---------------------+----------------------------- Total | 26.40 13 0.0150 --------------------------------------------------- estat hettest Breusch-Pagan / Cook-Weisberg test for heteroskedas ticity Ho: Constant variance Variables: fitted values of api00 chi2(1) = 8.75 Prob > chi2 = 0.0031
So in this case, the evidence is against the null hypothesis that the variance is homogeneous.
These tests are very sensitive to model assumptions, such as the assumption of normality.
Therefore it is a common practice to combine the tests with diagnostic plots to make a
judgment on the severity of the heteroscedasticity and to decide if any correction is needed for
heteroscedasticity. In our case, the plot above (to be added) does not show too strong an
31
evidence. So we are not going to get into details on how to correct for heteroscedasticity even
though there are methods available.
4.2.6 Test on distribution of the residuals
Normality of residuals is only required for valid hypothesis testing, that is, the normality
assumption assures that the p-values for the t-tests and F-test will be valid. Normality is not
required in order to obtain unbiased estimates of the regression coefficients. OLS regression
merely requires that the residuals (errors) be identically and independently distributed (IID).
Furthermore, there is no assumption or requirement that the predictor variables be normally
distributed. If this were the case than we would not be able to use dummy coded variables in
our models.
After we run a regression analysis, we can use the predict command to create residuals and
then use commands such as kdensity , qnorm and pnorm to check the normality of the
residuals. More information on qnorm and pnorm can be found on the help function of
Stata http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm
Let us use the elemapi2 data file for an example on the kdensity statement Let us
predict the academic performance (api00) from the percentage receiving free meals (meals),
the percentage of English language learners (ell), and percentage of teachers with emergency
credentials (emer).
use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2 regress api00 meals ell emer Source | SS df MS Number of obs = 400 ---------+------------------------------ F( 3, 396) = 673.00 Model | 6749782.75 3 2249927.58 Prob > F = 0.0000 Residual | 1323889.25 396 3343.15467 R-squar ed = 0.8360 ---------+------------------------------ Adj R-s quared = 0.8348 Total | 8073672.00 399 20234.7669 Root MS E = 57.82 --------------------------------------------------- ----------------- api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+----------------------------------------- ----------------- meals | -3.159189 .1497371 -21.098 0.000 -3.453568 -2.864809 ell | -.9098732 .1846442 -4.928 0.000 -1.272878 -.5468678 emer | -1.573496 .293112 -5.368 0.000 -2.149746 -.9972456 _cons | 886.7033 6.25976 141.651 0.000 874.3967 899.0098 --------------------------------------------------- ---------------------------
We then use the predict command to generate residuals.
predict r, resid
32
Below we use the kdensity command to produce a kernel density plot with the normal
option requesting that a normal density be overlaid on the plot. kdensity stands for kernel
density estimate. It can be thought of as a histogram with narrow bins and moving average,
see the Stata output below.
kdensity r, normal
There are also numerical tests for testing normality. Another test available is the Shapiro-Wilk
W test for normality. The swilk command performs the Shapiro-Wilk W test for normality:
swilk r Shapiro-Wilk W test for normal dat a Variable | Obs W V z Pr > z ---------+---------------------------------------- --------- r | 400 0.99641 0.989 -0.025 0 .51006
The p-value is based on the assumption that the distribution is normal. In our example, it is very large (.51), indicating that we cannot reject that r is normally distributed.
(Source: section 2.2 in http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm)
33
4.3 Tests on coefficients
4.3.1 Test on linear restrictions of coefficients
Testing a single coefficient
Suppose one of the variables VAR1 in our regression has a coefficient β1. The hypothesis we
would like to test is whether β1 is equal to 0. If this hypothesis is rejected, the coefficient β1 of
VAR1 is significantly different from 0. Stata uses a WALD test for testing the hypothesis, see
C&T406. To test H0: β1=0, we have
. * Testing a single coefficient equal to 0
. Test beta1
( 1) beta1 = 0
Chi2( 1) = 70.80
Prob .> chi2 = 0.000
The null-hypothesis is rejected if the probability is smaller than 0.05. As a consequence, the
coefficient of variable VAR1 is significant. If the null-hypothesis is not rejected, one can
consider to exclude the variable from the regression equation.
Testing multiple coefficients
Suppose we have the variables VAR1 to VAR3 in our regression with coefficient β1 to β3. The
hypothesis we would like to test is whether β1 is equal to 0, and whether the sum of the
coefficients of the variables VAR2 and VAR3 is equal to1. If this hypothesis is rejected, the
coefficient β1 of VAR1 is significantly different from 0, and the sum of the coefficients of
VAR2 and VAR 3 is not equal to 1. Stata uses a WALD test for testing the hypothesis, see
Cameron and Trivedi (2009, p. 406). To test H0: β1=0 and β2+ β3=1, we have
. * Testing two hypotheses jointly
. xtreg y VAR1 VAR2 VAR3 VAR4, fe
. Test (beta1) (beta2 + beta3 = 1)
( 1) beta1 = 0
( 2) beta2 + beta3 = 1
Chi2( 2) = 122.29
Prob .> chi2 = 0.000
34
If the mtest option is added to the multiple tests command in Stata, each hypothesis is tested
in isolation as well, i.e.
. Test (beta1) (beta2 + beta3 = 1), mtest
More information on testing linear restrictions can be found in Cameron and Trivedi (2009,
403-409) or the Stata help on test .
4.3.2 Test on structural change
LR test on two models, one restricted model and one unrestricted model (this is not the same
as imposing linear restrictions).
4.3.3 Tests on linearity in variables
This is more a procedure of trial and error than a straightforward test. In linear regression
models, variables are either nominal variables or dummy variables. Nominal variables are
included as linear function of the dependent variable, in other words . Alternatively, one can
use quadratic or third-order polynomial of nominal variables to add to linear regression
equations.
. * Testing two hypotheses jointly
. xtreg y VAR1 VAR2 VAR3 VAR4, fe
. estimate store Regr1
. generate VAR1sq = VAR1*VAR1
. xtreg y VAR1 VAR1sq VAR2 VAR3 VAR4, fe
. estimate store Regr2
. LRtest Regr1 Regr2, force
To test whether higher-order polynomials of nominal functions add explanatory power to the
estimation one could perform a Likelihood Ration test (Cameron and Trivedi, 2010, p. 416).
In addition, one has to check two aspects. First, is the coefficient of VAR1sq significant.
Secondly, is the LR test rejected, then VAR1sq has significantly additional explanatory power
for the regression estimation and has to be maintained.
35
One should keep in mind that VAR1sq might correlate significantly to VAR1. If this is the
case, then the results of the estimation with VAR1sq might be biased.
4.4 Goodness of fit tests of the model
The goodness of fit of a statistical model describes how well it fits a set of observations.
Measures of goodness of fit typically summarize the discrepancy between observed values
and the values expected under the model in question. The most commonly used measure of
goodness of fit is the R-squared statistic. Goodness of fit measures can be used in statistical
hypothesis testing, e.g. to test for normality of residuals, to test whether two samples are
drawn from identical distributions, i.e. Kolmogorov–Smirnov test.
5 Concluding remarks
In this document the general econometric test to assess the impact of RDPs are presented.
In WP4 we will focus on (modelling) a selection of indicators for Rural Development
Measures that differ with respect to impact and provide an overview of the relevant aspects of
RDPs. We analyse which relations between Rural Development Indicators (and other data
available) are affected by spatial interactions and thus have to be tested using spatial
econometrics. In the forthcoming document 4.2 the database is analysed for spatial patterns.
Explanatory Spatial Data Analysis (ESDA) to assess the spatial distribution of the relevant
data at the relevant scale level (NUTS0-NUTS2-NUTS3). To apply ESDA the weight matrix
has to be adjusted to each relevant scale level.
Thereafter the model is specified and estimated at NUTS0 level - EU wide with focus on the
variation between the member states. The difference in impact of RD Measures is explained at
member state level. Then the model for the case studies - EU-depth (NUTS2 and NUTS3
level) is specified in a generic form for WP5. The necessary information is provided by the
case studies. Finally we report on general methodology with recommendations for use in EU
Anselin, L. (2005), Exploring Spatial Data with GeoDa: A Workbook, http://geodacenter.asu.edu/system/files/geodaworkbook.pdf
Anselin, L., J. Le Gallo and H. Jayet (2008), Spatial Panel Econometrics. In: L. Mátyás, P. Sevestre (eds.), The Econometrics of Panel Data. Springer Verlag, Heidelberg.
Cameron A. Colin and Pravin K. Trivedi (2009), Microeconometrics using Stata. Stata Press, College Station.
Getis, A. (2009), “Spatial Weight Matrices”. Geographical Analysis 41, pp. 404–410.
Harris, R., J. Moffat and V. Kravtsova (2011). In Search of 'W'. Spatial Economic Analysis 6(3), 249-270.
Uthes, S., T. Kuhlman, S. Reinhard, P. Nowicki, M. J. Smit, E. van Leeuwen, A. L. Silburn, I. Zasada and A. Piorr (2011) Report on analytical framework – conceptual model, data sources, and implications for spatial econometric modeling, Müncheberg, ZALF.