This file and several accompanying files contain the solutions to the odd- numbered problems in the book Econometric Analysis of Cross Section and Panel Data, by Jeffrey M. Wooldridge, MIT Press, 2002. The empirical examples are solved using various versions of Stata, with some dating back to Stata 4.0. Partly out of laziness, but also because it is useful for students to see computer output, I have included Stata output in most cases rather than type tables. In some cases, I do more hand calculations than are needed in current versions of Stata. Currently, there are some missing solutions. I will update the solutions occasionally to fill in the missing solutions, and to make corrections. For some problems I have given answers beyond what I originally asked. Please report any mistakes or discrepencies you might come across by sending me e- mail at [email protected]. CHAPTER 2 E(yx ,x ) E(yx ,x ) 1 2 1 2 2.1. a. = + x and = +2 x + x . 1 4 2 2 3 2 4 1 x x 1 2 2 b. By definition, E(ux ,x ) = 0. Because x and xx are just functions 1 2 2 1 2 of (x ,x ), it does not matter whether we also condition on them: 1 2 2 E(ux ,x ,x ,xx ) = 0. 1 2 2 1 2 c. All we can say about Var(ux ,x ) is that it is nonnegative for all x 1 2 1 and x : E(ux ,x ) = 0 in no way restricts Var(ux ,x ). 2 1 2 1 2 2.3. a. y = + x + x + xx + u, where u has a zero mean given x 0 1 1 2 2 3 1 2 1 and x : E(ux ,x ) = 0. We can say nothing further about u. 2 1 2 b. E(yx ,x )/ x = + x . Because E(x ) = 0, = 1 2 1 1 3 2 2 1 1
135
Embed
Solutions Manual and Supplementary Materials for Eco No Metric Analysis of Cross Section and Panel Data
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
This file and several accompanying files contain the solutions to the odd-
numbered problems in the book Econometric Analysis of Cross Section and Panel
Data, by Jeffrey M. Wooldridge, MIT Press, 2002. The empirical examples are
solved using various versions of Stata, with some dating back to Stata 4.0.
Partly out of laziness, but also because it is useful for students to see
computer output, I have included Stata output in most cases rather than type
tables. In some cases, I do more hand calculations than are needed in current
versions of Stata.
Currently, there are some missing solutions. I will update the solutions
occasionally to fill in the missing solutions, and to make corrections. For
some problems I have given answers beyond what I originally asked. Please
report any mistakes or discrepencies you might come across by sending me e-
dE(y|x ,x ) dE(y|x ,x )1 2 1 22.1. a. ----------------------------------------------------- = b + b x and ----------------------------------------------------- = b + 2b x + b x .1 4 2 2 3 2 4 1dx dx1 2
2b. By definition, E(u|x ,x ) = 0. Because x and x x are just functions1 2 2 1 2
of (x ,x ), it does not matter whether we also condition on them:1 2
2E(u|x ,x ,x ,x x ) = 0.1 2 2 1 2
c. All we can say about Var(u|x ,x ) is that it is nonnegative for all x1 2 1
and x : E(u|x ,x ) = 0 in no way restricts Var(u|x ,x ).2 1 2 1 2
2.3. a. y = b + b x + b x + b x x + u, where u has a zero mean given x0 1 1 2 2 3 1 2 1
and x : E(u|x ,x ) = 0. We can say nothing further about u.2 1 2
b. dE(y|x ,x )/dx = b + b x . Because E(x ) = 0, b =1 2 1 1 3 2 2 1
Even though there are 935 men in the sample, only 722 are used for the
estimation, because data are missing on meduc and feduc. What we could do is
define binary indicators for whether the corresponding variable is missing,
set the missing values to zero, and then use the binary indicators as
instruments along with meduc, feduc, and sibs. This would allow us to use all
935 observations.
The return to education is estimated to be small and insignificant
whether IQ or KWW used is used as the indicator. This could be because family
background variables do not satisfy the appropriate redundancy condition, or
they might be correlated with a . (In both first-stage regressions, the F1
statistic for joint significance of meduc, feduc, and sibs have p-values below
.002, so it seems the family background variables are sufficiently partially
correlated with the ability indicators.)
5.9. Define q = b - b , so that b = b + q . Plugging this expression into4 4 3 4 3 4
the equation and rearranging gives
19
2log(wage) = b + b exper + b exper + b (twoyr + fouryr) + q fouryr + u0 1 2 3 4
2= b + b exper + b exper + b totcoll + q fouryr + u,0 1 2 3 4
where totcoll = twoyr + fouryr. Now, just estimate the latter equation by
22SLS using exper, exper , dist2yr and dist4yr as the full set of instruments.
^We can use the t statistic on q to test H : q = 0 against H : q > 0.4 0 4 1 4
05.11. Following the hint, let y be the linear projection of y on z , let a2 2 2 2
be the projection error, and assume that L is known. (The results on2
generated regressors in Section 6.1.1 show that the argument carries over to
0the case when L is estimated.) Plugging in y = y + a gives2 2 2 2
0y = z D + a y + a a + u .1 1 1 1 2 1 2 1
0Effectively, we regress y on z , y . The key consistency condition is that1 1 2
each explanatory is orthogonal to the composite error, a a + u . By1 2 1
0assumption, E(z’u ) = 0. Further, E(y a ) = 0 by construction. The problem1 2 2
is that E(z’a ) $ 0 necessarily because z was not included in the linear1 2 1
projection for y . Therefore, OLS will be inconsistent for all parameters in2
*general. Contrast this with 2SLS when y is the projection on z and z : y2 1 2 2
*= y + r = zP + r , where E(z’r ) = 0. The second step regression (assuming2 2 2 2 2
that P is known) is essentially2
*y = z D + a y + a r + u .1 1 1 1 2 1 2 1
*Now, r is uncorrelated with z, and so E(z’r ) = 0 and E(y r ) = 0. The2 1 2 2 2
lesson is that one must be very careful if manually carrying out 2SLS by
explicitly doing the first- and second-stage regressions.
5.13. a. In a simple regression model with a single IV, the IV estimate of the
N N^ & * & *_ _ _ _slope can be written as b = S (z - z)(y - y) / S (z - z)(x - x) =1 i i i i7 8 7 8i=1 i=1
20
N N& * & *_ _S z (y - y) / S z (x - x) . Now the numerator can be written asi i i i7 8 7 8i=1 i=1N N N& * ----- ----- ----- -----_ _S z (y - y) = S z y - S z y = N y - N y = N (y - y).i i i i i 1 1 1 1 17 8i=1 i=1 i=1
N
where N = S z is the number of observations in the sample with z = 1 and1 i ii=1
----- -----y is the average of the y over the observations with z = 1. Next, write y1 i i
----- ----- -----as a weighted average: y = (N /N)y + (N /N)y , where the notation should be0 0 1 1
----- ----- ----- -----clear. Straightforward algebra shows that y - y = [(N - N )/N]y - (N /N)y1 1 1 0 0
----- ----- ----- -----= (N /N)(y - y ). So the numerator of the IV estimate is (N N /N)(y - y ).0 1 0 0 1 1 0
----- -----The same argument shows that the denominator is (N N /N)(x - x ). Taking the0 1 1 0
ratio proves the result.
-----b. If x is also binary -- representing some "treatment" -- x is the1
-----fraction of observations receiving treatment when z = 1 and x is thei 0
fraction receiving treatment when z = 0. So, suppose x = 1 if person ii i
participates in a job training program, and let z = 1 if person i is eligiblei
-for participation in the program. Then x is the fraction of people1
-----participating in the program out of those made eligibile, and x is the0
fraction of people participating who are not eligible. (When eligibility is
----- - -----necessary for participation, x = 0.) Generally, x - x is the difference in0 1 0
participation rates when z = 1 and z = 0. So the difference in the mean
response between the z = 1 and z = 0 groups gets divided by the difference in
participation rates across the two groups.
( )^ 0115.15. In L(x|z) = z^, we can write ^ = 2 2, where I is the K x KK 2 2^ I 212 K9 20
identity matrix, 0 is the L x K zero matrix, ^ is L x K , and ^ is K x1 2 11 1 1 12 2
K . As in Problem 5.12, the rank condition holds if and only if rank(^) = K.1
a. If for some x , the vector z does not appear in L(x |z), then ^ hasj 1 j 11
21
a column which is entirely zeros. But then that column of ^ can be written as
a linear combination of the last K elements of ^, which means rank(^) < K.2
Therefore, a necessary condition for the rank condition is that no columns of
^ be exactly zero, which means that at least one z must appear in the11 h
reduced form of each x , j = 1,...,K .j 1
b. Suppose K = 2 and L = 2, where z appears in the reduced form form1 1 1
both x and x , but z appears in neither reduced form. Then the 2 x 2 matrix1 2 2
^ has zeros in its second row, which means that the second row of ^ is all11
zeros. It cannot have rank K, in that case. Intuitively, while we began with
two instruments, only one of them turned out to be partially correlated with
x and x .1 2
c. Without loss of generality, we assume that z appears in the reducedj
form for x ; we can simply reorder the elements of z to ensure this is thej 1
case. Then ^ is a K x K diagonal matrix with nonzero diagonal elements.11 1 1
( )^ 011Looking at ^ = 2 2, we see that if ^ is diagonal with all nonzero11^ I12 K9 20
diagonals then ^ is lower triangular with all nonzero diagonal elements.
Therefore, rank ^ = K.
CHAPTER 6
6.1. a. Here is abbreviated Stata output for testing the null hypothesis that
educ is exogenous:
. qui reg educ nearc4 nearc2 exper expersq black south smsa reg661-reg668
smsa66
. predict v2hat, resid
22
. reg lwage educ exper expersq black south smsa reg661-reg668 smsa66 v2hat
Total | 491.772644 3009 .163433913 Root MSE = .40526
The test statistic is the sample size times the R-squared from this regression:
. di 3010*.0004
1.204
. di chiprob(1,1.2)
.27332168
2The p-value, obtained from a c distribution, is about .273, so the instruments1
pass the overidentification test.
6.3. a. We need prices to satisfy two requirements. First, calories and
protein must be partially correlated with prices of food. While this is easy
to test for each by estimating the two reduced forms, the rank condition could
still be violated (although see Problem 15.5c). In addition, we must also
assume prices are exogenous in the productivity equation. Ideally, prices vary
because of things like transportation costs that are not systematically related
to regional variations in individual productivity. A potential problem is that
prices reflect food quality and that features of the food other than calories
and protein appear in the disturbance u .1
b. Since there are two endogenous explanatory variables we need at least
two prices.
c. We would first estimate the two reduced forms for calories and protein
2by regressing each on a constant, exper, exper , educ, and the M prices, p ,1
^ ^..., p . We obtain the residuals, v and v . Then we would run theM 21 22
2 ^ ^regression log(produc) on 1, exper, exper , educ, v , v and do a joint21 22
24
^ ^significance test on v and v . We could use a standard F test or use a21 22
heteroskedasticity-robust test.
6.5. a. For simplicity, absorb the intercept in x, so y = xB + u, E(u|x) = 0,
2 ^2Var(u|x) = s . In these tests, s is implictly SSR/N -- there is no degrees of
freedom adjustment. (In any case, the df adjustment makes no difference
^2 ^2asymptotically.) So u - s has a zero sample average, which means thati
N N-1/2 ^2 ^2 -1/2 ^2 ^2N S (h - M )’(u - s ) = N S h’(u - s ).i h i i i
i=1 i=1N-1/2 ^2 2
Next, N S (h - M )’ = O (1) by the central limit theorem and s - s =i h pi=1
N-1/2 ^2 2o (1). So N S (h - M )’(s - s ) = O (1)Wo (1) = o (1). Therefore, sop i h p p p
i=1
far we have
N N-1/2 ^2 ^2 -1/2 ^2 2N S h’(u - s ) = N S (h - M )’(u - s ) + o (1).i i i h i p
i=1 i=1N N-1/2 ^2 -1/2
We are done with this part if we show N S (h - M )’u = N S (h -i h i ii=1 i=1
2 ^2 2 ^M )’u + o (1). Now, as in Problem 4.4, we can write u = u - 2u x (B - B) +h i p i i i i
^ 2[x (B - B)] , soi
N N-1/2 ^2 -1/2 2N S (h - M )’u = N S (h - M )’ui h i i h i
i=1 i=1N& -1/2 * ^
- 2 N S u (h - M )’x (B - B) (6.40)i i h i7 8i=1N& -1/2 * ^ ^
+ N S (h - M )’(x t x ) {vec[(B - B)(B - B)’]},i h i i7 8i=1
^ 2 ^ ^where the expression for the third term follows from [x (B - B)] = x (B - B)(Bi i
^ ^- B)’x’ = (x t x )vec[(B - B)(B - B)’]. Dropping the "-2" the second term cani i i
N& -1 * ----- ^ ----- ^be written as N S u (h - M )’x rN(B - B) = o (1)WO (1) because rN(B - B) =i i h i p p7 8i=1
O (1) and, under E(u |x ) = 0, E[u (h - M )’x ] = 0; the law of large numbersp i i i i h i
implies that the sample average is o (1). The third term can be written asp
N-1/2& -1 * ----- ^ ----- ^ -1/2N N S (h - M )’(x t x ) {vec[rN(B - B)rN(B - B)’]} = N WO (1)WO (1),i h i i p p7 8i=1
where we again use the fact that sample averages are O (1) by the law of largep
----- ^ ----- ^numbers and vec[rN(B - B)rN(B - B)’] = O (1). We have shown that the last twop
25
terms in (6.40) are o (1), which proves part (a).p
N-1/2 ^2 ^2b. By part (a), the asymptotic variance of N S h’(u - s ) is Var[(hi i i
i=1
2 2 2 2 2 2 2 2 4- M )’(u - s )] = E[(u - s ) (h - M )’(h - M )]. Now (u - s ) = u -h i i i h i h i i
2 2 4 2 22u s + s . Under the null, E(u |x ) = Var(u |x ) = s [since E(u |x ) = 0 isi i i i i i i
2 2 2 2 4 2assumed] and therefore, when we add (6.27), E[(u - s ) |x ] = k - s _ h . Ai i
2 2 2standard iterated expectations argument gives E[(u - s ) (h - M )’(h - M )]i i h i h
2 2 2 2 2 2= E{E[(u - s ) (h - M )’(h - M )]|x } = E{E[(u - s ) |x ](h - M )’(h -i i h i h i i i i h i
2M )} [since h = h(x )] = h E[(h - M )’(h - M )]. This is what we wanted toh i i i h i h
show. (Whether we do the argument for a random draw i or for random variables
representing the population is a matter of taste.)
c. From part (b) and Lemma 3.8, the following statistic has an asymptotic
2c distribution:Q
N N& -1/2 ^2 ^2 * 2 -1& -1/2 ^2 ^2 *N S (u - s )h {h E[(h - M )’(h - M )]} N S h’(u - s ) .i i i h i h i i7 8 7 8i=1 i=1
N ^2 ^2 -----Using again the fact that S (u - s ) = 0, we can replace h with h - h ini i i
i=1
the two vectors forming the quadratic form. Then, again by Lemma 3.8, we can
replace the matrix in the quadratic form with a consistent estimator, which is
N^2& -1 ----- ----- *h N S (h - h)’(h - h) ,i i7 8i=1N^2 -1 ^2 ^2 2
where h = N S (u - s ) . The computable statistic, after simple algebra,ii=1
can be written as
N N -1 N& ^2 ^2 ----- *& ----- ----- * & ----- ^2 ^2 * ^2S (u - s )(h - h) S (h - h)’(h - h) S (h - h)’(u - s ) /h .i i i i i i7 87 8 7 8i=1 i=1 i=1
^2 ^2Now h is just the total sum of squares in the u , divided by N. The numeratori
^2of the statistic is simply the explained sum of squares from the regression ui
on 1, h , i = 1,...,N. Therefore, the test statistic is N times the usuali
^2 2(centered) R-squared from the regression u on 1, h , i = 1,...,N, or NR .i i c
2 2 2d. Without assumption (6.37) we need to estimate E[(u - s ) (h - M )’(hi i h i
- M )] generally. Hopefully, the approach is by now pretty clear. We replaceh
26
the population expected value with the sample average and replace any unknown
2parameters -- B, s , and M in this case -- with their consistent estimatorsh
N& -1/2 ^2 ^2 *(under H ). So a generally consistent estimator of Avar N S h’(u - s )0 i i7 8i=1
is
N-1 ^2 ^2 2 ----- -----N S (u - s ) (h - h)’(h - h),i i i
i=1
and the test statistic robust to heterokurtosis can be written as
N N -1& ^2 ^2 ----- *& ^2 ^2 2 ----- ----- *S (u - s )(h - h) S (u - s ) (h - h)’(h - h)i i i i i7 87 8i=1 i=1N& ----- ^2 ^2 *W S (h - h)’(u - s ) ,i i7 8i=1
which is easily seen to be the explained sum of squares from the regression of
^2 ^2 -----1 on (u - s )(h - h), i = 1,...,N (without an intercept). Since the totali i
sum of squares, without demeaning, is N = (1 + 1 + ... + 1) (N times), the
statistic is equivalent to N - SSR , where SSR is the sum of squared0 0
So the growth in nominal wages for a man with educ = 12 is about .339, or
33.9%. [We could use the more accurate estimate, obtained from exp(.339) -1.]
The 95% confidence interval goes from about 27.3 to 40.6.
33
CHAPTER 7
7.1. Write (with probability approaching one)
-1N N^ & -1 * & -1 *B = B + N S X’X N S X’u .i i i i7 8 7 8i=1 i=1
From SOLS.2, the weak law of large numbers, and Slutsky’s Theorem,
-1N& -1 * -1plim N S X’X = A .i i7 8i=1
N& -1 *Further, under SOLS.1, the WLLN implies that plim N S X’u = 0. Thus,i i7 8i=1
-1N N^ & -1 * & -1 * -1plim B = B + plim N S X’X Wplim N S X’u = B + A W0 = B. )i i i i7 8 7 8i=1 i=1
7.3. a. Since OLS equation-by-equation is the same as GLS when ) is diagonal,
it suffices to show that the GLS estimators for different equations are
asymptotically uncorrelated. This follows if the asymptotic variance matrix
is block diagonal (see Section 3.5), where the blocking is by the parameter
vector for each equation. To establish block diagonality, we use the result
from Theorem 7.4: under SGLS.1, SGLS.2, and SGLS.3,
^----- -1 -1Avar rN(B - B) = [E(X’) X )] .i i
Now, we can use the special form of X for SUR (see Example 7.1), the facti
-1that ) is diagonal, and SGLS.3. In the SUR model with diagonal ), SGLS.3
2 2implies that E(u x’ x ) = s E(x’ x ) for all g = 1,...,G, andig ig ig g ig ig
E(u u x’ x ) = E(u u )E(x’ x ) = 0, all g $ h. Therefore, we haveig ih ig ih ig ih ig ih
-2&s E(x’ x ) 0 0 *1 i1 i1
2 2-1 0 W
E(X’) X ) = 2 2.i i W 02 2W -27 0 0 s E(x’ x )8G iG iG
When this matrix is inverted, it is also block diagonal. This shows that the
asymptotic variance of what we wanted to show.
34
b. To test any linear hypothesis, we can either construct the Wald
statistic or we can use the weighted sum of squared residuals form of the
statistic as in (7.52) or (7.53). For the restricted SSR we must estimate the
model with the restriction B = B imposed. See Problem 7.6 for one way to1 2
impose general linear restrictions.
c. When ) is diagonal in a SUR system, system OLS and GLS are the same.
Under SGLS.1 and SGLS.2, GLS and FGLS are asymptotically equivalent
^(regardless of the structure of )) whether or not SGLS.3 holds. But, if BSOLS
^ ----- ^ ^ ----- ^ ^= B and rN(B - B ) = o (1), then rN(B - B ) = o (1). Thus,GLS FGLS GLS p SOLS FGLS p
^when ) is diagonal, OLS and FGLS are asymptotically equivalent, even if ) is
estimated in an unrestricted fashion and even if the system homoskedasticity
assumption SGLS.3 does not hold.
7.5. This is easy with the hint. Note that
-1 -1& N * N^-1 & * ^ & *2) t S x’x 2 = ) t S x’x .i i i i7 8 7 87 i=1 8 i=1
Therefore,
N N& * & *S x’y S x’yi i1 i i12 2 2 2i=1 i=1
-1 2 2 -1 2 2& N * & N *^ ^ & * ^-1 & *B = 2) t S x’x 2() t I )2 W 2 = 2I t S x’x 22 W 2.i i K G i i7 8 W 7 8 W7 i=1 8 7 i=1 82 W 2 2 W 2N N2 2 2 2S x’y S x’yi iG i iG7 8 7 8i=1 i=1
Straightforward multiplication shows that the right hand side of the equation
^ ^^ ^is just the vector of stacked B , g = 1,...,G. where B is the OLS estimatorg g
for equation g. )
27.7. a. First, the diagonal elements of ) are easily found since E(u ) =it
2 2E[E(u |x )] = s by iterated expectations. Now, consider E(u u ), andit it t it is
35
take s < t without loss of generality. Under (7.79), E(u |u ) = 0 since uit is is
is a subset of the conditioning information in (7.80). Applying the law of
iterated expectations (LIE) again we have E(u u ) = E[E(u u |u )] =it is it is is
E[E(u |u )u )] = 0.it is is
b. The GLS estimator is
-1N N* & -1 * & -1 *B _ S X’) X S X’) yi i i i7 8 7 8i=1 i=1
-1N T N T& -2 * & -2 *= S S s x’ x S S s x’ y .t it it t it it7 8 7 8i=1t=1 i=1t=1
c. If, say, y = b + b y + u , then y is clearly correlatedit 0 1 i,t-1 it it
with u , which says that x = y is correlated with u . Thus, SGLS.1it i,t+1 it it
does not hold. Generally, SGLS.1 holds whenever there is feedback from y toit
T-1 -1 -2x , s > t. However, since ) is diagonal, X’) u = S x’ s u , and sois i i it t it
t=1T-1 -2
E(X’) u ) = S s E(x’ u ) = 0i i t it itt=1
since E(x’ u ) = 0 under (7.80). Thus, GLS is consistent in this caseit it
without SGLS.1.
-1 -1 -2 -2 -2d. First, since ) is diagonal, X’) = (s x’ ,s x’ , ..., s x’ )’,i 1 i1 2 i2 T iT
and so
T T-1 -1 -2 -2E(X’) u u’) X ) = S S s s E(u u x’ x ).i i i i t s it is it is
t=1s=1
First consider the terms for s $ t. Under (7.80), if s < t,
E(u |x ,u ,x ) = 0, and so by the LIE, E(u u x’ x ) = 0, t $ s. Next,it it is is it is it is
for each t,
2 2 2E(u x’ x ) = E[E(u x’ x |x )] = E[E(u |x )x’ x )]it it it it it it it it it it it
2 2= E[s x’ x ] = s E(x’ x ), t = 1,2,...,T.t it it t it it
It follows that
T-1 -1 -2 -1E(X’) u u’) X ) = S s E(x’ x ) = E(X’) X ).i i i i t it it i i
t=1
36
^
e. First, run pooled regression across all i and t; let u denote theit
pooled OLS residuals. Then, for each t, define
N ^2^2 -1 ^s = N S ut iti=1
(We might replace N with N - K as a degrees-of-freedom adjustment.) Then, by
^2 p 2standard arguments, s L s as N L 8.t t
f. We have verified the assumptions under which standard FGLS statistics
have nice properties (although we relaxed SGLS.1). In particular, standard
errors obtained from (7.51) are asymptotically valid, and F statistics from
^ ^2(7.53) are valid. Now, if ) is taken to be the diagonal matrix with s as thet
tht diagonal, then the FGLS statistics are easily shown to be identical to the
statistics obtained by performing pooled OLS on the equation
^ ^(y /s ) = (x /s )B + error , t = 1,2,...,T, i = 1,...,N.it t it t it
We can obtain valid standard errors, t statistics, and F statistics from this
^2weighted least squares analysis. For F testing, note that the s should bet
obtained from the pooled OLS residuals for the unrestricted model.
2 2g. If s = s for all t = 1,...,T, inference is very easy. FGLS reducest
to pooled OLS. Thus, we can use the standard errors and test statistics
reported by a standard OLS regression pooled across i and t.
7.9. The Stata session follows. I first test for serial correlation before
computing the fully robust standard errors:
. reg lscrap d89 grant grant_1 lscrap_1 if year != 1987
. * Thus, we fail to reject the random effects assumptions even at very large
. * significance levels.
For comparison, the usual form of the Hausman test, which includes spring
2among the coefficients tested, gives p-value = .770, based on a c4
distribution (using Stata 7.0). It would have been easy to make the
regression-based test robust to any violation of RE.3: add ", robust
cluster(id)" to the regression command.
62
10.9. a. The Stata output follows. The simplest way to compute a Hausman test
is to just add the time averages of all explanatory variables, excluding the
dummy variables, and estimating the equation by random effects. I should have
done a better job of spelling this out in the text. In other words, write
-----y = x B + w X + r , t = 1,...,T,it it i it
where x includes an overall intercept along with time dummies, as well asit
w , the covariates that change across i and t. We can estimate this equationit
by random effects and test H : X = 0. The actual calculation for this example0
is to be added.
Parts b, c, and d: To be added.
10.11. To be added.
10.13. The short answer is: Yes, we can justify this procedure with fixed T
-----as N L 8. In particular, it produces a rN-consistent, asymptotically normal
estimator of B. Therefore, "fixed effects weighted least squares," where the
weights are known functions of exogenous variables (including x and possiblei
other covariates that do not appear in the conditional mean), is another case
where "estimating" the fixed effects leads to an estimator of B with good
properties. (As usual with fixed T, there is no sense in which we can
estimate the c consistently.) Verifying this claim takes much more work, buti
it is mostly just algebra.
First, in the sum of squared residuals, we can "concentrate" the a outi
^by finding a (b) as a function of (x ,y ) and b, substituting back into thei i i
63
sum of squared residuals, and then minimizing with respect to b only.
Straightforward algebra gives the first order conditions for each i as
T ^S (y - a - x b)/h = 0,it i it itt=1
which gives
T T^ & * & *a (b) = w S y /h - w S x /h bi i it it i it it7 8 7 8t=1 t=1
-----w -----w_ y - x b,i i
T T& * -----w & *where w _ 1/ S (1/h ) > 0 and y _ w S y /h , and a similari it i i it it7 8 7 8t=1 t=1
-----w -----w -----wdefinition holds for x . Note that y and x are simply weighted averages.i i i
-----w -----wIf h equals the same constant for all t, y and x are the usual timeit i i
averages.
^ ^Now we can plug each a (b) into the SSR to get the problem solved by B:i
N T -----w -----w 2min S S [(y - y ) - (x - x )b] /h .it i it i itK i=1t=1beR
-----wBut this is just a pooled weighted least squares regression of (y - y ) onit i
-----w ~ -----w -------------(x - x ) with weights 1/h . Equivalently, define y _ (y - y )/rh ,it i it it it i it
~ -----w ------------- ^x _ (x - x )/rh , all t = 1,...,T, i = 1,...,N. Then B can be expressedit it i it
in usual pooled OLS form:
N T -1 N T^ & ~ ~ * & ~ ~ *B = S S x’ x S S x’ y . (10.82)it it it it7 8 7 8i=1t=1 i=1t=1
-----wNote carefully how the initial y are weighted by 1/h to obtain y , butit it i
-------------where the usual 1/rh weighting shows up in the sum of squared residuals onit
the time-demeaned data (where the demeaming is a weighted average). Given
^(10.82), we can study the asymptotic (N L 8) properties of B. First, it is
T-----w -----w -----w -----w & *easy to show that y = x B + c + u , where u _ w S u /h . Subtractingi i i i i i it it7 8t=1
~ ~ ~this equation from y = x B + c + u for all t gives y = x B + u ,it it i it it it it
~ -----w ------------- ~where u _ (u - u )/rh . When we plug this in for y in (10.82) andit it i it it
divide by N in the appropriate places we get
64
N T -1 N T^ & -1 ~ ~ * & -1 ~ ~ *B = B + N S S x’ x N S S x’ u .it it it it7 8 7 8i=1t=1 i=1t=1T T~ ~ ~ -------------
Straightforward algebra shows that S x’ u = S x’ u /rh , i = 1,...,N,it it it it itt=1 t=1
and so we have the convenient expression
N T -1 N T^ & -1 ~ ~ * & -1 ~ -------------*B = B + N S S x’ x N S S x’ u /rh . (10.83)it it it it it7 8 7 8i=1t=1 i=1t=1
^From (10.83) we can immediately read off the consistency of B. Why? We
assumed that E(u |x ,h ,c ) = 0, which means u is uncorrelated with anyit i i i it
~ ~function of (x ,h ), including x . So E(x’ u ) = 0, t = 1,...,T. As longi i it it it
T& ~ ~ *as we assume rank S E(x’ x ) = K, we can use the usual proof to showit it7 8t=1
^ ^plim(B) = B. (We can even show that E(B|X,H) = B.)
^ -----It is also clear from (10.83) that B is rN-asymptotically normal under
mild assumptions. The asymptotic variance is generally
----- ^ -1 -1Avar rN(B - B) = A BA ,
where
T T~ ~ & ~ -------------*A _ S E(x’ x ) and B _ Var S x’ u /rh .it it it it it7 8t=1 t=1
If we assume that Cov(u ,u |x ,h ,c ) = 0, t $ s, in addition to theit is i i i
2variance assumption Var(u |x ,h ,c ) = s h , then it is easily shown that Bit i i i u it
2 ----- ^ 2 -1= s A, and so rN(B - B) = s A .u u
2The same subtleties that arise in estimating s for the usual fixedu
effects estimator crop up here as well. Assume the zero conditional
covariance assumption and correct variance specification in the previous
paragraph. Then, note that the residuals from the pooled OLS regression
~ ~y on x , t = 1,...,T, i = 1,...,N, (10.84)it it
^ ~ -----w ------------- ^say r , are estimating u = (u - u )/rh (in the sense that we obtain rit it it i it it
~ ^ ~2 2 -----wfrom u by replacing B with B). Now E(u ) = E[(u /h )] - 2E[(u u )/h ]it it it it it i it
-----w 2 2 2 2+ E[(u ) /h ] = s - 2s E[(w /h )] + s E[(w /h )], where the law ofi it u u i it u i it
65
-----w 2 2iterated expectations is applied several times, and E[(u ) |x ,h ] = s w hasi i i u i
~2 2been used. Therefore, E(u ) = s [1 - E(w /h )], t = 1,...,T, and soit u i it
T ~2 2 T 2S E(u ) = s {T - E[w WS (1/h )]} = s (T - 1).it u i it ut=1t=1
This contains the usual result for the within transformation as a special
2case. A consistent estimator of s is SSR/[N(T - 1) - K], where SSR is theu
usual sum of squared residuals from (10.84), and the subtraction of K is
^optional. The estimator of Avar(B) is then
-1N T^2& ~ ~ *s S S x’ x .u it it7 8i=1t=1
If we want to allow serial correlation in the {u }, or allowit
2Var(u |x ,h ,c ) $ s h , then we can just apply the robust formula for theit i i i u it
pooled OLS regression (10.84).
CHAPTER 11
11.1. a. It is important to remember that, any time we put a variable in a
regression model (whether we are using cross section or panel data), we are
controlling for the effects of that variable on the dependent variable. The
whole point of regression analysis is that it allows the explanatory variables
to be correlated while estimating ceteris paribus effects. Thus, the
inclusion of y in the equation allows prog to be correlated withi,t-1 it
y , and also recognizes that, due to inertia, y is often stronglyi,t-1 it
related to y .i,t-1
An assumption that implies pooled OLS is consistent is
E(u |z ,x ,y ,prog ) = 0, all t,it i it i,t-1 it
66
which is implied by but is weaker than dynamic completeness. Without
additional assumptions, the pooled OLS standard errors and test statistics
need to be adjusted for heteroskedasticity and serial correlation (although
the later will not be present under dynamic completeness).
b. As we discussed in Section 7.8.2, this statement is incorrect.
Provided our interest is in E(y |z ,x ,y ,prog ), we do not care aboutit i it i,t-1 it
serial correlation in the implied errors, nor does serial correlation cause
inconsistency in the OLS estimators.
c. Such a model is the standard unobserved effects model:
y = x B + d prog + c + u , t=1,2,...,T.it it 1 it i it
We would probably assume that (x ,prog ) is strictly exogenous; the weakestit it
form of strict exogeneity is that (x ,prog ) is uncorrelated with u forit it is
all t and s. Then we could estimate the equation by fixed effects or first
differencing. If the u are serially uncorrelated, FE is preferred. Weit
could also do a GLS analysis after the fixed effects or first-differencing
transformations, but we should have a large N.
d. A model that incorporates features from parts a and c is
y = x B + d prog + r y + c + u , t = 1,...,T.it it 1 it 1 i,t-1 i it
Now, program participation can depend on unobserved city heterogeneity as well
as on lagged y (we assume that y is observed). Fixed effects and first-it i0
differencing are both inconsistent as N L 8 with fixed T.
Assuming that E(u |x ,prog ,y ,y ,...,y ) = 0, a consistentit i i i,t-1 i,t-2 i0
procedure is obtained by first differencing, to get
y = Dx B + d Dprog + r Dy + Du , t=2,...,T.it it 1 it 1 i,t-1 it
At time t and Dx , Dprog can be used as there own instruments, along withit it
y for j > 2. Either pooled 2SLS or a GMM procedure can be used. Underi,t-j
67
strict exogeneity, past and future values of x can also be used asit
instruments.
^11.3. Writing y = bx + c + u - br , the fixed effects estimator bit it i it it FE
can be written as
2N T N T& -1 ----- * & -1 ----- ----- ----- *b + N S S (x - x ) N S S (x - x )(u - u - b(r - r ) .it i it i it i it i7 8 7 8i=1t=1 i=1t=1
----- * -----* ----- *Now, x - x = (x - x ) + (r - r ). Then, because E(r |x ,c ) = 0 forit i it i it i it i i
* -----* -----all t, (x - x ) and (r - r ) are uncorrelated, and soit i it i
----- * -----* -----Var(x - x ) = Var(x - x ) + Var(r - r ), all t.it i it i it i
----- -----Similarly, under (11.30), (x - x ) and (u - u ) are uncorrelated for allit i it i
----- ----- * -----* ----- -----t. Now E[(x - x )(r - r )] = E[{(x - x ) + (r - r )}(r - r )] =it i it i it i it i it i
-----Var(r - r ). By the law of large numbers and the assumption of constantit i
variances across t,
N T T-1 ----- p ----- * -----* -----N S S (x - x ) L S Var(x - x ) = T[Var(x - x ) + Var(r - r )]it i it i it i it i
i=1t=1 t=1
and
N T-1 ----- ----- ----- p -----N S S (x - x )(u - u - b(r - r ) L -TbVar(r - r ).it i it i it i it i
i=1t=1
Therefore,
-----Var(r - r )it i^ & *
plim b = b - b ---------------------------------------------------------------------------------------------------------------------------------------------------------------FE 7 * ------* ----- 8[Var(x - x ) + Var(r - r )]it i it i
-----Var(r - r )it i& *
= b 1 - --------------------------------------------------------------------------------------------------------------------------------------------------------------- .7 * ------* ----- 8[Var(x - x ) + Var(r - r )]it i it i
11.5. a. E(v |z ,x ) = Z [E(a |z ,x ) - A] + E(u |z ,x ) = Z (A - A) + 0 = 0.i i i i i i i i i i i
Next, Var(v |z ,x ) = Z Var(a |z ,x )Z’ + Var(u |z ,x ) + Cov(a ,u |z ,x ) +i i i i i i i i i i i i i i i
Cov(u ,a |z ,x ) = Z Var(a |z ,x )Z’ + Var(u |z ,x ) because a and u arei i i i i i i i i i i i i i
uncorrelated, conditional on (z ,x ), by FE.1’ and the usual iteratedi i
68
2expectations argument. Therefore, Var(v |z ,x ) = Z *Z’ + s I under thei i i i i u T
assumptions given, which shows that the conditional variance depends on z .i
Unlike in the standard random effects model, there is conditional
heteroskedasticity.
b. If we use the usual RE analysis, we are applying FGLS to the equation
y = Z A + X B + v , where v = Z (a - A) + u . From part a, we know thati i i i i i i i
E(v |x ,z ) = 0, and so the usual RE estimator is consistent (as N L 8 fori i i
-----fixed T) and rN-asymptotically normal, provided the rank condition, Assumption
^RE.2, holds. (Remember, a feasible GLS analysis with any ) will be consistent
^provided ) converges in probability to a nonsingular matrix as N L 8. It need
^ ^not be the case that Var(v |x ,z ) = plim()), or even that Var(v ) = plim()).i i i i
From part a, we know that Var(v |x ,z ) depends on z unless we restricti i i i
almost all elements of * to be zero (all but those corresponding the the
constant in z ). Therefore, the usual random effects inference -- that is,it
based on the usual RE variance matrix estimator -- will be invalid.
c. We can easily make the RE analysis fully robust to an arbitrary
Var(v |x ,z ), as in equation (7.49). Naturally, we expand the set ofi i i
explanatory variables to (z ,x ), and we estimate A along with B.it it
11.7. When L = L/T for all t, we can rearrange (11.60) to gett
-----y = x B + x L + v , t = 1,2,...,T.it it i it
^ ^Let B (along with L) denote the pooled OLS estimator from this equation. By
standard results on partitioned regression [for example, Davidson and
^MacKinnon (1993, Section 1.4)], B can be obtained by the following two-step
procedure:
69
-----(i) Regress x on x across all t and i, and save the 1 * K vectors ofit i
^residuals, say r , t = 1,...,T, i = 1,...,N.it
^ ^ ^(ii) Regress y on r across all t and i. The OLS vector on r is B.it it it
^We want to show that B is the FE estimator. Given that the FE estimator can
----- ^be obtained by pooled OLS of y on (x - x ), it suffices to show that r =it it i it
-----x - x for all t and i. Butit i
N T -1 N T^ ----- & ----- ----- * & ----- *r = x - x S S x’x S S x’xit it i i i i it7 8 7 8i=1t=1 i=1t=1
N T N T N N T----- ----- ----- ----- ----- ----- ^ -----and S S x’x = S x’ S x = S Tx’x = S S x’x , and so r = x - x Ii it i it i i i i it it i K
i=1t=1 i=1 t=1 i=1 i=1t=1
-----= x - x . This completes the proof.it i
11.9. a. We can apply Problem 8.8.b, as we are applying pooled 2SLS to the
T& *¨ ¨time-demeaned equation: rank S E(z’ x ) = K. This clearly fails if xit it it7 8t=1
contains any time-constant explanatory variables (across all i, as usual).
T& *¨ ¨The condition rank S E(z’ z ) = L is also needed, and this rules out time-it it7 8t=1
constant instruments. But if the rank condition holds, we can always redefine
T¨ ¨z so that S E(z’ z ) has full rank.it it it
t=1
b. We can apply the results on GMM estimation in Chapter 8. In
-1¨ ¨ ¨ ¨particular, in equation (8.25), take C = E(Z’X ), W = [E(Z’Z )] , and * =i i i i
¨ ¨ ¨ ¨ ¨ " ¨E(Z’u u’Z ). A key point is that Z’u = (Q Z )’(Q u ) = Z’Q u = Z’u , wherei i i i i i T i T i i T i i i
Q is the T x T time-demeaning matrix defined in Chapter 10. Under (11.80),T
2¨E(u u’|Z ) = s I (by the usual iterated expectations argument), and so * =i i i u T
2¨ ¨ ¨ ¨E(Z’u u’Z ) = s E(Z’Z ). If we plug these choices of C, W, and * into (8.25)i i i i u i i
and simplify, we obtain
----- ^ 2 -1 -1¨ ¨ ¨ ¨ ¨ ¨Avar rN(B - B) = s {E(X’Z )[E(Z’Z )] E(Z’X )} .u i i i i i i
c. The argument is very similar to the case of the fixed effects
T 2 2 ^ ^¨ ¨ ¨estimator. First, S E(u ) = (T - 1)s , just as before. If u = y - x Bit u it it itt=1
70
are the pooled 2SLS residuals applied to the time-demeaned data, then [N(T -
N T-1 ^2 21)] S S u is a consistent estimator of s . Typically, N(T - 1) would beit u
i=1t=1
replaced by N(T - 1) - K as a degrees of freedom adjustment.
d. From Problem 5.1 (which is purely algebraic, and so applies
immediately to pooled 2SLS), the 2SLS estimator of all parameters in (11.81),
including B, can be obtained as follows: first run the regression x on d1 ,it i
^..., dN , z across all t and i, and obtain the residuals, say r ; second,i it it
^ ^ ^obtain c , ..., c , B from the pooled regression y on d1 , ..., dN , x ,1 N it i i it
^ ^ ^r . Now, by algebra of partial regression, B and the coefficient on r , sayit it
^D, from this last regression can be obtained by first partialling out the
dummy variables, d1 , ..., dN . As we know from Chapter 10, this partiallingi i
^ ^out is equivalent to time demeaning all variables. Therefore, B and D can be
^¨ ¨obtained from the pooled regression y on x , r , where we use the factit it it
^that the time average of r for each i is identically zero.it
Now consider the 2SLS estimator of B from (11.79). This is equivalent to
^¨ ¨first regressing x on z and saving the residuals, say s , and thenit it it
^¨ ¨running the OLS regression y on x , s . But, again by partial regressionit it it
^and the fact that regressing on d1 , ..., dN results in time demeaning, s =i i it
^r for all i and t. This proves that the 2SLS estimates of B from (11.79)it
and (11.81) are identical. (If some elements of x are included in z , asit it
^would usually be the case, some entries in r are identically zero for all tit
and i. But we can simply drop those without changing any other steps in the
argument.)
e. First, by writing down the first order condition for the 2SLS
^estimates from (11.81) (with dn as their own instruments, and x as the IVsi it
^ ----- ----- ^ ^for x ), it is easy to show that c = y - x B, where B is the IV estimatorit i i i
71
from (11.81) (and also (11.79)). Therefore, the 2SLS residuals from (11.81)
----- ----- ^ ^ ----- ----- ^ ¨are computed as y - (y - x B) - x B = (y - y ) - (x - x )B = y -it i i it it i it i it
^x B, which are exactly the 2SLS residuals from (11.79). Because the N dummyit
variables are explicitly included in (11.81), the degrees of freedom in
2estimating s from part c are properly calculated.u
f. The general, messy estimator in equation (8.27) should be used, where
^ -1 ^ ^ ^¨ ¨ ¨ ¨ ¨ ¨X and Z are replaced with X and Z, W = (Z’Z/N) , u = y - X B, and * =i i i
N& -1 ^ ^ *¨ ¨N S Z’u u’Z .i i i i7 8i=1
g. The 2SLS procedure is inconsistent as N L 8 with fixed T, as is any IV
method that uses time-demeaning to eliminate the unobserved effect. This is
because the time-demeaned IVs will generally be correlated with some elements
of u (usually, all elements).i
11.11. Differencing twice and using the resulting cross section is easily done
in Stata. Alternatively, I can used fixed effects on the first differences:
Now, the first term does not depend on Q, and the second term is clearly
minimized at Q = Q (although not uniquely, in general).o
^ ^12.3. a. The approximate elasticity is dlog[E(y|z)]/dlog(z ) = d[q +1 1
^ ^ ^q log(z ) + q z ]/dlog(z ) = q .2 1 3 2 1 2
^ ^b. This is approximated by 100Wdlog[E(y|z)]/dz = 100Wq .2 3
^ ^ ^ ^ ^ 2 ^ ^c. Since dE(y|z)/dz = exp[q + q log(z ) + q z + q z ]W(q + 2q z ),2 1 2 1 3 2 4 2 3 4 2
* ^ ^the turning point is z = q /(-2q ).2 3 4
d. Since D m(x,Q) = exp(x Q + x Q )x, the gradient of the mean functionq 1 1 2 2
~ ~ ~ ~evaluated under the null is D m = exp(x Q )x _ m x , where Q is theq i i1 1 i i i 1
restricted NLS estimator. From (12.72), we can compute the usual LM
2 ~ ~ ~statistic as NR from the regression u on m x , m x , i = 1,...,N, whereu i i i1 i i2
~ ~ ~ ~u = y - m . For the robust test, we first regress m x on m x andi i i i i2 i i1
~obtain the 1 * K residuals, r . Then we compute the statistic as in2 i
regression (12.75).
12.5. We need the gradient of m(x ,Q) evaluated under the null hypothesis.i
2 3 2By the chain rule, D m(x,Q) = g[xB + d (xB) + d (xB) ]W[x + 2d (xB) +b 1 2 1
75
2 2 3 2 3 ~3d (xB) ], D m(x,Q) = g[xB + d (xB) + d (xB) ]W[(xB) ,(xB) ]. Let B denote2 d 1 2
~ ~the NLS estimator with d = d = 0 imposed. Then D m(x ,Q) = g(x B)x and1 2 b i i i
~ ~ ~ 2 ~ 3D m(x ,Q) = g(x B)[(x B) ,(x B) ]. Therefore, the usual LM statistic can bed i i i i
2 ~ ~ ~ ~ 2 ~ ~ 3obtained as NR from the regression u on g x , g W(x B) , g W(x B) , whereu i i i i i i i
~ ~g _ g(x B). If G(W) is the identity function, g(W) _ 1, and we get RESET.i i
12.7. a. For each i and g, define u _ y - m(x ,Q ), so that E(u |x ) =ig ig ig o ig i
0, g = 1,...,G. Further, let u be the G * 1 vector containing the u .i ig
^Then E(u u’|x ) = E(u u’) = ) . Let u be the vector of nonlinear leasti i i i i o i
squares residuals. That is, do NLS for each g, and collect the residuals.
Then, by standard arguments, a consistent estimator of ) iso
N^ -1 ^ ^^ ^) _ N S u u’i ii=1
^because each NLS estimator, Q is consistent for Q as N L 8.g og
b. This part involves several steps, and I will sketch how each one
goes. First, let G be the vector of distinct elements of ) -- the nuisance
parameters in the context of two-step M-estimation. Then, the score for
observation i is
-1s(w ,Q;G) = -D m(x ,Q)’) u (Q)i q i i
where, hopefully, the notation is clear. With this definition, we can verify
condition (12.37), even though the actual derivatives are complicated. Each
element of s(w ,Q;G) is a linear combination of u (Q). So D s (w ,Q ;G) is ai i g j i o
linear combination of u (Q ) _ u , where the linear combination is a functioni o i
of (x ,Q ,G). Since E(u |x ) = 0, E[D s (w ,Q ;G)|x ] = 0, and so itsi o i i g j i o i
unconditional expectation is zero, too. This shows that we do not have to
adjust for the first-stage estimation of ) . Alternatively, one can verifyo
the hint directly, which has the same consequence.
76
Next, we derive B _ E[s (Q ;G )s (Q ;G )’]:o i o o i o o
-1 -1E[s (Q ;G )s (Q ;G )’] = E[D m (Q )’) u u’) D m (Q )]i o o i o o q i o o i i o q i o
-1 -1= E{E[D m (Q )’) u u’) D m (Q )|x ]}q i o o i i o q i o i
-1 -1= E[D m (Q )’) E(u u’|x )) D m (Q )]q i o o i i i o q i o
-1 -1 -1= E[D m (Q )’) ) ) D m (Q )] = E[D m (Q )’) D m (Q )].q i o o o o q i o q i o o q i o
Next, we have to derive A _ E[H (Q ;G )], and show that B = A . Theo i o o o o
Hessian itself is complicated, but its expected value is not. The Jacobian
of s (Q;G) with respect to Q can be writteni
-1H (Q;G) = D m(x ,Q)’) D m(x ,Q) + [I t u (Q)’]F(x ,Q;G),i q i q i P i i
where F(x ,Q;G) is a GP * P matrix, where P is the total number ofi
-1parameters, that involves Jacobians of the rows of ) D m (Q) with respect toq i
Q. The key is that F(x ,Q;G) depends on x , not on y . So,i i i
-1E[H (Q ;G )|x ] = D m (Q )’) D m (Q ) + [I t E(u |x )’]F(x ,Q ;G )i o o i q i o o q i o P i i i o o
-1= D m (Q )’) D m (Q ).q i o o q i o
-1Now iterated expectations gives A = E[D m (Q )’) D m (Q )]. So, we haveo q i o o q i o
----- ^verified (12.37) and that A = B . Therefore, from Theorem 12.3, Avar rN(Q -o o
-1 -1 -1Q ) = A = {E[D m (Q )’) D m (Q )]} .o o q i o o q i o
c. As usual, we replace expectations with sample averages and unknown
^ ^parameters, and divide the result by N to get Avar(Q):
N -1^ ^ & -1 ^ ^-1 ^ *Avar(Q) = N S D m (Q)’) D m (Q) /Nq i q i7 8i=1
N -1& ^ ^-1 ^ *= S D m (Q)’) D m (Q) .q i q i7 8i=1
^The estimate ) can be based on the multivariate NLS residuals or can be
updated after the nonlinear SUR estimates have been obtained.
d. First, note that D m (Q ) is a block-diagonal matrix with blocksq i o
D m (Q ), a 1 * P matrix. (I implicityly assumed that there are noq ig og gg
cross-equation restrictions imposed in the nonlinear SUR estimation.) If )o
77
is block-diagonal, so is its inverse. Standard matrix multiplication shows
that
( )-2 o o2s D m ’D m 0 W W W 0 2o1 q i1 q i11 12 2
-1 0D m (Q )’) D m (Q ) = 2 2.q i o o q i o W2 W 2W -2 o o2 20 W W W 0 s D m ’D moG q iG q iGG G9 0----- ^
Taking expectations and inverting the result shows that Avar rN(Q - Q ) =g og
2 o o -1s [E(D m ’D m )] , g = 1,...,G. (Note also that the nonlinear SURog q ig q igg g
estimators are asymptotically uncorrelated across equations.) These
asymptotic variances are easily seen to be the same as those for nonlinear
least squares on each equation; see p. 360.
e. I cannot see a nonlinear analogue of Theorem 7.7. The first hint
given in Problem 7.5 does not extend readily to nonlinear models, even when
the same regressors appear in each equation. The key is that X is replacedi
with D m(x ,Q ). While this G * P matrix has a block-diagonal form, asq i o
described in part d, the blocks are not the same even when the same
regressors appear in each equation. In the linear case, D m (x ,Q ) = xq g i og ig
for all g. But, unless Q is the same in all equations -- a veryog
restrictive assumption -- D m (x ,Q ) varies across g. For example, ifq g i ogg
m (x ,Q ) = exp(x Q ) then D m (x ,Q ) = exp(x Q )x , and the gradientsg i og i og q g i og i og ig
differ across g.
12.9. a. We cannot say anything in general about Med(y|x), since Med(y|x) =
m(x,B ) + Med(u|x), and Med(u|x) could be a general function of x.o
b. If u and x are independent, then E(u|x) and Med(u|x) are both
constants, say a and d. Then E(y|x) - Med(y|x) = a - d, which does not
78
depend on x.
c. When u and x are independent, the partial effects of x on thej
conditional mean and conditional median are the same, and there is no
ambiguity about what is "the effect of x on y," at least when only the meanj
and median are in the running. Then, we could interpret large differences
between LAD and NLS as perhaps indicating an outlier problem. But it could
just be that u and x are not independent.
12.11. a. For consistency of the MNLS estimator, we need -- in addition to
the regularity conditions, which I will ignore -- the identification
condition. That is, B must uniquely minimize E[q(w ,B)] = E{[y -o i i
m(x ,B)]’[y - m(x ,B)]} = E({u + [m(x ,B ) - m(x ,B)]}’{u + [m(x ,B ) -i i i i i o i i i o
m(x ,B)]}) = E(u’u ) + 2E{[m(x ,B ) - m(x ,B)]’u } + E{[m(x ,B ) -i i i i o i i i o
m(x ,B)]’[m(x ,B ) - m(x ,B)]} = E(u u’) + E{[m(x ,B ) - m(x ,B)]’[m(x ,B ) -i i o i i i i o i i o
m(x ,B)]} because E(u |x ) = 0. Therefore, the identification assumption isi i i
that
E{[m(x ,B ) - m(x ,B)]’[m(x ,B ) - m(x ,B)]} > 0, B $ B .i o i i o i o
In a linear model, where m(x ,B) = X B for X a G * K matrix, the conditioni i i
is
(B - B)’E(X’X )(B - B) > 0, B $ B ,o i i o o
and this holds provided E(X’X ) is positive definite.i i
Provided m(x,W) is twice continuously differentiable, there are no
problems in applying Theorem 12.3. Generally, B = E[D m (B )’u u’D m (B )]o q i o i i q i o
and A = E[D m (B )’D m (B )]. These can be consistently estimated in theo q i o q i o
obvious way after obtain the MNLS estimators.
b. We can apply the results on two-step M-estimation. The key is that,
79
underl general regularity conditions,
N-1 ^ -1N S [y - m(x ,B)]’[W (D)] [y - m(x ,B)]/2,i i i i i
i=1
converges uniformly in probability to
-1E{[y - m(x ,B)]’[W(x ,D )] [y - m(x ,B)]}/2,i i i o i i
which is just to say that the usual consistency proof can be used provided we
verify identification. But we can use an argument very similar to the
unweighted case to show
-1 -1E{[y - m(x ,B)]’[W(x ,D )] [y - m(x ,B)]} = E{u’[W (D )] u }i i i o i i i i o i
-1+ E{[m(x ,B ) - m(x ,B)]’[W (D )] [m(x ,B ) - m(x ,B)]},i o i i o i o i
where E(u |x ) = 0 is used to show the cross-product term, 2E{[m(x ,B ) -i i i o
-1m(x ,B)]’[W (D )] u }, is zero (by iterated expectations, as always). Asi i o i
before, the first term does not depend on B and the second term is minimized
at B ; we would have to assume it is uniquely minimized.o
To get the asymptotic variance, we proceed as in Problem 12.7. First,
it can be shown that condition (12.37) holds. In particular, we can write
D s (B ;D ) = (I t u )’G(x ,B ;D ) for some function G(x ,B ;D ). Itd i o o P i i o o i o o
follows easily that E[D s (B ;D )|x ] = 0, which implies (12.37). This meansd i o o i
that, under E(y |x ) = m(x ,B ), we can ignore preliminary estimation of Di i i o o
-----provided we have a rN-consistent estimator.
To obtain the asymptotic variance when the conditional variance matrix
is correctly specified, that is, when Var(y |x ) = Var(u |x ) = W(x ,D ), wei i i i i o
can use an argument very similar to the nonlinear SUR case in Problem 12.7:
-1 -1E[s (B ;D )s (B ;D )’] = E[D m (B )’[W (D )] u u’[W (D )] D m (B )]i o o i o o b i o i o i i i o b i o
-1 -1= E{E[D m (B )’[W (D )] u u’[W (D )] D m (B )|x ]}b i o i o i i i o b i o i
-1 -1= E[D m (B )’[W (D )] E(u u’|x )[W (D )] D m (B )]b i o i o i i i i o b i o
-1= E{D m (B )’[W (D )] ]D m (B )}.b i o i o b i o
80
Now, the Hessian (with respect to B), evaluated at (B ,D ), can be written aso o
-1H (B ;D ) = D m(x ,B )’[W (D )] D m(x ,B ) + (I t u )’]F(x ,B ;D ),i o o b i o i o b i o P i i o o
for some complicated function F(x ,B ;D ) that depends only on x . Takingi o o i
expectations gives
-1A _ E[H (B ;D )] = E{D m(x ,B )’[W (D )] D m(x ,B )} = B .o i o o b i o i o b i o o
----- ^ -1Therefore, from the usual results on M-estimation, Avar rN(B - B ) = A , ando o
a consistent estimator of A iso
N^ -1 ^ ^ -1 ^A = N S D m(x ,B)’[W (D)] D m(x ,B).b i i b i
i=1
c. The consistency argument in part b did not use the fact that W(x,D)
is correctly specified for Var(y|x). Exactly the same derivation goes
through. But, of course, the asymptotic variance is affected because A $o
B , and the expression for B no longer holds. The estimator of A in part bo o o
still works, of course. To consistently estimate B we useo
N^ -1 ^ ^ -1^ ^ ^ -1 ^B = N S D m(x ,B)’[W (D)] u u’[W (D)] D m(x ,B).b i i i i i b i
i=1
----- ^ ^-1^^-1Now, we estimate Avar rN(B - B ) in the usual way: A BA .o
CHAPTER 13
13.1. No. We know that Q solveso
max E[log f(y |x ;Q)],i i
Qe$
where the expectation is over the joint distribution of (x ,y ). Therefore,i i
because exp(W) is an increasing function, Q also maximizes exp{E[logo
f(y |x ;Q)]} over $. The problem is that the expectation and the exponentiali i
function cannot be interchanged: E[f(y |x ;Q)] $ exp{E[log f(y |x ;Q)]}. Ini i i i
fact, Jensen’s inequality tells us that E[f(y |x ;Q)] > exp{E[logi i
81
f(y |x ;Q)]}.i i
13.3. Parts a and b essentially appear in Section 15.4.
g -113.5. a. Since s (F ) = [G(Q )’] s (Q ),i o o i o
g g -1 -1E[s (F )s (F )’|x ] = E{[G(Q )’] s (Q )s (Q )’[G(Q )] |x }i o i o i o i o i o o i
-1 -1= [G(Q )’] E[s (Q )s (Q )’|x ][G(Q )]o i o i o i o
-1 -1= [G(Q )’] A (Q )[G(Q )] .o i o o
~ ~b. In part b, we just replace Q with Q and F with F:o o
~g ~ -1 ~ ~ -1 ~ -1~ ~-1A = [G(Q)’] A (Q)[G(Q)] _ G’ A G .i i i
c. The expected Hessian form of the statistic is given in the second
~g ~gpart of equation (13.36), but where it is based initial on s and A :i i
N N -1 N& ~g*’& ~g* & ~g*LM = S s S A S sg i i i7 8 7 8 7 8i=1 i=1 i=1
N N -1 N& ~ -1~ *’& ~ -1~ ~-1* & ~ -1~ *= S G’ s S G’ A G S G’ si i i7 8 7 8 7 8i=1 i=1 i=1
N ’ N -1 N& ~ * ~-1~& ~ * ~ ~ -1& ~ *= S s G G S A G’G’ S si i i7 8 7 8 7 8i=1 i=1 i=1
N N -1 N& ~ *’& ~ * & ~ *= S s S A S s = LM.i i i7 8 7 8 7 8i=1 i=1 i=1
13.7. a. The joint density is simply g(y |y ,x;Q )Wh(y |x;Q ). The log-1 2 o 2 o
likelihood for observation i is
l (Q) _ log g(y |y ,x ;Q) + log h(y |x ;Q),i i1 i2 i i2 i
and we would use this in a standard MLE analysis (conditional on x ).i
b. First, we know that, for all (y ,x ), Q maximizes E[l (Q)|y ,x ].i2 i o i1 i2 i
Since r is a function of (y ,x ),i2 i2 i
E[r l (Q)|y ,x ] = r E[l (Q)|y ,x ];i2 i1 i2 i i2 i1 i2 i
since r > 1, Q maximizes E[r l (Q)|y ,x ] for all (y ,x ), andi2 o i2 i1 i2 i i2 i
therefore Q maximizes E[r l (Q)]. Similary, Q maximizes E[l (Q)], ando i2 i1 o i2
82
so it follows that Q maximizes r l (Q) + l (Q). For identification, weo i2 i1 i2
have to assume or verify uniqueness.
c. The score is
s (Q) = r s (Q) + s (Q),i i2 i1 i2
where s (Q) _ D l (Q)’ and s (Q) _ D l (Q)’. Therefore,i1 q i1 i2 q i2
E[s (Q )s (Q )’] = E[r s (Q )s (Q )’] + E[s (Q )s (Q )’]i o i o i2 i1 o i1 o i2 o i2 o
+ E[r s (Q )s (Q )’] + E[r s (Q )s (Q )’].i2 i1 o i2 o i2 i2 o i1 o
Now by the usual conditional MLE theory, E[s (Q )|y ,x ] = 0 and, since ri1 o i2 i i2
and s (Q) are functions of (y ,x ), it follows thati2 i2 i
E[r s (Q )s (Q )’|y ,x ] = 0, and so its transpose also has zeroi2 i1 o i2 o i2 i
conditional expectation. As usual, this implies zero unconditional
expectation. We have shown
E[s (Q )s (Q )’] = E[r s (Q )s (Q )’] + E[s (Q )s (Q )’].i o i o i2 i1 o i1 o i2 o i2 o
Now, by the unconditional information matrix equality for the density
h(y |x;Q), E[s (Q )s (Q )’] = -E[H (Q )], where H (Q) = D s (Q).2 i2 o i2 o i2 o i2 q i2
Further, byt the conditional IM equality for the density g(y |y ,x;Q),1 2
E[s (Q )s (Q )’|y ,x ] = -E[H (Q )|y ,x ], (13.70)i1 o i1 o i2 i i1 o i2 i
where H (Q) = D s (Q). Since r is a function of (y ,x ), we can put ri1 q i1 i2 i2 i i2
inside both expectations in (13.70). Then, by iterated expectatins,
E[r s (Q )s (Q )’] = -E[r H (Q )].i2 i1 o i1 o i2 i1 o
Combining all the pieces, we have shown that
E[s (Q )s (Q )’] = -E[r H (Q )] - E[H (Q )]i o i o i2 i1 o i2 o
= -{E[r D s (Q) + D s (Q)]i2 q i1 q i2
2= -E[D l (Q)] _ -E[H (Q)].q i i
So we have verified that an unconditional IM equality holds, which means we
----- ^ -1can estimate the asymptotic variance of rN(Q - Q ) by {-E[H (Q)]} .o i
83
----- ^d. From part c, one consistent estimator of rN(Q - Q ) iso
N-1 ^ ^N S (r H + H ),i2 i1 i2
i=1
where the notation should be obvious. But, as we discussed in Chapters 12
and 13, this estimator need not be positive definite. Instead, we can break
the problem into needed consistent estimators of -E[r H (Q )] andi2 i1 o
-E[H (Q )], for which we can use iterated expectations. Since, byi2 o
N-1 ^definition, A (Q ) _ -E[H (Q )|x ], N S A is consistent for -E[H (Q )]i2 o i2 o i i2 i2 o
i=1
by the usual iterated expectations argument. Similarly, since A (Q ) _i1 o
-E[H (Q )|y ,x ], and r is a function of (y ,x ), it follows thati1 o i2 i i2 i2 i
E[r A (Q )] = -E[r H (Q )]. This implies that, under general regularityi2 i1 o i2 i1 o
N-1 ^conditions, N S r A consistently estimates -E[r H (Q )]. Thisi2 i1 i2 i1 o
i=1
completes what we needed to show. Interestingly, even though we do not have
a true conditional maximum likelihood problem, we can still used the
conditional expectations of the hessians -- but conditioned on different sets
of variables, (y ,x ) in one case, and x in the other -- to consistentlyi2 i i
estimate the asymptotic variance of the partial MLE.
e. Bonus Question: Show that if we were able to use the entire random
sample, the result conditional MLE would be more efficient than the partial
MLE based on the selected sample.
Answer: We use a basic fact about positive definite matrices: if A and
B are P * P positive definite matrices, then A - B is p.s.d. if and only if
-1 -1B - A is positive definite. Now, as we showed in part d, the asymptotic
-1variance of the partial MLE is {E[r A (Q ) + A (Q )]} . If we could usei2 i1 o i2 o
the entire random sample for both terms, the asymptotic variance would be
-1 -1{E[A (Q ) + A (Q )]} . But {E[r A (Q ) + A (Q )]} - {E[A (Q ) +i1 o i2 o i2 i1 o i2 o i1 o
-1A (Q )]} is p.s.d. because E[A (Q ) + A (Q )] - E[r A (Q ) + A (Q )]i2 o i1 o i2 o i2 i1 o i2 o
84
= E[(1 - r )A (Q )] is p.s.d. (since A (Q ) is p.s.d. and 1 - r > 0.i2 i1 o i1 o i2
13.9. To be added.
13.11. To be added.
CHAPTER 14
14.1. a. The simplest way to estimate (14.35) is by 2SLS, using instruments
(x ,x ). Nonlinear functions of these can be added to the instrument list1 2
2 2-- these would generally improve efficiency if g $ 1. If E(u |x) = s ,2 2 2
2SLS using the given list of instruments is the efficient, single equation
GMM estimator. Otherwise, the optimal weighting matrix that allows
heteroskedasticity of unknown form should be used. Finally, one could try
to use the optimal instruments derived in section 14.5.3. Even under
homoskedasticity, these are difficult, if not impossible, to find
analytically if g $ 1.2
b. No. If g = 0, the parameter g does not appear in the model. Of1 2
course, if we knew g = 0, we would consistently estimate D by OLS.1 1
c. We can see this by obtaining E(y |x):1
g2E(y |x) = x D + g E(y |x) + E(u |x)1 1 1 1 2 1
g2= x D + g E(y |x).1 1 1 2
g g2 2Now, when g $ 1, E(y |x) $ [E(y |x)] , so we cannot write2 2 2
g2E(y |x) = x D + g (xD ) ;1 1 1 1 2
in fact, we cannot find E(y |x) without more assumptions. While the1
regression y on x consistently estimates D , the two-step NLS estimator of2 2 2
85
g^ 2y on x , (x D ) will not be consistent for D and g . (This is ani1 i1 i 2 1 2
example of a "forbidden regression.") When g = 1, the plug-in method works:2
it is just the usual 2SLS estimator.
*14.3. Let Z be the G * G matrix of optimal instruments in (14.63), wherei
we suppress its dependence on x . Let Z be a G * L matrix that is ai i
function of x and let % be the probability limit of the weighting matrix.i o
Then the asymptotic variance of the GMM estimator has the form (14.10) with
G = E[Z’R (x )]. So, in (14.54), take A _ G’% G and s(w ) _o i o i o o o i
*G’% Z’r(w ,Q ). The optimal score function is s (w ) _o o i i o i
-1R (x )’) (x ) r(w ,Q ). Now we can verify (14.57) with r = 1:o i o i i o
* -1E[s(w )s (w )’] = G’% E[Z’r(w ,Q )r(w ,Q )’) (x ) R (x )]i i o o i i o i o o i o i
-1= G’% E[Z’E{r(w ,Q )r(w ,Q )’|x }) (x ) R (x )]o o i i o i o i o i o i
-1= G’% E[Z’) (x )) (x ) R (x )] = G’% G = A.o o i o i o i o i o o o
14.5. We can write the unrestricted linear projection as
y = p + x P + v , t = 1,2,3,it t0 i t it
where P is 1 + 3K * 1, and then P is the 3 + 9K * 1 vector obtained byt
stacking the P . Let Q = (j,L’,L’,L’,B’)’. With the restrictions imposed ont 1 2 3
the P we havet
p = j, t = 1,2,3, P = [(L + B)’,L’,L’]’,t0 1 1 2 3
P = [L’,(L + B)’,L’]’, P = [L’,L’,(L + B)’]’.2 1 2 3 3 1 2 3
Therefore, we can write P = HQ for the (3 + 9K) * (1 + 4K) matrix H defined
by
86
&1 0 0 0 0 *2 20 I 0 0 IK K2 20 0 I 0 0K2 2
20 0 0 I 0 2K
2 21 0 0 0 02 20 I 0 0 0KH = 2 2.0 0 I 0 I2 K K2
20 0 0 I 0 2K
2 21 0 0 0 02 20 I 0 0 0K2 20 0 I 0 02 K 2
70 0 0 I I 8K K
14.7. With h(Q) = HQ, the minimization problem becomes
^ ^-1 ^min (P - HQ)’% (P - HQ),
PQeR
where it is assumed that no restrictions are placed on Q. The first order
condition is easily seen to be
^-1 ^ ^ ^-1 ^ ^-1^-2H’% (P - HQ) = 0 or (H’% H)Q = H’% P.
^-1Therefore, assuming H’% H is nonsingular -- which occurs w.p.a.1. when
-1 ^ ^-1 -1 ^-1^H’% H -- is nonsingular -- we have Q = (H’% H) H’% P.o
14.9. We have to verify equations (14.55) and (14.56) for the random effects
and fixed effects estimators. The choices of s , s (with added ii1 i2
subscripts for clarity), A , and A are given in the hint. Now, from Chapter1 2
210, we know that E(r r’|x ) = s I under RE.1, RE.2, and RE.3, where r = vi i i u T i i
----- ˇ ˇ 2 ˇ ˇ 2- lj v . Therefore, E(s s’ ) = E(X’r r’X ) = s E(X’X ) _ s A by the usualT i i1 i1 i i i i u i i u 1
2iterated expectations argument. This means that, in (14.55), r _ s . Now,u
¨ ˇwe just need to verify (14.56) for this choice of r. But s s’ = X’u r’X .i2 i1 i i i i
¨ ¨ ----- ¨ ¨Now, as described in the hint, X’r = X’(v - lj v ) = X’v = X’(c j + u ) =i i i i T i i i i i T i
87
¨ ¨ ˇ ¨ ˇX’u . So s s’ = X’r r’X and therefore E(s s’ |x ) = X’E(r r’|x )X =i i i2 i1 i i i i i2 i1 i i i i i i
2¨ ˇ 2 ¨ ˇs X’X . It follows that E(s s’ ) = s E(X’X ). To finish off the proof,u i i i2 i1 u i i
¨ ˇ ¨ ----- ¨ ¨ ¨note that X’X = X’(X - lj x ) = X’X = X’X . This verifies (14.56) with ri i i i T i i i i i
2= s .u
88
CHAPTER 15
15.1. a. Since the regressors are all orthogonal by construction -- dk Wdm =i i
0 for k $ m, and all i -- the coefficient on dm is obtained from the
regression y on dm , i = 1,...,N. But this is easily seen to be the fractioni i
of y in the sample falling into category m. Therefore, the fitted values arei
just the cell frequencies, and these are necessarily in [0,1].
b. The fitted values for each category will be the same. If we drop d1
but add an overall intercept, the overall intercept is the cell frequency for
the first category, and the coefficient on dm becomes the difference in cell
frequency between category m and category one, m = 2, ..., M.
215.3. a. If P(y = 1|z ,z ) = F(z D + g z + g z ) then1 2 1 1 1 2 2 2
dP(y = 1|z ,z )1 2 2------------------------------------------------------------------------- = (g + 2g z )Wf(z D + g z + g z );1 2 2 1 1 1 2 2 2dz2for given z, this is estimated as
^ ^ ^ ^ ^ 2(g + 2g z )Wf(z D + g z + g z ),1 2 2 1 1 1 2 2 2
where, of course, the estimates are the probit estimates.
b. In the model
P(y = 1|z ,z ,d ) = F(z D + g z + g d + g z d ),1 2 1 1 1 1 2 2 1 3 2 1
the partial effect of z is2
dP(y = 1|z ,z ,d )1 2 1--------------------------------------------------------------------------------------- = (g + g d )Wf(z D + g z + g d + g z d ).1 3 1 1 1 1 2 2 1 3 2 1dz2The effect of d is measured as the difference in the probabilities at d = 11 1
and d = 0:1
P(y = 1|z,d = 1) - P(y = 1|z,d = 0)1 1
= F[z D + (g + g )z + g ] - F(z D + g z ).1 1 1 3 2 2 1 1 1 2
Again, to estimate these effects at given z and -- in the first case, d -- we1
89
just replace the parameters with their probit estimates, and use average or
other interesting values of z.
c. We would apply the delta method from Chapter 3. Thus, we would
require the full variance matrix of the probit estimates as well as the
gradient of the expression of interest, such as (g + 2g z )Wf(z D + g z +1 2 2 1 1 1 2
2g z ), with respect to all probit parameters. (Not with respect to the z .)2 2 j
15.5. a. If P(y = 1|z,q) = F(z D + g z q) then1 1 1 2
dP(y = 1|z,q)----------------------------------------------------------------- = g qWf(z D + g z q),1 1 1 1 2dz2assuming that z is not functionally related to z .2 1
*b. Write y = z D + r, where r = g z q + e, and e is independent of1 1 1 2
(z,q) with a standard normal distribution. Because q is assumed independent
2 2of z, q|z ~ Normal(0,g z + 1); this follows because E(r|z) = g z E(q|z) +1 2 1 2
E(e|z) = 0. Also,
2 2 2 2Var(r|z) = g z Var(q|z) + Var(e|z) + 2g z Cov(q,e|z) = g z + 11 2 1 2 1 2
because Cov(q,e|z) = 0 by independence between e and (z,q). Thus,
5======================================2 2
r/r g z + 1 has a standard normal distribution independent of z. It follows1 2
that
5======================================& 2 2 *P(y = 1|z) = F z D /r g z + 1 . (15.90)1 1 1 27 8
2c. Because P(y = 1|z) depends only on g , this is what we can estimate1
along with D . (For example, g = -2 and g = 2 give exactly the same model1 1 1
2for P(y = 1|z).) This is why we define r = g . Testing H : r = 0 is most1 1 0 1
easily done using the score or LM test because, under H , we have a standard0
^probit model. Let D denote the probit estimates under the null that r = 0.1 1
-(z D )(z /2) r z + 1 f z D /r g z + 1 .i1 1 i2 1 i2 i1 1 1 i27 8 9 0^ ^ 2 ^
When we evaluate this at r = 0 and D we get -(z D )(z /2)f . Then, the1 1 i1 1 i2 i
2score statistic can be obtained as NR from the regressionu
5================================================ 5================================================~ ^ ^ ^ ^ 2 ^ ^ ^u on f z /r F (1 - F ), (z D )z f /r F (1 - F );i i i1 i i i1 1 i2 i i i
2 a 2under H , NR ~ c .0 u 1
d. The model can be estimated by MLE using the formulation with r in1
2place of g . But this is not a standard probit estimation.1
15.7. a. The following Stata output is for part a:
We can plug in a = 10 to obtain the approximate effect of increasing the cap2
^ ^from 10 to 11. For a given value of x, we would compute F[(xB - 10)/s], where
^ ^B and s are the MLEs. We might evaluate this expression at the sample average
of x or at other interesting values (such as across gender or race).
^ ^d. If y < 10 for i = 1,...,N, B and s are just the usual Tobit estimatesi
with the "censoring" at zero.
16.11. No. OLS always consistently estimates the parameters of a linear
projection -- provided the second moments of y and the x are finite, andj
Var(x) has full rank K -- regardless of the nature of y or x. That is why a
linear regression analysis is always a reasonable first step for binary
outcomes, corner solution outcomes, and count outcomes, provided there is not
true data censoring.
16.13. This extension has no practical effect on how we estimate an unobserved
effects Tobit or probit model, or how we estimate a variety of unobserved
effects panel data models with conditional normal heterogeneity. We simply
have
T& -1 * ----- ----- ,c = - T S P X + x X + a _ j + x X + ai t i i i i7 8t=1T& -1 *
where j _ - T S P X. Of course, any aggregate time dummies explicitly gett7 8t=1
-----swept out of x in this case but would usually be included in x .i it
An interesting follow-up question would have been: What if we
standardize each x by its cross-sectional mean and variance at time t, andit
108
assume c is related to the mean and variance of the standardized vectors. Ini
-1/2other words, let z _ (x - P )) , t = 1,...,T, for each random draw iit it t t
----- 2from the population. Then, we might assume c |x ~ Normal(j + z X,s ) (where,i i i a
again, z would not contain aggregate time dummies). This is the kind ofit
scenario that is handled by Chamberlain’s more general assumption concerning
T
the relationship between c and x : c = j + S x L + a , where L =i i i ir r i rr=1
-1/2) X/T, t = 1,2,...,T. Alternatively, one could estimate estimate P and )r t t
for each t using the cross section observations {x : i = 1,2,...,N}. Theit
^ ^usual sample means and sample variance matrices, say P and ) , are consistentt t
----- ^-1/2 ^^and rN-asymptotically normal. Then, form z _ ) (x - P ), and proceedit t it t
with the usual Tobit (or probit) unobserved effects analysis that includes the
----- T-1^ ^time averages z = T S z . This is a rather simple two-step estimationi itt=1
^ ^method, but accounting for the sample variation in P and ) would bet t
^ ^cumbersome. It may be possible to use a much larger to obtain P and ) , int t
which case one might ignore the sampling error in the first-stage estimates.
16.15. To be added.
CHAPTER 17
17.1. If you are interested in the effects of things like age of the building
and neighborhood demographics on fire damage, given that a fire has occured,
then there is no problem. We simply need a random sample of buildings that
actually caught on fire. You might want to supplement this with an analysis
of the probability that buildings catch fire, given building and neighborhood
characteristics. But then a two-stage analysis is appropriate.
109
17.3. This is essentially given in equation (17.14). Let y given x havei i
density f(y|x ,B,G), where B is the vector indexing E(y |x ) and G is anotheri i i
set of parameters (usually a single variance parameter). Then the density of
y given x , s = 1, when s = 1[a (x ) < y < a (x )], isi i i i 1 i i 2 i
f(y|x ;B,G)i
p(y|x ,s =1) = ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------, a (x ) < y < a (x ).i i 1 i 2 iF(a (x )|x ;B,G) - F(a (x )|x ;B,G)2 i i 1 i i
In the Hausman and Wise (1977) study, y = log(income ), a (x ) = -8, andi i 1 i
a (x ) was a function of family size (which determines the official poverty2 i
level).
^17.5. If we replace y with y , we need to see what happens when y = zD + v2 2 2 2 2
is plugged into the structural mode:
y = z D + a W(zD + v ) + u1 1 1 1 2 2 1
= z D + a W(zD ) + (u + a v ). (17.81)1 1 1 2 1 1 2
----- ^So, the procedure is to replace D in (17.81) its rN-consistent estimator, D .2 2
The key is to note how the error term in (17.81) is u + a v . If the1 1 2
selection correction is going to work, we need the expected value of u + a v1 1 2
given (z,v ) to be linear in v (in particular, it cannot depend on z). Then3 3
we can write
E(y |z,v ) = z D + a W(zD ) + g v ,1 3 1 1 1 2 1 3
where E[(u + a v )|v ] = g v by normality. Conditioning on y = 1 gives1 1 2 3 1 3 3
E(y |z,y = 1) = z D + a W(zD ) + g l(zD ). (17.82)1 3 1 1 1 2 1 3
A sufficient condition for (17.82) is that (u ,v ,v ) is independent of z1 2 3
with a trivariate normal distribution. We can get by with less than this, but
the nature of v is restricted. If we use an IV approach, we need assume2
110
nothing about v except for the usual linear projection assumption.2
As a practical matter, if we cannot write y = zD + v , where v is2 2 2 2
independent of z and approximately normal, then the OLS alternative will not
be consistent. Thus, equations where y is binary, or is some other variable2
that exhibits nonnormality, cannot be consistently estimated using the OLS
procedure. This is why 2SLS is generally preferred.
17.7. a. Substitute the reduced forms for y and y into the third equation:1 2
y = max(0,a (zD ) + a (zD ) + z D + v )3 1 1 2 2 3 3 3
_ max(0,zP + v ),3 3
where v _ u + a v + a v . Under the assumptions given, v is indepdent of3 3 1 1 2 2 3
z and normally distributed. Thus, if we knew D and D , we could consistently1 2
estimate a , a , and D from a Tobit of y on zD , zD , and z . >From the1 2 3 3 1 2 3
usual argument, consistent estimators are obtained by using initial consistent
estimators of D and D . Estimation of D is simple: just use OLS using the1 2 2
entire sample. Estimation of D follows exactly as in Procedure 17.3 using1
the system
y = zD + v (17.83)1 1 1
y = max(0,zP + v ), (17.84)3 3 3
where y is observed only when y > 0.1 3
^ ^ ^ ^Given D and D , form z D and z D for each observation i in the sample.1 2 i 1 i 2
^ ^ ^Then, obtain a , a , and D from the Tobit1 2 3
^ ^y on (z D ), (z D ), zi3 i 1 i 2 i3
using all observations.
For identification, (zD ,zD ,z ) can contain no exact linear1 2 3
dependencies. Necessary is that there must be at least two elements in z not
111
also in z .3
Obtaining the correct asymptotic variance matrix is complicated. It is
most easily done in a generalized method of moments framework.
b. This is not very different from part a. The only difference is that
D must be estimated using Procedure 17.3. Then follow the steps from part a.2
2c. We need to estimate the variance of u , s .3 3
17.9. To be added.
17.11. a. There is no sample selection problem because, by definition, you
have specified the distribution of y given x and y > 0. We only need to
obtain a random sample from the subpopulation with y > 0.
b. Again, there is no sample selection bias because we have specified the
conditional expectation for the population of interest. If we have a random
-----sample from that population, NLS is generally consistent and rN-asymptotically
normal.
c. We would use a standard probit model. Let w = 1[y > 0]. Then w given
x follows a probit model with P(w = 1|x) = F(xG).
d. E(y|x) = P(y > 0|x)WE(y|x,y > 0) = F(xG)Wexp(xB). So we would plug in
the NLS estimator of B and the probit estimator of G.
e. Not when you specify the conditional distributions, or conditional
means, for the two parts. By definition, there is no sample selection
problem. Confusion arises, I think, when two part models are specified with
unobservables that may be correlated. For example, we could write
y = wWexp(xB + u),
w = 1[xG + v > 0],
112
so that w = 0 6 y = 0. Assume that (u,v) is independent of x. Then, if u and
v are independent -- so that u is independent of (x,w) -- we have