1/126 Instrumental Variables with Heterogeneous Effects Magne Mogstad
1/126
Instrumental Variables with Heterogeneous Effects
Magne Mogstad
2/126
Linear IV with heterogeneous effectsWhen estimating the effect of D on Y with IV Z the standard textbookcase presents the outcome equation with homogenous effects
Y = α + βD + U
But we can link observed outcome Y to potential outcomes (Y0,Y1)
Yi = E [Y0]︸ ︷︷ ︸α
+ (Y1 − Y0)︸ ︷︷ ︸βi
D + Y0 − E [Y0]︸ ︷︷ ︸Ui
≡ α + βD + U
What does linear IV identify when treatment effects areheterogenous?
This question is the focus of much of applied micro.
Arguably reverse engineering. Like playing Jeopardy
Later we start with a question (target parameter), and then ask how toanswer it (identify and estimate target parameter)]
3/126
Heterogeneous potential outcome set-upInstrument initiates a causal chain, whereby Z affects the variable ofinterest D which in turn affects Y
Keeping this in mind we can adopt the potential outcome set-up:
I Dz is treatment status at instrument value Z = zI Yd ,z is outcome of individual i if he receives treatment D = d and
instrument value Z = z
We can now define various causal effects:
I Y1,Z − Y0,Z
I YD,1 − YD,0
I Y1,z − Y0,z
I Yd ,1 − Yd ,0
I D1 − D0
4/126
Heterogeneous potential outcome set-up
The first assumption in the heterogeneous effects set-up is randomassignment
Random assignmentYd ,z , Dz ⊥ Z ∀ d , z
This is sufficient to identify average causal effect of Z on Y (and of Zon D):
E [Y |Z = 1]− E [Y |Z = 0]
= E [YD1,1|Z = 1]− E [YD0,0|Z = 0]
= E [YD1,1 − YD0,0]
5/126
Heterogeneous potential outcome set-up
The second assumption in the heterogeneous effects set-up is theexclusion restriction
Exclusion restrictionYd ,1 = Yd ,0
This states that any effect of Z on Y must be via an effect of Z on D
The exclusion restriction is often expressed by omitting Z in equationof interest: Y = α + β · D + U
Random assignment + exclusion restriction = instrument exogeneity
Conceptually distinct problems – argue one at the time!
6/126
Heterogeneous potential outcome set-up
The third assumption in the heterogeneous effects set up is theexistence of a first stage
First stageE [D1 − D0] 6= 0
Which requires the instrument Z to have some effect on the averageprobability of treatment
Note: For the (usual) statistical inference (which relies on the standardfirst-order asymptotic approximation invoked in large-sample theory),the first stage should not be too close to zero (more on that later)
7/126
Heterogeneous potential outcome set-up
The fourth assumption in the heterogeneous effects set-up ismonotonicity
MonotonicityD1 ≥ D0 ∀i (or vice versa)
Which says that all those affected by the instrument are affected in thesame direction
Note: Uniformity would be a better terminology.
Monotonicity assumption does not imply that treatment is a monotonicfunction of the instrument (which becomes relevant with multipleinstruments or when instrumen takes multiple values).
8/126
Local Average Treatment Effect (LATE)A variable Z is an instrumental variable for the causal effect of D on Yif the following assumptions hold:
1. Random assignment: Yd ,z , Dz ⊥ Z ∀ d , z
I gives the causal effect of Z on D (1st stage) and Y (reduced form)
2. Exclusion Restriction: Yd ,1 = Yd ,0 = Yd
I so that the causal effect of Z on Y is only due to the effect of Z on D
3. Monotonicity: D1 ≥ D0 , or vice versa
I to avoid offsetting effects
4. First-Stage: E [D1 − D0] 6= 0
I because we need treatment variation in the sample
The Wald estimand then gives the Local Average Treatment Effect:
βIV = E [β|D1 = 1, D0 = 0]
the average treatment effect for those affected by the instrument
8/126
Local Average Treatment Effect (LATE)A variable Z is an instrumental variable for the causal effect of D on Yif the following assumptions hold:
1. Random assignment: Yd ,z , Dz ⊥ Z ∀ d , zI gives the causal effect of Z on D (1st stage) and Y (reduced form)
2. Exclusion Restriction: Yd ,1 = Yd ,0 = Yd
I so that the causal effect of Z on Y is only due to the effect of Z on D
3. Monotonicity: D1 ≥ D0 , or vice versa
I to avoid offsetting effects
4. First-Stage: E [D1 − D0] 6= 0
I because we need treatment variation in the sample
The Wald estimand then gives the Local Average Treatment Effect:
βIV = E [β|D1 = 1, D0 = 0]
the average treatment effect for those affected by the instrument
8/126
Local Average Treatment Effect (LATE)A variable Z is an instrumental variable for the causal effect of D on Yif the following assumptions hold:
1. Random assignment: Yd ,z , Dz ⊥ Z ∀ d , zI gives the causal effect of Z on D (1st stage) and Y (reduced form)
2. Exclusion Restriction: Yd ,1 = Yd ,0 = YdI so that the causal effect of Z on Y is only due to the effect of Z on D
3. Monotonicity: D1 ≥ D0 , or vice versa
I to avoid offsetting effects
4. First-Stage: E [D1 − D0] 6= 0
I because we need treatment variation in the sample
The Wald estimand then gives the Local Average Treatment Effect:
βIV = E [β|D1 = 1, D0 = 0]
the average treatment effect for those affected by the instrument
8/126
Local Average Treatment Effect (LATE)A variable Z is an instrumental variable for the causal effect of D on Yif the following assumptions hold:
1. Random assignment: Yd ,z , Dz ⊥ Z ∀ d , zI gives the causal effect of Z on D (1st stage) and Y (reduced form)
2. Exclusion Restriction: Yd ,1 = Yd ,0 = YdI so that the causal effect of Z on Y is only due to the effect of Z on D
3. Monotonicity: D1 ≥ D0 , or vice versaI to avoid offsetting effects
4. First-Stage: E [D1 − D0] 6= 0
I because we need treatment variation in the sample
The Wald estimand then gives the Local Average Treatment Effect:
βIV = E [β|D1 = 1, D0 = 0]
the average treatment effect for those affected by the instrument
8/126
Local Average Treatment Effect (LATE)A variable Z is an instrumental variable for the causal effect of D on Yif the following assumptions hold:
1. Random assignment: Yd ,z , Dz ⊥ Z ∀ d , zI gives the causal effect of Z on D (1st stage) and Y (reduced form)
2. Exclusion Restriction: Yd ,1 = Yd ,0 = YdI so that the causal effect of Z on Y is only due to the effect of Z on D
3. Monotonicity: D1 ≥ D0 , or vice versaI to avoid offsetting effects
4. First-Stage: E [D1 − D0] 6= 0I because we need treatment variation in the sample
The Wald estimand then gives the Local Average Treatment Effect:
βIV = E [β|D1 = 1, D0 = 0]
the average treatment effect for those affected by the instrument
9/126
Local Average Treatment Effect (LATE)Wald estimand can be interpreted as effect of treatment on outcomesfor individuals who were treated because Z = 1, but who would nothave been treated otherwise
To see why this is so, we can divide the population into four groups:
1. Compliers: D1 = 1 and D0 = 0;2. Always-takers: D1 = 1 and D0 = 1;3. Never-takers: D1 = 0 and D0 = 0;4. Defiers: D1 = 0 and D0 = 1;
Note: The terminology is much used but a bit confusing (at least to me).
Always-takers are not always taking treatment. Never-takers are notnever taking treatment. Everything is specific to the instrument at hand.
With other instruments, always-taker, never-taker and complier statusmay change
10/126
Local Average Treatment Effect: ProofWe saw that (by independence)
E [Y |Z = 1]− E [Y |Z = 0] = E [YD1 − YD0 ]
The average causal effect of Z on Y can be written as weightedaverage of the causal effects of the four sub-populations:
E [YD1 − YD0 ] =
E [YD1 − YD0 |Complier]× P(D1 = 1, D0 = 0)
+E [YD1 − YD0 |Never taker]× P(D1 = 0, D0 = 0)
+E [YD1 − YD0 |Always taker]× P(D1 = 1, D0 = 1)
+E [YD1 − YD0 |Defier]× P(D1 = 0, D0 = 1)
11/126
Local Average Treatment Effect: ProofWe saw that (by independence)
E [Y |Z = 1]− E [Y |Z = 0] = E [YD1 − YD0 ]
The average causal effect of Z on Y can be written as weightedaverage of the causal effects of the four sub-populations:
E [YD1 − YD0 ] =
E [YD1 − YD0 |Complier]× P(D1 = 1, D0 = 0)
+E [YD1 − YD0︸ ︷︷ ︸=Y0−Y0=0
|Never taker]× P(D1 = 0, D0 = 0)
+E [YD1 − YD0︸ ︷︷ ︸=Y1−Y1=0
|Always taker]× P(D1 = 1, D0 = 1)
+E [YD1 − YD0 |Defier]× P(D1 = 0, Di(0) = 1)︸ ︷︷ ︸=0
12/126
Local Average Treatment Effect: ProofBy monotonicity D1 ≥ D0, which implies that there are no defiers.
E [Y |Z = 1]− E [Y |Z = 0]
= E [Y1 − Y0|Complier]× P(D1 = 1, D0 = 0)
and by independence and monotonicity we can show that
E [D|Z = 1]− E [D|Z = 0] = E [D1 − D0] = P(D1 = 1, D0 = 0)
From this it follows that the Wald estimand is equal to the averagetreatment effect on the compliers
E [Y |Z = 1]− E [Y |Z = 0]
E [D|Z = 1]− E [D|Z = 0]
=E [Y1 − Y0|Complier]× P(D1 = 1, D0 = 0)
P(D1 = 1, D0 = 0)
= E [Y1 − Y0|Complier]
13/126
LATE: Interpretation and relevanceWith heterogeneous effects IV estimates the average causal effect forcompliers
Different valid instruments for same causal relation therefore estimatedifferent things (different groups of compliers)
I Overidentifying restrictions test (Sargan test) might reject even ifall instruments are valid.
I Policy-relevance of IV estimate depends on policy relevance ofinstrument
Note: We cannot identify the compliers because we can never observeboth D0 and D1 (thus, we don’t know who the compliers are)
I those with Z = 1 and D = 1 can be compliers or always-takersI those with Z = 0 and D = 0 can be compliers or never-takers
14/126
Compliers: How many and what do they look likeThe size of the complier group is the Wald 1st-stage:
P(D1 = 1, D0 = 0) = E [D|Z = 1]− E [D|Z = 0]
Or among the treated
P(D1 − D0 = 1|D = 1) =P(D = 1|D1 > D0)P(D1 > D0)
P(D = 1)
=P(Z = 1)(E [D|Z = 1]− E [D|Z = 0])
P(D = 1)
We cannot identify compliers, but we can describe them
P(X = x |D1 > D0)
P(X = x)=
P(D1 > D0|X = x)
P(D1 > D0)
=E [D|Z = 1, X = x ]− E [D|Z = 0, X = x ]
E [D|Z = 1]− E [D|Z = 0]
15/126
LATE extensionsUntil now we considered the IV model with heterogeneity in the simplecase of
I average effects (for compliers)I binary treatment, binary instrumentI no covariates
What happens when we relax these assumptions?
Angrist and Pischke (2009, p. 173) write that “The econometric toolremains 2SLS and the interpretation remains fundamentally similar tothe basic LATE result, with a few bells and whistles."
Is this really true? (spoiler: no, it’s not!)
But first, let’s see that even in the simple case, linear IV is not revealingall the information about potential outcomes available in the data
16/126
Extension I: Counterfactual distributions
17/126
Counterfactual distributionsImbens & Rubin (1997) show that we can estimate more than averagecausal effects for compliers
They show how to recover the complete marginal distributions of theoutcome
I under different treatments for the compliersI under the treatment for the always-takersI without the treatment for the never-takers
These results allow us to draw inference about effect on the outcomedistribution of compliers (QTE of compliers)
Can also be used to test instrument exogeneity & monotonicity
Even exactly identified models can have testable implications (unlikewhat is claimed in MHE).
18/126
Counterfactual distributions
First introduce some shorthand notation
Ci = n⇐⇒ D1 = D0 = 0Ci = a⇐⇒ D1 = D0 = 1Ci = c ⇐⇒ D1 = 1, D0 = 0Ci = d ⇐⇒ D1 = 0, D0 = 1
For the different combinations of Z and D, we know the following:
D0 1
0 n, c aZ
1 n a, c
19/126
Counterfactual distributionsDistribution of types
Since Z is random we know that the distribution of types a, n, c is thesame for each value of Z and in the population as a whole
Therefore, this...D
0 10 n, c a
Z1 n a, c
...implies the following:
pa = Pr(D = 1|Z = 0)
pn = Pr(D = 0|Z = 1)
pc = 1− pa − pn
20/126
Counterfactual distributionsIdentifying distributions
Let’s use the following notation for the observed marginal distribution ofY conditional on Z and D:
fzd (y) ≡ f (y |Z = z, D = d)
Therefore, this...D
0 10 n, c a
Z1 n a, c
...implies the following:
f10(y) = gn(y)
f01(y) = ga(y)
f00(y) = gc0(y) · (pc/(pc + pn))
+ gn(y) · (pn/(pc + pn))
f11(y) = gc1(y) · (pc/(pc + pa))
+ ga(y) · (pa/(pc + pa))
21/126
Counterfactual distributionsExample
To illustrate the above, consider Dutch data (see Ketel et al., 2016, AEJapplied).
I Lottery outcome as instrument of medical school completionI D = 1 if completed medical schoolI Z = 1 if offered medical school after successful lottery
. ta z d| d
z | 0 1 | Total-----------+----------------------+----------
0 | 269 187 | 4561 | 71 949 | 1,020
-----------+----------------------+----------Total | 340 1,136 | 1,476
22/126
Counterfactual distributionsf10(y) = gn(y)
0
.2
.4
.6
.8
1
Y0, N
ever
take
rs
1 2 3 4 5log(Wage)
23/126
Counterfactual distributionsf01(y) = ga(y)
0
.2
.4
.6
.8
1
Y1, A
lway
s Ta
kers
1 2 3 4 5log(Wage)
24/126
Counterfactual distributions
We have seen that we can estimate pa, pn, pc and also gn(y) (=f10(y))and ga(y) (=f01(y))
By rearranging the following
f00(y) = gc0(y) · (pc/(pc + pn)) + gn(y) · (pn/(pc + pn))
f11(y) = gc1(y) · (pc/(pc + pa)) + ga(y) · (pa/(pc + pa))
we can back out the counterfactual distributions for the compliers:
gc0(y) = f00(y) · (pc + pn)/pc − f10(y) · pn/pc
gc1(y) = f11(y) · (pc + pa)/pc − f01(y) · pa/pc
25/126
Counterfactual distributionsgc0(y) = f00(y) · (pc + pn)/pc − f10(y) · pn/pc
0
.5
1
1.5
Y0, C
ompl
iers
1 2 3 4 5log(Wage)
26/126
Counterfactual distributionsgc1(y) = f11(y) · (pc + pa)/pc − f01(y) · pa/pc
0
.2
.4
.6
.8
1
Y1, C
ompl
iers
1 2 3 4 5log(Wage)
27/126
Counterfactual distributions
0
.5
1
1.5
1 2 3 4 5log(Wage)
Y1, Compliers Y0, Compliers
28/126
Counterfactual distributions
0
.5
1
1.5
1 2 3 4 5log(Wage)
Y1, Compliers Y0, CompliersY1, Always Takers Y0, Never takers
29/126
Counterfactual distributions
We can also show that
E [Y1|C = c] =E [Y · D|Z = 1]− E [Y · D|Z = 0]
E [D|Z = 1]− E [D|Z = 0]
and
E [Y0|C = c] =E [Y · (1− D)|Z = 1]− E [Y · (1− D)|Z = 0]
E [1− D|Z = 1]− E [1− D|Z = 0]
30/126
Counterfactual distributions. ivregress 2sls lnw (d = z), robust noheader------------------------------------------------------------------------------
| Robustlnw | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------d | .1871175 .0485501 3.85 0.000 .0919609 .282274
_cons | 3.010613 .0382073 78.80 0.000 2.935728 3.085498------------------------------------------------------------------------------
. g y1 = lnw*d
. ivregress 2sls y1 (d = z), robust noheader------------------------------------------------------------------------------
| Robusty1 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------d | 3.264167 .0387887 84.15 0.000 3.188142 3.340191
_cons | -.0617161 .0275252 -2.24 0.025 -.1156644 -.0077678------------------------------------------------------------------------------
. g y0 = lnw*(1-d)
. g md = 1-d
. ivregress 2sls y0 (md = z), robust noheader------------------------------------------------------------------------------
| Robusty0 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------md | 3.077049 .0293153 104.96 0.000 3.019592 3.134506
_cons | -.0047203 .0047455 -0.99 0.320 -.0140213 .0045806------------------------------------------------------------------------------
. di 3.264167 - 3.077049
.187118
31/126
Testing instrument validity
The above discussion points to a test for instrument validity (or,equivalently, a test for monotonicity given exogeneity)
Basic idea: Under the IV assumptions, the complier distribution shouldactually be a distribution
I By definition, probability can never be negative.I Thus, density can never be negativeI For binary Y , it means that E(Y |C = c) needs to be between 0
and 1
Kitagawa (2015) develops a formal statistical test based on theseimplication
32/126
Extension II: Multiple instruments
33/126
LATE with multiple instrumentsAssume we have 2 mutually exclusive (and for simplicity independent)binary instruments
(Without loss of generality: make two non-exclusive instrumentsmutually exclusive by working with Z1(1-Z2), Z2(1-Z1), Z1Z2)
We can then estimate two different LATEs:
βZj =cov(Y , Zj)
cov(D, Zj)
= E [Y1 − Y0|DZj=1 − DZj=0 = 1]
In practice researchers often combine the instruments using 2SLS
The 2SLS estimator is
β2SLS =cov(Y , D)
cov(D, D)
where D = π1Z1 + π2Z2
34/126
LATE with multiple instruments
Expanding β2SLS gives
β2SLS = π1cov(Y , Z1)
cov(D, D)+ π2
cov(Y , Z2)
cov(D, D)
= π1cov(D, Z1)
cov(D, D)
cov(Y , Z1)
cov(D, Z1)+ π2
cov(D, Z2)
cov(D, D)
cov(Y , Z2)
cov(D, Z2)
= ψβZ1 + (1− ψ)βZ2
whereψ ≡ π1cov(D, Z1)
π1cov(D, Z1) + π2cov(D, Z2)
is the relative strength of Z1 in the first stage
Under assumptions 1-4, the 2SLS estimate is an instrument-strengthweighted average of the instrument specific LATEs
35/126
Questions with multiple instruments?
I What question does the 2SLS weighted average of LATEs answer?I Why not some other weighted average (e.g. use GMM or LIML)?I Is monotonicity more restrictive with multiple instruments?I Can one do without monotonicity?
Some papers do IV with heterogeneity without invoking monotonicity
See, for example, much of the work by Manski but also Heckman andPinto (2018) and Mogstad, Walters and Torgovitsky (2019)
36/126
Interpreting Monotonicity with Multiple Instruments
NotationI Binary treatment D ∈ {0,1}I Potential treatments Dz for instrument values z ∈ Z
IA monotonicity condition (IAM)For all z, z ′ ∈ Z either:I Dz ≥ Dz′ orI Dz ≤ Dz′
I IA Monotonicity is uniformity, not monotonicityI Pairwise instrument shifts push everyone to or from treatment
37/126
Choice BehaviorI Random utility model
V (d , z) is indirect utility from choosing d when instrument z:
Dz = arg maxd∈{0,1}
V (d , z) = 1[Vz ≥ 0]
where V (z) ≡ V (1, z)− V (0, z) is net indirect utility
Illustrative example:
I Dz ∈ {0,1} is whether to attend collegeI Z1 is a tuition subsidyI Z2 is proximity to a collegeI Dz should be an increasing function of zI Neither implies nor is implied by IA monotonicityI What is implied by IA monotonicity? Restrictions on V (z)?
38/126
Binary Instruments
I IA monotonicity does not permit individuals to differ in responsesI All individuals must find either tuition or distance more compelling
39/126
Continuous Instruments
I z∗ is a point of indifference for j and kI IA monotonicity fails if marginal rates of substitution are different
40/126
Homogenous Marginal Rates of Substitution
I Let z∗ be a point at which V (z) is differentiableI Let I(z∗) = {i ∈ I : V (z∗) = 0}I IA monotonicity implies that
∂1Vj(z∗)∂2Vk (z∗) = ∂1Vk (z∗)∂2Vj(z∗),∀j , k ∈ I(z∗)
I Natural discrete choice specification:
V (z) = B0 + B1Z1 + 1× Z2
I Where (B0,B1) are unobservedI B1 controls variation in taste for tuition relative to proximityI IA monotonicity requires no variation over individuals: Var(B1) = 0
41/126
Extension III: Variable treatment intensity
42/126
Variable treatment intensity
Assume treatment is no longer binary but varies in its level
S ∈ {0,1,2, . . . , J}
such as for example years of schooling.
We can then define potential outcomes indexed by the level oftreatment
YS
Potential treatments (schooling level) are as before indexed by thevalue of the instrument
SZ
so that with a binary instrument the observed level of schooling is
S = ZS1 + (1− Z )S0
43/126
Variable treatment intensityThe observed outcome
Y =J∑
s=0
Ys1[S = s] = Y0 +J∑
s=1
(Ys − Ys−1)1[S ≥ s]
The average effect of the s-th year of schooling is then
E [Ys − Ys−1]
and we have now J different treatment effects
Even so, researchers often estimate a linear-in-parameter model:
Y = α + βS + u
One possibility is to take the linearity restriction literally
Another option is to reverse-engineer
(A third possibility is to start with a target parameter.....)
44/126
Variable treatment intensity
As before we need to make an independence assumption
Ys,z , Sz ⊥ Z ∀s, z
and an exclusion restriction
Ys,z = Ys
We further need a monotonicity assumption
S1 ≥ S0
and instrument relevance
E [S1 − S0] 6= 0
45/126
Variable treatment intensityExample with 3 levelsMonotonicity implies
1[S1 ≥ s]− 1[S0 ≥ s] ∈ {0, 1}
so thatPr(1[S1 ≥ s] > 1[S0 ≥ s]) = Pr(S1 ≥ s > S0)
if this probability is greater than 0, then the instrument affects theincidence of treatment level s.
E [S|Z = 1]− E [S|Z = 0](1)=
[Pr(S1 < 1|Z = 1)− Pr(S0 < 1|Z = 0)]
+ [Pr(S1 < 2|Z = 1)− Pr(S0 < 2|Z = 0)]
(2)= Pr(S1 ≥ 1 > S0) + Pr(S1 ≥ 2 > S0)
where (1) follows because the mean is the sum (or integral) of 1 minusthe CDF, and (2) because of independence.
46/126
Variable treatment intensityExample with 3 levels
With three treatment intensities S ∈ {0, 1, 2} we observe
Y = Y0 + (Y1 − Y0)1[S ≥ 1] + (Y2 − Y1)1[S ≥ 2]
Using this we can expand the reduced form as follows
E [Y |Z = 1]− E [Y |Z = 0] = E [(Y1 − Y0)(1[S1 ≥ 1]− 1[S0 ≥ 1])]
+ E [(Y2 − Y1)(1[S1 ≥ 2]− 1[S0 ≥ 2])]
47/126
Variable treatment intensityAverage Causal Response
We can now define
ωs =Pr(S1 ≥ s > S0)∑Jj=1 Pr(S1 ≥ j > S0)
and express the Wald estimate as follows
E [Y |Z = 1]− E [Y |Z = 0]
E [S|Z = 1]− E [S|Z = 0]=
J∑s=1
ωsE [Ys − Ys−1|S1 ≥ s > S0]
which Angrist and Imbens call the average causal response (ACR).
48/126
Variable treatment intensityAverage Causal Response
We cannot estimate E [Ys − Ys−1|S1 ≥ s > S0] for the different localcomplier groups
What we can do is estimate their weights in the ACR, since
Pr(S1 ≥ s > S0) = Pr(S1 ≥ s)− Pr(S0 ≥ s)
= Pr(S0 < s)− Pr(S1 < s)
= Pr(S < s|Z = 0)− Pr(S < s|Z = 1)
which allows us to estimate ωs
Note: although ACR is a positive weighted average, it
– averages together components that are potentially overlapping
– cannot be expressed as a positive weighted average of causal effectsacross mutually exclusive subroups (unlike the LATE)
49/126
Variable treatment intensityExample
Angrist & Krueger (1991) use quarter of birth as an instrument forschooling
I D = 1 if education is at least high schoolI Z = 1 if born in the 4th quarter, Z = 0 if born in the 1st quarter
How does the Wald estimator weighs the average unit causal response
E [Ys − Ys−1|S1 ≥ s > S0]
for the complier at the different points s?
50/126
Variable treatment intensityExample, Schooling CDF by QoB (= 1, 4)
51/126
Variable treatment intensityExample, Differences in Schooling CDF by QoB (= 1, 4)
52/126
Variable treatment intensityExample, for different QoB’s: 4vs1, 4vs2, 4vs3
53/126
Can the weigthing matter?Loken et al. (2012) reports OLS, IV and family fixed effects estimates of how familyincome affects kid’s outcomes
54/126
Can the weigthing matter?
55/126
Covariates
56/126
Extensions to Covariates - Nonparametric
I Often, one wants covariates X to help justify the exogeneity of ZI And/or to reduce residual noise in YI And/or to look at observed heterogeneity in treatment effects
Adjust the assumptions to be conditional on X
I Exogeneity: (Y0,Y1,D0,D1) |= Z |XI Relevance: P[D = 1|X ,Z = 1] 6= P[D = 1|X ,Z = 0] a.s.I Monotonicity: P[D1 ≥ D0|X ] = 1 a.sI Overlap: P[Z = 1|X ] ∈ (0,1) a.s.
57/126
Non-parametric IV with Covariates
I Suppose we can estimate stratified LATEs
β(x) =E [Y |Z = 1, X = x ]− E [Y |Z = 0, X = x ]
E [D|Z = 1, X = x ]− E [D|Z = 0, X = x ]
= E [Y1 − Y0|D1 − D0 = 1, X = x ]
I We want to go from here to some population averaged LATE
I Which one would we like to have? Complier weighted? Populationweighted?
58/126
2SLS regression with CovariatesI What does a saturated 2SLS estimation gives us?
Y = βD + αx + eD = πxZ + γx + u
I i.e. x-dummies in both stages, and x-specific first-stagecoefficients
I Angrist & Imbens (1995) show that
β = E [β(x)ω(x)]
I where β(x) is the x-specific LATE, and
ω(x) =σ2
D(x)
E [σ2D
(x)]=
π2xσ
2Z (x)
E [π2xσ
2Z (x)]
I The weighting thus depends on the square of the local (to x)complier share and instrument variance
59/126
Abadie’s (2003) κ
I For covariates (but D, Z binary) a more elegant approachI Idea is to run regressions only on the compliersI Compliers aren’t directly observable, but they can be weightedI Abadie showed that for any function G = g(Y ,X ,D)
E[G|T = c] =1
P[T = c]E[κG],κ = 1− D(1− Z )
P[Z = 0|X ]− Z (1− D)
P[Z = 1|X ]
IntuitionI Complier = 1 − Always Taker − Never TakerI On average, κ only applies positive weights to compliers:
E[κ|T = t ,X ,D,Y ] = 1[t = c]
I So on average, κG is only positive for compliers
60/126
IV with Covariates
I Abadie (2003) showed that
E [κ0g(Y ,X )] = E [g(Y 0,X )|D1 > D0] Pr(D1 > D0)
E [κ1g(Y ,X )] = E [g(Y 1,X )|D1 > D0] Pr(D1 > D0)
E [κg(Y ,D,X )] = E [g(Y ,D,X )|D1 > D0] Pr(D1 > D0)
where:
κ0 = (1− D)(1− Z )− Pr(Z = 0|X )
Pr(Z = 0|X ) Pr(Z = 1|X )
κ1 = DZ − Pr(Z = 1|X )
Pr(Z = 0|X ) Pr(Z = 1|X )
κ = κ0 Pr(Z = 0|X ) + κ1 Pr(Z = 1|X )
= 1− D(1− Z )
Pr(Z = 0|X )− (1− D)Z
Pr(Z = 1|X )
61/126
Using Abadie’s (2003) κ
Linear/nonlinear regression
I For example, take g(Y ,X ,D) = (Y − αD − X ′β)2 then:
minα,β
E[(Y − αD − X ′β)2|T = c] = minα,β
E[κ(Y − αD − X ′β)2]
I Estimate α, β by solving a sample analog of the second problem
I This is just a weighted regression, with estimated weights ( κ)
I Result is general enough to use for many other estimators
I Specify X however you like - still picks out the compliers
62/126
Using Abadie’s (2003) κ
Estimating κ
I To implement the result one must estimate κ, hence P[Z = 1|X ]
I If P[Z = 1|X ] is linear, the κ-weighted regression equals TSLS
I Of course, Z is binary, so P[Z = 1|X ] typically won’t be exactlylinear
I Logit/probit often close to linear, so in practice may be close
63/126
Empirical Example: Angrist and Evans (1998, “AE”)
MotivationI Relationship between fertility decisions and female labor supply?
I Strong negative correlation, but these are joint choices
I Leads to many possible endogeneity stories, here’s just one:
High earning women have fewer children due to higher opp. cost
64/126
Empirical Example: Angrist and Evans (1998, “AE”)
Empirical strategy
I Y is a labor market outcome for the woman (or her husband)I Restrict the sample to only women (or couples) with 2 or more
childrenI D is an indicator for having more than 2 children (vs. exactly 2)I Z = 1 if first two children had the same sex→ Based on the idea that there is preference to have a mix of boysand girls
I Also consider Z = 1 if the second birth was a twin→Twins are primarily for comparison - used before this paper
65/126
Assumptions in AEExogeneity
I Requires the assumption that sex at birth is randomly assignedI Authors conduct balance tests to support this (next slide)I The twins instrument is less compellingI First, well-known that older women have twins more (see next
slide)→More subtly, it impacts both the number and spacing of children
Monotonicity
I Monotonicity restricts preference heterogeneity in unattractiveways→Some families may want two boys or girls (then stop)
I No discussion of this in the paper - unfortunately common practiceI Twins is effectively a one-sided non-compliance instrument→ Twins compliers are the untreated since no twins never-takers
66/126
Evidence in Support of Exogeneity
I Same sex is uncorrelated with a variety of observed confoundersI Twins is well-known to be correlated with age (so, education) and
race
67/126
Wald Estimates
I First stage (denominator of Wald) for two measures of fertility
68/126
Wald Estimates
I First stage (numerator of Wald) for several labor market outcomes
69/126
Wald Estimates
I IV (Wald) estimator, e.g. -.133≈-.008/0.060 - these are LATEs
70/126
Two Stage Least Squares Estimates
I OLS is quite different from IV - consistent with endogeneity(selection)
71/126
Two Stage Least Squares Estimates
I Break same-sex into two instrumens - two boys vs two girls
72/126
Two Stage Least Squares Estimates
I Overid test p-values - many interpretations with heterogeneity
73/126
Comparison to Abadie’s κ (Angrist 2001)
I Illustration of Abadie’s κ(and other methods) using the AE dataI Results are almost identical to TSLS - uses this to promote TSLSI Logic is strange - we know that in general this is not the caseI In fact, Abadie’s (2003) paper has an application where it is not
74/126
Multiple unordered treatments
75/126
Estimating equation: Example with 3 field choice
I Individuals are often choosing between multiple unorderedtreatments:
Education types, occupations, locations, etc.
I MHS is completely silent about multiple unordered treatment
I What does 2SLS identify in this case?
I Kirkeboen et al. (2016, QJE) discusses this in the context ofeducational choices
I See also Kline and Walters (2016), Heckman and Pinto (2019) andMountjoy (2019).
76/126
Estimating equation: Example with 3 field choice
I Students choose between three fields, D ∈ {0,1,2}
I Our interest is centered on how to interpret IV (and OLS)estimates of
Y = β0 + β1D1 + β2D2 + ε
I Y is observed earnings
I Dj ≡ 1(D = j) is an indicator variables that equals 1 if individualchooses field j
I ε is the residual which is potentially correlated with Dj
77/126
Potential earnings and field choices
I Individuals are assigned to one of three groups, Z ∈ {0,1,2}
I Linking observed and potential earnings and field choices
Y = Y 0 + (Y 1 − Y 0)D1 + (Y 2 − Y 0)D2
D1 = D01 + (D1
1 − D01)Z1 + (D2
1 − D01)Z2
D2 = D02 + (D1
2 − D02)Z1 + (D2
2 − D02)Z2
I Y j is potential earnings if individual chooses field j
I Zk = 1(Z = k) is an indicator variable that equals 1 if Z is equal tok
I Dzj ≡ 1(Dz = j) is indicator variables that equals 1 if individual
chooses field j for a given value of Z
78/126
Standard IV assumptions
I ASSUMPTION 1: (EXCLUSION): Y d ,z = Y d for all d , z
I ASSUMPTION 2: (INDEPENDENCE): Y 0,Y 1,Y 2,D0,D1,D2 ⊥ Z
I ASSUMPTION 3: (RANK): Rank E(Z’D) = 3
I ASSUMPTION 4: (MONOTONICITY): D11 ≥ D0
1 and D22 ≥ D0
2
79/126
Moment conditionsI IV uses the following moment conditions:
E [εZ1] = E [εZ2] = E [ε] = 0
I Expressing these conditions in potential earnings and choicesgives:
E [(∆1 − β1)(D11 − D0
1) + (∆2 − β2)(D12 − D0
2)] = 0 (1)
E [(∆1 − β1)(D21 − D0
1) + (∆2 − β2)(D22 − D0
2)] = 0 (2)
where
∆j ≡ Y j − Y 0
I To understand what IV can and cannot identify, we solve theseequations for β1 and β2
80/126
What IV cannot identify
PROPOSITION 1
I Suppose Assumptions 1-4 hold
I Solving equations (1)-(2) for β1 and β2, it follows that βj for j = 1,2is a linear combination of the following three payoffs:
1. ∆1: Payoff of field 1 compared to 0
2. ∆2: Payoff of field 2 compared to 0
3. ∆2 −∆1 ≡ Y 2 − Y 1: Payoff of field 2 compared to 1
81/126
Constant effects
I Suppose Assumptions 1-4 hold.
I Solving equations (1)-(2) for β1 and β2:
I If ∆1 and ∆2 are common across all individuals (Constant effects):
β1 = ∆1
β2 = ∆2
I Alternatively, move to goal post to estimating effect of, say, field 1versus next best (combination of 2 and 3)
I Back to binary treatment but hard to interpret and requires strongexogeneity assumption
82/126
Data on Second Choices
I In certain circumstances, one might plausibly observe next bestoptions
I Kirkeboen et al (2016) show one can point identify
β1 = E [∆1|D11 − D0
1 = 1, D02 = 0]
β2 = E [∆2|D22 − D0
2 = 1, D01 = 0]
I Kirkeboen et al (2016) do this with Norwegian admissions data
I Students apply with a list of desired fields and universities
I Assigned based on preference and merit rankings
83/126
Data on Second Choices
I Strategy proof mechanism, so stated preferences should be actual
I Conditional exogeneity uses a local type of argument
I Compare students with similar rankings and stated preferences j , k
I One is slightly above the cutoff, gets j - other slightly below gets k
I An example of a (fuzzy) RDD — we will discuss these more soon
84/126
Weak and many instruments
85/126
Weak instrumentsAn instrumental variable is weak if its correlation with the includedendogenous regressor is small.
I “small” depends on inference problem at hand, and on sample size
Why is weak instruments a problem?
I Weak instrument is a “divide by (almost) zero” problem (recall IV =reduced form/first stage)
For the usual asymptotic approximation to be “good”, we would like toeffectively treat the denominator as a constant
I In other words, we would like the mean to be much larger than thestandard deviation of the denominator
I Otherwise, the finite-sample distribution can be very different fromthe asymptotic one (even in relatively “large” samples)
I And remember that 2SLS’s justification is asymptotic!
For details, see Azeem’s lecture notes
86/126
What (not) to do about weak instrumentsLarge literature on (how to detect) weak instruments
I Useful summary of theory an practice in Andrews et al. (2019);see also their NBER lecture slides
Standard practice is to report the usual F-stat for instruments, andproceed as usual if F exceeds 10 (or some other arbitrary number)
Increasingly people instead report the “Effective first-stage F statistic”of Montiel Olea and Plueger (2013)
I Robust to the worst type of heteroscedasticity, serial correlation,and clustering in the second stage
The idea behind this practice is to decide if instruments are strong(TSLS “works”) or weak (use weak-instrument robust methods)
I But screening on F-statistics induces size distortions
87/126
What to do about weak instruments (con’t)To me, it makes more sense to
1. report and interpret reduced form2. think hard about why your instrument could be weak
(instruments comes from knowledge about treatment assignment)3. (also) report weak instrument robust confidence sets
Weak instrument robust confidence sets:
I Ensure correct coverage regardless of instrument strengthI No need to screen on first stageI Avoids pretesting biasI Avoids throwing away applications with valid instruments just
because weakI Confidence sets can be informative even with weak instruments
88/126
Many instruments and overfittingAt seminars (and in referee reports), people often talk about manyinstruments and weak instruments as if they are the same problem
Very confusing (at least to me)
Confusion may stem from Angrist and Kruerger (1991)
I Looked at how years of schooling (S) affects wages (Y), and usesthe instrument quarter of birth (Z)
I Problem: quarter of birth only produces very small variation in theyears of schooling
I Thus people worry it is a weak instrument.
To overcome this issue, they interacted the instrument with manycontrol variables (assumed to be exogenous)
They found that the estimate for the coefficient on years of schoolingfrom the IV regression was very similar to that from the OLS
89/126
Many instruments and overfitting (con’t)
The re-analysis of Bound et al (1993) suggests the similarity was dueto overfitting
They take the data that Angrist and Kruerger (1991) used and addedmany randomly generated variables
I Find that running IV regression with these variables leads to acoefficient estimate that is similar to that using OLS
I Intuitively, the problem here is that when we have manyinstruments, S and S, are essentially the “same”
I Since the true S is endogenous, this means that S is alsoendogenousI results in IV having a bias towards the OLS
90/126
Many instruments and overfitting (con’t)
In response to the many instrument problem and overfitting, recentwork on how to select the “optimal” instruments (e.g. using Lasso)
I Not clear what optimal means with heterogeneous effectsI Most settings, hard to find even one good instrumentI Thus, many instruments usually involves implicit exclusion
restricitons (from interacting X and Z but not S and Z )I Effectively solving an estimation/ inference issue by violating
exclusion restriction
91/126
Taking stock
92/126
SummaryIVI The IV estimand in the binary D, binary Z case is the LATEI Easy to interpret as the average effect for compliersI Could be relevant for a policy intervention that affects compliers
ExtensionsI 2SLS used in general cases→ interpretation is complicatedI At best, a weighted average of several different (complier) groupsI When would these weights be useful to inform a counterfactual?
Reverse engineering
I These results are motivated by a backward thought processI Start with a common estimator, then interpret the estimandI Why not start with a parameter of interest→ create an estimator?
I More on that later!
93/126
Practical advice when doing IV
1. Motivate your instruments
I Motivate exclusion and independenceI how is Z generated? What do I need to controll for to make it as
good as randomly assigned?I why is Z not in the outcome equation? what are the distinct
channels through which Z can affect Y?
I Specification: what control variables should be included?I conditional exclusion restrictions can be more credibleI assess by regressing instrument on other pre-determined variables
I Interpretation: what is the complier group?I is the instrument policy relevant?
94/126
Practical advice when doing IV
2. Check your instruments
I Always report the first stage andI discuss whether the magnitude and signs are as expectedI report the (relevant) F-statistic on instruments
I larger is better (rule-of-thumb: F > 10.... but who knows what’s largeenough)
I consider also reporting weak instrument robust confidence intervals
I Inspect the reduced-form regression of dependent variables oninstrumentsI both first stage and reduced form; sign, magnitude, etc.I remember that the reduced form is proportional to the causal effect
of interestI the reduced-form is unbiased (and not only consistent) because this
is OLS
95/126
How do I find instruments?
I There is no "recipe" that guarantees successI But often necessary ingredients: Detailed knowledge of
1. the economic mechanisms, and2. institutions determining the endogenous regressor3. restrictions from economic theory
I Examples:1. Naturally occuring random events (like weather, twin birth, etc)2. Policy reforms (which conditional on something are as good as
random)3. Random assignment to individuals deciding treatment (e.g. judges)4. Cutoff rules for admission to programs — more next week on using
such discontinuities
I Randomized experiments with imperfect complienceI gives a LATE interpretation of RCT
96/126
Application: Judge design
97/126
Family welfare cultures: Opposing views
Two opposing views:
1. Welfare use reinforces itself through the family, because parentson welfare may
I Provide information about program to their childrenI Reduce stigma of participationI Invest differentially in child development
2. The determinants of health and poverty are correlated acrossgenerations, so that
I Child welfare dependency is associated with– but not caused by –a parent’s use of welfare
98/126
What do we do?
1. We investigate existence and importance of family welfare cultures
I In a setting with no correlated unobservables
2. We explore breadth and nature of welfare cultures
I Spillover effects in other social networksI Explore channels of welfare culture
3. We illustrate the policy relevance of intergenerational spillovers
I Use estimates to simulate direct and indirect effects or policy
99/126
Empirical Challenges: Statistical Model
I Characterize child’s latent demand/qualification (Pc∗i ) as a function
of
1. parent’s actual participation (Ppi )
2. other observed traits (xci )
3. unobserved taste/health/etc. (εci )
Pc∗i = αc + βcPp
i + δcxci + εc
i (3)
I Similar equation for parents and grandparents
Pp∗i = αp + βpPg
i + δpxpi + εp
i (4)
100/126
Empirical Challenges: Sources to Bias
I Substitution of parent’s choice yields
Pc∗ = αc + βc I(αp + βpPgi + δpxp
i + εpi > 0) + δcxc
i + εci . (5)
where child participates if Pc∗i > 0
1. This equation illustrates that if unobservables are correlatedacross generations
cov(εpi , ε
ci |x
ci , x
pi ) 6= 0
2. Similarly, unobservables common to grandparent and child:
cov(εgi , ε
ci |x
ci , x
pi , x
gi ) 6= 0
→ Family welfare culture parameter will be biased
101/126
Empirical Challenges: Correlations and Bias
Table: OLS Estimates of Intergenerational Welfare Transmission
Child DI use (Pci )
(1) (2) (3)
Parent DI use (Ppi ) 0.036*** 0.035*** 0.025***
(0.001) (0.001) (0.001)Grandparent DI use (Pg
i ) 0.005*** 0.004***(0.000) (0.000)
Additional controls? NO NO YESObs. 1,022,507 1,022,507 1,022,507Dep. mean 0.03 0.03 0.03
Notes: Data come from 2008 and are restricted to parents age 60 or below with children age 23 and above and a grandparentwho is alive during the period 1967-2010. DI use in each generation defined to be equal to 1 if the individual is currently receivingDI benefits (except for grandparents, which is defined as having ever received DI benefits). Column (3) controls flexibily for child,parent and grandparent characteristics (age, gender, education, foreign born, marital status, earnings history, and region fixedeffects). Standard errors clustered at the family level.
102/126
Research design and setting
I Research design
1. Exploit a policy which randomizes probability that parents receivewelfare
2. Use a unique source of population panel data, linking welfare use ofmembers in social networks
I Setting:
Disability insurance (DI) system in Norway
103/126
Identification: Random assignment of judges
I Denied DI applicants may decide to appeal the decision:1. Cases are randomly assigned to judges2. Some appeal judges systematically more lenient
=⇒ random variation in probability a parent receives DI
I Exploit this exogenous variation to examine intergenerational links
I Since variation driven by difficult-to-verify casesI Randomization picks out the more marginal applicants
I Policy relevant group1. Driving the recent rise in DI rolls2. Affected by policy proposals to tighten screening
104/126
Research design: Baseline Regression Model
I First and second stage of IV model:
Ppi = αp + γpZ p
i + Xiδp + εp
i (6)
Pci = αc + βcPp
i + Xiδc + εc
i (7)
I Due to randomization, Z pi (judge leniency) ⊥ εc
i and εpi
I Correlated unobservables do not bias the estimateI Xi always includes year of appeal × department fixed effects
– First stage: γp identified from a regression of Ppi on Z p
i– Reduced form: Regression of Pc
i on Z pi
– Second stage: Intergenerational transmission coefficient βc
given by ratio of reduced form and first stage
104/126
Research design: Baseline Regression Model
I First and second stage of IV model:
Ppi = αp + γpZ p
i + Xiδp + εp
i (6)
Pci = αc + βcPp
i + Xiδc + εc
i (7)
I Due to randomization, Z pi (judge leniency) ⊥ εc
i and εpi
I Correlated unobservables do not bias the estimateI Xi always includes year of appeal × department fixed effects
– First stage: γp identified from a regression of Ppi on Z p
i– Reduced form: Regression of Pc
i on Z pi
– Second stage: Intergenerational transmission coefficient βc
given by ratio of reduced form and first stage
105/126
Testing Random Assignment
Case Allowed Judge Leniency
Age 0.0054*** (0.0009) 0.0003* (0.0002)
Female 0.0109 (0.0096) 0.0002 (0.0019)
Married 0.0041 (0.0076) 0.0013 (0.0019)
Foreign born -0.0271*** (0.0114) 0.0009 (0.0025)
High school degree -0.01670*** (0.0070) -0.0002 (0.0017)
Some college 0.01317* (0.0070) 0.00041 (0.0014)
College graduate 0.02282 (0.0161) -0.00073 (0.0033)
One child -0.1033*** (0.0199) 0.00389 (0.0094)
Two children -0.0052 (0.0087) -0.00097 (0.0020)
Three or more children -0.0159 (0.0132) 0.00103 (0.0016)
Previous earnings -0.0355*** (0.0146) 0.00319 (0.0021)
Years of work 0.0000*** (0.0000) 0.0000 (0.0000)
Mental disorders 0.0357*** (0.0105) 0.00005 (0.0038)
Musculoskeletal disorders 0.0026 (0.0086) 0.0018 (0.00256)
Test for joint significance F: 9.25 p-value: .001 F: .77 p-value: .723
106/126
Graphical evidence: first stage
107/126
Graphical evidence: reduced form
108/126
Time profile in IV estimates
109/126
Why welfare cultures matter for policy
I Intergenerational links could be important for policy design
I In particular, making the disability screening more stringent:1. Directly reduce DI participation among parents2. Further reduce DI participation in next generation
I Policy simulation1. Make judges 1/5 std dev stricter
(10% less likely to grant an appeal on average)2. Combine with estimates of how parent’s judge leniency affect parent
and child participation over time
110/126
Direct and indirect effects of stringent screening
111/126
Application combining theory and instrument
112/126
The Model: Supply and Demand
I Quantity traded and price are equilibrium outcomes from a systemof simultaneous equations:
qSi = εSpi + ΓSXi + νS
i
qDi = εDpi + ΓDXi + νD
i
I Where:
I i indexes different markets, S indexes supply, D indexes demandI q is log quantity, p is log priceI X is a vector of (pre-determined) observable determinants of
demand and supply (including a constant term)I{νS, νD
}are unobservable determinants of supply and demand.
I Target parameters: εS and −εD
113/126
We only observe the equilibrium, not supply/demand
Solid and dashed lines represent two different supply/demand systemswith different elasticities εD1 6= εD2 and εS1 6= εS2 yet observed equilibrium
can be rationalized by both systems
114/126
Endogeneity
Endogeneity - equilibria across multiple markets i ∈ {1,2,3} do nottrace out either supply or demand
115/126
Exclusion Restrictions - Supply shifter
I Assume that we observe a variable (Z Si ) that enters the supply
equation but is excluded from the demand equation:
qSi = εSpi + ΓSXi + θSZ S
i + νSi
qDi = εDpi + ΓDXi + νD
i
I We further assume:
I θS 6= 0 so that quantity supplied is a nontrivial function of Z Si
I Z Si |= νS
i , νDi | Xi
116/126
Exclusion Restrictions - Supply shifter
Using variation in Z Si identifies the elasticity of demand by shifting
supply along the demand curve.
117/126
Exclusion Restrictions - Supply and Demand shifters
I Assume that in addition to the supply shifter (Z Si ), we observe a
variable (Z Di ) that enters the demand equation but is excluded
from the supply equation:
qSi = εSpi + ΓSXi + θSZ S
i + νSi
qDi = εDpi + ΓDXi + θDZ D
i + νDi
I We further assume:
I θD 6= 0 so that quantity demanded is a nontrivial function of Z Di
I Z Di |= νS
i , νDi | Xi
118/126
Exclusion Restrictions - Supply and Demand shifters
Variation in Z Di (holding Z s
i constant) identifies the elasticity of supply.Variation in Z S
i (holding Z Di constant) identifies the elasticity of demand.
119/126
Supply and Demand Shifters - Reduced Form
I Solving equations for the equilibrium quantity and price on eachmarket i , we obtain:
qi =εSΓD − εDΓS
εS − εDXi +
εSθDZ Di − εDθSZ S
iεS − εD
+εSνD
i − εDνSi
εS − εD
pi =ΓD − ΓS
εS − εDXi +
θDZ Di − θSZ S
iεS − εD
+νD
i − νSi
εS − εD
I Denote by q∗ and p∗ the residual variation in q and p afterpartialling out variation in Xi .
I Note: q∗i =εSθDZ D
i −εDθSZ S
iεS−εD +
εSνDi −ε
DνSi
εS−εD and p∗i =θDZ D
i −θSZ S
iεS−εD +
νDi −ν
Si
εS−εD
120/126
IV estimates
βIV ,D =Cov(q∗i ,Z
Si )
Cov(p∗i ,ZSi )
= −εDθS
−θS = εD
βIV ,S =Cov(q∗i ,Z
Di )
Cov(p∗i ,ZDi )
= εSθD
θD = εS
I IV recovers the elasticities. In general, we need one instrument foreach elasticity.
I An interesting exception: when Tax rate is an instrument⇒ asingle instrument (tax rate) recovers both elasticities (Gavrilova,Zoutman and Hopland 2018)
121/126
Using tax rates as an instrument
I Assume that there is an ad valorem tax rate ti imposed onproducers. We define τi = log (1 + ti).
I We also denote by pci the price paid by consumers and by
psi = pc
i − τi the price received by suppliers.
I We assume τi |= νSi , ν
Di | Xi
I Because the tax is on producers, it does not enter the demandequation⇒ εD is identified via standard exclusion restriction
I Economic theory generates an additional exclusion restriction:Ramsey Exclusion Restriction (see GZH 2018)
122/126
Identification of Demand
The tax is a “supply shifter" - it allows identification of εD
123/126
Tax Rate as an Instrument
I The system of equations becomes:
qDi = εDpc
i + ΓDXi + νDi
qSi = εSpc
i + θSZ Si︸ ︷︷ ︸
=−εSτi︸ ︷︷ ︸=εS(pc
i −τi)
+ ΓSXi + νSi
I Note: we impose an additional restriction- extremely common inpublic finance - that suppliers respond to the tax the same waythey would respond to a cost shock (θS = −εS). This directlyfollows from assumption of profit maximization.
124/126
Tax Rate as an Instrument - Reduced Form
I Solving previous system of equations for the equilibrium quantityand price on each market i , we obtain:
qi =εSΓD − εDΓS
εS − εDXi +
εSεD
εS − εDτi +
εSνDi − εDνS
iεS − εD
pci =
ΓD − ΓS
εS − εDXi +
εS
εS − εDτi +
νDi − νS
iεS − εD
I Denote by q∗ and ps∗ the residual variation in q and pc afterpartialling out variation in Xi .
125/126
Tax Rate as an instrument - IV estimate
βIV ,Dτ =
Cov(q∗i , τi
)Cov
(pc∗
i , τi) = εD
I This directly follows from slide 102 and fact that the tax is excludedfrom Demand equation (Standard Exclusion Restriction)
I Can we identify more than just εD?
I Yes, it is the role of the additional restriction that suppliers respondto the tax the same way they would respond to an increase inmarginal cost (θS = −εS). ⇒ Key implication is that the
passthrough of the tax (to consumers) is dpc
dτ = εS
εS−εD
126/126
Tax Rate as an instrument - Identifying εS
I Because 1) εD is identified and 2) we can estimate the passthroughdpc
dτ which is a function of the two elasticities, we can recover εS.
I GZH 2018 recommend using the following IV estimator:
βIV ,Sτ =
Cov(q∗i , τi
)Cov
(ps∗
i , τi) = εS