Instrumental Variables with Heterogeneous Effectsjenni.uchicago.edu/econ312/Slides/econ312_Mogstad...With heterogeneous effects IV estimates the average causal effect for compliers

1/126

Instrumental Variables with Heterogeneous Effects

Magne Mogstad

2/126

Linear IV with heterogeneous effectsWhen estimating the effect of D on Y with IV Z the standard textbookcase presents the outcome equation with homogenous effects

Y = α + βD + U

But we can link observed outcome Y to potential outcomes (Y0,Y1)

Yi = E [Y0]︸︷︷︸α

+ (Y1 − Y0)︸︷︷︸βi

D + Y0 − E [Y0]︸︷︷︸Ui

≡ α + βD + U

What does linear IV identify when treatment effects areheterogenous?

This question is the focus of much of applied micro.

Arguably reverse engineering. Like playing Jeopardy

Later we start with a question (target parameter), and then ask how toanswer it (identify and estimate target parameter)]

3/126

Heterogeneous potential outcome set-upInstrument initiates a causal chain, whereby Z affects the variable ofinterest D which in turn affects Y

Keeping this in mind we can adopt the potential outcome set-up:

I Dz is treatment status at instrument value Z = zI Yd ,z is outcome of individual i if he receives treatment D = d and

instrument value Z = z

We can now define various causal effects:

I Y1,Z − Y0,Z

I YD,1 − YD,0

I Y1,z − Y0,z

I Yd ,1 − Yd ,0

I D1 − D0

4/126

Heterogeneous potential outcome set-up

The first assumption in the heterogeneous effects set-up is randomassignment

Random assignmentYd ,z , Dz ⊥ Z ∀ d , z

This is sufficient to identify average causal effect of Z on Y (and of Zon D):

E [Y |Z = 1]− E [Y |Z = 0]

= E [YD1,1|Z = 1]− E [YD0,0|Z = 0]

= E [YD1,1 − YD0,0]

5/126


The second assumption in the heterogeneous effects set-up is theexclusion restriction

Exclusion restrictionYd ,1 = Yd ,0

This states that any effect of Z on Y must be via an effect of Z on D

The exclusion restriction is often expressed by omitting Z in equationof interest: Y = α + β · D + U

Random assignment + exclusion restriction = instrument exogeneity

Conceptually distinct problems – argue one at the time!

6/126


The third assumption in the heterogeneous effects set up is theexistence of a first stage

First stageE [D1 − D0] 6= 0

Which requires the instrument Z to have some effect on the averageprobability of treatment

Note: For the (usual) statistical inference (which relies on the standardfirst-order asymptotic approximation invoked in large-sample theory),the first stage should not be too close to zero (more on that later)

7/126


The fourth assumption in the heterogeneous effects set-up ismonotonicity

MonotonicityD1 ≥ D0 ∀i (or vice versa)

Which says that all those affected by the instrument are affected in thesame direction

Note: Uniformity would be a better terminology.

Monotonicity assumption does not imply that treatment is a monotonicfunction of the instrument (which becomes relevant with multipleinstruments or when instrumen takes multiple values).

8/126

Local Average Treatment Effect (LATE)A variable Z is an instrumental variable for the causal effect of D on Yif the following assumptions hold:

1. Random assignment: Yd ,z , Dz ⊥ Z ∀ d , z

I gives the causal effect of Z on D (1st stage) and Y (reduced form)

2. Exclusion Restriction: Yd ,1 = Yd ,0 = Yd

I so that the causal effect of Z on Y is only due to the effect of Z on D

3. Monotonicity: D1 ≥ D0 , or vice versa

I to avoid offsetting effects

4. First-Stage: E [D1 − D0] 6= 0

I because we need treatment variation in the sample

The Wald estimand then gives the Local Average Treatment Effect:

βIV = E [β|D1 = 1, D0 = 0]

the average treatment effect for those affected by the instrument

8/126


1. Random assignment: Yd ,z , Dz ⊥ Z ∀ d , zI gives the causal effect of Z on D (1st stage) and Y (reduced form)

2. Exclusion Restriction: Yd ,1 = Yd ,0 = Yd

I so that the causal effect of Z on Y is only due to the effect of Z on D






βIV = E [β|D1 = 1, D0 = 0]


8/126



2. Exclusion Restriction: Yd ,1 = Yd ,0 = YdI so that the causal effect of Z on Y is only due to the effect of Z on D






βIV = E [β|D1 = 1, D0 = 0]


8/126




3. Monotonicity: D1 ≥ D0 , or vice versaI to avoid offsetting effects




βIV = E [β|D1 = 1, D0 = 0]


8/126




3. Monotonicity: D1 ≥ D0 , or vice versaI to avoid offsetting effects

4. First-Stage: E [D1 − D0] 6= 0I because we need treatment variation in the sample


βIV = E [β|D1 = 1, D0 = 0]


9/126

Local Average Treatment Effect (LATE)Wald estimand can be interpreted as effect of treatment on outcomesfor individuals who were treated because Z = 1, but who would nothave been treated otherwise

To see why this is so, we can divide the population into four groups:

1. Compliers: D1 = 1 and D0 = 0;2. Always-takers: D1 = 1 and D0 = 1;3. Never-takers: D1 = 0 and D0 = 0;4. Defiers: D1 = 0 and D0 = 1;

Note: The terminology is much used but a bit confusing (at least to me).

Always-takers are not always taking treatment. Never-takers are notnever taking treatment. Everything is specific to the instrument at hand.

With other instruments, always-taker, never-taker and complier statusmay change

10/126

Local Average Treatment Effect: ProofWe saw that (by independence)

E [Y |Z = 1]− E [Y |Z = 0] = E [YD1 − YD0 ]

The average causal effect of Z on Y can be written as weightedaverage of the causal effects of the four sub-populations:

E [YD1 − YD0 ] =

E [YD1 − YD0 |Complier]× P(D1 = 1, D0 = 0)

+E [YD1 − YD0 |Never taker]× P(D1 = 0, D0 = 0)

+E [YD1 − YD0 |Always taker]× P(D1 = 1, D0 = 1)

+E [YD1 − YD0 |Defier]× P(D1 = 0, D0 = 1)

11/126

Local Average Treatment Effect: ProofWe saw that (by independence)

E [Y |Z = 1]− E [Y |Z = 0] = E [YD1 − YD0 ]

The average causal effect of Z on Y can be written as weightedaverage of the causal effects of the four sub-populations:

E [YD1 − YD0 ] =

E [YD1 − YD0 |Complier]× P(D1 = 1, D0 = 0)

+E [YD1 − YD0︸︷︷︸=Y0−Y0=0

|Never taker]× P(D1 = 0, D0 = 0)

+E [YD1 − YD0︸︷︷︸=Y1−Y1=0

|Always taker]× P(D1 = 1, D0 = 1)

+E [YD1 − YD0 |Defier]× P(D1 = 0, Di(0) = 1)︸︷︷︸=0

12/126

Local Average Treatment Effect: ProofBy monotonicity D1 ≥ D0, which implies that there are no defiers.

E [Y |Z = 1]− E [Y |Z = 0]

= E [Y1 − Y0|Complier]× P(D1 = 1, D0 = 0)

and by independence and monotonicity we can show that

E [D|Z = 1]− E [D|Z = 0] = E [D1 − D0] = P(D1 = 1, D0 = 0)

From this it follows that the Wald estimand is equal to the averagetreatment effect on the compliers

E [Y |Z = 1]− E [Y |Z = 0]

E [D|Z = 1]− E [D|Z = 0]

=E [Y1 − Y0|Complier]× P(D1 = 1, D0 = 0)

P(D1 = 1, D0 = 0)

= E [Y1 − Y0|Complier]

13/126

LATE: Interpretation and relevanceWith heterogeneous effects IV estimates the average causal effect forcompliers

Different valid instruments for same causal relation therefore estimatedifferent things (different groups of compliers)

I Overidentifying restrictions test (Sargan test) might reject even ifall instruments are valid.

I Policy-relevance of IV estimate depends on policy relevance ofinstrument

Note: We cannot identify the compliers because we can never observeboth D0 and D1 (thus, we don’t know who the compliers are)

I those with Z = 1 and D = 1 can be compliers or always-takersI those with Z = 0 and D = 0 can be compliers or never-takers

14/126

Compliers: How many and what do they look likeThe size of the complier group is the Wald 1st-stage:

P(D1 = 1, D0 = 0) = E [D|Z = 1]− E [D|Z = 0]

Or among the treated

P(D1 − D0 = 1|D = 1) =P(D = 1|D1 > D0)P(D1 > D0)

P(D = 1)

=P(Z = 1)(E [D|Z = 1]− E [D|Z = 0])

P(D = 1)

We cannot identify compliers, but we can describe them

P(X = x |D1 > D0)

P(X = x)=

P(D1 > D0|X = x)

P(D1 > D0)

=E [D|Z = 1, X = x ]− E [D|Z = 0, X = x ]

E [D|Z = 1]− E [D|Z = 0]

15/126

LATE extensionsUntil now we considered the IV model with heterogeneity in the simplecase of

I average effects (for compliers)I binary treatment, binary instrumentI no covariates

What happens when we relax these assumptions?

Angrist and Pischke (2009, p. 173) write that “The econometric toolremains 2SLS and the interpretation remains fundamentally similar tothe basic LATE result, with a few bells and whistles."

Is this really true? (spoiler: no, it’s not!)

But first, let’s see that even in the simple case, linear IV is not revealingall the information about potential outcomes available in the data

16/126

Extension I: Counterfactual distributions

17/126

Counterfactual distributionsImbens & Rubin (1997) show that we can estimate more than averagecausal effects for compliers

They show how to recover the complete marginal distributions of theoutcome

I under different treatments for the compliersI under the treatment for the always-takersI without the treatment for the never-takers

These results allow us to draw inference about effect on the outcomedistribution of compliers (QTE of compliers)

Can also be used to test instrument exogeneity & monotonicity

Even exactly identified models can have testable implications (unlikewhat is claimed in MHE).

18/126

Counterfactual distributions

First introduce some shorthand notation

Ci = n⇐⇒ D1 = D0 = 0Ci = a⇐⇒ D1 = D0 = 1Ci = c ⇐⇒ D1 = 1, D0 = 0Ci = d ⇐⇒ D1 = 0, D0 = 1

For the different combinations of Z and D, we know the following:

D0 1

0 n, c aZ

1 n a, c

19/126

Counterfactual distributionsDistribution of types

Since Z is random we know that the distribution of types a, n, c is thesame for each value of Z and in the population as a whole

Therefore, this...D

0 10 n, c a

Z1 n a, c

...implies the following:

pa = Pr(D = 1|Z = 0)

pn = Pr(D = 0|Z = 1)

pc = 1− pa − pn

20/126

Counterfactual distributionsIdentifying distributions

Let’s use the following notation for the observed marginal distribution ofY conditional on Z and D:

fzd (y) ≡ f (y |Z = z, D = d)

Therefore, this...D

0 10 n, c a

Z1 n a, c

...implies the following:

f10(y) = gn(y)

f01(y) = ga(y)

f00(y) = gc0(y) · (pc/(pc + pn))

+ gn(y) · (pn/(pc + pn))

f11(y) = gc1(y) · (pc/(pc + pa))

+ ga(y) · (pa/(pc + pa))

21/126

Counterfactual distributionsExample

To illustrate the above, consider Dutch data (see Ketel et al., 2016, AEJapplied).

I Lottery outcome as instrument of medical school completionI D = 1 if completed medical schoolI Z = 1 if offered medical school after successful lottery

. ta z d| d

z | 0 1 | Total-----------+----------------------+----------

0 | 269 187 | 4561 | 71 949 | 1,020

-----------+----------------------+----------Total | 340 1,136 | 1,476

22/126

Counterfactual distributionsf10(y) = gn(y)

0

.2

.4

.6

.8

1

Y0, N

ever

take

rs

1 2 3 4 5log(Wage)

23/126

Counterfactual distributionsf01(y) = ga(y)

0

.2

.4

.6

.8

1

Y1, A

lway

s Ta

kers

1 2 3 4 5log(Wage)

24/126


We have seen that we can estimate pa, pn, pc and also gn(y) (=f10(y))and ga(y) (=f01(y))

By rearranging the following

f00(y) = gc0(y) · (pc/(pc + pn)) + gn(y) · (pn/(pc + pn))

f11(y) = gc1(y) · (pc/(pc + pa)) + ga(y) · (pa/(pc + pa))

we can back out the counterfactual distributions for the compliers:

gc0(y) = f00(y) · (pc + pn)/pc − f10(y) · pn/pc

gc1(y) = f11(y) · (pc + pa)/pc − f01(y) · pa/pc

25/126

Counterfactual distributionsgc0(y) = f00(y) · (pc + pn)/pc − f10(y) · pn/pc

0

.5

1

1.5

Y0, C

ompl

iers

1 2 3 4 5log(Wage)

26/126

Counterfactual distributionsgc1(y) = f11(y) · (pc + pa)/pc − f01(y) · pa/pc

0

.2

.4

.6

.8

1

Y1, C

ompl

iers

1 2 3 4 5log(Wage)

27/126


0

.5

1

1.5

1 2 3 4 5log(Wage)

Y1, Compliers Y0, Compliers

28/126


0

.5

1

1.5

1 2 3 4 5log(Wage)

Y1, Compliers Y0, CompliersY1, Always Takers Y0, Never takers

29/126


We can also show that

E [Y1|C = c] =E [Y · D|Z = 1]− E [Y · D|Z = 0]

E [D|Z = 1]− E [D|Z = 0]

and

E [Y0|C = c] =E [Y · (1− D)|Z = 1]− E [Y · (1− D)|Z = 0]

E [1− D|Z = 1]− E [1− D|Z = 0]

30/126

Counterfactual distributions. ivregress 2sls lnw (d = z), robust noheader------------------------------------------------------------------------------

| Robustlnw | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------d | .1871175 .0485501 3.85 0.000 .0919609 .282274

_cons | 3.010613 .0382073 78.80 0.000 2.935728 3.085498------------------------------------------------------------------------------

. g y1 = lnw*d

. ivregress 2sls y1 (d = z), robust noheader------------------------------------------------------------------------------

| Robusty1 | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------d | 3.264167 .0387887 84.15 0.000 3.188142 3.340191

_cons | -.0617161 .0275252 -2.24 0.025 -.1156644 -.0077678------------------------------------------------------------------------------

. g y0 = lnw*(1-d)

. g md = 1-d

. ivregress 2sls y0 (md = z), robust noheader------------------------------------------------------------------------------

| Robusty0 | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------md | 3.077049 .0293153 104.96 0.000 3.019592 3.134506

_cons | -.0047203 .0047455 -0.99 0.320 -.0140213 .0045806------------------------------------------------------------------------------

. di 3.264167 - 3.077049

.187118

31/126

Testing instrument validity

The above discussion points to a test for instrument validity (or,equivalently, a test for monotonicity given exogeneity)

Basic idea: Under the IV assumptions, the complier distribution shouldactually be a distribution

I By definition, probability can never be negative.I Thus, density can never be negativeI For binary Y , it means that E(Y |C = c) needs to be between 0

and 1

Kitagawa (2015) develops a formal statistical test based on theseimplication

32/126

Extension II: Multiple instruments

33/126

LATE with multiple instrumentsAssume we have 2 mutually exclusive (and for simplicity independent)binary instruments

(Without loss of generality: make two non-exclusive instrumentsmutually exclusive by working with Z1(1-Z2), Z2(1-Z1), Z1Z2)

We can then estimate two different LATEs:

βZj =cov(Y , Zj)

cov(D, Zj)

= E [Y1 − Y0|DZj=1 − DZj=0 = 1]

In practice researchers often combine the instruments using 2SLS

The 2SLS estimator is

β2SLS =cov(Y , D)

cov(D, D)

where D = π1Z1 + π2Z2

34/126

LATE with multiple instruments

Expanding β2SLS gives

β2SLS = π1cov(Y , Z1)

cov(D, D)+ π2

cov(Y , Z2)

cov(D, D)

= π1cov(D, Z1)

cov(D, D)

cov(Y , Z1)

cov(D, Z1)+ π2

cov(D, Z2)

cov(D, D)

cov(Y , Z2)

cov(D, Z2)

= ψβZ1 + (1− ψ)βZ2

whereψ ≡ π1cov(D, Z1)

π1cov(D, Z1) + π2cov(D, Z2)

is the relative strength of Z1 in the first stage

Under assumptions 1-4, the 2SLS estimate is an instrument-strengthweighted average of the instrument specific LATEs

35/126

Questions with multiple instruments?

I What question does the 2SLS weighted average of LATEs answer?I Why not some other weighted average (e.g. use GMM or LIML)?I Is monotonicity more restrictive with multiple instruments?I Can one do without monotonicity?

Some papers do IV with heterogeneity without invoking monotonicity

See, for example, much of the work by Manski but also Heckman andPinto (2018) and Mogstad, Walters and Torgovitsky (2019)

36/126

Interpreting Monotonicity with Multiple Instruments

NotationI Binary treatment D ∈ {0,1}I Potential treatments Dz for instrument values z ∈ Z

IA monotonicity condition (IAM)For all z, z ′ ∈ Z either:I Dz ≥ Dz′ orI Dz ≤ Dz′

I IA Monotonicity is uniformity, not monotonicityI Pairwise instrument shifts push everyone to or from treatment

37/126

Choice BehaviorI Random utility model

V (d , z) is indirect utility from choosing d when instrument z:

Dz = arg maxd∈{0,1}

V (d , z) = 1[Vz ≥ 0]

where V (z) ≡ V (1, z)− V (0, z) is net indirect utility

Illustrative example:

I Dz ∈ {0,1} is whether to attend collegeI Z1 is a tuition subsidyI Z2 is proximity to a collegeI Dz should be an increasing function of zI Neither implies nor is implied by IA monotonicityI What is implied by IA monotonicity? Restrictions on V (z)?

38/126

Binary Instruments

I IA monotonicity does not permit individuals to differ in responsesI All individuals must find either tuition or distance more compelling

39/126

Continuous Instruments

I z∗ is a point of indifference for j and kI IA monotonicity fails if marginal rates of substitution are different

40/126

Homogenous Marginal Rates of Substitution

I Let z∗ be a point at which V (z) is differentiableI Let I(z∗) = {i ∈ I : V (z∗) = 0}I IA monotonicity implies that

∂1Vj(z∗)∂2Vk (z∗) = ∂1Vk (z∗)∂2Vj(z∗),∀j , k ∈ I(z∗)

I Natural discrete choice specification:

V (z) = B0 + B1Z1 + 1× Z2

I Where (B0,B1) are unobservedI B1 controls variation in taste for tuition relative to proximityI IA monotonicity requires no variation over individuals: Var(B1) = 0

41/126

Extension III: Variable treatment intensity

42/126

Variable treatment intensity

Assume treatment is no longer binary but varies in its level

S ∈ {0,1,2, . . . , J}

such as for example years of schooling.

We can then define potential outcomes indexed by the level oftreatment

YS

Potential treatments (schooling level) are as before indexed by thevalue of the instrument

SZ

so that with a binary instrument the observed level of schooling is

S = ZS1 + (1− Z )S0

43/126

Variable treatment intensityThe observed outcome

Y =J∑

s=0

Ys1[S = s] = Y0 +J∑

s=1

(Ys − Ys−1)1[S ≥ s]

The average effect of the s-th year of schooling is then

E [Ys − Ys−1]

and we have now J different treatment effects

Even so, researchers often estimate a linear-in-parameter model:

Y = α + βS + u

One possibility is to take the linearity restriction literally

Another option is to reverse-engineer

(A third possibility is to start with a target parameter.....)

44/126

Variable treatment intensity

As before we need to make an independence assumption

Ys,z , Sz ⊥ Z ∀s, z

and an exclusion restriction

Ys,z = Ys

We further need a monotonicity assumption

S1 ≥ S0

and instrument relevance

E [S1 − S0] 6= 0

45/126

Variable treatment intensityExample with 3 levelsMonotonicity implies

1[S1 ≥ s]− 1[S0 ≥ s] ∈ {0, 1}

so thatPr(1[S1 ≥ s] > 1[S0 ≥ s]) = Pr(S1 ≥ s > S0)

if this probability is greater than 0, then the instrument affects theincidence of treatment level s.

E [S|Z = 1]− E [S|Z = 0](1)=

[Pr(S1 < 1|Z = 1)− Pr(S0 < 1|Z = 0)]

+ [Pr(S1 < 2|Z = 1)− Pr(S0 < 2|Z = 0)]

(2)= Pr(S1 ≥ 1 > S0) + Pr(S1 ≥ 2 > S0)

where (1) follows because the mean is the sum (or integral) of 1 minusthe CDF, and (2) because of independence.

46/126

Variable treatment intensityExample with 3 levels

With three treatment intensities S ∈ {0, 1, 2} we observe

Y = Y0 + (Y1 − Y0)1[S ≥ 1] + (Y2 − Y1)1[S ≥ 2]

Using this we can expand the reduced form as follows

E [Y |Z = 1]− E [Y |Z = 0] = E [(Y1 − Y0)(1[S1 ≥ 1]− 1[S0 ≥ 1])]

+ E [(Y2 − Y1)(1[S1 ≥ 2]− 1[S0 ≥ 2])]

47/126

Variable treatment intensityAverage Causal Response

We can now define

ωs =Pr(S1 ≥ s > S0)∑Jj=1 Pr(S1 ≥ j > S0)

and express the Wald estimate as follows

E [Y |Z = 1]− E [Y |Z = 0]

E [S|Z = 1]− E [S|Z = 0]=

J∑s=1

ωsE [Ys − Ys−1|S1 ≥ s > S0]

which Angrist and Imbens call the average causal response (ACR).

48/126

Variable treatment intensityAverage Causal Response

We cannot estimate E [Ys − Ys−1|S1 ≥ s > S0] for the different localcomplier groups

What we can do is estimate their weights in the ACR, since

Pr(S1 ≥ s > S0) = Pr(S1 ≥ s)− Pr(S0 ≥ s)

= Pr(S0 < s)− Pr(S1 < s)

= Pr(S < s|Z = 0)− Pr(S < s|Z = 1)

which allows us to estimate ωs

Note: although ACR is a positive weighted average, it

– averages together components that are potentially overlapping

– cannot be expressed as a positive weighted average of causal effectsacross mutually exclusive subroups (unlike the LATE)

49/126

Variable treatment intensityExample

Angrist & Krueger (1991) use quarter of birth as an instrument forschooling

I D = 1 if education is at least high schoolI Z = 1 if born in the 4th quarter, Z = 0 if born in the 1st quarter

How does the Wald estimator weighs the average unit causal response

E [Ys − Ys−1|S1 ≥ s > S0]

for the complier at the different points s?

50/126

Variable treatment intensityExample, Schooling CDF by QoB (= 1, 4)

51/126

Variable treatment intensityExample, Differences in Schooling CDF by QoB (= 1, 4)

52/126

Variable treatment intensityExample, for different QoB’s: 4vs1, 4vs2, 4vs3

53/126

Can the weigthing matter?Loken et al. (2012) reports OLS, IV and family fixed effects estimates of how familyincome affects kid’s outcomes

54/126

Can the weigthing matter?

55/126

Covariates

56/126

Extensions to Covariates - Nonparametric

I Often, one wants covariates X to help justify the exogeneity of ZI And/or to reduce residual noise in YI And/or to look at observed heterogeneity in treatment effects

Adjust the assumptions to be conditional on X

I Exogeneity: (Y0,Y1,D0,D1) |= Z |XI Relevance: P[D = 1|X ,Z = 1] 6= P[D = 1|X ,Z = 0] a.s.I Monotonicity: P[D1 ≥ D0|X ] = 1 a.sI Overlap: P[Z = 1|X ] ∈ (0,1) a.s.

57/126

Non-parametric IV with Covariates

I Suppose we can estimate stratified LATEs

β(x) =E [Y |Z = 1, X = x ]− E [Y |Z = 0, X = x ]

E [D|Z = 1, X = x ]− E [D|Z = 0, X = x ]

= E [Y1 − Y0|D1 − D0 = 1, X = x ]

I We want to go from here to some population averaged LATE

I Which one would we like to have? Complier weighted? Populationweighted?

58/126

2SLS regression with CovariatesI What does a saturated 2SLS estimation gives us?

Y = βD + αx + eD = πxZ + γx + u

I i.e. x-dummies in both stages, and x-specific first-stagecoefficients

I Angrist & Imbens (1995) show that

β = E [β(x)ω(x)]

I where β(x) is the x-specific LATE, and

ω(x) =σ2

D(x)

E [σ2D

(x)]=

π2xσ

2Z (x)

E [π2xσ

2Z (x)]

I The weighting thus depends on the square of the local (to x)complier share and instrument variance

59/126

Abadie’s (2003) κ

I For covariates (but D, Z binary) a more elegant approachI Idea is to run regressions only on the compliersI Compliers aren’t directly observable, but they can be weightedI Abadie showed that for any function G = g(Y ,X ,D)

E[G|T = c] =1

P[T = c]E[κG],κ = 1− D(1− Z )

P[Z = 0|X ]− Z (1− D)

P[Z = 1|X ]

IntuitionI Complier = 1 − Always Taker − Never TakerI On average, κ only applies positive weights to compliers:

E[κ|T = t ,X ,D,Y ] = 1[t = c]

I So on average, κG is only positive for compliers

60/126

IV with Covariates

I Abadie (2003) showed that

E [κ0g(Y ,X )] = E [g(Y 0,X )|D1 > D0] Pr(D1 > D0)

E [κ1g(Y ,X )] = E [g(Y 1,X )|D1 > D0] Pr(D1 > D0)

E [κg(Y ,D,X )] = E [g(Y ,D,X )|D1 > D0] Pr(D1 > D0)

where:

κ0 = (1− D)(1− Z )− Pr(Z = 0|X )

Pr(Z = 0|X ) Pr(Z = 1|X )

κ1 = DZ − Pr(Z = 1|X )

Pr(Z = 0|X ) Pr(Z = 1|X )

κ = κ0 Pr(Z = 0|X ) + κ1 Pr(Z = 1|X )

= 1− D(1− Z )

Pr(Z = 0|X )− (1− D)Z

Pr(Z = 1|X )

61/126

Using Abadie’s (2003) κ

Linear/nonlinear regression

I For example, take g(Y ,X ,D) = (Y − αD − X ′β)2 then:

minα,β

E[(Y − αD − X ′β)2|T = c] = minα,β

E[κ(Y − αD − X ′β)2]

I Estimate α, β by solving a sample analog of the second problem

I This is just a weighted regression, with estimated weights ( κ)

I Result is general enough to use for many other estimators

I Specify X however you like - still picks out the compliers

62/126

Using Abadie’s (2003) κ

Estimating κ

I To implement the result one must estimate κ, hence P[Z = 1|X ]

I If P[Z = 1|X ] is linear, the κ-weighted regression equals TSLS

I Of course, Z is binary, so P[Z = 1|X ] typically won’t be exactlylinear

I Logit/probit often close to linear, so in practice may be close

63/126

Empirical Example: Angrist and Evans (1998, “AE”)

MotivationI Relationship between fertility decisions and female labor supply?

I Strong negative correlation, but these are joint choices

I Leads to many possible endogeneity stories, here’s just one:

High earning women have fewer children due to higher opp. cost

64/126

Empirical Example: Angrist and Evans (1998, “AE”)

Empirical strategy

I Y is a labor market outcome for the woman (or her husband)I Restrict the sample to only women (or couples) with 2 or more

childrenI D is an indicator for having more than 2 children (vs. exactly 2)I Z = 1 if first two children had the same sex→ Based on the idea that there is preference to have a mix of boysand girls

I Also consider Z = 1 if the second birth was a twin→Twins are primarily for comparison - used before this paper

65/126

Assumptions in AEExogeneity

I Requires the assumption that sex at birth is randomly assignedI Authors conduct balance tests to support this (next slide)I The twins instrument is less compellingI First, well-known that older women have twins more (see next

slide)→More subtly, it impacts both the number and spacing of children

Monotonicity

I Monotonicity restricts preference heterogeneity in unattractiveways→Some families may want two boys or girls (then stop)

I No discussion of this in the paper - unfortunately common practiceI Twins is effectively a one-sided non-compliance instrument→ Twins compliers are the untreated since no twins never-takers

66/126

Evidence in Support of Exogeneity

I Same sex is uncorrelated with a variety of observed confoundersI Twins is well-known to be correlated with age (so, education) and

race

67/126

Wald Estimates

I First stage (denominator of Wald) for two measures of fertility

68/126

Wald Estimates

I First stage (numerator of Wald) for several labor market outcomes

69/126

Wald Estimates

I IV (Wald) estimator, e.g. -.133≈-.008/0.060 - these are LATEs

70/126

Two Stage Least Squares Estimates

I OLS is quite different from IV - consistent with endogeneity(selection)

71/126


I Break same-sex into two instrumens - two boys vs two girls

72/126


I Overid test p-values - many interpretations with heterogeneity

73/126

Comparison to Abadie’s κ (Angrist 2001)

I Illustration of Abadie’s κ(and other methods) using the AE dataI Results are almost identical to TSLS - uses this to promote TSLSI Logic is strange - we know that in general this is not the caseI In fact, Abadie’s (2003) paper has an application where it is not

74/126

Multiple unordered treatments

75/126

Estimating equation: Example with 3 field choice

I Individuals are often choosing between multiple unorderedtreatments:

Education types, occupations, locations, etc.

I MHS is completely silent about multiple unordered treatment

I What does 2SLS identify in this case?

I Kirkeboen et al. (2016, QJE) discusses this in the context ofeducational choices

I See also Kline and Walters (2016), Heckman and Pinto (2019) andMountjoy (2019).

76/126

Estimating equation: Example with 3 field choice

I Students choose between three fields, D ∈ {0,1,2}

I Our interest is centered on how to interpret IV (and OLS)estimates of

Y = β0 + β1D1 + β2D2 + ε

I Y is observed earnings

I Dj ≡ 1(D = j) is an indicator variables that equals 1 if individualchooses field j

I ε is the residual which is potentially correlated with Dj

77/126

Potential earnings and field choices

I Individuals are assigned to one of three groups, Z ∈ {0,1,2}

I Linking observed and potential earnings and field choices

Y = Y 0 + (Y 1 − Y 0)D1 + (Y 2 − Y 0)D2

D1 = D01 + (D1

1 − D01)Z1 + (D2

1 − D01)Z2

D2 = D02 + (D1

2 − D02)Z1 + (D2

2 − D02)Z2

I Y j is potential earnings if individual chooses field j

I Zk = 1(Z = k) is an indicator variable that equals 1 if Z is equal tok

I Dzj ≡ 1(Dz = j) is indicator variables that equals 1 if individual

chooses field j for a given value of Z

78/126

Standard IV assumptions

I ASSUMPTION 1: (EXCLUSION): Y d ,z = Y d for all d , z

I ASSUMPTION 2: (INDEPENDENCE): Y 0,Y 1,Y 2,D0,D1,D2 ⊥ Z

I ASSUMPTION 3: (RANK): Rank E(Z’D) = 3

I ASSUMPTION 4: (MONOTONICITY): D11 ≥ D0

1 and D22 ≥ D0

2

79/126

Moment conditionsI IV uses the following moment conditions:

E [εZ1] = E [εZ2] = E [ε] = 0

I Expressing these conditions in potential earnings and choicesgives:

E [(∆1 − β1)(D11 − D0

1) + (∆2 − β2)(D12 − D0

2)] = 0 (1)

E [(∆1 − β1)(D21 − D0

1) + (∆2 − β2)(D22 − D0

2)] = 0 (2)

where

∆j ≡ Y j − Y 0

I To understand what IV can and cannot identify, we solve theseequations for β1 and β2

80/126

What IV cannot identify

PROPOSITION 1

I Suppose Assumptions 1-4 hold

I Solving equations (1)-(2) for β1 and β2, it follows that βj for j = 1,2is a linear combination of the following three payoffs:

1. ∆1: Payoff of field 1 compared to 0

2. ∆2: Payoff of field 2 compared to 0

3. ∆2 −∆1 ≡ Y 2 − Y 1: Payoff of field 2 compared to 1

81/126

Constant effects

I Suppose Assumptions 1-4 hold.

I Solving equations (1)-(2) for β1 and β2:

I If ∆1 and ∆2 are common across all individuals (Constant effects):

β1 = ∆1

β2 = ∆2

I Alternatively, move to goal post to estimating effect of, say, field 1versus next best (combination of 2 and 3)

I Back to binary treatment but hard to interpret and requires strongexogeneity assumption

82/126

Data on Second Choices

I In certain circumstances, one might plausibly observe next bestoptions

I Kirkeboen et al (2016) show one can point identify

β1 = E [∆1|D11 − D0

1 = 1, D02 = 0]

β2 = E [∆2|D22 − D0

2 = 1, D01 = 0]

I Kirkeboen et al (2016) do this with Norwegian admissions data

I Students apply with a list of desired fields and universities

I Assigned based on preference and merit rankings

83/126

Data on Second Choices

I Strategy proof mechanism, so stated preferences should be actual

I Conditional exogeneity uses a local type of argument

I Compare students with similar rankings and stated preferences j , k

I One is slightly above the cutoff, gets j - other slightly below gets k

I An example of a (fuzzy) RDD — we will discuss these more soon

84/126

Weak and many instruments

85/126

Weak instrumentsAn instrumental variable is weak if its correlation with the includedendogenous regressor is small.

I “small” depends on inference problem at hand, and on sample size

Why is weak instruments a problem?

I Weak instrument is a “divide by (almost) zero” problem (recall IV =reduced form/first stage)

For the usual asymptotic approximation to be “good”, we would like toeffectively treat the denominator as a constant

I In other words, we would like the mean to be much larger than thestandard deviation of the denominator

I Otherwise, the finite-sample distribution can be very different fromthe asymptotic one (even in relatively “large” samples)

I And remember that 2SLS’s justification is asymptotic!

For details, see Azeem’s lecture notes

86/126

What (not) to do about weak instrumentsLarge literature on (how to detect) weak instruments

I Useful summary of theory an practice in Andrews et al. (2019);see also their NBER lecture slides

Standard practice is to report the usual F-stat for instruments, andproceed as usual if F exceeds 10 (or some other arbitrary number)

Increasingly people instead report the “Effective first-stage F statistic”of Montiel Olea and Plueger (2013)

I Robust to the worst type of heteroscedasticity, serial correlation,and clustering in the second stage

The idea behind this practice is to decide if instruments are strong(TSLS “works”) or weak (use weak-instrument robust methods)

I But screening on F-statistics induces size distortions

87/126

What to do about weak instruments (con’t)To me, it makes more sense to

1. report and interpret reduced form2. think hard about why your instrument could be weak

(instruments comes from knowledge about treatment assignment)3. (also) report weak instrument robust confidence sets

Weak instrument robust confidence sets:

I Ensure correct coverage regardless of instrument strengthI No need to screen on first stageI Avoids pretesting biasI Avoids throwing away applications with valid instruments just

because weakI Confidence sets can be informative even with weak instruments

88/126

Many instruments and overfittingAt seminars (and in referee reports), people often talk about manyinstruments and weak instruments as if they are the same problem

Very confusing (at least to me)

Confusion may stem from Angrist and Kruerger (1991)

I Looked at how years of schooling (S) affects wages (Y), and usesthe instrument quarter of birth (Z)

I Problem: quarter of birth only produces very small variation in theyears of schooling

I Thus people worry it is a weak instrument.

To overcome this issue, they interacted the instrument with manycontrol variables (assumed to be exogenous)

They found that the estimate for the coefficient on years of schoolingfrom the IV regression was very similar to that from the OLS

89/126

Many instruments and overfitting (con’t)

The re-analysis of Bound et al (1993) suggests the similarity was dueto overfitting

They take the data that Angrist and Kruerger (1991) used and addedmany randomly generated variables

I Find that running IV regression with these variables leads to acoefficient estimate that is similar to that using OLS

I Intuitively, the problem here is that when we have manyinstruments, S and S, are essentially the “same”

I Since the true S is endogenous, this means that S is alsoendogenousI results in IV having a bias towards the OLS

90/126

Many instruments and overfitting (con’t)

In response to the many instrument problem and overfitting, recentwork on how to select the “optimal” instruments (e.g. using Lasso)

I Not clear what optimal means with heterogeneous effectsI Most settings, hard to find even one good instrumentI Thus, many instruments usually involves implicit exclusion

restricitons (from interacting X and Z but not S and Z )I Effectively solving an estimation/ inference issue by violating

exclusion restriction

91/126

Taking stock

92/126

SummaryIVI The IV estimand in the binary D, binary Z case is the LATEI Easy to interpret as the average effect for compliersI Could be relevant for a policy intervention that affects compliers

ExtensionsI 2SLS used in general cases→ interpretation is complicatedI At best, a weighted average of several different (complier) groupsI When would these weights be useful to inform a counterfactual?

Reverse engineering

I These results are motivated by a backward thought processI Start with a common estimator, then interpret the estimandI Why not start with a parameter of interest→ create an estimator?

I More on that later!

93/126

Practical advice when doing IV

1. Motivate your instruments

I Motivate exclusion and independenceI how is Z generated? What do I need to controll for to make it as

good as randomly assigned?I why is Z not in the outcome equation? what are the distinct

channels through which Z can affect Y?

I Specification: what control variables should be included?I conditional exclusion restrictions can be more credibleI assess by regressing instrument on other pre-determined variables

I Interpretation: what is the complier group?I is the instrument policy relevant?

94/126

Practical advice when doing IV

2. Check your instruments

I Always report the first stage andI discuss whether the magnitude and signs are as expectedI report the (relevant) F-statistic on instruments

I larger is better (rule-of-thumb: F > 10.... but who knows what’s largeenough)

I consider also reporting weak instrument robust confidence intervals

I Inspect the reduced-form regression of dependent variables oninstrumentsI both first stage and reduced form; sign, magnitude, etc.I remember that the reduced form is proportional to the causal effect

of interestI the reduced-form is unbiased (and not only consistent) because this

is OLS

95/126

How do I find instruments?

I There is no "recipe" that guarantees successI But often necessary ingredients: Detailed knowledge of

1. the economic mechanisms, and2. institutions determining the endogenous regressor3. restrictions from economic theory

I Examples:1. Naturally occuring random events (like weather, twin birth, etc)2. Policy reforms (which conditional on something are as good as

random)3. Random assignment to individuals deciding treatment (e.g. judges)4. Cutoff rules for admission to programs — more next week on using

such discontinuities

I Randomized experiments with imperfect complienceI gives a LATE interpretation of RCT

96/126

Application: Judge design

97/126

Family welfare cultures: Opposing views

Two opposing views:

1. Welfare use reinforces itself through the family, because parentson welfare may

I Provide information about program to their childrenI Reduce stigma of participationI Invest differentially in child development

2. The determinants of health and poverty are correlated acrossgenerations, so that

I Child welfare dependency is associated with– but not caused by –a parent’s use of welfare

98/126

What do we do?

1. We investigate existence and importance of family welfare cultures

I In a setting with no correlated unobservables

2. We explore breadth and nature of welfare cultures

I Spillover effects in other social networksI Explore channels of welfare culture

3. We illustrate the policy relevance of intergenerational spillovers

I Use estimates to simulate direct and indirect effects or policy

99/126

Empirical Challenges: Statistical Model

I Characterize child’s latent demand/qualification (Pc∗i ) as a function

of

1. parent’s actual participation (Ppi )

2. other observed traits (xci )

3. unobserved taste/health/etc. (εci )

Pc∗i = αc + βcPp

i + δcxci + εc

i (3)

I Similar equation for parents and grandparents

Pp∗i = αp + βpPg

i + δpxpi + εp

i (4)

100/126

Empirical Challenges: Sources to Bias

I Substitution of parent’s choice yields

Pc∗ = αc + βc I(αp + βpPgi + δpxp

i + εpi > 0) + δcxc

i + εci . (5)

where child participates if Pc∗i > 0

1. This equation illustrates that if unobservables are correlatedacross generations

cov(εpi , ε

ci |x

ci , x

pi ) 6= 0

2. Similarly, unobservables common to grandparent and child:

cov(εgi , ε

ci |x

ci , x

pi , x

gi ) 6= 0

→ Family welfare culture parameter will be biased

101/126

Empirical Challenges: Correlations and Bias

Table: OLS Estimates of Intergenerational Welfare Transmission

Child DI use (Pci )

(1) (2) (3)

Parent DI use (Ppi ) 0.036*** 0.035*** 0.025***

(0.001) (0.001) (0.001)Grandparent DI use (Pg

i ) 0.005*** 0.004***(0.000) (0.000)

Additional controls? NO NO YESObs. 1,022,507 1,022,507 1,022,507Dep. mean 0.03 0.03 0.03

Notes: Data come from 2008 and are restricted to parents age 60 or below with children age 23 and above and a grandparentwho is alive during the period 1967-2010. DI use in each generation defined to be equal to 1 if the individual is currently receivingDI benefits (except for grandparents, which is defined as having ever received DI benefits). Column (3) controls flexibily for child,parent and grandparent characteristics (age, gender, education, foreign born, marital status, earnings history, and region fixedeffects). Standard errors clustered at the family level.

102/126

Research design and setting

I Research design

1. Exploit a policy which randomizes probability that parents receivewelfare

2. Use a unique source of population panel data, linking welfare use ofmembers in social networks

I Setting:

Disability insurance (DI) system in Norway

103/126

Identification: Random assignment of judges

I Denied DI applicants may decide to appeal the decision:1. Cases are randomly assigned to judges2. Some appeal judges systematically more lenient

=⇒ random variation in probability a parent receives DI

I Exploit this exogenous variation to examine intergenerational links

I Since variation driven by difficult-to-verify casesI Randomization picks out the more marginal applicants

I Policy relevant group1. Driving the recent rise in DI rolls2. Affected by policy proposals to tighten screening

104/126

Research design: Baseline Regression Model

I First and second stage of IV model:

Ppi = αp + γpZ p

i + Xiδp + εp

i (6)

Pci = αc + βcPp

i + Xiδc + εc

i (7)

I Due to randomization, Z pi (judge leniency) ⊥ εc

i and εpi

I Correlated unobservables do not bias the estimateI Xi always includes year of appeal × department fixed effects

– First stage: γp identified from a regression of Ppi on Z p

i– Reduced form: Regression of Pc

i on Z pi

– Second stage: Intergenerational transmission coefficient βc

given by ratio of reduced form and first stage

104/126

Research design: Baseline Regression Model

I First and second stage of IV model:

Ppi = αp + γpZ p

i + Xiδp + εp

i (6)

Pci = αc + βcPp

i + Xiδc + εc

i (7)

I Due to randomization, Z pi (judge leniency) ⊥ εc

i and εpi

I Correlated unobservables do not bias the estimateI Xi always includes year of appeal × department fixed effects

– First stage: γp identified from a regression of Ppi on Z p

i– Reduced form: Regression of Pc

i on Z pi

– Second stage: Intergenerational transmission coefficient βc

given by ratio of reduced form and first stage

105/126

Testing Random Assignment

Case Allowed Judge Leniency

Age 0.0054*** (0.0009) 0.0003* (0.0002)

Female 0.0109 (0.0096) 0.0002 (0.0019)

Married 0.0041 (0.0076) 0.0013 (0.0019)

Foreign born -0.0271*** (0.0114) 0.0009 (0.0025)

High school degree -0.01670*** (0.0070) -0.0002 (0.0017)

Some college 0.01317* (0.0070) 0.00041 (0.0014)

College graduate 0.02282 (0.0161) -0.00073 (0.0033)

One child -0.1033*** (0.0199) 0.00389 (0.0094)

Two children -0.0052 (0.0087) -0.00097 (0.0020)

Three or more children -0.0159 (0.0132) 0.00103 (0.0016)

Previous earnings -0.0355*** (0.0146) 0.00319 (0.0021)

Years of work 0.0000*** (0.0000) 0.0000 (0.0000)

Mental disorders 0.0357*** (0.0105) 0.00005 (0.0038)

Musculoskeletal disorders 0.0026 (0.0086) 0.0018 (0.00256)

Test for joint significance F: 9.25 p-value: .001 F: .77 p-value: .723

106/126

Graphical evidence: first stage

107/126

Graphical evidence: reduced form

108/126

Time profile in IV estimates

109/126

Why welfare cultures matter for policy

I Intergenerational links could be important for policy design

I In particular, making the disability screening more stringent:1. Directly reduce DI participation among parents2. Further reduce DI participation in next generation

I Policy simulation1. Make judges 1/5 std dev stricter

(10% less likely to grant an appeal on average)2. Combine with estimates of how parent’s judge leniency affect parent

and child participation over time

110/126

Direct and indirect effects of stringent screening

111/126

Application combining theory and instrument

112/126

The Model: Supply and Demand

I Quantity traded and price are equilibrium outcomes from a systemof simultaneous equations:

qSi = εSpi + ΓSXi + νS

i

qDi = εDpi + ΓDXi + νD

i

I Where:

I i indexes different markets, S indexes supply, D indexes demandI q is log quantity, p is log priceI X is a vector of (pre-determined) observable determinants of

demand and supply (including a constant term)I{νS, νD

}are unobservable determinants of supply and demand.

I Target parameters: εS and −εD

113/126

We only observe the equilibrium, not supply/demand

Solid and dashed lines represent two different supply/demand systemswith different elasticities εD1 6= εD2 and εS1 6= εS2 yet observed equilibrium

can be rationalized by both systems

114/126

Endogeneity

Endogeneity - equilibria across multiple markets i ∈ {1,2,3} do nottrace out either supply or demand

115/126

Exclusion Restrictions - Supply shifter

I Assume that we observe a variable (Z Si ) that enters the supply

equation but is excluded from the demand equation:

qSi = εSpi + ΓSXi + θSZ S

i + νSi

qDi = εDpi + ΓDXi + νD

i

I We further assume:

I θS 6= 0 so that quantity supplied is a nontrivial function of Z Si

I Z Si |= νS

i , νDi | Xi

116/126

Exclusion Restrictions - Supply shifter

Using variation in Z Si identifies the elasticity of demand by shifting

supply along the demand curve.

117/126

Exclusion Restrictions - Supply and Demand shifters

I Assume that in addition to the supply shifter (Z Si ), we observe a

variable (Z Di ) that enters the demand equation but is excluded

from the supply equation:

qSi = εSpi + ΓSXi + θSZ S

i + νSi

qDi = εDpi + ΓDXi + θDZ D

i + νDi

I We further assume:

I θD 6= 0 so that quantity demanded is a nontrivial function of Z Di

I Z Di |= νS

i , νDi | Xi

118/126

Exclusion Restrictions - Supply and Demand shifters

Variation in Z Di (holding Z s

i constant) identifies the elasticity of supply.Variation in Z S

i (holding Z Di constant) identifies the elasticity of demand.

119/126

Supply and Demand Shifters - Reduced Form

I Solving equations for the equilibrium quantity and price on eachmarket i , we obtain:

qi =εSΓD − εDΓS

εS − εDXi +

εSθDZ Di − εDθSZ S

iεS − εD

+εSνD

i − εDνSi

εS − εD

pi =ΓD − ΓS

εS − εDXi +

θDZ Di − θSZ S

iεS − εD

+νD

i − νSi

εS − εD

I Denote by q∗ and p∗ the residual variation in q and p afterpartialling out variation in Xi .

I Note: q∗i =εSθDZ D

i −εDθSZ S

iεS−εD +

εSνDi −ε

DνSi

εS−εD and p∗i =θDZ D

i −θSZ S

iεS−εD +

νDi −ν

Si

εS−εD

120/126

IV estimates

βIV ,D =Cov(q∗i ,Z

Si )

Cov(p∗i ,ZSi )

= −εDθS

−θS = εD

βIV ,S =Cov(q∗i ,Z

Di )

Cov(p∗i ,ZDi )

= εSθD

θD = εS

I IV recovers the elasticities. In general, we need one instrument foreach elasticity.

I An interesting exception: when Tax rate is an instrument⇒ asingle instrument (tax rate) recovers both elasticities (Gavrilova,Zoutman and Hopland 2018)

121/126

Using tax rates as an instrument

I Assume that there is an ad valorem tax rate ti imposed onproducers. We define τi = log (1 + ti).

I We also denote by pci the price paid by consumers and by

psi = pc

i − τi the price received by suppliers.

I We assume τi |= νSi , ν

Di | Xi

I Because the tax is on producers, it does not enter the demandequation⇒ εD is identified via standard exclusion restriction

I Economic theory generates an additional exclusion restriction:Ramsey Exclusion Restriction (see GZH 2018)

122/126

Identification of Demand

The tax is a “supply shifter" - it allows identification of εD

123/126

Tax Rate as an Instrument

I The system of equations becomes:

qDi = εDpc

i + ΓDXi + νDi

qSi = εSpc

i + θSZ Si︸︷︷︸

=−εSτi︸︷︷︸=εS(pc

i −τi)

+ ΓSXi + νSi

I Note: we impose an additional restriction- extremely common inpublic finance - that suppliers respond to the tax the same waythey would respond to a cost shock (θS = −εS). This directlyfollows from assumption of profit maximization.

124/126

Tax Rate as an Instrument - Reduced Form

I Solving previous system of equations for the equilibrium quantityand price on each market i , we obtain:

qi =εSΓD − εDΓS

εS − εDXi +

εSεD

εS − εDτi +

εSνDi − εDνS

iεS − εD

pci =

ΓD − ΓS

εS − εDXi +

εS

εS − εDτi +

νDi − νS

iεS − εD

I Denote by q∗ and ps∗ the residual variation in q and pc afterpartialling out variation in Xi .

125/126

Tax Rate as an instrument - IV estimate

βIV ,Dτ =

Cov(q∗i , τi

)Cov

(pc∗

i , τi) = εD

I This directly follows from slide 102 and fact that the tax is excludedfrom Demand equation (Standard Exclusion Restriction)

I Can we identify more than just εD?

I Yes, it is the role of the additional restriction that suppliers respondto the tax the same way they would respond to an increase inmarginal cost (θS = −εS). ⇒ Key implication is that the

passthrough of the tax (to consumers) is dpc

dτ = εS

εS−εD

126/126

Tax Rate as an instrument - Identifying εS

I Because 1) εD is identified and 2) we can estimate the passthroughdpc

dτ which is a function of the two elasticities, we can recover εS.

I GZH 2018 recommend using the following IV estimator:

βIV ,Sτ =

Cov(q∗i , τi

)Cov

(ps∗

i , τi) = εS

Instrumental Variables with Heterogeneous Effectsjenni.uchicago.edu/econ312/Slides/econ312_Mogstad...With heterogeneous effects IV estimates the average causal effect for compliers

Documents