Efficient Estimation for Staggered Rollout Designs...E cient Estimation for Staggered Rollout Designs Jonathan Roth Microsoft Pedro H. C. Sant’Anna Microsoft and Vanderbilt University

Efficient Estimation for Staggered Rollout Designs

Jonathan Roth

Microsoft

Pedro H. C. Sant’Anna

Microsoft and Vanderbilt University

July 2021

Introduction

• In many settings, treatments are rolled out to different units at different points in time.

• Social policies may be introduced in different locations at different times.

• Companies may introduce new feature or marketing campaign to customers at different times.

• Researcher can often ensure that treatment timing is random by design.

• Alternatively, researcher might argue that treatment timing is quasi-random or

as-good-as-random (which is somehow common in Economics).

• This paper studies efficient estimation of causal effects in such (quasi-)randomized

staggered rollout designs

1

What is the status-quo method in these cases?

2

Current Practice

• Two-way fixed effects (TWFE) methods are commonly use, but we now know they may not

recover interpretable treatment effect parameters (Borusyak and Jaravel, 2017; Athey and Imbens,

2021; Goodman-Bacon, 2018; de Chaisemartin and D’Haultfœuille, 2020; Sun and Abraham, 2020)

• Alternative DiD estimators that give more sensible estimands under heterogeneity have been

proposed (Callaway and Sant’Anna, 2020; Sun and Abraham, 2020; de Chaisemartin and D’Haultfœuille,

2020)

• All of these procedures exploit different parallel trends assumptions, not random timing.

• Although technically weaker, parallel trends assumptions are routinely motivated by arguing that

treatment timing is “quasi-random”.

• In the absence of random treatment timing, DiD estimators may be sensitive to functional form

restrictions. (Roth and Sant’Anna, 2021).

3

What we do in this paper

• We introduce a design-based framework formalizing the notion of random treatment

timing (Imbens and Rubin, 2015; Athey and Imbens, 2021; Abadie, Athey, Imbens and Wooldridge, 2020;

Bojinov, Rambachan and Shephard, 2020; Rambachan and Roth, 2020; Xu, 2021)

• Consider estimation of a large class of causal parameters that aggregate average effects

across periods and cohorts

• Solve for the efficient estimator in a class of estimators that nests existing approaches• Sample analog plus linear adjustment for pre-treatment differences in outcome

• In our setting, pre-treatment outcomes play a similar role to fixed covariates in a

randomized experiment (Freedman, 2008b,a; Lin, 2013; Li and Ding, 2017; Imbens and Rubin, 2015)

• We provide both t-based and permutation-test based methods for randomization inference.

4

Does this matter in practice?

5

Application - Background

• Reducing police misconduct and use of force is an important policy objective.

• Wood, Tyler and Papachristos (2020a, PNAS) studied a randomized roll out of aprocedural justice training program for police officers• Emphasized respect, neutrality, and transparency in the exercise of authority

• Original study found large & significant reductions in complaints/use of force

• But we discovered a statistical error in the original analysis!• They compared cohorts without normalizing by cohort size

• In Wood, Tyler, Papachristos, Roth and Sant’Anna (2020b), we re-analyzed data usingthe method of Callaway and Sant’Anna (2020)• No significant impacts on complaints; borderline significant effects on force; but CIs for all

outcomes were wide

6

7

Table 1: Estimates and 95% CIs as a Percentage of Pre-treatment Means

Note: This table shows the pre-treatment means for the three outcomes. It also displays the

estimates and 95% CIs in Figure ?? as percentages of these means, as well as the p-value from

a Fisher Randomization Test (FRT). The final columns shows the ratio of the CI length using

the CS estimator relative to the plug-in efficient estimator.

8

Framework

9

Framework

• Finite population of units: i = 1, ...,N

• T periods: t = 1, ...,T

• Unit i is first treated at period Gi ∈ G ⊂ {1, ...,T} ∪ {∞}• Gi =∞ denotes never treated.

• Treatment is an “absorbing state.”

• Potential outcomes: Yi ,t(g) = i ’s outcome in t if first treated at g

• We observe Yi ,t =∑

g 1[Gi = g ]Yi ,t(g)

• Adopt a design-based framework: Yi ,t(·) and Ng =∑

i 1[Gi = g ] treated as fixed, G is

stochastic.

10

Two Key Assumptions

Assumption 1 (Random treatment timing)

Let G = (G1, ...,GN). Then P (G = g) = (∏

g∈G Ng !)/N! if∑

i 1[gi = g ] = Ng for all g , and

zero otherwise.

• Any permutation of treatment timing that preserves group size is equally likely

Assumption 2 (No anticipation)

For all i ,t and g , g ′ > t, Yi ,t(g) = Yi ,t(g′).

• No Anticipation may fail if treatment timing announced in advance (Malani and Reif, 2015)

• Note that No Anticipation allows for arbitrary treatment effect dynamics once treatment

has occurred.

11

Special Case: 2-Period Model

• Suppose T = 2 and G = {2,∞}, so some units are treated in period 2 and some are never

treated.

• Under Randomization and No Anticipation, this is analogous to a cross-sectionalrandom experiment with Yi ,t=2 the outcome and Yi ,t=1 playing the role of a fixedcovariate.

• Yi = Yi,t=2;

• Xi = Yi,t=1 ≡ Yi,t=1(∞)

• Di = 1[Gi = 2]

12

Causal Parameters of Interest

13

Estimands

• With staggered treatment timing, there are many possible causal estimands.

Consider a flexible class of possible aggregations.

• Building block: Following Athey and Imbens (2021), let τt,gg ′ be average effect on

outcome in period t of switching treatment from g ′ to g

τt,gg ′ =1

N

∑i

Yi ,t(g)− Yi ,t(g′).

• Consider a (scalar) estimand that aggregates these building blocks:

θ =∑t,gg ′

at,gg ′τt,gg ′

14

Estimands in the Staggered Case

• In the staggered case, there are many possible ways of aggregating effects across cohorts

and time periods.

• One useful parameter is ATE (t, g), the average effect at time t of being treated at g

relative to being never treated:

ATE (t, g) =1

N

∑i

Yi ,t(g)− Yi ,t(∞).

• Following Callaway and Sant’Anna (2020), one might also be interested in summary

parameters that are weighted averages of ATE (t, g) along different dimensions.

• All these aggregations fit into our setup

15

Class of estimators we consider

16

Estimators

• Define θ0 to be the sample plug-in estimator for θ:

θ0 =∑t,gg ′

at,gg ′ τt,gg ′ ,

where τt,gg ′ = Ytg − Ytg ′ and Ytg is the sample mean of Yi ,t for cohort g .

• We will consider the class of estimators of the form

θβ = θ0 − X ′β,

where X is a vector guaranteed to be mean-zero by No Anticipation.

• Formally, each element of X aggregates differences btwn groups before either is treated

Xj =∑

(t,g ,g ′):g ,g ′>t

bjt,gg ′ τt,gg ′ .

17

Example: 2 period Example

• Set-up: Two periods (T = 2). Units treated in period 2 or never (G = {2,∞})

• Target parameter: Average treatment effect (ATE) in period 2:

θ = τ2,2∞ =1

N

∑i

Yi ,t=2(2)− Yi ,t=2(∞)

• Class of estimators:

θ0 is the simple difference in means at t = 2:

Yt=2,g=2 − Yt=2,g=∞

X is the pre-treatment difference in means: Yt=1,g=2 − Yt=1,g=∞

18

Example: 2 period Example

• Our proposed estimator is of the form

θβ = θ0 − X ′β,

• The difference-in-differences estimator is

θ1 = θ0 − X = (Yt=2,g=2 − Yt=2,g=∞)− (Yt=1,g=2 − Yt=1,g=∞),

corresponding with θβ with β = 1.

• In this simple 2x2 case, θβ = θ0 − X ′β is isomorphic to class of regression adjustedestimators in Freedman (2008b,a); Lin (2013)

• They consider τ(β1, β2) = θ0 − β′1X1 + β′0X0 = θβ , for β = N1

N β0 + N0

N β1.

19

Estimators in this class

• Several previously proposed estimators correspond with θ1 = θ0 − X for an appropriately

specificied θ0 and X .

• Callaway and Sant’Anna (2020) consider estimators that aggregate 2x2 diff-in-diff

estimators:

τCSw =∑t,g

wt,g

(Yt,g − Yt,∞)︸︷︷︸Diff in period t

− (Yg−1,g − Yg−1,∞)︸︷︷︸Diff in period g−1

.• This can be viewed as an estimator of the form θ0 − X , where

θ0 =∑t,g

wt,g τt,g∞ and X =∑t,g

wt,g τg−1,g∞

20

Related Staggered Estimators

• Several variants to the Callaway and Sant’Anna (2020) estimator have been proposed that

can likewise be cast into this class

• Callaway and Sant’Anna (2020) propose an alternative estimator using not-yet-treated

instead of never-treated as the comparison

• Sun and Abraham (2020) propose a similar estimator using last-to-be-treated as the

comparison

• de Chaisemartin and D’Haultfœuille (2020)’s estimator equivalent to Callaway and

Sant’Anna (2020) estimator for particular choice of weights, corresponding with

event-study at lag 0

21

The Efficient Estimator

22

Unbiasedness

Proposition 1 (Unbiasedness)The estimator θβ = θ0 − X ′β is unbiased over the randomization distribution for any β,

E[θβ

]= θ for all β.

23

Efficient Estimator

Proposition 2The variance of θβ = θ0 − X ′β is uniquely minimized at

β∗ = (Var[X]

︸︷︷︸=VX

)−1 Cov[X , θ0

]︸︷︷︸

=VX ,θ0

if VX is positive definite.

24

Solving for the Variance

Recall that θ0 and X are both linear functions of cohort sample means Yg .

Can write them as:

θ0 =∑g

Aθ,g Yg and X =∑g

A0,g Yg .

Can apply Li and Ding (2017)’s results for experiments with multiple outcomes/treatments:

Proposition 3

Var

[(θ0

X

)]=

( ∑g Ng

−1 Aθ,g Sg A′θ,g−N−1Sθ,

∑g Ng

−1 Aθ,g Sg A′0,g∑

g Ng−1 A0,g Sg A

′θ,g ,

∑g Ng

−1 A0,g Sg A′0,g

),

where Sg = Varf [Yi (g)], Sθ = Varf

[∑g Aθ,gYi (g)

].

• Depends on estimable variances of potential outcomes (Sg ), and

non-estimable variances of treatment effects Sθ.

• But β∗ depends only on estimable quantities not on heterogeneous treatment effects.25

Properties of the Plug-In Efficient Estimator

26

The Plug-In Estimator

• So far we have solved for the efficient β∗, but it depends on the variances of potential

outcomes Sg , which are typically not known ex ante.

• Consider the feasible plug-in efficient estimator based on β∗, which replaces Sg with asample analog Sg in the expression for β∗.

• Sg = 1/(Ng − 1)∑

i 1[Gi = g ](Yi − Yg )(Yi − Yg )′.

• Will show that in large populations the plug-in estimator θβ∗ has similar properties to the

“oracle” estimator θβ∗ .

27

Large population asymptotics

• Consider a sequence of populations in which Ng grows large for all g , satisfying certain

regularity conditions

Assumption 3

(i) Cohort shares converge to a constant:

• For all g ∈ G, Ng/N → pg ∈ (0, 1).

(ii) Variances of potential outcomes converge to a constant:

• For all g , g ′, Sg and Sgg′ have limiting values denoted S∗g and S∗gg′ , respectively, with S∗g positive

definite.

(iii) No individual dominates the variance of potential outcomes (Lindeberg-type condition):

• maxi,g ||Yi (g)− Y (g)||2/N → 0.

28

Asymptotic Properties of the Plug-In Estimator

• Under the given asymptotic conditions, the plug-in efficient estimator is asymptotically

normally distributed with the same variance as the “oracle” efficient estimator.

Proposition 4

Under the given asymptotic conditions,

√N(θβ∗ − θ)→d N

(0, σ2

∗),

where

σ2∗ = lim

N→∞NVar

[θβ∗].

29

Variance Estimation

30

Covariance Estimation

• As is common in finite-population settings, the variance of θβ∗ can only be estimated

conservatively.

• The issue is that the variance of θβ∗ contains the term −Sθ = −Varf

[∑g Aθ,gYi (g)

].

This is not consistently estimable since it depends on covariances of potential outcomes

that are never observed together.

• A natural conservative approach is the Neyman-style variance estimate, which ignores

Sθ and replaces Sg with Sg in the variance formula.

• In paper, we show that a less conservative variance estimator can be obtained byestimating the part of Sθ explained by X .

• Mirrors the use of pre-treatment covariates in Lin (2013); Abadie et al. (2020)

31

What about Fisher Randomization Tests

32

Fisher Randomization Tests

• An alternative approach to inference uses Fisher Randomization Tests (FRTs)

• We show that an FRT using a studentized version of the efficient estimator has thedual advantages :

1. has exact size under the sharp null of no treatment effects for all units;

2. is asymptotically valid for the weak null that θ = 0.

• Studentization is key!

• In general, (unstudentized) FRTs may not have correct size for such weak null hypotheses

even asymptotically (Wu and Ding, 2020).

• We build on Wu and Ding (2020) and Zhao and Ding (2020) to show that studentization

bypass this problem: FRT is asy. equiv. to testing that 0 falls within the t-based confidence

interval CI∗∗

33


The following regularity condition imposes that the means of the potential outcomes have

limits, and that their fourth moment is bounded.

Assumption 4Suppose that for all g , limN→∞ Ef [Yi (g)] = µg <∞, and there exists L <∞ such that

N−1∑

i ||Yi (g)− Ef [Yi (g)] ||4 < L for all N.

34


With this assumption in hand, we can make precise the sense in which the FRT is

asymptotically valid under the weak null.

Proposition 5

Suppose Assumptions 1-4 hold. Let tπ = (θ∗/se)π be the studentized statistic under

permutation π. Then tπ →d N (0, 1), PG -almost surely. Hence, if pFRT is the p-value from

the FRT associated with |tπ|, then under H0 : θ = 0,

limN→∞

P(pFRT ≤ α) ≤ α,

PG -almost surely, with equality if and only if S∗θ = 0.

35

Conclusion

36

Conclusion

• We study staggered rollout designs, in which units randomly receive treatment at different

times.

• Estimation in these settings is often done using generalized difference-in-differences

procedures.

• We solve for the efficient estimator in a class that nests common procedures.

Our proofs draw on parallels to the literature on covariate adjustment in experiments.

• We provide both t-based and permutation-test based methods for randomization inference.

• The plug-in efficient estimator offers substantial precision gains relative to existing

approaches.

37

38

Abadie, Alberto, Susan Athey, Guido W. Imbens, and Jeffrey M. Wooldridge,

“Sampling-Based versus Design-Based Uncertainty in Regression Analysis,” Econometrica,

2020, 88 (1), 265–296.

Athey, Susan and Guido Imbens, “Design-Based Analysis in Difference-In-Differences

Settings with Staggered Adoption,” Journal of Econometrics, 2021, Forthcoming.

Bojinov, Iavor, Ashesh Rambachan, and Neil Shephard, “Panel Experiments and Dynamic

Causal Effects: A Finite Population Perspective,” arXiv:2003.09915 [stat], 2020. arXiv:

2003.09915.

Borusyak, Kirill and Xavier Jaravel, “Revisiting Event Study Designs,” SSRN Scholarly

Paper ID 2826228, Social Science Research Network, Rochester, NY 2017.

Callaway, Brantly and Pedro H. C. Sant’Anna, “Difference-in-Differences with multiple

time periods,” Journal of Econometrics, December 2020.

38

de Chaisemartin, Clement and Xavier D’Haultfœuille, “Two-Way Fixed Effects Estimators

with Heterogeneous Treatment Effects,” American Economic Review, September 2020, 110

(9), 2964–2996.

Freedman, David A., “On Regression Adjustments in Experiments with Several Treatments,”

The Annals of Applied Statistics, 2008, 2 (1), 176–196.

, “On regression adjustments to experimental data,” Advances in Applied Mathematics,

2008, 40 (2), 180–193.

Goodman-Bacon, Andrew, “Public Insurance and Mortality: Evidence from Medicaid

Implementation,” Journal of Public Economics, 2018, 126 (1), 216–262.

Imbens, Guido W. and Donald B. Rubin, Causal Inference for Statistics, Social, and

Biomedical Sciences: An Introduction, 1 edition ed., New York: Cambridge University Press,

April 2015.

38

Li, Xinran and Peng Ding, “General Forms of Finite Population Central Limit Theorems with

Applications to Causal Inference,” Journal of the American Statistical Association, October

2017, 112 (520), 1759–1769.

Lin, Winston, “Agnostic notes on regression adjustments to experimental data: Reexamining

Freedman’s critique,” Annals of Applied Statistics, March 2013, 7 (1), 295–318.

Malani, Anup and Julian Reif, “Interpreting pre-trends as anticipation: Impact on estimated

treatment effects from tort reform,” Journal of Public Economics, April 2015, 124, 1–17.

Rambachan, Ashesh and Jonathan Roth, “Design-Based Uncertainty for

Quasi-Experiments,” arXiv:2008.00602 [econ.EM], 2020. arXiv: 2003.09915.

Roth, Jonathan and Pedro H. C. Sant’Anna, “When Is Parallel Trends Sensitive to

Functional Form?,” arXiv:2010.04814 [econ, stat], January 2021. arXiv: 2010.04814.

Sun, Liyang and Sarah Abraham, “Estimating dynamic treatment effects in event studies

with heterogeneous treatment effects,” Journal of Econometrics, December 2020.

38

Wood, George, Tom R. Tyler, and Andrew V. Papachristos, “Procedural justice training

reduces police use of force and complaints against officers,” Proceedings of the National

Academy of Sciences, May 2020, 117 (18), 9815–9821.

, , , Jonathan Roth, and Pedro H.C. Sant’Anna, “Revised Findings for “Procedural

justice training reduces police use of force and complaints against officers”,” Working Paper,

2020.

Wu, Jason and Peng Ding, “Randomization Tests for Weak Null Hypotheses in Randomized

Experiments,” Journal of the American Statistical Association, May 2020, pp. 1–16. arXiv:

1809.07419.

Xu, Ruonan, “Potential outcomes and finite-population inference for M-estimators,”

Econometrics Journal, 2021, (Forthcoming).

Zhao, Anqi and Peng Ding, “Covariate-adjusted Fisher randomization tests for the average

treatment effect,” arXiv:2010.14555 [math, stat], November 2020. arXiv: 2010.14555.

37

Efficient Estimation for Staggered Rollout Designs...E cient Estimation for Staggered Rollout Designs Jonathan Roth Microsoft Pedro H. C. Sant’Anna Microsoft and Vanderbilt University

Documents