When Should You Adjust Standard Errors for …When Should You Adjust Standard Errors for Clustering? Alberto Abadie, Susan Athey, Guido W. Imbens, and Jeffrey Wooldridge NBER Working
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
NBER WORKING PAPER SERIES
WHEN SHOULD YOU ADJUST STANDARD ERRORS FOR CLUSTERING?
Alberto AbadieSusan Athey
Guido W. ImbensJeffrey Wooldridge
Working Paper 24003http://www.nber.org/papers/w24003
NATIONAL BUREAU OF ECONOMIC RESEARCH1050 Massachusetts Avenue
Cambridge, MA 02138November 2017
The questions addressed in this paper partly originated in discussions with Gary Chamberlain. We are grateful for questions raised by Chris Blattman. We are grateful to seminar audiences at the 2016 NBER Labor Studies meeting, CEMMAP, Chicago, Brown University, the Harvard-MIT Econometrics seminar, Ca' Foscari University of Venice, the California Econometrics Conference, the Erasmus University Rotterdam, and Stanford University. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.
At least one co-author has disclosed a financial relationship of potential relevance for this research. Further information is available online at http://www.nber.org/papers/w24003.ack
NBER working papers are circulated for discussion and comment purposes. They have not been peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies official NBER publications.
When Should You Adjust Standard Errors for Clustering?Alberto Abadie, Susan Athey, Guido W. Imbens, and Jeffrey WooldridgeNBER Working Paper No. 24003November 2017JEL No. C21
ABSTRACT
In empirical work in economics it is common to report standard errors that account for clustering of units. Typically, the motivation given for the clustering adjustments is that unobserved components in outcomes for units within clusters are correlated. However, because correlation may occur across more than one dimension, this motivation makes it difficult to justify why researchers use clustering in some dimensions, such as geographic, but not others, such as age cohorts or gender. This motivation also makes it difficult to explain why one should not cluster with data from a randomized experiment. In this paper, we argue that clustering is in essence a design problem, either a sampling design or an experimental design issue. It is a sampling design issue if sampling follows a two stage process where in the first stage, a subset of clusters were sampled randomly from a population of clusters, and in the second stage, units were sampled randomly from the sampled clusters. In this case the clustering adjustment is justified by the fact that there are clusters in the population that we do not see in the sample. Clustering is an experimental design issue if the assignment is correlated within the clusters. We take the view that this second perspective best fits the typical setting in economics where clustering adjustments are used. This perspective allows us to shed new light on three questions: (i) when should one adjust the standard errors for clustering, (ii) when is the conventional adjustment for clustering appropriate, and (iii) when does the conventional adjustment of the standard errors matter.
Alberto AbadieDepartment of Economics, E52-546MIT77 Massachusetts AvenueCambridge, MA 02139and [email protected]
Susan AtheyGraduate School of BusinessStanford University655 Knight WayStanford, CA 94305and [email protected]
Guido W. ImbensGraduate School of BusinessStanford University655 Knight WayStanford, CA 94305and [email protected]
Jeffrey WooldridgeDepartment of EconomicsMichigan State [email protected]
1 Introduction
In empirical work in economics, it is common to report standard errors that account for cluster-
ing of units. The first issue we address in this manuscript is the motivation for this adjustment.
Typically the stated motivation is that unobserved components of outcomes for units within
clusters are correlated (Moulton [1986, 1990], Moulton and Randolph [1989], Kloek [1981],
Hansen [2007], Cameron and Miller [2015]). For example, Hansen [2007] writes: “The cluster-
ing problem is caused by the presence of a common unobserved random shock at the group
level that will lead to correlation between all observations within each group” (Hansen [2007],
p. 671). Similarly Cameron and Miller [2015] write: “The key assumption is that the errors are
uncorrelated across clusters while errors for individuals belonging to the same cluster may be
correlated” (Cameron and Miller [2015], p. 320). This motivation for clustering adjustments
in terms of within-group correlations of the errors makes it difficult to justify clustering by
some partitioning of the population, but not by others. For example, in a regression of wages
on years of education, this argument could be used to justify clustering by age cohorts just as
easily as clustering by state. Similarly, this motivation makes it difficult to explain why, in a
randomized experiment, researchers typically do not cluster by groups. It also makes it difficult
to motivate clustering if the regression function already includes fixed effects. The second issue
we address concerns the appropriate level of clustering. The typical answer is to go for the most
aggregate level feasible. For example, in a recent survey Cameron and Miller [2015] write: “The
consensus is to be conservative and avoid bias and to use bigger and more aggregate clusters
when possible, up to and including the point at which there is concern about having too few
clusters.” (Cameron and Miller [2015], p. 333). We argue in this paper that there is in fact
harm in clustering at too aggregate a level, We also make the case that the confusion regarding
both issues arises from the dominant model-based perspective on clustering.
We take the view that clustering is in essence a design problem, either a sampling design
or an experimental design issue. It is a sampling design issue when the sampling follows a
two stage process, where in the first stage, a subset of clusters is sampled randomly from a
population of clusters, and in the second stage, units are sampled randomly from the sampled
clusters. Although this clustered sampling approach is the perspective taken most often when a
formal justification is given for clustering adjustments to standard errors, it actually rarely fits
applications in economics. Angrist and Pischke [2008] write: “Most of the samples that we work
with are close enough to random that we typically worry more about the dependence due to a
group structure than clustering due to stratification.” (Angrist and Pischke [2008], footnote 10,
p. 309). Instead of a sampling issue, clustering can also be an experimental design issue, when
clusters of units, rather than units, are assigned to a treatment. In the view developed in this
manuscript, this perspective fits best the typical application in economics, but surprisingly it
is rarely explicitly presented as the motivation for cluster adjustments to the standard errors.
We argue that the design perspective on clustering, related to randomization inference
(e.g., Rosenbaum [2002], Athey and Imbens [2017]), clarifies the role of clustering adjustments
to standard errors and aids in the decision whether to, and at what level to, cluster, both
in standard clustering settings and in more general spatial correlation settings (Bester et al.
[1]
[2009], Conley [1999], Barrios et al. [2012], Cressie [2015]). For example, we show that, contrary
to common wisdom, correlations between residuals within clusters are neither necessary, nor
sufficient, for cluster adjustments to matter. Similarly, correlations between regressors within
clusters are neither necessary, not sufficient, for cluster adjustments to matter or to justify
clustering. In fact, we show that cluster adjustments can matter, and substantially so, even
when both residuals and regressors are uncorrelated within clusters. Moreover, we show that
the question whether, and at what level, to adjust standard errors for clustering is a substantive
question that cannot be informed solely by the data. In other words, although the data are
informative about whether clustering matters for the standard errors, but they are only partially
informative about whether one should adjust the standard errors for clustering. A consequence
is that in general clustering at too aggregate a level is not innocuous, and can lead to standard
errors that are unnecessarily conservative, even in large samples.
One important theme of the paper, building on Abadie et al. [2017], is that it is critical
to define estimands carefully, and to articulate precisely the relation between the sample and
the population. In this setting that means one should define the estimand in terms of a finite
population, with a finite number of clusters and a finite number of units per clusters. This is
important even if asymptotic approximations to finite sample distributions involve sequences
of experiments with an increasing number of clusters and/or an increasing number of units per
cluster. In addition, researchers need to be explicit about the way the sample is generated from
this population, addressing two issues: (i) how units in the sample were selected and, most
importantly whether there are clusters in the population of interest that are not represented
in the sample, and (ii) how units were assigned to the various treatments, and whether this
assignment was clustered. If either the sampling or assignment varies systematically with
groups in the sample, clustering will in general be justified. We show that the conventional
adjustments, often implicitly, assume that the clusters in the sample are only a small fraction of
the clusters in the population of interest. To make the conceptual points as clear as possible, we
focus in the current manuscript on the cross-section setting. In the panel case (e.g., Bertrand
et al. [2004]), the same issues arise, but there are additional complications because of the time
series correlation of the treatment assignment. Analyzing the uncertainty from the experimental
design perspective would require modeling the time series pattern of the assignments, and we
leave that to future work.
The practical implications from the results in this paper are as follows. First, the researcher
should assess whether the sampling process is clustered or not, and whether the assignment
mechanism is clustered. If the answer to both is no, one should not adjust the standard errors
for clustering, irrespective of whether such an adjustment would change the standard errors.
Second, in general, the standard Liang-Zeger clustering adjustment is conservative unless one
of three conditions holds: (i) there is no heterogeneity in treatment effects; (ii) we observe only
a few clusters from a large population of clusters; or (iii) a vanishing fraction of units in each
cluster is sampled, e.g. at most one unit is sampled per cluster. Third, the (positive) bias from
standard clustering adjustments can be corrected if all clusters are included in the sample and
further, there is variation in treatment assignment within each cluster. For this case we propose
a new variance estimator. Fourth, if one estimates a fixed effects regression (with fixed effects
[2]
at the level of the relevant clusters), the analysis changes. Then, heterogeneity in the treatment
effects is a requirement for a clustering adjustment to be necessary.
2 A Simple Example and Two Misconceptions
In this section we discuss two misconceptions about clustering that appear common in the
literature. The first misconception is about when clustering matters, and the second about
whether one ought to cluster. Both misconceptions are related to the common model-based
perspective of clustering which we outline briefly below. We argue that this perspective obscures
the justification for clustering that is relevant for most empirical work.
2.1 The Model-based Approach to Clustering
First let us briefly review the textbook, model-based approach to clustering (e.g., Cameron and
Miller [2015], Wooldridge [2003, 2010], Angrist and Pischke [2008]). Later, we contrast this
with the design-based approach starting from clustered randomized experiments (Donner and
Klar [2000], Murray [1998], Fisher [1937]). Consider a setting where we wish to model a scalar
outcome Yi in terms of a binary covariate Wi ∈ 0, 1, with the units belonging to clusters,
with the cluster for unit i denoted by Ci ∈ 1, . . . , C. We estimate the linear model
Yi = α + τWi + εi = β>Xi + εi,
where β> = (α, τ) and X>i = (1, Wi), using least squares, leading to
β = argminβ
N∑
i=1
(
Yi − β>Xi
)2
=(
X>X)−1 (
X>Y)
.
In the model-based perspective, the N -vector ε with ith element equal to εi, is viewed as the
stochastic component. The N × 2 matrix X with ith row equal to (1, Wi) and the N -vector C
with ith element equal to Ci are viewed as non-stochastic. Thus the repeated sampling thought
experiment is redrawing the vectors ε, keeping fixed C and W.
Often the following structure is imposed on the first two moments of ε,
E[ε|X, C] = 0, E
[
εε>∣
∣
∣X, C]
= Ω,
leading to the following expression for the variance of the ordinary least squares (OLS) estima-
tor:
V(β) =(
X>X)−1 (
X>ΩX)(
X>X)−1
.
In the setting without clustering, the key assumption is that Ω is diagonal. If one is also willing
to assume homoskedasticity the variance reduces to the standard OLS variance:
VOLS = σ2
(
X>X)−1
,
[3]
where σ = Ωii = V(εi) for all i. Often researchers allow for general heteroskedasticity and use
Proof of Lemma A.2: Substituting εin = Tin(εin(1) − εin(0))/2 + (εin(1) + εin(0))/2, we have
2√nPCnPUn
n∑
i=1
RinTinεin =2√
nPCnPUn
n∑
i=1
Rin
εin(1) − εin(0)
2+ RiTin
εin(1) + εin(0)
2
.
Because∑n
i=1εin(0) =
∑n
i=1εin(1) = 0, this is equal to
1√nPCnPUn
n∑
i=1
(Rin − pCpU )(εin(1)− εin(0))+1√
nPCnPUn
n∑
i=1
RinTin(εin(1)+ εin(0)) = Sn +Dn.
Comment: The S here refers to sampling, because Sn captures the sampling part of the clustering, andD refers to design, as Dn captures the design part of the clustering. For Sn only the clustering in thesampling (in Rin) matters, and the clustering in the assignment (in Tin) does not matter. For Dn it isthe other way around. Even if Rin is clustered, if Tin is not, the covariance terms in the variance of Dn
vanish.
Lemma A.3. The first two moments of Sn and Dn are
E[Sn] = 0, E[Dn] = 0,
E[S2
n] =1 − PUn
n
n∑
i=1
(εin(1) − εin(0))2 +PUn(1 − PCn)
n
C∑
c=1
n2
c(εcn(1) − εcn(0))2,
E[D2
n] =1 − 4σ2
nPUn
n
n∑
i=1
(εin(1) + εin(0))2 +4σ2
nPUn
n
C∑
c=1
n2
s(εcn(1) + εcn(0))2,
and
E[SnDn] = 0
so that
E
(
2√nPCnPUn
n∑
i=1
RinTinεin
)2
=1
n
n∑
i=1
(2 − PUn(1 + 4σ2
n))εin(1)2 + (2 − PUn(1 + 4σ2
n))εin(0)2 + PUn(2 − 8σ2
n)εin(1)εin(0)
+PUn
n
C∑
c=1
n2
c
(1 − PCn)(εcn(1) − εcn(0))2 + 4σ2
n(εcn(1) + εcn(0))2
[25]
Proof of Lemma A.3: Because E[Rin] = PCnPUn, it follows immediately that E[Sn] = 0. Be-cause E[RinTin] = 0, it follows that E[Dn] = 0. Because E[(Rin − PCnPUn)RinTin] = E[(Rin −PCnPUn)Rin]E[Tin] = 0, it follows that E[SnDn] = 0. Next, consider E[S2
n]:
E[S2
n] =1
nPCnPUn
n∑
i=1
n∑
j=1
E [(Rin − PCnPUn)(εin(1) − εin(0))(Rjn − PCnPUn)(εjn(1) − εjn(0))]
=1
nPCnPUn
n∑
i=1
(
PCnPUn(1 − PCnPUn) − P 2
UnPCn(1 − PCn))
(εin(1) − εin(0))2
+1
nPCnPUn
Cn∑
c=1
n∑
i=1
n∑
j=1
CinCjn
(
P 2
UnPCn(1 − PCn))
(εin(1) − εin(0))(εjn(1)− εjn(0))
=1 − PUn
n
n∑
i=1
(εin(1) − εin(0))2
+PUn(1 − PCn)
n
Cn∑
c=1
n∑
i=1
n∑
j=1
CinCjn(εin(1) − εin(0))(εjn(1) − εjn(0))
=1 − PUn
n
n∑
i=1
(εin(1) − εin(0))2 +PUn(1 − PCn)
n
Cn∑
c=1
n2
cn(εcn(1) − εcn(0))2.
Next, consider E[D2n].
E[D2
n] =1
nPCnPUn
n∑
i=1
n∑
j=1
E [RinTin(εin(1) + εin(0))RjnTjn(εjn(1) + εjn(0))]
=1
nPCnPUn
n∑
i=1
(
PCnPUn − 4σ2
nPCnP 2
Un
)
(εin(1) + εin(0))2
+1
nPCnPUn
Cn∑
c=1
n∑
i=1
n∑
j=1
CinCjn4σ2
nPCnP 2
Un(εin(1) + εin(0))(εjn(1) + εjn(0))
=1 − 4σ2
nPUn
n
n∑
i=1
(εin(1) + εin(0))2 +4σ2
nPUn
n
Cn∑
c=1
n2
cn(εcn(1) + εcn(0))2.
Lemma A.4.
ηfe
n =2√
npCpU
n∑
i=1
Rin(Tin − qCin)εin = Sfe
n + Dfe
n ,
where
Sfe
n =1√
nPCnPUn
n∑
i=1
(Rin − PCnPUn)(1 − q2
Cin)(εin(1) − εin(0)),
and
Dfe
n =1√
nPCnPUn
n∑
i=1
Rin(Tin − qCin(εin(1) + εin(0)) − qCin
(εin(1) − εin(0)) .
[26]
The proof follows the same argument as the proof for Lemma A.2 and is omitted.Proof of Proposition 2: By definition