Di erence-in-Di erences Inference With Few

Dierence-in-Dierences Inference With Few Treated
Clusters ∗
James G. MacKinnon
Matthew D. Webb
[email protected]
[email protected]
May 15, 2015
Abstract
Inference with dierence-in-dierences requires care. Past research has shown that
the cluster-robust variance estimator (CRVE) severely over-rejects when there are
few treated clusters, that dierent variants of the wild cluster bootstrap can severely
over-reject or severely under-reject, and that procedures based on randomization show
promise. We demonstrate that randomization inference (RI) procedures based on estimated coecients such as the one proposed by Conley and Taber (2011) fail when
the treated clusters are atypical, while RI procedures based on t statistics fail when
few atypical clusters are treated. We also propose a bootstrap-based alternative to
randomization inference, which mitigates the discrete nature of RI P values with a
small number of clusters.
Keywords: CRVE, grouped data, clustered data, panel data, cluster wild bootstrap,
dierence in dierence, DiD, randomization inference
Preliminary and Incomplete
Please do not cite or circulate.
This research was supported, in part, by grants from the Social Sciences and Humanities Research
Council of Canada.
∗
1
1
Introduction
Inference for dierence-in-dierences estimators has received considerable attention in the
past decade.
1
While much progress has been made, there are still some areas in which
reliable inference is a challenge. We study one such area, when there are few treated groups.
Past research has shown that cluster-robust inference will greatly over-reject with just one
treated group (Conley and Taber, 2011).
The wild cluster bootstrap will either greatly
under-reject or greatly over-reject depending on whether the null hypothesis is imposed on
the bootstrap DGP (MacKinnon and Webb, 2015).
Several authors have considered randomization inference (RI) as a way to obtain accurate
test size with few treated groups (Conley and Taber, 2011; Canay, Romano and Shaikh,
2014). However, existing RI procedures rely on strong assumptions about the comparability
of the control groups to the treatment groups. We show that these assumptions fail to hold
when the treated groups are either larger or smaller, in terms of number of observations,
than the control groups.
We provide evidence that existing RI procedures can severely
over-reject or under-reject in that case. We also highlight a problem for inference at typical
signicance levels with these procedures and propose a bootstrap-based modication.
We are motivated by a conventional, linear, dierence-in-dierences model with individual data.
Specically, we are interested in applications in which there is variation in
treatment across both groups and time periods.
lows. If
i
indexes individuals,
g
Such models are often expressed as fol-
indexes groups, and
t
indexes time periods, then a classic
dierence in dierences (or DiD) regression can be written as
yigt = β1 + β2 GTigt + β3 PTigt + β4 GTigt PTigt + igt ,
i = 1, . . . , Ng , g = 1, . . . , G, t = 1, . . . , T,
(1)
(2)
GTigt is a group treated dummy that equals 1 if group g is treated in any time
period, and PTigt is a period treated dummy that equals 1 if any group is treated in time
period t. The coecient of most interest is β4 , which shows the eect on treated groups
in periods when there is treatment. Following the literature, we denote the G groups as
either control groups G0 or treated groups G1 . Past research, notably Conley and Taber
in which
(2011) and MacKinnon and Webb (2015) has shown that the reliability of inference of several
procedures varies considerably with the fraction of clusters treated.
We compare several alternative procedures for inference with few treated clusters. We
discuss the theoretical advantages and limitations of the various procedures before comparing
the size of various tests in a set of Monte Carlo experiments. Additionally, we introduce
two new procedures for inference in this setting:
a bootstrap-based modication of that procedure.
an RI test based on
t
statistics, and
We nd that the bootstrap-based RI
procedure can improve inference when there are few control groups, but that none of the
existing procedures allows for reliable inference when groups are heterogeneous and only one
group is treated.
1 See,
among others, Bertrand, Duo and Mullainathan (2004), Donald and Lang (2007), Cameron,
Gelbach and Miller (2008), Conley and Taber (2011), MacKinnon and Webb (2015), Canay, Romano and
Shaikh (2014), and Ibragimov and Muller (2015).
2
Section 2 briey discusses the issue of clustered errors and considers various procedures
for inference.
Section 2.1 discusses cluster-robust variance estimation and its limitations.
Section 2.2 discusses the eective number of clusters approach developed in Carter, Schnepel
and Steigerwald (2013). The wild cluster bootstrap is discussed in Section 2.3. Section 2.4
describes the strengths and weaknesses of the Conley and Taber approach to randomization
inference. Section 2.5 considers an alternative RI procedure based on
t
statistics. Section
2.6 discusses a practical problem with all RI procedures when the number of control groups
is small. Section 2.7 then introduces a bootstrap-based modied RI procedure that solves
this problem. Recent work by Ibragimov and Muller is discussed in Section 2.8. We compare
several procedures in a set of Monte Carlo experiments. The experiments are described in
Section 3, and the results are presented in Section 3.1. Section 4 concludes.
2
Inference with Few Treated Clusters
A linear regression model with clustered errors may be written as


 


y1
X1
1
 y2 
 X2 
 2 
 

 

y ≡  ..  = Xβ + ≡  .. β +  .. ,
 . 
 . 
 . 
yG
XG
G
clusters, indexed by g , has Ng observations. The matrix X and the
PG
have N =
g=1 Ng rows, X has k columns, and the parameter vector β
where each of the
vectors
has
k
y
and
(3)
G
rows. OLS estimation of equation (3) yields estimates
Because the elements of the
g
β̂
and residuals
ˆ.
are generally neither independent nor identically dis-
tributed, standard OLS estimates of the standard errors of
β̂
are invalid, and conventional
inference can be severely unreliable. Cameron and Miller (2015) is a recent and comprehensive survey of cluster-robust inference. In the remainder of this section, we discuss several
procedures for inference when there is within-cluster correlation and few treated clusters.
2.1
Cluster-Robust Variance Estimation
There are several cluster-robust variance estimators, of which the earliest may be Liang and
Zeger (1986). This is the estimator that is used when the
cluster
command is invoked in
Stata. The CRVE we investigate is dened as:
!
G
X
G(N − 1)
(X 0 X)−1
Xg0 ˆg ˆ0g Xg (X 0 X)−1 .
(G − 1)(N − k)
g=1
(4)
It yields reliable inferences when the number of clusters is large (Cameron, Gelbach and
Miller, 2008) and the number of observations per cluster is balanced across clusters (Carter,
Schnepel and Steigerwald, 2013; MacKinnon and Webb, 2015). However, Conley and Taber
(2011) and MacKinnon and Webb (2015) show that it over-rejects severely when there are
very few treated clusters. Rejection frequencies can be well over 50% when only one cluster
is treated, even if the
t statistics are assumed to follow a t(G − 1) distribution, as in Donald
and Lang (2007).
3
2.2
The Eective Number of Clusters
Recent work by Carter, Schnepel and Steigerwald (2013) develops the asymptotic properties
of the CRVE when the number of observations per group is not constant. The authors show
∗
that, when clusters are unbalanced, the dataset has an eective number of clusters, G ,
which is less than
G
(sometimes very much less).
Simulations in MacKinnon and Webb
(2015) show that the procedure can work fairly well when intermediate numbers of clusters
are treated.
However, when few clusters are treated (or untreated), it over-rejects when
∗
inference is conducted assuming that t statistics follow a T (G ) distribution, and it under∗
rejects when assuming that they follow a T (G − 1) distribution. For this reason, we do not
consider this procedure in our Monte Carlo simulations.
2.3
The Wild Cluster Bootstrap
The wild cluster bootstrap was proposed as a method for reliable inference in cases with
a small number of clusters by Cameron, Gelbach and Miller (2008).
MacKinnon and Webb (2015) for the case of unbalanced clusters.
2
It was studied in
Because we will be
proposing a new procedure that is closely related to the wild cluster bootstrap in Section
2.7, it is important to understand how the latter works.
We therefore discuss it in some
detail, based on the treatment in MacKinnon and Webb (2015).
We wish to test the hypothesis that a single coecient in equation (3) is zero. Without
loss of generality, we let this be
βk ,
the last coecient of
β.
The restricted wild cluster
bootstrap works as follows:
1. Estimate equation (3) by OLS.
2. Calculate t̂k , the
t statistic for βk = 0, using the square root of the k th diagonal element
of (4) as a cluster robust standard error.
3. Re-estimate the model (3) subject to the restriction that
4. For each of
B
˜ and
the restricted estimates
β̃ .
bootstrap replications, indexed by
∗j
yig
using the bootstrap DGP
j,
restricted residuals
βk = 0,
so as to obtain the
generate a new set of bootstrap
dependent variables
∗j
yig
= Xig β̃ + ˜ig vg∗j ,
(5)
∗j
∗j
yig
is an element of the vector y
of observations on the bootstrap dependent
∗j
variable, Xig is the corresponding row of X, and so on. Here vg is a random variable
where
that follows the Rademacher distribution; see Davidson and Flachaire (2008). It takes
the values 1 and
−1
with equal probability. Note that we would not want to use the
Rademacher distribution if
G
were smaller than about 12; see Webb (2014), which
proposes an alternative for such cases.
2A
dierent bootstrap procedure for cluster-robust inference was previously proposed by Bertrand, Duo
and Mullainathan (2004).
4
∗j
5. For each bootstrap replication, estimate regression (3) using y
as the regressand,
∗j
th
and calculate tk , the bootstrap t statistic for βk = 0, using the square root of the k
diagonal element of (4), with bootstrap residuals replacing the OLS residuals, as the
standard error.
6. Calculate the bootstrap
P
value as
p̂∗s
where
I(·)
B
1 X
=
I |t∗j
k | > |t̂k | ,
B j=1
(6)
denotes the indicator function. Equation (6) assumes that the distribution
is symmetric. Alternatively, one can use a slightly more complicated formula for an
equal-tail bootstrap
P
value.
The procedure just described is known as the restricted wild cluster bootstrap, or WCR,
because the bootstrap DGP (5) uses restricted parameter estimates and restricted residuals. An alternative procedure is the unrestricted wild cluster bootstrap, or WCU. It uses
unrestricted estimates and unrestricted residuals, and the bootstrap
now test the hypothesis that
t
statistics in step 5.
βk = β̂k .
MacKinnon and Webb (2015) explains why the wild cluster bootstrap fails when the
number of treated clusters is small. The WCR bootstrap, which imposes the null hypothesis,
leads to severe under-rejection.
In contrast, the WCU bootstrap, which does not impose
the null hypothesis, leads to severe over-rejection. When just one cluster is treated, it overrejects at almost the same rate as using CRVE
t statistics with the T (G − 1) distribution.
In
the Monte Carlo experiments discussed below, we obtained results for the WCR and WCU
bootstraps, but we do not report them because those procedures are treated in detail in
MacKinnon and Webb (2015).
2.4
Randomization Inference Coecients
Conley and Taber (2011) suggests two procedures for inference with few treated groups.
Both of these procedures involve constructing an empirical distribution by randomizing the
assignment of groups to treatment and control and using this empirical distribution to
conduct inference.
These procedures are quite involved, and the details can be found in
their paper. Of the two procedures, we restrict attention to the one based on randomization
3
inference, because it often had better size properties in their Monte Carlo experiments.
For the RI procedure of Conley and Taber (2011), a coecient equivalent to
β4
in
equation (2) is estimated, and the estimate β̂ is compared to an empirical distribution of
∗
∗
estimated βj , where j indexes repetitions. The βj are obtained by pretending that control
4
∗
groups are actually treated.
When G1 = 1, the number of βj is G0 . This procedure
3 Under
suitable assumptions, randomization inference tests yield exact inferences when the data come
from experiments where there was random assignment to treatment. However, this is very rarely the case
for applications of the DiD regression model (2) in economics.
4 Actually, this is not quite how the procedure of Conley and Taber (2011) works. Instead of using β̂
and the βj∗ , the procedure is based on quantities that are numerically close and asymptotically equivalent
to them.
5
evidently depends on the strong assumption that
β̂
and the
βj∗
follow the same distribution.
But that cannot be the case if the coecients for some clusters are estimated more eciently
than for others. As we will demonstrate in Section 3.1, when clusters are of dierent sizes,
and unusually large or small clusters are treated, this type of RI procedure can have very
poor size properties.
2.5
Randomization Inference t statistics
As an alternative to the Conley and Taber procedure, we consider RI procedures based
on cluster-robust
t
statistics. For the CT procedure, inference is based on comparing the
βj∗ obtained by pretending that control groups
were actually treated. Instead, we propose to compare the actual t statistic t̂ to an empirical
∗
∗
distribution of the tj that correspond to the βj. A similar procedure was also recently
estimated
β̂
to an empirical distribution of
considered in Canay, Romano and Shaikh (2014).
When there is just one treated group, it is natural to compare t̂ to the empirical disG0 dierent t∗j statistics. However, when there are two or more treated groups
∗
and G0 is not quite small, the number of potential tj to compare with can be very large.
tribution of
In such cases, we may pick
∗
among the tj .
B
of them at random. Note that we never include the actual
t̂
Our randomization inference procedure works as follows:
1. Estimate the regression model and calculate the cluster-robust
coecient of interest. For the DiD model (2), this is the
to this
t
statistic as
t̂G1 ,
2. Generate a number of
•
t∗j
t
t
statistic for the
statistic for
β4 = 0.
β
Refer
indicating actual treatment.
statistics to compare
t̂G1
with.
When G1 = 1, assign a group from the G0 control groups as the treated group
G∗1 for each repetition, re-estimate the model using the observations from all
G groups, and calculate a new t statistic, t∗j , indicating randomized treatment.
Repeat this process for all G0 control groups. Thus the empirical distribution of
∗
the tj will have G0 elements.
•
When G1 > 1, randomize treatment across all groups for each repetition, re∗
G
estimate equation (2), and calculate a new tj . There are potentially CG − 1
1
n
sets of groups to compare with, where Ck denotes n choose k . When this num∗
ber is not too large, obtain all of the tj by enumeration. When it exceeds B ,
choose the comparators randomly, without replacement, from the set of potential
G
comparators. Thus the empirical distribution will have min(CG −1, B) elements.
1
3. Sort the vector of
t∗j
statistics.
4. Determine the location of
t̂G1
within the sorted vector of the
6
t∗j ;
see the next section.
2.6
Randomization Inference and Interval P Values
The most natural way to calculate a randomization inference
P
value is probably to use
the equivalent of equation (6). Let S denote the number of repetitions, which would be
G
when G1 = 1 and either CG − 1 or B when G1 > 1. Then the analog of (6) is
1
p̂∗1
S
1X
I |t∗j | > |t̂| .
=
S j=1
However, this is not the only way to compute a
P
(7)
value. A widely-used alternative is
!
S
X
1
p̂∗2 =
1+
I |t∗j | > |t̂| .
S+1
j=1
pˆ1 ∗ and pˆ2 ∗ tend
when S is small.
Evidently,
results
to the same value as
G0
S → ∞,
(8)
but they can yield quite dierent
Figure 1 shows analytical rejection frequencies for tests at the .05 level based on equations
between 7 and 103. In the gure, R denotes the number of times
∗
∗
that t̂ is more extreme than tj , so that R/S corresponds to p̂1 and (R+1)/(S+1) corresponds
∗
∗
∗
to p̂2 . It is evident that p̂1 always rejects more often than p̂2 , except when S = 19, 39, 59, and
so on. Even for fairly large values of S , the dierence between the two rejection frequencies
(7) and (8) for values of
S
can be substantial.
This phenomenon does not cause a serious problem for bootstrap inference.
obtain identical inferences from the two varieties of
P
value by choosing
B
so that
We can
α(B + 1)
is the level of the test. Or we can make B a large number, so that
∗
∗
the dierence between p̂1 and p̂2 is always very small.
Unfortunately, neither of these solutions works for randomization inference. As an ex-
is an integer, where
α
treme example, suppose the data come from Canada, which has just ten provinces. If one
G1 = 1, G0 = 9, and the P value can lie in only one of nine
2/10, 2/9 to 3/10, and so on. Even if R = 0, it would never be
the .01 or .05 levels. Thus the fundamental problem is that, when S
province is treated, then
intervals: 0 to
1/10, 1/9
reasonable to reject at
is nite, RI
2.7
P
to
values are interval identied rather than point identied.
5
Wild Bootstrap Randomization Inference
In this section we propose a simple way to overcome the problem discussed in the previous
section.
We propose a procedure which we will refer to as wild bootstrap randomization
inference, or WBRI. The WBRI procedure essentially nests the randomization inference
process of section 2.5 within the wild cluster bootstrap of section 2.3. The procedure for
∗
generating the tj statistics is as follows:
1. Estimate equation (2) by OLS and calculate
t̂G1
for the coecient of interest using
CRVE standard errors.
2. Construct a bootstrap sample,
yj∗ , using the wild cluster bootstrap procedure discussed
in section 2.3.
5 The
problem with P values not being point identied is discussed at length in Webb (2014).
7
3. Estimate equation (2) using
yj∗
and calculate a bootstrap
t
statistic
t∗j
using CRVE
standard errors.
∗
4. Re-estimate equation (2) using yj , but changing the treated group(s) from G1 to
∗
G1 . When G1 = 1, this is done by cycling the treatment group across all G0 control
∗
groups. Calculate a t statistic tj for each randomization using CRVE standard errors.
5. Repeat steps 2, 3, and 4
B
times, constructing a new
yj∗
in each step 2.
∗
6. Perform inference by comparing t̂G1 to either BG or B(G − 1) bootstrap statistics tj .
∗
The larger number includes B tj statistics that correspond to the G1 actually treated
groups and are drawn from exactly the same distribution as the t statistics in the wild
∗
cluster bootstrap procedure. The smaller number omits these. It consists only of tj
statistics similar to those in the randomization inference procedure of Section 2.5.
We consider inference based on both the
BG
and
B(G − 1)
empirical distributions. In
view of the results in MacKinnon and Webb (2015), it is highly likely that the
B
bootstrap
statistics corresponding to the actually treated groups will tend to be too large. Including
these test statistics in the empirical distribution will thus tend to reduce rejection frequencies, compared to using just
tests based on
B(G − 1)
B(G − 1)
of them. While this may be a desirable outcome if
bootstrap statistics tend to over-reject, it does not currently have
any theoretical justication.
2.8
Other Procedures
Building o results in Donald and Lang (2007), Ibragimov and Muller (2015) (IM, henceforth) studies the Behrens-Fisher problem of comparing means of two groups with dierent
variances. The paper focuses on dierences in means for treatment and control groups and
proves that
t
tests for these dierences in means follow asymptotic distributions with the
degrees of freedom determined by the number of groups. Specically, the degrees of freedom equals
(min(G0 , G1 ) − 1).
When
G1 = 1,
this number is 0, which implies that the IM
procedure is not appropriate when there is only one treatment group. Future simulations
will consider the IM procedure in cases where
G1 > 1.
A procedure somewhat related to the CT procedure is proposed by Abadie, Diamond and
Hainmueller (2010), which also bases inference on an empirical distribution generated by
perturbing the assignment of treatment. The procedure diers substantially from the ones
considered in this paper, because it constructs a synthetic control as a weighted average of
potential control groups. This results in both a dierent estimate of the treatment eect
and a dierent
P
value. For this reason, we do not study the synthetic controls approach
in this paper.
3
Monte Carlo Design
We conduct a set of Monte Carlo experiments to test the performance of the previously
discussed procedures for inference when the number of treated clusters is small and cluster
sizes are heterogeneous. In the experiments, we vary the total number of clusters, the number
of treated clusters, and which clusters are treated. However, since neither the sample size
8
nor the total number of clusters seemed to matter very much, we report results only for
N = 5000
and
G = 50.
In all experiments, we assign
N
total observations unevenly among
G
clusters. See the
appendix for details on the assignment of observations and the size of each cluster.
The data generating process is the DiD model, equation (2), with
β4 = 0.
Each ob-
servation is assigned to one of 20 years, and the starting year of treatment is randomly
assigned to years between 4 and 14.
The error terms are homoskedastic and correlated
within each cluster, with correlation coecient 0.05.
We vary the number of treated clusters from
are treated in three ways:
1, 2, ..., 10.
We also vary which clusters
the smallest rst, the largest rst, or at random.
The more
observations the treated clusters have (at least up to about half the sample size), the more
eciently
β4
should be estimated.
6
Thus we would expect randomization inference based on
coecient estimates to perform less well than randomization inference based on
t
statistics
when the treated clusters are unusually large or small.
For the two varieties of randomization inference, the number of randomizations was as
G1 = 1, 49; for G1 = 2, 1128 (C248 ); and for G1 ≥ 3, 999.
For each experiment, 100,000 Monte Carlo replications were performed, and rejection
follows: for
frequencies were calculated at the
frequencies are discussed below.
1%, 5%,
and
10%
levels, although only the
5%
rejection
We compare rejection frequencies for the cluster robust
variance estimator, the wild cluster bootstrap, two types of randomization inference, and
two types of wild bootstrap randomization inference.
3.1
Monte Carlo Results
The Monte Carlo experiments contain three main ndings.
The rst is that none of the
procedures works well when the few treated clusters are atypical. The second is that the
RI procedures appear to work well when typical clusters are treated.
The third is that
the performance of procedures based on randomization inference for coecients does not
improve with
t
G1 , while the performance of procedures based on randomization inference for
statistics does improve rapidly.
The most striking result is that the size of some tests depends heavily on which clusters
are treated.
For instance, with coecient RI (Conley and Taber), there is severe under-
rejection when the largest clusters are treated rst and fairly severe over-rejection when the
smallest clusters are treated rst. These patterns can be seen clearly in Figure 2.
Randomization inference based on
t
statistics also performs poorly when
G1 = 1,
with
the same pattern of under-rejection and over-rejection as randomization inference based on
coecients. Results are shown in Figure 3. In contrast to the results in Figure 2, however,
the rejection frequencies improve rapidly as
G1
increases. When the smallest clusters are
treated, the procedure seems to work perfectly for
G1 ≥ 6.
When the largest clusters are
treated, it always under-rejects, but not severely.
To understand why these two procedures perform so dierently, consider what happens
when dierent size clusters are treated. Let
smallest cluster.
Ḡ
denote the largest cluster and G denote the
∗
Additionally, let G1 represent the actual treatment assignment and G1
6 It is dicult to be precise about this, because eciency will also depend on intra-cluster correlations,
the number of treated years, and the values of other regressors.
9
represent the randomized assignment.
Ḡ. The
G∗1 6= Ḡ.
Suppose that the treated cluster is
less dispersed than the estimates for
β̂G1
estimates of
β
for
G1 = Ḡ
will necessarily be
Thus the coecient RI procedure compares
with an empirical distribution that has too large a variance. Naturally, we under-reject.
On the other hand, if G1 = G, the estimates of β̂G1 are more dispersed than the estimates
∗
for G1 6= G. So we are comparing β̂G1 with an empirical distribution that has too small a
variance. Naturally, we over-reject.
t statistic RI procedure solves this problem, but only when we treat enough clusters,
because the CRVE t statistics do not work well with very few treated clusters.
The
One other result that is evident in both Figures 2 and 3 is that both RI procedures work
very well when the treated clusters are chosen at random. These results are actually somewhat misleading. The distribution of cluster sizes is the same for all the experiments. The
good results apparent in the gures arise because we are averaging over 100,000 replications.
In some of those replications, when small clusters happen to be treated, too many rejections
occur. In others, when large clusters happen to be treated, too few rejections occur. Only
when clusters of intermediate size are treated do all the procedures actually work well.
Some results are not (yet) reported.
Results for cluster-robust
t
statistics are consis-
tent with previous ndings; the tests are greatly oversized regardless of which clusters are
treated. Results for the wild cluster bootstrap are consistent with those in MacKinnon and
Webb (2015). The unrestricted wild cluster bootstrap has size identical to the CRVE, with
severe over-rejection for few treated clusters. The restricted wild bootstrap has very severe
under-rejection with few treated clusters, regardless of which clusters are treated. Rejection
frequencies for WBRI were only calculated for
G1 = 1, 2,
and rejection frequencies for those
tests were very similar to the analogous RI tests.
4
Conclusion
We compare several alternative procedures for inference with few treated clusters, discuss the
limitations of some existing procedures, and propose several new procedures. We perform
Monte Carlo experiments to compare the robustness of several procedures to heterogeneity
in cluster sizes.
We nd that none of the existing or proposed tests work well when a
single treated cluster is unusually large or small. Ongoing research will consider alternative
procedures that show promise in this setting.
10
References
Abadie, Alberto, Alexis Diamond, and Jens Hainmueller (2010) `Synthetic control methods for comparative case studies:
program.'
Estimating the eect of california's tobacco control
Journal of the American Statistical Association 105(490), 493505 8
Bertrand, Marianne, Esther Duo, and Sendhil Mullainathan (2004) `How much should
we trust dierences-in-dierences estimates?'
The Quarterly Journal of Economics
119(1), pp. 249275 2, 4
Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller (2008) `Bootstrap-based improvements for inference with clustered errors.'
The Review of Economics and Statistics
90(3), 414427 2, 3, 4
Cameron, A.C., and D.L. Miller (2015) `A practitoner's guide to cluster robust inference.'
Journal of Human Resources 50, 317372 3
Canay, Ivan A, Joseph P Romano, and Azeem M Shaikh (2014) `Randomization tests under
an approximate symmetry assumption.'
sity 2, 6
Technical Report No. 2014-13, Stanford Univer-
Carter, Andrew V., Kevin T. Schnepel, and Douglas G. Steigerwald (2013) `Asymptotic
behavior of a t test robust to cluster heterogeneity.' Technical Report, University of California, Santa Barbara 3, 4
Conley, Timothy G., and Christopher R. Taber (2011) `Inference with Dierence in Dierences with a small number of policy changes.'
The Review of Economics and Statistics
93(1), 113125 1, 2, 3, 5
Davidson, Russell, and Emmanuel Flachaire (2008) `The wild bootstrap, tamed at last.'
Journal of Econometrics 146(1), 162 169 4
Donald, Stephen G, and Kevin Lang (2007) `Inference with dierence-in-dierences and
other panel data.'
The Review of Economics and Statistics 89(2), 221233 2, 3, 8
Ibragimov, Rustam, and Ulrich K Muller (2015) `Inference with few heterogeneous clusters.'
Review of Economics & Statistics p. forthcoming 2, 3, 8
Liang, Kung-Yee, and Scott L. Zeger (1986) `Longitudinal data analysis using generalized
linear models.'
Biometrika 73(1), 1322 3
MacKinnon, James G., and Matthew D. Webb (2015) `Wild bootstrap inference for wildly
dierent cluster sizes.' Working Paper 1314, Queen's University, Department of Economics
2, 3, 4, 5, 8, 10
Webb, Matthew D. (2014) `Reworking wild bootstrap based inference for clustered errors.'
Working Papers 1315, Queen's University, Department of Economics, August 4, 7
11
Figure 1: Rejection Frequencies and Number of Simulations
Rej. Rate
0.13 .
0.12
0.11
0.10
0.09
0.08
0.07
0.06
0.05
...
...
...
...
... Rejection frequency based on R/S
...
...
...
...
..
...
........
...
.
.
..
.
...
...
... ....
...
..... .......
...
....
...
...
....
.......
.....
...
...
.
.
... ............
.....
...
.
.
.
.
.......
.....
.
...
............
........
......
...
... ....
.......
........
......
... ...................
.... ...
...........
...
.
.
.
... ..........................
.
.
.
...........
.
.
.
.
.
.
.
.........
....... ..
.... ..
.
.
...
...............
.
.
.
.
.
.
............. ..
.......... ..
..........
.......
.
.
.
.
.
.
.
.
................................................................................................................................................................................................................................................................................................................................................................
........
....
. ...............
.. ..........
........
..
...........
..
.......
.. .......
..
..........
............ ..
..
..
.
....
........
.
.
.
.
.
.
.
.
.
...
.
.
.
.
.....
.......
........
..
......
.......... ..
.
.
.
......
..
.
.......
.
..
....... ..
....
..
.
..
.
..
... Rejection frequency based on (R + 1)/(S + 1)
..
...........................
0.04
0.03
0.02
0.01
0.00
10
20
30
40
50
60
70
80
90
...
S
100
Figure 2: Rejection Frequencies for Coecient Randomization Inference
Rej. Rate
0.24
0.22
..................................................
.......
.............
..............................
............................ ....
..................
....................
.....................
.
0.20
0.18
Random treated ............................................
0.16
Smallest treated ...............................
Largest treated ...................
0.14
0.12
0.10
0.08
0.06
.................
.......
.........................................................................................................................................................................................................................................................................................................................................................................................
0.04
0.02
0.00
............................
.....................................................................................
1
2
3
4
5
12
6
7
8
9
10
G1
Figure 3: Rejection Frequencies for
t
Statistic Randomization Inference
Rej. Rate
0.24
0.22
0.20
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
Random treated ............................................
..
..
Smallest treated ...............................
..
..
..
Largest treated ...................
..
..
..
..
..
..
..
..
..
..
..
..
..
......
......
......
......
............
............
....................
............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..........................................................................................................................................................................................................................................................................................................................................................................................
.......................................................
...........................
..
..
..
.
.
.
.
.
.
........
......
.
.
.
.
.
.
1
2
3
4
5
13
6
7
8
9
10
G1
Appendix
Cluster Sizes
Observations were assigned to clusters by using the following formula:
"
exp γg/G
Ng = N PG
j=1
for
g = 1, . . . , G − 1,
where
exp γj/G
#
,
(9)
γ ≥0
is a parameter that determines how uneven the cluster
PG
sizes are. The value of NG was chosen to ensure that
g=1 Ng = N . When γ = 0 and N/G
is an integer, equation (9) implies that Ng = N/G for all g . As γ increases, however, cluster
sizes vary more and more.
For the experiments with 5000 observations and
γ = 2,
the sizes of the 50 clusters are:
31, 33, 34, 36, 37, 39, 40, 42, 43, 45, 47, 49, 51, 53, 55, 58, 60, 63, 65, 68, 71, 73, 76, 80, 83,
86, 90, 94, 97, 101, 106, 110, 114, 119, 124, 129, 134, 140, 146, 151, 158, 164, 171, 178, 185,
193, 201, 209, 217, 251.
14