Dierence-in-Dierences Inference With Few Treated Clusters ∗ James G. MacKinnon Matthew D. Webb [email protected] [email protected] May 15, 2015 Abstract Inference with dierence-in-dierences requires care. Past research has shown that the cluster-robust variance estimator (CRVE) severely over-rejects when there are few treated clusters, that dierent variants of the wild cluster bootstrap can severely over-reject or severely under-reject, and that procedures based on randomization show promise. We demonstrate that randomization inference (RI) procedures based on estimated coecients such as the one proposed by Conley and Taber (2011) fail when the treated clusters are atypical, while RI procedures based on t statistics fail when few atypical clusters are treated. We also propose a bootstrap-based alternative to randomization inference, which mitigates the discrete nature of RI P values with a small number of clusters. Keywords: CRVE, grouped data, clustered data, panel data, cluster wild bootstrap, dierence in dierence, DiD, randomization inference Preliminary and Incomplete Please do not cite or circulate. This research was supported, in part, by grants from the Social Sciences and Humanities Research Council of Canada. ∗ 1 1 Introduction Inference for dierence-in-dierences estimators has received considerable attention in the past decade. 1 While much progress has been made, there are still some areas in which reliable inference is a challenge. We study one such area, when there are few treated groups. Past research has shown that cluster-robust inference will greatly over-reject with just one treated group (Conley and Taber, 2011). The wild cluster bootstrap will either greatly under-reject or greatly over-reject depending on whether the null hypothesis is imposed on the bootstrap DGP (MacKinnon and Webb, 2015). Several authors have considered randomization inference (RI) as a way to obtain accurate test size with few treated groups (Conley and Taber, 2011; Canay, Romano and Shaikh, 2014). However, existing RI procedures rely on strong assumptions about the comparability of the control groups to the treatment groups. We show that these assumptions fail to hold when the treated groups are either larger or smaller, in terms of number of observations, than the control groups. We provide evidence that existing RI procedures can severely over-reject or under-reject in that case. We also highlight a problem for inference at typical signicance levels with these procedures and propose a bootstrap-based modication. We are motivated by a conventional, linear, dierence-in-dierences model with individual data. Specically, we are interested in applications in which there is variation in treatment across both groups and time periods. lows. If i indexes individuals, g Such models are often expressed as fol- indexes groups, and t indexes time periods, then a classic dierence in dierences (or DiD) regression can be written as yigt = β1 + β2 GTigt + β3 PTigt + β4 GTigt PTigt + igt , i = 1, . . . , Ng , g = 1, . . . , G, t = 1, . . . , T, (1) (2) GTigt is a group treated dummy that equals 1 if group g is treated in any time period, and PTigt is a period treated dummy that equals 1 if any group is treated in time period t. The coecient of most interest is β4 , which shows the eect on treated groups in periods when there is treatment. Following the literature, we denote the G groups as either control groups G0 or treated groups G1 . Past research, notably Conley and Taber in which (2011) and MacKinnon and Webb (2015) has shown that the reliability of inference of several procedures varies considerably with the fraction of clusters treated. We compare several alternative procedures for inference with few treated clusters. We discuss the theoretical advantages and limitations of the various procedures before comparing the size of various tests in a set of Monte Carlo experiments. Additionally, we introduce two new procedures for inference in this setting: a bootstrap-based modication of that procedure. an RI test based on t statistics, and We nd that the bootstrap-based RI procedure can improve inference when there are few control groups, but that none of the existing procedures allows for reliable inference when groups are heterogeneous and only one group is treated. 1 See, among others, Bertrand, Duo and Mullainathan (2004), Donald and Lang (2007), Cameron, Gelbach and Miller (2008), Conley and Taber (2011), MacKinnon and Webb (2015), Canay, Romano and Shaikh (2014), and Ibragimov and Muller (2015). 2 Section 2 briey discusses the issue of clustered errors and considers various procedures for inference. Section 2.1 discusses cluster-robust variance estimation and its limitations. Section 2.2 discusses the eective number of clusters approach developed in Carter, Schnepel and Steigerwald (2013). The wild cluster bootstrap is discussed in Section 2.3. Section 2.4 describes the strengths and weaknesses of the Conley and Taber approach to randomization inference. Section 2.5 considers an alternative RI procedure based on t statistics. Section 2.6 discusses a practical problem with all RI procedures when the number of control groups is small. Section 2.7 then introduces a bootstrap-based modied RI procedure that solves this problem. Recent work by Ibragimov and Muller is discussed in Section 2.8. We compare several procedures in a set of Monte Carlo experiments. The experiments are described in Section 3, and the results are presented in Section 3.1. Section 4 concludes. 2 Inference with Few Treated Clusters A linear regression model with clustered errors may be written as y1 X1 1 y2 X2 2 y ≡ .. = Xβ + ≡ .. β + .. , . . . yG XG G clusters, indexed by g , has Ng observations. The matrix X and the PG have N = g=1 Ng rows, X has k columns, and the parameter vector β where each of the vectors has k y and (3) G rows. OLS estimation of equation (3) yields estimates Because the elements of the g β̂ and residuals ˆ. are generally neither independent nor identically dis- tributed, standard OLS estimates of the standard errors of β̂ are invalid, and conventional inference can be severely unreliable. Cameron and Miller (2015) is a recent and comprehensive survey of cluster-robust inference. In the remainder of this section, we discuss several procedures for inference when there is within-cluster correlation and few treated clusters. 2.1 Cluster-Robust Variance Estimation There are several cluster-robust variance estimators, of which the earliest may be Liang and Zeger (1986). This is the estimator that is used when the cluster command is invoked in Stata. The CRVE we investigate is dened as: ! G X G(N − 1) (X 0 X)−1 Xg0 ˆg ˆ0g Xg (X 0 X)−1 . (G − 1)(N − k) g=1 (4) It yields reliable inferences when the number of clusters is large (Cameron, Gelbach and Miller, 2008) and the number of observations per cluster is balanced across clusters (Carter, Schnepel and Steigerwald, 2013; MacKinnon and Webb, 2015). However, Conley and Taber (2011) and MacKinnon and Webb (2015) show that it over-rejects severely when there are very few treated clusters. Rejection frequencies can be well over 50% when only one cluster is treated, even if the t statistics are assumed to follow a t(G − 1) distribution, as in Donald and Lang (2007). 3 2.2 The Eective Number of Clusters Recent work by Carter, Schnepel and Steigerwald (2013) develops the asymptotic properties of the CRVE when the number of observations per group is not constant. The authors show ∗ that, when clusters are unbalanced, the dataset has an eective number of clusters, G , which is less than G (sometimes very much less). Simulations in MacKinnon and Webb (2015) show that the procedure can work fairly well when intermediate numbers of clusters are treated. However, when few clusters are treated (or untreated), it over-rejects when ∗ inference is conducted assuming that t statistics follow a T (G ) distribution, and it under∗ rejects when assuming that they follow a T (G − 1) distribution. For this reason, we do not consider this procedure in our Monte Carlo simulations. 2.3 The Wild Cluster Bootstrap The wild cluster bootstrap was proposed as a method for reliable inference in cases with a small number of clusters by Cameron, Gelbach and Miller (2008). MacKinnon and Webb (2015) for the case of unbalanced clusters. 2 It was studied in Because we will be proposing a new procedure that is closely related to the wild cluster bootstrap in Section 2.7, it is important to understand how the latter works. We therefore discuss it in some detail, based on the treatment in MacKinnon and Webb (2015). We wish to test the hypothesis that a single coecient in equation (3) is zero. Without loss of generality, we let this be βk , the last coecient of β. The restricted wild cluster bootstrap works as follows: 1. Estimate equation (3) by OLS. 2. Calculate t̂k , the t statistic for βk = 0, using the square root of the k th diagonal element of (4) as a cluster robust standard error. 3. Re-estimate the model (3) subject to the restriction that 4. For each of B ˜ and the restricted estimates β̃ . bootstrap replications, indexed by ∗j yig using the bootstrap DGP j, restricted residuals βk = 0, so as to obtain the generate a new set of bootstrap dependent variables ∗j yig = Xig β̃ + ˜ig vg∗j , (5) ∗j ∗j yig is an element of the vector y of observations on the bootstrap dependent ∗j variable, Xig is the corresponding row of X, and so on. Here vg is a random variable where that follows the Rademacher distribution; see Davidson and Flachaire (2008). It takes the values 1 and −1 with equal probability. Note that we would not want to use the Rademacher distribution if G were smaller than about 12; see Webb (2014), which proposes an alternative for such cases. 2A dierent bootstrap procedure for cluster-robust inference was previously proposed by Bertrand, Duo and Mullainathan (2004). 4 ∗j 5. For each bootstrap replication, estimate regression (3) using y as the regressand, ∗j th and calculate tk , the bootstrap t statistic for βk = 0, using the square root of the k diagonal element of (4), with bootstrap residuals replacing the OLS residuals, as the standard error. 6. Calculate the bootstrap P value as p̂∗s where I(·) B 1 X = I |t∗j k | > |t̂k | , B j=1 (6) denotes the indicator function. Equation (6) assumes that the distribution is symmetric. Alternatively, one can use a slightly more complicated formula for an equal-tail bootstrap P value. The procedure just described is known as the restricted wild cluster bootstrap, or WCR, because the bootstrap DGP (5) uses restricted parameter estimates and restricted residuals. An alternative procedure is the unrestricted wild cluster bootstrap, or WCU. It uses unrestricted estimates and unrestricted residuals, and the bootstrap now test the hypothesis that t statistics in step 5. βk = β̂k . MacKinnon and Webb (2015) explains why the wild cluster bootstrap fails when the number of treated clusters is small. The WCR bootstrap, which imposes the null hypothesis, leads to severe under-rejection. In contrast, the WCU bootstrap, which does not impose the null hypothesis, leads to severe over-rejection. When just one cluster is treated, it overrejects at almost the same rate as using CRVE t statistics with the T (G − 1) distribution. In the Monte Carlo experiments discussed below, we obtained results for the WCR and WCU bootstraps, but we do not report them because those procedures are treated in detail in MacKinnon and Webb (2015). 2.4 Randomization Inference Coecients Conley and Taber (2011) suggests two procedures for inference with few treated groups. Both of these procedures involve constructing an empirical distribution by randomizing the assignment of groups to treatment and control and using this empirical distribution to conduct inference. These procedures are quite involved, and the details can be found in their paper. Of the two procedures, we restrict attention to the one based on randomization 3 inference, because it often had better size properties in their Monte Carlo experiments. For the RI procedure of Conley and Taber (2011), a coecient equivalent to β4 in equation (2) is estimated, and the estimate β̂ is compared to an empirical distribution of ∗ ∗ estimated βj , where j indexes repetitions. The βj are obtained by pretending that control 4 ∗ groups are actually treated. When G1 = 1, the number of βj is G0 . This procedure 3 Under suitable assumptions, randomization inference tests yield exact inferences when the data come from experiments where there was random assignment to treatment. However, this is very rarely the case for applications of the DiD regression model (2) in economics. 4 Actually, this is not quite how the procedure of Conley and Taber (2011) works. Instead of using β̂ and the βj∗ , the procedure is based on quantities that are numerically close and asymptotically equivalent to them. 5 evidently depends on the strong assumption that β̂ and the βj∗ follow the same distribution. But that cannot be the case if the coecients for some clusters are estimated more eciently than for others. As we will demonstrate in Section 3.1, when clusters are of dierent sizes, and unusually large or small clusters are treated, this type of RI procedure can have very poor size properties. 2.5 Randomization Inference t statistics As an alternative to the Conley and Taber procedure, we consider RI procedures based on cluster-robust t statistics. For the CT procedure, inference is based on comparing the βj∗ obtained by pretending that control groups were actually treated. Instead, we propose to compare the actual t statistic t̂ to an empirical ∗ ∗ distribution of the tj that correspond to the βj. A similar procedure was also recently estimated β̂ to an empirical distribution of considered in Canay, Romano and Shaikh (2014). When there is just one treated group, it is natural to compare t̂ to the empirical disG0 dierent t∗j statistics. However, when there are two or more treated groups ∗ and G0 is not quite small, the number of potential tj to compare with can be very large. tribution of In such cases, we may pick ∗ among the tj . B of them at random. Note that we never include the actual t̂ Our randomization inference procedure works as follows: 1. Estimate the regression model and calculate the cluster-robust coecient of interest. For the DiD model (2), this is the to this t statistic as t̂G1 , 2. Generate a number of • t∗j t t statistic for the statistic for β4 = 0. β Refer indicating actual treatment. statistics to compare t̂G1 with. When G1 = 1, assign a group from the G0 control groups as the treated group G∗1 for each repetition, re-estimate the model using the observations from all G groups, and calculate a new t statistic, t∗j , indicating randomized treatment. Repeat this process for all G0 control groups. Thus the empirical distribution of ∗ the tj will have G0 elements. • When G1 > 1, randomize treatment across all groups for each repetition, re∗ G estimate equation (2), and calculate a new tj . There are potentially CG − 1 1 n sets of groups to compare with, where Ck denotes n choose k . When this num∗ ber is not too large, obtain all of the tj by enumeration. When it exceeds B , choose the comparators randomly, without replacement, from the set of potential G comparators. Thus the empirical distribution will have min(CG −1, B) elements. 1 3. Sort the vector of t∗j statistics. 4. Determine the location of t̂G1 within the sorted vector of the 6 t∗j ; see the next section. 2.6 Randomization Inference and Interval P Values The most natural way to calculate a randomization inference P value is probably to use the equivalent of equation (6). Let S denote the number of repetitions, which would be G when G1 = 1 and either CG − 1 or B when G1 > 1. Then the analog of (6) is 1 p̂∗1 S 1X I |t∗j | > |t̂| . = S j=1 However, this is not the only way to compute a P (7) value. A widely-used alternative is ! S X 1 p̂∗2 = 1+ I |t∗j | > |t̂| . S+1 j=1 pˆ1 ∗ and pˆ2 ∗ tend when S is small. Evidently, results to the same value as G0 S → ∞, (8) but they can yield quite dierent Figure 1 shows analytical rejection frequencies for tests at the .05 level based on equations between 7 and 103. In the gure, R denotes the number of times ∗ ∗ that t̂ is more extreme than tj , so that R/S corresponds to p̂1 and (R+1)/(S+1) corresponds ∗ ∗ ∗ to p̂2 . It is evident that p̂1 always rejects more often than p̂2 , except when S = 19, 39, 59, and so on. Even for fairly large values of S , the dierence between the two rejection frequencies (7) and (8) for values of S can be substantial. This phenomenon does not cause a serious problem for bootstrap inference. obtain identical inferences from the two varieties of P value by choosing B so that We can α(B + 1) is the level of the test. Or we can make B a large number, so that ∗ ∗ the dierence between p̂1 and p̂2 is always very small. Unfortunately, neither of these solutions works for randomization inference. As an ex- is an integer, where α treme example, suppose the data come from Canada, which has just ten provinces. If one G1 = 1, G0 = 9, and the P value can lie in only one of nine 2/10, 2/9 to 3/10, and so on. Even if R = 0, it would never be the .01 or .05 levels. Thus the fundamental problem is that, when S province is treated, then intervals: 0 to 1/10, 1/9 reasonable to reject at is nite, RI 2.7 P to values are interval identied rather than point identied. 5 Wild Bootstrap Randomization Inference In this section we propose a simple way to overcome the problem discussed in the previous section. We propose a procedure which we will refer to as wild bootstrap randomization inference, or WBRI. The WBRI procedure essentially nests the randomization inference process of section 2.5 within the wild cluster bootstrap of section 2.3. The procedure for ∗ generating the tj statistics is as follows: 1. Estimate equation (2) by OLS and calculate t̂G1 for the coecient of interest using CRVE standard errors. 2. Construct a bootstrap sample, yj∗ , using the wild cluster bootstrap procedure discussed in section 2.3. 5 The problem with P values not being point identied is discussed at length in Webb (2014). 7 3. Estimate equation (2) using yj∗ and calculate a bootstrap t statistic t∗j using CRVE standard errors. ∗ 4. Re-estimate equation (2) using yj , but changing the treated group(s) from G1 to ∗ G1 . When G1 = 1, this is done by cycling the treatment group across all G0 control ∗ groups. Calculate a t statistic tj for each randomization using CRVE standard errors. 5. Repeat steps 2, 3, and 4 B times, constructing a new yj∗ in each step 2. ∗ 6. Perform inference by comparing t̂G1 to either BG or B(G − 1) bootstrap statistics tj . ∗ The larger number includes B tj statistics that correspond to the G1 actually treated groups and are drawn from exactly the same distribution as the t statistics in the wild ∗ cluster bootstrap procedure. The smaller number omits these. It consists only of tj statistics similar to those in the randomization inference procedure of Section 2.5. We consider inference based on both the BG and B(G − 1) empirical distributions. In view of the results in MacKinnon and Webb (2015), it is highly likely that the B bootstrap statistics corresponding to the actually treated groups will tend to be too large. Including these test statistics in the empirical distribution will thus tend to reduce rejection frequencies, compared to using just tests based on B(G − 1) B(G − 1) of them. While this may be a desirable outcome if bootstrap statistics tend to over-reject, it does not currently have any theoretical justication. 2.8 Other Procedures Building o results in Donald and Lang (2007), Ibragimov and Muller (2015) (IM, henceforth) studies the Behrens-Fisher problem of comparing means of two groups with dierent variances. The paper focuses on dierences in means for treatment and control groups and proves that t tests for these dierences in means follow asymptotic distributions with the degrees of freedom determined by the number of groups. Specically, the degrees of freedom equals (min(G0 , G1 ) − 1). When G1 = 1, this number is 0, which implies that the IM procedure is not appropriate when there is only one treatment group. Future simulations will consider the IM procedure in cases where G1 > 1. A procedure somewhat related to the CT procedure is proposed by Abadie, Diamond and Hainmueller (2010), which also bases inference on an empirical distribution generated by perturbing the assignment of treatment. The procedure diers substantially from the ones considered in this paper, because it constructs a synthetic control as a weighted average of potential control groups. This results in both a dierent estimate of the treatment eect and a dierent P value. For this reason, we do not study the synthetic controls approach in this paper. 3 Monte Carlo Design We conduct a set of Monte Carlo experiments to test the performance of the previously discussed procedures for inference when the number of treated clusters is small and cluster sizes are heterogeneous. In the experiments, we vary the total number of clusters, the number of treated clusters, and which clusters are treated. However, since neither the sample size 8 nor the total number of clusters seemed to matter very much, we report results only for N = 5000 and G = 50. In all experiments, we assign N total observations unevenly among G clusters. See the appendix for details on the assignment of observations and the size of each cluster. The data generating process is the DiD model, equation (2), with β4 = 0. Each ob- servation is assigned to one of 20 years, and the starting year of treatment is randomly assigned to years between 4 and 14. The error terms are homoskedastic and correlated within each cluster, with correlation coecient 0.05. We vary the number of treated clusters from are treated in three ways: 1, 2, ..., 10. We also vary which clusters the smallest rst, the largest rst, or at random. The more observations the treated clusters have (at least up to about half the sample size), the more eciently β4 should be estimated. 6 Thus we would expect randomization inference based on coecient estimates to perform less well than randomization inference based on t statistics when the treated clusters are unusually large or small. For the two varieties of randomization inference, the number of randomizations was as G1 = 1, 49; for G1 = 2, 1128 (C248 ); and for G1 ≥ 3, 999. For each experiment, 100,000 Monte Carlo replications were performed, and rejection follows: for frequencies were calculated at the frequencies are discussed below. 1%, 5%, and 10% levels, although only the 5% rejection We compare rejection frequencies for the cluster robust variance estimator, the wild cluster bootstrap, two types of randomization inference, and two types of wild bootstrap randomization inference. 3.1 Monte Carlo Results The Monte Carlo experiments contain three main ndings. The rst is that none of the procedures works well when the few treated clusters are atypical. The second is that the RI procedures appear to work well when typical clusters are treated. The third is that the performance of procedures based on randomization inference for coecients does not improve with t G1 , while the performance of procedures based on randomization inference for statistics does improve rapidly. The most striking result is that the size of some tests depends heavily on which clusters are treated. For instance, with coecient RI (Conley and Taber), there is severe under- rejection when the largest clusters are treated rst and fairly severe over-rejection when the smallest clusters are treated rst. These patterns can be seen clearly in Figure 2. Randomization inference based on t statistics also performs poorly when G1 = 1, with the same pattern of under-rejection and over-rejection as randomization inference based on coecients. Results are shown in Figure 3. In contrast to the results in Figure 2, however, the rejection frequencies improve rapidly as G1 increases. When the smallest clusters are treated, the procedure seems to work perfectly for G1 ≥ 6. When the largest clusters are treated, it always under-rejects, but not severely. To understand why these two procedures perform so dierently, consider what happens when dierent size clusters are treated. Let smallest cluster. Ḡ denote the largest cluster and G denote the ∗ Additionally, let G1 represent the actual treatment assignment and G1 6 It is dicult to be precise about this, because eciency will also depend on intra-cluster correlations, the number of treated years, and the values of other regressors. 9 represent the randomized assignment. Ḡ. The G∗1 6= Ḡ. Suppose that the treated cluster is less dispersed than the estimates for β̂G1 estimates of β for G1 = Ḡ will necessarily be Thus the coecient RI procedure compares with an empirical distribution that has too large a variance. Naturally, we under-reject. On the other hand, if G1 = G, the estimates of β̂G1 are more dispersed than the estimates ∗ for G1 6= G. So we are comparing β̂G1 with an empirical distribution that has too small a variance. Naturally, we over-reject. t statistic RI procedure solves this problem, but only when we treat enough clusters, because the CRVE t statistics do not work well with very few treated clusters. The One other result that is evident in both Figures 2 and 3 is that both RI procedures work very well when the treated clusters are chosen at random. These results are actually somewhat misleading. The distribution of cluster sizes is the same for all the experiments. The good results apparent in the gures arise because we are averaging over 100,000 replications. In some of those replications, when small clusters happen to be treated, too many rejections occur. In others, when large clusters happen to be treated, too few rejections occur. Only when clusters of intermediate size are treated do all the procedures actually work well. Some results are not (yet) reported. Results for cluster-robust t statistics are consis- tent with previous ndings; the tests are greatly oversized regardless of which clusters are treated. Results for the wild cluster bootstrap are consistent with those in MacKinnon and Webb (2015). The unrestricted wild cluster bootstrap has size identical to the CRVE, with severe over-rejection for few treated clusters. The restricted wild bootstrap has very severe under-rejection with few treated clusters, regardless of which clusters are treated. Rejection frequencies for WBRI were only calculated for G1 = 1, 2, and rejection frequencies for those tests were very similar to the analogous RI tests. 4 Conclusion We compare several alternative procedures for inference with few treated clusters, discuss the limitations of some existing procedures, and propose several new procedures. We perform Monte Carlo experiments to compare the robustness of several procedures to heterogeneity in cluster sizes. We nd that none of the existing or proposed tests work well when a single treated cluster is unusually large or small. Ongoing research will consider alternative procedures that show promise in this setting. 10 References Abadie, Alberto, Alexis Diamond, and Jens Hainmueller (2010) `Synthetic control methods for comparative case studies: program.' Estimating the eect of california's tobacco control Journal of the American Statistical Association 105(490), 493505 8 Bertrand, Marianne, Esther Duo, and Sendhil Mullainathan (2004) `How much should we trust dierences-in-dierences estimates?' The Quarterly Journal of Economics 119(1), pp. 249275 2, 4 Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller (2008) `Bootstrap-based improvements for inference with clustered errors.' The Review of Economics and Statistics 90(3), 414427 2, 3, 4 Cameron, A.C., and D.L. Miller (2015) `A practitoner's guide to cluster robust inference.' Journal of Human Resources 50, 317372 3 Canay, Ivan A, Joseph P Romano, and Azeem M Shaikh (2014) `Randomization tests under an approximate symmetry assumption.' sity 2, 6 Technical Report No. 2014-13, Stanford Univer- Carter, Andrew V., Kevin T. Schnepel, and Douglas G. Steigerwald (2013) `Asymptotic behavior of a t test robust to cluster heterogeneity.' Technical Report, University of California, Santa Barbara 3, 4 Conley, Timothy G., and Christopher R. Taber (2011) `Inference with Dierence in Dierences with a small number of policy changes.' The Review of Economics and Statistics 93(1), 113125 1, 2, 3, 5 Davidson, Russell, and Emmanuel Flachaire (2008) `The wild bootstrap, tamed at last.' Journal of Econometrics 146(1), 162 169 4 Donald, Stephen G, and Kevin Lang (2007) `Inference with dierence-in-dierences and other panel data.' The Review of Economics and Statistics 89(2), 221233 2, 3, 8 Ibragimov, Rustam, and Ulrich K Muller (2015) `Inference with few heterogeneous clusters.' Review of Economics & Statistics p. forthcoming 2, 3, 8 Liang, Kung-Yee, and Scott L. Zeger (1986) `Longitudinal data analysis using generalized linear models.' Biometrika 73(1), 1322 3 MacKinnon, James G., and Matthew D. Webb (2015) `Wild bootstrap inference for wildly dierent cluster sizes.' Working Paper 1314, Queen's University, Department of Economics 2, 3, 4, 5, 8, 10 Webb, Matthew D. (2014) `Reworking wild bootstrap based inference for clustered errors.' Working Papers 1315, Queen's University, Department of Economics, August 4, 7 11 Figure 1: Rejection Frequencies and Number of Simulations Rej. Rate 0.13 . 0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 ... ... ... ... ... Rejection frequency based on R/S ... ... ... ... .. ... ........ ... . . .. . ... ... ... .... ... ..... ....... ... .... ... ... .... ....... ..... ... ... . . ... ............ ..... ... . . . . ....... ..... . ... ............ ........ ...... ... ... .... ....... ........ ...... ... ................... .... ... ........... ... . . . ... .......................... . . . ........... . . . . . . . ......... ....... .. .... .. . . ... ............... . . . . . . ............. .. .......... .. .......... ....... . . . . . . . . ................................................................................................................................................................................................................................................................................................................................................................ ........ .... . ............... .. .......... ........ .. ........... .. ....... .. ....... .. .......... ............ .. .. .. . .... ........ . . . . . . . . . ... . . . . ..... ....... ........ .. ...... .......... .. . . . ...... .. . ....... . .. ....... .. .... .. . .. . .. ... Rejection frequency based on (R + 1)/(S + 1) .. ........................... 0.04 0.03 0.02 0.01 0.00 10 20 30 40 50 60 70 80 90 ... S 100 Figure 2: Rejection Frequencies for Coecient Randomization Inference Rej. Rate 0.24 0.22 .................................................. ....... ............. .............................. ............................ .... .................. .................... ..................... . 0.20 0.18 Random treated ............................................ 0.16 Smallest treated ............................... Largest treated ................... 0.14 0.12 0.10 0.08 0.06 ................. ....... ......................................................................................................................................................................................................................................................................................................................................................................................... 0.04 0.02 0.00 ............................ ..................................................................................... 1 2 3 4 5 12 6 7 8 9 10 G1 Figure 3: Rejection Frequencies for t Statistic Randomization Inference Rej. Rate 0.24 0.22 0.20 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 Random treated ............................................ .. .. Smallest treated ............................... .. .. .. Largest treated ................... .. .. .. .. .. .. .. .. .. .. .. .. .. ...... ...... ...... ...... ............ ............ .................... ............ . . . . . . . . . . . . . . . . . . . . . . . . .......................................................................................................................................................................................................................................................................................................................................................................................... ....................................................... ........................... .. .. .. . . . . . . ........ ...... . . . . . . 1 2 3 4 5 13 6 7 8 9 10 G1 Appendix Cluster Sizes Observations were assigned to clusters by using the following formula: " exp γg/G Ng = N PG j=1 for g = 1, . . . , G − 1, where exp γj/G # , (9) γ ≥0 is a parameter that determines how uneven the cluster PG sizes are. The value of NG was chosen to ensure that g=1 Ng = N . When γ = 0 and N/G is an integer, equation (9) implies that Ng = N/G for all g . As γ increases, however, cluster sizes vary more and more. For the experiments with 5000 observations and γ = 2, the sizes of the 50 clusters are: 31, 33, 34, 36, 37, 39, 40, 42, 43, 45, 47, 49, 51, 53, 55, 58, 60, 63, 65, 68, 71, 73, 76, 80, 83, 86, 90, 94, 97, 101, 106, 110, 114, 119, 124, 129, 134, 140, 146, 151, 158, 164, 171, 178, 185, 193, 201, 209, 217, 251. 14
© Copyright 2026 Paperzz