as a PDF

Limitless Regression Discontinuity: Causal Inference for
arXiv:1403.5478v1 [stat.AP] 21 Mar 2014
a Population Surrounding a Threshold
Adam Sales
Ben Hansen∗
Department of Statistics, University of Michigan
March 24, 2014
1
Introduction
Randomization of treatment assignment is the gold-standard of causal inference in statistics:
it is the only sure way to control all confounding. Of the options available when experiments
are impossible, the “regression discontinuity design” (RDD) (Thistlethwaite and Campbell,
1960; Cook, 2008; Imbens and Lemieux, 2008; Lee and Lemieux, 2010) is among the most
credible. In an RDD, the treatment assignment mechanism is known: each subject i has a
“running variable” Ri , and those subjects whose Ri is greater (or less) than a pre-determined
constant c are assigned to treatment. Lee (2008) argued that the RDD features “local
randomization” of treatment assignment, and is therefore “a highly credible and transparent
way of estimating program effects” (Lee and Lemieux, 2010, p. 282).
∗
The authors thank Susan Dynarski, Rocı́o Titiunik, Matias Cattaneo, Walter Mebane, Jeff Smith, Kerby
Shedden, Guido Imbens, and the participants in the University of Michigan Causal Inference in Education
Research Seminar for helpful suggestions.
1
Lee’s notion of local randomization is that, in a well-behaved RDD, one can recover the
important advantages of randomized experiments—specifically, no confounding, unbiased estimation of average treatment effects (ATE) and covariate balance—by replacing statements
about samples or populations with statements about limits. For instance, in a randomized
experiment a covariate X has the same distribution regardless of treatment assignment. In
an RDD, the distribution of X, as a function of R, is continuous at the cutoff c. That is, the
limits of X’s distribution, as R approaches c from both sides, are identical. The conventional
approach to RDDs (eg. Berk and Rauma, 1983; Angrist and Lavy, 1999; Oreopoulos, 2006),
consistent with this “limit” understanding, uses regression to estimate the average functional
relationship between the outcomes Y , the running variable R and treatment assignment Z.
However, formulating local randomization in terms of limits fails to realize some other
benefits of experiments. In particular, randomized experiments allow researchers to estimate
treatment effects for specific samples or subpopulations. In contradistinction, conventional
RDD analysis estimates an ambiguously weighted population ATE (as in Lee, 2008), or the
limit of sample ATEs as subpopulations shrink around a cutoff (as in Hahn et al., 2001). Similarly, randomized experiments allow analysts access to distribution-free inferential methods,
equally valid for large or small samples (Fisher, 1935; Pitman, 1937; Rosenbaum, 2002b).
These assets would accompany a more literal interpretation of local randomization: that, in
a window around c, treatment assignment is random or ignorable (cf. Rubin, 1978b).
This paper will suggest a method of modeling RDDs that realizes these benefits while
also retaining some of the flavor of the conventional approach, using regression techniques
to disentangle Y and R. Doing so allows the use of covariate adjustment not only to make
estimates more precise but also to expand the notion of “local.” That is, it estimates the
treatment effect in a broader sample. Similarly, it suggests testable consequences of the
model that can be wholly separate from estimation of treatment effects. In some scenarios,
it allows researchers to use those consequences to make improvements to the model that
2
would be incompatible with the conventional, limit-focused approach.
Several studies have compared results from RDD analyses to results from experiments
(Black et al. 2005; Green et al. 2009; Gleason et al. 2012; Moss et al. 2013; Cook and Wong
2008 and Shadish et al. 2011 provide summaries of several such studies, as well as new results;
Hallberg et al. 2013 conceptually compares RDDs to randomized experiments). Their results
are encouraging, in that RDD estimates tend to replicate those from experiments. However,
these papers also find that RDD estimates can be sensitive to modeling choices and data
configurations. This paper aims to clarify some of these choices. It explicitly separates, and
explicates the roles of, the two modeling components of an RDD analysis: the regression of
the outcome on the running variable R, and the probability model relating Z to the outcome
and to covariates.
Our case study is the regression discontinuity design found in Lindo, Sanders, and Oreopoulos (2010) (hereafter LSO). In many colleges and universities, struggling students are
put on “academic probation” (AP); the school administration monitors these students and
devotes additional resources to them. In addition, if their grade-point-averages (GPAs) fail
to improve, they are subject to suspension. But does AP actually help these students?
What is the effect of AP on students’ subsequent GPAs? LSO realized that at a certain
large Canadian university, AP status was function of students’ GPAs: students with GPAs
below 1.5 or 1.6, depending on the campus, were put on AP.
One implementation of Lee’s local randomization heuristic, which is simpler but more
limiting than the method developed in this paper, posits that there is a random component
to each student’s GPA each semester. Therefore, for a subset of students whose first-semester
GPAs fell close to the cutoff, AP was effectively assigned randomly. Based on this heuristic,
some researchers (eg. Black et al., 2005; Cattaneo et al., 2014; Li et al., 2013)1 argue that
analysts may use randomization inference techniques to estimate and infer average treatment
1
Black et al. (2005) refers to its simple experimental approach as a “Wald estimatator.”
3
effects for this small group of students.
Alternatively, the conventional approach, which Lee (2008) recommends, argues that one
would expect a semester’s GPA to rise, on average, with the previous (first) year’s GPA.
One could regress next semester’s GPA on the first year’s, and expect the estimated slope
to be positive. If AP had an effect, one would expect this relationship to jump downward
at the point where the first year’s GPA equals the AP cutoff. In other words, the regression
line should jump down at the cutoff; by measuring the size of the discontinuity, researchers
can quantify the effect of the treatment on the outcome of interest.
This paper proposes a dual approach, combining randomization inference and regression
adjustment. It also suggests a procedure for using pre-treatment covariates to choose the
region around the cutoff to study. If substantive context suggests a region, this procedure
will test its validity.
The question of selecting and verifying a region in which local randomization takes place
is related to a prominent discussion in the econometric literature surrounding RD. One flaw
of the conventional RD methodology is that a linear model might not accurately account
for the relationship between R and Y —when either R < c or R > c. The RD literature
has discussed this issue at length; see, in particular, Lee and Card (2008) and Imbens and
Lemieux (2008). One common solution is to model the relationship with a polynomial (eg.
Oreopoulos 2006). This solution has the disadvantage of allowing observations with large R
values, which are generally far from the cutoff, to disproportionately influence the model fit.
Hahn et al. (2001) suggest a different approach: fitting a kernel-based “local linear regression”
to the data, that weights points closer to the cutoff higher than points far away. In fact,
Black et al. (2005) find that when restricting the data for RDD estimation to a small window
around the cutoff, RDD estimates become substantially more robust. Bandwidth selection is
an area of active research (DesJardins and McCall, 2008; Imbens and Kalyanaraman, 2012;
Cattaneo et al., 2014).
4
The following section will discuss, in more detail, two alternative approaches to RDDs:
regression analysis and randomization inference. Section three will introduce a novel approach to RDDs that combines features of the two methods in the previous section. Section
four will demonstrate the new approach on the LSO dataset, and section five will conclude.
1.1
Notation: the Rubin Causal Model
In a randomized experiment with a binary treatment, let Zi ∈ {0, 1} be a random variable
coding subject i’s treatment status (Z = 1 signifies treatment). Let Y represent the outcome
of interest. Then each subject has two (possibly random) values: YCi is subject i’s outcome
if subject i is not treated and YT i is subject i’s outcome if subject i is treated (Rubin (1974);
Splawa-Neyman et al. (1990)). For each i, only one of these two values is observed, dependent
on Zi ; subject i’s observed outcome is Yi = Zi YT i + (1 − Zi )YCi . The values YT i and YCi
are called “counterfactuals” or “potential outcomes.” This notation implicitly assumes noninterference (Cox, 1958) (part of the “stable unit treatment value assumption” in Rubin
1978a): if i 6= j, then subject i’s treatment does not affect subject j’s response, or Yj ⊥
⊥
Zi ∀j 6= i.
In addition, each subject i has a vector of pre-treatment covariates Xi .
2
2.1
Two Existing RDD Models, and a Third Way
Randomization-Based RDD Analysis
The simplest way to motivate permutation analysis of RDDs is to assume that in a small
window around the cutoff, c ± b, subjects’ assignment to treatment and control conditions
is effectively at random, as good as if subjects within that window had participated in
a randomized trial. This is roughly the approach discussed in Cattaneo et al. (2014); a
5
Bayesian perspective may be found in Li et al. (2013).
Formally, for some b > 0,
Assumption 0. (YCi , YT i , Xi ) ⊥
⊥ Zi ∀i with ti ∈ c ± b
Of course, a method is required of determining the region c ± b for which this is true.
When there are pre-treatment covariates available, they may be useful in testing Assumption
0 for an interval-width b. Assumption 0 posits that any particular pre-treatment covariate
X is independent of treatment assignment Z; therefore, the sample means of the covariates
in the control and treatment groups,
P
P
(1 − Zi )Xi
i Z i Xi
i
; X̄T = P
,
X̄C = P
i 1 − Zi
i Zi
(1)
are equal in expectation, balanced. As Cattaneo et al. (2014) suggest, one can test this
implication for a particular b or for several possible choices of b, and choose a b in which
Assumption 0 seems plausible. This can also be done simultaneously with several pretreatment covariates Xk k = 1, ..., p. We will propose a variant of this approach below.
If the potential outcomes YC and YT and covariates X are correlated with the running
variable R, only data in very small windows will exhibit covariate balance. Depending on
the criterion for balance, in some cases balance might not be achievable at all. For instance,
using a multivariate permutation test (Hansen and Bowers, 2007) to check balance on eight
pre-treatment covariates available in the LSO dataset2 the highest p-value achievable was
p = 0.38, at a width of b = 0.17; for only slightly higher b = 0.18 the p-value has decreased
to below 1/1000. If balance is achievable, it may only be at the expense of sacrificing the
vast majority of the data.
More broadly, the relationship between Y and R is a central part of the RDD story. In
2
Briefly, high-school GPA, age-at-entry, total first-year credits taken and dummy variables for campus,
sex, North-American origin and English-speaking.
6
some lucky scenarios, the relationship may be flat (a slope of zero) in a region around c.
Otherwise, researchers must either sacrifice power, choosing a region of analysis that does
not contain enough data to detect a relationship, or choose a larger region and incur bias by
omitting the R-Y relationship from the model.
2.2
OLS-Based RDD Analysis
The OLS approach, on the other hand, models the relationship between Y and R. The
conventional linear regression formulation of an RDD is:
Y = β0 + β1 R + β2 1[R>c] + β3 R : 1[R>c] + (2)
where Y is the outcome of interest, c is the cutoff, and is a mean-0 error (Imbens and
Lemieux, 2008). This analytical approach is illustrated in Figure 1. The estimand, in the
conventional approach, is
lim E[Y |R] − lim− E[Y |R]
R→c+
(3)
R→c
This can be interpreted as the effect at R = c (a measure-0 event if R is continuous, as the
OLS approach requires), a limit of sample ATEs for regions of analysis collapsing around c,
or as a weighted population ATE, where the weights decrease with the distance of R from c.
None of these maps the estimand onto a treatment effect for an easily identifiable population.
Imbens and Lemieux’s (2008) analysis conditions on R, as is common in the RD literature (Van der Klaauw, 2008; Imbens and Zajonc, 2011; Mealli and Rampichini, 2012, eg.).
In modeling the outcomes, but not treatment status, as random, the conventional approach
to RDDs relies on hyperpopulation models. These, in turn, rely on large samples or distributional assumptions. Furthermore, its construal of the estimation target as a limit hinders
interpretability.
In contrast, our approach combines regression and randomization-inference, in the process
7
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
2
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●●●
●
●
●●● ●● ●●
●
●●●●
●● ●
●●●
●
●●●●●●●●
●●●
●
●
●
●
●
●
●●
● ●●
●
●●●
●
●●●
●
●
●●
●
●
●
● ●●
●
●
●
●
● ●● ●
●
●
1
●
●●
●
● ● ●
●
●
●
●
●
●
●
●●
●●
●●
●● ● ●
●
●
●
●
●● ●
● ●●
●●
●
●
●
●
● ●
●●
●●
●●●●●●
●
●●●
●● ●●
●
●
●●
●●●● ●
●
●●●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●● ●
● ●
●
●
●●●●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●●
●●●●
● ●
●
●
●
● ● ●●
●● ● ● ● ●
●
●
●● ● ●●
●●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
0
Avg Subsequent GPA
●
● ●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1
●
●
●
●
●
●
●
●
●
●
−1
0
1
2
First−Year GPA (Distance from Cutoff)
Figure 1: The RDD from LSO. The first-year GPAs were shifted so that the cutoff is at zero—
that is, each campus’ cutoff was subtracted from its students first-year GPAs. Subsequent
GPA was averaged according to first-year GPA (this causes residual variances to appear
inflated towards the edges of the graph; however, the appearance of heteroskedasticity is
only because there are few subjects with extreme first-semester-GPAs). The size of each
data point in the figure is proportional to the number of students at that first-year GPA.
The lines are least-squares linear fits: the red line to the treatment group and the blue line
to the control group. The distance between the lines at the cutoff, along the dotted line, is
the classic RD estimate of the “Local Average Treatment Effect.”
lifting both limitations of the conventional approach. The method involves four steps. First,
posit a causal hypothesis determining unit-level treatment effects. From it and the data
reconstruct Yc , the vector of potential responses to control. Second, regress Yc on R, leaving
residuals Y̌C . Third, test the validity of the method’s assumptions, which will be spelled out
shortly, as applied to subjects within a specified window of the cutpoint. Finally, test the
causal hypothesis from the first step using permutational methods. Testing suitable arrays
of causal hypotheses leads to confidence intervals and Hodges-Lehmann point estimates.
Technical details follow.
8
3
Transforming RDDs into Approximate Randomized
Experiments
The first step is to specify a causal hypothesis and reconstruct YC from the data. Fisher’s
strict null hypothesis states that YCi = YT i for all subjects i, entailing that YC is simply the
observed Y ; if the posited hypothesis is that treatment adds τ0 to each treated subject, then
relative to it one has YC = Z(Y − τ0 ) + (1 − Z)Y .
The second step is to disentangle YC from R. Our case study uses simple least squares
(SLS) linear regression for this purpose, but most any form of regression is compatible with
the overall method. The results to carry forward from this regression are the residuals,
Y̌C ≡ Y − R(R0 R)−1 R0 Y,
(4)
where R is the matrix formed by joining a column of 1’s to the column of R.
The hypothesis about treatment effects will be tested relative to
Assumption (Transformed Ignorability). For all subjects i with Ri ∈ c ± b, the distribution of Ri is not degenerate and R and Y̌C are independent:
R⊥
⊥ Y̌C
(5)
(For simplicity of notation, in the remainder of the paper we will denote hypothetical Y̌C
values as Y̌ .) Transformed ignorability allows researchers to treat RDDs, in a sense, as randomized experiments. Specifically, with R random and Z = 1[R>c] independent of Y̌ , every
combination of R and Y̌ is equally likely under the null hypothesis. This, in turn, implies
the characteristics that recommend randomized experiments: there are no confounders, and
the randomization distribution is known.
9
For transformed ignorability to be plausible, the relationship between YC and R must
be approximately linear in the region R ∈ c ± b. If YC is modeled as random, and E[YC |R]
is differentiable at R = c, then Taylor’s theorem implies that this is indeed true for some
region around c: the relationship is approximately linear.
Transformed ignorability is not directly testable, but one can test closely related assumptions involving pre-treatment covariates. In classical randomized experiments, treatment assignment is independent of pre-treatment covariates. This implies that the means
of pre-treatment covariates are balanced, in expectation, between the treatment and control
groups. The analogy with experiments suggests a companion assumption for transformed
ignorability:
Assumption (Transformed Ignorability for Covariates).
R⊥
⊥ X̌,
(6)
where X̌ denotes the residuals of a regression of X on R, and is analogous to Y̌ . To test
transformed ignorability for covariates, researchers can test the hypothesis that E[X̌|Z =
1] = E[X̌|Z = 0].
Cattaneo et al. (2014) also suggested using covariate balance as a criterion for selecting
or verifying a region of interest. However, their balance-testing procedure did not adjust
covariates for a relationship with R. In some applications—including, as we have seen,
LSO—their approach will yield very small windows. Adjusting covariates and outcomes
for R can, in some circumstances, allow researchers to use a much larger portion of their
datasets.
10
3.1
Inference and Estimation Under Transformed Ignorability
Transformed ignorability implies a test of the strict null hypothesis of no treatment effect
H0 : YCi = YT i for all i. Since every combination of Y̌ and Z is equally probable under the
null hypothesis, a permutation test of H0 would achieve the nominal level. In particular, the
difference in the means of Y̌ between the treatment (Z = 1) and control (Z = 0) groups could
be computed for every possible permutation of Z. For a given α, if the achieved difference
in means is greater in magnitude than 1 − α proportion of these, then the null hypothesis
is rejected at level α. Formally, if D(Z ∗ ) is the difference in means for a given treatmentP
assignment vector Z ∗ , Z is the achieved treatment assignment vector, and nt = i Zi , the
number of treated subjects, then the p-value testing H0 is
n
p = #{Z : |D(Z )| > |D(Z)|}/
nt
∗
∗
(7)
In practice, researchers can estimate p using monte-carlo methods or based on a suitable
normal approximation, when n is large enough (Good, 2000).
The difference-in-means test statistic is the most straightforward, and emerges directly
from the comparison with randomized controlled trials; however, others are also possible.
One example is the F-statistic from the following regression:
Y̌ = β0 + βR R + βZ Z + (8)
If the model (8) is a significantly better fit than a grand mean model Y̌ m = β0 +, then it may
be concluded that the null hypothesis is implausible. Simulation results based on the LSO
data suggest that this statistic may more powerful than difference-in-means. Analysts can
determine the randomization distribution of the F-statisic by enumerating the permutations
of R and Y̌ , with monte-carlo simulations, or with a normal approximation (Pitman, 1937).
11
One approach to estimating the magnitude of the treatment effect assumes a constant
additive treatment effect, YT i − YCi = τ for all i. Under this assumption, for a hypothetical
treatment effect τ0 , YCi is equal to Yi − τ0 Zi : τ0 is subtracted from each treated subject’s
outcome. Researchers may perform the same permutation-based hypothesis test for these
hypothetical values of YCi . They would regress YC on R, extract the residuals, and compute
the F-statistic for model (8) for the new values Y̌τ (see Rosenbaum, 2002a). Then, they
could invert this hypothesis test by testing a range of values for τ . Those values that are not
rejected at level α form a 1 − α confidence region for τ . Similarly, one can use the HodgesLehmann technique (Rosenbaum, 1993) to arrive at a point estimate for τ . The estimate for
τ from this procedure is identical to the coefficient of Z from a regression of Y on R and Z
(Rosenbaum, 2002a).
3.2
Testing Transformed Ignorability
The test of transformed ignorability for a single covariate is the same as the test for transformed ignorability in Y . That is, for a covariate X, transform X into X̌ and check that
X̌ is independent of R, using difference in means or an F-test. This is equivalent to testing
for a “treatment effect” of Z on X. Since X was measured prior to treatment, detecting a
treatment effect on X would be disconcerting.
If several pre-treatment covariates are available, testing transformed ignorability for each
covariate separately could lead to problems of multiple testing, and overly-conservative specification tests. One solution to this problem is to combine information from the covariates
into an omnibus test statistic (eg. Hansen and Bowers, 2007). The appropriate null hypothesis to test is that there is no treatment effect on any of the pre-treatment covariates. One
omnibus test statistic is the maximum of the individual F statistics from testing transformed
ignorability for each of the covariates (see Westfall and Young, 1993). Another possibility
for the test statistic is the sum of the individual F statistics (this is somewhat similar to
12
Hotelling’s T 2 statistic (Hotelling, 1931)). Formally, if k = 1, ..., p indexes p pre-treatment
covariates, and Fk is the F-statistic testing no treatment effect for covariate k, then let
Ftot =
p
X
Fk
(9)
k=1
be the test statistic for overall covariate balance.
As in the case of Y or an individual covariate, the randomization distribution for Fk can
be calculated by enumerating all possible permutations of R or approximated with monte
carlo methods.
4
RDD Analysis in the LSO Data
One of the outcomes that LSO measures is nextGP A, students’ subsequent GPAs, either
for the summer or fall term after students’ first years. Their causal question of interest is
whether AP causes a change, on average, in nextGP A: do students on AP tend to have
higher (or lower) subsequent GPAs? Recall that AP is determined almost exclusively3 by
first-year cumulative GPA, which in this case is the running variable R. That is, students
with GPAs below the cutoff are “treated” with AP, and students above the cutoff are in
the control group. A relevant region in which to estimate a treatment effect is within 0.3
grade-points of the cutoff c. Conventionally, 0.3 represents the difference in grade points
between a C, say, and a C-, or any other grade half-step.
First, we use a covariate, or several covariates, to test transformed ignorability. One
suggestive covariate that appears in LSO’s data is hs gpa, each student’s high school grade3
There are 48 cases, out of a total of 44,362, where students were not put on AP despite GPAs below
the cutoff, and three cases in which students were put on AP despite having GPAs above the cutoff. In
this paper, as in LSO, we will follow the “intent to treat” principle and estimate the effect of treatment
assignment—that is, having a GPA below the cutoff—instead of actual treatment (AP). Since the number
of cases in which these two disagree is such a small proportion of the total, we anticipate that this will have
a minimal impact on our estimates.
13
●
●
●
●
●
4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
2
●
●
●
●●
●
●
●
●
●
●● ●
● ●●
●
● ●●
●
●
●● ●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
● ●● ●
●●
●
●
●
●●● ●
●●
●
●
●
●●
●●●● ●●●
●●●●●● ● ●
●●
●
●●
●
●
●●
● ● ● ● ●●
●
●●● ●
●● ●●●●●●
●●
● ●●
●●●●● ● ●
● ●
● ● ●●
●●●
●
●●
●
● ●
●
● ● ●●●●●●
●●
●●● ●
●●
0
Avg logit(hsgrade_pct)
●
●●
●●
●●●
●
●
●
● ●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●●●
●●
● ●●
●●
●
●
●●
●
●●
●
●
●
● ●● ●
●
●
●
●
●●● ● ●●
●
●
●
●
●
●
●
●
●
●
●●
●●●●●●● ●●●
●
●
●
●
●● ●
●
●
●●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1
0
1
2
First Year GPA (Distance from Cutoff)
Figure 2: A smoothed-plot of hs gpa as a function of dist f rom cut
point-average. Figure 2(a) plots hs gpa as a function of students’ first-year GPAs, the
running variable, and it seems to have an approximately linear relationship in a neighborhood
around c. Linear variation in hs gpa explains about a third of the variation in students’ firstyear GPAs (ρ =0.6) and about a quarter of the variation in their subsequent GPAs (ρ =0.48).
Since hs gpa is coded with values that fall roughly uniformly between 0 and 100, we use a
logit function to transform them hs gpa → log[0.01 ∗ hs gpa/(1 − 0.01 ∗ hs gpa)], to improve
the performance of SLS.
We begin by estimating the regression of hs gpa on dist f rom cut, and extracting the
residuals hs ˇgpa. Next, we can test the null hypothesis that hs ˇgpa is balanced between
treatment and control, using the F-test described above. The permutational p-value for that
null hypothesis is 0.15, which means that hs ˇgpa is more balanced than it would have been
in about 15% of hypothetical randomized experiments
We can also test balance of several covariates simultaneously. LSO exploits seven pretreatment variables to test the RDD assumptions in the AP case: hs gpa, the total number
14
of credits attempted in students’ first year and students’ age at college entry, which are
continuous, and dummy variables for campus (the university has three campuses), whether
students’ first language is English, and whether students were born in North America. Using
equation (9) to test transformed ignorability for all of these covariates yields a permutational
p-value of 0.6.
4.1
Sorting Around the Cutoff: The McCrary Density Test
If there are subjects who can control their running variable values with enough exactitude
to determine their respective treatment statuses, then treatment assignment is not, in any
meaningful way, random. For instance, suppose some students predicted that their GPAs
would fall below the cutoff and were able to drop a difficult classes to avoid AP. On the
other hand, their observationally equivalent peers were unable to predict their GPAs or
surgically drop classes, and hence were put on AP. If this was the case, then AP students
were, on average, less savvy than non-AP students: a difference in unobserved pre-treatment
characteristics that invalidates the randomization assumption. For this reason, LSO restricts
its analysis to first-year students. Since they are new to the system, they are less likely to
know to manipulate their GPAs and avoid AP. However, this assumption may be false, and
hence does not entirely protect the LSO data from the threat of sorting.
McCrary (2008) suggested a test that, in some circumstances, will detect sorting around
an RDD cutoff. The intuition is that if a sufficient number of subjects are manipulating
their way into, or out of, treatment, the number of subjects who fall on one side of the cutoff
will significantly exceed the number on the other side. In the LSO case, if enough students
were manipulating their GPAs to fall just above the AP cutoff, then we would detect more
students just above the cutoff than just below.
More formally, the McCrary procedure tests an assumption of the conventional regressiondiscontinuity model: that the density of R is continuous at the cutoff c. This assumption is
15
not necessary for our approach. In fact, in the LSO dataset, R is discrete, not continuous,
since GPAs are rounded to the nearest hundredth point (and the class-specific grade-points
themselves only take discrete values, corresponding to grades F though A or A+). However,
the intuition behind the McCrary test still holds: an unexpectedly high number of students
with first-year GPAs to the right of the cutoff would raise suspicions of sorting. For the
purpose of implementing the McCrary test we may model R as a descretized version of a
continuous variable that measures students’ average performance across classes, on a scale
from 0–4. Then, a discontinuity in the density may indicate a sorting threat.
The McCrary procedure begins by descretizing R. That is, it divides the support of
R into equally-sized bins and counts the number of subjects with R values in each one of
the bins. Next, the procedure uses local linear regression to test for a discontinuity in the
distribution of R. In the LSO data, this involves picking a bin size such that the McCrary
bins line up exactly with the discrete values of R. Doing so yields a concerning result: the
resulting p-value is 6.2e-05.
Figures 3 may suggest an explanation of the McCrary result. Figure 3(a) displays the
number of students at each possible value of R—the data for the McCrary test—with a dotted
line at the cutoff, R = 0. The number of students for whom R = 0 seems impressively large;
these students just barely avoided probation. Figure 3(b) shows how many students, at each
possible value of R, attempt 4, 4.5, or 5 credits. This represents 93% of students in our
region of analysis. Apparently, the excessive number of students at R = 0 is a result of
an abnormally large number of students with R = 0 attempting only four credits. Indeed,
for very small regions around the cutoff, balance tests of the average number of attempted
credits tend to reject. Considering students with first-year GPAs within b = 0.03 of the
cutoff, for instance, a test of balance on the transformed number of attempted credits yields
a p-value of 0.03. These results are consistent with a particular form of sorting: students
will drop a course in which they are performing poorly, in order to barely avoid AP.
16
400
150
●
●
●
●
● ●
●
●●
●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●
● ●
●● ●
●
●
●
●
●
●●
●
● ●
●
● ●
●
●
●
●
● ●
● ● ● ●● ● ●
●
●
●
●
●
● ● ●
●
● ● ●
●●●
●
● ●
●● ●
●●
●
●
●
●
●
●
● ● ●● ● ●
●
●
●
● ●●
●
●
●
●
●
● ●●●
●
● ●● ●● ● ●
●
●
●
●●
●●
●
●●● ● ●
●
●
●
●
● ● ●●
● ●●
●
●
●
●
●● ● ●
●
●●
●● ● ●●
● ●●●
●
●
●
●● ●●
●
●
● ●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
● ●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
● ●● ● ● ● ●●●●
●●● ● ●●●
●●
● ●●
●● ● ●
●
●
●●
● ●
●● ●
●
●●
● ●
●●
●
●
●
●
●●
●
● ●●
●●
●
●
●●●● ●
●
● ●●● ●
●
● ●●●●●●
● ●●●
●●
●● ● ●●
● ● ●● ●●●
●
●●
●● ●●●
●
●●●●●●● ● ●
●
●● ●
●
●
●
●
●
●
●●
●●
●●●
● ●●●●● ●
●
●●●
●
●●
●
●
●
●●●●●● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●●
●
−1
0
1
0
●
100
●
200
# of subjects
●
●
●
●
●
●
50
300
●
●
●
●
●
●
●
100
●
# of Students by Attempted Credits
●
0
4 credits
4.5 credits
5 credits
●
●
2
−0.3
Running−variable values (discrete)
−0.2
−0.1
0.0
0.1
0.2
0.3
First−Year GPA (R)
(a)
(b)
Figure 3: Diagnostics for sorting: (a) is the standard McCrary figure: the number of students
who achieve each possible first-year GPA. The running-variable, first-year GPA, is centered
at the cutoff, which is denoted with a vertical dotted line. (b) illustrates the number of credits
students attempt in their first year. (Descrete points are replaced by lines, for clarity.) The
vast majority of students (93%) take 4, 4.5, or 5 credits, and this plot shows how many
students at each value of R—first-year GPA—attempt each of those. Note the large spike
for 4 credits at R = 0. In both plots, yellow vertical lines mark R = 0.1, 0.12, 0.3, and 0.32,
the values for R which, we hypothesize, sorting students would have achieved, had they not
sorted.
4.1.1
Fixing a Broken RDD
Can the LSO RDD be salvaged? The McCrary test failure, coupled with local imbalance of
attempted credits, suggests that a cadre of students may have manipulated their way out of
AP. One attractive solution to this problem is to simply remove this cadre—students with
R = 0—from the dataset. Doing so removes students who would have been on AP, but
manipulated their way out.
This, however, may only be a partial solution. It remains the case that were savvyness
measured, it would predict treatment status: being less-savvy increases students’ probabilities of being put on AP. This is because the manipulators’ peers, who could have but did
17
not avoid AP, remain in the dataset, and on AP. We assume that the sorting students with
R = 0 attempting four credits were previously taking five credits. Then to identify the
non-sorters we envision two scenarios. First, assume the sorters dropped a class they would
have otherwise failed. Then their centered first-year GPAs would have been -0.3 or -0.32,
depending on campus.4 Second, perhaps the sorters dropped a class in which they would
have scored a D—worth one grade point. Then their centered GPAs would have been -0.1 or
-0.12, depending on campus. Therefore, the peers most similar to the sorting students who
nevertheless did not manipulate their GPAs, have R values of -0.3, -0.32, -0.1, or -0.12.
Yellow vertical lines mark these values in Figures 3(a) and (b). If the execssive number
of students at R = 0 is a result of students dropping classes, and moving from these R
values to R = 0, one would expect to see relatively few students R = -0.1, -0.12, -0.3 and
-0.32, and particularly few students attempting five credits at these R values. However, the
opposite seems to be the case: these R values seem to correspond with relatively (though not
excessively) high numbers of students attempting five credits. It may be that the observed
pattern reflects sorting of a more complex nature, not entirely explained by the data.
That being the case, there are several possible solutions to the McCrary failure. One
could remove all students with R = 0, or all students who took four credits, or all students
with R = 0 who took four credits. Additionally, it may be reasonable to remove students at
R = -0.1 and -0.3, or -0.12 and -0.32, depending on campus, or all students attempting five
credits in their first year. Removing subsets of students necessarily changes the interpretation
of the resulting effect estimate: the RDD analysis estimates the effect for the students in the
sample. After removing students with R = 0, for instance, the analysis estimates the effect
for the students in the sample for whom c − b ≤ R ≤ c + b and R 6= 0.
A relevent criterion to evaluate each of these options is whether they resolve the McCrary
4
Recall, the large university at which this study takes place has three campuses; at campuses one and
two, the GPA cutoff for AP is 1.5; at campus three it is 1.6.
18
test. That is, after removing the possibly-sorting students, whether the test continues to reject the null hypothesis of a continuous running variable. Also, does removing these students
lead to better balance on the number of year-one credits students attempt? Testing several
hypotheses under different scenarios raises a concern of multiple comparisons, and threatens
the validity of the resulting p-values. One way around the multiple testing problem is to
temporarily divide the data: a random sample of 25% of the data will serve as an exploratory
sample—in which we will test however many hypotheses seem feasible—and the remaining
75% will be used to, more definitively, test the resulting configuration.
In the exploratory 25% sample, the McCrary p-value is still quite low—p = 0.0066.
Additonally, for students close to the cutoff, the transformed number of attempted credits
is imbalanced: p = 0.02, for students with R ≤ 0.04, for instance.
Removing students with R = 0 greatly improves the picture. The McCrary p-value,
testing for discontinuity in R’s distribution around the cutoff, is 0.19. Additionally, the pvalues testing tranformed ignorability for the number of credits students take are consistantly
higher (for instance, when b = 0.04, the p-value is 0.64).
Also removing students with R = -0.1 or -0.12 and 0.3 or 0.32, depending on campus,
does not yield as impressive an improvement: the McCrary p-value for this dataset is p =
0.03. A more surgical procedure, removing only those students at the cutoff who attempted
four credits, and those at -0.3 and -0.1 taking five, also fails to yield encouraging results. The
McCrary p-value is 0.083 and p-values testing transformed ignorability for the number of
attempted first-year credits tend to be much lower. Removing students based on the number
of credits they attempted apparently induced imbalance.
Moving to the remaining 75% of the sample, after removing students with R = 0, tests
for transformed ignorability for first-year credits yield large p-values for a wide range of
bs. The McCrary density test yields a p-value of 0.339. Removing students with R =
0—including many students who may have sorted around the cutoff—vastly improves the
19
empirical plausibility of the RDD model. It was not possible, however, to identify students
in the remainder of the sample who could have sorted, but did not—the counterparts of the
sorters we removed. Hopefully, they are sufficiently dispersed among the AP group so as to
not significantly harm the RDD estiamte.
To be sure, removing certain subsets of students is an ad hoc solution to a (potentially)
serious problem. RDD purists may be tempted to give up on the LSO datset entirely,
given its McCrary test failure. This approach, however, errs in two ways: over-valuing the
validity of RDDs which pass all of the requisite fitness tests, and under-valuing the validity
of possible corrections. All RDD analyses assume the relationship between Y and R to be
well-estimated, the absence of sorting around the cutoff that is undetectable by the McCrary
procedure, and the absence of other interventions at the cutoff. In this sense, RDD analyses,
while exceptionally believable, do not rise quite to the level of randomized experiments,
which can yield p-values and hypothesis tests without requiring any assumptions beyond the
design itself (Fisher, 1935). Hence, RDD purism of this sort is unwarrented.
In the other direction, an attitude which will only consider perfect RDDs for causal
inference is likely to lose important information. The relevent comparison here is to a
broken randomized trial (see Rubin, 2007): a randomized experiment in which something,
such as faulty randomization or differential attrition, compromised the experiment’s validity.
Conventionally, researchers do not discard these datasets, but instead attempt to correct their
problems. Here, too, the best approach is to attempt to correct the RDD’s problems, rather
than discard the dataset. Because our RDD model does not depend on estimating the limit
of E Y as R tends towards the cutoff, and instead attempts to estimate a treatment effect
for a specific sub-sample of subjects, removing subjects for whom R = c is reasonable.
20
4.2
Testing Balance of Higher-Order Means
If transformed ignorability for covariates holds and every pre-treatment covariate X̌ is independent of treatment assignment Z, then X̌ should be balanced not only in its mean, or first
moment, but in all of its moments. Along these lines, some methodologists, such as Caughey
and Sekhon (2011), have suggested testing balance of higher-order moments between treatment groups. Incorporating this suggestion into the randomization-based methodology is
straightforward: raise the transformed covariate to the specified power p, producing a cop
p
variate X̌k , and add test balance with X̌k as an additional covariate.
In the LSO dataset, a test incorporating 2nd -order natural splines in the balance test for
b = 0.3, and discarding students with R = 0, results in a permutational p-value of 0.91.
4.3
Outcome Analysis: Testing for and Estimating a Treatment
Effect
The final step is to estimate a treatment effect, and to test the strict null-hypothesis of no
effect. Following the discussion above, this step will set b = 0.3, and remove subjects for
whom R =0. To test the hypothesis that AP has no effect for b = 0.3 means that students
whose first-year GPAs were within 0.3 grade-points of the AP cutoff (and whose first-year
GPAs are not equal to the cutoff) would have the same subsequent GPA regardless of AP
assignment. A permutation test of this hypothesis resulted in a p-value of less than 1/1000,
so the strict null hypothesis is implausible.
But what is the treatment effect? Using this approach, a 95% confidence interval is the
set of τ values whose Hτ is not rejected—this is simply the dual of the hypothesis test for
a constant effect τ . A 95% confidence interval for the effect of AP on students for whom
R ∈ [c − b, c + b] and R 6= 0 is 0.1–0.34 grade points. Similarly, one would arrive at a point
estimate of the effect by determining which hypothetical value of τ produced the highest
21
p-value: this is the Hodges-Lehmann estimate. In the LSO data, the Hodges-Lehmann point
estimate was 0.21 grade points; that is, the treatment effect of academic probation is to
increase students’ subsequent GPA by 0.21.
In the original paper by LSO, the standard errors were clustered by first-semester GPA.
Since GPA is essentially discrete, and a function of how many credits a student takes—or even
which classes—it is plausible that students with the same first-semester GPAs are similar to
ˇ A values. There is a permutation test that is
each other and therefore have correlated nextGP
robust to this clustering: instead of allowing all permutations of the treatment variable, only
allowing permutations that preserve the grouping structure. This is equivalent to assuming
that every combination of R and Y̌ that preserves the grouping structure of R is equally likely;
this is somewhat weaker that transformed ignorability. Weakening transformed ignorability
does not significantly change the p-value testing the strict null hypothesis: it is less than
1/1000.
5
Discussion
This paper presents a novel interpretation and modeling approach to regression discontinuity designs. The new approach has several advantages over the conventional approach.
Its assumptions align better with the heuristic motivation that RDDs leverage natural randomization in the vicinity of a cut point. This approach is easily-adaptable to small-sample
inference, since it relies on permutation tests. Similarly, the conceptual approach does not
demand continuity in the running variable R, since it does not rely on taking limits of continuous functions of R. Finally, The approach presented here suggests a simple and intuitive
ways to choose a bandwidth for the RDD analysis: to use prior substantive knowledge to
choose a region in which an estimated treatment effect or ATE is desirable and interpretable,
and then to test this bandwidth with a balance test).
22
The approach presented here may have some weaknesses as well. Firstly, it requires a
set of covariates in the data whose relationships to the outcome of interest are similar to
the running variable R’s. Not only are such covariates not always available, but even in
a rich dataset it may be hard to assess the covariates’ usefulness. Secondly, there may be
some scenarios in which the conventional RDD assumptions are more plausible than those
presented here.
Though starting from a similar point, this paper’s approach differs in some important
ways from the approach in Cattaneo et al. (2014). That paper suggests assuming that
subjects in a window close to the cutoff are randomized, with equal probabilities, to various
treatment conditions. In particular, within the window of analysis, the potential outcomes
are not related to treatment assignment—and, therefore, the running variable R. In contrast,
this paper allows a relationship between potential outcomes and R in the window of analysis.
The question of which set of assumptions is more plausible will depend on the substantive
question of interest. Cattaneo et al. (2014)’s approach allows more flexibility in the choice of
estimands, whereas the approach here is limited to estimating mean-based estimands. This
paper’s approach will hopefully be attractive to researchers who are drawn to randomizationbased arguments, but are reluctant to abandon the familiar trappings of the conventional
RDD approach.
Recently, the RDD methodology literature has begun to address the case of multiple
running variables (Papay et al., 2011; Reardon and Robinson, 2012). The method we present
here extends to that case in a straightforward way, using multivariate modeling techniques
to disentangle outcomes from the running variables and joint permutation tests for inference.
On the substantive side, this paper addressed a weakness of the LSO study: the McCrary
test rejected the hypothesis of a continuous running variable R. As McCrary (2008) points
out, a McCrary-test failure is neither necessary nor sufficient to invalidate an RDD, but is
suspicious. This paper found that removing a small subset of the students in the study—
23
those with running variable values exactly at the cutoff—corrected the empirical problem.
However, this adjustment raises some substantive questions: were there other students,
whom the McCrary test did not identify, who also manipulated their scores? Were there
students who, had their first-year GPAs threatened to fall short of the cutoff, would have been
able to manipulate their GPAs? Estimating effects without removing students with R = 0
gave a result highly similar to the result presented here. This suggests that, if there was
manipulation, its impact on treatment estimates and inference was negligible. Nevertheless,
these questions illustrate the narrow but important gulf that separates the RDD from a fully
randomized controlled experiment.
References
Angrist, J. D. and Lavy, V. (1999), “Using Maimonides’ rule to estimate the effect of class
size on scholastic achievement,” The Quarterly Journal of Economics, 114, 533–575.
Berk, R. and Rauma, D. (1983), “Capitalizing on nonrandom assignment to treatments: A
regression-discontinuity evaluation of a crime-control program,” Journal of the American
Statistical Association, 21–27.
Black, D., Galdo, J., and Smith, J. A. (2005), “Evaluating the regression discontinuity design
using experimental data,” Unpublished manuscript.
Cattaneo, M. D., Frandsen, B., and Titiunik, R. (2014), “Randomization Inference in the
Regression Discontinuity Design: An Application to Party Advantages in the U.S. Senate,”
Tech. rep., University of Michigan.
Caughey, D. and Sekhon, J. S. (2011), “Elections and the regression discontinuity design:
Lessons from close US house races, 1942–2008,” Political Analysis, 19, 385–408.
24
Cook, T. D. (2008), ““Waiting for life to arrive”: a history of the regression-discontinuity
design in psychology, statistics and economics,” Journal of Econometrics, 142, 636–654.
Cook, T. D. and Wong, V. C. (2008), “Empirical tests of the validity of the regression
discontinuity design,” Annales dEconomie et de Statistique, 91, 127–150.
Cox, D. (1958), The Planning of Experiments, John Wiley.
DesJardins, S. and McCall, B. (2008), “The impact of the Gates Millennium Scholars Program on the retention, college finance-and work-related choices, and future educational
aspirations of low-income minority students,” Unpublished Manuscript.
Fisher, R. A. (1935), Design of Experiments, Edinburgh: Oliver and Boyd.
Gleason, P. M., Resch, A. M., and Berk, J. A. (2012), “Replicating Experimental Impact
Estimates Using a Regression Discontinuity Approach,” Tech. rep., Mathematica Policy
Research.
Good, P. (2000), Permutation tests: a practical guide to resampling methods for testing
hypotheses, vol. 2, Springer New York.
Green, D. P., Leong, T. Y., Kern, H. L., Gerber, A. S., and Larimer, C. W. (2009), “Testing
the accuracy of regression discontinuity analysis using experimental benchmarks,” Political
Analysis, 17, 400–417.
Hahn, J., Todd, P., and Van der Klaauw, W. (2001), “Identification and estimation of
treatment effects with a regression-discontinuity design,” Econometrica, 69, 201–209.
Hallberg, K., Wing, C., Wong, V., and Cook, T. (2013), “Experimental design for causal
inference: Clinical trials and regression discontinuity designs,” in The Oxford Handbook of
Quantitative Methods, Volume 1: Foundations, ed. Little, T., Oxford: Oxford University
Press.
25
Hansen, B. B. and Bowers, J. (2007), “Covariate balance in simple, stratified and clustered
comparative studies,” Tech. Rep. 436, University of Michigan, Statistics Department.
Hotelling, H. (1931), “The generalization of Student’s ratio,” The Annals of Mathematical
Statistics, 360–378.
Imbens, G. and Kalyanaraman, K. (2012), “Optimal bandwidth choice for the regression
discontinuity estimator,” The Review of Economic Studies, 79, 933–959.
Imbens, G. and Lemieux, T. (2008), “Regression discontinuity designs: A guide to practice,”
Journal of Econometrics, 142, 615–635.
Imbens, G. and Zajonc, T. (2011), “Regression Discontinuity Design with Multiple Forcing
Variables,” Tech. rep., Working Paper, September.
Lee, D. and Card, D. (2008), “Regression discontinuity inference with specification error,”
Journal of Econometrics, 142, 655–674.
Lee, D. and Lemieux, T. (2010), “Regression Discontinuity Designs in Economics,” Journal
of Economic Literature, 48, 281–355.
Lee, D. S. (2008), “Randomized experiments from non-random selection in US House elections,” Journal of Econometrics, 142, 675–697.
Li, F., Mattei, A., and Mealli, F. (2013), “Bayesian inference for regression discontinuity
designs with application to the evaluation of Italian university grants,” .
Lindo, J., Sanders, N., and Oreopoulos, P. (2010), “Ability, Gender, and Performance Standards: Evidence from Academic Probation,” American Economic Journal: Applied Economics, 2, 95–117.
26
McCrary, J. (2008), “Manipulation of the running variable in the regression discontinuity
design: A density test,” Journal of Econometrics, 142, 698–714.
Mealli, F. and Rampichini, C. (2012), “Evaluating the effects of university grants by using regression discontinuity designs,” Journal of the Royal Statistical Society: Series A
(Statistics in Society), 175, 775–798.
Moss, B. G., Yeaton, W. H., and LIoyd, J. E. (2013), “Evaluating the Effectiveness of Developmental Mathematics by Embedding a Randomized Experiment Within a Regression
Discontinuity Design,” Educational Evaluation and Policy Analysis, 0162373713504988.
Oreopoulos, P. (2006), “Estimating average and local average treatment effects of education
when compulsory schooling laws really matter,” The American Economic Review, 152–175.
Papay, J. P., Willett, J. B., and Murnane, R. J. (2011), “Extending the regressiondiscontinuity approach to multiple assignment variables,” Journal of Econometrics, 161,
203–207.
Pitman, E. (1937), “Significance tests which may be applied to samples from any populations,” Supplement to the Journal of the Royal Statistical Society, 4, 119–130.
Reardon, S. F. and Robinson, J. P. (2012), “Regression discontinuity designs with multiple
rating-score variables,” Journal of Research on Educational Effectiveness, 5, 83–104.
Rosenbaum, P. (2002a), “Covariance Adjustment in Randomized Experiments and Observational Studies,” Statistical Science, 17.
Rosenbaum, P. R. (1993), “Hodges-Lehmann point estimates of treatment effect in observational studies,” Journal of the American Statistical Association, 88, 1250–1253.
— (2002b), Observational Studies, Springer-Verlag, 2nd ed.
27
Rubin, D. (1974), “Estimating causal effects of treatments in randomized and nonrandomized
studies.” Journal of Educational Psychology; Journal of Educational Psychology, 66, 688.
— (1978a), “Bayesian inference for causal effects: The role of randomization,” The Annals
of Statistics, 34–58.
Rubin, D. B. (1978b), “Bayesian Inference for Causal Effects: The Role of Randomization,”
The Annals of Statistics, 6, 34–58.
— (2007), “The design versus the analysis of observational studies for causal effects: parallels
with the design of randomized trials,” Statistics in medicine, 26, 20–36.
Shadish, W. R., Galindo, R., Wong, V. C., Steiner, P. M., and Cook, T. D. (2011), “A
randomized experiment comparing random and cutoff-based assignment.” Psychological
methods, 16, 179.
Splawa-Neyman, J., Dabrowska, D., and Speed, T. (1990), “On the application of probability
theory to agricultural experiments. Essay on principles. Section 9,” Statistical Science, 5,
465–472.
Thistlethwaite, D. and Campbell, D. (1960), “Regression-discontinuity analysis: An alternative to the ex post facto experiment.” Journal of Educational Psychology, 51, 309.
Van der Klaauw, W. (2008), “Regression–discontinuity analysis: a survey of recent developments in economics,” Labour, 22, 219–245.
Westfall, P. and Young, S. (1993), Resampling-based multiple testing: Examples and methods
for p-value adjustment, vol. 279, Wiley-Interscience.
28