A Note on Inference with “Difference in Differences” with a Small

A Note on Inference with “Difference in Differences”
with a Small Number of Policy Changes
Timothy G. Conley1
Graduate School of Business
University of Chicago
and
Christopher R. Taber
Department of Economics
and
Institute for Policy Research
Northwestern University
July 9, 2007
1 We
thank Federico Bandi, Alan Bester, Phil Cross, Chris Hansen, Rosa Matzkin, Bruce Meyer,
and Jeff Russell for helpful comments and Aroop Chaterjee and Nathan Hendren for research
assistantship. All errors are our own. Conley gratefully acknowledges financial support from the
NSF (SES 9905720) and from the IBM Corporation Faculty Research Fund at the University of
Chicago Graduate School of Business. Taber gratefully acknowledges financial support from the
NSF (SES 0217032).
Abstract
Difference in differences methods have become very popular in applied work and in many
applications identification of the key parameter arises from changes in policy by a small
number of groups. This is at odds with typical inference via asymptotic approximations that
assume the number of groups changing policy is large. We develop an alternative approach
to inference in these circumstances under the assumption that there are a finite number
of policy changers using asymptotic approximations as the number of non-changing groups
gets large. Treatment effect point esimators are not consistent, but we can consistently
estimate their asymptotic distribution under any point null hypothesis about the treatment
effect. This allows hypothesis testing and construction of confidence intervals via test statistic
‘inversion.’ Our method can readily accomodate hetergeneity in treatment parameters across
groups, and many types of heteroskedasticity and spatial dependence. This method easily
extends beyond difference in differences models and will be valuable in any treatment effect
application with a large number of controls, but a small number of treatments. We illustrate
our approach by analyzing the effect of college merit aid programs upon college attendance
and demonstrate that inferences via our method differ substatially from those obtained with
conventional methods.
1
Introduction
This paper presents a new method of inference for difference in differences type fixed effect
regression methods in circumstances where only a small number of groups provide information about treatment parameters of interest. In this approach, identification of the treatment
parameter typically arises when a group ‘changes’ some particular policy. We use N0 to denote the number of ‘treatment’ groups that change their policy in the data and N1 to denote
the number of ‘control’ groups who do not change their policy. The usual asymptotic approximations assume that both N0 and N1 are large. However, even when the total number of
observations is large, the number of actual policy changes observed in the data is often very
small. For example, often only a few states change a law within the time span of the data.
In such cases, we argue that the standard large-sample approximations used for inference
are not appropriate.1 We develop an alternative approach to inference under the assumption
that N0 is finite, using asymptotic approximations that let N1 grow large. Point estimators
of the treatment effect parameter(s) are not consistent since N0 is fixed. However, we can
use information from the N1 control groups to consistently estimate the distribution of these
point estimators, up to the true value(s) of the parameter. This allows us to use treatment
parameter point estimators as test statistics for any hypothesized true treatment parameter
value(s) and to construct confidence intervals by ‘inverting’ these test statistics.
The following simple model illustrates our basic problem and approach to its solution:
Yjt = αdjt + θj + γ t + η jt ,
(1)
where djt is a policy variable whose coefficient α is the object of interest, θj is a time-invariant
fixed effect for group j, γ t is a time effect that is common across all groups but varies across
time t = 1, ..., T , and η jt is a group×time random effect.
Suppose that only the j = 1 group experiences a treatment change and that it happens
to be a permanent, one-unit change at period t∗ . All other groups have a time-invariant
policy: dj1 = ... = djT . Consider estimating the model (1) by using OLS controlling for
group and time effects via dummy variables. Let α
b F E be this regression estimate of α. It is
1
Of course in some special cases, the classical linear model assumptions will be satisfied, enabling small
sample inference (see e.g. Donald and Lang, 2002). Here our methods remain useful as specification checks
but they will be most valuable when the classical model is not applicable.
1
straightforward to show that α
b F E can be written as a difference of differences:
#
"
T
t∗
X
1
1X
η −
η
α
bF E = α +
T − t∗ t=t∗ +1 1t t∗ t=1 1t
!
Ã
t∗
N1
T
N1
X
X
X
1
1
1X
1
.
−
η −
η
(N1 − 1) j=2 (T − t∗ ) t=t∗ +1 jt (N1 − 1) j=2 t∗ t=1 jt
While α
b F E is unbiased under the usual assumption that η jt has mean zero conditional on the
regressors, it is not consistent. As the number of groups grows, only the term in parentheses
vanishes, the term in brackets remains unchanged as N1 gets large (with T fixed). That is
#
"
T
t∗
X
X
1
1
p
(b
αF E − α) → W ≡
η −
η .
T − t∗ t=t∗ +1 1t t∗ t=1 1t
This problem is rarely acknowledged in empirical work and researchers often ignore it
when calculating standard errors. If the classical linear model were applicable, standard
methods would yield the correct small sample inference (See e.g. Donald and Lang, 2001).
However, there are many applications where the classical model does not apply, due e.g.
to non-normal η jt or serial correlation in η jt . In such cases, classical inference can be very
misleading.
In this paper we show that even though α
b F E is not consistent, we can still conduct
inference and construct confidence intervals for α, with a general η jt distribution. Our
key assumption regarding the η jt is that they are cross sectionally stationary: the vector
(η j1 , ..., η jT ) is stationary and weakly dependent in the cross section. This assumption allows
us to learn about the limiting distribution of (b
αF E − α) from the residuals from the other
groups. Let η̂ jt denote residuals and Ŵj denote the function of residuals that is analogous
to W :
∗
T
t
X
1
1X
Ŵj ≡
b
η
−
b
η .
T − t∗ t=t∗ +1 jt t∗ t=1 jt
As N1 gets large the Ŵj will have the same distribution as W. A test of the hypothesis
that α = α0 is easily conducted by comparing (b
αF E − α0 ) with the empirical distribution
n oN1
of Ŵj
. The null hypothesis is rejected when (b
αF E − α0 ) is a sufficiently unlikely (tail)
j=2
event according to this distribution.
We illustrate this approach in Figure 1 which is based on data from our empirical example
in Section 4. We present the empirical distribution of Ŵj which is a consistent estimate
2
of the distribution of (b
αF E − α0 ) under the null hypothesis that α = α0 . An acceptance
region can be constructed by finding appropriate quantiles of this empirical distribution.
For example, the interval [-.11 to .09] in Figure 1 corresponds to an approximately 90%
acceptance region. If (b
αF E − α0 ) does not fall within that range, the null hypothesis α = α0
is rejected. The set of α0 that fail to be rejected provides an approximate 90% confidence
interval for α. In this example, α̂ is approximately .08 which yields a 90% confidence interval
for α of [-.01 to .19].
Our approach is related to a large body of existing work on difference in differences
models and inference in more general group effect models.2 It is complementary to typical
approaches focusing on situations where the number of treatment and control groups, N0 and
N1 , are both large (e.g. Moulton, 1990) or both small (e.g. Donald and Lang, 2002). It is
also in the spirit of comparisons of changes in treatment groups to changes in control groups
often done by careful applied researchers. For example, Anderson and Meyer (2000) examine
the effect of changes in unemployment insurance payroll in Washington state on a number
of outcomes using a difference in differences approach with all other states representing the
control groups. In addition to standard analysis, they compare the change in the policy in
Washington state to the distribution of changes across other states during the same period
in time in order determine whether it is an outlier consistent with a policy effect.3
This approach is relevant is for wide range of applications. Examples include Gruber,
Levine, and Staiger (1999) who use comparisons between the five treatment states that legalized abortion prior to Roe v. Wade versus the remaining states. Our results apply directly
with N0 corresponding to the five initial movers. For expositional and motivational purposes,
we focus on the difference in differences case, but our approach is appropriate more generally
in treatment effect models in which there are a large number of controls, but a small number
of treatments.4 Hotz, Mullin, and Sanders (1997) provide a good example of this outside the
2
There are so many examples of difference-in-differences-style empirical work that we do not attempt to
survey them. See Meyer (1995), Angrist and Krueger (1999), and Bertrand, Duflo, and Mullainathan (2004)
for nice overviews of difference in difference methods. Wooldrige (2003) provides an excellent and concise
survey of closely-related group effect models.
3
Though it does not appear in the published version, Section 4.6 of Bertrand, Duflo, and Mullainathan
(2002) describes a ‘placebo laws’ experiment which is also related to some aspects of our approach. They use
simulation experiments under specific joint hypotheses about the policy and distribution of covariates to
assess size and power of typical tests (based on large-N0 and large-N1 ). Such experiments could also be used
to recover the finite-sample distribution of a treatment effect parameter under a particular null hypothesis.
4
One can also find many studies which use a small number of both treatments and controls. However,
if there exist group×time effects, the usual approach for inference is inappropriate. An alternative sample
design is to collect many control groups (with the inherent cost of a reduction of match quality). One
3
difference in differences literature. They estimate the effect of teenage pregnancy on labor
market outcomes of mothers. The key to their analysis is using miscarriage as an instrument for teenage motherhood. Their sample is comprised of 980 women who had a teenage
pregnancy, but only 68 experienced miscarriages. Our basic approach could be extended to
this type of application with the 68 miscarriages taken as fixed like N0 and the approximate
distributions of estimators calculated treating only the non-miscarried pregnancies as a large
sample.
Our final example is the study of merit aid policies which we use in Section 4 to illustrate
our methods. Merit aid programs provide college tuition assistance to students who attend
college in state and maintain a sufficiently high grade point average during high school. Some
of the studies in the literature estimate the effect using only a single state that changed its
law (Georgia) while newer studies make use of ten different states.5 We demonstrate our
methodology and show that accounting for the small number of treatment states is important
as the confidence intervals become substantially larger than those formed by the standard
asymptotic approach.
The closest analog to our inference method in econometrics is work on testing for endof-sample structural breaks. In particular, work such as that by Dufor, Ghysels, and Hall
(1994) and Andrews (2003) on the problem of testing for a structural break over a fixed and
perhaps very short interval at the end of a sample, analogous to our N0 observations on policy
changers. They develop tests that are asymptotically valid as the number of observations
before the potential break point grows, holding fixed the number of time periods after the
break.
The remainder of this paper presents our approach in the simplest case of group × time
data (e.g. collected at the state×year level) and a common treatment parameter in Section 2.
Extensions to allow for heterogeneity in treatment parameters across groups, to individuallevel data, and to cross sectional dependence and heteroskedasticity are described in Section
3. In Section 4, we present an illustrative example of our approach by studying the effect of
merit scholarships, followed by a brief conclusion in Section 5.
could then use our methods for appropriate inference. For example Card and Krueger (1994) examine the
impact of the New Jersey minimum wage law change on employment in the fast food industry. Their sample
design includes only one control group (eastern Pennsylvania), but they could have collected data from
many “control states” to contrast with the available treatment state. We view this not as a substitute for
the analysis that they perform, but rather a complement to check the robustness of the results.
5
Our specifcations are motivated by Dynarski (2004).
4
2
Base Model
Our model for group × time data is:
Yjt = αdjt + Xjt0 β + θj + γ t + η jt ,
(2)
where djt is the policy variable that need not be binary, Xjt is a vector of regressors with
parameter vector β, θj is a time-invariant fixed effect for group j, γ t is a time effect that is
common across all groups but varies across time t = 1, ..., T , and η jt is a group×time random
effect. We take α to be the parameter of interest.
The key problem motivating our approach is that for many groups there is no (temporal)
variation in djt . We adopt the convention of indexing the N0 groups whose value of djt
changes during the observed time span with the integers 1 to N0 . The integers from N0 + 1
to N0 + N1 then refer to the remaining groups for which djt is constant from t = 1 to T.
We treat N0 and T as fixed, taking limits as N1 grows large. We assume throughout that at
least one group changes its policy so that N0 ≥ 1.
It is convenient to partial out variation explained by indicators for groups and times and
to have notation for averages across groups and time. Therefore for generic variable Zjt , we
P
PN1 +N0
1
define Z j = T1 Tt=1 Zjt , Z t = N1 +N
Zjt , and use the notation Z for the average of
j=1
0
Zjt across both groups and time periods. We define a variable Zejt that equals the residual
from a projection of Zjt upon group and time indicators: Zejt = Zjt − Z j − Z t + Z. The
essence of ‘difference in differences’ is that we can rewrite regression model (2) as
ejt0 β + e
Yejt = αdejt + X
η jt
(3)
b denote the OLS
ejt . Let α
b and β
and we can then estimate α by regressing Yejt on dejt and X
estimates of α and β in (3).
We assume a set of regularity conditions stated as Assumption 1, most of which are
routine. The conditions need to imply that changes in η jt are uncorrelated with changes
in regressors and the usual moment and rank conditions hold. The only (slightly) unusual
condition we use describes the cross sectional dependence of our data. We assume our
data is cross sectionally stationary and strong mixing (see Conley, 1999). This presumes
the existence of a coordinate space in which our observations can be indexed. Stationarity
refers to translation invariance in this coordinate space and mixing refers to observations
5
approaching independence as their distance grows, both direct analogs of the time series
properties with the same names. We omit an explicit notation for these coordinates for ease
of exposition.
¡¡
¢
¡
¢¢
Assumption 1 Xj1 , dj1 , η j1 , ..., XjT , djT , η jT is stationary and strong mixing across
¡
¢
groups; η j1 , ..., η jT is expectation zero conditional on (dj1 , ..., djT ) and (Xj1 , ..., XjT ) ; all
random variables have finite second moments. Finally, the usual rank condition for equation
(3) holds:
NX
T
1 +N0 X
1
p
ejt X
ejt0 →
Σx
X
N1 + N0 j=1 t=1
where Σx is finite and of full rank.
In Proposition 1 we state that OLS yields a consistent estimator of β and we derive the
limiting distribution of α
b:
p
b→
β and α
b is unbiased and converges
Proposition 1 Under Assumption 1, as N1 → ∞ : β
in probability to α + W , with:
¢¡
¢
PN0 PT ¡
η jt − η j
t=1 djt − dj
j=1
W =
.
¢2
PN0 PT ¡
t=1 djt − dj
j=1
(4)
The proposition states that while α
b is unbiased, it is not consistent. Its limiting distrib¡
¢
ution is centered at α, with deviation from α given by W , a linear combination of η jt − η j
for j = 1 to N0 and t = 1 to T. The nice aspect of this result is that inference for α remains
feasible if we can estimate relevant aspects of the distribution of W.
Our approach is to estimate the conditional distribution of W given the observable djt for
©
ª
the treatment groups. Thus we need to identify the conditional distribution of (η jt − η j )
for j = 1 to N0 and t = 1 to T given the corresponding set of djt values. The mixing and
cross-sectional stationarity of our data enables us to consistently estimate this conditional
distribution using data from the control groups, j > N0 , subject to regularity conditions. In
order to identify the distribution of W, we need an additional condition beyond Assumption
1 that arises from our assumed design structure for djt . The controls have time-invariant
djt so they cannot be informative about all forms of conditional η jt distributions given the
treatments’ time-varying djt series. Thus we must assume a model for required conditional
distribution for W that is estimable with time-invariant djt .
6
For ease of exposition, we first discuss estimation under perhaps the simplest such model
¡
¢
which is that the η j1 , ..., η jT are independent of regressors and identically distributed across
groups, stated as Assumption 2. Note that this still allows arbitrary serial correlation in η jt .
An alternative assumption that allows heteroskedasticity and spatial dependence is discussed
in Section 3.3.
¡
¢
Assumption 2 η j1 , ..., η jT is IID across j and independent of (dj1 , ..., djT ) and (Xj1 , ..., XjT ) ,
with a bounded density.
¡
¢
To see how the distribution of η jt − η j can be estimated under Assumption 2, consider the
residual for a member of the control group (i.e. j > N0 ),
¡
¢
ejt0 (β̂ − β) + η jt − η j − η t + η
ejt0 β̂ = X
Yejt − X
¢
p ¡
→ η jt − η j .
(5)
ejt vanishes since β̂ is consistent, and the η term simplifies because
The term involving X
©¡
¢ªT
η t and η vanish. Thus, if η jt − η j t=1 is IID across groups, its distribution for the
treatment groups, j ≤ N0 , is trivially identified using residuals for control groups j > N0 .
We focus on estimators implied by the sample analog estimator of the distribution of
©¡
¢ªT
η jt − η j t=1 , i.e. the empirical distribution of residuals from control groups.6 This implies
an estimator of the conditional distribution of W given the djt for the treatment groups.
Defining this distribution as Γ(w) ≡ Pr(W < w | {djt , j = 1, .., N0 , t = 1, ..., T }), its sample
analog estimator is:
b (w) ≡
Γ
µ
1
N1
¶N0
´
⎛ PN PT ¡
⎞
¢³
0
e t−X
e 0 β̂
Y
−
d
d
jt
j
j
t=1
j=1
jt
...
1⎝
< w⎠ .
¢2
PN0 PT ¡
t=1 djt − dj
j=1
1 =N0 +1
N0 =N0 +1
NX
0 +N1
NX
0 +N1
b (w) as Proposition 2.
We state a consistency result for Γ
Proposition 2 Under Assumptions 1-2 and assuming β is interior to a compact parameter
b
converges in probability to Γ(w) uniformly on any compact subset
space, as N1 → ∞, Γ(w)
of the support of W.
b (w) , it is straightforward to conduct hypothesis tests reGiven the consistent estimator Γ
garding α using α
b as a test statistic. Under the null hypothesis that the true value of α = α0 ,
6
Of course the residuals could also be used to estimate any parametric model of their distribution. This
may be a preferable practical strategy in applications with moderately large N1 .
7
the large sample (N1 large) approximation following from Proposition 1 is that α̂ is distributed as α0 + W conditional on {djt , j = 1, .., N0 , t = 1, ..., T } . Therefore, we consistently
estimate the distribution function Pr(α̂ < c) via Γ̂(c − α0 ) and use its appropriate quantiles
to define an asymptotically valid acceptance region for this null hypothesis.7 For example, a
90% acceptance region could be estimates as [b
αlower , α
b upper ] with these endpoints being the
5th and 95th percentiles of this distribution: Γ̂(b
αlower − α0 ) ≈ .05 and Γ̂(b
αupper − α0 ) ≈ .95.8
A 90% confidence interval for the true value of α can then be constructed as the set of all
values of α0 were one fails to reject the null hypothesis that α0 is the true value of α.
An alternative, asymptotically equivalent, estimator is motivated by the literature on
permutation or randomization inference (see e.g. Rosenbaum, 2002). In this approach one
conditions upon everything but being a treatment or control group and the exact, small sample distribution of W is computable. We can write a ‘randomization distribution’ estimator
of Γ as
⎡
⎢ X
⎢
⎢
⎢
⎣ 1 ∈[1:N0 +N1 ]
b∗ (w) ≡
Γ
X
2 ∈[1:N0 +N1 ]
2 6= 1
...
1
×
(N1 + N0 )(N1 + N0 − 1)...(N1 )
X
N0 ∈[1:N0 +N1 ]
{ 1 ,..., N0 −1 }
/
N0 ∈
⎤
´
³
⎞
⎛ PN PT ¡
¢
⎥
0
e 0 β̂
e j t − α0 de j t − X
−
d
d
Y
jt
j
⎥
t=1
j=1
jt
⎠
⎝
<w ⎥
1
¢2
PN0 PT ¡
⎥.
−
d
d
⎦
jt
j
t=1
j=1
The summations are over all possible assignments of treatment status to N0 of the N0 + N1
total groups. In special cases, e.g. with no regressors and random assignment of treatment
b∗ (w) provides exact small sample inference. Regressors are typically included because
dejt , Γ
the researcher believes it is important to condition upon them before treatments can be
viewed as random. In other words, treatment is randomly assigned only conditional upon
b∗ (w) will not provide exact small sample inference.
X, and since β must be estimated Γ
b∗ (w) should provide good
However, we anticipate that if β̂ is a good estimate of β then Γ
approximations of the small sample distribution of W.9
7
We note that no test in this framework can be consistent as N1 → ∞ since there are a finite number of
observations that are informative regarding α. We also make no claim that this test is optimal.
8
b is a step function, but we can choose
We can not obtain exact equality in these expressions because Γ
the closest point and asymptotically the coverage probability will converge to 90%.
9
b (w)∗ to outperform Γ
b (w) in situations where β̂ is well estimated but N1 is still small
We expect Γ
b (w) to perform poorly. There are certainly applications where this
enough for the empirical distribution in Γ
is likely to be the case. For example, suppose that data is collected at the state level and that demographic
regressors like income or population have substantial variation. With such large-variance regresssors β̂ may
be well estimated with, say, N1 = 20 states, while with only 20 observations the emprical distribution will
8
3
Extensions
This Section presents extensions of our base model to accommodate treatment parameter
heterogeneity, individual-level data, and departures from Assumption 2 that allow cross
sectional (spatial) dependence and heteroskedasticity.
3.1
Treatment Parameter Heterogeneity
It is straightforward to modify (2) to allow for heterogeneity in treatment parameters across
groups. Consider the extension to allow group-specific treatment parameters:
Yjt = αj djt + Xjt0 β + θj + γ t + η jt .
(6)
Using the notation defined above we can rewrite this as
ejt0 β + e
η jt .
Yejt = αj dejt + X
Note that dejt is zero for all of the control groups, thus we only estimate treatment parameters
for j = 1 to N0 , and stack these estimable parameters in the vector A = [α1 , ..., αN0 ]0 . We
define Djt to be the N0 × 1 vector of interactions between djt and group indicators. That is
the
th
element of the vector Djt = djt if j =
and is zero otherwise. We can then write
0
e jt
ejt0 β + e
A+X
η jt .
Yejt = D
³
´
b
b
We refer to OLS estimates of (A, β) in this regression as A, β .
p
b →
b converges in
β and A
Proposition 3 If Assumption 1 holds, then as N1 → ∞, β
probability to A + W, where W is an N0 × 1 random vector with generic element
¢¡
¢
PT ¡
η jt − η j
t=1 djt − dj
W (j) =
.
¢2
PT ¡
t=1 djt − dj
Testing and inference can proceed exactly as in Section 2. A consistent sample analog
b under the null hypothesis that A0 is the true value of
estimator of the distribution of A
do a mediocre-at-best job of estimating conventional critical values. This situation will also arise when the
model is extended to individual-level data in Section 3. With only individual-level regressors, coefficients
analogous to β will be estimated extremely well regardless of N1 if there are many individuals within each
group. This situation is routine with repeated cross section data and arises in our emprical example to merit
aid programs discussed in Section 4.
9
A can be constructed with residuals from controls. This allows testing any point null hypothesis about the heterogeneous treatment effects and inversion of this test provides a joint
confidence set for the elements of A. Alternatively, the distribution of any function of the
elements of A (e.g. their mean across groups) can also be consistently estimated allowing
analogous hypothesis testing and confidence set construction.
Note that we have restricted the form of the heterogeneity by allowing the treatment
effect parameter to vary with j but not with t. This is purely for simplicity in exposition.
There are many examples of models in which one can identify time-varying treatment effects.
We can extend our approach to all the cases known to us in which αjt would be identified
and consistently estimable under the usual asymptotics (large N0 and large N1 ).
3.2
Individual Level Data
Our approach can easily be applied with repeated cross sections or panels of individual data,
the relevant data type for many situations. We restrict ourselves to repeated cross sections
for ease of exposition. Let i index an individual who is observed in group j(i) at a single
time period t(i). Allowing for individual-specific regressors Zi and noise εi , we arrive at a
model:
Yi = λj(i)t(i) + Zi0 δ + εi
λjt = αdjt + Xjt0 β + θj + γ t + η jt .
(7)
(8)
The difference between Zi and Xjt is that we assume that Zi varies within a group×time
cell while Xjt does not. Following Amemiya (1978), we can obtain estimates for α in two
steps. First, estimate λjt for all groups and time periods via a regression of Yi upon a full
set of indicators for group×time and Zi . The estimated λjt can then be used as the outcome
variable in equation (8) and the inference procedures described above can be applied directly
to this second-stage regression.
Let M(j, t) be the set of individuals observed in group j at time t and |M(j, t)| denote
the number of individuals in this set. We continue to assume that T is fixed. A variety of
assumptions could be made about the behavior of |M(j, t)|. We focus on the case in which
|M(j, t)| is growing with N1 . However, in the Appendix, we provide a rigorous demonstration
that these test procedures remain asymptotically valid when the number of individuals per
group×time is fixed but common across group×time cells.10
10
In Conley and Taber (2005) we present a complementary strategy with fixed sample sizes that vary
10
Let Ii be a set of fully-interacted indicators for all groups×time periods. Now consider
bjt be the regression coefficient on the dummy variable
a regression of Yi on Zi and Ii . Let λ
for group j at time t. It is straight forward to show that
⎤
⎡
³
´
X
X
1
1
bjt = λjt + ⎣
λ
δ +
Zi0 δ − b
εi ⎦
|M(j, t)|
|M(j, t)|
i∈M(j,t)
(9)
i∈M(j,t)
where b
δ is the regression coefficient obtained in the first stage. One can see that as |M(j, t)|
bjt
grows large, the term in brackets vanishes. The second stage is then simply to plug-in λ
b and α
for λjt in (8) and run fixed effect OLS. We recycle notation and use β
b in this Section
to refer to the Amemiya (1978) second stage OLS estimators of (8). The results of Section
2 apply to these estimators under a straightforward set of conditions. Aside from the usual
orthogonality and rank conditions, we need to specify the rate at which |M(j, t)| grows, these
are stated as Assumption 3:
Assumption 3 εi is IID and orthogonal to [ Zi Ii ], which is full rank. For all j, |M(j, t)|
grows uniformly at the same rate as N1 .
Proposition 4 Under Assumptions 1,2,3 and assuming β is interior to a compact parameter
space, as N1 → ∞, the conclusions of Propositions 1 and 2 apply to the Amemiya (1978)
p
p
b and α
b→
second stage OLS estimators β
b of equation (8): β
β and α
b → α + W, where W
has exactly the same form given by (4). Using the notation Ze to refer to the residual from a
b as:
linear projection of a variable Z upon a full set of time and group indicators, define Γ
¶
µ
⎛
⎞
¢ e
PN0 PT ¡
0
b
e
µ ¶N0 NX
NX
0 +N1
0 +N1
⎜ j=1 t=1 djt − dj λ j t − X j t β̂
⎟
1
b
⎟.
Γ (w) ≡
...
1⎜
<
w
¡
¢
P
P
2
⎝
⎠
N0
T
N1
djt − dj
=N +1
=N +1
1
0
N0
j=1
0
t=1
b
Γ(w)
converges in probability to Γ(w) uniformly on any compact subset of the support of W .
3.3
Cross Sectional Dependence and Heteroskedasticity
Assumption 2 can be replaced by any model of cross sectionally (spatially) correlated and/or
heterogeneously distributed η jt that is estimable given data from the controls. This Section
across group×time cells. This is considerably more difficult as one needs to solve a deconvolution problem.
11
presents one example model that allows for temporal and spatial dependence in η jt and
heteroskedasticity depending on group population.11
Consider a model that builds up η jt from two Gaussian processes μjt and vjt that are
independent of each other with the first capturing dependence and the second heteroskedasticity:
η jt = μjt + vjt .
Suppose the μ process is expectation zero and stationary across both space and time, potentially spatially and temporally correlated. The covariance between μit and μks depends
on a time-invariant observed ‘economic’ distance, denoted distik , and the time lag t − s.
Thus a covariance function for μ can be defined as Cov(μit , μks ) = f (distik , t − s; θμ ), for
some finite-dimensional parameter θμ . The function f can be chosen to be a valid covariance
function for any θμ parameter value. See e.g. Cressie and Huang (1999) or Ma (2003) for
classes of valid families for f. One simple example parameterization with θμ = (θ1 , θ2 , θ3 ) > 0
is:
Cov(μit , μks ) = θ1 exp{−θ2 distik − θ3 |t − s|}.
Further suppose that the second component vjt of this process is independent across groups
and time with a variance that is a function g(popjt ; θv ) of observed group population, popjt ,
so that:
vjt ∼ N[0, g(popjt ; θv )].
These specifications for the components of η jt imply that the probability limit of residuals in
equation (5), (η jt − η̄ j ), is Gaussian process with a known parametric covariance structure
that depends on the data, {distik }all
ik
and {popjt }all jt , and the parameters θμ and θv . It
is straightforward to consistently estimate θμ and θv by spatial GMM (Conley, 1999) using
covariances and variances of residuals in equation (5) as moment conditions. The ‘plug-in’
estimator of the joint distribution of (η jt − η̄ j ) using GMM estimators of θμ and θv will be
consistent and can directly be used to estimate the distribution of W .
11
See Conley and Taber (2005) for an alternative model in this framework that allows for heteroskedasticity arising from variation in group populations along with arbitrary serial dependence but with spatial
independence.
12
4
Empirical Example: The Effect of Merit-Aid Programs on Schooling Decisions
In the last fifteen years a number of states have adopted merit-based college aid programs.
These programs provide subsidies for tuition and fees to students who meet certain meritbased criteria. The largest and probably the best known program is the Georgia HOPE
(Helping Outstanding Pupils Educationally) scholarship which started in 1993. This program
provides full tuition as well as some fees to eligible students who attend in-state public
colleges.12 Eligibility for the program requires maintaining a 3.0 grade point average during
high school. A number of previous papers have examined the effect of HOPE and other
merit based aid programs.13 Given the large amount of previous work on this subject, we
leave full discussion of the details of these programs to these other papers and focus on our
methodological contribution.
Our work most closely relates to Dynarski (2004) by focusing on the effects of HOPE and
other merit aid programs on college enrollment of 18 and 19 year olds using the October CPS
from 1989-2000. Our specifications are motivated by some of hers, but we do not replicate
her entire analysis. Our goal is to illustrate use of our method, and our analysis falls well
short of a complete empirical analysis of merit scholarship effects.
During the 1989-2000 time period, ten different states initiated merit-aid programs. We
use two specifications with the first focusing on the HOPE program alone. In this case,
we ignore data from the other nine treatment states and use 41 controls (40 states plus
the district of Columbia). In the second case, we study the effect of merit-based programs
together and use all 51 units.14 The outcome variable in all cases is an indicator variable
representing whether the individual is currently enrolled in college.
In Table 1 we present results for the HOPE program with Georgia as the only treatment
state. We compare three estimators: column A corresponds to the approach described in
subsection 3.2 and columns B and C present two natural alternatives. The estimates in both
12
A subsidy for private colleges is also part of the program.
Examples include Dynarski (2000, 2004), Cornwell, Mustard, and Sridhar (2003), Cornwell, Lee,
and Mustard, (2003), Cornwell, Leidner, and Mustard (2003), Bugler, Henry, and Rubenstein (1999),
Berker(2001), Bugler and Henry (1997,1998), Henry and Rubenstein (2002).
14
Note that these merit programs are quite heterogeneous. This exercise does not necessarily mean that
we are assuming that the impact of all of these programs is the same. One could interpret this as estimation
of a weighted average of the treatment effects. Alternatively, we can think of this as a test of the joint null
hypothesis that all of the effects are zero. We could estimate more general confidence intervals allowing for
heterogeneous treatment effects but we focus on the simplest case here.
13
13
columns A and B are obtained via the Amemiya (1978) two-step approach. The estimates
reported in column A use a first stage linear probability model (OLS) and in column B the
first stage is a logit, regressors in both case include demographics and state×year indicators.
The second stage in both A and B estimates (8) via OLS using the estimated state×year
coefficients as the dependent variable. Column C presents results from a ‘one-step’ estimator
which is simply a linear probability model estimated via OLS using the entire sample. Thus
the column C treatment effect estimates will be ‘population-weighted’ across states while in
column A states are ‘equally-weighted.’ The top panel of Table 1 presents point estimates
for all three estimators and the bottom panel presents interval estimates for the treatment
parameter both using our methods and using the typical approaches to allow clustering by
state and state-by-time.
While results differ depending on the clustering used, interval estimates in column A
using typical methods indicate significant treatment effects. An interval of [2.5% to 13.0%]
obtains with clustering by state and year, which allows the error terms of individuals within
the same state and year to be arbitrarily correlated with each other. This interval shrinks
to [5.8% to 9.7%] when clustering is done by state which allows for serial correlation in η jt .
Clearly one should be worried about the assumption that the number of states changing
treatment status is large which underlies these routine confidence interval estimates since
only one state (Georgia) contributes to the estimate of the treatment effect.
The estimated confidence interval using our method reported in the last row of Column A
is [-1% to 21%]. This confidence interval is formed by inverting the test statistic (α̂−α0 ) using
b∗ estimator. It is centered at a larger value and much wider than the intervals obtained
our Γ
with conventional inference; wide enough to include zero despite its shift in centering. To
better understand these discrepancies, Figure 2 displays a kernel smooth estimate (solid
line) of the distribution of (b
α − α) under the null hypothesis that the true value of α is zero.
This distribution is estimated from the control states. For comparison, the dashed line plots
an estimate implied by the usual asymptotic approximation with clustering by state. This
curve is a Gaussian density centered at zero with a standard deviation equal to 0.0098: the
standard error on α
b from a fixed effect regression that clusters by state. The pronounced
differences between the spread and symmetry (lack thereof) of these distributions are what
drive our interval estimates of α to differ from those resulting from conventional methods.
In constructing the confidence intervals, two issues arise due to the fact that we have only
14
41 control states. The first issue is that one may worry whether 41 is large enough for the
b∗ estimator described in Section 2.
asymptotics to be valid. With that in mind, we use the Γ
This is motivated by its anticipated good finite sample properties and the fact that all our
regressors vary at the individual level, so that we do not need N1 to grow in order to get
b∗ will converge to the exact permutation distribconsistent estimates of δ. With N1 fixed Γ
ution of W as the number of observations within each state grows large. The second issue
can be seen in Figure 1. The estimated CDF is of course a step function and with a single
treatment state and 41 controls its probability increments are limited to 1/41ths. To approximate intervals with conventional, say 95%, coverage probabilities we use a conservative
interval so that the limiting coverage probability is at least 95%. As a practical matter this
is usually only relevant for the case of a single treatment group. With two or more treatment
groups the empirical CDF will have a number of steps on the order of the number of ways
1 )!
to choose the N0 treatment groups out of the total number of groups ( (NN00+N
). Thus the
!N1 !
number of steps in the CDF is typically large for two or more treatments with corresponding
small probability increments.
In column B we present a logit version of the model.
The estimates in this column
were obtained in exactly the same manner as in the previous one, except that in the first
stage we utilize a logit model of the college attendance indicator so the predicted parameter
has the interpretation of a logit index coefficient. The pattern is very similar to column
A. Intervals from our method are again centered higher than conventional ones but enough
wider that the HOPE treatment effect becomes marginally insignificant. This contrasts
with effects that are highly significant using standard inference methods. To display the
magnitude of the program impact we calculate a 95% confidence interval for changes in college
attendance probability for a particular individual.
We consider an individual (without
the treatment) whose logit index puts his probability of college attendance at the sample
unconditional average attendance probability of 45% (i.e. an individual with a logit index of
-.20). The bracketed intervals reported in column two are 95% confidence intervals for the
change in attendance probability for our reference individual (intervals in parentheses are
95% confidence intervals for α).15
In column C we present results from a linear probability that estimates equations (7)-(8)
15
These confidence intervals for changes in attendance probabilities are calculated directly from the 95% CI
for α. Specifically, when the CI for α is [c1 , c2 ], we report an interval for the change in predicted probability
for our reference individual of: (Λ(−.2 + c1 ) − 45%) to (Λ(−.2 + c2 ) − 45%).
15
via OLS using all 34,902 observations. The details for constructing the confidence intervals
are formally presented in the Appendix (section A.4.4). These results are close to those
presented in column A. The difference between these two estimates is that in column A the
states are equally weighted while in column C they are population-weighted.
In Table 2 we present results estimating the effect of merit aid using all ten states who
added programs during this time period. The format of the table is identical to Table 1.
There are a few notable features of the table. First, the weighting matters substantially as
the effect is much smaller when we weight all the states equally as opposed to the populationweighted estimates in column C. Second, in contrast to Table 1, the confidence intervals are
quite similar when we cluster by state compared to clustering by state×year. Most importantly our approach changes the confidence intervals substantially, but less dramatically
than in Table 1. In particular, the treatment effect with equal weighting across states is still
statistically significant at conventional levels.
5
Conclusions
This paper has presented an inference method for difference-in-differences fixed effect models when the number of policy changes observed in the data is small. This method is an
alternative to typical asymptotic inference based on a large number of policy changes and
classical small sample inference based on Gaussian error terms. Our approach will be most
valuable in applications where the classical model does not apply, e.g. due to non-Gaussian
or serially correlated errors. We provide an estimator Γ̂∗ that in some cases with IID data is
both large-N1 asymptotically valid and approximately exact with small N1 . However, with
large N1 our approach can still be applied with much weaker conditions on the data. Many
forms of cross sectional dependence and heteroskedasticity, for example, can be readily accommodated. We provide an example application studying the effect of merit scholarship
programs on college attendance for which our approach seems appropriate. It results in very
different inference from conventional methods.
16
References
Adler, R.J. The Geometry of Random Fields Wiley, New York.
Amemiya, Takeshi “A Note on a Random Coefficients Model” International Economic
Review Vol. 19 (3), 1978, 793-796.
Anderson, Patricia and Bruce Meyer, “The Effects of the Unemployment Insurance Payroll Tax on Wages, Employment, Claims, and Denials” Journal of Public Economics
78:1,2000, 81-106.
Andrews, Donald W.K. “End-of-SampleTests” Econometrica 71(6) 1661-1694, 2003.
Angrist, Joshua, and Alan Krueger, “Empirical Strategies in Labor Economics, ” Handbook
of Labor Economics, 1999, Elsevier, New York, Ashenfelder and Card eds.
Berker, Ali Murat, “The Impact of Merit-Based Aid on College Enrollment: Evidence from
HOPE-Like Scholarship Programs,” unpublished manuscript, Michigan State University, 2001.
Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan, “How Much Should We
Truce Differences-in-Differences Estimates?,” Quarterly Journal of Economics 19:1,
2004. 249-275.
Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan, “How Much Should We Truce
Differences-in-Differences Estimates?,” NBER Working Paper 8841, 2002.
Bugler, Daniel and Gary Henry, “Evaluating the Georgia HOPE Scholarship Program: Impact on Students Attending Public Colleges and Universities,” unpublished manuscript,
Coucil for School Performance, Georgia State University, 1997.
Card, David, “The Impact of the Mariel Boatlift on the Miami Labor Market,” Industrial
and Labor Relations Review, January, 1990.
Card, David, and A. Krueger, “Minimum Wages and Employment: A Case Study of the
Fast-Food Industry in New Jersey and Pennsylvania,” American Economic Review,
September, 1994.
Conley, Timothy G. “GMM Estimation with Cross Sectional Dependence” Journal of
Econometrics 92:1-45, 1999.
Conley, Timothy G. and Christopher R. Taber “Inference with ‘Difference in Differences’
with a Small Number of Policy Changes” NBER Working Paper 0312, 2005.
Cornwell, Christopher, Kyung Hee Lee, and David Mustard, “The Effects of Merit-Based
Financial Aid on Academic Choices in College,”, unpublished manuscript, University
of Georgia, 2003.
Cornwell, Christopher, Mark Leidner, and David Mustard, “Rules, Incentives and Policy
Implications of Large-Scale Merit-Aid Programs,” unpublished manuscript, University
of Georgia, 2003.
Cornwell, Christopher, David Mustard, and Deepa Sridhar, “The Enrollment Effects of
Merit-Based Financial Aid: Evidence from Georgia’s HOPE Scholarship, unpublished
manuscript, University of Georgia, 2003.
17
Cressie, N. and C-H Huang “Classes of Nonseparable Spatiotemporal Stationary Covariance
Functions” JASA 94:1330-1340, 1999.
Donald, Stephen, and Kevin Lang, “Inference with Difference in Differences and Other
Panel Data,” unpublished manuscript, Boston University, 2001.
Dufor, Jean-Marie, Eric Ghysels, and Alastair Hall, “Generalized Predictive Tests and
Structural Change Analysis in Econometrics” International Economics Review 35(1):199229, 1994.
Dynarski, Susan, “Hope for Whom? Financial Aid for the Middle Class and its Impact on
College Attendance,” National Tax Journal, Vol 53 no. 3 Part 2, 2000, pp 629-662.
Dynarski, Susan, “The New Merit Aid,” in College Choices: The Economics of Which
College, When College, and How to pay for it. Caroline Hoxby, ed. University of
Chicago Press, 2004.
Gruber, Jonathon, Pillip Levine, and Douglas Staiger, “Abortion Legalization and Child
Living Circumstances: “Who is the Marginal Child?”, The Quarterly Journal of Economics, February, 1999.
Henry, Gary and Ross Rubinstein, “Paying for Grades: Impact of Merit-Based Financial
aid on Educational Quality,” Journal of Policy Analysis and Management, 21:1, 2002,
pp 93-109.
Hotz, V.J., C. Mullin, and S. Sanders, 1997, “Bounding Causal Effects Using Data from
a Contaminated Natural Experiment: Analyzing the Effect of Teenage Childbearing,”
Review of Economic Studies, Vol. 64, 576-603.
Ma, C. “Spatio-temporal Stationary Covariance Models” Journal of Multivariate Analysis.
86:97-107, 2003.
Meyer, Bruce, “Natural and Quasi-Natural Experiments in Economics, ” Journal of Business and Economic Statistics, XII (1995), 151-162.
Newey, Whitney and Daniel McFadden, “Large Sample Estimation and Hypothesis Testing.” Handbook of Econometrics, Vol. 4, Engle and McFadden eds. Elsevier, 1994,21132245.
Rosenbaum, Paul, “Covariance Adjustment in Randomized Experiments and Observational
Studies” Statistical Science 2002, Vol. 17, No. 3, 286—327.
Wooldridge , Jeffrey, “Cluster-Sample Methods in Applied Econometrics,” unpublished
manuscript, Michigan State University.
18
Figure 1. Example Estimate of CDF for W
1
.95 −
0.9
0.8
Estimated Prob(W<c)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
.05 −
0
−0.2
−0.15
−0.1
−0.05
0
c
0.05
0.1
0.15
Figure 2: Estimated Density of α̂ under H0 : α0 = 0
45
40
35
30
25
20
15
10
5
0
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
Solid Line Kernel Smoothed Density Estimate for Conley-Taber Approach
Dotted Line Density Estimate using Standard Asymptotics
1
0.15
Table 1
Estimates for
Effect of Georgia HOPE Program on College Attendance
A
B
C
Linear
Probability
Logit
Population Weighted
Linear Probability
Hope Scholarship
0.078
0.359
0.072
Male
-0.076
-0.323
-0.077
Black
-0.155
-0.673
-0.155
Asian
0.172
0.726
0.173
State Dummies
yes
yes
yes
Year Dummies
yes
yes
yes
95% Confidence intervals for HOPE Effect
Standard Cluster by State×Year
(0.025,0.130)
(0.119,0.600)
[0.030,0.149]
(0.025, 0.119)
Standard Cluster by State
(0.058,0.097)
(0.274,0.444)
[0.068,0.111]
(0.050,0.094)
Conley-Taber
(-0.010,0.207) (-0.039,0.909)
[-0.010,0.225]
(-0.015,0.212)
Sample Size
Number States
Number of Individuals
42
42
42
34902
34902
34902
Note: Confidence intervals for parameters are presented in parentheses. Brackets contain a confidence interval for
the program impact upon a person whose college attendance probability in the absence of the program would be 45
Table 2
Estimates for
Merit Aid Programs Effect on College Attendance
A
B
C
Linear
Probability
Logit
Population Weighted
Linear Probability
Merit Scholarship
0.051
0.229
0.034
Male
-0.078
-0.331
-0.079
Black
-0.150
-0.655
-0.150
Asian
0.168
0.707
0.169
State Dummies
yes
yes
yes
Year Dummies
yes
yes
yes
95% Confidence intervals for Merit Aid Program Effect
Standard Cluster by State×Year (0.024,0.078) (0.111,0.346)
[0.028,0.086]
(0.006,0.062)
Standard Cluster by State
(0.028,0.074) (0.127,0.330)
[0.032,0.082]
(0.008,0.059)
Conley-Taber
(0.012,0.093) (0.056,0.407)
[0.014,0.101]
(-0.003,0.093)
Sample Size
Number States
Number of Individuals
51
51
51
42161
42161
42161
Note: Confidence intervals for parameters are presented in parentheses. Brackets contain a confidence interval for
the program impact upon a person whose college attendance probability in the absence of the program would be 45%.
Appendix
A.1
Proof of Proposition 1
First a standard application of the partitioned inverse theorem makes it straight forward to
show that
hP
i hP
i ⎞−1
⎛P
N1 +N0 PT
N1 +N0 PT
0
e
e
N1 +N0 PT
e
e
0
e
e
d
d
X
X
t=1 jt jt
t=1 jt jt
j=1
j=1
t=1 Xjt Xjt
b = β + ⎝ j=1
⎠(A-1)
−
β
PN1 +N0 PT f 2
N1 + N0
(N1 + N0 ) j=1
t=1 djt
i hP
i⎞
hP
⎛P
P
P
N1 +N0
T
N1 +N0
T
e
e
N1 +N0 PT
e
e η jt
η jt
t=1 djt Xjt
t=1 djte
j=1
j=1
t=1 Xjt e
j=1
⎠.
×⎝
−
PN1 +N0 PT f 2
N1 + N0
(N1 + N0 ) j=1
t=1 djt
Now consider each piece in turn.
First Assumption 1 states that
NX
T
1 +N0 X
1
p
ejt X
ejt0 →
X
ΣX < ∞.
N1 + N0 j=1 t=1
The stationary and mixing components of Assumption 1 imply that a strong law of large
numbers (LLN) applies here, see e.g. Adler (1981) or Conley (1999). This LLN and the
conditional independence component of Assumption 1 imply that:
#
" T
NX
T
1 +N0 X
X
1
p
ejte
ejte
η jt → E
η jt = 0.
X
X
N1 + N0 j=1 t=1
t=1
For control groups j > N0 , the treatment does not vary over time so djt = dj . Therefore,
NX
T
1 +N0 X
j=1
t=1
where
NX
0 +N1
de2jt =
N0 X
T
T
0 +N1 X
X
¢2 NX
¡
¡
¢2
d − dt
djt − dj − dt + d +
j=1 t=1
j=N0 +1 t=1
T
T
X
X
¡
¢2
¡
¢2
d − dt = N1
d − dt
j=N0 +1 t=1
t=1
"Ã
!
#!2
NX
T
0 +N1
1
1X
= N1
d τ −d t
N0 + N1 =1
T τ =1
t=1
à N "Ã
!
#!2
T
T
0
X
X
X
1
N1
=
d τ −d t
2
T τ =1
(N0 + N1 ) t=1
=1
T
X
Ã
p
→ 0.
1
Now consider the other term
N0 X
N0 X
T
T
X
¢2 p X
¡
¡
¢2
djt − dj − dt + d →
djt − dj
j=1 t=1
j=1 t=1
since dt and d both have the same limit due to the finite number of groups with intertemporal
variation in treatments. Thus
T
NX
1 +N0 X
j=1
t=1
since N0 ≥ 1.
Now consider
p
de2jt →
T
N0 X
X
¡
¢2
djt − dj > 0
j=1 t=1
NX
N0 X
T
T
1 +N0 X
X
¡
¢
1
ejt
ejt = √ 1
√
djt − dj X
dejt X
N1 + N0 j=1 t=1
N1 + N0 j=1 t=1
NX
T
0 +N1 X
¡
¢
1
ejt
+√
d − dt X
N1 + N0 j=1 t=1
N0 X
T
X
¡
¢
1
ejt
=√
djt − dj X
N1 + N0 j=1 t=1
p
NX
T
0 +N1
X
¡
¢
1
ejt
+
d − dt √
X
N
+
N
1
0 j=1
t=1
→0 as N1 → ∞.
This result follows because the first term involves a sum of a finite number of Op (1) random
−1/2
variables normalized by an O(N1 ) term and the second term is identically zero due to
differencing:
NX
1 +N0
j=1
ejt =
X
NX
0 +N1
j=1
¢
¡
Xjt − X j − X t + X
¡
¢
= (N0 + N1 ) X t − X − X t + X
= 0.
Likewise
NX
1 +N0
j=1
T
X
t=1
dejte
η jt =
N0 X
T
T
0 +N1
X
X
¡
¢
¡
¢ NX
djt − dj e
η jt +
d − dt
e
η jt
j=1 t=1
t=1
j=1
N0 X
T
X
¢
¡
¢¡
=
djt − dj η jt − η j − η t + η ,
j=1 t=1
2
which is Op (1), thus
NX
T
1 +N0 X
1
p
√
dejte
η jt →0.
N1 + N0 j=1 t=1
b follows upon plugging the pieces back into (A-1) and applying Slutsky’s
Consistency for β
theorem.
From the normal equation for α
b it is straightforward to show that
´
PN1 +N0 PT e ³ e
0 b
e
t=1 djt Yjt − Xjt β
j=1
α
b=
(A-2)
PN1 +N0 PT e2
t=1 djt
j=1
´
PN1 +N0 PT e ³ e
0
b
e
α
d
d
+
e
η
+
X
(β
−
β)
jt
jt
jt
t=1 jt
j=1
=
PN1 +N0 PT e2
t=1 djt
j=1
#
" PN +N PT
PN1 +N0 PT e
1
0
0
e
e
e
η
d
d
X
jt
jt
jt
jt
t=1
t=1
j=1
j=1
b
(β − β).
= α + PN1 +N0 PT
+
PN1 +N0 PT e2
2
e
d
d
t=1 jt
t=1 jt
j=1
j=1
From above we know that
NX
1 +N0
j=1
NX
1 +N0
j=1
Thus
T
X
t=1
T
X
t=1
p
de2jt →
ejt =
dejt X
p
We showed above that
t=1
j=1 t=1
¢
¡
ejt
djt − dj − dt + d X
#
ejt X
ejt0
d
p
t=1
j=1
b →
(β − β)
0.
PN1 +N0 PT e2
d
1 +N0
j=1
j=1 t=1
N0 X
T
X
b → 0.
(β − β)
" PN
NX
T
1 +N0 X
N0 X
T
X
¡
¢2
djt − dj
j=1
dejte
η jt =
PT
t=1
jt
N0 X
T
X
¢
¡
¢¡
djt − dj η jt − η j − η t + η .
j=1 t=1
The variables η t and η both converge to zero in probability as N1 → ∞, therefore
NX
0 +N1
N0 X
T
T
X
X
¡
¢
¡
¢¡
¢
p
djt − dj e
η jt →
djt − dj η jt − η j .
j=N0 +1 t=1
j=1 t=1
Plugging these pieces into (A-2) gives the result.
3
A.2
Proof of Proposition 2
Since Γ is defined conditional on djt for j = 1, ...N0 , t = 1, ..., T, every probability in this proof
conditions on this set. To simplify the notation, we omit this explicit conditioning. Thus,
every probability statement and distribution function in this proof should be interpreted as
conditioning on djt for j = 1, ...N0 , t = 1, ..., T.
It is convenient to define
¢
¡
djt − dj
ρjt = PN PT ¡
¢2 .
0
−
d
d
t
t=1
=1
For each j = 1, ..., N0 define the random variable
¢¡
¢
PT ¡
η jt − η j
t=1 djt − dj
Wj ≡
¢2
PN0 PT ¡
d
−
d
t
t=1
=1
T
X
=
ρjt η jt .
t=1
Note that we used the fact that since η j is time invariant,
the distribution of Wj for j = 1, .., N0 .
Then note that
!
ÃN T
0 X
X
ρjt η jt < w
Γ (w) = Pr
j=1 t=1
=
Z
ÃN
0
X
···
Z
Z
ÃN
0
X
1
¢
PT ¡
t=1 djt − dj η j = 0. Let Fj be
!
Wj < w dF1 (W1 )...dFN0 (WN0 ).
j=1
We can also write
b (w) =
Γ
Z
···
1
j=1
!
b
b
Wj < w dFb1 (W1 ; β)...d
FbN0 (WN0 ; β),
b is the empirical c.d.f. one gets from the residuals using the control groups
where Fbj (·; β)
only. That is more generally
!
à T
N1
´
X
X ³
1
0
emt b < wj .
Fbj (wj ; b) ≡
1
ρjt Yemt − X
N1 m=1
t=1
To avoid repeating the expression we define
!
à T
X
0
φj (wj , b) ≡ Pr
ρjt (η mt − Xmt
(β − b)) < wj .
t=1
4
b
Note that φj (wj , β) = Fj (wj ) . The proof strategy is first to demonstrate that Fbj (wj ; β)
b
is a consistent estimate
converges to φj (wj , β) uniformly over wj . We will then show that Γ(a)
of Γ(a).
Define
T
³
³
´´
X
ω
bj =
ρjt η t − X t β − β̂ .
t=1
Let ³Ω be a´ compact parameter space for w, and Θ a compact subset of the parameter space
b ω
for β,
cj in which (β, 0) is an interior point.
b and φj (wj , β)
For each j = 1, ..., N0 consider the difference between Fbj (wj ; β)
=
=
=
≤
≤
b − φj (wj , β)|
(A-3)
sup |Fbj (wj ; β)
wj ∈Ω
¯
¯
!
à T
N1
¯ 1 X
¯
³
´´
X ³
¯
¯
0
e
η mt − Xmt β − β̂
< wj − φj (wj , β)¯
1
ρjt e
sup ¯
¯
wj ∈Ω ¯ N1
t=1
m=N0 +1
¯
¯
!
Ã
N1
T
¯ 1 X
¯
³
³
´´
X
¡
¢
0
¯
¯
sup ¯
< wj − φj (wj , β)¯
1
ρjt η mt − η t − Xmt − X t β − β̂
¯
¯
N
wj ∈Ω
1 m=N +1
t=1
0
¯
¯
!
à T
N1
¯ 1 X
¯
³
´´
X ³
¯
¯
sup ¯
< wj + ω
1
ρjt η mt − Xmt β − β̂
b j − φj (wj , β)¯
¯
wj ∈Ω ¯ N1
t=1
m=N0 +1
¯
¯
!
Ã
N1
T
¯ 1 X
³
³
´´
³
´¯
X
¯
0
b ¯¯
sup ¯
β − β̂
< wj + ω
1
ρjt η mt − Xmt
b j − φj wj + ω
bj , β
¯
¯
N
wj ∈Ω
1
t=1
m=N +1
¯ ³ 0
¯
´
¯
b − φj (wj , β)¯¯
+ sup ¯φj wj + ω
bj , β
wj ∈Ω
¯
¯
!
à T
N1
¯ 1 X
¯
X
¯
¯
0
sup ¯
1
ρjt (η mt − Xmt (β − b)) < wj + ω j − φj (wj + ω j , b)¯
¯
¯
N1 m=N +1
wj ∈Ω
t=1
(b,ω j )∈Θ
0
¯
¯ ³
³³
´
´
´
¯
¯
b
+ Pr β̂, ω
bj ∈
b j , β − φj (wj , β)¯ .
/ Θ + sup ¯φj wj + ω
wj ∈Ω
¯
¯ ³
´
¯
¯
b
First consider supw ¯φj wj , β − φj (wj , β)¯ . Using a standard mean-value expansion of φ,
³
´
e ,
for some ω
ej , β
´
´
³
³
¯
¯
¯ ∂φ w + ω
¯
e
e
¯ ³
¯
´
³
´
∂φ
e
,
β
,
β
w
j
j
j
¯
¯
j
j
¯
¯
b
b
¯
sup ¯φj wj + ω
b j , β − φj (w, β)¯ = sup ¯
(b
ωj )¯¯ .
β−β +
∂β
∂wj
w ¯
wj ∈Ω
¯
5
To see that the derivative
∂φj (wj ,b)
∂b
is bounded first note that
φj (wj , b) = Pr
à T
X
Ã
t=1
´
³
0
e
ρjt e
η jt + Xjt (β − b) < wj
= Pr Wj < wj −
So
∂φj (wj , b)
=E
∂b
Ã
fj
T
X
t=1
T
X
t=1
!
!
ejt0 (β − b)
ρjt X
ejt0
ρjt X
!
where fj is the density associated with Fj . Since fj is bounded and Xjt has first mo∂φ (w ,b)
ments, this term is bounded. Clearly j∂wjj is also bounded for the same reason. Thus
¯
¯ ³
´
¯
b − φj (wj , β)¯¯ converges to zero since β
b is consistent.
supwj ∈Ω ¯φj w + ω
bj , β
´
³³
´
´
³
b ω
bj ∈
/Θ
Since β,
b j converges in probability to (β, 0) which is an interior point of Θ, Pr β̂, ω
converges to zero.
Next consider the first term on the right side of (A-3). Note that the function
!
à T
´
X ³
0
emt
ρjt Yemt − X
b < wj + ωj
1
t=1
is continuous at each (b, w, ω) with probability one and its absolute value is bounded by
1, so applying Lemma 2.4 of Newey and McFadden, 1994, Fbj (wj ; b) converges uniformly to
φ (wj , b) . Thus putting the three pieces of (A-3) together,
p
b − φ(wj , β)| →
sup |Fb(wj ; β)
0.
wj ∈Ω
6
b (w) converges to Γ(w) note that we can write
Now to see that Γ
¯
¯
¯b
¯
¯Γ (w) − Γ (w)¯
¯Z Ã N
!
0
¯
X
¯
b Fb2 (W2 ; β)...d
b
b
= ¯ 1
Wj < w dFb1 (W1 ; β)d
FbN0 (WN0 ; β)
¯
j=1
¯
!
Z ÃX
N0
¯
¯
− 1
Wj < w dF1 (W1 )dF2 (W2 )...dFN0 (WN0 )¯
¯
j=1
¯(Z Ã N
!
0
¯
X
¯
b Fb2 (W2 ; β)...d
b
b
= ¯
1
Wj < a dFb1 (W1 ; β)d
FbN0 (WN0 ; β)
¯
j=1
!
)
Z ÃX
N0
b
b
Wj < w dF1 (W1 )dFb2 (W2 ; β)...d
FbN (WN ; β)
− 1
0
0
j=1
( Z ÃN
!
0
X
b
b
+ 1
Wj < w dF1 (W1 )dFb2 (W2 ; β)...d
FbN0 (WN0 ; β)
j=1
−
Z
1
+...
( Z
+
ÃN
0
X
j=1
1
)
b
b
Wj < w dF1 (W1 )dF2 (W2 )dFb3 (W3 ; β)...d
FbN0 (WN0 ; β)
ÃN
0
X
j=1
ÃN
0
X
!
!
b
Wj < w dF1 (W1 )...dFN0 −1 (WN0 −1 )dFbN0 (WN0 ; β)
)¯
¯
¯
− 1
Wj < w dF1 (W1 )...dFN0 −1 (WN0 −1 )dFN0 (WN0 ) ¯
¯
j=1
¯(Z " Ã"
# !
Ã
!#
)
N0
N0
¯
X
X
¯
b − F1 w −
b
b
Fb1
= ¯
w−
dFb2 (W2 ; β)...d
Wj ; β
Wj
FbN0 (WN0 ; β)
+
¯
j=2
j=2
⎫
⎧ ⎡ ⎛⎡
⎤ ⎞
⎛
⎞⎤
⎪
⎪
⎪
⎪
N0
N0
⎬
⎨Z ⎢ ⎜⎢
X
X
⎥ ⎟
⎜
⎟⎥
b
b
b
b
b
⎟
⎥
⎜
⎟
⎥
⎢Fb2 ⎜⎢w −
Wj ⎦ ; β ⎠ − F2 ⎝w −
Wj ⎠⎦ dF1 (W1 )dF3 (W3 ; β)...dFN0 (WN0 ; β)
⎣ ⎝⎣
⎪
⎪
⎪
⎪
j=1
j=1
⎭
⎩
Z
!
j6=2
j6=2
+...
(Z "
Ã"
# !
Ã
!#
)¯
N
N
0 −1
0 −1
¯
X
X
¯
b
b
+
w−
dF1 (W1 )...dFN0 −1 (WN0 ) ¯
FN0
Wj ; β − FN0 w −
Wj
¯
j=1
j=1
b converges uniformly to Fj (w), the right hand side of this expression must
Since each Fbj (w; β)
b (a) converges to Γ(a).
converge to zero so Γ
7
A.3
Proof of Proposition 3
First a standard application of the partitioned inverse theorem and using the fact that
0
e jt
D
= 0 for j > N0 ,
b = β+
β
Ã
NX
T
1 +N0 X
1
ejt X
ejt0
X
N1 + N0 j=1 t=1
−
×
Ã
1
N1 + N0
"N +N T
0
1 X
X
j=1
t=1
0
ejt D
e jt
X
(A-4)
# "N +N T
0
1 X
X
j=1
t=1
0
e jt D
e jt
D
#−1 "N +N T
0
1 X
X
j=1
t=1
#⎞−1
e jt X
ejt0 ⎠
D
NX
T
1 +N0 X
1
ejte
η jt
X
N1 + N0 j=1 t=1
"N +N T
# "N +N T
#−1 "N +N T
#⎞
0
1 X
0
1 X
0
1 X
X
X
X
1
0
0
ejt D
e jt
e jt D
e jt
e jte
X
D
D
−
η 0jt ⎠ .
N1 + N0 j=1 t=1
j=1 t=1
j=1 t=1
Now consider each piece in turn.
First Assumption 1 states that
NX
T
1 +N0 X
1
p
ejt X
ejt0 →
ΣX < ∞.
X
N1 + N0 j=1 t=1
The mixing and conditional independence components of Assumption 1 imply that:
#
" T
NX
T
1 +N0 X
X
1
p
ejte
ejte
X
X
η jt → E
η jt = 0.
N1 + N0 j=1 t=1
t=1
Consider the
The
th
th
e jt ( ) ,
element of the vector D
(
dt
d
j=
e jt ( ) = djt − dj − N0 +N1 + N0 +N1
.
D
dt
d
otherwise
− N0 +N1 + N0 +N1
diagonal component of
T µ
X
d t−d −
t=1
p
→
t=1
j=1
d
dt
+
N0 + N1 N0 + N1
T µ
X
d t−d −
=
t=1
T
X
PN0 +N1 PT
t=1
¶2
0
e jt D
e jt
can be written as
D
+
NX
T µ
0 +N1 X
d
dt
+
N0 + N1 N0 + N1
¡
¢2
dt−d .
8
j=1
j6=
¶2
+
t=1
d
dt
−
+
N0 + N1 N0 + N1
(N0 + N1 − 1)
¶2
¢
PT ¡
t=1 d − d t
(N0 + N1 )2
A generic ( , k) off diagonal term can be written as
µ
¶µ
¶
dkt
dt
d
dk
d t−d −
−
+
+
N0 + N1 N0 + N1
N0 + N1 N0 + N1
¶µ
¶
µ
dk
d
dt
dkt
−
+
+
+ dkt − dk −
N0 + N1 N0 + N1
N0 + N1 N0 + N1
¶µ
¶
NX
0 +N1 µ
d
dk
dt
dkt
−
−
+
+
+
N
+
N
N
+
N
N
+
N
N0 + N1
0
1
0
1
0
1
j=1
j6=
p
→0
The
th
column of
PN0 +N1 PT
j=1
t=1
0
ejt D
e jt
can be written as
X
¶
d
dt
+
N0 + N1 N0 + N1
t=1
j=1 t=1
¶ NX
T
T µ
0 +N1
X
¡
¢ X
d
d
t
e t dt−d +
ejt
X
X
=
−
+
N
+
N
N
+
N
0
1
0
1
t=1
t=1
j=1
T
X
=
T
X
t=1
µ
T
0 +N1 X
¡
¢ NX
e
e
Xt dt−d +
Xjt −
¡
¢
e t dt−d ,
X
P 0 +N1 e
Xjt = 0 as shown in the proof of Proposition 1.
since N
j=1
¢
P 0 +N1 PT e
P ¡
η jt is equal to Tt=1 d t − d (η t − η ) .
Similarly the th column of N
t=1 Djte
j=1
Thus since all of these terms are Op (1),
1
N1 + N0
1
N1 + N0
"N +N T
0
1 X
X
j=1
t=1
"N +N T
0
1 X
X
j=1
t=1
0
ejt D
e jt
X
0
ejt D
e jt
X
# "N +N T
0
1 X
X
t=1
j=1
# "N +N T
0
1 X
X
t=1
j=1
0
e jt D
e jt
D
0
e jt D
e jt
D
#−1 "N +N T
0
1 X
X
j=1
t=1
#−1 "N +N T
0
1 X
X
j=1
t=1
#
p
e jt X
ejt0 →
0
D
#
p
e jte
D
η jt → 0
b follows upon plugging the pieces back into (A-4) and applying Slutsky’s
Consistency for β
theorem.
9
From the normal equation for α
b j it is straightforward to show that
#−1 N +N T
"N +N T
0
1 X
1
0 X
´
³
X
X
0
b
e
e
e jt Yejt − X
ejt0 β
b
Djt Djt
D
A=
j=1
=
t=1
"N +N T
0
1 X
X
j=1
=A+
t=1
0
e jt D
e jt
D
"N +N T
0
1 X
X
j=1
+
j=1
"N +N T
0
1 X
X
j=1
t=1
hP
N0 +N1 PT
t=1
0
e jt D
e jt
D
0
e jt D
e jt
D
t=1
#−1 N +N T
1
0 X
X
j=1
t=1
#−1 N +N T
1
0 X
X
j=1
#−1 N +N T
1
0 X
X
i−1
³
´
0
e jt
ejt − X
ejt )0bβ
e jt D
A+e
η jt + (X
D
j=1
t=1
t=1
e jte
η jt +
D
³
´
b .
e jt X
ejt0 β − β
D
PN1 +N0 PT
e e0
e e0
and j=1
Since
t=1 Djt Djt
t=1 Djt Xjt are Op (1) (as shown above)
j=1
b is consistent, the last term of this expression converges to zero.
and β
i−1
hP
P 1 +N0 PT e
N0 +N1 PT
0
e
e
and N
η jt above, putting
We derived the limits of
t=1 Djt Djt
t=1 Djte
j=1
j=1
this together one sees that for each j = 1, ..., N0
¢
PT ¡
−
d
d
(η t − η )
p
t
b ) → A( ) + t=1P ¡
A(
¢2
T
−
d
d
t
t=1
b ) and A( ) are the
where A(
This gives the result.
A.4
th
b and A respectively.
component of A
Individual Data
The use of individual-level data has three complications relative to the model
2.
P in Section
1
0
Z
which
We need to worry about a) the fact that b
δ is estimated, b) the term |M(j,t)|
i∈M(j,t) i
can change with the sample size, and c) the error term involving εi . None of these issues
are particularly difficult to deal with, but do require verifying that the proofs still hold. We
consider each case in turn. However, we first define some notation which is analogous to our
earlier notation. Define:
T
X
X
X
1
1
e
εjt ≡
εi −
εi
|M(j, t)|
T |M(j, t)| τ =1
i∈M(j,t)
1
−
N1
e
Z
jt ≡
=N0
i∈M(j,τ )
T N0 +N1
X
X
1
1 X X
εi +
εi
|M(
,
t)|
T
N
1 τ =1
+1
=N +1
NX
0 +N1
i∈M( ,t)
X
1
1
Zi −
|M(j, t)|
T |M(j, t)|
i∈M(j,t)
1
−
N1
X
i∈M( ,τ )
Zi
τ =1 i∈M(j,τ )
T N0 +N1
X
X
1 X X
1
Zi +
Zi .
|M( , t)|
T N1 τ =1 =N +1
+1
NX
0 +N1
=N0
0
T
X
i∈M( ,t)
10
0
i∈M( ,τ )
It is straight forward to verify that
´
0 ³
e
e
b
b
e
λjt = λjt + Z jt δ − δ + e
εjt
³
´
e0 δ − b
ejt0 β + Z
= αdejt + X
εjt .
δ
+e
η jt + e
jt
We will also use the notation
X
1
Zi
|M(j, t)|
i∈M(j,t)
X
1
=
εi
|M(j, t)|
Z jt =
εjt
i∈M(j,t)
εj
T
1X
=
εjt
T t=1
Extending the proof of Proposition 2 to this case takes a bit more work because we can
not directly apply the Lemma from Newey and McFadden (1994). Instead to prove the result
we directly follow the proof of Lemma 1 of Tauchen (1985).
Lemma A.1 Let θ = (wj , g, b) be contained in a compact parameter space Θ.Under Assumptions 1-3,
´
⎛ PT ¡
⎞
¢³
0
0
−
d
+
Z
(δ
−
g)
−
X
(β
−
b)
+
ε
d
η
jt
j
jt
mt
mt
mt
t=1
1
Fj (θ) =
1⎝
< wj ⎠
¡
¢
P
P
2
N
T
0
N1 m=N +1
t=1 d t − d
=1
0
NX
0 +N1
converges uniformly to
!
à PT ¡
¢
0
0
−
d
+
Z
(δ
−
g)
−
X
(β
−
b))
d
(η
jt
j
mt
mt
mt
t=1
< wj .
φj (θ) ≡ Pr
¢2
PN0 PT ¡
−
d
d
t
t=1
=1
Proof. Take as given.
Define θ = (wj , b, g). Generically we will use the notation θ∗ to denote (wj∗ , b∗ , g ∗ ) and θk
to denote (wjk , bk , gk ). We contintue to use the notation
¢
¡
djt − dj
ρjt ≡ PN PT ¡
¢2
0
t=1 d t − d
=1
Now define
¯ Ã T
!
¯ X
¯
0
0
ρjt (η mt + Zmt
(γ − g∗ ) − Xmt
(β − b∗ )) ≤ wj∗
u (η m , θ, d) = sup ¯1
|θ∗ −θ|≤d ¯
t=1
!¯
à T
¯
X
¯
0
0
−1
ρjt (η mt + Zmt
(γ − g) − Xmt
(β − b)) ≤ wj ¯
¯
t=1
11
where η m = (η m1 , ..., η mT ). Since the distribution of η mt is continuous, lim u(η i , θ, d) = 0
as d → 0 with θ fixed almost surely. Applying this means we can define d (wj , ) so that
E [u(η i , θ, 2d(wj , ))] < .
Now notice that
¯ Ã T
!
¯ X ³
´
0
¯
0
ρjt η mt + Z mt (δ − g) − Xmt
(β − b) + εjt < wj
sup ¯1
∗
¯
|θ−θ |≤d
t=1
à T
!¯
¯
X
¯
0
0
−1
ρjt (η mt + Zmt
(δ − g∗ ) − Xmt
(β − b∗ )) ≤ wj∗ ¯
¯
t=1
¯ Ã T
!
¯ X
¯
0
0
≤ sup ¯1
ρjt (η mt + Zmt
(δ − g) − Xmt
(β − b)) < wj
∗
¯
|(θ−θ |≤2d
t=1
!¯
à T
¯
X
0
∗
0
∗
∗ ¯
−1
ρjt (η mt + Zmt (δ − g ) − Xmt (β − b )) ≤ wj ¯ ×
¯
t=1
Ã
!
¯¡
¯
¢0
¯
¯
∗
1
sup ¯ Z mt − Zmt (δ − g ) + εjt ¯ ≤ d
|g−g ∗ |≤d
+1
Ã
!
¯¡
¯
¢
0
¯
¯
sup ¯ Z mt − Zmt (δ − g ∗ ) + εjt ¯ > d
|g−g ∗ |≤d
≤u (η i , θ, 2d) + 1
Ã
!
¯¡
¯
¢
0
¯
¯
sup ¯ Z mt − Zmt (δ − g∗ ) + εjt ¯ > d
|g−g ∗ |≤d
Let B(θ) denote an open interval of radius d (θ, ) about θ. By compactness we can form
an oven covering Bk = B(θk , ) for k = 1, ..., K. Let dk = d (θk , ) and μk = E(u(Yi , θk , 2dk )).
Now note that if θ ∈ Bk then μk ≤ by definition of d (wjk , ) . It must also be the case that
¯
¯
¯φj (θ) − φj (θk )¯ ≤ E (u (η i , wj , d)) ≤ E (u (η i , wj , 2d)) = μk ≤ .
12
Now let θ ∈ Bk and consider
¯
¯
!
à T
0 +N1
¯ 1 NX
¯
´
X ³
0
¯
¯
0
1
ρjt η mt + Z mt (δ − g) − Xmt (β − b) + εjt < wj − φj (θ)¯
¯
¯ N1
¯
t=1
m=N0 +1
¯
"
Ã
!
T
0 +N1
¯ 1 NX
³
´
X
0
¯
0
≤ ¯
1
ρjt η mt + Z mt (δ − g) − Xmt
(β − b) + εjt < wj
¯ N1
t=1
m=N0 +1
!#¯
à T
¯
X
¯
0
0
ρjt (η mt + Zmt (δ − gk ) − Xmt (β − bk )) ≤ wjk ¯
−1
¯
t=1
¯
¯
!
Ã
T
0 +N1
¯ 1 NX
¯
X
¯
¯
0
0
+¯
1
ρjt (η mt + Zmt (δ − gk ) − Xmt (β − bk )) ≤ wjk − φj (θk )¯
¯ N1
¯
t=1
m=N0 +1
¯
¯
+ ¯φj (θk ) − φj (θ)¯
! Ã
!
Ã
N0 +N1
NX
0 +N1
¯¡
¯
¢0
1 X
1
¯
¯
≤
1
sup ¯ Z mt − Zmt (δ − g ∗ ) + εjt ¯ > d +
u(η m , wk , 2dk ) − μk
N1 m=N +1
N1 m=N +1
|g−g ∗ |≤d
0
¯0
¯
!
à T
0 +N1
¯ 1 NX
¯
X
¯
¯
0
0
1
ρjt (η mt + Zmt
(δ − gk ) − Xmt
(β − bk )) ≤ wjk − φj (θk )¯ +
+μk + ¯
¯ N1
¯
t=1
m=N +1
0
≤ 5ε
whenever N1 ≥ Nk ( ) almost
the strong ¯law of
¯¡ applying twice
´ large numbers and
³ surely, by
¢0
¯
¯
∗
taking into account that 1 sup|g−g∗ |≤d ¯ Z mt − Zmt (δ − g ) + εjt ¯ > d is iid and converges
almost surely to zero so the first term converges to zero.
Thus
¯
¯
sup ¯Fη (θ) − φj (θ)¯ ≤ 5
θ∈Θ
whenever n ≥ maxk Nk ( ) almost surely which proves the result.
A.4.1
Proof of Proposition 4
The first part of this proof virtually identical to the proof of Proposition 1 while the second
is very similar to that of Proposition 2.
Consistency of b
δ follows immediately from the standard OLS argument.
A standard application of the partitioned inverse theorem makes it straight forward to
13
show that
i hP
i ⎞−1
hP
N1 +N0 PT
N1 +N0 PT
0
e
e
e
e
0
e
e
X
X
d
d
jt
jt
jt
jt
t=1
t=1
j=1
j=1
t=1 Xjt Xjt
b = β+⎝
⎠
−
(A-5)
β
PN1 +N0 PT e2
N1 + N0
(N1 + N0 ) j=1
t=1 djt
hP
i hP
i
⎛P
N1 +N0 PT
N1 +N0 PT
e
e
N1 +N0 PT
e
e
d
d
η
X
jt
jt
jt
η jt
jt
t=1
t=1
j=1
j=1
t=1 Xjt e
j=1
×⎝
−
PN1 +N0 PT e2
N1 + N0
(N1 + N0 ) j=1
t=1 djt
h
³
´
i
h
´i
PN1 +N0 PT e e
PN1 +N0 PT e e ³b
PN1 +N0 PT e e 0 b
t=1 Xjt Z jt δ − δ
t=1 djt Xjt
t=1 djt Z jt δ − δ
j=1
j=1
j=1
+
−
PN1 +N0 PT e2
N1 + N0
(N1 + N0 ) j=1
t=1 djt
i
h
hP
PN1 +N0 PT e i ⎞
PN1 +N0 PT e
N1 +N0 PT
e
e
e
d
X
e
jt
jt
t=1
t=1 djt εjt
j=1
j=1
t=1 Xjt εjt
j=1
⎠.
+
−
P
P
N1 + N0
(N1 + N0 ) N1 +N0 T de2
⎛P
N1 +N0
j=1
PT
j=1
First consider
PN1 +N0 PT
j=1
t=1
N1 + N0
ejte
εjt
X
=
T
X
t=1
=
T
X
t=1
=
T
X
t=1
t=1
jt
NX
1 +N0
X
ejt εi
1
X
N1 + N0 j=1
|M(j, t)|
i∈M(j,t)
NX
1 +N0
X X
ejt εi
1
N1 + N0 j=1
φjt N1
i∈M(j,t)
Ã
1
N1 + N0
NX
1 +N0
j=1
! ⎛ PN1 +N0 P
hjt εi ⎞
X
i∈M(j,t) φjt
|M(j, t)| ⎝ j=1
⎠
PN1 +N0
|M(j,
t)|
N1
j=1
For each t the first term in parenthesese converges to φt . We can apply the law of large
numbers to the second to show that it converges to zero. Thus the whole term converges to
zero in probability.
Now consider the term
#
#
" +N
"N +N T
N0 X
T
T
1
0 X
0
1
X
X
X
¡
¢
¡
¢ NX
e
εjt =
εjt +
d − dt
εjt
djt − dj e
dejte
j=1
t=1
j=1 t=1
=
t=1
j=1
N0 X
T
X
¡
¢
εjt
djt − dj e
j=1 t=1
p
→ 0,
p
since this last expression involves a finite number of terms and e
εjt → 0.
b follows upon plugging the pieces back into (A-5) and applying Slutsky’s
Consistency for β
theorem.
14
From the normal equation for α
b it is straightforward to show that
´
PN1 +N0 PT e e 0 ³b
PN1 +N0 PT e
e
t=1 djt Z jt δ − δ
j=1
t=1 djt εjt
j=1
α
b =α+
+ PN1 +N0 PT
PN1 +N0 PT e2
e2
t=1 djt
t=1 djt
j=1
j=1
#
" PN +N PT
PN1 +N0 PT e
1
0
ejt X
ejt0
d
d
e
η
jt
jt
t=1
t=1
j=1
j=1
b
(β − β).
+ PN1 +N0 PT
+
PN1 +N0 PT e2
2
e
t=1 djt
t=1 djt
j=1
j=1
P 1 +N0 PT e
e
The only new term we don’t see in the proof of Proposition 1 is N
t=1 djt εjt but
j=1
we have already shown that this converges in probability to zero so the result for W holds.
b Since Γ is defined conditional on djt for j = 1, ...N0 ,
We now consider consistency of Γ.
t = 1, ..., T, every probability in the rest of this proof conditions on this set. To simplify
the notation, we omit this explicit conditioning. Thus, every probability statement and
distribution function in this proof should be interpreted as conditioning on djt for j = 1, ...N0 ,
t = 1, ..., T.
We contintue to use the notation
¢
¡
djt − dj
ρjt ≡ PN PT ¡
¢2
0
t=1 d t − d
=1
For each j = 1, ..., N0 define the random variable
Wj ≡
T
X
ρjt η jt .
t=1
Note that we used the fact that since η j is time invariant,
the distribution of Wj for j = 1, .., N0 .
Then note that
ÃN T
!
0 X
X
Γ (w) = Pr
ρjt η jt < w
j=1 t=1
=
Z
···
We can also write
b (w) =
Γ
Z
···
Z
1
Z
1
ÃN
0
X
j=1
ÃN
0
X
¢
PT ¡
t=1 djt − dj η j = 0. Let Fj be
!
Wj < w dF1 (W1 )...dFN0 (WN0 ).
j=1
!
b
b
Wj < a dFb1 (W1 , b
γ , β)...d
FbN0 (WN0 b
γ , β),
b is the empirical c.d.f. one gets from the residuals using the control states
where Fbj (·, b
δ, β)
only. That is more generally
!
à T
¶
N1
X
X µ
0
1
e (δ − g) − X
0
emt
η mt − Z
1
ρjt e
(β − b) + e
εmt < wj .
Fbj (w, g, b) ≡
mt
N1 m=1
t=1
15
It is easy to verify that
!
à T
¶
N1
´
³
´
X
X µ
0 ³
1
e
0
b +e
b ≡
emt β − β
Fbj (w, b
η mt − Z mt δ − b
δ −X
δ, β)
1
ρjt e
εmt < wj
N1 m=1
t=1
!
à T
¶
N1
X µe
1 X
0 b
bmt − X
emt
1
ρjt λ
=
β < wj .
N1 m=1
t=1
To avoid repeating the expression we define
!
à T
X ¡
¢
0
ρjt η jt − Zjt
(δ − g) − Xjt0 (β − b) < wj
φj (wj , γ, b) ≡ Pr
t=1
where Zmt ≡ E(Zi | j(i) = j, t(i) = t).
b
δ, β)
Note that φj (wj , δ, β) = Fj (wj ) . The proof strategy is first to demonstrate that Fbj (wj , b
b
converges to φj (wj , δ, β) uniformly over w. The final part of the proof that Γ(w)
is a consistent estimate of Γ(w) is identical to that argument in the proof of Proposition 2.
Following the same line as the proof of Proposition 2, let
ω
bj =
T
X
t=1
³
³
´
³
´
´
0
0
b + εt .
δ − Xt β − β
ρjt η mt − Z t δ − b
Let Ω ³be a compact
parameter space for wj , and Θ a compact subset of the parameter
´
b
b, β in which (0, δ, β) is an interior point.
space for ω
bj , γ
P
P 0 +N1
We use the notation m rather than N
m=N0 +1 in order to get the expression to fit on a
page.
b and φj (wj , γ, β)
First, for each j = 1, ..., N0 consider the difference between Fbj (wj , γ
b, β)
b − φj (wj , δ, β)|
sup |Fbj (wj , b
δ, β)
¯
¯
!
à T
µ
¶
¯ 1 X
¯
´
³
´
X
0 ³
¯
¯
e
0
b
b +e
e
−
φ
η mt − Z
<
w
δ
−
δ
−
X
β
−
β
1
αjt e
ε
(w
,
δ,
β)
= sup ¯
¯
mt
j
j
j
mt
mt
¯
¯
N
wj ∈Ω
1 m
t=1
¯
¯
!
à T
¯ 1 X
¯
³
³
´
³
´
´
X
0
¯
¯
0
b + εmt < wj + ω
= sup ¯
δ − Xmt
β−β
1
αjt η mt − Z mt δ − b
b j − φj (wj , δ, β)¯
¯
¯
N
wj ∈Ω
1 m
t=1
¯
¯
!
à T
¯ 1 X
¯
³
³
´
³
´
´
³
´
X
0
¯
0
b + εmt < wj + ω
b ¯¯
≤ sup ¯
δ − Xmt
β−β
1
αjt η mt − Z mt δ − b
b j − φj wj + ω
bj , b
δ, β
¯
wj ∈Ω ¯ N1
t=1
¯ m³
¯
´
¯
b − φj (wj , β)¯¯
+ sup ¯φj wj + ω
bj , b
δ, β
wj ∈Ω
¯
¯
!
à T
N1
¯ 1 X
¯
³
´
X
0
¯
¯
0
≤ sup ¯
1
αjt η mt − Z mt (δ − g) − Xmt
(β − b) + εmt < wj + ω j − φj (wj + ω j , g, b)¯
¯
¯
N
wj ∈Ω
1 m
t=1
wj ∈Ω
(ω j ,g,b)∈Θ
+ Pr
¯ ³
¯
³³
´
´
´
¯
b − φj (wj , δ, β)¯¯ .
ω
bj , b
δ, β̂, ∈
/ Θ + sup ¯φj wj + ω
bj , b
δ, β
wj ∈Ω
16
¯
¯ ³
´
¯
b − φj (wj , δ, β)¯¯ . Using a standard mean-value exFirst consider supwj ∈Ω ¯φj wj + ω
bj , b
δ, β
³
´
e
e
pansion of φ, for some ω
e j , δ, β
¯ ³
¯
´
¯
b − φj (wj , δ, β)¯¯
sup ¯φj wj + ω
bj , b
δ, β
wj ∈Ω
³
³
³
´
´
´
¯
¯
¯ ∂φ w + ω
¯
e
e
e
e
e
e
³
´
w
w
e
,
δ,
β
+
ω
e
,
δ,
β
+
ω
e
,
δ,
β
∂φ
∂φ
j
j
j
j
j
j
¯ j
¯
j
j
b
= sup ¯¯
(b
γ − γ) +
β − β ¯¯ .
(b
ωj ) +
∂wj
∂γ
∂β
wj ∈Ω ¯
¯
h)
∂φ (wj +h
ω j ,h
δ,β
The proof that j ∂β
is bounded is analogous to that in the proof of Proposition 2.
¯ ³
¯
´
¯
b is consistent. The
b − φj (wj , δ, β)¯¯converges to zero since β
Thus supwj ∈Ω ¯φj wj + ω
bj , b
δ, β
same argument
holds
³
´ for the other two pieces.
b converges in probability to (0, δ, β) which is an interior point of Θ,
Since ω
bj , b
δ, β
´
´
³³
b ∈ Θ converges in to zero.
δ, β
Pr ω
bj , b
Next consider the term
sup
wj ∈Ω
(ωj ,g,b)∈Θ
¯
¯
!
à T
N1
¯ 1 X
¯
³
´
X
0
¯
¯
0
1
αjt η mt − Z mt (δ − g) − Xmt (β − b) + εmt < wj + ω j − φj (wj + ω j , g, b)¯ .
¯
¯ N1 m
¯
t=1
Since wj and ω j enter the expression in identical ways we can combine these into one parameter and expand the compact parameter set appropriately and apply Lemma A.1 to show
that this term converges to zero.
Then putting the three pieces together,
p
b − φ(wh , δ, β)| →
δ, β)
0.
sup |Fb(wj , b
wj ∈Ω
b (w) converges to Γ(w) is identical to that of
Finally, the final step of the proof that Γ
Proposition 2.
A.4.2
Results for |M(j, t)| fixed
We now consider the case in which |M(j, t)| is fixed. This is a straight forward extention
of the proofs of Propositions 1 and 2 in which group by time error term becomes η jt + εjt
rather than just η jt . We replace Assumption 3 with
Assumption A.1 εi is IID and orthogonal to [ Zi Ii ], which is full rank. |M(j, t)| is fixed
with the sample size and for all j1 , j2 , and t
|M(j1 , t)| = |M(j2 , t)| .
17
Note that without loss of generality we can essentially incorporate e
εjt into e
η jt and repeat
what we have done.
Proposition A.1 Under Assumptions 1 and A.1,
p
b
δ→δ
p
b→
β
β
p
as N1 → ∞ where
α
b →α+W
¢¡
¢
PN0 PT ¡
η jt + εjt − η j − εj
t=1 djt − dj
j=1
W =
.
¢2
PN0 PT ¡
−
d
d
jt
j
t=1
j=1
Proof.
This is virtually identical to the proof of Proposition 1.
Consistency of b
δ follows directly from the standard argument for consistency of parameters in fixed effects models.
First a standard application of the partitioned inverse theorem makes it straight forward
to show that
i hP
i ⎞−1
hP
⎛P
N1 +N0 PT
N1 +N0 PT
0
e
e
N1 +N0 PT
e
e
0
e
e
d
d
X
X
jt
jt
jt
jt
t=1
t=1
j=1
j=1
t=1 Xjt Xjt
b = β + ⎝ j=1
⎠
−
β
PN1 +N0 PT e2
N1 + N0
(N1 + N0 ) j=1
t=1 djt
³
´ hP
i
h
⎛ PN +N PT
¢i
PN1 +N0 PT e ¡
N1 +N0 PT
1
0
e
ejt e
e
e
η
+
ε
+
ε
η
d
d
X
X
jt
jt
jt
jt
t=1
t=1 jt jt
t=1 jt
j=1
j=1
j=1
×⎝
−
PN1 +N0 PT e2
N1 + N0
(N1 + N0 ) j=1
t=1 djt
³
´
i
h
´i ⎞
h
PN1 +N0 PT e e ³b
PN1 +N0 PT e e 0 b
PN1 +N0 PT e e
t=1 Xjt Z jt δ − δ
t=1 djt Xjt
t=1 djt Z jt δ − δ
j=1
j=1
j=1
⎟
+
−
⎠.
PN1 +N0 PT e2
N1 + N0
d
(N1 + N0 )
j=1
t=1
jt
We have two new pieces to consider relative to the proof of Proposition 1. First consider
´
PN1 +N0 PT e e 0 ³b
δ
−
δ
Z
X
jt jt
t=1
j=1
.
N1 + N0
SN1 +N0 ST
hjt Z
h0
X
t=1
jt
Since Zejt is independent across j, j=1 N1 +N
converges in probability to its mean so
0
by Slutzky,
´
PN1 +N0 PT e e 0 ³b
Z
δ
−
δ
X
jt jt
t=1
j=1
p
−→ 0.
N1 + N0
18
Now consider the term
"N +N T
#
#
"N +N
T
N0 X
T
1
0 X
0
1
³
´
³
´
´ X
X
X
X
0 ³
0
¡
¢
¡
¢
e b
e b
e
b
δ
−
δ
−
d
Z
d
−
d
Z
dejt Z
δ
−
δ
=
δ
−
δ
+
d
jt
jt
j
t
jt
jt
j=1
t=1
j=1 t=1
=
N0 X
T
X
j=1 t=1
p
→0
t=1
j=1
´
¡
¢ e0 ³
b
djt − dj Z jt δ − δ
³
´
p
since this last expression involves a finite number of terms and b
δ − δ → 0.
b follows upon plugging the pieces back into the expression for β
b above
Consistency for β
and applying Slutsky’s theorem.
From the normal equation for α
b it is straightforward to show that
´
PN1 +N0 PT f e 0 ³b
δ
−
δ
d
Z
t=1 jt jt
j=1
α
b =α+
PN1 +N0 PT f 2
t=1 djt
j=1
³
´ ⎡
PN1 +N0 PT f
PN1 +N0 PT f e 0 ⎤
e
d
e
η
+
ε
jt
jt
t=1 jt
j=1
t=1 djt Xjt ⎦
j=1
b
+
+⎣ P
(β − β).
PN1 +N0 PT f 2
P
2
N1 +N0
T
djt
df
jt
t=1
j=1
j=1
t=1
We showed in the proof to Proposition 1 that
⎤
⎡P
N1 +N0 PT
0
f
e
X
d
t=1 jt jt ⎦
b p
⎣ j=1
PN1 +N0 PT f 2 (β − β) → 0.
t=1 djt
j=1
and that
NX
1 +N0
j=1
By the same logic
NX
T
1 +N0 X
j=1
t=1
T
X
t=1
N0 X
T
X
¡
¢¡
¢
djt − dj η jt − η j .
p
dejte
η jt →
j=1 t=1
N0 X
T
³
´ p X
¡
¢¡
¢
e
e
η jt + εjt →
djt e
djt − dj η jt + εjt − η j − εj .
j=1 t=1
Finally, we showed above that
T
NX
1 +N0 X
j=1
t=1
³
´ p
e0 b
dejt Z
δ
−
δ
→ 0.
jt
Putting the terms together the result stands.
b
Proposition A.2 Under Assumptions 1, 2, and A.1, Γ(a)
converges uniformly to Γ(a).
19
Proof.
Note that
b (w) ≡
Γ
=
µ
1
N1
µ
1
N1
¶N0
¶N0
¶
µ
⎞
¢ e
PN0 PT ¡
0
b
e
β̂
−
d
−
X
d
λ
NX
NX
j
0 +N1
0 +N1
jt
jt
⎜ j=1 t=1 jt
⎟
⎟
...
1⎜
<
w
¡
¢
P
P
2
⎝
⎠
N0
T
d
−
d
jt
j
t=1
j=1
1 =N0 +1
N0 =N0 +1
⎛
µ
⎞
´
³
´¶
0 ³
¢
PN0 PT ¡
e
0
b
e
η j t − Z j t δ − δ − X j t β − β̂
NX
NX
0 +N1
0 +N1
⎜ j=1 t=1 djt − dj e
⎟
⎜
...
1⎝
< w⎟
¢2
PN0 PT ¡
⎠.
−
d
d
jt
j
t=1
j=1
1 =N0 +1
N0 =N0 +1
⎛
Thus one can see
this iresult follows directly from the proof of Proposition 2 where
h that
¤
£
0
e
0
bjt as “Yejt ” in that proof.
e
ejt
e0
we just reinterpret Z
as “X
” , δ0 β 0 as “β 0 ” and λ
jt Xjt
A.4.3
Estimation with Population Weighted Regression
Now we consider estimation of the model directly. That is we imagine that one directly runs
the regression
Yi = αdjt + Zi0 δ + θj + γ t + η jt + εi
using state dummies and time dummies. Note that the distinction between Zi and Xjt is no
0
longer necessary so we have just incorporated Xj(i)t(i)
β into Zi0 δ.
We first need to formally define what these objects are. For a generic variable Zi define
PT P
t=1
i∈M(j,t) Zi
.
Z j = PT
t=1 |M(j, t)|
Since in general, the number of individuals varies across (j, t) cells, derivation of the
difference in differences operator requires additional notation. We need to formally define
−1
0 +N1
the full set of indicators for groups {g i }N=1
and time periods, {pτ i }Tτ =1
so that
g i ≡ 1( = j(i))
pτ i ≡ 1(τ = t(i)).
(A-6)
(A-7)
Further define Gi and Pi as the vectors of these dummy variables,
£
¤0
g1i g2i . . . gNo +N1 ,i
Gi ≡
£
¤0
p1i p2i . . . pT −1,i .
Pi ≡
(A-8)
(A-9)
Then for any individual-specific random variable Zi , let Zei be the residual from a linear
−1
0 +N1
regression of Zi on {g i }N=1
and {pτ i }Tτ =1
. That is
⎛
⎞ ⎛
⎞
¸0 NX
∙
¸∙
¸0 −1 NX
∙
¸
∙
T
T
0 +N1 X
0 +N1 X
X
X
Gh
Gh ⎠ ⎝
Gh
Gi ⎝
Zh ⎠ .
Zei ≡ Zi −
Pi
Ph
Ph
Ph
j=1
t=1 h∈M(j,t)
j=1
20
t=1 h∈M(j,t)
We need to strengthen Assumption 1 somewhat. In the two stage approach one does not need
to take a stand on the relationship between Zi and η j(i)t(i) because we can obtain consistent
estimates of δ via fixed effects. This is no longer the case if we estimate the model in one
step. We use the assumption
¢
¡
¢¢
¡¡
Assumption A.2 η j1 , {Z¡i : i ∈ M(j,¢1)} , ..., η jT , {Zi : i ∈ M(j, T )} is stationary and
strong mixing across groups; η j1 , ..., η jT is expectation zero conditional on (dj1 , ..., djT ),{Zi :
i ∈ ∪t=1,...,T M(j, t)} and all random variables have finite second moments.
We need a regularity condition to guarantee enough degrees of freedom that regressions
upon time and group indicators can be run.
Assumption A.3
PN0 +N1 PT P
0
t=1
i∈M(j,t) Pi Pi
PN0 PT
t=1 |M(j, t)|
j=1
j=1
PN0 +N1 PT P
j=1
t=1
0
i∈M(j,t) Pi Gi
−
³P
N0 +N1 PT P
j=1
t=1
PN0 PT
j=1
p
→Ω
t=1
0
i∈M(j,t) Gi Gi
|M(j, t)|
´−1 P
N0 +N1
j=1
PT P
t=1
i∈M(j,t)
Gi Pi0
where Ω is of full rank.
Under this condition, we can rewrite the model as:
η j(i)t(i) + e
εi .
Yei = αdej(i)t(i) + Zei0 δ + e
(A-10)
b denote the corresponding
We estimate α and β in equation (A-10) by OLS, letting α
b and β
estimators. This requires the usual OLS rank condition stated as
Assumption A.4
PN1 +N0 PT P
e e0
t=1
j=1
i∈M(j,t) Zi Zi
PN1 +N0 PT
t=1 |M(j, t)|
j=1
p
→ Σz
where Σz is finite and of full rank.
We will make use of the following Lemma
Lemma A.2 Consider a regression of dj(i)t(i) on group dummies (Gi ) and time dummy
at be coefficient on the time variable
variables (Pi ) as defined in equations (A-6)-(A-9). Let b
for time period t = 1, .., T − 1 and b
aT ≡ 0. Under Assumptions A.3 and 3,
!
Ã
PT −1
|M(j(i),
τ
|
b
a
τ
τ =1
at(i) − P
dej(i)t(i) = dj(i)t(i) − dj(i) − b
T
τ =1 |M(j(i), τ |
and b
aτ = Op ( N11 ), τ = 1, ..., T − 1.
21
Proof. To streamline the notation, let
m0 ≡
m1 ≡
P
i
denote
N0 X
T
X
j=1 t=1
NX
1 +N0
PN1 +N0 PT P
t=1
j=1
i∈M(j,t)
and let
|M(j, t)|
T
X
j=N0 +1 t=1
|M(j, t)|
m ≡ m0 + m1
Note that m0 is fixed but m1 and m get large as N1 → ∞. We will use this notation in the
proof of Proposition A.3.
Now consider a regression of dj(i)t(i) on group dummies and time dummies. We will write
this regression equation as
dj(i)t(i) = Pi0b
a + G0ibb + dej(i)t(i)
where Pi and Gi are as defined equations (A-6)-(A-9).
The first part of our lemma is a standard regression result with dummy variables. Note
that we can rewrite this regression equation as
at(i) = G0ibb + dej(i)t(i) .
dj(i)t(i) − b
Since dej(i)t(i) is orthogonal to Gi we could construct residuals by regressing dj(i)t(i) − b
at(i) on
a full set of group dummies and taking residuals. However, it is well known that this will
lead to taking deviations of the left hand side variable from group means so that
¡
¢
PT P
at( )
dj( )τ − b
¡
¢
τ
=1
∈M(j(i),τ
)
dej(i)t(i) = dj(i)t(i) − b
at(i) −
PT
τ =1 |M(j(i), τ )|
!
Ã
PT −1
¡
¢
|M(j(i),
τ
)|
b
a
τ
τ =1
= dj(i)t(i) − dj(i) − b
.
at(i) − P
T
|M(j(i),
τ
)|
τ =1
Next consider the derivation of b
a. Using the partitioned inverse theorem,
⎞−1
Ã
!−1
X
X
X
X
1
1 ⎝1
Pi Pi0 −
Pi G0i
Gi G0i
Gi Pi0 ⎠ ×
b
a =
m m i
m i
i
i
⎤
⎡
Ã
!−1
X
X
X
X
⎣
Pi dj(i)t(i) −
Pi G0i
Gi G0i
Gi dj(i)t(i) ⎦ .
⎛
i
i
i
i
Assumption A.3 implies that we can rewrite this as
⎡
⎤
Ã
!−1
X
X
X
X
1
Pi dj(i)t(i) −
Pi G0i
Gi G0i
Gi dj(i)t(i) ⎦ .
b
a = (Ω + op (1))−1 ⎣
m
i
i
i
i
22
P
P
P
Now consider the last term, Pi G0i ( Gi G0i )−1 Gi dj(i)t(i) . It is straightforward to show
that this is a (T − 1) × 1 vector with generic element t
NX
0 +N1
j=1
P
NX
0 +N1
|M(j, t)| Tτ=1 |M(j, τ )| djτ
=
|M(j, t)| dj .
PT
τ =1 |M(j, τ )|
j=1
Thus the (T − 1) × 1 vector
element
NX
0 +N1
j=1
£P
|M(j, t)| djt −
P
Pi dj(i)t(i) −
NX
0 +N1
j=1
¤
P
P
Pi G0i ( Gi G0i )−1 Gi dj(i)t(i) has generic t
|M(j, t)| dj =
=
NX
0 +N1
j=1
N0
X
j=1
¡
¢
|M(j, t)| djt − dj
¡
¢
|M(j, t)| djt − dj .
Under Assumption 3 we can write
â =
1
(Ω + op (1))−1 ×
N0 + N1
⎡
⎤
Ã
!−1
X
X
X
N0 + N1 X
⎣ N0 + N1
Pi dj(i)t(i) −
Pi G0i
Gi G0i
Gi dj(i)t(i) ⎦ .
m
m
i
i
i
i
As above the last term in brackets is a (T − 1) × 1 vector with a generic element t that can
be written as
¡
¢
PN0
N0
¡
¢
N0 + N1 X
j=1 |M(j, t)| djt − dj
|M(j, t)| djt − dj =
SN0 +N1 ST
m
1
j=1
t=1 |M(j,t)|
j=1
N0 +N1
¡
¢
PN0
|M(j,t)|
j=1 SN0 ST |M( ,τ )| djt − dj
τ =1
=1
=
#
$
1
N0 +N1
p
→
PN0
which is Op (1).
SN0 +N1 ST
j=1
j=1
t=1
|M (j,t)|
SN0 ST
|M ( ,τ )|
=1 τ =1
¡
¢
φjt djt − dj
PT
t=1 φt
Proposition A.3 Under Assumptions 3 and A.2-A.4,
p
b
δ→δ
p
as N1 → ∞.
α
b →α+
¡
¢
φjt djt − dj (η jt − η j )
³
´
PN0 PT
2
d
φ
−
d
jt
j
t=1 jt
j=1
PN0 PT
j=1
t=1
23
Proof:
In this proof we make use of the notation defined in the proof of the Lemma A.2.
First a standard application of the partitioned inverse theorem makes it straightforward
to show that
ih P
i ⎞−1
h P
⎛
1
1
0
e
e
e
e
1 X e e0 m0 m0 i dj(i)t(i) Zi m0 i dj(i)t(i) Zi ⎠
b
δ = δ+⎝
Zi Zi −
P e2
1
m i
m
i dj(i)t(i)
m0
ih P
h P
⎛
¡
¢i ⎞
1
1
e
e
e
Z
+
ε̃
d
d
e
η
X
i
i
j(i)t(i)
j(i)t(i)
¡
¢ m0 m0 i
j(i)t(i)
i
m0
1
⎠.
Zei e
×⎝
εi −
η j(i)t(i) + e
P e2
1
m
m
d
i
m0
Now consider each piece in turn.
Assumption A.4 states that
i
j(i)t(i)
1 X e e0 p
Zi Zi → Σz .
m i
Using Assumptions A.2-A.3 and invoking the law of large numbers,
¢ p
1 Xe¡
η j(i)t(i) + ε̃i → 0.
Zi e
m i
Define b
at as in the statement of Lemma A.2 and then define
!
Ã
PT −1
|M(j,
τ
|
b
a
τ
τ =1
b
.
at − P
e
ajt ≡ b
T
τ =1 |M(j, τ |
Lemma A.2 states that dej(i)t(i) = dj(i)t(i) −dj(i) −b
e
aj(i)t(i) . Note also that for j > N0 , djt −dj = 0.
Thus
N0 X
N0 +N1 X
T
T
1 X e2
1 X
1 X
2
e
=
|M(j, t)| djt +
|M(j, t)| de2jt
d
m0 i j(i)t(i) m0 j=1 t=1
m0 j=N +1 t=1
0
=
p
→
1
m0
N0 X
T
X
j=1 t=1
T
N0 X
X
j=1 t=1
N0 +N1 X
T
i
h¡
¢2
¡
¢
2
2
1 X
b
b
b
|M(j, t)| djt − dj − 2e
ajt djt − dj + e
ajt +
e
ajt |M(j, t)|
m0 j=N +1 t=1
0
¡
¢2
φjt djt − dj .
24
This result follows because
N0 +N1 X
T
2
1 X
b
e
ajt |M(j, t)|
m0 j=N +1 t=1
0
⎤
⎡
!2
Ã
PT −1
NX
T
0 +N1
X
aτ
1
τ =1 |M(j, τ | b
⎣
=
b
at − P
|M(j, t)|⎦
T
m0 j=N +1 t=1
|M(j,
τ
|
τ =1
0
=
+
NX
T −1
0 +N1 X
|M(j, t)|
b
a2t
mo
j=N0 +1 t=1
NX
0 +N1
T
X
j=N0 +1 t=1
=
T −1
X
t=1
b
a2t
j=N0
T −1
T −1 X
X
j=N0 +1
!2
T −1
aτ
τ =1 |M(j, τ )| b
PT
τ =1 |M(j, τ )|
ÃP
NX
0 +N1
−2
PT −1
aτ
=1 |M(j, τ | b
b
at τP
T
m0 τ =1 |M(j, τ |
t=1
NX
T −1
0 +N1 X
|M(j, t)|
m0
NX
T −1
T −1 X
0 +N1
X
|M(j, t)|
|M(j, τ )|
−2
b
atb
aτ
PT
m0
m0 s=1 |M(j, s)|
t=1 τ =1
+1
j=N +1
0
NX
0 +N1
|M(j, τ )| |M(j, t)|
P
m0 Ts=1 |M(j, s)|
t=1 τ =1
j=N0 +1
µ ¶ NX
µ ¶ NX
T −1 T −1
T −1
0 +N1
0 +N1
X
|M(j, t)|
|M(j, τ )|
1
1
2 XX
Op
−
O
=
PT
p
2
2
N1 j=N +1 m0
m0 t=1 τ =1
N1 j=N +1 s=1 |M(j, s)|
t=1
0
0
µ
¶
T
−1
N
+N
T
−1
0
1
XX
X
1
|M(j, τ )| |M(j, t)|
Op
+
P
2
N1 j=N +1 m0 Ts=1 |M(j, s)|
t=1 τ =1
0
+
b
aτ b
at
p
→ 0.
25
Next consider the object
T
T
N0 X
N0 +N1 X
X ¡
X
¢
1 X
1 X
1 Xe
b
e
e
dj(i)t(i) Zi =
djt − dj Zi +
e
ajt Zei
m0 i
m0 j=1 t=1
m0 j=1 t=1
i∈m(j,t)
i∈m(j,t)
!
Ã
PT −1
N
N
+N
T
T
−1
0
0
1
X X X
¢
|M(j,
τ
)|
b
a
1 XX X ¡
1
τ
τ =1
Zei
b
at − P
=
djt − dj Zei −
T
m0 j=1 t=1
m0 j=1 t=1
|M(j,
τ
)|
τ =1
i∈m(j,t)
i∈m(j,t)
P
N0 +N1 X
T −1
aτ e
1 X
τ =1 |M(j, T )| b
+
Zi
P
T
m0 j=1
τ =1 |M(j, T )|
i∈m(j,t)
NX
N0 X
T −1
T
0 +N1
X ¡
X
¢
1 X
1 X
e
Zei
=
b
at
djt − dj Zi −
m0 j=1 t=1
m0 t=1
j=1 i∈m(j,t)
i∈m(j,t)
P
N0 +N1
T
T −1
aτ X X e
1 X
τ =1 |M(j, τ )| b
+
Zi
PT
m0 j=1
|M(j,
τ
)|
τ =1
t=1 i∈m(j,t)
⎤
#⎡
"
N0 X
T
X
X
¡
¢
1
|M(j, t)|
⎣
Zei ⎦
=
djt − dj PN0 PT
|M(j,
t)|
|M(j,
t)|
t=1
j=1
j=1 t=1
i∈m(j,t)
p
→
T
N0 X
X
j=1 t=1
= Op (1).
¡
¢
φjt djt − dj E(Zei | i ∈ M(j, t))
We usedP
the fact that Zei is the
from a regression on time and state dummies so
P
PT residual
P
N0 +N1
e
e
t=1
j=1
i∈m(j,t) Zi = 0 and
i∈m(j,t) Zi = 0.
An analogous argument gives
N0 X
T
X ¡
¡
¢
¢¡
¢
1 Xe
1 X
εi =
εi
η j(i)t(i) + e
djt − dj e
η jt + e
dj(i)t(i) e
m0 i
m0 j=1 t=1
i∈m(j,t)
⎤
#⎡
"
T
N0 X
X
X
¡
¢
¡
¢
1
|M(j, t)|
⎣
=
εi ⎦
djt − dj PN0 PT
e
η jt + e
|M(j, t)|
t=1 |M(j, t)|
j=1
j=1 t=1
i∈m(j,t)
p
→
=
N0 X
T
X
j=1 t=1
T
N0 X
X
j=1 t=1
= Op (1).
¢
¡
¢ ¡
φjt djt − dj E e
εi | i ∈ M(j, t)
η jt + e
¡
¢¡
¢
φjt djt − dj η jt − η j
¡
¢
The last term follows because
for
any
τ
=
1,
..,
T
E
η
−
η
= E(εi − εj | t (i) = τ ) = 0.
jτ
j
¢
¡
So for a regression of either η jτ − η j or (εi − εj ) on time dummies, the coefficient on the
¢
p
p ¡
εi → (εi − εj ) .
dummy variables will converge to zero so e
η jt → η jτ − η j and e
26
Putting all the objects into the expression for b
δ, one can see that b
δ is consistent.
Now consider α
b . It is straight forward to show that
¡
¢
P e
P e
1
1
e0
η j(i)t(i) + e
εi
i dj(i)t(i) e
i dj(i)t(i) Zi (δ − δ)
m0
m0
(b
α − α) =
+
.
P e2
P e2
1
1
d
d
m0
i
j(i)t(i)
m0
i
j(i)t(i)
We have shown that
N0 X
T
¡
¢2
1 X e2 p X
φjt djt − dj
djt →
m0 i
j=1 t=1
N0 X
T
X
¡
¢
1 Xe
0
p
e
dj(i)t(i) Zj(i)t(i) →
φjt djt − dj E(Zei | i ∈ M(j, t))
m0 i
j=1 t=1
p
(δ − b
δ) → 0
N0 X
T
¡
¢ p X
¡
¢¡
¢
1 Xe
dj(i)t(i) e
εi →
φjt djt − dj η jt − η j .
η j(i)t(i) + e
m0 i
j=1 t=1
Thus we are left with:
(b
α − α) =
PN0 PT
p
→
This gives the result.
¡
¢¡
¢
η jt − η j + op (1)
t=1 φjt djt − dj
¡
¢2
PN0 PT
φ
−
d
+ op (1)
d
jt
j
jt
t=1
j=1
¡
¢¡
¢
PN0 PT
φ
−
d
−
η
d
η
jt
j
jt
j
t=1 jt
j=1
.
¡
¢2
PN0 PT
φ
−
d
d
jt
j
t=1 jt
j=1
j=1
27
+ op (1)