Comparing Treatments across Labor Markets: An Assessment of

Comparing Treatments across Labor Markets:
An Assessment of Nonexperimental Multiple-Treatment Strategies∗
Carlos A. Flores†
Oscar A. Mitnik‡
May, 2012
Abstract
We consider the problem of using data from several programs, each implemented at a different location, to compare what their effect would be if they were implemented at a specific
location. In particular, we study the effectiveness of nonexperimental strategies in adjusting for differences across comparison groups arising from two sources. First, we adjust for
differences in the distribution of individual characteristics simultaneously across all locations
by using unconfoundedness-based and conditional difference-in-difference methods for multiple
treatments. Second, we explicitly adjust for differences in local economic conditions. We stress
the importance of analyzing the overlap of, and adjusting for, local economic conditions after
program participation. Our results suggest that the strategies studied are valuable econometric
tools for the problem we consider, as long as we adjust for a rich set of individual characteristics
and have sufficient overlap across locations for both individual and local labor market characteristics. Our results show that the overlap analysis of these two sets of variables is critical
for identifying non-comparable groups and they illustrate the difficulty of adjusting for local
economic conditions that differ greatly across locations.
∗
Detailed comments from two anonymous referees and an editor greatly improved the paper and are gratefully acknowledged. We are also grateful for comments by Chris Bollinger, Matias Cattaneo, Alfonso Flores-Lagunes, Laura
Giuliano, Guido Imbens, Kyoo il Kim, Carlos Lamarche, Phil Robins, Jeff Smith, and seminar participants at the
2009 AEA meetings, 2009 SOLE meetings, Second Joint IZA/IFAU Conference on Labor Market Policy Evaluation,
2009 Latin American Econometric Society Meeting, Third Meeting of the Impact Evaluation Network, Second UCLA
Economics Alumni Conference, Abt Associates, Federal Deposit Insurance Corporation, Institute for Defense Analyses, Interamerican Development Bank, Queen Mary University of London, Tulane University, University of Freiburg,
University of Kentucky, University of Miami, University of Michigan, and VU University Amsterdam. Bryan Mueller
and Yongmin Zang provided excellent research assistance. This is a heavily revised version of IZA Discussion Paper
No. 4451, “Evaluating Nonexperimental Estimators for Multiple Treatments: Evidence from Experimental Data”.
The usual disclaimer applies.
A supplemental Appendix is available online at http:// [to be determined].
†
Department of Economics, University of Miami. Email: [email protected].
‡
Department of Economics, University of Miami and IZA. Email: [email protected].
1
Introduction
We consider the problem of using data from several programs, each implemented at a different
location, to compare what their effect would be if they were implemented at a specific location. One
example would be a local government considering the implementation of one of several possible job
training programs. This is a problem many policy makers are likely to face given the heterogeneity
of job training programs worldwide and the scope that many local governments have in selecting
their programs. In this case the policy maker may want to perform an experiment in her locality and
randomly assign individuals to all the possible programs to assess the effectiveness of the programs
and choose the best one for her locality. This would require implementing all the “candidate”
programs in her locality. Thus, the most likely scenario is that the policy maker would have to
rely on nonexperimental methods and use data on implementations of the programs at various
locations and times. In this paper, we assess the likely effectiveness of nonexperimental strategies
for multiple treatments to inform a policy maker about the effects of programs implemented in
different locations if they were implemented at her location.
This paper is closely related to the literature analyzing whether observational comparison groups
from a different location than the treated observations can be used to estimate average treatment
effects by employing nonexperimental identification strategies (Friedlander and Robins, 1995; Heckman et al., 1997; Heckman et al., 1998; Michalopoulos et al., 2004). Our work is also closely related
to Hotz et al. (2005), who analyze the problem of predicting the average effect of a new training
program using data on previous program implementations at different locations. Our paper differs
from the rest of the literature, which focuses on pairwise comparisons between groups in different
locations. Instead, we focus on comparing all groups simultaneously, which requires the use of
nonexperimental methods for multiple treatments. This is an important distinction in practice.
For instance, suppose that the policy maker above was given the average effects of each of   2
programs estimated from the binary-treatment versions of the nonexperimental strategies we study
in this paper. If so, each estimate would give the average effect for a particular subpopulation,
namely, those individuals in the “common support” region of the corresponding pairwise comparison (Heckman et al., 1997; Dehejia and Wahba, 1999, 2002). Thus, the policy maker would be
unable to compare the average effects across the different programs simultaneously because the
effects are likely to apply to different subpopulations.
We employ data from the National Evaluation of Welfare-to-Work Strategies (NEWWS) study,
which combines high quality survey and administrative data. The NEWWS evaluated a set of social
experiments conducted in the United States in the 1990s in which individuals in several locations
were randomly assigned to a control group or to different training programs. These data allow us to
restrict the analysis to a relatively well-defined target population, which is welfare recipients in the
different locations. This is consistent with our example above, where we would expect the policy
maker to be interested in comparing programs that applied to relatively similar target populations.
Even in cases like this, differences across locations in the average outcomes of individuals who participated in the site-specific programs can arise because of (at least) three important factors (Hotz
1
et al., 2005). First, the effectiveness of the programs can differ (this is the part that the policy
maker in the example above would like to know). Second, the distribution of individual characteristics across sites may be different (e.g., individuals may be more educated, on average, in some
locations). Third, the labor market conditions may differ across locations (e.g., the unemployment
rate may be higher at some sites). In this paper, we focus on the control groups in the different
locations and use the insight that since the control individuals were excluded from all training
services, the control groups should be comparable after adjusting for differences in individual characteristics and labor market conditions. If by using nonexperimental strategies to adjust for those
differences we are able to simultaneously equalize average outcomes among the control groups in
the different locations, we would expect the average effects resulting from the implementation of
those strategies on the treated observations to reflect actual program heterogeneity across the sites
(e.g., Heckman et al., 1997; Heckman et al., 1998; Hotz et al., 2005).1
The first source of heterogeneity across sites that we consider is the distribution of individual
characteristics. We consider unconfoundedness-based and conditional difference-in-difference strategies for multiple treatments to adjust simultaneously across all sites for this source of heterogeneity.
This is related to the literature on program evaluation using nonexperimental methods, most of
which focuses on the binary-treatment case (see Heckman et al., 1999; Imbens and Wooldridge,
2009). More recently, however, there has been a growing interest on methods to evaluate multiple
treatments (Lechner, 2001, 2002a, 2002b; Frölich, 2004; Plesca and Smith, 2007) and treatments
that are multivalued or continuous (Imbens, 2000; Imai and van Dyk, 2004; Hirano and Imbens,
2004; Abadie, 2005; Cattaneo, 2010); we employ some of these methods in this paper. In addition,
we study the use of the generalized propensity score (GPS), defined as the probability of receiving a
particular treatment conditional on covariates, in assessing the comparability of individuals among
the different comparison groups, and we propose a strategy to determine the common support region that is less stringent than those previously used in the multiple treatment literature (Lechner,
2002a, 2002b; Frölich et al., 2004). We also consider several GPS specifications and GPS-based
estimators, and we analyze the performance of the nonexperimental strategies considered when
controlling for alternative sets of covariates.
The second source of heterogeneity across the comparison groups that we consider is local
economic conditions (LEC). The literature on program evaluation methods stresses the importance
of comparing treatment and nonexperimental comparison groups from the same local labor market
(Friedlander and Robins, 1995; Heckman et al., 1997; Heckman et al., 1998). In our setting,
the analyst likely has information about specific programs only when they were implemented in
different local labor markets. Thus, we consider the challenging problem of explicitly adjusting
for differences in LEC across our comparison groups. The literature on how to adjust for these
differences is scarce compared to that on how to adjust for individual characteristics, as we discuss
later. We stress the importance of adjusting for differences in LEC after program participation, as
1
Another potential source of heterogeneity across control groups is substitution by their members into alternative
training programs (Heckman and Smith, 2000; Heckman et al., 2000; Greenberg and Robins, 2011). Our analysis
leads us to believe that it is unlikely that this potential source of heterogeneity drives our results, as we discuss later.
2
they can be a significant source of heterogeneity in average outcomes across sites. We also highlight
the importance of analyzing the overlap in LEC across locations, as well as the advantages of
separating such analysis from that of the overlap in individual characteristics.
Our results suggest that the strategies studied are valuable econometric tools when evaluating
the likely effectiveness of programs implemented at different locations if they were implemented at
another location. The keys are adjusting for a rich set of individual characteristics (including labor
market histories) and having sufficient overlap across locations in both the individual and local labor market characteristics. More generally, our results also illustrate that in the multiple treatment
setting, there may be treatment groups for which it is extremely difficult to find valid comparison groups, and that the strategies we consider perform poorly in such cases. Importantly, the
overlap analyses (both in individual characteristics and LEC) are critical for identifying those noncomparable groups. In addition, our results show that when the differences in post-randomization
LEC are large across sites, the methods we consider to adjust for them substantially reduce this
source of heterogeneity, although they are not able to eliminate it. These results highlight the
importance of adjusting for post-treatment LEC.
The paper is organized as follows. Section 2 presents our general setup and the econometric
methods we employ; Section 3 describes the data; Section 4 presents the results; and Section 5
provides our conclusions. An online Appendix provides further details on the econometric methods
and data. It also presents supplemental estimation results, including those from several robustness
checks briefly mentioned throughout the paper and those complementing the analysis in Section 4.
2
2.1
Methodology
General Framework
We use data from a series of randomized experiments conducted in several locations. In
each location, individuals were randomly assigned either to participate in a site-specific training
program or to a control group that was excluded from all training services. We study whether the
nonexperimental strategies considered in this paper can simultaneously equalize average outcomes
for control individuals across all the sites by adjusting for differences in the distribution of individual
characteristics and local economic conditions.
Each unit  in our sample,  = 1 2      , comes from one of  possible sites. Let  ∈
{1 2     } denote the location of individual . For our purposes, it is convenient to write the
potential outcomes of interest as  (0 ), where  denotes the site and zero only emphasizes the
fact that we focus exclusively on the control groups. Thus,  (0 ) is the outcome (e.g., employment
status) that individual  would obtain if that person were a control in site . The data we observe
for each unit are (     ), where  is a set of pre-treatment covariates and  =  (0  ).
Let  be a subset of the support of , X . Our parameters of interest in this paper are
  ≡   () =  [ (0 ) | ∈ ] , for  = 1 2     
3
(1)
The region  can include all or part of the support of  and is determined by the overlap region,
which we discuss below. To simplify the notation, we omit the conditioning on  ∈  unless
necessary for clarity. The object in (1) gives the average potential outcome under the control
treatment at location  for someone with  ∈  randomly selected from any of the  sites. This
is different from  [ (0 ) | =   ∈ ], which gives the average control outcome at site  for
someone with  ∈  randomly selected from site . While this last quantity is identified from the
data because of random treatment assignment within site ,   is not. Rather than focusing on
(1), we could have focused on [ (0 ) | =   ∈  ] for  = 1     , which gives the average
outcome for the individuals in a particular site  ∈ {1     } within the overlap region  for site
 . By focusing on the average over all  sites in (1), we avoid having our results depend on the
selection of site  as a reference point. However, the estimation problem becomes more challenging
because we need to find comparable individuals in each of the sites for every individual in the 
sites (as opposed to finding comparable individuals in each site for only those in site  ).
We assess the performance of the strategies presented below in several ways. First, given
b of the corresponding parameters in (1), we test the joint hypothesis
b1      
estimates 
1 = 2 =    =  
(2)
One drawback of this approach is that we could fail to reject the null hypothesis in (2) only because
b are sufficiently close
the variance of the estimators is high, rather than because all the estimated ’s
to each other. Hence, we focus our analysis mostly on an overall measure of distance among the
X
b , we define the the root mean square distance as

estimated means. Letting 
b = −1

=1
1
 =
|b
|
s
´2
 ³
1 P
b −

b

 =1 
(3)
where we divide by |b
| to facilitate its interpretation and the comparison across outcomes. Due to
pure sample variation, we would never expect the  to equal zero, even if  were randomly
assigned. In Section 4, we present benchmark values of the  as a reference point for reasonable
values of the . These benchmarks give the value of the  that would be achieved by an
experiment in a setting where  1 =  2 =    =   holds.2
In the following two sections, we discuss the strategies we employ to adjust for differences across
the sites in the distribution of individual characteristics and local economic conditions.
2.2
Adjusting for Individual Characteristics
Nonexperimental Strategies. We consider two identification strategies to adjust for differences in individual characteristics across sites. Let 1 () be the indicator function, which equals
one if event  is true and zero if otherwise. The first strategy is based on the following assumption:
2
In addition to the  in (3), we considered two other measures: the mean absolute distance and the maximum
pairwise distance among all the estimated means. All three measures give similar results.
4
Assumption 1 (Unconfounded site) 1 ( = ) ⊥⊥  (0 ) | , for all  ∈ {1 2     }.
This assumption is an extension to the multiple-site setting of the one in Hotz et al. (2005),
and in the program evaluation context it is called weak unconfoundedness (Imbens, 2000). The
problem when estimating   is that we only observe  (0 ) for individuals with 1 ( = ) = 1,
while it is missing for those with 1 ( = ) = 0. Assumption 1 implies that conditional on  , the
two groups are comparable, which means that we can use the individuals with 1 ( = ) = 1 to
learn about   . A key feature of Assumption 1 is that all that matters is whether or not unit  is
in site ; hence, the actual site of an individual not in  is not relevant for learning about   .
In addition to Assumption 1, we impose an overlap assumption that guarantees that, in infinite
samples, we are able to find comparable individuals in terms of the covariates for all units with
 6=  in the subsample with  =  for every site .
Assumption 2 (Simultaneous strict overlap) 0    Pr ( = | = ), for all  and  ∈ X .
This assumption is stronger than that of the binary treatment case, as it requires that for
every individual in the  sites, we find comparable individuals in terms of covariates in each of
the  sites. Intuitively, while we want to learn about the potential outcomes in all sites for every
individual, we observe only one of those  potential outcomes, so the “fundamental problem of
causal inference” (Holland, 1986) is worsened. When the strict overlap assumption is violated,
√
semiparametric estimators of   can fail to be -consistent, which can lead to large standard
errors and numerical instability (Busso et al., 2009a, 2009b; Khan and Tamer, 2010). In practice,
Assumption 2 may hold only for some of the sites and/or regions in the support of , in which
case one may have to restrict the analysis to those sites and/or regions with common support.
The second identification strategy we consider to adjust for differences in individual characteristics across locations is a multiple-treatment version of the conditional difference-in-differences
(DID) strategy (Heckman et al., 1997; Heckman et al., 1998; Smith and Todd, 2005). In our setting, the main identification assumption of this strategy is 1 ( = ) ⊥⊥ { (0 ) − 0 (0 )}| ,
for all  ∈ {1 2     }, where  and 0 are time periods after and before random assignment, respectively. This assumption is similar to Assumption 1 but employs the difference of the potential
outcomes over time. This identification strategy allows for systematic differences in time-invariant
factors between the individuals in site  and those not in site  for every site . Thus, DID methods
can also help to adjust for time-invariant heterogeneity in local economic conditions across sites
(Heckman et al., 1997; Heckman et al., 1998; Smith and Todd, 2005).
Estimators. We consider several estimators to implement the strategies discussed above. For
simplicity, we present the estimators in the context of the unconfoundedness strategy. The DID
estimators simply replace the outcome in levels with the outcome in differences.
Under Assumption 1 we can identify   (omitting for brevity the conditioning on  ∈ ) as
  =  [ [ (0 ) | = ]] =  [ [ (0 ) |1 ( = ) = 1  = ]] =  [ [ | =   = ]]
5
where, following similar steps to those used in a binary treatment setting (e.g., Imbens, 2004), the
first equality uses iterated expectations and the second uses Assumption 1. This result suggests
estimating   using a partial mean, which is an average of a regression function over some of
its regressors while holding others fixed (Newey, 1994). To estimate   , we first estimate the
expectation of  conditional on  and , and then we average this regression function over the
covariates holding  fixed at . The first estimator we consider uses a linear regression of  on the
site dummies and  to model the conditional expectation  [ | =   = ]. We refer to it as
the partial mean linear  estimator. We also consider a more flexible model of that expectation
by including polynomials of the continuous covariates and various interactions, and refer to the
corresponding estimator as the partial mean flexible  estimator.
The rest of the estimators we consider are based on the generalized propensity score (GPS). Its
binary-treatment version, the propensity score, is widely used in the program evaluation literature
(Heckman et al., 1999; Imbens and Wooldridge, 2009). In our setting, the GPS is defined as the
probability of belonging to a particular site conditional on the covariates (Imbens, 2000):
 ( ) = Pr ( = | = ) 
(4)
We let  =  (   ) be the probability that individual  belongs to the location she actually
belongs to, and let  =  (  ) be the probability that  belongs to a particular location 
conditional on her covariates. Clearly,  =  for those units with  = .
We consider GPS-based estimators of   based on partial means and inverse probability weighting. Imbens (2000) shows that the GPS can be used to estimate   within a partial mean framework
by first estimating the conditional expectation of  as a function of  and  =  ( ), and then
by averaging this conditional expectation over  =  ( ) while holding  fixed at . We consider
two estimators based on this approach. The first follows Hirano and Imbens (2004) and estimates
 [ | =   = ] by employing a linear regression of the outcome on the site dummies and the
interactions of the site dummies with  and 2 . The second estimator employs a local polynomial
regression of order one to estimate that regression function. We refer to these two estimators of  
as the GPS-based parametric and nonparametric partial mean estimators, respectively.
Finally, we also employ the GPS in a weighting scheme. Intuitively, reweighting observations
by the inverse of  creates a sample in which the covariates are balanced across all locations. The
first inverse probability weighting (IPW) estimator of   that we consider equals the coefficient for
1 ( = ) from a weighted linear regression of  on all the site indicator variables, with weights
p
equal to  = 1 . We also consider an estimator that combines IPW with linear regression by
adding covariates to this regression. We refer to it as the IPW with covariates.
Determination of the Overlap Region.
A key ingredient in the implementation and
performance of the estimators previously discussed is the identification of regions of the data in
which there is overlap in the covariate distributions across the different comparison groups. The
usual approach in the binary-treatment literature is to focus on the “overlap” or “common support”
6
region by dropping those individuals whose propensity score does not overlap with the propensity
score of those in the other treatment arm (Heckman et al., 1997; Heckman et al., 1998; Dehejia and
Wahba, 1999, 2002). We propose a GPS-based trimming procedure for determining the overlap
region in the multiple-treatment setting that is motivated by a procedure commonly used when the
treatment is binary (Dehejia and Wahba, 1999, 2002).
We determine the overlap region in two steps. First, we define the overlap region with respect
to a particular site ,  , as the subsample of individuals whose conditional probability of
being in site  (i.e.,  ) is greater than a cutoff value, which is given by the highest -quantile
of the  distribution for two groups: those individuals who are in site  and those who are not.
This rule ensures that within the subsample  , for every individual with  6= , we can
find comparable individuals with  = , which implies that we can estimate the mean of  (0 )
for the individuals in  under Assumption 1. Second, we define the overlap region as
the subsample given by those individuals who are simultaneously in the overlap regions for all
the different sites. This step guarantees that we estimate the mean of  (0 ) for  = 1 2     
for the set of individuals who are simultaneously comparable in all the sites. More formally, let



 {:
}}, where {∈}
denotes the -th quantile of the
 = { :  ≥ max{{:
 =}
 6=}

distribution of  over those individuals in subsample . We define the overlap region as3
 =
T
=1  
(5)
Lechner (2002a, 2002b) and Frölich et al. (2004) use a rule akin to (5) but define  =
{ :  ∈ [max=1 {min{: =}  } min=1 {max{: =}  }]}. Our rule is less stringent in
two important ways. First, as implied by Assumption 1, it does not require the comparison of 
among all  locations but only between the groups of individuals with  =  and  6= . Second,
it identifies individuals outside the overlap region based only on the lower tail of the distributions
of  . This is because we impose the overlap condition  for all sites, so it is not necessary
to look at the upper tail of the GPS distributions.4
2.3
Adjusting for Local Economic Conditions
Even after adjusting for differences in individual characteristics, there could still be differences
in the average outcomes of the comparison groups across sites because of differences in LEC. The
main difficulty in adjusting for differences in LEC is that these variables take on the same value
within each location, or within each cohort in a location. When each comparison group belongs to
a different location, as in our setting, the variation of the LEC variables can be limited and they
are likely to fail the overlap condition. In such cases, the literature provides little guidance on how
to adjust for differences in LEC across comparison groups.
3


}  {: 6=} }}.
4
For example, when estimating the average treatment effect in the binary-treatment setting both tails of the
distribution of the propensity score are considered. This is equivalent to considering the two lower tails of the
distributions of Pr ( = 1|) and Pr ( = 0|) because Pr ( = 1|) + Pr ( = 0|) = 1.
Alternatively, the overlap region could be defined as  =
7
=1 {

∈ X :  () ≥ max{
{:
 =
In the context of evaluating training programs, a strategy followed in practice to control for
differences in LEC across comparison groups is to include LEC measures and/or geographic indicators in the propensity score (Roselius, 1996; Heckman et al., 1998; Dyke et al., 2006; Mueser et
al., 2007; Dolton and Smith, 2011). Using regression-adjustment methods Hotz et al. (2006) and
Flores-Lagunes et al. (2010) include as regressors measures of post-treatment LEC. Most of these
papers differ from ours because we do not have a large time series or cross sectional variation in LEC.
In a setting more similar to ours, Hotz et al. (2005) include two measures of pre-randomization
LEC in the regression step of the bias-corrected matching estimator in Abadie and Imbens (2011).5
We pay particular attention to adjusting for differences across sites in LEC after randomization,
which can be a significant source of heterogeneity in average outcomes. For example, Hotz et al.
(2006) and Flores-Lagunes et al. (2010) find that it is very important to adjust for post-treatment
LEC when evaluating training programs. The approach we use to adjust for post-randomization
LEC consists of modeling the effect of the pre-randomization LEC on pre-randomization values of
the outcomes, and then using the estimated parameters to adjust the outcomes of interest for the
effect of post-randomization LEC.
We implement our approach in two ways depending on whether we use the outcome in levels or
in differences. Assume that the potential outcomes  (0 ) are separable functions of the individual
characteristics and the LEC, and for the sake of presentation, ignore the presence of the covariates
 . For the outcome in levels, we use the pre-randomization outcome and LEC to run the regresP
sion 0 = =1 [1 ( = ) · 0 ] + 00  + 0 , where 0 denotes a time period before random
assignment, 0 is a vector of LEC at time 0 , and 0 is an error term. The availability of different
cohorts within each site in our data provides us with sufficient variation in the LEC to identify the
parameter vector . We use the estimated coefficient b
 to adjust the outcome of interest ( ) as
0
b
e
 =  −  , and apply the estimators discussed in Section 2.2 on the LEC-adjusted outcome
e . For the outcome in differences, we follow a similar approach but employ in the regression the
differences of the outcomes and LEC in two periods prior to randomization. In both cases (for the
outcome in levels and in differences), we normalize the adjustments to have mean zero in order to
allow for easier comparisons between the unadjusted and adjusted outcomes. Further details are
provided in the Appendix.
Clearly, the flexibility of the specific model employed to adjust for differences across locations in
LEC depends heavily on the available data and the amount of variation in the LEC. For instance,
one may be able to estimate more flexible models if there is a large number of locations or cohorts
in each location. In this paper, we are faced with the challenging (but not uncommon) problem
that there is limited variation in the LEC measures (we have thirteen cohorts total); so it is difficult
to use a more flexible model than the one we use.
5
Dehejia (2003) uses an approach based on hierarchical models to deal with site effects when evaluating multisite
training programs, although he does not directly address LEC differences. Galdo (2008) considers a setting where
there is lack of overlap due to site and time effects, and proposes a procedure that removes those two fixed effects
before implementing a matching estimator. As pointed out by Dehejia (2003), in a framework that aims at predicting
the effects of programs if implemented in different sites, it is problematic to model site effects as fixed effects.
8
An attractive feature of our approach is that it is not estimator-specific, meaning that once we
adjust the outcomes for differences in post-randomization LEC, we can employ any of the estimators
from Section 2.2 to adjust for differences in individual characteristics. Another potentially attractive
feature is that the effects of the LEC on the outcome variables are estimated using the outcomes
prior to random assignment, so part of the analysis (e.g., modeling the effect of the LEC on
outcomes) can be performed without resorting to the outcomes after randomization.
Finally, note that while we do not explicitly adjust for pre-randomization LEC in our preferred
specification, we include a rich set of welfare and labor market histories as covariates, which can
proxy for those variables (Hotz et al., 2005). In Section 4, we also present results from models that
include measures of both pre- and post-randomization LEC in the GPS estimation.
3
Data
The data we use comes from the National Evaluation of Welfare-to-Work Strategies (NEWWS),
a multi-year study conducted in seven United States cities to compare the effects of different
approaches for helping welfare recipients (mostly single mothers) to improve their labor market
outcomes and leave public assistance. The individuals in each location were randomly assigned to
different training programs or to a control group that was denied access to the training services
offered by the program for a pre-set “embargo” period. The year of randomization differed across
sites, with the earliest randomization taking place in 1991 and the latest in 1994. We rely on
the public-use version of the NEWWS data, which is a combination of high quality survey and
administrative data. For further details on the NEWWS study, see Hamilton et al. (2001).
We restrict the analysis to female control individuals in the five sites for which two years of
pre-randomization individual labor market histories are available and in which individuals were
randomized after welfare eligibility had been determined. The five sites we use are Atlanta, GA;
Detroit, MI; Grand Rapids, MI; Portland, OR; and Riverside, CA (we exclude Columbus, OH and
Oklahoma City, OK). The final sample size in our analysis is 9,351 women.6
The outcome we use is an indicator variable equal to one if the individual was ever employed
during the two years (quarters two to nine) following randomization, and zero otherwise.7 We focus
on an outcome that is measured only up to two years after randomization because, in most sites,
we cannot identify which control individuals were embargoed from receiving program services after
year two. We define the outcome in differences as the original outcome minus an indicator equal
to one if the individual was ever employed during the two years prior to randomization.
Our data contain information on demographic and family characteristics, education, housing
type and stability, pre-randomization welfare and food stamps use (seven quarters), and pre6
Michalopoulos et al. (2004) use the NEWWS data to perform binary comparisons between control groups in
different locations. See the Appendix for a discussion of the differences between their sample and ours.
7
Using one-year employment indicators instead does not change our conclusions substantively. We focus on an
employment indicator because the public-use data available to us contain only the year of random assignment and
nominal earnings measures. Thus, we could create only rough measures of real earnings. By focusing on employment,
we also avoid having to deal with issues related to cost-of-living differences across sites.
9
randomization earnings and employment histories (eight quarters). The first five columns of Panel
A in Table 1 show the averages of the outcomes and selected covariates for each site. As expected,
there are important and usually large differences in the means of all the variables across sites (e.g.,
percentage of blacks).
We employ three measures of LEC for the different cohorts within each site at the metropolitan
statistical area (MSA) level: employment-to-population ratio, average real earnings, and unemployment rate. The bottom of Panel A presents the average of these variables during the calendar
year of random assignment, as well as their two-year growth rates in the two years before and after
random assignment. The differences in LEC across all sites clearly suggest that we are working
with five distinct local labor markets, especially with respect to Riverside.
4
Results
4.1
GPS Estimation and Covariate Balancing
We estimate the GPS using a multinomial logit model (MNL) that includes 52 individual-level
covariates (estimation results are provided in the Appendix). In Section 4.4, we consider alternative
estimators of the GPS. An important property of the GPS is that it balances the covariates between
the individuals in a particular site and those not in that site (Imbens, 1999). In the binary treatment
setting, this property is commonly used to gauge the adequacy of the propensity score specification
(Dehejia and Wahba, 1999; Smith and Todd, 2005; Lee, 2011). We follow two strategies to examine
the balance of each covariate across the different sites after adjusting for the GPS. The first strategy
tests whether there is joint equality of means across all five sites after each observation is weighted
b . The second strategy consists of a series of pairwise comparisons of the mean in each
by 1
site versus the mean in the (pooled) remaining sites, adjusting for the GPS within a “blocking” or
“stratification” framework in the spirit of Dehejia and Wahba (1999, 2002) and Hirano and Imbens
(2004). Both strategies are implemented after imposing the overlap rule (5). An important issue
when implementing this rule is selecting the quantile  that determines the amount of trimming.
Even in the binary treatment literature, there is no consensus about how to select the amount of
trimming (e.g., Imbens and Wooldridge, 2009). Based on attaining good balancing under both
strategies, we set  = 00025.8
For brevity, we focus on the first strategy. The last four columns of Panel A in Table 1 show the
p-value from a joint equality test and the root mean square distance () as given in (3) for the
means of the variables in each site, before and after weighting by the GPS. The complete variableby-variable results are presented in the Appendix. Weighting by the GPS brings the means of all
of the covariates much closer together (except for the percentage of individuals with a high school
degree or GED, which becomes slightly less balanced). For example, the  for the percentage of
blacks is reduced from 0.64 to 0.06 after weighting by the GPS. Despite the significant improvement
8
We also examined covariate balancing properties for values of  equal to 0 0001 0002 0003 0004 and 0005.
Our main results presented in Section 4.3 are, in general, robust to the choice of .
10
in balancing, the joint equality of means test is still rejected for some of the covariates. Panel B
in Table 1 shows summary results for the covariate balancing analysis. Weighting by the GPS
greatly improves the balancing among the five sites, as the number of unbalanced covariates at
the 5 percent significance level drops from 52 to 6. The second strategy used to analyze covariate
balancing, blocking on the GPS, yields similar results. Overall, we conclude that the covariates in
the raw data are highly unbalanced across sites, and that the estimated GPS does a reasonably
good job in improving their balance across all five sites.
4.2
Overlap Analysis
We first analyze the overlap in individual characteristics across sites, and then we analyze
the overlap in LEC. Panel C in Table 1 presents the percentage of individuals dropped from each
site after imposing overlap, and the overall percentage of observations dropped (26 percent). In
Riverside, almost half of the observations are dropped, implying that many of the individuals in
Riverside are not comparable to individuals in at least one of the other sites. The sites with the
fewest number of observations dropped are Atlanta and Detroit (about 5 percent in each).
In Figure 1, we examine more closely the overlap quality in individual characteristics. Each
panel in the Figure shows, for a given site , the number of observations with  6=  (left bars)
and  =  (right bars) whose GPS  falls within given percentile intervals of the support of
 . The percentile intervals, for each site , are constructed using the distribution of  only
for individuals in site  before imposing overlap. The outside wide (inside thin) bars display the
number of individuals before (after) imposing overlap. As discussed in Section 2.2, we are interested
only in the lower tail of these distributions because, for estimation of   , we need to find for every
individual with  6=  comparable individuals in terms of  in the  =  group, and not
vice versa. However, note that the intersection in the overlap rule (5) results in individuals being
dropped from all regions of the GPS distributions in Figure 1. This illustrates a key difference with
the binary-treatment case, where usually individuals are only dropped from the tails.
Figure 1 makes clear that the GPS distributions of the  6=  and  =  groups differ
greatly in some sites, which is consistent with the large differences in individual characteristics
documented in Table 1. Figure 1 also shows that imposing overlap significantly decreases the
number of individuals at the lower tail of the distribution of  for the  6=  group in most
sites. The biggest changes, mostly arising from the large number of Riverside individuals dropped,
are in Atlanta and Detroit. For example, the bottom one percent of the individuals in Atlanta
(about 14 observations) are comparable to 3,245 individuals (about 41 percent of those not in
Atlanta) before imposing overlap, and to 938 individuals after imposing overlap (about 17 percent
of those not in Atlanta). Despite the improvement in overlap after imposing (5), the overlap quality
remains relatively poor in some regions of the distributions, as many observations not in site  are
comparable to few observations in site  in those regions. This poor overlap quality can lead to
high variance of the GPS-based semiparametric estimators, just as in the binary-treatment setting
(Black and Smith, 2004; Busso et al., 2009b; Imbens and Wooldridge, 2009; Khan and Tamer,
11
2010). In general, the overlap quality in individual characteristics between units with  =  and
 6=  in each site is similar to that from previous studies in a binary-treatment setting (e.g.,
Dehejia and Wahba, 1999; Black and Smith, 2004; Smith and Todd, 2005).
We now examine the overlap in pre- and post-randomization LEC across the different cohorts
in our sites. Figure 2 presents plots of the LEC faced by each of the cohorts in our analysis
before (top figures) and after (bottom figures) random assignment. It is clear from the Figure that
all the cohorts in Riverside experienced, especially after randomization, extremely different LEC
than the other sites with respect to the employment-to-population ratio and the unemployment
rate. Moreover, the comparison of the top and bottom panels shows that the behavior over time
of the LEC in Riverside is also different, with earlier cohorts experiencing a deterioration of the
LEC. As for the rest of the sites, while their LEC are not the same, they are roughly similar.
Figure 2 highlights that it is almost impossible to find individuals in other sites that faced postrandomization LEC similar to those faced by the individuals in Riverside. Thus, the adjustment
for LEC in Riverside heavily depends on how well the parametric model used to adjust for LEC is
able to extrapolate to these regions of the data.
4.3
Main Results
Table 2 presents the results for the assessment measures of the different estimators used to
implement the unconfoundedness (first three columns) and conditional DID (last three columns)
strategies. Panel A shows the results when the outcome is not adjusted for differences in postrandomization LEC, while Panel B shows the results when adjusting for such differences following
the procedure discussed in Section 2.3. The first and second columns of the Table show, respectively,
the p-value from the joint equality test in (2) and the  in (3) with its corresponding 95 percent
confidence interval based on 1,000 bootstrap replications. To have a reference level for the reduction
in the  for the different estimators, the third column displays the  relative to that from
the raw mean estimator using the outcome in levels before imposing overlap and not adjusting for
LEC. In addition, to have a benchmark against which to compare our results, the last row in each
of the panels gives the value of the  that would be achieved by an experiment in a setting
where  1 =    =   holds. In particular, we perform a “placebo” experiment in which we assign
placebo sites randomly to the individuals in the data set (keeping the number of observations in
b be the average outcome in each placebo site.
each site the same as in our sample), and we let 

These placebo values are also calculated using the outcome in differences and with and without
adjusting for the LEC. We implement all the GPS-based estimators only within the overlap region,
and the non-GPS-based estimators before and after imposing overlap.9
9
For the partial mean linear X and IPW with covariates estimators, we include all 52 individual covariates used
in the GPS estimation. For the partial mean flexible X estimator, we add 50 higher order terms and interactions.
We implement the GPS-based nonparametric partial mean estimator using an Epanechnikov kernel and selecting the
bandwidth using the “plug-in” procedure proposed by Fan and Gijbels (1996). For the GPS-based parametric partial
mean estimator, we also explored using a cubic and a quartic polynomial in  , which made the results very close to
those from the nonparametric partial mean estimator. Finally, the results from the regressions in the first step of the
procedure used to adjust for LEC are shown in the Appendix.
12
The  of the raw mean in levels (0.129) is about six times what one would expect to
see in an experiment (0.021). Both strategies used to adjust for individual characteristics do
very little in reducing the , regardless of the estimator used. This illustrates that in some
instances, DID estimators are unable to adjust for differences across sites. In our case, there seems
to be important differences in time-variant characteristics across sites (LEC) that are not captured
by the individual characteristics we control for. The only adjustment that helps to reduce the
differences in outcomes across comparison groups is the one for post-randomization LEC. In general,
this adjustment reduces the  by about half whether we employ the outcome in levels or in
differences; however, the  still remains more than three times that of the placebo experiment.
Table 3 presents the estimates and confidence intervals of   for each site. This table shows that
the methods are fairly successful in equalizing the average outcomes among all sites but Riverside.
In fact, with the exception of the LEC adjustment, the strategies considered change very little the
average outcome of Riverside. This is not surprising given the very different LEC in this site.
Tables 2 and 3 highlight the difficulty of adjusting for LEC when they differ substantially across
sites and the number of cohorts and/or locations is small. This is consistent with Dehejia (2003),
who finds that it is difficult to predict site effects when the site being predicted differs greatly from
the other sites. The lack of overlap in post-randomization LEC in Riverside prevents us from finding
individuals in other sites who faced post-randomization LEC like those in Riverside. Moreover, the
small number of cohorts and limited LEC variation in our sample makes it extremely difficult to
model the effect of the LEC on the outcome and to extrapolate to regions with no overlap. Thus,
we are unable to compare the outcomes in Riverside to those from the other sites. More generally,
our results illustrate that when evaluating multiple treatments, there may be treatment groups for
which it is not possible to draw inferences because they are not comparable to the other ones.
Table 4 shows our assessment measures calculated using the estimated means from Table 3 (i.e.,
from the five-site model) for all sites, except Riverside. The  for the raw mean of the outcome
in levels starts at 0.077, which is about three times that of the placebo experiment. In this case,
regardless of the estimator employed, the adjustment for individual characteristics significantly
reduces the  to levels that are much closer to those of the placebo experiment.10 As in Table
2, the DID estimators do not seem to improve on the results from the outcomes in levels. The
LEC adjustment does not affect the results nearly as much as it did in Table 2, and for most
estimators it slightly increases the . This small effect of the LEC adjustment, as compared to
the case when Riverside is included, is consistent with the other four sites having more similar LEC
than does Riverside. It is also important to compare the confidence intervals from the different
estimators to those from the placebo experiment. In general, the confidence intervals from the
linear regression-based estimators without adjusting the outcome for LEC are similar to those from
the placebo experiment. However, those from the GPS-based estimators are wider, especially those
10
For estimation of   , it does not matter which site a unit comes from if the unit is not in site . Thus, performing
the whole analysis using only the observations not in Riverside gives results similar to those in Table 4. In addition,
carrying out the four-site analysis keeping Riverside, while dropping any of the other sites, gives  estimates as
high as those for the five-site analysis in Table 2.
13
from the IPW estimators, because of the poor overlap quality in some sites (e.g., Atlanta).11
In sum, the results from this section highlight the importance of the LEC when employing
comparison groups from different locations. Our results suggest that, given our set of conditioning
variables, the unconfoundedness and DID strategies work reasonably well in simultaneously equalizing average outcomes across our comparison groups when they come from sites with similar LEC.
This is consistent with some of the previous findings in the binary-treatment literature (Roselius,
1996; Heckman et al., 1997; Heckman et al., 1998; Michalopoulos et al., 2004; Smith and Todd, 2005;
Hotz et al., 2005). Our results show that, in the presence of large differences in post-randomization
LEC across comparison groups, our LEC adjustment can reduce but not eliminate these differences.
Our findings regarding the importance of adjusting for post-randomization LEC complement, and
are consistent with, those in the previous literature (Hotz et al., 2006; Flores-Lagunes et al., 2010).
We briefly relate our results to our example of a policy maker who wants to implement in her site
one of the programs from the other sites. If the policy maker were in any of the sites but Riverside,
she would clearly be able to compare all the programs, except the one in Riverside, for the people
in her site in the overlap region. For the program in Riverside, she could not separate program
heterogeneity from LEC heterogeneity. A subtler case is that of a policy maker in Riverside, who
would be able to compare the effects of the other sites’ programs for the individuals in Riverside
in the overlap region, had they participated in these programs in any of the sites but Riverside.
However, she would not be able to make the same comparison if the programs had been implemented
in Riverside because she could not assess the effects of the programs in a labor market like Riverside,
which differs greatly from the labor markets of the other sites.
A potential threat to our analysis is differential substitution into alternative training programs
by the control individuals (Heckman and Smith, 2000; Heckman et al., 2000; Greenberg and Robins,
2011). Analyzing a subsample of the individuals in the NEWWS study, Freedman et al. (2000, Table A.1) find differential participation rates in self-initiated activities among the control individuals.
For instance, the control individuals in Atlanta (19%) and Riverside (around 26%) show much lower
participation rates than do those in the other three sites (around 40%), with most of the differences
driven by participation in post-secondary education and vocational training activities. Freedman
et al. (2000) partially attribute these differences to individual characteristics heterogeneity, which
should be adjusted by the individual-level covariates we control for. Although we lack the data
to directly deal with differential control group substitution, our results do not seem to support
the notion that this source of heterogeneity drives our results. For example, while we do not have
trouble adjusting the average outcomes in Atlanta (the site with the lowest participation among
the control individuals), we have the most trouble in Riverside (the site with the worst LEC).
Unfortunately, we cannot completely rule out this additional source of heterogeneity, nor can we
precisely determine the extent to which it may affect our results.12
11
When overlap is poor, the large standard errors of semiparametric propensity-score based estimators of treatment
effects may more accurately reflect uncertainty than regression based estimators, which extrapolate to regions of poor
overlap using parametric assumptions (Black and Smith, 2004; Imbens and Wooldridge, 2009).
12
Another potential source of heterogeneity across control groups is welfare rules. For instance, welfare waivers
14
4.4
Robustness Analysis
In this section we examine how robust our previous results are to different choices regarding the
GPS estimation and the imposition of overlap. For brevity, Table 5 presents summary results only
for the IPW with covariates estimator (the complete set of results is available in the Appendix).
Unless otherwise noted, the results discussed below do not depend on the particular GPS-based
estimator used. Table 5 shows the percentage of observations dropped after imposing overlap with
 = 00025, and  results analogous to those presented in Tables 2 and 4. To ease comparisons,
the first row of Table 5 shows the corresponding results from those tables.
We start by including pre-randomization LEC measures in the GPS estimation. We employ
only one two-year pre-randomization LEC variable because including the growth rates for all three
of our LEC measures, or the measures in levels, identifies the sites (almost) perfectly. While we use
the growth rate of the employment-to-population ratio, which particular LEC measure we employ
does not affect the results. Overall, the results in the second row of the Table are comparable to
those we obtained in the previous section. These results illustrate that in some cases, adjusting
for pre-randomization LEC (and our set of individual covariates) helps little in reducing differences
across comparisons groups when there are large differences in post-randomization LEC.
In row three, we add to the GPS model shown in row two the two-year growth rate of the
employment-to-population ratio after randomization. Here, the relevant comparison is between
the outcome not adjusted for LEC (first four columns in row three) and the outcome adjusted for
post-randomization LEC under our preferred specification (last four columns in row one). The
point estimates are very close, except for the outcome in differences using four sites.
It is important to note that the overlap worsens when including LEC measures in the GPS
estimation. The percentage of observations dropped due to overlap shown in Table 5 does not
provide a complete picture. For example, the inclusion of the pre-randomization LEC measure
greatly affects Riverside, causing the overlap rule to drop about 63 percent of the individuals in
this site, compared to around 49 percent dropped in our preferred model. The inclusion of both
the pre- and post-randomization LEC measures worsens the overlap even more: almost 46 percent
of the overall sample and more than 84 percent of the individuals in Riverside are dropped.13 The
deterioration in overlap quality results in wider confidence intervals than those of our preferred
specification, although an argument could be made that the increased variability reflects more
appropriately the uncertainty of the problem at hand.
In general, we conclude from rows two and three that our main results in the previous section
are robust to the inclusion of (pre- and post-) LEC measures in the GPS. Nevertheless, it seems
to us that our approach of analyzing the overlap in individual characteristics and LEC separately
for Atlanta, and for Riverside after 1994, allowed employed welfare recipients to keep more of their welfare checks
(Hamilton et al., 2001). We believe it is unlikely that welfare rule differences across sites drive our results. For
example, although Riverside made its earnings policy more generous in the middle of the NEWWS study, its average
employment rate remained well below that of the other sites before and after the change. Unfortunately, as before,
we cannot rule out nor precisely determine the extent to which this source of heterogeneity could affect our results.
13
For reference, if instead of using  = 00025 as the quantile for the overlap trimming rule, we use the weakest
version of it ( = 0), we drop 34 percent of the observations overall and 79 percent of the observations in Riverside.
15
makes the differences in LEC across the sites easier to detect and to assess, as opposed to analyzing
the overlap using a GPS specification that includes the LEC measures. The latter, however, may
be a more practical alternative in cases where lack of overlap in LEC is unlikely to be an issue.
In rows four to seven of Table 5, we analyze the sensitivity of our results to the exclusion of
some sets of covariates (listed in the Table) from the GPS estimation. This may shed light on the
types of covariates needed to make the conditional independence assumption more plausible in our
setting. The results for the five sites remain basically unchanged regardless of the set of individual
covariates, although the balancing properties of the four models (not shown in the Table) are much
worse than those of our preferred specification. When considering sites with similar LEC (i.e.,
not including Riverside), however, the exclusion of the employment and earnings histories prior to
random assignment does largely increase the . This result is consistent with the key role the
literature gives to conditioning on labor market histories when implementing the nonexperimental
strategies we consider (Heckman et al., 1997; Heckman et al., 1998; Dehejia and Wahba, 1999,
2002; Hotz et al., 2005).14
In rows eight to ten of Table 5, we consider three different models to estimate the GPS. First, we
use a multinomial probit model, that does not have the unattractive “independence from irrelevant
alternatives” property of the multinomial logit model, but is much more computationally intensive.
Second, we use a multinomial version of the “Inverse Probability Tilting” (IPT) estimator proposed
by Graham et al. (2011), which estimates the GPS by solving a just-identified GMM problem
imposing moment conditions that ensure that the mean of each covariate is exactly the same
across all sites when weighting the observations by the inverse of the GPS (the “exact balancing”
property). For the sake of comparability, we impose overlap in row nine, which breaks the exact
balancing property of the estimated GPS, but below we consider GPS specifications with that
property. Third, we use a combination of the multinomial logit and IPT estimators, which estimates
the GPS by solving an overidentified GMM problem that adds to the multinomial logit moments
the exact balancing moments of the IPT. Although this GPS estimator does not share the exact
balancing property of the IPT-based estimator, its balancing properties may be better than those
of the latter when balancing is assessed within a blocking-on-the-GPS approach. However, it has
the disadvantage of being much slower to compute than any of the other models. In general, the
results under any of the alternative GPS estimators are relatively close to those from our preferred
specification.
An issue for which there is no guidance in the literature is whether the GPS should be reestimated after imposing overlap. This can be important if, for example, we want to attain exact
balancing when weighting by the GPS after imposing overlap. We show several reestimation options
in rows eleven to fourteen of Table 5, and find that the  results of these models are remarkably
close to those of our preferred specification.15 However, in terms of covariate balancing (not shown
14
When using a GPS specification with interactions and higher order terms of the covariates, in general, balancing
does not improve and the results are similar to those of our preferred specification.
15
For the GPS-based nonparametric partial mean estimator, this is not always true. In many cases, reestimating
the GPS produces unusually high  values for this estimator.
16
in the Table), we find that the models reestimated using the IPT estimator generally perform very
poorly when blocking on the GPS, while the models reestimated using MNL and MNL with IPT
moment conditions perform well both when weighting by and blocking on the GPS.
Finally, the last three rows of Table 5 explore the effect of not imposing overlap in the individual
covariates for three GPS models. For the five-site case, the  results are similar or slightly below
those of our preferred specification. However, for the four-site case, the results for the outcome in
levels are well above those of our preferred specification, especially for the IPT estimator. This
illustrates that the results can be poor if overlap is not imposed, even if it is possible to attain
exact balancing.16
5
Conclusion
Policy makers are commonly faced with the challenging problem of comparing the effects
different programs would have in their own locality based on data when each of the programs was
implemented at a different location. We deem our results as encouraging regarding the performance
of the nonexperimental methods we study in informing policy makers in such situations. In our
application, these methods reduce the differences in the average outcomes of our comparison groups
due to differences across locations in individual characteristics and local economic conditions (LEC)
down to levels closer to those that would be expected in an experiment. The keys are adjusting for a
rich set of covariates (including labor market histories) and having simultaneous overlap across sites
in individual characteristics and post-randomization LEC. Nevertheless, we find that the standard
errors from the nonexperimental methods can be much wider than those we would encounter in
an experiment if the overlap quality is poor, especially for the GPS-based estimators. This raises
the possibility that large differences in outcomes across sites may still persist. When there are
large differences in post-randomization LEC across locations, the methods we consider to explicitly
adjust for them can reduce but not eliminate these differences. We also find encouraging that, in
general, our results are not very sensitive to the particular model used to estimate the GPS, nor to
the particular estimator employed to implement the unconfoundedness or DID strategies.
Some final remarks are in order. First, our results may have implications regarding the external
validity of randomized controlled trials. For instance, they suggest that when extrapolating the
results from a social experiment to different populations and/or points in time, we need to adjust for
differences in individual characteristics and LEC after treatment. Second, we believe the literature
would benefit from more research on how to best adjust for differences in LEC when their variation is
limited. Third, a valuable exercise would be to perform a Monte Carlo analysis of the finite sample
properties of the multiple-treatment estimators we consider. Our paper points out important
aspects to consider in such an analysis, for example, the number of treatments, the degree and
quality of overlap, the GPS model, and whether the GPS should be reestimated after imposing
overlap. A similar analysis for different ways to adjust for differences in LEC would also be valuable.
16
Interestingly, the poor balancing properties under a blocking approach of the IPT-based GPS models in Table 5
do not translate into poor performance of the GPS-based partial-mean estimators.
17
References
[1] Abadie, Alberto. 2005. Semiparametric Difference-in-Differences Estimators. The Review of
Economic Studies 72 (1): 1-19.
[2] Abadie, Alberto, and Guido W. Imbens. 2011. Bias-Corrected Matching Estimators for Average
Treatment Effects. Journal of Business and Economic Statistics 29 (1): 1-11.
[3] Black, Dan A., and Jeffrey A. Smith. How robust is the evidence on the effects of college
quality? Evidence from matching. Journal of Econometrics 121 (1-2): 99-124.
[4] Busso, Matias, John DiNardo, and Justin McCrary. 2009a. New Evidence on the Finite Sample Properties of Propensity Score Matching and Reweighting Estimators. IZA Discussion
Paper no. 3998.
[5] Busso, Matias, John DiNardo, and Justin McCrary. 2009b. Finite Sample Properties of Semiparametric Estimators of Average Treatment Effects. University of Michigan.
[6] Cattaneo, Matias D. 2010. Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics 155 (2): 138-154.
[7] Dehejia, Rajeev H. 2003. Was There a Riverside Miracle? A Hierarchical Framework for
Evaluating Programs With Grouped Data. Journal of Business and Economic Statistics 21
(1): 1-11.
[8] Dehejia, Rajeev H., and Sadek Wahba. 1999. Causal Effects in Nonexperimental Studies:
Reevaluating the Evaluation of Training Programs. Journal of the American Statistical
Association 94 (448): 1053-1062.
[9] Dehejia, Rajeev H., and Sadek Wahba. 2002. Propensity Score-Matching Methods for Nonexperimental Causal Studies. The Review of Economics and Statistics 84 (1): 151-161.
[10] Dolton, Peter, and Jeffrey A. Smith. 2011. The Impact of the UK New Deal for Lone Parents
on Benefit Receipt. IZA Discussion Paper no. 5491.
[11] Dyke, Andrew, Carolyn J. Heinrich, Peter R Mueser, Kenneth R. Troske, and Kyung-Seong
Jeon. 2006. The Effects of Welfare-to-Work Program Activities on Labor Market Outcomes.
Journal of Labor Economics 24 (3): 567-607.
[12] Fan, Jianqing, and Irène Gijbels. 1996. Local polynomial modelling and its applications. New
York: Chapman & Hall/CRC Press.
[13] Flores-Lagunes, Alfonso, Arturo Gonzalez, and Todd Neumann. 2010. Learning but not Earning? The Impact of Job Corps Training on Hispanic Youth. Economic Inquiry 48 (3):
651-667.
[14] Freedman, Stephen, Daniel Friedlander, Gayle Hamilton, JoAnn Rock, Marisa Mitchell, Jodi
Nudelman, Amanda Schweder, and Laura Storto. 2000. Evaluating Alternative Welfare-toWork Approaches: Two-Year Impacts for Eleven Programs. Washington, D.C.: U.S. Dept.
of Health and Human Services, Administration for Children and Families and Office of the
Assistant Secretary for Planning and Evaluation, and U.S. Dept. of Education.
[15] Friedlander, Daniel, and Philip K. Robins. 1995. Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods. The American Economic Review 85
(4): 923-937.
[16] Frölich, Markus. 2004. Programme Evaluation with Multiple Treatments. Journal of Economic
Surveys 18 (2): 181-224.
18
[17] Frölich, Markus, Almas Heshmati, and Michael Lechner. 2004. A microeconometric evaluation
of rehabilitation of long-term sickness in Sweden. Journal of Applied Econometrics 19 (3):
375-396.
[18] Galdo, Jose. 2008. Treatment Effects for Profiling Unemployment Insurance Programs: Semiparametric Estimation of Matching Models with Fixed Effects. Carleton University.
[19] Graham, Bryan S., Cristine Campos de Xavier Pinto, and Daniel Egel. 2011. Inverse probability
tilting for moment condition models with missing data. The Review of Economic Studies.
Forthcoming.
[20] Greenberg, David H., and Philip K. Robins. 2011. Have Welfare-To-Work Programs Improved
Over Time In Putting Welfare Recipients To Work? Industrial & Labor Relations Review.
Forthcoming.
[21] Hamilton, Gayle, Stephen Freedman, Lisa Gennetian, Charles Michalopoulos, Johana Walter,
Diana Adams-Ciardullo, Ann Gassman-Pines, Sharon McGroder, Martha Zaslow, Jennifer
Brooks, and Surjeet Ahluwalia. 2001. How Effective are Different Welfare-to-Work Approaches? Five-Year Adult and Child Impacts for Eleven Programs. Washington, D.C.:
U.S. Dept. of Health and Human Services, Administration for Children and Families and
Office of the Assistant Secretary for Planning and Evaluation, and U.S. Dept. of Education.
[22] Heckman, James J., Hidehiko Ichimura, Jeffrey Smith, and Petra Todd. 1998. Characterizing
Selection Bias Using Experimental Data. Econometrica 66 (5): 1017-1098.
[23] Heckman, James J., Hidehiko Ichimura, and Petra E. Todd. 1997. Matching as an Econometric
Evaluation Estimator: Evidence from Evaluating a Job Training Programme. The Review
of Economic Studies 64 (4): 605-654.
[24] Heckman, James, Neil Hohmann, Jeffrey Smith, and Michael Khoo. 2000. Substitution and
Dropout Bias in Social Experiments: A Study of an Influential Social Experiment. The
Quarterly Journal of Economics 115 (2): 651-694.
[25] Heckman, James J., Robert J. Lalonde, and Jeffrey A. Smith. 1999. The economics and econometrics of active labor market programs. In Handbook of Labor Economics, ed. Orley C.
Ashenfelter and David Card, Volume 3, Part 1:1865-2097. Elsevier.
[26] Heckman, James J., and Jeffrey Smith. 2000. The Sensitivity of Experimental Impact Estimates (Evidence from the National JTPA Study). In Youth Employment and Joblessness in
Advanced Countries, ed. David G. Blanchflower and Richard B. Freeman, 331-356. NBER
Comparative Labor Markets Series. University of Chicago Press.
[27] Hirano, Keisuke, and Guido W. Imbens. 2004. The Propensity Score with Continuous Treatments. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, ed. Andrew Gelman and Xiao-Li Meng, 73-84. Hoboken, NJ: John Wiley & Sons.
[28] Holland, Paul W. 1986. Statistics and Causal Inference (with discussion). Journal of the American Statistical Association 81 (396): 945-970.
[29] Hotz, V. Joseph, Guido W. Imbens, and Jacob A. Klerman. 2006. Evaluating the Differential
Effects of Alternative Welfare-to-Work Training Components: A Reanalysis of the California GAIN Program. Journal of Labor Economics 24 (3): 521-566.
[30] Hotz, V. Joseph, Guido W. Imbens, and Julie H. Mortimer. 2005. Predicting the efficacy of
future training programs using past experiences at other locations. Journal of Econometrics
125 (1-2): 241-270.
19
[31] Imai, Kosuke, and David A. van Dyk. 2004. Causal Inference With General Treatment Regimes.
Journal of the American Statistical Association 99 (467): 854-866.
[32] Imbens, Guido W. 1999. The Role of the Propensity Score in Estimating Dose-Response Functions. NBER Technical Working Paper Series no. 237.
[33] Imbens, Guido W. 2000. The Role of the Propensity Score in Estimating Dose-Response Functions. Biometrika 87 (3): 706-710.
[34] Imbens, Guido W. 2004. Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review. The Review of Economics and Statistics 86 (1): 4-29.
[35] Imbens, Guido W., and Jeffrey M. Wooldridge. 2009. Recent Developments in the Econometrics
of Program Evaluation. Journal of Economic Literature 47 (1): 5-86.
[36] Khan, Shakeeb, and Elie Tamer. 2010. Irregular Identification, Support Conditions, and Inverse
Weight Estimation. Econometrica 78 (6): 2021-2042.
[37] Lechner, Michael. 2001. Identification and estimation of causal effects of multiple treatments
under the conditional independence assumption. In Econometric Evaluation of Labour
Market Policies, ed. Michael Lechner and Friedhelm Pfeiffer, 43-58. ZEW Economic Studies
13. New York: Springer-Verlag.
[38] Lechner, Michael. 2002a. Some Practical Issues in the Evaluation of Heterogeneous Labour
Market Programmes by Matching Methods. Journal of the Royal Statistical Society, Series
A 165 (1): 59-82.
[39] Lechner, Michael. 2002b. Program Heterogeneity and Propensity Score Matching: An Application to the Evaluation of Active Labor Market Policies. The Review of Economics and
Statistics 84 (2): 205-220.
[40] Lee, Wang-Sheng. 2011. Propensity score matching and variations on the balancing test. Empirical Economics. Forthcoming.
[41] Michalopoulos, Charles, Howard S. Bloom, and Carolyn J. Hill. 2004. Can PropensityScore Methods Match the Findings from a Random Assignment Evaluation of Mandatory
Welfare-to-Work Programs? The Review of Economics and Statistics 86 (1): 156-179.
[42] Mueser, Peter R., Kenneth R. Troske, and Alexey Gorislavsky. 2007. Using State Administrative Data to Measure Program Performance. The Review of Economics and Statistics 89
(4): 761-783.
[43] Newey, Whitney K. 1994. Kernel Estimation of Partial Means and a General Variance Estimator. Econometric Theory 10 (2): 233-253.
[44] Plesca, Miana, and Jeffrey A. Smith. 2007. Evaluating multi-treatment programs: theory and
evidence from the U.S. Job Training Partnership Act experiment. Empirical Economics 32
(2): 491-528.
[45] Roselius, Rebecca L. 1996. New Life for Non-Experimental Methods: Three Essays that Reexamine the Evaluation of Social Programs. Ph.D. Dissertation, University of Chicago.
[46] Smith, Jeffrey A., and Petra E. Todd. 2005. Does matching overcome LaLonde’s critique of
nonexperimental estimators? Journal of Econometrics 125 (1-2): 305-353.
20
Table 1. Descriptive statistics, covariates balancing, and overlap NEWWS data
A. Descriptive statistics (selected variables)
Variables
Outcomes
Ever employed in 2 years after RA
Ever employed in 2 years after RA (in diff) \a
Individual level covariates
Demographic and family characteristics
Black
Age 30-39 years old
Age 40+ years old
Became mother as a teenager
Never married
Any child 0-5 years old
Any child 6-12 years old
Education characteristics
10th grade
11th grade
Grade 12 or higher
Highest degree = High School or GED
Welfare use history
On welfare for less than 2 years
On welfare for 2-5 years
On welfare 5-10 years
Received welfare in Q1 before RA
Received welfare in Q7 before RA
Food stamps use history
Received food stamps in Q1 before RA
Received food stamps in Q6 before RA
Received food stamps in Q7 before RA
Employment history
Employed in Q1 before RA
Employed in Q8 before RA
Employed at RA (self reported)
Ever worked FT 6+ months at same job
Earnings history (real $ /1,000)
Earnings Q1 before RA
Earnings Q8 before RA
Any earnings year before RA (self reported)
Local economic conditions
Employment/population year of RA
Average total earnings year of RA ($1000)
Unemployment rate year of RA
Emp/pop growth rate 2 yrs before RA (! logs)
Avg earns growth rate 2 yrs before RA (! logs)
Unemp growth rate 2 yrs before RA (! logs)
Emp/pop growth rate 2 yrs after RA (! logs)
Avg earns growth rate 2 yrs after RA (! logs)
Unemp growth rate 2 yrs after RA (! logs)
P-value joint equality
of means Wald test
Raw
IPW \b
Root Mean
Square Distance
Raw
IPW \b
ATL
Raw Means
DET
GRP
POR
RIV
0.59
0.60
0.59
0.68
0.70
0.56
0.59
0.58
0.46
0.46
0.00
0.00
0.00
0.00
0.13
0.12
0.12
0.12
0.95
0.51
0.14
0.45
0.62
0.42
0.70
0.89
0.35
0.11
0.45
0.69
0.65
0.48
0.41
0.29
0.09
0.51
0.58
0.69
0.43
0.20
0.40
0.08
0.34
0.49
0.71
0.52
0.17
0.45
0.13
0.35
0.34
0.58
0.59
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.10
0.90
0.80
0.13
0.00
0.07
0.64
0.19
0.21
0.16
0.23
0.17
0.18
0.06
0.07
0.08
0.04
0.05
0.07
0.06
0.14
0.17
0.57
0.53
0.15
0.25
0.50
0.48
0.13
0.20
0.54
0.54
0.17
0.22
0.45
0.53
0.11
0.18
0.57
0.59
0.00
0.00
0.00
0.00
0.73
0.47
0.05
0.03
0.15
0.14
0.09
0.06
0.08
0.09
0.06
0.07
0.26
0.25
0.24
0.97
0.64
0.23
0.25
0.24
0.90
0.77
0.38
0.31
0.17
0.77
0.56
0.32
0.35
0.23
0.79
0.58
0.44
0.28
0.16
0.73
0.37
0.00
0.00
0.00
0.00
0.00
0.04
0.21
0.51
0.83
0.39
0.23
0.13
0.16
0.11
0.22
0.12
0.08
0.11
0.01
0.04
0.97
0.75
0.72
0.94
0.81
0.78
0.85
0.64
0.60
0.86
0.70
0.66
0.62
0.31
0.29
0.00
0.00
0.00
0.05
0.01
0.06
0.14
0.27
0.28
0.02
0.04
0.04
0.18
0.30
0.07
0.72
0.18
0.18
0.07
0.46
0.29
0.39
0.13
0.64
0.23
0.27
0.09
0.77
0.22
0.30
0.13
0.71
0.00
0.00
0.00
0.00
0.45
0.14
0.39
0.29
0.18
0.23
0.27
0.16
0.09
0.11
0.14
0.04
0.23
0.74
0.23
0.21
0.33
0.20
0.36
0.69
0.46
0.33
0.57
0.36
0.43
0.85
0.40
0.00
0.00
0.00
0.45
0.77
0.34
0.27
0.28
0.30
0.19
0.12
0.08
0.52
32.4
5.93
-0.03
0.03
0.22
0.03
-0.01
-0.27
0.46
35.9
7.38
0.00
0.03
-0.24
0.04
0.03
-0.33
0.49
29.1
7.42
-0.02
0.01
0.11
0.04
0.02
-0.41
0.49
30.0
5.36
0.00
0.02
-0.06
0.04
0.03
-0.31
0.29
27.8
10.45
-0.05
-0.01
0.42
0.00
-0.02
-0.08
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.18
0.09
0.24
1.08
1.05
2.53
0.59
2.18
0.38
0.18
0.09
0.24
1.03
1.04
2.30
0.58
2.01
0.38
B. Covariates balancing. Number of individual covariates for which p-value ! 0.05 (total number of covariates = 52)
Tests
Joint equality of means across all sites
Raw means before overlap
GPS-based Inverse Probability Weighting \c
Diff. of means each site versus other sites pooled
Raw means before overlap
Blocking on GPS \c
ATL
DET
GRP
POR
RIV
52
6
42
1
49
3
34
4
36
0
48
6
ATL
1,372
1,299
5.3%
DET
2,037
1,933
5.1%
GRP
1,374
969
29.5%
POR
1,740
1,263
27.4%
RIV
2,828
1,437
49.2%
C. Overlap
Number of observations
Number of observations after imposing overlap
Percentage observations dropped due to overlap
Total
9,351
6,901
26.2%
Notes: \a The lagged outcome was standardized to be mean zero; so the grand mean is the same for both the levels and in differences outcomes.
\b The IPW joint equality of means and IPW RMSD are calculated from GPS-based IPW means (after imposing overlap).
\c Tests calculated after imposing overlap.
Table 2. Assessment measures of estimators of the average employment rate in two years after random assignment - 5 sites
Outcome in levels
P-value joint
Root Mean Square Distance
equality of
Estimated
Relative to in-levels
Estimator
means test
value
Raw Mean
A. Outcome not adjusted for local economic conditions
Raw Mean - No Ovlp
0.000
0.129
1.000
[0.115,0.146]
Raw Mean - Ovlp
0.000
0.141
1.093
[0.120,0.161]
Linear regression-based
Partial Mean Linear X - No Ovlp
0.000
0.117
0.907
[0.102,0.135]
Partial Mean Linear X - Ovlp
0.000
0.119
0.922
[0.098,0.143]
Partial Mean Flex X - No Ovlp
0.000
0.119
0.922
[0.104,0.138]
Partial Mean Flex X - Ovlp
0.000
0.118
0.915
[0.098,0.143]
GPS-based (imposing Ovlp)
Parametric Partial Mean
0.000
0.118
0.915
[0.089,0.151]
Nonparametric Partial Mean
0.000
0.120
0.930
[0.088,0.155]
IPW No Covariates
0.000
0.122
0.946
[0.098,0.180]
IPW With Covariates
0.000
0.135
1.047
[0.109,0.170]
Placebo experiment (benchmark)
Raw Mean - No Ovlp
0.307
0.021
0.163
[0.012,0.044]
B. Outcome adjusted for local economic conditions
Raw Estimator - No Ovlp
0.000
Raw Estimator - Ovlp
0.000
Linear regression-based
Partial Mean Linear X - No Ovlp
0.000
Partial Mean Linear X - Ovlp
0.000
Partial Mean Flex X - No Ovlp
0.000
Partial Mean Flex X - Ovlp
0.000
GPS-based (imposing Ovlp)
Parametric Partial Mean
0.004
Nonparametric Partial Mean
0.009
IPW No Covariates
0.011
IPW With Covariates
0.000
Placebo experiment (benchmark)
Raw Mean - No Ovlp
0.320
P-value joint
equality of
means test
0.000
0.121
[0.104,0.140]
0.116
[0.096,0.140]
0.938
0.107
[0.088,0.132]
0.101
[0.080,0.133]
0.122
[0.105,0.142]
0.113
[0.095,0.140]
0.829
0.099
[0.074,0.140]
0.119
[0.087,0.157]
0.124
[0.098,0.176]
0.126
[0.097,0.169]
0.767
0.545
0.020
[0.011,0.047]
0.155
0.068
[0.049,0.105]
0.061
[0.040,0.103]
0.527
0.057
[0.036,0.098]
0.060
[0.039,0.102]
0.068
[0.046,0.107]
0.062
[0.043,0.104]
0.442
0.041
[0.023,0.104]
0.068
[0.038,0.127]
0.070
[0.040,0.137]
0.071
[0.036,0.129]
0.318
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.098
[0.082,0.118]
0.104
[0.082,0.128]
0.760
0.000
0.806
0.001
0.069
[0.050,0.093]
0.068
[0.046,0.096]
0.070
[0.050,0.095]
0.066
[0.044,0.096]
0.535
0.008
0.527
0.005
0.543
0.000
0.512
0.000
0.065
[0.039,0.101]
0.065
[0.037,0.107]
0.069
[0.046,0.152]
0.081
[0.052,0.123]
0.504
0.540
0.504
0.105
0.535
0.091
0.628
0.088
0.155
0.567
0.020
[0.012,0.044]
Note: Bootstrap 95% confidence intervals in brackets (based on 1,000 replications).
Outcome in differences
Root Mean Square Distance
Estimated
Relative to in-levels
value
Raw Mean
0.020
[0.010,0.047]
0.899
0.783
0.946
0.876
0.922
0.961
0.977
0.473
0.465
0.527
0.481
0.527
0.543
0.550
0.155
Table 3. Estimated average employment rate in two years after random assignment - 5 sites
Outcome in levels
GRP
POR
RIV
ATL
DET
0.70
[0.67,0.72]
0.68
[0.65,0.71]
0.59
[0.57,0.61]
0.57
[0.54,0.60]
0.46
[0.44,0.48]
0.43
[0.41,0.46]
0.60
[0.57,0.63]
0.61
[0.58,0.64]
0.68
[0.65,0.70]
0.68
[0.66,0.71]
0.56
[0.53,0.58]
0.57
[0.53,0.61]
0.58
[0.55,0.60]
0.59
[0.55,0.62]
0.46
[0.44,0.49]
0.47
[0.44,0.51]
0.64
[0.62,0.67]
0.62
[0.60,0.65]
0.64
[0.62,0.66]
0.62
[0.59,0.65]
0.60
[0.58,0.63]
0.59
[0.56,0.62]
0.60
[0.58,0.63]
0.59
[0.56,0.62]
0.45
[0.43,0.47]
0.44
[0.41,0.46]
0.45
[0.43,0.47]
0.44
[0.41,0.46]
0.61
[0.58,0.65]
0.63
[0.59,0.67]
0.60
[0.57,0.63]
0.61
[0.58,0.64]
0.64
[0.61,0.66]
0.64
[0.61,0.67]
0.63
[0.61,0.66]
0.64
[0.61,0.67]
0.58
[0.55,0.61]
0.58
[0.55,0.62]
0.64
[0.61,0.66]
0.64
[0.61,0.67]
0.61
[0.58,0.63]
0.61
[0.58,0.65]
0.60
[0.57,0.62]
0.61
[0.57,0.64]
0.46
[0.43,0.48]
0.48
[0.44,0.51]
0.45
[0.42,0.47]
0.46
[0.43,0.49]
0.63
[0.59,0.66]
0.62
[0.58,0.66]
0.61
[0.57,0.65]
0.62
[0.57,0.66]
0.61
[0.57,0.65]
0.60
[0.56,0.65]
0.60
[0.54,0.65]
0.61
[0.55,0.66]
0.44
[0.40,0.48]
0.44
[0.40,0.49]
0.44
[0.40,0.48]
0.42
[0.38,0.46]
0.60
[0.55,0.66]
0.64
[0.58,0.71]
0.64
[0.57,0.72]
0.64
[0.57,0.70]
0.65
[0.60,0.70]
0.66
[0.59,0.73]
0.68
[0.60,0.79]
0.69
[0.61,0.77]
0.62
[0.58,0.66]
0.62
[0.57,0.66]
0.63
[0.58,0.67]
0.64
[0.58,0.68]
0.65
[0.60,0.70]
0.65
[0.61,0.71]
0.67
[0.62,0.74]
0.67
[0.62,0.73]
0.49
[0.44,0.54]
0.47
[0.43,0.52]
0.47
[0.43,0.51]
0.47
[0.42,0.51]
0.57
[0.55,0.60]
0.55
[0.53,0.58]
0.57
[0.55,0.59]
0.57
[0.54,0.60]
0.57
[0.55,0.60]
0.58
[0.55,0.61]
0.55
[0.52,0.58]
0.56
[0.54,0.58]
0.57
[0.54,0.59]
0.57
[0.55,0.60]
0.68
[0.66,0.71]
0.67
[0.64,0.70]
0.55
[0.53,0.58]
0.53
[0.50,0.56]
0.52
[0.49,0.54]
0.49
[0.46,0.53]
0.61
[0.58,0.65]
0.62
[0.59,0.65]
0.62
[0.57,0.66]
0.62
[0.57,0.67]
0.54
[0.51,0.57]
0.55
[0.51,0.59]
0.55
[0.51,0.59]
0.56
[0.51,0.60]
0.53
[0.48,0.58]
0.54
[0.49,0.59]
0.60
[0.57,0.62]
0.58
[0.56,0.61]
0.60
[0.58,0.62]
0.58
[0.56,0.61]
0.63
[0.60,0.65]
0.61
[0.58,0.64]
0.63
[0.60,0.65]
0.61
[0.58,0.64]
0.57
[0.54,0.59]
0.55
[0.52,0.58]
0.57
[0.54,0.59]
0.55
[0.52,0.58]
0.51
[0.48,0.54]
0.50
[0.46,0.53]
0.51
[0.48,0.54]
0.50
[0.46,0.53]
0.63
[0.59,0.66]
0.65
[0.60,0.68]
0.61
[0.58,0.64]
0.63
[0.59,0.66]
0.57
[0.52,0.62]
0.58
[0.53,0.63]
0.57
[0.52,0.62]
0.58
[0.53,0.63]
0.56
[0.53,0.59]
0.56
[0.53,0.60]
0.62
[0.59,0.65]
0.62
[0.59,0.66]
0.58
[0.54,0.62]
0.59
[0.54,0.63]
0.57
[0.53,0.61]
0.58
[0.53,0.62]
0.52
[0.48,0.58]
0.54
[0.48,0.60]
0.51
[0.46,0.56]
0.53
[0.47,0.58]
0.55
[0.51,0.59]
0.58
[0.53,0.65]
0.60
[0.50,0.67]
0.60
[0.54,0.65]
0.61
[0.57,0.65]
0.61
[0.57,0.65]
0.60
[0.56,0.64]
0.61
[0.55,0.65]
0.57
[0.53,0.61]
0.56
[0.52,0.61]
0.56
[0.50,0.61]
0.57
[0.51,0.62]
0.50
[0.46,0.55]
0.50
[0.46,0.55]
0.50
[0.45,0.54]
0.48
[0.44,0.53]
0.61
[0.56,0.67]
0.65
[0.59,0.73]
0.66
[0.59,0.73]
0.65
[0.59,0.72]
0.59
[0.52,0.65]
0.60
[0.51,0.68]
0.62
[0.53,0.73]
0.62
[0.53,0.71]
0.60
[0.56,0.64]
0.60
[0.55,0.65]
0.61
[0.56,0.66]
0.62
[0.56,0.66]
0.62
[0.57,0.68]
0.62
[0.57,0.69]
0.65
[0.58,0.71]
0.64
[0.58,0.71]
0.55
[0.48,0.61]
0.53
[0.47,0.60]
0.54
[0.47,0.59]
0.53
[0.47,0.59]
0.58
[0.56,0.60]
0.57
[0.55,0.60]
0.55
[0.53,0.58]
0.57
[0.55,0.59]
0.57
[0.54,0.60]
0.57
[0.55,0.60]
0.58
[0.55,0.61]
0.55
[0.52,0.58]
0.56
[0.54,0.58]
Estimator
ATL
DET
A. Outcome not adjusted for local economic conditions
Raw Mean - No Ovlp
0.59
0.59
[0.56,0.61]
[0.57,0.61]
Raw Mean - Ovlp
0.59
0.60
[0.56,0.61]
[0.57,0.62]
Linear regression-based
Partial Mean Linear X - No Ovlp
0.59
0.62
[0.57,0.63]
[0.60,0.65]
Partial Mean Linear X - Ovlp
0.59
0.60
[0.56,0.62]
[0.58,0.63]
Partial Mean Flex X - No Ovlp
0.60
0.62
[0.57,0.63]
[0.60,0.65]
Partial Mean Flex X - Ovlp
0.59
0.60
[0.56,0.62]
[0.58,0.63]
GPS-based (imposing Ovlp)
Parametric Partial Mean
0.61
0.57
[0.55,0.65]
[0.53,0.61]
Nonparametric Partial Mean
0.61
0.61
[0.55,0.67]
[0.55,0.67]
IPW No Covariates
0.60
0.63
[0.42,0.71]
[0.52,0.69]
IPW With Covariates
0.60
0.63
[0.50,0.68]
[0.56,0.67]
Placebo experiment (benchmark)
Raw Mean - No Ovlp
0.55
0.58
[0.52,0.58]
[0.56,0.60]
B. Outcome adjusted for local economic conditions
Raw Estimator - No Ovlp
0.56
[0.53,0.58]
Raw Estimator - Ovlp
0.56
[0.53,0.59]
Linear regression-based
Partial Mean Linear X - No Ovlp
0.56
[0.53,0.59]
Partial Mean Linear X - Ovlp
0.56
[0.53,0.59]
Partial Mean Flex X - No Ovlp
0.57
[0.54,0.60]
Partial Mean Flex X - Ovlp
0.56
[0.53,0.59]
GPS-based (imposing Ovlp)
Parametric Partial Mean
0.58
[0.52,0.62]
Nonparametric Partial Mean
0.58
[0.51,0.64]
IPW No Covariates
0.57
[0.38,0.68]
IPW With Covariates
0.56
[0.47,0.65]
Placebo experiment (benchmark)
Raw Mean - No Ovlp
0.55
[0.52,0.57]
Note: Bootstrap 95% confidence intervals in brackets (based on 1,000 replications).
Outcome in differences
GRP
POR
RIV
Table 4. Assessment measures of estimators of the average employment rate in two years after random assignment
4 sites based on 5 sites estimation, excluding Riverside
Outcome in levels
P-value joint
Root Mean Square Distance
equality of
Estimated
Relative to in-levels
Estimator
means test
value
Raw Mean
A. Outcome not adjusted for local economic conditions
Raw Mean - No Ovlp
0.000
0.077
1.000
[0.059,0.097]
Raw Mean - Ovlp
0.000
0.069
0.896
[0.052,0.096]
Linear regression-based
Partial Mean Linear X - No Ovlp
0.041
0.030
0.390
[0.014,0.053]
Partial Mean Linear X - Ovlp
0.280
0.023
0.299
[0.011,0.052]
Partial Mean Flex X - No Ovlp
0.104
0.025
0.325
[0.011,0.049]
Partial Mean Flex X - Ovlp
0.434
0.019
0.247
[0.009,0.049]
GPS-based (imposing Ovlp)
Parametric Partial Mean
0.245
0.033
0.429
[0.014,0.071]
Nonparametric Partial Mean
0.901
0.013
0.169
[0.010,0.068]
IPW No Covariates
0.923
0.018
0.234
[0.014,0.148]
IPW With Covariates
0.901
0.019
0.247
[0.010,0.080]
Placebo experiment (benchmark)
Raw Mean - No Ovlp
0.188
0.023
0.299
[0.010,0.048]
B. Outcome adjusted for local economic conditions
Raw Estimator - No Ovlp
0.000
Raw Estimator - Ovlp
0.000
Linear regression-based
Partial Mean Linear X - No Ovlp
0.000
Partial Mean Linear X - Ovlp
0.016
Partial Mean Flex X - No Ovlp
0.002
Partial Mean Flex X - Ovlp
0.037
GPS-based (imposing Ovlp)
Parametric Partial Mean
0.115
Nonparametric Partial Mean
0.460
IPW No Covariates
0.578
IPW With Covariates
0.526
Placebo experiment (benchmark)
Raw Mean - No Ovlp
0.200
P-value joint
equality of
means test
0.000
0.078
[0.057,0.099]
0.070
[0.050,0.099]
1.013
0.032
[0.015,0.059]
0.035
[0.018,0.068]
0.030
[0.015,0.054]
0.023
[0.012,0.053]
0.416
0.033
[0.016,0.081]
0.025
[0.014,0.081]
0.035
[0.018,0.104]
0.032
[0.016,0.091]
0.429
0.394
0.022
[0.010,0.050]
0.286
0.063
[0.045,0.091]
0.054
[0.035,0.091]
0.818
0.041
[0.018,0.084]
0.052
[0.025,0.092]
0.038
[0.017,0.074]
0.037
[0.016,0.075]
0.532
0.020
[0.012,0.077]
0.036
[0.018,0.106]
0.032
[0.017,0.103]
0.024
[0.013,0.087]
0.260
0.022
[0.009,0.051]
0.286
0.000
0.064
0.083
0.043
0.262
0.430
0.671
0.498
0.545
0.093
[0.074,0.115]
0.087
[0.068,0.116]
1.208
0.000
1.130
0.003
0.046
[0.027,0.069]
0.040
[0.023,0.069]
0.042
[0.023,0.066]
0.037
[0.020,0.066]
0.597
0.113
0.519
0.038
0.545
0.047
0.481
0.147
0.040
[0.018,0.078]
0.028
[0.015,0.079]
0.035
[0.020,0.164]
0.033
[0.014,0.094]
0.519
0.843
0.364
0.593
0.455
0.608
0.429
0.767
0.023
[0.010,0.048]
0.299
0.413
Note: Bootstrap 95% confidence intervals in brackets (based on 1,000 replications).
Outcome in differences
Root Mean Square Distance
Estimated
Relative to in-levels
value
Raw Mean
0.909
0.455
0.390
0.299
0.325
0.455
0.416
0.701
0.675
0.494
0.481
0.468
0.416
0.312
Table 5. Comparison of alternative GPS specifications. RMSD of estimators based on the IPW with covariates estimator
Outcome: Average employment rate in two years after random assignment
Outcome not adjusted for local economic conditions
RMSD for 5 sites
RMSD for 4 sites \a
Outcome in
Outcome in
Outcome in
Outcome in
levels
differences
levels
differences
Outcome adjusted for local economic conditions
RMSD for 5 sites
RMSD for 4 sites \b
Outcome in
Outcome in
Outcome in
Outcome in
levels
differences
levels
differences
26.2%
0.135
[0.109,0.170]
0.126
[0.097,0.169]
0.019
[0.010,0.080]
0.032
[0.016,0.091]
0.081
[0.052,0.123]
0.071
[0.036,0.129]
0.033
[0.014,0.094]
0.024
[0.013,0.087]
Including LEC measures in GPS \c
MNL including pre-RA LEC measure
28.0%
MNL including pre & post-RA LEC measures
45.7%
0.141
[0.090,0.216]
0.076
[0.035,0.220]
0.102
[0.052,0.183]
0.076
[0.037,0.194]
0.009
[0.011,0.102]
0.032
[0.017,0.122]
0.027
[0.017,0.103]
0.085
[0.027,0.189]
0.089
[0.048,0.170]
0.031
[0.029,0.177]
0.057
[0.027,0.138]
0.068
[0.036,0.182]
0.026
[0.014,0.116]
0.023
[0.015,0.113]
0.033
[0.015,0.116]
0.067
[0.025,0.166]
Alternative sets of covariates in GPS \d
MNL with alternative set of covariates 1
8.2%
MNL with alternative set of covariates 2
13.9%
MNL with alternative set of covariates 3
25.6%
MNL with alternative set of covariates 4
15.9%
0.119
[0.100,0.146]
0.126
[0.104,0.157]
0.134
[0.108,0.173]
0.117
[0.093,0.147]
0.127
[0.101,0.161]
0.136
[0.091,0.178]
0.138
[0.100,0.188]
0.114
[0.083,0.153]
0.060
[0.041,0.117]
0.080
[0.034,0.126]
0.054
[0.024,0.110]
0.036
[0.011,0.084]
0.089
[0.061,0.126]
0.103
[0.051,0.149]
0.083
[0.049,0.135]
0.062
[0.022,0.107]
0.076
[0.058,0.119]
0.089
[0.059,0.129]
0.088
[0.066,0.131]
0.066
[0.040,0.102]
0.083
[0.052,0.128]
0.077
[0.040,0.136]
0.089
[0.053,0.146]
0.046
[0.024,0.110]
0.066
[0.046,0.127]
0.086
[0.037,0.134]
0.063
[0.031,0.121]
0.046
[0.014,0.093]
0.082
[0.050,0.127]
0.075
[0.038,0.128]
0.074
[0.043,0.129]
0.024
[0.013,0.082]
Alternative GPS models \e
Multinomial Probit
23.1%
IPT
32.3%
MNL with IPT moment conditions
25.5%
0.140
[0.111,0.189]
0.124
[0.096,0.170]
0.133
[0.106,0.170]
0.146
[0.097,0.188]
0.125
[0.092,0.176]
0.123
[0.094,0.160]
0.038
[0.016,0.103]
0.045
[0.019,0.118]
0.021
[0.011,0.076]
0.058
[0.022,0.117]
0.048
[0.025,0.117]
0.029
[0.017,0.080]
0.090
[0.056,0.140]
0.077
[0.049,0.136]
0.079
[0.047,0.115]
0.089
[0.041,0.146]
0.066
[0.033,0.135]
0.068
[0.035,0.130]
0.050
[0.020,0.117]
0.056
[0.023,0.134]
0.037
[0.017,0.086]
0.036
[0.017,0.098]
0.030
[0.016,0.096]
0.024
[0.012,0.074]
GPS reestimation after imposing overlap
MNL, reestimated as MNL
26.2%
MNL, reestimated as IPT
26.2%
IPT, reestimated as IPT
32.3%
MNL, reestimated as MNL with IPT moment
conditions
Not imposing overlap
MNL
26.2%
0.133
[0.107,0.172]
0.117
[0.090,0.156]
0.127
[0.089,0.175]
0.135
[0.104,0.170]
0.120
[0.089,0.166]
0.110
[0.079,0.149]
0.123
[0.087,0.171]
0.126
[0.092,0.174]
0.019
[0.011,0.080]
0.026
[0.010,0.074]
0.022
[0.012,0.075]
0.019
[0.012,0.081]
0.030
[0.017,0.091]
0.024
[0.012,0.072]
0.036
[0.016,0.078]
0.032
[0.015,0.089]
0.080
[0.052,0.124]
0.067
[0.041,0.110]
0.074
[0.045,0.126]
0.081
[0.047,0.121]
0.065
[0.034,0.124]
0.057
[0.028,0.114]
0.067
[0.033,0.129]
0.071
[0.038,0.136]
0.034
[0.014,0.093]
0.042
[0.018,0.087]
0.037
[0.018,0.090]
0.033
[0.014,0.091]
0.023
[0.015,0.088]
0.027
[0.012,0.088]
0.027
[0.012,0.075]
0.024
[0.012,0.085]
IPT
0.0%
MNL with IPT moment conditions
0.0%
0.131
[0.109,0.170]
0.117
[0.095,0.156]
0.127
[0.104,0.154]
0.130
[0.104,0.166]
0.113
[0.089,0.151]
0.123
[0.101,0.153]
0.051
[0.012,0.116]
0.062
[0.023,0.128]
0.044
[0.009,0.090]
0.054
[0.018,0.100]
0.039
[0.016,0.091]
0.040
[0.012,0.079]
0.082
[0.055,0.133]
0.079
[0.052,0.134]
0.079
[0.050,0.112]
0.066
[0.036,0.123]
0.051
[0.026,0.113]
0.062
[0.035,0.113]
0.062
[0.017,0.129]
0.074
[0.032,0.142]
0.056
[0.019,0.101]
0.024
[0.014,0.082]
0.017
[0.012,0.081]
0.018
[0.012,0.080]
Estimator: IPW With Covariates
Preferred specification
Multinomial Logit (MNL) - from Tables 2 & 4
% Observations
dropped due
to overlap \a
0.0%
Notes: Bootstrap 95% confidence intervals in brackets (based on 1,000 replications).
\a Except in the cases where the overlap condition is not imposed, in all other cases a value q=0.025 is used to determine individuals who satisfy the overlap condition.
\b Results for 4 sites based on 5 sites estimation, excluding Riverside.
\c The pre & post-RA LEC measures are the growth rates of the employment/population ratio in the two years before and after random assignment, respectively.
\d Alternative set 1: demographic, education & family characteristics.
Alternative set 2: set of covariates 1 + long term welfare use history & housing type/stability.
Alternative set 3: set of covariates 2 + welfare & food stamps use history in 7 quarters prior to random assignment.
Alternative set 4: set of covariates 2 + employment & earnings history in 8 quarters prior to random assignment.
The specific variables included in each set are provided in the Appendix.
\e IPT refers to Inverse Probability Tilting and is estimated by GMM with moment conditions imposing exact balancing in covariates. MNL with IPT moment conditions estimates a Multinomial Logit
model by GMM as an overidentified model where the exact balancing of covariates conditions are imposed as additional moments. The Multinomial Logit (MNL) and Multinomial Probit models
are estimated by maximum likelihood. See the Appendix for further details.
Figure 1. Overlap quality based on GPS distribution
Overlap quality of Pr(Treatment = ATL | X)
Overlap quality of Pr(Treatment = GRP | X)
Before Overlap
Before Overlap
(99,100]
Not in ATL
(95,99]
After Overlap
In ATL
(75,90]
After Overlap
(50,75]
(25,50]
(10,25]
In GRP
(50,75]
(25,50]
(10,25]
(75,90]
(1,5]
(1,5]
(1,5]
[0,1]
[0,1]
[0,1]
2000
Overlap quality of Pr(Treatment = DET | X)
(99,100]
Before Overlap
(99,100]
(75,90]
After Overlap
(90,95]
Not in DET
In DET
(50,75]
(25,50]
(10,25]
(75,90]
1500
1000
500
0
Number of individuals
Before Overlap
(1,5]
[0,1]
[0,1]
500
1000
In POR
After Overlap
Not in POR
In POR
(10,25]
(1,5]
1500 1000 500
0
Number of individuals
2000
(25,50]
(5,10]
2000
500
(50,75]
(5,10]
2500
0
Not in POR
(95,99]
In DET
Percentile Interval
(90,95]
1000
500
Number of individuals
Overlap quality of Pr(Treatment = POR | X)
Not in DET
(95,99]
1500
In RIV
(10,25]
(5,10]
500
Not in RIV
(25,50]
(5,10]
0
In RIV
After Overlap
(50,75]
(5,10]
3500 3000 2500 2000 1500 1000 500
Number of individuals
Percentile Interval
(90,95]
Not in GRP
(75,90]
Before Overlap
Not in RIV
(95,99]
In GRP
(90,95]
Not in ATL
Percentile Interval
Percentile Interval
(95,99]
In ATL
(90,95]
(99,100]
Not in GRP
Percentile Interval
(99,100]
Overlap quality of Pr(Treatment = RIV | X)
2000
1500
1000
500
Number of individuals
0
500
Notes:
Outside wide bars and inside thin bars display the number of individuals whose GPS fall in a percentile interval, before and after imposing overlap, respectively.
Percentile intervals in each site are constructed using the GPS distribution, before imposing overlap, for individuals in that site only.
500
1000
Figure 2. Local economic conditions overlap
By site and random assignment cohort
Employment/population ratio before RA
1991
ATL 1992
1993
1994
1991
DET 1992
1993
1994
1991
GRP 1992
1993
1994
1991
POR 1992
1993
1994
1991
RIV 1992
1993
1994
Unemployment rate before RA
Average real earnings ($1,000) before RA
1991
ATL 1992
1993
1994
1991
DET 1992
1993
1994
1991
GRP 1992
1993
1994
1991
POR 1992
1993
1994
1991
RIV 1992
1993
1994
0
.1
.2
.3
Year 1 before RA
.4
.5
.6
1991
ATL 1992
1993
1994
1991
DET 1992
1993
1994
1991
GRP 1992
1993
1994
1991
POR 1992
1993
1994
1991
RIV 1992
1993
1994
0
Year 2 before RA
DET
GRP
POR
RIV
.1
.2
Year 1 after RA
.3
.4
.5
.6
Year 2 after RA
6
8
10
12
0
Year 2 before RA
Unemployment rate after RA
ATL
0
4
Year 1 before RA
Employment/population ratio after RA
1991
ATL 1992
1993
1994
1991
DET 1992
1993
1994
1991
GRP 1992
1993
1994
1991
POR 1992
1993
1994
1991
RIV 1992
1993
1994
2
5
10
15
20
25
Year 1 before RA
30
35
40
Year 2 before RA
Average real earnings ($1,000) after RA
1991
1992
1993
1994
1991
1992
1993
1994
1991
1992
1993
1994
1991
1992
1993
1994
1991
1992
1993
1994
1991
ATL 1992
1993
1994
1991
DET 1992
1993
1994
1991
GRP 1992
1993
1994
1991
POR 1992
1993
1994
1991
RIV 1992
1993
1994
0
2
4
Year 1 after RA
6
8
10
12
Year 2 after RA
0
5
10
15
Year 1 after RA
20
25
30
35
40
Year 2 after RA