Comparing Treatments across Labor Markets: An Assessment of Nonexperimental Multiple-Treatment Strategies∗ Carlos A. Flores† Oscar A. Mitnik‡ May, 2012 Abstract We consider the problem of using data from several programs, each implemented at a different location, to compare what their effect would be if they were implemented at a specific location. In particular, we study the effectiveness of nonexperimental strategies in adjusting for differences across comparison groups arising from two sources. First, we adjust for differences in the distribution of individual characteristics simultaneously across all locations by using unconfoundedness-based and conditional difference-in-difference methods for multiple treatments. Second, we explicitly adjust for differences in local economic conditions. We stress the importance of analyzing the overlap of, and adjusting for, local economic conditions after program participation. Our results suggest that the strategies studied are valuable econometric tools for the problem we consider, as long as we adjust for a rich set of individual characteristics and have sufficient overlap across locations for both individual and local labor market characteristics. Our results show that the overlap analysis of these two sets of variables is critical for identifying non-comparable groups and they illustrate the difficulty of adjusting for local economic conditions that differ greatly across locations. ∗ Detailed comments from two anonymous referees and an editor greatly improved the paper and are gratefully acknowledged. We are also grateful for comments by Chris Bollinger, Matias Cattaneo, Alfonso Flores-Lagunes, Laura Giuliano, Guido Imbens, Kyoo il Kim, Carlos Lamarche, Phil Robins, Jeff Smith, and seminar participants at the 2009 AEA meetings, 2009 SOLE meetings, Second Joint IZA/IFAU Conference on Labor Market Policy Evaluation, 2009 Latin American Econometric Society Meeting, Third Meeting of the Impact Evaluation Network, Second UCLA Economics Alumni Conference, Abt Associates, Federal Deposit Insurance Corporation, Institute for Defense Analyses, Interamerican Development Bank, Queen Mary University of London, Tulane University, University of Freiburg, University of Kentucky, University of Miami, University of Michigan, and VU University Amsterdam. Bryan Mueller and Yongmin Zang provided excellent research assistance. This is a heavily revised version of IZA Discussion Paper No. 4451, “Evaluating Nonexperimental Estimators for Multiple Treatments: Evidence from Experimental Data”. The usual disclaimer applies. A supplemental Appendix is available online at http:// [to be determined]. † Department of Economics, University of Miami. Email: [email protected]. ‡ Department of Economics, University of Miami and IZA. Email: [email protected]. 1 Introduction We consider the problem of using data from several programs, each implemented at a different location, to compare what their effect would be if they were implemented at a specific location. One example would be a local government considering the implementation of one of several possible job training programs. This is a problem many policy makers are likely to face given the heterogeneity of job training programs worldwide and the scope that many local governments have in selecting their programs. In this case the policy maker may want to perform an experiment in her locality and randomly assign individuals to all the possible programs to assess the effectiveness of the programs and choose the best one for her locality. This would require implementing all the “candidate” programs in her locality. Thus, the most likely scenario is that the policy maker would have to rely on nonexperimental methods and use data on implementations of the programs at various locations and times. In this paper, we assess the likely effectiveness of nonexperimental strategies for multiple treatments to inform a policy maker about the effects of programs implemented in different locations if they were implemented at her location. This paper is closely related to the literature analyzing whether observational comparison groups from a different location than the treated observations can be used to estimate average treatment effects by employing nonexperimental identification strategies (Friedlander and Robins, 1995; Heckman et al., 1997; Heckman et al., 1998; Michalopoulos et al., 2004). Our work is also closely related to Hotz et al. (2005), who analyze the problem of predicting the average effect of a new training program using data on previous program implementations at different locations. Our paper differs from the rest of the literature, which focuses on pairwise comparisons between groups in different locations. Instead, we focus on comparing all groups simultaneously, which requires the use of nonexperimental methods for multiple treatments. This is an important distinction in practice. For instance, suppose that the policy maker above was given the average effects of each of 2 programs estimated from the binary-treatment versions of the nonexperimental strategies we study in this paper. If so, each estimate would give the average effect for a particular subpopulation, namely, those individuals in the “common support” region of the corresponding pairwise comparison (Heckman et al., 1997; Dehejia and Wahba, 1999, 2002). Thus, the policy maker would be unable to compare the average effects across the different programs simultaneously because the effects are likely to apply to different subpopulations. We employ data from the National Evaluation of Welfare-to-Work Strategies (NEWWS) study, which combines high quality survey and administrative data. The NEWWS evaluated a set of social experiments conducted in the United States in the 1990s in which individuals in several locations were randomly assigned to a control group or to different training programs. These data allow us to restrict the analysis to a relatively well-defined target population, which is welfare recipients in the different locations. This is consistent with our example above, where we would expect the policy maker to be interested in comparing programs that applied to relatively similar target populations. Even in cases like this, differences across locations in the average outcomes of individuals who participated in the site-specific programs can arise because of (at least) three important factors (Hotz 1 et al., 2005). First, the effectiveness of the programs can differ (this is the part that the policy maker in the example above would like to know). Second, the distribution of individual characteristics across sites may be different (e.g., individuals may be more educated, on average, in some locations). Third, the labor market conditions may differ across locations (e.g., the unemployment rate may be higher at some sites). In this paper, we focus on the control groups in the different locations and use the insight that since the control individuals were excluded from all training services, the control groups should be comparable after adjusting for differences in individual characteristics and labor market conditions. If by using nonexperimental strategies to adjust for those differences we are able to simultaneously equalize average outcomes among the control groups in the different locations, we would expect the average effects resulting from the implementation of those strategies on the treated observations to reflect actual program heterogeneity across the sites (e.g., Heckman et al., 1997; Heckman et al., 1998; Hotz et al., 2005).1 The first source of heterogeneity across sites that we consider is the distribution of individual characteristics. We consider unconfoundedness-based and conditional difference-in-difference strategies for multiple treatments to adjust simultaneously across all sites for this source of heterogeneity. This is related to the literature on program evaluation using nonexperimental methods, most of which focuses on the binary-treatment case (see Heckman et al., 1999; Imbens and Wooldridge, 2009). More recently, however, there has been a growing interest on methods to evaluate multiple treatments (Lechner, 2001, 2002a, 2002b; Frölich, 2004; Plesca and Smith, 2007) and treatments that are multivalued or continuous (Imbens, 2000; Imai and van Dyk, 2004; Hirano and Imbens, 2004; Abadie, 2005; Cattaneo, 2010); we employ some of these methods in this paper. In addition, we study the use of the generalized propensity score (GPS), defined as the probability of receiving a particular treatment conditional on covariates, in assessing the comparability of individuals among the different comparison groups, and we propose a strategy to determine the common support region that is less stringent than those previously used in the multiple treatment literature (Lechner, 2002a, 2002b; Frölich et al., 2004). We also consider several GPS specifications and GPS-based estimators, and we analyze the performance of the nonexperimental strategies considered when controlling for alternative sets of covariates. The second source of heterogeneity across the comparison groups that we consider is local economic conditions (LEC). The literature on program evaluation methods stresses the importance of comparing treatment and nonexperimental comparison groups from the same local labor market (Friedlander and Robins, 1995; Heckman et al., 1997; Heckman et al., 1998). In our setting, the analyst likely has information about specific programs only when they were implemented in different local labor markets. Thus, we consider the challenging problem of explicitly adjusting for differences in LEC across our comparison groups. The literature on how to adjust for these differences is scarce compared to that on how to adjust for individual characteristics, as we discuss later. We stress the importance of adjusting for differences in LEC after program participation, as 1 Another potential source of heterogeneity across control groups is substitution by their members into alternative training programs (Heckman and Smith, 2000; Heckman et al., 2000; Greenberg and Robins, 2011). Our analysis leads us to believe that it is unlikely that this potential source of heterogeneity drives our results, as we discuss later. 2 they can be a significant source of heterogeneity in average outcomes across sites. We also highlight the importance of analyzing the overlap in LEC across locations, as well as the advantages of separating such analysis from that of the overlap in individual characteristics. Our results suggest that the strategies studied are valuable econometric tools when evaluating the likely effectiveness of programs implemented at different locations if they were implemented at another location. The keys are adjusting for a rich set of individual characteristics (including labor market histories) and having sufficient overlap across locations in both the individual and local labor market characteristics. More generally, our results also illustrate that in the multiple treatment setting, there may be treatment groups for which it is extremely difficult to find valid comparison groups, and that the strategies we consider perform poorly in such cases. Importantly, the overlap analyses (both in individual characteristics and LEC) are critical for identifying those noncomparable groups. In addition, our results show that when the differences in post-randomization LEC are large across sites, the methods we consider to adjust for them substantially reduce this source of heterogeneity, although they are not able to eliminate it. These results highlight the importance of adjusting for post-treatment LEC. The paper is organized as follows. Section 2 presents our general setup and the econometric methods we employ; Section 3 describes the data; Section 4 presents the results; and Section 5 provides our conclusions. An online Appendix provides further details on the econometric methods and data. It also presents supplemental estimation results, including those from several robustness checks briefly mentioned throughout the paper and those complementing the analysis in Section 4. 2 2.1 Methodology General Framework We use data from a series of randomized experiments conducted in several locations. In each location, individuals were randomly assigned either to participate in a site-specific training program or to a control group that was excluded from all training services. We study whether the nonexperimental strategies considered in this paper can simultaneously equalize average outcomes for control individuals across all the sites by adjusting for differences in the distribution of individual characteristics and local economic conditions. Each unit in our sample, = 1 2 , comes from one of possible sites. Let ∈ {1 2 } denote the location of individual . For our purposes, it is convenient to write the potential outcomes of interest as (0 ), where denotes the site and zero only emphasizes the fact that we focus exclusively on the control groups. Thus, (0 ) is the outcome (e.g., employment status) that individual would obtain if that person were a control in site . The data we observe for each unit are ( ), where is a set of pre-treatment covariates and = (0 ). Let be a subset of the support of , X . Our parameters of interest in this paper are ≡ () = [ (0 ) | ∈ ] , for = 1 2 3 (1) The region can include all or part of the support of and is determined by the overlap region, which we discuss below. To simplify the notation, we omit the conditioning on ∈ unless necessary for clarity. The object in (1) gives the average potential outcome under the control treatment at location for someone with ∈ randomly selected from any of the sites. This is different from [ (0 ) | = ∈ ], which gives the average control outcome at site for someone with ∈ randomly selected from site . While this last quantity is identified from the data because of random treatment assignment within site , is not. Rather than focusing on (1), we could have focused on [ (0 ) | = ∈ ] for = 1 , which gives the average outcome for the individuals in a particular site ∈ {1 } within the overlap region for site . By focusing on the average over all sites in (1), we avoid having our results depend on the selection of site as a reference point. However, the estimation problem becomes more challenging because we need to find comparable individuals in each of the sites for every individual in the sites (as opposed to finding comparable individuals in each site for only those in site ). We assess the performance of the strategies presented below in several ways. First, given b of the corresponding parameters in (1), we test the joint hypothesis b1 estimates 1 = 2 = = (2) One drawback of this approach is that we could fail to reject the null hypothesis in (2) only because b are sufficiently close the variance of the estimators is high, rather than because all the estimated ’s to each other. Hence, we focus our analysis mostly on an overall measure of distance among the X b , we define the the root mean square distance as estimated means. Letting b = −1 =1 1 = |b | s ´2 ³ 1 P b − b =1 (3) where we divide by |b | to facilitate its interpretation and the comparison across outcomes. Due to pure sample variation, we would never expect the to equal zero, even if were randomly assigned. In Section 4, we present benchmark values of the as a reference point for reasonable values of the . These benchmarks give the value of the that would be achieved by an experiment in a setting where 1 = 2 = = holds.2 In the following two sections, we discuss the strategies we employ to adjust for differences across the sites in the distribution of individual characteristics and local economic conditions. 2.2 Adjusting for Individual Characteristics Nonexperimental Strategies. We consider two identification strategies to adjust for differences in individual characteristics across sites. Let 1 () be the indicator function, which equals one if event is true and zero if otherwise. The first strategy is based on the following assumption: 2 In addition to the in (3), we considered two other measures: the mean absolute distance and the maximum pairwise distance among all the estimated means. All three measures give similar results. 4 Assumption 1 (Unconfounded site) 1 ( = ) ⊥⊥ (0 ) | , for all ∈ {1 2 }. This assumption is an extension to the multiple-site setting of the one in Hotz et al. (2005), and in the program evaluation context it is called weak unconfoundedness (Imbens, 2000). The problem when estimating is that we only observe (0 ) for individuals with 1 ( = ) = 1, while it is missing for those with 1 ( = ) = 0. Assumption 1 implies that conditional on , the two groups are comparable, which means that we can use the individuals with 1 ( = ) = 1 to learn about . A key feature of Assumption 1 is that all that matters is whether or not unit is in site ; hence, the actual site of an individual not in is not relevant for learning about . In addition to Assumption 1, we impose an overlap assumption that guarantees that, in infinite samples, we are able to find comparable individuals in terms of the covariates for all units with 6= in the subsample with = for every site . Assumption 2 (Simultaneous strict overlap) 0 Pr ( = | = ), for all and ∈ X . This assumption is stronger than that of the binary treatment case, as it requires that for every individual in the sites, we find comparable individuals in terms of covariates in each of the sites. Intuitively, while we want to learn about the potential outcomes in all sites for every individual, we observe only one of those potential outcomes, so the “fundamental problem of causal inference” (Holland, 1986) is worsened. When the strict overlap assumption is violated, √ semiparametric estimators of can fail to be -consistent, which can lead to large standard errors and numerical instability (Busso et al., 2009a, 2009b; Khan and Tamer, 2010). In practice, Assumption 2 may hold only for some of the sites and/or regions in the support of , in which case one may have to restrict the analysis to those sites and/or regions with common support. The second identification strategy we consider to adjust for differences in individual characteristics across locations is a multiple-treatment version of the conditional difference-in-differences (DID) strategy (Heckman et al., 1997; Heckman et al., 1998; Smith and Todd, 2005). In our setting, the main identification assumption of this strategy is 1 ( = ) ⊥⊥ { (0 ) − 0 (0 )}| , for all ∈ {1 2 }, where and 0 are time periods after and before random assignment, respectively. This assumption is similar to Assumption 1 but employs the difference of the potential outcomes over time. This identification strategy allows for systematic differences in time-invariant factors between the individuals in site and those not in site for every site . Thus, DID methods can also help to adjust for time-invariant heterogeneity in local economic conditions across sites (Heckman et al., 1997; Heckman et al., 1998; Smith and Todd, 2005). Estimators. We consider several estimators to implement the strategies discussed above. For simplicity, we present the estimators in the context of the unconfoundedness strategy. The DID estimators simply replace the outcome in levels with the outcome in differences. Under Assumption 1 we can identify (omitting for brevity the conditioning on ∈ ) as = [ [ (0 ) | = ]] = [ [ (0 ) |1 ( = ) = 1 = ]] = [ [ | = = ]] 5 where, following similar steps to those used in a binary treatment setting (e.g., Imbens, 2004), the first equality uses iterated expectations and the second uses Assumption 1. This result suggests estimating using a partial mean, which is an average of a regression function over some of its regressors while holding others fixed (Newey, 1994). To estimate , we first estimate the expectation of conditional on and , and then we average this regression function over the covariates holding fixed at . The first estimator we consider uses a linear regression of on the site dummies and to model the conditional expectation [ | = = ]. We refer to it as the partial mean linear estimator. We also consider a more flexible model of that expectation by including polynomials of the continuous covariates and various interactions, and refer to the corresponding estimator as the partial mean flexible estimator. The rest of the estimators we consider are based on the generalized propensity score (GPS). Its binary-treatment version, the propensity score, is widely used in the program evaluation literature (Heckman et al., 1999; Imbens and Wooldridge, 2009). In our setting, the GPS is defined as the probability of belonging to a particular site conditional on the covariates (Imbens, 2000): ( ) = Pr ( = | = ) (4) We let = ( ) be the probability that individual belongs to the location she actually belongs to, and let = ( ) be the probability that belongs to a particular location conditional on her covariates. Clearly, = for those units with = . We consider GPS-based estimators of based on partial means and inverse probability weighting. Imbens (2000) shows that the GPS can be used to estimate within a partial mean framework by first estimating the conditional expectation of as a function of and = ( ), and then by averaging this conditional expectation over = ( ) while holding fixed at . We consider two estimators based on this approach. The first follows Hirano and Imbens (2004) and estimates [ | = = ] by employing a linear regression of the outcome on the site dummies and the interactions of the site dummies with and 2 . The second estimator employs a local polynomial regression of order one to estimate that regression function. We refer to these two estimators of as the GPS-based parametric and nonparametric partial mean estimators, respectively. Finally, we also employ the GPS in a weighting scheme. Intuitively, reweighting observations by the inverse of creates a sample in which the covariates are balanced across all locations. The first inverse probability weighting (IPW) estimator of that we consider equals the coefficient for 1 ( = ) from a weighted linear regression of on all the site indicator variables, with weights p equal to = 1 . We also consider an estimator that combines IPW with linear regression by adding covariates to this regression. We refer to it as the IPW with covariates. Determination of the Overlap Region. A key ingredient in the implementation and performance of the estimators previously discussed is the identification of regions of the data in which there is overlap in the covariate distributions across the different comparison groups. The usual approach in the binary-treatment literature is to focus on the “overlap” or “common support” 6 region by dropping those individuals whose propensity score does not overlap with the propensity score of those in the other treatment arm (Heckman et al., 1997; Heckman et al., 1998; Dehejia and Wahba, 1999, 2002). We propose a GPS-based trimming procedure for determining the overlap region in the multiple-treatment setting that is motivated by a procedure commonly used when the treatment is binary (Dehejia and Wahba, 1999, 2002). We determine the overlap region in two steps. First, we define the overlap region with respect to a particular site , , as the subsample of individuals whose conditional probability of being in site (i.e., ) is greater than a cutoff value, which is given by the highest -quantile of the distribution for two groups: those individuals who are in site and those who are not. This rule ensures that within the subsample , for every individual with 6= , we can find comparable individuals with = , which implies that we can estimate the mean of (0 ) for the individuals in under Assumption 1. Second, we define the overlap region as the subsample given by those individuals who are simultaneously in the overlap regions for all the different sites. This step guarantees that we estimate the mean of (0 ) for = 1 2 for the set of individuals who are simultaneously comparable in all the sites. More formally, let {: }}, where {∈} denotes the -th quantile of the = { : ≥ max{{: =} 6=} distribution of over those individuals in subsample . We define the overlap region as3 = T =1 (5) Lechner (2002a, 2002b) and Frölich et al. (2004) use a rule akin to (5) but define = { : ∈ [max=1 {min{: =} } min=1 {max{: =} }]}. Our rule is less stringent in two important ways. First, as implied by Assumption 1, it does not require the comparison of among all locations but only between the groups of individuals with = and 6= . Second, it identifies individuals outside the overlap region based only on the lower tail of the distributions of . This is because we impose the overlap condition for all sites, so it is not necessary to look at the upper tail of the GPS distributions.4 2.3 Adjusting for Local Economic Conditions Even after adjusting for differences in individual characteristics, there could still be differences in the average outcomes of the comparison groups across sites because of differences in LEC. The main difficulty in adjusting for differences in LEC is that these variables take on the same value within each location, or within each cohort in a location. When each comparison group belongs to a different location, as in our setting, the variation of the LEC variables can be limited and they are likely to fail the overlap condition. In such cases, the literature provides little guidance on how to adjust for differences in LEC across comparison groups. 3 } {: 6=} }}. 4 For example, when estimating the average treatment effect in the binary-treatment setting both tails of the distribution of the propensity score are considered. This is equivalent to considering the two lower tails of the distributions of Pr ( = 1|) and Pr ( = 0|) because Pr ( = 1|) + Pr ( = 0|) = 1. Alternatively, the overlap region could be defined as = 7 =1 { ∈ X : () ≥ max{ {: = In the context of evaluating training programs, a strategy followed in practice to control for differences in LEC across comparison groups is to include LEC measures and/or geographic indicators in the propensity score (Roselius, 1996; Heckman et al., 1998; Dyke et al., 2006; Mueser et al., 2007; Dolton and Smith, 2011). Using regression-adjustment methods Hotz et al. (2006) and Flores-Lagunes et al. (2010) include as regressors measures of post-treatment LEC. Most of these papers differ from ours because we do not have a large time series or cross sectional variation in LEC. In a setting more similar to ours, Hotz et al. (2005) include two measures of pre-randomization LEC in the regression step of the bias-corrected matching estimator in Abadie and Imbens (2011).5 We pay particular attention to adjusting for differences across sites in LEC after randomization, which can be a significant source of heterogeneity in average outcomes. For example, Hotz et al. (2006) and Flores-Lagunes et al. (2010) find that it is very important to adjust for post-treatment LEC when evaluating training programs. The approach we use to adjust for post-randomization LEC consists of modeling the effect of the pre-randomization LEC on pre-randomization values of the outcomes, and then using the estimated parameters to adjust the outcomes of interest for the effect of post-randomization LEC. We implement our approach in two ways depending on whether we use the outcome in levels or in differences. Assume that the potential outcomes (0 ) are separable functions of the individual characteristics and the LEC, and for the sake of presentation, ignore the presence of the covariates . For the outcome in levels, we use the pre-randomization outcome and LEC to run the regresP sion 0 = =1 [1 ( = ) · 0 ] + 00 + 0 , where 0 denotes a time period before random assignment, 0 is a vector of LEC at time 0 , and 0 is an error term. The availability of different cohorts within each site in our data provides us with sufficient variation in the LEC to identify the parameter vector . We use the estimated coefficient b to adjust the outcome of interest ( ) as 0 b e = − , and apply the estimators discussed in Section 2.2 on the LEC-adjusted outcome e . For the outcome in differences, we follow a similar approach but employ in the regression the differences of the outcomes and LEC in two periods prior to randomization. In both cases (for the outcome in levels and in differences), we normalize the adjustments to have mean zero in order to allow for easier comparisons between the unadjusted and adjusted outcomes. Further details are provided in the Appendix. Clearly, the flexibility of the specific model employed to adjust for differences across locations in LEC depends heavily on the available data and the amount of variation in the LEC. For instance, one may be able to estimate more flexible models if there is a large number of locations or cohorts in each location. In this paper, we are faced with the challenging (but not uncommon) problem that there is limited variation in the LEC measures (we have thirteen cohorts total); so it is difficult to use a more flexible model than the one we use. 5 Dehejia (2003) uses an approach based on hierarchical models to deal with site effects when evaluating multisite training programs, although he does not directly address LEC differences. Galdo (2008) considers a setting where there is lack of overlap due to site and time effects, and proposes a procedure that removes those two fixed effects before implementing a matching estimator. As pointed out by Dehejia (2003), in a framework that aims at predicting the effects of programs if implemented in different sites, it is problematic to model site effects as fixed effects. 8 An attractive feature of our approach is that it is not estimator-specific, meaning that once we adjust the outcomes for differences in post-randomization LEC, we can employ any of the estimators from Section 2.2 to adjust for differences in individual characteristics. Another potentially attractive feature is that the effects of the LEC on the outcome variables are estimated using the outcomes prior to random assignment, so part of the analysis (e.g., modeling the effect of the LEC on outcomes) can be performed without resorting to the outcomes after randomization. Finally, note that while we do not explicitly adjust for pre-randomization LEC in our preferred specification, we include a rich set of welfare and labor market histories as covariates, which can proxy for those variables (Hotz et al., 2005). In Section 4, we also present results from models that include measures of both pre- and post-randomization LEC in the GPS estimation. 3 Data The data we use comes from the National Evaluation of Welfare-to-Work Strategies (NEWWS), a multi-year study conducted in seven United States cities to compare the effects of different approaches for helping welfare recipients (mostly single mothers) to improve their labor market outcomes and leave public assistance. The individuals in each location were randomly assigned to different training programs or to a control group that was denied access to the training services offered by the program for a pre-set “embargo” period. The year of randomization differed across sites, with the earliest randomization taking place in 1991 and the latest in 1994. We rely on the public-use version of the NEWWS data, which is a combination of high quality survey and administrative data. For further details on the NEWWS study, see Hamilton et al. (2001). We restrict the analysis to female control individuals in the five sites for which two years of pre-randomization individual labor market histories are available and in which individuals were randomized after welfare eligibility had been determined. The five sites we use are Atlanta, GA; Detroit, MI; Grand Rapids, MI; Portland, OR; and Riverside, CA (we exclude Columbus, OH and Oklahoma City, OK). The final sample size in our analysis is 9,351 women.6 The outcome we use is an indicator variable equal to one if the individual was ever employed during the two years (quarters two to nine) following randomization, and zero otherwise.7 We focus on an outcome that is measured only up to two years after randomization because, in most sites, we cannot identify which control individuals were embargoed from receiving program services after year two. We define the outcome in differences as the original outcome minus an indicator equal to one if the individual was ever employed during the two years prior to randomization. Our data contain information on demographic and family characteristics, education, housing type and stability, pre-randomization welfare and food stamps use (seven quarters), and pre6 Michalopoulos et al. (2004) use the NEWWS data to perform binary comparisons between control groups in different locations. See the Appendix for a discussion of the differences between their sample and ours. 7 Using one-year employment indicators instead does not change our conclusions substantively. We focus on an employment indicator because the public-use data available to us contain only the year of random assignment and nominal earnings measures. Thus, we could create only rough measures of real earnings. By focusing on employment, we also avoid having to deal with issues related to cost-of-living differences across sites. 9 randomization earnings and employment histories (eight quarters). The first five columns of Panel A in Table 1 show the averages of the outcomes and selected covariates for each site. As expected, there are important and usually large differences in the means of all the variables across sites (e.g., percentage of blacks). We employ three measures of LEC for the different cohorts within each site at the metropolitan statistical area (MSA) level: employment-to-population ratio, average real earnings, and unemployment rate. The bottom of Panel A presents the average of these variables during the calendar year of random assignment, as well as their two-year growth rates in the two years before and after random assignment. The differences in LEC across all sites clearly suggest that we are working with five distinct local labor markets, especially with respect to Riverside. 4 Results 4.1 GPS Estimation and Covariate Balancing We estimate the GPS using a multinomial logit model (MNL) that includes 52 individual-level covariates (estimation results are provided in the Appendix). In Section 4.4, we consider alternative estimators of the GPS. An important property of the GPS is that it balances the covariates between the individuals in a particular site and those not in that site (Imbens, 1999). In the binary treatment setting, this property is commonly used to gauge the adequacy of the propensity score specification (Dehejia and Wahba, 1999; Smith and Todd, 2005; Lee, 2011). We follow two strategies to examine the balance of each covariate across the different sites after adjusting for the GPS. The first strategy tests whether there is joint equality of means across all five sites after each observation is weighted b . The second strategy consists of a series of pairwise comparisons of the mean in each by 1 site versus the mean in the (pooled) remaining sites, adjusting for the GPS within a “blocking” or “stratification” framework in the spirit of Dehejia and Wahba (1999, 2002) and Hirano and Imbens (2004). Both strategies are implemented after imposing the overlap rule (5). An important issue when implementing this rule is selecting the quantile that determines the amount of trimming. Even in the binary treatment literature, there is no consensus about how to select the amount of trimming (e.g., Imbens and Wooldridge, 2009). Based on attaining good balancing under both strategies, we set = 00025.8 For brevity, we focus on the first strategy. The last four columns of Panel A in Table 1 show the p-value from a joint equality test and the root mean square distance () as given in (3) for the means of the variables in each site, before and after weighting by the GPS. The complete variableby-variable results are presented in the Appendix. Weighting by the GPS brings the means of all of the covariates much closer together (except for the percentage of individuals with a high school degree or GED, which becomes slightly less balanced). For example, the for the percentage of blacks is reduced from 0.64 to 0.06 after weighting by the GPS. Despite the significant improvement 8 We also examined covariate balancing properties for values of equal to 0 0001 0002 0003 0004 and 0005. Our main results presented in Section 4.3 are, in general, robust to the choice of . 10 in balancing, the joint equality of means test is still rejected for some of the covariates. Panel B in Table 1 shows summary results for the covariate balancing analysis. Weighting by the GPS greatly improves the balancing among the five sites, as the number of unbalanced covariates at the 5 percent significance level drops from 52 to 6. The second strategy used to analyze covariate balancing, blocking on the GPS, yields similar results. Overall, we conclude that the covariates in the raw data are highly unbalanced across sites, and that the estimated GPS does a reasonably good job in improving their balance across all five sites. 4.2 Overlap Analysis We first analyze the overlap in individual characteristics across sites, and then we analyze the overlap in LEC. Panel C in Table 1 presents the percentage of individuals dropped from each site after imposing overlap, and the overall percentage of observations dropped (26 percent). In Riverside, almost half of the observations are dropped, implying that many of the individuals in Riverside are not comparable to individuals in at least one of the other sites. The sites with the fewest number of observations dropped are Atlanta and Detroit (about 5 percent in each). In Figure 1, we examine more closely the overlap quality in individual characteristics. Each panel in the Figure shows, for a given site , the number of observations with 6= (left bars) and = (right bars) whose GPS falls within given percentile intervals of the support of . The percentile intervals, for each site , are constructed using the distribution of only for individuals in site before imposing overlap. The outside wide (inside thin) bars display the number of individuals before (after) imposing overlap. As discussed in Section 2.2, we are interested only in the lower tail of these distributions because, for estimation of , we need to find for every individual with 6= comparable individuals in terms of in the = group, and not vice versa. However, note that the intersection in the overlap rule (5) results in individuals being dropped from all regions of the GPS distributions in Figure 1. This illustrates a key difference with the binary-treatment case, where usually individuals are only dropped from the tails. Figure 1 makes clear that the GPS distributions of the 6= and = groups differ greatly in some sites, which is consistent with the large differences in individual characteristics documented in Table 1. Figure 1 also shows that imposing overlap significantly decreases the number of individuals at the lower tail of the distribution of for the 6= group in most sites. The biggest changes, mostly arising from the large number of Riverside individuals dropped, are in Atlanta and Detroit. For example, the bottom one percent of the individuals in Atlanta (about 14 observations) are comparable to 3,245 individuals (about 41 percent of those not in Atlanta) before imposing overlap, and to 938 individuals after imposing overlap (about 17 percent of those not in Atlanta). Despite the improvement in overlap after imposing (5), the overlap quality remains relatively poor in some regions of the distributions, as many observations not in site are comparable to few observations in site in those regions. This poor overlap quality can lead to high variance of the GPS-based semiparametric estimators, just as in the binary-treatment setting (Black and Smith, 2004; Busso et al., 2009b; Imbens and Wooldridge, 2009; Khan and Tamer, 11 2010). In general, the overlap quality in individual characteristics between units with = and 6= in each site is similar to that from previous studies in a binary-treatment setting (e.g., Dehejia and Wahba, 1999; Black and Smith, 2004; Smith and Todd, 2005). We now examine the overlap in pre- and post-randomization LEC across the different cohorts in our sites. Figure 2 presents plots of the LEC faced by each of the cohorts in our analysis before (top figures) and after (bottom figures) random assignment. It is clear from the Figure that all the cohorts in Riverside experienced, especially after randomization, extremely different LEC than the other sites with respect to the employment-to-population ratio and the unemployment rate. Moreover, the comparison of the top and bottom panels shows that the behavior over time of the LEC in Riverside is also different, with earlier cohorts experiencing a deterioration of the LEC. As for the rest of the sites, while their LEC are not the same, they are roughly similar. Figure 2 highlights that it is almost impossible to find individuals in other sites that faced postrandomization LEC similar to those faced by the individuals in Riverside. Thus, the adjustment for LEC in Riverside heavily depends on how well the parametric model used to adjust for LEC is able to extrapolate to these regions of the data. 4.3 Main Results Table 2 presents the results for the assessment measures of the different estimators used to implement the unconfoundedness (first three columns) and conditional DID (last three columns) strategies. Panel A shows the results when the outcome is not adjusted for differences in postrandomization LEC, while Panel B shows the results when adjusting for such differences following the procedure discussed in Section 2.3. The first and second columns of the Table show, respectively, the p-value from the joint equality test in (2) and the in (3) with its corresponding 95 percent confidence interval based on 1,000 bootstrap replications. To have a reference level for the reduction in the for the different estimators, the third column displays the relative to that from the raw mean estimator using the outcome in levels before imposing overlap and not adjusting for LEC. In addition, to have a benchmark against which to compare our results, the last row in each of the panels gives the value of the that would be achieved by an experiment in a setting where 1 = = holds. In particular, we perform a “placebo” experiment in which we assign placebo sites randomly to the individuals in the data set (keeping the number of observations in b be the average outcome in each placebo site. each site the same as in our sample), and we let These placebo values are also calculated using the outcome in differences and with and without adjusting for the LEC. We implement all the GPS-based estimators only within the overlap region, and the non-GPS-based estimators before and after imposing overlap.9 9 For the partial mean linear X and IPW with covariates estimators, we include all 52 individual covariates used in the GPS estimation. For the partial mean flexible X estimator, we add 50 higher order terms and interactions. We implement the GPS-based nonparametric partial mean estimator using an Epanechnikov kernel and selecting the bandwidth using the “plug-in” procedure proposed by Fan and Gijbels (1996). For the GPS-based parametric partial mean estimator, we also explored using a cubic and a quartic polynomial in , which made the results very close to those from the nonparametric partial mean estimator. Finally, the results from the regressions in the first step of the procedure used to adjust for LEC are shown in the Appendix. 12 The of the raw mean in levels (0.129) is about six times what one would expect to see in an experiment (0.021). Both strategies used to adjust for individual characteristics do very little in reducing the , regardless of the estimator used. This illustrates that in some instances, DID estimators are unable to adjust for differences across sites. In our case, there seems to be important differences in time-variant characteristics across sites (LEC) that are not captured by the individual characteristics we control for. The only adjustment that helps to reduce the differences in outcomes across comparison groups is the one for post-randomization LEC. In general, this adjustment reduces the by about half whether we employ the outcome in levels or in differences; however, the still remains more than three times that of the placebo experiment. Table 3 presents the estimates and confidence intervals of for each site. This table shows that the methods are fairly successful in equalizing the average outcomes among all sites but Riverside. In fact, with the exception of the LEC adjustment, the strategies considered change very little the average outcome of Riverside. This is not surprising given the very different LEC in this site. Tables 2 and 3 highlight the difficulty of adjusting for LEC when they differ substantially across sites and the number of cohorts and/or locations is small. This is consistent with Dehejia (2003), who finds that it is difficult to predict site effects when the site being predicted differs greatly from the other sites. The lack of overlap in post-randomization LEC in Riverside prevents us from finding individuals in other sites who faced post-randomization LEC like those in Riverside. Moreover, the small number of cohorts and limited LEC variation in our sample makes it extremely difficult to model the effect of the LEC on the outcome and to extrapolate to regions with no overlap. Thus, we are unable to compare the outcomes in Riverside to those from the other sites. More generally, our results illustrate that when evaluating multiple treatments, there may be treatment groups for which it is not possible to draw inferences because they are not comparable to the other ones. Table 4 shows our assessment measures calculated using the estimated means from Table 3 (i.e., from the five-site model) for all sites, except Riverside. The for the raw mean of the outcome in levels starts at 0.077, which is about three times that of the placebo experiment. In this case, regardless of the estimator employed, the adjustment for individual characteristics significantly reduces the to levels that are much closer to those of the placebo experiment.10 As in Table 2, the DID estimators do not seem to improve on the results from the outcomes in levels. The LEC adjustment does not affect the results nearly as much as it did in Table 2, and for most estimators it slightly increases the . This small effect of the LEC adjustment, as compared to the case when Riverside is included, is consistent with the other four sites having more similar LEC than does Riverside. It is also important to compare the confidence intervals from the different estimators to those from the placebo experiment. In general, the confidence intervals from the linear regression-based estimators without adjusting the outcome for LEC are similar to those from the placebo experiment. However, those from the GPS-based estimators are wider, especially those 10 For estimation of , it does not matter which site a unit comes from if the unit is not in site . Thus, performing the whole analysis using only the observations not in Riverside gives results similar to those in Table 4. In addition, carrying out the four-site analysis keeping Riverside, while dropping any of the other sites, gives estimates as high as those for the five-site analysis in Table 2. 13 from the IPW estimators, because of the poor overlap quality in some sites (e.g., Atlanta).11 In sum, the results from this section highlight the importance of the LEC when employing comparison groups from different locations. Our results suggest that, given our set of conditioning variables, the unconfoundedness and DID strategies work reasonably well in simultaneously equalizing average outcomes across our comparison groups when they come from sites with similar LEC. This is consistent with some of the previous findings in the binary-treatment literature (Roselius, 1996; Heckman et al., 1997; Heckman et al., 1998; Michalopoulos et al., 2004; Smith and Todd, 2005; Hotz et al., 2005). Our results show that, in the presence of large differences in post-randomization LEC across comparison groups, our LEC adjustment can reduce but not eliminate these differences. Our findings regarding the importance of adjusting for post-randomization LEC complement, and are consistent with, those in the previous literature (Hotz et al., 2006; Flores-Lagunes et al., 2010). We briefly relate our results to our example of a policy maker who wants to implement in her site one of the programs from the other sites. If the policy maker were in any of the sites but Riverside, she would clearly be able to compare all the programs, except the one in Riverside, for the people in her site in the overlap region. For the program in Riverside, she could not separate program heterogeneity from LEC heterogeneity. A subtler case is that of a policy maker in Riverside, who would be able to compare the effects of the other sites’ programs for the individuals in Riverside in the overlap region, had they participated in these programs in any of the sites but Riverside. However, she would not be able to make the same comparison if the programs had been implemented in Riverside because she could not assess the effects of the programs in a labor market like Riverside, which differs greatly from the labor markets of the other sites. A potential threat to our analysis is differential substitution into alternative training programs by the control individuals (Heckman and Smith, 2000; Heckman et al., 2000; Greenberg and Robins, 2011). Analyzing a subsample of the individuals in the NEWWS study, Freedman et al. (2000, Table A.1) find differential participation rates in self-initiated activities among the control individuals. For instance, the control individuals in Atlanta (19%) and Riverside (around 26%) show much lower participation rates than do those in the other three sites (around 40%), with most of the differences driven by participation in post-secondary education and vocational training activities. Freedman et al. (2000) partially attribute these differences to individual characteristics heterogeneity, which should be adjusted by the individual-level covariates we control for. Although we lack the data to directly deal with differential control group substitution, our results do not seem to support the notion that this source of heterogeneity drives our results. For example, while we do not have trouble adjusting the average outcomes in Atlanta (the site with the lowest participation among the control individuals), we have the most trouble in Riverside (the site with the worst LEC). Unfortunately, we cannot completely rule out this additional source of heterogeneity, nor can we precisely determine the extent to which it may affect our results.12 11 When overlap is poor, the large standard errors of semiparametric propensity-score based estimators of treatment effects may more accurately reflect uncertainty than regression based estimators, which extrapolate to regions of poor overlap using parametric assumptions (Black and Smith, 2004; Imbens and Wooldridge, 2009). 12 Another potential source of heterogeneity across control groups is welfare rules. For instance, welfare waivers 14 4.4 Robustness Analysis In this section we examine how robust our previous results are to different choices regarding the GPS estimation and the imposition of overlap. For brevity, Table 5 presents summary results only for the IPW with covariates estimator (the complete set of results is available in the Appendix). Unless otherwise noted, the results discussed below do not depend on the particular GPS-based estimator used. Table 5 shows the percentage of observations dropped after imposing overlap with = 00025, and results analogous to those presented in Tables 2 and 4. To ease comparisons, the first row of Table 5 shows the corresponding results from those tables. We start by including pre-randomization LEC measures in the GPS estimation. We employ only one two-year pre-randomization LEC variable because including the growth rates for all three of our LEC measures, or the measures in levels, identifies the sites (almost) perfectly. While we use the growth rate of the employment-to-population ratio, which particular LEC measure we employ does not affect the results. Overall, the results in the second row of the Table are comparable to those we obtained in the previous section. These results illustrate that in some cases, adjusting for pre-randomization LEC (and our set of individual covariates) helps little in reducing differences across comparisons groups when there are large differences in post-randomization LEC. In row three, we add to the GPS model shown in row two the two-year growth rate of the employment-to-population ratio after randomization. Here, the relevant comparison is between the outcome not adjusted for LEC (first four columns in row three) and the outcome adjusted for post-randomization LEC under our preferred specification (last four columns in row one). The point estimates are very close, except for the outcome in differences using four sites. It is important to note that the overlap worsens when including LEC measures in the GPS estimation. The percentage of observations dropped due to overlap shown in Table 5 does not provide a complete picture. For example, the inclusion of the pre-randomization LEC measure greatly affects Riverside, causing the overlap rule to drop about 63 percent of the individuals in this site, compared to around 49 percent dropped in our preferred model. The inclusion of both the pre- and post-randomization LEC measures worsens the overlap even more: almost 46 percent of the overall sample and more than 84 percent of the individuals in Riverside are dropped.13 The deterioration in overlap quality results in wider confidence intervals than those of our preferred specification, although an argument could be made that the increased variability reflects more appropriately the uncertainty of the problem at hand. In general, we conclude from rows two and three that our main results in the previous section are robust to the inclusion of (pre- and post-) LEC measures in the GPS. Nevertheless, it seems to us that our approach of analyzing the overlap in individual characteristics and LEC separately for Atlanta, and for Riverside after 1994, allowed employed welfare recipients to keep more of their welfare checks (Hamilton et al., 2001). We believe it is unlikely that welfare rule differences across sites drive our results. For example, although Riverside made its earnings policy more generous in the middle of the NEWWS study, its average employment rate remained well below that of the other sites before and after the change. Unfortunately, as before, we cannot rule out nor precisely determine the extent to which this source of heterogeneity could affect our results. 13 For reference, if instead of using = 00025 as the quantile for the overlap trimming rule, we use the weakest version of it ( = 0), we drop 34 percent of the observations overall and 79 percent of the observations in Riverside. 15 makes the differences in LEC across the sites easier to detect and to assess, as opposed to analyzing the overlap using a GPS specification that includes the LEC measures. The latter, however, may be a more practical alternative in cases where lack of overlap in LEC is unlikely to be an issue. In rows four to seven of Table 5, we analyze the sensitivity of our results to the exclusion of some sets of covariates (listed in the Table) from the GPS estimation. This may shed light on the types of covariates needed to make the conditional independence assumption more plausible in our setting. The results for the five sites remain basically unchanged regardless of the set of individual covariates, although the balancing properties of the four models (not shown in the Table) are much worse than those of our preferred specification. When considering sites with similar LEC (i.e., not including Riverside), however, the exclusion of the employment and earnings histories prior to random assignment does largely increase the . This result is consistent with the key role the literature gives to conditioning on labor market histories when implementing the nonexperimental strategies we consider (Heckman et al., 1997; Heckman et al., 1998; Dehejia and Wahba, 1999, 2002; Hotz et al., 2005).14 In rows eight to ten of Table 5, we consider three different models to estimate the GPS. First, we use a multinomial probit model, that does not have the unattractive “independence from irrelevant alternatives” property of the multinomial logit model, but is much more computationally intensive. Second, we use a multinomial version of the “Inverse Probability Tilting” (IPT) estimator proposed by Graham et al. (2011), which estimates the GPS by solving a just-identified GMM problem imposing moment conditions that ensure that the mean of each covariate is exactly the same across all sites when weighting the observations by the inverse of the GPS (the “exact balancing” property). For the sake of comparability, we impose overlap in row nine, which breaks the exact balancing property of the estimated GPS, but below we consider GPS specifications with that property. Third, we use a combination of the multinomial logit and IPT estimators, which estimates the GPS by solving an overidentified GMM problem that adds to the multinomial logit moments the exact balancing moments of the IPT. Although this GPS estimator does not share the exact balancing property of the IPT-based estimator, its balancing properties may be better than those of the latter when balancing is assessed within a blocking-on-the-GPS approach. However, it has the disadvantage of being much slower to compute than any of the other models. In general, the results under any of the alternative GPS estimators are relatively close to those from our preferred specification. An issue for which there is no guidance in the literature is whether the GPS should be reestimated after imposing overlap. This can be important if, for example, we want to attain exact balancing when weighting by the GPS after imposing overlap. We show several reestimation options in rows eleven to fourteen of Table 5, and find that the results of these models are remarkably close to those of our preferred specification.15 However, in terms of covariate balancing (not shown 14 When using a GPS specification with interactions and higher order terms of the covariates, in general, balancing does not improve and the results are similar to those of our preferred specification. 15 For the GPS-based nonparametric partial mean estimator, this is not always true. In many cases, reestimating the GPS produces unusually high values for this estimator. 16 in the Table), we find that the models reestimated using the IPT estimator generally perform very poorly when blocking on the GPS, while the models reestimated using MNL and MNL with IPT moment conditions perform well both when weighting by and blocking on the GPS. Finally, the last three rows of Table 5 explore the effect of not imposing overlap in the individual covariates for three GPS models. For the five-site case, the results are similar or slightly below those of our preferred specification. However, for the four-site case, the results for the outcome in levels are well above those of our preferred specification, especially for the IPT estimator. This illustrates that the results can be poor if overlap is not imposed, even if it is possible to attain exact balancing.16 5 Conclusion Policy makers are commonly faced with the challenging problem of comparing the effects different programs would have in their own locality based on data when each of the programs was implemented at a different location. We deem our results as encouraging regarding the performance of the nonexperimental methods we study in informing policy makers in such situations. In our application, these methods reduce the differences in the average outcomes of our comparison groups due to differences across locations in individual characteristics and local economic conditions (LEC) down to levels closer to those that would be expected in an experiment. The keys are adjusting for a rich set of covariates (including labor market histories) and having simultaneous overlap across sites in individual characteristics and post-randomization LEC. Nevertheless, we find that the standard errors from the nonexperimental methods can be much wider than those we would encounter in an experiment if the overlap quality is poor, especially for the GPS-based estimators. This raises the possibility that large differences in outcomes across sites may still persist. When there are large differences in post-randomization LEC across locations, the methods we consider to explicitly adjust for them can reduce but not eliminate these differences. We also find encouraging that, in general, our results are not very sensitive to the particular model used to estimate the GPS, nor to the particular estimator employed to implement the unconfoundedness or DID strategies. Some final remarks are in order. First, our results may have implications regarding the external validity of randomized controlled trials. For instance, they suggest that when extrapolating the results from a social experiment to different populations and/or points in time, we need to adjust for differences in individual characteristics and LEC after treatment. Second, we believe the literature would benefit from more research on how to best adjust for differences in LEC when their variation is limited. Third, a valuable exercise would be to perform a Monte Carlo analysis of the finite sample properties of the multiple-treatment estimators we consider. Our paper points out important aspects to consider in such an analysis, for example, the number of treatments, the degree and quality of overlap, the GPS model, and whether the GPS should be reestimated after imposing overlap. A similar analysis for different ways to adjust for differences in LEC would also be valuable. 16 Interestingly, the poor balancing properties under a blocking approach of the IPT-based GPS models in Table 5 do not translate into poor performance of the GPS-based partial-mean estimators. 17 References [1] Abadie, Alberto. 2005. Semiparametric Difference-in-Differences Estimators. The Review of Economic Studies 72 (1): 1-19. [2] Abadie, Alberto, and Guido W. Imbens. 2011. Bias-Corrected Matching Estimators for Average Treatment Effects. Journal of Business and Economic Statistics 29 (1): 1-11. [3] Black, Dan A., and Jeffrey A. Smith. How robust is the evidence on the effects of college quality? Evidence from matching. Journal of Econometrics 121 (1-2): 99-124. [4] Busso, Matias, John DiNardo, and Justin McCrary. 2009a. New Evidence on the Finite Sample Properties of Propensity Score Matching and Reweighting Estimators. IZA Discussion Paper no. 3998. [5] Busso, Matias, John DiNardo, and Justin McCrary. 2009b. Finite Sample Properties of Semiparametric Estimators of Average Treatment Effects. University of Michigan. [6] Cattaneo, Matias D. 2010. Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics 155 (2): 138-154. [7] Dehejia, Rajeev H. 2003. Was There a Riverside Miracle? A Hierarchical Framework for Evaluating Programs With Grouped Data. Journal of Business and Economic Statistics 21 (1): 1-11. [8] Dehejia, Rajeev H., and Sadek Wahba. 1999. Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94 (448): 1053-1062. [9] Dehejia, Rajeev H., and Sadek Wahba. 2002. Propensity Score-Matching Methods for Nonexperimental Causal Studies. The Review of Economics and Statistics 84 (1): 151-161. [10] Dolton, Peter, and Jeffrey A. Smith. 2011. The Impact of the UK New Deal for Lone Parents on Benefit Receipt. IZA Discussion Paper no. 5491. [11] Dyke, Andrew, Carolyn J. Heinrich, Peter R Mueser, Kenneth R. Troske, and Kyung-Seong Jeon. 2006. The Effects of Welfare-to-Work Program Activities on Labor Market Outcomes. Journal of Labor Economics 24 (3): 567-607. [12] Fan, Jianqing, and Irène Gijbels. 1996. Local polynomial modelling and its applications. New York: Chapman & Hall/CRC Press. [13] Flores-Lagunes, Alfonso, Arturo Gonzalez, and Todd Neumann. 2010. Learning but not Earning? The Impact of Job Corps Training on Hispanic Youth. Economic Inquiry 48 (3): 651-667. [14] Freedman, Stephen, Daniel Friedlander, Gayle Hamilton, JoAnn Rock, Marisa Mitchell, Jodi Nudelman, Amanda Schweder, and Laura Storto. 2000. Evaluating Alternative Welfare-toWork Approaches: Two-Year Impacts for Eleven Programs. Washington, D.C.: U.S. Dept. of Health and Human Services, Administration for Children and Families and Office of the Assistant Secretary for Planning and Evaluation, and U.S. Dept. of Education. [15] Friedlander, Daniel, and Philip K. Robins. 1995. Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods. The American Economic Review 85 (4): 923-937. [16] Frölich, Markus. 2004. Programme Evaluation with Multiple Treatments. Journal of Economic Surveys 18 (2): 181-224. 18 [17] Frölich, Markus, Almas Heshmati, and Michael Lechner. 2004. A microeconometric evaluation of rehabilitation of long-term sickness in Sweden. Journal of Applied Econometrics 19 (3): 375-396. [18] Galdo, Jose. 2008. Treatment Effects for Profiling Unemployment Insurance Programs: Semiparametric Estimation of Matching Models with Fixed Effects. Carleton University. [19] Graham, Bryan S., Cristine Campos de Xavier Pinto, and Daniel Egel. 2011. Inverse probability tilting for moment condition models with missing data. The Review of Economic Studies. Forthcoming. [20] Greenberg, David H., and Philip K. Robins. 2011. Have Welfare-To-Work Programs Improved Over Time In Putting Welfare Recipients To Work? Industrial & Labor Relations Review. Forthcoming. [21] Hamilton, Gayle, Stephen Freedman, Lisa Gennetian, Charles Michalopoulos, Johana Walter, Diana Adams-Ciardullo, Ann Gassman-Pines, Sharon McGroder, Martha Zaslow, Jennifer Brooks, and Surjeet Ahluwalia. 2001. How Effective are Different Welfare-to-Work Approaches? Five-Year Adult and Child Impacts for Eleven Programs. Washington, D.C.: U.S. Dept. of Health and Human Services, Administration for Children and Families and Office of the Assistant Secretary for Planning and Evaluation, and U.S. Dept. of Education. [22] Heckman, James J., Hidehiko Ichimura, Jeffrey Smith, and Petra Todd. 1998. Characterizing Selection Bias Using Experimental Data. Econometrica 66 (5): 1017-1098. [23] Heckman, James J., Hidehiko Ichimura, and Petra E. Todd. 1997. Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme. The Review of Economic Studies 64 (4): 605-654. [24] Heckman, James, Neil Hohmann, Jeffrey Smith, and Michael Khoo. 2000. Substitution and Dropout Bias in Social Experiments: A Study of an Influential Social Experiment. The Quarterly Journal of Economics 115 (2): 651-694. [25] Heckman, James J., Robert J. Lalonde, and Jeffrey A. Smith. 1999. The economics and econometrics of active labor market programs. In Handbook of Labor Economics, ed. Orley C. Ashenfelter and David Card, Volume 3, Part 1:1865-2097. Elsevier. [26] Heckman, James J., and Jeffrey Smith. 2000. The Sensitivity of Experimental Impact Estimates (Evidence from the National JTPA Study). In Youth Employment and Joblessness in Advanced Countries, ed. David G. Blanchflower and Richard B. Freeman, 331-356. NBER Comparative Labor Markets Series. University of Chicago Press. [27] Hirano, Keisuke, and Guido W. Imbens. 2004. The Propensity Score with Continuous Treatments. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, ed. Andrew Gelman and Xiao-Li Meng, 73-84. Hoboken, NJ: John Wiley & Sons. [28] Holland, Paul W. 1986. Statistics and Causal Inference (with discussion). Journal of the American Statistical Association 81 (396): 945-970. [29] Hotz, V. Joseph, Guido W. Imbens, and Jacob A. Klerman. 2006. Evaluating the Differential Effects of Alternative Welfare-to-Work Training Components: A Reanalysis of the California GAIN Program. Journal of Labor Economics 24 (3): 521-566. [30] Hotz, V. Joseph, Guido W. Imbens, and Julie H. Mortimer. 2005. Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics 125 (1-2): 241-270. 19 [31] Imai, Kosuke, and David A. van Dyk. 2004. Causal Inference With General Treatment Regimes. Journal of the American Statistical Association 99 (467): 854-866. [32] Imbens, Guido W. 1999. The Role of the Propensity Score in Estimating Dose-Response Functions. NBER Technical Working Paper Series no. 237. [33] Imbens, Guido W. 2000. The Role of the Propensity Score in Estimating Dose-Response Functions. Biometrika 87 (3): 706-710. [34] Imbens, Guido W. 2004. Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review. The Review of Economics and Statistics 86 (1): 4-29. [35] Imbens, Guido W., and Jeffrey M. Wooldridge. 2009. Recent Developments in the Econometrics of Program Evaluation. Journal of Economic Literature 47 (1): 5-86. [36] Khan, Shakeeb, and Elie Tamer. 2010. Irregular Identification, Support Conditions, and Inverse Weight Estimation. Econometrica 78 (6): 2021-2042. [37] Lechner, Michael. 2001. Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. In Econometric Evaluation of Labour Market Policies, ed. Michael Lechner and Friedhelm Pfeiffer, 43-58. ZEW Economic Studies 13. New York: Springer-Verlag. [38] Lechner, Michael. 2002a. Some Practical Issues in the Evaluation of Heterogeneous Labour Market Programmes by Matching Methods. Journal of the Royal Statistical Society, Series A 165 (1): 59-82. [39] Lechner, Michael. 2002b. Program Heterogeneity and Propensity Score Matching: An Application to the Evaluation of Active Labor Market Policies. The Review of Economics and Statistics 84 (2): 205-220. [40] Lee, Wang-Sheng. 2011. Propensity score matching and variations on the balancing test. Empirical Economics. Forthcoming. [41] Michalopoulos, Charles, Howard S. Bloom, and Carolyn J. Hill. 2004. Can PropensityScore Methods Match the Findings from a Random Assignment Evaluation of Mandatory Welfare-to-Work Programs? The Review of Economics and Statistics 86 (1): 156-179. [42] Mueser, Peter R., Kenneth R. Troske, and Alexey Gorislavsky. 2007. Using State Administrative Data to Measure Program Performance. The Review of Economics and Statistics 89 (4): 761-783. [43] Newey, Whitney K. 1994. Kernel Estimation of Partial Means and a General Variance Estimator. Econometric Theory 10 (2): 233-253. [44] Plesca, Miana, and Jeffrey A. Smith. 2007. Evaluating multi-treatment programs: theory and evidence from the U.S. Job Training Partnership Act experiment. Empirical Economics 32 (2): 491-528. [45] Roselius, Rebecca L. 1996. New Life for Non-Experimental Methods: Three Essays that Reexamine the Evaluation of Social Programs. Ph.D. Dissertation, University of Chicago. [46] Smith, Jeffrey A., and Petra E. Todd. 2005. Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics 125 (1-2): 305-353. 20 Table 1. Descriptive statistics, covariates balancing, and overlap NEWWS data A. Descriptive statistics (selected variables) Variables Outcomes Ever employed in 2 years after RA Ever employed in 2 years after RA (in diff) \a Individual level covariates Demographic and family characteristics Black Age 30-39 years old Age 40+ years old Became mother as a teenager Never married Any child 0-5 years old Any child 6-12 years old Education characteristics 10th grade 11th grade Grade 12 or higher Highest degree = High School or GED Welfare use history On welfare for less than 2 years On welfare for 2-5 years On welfare 5-10 years Received welfare in Q1 before RA Received welfare in Q7 before RA Food stamps use history Received food stamps in Q1 before RA Received food stamps in Q6 before RA Received food stamps in Q7 before RA Employment history Employed in Q1 before RA Employed in Q8 before RA Employed at RA (self reported) Ever worked FT 6+ months at same job Earnings history (real $ /1,000) Earnings Q1 before RA Earnings Q8 before RA Any earnings year before RA (self reported) Local economic conditions Employment/population year of RA Average total earnings year of RA ($1000) Unemployment rate year of RA Emp/pop growth rate 2 yrs before RA (! logs) Avg earns growth rate 2 yrs before RA (! logs) Unemp growth rate 2 yrs before RA (! logs) Emp/pop growth rate 2 yrs after RA (! logs) Avg earns growth rate 2 yrs after RA (! logs) Unemp growth rate 2 yrs after RA (! logs) P-value joint equality of means Wald test Raw IPW \b Root Mean Square Distance Raw IPW \b ATL Raw Means DET GRP POR RIV 0.59 0.60 0.59 0.68 0.70 0.56 0.59 0.58 0.46 0.46 0.00 0.00 0.00 0.00 0.13 0.12 0.12 0.12 0.95 0.51 0.14 0.45 0.62 0.42 0.70 0.89 0.35 0.11 0.45 0.69 0.65 0.48 0.41 0.29 0.09 0.51 0.58 0.69 0.43 0.20 0.40 0.08 0.34 0.49 0.71 0.52 0.17 0.45 0.13 0.35 0.34 0.58 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.90 0.80 0.13 0.00 0.07 0.64 0.19 0.21 0.16 0.23 0.17 0.18 0.06 0.07 0.08 0.04 0.05 0.07 0.06 0.14 0.17 0.57 0.53 0.15 0.25 0.50 0.48 0.13 0.20 0.54 0.54 0.17 0.22 0.45 0.53 0.11 0.18 0.57 0.59 0.00 0.00 0.00 0.00 0.73 0.47 0.05 0.03 0.15 0.14 0.09 0.06 0.08 0.09 0.06 0.07 0.26 0.25 0.24 0.97 0.64 0.23 0.25 0.24 0.90 0.77 0.38 0.31 0.17 0.77 0.56 0.32 0.35 0.23 0.79 0.58 0.44 0.28 0.16 0.73 0.37 0.00 0.00 0.00 0.00 0.00 0.04 0.21 0.51 0.83 0.39 0.23 0.13 0.16 0.11 0.22 0.12 0.08 0.11 0.01 0.04 0.97 0.75 0.72 0.94 0.81 0.78 0.85 0.64 0.60 0.86 0.70 0.66 0.62 0.31 0.29 0.00 0.00 0.00 0.05 0.01 0.06 0.14 0.27 0.28 0.02 0.04 0.04 0.18 0.30 0.07 0.72 0.18 0.18 0.07 0.46 0.29 0.39 0.13 0.64 0.23 0.27 0.09 0.77 0.22 0.30 0.13 0.71 0.00 0.00 0.00 0.00 0.45 0.14 0.39 0.29 0.18 0.23 0.27 0.16 0.09 0.11 0.14 0.04 0.23 0.74 0.23 0.21 0.33 0.20 0.36 0.69 0.46 0.33 0.57 0.36 0.43 0.85 0.40 0.00 0.00 0.00 0.45 0.77 0.34 0.27 0.28 0.30 0.19 0.12 0.08 0.52 32.4 5.93 -0.03 0.03 0.22 0.03 -0.01 -0.27 0.46 35.9 7.38 0.00 0.03 -0.24 0.04 0.03 -0.33 0.49 29.1 7.42 -0.02 0.01 0.11 0.04 0.02 -0.41 0.49 30.0 5.36 0.00 0.02 -0.06 0.04 0.03 -0.31 0.29 27.8 10.45 -0.05 -0.01 0.42 0.00 -0.02 -0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.18 0.09 0.24 1.08 1.05 2.53 0.59 2.18 0.38 0.18 0.09 0.24 1.03 1.04 2.30 0.58 2.01 0.38 B. Covariates balancing. Number of individual covariates for which p-value ! 0.05 (total number of covariates = 52) Tests Joint equality of means across all sites Raw means before overlap GPS-based Inverse Probability Weighting \c Diff. of means each site versus other sites pooled Raw means before overlap Blocking on GPS \c ATL DET GRP POR RIV 52 6 42 1 49 3 34 4 36 0 48 6 ATL 1,372 1,299 5.3% DET 2,037 1,933 5.1% GRP 1,374 969 29.5% POR 1,740 1,263 27.4% RIV 2,828 1,437 49.2% C. Overlap Number of observations Number of observations after imposing overlap Percentage observations dropped due to overlap Total 9,351 6,901 26.2% Notes: \a The lagged outcome was standardized to be mean zero; so the grand mean is the same for both the levels and in differences outcomes. \b The IPW joint equality of means and IPW RMSD are calculated from GPS-based IPW means (after imposing overlap). \c Tests calculated after imposing overlap. Table 2. Assessment measures of estimators of the average employment rate in two years after random assignment - 5 sites Outcome in levels P-value joint Root Mean Square Distance equality of Estimated Relative to in-levels Estimator means test value Raw Mean A. Outcome not adjusted for local economic conditions Raw Mean - No Ovlp 0.000 0.129 1.000 [0.115,0.146] Raw Mean - Ovlp 0.000 0.141 1.093 [0.120,0.161] Linear regression-based Partial Mean Linear X - No Ovlp 0.000 0.117 0.907 [0.102,0.135] Partial Mean Linear X - Ovlp 0.000 0.119 0.922 [0.098,0.143] Partial Mean Flex X - No Ovlp 0.000 0.119 0.922 [0.104,0.138] Partial Mean Flex X - Ovlp 0.000 0.118 0.915 [0.098,0.143] GPS-based (imposing Ovlp) Parametric Partial Mean 0.000 0.118 0.915 [0.089,0.151] Nonparametric Partial Mean 0.000 0.120 0.930 [0.088,0.155] IPW No Covariates 0.000 0.122 0.946 [0.098,0.180] IPW With Covariates 0.000 0.135 1.047 [0.109,0.170] Placebo experiment (benchmark) Raw Mean - No Ovlp 0.307 0.021 0.163 [0.012,0.044] B. Outcome adjusted for local economic conditions Raw Estimator - No Ovlp 0.000 Raw Estimator - Ovlp 0.000 Linear regression-based Partial Mean Linear X - No Ovlp 0.000 Partial Mean Linear X - Ovlp 0.000 Partial Mean Flex X - No Ovlp 0.000 Partial Mean Flex X - Ovlp 0.000 GPS-based (imposing Ovlp) Parametric Partial Mean 0.004 Nonparametric Partial Mean 0.009 IPW No Covariates 0.011 IPW With Covariates 0.000 Placebo experiment (benchmark) Raw Mean - No Ovlp 0.320 P-value joint equality of means test 0.000 0.121 [0.104,0.140] 0.116 [0.096,0.140] 0.938 0.107 [0.088,0.132] 0.101 [0.080,0.133] 0.122 [0.105,0.142] 0.113 [0.095,0.140] 0.829 0.099 [0.074,0.140] 0.119 [0.087,0.157] 0.124 [0.098,0.176] 0.126 [0.097,0.169] 0.767 0.545 0.020 [0.011,0.047] 0.155 0.068 [0.049,0.105] 0.061 [0.040,0.103] 0.527 0.057 [0.036,0.098] 0.060 [0.039,0.102] 0.068 [0.046,0.107] 0.062 [0.043,0.104] 0.442 0.041 [0.023,0.104] 0.068 [0.038,0.127] 0.070 [0.040,0.137] 0.071 [0.036,0.129] 0.318 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.098 [0.082,0.118] 0.104 [0.082,0.128] 0.760 0.000 0.806 0.001 0.069 [0.050,0.093] 0.068 [0.046,0.096] 0.070 [0.050,0.095] 0.066 [0.044,0.096] 0.535 0.008 0.527 0.005 0.543 0.000 0.512 0.000 0.065 [0.039,0.101] 0.065 [0.037,0.107] 0.069 [0.046,0.152] 0.081 [0.052,0.123] 0.504 0.540 0.504 0.105 0.535 0.091 0.628 0.088 0.155 0.567 0.020 [0.012,0.044] Note: Bootstrap 95% confidence intervals in brackets (based on 1,000 replications). Outcome in differences Root Mean Square Distance Estimated Relative to in-levels value Raw Mean 0.020 [0.010,0.047] 0.899 0.783 0.946 0.876 0.922 0.961 0.977 0.473 0.465 0.527 0.481 0.527 0.543 0.550 0.155 Table 3. Estimated average employment rate in two years after random assignment - 5 sites Outcome in levels GRP POR RIV ATL DET 0.70 [0.67,0.72] 0.68 [0.65,0.71] 0.59 [0.57,0.61] 0.57 [0.54,0.60] 0.46 [0.44,0.48] 0.43 [0.41,0.46] 0.60 [0.57,0.63] 0.61 [0.58,0.64] 0.68 [0.65,0.70] 0.68 [0.66,0.71] 0.56 [0.53,0.58] 0.57 [0.53,0.61] 0.58 [0.55,0.60] 0.59 [0.55,0.62] 0.46 [0.44,0.49] 0.47 [0.44,0.51] 0.64 [0.62,0.67] 0.62 [0.60,0.65] 0.64 [0.62,0.66] 0.62 [0.59,0.65] 0.60 [0.58,0.63] 0.59 [0.56,0.62] 0.60 [0.58,0.63] 0.59 [0.56,0.62] 0.45 [0.43,0.47] 0.44 [0.41,0.46] 0.45 [0.43,0.47] 0.44 [0.41,0.46] 0.61 [0.58,0.65] 0.63 [0.59,0.67] 0.60 [0.57,0.63] 0.61 [0.58,0.64] 0.64 [0.61,0.66] 0.64 [0.61,0.67] 0.63 [0.61,0.66] 0.64 [0.61,0.67] 0.58 [0.55,0.61] 0.58 [0.55,0.62] 0.64 [0.61,0.66] 0.64 [0.61,0.67] 0.61 [0.58,0.63] 0.61 [0.58,0.65] 0.60 [0.57,0.62] 0.61 [0.57,0.64] 0.46 [0.43,0.48] 0.48 [0.44,0.51] 0.45 [0.42,0.47] 0.46 [0.43,0.49] 0.63 [0.59,0.66] 0.62 [0.58,0.66] 0.61 [0.57,0.65] 0.62 [0.57,0.66] 0.61 [0.57,0.65] 0.60 [0.56,0.65] 0.60 [0.54,0.65] 0.61 [0.55,0.66] 0.44 [0.40,0.48] 0.44 [0.40,0.49] 0.44 [0.40,0.48] 0.42 [0.38,0.46] 0.60 [0.55,0.66] 0.64 [0.58,0.71] 0.64 [0.57,0.72] 0.64 [0.57,0.70] 0.65 [0.60,0.70] 0.66 [0.59,0.73] 0.68 [0.60,0.79] 0.69 [0.61,0.77] 0.62 [0.58,0.66] 0.62 [0.57,0.66] 0.63 [0.58,0.67] 0.64 [0.58,0.68] 0.65 [0.60,0.70] 0.65 [0.61,0.71] 0.67 [0.62,0.74] 0.67 [0.62,0.73] 0.49 [0.44,0.54] 0.47 [0.43,0.52] 0.47 [0.43,0.51] 0.47 [0.42,0.51] 0.57 [0.55,0.60] 0.55 [0.53,0.58] 0.57 [0.55,0.59] 0.57 [0.54,0.60] 0.57 [0.55,0.60] 0.58 [0.55,0.61] 0.55 [0.52,0.58] 0.56 [0.54,0.58] 0.57 [0.54,0.59] 0.57 [0.55,0.60] 0.68 [0.66,0.71] 0.67 [0.64,0.70] 0.55 [0.53,0.58] 0.53 [0.50,0.56] 0.52 [0.49,0.54] 0.49 [0.46,0.53] 0.61 [0.58,0.65] 0.62 [0.59,0.65] 0.62 [0.57,0.66] 0.62 [0.57,0.67] 0.54 [0.51,0.57] 0.55 [0.51,0.59] 0.55 [0.51,0.59] 0.56 [0.51,0.60] 0.53 [0.48,0.58] 0.54 [0.49,0.59] 0.60 [0.57,0.62] 0.58 [0.56,0.61] 0.60 [0.58,0.62] 0.58 [0.56,0.61] 0.63 [0.60,0.65] 0.61 [0.58,0.64] 0.63 [0.60,0.65] 0.61 [0.58,0.64] 0.57 [0.54,0.59] 0.55 [0.52,0.58] 0.57 [0.54,0.59] 0.55 [0.52,0.58] 0.51 [0.48,0.54] 0.50 [0.46,0.53] 0.51 [0.48,0.54] 0.50 [0.46,0.53] 0.63 [0.59,0.66] 0.65 [0.60,0.68] 0.61 [0.58,0.64] 0.63 [0.59,0.66] 0.57 [0.52,0.62] 0.58 [0.53,0.63] 0.57 [0.52,0.62] 0.58 [0.53,0.63] 0.56 [0.53,0.59] 0.56 [0.53,0.60] 0.62 [0.59,0.65] 0.62 [0.59,0.66] 0.58 [0.54,0.62] 0.59 [0.54,0.63] 0.57 [0.53,0.61] 0.58 [0.53,0.62] 0.52 [0.48,0.58] 0.54 [0.48,0.60] 0.51 [0.46,0.56] 0.53 [0.47,0.58] 0.55 [0.51,0.59] 0.58 [0.53,0.65] 0.60 [0.50,0.67] 0.60 [0.54,0.65] 0.61 [0.57,0.65] 0.61 [0.57,0.65] 0.60 [0.56,0.64] 0.61 [0.55,0.65] 0.57 [0.53,0.61] 0.56 [0.52,0.61] 0.56 [0.50,0.61] 0.57 [0.51,0.62] 0.50 [0.46,0.55] 0.50 [0.46,0.55] 0.50 [0.45,0.54] 0.48 [0.44,0.53] 0.61 [0.56,0.67] 0.65 [0.59,0.73] 0.66 [0.59,0.73] 0.65 [0.59,0.72] 0.59 [0.52,0.65] 0.60 [0.51,0.68] 0.62 [0.53,0.73] 0.62 [0.53,0.71] 0.60 [0.56,0.64] 0.60 [0.55,0.65] 0.61 [0.56,0.66] 0.62 [0.56,0.66] 0.62 [0.57,0.68] 0.62 [0.57,0.69] 0.65 [0.58,0.71] 0.64 [0.58,0.71] 0.55 [0.48,0.61] 0.53 [0.47,0.60] 0.54 [0.47,0.59] 0.53 [0.47,0.59] 0.58 [0.56,0.60] 0.57 [0.55,0.60] 0.55 [0.53,0.58] 0.57 [0.55,0.59] 0.57 [0.54,0.60] 0.57 [0.55,0.60] 0.58 [0.55,0.61] 0.55 [0.52,0.58] 0.56 [0.54,0.58] Estimator ATL DET A. Outcome not adjusted for local economic conditions Raw Mean - No Ovlp 0.59 0.59 [0.56,0.61] [0.57,0.61] Raw Mean - Ovlp 0.59 0.60 [0.56,0.61] [0.57,0.62] Linear regression-based Partial Mean Linear X - No Ovlp 0.59 0.62 [0.57,0.63] [0.60,0.65] Partial Mean Linear X - Ovlp 0.59 0.60 [0.56,0.62] [0.58,0.63] Partial Mean Flex X - No Ovlp 0.60 0.62 [0.57,0.63] [0.60,0.65] Partial Mean Flex X - Ovlp 0.59 0.60 [0.56,0.62] [0.58,0.63] GPS-based (imposing Ovlp) Parametric Partial Mean 0.61 0.57 [0.55,0.65] [0.53,0.61] Nonparametric Partial Mean 0.61 0.61 [0.55,0.67] [0.55,0.67] IPW No Covariates 0.60 0.63 [0.42,0.71] [0.52,0.69] IPW With Covariates 0.60 0.63 [0.50,0.68] [0.56,0.67] Placebo experiment (benchmark) Raw Mean - No Ovlp 0.55 0.58 [0.52,0.58] [0.56,0.60] B. Outcome adjusted for local economic conditions Raw Estimator - No Ovlp 0.56 [0.53,0.58] Raw Estimator - Ovlp 0.56 [0.53,0.59] Linear regression-based Partial Mean Linear X - No Ovlp 0.56 [0.53,0.59] Partial Mean Linear X - Ovlp 0.56 [0.53,0.59] Partial Mean Flex X - No Ovlp 0.57 [0.54,0.60] Partial Mean Flex X - Ovlp 0.56 [0.53,0.59] GPS-based (imposing Ovlp) Parametric Partial Mean 0.58 [0.52,0.62] Nonparametric Partial Mean 0.58 [0.51,0.64] IPW No Covariates 0.57 [0.38,0.68] IPW With Covariates 0.56 [0.47,0.65] Placebo experiment (benchmark) Raw Mean - No Ovlp 0.55 [0.52,0.57] Note: Bootstrap 95% confidence intervals in brackets (based on 1,000 replications). Outcome in differences GRP POR RIV Table 4. Assessment measures of estimators of the average employment rate in two years after random assignment 4 sites based on 5 sites estimation, excluding Riverside Outcome in levels P-value joint Root Mean Square Distance equality of Estimated Relative to in-levels Estimator means test value Raw Mean A. Outcome not adjusted for local economic conditions Raw Mean - No Ovlp 0.000 0.077 1.000 [0.059,0.097] Raw Mean - Ovlp 0.000 0.069 0.896 [0.052,0.096] Linear regression-based Partial Mean Linear X - No Ovlp 0.041 0.030 0.390 [0.014,0.053] Partial Mean Linear X - Ovlp 0.280 0.023 0.299 [0.011,0.052] Partial Mean Flex X - No Ovlp 0.104 0.025 0.325 [0.011,0.049] Partial Mean Flex X - Ovlp 0.434 0.019 0.247 [0.009,0.049] GPS-based (imposing Ovlp) Parametric Partial Mean 0.245 0.033 0.429 [0.014,0.071] Nonparametric Partial Mean 0.901 0.013 0.169 [0.010,0.068] IPW No Covariates 0.923 0.018 0.234 [0.014,0.148] IPW With Covariates 0.901 0.019 0.247 [0.010,0.080] Placebo experiment (benchmark) Raw Mean - No Ovlp 0.188 0.023 0.299 [0.010,0.048] B. Outcome adjusted for local economic conditions Raw Estimator - No Ovlp 0.000 Raw Estimator - Ovlp 0.000 Linear regression-based Partial Mean Linear X - No Ovlp 0.000 Partial Mean Linear X - Ovlp 0.016 Partial Mean Flex X - No Ovlp 0.002 Partial Mean Flex X - Ovlp 0.037 GPS-based (imposing Ovlp) Parametric Partial Mean 0.115 Nonparametric Partial Mean 0.460 IPW No Covariates 0.578 IPW With Covariates 0.526 Placebo experiment (benchmark) Raw Mean - No Ovlp 0.200 P-value joint equality of means test 0.000 0.078 [0.057,0.099] 0.070 [0.050,0.099] 1.013 0.032 [0.015,0.059] 0.035 [0.018,0.068] 0.030 [0.015,0.054] 0.023 [0.012,0.053] 0.416 0.033 [0.016,0.081] 0.025 [0.014,0.081] 0.035 [0.018,0.104] 0.032 [0.016,0.091] 0.429 0.394 0.022 [0.010,0.050] 0.286 0.063 [0.045,0.091] 0.054 [0.035,0.091] 0.818 0.041 [0.018,0.084] 0.052 [0.025,0.092] 0.038 [0.017,0.074] 0.037 [0.016,0.075] 0.532 0.020 [0.012,0.077] 0.036 [0.018,0.106] 0.032 [0.017,0.103] 0.024 [0.013,0.087] 0.260 0.022 [0.009,0.051] 0.286 0.000 0.064 0.083 0.043 0.262 0.430 0.671 0.498 0.545 0.093 [0.074,0.115] 0.087 [0.068,0.116] 1.208 0.000 1.130 0.003 0.046 [0.027,0.069] 0.040 [0.023,0.069] 0.042 [0.023,0.066] 0.037 [0.020,0.066] 0.597 0.113 0.519 0.038 0.545 0.047 0.481 0.147 0.040 [0.018,0.078] 0.028 [0.015,0.079] 0.035 [0.020,0.164] 0.033 [0.014,0.094] 0.519 0.843 0.364 0.593 0.455 0.608 0.429 0.767 0.023 [0.010,0.048] 0.299 0.413 Note: Bootstrap 95% confidence intervals in brackets (based on 1,000 replications). Outcome in differences Root Mean Square Distance Estimated Relative to in-levels value Raw Mean 0.909 0.455 0.390 0.299 0.325 0.455 0.416 0.701 0.675 0.494 0.481 0.468 0.416 0.312 Table 5. Comparison of alternative GPS specifications. RMSD of estimators based on the IPW with covariates estimator Outcome: Average employment rate in two years after random assignment Outcome not adjusted for local economic conditions RMSD for 5 sites RMSD for 4 sites \a Outcome in Outcome in Outcome in Outcome in levels differences levels differences Outcome adjusted for local economic conditions RMSD for 5 sites RMSD for 4 sites \b Outcome in Outcome in Outcome in Outcome in levels differences levels differences 26.2% 0.135 [0.109,0.170] 0.126 [0.097,0.169] 0.019 [0.010,0.080] 0.032 [0.016,0.091] 0.081 [0.052,0.123] 0.071 [0.036,0.129] 0.033 [0.014,0.094] 0.024 [0.013,0.087] Including LEC measures in GPS \c MNL including pre-RA LEC measure 28.0% MNL including pre & post-RA LEC measures 45.7% 0.141 [0.090,0.216] 0.076 [0.035,0.220] 0.102 [0.052,0.183] 0.076 [0.037,0.194] 0.009 [0.011,0.102] 0.032 [0.017,0.122] 0.027 [0.017,0.103] 0.085 [0.027,0.189] 0.089 [0.048,0.170] 0.031 [0.029,0.177] 0.057 [0.027,0.138] 0.068 [0.036,0.182] 0.026 [0.014,0.116] 0.023 [0.015,0.113] 0.033 [0.015,0.116] 0.067 [0.025,0.166] Alternative sets of covariates in GPS \d MNL with alternative set of covariates 1 8.2% MNL with alternative set of covariates 2 13.9% MNL with alternative set of covariates 3 25.6% MNL with alternative set of covariates 4 15.9% 0.119 [0.100,0.146] 0.126 [0.104,0.157] 0.134 [0.108,0.173] 0.117 [0.093,0.147] 0.127 [0.101,0.161] 0.136 [0.091,0.178] 0.138 [0.100,0.188] 0.114 [0.083,0.153] 0.060 [0.041,0.117] 0.080 [0.034,0.126] 0.054 [0.024,0.110] 0.036 [0.011,0.084] 0.089 [0.061,0.126] 0.103 [0.051,0.149] 0.083 [0.049,0.135] 0.062 [0.022,0.107] 0.076 [0.058,0.119] 0.089 [0.059,0.129] 0.088 [0.066,0.131] 0.066 [0.040,0.102] 0.083 [0.052,0.128] 0.077 [0.040,0.136] 0.089 [0.053,0.146] 0.046 [0.024,0.110] 0.066 [0.046,0.127] 0.086 [0.037,0.134] 0.063 [0.031,0.121] 0.046 [0.014,0.093] 0.082 [0.050,0.127] 0.075 [0.038,0.128] 0.074 [0.043,0.129] 0.024 [0.013,0.082] Alternative GPS models \e Multinomial Probit 23.1% IPT 32.3% MNL with IPT moment conditions 25.5% 0.140 [0.111,0.189] 0.124 [0.096,0.170] 0.133 [0.106,0.170] 0.146 [0.097,0.188] 0.125 [0.092,0.176] 0.123 [0.094,0.160] 0.038 [0.016,0.103] 0.045 [0.019,0.118] 0.021 [0.011,0.076] 0.058 [0.022,0.117] 0.048 [0.025,0.117] 0.029 [0.017,0.080] 0.090 [0.056,0.140] 0.077 [0.049,0.136] 0.079 [0.047,0.115] 0.089 [0.041,0.146] 0.066 [0.033,0.135] 0.068 [0.035,0.130] 0.050 [0.020,0.117] 0.056 [0.023,0.134] 0.037 [0.017,0.086] 0.036 [0.017,0.098] 0.030 [0.016,0.096] 0.024 [0.012,0.074] GPS reestimation after imposing overlap MNL, reestimated as MNL 26.2% MNL, reestimated as IPT 26.2% IPT, reestimated as IPT 32.3% MNL, reestimated as MNL with IPT moment conditions Not imposing overlap MNL 26.2% 0.133 [0.107,0.172] 0.117 [0.090,0.156] 0.127 [0.089,0.175] 0.135 [0.104,0.170] 0.120 [0.089,0.166] 0.110 [0.079,0.149] 0.123 [0.087,0.171] 0.126 [0.092,0.174] 0.019 [0.011,0.080] 0.026 [0.010,0.074] 0.022 [0.012,0.075] 0.019 [0.012,0.081] 0.030 [0.017,0.091] 0.024 [0.012,0.072] 0.036 [0.016,0.078] 0.032 [0.015,0.089] 0.080 [0.052,0.124] 0.067 [0.041,0.110] 0.074 [0.045,0.126] 0.081 [0.047,0.121] 0.065 [0.034,0.124] 0.057 [0.028,0.114] 0.067 [0.033,0.129] 0.071 [0.038,0.136] 0.034 [0.014,0.093] 0.042 [0.018,0.087] 0.037 [0.018,0.090] 0.033 [0.014,0.091] 0.023 [0.015,0.088] 0.027 [0.012,0.088] 0.027 [0.012,0.075] 0.024 [0.012,0.085] IPT 0.0% MNL with IPT moment conditions 0.0% 0.131 [0.109,0.170] 0.117 [0.095,0.156] 0.127 [0.104,0.154] 0.130 [0.104,0.166] 0.113 [0.089,0.151] 0.123 [0.101,0.153] 0.051 [0.012,0.116] 0.062 [0.023,0.128] 0.044 [0.009,0.090] 0.054 [0.018,0.100] 0.039 [0.016,0.091] 0.040 [0.012,0.079] 0.082 [0.055,0.133] 0.079 [0.052,0.134] 0.079 [0.050,0.112] 0.066 [0.036,0.123] 0.051 [0.026,0.113] 0.062 [0.035,0.113] 0.062 [0.017,0.129] 0.074 [0.032,0.142] 0.056 [0.019,0.101] 0.024 [0.014,0.082] 0.017 [0.012,0.081] 0.018 [0.012,0.080] Estimator: IPW With Covariates Preferred specification Multinomial Logit (MNL) - from Tables 2 & 4 % Observations dropped due to overlap \a 0.0% Notes: Bootstrap 95% confidence intervals in brackets (based on 1,000 replications). \a Except in the cases where the overlap condition is not imposed, in all other cases a value q=0.025 is used to determine individuals who satisfy the overlap condition. \b Results for 4 sites based on 5 sites estimation, excluding Riverside. \c The pre & post-RA LEC measures are the growth rates of the employment/population ratio in the two years before and after random assignment, respectively. \d Alternative set 1: demographic, education & family characteristics. Alternative set 2: set of covariates 1 + long term welfare use history & housing type/stability. Alternative set 3: set of covariates 2 + welfare & food stamps use history in 7 quarters prior to random assignment. Alternative set 4: set of covariates 2 + employment & earnings history in 8 quarters prior to random assignment. The specific variables included in each set are provided in the Appendix. \e IPT refers to Inverse Probability Tilting and is estimated by GMM with moment conditions imposing exact balancing in covariates. MNL with IPT moment conditions estimates a Multinomial Logit model by GMM as an overidentified model where the exact balancing of covariates conditions are imposed as additional moments. The Multinomial Logit (MNL) and Multinomial Probit models are estimated by maximum likelihood. See the Appendix for further details. Figure 1. Overlap quality based on GPS distribution Overlap quality of Pr(Treatment = ATL | X) Overlap quality of Pr(Treatment = GRP | X) Before Overlap Before Overlap (99,100] Not in ATL (95,99] After Overlap In ATL (75,90] After Overlap (50,75] (25,50] (10,25] In GRP (50,75] (25,50] (10,25] (75,90] (1,5] (1,5] (1,5] [0,1] [0,1] [0,1] 2000 Overlap quality of Pr(Treatment = DET | X) (99,100] Before Overlap (99,100] (75,90] After Overlap (90,95] Not in DET In DET (50,75] (25,50] (10,25] (75,90] 1500 1000 500 0 Number of individuals Before Overlap (1,5] [0,1] [0,1] 500 1000 In POR After Overlap Not in POR In POR (10,25] (1,5] 1500 1000 500 0 Number of individuals 2000 (25,50] (5,10] 2000 500 (50,75] (5,10] 2500 0 Not in POR (95,99] In DET Percentile Interval (90,95] 1000 500 Number of individuals Overlap quality of Pr(Treatment = POR | X) Not in DET (95,99] 1500 In RIV (10,25] (5,10] 500 Not in RIV (25,50] (5,10] 0 In RIV After Overlap (50,75] (5,10] 3500 3000 2500 2000 1500 1000 500 Number of individuals Percentile Interval (90,95] Not in GRP (75,90] Before Overlap Not in RIV (95,99] In GRP (90,95] Not in ATL Percentile Interval Percentile Interval (95,99] In ATL (90,95] (99,100] Not in GRP Percentile Interval (99,100] Overlap quality of Pr(Treatment = RIV | X) 2000 1500 1000 500 Number of individuals 0 500 Notes: Outside wide bars and inside thin bars display the number of individuals whose GPS fall in a percentile interval, before and after imposing overlap, respectively. Percentile intervals in each site are constructed using the GPS distribution, before imposing overlap, for individuals in that site only. 500 1000 Figure 2. Local economic conditions overlap By site and random assignment cohort Employment/population ratio before RA 1991 ATL 1992 1993 1994 1991 DET 1992 1993 1994 1991 GRP 1992 1993 1994 1991 POR 1992 1993 1994 1991 RIV 1992 1993 1994 Unemployment rate before RA Average real earnings ($1,000) before RA 1991 ATL 1992 1993 1994 1991 DET 1992 1993 1994 1991 GRP 1992 1993 1994 1991 POR 1992 1993 1994 1991 RIV 1992 1993 1994 0 .1 .2 .3 Year 1 before RA .4 .5 .6 1991 ATL 1992 1993 1994 1991 DET 1992 1993 1994 1991 GRP 1992 1993 1994 1991 POR 1992 1993 1994 1991 RIV 1992 1993 1994 0 Year 2 before RA DET GRP POR RIV .1 .2 Year 1 after RA .3 .4 .5 .6 Year 2 after RA 6 8 10 12 0 Year 2 before RA Unemployment rate after RA ATL 0 4 Year 1 before RA Employment/population ratio after RA 1991 ATL 1992 1993 1994 1991 DET 1992 1993 1994 1991 GRP 1992 1993 1994 1991 POR 1992 1993 1994 1991 RIV 1992 1993 1994 2 5 10 15 20 25 Year 1 before RA 30 35 40 Year 2 before RA Average real earnings ($1,000) after RA 1991 1992 1993 1994 1991 1992 1993 1994 1991 1992 1993 1994 1991 1992 1993 1994 1991 1992 1993 1994 1991 ATL 1992 1993 1994 1991 DET 1992 1993 1994 1991 GRP 1992 1993 1994 1991 POR 1992 1993 1994 1991 RIV 1992 1993 1994 0 2 4 Year 1 after RA 6 8 10 12 Year 2 after RA 0 5 10 15 Year 1 after RA 20 25 30 35 40 Year 2 after RA
© Copyright 2026 Paperzz