Optimal Multilevel Matching in Clustered Observational Studies: A Case Study of the School Voucher System in Chile∗ José R. Zubizarreta† Luke Keele‡ March 16, 2015 Abstract A distinctive feature of a clustered observational study is its multilevel or nested data structure arising from the assignment of treatment, in a non-random manner, to groups or clusters of units or individuals. Examples are ubiquitous in the health and social sciences including patients in hospitals, employees in firms, and students in schools. What is the optimal matching strategy in a clustered observational study? At first thought, one might start by matching clusters of individuals and then, within matched clusters, continue by matching individuals. But, as we discuss in this paper, the optimal strategy is the opposite: first match individuals and, once all possible combinations of matched individuals are known, then match clusters. In this paper we use dynamic and integer programming to implement this strategy and extend optimal matching methods to hierarchical and multilevel settings. In particular, our method attempts to replicate a paired clustered randomized study by finding the largest sample of matched pairs of treated and control individuals within matched pairs of treated and control clusters that is balanced according to specifications given by the user. Our method directly balances covariates both at the cluster and individual levels and does not require estimating the propensity score, although the propensity score can be balanced as an additional covariate. We illustrate our method on a case study of the comparative effectiveness of public versus private voucher schools in Chile, a question of intense policy debate in the country at the present. Keywords: Causal Inference; Group Randomization; Hierarchical/Multilevel Data; Observational Study; Optimal Matching ∗ For comments and suggestions, we thank Cinar Kilcioglu, Sam Pimentel and Paul Rosenbaum, and seminar participants at Johns Hopkins University and the University of Pennsylvania. † Assistant Professor, Division of Decision, Risk and Operations, and Statistics Department, Columbia University, 3022 Broadway, 417 Uris Hall, New York, NY 10027, Email: [email protected]. ‡ Associate Professor, Department of Political Science, 211 Pond Lab, Penn State University, University Park, PA 16802 Phone: 814-863-1592, Email: [email protected]. 1 1 Introduction 1.1 Clustered Observational Studies with Multilevel Data Clustered observational studies are ubiquitous in the health and social sciences. Examples include patients receiving similar treatments in hospitals, employees facing a policy change inside firms, and students following a particular learning program within schools. Clustered observational studies have a nested or multilevel data structure with observed and unobserved covariates both at the cluster and unit levels. In this context, research interest typically lies in the effect of the cluster level treatment on unit level outcomes, however this effect may be confounded by differences in the distributions of covariates across the treatment groups both at the cluster and unit levels. Therefore, an important question in clustered observational studies is: how to adjust for observed covariates taking into account the multilevel structure? In an attempt to be transparent, these adjustments will ideally balance covariates at both levels and facilitate sensitivity analyses to hidden biases due to unobserved covariates (Rosenbaum 2010). Educational settings are perhaps the most well-known multilevel structure with unit level measures such as the student’s score on a standardized test, and also cluster level covariates such as school enrollment (Lee and Bryk 1989). Both covariates may act as confounders when evaluating, for instance, the impact of a study program or administration regime targeted at improving learning. A conventional approach to adjust for cluster and unit level covariates is hierarchical or multilevel regression modeling. A hierarchical regression model allows the researcher to fit a model for the mean outcome using unit level covariates while accounting for unexplained variation among clusters. The cluster level predictors are often referred to as “contextual effects,” and may be interpreted as causal effects under certain assumptions (Gelman 2006; Feller and Gelman 2015). Nonparametric alternatives to multilevel regression modeling often rely on propensity scores (Hong and Raudenbush 2006; Arpino and Mealli 2011; Li et al. 2013). For example, Hong and Rau2 denbush (2006) stratify on a multilevel propensity score to approximate a two-stage experiment where schools and students are randomly assigned to treatment within blocks. Matching methods are extensively used in observational studies (Stuart 2010; Lu et al. 2011), however they do not typically account for multilevel data structures. In this paper, we develop an optimal matching method for multilevel data structures. Contrary to intuition, our method first matches pairs of units and then clusters to create matched pairs of clusters with pairs of units within these cluster pairs. Our method is optimal in the sense that it maximizes the size of the matched sample and it directly balances the observed covariates in a specific way. For example, our method not only allows the researcher to balance central moments of distributions but also the entire distribution of the observed covariates. Our method does not require estimating the propensity score, although the propensity score can be used as an additional balancing covariate. Although we illustrate our method in a study with two levels— students within schools—it can readily be extended to settings with three or more levels such as students within schools within districts. In particular, we illustrate our method on a case study of the comparative effectiveness of public versus private voucher schools in standardized tests in Chile. This a question of intense policy debate in the country at the present, and we describe it subsequently. 1.2 Vouchers and School Choice Governments often enact policy reforms to improve educational outcomes. One educational reform is the use of vouchers to create education markets. In a voucher system, parents receive a voucher to choose among a number of competing schools both public and private. While voucher systems are relatively rare in the United States, many other countries, however, have adopted universal voucher programs.1 Countries with universal voucher programs include Chile, Denmark, Netherlands, South Korea and Sweden (Lara et al. 2011). Chile was the first country to adopt a universal voucher system in the 1980’s. Public and private voucher schools receive a 1 As of 2007, only 16% of U.S. students lived in areas with an active voucher system. 3 direct payment from the cover on a per-student basis. This voucher system fueled the growth of a parallel private education system. By 2006, nearly 5,000 private voucher schools existed, and these schools enrolled nearly 44% of students with 48% attending public schools (Lara et al. 2011). By 2013, there were over 6,000 private voucher schools with around 53% of students, and 38% attended public schools (MINEDUC 2013). Are private vouchers schools more effective than public schools? Current evidence on the voucher system in Chile is mixed. A number of studies have found that private voucher schools increase test scores by at least 15% to 20% of a standard deviation (Mizala and Romaguera 2001; Anand et al. 2009) though other studies have found larger effects (Sapelli and Vial 2002, 2005). While other work has found effects that are either not statistically detectable (Hsieh and Urquiola 2006; McEwan 2001) or are much smaller (Lara et al. 2011). We conduct an observational study of whether private voucher schools produce students with higher test scores than public schools. These data have a multilevel structure in that we observe student level covariates such as gender and socio-economic status as well as school level covariates such as enrollment and whether the school is in an urban or rural area. Our study also provides a basic template for the design of observational studies of school level interventions. We demonstrate how temporal ordering is critical to selecting covariates for the removal of overt biases. In an observational study, use of longitudinal data is necessary to avoid conditioning on a post-treatment covariate. If data are carefully collected over time as events occur, then the temporal order of events is clear, and the distinction between covariates and outcomes is clear as well. In contrast, if data are collected from subjects at a single time, as in a cross-sectional study, such temporal order is unclear and one might mistakenly condition on an outcome. In this study, we use the transition from primary school to secondary school to clearly delineate the temporal order of pre- and post-treatment covariates. The Chilean data form a full panel over time which allows us to carefully separate both school and student level covariates from outcomes. 4 This article is organized as follows. Section 2 describes the Chilean school system, the longitudinal census data, and the design that we use in our study. Section 3 reviews cardinality matching for finding the largest matched sample that is balanced, explains the multilevel matching method, and presents the method more generally. Section 4 shows the resulting matches. Section 5 analyzes the comparative effectiveness of public and private voucher schools in Chile. Section 6 concludes with a summary and a discussion. 2 Schools in Chile; Educational Census Data; Study Design 2.1 The Chilean School System and Restricted Secondary School Choice In Chile there are three basic types of schools: (i) private non-subsidized schools, (ii) private subsidized schools, and (iii) public schools, also called municipal schools. Private non-subsidized schools are generally elite schools that do not receive any funds from the state and are funded entirely through private tuition. Approximately 7% of the students in Chile attend this type of schools. Private subsidized schools receive funding from the state on a per pupil basis, and some of these schools charge an additional monthly tuition fee to parents (these are called “shared funding" private subsidized schools or “financiamiento compartido" in Spanish). These schools are a mix of for-profit and not-for-profit organizations. Approximately 53% of the students in Chile attend this type of school. Finally, approximately 40% of students attend public schools.2 In Chile, the voucher system is based on a direct payment to the schools as a function of daily attendance. In our We seek to test whether private vouchers schools are more effective than public schools. Pre2 The exception is 60 schools (called “Delegated Administration") that are publicly funded on a non-voucher basis, i.e. they receive a fixed amount of money from the state regardless of how many students they enroll. 5 treatment differences in these two groups of student may be either those that are measurable and thus form overt biases or those that are unmeasured which are hidden biases. In an observational study, analysts use pretreatment covariates and a statistical adjustment strategy to remove overt biases in the hopes of consistently estimating treatment effects. We face two challenges in our observational study if we simply compare the test scores of students in these two types of schools and remove overt biases via matching. First, data are not collected on students until they are in the fourth grade. Without measurements that precede the treatment, we may mistakenly adjust for an outcome rather than a covariate and bias the study (Rosenbaum 1984). Second, the choice to attend a private voucher school is most likely highly confounded by not only observed but unobserved factors. To solve both problems, we exploit an opportunity: an unusual setting in which we believe confounding from unobserved covariates is lessened (Rosenbaum 2010, section 5.1). The opportunity we exploit is related to the concept of differential effects, which are used to reduce bias in observational studies by the study of parallel treatments (Rosenbaum 2006). Consider the following example of a differential effect. It is thought that nonsteroidal anti-inflammatory drugs (NSAIDs) reduce the risk of Alzheimer disease. To test this theory one might compare subjects that regularly use NSAIDs to those that do not. However, there are a number of observed and unobserved reasons why these two populations may differ other than the use of NSAIDs. As an alternative design, one might instead compare regular NSAID users to regular users of other pain medications such as acetaminophen. The differential effect of acetaminophen versus NSAIDs may be less subject to bias from unmeasured confounders than the effect of NSAIDS versus no use of pain medication. As such, a differential effect is an effort to reduce sensitivity to bias through the comparison of parallel treatments. The Chilean school system provides us with an opportunity that is parallel to a differential effect. Chilean schools are divided into primary and secondary schools. Primary schools are comprised of Grades 1–8 and secondary schools encompass Grades 9–12. Since 2003 students are required 6 to attend both primary and secondary schools. However, primary and secondary schools need not be separate institutions. It is not uncommon for primary and secondary schools to be fused into a single school, such that students can attend Grades 1–12 at the same institution. Students that attend a primary only school have to select a secondary school once they reach Grade 8. In 2004, 76.4% of public schools and 52% of private voucher schools were primary only schools, and thus about 56% of students in the eighth grade had to select a new secondary school to attend (Lara et al. 2011). We use this opportunity in our design by restricting the analysis to students from public primary schools that had to select a secondary school since they attended a primary only school. Under this design, treated students are students that switch from a public primary school to a private voucher secondary school. We compare these treated students to students in public primary only schools that choose to attend public secondary schools. Exploiting this aspect of the Chilean school system has two advantages. First, it allows us to clearly delineate the temporal ordering of the treatment. Since the treatment is a private voucher secondary school, we may safely condition on covariates collected while students were in primary school. In this way we avoid biases from adjusting for a concomitant outcome (Rosenbaum 1984). Second, we also suspect that a student that switches from a public to a private voucher school when they are not required to, may do so for a number of reasons, many of which are unobservable. Here, we restrict the analysis to students that attend their local public school but must switch due to school structure. This may be a more directly observable treatment selection mechanism. 2.2 Longitudinal Census of Students and Schools In 1988, Chile introduced a national student assessment system known as the Sistema Nacional de Medición de la Calidad de la Educación or SIMCE. The SIMCE is an “educational census.” That is, in the SIMCE, the Ministry of Education collects data to evaluate all students in fourth, eighth, tenth and eleventh grades in language, mathematics and sciences, roughly every two years. SIMCE data are collected from four different sources. First, data are collected from students, 7 which includes test scores that are complemented with other student covariates such as gender. Second, both parents and teachers complete questionnaires. Finally for schools, student test scores are aggregated, and a few additional covariates are collected. Students are given unique identifiers which allows us to form a true panel over a two year period. Student records can also be linked to teacher, parent, and school level covariates. In our study, we use SIMCE data from 2003, 2004, and 2006. The SIMCE from 2003 only collected data from secondary schools and students enrolled in secondary schools. For 2004 and 2006, SIMCE collected data from both primary and secondary schools and students. We use test scores on language and mathematics administered in 2006 when students are in the tenth grade as our outcome measures. 2.3 Data Structure and Study Design In our study, the data structure and design are intricately linked. We now outline how we constructed the match to fit the data structure. One advantage of our approach is that we can tailor the statistical adjustment to exactly fit the multilevel structure of the data, which is important since we have student, parent, teacher, and school level data. We perform two matches: one for students and one for schools. Next, we describe the covariates that form the student level match. For each student with test scores observed in 2006, we match on student, parent, teacher, and primary school covariates from the SIMCE data collection in 2004. For the student match, we first list student level covariates. The key covariate, here, is student test scores from the 8th grade. In 8th grade students are tested on four topics: language, mathematics, social sciences, and natural sciences. The student level data also measures gender. For the student match, we also include three covariates from parents: income measured in six categories, father’s education, and mother’s education. We also link students to primary school level measures, and we match on primary school covariates in the student level match. At the primary school level, we match 8 on a five category socio-economic status indicator for each school that is created by the Chilean Ministry of Education. This five category indicator is constructed from questions based on parental education, family incomes in the school and an index of school vulnerability. We also use school level measures that are aggregates of data observed at other levels. As such, we match on average test scores for each primary school, the number of teachers and the number of enrolled students. Finally, in the teacher survey, teachers are asked what level of education they expect the majority of their students to achieve. Teachers responded using a five category scale that records responses from 8th grade to a college degree. We aggregate this measure and recorded the median for each primary school and use it in the student level match. To reiterate while we observed covariates measured at different levels in 2004, we treat all these measures as pre-treatment student level covariates in the match. The school match is based on secondary school data from 2003. Since the SIMCE forms a panel, we can match on characteristics of the secondary schools before any student is exposed to the treatment. That is, we match on the schools the students will attend using data from before they attend that school. For the school match, we match on enrollment, school level math and language test score averages, the percentage of female students in the school, average student income, urban versus rural status, and the same five category socio-economic status indicator for each school that is recorded for primary schools.3 Some of these covariates are aggregates that we created from either student, teacher, or parent level data in 2003. We did not match on several other covariates that are also observed in the 2003 data. These measures include whether teachers are allowed class preparation time, the proportion of teachers with a post-graduate diploma, the average teacher experience, the number of hours teachers worked per week. We do not match on these covariates since they are plausibly part of the school level treatment. Matching on such covariates would remove their effect on students from the final outcomes and thus could potentially attenuate the treatment effect. 3 For student in secondary schools, the SIMCE only collects test scores on language and math. At the primary school level, we have test scores for language, mathematics, social sciences, and natural sciences. 9 We describe the matching algorithm in greater detail in Section 3. However, the match is based on integer programming which allows us to enforce different forms of balance for different covariates (Zubizarreta 2012; Zubizarreta et al. 2014). This is relevant since we tailored the constraints for each covariate. Here, we describe the different balance constraints we applied to each covariate. For the student level covariates, we applied a mean balance constraint to primary school test score measures, primary school enrollment, the number of teachers in the primary school, the average expected level of educational attainment, and the proportion of female student in the primary school. For student level test score measures, we enforced a constraint on the entire distribution via the Kolmogorov-Smirnov test statistic which is the maximum discrepancy in the empirical cumulative distribution functions. For the school level match, we enforced a mean balance constraint on secondary school test scores, missingness indicators for test scores, secondary school enrollment, income category, SES category, urban or rural status, and the proportion of female students in the secondary school. For discrete student and school level covariates, we used a fine balance constraint. Under fine balance, we exactly balance covariates without exactly matching. Fine balance is achieved for discrete covariates by balancing the marginal distributions of covariates exactly in aggregate but without constraining who is matched to whom. We applied fine balance to student sex, father and mother’s education level, parental income categories, and primary school SES categories. We now describe the notation and the optimal matching algorithm. See Rosenbaum et al. (2007) for a discussion of fine balance and Rosenbaum (2010, Part II) for a discussion of different forms of covariate balance. 3 Dynamic and Integer Programming for Multilevel Matching The goal of our multilevel matching method is to find the largest sample of matched pairs of treated and control units within matched pairs of treated and control clusters that is balanced 10 on the observed covariates. For assessing the sensitivity of results to the influence of unobserved covariates we use the methods for sensitivity analysis proposed by Rosenbaum (1987, 2002) and tailored to clustered treatment assignments by Hansen et al. (2014) (see subsection 5.4 of the paper). In our case study, units are students and clusters are schools, and, importantly, because results can be confounded both by student and school level covariates, we match pairs of students and schools to balance covariates at both levels. The basic tool that we use in our multilevel matching method is cardinality matching which we describe subsequently. 3.1 Review of Cardinality Matching Common matching methods attempt to achieve covariate balance indirectly, by finding treated and control units that are close on a summary measure of the covariates such as the Mahalanobis distance or the propensity score (see Stuart 2010 and Lu et al. 2011 for reviews). Unlike these matching methods, cardinality matching uses the original covariates to match units and directly balance their covariate distributions (Zubizarreta et al. 2014). Specifically, by solving an integer programming problem, cardinality matching finds the largest matched sample that satisfies the researcher’s specifications for covariate balance. Following Zubizarreta (2012), these specifications for covariate balance may not only require mean balance, but perhaps also other forms of distributional balance such as fine balance (Rosenbaum et al. 2007), x-fine balance (Zubizarreta et al. 2011), and strength-k matching (Hsu et al. 2015). For example, cardinality matching will find the largest sample of matched pairs in which all the covariates have differences in means smaller than one tenth of a standard deviation and the marginal distributions of nominal covariates of greater prognostic importance are perfectly balanced (fine balance). In this manner, with cardinality matching subject matter knowledge about the research question at hand comes into the matching problem through the specifications for covariate balance, finding the largest matched sample that satisfies them. As we describe in the next subsection, our multilevel matching method uses cardinality matching to match treated and control students across all the possible combinations of treated and control 11 schools, and then uses a modified version of cardinality matching to match schools with the largest number of matched students. 3.2 A Multistage Decision Method for Multilevel Matching Let kt ∈ Kt = {1, ..., Kt } index the treated clusters and kc ∈ Kc = {1, ..., Kc } denote the control clusters. Let jkt be treated unit j in treated cluster kt , with jkt ∈ Jkt = {1, ..., Jkt }, and jkc stand for control unit j in control cluster kc with jkc ∈ Jkc = {1, ..., Jkc }. Put xkt for the vector of observed covariates of treated cluster kt , and similarly write xjkt for the observed covariates of treated unit jkt ; analogous notation applies for control clusters and units. Based on the unit-level covariates, calculate a distance δjkt ,jkc between treated unit jkt and control unit jkc (for instance, this distance may be the robust Mahalanobis distance specified in section 8.3 of Rosenbaum 2010). Define A and Ba as the sets of feasible solutions for the cluster- and unit-level matches within matched clusters (hence the subindex a in Ba ). In practice, A and Ba are implemented as linear inequality constraints in a integer program and they enforce the researcher’s requirements for covariate balance and matching structures at the cluster and unit levels respectively (for instance, A may require the means of the cluster covariates to be balanced and the matched groups to form pairs of clusters, and Ba may require the marginal distributions of the unit covariates to be balanced and the matched groups to form pairs of units). Importantly, since the requirements in A refer to clusters and those in Ba refer to units, A and Ba are disjoint. (m) Let Jkt (m) be the set of treated units matched in treated cluster kt and Jt = (m) be the set of treated units matched across all treated clusters. Finally, let Kt S (m) kt ∈Kt Jkt be the set of matched treated clusters. Building upon the framework of Rosenbaum (2012a), an optimal cardinality matching of units (m) (m) within clusters can be characterized by the quadruple (Kt , α, Jt (m) ters α : Kt (m) → Kc and units β : Jkt , β) of assignments of clus- → Jkc that maximize the cardinality of the set of matched of units within matched clusters subject to the constraints in A and Ba , respectively. If there are two cardinality matchings that satisfy the requirements in A and Ba , then we prefer 12 one matching over the other if it has a larger cardinality, or, alternatively, if they both have the same cardinality, if it has a smaller sum of total distances between matched units. For(m) (m) mally, we prefer the cardinality matching (Kt , α, Jt (m) (m) (Kt , α, Jt and P (m) jkt ∈Jt P (m) jkt ∈J˜t (m) (m) δjkt ,β(jkt ) < (m) , β̃), if |Jt (m) | > |J˜t (m) P (m) jkt ∈J˜t δjkt ,β(jkt ) . If |Jt (m) , β̃), denoted by (m) |, or alternatively if |Jt (m) | = |J˜t | and P (m) jkt ∈Jt (m) | = |J˜t | δjkt ,β(jkt ) = δjkt ,β(jkt ) , then we are indifferent between the two cardinality matchings and write (m) (Kt , α, Jt (m) (m) , β) (K̃t , α̃, J˜t (m) , β) to (K̃t , α̃, J˜t (m) (m) , β) ∼ (K̃t , α̃, J˜t (m) or (Kt , α, Jt (m) (m) , β̃). If we have either (Kt , α, Jt (m) (m) , β) (K̃t , α̃, J˜t , β̃) (m) (m) (m) (m) (m) (m) , β) ∼ (K̃t , α̃, J˜t , β̃), we write (Kt , α, Jt , β) % (K̃t , α̃, J˜t , β̃). Our optimal multilevel matching problem is the following. Problem 3.1. For given sets of cluster-level constraints A and unit-level constraints Ba , find a (m) (m) matching (Kt , α, Jt (m) , β) that satisfies A and Ba such that, for any other matching (K̃t , α̃, (m) (m) (m) (m) (m) J˜t , β̃) that also satisfies A and Ba , (Kt , α, Jt , β) % (K̃t , α̃, J˜t , β̃). Intuition may suggest that the the best way to solve Problem 3.1 and match with multilevel data is first to match clusters and then within matched clusters to match units. In our case study, this would require first pairing schools and then, within pairs of schools, pairing students. However this strategy will not always find the largest matched sample that is balanced as two schools that are paired on their school level characteristics may have different student compositions so that when their students are paired it may result in a smaller sample size than optimal. For this reason, the optimal matching strategy needs to contemplate what is optimal both at the student and school levels simultaneously. Applying Bellman’s (1957) principle of optimality, the optimal matching strategy is, under the assumption that schools have been matched optimally, first match students and then, considering these optimal student matches, match schools. In abstract terms, the following algorithm and proposition state this; again, that the optimal strategy is first to match units across all the possible combinations of pairs of treated and control clusters, and, once all possible combinations of matched units are known, then match clusters. To implement the optimal assignments α and β, let akt ,kc = 1 if treated cluster kt is paired to 13 control cluster kc and akt ,kc = 0 otherwise; similarly let bjkt ,jkc = 1 if treated unit j in treated cluster kt is paired to control unit j in control cluster kc , and bjkt ,jkc = 0 otherwise. Algorithm 3.2. For each of the possible Kt × Kc pairs of treated and control clusters, find the optimal cardinality matching of units that satisfies Ba . This is, for each kt ∈ Kt and each kc ∈ Kc find mkt ,kc = maxb P jkt ∈Jkt P jkc ∈Jkc cardinality cluster matching that solves maxa bjkt ,jkc subject to b ∈ Ba . Then find the optimal P kt ∈Kt P kc ∈Kc mkt ,kc akt ,kc subject to a ∈ A. Proposition 3.3. Algorithm 3.2 solves the optimal multilevel cardinality matching problem 3.1. Proof. Let f (a, b) be the the total number of pairs of treated and control units matched by a within pairs of treated and clusters matched by b. In the abstract, in Problem 3.1 we want to maximize the function f (a, b) subject to the constraints A and Ba . This is, find a and b to solve max f (a, b) subject to a ∈ A, b ∈ Ba . a,b (1) In a trivial way, we may solve (1) by first solving g(a) = max f (a, b) subject to b ∈ Ba b (2) for each a ∈ A, and then solving max g(a) subject to a ∈ A. a (3) While (2) seems hard in general (because there are many possible choices of b), the nested structure of the units-in-clusters problem makes it easier because f (a, b) separates into a sum of parts for cluster pairs because the constraint sets A and Ba are disjoint. Algorithm 3.2 does exactly this. In our case study, for each pairing of schools a, we find the best pairing of students b within 14 those schools (2), and then pick the best pairing of schools with the associated best pairing of students for that pairing of schools (3). Again, while (2) seems hard in general (because there are many possible student matches b), the nested structure of the students-in-schools problem makes it easier because f (a, b) separates into a into a sum of parts for school pairs. For example, if treated school kt is paired to control school kc , then the contribution of schools kt and kc is the same of number of pairs regardless of how the other schools are paired. With Algorithm 3.2, the multilevel cardinality matching problem can be solved optimally by breaking it into simpler matching subproblems and recursively finding the optimal match. This is an application of dynamic programming to matching in observational studies that takes advantage of the multilevel structure of the data (see Bertsekas 2005 for an extensive exposition of dynamic programming). 3.3 Extensions and Computation Note that if we had three or more levels (such as students within schools within districts), then the multilevel matching procedure would extend naturally. With l levels, the procedure would require first matching the lower level l under the assumptions that levels l − 1, l − 2, ..., 1 have been matched optimally, to then (once the matches at level l are completed) matching level l − 1 under the assumptions that levels l − 2, l − 3..., 1 have been matched optimally, and so on. Note that our multilevel matching method maximizes the size of the matched sample, but it can also be formulated to minimize a covariate distance between students. If this was the case and if each of the student level matching problems was solved using optimal matching as in Rosenbaum (1989) and Hansen (2007), then a trivial worst-case time bound for the multilevel matching method would be of order O(J 3 K 2 + K 3 ) where J = max{Jkt =1 , ..., Jkt =Kt , Jkc =1 , ..., Jkc =Kc } and K = max{Kt , Kc }. In general it is possible to find worst-case time bounds based on the component problems. In our presentation above, each of the component problems is a cardinality matching problem and, while at the present there is no polynomial time algorithm for cardinality 15 matching, in practice many instances with data sets of reasonable size run in time comparable to that of optimal matching. Furthermore, a useful feature of problem (1) is that the student level matches can be found in parallel by separating all the possible pairs of treated and control schools into smaller mutually exclusive but exhaustive pairs of treated and control schools. In practice, we found the matches using the package mipmatch for R (Zubizarreta 2012). 4 Covariate Balance in the Matched Sample After applying basic exclusion criteria, there are 64245 students in 517 schools, 150 subsidized and 367 public schools (henceforth treated and control schools respectively). Out of the 64245 students, 15682 students are from treated schools and 48563 are from control schools. Using our multilevel matching method, we matched in two stages within similar groups regions of the country (namely, regions I-III, IV-V, VI-VII, VIII, IX, X-XII and the Metropolitan region). At the student level, we used cardinality matching to find the largest balanced sample of pairs of students across all the possible combinations of pairs of schools within the groups of regions. In each of these matches we required mean balance for 19 covariates (including student test scores, school test scores, and indicators for socioeconomic status and expected educational achievement; see Table 1 for details), fine balance for 4 covariates (sex, mother and father education, and household income; see Table 2) and distributional balance for the sum of the test scores in language and mathematics at baseline. Figure 1 shows not only that the marginal distributions of the baseline test scores are very closely balanced after matching but also their joint distribution. As a matter of fact, the 95% bivariate normal density contours are almost indistinguishable after matching. At the school level, we used the modification of cardinality matching in the second stage of Algorithm 3.2 and mean balanced 16 other covariates: percentage female, total enrollment, language and math scores (plus indicators for missing values), urban area, parental income categories (1-5), and socioeconomic groups (A-D). Again, covariates were exact matched for the 7 region 16 Table 1: Covariate balance at the student level after matching. All the covariates in 2004. Covariate Mean Subsidized Public Language score 243.24 243.68 Mathematics score 243.43 243.04 Natural science score 246.19 247.00 Social science score 243.01 243.91 School language score 237.57 237.02 School mathematics score 238.37 237.89 School female proportion 0.51 0.50 School number of students 82.34 82.34 School teacher to student ratio 8.09 8.03 Urban area 0.83 0.83 Socioeconomic status A 0.13 0.12 Socioeconomic status B 0.61 0.61 Socioeconomic status C 0.25 0.26 Socioeconomic status D 0.01 0.01 Expected education: primary 0.01 0.02 Expected education: secondary, technical-professional 0.77 0.77 Expected education: secondary, scientific-humanities 0.13 0.12 Expected education: technical-professional 0.09 0.10 Expected education: college 0.00 0.00 17 are measured Std. dif. -0.01 0.01 -0.02 -0.02 0.03 0.03 0.05 0.00 0.03 0.00 0.01 0.01 -0.02 0.01 -0.04 -0.00 0.02 -0.01 0.01 Table 2: Balance for nominal covariates at the student level. All the covariates are measured in 2004. Fine balance constraints balanced sex, type of education of the mother and father, and household income category. The tabulated values are counts of the number of students in each category. In addition, matching was exact for groups of counties (not shown here). Covariate Sex Male Female Mother education Primary school Secondary school Technical College or higher Missing Father education Primary school Secondary school Technical College or higher Missing Household income category (in 1000 pesos) [0, 100) [100, 200] (200, 400] (400, 600] (600, 1400] > 1400 Missing 18 Subsidized Public 2084 1981 2084 1981 1886 1152 56 17 954 1886 1152 56 17 954 1647 1255 56 17 954 1647 1255 56 17 954 1647 1255 1643 446 124 84 183 1647 1255 1643 446 124 84 183 Figure 1: Distribution of student test scores at baseline after matching. The baseline test scores are measured in 2004. The ellipses trace the 95% bivariate normal density contours of the joint distributions of test scores for the matched treated and control units. The contours are almost identical showing that not only marginal distributions of the test scores are very closely balanced but also their joint distribution. 400 350 Public Subsidized 100 150 200 250 300 350 300 250 200 150 100 Test scores in mathematics at baseline 400 Distribution of student-level test scores after matching Sub. 100 150 200 250 300 350 400 100 150 200 250 300 350 400 Pub. Sub. Pub. Test scores in language at baseline groups. We balanced all covariates with and without weighting for the size of the school; see Table 3. Note that after matching all the differences in means are smaller than 0.05 standard deviations. In this way, we matched 8130 students in 4065 pairs, and 166 schools in 83 pairs. In this match, 7 out of the 13 region of the country are represented in both the treatment and 19 control groups. Table 3: Covariate balance at the school level after matching. Both means and standardized differences are weighted by the number of students in each school. Covariate Female proportion Number of students Language score Mathematics score Language score missing Math score missing Urban area Income category 1 Income category 2 Income category 3 Income category 4 Income category 5 Socioeconomic status A Socioeconomic status B Socioeconomic status C Socioeconomic status D Mean Subsidized Public 0.49 0.49 262.33 262.67 241.36 241.43 230.20 229.19 0.00 0.00 0.00 0.00 0.98 0.99 0.08 0.09 0.80 0.79 0.12 0.12 0.00 0.00 0.00 0.00 0.18 0.18 0.71 0.71 0.10 0.11 0.00 0.00 Std. dif. -0.00 -0.00 -0.00 0.04 0.00 0.00 -0.03 -0.02 0.01 0.00 0.00 0.00 0.01 0.01 -0.02 0.00 Thus while our match yields highly comparable treated and control groups, geographic coverage is somewhat poor. To that end, we also implemented a second match designed to increase geographic representation. Starting from the same student matches, we also found a more externally valid school in which all the differences in means are smaller than 0.15 standard deviations, and where 10776 students are matched in 5388 pairs and 210 schools are matched in 105 pairs. In this match, 12 out of the 13 regions in Chile are represented. We deem the first match a match with greater internal validity, and the second match, a match with greater external validity. Table (4) compares the results of these two matches. In the next section we compare treatment effect estimates for these two different matched samples. 20 Table 4: Comparison of the two school level matches. In the internally valid match the largest standardized difference in means is smaller than 0.05, whereas in the externally valid match this difference is smaller than 0.15. Regions represented are the regions present in both in the treatment and control groups. Matched students Matched schools Regions represented 5 Internal validity match External validity match 8130 10776 166 210 4, 5, 6, 7, 9, 10, 13 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13 Outcome Analyses Having found the optimal match, we now estimate the voucher effect, test its significance, and assess the robustness of these conclusions with sensitivity analysis methods to hidden bias. We follow the notation and methods for clustered experiments by Small et al. (2008), and use the sensitivity analysis methods for clustered observational studies by Hansen et al. (2014). 5.1 Notation: Treatment Effects for Students in Schools There are S matched pairs of clusters, s = 1, . . . , S, with two schools, j = 1, 2, one treated and one control for 2S total units. The ordered pair sj thus identifies a unique cluster. Each cluster sj contains nsj > 1 individuals, i = 1, . . . , nsj . Each pair is matched for observed, pretreatment covariates, so xs11 = xs22 for each j and i, where xsji represents the observed covariates on which we matched. A student i in school sj is described by both observed covariates and an unobserved covariate usji . The set (xsji , usji ) may describe either the student sji or the school sj containing this student. In our study, treatment assignment occurs at the school level as whole schools are assigned to treatment or control. If the j th school in pair s receives the treatment, write Zsj = 1, whereas if this school receives the control, write Zsj = 0, so Zs1 + Zs2 = 1, for each s as each pair contains one treated school and one control school. If nsj = 1 for all sj then the clusters are individuals, and we have unclustered treatment assignment. Each student has two potential responses; one response that is observed under treatment Zsj = 1 21 and the other observed under control Zsj = 0 (Neyman 1923; Rubin 1974). We denote these responses with (yT sji , yCsji ), where yT sji is observed from the ith subject in pair s under Zsj = 1, and yCsji is observed from this subject under Zsj = 0. In our application yT sji is the test score that student sji would exhibit if he or she switched from a public school to a private voucher school and yCsji is the test score this same student would exhibit if he or she remained in a public school. Under this notation, we allow for interference among students in the same school but not across schools. In this context, yT sji denotes the response of student sji if all students in school sj receive the treatment, while yCsji denotes the response of student sji if all students in school sj receive the control. Therefore, we do not assume that we would observe the same response from student sji if the treatment were assigned to some but not all of the students in school sj. For each student, the unobservable effect of treatment is yT sji − yCsji , which is the change in test scores induced by attending a private voucher school. We do not observe both potential outcomes, but we do observe responses: Ysji = Zsj yT sji + (1 − Zsj )yCsji . Under this framework, the observed response Ysji varies with Zsj but the potential outcomes do not vary with treatment assignment. Write R = (R111 , . . . , RS2,ns2 )T for the N = P s,j ns,j dimensional vector of observed responses with the same notation for yc , which are potential responses under control. Below we test the sharp null hypothesis of no treatment effect on (yT sji , yCsji ) which stipulates that H0 : yT sji = yCsji for all sji (Fisher 1935). This hypothesis asserts that changing the treatment assigned to school sj would leave the response of student sji unchanged. 5.2 Randomization Inference When Treatment is Assigned at the School Level In our analysis, we initially assume that treatment assignment is as-if randomly assigned to clusters conditional on the matches. In short, we assume as if the toss of a fair coin was used to allocate private voucher status within matched school pairs. We collect in the set Ω the 2S 22 treatment assignments for all 2S clusters: Z = (Z11 , Z12 , . . . , ZS2 )T . In a matched-pair group randomized experiment, one treatment assignment Zsj would be picked at random and each assignment would therefore have probability Pr(Z = Zsj ) = 2−S , which yields the randomization distribution. In an observational study, if the probability of receiving treatment is equal for both schools in each pair, then the conditional distribution of Z given that there is exactly one treated unit in each pair equals this randomization distribution, and Pr(Zsj = 1) = 1/2 for each unit j in pair s (see Rosenbaum 2002 for details). However, in an observational study it may not be true Pr(Zsj = 1) = 1/2 for each unit j in pair s due to an unobserved covariate usji . We explore this possibility through a sensitivity analysis described below. To test Fisher’s sharp null hypothesis of no treatment effect, we define T a test statistic which is a function of Z and R where T = t(Z, R). Under the sharp null hypothesis R = yc , therefore T = t(Z, yc ). If the model for treatment assignment above were true then the randomization distribution for T is Pr{t(Z, R) ≥ w|yT sji , yCsji , xsji , usji , Ω} = Pr{t(Z, yc ) ≥ w|yT sji , yCsji , xsji , usji , Ω} since yc is fixed by conditioning on yT sji , yCsji , xsji , usji and Pr(Z = Zsj |yT sji , yCsji , xsji , usji , Ω) = 1/|Ω|. We use a test statistic from Hansen et al. (2014) that provides inferences for units in clusters. For this test statistic, qsji is a score or rank given to Ysji , so that under the null hypothesis, the qsji are functions of the yCsji and xsji , and they do not vary with Zsk . To make qsji resistant to outliers, we use the ranks of the residuals when Ysji is regressed on the student level covariates using Huber’s method of m-estimation Small et al. (2008). We regressed the outcome, test scores recorded in 2006, on student level test scores recorded in 2004 when the student was still in primary school. The test statistic T is a weighted sum of the mean ranks in the treated school 23 minus the mean ranks in the control school. Formally the test statistic is T = S X Bs Qs s=1 where Bs = 2Zs1 − 1 = ±1, Qs = ns1 ns2 ws X ws X qs1i − qs2i . ns1 i=1 ns2 i=1 Hansen et al. (2014) show that T is the sum of S independent random variables each taking the value ±Qs with probability 1/2, so E(T ) = 0 and var(T ) = PS s=1 Q2s . The central limit theorem q implies that as S → ∞, then T / var(T ) converges in distribution to the standard Normal distribution. In the above equation, ws defines the weights which are a function of nsj . The choice of weights ws has important implications in our application. Hansen et al. (2014) discuss three possible choices for ws . One possibility is to use constant weights, ws ∝ 1. Another possibility is to use weights that are proportional to the total number of students in a matched cluster pair: ws ∝ ns1 +ns2 or ws = (ns1 +ns2 )/ PS l=1 (n11 +n12 ). These proportional weights are particularly useful if we believe that the private school voucher effect varies with cluster size. This would be true if, for example, the private school effect was larger in smaller schools. However, if we suspect that the private voucher school effect is constant, we could select the weights to minimize the variance of T . For example, ws ∝ ns1 ns2 /(ns1 + ns2 ) will minimize the variance of T if cross cluster variability is low, while constant weights will minimize the variance if there is little variance within schools. Hansen et al. (2014) note that for testing the null hypothesis each set of weights is valid. Given that our cluster sizes exhibit considerable variation, and it is fully possible that the treatment effect varies with cluster size, we use all three sets of weights to test the sharp null hypothesis. Below we discuss how we incorporate the different weights into the sensitivity analysis. If we test the hypothesis of a shift effect instead of the hypothesis of no effect, we can apply the method of Hodges and Lehmann (1963) to estimate the voucher school effect. The Hodges and 24 Lehmann (HL) estimate of τ is the value of τ0 that when subtracted from Ysji makes T as as close as possible to its null expectation. Intuitively, the point estimate τ̂ is the value of τ0 such that T equals 0 when Tτ0 is computed from Ysji − Zsj τ0 . Using constant effects is convenient, but this assumption can be relaxed; see Rosenbaum (2003). If the treatment has an additive effect, Ysji = yCsji + τ then a 95% confidence interval for the additive treatment effect is formed by testing a series of hypotheses H0 : τ = τ0 and retaining the set of values of τ0 not rejected at the 5% level. 5.3 Comparative Effectiveness of Public Versus Private Voucher Schools We now test the hypothesis of no effect for private voucher schools. We test this hypothesis in both matches. For the match with greater external validity, with constant weights ws ∝ 1, the approximate one-sided p-value is 0.256. Thus we are unable to reject the null that the voucher school are completely without effect. We also found that both sets of weights, ws ∝ ns1 + ns2 and ws ∝ ns1 ns2 /(ns1 + ns2 ), lead to identical p-values of 0.292. In the absence of bias from hidden confounders, the point estimate is τ̂ = 2.81 with a 95% confidence interval of -5.68 and 11.34. For the match with greater internal validity with constant weights, the approximate one-sided pvalue is 0.492. Using non-constant weights, we find the approximate one-sided p-value is 0.633. If there are no hidden confounders, the point estimate for the match with greater internal validity is τ̂ = 0.0743 with a 95% confidence interval of -8.58 and 9.36. Thus for both matches, we cannot reject the hypothesis that attending a private voucher school has no effect on test scores. We next explore the likelihood that bias from a hidden confounder masks a treatment effect. 5.4 Test of Equivalence and Sensitivity Analysis In an observational study, one concern is that bias from a hidden covariate can give the impression that a treatment effect exists when in fact no effect is present. Bias from hidden confounders can 25 also mask an actual treatment effect leaving the analyst to conclude there is no effect when in fact such an effect exists. We explore this possibility using a test of equivalence and a sensitivity analysis (Rosenbaum 2008; Rosenbaum and Silber 2009; Rosenbaum 2010). Above we were unable to reject the null hypothesis that τ = 0 for all students. Next, we apply a test of equivalence to test the hypotheses that τ is not small. Under a test of equivalence, we (δ) (δ) test the following null hypothesis H6= : |τ | > δ. Rejecting H6= provides a basis for asserting ← −(δ) (δ) with confidence that |τ | < δ. H6= is the union of two exclusive hypotheses: H 0 : τ ≤ −δ → − (δ) ← −(δ) → − (δ) (δ) and H 0 : τ ≥ δ, and H6= is rejected if both H 0 and H 0 are rejected (Rosenbaum and Silber 2009). We can apply the two tests without correction for multiple testing since we test two mutually exclusive hypotheses. Thus we can test whether the estimate from our study is different from other possible treatment effects which are represented by δ. With a test of equivalence, it is not possible to demonstrate a total absence of effect, but if this were a randomized trial we could safely test that our estimated effect is not as large as δ. (δ) That is we may be able to reject H6= : |τ | > δ. In an observational study, however, there are additional complications. Since the treatment was not randomly assigned, it may be the case that we reject the null hypothesis of equivalence due to hidden confounding. However, using a sensitivity analysis we may find evidence that the test of equivalence is insensitive to biases from nonrandom treatment assignment. In a sensitivity analysis, we quantify the degree to which a key assumption must be violated in order for our inference to be reversed. Our model of treatment assignment assumes that within matched pairs, receipt of the treatment is effectively random conditional on the matches. We consider how sensitive our conclusions are to violations of this assumption using a model of sensitivity analysis discussed in Rosenbaum (2002, ch. 4). In our study, matching on observed covariates xsji made students more similar in their chances of being exposed to the treatment. However, we may have failed to match on an important unobserved covariate usji such that xsji = xsji0 ∀ s, j, i, i0 , but possibly usji 6= usji0 . If true, the 26 probability of being exposed to treatment may not be constant within matched pairs. To explore this possibility, we use a sensitivity analysis that imagines that before matching, student i in pair s had a probability, πs , of being exposed to the voucher school treatment. For two matched students in pair s, say i and i0 , because they have the same observed covariates xsji = xsji0 it may be true that πs = πs0 . However, if these two students differ in an unobserved covariate, usji 6= usji0 , then these two students may differ in their odds of being exposed to the voucher school treatment by at most a factor of Γ ≥ 1 such that πs /(1 − πs0 ) 1 ≤ ≤ Γ, ∀ s, s0 , with xsji = xsji0 ∀ j, i, i0 . Γ πs0 /(1 − πs ) (4) If Γ = 1, then πs = πs0 , and the randomization distribution for T is valid. If Γ > 1, then quantities such as p-values and point estimates are unknown but are bounded by a known interval. In a sensitivity analysis, we use several values of Γ to compute bounds on the p-value for the test of equivalence. We then observe at which value of Γ the upper bound on the p-value exceeds 0.05. If the value of Γ is large, we can be confident that it would take a large bias from a hidden confounder to reverse the conclusions of the study. The derivation for the sensitivity analysis as applied to our test statistic T is in Hansen et al. (2014). (δ) Under a test of equivalence, we may be able to reject H6= : |τ | > δ if the p-value from the test is low. Rejecting this null, allows us to infer that the estimate treatment effect is not as large as δ. We then apply the sensitivity analysis to understand whether this inference is sensitive to biases from nonrandom treatment assignment. In the analysis, we observe at what value of Γ the p-value exceeds the conventional 0.05 threshold for each test. If this Γ value is relatively large, we can be confident that the test of equivalence is not sensitive to hidden bias from nonrandom treatment assignment. Hansen et al. (2014) note that sensitivity to hidden bias may vary with the choice of weights ws . To understand whether different weights lead to different sensitivities to hidden confounders, we 27 can conduct a different sensitivity analysis for each set of weights and correct these tests using a Bonferroni correction. However, Rosenbaum (2012b) shows that the Bonferroni correction is overly conservative when applied sensitivity analysis. He develops an alternative multiple testing correction based on correlations among the test statistics. Under this correction, for a given value of Γ we conduct a sensitivity analysis using each set of weights. We then apply the multiple testing correction from Rosenbaum (2012b) which produces a single corrected p-value for that value of Γ. 5.5 How Much Bias Would Need to be Present to Mask a Positive Effect of Private Voucher Schools? We now apply the test of equivalence to both matches. In this test, the null hypothesis asserts (δ) H6= : |τ | > δ for some specified δ > 0. Rejection of this null hypothesis provides evidence that the effect of attending a private voucher school on test scores is less than δ. What values should we select for δ? A number of studies in the literature have found that private voucher schools increase test score achievement. The smallest effect size in the extant literature is 0.15 of a standard deviation (Sapelli and Vial 2002). However, among low income students the effects may be as large as 0.5 of a standard deviation, and Sapelli and Vial (2005) find an effect size of 0.6 standard deviations. These results suggest a range of possible effects from 0.15 to 0.6 standard deviations. To that end, we use three values for δ of 0.15, 0.30 and 0.6 standard deviations. This allows us to test whether the point estimates in our study are equivalent to small, medium or large voucher effects. Thus we define three values δ1 , δ2 , and δ3 to correspond to these three different possible effect sizes. We first ask whether the point estimate from the match with greater external validity is large relative to effect sizes in the literature. Table 5 contains a summary of the test of equivalence and sensitivity analysis to the match with greater external validity. We first assume that there ← −(δ ) is no hidden bias such that Γ = 1. We first test H 0 1 and find that the one-sided p-value 28 → − (δ ) from this test is 0.008. We then test H 0 1 and we find that the one-sided p-value is 0.089. (δ ) Therefore we are unable to reject H6= 1 for the match with greater external validity. Thus our point estimate from this match may be consistent with a small effect. For a larger effect size (δ ) of 0.30 standard deviations, however, we can reject H6= 2 with a p-value of 0.003. Thus the estimated treatment effect is not consistent with moderate effect size. Is this inference sensitive to bias from a confounder? We find that for Γ = 1.85, the p-value is 0.049. A bias of magnitude Γ = 1.85 means that two matched students might differ in terms of an unobserved usji such that one student is almost twice as like as the other to attend a private voucher school before it would alter our conclusions. Finally, we test whether the point estimate is equivalent with a (δ ) large effect size of 0.60 standard deviations. Again, we can reject H6= 3 with p < .001. With a bias of magnitude Γ = 5.71 the p-value is 0.049. Therefore, it would take a very large bias for our conclusions about a large treatment effect to be altered. Table 5: Sensitivity Analysis Results With Different Weights and Corrections for Multiple Testing for the Externally Valid Match Γ H0 : |τ0 | > δ1 H0 : |τ0 | > δ2 H0 : |τ0 | > δ3 1 1.01 1.85 5.71 0.089 0.092 0.404 0.991 0.003 0.004 0.049 0.515 <0.001 <0.001 0.001 0.049 We now apply the test of equivalence and sensitivity analysis to the results from the match with (δ ) greater internal validity. For this match, we are able to reject H6= 1 with p-value of 0.036 when Γ = 1. Therefore, we are able to reject the null that the smaller treatment effect we observe in this design is equivalent to the smallest estimated effect in the extant literature. However, we find that when Γ is as small as 1.1 the p-value for the test of equivalence is 0.048. Thus if students differed by as much as 10 percent in the odds of being treated that could explain our (δ ) inference. For a moderate effect size of 0.30 standard deviations, we can easily reject H6= 2 when Γ = 1, and we find that when Γ = 2.16 the p-value is 0.049. Finally, for a large effect size we find that when Γ = 4.84, the p-value is 0.048. 29 In sum, for both matches, we either cannot reject that the estimated effect is as large as the smallest effects found in previous studies or that association could be easily explained by unobserved confounding. Bias from an unobserved covariate would need to double the odds of selecting a private voucher school to mask a moderate size effect of 0.30 standard deviations. To mask a large effect size of 0.60 standard deviations, the bias from the unobserved founders would have to nearly quintuple the odds of differential treatment assignment in both matches. Table 6: Sensitivity Analysis Results With Different Weights and Corrections for Multiple Testing for the Internally Valid Match Γ H0 : |τ0 | > δ1 H0 : |τ0 | > δ2 H0 : |τ0 | > δ3 1 1.1 2.16 4.84 6 0.036 0.048 0.392 0.923 0.003 0.005 0.049 0.243 0.0015 <0.001 0.007 0.048 Summary and Discussion Clustered observational studies with hierarchical or multilevel data are very common in the health and social sciences. In this paper we showed that the optimal matching strategy in these settings is, under the assumption that clusters have been matched optimally, first match units and then, considering these optimal unit-level matches, match clusters. We emphasized that this strategy explicitly uses the nested structure of the data by breaking the multilevel matching problem into simpler, smaller matching subproblems that are solved only once and that can be solved in parallel to yield an optimal global solution. We implemented this strategy using and extending cardinality matching to find the largest matched sample of pairs of treated and control units within pairs of treated and control clusters that is balanced according to specifications given by the researcher. Unlike other matching methods for multilevel data, the resulting method is optimal in the sense that it maximizes the size of the matched sample (or minimizes the covariate distances between matched units) and it does not require estimating the propensity score (because it directly balances covariates as specified by the researcher). As mentioned, these specifications 30 for covariate balance may not only require mean balance, but also other forms of distributional balance such as fine balance, x-fine balance, and strength-k matching. In practice, this method facilitates sensitivity analyses to hidden biases due to unobserved covariates, and it readily extends to clustered observational studies with three or more levels of data. To our knowledge, this method is the first application of dynamic and integer programming to observational studies. 31 References Anand, P., Mizala, A., and Repetto, A. (2009), “Using School Scholarships to Estimate the Effect of Government Subsidized Private Education on Academic Achievement in Chile,” Economics of Education Review, 28, 370–381. Arpino, B. and Mealli, F. (2011), “The specification of the propensity score in multilevel observational studies,” Computational Statistics & Data Analysis, 55, 1770–1780. Bellman, R. (1957), Dynamic Programming, Princeton, NJ: Princeton University Press. Bertsekas, D. P. (2005), Dynamic Programming and Optimal Control, Vol. I, Belmont, MA: Athena Scientific. Feller, A. and Gelman, A. (2015), “Hierarchical Models for Causal Effects,” Working Paper. Fisher, R. A. (1935), The Design of Experiments, London: Oliver and Boyd. Gelman, A. (2006), “Multilevel (hierarchical) modeling: what it can and cannot do,” Technometrics, 48, 432–435. Greevy, R., Lu, B., Silber, J. H., and Rosenbaum, P. R. (2004), “Optimal Multivariate Matching Before Randomization,” Biostatistics, 5, 263–275. Hansen, B. B. (2007), “Flexible, Optimal Matching for Observational Studies,” R News, 7, 18–24. Hansen, B. B., Rosenbaum, P. R., and Small, D. S. (2014), “Clustered Treatment Assignments and Sensitivity to Unmeasured Biases in Observational Studies,” Journal of the American Statistical Association, 109, 133–144. Hodges, J. L. and Lehmann, E. (1963), “Estimates of Location Based on Ranks,” The Annals of Mathematical Statistics, 34, 598–611. Hong, G. and Raudenbush, S. W. (2006), “Evaluating Kindergarten Retention Policy: A Case of 32 Study of Causal Inference for Multilevel Data,” Journal of the American Statistical Association, 101, 901–910. Hsieh, C.-T. and Urquiola, M. (2006), “The Effects of Generalized School Choice on Achievement and Stratification: Evidence from Chile’s voucher program,” Journal of Public Economics, 90, 1477–1503. Hsu, J. Y., Zubizarreta, J. R., Small, D. S., and Rosenbaum, P. R. (2015), “Strong Control of the Family-Wise Error Rate in Observational Studies that Discover Effect Modification by Exploratory Methods,” Working Paper. Lara, B., Mizala, A., and Repetto, A. (2011), “The Effectiveness of Privte Voucher Education: Evidence From Structural School Switches,” Educational Evaluation and Policy Analysis, 33, 119–137. Lee, V. E. and Bryk, A. S. (1989), “A Multilevel Model of The Social Distribution of High School Achievement,” Sociology of Education, 62, 172–192. Li, F., Zaslavsky, A. M., and Landrum, M. B. (2013), “Propensity score weighting with multilevel data,” Statistics in medicine, 32, 3373–3387. Lu, B., Greevy, R., Xu, X., and Beck, C. (2011), “Optimal Nonbipartite Matching and its Statistical Applications,” The American Statistician, 65, 21–30. McEwan, P. J. (2001), “The Effectiveness of Public, Catholic, and Non-Religious Private Schools in Chile’s Voucher System,” Education Economics, 9, 183–219. MINEDUC (2013), “Estadśticas de la Educación,” http://www.ministeriodesarrollosocial.gob.cl/encuestapost-terremoto/index.html. Mizala, A. and Romaguera, P. (2001), “Factors Explaining Secondary Education Outcomes in Chile,” El Trimestre Economico, 272, 515–549. Neyman, J. (1923), “On the Application of Probability Theory to Agricultural Experiments. Essay 33 on Principles. Section 9.” Statistical Science, 5, 465–472. Trans. Dorota M. Dabrowska and Terence P. Speed (1990). Rosenbaum, P. R. (1984), “The Consequences of Adjusting For a Concomitant Variable That Has Been Affected By The Treatment,” Journal of The Royal Statistical Society Series A, 147, 656–666. — (1987), “Sensitivity Analysis for Certain Permutation Inferences in Matched Observational Studies,” Biometrika, 74, 13–26. — (1989), “Optimal Matching for Observational Studies,” Journal of the American Statistical Association, 84, 1024–1032. — (2002), Observational Studies, New York, NY: Springer, 2nd ed. — (2003), “Exact Confidence Intervals for Nonconstant Effects by Inverting the Signed Rank Test,” The American Statistician, 57, 132–138. — (2006), “Differential effects and generic biases in observational studies,” Biometrika, 93, 573–586. — (2008), “Testing hypotheses in order,” Biometrika, 95, 248–252. — (2010), Design of Observational Studies, New York: Springer-Verlag. — (2012a), “Optimal Matching of an Optimally Chosen Subset in Observational Studies,” Journal of Computational and Graphical Statistics, 21, 57–71. — (2012b), “Testing One Hypothesis Twice in Observational Studies,” Biometrika, 99, 763–774. Rosenbaum, P. R., Ross, R. N., and Silber, J. H. (2007), “Minimum Distance Matched Sampling with Fine Balance in an Observational Study of Treatment for Ovarian Cancer,” Journal of the American Statistical Association, 102, 75–83. Rosenbaum, P. R. and Silber, J. H. (2009), “Sensitivity Analysis for Equivalence and Difference in 34 an Observational Study of Neonatal Intensive Care Units,” Journal of the American Statistical Association, 104, 501–511. Rubin, D. B. (1974), “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies,” Journal of Educational Psychology, 6, 688–701. Sapelli, C. and Vial, B. (2002), “The Perfomance of Private and Public Schools in the Chilean Voucher System,” Cuadernos De Economia, 39. — (2005), “Private vs public voucher schools in Chile: New evidence on efficiency and peer effects,” Working Paper 289, Catholic University of Chile, Instituto de Economia. Small, D. S., Have, T. R. T., and Rosenbaum, P. R. (2008), “Randomization Inference in a Group– Randomized Trial of Treatments for Depression: Covariate Adjustment, Noncompliance, and Quantile Effects,” Journal of the American Statistical Association, 103, 271–279. Stuart, E. A. (2010), “Matching Methods for Causal Inference: A Review and a Look Forward,” Statistical Science, 25, 1–21. Zubizarreta, J. R. (2012), “Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery,” Journal of the American Statistical Association, 107, 1360–1371. Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014), “Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of For-profit and Not-for-profit High Schools in Chile,” Annals of Applied Statistics, 8, 204–231. Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), “Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery,” The American Statistician, 65, 229–238. 35
© Copyright 2026 Paperzz