Does “What Works”, Work for Me?: Translating Causal Impact Findings from Multiple RCTs of a Program to Support Decision-Making Andrew P. Jaciw Denis Newman Val Lazarev Boya Ma Empirical Education Inc. Background Introduction: Imagine the scenario where a school or district decision-maker is looking to results from several well-conducted randomized control trials (RCTs) of a literacy program to understand its potential to work for his or her specific context. Looking informally across the studies, there is no clear answer. In most, but not all cases, the impacts appear positive, possibly more so in biology classes. Each experiment is situated in a specific context. Some of the trials were conducted in ELA classes, others across several types of science classes. Also, the amount of information for contextualizing each trial varies. How might the decision-maker make sense of these results? In this work we take a mixed methods approach to understand the potential reach of causal effects from five RCTs of the Reading Apprenticeship (RA) program. A critical aspect of translating research findings is helping decision-makers to address the complexity of results. We use four types of research results to reach summative conclusions about the efficacy of the program: (1) we summarize and synthesize impact findings for the five studies and formally test for effect heterogeneity, (2) we use mediation analysis to explore a plausible mechanism for impact using results from one RCT, (3) we explore impacts on instructional strategies using results from one RCT, and (4) we posit several hypotheses concerning conditions for impact and validate them through a prediction exercise. To provide focus to the work, we narrow the scenario: We consider the problem of deciding whether the program has positive impact overall, and whether it is especially effective in biology classes, and investigate possible mechanisms why impact for biology may be greater. Steps (1)-(4), above, lead to conclusions about overall impact, and for biology classes. The work serves as an example of a systematic approach to translating research to support decision-making by consumers of research, specifically about whether to adopt the program. Prior Research The Institute of Education Science (IES) has provided us with research standards and tools for assessing the statistical conclusions and internal validity of results from experiments (e.g., Schochet, 2009, WWC, 2014). To understand the reach of causal inferences, especially to translate results from experiments to individual contexts, requires also considerations of external validity. While there is no WWC-like guide for judging the validity of generalized inferences, there are several established approaches to build warrants for claims of external validity (Cook, 2002; Cronbach, 1982; Shadish, Cook and Campbell, 2002; and Tipton, 2014). Critical to these approaches is variation in program impact and methods of accounting for differences in impact. Variation in program impact may be due to factors that we can observe and measure, including differences in the populations being treated, in how treatment components are distributed, and in 1 economic conditions (Bloom, Hill, & Riccio, 2003; Hotz, Imbens, & Mortimer, 2005). Cronbach (1982) and Cronbach et al., (1980) were particularly concerned with this question. They recognized that a decision-maker is likely to consider not just estimates of marginal impacts, but finer-grained information about contexts and potential mechanisms when deciding whether program effects are likely to extrapolate to their own contexts. They also provided a useful framework for situating experimental results. Specifically, they considered differences in participants, treatment variants, the role of outcome measures, and interactions of treatment with setting. Translating impact findings from several randomized trials of the same (or very similar) interventions to inform a single decision, “whether to adopt under specific conditions - yes or no” requires weighing different kinds of evidence. In the current study, we consider three past RCTs and two current experiments of the same program (we have just completed analysis of one, and for the other, impacts will be analyzed by December 2015.) We situate the results using the framework by Cronbach described above, and then, through a series of exploratory analyses, attempt to unpack the findings to inform a question that a hypothetical decision-maker would want to know: Does the program work, and can we claim that it works especially well in Biology classes? Purpose / Objective / Research Question / Focus of Study Our research questions reflect the mixed-methods approach whereby we will consider both coarser findings across studies and more granular results within individual experiments leading to a final determination of whether there is evidence of overall impact of RA as well as differential impact favoring biology classes. Specifically: 1. Is there an impact of RA on reading literacy across five experiments and several different content domains: ELA, history, and science (biology, physics, chemistry, Earth science)? Does the program have a positive impact generally, and biology classes specifically? (For the final study we will use meta-analysis to synthesize results across all five experiments, and run a formal test of the heterogeneity of impact; For this proposal we provide a table of impact estimates and their standard errors for the impact results so far.) 2. Is there evidence of differential impact between biology and nonbiology classes in use of literacy strategies supported by RA? (Based on results from RCT4). 3. Is there evidence that impacts of RA occur through different mediating processes for biology compared to the other subject areas? (Based on results from RCT4) 4. Answers to (1)-(3) should tell us not just whether we have observed greater impact on biology, but allows us to hypothesize a mechanism – the mediating and instructional processes in biology classes that potentially facilitate greater impact than in other subjects. In turn, we can use this information to predict conditions for observing greater impact among science subject domains in RCT5: specifically, for which subject(s) we expect to see most impact among biology, physics, chemistry and Earth science classes, given the kinds of literacy practices commonly used in those subjects. Question: Are predictions we make concerning conditions for impact for RCT5 borne out, thereby validating our hypothesis about the processes leading to overall and differential impacts across science domains and between science and other subject areas? We ask the researchers who carried out RCT4 to, independently of each other, (a) make the predictions discussed in (4), above, and, in the conclusion, (b) give their summative conclusion, with a rationale, about whether RA has impact overall and greater impact for biology given the evidence thus far. Setting We use Cronbach’s framework for describing study characteristics and participants, summarized in Table 1 in the Appendix. All five RCTs assess impacts of RA on student literacy achievement at the high school level. The studies vary in terms of type of class: four cover 2 biology, two history, two ELA, and one science subjects other than biology. Four of the five studies assess impact on students in regular classrooms. One of the studies (RCT3) focuses on students below grade level in reading. The five studies span multiple states, with schools in California and Pennsylvania used in more than one study. Intervention / Program / Practice: The program of study – Reading Apprenticeship – is an instructional framework that helps teachers support discipline-specific literacy and learning in their varied content areas by attending to four interacting dimensions of classroom learning culture: Social, Personal, Cognitive, and Knowledge-Building. At the center of the program is an ongoing metacognitive conversation carried on both internally through metacognitive reading and reasoning routines and externally, as teacher and students talk about their personal relationships to reading, the social environment and resources of the classroom, their affective responses and cognitive activity, and the knowledge required to make sense of complex texts. This takes place through extensive reading including increased in-class opportunities for students to practice reading complex academic texts in more skillful ways as they collaborate to make meaning of these texts for learning purposes. The framework targets learning dispositions as well as literacy skills and knowledge. The inquiry-based professional development is designed to transform teachers' understanding of their role in adolescent literacy development and to build enduring capacity for literacy instruction in the academic disciplines. The inquiry-based PD model engages teachers in: (a) learning about the complexity of literacy and learning with disciplinary texts through experiential learning that mirrors the instructional environment and practices of the framework, (b) learning how the framework supports students’ literacy and learning (c) practicing specific pedagogies, and (d) carrying out formative assessment focused on student reading, thinking and learning. Research Design: The details of the research designs are summarized in right-most column of Table 1: RCT1: (Greenleaf et al., 2009): School randomization, 23 in treatment (T), 22 in control (C). One year exposure. Biology classes in 9th and 10th grade. RCT2: (Greenleaf et al., 2011): School randomization, 22 in T, 18 in C. Impacts assessed two years after initial implementation. Biology classes in 9th and 10th grades, history in 11th. RCT3: (Kemple et al., 2008): Student randomization, two cohorts, for the second cohort 645 in T, 470 in C. Population of low-performing readers. RCT4: (Citation not included for anonymous review.): School randomization 22 in T, 20 in C. Impact assessed after two years of implementation. English, history and biology in grades 9-11. RCT5: (Citation not included for anonymous review.): Teacher randomization, 35 T 34 C. One year exposure. Biology, chemistry, physics, earth and environmental science in high school. Data Collection and Analysis: The five RCTs were analyzed separately to obtain impact estimates. (We are the PIs for RCT4 and RCT5 and have student-level outcomes data; RCT1-RCT3 were analyzed by other researchers and we are using summary statistics from their reports to conduct secondary analyses.) Results from impacts analyses for RCT5 will be ready in December 2015. Analysis 1: For the final paper we will synthesize results across the five RCTs using metaanalysis (Hedges and Olkin, 1985). In this proposal we report impacts of RA on student achievement across RCT1-RCT4 per science subject area. Analysis 2: To examine impacts of RA on literacy strategies promoted by the program, as well as differences between biology and non-biology classes in these impacts for RCT4, we assessed impacts on 12 dimensions (left-most column in Table 3). Cronbach alphas for the subscales ranged between .54 and .91, with median 3 value .69. Analysis 3: To explore a possible mechanism for differential impact in RCT3, we conducted an exploratory factor analysis (proc FACTOR in SAS with oblique PROMAX rotation) on the scales for the 12 dimensions (described under Analysis 2, above) to extract four factors to serve as potential mediators of impacts on achievement. Given limited power to conduct a formal mediation analysis (Schochet, 2011), we used an exploratory 2-stage approach to examine for which of the factors we observed: a significant overall or differential (across biology and non-biology courses) impact on the mediator and a significant overall or differential association between the factor and the achievement outcome conditional on baseline covariates. When both conditions were met, we flagged that processes mediating impact on achievement may be different for biology and non-biology classes. Analysis 4: We asked both program developers and the researchers involved in RCT4 and 5 to predict results for RCT5, based on findings from RCT1-RCT4. Findings / Results: Description of the main findings with specific details. Result 1: (RCT1 – RCT4). Impact estimates, standard errors, and effect sizes are reported in Table 2. Significant positive impacts on literacy skills were observed for RCT1 in biology and RCT2 in biology and history classes. RCT1 did not meet WWC evidence standards because of high attrition and non-equivalence on pretest, and RCT2 had high attrition. RCT3 demonstrated positive impact on one of two dimensions of the literacy outcome domain for one of two cohorts, and for a special population of low performing readers. RCT4 demonstrated positive impact on reading literacy in biology classes, but not in history or ELA classes, or overall. Result 2: Table 3 shows impacts and differential impacts across biology and non-biology classes on literacy strategies and student activities for RCT4. Based on these exploratory findings, greater impact in Biology may be attributable to use of a greater variety of text types, more practice of metacognitive inquiry and use of comprehension strategies by students, and greater increase in confidence by teachers in their literacy instruction. Result 3: Four factors were identified as potential mediators in RCT4. Impact on use of RA strategies was greater in biology classes, and use of these strategies trended positively in their association with student achievement, also RA led to less use of traditional instructional materials, which were negatively associated with achievement, in biology. The results are shown in Table 4. Result 4: The developer postulated for RCT4 that RA introduces a larger pedagogical shift for biology teachers than ELA or history teachers (resulting in a greater contrast in instructional practice between RA and controls for biology). That is, biology classes provide greater opportunity for RA practices as a new (RA) literacy dimension is introduced to content that is normally delivered through lecture, labs, manipulatives, and “hands-on” activities and textbooks as supplements. ELA and history is more lecture-based and less amenable to RA strategies. Prediction: RCT5 will produce positive impact across all four science subjects. Researcher 1: Chemistry and Physics involve lab work, but the text component is about understanding and applying formulas and principles and remembering facts. Biology, environmental and earth sciences allow greater potential for using richer variety of text, and adoption of RA strategies. Prediction: RCT5 will produce positive impact across all four science subjects, and a greater impact for biology, earth and environmental science combined, than for chemistry and physics combined. (Predictions by other researchers in this project and science education specialists will be included in the final paper.) Conclusions: Description of conclusions, recommendations, and limitations based on findings. The ultimate goal of this work is to consider jointly several results from different RCTs of RA to support a recommendation about impact of RA generally and especially for biology. 4 Based on the results we asked the program developers and several researchers to state their conclusions (we offer one conclusions here, more will be added to the final report): Researcher 1: the internal validity of the findings from RCT1 and RCT2 is compromised from attrition, with impact estimates possibly biased, and the result too optimistic. RCT3 shows positive impact, but for a specific population. RCT5 should be seen as replication of the promising impact found for Biology classes in RCT4. Therefore conclusion of impact, in biology classes only, is pending result from RCT5. Speculation about reasons for differential impact favoring biology based on explored mediators will be validated if we observe the differential impacts favoring biology, earth and environmental sciences (compared to chemistry and physics) with similar impacts on dimensions of instructional strategies. Given positive impact on low readers in RCT3, impacts on this subpopulation should be examined across all of the RCT’s especially RCT5 as a potential replication exercise. In the final paper, several researchers and the developers will provide additional rationales for predicting results for RCT5 and for drawing summative conclusion about the efficacy of RA, and conditions and possible mechanisms for achieving impact. In this example, translation of research findings is a complex process that involves weighing the integrity of the research design (e.g., levels of attrition), plausible mechanisms, the role of conditions and moderators, and perhaps most importantly, the replication of results, especially if predicted from earlier findings. 5 References: Bloom, H., Hill, C. J., & Riccio, J. A. (2003). Linking program implementation and effectiveness: Lessons from a pooled sample of welfare-to-work experiments, Journal of Policy Analysis and Management, 22(4) 551–575. Cook, T.D. (2002) Randomized experiments in educational policy research: A critical examination of the reasons the educational evaluation community has offered for not doing them. Educational Evaluation and Policy Analysis, 24 (3) 175-199. Cronbach, L.J. (1982). Designing Evaluations of Educational and Social Programs. San Francisco, CA: Jossey-Bass. Cronbach, L.J. and Associates (1980). Toward Reform of Program Evaluation. San Francisco, CA: Jossey-Bass. Greenleaf, C., Hanson, T., Herman, J., Litman, C., Madden, S., Rosen, R., et al. (2009). Integrating literacy and science instruction in high school biology: Impact on teacher practice, student engagement, and student achievement. Arlington, VA: National Science Foundation. Greenleaf, C., Hanson. T., Herman, J., Litman, C., Rosen, R., Schneider, S., et al. (2011). A study of the efficacy of Reading Apprenticeship Development for high school history and science teaching and learning. Institute of Education Sciences. Washington, DC: Institute of Education Sciences Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Academic Press. Hotz, V. J., Imbens, G., & Mortimer, J. (2005). Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics, 125 241 – 270. Previous version available at: NBER Technical Working Paper #T0238. Kemple, J., Corrin, W., Nelson, E., Salinger, T., Herrmann, S., and Drummond, K. (2008). The Enhanced Reading Opportunities Study: Early Impact and Implementation Findings (NCEE 2008-4015). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. Schochet, P. Z. (2009). An approach for addressing the multiple testing problem in social policy impact evaluations. Evaluation Review, 33, 539–567. Schochet, P. Z. (2011). Do typical RCT’s of education interventions have sufficient statistical power for linking impacts on teacher practice and student achievement outcomes? Journal of Educational and Behavioral Statistics, 36 (4), 441-471. 6 Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston, MA: Houghton Mifflin. Tipton, E. (2014). How generalizable is your experiment? Comparing a sample and population through a generalizability index. Journal of Educational and Behavioral Statistics, 39(6): 478 – 501. What Works Clearinghouse (WWC). (2014, March). What Works Clearinghouse Procedures and Standards Handbook (Version 3.0). Retrieved December 29, 2014 from http://ies.ed.gov/ncee/wwc/documentsum.aspx?sid=19 7 Appendix Table 1: Characteristics of five randomized trials of Reading Apprenticeship RCT 1* Participants: Treatment Variant Outcome Measures Setting (location, and time) Design Biology in 9th and 10th grades Treatment: Traditional RA implementation (10 days of training spread over an academic year.) CST ELA , CST Reading Comprehension California Randomization (School N): Biology sample Assigned 83 schools total Retained: 23 T 22 C; (51 total in cross-sectional sample) Baseline equivalence for analysis sample: .52 sd. (teacher averages of student scores for prior cohort of students) Exposure: One year of exposure of students to teachers in their second year of exposure. Randomization (school N): Biology sample: Assigned 39 T, 39 C; Retained 14 T, 24 C . History sample: Assigned 45 T 37 C; Retained: 22 T, 18 C Baseline equivalence for analysis sample: .11 sd (p = .32) Exposure: Impacts assessed two years after initial professional development implementation. NSF RCT 2** IES Biology (9th and 10th grades) and history (11th grade) Control: Delayed treatment (2 years) BAU Treatment: Traditional RA implementation (10 days of training spread over an academic year.) Control: BAU RCT 3*** 9th grade students Treatment: RA used as a supplemental program (11 hours / month on average) Control: BAU 2005/06-2006/07 (State assessment) CST ELA , CST Reading Comprehension (State assessments) *recognized as not a well-aligned intervention for the study (p…) Comprehension (reading comprehension and vocabulary development) GRADE assessment California 2006/07 – 2008/09 17 high schools from 10 school districts in the U.S. / students reading 2 or more years below grade level 2005/06, 2006/07 Randomization (student N) Cohort 1: Assigned: 686 T, 454 C; Attrition: total 30%, differential6% Cohort 2: Assigned: 645 T, 470 C; Attrition: total 36%, differential 3% Exposure: 7.5-9 months 8 RCT 4**** English, history and biology in grades 911. Treatment: Traditional RA implementation (10 days of training spread over an academic year.) ETS assessment of “reading literacy” CA and PA, 2011/12-2013/14 Control: BAU RCT 5**** Biology, physics, chemistry and earth and environmental science in grades 9-12 Treatment: Online version of RA Control: BAU ETS assessment of “reading literacy” MI and PA, 2014/15 (mixed population of urban and rural schools) Randomization (School N): Assigned: 22 to T, 20 to C Retained: Biology: 22 T, 18 C ELA: 20 T, 19 C History: 20 T, 19 C Baseline equivalence for analysis sample: (N/A) Exposure: Impacts assessed two years after initial professional development implementation. Randomization (teacher N): Of teachers withinschools Overall: Assigned 41 T, 41 C / Retained: 35 T, 34 C Biology sample: Assigned 21 T, 14 C / Retained 17 T, 11 C. Chemistry sample: Assigned 8 T, 13 C / Retained: 8 T, 10 C Physics sample: Assigned 9 T, 7 C / Retained: 7 T, 7 C Earth / Environmental science: Assigned 3 T, 7 C / Retained: 3 T, 6 C Baseline equivalence for analysis sample: (N/A) Exposure: One year of exposure. *This study does not meet WWC evidence standards due to high attrition and non-equivalence at baseline ** This study has high attrition and, at best, meets WWC evidence standards with reservations ***This study meets WWC evidence standards without reservations ****These studies have not been reviewed by WWC, but based on levels of attrition, will meet WWC 3.0 evidence standards without reservations Note: The WWC review of Adolescent Literacy interventions addresses student outcomes in four domains: alphabetics, reading fluency, comprehension, and general literacy achievement. 9 Table 2: Impacts on reading achievement in five randomized trials of Reading Apprenticeship Point Estimate for Impact .23 Standard Error p value RCT 1 Biology RCT 2 Biology .18a / .09b .13a / .39b History .26a / .22b .02 a / .04b Comprehension .09c / .14d NS / <.05 Vocabulary .05c / -0.04d NS / NS RCT 3 RCT4 RCT5 N/A Standardized Effect Size .23 .04 / Biology 0.30 0.12 .24 .02 ELA 0.12 0.12 .14 .30 History -0.08 0.13 -0.09 .51 Overall 0.11 0.09 .11 .21 Biology Chemistry Physics Earth Science Overall TBD (Dec, 2015) a=ELA CST b=Reading comprehension CST, c=Cohort 1 d=Cohort 2 Note: In RCT4 we observed no impact overall (across history, biology and ELA) but a differential impact favoring biology (t=2.41, p=.02). 10 Table 3: The impact of Reading Apprenticeship on potential mediating processes for the sample as a whole and across biology and non-biology classes (RCT4) Potential Mediator Use of a variety of text types Teachers instructing using metacognitive inquiry Teachers modeling using metacognitive inquiry Students practicing metacognitive inquiry Teachers instructing using comprehension strategies Teachers modeling using comprehension techniques Students practicing comprehension strategies Student engagement Average Impact Difference between biology and non-biology classes in impact Effect t-value size 0.53 1.80, .07** Effect size 0.08 t-value 0.05 0.34 0.07 0.31 2.04** 0.70 Impact for biology classes Effect size 0.35 t-value -0.56, .57 0.14 4.36**** 0.12 Impact for non-biology classes t-value 0.78 Effect size -0.06 -0.13 -0.39 0.04 0.20 0.48, .63 0.33 0.89 0.27 1.44 0.40 1.40, .17* 0.78 2.23*** 0.59 2.95**** 0.76 -0.07 -0.23, .82 0.09 0.22 0.09 0.44 0.36 2.18*** 0.28 0.48, .63 0.36 0.92 0.32 1.39* 0.92 5.70**** 0.33 1.14, .26 1.00 2.64*** 0.81 3.86**** 0.08 0.50 -0.23 -0.74, .46 0.15 0.48 0.18 0.86 0.48 -0.34 Teacher self-confidence in literacy 0.59 3.35**** 0.63 2.03, .04*** 1.34 3.53**** 0.41 1.89** instruction Teachers fostering student 0.68 3.53**** 0.25 0.87, .39 0.50 1.41* 0.63 2.90**** independence Teachers use of traditional reading 0.27 1.42* 0.19 0.61 .54 0.05 0.13 0.31 1.29 strategies Levels of student collaboration 0.55 2.74**** -0.12 0.39, .70 0.01 0.01 0.65 2.82**** Note: *p<=.20, **p<.10, ***p<.05, ****p<.01 Note: To identify impacts we considered more-liberal criteria, given the exploratory nature of the analysis. We flagged differential impact estimates with p<=.10, OR with substantively important effect sizes (<=.30), OR where we observed that impact for biology (or non-biology) was p<=.05 and the other p>.05) 11 Table 4: Factor based mediation analysis results Effect of RA Association with student achievement Overall positive Negative overall Engaging students in RA strategies Favors Biology Postive trend for Biology Confidently engaing students Favors Biology No association Factors Instructing RA strategies Use of textbooks and Negative for Non-biology Negative for Non-biology science related media 12
© Copyright 2026 Paperzz