abstract - Society for Research on Educational Effectiveness

Does “What Works”, Work for Me?: Translating Causal Impact Findings from Multiple
RCTs of a Program to Support Decision-Making
Andrew P. Jaciw
Denis Newman
Val Lazarev
Boya Ma
Empirical Education Inc.
Background
Introduction:
Imagine the scenario where a school or district decision-maker is looking to results from
several well-conducted randomized control trials (RCTs) of a literacy program to understand its
potential to work for his or her specific context. Looking informally across the studies, there is
no clear answer. In most, but not all cases, the impacts appear positive, possibly more so in
biology classes. Each experiment is situated in a specific context. Some of the trials were
conducted in ELA classes, others across several types of science classes. Also, the amount of
information for contextualizing each trial varies. How might the decision-maker make sense of
these results?
In this work we take a mixed methods approach to understand the potential reach of
causal effects from five RCTs of the Reading Apprenticeship (RA) program. A critical aspect of
translating research findings is helping decision-makers to address the complexity of
results. We use four types of research results to reach summative conclusions about the efficacy
of the program: (1) we summarize and synthesize impact findings for the five studies and
formally test for effect heterogeneity, (2) we use mediation analysis to explore a plausible
mechanism for impact using results from one RCT, (3) we explore impacts on instructional
strategies using results from one RCT, and (4) we posit several hypotheses concerning conditions
for impact and validate them through a prediction exercise. To provide focus to the work, we
narrow the scenario: We consider the problem of deciding whether the program has positive
impact overall, and whether it is especially effective in biology classes, and investigate possible
mechanisms why impact for biology may be greater. Steps (1)-(4), above, lead to conclusions
about overall impact, and for biology classes. The work serves as an example of a systematic
approach to translating research to support decision-making by consumers of research,
specifically about whether to adopt the program.
Prior Research
The Institute of Education Science (IES) has provided us with research standards and
tools for assessing the statistical conclusions and internal validity of results from experiments
(e.g., Schochet, 2009, WWC, 2014). To understand the reach of causal inferences, especially to
translate results from experiments to individual contexts, requires also considerations of external
validity. While there is no WWC-like guide for judging the validity of generalized inferences,
there are several established approaches to build warrants for claims of external validity (Cook,
2002; Cronbach, 1982; Shadish, Cook and Campbell, 2002; and Tipton, 2014). Critical to these
approaches is variation in program impact and methods of accounting for differences in impact.
Variation in program impact may be due to factors that we can observe and measure, including
differences in the populations being treated, in how treatment components are distributed, and in
1
economic conditions (Bloom, Hill, & Riccio, 2003; Hotz, Imbens, & Mortimer, 2005). Cronbach
(1982) and Cronbach et al., (1980) were particularly concerned with this question. They
recognized that a decision-maker is likely to consider not just estimates of marginal impacts, but
finer-grained information about contexts and potential mechanisms when deciding whether
program effects are likely to extrapolate to their own contexts. They also provided a useful
framework for situating experimental results. Specifically, they considered differences in
participants, treatment variants, the role of outcome measures, and interactions of treatment with
setting. Translating impact findings from several randomized trials of the same (or very similar)
interventions to inform a single decision, “whether to adopt under specific conditions - yes or no”
requires weighing different kinds of evidence. In the current study, we consider three past RCTs
and two current experiments of the same program (we have just completed analysis of one, and
for the other, impacts will be analyzed by December 2015.) We situate the results using the
framework by Cronbach described above, and then, through a series of exploratory analyses,
attempt to unpack the findings to inform a question that a hypothetical decision-maker would
want to know: Does the program work, and can we claim that it works especially well in Biology
classes?
Purpose / Objective / Research Question / Focus of Study
Our research questions reflect the mixed-methods approach whereby we will consider
both coarser findings across studies and more granular results within individual experiments
leading to a final determination of whether there is evidence of overall impact of RA as well as
differential impact favoring biology classes. Specifically: 1. Is there an impact of RA on
reading literacy across five experiments and several different content domains: ELA, history, and
science (biology, physics, chemistry, Earth science)? Does the program have a positive impact
generally, and biology classes specifically? (For the final study we will use meta-analysis to
synthesize results across all five experiments, and run a formal test of the heterogeneity of
impact; For this proposal we provide a table of impact estimates and their standard errors for the
impact results so far.) 2. Is there evidence of differential impact between biology and nonbiology classes in use of literacy strategies supported by RA? (Based on results from RCT4). 3.
Is there evidence that impacts of RA occur through different mediating processes for biology
compared to the other subject areas? (Based on results from RCT4) 4. Answers to (1)-(3) should
tell us not just whether we have observed greater impact on biology, but allows us to hypothesize
a mechanism – the mediating and instructional processes in biology classes that potentially
facilitate greater impact than in other subjects. In turn, we can use this information to predict
conditions for observing greater impact among science subject domains in RCT5: specifically,
for which subject(s) we expect to see most impact among biology, physics, chemistry and Earth
science classes, given the kinds of literacy practices commonly used in those subjects. Question:
Are predictions we make concerning conditions for impact for RCT5 borne out, thereby
validating our hypothesis about the processes leading to overall and differential impacts across
science domains and between science and other subject areas? We ask the researchers who
carried out RCT4 to, independently of each other, (a) make the predictions discussed in (4),
above, and, in the conclusion, (b) give their summative conclusion, with a rationale, about
whether RA has impact overall and greater impact for biology given the evidence thus far.
Setting
We use Cronbach’s framework for describing study characteristics and participants,
summarized in Table 1 in the Appendix. All five RCTs assess impacts of RA on student literacy
achievement at the high school level. The studies vary in terms of type of class: four cover
2
biology, two history, two ELA, and one science subjects other than biology. Four of the five
studies assess impact on students in regular classrooms. One of the studies (RCT3) focuses on
students below grade level in reading. The five studies span multiple states, with schools in
California and Pennsylvania used in more than one study.
Intervention / Program / Practice:
The program of study – Reading Apprenticeship – is an instructional framework that
helps teachers support discipline-specific literacy and learning in their varied content areas by
attending to four interacting dimensions of classroom learning culture: Social, Personal,
Cognitive, and Knowledge-Building. At the center of the program is an ongoing metacognitive
conversation carried on both internally through metacognitive reading and reasoning routines
and externally, as teacher and students talk about their personal relationships to reading, the
social environment and resources of the classroom, their affective responses and cognitive activity,
and the knowledge required to make sense of complex texts. This takes place through extensive
reading including increased in-class opportunities for students to practice reading complex
academic texts in more skillful ways as they collaborate to make meaning of these texts for
learning purposes. The framework targets learning dispositions as well as literacy skills and
knowledge. The inquiry-based professional development is designed to transform teachers'
understanding of their role in adolescent literacy development and to build enduring capacity for
literacy instruction in the academic disciplines. The inquiry-based PD model engages teachers
in: (a) learning about the complexity of literacy and learning with disciplinary texts through
experiential learning that mirrors the instructional environment and practices of the framework,
(b) learning how the framework supports students’ literacy and learning (c) practicing specific
pedagogies, and (d) carrying out formative assessment focused on student reading, thinking and
learning.
Research Design: The details of the research designs are summarized in right-most column of
Table 1: RCT1: (Greenleaf et al., 2009): School randomization, 23 in treatment (T), 22 in control
(C). One year exposure. Biology classes in 9th and 10th grade.
RCT2: (Greenleaf et al., 2011): School randomization, 22 in T, 18 in C. Impacts assessed two
years after initial implementation. Biology classes in 9th and 10th grades, history in 11th.
RCT3: (Kemple et al., 2008): Student randomization, two cohorts, for the second cohort 645 in T,
470 in C. Population of low-performing readers.
RCT4: (Citation not included for anonymous review.): School randomization 22 in T, 20 in C.
Impact assessed after two years of implementation. English, history and biology in grades 9-11.
RCT5: (Citation not included for anonymous review.): Teacher randomization, 35 T 34 C. One
year exposure. Biology, chemistry, physics, earth and environmental science in high school.
Data Collection and Analysis:
The five RCTs were analyzed separately to obtain impact estimates. (We are the PIs for
RCT4 and RCT5 and have student-level outcomes data; RCT1-RCT3 were analyzed by other
researchers and we are using summary statistics from their reports to conduct secondary
analyses.) Results from impacts analyses for RCT5 will be ready in December 2015.
Analysis 1: For the final paper we will synthesize results across the five RCTs using metaanalysis (Hedges and Olkin, 1985). In this proposal we report impacts of RA on student
achievement across RCT1-RCT4 per science subject area. Analysis 2: To examine impacts of
RA on literacy strategies promoted by the program, as well as differences between biology and
non-biology classes in these impacts for RCT4, we assessed impacts on 12 dimensions (left-most
column in Table 3). Cronbach alphas for the subscales ranged between .54 and .91, with median
3
value .69. Analysis 3: To explore a possible mechanism for differential impact in RCT3, we
conducted an exploratory factor analysis (proc FACTOR in SAS with oblique PROMAX
rotation) on the scales for the 12 dimensions (described under Analysis 2, above) to extract four
factors to serve as potential mediators of impacts on achievement. Given limited power to
conduct a formal mediation analysis (Schochet, 2011), we used an exploratory 2-stage approach
to examine for which of the factors we observed: a significant overall or differential (across
biology and non-biology courses) impact on the mediator and a significant overall or differential
association between the factor and the achievement outcome conditional on baseline covariates.
When both conditions were met, we flagged that processes mediating impact on achievement
may be different for biology and non-biology classes. Analysis 4: We asked both program
developers and the researchers involved in RCT4 and 5 to predict results for RCT5, based on
findings from RCT1-RCT4.
Findings / Results: Description of the main findings with specific details.
Result 1: (RCT1 – RCT4). Impact estimates, standard errors, and effect sizes are reported in
Table 2. Significant positive impacts on literacy skills were observed for RCT1 in biology and
RCT2 in biology and history classes. RCT1 did not meet WWC evidence standards because of
high attrition and non-equivalence on pretest, and RCT2 had high attrition. RCT3 demonstrated
positive impact on one of two dimensions of the literacy outcome domain for one of two cohorts,
and for a special population of low performing readers. RCT4 demonstrated positive impact on
reading literacy in biology classes, but not in history or ELA classes, or overall. Result 2: Table
3 shows impacts and differential impacts across biology and non-biology classes on literacy
strategies and student activities for RCT4. Based on these exploratory findings, greater impact in
Biology may be attributable to use of a greater variety of text types, more practice of
metacognitive inquiry and use of comprehension strategies by students, and greater increase in
confidence by teachers in their literacy instruction. Result 3: Four factors were identified as
potential mediators in RCT4. Impact on use of RA strategies was greater in biology classes, and
use of these strategies trended positively in their association with student achievement, also RA
led to less use of traditional instructional materials, which were negatively associated with
achievement, in biology. The results are shown in Table 4. Result 4: The developer postulated
for RCT4 that RA introduces a larger pedagogical shift for biology teachers than ELA or history
teachers (resulting in a greater contrast in instructional practice between RA and controls for
biology). That is, biology classes provide greater opportunity for RA practices as a new (RA)
literacy dimension is introduced to content that is normally delivered through lecture, labs,
manipulatives, and “hands-on” activities and textbooks as supplements. ELA and history is more
lecture-based and less amenable to RA strategies. Prediction: RCT5 will produce positive impact
across all four science subjects. Researcher 1: Chemistry and Physics involve lab work, but the
text component is about understanding and applying formulas and principles and remembering
facts. Biology, environmental and earth sciences allow greater potential for using richer variety
of text, and adoption of RA strategies. Prediction: RCT5 will produce positive impact across all
four science subjects, and a greater impact for biology, earth and environmental science
combined, than for chemistry and physics combined. (Predictions by other researchers in this
project and science education specialists will be included in the final paper.)
Conclusions: Description of conclusions, recommendations, and limitations based on
findings. The ultimate goal of this work is to consider jointly several results from different RCTs
of RA to support a recommendation about impact of RA generally and especially for biology.
4
Based on the results we asked the program developers and several researchers to state their
conclusions (we offer one conclusions here, more will be added to the final report):
Researcher 1: the internal validity of the findings from RCT1 and RCT2 is compromised from
attrition, with impact estimates possibly biased, and the result too optimistic. RCT3 shows
positive impact, but for a specific population. RCT5 should be seen as replication of the
promising impact found for Biology classes in RCT4. Therefore conclusion of impact, in biology
classes only, is pending result from RCT5. Speculation about reasons for differential impact
favoring biology based on explored mediators will be validated if we observe the differential
impacts favoring biology, earth and environmental sciences (compared to chemistry and physics)
with similar impacts on dimensions of instructional strategies. Given positive impact on low
readers in RCT3, impacts on this subpopulation should be examined across all of the RCT’s
especially RCT5 as a potential replication exercise.
In the final paper, several researchers and the developers will provide additional
rationales for predicting results for RCT5 and for drawing summative conclusion about the
efficacy of RA, and conditions and possible mechanisms for achieving impact. In this example,
translation of research findings is a complex process that involves weighing the integrity of the
research design (e.g., levels of attrition), plausible mechanisms, the role of conditions and
moderators, and perhaps most importantly, the replication of results, especially if predicted from
earlier findings.
5
References:
Bloom, H., Hill, C. J., & Riccio, J. A. (2003). Linking program implementation and
effectiveness: Lessons from a pooled sample of welfare-to-work experiments, Journal of
Policy Analysis and Management, 22(4) 551–575.
Cook, T.D. (2002) Randomized experiments in educational policy research: A critical
examination of the reasons the educational evaluation community has offered for not
doing them. Educational Evaluation and Policy Analysis, 24 (3) 175-199.
Cronbach, L.J. (1982). Designing Evaluations of Educational and Social Programs. San
Francisco, CA: Jossey-Bass.
Cronbach, L.J. and Associates (1980). Toward Reform of Program Evaluation. San Francisco,
CA: Jossey-Bass.
Greenleaf, C., Hanson, T., Herman, J., Litman, C., Madden, S., Rosen, R., et al. (2009).
Integrating literacy and science instruction in high school biology: Impact on teacher
practice, student engagement, and student achievement. Arlington, VA: National Science
Foundation.
Greenleaf, C., Hanson. T., Herman, J., Litman, C., Rosen, R., Schneider, S., et al. (2011). A
study of the efficacy of Reading Apprenticeship Development for high school history and
science teaching and learning. Institute of Education Sciences. Washington, DC: Institute
of Education Sciences
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Academic Press.
Hotz, V. J., Imbens, G., & Mortimer, J. (2005). Predicting the efficacy of future training
programs using past experiences at other locations. Journal of Econometrics, 125 241 –
270. Previous version available at: NBER Technical Working Paper #T0238.
Kemple, J., Corrin, W., Nelson, E., Salinger, T., Herrmann, S., and Drummond, K.
(2008). The Enhanced Reading Opportunities Study: Early Impact and Implementation
Findings (NCEE 2008-4015). Washington, DC: National Center for Education
Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of
Education.
Schochet, P. Z. (2009). An approach for addressing the multiple testing problem in social policy
impact evaluations. Evaluation Review, 33, 539–567.
Schochet, P. Z. (2011). Do typical RCT’s of education interventions have sufficient statistical
power for linking impacts on teacher practice and student achievement outcomes?
Journal of Educational and Behavioral Statistics, 36 (4), 441-471.
6
Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and Quasi-Experimental Designs
for Generalized Causal Inference. Boston, MA: Houghton Mifflin.
Tipton, E. (2014). How generalizable is your experiment? Comparing a sample and population
through a generalizability index. Journal of Educational and Behavioral Statistics,
39(6): 478 – 501.
What Works Clearinghouse (WWC). (2014, March). What Works Clearinghouse Procedures
and Standards Handbook (Version 3.0). Retrieved December 29, 2014 from
http://ies.ed.gov/ncee/wwc/documentsum.aspx?sid=19
7
Appendix
Table 1: Characteristics of five randomized trials of Reading Apprenticeship
RCT 1*
Participants:
Treatment Variant
Outcome
Measures
Setting (location,
and time)
Design
Biology in 9th and
10th grades
Treatment:
Traditional RA
implementation (10
days of training spread
over an academic
year.)
CST ELA , CST
Reading
Comprehension
California
Randomization (School N):
Biology sample
Assigned 83 schools total
Retained: 23 T 22 C; (51 total in cross-sectional
sample)
Baseline equivalence for analysis sample:
.52 sd. (teacher averages of student scores for
prior cohort of students)
Exposure: One year of exposure of students to
teachers in their second year of exposure.
Randomization (school N):
Biology sample:
Assigned 39 T, 39 C; Retained 14 T, 24 C .
History sample:
Assigned 45 T 37 C; Retained: 22 T, 18 C
Baseline equivalence for analysis sample: .11 sd
(p = .32)
Exposure: Impacts assessed two years after initial
professional development implementation.
NSF
RCT 2**
IES
Biology (9th and 10th
grades) and history
(11th grade)
Control: Delayed
treatment (2 years)
BAU
Treatment:
Traditional RA
implementation (10
days of training spread
over an academic
year.)
Control: BAU
RCT 3***
9th grade students
Treatment: RA used
as a supplemental
program (11 hours /
month on average)
Control: BAU
2005/06-2006/07
(State
assessment)
CST ELA , CST
Reading
Comprehension
(State
assessments)
*recognized as
not a well-aligned
intervention for
the study (p…)
Comprehension
(reading
comprehension
and vocabulary
development)
GRADE
assessment
California
2006/07 – 2008/09
17 high schools
from 10 school
districts in the U.S.
/ students reading
2 or more years
below grade level
2005/06, 2006/07
Randomization (student N)
Cohort 1:
Assigned: 686 T, 454 C; Attrition: total 30%,
differential6%
Cohort 2:
Assigned: 645 T, 470 C; Attrition: total 36%,
differential 3%
Exposure: 7.5-9 months
8
RCT 4****
English, history and
biology in grades 911.
Treatment:
Traditional RA
implementation (10
days of training spread
over an academic
year.)
ETS assessment
of “reading
literacy”
CA and PA,
2011/12-2013/14
Control: BAU
RCT 5****
Biology, physics,
chemistry and earth
and environmental
science in grades 9-12
Treatment:
Online version of RA
Control: BAU
ETS assessment
of “reading
literacy”
MI and PA,
2014/15
(mixed population
of urban and rural
schools)
Randomization (School N):
Assigned: 22 to T, 20 to C
Retained:
Biology: 22 T, 18 C
ELA: 20 T, 19 C
History: 20 T, 19 C
Baseline equivalence for analysis sample: (N/A)
Exposure: Impacts assessed two years after initial
professional development implementation.
Randomization (teacher N): Of teachers withinschools Overall:
Assigned 41 T, 41 C / Retained: 35 T, 34 C
Biology sample:
Assigned 21 T, 14 C / Retained 17 T, 11 C.
Chemistry sample:
Assigned 8 T, 13 C / Retained: 8 T, 10 C
Physics sample:
Assigned 9 T, 7 C / Retained: 7 T, 7 C
Earth / Environmental science:
Assigned 3 T, 7 C / Retained: 3 T, 6 C
Baseline equivalence for analysis sample: (N/A)
Exposure: One year of exposure.
*This study does not meet WWC evidence standards due to high attrition and non-equivalence at baseline
** This study has high attrition and, at best, meets WWC evidence standards with reservations
***This study meets WWC evidence standards without reservations
****These studies have not been reviewed by WWC, but based on levels of attrition, will meet WWC 3.0 evidence standards without reservations
Note: The WWC review of Adolescent Literacy interventions addresses student outcomes in four domains: alphabetics, reading fluency, comprehension, and
general literacy achievement.
9
Table 2: Impacts on reading achievement in five randomized trials of Reading Apprenticeship
Point Estimate
for Impact
.23
Standard Error
p value
RCT 1
Biology
RCT 2
Biology
.18a / .09b
.13a / .39b
History
.26a / .22b
.02 a / .04b
Comprehension
.09c / .14d
NS / <.05
Vocabulary
.05c / -0.04d
NS / NS
RCT 3
RCT4
RCT5
N/A
Standardized
Effect Size
.23
.04 /
Biology
0.30
0.12
.24
.02
ELA
0.12
0.12
.14
.30
History
-0.08
0.13
-0.09
.51
Overall
0.11
0.09
.11
.21
Biology
Chemistry
Physics
Earth Science
Overall
TBD (Dec, 2015)
a=ELA CST
b=Reading comprehension CST,
c=Cohort 1
d=Cohort 2
Note: In RCT4 we observed no impact overall (across history, biology and ELA) but a differential impact favoring biology (t=2.41, p=.02).
10
Table 3: The impact of Reading Apprenticeship on potential mediating processes for the sample as a whole and across biology and
non-biology classes (RCT4)
Potential Mediator
Use of a variety of text types
Teachers instructing using
metacognitive inquiry
Teachers modeling using
metacognitive inquiry
Students practicing metacognitive
inquiry
Teachers instructing using
comprehension strategies
Teachers modeling using
comprehension techniques
Students practicing comprehension
strategies
Student engagement
Average Impact
Difference between
biology and non-biology
classes in impact
Effect
t-value
size
0.53
1.80, .07**
Effect
size
0.08
t-value
0.05
0.34
0.07
0.31
2.04**
0.70
Impact for biology
classes
Effect
size
0.35
t-value
-0.56, .57
0.14
4.36****
0.12
Impact for non-biology
classes
t-value
0.78
Effect
size
-0.06
-0.13
-0.39
0.04
0.20
0.48, .63
0.33
0.89
0.27
1.44
0.40
1.40, .17*
0.78
2.23***
0.59
2.95****
0.76
-0.07
-0.23, .82
0.09
0.22
0.09
0.44
0.36
2.18***
0.28
0.48, .63
0.36
0.92
0.32
1.39*
0.92
5.70****
0.33
1.14, .26
1.00
2.64***
0.81
3.86****
0.08
0.50
-0.23
-0.74, .46
0.15
0.48
0.18
0.86
0.48
-0.34
Teacher self-confidence in literacy
0.59
3.35****
0.63
2.03, .04***
1.34
3.53****
0.41
1.89**
instruction
Teachers fostering student
0.68
3.53****
0.25
0.87, .39
0.50
1.41*
0.63
2.90****
independence
Teachers use of traditional reading
0.27
1.42*
0.19
0.61 .54
0.05
0.13
0.31
1.29
strategies
Levels of student collaboration
0.55
2.74****
-0.12
0.39, .70
0.01
0.01
0.65
2.82****
Note: *p<=.20, **p<.10, ***p<.05, ****p<.01
Note: To identify impacts we considered more-liberal criteria, given the exploratory nature of the analysis. We flagged differential impact
estimates with p<=.10, OR with substantively important effect sizes (<=.30), OR where we observed that impact for biology (or non-biology) was
p<=.05 and the other p>.05)
11
Table 4: Factor based mediation analysis results
Effect of RA
Association with student
achievement
Overall positive
Negative overall
Engaging students in
RA strategies
Favors Biology
Postive trend for Biology
Confidently engaing
students
Favors Biology
No association
Factors
Instructing RA
strategies
Use of textbooks and
Negative for Non-biology Negative for Non-biology
science related media
12