“The private history of a campaign that failed”: Possible

“The private history of a campaign that failed”:
Possible lessons from a Three-Year School Effectiveness
Randomized Control Trial
Eugene Schaffer
University of Maryland—Baltimore County
[email protected]
Sam Stringfield
University of Louisville
[email protected]
A presentation at the International Congress for School Effectiveness and Improvement,
Limassol, Cyprus, January 6, 2011
1
Objective
Our objective is to present the design, implementation, outcomes and possible policy
implications of a three-year randomized control trial (RCT) of ―effective schools correlates‖ in
elementary schools serving relatively high concentrations of high poverty students1.
Background
For at over 40 years, school effectiveness researchers have sought to identify ―correlates‖
of school effectiveness. Their methods have ranged from case studies to HLM. These efforts
have produced laudable progress (Teddlie & Reynolds, 2000; Townsend, 2007; Weber 1971).
For over 70 years in the U.S.—and perhaps longer elsewhere—well regarded educational
scholars and researchers have attempted to demonstrate that they could produce measurable
gains in student achievement (Nunnery, 1998; Stringfield & Teddlie, in press). This enterprise
has faced a consistently daunting set of challenges, ranging from process and product definitions
and instrumentation to The Implementation Gap (Supovitz & Weinbaum, 2008). Fortunately,
the urge toward documented educational reform is strong, and a broad range of persons has
periodically documented success. Perhaps the best documented of these is Success for All (SFA,
Slavin & Madden, 2000; Borman et al., 2007). However, Borman, Hewes, Overman, and Brown
(2003) conducted an extensive meta-analysis of over two dozen whole school reform designs and
hundreds of studies in the United States, and found that most lacked empirically strong evidence
of their effectiveness.
The juxtaposition of stable correlates with often frustrated change efforts—nowhere
more clearly discussed than at ICSEI—raises the potentially troubling specter taught to first year
graduate students, that ―correlation is not causation.‖ Perhaps the ―correlates‖ are real but not
useful as levers, or perhaps these correlates can be among the levers but are necessary but not
sufficient. What has been needed has been a series of methodologically rigorous studies testing
the ability of school effectiveness correlates as affecters of schools’ levels of effectiveness.
The Effective Schools for the 21st Century (ES-21) project
The purpose of ES-21 was to test the ability of seven school effectiveness correlates
(Taylor & Bullard, 1995) to improve the academic achievement of students attending relatively
1
Mark Twain is considered by some to have been both America’s finest humorist and finest
novelist. He was a young man when America’s Civil War broke out, and for less than a week he was a
member of a Southern militia. Twenty years after the war he was asked by a magazine to write an
almost certainly humorous “private history” of his military “service.” In 1885 he authored, “The private
history of a campaign that failed.” It remains a classic short story in the American store of literature,
and its title serves as a “jumping off point” for a reflection on the study to be presented at ICSEI.
2
high poverty schools in the United States. The remainder of this paper will describe the research
methods, the intervention, the study’s results, and possible conclusions.
Design
The overarching design for ES-21 was a ―gold standard,‖ random control trial (RCT)
effectiveness study. As described by Shadish, Cook & Campbell (2002), proactive random
assignment ―reduced the plausibility of alternative explanations for observed effects‖ (p. 247).
For several years in the early evolution of the federal ―What Works Clearinghouse‖ (WWC,
http://ies.ed.gov/ncee/wwc/) a substantial priority was placed on randomization of units of
analysis2.
Having seen the challenges faced by Robert Slavin and others in attempting to create a
truly RCT design for a federally funded study of Success for All (Borman, Slavin, Cheung,
Chamberlain, Madden, Chambers, 2007), and Cook’s RCT and near-RCT studies of the Comer
School Development Program (Cook, Habib, Phillips, Settersten, Shagle, & Degirmencioglu,
1999; Cook, Murphy, & Hunt, 2000), the team was curious and guardedly optimistic about
conducting a school effectiveness RCT.
The unit of analysis was the school. Power analyses indicated that the project would
need at least 30 schools (15 experimental and 15 control) to have a reasonable probability of
finding significant differences where significant differences existed, so the design called for
random assignment of four sets of four experimental and four controls in a total of four states
(with apologies to T.S. Elliot, ―The Four Quartets‖).
Though the federal Institute of Educational Sciences (IES) had not made the distinction at
the time of the proposal to conduct this study (e.g., early 2004), today IES describes such an
ambitious project as ―Goal Four‖ or ―effectiveness trial.‖ IES effectiveness trials assume a fully
developed and previously implemented interventions in which the majority of funds go to the
research effort, and not to development. IES funds effectiveness trials at up to $6 million over
five years (IES, 2009), more that quadruple the level of funding of ES-21.
Sampling Frame and Final Sample
The primary conditions for the sampling frame were as follows:
2
While the WWC retains an understandable preference for RCTs, it has evolved an increased tolerance for
carefully designed and conducted quasi-experiments. In a keynote speech at the Society for Research on
Educational Effectiveness, Tony Bryk declared that he never again wanted to hear the phrases “Randomized
Control Trial” and “Gold Standard” uttered in the same paragraph. It would appear that the primacy of the RCT is
waning.
3
a. A key issue in creating a ―gold standard‖ true experiment is the random assignment of
units (in ES-21, schools) to experimental and control conditions (see Shadish, Cook, &
Campbell, 2002, chapters 8 and 9). In the proposal to the Olin Foundation, we proposed
conducting such a true experiment. Hence, the first sampling requirement was that the
LEAs and schools potentially involved had to agree to forward twice as many schools as
would eventually be provided services, and to let the research team either flip a coin or
consult a table of random numbers to choose experimental and control schools.
b. Given the power analysis’ identification of a minimum school sample size of 30, the team
proposed obtaining participation from 32 schools set as four groups of four experimental
and four controls in each of our districts (the experimental sample would thereby be
―Four Quartets.‖).
c. The schools were to come from four states. This was to minimize the likelihood that
changes in policy in any one state would dramatically affect results in the over-all ES-21
study. (For an example of such disruption, see Datnow, et al., 2003).
d. To the extent practical, all schools in any one state were to come from the same LEA.
The goal was to ensure that all schools, experimental and control, within each state
operate in the same local policy environment. When the team was unable to fill the
frame with eight relatively high poverty schools per district, an exception was made for
two small LEAs in South Carolina. The two LEAs offered four relatively high poverty
schools each that met the criteria. In that case, the matched controls for each
experimental school did come from their same LEA.
e. The effective schools movement and research base have both had a strong equity focus,
and the sample was drawn with a conscious effort to represent that perspective. To the
extent possible, all schools were to serve student populations that were above their LEAs
average for poverty (measured by percentages of students qualifying for the federal free
and reduced-price meals program, FARM).
f. At the student level, the ES-21 sample consisted of all of the students in the second and
third grades of all experimental and control schools at the beginning of the project. These
samples were followed for three years. Students who left the experimental or control
schools were dropped from the sample, and the sample was not refreshed with entering
students. (Data on initial and final sample characteristics are presented later in this
report.)
The team was aware of the challenges that had been faced both by the Success For All
(SFA) RCT (Borman, et al., 2007) and others. For the SFA project, identification of a large
enough sample had required repeated adjustments to the offer made to schools and LEAs
(including, at one point offering all schools that ―lost‖ the coin toss and could not implement
SFA for three years, a cash consolation prize of over $10,000 each.). However, we believed that
because the Effective Schools procedures would be less prescriptive than those in SFA, and
because many districts had already implemented some components of the original School
4
Effectiveness principles, we would face fewer challenges in identifying a full sample of 32
schools.
We were wrong.
Over the first 18 months of the ES-21 project, the team invited 17 LEAs in six states to
nominate schools for participation. All expressed initial interest, but 12 declined to participate.
In each case, the stated cause of their declining to participate was the random assignment
requirement. The 12 LEAs were unwilling to have either coin flips between demographically
matched schools or use of a table of random numbers to determine which schools would or
would not participate.
In the end, all five of the LEAs that agreed to participate did so in part as a result of longstanding relationships with one or more of the research team members. In Kentucky and South
Carolina, the prior connection was with Dr. Stringfield and his associates. Dr. Schaffer’s prior
relationships were critical in the North Carolina LEA, and Dr. Chrispeels’ prior work with a
specific district in California resulted in entre into that district. In every case, the stumbling
block that had to be overcome through careful negotiation was the issue of random assignment of
schools to treatment or control conditions. (In one case, even after the LEA agreed to
participate, a central office person wanted to ―pick‖ which schools were ―randomly assigned‖ to
which condition. The ES-21 team insisted on controlling the random assignment process.)
Once districts and subsets of their schools had agreed to the randomization process,
schools were demographically matched, and involvement in the experimental vs. ―business-asusual‖ comparison sites was determined through coin flips.
Two further sampling facts merit noting. First, the California district was sufficiently
enthused about the project that they requested that the study admit 10 schools to the assignment
process and five to the experimental condition (as contrasted to the ―8 and 4‖ specified in the
design). This request was granted, so that the total number of schools in the study rose to 17
experimental and 34 total.
Second, one school’s principal apparently agreed to participate in the project on the
presumption that her school would be a control. In fact, the school was randomly assigned to the
experimental condition, and that school withdrew from participating in the training at the end of
year one. That school has been dropped from third year analyses.
Students. Two cohorts of students were followed for three years in each school. Second- and
third-graders (in the U.S., typical ages 7 and 8) were followed for three years.
Outcome Measure. None of the districts agreed to allow the research team to administer a
study-wide standardized achievement measure, so four different state’s stat-wide tests were
accepted as outcome measures. States’ and Local Education Authorities (LEAs) norm5
referenced achievement tests in Reading and Mathematics were gathered and used in all outcome
analyses. This essentially changed the outcome-focused portion of the study from being one
study with 34 schools to four studies with 8-10 schools each.
Four cycles of achievement data (pre-, and end of years 1, 2, and 3) were gathered on all
students who remained in their cohorts over the three years.
Process measures. In addition to the quantitative data, extensive observations of professional
development activities, interviews with principals, and with focus groups of teachers were
gathered and analyzed, following Miles and Huberman (1994). These were gathered by teams
of observers and interviewers over the three years of the project. Additional data was gathered
from the developers and by the developers during this period. Process data was analyzed by a
team of qualitative specialist whos major findings are reported independently of this report and
in a dissertation (Pickup,2010) on the implementation of the program in the Kentucky schools.
The ES-21 Intervention: Content
The content of the intervention was to be a state-of-the-art compilation of school
effectiveness information (ex., Datnow, Lasky, Stringfield, & Teddlie, 2006; Purkey & Smith,
1983; Teddlie & Reynolds, 2000; Taylor & Bullard, 1995). The core content of the three years
of professional development was derived from Taylor’s review of correlates of effective schools,
―OHCFISH,‖ (pronounced ―oh see fish‖):
Opportunity to Learn/Time on Task
High Expectations
Clear School Mission
Frequent Monitoring
Instructional Leadership
Safe and Orderly Environment, and
Home-school Relations.
The initial development team, primarily Drs. Stringfield, Charles Teddlie, Debbie
McDonald, and Janet Chrispeels, concluded that given the changing contexts of the 21 st century
(especially the addition of the federal No Child Left Behind act, NCLB) and the ongoing
evolution of the school effects/school improvement research bases (Teddlie & Reynolds, 2000;
and others) that three additional dimensions needed to be added to the most fundamental aspects
6
of the OHCFISH definition of school effectiveness. The first was an increased focus on
standardized test scores. Under NCLB all schools are required to gather and report scores on
state-chosen tests for over 95% of all students in all grades, both in total and disaggregated
several ways. The first result was an increase in testing in most states, and the second was
greatly increased focus on raising the scores on each state’s tests for individual students and
various aggregations of students in each school. This has led to a greater national focus on the
use of test scores as part of the ―Frequent Monitoring‖ component of effective schools correlates.
It has also led to a remarkable level of ―teaching test-taking skills‖ and ―teaching to the test.‖
For the ES-21 team, the important point was to increase our focus on creating maximally
thoughtful, student-relevant use of data as a major component of virtually every phase of the
project. To have done otherwise would have been to be seen as irrelevant in many contexts.
Second was explicit involvement of the school district. If the district administration
doesn’t signal the importance of a reform and ―buy in‖ to the steps necessary to initiate and
sustain a reform, then the reform’s chances of deep implementation and long-term survival are
minimal (Datnow & Stringfield, 2000; Datnow, Lasky, Stringfield, & Teddlie, 2005).
Similarly, Chrispeels, Castillo, and Brown (2000) had pointed out the value of working
through ―leadership teams‖ consisting of principals and groups of key teachers, when working to
improve elementary schools. Hence, we explicitly added leadership teams as the prime movers
of change within individual schools.
Finally, Nunnery (1997) had reviewed over a half century of research on school change
efforts and had concluded that in virtually every study, variance in levels of implementation had
been much larger than developers had anticipated, and had been an excellent predictor of the
levels of success achieved by individual schools (See also Stringfield, Millsap & Herman, 1997;
Supovitz & Weinbaum, 2008). Hence, the team worked to build higher levels of reliability into
the implementation process3.
Initially, the team believed that adequate or nearly adequate training materials could be
used ―off the shelf.‖ For example, the Olin Foundation had paid Phi Delta Kappa International,
to develop a several inch thick set of materials for trainers to use in running school effectiveness
workshop series. Similarly, a group had been funded at the University of Wisconsin to operate a
School Effectiveness (SE) center. Dr. Larry Lezotte and others had offered SE training for years,
and had a great many materials that Dr. McDonald believed we could use. This proved to be an
overly-optimistic assessment of each aspect of the situation.
An early, expensive (in time and resources) lesson was learned. The PDK materials were
inadequate and existing SE teams would not share their materials (although Dr. Lezotte did sell
3
For an alternative example of efforts to build High Reliability Organization processes into a school improvement
design, see Stringfield, Schaffer, & Reynolds (2008)
7
dozens of volumes of one of his books to the ES-21 project.) ES-21 would have to develop its
own materials.
The ES-21 project hired a nationally-known trainer to pull together materials for the three
years of workshops. After examining the Phi Delta Kappa materials, the intended-to-be-full-time
trainer and Dr. McDonald determined that those materials were not sufficiently developed and
refined for practical usage. (This judgment was later agreed to by Drs. Chrispeels and
subsequent developer/presenters.) We then examined a range of other materials. After six
months of minimal progress and a range of other issues, the services of the first developer/trainer
were discontinued, and the team hired Dr. Eugene Schaffer of the University of MarylandBaltimore County to work with Drs. McDonald and Chrispeels on further development. Dr.
Schaffer offered decades of experience working on a range of teacher- and school-improvement
projects, including a nine-nation school effects study (Reynolds, Creemers, Stringfield, Teddlie,
& Schaffer, 2002) and the High Reliability Schools reform effort in England and Wales
(Stringfield, Reynolds, & Schaffer, 2008). Drs. McDonald and Schaffer, aided by Dr.
Stringfield and, over time, Drs. Janet Chrispeels, Peggy Burke and others, began turning research
on School Effectiveness into a three-year series of workshops. While a rough outline of three
years was completed during the first 12 months of the grant, substantial refinements of the
process continued through the full three years of implementation. The project was hampered by
the fact that several of the processes necessarily were still being developed and/or significantly
modified over the course of the project.
The ES-21 Delivery Process
The ES-21 content was delivered through a multi-layered, multi-school, three-year
training of leadership teams. The leadership teams included members of each Local Education
Authority (LEA), each school’s principal, and a teacher leadership team from each school. The
teacher leadership team was to include one teacher from each grade. Training was offered at
each of the five LEAs. The choice of engaging multiple layers of the local educational hierarchy
was based on relatively recent research (ex., Datnow, Borman, Stringfield, Rachuba, &
Castellano, 2003; Datnow, Lasky, Stringfield, & Teddlie, 2006;) that each of the levels needed to
be actively engaged in the reform, and that their efforts needed a coordinated focus. These
multi-layered teams received training from a team of two well prepared, highly experienced
trainers on multiple occasions throughout each school year. Typically, these within-district
professional development experiences were offered in six full-day workshops spread over each
of the three school years.
An additional layer of training was provided for the sets of local leaders via annual crossdistrict trainings. This final level of design-team-led, cross-LEA training copied training
provided by virtually every comprehensive school reform design group in the U.S. (ex., Success
for all (Slavin & Madden, 2000), the Comer School Development program( Comer, 2004); the
New American Schools designs (Stringfield, Ross, & Smith, 1996) and the U.K (Stringfield,
8
Reynolds, & Schaffer, 2009)). The implementation model called for the trained local leaders
who had participated in the six days of training each year to then provide professional
development and coordinated ES-21 leadership for the remaining teachers in the ES-21
experimental schools.
The control schools were to receive no intervention, and continued in a ―business as
usual‖ fashion. This topic will be re-visited in the findings section.
Results
Qualitative findings.
Implementation was uneven across the sites with changes in administration and teaching staff a
powerful influence on the ability to make significant progress in implementation of the elements
of ES 21. Many changes were incorporated in classrooms particularly the use of data as a
decision-maker in the schools. The translation of data into instructional change was less obvious
and appears to be a complex outcome that involves many efforts on the part of classrooms,
grades and schools. The use of horizontal and vertical ―slice‖ or integration of curriculum based
on student learning proved to be an eye opener for most teachers and schools and takes
significant efforts to build an effective instruction that crosses grades and classrooms. At the
grade level and class room level, the impact of ES21 was diffused or not understood by all
teachers as a reform. This may not be a negative as the integration of the process and direction of
the reform into the very fabric of the school.
Quantitative (Achievement Test Analyses)
Because each state used a different testing program, and because some states either
change testing series across grades or have changed tests during the years of the ES-21 project,
all achievement scores for all students (and aggregated to schools and experimental conditions)
have been normalized and are presented as Z-scores. The effect is that in each case, the students
from a given district will have scores with a mean of 0.0 and a standard deviation of 1.0. This
follows the logic of ―true experiments,‖ in that the comparisons are exclusively between
experimental and control students, not against a state or national norm.
The study design called for following two cohorts of elementary students over three
school years. The cohorts were in second and third grades as the study began. Given that no
district would allow the research team to conduct fall of second grade testing, none of the second
grade cohorts have a pretest. Whereas analyses of students in third grade cohorts include free
and reduced-price meals, race, and second-grade pretest scores as covariates, the second grade
cohorts have no first-grade pretest. This necessarily puts greater emphasis on the third grade
cohort, as the only cohort to have a full set of achievement measures, for pre- through post- (e.g.,
9
from second through fifth grade). The research team gathered data on both cohorts, and regards
the presence of two cohorts’ analyses as more valuable than one cohort. None the less, in all
cases greater weight should be placed on results from the third grade cohort data.
A full set of detailed quantitative analyses are available from the authors. A summary
table of achievement contrasts (ES-21 vs. controls) by district and state is provided in Table 1.
Table 1:
Effective School for the 21st Century Three-Year Achievement Summary Significant Differences
Favoring Experimental or Control Groups by State, Grade Cohort, and Content Area (Reading
and Mathematics)
_____________________________________________________________________________
State
Content Area
Reading
Mathematics_______
Cohort
Grade 2-4
Grade 3-5
Grade 2-4
Grade 3-5
Kentucky
n.s.
n.s.
n.s.
Controls
Controls
n.s.
n.s.
n.s.
District 1
n.s.
n.s.
n.s.
n.s.
District 2
Controls
n.s.
Controls
n.s.
Controls
Controls
n.s.
Controls
North Carolina
South Carolina
California
Overall, thirteen (13) of twenty comparisons produced non-significant results, and seven
(7) produced results favoring the control schools. In this ―true experiment,‖ none of the analyses
of achievement test scores favored the ES-21 groups of schools.
10
Discussion
The three-year student achievement patterns summarized in Table 1 are counter-intuitive.
Literally decades of research suggest that an effective schools intervention, delivered by strong,
highly-experienced, practically-grounded trainers, and should produce achievement results
favoring the experimental schools. But that did not happen. In this paper we consider possible
explanations, in reverse order of what we believe to be their validity in explaining the findings
summarized in Table 1. The explanations derive from a combination of formal and informal
qualitative observations of the three years of ES-21 development and implementation. In the
pages that follow, the explanations we regard as least likely are presented first.
1. “School-effectiveness”-based whole-school reform efforts are not valid. They do not
“push the right levers.” We reject this explanation. There is far too much prior
research indicating the value of the seven ―OHCFISH‖ correlates to justify this
conclusion. Further, teachers and several principals reported valuing the workshops, and
the California district has contracted with the ES-21 trainers to continue their work. The
ES-21 program appears to have ―face validity.‖
2. The schools and teachers didn’t “buy in.” If teachers and principals rejected the ideas
in ES-21, the reform was doomed. Members of the research team have studied literally
dozens of reform efforts, and a consistent finding from those efforts was that if teachers
and principals harbor deep reservations about a reform, it is doomed. Beginning in year
one, we conducted teacher focus groups and informal data gathering activities. Our team
probed consistently, and yet heard very few reservations about ES-21. The substantial
majority of teachers and principals in the ES-21 study reported buying into both the
correlates and the leadership-team implementation approach. Further, the fact that 16 of
17 experimental schools continued in the program through three years of intervention
suggests that most principals and teachers saw value in ES-21.
3. The intervention was not sufficiently intense to produce desired results. From the
beginning, Dr. Chrispeels expressed the belief that the study lacked sufficient intensity to
produce positive effects on state tests. She cautioned that six whole-day workshops per
year with only telephone follow-up were simply not sufficient to produce the desired
impact on student outcomes. Stringfield’s observation on the other side was that
Stringfield, Reynolds, and Schaffer (2008) had delivered a similarly low-intensity
intervention in two British local authorities and had gotten very substantial gains in
student outcomes from a somewhat similar intervention. It seemed plausible that this
effort could have similarly positive effects after a total of 18 days of workshops over a
three year period. On the other hand, there were few competing efforts in the United
Kingdom to diffuse the efforts of the reform.
4. Inadequate funding to provide a sufficiently intense intervention. Research teams
that win Institute for Education Sciences ―Goal 4‖ (large-scale, random assignment,
11
5.
6.
7.
8.
effectiveness trials), are funded at up to $6.0 million dollars, just for the research
components. Goal 4 assumes the presence of a well developed program and supporting
materials. That is over four times the funding for ES-21. The ―New American Schools‖
designs (see Stringfield, Ross, & Smith, 1996) each received millions of dollars in
external support before being asked to serve more than a handful of schools. Certainly
additional funding would have helped, and the ES-21 team applied for IES funding, but
did not receive the additional, external support.
The efforts need more time to show results on state test scores. This is plausible, and
the research team intends to follow the schools over several more years to see if they
produce a ―sleeper effect‖ not unlike was found in the English district in the High
Reliability Schools project (Schaffer, Stringfield, & Reynolds, 2008).
Knowledge from ES-21 training migrated into control schools as well. In the
California schools, there was extensive documenting of the extent to which ES-21
concepts and processes were being picked up by all the schools in the district. There was
less formal data suggesting that a similar ―bleeding over‖ of information and skill
development in some of the control schools in the other districts. Obviously, if controls
were doing more than ―business as usual,‖ their gains might be evidence of ES-21 effects.
However, it is not intuitively obvious why control schools, receiving any ES-21
information in less depth, would achieve greater gains in several areas than the
experimental schools.
Frequent lack of systemic supports. Four of the five LEAs changed superintendents
during the project, and all of the new superintendents had priorities that did not focus on
ES-21. Schools often faced conflicting demands and worse. In one LEA, a year’s
professional development days were carefully worked out in advance by the ES-21 team
and the principals. At every step as the schedule was being developed, the LEA’s ES-21
liaison was included in the emails. Only after the principals, teachers, and ES-21 staff
had agreed on what they believed was a fixed, final schedule did the LEA person inform
the group that all workshops that had been planned on Mondays or Fridays (e.g., all of
the workshops) would have to be rescheduled, because the LEA would not allow Monday
or Friday professional development release time. While probably unintended, the
message that principals and teachers inferred was that ES-21 wasn’t important to the
LEA.
Contradicting evidence was provided, however, by the California district. Their
Superintendent came to virtually every workshop, repeatedly took extra steps to make the
reform happen, and has now hired the California ES-21 team to train three more schools.
Yet the California achievement data was, if anything, less encouraging than data from
other, less-supported LEAs.
Attempting to conduct an effectiveness trial while having to develop the intervention
materials. Over the past five years, the Institute of Education Sciences has developed a
multi-layered set of options for seeking funding
12
(http://ies.ed.gov/funding/ncer_progs.asp). A first level of funding involves examination
of data sets to examine relationships among variables, and to potentially identify relevant,
malleable variables. The second level, ―Goal 2‖ involves ―development‖ studies, in
which programs that are theoretically promising but not well developed are given three
years of funding to work out the specifics of a reform design in a few schools. ―Goal 3‖
is concerned with efficacy trials, in which fully developed reforms are tried out in a
moderate number of schools. Goal 3 studies are funded for four years and up to $3
million. ―Goal 4‖ studies are ―gold standard‖ RCTs. They are funded for five years and
up to $6 million.
―The Four Quartets‖ was proposed under the impression that multiple sets of fully
developed, high-quality school effectiveness training materials existed and would be
made available to the ES-21 team. This proved to be inaccurate. Members of
consecutive development/implementation teams found the Phi Delta Kappa materials
inadequately developed to be of practical use. Other ―effective schools‖ implementation
teams that offer for-profit training declined to make their materials available for the
project. As a result, the ES-21 team was left with doing ―Goal 2‖ development work
while attempting to conduct a ―Goal 4‖ RCT effectiveness trial.
ES-21 was conducted on a ―Goal 1‖ budget. The ES-21 team made substantial sacrifices
throughout the project to avoid having to shut the project down before the completion of
three years of intervention and data gathering. As one example, zero percent of the P.I’s
salary has been charged to this grant in over two years. Were it not for the generous
understanding of two consecutive College of Education deans, the project would have run
out of funds. In retrospect, the Olin Foundation grant, for which we have all been very
appreciative, was stretched far too thin. When additional development work, on-site
intervention work, data-gathering and analytic work were needed, the efforts came in
non-funded time-contributions.
9. The sheer instability in the ES-21 schools, and perhaps in much of American
education today, made establishing and sustaining reform daunting at best and
probably even more challenging than that. The ES-21 schools faced instability at
every level. During three years of intervention, four of five LEAs changed
superintendents. At least 12 of 17 schools changed principals, and one school had five
principals in a single year. The total number of principals serving the 17 schools totaled
nearly 40. In three years, several of the schools had over 100% turnover in their teacher
leadership teams. A few had near-100% turnover annually. When one North Carolina
ES-21 principal moved to the principalship of a newly opening school in the same
district, she hand-picked over a half-dozen fine teachers from her original school to go
with her. The excellent principal and half of the leadership team were gone in a single
stroke.
In the majority of cases, the persons replacing the original principals and teachers were
competent, caring professionals. But in every case they either needed to be tutored in the
13
specifics of ES-21, or, understandably, they were less than fully supportive of the
program.
When attempting to provide professional development to middle school science teachers
in a high poverty LEA, Ruby (2002) found that teacher turnover meant that entry-level
training became a near-constant part of the task. In ES-21, the task was more complex,
as the level of professional re-development could be at the superintendent, principal, or
teacher levels, or all three at once.
10. The No Child Left Behind legislation has so altered the landscape that interventions
must also be substantially altered. Throughout ES-21, developers, trainers, and
evaluation team members were struck with the extent to which the schools and districts’
perceived sets of options had been altered by NCLB.
a. In general, schools and LEAs now have a much greater focus on test scores and
methods for raising test scores. Teaching ―test taking skills‖ appears to be a nearuniversal component of schools’ curricula. This was true in every state. It is
possible that the ES-21 schools spent substantially less time in ―teaching to the
test‖ than most controls. While this more balanced focus may have increased
student achievement, it would do so without necessarily raising test scores in the
experimental schools.
b. Curricula in all districts were more focused on reading and mathematics.
c. School days were altered to focus on areas measured under NCLB.
d. ―Data‖ of all sorts, but particularly data from each state’s annual, NCLBmandated testing program is now ubiquitous in schools. Interestingly, districts
professional development for teachers and administrators on the uses of data
appeared to be of indifferent quality, and praise for the ES-21 data-use workshops
was nearly universal. Aspects of those workshops were clearly being shared
across several of the districts. Still, in follow-up interviews after the three years
of intervention, ―data use workshops‖ were among the most remembered, most
praised-by-teachers-and-principals parts of the program.
e. Whereas in many previous studies conducted by the P.I., district officials had
been content to let interventions ―run their course‖ and then evaluate the results;
in the post-NCLB world, district staff were often eager to observe trainings and,
to the extent they could, replicate aspects of the training in other schools,
including control schools.
Implications for Future Research and Practice
Implications for Future Research
The first implication of the ES-21 project for research is an endorsement of the current
IES Goal sequence. ES-21 was to be, in IES’s current language, a ―Goal 4 effectiveness trial.‖
14
Those should only be undertaken when the reform in all of its specifics is well developed and
previously tested.
The second cautionary implication is that the full costs of an effectiveness trial easily can
be underestimated. A shortage of resources can hobble both the intervention and the research
sides of a reform.
Third, substantial efforts were made in ES-21 to protect both the internal and the external
validity of the design. The random assignment of over 30 schools to experimental or control
conditions was one of several steps taken to guard internal validity. The involvement of schools
in four states was a deliberate effort to maximize external validity. Armed with 20-20 hindsight,
the team would have been well advised to pick one or the other and settle, as Cronbach et al.
(1981) advised, for making an imperfect but well-crafted contribution to the field, and not trying
to produce a single, utterly-compelling trial.
The fourth implication is a speculation. We wonder about the viability of school-level
random assignment to experimental and control conditions. In the Borman et al. (2007) random
assignment study of Success For All, the research team required two full years to identify a
group of schools that would hold a teacher vote and get an 80% commitment to the program, and
then agree to a coin-flip as to (first) participation and (in the second year of recruitment) to
implementation in K-2 or 3-5, but not both. The resulting study produced a smaller effect than
several previous SFA studies. It is possible that the smaller effect was the result of only
involving schools that were willing to chance not being involved. Logically, such schools could
be presumed to being less committed to the reform effort.
Similarly, in the British High Reliability Schools project, the two districts that involved
all the secondary schools in their districts in a team effort achieved laudable results (Stringfield,
Schaffer, & Reynolds, 2008; Schaffer, Stringfield, & Reynolds, 2008), but the district in which
half of the secondary schools participated got no benefit in terms of student outcomes. It is
possible that a significant part of the value in a reform is achieved by all principals talking
together about—and learning from—their shared experiences.
Thirteen districts that wanted to participate in ES-21, and expressed a willingness to be
highly supportive, withdrew rather than have their schools go through a lottery to see whether
they would be allowed to participate in the reform. While one cannot know ―what would have
been if…‖, one can know that the late Matt Miles most fundamental advice when considering
involvement in a school reform was, ―Pick a reform and go at it hard!‖ If a group wants to go at
something hard, why would they risk a lottery?
While we fully endorse the logic of ―true experiments‖ at the individual student and
classroom levels, we are inclined to believe that in the SFA case, and even more in this case,
forcing a lottery system on potentially willing samples of schools may bias the selection process
15
generally, and specifically reduce the probability that schools that want to ―go at it hard,‖ will
agree to participate.
Implications for Future Reform Efforts
First, picking up on the last research implication, if a school isn’t committed to a
vigorous, long-term improvement process, they shouldn’t bother to start. Several of the ES-21
schools were curious and somewhat interested throughout, but never displayed a widely-shared
passion for academic improvement. Perhaps the most driven, committed principal in the study
used ES-21 to produce dramatic two year gains at her school. She was then promoted to central
office and the hard-won gains collapsed.
Second, groups seeking achievement gains should assume that they are making a
commitment for several years. One, two, or even three years of work are more than a prelude
but less than an institutionalized reform.
Third, as Fullan and Miles (1992) observed, change is ―resource hungry.‖ It is not
practical to assume that all of the responsibility for creating change will be borne by an outside
group. Some of the ES-21 principals found ways to leverage the training ES-21 provided.
Others did not. The external group can not do it alone.
Fourth, NCLB is changing the educational landscape and the imperatives facing schools.
ES-21’s P.I. edits a journal focused largely on Title I/NCLB, and yet he and the full team were
struck with the extent to which a focus on raising scores on single state tests has come to
dominate the decision processes in districts and schools. Obviously, to the extent that this forces
a focus on the academic achievements of all students, this change is a very good thing. We saw
some of that. However, to the extent that NCLB testing is producing a greater focus on test
scores than on larger issues of deep learning for all students, the focus risks becoming a harmful
monomania. While the balance is getting harder for educational leaders to find and maintain,
surely it is worth the effort. We believe that, considered in total, the effective schools correlates
can assist leaders in finding that balance.
Conclusions
1. Schools that achieved an initial focus on achievement and set clear directions early on
seemed to do better on state tests, so long as the school benefitted from stable leadership
at both the principal and teacher leadership team levels.
2. Across the 17 ES-21 schools and five LEAs, leadership teams consistently reported that
they had gained in their ability to work and solve problems as teams. In a variety of
ways, teachers reported being more tightly networked for the purposes of solving
practical, work-related problems. This is a likely correlate of institutionalization of
16
continuous progress in any organization. This observation bodes well for the long-term
impact of ES-21.
3. The substantial majority of participating teachers and principal reported thinking ES-21
was valuable for their schools and for them personally.
4. Teachers and principals found the training and experiences in teaming very helpful. They
reported spending more time having much more extended and thoughtful conversations
about content and about the acts of teaching. Many teachers reported having never
observed in other schools and other classrooms, and having found in their own teaching.
5. Similarly, both teachers and principals reported that the intensive training in data analysis
and in the folding of quantitative training into their own school and classroom practices
very helpful.
6. In several schools, both teachers and principals reported that they thought that the ES-21
experiences put them several years ahead of their within-district colleagues in terms of
upcoming change efforts. Many reported that their districts were in the process of
moving to teacher leadership teams, and to increased data use, and that they believed that
they were years ahead in these reforms.
7. More than one teacher and principal in the eastern LEAs reported feeling that the first
year implementation processes, while valuable, occasionally felt disjointed. (In fact,
some of the workshops were being developed in something very close to ―real time.‖
Reassuringly, this was not a comment heard in the California schools, which were
implementing a year later.
8. Goals may change with leadership
9. Reforms were based on funding
10. LEA support was generally lacking, but was not effective when present
11. A new leadership model was used with the implementation of the program
12. Data was either not in place or ineffectively used for instruction when available
13. Personnel was either seen as interchangeable or recruited away from schools
14. Formal leadership was often replaced or essential positions left unfilled.
15. Leadership was often neophytes hired into very challenging positions
17
References
Aikin, W. M. (1942). The story of The Eight Year Study. New York: Harper.
Borman, G., Hewes, G., Overman, L., & Brown, S. (2003). Comprehensive school reform and
achievement: A meta-analysis. Review of Educational Research, 73 (2), 125-230.
Borman, G., Slavin, R., Cheung, A., Chamberlain, A., Madden, N., Chambers, B. (2007). Final
reading outcomes of the national randomized field trial of Success for All. American
Educational Research Journal, 44(3) 701-731.
Chrispeels, J., Castillo, S., & Brown, J (2000). School Leadership Teams: A Process Model of
Team Development. School Effectiveness and School Improvement, 11(1), 20 – 56
Datnow, A., Borman, G., Stringfield, S., Rachuba, L., & Castellano, M. (2003). Comprehensive
school reform in culturally and linguistically diverse contexts: Implementation and
outcomes from a four-year study. Educational Evaluation and Policy Analysis, 25(2),
143-170.
Datnow, A., Lasky, S., Stringfield, S., & Teddlie, C. (2005). Systemic integration for
educational reform in racially and linguistically diverse contexts: A summary of
evidence. Journal of Education for Students Placed At Risk, 10 (4), 441-453.
Datnow, A., & Stringfield, S. (2000). Working together for reliable school reform. Journal of
Education for Students Placed At Risk, 5(1&2), 183-204.
Edmonds, R. (1979). Effective schools for the urban poor. Educational Leadership, 37, 15-27.
Fullan, M.G., & Miles, M.M. (1992). Getting reform right: What works and what doesn’t. Phi
Delta Kappan, 73 (10), 744-752.
Hargreaves, A., & Fink, D., (2006). Sustainable Leadership. San Francisco: JosseyBass/Wiley.
Miles, M B., & Huberman, A. M. (1994). Qualitative data analysis. 2nd Edition. Thousand
Oaks, CA: Sage.
Nunnery, J. A. (1998). Reform ideology and the locus of development problem in educational
restructuring. Education and Urban Society, 30 (3), 277-295.
Slavin, R. & Madden, N. (2000). One million children: Success For All. Thousand Oaks:
Sage.
Stringfield, S., Reynolds, D., & Schaffer, E. (2008). Improving secondary students' academic
achievement through a focus on reform reliability: 4- and 9-year findings from the High
18
Reliability Schools project School Effectiveness and School Improvement, 19(4), 409428.
Stringfield, S., Ross, S., & Smith, L. (Eds.). (1996). Bold plans for school restructuring: The
New American Schools designs. Mahwah, NJ: Lawrence Erlbaum Associates.
Supovitz, J. & Weinbaum, E. (2008). The implementation gap. New York: NY: Teachers
College.
Taylor, B., & Bullard (1995). The revolution revisited. Bloomington, IN: Phi Delta Kappa.
Teddlie, C. & Reynolds, D. (2000). Handbook of research on school effectiveness and
improvement. London: Falmer.
What Works Clearinghouse. Downloaded July 1, 2010 from http://www.whatworks.ed.gov/.
19