Experimental Evidence From the Round Rock Pilot Project on Team

Educational Evaluation and Policy Analysis
December 2012, Vol. 34, No. 4, pp. 367–390
DOI: 10.3102/0162373712439094
© 2012 AERA. http://eepa.aera.net
Team Pay for Performance: Experimental Evidence
From the Round Rock Pilot Project on Team Incentives
Matthew G. Springer
Vanderbilt University
John F. Pane
Vi-Nhuan Le
Daniel F. McCaffrey
RAND Corporation
Susan Freeman Burns
Vanderbilt University
Laura S. Hamilton
Brian Stecher
RAND Corporation
Education policymakers have shown increased interest in incentive programs for teachers based on
the outcomes of their students. This article examines a program in which bonuses were awarded to
teams of middle school teachers based on their collective contribution to student test score gains.
The study employs a randomized controlled trial to examine effects of the bonus program over the
course of an academic year, with the experiment repeated a second year, and finds no significant
effects on the achievement of students or the attitudes and practices of teachers. The lack of effects
of team-level pay for performance in this study is consistent with other recent experiments studying
the short-term effects of bonus awards for individual performance or whole-school performance.
Keywords: teacher pay for performance, education performance incentives, group incentives,
team incentives
1. Introduction
A variety of factors have led education policymakers to increase their interest in providing
incentives to teachers based on the outcomes of
their students. First, there is ongoing frustration
that U.S. public schools have not made sufficient progress in recent decades in addressing
the achievement gap between advantaged and
disadvantaged students, nor in how the United
We would like to acknowledge the contributions of Ann Haas, who provided analytic support to the project; Dale Ballou, who
provided input into the study design and feedback on a draft of this article; and three anonymous reviewers who provided
thoughtful feedback that helped us improve the article. We are grateful for the support of officials in the Round Rock
Independent School District, particularly Jesus Chavez, Ph.D., Superintendent; Toni Garcia, Assistant Superintendent for
Instruction; Rosena Malone, Assistant Superintendent for Secondary Education; and Debbie Lewis, Director of Research and
Assessment. We offer our special thanks to the teachers and principals in the Round Rock middle schools, without whose participation this research would not have been possible. Teacher bonuses were made possible through the generous financial
support of an anonymous foundation. This research was supported by the National Center on Performance Incentives, which is
funded by the United States Department of Education’s Institute of Education Sciences (R305A06034).
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Springer et al.
States fares in international comparisons.
Meanwhile, federal policies have led to increased
use of standardized testing and more widespread
use of test results to evaluate the performance of
schools, administrators, and teachers. Finally,
there has been greater recognition that increasing teacher quality may be the most direct and
effective pathway to improving student achievement (see, e.g., Rice, 2003). Performance
incentives are viewed as having the potential to
increase teacher quality in two ways: by incentivizing existing teachers to improve their practice and by attracting better teachers into the
profession and retaining them there.
However, in spite of the intuitive appeal
incentive pay has to some stakeholders, an influential base of individuals and organizations
fundamentally oppose its use in education.
Opponents contend that such pay renders schools
less effective by reducing the motivating effects
of intrinsic rewards; that is, teachers will achieve
less satisfaction from professional performance
as they are rewarded financially for measured
student achievement (Thomas, 2009). Other
concerns include the risk that test-score-focused
measures could lead to excessive emphasis on
test preparation at the expense of other valued
activities, which would lessen the validity of
inferences from the scores themselves (Kaufman,
2008) and that the education system lacks appropriate measures for evaluating teacher performance directly (Gratz, 2009; Milanowski, 1999).
There is also concern that incentive pay could
negatively affect teachers’ morale and the collegial environment of the school, which is considered essential for school improvement (Hoerr,
1998; Odden, 2000).
The historic evidence on the impact of payfor-performance programs is inconclusive
(Springer et al., 2009). Recent international
studies have found positive effects of pay-forperformance systems on student achievement
(Glewwe, Ilias, & Kremer, 2010; Muralidharan
& Sundararaman, 2011); however, recent studies of such systems in the United States did not
find effects on student outcomes (Fryer, 2011;
Glazerman & Seifullah, 2010; Marsh et al.,
2011; Springer et al., 2010). One of these studies, the POINT study conducted in Nashville,
Tennessee, randomly assigned individual middle school math teachers to be eligible for large
bonuses based on their students’ performance
(Springer et al., 2010). Two other studies independently evaluated the School-Wide Performance
Bonus Program (SPBP) in New York City
(Fryer, 2011; Marsh et al., 2011). In that program, teachers and other union-represented staff
in low-performing, at-risk schools could earn
bonuses if the school exceeded performance
standards based on the city’s school accountability measures. Both studies found no or negative differences between the outcomes of students attending schools randomly assigned to
be in the program or the control group (Fryer,
2011; Marsh et al., 2011).
Given the importance of collaboration among
teachers for effective teaching (Rosenholtz,
1989), the school-wide program in New York,
which may have done more to encourage teachers to work together than an individual bonus
system, such as POINT, might have been
expected to have effects even if POINT did not.
However, some researchers have argued that
school-level bonuses offer only weak incentives
since teachers may not feel that they have much
influence on whether their school as a whole
qualifies for a bonus (Sager, 2009). Moreover,
the group-based award program was potentially
susceptible to the “free-rider problem.” With
group-based performance structures, individuals on a team may become less likely to shoulder their fair share of the workload. They know
the capabilities of teammates can make up for
their subpar performance, resulting in all team
members receiving a bonus award. Thus, group
incentive systems are likely to result in the inefficient allocation of some bonus resources.
Milanowski (2007) found that teachers are concerned about free riders. In that study, teachers
expressed preference for individual awards over
group awards, citing a lack of control over others’ performance and concerns that their peers
would not contribute equally to earning the
bonus.
However, Kandel and Lazear (1992) and
others have argued that as long as the size of a
within-organization team is not too large, the
free-rider problem can be solved through peer
pressure. For instance, peer monitoring and the
enforcement of social penalties in the form of
shame, guilt, empathy, and mutual monitoring
can lead to individual team members being
368
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Team Pay for Performance
accountable for their performance to the other
members of the group. If a worker has both
monetary and social incentives to not shirk,
Kandel and Lazear (1992) contend the motivational forces that would have been “choked off”
by the free-rider problem are recovered.
Moreover, organizational theory suggests
group incentives can promote social cohesion,
productivity norms, and feelings of fairness
(Lazear, 1998; Pfeffer, 1995; Rosen, 1986).
Improved social cohesion among workers can
foster knowledge transfer and mutual learning
that result in increased productivity in the long
run (Che & Yoo, 2001). For example, as
reported in Berg et al.’s (1996) and Hamilton,
Nickerson, and Owan’s (2003) case studies of
garment plants, the formation of teams with
workers of varying abilities facilitated interactions among high- and low-ability workers so
that more able workers taught less effective
workers how to better execute tasks and become
more productive.
For this reason, some policymakers and
practitioners gravitate toward team-level awards
(i.e., awards based on the performance of a
group of teachers within a school). While teams
can be construed in a variety of ways, including
teachers of the same grade level, in the same
department, or part of instructional teams, the
rationale for team-level awards is usually based
on the collaborative nature of teaching within
these smaller-than-whole-school groupings
(Azordegan, Byrnett, Campbell, Greenman, &
Coulter, 2005). In theory, team-level awards can
provide a close coupling between student performance and the teachers who instruct them,
while also encouraging teacher collaboration
and reaping the benefits of productivity gains
from knowledge transfer and mutual learning.
From outside of the K–12 education sector,
there is limited evidence suggesting that team
incentives may be more effective than individual
incentives. Condly, Clark, and Stolovitch (2003)
conducted a meta-analysis of 64 studies conducted in private sector, government, and higher
education. Nine of the studies examined team
incentive programs, reporting an average effect
size of 1.40 SD units, versus 0.55 for the 55 studies that examined individual incentive programs.
Here, the relative effects of team versus individual incentives may be more informative than
the absolute magnitudes of these effects, because
there are reasons to doubt that such large effect
sizes would be obtained in fielded teacher incentive programs. First, only about half of the studies in the meta-analysis were field studies, and
those achieved smaller effects than the laboratory experiments. Second, 41% of the systems
studied incentivized manual work, which produced larger effects than those incentivizing
cognitive work. Although the comparability of
teaching and the cognitive work of the studied
systems cannot be determined from the metaanalyses, the smaller effects for cognitive work
would seem more applicable to the complex
cognitive work of teaching, which is measured
indirectly through student achievement outcomes, than the results for incentivizing manual
work, which typically has directly observable
and measureable outputs.
In terms of studies of team incentives in education (excluding school-wide programs), we know
of only one study that compares team- and individual-level incentive programs in pre-college
education. The Andhra Pradesh Randomized
Evaluation Study (AP RESt) compared the impact
of two output-based incentive systems (an individual teacher incentive program and a grouplevel teacher incentive program) and two inputbased resource interventions (one provided an
extra paraprofessional teacher and another provided block grants). Muralidharan and Sundara­
raman (2008) found that students enrolled in a
classroom instructed by a teacher selected for
the group incentive intervention outperformed
students in control condition classrooms on
both the mathematics and language exams
(0.28 and 0.16 SDs, respectively); however,
students enrolled in schools assigned to the
individual incentive condition outperformed
students in both the group incentive condition
and the control condition after the second year
of implementation. However, the context of
schooling in India compared with the United
States makes the applicability of this study
unclear.
Thus, theory and research from outside of
education suggest that awarding team-level performance may be a productive way of implementing performance pay that overcomes the
pitfalls of individual and school-wide awards.
However, there has been little scientific study,
369
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Springer et al.
and none in the United States, of the effect of
teacher incentives on student achievement where
the distribution of awards is based on the performance of a teacher team.
Although the POINT and New York SPBP
programs used different levels of performance
for determining awards, both programs took
place in large urban school systems in which all
the schools in the study were facing strong
accountability pressures because of low performance. Some schools in the SPBP evaluation
were targeted for closure due to failing performance, and during the POINT study the
Metropolitan Nashville School District was
threatened with state takeover due to the failure
of its schools to make Adequate Yearly Progress
under the federal No Child Left Behind regulations. Authors of evaluations of both programs
speculate that these pressures might have limited the effect of the bonus program to change
teacher behavior because all teachers in participating schools already faced strong external
pressure to improve student outcomes (Marsh
et al., 2011; Springer et al., 2010). Testing the
effects of bonus programs in other contexts may
yield different results.
To help fill this gap in our knowledge, this
article examines a pay-for-performance program in which performance awards were distributed to teams of teachers based on their collective contribution to student test score gains in
a suburban district with above average levels of
student achievement for its state. Starting in
August 2008, the Round Rock Independent
School District (RRISD) (Texas) and the
National Center on Performance Incentives
(NCPI) designed and implemented two 1-year
randomized controlled trials to examine the
impact of a team-level teacher pay for performance intervention on middle school student
achievement in core subject areas (i.e., mathematics, reading, social studies and science) as
well as the impact of team-level awards on
teacher attitudes and behaviors and on team and
institutional dynamics. In the district, most
middle school students are taught the core subjects by interdisciplinary teams. All Grade 6, 7,
and 8 teachers on these teams were part of the
study if they taught one of the core subject
areas. The first year of the study, 78 middle
school teams of teachers were randomly
assigned to the treatment (eligible for an award)
or control condition (not eligible for an award).
If a treatment team’s value-added score was
among the highest one third among treatment
teams in each grade level, teachers on the team
were awarded about $5,400 each as long as their
individual value-added score was not statistically below average for their grade level. The
second year of the study, the same procedure
was repeated. That year, 81 teams were randomized, and teachers on bonus-winning teams were
awarded about $5,900 each.
As noted above, there are two high-level
pathways through which performance incentives might result in increased student achievement: by incentivizing existing teachers to
improve their practice, either individually or in
collaboration, or by attracting better teachers
into the profession and retaining them there.
This experiment is designed to measure effects
of the first of these pathways over the relatively
short term of an academic year.
2. Research Questions
The study addresses the following research
questions regarding the opportunity for teachers
to earn a bonus on the basis of student achievement in core subjects taught by them and their
teammates:
1.Does the bonus opportunity affect the
achievement of students taught by the team?
2.Does the bonus opportunity affect teachers’
attitudes about compensation and teaching or
their teaching practices?
3.Are there differences in the attitudes or
practices of teachers who earned a bonus and
teachers who did not?
3. Sample
RRISD serves students from the cities of
Round Rock, Cedar Park, and portions of Austin
in Texas. The district enrolls approximately
43,000 students, with the number of students
increasing by approximately 1,500 per year
between 2003–04 and 2008–09. The district has
a diverse ethnic base with a student population
that is approximately 8.7% African American,
10.7% Asian, 30% Hispanic, and 46.2% White.
370
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Team Pay for Performance
More than 73 languages are spoken throughout
the district.
The district has 50 schools (6 high schools,
10 middle schools, 32 elementary schools, and
2 alternative education centers) and approximately 5,950 employees, of whom 2,795 are
teachers. The district’s student-teacher ratio is
14.7. In the 2008–09 school year, beginning
salaries for teachers with bachelor’s degrees
was $41,000, master’s degrees $42,000, and
doctoral degrees $43,000. Approximately 25%
of the teachers have master’s or doctoral
degrees, well below the national average of
57%. The average for years of teaching experience is 10.4, compared to a national average
of 15. The study’s middle school sample is
described below.
The 5-year graduation rate from Grades 9 to
12 for the class of 2008 was 87.6% and more
than 77% of the district’s graduating seniors
took the SAT or ACT college entrance exams.
Graduating seniors score approximately
160 points above the state average and 110 points
above the national average on the SAT.
Performance on the ACT is similar.
RRISD’s teaching assignment structure provides an opportunity to investigate the efficacy
of team-level incentives. The district organizes
middle school teachers into grade-level interdisciplinary teams that oversee the learning experiences of a group of approximately 100 to
140 students associated with the team. Team
composition changes from year to year due to
teacher turnover and administrative discretion.
Each team has at least one teacher for each core
subject of mathematics, reading/English language arts, science, and social studies. Some
teams also contain additional teachers who
focus on students with limited English proficiency and/or special education students. Team
members share a common planning time, during
which they plan future lessons, discuss their
students’ performance, confer with parents, conduct data conversations, and plan interventions
or extensions for specific students. The interdisciplinary structure of the RRISD teams may
make them more conducive to certain kinds of
knowledge transfer, such as information about
individual students, and less conducive to other
kinds, such as the sharing of subject-specific
pedagogical practices.
Over the 2 years, the study included 159
teams of teachers teaching core subjects to students in Grades 6 to 8 in nine middle schools.
These were all of the teams in those schools
during the study. There were 665 teachers on
these teams.1 Teams include language arts/reading, mathematics, science, and social studies
teachers, and some teams also include special
education teachers and specialists for students
with limited English proficiency. Not all teachers in the school are members of a team, but
most students receive instruction in core subjects from teachers who are members of a team.
Some teachers taught off-team for a small proportion of their students. For instance, mathematics teachers may have taught a section with students from two teams or a section of students who
received core instruction in other courses from
another team. Across all subjects, off-team students constituted 3% of the students taught by
teachers participating in the study. Teaching of
off-team students was most prevalent among
mathematics teachers. Sixty-nine percent of mathematics teachers taught at least one off-team student, with a median of 8, and these students
accounted for 11% of all the students their teachers taught. Off-team teaching was less common in
subjects other than mathematics. In each of those
subjects, off-team students accounted for about
1% of the students their teachers taught. In ELA,
46% of teachers taught at least one off-team student, with a median of 2. In science, these numbers were 38% and 2; and in social studies, they
were 39% and 2, respectively. These students
were not included in the calculation of team performance measures or in outcomes analyses.
There were roughly 17,383 students taught by
participating teachers. Of these, 17,307 were
determined to be part of a team, defined as receiving instruction from team teachers in at least two
of the four core subjects. In rare cases, students
received instruction from teachers of one team for
two core subjects and teachers of another team for
the other two core subjects. These students were
considered members of both teams and assigned
to their teacher’s team in each core subject.
4. The Pilot Incentive Program
The incentive program offered teachers on
selected teams the opportunity to earn a bonus
371
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 1
Achievement Test Used in the Performance Measures
Grade Level
Mathematics
Reading/ELA
Science
Social Studies
Six
TAKSa
Mathematics
TAKSa Reading
Seven
TAKSa
Mathematics
TAKSa Reading
Eight
TAKSa,c
Mathematics
TAKSa,c Reading
Benchmarkb Cycle 6
(Year 1) or Stanford
10 Science (Year 2)
Benchmarkb Cycle 6
(Year 1) or Stanford
10 Science (Year 2)
TAKSa Science
Benchmarkb Cycle 6 (Year 1)
or Stanford 10 Social
Studies (Year 2)
Benchmarkb Cycle 6 (Year 1)
or Stanford 10 Social
Studies (Year 2)
TAKSa Social Studies
TAKS = Texas Assessment of Knowledge and Skills; ELA = English language arts.
a. Scale scores.
b. Total correct.
c. Scores from the first test students completed, not retest scores.
on the basis of their students’ achievement
growth in the four core subjects of mathematics,
reading-English language arts, science, and
social studies. Teachers on teams assigned to the
intervention group were notified that their team
would be eligible for a bonus. The notification
included a Frequently Asked Questions (FAQ)
document that described the requirements for
winning a bonus, including that it would depend
on team students’ achievement in core subjects
relative to predictions based on prior-year
scores. Neither the particular tests to be used nor
the exact methods for calculating teams’ performance measures were detailed.
Team performance was based on a valueadded measure of student performance on standardized achievement tests and district benchmark assessments. Students were tested in the
four core subjects: reading/language arts, mathematics, science, and social studies. The goal of
the performance measure was to provide an
evaluation of each team’s contribution to student learning in the core subjects.
In essence, the value-added method compared
students’ test scores in each subject area with how
they would be expected to score on these tests had
they been taught by the average performing team
for the subject area and grade level (six, seven, or
eight).2 The difference between a student’s actual
performance and his or her expected performance
was the measure of the team’s contribution to that
student’s learning for a given subject area. In each
subject area, the average of the contributions to all
the students instructed by a team teacher was the
measure of the team’s contribution for that subject
area. The overall performance measure for the
team was the average of its contributions to each
of the four subject areas. Students were associated
with a team if they received continuous instruction in two or more of the core subject areas by
team teachers during the period from the fall
snapshot date until test administration. For a more
complete description of student linkages, see
Appendix 1.
Table 1 lists the assessments used to measure
student achievement for team performance calculations in each subject area. Because a common test was not available in all four subject
areas for each grade level, the performance measures used combinations of Texas Assessment of
Knowledge and Skills (TAKS) tests and either
district benchmark tests from the final testing
cycle in Grades 6 and 7, in Year 1 (2008–09), or
the Stanford Achievement Test Series, Tenth
Edition (Stanford 10) in year 2 (2009–10). For
the TAKS and Stanford 10 tests, we used the
scale scores; however, there are no scale scores
for the benchmark tests, so we used the proportion of number of correct responses out of 30.
Measures used for testing the program’s effects
on student outcomes are discussed below.
To receive a bonus, a team’s score had to
rank in the top third of treatment group teams in
the same grade level. The rationale for designing the incentive program as a fixed tournament
was pragmatic—the cost of the incentive program
would be known in advance. If the incentive
plan had required teams exceed a fixed threshold to receive an award, then the number of
teams winning awards could not be contained
372
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Team Pay for Performance
and potential financial exposure would be
much greater.3 An incentive structure that
awards large bonuses relative to base salary and
does not fix the number of units eligible to
receive an award poses serious problems for
policy. District officials, legislators, or others
funding bonuses are reluctant to make openended commitments to reward all teachers
exceeding a benchmark, given the uncertain and
potentially substantial costs. A prime consideration in designing this experiment was to test a
policy with modest-size bonuses that would be
feasible for educators and school officials in
RRISD and other school districts. The fixed
tournament design offered this virtue.
A fixed tournament incentive structure, however, suffers from one well-recognized defect:
promotion of competition among teams, which
could lead to a breakdown of interteam cooperation. The consequences of within-school competition are of particular concern if teachers in
the same school are no longer willing to help
other teachers who are not on the same team. To
ensure that the structure of the incentive program would not promote competition among
teams within schools, we modified the criteria
for earning a bonus. Specifically, if a team
would have earned a bonus if another team in
the same school had not outperformed it, we
designated the nonqualifying team an additional
winner.
This solution ensured that no team close to
earning a bonus would be denied a bonus
because some other team in the same school
outperformed it. Thus, in practice, no teacher
would have reason to withhold help or cooperation from another teacher on a different team
in the same school, assuming teachers understand the procedure. While this modification
introduced a small amount of uncertainty about
the total bonuses to be paid out, this uncertainty
seemed a reasonable price to pay to promote
harmonious working relationships within schools.4
The research team also found that this provision
helped to promote buy-in among building and
district educators to permit the experiment to
take place. In short, the incentive pay plan was
designed to capitalize on the advantages of
a tournament while avoiding its potential
defect of competition that leads to reduced
collaboration.
Finally, the award criteria also included an
individual performance requirement for all
teachers on eligible teams: In order for teachers
to receive an award for their team’s performance, their individual value-added scores must
not have been statistically below average for
their grade level. This criterion was included at
the request of the funder to help reduce the
potential for the free-rider problem to occur and
to avoid rewarding individual teachers whose
students were not making progress. This provision, while individual- rather than team-based,
was aligned with the team incentive structure.
Specifically, it was designed not to induce competition within teams or incentivize teachers to
place more importance on their own performance at the expense of the team’s performance. Moreover, although this provision was
explained to teachers in the FAQ document,
the vast majority of discussion revolved around
the team aspects of the incentive program.
Table 2 summarizes the bonus awards. In
2008–09, there were a total of 78 teams, half
assigned to the treatment condition. Of the 39
teams eligible for the bonuses, 14 teams received
a bonus. Awards were given to 67 individual
teachers, with 63 teachers earning the maximum
award amount and 4 receiving a prorated bonus
because they taught team students for only a fraction of their instructional workload. The full award
amount was nearly $5,500, and prorated awards
ranged as low as $3,800, with an average award
size of $5,373. Overall, nearly $360,000 was
awarded to teachers in 2008–09. Similarly, in
2009–10, there were 81 teams, 40 of which were
assigned to the treatment condition. Of those 40
teams eligible for bonus, 12 teams received one,
with 46 teachers receiving the full award and six
teachers receiving a prorated share. The full award
amount was $6,000, and prorated awards ranged
as low as $4,200, with an average award of
$5,862. The total amount awarded to teachers in
2009–10 was $304,800. Across the 2 years, there
was only one instance of a teacher on a bonuswinning team who did not receive a bonus because
the teacher’s value-added score was too low.
5. Methods
In each year of the study, teams were randomized to either the bonus intervention or
373
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 2
Summary of Awards
Number of teams, both treatment and control
Number of treatment teams
Number of teams receiving an award
Number of individual award recipients
Number of individual award recipients receiving full award
Number of individuals on award-winning teams with value-added
scores below average (no bonus)
Amount of full award
Average award
Amount of smallest prorated award
Total amount of awards
control condition using a block-randomized
design. Blocks were defined by grades within
school. Within each block, there were multiple
teams. When there was an even number of
teams, half the teams in each block were randomized to treatment and half to control. In
blocks with three teams (no blocks had more
than four teams), two teams were randomly
assigned to treatment or control and the
remaining team was assigned to the other condition. The randomizations were constrained
so that the number of treatment and control
teams was balanced at each grade level. The
district and schools provided rosters of teachers on each team and the teachers were notified of their team’s treatment assignment in
early October.
Students were assigned to classes by the
schools and student outcomes are analyzed
according to the team’s assignment. Because
assignment to experimental conditions occurred
after the late-August start of school, nearly all
students had their class assignments prior to
teams knowing their experimental condition.
Table 3 shows group means for treatment and
control group students on the available demographic and achievement measures, pooled
across years. The randomization produced good
balance between groups on most of these characteristics; however, there are significant differences in the percentage of limited English proficiency students, which is relatively low in both
groups but higher in the treatment group (4.7%
versus 3.1% of students), and the percentage of
talented and gifted students, which is also
Year 1
Year 2
78
39
14
67
63
81
40
12
52
46
0
1
$5,446
$5,373
$3,812
$359,981
$6,000
$5,862
$4,200
$304,800
somewhat low in both groups but lower in the
treatment group (8.8% versus 13.7%).5 The
table is based on team assignments for ELA.
Students assigned to teams for other subjects
can differ somewhat from those on the teams
for ELA instruction; however, balance for math,
science, and social studies is similar, with the
same two characteristics showing significant
group differences. Despite this relatively good
balance overall, when looking at individual
years or grade levels, more of the covariates
show significant group differences. All outcomes models adjust for all of the covariates
listed in Table 3 to remove variability created
by the potential imbalances occurring in spite
of randomization.
Measures
The study tested the effects of the bonus program on student achievement and teacher attitudes, perceptions, and practices about the school
environment, so as to capture a broad spectrum
of potential pathways through which the pilot
incentive program might influence student
achievement.
Student achievement. For evaluating the effect of
the bonus program on student outcomes, the
study used both the TAKS and Stanford 10
measures. The TAKS is the state’s high-stakes
accountability test, administered in the spring.6
As a supplemental measure, the district
administered the Stanford 10 specifically for this
project in late May. As discussed above and
374
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 3
Assessment of Student Balance Between Treatment and Control Groups on ELA teams
Group Means
Covariate
Prior-year TAKS math score
Prior-year TAKS reading score
Percent female
Percent racial/ethnic minority
Percent limited English proficiency
Percent economically disadvantaged
Percent designated at-risk by district
Percent special education
Percent talented and gifted
Group Differences
Control
(n = 8,592)
Treatment
(n = 8,744)
Group
Difference
Standard
Error
p Value
0.073
0.081
49.7%
46.7%
3.1%
25.6%
23.2%
5.3%
13.7%
0.039
0.023
50.1%
48.1%
4.7%
26.5%
24.6%
5.1%
8.8%
–0.035
–0.058
0.003
0.015
0.016
0.009
0.014
–0.002
–0.049
0.037
0.035
0.007
0.010
0.007
0.010
0.011
0.007
0.024
0.348
0.100
0.678
0.120
0.026
0.392
0.184
0.785
0.046
A joint test of significance of the group differences on these covariates produced a p value of 0.076 (calculated with a permutation
test). TAKS = Texas Assessment of Knowledge and Skills; ELA = English language arts.
shown in Table 1, the TAKS math and reading
exams are administered in Grades 6 through 8,
but the TAKS science and social studies tests are
administered only in eighth grade. The Stanford
10 provides measures in all of the core subject
areas in all three grades. For the outcomes
analysis, we examined the available TAKS and
Stanford 10 scores at each grade level and
subject area. For outcomes analysis, student
TAKS and Stanford 10 normal curve equivalent
scores were standardized to have mean 0 and
standard deviation 1 within our sample.
Survey measures. NCPI administered two surveys
each year to all of the teacher participants in the
study. The survey addressed attitudes about pay
for performance, attitudes about the study, selfefficacy, collegiality, academic press (the extent
to which teachers have high expectations for
their students), team dynamics, parent engagement,
and practices, including emphasis on standards,
hands-on learning, use of tests, homework, test
preparation, hours of work, and professional
development. Each item requested the
respondent to select from a four-choice or sixchoice scale or to provide a numeric response.
Most of the items were the same in the two
surveys each year, and, for the most part, the
surveys were the same both years.
Survey administration. The first study year,
teachers were notified about the two surveys via
email in February and May of 2009. The email
contained information about survey content, an
explanation of how teacher confidentiality
would be protected, and a link to the survey. The
email also described the $150 stipend that was
offered upon completion of each survey.
Teachers were then able to take the survey
online and at their convenience. Email reminders
were sent to teachers who had not participated
in the survey, with a few teachers receiving as
many as five reminders. We repeated these
procedures in the second year of the study,
administering the surveys in November 2009
and April 2010.
For the purposes of this article, we focus on
the responses obtained during the spring survey
administration (May 2009 and April 2010)
because they provide a measure of teacher attitudes, practices, and perceptions near the end of
each study year. Participation in these surveys
was strong. During Year 1, 93% of control
group teachers and 99% of treatment teachers
participated. Similar response rates were
observed in Year 2, where 93% of control group
teachers and 96% of treatment teachers participated.
Across the 2 years of the study, a total of 441
teachers responded to the surveys. Of those, 83
teachers were in the treatment group both years,
79 teachers were in the control group both
years, 76 teachers were in the experiment for
only 1 year and assigned to the treatment group,
79 teachers were in the experiment for only 1
year and assigned to the control group, and the
375
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Springer et al.
remaining 123 teachers were in the experiment
for both years and were assigned to the treatment group one year and the control group the
other year.
Among survey respondents, the distribution
of teacher qualifications was similar each year.
Responding teachers averaged 10 years of
teaching experience, and 28% had a master’s
degree or higher. Twenty-one percent taught
English/language arts, 24% taught math, 22%
taught science, 20% taught social studies, and
the remaining 13% taught another subject.
Development of survey scales. The surveys
assessed various practices that teachers may be
likely to change as a result of being eligible
to receive a bonus. These include instructional
practices, engagement in professional devel­
opment, efforts to involve parents, and teachers’
collaboration with their team and same-subject
teachers. We also measured contextual factors
that may affect their team’s dynamics. In
addition, the surveys included several items
about teachers’ perceptions and understanding
of the intervention.
We created scales from the survey responses
to measure key constructs related to teachers’
attitudes, team dynamics, instructional practices, self-improvement efforts, parent engagement activities, professional development, and
perceptions of principal leadership. In Year 1,
we created 13 composite scales by combining
responses across multiple items. To create the
composite scales, we reviewed each of the survey questions, computed descriptive statistics
for all item-level responses (including examining full distributions), and conducted exploratory factor analyses where appropriate. The
scales were constructed by calculating the average of the responses on the items on the 4- to
6-point scale. We also administered additional
items on the Year 2 spring survey, enabling us
to create an additional scale on principal leadership. The 14 composite scales are shown in
Table 4 along with their internal consistency
reliability coefficients.
In addition to the composite scales, we created 18 item-level scales. Four items asked
teachers to report the number of hours they
engaged in an activity, and the scales represent
the average number of hours reported. The other
13 items pertain to perceptions and understanding
of the intervention, 7 of which were administered to treatment teachers only. For these 13
items, we created scales by dichotomizing the
4-point Likert-type responses (strongly agree/
agree vs. disagree/strongly disagree) and calculating the percentage of teachers who endorsed
the item. The 17 item-level scales are also shown
in Table 4.
Student Outcomes Analysis
We fit a hierarchical linear model to estimate
and test the intervention effect on student
achievement. To improve the precision of the
estimates, the model includes individual student
and team aggregate pre-treatment variables. As
shown in Equation (1), Level 1 models individual student outcomes as a function of a team
component and pre-intervention student variables including prior TAKS mathematics and
reading/ELA achievement scores and the following student demographic indicators provided by the district: gender, race/ethnicity,
limited English proficiency, economically disadvantaged, at risk for academic failure,7 special
education, and talented and gifted. About 9% of
students were missing both the prior year TAKS
scores, about 4% had a prior year reading score
but not a mathematics score, and about 2% had
a prior year mathematics score but not a reading
score. For students with incomplete prior test
scores, we set the missing value to zero and
included in the model indicators for the four patterns of observed prior year scores (both scores
observed, no scores observed, reading but not
mathematics observed, and mathematics but not
reading observed). We also included interactions between the pattern indicators and the
prior scores and demographic variables. All
students had complete demographic data.
Level 2 models the team component as a
function of an indicator for the intervention
group, the randomization block, aggregate student pre-intervention TAKS mathematics and
reading scores, and a random component for
team. Let yij equal the standardized score on a
given test for student j = 1, … , ni of team i = 1,
... , m = 78 (Year 1) or 80 (Year 2); xij equals a
K-vector of student pre-intervention variables
including prior year TAKS reading and mathematics scores and student demographic characteristics; then Level 1 of our model is:
376
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 4
Survey Scales and Reliability Coefficients for Composite Scales
Scale
Alpha
Group dynamics
Collaboration among same-subject teachers
Hours spent meeting with same-subject teachers
Collaboration among team teachers
Hours spent meeting with team teachers
Quality of team dynamics
Instructional practices
Change from prior year in classroom emphasis on state standards and tests
Change from prior year in emphasis on hands-on activities and having students work in groups
Importance teachers place on test-preparation activities
Importance of student scores on state tests and benchmark assessments to guide instruction
Importance placed on student performance on classroom work and homework to guide instruction
Use of test scores for making instructional decisions
Frequency with which teachers incorporate Texas state standards into instructional planning
Number of hours worked outside of formal school hours
Professional development and self-improvement efforts
Teachers’ use of student test scores to help improve their own practice
Frequency of professional development activities related to collaborative aspects of teaching
Total amount of time spent in professional development
Parent engagement
Efforts to engage parents
Principal leadership
Principal leadership (Year 2 only)
Perceptions of the intervention
The intervention provides feedback about team’s effectiveness
The intervention should include non-core subjects
The intervention has caused resentment among teachers
The intervention distinguishes effective from ineffective teams
The intervention has had negative effects on my school
The intervention forced teachers to teach in a certain way
The intervention energized me to improve my teaching (treatment only)
The intervention will not affect my teaching (treatment only)
I have a clear understanding of performance criteria (treatment only)
The bonus is too small (treatment only)
The Frequently Asked Questions document answered my questions (treatment only)
Not winning a bonus will have a negative effect on my team’s teaching evaluations (treatment only)
The intervention uses a fair method of awarding bonus (treatment only)
0.83
—
0.78
—
0.83
0.83
0.89
0.82
0.74
0.52
0.86
0.45
—
0.73
0.69
—
0.68
0.78
—
—
—
—
—
—
—
—
—
—
—
—
—
Coefficient alpha is shown for the composite survey scales; it was calculated using responses across treatment and control
groups.
yij = µ + θi + xij′ βi + εij(1),
where θi is the team component and εij are
independent normally distributed residual errors
with variance that depends on the pattern of
observed scores. Level 2 of the model is:
8
B
g =6
b =1
θi = Tiδ + zij ′ η + ∑ γ g uig + ∑ λ b vib + ζi (2)
βik = βk, k = 1, … , K,
where zij equals a vector of team average prior
reading and mathematics test scores, uig equals 1
if the team’s students are in enrolled in grade g
and 0 otherwise, and vib if the team is in randomization block b and 0 otherwise. ζi is a teamspecific random effect to allow for correlation
among the outcomes of students on the same
team. Ti is a treatment indicator and the coefficient δ is a measure of the effect of the bonus
intervention on student achievement. Tests of
the null hypothesis that δ equals 0 will test for
the intervention effect.
The primary model includes a single overall
intervention effect for all three grades. In secondary models, we examine separate effects by
377
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Springer et al.
grade. Results from secondary models are consistent with the primary models and not reported
in this article.
We used model-based Wald tests to test the
null hypothesis of no treatment effects. We also
use permutation tests (Efron & Tibshirani, 1993)
to test the null hypothesis. The permutation test
is an alternative approach for calculating the
probability of obtaining the observed effects by
chance if the null hypothesis of no effect is true.
The permutation test does not rely on the model
assumptions like the Wald test and ensures that
our conclusions are not sensitive to the model
assumptions. To conduct the permutation test,
we randomly reassign the treatment assignment
indicators to teams following the randomization
design, to simulate the outcomes of the experiment with alternative realizations of the randomization under the null hypothesis. We repeat
this process 2,000 times, and for each resampled
data set, we estimate the treatment effect using
the same model as used for model-based estimates. The p value for the test of the null
hypothesis is the proportion of times the result
from the resampled data equals or exceeds the
observed result in absolute value.
We conducted separate analyses for each
student achievement outcome measure. For
each subject, the analysis includes only students
who were members of the participating teams
and taught by the team teacher in that subject.
For instance, if a student was taught by the
English, science, and social studies teachers
from the same team but not the mathematics
teacher on that team, the student would be on
the team and included in the analyses for
English, science, and social studies, but not
mathematics. As discussed above, the criteria
for inclusion of a student on a team is that they
were taught at least two core subjects by team
teachers. So, for example, if the student was
only taught by a team teacher for mathematics
but not for any other subject, the student would
not be included in any of the analyses.
Analyses were conducted separately for each
year. A precision-weighted combination of the
2 years was used to estimate the overall effect of
treatment, with p values calculated using permutation tests as described above. An additional
analysis examines effects on students across
both years of the study according to their pattern
of treatment. This restricts the sample to students who were in the study schools both years
(sixth or seventh graders in the first year of the
study). Depending on each student’s assignment
to a team and the team’s random assignment to
treatment condition, each student experienced
one of four patterns of treatment: control group
both years, control group in Year 1 and treatment group in Year 2, treatment group in Year 1
and control group in Year 2, or treatment group
in both years. This analysis compares the three
groups of students who were on treatment teams
at least once to those who were in the control
group both years. This analysis uses a joint test
of significance using a three-degree-of-freedom
chi-square test.
Because we had multiple outcomes and multiple grade levels to test effects, the possibility of
one or more significant effects by chance (i.e., a
significant effect when there is no real intervention effect) is greater than 5%. We estimated
adjusted p values to account for the multiple testing in a group of tests, such as tests for three grade
levels on a given test or across subject areas for
the pooled sample. The adjusted p value equals
the proportion of simulated experiments that any
one of multiple test statistics exceeded each of the
observed statistics from the true outcomes data.
Teacher Outcomes Analyses
We pooled responses across the two spring
survey administrations and then conducted two
sets of analyses of the survey data. The first set
of analyses examined the differences between
the treatment and control groups in their opinions and practices. We used a hierarchical linear
model similar to the student achievement model
discussed above to compare results between the
two groups. Level 1 models individual teacher
survey responses as function of a team component and several covariates, including years of
teaching experience, an indicator for whether
the teacher had a master’s degree or higher,
and indicators for whether the teacher was an
English language arts, math, science, or social
studies teacher and teacher-specific residual
error. Level 2 models the team components as a
function of the team’s intervention status, fixed
effects for the blocks with which teams were
randomly assigned to interventions, and random
378
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 5
Student Outcome Results, Weighted Combination of Years 1 and 2
Subject Area
Math
Reading/ELA
Science
Social studies
Exam
TAKS
SAT10
TAKS
SAT10 reading
SAT10 language
SAT10
SAT10
N Students
N Teams
Standardized Effect Size
p Value
13,359
12,339
16,594
15,460
15,026
15,531
15,648
159
159
159
159
159
159
159
0.007
–0.008
–0.002
0.009
–0.001
0.026
0.038
0.699
0.704
0.884
0.554
0.962
0.192
0.112
ELA = English language arts; SAT10 = Stanford Achievement Test Series, Tenth Edition; TAKS = Texas Assessment of
Knowledge and Skills.
effects for team, which accommodates the fact
that responses from teachers on the same team
may be correlated and provides accurate inferences given the cluster-randomized block design
used in the experiment. For teachers’ responses
on the dichotomously scored scales (relating
to perceptions and understanding of the intervention), we used an analogous approach but
employed logistic regression.
Given the relatively large number of statistical tests conducted, we again used permutation
tests (described above) to adjust for multiple
comparisons. Statistical significance is determined on the basis of the adjusted p values from
these permutation tests.
Our second set of analyses examined differences in the attitudes and practices between
treatment teachers who earned and did not earn
a bonus. We used a regression analysis that
included random effects for grades within
schools and controlled for the clustering of
teachers within the same teams. We also used
this approach to examine how treatment teachers’ responses changed from Year 1 to Year 2 in
relation to earning or not earning a bonus.
To adjust for multiple comparisons in this
second set of analyses, which examines the
treatment group only (thus, for which the permutation test is not applicable), we adjusted
p values using a false discovery rate (FDR) procedure (Benjamini & Hochberg, 1995). An FDR
is the expected proportion of statistical tests that
report significant relationships when no such
relationship actually exists. Applying the procedure with a FDR of 0.05 led to rejecting the null
hypothesis of zero effects only if p values were
less than .0016.
6. Results
Student Outcomes
Analysis of student achievement outcomes
reveals no overall intervention effect in any
subject area across the 2 years of the experiment. The effects, displayed in Table 5, were
estimated through a precision-weighted combination of Year 1 and Year 2 analyses (shown
below) and tested with permutation tests. The
effect size estimates are very small in each subject area. Results are very similar when looking
at individual years—Table 6 shows Year 1
results and Table 7 shows Year 2 results. Again,
the effect sizes are typically very small with
small standard errors. Finally, Table 8 shows the
results of the 2-year analysis of student outcomes based on their pattern of being taught by
treatment or control teams. The table shows
effect estimates for the three groups of students
who were taught by treatment teams at least
once during the experiment, comparing their
outcomes to students who were taught by control teams both years. Like the other results, the
treatment effect estimates are very small and not
significant. In particular, there is no evidence of
the emergence of an effect on the scores of
students who were taught by treatment teams
both years.
Comparisons Between Treatment
and Control Teachers
In this section, we first compare treatment
and control teachers’ attitudes and perceptions
about the intervention and their practices. Table 9
presents descriptive statistics for the treatment
379
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 6
Student Outcome Results, Year 1
Subject Area
Math
Reading/ELA
Science
Social studies
Exam
N Students
N Teams
Standardized
Effect Size
Standard
Error
p Value
TAKS
SAT10
TAKS
SAT10 reading
SAT10 language
SAT10
SAT10
6,850
6,299
8,051
7,507
7,507
7,454
7,565
78
78
78
78
78
78
78
0.016
–0.012
0.006
–0.006
–0.007
0.019
0.031
0.025
0.024
0.021
0.021
0.056
0.032
0.034
0.486
0.593
0.754
0.787
0.888
0.556
0.361
SAT10 = Stanford Achievement Test Series, Tenth Edition; TAKS = Texas Assessment of Knowledge and Skills; ELA = English
language arts.
TABLE 7
Student Outcome Results, Year 2
Subject Area
Math
Reading/ELA
Science
Social studies
Exam
TAKS
SAT10
TAKS
SAT10 reading
SAT10 language
SAT10
SAT10
N Students
N Teams
Standardized
Effect Size
Standard
Error
p Value
6,509
6,040
8,543
7,953
7,519
8,077
8,083
81
81
81
81
81
81
81
–0.007
0.002
–0.010
0.025
0.000
0.032
0.045
0.031
0.038
0.020
0.022
0.024
0.027
0.034
0.808
0.951
0.583
0.229
0.993
0.219
0.164
ELA = English language arts; TAKS = Texas Assessment of Knowledge and Skills; SAT10 = Stanford Achievement Test Series,
Tenth Edition.
TABLE 8
Student Outcome Results, Two-Year Effects By Pattern of Treatment
Students on
Students on
Treatment Teams Treatment Teams
in Year 1 Only In Year 2 Only
Subject Area
Effect Standard Effect Standard Effect Standard
N
Error
Size
Error
Size
Error p Value
Students N Teams Size
Exam
TAKS
SAT10
Reading and ELA TAKS
SAT10
reading
SAT10
language
Science
SAT10
Social studies
SAT10
Math
Students on
Treatment
Teams Both
Years
3,274
3,050
4,917
54
54
54
0.049
0.015
–0.029
0.034
0.039
0.028
–0.019
–0.011
–0.040
0.048
0.057
0.036
0.020
–0.004
–0.022
0.048
0.057
0.030
0.330
0.968
0.035*
4,604
54
–0.008
0.028
–0.040
0.036
–0.002
0.036
0.500
4,428
54
0.011
0.033
–0.040
0.035
0.000
0.036
0.488
4,644
4,754
54
54
0.003
0.065
0.032
0.034
–0.018
0.025
0.047
0.038
0.001
0.036
0.047
0.038
0.925
0.332
ELA = English language arts; TAKS = Texas Assessment of Knowledge and Skills; SAT10 = Stanford Achievement Test Series,
Tenth Edition.
* Significance does not survive adjustment for multiple comparisons.
380
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 9
Comparison of Control and Treatment Teachers’ Attitudes, Perceptions, and Behaviors
Control Group
Dependent Variable
N
M
Group dynamics
Collaboration among same-subject teachers
346 2.50
Hours spent meeting with same-subject
344 4.19
teachers
Collaboration among team teachers
345 2.60
Hours spent meeting with team teachers
346 4.92
Quality of team dynamics
345 4.35
Instructional practices
Change in classroom emphasis on state
339 3.48
standards and tests
Change in emphasis on hands-on activities
339 3.64
and having students work in groups
Importance teachers place on test-preparation
346 3.27
activities
Importance of student scores on state tests and 346 3.04
benchmark assessments to guide instruction
Importance placed on student performance on
346 3.71
classroom work and homework to guide
instruction
Use of test scores for making instructional
345 3.32
decisions
Frequency with which teachers incorporate
345 5.09
Texas state standards into instructional
planning
Number of hours worked outside of formal
346 11.80
school hours
Professional development and self-improvement efforts
Teachers’ use of student test scores to help
343 2.12
improve their own practice
Frequency of professional development
346 3.21
activities related to collaborative aspects of
teaching
Total amount of time spent in professional
341 42.17
development
Parent engagement
Efforts to engage parents
346 2.42
Principal leadership
Principal leadership (Year 2 only)
172 3.17
Perceptions of the intervention
The intervention provides feedback about
346 0.38
team’s effectiveness
The intervention should include noncore
344 0.60
subjects
The intervention has caused resentment
342 0.34
among teachers
The intervention distinguishes effective from
343 0.12
ineffective teams
The intervention has had negative effects on
343 0.26
my school
The intervention forced teachers to teach in a
342 0.13
certain way
SD
Treatment Group
N
M
SD
Standardized Standard
Effect Size
Error
0.60
3.54
353 2.54 0.58
353 4.20 4.36
0.08
–0.02
0.07
0.07
0.53
4.24
0.65
355 2.65 0.56
355 4.85 4.84
355 4.32 0.68
0.09
0.02
–0.05
0.08
0.07
0.11
0.55
349 3.50 0.57
0.00
0.08
0.69
349 3.60 0.72
–0.02
0.08
0.58
355 3.31 0.57
0.05
0.07
0.74
355 3.10 0.72
0.10
0.08
0.48
355 3.70 0.45
–0.01
0.08
0.52
353 3.29 0.53
–0.08
0.08
0.73
353 4.97 0.81
–0.22
0.07
7.68
352 12.19 8.27
0.08
0.08
0.79
348 2.10 0.83
–0.02
0.08
0.90
355 3.24 0.86
0.06
0.07
36.72 350 43.63 42.06
0.04
0.08
0.52
354 2.51 0.50
0.22
0.07
0.58
174 3.22 0.59
0.05
0.18
0.49
352 0.41 0.49
0.19
0.21
0.49
349 0.55 0.50
–0.25
0.21
0.47
351 0.29 0.45
–0.28
0.27
0.33
354 0.09 0.28
–0.33
0.29
0.44
351 0.22 0.42
–0.44
0.18
0.34
351 0.07 0.25
–0.83
0.18
None of the treatment effect estimates is significant after adjustment for multiple comparisons.
381
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Springer et al.
and control groups, along with estimates of the
differences between groups on these measures.
To summarize the results in the table, treatment
and control group teachers responded similarly
on all of the survey scales, and no statistically
significant differences were detected between
groups after adjustment for multiple hypothesis
testing.
The bonus intervention had no effects on
perceptions of team dynamics. Groups reported
similar levels of collegiality and hours spent
collaborating with teachers on and off their
teams. Both groups reported similarly high
scores with respect to the perceived quality of
team dynamics.
Similarly, the perceptions of the intervention
did not differ between the experimental groups.
Regardless of treatment condition, less than one
quarter of teachers believed that the intervention
had negative effects on their school and about
one tenth believed it forced teachers to teach in a
certain way. Furthermore, less than one third of
teachers believed the intervention caused resentment among teachers. There was also evidence
that teachers were not entirely supportive of the
intervention. Only 38% of control teachers and
41% of treatment teachers believed that the
intervention could provide feedback about their
team’s effectiveness. The percentage of teachers
who believed that the intervention could distinguish between effective and ineffective teachers
was particularly low, with 12% of control teachers and 9% of treatment teachers endorsing that
item. Furthermore, the majority of teachers in
both the treatment and control groups believed
that the intervention should include other noncore subject teachers (e.g., music).
Taken together, the majority of teachers did
not believe the intervention had negative effects
on their school or on their attitudes, but nonetheless they were skeptical that the intervention
could provide useful information about teaching
effectiveness. In addition, teachers felt that the
intervention could be improved by including
other noncore subject teachers in the calculation
of team bonuses.
Comparisons of Treatment Teachers Who
Earned a Bonus to Those Who Did Not
Another important aspect of the study is
exploring whether there were differences in
attitudes, perceptions, and practices among the
treatment teachers who would ultimately win
or not win a bonus. We also examine how those
measures changed after teachers were informed
of Year 1 bonus results. These analyses are
nonexperimental and should be interpreted
with caution.
Mean Differences in Attitudes,
Perceptions, and Practices
In this section, we compare the survey
responses of teachers who would go on to win
a bonus with the survey responses of teachers
who would not go on to win a bonus. At the time
teachers responded to the survey, they did not
know whether or not they would win a bonus for
that academic year. Table 10 shows descriptive
statistics for the bonus winners and nonwinners
along with estimates of the differences between
groups. A positive coefficient indicates that the
treatment teachers who earned a bonus had higher
scores on the scale, while a negative coefficient
indicates the teachers who did not earn a bonus
had higher scores.
None of these differences were significant
after adjustment for multiple tests. One trend
worth noting was that teachers who would go on
to win a bonus tended to be less likely than
teachers who did not win a bonus to emphasize
standardized tests. Namely, relative to teachers
who won a bonus, teachers who did not win a
bonus reported that they put more emphasis on
TAKS test preparation (including practicing
test-taking skills and using TAKS preparation
materials) and on using scores from the TAKS
and district benchmarks tests to guide their
instruction.
Change in Responses Over
Time by Bonus Status
Another important analysis is to examine
how treatment teachers may respond to receiving or not receiving a bonus. Specifically, we
examined how teachers’ responses in Year 1,
which were obtained before teachers were
informed of their bonus status, compared to
their responses in Year 2, which were obtained
after they were informed of their Year 1 bonus
status.8 Table 11 shows changes in survey
responses from Year 1 to Year 2 for teachers
382
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 10
Treatment Teachers’ Attitudes, Perceptions, and Practices by Whether or Not They Would Ultimately Be
Awarded a Bonus at the End of the Year
Dependent Variable
Would Ultimately
Earn Bonus
Would Not
Ultimately Earn
Bonus
N
N
M
SD
M
SD
Group dynamics
Collaboration among same-subject teachers
231 2.58 0.58 120 2.45 0.56
Hours spent meeting with same-subject
231 4.29 4.20 120 4.07 4.70
teachers
Collaboration among team teachers
233 2.62 0.57 120 2.70 0.55
Hours spent meeting with team teachers
233 4.52 4.16 120 5.53 5.91
Quality of team dynamics
233 4.25 0.74 120 4.47 0.52
Instructional practices
Change in classroom emphasis on state
231 3.55 0.61 116 3.38 0.46
standards and tests
Change in emphasis on hands-on activities
230 3.67 0.74 117 3.49 0.67
and having students work in groups
Importance teachers place on test234 3.37 0.56 119 3.18 0.57
preparation activities
Importance of student scores on state tests
234 3.21 0.67 119 2.9
0.77
and benchmark assessments to guide
instruction
Importance placed on student performance
234 3.69 0.44 119 3.71 0.48
on classroom work and homework to
guide instruction
Use of test scores for making instructional
232 3.28 0.55 119 3.33 0.51
decisions
Frequency with which teachers incorporate
232 4.95 0.85 119 5.01 0.73
Texas state standards into instructional
planning
Number of hours worked outside of formal
232 12.92 8.85 118 10.75 6.88
school hours
Professional development and self-improvement efforts
Teachers’ use of student test scores to help
228 2.15 0.81 118 2.00 0.88
improve their own practice
Frequency of professional development
233 3.28 0.88 120 3.18 0.80
activities related to collaborative aspects
of teaching
Total amount of time spent in professional
231 44.47 40.66 117 42.62 44.91
development
Parent engagement
Efforts to engage parents
233 2.49 0.50 119 2.55 0.50
Principal leadership
Principal leadership (Year 2 only)
120 3.21 0.63
54 3.25 0.52
Perceptions of the intervention
The intervention provides feedback about
231 0.43 0.50 119 0.37 0.48
team’s effectiveness
The intervention should include noncore
230 0.58 0.49 117 0.47 0.50
subjects
The intervention has caused resentment
231 0.29 0.45 118 0.30 0.46
among teachers
The intervention distinguishes effective
233 0.12 0.32 119 0.03 0.18
from ineffective teams
The intervention has had negative effects on 231 0.20 0.40 118 0.26 0.44
my school
Standardized Standard
Effect Size
Error
–0.09
–0.48
0.08
0.55
–0.01
0.78
0.22
0.08
0.51
0.11
–0.12
0.08
–0.15
0.10
–0.18
0.07
–0.23
0.10
0.02
0.06
0.08
0.07
0.09
0.10
–1.53
1.05
–0.15
0.11
–0.12
0.11
–1.81
4.83
0.09
0.07
0.11
0.13
–0.18
0.35
–0.57
0.29
–0.23
0.36
–1.16
0.58
0.19
0.36
(continued)
383
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 10 (continued)
Dependent Variable
The intervention forced teachers to teach in
a certain way
The intervention will not affect my teaching
I have a clear understanding of performance
criteria
The bonus is too small
The Frequently Asked Questions document
answered my questions
Not winning a bonus will have a negative
effect on my team’s teaching evaluations
The intervention uses a fair method of
awarding bonus
Would Ultimately
Earn Bonus
Would Not
Ultimately Earn
Bonus
N
SD
N
M
SD
230 0.09
0.28
119
0.03
0.18
–0.92
0.57
234 0.76
232 0.48
0.43
0.50
120
119
0.83
0.44
0.38
0.50
0.38
–0.20
0.34
0.28
228 0.17
227 0.63
0.38
0.48
113
116
0.18
0.59
0.38
0.49
0.05
–0.22
0.35
0.29
230 0.12
0.32
119
0.06
0.24
–0.65
0.47
228 0.44
0.50
116
0.34
0.48
–0.33
0.30
M
Standardized Standard
Effect Size
Error
None of these differences are significant after adjustment for multiple comparisons.
who did and did not earn a bonus in Year 1.
Similar data are not available for teachers who
did or did not win awards in Year 2 because we
did not administer a follow-up survey.
Again, none of the differences were significant after adjustment for multiple tests. For the
most part, teachers who earned a Year 1 bonus
showed similar changes in attitudes and practices over time as teachers who did not win a
bonus. There were no differences on the group
dynamics, professional development, and parent
engagement measures. There were also no differences in changes on the instructional practices scales. A notable trend, though not significant, concerned the extent to which teachers
used scores from the TAKS and district benchmarks tests to guide their instruction. Teachers
who had not won a bonus reported decreased
emphasis on standardized test scores in Year 2
relative to Year 1. In contrast, teachers who had
won a bonus reported increased emphasis on
standardized test scores in Year 2. Though not
shown in the table, it is important to note that
despite the increase over time, teachers who had
won a Year 1 bonus continued in Year 2 to report
less emphasis on standardized test scores than
teachers who had not won a bonus.
There were also no differences between
teachers who had won a bonus and teachers who
had not won a bonus with respect to the perceptions of the intervention over time, although
small sample sizes may have limited our ability
to detect differences. One interesting result in
the data is that teachers’ attitudes toward the
size of the bonus award differed between those
who had or had not won bonuses. Teachers who
did not win the bonus were more likely to
endorse that the bonus was too small to motivate
them to work harder, while teachers who had
won a bonus showed the opposite pattern. The
difference was large although not significant
after adjustment for multiple tests.
Overall, winning a bonus did not appear to
materially change teachers’ attitudes and practices. There were generally no differences with
respect to changes in collaboration, professional
development, parent engagement, instructional
practices, and perceptions of the intervention for
teachers who won a Year 1 bonus compared to
teachers who did not.
7. Discussion
There are a variety of possible explanations
for why we did not observe any differences in
the achievement of students taught by treatment
and control groups teams or in the attitudes and
practices of the treatment and control group
teachers. First, the single-academic-year randomized trials may have been too brief to meaningfully detect treatment effects. The Year 1
spring survey was administered 8 months after
the experiment started, and the Year 2 survey
was administered 20 months after the experiment
384
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
TABLE 11
Changes in Attitudes, Perceptions, and Practices of Teachers From Year 1 to Year 2 by Whether They Earned a
Bonus in Year 1
Earned Bonus Did Not Earn
in Year 1
Bonus in Year 1
Dependent Variable
N
Δ1
Group dynamics
Collaboration among same-subject teachers
54 0.02
Hours spent meeting with same-subject teachers
54 1.20
Collaboration among team teachers
54 0.05
Hours spent meeting with team teachers
54 0.69
Quality of team dynamics
54 –0.01
Instructional practices
Change in classroom emphasis on state standards
53 –0.19
and tests
Change in emphasis on hands-on activities and
53 –0.30
having students work in groups
Importance teachers place on test-preparation activities 54 –0.01
Importance of student scores on state tests and
54 0.24
benchmark assessments to guide instruction
Importance placed on student performance on
54 –0.09
classroom work and homework to guide instruction
Use of test scores for making instructional decisions 54 –0.01
Frequency with which teachers incorporate Texas
53 0.08
state standards into instructional planning
Number of hours worked outside of formal school
53 0.91
hours
Professional development and self-improvement efforts
Teachers’ use of student test scores to help improve 53 –0.01
their own practice
Frequency of professional development activities
54 –0.05
related to collaborative aspects of teaching
Total amount of time spent in professional
53 –5.83
development
Parent engagement
Efforts to engage parents
54 0.01
Perceptions of the intervention
The intervention provides feedback about team’s
52 0.06
effectiveness
The intervention should include noncore subjects
51 0.06
The intervention has caused resentment among teachers 52 0.26
The intervention distinguishes effective from
53 0.04
ineffective teams
The intervention has had negative effects on my school 52 0.16
The intervention forced teachers to teach in a certain 53 –0.02
way
The intervention energized me to improve my
28 –0.04
teaching
The intervention will not affect my teaching
28 0.07
I have a clear understanding of performance criteria 28 0.18
The bonus is too small
26 –0.11
The Frequently Asked Questions document
28 0.07
answered my questions
Not winning a bonus will have a negative effect on 28 0.03
my team’s teaching evaluations
The intervention uses a fair method of awarding bonus 28 –0.04
SD
0.48
5.80
0.56
2.85
0.53
N
Δ1
SD
Standardized Standard
Effect Size
Error
83 –0.17 0.63
83 0.86 6.10
84 –0.11 0.53
84 0.60 6.43
84 –0.11 0.83
0.31
–0.25
0.24
–0.01
0.16
0.18
0.31
0.19
0.19
0.17
0.55 84 –0.14 0.72
–0.02
0.14
0.77 83 –0.07 0.90
–0.14
0.16
0.52 85 –0.01 0.51
0.71 85 –0.16 0.78
–0.00
0.49
0.18
0.18
0.61 85
0.06 0.52
–0.25
0.19
0.64 84
0.71 82
0.06 0.54
0.16 0.84
–0.16
–0.14
0.20
0.18
5.34 84
0.83 7.76
0.01
0.20
0.82 80 –0.03 0.90
0.13
0.19
0.97 84 –0.13 0.90
0.04
0.18
60.23 83 –7.88 53.92
0.03
0.20
0.04 0.52
–0.09
0.19
0.60 82 –0.08 0.52
0.16
0.12
0.50 80 –0.04 0.48
0.56 80 0.06 0.50
0.27 82 –0.05 0.43
0.09
0.20
0.12
0.10
0.10
0.09
0.02 0.42
0.05 0.38
0.12
–0.07
0.11
0.06
0.50 51 –0.19 0.45
0.15
0.15
0.08
0.30
–0.27
0.15
0.59
0.14
0.11
0.13
0.02 0.32
–0.02
0.10
0.42 48 –0.21 0.46
0.16
0.11
0.41 54
0.58 79
0.23 78
0.53
0.60
0.51
0.65
51 0.02 0.55
50 –0.08 0.49
49 0.14 0.41
49 –0.02 0.48
0.33 50
1
Δ is the group mean for Year 2 minus the group mean for Year 1; none of these differences are significant after adjustment for
multiple comparisons.
385
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Springer et al.
started but again only 8 months after randomization in that year. Student assessments were
administered on similar schedules. This may not
have given treatment teachers sufficient time to
change their practices, to experience conditions
that would change their perceptions or attitudes,
or for changes in practices to affect student outcomes. Previous research suggests that it can
take considerable time for teachers to change
their practices (Mayer, 1999).
Second, treatment teachers may not have
fully understood the pilot bonus program. The
majority (54%) of teachers did not understand
the criteria for earning a bonus. This was a problem even among the treatment teachers who
indicated that they had read the FAQ document,
which explained the methods used to calculate
which teams would be awarded a bonus. For
these teachers, 39% indicated they lacked clarity regarding the criteria for earning a bonus.
Moreover, a majority of teachers (59%) did not
think that the method used to award the bonus
was fair to teachers, possibly because teachers
may not have understood how the system was
designed to reduce competition. Overall, these
results suggest that teachers did not appear to
fully understand how their performance would
translate to a bonus.
Third, the opportunity to win a bonus
appeared to be a weak incentive, as only onequarter of treatment teachers endorsed the statement “the chance to earn a bonus award has
energized me to improve my teaching.”
Furthermore, the vast majority of treatment
teachers (78%) indicated that they would not
change their practice in order to win the bonus.
Finally, the majority of teachers in this study
did not believe the intervention forced them to
teach in a certain manner, and the majority did
not believe that the experiment had negative
effects on their schools or caused resentment
among teachers, contradictory to concerns some
teachers raise about pay for performance
(Solmon & Podgursky, 2001). However, these
misgivings were present among a substantial
minority of teachers. Nearly one third reported
that it created resentment among teachers and a
quarter reported negative effects on their schools.
Moreover, many teachers were skeptical about
other aspects of the intervention. The majority
of teachers believed the intervention to be
incomplete, in that it did not account for the
teaching effects of teachers from noncore
subjects. Teachers also questioned whether the
intervention could distinguish effective teachers
from ineffective teachers, could provide feedback about their team’s effectiveness, and could
fairly assign bonuses to teachers. For the latter
result, it is important to keep in mind that many
treatment teachers indicated they did not fully
understand the criteria for awarding bonuses, so
they may not have had adequate knowledge to
fully evaluate the fairness of the method.9
Nonetheless, it is important that teachers’ uncertainty and misgivings about a pay-for-performance
system be alleviated to the extent possible in
order to ensure optimal outcomes.
Taken together, these factors—the relatively
short duration of the experiment, treatment
teachers’ lack of understanding of the intervention, teachers’ reports that the potential to win a
bonus did not induce change in their practice,
and misgivings about the intervention among a
substantial minority of teachers—may help to
explain the lack of differences in perceptions
and practices between the treatment and control
groups. Marsh et al. (2011) summarizes prior
research that suggests many of these factors are
important in the success of pay-for-performance
programs: Participants must understand the program, buy into the program and the criteria used
for selecting winners, believe the system is fair
and that they are capable of achieving an award,
and find the award valuable enough to inspire
efforts to achieve it.
Alternatively, the intervention might have
caused changes to student outcomes, but the
effect could have been masked by similar
changes in the control group. In a phenomenon
sometimes referred to as a John Henry Effect
(Saretsky, 1975), the control group exerts extra
effort in response to participating in study.
Teacher misgivings about the intervention could
potentially exacerbate such an effect. Available
data do not enable us to test this speculation.
The study did not find evidence that the freerider problem was an issue in this program.
Three survey items in the quality of team
dynamics scale are particularly relevant to this
issue. On average, treatment teachers reported
that team members demonstrated commitment
to the team by putting in extra time and effort to
386
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Team Pay for Performance
help it succeed (4.46 on a scale of 1 to 5, where
4 indicates the statement is somewhat accurate
and 5 indicates it is very accurate), and that
everyone on the team is motivated to have its
students succeed (4.64). They also tended to
disagree that some members of the team do not
carry their fair share of the overall workload
(2.01, where 2 indicates the statement is somewhat inaccurate). Moreover, control group
teachers responded very similarly on these
items (4.49, 4.57, and 2.03, respectively), suggesting that participation in the bonus program
did not affect teacher reports related to freeriding.
Overall, while these results further our understanding of how group-based compensation
plans affect teachers, more research is needed.
First, the study does not capture effects of the
intervention that might occur through changes
to the composition of the teaching workforce.
Second, future research could improve our
understanding of which particular features of
the intervention were accepted by teachers and
which particular features were in need of
improvement, so that compensation plans could
be better designed to promote teacher buy-in.
For example, interviews with teachers can identify what aspects of the bonus criteria were
unclear, and the results of this analysis can help
inform future designs of pay-for-performance
plans. In a related manner, teachers can shed
insight as to why they believe the current
method used to calculate bonuses gives only
limited information about teachers’ effectiveness and might suggest ways the system could
provide better feedback about a team’s effectiveness. Given that teacher buy-in has been
shown to depend on program design (Lavy,
2007), teachers can also discuss the pros and
cons of designing pay-for-performance plans
that provide additional compensation based
on an individual teacher basis, team basis, or
whole-school basis.
Some studies have suggested that teachers
may not be motivated by financial incentives
and prefer other types of non-monetary rewards
(Firestone & Pennell, 1993). Future studies
should examine how the opportunity to win a
monetary bonus compares to the opportunity to
obtain other types of incentives, such as choice
of team members, access to instructional materials,
or greater opportunity for professional development. Finally, in-depth case studies of teams of
teachers who win a bonus can help identify the
practices and conditions that led to their success, in particular the importance of teamwork
and other specific practices.
The lack of an effect of the pay-forperformance system in this study is consistent
with other recent experiments to study payfor-performance systems in education, including studies of bonus awards for individual performance (Springer et al., 2010) or wholeschool performance (Fryer, 2011; Marsh et al.,
forthcoming). These studies shared several features that may provide additional explanation
for the lack of effects: The financial awards
were an add-on to standard pay, performance
was measured separately from the districts’
standard evaluations of teachers (except in one
of the programs evaluated), and there was no
professional development specifically connected to these programs.
Appendix
Student Linkages
Students were associated with a team if they
receive continuous instruction in two or more of
the core subject areas of mathematics, reading/
English language arts, science, and social studies by team teachers during the period from
the fall snapshot date until test administration.
According to district records, the fall snapshot
date was October 31, 2008, and because of differing test administration dates, we used May
12, 2009, as the end of the enrollment window
in spring. A student was continuously enrolled if
he or she was enrolled in the same school during
that interval and enrolled in courses taught by
team teachers for a subject area for every day
of the interval.
Using data provided to the project by the
district, we determined all students continuously enrolled in each of the district’s nine
middle schools. We identified each student’s
primary instruction teachers for each core subject during the enrollment window and determined if each student received primary instruction from a single team teacher during the
enrollment window.10 Students were assigned to
387
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Appendix (continued)
a team if they received continuous instruction
for two or more core subject areas by teachers
on that team. Not all students were assigned to a
team since some students were not continuously
enrolled or did not receive instruction in two
core subjects from teachers on the same team. A
small number of students received instruction in
two core topics from team teachers on one team
and instruction in the other two subjects from
teachers on a second team; in these cases, students were assigned to both teams.
For each student assigned to a team, we determined if a team teacher provided instruction for
each core subject area. A student could be part of
a team but not receive continuous instruction
from team teachers in every subject area. This
scenario is most common for mathematics where
a student might be instructed by teachers in Team
A for English, science, and social studies but
switch to a teacher from Team B for mathematics
for a course that is more appropriate for his or her
achievement level. For a student’s achievement
to contribute to the team’s performance measures
of a given subject area, the student must be part
of the team and receive continuous instruction
from a team teacher. Students who are part of a
team but who received continuous instruction by
a teacher from a different team or did not receive
continuous instruction in a subject area did not
contribute to estimates of performance measures
for this subject area on that team.
Notes
1. Teachers assigned to teams in both years of the
study are counted twice. The same is true for student
counts below.
2. We used a statistical model to predict each student’s expected achievement on each subject area test
in school year 2008–09. The model used achievement
on 3 prior years of TAKS math and reading tests as
predictors. A separate model was used for each subject. The statistical model is essentially the same
model as the model described in Wright, Sanders, and
Rivers (2006) and the multivariate ANCOVA method
described in McCaffrey, Han, and Lockwood (2009).
In particular, a student’s expected current achievement in a given subject area if he or she were taught
by the average performing team is assumed to be a
linear function of all his or her prior three math and
reading TAKS scores (where both the current and
prior scores have been appropriately rescaled to normal curve equivalents to place all tests on the same
point range, see note below for more on the scaling).
The model uses data from all students, even those
with incomplete prior test records, by using the pattern of prior tests completed as part of the model. See
Wright, Sanders, and Rivers (2006) or McCaffrey,
Han, and Lockwood (2009) for more details on modeling with incomplete records.
3. For instance, in the POINT study conducted in
Nashville, Tennessee, which used fixed benchmarks
for determining awards, the proportion of teachers
who earned a bonus increased from 29% to 52% over
the 3 years of the study. Similarly, the SPBP in New
York City used a fixed benchmark for determining
schools to receive awards for their staff, and in the
second year 80% of schools earned a full bonus, up
from 47% in the first year.
4. For example, suppose the Top 10 teams at each
grade level qualify for bonuses. If there are three teams
for each grade in each school and the 11th place team
is in the same school as one of the 10 winners, that
11th place team is also designated a winner under this
policy. If the 12th place team is also in the same
school, that team is designated a winner too. However,
a 12th place team located in any other school would
not be designated a winner under this rule, as it has
not been denied a place in the Top 10 by a higher
ranking team in its school.
5. Tests for significance in group differences were
conducted using permutation tests (discussed below),
without adjustment for multiple comparisons.
6. Most students in Grades 6 and 7 were tested in
math and reading in late April. Most students in Grade
8 were tested in reading in early March, math in early
April, and social studies and science in late April or
early May. Grade 8 students can take the TAKS multiple times if they are not proficient on the first attempt.
We included only scores from students’ first attempt.
7. The district’s precise definition of this indicator
is not known.
8. We conducted additional sensitivity analysis
in which we controlled for treatment condition
in Year 2, and the results were similar to those
reported here.
9. This item was not asked of control teachers.
10. We identified four cases in which a team
teacher left the district during the school year and was
replaced by another teacher or long-term substitute.
These replacement teachers were also considered
team teachers so the students instructed by the original and the replacement teacher would be considered
as receiving continuous instruction provided they
were otherwise continuously enrolled.
388
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Team Pay for Performance
References
Azordegan, J., Byrnett, P., Campbell, K., Greenman, J.,
& Coulter, T. (2005). Diversifying teacher compensation. Denver, CO: Education Commission of the
States. Retrieved November 3, 2010, from http://
www.ecs.org/clearinghouse/65/83/6583.pdf
Benjamini, Y., & Hochberg, Y. (1995). Controlling the
false discovery rate: A practical and powerful
approach to multiple testing. Journal of the Royal
Statistical Society. Series B (Methodological), 57(1),
289–300.
Berg, P., Appelbaum, E., Bailey, T., & Kalleberg, A.
(1996). The performance effects of modular production in the apparel industry. Industrial Relations,
35(3), 356–373.
Che, Y., & Yoo, S. (2001). Optimal incentives for
teams. American Economic Review, 91(3), 525–541.
Condly, S. J., Clark, R. E., & Stolovitch, H. D.
(2003). The effects of incentives on workplace
performance: A meta-analytic review of research
studies. Performance Improvement Quarterly,
16(3), 46–63.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Boca Raton, FL: Chapman
& Hall/CRC.
Firestone, W. A., & Pennell, J. R. (1993). Teacher
commitment, working conditions, and differential
incentive policies. Review of Educational Research,
63(4), 489–525.
Fryer, R. G. (2011). Teacher incentives and student
achievement: Evidence from New York City public schools. Working Paper Series. Cambridge,
MA: National Bureau of Economic Research.
Glazerman, S., & Seifullah, A. (2010). An evaluation
of the Teacher Advancement Program (TAP) in
Chicago: Year two impact report (Reference
Number 6319-520). Washington, DC: Mathematica
Policy Research.
Glewwe, P., Ilias, N., & Kremer, M. (2010). Teacher
incentives. American Economic Journal, 2(3),
205–227.
Gratz, D. B. (2009). The problem with performance
pay. Educational Leadership, 67(3), 76–79.
Hamilton, B. H., Nickerson, J. A., & Owan, H.
(2003). Team incentives and worker heterogeneity:
An empirical analysis of the impact of teams on
productivity and participation. Journal of Political
Economy, 111(3), 465–497.
Hoerr, T (1998). A case for merit pay. Phi Delta
Kappan, 80(4), 326–327.
Kandel, E., & Lazear, E. P. (1992). Peer pressure and
partnerships. Journal of Political Economy, 100(4),
801–817.
Kaufman, B. E. (2008). Work motivation: Insights from
economics. In R. Kanfer & G. Chen (Eds.), Work
motivation: Past, present, and future (pp. 588–600).
New York: Routledge/Taylor & Francis.
Lavy, V. (2007). Using performance-based pay to
improve the quality of teachers. The Future of
Children, 17(1), 87–109.
Lazear, E. P. (1998). Personnel economics for
managers. New York: Wiley.
Marsh, J. A., Springer, M. G., McCaffrey, D. F., Yuan, K.,
Epstein, S., Koppich, J., Kalra, N., DiMartino, C., &
Peng, A. (2011). A big apple for educators: New York
City’s experiment with schoolwide performance
bonuses (MG-1114-FPS). Santa Monica, CA: RAND
Corporation.
Mayer, D. P. (1999). Measuring instructional practice:
Can policymakers trust survey data? Educational
Evaluation and Policy Analysis, 21(1), 29–45.
McCaffrey, D. F., Han, B., & Lockwood, J. R. (2009).
Incentive system design and measurement. In M. G.
Springer (Ed.), Performance incentives: Their growing impact on American K-12 education. Washington,
D.C.: Brookings Institution Press.
Milanowski, A. (1999). Measurement error or meaningful change? The consistency of school achievement in two school-based performance award
programs. Journal of Personnel Evaluation in
Education, 12(4), 343–363.
Milanowski, A. T. (2007, Spring). Performance pay
system preferences of students preparing to be
teachers. Education Finance and Policy, 2(2),
111–132.
Muralidharan, K., & Sundararaman, V. (2011).
Teacher performance pay: Experimental evidence
from India. The Journal of Political Economy,
119(1), 39–77.
Odden, A. (2000). New and better forms of teacher
compensation are possible. Phi Delta Kappan,
81(5), 361–366.
Pfeffer, J. (1995). Competitive advantage through
people: Unleashing the power of the work force.
Boston, MA: Harvard Business School Press.
Rice, J. K. (2003). Teacher quality: Understanding
the effectiveness of teacher attributes. Washington,
DC: Economic Policy Institute.
Rosen, S. (1986). The theory of equalizing differences. In O. C. Ashenfelter & R. Layard (Eds.),
Handbook of labor economics (Vol. 1). Oxford:
North-Holland.
Rosenholtz, S. (1989). Teacher’s workplace: The
social organization of schools. New York:
Longman.
Sager, R. (2009, March 2). Prez’s challenge to NYC
teachers. New York Post. Retrieved March 31,
389
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
Springer et al.
2009, from http://www.nypost.com/p/news/opin
ion/opedcolumnists/item_eS7bvzIPkWPVEjJsb
HwsoJ.
Saretsky, G. (1975 ). The John Henry effect: Potential
confounder of experimental vs. control group
approaches to the evaluation of educational innovations. Paper presented at the Annual Meeting of
the American Educational Research Association,
Washington, D.C.
Solmon, L., & Podgursky, M. (2001). The pros and
cons of performance-based compensation.
Pasadena, CA: Milken Family Foundation.
Springer, M. G., Ballou, D., Hamilton, L., Le, V.,
Lockwood, J. R., McCaffrey, D., Pepper, M., &
Stecher, B. (2010). Teacher pay for performance:
Experimental evidence from the project on incentives in teaching. Nashville, TN: National Center
on Performance Incentives at Vanderbilt University.
Springer, M. G., Lewis, J. L., Podgursky, M. J.,
Ehlert, M. W., Taylor, L. L., Lopez, O. S., &
Peng, A. (2009). Governor’s Educator Excellence
Grant (GEEG) Program: Year three evaluation
report. Nashville, TN: National Center on Perfor­
mance Incentives.
Thomas, K. W. (2009). Intrinsic motivation at work:
What really drives employee management (2nd ed.).
San Francisco: Berrett-Koehler.
Wright, S. P., Sanders, W. L., & Rivers, J. C. (2006).
Measurement of academic growth of individual students toward variable and meaningful academic
standards. In R. Lissitz (Ed.), Longitudinal and value
added models of student performance. Maple Grove,
Minnesota: JAM Press.
Authors
MATTHEW G. SPRINGER is assistant professor
of public policy and education, director of the federally-funded National Center on Performance
Incentives, and director of the Tennessee Consortium
on Research, Evaluation, and Development, a
research consortium funded through Tennessee’s
Race to the Top grant. Professor Springer’s research
interests involve educational policy issues, with a
particular focus on the impact of policy on resource
allocation decisions and student outcomes.
JOHN F. PANE is a Senior Scientist at RAND. He
uses experimental and rigorous quasi-experimental
methods to study the implementation and effectiveness of innovations in education, particularly those
involving technology.
VI-NHUAN LE is a Behavioral Scientist at RAND.
Her research and expertise lies in mathematics and
science reform, educational assessment, and early
childhood education.
DANIEL F. MCCAFFREY is a Senior Statistician
and holds the PNC Chair in Policy Analysis at the
RAND Corporation. His current research interests
include value-added modeling and the measurement
of teaching.
SUSAN FREEMAN BURNS is program manager
at the National Center on Performance Incentives.
Her research interests include school leadership and
teacher effectiveness, particularly in K-12 public
education. LAURA S. HAMILTON is a Senior Behavioral
Scientist at RAND. Her areas of specialization are
assessment, accountability, and the measurement of
instruction and leadership practices.
BRIAN STECHER is a Senior Social Scientist
and the Associate Director of RAND Education.
Dr. Stecher’s research focuses on measuring educational quality and evaluating education reforms, with
a particular emphasis on assessment and accountability
systems
Manuscript received August 3, 2011
Revision received November 29, 2011
Accepted January 21, 2012
390
Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015