Educational Evaluation and Policy Analysis December 2012, Vol. 34, No. 4, pp. 367–390 DOI: 10.3102/0162373712439094 © 2012 AERA. http://eepa.aera.net Team Pay for Performance: Experimental Evidence From the Round Rock Pilot Project on Team Incentives Matthew G. Springer Vanderbilt University John F. Pane Vi-Nhuan Le Daniel F. McCaffrey RAND Corporation Susan Freeman Burns Vanderbilt University Laura S. Hamilton Brian Stecher RAND Corporation Education policymakers have shown increased interest in incentive programs for teachers based on the outcomes of their students. This article examines a program in which bonuses were awarded to teams of middle school teachers based on their collective contribution to student test score gains. The study employs a randomized controlled trial to examine effects of the bonus program over the course of an academic year, with the experiment repeated a second year, and finds no significant effects on the achievement of students or the attitudes and practices of teachers. The lack of effects of team-level pay for performance in this study is consistent with other recent experiments studying the short-term effects of bonus awards for individual performance or whole-school performance. Keywords: teacher pay for performance, education performance incentives, group incentives, team incentives 1. Introduction A variety of factors have led education policymakers to increase their interest in providing incentives to teachers based on the outcomes of their students. First, there is ongoing frustration that U.S. public schools have not made sufficient progress in recent decades in addressing the achievement gap between advantaged and disadvantaged students, nor in how the United We would like to acknowledge the contributions of Ann Haas, who provided analytic support to the project; Dale Ballou, who provided input into the study design and feedback on a draft of this article; and three anonymous reviewers who provided thoughtful feedback that helped us improve the article. We are grateful for the support of officials in the Round Rock Independent School District, particularly Jesus Chavez, Ph.D., Superintendent; Toni Garcia, Assistant Superintendent for Instruction; Rosena Malone, Assistant Superintendent for Secondary Education; and Debbie Lewis, Director of Research and Assessment. We offer our special thanks to the teachers and principals in the Round Rock middle schools, without whose participation this research would not have been possible. Teacher bonuses were made possible through the generous financial support of an anonymous foundation. This research was supported by the National Center on Performance Incentives, which is funded by the United States Department of Education’s Institute of Education Sciences (R305A06034). Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Springer et al. States fares in international comparisons. Meanwhile, federal policies have led to increased use of standardized testing and more widespread use of test results to evaluate the performance of schools, administrators, and teachers. Finally, there has been greater recognition that increasing teacher quality may be the most direct and effective pathway to improving student achievement (see, e.g., Rice, 2003). Performance incentives are viewed as having the potential to increase teacher quality in two ways: by incentivizing existing teachers to improve their practice and by attracting better teachers into the profession and retaining them there. However, in spite of the intuitive appeal incentive pay has to some stakeholders, an influential base of individuals and organizations fundamentally oppose its use in education. Opponents contend that such pay renders schools less effective by reducing the motivating effects of intrinsic rewards; that is, teachers will achieve less satisfaction from professional performance as they are rewarded financially for measured student achievement (Thomas, 2009). Other concerns include the risk that test-score-focused measures could lead to excessive emphasis on test preparation at the expense of other valued activities, which would lessen the validity of inferences from the scores themselves (Kaufman, 2008) and that the education system lacks appropriate measures for evaluating teacher performance directly (Gratz, 2009; Milanowski, 1999). There is also concern that incentive pay could negatively affect teachers’ morale and the collegial environment of the school, which is considered essential for school improvement (Hoerr, 1998; Odden, 2000). The historic evidence on the impact of payfor-performance programs is inconclusive (Springer et al., 2009). Recent international studies have found positive effects of pay-forperformance systems on student achievement (Glewwe, Ilias, & Kremer, 2010; Muralidharan & Sundararaman, 2011); however, recent studies of such systems in the United States did not find effects on student outcomes (Fryer, 2011; Glazerman & Seifullah, 2010; Marsh et al., 2011; Springer et al., 2010). One of these studies, the POINT study conducted in Nashville, Tennessee, randomly assigned individual middle school math teachers to be eligible for large bonuses based on their students’ performance (Springer et al., 2010). Two other studies independently evaluated the School-Wide Performance Bonus Program (SPBP) in New York City (Fryer, 2011; Marsh et al., 2011). In that program, teachers and other union-represented staff in low-performing, at-risk schools could earn bonuses if the school exceeded performance standards based on the city’s school accountability measures. Both studies found no or negative differences between the outcomes of students attending schools randomly assigned to be in the program or the control group (Fryer, 2011; Marsh et al., 2011). Given the importance of collaboration among teachers for effective teaching (Rosenholtz, 1989), the school-wide program in New York, which may have done more to encourage teachers to work together than an individual bonus system, such as POINT, might have been expected to have effects even if POINT did not. However, some researchers have argued that school-level bonuses offer only weak incentives since teachers may not feel that they have much influence on whether their school as a whole qualifies for a bonus (Sager, 2009). Moreover, the group-based award program was potentially susceptible to the “free-rider problem.” With group-based performance structures, individuals on a team may become less likely to shoulder their fair share of the workload. They know the capabilities of teammates can make up for their subpar performance, resulting in all team members receiving a bonus award. Thus, group incentive systems are likely to result in the inefficient allocation of some bonus resources. Milanowski (2007) found that teachers are concerned about free riders. In that study, teachers expressed preference for individual awards over group awards, citing a lack of control over others’ performance and concerns that their peers would not contribute equally to earning the bonus. However, Kandel and Lazear (1992) and others have argued that as long as the size of a within-organization team is not too large, the free-rider problem can be solved through peer pressure. For instance, peer monitoring and the enforcement of social penalties in the form of shame, guilt, empathy, and mutual monitoring can lead to individual team members being 368 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Team Pay for Performance accountable for their performance to the other members of the group. If a worker has both monetary and social incentives to not shirk, Kandel and Lazear (1992) contend the motivational forces that would have been “choked off” by the free-rider problem are recovered. Moreover, organizational theory suggests group incentives can promote social cohesion, productivity norms, and feelings of fairness (Lazear, 1998; Pfeffer, 1995; Rosen, 1986). Improved social cohesion among workers can foster knowledge transfer and mutual learning that result in increased productivity in the long run (Che & Yoo, 2001). For example, as reported in Berg et al.’s (1996) and Hamilton, Nickerson, and Owan’s (2003) case studies of garment plants, the formation of teams with workers of varying abilities facilitated interactions among high- and low-ability workers so that more able workers taught less effective workers how to better execute tasks and become more productive. For this reason, some policymakers and practitioners gravitate toward team-level awards (i.e., awards based on the performance of a group of teachers within a school). While teams can be construed in a variety of ways, including teachers of the same grade level, in the same department, or part of instructional teams, the rationale for team-level awards is usually based on the collaborative nature of teaching within these smaller-than-whole-school groupings (Azordegan, Byrnett, Campbell, Greenman, & Coulter, 2005). In theory, team-level awards can provide a close coupling between student performance and the teachers who instruct them, while also encouraging teacher collaboration and reaping the benefits of productivity gains from knowledge transfer and mutual learning. From outside of the K–12 education sector, there is limited evidence suggesting that team incentives may be more effective than individual incentives. Condly, Clark, and Stolovitch (2003) conducted a meta-analysis of 64 studies conducted in private sector, government, and higher education. Nine of the studies examined team incentive programs, reporting an average effect size of 1.40 SD units, versus 0.55 for the 55 studies that examined individual incentive programs. Here, the relative effects of team versus individual incentives may be more informative than the absolute magnitudes of these effects, because there are reasons to doubt that such large effect sizes would be obtained in fielded teacher incentive programs. First, only about half of the studies in the meta-analysis were field studies, and those achieved smaller effects than the laboratory experiments. Second, 41% of the systems studied incentivized manual work, which produced larger effects than those incentivizing cognitive work. Although the comparability of teaching and the cognitive work of the studied systems cannot be determined from the metaanalyses, the smaller effects for cognitive work would seem more applicable to the complex cognitive work of teaching, which is measured indirectly through student achievement outcomes, than the results for incentivizing manual work, which typically has directly observable and measureable outputs. In terms of studies of team incentives in education (excluding school-wide programs), we know of only one study that compares team- and individual-level incentive programs in pre-college education. The Andhra Pradesh Randomized Evaluation Study (AP RESt) compared the impact of two output-based incentive systems (an individual teacher incentive program and a grouplevel teacher incentive program) and two inputbased resource interventions (one provided an extra paraprofessional teacher and another provided block grants). Muralidharan and Sundara raman (2008) found that students enrolled in a classroom instructed by a teacher selected for the group incentive intervention outperformed students in control condition classrooms on both the mathematics and language exams (0.28 and 0.16 SDs, respectively); however, students enrolled in schools assigned to the individual incentive condition outperformed students in both the group incentive condition and the control condition after the second year of implementation. However, the context of schooling in India compared with the United States makes the applicability of this study unclear. Thus, theory and research from outside of education suggest that awarding team-level performance may be a productive way of implementing performance pay that overcomes the pitfalls of individual and school-wide awards. However, there has been little scientific study, 369 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Springer et al. and none in the United States, of the effect of teacher incentives on student achievement where the distribution of awards is based on the performance of a teacher team. Although the POINT and New York SPBP programs used different levels of performance for determining awards, both programs took place in large urban school systems in which all the schools in the study were facing strong accountability pressures because of low performance. Some schools in the SPBP evaluation were targeted for closure due to failing performance, and during the POINT study the Metropolitan Nashville School District was threatened with state takeover due to the failure of its schools to make Adequate Yearly Progress under the federal No Child Left Behind regulations. Authors of evaluations of both programs speculate that these pressures might have limited the effect of the bonus program to change teacher behavior because all teachers in participating schools already faced strong external pressure to improve student outcomes (Marsh et al., 2011; Springer et al., 2010). Testing the effects of bonus programs in other contexts may yield different results. To help fill this gap in our knowledge, this article examines a pay-for-performance program in which performance awards were distributed to teams of teachers based on their collective contribution to student test score gains in a suburban district with above average levels of student achievement for its state. Starting in August 2008, the Round Rock Independent School District (RRISD) (Texas) and the National Center on Performance Incentives (NCPI) designed and implemented two 1-year randomized controlled trials to examine the impact of a team-level teacher pay for performance intervention on middle school student achievement in core subject areas (i.e., mathematics, reading, social studies and science) as well as the impact of team-level awards on teacher attitudes and behaviors and on team and institutional dynamics. In the district, most middle school students are taught the core subjects by interdisciplinary teams. All Grade 6, 7, and 8 teachers on these teams were part of the study if they taught one of the core subject areas. The first year of the study, 78 middle school teams of teachers were randomly assigned to the treatment (eligible for an award) or control condition (not eligible for an award). If a treatment team’s value-added score was among the highest one third among treatment teams in each grade level, teachers on the team were awarded about $5,400 each as long as their individual value-added score was not statistically below average for their grade level. The second year of the study, the same procedure was repeated. That year, 81 teams were randomized, and teachers on bonus-winning teams were awarded about $5,900 each. As noted above, there are two high-level pathways through which performance incentives might result in increased student achievement: by incentivizing existing teachers to improve their practice, either individually or in collaboration, or by attracting better teachers into the profession and retaining them there. This experiment is designed to measure effects of the first of these pathways over the relatively short term of an academic year. 2. Research Questions The study addresses the following research questions regarding the opportunity for teachers to earn a bonus on the basis of student achievement in core subjects taught by them and their teammates: 1.Does the bonus opportunity affect the achievement of students taught by the team? 2.Does the bonus opportunity affect teachers’ attitudes about compensation and teaching or their teaching practices? 3.Are there differences in the attitudes or practices of teachers who earned a bonus and teachers who did not? 3. Sample RRISD serves students from the cities of Round Rock, Cedar Park, and portions of Austin in Texas. The district enrolls approximately 43,000 students, with the number of students increasing by approximately 1,500 per year between 2003–04 and 2008–09. The district has a diverse ethnic base with a student population that is approximately 8.7% African American, 10.7% Asian, 30% Hispanic, and 46.2% White. 370 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Team Pay for Performance More than 73 languages are spoken throughout the district. The district has 50 schools (6 high schools, 10 middle schools, 32 elementary schools, and 2 alternative education centers) and approximately 5,950 employees, of whom 2,795 are teachers. The district’s student-teacher ratio is 14.7. In the 2008–09 school year, beginning salaries for teachers with bachelor’s degrees was $41,000, master’s degrees $42,000, and doctoral degrees $43,000. Approximately 25% of the teachers have master’s or doctoral degrees, well below the national average of 57%. The average for years of teaching experience is 10.4, compared to a national average of 15. The study’s middle school sample is described below. The 5-year graduation rate from Grades 9 to 12 for the class of 2008 was 87.6% and more than 77% of the district’s graduating seniors took the SAT or ACT college entrance exams. Graduating seniors score approximately 160 points above the state average and 110 points above the national average on the SAT. Performance on the ACT is similar. RRISD’s teaching assignment structure provides an opportunity to investigate the efficacy of team-level incentives. The district organizes middle school teachers into grade-level interdisciplinary teams that oversee the learning experiences of a group of approximately 100 to 140 students associated with the team. Team composition changes from year to year due to teacher turnover and administrative discretion. Each team has at least one teacher for each core subject of mathematics, reading/English language arts, science, and social studies. Some teams also contain additional teachers who focus on students with limited English proficiency and/or special education students. Team members share a common planning time, during which they plan future lessons, discuss their students’ performance, confer with parents, conduct data conversations, and plan interventions or extensions for specific students. The interdisciplinary structure of the RRISD teams may make them more conducive to certain kinds of knowledge transfer, such as information about individual students, and less conducive to other kinds, such as the sharing of subject-specific pedagogical practices. Over the 2 years, the study included 159 teams of teachers teaching core subjects to students in Grades 6 to 8 in nine middle schools. These were all of the teams in those schools during the study. There were 665 teachers on these teams.1 Teams include language arts/reading, mathematics, science, and social studies teachers, and some teams also include special education teachers and specialists for students with limited English proficiency. Not all teachers in the school are members of a team, but most students receive instruction in core subjects from teachers who are members of a team. Some teachers taught off-team for a small proportion of their students. For instance, mathematics teachers may have taught a section with students from two teams or a section of students who received core instruction in other courses from another team. Across all subjects, off-team students constituted 3% of the students taught by teachers participating in the study. Teaching of off-team students was most prevalent among mathematics teachers. Sixty-nine percent of mathematics teachers taught at least one off-team student, with a median of 8, and these students accounted for 11% of all the students their teachers taught. Off-team teaching was less common in subjects other than mathematics. In each of those subjects, off-team students accounted for about 1% of the students their teachers taught. In ELA, 46% of teachers taught at least one off-team student, with a median of 2. In science, these numbers were 38% and 2; and in social studies, they were 39% and 2, respectively. These students were not included in the calculation of team performance measures or in outcomes analyses. There were roughly 17,383 students taught by participating teachers. Of these, 17,307 were determined to be part of a team, defined as receiving instruction from team teachers in at least two of the four core subjects. In rare cases, students received instruction from teachers of one team for two core subjects and teachers of another team for the other two core subjects. These students were considered members of both teams and assigned to their teacher’s team in each core subject. 4. The Pilot Incentive Program The incentive program offered teachers on selected teams the opportunity to earn a bonus 371 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 1 Achievement Test Used in the Performance Measures Grade Level Mathematics Reading/ELA Science Social Studies Six TAKSa Mathematics TAKSa Reading Seven TAKSa Mathematics TAKSa Reading Eight TAKSa,c Mathematics TAKSa,c Reading Benchmarkb Cycle 6 (Year 1) or Stanford 10 Science (Year 2) Benchmarkb Cycle 6 (Year 1) or Stanford 10 Science (Year 2) TAKSa Science Benchmarkb Cycle 6 (Year 1) or Stanford 10 Social Studies (Year 2) Benchmarkb Cycle 6 (Year 1) or Stanford 10 Social Studies (Year 2) TAKSa Social Studies TAKS = Texas Assessment of Knowledge and Skills; ELA = English language arts. a. Scale scores. b. Total correct. c. Scores from the first test students completed, not retest scores. on the basis of their students’ achievement growth in the four core subjects of mathematics, reading-English language arts, science, and social studies. Teachers on teams assigned to the intervention group were notified that their team would be eligible for a bonus. The notification included a Frequently Asked Questions (FAQ) document that described the requirements for winning a bonus, including that it would depend on team students’ achievement in core subjects relative to predictions based on prior-year scores. Neither the particular tests to be used nor the exact methods for calculating teams’ performance measures were detailed. Team performance was based on a valueadded measure of student performance on standardized achievement tests and district benchmark assessments. Students were tested in the four core subjects: reading/language arts, mathematics, science, and social studies. The goal of the performance measure was to provide an evaluation of each team’s contribution to student learning in the core subjects. In essence, the value-added method compared students’ test scores in each subject area with how they would be expected to score on these tests had they been taught by the average performing team for the subject area and grade level (six, seven, or eight).2 The difference between a student’s actual performance and his or her expected performance was the measure of the team’s contribution to that student’s learning for a given subject area. In each subject area, the average of the contributions to all the students instructed by a team teacher was the measure of the team’s contribution for that subject area. The overall performance measure for the team was the average of its contributions to each of the four subject areas. Students were associated with a team if they received continuous instruction in two or more of the core subject areas by team teachers during the period from the fall snapshot date until test administration. For a more complete description of student linkages, see Appendix 1. Table 1 lists the assessments used to measure student achievement for team performance calculations in each subject area. Because a common test was not available in all four subject areas for each grade level, the performance measures used combinations of Texas Assessment of Knowledge and Skills (TAKS) tests and either district benchmark tests from the final testing cycle in Grades 6 and 7, in Year 1 (2008–09), or the Stanford Achievement Test Series, Tenth Edition (Stanford 10) in year 2 (2009–10). For the TAKS and Stanford 10 tests, we used the scale scores; however, there are no scale scores for the benchmark tests, so we used the proportion of number of correct responses out of 30. Measures used for testing the program’s effects on student outcomes are discussed below. To receive a bonus, a team’s score had to rank in the top third of treatment group teams in the same grade level. The rationale for designing the incentive program as a fixed tournament was pragmatic—the cost of the incentive program would be known in advance. If the incentive plan had required teams exceed a fixed threshold to receive an award, then the number of teams winning awards could not be contained 372 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Team Pay for Performance and potential financial exposure would be much greater.3 An incentive structure that awards large bonuses relative to base salary and does not fix the number of units eligible to receive an award poses serious problems for policy. District officials, legislators, or others funding bonuses are reluctant to make openended commitments to reward all teachers exceeding a benchmark, given the uncertain and potentially substantial costs. A prime consideration in designing this experiment was to test a policy with modest-size bonuses that would be feasible for educators and school officials in RRISD and other school districts. The fixed tournament design offered this virtue. A fixed tournament incentive structure, however, suffers from one well-recognized defect: promotion of competition among teams, which could lead to a breakdown of interteam cooperation. The consequences of within-school competition are of particular concern if teachers in the same school are no longer willing to help other teachers who are not on the same team. To ensure that the structure of the incentive program would not promote competition among teams within schools, we modified the criteria for earning a bonus. Specifically, if a team would have earned a bonus if another team in the same school had not outperformed it, we designated the nonqualifying team an additional winner. This solution ensured that no team close to earning a bonus would be denied a bonus because some other team in the same school outperformed it. Thus, in practice, no teacher would have reason to withhold help or cooperation from another teacher on a different team in the same school, assuming teachers understand the procedure. While this modification introduced a small amount of uncertainty about the total bonuses to be paid out, this uncertainty seemed a reasonable price to pay to promote harmonious working relationships within schools.4 The research team also found that this provision helped to promote buy-in among building and district educators to permit the experiment to take place. In short, the incentive pay plan was designed to capitalize on the advantages of a tournament while avoiding its potential defect of competition that leads to reduced collaboration. Finally, the award criteria also included an individual performance requirement for all teachers on eligible teams: In order for teachers to receive an award for their team’s performance, their individual value-added scores must not have been statistically below average for their grade level. This criterion was included at the request of the funder to help reduce the potential for the free-rider problem to occur and to avoid rewarding individual teachers whose students were not making progress. This provision, while individual- rather than team-based, was aligned with the team incentive structure. Specifically, it was designed not to induce competition within teams or incentivize teachers to place more importance on their own performance at the expense of the team’s performance. Moreover, although this provision was explained to teachers in the FAQ document, the vast majority of discussion revolved around the team aspects of the incentive program. Table 2 summarizes the bonus awards. In 2008–09, there were a total of 78 teams, half assigned to the treatment condition. Of the 39 teams eligible for the bonuses, 14 teams received a bonus. Awards were given to 67 individual teachers, with 63 teachers earning the maximum award amount and 4 receiving a prorated bonus because they taught team students for only a fraction of their instructional workload. The full award amount was nearly $5,500, and prorated awards ranged as low as $3,800, with an average award size of $5,373. Overall, nearly $360,000 was awarded to teachers in 2008–09. Similarly, in 2009–10, there were 81 teams, 40 of which were assigned to the treatment condition. Of those 40 teams eligible for bonus, 12 teams received one, with 46 teachers receiving the full award and six teachers receiving a prorated share. The full award amount was $6,000, and prorated awards ranged as low as $4,200, with an average award of $5,862. The total amount awarded to teachers in 2009–10 was $304,800. Across the 2 years, there was only one instance of a teacher on a bonuswinning team who did not receive a bonus because the teacher’s value-added score was too low. 5. Methods In each year of the study, teams were randomized to either the bonus intervention or 373 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 2 Summary of Awards Number of teams, both treatment and control Number of treatment teams Number of teams receiving an award Number of individual award recipients Number of individual award recipients receiving full award Number of individuals on award-winning teams with value-added scores below average (no bonus) Amount of full award Average award Amount of smallest prorated award Total amount of awards control condition using a block-randomized design. Blocks were defined by grades within school. Within each block, there were multiple teams. When there was an even number of teams, half the teams in each block were randomized to treatment and half to control. In blocks with three teams (no blocks had more than four teams), two teams were randomly assigned to treatment or control and the remaining team was assigned to the other condition. The randomizations were constrained so that the number of treatment and control teams was balanced at each grade level. The district and schools provided rosters of teachers on each team and the teachers were notified of their team’s treatment assignment in early October. Students were assigned to classes by the schools and student outcomes are analyzed according to the team’s assignment. Because assignment to experimental conditions occurred after the late-August start of school, nearly all students had their class assignments prior to teams knowing their experimental condition. Table 3 shows group means for treatment and control group students on the available demographic and achievement measures, pooled across years. The randomization produced good balance between groups on most of these characteristics; however, there are significant differences in the percentage of limited English proficiency students, which is relatively low in both groups but higher in the treatment group (4.7% versus 3.1% of students), and the percentage of talented and gifted students, which is also Year 1 Year 2 78 39 14 67 63 81 40 12 52 46 0 1 $5,446 $5,373 $3,812 $359,981 $6,000 $5,862 $4,200 $304,800 somewhat low in both groups but lower in the treatment group (8.8% versus 13.7%).5 The table is based on team assignments for ELA. Students assigned to teams for other subjects can differ somewhat from those on the teams for ELA instruction; however, balance for math, science, and social studies is similar, with the same two characteristics showing significant group differences. Despite this relatively good balance overall, when looking at individual years or grade levels, more of the covariates show significant group differences. All outcomes models adjust for all of the covariates listed in Table 3 to remove variability created by the potential imbalances occurring in spite of randomization. Measures The study tested the effects of the bonus program on student achievement and teacher attitudes, perceptions, and practices about the school environment, so as to capture a broad spectrum of potential pathways through which the pilot incentive program might influence student achievement. Student achievement. For evaluating the effect of the bonus program on student outcomes, the study used both the TAKS and Stanford 10 measures. The TAKS is the state’s high-stakes accountability test, administered in the spring.6 As a supplemental measure, the district administered the Stanford 10 specifically for this project in late May. As discussed above and 374 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 3 Assessment of Student Balance Between Treatment and Control Groups on ELA teams Group Means Covariate Prior-year TAKS math score Prior-year TAKS reading score Percent female Percent racial/ethnic minority Percent limited English proficiency Percent economically disadvantaged Percent designated at-risk by district Percent special education Percent talented and gifted Group Differences Control (n = 8,592) Treatment (n = 8,744) Group Difference Standard Error p Value 0.073 0.081 49.7% 46.7% 3.1% 25.6% 23.2% 5.3% 13.7% 0.039 0.023 50.1% 48.1% 4.7% 26.5% 24.6% 5.1% 8.8% –0.035 –0.058 0.003 0.015 0.016 0.009 0.014 –0.002 –0.049 0.037 0.035 0.007 0.010 0.007 0.010 0.011 0.007 0.024 0.348 0.100 0.678 0.120 0.026 0.392 0.184 0.785 0.046 A joint test of significance of the group differences on these covariates produced a p value of 0.076 (calculated with a permutation test). TAKS = Texas Assessment of Knowledge and Skills; ELA = English language arts. shown in Table 1, the TAKS math and reading exams are administered in Grades 6 through 8, but the TAKS science and social studies tests are administered only in eighth grade. The Stanford 10 provides measures in all of the core subject areas in all three grades. For the outcomes analysis, we examined the available TAKS and Stanford 10 scores at each grade level and subject area. For outcomes analysis, student TAKS and Stanford 10 normal curve equivalent scores were standardized to have mean 0 and standard deviation 1 within our sample. Survey measures. NCPI administered two surveys each year to all of the teacher participants in the study. The survey addressed attitudes about pay for performance, attitudes about the study, selfefficacy, collegiality, academic press (the extent to which teachers have high expectations for their students), team dynamics, parent engagement, and practices, including emphasis on standards, hands-on learning, use of tests, homework, test preparation, hours of work, and professional development. Each item requested the respondent to select from a four-choice or sixchoice scale or to provide a numeric response. Most of the items were the same in the two surveys each year, and, for the most part, the surveys were the same both years. Survey administration. The first study year, teachers were notified about the two surveys via email in February and May of 2009. The email contained information about survey content, an explanation of how teacher confidentiality would be protected, and a link to the survey. The email also described the $150 stipend that was offered upon completion of each survey. Teachers were then able to take the survey online and at their convenience. Email reminders were sent to teachers who had not participated in the survey, with a few teachers receiving as many as five reminders. We repeated these procedures in the second year of the study, administering the surveys in November 2009 and April 2010. For the purposes of this article, we focus on the responses obtained during the spring survey administration (May 2009 and April 2010) because they provide a measure of teacher attitudes, practices, and perceptions near the end of each study year. Participation in these surveys was strong. During Year 1, 93% of control group teachers and 99% of treatment teachers participated. Similar response rates were observed in Year 2, where 93% of control group teachers and 96% of treatment teachers participated. Across the 2 years of the study, a total of 441 teachers responded to the surveys. Of those, 83 teachers were in the treatment group both years, 79 teachers were in the control group both years, 76 teachers were in the experiment for only 1 year and assigned to the treatment group, 79 teachers were in the experiment for only 1 year and assigned to the control group, and the 375 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Springer et al. remaining 123 teachers were in the experiment for both years and were assigned to the treatment group one year and the control group the other year. Among survey respondents, the distribution of teacher qualifications was similar each year. Responding teachers averaged 10 years of teaching experience, and 28% had a master’s degree or higher. Twenty-one percent taught English/language arts, 24% taught math, 22% taught science, 20% taught social studies, and the remaining 13% taught another subject. Development of survey scales. The surveys assessed various practices that teachers may be likely to change as a result of being eligible to receive a bonus. These include instructional practices, engagement in professional devel opment, efforts to involve parents, and teachers’ collaboration with their team and same-subject teachers. We also measured contextual factors that may affect their team’s dynamics. In addition, the surveys included several items about teachers’ perceptions and understanding of the intervention. We created scales from the survey responses to measure key constructs related to teachers’ attitudes, team dynamics, instructional practices, self-improvement efforts, parent engagement activities, professional development, and perceptions of principal leadership. In Year 1, we created 13 composite scales by combining responses across multiple items. To create the composite scales, we reviewed each of the survey questions, computed descriptive statistics for all item-level responses (including examining full distributions), and conducted exploratory factor analyses where appropriate. The scales were constructed by calculating the average of the responses on the items on the 4- to 6-point scale. We also administered additional items on the Year 2 spring survey, enabling us to create an additional scale on principal leadership. The 14 composite scales are shown in Table 4 along with their internal consistency reliability coefficients. In addition to the composite scales, we created 18 item-level scales. Four items asked teachers to report the number of hours they engaged in an activity, and the scales represent the average number of hours reported. The other 13 items pertain to perceptions and understanding of the intervention, 7 of which were administered to treatment teachers only. For these 13 items, we created scales by dichotomizing the 4-point Likert-type responses (strongly agree/ agree vs. disagree/strongly disagree) and calculating the percentage of teachers who endorsed the item. The 17 item-level scales are also shown in Table 4. Student Outcomes Analysis We fit a hierarchical linear model to estimate and test the intervention effect on student achievement. To improve the precision of the estimates, the model includes individual student and team aggregate pre-treatment variables. As shown in Equation (1), Level 1 models individual student outcomes as a function of a team component and pre-intervention student variables including prior TAKS mathematics and reading/ELA achievement scores and the following student demographic indicators provided by the district: gender, race/ethnicity, limited English proficiency, economically disadvantaged, at risk for academic failure,7 special education, and talented and gifted. About 9% of students were missing both the prior year TAKS scores, about 4% had a prior year reading score but not a mathematics score, and about 2% had a prior year mathematics score but not a reading score. For students with incomplete prior test scores, we set the missing value to zero and included in the model indicators for the four patterns of observed prior year scores (both scores observed, no scores observed, reading but not mathematics observed, and mathematics but not reading observed). We also included interactions between the pattern indicators and the prior scores and demographic variables. All students had complete demographic data. Level 2 models the team component as a function of an indicator for the intervention group, the randomization block, aggregate student pre-intervention TAKS mathematics and reading scores, and a random component for team. Let yij equal the standardized score on a given test for student j = 1, … , ni of team i = 1, ... , m = 78 (Year 1) or 80 (Year 2); xij equals a K-vector of student pre-intervention variables including prior year TAKS reading and mathematics scores and student demographic characteristics; then Level 1 of our model is: 376 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 4 Survey Scales and Reliability Coefficients for Composite Scales Scale Alpha Group dynamics Collaboration among same-subject teachers Hours spent meeting with same-subject teachers Collaboration among team teachers Hours spent meeting with team teachers Quality of team dynamics Instructional practices Change from prior year in classroom emphasis on state standards and tests Change from prior year in emphasis on hands-on activities and having students work in groups Importance teachers place on test-preparation activities Importance of student scores on state tests and benchmark assessments to guide instruction Importance placed on student performance on classroom work and homework to guide instruction Use of test scores for making instructional decisions Frequency with which teachers incorporate Texas state standards into instructional planning Number of hours worked outside of formal school hours Professional development and self-improvement efforts Teachers’ use of student test scores to help improve their own practice Frequency of professional development activities related to collaborative aspects of teaching Total amount of time spent in professional development Parent engagement Efforts to engage parents Principal leadership Principal leadership (Year 2 only) Perceptions of the intervention The intervention provides feedback about team’s effectiveness The intervention should include non-core subjects The intervention has caused resentment among teachers The intervention distinguishes effective from ineffective teams The intervention has had negative effects on my school The intervention forced teachers to teach in a certain way The intervention energized me to improve my teaching (treatment only) The intervention will not affect my teaching (treatment only) I have a clear understanding of performance criteria (treatment only) The bonus is too small (treatment only) The Frequently Asked Questions document answered my questions (treatment only) Not winning a bonus will have a negative effect on my team’s teaching evaluations (treatment only) The intervention uses a fair method of awarding bonus (treatment only) 0.83 — 0.78 — 0.83 0.83 0.89 0.82 0.74 0.52 0.86 0.45 — 0.73 0.69 — 0.68 0.78 — — — — — — — — — — — — — Coefficient alpha is shown for the composite survey scales; it was calculated using responses across treatment and control groups. yij = µ + θi + xij′ βi + εij(1), where θi is the team component and εij are independent normally distributed residual errors with variance that depends on the pattern of observed scores. Level 2 of the model is: 8 B g =6 b =1 θi = Tiδ + zij ′ η + ∑ γ g uig + ∑ λ b vib + ζi (2) βik = βk, k = 1, … , K, where zij equals a vector of team average prior reading and mathematics test scores, uig equals 1 if the team’s students are in enrolled in grade g and 0 otherwise, and vib if the team is in randomization block b and 0 otherwise. ζi is a teamspecific random effect to allow for correlation among the outcomes of students on the same team. Ti is a treatment indicator and the coefficient δ is a measure of the effect of the bonus intervention on student achievement. Tests of the null hypothesis that δ equals 0 will test for the intervention effect. The primary model includes a single overall intervention effect for all three grades. In secondary models, we examine separate effects by 377 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Springer et al. grade. Results from secondary models are consistent with the primary models and not reported in this article. We used model-based Wald tests to test the null hypothesis of no treatment effects. We also use permutation tests (Efron & Tibshirani, 1993) to test the null hypothesis. The permutation test is an alternative approach for calculating the probability of obtaining the observed effects by chance if the null hypothesis of no effect is true. The permutation test does not rely on the model assumptions like the Wald test and ensures that our conclusions are not sensitive to the model assumptions. To conduct the permutation test, we randomly reassign the treatment assignment indicators to teams following the randomization design, to simulate the outcomes of the experiment with alternative realizations of the randomization under the null hypothesis. We repeat this process 2,000 times, and for each resampled data set, we estimate the treatment effect using the same model as used for model-based estimates. The p value for the test of the null hypothesis is the proportion of times the result from the resampled data equals or exceeds the observed result in absolute value. We conducted separate analyses for each student achievement outcome measure. For each subject, the analysis includes only students who were members of the participating teams and taught by the team teacher in that subject. For instance, if a student was taught by the English, science, and social studies teachers from the same team but not the mathematics teacher on that team, the student would be on the team and included in the analyses for English, science, and social studies, but not mathematics. As discussed above, the criteria for inclusion of a student on a team is that they were taught at least two core subjects by team teachers. So, for example, if the student was only taught by a team teacher for mathematics but not for any other subject, the student would not be included in any of the analyses. Analyses were conducted separately for each year. A precision-weighted combination of the 2 years was used to estimate the overall effect of treatment, with p values calculated using permutation tests as described above. An additional analysis examines effects on students across both years of the study according to their pattern of treatment. This restricts the sample to students who were in the study schools both years (sixth or seventh graders in the first year of the study). Depending on each student’s assignment to a team and the team’s random assignment to treatment condition, each student experienced one of four patterns of treatment: control group both years, control group in Year 1 and treatment group in Year 2, treatment group in Year 1 and control group in Year 2, or treatment group in both years. This analysis compares the three groups of students who were on treatment teams at least once to those who were in the control group both years. This analysis uses a joint test of significance using a three-degree-of-freedom chi-square test. Because we had multiple outcomes and multiple grade levels to test effects, the possibility of one or more significant effects by chance (i.e., a significant effect when there is no real intervention effect) is greater than 5%. We estimated adjusted p values to account for the multiple testing in a group of tests, such as tests for three grade levels on a given test or across subject areas for the pooled sample. The adjusted p value equals the proportion of simulated experiments that any one of multiple test statistics exceeded each of the observed statistics from the true outcomes data. Teacher Outcomes Analyses We pooled responses across the two spring survey administrations and then conducted two sets of analyses of the survey data. The first set of analyses examined the differences between the treatment and control groups in their opinions and practices. We used a hierarchical linear model similar to the student achievement model discussed above to compare results between the two groups. Level 1 models individual teacher survey responses as function of a team component and several covariates, including years of teaching experience, an indicator for whether the teacher had a master’s degree or higher, and indicators for whether the teacher was an English language arts, math, science, or social studies teacher and teacher-specific residual error. Level 2 models the team components as a function of the team’s intervention status, fixed effects for the blocks with which teams were randomly assigned to interventions, and random 378 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 5 Student Outcome Results, Weighted Combination of Years 1 and 2 Subject Area Math Reading/ELA Science Social studies Exam TAKS SAT10 TAKS SAT10 reading SAT10 language SAT10 SAT10 N Students N Teams Standardized Effect Size p Value 13,359 12,339 16,594 15,460 15,026 15,531 15,648 159 159 159 159 159 159 159 0.007 –0.008 –0.002 0.009 –0.001 0.026 0.038 0.699 0.704 0.884 0.554 0.962 0.192 0.112 ELA = English language arts; SAT10 = Stanford Achievement Test Series, Tenth Edition; TAKS = Texas Assessment of Knowledge and Skills. effects for team, which accommodates the fact that responses from teachers on the same team may be correlated and provides accurate inferences given the cluster-randomized block design used in the experiment. For teachers’ responses on the dichotomously scored scales (relating to perceptions and understanding of the intervention), we used an analogous approach but employed logistic regression. Given the relatively large number of statistical tests conducted, we again used permutation tests (described above) to adjust for multiple comparisons. Statistical significance is determined on the basis of the adjusted p values from these permutation tests. Our second set of analyses examined differences in the attitudes and practices between treatment teachers who earned and did not earn a bonus. We used a regression analysis that included random effects for grades within schools and controlled for the clustering of teachers within the same teams. We also used this approach to examine how treatment teachers’ responses changed from Year 1 to Year 2 in relation to earning or not earning a bonus. To adjust for multiple comparisons in this second set of analyses, which examines the treatment group only (thus, for which the permutation test is not applicable), we adjusted p values using a false discovery rate (FDR) procedure (Benjamini & Hochberg, 1995). An FDR is the expected proportion of statistical tests that report significant relationships when no such relationship actually exists. Applying the procedure with a FDR of 0.05 led to rejecting the null hypothesis of zero effects only if p values were less than .0016. 6. Results Student Outcomes Analysis of student achievement outcomes reveals no overall intervention effect in any subject area across the 2 years of the experiment. The effects, displayed in Table 5, were estimated through a precision-weighted combination of Year 1 and Year 2 analyses (shown below) and tested with permutation tests. The effect size estimates are very small in each subject area. Results are very similar when looking at individual years—Table 6 shows Year 1 results and Table 7 shows Year 2 results. Again, the effect sizes are typically very small with small standard errors. Finally, Table 8 shows the results of the 2-year analysis of student outcomes based on their pattern of being taught by treatment or control teams. The table shows effect estimates for the three groups of students who were taught by treatment teams at least once during the experiment, comparing their outcomes to students who were taught by control teams both years. Like the other results, the treatment effect estimates are very small and not significant. In particular, there is no evidence of the emergence of an effect on the scores of students who were taught by treatment teams both years. Comparisons Between Treatment and Control Teachers In this section, we first compare treatment and control teachers’ attitudes and perceptions about the intervention and their practices. Table 9 presents descriptive statistics for the treatment 379 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 6 Student Outcome Results, Year 1 Subject Area Math Reading/ELA Science Social studies Exam N Students N Teams Standardized Effect Size Standard Error p Value TAKS SAT10 TAKS SAT10 reading SAT10 language SAT10 SAT10 6,850 6,299 8,051 7,507 7,507 7,454 7,565 78 78 78 78 78 78 78 0.016 –0.012 0.006 –0.006 –0.007 0.019 0.031 0.025 0.024 0.021 0.021 0.056 0.032 0.034 0.486 0.593 0.754 0.787 0.888 0.556 0.361 SAT10 = Stanford Achievement Test Series, Tenth Edition; TAKS = Texas Assessment of Knowledge and Skills; ELA = English language arts. TABLE 7 Student Outcome Results, Year 2 Subject Area Math Reading/ELA Science Social studies Exam TAKS SAT10 TAKS SAT10 reading SAT10 language SAT10 SAT10 N Students N Teams Standardized Effect Size Standard Error p Value 6,509 6,040 8,543 7,953 7,519 8,077 8,083 81 81 81 81 81 81 81 –0.007 0.002 –0.010 0.025 0.000 0.032 0.045 0.031 0.038 0.020 0.022 0.024 0.027 0.034 0.808 0.951 0.583 0.229 0.993 0.219 0.164 ELA = English language arts; TAKS = Texas Assessment of Knowledge and Skills; SAT10 = Stanford Achievement Test Series, Tenth Edition. TABLE 8 Student Outcome Results, Two-Year Effects By Pattern of Treatment Students on Students on Treatment Teams Treatment Teams in Year 1 Only In Year 2 Only Subject Area Effect Standard Effect Standard Effect Standard N Error Size Error Size Error p Value Students N Teams Size Exam TAKS SAT10 Reading and ELA TAKS SAT10 reading SAT10 language Science SAT10 Social studies SAT10 Math Students on Treatment Teams Both Years 3,274 3,050 4,917 54 54 54 0.049 0.015 –0.029 0.034 0.039 0.028 –0.019 –0.011 –0.040 0.048 0.057 0.036 0.020 –0.004 –0.022 0.048 0.057 0.030 0.330 0.968 0.035* 4,604 54 –0.008 0.028 –0.040 0.036 –0.002 0.036 0.500 4,428 54 0.011 0.033 –0.040 0.035 0.000 0.036 0.488 4,644 4,754 54 54 0.003 0.065 0.032 0.034 –0.018 0.025 0.047 0.038 0.001 0.036 0.047 0.038 0.925 0.332 ELA = English language arts; TAKS = Texas Assessment of Knowledge and Skills; SAT10 = Stanford Achievement Test Series, Tenth Edition. * Significance does not survive adjustment for multiple comparisons. 380 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 9 Comparison of Control and Treatment Teachers’ Attitudes, Perceptions, and Behaviors Control Group Dependent Variable N M Group dynamics Collaboration among same-subject teachers 346 2.50 Hours spent meeting with same-subject 344 4.19 teachers Collaboration among team teachers 345 2.60 Hours spent meeting with team teachers 346 4.92 Quality of team dynamics 345 4.35 Instructional practices Change in classroom emphasis on state 339 3.48 standards and tests Change in emphasis on hands-on activities 339 3.64 and having students work in groups Importance teachers place on test-preparation 346 3.27 activities Importance of student scores on state tests and 346 3.04 benchmark assessments to guide instruction Importance placed on student performance on 346 3.71 classroom work and homework to guide instruction Use of test scores for making instructional 345 3.32 decisions Frequency with which teachers incorporate 345 5.09 Texas state standards into instructional planning Number of hours worked outside of formal 346 11.80 school hours Professional development and self-improvement efforts Teachers’ use of student test scores to help 343 2.12 improve their own practice Frequency of professional development 346 3.21 activities related to collaborative aspects of teaching Total amount of time spent in professional 341 42.17 development Parent engagement Efforts to engage parents 346 2.42 Principal leadership Principal leadership (Year 2 only) 172 3.17 Perceptions of the intervention The intervention provides feedback about 346 0.38 team’s effectiveness The intervention should include noncore 344 0.60 subjects The intervention has caused resentment 342 0.34 among teachers The intervention distinguishes effective from 343 0.12 ineffective teams The intervention has had negative effects on 343 0.26 my school The intervention forced teachers to teach in a 342 0.13 certain way SD Treatment Group N M SD Standardized Standard Effect Size Error 0.60 3.54 353 2.54 0.58 353 4.20 4.36 0.08 –0.02 0.07 0.07 0.53 4.24 0.65 355 2.65 0.56 355 4.85 4.84 355 4.32 0.68 0.09 0.02 –0.05 0.08 0.07 0.11 0.55 349 3.50 0.57 0.00 0.08 0.69 349 3.60 0.72 –0.02 0.08 0.58 355 3.31 0.57 0.05 0.07 0.74 355 3.10 0.72 0.10 0.08 0.48 355 3.70 0.45 –0.01 0.08 0.52 353 3.29 0.53 –0.08 0.08 0.73 353 4.97 0.81 –0.22 0.07 7.68 352 12.19 8.27 0.08 0.08 0.79 348 2.10 0.83 –0.02 0.08 0.90 355 3.24 0.86 0.06 0.07 36.72 350 43.63 42.06 0.04 0.08 0.52 354 2.51 0.50 0.22 0.07 0.58 174 3.22 0.59 0.05 0.18 0.49 352 0.41 0.49 0.19 0.21 0.49 349 0.55 0.50 –0.25 0.21 0.47 351 0.29 0.45 –0.28 0.27 0.33 354 0.09 0.28 –0.33 0.29 0.44 351 0.22 0.42 –0.44 0.18 0.34 351 0.07 0.25 –0.83 0.18 None of the treatment effect estimates is significant after adjustment for multiple comparisons. 381 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Springer et al. and control groups, along with estimates of the differences between groups on these measures. To summarize the results in the table, treatment and control group teachers responded similarly on all of the survey scales, and no statistically significant differences were detected between groups after adjustment for multiple hypothesis testing. The bonus intervention had no effects on perceptions of team dynamics. Groups reported similar levels of collegiality and hours spent collaborating with teachers on and off their teams. Both groups reported similarly high scores with respect to the perceived quality of team dynamics. Similarly, the perceptions of the intervention did not differ between the experimental groups. Regardless of treatment condition, less than one quarter of teachers believed that the intervention had negative effects on their school and about one tenth believed it forced teachers to teach in a certain way. Furthermore, less than one third of teachers believed the intervention caused resentment among teachers. There was also evidence that teachers were not entirely supportive of the intervention. Only 38% of control teachers and 41% of treatment teachers believed that the intervention could provide feedback about their team’s effectiveness. The percentage of teachers who believed that the intervention could distinguish between effective and ineffective teachers was particularly low, with 12% of control teachers and 9% of treatment teachers endorsing that item. Furthermore, the majority of teachers in both the treatment and control groups believed that the intervention should include other noncore subject teachers (e.g., music). Taken together, the majority of teachers did not believe the intervention had negative effects on their school or on their attitudes, but nonetheless they were skeptical that the intervention could provide useful information about teaching effectiveness. In addition, teachers felt that the intervention could be improved by including other noncore subject teachers in the calculation of team bonuses. Comparisons of Treatment Teachers Who Earned a Bonus to Those Who Did Not Another important aspect of the study is exploring whether there were differences in attitudes, perceptions, and practices among the treatment teachers who would ultimately win or not win a bonus. We also examine how those measures changed after teachers were informed of Year 1 bonus results. These analyses are nonexperimental and should be interpreted with caution. Mean Differences in Attitudes, Perceptions, and Practices In this section, we compare the survey responses of teachers who would go on to win a bonus with the survey responses of teachers who would not go on to win a bonus. At the time teachers responded to the survey, they did not know whether or not they would win a bonus for that academic year. Table 10 shows descriptive statistics for the bonus winners and nonwinners along with estimates of the differences between groups. A positive coefficient indicates that the treatment teachers who earned a bonus had higher scores on the scale, while a negative coefficient indicates the teachers who did not earn a bonus had higher scores. None of these differences were significant after adjustment for multiple tests. One trend worth noting was that teachers who would go on to win a bonus tended to be less likely than teachers who did not win a bonus to emphasize standardized tests. Namely, relative to teachers who won a bonus, teachers who did not win a bonus reported that they put more emphasis on TAKS test preparation (including practicing test-taking skills and using TAKS preparation materials) and on using scores from the TAKS and district benchmarks tests to guide their instruction. Change in Responses Over Time by Bonus Status Another important analysis is to examine how treatment teachers may respond to receiving or not receiving a bonus. Specifically, we examined how teachers’ responses in Year 1, which were obtained before teachers were informed of their bonus status, compared to their responses in Year 2, which were obtained after they were informed of their Year 1 bonus status.8 Table 11 shows changes in survey responses from Year 1 to Year 2 for teachers 382 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 10 Treatment Teachers’ Attitudes, Perceptions, and Practices by Whether or Not They Would Ultimately Be Awarded a Bonus at the End of the Year Dependent Variable Would Ultimately Earn Bonus Would Not Ultimately Earn Bonus N N M SD M SD Group dynamics Collaboration among same-subject teachers 231 2.58 0.58 120 2.45 0.56 Hours spent meeting with same-subject 231 4.29 4.20 120 4.07 4.70 teachers Collaboration among team teachers 233 2.62 0.57 120 2.70 0.55 Hours spent meeting with team teachers 233 4.52 4.16 120 5.53 5.91 Quality of team dynamics 233 4.25 0.74 120 4.47 0.52 Instructional practices Change in classroom emphasis on state 231 3.55 0.61 116 3.38 0.46 standards and tests Change in emphasis on hands-on activities 230 3.67 0.74 117 3.49 0.67 and having students work in groups Importance teachers place on test234 3.37 0.56 119 3.18 0.57 preparation activities Importance of student scores on state tests 234 3.21 0.67 119 2.9 0.77 and benchmark assessments to guide instruction Importance placed on student performance 234 3.69 0.44 119 3.71 0.48 on classroom work and homework to guide instruction Use of test scores for making instructional 232 3.28 0.55 119 3.33 0.51 decisions Frequency with which teachers incorporate 232 4.95 0.85 119 5.01 0.73 Texas state standards into instructional planning Number of hours worked outside of formal 232 12.92 8.85 118 10.75 6.88 school hours Professional development and self-improvement efforts Teachers’ use of student test scores to help 228 2.15 0.81 118 2.00 0.88 improve their own practice Frequency of professional development 233 3.28 0.88 120 3.18 0.80 activities related to collaborative aspects of teaching Total amount of time spent in professional 231 44.47 40.66 117 42.62 44.91 development Parent engagement Efforts to engage parents 233 2.49 0.50 119 2.55 0.50 Principal leadership Principal leadership (Year 2 only) 120 3.21 0.63 54 3.25 0.52 Perceptions of the intervention The intervention provides feedback about 231 0.43 0.50 119 0.37 0.48 team’s effectiveness The intervention should include noncore 230 0.58 0.49 117 0.47 0.50 subjects The intervention has caused resentment 231 0.29 0.45 118 0.30 0.46 among teachers The intervention distinguishes effective 233 0.12 0.32 119 0.03 0.18 from ineffective teams The intervention has had negative effects on 231 0.20 0.40 118 0.26 0.44 my school Standardized Standard Effect Size Error –0.09 –0.48 0.08 0.55 –0.01 0.78 0.22 0.08 0.51 0.11 –0.12 0.08 –0.15 0.10 –0.18 0.07 –0.23 0.10 0.02 0.06 0.08 0.07 0.09 0.10 –1.53 1.05 –0.15 0.11 –0.12 0.11 –1.81 4.83 0.09 0.07 0.11 0.13 –0.18 0.35 –0.57 0.29 –0.23 0.36 –1.16 0.58 0.19 0.36 (continued) 383 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 10 (continued) Dependent Variable The intervention forced teachers to teach in a certain way The intervention will not affect my teaching I have a clear understanding of performance criteria The bonus is too small The Frequently Asked Questions document answered my questions Not winning a bonus will have a negative effect on my team’s teaching evaluations The intervention uses a fair method of awarding bonus Would Ultimately Earn Bonus Would Not Ultimately Earn Bonus N SD N M SD 230 0.09 0.28 119 0.03 0.18 –0.92 0.57 234 0.76 232 0.48 0.43 0.50 120 119 0.83 0.44 0.38 0.50 0.38 –0.20 0.34 0.28 228 0.17 227 0.63 0.38 0.48 113 116 0.18 0.59 0.38 0.49 0.05 –0.22 0.35 0.29 230 0.12 0.32 119 0.06 0.24 –0.65 0.47 228 0.44 0.50 116 0.34 0.48 –0.33 0.30 M Standardized Standard Effect Size Error None of these differences are significant after adjustment for multiple comparisons. who did and did not earn a bonus in Year 1. Similar data are not available for teachers who did or did not win awards in Year 2 because we did not administer a follow-up survey. Again, none of the differences were significant after adjustment for multiple tests. For the most part, teachers who earned a Year 1 bonus showed similar changes in attitudes and practices over time as teachers who did not win a bonus. There were no differences on the group dynamics, professional development, and parent engagement measures. There were also no differences in changes on the instructional practices scales. A notable trend, though not significant, concerned the extent to which teachers used scores from the TAKS and district benchmarks tests to guide their instruction. Teachers who had not won a bonus reported decreased emphasis on standardized test scores in Year 2 relative to Year 1. In contrast, teachers who had won a bonus reported increased emphasis on standardized test scores in Year 2. Though not shown in the table, it is important to note that despite the increase over time, teachers who had won a Year 1 bonus continued in Year 2 to report less emphasis on standardized test scores than teachers who had not won a bonus. There were also no differences between teachers who had won a bonus and teachers who had not won a bonus with respect to the perceptions of the intervention over time, although small sample sizes may have limited our ability to detect differences. One interesting result in the data is that teachers’ attitudes toward the size of the bonus award differed between those who had or had not won bonuses. Teachers who did not win the bonus were more likely to endorse that the bonus was too small to motivate them to work harder, while teachers who had won a bonus showed the opposite pattern. The difference was large although not significant after adjustment for multiple tests. Overall, winning a bonus did not appear to materially change teachers’ attitudes and practices. There were generally no differences with respect to changes in collaboration, professional development, parent engagement, instructional practices, and perceptions of the intervention for teachers who won a Year 1 bonus compared to teachers who did not. 7. Discussion There are a variety of possible explanations for why we did not observe any differences in the achievement of students taught by treatment and control groups teams or in the attitudes and practices of the treatment and control group teachers. First, the single-academic-year randomized trials may have been too brief to meaningfully detect treatment effects. The Year 1 spring survey was administered 8 months after the experiment started, and the Year 2 survey was administered 20 months after the experiment 384 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 TABLE 11 Changes in Attitudes, Perceptions, and Practices of Teachers From Year 1 to Year 2 by Whether They Earned a Bonus in Year 1 Earned Bonus Did Not Earn in Year 1 Bonus in Year 1 Dependent Variable N Δ1 Group dynamics Collaboration among same-subject teachers 54 0.02 Hours spent meeting with same-subject teachers 54 1.20 Collaboration among team teachers 54 0.05 Hours spent meeting with team teachers 54 0.69 Quality of team dynamics 54 –0.01 Instructional practices Change in classroom emphasis on state standards 53 –0.19 and tests Change in emphasis on hands-on activities and 53 –0.30 having students work in groups Importance teachers place on test-preparation activities 54 –0.01 Importance of student scores on state tests and 54 0.24 benchmark assessments to guide instruction Importance placed on student performance on 54 –0.09 classroom work and homework to guide instruction Use of test scores for making instructional decisions 54 –0.01 Frequency with which teachers incorporate Texas 53 0.08 state standards into instructional planning Number of hours worked outside of formal school 53 0.91 hours Professional development and self-improvement efforts Teachers’ use of student test scores to help improve 53 –0.01 their own practice Frequency of professional development activities 54 –0.05 related to collaborative aspects of teaching Total amount of time spent in professional 53 –5.83 development Parent engagement Efforts to engage parents 54 0.01 Perceptions of the intervention The intervention provides feedback about team’s 52 0.06 effectiveness The intervention should include noncore subjects 51 0.06 The intervention has caused resentment among teachers 52 0.26 The intervention distinguishes effective from 53 0.04 ineffective teams The intervention has had negative effects on my school 52 0.16 The intervention forced teachers to teach in a certain 53 –0.02 way The intervention energized me to improve my 28 –0.04 teaching The intervention will not affect my teaching 28 0.07 I have a clear understanding of performance criteria 28 0.18 The bonus is too small 26 –0.11 The Frequently Asked Questions document 28 0.07 answered my questions Not winning a bonus will have a negative effect on 28 0.03 my team’s teaching evaluations The intervention uses a fair method of awarding bonus 28 –0.04 SD 0.48 5.80 0.56 2.85 0.53 N Δ1 SD Standardized Standard Effect Size Error 83 –0.17 0.63 83 0.86 6.10 84 –0.11 0.53 84 0.60 6.43 84 –0.11 0.83 0.31 –0.25 0.24 –0.01 0.16 0.18 0.31 0.19 0.19 0.17 0.55 84 –0.14 0.72 –0.02 0.14 0.77 83 –0.07 0.90 –0.14 0.16 0.52 85 –0.01 0.51 0.71 85 –0.16 0.78 –0.00 0.49 0.18 0.18 0.61 85 0.06 0.52 –0.25 0.19 0.64 84 0.71 82 0.06 0.54 0.16 0.84 –0.16 –0.14 0.20 0.18 5.34 84 0.83 7.76 0.01 0.20 0.82 80 –0.03 0.90 0.13 0.19 0.97 84 –0.13 0.90 0.04 0.18 60.23 83 –7.88 53.92 0.03 0.20 0.04 0.52 –0.09 0.19 0.60 82 –0.08 0.52 0.16 0.12 0.50 80 –0.04 0.48 0.56 80 0.06 0.50 0.27 82 –0.05 0.43 0.09 0.20 0.12 0.10 0.10 0.09 0.02 0.42 0.05 0.38 0.12 –0.07 0.11 0.06 0.50 51 –0.19 0.45 0.15 0.15 0.08 0.30 –0.27 0.15 0.59 0.14 0.11 0.13 0.02 0.32 –0.02 0.10 0.42 48 –0.21 0.46 0.16 0.11 0.41 54 0.58 79 0.23 78 0.53 0.60 0.51 0.65 51 0.02 0.55 50 –0.08 0.49 49 0.14 0.41 49 –0.02 0.48 0.33 50 1 Δ is the group mean for Year 2 minus the group mean for Year 1; none of these differences are significant after adjustment for multiple comparisons. 385 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Springer et al. started but again only 8 months after randomization in that year. Student assessments were administered on similar schedules. This may not have given treatment teachers sufficient time to change their practices, to experience conditions that would change their perceptions or attitudes, or for changes in practices to affect student outcomes. Previous research suggests that it can take considerable time for teachers to change their practices (Mayer, 1999). Second, treatment teachers may not have fully understood the pilot bonus program. The majority (54%) of teachers did not understand the criteria for earning a bonus. This was a problem even among the treatment teachers who indicated that they had read the FAQ document, which explained the methods used to calculate which teams would be awarded a bonus. For these teachers, 39% indicated they lacked clarity regarding the criteria for earning a bonus. Moreover, a majority of teachers (59%) did not think that the method used to award the bonus was fair to teachers, possibly because teachers may not have understood how the system was designed to reduce competition. Overall, these results suggest that teachers did not appear to fully understand how their performance would translate to a bonus. Third, the opportunity to win a bonus appeared to be a weak incentive, as only onequarter of treatment teachers endorsed the statement “the chance to earn a bonus award has energized me to improve my teaching.” Furthermore, the vast majority of treatment teachers (78%) indicated that they would not change their practice in order to win the bonus. Finally, the majority of teachers in this study did not believe the intervention forced them to teach in a certain manner, and the majority did not believe that the experiment had negative effects on their schools or caused resentment among teachers, contradictory to concerns some teachers raise about pay for performance (Solmon & Podgursky, 2001). However, these misgivings were present among a substantial minority of teachers. Nearly one third reported that it created resentment among teachers and a quarter reported negative effects on their schools. Moreover, many teachers were skeptical about other aspects of the intervention. The majority of teachers believed the intervention to be incomplete, in that it did not account for the teaching effects of teachers from noncore subjects. Teachers also questioned whether the intervention could distinguish effective teachers from ineffective teachers, could provide feedback about their team’s effectiveness, and could fairly assign bonuses to teachers. For the latter result, it is important to keep in mind that many treatment teachers indicated they did not fully understand the criteria for awarding bonuses, so they may not have had adequate knowledge to fully evaluate the fairness of the method.9 Nonetheless, it is important that teachers’ uncertainty and misgivings about a pay-for-performance system be alleviated to the extent possible in order to ensure optimal outcomes. Taken together, these factors—the relatively short duration of the experiment, treatment teachers’ lack of understanding of the intervention, teachers’ reports that the potential to win a bonus did not induce change in their practice, and misgivings about the intervention among a substantial minority of teachers—may help to explain the lack of differences in perceptions and practices between the treatment and control groups. Marsh et al. (2011) summarizes prior research that suggests many of these factors are important in the success of pay-for-performance programs: Participants must understand the program, buy into the program and the criteria used for selecting winners, believe the system is fair and that they are capable of achieving an award, and find the award valuable enough to inspire efforts to achieve it. Alternatively, the intervention might have caused changes to student outcomes, but the effect could have been masked by similar changes in the control group. In a phenomenon sometimes referred to as a John Henry Effect (Saretsky, 1975), the control group exerts extra effort in response to participating in study. Teacher misgivings about the intervention could potentially exacerbate such an effect. Available data do not enable us to test this speculation. The study did not find evidence that the freerider problem was an issue in this program. Three survey items in the quality of team dynamics scale are particularly relevant to this issue. On average, treatment teachers reported that team members demonstrated commitment to the team by putting in extra time and effort to 386 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Team Pay for Performance help it succeed (4.46 on a scale of 1 to 5, where 4 indicates the statement is somewhat accurate and 5 indicates it is very accurate), and that everyone on the team is motivated to have its students succeed (4.64). They also tended to disagree that some members of the team do not carry their fair share of the overall workload (2.01, where 2 indicates the statement is somewhat inaccurate). Moreover, control group teachers responded very similarly on these items (4.49, 4.57, and 2.03, respectively), suggesting that participation in the bonus program did not affect teacher reports related to freeriding. Overall, while these results further our understanding of how group-based compensation plans affect teachers, more research is needed. First, the study does not capture effects of the intervention that might occur through changes to the composition of the teaching workforce. Second, future research could improve our understanding of which particular features of the intervention were accepted by teachers and which particular features were in need of improvement, so that compensation plans could be better designed to promote teacher buy-in. For example, interviews with teachers can identify what aspects of the bonus criteria were unclear, and the results of this analysis can help inform future designs of pay-for-performance plans. In a related manner, teachers can shed insight as to why they believe the current method used to calculate bonuses gives only limited information about teachers’ effectiveness and might suggest ways the system could provide better feedback about a team’s effectiveness. Given that teacher buy-in has been shown to depend on program design (Lavy, 2007), teachers can also discuss the pros and cons of designing pay-for-performance plans that provide additional compensation based on an individual teacher basis, team basis, or whole-school basis. Some studies have suggested that teachers may not be motivated by financial incentives and prefer other types of non-monetary rewards (Firestone & Pennell, 1993). Future studies should examine how the opportunity to win a monetary bonus compares to the opportunity to obtain other types of incentives, such as choice of team members, access to instructional materials, or greater opportunity for professional development. Finally, in-depth case studies of teams of teachers who win a bonus can help identify the practices and conditions that led to their success, in particular the importance of teamwork and other specific practices. The lack of an effect of the pay-forperformance system in this study is consistent with other recent experiments to study payfor-performance systems in education, including studies of bonus awards for individual performance (Springer et al., 2010) or wholeschool performance (Fryer, 2011; Marsh et al., forthcoming). These studies shared several features that may provide additional explanation for the lack of effects: The financial awards were an add-on to standard pay, performance was measured separately from the districts’ standard evaluations of teachers (except in one of the programs evaluated), and there was no professional development specifically connected to these programs. Appendix Student Linkages Students were associated with a team if they receive continuous instruction in two or more of the core subject areas of mathematics, reading/ English language arts, science, and social studies by team teachers during the period from the fall snapshot date until test administration. According to district records, the fall snapshot date was October 31, 2008, and because of differing test administration dates, we used May 12, 2009, as the end of the enrollment window in spring. A student was continuously enrolled if he or she was enrolled in the same school during that interval and enrolled in courses taught by team teachers for a subject area for every day of the interval. Using data provided to the project by the district, we determined all students continuously enrolled in each of the district’s nine middle schools. We identified each student’s primary instruction teachers for each core subject during the enrollment window and determined if each student received primary instruction from a single team teacher during the enrollment window.10 Students were assigned to 387 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Appendix (continued) a team if they received continuous instruction for two or more core subject areas by teachers on that team. Not all students were assigned to a team since some students were not continuously enrolled or did not receive instruction in two core subjects from teachers on the same team. A small number of students received instruction in two core topics from team teachers on one team and instruction in the other two subjects from teachers on a second team; in these cases, students were assigned to both teams. For each student assigned to a team, we determined if a team teacher provided instruction for each core subject area. A student could be part of a team but not receive continuous instruction from team teachers in every subject area. This scenario is most common for mathematics where a student might be instructed by teachers in Team A for English, science, and social studies but switch to a teacher from Team B for mathematics for a course that is more appropriate for his or her achievement level. For a student’s achievement to contribute to the team’s performance measures of a given subject area, the student must be part of the team and receive continuous instruction from a team teacher. Students who are part of a team but who received continuous instruction by a teacher from a different team or did not receive continuous instruction in a subject area did not contribute to estimates of performance measures for this subject area on that team. Notes 1. Teachers assigned to teams in both years of the study are counted twice. The same is true for student counts below. 2. We used a statistical model to predict each student’s expected achievement on each subject area test in school year 2008–09. The model used achievement on 3 prior years of TAKS math and reading tests as predictors. A separate model was used for each subject. The statistical model is essentially the same model as the model described in Wright, Sanders, and Rivers (2006) and the multivariate ANCOVA method described in McCaffrey, Han, and Lockwood (2009). In particular, a student’s expected current achievement in a given subject area if he or she were taught by the average performing team is assumed to be a linear function of all his or her prior three math and reading TAKS scores (where both the current and prior scores have been appropriately rescaled to normal curve equivalents to place all tests on the same point range, see note below for more on the scaling). The model uses data from all students, even those with incomplete prior test records, by using the pattern of prior tests completed as part of the model. See Wright, Sanders, and Rivers (2006) or McCaffrey, Han, and Lockwood (2009) for more details on modeling with incomplete records. 3. For instance, in the POINT study conducted in Nashville, Tennessee, which used fixed benchmarks for determining awards, the proportion of teachers who earned a bonus increased from 29% to 52% over the 3 years of the study. Similarly, the SPBP in New York City used a fixed benchmark for determining schools to receive awards for their staff, and in the second year 80% of schools earned a full bonus, up from 47% in the first year. 4. For example, suppose the Top 10 teams at each grade level qualify for bonuses. If there are three teams for each grade in each school and the 11th place team is in the same school as one of the 10 winners, that 11th place team is also designated a winner under this policy. If the 12th place team is also in the same school, that team is designated a winner too. However, a 12th place team located in any other school would not be designated a winner under this rule, as it has not been denied a place in the Top 10 by a higher ranking team in its school. 5. Tests for significance in group differences were conducted using permutation tests (discussed below), without adjustment for multiple comparisons. 6. Most students in Grades 6 and 7 were tested in math and reading in late April. Most students in Grade 8 were tested in reading in early March, math in early April, and social studies and science in late April or early May. Grade 8 students can take the TAKS multiple times if they are not proficient on the first attempt. We included only scores from students’ first attempt. 7. The district’s precise definition of this indicator is not known. 8. We conducted additional sensitivity analysis in which we controlled for treatment condition in Year 2, and the results were similar to those reported here. 9. This item was not asked of control teachers. 10. We identified four cases in which a team teacher left the district during the school year and was replaced by another teacher or long-term substitute. These replacement teachers were also considered team teachers so the students instructed by the original and the replacement teacher would be considered as receiving continuous instruction provided they were otherwise continuously enrolled. 388 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Team Pay for Performance References Azordegan, J., Byrnett, P., Campbell, K., Greenman, J., & Coulter, T. (2005). Diversifying teacher compensation. Denver, CO: Education Commission of the States. Retrieved November 3, 2010, from http:// www.ecs.org/clearinghouse/65/83/6583.pdf Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289–300. Berg, P., Appelbaum, E., Bailey, T., & Kalleberg, A. (1996). The performance effects of modular production in the apparel industry. Industrial Relations, 35(3), 356–373. Che, Y., & Yoo, S. (2001). Optimal incentives for teams. American Economic Review, 91(3), 525–541. Condly, S. J., Clark, R. E., & Stolovitch, H. D. (2003). The effects of incentives on workplace performance: A meta-analytic review of research studies. Performance Improvement Quarterly, 16(3), 46–63. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Boca Raton, FL: Chapman & Hall/CRC. Firestone, W. A., & Pennell, J. R. (1993). Teacher commitment, working conditions, and differential incentive policies. Review of Educational Research, 63(4), 489–525. Fryer, R. G. (2011). Teacher incentives and student achievement: Evidence from New York City public schools. Working Paper Series. Cambridge, MA: National Bureau of Economic Research. Glazerman, S., & Seifullah, A. (2010). An evaluation of the Teacher Advancement Program (TAP) in Chicago: Year two impact report (Reference Number 6319-520). Washington, DC: Mathematica Policy Research. Glewwe, P., Ilias, N., & Kremer, M. (2010). Teacher incentives. American Economic Journal, 2(3), 205–227. Gratz, D. B. (2009). The problem with performance pay. Educational Leadership, 67(3), 76–79. Hamilton, B. H., Nickerson, J. A., & Owan, H. (2003). Team incentives and worker heterogeneity: An empirical analysis of the impact of teams on productivity and participation. Journal of Political Economy, 111(3), 465–497. Hoerr, T (1998). A case for merit pay. Phi Delta Kappan, 80(4), 326–327. Kandel, E., & Lazear, E. P. (1992). Peer pressure and partnerships. Journal of Political Economy, 100(4), 801–817. Kaufman, B. E. (2008). Work motivation: Insights from economics. In R. Kanfer & G. Chen (Eds.), Work motivation: Past, present, and future (pp. 588–600). New York: Routledge/Taylor & Francis. Lavy, V. (2007). Using performance-based pay to improve the quality of teachers. The Future of Children, 17(1), 87–109. Lazear, E. P. (1998). Personnel economics for managers. New York: Wiley. Marsh, J. A., Springer, M. G., McCaffrey, D. F., Yuan, K., Epstein, S., Koppich, J., Kalra, N., DiMartino, C., & Peng, A. (2011). A big apple for educators: New York City’s experiment with schoolwide performance bonuses (MG-1114-FPS). Santa Monica, CA: RAND Corporation. Mayer, D. P. (1999). Measuring instructional practice: Can policymakers trust survey data? Educational Evaluation and Policy Analysis, 21(1), 29–45. McCaffrey, D. F., Han, B., & Lockwood, J. R. (2009). Incentive system design and measurement. In M. G. Springer (Ed.), Performance incentives: Their growing impact on American K-12 education. Washington, D.C.: Brookings Institution Press. Milanowski, A. (1999). Measurement error or meaningful change? The consistency of school achievement in two school-based performance award programs. Journal of Personnel Evaluation in Education, 12(4), 343–363. Milanowski, A. T. (2007, Spring). Performance pay system preferences of students preparing to be teachers. Education Finance and Policy, 2(2), 111–132. Muralidharan, K., & Sundararaman, V. (2011). Teacher performance pay: Experimental evidence from India. The Journal of Political Economy, 119(1), 39–77. Odden, A. (2000). New and better forms of teacher compensation are possible. Phi Delta Kappan, 81(5), 361–366. Pfeffer, J. (1995). Competitive advantage through people: Unleashing the power of the work force. Boston, MA: Harvard Business School Press. Rice, J. K. (2003). Teacher quality: Understanding the effectiveness of teacher attributes. Washington, DC: Economic Policy Institute. Rosen, S. (1986). The theory of equalizing differences. In O. C. Ashenfelter & R. Layard (Eds.), Handbook of labor economics (Vol. 1). Oxford: North-Holland. Rosenholtz, S. (1989). Teacher’s workplace: The social organization of schools. New York: Longman. Sager, R. (2009, March 2). Prez’s challenge to NYC teachers. New York Post. Retrieved March 31, 389 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015 Springer et al. 2009, from http://www.nypost.com/p/news/opin ion/opedcolumnists/item_eS7bvzIPkWPVEjJsb HwsoJ. Saretsky, G. (1975 ). The John Henry effect: Potential confounder of experimental vs. control group approaches to the evaluation of educational innovations. Paper presented at the Annual Meeting of the American Educational Research Association, Washington, D.C. Solmon, L., & Podgursky, M. (2001). The pros and cons of performance-based compensation. Pasadena, CA: Milken Family Foundation. Springer, M. G., Ballou, D., Hamilton, L., Le, V., Lockwood, J. R., McCaffrey, D., Pepper, M., & Stecher, B. (2010). Teacher pay for performance: Experimental evidence from the project on incentives in teaching. Nashville, TN: National Center on Performance Incentives at Vanderbilt University. Springer, M. G., Lewis, J. L., Podgursky, M. J., Ehlert, M. W., Taylor, L. L., Lopez, O. S., & Peng, A. (2009). Governor’s Educator Excellence Grant (GEEG) Program: Year three evaluation report. Nashville, TN: National Center on Perfor mance Incentives. Thomas, K. W. (2009). Intrinsic motivation at work: What really drives employee management (2nd ed.). San Francisco: Berrett-Koehler. Wright, S. P., Sanders, W. L., & Rivers, J. C. (2006). Measurement of academic growth of individual students toward variable and meaningful academic standards. In R. Lissitz (Ed.), Longitudinal and value added models of student performance. Maple Grove, Minnesota: JAM Press. Authors MATTHEW G. SPRINGER is assistant professor of public policy and education, director of the federally-funded National Center on Performance Incentives, and director of the Tennessee Consortium on Research, Evaluation, and Development, a research consortium funded through Tennessee’s Race to the Top grant. Professor Springer’s research interests involve educational policy issues, with a particular focus on the impact of policy on resource allocation decisions and student outcomes. JOHN F. PANE is a Senior Scientist at RAND. He uses experimental and rigorous quasi-experimental methods to study the implementation and effectiveness of innovations in education, particularly those involving technology. VI-NHUAN LE is a Behavioral Scientist at RAND. Her research and expertise lies in mathematics and science reform, educational assessment, and early childhood education. DANIEL F. MCCAFFREY is a Senior Statistician and holds the PNC Chair in Policy Analysis at the RAND Corporation. His current research interests include value-added modeling and the measurement of teaching. SUSAN FREEMAN BURNS is program manager at the National Center on Performance Incentives. Her research interests include school leadership and teacher effectiveness, particularly in K-12 public education. LAURA S. HAMILTON is a Senior Behavioral Scientist at RAND. Her areas of specialization are assessment, accountability, and the measurement of instruction and leadership practices. BRIAN STECHER is a Senior Social Scientist and the Associate Director of RAND Education. Dr. Stecher’s research focuses on measuring educational quality and evaluating education reforms, with a particular emphasis on assessment and accountability systems Manuscript received August 3, 2011 Revision received November 29, 2011 Accepted January 21, 2012 390 Downloaded from http://eepa.aera.net at VANDERBILT UNIVERSITY LIBRARY on February 6, 2015
© Copyright 2026 Paperzz