Link to article

VALUE ADDED AND ITS USES:
WHERE YOU STAND DEPENDS
ON WHERE YOU SIT
Sean Corcoran
Institute for Education and
Social Policy
New York University
New York, NY 10012
[email protected]
Dan Goldhaber
(corresponding author)
Center for Education Data
& Research
University of Washington
Seattle, WA 98103
[email protected]
Abstract
In this policy brief we argue that there is little debate
about the statistical properties of value-added model
(VAM) estimates of teacher performance, yet, despite
this, there is little consensus about what the evidence
about VAMs implies for their practical utility as part of
high-stakes performance evaluation systems. A review
of the evidence base that underlies the debate over VAM
measures, followed by our subjective opinions about the
value of using VAMs, illustrates how different policy
conclusions can easily arise even given a high-level general agreement about an existing body of evidence. We
conclude the brief by offering a few thoughts about the
limits of our knowledge and what that means for those
who do wish to integrate VAMs into their own teacherevaluation strategy.
No rights reserved. This work was authored as part of the Contributor’s official duties as an Employee of
the United States Government and is therefore the work of the United States Government. In accordance
with 17 U.S.C. 105, no copyright protection is available for such works under U.S. law.
doi:10.1162/EDFP_a_00104
418
Sean Corcoran and Dan Goldhaber
INTRODUCTION
We contend that there is little academic disagreement about the statistical properties of value-added model (VAM) estimates currently being implemented to
measure teacher job performance or the current state of in-service teacher
evaluations.1 In particular, we believe that most researchers could sign on to
five stylized facts: (1) there is important variation in teacher effectiveness that
has educationally significant consequences for student achievement, as measured on standardized tests (and new evidence suggests later life outcomes as
well); (2) most formal teacher evaluations are far from rigorous in that they
largely ignore performance differences between teachers; (3) VAM measures
are likely to contain real information about teacher effectiveness that could be
used to inform personnel decisions and policies, but they are subject to potential biases; (4) VAM measures are noisy, varying from year to year or class
section to class section for reasons other than variation in true effectiveness;
and (5) little is known about how current and potential teachers might respond
to the use of VAM measures in evaluating their job performance.
As we see it, the real debate about VAMs is less about the known properties
of these measures, and more about what the evidence about VAMs implies for
their practical utility as part of high-stakes performance evaluation systems.2
This policy brief offers a case in point—we agree about what research has to say
about teacher evaluation, the properties of value-added measures, and the ways
in which VAMs could conceivably be used for workforce improvement, but we
do not reach the same conclusions about the value of VAMs in making highstakes personnel decisions. Our views are not diametrically opposed but stem
from subjective assessments about the risks associated with VAMs, judgments
about the alternatives, and skepticism about the degree to which education
systems will change without their use. As we will show, given the same body
of evidence, the VAM glass can easily be viewed as half empty or half full.
In the next section, we briefly review the evidence base that underlies the
debate over VAM measures, organized loosely around the five stylized facts
listed earlier. We then describe theories of action connecting VAM measures
to educational improvement and the limited available evidence on many of
the theories put into practice. Successful use of VAMs in the real world requires a well-specified link between teacher measurement and human resource
1. Henceforth we use the terms “teacher effectiveness” and “teacher job performance” interchangeably.
2. In fact, we speculate that any means of evaluating teachers that leads to a spread of summative
teacher performance ratings, which in turn have consequential implications for teachers’ jobs or
compensation, would be controversial. As we briefly touch on subsequently, this may be because
of teacher concerns about disruptive effects on teacher collegiality or a general resistance to move
away from the current system where there are few job consequences for poor teaching (because
poor teaching is rarely identified through a formal evaluation process).
419
VALUE ADDED AND ITS USES
management actions; unfortunately, their implementation in practice is often
carried out without much attention to this link. The remainder of the brief is
devoted to our differences in opinion about the use of VAMs, illustrating how
different conclusions can easily arise from the same body of evidence and a
conclusion about the limits of our knowledge and what that means for those
who do wish to integrate VAMs into their own teacher evaluation strategy.
A BRIEF REVIEW OF WHAT WE KNOW ABOUT VAMs
A near-universal point of agreement among education researchers is that
teachers matter, or, put another way, that there is important variation in teacher
effectiveness that has educationally significant consequences for students. This
conclusion is an out-growth of value-added studies that find an effect size of
individual teachers in the neighborhood of 0.10 to 0.25 standard deviations
(Hanushek and Rivkin 2010; Goldhaber and Hansen 2012).3 To put this effect
in perspective, a one standard deviation difference in teacher effectiveness
amounts to about 10 to 25 percent of a grade-level’s worth of growth in the
elementary grades.4 The magnitude of these effects has led to the oft-used
refrain that teacher quality is the most important in-school factor affecting
student achievement.
Unfortunately for policy makers and school leaders, variation in achievement gains between teachers appears to be only weakly related to licensure, experience, and degree attainment, now used by most states and school districts
to determine employment eligibility and compensation (Goldhaber, Brewer,
and Anderson 1999; Aaronson, Barrow, and Sander 2007; McCaffrey et al.
2009).5 Moreover, existing in-service teacher evaluations fail to recognize important differences in learning gains across classrooms. In their well-known
report The Widget Effect, Weisburg et al. (2009) found little to no documented
variation in teacher job performance ratings in surveyed school districts.6 In
the absence of variation, these ratings played little to no role in key human
capital decisions such as tenure, promotion, and compensation. Together,
3. The estimates are typically in the neighborhood of 0.10 to 0.15 for within-school estimates and 0.15
to 0.25 for estimates that also include between-school differences in teacher effectiveness.
4. In the lower grades a typical student gains about one standard deviation in math and reading achievement (Bloom et al. 2008). Thus a one standard deviation difference in teacher
effectiveness—about the difference between having a teacher at the sixteenth percentile of the
performance distribution and one at the fiftieth percentile—amounts to about 10 to 25 percent of a
grade-level’s worth of achievement in the elementary grades. The impact of teacher effectiveness is
even larger in middle and high schools since students gain less (in standard deviation units) from
grade to grade.
5. For more evidence on the relationship between these credentials and student achievement, see
Rockoff (2004), Rivkin, Hanushek, and Kain (2005), Goldhaber and Brewer (1997, 2000), Boyd
et al. (2009), Clotfelter, Ladd, and Vigdor (2010), Kane et al. (2010), and Goldhaber and Hansen
(2012).
6. See also Tucker (1997) and Toch and Rothman (2008).
420
Sean Corcoran and Dan Goldhaber
these findings have led researchers and policy makers to consider alternative measures of job performance, such as VAMs, that aim to directly and
objectively estimate teachers’ impact on student achievement.
Put simply, VAMs use historical data on student achievement to quantify
how a teacher’s students performed relative to similar students taught by other
teachers (e.g., students with the same baseline performance, socioeconomic
background, educational needs, and so on). Value-added models are designed
to statistically control for differences across classrooms in student inputs,
such that any remaining differences can be used to infer teachers’ impacts
on achievement. As is true for any statistical model, users of VAMs must be
attentive to issues of attribution (bias), imprecision (noise), and model specification. No model is perfect, and thus the relevant question for the VAM debate
is whether these measures as currently constructed contain enough useful
information that their integration into performance evaluation systems, and
the use of these performance evaluations as part of human resource management systems, is likely to lead to improvements in the quality of the teacher
workforce.
Questions of model specification, bias, and precision have been extensively studied. With respect to model specification, there appears to be general
agreement that VAM estimates are not terribly sensitive to common specification choices, such as the inclusion or exclusion of student or classroom level
covariates, but that there is greater sensitivity to whether models include school
or student fixed effects.7 VAM estimates also show some sensitivity to the student outcomes (or domains on a test) employed as the metric to judge teachers
(e.g., Lockwood et al. 2007; Corcoran et al. 2011; Papay 2011).
The question of whether and to what extent VAMs are biased—that is,
whether test-score gains attributed to a teacher are due to student sorting or
some other factor that varies with teacher assignment—is less settled. Rothstein (2010), for example, shows that in standard VAMs, teachers assigned to
students in the future have statistically significant power in predicting past student achievement. This finding clearly cannot be causal, and suggests that typical models do not adequately account for the processes leading to the matching of students and teachers in classrooms.8 On the other hand, Goldhaber
7. Correlations across different models are generally high (over 0.9) for models that vary in terms of
how they handle covariate adjustments for student background, classroom, and school characteristics (Ballou, Sanders, and Wright 2004; Harris and Sass 2006; Lockwood et al. 2007) but the
correlations between these models and those that include student or school fixed effects are far
lower, in the range of 0.3 to 0.5 (Harris and Sass 2006; Goldhaber, Gabele, and Walch 2012), with
the range dependent on the subject as well as the number of years of teacher data informing the
model. Importantly, models with student fixed effects are much more commonly used in research
than in practice, due to their high data and computational demands.
8. Rothstein (2009) suggests that bias is likely to be quite small when VAMs include several years of
prior test scores as control variables.
421
VALUE ADDED AND ITS USES
and Chaplin (2012) and Kinsler (2012b) find that “the Rothstein test” may not
work as intended, and may suggest model misspecification where none exists. In a study based on random assignment of teachers to classrooms, Kane
and Staiger (2008) showed that VAMs do not appear to be biased. Although
promising, this experiment has some limitations, most notably a small sample
size and the fact that the experiment only includes teachers who were deemed
eligible by their principals to teach each other’s classes (i.e., teachers who are
seen to be exchangeable).9 Consequently, the results may not be fully generalizable. Finally, Chetty, Friedman, and Rockoff (2011) use different VAM
falsification exercises from Rothstein and find little evidence of bias.
The extent of imprecision, or noise in VAM measures is another issue
that has been extensively studied. Several recent studies have found teacher
value-added estimates to be only moderately reliable or stable from year to
year, with adjacent year correlations in the neighborhood of 0.3 to 0.5 (e.g.,
McCaffrey et al. 2009; Goldhaber and Hansen 2012). Whether or not these
correlations are sufficiently strong is in the eye of the beholder. On the one
hand, correlations of this magnitude result in a nontrivial proportion of teachers being rated as “ineffective” or “highly effective” in one school year (or class
section) while being rated as average in the next year (or section). On the other
hand, as Glazerman et al. (2011) point out, the magnitude of these stability
estimates is not much different from what is observed in other occupations
that have been quantitatively measured. It is true that the magnitude of these
reliability estimates likely do not support the exclusive use of VAM measures
to classify teachers into performance categories (Schochet and Chiang 2010;
Goldhaber and Loeb 2013).10 Although some evaluation systems put a large
weight on VAM results, to our knowledge no proposed or adopted policy relies
exclusively on VAMs.
The validation of value-added measures against long-run student
outcomes—as opposed to year-to-year test score gains—has been less extensively studied. A number of studies, however, have documented “fade out”
of teacher effects—that is, the gains induced by teachers in one grade that
9. A recently released report from the well-known Measures of Effective Teaching Project (Kane et al.
2013) is based on a similar experiment and also concludes that VAMs provide unbiased estimates
of teacher effectiveness, and this study is based on a much larger teacher and student sample. The
study is still limited, however, in that the findings might not be generalizable outside of the group
of teachers who were deemed by principals to be randomized within schools.
10. Schochet and Chiang (2010), for instance, used simulated data that rely on plausible estimates
of the signal to noise ratio in teacher effect estimates and conclude that, if three years of data
are used for estimating teacher effectiveness, the probability of identifying an average teacher as
being “exceptional” or “ineffective” (roughly one standard deviation above or below the mean,
respectively), a Type I error, is about 25 percent. Conversely, the probability that a truly exceptional
teacher is not identified, a Type II error, is also about 25 percent.
422
Sean Corcoran and Dan Goldhaber
dissipate in later grades.11 One possible explanation for this finding is that the
heightened focus on tests may encourage teachers to emphasize strategies that
lead to short-run test score gains at the expense of long-run learning. On the
other hand, there are some less pernicious explanations for fade out, which
include variation in test content across grades and test scaling effects (Cascio and Staiger 2012). Additionally, recent evidence by Chetty, Friedman, and
Rockoff (2011) finds that, despite effects that fade out in the short-run, VAM
measures still predict key later-life outcomes—e.g., teen pregnancy, collegegoing behavior, and labor market earnings—long after students have left a
teacher’s classroom.
Although we know a good deal about the properties of VAMs, existing
research has little to say regarding how current and potential teachers are likely
to respond to the use of VAM measures in high-stakes personnel decisions,
other than with regard to whether there are short-run productivity effects of pay
for performance (discussed below). Much of the available evidence on teacher
value added comes from low-stakes settings, in which teacher evaluations were
not linked to student performance or high-stakes personnel decisions.
Taken together, the balance of the evidence indicates that VAM measures
are imperfect, and questions about bias are far from settled. We argue, however,
that the more important question for policy is whether VAMs can improve
upon existing or alternative measures of teacher job performance. As The
Widget Effect pointed out, existing evaluation systems generally fail to recognize
variation in teacher performance, and even imperfect measures may improve
upon the status quo. Moreover, non-test-based measures of performance, such
as classroom observations, student surveys, and portfolio assessment, suffer
from their own imperfections and are as untested as VAMs (Harris 2012). We
believe that policy makers ought not to focus solely on whether or not a VAM
measure is biased and noisy but rather on the extent of bias and imprecision,
and whether VAMs provide either a more accurate picture of teacher quality
than other means of assessing teachers and/or additional information that can
be used to improve instruction.
THEORIES OF ACTION: CONNECTING VAMs
TO EDUCATIONAL IMPROVEMENT
Much is known about the properties of value-added measures, but successful implementation of VAMs in practice requires a sound theory of action
11. For instance, estimates suggest that only one third to one half of teacher value added can be
identified in subsequent grades. See Jacob, Lefgren, and Sims (2010), Chetty, Friedman, and
Rockoff (2011), Corcoran, Jennings, and Beveridge (2011), and Kinsler (2012a).
423
VALUE ADDED AND ITS USES
connecting VAM measures to educational improvement. Unfortunately, the
implementation of VAMs is often done without a well-specified link between
these two, and there is little evidence available to inform such links. By their nature, VAMs are norm-referenced measures, indicators of relative performance
that use historical data to rank teachers based on classroom test performance.
Thus the question is how these measures can be integrated into a performance
evaluation system to raise overall teacher effectiveness.
There are two primary means by which VAMs might be used to improve
educational outcomes. The first focuses on improving the effectiveness of
incumbent teachers, whereas the second emphasizes changes in the composition of the teacher workforce. With respect to the former, VAMs could be used
by school and district leaders as a professional development tool, identifying
areas of weakness and providing models of successful instruction. School systems have historically invested vast sums in professional development, but
generally little of this is targeted to the specific needs of individual teachers
(Rice 2009; Toch and Rothman 2008). VAM measures could aid in better targeting these investments to the teachers who need them most. Furthermore,
the availability of rewards for high performance may incentivize existing teachers to increase effort levels or improve their skills by learning from successful
colleagues.12
The second mechanism operates through changes in the composition of
the teacher workforce. Some researchers have argued that a policy of denying
tenure or dismissal of the lowest-performing teachers would substantially raise
the quality of the teacher workforce and have large effects on the life outcomes
of students (Staiger and Rockoff 2010; Chetty, Friedman, and Rockoff 2011;
Hanushek 2011). A practice of recognizing and rewarding effective teachers, either through higher pay or promotion, could also improve quality by attracting
potential teachers who might otherwise be drawn to fields that better reward
their skills.13 Finally, the availability of rewards for high performance may help
with the retention of effective teachers.
Unfortunately, with practical implementation of VAMs in its infancy, we
have almost no direct evidence on these proposed mechanisms. A notable exception is a series of experimental studies of teacher performance incentives
12. Although teachers may learn skills from their colleagues now (Jackson and Bruegmann 2009),
recognition and rewards systems could facilitate additional informal learning by broadly identifying
teacher successes and hence making it clearer which teachers might have the most to offer in terms
of informal training.
13. The impacts on the teacher applicant pool would be even more beneficial if VAM-based evaluations
helped teacher training institutions learn what kinds of skills they should be developing in future
teachers (Boyd et al. 2009), or helped school districts identify which of their hires tend to be more
effective, leading to better future hiring decisions (Rockoff et al. 2010).
424
Sean Corcoran and Dan Goldhaber
tied to VAMs that suggest they have little impact on the productivity of practicing teachers.14 These studies remain small in number, however.
In the absence of much evidence on the mechanisms by which VAMs
could improve educational outcomes, proponents and detractors of VAMs
have largely relied on reasoned speculation over whether they believe these
measures are likely to improve educational outcomes or not. These beliefs can
be grounded in the existing evidence on VAM measures but hinge on whether
one views the VAM glass as half empty or half full. On the one hand, each of
these theories is compelling. The teaching profession has a poor track record
of rewarding performance, whether through higher pay, increased responsibility, or career advancement (Johnson and Papay 2009). Teaching’s reward
structure may discourage talented graduates from the profession (Hoxby and
Leigh 2004), fail to retain the best teachers (Chingos and West 2012; TNTP
2012), and provide weak incentives for effort and improvement.15 VAM measures may be used as part of a system to address some of these issues, and to do
so it is not necessary for VAMs to be perfect indicators of teacher effectiveness
(Glazerman et al. 2011).
On the other hand, VAMs may not be well suited to support these theories
of action. If VAM measures are sufficiently biased and/or unreliable, they may
lead to incorrect personnel decisions and misallocated resources. Bias and
instability can also undermine trust in the system, and the risk associated with
employment or compensation instability could dissuade potential teachers
from the profession rather than attract them (Rothstein 2012).
Along the same lines, even if VAM measures have acceptable properties
from a statistician’s point of view, their complex calculation and inherent variability can limit their face validity among practitioners. For evaluation systems
built on VAMs to improve the instruction techniques of existing teachers,
practitioners will need to see direct connections between their day-to-day practice and their performance evaluations. Today, most VAM-based evaluation
systems only provide information about teachers’ effectiveness categories or
relative ranking, not direct information that can be used to improve particular
aspects of practice. This is one of the reasons it makes sense to use VAM only
as a part of a well-rounded evaluation system.
14. For example, see Springer et al. (2010), Fryer (2011), and Fryer et al. (2012). An interesting exception is Fryer et al. (2012), who find that teachers respond to loss aversion (i.e., the threat that
compensation that is in hand will be taken away if student performance does not meet certain
goals); it is not clear how such an incentive system could be implemented in practice, however.
15. By some measures, research finds that more academically talented teachers are more likely to
leave the teaching profession than their lower achieving counterparts (e.g., as measured using
SAT scores, licensure exam scores, or college selectivity measures; see Stinebrickner 2001, 2002;
Podgursky, Monroe, and Watson 2004; Goldhaber 2007). More recent research, however, finds
that higher value-added teachers are retained at a slightly higher rate than lower value-added
teachers (Hanushek et al. 2005; Boyd et al. 2008; Goldhaber, Gross, and Player 2011).
425
VALUE ADDED AND ITS USES
WHERE YOU STAND DEPENDS ON WHERE YOU SIT
We began the brief with the contention that most researchers would agree with
our stylized facts about VAM measures of teacher effectiveness and the current
state of performance evaluation in the teaching profession. But agreeing on
the stylized facts does not necessarily mean agreement on the practical utility
of using VAMs or their potential to improve teacher effectiveness. In fact, we
do not entirely see eye to eye on these issues.
Glass Half Empty: Corcoran
I (Corcoran) tend to view the VAM glass as half empty. Although I would
agree that performance evaluation in teaching is lacking and sorely in need
of reform, I believe the potential for VAM measures to dramatically improve
teaching effectiveness and the quality of the profession tends to be overblown.
Student achievement should be an important part of a new and improved
system of evaluation. As statistical estimates based on historic and limited
data, however, VAM measures lack transparency and are inherently limited by
imprecision. If teacher quality is to be improved, teachers and school leaders
need instructive, actionable information they can use to make meaningful
changes to their practice sooner rather than later. A statistical prediction that
relies on annual tests and multiple years of data to produce reasonably reliable
value-added estimates does not, in my view, meet this requirement.
VAM measures may turn out to be useful indicators of relative
performance—separating the very high- and very low-performing teachers
from the rest of the pack. This information could be fruitfully used by principals as an early warning signal or (in extreme cases) as grounds for dismissal.
Their utility as a job performance indicator for a significant number of teachers
is another matter, however. Given the inherent instability of VAM estimates,
a high-stakes system tied to VAMs would need to be conservatively designed,
reserving punishment and reward only for those with demonstrably very low or
very high performance, and an acceptably low level of statistical uncertainty.16
But a VAM system that meets these conservative standards would ultimately
only apply to the most extreme cases, and would provide little feedback to the
bulk of teachers. This begs the question of what VAMs would add beyond
the subjective evaluation of principals or other educators who are presumably
capable of identifying the very worst (or best) teachers (e.g., see Jacob and Lefgren 2008). Finally, it is important to keep in mind that “value added” is not
a unidimensional concept. There are as many value-added measures as there
are tests and subject areas, and VAMs have been found to be only moderately
16. Some states, such as New York, have sought to accomplish this by assigning teachers to performance
categories using both their VAM estimate and their level of certainty about this estimate.
426
Sean Corcoran and Dan Goldhaber
correlated across these. There is room for combining information across tests
(Lefgren and Sims 2012), but policy makers may find that creating decision
rules to objectively identify the “best” and “worst” teachers is easier said than
done.
I am encouraged by the prospect for reform in teacher evaluation and the
renewed focus on student achievement. Our understanding of teacher labor
markets and teacher effectiveness would not be where it is today without advances in value-added measurement and the careful linking of student-level
achievement data to teachers over time. That VAM measures have proven invaluable to research, however, does not imply they will be useful as on-the-job
performance measures. Inferences about teacher effects on average are quite
different from inferences about individual teachers, and I tend to be more
pessimistic about the latter. The attachment of high stakes to measures that
are meant to be informative about the progress of students has the potential to
undermine the validity of the measures themselves, to encourage teaching to
the test, and (at worst) cheating. With professional careers and the education
of our nation’s children on the line, our educators need a new evaluation system that is transparent, informative, and responsive to their needs. Although
VAMs have a role to play in this system, policy makers should temper their
expectations and limit their high-stakes use.
Glass Half Full: Goldhaber
I (Goldhaber) view the VAM glass as half full. I do not disagree with the
technical points or potential negative incentives associated with connecting
VAM measures to high-stakes decisions described above. This is one of the
reasons why I think that VAM measures ought only to be a component of
a well-rounded evaluation system that also, for instance, includes classroom
observations.17 The concern that focusing teacher accountability on student
test achievement might lead to pernicious behaviors that distorts the learning
process in schools—Campbell’s Law—is entirely valid, which is why evaluation
reform probably ought to happen in conjunction with measures designed to
guard against these outcomes.18 Yet despite some wariness about VAMs and
their use, I believe we should experiment with incorporating VAM as a factor
in evaluations that are used for such high stakes purposes as pay, tenure, and
promotion determination.
I would argue that VAMs ought to be used for three primary reasons. First,
evaluation is to some extent about determining which teachers should stay
17. Another reason is that VAMs alone typically do not provide much information that teachers can
use to improve their practice.
18. This issue is just beginning to be addressed in a comprehensive way. See, for instance, Samuels
(2012).
427
VALUE ADDED AND ITS USES
in the profession (and perhaps in which positions and at what compensation
levels). Even those who believe the overwhelming majority of teachers are successful enough to merit being in the profession likely recognize that some,
perhaps quite small, proportion of teachers are not very effective and ought
to be dismissed. And, as we have recently witnessed, economic circumstances
occasionally necessitate that some teachers will lose their jobs. We cannot use
performance evaluations to make personnel decisions if there is no variation
in the evaluation ratings in the workforce. Unless the act of evaluating teachers itself makes them more effective,19 school systems are investing time and
effort in the evaluation endeavor for little useable information.
Second, I see value added as a catalyst for broader changes to teacher evaluation. I am quite skeptical that we would be engaged in what is now almost
a national experiment with new, and hopefully more rigorous, teacher evaluation were it not for the specter of VAM usage. It has been over two decades
since research (Hanushek 1992) showed just how important the variation in
teacher effectiveness is for determining student outcomes.20 Yet policy makers and practitioners have, by and large, been unable or unwilling to develop
credible teacher evaluation systems that recognize the important differences
that exist between teachers. As I noted earlier, I understand the need to be
careful in making changes to the system but surely it is not unreasonable to
have expected more movement on this issue over such a long period of time.
Third, and perhaps most importantly, evidence suggests that VAM measures are better at predicting future student test achievement than the other
credentials—licensure, degree, and experience levels—that are now used for
high stakes purposes (e.g., Goldhaber and Hansen 2010), and better than
other means of assessing teachers (Glazerman et al. 2011; Harris 2010; Tyler
et al. 2010). To the extent that evaluations are in fact used for any purpose, we
would want them to be good predictors of student achievement and, although
imperfect, VAMs look pretty good compared to the other options currently out
there. There is a laser focus on the known flaws of VAMs while other methods
of teacher evaluation have basically been given a pass.
There is no doubt that VAM-informed decisions about teachers will sometimes, because of bias or reliability, lead to teacher classification errors (Goldhaber and Loeb 2013). We clearly want to be careful so as to limit these errors,
but we also have to recognize that it is almost certainly not optimal for students
to entirely eliminate the downside risk to teachers of classification errors. The
19. There is in fact some new evidence (Taylor and Tyler 2012) that suggests that a comprehensive
evaluation does increase teacher effectiveness,
20. The results of this study suggest that the difference between having an effective versus an ineffective
teacher can be equivalent to more than a year’s worth of typical student achievement growth.
428
Sean Corcoran and Dan Goldhaber
reason is obvious but still merits repeating: There is a trade-off inherent with
reductions in the number of teachers who are unfairly classified as being ineffective; that is, the number of ineffective teachers classified as effective rises.
In other words, to some extent what is best for teachers may not be best for
students. Again, this is an issue that exists for all evaluation systems, not just
those that utilize VAM-based information.
GOING FORWARD
As is clear from the preceding section, we do not entirely agree with one
another in terms of the extent to which policy makers ought to use VAMs for
high-stakes purposes, but there is significant agreement about how we ought
to move forward with some use of VAMs. First, implementation matters.
It matters not only because of the support systems we discussed here that
are likely to be crucial for the integrity of an evaluation system emphasizing
value-added measures, but also because we are talking about a system that
will affect human beings whose responses to the system will go a long way in
determining its effectiveness. Specifically, the behavioral response to the use
of VAMs depends not only on the design of the system but also on teachers’
(and other stakeholders’) reactions to it. This means that clear and constant
communication about why particular modeling decisions were made, how
the system works, and what are the consequences for teachers who receive
different evaluation ratings, is essential.
Second, as we have hopefully stressed herein, our current understanding
of the impact of VAMs in practice is limited because this impact depends on
human beings’ responses to how VAMs are used.21 This means there is a good
deal of room for debate about the extent to which VAM-usage, particularly
for high-stakes personnel decisions, would lead to good (e.g., more feedback
on effective teaching practices) or bad (e.g., a narrowing of the curriculum)
outcomes. Given this, we recommend that policy makers roll out plans with
an eye toward evaluation and modification.22 This is decidedly not how new
education policies tend to be implemented; policy makers or practitioners who
say “we are pretty sure this is a good plan, but we might have to change it soon”
probably reduce the chances they will hold key positions of influence in the
future. Fortunately, solving this political problem is outside the scope of our
brief (and perhaps not possible).
Third, policy makers should, as best as possible, anticipate indirect and
unintended consequences (not all of which are necessarily negative). For
21. For instance, some of the academic debate is based on simulations (e.g., Hanushek 2009; Rothstein
2012).
22. This might, for instance, entail small-scale pilots or implementation features that permit research
designs allowing for strong causal inferences.
429
VALUE ADDED AND ITS USES
instance, attaching stakes for teachers to student test achievement will encourage cheating, so clearly school systems should be considering auditing
systems to discourage cheating and detect it when it happens. As we noted
earlier, increased job or compensation risk may make teaching a less desirable occupation. These are obvious examples; it is worth some time and effort
planning for the obvious issues that will arise and anticipating some issues
that may not be immediately obvious.
The bottom line is that we are entering a new world of teacher evaluation,
a world in which recent policies dictate that VAMs will play a role. But, in
implementing VAM-based reforms, we believe one should view the current
policy push not as the end product but the evolution toward a better system.
We thank two anonymous reviewers for helpful comments on an earlier draft of this
paper.
REFERENCES
Aaronson, Daniel, Lisa Barrow, and William Sander. 2007. Teachers and student
achievement in the Chicago Public High Schools. Journal of Labor Economics 25(1):
95–135. doi:10.1086/508733
Ballou, Dale, William Sanders, and Paul Wright. 2004. Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral
Statistics 29(1): 37–65. doi:10.3102/10769986029001037
Bloom, Howard, Carolyn Hill, Alison Black, and Mark Lipsey. 2008. Performance trajectories and performance gaps as achievement effect-size benchmarks for educational
interventions. MDRC Working Paper.
Boyd, Donald, Pam Grossman, Hamilton Lankford, Susanna Loeb, and James Wyckoff.
2008. Who leaves? Teacher attrition and student achievement. NBER Working Paper
No. 14022.
Boyd, Donald J., Pamela L. Grossman, Hamilton Lankford, Susanna Loeb, and James
Wyckoff. 2009. Teacher preparation and student achievement. Educational Evaluation
and Policy Analysis 31(4): 416–40. doi:10.3102/0162373709353129
Cascio, Elizabeth U., and Douglas O. Staiger. 2012. Knowledge, tests, and fadeout in
educational interventions. NBER Working Paper No. 18038.
Chetty, Raj, John N. Friedman, and Jonah E. Rockoff. 2011. The long-term impacts
of teachers: Teacher value-added and student outcomes in adulthood. NBER Working
Paper No. 17699.
Chingos, Matthew, and Martin West. 2012. Do more effective teachers earn more
outside the classroom? Education Finance and Policy 7(1): 8–43. doi:10.1162/EDFP_a
_00052
Clotfelter, C. T., H. F. Ladd, and J. L. Vigdor. 2010. Teacher credentials and student
achievement in high school. Journal of Human Resources 45 (3): 655–81.
430
Sean Corcoran and Dan Goldhaber
Corcoran, Sean P., Jennifer L. Jennings, and Andrew A. Beveridge. 2011. Teacher effectiveness on high- and low-stakes tests. Mimeo, New York University, Institute for Education
and Social Policy.
Fryer, Roland. 2011. Teacher incentives and student achievement: Evidence from New
York City public schools. NBER Working Paper No. 16850.
Fryer, Roland, Steven Levitt, John List, and Sally Sadoff. 2012. Enhancing the efficacy
of teacher incentives through loss aversion: A field experiment. NBER Working Paper
No. 18237.
Glazerman, Steven, Susanna Loeb, Dan Goldhaber, Douglas Staiger, Steve Raudenbush, and Grover Whitehurst. 2011. Evaluating teachers: The important role of value-added.
Washington, DC: Brown Center on Education Policy at Brookings.
Goldhaber, Dan. 2007. Everyone’s doing it, but what does teacher testing tell us about
teacher effectiveness? Journal of Human Resources 42(4): 765–94.
Goldhaber, Dan, and Dominic J. Brewer. 1997. Why don’t schools and teachers seem
to matter? Assessing the impact of unobservables on educational productivity. Journal
of Human Resources 32(3): 505–23. doi:10.2307/146181
Goldhaber, Dan, and Dominic J. Brewer. 2000. Does teacher certification matter? High
school teacher certification status and student achievement. Educational Evaluation and
Policy Analysis 22(2): 129–45.
Goldhaber, Dan, Dominic J. Brewer, and Deborah J. Anderson. 1999. A three-way error
components analysis of educational productivity. Education Economics 7(3): 199–208.
doi:10.1080/09645299900000018
Goldhaber, Dan, and Duncan Chaplin. 2012. Assessing the “Rothstein falsification
test.” Does it really show teacher value-added models are biased? CEDR Working Paper
No. 2012–1.3, University of Washington.
Goldhaber, Dan, Brian Gabele, and Joe Walch. 2012. Does the model matter? Exploring the relationship between different achievement-based teacher assessments. CEDR
Working Paper No. 2012–6, University of Washington.
Goldhaber, Dan, Betheny Gross, and Daniel Player. 2011. Teacher career paths,
teacher quality, and persistence in the classroom: Are public schools keeping
their best? Journal of Policy Analysis and Management 30(1): 57–87. doi:10.1002/pam
.20549
Goldhaber, Dan, and Michael L. Hansen. 2010. Using performance on the job to inform
teacher tenure decisions. American Economic Review 100(2): 250–55. doi:10.1257/aer
.100.2.250
Goldhaber, Dan, and Michael L. Hansen. 2012. Forthcoming. Is it just a bad class?
Assessing the stability of measured teacher performance. Economica. doi:10.1111/ecca
.12002
Goldhaber, Dan, and Susanna Loeb. 2013. What do we know about the tradeoffs with
teacher misclassification in high stakes personnel decisions? Carnegie Knowledge
Network. April 15, 2013.
431
VALUE ADDED AND ITS USES
Hanushek, Eric A. 1992. The trade-off between child quantity and quality. Journal of
Political Economy 100(1): 84–117. doi:10.1086/261808
Hanushek, Eric A. 2009. Teacher deselection. In Creating a new teaching profession,
edited by Dan Goldhaber and Jane Hannaway, pp. 165–80. Washington, DC: Urban
Institute Press.
Hanushek, Eric A. 2011. The economic value of higher teacher quality. Economics of
Education Review 30(3): 466–79. doi:10.1016/j.econedurev.2010.12.006
Hanushek, Eric A., John F. Kain, Daniel M. O’Brien, and Steven G. Rivkin. 2005. The
market for teacher quality. NBER Working Paper No. 11154.
Hanushek, Eric A., and Steven G. Rivkin. 2010. Generalizations about using
value-added measures of teacher quality. American Economic Review 100(2): 267–71.
doi:10.1257/aer.100.2.267
Harris, Douglas N. 2010. Clear away the smoke and mirrors of value-added. Phi Delta
Kappan 91(8): 66–69.
Harris, Douglas N. 2012. How do teacher value-added indicators compare to other
measures of teacher effectiveness? Carnegie Knowledge Network Brief. Available www
.carnegieknowledgenetwork.org/briefs/value-added/value-added-other-measures/. Accessed 11 March 2013.
Harris, Douglas N., and Tim Sass. 2006. Value-added models and the measurement
of teacher quality. Unpublished paper, Florida State University.
Hoxby, Caroline M., and Alison Leigh. 2004. Pulled away or pushed out? Explaining
the decline of teacher aptitude in the United States. American Economic Review 94(2):
236–40. doi:10.1257/0002828041302073
Jackson, C. Kirabo, and Elias Bruegmann. 2009. Teaching students and teaching each
other: The importance of peer learning for teachers. American Economic Journal: Applied
Economics 1(4): 85–108. doi:10.1257/app.1.4.85
Jacob, Brian, and Lars Lefgren. 2008. Can principals identify effective teachers? Evidence on subjective performance evaluation in education. Journal of Labor Economics
26(1): 101–36. doi:10.1086/522974
Jacob, Brian, Lars Lefgren, and David Sims. 2010. The persistence of teacherinduced learning gains. Journal of Human Resources 45(4): 915–43. doi:10.1353/jhr.2010
.0029
Johnson, Susan Moore, and John P. Papay. 2009. Redesigning teacher pay. A system for
the next generation of educators. Washington, DC: Economic Policy Institute.
Kane, Thomas J., and Douglas O. Staiger. 2008. Estimating teacher impacts on student
achievement: An experimental evaluation. NBER Working Paper No. 14607.
Kane, Thomas J., Daniel F. McCaffrey, Trey Miller, and Douglas O. Staiger. 2013.
Have we identified effective teachers? Validating measures of effective teaching using
random assignment. MET Project Research Paper. Seattle, WA: The Bill & Melinda
Gates Foundation.
432
Sean Corcoran and Dan Goldhaber
Kane, Thomas J., Eric S. Taylor, John H. Tyler, and Amy L. Wooten. 2010. Identifying
effective classroom practices using student achievement data. NBER Working Paper
No. 15803.
Kinsler, Joshua. 2012a. Beyond levels and growth: Estimating teacher value-added and
its persistence. Journal of Human Resources 47(3): 722–53. doi:10.1353/jhr.2012.0023
Kinsler, Joshua. 2012b. Assessing Rothstein’s critique of teacher value-added models.
Quantitative Economics 3(2): 333–62. doi:10.3982/QE132
Lefgren, Lars, and David Sims. 2012. Using subject test scores efficiently to predict teacher value-added. Educational Evaluation and Policy 34(1): 109–21. doi:10.3102/
0162373711422377
Lockwood, J. R., Daniel F. McCaffrey, Laura S. Hamilton, Brian M. Stecher, Vi Nhuan
Le, and Jose F. Martinez. 2007. The sensitivity of value-added teacher effect estimates
to different mathematics achievement measures. Journal of Educational Measurement
44(1): 47–67. doi:10.1111/j.1745-3984.2007.00026.x
McCaffrey, Daniel F., Tim R. Sass, J. R. Lockwood, and Kata Mihaly. 2009. The intertemporal variability of teacher effect estimates. Education Finance and Policy 4(4):
572–606. doi:10.1162/edfp.2009.4.4.572
The New Teacher Project (TNTP). 2012. The irreplaceables: Understanding the real retention crisis in America’s schools. New York: TNTP.
Papay, John P. 2011. Different tests, different answers: The stability of teacher valueadded estimates across outcome measures. American Educational Research Journal 48(1):
163–93. doi:10.3102/0002831210362589
Podgursky, Michael, Ryan Monroe, and Donald Watson. 2004. The academic quality
of public school teachers: An analysis of entry and exit behavior. Economics of Education
Review 23(5): 507–18. doi:10.1016/j.econedurev.2004.01.005
Rice, Jennifer K. 2009. Investing in human capital through teacher professional development. In Creating a new teaching profession, edited by Dan Goldhaber and Jane
Hannaway, pp. 227–50. Washington, DC: Urban Institute Press.
Rivkin, Steven G., Eric A. Hanushek, and John F. Kain. 2005. Teachers, schools,
and academic achievement. Econometrica 73(2): 417–58. doi:10.1111/j.1468-0262.2005
.00584.x
Rockoff, Jonah. 2004. The impact of individual teachers on student achievement:
Evidence from panel data. American Economic Review 94(2): 247–52. doi:10.1257/
0002828041302244
Rockoff, Jonah E., Brian A. Jacob, Thomas J. Kane, and Douglas O. Staiger. 2010. Can
you recognize an effective teacher when you recruit one? Education Finance and Policy
6(1): 43–74. doi:10.1162/EDFP_a_00022
Rothstein, Jesse. 2009. Student sorting and bias in value-added estimation: Selection on
observables and unobservables. Education Finance and Policy 4(4): 537–71. doi:10.1162/
edfp.2009.4.4.537
433
VALUE ADDED AND ITS USES
Rothstein, Jesse. 2010. Teacher quality in educational production: Tracking, decay,
and student achievement. Quarterly Journal of Economics 125(1): 175–14. doi:10.1162/qjec
.2010.125.1.175
Rothstein, Jesse. 2012. Teacher quality when supply matters. NBER Working Paper
No. 18419.
Samuels, Christina A. 2012. Experts outline steps to guard against cheating. Education
Week 6 (March).
Schochet, Peter Z., and Hanley S. Chiang. 2010. Error rates in measuring teacher and
school performance based on student test score gains (NCEE 20120–24004). Washington, DC:
National Center for Education, Evaluation and Regional Assistance, Institute of Education
Sciences. US: Department of Education.
Springer, Matthew G., Dale Ballou, Laura S. Hamilton, Vi Nhuan Le. J.R. Lockwood,
Daniel F. McCaffrey, Matthew Pepper, and Brian M. Stecher. 2010. Teacher pay for
performance: Experimental evidence from the project on incentives in teaching. Santa Monica,
CA: RAND Corporation.
Staiger, Douglas O., and Jonah E. Rockoff. 2010. Searching for effective teachers with
imperfect information. Journal of Economic Perspectives 24(3): 97–118. doi:10.1257/jep
.24.3.97
Stinebrickner, Todd R. 2001. A dynamic model of teacher labor supply. Journal of Labor
Economics 19(1): 196–230. doi:10.1086/209984
Stinebrickner, Todd R. 2002. An analysis of occupational change and departure from
the labor force: Evidence of the reasons that teachers leave. Journal of Human Resources
37(1): 192–216. doi:10.2307/3069608
Taylor, Eric S., and John H. Tyler. 2012. The effect of evaluation on teacher performance:
Evidence from longitudinal student achievement data of mid-career teachers. American
Economic Review 102(7): 3628–51.
Toch, Thomas, and Robert Rothman. 2008. Rush to judgment: Teacher evaluation in
public education. Washington, DC: Education Sector Reports.
Tucker, Pamela. 1997. Lake Wobegon: Where all teachers are competent (or, have
we come to terms with the problem of incompetent teachers?). Journal of Personnel
Evaluation in Education 11(2): 103–26. doi:10.1023/A:1007962302463
Tyler, John H., Eric S. Taylor, Thomas J. Kane, and Amy L. Wooten. 2010. Using
student performance data to identify effective classroom practices. American Economic
Review 100(2): 256–60. doi:10.1257/aer.100.2.256
Weisberg, Daniel, Susan Sexton, Jennifer Mulhern, and David Keeling. 2009. The
widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness.
New York: The New Teacher Project.
434