using effect size - sunshineteachingpantry

Journal of Educational Psychology
2012, Vol. 104, No. 3, 743–762
© 2012 American Psychological Association
0022-0663/12/$12.00 DOI: 10.1037/a0027627
Accuracy of Teachers’ Judgments of Students’ Academic Achievement:
A Meta-Analysis
Anna Südkamp
Johanna Kaiser and Jens Möller
University of Bamberg
University of Kiel
This meta-analysis summarizes empirical results on the correspondence between teachers’ judgments of
students’ academic achievement and students’ actual academic achievement. The article further investigates theoretically and methodologically relevant moderators of the correlation between the two
measures. Overall, 75 studies reporting correlational data on the relationship between teachers’ judgments of students’ academic achievement and students’ performance on a standardized achievement test
were analyzed, including studies focusing on different school types, grade levels, and subject areas. The
overall mean effect size was found to be .63. The effect sizes were moderated by use of informed versus
uninformed teacher judgments, with use of informed judgments leading to a higher correspondence
between teachers’ judgments and students’ academic achievement. A comprehensive model of teacherbased judgments of students’ academic achievement is provided in the Discussion.
Keywords: teacher judgment, academic achievement, judgment accuracy
Supplemental materials: http://dx.doi.org/10.1037/a0027627.supp
of student learning groups; and they may prompt teachers to revise
their teaching techniques (Shavelson & Stern, 1981). Elliott, Lee,
and Tollefson (2001) consider good assessment to be an integral
part of good instruction. In their empirical study, Helmke and
Schrader (1987) found that high judgment accuracy in combination with a high-frequency use of instructional techniques such as
providing structuring cues or individual support was particularly
favorable for student learning. Teachers have various assessment
tools at their disposal, including “oral questioning of students,
observation, written work products, oral presentations, interviews,
projects, portfolios, tests, and quizzes” (Shepard, Hammerness,
Darling-Hammond, & Rust, 2005, p. 294). Although objective
measures of students’ academic achievement are now being more
widely applied, there are still good reasons to care about the
accuracy of teachers’ judgments. For example, in response to
intervention (RTI) models of data-based decision making,
curriculum-based measures are commonly used to assess students’
academic achievement (VanDerHeyden, Witt, & Gilbertson,
2007). Curriculum-based measures (CBM) are defined as any set
of measurement procedures involving “direct observation and recording of student performance in response to selected curriculum
materials [which] are emphasized as a basis for collecting information” (Deno, 2003, p. 4) to make instructional decisions. In the
context of reading, for instance, the reading skills of elementary
students are assessed by CBM every 3– 4 months; in the case of
students receiving intervention services, objective measures are
applied even more regularly (every 1–2 weeks; Begeny, Krouse,
Brown, & Mann, 2011). However, teachers make judgments about
instruction more often than can be facilitated by objective data.
Even in the context of RTI practices, where objective measurement
of students’ academic achievement is implemented by default,
teachers must still make ongoing instructional decisions that are
informed by their judgments.
Academic achievement is a major issue in educational psychology (Winne & Nesbit, 2010). Often, teachers’ judgments are the
primary source of information on students’ academic achievement.
The ability to accurately assess students’ achievement therefore is
considered to be an important aspect of teachers’ professional
competence (Ready & Wright, 2011). In acknowledgment of the
importance of teachers’ judgments for student learning, the American Federation of Teachers, the National Council on Measurement
in Education, and the National Education Association (1990) have
developed standards for teacher competence in the educational
assessment of students. Likewise, the core propositions of the
National Board for Professional Teaching Standards state that
teachers should “know how to assess the progress of individual
students as well as the class as a whole” (Proposition 3.3; National
Board for Professional Teaching Standards, 2010).
Teachers’ judgments can have consequences for their instructional practice, for the further evaluation of students’ performances, and for placement decisions—and can crucially influence
individual students’ academic careers and self-concepts. First,
teachers use their judgments of students’ academic achievement as
a basis for various instructional decisions (Alvidrez & Weinstein,
1999; Clark & Peterson, 1986; Hoge, 1983; Hoge & Coladarci,
1989). These judgments influence teachers’ selection of classroom
activities and materials; they determine the difficulties of the tasks
selected, the choice of questioning strategies, and the organization
This article was published Online First March 26, 2012.
Anna Südkamp, National Educational Panel Study, University of Bamberg, Bamberg, Germany; Johanna Kaiser and Jens Möller, Department of
Educational Psychology, University of Kiel, Kiel, Germany.
Correspondence concerning this article should be addressed to Anna Südkamp, National Educational Panel Study, University of Bamberg, Wilhelmsplatz 3, 96047 Bamberg, Germany. E-mail: [email protected]
743
SÜDKAMP, KAISER, AND MÖLLER
744
Second, various authors have noted that accurate teacher judgments can help to identify children who show early signs of
difficulties in school (Bailey & Drummond, 2006; Beswick,
Willms, & Sloat, 2005; Teisl, Mazzocco, & Myers, 2001) and that
accurate information on students’ academic achievement is crucial
for meaningful placement decisions (Helwig, Anderson, & Tindal,
2001). In practice, teachers’ judgments tend to be given heavy
weight in decisions about intervention (Hoge, 1983). In the case of
students requiring intensive intervention, it is the teacher who is
able to employ early, less intensive forms of intervention in the
classroom and who takes steps to arrange more intensive intervention (Begeny et al., 2011).
Third, research has shown that teacher judgments of students’
academic achievement influence teacher expectations about students’ ability (Brophy & Good, 1986). A large body of research in
the form of experimental and naturalistic studies has provided
empirical insights into the formation, transmission, and impact of
teacher expectations on students’ performance (de Boer, Bosker, &
van der Werf, 2010; Jussim & Eccles, 1992).
Fourth, in the context of formal assessment, teacher judgments
of students’ performance are commonly expressed in the form of
grades, which not only provide feedback to students and parents
(Hoge & Coladarci, 1989) but also contribute to exit qualifications
in many countries (Harlen, 2005). As Begeny, Eckert, Montarello,
and Storie (2008) and Feinberg and Shapiro (2003) have pointed
out, grades thus have far-reaching consequences for students’
academic careers.
Fifth, research on academic self-concepts (see Marsh, 1990a,
and Möller, Pohlmann, Köller, & Marsh, 2009, for an overview)
has shown that teacher judgments influence students’ self-related
cognitions of ability. For example, Trautwein, Lüdtke, Köller, and
Baumert (2006) found that the effect of students’ individual
achievement on their academic self-concept is mediated by
teacher-assigned grades. In turn, academic self-concept has a considerable effect on student learning (Marsh, 1990b).
Given the important implications of teacher judgments, the
question of their accuracy is critical. Accurate assessment of
students’ performance is a necessary condition for teachers to be
able to adapt their instructional practices, to make fair placement
decisions, and to support the development of an appropriate academic self-concept.
achievement is the correlation between the two. Overall, moderate
to high correlations are reported (Begeny et al., 2008; Demaray &
Elliot, 1998; Feinberg & Shapiro, 2003). For example, Feinberg
and Shapiro (2009) reported correlations of .59 and .60 between
teachers’ judgments and students’ decoding skills and reading
comprehension, as measured by subtests of the WoodcockJohnson–III Test of Achievement. In the same study, a correlation
of .64 was found between students’ oral reading fluency as measured by a CBM procedure and teachers’ predictions of oral
reading fluency. In a review of 16 studies, Hoge and Coladarci
(1989) found a median correlation of .66 between teachers’ judgments and students’ achievement on a standardized test. On the
one hand, these results may be interpreted as indicating that
teachers’ judgments are quite accurate; on the other hand, their
judgments are evidently far from perfect, and more than two thirds
of the variance in teachers’ judgments cannot be explained by
student performance. Additionally, the correlations found ranged
substantially across studies, from .28 to .92 (Hoge & Coladarci,
1989).
It is important to note that accuracy is rarely defined in concrete
terms in studies explicitly focusing on teacher judgment accuracy
(see Ready & Wright, 2011, for an exception). Which outcomes
are considered to be accurate or inaccurate therefore remains
questionable. To date, no consistent criteria have been established.
Moreover, the different methods used to measure teachers’ judgments and students’ academic achievement make a substantial
contribution to the degree of accuracy observed. For example, the
outcome (degree of accuracy) may differ depending on whether
teachers are informed about the standard of comparison for their
judgment. Accordingly, the inaccuracy of teachers’ judgments
may be grounded in the studies’ methodologies rather than in the
teachers’ diagnostic competence. Another limitation broadly
shared by studies on teacher judgment accuracy is that they do not
account for the dependency of teachers’ judgments on the academic achievement of the students in their class. A multilevel
approach to the analysis of teacher judgment accuracy as applied
by Ready and Wright (2011) is the most appropriate option, but it
is seldom used in studies on teacher judgment accuracy. The reader
should keep this limitation in mind when interpreting the results of
this meta-analysis.
Factors Influencing Teacher Judgment Accuracy
Teacher Judgment Accuracy
Most research explicitly focusing on teacher judgment accuracy
examines the relationship between teachers’ judgments of students’ achievement and students’ actual performance on measures
of achievement in various subject areas. However, studies not
explicitly intending to assess teacher judgment accuracy (e.g.,
studies validating student test scores by reference to teacher ratings) can also provide empirical insights into teacher judgment
accuracy. Both types of studies (with and without an explicit focus
on teacher judgment accuracy) are included in this meta-analysis.
Nevertheless, our review of the literature is restricted to studies
focusing on teacher judgment accuracy. We carefully distinguish
between the two study types wherever necessary throughout the
article.
The most commonly reported measure quantifying the correspondence between teachers’ judgments and students’ actual
Against this background, it is clear that methodological differences between studies need to be taken into account when one is
considering the differences in the studies’ findings. For example,
studies with and without an explicit focus on teacher judgment
accuracy clearly differ in terms of the methods used, and these
differences warrant particular consideration when all results are
interpreted in terms of teacher judgment accuracy.
Judgment Characteristics
As mentioned previously, it can be assumed that various judgment characteristics affect the correspondence between teachers’
judgments and students’ academic achievement. We therefore
distinguished various aspects of teacher judgments: informed versus uninformed judgments, number of points on the rating scale
used, instruction specificity, norm-referenced versus peer-
ACCURACY OF TEACHERS’ JUDGMENTS
dependent judgments, and domain specificity of teachers’ judgments.
Informed versus uninformed teacher judgments.
Hoge
and Coladarci (1989) distinguished between direct and indirect
teacher judgments. In this meta-analysis, we used a slightly different categorization. The main difference between direct and
indirect judgments is that teachers are either informed or uninformed about the test or the standard of comparison on which their
judgment is based. In some studies, teachers are asked to assess
students’ academic achievement on a standardized achievement
test by estimating the number of items each student will solve
correctly (Helmke & Schrader, 1987). This approach can be considered an informed rating. In other studies, teachers are asked to
rate students’ performance in a certain subject on a Likert-type
rating scale (e.g., a 5-point rating scale; DuPaul, Rapport, &
Perriello, 1991). Hoge and Coladarci (1989) called this type of
approach an indirect rating. Here, teachers are usually (but not
always) left uninformed about the standard of comparison to be
applied in their judgment. As this can make an important contribution to their judgments, we chose to distinguish between “informed” and “uninformed” judgments. In line with the results of
Hoge and Coladarci, both Feinberg and Shapiro (2003, 2009) and
Demaray and Elliott (1998) found higher correlations for direct
(usually informed) teacher judgments than for indirect (usually
uninformed) teacher judgments. For example, Feinberg and Shapiro (2003) found a correlation of .70 between students’ test
performance and direct teacher judgments, whereas the correlation
with indirect teacher judgments was .62.
Points on the rating scale. Studies using rating scales to
obtain teacher judgments differ in terms of the number of points on
the rating scales implemented. Rating scales with many categories
permit a sophisticated judgment, whereas scales with fewer categories allow a more global judgment. Generally, slightly higher
correlations with students’ actual performance are obtained for
more sophisticated judgments than for more global judgments. To
date, this variable has been neglected in empirical research on
teacher judgment accuracy. We therefore considered the number of
points on the rating scales used in this meta-analysis, expecting to
find higher correlations between teachers’ judgments and students’
academic achievement when a sophisticated rating scale was used.
Judgment specificity. According to the approach used by
Hoge and Coladarci, teachers’ judgments can be allocated to one
of five categories, ranging from low to high specificity. First, a
judgment that requires teachers to rate students’ academic achievement on a rating scale (e.g., poor– excellent) is considered to be of
low specificity. Second, in a ranking, the teacher’s task is to put the
students of his or her class into rank order according to their
achievement. Third, tasks requiring teachers to find grade equivalents for students’ performance on a standardized achievement
test are considered to be of average specificity. Fourth, tasks
requiring teachers to estimate the number of correct responses
achieved by a student on a standardized achievement test are
slightly less specific than the fifth and most specific category, in
which teachers indicate students’ item responses on each item of
an achievement test. In their review, Hoge and Coladarci found a
median correlation of .61 for studies using ratings, which was the
predominant approach. The median correlations for studies using
rank ordering (median r ⫽ .76), grade equivalents (median r ⫽
745
.70), number of correct responses (r ⫽ .67, for a single study), and
item-based judgments (median r ⫽ .70) were indeed higher.
Norm-referenced vs. peer-independent judgments. In addition, teacher judgments may differ in whether they are normreferenced or peer-independent. For example, Helwig et al. (2001)
asked teachers to rate students’ academic achievement on an
absolute scale (very low proficiency–very high proficiency),
whereas Hecht and Greenfield (2002) asked teachers to estimate
students’ academic achievement in relation to other members of
the class (in the bottom 10% of the class–in the top 10% of the
class). Hoge and Coladarci (1989) considered this aspect in their
meta-analysis but found no substantial difference between correlations. The median correlation for norm-referenced judgments
was .68; that for peer-independent judgments was .64. We also
considered norm-referenced versus peer-independent teacher judgments in the present meta-analysis. However, we did not formulate
a hypothesis about the direction of the effect on teacher judgment
accuracy. It is possible that the use of peer-independent teacher
rating scales leads to higher correlations between teacher judgments and students’ academic achievement, because this approach
allows teachers to focus on each student individually, preventing
judgment biases due to the achievement of other students in the
class (see also the literature on the big-fish-little-pond effect;
Marsh, 1989). On the other hand, it is equally possible that the use
of norm-referenced teacher rating scales produces higher accuracy
scores (correlations), as these correlations reflect teachers’ ability
to establish a rank order within a class based on the students’
achievement.
Domain specificity.
Finally, teacher judgments differ in
terms of their domain specificity. Whereas some studies ask teachers to judge students on a very specific ability (e.g., arithmetic
skills; Karing, 2009), others ask them to judge students’ overall
academic achievement (e.g., Li, Pfeiffer, Petscher, Kumtepe, &
Mo, 2008). To our knowledge, no studies to date have examined
the influence of the domain specificity of teachers’ judgments on
teacher judgment accuracy. However, it seems reasonable to hypothesize that it is easier to make a focused judgment on a
domain-specific ability than to judge a student’s overall academic
ability. Therefore, we expect to find higher teacher judgment
accuracy for domain-specific judgments than for global judgments.
Test Characteristics
Like the judgment characteristics we have summarized, test
characteristics in turn depend on methodological decisions made
by the author(s) of the studies. In studies explicitly focusing on
teacher judgment accuracy, various instruments are used to measure students’ academic achievement, ranging from highly specific
tests measuring, for example, receptive vocabulary (e.g., the Peabody Picture Vocabulary Test used by Fletcher, Tannock, &
Bishop, 2001) to broader tests measuring students’ performance in
different subject areas (e.g., the Kaufman Test of Academic
Achievement measuring achievement in mathematics, reading, and
spelling used by Demaray & Elliott, 1998). Such differences
between tests are summarized under the label test characteristics
here. Various test characteristics can be assumed to influence the
correspondence between teachers’ judgments and students’ performance. In this meta-analysis, we considered the subject matter
assessed, accounted for the use of CBM procedures or standard-
SÜDKAMP, KAISER, AND MÖLLER
746
ized achievement tests, and distinguished domain-specific tests
from tests covering different domains.
Subject matter. Comparing correlations between teachers’
judgments and students’ academic achievement in different subjects, Hopkins, George, and Williams (1985) found that correlations were significantly lower for social studies and science than
for language arts, reading, and mathematics. Using CBM procedures to gauge students’ academic achievement, Eckert, Dunn,
Codding, Begeny, and Kleinmann (2006) found higher correlations for reading than for mathematics. In turn, Coladarci (1986)
reported teachers’ judgments to be more accurate for students’
performance in mathematics computations than for mathematics
concept items. Demaray and Elliott (1998) found no difference
between correlations in language arts and in mathematics. Hinnant,
O’Brien, and Ghazarian (2009) found that teachers’ ratings of
academic ability as measured by an academic skills questionnaire
were highly correlated with standardized measures of achievement
in reading (.53–.67) and mathematics (.54 –.57). Evidently, the
empirical findings on the influence of subject matter on teacher
judgment accuracy are inconsistent.
CBM procedures vs. standardized achievement tests.
Some studies of teacher judgment accuracy have used CBM procedures as indicators of students’ achievement (Eckert et al., 2006;
Feinberg & Shapiro, 2003; Hamilton & Shinn, 2003). According to
Feinberg and Shapiro (2003), CBM is closely linked to actual
in-class student performance, as methods derived from curriculum
materials provide a closer overlap with the content of instruction
than do published norm-referenced tests. Feinberg and Shapiro
(2009) found that correlations between a CBM procedure measuring oral reading fluency and teachers’ predictions of oral reading
fluency were slightly higher (.64) than correlations between a
global teacher rating of students’ performance and two subtests of
a standardized achievement test (.59 and .60). In the present
meta-analysis, we therefore consider the use of CBM procedures
versus standardized achievement tests.
Domain specificity.
Like teacher judgments, academic
achievement tests also differ in terms of their domain specificity.
Whereas some tests are designed to measure a very specific
academic ability (e.g., phonological awareness; Bailey & Drummond, 2006), others measure different aspects of academic ability
(e.g., the Woodcock–Johnson Achievement Battery; Benner &
Mistry, 2007). We therefore took this test characteristic into consideration in this meta-analysis.
Correspondence Between Judgment and
Test Characteristics
In the present meta-analysis, we also considered the time gap
between teachers’ judgments and the administration of the
achievement test and the congruence in the domain specificity of
the judgment characteristics and test characteristics.
Time Gap
In their review, Hoge and Coladarci (1989) included only studies in which the achievement test was administered at the same
time as the teacher rating task. There are studies, however, in
which these two measures are not implemented concurrently (for
example, the study by Pomplun (2004), which focused on the
validation of a reading test). Due to temporal proximity, we
expected to find higher correlations between teachers’ judgments
and students’ academic achievement when both measures are
administered concurrently than when the test is administered either
before or after the rating task.
Congruence in Domain Specificity
Finally, we considered the congruence in the domain specificity
of the teacher rating task and the achievement test. Theoretically,
the achievement test may measure a specific academic ability,
whereas the teacher judgment task may be less specific— or vice
versa. For example, Hecht and Greenfield (2001) found teachers’
judgments of students’ overall academic competence to be correlated with the students’ performance on the Letter–Word Identification subtest of Woodcock–Johnson Test of Achievement–
Revised. Here, a general judgment was set in relation to a very
specific ability. We expected to find higher correlations between
teachers’ judgments and students’ achievement in studies in which
the domain specificity of the teacher rating task and the achievement test was congruent (e.g., teachers rated students’ reading
comprehension; students were administered a test of reading comprehension) and lower correlations in studies in which the domain
specificity was incongruent (e.g., teachers rated students’ overall
academic achievement; students were administered a test of reading comprehension).
Teacher and Student Characteristics
Besides judgment and test characteristics, characteristics of
the teachers judging students’ performance and of the students
being judged also warrant consideration. Studies explicitly focusing on teacher judgment accuracy have found large interindividual differences in teachers’ ability to judge student performance (Helmke & Schrader, 1987). For example, Lorenz and
Artelt (2009) reported moderate average correlations between
teacher judgments and student performance in reading and
mathematics for a sample of 127 teachers. The standard deviation for the mean of the correlations was .30 for reading and
.39 for mathematics. Some teachers showed very high judgment
accuracy; others, very low judgment accuracy. These findings
raise the question of which characteristics of teachers predict
their judgment accuracy. A teacher’s characteristics are thought
to influence his or her judgment at various stages of the judgment process (e.g., reception, perception, interpretation), and
characteristics such as job experience (Impara & Plake, 1998),
beliefs (Shavelson & Stern, 1981), professional goals (Schrader
& Helmke, 2001), and teaching philosophy (Hoge & Coladarci,
1989) have previously been associated with teachers’ judgment
processes in the literature. Although the variability in the accuracy of teachers’ judgments is well documented (Helmke &
Schrader, 1987; Hoge & Coladarci, 1989), empirical research
has not yet pinpointed individual teacher characteristics that
influence judgment accuracy. As teacher characteristics have
only been examined in a small number of studies to date,
moreover, we were not able to study their effects in the present
meta-analysis.
At the same time, several student characteristics have been
identified as influencing the accuracy of teachers’ judgments. For
ACCURACY OF TEACHERS’ JUDGMENTS
example, Bennett, Gottesman, Rock, and Cerullo (1993) found that
teachers who perceived their students as exhibiting bad behavior
also perceived these students to be low academic performers,
regardless of the students’ academic skills. In a study by Hurwitz,
Elliott, and Braden (2007), the accuracy of teachers’ judgments
was related to students’ disability status: Teachers predicted the
mathematics test performance of students without disabilities more
accurately than that of students with disabilities. As is the case for
teacher characteristics, however, few studies to date have reported
information on the student sample. Moreover, any data available
are not readily comparable across studies (e.g., only the percentage
of female/male students was reported). Therefore, we decided not
to conduct moderator analyses on student characteristics in this
meta-analysis.
Meta-Analytic Approach
A review of literature cited in many studies on teacher judgment
accuracy (Begeny et al, 2008; Feinberg & Shapiro, 2003; Hinnant
et al., 2009) is that by Hoge and Coladarci (1989). As mentioned
previously, this review summarized the results of 16 studies presenting data on the relationship between teachers’ judgments of
students’ academic achievement and the students’ actual performance on an independent criterion of achievement. Hoge and
Coladarci reported a range of correlations from .28 to .92 and a
median correlation of .66.
Hoge and Coladarci (1989) also examined how different methodological study characteristics (direct vs indirect judgments, instruction specificity, norm-referenced vs peer-dependent judgments) were related to the correspondence between teachers’
judgments and students’ achievement. They also sought to identify
moderator variables (student gender, subject matter, student ability) influencing the size of the correlation between the two measures. Because only 16 studies were included in the review, the
sample sizes for studying the different effects were small. As such,
only descriptive analyses could be presented. For example, three
studies distinguished between male and female students and found
no effect of gender on teacher judgment accuracy. Similarly, two
studies explored the influence of student achievement on teacher
judgment accuracy, revealing higher levels of teacher accuracy in
judging appropriateness of instruction for higher achieving than
for lower achieving students (Leinhardt, 1983) and lower levels of
accuracy in judging the performance of lower achieving students
(Coladarci, 1986). In the present meta-analysis, we did not evaluate the primary studies separately and descriptively, as was done
by Hoge and Coladarci. As such, we were unable to control for
student ability as a moderating variable, as the different testing
procedures used meant that data on students’ average achievement
(means and standard deviations) were not comparable across studies.
Since the publication of the Hoge and Coladarci review in 1989,
numerous further studies have reported data on teachers’ judgment
accuracy. In order to overcome the limitations of their narrative
review and to draw a clear picture of current findings on teacher
judgment accuracy, we therefore present a comprehensive metaanalysis. Beyond the statistical synthesis of study results, we
evaluated whether potential moderators influence the size of the
correlation between teacher judgments and students’ actual academic achievement.
747
Method
Information Retrieval Process
We identified relevant studies by applying a multimodal search
strategy involving both electronic and manual searches.1 The literature search process consisted of two phases. First, we conducted
preliminary searches to refine our research questions and to define
the key concepts. In this phase, we also refined and modified our
search terms by using database thesauri to ensure that the universe
of appropriate synonyms were included. The main searches were
conducted in the second phase (March–July 2009).
Electronic searches were conducted using the four main search
engines in the fields of psychology and education, which cover a
wide variety of bibliographic databases: the Education Resources
Information Center (ERIC), PSYNDEXplus in Journals@Ovid,
the EBSCOhost (including PsycARTICLES, PsycINFO, and the
Psychology and Behavioral Sciences Collection), and the Web of
Science (including the Science Citation Index Expanded, the Social Sciences Citation Index, the Arts & Humanities Citation
Index, the Conference Proceedings Citation Index–Science, and
the Conference Proceedings Citation Index–Social Science & Humanities). The search terms entered in these databases include
“teacher judgment,” “teacher expectations,” and “classroom assessment.” A full list of search terms is given in the Appendix.
Inclusion Criteria and Exclusion Criteria
General criteria. In order to identify studies reporting data
on the accuracy of teachers’ judgments of students’ academic
achievement, we searched for studies analyzing the relationship
between teacher judgments of students’ academic achievement and
students’ actual performance on an achievement test. We excluded
studies (e.g., Pohlmann, Möller, & Streblow, 2004; Spinath, 2005)
analyzing the accuracy of teachers’ judgments of student characteristics other than achievement (e.g., motivation, attention, anxiety).
First, we included studies conducted to validate teachers’ judgments by reference to students’ performance on a (standardized)
achievement test. Second, we included studies conducted to validate a standardized achievement test by reference to teachers’
judgments. Third, we sought to include any study reporting on the
relationship between teachers’ judgments and students’ academic
achievement.
English abstract. As we used English search terms in the
literature search, we included all studies retrieved with an English
title and abstract, including studies in languages other than English. For example, the study by Eshel and Benski (1995) was
written in Hebrew.2
School system. We included only those studies that reported
teacher judgments’ on students enrolled in the regular school
system (e.g., from kindergarten through Grade 12 in the United
1
We would like to thank the following people for their contribution to
this meta-analysis: Yvonne Anders, Susannah Goss, Friederike Helm,
Annette Heberlein, Nils Machts, Maria Rauch, Angelika Ribak, and Camilla Rjosk.
2
Our thanks go to the first author, Yohanan Eshel, for translating the
relevant information for us.
748
SÜDKAMP, KAISER, AND MÖLLER
States). We excluded studies focusing on college students, vocational training students, or prekindergarten children.
Quantitative data. We included only studies reporting quantitative data. Qualitative studies were excluded.
Field research.
We also excluded studies that were not
conducted in the field but that used computer simulations (Südkamp & Möller, 2009) or case descriptions (Chang & Sue, 2003)
to analyze the accuracy of teachers’ judgments.
Publication year. As Hoge and Coladarci published their
meta-analysis on the accuracy of teachers’ judgments in 1989, we
limited our search to studies published between January 1989 and
December 2009. The only exception is the study by Anders,
Kunter, Brunner, Krauss, and Baumert (2010), which was in press
in 2009. As we used rather broad keywords in the literature search
to identify all studies reporting a correlation between teachers’
judgments and students’ academic achievement, our searches produced high numbers of potentially relevant studies. Including
studies published before 1989 would have considerably increased
the number of studies identified and thus have been prohibitively
costly. The procedure of defining a certain cutoff point for the
inclusion or exclusion of studies also has been applied in other
recent meta-analyses (see, e.g., Cafri, Komrey, & Brannick, 2010;
Fischer & Boer, 2011; Tillman, 2011).
Statistics. Most studies on the accuracy of teachers’ judgments report correlations between teachers’ judgments and students’ performance on an achievement test. However, the relationship can also be presented by means of other statistics (e.g., t test
results or regression coefficients). In the information retrieval
process, we searched for studies reporting any values representing
the correspondence between teachers’ judgments and students’
academic achievement. In addition, some studies (for example,
Demaray & Elliot, 1998) report additional measures (here, the
results of t tests) in order to answer a specific research question.
Nevertheless, the relationship between teachers’ judgments and
students’ academic achievement already is presented adequately
through a correlation coefficient. In these cases, we decided to
only include the correlation coefficients in the meta-analysis. During the information retrieval process, no study was identified in
which only t test results were reported.
Simultaneity of judgments and tasks. In contrast to Hoge
and Coladarci (1989), we did not limit our analysis to studies in
which judgment and test data were collected concurrently but
included studies in which judgments were made prior to or after
testing.
Publication source. In meta-analyses, the problem of publication bias (i.e., the selective publication of studies with a particular outcome, usually those whose results are statistically significant, at the expense of null studies; Ferguson & Brannick, 2011;
Sutton, 2009) is often addressed by the inclusion of “unpublished”
studies (dissertations, conference papers, and the like). In the
present meta-analytic review, however, we decided to focus our
attention on articles published in scientific journals for three main
reasons. First, the issue of publication bias in studies on teacher
judgment accuracy is a rather minor one, as findings of low
correlations between teachers’ judgments and students’ academic
achievement do not usually prevent findings from being published.
As is evident from the wide range of correlations presented in the
Results section, the range of effect sizes reported is large. Second,
there were methodological reasons for the decision to exclude gray
literature. As Ferguson and Brannick (2011) have pointed out,
including gray literature in an attempt to overcome the problem of
publication bias in fact often exacerbates the problem. For example, whereas the proportion of published articles exceeds that of
unpublished articles in any meta-analysis (see Balliet, Mulder, &
van Lange, 2001, and Kleingeld, van Mierlo, & Arends, 2011, for
recent examples), the ratio of published to unpublished studies in
the field may be the reverse. In our case, a search of the ProQuest
dissertation database identified many studies conducted in the
United States, but other international studies clearly were underrepresented. We therefore decided to include only published studies in our meta-analysis to ensure that our selection was clear and
transparent. Third, the decision to exclude unpublished studies was
the result of limited resources. As described earlier, the use of
rather broad keywords led to the identification of high numbers of
potentially relevant references. Including dissertations, conference
papers, and so on in the study selection and coding process would
have been prohibitively time-consuming. Interested readers are
referred to Lench, Flores, and Bench (2011) for an example of a
meta-analysis in which gray literature was excluded for reasons of
limited resources. The decision to exclude gray literature has also
been made in other recent meta-analytic reviews (e.g., Fischer &
Boer, 2011; Tillman, 2011).
Explicit judgments. We included only those studies in which
teachers were asked explicitly to judge students’ academic
achievement. Although grading may also be considered a form of
teacher judgment, we excluded studies in which grades were used
as teacher judgments.
Study Selection Procedure
In a first step, all search terms were entered in each database,
resulting in a total of 20,456 potentially relevant references. The
title and abstract of each reference were read by one researcher,
who decided whether to include the reference on the basis of the
inclusion/exclusion criteria. With this selection process, a total of
1,083 references were identified as potentially including information on the relationship between teachers’ judgments and students’
academic achievement, which were retrieved for further review. In
a next step, the selected studies were carefully read, and the
inclusion/exclusion criteria were applied. Among the 1,083 studies
ascertained to be potentially relevant, we identified 37 studies
including data on teachers’ judgments and students’ academic
achievement but not reporting the correlation between the two
measures or other statistical indices that would have allowed
transformation to correlation coefficients. These studies were excluded from the analyses. For example, Jones and Gerig (1994)
obtained teachers’ rankings of students’ achievement and students’
test scores, but they did not report the correlation between the two
measures. Instead, means and standard deviations of the teacher
ranking were reported by “achievement level” (1– 4) for silent and
nonsilent readers separately. It was therefore not possible to transform these data into correlation coefficients. Likewise, Smith,
Jussim, and Eccles (1999) collected data on students’ academic
achievement and teachers’ ratings of students’ academic achievement. Here, relationships between the two measures were reported
only in complex multivariate models, making calculation of the
single correlation between the two measures impossible. There
were similar problems with the statistics reported in the other 35
ACCURACY OF TEACHERS’ JUDGMENTS
studies that were excluded. As our final selection of 94 studies was
limited to studies reporting the correlation between teachers’ judgments and students’ academic achievement, there was no need to
transform other measures into correlation coefficients.
The 94 studies identified were then screened for further references not found through the electronic searches. This manual
search produced nine relevant references. A total of 103 relevant
studies were thus identified and included in the coding process.
Some of these studies were closely related. As we wanted to avoid
including duplicate data, we excluded articles that seemed to report
the same data as another article. For example, Bennett, Gottesman,
Cerullo, and Rock (1991) and Gottesman, Cerullo, Bennett, and
Rock (1991) both reported data on a sample of 796 students; one
table of descriptive data is the same in both articles. We decided to
include only the Gottesman et al. (1991) article, which includes
more information on the subsamples analyzed. In addition, 15
studies were excluded because they were not published in a regular
journal (most were reports and conference papers found in the
ERIC database). Moreover, studies focusing on academic achievement in subjects other than language arts and mathematics were
excluded, as these subjects were clearly underrepresented. Specifically, we identified one study by Trouilloud, Sarrazin, Martinek,
and Guillet (2002) on sports and one study by Klinedinst (1991) on
music. A further three studies were excluded because they were
found to not meet the inclusion criteria during the coding process
(e.g., teachers rated students’ learning behavior rather than academic achievement). As a result of this study selection procedure,
75 studies were included in the present review.
Data Coding
Two of the authors independently coded all studies. Before
analyzing the data, we calculated the level of interrater agreement
on the coding of key variables. For the categorical variables, we
used Cohen’s kappa to assess agreement (Cohen, 1960). The
resulting kappa coefficients were as follows: country: .99; aim of
the study: .96; informed versus uninformed judgments: .93; judgment specificity: .97; norm-referenced versus peer-independent
judgments: 1.00; domain specificity of the achievement test: .93,
domain specificity of the judgment task: .93; time gap: .97. For the
remaining variables, we determined the percentage of times that
the two raters recorded the same value for each independent
sample. The levels of interrater agreement were as follows: teacher
judgment accuracy (mean intercoder agreement for all coded correlations): 96%; year of publication: 100%; sample size: 99%;
points on rating scale: 100%. Instances of disagreement were
resolved by discussion.
If information on the variables under investigation was not
available from a study, it was coded as missing. Information on the
following variables was coded:
Teacher judgment accuracy. All reported correlations between teachers’ judgments and students’ actual test performance
were extracted from the selected studies. Negative correlations
were multiplied by ⫺1 whenever one of the following two conditions was satisfied: lower values representing more favorable judgments and higher values representing less favorable judgments on
the teacher rating scale (e.g., Tindal & Marston, 1996; Wilson,
Schendel, & Ulman, 1992).
749
Study-specific characteristics. The following study-specific
characteristics were coded for primarily methodological reasons:
Year of publication. Publication year was coded as a continuous variable.
Country.
We coded the country in which the study was
conducted and allocated each study to one of five groups: United
States, Australia, Canada, Europe, and other countries.
Aim of the study. According to the inclusion/exclusion criteria, we coded whether the main aim of the study was to validate
teachers’ judgments by tests, or to validate achievement tests by
teachers’ judgments, or whether studies simply reported the correlations between the two measures.
Sample size. For each study, the number of students rated and
the number of teachers rating student performance were coded. As
many studies reported correlations for different subsamples, we
coded the exact size of the student and teacher samples for each
correlation.
Judgment characteristics. The following aspects of teacher
judgments were taken into account:
Informed versus uninformed judgments. We coded whether
teachers were informed about the achievement test on which their
judgment of student achievement would be based—that is, about
the standard of comparison to be applied in their judgment.
Points on rating scale. For the later analysis, we coded the
number of categories given on the rating scales used.
Judgment specificity. Teachers’ judgments were classified as
ratings (e.g., rating of students’ performance in mathematics),
rankings (e.g., ranking of students from lowest to highest in
reading ability), or estimations of the number of correct responses
(e.g., estimation of the number of items solved correctly). Unlike
Hoge and Coladarci (1989), we did not include grading as a type
of teacher judgment. That approach would have increased the
number of relevant studies enormously. None of the studies in our
sample asked teachers to indicate students’ responses on each item
of an achievement test.
Norm-referenced versus peer-independent judgments.
On
peer-dependent rating scales (e.g., near the bottom of the class–one
of the best in the class), teachers are asked to rate students’
performance in relation to a reference group (usually the other
students in the class). Peer-independent rating scales do not elicit
an explicit comparison with a reference group (e.g., very low
ability–very high ability).
Domain specificity. We coded the domain specificity of the
judgment task using the following three categories: judgment of
overall academic ability (0), judgment of academic ability in one
subject (1), and judgment of a specific academic ability within a
subject (2).
Test characteristics. On the basis of our theoretical considerations, we coded information on the following test characteristics:
Subject matter.
We coded the domain (language arts or
mathematics) in which academic ability was measured. Some
studies administered tests measuring achievement in different subjects. In these cases, the subject was coded as mixed.
CBM procedures versus standardized achievement tests. We
differentiated between the use of standardized achievement tests
and CBM procedures.
Domain specificity. We coded the domain specificity of the
achievement test using the following three categories: covered
750
SÜDKAMP, KAISER, AND MÖLLER
different subjects (0; e.g., mathematics and language arts), covered
a single subject (1; e.g., mathematics), and covered a specific
ability within a subject (2; e.g., oral comprehension).
Correspondence between judgment and test characteristics.
With regard to the correspondence between judgment and test
characteristics, the following information was coded:
Time gap.
Our meta-analysis includes studies that report
measures of teacher judgments and students’ academic achievement obtained at different points of time. Therefore, we also coded
when teachers’ judgments were made: same time (achievement
test and rating task administered within a 1-month period), test
before rating (achievement test administered at least 1 month
before rating task), or test after rating (achievement test administered at least 1 month after rating task).
Congruence in domain specificity. We coded the domain
specificity of the achievement test and the judgment task separately (as described previously). In a second step, we calculated the
difference between the domain specificity of the two measures in
order to gauge the congruence between the achievement test and
the judgment task. In the subsequent analysis, we coded the studies
as using either a congruent achievement test and rating task (0,
achievement test and rating are equally specific) or an incongruent
achievement test and rating task (1, one measure is more specific
than the other).
Analytical Issues
For this meta-analysis, we coded not only study outcomes but
also several study characteristics as variables with the potential to
explain differences in study outcomes. Some studies reported
separate correlations for different methodological approaches (e.g.,
a focus on language arts or mathematics). For studies in which
more than one correlation coefficient was reported, we calculated
the mean correlation coefficient (Lipsey & Wilson, 2001; Möller et
al., 2009; O’Mara, Marsh, Craven, & Debus, 2006). As some
studies reported correlation coefficients for different subsamples
or different methodological approaches, we calculated the mean
correlations for these subsamples or differing approaches separately (Kalaian & Kasim, 2008). In these cases, we included more
than one effect size from the same study in the meta-analytic
calculations and thus had to deal with the problem that those effect
sizes were not independent. The number of participants in each
study (N) refers to the number of students who were rated. For
studies reporting correlations from more than one sample, we
calculated the mean number of participants across all samples in a
study (Möller et al., 2009). To account for the hierarchical structure of the meta-analytic data (subjects within studies at the first
level and studies at the second level), we applied a multilevel
approach (Hox, 2002; Kalaian & Kasim, 2008). This approach
assumes that the primary studies under review are samples from
the population of studies. Accordingly, an estimate of a study’s
effect size is regarded as a function of a true population effect size,
within-study sampling error, and random between-studies error.
The variation in the between-studies error is estimated via the
multilevel approach and can be modeled and explained using study
and sample characteristics. The multilevel approach combines
features of the traditional fixed effects approach and the random
effects approach. It assumes differences in effect sizes beyond
those due to sampling error. Additionally, unlike the fixed effects
and the random effects approach, the multilevel approach does not
assume the independence of effect sizes (Marsh, Bornmann, Mutz,
Daniel, & O’Mara, 2009). As we did not have access to the
original raw data but had to draw on the published descriptive
results, we assumed the sampling error to be known (varianceknown model) and calculated the sampling variances of the effect
sizes from the summary statistics of the primary studies (Kalaian
& Kasim, 2008). The analyses were performed with hierarchical
linear modeling (HLM Version 6; Raudenbush, Byrk, Cheong, &
Congdon, 2004) using the HLM2 option, in which restricted maximum likelihood estimation is applied.
Results
The 75 studies included in the analysis of effect sizes are
documented in Table 1. The correlation between teacher judgments and students’ test performance (r) and the size of the student
sample (N) is reported for each study. For studies reporting more
than one correlation, we calculated the mean correlation and the
mean size of the student sample (see previous text). The mean
correlation was calculated with Fisher’s z transformation of the
single correlations (Hedges & Olkin, 1985; Lipsey & Wilson,
2001). Then the coefficients were re-transformed into correlation
coefficients (Borenstein, 2009). The table also includes an effect
size for each study (Zr), which was again calculated using Fisher’s
z transformation and the asymptotic variance of the effect sizes
(VarZr); Rosenberg, Adams, & Gurevitch, 2000). In the following
analyses, the effect size Zr serves as the dependent variable.
Summary of Effect Sizes
In a first step, we applied an unconditional multilevel model to
the data to estimate the overall mean effect size and to examine
heterogeneity in the primary study effects. No explanatory variables are included at either level in an unconditional multilevel
model. The results of the baseline model are presented in Tables 2
and 3.3 The overall mean effect size of the 73 effect sizes included
in the analysis was .63 and significantly different from zero. As the
large and highly significant chi-square test indicates, the effect
sizes were heterogeneous, indicating a need to include explanatory
variables in the model to explain the variance in the effect sizes. As
presented in Table 1, the Fisher’s z-transformed correlations
ranged between ⫺0.03 and 1.18.
Next, we computed several conditional multilevel models in
which the explanatory predictor variables were entered separately.
For each model, only those studies reporting data on the predictor
variable of interest were included in the analysis; all others were
excluded. For studies reporting correlations for different categories
of a predictor variable (e.g., informed vs uninformed teacher
judgments), the mean correlation for each category was calculated;
a weighted mean effect size was then calculated for each category,
and all categories were included in the analysis. Due to this
procedure, the sample size varied across the models. Additionally,
some studies were excluded by the multilevel software whenever
the variance of the effect sizes was zero.
3
Study Numbers 51 and 75 were excluded from the analysis because the
variance of the effect size in these studies was zero.
ACCURACY OF TEACHERS’ JUDGMENTS
751
Table 1
Summary of Studies Included in the Meta-Analysis
Study no.
First author
Year
Country
Judgment type (i/u)
Subject
Congruence
r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Graue
Webster
Gullo
Meyer
Schrader
DuPaul
Gottesman
Kenealy
Miller
Wilson
Freeman
Jenkins
Jorgenson
Kenny
Sink
Wilson
Eaves
Eaves
Salvesen
Eshel
Wright
Kwok
Maguin
Tindal
Gresham
Hartman
Hodges
Saint-Laurent
Demaray
Flynn
DiPerna
van Kraayenoord
Espin
Bates
Elliott
Fletcher
Helwig
Kuklinski
Limbos
Madon
Meisels
Teisl
Hecht
Sofie
Burns
Feinberg
Hauser-Cram
Pomplun
Triga
Beswick
Dale
Herbert
Hughes
Madelaine
Montague
Bailey
Dompnier
Eckert
Benner
Trautwein
Begeny
Graney
Lembke
Li
Maunganidze
1989
1989
1990
1990
1990
1991
1991
1991
1992
1992
1993
1993
1993
1993
1993
1993
1994a
1994b
1994
1995
1995
1996
1996
1996
1997
1997
1997
1997
1998
1998
1999
1999
2000
2001
2001
2001
2001
2001
2001
2001
2001
2001
2002
2002
2003
2003
2003
2004
2004
2005
2005
2005
2005
2005
2005
2006
2006
2006
2007
2007
2008
2008
2008
2008
2008
United States
United States
United States
Canada
Germany
United States
United States
Great Britain
United States
United States
Not reported
United States
United States
Australia
United States
United States
United States
United States
Norway
Israel
United States
Canada, Hong Kong
United States
United States
United States
United States
United States
Canada
United States
United States
United States
Germany
United States
Australia
United States
Australia
United States
United States
Canada
United States
United States
United States
United States
United States
United States
United States
United States
United States
Greece
United States
Great Britain
United States
United States
Australia
United States
United States
France
United States
United States
Switzerland
United States
United States
United States
China
Zimbabwe
u
u
u
u
i
u
u
u
i
u
i
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
i
u
u
u
u
u
u
u
u
u
u
u
i
u
u
i
u
u
u
u
u
u
u
u
i
u
u
i
u
u
u
u
mixed
l/m
l/m/mixed
l
m
l/m
mixed
mixed
l/m
l
l
l
l/m/mixed
l/mixed
l/m
l/m
l/m/mixed
l/m
l
l/m
l/m
m
l/m
l
l/m
l
l
l/m
l/m/mixed
l
l/m/mixed
l
l
l
l
l
l/m
l
l/mixed
m
l/m
l/m
l
l
l
l
l/m
l
l
l
l
l/m
l/m
l
l/m
l
l/m
l/m
l
m
l
l
m
l/m
l
ic
ic
c/ic
c/ic
c
ic
c
c
c
c/ic
c
c
c
c/ic
ic
ic
c
c/ic
c
c/ic
ic
c/ic
c
c
c/ic
c/ic
c
c
c/ic
c
c/ic
c
c
c
c
c
c/ic
c
c
c
c/ic
c/ic
c
c
c
c
c
c
c
c
c
ic
ic
c
ic
c
ic
c/ic
c
c
c
c
ic
c
c
.28
.45
.42
.56
.51
.48
.45
.58
.60
.54
.72
.62
.65
.39
.59
.52
.66
.42
.70
.47
.48
.49
.73
.60
.32
.79
.69
.49
.73
.46
.69
.52
.51
.70
.59
.42
.56
.58
.58
.66
.60
.41
.57
.58
.40
.66
.49
.56
.84
.67
.54
.52
.44
.73
.48
.25
.71
.46
.47
.76
.72
⫺.03
.51
.54
.36
N
Zr
Var(Zr)
63 0.290
.017
134 0.490
.008
65 0.480
.016
171 0.640
.006
690 0.560
.002
50 0.520
.021
93 0.490
.011
426 0.660
.002
60 0.700
.018
1265 0.610
.001
214 0.910
.005
210 0.730
.005
63 0.780
.017
99 0.410
.010
59 0.680
.018
60 0.580
.018
89 0.790
.012
45 0.450
.024
603 0.870
.002
201 0.510
.005
74 0.520
.014
126 0.530
.008
368 0.930
.003
130 0.700
.008
150 0.330
.007
34 1.070
.032
121 0.840
.009
606 0.530
.002
47 0.930
.023
1634 0.500
.001
32 0.840
.035
75 0.570
.014
80 0.560
.013
108 0.870
.010
75 0.680
.014
47 0.450
.023
206 0.640
.005
62 0.660
.017
178 0.660
.006
1692 0.790
.001
70 0.700
.015
234 0.440
.004
170 0.650
.006
40 0.660
.027
147 0.420
.007
30 0.800
.037
105 0.530
.010
208 0.640
.005
125 1.220
.008
205 0.810
.005
5542 0.610
.000
359 0.580
.003
607 0.470
.002
396 0.930
.003
55 0.520
.019
16 0.260
.077
663 0.880
.002
33 0.500
.033
314 0.510
.003
741 1.000
.001
87 0.900
.012
93 ⫺0.030
.011
45 0.560
.024
499 0.600
.002
60 0.380
.018
(table continues)
SÜDKAMP, KAISER, AND MÖLLER
752
Table 1 (continued)
Study no.
66
67
68
69
70
71
72
73
74
75
First author
Methe
Bang
Feinberg
Gallant
Hinnant
Karing
Lorenz
Martı́nez
McElvany
Anders
Year
2008
2009
2009
2009
2009
2009
2009
2009
2009
2010
Country
Judgment type (i/u)
United States
United States
United States
United States
United States
Germany
Germany
United States
Germany
Germany
u
u
i/u
u
u
u
u
u
i
u
Subject
m
l
l
l/m
l/m
l/m
l/m
m
l
m
Congruence
r
N
Zr
Var(Zr)
c/ic
ic
c
c/ic
c/ic
c
c
c
c
c
.80
.31
.49
.38
.58
.54
.59
.61
.34
.35
76
273
148
1281
964
1449
1786
9650
812
1085
1.090
0.320
0.540
0.400
0.670
0.600
0.680
0.710
0.350
0.370
.014
.004
.007
.001
.001
.001
.001
.000
.001
.001
Note. i ⫽ informed judgments; u ⫽ uninformed judgments; mixed ⫽ test(s) covering different subjects; l ⫽ test(s) on academic ability in language arts;
m ⫽ test(s) on academic ability in mathematics; c ⫽ congruent, ic ⫽ incongruent.
For the different categories of moderator variable, the mean
correlation (r), the mean sample size (N), the mean effect size (Zr),
and the variance of the effect size (VarZr)) are only reported where
moderator effects were statistically significant and for “subject” as
a moderator.4 If a study supplied correlations for only one category
of a moderator variable (e.g., only informed judgments), the summary statistics are displayed in Table 1. If a study supplied
correlations for more than one category of a moderator variable
(e.g., informed and uninformed judgments), summary statistics are
displayed in Tables S1–S3 in the online supplemental material.
Moderator Analyses
Publication year. Model 1 (see Table 2) considered the effect
of publication year (73 effect sizes included), which did not
emerge to be a statistically significant moderator. Thus, our findings did not indicate that effect sizes varied systematically according to the study’s year of publication.
Country. Model 2 tested the effect of the country in which the
study was conducted (72 effect sizes). The studies were split into five
groups, with studies conducted in the United States being chosen as
the reference category. Most of the studies selected were conducted in
the United States (69.9%), followed by European countries (16.4%),
Canada (4.1%), Australia (5.5%), and other countries (4.1%). None of
the effects was statistically significant.
Aim. The effect of the main aim of the study was tested in
Model 3. Overall, 48.1% of the studies were conducted to compare
teachers’ judgments with students’ outcomes on an achievement
test and 16.9% of the studies aimed to validate an achievement test
by reference to teachers’ judgments. A further 35% of studies were
conducted for other purposes but also reported the correlation
between the two measures. The aim of the study did not emerge to
be a significant moderator (73 effect sizes included).
Judgment characteristics. As information on the methods
used to obtain teachers’ judgments was presented in most studies,
the following judgment characteristics could be included in the
analysis:
Informed versus uninformed teacher judgments. In most
studies, teachers were not informed about the achievement test to
which their judgment would be related (86.8 %); only 13.2% of the
74 effect sizes included were related to informed judgments.
Model 4 revealed a significant negative effect of informed versus
uninformed judgments, indicating higher correlations between stu-
dents’ academic achievement and informed teacher judgments
(mean effect size ⫽ .76) than uninformed teacher judgments (.61).
Points on rating scale. Model 5 examined whether the number of points on the rating scales used had an effect (64 effect
sizes). Studies proved to vary enormously in this aspect, with
rating scales ranging between 2 and 100 points. As shown in Table
2, however, the effect of the number of points on the rating scale
was not statistically significant.
Judgment specificity. Model 6 examined the effect of the
specificity of the judgment task (70 effect sizes). As the largest
group, ratings (86.8%) were chosen as the reference category.
Ratings were followed by estimations of the number of correct
responses (9.2%) and rankings (3.9%). As shown in Table 2, none
of the effects were statistically significant.
Norm-referenced versus peer-independent judgments.
Model 7 examined whether the use of a peer-dependent or peerindependent rating scale had an effect on teacher judgment accuracy (63 effect sizes). Of the 66 effect sizes available, 61.8% relied
on peer-independent teacher judgments and 38.2.3% on peerdependent teacher judgments. The effect of this factor was not
statistically significant.
Domain specificity. Model 8 assessed the influence of the
domain specificity of the judgment task on teacher judgment
accuracy. Altogether, 89 effect sizes were included in this analysis,
of which 27.4% were based on judgments of overall academic
ability, 23.2% on judgments of an academic ability in one subject,
and 49.5% on judgments of a specific academic ability within a
subject. As shown in Table 3, the effect was not statistically
significant. Therefore, there was no evidence for the hypothesis
that domain-specific teacher judgments result in higher judgment
accuracy than do global judgments.
Test characteristics. We next examined the effects of various characteristics of the tasks administered.
Subject matter. Model 9 examined the effect of subject matter. Studies reporting information relevant to this analysis are
reported in Tables 1 and S2. Again, Study Numbers 51 and 75
were excluded from the analysis, resulting in a total of 89 effect
sizes. Most studies addressed the domain of language arts (63.1%
4
Coding data for all other moderator variables are available from the
first author upon request.
773.44ⴱ (72)
.63ⴱ (.03)
Intercept
.63ⴱ (.03)
⫺.01 (.02)/.683
Publication yr
.04
762.97ⴱ (71)
Model 2
.04
748.80ⴱ (66)
.08 (.11)/.470
.00 (.12)/.998
.09 (.07)/.168
⫺.10 (.12)/.435
.61ⴱ (.03)
Country
Study characteristics
Model 1
.07
748.04ⴱ (70)
.02 (.06)/.697
.08 (.07)/.302
.61ⴱ (.04)
Aim
Model 3
.03
844.28ⴱ (72)
.15 (.07)/.045
.61ⴱ (.03)
i vs. u judgment
Model 4
.04
675.33ⴱ (63)
.00 (.00)/.805
.62ⴱ (.03)
Points on rating scale
Judgment characteristics
Model 5
.04
803.67ⴱ (71)
.04 (.12)/.747
⫺.13 (.22)/.535
.11 (.09)/.212
.63ⴱ (.03)
Judgment specificity
Model 6
Note. Unless otherwise noted, values in parentheses are standard errors. Exact p-values are reported behind the slash. i ⫽ informed judgments; u ⫽ uninformed judgments; RC ⫽ reference category.
ⴱ
p ⬍ .001.
Fixed effects
Intercept
Publication yr (Z score)
Country (RC: United States)
Australia
Canada
Europe
Other
Overall aim of the study
(RC: validation of teachers’
judgments)
Validation of a test of academic
achievement
Correlation between teachers’
judgments and students’ test
performance
Informed vs. uninformed
judgment
Points on rating scale
Judgment specificity (RC: rating)
Ranking
Grade equivalents
No. of correct responses
Random effects
␶
Chi square (df)
Parameter
Model 0
Table 2
Multilevel Meta-Analysis of Effect Sizes: Fixed Effects and Random Effects in Models 0 – 6
ACCURACY OF TEACHERS’ JUDGMENTS
753
.04
729.97ⴱ (64)
.66ⴱ (.03)
⫺.05 (.05)/.340
Norm-referenced vs peerindependent judgments
.04
957.97ⴱ (90)
⫺.05 (.06)/.425
.06 (.05)/.307
.61ⴱ (.04)
Domain specificity
.03
1041.89ⴱ (99)
⫺.03 (.04)/.429
.63ⴱ (.03)
Subject matter
Model 9
.04
809.00ⴱ (70)
⫺.04 (.09)/.627
.64ⴱ (.03)
CBM vs. standardized
achievement tests
Test characteristics
Model 10
.04
859.66ⴱ (93)
⫺.12 (.06)/.114
⫺.08 (.08)/.247
.72ⴱ (.06)
Domain
specificity
Model 11
Model 13
.04
627.20ⴱ (57)
⫺.07 (.08)/.343
⫺.05 (.10)/.589
.66ⴱ (.03)
Time gap
.04
921.75ⴱ (89)
⫺.13 (.05)/.009
.67ⴱ (.03)
Congruence
Correspondence between judgment
and test characteristics
Model 12
ⴱ
Note. Unless otherwise noted, values in parentheses are standard errors. Exact p values are reported behind the slash. CBM ⫽ curriculum-based measures; RC ⫽ reference category.
p ⬍ .001.
Fixed effects
Intercept
Peer dependency
Domain specificity (RC: overall
academic achievement)
Ability in one subject
Specific ability within subject
Subject (RC: language arts)
Mathematics
CBM vs standardized achievement
tests
Domain specificity (RC: overall
academic achievement)
Ability in one subject
Specific ability within subject
Time gap (RC: same time)
Test before rating
Test after rating
Congruence
Random effects
␶
Chi square (df)
Parameter
Model 8
Judgment characteristics
Model 7
Table 3
Multilevel Meta-Analysis of Effect Sizes: Fixed Effects and Random Effects in Models 8 –17
754
SÜDKAMP, KAISER, AND MÖLLER
ACCURACY OF TEACHERS’ JUDGMENTS
of effect sizes), while 36.9% addressed the domain of mathematics. As Table 3 shows, the effect was not statistically significant.
CBM procedures versus standardized achievement tests.
Next, we were interested in whether the size of the effects was
influenced by the testing procedure used (CBM vs standardized
achievement tests; Model 10). Overall, 87.7% of the 73 relevant
effect sizes relied on standardized achievement tests and 12.3% on
a CBM procedure. Four effect sizes were excluded from the
analysis as their variance was zero, resulting in a total of 69 effect
sizes. The effect of testing procedure was not statistically significant.
Domain specificity. Finally, the effect of the domain specificity of the achievement test was analyzed (Model 11). Of the 103
effect sizes included in this analysis, 12.2% were based on tests
covering different subjects, 25.5% on tests covering a single subject, and 62.3% on tests covering a specific ability within a subject.
The effect of the domain specificity of the achievement test was
not statistically significant.
Correspondence between judgment and test characteristics.
Models 12 and 13 examined two aspects of the correspondence
between judgment characteristics and test characteristics: time gap
and congruence in domain specificity.
Time gap. Model 12 tested the effect of the time interval
between the administration of the achievement test and the teacher
rating. In most studies, the two measures were administered concurrently (73.3%), so “same time” was chosen as the reference
category. The achievement test was administered before the rating
task for 18.3% of the 61 effect sizes included in the analysis and
after the rating task for 8.3%. The effects were not statistically
significant.
Congruence in domain specificity. According to our coding
procedure, the domain specificity of the achievement test and the
teacher rating was congruent for 67.6% of the effect sizes reported
in this analysis and incongruent for 32.4%. A total of 93 effect
sizes were included in this analysis (Tables 1 and S3). The effect
of congruence was tested in Model 13, revealing a significant
negative effect. As expected, larger effect sizes were observed for
studies in which the domain specificity of the achievement test and
the rating task was congruent (.67) than for studies in which it was
not (.54).
Overall, the highly significant chi-square tests for all models
indicate a substantial heterogeneity of variances. Beyond sampling
variation, there is variation in the effect sizes across studies that
could not be explained by the explanatory predictor variables used
in our models.
Discussion
In this article, we statistically summarized empirical research
findings on teacher judgment accuracy in a meta-analysis. In
addition, we examined the role played by theoretically and methodologically relevant moderators in explaining the variation in
findings across studies. In this section, we discuss the main findings and introduce a heuristic model of teacher judgment that
brings together findings on teacher judgment accuracy.
The results of our meta-analysis indicate that teachers’ judgment
accuracy— defined as the correlation between teachers’ judgments
of students’ academic achievement and students’ actual test performance—is positive and fairly high (.63). Nevertheless, this
755
result shows that teacher judgments are far from perfect and that
there is plenty of room for improvement. This result is in line with
the findings of Hoge and Coladarci (1989), who reported a median
correlation of .66. However, the median correlation in the present
meta-analysis was .53, showing that the results produced using
Hoge and Coladarci’s rather descriptive methods varied substantially from those generated by current meta-analytical methods.
Thus, our meta-analysis helps to summarize and clarify empirical
results on teacher judgment accuracy using adequate empirical
methods.
Our meta-analysis revealed substantial variation in effect sizes
across studies. Two important moderators of teacher judgment
accuracy were identified: one judgment characteristic and one
characteristic based on the interaction of judgment and test characteristics. In the following, we discuss in detail the effects of (a)
informed versus uninformed judgments and (b) the congruence in
the domain specificity of teacher judgments and student achievement tests.
First, we found significantly higher correlations between teachers’ judgments and students’ test performance for informed than
for uninformed teacher judgments. We chose to differentiate between these two categories, rather than between direct and indirect
judgments following Hoge and Coladarci (1989), because the latter
distinction was confounded by judgment specificity. Although the
difference between direct/indirect versus informed/uninformed
judgments seems small, we think it is important to make this
distinction. Indeed, the results of this meta-analysis indicate that
informed judgments result in higher judgment accuracy than do
uninformed judgments. Considerably more studies used uninformed judgments than informed judgments. Surprisingly, the use
of informed versus uninformed teacher judgments is barely discussed in studies on teacher judgment accuracy, although it evidently can have a substantial influence on the size of the correlation between teachers’ judgments and students’ achievement. As
was to be expected, it seems easier for teachers to judge students’
performance when they are informed about the standard of comparison than when they are not. No effects of the other judgment
characteristics (i.e., number of points on rating scales, judgment
specificity, norm-referenced vs peer-independent judgments) were
found. Nevertheless, we would recommend carefully considering
these aspects when conducting studies on teacher judgment accuracy.
In terms of test characteristics, we found no evidence for a
difference in teacher judgment accuracy between language arts and
mathematics. The effects of the other test characteristics were not
significant either. Therefore, results are generalizable across several types of judgments and tests. Regarding the testing procedure,
we distinguished between CBM procedures and standardized
achievement tests, but we found no significant effect on teacher
judgment accuracy. Given the variety of tests used in the different
studies, the categorization of CBM procedures versus standardized
tests was rather broad. The achievement test category was particularly broad, including both outdated and up-to-date tests. Very
little information was available on some of the tests (e.g., psychometric properties). We therefore advocate a more thorough description of the tests used in studies on teacher judgment accuracy.
As expected, the congruence between the teachers’ rating task
and the achievement test administered to students was related to
teacher judgment accuracy, with higher congruence being associ-
SÜDKAMP, KAISER, AND MÖLLER
756
ated with higher accuracy levels. Because the match between
teachers’ judgments and students’ test performance was higher
when both measures addressed the same domain and same ability
within a domain, it is reasonable to assume that a “mismatch” leads
to lower teacher judgment accuracy. Surprisingly, this factor is
rarely discussed in studies on teacher judgment accuracy.
Unfortunately, very little information was reported on the
teacher samples. We had planned to study the effects of teachers’
years of teaching experience, years of exposure to the students
rated, age, and gender, but were unable to conduct these analyses
for lack of data. It would also be interesting to study whether other
teacher characteristics affect teacher judgment accuracy. For example, teacher judgment accuracy might be associated with teachers’ cognitive abilities or memory capacity or—in terms of teaching skills—with their instructional quality or expert knowledge.
Hauser-Cram, Sirin, and Stipek (2003) used a classroom observation procedure to assess teachers’ teaching styles. Teachers were
identified as student-centered if they adapted well to students’
individual needs (e.g., encouraged children to communicate and
elaborate on their thoughts). In contrast, teachers were identified as
curriculum-centered if they applied a uniform approach dictated by
the curriculum (e.g., gave children few opportunities to take responsibility or to choose activities). Additionally, teachers’ perceived differences with parents regarding education-related values
were measured. As expected, perceived teacher–parent differences
had greater effects on teacher ratings of students’ academic
achievement in more curriculum-centered classrooms. In another
study, Kuklinski and Weinstein (2001) assessed teachers’ differential treatment of low- and high-achieving students in the classroom. Their results showed that teachers’ differential treatment as
perceived by their students was a significant moderator of teacher
expectations. Clearly, there is a need for studies assessing how
other teacher characteristics relate to teacher judgment accuracy.
The student characteristics we would have liked to consider in
our meta-analysis included gender, age, and grade level. Unfortunately, information on these characteristics was scarce and, for the
most part, not comparable across studies. For example, only the
percentage distribution for gender and the mean age of the student
sample were reported. More consistency in the information reported and more specific information would facilitate comparison
Figure 1.
across studies. Most of the studies included in the meta-analysis
involved samples of kindergarten and elementary school children.
More studies focusing on older children and higher grade levels
would therefore be desirable.
The lack of data on teacher and student characteristics makes it
almost impossible to study the effects of the correspondence
between teacher and student characteristics. None of the studies
included in this meta-analysis reported comparable information on
teacher and student characteristics (e.g., on teachers’ and students’
gender). Nevertheless, it seems reasonable to consider the correspondence between the two variables when studying teacher judgment accuracy. For example, it might be hypothesized that female
teachers provide more accurate predictions of girls’ than of boys’
performance.
In the study characteristics category, we further distinguished
between studies investigating teacher judgment accuracy and those
focusing on other research questions but providing measures of
teachers’ judgments and students’ academic achievement. Our
analysis did not reveal any differences in teacher judgment accuracy between studies of these two types. Nevertheless, it is important to bear the distinction between these study types in mind,
especially when interpreting results with regard to teacher judgment accuracy.
A Model of Teacher Judgment Accuracy
In order to systematize the moderators of teacher judgment
accuracy in a more structured form, we provide a model of teacher
judgment accuracy based on our theoretical considerations and
empirical findings. Teacher judgment accuracy is at the core of this
model, which is shown in Figure 1. It represents the correspondence between teachers’ judgments of students’ academic achievement and students’ actual achievement as measured by a standardized test. In most studies, the correlation between the two is used
as a measure of this correspondence. However, other indicators,
such as the average difference between teacher judgments and
students’ actual performance, can also be used.
A student’s test performance is the result he or she achieves on
an academic achievement test. On the one hand, this result may
depend on student characteristics such as prior knowledge, moti-
A model of teacher-based judgments of students’ academic achievement.
ACCURACY OF TEACHERS’ JUDGMENTS
vation, and intelligence. On the other hand, it may depend on test
characteristics such as subject area, the specific task set, or task
difficulty. In this meta-analysis, the moderating effect of three test
characteristics were studied, but none of them showed a significant
effect on teacher judgment accuracy. As mentioned previously, the
influence of student characteristics could not be studied because
the relevant data were not consistently reported.
A teacher’s judgment may depend on teacher characteristics
such as professional expertise or stereotypes about students or on
judgment characteristics (e.g., whether the teacher is asked to
judge a specific student competency, such as oral reading fluency,
or to provide a global judgment of academic ability). As described
earlier, our analyses revealed that whether teacher judgments were
informed or uninformed significantly influenced their judgment
accuracy.
According to our model, teacher judgment accuracy is also
influenced by the correspondence between judgment characteristics and test characteristics (dashed line, Figure 1). For example, an
achievement test may measure a very specific academic ability
(e.g., arithmetic skills), whereas the focus of the teachers’ judgment task is broader (e.g., rating students’ overall ability in mathematics), making it more difficult for teachers to provide accurate
judgments. Indeed, in this meta-analysis, we found evidence that a
high level of congruence between the specificity of teachers’
judgments and the specificity of the achievement measure used
leads to high accuracy of teacher judgments. Another relationship
that may influence teacher judgment accuracy is the correspondence between teacher characteristics and student characteristics
(e.g., gender, ethnicity).
As depicted in our model, the correspondence between judgment characteristics and test characteristics is assumed to influence
teacher judgment accuracy. However, as data on some elements of
the model are scarce, the model is in parts highly speculative.
Strength, Limitations, and Directions for Future
Research
This meta-analysis used a sophisticated multilevel approach to
provide a comprehensive overview of research on teacher judgment accuracy published in the past 20 years. In addition, it
informed a heuristic model of teacher judgment accuracy that can
be used to describe and analyze moderators of teacher judgment
accuracy.
Although some moderators proved to significantly influence the
correlation between teachers’ judgments and students’ test performance, the chi-square test was significant across all models, indicating that there remains variation in the effect sizes across studies
that could not be explained by the moderators under investigation.
One reason for this may be that little information was available on
some potential moderators, especially on the characteristics of the
teacher samples. Future studies should therefore report more information on the teacher sample investigated, making it possible to
analyze the relationship between teacher characteristics and
teacher judgment accuracy in much more detail.
As mentioned previously, most studies under analysis used the
correlation between teachers’ judgments and students’ test performance as a measure of teacher judgment accuracy. We therefore
chose correlation coefficients as the unit of analysis. However, the
correlation coefficient basically indicates whether teachers are able
757
to put their students in a rank order. Accordingly, high correlations
can also be attained if teachers systematically over- or underestimate their students’ performance (Eckert et al., 2006; Feinberg &
Shapiro, 2003; Graney, 2008).
Indeed, the findings of studies in which indicators other than
correlations were used as measures of teacher judgment accuracy
suggest that teacher judgments are rather inaccurate. In a study by
Eckert et al. (2006), CBM material was used as an indicator of
students’ mathematics and reading skills. Teachers were asked to
estimate students’ reading and mathematics level (mastery, instructional, or frustrational). This judgment was compared with
students’ actual reading and mathematics level as measured by the
CBM material via percentage agreement. The results indicated that
teachers overestimated students’ performance across most mathematics skills and on reading material that was at or below grade
level. Bates and Nettelbeck (2001) subtracted students’ reading
accuracy and reading comprehension scores on a standardized
achievement test from teachers’ predictions of these scores. Teachers generally overestimated the performance of the 6- to 8-year-old
students; inspection of the difference scores revealed that this held
to a greater extent for low-achieving readers than for average- and
high-achieving students. In line with this result, Begeny et al.
(2008) found that teachers’ judgments of students with average to
low oral reading fluency scores were rather inaccurate, and Feinberg and Shapiro (2003) reported that teachers generally overestimated the performance of low-achieving readers.
Cronbach (1955) was able to disentangle some of the effects of
human judgments by splitting a general accuracy score (squared
errors in judgments over all items) into three distinct components.
Building on his work, Helmke and Schrader (1987) proposed three
components of teacher judgment accuracy: a rank component
(rank correlation), a level component indicating over- or underestimation of students’ performance, and a differentiation component
indicating whether the variance of students’ performances was
accurately assessed by teachers. Although the correlation coefficient remains popular as an indicator of teacher judgment accuracy
(e.g., Anders et al., 2010; Ready & Wright, 2011), some studies—
especially in the literature on RTI models (Begeny et al., 2011;
Eckert et al., 2006)— have used percentage agreement between
teachers’ judgments and students’ level of academic ability (e.g., at
risk, some risk, low risk) as an indicator of teacher judgment
accuracy. The problem here is that the categories need to be clearly
defined and familiar to the teacher. Additionally, some information
is lost by categorizing students’ academic achievement into different subgroups. Moreover, it is not as easy to compare percentage agreement across studies as it is to compare correlation coefficients, as the measure depends on the number of categories used.
Nevertheless, percentage agreement offers valuable information
about teachers’ ability to detect children with a need for additional
support, which is a goal of RTI models. Which measures of teacher
judgments can or should be used also heavily depends on the
original data available. For example, Karing, Matthäi, and Artelt
(2011) asked teachers to predict students’ individual responses to
each item of a test. This approach allowed the authors to calculate
a “hit rate” delivering very detailed information on the teachers’
judgment accuracy. In contrast, Bates and Nettelbeck (2001) calculated the difference between teachers’ judgments and students’
academic achievement in order to identify over- and underestimations of students’ academic achievement. In our opinion, different
SÜDKAMP, KAISER, AND MÖLLER
758
measures should be applied in the analysis of teacher judgment
accuracy, depending on the focus of the study. Although the
potential of correlations as a measure of teacher judgment accuracy is limited in the ways previously described, they nevertheless
offer valuable information and are easily interpretable.
In summary, this meta-analysis has important theoretical and
methodological implications for research on teacher judgment
accuracy. It highlights the various methodological aspects that
need to be considered in studies examining the accuracy of teacher
judgments. The differentiation among teacher characteristics, student characteristics, judgment characteristics, and test characteristics was fruitful in this analysis, as these factors proved to constitute judgment accuracy. In particular, our results showed that
judgment and task characteristics influenced the correlation between teacher judgments and students’ academic achievement.
Additionally, we found empirical evidence that the level of congruence in the domain specificity of the teachers’ rating task, on
the one hand, and the achievement tests administered, on the other,
influenced teacher judgment accuracy. Our meta-analysis also
showed where further research is necessary. From the theoretical
perspective, we proposed a model of teacher-based judgments of
students’ academic achievement, bringing together teacher characteristics, judgment characteristics, student characteristics, and
tasks characteristics as factors with theoretical relevance for
teacher judgment accuracy.
References
References marked with an asterisk indicate studies included in the
meta-analysis.
Alvidrez, J., & Weinstein, R. S. (1999). Early teacher perceptions and later
student academic achievement. Journal of Educational Psychology, 91,
731–746. doi:10.1037/0022-0663.91.4.731
American Federation of Teachers, the National Council on Measurement in
Education, and the National Education Association. (1990). Standards
for teacher competence in educational assessment of students. Retrieved
from http://www.unl.edu/buros/bimm/html/article3.html
*Anders, Y., Kunter, M., Brunner, M., Krauss, S., & Baumert, J. (2010).
Diagnostische Fähigkeiten von Mathematiklehrkräften und ihre Auswirkungen auf die Leistungen ihrer Schülerinnen und Schüler [Mathematics teachers’ diagnostic skills and their impact on students’ achievements]. Psychologie in Erziehung und Unterricht, 57, 175–193.
*Bailey, A. L., & Drummond, K. V. (2006). Who is at risk and why?
Teachers’ reasons for concern and their understanding and assessment of
early literacy. Educational Assessment, 11, 149 –178. doi:10.1207/
s15326977ea1103&4_2
Balliet, D., Mulder, L. D., & van Lange, P. A. M. (2011). Reward,
punishment, and cooperation: A meta-analysis. Psychological Bulletin,
137, 594 – 615. doi:10.1037/a0023489
*Bang, H. J., Suarez-Orozco, C., Pakes, J., & O’Connor, E. (2009). The
importance of homework in determining immigrant students’ grades in
schools in the USA context. Educational Research, 51, 1–25. doi:
10.1080/00131880802704624
*Bates, C., & Nettelbeck, T. (2001). Primary school teachers’ judgements
of reading achievement. Educational Psychology, 21, 177–187. doi:
10.1080/01443410020043878
*Begeny, J. C., Eckert, T. L., Montarello, S. A., & Storie, M. S. (2008).
Teachers’ perceptions of students’ reading abilities: An examination of
the relationship between teachers’ judgments and students’ performance
across a continuum of rating methods. School Psychology Quarterly, 23,
43–55. doi:10.1037/1045-3830.23.1.43
Begeny, J. C., Krouse, H. E., Brown, K. G., & Mann, C. M. (2011).
Teacher judgments of students’ reading abilities across a continuum of
rating methods and achievement measures. School Psychology Review,
40, 23–38. doi:10.1037/1045-3830.23.1.43
*Benner, A. D., & Mistry, R. S. (2007). Congruence of mother and teacher
educational expectations and low-income youth’s academic competence.
Journal of Educational Psychology, 99, 140 –153. doi:10.1037/00220663.99.1.140
Bennett, R. E., Gottesman, R. L., Cerullo, F. M., & Rock, D. A. (1991).
The validity of Einstein assessment subtest scores as predictors of early
school achievement. Journal of Psychoeducational Assessment, 9, 67–
79. doi:10.1177/073428299100900107
Bennett, R. E., Gottesman, R. L., Rock, D. A., & Cerullo, F. (1993).
Influence of behavior perceptions and gender on teachers’ judgments of
students’ academic skill. Journal of Educational Psychology, 85, 347–
356. doi:10.1037/0022-0663.85.2.347
*Beswick, J. F., Willms, J. D., & Sloat, E. A. (2005). A comparative study
of teacher ratings of emergent literacy skills and student performance on
a standardized measure. Education, 126, 116 –137.
Borenstein, M. (2009). Effect sizes for continuous data. In H. Cooper, L. V.
Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis
and meta-analysis (pp. 221–235). New York, NY: Russell Sage Foundation.
Brophy, J., & Good, T. (1986). Teacher behavior and student achievement.
In M. C. Wittrock (Ed.), Third handbook of research on teaching (pp.
328 –375). New York, NY: McMillan.
*Burns, M. K., & Symington, T. (2003). A comparison of the spontaneous
writing quotient of the Test of Written Language (3rd ed.) and teacher
ratings of writing progress. Assessment for Effective Intervention, 28,
29 –34. doi:10.1177/073724770302800203
Cafri, G., Komrey, J. D., & Brannick, M. T. (2010). A meta-meta-analysis:
Empirical review of statistical power, type I error rates, effect sizes, and
model selection of meta-analyses published in psychology. Multivariate
Behavioral Research, 45, 239 –270. doi:10.1080/00273171003680187
Chang, D. F., & Sue, S. (2003). The effects of race and problem type on
teachers’ assessments of student behavior. Journal of Consulting and
Clinical Psychology, 71, 235–242. doi:10.1037/0022-006X.71.2.235
Clark, C. M., & Peterson, P. L. (1986). Teachers’ thought processes. In
M. C. Wittrock (Ed.), Third handbook of research on teaching (pp.
255–296). New York, NY: Macmillan.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37– 46. doi:10.1177/
001316446002000104
Coladarci, T. (1986). Accuracy of teacher judgments of student responses
to standardized test items. Journal of Educational Psychology, 78,
141–146. doi:10.1037/0022-0663.78.2.141
Cronbach, L. J. (1955). Processes affecting scores on “understanding of
others” and “assumed similarity.” Psychological Bulletin, 52, 177–193.
doi:10.1037/h0044919
*Dale, P. S., Harlaar, N., & Plomin, R. (2005). Telephone testing and
teacher assessment of reading skills in 7-year-olds: I. Substantial correspondence for a sample of 5,544 children and for extremes. Reading and
Writing, 18, 385– 400. doi:10.1007/s11145-004-8130-z
de Boer, H., Bosker, R. J., & van der Werf, M. P. C. (2010). Sustainability
of teacher expectation bias effects on long-term student performance.
Journal of Educational Psychology, 102, 168 –179. doi:10.1037/
a0017289
*Demaray, M. K., & Elliott, S. N. (1998). Teachers’ judgments of students’
academic functioning: A comparison of actual and predicted performances. School Psychology Quarterly, 13, 8 –24. doi:10.1037/h0088969
Deno, S. L. (2003). Curriculum-based measures: Development and perspectives. Assessment of Effective Intervention, 28, 3–12.
*DiPerna, J. C., & Elliott, S. N. (1999). Development and validation of the
Academic Competence Evaluation Scales. Journal of Psychoeducational Assessment, 17, 207–225. doi:10.1177/073428299901700302
ACCURACY OF TEACHERS’ JUDGMENTS
*Dompnier, B., Pansu, P., & Bressoux, P. (2006). An integrative model of
scholastic judgments: Pupils’ characteristics, class context, halo effect
and internal attributions. European Journal of Psychology of Education,
21, 119 –133. doi:10.1007/BF03173572
*DuPaul, G. J., Rapport, M. D., & Perriello, L. M. (1991). Teacher ratings
of academic skills: The development of the Academic Performance
Rating Scale. School Psychology Review, 20, 284 –300.
*Eaves, R. C., Campbell-Whatley, G., Dunn, C., Reilly, A. S., & TateBraxton, C. (1994). Comparison of the Slosson Full-Range Intelligence
Test and teacher judgments as predictors of students’ academic achievement. Journal of Psychoeducational Assessment, 12, 381–392. doi:
10.1177/073428299401200408
*Eaves, R. C., Williams, P., Winchester, K., & Darch, C. (1994). Using
teacher judgment and IQ to estimate reading and mathematics achievement in a remedial-reading program. Psychology in the Schools, 31,
261–272.
doi:10.1002/1520-6807(199410)31:4⬍261::AIDPITS2310310403⬎3.0.CO;2-K
*Eckert, T. L., Dunn, E. K., Codding, R. S., Begeny, J. C., & Kleinmann,
A. E. (2006). Assessment of mathematics and reading performance: An
examination of the correspondence between direct assessment of student
performance and teacher report. Psychology in the Schools, 43, 247–265.
doi:10.1002/pits.20147
*Elliott, J., Lee, S. W., & Tollefson, N. (2001). A reliability and validity
study of the Dynamic Indicators of Basic Early Literacy Skills–
Modified. School Psychology Review, 30, 33– 49.
*Eshel, Y., & Benski, M. (1995). Group-administered school readiness test
and kindergarten teacher ratings as predictors of academic success in the
first grade. Megamot, 36, 451– 464.
*Espin, C., Shin, J., Deno, S. L., Skare, S., Robinson, S., & Benner, B.
(2000). Identifying indicators of written expression proficiency for middle school students. The Journal of Special Education, 34, 140 –153.
doi:10.1177/002246690003400303
*Feinberg, A. B., & Shapiro, E. S. (2003). Accuracy of teacher judgments
in predicting oral reading fluency. School Psychology Quarterly, 18,
52– 65. doi:10.1521/scpq.18.1.52.20876
*Feinberg, A. B., & Shapiro, E. S. (2009). Teacher accuracy: An examination of teacher-based judgments of students’ reading with differing
achievement levels. Journal of Educational Research, 102, 453– 462.
doi:10.3200/JOER.102.6.453-462
Ferguson, C. J., & Brannick, M. T. (2011). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and
implications for the use of meta-analyses. Psychological Methods. Advance online publication. doi:10.1037/a0024445
Fischer, R., & Boer, D. (2011). What is more important for national
well-being: Money or autonomy? A meta-analysis of well-being, burnout, and anxiety across 63 societies. Journal of Personality and Social
Psychology, 101, 164 –184. doi:10.1037/a0023663
*Fletcher, J., Tannock, R., & Bishop, D. V. M. (2001). Utility of brief
teacher rating scales to identify children with educational problems:
Experience with an Australian sample. Australian Journal of Psychology, 53, 63–71. doi:10.1080/00049530108255125
*Flynn, J. M., & Rahbar, M. H. (1998). Improving teacher prediction of
children at risk for reading failure. Psychology in the Schools, 35,
163–172. doi:10.1002/(SICI)1520-6807(199804)35:2⬍163::AIDPITS8⬎3.0.CO;2-Q
*Freeman, J. G. (1993). Two factors contributing to elementary school
teachers’ predictions of students’ scores on the Gates–MacGinitie Reading Test, Level D. Perceptual and Motor Skills, 76, 536 –538. doi:
10.2466/pms.1993.76.2.536
*Gallant, D. J. (2009). Predictive validity evidence for an assessment
program based on the Work Sampling System in mathematics and
language and literacy. Early Childhood Research Quarterly, 24, 133–
141. doi:10.1016/j.ecresq.2009.03.003
*Gottesman, R. L., Cerullo, F. M., Bennett, R. E., & Rock, D. A. (1991).
759
Predictive validity of a screening test for mild school learning difficulties. Journal of School Psychology, 29, 191–205. doi:10.1016/00224405(91)90001-8
*Graney, S. B. (2008). General education teacher judgments of their
low-performing students’ short-term reading progress. Psychology in the
Schools, 45, 537–549. doi:10.1002/pits.20322
*Graue, M. E., & Shepard, L. A. (1989). Predictive validity of the Gesell
School Readiness Tests. Early Childhood Research Quarterly, 4, 303–
315. doi:10.1016/0885-2006(89)90016-1
*Gresham, F. M., MacMillan, D. L., & Bocian, K. M. (1997). Teachers as
“tests”: Differential validity of teacher judgments in identifying students
at-risk for learning difficulties. School Psychology Review, 26, 47– 60.
*Gullo, D. F. (1990). Kindergarten schedules: Effects on teachers’ ability
to assess academic achievement. Early Childhood Research Quarterly,
5, 43–51. doi:10.1016/0885-2006(90)90005-L
Hamilton, C., & Shinn, M. R. (2003). Characteristics of word callers: An
investigation of the accuracy of teachers’ judgments of reading comprehension and oral reading skills. School Psychology Review, 32, 228 –
240.
Harlen, W. (2005). Trusting teachers’ judgment: Research evidence of the
reliability and validity of teachers’ assessment used for summative
purposes. Research Papers in Education, 20, 245–270. doi:10.1080/
02671520500193744
*Hartman, J. M., & Fuller, M. L. (1997). The development of curriculumbased measurement norms in literature-based classrooms. Journal of
School Psychology, 35, 377–389. doi:10.1016/S0022-4405(97)00013-7
*Hauser-Cram, P., Sirin, S. R., & Stipek, D. (2003). When teachers’ and
parents’ values differ: Teachers’ ratings of academic competence in
children from low-income families. Journal of Educational Psychology,
95, 813– 820. doi:10.1037/0022-0663.95.4.813
Hecht, S. A., & Greenfield, D. B. (2001). Comparing the predictive validity
of first grade teacher ratings and reading-related tests on third grade
levels of reading skills in young children exposed to poverty. School
Psychology Review, 30, 50 – 69.
*Hecht, S. A., & Greenfield, D. B. (2002). Explaining the predictive
accuracy of teacher judgments of their students’ reading achievement:
The role of gender, classroom behavior, and emergent literacy skills in
a longitudinal sample of children exposed to poverty. Reading and
Writing, 15, 789 – 809. doi:10.1023/A:1020985701556
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis.
Orlando, FL: Academic Press.
Helmke, A., & Schrader, F.-W. (1987). Interactional effects of instructional
quality and teacher judgment accuracy on achievement. Teaching and
Teacher Education, 3, 91–98. doi:10.1016/0742-051X(87)90010-2
*Helwig, R., Anderson, L., & Tindal, G. (2001). Influence of elementary
student gender on teachers’ perceptions of mathematics achievement.
Journal of Educational Research, 95, 93–102. doi:10.1080/
00220670109596577
*Herbert, J., & Stipek, D. (2005). The emergence of gender differences in
children’s perceptions of their academic competence. Journal of Applied
Developmental Psychology, 26, 276 –295. doi:10.1016/j.appdev
.2005.02.007
*Hinnant, J. B., O’Brien, M., & Ghazarian, S. R. (2009). The longitudinal
relations of teacher expectations to achievement in the early school year.
Journal of Educational Psychology, 101, 662– 670. doi:10.1037/
a0014306
*Hodges, C. A. (1997). How valid and useful are alternative assessments
for decision-making in primary grade classrooms? Reading Research
and Instruction, 36, 157–173. doi:10.1080/19388079709558235
Hoge, R. D. (1983). Psychometric properties of teacher-judgment measures
of pupil aptitudes, classroom behaviors, and achievement levels. The
Journal of Special Education, 17, 401– 429. doi:10.1177/
002246698301700404
Hoge, R. D., & Coladarci, T. (1989). Teacher-based judgments of aca-
760
SÜDKAMP, KAISER, AND MÖLLER
demic achievement: A review of literature. Review of Educational Research, 59, 297–313. doi:10.2307/1170184
Hopkins, K. D., George, C. A., & Williams, D. D. (1985). The concurrent
validity of standardized achievement tests by content area using teachers’ ratings as criteria. Journal of Educational Measurement, 22, 177–
182. doi:10.1111/j.1745-3984.1985.tb01056.x
Hox, J. (2002). Multilevel analysis. Mahwah, NJ: Erlbaum.
*Hughes, J. N., Gleason, K. A., & Zhang, D. A. (2005). Relationship
influences on teachers’ perceptions of academic competence in academically at-risk minority and majority first grade students. Journal of
School Psychology, 43, 303–320. doi:10.1016/j.jsp.2005.07.001
Hurwitz, J. T., Elliott, S. N., & Braden, J. P. (2007). The influence of test
familiarity and student disability status upon teachers’ judgments of
students’ test performance. School Psychology Quarterly, 22, 115–144.
doi:10.1037/1045-3830.22.2.115
Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item
difficulty: A test of assumptions in the Angoff standard setting method.
Journal of Educational Measurement, 35, 69 – 81. doi:10.1111/j.17453984.1998.tb00528.x
*Jenkins, J. R., & Jewell, M. (1993). Examining the validity of two
measures for formative teaching: Reading aloud and maze. Exceptional
Children, 59, 421– 432.
Jones, M. G., & Gerig, T. M. (1994). Silent sixth-grade students: Characteristics, achievement, and teacher expectations. The Elementary School
Journal, 95, 169 –182. doi:10.1086/461797
*Jorgenson, C. B., Jorgenson, D. E., Gillis, M. K., & McCall, C. M. (1993).
Validation of a screening instrument for young children with teacher
assessment of school performance. School Psychology Quarterly, 8,
125–139. doi:10.1037/h0088834
Jussim, L., & Eccles, J. S. (1992). Teacher expectations II: Construction
and reflection of student achievement. Journal of Personality and Social
Psychology, 63, 947–961. doi:10.1037/0022-3514.63.6.947
Kalaian, S. A., & Kasim, R. M. (2008). Multilevel methods for metaanalysis. In A. A. O’Connell & D. B. McCoach (Eds.), Multilevel
modeling of educational data (pp. 315–343). Charlotte, NC: Information
Age Publishing.
*Karing, C. (2009). Diagnostische Kompetenz von Grundschul- und Gymnasiallehrkräften im Leistungsbereich und im Bereich Interessen [Diagnostic competence of elementary and secondary school teachers in the
domains of competence and interests]. Zeitschrift für Pädagogische
Psychologie/German Journal of Educational Psychology, 23, 197–209.
doi:10.1024/1010-0652.23.34.197
Karing, C., Matthäi, J., & Artelt, C. (2011). Genauigkeit von Lehrerurteilen
über die Lesekompetenz ihrer Schülerinnen und Schüler in der
Sekundarstufe I: Eine Frage der Spezifität? [Lower secondary school
teacher judgment accuracy of students’ reading competence: A matter of
specificity?]. Zeitschrift für Pädagogische Psychologie, 25, 159 –172.
doi:10.1024/1010 – 0652/a000041.
*Kenealy, P., Frude, N., & Shaw, W. (1991). Teacher expectations as
predictors of academic success. Journal of Social Psychology, 131,
305–306. doi:10.1080/00224545.1991.9713856
*Kenny, D. T., & Chekaluk, E. (1993). Early reading performance: A
comparison of teacher-based and test-based assessments. Journal of
Learning Disabilities, 26, 227–236. doi:10.1177/002221949302600403
Kleingeld, A., van Mierlo, H., & Arends, L. (2011). The effect of goal
setting on group performance: A meta-analysis. Journal of Applied
Psychology. Advance online publication. doi:10.1037/a0024315
*Klinedinst, R. E. (1991). Predicting performance achievement and retention of fifth-grade instrumental students. Journal of Research in Music
Education, 39, 225–238. doi:10.2307/3344722
*Kuklinski, M. R., & Weinstein, R. S. (2001). Classroom and developmental differences in a path model of teacher expectancy effects. Child
Development, 72, 1554 –1578. doi:10.1111/1467-8624.00365
*Kwok, D. C., & Lytton, H. (1996). Perceptions of mathematics ability
versus actual mathematics performance: Canadian and Hong Kong Chinese children. British Journal of Educational Psychology, 66, 209 –222.
doi:10.1111/j.2044-8279.1996.tb01190.x
Leinhardt, G. (1983). Novice and expert knowledge of individual student’s
achievement. Educational Psychologist, 18, 165–179. doi:10.1080/
00461528309529272
*Lembke, E. S., Foegen, A., Whittaker, T. A., & Hampton, D. (2008).
Establishing technically adequate measures of progress in early numeracy. Assessment for Effective Intervention, 33, 206 –214. doi:
10.1177/1534508407313479
Lench, H. C., Flores, S. A., & Bench, S. W. (2011). Discrete emotions
predict changes in cognition, judgment, experience, behavior, and physiology: A meta-analysis of experimental emotion elicitations. Psychological Bulletin, 137, 834 – 855. doi:10.1037/a0024244
*Li, H., Pfeiffer, S. I., Petscher, Y., Kumtepe, A. T., & Mo, G. (2008).
Validation of the Gifted Rating Scales–School Form in China. Gifted
Child Quarterly, 52, 160 –169. doi:10.1177/0016986208315802
*Limbos, M. M., & Geva, E. (2001). Accuracy of teacher assessments of
second-language students at risk for reading disability. Journal of Learning Disabilities, 34, 136 –151. doi:10.1177/002221940103400204
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand
Oaks, CA: Sage.
*Lorenz, C., & Artelt, C. (2009). Fachspezifität und Stabilität diagnostischer Kompetenz von Grundschullehrkräften in den Fächern Deutsch
und Mathematik [Domain specificity and stability of diagnostic competence among primary school teachers in the school subjects of German
and mathematics]. Zeitschrift für Pädagogische Psychologie/German
Journal of Educational Psychology, 23, 211–222. doi:10.1024/10100652.23.34.211
*Madelaine, A., & Wheldall, K. (2005). Identifying low-progress readers:
Comparing teacher judgment with a curriculum-based measurement
procedure. International Journal of Disability, Development, and Education, 52, 33– 42. doi:10.1080/10349120500071886
*Madon, S., Smith, A., Jussim, L., Russell, D. W., Eccles, J., Palumbo, P.,
& Walkiewicz, M. (2001). Am I as you see me or do you see me as I am?
Self-fulfilling prophecies and self-verification. Personality and Social
Psychology Bulletin, 27, 1214 –1224. doi:10.1177/0146167201279013
*Maguin, E., & Loeber, R. (1996). How well do ratings of academic
performance by mothers and their sons correspond to grades, achievement test scores, and teachers’ ratings? Journal of Behavioral Education, 6, 405– 425. doi:10.1007/BF02110514
Marsh, H. W. (1989). The effects of attending single-sex and coeducational
high schools on achievement, attitudes, and behaviors and on sex differences. Journal of Educational Psychology, 81, 70 – 85. doi:10.1037/
0022-0663.81.1.70
Marsh, H. W. (1990a). A multidimensional, hierarchical model of selfconcept: Theoretical and empirical justification. Educational Psychology Review, 2, 77–172. doi:10.1007/BF01322177
Marsh, H. W. (1990b). Causal ordering of academic self-concept on
academic achievement: A multiwave, longitudinal panel analysis. Journal of Educational Psychology, 82, 646 – 656. doi:10.1037/00220663.82.4.646
Marsh, H. W., Bornmann, L., Mutz, R., Daniel, H-D., & O’Mara, A.
(2009). Gender effects in the peer reviews of grant proposals: A comprehensive meta-analysis comparing traditional and multilevel approaches. Review of Educational Research, 79, 1290 –1326. doi:
10.3102/0034654309334143
*Martı́nez, J. F., Stecher, B., & Borko, H. (2009). Classroom assessment
practices, teacher judgments, and student achievement in mathematics:
Evidence from the ECLS. Educational Assessment, 14, 78 –102. doi:
10.1080/10627190903039429
*Maunganidze, L., Ruhode, N., Shoniwa, L., Kasayira, J. M., Sodi, T., &
Nyanhongo, S. (2008). Teacher ratings and standardized test scores:
ACCURACY OF TEACHERS’ JUDGMENTS
How good for predicting achievement in students with learning support
placement? Journal of Psychology in Africa, 18, 255–258.
*McElvany, N., Schroeder, S., Hachfeld, A., Baumert, J., Richter, T.,
Schnotz, W., & Ullrich, M. (2009). Diagnostische Fähigkeiten von
Lehrkräften bei der Einschätzung von Schülerleistungen und Aufgabenschwierigkeiten bei Lernmedien mit instruktionalen Bildern [Teachers’
diagnostic skills to judge student performance and task difficulty when
learning materials include instructional pictures]. Zeitschrift für Pädagogische Psychologie/German Journal of Educational Psychology, 23,
223–235. doi:10.1024/1010-0652.23.34.223
*Meisels, S. J., Bickel, D. D., Nicholson, J., Xue, Y., & Atkins-Burnett, S.
(2001). Trusting teachers’ judgments: A validity study of a curriculumembedded performance assessment in kindergarten to grade 3. American
Educational Research Journal, 38, 73–95. doi:10.3102/
00028312038001073
*Methe, S. A., Hintze, J. M., & Floyd, R. G. (2008). Validation and
decision accuracy of early numeracy skill indicators. School Psychology
Review, 37, 359 –373.
*Meyer, M., Wilgosh, L., & Mueller, H. (1990). Effectiveness of teacheradministered tests and rating scales in predicting subsequent academic
performance. Alberta Journal of Educational Research, 36, 257–264.
*Miller, S. A., & Davis, T. L. (1992). Beliefs about children: A comparative study of mothers, teachers, peers, and self. Child Development, 63,
1251–1265. doi:10.2307/1131531
Möller, J., Pohlmann, B., Köller, O., & Marsh, H. W. (2009). A metaanalytic path analysis of the internal/external frame of reference model
of academic achievement and academic self-concept. Review of Educational Research, 79, 1129 –1167. doi:10.3102/0034654309337522
*Montague, M., Enders, C., & Castro, M. (2005). Academic and behavioral
outcomes for students at risk for emotional and behavioral disorders.
Behavioral Disorders, 31, 84 –94.
National Board for Professional Teaching Standards. (2010). The five core
propositions. Retrieved from http://www.nbpts.org/the_standards/
the_five_core_proposition
O’Mara, A. J., Marsh, H. W., Craven, R. G., & Debus, R. (2006). Do
self-concept interventions make a difference? A synergistic blend of
construct validation and meta-analysis. Educational Psychologist, 41,
181–206. doi:10.1207/s15326985ep4103_4
Pohlmann, B., Möller, J., & Streblow, L. (2004). Zur Fremdeinschätzung
von Schülerselbstkonzepten durch Lehrer und Mitschüler [On students’
self-concepts inferred by teachers and classmates]. Zeitschrift für Pädagogische Psychologie/German Journal of Educational Psychology, 18,
157–169. doi:10.1024/1010-0652.18.34.157
*Pomplun, M. (2004). The differential predictive validity of the initial
skills analysis: Reading screening tests for K-3. Educational and Psychological Measurement, 64, 813– 827. doi:10.1177/0013164404263879
Raudenbush, S. W., Byrk, A. S., Cheong, Y., & Congdon, R. T. (2004).
HLM 6: Hierarchical linear modeling. Chicago, IL: Scientific Software
International.
Ready, D. D., & Wright, D. L. (2011). Accuracy and inaccuracy in
teachers’ perceptions of young children’s cognitive abilities. American
Educational Research Journal, 48, 335–360. doi:10.3102/
0002831210374874
Rosenberg, M. S., Adams, D. C., & Gurevitch, J. (2000). MetaWin:
Statistical software for meta-analysis. Sunderland, MA: Sinauer.
*Saint-Laurent, L., Hébert, M., Royer, É., & Piérard, B. (1997). Identification of students with academic difficulties: Implications for research
and practice. Canadian Journal of School Psychology, 12, 143–154.
doi:10.1177/082957359701200211
*Salvesen, K. A., & Undheim, J. O. (1994). Screening for learning disabilities with teacher rating scales. Journal of Learning Disabilities, 27,
60 – 66. doi:10.1177/002221949402700109
*Schrader, F.-W., & Helmke, A. (1990). Lassen sich Lehrer bei der
Leistungsbeurteilung von sachfremden Gesichtspunkten leiten? Eine
761
Untersuchung zu Determinanten diagnostischer Lehrerurteile [Are
teachers influenced by extrinsic factors when evaluating scholastic performance? A study on the determinants of teachers’ judgments].
Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie,
22, 312–324.
Schrader, F.-W., & Helmke, A. (2001). Alltägliche Leistungsbeurteilung
durch Lehrer [Day-to-day performance evaluation by teachers]. In F. E.
Weinert (Ed.), Leistungsmessung in Schulen [Performance measurement
in schools] (pp. 45–58). Weinheim, Germany: Beltz.
Shavelson, R. J., & Stern, P. (1981). Research on teachers’ pedagogical
thoughts, judgments, decisions, and behavior. Review of Educational
Research, 51, 455– 498. doi:10.2307/1170362
Shepard, L., Hammerness, K., Darling-Hammond, L., & Rust, F. (2005).
Assessment. In L. Darling-Hammond & J. Bransford (Eds.), Preparing
teachers for a changing world (pp. 275–326). San Francisco, CA: Wiley.
*Sink, C. A., Barnett, J. E., & Pool, B. A. (1993). Perceptions of scholastic
competence in relation to middle-school achievement. Perceptual and
Motor Skills, 76, 471– 478. doi:10.2466/pms.1993.76.2.471
Smith, A. E., Jussim, L., & Eccles, J. (1999). Do self-fulfilling prophecies
accumulate, dissipate, or remain stable over time? Journal of Personality
and Social Psychology, 77, 548 –565. doi:10.1037/0022-3514.77.3.548
*Sofie, C. A., & Riccio, C. A. (2002). A comparison of multiple methods
for the identification of children with reading disabilities. Journal of
Learning Disabilities, 35, 234 –244. doi:10.1177/002221940203500305
Spinath, B. (2005). Akkuratheit der Einschätzung von Schülermerkmalen
durch Lehrer und das Konstrukt der diagnostischen Kompetenz [Accuracy of teacher judgments of student characteristics and the construct of
diagnostic competence]. Zeitschrift für Pädagogische Psychologie/
German Journal of Educational Psychology, 19, 85–95. doi:10.1024/
1010-0652.19.12.85
Südkamp, A., & Möller, J. (2009). Referenzgruppeneffekte im Simulierten
Klassenraum: Direkte und indirekte Einschätzungen von Schülerleistungen. [Reference-group effects in a simulated classroom: Direct and
indirect judgments]. Zeitschrift für Pädagogische Psychologie/German
Journal of Educational Psychology, 23, 161–174. doi:10.1024/10100652.23.34.161
Sutton, A. J. (2009). Publication bias. In H. Cooper, L. V. Hedges, & J. C.
Valentine (Eds.), The handbook of research synthesis and meta-analysis
(pp. 435– 452). New York, NY: Russell Sage Foundation.
*Teisl, J. T., Mazzocco, M. M. M., & Myers, G. F. (2001). The utility of
kindergarten teacher ratings for predicting low academic achievement in
first grade. Journal of Learning Disabilities, 34, 286 –293. doi:10.1177/
002221940103400308
Tillman, C. M. (2011). Developmental change in the relation between
simple and complex spans: A meta-analysis. Developmental Psychology,
47, 1012–1025. doi:10.1037/a0021794
*Tindal, G., & Marston, D. (1996). Technical adequacy of alternative
reading measures as performance assessments. Exceptionality, 6, 201–
230. doi:10.1207/s15327035ex0604_1
*Trautwein, U., & Baeriswyl, F. (2007). Wenn leistungsstarke Klassenkameraden ein Nachteil sind: Referenzgruppeneffekte bei Übertrittsentscheidungen [When high-achieving classmates put students at a disadvantage: Reference group effects at the transition to secondary schooling].
Zeitschrift für Pädagogische Psychologie/German Journal of Educational Psychology, 21, 119 –133. doi:10.1024/1010-0652.21.2.119
Trautwein, U., Lüdtke, O., Köller, O., & Baumert, J. (2006). Self-esteem,
academic self-concept, and achievement: How the learning environment
moderates the dynamics of self-concept. Journal of Personality and
Social Psychology, 90, 334 –349. doi:10.1037/0022-3514.90.2.334
*Triga, A. (2004). An analysis of teachers’ rating scales as sources of
evidence for a standardised Greek reading test. Journal of Research in
Reading, 27, 311–320. doi:10.1111/j.1467-9817.2004.00234.x
*Trouilloud, D. O., Sarrazin, P. G., Martinek, T. J., & Guillet, E. (2002).
The influence of teacher expectations on student achievement in phys-
762
SÜDKAMP, KAISER, AND MÖLLER
ical education classes: Pygmalion revisited. European Journal of Social
Psychology, 32, 591– 607. doi:10.1002/ejsp.109
VanDerHeyden, A. M., Witt, J. C., & Gilbertson, D. (2007). A multi-year
evaluation of the effects of a Response to Intervention (RTI) model on
identification of children for special education. Journal of School Psychology, 45, 225–256. doi:10.1016/j.jsp.2006.11.004
*van Kraayenoord, C. E., & Schneider, W. E. (1999). Reading achievement, metacognition, reading self-concept and interest: A study of
German students in Grades 3 and 4. European Journal of Psychology of
Education, 14, 305–324. doi:10.1007/BF03173117
*Webster, R. E., Hewett, B., & Crumbacker, H. M. (1989). Criterionrelated validity of the WRAT–R and K-TEA with teacher estimates of
actual classroom academic performance. Psychology in the Schools, 26,
243–248.
doi:10.1002/1520-6807(198907)26:3⬍243::AIDPITS2310260304⬎3.0.CO;2-M
*Wilson, J., & Wright, C. R. (1993). The predictive validity of student
self-evaluations, teachers’ assessments, and grades for performance on
the Verbal Reasoning and Numerical Ability Scales of the Differential
Aptitude Test for a sample of secondary school students attending rural
Appalachia schools. Educational and Psychological Measurement, 53,
259 –270. doi:10.1177/0013164493053001029
*Wilson, M. S., Schendel, J. M., & Ulman, J. E. (1992). Curriculum-based
measures, teachers’ ratings, and group achievement scores: Alternative
screening measures. Journal of School Psychology, 30, 59 –76. doi:
10.1016/0022-4405(92)90020-6
Winne, P. H., & Nesbit, J. C. (2010). The psychology of academic
achievement. Annual Review of Psychology, 61, 653– 678. doi:10.1146/
annurev.psych.093008.100348
*Wright, C. R., & Houck, J. W. (1995). Gender differences among selfassessments, teacher ratings, grades, and aptitude test scores for a sample
of students attending rural secondary schools. Educational and Psychological Measurement, 55, 743–752. doi:10.1177/0013164495055005005
Appendix
Search Terms
The following search terms were entered in electronic databases:
teacher*5 diagnostic* accuracy OR6 sensitivity OR competence,
teacher* diagnostic* skill* OR teacher* assessment skill*,
teacher* judgment* OR judgement*,
teacher* “classroom* assessment,”
teacher* “academic achievement” prediction,
grading accuracy,
teacher* judg* student* academic* achievement* OR outcome*
OR performance* OR abilit*,
teacher* assess* student* academic* achievement* OR outcome*
OR performance* OR abilit*,
teacher* evaluat* student* academic* achievement* OR outcome*
OR performance* OR abilit*,
teacher* rating* student* academic* achievement* OR outcome*
OR performance* OR abilit*,
teacher* rate student* academic* achievement* OR outcome*
OR performance* OR abilit*,
“teacher* expectation*.”
5
The truncation symbol allows for unknown characters, multiple spellings, or various endings.
6
The OR operator combines search terms so that each search result
contains at least one of the terms.
Received January 6, 2011
Revision received December 22, 2011
Accepted January 26, 2012 䡲