Journal of Educational Psychology 2012, Vol. 104, No. 3, 743–762 © 2012 American Psychological Association 0022-0663/12/$12.00 DOI: 10.1037/a0027627 Accuracy of Teachers’ Judgments of Students’ Academic Achievement: A Meta-Analysis Anna Südkamp Johanna Kaiser and Jens Möller University of Bamberg University of Kiel This meta-analysis summarizes empirical results on the correspondence between teachers’ judgments of students’ academic achievement and students’ actual academic achievement. The article further investigates theoretically and methodologically relevant moderators of the correlation between the two measures. Overall, 75 studies reporting correlational data on the relationship between teachers’ judgments of students’ academic achievement and students’ performance on a standardized achievement test were analyzed, including studies focusing on different school types, grade levels, and subject areas. The overall mean effect size was found to be .63. The effect sizes were moderated by use of informed versus uninformed teacher judgments, with use of informed judgments leading to a higher correspondence between teachers’ judgments and students’ academic achievement. A comprehensive model of teacherbased judgments of students’ academic achievement is provided in the Discussion. Keywords: teacher judgment, academic achievement, judgment accuracy Supplemental materials: http://dx.doi.org/10.1037/a0027627.supp of student learning groups; and they may prompt teachers to revise their teaching techniques (Shavelson & Stern, 1981). Elliott, Lee, and Tollefson (2001) consider good assessment to be an integral part of good instruction. In their empirical study, Helmke and Schrader (1987) found that high judgment accuracy in combination with a high-frequency use of instructional techniques such as providing structuring cues or individual support was particularly favorable for student learning. Teachers have various assessment tools at their disposal, including “oral questioning of students, observation, written work products, oral presentations, interviews, projects, portfolios, tests, and quizzes” (Shepard, Hammerness, Darling-Hammond, & Rust, 2005, p. 294). Although objective measures of students’ academic achievement are now being more widely applied, there are still good reasons to care about the accuracy of teachers’ judgments. For example, in response to intervention (RTI) models of data-based decision making, curriculum-based measures are commonly used to assess students’ academic achievement (VanDerHeyden, Witt, & Gilbertson, 2007). Curriculum-based measures (CBM) are defined as any set of measurement procedures involving “direct observation and recording of student performance in response to selected curriculum materials [which] are emphasized as a basis for collecting information” (Deno, 2003, p. 4) to make instructional decisions. In the context of reading, for instance, the reading skills of elementary students are assessed by CBM every 3– 4 months; in the case of students receiving intervention services, objective measures are applied even more regularly (every 1–2 weeks; Begeny, Krouse, Brown, & Mann, 2011). However, teachers make judgments about instruction more often than can be facilitated by objective data. Even in the context of RTI practices, where objective measurement of students’ academic achievement is implemented by default, teachers must still make ongoing instructional decisions that are informed by their judgments. Academic achievement is a major issue in educational psychology (Winne & Nesbit, 2010). Often, teachers’ judgments are the primary source of information on students’ academic achievement. The ability to accurately assess students’ achievement therefore is considered to be an important aspect of teachers’ professional competence (Ready & Wright, 2011). In acknowledgment of the importance of teachers’ judgments for student learning, the American Federation of Teachers, the National Council on Measurement in Education, and the National Education Association (1990) have developed standards for teacher competence in the educational assessment of students. Likewise, the core propositions of the National Board for Professional Teaching Standards state that teachers should “know how to assess the progress of individual students as well as the class as a whole” (Proposition 3.3; National Board for Professional Teaching Standards, 2010). Teachers’ judgments can have consequences for their instructional practice, for the further evaluation of students’ performances, and for placement decisions—and can crucially influence individual students’ academic careers and self-concepts. First, teachers use their judgments of students’ academic achievement as a basis for various instructional decisions (Alvidrez & Weinstein, 1999; Clark & Peterson, 1986; Hoge, 1983; Hoge & Coladarci, 1989). These judgments influence teachers’ selection of classroom activities and materials; they determine the difficulties of the tasks selected, the choice of questioning strategies, and the organization This article was published Online First March 26, 2012. Anna Südkamp, National Educational Panel Study, University of Bamberg, Bamberg, Germany; Johanna Kaiser and Jens Möller, Department of Educational Psychology, University of Kiel, Kiel, Germany. Correspondence concerning this article should be addressed to Anna Südkamp, National Educational Panel Study, University of Bamberg, Wilhelmsplatz 3, 96047 Bamberg, Germany. E-mail: [email protected] 743 SÜDKAMP, KAISER, AND MÖLLER 744 Second, various authors have noted that accurate teacher judgments can help to identify children who show early signs of difficulties in school (Bailey & Drummond, 2006; Beswick, Willms, & Sloat, 2005; Teisl, Mazzocco, & Myers, 2001) and that accurate information on students’ academic achievement is crucial for meaningful placement decisions (Helwig, Anderson, & Tindal, 2001). In practice, teachers’ judgments tend to be given heavy weight in decisions about intervention (Hoge, 1983). In the case of students requiring intensive intervention, it is the teacher who is able to employ early, less intensive forms of intervention in the classroom and who takes steps to arrange more intensive intervention (Begeny et al., 2011). Third, research has shown that teacher judgments of students’ academic achievement influence teacher expectations about students’ ability (Brophy & Good, 1986). A large body of research in the form of experimental and naturalistic studies has provided empirical insights into the formation, transmission, and impact of teacher expectations on students’ performance (de Boer, Bosker, & van der Werf, 2010; Jussim & Eccles, 1992). Fourth, in the context of formal assessment, teacher judgments of students’ performance are commonly expressed in the form of grades, which not only provide feedback to students and parents (Hoge & Coladarci, 1989) but also contribute to exit qualifications in many countries (Harlen, 2005). As Begeny, Eckert, Montarello, and Storie (2008) and Feinberg and Shapiro (2003) have pointed out, grades thus have far-reaching consequences for students’ academic careers. Fifth, research on academic self-concepts (see Marsh, 1990a, and Möller, Pohlmann, Köller, & Marsh, 2009, for an overview) has shown that teacher judgments influence students’ self-related cognitions of ability. For example, Trautwein, Lüdtke, Köller, and Baumert (2006) found that the effect of students’ individual achievement on their academic self-concept is mediated by teacher-assigned grades. In turn, academic self-concept has a considerable effect on student learning (Marsh, 1990b). Given the important implications of teacher judgments, the question of their accuracy is critical. Accurate assessment of students’ performance is a necessary condition for teachers to be able to adapt their instructional practices, to make fair placement decisions, and to support the development of an appropriate academic self-concept. achievement is the correlation between the two. Overall, moderate to high correlations are reported (Begeny et al., 2008; Demaray & Elliot, 1998; Feinberg & Shapiro, 2003). For example, Feinberg and Shapiro (2009) reported correlations of .59 and .60 between teachers’ judgments and students’ decoding skills and reading comprehension, as measured by subtests of the WoodcockJohnson–III Test of Achievement. In the same study, a correlation of .64 was found between students’ oral reading fluency as measured by a CBM procedure and teachers’ predictions of oral reading fluency. In a review of 16 studies, Hoge and Coladarci (1989) found a median correlation of .66 between teachers’ judgments and students’ achievement on a standardized test. On the one hand, these results may be interpreted as indicating that teachers’ judgments are quite accurate; on the other hand, their judgments are evidently far from perfect, and more than two thirds of the variance in teachers’ judgments cannot be explained by student performance. Additionally, the correlations found ranged substantially across studies, from .28 to .92 (Hoge & Coladarci, 1989). It is important to note that accuracy is rarely defined in concrete terms in studies explicitly focusing on teacher judgment accuracy (see Ready & Wright, 2011, for an exception). Which outcomes are considered to be accurate or inaccurate therefore remains questionable. To date, no consistent criteria have been established. Moreover, the different methods used to measure teachers’ judgments and students’ academic achievement make a substantial contribution to the degree of accuracy observed. For example, the outcome (degree of accuracy) may differ depending on whether teachers are informed about the standard of comparison for their judgment. Accordingly, the inaccuracy of teachers’ judgments may be grounded in the studies’ methodologies rather than in the teachers’ diagnostic competence. Another limitation broadly shared by studies on teacher judgment accuracy is that they do not account for the dependency of teachers’ judgments on the academic achievement of the students in their class. A multilevel approach to the analysis of teacher judgment accuracy as applied by Ready and Wright (2011) is the most appropriate option, but it is seldom used in studies on teacher judgment accuracy. The reader should keep this limitation in mind when interpreting the results of this meta-analysis. Factors Influencing Teacher Judgment Accuracy Teacher Judgment Accuracy Most research explicitly focusing on teacher judgment accuracy examines the relationship between teachers’ judgments of students’ achievement and students’ actual performance on measures of achievement in various subject areas. However, studies not explicitly intending to assess teacher judgment accuracy (e.g., studies validating student test scores by reference to teacher ratings) can also provide empirical insights into teacher judgment accuracy. Both types of studies (with and without an explicit focus on teacher judgment accuracy) are included in this meta-analysis. Nevertheless, our review of the literature is restricted to studies focusing on teacher judgment accuracy. We carefully distinguish between the two study types wherever necessary throughout the article. The most commonly reported measure quantifying the correspondence between teachers’ judgments and students’ actual Against this background, it is clear that methodological differences between studies need to be taken into account when one is considering the differences in the studies’ findings. For example, studies with and without an explicit focus on teacher judgment accuracy clearly differ in terms of the methods used, and these differences warrant particular consideration when all results are interpreted in terms of teacher judgment accuracy. Judgment Characteristics As mentioned previously, it can be assumed that various judgment characteristics affect the correspondence between teachers’ judgments and students’ academic achievement. We therefore distinguished various aspects of teacher judgments: informed versus uninformed judgments, number of points on the rating scale used, instruction specificity, norm-referenced versus peer- ACCURACY OF TEACHERS’ JUDGMENTS dependent judgments, and domain specificity of teachers’ judgments. Informed versus uninformed teacher judgments. Hoge and Coladarci (1989) distinguished between direct and indirect teacher judgments. In this meta-analysis, we used a slightly different categorization. The main difference between direct and indirect judgments is that teachers are either informed or uninformed about the test or the standard of comparison on which their judgment is based. In some studies, teachers are asked to assess students’ academic achievement on a standardized achievement test by estimating the number of items each student will solve correctly (Helmke & Schrader, 1987). This approach can be considered an informed rating. In other studies, teachers are asked to rate students’ performance in a certain subject on a Likert-type rating scale (e.g., a 5-point rating scale; DuPaul, Rapport, & Perriello, 1991). Hoge and Coladarci (1989) called this type of approach an indirect rating. Here, teachers are usually (but not always) left uninformed about the standard of comparison to be applied in their judgment. As this can make an important contribution to their judgments, we chose to distinguish between “informed” and “uninformed” judgments. In line with the results of Hoge and Coladarci, both Feinberg and Shapiro (2003, 2009) and Demaray and Elliott (1998) found higher correlations for direct (usually informed) teacher judgments than for indirect (usually uninformed) teacher judgments. For example, Feinberg and Shapiro (2003) found a correlation of .70 between students’ test performance and direct teacher judgments, whereas the correlation with indirect teacher judgments was .62. Points on the rating scale. Studies using rating scales to obtain teacher judgments differ in terms of the number of points on the rating scales implemented. Rating scales with many categories permit a sophisticated judgment, whereas scales with fewer categories allow a more global judgment. Generally, slightly higher correlations with students’ actual performance are obtained for more sophisticated judgments than for more global judgments. To date, this variable has been neglected in empirical research on teacher judgment accuracy. We therefore considered the number of points on the rating scales used in this meta-analysis, expecting to find higher correlations between teachers’ judgments and students’ academic achievement when a sophisticated rating scale was used. Judgment specificity. According to the approach used by Hoge and Coladarci, teachers’ judgments can be allocated to one of five categories, ranging from low to high specificity. First, a judgment that requires teachers to rate students’ academic achievement on a rating scale (e.g., poor– excellent) is considered to be of low specificity. Second, in a ranking, the teacher’s task is to put the students of his or her class into rank order according to their achievement. Third, tasks requiring teachers to find grade equivalents for students’ performance on a standardized achievement test are considered to be of average specificity. Fourth, tasks requiring teachers to estimate the number of correct responses achieved by a student on a standardized achievement test are slightly less specific than the fifth and most specific category, in which teachers indicate students’ item responses on each item of an achievement test. In their review, Hoge and Coladarci found a median correlation of .61 for studies using ratings, which was the predominant approach. The median correlations for studies using rank ordering (median r ⫽ .76), grade equivalents (median r ⫽ 745 .70), number of correct responses (r ⫽ .67, for a single study), and item-based judgments (median r ⫽ .70) were indeed higher. Norm-referenced vs. peer-independent judgments. In addition, teacher judgments may differ in whether they are normreferenced or peer-independent. For example, Helwig et al. (2001) asked teachers to rate students’ academic achievement on an absolute scale (very low proficiency–very high proficiency), whereas Hecht and Greenfield (2002) asked teachers to estimate students’ academic achievement in relation to other members of the class (in the bottom 10% of the class–in the top 10% of the class). Hoge and Coladarci (1989) considered this aspect in their meta-analysis but found no substantial difference between correlations. The median correlation for norm-referenced judgments was .68; that for peer-independent judgments was .64. We also considered norm-referenced versus peer-independent teacher judgments in the present meta-analysis. However, we did not formulate a hypothesis about the direction of the effect on teacher judgment accuracy. It is possible that the use of peer-independent teacher rating scales leads to higher correlations between teacher judgments and students’ academic achievement, because this approach allows teachers to focus on each student individually, preventing judgment biases due to the achievement of other students in the class (see also the literature on the big-fish-little-pond effect; Marsh, 1989). On the other hand, it is equally possible that the use of norm-referenced teacher rating scales produces higher accuracy scores (correlations), as these correlations reflect teachers’ ability to establish a rank order within a class based on the students’ achievement. Domain specificity. Finally, teacher judgments differ in terms of their domain specificity. Whereas some studies ask teachers to judge students on a very specific ability (e.g., arithmetic skills; Karing, 2009), others ask them to judge students’ overall academic achievement (e.g., Li, Pfeiffer, Petscher, Kumtepe, & Mo, 2008). To our knowledge, no studies to date have examined the influence of the domain specificity of teachers’ judgments on teacher judgment accuracy. However, it seems reasonable to hypothesize that it is easier to make a focused judgment on a domain-specific ability than to judge a student’s overall academic ability. Therefore, we expect to find higher teacher judgment accuracy for domain-specific judgments than for global judgments. Test Characteristics Like the judgment characteristics we have summarized, test characteristics in turn depend on methodological decisions made by the author(s) of the studies. In studies explicitly focusing on teacher judgment accuracy, various instruments are used to measure students’ academic achievement, ranging from highly specific tests measuring, for example, receptive vocabulary (e.g., the Peabody Picture Vocabulary Test used by Fletcher, Tannock, & Bishop, 2001) to broader tests measuring students’ performance in different subject areas (e.g., the Kaufman Test of Academic Achievement measuring achievement in mathematics, reading, and spelling used by Demaray & Elliott, 1998). Such differences between tests are summarized under the label test characteristics here. Various test characteristics can be assumed to influence the correspondence between teachers’ judgments and students’ performance. In this meta-analysis, we considered the subject matter assessed, accounted for the use of CBM procedures or standard- SÜDKAMP, KAISER, AND MÖLLER 746 ized achievement tests, and distinguished domain-specific tests from tests covering different domains. Subject matter. Comparing correlations between teachers’ judgments and students’ academic achievement in different subjects, Hopkins, George, and Williams (1985) found that correlations were significantly lower for social studies and science than for language arts, reading, and mathematics. Using CBM procedures to gauge students’ academic achievement, Eckert, Dunn, Codding, Begeny, and Kleinmann (2006) found higher correlations for reading than for mathematics. In turn, Coladarci (1986) reported teachers’ judgments to be more accurate for students’ performance in mathematics computations than for mathematics concept items. Demaray and Elliott (1998) found no difference between correlations in language arts and in mathematics. Hinnant, O’Brien, and Ghazarian (2009) found that teachers’ ratings of academic ability as measured by an academic skills questionnaire were highly correlated with standardized measures of achievement in reading (.53–.67) and mathematics (.54 –.57). Evidently, the empirical findings on the influence of subject matter on teacher judgment accuracy are inconsistent. CBM procedures vs. standardized achievement tests. Some studies of teacher judgment accuracy have used CBM procedures as indicators of students’ achievement (Eckert et al., 2006; Feinberg & Shapiro, 2003; Hamilton & Shinn, 2003). According to Feinberg and Shapiro (2003), CBM is closely linked to actual in-class student performance, as methods derived from curriculum materials provide a closer overlap with the content of instruction than do published norm-referenced tests. Feinberg and Shapiro (2009) found that correlations between a CBM procedure measuring oral reading fluency and teachers’ predictions of oral reading fluency were slightly higher (.64) than correlations between a global teacher rating of students’ performance and two subtests of a standardized achievement test (.59 and .60). In the present meta-analysis, we therefore consider the use of CBM procedures versus standardized achievement tests. Domain specificity. Like teacher judgments, academic achievement tests also differ in terms of their domain specificity. Whereas some tests are designed to measure a very specific academic ability (e.g., phonological awareness; Bailey & Drummond, 2006), others measure different aspects of academic ability (e.g., the Woodcock–Johnson Achievement Battery; Benner & Mistry, 2007). We therefore took this test characteristic into consideration in this meta-analysis. Correspondence Between Judgment and Test Characteristics In the present meta-analysis, we also considered the time gap between teachers’ judgments and the administration of the achievement test and the congruence in the domain specificity of the judgment characteristics and test characteristics. Time Gap In their review, Hoge and Coladarci (1989) included only studies in which the achievement test was administered at the same time as the teacher rating task. There are studies, however, in which these two measures are not implemented concurrently (for example, the study by Pomplun (2004), which focused on the validation of a reading test). Due to temporal proximity, we expected to find higher correlations between teachers’ judgments and students’ academic achievement when both measures are administered concurrently than when the test is administered either before or after the rating task. Congruence in Domain Specificity Finally, we considered the congruence in the domain specificity of the teacher rating task and the achievement test. Theoretically, the achievement test may measure a specific academic ability, whereas the teacher judgment task may be less specific— or vice versa. For example, Hecht and Greenfield (2001) found teachers’ judgments of students’ overall academic competence to be correlated with the students’ performance on the Letter–Word Identification subtest of Woodcock–Johnson Test of Achievement– Revised. Here, a general judgment was set in relation to a very specific ability. We expected to find higher correlations between teachers’ judgments and students’ achievement in studies in which the domain specificity of the teacher rating task and the achievement test was congruent (e.g., teachers rated students’ reading comprehension; students were administered a test of reading comprehension) and lower correlations in studies in which the domain specificity was incongruent (e.g., teachers rated students’ overall academic achievement; students were administered a test of reading comprehension). Teacher and Student Characteristics Besides judgment and test characteristics, characteristics of the teachers judging students’ performance and of the students being judged also warrant consideration. Studies explicitly focusing on teacher judgment accuracy have found large interindividual differences in teachers’ ability to judge student performance (Helmke & Schrader, 1987). For example, Lorenz and Artelt (2009) reported moderate average correlations between teacher judgments and student performance in reading and mathematics for a sample of 127 teachers. The standard deviation for the mean of the correlations was .30 for reading and .39 for mathematics. Some teachers showed very high judgment accuracy; others, very low judgment accuracy. These findings raise the question of which characteristics of teachers predict their judgment accuracy. A teacher’s characteristics are thought to influence his or her judgment at various stages of the judgment process (e.g., reception, perception, interpretation), and characteristics such as job experience (Impara & Plake, 1998), beliefs (Shavelson & Stern, 1981), professional goals (Schrader & Helmke, 2001), and teaching philosophy (Hoge & Coladarci, 1989) have previously been associated with teachers’ judgment processes in the literature. Although the variability in the accuracy of teachers’ judgments is well documented (Helmke & Schrader, 1987; Hoge & Coladarci, 1989), empirical research has not yet pinpointed individual teacher characteristics that influence judgment accuracy. As teacher characteristics have only been examined in a small number of studies to date, moreover, we were not able to study their effects in the present meta-analysis. At the same time, several student characteristics have been identified as influencing the accuracy of teachers’ judgments. For ACCURACY OF TEACHERS’ JUDGMENTS example, Bennett, Gottesman, Rock, and Cerullo (1993) found that teachers who perceived their students as exhibiting bad behavior also perceived these students to be low academic performers, regardless of the students’ academic skills. In a study by Hurwitz, Elliott, and Braden (2007), the accuracy of teachers’ judgments was related to students’ disability status: Teachers predicted the mathematics test performance of students without disabilities more accurately than that of students with disabilities. As is the case for teacher characteristics, however, few studies to date have reported information on the student sample. Moreover, any data available are not readily comparable across studies (e.g., only the percentage of female/male students was reported). Therefore, we decided not to conduct moderator analyses on student characteristics in this meta-analysis. Meta-Analytic Approach A review of literature cited in many studies on teacher judgment accuracy (Begeny et al, 2008; Feinberg & Shapiro, 2003; Hinnant et al., 2009) is that by Hoge and Coladarci (1989). As mentioned previously, this review summarized the results of 16 studies presenting data on the relationship between teachers’ judgments of students’ academic achievement and the students’ actual performance on an independent criterion of achievement. Hoge and Coladarci reported a range of correlations from .28 to .92 and a median correlation of .66. Hoge and Coladarci (1989) also examined how different methodological study characteristics (direct vs indirect judgments, instruction specificity, norm-referenced vs peer-dependent judgments) were related to the correspondence between teachers’ judgments and students’ achievement. They also sought to identify moderator variables (student gender, subject matter, student ability) influencing the size of the correlation between the two measures. Because only 16 studies were included in the review, the sample sizes for studying the different effects were small. As such, only descriptive analyses could be presented. For example, three studies distinguished between male and female students and found no effect of gender on teacher judgment accuracy. Similarly, two studies explored the influence of student achievement on teacher judgment accuracy, revealing higher levels of teacher accuracy in judging appropriateness of instruction for higher achieving than for lower achieving students (Leinhardt, 1983) and lower levels of accuracy in judging the performance of lower achieving students (Coladarci, 1986). In the present meta-analysis, we did not evaluate the primary studies separately and descriptively, as was done by Hoge and Coladarci. As such, we were unable to control for student ability as a moderating variable, as the different testing procedures used meant that data on students’ average achievement (means and standard deviations) were not comparable across studies. Since the publication of the Hoge and Coladarci review in 1989, numerous further studies have reported data on teachers’ judgment accuracy. In order to overcome the limitations of their narrative review and to draw a clear picture of current findings on teacher judgment accuracy, we therefore present a comprehensive metaanalysis. Beyond the statistical synthesis of study results, we evaluated whether potential moderators influence the size of the correlation between teacher judgments and students’ actual academic achievement. 747 Method Information Retrieval Process We identified relevant studies by applying a multimodal search strategy involving both electronic and manual searches.1 The literature search process consisted of two phases. First, we conducted preliminary searches to refine our research questions and to define the key concepts. In this phase, we also refined and modified our search terms by using database thesauri to ensure that the universe of appropriate synonyms were included. The main searches were conducted in the second phase (March–July 2009). Electronic searches were conducted using the four main search engines in the fields of psychology and education, which cover a wide variety of bibliographic databases: the Education Resources Information Center (ERIC), PSYNDEXplus in Journals@Ovid, the EBSCOhost (including PsycARTICLES, PsycINFO, and the Psychology and Behavioral Sciences Collection), and the Web of Science (including the Science Citation Index Expanded, the Social Sciences Citation Index, the Arts & Humanities Citation Index, the Conference Proceedings Citation Index–Science, and the Conference Proceedings Citation Index–Social Science & Humanities). The search terms entered in these databases include “teacher judgment,” “teacher expectations,” and “classroom assessment.” A full list of search terms is given in the Appendix. Inclusion Criteria and Exclusion Criteria General criteria. In order to identify studies reporting data on the accuracy of teachers’ judgments of students’ academic achievement, we searched for studies analyzing the relationship between teacher judgments of students’ academic achievement and students’ actual performance on an achievement test. We excluded studies (e.g., Pohlmann, Möller, & Streblow, 2004; Spinath, 2005) analyzing the accuracy of teachers’ judgments of student characteristics other than achievement (e.g., motivation, attention, anxiety). First, we included studies conducted to validate teachers’ judgments by reference to students’ performance on a (standardized) achievement test. Second, we included studies conducted to validate a standardized achievement test by reference to teachers’ judgments. Third, we sought to include any study reporting on the relationship between teachers’ judgments and students’ academic achievement. English abstract. As we used English search terms in the literature search, we included all studies retrieved with an English title and abstract, including studies in languages other than English. For example, the study by Eshel and Benski (1995) was written in Hebrew.2 School system. We included only those studies that reported teacher judgments’ on students enrolled in the regular school system (e.g., from kindergarten through Grade 12 in the United 1 We would like to thank the following people for their contribution to this meta-analysis: Yvonne Anders, Susannah Goss, Friederike Helm, Annette Heberlein, Nils Machts, Maria Rauch, Angelika Ribak, and Camilla Rjosk. 2 Our thanks go to the first author, Yohanan Eshel, for translating the relevant information for us. 748 SÜDKAMP, KAISER, AND MÖLLER States). We excluded studies focusing on college students, vocational training students, or prekindergarten children. Quantitative data. We included only studies reporting quantitative data. Qualitative studies were excluded. Field research. We also excluded studies that were not conducted in the field but that used computer simulations (Südkamp & Möller, 2009) or case descriptions (Chang & Sue, 2003) to analyze the accuracy of teachers’ judgments. Publication year. As Hoge and Coladarci published their meta-analysis on the accuracy of teachers’ judgments in 1989, we limited our search to studies published between January 1989 and December 2009. The only exception is the study by Anders, Kunter, Brunner, Krauss, and Baumert (2010), which was in press in 2009. As we used rather broad keywords in the literature search to identify all studies reporting a correlation between teachers’ judgments and students’ academic achievement, our searches produced high numbers of potentially relevant studies. Including studies published before 1989 would have considerably increased the number of studies identified and thus have been prohibitively costly. The procedure of defining a certain cutoff point for the inclusion or exclusion of studies also has been applied in other recent meta-analyses (see, e.g., Cafri, Komrey, & Brannick, 2010; Fischer & Boer, 2011; Tillman, 2011). Statistics. Most studies on the accuracy of teachers’ judgments report correlations between teachers’ judgments and students’ performance on an achievement test. However, the relationship can also be presented by means of other statistics (e.g., t test results or regression coefficients). In the information retrieval process, we searched for studies reporting any values representing the correspondence between teachers’ judgments and students’ academic achievement. In addition, some studies (for example, Demaray & Elliot, 1998) report additional measures (here, the results of t tests) in order to answer a specific research question. Nevertheless, the relationship between teachers’ judgments and students’ academic achievement already is presented adequately through a correlation coefficient. In these cases, we decided to only include the correlation coefficients in the meta-analysis. During the information retrieval process, no study was identified in which only t test results were reported. Simultaneity of judgments and tasks. In contrast to Hoge and Coladarci (1989), we did not limit our analysis to studies in which judgment and test data were collected concurrently but included studies in which judgments were made prior to or after testing. Publication source. In meta-analyses, the problem of publication bias (i.e., the selective publication of studies with a particular outcome, usually those whose results are statistically significant, at the expense of null studies; Ferguson & Brannick, 2011; Sutton, 2009) is often addressed by the inclusion of “unpublished” studies (dissertations, conference papers, and the like). In the present meta-analytic review, however, we decided to focus our attention on articles published in scientific journals for three main reasons. First, the issue of publication bias in studies on teacher judgment accuracy is a rather minor one, as findings of low correlations between teachers’ judgments and students’ academic achievement do not usually prevent findings from being published. As is evident from the wide range of correlations presented in the Results section, the range of effect sizes reported is large. Second, there were methodological reasons for the decision to exclude gray literature. As Ferguson and Brannick (2011) have pointed out, including gray literature in an attempt to overcome the problem of publication bias in fact often exacerbates the problem. For example, whereas the proportion of published articles exceeds that of unpublished articles in any meta-analysis (see Balliet, Mulder, & van Lange, 2001, and Kleingeld, van Mierlo, & Arends, 2011, for recent examples), the ratio of published to unpublished studies in the field may be the reverse. In our case, a search of the ProQuest dissertation database identified many studies conducted in the United States, but other international studies clearly were underrepresented. We therefore decided to include only published studies in our meta-analysis to ensure that our selection was clear and transparent. Third, the decision to exclude unpublished studies was the result of limited resources. As described earlier, the use of rather broad keywords led to the identification of high numbers of potentially relevant references. Including dissertations, conference papers, and so on in the study selection and coding process would have been prohibitively time-consuming. Interested readers are referred to Lench, Flores, and Bench (2011) for an example of a meta-analysis in which gray literature was excluded for reasons of limited resources. The decision to exclude gray literature has also been made in other recent meta-analytic reviews (e.g., Fischer & Boer, 2011; Tillman, 2011). Explicit judgments. We included only those studies in which teachers were asked explicitly to judge students’ academic achievement. Although grading may also be considered a form of teacher judgment, we excluded studies in which grades were used as teacher judgments. Study Selection Procedure In a first step, all search terms were entered in each database, resulting in a total of 20,456 potentially relevant references. The title and abstract of each reference were read by one researcher, who decided whether to include the reference on the basis of the inclusion/exclusion criteria. With this selection process, a total of 1,083 references were identified as potentially including information on the relationship between teachers’ judgments and students’ academic achievement, which were retrieved for further review. In a next step, the selected studies were carefully read, and the inclusion/exclusion criteria were applied. Among the 1,083 studies ascertained to be potentially relevant, we identified 37 studies including data on teachers’ judgments and students’ academic achievement but not reporting the correlation between the two measures or other statistical indices that would have allowed transformation to correlation coefficients. These studies were excluded from the analyses. For example, Jones and Gerig (1994) obtained teachers’ rankings of students’ achievement and students’ test scores, but they did not report the correlation between the two measures. Instead, means and standard deviations of the teacher ranking were reported by “achievement level” (1– 4) for silent and nonsilent readers separately. It was therefore not possible to transform these data into correlation coefficients. Likewise, Smith, Jussim, and Eccles (1999) collected data on students’ academic achievement and teachers’ ratings of students’ academic achievement. Here, relationships between the two measures were reported only in complex multivariate models, making calculation of the single correlation between the two measures impossible. There were similar problems with the statistics reported in the other 35 ACCURACY OF TEACHERS’ JUDGMENTS studies that were excluded. As our final selection of 94 studies was limited to studies reporting the correlation between teachers’ judgments and students’ academic achievement, there was no need to transform other measures into correlation coefficients. The 94 studies identified were then screened for further references not found through the electronic searches. This manual search produced nine relevant references. A total of 103 relevant studies were thus identified and included in the coding process. Some of these studies were closely related. As we wanted to avoid including duplicate data, we excluded articles that seemed to report the same data as another article. For example, Bennett, Gottesman, Cerullo, and Rock (1991) and Gottesman, Cerullo, Bennett, and Rock (1991) both reported data on a sample of 796 students; one table of descriptive data is the same in both articles. We decided to include only the Gottesman et al. (1991) article, which includes more information on the subsamples analyzed. In addition, 15 studies were excluded because they were not published in a regular journal (most were reports and conference papers found in the ERIC database). Moreover, studies focusing on academic achievement in subjects other than language arts and mathematics were excluded, as these subjects were clearly underrepresented. Specifically, we identified one study by Trouilloud, Sarrazin, Martinek, and Guillet (2002) on sports and one study by Klinedinst (1991) on music. A further three studies were excluded because they were found to not meet the inclusion criteria during the coding process (e.g., teachers rated students’ learning behavior rather than academic achievement). As a result of this study selection procedure, 75 studies were included in the present review. Data Coding Two of the authors independently coded all studies. Before analyzing the data, we calculated the level of interrater agreement on the coding of key variables. For the categorical variables, we used Cohen’s kappa to assess agreement (Cohen, 1960). The resulting kappa coefficients were as follows: country: .99; aim of the study: .96; informed versus uninformed judgments: .93; judgment specificity: .97; norm-referenced versus peer-independent judgments: 1.00; domain specificity of the achievement test: .93, domain specificity of the judgment task: .93; time gap: .97. For the remaining variables, we determined the percentage of times that the two raters recorded the same value for each independent sample. The levels of interrater agreement were as follows: teacher judgment accuracy (mean intercoder agreement for all coded correlations): 96%; year of publication: 100%; sample size: 99%; points on rating scale: 100%. Instances of disagreement were resolved by discussion. If information on the variables under investigation was not available from a study, it was coded as missing. Information on the following variables was coded: Teacher judgment accuracy. All reported correlations between teachers’ judgments and students’ actual test performance were extracted from the selected studies. Negative correlations were multiplied by ⫺1 whenever one of the following two conditions was satisfied: lower values representing more favorable judgments and higher values representing less favorable judgments on the teacher rating scale (e.g., Tindal & Marston, 1996; Wilson, Schendel, & Ulman, 1992). 749 Study-specific characteristics. The following study-specific characteristics were coded for primarily methodological reasons: Year of publication. Publication year was coded as a continuous variable. Country. We coded the country in which the study was conducted and allocated each study to one of five groups: United States, Australia, Canada, Europe, and other countries. Aim of the study. According to the inclusion/exclusion criteria, we coded whether the main aim of the study was to validate teachers’ judgments by tests, or to validate achievement tests by teachers’ judgments, or whether studies simply reported the correlations between the two measures. Sample size. For each study, the number of students rated and the number of teachers rating student performance were coded. As many studies reported correlations for different subsamples, we coded the exact size of the student and teacher samples for each correlation. Judgment characteristics. The following aspects of teacher judgments were taken into account: Informed versus uninformed judgments. We coded whether teachers were informed about the achievement test on which their judgment of student achievement would be based—that is, about the standard of comparison to be applied in their judgment. Points on rating scale. For the later analysis, we coded the number of categories given on the rating scales used. Judgment specificity. Teachers’ judgments were classified as ratings (e.g., rating of students’ performance in mathematics), rankings (e.g., ranking of students from lowest to highest in reading ability), or estimations of the number of correct responses (e.g., estimation of the number of items solved correctly). Unlike Hoge and Coladarci (1989), we did not include grading as a type of teacher judgment. That approach would have increased the number of relevant studies enormously. None of the studies in our sample asked teachers to indicate students’ responses on each item of an achievement test. Norm-referenced versus peer-independent judgments. On peer-dependent rating scales (e.g., near the bottom of the class–one of the best in the class), teachers are asked to rate students’ performance in relation to a reference group (usually the other students in the class). Peer-independent rating scales do not elicit an explicit comparison with a reference group (e.g., very low ability–very high ability). Domain specificity. We coded the domain specificity of the judgment task using the following three categories: judgment of overall academic ability (0), judgment of academic ability in one subject (1), and judgment of a specific academic ability within a subject (2). Test characteristics. On the basis of our theoretical considerations, we coded information on the following test characteristics: Subject matter. We coded the domain (language arts or mathematics) in which academic ability was measured. Some studies administered tests measuring achievement in different subjects. In these cases, the subject was coded as mixed. CBM procedures versus standardized achievement tests. We differentiated between the use of standardized achievement tests and CBM procedures. Domain specificity. We coded the domain specificity of the achievement test using the following three categories: covered 750 SÜDKAMP, KAISER, AND MÖLLER different subjects (0; e.g., mathematics and language arts), covered a single subject (1; e.g., mathematics), and covered a specific ability within a subject (2; e.g., oral comprehension). Correspondence between judgment and test characteristics. With regard to the correspondence between judgment and test characteristics, the following information was coded: Time gap. Our meta-analysis includes studies that report measures of teacher judgments and students’ academic achievement obtained at different points of time. Therefore, we also coded when teachers’ judgments were made: same time (achievement test and rating task administered within a 1-month period), test before rating (achievement test administered at least 1 month before rating task), or test after rating (achievement test administered at least 1 month after rating task). Congruence in domain specificity. We coded the domain specificity of the achievement test and the judgment task separately (as described previously). In a second step, we calculated the difference between the domain specificity of the two measures in order to gauge the congruence between the achievement test and the judgment task. In the subsequent analysis, we coded the studies as using either a congruent achievement test and rating task (0, achievement test and rating are equally specific) or an incongruent achievement test and rating task (1, one measure is more specific than the other). Analytical Issues For this meta-analysis, we coded not only study outcomes but also several study characteristics as variables with the potential to explain differences in study outcomes. Some studies reported separate correlations for different methodological approaches (e.g., a focus on language arts or mathematics). For studies in which more than one correlation coefficient was reported, we calculated the mean correlation coefficient (Lipsey & Wilson, 2001; Möller et al., 2009; O’Mara, Marsh, Craven, & Debus, 2006). As some studies reported correlation coefficients for different subsamples or different methodological approaches, we calculated the mean correlations for these subsamples or differing approaches separately (Kalaian & Kasim, 2008). In these cases, we included more than one effect size from the same study in the meta-analytic calculations and thus had to deal with the problem that those effect sizes were not independent. The number of participants in each study (N) refers to the number of students who were rated. For studies reporting correlations from more than one sample, we calculated the mean number of participants across all samples in a study (Möller et al., 2009). To account for the hierarchical structure of the meta-analytic data (subjects within studies at the first level and studies at the second level), we applied a multilevel approach (Hox, 2002; Kalaian & Kasim, 2008). This approach assumes that the primary studies under review are samples from the population of studies. Accordingly, an estimate of a study’s effect size is regarded as a function of a true population effect size, within-study sampling error, and random between-studies error. The variation in the between-studies error is estimated via the multilevel approach and can be modeled and explained using study and sample characteristics. The multilevel approach combines features of the traditional fixed effects approach and the random effects approach. It assumes differences in effect sizes beyond those due to sampling error. Additionally, unlike the fixed effects and the random effects approach, the multilevel approach does not assume the independence of effect sizes (Marsh, Bornmann, Mutz, Daniel, & O’Mara, 2009). As we did not have access to the original raw data but had to draw on the published descriptive results, we assumed the sampling error to be known (varianceknown model) and calculated the sampling variances of the effect sizes from the summary statistics of the primary studies (Kalaian & Kasim, 2008). The analyses were performed with hierarchical linear modeling (HLM Version 6; Raudenbush, Byrk, Cheong, & Congdon, 2004) using the HLM2 option, in which restricted maximum likelihood estimation is applied. Results The 75 studies included in the analysis of effect sizes are documented in Table 1. The correlation between teacher judgments and students’ test performance (r) and the size of the student sample (N) is reported for each study. For studies reporting more than one correlation, we calculated the mean correlation and the mean size of the student sample (see previous text). The mean correlation was calculated with Fisher’s z transformation of the single correlations (Hedges & Olkin, 1985; Lipsey & Wilson, 2001). Then the coefficients were re-transformed into correlation coefficients (Borenstein, 2009). The table also includes an effect size for each study (Zr), which was again calculated using Fisher’s z transformation and the asymptotic variance of the effect sizes (VarZr); Rosenberg, Adams, & Gurevitch, 2000). In the following analyses, the effect size Zr serves as the dependent variable. Summary of Effect Sizes In a first step, we applied an unconditional multilevel model to the data to estimate the overall mean effect size and to examine heterogeneity in the primary study effects. No explanatory variables are included at either level in an unconditional multilevel model. The results of the baseline model are presented in Tables 2 and 3.3 The overall mean effect size of the 73 effect sizes included in the analysis was .63 and significantly different from zero. As the large and highly significant chi-square test indicates, the effect sizes were heterogeneous, indicating a need to include explanatory variables in the model to explain the variance in the effect sizes. As presented in Table 1, the Fisher’s z-transformed correlations ranged between ⫺0.03 and 1.18. Next, we computed several conditional multilevel models in which the explanatory predictor variables were entered separately. For each model, only those studies reporting data on the predictor variable of interest were included in the analysis; all others were excluded. For studies reporting correlations for different categories of a predictor variable (e.g., informed vs uninformed teacher judgments), the mean correlation for each category was calculated; a weighted mean effect size was then calculated for each category, and all categories were included in the analysis. Due to this procedure, the sample size varied across the models. Additionally, some studies were excluded by the multilevel software whenever the variance of the effect sizes was zero. 3 Study Numbers 51 and 75 were excluded from the analysis because the variance of the effect size in these studies was zero. ACCURACY OF TEACHERS’ JUDGMENTS 751 Table 1 Summary of Studies Included in the Meta-Analysis Study no. First author Year Country Judgment type (i/u) Subject Congruence r 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Graue Webster Gullo Meyer Schrader DuPaul Gottesman Kenealy Miller Wilson Freeman Jenkins Jorgenson Kenny Sink Wilson Eaves Eaves Salvesen Eshel Wright Kwok Maguin Tindal Gresham Hartman Hodges Saint-Laurent Demaray Flynn DiPerna van Kraayenoord Espin Bates Elliott Fletcher Helwig Kuklinski Limbos Madon Meisels Teisl Hecht Sofie Burns Feinberg Hauser-Cram Pomplun Triga Beswick Dale Herbert Hughes Madelaine Montague Bailey Dompnier Eckert Benner Trautwein Begeny Graney Lembke Li Maunganidze 1989 1989 1990 1990 1990 1991 1991 1991 1992 1992 1993 1993 1993 1993 1993 1993 1994a 1994b 1994 1995 1995 1996 1996 1996 1997 1997 1997 1997 1998 1998 1999 1999 2000 2001 2001 2001 2001 2001 2001 2001 2001 2001 2002 2002 2003 2003 2003 2004 2004 2005 2005 2005 2005 2005 2005 2006 2006 2006 2007 2007 2008 2008 2008 2008 2008 United States United States United States Canada Germany United States United States Great Britain United States United States Not reported United States United States Australia United States United States United States United States Norway Israel United States Canada, Hong Kong United States United States United States United States United States Canada United States United States United States Germany United States Australia United States Australia United States United States Canada United States United States United States United States United States United States United States United States United States Greece United States Great Britain United States United States Australia United States United States France United States United States Switzerland United States United States United States China Zimbabwe u u u u i u u u i u i u u u u u u u u u u u u u u u u u u u u u u i u u u u u u u u u u u i u u i u u u u u u u u i u u i u u u u mixed l/m l/m/mixed l m l/m mixed mixed l/m l l l l/m/mixed l/mixed l/m l/m l/m/mixed l/m l l/m l/m m l/m l l/m l l l/m l/m/mixed l l/m/mixed l l l l l l/m l l/mixed m l/m l/m l l l l l/m l l l l l/m l/m l l/m l l/m l/m l m l l m l/m l ic ic c/ic c/ic c ic c c c c/ic c c c c/ic ic ic c c/ic c c/ic ic c/ic c c c/ic c/ic c c c/ic c c/ic c c c c c c/ic c c c c/ic c/ic c c c c c c c c c ic ic c ic c ic c/ic c c c c ic c c .28 .45 .42 .56 .51 .48 .45 .58 .60 .54 .72 .62 .65 .39 .59 .52 .66 .42 .70 .47 .48 .49 .73 .60 .32 .79 .69 .49 .73 .46 .69 .52 .51 .70 .59 .42 .56 .58 .58 .66 .60 .41 .57 .58 .40 .66 .49 .56 .84 .67 .54 .52 .44 .73 .48 .25 .71 .46 .47 .76 .72 ⫺.03 .51 .54 .36 N Zr Var(Zr) 63 0.290 .017 134 0.490 .008 65 0.480 .016 171 0.640 .006 690 0.560 .002 50 0.520 .021 93 0.490 .011 426 0.660 .002 60 0.700 .018 1265 0.610 .001 214 0.910 .005 210 0.730 .005 63 0.780 .017 99 0.410 .010 59 0.680 .018 60 0.580 .018 89 0.790 .012 45 0.450 .024 603 0.870 .002 201 0.510 .005 74 0.520 .014 126 0.530 .008 368 0.930 .003 130 0.700 .008 150 0.330 .007 34 1.070 .032 121 0.840 .009 606 0.530 .002 47 0.930 .023 1634 0.500 .001 32 0.840 .035 75 0.570 .014 80 0.560 .013 108 0.870 .010 75 0.680 .014 47 0.450 .023 206 0.640 .005 62 0.660 .017 178 0.660 .006 1692 0.790 .001 70 0.700 .015 234 0.440 .004 170 0.650 .006 40 0.660 .027 147 0.420 .007 30 0.800 .037 105 0.530 .010 208 0.640 .005 125 1.220 .008 205 0.810 .005 5542 0.610 .000 359 0.580 .003 607 0.470 .002 396 0.930 .003 55 0.520 .019 16 0.260 .077 663 0.880 .002 33 0.500 .033 314 0.510 .003 741 1.000 .001 87 0.900 .012 93 ⫺0.030 .011 45 0.560 .024 499 0.600 .002 60 0.380 .018 (table continues) SÜDKAMP, KAISER, AND MÖLLER 752 Table 1 (continued) Study no. 66 67 68 69 70 71 72 73 74 75 First author Methe Bang Feinberg Gallant Hinnant Karing Lorenz Martı́nez McElvany Anders Year 2008 2009 2009 2009 2009 2009 2009 2009 2009 2010 Country Judgment type (i/u) United States United States United States United States United States Germany Germany United States Germany Germany u u i/u u u u u u i u Subject m l l l/m l/m l/m l/m m l m Congruence r N Zr Var(Zr) c/ic ic c c/ic c/ic c c c c c .80 .31 .49 .38 .58 .54 .59 .61 .34 .35 76 273 148 1281 964 1449 1786 9650 812 1085 1.090 0.320 0.540 0.400 0.670 0.600 0.680 0.710 0.350 0.370 .014 .004 .007 .001 .001 .001 .001 .000 .001 .001 Note. i ⫽ informed judgments; u ⫽ uninformed judgments; mixed ⫽ test(s) covering different subjects; l ⫽ test(s) on academic ability in language arts; m ⫽ test(s) on academic ability in mathematics; c ⫽ congruent, ic ⫽ incongruent. For the different categories of moderator variable, the mean correlation (r), the mean sample size (N), the mean effect size (Zr), and the variance of the effect size (VarZr)) are only reported where moderator effects were statistically significant and for “subject” as a moderator.4 If a study supplied correlations for only one category of a moderator variable (e.g., only informed judgments), the summary statistics are displayed in Table 1. If a study supplied correlations for more than one category of a moderator variable (e.g., informed and uninformed judgments), summary statistics are displayed in Tables S1–S3 in the online supplemental material. Moderator Analyses Publication year. Model 1 (see Table 2) considered the effect of publication year (73 effect sizes included), which did not emerge to be a statistically significant moderator. Thus, our findings did not indicate that effect sizes varied systematically according to the study’s year of publication. Country. Model 2 tested the effect of the country in which the study was conducted (72 effect sizes). The studies were split into five groups, with studies conducted in the United States being chosen as the reference category. Most of the studies selected were conducted in the United States (69.9%), followed by European countries (16.4%), Canada (4.1%), Australia (5.5%), and other countries (4.1%). None of the effects was statistically significant. Aim. The effect of the main aim of the study was tested in Model 3. Overall, 48.1% of the studies were conducted to compare teachers’ judgments with students’ outcomes on an achievement test and 16.9% of the studies aimed to validate an achievement test by reference to teachers’ judgments. A further 35% of studies were conducted for other purposes but also reported the correlation between the two measures. The aim of the study did not emerge to be a significant moderator (73 effect sizes included). Judgment characteristics. As information on the methods used to obtain teachers’ judgments was presented in most studies, the following judgment characteristics could be included in the analysis: Informed versus uninformed teacher judgments. In most studies, teachers were not informed about the achievement test to which their judgment would be related (86.8 %); only 13.2% of the 74 effect sizes included were related to informed judgments. Model 4 revealed a significant negative effect of informed versus uninformed judgments, indicating higher correlations between stu- dents’ academic achievement and informed teacher judgments (mean effect size ⫽ .76) than uninformed teacher judgments (.61). Points on rating scale. Model 5 examined whether the number of points on the rating scales used had an effect (64 effect sizes). Studies proved to vary enormously in this aspect, with rating scales ranging between 2 and 100 points. As shown in Table 2, however, the effect of the number of points on the rating scale was not statistically significant. Judgment specificity. Model 6 examined the effect of the specificity of the judgment task (70 effect sizes). As the largest group, ratings (86.8%) were chosen as the reference category. Ratings were followed by estimations of the number of correct responses (9.2%) and rankings (3.9%). As shown in Table 2, none of the effects were statistically significant. Norm-referenced versus peer-independent judgments. Model 7 examined whether the use of a peer-dependent or peerindependent rating scale had an effect on teacher judgment accuracy (63 effect sizes). Of the 66 effect sizes available, 61.8% relied on peer-independent teacher judgments and 38.2.3% on peerdependent teacher judgments. The effect of this factor was not statistically significant. Domain specificity. Model 8 assessed the influence of the domain specificity of the judgment task on teacher judgment accuracy. Altogether, 89 effect sizes were included in this analysis, of which 27.4% were based on judgments of overall academic ability, 23.2% on judgments of an academic ability in one subject, and 49.5% on judgments of a specific academic ability within a subject. As shown in Table 3, the effect was not statistically significant. Therefore, there was no evidence for the hypothesis that domain-specific teacher judgments result in higher judgment accuracy than do global judgments. Test characteristics. We next examined the effects of various characteristics of the tasks administered. Subject matter. Model 9 examined the effect of subject matter. Studies reporting information relevant to this analysis are reported in Tables 1 and S2. Again, Study Numbers 51 and 75 were excluded from the analysis, resulting in a total of 89 effect sizes. Most studies addressed the domain of language arts (63.1% 4 Coding data for all other moderator variables are available from the first author upon request. 773.44ⴱ (72) .63ⴱ (.03) Intercept .63ⴱ (.03) ⫺.01 (.02)/.683 Publication yr .04 762.97ⴱ (71) Model 2 .04 748.80ⴱ (66) .08 (.11)/.470 .00 (.12)/.998 .09 (.07)/.168 ⫺.10 (.12)/.435 .61ⴱ (.03) Country Study characteristics Model 1 .07 748.04ⴱ (70) .02 (.06)/.697 .08 (.07)/.302 .61ⴱ (.04) Aim Model 3 .03 844.28ⴱ (72) .15 (.07)/.045 .61ⴱ (.03) i vs. u judgment Model 4 .04 675.33ⴱ (63) .00 (.00)/.805 .62ⴱ (.03) Points on rating scale Judgment characteristics Model 5 .04 803.67ⴱ (71) .04 (.12)/.747 ⫺.13 (.22)/.535 .11 (.09)/.212 .63ⴱ (.03) Judgment specificity Model 6 Note. Unless otherwise noted, values in parentheses are standard errors. Exact p-values are reported behind the slash. i ⫽ informed judgments; u ⫽ uninformed judgments; RC ⫽ reference category. ⴱ p ⬍ .001. Fixed effects Intercept Publication yr (Z score) Country (RC: United States) Australia Canada Europe Other Overall aim of the study (RC: validation of teachers’ judgments) Validation of a test of academic achievement Correlation between teachers’ judgments and students’ test performance Informed vs. uninformed judgment Points on rating scale Judgment specificity (RC: rating) Ranking Grade equivalents No. of correct responses Random effects Chi square (df) Parameter Model 0 Table 2 Multilevel Meta-Analysis of Effect Sizes: Fixed Effects and Random Effects in Models 0 – 6 ACCURACY OF TEACHERS’ JUDGMENTS 753 .04 729.97ⴱ (64) .66ⴱ (.03) ⫺.05 (.05)/.340 Norm-referenced vs peerindependent judgments .04 957.97ⴱ (90) ⫺.05 (.06)/.425 .06 (.05)/.307 .61ⴱ (.04) Domain specificity .03 1041.89ⴱ (99) ⫺.03 (.04)/.429 .63ⴱ (.03) Subject matter Model 9 .04 809.00ⴱ (70) ⫺.04 (.09)/.627 .64ⴱ (.03) CBM vs. standardized achievement tests Test characteristics Model 10 .04 859.66ⴱ (93) ⫺.12 (.06)/.114 ⫺.08 (.08)/.247 .72ⴱ (.06) Domain specificity Model 11 Model 13 .04 627.20ⴱ (57) ⫺.07 (.08)/.343 ⫺.05 (.10)/.589 .66ⴱ (.03) Time gap .04 921.75ⴱ (89) ⫺.13 (.05)/.009 .67ⴱ (.03) Congruence Correspondence between judgment and test characteristics Model 12 ⴱ Note. Unless otherwise noted, values in parentheses are standard errors. Exact p values are reported behind the slash. CBM ⫽ curriculum-based measures; RC ⫽ reference category. p ⬍ .001. Fixed effects Intercept Peer dependency Domain specificity (RC: overall academic achievement) Ability in one subject Specific ability within subject Subject (RC: language arts) Mathematics CBM vs standardized achievement tests Domain specificity (RC: overall academic achievement) Ability in one subject Specific ability within subject Time gap (RC: same time) Test before rating Test after rating Congruence Random effects Chi square (df) Parameter Model 8 Judgment characteristics Model 7 Table 3 Multilevel Meta-Analysis of Effect Sizes: Fixed Effects and Random Effects in Models 8 –17 754 SÜDKAMP, KAISER, AND MÖLLER ACCURACY OF TEACHERS’ JUDGMENTS of effect sizes), while 36.9% addressed the domain of mathematics. As Table 3 shows, the effect was not statistically significant. CBM procedures versus standardized achievement tests. Next, we were interested in whether the size of the effects was influenced by the testing procedure used (CBM vs standardized achievement tests; Model 10). Overall, 87.7% of the 73 relevant effect sizes relied on standardized achievement tests and 12.3% on a CBM procedure. Four effect sizes were excluded from the analysis as their variance was zero, resulting in a total of 69 effect sizes. The effect of testing procedure was not statistically significant. Domain specificity. Finally, the effect of the domain specificity of the achievement test was analyzed (Model 11). Of the 103 effect sizes included in this analysis, 12.2% were based on tests covering different subjects, 25.5% on tests covering a single subject, and 62.3% on tests covering a specific ability within a subject. The effect of the domain specificity of the achievement test was not statistically significant. Correspondence between judgment and test characteristics. Models 12 and 13 examined two aspects of the correspondence between judgment characteristics and test characteristics: time gap and congruence in domain specificity. Time gap. Model 12 tested the effect of the time interval between the administration of the achievement test and the teacher rating. In most studies, the two measures were administered concurrently (73.3%), so “same time” was chosen as the reference category. The achievement test was administered before the rating task for 18.3% of the 61 effect sizes included in the analysis and after the rating task for 8.3%. The effects were not statistically significant. Congruence in domain specificity. According to our coding procedure, the domain specificity of the achievement test and the teacher rating was congruent for 67.6% of the effect sizes reported in this analysis and incongruent for 32.4%. A total of 93 effect sizes were included in this analysis (Tables 1 and S3). The effect of congruence was tested in Model 13, revealing a significant negative effect. As expected, larger effect sizes were observed for studies in which the domain specificity of the achievement test and the rating task was congruent (.67) than for studies in which it was not (.54). Overall, the highly significant chi-square tests for all models indicate a substantial heterogeneity of variances. Beyond sampling variation, there is variation in the effect sizes across studies that could not be explained by the explanatory predictor variables used in our models. Discussion In this article, we statistically summarized empirical research findings on teacher judgment accuracy in a meta-analysis. In addition, we examined the role played by theoretically and methodologically relevant moderators in explaining the variation in findings across studies. In this section, we discuss the main findings and introduce a heuristic model of teacher judgment that brings together findings on teacher judgment accuracy. The results of our meta-analysis indicate that teachers’ judgment accuracy— defined as the correlation between teachers’ judgments of students’ academic achievement and students’ actual test performance—is positive and fairly high (.63). Nevertheless, this 755 result shows that teacher judgments are far from perfect and that there is plenty of room for improvement. This result is in line with the findings of Hoge and Coladarci (1989), who reported a median correlation of .66. However, the median correlation in the present meta-analysis was .53, showing that the results produced using Hoge and Coladarci’s rather descriptive methods varied substantially from those generated by current meta-analytical methods. Thus, our meta-analysis helps to summarize and clarify empirical results on teacher judgment accuracy using adequate empirical methods. Our meta-analysis revealed substantial variation in effect sizes across studies. Two important moderators of teacher judgment accuracy were identified: one judgment characteristic and one characteristic based on the interaction of judgment and test characteristics. In the following, we discuss in detail the effects of (a) informed versus uninformed judgments and (b) the congruence in the domain specificity of teacher judgments and student achievement tests. First, we found significantly higher correlations between teachers’ judgments and students’ test performance for informed than for uninformed teacher judgments. We chose to differentiate between these two categories, rather than between direct and indirect judgments following Hoge and Coladarci (1989), because the latter distinction was confounded by judgment specificity. Although the difference between direct/indirect versus informed/uninformed judgments seems small, we think it is important to make this distinction. Indeed, the results of this meta-analysis indicate that informed judgments result in higher judgment accuracy than do uninformed judgments. Considerably more studies used uninformed judgments than informed judgments. Surprisingly, the use of informed versus uninformed teacher judgments is barely discussed in studies on teacher judgment accuracy, although it evidently can have a substantial influence on the size of the correlation between teachers’ judgments and students’ achievement. As was to be expected, it seems easier for teachers to judge students’ performance when they are informed about the standard of comparison than when they are not. No effects of the other judgment characteristics (i.e., number of points on rating scales, judgment specificity, norm-referenced vs peer-independent judgments) were found. Nevertheless, we would recommend carefully considering these aspects when conducting studies on teacher judgment accuracy. In terms of test characteristics, we found no evidence for a difference in teacher judgment accuracy between language arts and mathematics. The effects of the other test characteristics were not significant either. Therefore, results are generalizable across several types of judgments and tests. Regarding the testing procedure, we distinguished between CBM procedures and standardized achievement tests, but we found no significant effect on teacher judgment accuracy. Given the variety of tests used in the different studies, the categorization of CBM procedures versus standardized tests was rather broad. The achievement test category was particularly broad, including both outdated and up-to-date tests. Very little information was available on some of the tests (e.g., psychometric properties). We therefore advocate a more thorough description of the tests used in studies on teacher judgment accuracy. As expected, the congruence between the teachers’ rating task and the achievement test administered to students was related to teacher judgment accuracy, with higher congruence being associ- SÜDKAMP, KAISER, AND MÖLLER 756 ated with higher accuracy levels. Because the match between teachers’ judgments and students’ test performance was higher when both measures addressed the same domain and same ability within a domain, it is reasonable to assume that a “mismatch” leads to lower teacher judgment accuracy. Surprisingly, this factor is rarely discussed in studies on teacher judgment accuracy. Unfortunately, very little information was reported on the teacher samples. We had planned to study the effects of teachers’ years of teaching experience, years of exposure to the students rated, age, and gender, but were unable to conduct these analyses for lack of data. It would also be interesting to study whether other teacher characteristics affect teacher judgment accuracy. For example, teacher judgment accuracy might be associated with teachers’ cognitive abilities or memory capacity or—in terms of teaching skills—with their instructional quality or expert knowledge. Hauser-Cram, Sirin, and Stipek (2003) used a classroom observation procedure to assess teachers’ teaching styles. Teachers were identified as student-centered if they adapted well to students’ individual needs (e.g., encouraged children to communicate and elaborate on their thoughts). In contrast, teachers were identified as curriculum-centered if they applied a uniform approach dictated by the curriculum (e.g., gave children few opportunities to take responsibility or to choose activities). Additionally, teachers’ perceived differences with parents regarding education-related values were measured. As expected, perceived teacher–parent differences had greater effects on teacher ratings of students’ academic achievement in more curriculum-centered classrooms. In another study, Kuklinski and Weinstein (2001) assessed teachers’ differential treatment of low- and high-achieving students in the classroom. Their results showed that teachers’ differential treatment as perceived by their students was a significant moderator of teacher expectations. Clearly, there is a need for studies assessing how other teacher characteristics relate to teacher judgment accuracy. The student characteristics we would have liked to consider in our meta-analysis included gender, age, and grade level. Unfortunately, information on these characteristics was scarce and, for the most part, not comparable across studies. For example, only the percentage distribution for gender and the mean age of the student sample were reported. More consistency in the information reported and more specific information would facilitate comparison Figure 1. across studies. Most of the studies included in the meta-analysis involved samples of kindergarten and elementary school children. More studies focusing on older children and higher grade levels would therefore be desirable. The lack of data on teacher and student characteristics makes it almost impossible to study the effects of the correspondence between teacher and student characteristics. None of the studies included in this meta-analysis reported comparable information on teacher and student characteristics (e.g., on teachers’ and students’ gender). Nevertheless, it seems reasonable to consider the correspondence between the two variables when studying teacher judgment accuracy. For example, it might be hypothesized that female teachers provide more accurate predictions of girls’ than of boys’ performance. In the study characteristics category, we further distinguished between studies investigating teacher judgment accuracy and those focusing on other research questions but providing measures of teachers’ judgments and students’ academic achievement. Our analysis did not reveal any differences in teacher judgment accuracy between studies of these two types. Nevertheless, it is important to bear the distinction between these study types in mind, especially when interpreting results with regard to teacher judgment accuracy. A Model of Teacher Judgment Accuracy In order to systematize the moderators of teacher judgment accuracy in a more structured form, we provide a model of teacher judgment accuracy based on our theoretical considerations and empirical findings. Teacher judgment accuracy is at the core of this model, which is shown in Figure 1. It represents the correspondence between teachers’ judgments of students’ academic achievement and students’ actual achievement as measured by a standardized test. In most studies, the correlation between the two is used as a measure of this correspondence. However, other indicators, such as the average difference between teacher judgments and students’ actual performance, can also be used. A student’s test performance is the result he or she achieves on an academic achievement test. On the one hand, this result may depend on student characteristics such as prior knowledge, moti- A model of teacher-based judgments of students’ academic achievement. ACCURACY OF TEACHERS’ JUDGMENTS vation, and intelligence. On the other hand, it may depend on test characteristics such as subject area, the specific task set, or task difficulty. In this meta-analysis, the moderating effect of three test characteristics were studied, but none of them showed a significant effect on teacher judgment accuracy. As mentioned previously, the influence of student characteristics could not be studied because the relevant data were not consistently reported. A teacher’s judgment may depend on teacher characteristics such as professional expertise or stereotypes about students or on judgment characteristics (e.g., whether the teacher is asked to judge a specific student competency, such as oral reading fluency, or to provide a global judgment of academic ability). As described earlier, our analyses revealed that whether teacher judgments were informed or uninformed significantly influenced their judgment accuracy. According to our model, teacher judgment accuracy is also influenced by the correspondence between judgment characteristics and test characteristics (dashed line, Figure 1). For example, an achievement test may measure a very specific academic ability (e.g., arithmetic skills), whereas the focus of the teachers’ judgment task is broader (e.g., rating students’ overall ability in mathematics), making it more difficult for teachers to provide accurate judgments. Indeed, in this meta-analysis, we found evidence that a high level of congruence between the specificity of teachers’ judgments and the specificity of the achievement measure used leads to high accuracy of teacher judgments. Another relationship that may influence teacher judgment accuracy is the correspondence between teacher characteristics and student characteristics (e.g., gender, ethnicity). As depicted in our model, the correspondence between judgment characteristics and test characteristics is assumed to influence teacher judgment accuracy. However, as data on some elements of the model are scarce, the model is in parts highly speculative. Strength, Limitations, and Directions for Future Research This meta-analysis used a sophisticated multilevel approach to provide a comprehensive overview of research on teacher judgment accuracy published in the past 20 years. In addition, it informed a heuristic model of teacher judgment accuracy that can be used to describe and analyze moderators of teacher judgment accuracy. Although some moderators proved to significantly influence the correlation between teachers’ judgments and students’ test performance, the chi-square test was significant across all models, indicating that there remains variation in the effect sizes across studies that could not be explained by the moderators under investigation. One reason for this may be that little information was available on some potential moderators, especially on the characteristics of the teacher samples. Future studies should therefore report more information on the teacher sample investigated, making it possible to analyze the relationship between teacher characteristics and teacher judgment accuracy in much more detail. As mentioned previously, most studies under analysis used the correlation between teachers’ judgments and students’ test performance as a measure of teacher judgment accuracy. We therefore chose correlation coefficients as the unit of analysis. However, the correlation coefficient basically indicates whether teachers are able 757 to put their students in a rank order. Accordingly, high correlations can also be attained if teachers systematically over- or underestimate their students’ performance (Eckert et al., 2006; Feinberg & Shapiro, 2003; Graney, 2008). Indeed, the findings of studies in which indicators other than correlations were used as measures of teacher judgment accuracy suggest that teacher judgments are rather inaccurate. In a study by Eckert et al. (2006), CBM material was used as an indicator of students’ mathematics and reading skills. Teachers were asked to estimate students’ reading and mathematics level (mastery, instructional, or frustrational). This judgment was compared with students’ actual reading and mathematics level as measured by the CBM material via percentage agreement. The results indicated that teachers overestimated students’ performance across most mathematics skills and on reading material that was at or below grade level. Bates and Nettelbeck (2001) subtracted students’ reading accuracy and reading comprehension scores on a standardized achievement test from teachers’ predictions of these scores. Teachers generally overestimated the performance of the 6- to 8-year-old students; inspection of the difference scores revealed that this held to a greater extent for low-achieving readers than for average- and high-achieving students. In line with this result, Begeny et al. (2008) found that teachers’ judgments of students with average to low oral reading fluency scores were rather inaccurate, and Feinberg and Shapiro (2003) reported that teachers generally overestimated the performance of low-achieving readers. Cronbach (1955) was able to disentangle some of the effects of human judgments by splitting a general accuracy score (squared errors in judgments over all items) into three distinct components. Building on his work, Helmke and Schrader (1987) proposed three components of teacher judgment accuracy: a rank component (rank correlation), a level component indicating over- or underestimation of students’ performance, and a differentiation component indicating whether the variance of students’ performances was accurately assessed by teachers. Although the correlation coefficient remains popular as an indicator of teacher judgment accuracy (e.g., Anders et al., 2010; Ready & Wright, 2011), some studies— especially in the literature on RTI models (Begeny et al., 2011; Eckert et al., 2006)— have used percentage agreement between teachers’ judgments and students’ level of academic ability (e.g., at risk, some risk, low risk) as an indicator of teacher judgment accuracy. The problem here is that the categories need to be clearly defined and familiar to the teacher. Additionally, some information is lost by categorizing students’ academic achievement into different subgroups. Moreover, it is not as easy to compare percentage agreement across studies as it is to compare correlation coefficients, as the measure depends on the number of categories used. Nevertheless, percentage agreement offers valuable information about teachers’ ability to detect children with a need for additional support, which is a goal of RTI models. Which measures of teacher judgments can or should be used also heavily depends on the original data available. For example, Karing, Matthäi, and Artelt (2011) asked teachers to predict students’ individual responses to each item of a test. This approach allowed the authors to calculate a “hit rate” delivering very detailed information on the teachers’ judgment accuracy. In contrast, Bates and Nettelbeck (2001) calculated the difference between teachers’ judgments and students’ academic achievement in order to identify over- and underestimations of students’ academic achievement. In our opinion, different SÜDKAMP, KAISER, AND MÖLLER 758 measures should be applied in the analysis of teacher judgment accuracy, depending on the focus of the study. Although the potential of correlations as a measure of teacher judgment accuracy is limited in the ways previously described, they nevertheless offer valuable information and are easily interpretable. In summary, this meta-analysis has important theoretical and methodological implications for research on teacher judgment accuracy. It highlights the various methodological aspects that need to be considered in studies examining the accuracy of teacher judgments. The differentiation among teacher characteristics, student characteristics, judgment characteristics, and test characteristics was fruitful in this analysis, as these factors proved to constitute judgment accuracy. In particular, our results showed that judgment and task characteristics influenced the correlation between teacher judgments and students’ academic achievement. Additionally, we found empirical evidence that the level of congruence in the domain specificity of the teachers’ rating task, on the one hand, and the achievement tests administered, on the other, influenced teacher judgment accuracy. Our meta-analysis also showed where further research is necessary. From the theoretical perspective, we proposed a model of teacher-based judgments of students’ academic achievement, bringing together teacher characteristics, judgment characteristics, student characteristics, and tasks characteristics as factors with theoretical relevance for teacher judgment accuracy. References References marked with an asterisk indicate studies included in the meta-analysis. Alvidrez, J., & Weinstein, R. S. (1999). Early teacher perceptions and later student academic achievement. Journal of Educational Psychology, 91, 731–746. doi:10.1037/0022-0663.91.4.731 American Federation of Teachers, the National Council on Measurement in Education, and the National Education Association. (1990). Standards for teacher competence in educational assessment of students. Retrieved from http://www.unl.edu/buros/bimm/html/article3.html *Anders, Y., Kunter, M., Brunner, M., Krauss, S., & Baumert, J. (2010). Diagnostische Fähigkeiten von Mathematiklehrkräften und ihre Auswirkungen auf die Leistungen ihrer Schülerinnen und Schüler [Mathematics teachers’ diagnostic skills and their impact on students’ achievements]. Psychologie in Erziehung und Unterricht, 57, 175–193. *Bailey, A. L., & Drummond, K. V. (2006). Who is at risk and why? Teachers’ reasons for concern and their understanding and assessment of early literacy. Educational Assessment, 11, 149 –178. doi:10.1207/ s15326977ea1103&4_2 Balliet, D., Mulder, L. D., & van Lange, P. A. M. (2011). Reward, punishment, and cooperation: A meta-analysis. Psychological Bulletin, 137, 594 – 615. doi:10.1037/a0023489 *Bang, H. J., Suarez-Orozco, C., Pakes, J., & O’Connor, E. (2009). The importance of homework in determining immigrant students’ grades in schools in the USA context. Educational Research, 51, 1–25. doi: 10.1080/00131880802704624 *Bates, C., & Nettelbeck, T. (2001). Primary school teachers’ judgements of reading achievement. Educational Psychology, 21, 177–187. doi: 10.1080/01443410020043878 *Begeny, J. C., Eckert, T. L., Montarello, S. A., & Storie, M. S. (2008). Teachers’ perceptions of students’ reading abilities: An examination of the relationship between teachers’ judgments and students’ performance across a continuum of rating methods. School Psychology Quarterly, 23, 43–55. doi:10.1037/1045-3830.23.1.43 Begeny, J. C., Krouse, H. E., Brown, K. G., & Mann, C. M. (2011). Teacher judgments of students’ reading abilities across a continuum of rating methods and achievement measures. School Psychology Review, 40, 23–38. doi:10.1037/1045-3830.23.1.43 *Benner, A. D., & Mistry, R. S. (2007). Congruence of mother and teacher educational expectations and low-income youth’s academic competence. Journal of Educational Psychology, 99, 140 –153. doi:10.1037/00220663.99.1.140 Bennett, R. E., Gottesman, R. L., Cerullo, F. M., & Rock, D. A. (1991). The validity of Einstein assessment subtest scores as predictors of early school achievement. Journal of Psychoeducational Assessment, 9, 67– 79. doi:10.1177/073428299100900107 Bennett, R. E., Gottesman, R. L., Rock, D. A., & Cerullo, F. (1993). Influence of behavior perceptions and gender on teachers’ judgments of students’ academic skill. Journal of Educational Psychology, 85, 347– 356. doi:10.1037/0022-0663.85.2.347 *Beswick, J. F., Willms, J. D., & Sloat, E. A. (2005). A comparative study of teacher ratings of emergent literacy skills and student performance on a standardized measure. Education, 126, 116 –137. Borenstein, M. (2009). Effect sizes for continuous data. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (pp. 221–235). New York, NY: Russell Sage Foundation. Brophy, J., & Good, T. (1986). Teacher behavior and student achievement. In M. C. Wittrock (Ed.), Third handbook of research on teaching (pp. 328 –375). New York, NY: McMillan. *Burns, M. K., & Symington, T. (2003). A comparison of the spontaneous writing quotient of the Test of Written Language (3rd ed.) and teacher ratings of writing progress. Assessment for Effective Intervention, 28, 29 –34. doi:10.1177/073724770302800203 Cafri, G., Komrey, J. D., & Brannick, M. T. (2010). A meta-meta-analysis: Empirical review of statistical power, type I error rates, effect sizes, and model selection of meta-analyses published in psychology. Multivariate Behavioral Research, 45, 239 –270. doi:10.1080/00273171003680187 Chang, D. F., & Sue, S. (2003). The effects of race and problem type on teachers’ assessments of student behavior. Journal of Consulting and Clinical Psychology, 71, 235–242. doi:10.1037/0022-006X.71.2.235 Clark, C. M., & Peterson, P. L. (1986). Teachers’ thought processes. In M. C. Wittrock (Ed.), Third handbook of research on teaching (pp. 255–296). New York, NY: Macmillan. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37– 46. doi:10.1177/ 001316446002000104 Coladarci, T. (1986). Accuracy of teacher judgments of student responses to standardized test items. Journal of Educational Psychology, 78, 141–146. doi:10.1037/0022-0663.78.2.141 Cronbach, L. J. (1955). Processes affecting scores on “understanding of others” and “assumed similarity.” Psychological Bulletin, 52, 177–193. doi:10.1037/h0044919 *Dale, P. S., Harlaar, N., & Plomin, R. (2005). Telephone testing and teacher assessment of reading skills in 7-year-olds: I. Substantial correspondence for a sample of 5,544 children and for extremes. Reading and Writing, 18, 385– 400. doi:10.1007/s11145-004-8130-z de Boer, H., Bosker, R. J., & van der Werf, M. P. C. (2010). Sustainability of teacher expectation bias effects on long-term student performance. Journal of Educational Psychology, 102, 168 –179. doi:10.1037/ a0017289 *Demaray, M. K., & Elliott, S. N. (1998). Teachers’ judgments of students’ academic functioning: A comparison of actual and predicted performances. School Psychology Quarterly, 13, 8 –24. doi:10.1037/h0088969 Deno, S. L. (2003). Curriculum-based measures: Development and perspectives. Assessment of Effective Intervention, 28, 3–12. *DiPerna, J. C., & Elliott, S. N. (1999). Development and validation of the Academic Competence Evaluation Scales. Journal of Psychoeducational Assessment, 17, 207–225. doi:10.1177/073428299901700302 ACCURACY OF TEACHERS’ JUDGMENTS *Dompnier, B., Pansu, P., & Bressoux, P. (2006). An integrative model of scholastic judgments: Pupils’ characteristics, class context, halo effect and internal attributions. European Journal of Psychology of Education, 21, 119 –133. doi:10.1007/BF03173572 *DuPaul, G. J., Rapport, M. D., & Perriello, L. M. (1991). Teacher ratings of academic skills: The development of the Academic Performance Rating Scale. School Psychology Review, 20, 284 –300. *Eaves, R. C., Campbell-Whatley, G., Dunn, C., Reilly, A. S., & TateBraxton, C. (1994). Comparison of the Slosson Full-Range Intelligence Test and teacher judgments as predictors of students’ academic achievement. Journal of Psychoeducational Assessment, 12, 381–392. doi: 10.1177/073428299401200408 *Eaves, R. C., Williams, P., Winchester, K., & Darch, C. (1994). Using teacher judgment and IQ to estimate reading and mathematics achievement in a remedial-reading program. Psychology in the Schools, 31, 261–272. doi:10.1002/1520-6807(199410)31:4⬍261::AIDPITS2310310403⬎3.0.CO;2-K *Eckert, T. L., Dunn, E. K., Codding, R. S., Begeny, J. C., & Kleinmann, A. E. (2006). Assessment of mathematics and reading performance: An examination of the correspondence between direct assessment of student performance and teacher report. Psychology in the Schools, 43, 247–265. doi:10.1002/pits.20147 *Elliott, J., Lee, S. W., & Tollefson, N. (2001). A reliability and validity study of the Dynamic Indicators of Basic Early Literacy Skills– Modified. School Psychology Review, 30, 33– 49. *Eshel, Y., & Benski, M. (1995). Group-administered school readiness test and kindergarten teacher ratings as predictors of academic success in the first grade. Megamot, 36, 451– 464. *Espin, C., Shin, J., Deno, S. L., Skare, S., Robinson, S., & Benner, B. (2000). Identifying indicators of written expression proficiency for middle school students. The Journal of Special Education, 34, 140 –153. doi:10.1177/002246690003400303 *Feinberg, A. B., & Shapiro, E. S. (2003). Accuracy of teacher judgments in predicting oral reading fluency. School Psychology Quarterly, 18, 52– 65. doi:10.1521/scpq.18.1.52.20876 *Feinberg, A. B., & Shapiro, E. S. (2009). Teacher accuracy: An examination of teacher-based judgments of students’ reading with differing achievement levels. Journal of Educational Research, 102, 453– 462. doi:10.3200/JOER.102.6.453-462 Ferguson, C. J., & Brannick, M. T. (2011). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods. Advance online publication. doi:10.1037/a0024445 Fischer, R., & Boer, D. (2011). What is more important for national well-being: Money or autonomy? A meta-analysis of well-being, burnout, and anxiety across 63 societies. Journal of Personality and Social Psychology, 101, 164 –184. doi:10.1037/a0023663 *Fletcher, J., Tannock, R., & Bishop, D. V. M. (2001). Utility of brief teacher rating scales to identify children with educational problems: Experience with an Australian sample. Australian Journal of Psychology, 53, 63–71. doi:10.1080/00049530108255125 *Flynn, J. M., & Rahbar, M. H. (1998). Improving teacher prediction of children at risk for reading failure. Psychology in the Schools, 35, 163–172. doi:10.1002/(SICI)1520-6807(199804)35:2⬍163::AIDPITS8⬎3.0.CO;2-Q *Freeman, J. G. (1993). Two factors contributing to elementary school teachers’ predictions of students’ scores on the Gates–MacGinitie Reading Test, Level D. Perceptual and Motor Skills, 76, 536 –538. doi: 10.2466/pms.1993.76.2.536 *Gallant, D. J. (2009). Predictive validity evidence for an assessment program based on the Work Sampling System in mathematics and language and literacy. Early Childhood Research Quarterly, 24, 133– 141. doi:10.1016/j.ecresq.2009.03.003 *Gottesman, R. L., Cerullo, F. M., Bennett, R. E., & Rock, D. A. (1991). 759 Predictive validity of a screening test for mild school learning difficulties. Journal of School Psychology, 29, 191–205. doi:10.1016/00224405(91)90001-8 *Graney, S. B. (2008). General education teacher judgments of their low-performing students’ short-term reading progress. Psychology in the Schools, 45, 537–549. doi:10.1002/pits.20322 *Graue, M. E., & Shepard, L. A. (1989). Predictive validity of the Gesell School Readiness Tests. Early Childhood Research Quarterly, 4, 303– 315. doi:10.1016/0885-2006(89)90016-1 *Gresham, F. M., MacMillan, D. L., & Bocian, K. M. (1997). Teachers as “tests”: Differential validity of teacher judgments in identifying students at-risk for learning difficulties. School Psychology Review, 26, 47– 60. *Gullo, D. F. (1990). Kindergarten schedules: Effects on teachers’ ability to assess academic achievement. Early Childhood Research Quarterly, 5, 43–51. doi:10.1016/0885-2006(90)90005-L Hamilton, C., & Shinn, M. R. (2003). Characteristics of word callers: An investigation of the accuracy of teachers’ judgments of reading comprehension and oral reading skills. School Psychology Review, 32, 228 – 240. Harlen, W. (2005). Trusting teachers’ judgment: Research evidence of the reliability and validity of teachers’ assessment used for summative purposes. Research Papers in Education, 20, 245–270. doi:10.1080/ 02671520500193744 *Hartman, J. M., & Fuller, M. L. (1997). The development of curriculumbased measurement norms in literature-based classrooms. Journal of School Psychology, 35, 377–389. doi:10.1016/S0022-4405(97)00013-7 *Hauser-Cram, P., Sirin, S. R., & Stipek, D. (2003). When teachers’ and parents’ values differ: Teachers’ ratings of academic competence in children from low-income families. Journal of Educational Psychology, 95, 813– 820. doi:10.1037/0022-0663.95.4.813 Hecht, S. A., & Greenfield, D. B. (2001). Comparing the predictive validity of first grade teacher ratings and reading-related tests on third grade levels of reading skills in young children exposed to poverty. School Psychology Review, 30, 50 – 69. *Hecht, S. A., & Greenfield, D. B. (2002). Explaining the predictive accuracy of teacher judgments of their students’ reading achievement: The role of gender, classroom behavior, and emergent literacy skills in a longitudinal sample of children exposed to poverty. Reading and Writing, 15, 789 – 809. doi:10.1023/A:1020985701556 Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic Press. Helmke, A., & Schrader, F.-W. (1987). Interactional effects of instructional quality and teacher judgment accuracy on achievement. Teaching and Teacher Education, 3, 91–98. doi:10.1016/0742-051X(87)90010-2 *Helwig, R., Anderson, L., & Tindal, G. (2001). Influence of elementary student gender on teachers’ perceptions of mathematics achievement. Journal of Educational Research, 95, 93–102. doi:10.1080/ 00220670109596577 *Herbert, J., & Stipek, D. (2005). The emergence of gender differences in children’s perceptions of their academic competence. Journal of Applied Developmental Psychology, 26, 276 –295. doi:10.1016/j.appdev .2005.02.007 *Hinnant, J. B., O’Brien, M., & Ghazarian, S. R. (2009). The longitudinal relations of teacher expectations to achievement in the early school year. Journal of Educational Psychology, 101, 662– 670. doi:10.1037/ a0014306 *Hodges, C. A. (1997). How valid and useful are alternative assessments for decision-making in primary grade classrooms? Reading Research and Instruction, 36, 157–173. doi:10.1080/19388079709558235 Hoge, R. D. (1983). Psychometric properties of teacher-judgment measures of pupil aptitudes, classroom behaviors, and achievement levels. The Journal of Special Education, 17, 401– 429. doi:10.1177/ 002246698301700404 Hoge, R. D., & Coladarci, T. (1989). Teacher-based judgments of aca- 760 SÜDKAMP, KAISER, AND MÖLLER demic achievement: A review of literature. Review of Educational Research, 59, 297–313. doi:10.2307/1170184 Hopkins, K. D., George, C. A., & Williams, D. D. (1985). The concurrent validity of standardized achievement tests by content area using teachers’ ratings as criteria. Journal of Educational Measurement, 22, 177– 182. doi:10.1111/j.1745-3984.1985.tb01056.x Hox, J. (2002). Multilevel analysis. Mahwah, NJ: Erlbaum. *Hughes, J. N., Gleason, K. A., & Zhang, D. A. (2005). Relationship influences on teachers’ perceptions of academic competence in academically at-risk minority and majority first grade students. Journal of School Psychology, 43, 303–320. doi:10.1016/j.jsp.2005.07.001 Hurwitz, J. T., Elliott, S. N., & Braden, J. P. (2007). The influence of test familiarity and student disability status upon teachers’ judgments of students’ test performance. School Psychology Quarterly, 22, 115–144. doi:10.1037/1045-3830.22.2.115 Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35, 69 – 81. doi:10.1111/j.17453984.1998.tb00528.x *Jenkins, J. R., & Jewell, M. (1993). Examining the validity of two measures for formative teaching: Reading aloud and maze. Exceptional Children, 59, 421– 432. Jones, M. G., & Gerig, T. M. (1994). Silent sixth-grade students: Characteristics, achievement, and teacher expectations. The Elementary School Journal, 95, 169 –182. doi:10.1086/461797 *Jorgenson, C. B., Jorgenson, D. E., Gillis, M. K., & McCall, C. M. (1993). Validation of a screening instrument for young children with teacher assessment of school performance. School Psychology Quarterly, 8, 125–139. doi:10.1037/h0088834 Jussim, L., & Eccles, J. S. (1992). Teacher expectations II: Construction and reflection of student achievement. Journal of Personality and Social Psychology, 63, 947–961. doi:10.1037/0022-3514.63.6.947 Kalaian, S. A., & Kasim, R. M. (2008). Multilevel methods for metaanalysis. In A. A. O’Connell & D. B. McCoach (Eds.), Multilevel modeling of educational data (pp. 315–343). Charlotte, NC: Information Age Publishing. *Karing, C. (2009). Diagnostische Kompetenz von Grundschul- und Gymnasiallehrkräften im Leistungsbereich und im Bereich Interessen [Diagnostic competence of elementary and secondary school teachers in the domains of competence and interests]. Zeitschrift für Pädagogische Psychologie/German Journal of Educational Psychology, 23, 197–209. doi:10.1024/1010-0652.23.34.197 Karing, C., Matthäi, J., & Artelt, C. (2011). Genauigkeit von Lehrerurteilen über die Lesekompetenz ihrer Schülerinnen und Schüler in der Sekundarstufe I: Eine Frage der Spezifität? [Lower secondary school teacher judgment accuracy of students’ reading competence: A matter of specificity?]. Zeitschrift für Pädagogische Psychologie, 25, 159 –172. doi:10.1024/1010 – 0652/a000041. *Kenealy, P., Frude, N., & Shaw, W. (1991). Teacher expectations as predictors of academic success. Journal of Social Psychology, 131, 305–306. doi:10.1080/00224545.1991.9713856 *Kenny, D. T., & Chekaluk, E. (1993). Early reading performance: A comparison of teacher-based and test-based assessments. Journal of Learning Disabilities, 26, 227–236. doi:10.1177/002221949302600403 Kleingeld, A., van Mierlo, H., & Arends, L. (2011). The effect of goal setting on group performance: A meta-analysis. Journal of Applied Psychology. Advance online publication. doi:10.1037/a0024315 *Klinedinst, R. E. (1991). Predicting performance achievement and retention of fifth-grade instrumental students. Journal of Research in Music Education, 39, 225–238. doi:10.2307/3344722 *Kuklinski, M. R., & Weinstein, R. S. (2001). Classroom and developmental differences in a path model of teacher expectancy effects. Child Development, 72, 1554 –1578. doi:10.1111/1467-8624.00365 *Kwok, D. C., & Lytton, H. (1996). Perceptions of mathematics ability versus actual mathematics performance: Canadian and Hong Kong Chinese children. British Journal of Educational Psychology, 66, 209 –222. doi:10.1111/j.2044-8279.1996.tb01190.x Leinhardt, G. (1983). Novice and expert knowledge of individual student’s achievement. Educational Psychologist, 18, 165–179. doi:10.1080/ 00461528309529272 *Lembke, E. S., Foegen, A., Whittaker, T. A., & Hampton, D. (2008). Establishing technically adequate measures of progress in early numeracy. Assessment for Effective Intervention, 33, 206 –214. doi: 10.1177/1534508407313479 Lench, H. C., Flores, S. A., & Bench, S. W. (2011). Discrete emotions predict changes in cognition, judgment, experience, behavior, and physiology: A meta-analysis of experimental emotion elicitations. Psychological Bulletin, 137, 834 – 855. doi:10.1037/a0024244 *Li, H., Pfeiffer, S. I., Petscher, Y., Kumtepe, A. T., & Mo, G. (2008). Validation of the Gifted Rating Scales–School Form in China. Gifted Child Quarterly, 52, 160 –169. doi:10.1177/0016986208315802 *Limbos, M. M., & Geva, E. (2001). Accuracy of teacher assessments of second-language students at risk for reading disability. Journal of Learning Disabilities, 34, 136 –151. doi:10.1177/002221940103400204 Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage. *Lorenz, C., & Artelt, C. (2009). Fachspezifität und Stabilität diagnostischer Kompetenz von Grundschullehrkräften in den Fächern Deutsch und Mathematik [Domain specificity and stability of diagnostic competence among primary school teachers in the school subjects of German and mathematics]. Zeitschrift für Pädagogische Psychologie/German Journal of Educational Psychology, 23, 211–222. doi:10.1024/10100652.23.34.211 *Madelaine, A., & Wheldall, K. (2005). Identifying low-progress readers: Comparing teacher judgment with a curriculum-based measurement procedure. International Journal of Disability, Development, and Education, 52, 33– 42. doi:10.1080/10349120500071886 *Madon, S., Smith, A., Jussim, L., Russell, D. W., Eccles, J., Palumbo, P., & Walkiewicz, M. (2001). Am I as you see me or do you see me as I am? Self-fulfilling prophecies and self-verification. Personality and Social Psychology Bulletin, 27, 1214 –1224. doi:10.1177/0146167201279013 *Maguin, E., & Loeber, R. (1996). How well do ratings of academic performance by mothers and their sons correspond to grades, achievement test scores, and teachers’ ratings? Journal of Behavioral Education, 6, 405– 425. doi:10.1007/BF02110514 Marsh, H. W. (1989). The effects of attending single-sex and coeducational high schools on achievement, attitudes, and behaviors and on sex differences. Journal of Educational Psychology, 81, 70 – 85. doi:10.1037/ 0022-0663.81.1.70 Marsh, H. W. (1990a). A multidimensional, hierarchical model of selfconcept: Theoretical and empirical justification. Educational Psychology Review, 2, 77–172. doi:10.1007/BF01322177 Marsh, H. W. (1990b). Causal ordering of academic self-concept on academic achievement: A multiwave, longitudinal panel analysis. Journal of Educational Psychology, 82, 646 – 656. doi:10.1037/00220663.82.4.646 Marsh, H. W., Bornmann, L., Mutz, R., Daniel, H-D., & O’Mara, A. (2009). Gender effects in the peer reviews of grant proposals: A comprehensive meta-analysis comparing traditional and multilevel approaches. Review of Educational Research, 79, 1290 –1326. doi: 10.3102/0034654309334143 *Martı́nez, J. F., Stecher, B., & Borko, H. (2009). Classroom assessment practices, teacher judgments, and student achievement in mathematics: Evidence from the ECLS. Educational Assessment, 14, 78 –102. doi: 10.1080/10627190903039429 *Maunganidze, L., Ruhode, N., Shoniwa, L., Kasayira, J. M., Sodi, T., & Nyanhongo, S. (2008). Teacher ratings and standardized test scores: ACCURACY OF TEACHERS’ JUDGMENTS How good for predicting achievement in students with learning support placement? Journal of Psychology in Africa, 18, 255–258. *McElvany, N., Schroeder, S., Hachfeld, A., Baumert, J., Richter, T., Schnotz, W., & Ullrich, M. (2009). Diagnostische Fähigkeiten von Lehrkräften bei der Einschätzung von Schülerleistungen und Aufgabenschwierigkeiten bei Lernmedien mit instruktionalen Bildern [Teachers’ diagnostic skills to judge student performance and task difficulty when learning materials include instructional pictures]. Zeitschrift für Pädagogische Psychologie/German Journal of Educational Psychology, 23, 223–235. doi:10.1024/1010-0652.23.34.223 *Meisels, S. J., Bickel, D. D., Nicholson, J., Xue, Y., & Atkins-Burnett, S. (2001). Trusting teachers’ judgments: A validity study of a curriculumembedded performance assessment in kindergarten to grade 3. American Educational Research Journal, 38, 73–95. doi:10.3102/ 00028312038001073 *Methe, S. A., Hintze, J. M., & Floyd, R. G. (2008). Validation and decision accuracy of early numeracy skill indicators. School Psychology Review, 37, 359 –373. *Meyer, M., Wilgosh, L., & Mueller, H. (1990). Effectiveness of teacheradministered tests and rating scales in predicting subsequent academic performance. Alberta Journal of Educational Research, 36, 257–264. *Miller, S. A., & Davis, T. L. (1992). Beliefs about children: A comparative study of mothers, teachers, peers, and self. Child Development, 63, 1251–1265. doi:10.2307/1131531 Möller, J., Pohlmann, B., Köller, O., & Marsh, H. W. (2009). A metaanalytic path analysis of the internal/external frame of reference model of academic achievement and academic self-concept. Review of Educational Research, 79, 1129 –1167. doi:10.3102/0034654309337522 *Montague, M., Enders, C., & Castro, M. (2005). Academic and behavioral outcomes for students at risk for emotional and behavioral disorders. Behavioral Disorders, 31, 84 –94. National Board for Professional Teaching Standards. (2010). The five core propositions. Retrieved from http://www.nbpts.org/the_standards/ the_five_core_proposition O’Mara, A. J., Marsh, H. W., Craven, R. G., & Debus, R. (2006). Do self-concept interventions make a difference? A synergistic blend of construct validation and meta-analysis. Educational Psychologist, 41, 181–206. doi:10.1207/s15326985ep4103_4 Pohlmann, B., Möller, J., & Streblow, L. (2004). Zur Fremdeinschätzung von Schülerselbstkonzepten durch Lehrer und Mitschüler [On students’ self-concepts inferred by teachers and classmates]. Zeitschrift für Pädagogische Psychologie/German Journal of Educational Psychology, 18, 157–169. doi:10.1024/1010-0652.18.34.157 *Pomplun, M. (2004). The differential predictive validity of the initial skills analysis: Reading screening tests for K-3. Educational and Psychological Measurement, 64, 813– 827. doi:10.1177/0013164404263879 Raudenbush, S. W., Byrk, A. S., Cheong, Y., & Congdon, R. T. (2004). HLM 6: Hierarchical linear modeling. Chicago, IL: Scientific Software International. Ready, D. D., & Wright, D. L. (2011). Accuracy and inaccuracy in teachers’ perceptions of young children’s cognitive abilities. American Educational Research Journal, 48, 335–360. doi:10.3102/ 0002831210374874 Rosenberg, M. S., Adams, D. C., & Gurevitch, J. (2000). MetaWin: Statistical software for meta-analysis. Sunderland, MA: Sinauer. *Saint-Laurent, L., Hébert, M., Royer, É., & Piérard, B. (1997). Identification of students with academic difficulties: Implications for research and practice. Canadian Journal of School Psychology, 12, 143–154. doi:10.1177/082957359701200211 *Salvesen, K. A., & Undheim, J. O. (1994). Screening for learning disabilities with teacher rating scales. Journal of Learning Disabilities, 27, 60 – 66. doi:10.1177/002221949402700109 *Schrader, F.-W., & Helmke, A. (1990). Lassen sich Lehrer bei der Leistungsbeurteilung von sachfremden Gesichtspunkten leiten? Eine 761 Untersuchung zu Determinanten diagnostischer Lehrerurteile [Are teachers influenced by extrinsic factors when evaluating scholastic performance? A study on the determinants of teachers’ judgments]. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 22, 312–324. Schrader, F.-W., & Helmke, A. (2001). Alltägliche Leistungsbeurteilung durch Lehrer [Day-to-day performance evaluation by teachers]. In F. E. Weinert (Ed.), Leistungsmessung in Schulen [Performance measurement in schools] (pp. 45–58). Weinheim, Germany: Beltz. Shavelson, R. J., & Stern, P. (1981). Research on teachers’ pedagogical thoughts, judgments, decisions, and behavior. Review of Educational Research, 51, 455– 498. doi:10.2307/1170362 Shepard, L., Hammerness, K., Darling-Hammond, L., & Rust, F. (2005). Assessment. In L. Darling-Hammond & J. Bransford (Eds.), Preparing teachers for a changing world (pp. 275–326). San Francisco, CA: Wiley. *Sink, C. A., Barnett, J. E., & Pool, B. A. (1993). Perceptions of scholastic competence in relation to middle-school achievement. Perceptual and Motor Skills, 76, 471– 478. doi:10.2466/pms.1993.76.2.471 Smith, A. E., Jussim, L., & Eccles, J. (1999). Do self-fulfilling prophecies accumulate, dissipate, or remain stable over time? Journal of Personality and Social Psychology, 77, 548 –565. doi:10.1037/0022-3514.77.3.548 *Sofie, C. A., & Riccio, C. A. (2002). A comparison of multiple methods for the identification of children with reading disabilities. Journal of Learning Disabilities, 35, 234 –244. doi:10.1177/002221940203500305 Spinath, B. (2005). Akkuratheit der Einschätzung von Schülermerkmalen durch Lehrer und das Konstrukt der diagnostischen Kompetenz [Accuracy of teacher judgments of student characteristics and the construct of diagnostic competence]. Zeitschrift für Pädagogische Psychologie/ German Journal of Educational Psychology, 19, 85–95. doi:10.1024/ 1010-0652.19.12.85 Südkamp, A., & Möller, J. (2009). Referenzgruppeneffekte im Simulierten Klassenraum: Direkte und indirekte Einschätzungen von Schülerleistungen. [Reference-group effects in a simulated classroom: Direct and indirect judgments]. Zeitschrift für Pädagogische Psychologie/German Journal of Educational Psychology, 23, 161–174. doi:10.1024/10100652.23.34.161 Sutton, A. J. (2009). Publication bias. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (pp. 435– 452). New York, NY: Russell Sage Foundation. *Teisl, J. T., Mazzocco, M. M. M., & Myers, G. F. (2001). The utility of kindergarten teacher ratings for predicting low academic achievement in first grade. Journal of Learning Disabilities, 34, 286 –293. doi:10.1177/ 002221940103400308 Tillman, C. M. (2011). Developmental change in the relation between simple and complex spans: A meta-analysis. Developmental Psychology, 47, 1012–1025. doi:10.1037/a0021794 *Tindal, G., & Marston, D. (1996). Technical adequacy of alternative reading measures as performance assessments. Exceptionality, 6, 201– 230. doi:10.1207/s15327035ex0604_1 *Trautwein, U., & Baeriswyl, F. (2007). Wenn leistungsstarke Klassenkameraden ein Nachteil sind: Referenzgruppeneffekte bei Übertrittsentscheidungen [When high-achieving classmates put students at a disadvantage: Reference group effects at the transition to secondary schooling]. Zeitschrift für Pädagogische Psychologie/German Journal of Educational Psychology, 21, 119 –133. doi:10.1024/1010-0652.21.2.119 Trautwein, U., Lüdtke, O., Köller, O., & Baumert, J. (2006). Self-esteem, academic self-concept, and achievement: How the learning environment moderates the dynamics of self-concept. Journal of Personality and Social Psychology, 90, 334 –349. doi:10.1037/0022-3514.90.2.334 *Triga, A. (2004). An analysis of teachers’ rating scales as sources of evidence for a standardised Greek reading test. Journal of Research in Reading, 27, 311–320. doi:10.1111/j.1467-9817.2004.00234.x *Trouilloud, D. O., Sarrazin, P. G., Martinek, T. J., & Guillet, E. (2002). The influence of teacher expectations on student achievement in phys- 762 SÜDKAMP, KAISER, AND MÖLLER ical education classes: Pygmalion revisited. European Journal of Social Psychology, 32, 591– 607. doi:10.1002/ejsp.109 VanDerHeyden, A. M., Witt, J. C., & Gilbertson, D. (2007). A multi-year evaluation of the effects of a Response to Intervention (RTI) model on identification of children for special education. Journal of School Psychology, 45, 225–256. doi:10.1016/j.jsp.2006.11.004 *van Kraayenoord, C. E., & Schneider, W. E. (1999). Reading achievement, metacognition, reading self-concept and interest: A study of German students in Grades 3 and 4. European Journal of Psychology of Education, 14, 305–324. doi:10.1007/BF03173117 *Webster, R. E., Hewett, B., & Crumbacker, H. M. (1989). Criterionrelated validity of the WRAT–R and K-TEA with teacher estimates of actual classroom academic performance. Psychology in the Schools, 26, 243–248. doi:10.1002/1520-6807(198907)26:3⬍243::AIDPITS2310260304⬎3.0.CO;2-M *Wilson, J., & Wright, C. R. (1993). The predictive validity of student self-evaluations, teachers’ assessments, and grades for performance on the Verbal Reasoning and Numerical Ability Scales of the Differential Aptitude Test for a sample of secondary school students attending rural Appalachia schools. Educational and Psychological Measurement, 53, 259 –270. doi:10.1177/0013164493053001029 *Wilson, M. S., Schendel, J. M., & Ulman, J. E. (1992). Curriculum-based measures, teachers’ ratings, and group achievement scores: Alternative screening measures. Journal of School Psychology, 30, 59 –76. doi: 10.1016/0022-4405(92)90020-6 Winne, P. H., & Nesbit, J. C. (2010). The psychology of academic achievement. Annual Review of Psychology, 61, 653– 678. doi:10.1146/ annurev.psych.093008.100348 *Wright, C. R., & Houck, J. W. (1995). Gender differences among selfassessments, teacher ratings, grades, and aptitude test scores for a sample of students attending rural secondary schools. Educational and Psychological Measurement, 55, 743–752. doi:10.1177/0013164495055005005 Appendix Search Terms The following search terms were entered in electronic databases: teacher*5 diagnostic* accuracy OR6 sensitivity OR competence, teacher* diagnostic* skill* OR teacher* assessment skill*, teacher* judgment* OR judgement*, teacher* “classroom* assessment,” teacher* “academic achievement” prediction, grading accuracy, teacher* judg* student* academic* achievement* OR outcome* OR performance* OR abilit*, teacher* assess* student* academic* achievement* OR outcome* OR performance* OR abilit*, teacher* evaluat* student* academic* achievement* OR outcome* OR performance* OR abilit*, teacher* rating* student* academic* achievement* OR outcome* OR performance* OR abilit*, teacher* rate student* academic* achievement* OR outcome* OR performance* OR abilit*, “teacher* expectation*.” 5 The truncation symbol allows for unknown characters, multiple spellings, or various endings. 6 The OR operator combines search terms so that each search result contains at least one of the terms. Received January 6, 2011 Revision received December 22, 2011 Accepted January 26, 2012 䡲
© Copyright 2026 Paperzz