Singapore Management University Institutional Knowledge at Singapore Management University Research Collection School of Social Sciences School of Social Sciences 4-2010 The validity of the Graduate Record Examination for master's and doctoral programs: A meta-analytic investigation Nathan R. KUNCEL University of Minnesota Serena WEE Singapore Management University, [email protected] Lauren SERAFIN University of Illinois at Urbana-Champaign Sarah A. HEZLETT Personnel Decisions Research Institutes Follow this and additional works at: http://ink.library.smu.edu.sg/soss_research Part of the Higher Education Commons, and the Quantitative Psychology Commons Citation KUNCEL, Nathan R., WEE, Serena, SERAFIN, Lauren, & HEZLETT, Sarah A..(2010). The validity of the Graduate Record Examination for master's and doctoral programs: A meta-analytic investigation. Educational and Psychological Measurement, 70(2), 340-352. Available at: http://ink.library.smu.edu.sg/soss_research/1882 This Journal Article is brought to you for free and open access by the School of Social Sciences at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in Research Collection School of Social Sciences by an authorized administrator of Institutional Knowledge at Singapore Management University. For more information, please email [email protected]. Validity Studies Published in Educational and Psychological Measurement, 2010, Volume 70 (2), Pages 340-352. Educational and Psychological http://doi.org/10.1177/0013164409344508 Measurement 70(2) 340–352 The Validity of the Graduate © 2010 SAGE Publications DOI: 10.1177/0013164409344508 Record Examination for http://epm.sagepub.com Master’s and Doctoral The validity of the Graduate Record Examination for master's Programs: A Meta-analytic and doctoral programs: A meta-analytic investigation Investigation Nathan R. Kuncel,1 Serena Wee,2 Lauren Serafin,2 and Sarah A. Hezlett3 Abstract Extensive research has examined the effectiveness of admissions tests for use in higher education. What has gone unexamined is the extent to which tests are similarly effective for predicting performance at both the master’s and doctoral levels. This study empirically synthesizes previous studies to investigate whether or not the Graduate Record Examination (GRE) predicts the performance of students in master’s programs as well as the performance of doctoral students. Across nearly 100 studies and 10,000 students, this study found that GRE scores predict first year grade point average (GPA), graduate GPA, and faculty ratings well for both master’s and doctoral students, with differences that ranged from small to zero. Keywords graduate school, admissions tests, validity, Graduate Record Examination, GRE, metaanalysis, standardized tests The consistent level of validity of scores on standardized cognitive tests for predicting academic performance in graduate programs is remarkable. Corrected meta-analytic estimates for the validity of Graduate Record Examination (GRE), Law School Admission Test, Management Aptitude Test, Graduate Management Admission Test, Medical 1 University of Minnesota, Minneapolis, MN, USA University of Illinois at Urbana-Champaign, IL, USA 3 Personnel Decisions Research Institutes, Minneapolis, MN, USA 2 Corresponding Author: Nathan R. Kuncel, Department of Psychology, University of Minnesota, 75 East River Rd, Minneapolis, MN 55455, USA Email: [email protected] Kuncel et al. 341 College Admission Test, and Pharmacy College Admission Test total scores for predicting first year student grades range from .41 to .59 (for a comprehensive review, see Kuncel & Hezlett, 2007; see also Julian, 2005; Kuncel, Crede, & Thomas, 2007; Kuncel, Crede, Thomas, Klieger, Seiler, & Woo, 2005; Kuncel, Hezlett, & Ones, 2001, 2004; Linn & Hastings, 1984). Despite clear differences in grading practices, course content, and pedagogy across law, pharmacy, business, medicine, and other academic disciplines, the predictive validities of standardized test scores are highly similar. Correlation differences across fields demonstrate degrees of utility rather than the dichotomous presence or absence of utility. Extensive research has specifically examined the validity of scores on the GRE, including several large-scale summaries or meta-analyses (Kuncel et al., 2001; Powers, 2004; Schneider & Briel, 1990). This work suggests that GRE scores predict a variety of important aspects of graduate student performance across different disciplines and situations. Yet the evidence also shows that there are variations in the generally high levels of predictive validity, leaving open questions about the validity of the GRE for specific populations and situations. Somewhat surprisingly, direct comparisons of predictive power for different degree levels have not been conducted with any frequency. Differences in validity by degree level would have important practical implications for how GRE scores are used in making admissions decisions. Should the GRE be substantially less effective for one degree level, then other aspects of the applicant’s file should receive more weight. More generally, examination of degree level as a moderator of the validity of GRE scores for predicting academic performance in graduate school offers scientists the opportunity to gain insight into the situational factors that may influence how well scores on cognitive tests predict performance. To investigate the moderating role of graduate program degree level on the validity of the GRE, it is important to use large samples or a research synthesis. Any observed differences in the predictive validity in a single or small sample may be because of artifacts, such as sampling error or uncontrolled substantive factors. By examining a variety of different samples reflecting different disciplines and situations, it is hoped that some of these uncontrolled factors will be averaged out. Therefore, the present study used meta-analysis to provide separate estimates of the validity of GRE scores for predicting the performance of students enrolled in master’s and doctoral programs. The results contribute important practical guidance to those responsible for making admissions policies and decisions. By examining the degree to which the predictive validity of GRE scores is moderated by degree level, this study also provides insight into factors that may moderate the degree of validity of standardized cognitive tests for predicting performance. Moderators of Predictive Validity Moderators of predictive validity influence the magnitude of the correlation between individuals’ scores on the predictor and one or more outcomes of interest. The types of 342 Educational and Psychological Measurement 70(2) moderators that might affect the correlation between the GRE and a performance outcome can be organized into two general categories: substantive and artifactual. Substantive Moderators The major applied concern is with substantive moderators. Substantive moderators include meaningful conditions, situations, or populations consistently associated with higher or lower levels of predictive power. In some cases, moderators will directly cause differences in observed validities. In other instances, a moderator may be a wellrecognized situation associated with or reflecting a cluster of conditions that may influence predictive power. That is, the moderator is a proxy or indicator of underlying variables that affect predictive validity. Graduate program degree level most likely falls into the latter category. Degree level meaningfully differentiates individuals’ educational goals and experiences. However, if degree level does moderate the predictive validity of GRE scores, the moderation is unlikely to be a result of the letters appearing on students’ diplomas. The moderation is likely to occur as a result of factors that distinguish individuals’ experiences in master’s and doctoral programs. Degree level is a starting point for understanding more fundamental differences about how individual characteristics interact with the educational environment. Several variables may create differences between master’s- and doctoral-level degree programs. Course complexity. The greater demands placed on cognitive resources in highcomplexity settings should result in a stronger correlation between ability measures and performance than in low-complexity settings. Empirical support for the moderating effects of complexity can be found in research in diverse settings. Hunter (1980) reported that, on average, ability measures had stronger correlations with job performance for complex jobs (r = .58; retail food manager, game warden) than for low-complexity jobs (r = .23; shrimp picker, cannery worker). Research in the educational domain provides data that are less direct but compelling. A number of studies have found that the average test score for colleges and law schools tend to be associated with stronger predictive validity (Bridgeman, Jenkins, & Ervin, 1999; Linn & Hastings, 1984; Ramist, Lewis, & McCamley-Jenkins, 1994). There is even evidence that for simple tasks such as choice reaction time (e.g., pressing a number key when that number flashes on the screen) performance shows an increasing association with cognitive abilities as the number of choices increases, making the task more complex (e.g., Baumeister, 1998; see also Deary, 1996). It is reasonable to propose that, on average, doctoral-level work is of higher complexity than master’s-level work. Earning a PhD typically requires obtaining a deep level of expertise in an area and producing an original scientific contribution to the field. Higher complexity would lead to somewhat stronger correlations between GRE scores and performance in doctoral programs than in master’s programs. Independence and ill-structured learning environments. Unlike the work completed in traditional classroom settings, much of the work of doctoral students is ill structured. Kuncel et al. 343 Building on a core of required courses, doctoral students often must develop their own program of study, work independently to gain expertise, and contribute to unique research projects. Activities in master’s programs tend to be more structured. Many master’s students complete a well-defined set of required courses, augmented by a few classes selected from a limited set of electives. Only some master’s students write a thesis, and these students tend to receive more guidance than doctoral students completing dissertations. On the surface, it might appear that standardized admissions tests were developed to predict performance in structured instructional settings, such as the college lecture hall. Thus, one might initially conjecture that scores on tests would be more strongly related to performance in master’s programs, where the bulk of students’ work is structured. However, research indicates that standardized test scores predict learning and performance in situations that require the processing information or acquisition of new knowledge (for reviews, see Cattell, 1971; Kuncel et al., 2004), even when the outcomes are not grades in courses (Kuncel & Hezlett, 2007). Indeed, research suggests that students with higher scores on standardized tests may actually profit more from low-structured environments (such as doctoral programs) than those with lower scores because of the interaction between training structure and ability (Edgerton, 1958; Goldstein, 1993; Snow & Lohman, 1984). Discipline area. Discipline of study has been found to have a small effect on the predictive power of GRE scores. Although scores on both the Verbal (GRE-V) and Quantitative (GRE-Q) sections of the GRE demonstrate strong prediction across fields, GRE-V scores are more strongly related to grades in verbal disciplines (humanities). GRE-Q scores are more strongly correlated with grades for those in quantitative disciplines (Kuncel et al., 2001). The implication of these findings for the current study is that differential representation of area of study is unlikely to produce large moderating effects by degree level. Validity coefficients might differ to a small extent by degree level if particular disciplines are more likely to have master’s or doctoral programs. Artifactual Moderators Artifactual moderators can result in an observed difference in predictive validity because of the statistical properties of the samples or measures. Two common examples include differential restriction of range because of direct and indirect selection effects and criterion measurement error differences. If one graduate program degree level typically admits a wider range of students, we would expect to observe larger correlations between GRE scores and academic program for that degree level. Predictor variability will be affected by application requirements and the nature of the students who apply to the program. Highly selective programs are likely to get a narrow range of applicants and then further restrict the group by admitting the top applicants. Fortunately, this artifactual moderator can be reasonably addressed through corrections for restriction of range. 344 Educational and Psychological Measurement 70(2) Differences in criterion measurement error can occur because of grading policy or systematic instructor differences. If grades have poorer measurement properties, on average, for one degree level, then validity coefficients will vary by degree level not because of the predictor but because of the measure of performance. Formulas also exist that permit validity coefficients to be corrected for unreliability in the criteria. Current Study Hypotheses Given the results of the large literature on the validity of standardized tests scores for predicting academic performance, the GRE is likely to be a valid predictor for both master’s- and doctoral-level programs. Both situations require considerable acquisition of new knowledge. Both situations require the student to process information and make decisions. Finally, both situations, regardless of program, require verbal skills and most require quantitative skills as well. Therefore, it was our expectation that the GRE should be a valid predictor of performance in both master’s and doctoral programs. However, given that program complexity and structure may vary by degree level, there may be small differences in the degree to which GRE scores predict students’ performance in master’s and doctoral programs. Differences in disciplines’ reliance on master’s and doctoral programs also may lead to small variations in the magnitude of GRE validity coefficients by degree level. Directional hypotheses are not possible because the distribution of degree level by discipline is unknown and may operate in a different direction from the potential moderating effects of complexity and structure. Methods The database for this study was assembled from two sources. First, the meta-analytic database used in Kuncel et al. (2001) was used as a foundation for the research. All articles were reviewed for information about program level, and data were coded accordingly. To supplement and update this database, a new literature search was conducted using the ERIC (1999-2005), PsychINFO (1999-2005), and Dissertations Abstracts (1999-2005) databases. The search was set back to 2 years preceding the Kuncel et al. (2001) study to account for publication lag and ensure that any updates to the bibliographic databases were included. We did not conduct a meta-analysis of the GRE-Analytical Writing exam. It is sufficiently new that relatively few studies have examined its validity. These literature searches were imported into a bibliographic database and were examined to determine if it might contain relevant data. Relevant articles and dissertations were retrieved, and three yielded useable data (Edwards & Schleicher, 2004; Fenster, Markus, Weidemann, Brackett, & Fernandez, 2001; Pearson, 2003). The final set of coded data was entirely composed of publicly available journal articles, research reports, and dissertations. Because of the complexity of the measures, situations, and outcomes represented in the literature, precise coding of all statistics and moderator variables is critical. Although previous research has found that coding accuracy is generally very high Kuncel et al. 345 (e.g., Kuncel et al., 2001; Whetzel & McDaniel, 1988; Zakzanis, 1998), to ensure the reliability of the coded information we used a three-step process. It is important to note that the coding reliability of the Kuncel et al. (2001) database was found to be very good with more than 99% agreement and that database constitutes the majority of the data presented here. All studies were coded by two coders, and the results were compared for disagreements. Inconsistencies were resolved in meetings with the first author. Finally, the double-coded data were examined by the first author who checked a random sample of 20% of the articles, including those judged to contain no useable data. This third step helped ensure that all articles with useable data are included and that all coded data are accurate. In some articles, correlational data were not reported but other relevant information was included in the articles that allows estimate of the magnitude of the effect. These results were converted from their presented form (e.g., t, d, c2, p value, frequencies) into correlations using standard conversion formulae (Hunter & Schmidt, 2004; Lipsey & Wilson, 2001). Studies were only included if program level was specifically discussed for the sample. In addition, some doctoral programs require a master’s degree as a part of the program. Outcomes for these master’s students were not included in this study to more clearly separate the two degree program levels. Sufficient information was available to examine three measures of student performance: first year grade point average (GPA), overall graduate GPA, and faculty ratings. The first year GPA criterion included either first year or first semester grades. The overall graduate GPA criterion consisted of studies that contained 2 or more years of grades (i.e., second year cumulative GPA or more). Faculty ratings consisted of a mixture of different rating types, including overall evaluations of performance, ratings of professionalism, research accomplishment, and dissertation or thesis quality. Studies were not taken if graduate grades were student reported because of concerns about their accuracy (see Kuncel, Crede, & Thomas, 2005). The Hunter and Schmidt (2004) psychometric meta-analytic method was used for all analyses. In addition to statistically aggregating results across studies, psychometric meta-analysis allows for the correction of statistical artifacts that bias the average observed validity estimate and allows us to estimate the amount of variance attributable to sampling error, range restriction, and unreliability. These meta-analytic methods have been tested via Monte Carlo simulations on several occasions with results in all cases indicating that these methods yield accurate results even in the presence of minor violations of key assumptions (e.g., Law, Schmidt, & Hunter, 1994; Oswald & Johnson, 1998; Schulze, 2004). The artifact distribution method is used to address statistical artifact magnitude and variability (Hunter & Schmidt, 2004). In this approach, available information is used to make the corrections. The underlying assumption of this approach is that the available information reasonably represents the distribution of artifacts in the literature. For this assumption to be violated, the reporting of artifact-relevant information in a study would need to covary with the artifact in question. Because this seems unlikely, 346 Educational and Psychological Measurement 70(2) artifact distributions are likely to provide reasonable corrections and almost certainly result in less biased estimates than no corrections. In estimating the predictive power of GRE scores, we are most interested in its relationship for the full applicant population and not just for those students who are admitted. To create such an estimate, we need information about the variability of the incumbent (admitted) group and the group of students who applied to a program (i.e., the applicant group). The former is obtained directly from studies. The latter is almost never reported in primary studies. To obtain good estimates, technical manuals and reports are often excellent sources of information. One approach would be to correct back to the total sample of all test takers. Although some arguments could be made for this approach, we used a more conservative method. We linked samples back to the standard deviation for test takers who indicated the intent to study in the same academic area. For example, for a study with a sample of mathematics students, the standard deviation used was for all students who indicated that they intended to go to graduate school in mathematics. Given that students tend to sort themselves in graduate school partially based on their verbal and quantitative abilities, the applicant groups are likely to be more restricted than the sample of all test takers and tend to reflect a matching of ability level to program (e.g., Kuncel & Klieger, 2007). Given the applied research question raised by this study, this approach results in a more accurate but far smaller correction than what would be obtained if the standard deviations for all test takers were used. To further refine the corrections, we matched samples by time in addition to area. The standard deviation of the GRE appears to have increased over time. To better reflect the applicant groups at a given point in time, we used area estimates matched as closely as possible to the same point in time. These were available for the following years: 1952, 1967-1968, 1974-1976, 1988-1991, 1992-1995, 1995-1996 (Briel, O’Neill, & Scheuneman, 1993; Conrad, Trisman, & Miller, 1977; Educational Testing Services, 1996, 1997). Data were then sorted by degree level, and range restriction artifact distributions were then created separately for each meta-analysis for the GRE-V and GRE-Q. This approach created artifact distributions tailored by time, discipline area, and the variability in incumbent tests scores by degree areas. It is unknown and a question for further research whether or not the standard deviations for test takers who indicated the intent to study in the same academic area differ for master’s and doctoral programs. These corrections result in less biased estimates of the relationship between test scores and outcomes; however, admission decisions are often based on other variables that are not presented in validation studies. These variables can lead to indirect range restriction, and the evidence to date suggests that direct correction (such as those used here) tend to be conservative underestimates of predictive power (Hunter, Schmidt, & Le, 2006). The ideal approach to addressing restriction of range in graduate admission would be to conduct multivariate corrections (Aitken, 1934; Lawley, 1943; for a review, see Sackett & Yang, 2000). However, such corrections require considerable information that is almost never available in a meta-analysis. Kuncel et al. 347 In addition, we are generally interested in the predictive validity of the GRE for outcomes unclouded by measurement error. Grades are not assigned consistently, and ratings of performance are subject to rating errors. This results in weaker relationships than what would be obtained with more reliable evaluations of student performance. Therefore, the correlations were corrected for criterion unreliability. Estimates of college grade unreliability were obtained from three studies (Barritt, 1966; Bendig, 1953; Reilly & Warech, 1993). Estimates of faculty rating accuracy were taken from Kuncel et al. (2004) as these estimates are more precise and conservative than those used in Kuncel et al. (2001). The more conservative estimates yield smaller corrections and, thus, smaller estimates of the correlation between the GRE and performance. Given that discipline has a small effect on the predictive validity of the GRE (Kuncel et al., 2001), frequency counts of discipline area were conducted for both the overall graduate GPA and first year GPA analyses. Disciplines were classified into humanities, social science, life science, and math/physical science categories. Although a wide range of fields is represented, the social sciences in general and psychology and education in particular are most frequently represented. Therefore, the aggregated results (being weighted means) will more strongly, but not exclusively, reflect the effect of program level on social science disciplines than other disciplines. A chi-square was calculated to compare the distribution of disciplines by degree level. The result was statistically significant at p < .10. Much of this appears to be driven by the proportionately somewhat larger representation of the life sciences in the master’s programs than in the doctoral programs. Given the desirability of stable estimates, the overall small moderating effects of discipline, and the relatively balanced proportions of disciplines across the two degree levels, we proceeded to analyze the data across master’s and doctoral programs. Results and Discussion The results of the meta-analysis are shown in Tables 1 through 3. As expected, both GRE-V and GRE-Q were found to be valid predictors of graduate GPA and first year graduate GPA in both master’s and doctoral programs. In addition, GRE-V and GRE-Q scores were found to predict faculty ratings for both master’s and doctoral programs. These results indicate that the GRE is effective for admission decision making for both master’s- and doctoral-level work and should be incorporated in the application process for both degree levels. Specifically, the GRE has both a comparable and useful predictive validity for master’s- and doctoral-level programs. The primary implied implication of these findings is that the evidence suggests that both doctoral and master’s programs can continue to use the GRE and expect that it will provide useful predictive information about their students. One of the two smallest discrepancies between degree levels was for GRE-Q predicting graduate GPA with a corrected, operational validity of .28 for doctoral students and .30 for master’s students. In predicting faculty ratings, GRE-V had an operational validity of .32 for both master’s and doctoral programs. 348 Educational and Psychological Measurement 70(2) Table 1. Meta-analysis of GRE Predictive Validity by Degree Level for Graduate GPA N k robs SDobs r Master’s students GRE-Verbal 7,214 56 .29 .14 .38 GRE-Quantitative 6,864 55 .23 .14 .30 Doctoral students 1,216 12 .21 .13 .27 GRE-Verbal GRE-Quantitative 3,757 21 .20 .14 .28 SDr .14 .12 .10 .16 Note: N = sample size; k = number of studies; robs = sample size weighted mean observed correlation; SDobs = observed standard deviation of correlations; r = operational validity; SDr = standard deviation of true score correlations; GRE = Graduate Record Examination; GPA = grade point average. Table 2. Meta-analyses of GRE Predictive Validity by Degree Level for First Year GPA N k robs SDobs r Master’s students GRE-Verbal 2,204 47 .27 .18 .35 GRE-Quantitative 2,204 47 .22 .16 .28 Doctoral students GRE-Verbal 1,323 25 .22 .18 .29 GRE-Quantitative 1,250 24 .24 .16 .33 SDr .14 .06 .14 .10 Note: N = sample size; k = number of studies; robs = sample size weighted mean observed correlation; SDobs = observed standard deviation of correlations; r = operational validity; SDr = standard deviation of true score correlations; GRE = Graduate Record Examination; GPA = grade point average. Table 3. Meta-analyses of GRE Predictive Validity by Degree Level for Faculty Ratings N k robs SDobs r Master’s students GRE-Verbal 759 8 .23 .13 .32 GRE-Quantitative 759 8 .15 .13 .21 Doctoral students GRE-Verbal 1,360 13 .23 .14 .32 GRE-Quantitative 1,199 11 .20 .11 .30 SDr .12 .11 .12 .05 Note: N = sample size; k = number of studies; robs = sample size weighted mean observed correlation; SDobs = observed standard deviation of correlations; r = operational validity; SDr = standard deviation of true score correlations; GRE = Graduate Record Examination. One of the largest discrepancies in predictive validity across master’s and doctoral programs involved the use of the GRE-V scores for predicting graduate GPA. The corrected operational validity of GRE-V scores for predicting graduate GPA was .38 for master’s students and .27 for doctoral students. That one of the largest differences Kuncel et al. 349 involved graduate GPA, rather than first year GPA, is not surprising as it seems likely that master’s and doctoral work would diverge later in the programs. However, the direction of the difference is the opposite of what would be predicted from past research on complexity and structure as moderators of the cognitive ability– performance relationship. One possible explanation of the observed difference would be the mixture of discipline areas, although this does not appear to be a likely explanation as the distribution is fairly comparable across program level and disciplinespecific effects tend to be small. A second possibility would be differences in grading standards and variability by degree level. To test this, we examined GPA standard deviations for those studies in the GRE-V analyses for graduate GPA. Not all studies reported this information, but the results are suggestive. On average, the standard deviation of GPAs for master’s students was .40, whereas for doctoral programs it was only .21. The smaller range of grades for doctoral students could cause the observed variation in validity coefficients. However, studies reporting standard deviations were rare. Future research needs to more carefully examine the nature of grades in graduate education to see if systematic trends, such as greater grade inflation in doctoral programs, can be identified. Although we corrected operational validities for unreliability in grades, separate data on grade unreliability for each degree level were not available. Differences in grade reliability by degree level also are a trend worth investigating. In addition, the type of courses taken by master’s or doctoral students may be the source of the small observed differences. If quantitative courses are the major source of variability in doctoral student GPA but not for master’s students, then course taking patterns could partially account for the results. The other large discrepancy involved the correlation of GRE-V with graduate GPA, with an operational validity of .21 for master’s students and .30 for doctoral students. The direction of this difference is consistent with the complexity and structure being higher in doctoral programs. However, it is not clear why complexity and structure might moderate the validity of GRE-Q scores for predicting faculty ratings but not affect the magnitude of GRE-V validities. More work is needed to examine other aspects of student performance beyond those covered here. Grades and faculty ratings are important measures of student performance. When grades are based on good assessment of student learning, they are valuable pieces of information. Faculty have considerable contact with students, and their evaluations of their performance on nonclassroom aspects of performance enhances our confidence that the GRE predicts a range of important outcomes for both degree levels. Of course, students engage in many additional important activities in graduate school. This is especially so for doctoral students for whom GPA is often considered a less important outcome of their training. It is important to note that research has already found that standardized tests predict criteria other than grades (Kuncel & Hezlett, 2007). The results of the present investigation, combined with the general literature, create a strong case for the relevance of the GRE as a predictor for multiple important aspects of student performance and success, but additional primary studies are needed to fully examine the criterion space. Only then will a complete 350 Educational and Psychological Measurement 70(2) picture of admissions and assessment be possible for doctoral and master’s programs separately. Meta-analytic results depend on prior academic literature. In this study, a varied but not random representation of graduate programs was obtained. Therefore, results may not fully generalize to all graduate programs. In addition, some analyses were based on a relatively small number of studies. Additional primary research must be completed to make possible more robust meta-analytic estimates based on an even larger number of samples and students. This study, based on thousands of individuals and nearly 100 independent samples, found considerable evidence for the validity of the GRE for both master’s- and doctoral-level programs. Averaging across the two tests and grade measures, the validity of the GRE varied only .03 between master’s (.30) and doctoral (.27) programs. Based on the data currently available, the GRE is a useful decision-making tool for both master’s- and doctoral-level programs. This investigation has elucidated diverse variables that may potentially moderate the validity of the GRE for predicting academic performance but has not revealed systematic patterns of differences in validity coefficients by degree level. Declaration of Conflicting Interests The authors declared no potential conflicts of interest with respect to the authorship and/or publication of this article. Funding The authors received no financial support for the research and/or authorship of this article. The lead author received funding from ETS to support graduate students to conduct the study but did not receive any personal funding for the project. References Aitken, A. C. (1934). Note on selection from a multivariate normal population. Proceeding of the Edinburgh Mathematical Society, 4, 106-110. Barritt, L. S. (1966). The consistency of first-semester college grade point average. Journal of Educational Measurement, 3, 261-262. Baumeister, A. A. (1998). Intelligence and the “personal equation.” Intelligence, 26, 255-265. Bendig, A. W. (1953). The reliability of letter grades. Educational and Psychological Measurement, 13, 311-321. Bridgeman, B., Jenkins, L., & Ervin, N. (1999, April). Variation in the prediction of college grades across gender within ethnic groups at different selectivity levels. Paper presented at the American Educational Research Association, Montreal, Quebec, Canada. Briel, J. B., O’Neill, K., & Scheuneman, J. D. (Eds.). (1993). GRE technical manual. Princeton, NJ: Educational Testing Service. Cattell, R. B. (1971). Abilities: Their structure, growth and action. Oxford, UK: Houghton Mifflin. Conrad, L., Trisman, D., & Miller, R. (1977). GRE Graduate Record Examinations technical manual. Princeton, NJ: Educational Testing Service. Kuncel et al. 351 Deary, I. J. (1996). Reductionism and intelligence: The case of inspection time. Journal of Biosocial Science, 28, 405-423. Edgerton, H. A. (1958). The relationship of method of instruction to trainee aptitude pattern (Tech. Rep., Contract ONR 1042). New York: Richardson, Bellows, & Henry. Educational Testing Service. (1996). Interpreting your GRE General Test and Subject Test scores: 1996-1997. Princeton, NJ: Author. Educational Testing Service. (1997). Sex, race, ethnicity, and performance on the GRE General Test: A technical report. Princeton, NJ: Author. Edwards, W. R., & Schleicher, D. J. (2004). On selecting psychology graduate students: Validity evidence for a test of tacit knowledge. Journal of Educational Psychology, 96, 592-602. Fenster, A., Markus, K. A., Weidemann, C. F., Brackett, M. A., & Fernandez, J. (2001). Selecting tomorrow’s forensic psychologists: A fresh look at some familiar predictors. Educational and Psychological Measurement, 61, 336-348. Goldstein, I. L. (1993). Training in organizations (3rd ed.). Pacific Grove, CA: Brooks/Cole. Hunter, J. E. (1980). Validity generalization for 12,000 jobs: An application of synthetic validity and validity generalization to the General Aptitude Test Battery (GATB). Washington, DC: U.S. Department of Labor. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings (2nd ed.). Thousand Oaks, CA: Sage. Hunter, J. E., Schmidt, F. L., & Le, H. (2006). Implications of direct and indirect range restriction for meta-analysis methods and findings. Journal of Applied Psychology, 91, 594-612. Julian, E. R. (2005). Validity of the Medical College Admission Test for predicting medical school performance. Academic Medicine, 80, 910. Kuncel, N. R., Crede, M., & Thomas, L. L. (2005). The validity of self-reported grade point averages, class ranks, and test scores: A meta-analysis. Review of Educational Research, 75, 63-82. Kuncel, N. R., Crede, M., & Thomas, L. L. (2007). A comprehensive meta-analysis of the predictive validity of the Graduate Management Admission Test (GMAT) and undergraduate grade point average (UGPA). Academy of Management Learning and Education, 6, 53-68. Kuncel, N. R., Crede, M., Thomas, L. L., Klieger, D. M., Seiler, S. N., & Woo, S. E. (2005). A meta-analysis of the Pharmacy College Admission Test (PCAT) and grade predictors of pharmacy student success. American Journal of Pharmaceutical Education, 69, 339-347. Kuncel, N. R., & Hezlett, S. A. (2007). Standardized tests predict graduate students’ success. Science, 315, 1080-1081. Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive meta-analysis of the predictive validity of the Graduate Record Examinations: Implications for graduate student selection and performance. Psychological Bulletin, 127, 162-181. Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2004). Academic performance, career potential, creativity, and job performance: Can one construct predict them all? [Special section: Cognitive abilities: 100 years after Spearman (1904)]. Journal of Personality and Social Psychology, 86, 148-161. Kuncel, N. R., & Klieger, D. M. (2007). Application patterns when applicants know the odds: Implications for selection research and practice. Journal of Applied Psychology, 92, 586-593. 352 Educational and Psychological Measurement 70(2) Law, K. S., Schmidt, F. L., & Hunter, J. E. (1994). A test of two refinements in procedures for meta-analysis. Journal of Applied Psychology, 79, 978-986. Lawley, D. N. (1943). A note on Karl Pearson’s selection formulae. Proceeding of the Royal Society of Edinburgh, Section A, 62(Pt. 1), 28-30. Linn, R. L., & Hastings, C. N. (1984). A meta-analysis of the validity of predictors of performance in law school. Journal of Educational Measurement, 21, 245-259. Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage. Oswald, F. L., & Johnson, J. W. (1998). On the robustness, bias, and stability of results from meta-analysis of correlation coefficients: Some initial Monte Carlo findings. Journal of Applied Psychology, 83, 164-178. Pearson, L. L. (2003). Predictors of success in a graduate program in radiologic sciences. Unpublished doctoral dissertation, Texas Woman’s University, Denton, TX. Powers, D. E. (2004). Validity of Graduate Record Examinations (GRE) general test scores for admissions to colleges of veterinary medicine. Journal of Applied Psychology, 89, 209-219. Ramist, L., Lewis, C., & McCamley-Jenkins, L. (1994). Student group differences in predicting college grades: Sex, language, and ethnic group (College Board Rep. No. 93-1). New York: College Board. Reilly, R. R., & Warech, M. A. (1993). The validity and fairness of alternatives to cognitive tests. In L. C. Wing & B. R. Cifford (Eds.), Policy issues in employment testing (pp. 131-224). Boston: Kluwer. Sackett, P. R., & Yang, H. (2000). Correction for range restriction: An expanded typology. Journal of Applied Psychology, 85, 112-118. Schneider, L. M., & Briel, J. B. (1990). Validity of the GRE: 1988-1989 summary report. Princeton, NJ: Educational Testing Service. Schulze, R. (2004). Meta-analysis: A comparison of procedures. Cambridge, MA: Hogrefe & Huber. Snow, R. E., & Lohman, D. F. (1984). Toward a theory of cognitive aptitude for learning from instruction. Journal of Educational Psychology, 76, 347-376. Whetzel, D. L., & McDaniel, M. A. (1988). Reliability of validity generalization data bases. Psychological Reports, 63, 131-134. Zakzanis, K. K. (1998). The reliability of meta-analytic review. Psychological Reports, 83, 215-222.
© Copyright 2026 Paperzz