The validity of the Graduate Record Examination for master`s and

Singapore Management University
Institutional Knowledge at Singapore Management University
Research Collection School of Social Sciences
School of Social Sciences
4-2010
The validity of the Graduate Record Examination
for master's and doctoral programs: A meta-analytic
investigation
Nathan R. KUNCEL
University of Minnesota
Serena WEE
Singapore Management University, [email protected]
Lauren SERAFIN
University of Illinois at Urbana-Champaign
Sarah A. HEZLETT
Personnel Decisions Research Institutes
Follow this and additional works at: http://ink.library.smu.edu.sg/soss_research
Part of the Higher Education Commons, and the Quantitative Psychology Commons
Citation
KUNCEL, Nathan R., WEE, Serena, SERAFIN, Lauren, & HEZLETT, Sarah A..(2010). The validity of the Graduate Record
Examination for master's and doctoral programs: A meta-analytic investigation. Educational and Psychological Measurement, 70(2),
340-352.
Available at: http://ink.library.smu.edu.sg/soss_research/1882
This Journal Article is brought to you for free and open access by the School of Social Sciences at Institutional Knowledge at Singapore Management
University. It has been accepted for inclusion in Research Collection School of Social Sciences by an authorized administrator of Institutional
Knowledge at Singapore Management University. For more information, please email [email protected].
Validity Studies
Published in Educational and Psychological Measurement, 2010, Volume 70 (2), Pages 340-352.
Educational and Psychological
http://doi.org/10.1177/0013164409344508
Measurement
70(2) 340­–352
The Validity of the Graduate
© 2010 SAGE Publications
DOI: 10.1177/0013164409344508
Record Examination for
http://epm.sagepub.com
Master’s and Doctoral
The
validity of the
Graduate Record Examination for master's
Programs:
A Meta-analytic
and doctoral programs: A meta-analytic investigation
Investigation
Nathan R. Kuncel,1 Serena Wee,2 Lauren
Serafin,2 and Sarah A. Hezlett3
Abstract
Extensive research has examined the effectiveness of admissions tests for use
in higher education. What has gone unexamined is the extent to which tests are
similarly effective for predicting performance at both the master’s and doctoral levels.
This study empirically synthesizes previous studies to investigate whether or not
the Graduate Record Examination (GRE) predicts the performance of students in
master’s programs as well as the performance of doctoral students. Across nearly
100 studies and 10,000 students, this study found that GRE scores predict first year
grade point average (GPA), graduate GPA, and faculty ratings well for both master’s
and doctoral students, with differences that ranged from small to zero.
Keywords
graduate school, admissions tests, validity, Graduate Record Examination, GRE, metaanalysis, standardized tests
The consistent level of validity of scores on standardized cognitive tests for predicting
academic performance in graduate programs is remarkable. Corrected meta-analytic
estimates for the validity of Graduate Record Examination (GRE), Law School Admission Test, Management Aptitude Test, Graduate Management Admission Test, Medical
1
University of Minnesota, Minneapolis, MN, USA
University of Illinois at Urbana-Champaign, IL, USA
3
Personnel Decisions Research Institutes, Minneapolis, MN, USA
2
Corresponding Author:
Nathan R. Kuncel, Department of Psychology, University of Minnesota, 75 East River Rd, Minneapolis, MN
55455, USA
Email: [email protected]
Kuncel et al.
341
College Admission Test, and Pharmacy College Admission Test total scores for predicting first year student grades range from .41 to .59 (for a comprehensive review, see
Kuncel & Hezlett, 2007; see also Julian, 2005; Kuncel, Crede, & Thomas, 2007;
Kuncel, Crede, Thomas, Klieger, Seiler, & Woo, 2005; Kuncel, Hezlett, & Ones, 2001,
2004; Linn & Hastings, 1984). Despite clear differences in grading practices, course
content, and pedagogy across law, pharmacy, business, medicine, and other academic
disciplines, the predictive validities of standardized test scores are highly similar.
Correlation differences across fields demonstrate degrees of utility rather than the
dichotomous presence or absence of utility.
Extensive research has specifically examined the validity of scores on the GRE,
including several large-scale summaries or meta-analyses (Kuncel et al., 2001;
Powers, 2004; Schneider & Briel, 1990). This work suggests that GRE scores predict
a variety of important aspects of graduate student performance across different disciplines and situations. Yet the evidence also shows that there are variations in the
generally high levels of predictive validity, leaving open questions about the validity
of the GRE for specific populations and situations. Somewhat surprisingly, direct
comparisons of predictive power for different degree levels have not been conducted
with any frequency. Differences in validity by degree level would have important
practical implications for how GRE scores are used in making admissions decisions.
Should the GRE be substantially less effective for one degree level, then other aspects
of the applicant’s file should receive more weight. More generally, examination of
degree level as a moderator of the validity of GRE scores for predicting academic
performance in graduate school offers scientists the opportunity to gain insight into
the situational factors that may influence how well scores on cognitive tests predict
performance.
To investigate the moderating role of graduate program degree level on the validity
of the GRE, it is important to use large samples or a research synthesis. Any observed
differences in the predictive validity in a single or small sample may be because of
artifacts, such as sampling error or uncontrolled substantive factors. By examining a
variety of different samples reflecting different disciplines and situations, it is hoped
that some of these uncontrolled factors will be averaged out. Therefore, the present
study used meta-analysis to provide separate estimates of the validity of GRE scores
for predicting the performance of students enrolled in master’s and doctoral programs.
The results contribute important practical guidance to those responsible for making
admissions policies and decisions. By examining the degree to which the predictive
validity of GRE scores is moderated by degree level, this study also provides insight
into factors that may moderate the degree of validity of standardized cognitive tests
for predicting performance.
Moderators of Predictive Validity
Moderators of predictive validity influence the magnitude of the correlation between
individuals’ scores on the predictor and one or more outcomes of interest. The types of
342
Educational and Psychological Measurement 70(2)
moderators that might affect the correlation between the GRE and a performance outcome can be organized into two general categories: substantive and artifactual.
Substantive Moderators
The major applied concern is with substantive moderators. Substantive moderators
include meaningful conditions, situations, or populations consistently associated with
higher or lower levels of predictive power. In some cases, moderators will directly
cause differences in observed validities. In other instances, a moderator may be a wellrecognized situation associated with or reflecting a cluster of conditions that may
influence predictive power. That is, the moderator is a proxy or indicator of underlying
variables that affect predictive validity.
Graduate program degree level most likely falls into the latter category. Degree
level meaningfully differentiates individuals’ educational goals and experiences.
However, if degree level does moderate the predictive validity of GRE scores, the
moderation is unlikely to be a result of the letters appearing on students’ diplomas. The
moderation is likely to occur as a result of factors that distinguish individuals’ experiences in master’s and doctoral programs. Degree level is a starting point for
understanding more fundamental differences about how individual characteristics
interact with the educational environment. Several variables may create differences
between master’s- and doctoral-level degree programs.
Course complexity. The greater demands placed on cognitive resources in highcomplexity settings should result in a stronger correlation between ability measures
and performance than in low-complexity settings. Empirical support for the moderating effects of complexity can be found in research in diverse settings. Hunter (1980)
reported that, on average, ability measures had stronger correlations with job performance for complex jobs (r = .58; retail food manager, game warden) than for
low-complexity jobs (r = .23; shrimp picker, cannery worker). Research in the educational domain provides data that are less direct but compelling. A number of studies
have found that the average test score for colleges and law schools tend to be associated with stronger predictive validity (Bridgeman, Jenkins, & Ervin, 1999; Linn &
Hastings, 1984; Ramist, Lewis, & McCamley-Jenkins, 1994). There is even evidence
that for simple tasks such as choice reaction time (e.g., pressing a number key when
that number flashes on the screen) performance shows an increasing association with
cognitive abilities as the number of choices increases, making the task more complex
(e.g., Baumeister, 1998; see also Deary, 1996).
It is reasonable to propose that, on average, doctoral-level work is of higher complexity than master’s-level work. Earning a PhD typically requires obtaining a deep
level of expertise in an area and producing an original scientific contribution to the
field. Higher complexity would lead to somewhat stronger correlations between GRE
scores and performance in doctoral programs than in master’s programs.
Independence and ill-structured learning environments. Unlike the work completed in
traditional classroom settings, much of the work of doctoral students is ill structured.
Kuncel et al.
343
Building on a core of required courses, doctoral students often must develop their own
program of study, work independently to gain expertise, and contribute to unique
research projects. Activities in master’s programs tend to be more structured. Many
master’s students complete a well-defined set of required courses, augmented by a few
classes selected from a limited set of electives. Only some master’s students write a
thesis, and these students tend to receive more guidance than doctoral students completing dissertations.
On the surface, it might appear that standardized admissions tests were developed
to predict performance in structured instructional settings, such as the college lecture
hall. Thus, one might initially conjecture that scores on tests would be more strongly
related to performance in master’s programs, where the bulk of students’ work is
structured. However, research indicates that standardized test scores predict learning
and performance in situations that require the processing information or acquisition of
new knowledge (for reviews, see Cattell, 1971; Kuncel et al., 2004), even when the
outcomes are not grades in courses (Kuncel & Hezlett, 2007). Indeed, research suggests that students with higher scores on standardized tests may actually profit more
from low-structured environments (such as doctoral programs) than those with lower
scores because of the interaction between training structure and ability (Edgerton,
1958; Goldstein, 1993; Snow & Lohman, 1984).
Discipline area. Discipline of study has been found to have a small effect on the
predictive power of GRE scores. Although scores on both the Verbal (GRE-V) and
Quantitative (GRE-Q) sections of the GRE demonstrate strong prediction across
fields, GRE-V scores are more strongly related to grades in verbal disciplines (humanities). GRE-Q scores are more strongly correlated with grades for those in quantitative
disciplines (Kuncel et al., 2001). The implication of these findings for the current
study is that differential representation of area of study is unlikely to produce large
moderating effects by degree level. Validity coefficients might differ to a small extent
by degree level if particular disciplines are more likely to have master’s or doctoral
programs.
Artifactual Moderators
Artifactual moderators can result in an observed difference in predictive validity
because of the statistical properties of the samples or measures. Two common examples include differential restriction of range because of direct and indirect selection
effects and criterion measurement error differences. If one graduate program degree
level typically admits a wider range of students, we would expect to observe larger
correlations between GRE scores and academic program for that degree level. Predictor variability will be affected by application requirements and the nature of the
students who apply to the program. Highly selective programs are likely to get a
narrow range of applicants and then further restrict the group by admitting the top
applicants. Fortunately, this artifactual moderator can be reasonably addressed through
corrections for restriction of range.
344
Educational and Psychological Measurement 70(2)
Differences in criterion measurement error can occur because of grading policy or
systematic instructor differences. If grades have poorer measurement properties, on
average, for one degree level, then validity coefficients will vary by degree level not
because of the predictor but because of the measure of performance. Formulas also
exist that permit validity coefficients to be corrected for unreliability in the criteria.
Current Study Hypotheses
Given the results of the large literature on the validity of standardized tests scores for
predicting academic performance, the GRE is likely to be a valid predictor for both
master’s- and doctoral-level programs. Both situations require considerable acquisition
of new knowledge. Both situations require the student to process information and make
decisions. Finally, both situations, regardless of program, require verbal skills and most
require quantitative skills as well. Therefore, it was our expectation that the GRE should
be a valid predictor of performance in both master’s and doctoral programs.
However, given that program complexity and structure may vary by degree level,
there may be small differences in the degree to which GRE scores predict students’
performance in master’s and doctoral programs. Differences in disciplines’ reliance on
master’s and doctoral programs also may lead to small variations in the magnitude of
GRE validity coefficients by degree level. Directional hypotheses are not possible
because the distribution of degree level by discipline is unknown and may operate in a
different direction from the potential moderating effects of complexity and structure.
Methods
The database for this study was assembled from two sources. First, the meta-analytic
database used in Kuncel et al. (2001) was used as a foundation for the research. All
articles were reviewed for information about program level, and data were coded
accordingly. To supplement and update this database, a new literature search was conducted using the ERIC (1999-2005), PsychINFO (1999-2005), and Dissertations
Abstracts (1999-2005) databases. The search was set back to 2 years preceding the
Kuncel et al. (2001) study to account for publication lag and ensure that any updates
to the bibliographic databases were included. We did not conduct a meta-analysis of
the GRE-Analytical Writing exam. It is sufficiently new that relatively few studies
have examined its validity. These literature searches were imported into a bibliographic database and were examined to determine if it might contain relevant data.
Relevant articles and dissertations were retrieved, and three yielded useable data
(Edwards & Schleicher, 2004; Fenster, Markus, Weidemann, Brackett, & Fernandez,
2001; Pearson, 2003). The final set of coded data was entirely composed of publicly
available journal articles, research reports, and dissertations.
Because of the complexity of the measures, situations, and outcomes represented in
the literature, precise coding of all statistics and moderator variables is critical.
Although previous research has found that coding accuracy is generally very high
Kuncel et al.
345
(e.g., Kuncel et al., 2001; Whetzel & McDaniel, 1988; Zakzanis, 1998), to ensure the
reliability of the coded information we used a three-step process. It is important to
note that the coding reliability of the Kuncel et al. (2001) database was found to be
very good with more than 99% agreement and that database constitutes the majority
of the data presented here.
All studies were coded by two coders, and the results were compared for disagreements. Inconsistencies were resolved in meetings with the first author. Finally, the
double-coded data were examined by the first author who checked a random sample
of 20% of the articles, including those judged to contain no useable data. This third
step helped ensure that all articles with useable data are included and that all coded
data are accurate.
In some articles, correlational data were not reported but other relevant information
was included in the articles that allows estimate of the magnitude of the effect. These
results were converted from their presented form (e.g., t, d, c2, p value, frequencies)
into correlations using standard conversion formulae (Hunter & Schmidt, 2004;
Lipsey & Wilson, 2001). Studies were only included if program level was specifically
discussed for the sample. In addition, some doctoral programs require a master’s
degree as a part of the program. Outcomes for these master’s students were not
included in this study to more clearly separate the two degree program levels.
Sufficient information was available to examine three measures of student performance: first year grade point average (GPA), overall graduate GPA, and faculty
ratings. The first year GPA criterion included either first year or first semester grades.
The overall graduate GPA criterion consisted of studies that contained 2 or more years
of grades (i.e., second year cumulative GPA or more). Faculty ratings consisted of a
mixture of different rating types, including overall evaluations of performance, ratings
of professionalism, research accomplishment, and dissertation or thesis quality. Studies were not taken if graduate grades were student reported because of concerns about
their accuracy (see Kuncel, Crede, & Thomas, 2005).
The Hunter and Schmidt (2004) psychometric meta-analytic method was used for
all analyses. In addition to statistically aggregating results across studies, psychometric meta-analysis allows for the correction of statistical artifacts that bias the average
observed validity estimate and allows us to estimate the amount of variance attributable to sampling error, range restriction, and unreliability. These meta-analytic
methods have been tested via Monte Carlo simulations on several occasions with
results in all cases indicating that these methods yield accurate results even in the presence of minor violations of key assumptions (e.g., Law, Schmidt, & Hunter, 1994;
Oswald & Johnson, 1998; Schulze, 2004).
The artifact distribution method is used to address statistical artifact magnitude and
variability (Hunter & Schmidt, 2004). In this approach, available information is used
to make the corrections. The underlying assumption of this approach is that the available information reasonably represents the distribution of artifacts in the literature. For
this assumption to be violated, the reporting of artifact-relevant information in a study
would need to covary with the artifact in question. Because this seems unlikely,
346
Educational and Psychological Measurement 70(2)
artifact distributions are likely to provide reasonable corrections and almost certainly
result in less biased estimates than no corrections.
In estimating the predictive power of GRE scores, we are most interested in its
relationship for the full applicant population and not just for those students who are
admitted. To create such an estimate, we need information about the variability of the
incumbent (admitted) group and the group of students who applied to a program (i.e.,
the applicant group). The former is obtained directly from studies. The latter is almost
never reported in primary studies. To obtain good estimates, technical manuals and
reports are often excellent sources of information. One approach would be to correct
back to the total sample of all test takers. Although some arguments could be made for
this approach, we used a more conservative method. We linked samples back to the
standard deviation for test takers who indicated the intent to study in the same academic area. For example, for a study with a sample of mathematics students, the
standard deviation used was for all students who indicated that they intended to go to
graduate school in mathematics. Given that students tend to sort themselves in graduate school partially based on their verbal and quantitative abilities, the applicant
groups are likely to be more restricted than the sample of all test takers and tend to
reflect a matching of ability level to program (e.g., Kuncel & Klieger, 2007). Given
the applied research question raised by this study, this approach results in a more accurate but far smaller correction than what would be obtained if the standard deviations
for all test takers were used.
To further refine the corrections, we matched samples by time in addition to area.
The standard deviation of the GRE appears to have increased over time. To better
reflect the applicant groups at a given point in time, we used area estimates matched
as closely as possible to the same point in time. These were available for the following years: 1952, 1967-1968, 1974-1976, 1988-1991, 1992-1995, 1995-1996 (Briel,
O’Neill, & Scheuneman, 1993; Conrad, Trisman, & Miller, 1977; Educational Testing Services, 1996, 1997). Data were then sorted by degree level, and range restriction
artifact distributions were then created separately for each meta-analysis for the
GRE-V and GRE-Q. This approach created artifact distributions tailored by time,
discipline area, and the variability in incumbent tests scores by degree areas. It is
unknown and a question for further research whether or not the standard deviations
for test takers who indicated the intent to study in the same academic area differ for
master’s and doctoral programs. These corrections result in less biased estimates of
the relationship between test scores and outcomes; however, admission decisions are
often based on other variables that are not presented in validation studies. These variables can lead to indirect range restriction, and the evidence to date suggests that
direct correction (such as those used here) tend to be conservative underestimates of
predictive power (Hunter, Schmidt, & Le, 2006). The ideal approach to addressing
restriction of range in graduate admission would be to conduct multivariate corrections (Aitken, 1934; Lawley, 1943; for a review, see Sackett & Yang, 2000). However,
such corrections require considerable information that is almost never available in a
meta-analysis.
Kuncel et al.
347
In addition, we are generally interested in the predictive validity of the GRE for
outcomes unclouded by measurement error. Grades are not assigned consistently, and
ratings of performance are subject to rating errors. This results in weaker relationships
than what would be obtained with more reliable evaluations of student performance.
Therefore, the correlations were corrected for criterion unreliability. Estimates of college grade unreliability were obtained from three studies (Barritt, 1966; Bendig, 1953;
Reilly & Warech, 1993). Estimates of faculty rating accuracy were taken from Kuncel
et al. (2004) as these estimates are more precise and conservative than those used in
Kuncel et al. (2001). The more conservative estimates yield smaller corrections and,
thus, smaller estimates of the correlation between the GRE and performance.
Given that discipline has a small effect on the predictive validity of the GRE
(Kuncel et al., 2001), frequency counts of discipline area were conducted for both the
overall graduate GPA and first year GPA analyses. Disciplines were classified into
humanities, social science, life science, and math/physical science categories.
Although a wide range of fields is represented, the social sciences in general and psychology and education in particular are most frequently represented. Therefore, the
aggregated results (being weighted means) will more strongly, but not exclusively,
reflect the effect of program level on social science disciplines than other disciplines.
A chi-square was calculated to compare the distribution of disciplines by degree level.
The result was statistically significant at p < .10. Much of this appears to be driven by
the proportionately somewhat larger representation of the life sciences in the master’s
programs than in the doctoral programs. Given the desirability of stable estimates, the
overall small moderating effects of discipline, and the relatively balanced proportions
of disciplines across the two degree levels, we proceeded to analyze the data across
master’s and doctoral programs.
Results and Discussion
The results of the meta-analysis are shown in Tables 1 through 3. As expected, both
GRE-V and GRE-Q were found to be valid predictors of graduate GPA and first year
graduate GPA in both master’s and doctoral programs. In addition, GRE-V and GRE-Q
scores were found to predict faculty ratings for both master’s and doctoral programs.
These results indicate that the GRE is effective for admission decision making for
both master’s- and doctoral-level work and should be incorporated in the application
process for both degree levels. Specifically, the GRE has both a comparable and useful
predictive validity for master’s- and doctoral-level programs. The primary implied
implication of these findings is that the evidence suggests that both doctoral and master’s programs can continue to use the GRE and expect that it will provide useful
predictive information about their students.
One of the two smallest discrepancies between degree levels was for GRE-Q predicting graduate GPA with a corrected, operational validity of .28 for doctoral students
and .30 for master’s students. In predicting faculty ratings, GRE-V had an operational
validity of .32 for both master’s and doctoral programs.
348
Educational and Psychological Measurement 70(2)
Table 1. Meta-analysis of GRE Predictive Validity by Degree Level for Graduate GPA
N
k
robs
SDobs
r
Master’s students
GRE-Verbal
7,214
56
.29
.14
.38
GRE-Quantitative
6,864
55
.23
.14
.30
Doctoral students
1,216
12
.21
.13
.27
GRE-Verbal
GRE-Quantitative
3,757
21
.20
.14
.28
SDr
.14
.12
.10
.16
Note: N = sample size; k = number of studies; robs = sample size weighted mean observed correlation;
SDobs = observed standard deviation of correlations; r = operational validity; SDr = standard deviation of
true score correlations; GRE = Graduate Record Examination; GPA = grade point average.
Table 2. Meta-analyses of GRE Predictive Validity by Degree Level for First Year GPA
N
k
robs
SDobs
r
Master’s students
GRE-Verbal
2,204
47
.27
.18
.35
GRE-Quantitative
2,204
47
.22
.16
.28
Doctoral students
GRE-Verbal
1,323
25
.22
.18
.29
GRE-Quantitative
1,250
24
.24
.16
.33
SDr
.14
.06
.14
.10
Note: N = sample size; k = number of studies; robs = sample size weighted mean observed correlation;
SDobs = observed standard deviation of correlations; r = operational validity; SDr = standard deviation of
true score correlations; GRE = Graduate Record Examination; GPA = grade point average.
Table 3. Meta-analyses of GRE Predictive Validity by Degree Level for Faculty Ratings
N
k
robs
SDobs
r
Master’s students
GRE-Verbal
759
8
.23
.13
.32
GRE-Quantitative
759
8
.15
.13
.21
Doctoral students
GRE-Verbal
1,360
13
.23
.14
.32
GRE-Quantitative
1,199
11
.20
.11
.30
SDr
.12
.11
.12
.05
Note: N = sample size; k = number of studies; robs = sample size weighted mean observed correlation;
SDobs = observed standard deviation of correlations; r = operational validity; SDr = standard deviation of
true score correlations; GRE = Graduate Record Examination.
One of the largest discrepancies in predictive validity across master’s and doctoral
programs involved the use of the GRE-V scores for predicting graduate GPA. The corrected operational validity of GRE-V scores for predicting graduate GPA was .38 for
master’s students and .27 for doctoral students. That one of the largest differences
Kuncel et al.
349
involved graduate GPA, rather than first year GPA, is not surprising as it seems likely
that master’s and doctoral work would diverge later in the programs. However, the
direction of the difference is the opposite of what would be predicted from past
research on complexity and structure as moderators of the cognitive ability–
performance relationship. One possible explanation of the observed difference
would be the mixture of discipline areas, although this does not appear to be a likely
explanation as the distribution is fairly comparable across program level and disciplinespecific effects tend to be small.
A second possibility would be differences in grading standards and variability by
degree level. To test this, we examined GPA standard deviations for those studies in the
GRE-V analyses for graduate GPA. Not all studies reported this information, but the
results are suggestive. On average, the standard deviation of GPAs for master’s students
was .40, whereas for doctoral programs it was only .21. The smaller range of grades for
doctoral students could cause the observed variation in validity coefficients. However,
studies reporting standard deviations were rare. Future research needs to more carefully
examine the nature of grades in graduate education to see if systematic trends, such as
greater grade inflation in doctoral programs, can be identified. Although we corrected
operational validities for unreliability in grades, separate data on grade unreliability for
each degree level were not available. Differences in grade reliability by degree level
also are a trend worth investigating. In addition, the type of courses taken by master’s
or doctoral students may be the source of the small observed differences. If quantitative
courses are the major source of variability in doctoral student GPA but not for master’s
students, then course taking patterns could partially account for the results.
The other large discrepancy involved the correlation of GRE-V with graduate GPA,
with an operational validity of .21 for master’s students and .30 for doctoral students.
The direction of this difference is consistent with the complexity and structure being
higher in doctoral programs. However, it is not clear why complexity and structure
might moderate the validity of GRE-Q scores for predicting faculty ratings but not
affect the magnitude of GRE-V validities.
More work is needed to examine other aspects of student performance beyond
those covered here. Grades and faculty ratings are important measures of student performance. When grades are based on good assessment of student learning, they are
valuable pieces of information. Faculty have considerable contact with students, and
their evaluations of their performance on nonclassroom aspects of performance
enhances our confidence that the GRE predicts a range of important outcomes for both
degree levels. Of course, students engage in many additional important activities in
graduate school. This is especially so for doctoral students for whom GPA is often
considered a less important outcome of their training. It is important to note that
research has already found that standardized tests predict criteria other than grades
(Kuncel & Hezlett, 2007). The results of the present investigation, combined with the
general literature, create a strong case for the relevance of the GRE as a predictor for
multiple important aspects of student performance and success, but additional primary
studies are needed to fully examine the criterion space. Only then will a complete
350
Educational and Psychological Measurement 70(2)
picture of admissions and assessment be possible for doctoral and master’s programs
separately.
Meta-analytic results depend on prior academic literature. In this study, a varied but
not random representation of graduate programs was obtained. Therefore, results may
not fully generalize to all graduate programs. In addition, some analyses were based
on a relatively small number of studies. Additional primary research must be completed to make possible more robust meta-analytic estimates based on an even larger
number of samples and students.
This study, based on thousands of individuals and nearly 100 independent samples,
found considerable evidence for the validity of the GRE for both master’s- and
­doctoral-level programs. Averaging across the two tests and grade measures, the validity of the GRE varied only .03 between master’s (.30) and doctoral (.27) programs.
Based on the data currently available, the GRE is a useful decision-making tool for
both master’s- and doctoral-level programs. This investigation has elucidated diverse
variables that may potentially moderate the validity of the GRE for predicting academic performance but has not revealed systematic patterns of differences in validity
coefficients by degree level.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the authorship and/or
publication of this article.
Funding
The authors received no financial support for the research and/or authorship of this article. The
lead author received funding from ETS to support graduate students to conduct the study but did
not receive any personal funding for the project.
References
Aitken, A. C. (1934). Note on selection from a multivariate normal population. Proceeding of
the Edinburgh Mathematical Society, 4, 106-110.
Barritt, L. S. (1966). The consistency of first-semester college grade point average. Journal of
Educational Measurement, 3, 261-262.
Baumeister, A. A. (1998). Intelligence and the “personal equation.” Intelligence, 26, 255-265.
Bendig, A. W. (1953). The reliability of letter grades. Educational and Psychological Measurement, 13, 311-321.
Bridgeman, B., Jenkins, L., & Ervin, N. (1999, April). Variation in the prediction of college
grades across gender within ethnic groups at different selectivity levels. Paper presented at
the American Educational Research Association, Montreal, Quebec, Canada.
Briel, J. B., O’Neill, K., & Scheuneman, J. D. (Eds.). (1993). GRE technical manual. Princeton,
NJ: Educational Testing Service.
Cattell, R. B. (1971). Abilities: Their structure, growth and action. Oxford, UK: Houghton
Mifflin.
Conrad, L., Trisman, D., & Miller, R. (1977). GRE Graduate Record Examinations technical
manual. Princeton, NJ: Educational Testing Service.
Kuncel et al.
351
Deary, I. J. (1996). Reductionism and intelligence: The case of inspection time. Journal of Biosocial Science, 28, 405-423.
Edgerton, H. A. (1958). The relationship of method of instruction to trainee aptitude pattern
(Tech. Rep., Contract ONR 1042). New York: Richardson, Bellows, & Henry.
Educational Testing Service. (1996). Interpreting your GRE General Test and Subject Test
scores: 1996-1997. Princeton, NJ: Author.
Educational Testing Service. (1997). Sex, race, ethnicity, and performance on the GRE General
Test: A technical report. Princeton, NJ: Author.
Edwards, W. R., & Schleicher, D. J. (2004). On selecting psychology graduate students: Validity
evidence for a test of tacit knowledge. Journal of Educational Psychology, 96, 592-602.
Fenster, A., Markus, K. A., Weidemann, C. F., Brackett, M. A., & Fernandez, J. (2001). Selecting tomorrow’s forensic psychologists: A fresh look at some familiar predictors. Educational
and Psychological Measurement, 61, 336-348.
Goldstein, I. L. (1993). Training in organizations (3rd ed.). Pacific Grove, CA: Brooks/Cole.
Hunter, J. E. (1980). Validity generalization for 12,000 jobs: An application of synthetic validity
and validity generalization to the General Aptitude Test Battery (GATB). Washington, DC:
U.S. Department of Labor.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in
research findings (2nd ed.). Thousand Oaks, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Le, H. (2006). Implications of direct and indirect range restriction for meta-analysis methods and findings. Journal of Applied Psychology, 91, 594-612.
Julian, E. R. (2005). Validity of the Medical College Admission Test for predicting medical
school performance. Academic Medicine, 80, 910.
Kuncel, N. R., Crede, M., & Thomas, L. L. (2005). The validity of self-reported grade point
averages, class ranks, and test scores: A meta-analysis. Review of Educational Research, 75,
63-82.
Kuncel, N. R., Crede, M., & Thomas, L. L. (2007). A comprehensive meta-analysis of the predictive validity of the Graduate Management Admission Test (GMAT) and undergraduate
grade point average (UGPA). Academy of Management Learning and Education, 6, 53-68.
Kuncel, N. R., Crede, M., Thomas, L. L., Klieger, D. M., Seiler, S. N., & Woo, S. E. (2005).
A meta-analysis of the Pharmacy College Admission Test (PCAT) and grade predictors of
pharmacy student success. American Journal of Pharmaceutical Education, 69, 339-347.
Kuncel, N. R., & Hezlett, S. A. (2007). Standardized tests predict graduate students’ success.
Science, 315, 1080-1081.
Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive meta-analysis of the
predictive validity of the Graduate Record Examinations: Implications for graduate student
selection and performance. Psychological Bulletin, 127, 162-181.
Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2004). Academic performance, career potential,
creativity, and job performance: Can one construct predict them all? [Special section: Cognitive abilities: 100 years after Spearman (1904)]. Journal of Personality and Social Psychology, 86, 148-161.
Kuncel, N. R., & Klieger, D. M. (2007). Application patterns when applicants know the
odds: Implications for selection research and practice. Journal of Applied Psychology, 92,
586-593.
352
Educational and Psychological Measurement 70(2)
Law, K. S., Schmidt, F. L., & Hunter, J. E. (1994). A test of two refinements in procedures for
meta-analysis. Journal of Applied Psychology, 79, 978-986.
Lawley, D. N. (1943). A note on Karl Pearson’s selection formulae. Proceeding of the Royal
Society of Edinburgh, Section A, 62(Pt. 1), 28-30.
Linn, R. L., & Hastings, C. N. (1984). A meta-analysis of the validity of predictors of performance in law school. Journal of Educational Measurement, 21, 245-259.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.
Oswald, F. L., & Johnson, J. W. (1998). On the robustness, bias, and stability of results from
meta-analysis of correlation coefficients: Some initial Monte Carlo findings. Journal of
Applied Psychology, 83, 164-178.
Pearson, L. L. (2003). Predictors of success in a graduate program in radiologic sciences.
Unpublished doctoral dissertation, Texas Woman’s University, Denton, TX.
Powers, D. E. (2004). Validity of Graduate Record Examinations (GRE) general test scores for
admissions to colleges of veterinary medicine. Journal of Applied Psychology, 89, 209-219.
Ramist, L., Lewis, C., & McCamley-Jenkins, L. (1994). Student group differences in predicting
college grades: Sex, language, and ethnic group (College Board Rep. No. 93-1). New York:
College Board.
Reilly, R. R., & Warech, M. A. (1993). The validity and fairness of alternatives to cognitive tests.
In L. C. Wing & B. R. Cifford (Eds.), Policy issues in employment testing (pp. 131-224).
Boston: Kluwer.
Sackett, P. R., & Yang, H. (2000). Correction for range restriction: An expanded typology.
Journal of Applied Psychology, 85, 112-118.
Schneider, L. M., & Briel, J. B. (1990). Validity of the GRE: 1988-1989 summary report.
Princeton, NJ: Educational Testing Service.
Schulze, R. (2004). Meta-analysis: A comparison of procedures. Cambridge, MA: Hogrefe &
Huber.
Snow, R. E., & Lohman, D. F. (1984). Toward a theory of cognitive aptitude for learning from
instruction. Journal of Educational Psychology, 76, 347-376.
Whetzel, D. L., & McDaniel, M. A. (1988). Reliability of validity generalization data bases.
Psychological Reports, 63, 131-134.
Zakzanis, K. K. (1998). The reliability of meta-analytic review. Psychological Reports, 83,
215-222.