High-Stakes Testing in Employment, Credentialing, and Higher

High-Stakes Testing in Employment, Credentialing,
and Higher Education
Prospects in a Post-Affirmative-Action World
Paul R. Sackett
Neal Schmitt
Jill E. Ellingson
Melissa B. Kabin
Cognitive!}- loaded tests of knowledge, skill, and ability
often contribute to decisions regarding education, jobs,
licensure, or certification. Users of such tests often face
difficult choices when trying to optimize both the performance and ethnic diversity of chosen individuals. The
authors describe the nature of this quandary, review research on different strategies to address it, and recommend
using selection materials that assess the full range of
relevant attributes using a format that minimizes verbal
content as much as is consistent with the outcome one is
trying to achieve. They also recommend the use of test
preparation, face-valid assessments, and the consideration
of relevant job or life experiences. Regardless of the strategy adopted, it is unreasonable to expect that one can
maximize both the performance and ethnic diversity of
selected individuals.
C
ognitively loaded tests of knowledge, skill, and
ability are commonly used to help make employment, academic admission, licensure, and certification decisions (D'Costa, 1993; Dwyer & Ramsey, 1995;
Frierson, 1986; Mehrens, 1989). Law school applicants
submit scores on the Law School Admission Test (LSAT)
for consideration when making admission decisions. Upon
graduation, the same individuals must pass a state-administered bar exam to receive licensure to practice. Organizations commonly rely on cognitive ability tests when
making entry-level selection decisions and tests of knowledge and skill when conducting advanced-level selection.
High-school seniors take the Scholastic Assessment Test
(SAT) for use when determining college admissions and
the distribution of scholarship funds. Testing in these settings is termed high stakes, given the central role played by
such tests in determining who will and who will not gain
access to employment, education, and licensure or certification (jointly referred to as credentialing) opportunities.
The use of standardized tests in the knowledge, skill,
ability, and achievement domains for the purpose of facilitating high-stakes decision making has a history characterized by three dominant features. First, extensive research
has demonstrated that well-developed tests in these do302
University of Minnesota, Twin Cities Campus
Michigan State University
The Ohio State University
Michigan State University
mains are valid for their intended purpose. They are useful,
albeit imperfect, descriptors of the current level of knowledge, skill, ability, or achievement. Thus, they are meaningful contributors to credentialing decisions and useful
predictors of future performance in employment and academic settings (Mehrens, 1999; Neisser et al., 1996;
Schmidt & Hunter, 1998; Wightman, 1997; Wilson, 1981).
Second, racial group differences are repeatedly observed in scores on standardized knowledge, skill, ability,
and achievement tests. In education, employment, and credentialing contexts, test score distributions consistently reveal significant mean differences by race (e.g., Bobko,
Roth, & Potosky, 1999; Hartigan & Wigdor, 1989; Jensen,
1980; Lynn, 1996; Neisser et al., 1996; Scarr, 1981;
Schmidt, 1988; N. Schmitt, Clause, & Pulakos, 1996;
Wightman, 1997; Wilson, 1981). Blacks tend to score
approximately one standard deviation lower than Whites,
and Hispanics score approximately two thirds of a standard
deviation lower than Whites. Asians typically score higher
than Whites on measures of mathematical-quantitative
ability and lower than Whites on measures of verbal ability
and comprehension. These mean differences in test scores
can translate into large adverse impact against protected
groups when test scores are used in selection and credentialing decision making. As subgroup mean differences in
test scores increase, it becomes more likely that a smaller
proportion of the lower scoring subgroup will be selected
or granted a credential (Sackett & Wilk, 1994).
Third, the presence of subgroup differences leads to
questions regarding whether the differences observed bias
Editor's note. Sheldon Zedeck served as action editor for this article.
Author's note. Paul R. Sackett, Department of Psychology, University
of Minnesota, Twin Cities Campus; Neal Schmitt and Melissa B. Kabin,
Department of Psychology, Michigan State University; Jill E. Ellingson,
Department of Management and Human Resources, The Ohio State University. Authorship order for Paul R. Sackett and Neal Schmitt was
determined by coin toss.
Correspondence concerning this article should be addressed to Paul
R. Sackett, Department of Psychology, University of Minnesota, N475
Elliott Hall, 75 East River Road, Minneapolis, MN 55455. Electronic mail
may be sent to [email protected].
April 2001 • American Psychologist
Copyright 2001 by the American Psychological Association, Inc. 0O03-066XA)l/$5.00
Vol. 56. No. 4, 302-318
DOI: I0.IO37/AJO03-O66X.56.4.302
Paul R.
Sackett
resulting decisions. An extensive body of research in both
the employment and education literatures has demonstrated
that these tests generally do not exhibit predictive bias. In
other words, standardized tests do not underpredict the
performance of minority group members (e.g., American
Educational Research Association, American Psychological Association, & National Council of Measurement in
Education. 1999; Cole, 1981; Jensen. 1980: Neisser et al..
1996; O'Conner, 1989; Sackett & Wilk, 1994; Wightman.
1997; Wilson, 1981).
These features of traditional tests cause considerable
tension for many organizations and institutions of higher
learning. Most value thai which is gained through the use
of tests valid for their intended purpose (e.g., a higher
performing workforce, a higher achieving student body, a
cadffl of credentiated teachers who meet knowledge, skill,
and achievement standards). Yet, most also value racial and
ethn;c diversity in the workforce or student body, with
rationales ranging from a desire to mirror the composition
Of the community to a belief that academic experiences or
workplace effectiveness are enhanced by exposure to diverse perspectives. What quickly becomes clear is that
these two values—performance and diversity—come into
conflict. Increasing emphasis on the use of tests in the
interest of gaining enhanced performance has predictable
negative consequences for the selection of Blacks and
Hispanics. Conversely, decreasing emphasis on the use of
tests in the interest of achieving a diverse group of selectees
often results in a substantial reduction in the performance
gains that can be recognized through test use (e.g..
Schmidt. Mack, & Hunter, 1984; N. Schmitt et al., 1996).
This dilemma is well-known, and a variety of resolution strategies have been proposed. One class of strategies
involves some form of minority group preference; these
strategies were the subject of an American Psychologist
April 2001 • American Psychologist
article by Sackett and Wilk (1994) that detailed the history,
rationale, consequences, and legal status of such strategies.
However, a variety of recent developments indicate a
growing trend toward bans on preference-based forms of
affirmative action. The passage of the Civil Rights Act of
1991 made it unlawful for employers to adjust test scores as
a function of applicants' membership in a protected group.
The U.S. Supreme Court's decisions in City oj Richmond v,
J. A. Croson Co, (1989) and Ada rand Constructors, Inc. v.
Penci (1995) to overturn set-aside programs that reserved a
percentage of contract work for minority-owned businesses
signaled the Court's stance toward preference-based affirmative action (Mishkin. 1996). The U.S. Fifth Circuit
Court of Appeals ruled in Hopwood v. Stale of Texas
(1996) that race could not be used as a factor in university
admissions decisions (Kier & Davenport, 1997; Mishkin.
1996). In 1996, the state of California passed Proposition
209 prohibiting the use of group membership as a basis for
any selection decisions made by the state, thus affecting
public sector employment and California college admissions (Pear, 1996). Similarly, state of Washington voters
approved Initiative 200 that bars the use of race in state
hiring, contracting, and college admissions (Verhovek &
Ayres, 1998).
Strategies for Achieving Diversity
Without Minority Preference
In light of this legal trend toward restrictions on preferencebased routes to diversity, a key question emerges: What are
the prospects for achieving diversity without minority preference and without sacrificing the predictive accuracy and
content relevancy present in knowledge, skill, ability, and
achievement tests? Implicit in this question is the premise
that one values both diversity and the performance outcomes that an organization or educational institution may
realize through the use of tests. If one is willing to sacrifice
quality of measurement and predictive accuracy, there are
many routes to achieving diversity including random selection, setting a low cut score, or the use of a low-impact
predictor even though it may possess little to no predictive
power. On the other hand, if one values performance outcomes but does not value diversity, maximizing predictive
accuracy can be the sole focus. We suggest that most
organizations and educational institutions espouse neither
of these extreme views and instead seek a balance between
diversity concerns and performance outcomes. Clearly, the
use of traditional tests without race-based score adjustment
fails to achieve such a balance. However, what alternatives
are available for use in high-stakes, large-scale assessment
contexts? In this article, we review various alternative
strategies that have been put forth in the employment,
education, and credentialing literatures. We note that some
strategies have been examined more carefully in some
domains than in others, and thus the attention we devote to
employment, education, and credentialing varies across
these alternatives.
The first strategy involves the measurement of constructs with little or no adverse impact along with traditional cognitively loaded knowledge, skill, ability, and
303
domain of interest for members of all subgroups. Finally,
we also review the use of coaching or orientation programs
that provide examinees with information about the test and
study materials or aids to facilitate optimal performance. In
addition, we consider whether modifying the time limits
prescribed for testing helps reduce subgroup differences. In
the following sections, we review the literature relevant to
each of these efforts in order to understand the nature of
subgroup differences on knowledge, skill, ability, and
achievement tests and to ascertain the degree to which
these efforts have been effective in reducing these
differences.
Use of Measures of Additional Relevant
Constructs
Neal Schmitt
achievement measures. The notion is that if we consider
other relevant constructs along with knowledge, skill, ability, and achievement measures when making high-stakes
decisions, subgroup differences should be lessened because
alternatives such as measures of interpersonal skills or
personality usually exhibit smaller differences between ethnic and racial subgroups. A second strategy investigates
lust items in an effort to identify and remove those items
that are Culturally laden, k is generally believed that because those ilems likely reflect irrelevant, culture-bound
factors, their removal will improve minority passing rates.
The use of computer or video technology to present test
stimuli and collect examinee responses constitutes a third
strategy. Using these technologies usually serves to minimize the reading and writing requirements of a test. Reduction of adverse impact may be possible when the reading or writing requirements are inappropriately high. Also,
video technology may permit the presentation of stimulus
materials in a fashion that more closely matches the performance situation of interest. Attempts to modify how
examinees approach the test-taking experience constitutes a
fourth strategy. To the extent that individuals of varying
ethnic and racial groups exhibit different levels of testtaking motivation, attempts to enhance examinee motivation levels may reduce subgroup differences. Furthermore,
changing the way in which the test and its questions are
presented may impact how examinees respond, a result that
could also facilitate minority test performance. A fifth
strategy has been to document relevant knowledge, accomplishments, or achievements via portfolios, performance
assessments, or accomplishment records. Proponents of
this strategy maintain that this approach is directly relevant
to desired outcomes and hence should constitute a more fair
assessment of the knowledge, skill, ability, or achievement
304
Cognitively loaded knowledge, skill, ability, and achievement tests are among the mosi valid predictors available
when selecting individuals across a wide variety of educational and employment situations (Schmidt & Hunier,
1981. 1998). Therefore, a strategy for resolving the dilemma that allows for the use of such tests is readily
appealing. To that end, previous research has identified a
number of noncognitive predictors that are also valid when
making selection decisions in most educational and employment contexts. Measures of personality and interpersonal skills generally exhibit smaller mean differences by
ethnicity and race and also are related to performance on
the job or in school (e.g., Barrick & Mount, 1991; Bobko
ctal., 1999; Mount & Barrick, I995;Sackett & Wilk. 1994;
Salgado, 1997; N. Schmitt et al., 1996; Wolfe & Johnson.
1995). The use of valid, noncognitive predictors, in combination with cognitive predictors, serves as a very desirable strategy in that it offers the possibility of simultaneously meeting multiple objectives. If additional constructs, beyond those measured by the traditional test, are
relevant for the job or educational outcomes of interest,
supplementing cognitive tests offers the prospect of increased validity when predicting those outcomes. If those
additional constructs are ones on which subgroup differences are smaller, a composite of the traditional test and the
additional measures will often exhibit smaller subgroup
differences than the traditional test alone. The prospect of
simultaneously increasing validity and reducing subgroup
differences makes this a strategy worthy of careful study.
Several different approaches have been followed
when examining this strategy. On the basis of the psychometric theory of composites. Sackett and Ellingson (1997)
developed a set of implications helpful in estimating the
effect of a supplemental strategy on adverse impact. First,
consider a composite of two uncorrelated measures, where
d, = 1.0 and d2 — 0.0. Although intuition may suggest that
a composite of the two will split the difference (i.e.. result
in a d of 0.5), the computed value is 0.71. Thus, whereas
supplementing a cognitively loaded test with an uncorrelated measure exhibiting no subgroup differences will reduce the composite subgroup difference, this reduction will
be less than some might expect. Second, a composite may
result in a d larger than either of the components making up
the composite if the two measures are moderately corre-
April 2001 • American Psychologist
Building on estimates reported in Ford, Kraiger. and
Schechtman (1986) and N. Schmitt et al. (1997). Bobko et
al. (1999) refined previously cumulated validities, intercorrelations, meta-analytically derived estimates of subgroup
differences in performance, and ds associated with four
predictors commonly used in the employment domain
(cognitive ability, biodata, interviews, and conscientiousness). Using these refined estimates, Bobko et al. compared
the validity implications of various composites with the
effect on subgroup differences. The four predictor composite yielded a multiple correlation of .43 when predicting
performance, wilh d estimated at 0.76. The validity of the
cognitive ability predictor alone was estimated at .30. with
(/estimated at 1.00. Thus, using all four predictors reduced
subgroup differences by 0.24 standard deviation units and
increased the multiple correlation by .13. These results
mirror earlier research conducted by Ones. Viswesvaran.
and Schmidt (1993). in which integrity lest validities were
estimated and discussed with respect to their influence on
adverse impact via composite formation.
Jill E.
Ellingson
!ated; in essence the composite reflects a more reliable
measure of the underlying characteristic reflected in both
variables. Third, adding additional supplemental measures
has diminishing returns. For example, when tl, = 1.0 and
each additional measure is uncorrelated with the original
measure and has d = 0.0, ihe composite ds adding a
second, third, fourth, and fifth measure are 0.71. 0.58, 0.50,
and 11.45. respectively. (Note that adding additional predictors exhibiting lower d values would no! be done unless
those additional predictors are themselves related to the
outcome one hopes to predict; see N. Schmilt, Rogers,
Char:, Sheppard. & Jennings. 1997, for additional analytic
wort on the effects of combining predictors.)
A second approach followed when examining this
strategy has been the empirical tryout of different composite alternatives. For example, in an employment context,
Pula^os and Schmitt (1996) compared a traditional verbal
abiliiy measure with three alternative predictors: a biograpliical data measure (Stokes, Mumford, & Owens,
1994), a situational judgment test (Motowidlo. Dunnette, &
Carter. 1990), and a structured interview. Whereas the
Black-White d for the verbal ability measure was 1.03, a
composite of all four predictors produced a d of 0.63. A
composite of the three alternative predictors produced a el
of only 0.23. The inclusion of the verbal ability measure in
the composite did increase the multiple correlation between
the predictors and a performance criterion from .41 to .43.
However, it did so at the cost of a considerable increase in
subgroup differences (0.23 vs. 0.63). A similar pattern of
findings was observed when comparing Hispanics and
Whiles. Similar results were found by Ryan, Ployhart. and
Friedel (1998).
A third approach for evaluating the legitimacy of the
composite strategy has relied on meta-analytic findings.
April 200) • American Psychologist
Weighting of criterion components.
Each
of these studies is predicated on the assumption that the
outcome measure against which the predictors are being
validated is appropriately constituted. Educational performance, job performance, and credentialing criteria, however, may be defined in terms of a single dimension or
multiple dimensions. For example, if the outcome of interest involves only (he design of a new product or performance on an academic test, performance may be unidimensional. If. however, the outcome of interest requires
citizenship behavior or effective coordination as well, then
performance will be more broadly defined. When performance is multidimensional, institutions may choose to assign different weights to those various dimensions in an
effort to reflect that they vary in importance.
De Corte (1999) and Hattrup. Rock, and Scalia (1997)
have shown how the weighting of different elements of the
criterion space can affect the regression weights assigned
to different predictors, the level of adverse impact, and
predicted performance. Hattrup et al. used cumulated
correlations from three different studies to estimate the
relationship between contextual performance (behaviors
supporting an organization's climate and culture], task performance (behaviors that support the delivery of goods or
services), cognitive ability, and work orientation. Using
regression analyses, cognitive ability and work orientation
were used to predict criterion composites in which contextual performance and task performance were assigned varying weights. The regression weight for cognitive ability
was highest when task performance was weighted heavily
in the criterion composite, whereas work orientation received the largest regression weight when contextual performance was the more important part of the criterion. As
expected, adverse impact was the greatest when task performance was weighted more heavily and it decreased as
contextual performance received more weight. Relative to
a composite wherein task and contextual performance were
weighted equally, the percentage of minorities that would
be selected varied considerably depending on the relative
305
the individual predictors, their intercorrelation, the size of
subgroup differences on the combination of tests used, the
selection ratio, and the manner in which the tests are used.
In fact, in most situations wherein a variety of knowledge,
skills, and abilities are considered when making selection
decisions, adverse impact will remain at legally unacceptable levels and subgroup mean differences on the predictor
battery will no! be a great deal lower than the differences
observed for cognitive ability alone. The composition of
the test battery should reflect the individual differences
required to perform in the domain of interest. If institutions
focus mainly or solely on task performance, then cognitive
ability will likely be the most important predictor and
adverse impact will be great. If. however, they focus on a
broader domain that involves motivational, interpersonal,
or personality dimensions as well as cognitive ability, then
adverse impact may be reduced.
Identification and Removal of Culturally
Biased Test Items
Melissa B.
Kabin
weight of task versus contextual performance. De Cone
reached similar conclusions.
The Hattrup el al. (1997) and De Corte (1999) analyses were formulated in employment terms, yet the same
principles hold for educational and credentialing tests as
well. For example, when licensing lawyers, the licensing
body is concerned about both technical competence and
professional ethics. The general principle relevant across
settings is that when multiple criterion dimensions are of
interest, the weights given to the criterion dimensions can
have important effects on the relationship between the
predictors and the overall criterion. The higher the weigh!
given to cognitively loaded criterion dimensions, the higher
the resuming weight given to cognitively loaded predictors.
The higher the weight given to cognitively loaded predictors, the greater the resulting subgroup difference. In response, one may be tempted to simply choose criterion
dimension weights on the basis of their potential for reducing subgroup differences. Such a strategy would be errant,
however, as criterion weights should be determined primarily on the basis of an analysis of the performance domain
of interest and the values that an institution places on the
various criterion dimensions.
Summary.
Research on the strategy of measuring
different relevant constructs illustrates that it does matter
which individual differences are assessed in high-stakes
decision making if one is concerned about maximizing
minority subgroup opportunity. Minority pass rates can be
improved by including noncognitive predictors in a test
baitery. However, adding predictors with iiitle or no impact
will not eliminate adverse impact from a battery of tests
that includes cognitively loaded knowledge, skill, ability,
anil achievement measures. Reduction in adverse impact
results from a complex interaction between the validity of
306
A second strategy pursued in an attempt to resolve the
performance versus diversity dilemma involves investigating the possibility that certain types of test items are biased.
The traditional focus of studies examining differentia! item
functioning (DIF; Berk, I9S2) has been on the identification of items that function differently for minority versus
majority test takers. Conceivably, such items would contribute to misleading test scores for members of a particular
subgroup. Statistically. DIF seeks items that vary in difficulty for members of subgroups that are actually evenly
malched on the measured construct. That is. an attempt is
made to identify characteristics of items that lead to poorer
performance tor minority-group test lakers than for equally
able majority-group test takers. Assuming such a subset of
items or item characteristics exist, they must define a
race-related construe! that is distinguishable from the construe! the test is intended to measure (McCauley & Mendoza, 1985). Perhaps because of the availability of larger
sample sizes in large-scale testing programs, much of the
DIF research conducted to date has been done using educational and credentialing tests.
Initial evidence for DIF was provided by Medley and
Quirk (1974), who found relatively large group by item
interactions in a study of the performance of Black and
White examinees on the National Teacher Examination
items reflecting African American art, music, and literature. One should note, however, that these results are based
on examinee performance on a tesi consiructed using culturally specific content. Ironson andSubkoviak (!979) also
found evidence for DIF when they evaluated five cognitive
subtesis administered as part of the National Longitudinal
Study of 1972. The verbal subtest, measuring vocabulary
and reading comprehension, contained the largest number
of items flagged as biased. Items at the end of each of the
subtests also tended to be biased against Black examinees,
presumably because of lower levels of reading ability or
speed in completing these tests. More recently, T. Freedle
and Kostin (1990; R. Freedle & Kostin, 1997) showed that
Black examinees were more likely to get difficult verbal
April 2001 • American Psychologist
items on the Graduate Record Examination (GRE) and the
SAT correct when compared with equally able White examinees. However, the Black examinees were less likely to
get the easy items right. In explanation, they suggested that
the easier items possessed multiple meanings more familiar
to White examinees, whose culture was most dominant in
the test items and the educational system.
Whitney and Schmitt (1997) investigated the extent to
which DIF may be present in biographical data measures
developed for use in an employment context. In an effort to
further DIF research, Whitney and Schmitt focused not
only on identifying whether biodata items may exhibit DIF,
but also whether the presence of DIF can be traced to
differences in cultural values between racial subgroups.
More than one fourth of the biodata items exhibited DIF
between Black and White examinees. Moreover, the Black
and White examinees differentially endorsed item response
options designed to reflect differing cultural notions of
human nature, the environment, time orientation, and interpersonal relations. However, there was only limited evidence that the differences observed between subgroups in
cultural values were actually associated with DIF. After
removal of all DIF items, the observed disparity in test
scores between Blacks and Whites was eliminated, a disparity that, incidentally, favored Blacks over Whites.
Other studies evaluating the presence of DIF have
proved less interpretable. Scheuneman (1987) developed
item pairs that reflected seven major hypotheses about
potential sources of item bias in the experimental portion of
the GRE. The results were inconsistent in showing that
these manipulations produced a more difficult test for minority versus majority examinees. In some instances, the
manipulations produced a greater impact on Whites than
Blacks. In other instances, a three-way interaction between
group, test version, and items indicated that some uncontrolled factor (e.g., content of a passage or item) was
responsible for the subgroup difference. Scheuneman and
Gerritz (1990) examined verbal items from the GRE and
the SAT that consisted of short passages followed by
questions. Although they did identify several item features
possibly linked to subgroup differences (e.g., content dealing with science, requiring that examinees identify the
major thesis in a paragraph), the results, as a whole, yielded
no clear-cut explanations. Scheuneman and Gerritz concluded that DIF may result from a combination of item
features, the most important of which seems to be the
content of the items. Similarly inconclusive results were
reported by A. P. Schmitt and Dorans (1990) in a series of
studies on SAT-Verbal performance. Items that involved
the use of homographs (i.e., words that are spelled like
other words with a different meaning) were more difficult
for otherwise equally able racial and ethnic group members. Yet, when nonnative English speakers were removed
from the analyses, there were few remaining DIF items.
Schmeiser and Ferguson (1978) examined the English
usage and social studies reading tests of the American
College Test (ACT) and found little support for DIF. Two
English tests and three social studies tests were developed
to contain different content, while targeting the same cogApril 2001 • American Psychologist
nitive skills. None of the interactions between test content
and racial and ethnic group were statistically significant.
Similarly, Scheuneman and Grima (1997) reported that the
verbal characteristics of word problems (e.g., readability
indexes, the nature of the arguments, and propositions) in
the quantitative section of the GRE were not related to DIF
indexes.
These results indicate that although DIF may be detected for a variety of test items, it is often the case that the
magnitude of the DIF effect is very small. Furthermore,
there does not appear to be a consistent pattern of items
favoring one group versus another. Results do not indicate
that removing these items would have a large impact on
overall test scores. In addition, we know little about how
DIF item removal will affect test validity. However, certain
themes across these studies suggest the potential for some
DIF considerations. Familiarity with the content of items
appears to be important. The verbal complexity of the items
is also implicated, yet it is not clear what constitutes verbal
complexity. Differences in culture are often cited as important determinants of DIF, but beyond the influence of
having English as one's primary language, we know little
about how cultural differences play a role in test item
performance.
Use of Alternate Modes of Presenting Test
Stimuli
A third strategy to reduce subgroup differences in tests of
knowledge, skill, ability, and achievement has been to
change the mode in which test items or stimulus materials
are presented. Most often this involves using video or
auditory presentation of test items, as opposed to presenting test items in the normal paper-and-pencil mode. Implicit in this strategy is the assumption that reducing irrelevant written or verbal requirements will reduce subgroup
differences. In support of this premise, Sacco et al. (2000)
demonstrated the relationship between reading level and
subgroup differences by assessing the degree to which the
readability level of situational judgment tests (SJTs) was
correlated with the size of subgroup differences in SJT
performance. They estimated that 10th-, 12th-, and 14thgrade reading levels would be associated with Black-White
ds of 0.51, 0.62, and 0.74, respectively; reading levels at
the 8th, 10th, and 13th grades would be associated with
Hispanic-White ds of 0.38, 0.48, and 0.58, respectively.
This would suggest that reducing the readability level of a
test should in turn reduce subgroup differences. One must
be careful, however, to remove only verbal requirements
irrelevant to the criterion of interest, as Sacco et al. also
demonstrated that verbal ability may partially account for
SJT validities. In reviewing the research on this strategy,
we focused on three key studies investigating video as an
alternative format to a traditional test. The three selected
studies illustrate issues central in examining the effects of
format changes on subgroup differences. For other useful
studies that also touch on this issue, see Weekly and Jones
(1997) and N. Schmitt and Mills (in press).
Pulakos and Schmitt (1996) examined three measures
of verbal ability. A paper-and-pencil measure testing verbal
307
analogies, vocabulary, and reading comprehension had a
validity of .19 when predicting job performance and a
Black-White d = 1.03. A measure that required examinees
to evaluate written material and write a persuasive essay on
that material had a validity of .22 and a d = 0.91. The third
measure of verbal ability required examinees to draft a
description of what transpired in a short video. The validity
of this measure was .19 with d = 0.45. All three measures
had comparable reliabilities (i.e., .85 to .92). Comparing
the three measures, there was some reduction in d when
written materials involving a realistic reproduction of tasks
required on the job were used rather than a multiple-choice
paper-and-pencil test. When the stimulus material was visual rather than written, there was a much greater reduction
in subgroup differences (i.e., Black-White d values
dropped from 1.03 to 0.45, with parallel findings reported
for Hispanics). In terms of validity, the traditional verbal
ability test was the most predictive (r = .39), whereas the
video test was less so (r = .29, corrected for range restriction and criterion unreliability).
Chan and Schmitt (1997) evaluated subgroup differences in performance on a video-based SJT and on a
written SJT identical to the script used to produce the video
enactment. The written version of the test displayed a
subgroup difference of 0.95 favoring Whites over Blacks,
whereas the video-based version of the test produced a
subgroup difference of only 0.21. Corrections for unreliability produced differences of 1.19 and 0.28. These differences in d were matched by subgroup differences in
perceptions of the two tests. Both groups were more favorably disposed (as indicated by perceptions of the tests' face
validity) to the video version of the test, but Blacks significantly more so than Whites.
Sackett (1998) summarized research on the use of the
Multistate Bar Examination (MBE), a multiple-choice test
of legal knowledge and reasoning, and research conducted
by Klein (1983) examining a video-based alternative to the
MBE. The video test presented vignettes of lawyers taking
action in various settings. After each vignette, examinees
were asked to evaluate the actions taken. Millman, Mehrens,
and Sackett (1993) reported a Black-White d of 0.89 for
the MBE; an identical value (d = 0.89) was estimated by
Sackett for the video test.
Taken at face value, the Sackett (1998), Chan and
Schmitt (1997), and Pulakos and Schmitt (1996) results
appear contradictory regarding the impact of changing
from a paper-and-pencil format to a video format. These
contradictions are more apparent than real. First, it is important to consider the nature of the focal construct. In the
Pulakos and Schmitt study of verbal ability and the Sackett
study of legal knowledge and reasoning, the focal construct
was clearly cognitive in nature, falling squarely into the
domain of traditional knowledge, skill, ability and achievement tests. In Chan and Schmitt, however, the focal construct (i.e., interpersonal skill) was, as the name implies,
not heavily cognitively loaded. In fact, the SJT used was of
the type often suggested as a potential additional measure
that might supplement a traditional test. Chan and Schmitt
provided data supporting the hypothesis that the relatively
308
large drop in d was due in part to the removal of an implicit
reading comprehension component present in the written
version of the SJT. When that component was removed
through video presentation, the measure became less cognitively loaded and d was reduced.
But what of the differences between the verbal skill
(Pulakos & Schmitt, 1996) and legal knowledge and reasoning studies (Sackett, 1998)? We use this comparison to
highlight the importance of examining the correlation between the traditional written test and the alternative video
test. In the legal knowledge and reasoning study, in which
the video did not result in reduced subgroup differences,
the correlation between the traditional test and the video
test was .89, corrected for unreliability. This suggests that
the change from paper-and-pencil to video was essentially
a format change only, with the two tests measuring the
same constructs. In the verbal skills study, in which the
video resulted in markedly smaller subgroup differences,
the correlation between the two, corrected for unreliability
in the two measures, was .31. This suggests that the videobased test in the verbal skills study reflected not only a
change in format, but a change in the constructs measured
as well. An examination of the scoring procedures for the
verbal skills video test supports this conclusion. Examinee
essays describing video content were rated on features of
verbal ability (e.g., sentence structure, spelling) and on
completeness of details reported. We suggest that scoring
for completeness introduced into the construct domain personality characteristics such as conscientiousness and detail
orientation, both traits exhibiting smaller subgroup differences. Consistent with the arguments made above with
regard to the ability of composites to reduce subgroup
differences, the reduction in d observed for the verbal skills
video test was due not to the change in format, but rather to
the introduction of additional constructs that ameliorated
the influence of verbal ability when determining d. Thus,
the research to date indicates that changing to a video
format does not per se lead to a reduction in subgroup
differences. Future research into the influence of alternative
modes of testing should take steps to control for the unintended introduction of additional constructs beyond those
being evaluated. Failure to separate test content from test
mode will confound results, blurring our ability to understand the actual mechanism responsible for reducing subgroup differences.
Last, we wish to highlight the role of reliability when
comparing traditional tests with alternatives. Focusing on
the legal knowledge and reasoning study (Sackett, 1998),
recall that the Black-White difference was identical (d =
0.89) for both tests. We now add reliability data to our
discussion. Internal consistency reliability for the MBE
was .91; correcting the subgroup difference for unreliability results in a corrected value of d = 0.93. Internal
consistency reliability for the video test was .64; correcting
the subgroup difference for unreliability results in a corrected value of d = 1.11. In other words, after taking
differences in reliability into account, the alternative video
test results in a larger subgroup difference than the traditional paper-and-pencil test. Such an increase is possible in
April 2001 • American Psychologist
that a traditional test may tap declarative knowledge,
whereas a video alternative may require the application of
that knowledge, an activity that will likely draw on higher
order cognitive skills resulting in higher levels of d.
Clearly, any conclusion about the effects of changing test
format on subgroup differences must take into account
reliability of measurement. What appears to be a format
effect may simply be a reliability effect. Because different
reliability estimation methods focus on different sources of
measurement error (e.g., inconsistency across scorers, content sampling), taking reliability into account will require
considering the most likely sources of measurement error
in a particular setting.
The results reported in this section document the large
differences in d that are typically observed when comparisons are made between different test formats. Examinations of reasons for these differences are less conclusive. It
is not clear that the lower values of d are a function of test
format. Cognitive ability requirements and the reading requirements of these tests seem to be a major explanation.
Clearly more research would be helpful. Separating the
method of assessment from the content measured is a major
challenge when designing studies to evaluate the impact of
test format. However, it is a challenge that must be surmounted in order to understand whether group differences
can be reduced by altering test format. Future research
should also consider investigating how a change in format
influences validity. Only one of the three studies reviewed
here evaluated validity effects; clearly, that issue warrants
additional attention.
Use of Motivation and Instructional Sets
A fourth strategy places the focus not on the test itself, but
on the mental state adopted by examinees and its role in
determining test performance. An individual's motivation
to complete a test has the potential to influence test performance. To the extent that there are racial group differences
in test-taking motivation, energizing an individual to persevere when completing a test may reduce subgroup differences. At a more general level, a test taker's attitude and
approach toward testing may also influence test performance. Such test-taking impressions could be partially
determined by the instructional sets provided when completing tests and the context in which actual test questions
are derived. Manipulating instructional sets and item contexts has the potential to alter observed group differences if
these aspects of the testing experience allow individuals of
varying racial groups to draw on culture-specific cognitive
processes.
A number of studies have demonstrated racial group
differences in test-taking motivation. O'Neil and Brown
(1997) found that eighth-grade Black students reported
exerting the least amount of effort when completing a math
exam. Hispanic students reported exerting slightly more,
whereas White students reported exerting the most effort
when completing the exam. Chan, Schmitt, DeShon,
Clause, and Delbridge (1997) demonstrated that the relationship between race and test performance was partially
mediated by test-taking motivation, although the mediating
April 2001 • American Psychologist
effect accounted for a very small portion of the variance in
test performance. In a follow-up study, Chan, Schmitt,
Sacco, and DeShon (1998) found that pretest reactions
affected test performance and mediated the relationship
between belief in tests and test performance. Their subgroup samples were not large, but motivational effects
operated similarly for Whites and Blacks.
As a test of the influence of item context, a unique
study by DeShon, Smith, Chan, and Schmitt (1998) investigated whether presenting test questions in a certain way
would reduce subgroup differences. They tested the hypothesis proposed by Helms (1992) that cognitive ability
tests fail to adequately assess Black intelligence because
they do not account for the emphasis in Black culture on
social relations and social context, an observation offered
at a more general level by others as well (e.g., Miller-Jones,
1989; O'Connor, 1989). Contrary to Helms's argument,
racial subgroup performance differences on a set of Wason
conditional reasoning problems were not reduced by presenting the problems in a social relationship form.
Another working hypothesis is that the mere knowledge of cultural stereotypes may affect test performance. In
other words, making salient to test takers their ethnic and
racial or their gender identity may alter both women's and
minorities' test-taking motivation, self concept, effort
level, and expectation of successful performance. Steele
and colleagues (Steele, 1997; Steele & Aronson, 1995)
proposed a provocative theory of stereotype threat that
suggests that the way in which a test is presented to
examinees can affect examinee performance. The theory
hypothesizes that when a person enters a situation wherein
a stereotype of the group to which that person belongs
becomes salient, concerns about being judged according to
that stereotype arise and inhibit performance. When members of racial minority groups encounter high-stakes tests,
their awareness of commonly reported group differences
leads to concerns that they may do poorly on the test and
thus confirm the stereotype. This concern detracts from
their ability to focus all of their attention on the test,
resulting in poorer test performance. Steele hypothesized a
similar effect for gender in the domain of mathematics. A
boundary condition for the theory is that individuals must
identify with the domain in question. If the domain is not
relevant to the individual's self-image, the testing situation
will not elicit stereotype threat.
Steele and Aronson (1995) found support for the theory in a series of laboratory experiments. The basic paradigm used was to induce stereotype threat in a sample of
high-achieving majority and minority students statistically
equated in terms of their prior performance on the SAT.
One mechanism for inducing threat is via instructional set.
In the stereotype threat condition, participants were told
that they would be given a test of intelligence; in the
nonthreat condition, they were told they would be given a
problem-solving task. In fact, all of the participants received the same test. Steele and Aronson found a larger
majority-minority difference in the threat condition than in
the nonthreat condition, a finding supportive of the idea
309
that the presence of stereotype threat inhibits minority
group performance.
These findings are well replicated (Steele, 1997) but
commonly misinterpreted. For example, in the fall of 1999,
the PBS show "Frontline" broadcast a one-hour special
entitled "Secrets of the SAT," in which Steele's research
was featured. The program's narrator noted the large
Black-White gap on standardized tests, described the stereotype threat manipulation, and concluded, "Blacks who
believed the test was merely a research tool did the same as
Whites. But Blacks who believed the test measured their
abilities did half as well." The critical fact excluded was
that whereas a large score gap exists in the population in
general, Steele studied samples of Black and White students who had been statistically equated on the basis of
SAT scores. Thus, rather than eliminating the large score
gap, the research actually showed something very different.
Absent stereotype threat, the Black-White difference was
just what one would expect (i.e., zero), as the two groups
had been equated on the basis of SAT scores. However, in
the presence of stereotype threat, the Black-White difference was larger than would be expected, given that the two
groups were equated.
There are a variety of additional issues that cloud
interpretation and application of Steele's (1997) findings.
One critical issue is whether the SAT scores used to equate
the Black and White students are themselves influenced by
stereotype threat, thus confounding interpretation of study
findings. A second issue involves questions as to the populations to which these findings generalize (e.g., Whaley,
1998). The work of Steele and coworkers focused on
high-ability college students; Steele (1999) noted that the
effect is not replicable in the broader population. A third
issue is the conflict between a stereotype threat effect and
the large literature cited earlier indicating a lack of predictive bias in test use. If stereotype threat results in observed
scores for minority group members that are systematically
lower than true scores, one would expect underprediction
of minority group performance, an expectation not supported in the predictive bias literature. An additional pragmatic issue is the question of how one might reduce stereotype threat in high-stakes testing settings when the purpose of testing is clear.
These issues aside, Steele's (1997, 1999) research is
important in that it clearly demonstrates that the instructional set under which examinees approach a test can affect
test results. However, research has yet to demonstrate
whether and to what degree this effect generalizes beyond
the laboratory. Thus, we caution against overinterpreting
the findings to date, as they do not warrant the conclusion
that subgroup differences can be explained in whole or in
large part by stereotype threat.
The research on test-taker motivation and instructional
sets has been conducted primarily in laboratory settings.
The effects observed on subgroup differences are not large.
Future research should attempt to replicate these findings in
a field context so we may better understand the extent to
which group differences can be reduced using this alternative. Given the relatively small effects obtained in con310
trolled environments, it seems doubtful that motivational
and social effects will account for much of the subgroup
differences observed. Nonetheless, it may make sense for
test users to institute mechanisms for enhancing motivation, such as the use of more realistic test stimuli clearly
applicable to school or job requirements, for the purpose of
motivating all examinees.
Use of Portfolios, Accomplishment Records,
and Performance Assessments
Researchers have experimented with methods that directly
measure an individual's ability to perform aspects of the
job or educational domain of interest as a fifth alternative to
using paper-and-pencil measures of knowledge, skill, ability, and achievement. Portfolios, accomplishment records,
and performance assessments have each been investigated
as potential alternatives to traditional tests. Performance
assessments (sometimes referred to in the employment
domain as job or work samples) require an examinee to
complete a set of tasks that sample the performance domain
of interest. The intent is to obtain and then evaluate a
realistic behavior sample in an environment that closely
simulates the work or educational setting in question. Performance assessments may be comprehensive and broadbased, designed to obtain a wide ranging behavior sample
reflecting many aspects of the performance domain in
question or narrow with the intent of sampling a single
aspect of the domain in question. Accomplishment records
and portfolios differ from performance assessments in that
they require examinees to recount past endeavors or produce work products illustrative of an examinee's ability to
perform across a variety of contexts. Often examinees
provide examples demonstrative of their progress toward
skill mastery and knowledge acquisition.
Performance assessments, as a potential solution for
resolving subgroup differences, were examined in the employment domain by Schmidt, Greenthal, Hunter, Berner,
and Seaton (1977), who reported that performance assessments corresponded to substantially smaller Black-White
subgroup differences when compared with a written trades
test (d = 0.81 vs. 1.44). N. Schmitt et al. (1996) updated
this estimate to d = 0.38 on the basis of a meta-analytic
review of the literature, although they combined tests of job
knowledge and job samples in their review. The use of
performance assessments in the context of reducing subgroup differences has been extended to multiple highstakes situations in the credentialing, educational, and employment arena. We outline three such efforts here.
Legal skills assessment
center.
Klein and
Bolus (1982; described in Sackett, 1998) examined an
assessment center developed as a potential alternative to
the traditional bar examination. Each day of the two-day
center involved a separate trial, with a candidate representing the plaintiff on one day and the defendant on the
second. The center consisted of 11 exercises, such as conducting a client interview, delivering an opening argument,
conducting a cross-examination, and preparing a settlement
plan. Exercises were scored by trained attorneys. Sackett
reported a Black-White d = 0.76 and an internal consisApril 2001 • American Psychologist
tency reliability estimate of .67, resulting in a d corrected
for unreliability of 0.93.
Accomplished teacher assessment.
Jaeger
(1996a, 1996b) examined a complex performance assessment process developed to identify and certify highly accomplished teachers under the auspices of the National
Board for Professional Teaching Standards. Different assessment packages are developed for different teaching
specialty areas. Jaeger examined assessments for Early
Childhood Generalists and Middle Childhood Generalists.
The assessment process required candidates to complete an
assessment center and prepare in advance a portfolio that
included videotaped samples of their performance. From a
frequency count of scores, we computed Black-White <is of
1.06 and 0.97 for preoperational field trials, and 0.88 and
1.24 for operational use of the Early and Middle Childhood
assessments, respectively.
Management
assessment
center.
Goldstein,
Yusko, Braverman, Smith, and Chung (1998) examined an
assessment center designed for management development
purposes. Candidate performance on multiple dimensions
was evaluated by trained assessors who observed performance on seven exercises, including an in-basket, leaderless group discussions, and a one-on-one role play. Scores
on each exercise corresponded to Black-White ds ranging
from 0.03 to 0.40 for the individual exercises, and a d of
0.40 for a composite across all exercises.
One clear message emerging from these studies is that
it is not the case that an assessment involving a complex
realistic performance sample can always be expected to
result in a smaller subgroup difference than differences
commonly observed with traditional tests. The legal skills
assessment center and the accomplished teacher assessment
produced subgroup differences comparable in magnitude
with those commonly reported for traditional knowledge,
ability, and achievement tests. The management assessment center did result in smaller subgroup differences, a
finding commonly reported for this type of performance
assessment (Thornton & Byham, 1982). We offer several
observations as to what might account for the differences in
findings across the three performance assessments. First, in
the legal skills and accomplished teacher assessments, the
assessment exercises required performances that build on
an existing declarative knowledge base developed in part
through formal instruction. Effectiveness in delivering an
opening argument, for example, builds on a foundation of
legal knowledge and reasoning. In contrast, typical practice
in management assessment centers is to design exercises
that do not build on a formal knowledge base, but rather are
designed to tap characteristics such as communication
skills, organizing and planning skills, initiative, effectiveness under stress, and personal adjustment. Not only do
these not build on a formal knowledge base, but they also
in many cases reflect dimensions outside the cognitive
domain. As a reflection of this difference, we note that the
legal skills assessment center correlates .72 with the MBE,
whereas the seven exercises in the management assessment
center produce correlations corrected for unreliability ranging from .00 to .31 with a written cognitive ability measure.
April 2001 • American Psychologist
Thus, although each of these performance assessments
reflected highly realistic samples or simulations of the
behavior domain in question, the reductions in subgroup
differences were, we posit, a function of the degree to
which the assessment broadened the set of constructs assessed to include characteristics relevant to the job or
educational setting in question, but having little or no
relationship with the focal constructs tapped by the traditional test.
Researchers and practitioners in the educational arena
have been particularly vocal about the need to evaluate
student ability and teacher ability using more than standardized, multiple-choice tests (e.g., D'Costa, 1993; Dwyer
& Ramsey, 1995; Harmon, 1991; Lee, 1999; Neil, 1995).
Whether termed authentic assessments or constructed-response tests, these performance assessments have many
advocates in and out of the educational community. Until
recently, very few attempts had been made to determine
whether their use decreases the difference between the
measured progress of minority and majority students.
Given the high degree of fidelity to actual performance,
many consider this alternative a worthwhile approach.
Under particular focus in the education literature is the
use of traditional, multiple-choice tests to measure writing
skill. It is commonly argued that a constructed-response
format that requires examinees to write an essay would
serve as a more appropriate measure of writing ability. To
that end, a number of studies have compared subgroup
differences on multiple-choice tests of writing ability with
differences observed on essay tests. Welch, Doolittle, and
McLarty (1989) reported that Black-White differences on
the traditional ACT writing skills test and an ACT essay
test were the same, with d = 1.41 in both cases. Bond
(1995) reported that differences between Blacks and
Whites on the extended-essay portion of the National Assessment of Educational Progress (NAEP) were actually
greater after correcting for unreliability than those found on
the multiple-choice reading portion. In contrast, White and
Thomas (1981) reported a decrease in subgroup differences
when writing ability was assessed using an essay test
versus a multiple-choice test. Using the descriptive statistics reported, we established that the Black-White d decreased from 1.39 for the traditional test to 0.81 for the
essay test, with similar decreases observed for Hispanics
and Asians. White and Thomas did not present reliability
information, making it impossible to correct these differences for unreliability. However, other studies permitting
this correction have reported means and standard deviations that, when translated into ds corrected for unreliability, suggest similar subgroup differences on essay tests for
Blacks and Hispanics (Applebee, Langer, Jenkins, Mullis,
& Foertsch, 1990; Koenig & Mitchell, 1988).
Klein et al. (1997) compared subgroup mean differences between racial and ethnic groups on science performance assessments and on the Iowa Tests of Basic Skills,
a traditional multiple-choice test. The examinees were fifth,
sixth, and ninth graders participating in a statewide, performance assessment effort in California. A high level of
score reliability was achieved with the use of multiple
311
raters. The differences between subgroups were almost
identical across the two types of tests. Mean scores for
White students were about one standard deviation higher
than those of Hispanic and Black students. Furthermore,
changing test type or question type had no effect on the
score differences between the groups.
Reviews conducted in the employment domain suggest that performance assessments are among the most
valid predictors of performance (Asher & Sciarrino, 1974;
Hunter & Hunter, 1984; Robertson & Kandola, 1982;
Schmidt & Hunter, 1998; N. Schmitt, Gooding, Noe, &
Kirsch, 1984; Smith, 1991). In addition, examinees, particularly minority individuals, have reported more favorable
impressions of performance assessments than more traditional cognitive ability or achievement tests (Schmidt et al.,
1977). Given the motivational implications associated with
positive applicant reactions, the use of performance assessments wherein test content and format replicate the performance domain as closely as possible may be advantageous
regardless of the extent to which subgroup differences are
reduced. However, performance assessments tend to be
costly to develop, administer, and score reliably. Work
stations can be expensive to design and assessment center
exercises can be expensive to deliver. Stecher and Klein
(1997) indicated that it is often difficult and expensive to
achieve reliable scoring of performance assessments in a
large-scale testing context. Furthermore, to obtain a performance assessment that is both reliable and generalizable
requires that examinees complete a number of tasks, a
requirement that can triple the amount of testing time
necessary compared with traditional tests (Dunbar, Koretz,
& Hoover, 1991; Linn, 1993).
The accomplishment record (Hough, Keyes, & Dunnette, 1983) was developed in part to surmount some of the
development and administration cost issues characteristic
of performance assessments. Accomplishment records ask
examinees to describe major past accomplishments that are
illustrative of competence on multiple performance dimensions. These accomplishments are then scored using behaviorally defined scales. Accomplishment records can be used
to assess competence in a variety of work and nonwork
contexts. We do note that a very similar approach was
developed by Schmidt et al. (1979) under the label of the
"behavioral consistency method"; a meta-analysis by McDaniel, Schmidt, and Hunter (1988) reported useful levels
of predictive validity across 15 studies using this approach.
Hough et al. (1983) used accomplishment records to
evaluate attorneys and validated these instruments against
performance ratings. The accomplishment records were
scored with a high degree of interrater reliability (.75 to .85
across the different performance dimensions and the total
score). These scores were then correlated with attorney
experience (average r = .24), in order to partial out experience from the relationship between accomplishment
record scores and performance ratings. These partialed
validity coefficients ranged from .17 to .25 across the
dimensions. Validities for a small group of minority attorneys were larger than those for the majority group. Hough
(1984), describing the same data, reported Black-White
312
subgroup differences of d = 0.33 for the accomplishment
records. The performance ratings exhibited almost exactly
the same difference (i.e., d = 0.35). It is interesting that the
accomplishment records correlated near zero with the
LSAT, scores on the bar exam, and grades in law school.
These more traditional measures of ability would most
likely have exhibited greater d when compared with the
accomplishment records, although Hough did not present
the relevant subgroup means and standard deviations. Because the accomplishment records likely measured constructs in addition to ability (e.g., motivation and personality), it is perhaps not surprising that d was lower than that
found for more traditional cognitively oriented tests.
Similar to accomplishment records, portfolios represent examinees' past achievements through a collection of
work samples indicative of one's progress and ability.
Although it is applicable to adult assessment, much of the
research involving portfolios has taken place in schools.
LeMahieu, Gitomer, and Eresh (1995) reported on a project
in the Pittsburgh schools in which portfolios were used to
assess students' writing ability in Grades 6-12. Their experience indicated that portfolios could be rated with substantial interrater agreement. They also reported that Black
examinees' scores were significantly lower than those of
White examinees, but did not report subgroup means precluding the estimation of an effect size. Supovitz and
Brennan (1997) reported on an analysis of writing portfolios assembled by first and second graders in the Rochester,
New York, schools. Scores on two standardized tests were
compared with scores based on their portfolios. Interrater
reliability of the scoring of the language arts portfolios was
.73 and .78 for first and second graders, respectively,
whereas the reliability of the two standardized tests was .92
and .91. Differences between Black and White students
were about twice as large on the standardized tests as they
were on the writing samples. On both tests, the differences
between subgroups were smaller (0.25 to 0.50 in standard
deviation units depending on the test type) than is usually
reported.
Although accomplishment records and portfolios
likely have fewer development costs when compared with
performance assessments, the cost of scoring, especially if
multiple raters are used, may be high. Another issue present
with accomplishment records specifically is the reliance on
self-report, although an attempt to secure verification of the
role of the examinee in each accomplishment may diminish
the tendency to overreport or enhance one's role. There
may also be differing levels of opportunity to engage in
activities appropriate for portfolios or accomplishment
records. To the extent that examinees feel that they do not
have the resources available to assemble these extensive
documents, they may find the experience demotivating and
frustrating. This concern is important inasmuch as there is
some evidence (Ryan, Ployhart, Greguras, & Schmit, 1998;
Schmit & Ryan, 1997) that a greater proportion of minority
than majority individuals withdraw during the various hurdles in a selection system.
Summary.
Use of more realistic or authentic assessments does not eliminate or even diminish subgroup
April 2001 • American Psychologist
differences in many of the educational studies. Also, all of
the studies report that the reliable scoring of these tests is
difficult and expensive to achieve in any large-scale testing
application. Problems with the standardization of the material placed in portfolios and the directions and opportunities afforded students are also cited in studies of student
reactions to the use of these tests (Dutt-Doner & Gilman,
1998) as well as by professionals. Accomplishment records
and job samples used in employment contexts show smaller
subgroup differences in some studies than do cognitively
loaded tests. The attribution that these smaller subgroup
differences are due to test type are probably unwarranted,
as scores on most job samples and accomplishment records
most likely reflect a mix of constructs that go beyond those
measured by traditional knowledge, skill, ability, and
achievement tests.
Use of Coaching or Orientation Programs
Another strategy for reducing subgroup differences is the
use of coaching or orientation programs. The purpose of
these programs is to inform examinees about test content,
provide study materials, and recommend test-taking strategies, with the ultimate goal of enabling optimal examinee
performance. The term coaching is at times used to refer to
both orientation programs that focus on general test-taking
strategies and programs featuring intensive drill on sample
test items. We use the term orientation programs to refer to
short duration programs, dealing with broad test-taking
strategies, that introduce examinees to the types of items
they will encounter. We use the term coaching to refer to
more extensive programs, commonly involving practice
and feedback, in addition to the material included in orientation programs. A review by Sackett, Burris, and Ryan
(1989) indicated that coaching programs involving drill and
practice do show evidence of modest score gains above
those expected due simply to retesting. Although there is
little literature on the differential effectiveness of coaching
and orientation programs by subgroup, a plausible hypothesis is that subgroups differ in their familiarity with test
content and test-taking skills. This difference in familiarity
may contribute to observed subgroup differences in test
scores. Conceivably, coaching or orientation programs
would reduce error variance in test scores due to test
anxiety, unfamiliar test formats, and poor test-taking skills
(Frierson, 1986; Ryan et al., 1998) that would in turn
reduce the extent of subgroup differences. However, there
is evidence suggesting the presence of a larger coaching
effect for individuals with higher precoaching test scores, a
finding that argues against the likelihood that coaching will
narrow the gap between a lower scoring subgroup and a
higher scoring subgroup. With that caveat, we discuss
below the coaching and orientation literature investigating
the influence of this strategy on subgroup differences.
Ryan et al. (1998) studied an optimal orientation program that familiarized firefighter job applicants with test
format and types of test questions. The findings indicated
that Blacks, women, and more anxious examinees were
more likely to attend the orientation sessions, but attending
the orientation program was unrelated to test performance
April 2001 • American Psychologist
or motivation. Ryer, Schmidt, and Schmitt (1999) studied a
mandatory orientation program for entry-level jobs in a
manufacturing organization at two locations, with each
location having a control group and a test orientation
group. The results showed a small positive impact of orientation on the test scores of minority examinees, approximately 0.15 in standard deviation units, and the applicants
did indicate that they view organizations that provide these
programs favorably. However, the orientation program had
greater benefits for nonminority members than minority
members at one of the two locations. Schmit (1994) studied
a voluntary orientation program for police officers that
consisted of a review of test items and content, recommendations for test-taking strategies, practice on sample test
items, and suggestions on material to study. Attendance at
the program was unrelated to race, and whereas everyone
who attended the program scored higher on the examination than did nonattenders, Black gains were twice as large
as were those of Whites. No standard deviation was provided for the test performance variable, so d could not be
estimated.
The educational literature includes a relatively large
number of efforts to evaluate coaching initiatives. At least
three reviews of this literature have been conducted. Messick and Jungeblut (1981) reported that the average difference between coached examinees and noncoached examinees taking the SAT was about 0.15 standard deviation
units. The length of the coaching program and the amount
of score gains realized were positively correlated. Messick
and Jungeblut estimated that a gain of close to 0.25 standard deviation units could be achieved with a program that
would approach regular schooling. DerSimonian and Laird
(1983) reported an average effect size of 0.10 standard
deviation units for coaching programs directed at the SAT
test, an aptitude test. In an analysis of coaching programs
directed at achievement tests, Bangert-Downs, Kulik, and
Kulik (1983) reported gains of about 0.25 standard deviation units as a function of coaching. Thus, the effects of
coaching on performance on traditional paper-and-pencil
tests of aptitude and achievement appear to be small, but
replicable.
Frierson (1986) outlined results from a series of four
studies investigating the effects of test-taking interventions
designed to enhance minority examinee test performance
on various standardized medical examinations (e.g., Medical College Admissions Test |MCAT], Nursing State
Board Examination). The programs taught examinees testtaking strategies and facilitated the formation of learningsupport groups. Those minorities who experienced the interventions showed increased test scores. However, the
samples used in these studies included very few White
examinees, making it difficult to discern whether coaching
produced a differential effect on test scores in favor of
minorities. Powers (1987) reexamined data from a study on
the effects of test preparation involving practice, feedback
on results, and test-taking strategies using the initial version of the GRE analytical ability test (Powers & Swinton,
1982, 1984). The findings indicated that when supplied
with the same test preparation materials, no particular
313
subgroup appeared to gain more or less than any other
subgroup, although all of the examinees showed statistically significant score gains. Koenig and Leger (1997)
evaluated the impact of test preparation activities undertaken by individuals who had previously taken the MCAT
but had not passed. Black examinee scores improved across
two administrations of the MCAT less than did those of
White examinees.
Overall, the majority of studies on coaching and orientation programs indicate that these programs have little
positive impact on the size of subgroup differences. These
programs do benefit minority and nonminority examinees
slightly, but they do not appear to reduce subgroup differences. Note that these programs would affect passing rates
in settings such as licensure, in which performance relative
to a standard is at issue, rather than performance relative to
other examinees. On the positive side, these programs are
well received by examinees; they report favorable impressions of those institutions that offer these types of programs. Additional questions also of interest focus on examinee reactions to these programs, the effectiveness of
various types of programs, and ways to make these programs readily available to all examinees.
Use of More Generous Time Limits
A final option available for addressing group differences in
test scores is the strategy of increasing the amount of time
allotted to complete a test. Unless speed of work is part of
the construct in question, it can be argued that time limits
may bias test scores. Tests that limit administration time
may be biased against minority groups, in that certain
groups may be provided too little time to complete the test.
It has been reported that attitudes toward speededness are
culture-bound with observed differences by race and ethnicity (O'Connor, 1989). This suggests that providing examinees with more time to complete a test may facilitate
minority test performance.
Research to date, however, is not supportive of the
notion that relaxed time limits reduces subgroup differences. Evans and Reilly (1973) increased the time that
examinees were allotted to complete the Admission Test
for Graduate Study in Business from 69 seconds/item to 80
seconds/item and 96 seconds/item. The result was a corresponding increase in subgroup differences from a BlackWhite d of 0.83 to Black-White d% of 1.12 and 1.38,
respectively. Wild, Durso, and Rubin (1982) investigated
whether reducing the speededness of the GRE verbal and
quantitative tests from 20 minutes to 30 minutes reduced
subgroup differences. Results using both operational and
experimental versions of the tests indicated that whereas
increasing the time allotted benefited all examinees, it did
not produce differential score gains favoring minorities and
often exacerbated the extent of subgroup differences already present. Applebee et al. (1990) reported that doubling
the time that 4th, 8th, and 12th graders were given when
completing NAEP essay tests led to inconsistent changes in
the subgroup differences observed for Blacks and Hispanics. Through the use of odds ratios computed on the basis
of the percentage of examinees that received an adequate or
314
better passing score, we ascertained that the Black-White d
increased on six of the nine essay tests and the HispanicWhite d increased on five of the nine essay tests when the
tests were administered with extended time. Thus, the time
extension increased the differences between Black and
White essay scores and Hispanic and White essay scores in
the majority of test administrations.
Although relaxed time limits will likely result in
higher test scores for all examinees, there does not appear
to be a differential benefit favoring minority subgroups. In
fact, it is more common that extending the time provided to
examinees when completing a test increases subgroup differences, sometimes substantially. Time limits enhance the
error in test scores resulting from differences in pace of
work versus quality of work. Interpreting the observed
increases in subgroup differences when these limits are
removed suggests that the changes observed are likely a
function of enhancing the reliability of the test. Thus, time
extensions are unlikely to serve as an effective strategy for
reducing subgroup differences.
Discussion and Conclusions
Our goal in this article was to consider a variety of strategies that have been proposed as potential methods for
reducing the subgroup differences regularly observed on
traditional tests of knowledge, skill, ability, and achievement. Such tests are commonly used in the contexts of
selection for employment, educational admissions, and licensure and certification. We posit that there is extensive
evidence supporting the validity of well-developed traditional tests for their intended purposes, and that institutions
relying on traditional tests value the positive outcomes
resulting from test use. Thus, one set of critical constraints
is that any proposed alternative must include the constructs
underlying traditional tests and any proposed alternative
cannot result in an appreciable decrement in validity. Put
another way, we focus here on the situation in which the
institution is not willing to sacrifice validity in the interest
of reducing subgroup differences, as it is in that situation
that there is tension between pursuing a validity-maximization strategy and a diversity-maximization strategy. A
second set of critical constraints is that, in the face of
growing legal obstacles to preference-based forms of affirmative action, any proposed alternative cannot include any
form of preferential treatment by subgroup. Within these
constraints, we considered two general types of strategies:
modifications to procedural aspects of testing and the creation of alternative testing instruments.
In evaluating the various strategies that have been
promulgated as potential solutions to the performance versus diversity dilemma, our review echoes the sentiments of
the 1982 National Academy of Sciences panel that concluded, "the Committee has seen no evidence of alternatives to testing that are equally informative, equally adequate technically, and also economically and politically
viable" (p. 144). Although our observations are offered
almost 20 years later, the story remains much the same.
Alternatives to traditional tests tend to produce equivalent
subgroup differences in test scores when the alternative test
April 2001 • American Psychologist
measures cognitively loaded constructs. If such differences
are not observed, the reduction can often be traced to an
alternative that exhibits low levels of reliability or introduces noncognitive constructs. In fact, certainly the most
definitive conclusion one can reach from this review is that
adverse impact is unlikely to be eliminated as long as one
assesses domain-relevant constructs that are cognitively
loaded. This conclusion is no surprise to anyone who has
read the literature in this area over the past three or more
decades. Subgroup differences on cognitively loaded tests
of knowledge, skill, ability, and achievement simply document persistent inequities. Complicating matters further,
attempts to overcome issues associated with reliable measurement often result in a testing procedure that is costprohibitive when conducted on a large scale. In spite of
these statements, there are a number of actions that can be
taken by employers, academic admissions officers, or other
decision makers who are faced with the conflict between
diversity goals and a demand that only those who are most
able should be given desirable educational and employment
opportunities. Although elimination of subgroup differences via methods reviewed in this article is not feasible,
reduction in subgroup differences, if it can be achieved
without loss of validity, would be of considerable value.
First, in constructing test batteries, the full range of
performance goals and organizational interests should be
considered. In the employment arena, researchers have
tended to focus on measures of maximum performance
(i.e., ability), rather than on measures of typical performance (perhaps most related to motivation factors), when
considering what knowledge, skills and abilities to measure. These maximum performance constructs were easy to
measure using highly reliable and valid instruments. With
the recent literature espousing the value of personality,
improvements in interviews, and better methods for documenting job-related experience, valid methods for measuring less cognitively oriented constructs are becoming available. When these constructs are included in test batteries,
there is often less adverse impact. We must also emphasize
the importance of clearly identifying the performance construct one is hoping to predict. The weighting of different
aspects of performance and organizational goals should
determine the nature of the constructs measured in a highstakes testing situation. It is important to measure what is
relevant, not what is convenient, easy, or cheap.
Second, research on the identification and removal of
items that may be unfairly biased against one group or
another does not indicate that any practically significant
reductions in d can be achieved in this fashion. Studies of
DIF are characterized by small effects, with items not
consistently favoring one group versus another. The effects
of removing biased items on overall test characteristics are
usually minimal. It does seem apparent that one should
write as simply as possible consistent with the construct
one is hoping to measure and that content that is obviously
cultural should be removed.
Research on the mode of presenting test stimuli suggests that video-based procedures, which broaden the range
of constructs assessed, or reducing the verbal component
April 2001 • American Psychologist
(or reading level) of tests may have a positive effect on
subgroup differences, although d is often still large enough
to produce adverse impact, particularly when the selection
ratio is low. Results are not consistent across studies, and
clearly more research would be helpful. Such studies are
particularly difficult to conduct. Separation of the mode of
testing and the construct tested is a challenge; conflicting
results across studies may be due to an inability to differentiate between constructs and methods. With improvements in technology, alternatives to traditional paper-andpencil tests are clearly feasible and worthy of exploration.
It is also important to note that verbal ability may be a skill
related to important outcomes and hence considered a
desirable component of test performance. In these cases, it
would be best to include a measure that specifically assesses verbal ability so that one may remove its influence
when measuring other job-related constructs. The entire
test battery can then be constructed to reflect an appropriate
weighting and combination of relevant attributes given the
relative importance of the various performance outcomes.
Whenever possible, it seems desirable to measure
experiences that reflect necessary knowledge, skills, and
abilities required in the target situation. Accomplishment
records reduced differences compared with the usual d
obtained with cognitively loaded tests in a manner that was
practically important as well. This is very likely because
additional constructs are targeted and assessed in the accomplishment record. Results for portfolio and performance assessments in the educational arena have been
mixed. Some studies indicate lower levels of d, whereas
other studies indicate no difference or even greater differences on portfolio or performance assessments when compared with the typical multiple-choice measure of achievement. Differences in the level of d across studies may be
due partly to the degree to which test scores are a function
of ability and motivation. If partly a function of motivation,
we would expect d to be smaller. Again, if relatively
complex and realistic performance assessments involve
cognitive skills as opposed to interpersonal skills, the level
of d will likely be the same as a traditional cognitively
loaded measure. In addition, problems in attaining reliable
scores at reasonable expense questions the feasibility of
this strategy.
It seems reasonable to recommend that some form of
test preparation or orientation course be provided to examinees. The effects of coaching appear to be minimally
positive over all groups, even though coaching does not
seem to reduce d. Reactions to test preparation and coaching efforts among job applicants have been universally
positive. Insofar as some candidates do not have access to
informal networks that provide information on the nature of
exams, these programs could serve to place all examinees
on the same playing field. At the very least, it would seem
that such positive reactions would lead to less complaints
about the test and probably less litigation—although we
have little research documenting the relationship between
reactions and organizational outcomes.
Finally, we recommend that test constructors pay attention to face validity. When tests look appropriate for the
315
performance situation in which examinees will be expected
to perform, they tend to react positively. Such positive
reactions seem to produce a small reduction in the size of
d. Equally important, perhaps, may be the perception that
one is fairly treated. This is the same rationale underlying
our recommendation that test preparation programs be
used.
In sum, subgroup differences can be expected on
cognitively loaded tests of knowledge, skill, ability, and
achievement. We can, however, take some useful actions to
reduce such differences and to create the perception that
one's attributes are being fairly and appropriately assessed.
We note that in this article we have focused on describing
subgroup differences resulting from different measurement
approaches. We cannot in the space available here address
crucial questions of interventions to remedy subgroup differences in the life opportunities that affect the development of the knowledge, skill, ability, and achievement
domains that are the focus of this article. The research
discussed in this article, suggesting that subgroup differences are not simply artifacts of paper-and-pencil testing
technologies, highlights the need to consider those larger
questions.
REFERENCES
Adarand Constructors, Inc. v. Pena. 115 S. Ct. 2097, 2113 (1995).
American Educational Research Association, American Psychological
Association, & National Council on Measurement in Education. (1999).
Standards for educational and psychological testing. Washington, DC:
American Psychological Association.
Applebee, A. N., Langer, J. A., Jenkins. L. B., Mullis, I. V. S., & Foertsch,
M. A. (1990). Learning to write in our nation's schools: Instruction and
achievement in 1988 at grades 4, 8, and 12 (NAEP Rep. No. 19-W-02).
Princeton. NJ: Educational Testing Service.
Asher. J. J.. & Sciarrino. J. A. (1974). Realistic work sample tests: A
review. Personnel Psychology, 27, 519-533.
Bangert-Downs. R. L., Kulik. J. A., & Kulik. C-L. C. (1983). Effects of
coaching programs on achievement test scores. Review of Educational
Research. 53, 571-585.
Barrick. M. R.. & Mount, M. K. (1991). The Big-Five personality dimensions in job performance: A meta-analysis. Personnel Psychology, 44,
1-26.
Berk, R. A. (1982). Handbook of methods for detecting test bias. Baltimore: Johns Hopkins University Press.
Bobko. P.. Roth, P. L.. & Potosky. D. (1999). Derivation and implications
of a meta-analytic matrix incorporating cognitive ability, alternative
predictors, and job performance. Personnel Psychology, 52, 561-590.
Bond, L. (1995). Unintended consequences of performance assessment:
Issues of bias and fairness. Educational Measurement: Issues and
Practice. 14. 21-24.
Chan, D., & Schmitt. N. (1997). Video-based versus paper-and-pencil
method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of
Applied Psychology. 82, 143-159.
Chan. D., Schmitt, N.. DeShon. R. P.. Clause, C. C , & Delbridge, K.
(1997). Reactions to cognitive ability tests: The relationships between
race, test performance, face validity perceptions, and test-taking motivation. Journal of Applied Psychology, 82, 300-310.
Chan, D.. Schmitt. N., Sacco, J. M , & DeShon, R. P. (1998). Understanding pretest and posttest reactions to cognitive ability and personality
measures. Journal of Applied Psychology, 83, 471-485.
City of Richmond v. J. A. Croson Co., 488 U.S. 469 (1989).
Cole. N. S. (1981). Bias in testing. American Psychologist, 36, 10671077.
D'Costa. A. G. (1993). The impact of courts on teacher competence
316
testing. Theory into Practice: Assessing Tomorrow's Teachers, 32,
104-112.
De Corte, W. (1999). Weighing job performance predictors to both
maximize the quality of the selected workforce and control the level of
adverse impact. Journal of Applied Psychology, 84, 695-702.
DerSimonian, R., & Laird, N. (1983). Evaluating the effect of coaching on
SAT scores: A meta-analysis. Harvard Educational Review, 18, 694734.
DeShon, R. P., Smith, M., Chan, D., & Schmitt, N. (1998). Can adverse
impact on cognitive ability and personality tests be reduced by presenting problems in a social context? Journal of Applied Psychology, 83,
438-451.
Dunbar, S. B., Koretz, D. M., & Hoover. H. D. (1991). Quality control in
the development and use of performance assessments. Applied Measurement in Education, 4, 289-303.
Dutt-Doner. K., & Gilman, D. A. (1998). Students react to portfolio
assessment. Contemporary Education, 69, 159-165.
Dwyer, C. A., & Ramsey, P. A. (1995). Equity issues in teacher assessment. In M. T. Nettles & A. L. Nettles (Eds.), Equity and excellence in
educational testing and assessment (pp. 327-342). Boston: Kluwer
Academic.
Evans, F. R., & Reilly, R. R. (1973). A study of test speededness as a
potential source of bias in the quantitative score of the admission test
for graduate study in business. Research in Higher Education, I,
173-183.
Ford, J. K., Kraiger, K., & Schechtman, S. L. (1986). Study of race effects
in objective indices and subjective evaluations of performance: A
meta-analysis of performance criteria. Psychological Bulletin, 99, 330337.
Freedle, T., & Kostin, I. (1990). Item difficulty of four verbal item types
and an index of differential item functioning for Black and White
examinees. Journal of Educational Measurement, 27, 329-343.
Freedle, R., & Kostin, I. (1997). Predicting Black and White differential
item functioning in verbal analogy performance. Intelligence, 24, All—
AAA.
Frierson, H. T. (1986). Enhancing minority college students' performance
on educational tests. Journal of Negro Education, 55, 38-45.
Goldstein, H. W., Yusko, K. P., Braverman, E. P., Smith, D. B., & Chung,
B. (1998). The role of cognitive ability in the subgroup differences and
incremental validity of assessment center exercises. Personnel Psychology, 51, 357-374.
Harmon, M. (1991). Fairness in testing: Are science education assessments biased? In G. Kulm & S. M. Malcom (Eds.), Science assessment
in the service of reform (pp. 31-54). Washington, DC: American
Association for the Advancement of Science.
Hartigan. J. A., & Wigdor, A. K. (1989). Fairness in employment testing.
Washington, DC: National Academy Press.
Hattrup, K., Rock, J., & Scalia, C. (1997). The effects of varying conceptualizations of job performance on adverse impact, minority hiring,
and predicted performance. Journal of Applied Psychology, 82, 656664.
Helms. J. E. (1992). Why is there no study of cultural equivalence in
standardized cognitive ability testing? American Psychologist, 47,
1083-1101.
Hopwood v. State of Texas, 78 F. 3d 932, 948 (5th Cir. 1996).
Hough, L. M. (1984). Development and evaluation of the "accomplishment record" method of selecting and promoting professionals. Journal
of Applied Psychology, 69, 135-146.
Hough, L. M., Keyes, M. A., & Dunnette, M. D. (1983). An evaluation of
three "alternative" selection procedures. Personnel Psychology, 36,
261-276.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative
predictors of job performance. Psychological Bulletin, 96, 72-88.
Ironson. G. H., & Subkoviak, M. J. (1979). A comparison of several
methods of assessing item bias. Journal of Educational Measurement,
16, 209-225.
Jaeger, R. M. (1996a). Conclusions on the technical measurement quality
of the 1995-1996 operational version of the National Board for Professional Teaching Standards' Early Childhood Generalist Assessment.
Center for Educational Research and Evaluation, University of North
Carolina at Greensboro.
Jaeger, R. M. (1996b). Conclusions on the technical measurement quality
April 2001 • American Psychologist
of the 1995-1996 operational version of the National Board for Professional Teaching Standards' Middle Childhood Generalist Assessment. Center for Educational Research and Evaluation, University of
North Carolina at Greensboro.
Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.
Kier, F. J., & Davenport, D. S. (1997). Ramifications of Hopwood v. Texas
on the process of applicant selection in APA-accredited professional
psychology programs. Professional Psychology: Research and Practice, 28, 486-491.
Klein, S. P. (1983). An analysis of the relationship between trial practice
skills and bar examination results. Unpublished manuscript.
Klein, S. P., & Bolus, R. E. (1982). An analysis of the relationship
between clinical legal skills and bar examination results. Unpublished
manuscript.
Klein, S. P., Jovanovic, J., Stecher, B. M., McCaffrey, D., Shavelson,
R. J., Haertel, E., Solano-Flores, G., & Comfort, K. (1997). Gender and
racial/ethnic differences on performance assessments in science. Educational Evaluation and Policy Analysis, 19, 83-97.
Koenig, J. A., & Leger, K. F. (1997). A comparison of retest performances
and test-preparation methods for MCAT examinees grouped by gender
and race-ethnicity. Academic Medicine, 72, S100-S102.
Koenig, J. A., & Mitchell, K. J. (1988). An interim report on the MCAT
essay pilot project. Journal of Medical Education, 63, 21-29.
Lee, O. (1999). Equity implications based on the conceptions of science
achievement in major reform documents. Review of Educational Research, 69, 83-115.
Linn, R. L. (1993). Educational assessment: Expanded expectations and
challenges. Educational Evaluation and Policy Analysis, 15, 1-16.
LeMahieu, P. G., Gitomer, D. H., & Eresh, J. T. (1995). Portfolios in
large-scale assessment: Difficult but not impossible. Educational Measurement: Issues and Practice, 14, 11-16, 25-28.
Lynn, R. (1996). Racial and ethnic differences in intelligence in the U.S.
on the Differential Ability Scale. Personality and Individual Differences, 20, 271-273.
McCauley, C. D., & Mendoza, J. (1985). A simulation study of item bias
using a two-parameter item response model. Applied Psychological
Measurement, 9, 389-400.
McDaniel, M. A., Schmidt, F. L., & Hunter, J. E. (1988). A meta-analysis
of the validity of methods for rating training and experience in personnel selection. Personnel Psychology, 41, 283-309.
Medley, D. M., & Quirk, T. J. (1974). The application of a factorial design
to the study of cultural bias in general culture items on the National
Teacher Examination. Journal of Educational Measurement, 11, 235245.
Mehrens, W. A. (1989). Using test scores for decision making. In B. R.
Gifford (Ed.), Test policy and test performance: Education, language,
and culture (pp. 93-99). Boston: Kluwer Academic.
Mehrens, W. A. (1999). The CBEST saga: Implications for licensure and
employment testing. The Bar Examiner, 68, 23-32.
Messick, S. M., & Jungeblut, A. (1981). Time and method in coaching for
the SAT. Psychological Bulletin. 89, 191-216.
Miller-Jones, D. (1989). Culture and testing. American Psychologist, 44,
360-366.
Millman, J., Mehrens, W. A., & Sackett, P. R. (1993). An evaluation of the
New York State Bar Examination. Unpublished manuscript.
Mishkin, P. J. (1996). Foreword: The making of a turning point—Metro
and Adarand. California Law Review, 84, 875-886.
Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternative
selection procedure: The low-fidelity simulation. Journal of Applied
Psychology, 75, 640-647.
Mount, M. K., & Barrick, M. R. (1995). The Big Five personality
dimensions: Implications for research and practice in human resources
management. In G. Ferris (Ed.), Research in personnel and human
resources management (Vol. 13, pp. 153-200). Greenwich, CT: JAI
Press.
National Academy of Sciences. (1982). Ability testing: Uses, consequences, and controversies (Vol. 1). Washington, DC: National Academy Press.
Neill, M. (1995). Some prerequisites for the establishment of equitable,
inclusive multicultural assessment systems. In M. T. Nettles & A. L.
Nettles (Eds.), Equity and excellence in educational testing and assessment (pp. 115-157). Boston: Kluwer Academic.
April 2001 • American Psychologist
Neisser, U., Boodoo, G., Bouchard, T. J., Jr., Boykin, A. W., Brody, N.,
Ceci, S. J., Halpern, D. F., Loehlin, J. C , Perloff, R., Sternberg, R. J.,
& Urbina, S. (1996). Intelligence: Knowns and unknowns. American
Psychologist, 51, 77-101.
O'Connor, M. C. (1989). Aspects of differential performance by minorities on standardized tests: Linguistic and sociocultural factors. In B. R.
Gifford (Ed.), Test policy and test performance: Education, language,
and culture (pp. 129-181). Boston: Kluwer Academic.
O'Neil, H. F., & Brown, R. S. (1997). Differential effects of question
formats in math assessment on metacognition and effect (Tech. Rep.
No. 449). Los Angeles: University of California, National Center for
Research on Evaluation, Standards, and Student Testing.
Ones, D. S., Viswesvaran, C , & Schmidt, F. L. (1993). Comprehensive
meta-analysis of integrity test validities: Findings and implications for
personnel selection and theories of job performance. Journal of Applied
Psychology, 78, 656-664.
Pear, R. (1996, November 6). The 1996 elections: The nation—the states.
The New York Times, p. B7.
Powers, D. E. (1987). Who benefits most from preparing for a "coachable"
admissions test? Journal of Educational Measurement, 24, 247-262.
Powers, D. E., & Swinton, S. S. (1982). The effects of self-study of test
familiarization materials for the analytical section of the GRE Aptitude
Test (GRE Board Research Report GREB No. 79-9). Princeton, NJ:
Educational Testing Service.
Powers, D. E., & Swinton, S. S. (1984). Effects of self-study for coachable
test item types. Journal of Educational Psychology, 76, 266-278.
Pulakos, E. D., & Schmitt, N. (1996). An evaluation of two strategies for
reducing adverse impact and their effects on criterion-related validity.
Human Performance, 9, 241-258.
Robertson, I. T., & Kandola, R. S. (1982). Work sample tests: Validity,
adverse impact, and applicant reaction. Journal of Occupational Psychology, 55, 171-183.
Ryan, A. M., Ployhart, R. E., & Friedel, L. A. (1998). Using personality
testing to reduce adverse impact: A cautionary note. Journal of Applied
Psychology, 83, 298-307.
Ryan, A. M., Ployhart, R. E., Greguras, G. J., & Schmit, M. J. (1998). Test
preparation programs in selection contexts: Self-selection and program
effectiveness. Personnel Psychology, 51, 599-622.
Ryer, J. A., Schmidt, D. B., & Schmitt, N. (1999, April). Candidate
orientation programs: Effects on test scores and adverse impact. Paper
presented at the annual conference of the Society for Industrial and
Organizational Psychology, Atlanta, GA.
Sacco, J. M., Scheu, C. R., Ryan, A. M., Schmitt, N., Schmidt, D. B., &
Rogg, K. L. (2000). Reading level and verbal test scores as predictors
of subgroup differences and validities of situational judgment tests.
Unpublished manuscript.
Sackett, P. R. (1998). Performance assessment in education and professional certification: Lessons for personnel selection. In M. D. Hakel
(Ed.), Beyond multiple-choice: Evaluating alternatives to traditional
testing for selection (pp. 113-129). Mahwah, NJ: Erlbaum.
Sackett, P. R., Burris, L. R.. & Ryan, A. M. (1989). Coaching and practice
effects in personnel selection. In C. L. Cooper & I. Robertson (Eds.),
International review of industrial and organizational psychology 1989.
London: Wiley.
Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multipredictor composites on group differences and adverse impact. Personnel Psychology, 50, 707-722.
Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other
forms of score adjustment in preemployment testing. American Psychologist, 49, 929-954.
Salgado, J. F. (1997). The five factor model of personality and job
performance in the European community. Journal of Applied Psychology, 82, 30-43.
Scarr, S. (1981). Race, social class, and individual differences in l.Q.
Hillsdale, NJ: Erlbaum.
Scheuneman, J. (1987). An experimental, exploratory study of causes of
bias in test items. Journal of Educational Measurement, 24, 97-118.
Scheuneman, J., & Gerritz, K. (1990). Using differential item functioning
procedures to explore sources of item difficulty and group performance
characteristics. Journal of Educational Measurement, 27, 109-131.
Scheuneman, J., & Grima, A. (1997). Characteristics of quantitative word
317
items associated with differential performance for female and Black
examinees. Applied Measurement in Education, 10, 299-319.
Schmeiser, C. B., & Ferguson. R. L. (1978). Performance of Black and
White students on test materials containing content based on Black and
White cultures. Journal of Educational Measurement, 15, 193-200.
Schmidt, F. L. (1988). The problem of group differences in ability test
scores in employment selection. Journal of Vocational Behavior, 33,
272-292.
Schmidt, F. L., Greenthal, A. L.. Hunter, J. E., Berner, J. G., & Seaton,
F. W. (1977). Job sample vs. paper-and-pencil trades and technical
tests: Adverse impact and examinee attitudes. Personnel Psychology,
30, 187-196.
Schmidt, F. L.. & Hunter, J. E. (1981). Employment testing: Old theories
and new research findings. American Psychologist, 36, 1128-1137.
Schmidt, F. L.. & Hunter, J. E. (1998). The validity and utility of selection
methods in personnel psychology: Practical and theoretical implications
of 85 years of research findings. Psychological Bulletin, 124, 262-274.
Schmidt. F. L., Kaplan. J. R., Bemis, S. E.. Decuir, R., Dunn, L.. &
Antone, L. (1979). The behavioral consistency method of unassembled
examining (TM-79-21). Washington, DC: U.S. Office of Personnel
Management, Personnel Research and Development Center.
Schmidt, F. L., Mack. M. J.. & Hunter, J. E. (1984). Selection utility in the
occupation of U.S. park ranger for three modes of test use. Journal of
Applied Psychology, 69, 490-497.
Schmit, M. J. (1994). Pre-employment processes and outcomes, applicant
belief systems, and minority-majority group differences. Unpublished
doctoral dissertation. Bowling Green State University.
Schmit, M. J., & Ryan. A. M. (1997). Applicant withdrawal: The role of
test-taking attitudes and racial differences. Personnel Psychology, 50,
855-876.
Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning for
minority examinees on the SAT. Journal of Educational Measurement,
27, 67-81.
Schmitt. N.. Clause, C. S., & Pulakos, E. D. (1996). Subgroup differences
associated with different measures of some job-relevant constructs. In
C. R. Cooper & I. T. Robertson (Eds.), International review of industrial and organizational psychology (Vol. 11, pp. 115-140). New York:
Wiley.
Schmitt, N., Gooding, R. Z.. Noe, R. A., & Kirsch, M. P. (1984).
Meta-analyses of validity studies published between 1964 and 1982 and
the investigation of study characteristics. Personnel Psychology, 37,
407-422.
Schmitt, N., & Mills. A. E. (in press). Traditional tests and simulations:
Minority and majority performance and test validities. Journal of Applied Psychology.
Schmitt, N., Rogers. W., Chan, D.. Sheppard. L., & Jennings, D. (1997).
Adverse impact and predictive efficiency of various predictor combinations. Journal of Applied Psychology, 82, 719-730.
Smith, F. D. (1991). Work samples as measures of performance. In A. K.
318
Wigdor & B. G. Green Jr. (Eds.), Performance assessment for the
workplace (pp. 27-52). Washington: DC: National Academy Press.
Stecher, B. M., & Klein, S. P. (1997). The cost of science performance
assessments in large-scale testing programs. Educational Evaluation
and Policy Analysis, 19, 1-14.
Steele, C. M. (1997). A threat in the air: How stereotypes shape intellectual identity and performance. American Psychologist, 52, 613-629.
Steele, C. M., (1999, August). Thin ice: "Stereotype threat" and Black
college students. Atlantic Monthly, 284(3), 44-54.
Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual
test performance of African Americans. Journal of Personality and
Social Psychology, 69, 797-811.
Stokes, G. S., Mumford, M. D., & Owens, W. A. (1994). Biodata
handbook. Palo Alto, CA: Consulting Psychologists Press.
Supovitz, J. A., & Brennan, R. T. (1997). Mirror, mirror on the wall,
which is the fairest test of all? An examination of the equitability of
portfolio assessment relative to standardized test. Harvard Educational
Review, 67, 472-506.
Thornton, G. C , & Byham, W. C. (1982). Assessment centers and
managerial performance. New York: Academic Press.
Verhovek, S. H., & Ayres, B. D., Jr. (1998, November 4). The 1998
elections: The nation—referendums. The New York Times, p. B2.
Weekly, J. A., & Jones, C. (1997). Video-based situational testing. Personnel Psychology, 50, 25-49.
Welch, C , Doolittle, A., & McLarty, J. (1989). Differential performance
on a direct measure of writing skills for Black and White college
freshmen (ACT Research Reports, Tech. Rep. No. 89-8). Iowa City, IA:
ACT.
Whaley, A. L. (1998). Issues of validity in empirical tests of stereotype
threat theory. American Psychologist, 53, 679-680.
White, E. M., & Thomas, L. L. (1981). Racial minorities and writing skills
assessment in the California State University and colleges. College
English, 43, 276-283.
Whitney, D. J., & Schmitt, N. (1997). Relationship between culture and
responses to biodata employment items. Journal of Applied Psychology, 82, 113-129.
Wightman, L. F. (1997). The threat to diversity in legal education: An
empirical analysis of the consequences of abandoning race as a factor
in law school admission decisions. New York University Law Review,
72, 1-53.
Wild, C. L., Durso, R., & Rubin, D. B. (1982). Effects of increased
test-taking time on test scores by ethnic group, years out of school, and
sex. Journal of Educational Measurement, 19, 19-28.
Wilson, K. M. (1981). Analyzing the long-term performance of minority
and nonminority students: A tale of two studies. Research in Higher
Education, 15, 351-375.
Wolfe, R. N., & Johnson, S. D. (1995). Personality as a predictor of
college performance. Educational and Psychological Measurement, 55,
177-185.
April 2001 • American Psychologist