A Comparative Study of Measures to Evaluate Medical Students

EDUCATING PHYSICIANS
R
E S E A R C H
R
E P O R T
A Comparative Study of Measures to Evaluate
Medical Students’ Performances
Ronald A. Edelstein, EdD, Helen M. Reid, MA, Richard Usatine, MD, and Michael S. Wilkes, MD, PhD
ABSTRACT
Purpose. To assess how new National Board of Medical
Examiners (NBME) performance examinations—computerbased case simulations (CBX) and standardized patient exams (SPX)—compare with each other and with traditional
internal and external measures of medical students’ performances. Secondary objectives examined attitudes of students
toward new and traditional evaluation modalities.
Method. Fourth-year students (n = 155) at the University of California, Los Angeles, School of Medicine (including joint programs at Charles R. Drew University of
Medicine and Science and University of California, Riverside) were assigned two days of performance examinations (eight SPXs, ten CBXs, and a self-administered attitudinal survey). The CBX was scored by the NBME and
the SPX by a NBME/Macy consortium. Scores were
linked to the survey and correlated with archival student
data, including traditional performance indicators (licensing board scores, grade-point averages, etc.).
Accurately measuring physicians’ competencies is a crucial step toward im-
Dr. Edelstein is senior associate dean of academic
affairs and associate professor, Department of Family
Medicine, Charles R. Drew University of Medicine
and Science. Ms. Reid was senior statistician, Department of Psychiatry and Biobehavioral Sciences,
UCLA School of Medicine. Dr. Usatine is associate
professor, Department of Family Medicine, and assistant dean of student affairs, UCLA School of Medicine. Dr. Wilkes is associate professor, UCLA
Office of the Dean, Center for Educational Development and Research, and the Division of General Internal Medicine and Health Services Research.
Correspondence and requests for reprints should be
addressed to Dr. Edelstein, Charles R. Drew University of Medicine and Science, College of Medicine
—Academic Affairs, 1731 E. 120th Street, Los Angeles, CA 90059; e-mail: 具[email protected]典.
Results. Of the 155 students, 95% completed the testing. The CBX and the SPX had low to moderate statistically significant correlations with each other and with
traditional measures of performance. Traditional measures
were intercorrelated at higher levels than with the CBX
or SPX. Students’ perceptions of the various evaluation
methods varied based on the assessment. These findings
are consistent with the theoretical construct for development of performance examinations. For example, to assess clinical decision making, students rated the CBX
best, while they rated multiple-choice examinations best
to assess knowledge.
Conclusion. Examination results and student perception studies provide converging evidence that performance examinations measure different physician competency domains and support using multipronged assessment
approaches.
Acad. Med. 2000;75:825–833.
proving medical education and, in turn,
improving health care.1,2 The need to
rigorously evaluate physicians’ performances (by measuring, among other
things, process, outcome, and effectiveness) pervades the economic, social, and scientific debates surrounding
health care reform.3–5 The growing concern for accountability is demonstrated
in current proposals in both the public
and private sectors to develop and implement practice guidelines, outcomes
assessment, evidence-based medicine,
and measures such as consumer satisfaction surveys. Professional oversight bodies at the national, state, and local levels are challenging medical schools to
undertake curricular reforms, including
assessment procedures that will assure
that graduates will be able to practice
successfully and contribute to improved
patient care in the newly restructured
health care environment in the United
States.4,6,7
Given the pronounced scrutiny of
physicians’ performances and educational reform, it is easy to understand
why new evaluation methods are receiving increased attention in medical
schools.8–10 There is concern that
traditional measures of performance
(multiple-choice examinations, medical licensing examinations, grades, and
narrative rating forms) are constrained,
ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000
825
subjective, or test recall and memorization, rather than application and decision making. Schools are increasingly
turning to such simulated performanceevaluation methods as standardized-patient examinations (SPXs; in which students interact with patients who are
portrayed by actors or specially trained
patients) and computer-based case-simulation examinations (CBXs; in which
students interactively manage medical
scenarios presented on computers),
which assess complex behavioral and
cognitive dimensions. Both SPXs and
CBXs allow students to experience realistic problems and demonstrate the
ability to make clinical judgments without the risk of harm to actual patients.11–13 The student’s decision-making process is captured for later analysis
either in the computer’s memory or on
videotape. Each method of simulation
has its own theoretical strengths and
limitations. CBXs can avoid cueing students’ responses by presenting cases that
have multiple pathways and unfold over
time, revealing pertinent information in
stages. After the examination, educators
can explore in detail the subcomponents of the student’s decision-making
process—such as sequencing, timing,
and pathway—to reveal and classify
management errors.14–21 SPXs evaluate students’ questioning patterns,
their communication and interpersonal
skills, and their abilities to conduct a
patient history and physical examination.
Confidence in cognitive–behavioral
(performance) examination methods
was evidenced by the 1995 decision of
the governing bodies of the National
Board of Medical Examiners (NBME)
and the Federation of State Medical
Board Examiners to implement a computer-based case-simulation examination (named Primum) as part of the
United States Medical Licensing Examination (USMLE) Step 3 in the fall
of 1999.12 Primum is the new Windowsbased revision of the DOS version of
the CBX. They also endorsed the even-
826
tual inclusion of an SPX in the licensure process.22 Other professions (such
as airline pilots, nuclear reactor engineers, lawyers, and architects) have already incorporated such performance
measures in their training, professional
licensing, and assessment.23–25
However, enthusiasm for these newer
examinations is not universal. Some education and measurement experts cite
the need to further study the validity
of performance examinations and are
quick to remind us that simulations are,
by definition, abstractions of reality and
thus can miss significant aspects of the
complex real-practice environment.26,27
A growing body of research suggests
that a combination of evaluation methods is necessary to properly assess the
complex skills that make up the practice of medicine.2,16,23,26,28
Research of evaluation methods has
made educators and test developers
more sensitive to the concept of ‘‘consequential validity,’’29 which addresses
the correspondence between the purpose of an examination, the administrative decision making based on the examination results, and the intended and
unintended changes in the educational
environment based on the examination
results. Students’ perceptions of examinations are related to the measurement
of consequential validity. The perceptions of students help identify whether
the test takers share the test makers’
conception of an examination’s purpose
and help anticipate the consequences of
performance examinations when used
for high-stakes decisions. Little research
has addressed medical students’ perceptions of the myriad evaluation tools.
Test-taker validation may be of added
importance in performance examinations, for a number of reasons: (1) students may be less familiar with test formats and react in unexpected ways; (2)
the tests require a longer sustained time
with each test case and are potentially
less frustrating if students perceive they
are credible; (3) prompting toward correct answers is reduced because a ‘‘cor-
rect’’ answer is not written on the page;
and (4) students are instructed to
behave during the simulations as if they
were conducting a real-life task. However, the formats may lend themselves
to test-taking strategies or ‘‘gaming’’
plans that were not intended or perceived by the developers. While many
studies have examined gender and ethnic differences in performances on multiple-choice examinations and other
traditional evaluation methods, little
comparative research has been conducted on the interactions of ethnicity
and gender as variables modifying performance in the study of medical performance-based examinations.
To date, no one has systematically
studied an entire medical school class to
compare the SPXs and CBXs developed
by the NBME with the more traditional
evaluation measures. Such a comparison is important in that performance
examinations are increasingly used as
teaching and learning devices,30 and
medical schools bear large costs, both
financial—in terms of developing and
administering the examinations22 —and
emotional—in terms of the burden
and distress experienced by the students.
The object of our study was to evaluate the experiences of an entire senior
medical school class as they took both
traditional standardized examinations
and the new performance examinations
in order to answer the following questions: (1) How does performance of
medical students vary across traditional
measures of assessment and the new examinations (CBX and SPX)? (2) How
does student performance on traditional
methods of evaluation (USMLE) correlate with performances on the CBX
and SPX? (3) How do the two new performance examinations of the NBME
correlate with each other? (4) What are
the attitudes and perceptions of students toward these new evaluation
tools compared with traditional measures of performance? (5) Are there subgroup differences that appear on perfor-
ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000
mance examinations that differ from
those found with traditional examinations?
METHOD
Participants
In 1995, there were 155 full-time
fourth-year students who had been enrolled in one of the three programs affiliated with the UCLA School of Medicine for the entirety of their medical
education. (We did not include other
students—transfer students, research
trainees on extended programs, or fifthyear students—in the analyses.) The
students were primarily white (40%)
and Asian (31%); however, Latinos
(16%), African Americans (12%), and
students of other ethnicities (1%) were
also members of the class (Table 1). The
majority of the students (72%) were enrolled at UCLA, 18% were enrolled at
the University of California, Riverside
(UCR), a seven-year affiliated program,
and 10% were enrolled at a joint program with Charles R. Drew University
of Medicine and Science’s College of
Medicine. Fifty-five percent of the sample was male.
Data
These fourth-year students took two
senior performance examinations: a
standardized-patient examination (SPX)
and a computer-based case-simulation
Table 1
Demographics and Academic Scores for 155 Fourth-year Medical Students at the University
of California, Los Angeles, the University of California, Riverside, and Charles R. Drew
University’s College of Medicine, 1995
Demographic characteristic
Gender
Women
Men
45%
55%
Ethnicity
Latino
African American
Asian
Caucasian
Other
16%
12%
31%
40%
1%
Intended clinical specialty
Primary care
All others
52%
48%
Academic scores (mean)
MCAT (total)
Undergraduate GPA
Medical school GPA (years one and two)
Medical school GPA (years three and four)
USLME Step 1
USMLE Step 2
Computer-based examination†
Standardized-patient examination†
30.01
3.52
3.46
3.41
208
206
5.0
67
*For this study, ‘‘primary care’’ was defined to include internal medicine, family medicine, and pediatrics, without
regard to students’ later subspecialties.
†These two examinations were taken by the students within the same week during their fourth year.
examination (CBX); 95% of the students completed both examinations. To
eliminate sequencing bias, we had half
the students take the SPX on the first
day of the study and the CBX on the
second day; the other half took the examinations in the reverse order. After
completing both, the students filled out
a paper-and-pencil questionnaire on
clinical skills (clinical skills survey, or
CSS) that had been created at UCLA.
In addition to the data gathered from
the two examinations and the survey,
we also researched records from the
central dean’s offices regarding the students’ demographics, past performances,
and specialty choices. We explain each
of these data sources in greater detail
below. We created the merged data set
in strictest confidentiality and deleted
all individual identifiers once it was
complete.
Records in the deans’ offices. In addition to demographic information, the
central dean’s office held information
about the students’ choices of medical
specialties (non–primary care versus
primary care, which was defined as family medicine, pediatrics, and internal
medicine regardless of future subspecialty plans) and the students’ past performances as measured by premedical
school grades, MCAT scores, medical
school grades, and scores on USMLE
Steps 1 and 2. Separate estimates of reliability for the new performance-evaluation methods were calculated within
this sample. Comparisons were made
with published norms and published
correlations for each of the traditional
standardized tests.
Computer-based examination. The
students’ clinical decision-making and
doctor-centered skills were assessed using a CBX (a DOS-version computerbased case-simulation examination) developed by the National Board of
Medical Examiners (NBME).31 Examinees manage ten interactive and dynamic case simulations that change in
response to the ‘‘treatment.’’ As the
NBME describes these examinations:
ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000
827
Each CBX case presents a simulated
patient in an uncued, patient-management environment. The student or
physician taking a CBX case is presented with a brief description concerning the condition, circumstances,
and chief complaint(s) of the simulated patient. Subsequent information
about the patient depends on the student’s uncued request for test, therapies, procedures, or physical examination. The student is expected to
diagnose, treat, and monitor the patient’s condition as it changes over
time and in response to treatment.31
Each of our students spent up to one
day (seven working hours with a break
for lunch) in the school of medicine’s
computer lab at an IBM-compatible
computer completing the ten clinical
cases. The estimate of internal consistency measured by Cronbach’s alpha is
.54.
Standardized-patient examination.
On another day of the same week, the
students took an SPX, also developed
by the NBME and the Macy Consortium (a group of medical schools in
Southern California that collaborated
with the NBME to construct standardized patients using NBME format and
protocols). The SPX consisted of eight
15-min SP encounters representing a
broad range of clinical areas. The students were evaluated by the SPs, who
had been trained to perform reliable assessments. The SPX was conducted at
UCLA’s Center for Clinical Education,
where, to assure quality control, the students were videotaped as they interacted with the SPs. UCLA did not
score the SPX; rather, the NBME
scored the four cases it had contributed
and the Macy Consortium scored its
four cases, both using a standardized
scoring protocol. Reliability was estimated with Cronbach’s alpha at .68,
consistent with the .69 reported by the
NBME/New York/Macy Consortium using a comparable test protocol.22
Clinical skills survey. Immediately
after finishing the two performance
828
tests, the students completed a survey
that asked them to compare the different clinical evaluation tools in terms of
how well they assessed their competencies. They were asked to select the
best of the six tools (CBX, SPX, attending evaluation, resident evaluation, multiple-choice examinations, and
oral examinations) for assessing: (1)
knowledge of medicine, (2) clinical decision-making skills, and (3) selection
of a potential caregiver for a family
member. They also separately rated
each evaluation method’s ability to accurately assess overall functioning as a
doctor using a five-point Likert scale
ranging from 1 = ‘‘not very accurate’’ to
5 = ‘‘extremely accurate.’’
Analysis
Bivariate associations between demographic information, academic records,
examination scores, perceptions of evaluation methods, and primary care interest were examined using correlations,
t tests, F tests (ANOVA), and chisquare (cross-tabulation) tests of statistical significance at probability levels of
.05 or lower. The correlation between
the results of the performance examinations (CBX and SPX) was corrected
for attenuation due to the experimental
nature of the examinations and is reported corrected and uncorrected. Correlations among performances on the
traditional examinations are reported
and compared with published correlations, which have been reported corrected and/or uncorrected for attenuation.
SD = 4.9 with essay ignored); USMLE
Step 1 scores ranged from 155 to 246
(mean = 208.5, SD = 20.1); and
USMLE Step 2 scores ranged from 141
to 250 (mean = 205.7, SD = 22.0).
When we grouped pediatrics, family
practice, and internal medicine together as constituting ‘‘primary care,’’
approximately half (52%) of the students chose primary care residencies.
Correlations between examination results are listed in Table 2. Standardized
tests (MCAT, USLME, etc.) tended to
correlate together. The CBX correlated
best with the USMLE Step 2 (r = 0.4;
p < .000), which is also given early in
the fourth year of medical school. The
SPX correlated at low levels with all
other measures of performance and with
the CBX (r = .24 uncorrected and r =
.40 corrected; p < .007).
Gender differences are reported in
Table 3. Significant differences were
found for gender on the MCAT ( p =
.023), with men scoring higher. Men
also tended to score higher on the
USMLE Step 1, but this difference was
not significant (p = .10). Conversely,
women scored higher on the SPX (p <
.05). There was no difference based on
gender for the CBX.
Significant ethnic differences were
found on all standardized examinations
and in undergraduate and medical
school GPAs. The F-test values for
comparisons between ethnic groups
ranged from 4.1 to 29.3 (p < .003 to p
< .000) for all examinations other than
the SPX. For that examination, there
was no ethnic difference.
The Computer-based Examination
RESULTS
Overall Performance
The students’ demographics and standardized test results are reported in Table 1. Total undergraduate GPAs ranged
from 2.28 to 4.00 (mean = 3.52, SD =
0.37); combined new MCAT scores
ranged from 18 to 39 (mean = 30.01,
The CBX was scored by the NBME using a mathematical model based on
how expert panels judge performance.
The model counts correct actions taken
at appropriate times, as well as harmful
actions. It then weights and sums raw
scores to produce case scores. Specifications of the scoring method are published by the NBME.13,16 When the
ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000
Table 2
Correlations of Academic Measurements, Based on Scores for 155 Fourth-year Medical Students at the University of California, Los Angeles,
the University of California, Riverside, and Charles R. Drew University’s College of Medicine, 1995
ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000
Mean total undergraduate GPA
1.00
(prob)
Total MCAT
.48
(.000)
1.00
(prob)
.26
(.001)
.32
(.000)
1.00
(prob)
.16
(.049)
.40
(.000)
.67
(.000)
1.00
(prob)
.45
(.000)
.62
(.000)
.87
(.000)
.66
(.000)
1.00
(prob)
.38
(.000)
.53
(.000)
.81
(.000)
.68
(.000)
.84
(.000)
1.00
(prob)
(prob)
.20
(.020)
.15
(.083)
.36
(.000)
.26
(.003)
.32
(.000)
.40
(.000)
(prob)
⫺.02
(.864)
.11
(.208)
.33
(.000)
.28
(.001)
.25
(.003)
.30
(.001)
.24 (.40)*
(.007)
3.52 (0.37)
149
30.0 (4.9)
139
3.46 (0.59)
151
3.51 (0.27)
147
208.5 (20.1)
148
205.7 (22.0)
131
5.01 (0.70)
137
66.7 (7.2)
139
Mean total
undergraduate GPA
Total
MCAT
Medical
school GPA,
years one
and two
Medical
school GPA,
years three
and four
USMLE
Step 1
USMLE
Step 2
Computerbased examination
Standardized-patient
examination
Medical school GPA, years one and two
Medical school GPA, years three and four
USMLE Step 1
USMLE Step 2
Computer-based examination (total)
Standardized-patient examination (total)
Mean (SD)
No.
*Corrected for attenuation.
1.00
1.00
829
Table 3
Analysis by Gender of Academic Scores and Questionnaire Ratings for 155 Fourth-year Medical Students at the University of California,
Los Angeles, the University of California, Riverside, and Charles R. Drew University’s College of Medicine, 1995
Women
Mean (SD)
Academic scores
Mean total undergraduate GPA
Total MCAT
Medical school GPA, years one and two
Medical school GPA, years three and four
USMLE Step 1
USMLE Step 2
Standardized-patient examination
Computer-based case-simulation examination
Questionnaire ratings*
Computer-based examination
Standardized-patient examination
Attending physicians’ reports
Residents’ reports
Shelf examinations
Oral examinations
Men
Mean (SD)
t /df /p
3.51
29.03
3.42
3.52
205.5
207.3
68.17
5.05
(0.36)
(4.5)
(0.67)
(0.24)
(18.3)
(18.6)
(7.5)
(0.65)
3.52
30.90
3.49
3.51
210.9
204.3
65.69
4.98
(0.37)
(5.0)
(0.52)
(0.30)
(21.4)
(24.7)
(6.9)
(0.76)
⫺0.19/147/.849
⫺2.30/137/.023
⫺0.82/149/.411
0.17/145/.87
⫺1.66/146/.100
0.78/125/.436
2.03/136/.045
0.59/135/.557
3.14
2.81
3.38
3.65
2.43
2.95
(1.01)
(1.11)
(0.97)
(1.11)
(0.89)
(0.97)
3.12
3.05
3.43
3.64
2.63
3.12
(0.98)
(1.04)
(0.94)
(1.04)
(0.96)
(1.19)
0.14/137/.886
⫺1.33/136/.184
⫺0.33/137/.745
0.03/137/.974
⫺1.28/137/.203
⫺0.89/137/.376
*Students were asked to rate the accuracy with which each of these evaluation tools are able to assess students’ overall abilities to function as a doctor (1 = not very accurate
to 5 = extremely accurate).
Table 4
Rankings of Six Evaluation Tools in Measuring Competencies Required to Be Physicians, Based on Questionnaire Ratings of 155 Fourth-year
Medical Students at the University of California, Los Angeles, the University of California, Riverside, and Charles R. Drew University’s
College of Medicine, 1995*
System Feature
To
To
To
To
Standardizedpatient
Examination
Computerbased
Examination
Attending
Physicians’
Reports
Residents’
Reports
Shelf
Examinations
Oral
Examinations
6
5
5
4
1
3
3
3
2
2
2
1
1
6
6
5
4
4
3
6
1
2
5
4
assess knowledge base
assess clinical decision-making skills
assess overall ability as a doctor
assess a doctor as a potential caregiver
for a family member
*Ranking: 1 = best tool to 6 = worst tool.
scores for the ten cases were averaged,
our students’ overall CBX scores ranged
from 2.9 to 6.7 (mean = 5.0, SD = .7).
We found no difference by gender or
specialty. There were ethnic differences
comparable to those found with the traditional standardized tests.
830
The Standardized-patient Examination
Specifications of the scoring method
and related validity studies for the SPX
used in our study are published by the
NBME.11,12 The mean score for each
student over all SPX stations was 66.7
(standard deviation = 7.2, range =
46.9 to 84.0). Women scored higher
than did men (p = .045), and those
with interests in primary care scored
higher than did those going into other
specialties. No ethnic difference was evident.
ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000
Clinical Skills Survey
The results from the clinical skills survey are reported in Table 4.
Knowledge of medicine. The students believed that the best instruments
for measuring their knowledge of
medicine were standardized shelf examinations (standard multiple-choice
examinations) (28%), evaluations by residents (22%), and evaluations by clinical attending physicians (19%). Fewer
students chose performance examinations: CBXs (14%), oral examinations
(12%), and SPXs (6%).
Clinical decision-making skills.
CBXs and evaluations by residents were
chosen by 30% and 27% of the students, respectively, as the best tools to
assess clinical decision-making skills.
Fewer students chose clinical attending
physicians (17%), and oral examinations (16%), while less than 9% chose
SPXs. Only one student chose standardized multiple-choice examinations as an
effective assessment tool for these skills.
Overall competency. The students
rated the accuracies of the various
methods to assess overall functioning as
a physician in the following order: evaluations by residents (mean = 3.7, SD =
1.1), evaluations by clinical attending
physicians (mean = 3.4, SD = 1.0),
CBXs (mean = 3.1, SD = 1.0), oral examinations (mean = 3.0, SD = 1.1),
SPXs (mean = 2.9, SD = 1.1), and standardized multiple-choice-question examinations such as the USMLE (mean
= 2.5, SD = .9).
Selection of a personal doctor. The
students were asked, ‘‘If a new doctor
was going to be caring for one of your
family members, what assessment tool
results would you want to know about
the doctor to assess his/her overall competency?’’ Forty-eight percent of students favored the attending physician
evaluation. This was well above the
17% of the students who chose the resident evaluation. For this parameter,
each of the other evaluation tools were
rated best by 7% to 12% of the students.
Correlations. Analysis of variance
was performed to assess whether the students’ rankings of the accuracies of the
various evaluation methods matched
their performances on the SPX, CBX,
and multiple-choice examinations (as
represented by test scores from the
USMLE Step 1 and Step 2 examinations). We found no significant correlation. The students appear to have
evaluated the accuracies of the tests independent of their performances on
them.
No difference was found in the students’ ratings based on their specialty
plans. However, students with career
plans in primary care rated the accuracy
of SPXs higher than did those with
other career plans. Conversely, students
wishing to enter non–primary care
fields gave higher accuracy ratings to
the standardized shelf examinations (p
< .03).
DISCUSSION
This study found differences between
the performance examinations and the
traditional measures of student achievement (board examinations and grades,
which in the clinical years are a summative measure of multiple-choice examination scores, oral examination
scores, and subjective ratings from attending physicians and residents). The
differences appeared when we examined
(1) correlations among various assessment devices for the entire medical student class, (2) gender and ethnic differences in performances on the SPX
compared with other examinations, (3)
students’ perceptions of the accuracies
of the examinations, and (4) students’
rankings of the relative merits of the examinations in assessing different physician attributes.
The traditional measures of student
performance (basic science grades, clinical science grades, and performances
on USMLE Steps 1 and 2) had high
inter-test correlations, and this result is
consistent with previous findings.32–36
However, the SPX and CBX correlated
only modestly with the traditional measures. The correlation of .40 between
the NBME CBX and USMLE Step 2
was within the range of previous studies
of .35–.55 corrected for attenuation
(our finding was .40 uncorrected for attenuation).16 The correlation of .30 between the SPX and USMLE Step 2 was
almost identical to findings reported by
the NBME (comparisons are reported
uncorrected for attenuation).36 The
finding of a low but statistically significant relationship of .26 uncorrected
(.40 corrected) between the SPX and
the CBX is the first report comparing
the two new NBME performance measures directly with each other. The results demonstrate differences between
the new performance examinations in
addition to their differences from the
traditional measures.
A key question when testing an examination’s validity is whether different
subgroups perform differently on it.
Overall, the ethnic and gender differences in performances on the traditional standardized tests within our
class of students matched previously
published findings. In contrast, the
new performance examinations revealed
two exceptions from the traditional
standardized examination profile: (1)
women performed better than men on
the SPXs, and (2) no overall ethnic difference was found among the students
on the SPXs. The gender differences are
consistent with findings from a range of
professional disciplines. Examination
data for the legal profession, for example, demonstrate that women score less
well on multiple-choice examinations,
but outperform men on essay examinations.37 The results in this study conform with the growing body of literature
that suggests that gender and ethnic
performance differences on standardized
examinations may be influenced by culture, communication styles, task demand, identity, students’ attitudes toward testing,38,39 and stereotype threat
(defined by Steele as ‘‘the threat that
ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000
831
others’ judgments or their own actions
will negatively stereotype them . . .’’ 38).
Differences among assessment measures emerged when we measured the
students’ attitudes. The results indicated that the students’ perceptions of
the relative strengths and weaknesses of
the various evaluation methods varied
significantly depending on the purpose
of the examination (such as whether
the purpose was to measure knowledge,
to measure clinical decision making, or
to select a physician to care for patients). The students’ perceptions were
in tune with the test makers’ published
rationale for creating the examinations.13,31
The students also rated their impressions of the accuracy with which the
examination/evaluation methods used
by the medical school measured their
overall abilities. We found only one difference in these perceptions by subgroup: minority students varied significantly (p < .05) from the other students
in their assessments of the evaluations
done by clinical attending physicians
and residents (clinical attendings received higher ratings by minority students). Their perceptions for the CBX,
SPX, multiple-choice-question, and oral
examinations did not differ from those
of the other students. This suggests that
some underrepresented-minority students may perceive the accuracies of
subjective evaluations differently.
The results of the examination comparisons and perception studies provide
separate but converging evidence that a
multipronged approach to evaluation
is the most prudent. Researchers caution against overinterpretation of correlational studies and comparison of
examinations with different response
characteristics.26 However, the correlations that we found among the traditional tests and the new examinations
are consistent with earlier studies and,
when combined with the student-perception studies, lend support to the
theoretical construct that the new
examinations measure different, albeit
832
interrelated, domains of competency.
Students individually and in subgroups
do not perform the same on all tests,
and they express sensitivity to the need
for different formats for different purposes. The use of multiple evaluation
tools thus allows finer gradations in individual assessment. Multiple modes of
assessment offer the benefit of reduced
differences or perceptions of advantages
that can be attributed to preference for
any one method or style of test taking.29,31 The relatively low correlations
of premedical-school predictors (grades
and MCAT scores) with ultimate
achievement in senior years of medical
school, as described in the medical education literature and in this study, especially as measured by the new performance examinations, argues against
overreliance on any one set of selection
factors and supports composite evaluations based on multiple criteria.21,28,40
Our study is not without limitations.
Although 95% of the students in the
class completed the SPX and CBX (illness and away electives were the only
exceptions), these were low-stakes examinations—the students were not penalized or rewarded based on their performances—and this may have affected
our results. The students were required
to participate in practice examinations
for the CBX and to complete at least
three practice cases per protocol guidelines, but feedback from proctors indicated different levels of familiarity with
the format. The NBME scored the CBX
using its experimental scoring method.
The students completed ten multispecialty, computerized cases during the
eight-hour examination. Other studies
have suggested that more than ten cases
are needed to demonstrate sufficient reliability when the cases are multidisciplinary.13,26 The SPX consisted of four
examinations developed by the NBME
and four locally developed by the
NBME-affiliated Macy consortium of
medical schools using similar criteria
but non-identical test developers as an
ongoing test of examination reliability
and test-construction rules. Lengthening the examination might increase reliability.22
While this study was conducted at
only one medical school, and the results
should not necessarily be extrapolated
to all medical schools, UCLA School of
Medicine and its affiliated programs,
University of California, Riverside, and
Charles R. Drew University of Medicine and Science’s College of Medicine,
constitute one of the most diverse student bodies in the United States. However, the absolute number of underrepresented-minority students in each
subgroup is relatively small for statistical
purposes and cautions against over-generalizations.
This study supports the findings that
physician competency may be a multidimensional trait that requires a variety
of evaluation methods to accurately tap.
The next steps would be to replicate
this examination under high-stakes
conditions at multiple sites and to control for case specificity and case familiarity. Long-term studies are needed to
assess the predictive validity of the new
performance evaluations. One approach
would include a follow-up survey to determine whether performance examinations predict later success in clinical
training.
The authors thank Stuart Slavin, MD, LuAnn
Wilkerson, EdD, and Ava Yajima for their support
and helpful comments and suggestions. They also
thank Evie Kumpart, Marjorie Covey, Elizabeth
O’Gara, Pat Anaya, and Ligia Perez for their assistance and spirit of teamwork.
REFERENCES
1. Blumenthal D. Quality of health care: Part 1.
Quality of care: what is it? N Engl J Med.
1996;335:891–3.
2. Jones R. Academic Medicine: Institutions,
Programs, and Issues. Washington, DC: Association of American Medical Colleges,
1997.
3. Shugars D, O’Neil EH, Bader JD (eds).
Healthy America: Practitioners for 2005; An
Agenda for US Health Professional Schools.
Durham, NC: The Pew Health Commissions,
1991.
ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000
4. Association of American Medical Colleges.
Taking Charge of the Future: The Strategic
Plan for the Association of American Medical Colleges. Washington, DC: Association of
American Medical Colleges, 1995.
5. Brook RH, Kamberg CJ, McGlynn EA.
Health system reform and quality. JAMA.
1996;276:476–80.
6. Pew Health Professions Commission. Critical
Challenges: Revitalizing the Health Professions for the Twenty First Century. San Francisco, CA: UCSF Center for the Health Professions, 1995.
7. Rivo R. Managed health care: implications
for the physician workforce and medical education. JAMA. 1995;274:712–5.
8. Mast TA, Anderson B. Special Section: Annex to the Proceedings of the AAMC Consensus Conference on the Use of Standardized Patients in the Teaching and Evaluation
of Clinical Skills. Teach Learn Med. 1994;6.
9. Dauphinee WD. Assessing clinical performance: where do we stand and what might
we expect. JAMA. 1995;274:741–2.
10. Anderson MK, Kassebaum DG. Special Issue:
Proceedings of the AAMC’s Consensus Conference on the Use of Standardized Patients
in the Teaching and Evaluation of Clinical
Skills. Acad Med. 1993;68.
11. National Board of Medical Examiners. 1994
Annual Report. Philadelphia, PA: National
Board of Medical Examiners, 1995.
12. Federation of State Medical Boards of the
United States (FSMB) and National Board of
Medical Examiners (NBME). Plans for Administering the Medical Licensing Examination on Computer. Special Bulletin on Computer-based Testing for the United States
Medical Licensing Examination. 1998.
13. Clyman SG, Melnick DE, Clauser BE. In:
Mancall EL, Bashook PG (eds). Assessing
Clinical Reasoning: The Oral Examination
and Alternative Methods. Evanston, IL:
American Board of Medical Specialties,
1995:139–49.
14. Edelstein RA, Clyman SG. Computer-based
simulations as adjuncts for teaching and evaluating complex clinical skills. In: Scherpbier
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
AJJA, van der Vleuten CPM, Rethans JJ, van
der Steeg AFW (eds). Advances in Medical
Education. Dordrecht, The Netherlands: Kluwer Academic Publishers, 1997:327–9.
McGuire C. Reflections of a maverick measurement maven. JAMA. 1995;274:735–40.
Mancall EL, Bashook PG, Dockery JL. Computer-Based Examinations for Board Certification. Evanston, IL: American Board of
Medical Specialties, 1996.
Leape L. Error in medicine. JAMA. 1994;27:
1851–7.
Druckman D, Bjork R. Learning, Remembering, Believing: Enhancing Human Performance. Washington, DC: National Academy
Press, 1994.
Durlach N. Virtual Reality: Scientific and
Technological Challenges. Washington, DC:
National Academy Press, 1995.
Regehr GN, Norman GR. Issues in cognitive
psychology: implications for professional education. Acad Med. 1996;71:988–1001.
Elstein A. Beyond multiple-choice questions
and essays: the need for a new way to assess
clinical competence. Acad Med. 1993;68:
244–9.
Colliver JA, Swartz MD. Assessing clinical performance with standardized patients.
JAMA. 1997;278:790–1.
Mancall EL, Bashook PG. Assessing Clinical
Reasoning: The Oral Examination and Alternative Methods. Evanston, IL: American
Board of Medical Specialties, 1995.
Smith JP. Issues in Development of a Performance Test for Lawyers. Paper presented at
the annual meeting of the National Council
on Measurement in Education. Symposium
P3, Performance Assessment: Evaluating
Competing Perspectives on Use for High
Stakes Testing. New Orleans, LA, 1994.
Jones ER, Hennessy RT, Deutsch S. Human
Factors Aspects of Simulation. Washington,
DC: National Academy Press, 1985.
Swanson DB, Norman GR, Linn RL. Performance-based assessment: lessons from the
health professions. Educ Res. 1995;24:5–11.
Messick S. Validity of psychological assessment: validation of inferences from persons’
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
responses and performances as scientific inquiry into score meaning. Am Psychol. 1995;
50:741–9.
Anderson N. The mismeasure of medical education. Acad Med. 1990;65:159–60.
Messick S. The interplay of evidence and
consequences in the validation of performance assessments. Educ Res. 1994;23:13–
23.
Bazansky B, Jonas HS, Etzel SI. Educational
programs in US medical schools, 1995–96.
JAMA. 1996;276:714–9.
National Board of Medical Examiners. The
National Board of Medical Examiners’ Computer-based Examination–Clinical Simulation. Philadelphia, PA: NBME, 1991.
O’Donnell SO, Obenshain SC, Erdman J.
Background essential to the proper use of results of Step 1 and Step 2 of the USMLE.
Acad Med. 1993;68:734–9.
Hoffman K. The USMLE, the NBME Subject
Examinations, and the assessment of individual academic achievements. Acad Med.
1993;68:740–7.
Williams R. Use of NBME and USMLE examinations to evaluate medical education
programs. Acad Med. 1993;68:748–52.
Berner ES, Brooks CM, Erdman JB. Use of
the USMLE to select residents. Acad Med.
1993;68:753–9.
Clauser BR, Ripkey D, Fletcher B, King A,
Klass D, Orr N. A comparison of pass/fail
classifications made with scores from the
NBME standardized-patient examination and
Part II examination. Acad Med. 1993;68, 10
suppl.
Klein SP. The costs and benefits of performance testing on the bar examination. Bar
Examiner. 1996;13–20.
Steele CM. A threat in the air: how stereotypes shape intellectual identity and performance. Am Psychol. 1997;52:613–29.
Sternberg RJ, Williams WM. Does the Graduate Record Examination predict meaningful
success in the graduate training of psychologists. a case study. Am Psychol. 1997;52:630–
41.
Bowles L. Use of NBME and USMLE scores.
Acad Med. 1993;68:778.
ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000
833