EDUCATING PHYSICIANS R E S E A R C H R E P O R T A Comparative Study of Measures to Evaluate Medical Students’ Performances Ronald A. Edelstein, EdD, Helen M. Reid, MA, Richard Usatine, MD, and Michael S. Wilkes, MD, PhD ABSTRACT Purpose. To assess how new National Board of Medical Examiners (NBME) performance examinations—computerbased case simulations (CBX) and standardized patient exams (SPX)—compare with each other and with traditional internal and external measures of medical students’ performances. Secondary objectives examined attitudes of students toward new and traditional evaluation modalities. Method. Fourth-year students (n = 155) at the University of California, Los Angeles, School of Medicine (including joint programs at Charles R. Drew University of Medicine and Science and University of California, Riverside) were assigned two days of performance examinations (eight SPXs, ten CBXs, and a self-administered attitudinal survey). The CBX was scored by the NBME and the SPX by a NBME/Macy consortium. Scores were linked to the survey and correlated with archival student data, including traditional performance indicators (licensing board scores, grade-point averages, etc.). Accurately measuring physicians’ competencies is a crucial step toward im- Dr. Edelstein is senior associate dean of academic affairs and associate professor, Department of Family Medicine, Charles R. Drew University of Medicine and Science. Ms. Reid was senior statistician, Department of Psychiatry and Biobehavioral Sciences, UCLA School of Medicine. Dr. Usatine is associate professor, Department of Family Medicine, and assistant dean of student affairs, UCLA School of Medicine. Dr. Wilkes is associate professor, UCLA Office of the Dean, Center for Educational Development and Research, and the Division of General Internal Medicine and Health Services Research. Correspondence and requests for reprints should be addressed to Dr. Edelstein, Charles R. Drew University of Medicine and Science, College of Medicine —Academic Affairs, 1731 E. 120th Street, Los Angeles, CA 90059; e-mail: 具[email protected]典. Results. Of the 155 students, 95% completed the testing. The CBX and the SPX had low to moderate statistically significant correlations with each other and with traditional measures of performance. Traditional measures were intercorrelated at higher levels than with the CBX or SPX. Students’ perceptions of the various evaluation methods varied based on the assessment. These findings are consistent with the theoretical construct for development of performance examinations. For example, to assess clinical decision making, students rated the CBX best, while they rated multiple-choice examinations best to assess knowledge. Conclusion. Examination results and student perception studies provide converging evidence that performance examinations measure different physician competency domains and support using multipronged assessment approaches. Acad. Med. 2000;75:825–833. proving medical education and, in turn, improving health care.1,2 The need to rigorously evaluate physicians’ performances (by measuring, among other things, process, outcome, and effectiveness) pervades the economic, social, and scientific debates surrounding health care reform.3–5 The growing concern for accountability is demonstrated in current proposals in both the public and private sectors to develop and implement practice guidelines, outcomes assessment, evidence-based medicine, and measures such as consumer satisfaction surveys. Professional oversight bodies at the national, state, and local levels are challenging medical schools to undertake curricular reforms, including assessment procedures that will assure that graduates will be able to practice successfully and contribute to improved patient care in the newly restructured health care environment in the United States.4,6,7 Given the pronounced scrutiny of physicians’ performances and educational reform, it is easy to understand why new evaluation methods are receiving increased attention in medical schools.8–10 There is concern that traditional measures of performance (multiple-choice examinations, medical licensing examinations, grades, and narrative rating forms) are constrained, ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000 825 subjective, or test recall and memorization, rather than application and decision making. Schools are increasingly turning to such simulated performanceevaluation methods as standardized-patient examinations (SPXs; in which students interact with patients who are portrayed by actors or specially trained patients) and computer-based case-simulation examinations (CBXs; in which students interactively manage medical scenarios presented on computers), which assess complex behavioral and cognitive dimensions. Both SPXs and CBXs allow students to experience realistic problems and demonstrate the ability to make clinical judgments without the risk of harm to actual patients.11–13 The student’s decision-making process is captured for later analysis either in the computer’s memory or on videotape. Each method of simulation has its own theoretical strengths and limitations. CBXs can avoid cueing students’ responses by presenting cases that have multiple pathways and unfold over time, revealing pertinent information in stages. After the examination, educators can explore in detail the subcomponents of the student’s decision-making process—such as sequencing, timing, and pathway—to reveal and classify management errors.14–21 SPXs evaluate students’ questioning patterns, their communication and interpersonal skills, and their abilities to conduct a patient history and physical examination. Confidence in cognitive–behavioral (performance) examination methods was evidenced by the 1995 decision of the governing bodies of the National Board of Medical Examiners (NBME) and the Federation of State Medical Board Examiners to implement a computer-based case-simulation examination (named Primum) as part of the United States Medical Licensing Examination (USMLE) Step 3 in the fall of 1999.12 Primum is the new Windowsbased revision of the DOS version of the CBX. They also endorsed the even- 826 tual inclusion of an SPX in the licensure process.22 Other professions (such as airline pilots, nuclear reactor engineers, lawyers, and architects) have already incorporated such performance measures in their training, professional licensing, and assessment.23–25 However, enthusiasm for these newer examinations is not universal. Some education and measurement experts cite the need to further study the validity of performance examinations and are quick to remind us that simulations are, by definition, abstractions of reality and thus can miss significant aspects of the complex real-practice environment.26,27 A growing body of research suggests that a combination of evaluation methods is necessary to properly assess the complex skills that make up the practice of medicine.2,16,23,26,28 Research of evaluation methods has made educators and test developers more sensitive to the concept of ‘‘consequential validity,’’29 which addresses the correspondence between the purpose of an examination, the administrative decision making based on the examination results, and the intended and unintended changes in the educational environment based on the examination results. Students’ perceptions of examinations are related to the measurement of consequential validity. The perceptions of students help identify whether the test takers share the test makers’ conception of an examination’s purpose and help anticipate the consequences of performance examinations when used for high-stakes decisions. Little research has addressed medical students’ perceptions of the myriad evaluation tools. Test-taker validation may be of added importance in performance examinations, for a number of reasons: (1) students may be less familiar with test formats and react in unexpected ways; (2) the tests require a longer sustained time with each test case and are potentially less frustrating if students perceive they are credible; (3) prompting toward correct answers is reduced because a ‘‘cor- rect’’ answer is not written on the page; and (4) students are instructed to behave during the simulations as if they were conducting a real-life task. However, the formats may lend themselves to test-taking strategies or ‘‘gaming’’ plans that were not intended or perceived by the developers. While many studies have examined gender and ethnic differences in performances on multiple-choice examinations and other traditional evaluation methods, little comparative research has been conducted on the interactions of ethnicity and gender as variables modifying performance in the study of medical performance-based examinations. To date, no one has systematically studied an entire medical school class to compare the SPXs and CBXs developed by the NBME with the more traditional evaluation measures. Such a comparison is important in that performance examinations are increasingly used as teaching and learning devices,30 and medical schools bear large costs, both financial—in terms of developing and administering the examinations22 —and emotional—in terms of the burden and distress experienced by the students. The object of our study was to evaluate the experiences of an entire senior medical school class as they took both traditional standardized examinations and the new performance examinations in order to answer the following questions: (1) How does performance of medical students vary across traditional measures of assessment and the new examinations (CBX and SPX)? (2) How does student performance on traditional methods of evaluation (USMLE) correlate with performances on the CBX and SPX? (3) How do the two new performance examinations of the NBME correlate with each other? (4) What are the attitudes and perceptions of students toward these new evaluation tools compared with traditional measures of performance? (5) Are there subgroup differences that appear on perfor- ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000 mance examinations that differ from those found with traditional examinations? METHOD Participants In 1995, there were 155 full-time fourth-year students who had been enrolled in one of the three programs affiliated with the UCLA School of Medicine for the entirety of their medical education. (We did not include other students—transfer students, research trainees on extended programs, or fifthyear students—in the analyses.) The students were primarily white (40%) and Asian (31%); however, Latinos (16%), African Americans (12%), and students of other ethnicities (1%) were also members of the class (Table 1). The majority of the students (72%) were enrolled at UCLA, 18% were enrolled at the University of California, Riverside (UCR), a seven-year affiliated program, and 10% were enrolled at a joint program with Charles R. Drew University of Medicine and Science’s College of Medicine. Fifty-five percent of the sample was male. Data These fourth-year students took two senior performance examinations: a standardized-patient examination (SPX) and a computer-based case-simulation Table 1 Demographics and Academic Scores for 155 Fourth-year Medical Students at the University of California, Los Angeles, the University of California, Riverside, and Charles R. Drew University’s College of Medicine, 1995 Demographic characteristic Gender Women Men 45% 55% Ethnicity Latino African American Asian Caucasian Other 16% 12% 31% 40% 1% Intended clinical specialty Primary care All others 52% 48% Academic scores (mean) MCAT (total) Undergraduate GPA Medical school GPA (years one and two) Medical school GPA (years three and four) USLME Step 1 USMLE Step 2 Computer-based examination† Standardized-patient examination† 30.01 3.52 3.46 3.41 208 206 5.0 67 *For this study, ‘‘primary care’’ was defined to include internal medicine, family medicine, and pediatrics, without regard to students’ later subspecialties. †These two examinations were taken by the students within the same week during their fourth year. examination (CBX); 95% of the students completed both examinations. To eliminate sequencing bias, we had half the students take the SPX on the first day of the study and the CBX on the second day; the other half took the examinations in the reverse order. After completing both, the students filled out a paper-and-pencil questionnaire on clinical skills (clinical skills survey, or CSS) that had been created at UCLA. In addition to the data gathered from the two examinations and the survey, we also researched records from the central dean’s offices regarding the students’ demographics, past performances, and specialty choices. We explain each of these data sources in greater detail below. We created the merged data set in strictest confidentiality and deleted all individual identifiers once it was complete. Records in the deans’ offices. In addition to demographic information, the central dean’s office held information about the students’ choices of medical specialties (non–primary care versus primary care, which was defined as family medicine, pediatrics, and internal medicine regardless of future subspecialty plans) and the students’ past performances as measured by premedical school grades, MCAT scores, medical school grades, and scores on USMLE Steps 1 and 2. Separate estimates of reliability for the new performance-evaluation methods were calculated within this sample. Comparisons were made with published norms and published correlations for each of the traditional standardized tests. Computer-based examination. The students’ clinical decision-making and doctor-centered skills were assessed using a CBX (a DOS-version computerbased case-simulation examination) developed by the National Board of Medical Examiners (NBME).31 Examinees manage ten interactive and dynamic case simulations that change in response to the ‘‘treatment.’’ As the NBME describes these examinations: ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000 827 Each CBX case presents a simulated patient in an uncued, patient-management environment. The student or physician taking a CBX case is presented with a brief description concerning the condition, circumstances, and chief complaint(s) of the simulated patient. Subsequent information about the patient depends on the student’s uncued request for test, therapies, procedures, or physical examination. The student is expected to diagnose, treat, and monitor the patient’s condition as it changes over time and in response to treatment.31 Each of our students spent up to one day (seven working hours with a break for lunch) in the school of medicine’s computer lab at an IBM-compatible computer completing the ten clinical cases. The estimate of internal consistency measured by Cronbach’s alpha is .54. Standardized-patient examination. On another day of the same week, the students took an SPX, also developed by the NBME and the Macy Consortium (a group of medical schools in Southern California that collaborated with the NBME to construct standardized patients using NBME format and protocols). The SPX consisted of eight 15-min SP encounters representing a broad range of clinical areas. The students were evaluated by the SPs, who had been trained to perform reliable assessments. The SPX was conducted at UCLA’s Center for Clinical Education, where, to assure quality control, the students were videotaped as they interacted with the SPs. UCLA did not score the SPX; rather, the NBME scored the four cases it had contributed and the Macy Consortium scored its four cases, both using a standardized scoring protocol. Reliability was estimated with Cronbach’s alpha at .68, consistent with the .69 reported by the NBME/New York/Macy Consortium using a comparable test protocol.22 Clinical skills survey. Immediately after finishing the two performance 828 tests, the students completed a survey that asked them to compare the different clinical evaluation tools in terms of how well they assessed their competencies. They were asked to select the best of the six tools (CBX, SPX, attending evaluation, resident evaluation, multiple-choice examinations, and oral examinations) for assessing: (1) knowledge of medicine, (2) clinical decision-making skills, and (3) selection of a potential caregiver for a family member. They also separately rated each evaluation method’s ability to accurately assess overall functioning as a doctor using a five-point Likert scale ranging from 1 = ‘‘not very accurate’’ to 5 = ‘‘extremely accurate.’’ Analysis Bivariate associations between demographic information, academic records, examination scores, perceptions of evaluation methods, and primary care interest were examined using correlations, t tests, F tests (ANOVA), and chisquare (cross-tabulation) tests of statistical significance at probability levels of .05 or lower. The correlation between the results of the performance examinations (CBX and SPX) was corrected for attenuation due to the experimental nature of the examinations and is reported corrected and uncorrected. Correlations among performances on the traditional examinations are reported and compared with published correlations, which have been reported corrected and/or uncorrected for attenuation. SD = 4.9 with essay ignored); USMLE Step 1 scores ranged from 155 to 246 (mean = 208.5, SD = 20.1); and USMLE Step 2 scores ranged from 141 to 250 (mean = 205.7, SD = 22.0). When we grouped pediatrics, family practice, and internal medicine together as constituting ‘‘primary care,’’ approximately half (52%) of the students chose primary care residencies. Correlations between examination results are listed in Table 2. Standardized tests (MCAT, USLME, etc.) tended to correlate together. The CBX correlated best with the USMLE Step 2 (r = 0.4; p < .000), which is also given early in the fourth year of medical school. The SPX correlated at low levels with all other measures of performance and with the CBX (r = .24 uncorrected and r = .40 corrected; p < .007). Gender differences are reported in Table 3. Significant differences were found for gender on the MCAT ( p = .023), with men scoring higher. Men also tended to score higher on the USMLE Step 1, but this difference was not significant (p = .10). Conversely, women scored higher on the SPX (p < .05). There was no difference based on gender for the CBX. Significant ethnic differences were found on all standardized examinations and in undergraduate and medical school GPAs. The F-test values for comparisons between ethnic groups ranged from 4.1 to 29.3 (p < .003 to p < .000) for all examinations other than the SPX. For that examination, there was no ethnic difference. The Computer-based Examination RESULTS Overall Performance The students’ demographics and standardized test results are reported in Table 1. Total undergraduate GPAs ranged from 2.28 to 4.00 (mean = 3.52, SD = 0.37); combined new MCAT scores ranged from 18 to 39 (mean = 30.01, The CBX was scored by the NBME using a mathematical model based on how expert panels judge performance. The model counts correct actions taken at appropriate times, as well as harmful actions. It then weights and sums raw scores to produce case scores. Specifications of the scoring method are published by the NBME.13,16 When the ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000 Table 2 Correlations of Academic Measurements, Based on Scores for 155 Fourth-year Medical Students at the University of California, Los Angeles, the University of California, Riverside, and Charles R. Drew University’s College of Medicine, 1995 ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000 Mean total undergraduate GPA 1.00 (prob) Total MCAT .48 (.000) 1.00 (prob) .26 (.001) .32 (.000) 1.00 (prob) .16 (.049) .40 (.000) .67 (.000) 1.00 (prob) .45 (.000) .62 (.000) .87 (.000) .66 (.000) 1.00 (prob) .38 (.000) .53 (.000) .81 (.000) .68 (.000) .84 (.000) 1.00 (prob) (prob) .20 (.020) .15 (.083) .36 (.000) .26 (.003) .32 (.000) .40 (.000) (prob) ⫺.02 (.864) .11 (.208) .33 (.000) .28 (.001) .25 (.003) .30 (.001) .24 (.40)* (.007) 3.52 (0.37) 149 30.0 (4.9) 139 3.46 (0.59) 151 3.51 (0.27) 147 208.5 (20.1) 148 205.7 (22.0) 131 5.01 (0.70) 137 66.7 (7.2) 139 Mean total undergraduate GPA Total MCAT Medical school GPA, years one and two Medical school GPA, years three and four USMLE Step 1 USMLE Step 2 Computerbased examination Standardized-patient examination Medical school GPA, years one and two Medical school GPA, years three and four USMLE Step 1 USMLE Step 2 Computer-based examination (total) Standardized-patient examination (total) Mean (SD) No. *Corrected for attenuation. 1.00 1.00 829 Table 3 Analysis by Gender of Academic Scores and Questionnaire Ratings for 155 Fourth-year Medical Students at the University of California, Los Angeles, the University of California, Riverside, and Charles R. Drew University’s College of Medicine, 1995 Women Mean (SD) Academic scores Mean total undergraduate GPA Total MCAT Medical school GPA, years one and two Medical school GPA, years three and four USMLE Step 1 USMLE Step 2 Standardized-patient examination Computer-based case-simulation examination Questionnaire ratings* Computer-based examination Standardized-patient examination Attending physicians’ reports Residents’ reports Shelf examinations Oral examinations Men Mean (SD) t /df /p 3.51 29.03 3.42 3.52 205.5 207.3 68.17 5.05 (0.36) (4.5) (0.67) (0.24) (18.3) (18.6) (7.5) (0.65) 3.52 30.90 3.49 3.51 210.9 204.3 65.69 4.98 (0.37) (5.0) (0.52) (0.30) (21.4) (24.7) (6.9) (0.76) ⫺0.19/147/.849 ⫺2.30/137/.023 ⫺0.82/149/.411 0.17/145/.87 ⫺1.66/146/.100 0.78/125/.436 2.03/136/.045 0.59/135/.557 3.14 2.81 3.38 3.65 2.43 2.95 (1.01) (1.11) (0.97) (1.11) (0.89) (0.97) 3.12 3.05 3.43 3.64 2.63 3.12 (0.98) (1.04) (0.94) (1.04) (0.96) (1.19) 0.14/137/.886 ⫺1.33/136/.184 ⫺0.33/137/.745 0.03/137/.974 ⫺1.28/137/.203 ⫺0.89/137/.376 *Students were asked to rate the accuracy with which each of these evaluation tools are able to assess students’ overall abilities to function as a doctor (1 = not very accurate to 5 = extremely accurate). Table 4 Rankings of Six Evaluation Tools in Measuring Competencies Required to Be Physicians, Based on Questionnaire Ratings of 155 Fourth-year Medical Students at the University of California, Los Angeles, the University of California, Riverside, and Charles R. Drew University’s College of Medicine, 1995* System Feature To To To To Standardizedpatient Examination Computerbased Examination Attending Physicians’ Reports Residents’ Reports Shelf Examinations Oral Examinations 6 5 5 4 1 3 3 3 2 2 2 1 1 6 6 5 4 4 3 6 1 2 5 4 assess knowledge base assess clinical decision-making skills assess overall ability as a doctor assess a doctor as a potential caregiver for a family member *Ranking: 1 = best tool to 6 = worst tool. scores for the ten cases were averaged, our students’ overall CBX scores ranged from 2.9 to 6.7 (mean = 5.0, SD = .7). We found no difference by gender or specialty. There were ethnic differences comparable to those found with the traditional standardized tests. 830 The Standardized-patient Examination Specifications of the scoring method and related validity studies for the SPX used in our study are published by the NBME.11,12 The mean score for each student over all SPX stations was 66.7 (standard deviation = 7.2, range = 46.9 to 84.0). Women scored higher than did men (p = .045), and those with interests in primary care scored higher than did those going into other specialties. No ethnic difference was evident. ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000 Clinical Skills Survey The results from the clinical skills survey are reported in Table 4. Knowledge of medicine. The students believed that the best instruments for measuring their knowledge of medicine were standardized shelf examinations (standard multiple-choice examinations) (28%), evaluations by residents (22%), and evaluations by clinical attending physicians (19%). Fewer students chose performance examinations: CBXs (14%), oral examinations (12%), and SPXs (6%). Clinical decision-making skills. CBXs and evaluations by residents were chosen by 30% and 27% of the students, respectively, as the best tools to assess clinical decision-making skills. Fewer students chose clinical attending physicians (17%), and oral examinations (16%), while less than 9% chose SPXs. Only one student chose standardized multiple-choice examinations as an effective assessment tool for these skills. Overall competency. The students rated the accuracies of the various methods to assess overall functioning as a physician in the following order: evaluations by residents (mean = 3.7, SD = 1.1), evaluations by clinical attending physicians (mean = 3.4, SD = 1.0), CBXs (mean = 3.1, SD = 1.0), oral examinations (mean = 3.0, SD = 1.1), SPXs (mean = 2.9, SD = 1.1), and standardized multiple-choice-question examinations such as the USMLE (mean = 2.5, SD = .9). Selection of a personal doctor. The students were asked, ‘‘If a new doctor was going to be caring for one of your family members, what assessment tool results would you want to know about the doctor to assess his/her overall competency?’’ Forty-eight percent of students favored the attending physician evaluation. This was well above the 17% of the students who chose the resident evaluation. For this parameter, each of the other evaluation tools were rated best by 7% to 12% of the students. Correlations. Analysis of variance was performed to assess whether the students’ rankings of the accuracies of the various evaluation methods matched their performances on the SPX, CBX, and multiple-choice examinations (as represented by test scores from the USMLE Step 1 and Step 2 examinations). We found no significant correlation. The students appear to have evaluated the accuracies of the tests independent of their performances on them. No difference was found in the students’ ratings based on their specialty plans. However, students with career plans in primary care rated the accuracy of SPXs higher than did those with other career plans. Conversely, students wishing to enter non–primary care fields gave higher accuracy ratings to the standardized shelf examinations (p < .03). DISCUSSION This study found differences between the performance examinations and the traditional measures of student achievement (board examinations and grades, which in the clinical years are a summative measure of multiple-choice examination scores, oral examination scores, and subjective ratings from attending physicians and residents). The differences appeared when we examined (1) correlations among various assessment devices for the entire medical student class, (2) gender and ethnic differences in performances on the SPX compared with other examinations, (3) students’ perceptions of the accuracies of the examinations, and (4) students’ rankings of the relative merits of the examinations in assessing different physician attributes. The traditional measures of student performance (basic science grades, clinical science grades, and performances on USMLE Steps 1 and 2) had high inter-test correlations, and this result is consistent with previous findings.32–36 However, the SPX and CBX correlated only modestly with the traditional measures. The correlation of .40 between the NBME CBX and USMLE Step 2 was within the range of previous studies of .35–.55 corrected for attenuation (our finding was .40 uncorrected for attenuation).16 The correlation of .30 between the SPX and USMLE Step 2 was almost identical to findings reported by the NBME (comparisons are reported uncorrected for attenuation).36 The finding of a low but statistically significant relationship of .26 uncorrected (.40 corrected) between the SPX and the CBX is the first report comparing the two new NBME performance measures directly with each other. The results demonstrate differences between the new performance examinations in addition to their differences from the traditional measures. A key question when testing an examination’s validity is whether different subgroups perform differently on it. Overall, the ethnic and gender differences in performances on the traditional standardized tests within our class of students matched previously published findings. In contrast, the new performance examinations revealed two exceptions from the traditional standardized examination profile: (1) women performed better than men on the SPXs, and (2) no overall ethnic difference was found among the students on the SPXs. The gender differences are consistent with findings from a range of professional disciplines. Examination data for the legal profession, for example, demonstrate that women score less well on multiple-choice examinations, but outperform men on essay examinations.37 The results in this study conform with the growing body of literature that suggests that gender and ethnic performance differences on standardized examinations may be influenced by culture, communication styles, task demand, identity, students’ attitudes toward testing,38,39 and stereotype threat (defined by Steele as ‘‘the threat that ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000 831 others’ judgments or their own actions will negatively stereotype them . . .’’ 38). Differences among assessment measures emerged when we measured the students’ attitudes. The results indicated that the students’ perceptions of the relative strengths and weaknesses of the various evaluation methods varied significantly depending on the purpose of the examination (such as whether the purpose was to measure knowledge, to measure clinical decision making, or to select a physician to care for patients). The students’ perceptions were in tune with the test makers’ published rationale for creating the examinations.13,31 The students also rated their impressions of the accuracy with which the examination/evaluation methods used by the medical school measured their overall abilities. We found only one difference in these perceptions by subgroup: minority students varied significantly (p < .05) from the other students in their assessments of the evaluations done by clinical attending physicians and residents (clinical attendings received higher ratings by minority students). Their perceptions for the CBX, SPX, multiple-choice-question, and oral examinations did not differ from those of the other students. This suggests that some underrepresented-minority students may perceive the accuracies of subjective evaluations differently. The results of the examination comparisons and perception studies provide separate but converging evidence that a multipronged approach to evaluation is the most prudent. Researchers caution against overinterpretation of correlational studies and comparison of examinations with different response characteristics.26 However, the correlations that we found among the traditional tests and the new examinations are consistent with earlier studies and, when combined with the student-perception studies, lend support to the theoretical construct that the new examinations measure different, albeit 832 interrelated, domains of competency. Students individually and in subgroups do not perform the same on all tests, and they express sensitivity to the need for different formats for different purposes. The use of multiple evaluation tools thus allows finer gradations in individual assessment. Multiple modes of assessment offer the benefit of reduced differences or perceptions of advantages that can be attributed to preference for any one method or style of test taking.29,31 The relatively low correlations of premedical-school predictors (grades and MCAT scores) with ultimate achievement in senior years of medical school, as described in the medical education literature and in this study, especially as measured by the new performance examinations, argues against overreliance on any one set of selection factors and supports composite evaluations based on multiple criteria.21,28,40 Our study is not without limitations. Although 95% of the students in the class completed the SPX and CBX (illness and away electives were the only exceptions), these were low-stakes examinations—the students were not penalized or rewarded based on their performances—and this may have affected our results. The students were required to participate in practice examinations for the CBX and to complete at least three practice cases per protocol guidelines, but feedback from proctors indicated different levels of familiarity with the format. The NBME scored the CBX using its experimental scoring method. The students completed ten multispecialty, computerized cases during the eight-hour examination. Other studies have suggested that more than ten cases are needed to demonstrate sufficient reliability when the cases are multidisciplinary.13,26 The SPX consisted of four examinations developed by the NBME and four locally developed by the NBME-affiliated Macy consortium of medical schools using similar criteria but non-identical test developers as an ongoing test of examination reliability and test-construction rules. Lengthening the examination might increase reliability.22 While this study was conducted at only one medical school, and the results should not necessarily be extrapolated to all medical schools, UCLA School of Medicine and its affiliated programs, University of California, Riverside, and Charles R. Drew University of Medicine and Science’s College of Medicine, constitute one of the most diverse student bodies in the United States. However, the absolute number of underrepresented-minority students in each subgroup is relatively small for statistical purposes and cautions against over-generalizations. This study supports the findings that physician competency may be a multidimensional trait that requires a variety of evaluation methods to accurately tap. The next steps would be to replicate this examination under high-stakes conditions at multiple sites and to control for case specificity and case familiarity. Long-term studies are needed to assess the predictive validity of the new performance evaluations. One approach would include a follow-up survey to determine whether performance examinations predict later success in clinical training. The authors thank Stuart Slavin, MD, LuAnn Wilkerson, EdD, and Ava Yajima for their support and helpful comments and suggestions. They also thank Evie Kumpart, Marjorie Covey, Elizabeth O’Gara, Pat Anaya, and Ligia Perez for their assistance and spirit of teamwork. REFERENCES 1. Blumenthal D. Quality of health care: Part 1. Quality of care: what is it? N Engl J Med. 1996;335:891–3. 2. Jones R. Academic Medicine: Institutions, Programs, and Issues. Washington, DC: Association of American Medical Colleges, 1997. 3. Shugars D, O’Neil EH, Bader JD (eds). Healthy America: Practitioners for 2005; An Agenda for US Health Professional Schools. Durham, NC: The Pew Health Commissions, 1991. ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000 4. Association of American Medical Colleges. Taking Charge of the Future: The Strategic Plan for the Association of American Medical Colleges. Washington, DC: Association of American Medical Colleges, 1995. 5. Brook RH, Kamberg CJ, McGlynn EA. Health system reform and quality. JAMA. 1996;276:476–80. 6. Pew Health Professions Commission. Critical Challenges: Revitalizing the Health Professions for the Twenty First Century. San Francisco, CA: UCSF Center for the Health Professions, 1995. 7. Rivo R. Managed health care: implications for the physician workforce and medical education. JAMA. 1995;274:712–5. 8. Mast TA, Anderson B. Special Section: Annex to the Proceedings of the AAMC Consensus Conference on the Use of Standardized Patients in the Teaching and Evaluation of Clinical Skills. Teach Learn Med. 1994;6. 9. Dauphinee WD. Assessing clinical performance: where do we stand and what might we expect. JAMA. 1995;274:741–2. 10. Anderson MK, Kassebaum DG. Special Issue: Proceedings of the AAMC’s Consensus Conference on the Use of Standardized Patients in the Teaching and Evaluation of Clinical Skills. Acad Med. 1993;68. 11. National Board of Medical Examiners. 1994 Annual Report. Philadelphia, PA: National Board of Medical Examiners, 1995. 12. Federation of State Medical Boards of the United States (FSMB) and National Board of Medical Examiners (NBME). Plans for Administering the Medical Licensing Examination on Computer. Special Bulletin on Computer-based Testing for the United States Medical Licensing Examination. 1998. 13. Clyman SG, Melnick DE, Clauser BE. In: Mancall EL, Bashook PG (eds). Assessing Clinical Reasoning: The Oral Examination and Alternative Methods. Evanston, IL: American Board of Medical Specialties, 1995:139–49. 14. Edelstein RA, Clyman SG. Computer-based simulations as adjuncts for teaching and evaluating complex clinical skills. In: Scherpbier 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. AJJA, van der Vleuten CPM, Rethans JJ, van der Steeg AFW (eds). Advances in Medical Education. Dordrecht, The Netherlands: Kluwer Academic Publishers, 1997:327–9. McGuire C. Reflections of a maverick measurement maven. JAMA. 1995;274:735–40. Mancall EL, Bashook PG, Dockery JL. Computer-Based Examinations for Board Certification. Evanston, IL: American Board of Medical Specialties, 1996. Leape L. Error in medicine. JAMA. 1994;27: 1851–7. Druckman D, Bjork R. Learning, Remembering, Believing: Enhancing Human Performance. Washington, DC: National Academy Press, 1994. Durlach N. Virtual Reality: Scientific and Technological Challenges. Washington, DC: National Academy Press, 1995. Regehr GN, Norman GR. Issues in cognitive psychology: implications for professional education. Acad Med. 1996;71:988–1001. Elstein A. Beyond multiple-choice questions and essays: the need for a new way to assess clinical competence. Acad Med. 1993;68: 244–9. Colliver JA, Swartz MD. Assessing clinical performance with standardized patients. JAMA. 1997;278:790–1. Mancall EL, Bashook PG. Assessing Clinical Reasoning: The Oral Examination and Alternative Methods. Evanston, IL: American Board of Medical Specialties, 1995. Smith JP. Issues in Development of a Performance Test for Lawyers. Paper presented at the annual meeting of the National Council on Measurement in Education. Symposium P3, Performance Assessment: Evaluating Competing Perspectives on Use for High Stakes Testing. New Orleans, LA, 1994. Jones ER, Hennessy RT, Deutsch S. Human Factors Aspects of Simulation. Washington, DC: National Academy Press, 1985. Swanson DB, Norman GR, Linn RL. Performance-based assessment: lessons from the health professions. Educ Res. 1995;24:5–11. Messick S. Validity of psychological assessment: validation of inferences from persons’ 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. responses and performances as scientific inquiry into score meaning. Am Psychol. 1995; 50:741–9. Anderson N. The mismeasure of medical education. Acad Med. 1990;65:159–60. Messick S. The interplay of evidence and consequences in the validation of performance assessments. Educ Res. 1994;23:13– 23. Bazansky B, Jonas HS, Etzel SI. Educational programs in US medical schools, 1995–96. JAMA. 1996;276:714–9. National Board of Medical Examiners. The National Board of Medical Examiners’ Computer-based Examination–Clinical Simulation. Philadelphia, PA: NBME, 1991. O’Donnell SO, Obenshain SC, Erdman J. Background essential to the proper use of results of Step 1 and Step 2 of the USMLE. Acad Med. 1993;68:734–9. Hoffman K. The USMLE, the NBME Subject Examinations, and the assessment of individual academic achievements. Acad Med. 1993;68:740–7. Williams R. Use of NBME and USMLE examinations to evaluate medical education programs. Acad Med. 1993;68:748–52. Berner ES, Brooks CM, Erdman JB. Use of the USMLE to select residents. Acad Med. 1993;68:753–9. Clauser BR, Ripkey D, Fletcher B, King A, Klass D, Orr N. A comparison of pass/fail classifications made with scores from the NBME standardized-patient examination and Part II examination. Acad Med. 1993;68, 10 suppl. Klein SP. The costs and benefits of performance testing on the bar examination. Bar Examiner. 1996;13–20. Steele CM. A threat in the air: how stereotypes shape intellectual identity and performance. Am Psychol. 1997;52:613–29. Sternberg RJ, Williams WM. Does the Graduate Record Examination predict meaningful success in the graduate training of psychologists. a case study. Am Psychol. 1997;52:630– 41. Bowles L. Use of NBME and USMLE scores. Acad Med. 1993;68:778. ACADEMIC MEDICINE, VOL. 75, NO. 8 / AUGUST 2000 833
© Copyright 2026 Paperzz