multiple mini interviews Validating a multiple mini-interview question bank assessing entry-level reasoning skills in candidates for graduate-entry medicine and dentistry programmes Chris Roberts,1 Nathan Zoanetti2 & Imogene Rothnie3 CONTEXT The multiple mini-interview (MMI) was initially designed to test non-cognitive characteristics related to professionalism in entry-level students. However, it may be testing cognitive reasoning skills. Candidates to medical and dental schools come from diverse backgrounds and it is important for the validity and fairness of the MMI that these background factors do not impact on their scores. METHODS A suite of advanced psychometric techniques drawn from item response theory (IRT) was used to validate an MMI question bank in order to establish the conceptual equivalence of the questions. Bias against candidate subgroups of equal ability was investigated using differential item functioning (DIF) analysis. RESULTS All 39 questions had a good fit to the IRT model. Of the 195 checklist items, none were found to have significant DIF after visual inspection of expected score curves, consideration of the number of applicants per category, and evaluation of the magnitude of the DIF parameter estimates. CONCLUSIONS The question bank contains items that have been studied carefully in terms of model fit and DIF. Questions appear to measure a cognitive unidimensional construct, ‘entry-level reasoning skills in professionalism’, as suggested by goodness-of-fit statistics. The lack of items exhibiting DIF is encouraging in a contemporary high-stakes admission setting where candidates of diverse personal, cultural and academic backgrounds are assessed by common means. This IRT approach has potential to provide assessment designers with a quality control procedure that extends to the level of checklist items. Medical Education 2009: 43: 350–359 doi:10.1111/j.1365-2923.2009.03292.x 1 Office of Postgraduate Medical Education, University of Sydney, Sydney, New South Wales, Australia 2 Assessment Research Centre, Faculty of Education, University of Melbourne, Melbourne, Victoria, Australia 3 Office of Medical Education, Faculty of Medicine, University of Sydney, Sydney, New South Wales, Australia 350 Correspondence: Chris Roberts, Office of Postgraduate Medical Education (OPME), Faculty of Medicine, Mackie Building (K01), University of Sydney, Sydney, New South Wales 2006, Australia. Tel: 00 61 2 9036 9453; Fax: 00 61 2 9351 6646; E-mail: [email protected] ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 Validating a multiple mini-interview question bank INTRODUCTION Selection procedures are arguably the most highstakes, stressful, contentious and resource-intensive of all medical and dental school assessments. In combination with measures of previous academic achievement, many schools have used structured interviews, claiming to assess important non-cognitive characteristics of candidates, such as values and commitment.1 However, such interviews are biased, of limited value in predicting future performance, and therefore unfair when used as an important part of admission procedures.2 The multiple mini-interview (MMI) is a relatively new assessment which avoids the issues of the traditional interview where much of the observed mark of the candidate relates to biases arising from limited interview content, the interviewer panel and contextual factors.3,4 Because the MMI tests a larger sample in terms of both content and independent interviewers than a single interview can, more reliable generalisations about a candidate’s ability can be made. The developers of the MMI have described the validation of assessment blueprints to establish the preferred non-cognitive characteristics, such as integrity, teamwork and lifelong learning, of entry-level students.3–6 These map to notions of entry-level professionalism.3 Predictive validity of the MMI has been claimed with both clerkship assessment7 and with performance on specific parts of licensure.8 However, qualitative validity research9 suggests that interviewers and candidates perceive that MMI questions may test a candidate’s reasoning skills in areas of professionalism, rather than assess the amount of a particular non-cognitive characteristic he or she possesses. Some initial supporting evidence for this observation has come from a modest correlation3 of the MMI (r = 0.27) with ‘Reasoning in Humanities and Social Sciences’, Section 1 of the Graduate Australian Medical Schools Admission Test (GAMSAT).10 The focus of this paper concerns the extent to which a bank of MMI questions and the checklist items within them contribute in a meaningful way, without bias, to the measurement of the construct of interest, which is hypothesised as entry-level reasoning skills in professionalism. Investigating the performance of items 11 Item response theory (IRT) provides a useful way of exploring these concerns. It refers to a suite of advanced psychometric techniques that have previously been used in the medical education literature to investigate both written and clinical assessment. Its main use in relation to clinical rating has been to establish the consistency of judgements within and between judges and candidates.12–16 Little research has analysed the properties of the items themselves and any biases to which they are subject. IRT makes some strong assumptions about the way a student responds to clinical assessment items and assumes that the student’s probability of getting a satisfactory mark depends on: 1 2 3 his or her general ability in the area being assessed; the leniency or stringency of the judges of the assessment, and the difficulty of the items. Providing that the empirical data fit the IRT model, the three factors can all be placed on the same scale, called a Wright map, which is measured in a unit called a logit. This allows the non-technical observer to consider large amounts of assessment data visually. An additional strength of IRT is that student performance is estimated independently of the specific set of test items used,17 unlike in classical test theory. Furthermore, the item parameters, such as item difficulty, are independent of the student population in which the test was used. Validating item banks For large-scale assessment, items are generally organised into an item bank to facilitate the authoring of new items and the maintenance of existing items. This is a long-term process in which common items are anchored over repeated administrations.18 To satisfy the expectations of IRT, MMI questions and their checklist items should be conceptually equivalent: that is, they should measure the same unidimensional construct.19 A second standard is that there should be invariance of estimates of item difficulty and person ability20 across different appropriate administrations of the test. A third standard is that items should not favour candidates with particular personal or cultural characteristics or educational experience, an interaction which can be investigated with differential item functioning (DIF) analysis.21 This paper investigates the application of a suite of IRT techniques to a high-stakes selection procedure for entry into graduate medicine and dentistry programmes. Firstly, we aimed to establish the conceptual equivalence; the concept in this case being entry-level reasoning skills in professionalism. ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 351 C Roberts et al Secondly, we aimed to investigate whether there were any systematic differences in outcome among equally able candidates from different subgroups (e.g. by gender or type of programme applied for) caused by DIF. METHODS The setting Details of this MMI for a graduate-entry medical programme in 2006 are given elsewhere.3 In 2007, for reasons of efficiency, the Faculties of Medicine and Dentistry at the University of Sydney ran an integrated admission procedure so that MMI scores were available to both medicine and dentistry admission units. Both the medical and dental degrees are 4-year, graduate-entry, problem-based learning programmes. The faculties had previously moved to a shared curriculum model in 2001 in order to reduce duplication of content and resources in the teaching of basic and clinical sciences in the first 2 years of both curricula.22 Dental and medical applicants were offered an interview if their grade point average (GPA) and GAMSAT (or Medical College Admission Test [MCAT] for North American candidates) score were above individual faculty-determined levels. In 2006, the rank order of MMI scores determined the merit list for medicine, from which offers were made until the available places (n = 273) were exhausted. In 2007, the year of this research, ranking for the medicine and dentistry merit lists was based on MMI score and total GAMSAT (or MCAT) score at a ratio of 50% each. This reflected faculty concerns at the time about the validity of the MMI as the sole determinant of ranking. MMI organisation Each MMI question consisted of a short, authentic, non-clinical scenario, which contained a dilemma involving conflicting values and required the candidate to demonstrate his or her reasoning strategies (Appendix 1). The interviewer used a list of five standardised prompt questions to elicit further responses. The marking checklist consisted of five criteria to be marked on a 4-point Likert scale (4 = excellent, 3 = good, 2 = satisfactory, 1 = unsatisfactory), giving a maximum total of 20 raw marks for each MMI question. A pre-existing bank of 21 questions, which had been used for medical selection in 2006,3 was adapted so that questions were applicable to both medical and 352 dental candidates, as were all newly developed questions. The MMI circuit had eight stations, each lasting 7 minutes. Candidates were mixed as far as possible in relation to gender, whether they were applying for medicine, dentistry or both, and whether they were local or international candidates. They rotated through the MMI circuit, meeting a different single interviewer at each station. Questions from the bank were rotated between different MMI circuits. Candidates and interviewers did not know one another’s status in terms of whether their interest or background was in medicine or dentistry, and candidates did not know whether individual interviewers were faculty members or came from the community. All interviewers received written instructions and were offered a 1-hour training session to familiarise themselves with the format and the marking schema, and to practise on two simulated interview scenarios shown on DVD. Interviews were held for 3 days in Vancouver and 1 day in Singapore for international candidates and on three parallel circuits over 5 days in Australia. Multi-facet Rasch model A multi-facet Rasch model (MFRM) was used in FACETS Version 3.63 (Winsteps.com, Chicago, IL, USA) to independently estimate several first order facets and their associated error variances by simultaneously using all of the empirical data. In the interviewing plan, candidates were partially crossed with questions, with each candidate attempting eight of the total number of questions (n = 39) in the bank. Interviewers were partially crossed with questions (most manned two or more of 39 possible questions). Candidates were also partially crossed with interviewers, with each candidate seeing eight of the total number of interviewers used. The analysis used a logistic transformation of the observed scores of candidates into logit scores, which were adjusted for all the parameters specified in the model. The probability of a candidate receiving a particular mark in the MMI depended on his or her ability, the difficulty of the MMI questions, and the stringency or leniency of the interviewer. Maximum likelihood methods calculated an ability measure for each candidate, an estimate of judge stringency or leniency, and a difficulty measure for each MMI question. A partial credit model was used in order to treat interviewers as ‘independent experts’ free to ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 Validating a multiple mini-interview question bank apply ‘part marks’ from the full Likert rating scale to each checklist item.11 Standard errors were also calculated for each of these measures. The estimated measures for facets of interest were represented graphically, using a Wright map displaying a range of ) 3.0 to + 3.0 logits. This allows visual inspection of the functioning of the questions, particularly in terms of their range of difficulty, and shows whether they had a meaningful substantive hierarchy ranging from easy to hard. The analysis also provided statistics in the form of two chi-square ratios, infit and outfit mean squares, which indicate how well the empirical data fit the IRT model. Both fit statistics are weighted mean squared residuals (the difference between the actual and the expected value of the observation) divided by their degrees of freedom. The infit differs from the outfit in that it is weighted by the variance of the observation around its IRT expected score, making it less sensitive to outlier residual values. Both of these fit statistics give an indication of how well the empirical data meet the expectations of the IRT model and thus how well the set of questions measures the construct. They have an expected value of + 1 when the data fit the model. Standardised forms of both fit statistics (z) take account of the sample size.23 Both types of statistics have been reported in a medical education context13,14,16 with a focus on infit. A mean-square value < 1 indicates too little variation in the ratings (i.e. the scoring is too predictable), whereas a value > 1 indicates too much variation in the ratings. This was a high-stakes assessment and thus was similar to a clinical rating situation, so we set lower and upper control limits at 0.5 and 1.7 to indicate model fit.23 Differential item functioning Assessment items exhibit DIF if the item scores of equally able interviewees from different groups (e.g. groups defined by gender, prior degree or type of programme applied for) are systematically different.24 These population subgroups can be compared simultaneously using ConQuest, Version 2.025 by producing expected score curves (ESCs) with a partial credit model.26 Each theoretical ESC can be used as a benchmark against which the groups of interest are compared.25 Differential item functioning is said to exist if the empirical ESCs show a systematic difference between groups across the candidate ability scale, suggesting a secondary effect is operating.27,28 Small variations in ESCs are tolerable because of measurement error, but in general ESCs should not be separated. A criterion that each score point on the empirical ESCs should represent the results from at least 20 interviews was set for this cohort to avoid premature classification of DIF based upon unstable ESC loci. For sufficiently represented MMI questions and candidate characteristic combinations, ESCs were produced and visually inspected. Differential item functioning parameters were then estimated and the initially flagged items were evaluated for both substantive and statistical significance. One rule of thumb for declaring a difference between groups as being substantive is to use half a logit. Unsystematic parameter shifts of < 0.5 logit would have minimal impact on the accuracy of test scores.20 However, systematic parameter shifts across a number of items could impact test scores and would therefore be cause for concern.29 Statistical significance is demonstrated by showing that the DIF parameter for a particular item is not equal to zero. This can be done by dividing the DIF parameter estimate by twice its measurement error. Results exceeding a magnitude of 1 indicate statistical significance at the 5% level.29 Items exhibiting substantively and statistically significant DIF were flagged for inspection by subject matter experts to evaluate whether the DIF might be related to the construct, with irrelevance being interpreted as bias. Experts were also asked if they could identify specific characteristics of the item content that might cause the bias, and whether the item could be suitably amended rather than discarded.30 RESULTS Each candidate answered 8 MMI questions, each with five checklist items. A 4-facet Rasch model using a partial credit model established item characteristics on all of the 2007 data consisting of 207 interviewers, 686 candidates, a bank of 39 MMI questions, each with 5 checklist items giving 195 items, and 27 440 candidate marks. FACETS confirmed that there was sufficient linkage within the interviewing plan, so that all of the data could be simultaneously used. Consequently, all facets were placed onto a Wright map, where mean interviewer stringency and question difficulty were anchored by the measurement model at 0.00 logit, and candidate ability was allowed to float. The first column (Fig. 1) indicates the logit scale. The second column shows interviewers’ level of stringency, the third column the candidates’ level of ability on the MMI, and the fourth column the difficulty of the MMI questions. Reading the ruler ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 353 C Roberts et al ranged in difficulty from ) 0.55 to + 0.54 logit (Fig. 1). A reliability of separation index, which shows how well they were separated, was 0.92 (Table 1). This was statistically significant (v2 = 842.3, d.f. = 38, P < o.001), indicating that the questions were meaningfully separated according to level of difficulty from easy to hard with a high degree of confidence. Question statistics are summarised in Table 2. Question 27 () 0.55 logit) was the easiest question and question 29 was the hardest (0.54 logit). Standard errors of question measures (Table 2) ranged from 0.03 to 0.21 logit. Question 27 with the largest error was least used (11 times). However, 39.2% (n = 269) of candidates had ability > 0.54 logit and thus were not tested effectively by any of the MMI questions (Fig. 1). Figure 1 A Wright map showing ratings of candidates in multiple mini-interviews (MMIs) in 2007 combined with interviewers and MMI questions transformed onto a logit scale from bottom to top shows increasing interviewer stringency, increasing candidate ability and increasing question difficulty. An overall summary of the interviewer, candidate and question statistics are given in Table 1. Standard deviations (SDs) by facets were: interviewer stringency or leniency, 0.52; candidate ability, 0.75, and MMI question difficulty, 0.27. The spread of candidates was 1.44 times that of the interviewers and therefore variance was 2.08 times that of the interviewers (variance equals the square of the SD). Questions The overall infit mean for questions (Table 2) was 1.03 (SD = 0.19, range 0.63–1.27). Overall outfit mean was 1.03 (SD = 0.12, range 0.67–1.26). All the questions were well within the predetermined range of 0.5–1.7. The good fit of questions to the unidimensional measurement model provided some empirical support that the questions could be considered conceptually equivalent.23 Differential item functioning Differential item functioning was used to investigate whether checklist items within each MMI question discriminated against students of equal ability according to the categorical variables described in Table 3. Figure 2 shows the type of ESC plot typical of an item (item 91) with indications of uniform DIF. The horizontal axis represents the overall performance of a candidate on the MMI. The vertical axis represents the score that a candidate of a given ability would (on average) be expected to attain. The scoring rubric of 1–4 used by interviewers is recoded to a 0 through three sets of scores here by convention for IRT. The solid curve represents the modelled ESC. The empirical score curves for the two Table 1 Means, standard deviations (SDs) and standard errors (SEs), separation reliability, significances and degrees of freedom of parameters 354 Infit mean SE of the Chi-square Degrees of square mean SD Separation reliability Chi-square significance freedom Interviewer stringency 1.05 0.05 0.52 0.91 3392.5 < 0.001 206 Candidate ability 1.05 0.02 0.75 0.91 6959.2 < 0.001 685 Question difficulty 1.03 0.04 0.27 0.92 842.1 < 0.001 38 ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 Validating a multiple mini-interview question bank Table 2 Item statistics for ratings of multiple mini-interview (MMI) candidates for medicine and dentistry programmes using an MMI bank of questions (n = 39) in order of question difficulty (easiest to hardest) Outfit Question MMI question Checklist difficulty Standard Infit mean Infit z mean Outfit z number item count (logits) error squares score squares score 29 310 0.54 0.08 1.15 1.99 1.11 1.42 40 915 0.45 0.05 1.04 0.87 1.03 0.63 26 785 0.44 0.05 1.27 5.43 1.24 4.70 16 490 0.42 0.06 1.08 1.34 1.20 2.89 14 420 0.39 0.06 1.16 2.48 1.14 2.07 30 725 0.39 0.05 0.97 ) 0.53 1.00 0.11 21 105 0.27 0.13 1.00 0.04 1.00 0.01 5 2190 0.26 0.03 1.06 2.23 1.07 2.35 1 215 0.23 0.10 1.11 1.23 1.06 0.74 22 105 0.22 0.15 0.89 ) 0.86 0.94 ) 0.38 31 780 0.17 0.05 0.96 ) 0.94 0.96 ) 0.89 17 2075 0.13 0.03 0.99 ) 0.18 1.00 ) 0.08 25 265 0.12 0.09 0.92 ) 0.90 0.92 ) 0.87 47 435 0.12 0.07 1.08 1.24 1.10 1.54 2 815 0.10 0.05 1.06 1.22 1.09 1.78 24 1985 0.10 0.03 1.01 0.41 1.04 1.19 12 255 0.09 0.09 1.08 1.06 1.11 1.33 11 960 0.08 0.05 0.98 ) 0.45 1.01 0.22 51 460 0.07 0.07 1.01 0.14 1.01 0.24 10 500 ) 0.04 0.07 1.00 0.08 1.02 0.30 15 105 ) 0.04 0.13 0.91 ) 0.69 0.95 ) 0.37 23 915 ) 0.08 0.05 1.05 1.01 1.08 1.56 6 210 ) 0.10 0.10 1.06 0.59 0.96 ) 0.26 19 3320 ) 0.10 0.03 1.16 6.42 1.15 5.99 28 255 ) 0.10 0.09 0.90 ) 1.22 0.88 ) 1.30 43 220 ) 0.10 0.10 1.24 2.71 1.26 2.80 32 200 ) 0.12 0.11 1.09 1.03 1.11 1.16 13 730 ) 0.14 0.05 1.12 2.42 1.08 1.52 39 765 ) 0.14 0.05 1.06 1.28 1.01 0.14 18 210 ) 0.16 0.11 0.90 ) 1.07 0.85 ) 1.46 8 1200 ) 0.20 0.04 1.04 1.03 1.04 0.85 20 540 ) 0.23 0.06 1.02 0.40 0.98 ) 0.32 3 1240 ) 0.31 0.04 1.01 0.18 1.02 0.58 9 355 ) 0.36 0.07 1.02 0.27 1.00 0.03 52 245 ) 0.38 0.09 0.85 ) 1.89 0.87 ) 1.60 41 495 ) 0.47 0.07 1.19 2.88 1.26 3.79 4 1365 ) 0.48 0.04 0.90 ) 2.60 0.92 ) 2.07 7 225 ) 0.50 0.10 1.09 1.01 1.19 1.97 27 55 ) 0.55 0.21 0.63 ) 2.22 0.67 ) 1.68 ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 355 C Roberts et al Table 3 Characteristic variable categories and numbers for multiple mini-interview (MMI) candidates Candidate Number of characteristic Categories candidates Percentage Candidate Male 352 51.3 gender Female 334 48.7 Candidate Bachelor 578 84.3 Honours 101 14.7 Masters 1 0.1 PhD 6 0.9 degree level or Grad Dip Candidate type Candidate programme type Candidate MMI location Local 513 74.8 International 173 25.2 Medicine 448 65.3 Dentistry 173 25.2 65 9.5 Sydney Both 553 80.6 Vancouver 116 16.9 Singapore 17 2.5 parameters for item 91 using a DIF by location item response model and a DIF by gender item response model. Application of the DIF by location item response model was postponed owing to missing data across a number of the item score categories and the candidate MMI location categories. Despite the visually apparent DIF and a relatively large amount of interview data, statistical and substantive significance of the DIF findings were not upheld for the item by gender interaction. The DIF parameter magnitude was 0.168 logit, with a measurement error of 0.096 logit. This DIF parameter value did not exceed twice the magnitude of the measurement error, so the DIF effect was not statistically significant at the 5% level. Furthermore, this figure was considerably lower than the acceptable 0.5 logit criterion.23 Item 91 was identified for qualitative evaluation despite these results for the purpose of carrying through the final important steps of DIF evaluation. Inspection of the items by content experts revealed that item 91 asked a question about how the candidate might care for an elderly relative with a terminal illness. The knowledge and skill underpinning this item was considered to be about ethical understanding and demonstrating integrity. Although actual responses, such as those pertaining to personal involvement in the care of elderly relatives at home, might be gender- and culturespecific, reasoning processes should not be. DISCUSSION Figure 2 Uniform differential item functioning trend for item 91 by gender. comparison groups (male and female) are not coincident along the theoretical ESC. This kind of result is indicative of an item that discriminates between applicants on the basis of gender and this in general is undesirable. Of the 195 checklist items to which partial credit scores were applied, only item 91 exhibited visually detectable DIF for candidate variables (location and gender) and represented results from at least 20 interviews. This warranted an evaluation of the DIF 356 These results provide further validation evidence for the use of a structured MMI question bank in the context of a high-stakes, international selection procedure, where the purpose was to determine whether candidates with a good academic record should enter a graduate-entry medical or dental programme. Some of the conditions for a quality assured question bank have been established. Firstly, conceptual equivalence has been supported as infit and outfit statistics have indicated how well the MMI questions fit the assumptions of the IRT model23 and contribute to the measurement of the underlying unidimensional construct. We have named this ‘entry-level reasoning skills in professionalism’, meaning that candidates were reasoning on issues connected with professionalism at the level expected of beginning students, with a focus on problem solving and critical thinking. Questions displayed a meaningful hierarchy from easy to hard. However, the most difficult MMI question ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 Validating a multiple mini-interview question bank (question 29) did not test just over a quarter of candidates with the greatest ability. Secondly, we demonstrated a lack of significant DIF. Very few items discriminated against candidates by virtue of a range of background factors. This is an important point to establish in contemporary admissions settings where candidates of various personal, cultural and academic backgrounds are assessed by common means. Implications of findings This research provides further evidence3,9 that the MMI does assess cognitive reasoning skills. This stance represents a significant conceptual change in our understanding of the validity of the MMI, as it was previously thought to measure non-cognitive skills.4 A cognitive measurement construct would sit comfortably with other findings from the MMI literature. For example, context-specificity of MMI questions3,4 would mean that candidates’ reasoning skills on one question do not predict reasoning skills on another question. The shared variance between MMI candidate scores3 and scores on the unidimensional construct ‘Reasoning in Humanities and Social Sciences’ suggests that this subsection of GAMSAT and the MMI do, in part, measure a similar reasoning construct. Correlations of the MMI with measures of clinical competence4,7,8 might be partly explained by shared variance between reasoning in aspects of professionalism and clinical reasoning. An MMI that was found to assess cognitive skills would have a major impact on the way that admission tools should be combined in the future. This is an area that demands more research. The real benefit of using IRT lies in its presentation of item statistics, which yield enormous potential for further development of the question bank. For example, MMI questions will need to be made more difficult if they are to test the ability of the most able candidates. Only item 91 was considered to show some indications of DIF. On evaluation of the relationship between candidate integrity as it was perceived on the basis of responses to an item on caring for one’s own elderly relatives and the measurement construct, it was decided that differential performance may well be a result of interviewer bias,31 suggesting that interviewers should undertake further training around cultural and gender-associated issues. Strengths and weaknesses This is the first reported application of a suite of IRT methods in a high-stakes MMI involving a large and diverse group of candidates for both medical and dental programmes in three different continents. There are limitations to the study. The interviewing plan in this naturalistic setting was constrained by what was logistically possible. As is often the case in such settings, there was no fully formalised design that assigned specific interviewers or a specific set of questions to each MMI circuit. Residual-based fit statistics are not always sensitive to multi-dimensionality32 and future research will need to investigate our findings with additional advanced psychometric techniques. For many items it was difficult to detect DIF for candidate background variables with more than two categories as a result of small numbers of candidates per ability grouping per item. This difficulty will subside with the accretion of data as more MMI administrations are carried out. The use of DIF in this study is therefore limited to an initial evaluation of item validity rather than an evaluation of test score validity (across subgroups). Given that high-stakes decisions about individuals are based on test scores, this will be investigated as more data become available. The format of MMIs varies among medical schools and these findings are generalisable only to those MMIs which use the standardised structured approach.3 CONCLUSIONS The MMI question bank contains items that have been studied carefully in terms of conceptual equivalence and DIF. This research adds to the evidence that MMI questions do measure a cognitive unidimensional construct, namely, ‘entry-level reasoning skills in professionalism’. The lack of items exhibiting DIF is encouraging in contemporary, high-stakes admission settings where applicants of various personal, cultural and academic backgrounds are assessed by common means. The potential of this approach is to provide MMI question developers with a quality control procedure that extends down to the checklist item level. Contributors: all authors made substantial contributions to the study conception and design, and to the acquisition, analysis or interpretation of data. All authors contributed to the writing of the article and approved the final manuscript. Acknowledgements: we thank Mike Linacre for his invaluable comments on the application of the multi-facet Rasch model with the partial credit model to the data. We thank the members of the Faculty of Medicine, and ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 357 C Roberts et al Evelyn Howe and her team from the Faculty of Dentistry, University of Sydney, for their hard work. Funding: none. Conflicts of interest: none. Ethical approval: this study was approved by the University of Sydney Human Research Ethics Committee (ref. 08-2006 ⁄ 9392). REFERENCES 1 Norman GR. Editorial: the morality of medical school admissions. Adv Health Sci Educ 2004;9:79–82. 2 Kreiter CD, Yin P, Solow C, Brennan RL. Investigating the reliability of the medical school admissions interview. Adv Health Sci Educ 2004;9:147–59. 3 Roberts C, Walton M, Rothnie I, Crossley J, Kumar K, Lyon P, Tiller D. Factors affecting the utility of the multimini-interview in selecting candidates for graduate-entry medical school. Med Educ 2008;4:396–404. 4 Eva K, Rosenfeld J, Reiter HI, Norman GR. An admissions OSCE: the multiple-mini-interview. Med Educ 2004;38:314–26. 5 Harris S, Owen C. Discerning quality: using the multiple mini-interview in student selection for the Australian National University Medical School. Med Educ 2007;41:234–41. 6 Brownell K, Collin T, Lemay JF. Introduction of the multiple mini-interview into the admissions process at the University of Calgary: acceptability and feasibility. Med Teach 2007;29:395–6. 7 Eva KW, Reiter H, Rosenfeld J, Norman GR. The ability of the multiple mini-interview to predict preclerkship performance in medical school. Acad Med 2004;79 (Suppl):40–2. 8 Reiter HI, Eva KW, Rosenfeld J, Norman GR. Multiple mini-interviews predict clerkship and licensing examination performance. Med Educ 2007;41: 378–84. 9 Kumar K, Roberts C, Rothnie I, du Fresne C, Walton M. Experiences of the multiple mini-interview: a qualitative analysis. Med Educ 2009;43:360–7. 10 Australian Council for Educational Research. Graduate Australian Medical School Admission Guide 2006. Melbourne, Vic: ACER 2006. 11 Linacre JM. Many-Facet Rasch Measurement. Chicago, IL: MESA Press 1994;1–15. 12 Downing S. Threats to the validity of clinical teaching assessments: what about rater error? Med Educ 2005;39:353–5. 13 McManus I, Thompson M, Mollon J. Assessment of examiner leniency and stringency (‘hawk-dove effect’) in the MRCP (UK) clinical examination (PACES) using multi-facet Rasch modelling. BMC Med Educ 2006;6:1272–94. 14 Harasym P, Woloschuk W, Cunning L. Undesired variance due to examiner stringency ⁄ leniency effect in communication skill scores assessed in OSCEs. Adv Health Sci Educ 2008;13:617–32. 358 15 Griffin P, Zoanetti N. Moderation of Interviewer Judgement in Tests of Spoken English. 40th Annual TESOL (Teachers of English to Speakers of Other Languages) Convention and Exhibit, Tampa, FL, 15–18 March 2006;1–15. 16 Iramaneerat C, Myford C. Rater Effects in Clinical Performance Ratings of Surgery Residents. Proceedings of the American Educational Research Association Conference, San Francisco, CA, 8–12 April 2006. 17 Lord FM. Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum Associates 1980. 18 Vale CD. Computerized item banking. In: Downing SM, Haladanya TM, eds. Handbook of Test Development. New York, NY: Routledge 2006;261–85. 19 Teresi JA. Overview of quantitative measurement methods. Equivalence, invariance, and differential item functioning in health applications. Med Care 2006;44 (Suppl):39–49. 20 Wright BD, Douglas GA. Best Test Design and Self-tailored Testing. Research Memorandum No. 19, Statistical Laboratory. Chicago, IL: University of Chicago, Department of Education 1975. 21 Holland PW, Thayer DT. Differential Item Performance and the Mantel–Haenszel Procedure (Technical Report No. 86–69). Princeton, NJ: Educational Testing Service 1986. 22 Klineberg I, Massey W, Thomas M, Cockrell D. A new era of dental education at the University of Sydney, Australia. Aust Dent J 2002;47:194–201. 23 Bond TG, Fox CM. Applying the RASCH Model: Fundamental Measurement in the Human Sciences. Mahwah, NJ: Lawrence Erlbaum Associates 2001;239–249. 24 Kelderman H, Macready GB. The use of loglinear models for assessing differential item functioning across manifest and latent examinee groups. J Educ Meas 1990;27:307–27. 25 Australian Council for Educational Research. ConQuest Version 2: Generalised Item Response Modelling Software. Camberwell, Vic: ACER 2007. 26 Masters GN. A Rasch model for partial credit scoring. Psychometrika 1982;47:149–74. 27 Camilli G, Shepard LA. Methods for Identifying Biased Test Items. Thousand Oaks, CA: Sage Publications 1994; 7–21. 28 Embretson SE, Reise SP. Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum Associates 2000;249–72. 29 Draba RE. The identification and interpretation of item bias. MESA Memorandum No. 25. Chicago, IL: MESA Press 1977. Available at: http://www.rasch.org/ memo25.htm. 30 Langer MM, Hill CD, Thissen D, Burwinkle TM, Varni JW, DeWalt DA. Item response theory detected differential item functioning between healthy and ill children in quality-of-life measures. J Clin Epidemiol 2008;61:268–76. 31 Mehrens WA. Validating licensing and certification test score interpretations and decisions: a response. Appl Meas Educ 1997;10:97–104. ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 Validating a multiple mini-interview question bank 32 Tennant A, Pallant JF. Unidimensionality matters! (A tale of two Smiths?). Rasch Meas Trans 2006;20 (1):1048–51. Received 4 June 2008; editorial comments to authors 24 October 2008, 10 November 2008; accepted for publication 4 December 2008 Prompt questions 1 2 3 APPENDIX 1 Sample multiple mini-interview question Imagine you are the principal of a large, respected school. There has been an allegation that a humiliating film of a young disabled person has been circulating on the Internet. Two final year students are up before you to explain their actions in the creation of the video. The video appears to show a young person with intellectual impairment being verbally abused by one of the students whilst a group of senior students look on laughing. What are the issues that you, as the principal, are likely to consider both before and at a disciplinary hearing? Aim of the question It aims to get candidates thinking about discrimination issues. It challenges them to think about the importance of privacy and the ways in which privacy can be violated. It aims to explore the consequences of one’s actions on others. However, there is also the very real possibility that this was a genuine film made collaboratively to educate young people about discrimination against disabled people, which has been taken out of context. 4 5 How would you go about establishing the facts of the case? What might be the impact of this case for the disabled young person? What underlying reasons would you give for recommending the two final year students are suspended from school? What might your reaction be if the incident turned out to be a misunderstanding, such as if the senior students and the disabled young person had made a film about discrimination against disabled people? How could you use this incident to raise awareness around disability discrimination in the school? List of skills and behaviours (Marked on a scale of 1–4, where 4 = excellent, 3 = good, 2 = satisfactory, 1 = unsatisfactory.) 1 2 3 4 5 Has a sense of establishing the facts to ensure fairness Demonstrates an awareness of the situation from a range of perspectives Able to justify how he or she would balance conflicting interests Appreciates the need for students to consider the consequences of personal behaviours Is able to draw lessons from the experience to inform future learning Total mark: out of 20 ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359 359
© Copyright 2026 Paperzz