Validating a multiple mini-interview question bank assessing entry

multiple mini interviews
Validating a multiple mini-interview question bank
assessing entry-level reasoning skills in candidates for
graduate-entry medicine and dentistry programmes
Chris Roberts,1 Nathan Zoanetti2 & Imogene Rothnie3
CONTEXT The multiple mini-interview (MMI)
was initially designed to test non-cognitive
characteristics related to professionalism in
entry-level students. However, it may be testing
cognitive reasoning skills. Candidates to medical and dental schools come from diverse
backgrounds and it is important for the validity
and fairness of the MMI that these background
factors do not impact on their scores.
METHODS A suite of advanced psychometric
techniques drawn from item response theory
(IRT) was used to validate an MMI question
bank in order to establish the conceptual
equivalence of the questions. Bias against candidate subgroups of equal ability was investigated using differential item functioning (DIF)
analysis.
RESULTS All 39 questions had a good fit to the
IRT model. Of the 195 checklist items, none
were found to have significant DIF after visual
inspection of expected score curves, consideration of the number of applicants per category,
and evaluation of the magnitude of the DIF
parameter estimates.
CONCLUSIONS The question bank contains
items that have been studied carefully in terms
of model fit and DIF. Questions appear to
measure a cognitive unidimensional construct,
‘entry-level reasoning skills in professionalism’,
as suggested by goodness-of-fit statistics. The
lack of items exhibiting DIF is encouraging in a
contemporary high-stakes admission setting
where candidates of diverse personal, cultural
and academic backgrounds are assessed by
common means. This IRT approach has
potential to provide assessment designers
with a quality control procedure that extends
to the level of checklist items.
Medical Education 2009: 43: 350–359
doi:10.1111/j.1365-2923.2009.03292.x
1
Office of Postgraduate Medical Education, University of Sydney,
Sydney, New South Wales, Australia
2
Assessment Research Centre, Faculty of Education, University of
Melbourne, Melbourne, Victoria, Australia
3
Office of Medical Education, Faculty of Medicine, University of
Sydney, Sydney, New South Wales, Australia
350
Correspondence: Chris Roberts, Office of Postgraduate Medical
Education (OPME), Faculty of Medicine, Mackie Building (K01),
University of Sydney, Sydney, New South Wales 2006, Australia.
Tel: 00 61 2 9036 9453; Fax: 00 61 2 9351 6646;
E-mail: [email protected]
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
Validating a multiple mini-interview question bank
INTRODUCTION
Selection procedures are arguably the most highstakes, stressful, contentious and resource-intensive of
all medical and dental school assessments. In
combination with measures of previous academic
achievement, many schools have used structured
interviews, claiming to assess important non-cognitive
characteristics of candidates, such as values and
commitment.1 However, such interviews are biased,
of limited value in predicting future performance,
and therefore unfair when used as an important part
of admission procedures.2 The multiple mini-interview (MMI) is a relatively new assessment which
avoids the issues of the traditional interview where
much of the observed mark of the candidate relates
to biases arising from limited interview content, the
interviewer panel and contextual factors.3,4 Because
the MMI tests a larger sample in terms of both
content and independent interviewers than a single
interview can, more reliable generalisations about a
candidate’s ability can be made. The developers of
the MMI have described the validation of assessment
blueprints to establish the preferred non-cognitive
characteristics, such as integrity, teamwork and lifelong learning, of entry-level students.3–6 These map
to notions of entry-level professionalism.3 Predictive
validity of the MMI has been claimed with both
clerkship assessment7 and with performance on
specific parts of licensure.8 However, qualitative
validity research9 suggests that interviewers and
candidates perceive that MMI questions may test a
candidate’s reasoning skills in areas of professionalism, rather than assess the amount of a particular
non-cognitive characteristic he or she possesses.
Some initial supporting evidence for this observation
has come from a modest correlation3 of the MMI
(r = 0.27) with ‘Reasoning in Humanities and Social
Sciences’, Section 1 of the Graduate Australian
Medical Schools Admission Test (GAMSAT).10 The
focus of this paper concerns the extent to which a
bank of MMI questions and the checklist items within
them contribute in a meaningful way, without bias, to
the measurement of the construct of interest, which is
hypothesised as entry-level reasoning skills in
professionalism.
Investigating the performance of items
11
Item response theory (IRT) provides a useful way of
exploring these concerns. It refers to a suite of
advanced psychometric techniques that have previously been used in the medical education literature to
investigate both written and clinical assessment. Its
main use in relation to clinical rating has been to
establish the consistency of judgements within and
between judges and candidates.12–16 Little research
has analysed the properties of the items themselves
and any biases to which they are subject. IRT makes
some strong assumptions about the way a student
responds to clinical assessment items and assumes
that the student’s probability of getting a satisfactory
mark depends on:
1
2
3
his or her general ability in the area being
assessed;
the leniency or stringency of the judges of the
assessment, and
the difficulty of the items.
Providing that the empirical data fit the IRT model,
the three factors can all be placed on the same scale,
called a Wright map, which is measured in a unit
called a logit. This allows the non-technical observer
to consider large amounts of assessment data visually.
An additional strength of IRT is that student performance is estimated independently of the specific set
of test items used,17 unlike in classical test theory.
Furthermore, the item parameters, such as item
difficulty, are independent of the student population
in which the test was used.
Validating item banks
For large-scale assessment, items are generally
organised into an item bank to facilitate the authoring of new items and the maintenance of existing
items. This is a long-term process in which common
items are anchored over repeated administrations.18
To satisfy the expectations of IRT, MMI questions and
their checklist items should be conceptually equivalent: that is, they should measure the same unidimensional construct.19 A second standard is that
there should be invariance of estimates of item
difficulty and person ability20 across different appropriate administrations of the test. A third standard is
that items should not favour candidates with
particular personal or cultural characteristics or
educational experience, an interaction which can
be investigated with differential item functioning
(DIF) analysis.21
This paper investigates the application of a suite of
IRT techniques to a high-stakes selection procedure
for entry into graduate medicine and dentistry
programmes. Firstly, we aimed to establish the
conceptual equivalence; the concept in this case
being entry-level reasoning skills in professionalism.
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
351
C Roberts et al
Secondly, we aimed to investigate whether there were
any systematic differences in outcome among equally
able candidates from different subgroups (e.g. by
gender or type of programme applied for) caused by
DIF.
METHODS
The setting
Details of this MMI for a graduate-entry medical
programme in 2006 are given elsewhere.3 In 2007, for
reasons of efficiency, the Faculties of Medicine and
Dentistry at the University of Sydney ran an integrated admission procedure so that MMI scores were
available to both medicine and dentistry admission
units. Both the medical and dental degrees are 4-year,
graduate-entry, problem-based learning programmes.
The faculties had previously moved to a shared
curriculum model in 2001 in order to reduce duplication of content and resources in the teaching of
basic and clinical sciences in the first 2 years of both
curricula.22 Dental and medical applicants were
offered an interview if their grade point average
(GPA) and GAMSAT (or Medical College Admission
Test [MCAT] for North American candidates) score
were above individual faculty-determined levels. In
2006, the rank order of MMI scores determined the
merit list for medicine, from which offers were made
until the available places (n = 273) were exhausted.
In 2007, the year of this research, ranking for the
medicine and dentistry merit lists was based on MMI
score and total GAMSAT (or MCAT) score at a ratio
of 50% each. This reflected faculty concerns at the
time about the validity of the MMI as the sole
determinant of ranking.
MMI organisation
Each MMI question consisted of a short, authentic,
non-clinical scenario, which contained a dilemma
involving conflicting values and required the candidate to demonstrate his or her reasoning strategies
(Appendix 1). The interviewer used a list of five
standardised prompt questions to elicit further
responses. The marking checklist consisted of five
criteria to be marked on a 4-point Likert scale
(4 = excellent, 3 = good, 2 = satisfactory, 1 = unsatisfactory), giving a maximum total of 20 raw marks for
each MMI question.
A pre-existing bank of 21 questions, which had been
used for medical selection in 2006,3 was adapted so
that questions were applicable to both medical and
352
dental candidates, as were all newly developed
questions.
The MMI circuit had eight stations, each lasting
7 minutes. Candidates were mixed as far as possible
in relation to gender, whether they were applying
for medicine, dentistry or both, and whether they
were local or international candidates. They rotated
through the MMI circuit, meeting a different single
interviewer at each station. Questions from the
bank were rotated between different MMI circuits.
Candidates and interviewers did not know one
another’s status in terms of whether their interest
or background was in medicine or dentistry, and
candidates did not know whether individual
interviewers were faculty members or came from
the community.
All interviewers received written instructions and
were offered a 1-hour training session to familiarise
themselves with the format and the marking schema,
and to practise on two simulated interview scenarios
shown on DVD. Interviews were held for 3 days in
Vancouver and 1 day in Singapore for international
candidates and on three parallel circuits over 5 days
in Australia.
Multi-facet Rasch model
A multi-facet Rasch model (MFRM) was used in
FACETS Version 3.63 (Winsteps.com, Chicago, IL,
USA) to independently estimate several first order
facets and their associated error variances by simultaneously using all of the empirical data. In the
interviewing plan, candidates were partially crossed
with questions, with each candidate attempting eight
of the total number of questions (n = 39) in the bank.
Interviewers were partially crossed with questions
(most manned two or more of 39 possible questions).
Candidates were also partially crossed with interviewers, with each candidate seeing eight of the total
number of interviewers used.
The analysis used a logistic transformation of the
observed scores of candidates into logit scores, which
were adjusted for all the parameters specified in the
model. The probability of a candidate receiving a
particular mark in the MMI depended on his or her
ability, the difficulty of the MMI questions, and the
stringency or leniency of the interviewer. Maximum
likelihood methods calculated an ability measure for
each candidate, an estimate of judge stringency or
leniency, and a difficulty measure for each MMI
question. A partial credit model was used in order to
treat interviewers as ‘independent experts’ free to
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
Validating a multiple mini-interview question bank
apply ‘part marks’ from the full Likert rating scale to
each checklist item.11
Standard errors were also calculated for each of these
measures. The estimated measures for facets of
interest were represented graphically, using a Wright
map displaying a range of ) 3.0 to + 3.0 logits. This
allows visual inspection of the functioning of the
questions, particularly in terms of their range of
difficulty, and shows whether they had a meaningful
substantive hierarchy ranging from easy to hard. The
analysis also provided statistics in the form of two
chi-square ratios, infit and outfit mean squares, which
indicate how well the empirical data fit the IRT
model. Both fit statistics are weighted mean squared
residuals (the difference between the actual and the
expected value of the observation) divided by their
degrees of freedom. The infit differs from the outfit
in that it is weighted by the variance of the observation around its IRT expected score, making it less
sensitive to outlier residual values. Both of these fit
statistics give an indication of how well the empirical
data meet the expectations of the IRT model and
thus how well the set of questions measures the
construct. They have an expected value of + 1 when
the data fit the model. Standardised forms of both fit
statistics (z) take account of the sample size.23 Both
types of statistics have been reported in a medical
education context13,14,16 with a focus on infit.
A mean-square value < 1 indicates too little variation
in the ratings (i.e. the scoring is too predictable),
whereas a value > 1 indicates too much variation in
the ratings. This was a high-stakes assessment and
thus was similar to a clinical rating situation, so we set
lower and upper control limits at 0.5 and 1.7 to
indicate model fit.23
Differential item functioning
Assessment items exhibit DIF if the item scores of
equally able interviewees from different groups (e.g.
groups defined by gender, prior degree or type of
programme applied for) are systematically different.24 These population subgroups can be compared
simultaneously using ConQuest, Version 2.025 by
producing expected score curves (ESCs) with a
partial credit model.26 Each theoretical ESC can be
used as a benchmark against which the groups of
interest are compared.25
Differential item functioning is said to exist if the
empirical ESCs show a systematic difference between
groups across the candidate ability scale, suggesting a
secondary effect is operating.27,28 Small variations in
ESCs are tolerable because of measurement error,
but in general ESCs should not be separated. A
criterion that each score point on the empirical ESCs
should represent the results from at least 20 interviews was set for this cohort to avoid premature
classification of DIF based upon unstable ESC loci.
For sufficiently represented MMI questions and
candidate characteristic combinations, ESCs were
produced and visually inspected. Differential item
functioning parameters were then estimated and the
initially flagged items were evaluated for both
substantive and statistical significance. One rule of
thumb for declaring a difference between groups as
being substantive is to use half a logit. Unsystematic
parameter shifts of < 0.5 logit would have minimal
impact on the accuracy of test scores.20 However,
systematic parameter shifts across a number of items
could impact test scores and would therefore be cause
for concern.29 Statistical significance is demonstrated
by showing that the DIF parameter for a particular
item is not equal to zero. This can be done by
dividing the DIF parameter estimate by twice its
measurement error. Results exceeding a magnitude
of 1 indicate statistical significance at the 5% level.29
Items exhibiting substantively and statistically
significant DIF were flagged for inspection by
subject matter experts to evaluate whether the DIF
might be related to the construct, with irrelevance
being interpreted as bias. Experts were also asked if
they could identify specific characteristics of the item
content that might cause the bias, and whether the
item could be suitably amended rather than
discarded.30
RESULTS
Each candidate answered 8 MMI questions, each with
five checklist items. A 4-facet Rasch model using a
partial credit model established item characteristics
on all of the 2007 data consisting of 207 interviewers,
686 candidates, a bank of 39 MMI questions, each
with 5 checklist items giving 195 items, and 27 440
candidate marks. FACETS confirmed that there was
sufficient linkage within the interviewing plan, so that
all of the data could be simultaneously used. Consequently, all facets were placed onto a Wright map,
where mean interviewer stringency and question
difficulty were anchored by the measurement model
at 0.00 logit, and candidate ability was allowed to
float. The first column (Fig. 1) indicates the logit
scale. The second column shows interviewers’ level of
stringency, the third column the candidates’ level of
ability on the MMI, and the fourth column the
difficulty of the MMI questions. Reading the ruler
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
353
C Roberts et al
ranged in difficulty from ) 0.55 to + 0.54 logit
(Fig. 1). A reliability of separation index, which shows
how well they were separated, was 0.92 (Table 1).
This was statistically significant (v2 = 842.3, d.f. = 38,
P < o.001), indicating that the questions were meaningfully separated according to level of difficulty from
easy to hard with a high degree of confidence.
Question statistics are summarised in Table 2. Question 27 () 0.55 logit) was the easiest question and
question 29 was the hardest (0.54 logit). Standard
errors of question measures (Table 2) ranged from
0.03 to 0.21 logit. Question 27 with the largest error
was least used (11 times). However, 39.2% (n = 269) of
candidates had ability > 0.54 logit and thus were not
tested effectively by any of the MMI questions (Fig. 1).
Figure 1 A Wright map showing ratings of candidates in
multiple mini-interviews (MMIs) in 2007 combined with
interviewers and MMI questions transformed onto a logit
scale
from bottom to top shows increasing interviewer
stringency, increasing candidate ability and increasing question difficulty.
An overall summary of the interviewer, candidate and
question statistics are given in Table 1. Standard
deviations (SDs) by facets were: interviewer stringency
or leniency, 0.52; candidate ability, 0.75, and MMI
question difficulty, 0.27. The spread of candidates was
1.44 times that of the interviewers and therefore
variance was 2.08 times that of the interviewers
(variance equals the square of the SD). Questions
The overall infit mean for questions (Table 2) was
1.03 (SD = 0.19, range 0.63–1.27). Overall outfit
mean was 1.03 (SD = 0.12, range 0.67–1.26). All the
questions were well within the predetermined range
of 0.5–1.7. The good fit of questions to the unidimensional measurement model provided some
empirical support that the questions could be
considered conceptually equivalent.23
Differential item functioning
Differential item functioning was used to investigate
whether checklist items within each MMI question
discriminated against students of equal ability
according to the categorical variables described in
Table 3. Figure 2 shows the type of ESC plot typical of
an item (item 91) with indications of uniform DIF.
The horizontal axis represents the overall performance of a candidate on the MMI. The vertical axis
represents the score that a candidate of a given ability
would (on average) be expected to attain. The
scoring rubric of 1–4 used by interviewers is recoded
to a 0 through three sets of scores here by convention
for IRT. The solid curve represents the modelled
ESC. The empirical score curves for the two
Table 1 Means, standard deviations (SDs) and standard errors (SEs), separation reliability, significances and degrees of freedom of
parameters
354
Infit mean
SE of the
Chi-square
Degrees of
square
mean
SD
Separation
reliability
Chi-square
significance
freedom
Interviewer stringency
1.05
0.05
0.52
0.91
3392.5
< 0.001
206
Candidate ability
1.05
0.02
0.75
0.91
6959.2
< 0.001
685
Question difficulty
1.03
0.04
0.27
0.92
842.1
< 0.001
38
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
Validating a multiple mini-interview question bank
Table 2 Item statistics for ratings of multiple mini-interview (MMI) candidates for medicine and dentistry programmes using an MMI bank of
questions (n = 39) in order of question difficulty (easiest to hardest)
Outfit
Question
MMI question
Checklist
difficulty
Standard
Infit mean
Infit z
mean
Outfit z
number
item count
(logits)
error
squares
score
squares
score
29
310
0.54
0.08
1.15
1.99
1.11
1.42
40
915
0.45
0.05
1.04
0.87
1.03
0.63
26
785
0.44
0.05
1.27
5.43
1.24
4.70
16
490
0.42
0.06
1.08
1.34
1.20
2.89
14
420
0.39
0.06
1.16
2.48
1.14
2.07
30
725
0.39
0.05
0.97
) 0.53
1.00
0.11
21
105
0.27
0.13
1.00
0.04
1.00
0.01
5
2190
0.26
0.03
1.06
2.23
1.07
2.35
1
215
0.23
0.10
1.11
1.23
1.06
0.74
22
105
0.22
0.15
0.89
) 0.86
0.94
) 0.38
31
780
0.17
0.05
0.96
) 0.94
0.96
) 0.89
17
2075
0.13
0.03
0.99
) 0.18
1.00
) 0.08
25
265
0.12
0.09
0.92
) 0.90
0.92
) 0.87
47
435
0.12
0.07
1.08
1.24
1.10
1.54
2
815
0.10
0.05
1.06
1.22
1.09
1.78
24
1985
0.10
0.03
1.01
0.41
1.04
1.19
12
255
0.09
0.09
1.08
1.06
1.11
1.33
11
960
0.08
0.05
0.98
) 0.45
1.01
0.22
51
460
0.07
0.07
1.01
0.14
1.01
0.24
10
500
) 0.04
0.07
1.00
0.08
1.02
0.30
15
105
) 0.04
0.13
0.91
) 0.69
0.95
) 0.37
23
915
) 0.08
0.05
1.05
1.01
1.08
1.56
6
210
) 0.10
0.10
1.06
0.59
0.96
) 0.26
19
3320
) 0.10
0.03
1.16
6.42
1.15
5.99
28
255
) 0.10
0.09
0.90
) 1.22
0.88
) 1.30
43
220
) 0.10
0.10
1.24
2.71
1.26
2.80
32
200
) 0.12
0.11
1.09
1.03
1.11
1.16
13
730
) 0.14
0.05
1.12
2.42
1.08
1.52
39
765
) 0.14
0.05
1.06
1.28
1.01
0.14
18
210
) 0.16
0.11
0.90
) 1.07
0.85
) 1.46
8
1200
) 0.20
0.04
1.04
1.03
1.04
0.85
20
540
) 0.23
0.06
1.02
0.40
0.98
) 0.32
3
1240
) 0.31
0.04
1.01
0.18
1.02
0.58
9
355
) 0.36
0.07
1.02
0.27
1.00
0.03
52
245
) 0.38
0.09
0.85
) 1.89
0.87
) 1.60
41
495
) 0.47
0.07
1.19
2.88
1.26
3.79
4
1365
) 0.48
0.04
0.90
) 2.60
0.92
) 2.07
7
225
) 0.50
0.10
1.09
1.01
1.19
1.97
27
55
) 0.55
0.21
0.63
) 2.22
0.67
) 1.68
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
355
C Roberts et al
Table 3 Characteristic variable categories and numbers for
multiple mini-interview (MMI) candidates
Candidate
Number of
characteristic
Categories
candidates
Percentage
Candidate
Male
352
51.3
gender
Female
334
48.7
Candidate
Bachelor
578
84.3
Honours
101
14.7
Masters
1
0.1
PhD
6
0.9
degree level
or Grad Dip
Candidate type
Candidate
programme
type
Candidate MMI
location
Local
513
74.8
International
173
25.2
Medicine
448
65.3
Dentistry
173
25.2
65
9.5
Sydney
Both
553
80.6
Vancouver
116
16.9
Singapore
17
2.5
parameters for item 91 using a DIF by location item
response model and a DIF by gender item response
model. Application of the DIF by location item
response model was postponed owing to missing data
across a number of the item score categories and the
candidate MMI location categories. Despite the visually apparent DIF and a relatively large amount of
interview data, statistical and substantive significance
of the DIF findings were not upheld for the item by
gender interaction. The DIF parameter magnitude
was 0.168 logit, with a measurement error of
0.096 logit. This DIF parameter value did not exceed
twice the magnitude of the measurement error, so
the DIF effect was not statistically significant at the
5% level. Furthermore, this figure was considerably lower than the acceptable 0.5 logit criterion.23
Item 91 was identified for qualitative evaluation
despite these results for the purpose of carrying
through the final important steps of DIF
evaluation.
Inspection of the items by content experts revealed
that item 91 asked a question about how the
candidate might care for an elderly relative with a
terminal illness. The knowledge and skill underpinning this item was considered to be about ethical
understanding and demonstrating integrity.
Although actual responses, such as those pertaining
to personal involvement in the care of elderly
relatives at home, might be gender- and culturespecific, reasoning processes should not be.
DISCUSSION
Figure 2 Uniform differential item functioning trend for
item 91 by gender.
comparison groups (male and female) are not
coincident along the theoretical ESC. This kind of
result is indicative of an item that discriminates
between applicants on the basis of gender and this in
general is undesirable.
Of the 195 checklist items to which partial credit
scores were applied, only item 91 exhibited visually
detectable DIF for candidate variables (location and
gender) and represented results from at least 20
interviews. This warranted an evaluation of the DIF
356
These results provide further validation evidence for
the use of a structured MMI question bank in the
context of a high-stakes, international selection
procedure, where the purpose was to determine
whether candidates with a good academic record
should enter a graduate-entry medical or dental
programme. Some of the conditions for a quality
assured question bank have been established. Firstly,
conceptual equivalence has been supported as infit
and outfit statistics have indicated how well the
MMI questions fit the assumptions of the IRT
model23 and contribute to the measurement of the
underlying unidimensional construct. We have
named this ‘entry-level reasoning skills in professionalism’, meaning that candidates were reasoning
on issues connected with professionalism at the
level expected of beginning students, with a focus
on problem solving and critical thinking. Questions
displayed a meaningful hierarchy from easy to
hard. However, the most difficult MMI question
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
Validating a multiple mini-interview question bank
(question 29) did not test just over a quarter of
candidates with the greatest ability.
Secondly, we demonstrated a lack of significant DIF.
Very few items discriminated against candidates by
virtue of a range of background factors. This is an
important point to establish in contemporary admissions settings where candidates of various personal,
cultural and academic backgrounds are assessed by
common means.
Implications of findings
This research provides further evidence3,9 that the
MMI does assess cognitive reasoning skills. This stance
represents a significant conceptual change in our
understanding of the validity of the MMI, as it was
previously thought to measure non-cognitive skills.4 A
cognitive measurement construct would sit comfortably with other findings from the MMI literature. For
example, context-specificity of MMI questions3,4
would mean that candidates’ reasoning skills on one
question do not predict reasoning skills on another
question. The shared variance between MMI candidate scores3 and scores on the unidimensional construct ‘Reasoning in Humanities and Social Sciences’
suggests that this subsection of GAMSAT and the MMI
do, in part, measure a similar reasoning construct.
Correlations of the MMI with measures of clinical
competence4,7,8 might be partly explained by shared
variance between reasoning in aspects of professionalism and clinical reasoning. An MMI that was found to
assess cognitive skills would have a major impact on the
way that admission tools should be combined in the
future. This is an area that demands more research.
The real benefit of using IRT lies in its presentation of
item statistics, which yield enormous potential for
further development of the question bank. For
example, MMI questions will need to be made more
difficult if they are to test the ability of the most
able candidates. Only item 91 was considered to show
some indications of DIF. On evaluation of the
relationship between candidate integrity as it was
perceived on the basis of responses to an item on
caring for one’s own elderly relatives and the measurement construct, it was decided that differential
performance may well be a result of interviewer bias,31
suggesting that interviewers should undertake further
training around cultural and gender-associated issues.
Strengths and weaknesses
This is the first reported application of a suite of IRT
methods in a high-stakes MMI involving a large and
diverse group of candidates for both medical and
dental programmes in three different continents.
There are limitations to the study. The interviewing
plan in this naturalistic setting was constrained by
what was logistically possible. As is often the case in
such settings, there was no fully formalised design
that assigned specific interviewers or a specific set of
questions to each MMI circuit. Residual-based fit
statistics are not always sensitive to multi-dimensionality32 and future research will need to investigate our
findings with additional advanced psychometric
techniques.
For many items it was difficult to detect DIF for
candidate background variables with more than two
categories as a result of small numbers of candidates
per ability grouping per item. This difficulty will
subside with the accretion of data as more MMI
administrations are carried out. The use of DIF in this
study is therefore limited to an initial evaluation of
item validity rather than an evaluation of test score
validity (across subgroups). Given that high-stakes
decisions about individuals are based on test scores,
this will be investigated as more data become available. The format of MMIs varies among medical
schools and these findings are generalisable only to
those MMIs which use the standardised structured
approach.3
CONCLUSIONS
The MMI question bank contains items that have
been studied carefully in terms of conceptual equivalence and DIF. This research adds to the evidence
that MMI questions do measure a cognitive unidimensional construct, namely, ‘entry-level reasoning
skills in professionalism’. The lack of items exhibiting
DIF is encouraging in contemporary, high-stakes
admission settings where applicants of various
personal, cultural and academic backgrounds are
assessed by common means. The potential of this
approach is to provide MMI question developers
with a quality control procedure that extends down
to the checklist item level.
Contributors: all authors made substantial contributions to
the study conception and design, and to the acquisition,
analysis or interpretation of data. All authors contributed to
the writing of the article and approved the final manuscript.
Acknowledgements: we thank Mike Linacre for his
invaluable comments on the application of the multi-facet
Rasch model with the partial credit model to the data.
We thank the members of the Faculty of Medicine, and
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
357
C Roberts et al
Evelyn Howe and her team from the Faculty of Dentistry,
University of Sydney, for their hard work.
Funding: none.
Conflicts of interest: none.
Ethical approval: this study was approved by the University
of Sydney Human Research Ethics Committee
(ref. 08-2006 ⁄ 9392).
REFERENCES
1 Norman GR. Editorial: the morality of medical school
admissions. Adv Health Sci Educ 2004;9:79–82.
2 Kreiter CD, Yin P, Solow C, Brennan RL. Investigating
the reliability of the medical school admissions interview. Adv Health Sci Educ 2004;9:147–59.
3 Roberts C, Walton M, Rothnie I, Crossley J, Kumar K,
Lyon P, Tiller D. Factors affecting the utility of the multimini-interview in selecting candidates for graduate-entry
medical school. Med Educ 2008;4:396–404.
4 Eva K, Rosenfeld J, Reiter HI, Norman GR. An admissions OSCE: the multiple-mini-interview. Med Educ
2004;38:314–26.
5 Harris S, Owen C. Discerning quality: using the multiple mini-interview in student selection for the Australian National University Medical School. Med Educ
2007;41:234–41.
6 Brownell K, Collin T, Lemay JF. Introduction of the
multiple mini-interview into the admissions process at
the University of Calgary: acceptability and feasibility.
Med Teach 2007;29:395–6.
7 Eva KW, Reiter H, Rosenfeld J, Norman GR. The ability
of the multiple mini-interview to predict preclerkship
performance in medical school. Acad Med 2004;79
(Suppl):40–2.
8 Reiter HI, Eva KW, Rosenfeld J, Norman GR.
Multiple mini-interviews predict clerkship and licensing examination performance. Med Educ 2007;41:
378–84.
9 Kumar K, Roberts C, Rothnie I, du Fresne C, Walton M.
Experiences of the multiple mini-interview: a qualitative analysis. Med Educ 2009;43:360–7.
10 Australian Council for Educational Research. Graduate
Australian Medical School Admission Guide 2006. Melbourne, Vic: ACER 2006.
11 Linacre JM. Many-Facet Rasch Measurement. Chicago, IL:
MESA Press 1994;1–15.
12 Downing S. Threats to the validity of clinical teaching
assessments: what about rater error? Med Educ
2005;39:353–5.
13 McManus I, Thompson M, Mollon J. Assessment of
examiner leniency and stringency (‘hawk-dove effect’)
in the MRCP (UK) clinical examination (PACES) using
multi-facet Rasch modelling. BMC Med Educ
2006;6:1272–94.
14 Harasym P, Woloschuk W, Cunning L. Undesired variance due to examiner stringency ⁄ leniency effect in
communication skill scores assessed in OSCEs. Adv
Health Sci Educ 2008;13:617–32.
358
15 Griffin P, Zoanetti N. Moderation of Interviewer Judgement
in Tests of Spoken English. 40th Annual TESOL
(Teachers of English to Speakers of Other Languages)
Convention and Exhibit, Tampa, FL, 15–18 March
2006;1–15.
16 Iramaneerat C, Myford C. Rater Effects in Clinical Performance Ratings of Surgery Residents. Proceedings of the
American Educational Research Association Conference, San Francisco, CA, 8–12 April 2006.
17 Lord FM. Applications of Item Response Theory to Practical
Testing Problems. Hillsdale, NJ: Lawrence Erlbaum
Associates 1980.
18 Vale CD. Computerized item banking. In: Downing
SM, Haladanya TM, eds. Handbook of Test Development.
New York, NY: Routledge 2006;261–85.
19 Teresi JA. Overview of quantitative measurement
methods. Equivalence, invariance, and differential item
functioning in health applications. Med Care 2006;44
(Suppl):39–49.
20 Wright BD, Douglas GA. Best Test Design and Self-tailored
Testing. Research Memorandum No. 19, Statistical Laboratory. Chicago, IL: University of Chicago, Department of
Education 1975.
21 Holland PW, Thayer DT. Differential Item Performance and
the Mantel–Haenszel Procedure (Technical Report No. 86–69).
Princeton, NJ: Educational Testing Service 1986.
22 Klineberg I, Massey W, Thomas M, Cockrell D. A new
era of dental education at the University of Sydney,
Australia. Aust Dent J 2002;47:194–201.
23 Bond TG, Fox CM. Applying the RASCH Model: Fundamental Measurement in the Human Sciences. Mahwah, NJ:
Lawrence Erlbaum Associates 2001;239–249.
24 Kelderman H, Macready GB. The use of loglinear
models for assessing differential item functioning
across manifest and latent examinee groups. J Educ
Meas 1990;27:307–27.
25 Australian Council for Educational Research. ConQuest
Version 2: Generalised Item Response Modelling Software.
Camberwell, Vic: ACER 2007.
26 Masters GN. A Rasch model for partial credit scoring.
Psychometrika 1982;47:149–74.
27 Camilli G, Shepard LA. Methods for Identifying Biased Test
Items. Thousand Oaks, CA: Sage Publications 1994;
7–21.
28 Embretson SE, Reise SP. Item Response Theory for
Psychologists. Mahwah, NJ: Lawrence Erlbaum Associates
2000;249–72.
29 Draba RE. The identification and interpretation of
item bias. MESA Memorandum No. 25. Chicago, IL:
MESA Press 1977. Available at: http://www.rasch.org/
memo25.htm.
30 Langer MM, Hill CD, Thissen D, Burwinkle TM, Varni
JW, DeWalt DA. Item response theory detected
differential item functioning between healthy and ill
children in quality-of-life measures. J Clin Epidemiol
2008;61:268–76.
31 Mehrens WA. Validating licensing and certification test
score interpretations and decisions: a response. Appl
Meas Educ 1997;10:97–104.
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
Validating a multiple mini-interview question bank
32 Tennant A, Pallant JF. Unidimensionality matters!
(A tale of two Smiths?). Rasch Meas Trans 2006;20
(1):1048–51.
Received 4 June 2008; editorial comments to authors 24 October
2008, 10 November 2008; accepted for publication 4 December
2008
Prompt questions
1
2
3
APPENDIX 1
Sample multiple mini-interview question
Imagine you are the principal of a large, respected
school. There has been an allegation that a
humiliating film of a young disabled person has
been circulating on the Internet. Two final year
students are up before you to explain their
actions in the creation of the video. The video
appears to show a young person with intellectual
impairment being verbally abused by one of the
students whilst a group of senior students look on
laughing. What are the issues that you, as the
principal, are likely to consider both before and
at a disciplinary hearing?
Aim of the question
It aims to get candidates thinking about discrimination issues. It challenges them to think about the
importance of privacy and the ways in which privacy
can be violated. It aims to explore the consequences
of one’s actions on others. However, there is also the
very real possibility that this was a genuine film made
collaboratively to educate young people about discrimination against disabled people, which has been
taken out of context.
4
5
How would you go about establishing the facts of
the case?
What might be the impact of this case for the
disabled young person?
What underlying reasons would you give for
recommending the two final year students are
suspended from school?
What might your reaction be if the incident
turned out to be a misunderstanding, such as if
the senior students and the disabled young
person had made a film about discrimination
against disabled people?
How could you use this incident to raise awareness around disability discrimination in the
school?
List of skills and behaviours
(Marked on a scale of 1–4, where 4 = excellent,
3 = good, 2 = satisfactory, 1 = unsatisfactory.)
1
2
3
4
5
Has a sense of establishing the facts to ensure
fairness
Demonstrates an awareness of the situation from
a range of perspectives
Able to justify how he or she would balance
conflicting interests
Appreciates the need for students to consider the
consequences of personal behaviours
Is able to draw lessons from the experience to
inform future learning
Total mark: out of 20
ª Blackwell Publishing Ltd 2009. MEDICAL EDUCATION 2009; 43: 350–359
359