Hip Psychometrics
Peter Baldwin
Joseph Bernstein
Howard Wainer
Models vary in strength
When you have a lot of data, your need
for a model decreases and so you can
manage with a weak one.
When your data are very limited, you
need a strong model to lean on in
order to draw inferences
A very strong model
P(x=1| ) = exp()/[1+ exp()]
0-PL
This is a strong model that requires few data to
estimate its single parameter (person ability), but
in return makes rigid assumptions about the data
(all items must be of equal difficulty). Such a
model is justified only when you don’t have enough
data to reject its assumptions.
1-PL
P(x=1| ) = exp(b- )/[1+ exp(b- )]
This model is a little weaker and so makes fewer
assumptions about the data - now items can
have differential difficulty, but it assumes
that all ICCs have equal slopes. If there are
enough data to reject this a weaker model is
usually preferred.
2-PL
P(x=1| ) = exp{a(b- )}/[1+ exp{a(b- )}]
This model is weaker still allowing items to have
both differential difficulty and differential
discriminations. But it assumes that
examinees cannot get the item correct by
chance.
3-PL
P(x=1| ) = c + (1-c) exp{a(b- )}/[1+ exp{a(b- )}]
This model is weaker still, allowing guessing,
but it assumes that items are conditionally
independent.
Turtles all the way down!
As the amount of data we have increases, we can test
the assumptions of a model and are no longer forced
to use one that is unrealistic.
In general we prefer the weakest model that our data
will allow.
Thus we often fit a sequence of models and choose the
one whose fit no longer improves with more generality
(further weakening).
We usually have three models
In order of increasing complexity they are:
1. The one we fit to the data,
2. The one we use to think about the
data, and
3. The one that would actually generate
the data.
When data are abundant relative to
the number of questions asked of
them, answers can be formulated
using little more than those data.
We could fit the test response data with
Samejima’s polytomous IRT Model :
P(T jx mj )
exp[ a jk c jk ]
mj
exp[ a
c jk ]
jk
k 0
where {ak, ck}j, k = 0, 1, ..., mj are the item
category parameters that characterize the
shape of the individual response functions.
The aks are analogous to discriminations; the
cks analogous to intercepts.
And get a
useful
result
But with 830,000 data
points,
why bother?
Score (x)
A
B
C
0
0
1
1
1
14
9
5
2
27
21
12
3
52
38
20
4
117
108
47
5
220
226
86
6
421
464
176
7
711
804
298
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
2,001
984
4,007
43
1,586
825
3,504
44
1,165
557
3,017
45
770
387
2,488
46
516
249
1,929
47
306
144
1,354
48
153
66
696
49
28
22
323
SUM
254,339 174,725 169,364
D
3
3
9
25
66
119
235
427
.
.
.
.
510
418
300
226
139
68
37
12
79,701
E
0
0
9
23
44
88
150
266
.
.
.
.
74
58
28
24
7
19
7
0
30,005
Omit
296
176
252
343
480
590
837
1,066
.
.
.
.
422
286
195
107
57
36
8
1
120,981
TOTAL
301
207
330
501
862
1,329
2,283
3,572
.
.
.
.
7,998
6,677
5,262
4,002
2,897
1,927
967
386
829,115
P(A|score)
1.
0
0.
8
Choice A = 5 Faces
0.
6
0.
4
0.
2
0.
00
10
20
30
Number Right
40
50
1.0
0.8
P(B|Score)
Choice B = 6 Faces
0.6
0.4
0.2
0.0
0
10
20
30
Number Right
40
50
1.
0
Choice C = 7 Faces
P(C|Score)
0.
8
0.
6
0.
4
0.
2
0.
00
10
20
30
Number Right
40
50
1.0
P(D|Score)
0.8
Choice D = 8 Faces
0.6
0.4
0.2
0.0
0
10
20
30
Number Right
40
50
1.0
P(E|Score)
0.8
0.6
Choice E = 9 Faces
0.4
0.2
0.0
0
10
20
30
Number Right
40
50
Proof that the correct
answer is (A) Five
E
E
*
D
A
C
C
*
B
B*
But when data are sparse, we must
lean on strong models to help us
draw inferences.
A study of the diagnoses of hip
fractures provides a compelling
illustration of the power of
psychometric models to yield
insights when data are sparse.
Hip fractures are common injuries; more than
250,000 annually are treated in the US
alone.
These fractures can be located in the shaft
of the bone or in the neck of the bone
connecting the shaft to the head of the
femur.
Femoral neck fractures vary in their severity
Garden (1961) provided a fourcategory classification scheme
for hip fractures.
At the heart of this study are two clinical questions
of interest in the diagnosis of hip fractures.
1.
Is Garden’s approach of classifying femoral neck
fractures into four categories, which is considered the de
facto standard, too finely variegated to provide
meaningful information given that there are only two
clinical treatment choices?
2. How consistent are orthopedic surgeons in their diagnoses?
Should we expect consistent judgments from individual
surgeons? Are Garden’s classifications applied
consistently by different surgeons?
Raw data of hip fracture diagnosis
Case
Doctor
A
B
C
D
E
F
G
H
I
J
K
L
1
1
1
2
2
3
1
1
1
4
1
3
4
2
4
4
4
4
4
4
3
4
3
4
4
4
3
4
4
4
3
4
3
4
4
3
4
2
4
4
3
3
2
2
2
2
2
3
2
2
3
2
5
4
4
3
4
4
3
2
3
4
3
4
4
6
3
2
2
1
3
2
1
3
2
2
2
2
7
4
4
4
3
4
4
4
4
4
3
4
4
8
4
4
4
3
4
4
4
4
4
4
4
4
9 10 11 12 13 14 15 1* 2* 3* 4* 5*
4 4 1 4 4 4 2 2 4 4 3 4
3 4 2 4 4 4 2 1 4 4 3 4
3 4 1 4 3 4 2 1 3 4 3 4
2 4 1 3 3 4 2 2 3 4 2 3
3 4 2 4 4 4 2 2 4 4 3 4
3 4 1 4 3 3 2 2 4 3 3 3
4 4 1 4 3 3 2 2 3 4 2 4
3 4 2 4 4 4 1 1 4 4 3 4
4 4 1 4 3 4 2 2 4 4 2 3
4 4 1 4 3 4 1 1 3 4 4 4
2 4 2 4 4 4 1 3 4 4 2 4
3 4 1 4 4 4 3 2 4 4 2 4
The * indicates the 2nd administration of a previously viewed radiograph
Diagnoses tended toward the more serious end
With 20 radiographs and only
12 judges how weak a model
could we get away with?
We wanted to use a Bayesian
version of Samejima’s
polytomous model, but could we
fit it with such sparse data?
We decided to ask the experts.
We surveyed 42 of the world’s
greatest experts in IRT, asking
what would be the minimum ‘n’
required to obtain usefully
accurate results.
To summarize their advice
Minimal Acceptable Sample Size
1 00,000
1 0,000
1 ,000
1 00
10
They were almost right.
Actually 12 surgeons worked just fine, so long
as a few small precautions were followed.
1. We treated the surgeons as the items,
and the radiographs as the repetitions.
2. We needed 165,000 iterations to get
convergence .
Ethical Caveat
We feel obligated to offer the
warnings that:
1. These analyses were performed by
professionals; inexperienced persons
should not attempt to duplicate them.
2. Keep MCMC software from out of the
hands of amateurs and small children.
What did we find?
The model yields a stochastic
description of what happens when an
orthopedic surgeon meets a
radiograph.
As an example, consider:
Most of these results we could have
gotten without the model. What does
fitting a psychometric model buy us?
1.
Standard errors - without a model all we
can say is that different surgeons agree
x% of the time on this radiograph. With a
model we get more usable precision.
2. Automatic Adjustment for differential
propensity for judging a fracture serious.
The severity scores of 12 orthopedists and 15
radiographs
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
Orthopedists
-1.5
-2.0
Radiographs
This is good news!
On essay scoring (and the scoring of
most constructed response items) the
variance due to judges is usually
about the same as the variance due to
examinees.
Surgeons do much better than ‘expert
judges.’
The variance of the radiographs is 19 times that of the
variance of surgeons. We can construct an analog of
reliability from this as
(if we treat 2x-rays
as true score variance and 2Doctors as error
variance).
Reliability = 2x-rays/(2x-rays2Doctors)
These data yield an estimate of reliability of judgment equal to
0.95.
Suggesting that in aggregate,
on this sample of x-rays,
there is almost no need for a second opinion.
We shall discuss the ominous ‘almost’ shortly.
The model provides us with robustness of judgment
by adjusting the judgments for the differences in
the propensities of the orthopedists in their
tendencies to vary in severity.
For example, consider case 6.
Although there are three doctors who judged it a
III, the other nine all placed it as a I or a II.
The model yields the probability of this case falling in
each of the four categories as:
I
.18
II
.59
III
.21
IV
.02
Overall, it fell solidly in the II category, and so if we
had 12 different opinions on this case we would feel
reasonably secure deciding to pin the fracture, for
the probability of it being a I or a II was 0.77
(.18+.59).
But let’s try an experiment.
Suppose we omit, for this case, all nine surgeons that scored this
anything other than a III.
We thus have three surgeons who all rated it a category III
fracture and if we went no further the patient would have a hip
replacement in his immediate future.
But if we use the model, it automatically adjusts for the severity
of those three judges and yields the probabilities of case 6
falling in each of the four categories as:
I
.03
II
.38
III
.48
IV
.11
This case’s location on the severity scale has shifted to
the right, but not completely.
Case 6 is not a clear hip replacement, but rather it falls
on the boundary between II and III with the
probability of a pining being sufficient adds up to 0.41
(.03 + .38).
Prudence would suggest that when we find a boundary
case like this, we seek additional opinions.
In this case those opinions are likely to be the Is and
IIs that we had elided previously.
Note that this yields deeper meaning to the
phrase ‘second opinion’. It could mean getting
more opinions from the same doctor on other
cases so that we can adjust his/her ratings
for their any unusual severity.
This automatic adjustment is not easily
available without an explicit model.
Last, the title of this talk could just as easily
have been “Hearty Psychometrics” had the
data we used been from 12 cardiac surgeons
judging blood vessel blockage.
Peter and I are grateful to Joe for making us
hip psychometricians and more grateful still
that Joe didn’t specialize in gastroenterology.
© Copyright 2026 Paperzz