Michael J. Walk Modern Measurement Theories Homework #3 17

Michael J. Walk
Modern Measurement Theories
Homework #3
17 February 2008
Question #1
(a) [created using PLOTIRT—a package for R. An option to extend the x-axis was not
available.]
(b) SEE APPENDIX (ATTACHED SCAN OF EQUATIONS)
(c)
Probability Scoring INCORRECT
THETA
ITEM 1
ITEM 2
ITEM 3
-1.5
0.422
0.953
0.899
0
0.283
0.500
0.884
1.5
0.167
0.047
0.658
(e)
Test Information
THETA
Test
SEE
-1.5
0.216
2.153
0
1.048
0.977
1.5
0.768
1.141
(d)
Item Information (Standard Error)
Examinee
Item 1 Item 2
Item 3
0.035
0.181
0.000
-1.5
(5.35) (2.35)
(183.77)
0.038
1.000
0.010
0
(5.11)
(1)
(10.08)
0.030
0.181
0.557
1.5
(5.75) (2.35)
(1.34)
Question 2
Estimated normal-ogive
Discrimination
Difficulty
ENO
Reported
Difference
ENO
Reported
Difference
(ENO) difficulty and
Item
(α′)
(α)
|α′
α|
(β′)
(β)
|β′ - β|
discrimination
1
1.083
0.860
0.222
0.156
0.150
0.006
parameters were
2
0.799
1.161
0.362
-3.326
-2.441
0.884
calculated using
3
1.594
1.257
0.337
1.091
1.085
0.006
provided CTT statistics,
4
1.510
1.165
0.345
0.268
0.262
0.006
such that ENO
5
1.235
1.047
0.188
-0.544
-0.414
0.130
discrimination (α′) =
6
1.033
0.928
0.105
-0.861
-0.800
0.061
rbs/(1-rbs2)½ and ENO
7
0.748
1.141
0.393
-3.776
-2.698
1.079
difficulty (β′) = z/rbs.
8
0.520
1.040
0.520
-0.162
-0.096
0.066
These estimated values
9
1.470
1.129
0.342
0.503
0.501
0.003
are presented in the
10
1.024
0.829
0.194
-0.004
-0.008
0.004
table at right in the
11
1.054
0.840
0.214
0.217
0.210
0.007
columns headed by
12
1.461
1.160
0.301
1.110
1.101
0.009
“ENO.” Reported (i.e.,
13
1.361
1.173
0.188
1.696
1.635
0.061
given) estimates are in
14
1.310
1.110
0.200
-0.387
-0.374
0.013
adjacent columns to the
15
1.045
0.833
0.212
0.856
0.844
0.012
right of ENO estimates,
16
1.033
0.819
0.214
0.325
0.319
0.006
and the absolute
17
1.045
0.832
0.213
0.271
0.262
0.008
differences between
18
1.432
1.103
0.329
0.670
0.667
0.003
estimate types are
19
1.177
1.059
0.118
-0.758
-0.709
0.049
presented to the right of
20
1.400
1.120
0.280
1.179
1.165
0.014
reported values.
21
1.042
0.846
0.196
1.401
1.379
0.022
The average
22
1.348
1.050
0.298
0.867
0.862
0.005
absolute difference for
23
1.423
1.126
0.297
1.075
1.067
0.008
discrimination
24
0.924
0.779
0.145
-0.490
-0.472
0.019
parameters was M = .25
25
0.921
0.960
0.039
-1.728
-1.493
0.234
(SD = .098). A
26
1.118
0.927
0.191
-0.281
-0.276
0.004
histogram of the
27
0.997
0.796
0.201
0.845
0.833
0.012
absolute differences (see 28
1.391
1.111
0.280
-0.045
-0.048
0.003
figure below) reveals
29
1.039
0.824
0.215
0.284
0.279
0.005
that disparities appear to 30
1.475
1.135
0.340
0.326
0.323
0.003
be normally distributed
0.249
0.091
M=
M=
and centered
0.098
0.248
SD=
SD=
symmetrically around
the mean. Most (21) of the differences were below .30, indicating that, generally speaking, the
estimation methods used produced relatively similar results; however, an average absolute
difference of .25 is considerable in contexts where precision is of the utmost concern. Also, I
investigated the distribution of directional difference values (i.e., directional difference = α′ - α)
and found that 26 of the 30 items’ estimated normal ogive discrimination parameters are greater
than their corresponding reported parameters, indicating that the normal ogive estimates (for this
particular data) tended to be larger those obtained through traditional IRT analysis.
The average absolute difference for difficulty parameters was M = .09 (SD = .25). A
histogram of the absolute difference (see figure below) reveals a highly skewed distribution (one
Histogram
that we might expect for a distribution of
absolute value scores). Overall, the
normal ogive estimates seemed to be
quite close to their reported counterparts.
A histogram of the directional difference
between normal ogive estimates and
reported estimates (β′ - β) provided very
little new information. Difference values
were mostly (20 out of 30) positive or
zero (i.e., the normal ogive estimates
were larger than reported values). Only
Mean =0.2494
Std. Dev. =0.09784
items 2 and 7 had a difference value with
N =30
a magnitude of larger than .2. In addition,
adiff
there was a substantial difference
between reported and ENO values for item
seven, reported β = -2.70, ENO β = -3.78,
Histogram
indicating that item seven was poorly
represented by the normal ogive estimates.
Examination of the differences and
corresponding item statistics suggests that
normal-ogive approximation only holds
for items that perform well in the current
sample. For example, in estimating
difficulty parameters the largest
divergence occurred for items 2 and 7. It is
no coincidence that items 2 and 7 are also
the two items with the largest p-values
Mean =0.0915
Std. Dev. =0.24794
N =30
(.96 and .98, respectively—in fact, they
possess the only p-values greater than .90),
bdiff
suggesting that these items are far too easy
to provide useful information about examinee ability. Since p-value is used in calculating the
appropriate z-score in the formula, ENO β = z / rbs, it is logical that the ENO difficulty
parameters will be affected by problematic p-values.
Finally, the ENO discrimination parameter for item 8 (ENO α = .52) was quite different
from the provided value (α = 1.04). The absolute difference between these two values (.52) was
higher than that for any other item. Since the ENO α is a function of the item’s biserial
correlation (α′ = rbs / (1 - rbs2)1/2 ), it follows that low biserial correlations may produce unstable
ENO parameters. An item that fails to discriminate between examinees (as indicated by rbs; item
eight’s rbs = .40) will also be poorly represented by an ENO discrimination parameter.
8
Frequency
6
4
2
0
0.00
0.10
0.20
0.30
0.40
0.60
0.40
0.50
30
25
Frequency
20
15
10
5
0
0.00
0.20
0.80
1.00
(c-d)
Sorting the items by p-values and then by IRT difficulty estimates (β) simply inverts the
order of items. That is, items with higher p-values have smaller βs. There are three exceptions to
this rule: the β ordering of item pairs 24 and 25, 14 and 15, and 4 and 5 is reversed relative to
their p-value ordering. This suggests that β-values are directly related to p-values; however,
a value
because the IRT model controls for ability levels when estimating item parameters, this
relationship is not a perfect one to one.
Sorting the items by biserial correlations (rbs) and then by IRT discrimination estimates
(α) creates two somewhat different orderings. While some items are in almost the exactly same
rank in both orderings, other items are
quite differently ranked. In fact, a
Scatterplot of Rankings
scatterplot of rankings (see right) reveals
that, for most items, their rankings in
32
both orderings are quite similar, if not
exactly the same. However, there are
28
some items which appear to be “outliers”
24
in that their rankings are quite different
20
in the two orderings. The items that are
the farthest outliers were 2, 7, and 8
16
(which were identified earlier as poorly
12
performing items). The rankings of the
8
items (ranking by rbs, ranking by α) for
items 2, 7, and 8 were (3, 27), (2, 25),
4
and (1, 14), respectively. Items 2 and 7
0
had very high p-values (.96 and .98).
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Item 8 had a low biserial correlation
biserial
(.40). These extreme values affect item
discrimination as estimated by the IRT
model because the 2PL IRT model does not treat discrimination and difficulty as unrelated
values, but estimates them simultaneously. In addtion, the IRT analysis controls for examinee
ability level while the CTT statistics do not. (There were other items that appeared to have quite
different rankings; however, their distances from the rest of the points were not as extreme as the
mentioned items.) In general, by examining the scatterplot, it appears that there is more variation
in ranking for items with the lowest and highest biserial correlations. Items with low biserial
correlations may actually be discriminating, but only for lower ability levels (e.g., item 2 and 7
difficulty estimates were –2.44 and –2.70, respectively).
Question 3
(a)
Differences
The first, and I think most important, difference between IRT and CTT is that parameter
estimates from CTT are sample-dependent, while estimates from IRT are sample-independent.
To explain further, when item parameter estimates (e.g., discrimination in the form of biserial
correlations and difficulty in the form of p-values) are obtained in CTT, they will change from
sample to sample depending upon the trait-level present in the sample. Therefore, two samples
that differ greatly in their average level of ability (or trait) that take the same test will produce
very different item parameter estimates. However, IRT takes into account the ability levels of
examinees when item parameters are estimated. Since IRT essentially controls for ability level
when estimating item parameters, these parameters are relatively invariant (I say relatively,
because miniscule fluctuations are expected and acceptable) across samples of different ability
levels. While CTT can estimate quite accurate parameter estimates when the sample used is large
and heterogeneous in regards to ability level, determining whether or not the “best” sample has
been obtained is theoretically impossible—a large disadvantage for CTT if you want accurate
and invariant item parameter estimates.
Which brings us to a second difference between IRT and CTT: CTT focuses on the test as
the unit of measurement (hence, “Classical Test Theory”); IRT focuses on the test items (hence,
“Item Response Theory”). While item parameters are important in the design and construction of
a reliable test, items found to be acceptable via CTT analysis cannot stand alone. In other words,
CTT does not produce a bank of items with known characteristics from which a test designer
could build a test by combining items containing the desired characteristics. CTT can produce a
test (with a relatively stable set of parameter estimates; e.g., reliability coefficients, score norms,
etc.) that can be used as a whole, but it cannot produce items (with a relatively stable set of
parameter estimates; e.g., difficulty, discrimination, etc.) that can be used interchangeably with
other items to generate unique tests without harming the inferences that can be drawn from test
scores.
A third difference between CTT and IRT is that CTT is unable to incorporate guessing
into its measurement models. For example, when examinees are presented with multiple choice
items, the probability of answering correctly by chance alone is one divided by the total number
of response options—for an item with five options: 1 / 5 = .2. And this ignores the fact that, in
many cases, respondents can accurately eliminate at least one incorrect answer, thereby
increasing the probability of guessing correctly. IRT can provide test designers with estimates of
guessing parameters (i.e., the probability of getting an item correct regardless of trait level) in the
3PL model. While an item’s property of being guessable would influence the p-value of the item,
a large p-value could also indicate that the item is easy because of its difficulty is below the
average trait level present in the sample of examinees. CTT cannot discriminate between these
two cases; IRT can.
A last important difference between CTT and IRT is the concept of information. In IRT,
a test designer can calculate how much information an item (or test) provides for a given level of
ability. This enables designers to tailor tests to specific trait levels. For example, if I were
designing a test to qualify students for a prestigious scholarship, I would want that test to provide
the most information about examinees with above average ability. So, thanks to IRT item
information functions, I could select items which provide the highest amount of information for
high trait levels and put these items together to create a highly informative test. CTT does not
provide estimates of information as a function of trait level.
Similarities
Despite all of the previously listed differences between CTT and IRT, there remain
several similarities. For example, both CTT and IRT have, in their proverbial “heart and soul,”
the goal of accurately measuring an individual’s true score on a given construct of interest. The
two methods do this in different ways, but, ultimately, they strive to accomplish the same feat.
Both methods are concerned with providing reliable and valid estimates of examinee ability, and
both methods by and large use some of the same basic mathematical constructs as a basis for
doing this.
Namely, both CTT and IRT models incorporate some form of item characteristic into the
design of a test. Specifically, item discrimination and item difficulty form an important skeleton
upon which test scores are fleshed out in both methods. How these two statistics are used varies
across methods and models—discrimination and difficulty provide important information for
CTT test design; for IRT they are used in both design and estimating true scores.
Lastly, CTT and IRT are quite similar in that both CTT and IRT use measurement models
to try reproduce the response patterns found in observed data. Both methods incorporate several
different types of measurement models that can be more or less restrictive that can be tested
against data. These models tend to be used for different purposes—CTT models tend to be more
confirmatory in nature, while IRT models are both confirmatory and proscriptive. That is, CTT
models can provide the degree to which the observed data structure fits certain assumptions
about the test items or forms of tests (e.g., parallel, tau-equivalent, etc.), IRT models are
designed to both fit the data and to provide information about items that can be used in future
item administrations.
(b)
In deciding whether to use CTT or IRT analyses, I would consider several factors—the
most important of which is the purpose of the analysis. If the goal is to construct a test or several
tests by creating fixed sets of items to be contained in these tests, and there is no intent to create
adaptive testing, then I would use CTT. In addition, if I were interested in assessing the
reliability (i.e., internal consistency, test retest, or parallel forms) of the test, I could simply use
CTT. I would have to be careful in my sampling procedure since all CTT estimates are sampledependent, but a CTT analysis would be sufficient to answer these types of questions.
If I wanted to create a bank or pool of items that could be used to build future tests, or if I
wanted to create an adaptive test, I would use IRT analysis. IRT can provide the necessary
information to meet these analytical goals—CTT could not.
Other considerations include time, money, and utility. The information provided by a
CTT analysis may suffice for the purposes of the testing agency/agent, thereby allowing in a
decrease in the time/money that is often required to correctly calibrate items in IRT. This is not
to say that designing a solid test with CTT is not demanding of resources; however, it is often the
case for small scale assessments that CTT analyses will provide adequate test-wise information.
And one can avoid the extra effort of designing an item pool since one is not needed.
(c)
Both logistic regression and IRT are concerned with modeling the probability of a given
categorical outcome as a function of one or more predictor variables. For example, logistic
regression could be used to calculate the probability of getting lung cancer as a function of
number of years spent smoking and average daily cigarette consumption. The dependent variable
in this case is dichotomous (do or do not have lung cancer), and the independent or predictor
variables are continuous. IRT could be used to calculate the probability of getting a particular
item correct as a function of examinee ability and item difficulty (this is the Rasch or 1PL model
in IRT). In both the lung cancer and item correct examples, the key interest in the analysis is the
probability of a particular outcome.
In both logistic regression and IRT, the outcome variable is therefore not normally
distributed—the distribution used is the Bernoulli distribution. Since the outcome variable is not
normally distributed, it cannot be directly linearly modeled; a link function of some sort must
connect the predictor variables to the outcome variable in order to allow the relationship between
the predictors to be linearly related to the outcome variable (e.g., the logit link).
A last and important similarity, that I think is pivotal to an accurate appreciation for IRT,
is that logistic regression and IRT are not magical statistical models. This sounds superfluous,
but I am trying to say that both analyses only attempt to model the observed patterns in data.
However, in any model, there is error. In both analyses, estimates of how far off the model is
from the actual data are obtained using –2 times log likelihood (-2LL). And the –2LL can be
used to assess comparative model fit, allowing for the testing of alternative models. In summary,
it is helpful to remember how similar logistic regression and IRT actually are in both their
usefulness and their limitations.
Of course, although these two techniques are very similar, there are some important
differences. For instance, while both logistic regression and IRT predict categorical outcomes as
a function of predictor variables, only IRT uses a latent predictor variable (i.e., theta or
trait/ability level). Logistic regression only uses manifest predictor variables.
Another important difference between logistic regression and IRT is that, while the terms
in a logistic regression equation are additive (unless interactions are sought), terms in IRT item
response functions can be multiplicative (e.g., the 2PL) or even more complex (e.g., the 3PL).
This takes many IRT models out of the family of generalized linear models, which is the
membership of logistic regression.
A final difference between IRT and logistic regression is based on IRT’s use of latent
predictors. That is, because the levels of the latent variable must be established before item
parameters can be calculated, much larger samples are needed in IRT than in logistic regression
(generally speaking). While large samples are almost always desired in any statistical analysis,
they are especially important in IRT due to the complexity of the parameter estimation process
and the need to estimate the latent predictor. This is not a necessary step in logistic regression.