Michael J. Walk Modern Measurement Theories Homework #3 17 February 2008 Question #1 (a) [created using PLOTIRT—a package for R. An option to extend the x-axis was not available.] (b) SEE APPENDIX (ATTACHED SCAN OF EQUATIONS) (c) Probability Scoring INCORRECT THETA ITEM 1 ITEM 2 ITEM 3 -1.5 0.422 0.953 0.899 0 0.283 0.500 0.884 1.5 0.167 0.047 0.658 (e) Test Information THETA Test SEE -1.5 0.216 2.153 0 1.048 0.977 1.5 0.768 1.141 (d) Item Information (Standard Error) Examinee Item 1 Item 2 Item 3 0.035 0.181 0.000 -1.5 (5.35) (2.35) (183.77) 0.038 1.000 0.010 0 (5.11) (1) (10.08) 0.030 0.181 0.557 1.5 (5.75) (2.35) (1.34) Question 2 Estimated normal-ogive Discrimination Difficulty ENO Reported Difference ENO Reported Difference (ENO) difficulty and Item (α′) (α) |α′ α| (β′) (β) |β′ - β| discrimination 1 1.083 0.860 0.222 0.156 0.150 0.006 parameters were 2 0.799 1.161 0.362 -3.326 -2.441 0.884 calculated using 3 1.594 1.257 0.337 1.091 1.085 0.006 provided CTT statistics, 4 1.510 1.165 0.345 0.268 0.262 0.006 such that ENO 5 1.235 1.047 0.188 -0.544 -0.414 0.130 discrimination (α′) = 6 1.033 0.928 0.105 -0.861 -0.800 0.061 rbs/(1-rbs2)½ and ENO 7 0.748 1.141 0.393 -3.776 -2.698 1.079 difficulty (β′) = z/rbs. 8 0.520 1.040 0.520 -0.162 -0.096 0.066 These estimated values 9 1.470 1.129 0.342 0.503 0.501 0.003 are presented in the 10 1.024 0.829 0.194 -0.004 -0.008 0.004 table at right in the 11 1.054 0.840 0.214 0.217 0.210 0.007 columns headed by 12 1.461 1.160 0.301 1.110 1.101 0.009 “ENO.” Reported (i.e., 13 1.361 1.173 0.188 1.696 1.635 0.061 given) estimates are in 14 1.310 1.110 0.200 -0.387 -0.374 0.013 adjacent columns to the 15 1.045 0.833 0.212 0.856 0.844 0.012 right of ENO estimates, 16 1.033 0.819 0.214 0.325 0.319 0.006 and the absolute 17 1.045 0.832 0.213 0.271 0.262 0.008 differences between 18 1.432 1.103 0.329 0.670 0.667 0.003 estimate types are 19 1.177 1.059 0.118 -0.758 -0.709 0.049 presented to the right of 20 1.400 1.120 0.280 1.179 1.165 0.014 reported values. 21 1.042 0.846 0.196 1.401 1.379 0.022 The average 22 1.348 1.050 0.298 0.867 0.862 0.005 absolute difference for 23 1.423 1.126 0.297 1.075 1.067 0.008 discrimination 24 0.924 0.779 0.145 -0.490 -0.472 0.019 parameters was M = .25 25 0.921 0.960 0.039 -1.728 -1.493 0.234 (SD = .098). A 26 1.118 0.927 0.191 -0.281 -0.276 0.004 histogram of the 27 0.997 0.796 0.201 0.845 0.833 0.012 absolute differences (see 28 1.391 1.111 0.280 -0.045 -0.048 0.003 figure below) reveals 29 1.039 0.824 0.215 0.284 0.279 0.005 that disparities appear to 30 1.475 1.135 0.340 0.326 0.323 0.003 be normally distributed 0.249 0.091 M= M= and centered 0.098 0.248 SD= SD= symmetrically around the mean. Most (21) of the differences were below .30, indicating that, generally speaking, the estimation methods used produced relatively similar results; however, an average absolute difference of .25 is considerable in contexts where precision is of the utmost concern. Also, I investigated the distribution of directional difference values (i.e., directional difference = α′ - α) and found that 26 of the 30 items’ estimated normal ogive discrimination parameters are greater than their corresponding reported parameters, indicating that the normal ogive estimates (for this particular data) tended to be larger those obtained through traditional IRT analysis. The average absolute difference for difficulty parameters was M = .09 (SD = .25). A histogram of the absolute difference (see figure below) reveals a highly skewed distribution (one Histogram that we might expect for a distribution of absolute value scores). Overall, the normal ogive estimates seemed to be quite close to their reported counterparts. A histogram of the directional difference between normal ogive estimates and reported estimates (β′ - β) provided very little new information. Difference values were mostly (20 out of 30) positive or zero (i.e., the normal ogive estimates were larger than reported values). Only Mean =0.2494 Std. Dev. =0.09784 items 2 and 7 had a difference value with N =30 a magnitude of larger than .2. In addition, adiff there was a substantial difference between reported and ENO values for item seven, reported β = -2.70, ENO β = -3.78, Histogram indicating that item seven was poorly represented by the normal ogive estimates. Examination of the differences and corresponding item statistics suggests that normal-ogive approximation only holds for items that perform well in the current sample. For example, in estimating difficulty parameters the largest divergence occurred for items 2 and 7. It is no coincidence that items 2 and 7 are also the two items with the largest p-values Mean =0.0915 Std. Dev. =0.24794 N =30 (.96 and .98, respectively—in fact, they possess the only p-values greater than .90), bdiff suggesting that these items are far too easy to provide useful information about examinee ability. Since p-value is used in calculating the appropriate z-score in the formula, ENO β = z / rbs, it is logical that the ENO difficulty parameters will be affected by problematic p-values. Finally, the ENO discrimination parameter for item 8 (ENO α = .52) was quite different from the provided value (α = 1.04). The absolute difference between these two values (.52) was higher than that for any other item. Since the ENO α is a function of the item’s biserial correlation (α′ = rbs / (1 - rbs2)1/2 ), it follows that low biserial correlations may produce unstable ENO parameters. An item that fails to discriminate between examinees (as indicated by rbs; item eight’s rbs = .40) will also be poorly represented by an ENO discrimination parameter. 8 Frequency 6 4 2 0 0.00 0.10 0.20 0.30 0.40 0.60 0.40 0.50 30 25 Frequency 20 15 10 5 0 0.00 0.20 0.80 1.00 (c-d) Sorting the items by p-values and then by IRT difficulty estimates (β) simply inverts the order of items. That is, items with higher p-values have smaller βs. There are three exceptions to this rule: the β ordering of item pairs 24 and 25, 14 and 15, and 4 and 5 is reversed relative to their p-value ordering. This suggests that β-values are directly related to p-values; however, a value because the IRT model controls for ability levels when estimating item parameters, this relationship is not a perfect one to one. Sorting the items by biserial correlations (rbs) and then by IRT discrimination estimates (α) creates two somewhat different orderings. While some items are in almost the exactly same rank in both orderings, other items are quite differently ranked. In fact, a Scatterplot of Rankings scatterplot of rankings (see right) reveals that, for most items, their rankings in 32 both orderings are quite similar, if not exactly the same. However, there are 28 some items which appear to be “outliers” 24 in that their rankings are quite different 20 in the two orderings. The items that are the farthest outliers were 2, 7, and 8 16 (which were identified earlier as poorly 12 performing items). The rankings of the 8 items (ranking by rbs, ranking by α) for items 2, 7, and 8 were (3, 27), (2, 25), 4 and (1, 14), respectively. Items 2 and 7 0 had very high p-values (.96 and .98). 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Item 8 had a low biserial correlation biserial (.40). These extreme values affect item discrimination as estimated by the IRT model because the 2PL IRT model does not treat discrimination and difficulty as unrelated values, but estimates them simultaneously. In addtion, the IRT analysis controls for examinee ability level while the CTT statistics do not. (There were other items that appeared to have quite different rankings; however, their distances from the rest of the points were not as extreme as the mentioned items.) In general, by examining the scatterplot, it appears that there is more variation in ranking for items with the lowest and highest biserial correlations. Items with low biserial correlations may actually be discriminating, but only for lower ability levels (e.g., item 2 and 7 difficulty estimates were –2.44 and –2.70, respectively). Question 3 (a) Differences The first, and I think most important, difference between IRT and CTT is that parameter estimates from CTT are sample-dependent, while estimates from IRT are sample-independent. To explain further, when item parameter estimates (e.g., discrimination in the form of biserial correlations and difficulty in the form of p-values) are obtained in CTT, they will change from sample to sample depending upon the trait-level present in the sample. Therefore, two samples that differ greatly in their average level of ability (or trait) that take the same test will produce very different item parameter estimates. However, IRT takes into account the ability levels of examinees when item parameters are estimated. Since IRT essentially controls for ability level when estimating item parameters, these parameters are relatively invariant (I say relatively, because miniscule fluctuations are expected and acceptable) across samples of different ability levels. While CTT can estimate quite accurate parameter estimates when the sample used is large and heterogeneous in regards to ability level, determining whether or not the “best” sample has been obtained is theoretically impossible—a large disadvantage for CTT if you want accurate and invariant item parameter estimates. Which brings us to a second difference between IRT and CTT: CTT focuses on the test as the unit of measurement (hence, “Classical Test Theory”); IRT focuses on the test items (hence, “Item Response Theory”). While item parameters are important in the design and construction of a reliable test, items found to be acceptable via CTT analysis cannot stand alone. In other words, CTT does not produce a bank of items with known characteristics from which a test designer could build a test by combining items containing the desired characteristics. CTT can produce a test (with a relatively stable set of parameter estimates; e.g., reliability coefficients, score norms, etc.) that can be used as a whole, but it cannot produce items (with a relatively stable set of parameter estimates; e.g., difficulty, discrimination, etc.) that can be used interchangeably with other items to generate unique tests without harming the inferences that can be drawn from test scores. A third difference between CTT and IRT is that CTT is unable to incorporate guessing into its measurement models. For example, when examinees are presented with multiple choice items, the probability of answering correctly by chance alone is one divided by the total number of response options—for an item with five options: 1 / 5 = .2. And this ignores the fact that, in many cases, respondents can accurately eliminate at least one incorrect answer, thereby increasing the probability of guessing correctly. IRT can provide test designers with estimates of guessing parameters (i.e., the probability of getting an item correct regardless of trait level) in the 3PL model. While an item’s property of being guessable would influence the p-value of the item, a large p-value could also indicate that the item is easy because of its difficulty is below the average trait level present in the sample of examinees. CTT cannot discriminate between these two cases; IRT can. A last important difference between CTT and IRT is the concept of information. In IRT, a test designer can calculate how much information an item (or test) provides for a given level of ability. This enables designers to tailor tests to specific trait levels. For example, if I were designing a test to qualify students for a prestigious scholarship, I would want that test to provide the most information about examinees with above average ability. So, thanks to IRT item information functions, I could select items which provide the highest amount of information for high trait levels and put these items together to create a highly informative test. CTT does not provide estimates of information as a function of trait level. Similarities Despite all of the previously listed differences between CTT and IRT, there remain several similarities. For example, both CTT and IRT have, in their proverbial “heart and soul,” the goal of accurately measuring an individual’s true score on a given construct of interest. The two methods do this in different ways, but, ultimately, they strive to accomplish the same feat. Both methods are concerned with providing reliable and valid estimates of examinee ability, and both methods by and large use some of the same basic mathematical constructs as a basis for doing this. Namely, both CTT and IRT models incorporate some form of item characteristic into the design of a test. Specifically, item discrimination and item difficulty form an important skeleton upon which test scores are fleshed out in both methods. How these two statistics are used varies across methods and models—discrimination and difficulty provide important information for CTT test design; for IRT they are used in both design and estimating true scores. Lastly, CTT and IRT are quite similar in that both CTT and IRT use measurement models to try reproduce the response patterns found in observed data. Both methods incorporate several different types of measurement models that can be more or less restrictive that can be tested against data. These models tend to be used for different purposes—CTT models tend to be more confirmatory in nature, while IRT models are both confirmatory and proscriptive. That is, CTT models can provide the degree to which the observed data structure fits certain assumptions about the test items or forms of tests (e.g., parallel, tau-equivalent, etc.), IRT models are designed to both fit the data and to provide information about items that can be used in future item administrations. (b) In deciding whether to use CTT or IRT analyses, I would consider several factors—the most important of which is the purpose of the analysis. If the goal is to construct a test or several tests by creating fixed sets of items to be contained in these tests, and there is no intent to create adaptive testing, then I would use CTT. In addition, if I were interested in assessing the reliability (i.e., internal consistency, test retest, or parallel forms) of the test, I could simply use CTT. I would have to be careful in my sampling procedure since all CTT estimates are sampledependent, but a CTT analysis would be sufficient to answer these types of questions. If I wanted to create a bank or pool of items that could be used to build future tests, or if I wanted to create an adaptive test, I would use IRT analysis. IRT can provide the necessary information to meet these analytical goals—CTT could not. Other considerations include time, money, and utility. The information provided by a CTT analysis may suffice for the purposes of the testing agency/agent, thereby allowing in a decrease in the time/money that is often required to correctly calibrate items in IRT. This is not to say that designing a solid test with CTT is not demanding of resources; however, it is often the case for small scale assessments that CTT analyses will provide adequate test-wise information. And one can avoid the extra effort of designing an item pool since one is not needed. (c) Both logistic regression and IRT are concerned with modeling the probability of a given categorical outcome as a function of one or more predictor variables. For example, logistic regression could be used to calculate the probability of getting lung cancer as a function of number of years spent smoking and average daily cigarette consumption. The dependent variable in this case is dichotomous (do or do not have lung cancer), and the independent or predictor variables are continuous. IRT could be used to calculate the probability of getting a particular item correct as a function of examinee ability and item difficulty (this is the Rasch or 1PL model in IRT). In both the lung cancer and item correct examples, the key interest in the analysis is the probability of a particular outcome. In both logistic regression and IRT, the outcome variable is therefore not normally distributed—the distribution used is the Bernoulli distribution. Since the outcome variable is not normally distributed, it cannot be directly linearly modeled; a link function of some sort must connect the predictor variables to the outcome variable in order to allow the relationship between the predictors to be linearly related to the outcome variable (e.g., the logit link). A last and important similarity, that I think is pivotal to an accurate appreciation for IRT, is that logistic regression and IRT are not magical statistical models. This sounds superfluous, but I am trying to say that both analyses only attempt to model the observed patterns in data. However, in any model, there is error. In both analyses, estimates of how far off the model is from the actual data are obtained using –2 times log likelihood (-2LL). And the –2LL can be used to assess comparative model fit, allowing for the testing of alternative models. In summary, it is helpful to remember how similar logistic regression and IRT actually are in both their usefulness and their limitations. Of course, although these two techniques are very similar, there are some important differences. For instance, while both logistic regression and IRT predict categorical outcomes as a function of predictor variables, only IRT uses a latent predictor variable (i.e., theta or trait/ability level). Logistic regression only uses manifest predictor variables. Another important difference between logistic regression and IRT is that, while the terms in a logistic regression equation are additive (unless interactions are sought), terms in IRT item response functions can be multiplicative (e.g., the 2PL) or even more complex (e.g., the 3PL). This takes many IRT models out of the family of generalized linear models, which is the membership of logistic regression. A final difference between IRT and logistic regression is based on IRT’s use of latent predictors. That is, because the levels of the latent variable must be established before item parameters can be calculated, much larger samples are needed in IRT than in logistic regression (generally speaking). While large samples are almost always desired in any statistical analysis, they are especially important in IRT due to the complexity of the parameter estimation process and the need to estimate the latent predictor. This is not a necessary step in logistic regression.
© Copyright 2026 Paperzz