Research Report ETS RR–12-13 A Note on Explaining Away and Paradoxical Results in Multidimensional Item Response Theory Peter W. van Rijn Frank Rijmen August 2012 ETS Research Report Series EIGNOR EXECUTIVE EDITOR James Carlson Principal Psychometrician ASSOCIATE EDITORS Brent Bridgeman Distinguished Presidential Appointee Joel Tetreault Managing Research Scientist Marna Golub-Smith Principal Psychometrician Matthias von Davier Director, Research Shelby Haberman Distinguished Presidential Appointee Xiaoming Xi Director, Research Donald Powers Managing Principal Research Scientist Rebecca Zwick Distinguished Presidential Appointee John Sabatini Managing Principal Research Scientist PRODUCTION EDITORS Kim Fryer Manager, Editing Services Ruth Greenwood Editor Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. The Daniel Eignor Editorship is named in honor of Dr. Daniel R. Eignor, who from 2001 until 2011 served the Research and Development division as Editor for the ETS Research Report series. The Eignor Editorship has been created to recognize the pivotal leadership role that Dr. Eignor played in the research publication process at ETS. A Note on Explaining Away and Paradoxical Results in Multidimensional Item Response Theory Peter W. van Rijn and Frank Rijmen ETS, Princeton, New Jersey August 2012 As part of its nonprofit mission, ETS conducts and disseminates the results of research to advance quality and equity in education and assessment for the benefit of ETS’s constituents and the field. To obtain a PDF or a print copy of a report, please visit: http://www.ets.org/research/contact.html Associate Editor: Matthias von Davier Reviewers: Sandip Sinharay and Frederic Robin Copyright © 2012 by Educational Testing Service. All rights reserved. ETS, the ETS logo, and LISTENING. LEARNING. LEADING., are registered trademarks of Educational Testing Service (ETS). Abstract Hooker and colleagues addressed a paradoxical situation that can arise in the application of multidimensional item response theory (MIRT) models to educational test data. We demonstrate that this MIRT paradox is an instance of the explaining-away phenomenon in Bayesian networks, and we attempt to enhance the understanding of MIRT models by placing the paradox in a broader statistical modeling perspective. Key words: multidimensional IRT, paradoxical results, explaining away, Bayesian networks i Acknowledgments The authors would like to thank Matthias von Davier, Shelby Haberman, and Bob Mislevy for helpful comments. ii Hooker, Finkelman, and Schwartzman (2009) addressed a paradoxical situation that can arise in the application of multidimensional item response theory (MIRT) models to educational test data. The paradox boils down to the fact that a correct response on an additional item can lead to a lower estimate for one of the latent ability variables, whereas an incorrect response can lead to a higher estimate (Van der Linden, 2012). Hooker et al. (2009) argued that this is unfair to test takers. Various different appearances, generalizations, and implications of the paradox have been studied by numerous authors over the past few years (Finkelman, Hooker, & Wang, 2010; Hooker, 2010; Hooker & Finkelman, 2010; Jordan & Spiess, 2012; Van der Linden, 2012). The stated paradoxical situation is related to the explaining-away phenomenon in Bayesian networks (Pearl, 2009; Wellman & Henrion, 1993), which in statistics is known as Berkson’s paradox (Berkson, 1946). In this report, we demonstrate that the MIRT paradox is an instance of this phenomenon, and we attempt to enhance the understanding of MIRT models by placing the paradox in a broader statistical modeling perspective, namely, that of graphical models and Bayesian networks (Mislevy, 1994; Pearl, 2009; Williamson, 2005). These frameworks provide a shorthand for the probabilistic relationships of interest and can help understand the properties of these relationships. We discuss a small number of MIRT modeling examples in these frameworks, illustrating the relation between the MIRT paradox and the explaining-away phenomenon, and we end with some concluding remarks. 1 Examples In the following examples, we will adhere to parametric IRT in the framework of generalized nonlinear mixed models (Mellenbergh, 1994; Rijmen, Tuerlinckx, De Boeck, & Kuppens, 2003), and we will make additional assumptions as needed; that is, we do not make assumptions about the types of items (continuous or discrete; dichotomous or polytomous), the types of latent variables (continuous or discrete), and the response functions (linear, normal, or logistic). We assume that both item response variables and latent variables are random and that item response variables can be observed, whereas latent variables cannot. (Because we make as few assumptions as possible, standard linear 1 factor models are included here as well.) An important assumption in both unidimensional and multidimensional IRT models is monotonicity. Monotonicity requires the probabilities for the item variables to be strictly increasing or decreasing in each latent variable, and MIRT models are monotone if and only if the latent variables are compensatory (Holland & Rosenbaum, 1986; Van der Linden, 2012). Strictly speaking, we do not need to make the monotonicity assumption, but then a unidimensional IRT model for which local independence holds can always be specified for a set of item variables (Suppes & Zanotti, 1981). Therefore we need to keep the assumption of monotonicity and will illustrate other assumptions, such as local independence, through the examples. In all our examples, we have chosen to use six items to keep things simple yet nontrivial. Furthermore, we assume that the first five items are already observed so that the sixth item is always the focal additional item that possibly creates the paradoxical situation. Figure 1 displays a partially directed acyclic graph (DAG) of a MIRT model with two latent variables θ1 and θ2 and six item response variables X1 , X2 , . . . , X6 . (It is called partially directed because not all the lines in the graph have arrowheads. A partial DAG is also referred to as a chain graph.) This model is said to be of simple structure, also referred to as a between-item two-dimensional IRT model, because every item response variable is linked to a single latent variable only. In the graph, the nodes correspond to random variables, and the directed edges represent conditional dependency relations. An advantage of using graphical models is that there is a correspondence between the property of separation of the nodes in the graph and conditional independence of the random variables in the statistical model. For example, the path X1 ← θ1 → X2 in Figure 1 illustrates an instance of so-called d-separation (Pearl, 2009, pp. 16–17); that is, the only path from X1 to X2 runs through θ1 , and the arrows do not meet head to head at θ1 . The fact that X1 and X2 are d-separated in the graph implies that they are conditionally independent given θ1 . We can generalize this to all six items in the example, and obtain the familiar IRT assumption of local independence: the joint probability of X1 , X2 , . . . , X6 is conditional on θ1 and θ2 can � �6 be written as a simple product: Pr(X1 , X2 , . . . , X6 |θ1 , θ2 ) = 3j=1 Pr(Xj |θ1 ) j=4 Pr(Xj |θ2 ). Because of the correspondence between d-separation and conditional independence, it is 2 possible to determine all conditional independence relations that are entailed solely by working with the graph. Now, the MIRT paradox revolves around the beliefs about θ1 and θ2 in different situations. In describing the paradox, Hooker et al. (2009) always seemed to condition implicitly on X1 , X2 , . . . , X5 . Keeping this in mind, the MIRT paradox cannot arise for the model in Figure 1 because the only path between θ1 and θ2 is the undirected edge; that is, conditional on X1 , X2 , . . . , X5 , the additional observation of X6 does not affect the belief about θ1 in an unexpected manner. 3 θ1 X1 X2 θ2 X3 X4 X5 X6 Figure 1. Partially directed acyclic graph of two-dimensional IRT model with between-item multidimensionality Figure 1. Partially directed acyclic graph of two-dimensional item response latent variables (continuous or discrete), and the response functions (linear, normal, or logistic). theory model with between-item multidimensionality. We assume that both item response variables and latent variables are random, and that item Figure 2 shows the DAG of a two-dimensional IRT model for six items with so-called response variables can be observed whereas latent variables cannot. (Because we make as few within-item multidimensionality for items 3 and 4. In this figure, the paths θ1 → X3 ← θ2 assumptions as possible, standard linear factor models are included here as well.) An important and θ1 → X4 ← θ2 are so-called inverted forks and contain the first and foremost step of assumption in both unidimensional and multidimensional IRT models is that of monotonicity. explaining what happens in the MIRT paradox. These paths between θ1 and θ2 are not Monotonicity requires the probabilities for the item variables to be strictly increasing or decreas- blocked by X3 and X4 because the edges on these paths meet head to head. Therefore θ1 ing in each latent variable, and MIRT models are monotone if and only if the latent variables and θ2 are not d-separated by X3 and X4 , and conditional independence between θ1 and compensatory derimplied. Linden, We 2012;note Holland Rosenbaum, 1986). Strictly speaking, we given X3 and X4(Van is not that & this kind of conditional independence is θ2 are do not from need to make the monotonicity assumption, butcondition then a unidimensional IRT variables model for different that typically used in IRT because we here on observed a priori, they& instead on independence unobserved variables. if θ1 andforθ2aare which of local holds can Now, alwayseven be specified set independent of item variables (Suppes become dependent when wewecondition on Xthe . , X5 . Furthermore, the observation of X6 1 , . .assumption Zanotti, 1981). Therefore, need to keep of monotonicity and will illustrate can affect the belief about in anindependence unanticipated fashion. at first counterintuitive other assumptions such asθ1local through the This examples. Insight all our examples, we phenomenon is called the explaining-away effect. We refrain from giving substantive have chosen to use six items in order to keep things simple, yet nontrivial. Furthermore, we assume that the first five items are already observed, so that the sixth item is always the focal 3 additional item that possibly creates the paradoxical situation. Figure 1 displays a partially directed acyclic graph (DAG) of a MIRT model with two latent variables θ1 and θ2 , and six item response variables X1 , X2 , . . . , X6 . (It is called partially directed, examples to be concise and because intuitive examples of this phenomenon are described by many authors (e.g., Berkson, 1946; Bishop, 2006, p. 378; Hooker & Finkelman, 2010, p. 251; Pearl, 2009, p. 17). 5 θ1 X1 X2 θ2 X3 X4 X5 X6 Figure 2. Directed acyclic graph of two-dimensional IRT model with within-item structure Figure 2. Directed acyclic graph of two-dimensional item response theory model with within-item not implied. We notestructure. that this kind of conditional independence is different from that typically used in IRT, because we condition here on observed variables instead of on unobserved variables. We emphasize that this explaining-away phenomenon can arise as long as there is θ1 and θ2 are independent a priori, they become dependent when we condition on atNow, leasteven oneifinverted fork on the paths between θ1 and θ2 through X1 , X2 , . . . , X5 that does X1 ,depend . . . , X5 . on Furthermore, the observation affect θ1 in this an unanticipated 6 canθ2 withthe X6belief . Weabout illustrate by two other not the particular relation ofofθ1Xand fashion. This at first sight counterintuitive phenomenon is called the explaining away effect. We instances of the phenomenon. The first case is illustrated in Figure 3, in which the focal sixth variable is notsubstantive an item response the variable gender, gender is related to refrain from giving examplesbut in order to be concise, andwhere because intuitive examples Obviously, observing gender changes beliefe.g., about θ2 , but the belief θ2of. this 1 can phenomenon are described by many the authors, Berkson (1946), Bishopabout (2006,θp. 378),be affected in an unexpected manner owing to the inverted forks. Again, this dependency can Pearl (2009, p. 17), and Hooker and Finkelman (2009, p. 251). arise when θ1 and θ2 are a priori independent and when θ1 is unrelated to gender (as is We emphasize that this explaining way phenomenon can arise as long as there is at least the case in Figure 3). This example is particularly interesting because many applications one inverted fork on the paths between θ1 and θ2 through X1 , X2 , . . . , X5 , and does not depend of multidimensional IRT models with background variables are found in large-scale on the particular relation of θ1 and θ2 with X6 . We illustrate this by two other instances of assessments such as the Programme for International Student Assessment (PISA; Adams, the phenomenon. The first case is illustrated in Figure 3, in which the focal sixth variable Wilson, & Wang, 1997) and the National Assessment of Educational Progress (NAEP; is not an item (However, response, but gender is related θ2 . NAEP Obviously, Mislevy, 1985). we the notevariable that thegender, currentwhere MIRT models in PISAtoand have gender changes the about but theinstance belief about θ1 can be affected in an 2 , second aobserving between-item structure, as inbelief Figure 1.) θA can be constructed when we unexpected due to the inverted forks. Again,ofthis can arise θ1 andis relate gendermanner to an item response variable instead to adependency latent variable. Thiswhen situation given 4, where gender-related functioning appears on Figure the fifth θ2 areinaFigure priori independent and when θ1 differential is unrelateditem to gender (as is the case in 3). X5 as well as the belief about item. affects the beliefbecause about θmany 2 through This Observing example is gender particularly interesting, applications of multidimensional IRTθ1 models with background variables are found in 4 large scale assessments such as the Programme for International Student Assessment (PISA; Adams, Wilson, & Wu, 1997) and the National because of the inverted forks. To reiterate, paradoxical results in all these instances are not to be attributed to the focal sixth variable but to the inverted forks in other parts of the model. 6 θ1 θ2 6 X1 X θ12 X3 X4 Xθ25 Gender Figure 3. Gender and relation between Directed acyclic graph X of1 two-dimensional IRT X2 X3 model with X4 within-item X5 structure gender and θ2 Figure 3. Directed acyclic graph of two-dimensional item response theory model Figure 3. with within-item structure and relation between genderstructure and θ2 .and relation between Directed acyclic graph of two-dimensional IRT model with within-item θ θ 1 2 gender and θ2 X1 X θ12 X3 X4 Xθ25 Gender Figure 4. Gender Directed acyclic graph of IRT with and gender-related DIF X1two-dimensional X2 Xmodel X4within-item X5 structure 3 for X5 . Figure 4. Directed acyclic graph of two-dimensional IRT model with1985). within-item structure and gender-related DIF Assessment of Education Progress (NAEP; Mislevy, (However, we note that the current for X5 . MIRT models in PISA and NAEP have a between-item structure in Figuretheory 1.) A second Figure 4. Directed acyclic graph of two-dimensional item as response model with within-item structure gender-related differential item functioning for instance can of be Education constructed whenand we(NAEP; relate gender to an item response variable of acurrent latent Assessment Progress Mislevy, 1985). (However, we noteinstead that the Xvariable. 5. This situation given in Figure where gender-related item1.) functioning MIRT models in PISA isand NAEP have a4, between-item structuredifferential as in Figure A second (DIF) appears on the fifth item. Observing gendertoaffects the belief about θ2 instead throughofXa5latent , and instance can beand constructed when we relate gender an item response variable Hooker Finkelman (2010) considered the MIRT paradox in models for item the belief This aboutsituation θ1 as well because the inverted To reiterate, paradoxical in all variable. is given in of Figure 4, bifactor whereforks. gender-related differential itemresults functioning bundles. They focused on two models: the model and the testlet model. In the these arethe notitem to be attributed to the focal sixth variable, but to the inverted forks in (DIF)instances appears on fifth item. Observing gender affects the about θbundle X bifactor model, every loads on a general dimension andbelief on an item dimension. 2 through 5 , and other parts of theθ model. Hooker andabout Finkelman discussed cases, one inforks. which latent variables areresults assumed to the belief becausetwo of the inverted Toall reiterate, paradoxical in all 1 as well Hooker andand Finkelman (2010) considered the MIRT inbut models item bundles. bethese independent onetoinbe which the item bundle dimensions are correlated. Independent instances are not attributed to the focal sixthparadox variable, to thefor inverted forks in latent variables are model. typically to identify thethe bifactor which is the situation They models: assumed the bi-factor model and testlet model, model. In the bi-factor model, other focused parts ofon thetwo everyHooker item loads a general dimension and on an bundle dimension. Hooker Finkelman 5 theitem and on Finkelman (2010) considered MIRT paradox in models forand item bundles. (2010) discussed cases, one which all latent aremodel. assumed to be independent, They focused on two two models: theinbi-factor model andvariables the testlet In the bi-factor model, every item loads on a general dimension and on an item bundle dimension. Hooker and Finkelman that we consider. An example of the bifactor model is represented in a DAG in Figure 5. Hooker and Finkelman consider a result to be paradoxical if answering an additional item (X6 ) correctly results in a lower estimate for the general ability (θ1 ) than when the additional item is answered incorrectly. From Figure 5, it follows that θ1 and θ3 are not d-separated, that is, there are paths between θ1 and θ3 that contain an inverted fork (in fact, all paths do). Hence the explaining-away phenomenon can occur, and paradoxical results are possible for this bifactor model. Hooker and Finkelman (2010) derived mathematically the specific conditions under which paradoxical results occur for the more general bifactor model. From their mathematical derivations, it follows that paradoxical results are not possible when the loadings of the bifactor model are restricted according to the so-called testlet model (a testlet model is a restricted bifactor model; see Rijmen, 2010). The fact that paradoxical results cannot occur for the testlet model (with independent nuisance dimensions) can be shown directly by looking at the corresponding DAG, alleviating the need for mathematical derivations. First, one should realize that the testlet model is a Schmid–Leiman transformed second-order model (see, e.g., Yung, Thissen, & McLeod, 1999). Then, the conditional independence relations can be observed from the DAG of the equivalent second-order model, which is presented in Figure 6. In this figure, it is easily seen that θ1 and θ3 are always dependent because the path from θ1 to θ3 has a directed edge. However, θ1 is independent from X4 , X5 , and X6 is conditional on θ3 ; that is, conditional on θ3 , the observation of X6 does not change the belief about θ1 in an unexpected manner. Therefore, as long as monotonicity holds, paradoxical results cannot occur in this case. 2 Concluding Remarks We have shown that the MIRT paradox utilized by Hooker et al. (2009) is an instance of the explaining-away phenomenon. Specifically, the so-called inverted fork in the path between latent variables is the main cause of the phenomenon. In many of the MIRT paradox papers, intuitions are built up from an educational measurement perspective, which causes the result to be surprising. However, we made use of the frameworks of graphical models and Bayesian networks in which this phenomenon is well established. We chose 6 8 θ1 X1 X2 X3 X4 θ2 X5 X6 θ3 Figure 5. Directed acyclic graph of bi-factor three-dimensional IRT model. Figure 5. Directed acyclic graph of bifactor three-dimensional item response θ1 theory model. these frameworks because the conditional dependencies between the variables in a specific θ2 θ3 model can be derived directly from its graph, independent of different parameterizations and link functions. X1 X2 X3 X4 X5 X6 The work of Hooker et al. (2009) is nevertheless to be lauded because they described the exact mechanics of the paradox in MIRT in great Figure 6. detail. We disagree, however, with the Directed acyclic graph ofof second-order (or Spiess testlet) (2012) three-dimensional somewhat pessimistic conclusions Jordan and and Van IRT der model. Linden (2012) on the usefulness of MIRT models. The MIRT paradox is a general statistical paradox that Concluding remarks holds for many models with multiple competing explanatory variables and is accepted in many contexts other than psychometrics such as biostatistics and artificial intelligence. We We have shown that the MIRT paradox utilized by Hooker, Finkelman, and Schwartzman find that the issue of test fairness raised by Hooker et al. (2009) and Jordan and Spiess (2009) is an instance of the explaining away phenomenon. Specifically, the so-called inverted fork (2012) results from confounding different views on the purpose of tests. For example, in the path between latent variables is the main cause of the phenomenon. In many of the MIRT Holland (1994) distinguished between tests as contests and tests as measurement. The paradox papers, intuitions are built up from an educational measurement perspective, which contest view can result in a firm belief that more items correct should result in a higher cause the result to be surprising. However, we made use of the frameworks of graphical models score, a feature that nevertheless pertains to relatively few IRT models (Van der Linden, and Bayes in which thismodel phenomenon is well established. We chose these frameworks, 2012). In thenetworks measurement view, selection is perhaps the most important issue so the conditional between the variables in raised a specific can(1994), be derived thatbecause test-based inferencesdependencies are sound. A third view on tests, by model Mislevy suggests that tests can beindependent used as sources of information for evidentiary directly from its graph, of different parameterizations and link reasoning functions. about students, for example, as in models for cognitive diagnosis. Preventing paradoxical results 7 θ2 θ3 Figure 5. Directed acyclic graph of bi-factor three-dimensional IRT model. θ1 θ2 X1 X2 θ3 X3 X4 X5 X6 Figure 6. Directed acyclic graph of second-order (or testlet) three-dimensional IRT model. Figure 6. Directed acyclic graph of second-order (or testlet) three-dimensional Concluding remarks item response theory model. We have shown that the MIRT paradox utilized by Hooker, Finkelman, and Schwartzman might be relevant in the contest perspective on tests, but we argue that it is less relevant in (2009) is an instance of the explaining away phenomenon. Specifically, the so-called inverted fork the latter two perspectives on the purposes of educational tests. in the path between latent variables is the main cause of the phenomenon. In many of the MIRT paradox papers, intuitions are built up from an educational measurement perspective, which cause the result to be surprising. However, we made use of the frameworks of graphical models and Bayes networks in which this phenomenon is well established. We chose these frameworks, because the conditional dependencies between the variables in a specific model can be derived directly from its graph, independent of different parameterizations and link functions. 8 References Adams, R. J., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23. Berkson, J. (1946). Limitations of the application of fourfold tables to hospital data. Biometrics Bulletin, 2, 47–53. Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer. Finkelman, M., Hooker, G., & Wang, J. (2010). Prevalence and magnitude of paradoxical results in multidimensional item response theory. Journal of Educational and Behavioral Statistics, 35, 744–761. Holland, P. W. (1994). Measurements or contests? Comments on Zwick, Bond and Allen/Donoghue. In Proceedings of the Social Statistics Section of the American Statistical Association (pp. 27–29). Alexandria, VA: American Statistical Association. Holland, P. W., & Rosenbaum, P. R. (1986). Conditional association and unidimensionality in monotone latent variable models. Annals of Statistics, 14, 1523–1543. Hooker, G. (2010). On separable tests, correlated priors, and paradoxical results in multidimensional item response theory. Psychometrika, 75, 694-707. Hooker, G., & Finkelman, M. (2010). Paradoxical results and item bundles. Psychometrika, 75, 249–271. Hooker, G., Finkelman, M., & Schwartzman, A. (2009). Paradoxical results in multidimensional item response theory. Psychometrika, 74, 419–442. Jordan, P., & Spiess, M. (2012). Generalizations of paradoxical results in multidimensional item response theory. Psychometrika, 77, 127–152. Lord, F. M. (1962). Cutting scores and errors of measurement. Psychometrika, 27, 19–30. Mellenbergh, G. J. (1994). Generalized linear item response theory. Psychological Bulletin, 115, 300–307. Mislevy, R. J. (1985). Estimating latent group effects. Journal of the American Statistical Association, 80, 993–997. Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika, 9 59, 439–483. Pearl, J. (2009). Causality: Models, reasoning, and inference (2nd ed.). New York: Cambridge University Press. Rijmen, F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47, 361–372. Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8, 185–205. Suppes, P., & Zanotti, M. (1981). When are probabilistic explanations possible? Synthese, 48, 191-199. Van der Linden, W. J. (2012). On compensation in multidimensional response modeling. Psychometrika, 77, 21–30. Wellman, M. P., & Henrion, M. (1993). Explaining “explaining away.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 15, 287–292. Williamson, J. (2005). Bayesian nets and causality. Oxford: Oxford University Press. Yung, Y.-F., Thissen, D., & McLeod, L. D. (1999). On the relationship between the higher-order factor model and the hierarchical factor model. Psychometrika, 64, 113–128. 10
© Copyright 2026 Paperzz