Figure 9.3 Evidence of local dependence and item bias

Item analysis
Some notes on the use of DIGRAM for item analysis
by graphical and loglinear Rasch models
Svend Kreiner
Dept. of Biostatistics
June 2001
Provisional incomplete version
1
Table of contents
1. Introduction
1.1.
SCD and DIGRAM
1.2.
Item analysis
1.3.
Two Examples
1.4.
Conditional independence
2. Item response models fitted by DIGRAM
2.1.
RM2: The Rasch model for Dichotomous items
2.2.
RMP: The Rasch model for polytomous items
2.3.
GLLRM: Graphical and loglinear Rasch models
3. Analysis of differential item functioning and local independence
3.1.
Differential item functioning and item bias
3.2.
Local dependence
4. Definition of items, score groups and exogenous variables
4.1.
Selecting items
4.2.
Defining score groups
4.3.
Selecting exogenous variables
4.4.
Displaying items, exogenous variables, scores and score groups
4.5.
Disposing items and exogenous variables
4.6.
An example
5. Interaction graphs for item response models
6. Tests for item bias and local dependency
6.1.
Generalized Mantel-Haenszel analysis of item bias
6.2.
Generalized Rosenbaum and Tjur tests of local independence
6.3.
Initial screening of marginal item relationships
7. Item analysis by Rasch’s model for dichotomous items
8. Item analysis by Rasch models for polytomous items
9. Item analysis by graphical and loglinear Rasch models
9.1.
Building an initial model by item screening
9.2.
Estimating and fitting GLLRM models
10. Item analysis by Rating Scale models
11. Criterion validity
12. Export to other IRT programs
References
2
1 Introduction
1.1 SCD and DIGRAM
DIGRAM is part of a larger statistical package, SCD, which when finished will contain
parts of two other programs in addition to DIGRAM: CATANA by Andersen (1993) and
STATPACK by Holst (1994). These notes only describe certain parts of DIGRAM.
The original version of DIGRAM (Kreiner, 1989) was a program dedicated to analysis of
high-dimensional contingency tables by block recursive graphical models. While graphical modeling is still important for DIGRAM focus has to some degree changed towards a
range of problems where conditional independence plays important roles, but where a
graphical model is not regarded as a full-fledged model, but rather as a non-parametric
skeleton on which proper models may be build. In addition to graphical modeling
DIGRAM now supports:
1) Analysis of collapsibility across categories in multidimensional contingency
tables.
2) MCA analysis of marginal and conditional homogeneity in multidimensional
contingency tables.
3) Non-parametric loglinear modeling of ordinal categorical data.
4) Analysis of multidimensional Markov Chains.
5) Item analysis by graphical and loglinear Rasch models (Kreiner and Christensen
(2001)
These notes will only be concerned with the item analysis. The user interface of SCD and
DIGRAM and the remaining types of analyses will be described elsewhere.
Sections 2 and 3 provides a little bit of background information on item analysis by
Rasch models while the remaining sections of these notes describe what you have to do to
make DIGRAM perform the analysis: Section 4 tells you how to set things up for an item
analysis. Section 5 deals with interaction graphs for IRT models. Section 6 discusses nonparametric analyses of differential item functioning and local dependence. Section 7 is on
the classical Rasch model for dichotomous items. Sections 8 and 9 describe the general
case of polytomous items and models of DIF and local dependence. Section 10 is on the
so-called Rating scale. Section 11 describes analysis of criterion validity – a somewhat
overlooked topic among Rasch modelers. Finally Section 12 tells you how to escape to
other programs where you may find some techniques and methods not provided by
DIGRAM.
1.2 Item analysis
The purpose of item analysis is to examine whether or not an index scale summarizing
responses to a finite set of items is a valid measure of an unobserved (latent) variable.
DIGRAM supports two different but related approaches to item analysis:
3
1) Analysis of item bias and differential item functioning (DIF)
2) Estimation and fit of Rasch type IRT models
In addition DIGRAM may be used for analysis of criterion validity by analysis of partial
and marginal relationships between the suggested index scale and and one or more
criterion variables.
1.3 Two examples
Two examples are used throughout these notes. The first originated in a study of Health
in Copenhagen County. We will here be concerned with the validity of one of the SF36
scales measuring physical functioning summarizing responses to the following ten items:
Does your health now limit you in these activities? If so, how much?
PF1) Vigorous activities
PF2) Moderate activities
PF3) Lifting or carrying groceries
PF4) Climbing several fligths of stairs
PF5) Climbing one flight of stairs
PF6) Bending, kneeling, or stooping
PF7) Walking more than a mile
PF8) Walking several blocks
PF9) Walking one block
PF10) Bathing or dressing yourself
To illustrate different types of IRT models we will look at two different version of the
index scale. The first summarizes responses from dichotomized items with response
categories
0 : Limited
1: Not limited
while the second uses items with three response categories
0 : Limited a lot
1: Limited a little
2: Not limited
The data for the second example was collected for a study of health behavior in Denmark.
The scale considered here counts the number of symptoms experienced within the past
4
six month. Only twelve items are considered here. Kreiner(1993) found that the twelve
items did not fit the Rasch model, but that the items belonged to two different symptom
dimensions as shown in Table 1.1. A later analysis (Kreiner and Christensen, 2001)
suggested that item responses might be fitted by a formally more complicated, but
conceptually simpler GLLRM. The examples here will illustrate some of the analyses
leading to the two different results.
Table 1.1 Symptoms included in two symptom indices
Items that were positively related
to age
Cough
Muscle Pain
Eyes
Dizziness, fainting
Back pain
Chest pain
Items that were negative related
to age
Nose and throat
Fever
Stomach
Indigestion
Rash
Mouth and gums
1.4 Conditional independence
Conditional independence plays an important role for Rasch models.
Two variables, X and Y, are conditionally independent given a set of conditioning variables, U,V and W, if the conditional distribution of X and Y given (U,V,W) factorizes as a
product of the conditional distributions of each of the variables:
P( X ,Y|U ,V ,W )  P( X |U ,V ,W )  P(Y|U ,V ,W )
(1.1)
For brevity we write X╨Y to indicate that variables X and Y are marginally independent
and X╨Y|U,V,W whenever conditional independence (1.1) holds. We refer to Lauritzen
(1996) for an exhaustive discussion of conditional independence and the role this concept
plays for graphical models.
For certain procedures testing assumptions of Rasch models we need to distinguish not
only between conditional dependence and independence, but also between positive and
negative conditional association. For this purpose we write X<+>Y |U,V,W to indicate
positive conditional association between X and Y given U,V,W and X<->Y|U,V,W to
indicate negative conditional association.
5
2 Item response models fitted by DIGRAM
DIGRAM estimates and tests a range of Rasch type IRT models:
1)
2)
3)
4)
Rasch’s model for dichotomous items (RM2)
Generalized Rasch models for polytomous items (RMP).
Graphical and loglinear Rasch models (GLLRM)
Rating scale models (RSM)
The frame of reference for these models are defined by the following types of variables:
Items:
A univariate latent variable:
A summary index scale:
A set of exogenous variables:
(Y1,..,Yk)

S = iYi
(X1,..,Xm)
Some but not necessarily all of the exogenous variables may be so-called criterion variables known to be correlated to the latent variable.
The exogenous variables are formally integrated together with items and the latent variable into the GLLRM framework but only treated informally in connection with RM2,
RMP and RSM models.
2.1 RM2: The Rasch model for dichotomous items.
Items are assumed to be coded as 0 or 1, where 1 indicates a response of a specific type,
such that the raw score, S, is equal to the total number of responses of the given type.
Responses to different items are assumed to
1) be conditional independent given 
2) depend on unknown item parameters, i,
such that the probability of the complete set of items is equal to
P( y1 ,.., yk | ) 
k
(  i ) yi
 1  
i 1
(2.1)
i
Replacing i by exp(i) and  by exp() (2.1) may be rewritten as
6
P( y1 ,.., yk |  ) 
e(  i   ) yi

i  
i 1 1  e
k
(2.2)
(2.1) and (2.2) are two different parametrizations of one and the same model1. We will
refer to the parameters of (2.1) as the multiplicative parameters and the parameters of
(2.2) as the logarithmic parameters. During DIGRAM’s item analysis both types of parameters will be calculated and reported. In the majority of situations, the parameters reported will be the multiplicative parameters.
The parameters of the model are not identified unless certain restrictions are imposed.
DIGRAM defines this restriction in the same way by setting
k

i 1
i
1
i
0
(2.3)
or equivalently
k

i 1
(2.4)
The raw score S = iYi is a minimal sufficient statistic for  in (2.1). The score
distribution, P(S|) is a so-called power series distribution,
 s s (  1 ,..., k )
P( S  s| ) 
 (  1 ,..., k )
(2.5)
where the -parameters are the elementary symmetric polynomials
 s (  1 ,..., k ) 


y1 ,.., yk :


   iyi 
 i

y s
(2.6)
i
1
(2.1) was the original parameterization suggested by Rasch (1960). (2.2) may be preferable because it
connects the Rasch models to random effect logistic regression models. A third and very popular
parameterization of the Rasch model replaces i with exp(-i) such that the simultaneous distribution of
item responses may be rewritten as
e(   i ) yi
P( y1 ,.., yk |  )  
  i
i 1 1  e
k
7
and
 (  1 ,.., k ) 
k

s 0
s
( 1 ,.., k )
(2.7)
(2.5) to (2.7) may of course be rewritten in terms of the logarithmic parameters, but
appear a bit less ungainly when they put down in terms of multiplicative parameters.
The probability of a positive response on a single item may be calculated given either the
latent variable or the total score as shown in (2.8) and (2.9):
P(Yi  1| ) 
P(Yi  1| S ) 
 i
1   i
(2.8)
 i s1(  1 ,.., i 1, i 1 ,.., k )
 s (  1 ,.., k )
(2.9)
We will refer to (2.8) as the values on the item characteristic curves (ICC) and to (2.9) as
the values on the conditional item characteristic curves (CICC). The difference between
ICC and CICC values will be fairly small when the number of items is large. For small
number of items there may however be considerable difference between the two sets of
item characteristic values. For model testing purposes observed frequencies of positive
responses to specific items in different score groups should of course be compared to the
CICC values end not to ICC values.
Following Andersen (1970, 73, 80, 90) DIGRAM calculates conditional maximum likelihood (CML) estimates based on the conditional distribution of item responses given
observed scores and conditional likelihood ratio tests (CLR) to test for homogeneity of
item parameters in different groups defined either by scores or by values of exogenous
variables.
Inserting the CML estimates into (2.5) as known item parameters we may obtain estimates of the person parameters for different scores by solving the following equations:
k
^
E ( S| s ) 
^t
 t  s  t (  1 ,.., k )
^
^
t 0
^
^
 (  1 ,.., k )
s
(2.10)
The estimates obtained by solving (2.10) for s=1,..,k-1 would have been maximum likelihood estimates had the item parameters been known.
8
2.2 RMP: The Rasch model for polytomous items.
The current version of DIGRAM assumes that all items have the same number of response categories. Responses are coded as 0, 1, ..., m-1 where m is the number of response
categories.
The polytomous Rasch models considered by DIGRAM generalizes the dichotomous by
retaining the assumption of a univariate latent variable a locally independent set of items
and a sufficient raw score. The conditional distribution of item response to items given
the latent variable is for this model given by
P( y1 ,.., yk | ) 
k

i 1

 iy  y
i
i

y
 iy y
k
s

i 1

iyi
(2.11)

where iy are multidimensional item parameters. To identify the parameters we need constraints corresponding to the constraints for the dichotomous model, setting i0 = 1 for all
items and i  im = 1.
(2.11) may of course be rewritten with logarithmic parameters as
P( y1 ,.., yk |  ) 
k

i 1
 iyi  yi
e

 iy   ) y
y
e

s  
e  i iyi

(2.12)
We refer to the model defined by (2.11) and (2.12) as a generalized Rasch model. The
model may be reparameterized as a so-called partial credit model (Masters, 19??). We
stress however that calling (2.11) and (2.12) a partial credit model is an interpretation of
the model and that generalized Rasch models may appear as a result of other response
processes than the one suggested by Masters (19??).
The generalized Rasch model has the same sufficiency property as the dichotomous
Rasch model. The raw score, S = iyi, is a minimal sufficient statistic for  with a power
series distribution corresponding to (2.5). The symmetric  and  functions of (2.5) for
the generalized Rasch model are computationally a little more difficult to handle but in
principle of the same kind as the functions defined by the dichotomous Rasch model. All
results pertaining to the dichotomous Rasch model concerning the conditional distribution of item responses given the total score, conditional estimation of item parameters,
estimation of person parameters, and conditional likelihood ratio tests of hypotheses for
9
item parameters consequently apply for the generalized Rasch model in exactly the same
way as for the dichotomous Rasch model.
2.3 GLLRM: Graphical and loglinear Rasch models
Graphical and loglinear Rasch models are discussed by Kreiner and Christensen (2001).
A loglinear Rasch model (Kelderman, 1994) is a Rasch model for which the assumptions
of local independence and no DIF has been somewhat relaxed. The model assumes that
the effect of exogenous variables on certain items and that associations among items are
uniform in the sense that strengths of associations do not depend on the latent variable.
The conditional distribution of items given the latent variable and the exogenous variables may under this assumption be seen as a loglinear model, where main effect depend
on the latent variables, but where interaction parameters are homogenous across different
levels of .
A loglinear Rasch model is a loglinear model of the conditional distribution of item responses, Y1,..,Yk given both a latent variable and a set of exogenous variables, X1,..,Xm.
The model assumes first that main effects depend on the latent variable in the same way
as for the generalized Rasch models and second that loglinear interaction parameters are
independent of the latent variable.
The assumption that interaction between items and exogenous variables are independent
of the latent variable implies that the conditional distribution of item responses can be
written as a loglinear model in the following way:
Let Ti be a subset of items and exogenous variables, e.g. Ti = {Ya, Yb, Yc, Xe, Xf, Xg}for i
= 1,...,r and let T = {T1,...,Tr).
A loglinear Rasch model with generating set S defines the conditional distributions of
item responses in the following way:

P( y1 ,.., yk | , x1 ,.., xm ) 
k
s
r
 
i 1
iyi

i 1
iti
(2.13)
where ti = (ti1,..,tir) is the observed responses to the items and exogenous variables included in Ti.
The parameters of (2.13) are multiplicative parameters. The model may of course be
redefined in terms of logistic parameters as
P( y1 ,.., yk |  ) 
 s
 

e  i yii  i1..r iti

10
(2.14)
where iy = ln(iy).
DIGRAM will only fit a limited set of loglinear Rasch models assuming nothing but twofactor interactions. A generator of one of the models addressed by DIGRAM will therefore either be:
1) generators of uniform local dependence defined by item pairs, Si = (Ya,Yb),
or
2) generators of uniform item bias defined by one item and one exogenous variable,
Si = (Ya,Xb).
If a wider range of models is required you will have to use the program LOGIMO by
Kelderman and ???? (19??).
Restraints are needed to identify the parameters of a loglinear Rasch model. We use the
same kind of restraints on the main effect item parameters as for the generalized Rasch
model and fix the LD- and DIF parameters by selecting one value of from the range of
each of the items and exogenous values as the reference value such that z,ref = ref,z = 1
for all z.
The raw score is a sufficient statistic for  in models defined by (2.13) and (2.14) as in
the generalized Rasch model. If the model only contains generators of local dependence
the conditional distribution of the score given  will be a power series distribution like
(2.5) where the -parameters are symmetric functions of both item parameters and
parameters for local dependence.
If the model also includes generators of item bias, the conditional distribution of the score
given  and the exogenous variables are conditional power series distributions
P( S  s |  , X  x) 
 s  s ( x)
( x )
(2.15)
where the symmetric -functions depend on X through the item bias parameters.
Finally a GLLRM is a loglinear Rasch model embedded in a larger multivariate graphical
model where the association between the latent and exogenous variables is a chain graph
model as defined by Lauritzen (1996) while the measurement component of the model is
a loglinear Rasch model as defined above. The arguments for embedding the measurement model in a graphical framework and the consequences of doing so are discussed by
Kreiner and Christensen (2001).
11
3 Analysis of differential item functioning and local dependence
3.1 Differential item functioning and item bias
Two assumptions has to be met if the requirement of no differential item functioning is
satisfied:
1)
2)
Items must be conditional independent of exogenous variables given the
latent variable measured by the index scale: (Y1,..,Yk) ╨ (X1,..,Xm)|
Items has to be conditional independent of exogenous variables given
the index scale: (Y1,..,Yk) ╨ (X1,..,Xm)|S
Note that both requirements must fulfilled for an index scale to be regarded as completely
free of differential item functioning. Notice also that the requirements are stated in terms
of the complete set of items and the complete set of exogenous variables. 1) and 2) however implies that separate items must be conditionally independent of separate exogenous
variables given both the latent variable and the index scale.
3)
4)
Items must be conditional independent of exogenous variables given the
latent variable measured by the index scale: Yi╨Xj|
Items has to be conditional independent of exogenous variables given
the index scale: Yi╨Xj|S
3) and 4) will be referred to as requirements of no item bias. Absence of DIF implies that
all items will be unbiased, but the fact that all items are unbiased according to definitions
3) and 4) does not however imply that there is no DIF. The definition of unbiased items
may nevertheless be used for very simple non-parametric tests for no DIF.
In the case where evidence suggest that some items are biased one of the purposes of the
item analysis will be to decide how to deal with the DIF. For this purpose we need additional concepts related to DIF and item bias all of which are applied by DIGRAM’s item
analysis.
3.2 Local dependence
It is a fundamental assumption underlying the majority of standard IRT models including
dichotomous and polytomous Rasch models that items are conditionally independent
given the latent variable:
P(Y1 ,..,Yk | ) 
k
 P(Y | )
i
i 1
(3.1)
If it is further assumed that item characteristic curves are monotonously increasing it
follows in general that
12
1) all item pairs are marginally positively correlated.
2) all item pairs are partially positively correlated conditionally given any
function of the remaining items.
For Rasch models sufficiency of subscores imply however that
3) items are pairwise conditionally independent given subscores including one
but not both items.
1) to 3) are used as the foundation of tests of local independence of items. If the assumptions of the Rasch models hold then it follows for all pairs of items that
Yi<+>Yi
(3.2)
Yi<+>Yi|Sij
(3.3)
Yi╨Yi|Si
(3.4)
Yi╨Yi|Sj
(3.5)
Tests based on (3.2) and (3.3) are discussed by Rosenbaum (1988). As there is no requirement of a particular strong positive association between items, only evidence of negative
association between items should be regarded as evidence against the Rasch model. We
notice in particular that negative partial association between two items, Yi<->Yi|Sij, can
be regarded as evidence of item bundles as discussed by Rosenbaum (1988).
Tests of (3.4) and (3.5) are generalizations of tests for three dichotomous items suggested
by Tjur (1982) and described by Kreiner and Christensen (2001).
The techniques used to test hypotheses (3.2) – (3.5) are the same as for tests of no item
bias. Withdrawing one item from the scale and treating it as an exogenous variable will
lead to hypotheses of no item bias equivalent to (3.4) and (3.5). While the reasons for local dependence may be very different from the reasons for item bias, there seems to be no
technical difference between analysis of DIF and analysis of local dependence.
4 Definition of items, score groups and exogenous variables
You need but three commands to define a complete framework for item analysis by
DIGRAM:
13
ITEMS variables
is used to select items and define the index scale
CUT <min cut points max>
is used to define score groups
EXOGENOUS variables is used to select exogenous variables
Each of these commands is described below
4.1 Selecting items
The ITEMS command designates some of the DIGRAM project variables as items of a
summary index scale and calculates the total score for all persons where none of the item
responses are missing. The items are generally taken as defined by the DIGRAM project,
but recoded such that the lowest level of the variables are assigned an item score of zero,
followed by scores increasing with 1 for each level of the variables.
All items must be coded such that the mean expected item score increases with increasing
values of the latent variable. If some items are inversely coded you may flip them by setting a minus before the variable label in the list of items. The coding is shown in the table
below for a hypothetical item a with three response categories:
Categories of variable A Item scores if A is selected
First category
0
Second category
1
Third category
2
Item scores if –A is selected
2
1
0
Having selected items you may for substantive or computational reasons want to flip all
items changing for instance an index scale counting numbers of correct responses to an
educational test into a scale counting the number of errors. One command takes care of
this little problem:
FLIPITEMS
Flips all items and recalculate the score. Score groups will have
to be redefined after items have been flipped as described in the
next section.
ITEMS ABCD followed by FLIPITEMS in this way defines the same index scale as
ITEMS –A-B-C-D
14
After items have been selected DIGRAM prints a list of the items and the distribution of
the index scale for all persons for which no item responses was missing. You may at any
time print the same list and the same distribution again using the SHOW IS command
described below.
4.2 Defining score groups
Some of the item analysis procedure requires that score groups have been defined. These
score groups may be defined by the CUT command in the following way.
The general setup for CUT requires three sets of parameters: a minimum score, one or
more cut points and a maximum score defining score groups as describe below.
Let n be the number of cut points. The minimum, maximum and cut points define the
following n+1 score groups.
Score group no. 1
Score group no. 2
…
Score group no. I
…
Score group no. n+1
= [minimum, cut point no. 1]
= [(cut point no. 1) + 1, cut point no.2]
…
= [(cut point no. i-1) + 1, cut point no. i]
…
= [(cut point no. n) + 1, maximum score]
4.3 Selecting exogenous variables
The EXOGEOUS command simply selects a set of exogenous variable as they are defined in the DIGRAM project. You can not flip them or change them in any other way.
Criterion variables may be included among the exogenous variables but there is no way
to tell the program to distinguish between ordinary exogenous variable and special
criterion variables.
4.4 Displaying items, exogenous variables, scores and score groups
You can use DIGRAM’s SHOW command to obtain information on the current setup for
item analysis:
SHOW I
prints a list of all items and exogenous variables together with a table
of items and mean item scores where items are sorted according to
increasing mean scores.
15
SHOW S
prints tables with the score the score group distributions
distributions.
The two commands may be combined into one, SHOW IS,
4.5 Disposing items and exogenous variables
If you for some reason want to get rid of the items and or exogenous variables you must
use the DISPOSE command:
DISPOSE I
disposes items
DISPOSE E disposes exogenous variables
The GLLRM model associated with the items and exogenous variables will be disposed
when items are disposed and reinitialized if only exogenous variables are disposed.
4.6 An example
This section illustrates what happens when items and exogenous variables are selected.
The scale considered here is the symptom subscale defined by the first six items (labeled
A to F) of Table 1.1.
ITEMS ABCDEF selects the items and produces the following output:
16
6 items: ABCDEF
Items belong to recursive block no. 1
Maximum item score =
Max score =
Min score =
1
6
0
Scoregroups with complete item patterns = 1 - 5
Test of internal homogeneity possible
*****
Mean item scores
*****
450 cases with
complete items
items
n
mean
mean
---------------------------------------B:MUSCLE P 450
0.24
0.24
E:BACK PAI 450
0.16
0.16
C:EYE PROB 450
0.09
0.09
D:DIZZINES 450
0.06
0.06
A:
COUGH 450
0.04
0.04
F:CHEST PA 450
0.04
0.04
Score distribution: 450 Cases
--------------------------------Score Count Percent Cumulative
------------------------------0
251
55.8
55.8
1
141
31.3
87.1
2
41
9.1
96.2
3
11
2.4
98.7
4
4
0.9
99.6
5
1
0.2
99.8
6
1
0.2
100.0
------------------------------Total
450
100.0
Mean
=
0.63
Missing =
0
ScoreGrp:
Figure 4.1 450
ItemsCases
and score distribution
-----------------------------Score Count Percent Cumulative
------------------------------0- 1
392
87.1
87.1
2- 6
58
12.9
100.0
------------------------------Total
450
100.0
Missing = 0
2 Exogeneous variables:
-----------------------X:
GENDER - 2 nominal categories.
level: 2
Y:
AGE - 3 ordinal categories.
level: 2
17
after selection of items
Recursive
Recursive
Score groups may be needed for some of the analyses of this scale. In this case we will
distinguish between persons with a low score (0-1 points) and persons with higher scores
(2-6 points). To define these variables we enter
CUT 1 defining score groups with score = 1 as the cut point. Figure 1.2 shows that the
definition was successful:
ScoreGrp:
450 Cases
-----------------------------Score Count Percent Cumulative
------------------------------0- 1
392
87.1
87.1
2- 6
58
12.9
100.0
------------------------------Total
450
100.0
Missing = 0
Figure 4.2 Score groups defined by CUT 1
Finally we want to use sex and age (labeled X and Y in the DIGRAM project) as
exogenous variables:
EXO XY
selects the variables and produces the output show in Figure 1.3.
2 Exogenous variables:
-----------------------X:
GENDER - 2 nominal categories.
Y:
AGE - 3 ordinal categories.
Recursive level: 2
Recursive level: 2
Figure 4.3 Exogenous variables
If you at any point need to recapitulate the definition of items, score groups and
exogenous variables you must use the SHOW IS command as described above. The
output produced by this command is the same as the output following definition of the
variables.
18
5 Interaction graphs for item response models
A graphical loglinear Rasch model is created and/or initialized when items and exogenous variables are selected. The initial model always assumes that items are locally independent and unbiased. The model may however be revised during the analysis by graphical
and loglinear models described in section 9 below.
The interaction graphs characterizing the GLLRM graphs may be shown in DIGRAM’s
graph module.
Click on the graph button on DIGRAM’s main window to open the graph dialog. The
default graph shown here is the project graph for the graphical model of the complete set
of variables. If items have been selected you will however be able to select one of the
IRT graphs showing the interaction graphs of the current GLLRM, that is either the graph
with or the graph without the raw score included.
Figure 5.1 shows the Graph window with the interaction graph without the score for a
scale defined by the first six symptom items with sex and age included as exogenous
variables.
Figure 5.1 The initial IRT graph for a scale defined by the first six symptom items.
19
The graph may be manipulated, printed and saved as described in the notes on managing
graphs in DIGRAM. Notice that the “Latent” button shows the graph without the score. If
you want to see the graph with both the latent variable and the score you must click on
the “score” button instead.
The labels used by DIGRAM for the latent variable and the score are
¤ : The latent variable
# : The score
DIGRAM’s graph module may also be used to draw ad hoc IRT graphs illustrating
different properties of this type of models. All the graphs shown in Kreiner and
Christensen (2001) were drawn by DIGRAM.
20
6 Tests for item bias and local dependency
DIGRAM supports two different ways of testing for item bias. The first is a generalized
Mantel-Haenszel analysis where partial gamma coefficients are used to test for conditional independence of items and exogenous variables given the total raw score. The second
uses conditional likelihood ratio tests of equality of item parameters estimated in different
subpopulations. This section only describes the generalized Mantel- Haenszel approach.
Testing for local dependency may also be done in two different ways. The first corresponds to the Mantel-Haenszel test for item bias in that hypotheses of conditional independence of items given scores on certain subscales are considered, while the second test parameters representing local dependence in specific item response models. Only the first
approach will be described in this section.
6.1 Generalized Mantel-Haenszel analysis of item bias
The item bias analysis of DIGRAM generalizes standard Mantel-Haenszel techniques for
22k tables using partial gamma coefficients rather than estimates of odds ratios to allow for ordinal items and exogenous variables with more than two categories.
Hypotheses of conditional independence may be tested by exact Monte Carlo tests as in
all other tests of conditional independence in DIGRAM2.
The BIAS command requires parameter determining whether or not tables should be
printed:
BIAS
calculate test statistics for conditional independence
BIAS T
prints all tables in addition to the test statistics
BIAS T pct
prints the table if test statistics are significant at a pct level
All pairwise relationships between items and exogenous variables are analyzed by the
BIAS procedure. Results are printed first for each separate variable and finally summarized for each exogenous variable. If score groups have been defined you will be asked
whether you want to test for item bias by stratifying according to the score groups in
addition to stratification to separate score values. Unless the score distribution is very
sparse one should only condition with score values.
Figures 6.1 to 6.3 present some of the item bias analysis for the items and exogenous
variables shown in Figure 5.1.
2
Use the EXA command to select Monte Carlo tests instead of tests relying on asymptotic p-values.
21
Tables will be printed if test
results are significant at 0.01
Analysis of item bias for A: COUGH
Scale : # - RawScore
Exogenous
X²
df asymp exact gamma asymp exact
---------------------------------------------------X: GENDER
3.0 4 0.557 0.749 -0.01 0.482 0.535 1000
Y:
AGE
2.7 8 0.952 1.000
0.11 0.336 0.328 1000
Figure 6.1 Results of tests for item bias for item A (coughing). Tests are performed as
exact Monte Carlo tests with 1000 randomly generated tables.
There is no evidence of item bias for item A as both 2-statistics and -coefficients are
insignificant. Remember that p-values for the -coefficients are one-sided even though
two-sided p-values would have been more appropriate because we are testing a model
rather than searching for evidence supporting a hypothesis of an expected association
between two variables. We are confident however that the user is able to evaluate
significance of both one-sided and two-sided p-values.
Figure 6.2 reports the results concerning item bias of item E, back pain. In this case evidence is disclosed of item bias relative to age ( = -0.39, p = 0.003). The negative -coefficient implies that the frequency of back pain decreases with increasing age even when
the total score is held constant. That this is actually the case can be seen in the table in
Figure 6.2, which was printed because the test result was significant. The table shows not
only the observed frequencies in strata defined by the score, but also the mean item
scores corresponding to the relative frequencies of a positive response when there are
only two response categories.
At the end of the analysis of item bias a summary for each of the exogenous variables is
printed pointed at the evidence of item bias disclosed relative to this variables. This
summary is shown in Figure 6.3. Strong evidence of item bias of item E was disclosed for
Age, while somewhat weaker evidence of item bias was found for B against gender.
22
Variables:
E: BACK PAI
Y:
AGE
#: RawScore
-
2 ORDINAL CATEGORIES
3 ORDINAL CATEGORIES
5 ORDINAL CATEGORIES
|
E
# Y |
0
1
Total Mean score
-------------------------------------1 1 | 64.3 35.7
56
0.36
1 2 | 78.0 22.0
59
0.22
1 3 | 88.5 11.5
26
0.12
-------------------------------------1 TOT | 74.5 25.5
141
0.26
-------------------------------------2 1 | 44.4 55.6
9
0.56
2 2 | 38.1 61.9
21
0.62
2 3 | 63.6 36.4
11
0.36
-------------------------------------2 TOT | 46.3 53.7
41
0.54
-------------------------------------3 1 |
100.0
1
1.00
3 2 | 20.0 80.0
5
0.80
3 3 | 60.0 40.0
5
0.40
-------------------------------------3 TOT | 36.4 63.6
11
0.64
-------------------------------------4 1 |100.0
1
0.00
4 2 |
100.0
2
1.00
4 3 |
100.0
1
1.00
-------------------------------------4 TOT | 25.0 75.0
4
0.75
-------------------------------------5 1 |
100.0
1
1.00
-------------------------------------5 TOT |
100.0
1
1.00
-------------------------------------Analysis of item bias for E: BACK PAI
Scale : # - RawScore
Exogenous
X²
df asymp exact gamma asymp exact
---------------------------------------------------X: GENDER
2.4 4 0.667 0.828
0.12 0.243 0.247 1000
Y:
AGE 14.4 8 0.072 0.085 -0.39 0.003 0.003 1000 ---
Figure 6.2 Tests for item bias for item E. The table was printed
because the test results are significant for Age.
23
Summary of bias with respect to X: GENDER
B: MUSCLE P Summary of bias with respect to Y: AGE
E: BACK PAI ---
Figure 6.3 Summary of significance of evidence suggesting item bias for each exogenous
variable. A minus indicates evidence of a negative partial association between an item
and an exogenous variable, while a plus indicates positive partial association. The
number of signs indicate the strength of the evidence (degree of significance)
6.2 Generalized Rosenbaum and Tjur test for local dependence
Tests for local dependence (LD) in Rasch models may be performed using exactly the
same generalized Mantel-Haenszel techniques used to test for no item bias. To invoke
this analysis enter
ROSENBAUM
Calculates test statistic for hypotheses (3.3) to (3.5) for
all pairs of items.
Figure 6.4 shows the results of tests of Rosenbaum and Tjur conditions for items D and F
and items E and F.
Positive local dependence is suggested for Dizziness (D) and Chest pain (F). Notice the
somewhat unusual role played here by exact conditional tests for the analysis of items D
and E. If we had relied on asymptotic p-values for the evaluation of significance we
would have been led to the conclusion that the partial correlation between D and F given
subscales from which one of these item had been removed were not significant.
In the second part of Figure 6.4 both Rosenbaum and Tjur conditions are met. The lack of
significance of the marginal association between back pain (E) and chest pain (F) does
not in itself suggest any problems for the Rasch model, as the Rosenbaum conditions only
state that marginal associations between items must not be negative.
24
Analysis of the DF association
HYPOTHESIS
X²
df asymp exact gamma asymp exact
----------------------------------------------------D&F|
27.6
1 0.000
0.83 0.012
D&F|Score\DF 18.6
5 0.002 0.026
0.80 0.058 0.001 1000
D&F|Score\D
12.9
4 0.012 0.043
0.75 0.085 0.018 1000
D&F|Score\F
7.5
4 0.113 0.099
0.72 0.092 0.013 1000
Analysis of the EF association
HYPOTHESIS
X²
df asymp exact gamma asymp exact
----------------------------------------------------E&F|
2.1
1 0.144
0.37 0.125
E&F|Score\EF
1.4
4 0.843 0.914
0.21 0.306 0.256 1000
E&F|Score\E
2.1
3 0.549 0.616 -0.03 0.473 0.464 1000
E&F|Score\F
2.3
3 0.505 0.678 -0.28 0.235 0.272 1000
Figure 6.4 Tests of Rosenbaum and Tjur conditions.
6.3 Initial screening of marginal item relationships.
The tests for differential item functioning and local dependence has been combined into a
simpler procedure for screening of marginal item relationships as an initial step towards a
GLLRM. The procedure starts by looking at marginal unconditional relationships, continues to test for DIF and violations of the Tjur conditions and finally generates an initial
GLLRM with the smallest number of generators of DIF and LD to explain the significant
test results disclosed during the screening.
Item screening is invoked by the SCREEN I command. The details of the screening procedure and the output will be described in Section 9.1.
6.4 Concluding remarks concerning tests for DIF and local dependence.
A large number of statistical tests will be calculated and evaluated during analyses of
item bias and local dependence. You should therefore be aware of the risk of false significant test results typically of all kinds of multiple testing procedures. Only strong evidence of either DIF or LD with p-values close to zero should therefore be taken as evidence of failure of the simple Rasch models. If other tests suggest that the model does not
fit the data, weak evidence will be useful as indicators of what is wrong with the model.
25
7 Item analysis by Rasch’s model for dichotomous items
Item analysis by Rasch’s model for dichotomous items may be invoked using the
RASCH command. The command may be given with or without parameters:
RASCH
estimates and compares item parameters in different score
groups and permits calculation of additional information by
which you may evaluate the adequacy of the Rasch model.
Score groups must be defined for this command to work.
RASCH variable
estimates and compares item parameters in different groups
defined by the values of the variable. The variable must be
defined as an exogenous variable.
RASCH *
estimates and compares item parameters first for score
groups and second for groups defined by all the exogenous
variables. The output generated for each separate analysis
is exactly the same as for the separate analyses of score
groups and exogenous variable. An additional table summarizing the results of the tests for homogenous item responses
in different groups is printed at the end of the analysis.
The output generated by Rasch’s item analysis is presented and discussed below. Notice
that persons with minimum and maximum scores are completely disregarded during the
item analysis by Rasch’s model for dichotomous items as these persons provide no
information whatsoever about the item parameters.
Figures 7.1 – 7.9 show output presented during an analysis of homogeneity of item parameters in different score groups following the RASCH command without parameters.
The first to be reported (Figure 7.1) is the CML estimates of the multiplicative item
parameters and corresponding elementary symmetric functions used to calculate the score
distribution and the conditional distributions of items given observed scores (Formulas
(2.5) – (2.8)).
After item parameters have been calculated and before results for specific score groups
are presented you will asked whether or not you want to
-
see tables of item characteristic values and item fit statistics
see idealized test-retest tables that may give some idea about the reliability of the
scores
test for multidimensionality
test for local homogeneity
Figures 7.2 – 7.5 show the output generated by each of these procedures.
26
+----------------+
|
|
| All scoregoups |
|
|
+----------------+
Parameter estimates
Item param. obs
----------------A
C
D
B
E
F
0.454
1.107
0.591
3.924
2.135
0.402
19
41
24
107
69
17
Score
Gamma
-------------------0
1.00
1
8.61
2
26.14
3
36.14
4
24.57
5
8.01
6
1.00
Figure 7.1 CML estimates of multiplicative item parameters
and values of the symmetrical gamma function.
Two types of item characteristic curves may be calculated based on the estimates of the
item parameters in Figure 7.1:
-
Ordinary item characteristic curves (ICC) show the probabilities of positive responses to specific items for given values of the person parameters.
Conditional item characteristic curves (CICC) show the conditional probability of
positive responses to specific items given the total score on the complete set of
items.
Both sets of item characteristic values are calculated when item characteristic curves are
requested. CICC values are calculated for scores from 1 to k-1 while ICC values are calculated for 1,..., k-1, such that E(S|t) = t for t = 1,...,k-1. In addition to ICC and CICC
values based on the estimated item parameters, the observe frequency of positive responses for each item is counted and reported for each score group together with standardized
residuals comparing the observed counts with the ‘expected’ CICC values.
Figure 7.2 shows the results for the first of the six items.
27
Item A
Parameter =
0.454 (logistic =
-0.789)
THETA LNTHETA SCORE
ICC
CICC
OBS
N
RES
------------------------------------------------------0.159 -1.83959
1 0.0673 0.0528 0.0567 141
0.21
0.452 -0.79304
2 0.1705 0.1418 0.1463 41
0.08
1.021 0.02054
3 0.3169 0.2821 0.2727 11 -0.07
2.268 0.81902
4 0.5075 0.4798 0.2500
4 -0.92
6.198 1.82428
5 0.7380 0.7252 1.0000
1
0.62
------------------------------------------------------Molenaar´s U =
CICC gamma
=
0.224
p =
0.8226 cut points: 1 & 2
-0.065
p =
0.6699
(two-sided)
Figure 7.2. Item characteristic values and item fit statistics for item A. Both
multiplicative and logistic parameters are presented.
Figure 7.2 include two item fit statistics comparing the observed and expected CICC
values. Molenaar’s U is defined as
a
u
k 1
 ress   ress
s1
s b
(7.1)
a kb
where a and b are two cut points partitioning the score range into low and high scores
each with approximately one third of the sample with item-informative scores. A positive
u-value indicates that the observed CICC is more horizontal than one would expect if the
Rasch model fitted the data.
The CICC gamma coefficient is an alternative coefficient based on Goodman and
Kruskall’s gamma coefficient. A negative CICC gamma corresponds to a flat CICC
curve while a positive CICC gamma indicates that the observed CICC is steeper that one
would expect from a Rasch model.
In this case the residuals and the item fit statistics did not indicate any problems for the
Rasch model.
Having presented the item characteristic values DIGRAM asks whether you wantto see
estimated test-retest score distributions assuming that repeated independent measurements were feasible. The results are summarized in three tables included in Figure 7.3.
Assume that measurements have been repeated in such a way that the two scores are
conditional independent given the latent variable. Let S1 and S2 be the two scores and
28
define a total score as S = S1 + S2. The tables in Figure 7.3 describe different features of
the conditional distribution of (S1, S2) given S:
P( S1 , S2| S ) 
 s (  1 ,.., k ) s2 (  1 ,.., k )
 s (  1 ,.., k , 1 ,.., k )
1
(7.2)
We refer to the set of (S1, S2) values corresponding to a given value of S as an orbit and
to the set of conditional probabilities (7.2) for the given s values as the set of orbit
probabilities.
The first table of Figure 7.3 shows the estimates of the orbit probabilities based on the
CML estimates of Figure 7.1.
The diagonal of the tables in Figure 7.3 is referred to as the ridge. Cells lying close to the
ridge indicate measurements with good agreement between the first and the second measurement whereas cells far from the ridge indicate mismatch. In order to evaluate whether
the mismatch should be regarded as significant, cumulative probabilities calculated along
orbits starting at the edge of the table and ending on the ridge has been calculated and
printed in the second table.
The final table of Figure 7.3 shows the same cumulative probabilities as the second table
except that probabilities greater than 0.10 have been suppressed to emphasize areas where
the difference between the first and the second measurement may be taken as evidence
that the underlying latent variables are different.
That tables of Figure 7.3 and in particular the last table may serve as a qualitative description of the reliability of the test delineating how close test-retest results can be expected to be. The last table may also be used as means for evaluating whether or not some
kind of change has happened between two times where the questionnaire containing the
items was administered. A score of 5 on the first test and 1 on the second suggest that the
value of the latent variable has significantly changed from the first to the second measurement, because the cumulative orbit probability is small (p = 0.026). Finally the problem
of comparing two different persons is formally the same as the problem of comparing two
repeated measurements. If one person has had zero symptoms while another has had four
the conclusion will be that the first person has a significantly better health than the other
(p = 0.016).
Test-retest tables are something that you would probably only need when you have decided that the Rasch model fits the item responses. Orbit probabilities appear however
again in connection with the analysis of homogeneity of subscales described below.
29
Test-retest probabilities
Second test
6 | 0.000 0.004 0.021 0.078 0.217 0.500 1.000
5 | 0.003 0.025 0.095 0.234 0.422 0.566 0.500
4 | 0.018 0.091 0.235 0.401 0.489 0.422 0.217
3 | 0.069 0.230 0.406 0.478 0.401 0.234 0.078
2 | 0.207 0.431 0.504 0.406 0.235 0.095 0.021
1 | 0.500 0.587 0.431 0.230 0.091 0.025 0.004
0 | 1.000 0.500 0.207 0.069 0.018 0.003 0.000
------------------------------------------- First
test
0
1
2
3
4
5
6
Cumulative Test-retest probabilities
Second test
6 | 0.000 0.004 0.021 0.078 0.217 0.500 1.000
5 | 0.003 0.026 0.099 0.256 0.922 0.783 0.500
4 | 0.018 0.094 0.261 0.901 0.744 0.922 0.217
3 | 0.069 0.248 0.906 0.739 0.901 0.256 0.078
2 | 0.207 0.931 0.752 0.906 0.261 0.099 0.021
1 | 0.500 0.793 0.931 0.248 0.094 0.026 0.004
0 | 1.000 0.500 0.207 0.069 0.018 0.003 0.000
------------------------------------------- First
test
0
1
2
3
4
5
6
Extreme test-retest probabilities
Second test
6 | 0.000 0.004 0.021 0.078
5 | 0.003 0.026 0.099
4 | 0.018 0.094
3 | 0.069
0.078
2 |
0.099 0.021
1 |
0.094 0.026 0.004
0 |
0.069 0.018 0.003 0.000
------------------------------------------- First
test
0
1
2
3
4
5
6
Figure 7.3 Test retest tables based on CML estimates of item parameters
30
The next question asked by DIGRAM is whether or not you want an analysis of local
homogeneity.
Item responses are heterogeneous if estimates of item parameters are significantly different when compared across all score groups. In certain situations item responses may
appear to be locally homogenous in the sense that item parameters seems to be the same
when compared across a smaller subset of score groups. This situation may for instance
turn up if items fit so-called mixed Rasch models (Rost, 1988) where item parameters are
different in two or more latent classes where persons from one group typically have low
scores while persons from other classes have higher scores.
The purpose of an analysis of local homogeneity is to determine a set of cut points partitioning the score range into a small subset of score groups such that item responses are
homogenous within score groups, but heterogeneous across score groups.
Analysis of local homogeneity may be regarded as a stepwise model search procedure. In
each step two things happen:
1) All adjacent score groups are compared using conditional likelihood ratio
test.
2) If differences between item parameters seems to be insignificant for at least
two score groups, score groups may be collapsed following which the analysis proceeds to the next step. If item parameters are significantly different
the analysis stops.
Note that the decision to collapse score groups or to stop the analysis is completely up to
the user.
Figure 7.4 shows the results for the six symptoms.
DIGRAM first displays the distribution across scores and asks whether some of these
scores should be collapsed before the analysis starts. In this case scores 3-5 are collapsed
because the number of cases with high scores is too small to make estimation meaningful.
The first step compares score groups 1 and 2 and score groups 2 and 3-5. It turns out that
hypotheses of identical item parameters can be accepted in both cases. Because of the
small number of cases with scores from 3 to 5 we collapse score groups 2 and 3-5 and
proceed to the next step where score group 1 is compared to score group 2-5. Item responses still seem to be homogenous. The result of this analysis therefore suggest that
item responses are not only locally, but also globally homogenous.
31
Search for local homogeneity
5 scoregroups will be collapsed
1:
2:
3:
4:
5:
12345-
1
2
3
4
5
- 141 cases
- 41 cases
- 11 cases
4 cases
1 cases
Search for local homogeneity
3 scoregroups will be collapsed
1:
2:
3:
123-
1 - 141 cases
2 - 41 cases
5 - 16 cases
Scoregroups ready
Testresults
Scoregroups
Scoregroups
1 & 2
2 & 3
0.7
3.0
5
5
0.985
0.708
Scoregroups 2 - 3 will be collapsed
Search for local homogeneity
2 scoregroups will be collapsed
1:
2:
12-
1 - 141 cases
5 - 57 cases
Testresults
Scoregroups
1 & 2
1.5
5
0.910
Figure 7.4 Stepwise analysis of local homogeneity
32
Following the analysis of local homogeneity you will be whether or not you want an
analysis of homogeneity across subscales.
DIGRAM addresses only the case of two subscales, assuming that items are defined in
the correct order with all the items belonging to one subscale appearing first and all the
items of the other subscale last.
DIGRAM consequently starts by asking how many items you want to include in the first
subscale. In the example here (Figures 7.5 and 7.6) the items have been split into two
subscales, each with three items.
Output from the analysis of homogeneity of subscales consists of tables (Figure 7.5)
showing
1) first the observed joint distribution of subscale,
2) second the expected orbit distributions, nsP(S1, S2| S=s),
3) third standardized residuals,
and estimates and test results of Per Martin Löfs test of homogeneity.
Evidence against homogeneity of subscales are usually expected to be particular strong, if
item responses depend on more than one latent variable. Other types of departures from
the assumptions of the Rasch model may however also generate evidence against homogeneity of subscores. For a significant result of Löf’s test to be taken as an indication of
multidimensionality we would also require that residuals are (significantly) large at the
boundary of the joint distribution and small at the ridge.
In this example the analysis of homogeneity of subscales S1 = A+B+C and S2 = D+E+F
do not uncover any evidence against the Rasch model
33
Analysis of homogeneity across subscales
Score1 : ACD
Score2 : BEF
Observed
Score2
^
3 |
1
1
|
2 | 16
8
4
|
1 | 104 25
2
|
0 |
37
|
-----------------> Score1
0
1
2
3
Expected
Score2
^
3 | 1.02 1.18 0.60
|
2 | 16.96 7.08 2.51 0.40
|
1 |105.77 21.81 2.80 0.31
|
0 |
35.23 2.24 0.09
|
-------------------------> Score1
0
1
2
3
Standardized residuals
Score2
^
3 | -0.03 -1.29 0.82
|
2 | -0.30 0.58 1.54 -0.82
|
1 | -0.34 1.00 -0.56 -0.58
|
0 |
0.34 -1.54 -0.30
|
-------------------------> Score1
0
1
2
3
Table 7.5 Distribution across subscales.
34
Subscale 1 parameters
Item Before After
-----------------A
0.68
0.69
C
1.66
1.62
D
0.89
0.89
Subscale 2 parameters
Item Before After
-----------------B
2.62
2.58
E
1.42
1.41
F
0.27
0.28
Test for homogeneous subscales
Z =
DF =
P =
10.4
6
0.107
Figure 7.6 Estimates of parameters and Per Martin Löfs test
for homogeneity across sub scales
35
+------------------+
|
|
| Scoregroup no. 1 |
|
|
+------------------+
Item residuals
OBS
EXP RESIDUAL
--------------------A
8
7.4
0.211
C 20 18.1
0.473
D
9
9.7
-0.224
B 63 64.2
-0.210
E 36 34.9
0.205
F
5
6.6
-0.630
+------------------+
|
|
| Scoregroup no. 2 |
|
|
+------------------+
Item residuals
OBS
EXP RESIDUAL
--------------------A 11 11.6
-0.194
C 21 22.9
-0.532
D 15 14.3
0.215
B 44 42.8
0.388
E 33 34.1
-0.293
F 12 10.4
0.565
Figure 7.7 Analysis of residuals within score groups
36
+-----------------------------------------+
|
|
| EBA`s conditional likelihood ratio test |
|
|
+-----------------------------------------+
CLR
DF
P
=
=
=
1.5
5
0.910
Logarithmically scaled parameters
#
Item
Total
1
2
------------------------------------A:
COUGH
-0.79
-0.69
-0.86
C:EYE PROB
0.10
0.23
-0.04
D:DIZZINES
-0.53
-0.57
-0.48
B:MUSCLE P
1.37
1.38
1.46
E:BACK PAI
0.76
0.82
0.68
F:CHEST PA
-0.91
-1.16
-0.76
#:2
A F
D
C
E
B
#:1 F
AD
C
E
B
-1.16---------------------------------------Total
F A
D
C
E
B
1.46
Figure 7.8 Conditional likelihood ratio test of item homogeneity across score groups and
presentation of estimates of logarithmic item parameters
37
Parameter estimates - homogenous item
parametres
Item param. obs
----------------A
C
D
B
E
F
1.000
1.000
1.000
1.000
1.000
1.000
Score
Gamma
-------------------0
1.00
1
6.00
2
15.00
3
20.00
4
15.00
5
6.00
6
1.00
19
41
24
107
69
17
+-----------------------------------------+
|
|
| EBA`s conditional likelihood ratio test |
|
|
+-----------------------------------------+
CLR
DF
P
=
=
=
154.2
5
0.000
Figure 7.9 Conditional likelihood ratio test of equal item parameters
+----------------------------------+
|
|
| Summary of Rasch´s item analysis |
|
|
+----------------------------------+
z
df
p
--------------------------------Score groups
1.5
5
0.910
X:
GENDER
4.2
5
0.518
Y:
AGE
12.3
10
0.265
7.10 Summary of tests for homogeneous item parameters across score groups
and groups defined by gender and age
38
8 Item analysis by generalized Rasch models for polytomous items
DIGRAM does not distinguish between generalized Rasch models for polytomous items
and the more general family of graphical and log linear Rasch models. A Generalized
Rasch model is simply a GLLRM with an empty set of DIF and LD generators. To estimate and fit generalized Rasch models you have to use the same procedure for fitting and
testing graphical and loglinear Rasch models. This procedure and the dialog needed for
setting up the PRM and GLLRM analysis is described in Section 9.2 below.
9 Item analysis by graphical and loglinear Rasch models
9.1 Building an initial model by item screening
Item screening is an initial analysis of marginal and conditional item relationships that
you may use to generate an initial GLLRM.
Item screening proceeds in the following way:
1) Analysis of marginal relationships among
a. Pairs of items
b. Separate items and rest scores
c. Items and exogenous variables
d. Exogenous variables and total scores
2) Test of Tjur conditions and no item bias
3) Stepwise model building of a minimal model explaining all the evidence of local
dependence disclosed by tests of Tjur conditions.
4) Stepwise model search for origins of item bias.
The end result of the item screening is an initial model that may serve as the starting point
for a proper search for a GLLRM based on conditional maximum likelihood estimates of
item, DIF and LD parameters and conditional likelihood ratio tests of hypotheses according to which some of the DIF and LD parameters are assumed to vanish.
This section describes the initial item screening, while Section 9.2 describes the analysis
by GLLRM.
We illustrate the item screening procedure with an analysis of a subscale consisting of the
following five3 SF36 items:
3
Note that the subscales evades some of the obviously locally dependent items of the complete SF36 scale
in order to define a scale supported by a hopefully simple GLLRM.
39
B: PF2 Moderate activities
C: PF3 Lifting or carrying groceries
D: PF4 Climbing several fligths of stairs
G: PF7 Walking more than a mile
J: PF10 Bathing or dressing yourself
Sex and Age are included in the analysis as exogenous variables.
The total score is meant to be taken as a measure of physical function. The way items are
coded here implies that persons with higher scores have better physical function than
persons with low scores.
Figure 9.1 shows the marginal  coefficients for the items, the rest scores, Si = S – Yi, and
the exogenous variables.
*** Screening of marginal item relationships ***
B
C
D
G
J
score
-----------------------------------------------------------B Mod.act Gamma
0.971 0.906 0.919 0.903 0.941
p
0.000 0.000 0.000 0.000 0.000
C Liftgroc Gamma
p
0.971
0.000
0.886
0.000
D
Stair2+ Gamma
p
0.906
0.000
0.886
0.000
G
Walk 1m Gamma
p
0.919
0.000
0.887
0.000
0.907
0.000
J
Bathing Gamma
p
0.903
0.000
0.928
0.000
0.851
0.000
0.887
0.000
0.928
0.000
0.927
0.000
0.907
0.000
0.851
0.000
0.876
0.000
0.860
0.000
0.887
0.000
0.860
0.000
0.864
0.000
Exogeneous variables
K
SEX Gamma
p
-0.366 -0.440 -0.155 -0.170 -0.033 -0.260
0.000 0.000 0.014 0.021 0.385 0.000
L
AGE Gamma
p
-0.597 -0.571 -0.541 -0.546 -0.578 -0.481
0.000 0.000 0.000 0.000 0.000 0.000
Figure 9.1 Marginal -coefficients calculated during item screening.
40
Remember that p-values for -coefficients are always one-sided. There are but few surprises in Figure 9.1. All items are strongly positively associated to each other and to the
rest score. Mobility as measured by SF36 seems to be negatively correlated to age. Sex is
coded 1 for men and 2 for women. The negative -coefficients for sex implies that physical function is better among men than among women. The only noteworthy finding in
Figure 9.1 is that the association between items and sex appear to be somewhat weaker
for the last three items than for the first two and that walking 1 mile (item G) does not
seem to be related to gender at all.
Figure 9.2 shows the partial -coefficients coefficient testing Tjur conditions and item
bias.
*** Screening of partial item relationships ***
Generalized Tjur conditions - the row item has been
subtracted from the score
B
C
D
G
J
----------------------------------------------------B Mod.act Gamma
0.692 0.253 0.069 -0.722
p
0.000 0.265 0.365 0.000
C Liftgroc Gamma
p
0.880
0.000
0.220 -0.696
0.270 0.005
D
Stair2+ Gamma
p
-0.348 -0.406
0.150 0.090
G
Walk 1m Gamma
p
-0.110 -0.563
0.370 0.025
0.741
0.000
J
Bathing Gamma
p
-0.553 -0.016
0.105 0.420
0.174
0.275
0.213
0.240
0.030
0.425
0.703 -0.339
0.000 0.060
-0.347
0.055
Test for item bias
K
SEX Gamma
p
-0.460 -0.585
0.005 0.000
0.250
0.040
0.345
0.020
0.350
0.030
L
AGE Gamma
p
-0.098
0.220
0.005 -0.022
0.490 0.410
0.080
0.245
0.062
0.305
Figure 9.2 Partial -coefficients for tests of Tjur LD and DIF conditions.
P-values for the -coefficients are still one-sided although two-sided p-values would have
been more appropriate here. Even when this is taken into account some evidence of both
local dependence and item bias has been disclosed.
41
This is more obvious in Figure 9.3 showing the same results as Figure 9.2 except that
insignificant test results has been suppressed4.
*** Screening of partial item relationships –
evidence of scale problems ***
Generalized Tjur conditions - the row item has been
subtracted from the score
B
C
D
G
J
----------------------------------------------------B Mod.act Gamma
0.692
-0.722
p
0.000
0.000
C Liftgroc Gamma
p
D
Stair2+ Gamma
p
G
Walk 1m Gamma
p
J
Bathing Gamma
p
0.880
0.000
-0.696
0.005
0.703
0.000
-0.563
0.025
0.741
0.000
-0.460 -0.585
0.005 0.000
0.250
0.040
Test for item bias
K
SEX Gamma
p
L
AGE Gamma
p
0.345
0.020
0.350
0.030
Figure 9.3 Evidence of local dependence and item bias
Even though the evidence against the generalized Rasch model in Figure 9.3 surfaced in a
multiple testing situation the evidence against the model is too strong to be dismissed. We
therefore have to proceed to the final step of the item screening leading to a loglinear
Rasch model that might explain the significant test results in Figure 9.3.
This is done in two steps:
First suggestions of local dependence between items are eliminated when the evidence
could have been created local dependence between other items. Positive local dependence
4
The same type of table where results that do not indicate the assumptions of the generalized Rasch model
have been violated are printed for the tests of marginal association, Figure 9.1, but not shown here.
42
between items B and C and between D and G might for instance explain the negative
correlation between items B and J and between items C and G. For this reason the suggested model for these items includes only local dependence between B and C and
between D and G.
Notice also that two tests of local independence are performed for each pair of items. The
definition of a model for local dependence is created sequantially according to the
following priorities:




Pairs of items where both tests indicate positive local dependence is included first. If
more than one pair satisfying these criterion exists, the pair with the highest mean 
will be included first.
Pairs of items where both tests indicate negative local dependence will be included
next.
Pairs of items where only one test suggest positive local dependence will then be
included.
Pairs of items where one test indicate negative local dependence will be included last.
Pairs of items will not be included if there is a chance that the evidence of local independence might be caused by the local dependence already introduced in the model.
Second a stepwise search for the origins of item bias is performed if items seems to be
biased against more than one exogenous variable. This is not the case here and therefore
not shown in this example. Items seem to be biased against gender, but not against age.
Figure 9.4 below summarize the evidence of LD and DIF and present the model
suggested by DIGRAM.
The model suggested by DIGRAM has the following LD and DIF generators:
BC DG BK CK DK GK JK
with two locally dependent item pairs and global DIF relative to sex.
This model has some unsolved problems.
The first is that the DIF parameters are not identifiable when all items are biased. DIF is a
relative phenomenon meaning that at least one item must be assumed to be unbiased even
though the statistical tests suggest otherwise. DIGRAM cannot see however which item
should be unbiased and therefore have to leave it to the user to decide what the model
should look like.
The second is that item bias may be the cause of the evidence of local dependence found
during item screening and that some of the evidence of item bias may be created by a
combination of both local dependence and item bias. It will be up to the analysis by the
43
loglinear Rasch model modeling both DIF and LD at one and the same time to distinguish
find out what is really happening here. The model therefore is only an initial model that
has to be checked and refined in the subsequent analysis.
--------- Summary of Tjur problems -------Two positive significant partial correlations:
B & C
D & G
One positive significant partial correlation:
One negative significant partial correlation:
B & J
C & G
Two negative significant partial correlations:
Stepwise inclusion of local dependence:
B & C
D & G
Mean Gamma =
Mean Gamma =
0.786
0.722
Evidence of item bias:
B:
C:
D:
G:
J:
K
K
K
K
K
The current IRT model is BC DG BK CK DK GK JK
Local dependent items. 3 item components: BC DG J
Figure 9.4 Summary of LD and DIF evidence and the suggested GLLRM5.
5
The final bit of information on connected item components shown in Figure 9.4 is only meant to serve as
a warning for the user. The larger connected components, the slower the iterative procedures used for
estimation of item partameters.
44
9.2 Estimating and fitting RLLRM models
To open the dialog for item analysis by gLLRM you must use the following command:
GRM
Opens the dialog shown in Figure 9.1
Figure 9.5 The GRLLM dialog
The GRLLM dialog shows the current items, exogenous variables and score groups. The
model you want to estimate and fit has to be defined in terms of DIF and LD generators.
An empty model field as in Figure 9.5 defines a generalized Rasch model without local
dependence and item bias.
The items in the checklist box at the lower left corner of the dialog tell you what you can
do here. Select
-
reduce model for conditional likelihood ratio tests of all the LD and DIF
parameters in the current model,
45
-
-
check local independence for conditional likelihood ratio tests of all LD
parameters not in the model,
check missing item bias for conditional likelihood ratio tests of all DIF parameters
not in the model,
test global homogeneity for a conditional likelihood ratio test of equal item, LD
and DIF parameters in the different score groups,
test local homogeneity for a stepwise analysis of local homogeneity across
different scoregroups,
test global DIF for a conditional likelihood ratio test comparing item parameters
in groups defined by values of exogenous variables,
test local DIF for a stepwise analysis of local homogeneity across different
groups defined by exogenous variables,
test unidimensionality test of homogeneity of item responses across two or more
subscales (not ready yet),
fit rating scale to fit a model assuming unidimensional item parameters (not ready
yet),
calculate ICC curves to calculate ICC, CICC and residuals for mean item scores,
estimate person parameters for calculation of symmetric gamma functions and
estimates of person parameters,
show margins of main model if you want to see the sufficient margins of the
current model,
show margins of all models if you want to see the sufficient margins of all
models,
show parameters of all models if you want to see the parameters of all models.
Note that some of these options are not ready yet and that “greyed” items cannot be
selected given the current setup for item analysis.
Output form the GLLRM analysis will appear in the large window at the right side of the
dialog. You can save and/or erase the output at any time during the analysis. Saving the
output will copy all lines of the GLLRM output to the output window of DIGRAM’s
main form. You can also return to DIGRAM’s main form at any time by pushing the Exit
button and return later to continue the item analysis. The GLLRM output and the model
will be the same as you left it unless you have defined a new set of items or a new set of
exogenous variables.
If no selection is made DIGRAM simply estimates the parameters of the current model.
Notice that tests of global homogeneity and DIF has been selected in Figure 9.5. Pushing
the Start button produces the output shown in Figures 9.6 – 9.?.
46
=============================================
Estimation of Rating scale models
=============================================
The model is estimated under the following conditions:
1 <= K <= 2
1 <= L <= 4
The assumed model:
All items are assumed to be locally independent
All items are assumed to be unbiased
985 valid cases included
Estimated item parameters
0
1
2
B
C
D
G
J
1.00
3.23
0.70
1.00
4.06
1.24
1.00
2.16
0.28
1.00
1.57
0.63
1.00
3.03
6.49
Nstep = 47
Delta =
-ln(likelihood) =
AIC =
0.0000861
9 estimated parameters
632.091
1282.182
Figure 9.6 Initialization of GLLRM analysis and estimates of parameters.
The model estimated in Figure 9.6 is a simple PRM with locally independent and unbiased items. The iterative IPF algorithm had to use 47 steps to reach the solution of the conditional maximum likelihood equations. Delta is equal to the largest observed difference
between an observed and expected item margin.
The test for global homogeneity includes tables comparing observed and expected item
mean scores in the different score groups. The results are shown in Figure 9.7.
47
****
Score = 0 - 6
****
Observed and expected item mean scores
mean
item
n
obs
exp
res
-----------------------------------B -
Mod.act
103
0.72
0.80 -1.45
C - Liftgroc
103
0.82
0.89 -1.32
D -
Stair2+
103
0.68
0.67
0.25
G -
Walk 1m
103
0.80
0.71
1.22
J -
Bathing
103
1.38
1.31
0.87
****
Score = 7 - 10
****
Observed and expected item mean scores
mean
item
n
obs
exp
res
-----------------------------------B -
Mod.act
197
1.63
1.59
1.20
C - Liftgroc
197
1.72
1.68
1.21
D -
Stair2+
197
1.41
1.42 -0.19
G -
Walk 1m
197
1.69
1.73 -1.27
J -
Bathing
197
1.91
1.94 -2.07 -
Test of homogeneity of 2 score groups
lr =
39.20
df =
9
p = 0.0000
Figure 9.7 Test for global homogeneity
The test for global homogeneity clearly rejects the model, even though there seems to be
an adequate fit between observed and expected item mean scores.
The results of the tests for global DIF present themselves in exactly the same way as the
results of tests for global homogeneity and will therefore not be shown here. Evidence of
DIF is found for both sex (lr = 45.28, df = 9, p = 0.0000) and age (lr = 62.22, df = 27,
p = 0.0001).
48
The item screening suggested BC,DG,BK,CK,DK,GK,JK as a possible model accounting
for LD and DIF. In this model we have to ignore at least one of the DIF generators in
order to identify the parameters. Due to the trends disclosed by the partial -coefficients
for DIF we only include BK and CK as DIF generators in the initial model for the
GLLRM analysis: BC,DG,BK,CK.
Subsequent analyses do not support the differential item functioning of B but suggests
instead that D and J are locally dependent and that J is also biased relative to age (L).
Figures (9.8) – (9.9) show part of the analysis testing this model.
Multiplicative item, LD and DIF parameters are shown in Figure 9.8. There is strong
positive local dependency between B and C and between D and G respectively, but
negative local dependency between D and J.
CLR tests for LD and DIF are shown in Figure 9.9. The first two tables with highly
significant test results refer to the generators in the model. The last two tables testing LD
and DIF not included in the model show no evidence against the model.
49
Estimated item parameters
0
1
2
B
C
D
G
J
1.00
0.69
0.09
1.00
9.14
1.91
1.00
1.72
1.75
1.00
1.07
0.28
1.00
6.09
11.53
Local dependency between B & C
B
C
0
1
2
0
1
2
1.00 1.00 1.00
1.00 12.86 11.06
1.00 28.95169.17
Local dependency between D & G
D
G
0
1
2
0
1
2
1.00
1.00
1.00
1.00
2.94
8.19
1.00
2.43
8.67
Local dependency between D & J
D
J
0
1
2
0
1
2
1.00
1.00
1.00
1.00
0.25
0.78
Item bias
1.00
0.11
0.19
item: C Exogenous variable: K
C
K
1
2
0
1
1.00
1.00
1.00
0.07
Item bias
2
1.00
0.02
item: J Exogenous variable: L
J
L
1
2
3
4
0
1.00
1.00
1.00
1.00
Nstep = 103
1
1.00
0.05
7.17
1.24
2
1.00
0.18
6.01
1.32
Delta =
-ln(likelihood) =
AIC =
0.0000925
29 estimated parameters
553.068
1164.137
Figure 9.8 Estimated parameters of the BC,DG,DJ, BK,CK,JL model
50
Test of local dependence
B & C:
D & G:
D & J:
lr =
lr =
lr =
66.80
13.44
14.62
df =
df =
df =
4
4
4
p = 0.0000
p = 0.0093
p = 0.0056
df =
df =
2
6
p = 0.0000
p = 0.0019
--Test of item bias
C & K:
J & L:
lr =
lr =
26.30
20.89
---
Check assumptions of local independence
B
B
B
C
C
C
G
&
&
&
&
&
&
&
D:
G:
J:
D:
G:
J:
J:
lr
lr
lr
lr
lr
lr
lr
=
=
=
=
=
=
=
8.99
5.45
0.78
6.86
5.97
4.23
7.39
df
df
df
df
df
df
df
=
=
=
=
=
=
=
4
4
4
4
4
4
4
p
p
p
p
p
p
p
=
=
=
=
=
=
=
0.0614
0.2444
0.9408
0.1437
0.2015
0.3752
0.1168
p
p
p
p
p
p
p
p
=
=
=
=
=
=
=
=
0.0354
0.0651
0.1676
0.1408
0.5230
0.0108
0.8229
0.1241
--Check assumptions of no item bias
B
D
G
J
B
C
D
G
&
&
&
&
&
&
&
&
K:
K:
K:
K:
L:
L:
L:
L:
lr
lr
lr
lr
lr
lr
lr
lr
=
=
=
=
=
=
=
=
6.68
5.46
3.57
3.92
5.16
16.62
2.89
10.01
df
df
df
df
df
df
df
df
=
=
=
=
=
=
=
=
2
2
2
2
6
6
6
6
Figure 9.9 Test of local LD and DIF of BC,DG,DJ,BK,CK,JL model
51
10 Item analysis by Rating scale models
will not be described in this version of the notes.
11 Criterion validity
will not be described in this version of the notes.
12 Export to other IRT programs
will not be described in this version of the notes.
References
To be included
52