Computerized Adaptive Testing for Classifying Examinees Into

Measurement and Research Department Reports
96-3
Computerized Adaptive Testing for Classifying
Examinees Into Three Categories
T.J.H.M. Eggen
G.J.J.M. Straetmans
Measurement and Research Department Reports
Computerized Adaptive Testing for Classifying
Examinees Into Three Categories
T.J.H.M. Eggen
G.J.J.M. Straetmans
Cito
Arnhem, 1996
96-3
This manuscript has been submitted for publication. No part of this manuscript
may be copied or reproduced without permission.
Abstract
In this paper possibilities for computerized adaptive testing in applications where
examinees are to be classified in one of three categories are explored. Testing
algorithms with two different statistical computation procedures are described and
evaluated. The first computation procedure is based on statistical testing
(sequential probability ratio test) and the other on statistical estimation (weighted
maximum likelihood). Combined with the computation procedures, item selection
methods based on maximum information provided with possibilities for content
and exposure control are considered. The measurement quality of the proposed
testing algorithms are reported on the basis of the results of simulation studies
using an item response theory calibrated item bank mathematics which is
developed to replace an existing paper and pencil placement test by a
computerized adaptive test (CAT). The main results of the study are that a gain
of at least 25% in the mean number of items is to be expected in a CAT.
Furthermore, it is concluded that for the three way classification problem using
statistical testing is the most promising computation procedure. Finally, it is
concluded in this case that imposing the item selection with constraints in the
form of content and/or exposure control hardly impairs the quality of the testing
algorithm.
1
2
Computerized Adaptive Testing for Classifying
Examinees into Three Categories
The number of applications of computerized adaptive testing based on item
response theory (IRT) is growing quickly and psychometric research on adaptive
testing is getting widespread attention. Traditionally a computerized adaptive test
(CAT) aims at the efficient estimation of an examinee’s ability. However, it also
has shown to be a useful approach to classification problems. Weiss and Kingsbury
(1984) and more recently Spray and Reckase (1994) describe CATs for situations
where the main interest is not in estimating the ability of an examinee, but to
classify the examinee in of two categories, e.g., pass-fail, master/non-master. The
purpose of this article is to explore the possibilities for computerized adaptive
testing in an application, where examinees are to be classified in one of three
categories.
The core of a CAT is the testing algorithm. Using an IRT calibrated item bank
it controls the start, the continuation and the termination of a CAT. The
algorithm consists of two main parts. The first is a statistical computation
procedure which infers the ability of the examinee on the basis of responses to
items. The second is an item selection method: during the CAT after every item
and for each examinee the composition of the test is adapted to the ability
demonstrated thus far. A CAT is continued until this ability, or a decision to be
taken on it, can be reported with specified accuracy.
In this article two possible statistical computation procedures for a CAT to be
used for the classification of examinees into one of three categories are described
and evaluated. The first computation procedure is based on statistical testing and
the other on statistical estimation.
When CATs are used for the estimation of the ability of an examinee, the items
are selected using the maximum information criterion: the next item to be
administered in a CAT is the one which has maximum information at the current
ability estimate of the examinee. Spray and Reckase (1994) show that in
classification problems with two categories it is better to select items which have
maximum information at the cutting point of the classification. In this article the
benefits of these two item selection methods will be evaluated for the three way
classification problem. Furthermore, attention will be paid to negative implications
of item selection methods based on maximum information (see e.g., Wainer,
1990). When items are selected on basis of maximum information both the content
3
of the test and the exposure rates of items from the item bank are out of control.
Recent psychometric research has suggested solutions to these problems.
Kingsbury and Zara (1989, 1991) have proposed a procedure in which each CAT
is in accordance with certain content specifications. Exposure control, which has
been researched in particular by Sympson and Hetter (1985) and, Stocking and
Swanson (1993), deals with two problems in maximum information selection
methods: items from the bank may be used either too often (overexposure) or too
infrequently (underutilization) in adaptive testing. Overexposure may jeopardize
the confidentiality of items; underutilization is a waste of the time and energy
spent on the development of an item bank. The effects of adding content control
and exposure control to the maximum information item selection methods for the
three way classification problem will be reported.
Context
Adult basic education in the Netherlands wants to provide educationally
disadvantaged adults with knowledge and abilities that are indispensable for
satisfactory functioning as an individual and as a member of society. One of the
courses provided in this context is a mathematics course which is offered at three
different levels of difficulty. Prospective students are allocated to one of these
three course levels by means of a placement test. As there is a great variation in
the ability of the students, the placement test currently used is a two-stage test
(Lord, 1971). At the first stage all examinees take a routing test of 15 items, which
difficulty is targeted at the average ability of the prospective students. After the
routing test the examinees take one of three measurement tests, of 10 items each,
differing in difficulty. The performance on the routing test determines the
difficulty of the measurement test to be administered in the second testing stage.
There are certain drawbacks to the current paper-and-pencil placement test
which, in short, concern the complicated test administration procedure, the
confidentiality of the items and the limited measurement accuracy for large groups
of examinees. Replacing the paper-and-pencil test by a CAT is considered to help
in overcoming these problems.
The Mathematics Item Bank
4
For the adaptive test an item bank consisting of 250 items that can be scored
dichotomously is available. The basic equation of the IRT model used in the item
calibration is
.
The response to an item
(1)
is either correct (1) or incorrect (0). The probability
of scoring an item correctly is an increasing function of the latent ability
depends on two item characteristics: the difficulty parameter,
discrimination index,
and
, and the
. In this One-Parameter Logistic Model (OPLM)
(Verhelst, Glas, & Verstralen, 1995) only the difficulty parameters are estimated,
while the discrimination indices are imputed as hypothesis in the calibration.
The items in the item bank belong to one of three content subdomains of
mathematics: mental arithmetic/estimating (A), measuring/geometry (B) and the
other elements of the curriculum (C). From the calibration sample the distribution
of the ability
in the population is estimated to be normal with a mean of .294
and a standard deviation of .522. The distribution of the estimated item difficulties
over the three subdomains is given in Table 1.
Table 1
Distribution of Items with Regard to Difficulty and Content
Content
Difficulty
Total
Total
A
B
C
2
1
7
10
13
5
19
37
18
11
37
66
13
23
66
102
2
9
24
35
48
49
153
250
The cutting points for classification of the examinees in one of the three levels on
the latent ability scale were defined by content specialists. They identified subsets
of the items of the item bank which should have a probability of success of at
5
least .7 at the cutting points. The resulting cutting point between level 1 and 2 is
= -.13 and between level 2 and 3
= .33. A rough inspection of the item bank
shows that there is a satisfactory spread of the difficulties of the items on the
latent ability scale: a fair amount of items is concentrated near the cutting points.
Research Questions
The overall research question concerned with in this paper is: which testing
algorithm is most suitable for the computer adaptive placement test for
mathematics in adult basic education, given a number of practical requirements?
From the measurement point of view evidence is sought for justifying the
replacement of the current paper-and-pencil placement test by a CAT. Practical
requirements are that a CAT may not exceed 25 items and that there should be
possibilities for controlling the content of a CAT and the exposure rates of the
items. More specific are questions like: which statistical computation procedures
are suitable for classifying examinees according to one of three different levels?
And a related question: which item selection methods should be considered? How
do the testing algorithms operate in terms of measurement accuracy, the number
of misclassifications, measurement efficiency, adherence to content specifications
and the distribution of exposure rates over the item bank?
Statistical Computation Procedures
The statistical computation procedure in the testing algorithm leads to the
decision on the examinee on the basis of item responses. The inference is made
by considering the likelihood function of the examinee’s ability
scores on
items,
. Given the
, and the parameters of the items,
, this function is:
.
(2)
is substituted by the OPLM model formula (1).
After each item it is determined whether another item should be administered or
that testing is stopped and a decision on the examinee is taken. There are roughly
6
two statistical approaches to deal with in this classification problem: statistical
estimation and statistical testing. Both approaches will be described briefly.
Statistical Estimation in the Testing Algorithm
For statistical estimation a traditional approach in adaptive testing is chosen
(Weiss,
&
Kingsbury,
1984).
Given
the
scores
on
items,
, and the parameters of the items,
an estimate is made of the ability
and of its standard error
confidence interval for the examinee’s true ability
.
,
. Next, a
is constructed:
is a constant, determined by the required
accuracy. The algorithm decides to deliver another item as long as there is a
cutting point,
or
, within the interval; if not, a decision is taken according to
the decision rules set out in Table 2.
Table 2
Decision Rules of Adaptive Test with Statistical Estimation
If
Decision
Level 1
and
Level 2
Level 3
Else
Continue testing
Because of its good statistical properties the Weighted Maximum Likelihood
(WML) method of Warm (1989) is used for estimating the ability. After
items
this estimate and its standard error follows from an iterative maximalization
procedure:
.
(3)
The second part of this formula is the likelihood of the ability (2), given the item
scores and the item parameters; the first part is the weight attributed to this
likelihood function. This expression contains the item information function,
:
the information in an item as a function of ability. The contribution of an item to
7
the accuracy of the estimate of an examinee’s ability has a positive relationship
to this item information. In the OPLM model (1) the item information function
is given by
.
(4)
For further details on the background of this estimate and the way it is computed
we refer to Warm (1989) and, Verhelst and Kamphuis (1989). The accuracy of this
estimation procedure in the adaptive testing algorithm is determined by the level
of the confidence interval.
Statistical Testing in the Testing Algorithm
As an alternative for the traditional estimation procedure the classification
problem can also be solved by a statistical testing procedure. Proposed is a
generalization of a procedure, used earlier by Reckase (1983), based on the
Sequential Probability Ratio Test (SPRT) (Wald, 1947).
First, so-called indifference zones,
the cutting points
and
, are defined. These are small areas around
in which we can never be sure to take the right
decision, due to measurement fallibility. After formulating the statistical
hypotheses the acceptable probabilities of incorrect decisions or decision error
rates must be specified. Figure 1 represents the problem schematically.
Decision: Level 1
|
Level 2
|
|
Level 3
|
->
Figure 1
Schematic Representation of Statistical Testing Problem
The hypotheses are:
H0_1:
(level 1)
H0_2:
(lower than 3)
H1_1:
(higher than 1);
H1_2:
(level 3).
The acceptable decision error rates are, with
specified as follows:
8
,
,
and
small constants,
P(accept H0_1| H0_1 is true)
P(accept H0_2| H0_2 is true)
P(accept H0_1| H1_1 is true)
P(accept H0_2| H1_2 is true)
For each pair of hypotheses (H0_1 against H1_1; H0_2 against H1_2) the test can
be carried out using the SPRT (Wald, 1947), meeting the accuracy requirements.
As test statistic the ratio between the values of the likelihood function (2) under
the null hypothesis and the alternative hypothesis is used. The test for H0_1
against H1_1 operates as follows:
Continue sampling if (this is also called the critical inequality of the statistical
test):
,
(5)
accept H0_1 (level 1) if:
,
reject H0_1 (level 2 or 3) if:
.
By combining the two SPRT’s, as represented in Table 3, unequivocal decisions
can be taken in the classification problem dealt with.
Table 3
Decisions Based on Combination of Two SPRT’s
Decision test 1
Decision test 2
1
2 or 3
1 or 2
1
2
3
Impossible
3
This generalization of the SPRT is known in literature as the combination
procedure of Sobel and Wald (1949). It can easily be shown that by using the
9
OPLM model the impossible solution indeed never occurs and that the critical
inequality can be written as follows:
,
in which
is an expression which only depends on the item parameters and
on the constants in the statistical testing procedure
and
that are chosen
beforehand.
It becomes clear that the evaluation of the critical inequality can be carried out
quite easily because it involves only the observed weighted score against set constants.
Table 4 represents the decision rules based on the double SPRT in which, for the
sake
of
convenience,
it
is
assumed
and
that
,
.
Table 4
Decision Rules of Adaptive Test Using Statistical Testing
If
Decision
Level 1
Level 2
Level 3
Else
Continue
Testing
10
Item Selection Methods
In the testing algorithm the item selection method chooses items form the item
bank that are adapted to the examinee’s current ability, as determined by the
computation procedure. A special position is taken in by the starting procedure,
because it is assumed that before the first test administration the examinee’s
ability is completely unknown. Next the starting procedure and the series of
selection methods that are implemented in the mathematics placement test are
described.
Starting Procedure
The starting procedure for the mathematics placement test operates as follows.
From the item bank of 250 items a selection is made of 54 relatively easy items.
An examinee is presented a randomly chosen, relatively easy item from each of
the three content subdomains. There are three reasons for deciding on this
starting procedure. The most important one is that the target population of
examinees partly consists of poorly educated people who do not feel confident
working with a computer. Easy items at the beginning will help them overcome
their fear of the test and the computer. The second reason is that it is hardly
possible to make an optimal choice of items in accordance with the examinee’s
ability after one or two items, because the first estimates of the ability will
unavoidably be very inaccurate. Thirdly, the chosen starting procedure, drawing
upon the three different subdomains, will contribute to the content validity of the
test for the tested domain of mathematics.
Five Item Selection Methods
In connection with the computation procedures five item selection methods have
been investigated. These are indicated as follows:
1
random (R)
2
maximum information (MI)
3
maximum information with content control (MI+C)
4
maximum information with exposure control (MI+E)
5
maximum information with content control and exposure control
(MI+C+E)
The first method randomly selects the next item from the available item bank,
excluding items used before. The other methods select an item for an examinee
11
in such a way that the information of an item, see (4), is maximal for that
particular examinee. ’Maximum information’ in this context can have one of the
three following meanings.
In the case of statistical estimation as computation procedure:
a
The next item selected is the item for which the information at the current
ability estimate is maximal. This is from now on indicated as CE (current
estimate). Select the item
for which:
.
b Spray and Reckase (1994) demonstrate that with regard to classification
problems involving one cutting point it is, in the case of adaptive testing,
more efficient (resulting in a shorter average test length) to select items
that have maximum information at that cutting point, rather than at the
current ability estimate. The corresponding selection method is as follows:
select an item with maximum information at the cutting point nearest to
the current ability estimate; the minimum is determined of
and
. This option is indicated as NC (nearest cutting point).
In case of statistical testing as computation procedure no ability estimates are
made and a variation of b is used instead.
c
Consider the critical values of the statistical test in Table 4. Observe that
.
The first two critical values correspond to testing around the cutting point
and the second pair to testing around the cutting point
.
It is determined to which of the critical values the current score of an
examinee is closest: the minimum of
and
12
,
is determined. Selected is the item which has maximum information at the
cutting point corresponding to the critical value that was found to be
closest to the examinee’s score.
Maximum information is a psychometric criterion for item selection. However, the
practical requirements of test composition can often only be met by constraining
this psychometric criterion. Investigated are two of such constraints.
The first is related to content control: the adaptive test should be in agreement
with certain content specifications. In the case of the adaptive placement test of
mathematics the content control takes the following form: the preliminary
specification was that a test would preferably consist of 16% items from
subdomain arithmetic (A), 20% items from subdomain measuring/geometry (B)
and 64% items dealing with other subjects (C). In order to achieve this the
Kingsbury and Zara (1989, 1991) approach was followed. After each administered
item the implemented algorithm determines the difference between the desired
and achieved percentage of items selected from each subdomain. The next step
is that from the domain for which this difference is largest, the item with
maximum information is selected.
The second investigated constraint has to do with exposure control. The
rationale behind exposure control is that in the daily practice of adaptive testing
it often occurs that - although each examinee usually gets a different test - some
items from the available item bank are used more frequently than others while
some may hardly be used at all. A simple form of this control, used in the
placement test of mathematics, sees to it that the available item bank is used more
efficiently by actually administering an item that was selected according to the
maximum information criterion in 50 percent of the cases. When an item has been
selected the algorithm draws a random number
from the interval (0,1). If
the item is administered, if not the procedure is continued by selecting the
next most informative item. Items that have been rejected once by this control
cannot be selected again for a particular examinee. The simultaneous use of both
exposure and content control has also been investigated for the application of the
placement test.
Design of Simulation Studies
13
The performance of the computation procedures and item selection methods in
the mathematics placement test was investigated by means of simulation studies.
The general design was as follows. From the population distribution, estimated
from the calibration study as
other words: his ability
(.294,.522), a random examinee
was drawn, in
. The three starting items were selected according to the
starting procedure discussed earlier and the next items were selected using one of
the item selection methods. The simulee’s response to an item was generated
according to the OPLM model. To be more specific: at each exposure a random
number
was drawn from the interval (0,1). For simulee
and item
formula
(1) was evaluated and if:
the item was scored ’correct’:
, if not,
it was scored ’incorrect’:
. This procedure was repeated for
= 5000 (or
1000) simulees.
In the simulation studies the described procedures have been evaluated and
compared with regard to the following aspects. The testing algorithm with
statistical estimation as computation procedure was investigated with respect to
the attainable accuracy of the placement test. Instead of stopping testing at a
preset accuracy, the administration of 50 items was simulated, in order to find out
what differences there were in attainable accuracy between the algorithms as a
function of the number of items. The measurement inaccuracy after
items; that
is the mean absolute difference between true and estimated ability will be reported:
.
(6)
In order to evaluate the performance of the adaptive test using different testing
algorithms in the conditions of the placement test, the stopping rule used in these
simulations was the required accuracy to be attained, constrained by a maximum
test length:
. If this maximum number of items was needed, a deviation
was used from the decision rules in Tables 2 and 3 insofar that the most obvious
decision was taken. The decision rules at
are given in Table 5.
Table 5
Decision Rules in the Adaptive Test with Statistical Estimation and Testing at
Stat.estimation
Stat. testing
If
If
14
Decision
Level 1
Level 2
Level 3
The following results are reported: the classification accuracy, the mean number
of required items, the frequency of using items from the item bank (exposure
rates) and the distribution of the used items over the various subdomains.
In the statistical estimation computation procedure two different levels of
accuracy are reported:
is 1.034 and 1.644 (see Table 2) corresponding to
confidence intervals of 70% and 90% respectively. In the statistical testing
computation procedure the acceptable decision error rates varied:
and .1. Apart from that the indifference zone was varied:
is .05, .075
is .1 and .1333 at
=.075.
Besides the general simulation design also the following variation was used in
order to find out which abilities might show differences between the statistical
testing and the statistical estimation procedure. Instead of taking a random sample
from the population distribution, 500 test administrations were simulated at 67
equidistant points on the ability scale,
.
Results of the Simulation Studies
The Measuring Accuracy with Statistical Estimation
Table 6 gives the measuring inaccuracy of the adaptive test using the various item
selection methods after 10, 20, 30, 40 and 50 items. Reported is the inaccuracy (6)
15
multiplied by 10,000. As expected, the inaccuracy decreases as the number of
items increases.
Table 6
for Item Selection MI at Current Ability Estimate (CE)
and MI at Nearest Cutting Point (NC)
Number of items (k)
10
Selection
20
CE
NC
MI
1466
MI+C
NC
CE
1639
994 1146
834
1519
1674
1037 1166
MI+E
1522
1657
MI+C+E
1541
1652
R
2266
CE
30
40
CE
NC
CE
NC
963
758
847
700
776
865
990
771
869
702
804
1062 1171
893
983
811
890
754
827
1082 1162
932
991
835
890
785
822
1619
NC
50
1314
1141
1034
Selecting items at the current ability estimate leads to less inaccuracy than
selecting at the nearest cutting point. From Table 6 and Figure 2 it is clear that
random item selection leads to the largest inaccuracy.
16
Figure 2
Inaccuracy of Item Selection Methods as a Function
of the Number of Items (on large scale)
The decrease of the inaccuracy becomes very small for all the other item selection
methods after 20 or more items which justifies the conclusion, that the practical
requirement of a maximum test length of 25 items is a realistic one. The
differences between the various item selection methods are small. Figure 3, which
reproduces Figure 2 on a smaller scale, clearly shows, however, that in the case
of CE selection selecting without constraints leads to the most accurate estimates.
17
Figure 3
Inaccuracy of Item Selection Methods as a Function
of the Number of Items (on small scale)
The exposure control has a slightly more negative effect on accuracy than the
content control. If both constraints are operative, the loss of accuracy is greatest.
In the case of NC selection the differences between the selection methods are
even smaller. This is true in particular between 20 and 25 items: differences in
accuracy between the selection methods can hardly be detected here.
The Algorithms in the Conditions of the Placement Test
Table 7 summarizes the results of the simulations with the four maximum
information item selection methods. It shows the mean number of items required
for taking a decision (k) and the percentage of correct decisions (%). To facilitate
the interpretation of these results also the administration of the paper-and-pencil
version of the placement test was simulated. The result of this was that the mean
number of required items was, of course, 25 and the percentage of correct
decisions 87.0%. Furthermore, it can be noted that with the used sample sizes
(1000) differences between the mean number of required items of .6 are
significant at 99% level, whereas differences between percentages of correct
decisions are not significant at this level until they are at least 3.3 (2.7 at 95%
level).
Statistical Estimation
First consider the effect of varying the levels of the preset accuracy in the
estimation computation procedure. Increasing the level of accuracy, from
up to
in Table 2, results in a significant increase of the mean
number of required items, both in CE selection and in NC selection. In CE
selection the increase in the mean number of items varies between 1.9 and 2.8,
whereas this effect is about twice as large (between 3.9 and 4.6) in NC selection.
The percentages of correct decisions also tend to increase significantly when the
level of accuracy is increased.
18
Table 7
Mean Number of Required Items and Percentage of Correct Decisions
Selection method
MI
MI+C
MI+E
MI+C+E
Computation
procedure
k
%
k
%
k
%
k
%
70%-CE
13.8 85.4
13.8
85.7
14.5
85.5
14.2
83.5
90%-CE
16.3 89.1
16.6
88.8
16.4
87.8
16.4
87.7
70%-NC
14.4 88.4
14.8
88.2
14.3
87.4
14.6
85.2
90%-NC
18.7 89.9
19.4
89.8
18.7
87.2
18.5
89.2
17.9 90.9
18.1
88.7
18.6
87.9
19.2
88.7
Stat.estimation
Stat.testing
5%-
=.1
10%-
=.1
15.5 89.0
15.4
89.4
15.8
87.1
16.5
86.5
7.5%-
=.1
16.2 88.5
16.7
89.0
17.1
87.8
17.5
90.4
7.5%-
=.13
14.2 88.3
13.9
88.5
14.4
89.1
14.9
87.4
Compared to the paper-and-pencil version of the placement test, the percentages
of correct decisions only increase if 90% confidence intervals are used in the
stopping rule.
If CE selection and NC selection are compared, it appears that, if 70%
confidence intervals are used, there is a small difference in the mean number of
required items to the disadvantage of NC selection, whereas the percentage of
correct decisions is higher, sometimes significantly, for this selection method. At
90% however, there is a significant advantage for CE selection (a reduction
between 2.1 and 2.8) in the mean number of required items, whereas differences
in the percentages of correct decisions are not significant.
Comparing item selection methods, using varying constraints, hardly shows any
differences. Significant differences only occur in comparison to the random item
selection method (not included in Table 7) which, using confidence intervals of
70% and 90%, led to the following simulation results: k = 16.2 and % = 81.7, k
= 20.7 and % = 83.8.
19
Statistical Testing
If statistical testing is applied as computation procedure, with a fixed indifference
zone of
, the mean number of required items increases significantly when
the preset acceptable decision error rates are lowered. The differences with
acceptable decision error rates of 5% and 10% vary between 2.4 and 2.8.
Unexpectedly, there are hardly any differences between the percentages of correct
decisions. This can be explained by the fact that in a relatively high number of
simulated test administrations a decision could not be taken until 25 items and
thus not on the basis of set error rates; the procedure was stopped by taking the
most reasonable decision (see Table 5).
If the indifference zone is extended in the case of acceptable decision error rate
of .075 a significant decrease is seen in the mean number of required items
(varying between 2.0 and 2.8) without any effect on the percentage of correct
decisions. If, however, the indifference zone is extended still further (not in Table
7), then this does result in a significant decrease of the percentage of correct
decisions.
Compared to the paper-and-pencil version of the placement test the reported
simulations in which statistical testing is used as a computation procedure show
a (sometimes significant) increase in the percentage of correct decisions.
Just as in the case of statistical estimation, the differences between the item
selection methods are small: only a small increase can be observed in the mean
number of required items as the constraints on item selection are tightened. If in
the statistical testing procedure the items are selected randomly (not included in
Table 7), this does have an effect on the quality of the testing algorithm: a mean
number of about 5 additional items is required and the percentage of correct
decisions decreases as well.
Comparison of Statistical Estimation and Statistical Testing
The efficiency of the testing algorithms using statistical estimation and statistical
testing can be evaluated by comparing the second, the fourth and the eighth row
of Table 7. These three algorithms lead to roughly equal percentages of correct
decisions which, by the way, all exceed that of the paper-and-pencil version of the
placement test (87.0%). The global conclusion that can be drawn from this is that
the mean number of required items is smallest in the case of statistical testing as
computation procedure with the following characteristics: the acceptable decisions
20
error rate is
and the indifference zone is
. This algorithm
requires a mean number of just over 2 items less than the algorithm that combines
statistical estimation as computation procedure with selection of items with
maximum information at the current ability estimate (CE); another two additional
items are required if estimation and selecting at the nearest cutting point (NC) is
used. If content control and/or exposure control are added as extra constraints to
the item selection method the results are the same.
For the three testing algorithms discussed it was tried to find out at what points
in the ability distribution the largest gain in terms of the mean number of required
items is to be expected. For that purpose 500 test administrations were simulated
at 67 equidistant points on the ability scale,
. Because constraints
on the item selection methods had no effect on the outcomes, only the results for
the item selection method with both exposure and content control are shown
(Figure 4).
All three algorithms show that the abilities centering around the mean of the
population distribution require the highest number of items. Abilities above the
mean require more items than lower abilities. This may have to do with the actual
content of the item bank mathematics which - as seen before - contains a
relatively high number of easy items. A more obvious cause, however, could be
the starting procedure used in the adaptive test, which begins with three easy
items.
21
Figure 4
Mean Number of Required Items for Selection Method MI+C+E
A comparison of the three algorithms shows that for all abilities selecting items
with maximum information at the current ability estimate (CE) is to be preferred
to selection at the nearest cutting point (NC). Comparing the more efficient
estimation procedure to the testing procedure shows a gain for the testing
procedure in the mean number of items required for classification especially for
the abilities between the cutting points (-.13 and .33). It is only for the lower
abilities that the estimation procedure performs slightly better.
Exposure Data
Before the global conclusion was drawn that imposing constraints on the item
selection method has no serious consequences for the quality of the testing
algorithms in the conditions of the placement test. The question to deal with now
is: do the imposed constraints indeed have the desired effects?
First of all Table 8 shows the exposure rates of the items from the item bank
based on the testing algorithm that has a statistical estimation computation
procedure and confidence intervals of 90%. For each selection method the
number of items is reported (from a total number of 250) with the frequency of
use in percentages ( ) in 1000 simulated test administrations.
22
Table 8
Exposure Rates of Items with Estimation (90%) as Computation Procedure
Selection method
R
Frequency %
MI
MI+C
MI+E
MI+C+E
CE
NC
CE
NC
CE
NC
CE
NC
0
132
156
126
156
75
103
53
75
0
5
1
11
2
47
21
56
40
0
28
28
31
29
26
27
42
36
130
26
14
16
14
20
21
17
20
68
8
4
18
4
13
11
18
12
43
17
11
11
6
26
11
19
18
9
13
3
5
3
24
26
27
22
0
2
8
13
7
8
11
9
10
0
2
5
4
6
7
4
5
4
0
7
7
7
10
4
10
4
7
0
4
1
4
2
0
5
0
5
0
6
9
3
5
0
0
0
1
0
0
3
1
6
0
0
0
0
The item bank is most efficiently used by the random item selection method: all
items are used in between 5% and 20% of the test administrations. The from a
measurement point of view better algorithms suffer from as well underutilization
as from overutilization from parts of the item bank. In selecting items with
maximum information at the current ability estimate, for instance, 132 items are
never used at all, whereas 21 items are used in over 20% of the administrations
which could become problematic with regard to the confidentiality of the items.
Comparing the NC selection methods with the CE selection methods shows that
all variants of NC selection make a less efficient use of the item bank: both the
number of items that is never used and the number of items that is used
frequently (in 20% or more of the administrations) is larger.
From now just consider the exposure rates where CE item selection is used.
Applying content control has a notable effect: the number of items that is never
used decreases slightly (from 132 to 126), but on the other hand there is an
23
increase in the number of items used frequently (from 21 to 32). Applying
exposure control has the expected positive effect on the number of items not used
(75), the number of items frequently used also decreases to 19. Moreover there
are no items that are used in more than 40% of the test administrations.
Combining content and exposure control clearly has the most positive effect on
the exposure rates of the items.
Table 9 reports the same data as Table 8 but with respect to statistical testing
as computation procedure with
and the extended indifference zone
.
Table 9
Frequency of Used Items with Statistical Testing as
Computation Procedure (
;
)
Selection method
Frequency %
R
MI
MI+C
MI+E
MI+C+E
0
156
156
104
77
0
1
2
27
43
1
28
28
32
38
143
17
16
18
25
62
10
13
13
11
36
12
12
27
23
8
7
6
11
16
0
4
1
6
7
0
2
4
2
2
0
2
4
7
5
0
4
2
3
3
0
5
4
0
0
0
2
2
0
0
As can be seen, comparing the item selection methods leads to a similar result as
in the case of statistical estimation as computation procedure. Comparing the
results of Table 8 and Table 9 shows that the exposure rates are better in the case
of statistical testing as computation procedure than in the case of statistical
24
estimation combined with the NC selection method. The number of items used
frequently (20% or more) is 1.5 times larger in the NC procedure (e.g., 17 against
27 items with option MI+C+E). However, compared to statistical estimation with
CE selection, the exposure rates with statistical testing as computation procedure
are less favorable.
Finally, Table 10 shows the distributions of the items used over the three
content subdomains for the selection methods reported in Tables 7 and 8.
Table 10
Distribution of Items Used over Subdomains
Subdomains (desired %)
Selection method
A (16%)
B (20%)
C (64%)
Statistical estimation
R
21.2
21.7
57.2
CE-MI
21.3
20.0
58.7
NC-MI
20.9
15.6
63.6
CE-MI+C
16.2
20.8
62.9
NC-MI+C
16.2
20.4
63.4
CE-MI+E
22.6
21.0
56.4
NC-MI+E
22.7
18.7
58.7
CE-MI+C+E
16.3
20.9
62.8
NC-MI+C+E
16.5
20.4
63.1
Statistical testing
R
21.2
21.2
57.1
MI
23.5
14.5
62.0
MI+C
16.7
21.5
61.8
MI+E
23.2
18.6
58.1
MI+C+E
16.6
21.0
62.3
It appears that the desired distribution - from the point of view of content - over
subdomains A, B and C (16:20:64) can be achieved only through explicit content
25
control in selecting items. All other selection methods over-represent subdomain
A and under-represent subdomain C by about 4% in the average test.
Conclusion
The studies carried out lead to the following conclusions with regard to the
development of the adaptive placement test for mathematics.
1
The quality of the item bank is satisfactory for the purpose of adaptive
testing.
2
The absolute maximum of 25 items for each test administration is realistic.
3
The gain in the number of required items can be expected to amount to
between 25% and 45% of the number of items in the paper-and-pencil
version of the placement test.
4
Applying the double SPRT is the most promising computation procedure
in the testing algorithm.
5
Additional constraints on item selection methods, in the form of content
control or a mild form of exposure control, can be imposed without
impairing the quality of the procedures.
6
Before deciding on a final implementation of a CAT in the placement test
for mathematics, it has to be find out experimentally whether the way the
algorithms operate in real testing situations is not in conflict with the
results of the simulations.
With regard to the testing algorithms used for the classification of examinees
into three categories the general conclusion is drawn that statistical testing as a
computation procedure is a promising alternative to the more traditional statistical
estimation procedure. Apart from the gain in the mean number of required items,
while attaining equal accuracy, this procedure has the added advantage of little
computational work during the test administration. In statistical estimation an
iterative maximalization procedure has to be followed; in statistical testing - at
least in the OPLM model - a simple comparison of the observed weighted score
with constants suffices. The acceptable decision error rates in relation to the width
of the indifference zones on the quality of the testing algorithms calls for further
research.
It is interesting to see that in the item selection methods that were studied it is,
with regard to statistical estimation as computation procedure, generally speaking
26
advisable to select items that have maximum information at the current ability
estimate, rather than items that are maximally informative at the nearest cutting
point. Whether this is partly due to the characteristics of the item bank used and
the cutting points chosen, is still a question to be answered. With regard to the
item selection methods used in combination with statistical testing as computation
procedure it can be stated that these can be improved probably. This study has
deliberately not used estimates of the examinees’ ability in statistical testing as a
computation procedure. It is expected that in a follow-up study in which we do
resort to estimates, it will appear that the testing procedure can still be improved,
analogous to the results of Spray and Reckase (1994) in their study of a
classification problem into two categories.
Presently research is continued into the refinements of exposure control in the
item selection methods and the extended application of the sequential testing
procedure for more than three categories.
27
References
Kingsbury, G.G., & Weiss, D.J. (1983). A comparison of IRT-based adaptive
mastery testing and a sequential mastery testing procedure. In D.J. Weiss (Ed.).
New horizons in testing (pp. 257-286). New York: Academic Press.
Kingsbury, G.G., & Zara, A.R. (1989). Procedures for selecting items for
computerized adaptive testing. Applied Measurement in Education, 2, 359-375.
Kingsbury, G.G., & Zara, A.R. (1991). A comparison of procedures for contentsensitive item selection in computerized adaptive tests. Applied Measurement
in Education, 4, 241-261.
Lord, F.M. (1971). A theoretical study of two-stage testing. Psychometrika, 36,
227-
242.
Reckase, M.D. (1983). A procedure for decision making using tailored testing. In:
D.J. Weiss (Ed.). New horizons in testing (pp. 237-255). New York: Academic
Press.
Sobel, M., & Wald, A. (1949). A sequential decision procedure for choosing one
of
three hypotheses concerning the unknown mean of a normal distribution.
Annals of Mathematical Statistics, 20, 502-522.
Spray, J.A., & Reckase, M.D. (1994). The selection of test items for decision
making
with a computer adaptive test. Paper presented at the national meeting
of the National Council on Measurement in Education, New Orleans.
Stocking, M.L., & Swanson, L. (1993). A method for severely constrained item
selection in adaptive testing. Applied Psychological Measurement, 17, 277-292.
Sympson, J.B., & Hetter, R.D. (1985). Controlling item-exposure rates in
computerized
adaptive testing. Paper presented at the annual conference of
the Military Testing Association, San Diego.
Verhelst, N.D., & Kamphuis, F.H. (1989). Statistiek met
.[Statistics with
.]
Bulletinreeks nr. 77. Arnhem: Cito.
Verhelst, N.D., Glas, C.A.W., & Verstralen, H.H.F.M. (1995). One-parameter
logistic model (OPLM). Arnhem: Cito.
Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J., Steinberg, L.,
& Thissen, D. (1990). Computerized adaptive testing: A primer. Hillsdale (NJ):
Lawrence Erlbaum.
Wald, A. (1947). Sequential analysis. New York: Wiley.
28
Warm, T.A. (1989). Weighted maximum likelihood estimation of ability in item
response theory. Psychometrika, 54, 427-450.
Weiss, D.J., & Kingsbury, G.G. (1984). Application of computerized adaptive
testing
to educational problems. Journal of Educational Measurement, 21, 361375.
29
Recent Measurement and Research Department Reports:
96-1
H.H.F.M. Verstralen. Estimating Integer Parameters in Two IRT Models
for Polytomous Items.
96-2
H.H.F.M. Verstralen. Evaluating Ability in a Korfball Game.
30