Running head: Item Pocket Method to Allow Response Changing

Item Pocket Method to Allow Response Changing 1
Running head: Item Pocket Method to Allow Response Changing
Item Pocket Method to Allow Response Review and Change
in Computerized Adaptive Testing1
Kyung T. Han
Graduate Management Admission Council®
Correspondence may be sent to:
Kyung T. Han
Graduate Management Admission Council®
11921 Freedom Dr. Suite 300, Reston, VA 20190 USA
[email protected]
(Phone) +1-703-668-9753
(Fax) +1-703-668-9601
The views and opinions expressed in this article are those of the author and do not necessarily
reflect those of the Graduate Management Admission Council®.
1
Acknowledgements: The author wishes to thank Lawrence M Rudner, Fanmin Guo, and Eileen Talento-Miller of
Graduate Management Admission Council® (GMAC®) for their feedbacks and support. The author also is grateful to
Ernie Anastasio of GMAC®, Hua-Hua Chang of University of Illinois, Wim J. van der Linden of CTB/McGraw-Hill,
and Paula Bruggeman of GMAC® for review and valuable comments.
This is a revision of the original study that was presented at 2011 annual meeting of the National Council on
Measurement in Education (NCME). The paper received the Alicia Cascallar Award for Outstanding Paper by Early
Career Scholar from NCME. Item Pocket Method to Allow Response Changing 2
Abstract
Most computerized adaptive testing (CAT) programs do not allow test takers to review and
change their responses because it could seriously deteriorate the efficiency of measurement and
make tests vulnerable to manipulative test-taking strategies. Several modified testing methods
have been developed that provide restricted review options while limiting the trade-off in CAT
efficiency. The extent to which these methods provided test takers with options to review test
items, however, still was quite limited. This study proposes the item pocket (IP) method, a new
testing approach that allows greater flexibility in changing responses, eliminating unnecessary
restrictions that prevent test takers from moving across test sections to review their answers. A
series of simulations were conducted to evaluate the robustness of the IP method against various
manipulative test-taking strategies. Findings and implications of the study suggest that the IP
method could be an effective solution for many CAT programs when the IP size and test time
limit are properly set. Keywords: computerized adaptive testing, test construction, test administration, response change Item Pocket Method to Allow Response Changing 3
Item Pocket Method to Allow Response Review and Change
in Computerized Adaptive Testing
Computerized adaptive testing (CAT) is rapidly becoming a popular choice for test
administration, often favored by test developers and test takers alike (Poggio, 2010; Martineau &
Dean, 2010) because it delivers more accurate score estimates and requires relatively less testing
time than conventional paper-and-pencil-based tests (PBTs). Due to the item selection algorithm
that CAT programs employ, however, CAT usually does not allow test takers to review test
items and/or change their responses as they can on PBTs, since it relies on an interim score
estimate that is updated after each item administration. Previous research has shown that having
an opportunity to change their responses can help test takers reduce their anxiety and stress
levels during test administration (Wise, 1996; Stocking, 1997; Lunz, Bergstrom, & Wright,
1992; Papanastasiou, 2002), which can result in fewer mistakes made during testing and test
scores that may more accurately reflect test takers’ true proficiency (Papanastasiou, 2002). This
especially can be the case when the test is high stakes (Stocking, 1997) and/or tightly timed. Test
takers almost always prefer to have the option to change responses even if they do not always
exercise that option (Vispoel, Henderickson, & Bleiler, 2000). More importantly, research shows
that only a fraction of test takers actually benefit from response changing in terms of score
improvement (Benjamin, Cavell, & Shallenberger, 1984; Waddell & Blankenship, 1995; Wise,
1996).
Eliminating unnecessary causes of a test taker’s anxiety and stress during CAT
administration is an important step toward creating an ideal test environment. Reducing test
takers’ anxiety and stress by giving them sufficient control over the test administration is
Item Pocket Method to Allow Response Changing 4
critically important because a high level of test anxiety can contribute to increased measurement
errors: Providing test takers with a better testing experience is, educationally and morally, the
right thing to do.
Manipulative Test-Taking Strategies
One of the main practical objections to allowing for response changing in CAT programs
is that imprudently allowing test takers to change responses could open the door to systematic
test-taking strategies that could render CAT administration less efficient (Wainer, 1993; Wise,
1996) and/or result in biased score estimates (Vispoel, Rocklin, Wang, & Bleiler, 1999).
Wainer Strategy
For the manipulative test-taking strategy that Wainer (1993) introduced, a test taker could
intentionally answer all items incorrectly on the first round to make the CAT system administer
only easy items. After that, the test taker could go back to each item to review and change his or
her responses to get a perfect score on the second round. Simulation studies confirmed that this
so-called “Wainer strategy” could result in extremely large measurement errors with a
considerable risk for positively biased score estimates, especially for high-proficiency test takers
who are capable of implementing the strategy successfully (Wang & Wingersky, 1992; Gershon
& Bergstrom, 1995; Stocking, 1997; Vispoel et al., 1999; Bowles & Pommerich, 2001). In
practice, the Wainer strategy is difficult to implement successfully—test takers run the risk of
getting significantly underestimated scores if they fail to respond to all items correctly on the
second round (Wise, 1996). Notwithstanding the risks, studies with live data showed that highproficiency test takers had a good chance to profit from the Wainer strategy (Vispoel et al., 1999;
Wise, 1996).
Item Pocket Method to Allow Response Changing 5
Kingsbury Strategy
Another manipulative test-taking strategy for CAT involves examinee judgment on item
difficulty. As pointed out by Green, Bock, Humphreys, Linn, and Reckase (1984), test takers,
knowing that the current item’s difficulty depends heavily on the response made to the previous
item, could use the difficulty of the current item as a clue to the correctness of their response for
the earlier item. If a test taker thinks the current item’s difficulty is slightly higher than the
previous item, it would be reasonable for the test taker to think his or her answer for the previous
item was correct. On the other hand, if the current item seems easier than the previous item, it
may indicate that the previous item was answered incorrectly, and thus the test taker can go back
to the previous item and change the answer. In his simulation study, Kingsbury (1996) modeled
such a test-taking strategy based on two strong assumptions: (a) test takers were assumed to
make guesses if they saw an item whose difficulty was higher than their true proficiency by 1.0
theta unit or more, and (b) test takers also were assumed to go back to previous item and change
their responses if the next item’s difficulty was lower than the previous item by 0.5 theta unit or
more. The simulation result found that examinees could benefit from this test-taking strategy
depending on their proficiency level—the lower the examinees’ proficiency levels, the greater
the possible benefits they could see.
Generalized Kingsbury Strategy
Wise, Finney, Enders, Freeman, and Severance (1999) expanded Kingsbury’s test-taking
strategy so that test takers were assumed to have speculated on the difficulty level of the next
item not only for items with guessed responses, but for all previous items. Wise et al. (1999)
called this the “generalized Kingsbury” (GK) strategy and simulated it in more realistic
conditions where probabilities for correctly judging item difficulties derived from real data were
Item Pocket Method to Allow Response Changing 6
less than 1.0. The simulation results suggested small possible benefits to test takers from using
the Kingsbury strategy but the strategy offered no meaningful improvement in score estimates.
Vispoel, Clough, Bleiler, Hendrickson, and Ihrig (2002) used live testing data to examine the
effect of the Kingsbury and GK strategies on score estimates under various conditions and found
that for a majority of test takers in the study the Kingsbury and GK strategies were ineffective.
This was mainly because the accuracy of test takers’ item-difficulty rating was much lower than
what Kingsbury (1996) and Wise et al. (1999) assumed: Test takers were only 61% successful in
distinguishing the difficulty difference within each item pair. The study did find, however, that
some test takers were still able to improve their scores using either one of these strategies.
CAT With Restricted Revision Options
To minimize the effect of the manipulative test-taking strategies while still providing test
takers with reasonable options to review and change their responses, Stocking (1997) proposed
three different models for limiting test takers’ practices in item review and response change. In
the first model, test takers were allowed to change their responses at the end of the test, but were
limited to a maximum number of revisions. According to her simulation results, Stocking’s
model effectively reduced the impact of the Wainer strategy, and the conditional standard error
of measurement (CSEM) and bias were close to those with a zero-revision condition when the
number of revisions was limited to 2 out of 28 test items. When the number of allowable
revisions was greater than two, however, (in Stocking, 1997, the studied conditions were 0, 2, 7,
14, and 28 revisions for the 28-item test) Stocking’s model failed to control the effect of the
Wainer strategy.
Item Pocket Method to Allow Response Changing 7
In Stocking’s second model, the test consisted of multiple separately timed sections, and
test takers were allowed to revise their responses freely within each section. The simulation
results showed that administering the test in two separate sections substantially reduced the
effect of the Wainer strategy on the CSEM and bias. When the item revision option was
available with four or more separated sections, the effect of the Wainer strategy was almost
completely negated. Stocking’s third model, in which test takers were allowed to revise
responses only within each item set associated with the common stimulus, showed its robustness
against the Wainer strategy as well. The main disadvantage of Stocking’s third model, however,
was that test takers were not allowed to revise responses for discrete (i.e., non-set) items
(Stocking, 1997).
In their study using live test data, Vispoel et al. (2000) confirmed that Stocking’s second
model (the restricted review within each block) successfully reduced the possible effect of the
Wainer strategy. In addition, the majority of examinees (98.4%) in this study felt they had
adequate opportunity to review and change their responses. In the latter studies by Vispoel and
his colleagues, which also used live test data, the Kingsbury and GK strategies proved to be
ineffective when the restricted review was permitted (Vispoel et al., 2002; Vispoel, Clough, &
Bleiler, 2005).
Reviewing the overall results from Stocking (1997) and Vispoel et al. (2000, 2002, &
2005), the restricted review approach (especially Stocking’s second model) seems to be one of
several solutions that could allow test takers to change their response during CAT to a certain
degree without incurring unacceptable levels of sacrifice in measurement precision and
efficiency. A serious limitation of the restricted review approach, however, is that allowing test
takers to review and change their responses only within each section eventually causes them to
Item Pocket Method to Allow Response Changing 8
involuntarily surrender access to items in the current section in order to proceed to the next
section. If test sections are timed strictly and separately from each other as Stocking (1997)
originally proposed, test takers would not necessarily feel pressure to move on to the next section
before the section time expired. In a majority of operational CAT programs, however, small test
sections (or item sets) usually are not timed separately. As a result, every time test takers finish
one section (unless it is the last section), they must decide whether it is better to spend time
revisiting test items in the current section and improving their initial responses or use their time
to complete the remaining test sections. Being forced repeatedly to make such a decision, not
knowing exactly how much time will be needed to complete the remaining test sections, very
likely causes test takers the same kind of test anxiety as that observed among test takers in a
regular CAT administration with no response revision option.
Another downside of the restricted review approach is that it still does not allow test
takers to skip items (unless the items within each section are not adaptively administered). If test
takers want to proceed further, even within one section, they must answer every item because the
CAT program selects the following item based on the test taker’s initial response to the current
item. The inability to skip items may not be terribly bothersome for test takers because they can
simply answer randomly and move on to the next item, knowing they can return to the item
before advancing to the next section. In terms of measurement efficiency, however, an item
selection process that is based heavily on initial responses that do not necessarily reflect test
takers’ best effort could seriously erode CAT’s level of adaptiveness. Moreover, some test takers
might try different initial responses to find clues on correct response based on the difference in
item difficulty between item pairs, which was the source of concern for Kingsbury (1996) and
Wise et al. (1999). Vispoel et al. (2002, 2005) observed no meaningful gain from practicing
Item Pocket Method to Allow Response Changing 9
Kingsbury and GK strategies in their CATs with the restricted review option, but their finding
was based on results from low-stakes exams. In high-stakes CAT programs, test takers might be
tempted to follow the Kingsbury or GK strategies with the intent of improving their scores, and
CAT with the restricted review approach technically is vulnerable to successful implementations
of either of these strategies.
Item Pocket Method
To address the shortcomings of the restricted review approach, this study proposes a new
approach for allowing response change. This method, which hereafter will be referred to as the
“item pocket” (IP) method, provides test takers with item pockets into which they can place
items for later review and response change. Test takers can skip answering items by putting them
in the item pocket. Once an item is placed in the item pocket, a test taker can go back to it
anytime during the test until the test taker submits his or her final answer for the item. For
example, in the CAT interface shown in Figure 1, a test taker is reviewing Item 4 among the
three items in the item pocket. If a test taker wants to take Item 4 out of the item pocket, she or
he ‘confirms’ the final answer for it. Once removed from the item pocket, Item 4 cannot be
placed back in. Test takers must confirm final answers for any items in the item pocket in order
to empty the item pockets before the test time expires or face the prospect that any remaining
items in the item pocket will be counted as incorrect responses. CAT developers can determine
the item pocket size based on the test length and time limit (more discussion on the item pocket
size follows later in this paper). The item pocket size was five in the example shown in Figure 1.
As for CAT item selection, only items outside the pocket (in other words, items with final
responses) are included in the interim score estimation procedure.
Item Pocket Method to Allow Response Changing 10
Insert Figure 1 about here
The IP method has several advantages over the restricted review approach. First, there is
no restriction on the number and range of items that can be revisited for changing responses. In
contrast to the restricted review approach, the IP method eliminates the need for CAT sections to
break into smaller separately timed sections, and test takers can return to any item as long as the
item is in the item pocket. Even if a test taker removes an item from the item pocket because it is
full and the test taker wants to add a new item, she or he would still have control over which item
to remove from the item pocket. From a psychological point of view, this is a significant
improvement. Test takers’ feeling of loss of control over the test, a major source of their anxiety
and stress (Stocking, 1997; Olea, Revuelta, Ximénez, & Abad, 2000), potentially may be
reduced with the IP method.
Another merit of the IP method is that test takers are not forced to provide an answer just
to move forward but instead can skip items simply by adding them to the item pocket (as many
as the item pocket size allows). The reduced anxiety by not being forced to answer each item
before proceeding to the next would be one of the IP method’s possible direct psychological
benefits, but the IP method’s psychometric benefits are also worth mentioning. In the IP method,
test takers’ initial responses for the items in the item pocket, including skipped items, have no
impact on item selection. The CAT system excludes items in the item pocket when computing
interim score estimates. Therefore, CAT item selection is always based on test takers’ final
responses, and as a result, CAT’s level of adaptiveness can be retained effectively. Since test
takers cannot change their answers once they are finalized—in other words, once items are
Item Pocket Method to Allow Response Changing 11
removed from the item pocket with final answers—any attempt to apply the Kingsbury or GK
strategies becomes ineffective.
Compared to the restricted review approach or traditional CAT, it is easy to appreciate
the possible psychological benefits that accrue to test takers using the new IP method given the
greater degree to which it allows test takers to revisit items and change responses. The fact that
Kingsbury and GK strategies naturally become ineffective with the IP method also makes this
method appealing to test developers. As yet unknown, however, is whether the new IP method is
robust enough to immobilize other manipulative test-taking strategies, such as the Wainer
strategy (1993). A series of simulation studies were followed to examine the robustness of the IP
method against worst-case scenarios of test-taking strategy. The simulation study also evaluated
the effect of item pocket size.
Simulation Study
Research Design
For the simulation, 500 items were chosen from a real operational item bank built for a
CAT program used for admissions to graduate-level educational programs. As shown in Table 1,
a minor correlational relationship was observed between a- and b- parameter values among the
items in the item pool. All items were multiple-choice formats with five answer options. The
items were calibrated using the three-parameter logistic model (3PLM); the summary statistics
for the item pool are reported in Table 1. Ten thousand simulated test takers were sampled from
a normal distribution with a mean of 0 and standard deviation of 1. Each test taker was
administered a fixed-length CAT with 40 items.
Insert Table 1 about here
Item Pocket Method to Allow Response Changing 12
For the CAT administration, the maximized Fisher information method was used as an
item selection criterion, and the interim and final scores were estimated using the maximum
likelihood estimation (MLE) method. The score estimates were truncated to be within a range of
- 3 and 3. The initial score estimate was randomly drawn from a uniform distribution that ranged
from - 0.5 to 0.5. During the first five item administrations, the absolute value of change in the
interim score estimates from one item to another was limited so as not to exceed 1.0. This
prevented fluctuations in item selection in CAT’s early stage. This restriction was particularly
important here because the use of MLE method for score estimation could result in extreme
values when all responses are same (i.e., all 0’s or all 1’s), as often occurs in the early stage of
CAT. In terms of item exposure control, the simulation was conducted under two different
conditions: (a) no exposure control, and (b) the Sympson and Hetter method (1985). For the
exposure control using the Sympson and Hetter method, the target exposure rate was set to 0.20
and the exposure parameter for each item was derived after 40 iterative simulations. The content
balancing was ignored to eliminate other extraneous factors and to make the implications from
the study as generalizable as possible.
The IP method was implemented under three different conditions in terms of the item
pocket size (i.e., the maximum number of items that can be held in the item pocket at the same
time).The studied item pocket sizes were 2, 4, and 6. To serve as a baseline, the IP method also
was implemented with no item pocket condition, essentially the same as a conventional CAT that
does not allow reviewing and changing.
An unlikely worst-case scenario using the Wainer-like manipulative test-taking strategy,
as well as a more realistic scenario reflecting observations from literature in a probabilistic
Item Pocket Method to Allow Response Changing 13
model, was simulated with the IP method in order to evaluate possible impacts of test-taker
review and response change on measurement precision. Under each scenario, the conditional
standard error of measurement (CSEM) using the mean absolute error of θ estimation and bias
across the score levels with a 0.5 interval on the θ scale were evaluated along with IP usages.
The simulation was replicated 25 times and averaged. The CAT administration and simulation
were conducted using a modified version of SimulCAT, a comprehensive computer software
package for CAT simulation, written by Han (in press).
Simulation Study 1: Wainer-Like Test-Taking Strategy
To be comparable to earlier research, this study mimicked the ‘unrealistic worst case’
scenario used in Stocking (1997). The study simulated test takers following the Wainer strategy
to the extent possible within the IP system. To recap, the Wainer strategy assumes that test takers
intentionally keep their interim score estimates low by providing incorrect initial answers in
order to see more items that are easier than the test takers’ proficiency level. Supposedly this
gives them a better chance of answering those items correctly when they go back to change their
initial responses. Within the IP system, one expects the effect of the Wainer strategy to be
minimized because the item pocket size is limited and because the intentionally incorrect initial
responses for items in the item pocket do not influence item selection. Test takers might still be
able to make their interim score estimates negatively biased by postponing their answers to as
many easy items as the item pocket size allows, so the strategy’s impact on the final score
estimates should be examined. The study simulated this situation assuming that all test takers
would mechanically implement such a strategy throughout the test administration. This
manipulative test-taking strategy will be referred to in this paper as Test Taking Strategy 1
(TTS1).
Item Pocket Method to Allow Response Changing 14
Simulation Study 2: More Typical Test-Taking Scenario With Item Pocket
Simulation Study 1 was designed to provide us with knowledge about the robustness of
the new IP method against test takers’ systematic attempts to employ the Wainer-like
manipulative test taking strategy (TTS1). Such a gaming strategy, however, is not the way the IP
method was intended to be used, nor was it expected to happen often in practice. According to
Vispoel et al. (2000), when examinees were allowed to review their responses, their most
frequently observed test-taking strategy could be described as follows: “I mark some of my
answers that I review later (p. 34).” This test-taking strategy varied little from the ways
examinees would respond to items on other tests (Vispoel et al., 2000). In fact, marking some
answers (or items) for later review involves essentially the same process as putting items in the
item pocket within the IP system. Although Vispoel et al. (2000) did not report which items were
frequently marked for later review, it would be reasonable to assume that, within the IP system,
examinees were likely to use the item pocket to set aside the items they found challenging (i.e.,
they were not confident about their answers). This allowed them a later opportunity to answer the
items instead of giving up on the items by locking in their initial answers. This test-taking
strategy referred to in this paper as Test Taking Strategy 2 (TTS2), is a legitimate use of the IP
system (as opposed to the manipulative Wainer and Kingsbury gaming strategies) and, more
importantly, represents what it was designed for —providing examinees with a less restrictive
reviewing option. The second simulation study, therefore, was conducted to evaluate the
performance of the IP method under a more realistic situation with TTS2.
Simulating test takers’ reviewing behavior involves several strong assumptions, and so it
is important to make those assumptions as realistic as possible. Under TTS2, it was assumed that
test takers first evaluated the relative difficulty of each item against their proficiency level. If the
Item Pocket Method to Allow Response Changing 15
item difficulty was challenging given the test takers’ proficiency, it was assumed they would put
the item in item pocket. For items that were not challenging according to this definition, test
takers were assumed to give their final answer and move on, not necessarily using the item
pocket. For purposes of the simulation, the item was viewed as challenging if the difficulty (b parameter value) of an item exceeded a test taker’s true proficiency (θ) by 0.5. In practice,
researchers often find large errors associated with a test taker’s ratings (or judgment) of item
difficulty, so it is important to incorporate that assumption in the simulation algorithm in order to
ensure realistic results. Vispoel et al. (2005) observed that test takers showed success rates
between 46 percent and 61 percent when asked to compare the difficulty of a pair of items when
the difference in difficulty was less than 0.50. When the difference was larger than 0.50, the
average rating success was between 63 percent and 82 percent across the studied conditions
(Vispoel et al., 2005). In this study, test takers do not compare a pair of items but instead
evaluate the difficulty of an item using their proficiency as a baseline to decide whether or not
they want to put the item in the item pocket. If the difference between test takers’ proficiency
and the item difficulty was less than 0.50, test takers were simulated to find the item challenging
50 percent of the time. If the difference between test-taker proficiency and the item difficulty
were greater than or equal to 0.50, test takers were assumed to find the item challenging 70
percent of the time. If all item pockets were full and a test taker found the current item
challenging and was considering putting that item in the pocket, it was assumed that the test
taker compared the current item to the easiest items in the item pocket. If the test taker found an
item in the item pocket that was easier than the current item, the test taker was assumed to
finalize his or her answer to the easiest item and remove it from the item pocket to make room
for the current item. During the comparison between the easiest item in the item pocket and the
Item Pocket Method to Allow Response Changing 16
current item, test-taker error associated with difficulty ratings was also simulated in the same
way as the errors associated with the test taker’s decision on item pocket use (i.e., rating success
at a rate of 0.50 if the difference in difficulty between the easiest and the current items was less
than 0.50, and rating success at a rate of 0.70 if the difference was greater than or equal to 0.50).
Simulation Study 2 was designed using research findings based on real empirical test data
(Vispoel et al., 2005; Vispoel et al., 2000) to account for test-taker errors in item difficulty rating.
In spite of this parameter, test takers’ patterns observed in this simulation of the IP system were
still considerably exaggerated due to the fact that the simulation study imposed no test time limit.
In other words, the simulation did not take into account any speededness or test-taker fatigue. In
addition, test takers in the simulation were modeled to take as much time as needed to review the
items utilizing the IP system. For most operational real-world CAT programs, however, tests are
strictly timed, often slightly speeded for some test takers, and subject to severe penalties on final
scores (according to the number of omitted items) if test takers do not complete all items within
the time limit.
A major benefit of this study’s research design using “worst case” scenarios—essentially
an extreme stress test of the new IP method—was the implication that results would be even
more useful in real CAT situations because the findings would indicate the bottom-line
performance of the new method. Therefore, this paper will present a comprehensive discussion
that will include both the “worst-case” results and the “what would likely happen” in the real
world results. It is important to keep in mind that the simulation study cannot measure various
psychological effects of the IP method on real human test takers, and the main purpose of this
simulation study was to understand and determine if negative impacts of the IP method on the
Item Pocket Method to Allow Response Changing 17
measurement efficiency could be controlled to be acceptable level for real CAT programs even
under the worst conditions.
Results and Discussion
As mentioned, the IP method simulation was conducted under two different exposure
control conditions (no exposure control vs. Sympson and Hetter method), and no meaningful
difference was observed in terms of CSEM, bias, and IP usage between the two exposure control
conditions. Therefore, this paper presents only the cases with the Sympson and Hetter exposure
control.
Under TTS1, which essentially was a modification of the Wainer strategy for the IP
system, the CSEM and bias displayed in Figure 2 showed either no change or minimal changes
with the IP system compared to those from the CAT with no response change condition (the
condition with IP size = 0). The increase in CSEM due to the IP system was less than 0.10
throughout most of the θ range (- 2.5 ≤ θ < 2.5) even when the IP size was 6, which was the
largest IP size in this study. This was similar to findings observed with Stocking’s (1997)
restricted model 1 (with the limit of two response changes) and model 2 (with four or more
separately timed sections). In terms of the average conditional bias in the score estimates (seen in
the middle of Figure 2), test takers did not achieve any meaningful positive gain by
implementing this Wainer-like strategy. In fact, higher proficiency test takers (θ > 0.5) tended to
have slightly underestimated scores with this manipulative test-taking strategy because, as Wise
(1996) pointed out, final score estimates could drop substantially if a test taker failed to respond
to all items correctly when attempting the Wainer strategy. Therefore, it seems safe to conclude
that the IP method was very robust against test takers’ attempts to implement the Wainer strategy
Item Pocket Method to Allow Response Changing 18
even under a worst-case scenario. The total number of items that test takers of each proficiency
level placed in the item pocket (i.e., IP usage) is shown in Figure 2.
Insert Figure 2 about here
In Simulation Study 2, where it was assumed test takers would use the IP system
systematically for challenging items (i.e., implementing TTS2), the change in the mean absolute
error (MAE) for θ estimates due to the IP system was significant but the magnitude was minor—
the MAE increased by about 0.069, 0.083, and 0.087 when the IP was 2, 4, and 6, respectively.
Considering a typical standard error of θ estimation, which is often between 0.30 and 0.40 for
many CAT programs, the increase in the MAE due to the use of IP system, which was well
below 0.10, seemed acceptable. Increases seen in the average bias in θ estimates due to the use of
IP system were 0.057, 0.075, and 0.080, when the IP was 2, 4, and 6, respectively. Looking at the
observed patterns of the MAE and bias together across the different IP sizes, it was apparent that
the increase in MAE was due mainly to the systematic bias in the θ estimates. Although both
MAE and bias statistics showed the impact of the IP system was negligible in average, it is also
important to evaluate CSEM and bias across θ levels.
Figure 3 shows the details of the CSEM and conditional bias. The pattern of CSEM was
similar to the pattern of conditional bias because, as noted earlier, the increase in CSEM was
mainly due to the systematic bias. For higher proficiency test takers (θ > 0.5), TTS2 was
ineffective in gaining any meaningful score improvement. On the other hand, for test takers at
lower proficiency levels, for example, in the θ < -0.5 range, a saturated, excessive use of the IP
system with TTS2 could result in a positive score bias. Since the result suggested a possible
Item Pocket Method to Allow Response Changing 19
vulnerability of the IP system against an unrealistic case of TTS2, it was important to understand
its implication in real-world situations.
Insert Figures 3 and 4 about here
Because item pocket usage (i.e., the average number of items put in the item pocket),
shown in the bottom of Figure 3 did not necessarily indicate the time and effort test takers would
expend taking advantage of the IP system under TTS2, this study analyzed the average
conditional frequency of item review processes that included comparing the difficulty of each
new challenging item to the easiest preexisting items in the item pocket. As shown in Figure 4,
the higher the test-taker proficiency level, the less time likely was needed to revisit items in the
item pocket. There were two main reasons for this. First, given the initial θ value around zero,
test takers with below average proficiency were likely to see more items of challenging difficulty
and, hence, use the IP system more frequently. Second, test takers with extremely high
proficiency (for example, θ > 1.5) saw fewer items of challenging difficulty, and hence used the
IP system less frequently than test takers of below average proficiency. They had no need to
compare new challenging items with preexisting items in the item pocket because there often
was no item in the item pocket that needed to be removed to make room for a new, harder item.
As a result, higher proficiency test takers reviewed items less frequently under TTS2. For
example, for highly proficient individuals (θ > 1.5), the total load of item review tasks was fewer
than two occasions even when the item pocket size was six. For real CAT programs, test time
limits often are set at a level at which about 80 percent to 90 percent of average test takers can
finish the last item (Talento-Miller, Guo, & Han, 2010). This reduces the amount of wasted time
for both the CAT system and test takers alike while minimizing possible speededness. In such a
Item Pocket Method to Allow Response Changing 20
case, it is not unusual for highly proficient test takers to have a decent amount of time left to
review the small number of items in the item pocket as needed.
Although TTS2 could be a feasible strategy for high proficiency groups in real, timed
CAT administrations, use of the IP system would not necessarily result in positively biased
scores.. As seen in the middle of Figure 3, the bias in final θ estimation was next to nothing for
highly proficient test takers. On the other hand, for groups at lower proficiency levels (for
example, θ < -1.0), use of the TTS2 strategy theoretically could result in slightly biased scores.
Based on the average loads of review tasks that these test takers would need to process, however,
it appears TTS2 would be an extremely unrealistic strategy for most CAT programs with test
time limits (Figure 4). Although it was assumed in the simulation for TTS2 that each new
challenging item would be compared only to the easiest item in the item pocket, in reality test
takers likely will need to revisit not only the easiest but several (if not all) items in the item
pocket. This is because items in the item pocket are not ordered by item difficulty and test takers
would need to determine which was the easiest among them. Determining the easiest item, which
was not addressed in this simulation, could add a significant load to the item review process in
practice, making the actual item review load much heavier than what was shown in Figure 4.
Essentially, it means test takers at lower proficiency levels might spend more time and effort
analyzing the item difficulties than on solving the problems in order to result in final score with a
meaningfully positive bias.
Theoretically, it is possible for test takers to review items many times if test time is
unlimited. On the other hand, if time is unlimited, an option to review and change responses later
would not be necessary (or desired) for CAT in the first place. In such a situation, test takers
could spend as much time on each item as needed, thus minimizing any test anxiety resulting
Item Pocket Method to Allow Response Changing 21
from having to rush to the next item. Therefore, it is reasonable to assume that the IP system is
most likely to be employed and useful for CAT programs that are strictly and tightly timed as is
the case for most operational real-world CAT administrations. Under this assumption, typical test
takers would have only a fraction of test time left for reviewing items within the IP system.
Hence, a strategy such as TTS2 that test takers must mechanically implement would be
impossible to complete in real time. Knowing that many CAT programs apply severe penalties
on final scores if test takers fail to complete all items within a set time limit, test takers would be
discouraged from pushing TTS2 to the extreme—it would do more harm than good on their final
scores.
Aside from the test time limit issue, it is important to understand that analyzing and
comparing item difficulties to game a CAT system is a complicated and difficult process, one
with a fairly poor success rate observed even for highly proficient test takers (Vispoel, 1998;
Vispoel et al., 2002; Olea et al., 2000, Wise et al., 1999). Ironically, test takers who truly needed
to review items most often under TTS2 were those at the lowest proficiency level, and the
quality of their performance in analyzing item difficulties in the real world is expected to be very
poor, unlike the simulated study conditions in which all test takers performed the item difficulty
analyses with 50% to 70% accuracy. Thus, it is extremely unlikely that the magnitude of score
bias observed with low proficiency test takers in this simulation of the IP system could be
replicated in the real world, even with unlimited test time.
Most test takers would not benefit from practicing TTS2 in typical, timed CAT programs,
yet it presents some extreme scenarios worth thinking about. As shown at the bottom of Figure 3,
test takers with extremely low proficiency (for example, θ = -2.0) tended to put fewer than three
items into the item pocket during the test when the IP size was two. The items they mostly
Item Pocket Method to Allow Response Changing 22
placed in the item pocket were those administered at the beginning of CAT testing. For test
takers of extremely low-proficiency, the first few items, selected based on initial (randomly
chosen) θ around 0, were the most difficult among all administered items. Because the interim
score estimates for these low-proficiency test takers quickly declined from the initial θ, the first
few items placed in the item pocket mostly remained the same throughout the CAT
administration under TTS2. Therefore, test takers with very low proficiency levels may benefit
by being coached (for example, by test prep institutions) just to put the first items into the item
pocket and then solve all other items at once and skip comparing the new items with those in the
item pocket. Once the test takers reach the last item, they can revisit the first few items in the
item pocket and submit their final answers. Such a manipulative modification of TTS2, however,
would only yield a negligible positive bias (< 0.2 when IP size = 2) at the extremely low θ area,
which is far from the main population of consideration for most CAT programs. If test takers at
the bottom proficiency level are important for a CAT program, then one could suggest lowering
the initial θ value for item selection to reduce possible bias even further.
Conclusion
Because of its efficiency, CAT has become widely accepted in the field of educational
measurement; however, test takers’ dissatisfaction over not being allowed to review and change
their responses (Baghi, Ferrara, & Gabrys, 1991; Legg & Buhr, 1992; Wise, 1996; Vispoel,
1998) has not yet been satisfactorily addressed. On the one hand, reducing unnecessary test
anxiety for test takers by allowing them to review and change their responses during a CAT
administration is believed by some to have a positive effect on test validity (Papanastasiou, 2002;
Stocking, 1997; Olea et al., 2000). On the other hand, the trade-off in CAT efficiency is simply
Item Pocket Method to Allow Response Changing 23
unacceptable for most operational CAT programs, especially due to the possibility that test takers
would attempt to game the CAT system (Wainer, 1993; Wise, 1996). Stocking’s restricted
review models (1997) brought forward effective means of controlling the impact of some
manipulative test-taking strategies such as the Wainer strategy while offering test takers a limited
ability to review.
The new IP method presented in this study aimed to reduce the restrictions in reviewing
even further but at the same time improve the robustness of the CAT system against
manipulative test-taking strategies. With the IP system, test takers can go back and forth across
sections to review items in the item pocket, and test developers do not need to time each section
separately. The simulation result (with TTS1) showed that the IP method was as robust against
the Wainer strategy as Stocking’s restricted review models. Also, unlike the restricted review
models, the IP method systematically is immune to the Kingsbury and GK strategies since item
selection is not influenced by items in the item pocket. The simulation study under TTS2
revealed that test takers with above average proficiency would not benefit from saturatedly
excessive use of the IP system. The simulation did reveal the possibility of slight score biases (<
0.2 when IP size = 2) for test takers with very low proficiency (θ = -2) when abnormally
excessive use of the IP system (TTS2) occurred. The unrealistic nature of the simulation
condition (no time limit, no fatigue, and 100% accuracy in determining the easiest item),
however, combined with the huge loads of item review for test takers to process in order to
achieve such gains, leaves little practical chance to realize it in most operational CAT programs
that are tightly timed. Given the evidence of the IP method’s robustness against the Wainer,
Kingsbury, and GK strategies as well as excessive uses of the IP system seen in TTS2, it is
highly unlikely that test takers will be tempted to waste their time and effort on such
Item Pocket Method to Allow Response Changing 24
manipulative test-taking strategies. Even in the worst case under the studied conditions, the
sacrifice in the measurement efficiency was about one or two items in the extremely low
proficiency level, and some CAT programs may be willing to accept such a tradeoff to provide
test takers (or clients) with better testing experience, especially when the test market is customer
(i.e., test taker or client) driven.
The process for determining a proper IP size is somewhat ambiguous but is critical
because it ultimately decides the IP system’s flexibility and affects measurement efficiency. If
the IP size is too small, test takers’ feelings of a loss of control over the test may persist because
their ability to review items is too limited. On the other hand, if the IP size is too large, the CAT
optimality would decrease because items in the item pocket would not contribute information
that is used for item selection. The size of item pocket, however, is not the only factor that
determines IP usage. For example, if a CAT is tightly timed, test takers may not necessarily use
the item pocket up to its size limit because they know they will not have enough time near the
end of the test to review items in the item pocket if there are too many. So test length and test
time limit should be considered together when determining the IP size.
Although this study mainly focused on evaluating the possible negative impacts of the IP
method on measurement accuracy under worst-case scenarios, it is noteworthy that there could
be several possible positive effects of the IP method in terms of the measurement accuracy. First,
providing an option to review items may help test takers reduce their test anxiety levels, even if
they never exercise that option (Vispoel et al., 2000; Olea et al., 2000). With reduced anxiety,
test takers are expected to perform in a way they are supposed to with fewer mistakes, and, as a
result, an item review option could eventually reduce measurement errors (Papanastasiou, 2002).
Second, with a well-chosen IP size, the IP system may help test takers manage and use time
Item Pocket Method to Allow Response Changing 25
more wisely during CAT administration. One mistake test takers often make during a test is
spending too much time on a few items and then rushing through the rest of the test. This can
contribute to a large (often unobservable) measurement error because the speededness could
influence their whole performance on the rest of the items. An IP system, however, would allow
test takers to skip items they think will take a long time to answer right at the outset. Even if a
test taker finds a current item taking too long to finish in the middle of problem solving, he or
she can put the item into the item pocket and move on to the next item. Since test takers always
can revisit the items in the item pocket and restart from where they left off, they would not need
to gamble on whether to give up and guess or spend more time on the current item in which they
have already invested a sizable chunk of time. The IP method can enable flexibility in test time
management to help test takers minimize unintended speededness during CAT, and, as a result,
may reduce related measurement errors. These possible positive impacts of the IP method could
not be investigated in this study because of the limitation of simulation, but it is strongly
suggested that future studies conduct an in-depth examination of the psychological and
psychometric effects of the IP method using real empirical data.
The simulation conditions in this study mainly reflected a CAT program with fixed
length for a high-stake exam. Because even minor changes in the item selection algorithm, item
bank, estimation method, and test length can make huge differences, the findings of this study
should not be imprudently generalized. Examining the impact of the use of item pockets on the
test length when CAT administration is terminated based on the estimation precision would be an
interesting topic for examination in future studies.
The primary goal of the IP method was not necessarily to give test takers a better chance
at improving their scores but to create a less restrictive testing environment, allowing them more
Item Pocket Method to Allow Response Changing 26
control over a CAT administration so that they can perform undistracted without unnecessary
test anxiety. Based on the findings of this study, the new IP method seems to have delivered a
promising solution in doing just that in many CAT programs.
Item Pocket Method to Allow Response Changing 27
References
Baghi, H., Gabrys, R., & Ferrara, S. (1991, April). Applications of computer-adaptive testing in
Maryland. Paper presented at the annual meeting of the American Educational Research
Association, Chicago, IL.
Benjamin, L. T., Cavell, T. A., & Shallenberger III, W. R. (1984). Staying with initial answers
on objective tests: Is it myth? Teaching of Psychology, 11, 133–141.
Bowles, R., & Pommerich, M., (2001, April). An examination of item review on a CAT using the
specific information item selection algorithm. Paper presented at the annual meeting of
the National Council of Measurement in Education, Seattle, WA.
Gershon, R., & Bergstrom, B. (1995, April). Does cheating on CAT pay: Not! Paper presented at
the annual meeting of the American Educational Research Association, San Francisco,
CA.
Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984). Technical
guidelines for assessing computerized adaptive tests. Journal of Educational
Measurement, 21, 347–360.
Han, K. T. (2012). SimulCAT: Windows software for simulating computerized adaptive test
administration. Applied Psychological Measurement, 36(1), 64-66.
Kingsbury, G. G. (1996, April). Item review and adaptive testing. Paper presented at the annual
meeting of the National Council on Measurement in Education, New York.
Legg, S. M., & Buhr, D. C. (1992). Computerized adaptive testing with different groups.
Educational Measurement: Issues and Practice, 11, 23–27.
Lunz, M. E., Bergstrom, B. A., & Wright, B. D. (1992). The effect of review on student ability
and test efficiency for computerized adaptive tests. Applied Psychological Measurement,
16(1), 41–51.
Martineau, J., & Dean, V. (2010). How a state might benefit from computer-based assessment
and how to solve problems with its implementation from the view point of the State. Paper
presented at the annual meeting of the Maryland Assessment Conference, College Park,
MD.
Olea, J., Revuelta, J., Ximénez, M. C., & Abad, F. J. (2000). Psychometric and psychological
effects of review on computerized fixed and adaptive tests. Psicológica, 21, 157–173.
Item Pocket Method to Allow Response Changing 28
Papanastasiou, E. C. (2002, April). A ‘rearrangement procedure’ for scoring adaptive test with
review options. Paper presented at the annual meeting of the National Council on
Measurement in Education, New Orleans, LA.
Poggio, J. (2010). History, current practice, and predictions for the future of computer based
assessment in K-12 education. Paper presented at the annual meeting of the Maryland
Assessment Conference, College Park, MD.
Stocking, M. L. (1997). Revising item responses in computerized adaptive tests: A comparison
of three models. Applied Psychological Measurement, 21(2), 129–142.
Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized
adaptive testing. Proceedings of the 27th annual meeting of the Military Association. (pp.
973–977). San Diego, CA: Navy Personnel Research and Development Center.
Talento-Miller, E., Guo, F., & Han, K. T. (2010, July). Examining test speededness by native
language. Paper presented at the annual meeting of the International Test Commission,
Hong Kong, China.
Vispoel, W. P. (1998). Reviewing and changing answers on computer-adaptive and self-adaptive
vocabulary tests. Journal of Educational Measurement, 35(4), 328–345.
Vispoel, W. P., Clough, S. J., & Bleiler, T. (2005). A closer look at using judgments of item
difficulty to change answers on computerized adaptive tests. Journal of Educational
Measurement, 42(4), 331–350.
Vispoel, W. P., Clough, S. J., Bleiler, T. Henderickson, A. B., & Ihrig, D. (2002). Can examinees
use judgments of item difficulty to improve proficiency estimates on computerized
adaptive vocabulary tests? Journal of Educational Measurement, 39(4), 311–330.
Vispoel, W. P., Henderickson, A. B., & Bleiler, T. (2000). Limiting answer review and change
on computerized adaptive vocabulary test: Psychometric and attitudinal results. Journal
of Educational Measurement, 37(1), 21–38.
Vispoel, W. P., Rocklin, T. R., Wang, R., & Bleiler, T. (1999). Can examinee use a review
option to obtain positively biased ability estimates on a computerized adaptive test?
Journal of Educational Measurement, 36(2), 141–157.
Waddell, D. L., & Blankenship, J. C. (1995). Answer changing: A meta-analysis of the
prevalence and patterns. The Journal of Continuing Education in Nursing, 25, 155–158.
Wainer, H. (1993). Some practical considerations when converting a linearly administered test to
an adaptive format. Educational Measurement: Issues and Practice, 12, 15–20.
Item Pocket Method to Allow Response Changing 29
Wang, M. & Wingersky, M. (1992, April). Incorporating post-administration item response
revision into a CAT. Paper presented at the annual meeting of the National Council of
Measurement in Education, San Francisco, CA.
Wise, S. L. (1996). A critical analysis of the arguments for and against item review in
computerized adaptive testing. Paper presented at the annual meeting of the National
Council on Measurement in Education, New York.
Wise, S. L., Finney, S. J., Enders, C. K., Freeman, S. A., & Severance, D. D. (1999). Examinee
judgments of changes in item difficulty: Implications for item review in computerized
adaptive testing, Applied Measurement in Education, 12, 185–198.
Item Pocket Method to Allow Response Changing 30
Table 1
Descriptive Statistics for the Item Pool (500 Items)
Item parameter
Mean
SD
a
b
c
0.795
0.404
0.166
0.272
1.132
0.065
a
1
0.260
0.085
Pearson Correlation
b
0.260
1
0.057
c
0.085
0.057
1
Item Pocket Method to Allow Response Changing 31
Figure 1. Example of a Test Interface for a CAT With the Item Pocket Method
Is m divisible by 12?
(1) m is divisible by 3
(2) m is divisible by 4
[] Statement (1) ALONE is sufficient, but statement (2) alone is not sufficient to answer the question asked.
[] Statement (2) ALONE is sufficient, but statement (1) alone is not sufficient to answer the question asked.
[] BOTH statements (1) and (2) TOGETHER are sufficient to answer the question asked, but NEITHER
statement ALONE is sufficient to answer the question asked.
[] EACH statement ALONE is sufficient to answer the question asked.
[] Statements (1) and (2) TOGETHER are NOT sufficient to answer the question asked, and additional data
specific to the problem are needed. Item Pocket Method to Allow Response Changing 32
Figure 2. Conditional Bias, CSEM, and IP Usage Under TTS1
Conditional Standard Error of Measurement
Conditional Bias
Item Pocket Usage
Item Pocket Method to Allow Response Changing 33
Figure 3. Conditional Bias, CSEM, and IP Usage Under TTS2
Conditional Standard Error of Measurement
Conditional Bias
Item Pocket Usage
Item Pocket Method to Allow Response Changing 34
Figure 4. Frequency of Test Taker Revising (Easiest) Item Under TTS2