Item Pocket Method to Allow Response Changing 1 Running head: Item Pocket Method to Allow Response Changing Item Pocket Method to Allow Response Review and Change in Computerized Adaptive Testing1 Kyung T. Han Graduate Management Admission Council® Correspondence may be sent to: Kyung T. Han Graduate Management Admission Council® 11921 Freedom Dr. Suite 300, Reston, VA 20190 USA [email protected] (Phone) +1-703-668-9753 (Fax) +1-703-668-9601 The views and opinions expressed in this article are those of the author and do not necessarily reflect those of the Graduate Management Admission Council®. 1 Acknowledgements: The author wishes to thank Lawrence M Rudner, Fanmin Guo, and Eileen Talento-Miller of Graduate Management Admission Council® (GMAC®) for their feedbacks and support. The author also is grateful to Ernie Anastasio of GMAC®, Hua-Hua Chang of University of Illinois, Wim J. van der Linden of CTB/McGraw-Hill, and Paula Bruggeman of GMAC® for review and valuable comments. This is a revision of the original study that was presented at 2011 annual meeting of the National Council on Measurement in Education (NCME). The paper received the Alicia Cascallar Award for Outstanding Paper by Early Career Scholar from NCME. Item Pocket Method to Allow Response Changing 2 Abstract Most computerized adaptive testing (CAT) programs do not allow test takers to review and change their responses because it could seriously deteriorate the efficiency of measurement and make tests vulnerable to manipulative test-taking strategies. Several modified testing methods have been developed that provide restricted review options while limiting the trade-off in CAT efficiency. The extent to which these methods provided test takers with options to review test items, however, still was quite limited. This study proposes the item pocket (IP) method, a new testing approach that allows greater flexibility in changing responses, eliminating unnecessary restrictions that prevent test takers from moving across test sections to review their answers. A series of simulations were conducted to evaluate the robustness of the IP method against various manipulative test-taking strategies. Findings and implications of the study suggest that the IP method could be an effective solution for many CAT programs when the IP size and test time limit are properly set. Keywords: computerized adaptive testing, test construction, test administration, response change Item Pocket Method to Allow Response Changing 3 Item Pocket Method to Allow Response Review and Change in Computerized Adaptive Testing Computerized adaptive testing (CAT) is rapidly becoming a popular choice for test administration, often favored by test developers and test takers alike (Poggio, 2010; Martineau & Dean, 2010) because it delivers more accurate score estimates and requires relatively less testing time than conventional paper-and-pencil-based tests (PBTs). Due to the item selection algorithm that CAT programs employ, however, CAT usually does not allow test takers to review test items and/or change their responses as they can on PBTs, since it relies on an interim score estimate that is updated after each item administration. Previous research has shown that having an opportunity to change their responses can help test takers reduce their anxiety and stress levels during test administration (Wise, 1996; Stocking, 1997; Lunz, Bergstrom, & Wright, 1992; Papanastasiou, 2002), which can result in fewer mistakes made during testing and test scores that may more accurately reflect test takers’ true proficiency (Papanastasiou, 2002). This especially can be the case when the test is high stakes (Stocking, 1997) and/or tightly timed. Test takers almost always prefer to have the option to change responses even if they do not always exercise that option (Vispoel, Henderickson, & Bleiler, 2000). More importantly, research shows that only a fraction of test takers actually benefit from response changing in terms of score improvement (Benjamin, Cavell, & Shallenberger, 1984; Waddell & Blankenship, 1995; Wise, 1996). Eliminating unnecessary causes of a test taker’s anxiety and stress during CAT administration is an important step toward creating an ideal test environment. Reducing test takers’ anxiety and stress by giving them sufficient control over the test administration is Item Pocket Method to Allow Response Changing 4 critically important because a high level of test anxiety can contribute to increased measurement errors: Providing test takers with a better testing experience is, educationally and morally, the right thing to do. Manipulative Test-Taking Strategies One of the main practical objections to allowing for response changing in CAT programs is that imprudently allowing test takers to change responses could open the door to systematic test-taking strategies that could render CAT administration less efficient (Wainer, 1993; Wise, 1996) and/or result in biased score estimates (Vispoel, Rocklin, Wang, & Bleiler, 1999). Wainer Strategy For the manipulative test-taking strategy that Wainer (1993) introduced, a test taker could intentionally answer all items incorrectly on the first round to make the CAT system administer only easy items. After that, the test taker could go back to each item to review and change his or her responses to get a perfect score on the second round. Simulation studies confirmed that this so-called “Wainer strategy” could result in extremely large measurement errors with a considerable risk for positively biased score estimates, especially for high-proficiency test takers who are capable of implementing the strategy successfully (Wang & Wingersky, 1992; Gershon & Bergstrom, 1995; Stocking, 1997; Vispoel et al., 1999; Bowles & Pommerich, 2001). In practice, the Wainer strategy is difficult to implement successfully—test takers run the risk of getting significantly underestimated scores if they fail to respond to all items correctly on the second round (Wise, 1996). Notwithstanding the risks, studies with live data showed that highproficiency test takers had a good chance to profit from the Wainer strategy (Vispoel et al., 1999; Wise, 1996). Item Pocket Method to Allow Response Changing 5 Kingsbury Strategy Another manipulative test-taking strategy for CAT involves examinee judgment on item difficulty. As pointed out by Green, Bock, Humphreys, Linn, and Reckase (1984), test takers, knowing that the current item’s difficulty depends heavily on the response made to the previous item, could use the difficulty of the current item as a clue to the correctness of their response for the earlier item. If a test taker thinks the current item’s difficulty is slightly higher than the previous item, it would be reasonable for the test taker to think his or her answer for the previous item was correct. On the other hand, if the current item seems easier than the previous item, it may indicate that the previous item was answered incorrectly, and thus the test taker can go back to the previous item and change the answer. In his simulation study, Kingsbury (1996) modeled such a test-taking strategy based on two strong assumptions: (a) test takers were assumed to make guesses if they saw an item whose difficulty was higher than their true proficiency by 1.0 theta unit or more, and (b) test takers also were assumed to go back to previous item and change their responses if the next item’s difficulty was lower than the previous item by 0.5 theta unit or more. The simulation result found that examinees could benefit from this test-taking strategy depending on their proficiency level—the lower the examinees’ proficiency levels, the greater the possible benefits they could see. Generalized Kingsbury Strategy Wise, Finney, Enders, Freeman, and Severance (1999) expanded Kingsbury’s test-taking strategy so that test takers were assumed to have speculated on the difficulty level of the next item not only for items with guessed responses, but for all previous items. Wise et al. (1999) called this the “generalized Kingsbury” (GK) strategy and simulated it in more realistic conditions where probabilities for correctly judging item difficulties derived from real data were Item Pocket Method to Allow Response Changing 6 less than 1.0. The simulation results suggested small possible benefits to test takers from using the Kingsbury strategy but the strategy offered no meaningful improvement in score estimates. Vispoel, Clough, Bleiler, Hendrickson, and Ihrig (2002) used live testing data to examine the effect of the Kingsbury and GK strategies on score estimates under various conditions and found that for a majority of test takers in the study the Kingsbury and GK strategies were ineffective. This was mainly because the accuracy of test takers’ item-difficulty rating was much lower than what Kingsbury (1996) and Wise et al. (1999) assumed: Test takers were only 61% successful in distinguishing the difficulty difference within each item pair. The study did find, however, that some test takers were still able to improve their scores using either one of these strategies. CAT With Restricted Revision Options To minimize the effect of the manipulative test-taking strategies while still providing test takers with reasonable options to review and change their responses, Stocking (1997) proposed three different models for limiting test takers’ practices in item review and response change. In the first model, test takers were allowed to change their responses at the end of the test, but were limited to a maximum number of revisions. According to her simulation results, Stocking’s model effectively reduced the impact of the Wainer strategy, and the conditional standard error of measurement (CSEM) and bias were close to those with a zero-revision condition when the number of revisions was limited to 2 out of 28 test items. When the number of allowable revisions was greater than two, however, (in Stocking, 1997, the studied conditions were 0, 2, 7, 14, and 28 revisions for the 28-item test) Stocking’s model failed to control the effect of the Wainer strategy. Item Pocket Method to Allow Response Changing 7 In Stocking’s second model, the test consisted of multiple separately timed sections, and test takers were allowed to revise their responses freely within each section. The simulation results showed that administering the test in two separate sections substantially reduced the effect of the Wainer strategy on the CSEM and bias. When the item revision option was available with four or more separated sections, the effect of the Wainer strategy was almost completely negated. Stocking’s third model, in which test takers were allowed to revise responses only within each item set associated with the common stimulus, showed its robustness against the Wainer strategy as well. The main disadvantage of Stocking’s third model, however, was that test takers were not allowed to revise responses for discrete (i.e., non-set) items (Stocking, 1997). In their study using live test data, Vispoel et al. (2000) confirmed that Stocking’s second model (the restricted review within each block) successfully reduced the possible effect of the Wainer strategy. In addition, the majority of examinees (98.4%) in this study felt they had adequate opportunity to review and change their responses. In the latter studies by Vispoel and his colleagues, which also used live test data, the Kingsbury and GK strategies proved to be ineffective when the restricted review was permitted (Vispoel et al., 2002; Vispoel, Clough, & Bleiler, 2005). Reviewing the overall results from Stocking (1997) and Vispoel et al. (2000, 2002, & 2005), the restricted review approach (especially Stocking’s second model) seems to be one of several solutions that could allow test takers to change their response during CAT to a certain degree without incurring unacceptable levels of sacrifice in measurement precision and efficiency. A serious limitation of the restricted review approach, however, is that allowing test takers to review and change their responses only within each section eventually causes them to Item Pocket Method to Allow Response Changing 8 involuntarily surrender access to items in the current section in order to proceed to the next section. If test sections are timed strictly and separately from each other as Stocking (1997) originally proposed, test takers would not necessarily feel pressure to move on to the next section before the section time expired. In a majority of operational CAT programs, however, small test sections (or item sets) usually are not timed separately. As a result, every time test takers finish one section (unless it is the last section), they must decide whether it is better to spend time revisiting test items in the current section and improving their initial responses or use their time to complete the remaining test sections. Being forced repeatedly to make such a decision, not knowing exactly how much time will be needed to complete the remaining test sections, very likely causes test takers the same kind of test anxiety as that observed among test takers in a regular CAT administration with no response revision option. Another downside of the restricted review approach is that it still does not allow test takers to skip items (unless the items within each section are not adaptively administered). If test takers want to proceed further, even within one section, they must answer every item because the CAT program selects the following item based on the test taker’s initial response to the current item. The inability to skip items may not be terribly bothersome for test takers because they can simply answer randomly and move on to the next item, knowing they can return to the item before advancing to the next section. In terms of measurement efficiency, however, an item selection process that is based heavily on initial responses that do not necessarily reflect test takers’ best effort could seriously erode CAT’s level of adaptiveness. Moreover, some test takers might try different initial responses to find clues on correct response based on the difference in item difficulty between item pairs, which was the source of concern for Kingsbury (1996) and Wise et al. (1999). Vispoel et al. (2002, 2005) observed no meaningful gain from practicing Item Pocket Method to Allow Response Changing 9 Kingsbury and GK strategies in their CATs with the restricted review option, but their finding was based on results from low-stakes exams. In high-stakes CAT programs, test takers might be tempted to follow the Kingsbury or GK strategies with the intent of improving their scores, and CAT with the restricted review approach technically is vulnerable to successful implementations of either of these strategies. Item Pocket Method To address the shortcomings of the restricted review approach, this study proposes a new approach for allowing response change. This method, which hereafter will be referred to as the “item pocket” (IP) method, provides test takers with item pockets into which they can place items for later review and response change. Test takers can skip answering items by putting them in the item pocket. Once an item is placed in the item pocket, a test taker can go back to it anytime during the test until the test taker submits his or her final answer for the item. For example, in the CAT interface shown in Figure 1, a test taker is reviewing Item 4 among the three items in the item pocket. If a test taker wants to take Item 4 out of the item pocket, she or he ‘confirms’ the final answer for it. Once removed from the item pocket, Item 4 cannot be placed back in. Test takers must confirm final answers for any items in the item pocket in order to empty the item pockets before the test time expires or face the prospect that any remaining items in the item pocket will be counted as incorrect responses. CAT developers can determine the item pocket size based on the test length and time limit (more discussion on the item pocket size follows later in this paper). The item pocket size was five in the example shown in Figure 1. As for CAT item selection, only items outside the pocket (in other words, items with final responses) are included in the interim score estimation procedure. Item Pocket Method to Allow Response Changing 10 Insert Figure 1 about here The IP method has several advantages over the restricted review approach. First, there is no restriction on the number and range of items that can be revisited for changing responses. In contrast to the restricted review approach, the IP method eliminates the need for CAT sections to break into smaller separately timed sections, and test takers can return to any item as long as the item is in the item pocket. Even if a test taker removes an item from the item pocket because it is full and the test taker wants to add a new item, she or he would still have control over which item to remove from the item pocket. From a psychological point of view, this is a significant improvement. Test takers’ feeling of loss of control over the test, a major source of their anxiety and stress (Stocking, 1997; Olea, Revuelta, Ximénez, & Abad, 2000), potentially may be reduced with the IP method. Another merit of the IP method is that test takers are not forced to provide an answer just to move forward but instead can skip items simply by adding them to the item pocket (as many as the item pocket size allows). The reduced anxiety by not being forced to answer each item before proceeding to the next would be one of the IP method’s possible direct psychological benefits, but the IP method’s psychometric benefits are also worth mentioning. In the IP method, test takers’ initial responses for the items in the item pocket, including skipped items, have no impact on item selection. The CAT system excludes items in the item pocket when computing interim score estimates. Therefore, CAT item selection is always based on test takers’ final responses, and as a result, CAT’s level of adaptiveness can be retained effectively. Since test takers cannot change their answers once they are finalized—in other words, once items are Item Pocket Method to Allow Response Changing 11 removed from the item pocket with final answers—any attempt to apply the Kingsbury or GK strategies becomes ineffective. Compared to the restricted review approach or traditional CAT, it is easy to appreciate the possible psychological benefits that accrue to test takers using the new IP method given the greater degree to which it allows test takers to revisit items and change responses. The fact that Kingsbury and GK strategies naturally become ineffective with the IP method also makes this method appealing to test developers. As yet unknown, however, is whether the new IP method is robust enough to immobilize other manipulative test-taking strategies, such as the Wainer strategy (1993). A series of simulation studies were followed to examine the robustness of the IP method against worst-case scenarios of test-taking strategy. The simulation study also evaluated the effect of item pocket size. Simulation Study Research Design For the simulation, 500 items were chosen from a real operational item bank built for a CAT program used for admissions to graduate-level educational programs. As shown in Table 1, a minor correlational relationship was observed between a- and b- parameter values among the items in the item pool. All items were multiple-choice formats with five answer options. The items were calibrated using the three-parameter logistic model (3PLM); the summary statistics for the item pool are reported in Table 1. Ten thousand simulated test takers were sampled from a normal distribution with a mean of 0 and standard deviation of 1. Each test taker was administered a fixed-length CAT with 40 items. Insert Table 1 about here Item Pocket Method to Allow Response Changing 12 For the CAT administration, the maximized Fisher information method was used as an item selection criterion, and the interim and final scores were estimated using the maximum likelihood estimation (MLE) method. The score estimates were truncated to be within a range of - 3 and 3. The initial score estimate was randomly drawn from a uniform distribution that ranged from - 0.5 to 0.5. During the first five item administrations, the absolute value of change in the interim score estimates from one item to another was limited so as not to exceed 1.0. This prevented fluctuations in item selection in CAT’s early stage. This restriction was particularly important here because the use of MLE method for score estimation could result in extreme values when all responses are same (i.e., all 0’s or all 1’s), as often occurs in the early stage of CAT. In terms of item exposure control, the simulation was conducted under two different conditions: (a) no exposure control, and (b) the Sympson and Hetter method (1985). For the exposure control using the Sympson and Hetter method, the target exposure rate was set to 0.20 and the exposure parameter for each item was derived after 40 iterative simulations. The content balancing was ignored to eliminate other extraneous factors and to make the implications from the study as generalizable as possible. The IP method was implemented under three different conditions in terms of the item pocket size (i.e., the maximum number of items that can be held in the item pocket at the same time).The studied item pocket sizes were 2, 4, and 6. To serve as a baseline, the IP method also was implemented with no item pocket condition, essentially the same as a conventional CAT that does not allow reviewing and changing. An unlikely worst-case scenario using the Wainer-like manipulative test-taking strategy, as well as a more realistic scenario reflecting observations from literature in a probabilistic Item Pocket Method to Allow Response Changing 13 model, was simulated with the IP method in order to evaluate possible impacts of test-taker review and response change on measurement precision. Under each scenario, the conditional standard error of measurement (CSEM) using the mean absolute error of θ estimation and bias across the score levels with a 0.5 interval on the θ scale were evaluated along with IP usages. The simulation was replicated 25 times and averaged. The CAT administration and simulation were conducted using a modified version of SimulCAT, a comprehensive computer software package for CAT simulation, written by Han (in press). Simulation Study 1: Wainer-Like Test-Taking Strategy To be comparable to earlier research, this study mimicked the ‘unrealistic worst case’ scenario used in Stocking (1997). The study simulated test takers following the Wainer strategy to the extent possible within the IP system. To recap, the Wainer strategy assumes that test takers intentionally keep their interim score estimates low by providing incorrect initial answers in order to see more items that are easier than the test takers’ proficiency level. Supposedly this gives them a better chance of answering those items correctly when they go back to change their initial responses. Within the IP system, one expects the effect of the Wainer strategy to be minimized because the item pocket size is limited and because the intentionally incorrect initial responses for items in the item pocket do not influence item selection. Test takers might still be able to make their interim score estimates negatively biased by postponing their answers to as many easy items as the item pocket size allows, so the strategy’s impact on the final score estimates should be examined. The study simulated this situation assuming that all test takers would mechanically implement such a strategy throughout the test administration. This manipulative test-taking strategy will be referred to in this paper as Test Taking Strategy 1 (TTS1). Item Pocket Method to Allow Response Changing 14 Simulation Study 2: More Typical Test-Taking Scenario With Item Pocket Simulation Study 1 was designed to provide us with knowledge about the robustness of the new IP method against test takers’ systematic attempts to employ the Wainer-like manipulative test taking strategy (TTS1). Such a gaming strategy, however, is not the way the IP method was intended to be used, nor was it expected to happen often in practice. According to Vispoel et al. (2000), when examinees were allowed to review their responses, their most frequently observed test-taking strategy could be described as follows: “I mark some of my answers that I review later (p. 34).” This test-taking strategy varied little from the ways examinees would respond to items on other tests (Vispoel et al., 2000). In fact, marking some answers (or items) for later review involves essentially the same process as putting items in the item pocket within the IP system. Although Vispoel et al. (2000) did not report which items were frequently marked for later review, it would be reasonable to assume that, within the IP system, examinees were likely to use the item pocket to set aside the items they found challenging (i.e., they were not confident about their answers). This allowed them a later opportunity to answer the items instead of giving up on the items by locking in their initial answers. This test-taking strategy referred to in this paper as Test Taking Strategy 2 (TTS2), is a legitimate use of the IP system (as opposed to the manipulative Wainer and Kingsbury gaming strategies) and, more importantly, represents what it was designed for —providing examinees with a less restrictive reviewing option. The second simulation study, therefore, was conducted to evaluate the performance of the IP method under a more realistic situation with TTS2. Simulating test takers’ reviewing behavior involves several strong assumptions, and so it is important to make those assumptions as realistic as possible. Under TTS2, it was assumed that test takers first evaluated the relative difficulty of each item against their proficiency level. If the Item Pocket Method to Allow Response Changing 15 item difficulty was challenging given the test takers’ proficiency, it was assumed they would put the item in item pocket. For items that were not challenging according to this definition, test takers were assumed to give their final answer and move on, not necessarily using the item pocket. For purposes of the simulation, the item was viewed as challenging if the difficulty (b parameter value) of an item exceeded a test taker’s true proficiency (θ) by 0.5. In practice, researchers often find large errors associated with a test taker’s ratings (or judgment) of item difficulty, so it is important to incorporate that assumption in the simulation algorithm in order to ensure realistic results. Vispoel et al. (2005) observed that test takers showed success rates between 46 percent and 61 percent when asked to compare the difficulty of a pair of items when the difference in difficulty was less than 0.50. When the difference was larger than 0.50, the average rating success was between 63 percent and 82 percent across the studied conditions (Vispoel et al., 2005). In this study, test takers do not compare a pair of items but instead evaluate the difficulty of an item using their proficiency as a baseline to decide whether or not they want to put the item in the item pocket. If the difference between test takers’ proficiency and the item difficulty was less than 0.50, test takers were simulated to find the item challenging 50 percent of the time. If the difference between test-taker proficiency and the item difficulty were greater than or equal to 0.50, test takers were assumed to find the item challenging 70 percent of the time. If all item pockets were full and a test taker found the current item challenging and was considering putting that item in the pocket, it was assumed that the test taker compared the current item to the easiest items in the item pocket. If the test taker found an item in the item pocket that was easier than the current item, the test taker was assumed to finalize his or her answer to the easiest item and remove it from the item pocket to make room for the current item. During the comparison between the easiest item in the item pocket and the Item Pocket Method to Allow Response Changing 16 current item, test-taker error associated with difficulty ratings was also simulated in the same way as the errors associated with the test taker’s decision on item pocket use (i.e., rating success at a rate of 0.50 if the difference in difficulty between the easiest and the current items was less than 0.50, and rating success at a rate of 0.70 if the difference was greater than or equal to 0.50). Simulation Study 2 was designed using research findings based on real empirical test data (Vispoel et al., 2005; Vispoel et al., 2000) to account for test-taker errors in item difficulty rating. In spite of this parameter, test takers’ patterns observed in this simulation of the IP system were still considerably exaggerated due to the fact that the simulation study imposed no test time limit. In other words, the simulation did not take into account any speededness or test-taker fatigue. In addition, test takers in the simulation were modeled to take as much time as needed to review the items utilizing the IP system. For most operational real-world CAT programs, however, tests are strictly timed, often slightly speeded for some test takers, and subject to severe penalties on final scores (according to the number of omitted items) if test takers do not complete all items within the time limit. A major benefit of this study’s research design using “worst case” scenarios—essentially an extreme stress test of the new IP method—was the implication that results would be even more useful in real CAT situations because the findings would indicate the bottom-line performance of the new method. Therefore, this paper will present a comprehensive discussion that will include both the “worst-case” results and the “what would likely happen” in the real world results. It is important to keep in mind that the simulation study cannot measure various psychological effects of the IP method on real human test takers, and the main purpose of this simulation study was to understand and determine if negative impacts of the IP method on the Item Pocket Method to Allow Response Changing 17 measurement efficiency could be controlled to be acceptable level for real CAT programs even under the worst conditions. Results and Discussion As mentioned, the IP method simulation was conducted under two different exposure control conditions (no exposure control vs. Sympson and Hetter method), and no meaningful difference was observed in terms of CSEM, bias, and IP usage between the two exposure control conditions. Therefore, this paper presents only the cases with the Sympson and Hetter exposure control. Under TTS1, which essentially was a modification of the Wainer strategy for the IP system, the CSEM and bias displayed in Figure 2 showed either no change or minimal changes with the IP system compared to those from the CAT with no response change condition (the condition with IP size = 0). The increase in CSEM due to the IP system was less than 0.10 throughout most of the θ range (- 2.5 ≤ θ < 2.5) even when the IP size was 6, which was the largest IP size in this study. This was similar to findings observed with Stocking’s (1997) restricted model 1 (with the limit of two response changes) and model 2 (with four or more separately timed sections). In terms of the average conditional bias in the score estimates (seen in the middle of Figure 2), test takers did not achieve any meaningful positive gain by implementing this Wainer-like strategy. In fact, higher proficiency test takers (θ > 0.5) tended to have slightly underestimated scores with this manipulative test-taking strategy because, as Wise (1996) pointed out, final score estimates could drop substantially if a test taker failed to respond to all items correctly when attempting the Wainer strategy. Therefore, it seems safe to conclude that the IP method was very robust against test takers’ attempts to implement the Wainer strategy Item Pocket Method to Allow Response Changing 18 even under a worst-case scenario. The total number of items that test takers of each proficiency level placed in the item pocket (i.e., IP usage) is shown in Figure 2. Insert Figure 2 about here In Simulation Study 2, where it was assumed test takers would use the IP system systematically for challenging items (i.e., implementing TTS2), the change in the mean absolute error (MAE) for θ estimates due to the IP system was significant but the magnitude was minor— the MAE increased by about 0.069, 0.083, and 0.087 when the IP was 2, 4, and 6, respectively. Considering a typical standard error of θ estimation, which is often between 0.30 and 0.40 for many CAT programs, the increase in the MAE due to the use of IP system, which was well below 0.10, seemed acceptable. Increases seen in the average bias in θ estimates due to the use of IP system were 0.057, 0.075, and 0.080, when the IP was 2, 4, and 6, respectively. Looking at the observed patterns of the MAE and bias together across the different IP sizes, it was apparent that the increase in MAE was due mainly to the systematic bias in the θ estimates. Although both MAE and bias statistics showed the impact of the IP system was negligible in average, it is also important to evaluate CSEM and bias across θ levels. Figure 3 shows the details of the CSEM and conditional bias. The pattern of CSEM was similar to the pattern of conditional bias because, as noted earlier, the increase in CSEM was mainly due to the systematic bias. For higher proficiency test takers (θ > 0.5), TTS2 was ineffective in gaining any meaningful score improvement. On the other hand, for test takers at lower proficiency levels, for example, in the θ < -0.5 range, a saturated, excessive use of the IP system with TTS2 could result in a positive score bias. Since the result suggested a possible Item Pocket Method to Allow Response Changing 19 vulnerability of the IP system against an unrealistic case of TTS2, it was important to understand its implication in real-world situations. Insert Figures 3 and 4 about here Because item pocket usage (i.e., the average number of items put in the item pocket), shown in the bottom of Figure 3 did not necessarily indicate the time and effort test takers would expend taking advantage of the IP system under TTS2, this study analyzed the average conditional frequency of item review processes that included comparing the difficulty of each new challenging item to the easiest preexisting items in the item pocket. As shown in Figure 4, the higher the test-taker proficiency level, the less time likely was needed to revisit items in the item pocket. There were two main reasons for this. First, given the initial θ value around zero, test takers with below average proficiency were likely to see more items of challenging difficulty and, hence, use the IP system more frequently. Second, test takers with extremely high proficiency (for example, θ > 1.5) saw fewer items of challenging difficulty, and hence used the IP system less frequently than test takers of below average proficiency. They had no need to compare new challenging items with preexisting items in the item pocket because there often was no item in the item pocket that needed to be removed to make room for a new, harder item. As a result, higher proficiency test takers reviewed items less frequently under TTS2. For example, for highly proficient individuals (θ > 1.5), the total load of item review tasks was fewer than two occasions even when the item pocket size was six. For real CAT programs, test time limits often are set at a level at which about 80 percent to 90 percent of average test takers can finish the last item (Talento-Miller, Guo, & Han, 2010). This reduces the amount of wasted time for both the CAT system and test takers alike while minimizing possible speededness. In such a Item Pocket Method to Allow Response Changing 20 case, it is not unusual for highly proficient test takers to have a decent amount of time left to review the small number of items in the item pocket as needed. Although TTS2 could be a feasible strategy for high proficiency groups in real, timed CAT administrations, use of the IP system would not necessarily result in positively biased scores.. As seen in the middle of Figure 3, the bias in final θ estimation was next to nothing for highly proficient test takers. On the other hand, for groups at lower proficiency levels (for example, θ < -1.0), use of the TTS2 strategy theoretically could result in slightly biased scores. Based on the average loads of review tasks that these test takers would need to process, however, it appears TTS2 would be an extremely unrealistic strategy for most CAT programs with test time limits (Figure 4). Although it was assumed in the simulation for TTS2 that each new challenging item would be compared only to the easiest item in the item pocket, in reality test takers likely will need to revisit not only the easiest but several (if not all) items in the item pocket. This is because items in the item pocket are not ordered by item difficulty and test takers would need to determine which was the easiest among them. Determining the easiest item, which was not addressed in this simulation, could add a significant load to the item review process in practice, making the actual item review load much heavier than what was shown in Figure 4. Essentially, it means test takers at lower proficiency levels might spend more time and effort analyzing the item difficulties than on solving the problems in order to result in final score with a meaningfully positive bias. Theoretically, it is possible for test takers to review items many times if test time is unlimited. On the other hand, if time is unlimited, an option to review and change responses later would not be necessary (or desired) for CAT in the first place. In such a situation, test takers could spend as much time on each item as needed, thus minimizing any test anxiety resulting Item Pocket Method to Allow Response Changing 21 from having to rush to the next item. Therefore, it is reasonable to assume that the IP system is most likely to be employed and useful for CAT programs that are strictly and tightly timed as is the case for most operational real-world CAT administrations. Under this assumption, typical test takers would have only a fraction of test time left for reviewing items within the IP system. Hence, a strategy such as TTS2 that test takers must mechanically implement would be impossible to complete in real time. Knowing that many CAT programs apply severe penalties on final scores if test takers fail to complete all items within a set time limit, test takers would be discouraged from pushing TTS2 to the extreme—it would do more harm than good on their final scores. Aside from the test time limit issue, it is important to understand that analyzing and comparing item difficulties to game a CAT system is a complicated and difficult process, one with a fairly poor success rate observed even for highly proficient test takers (Vispoel, 1998; Vispoel et al., 2002; Olea et al., 2000, Wise et al., 1999). Ironically, test takers who truly needed to review items most often under TTS2 were those at the lowest proficiency level, and the quality of their performance in analyzing item difficulties in the real world is expected to be very poor, unlike the simulated study conditions in which all test takers performed the item difficulty analyses with 50% to 70% accuracy. Thus, it is extremely unlikely that the magnitude of score bias observed with low proficiency test takers in this simulation of the IP system could be replicated in the real world, even with unlimited test time. Most test takers would not benefit from practicing TTS2 in typical, timed CAT programs, yet it presents some extreme scenarios worth thinking about. As shown at the bottom of Figure 3, test takers with extremely low proficiency (for example, θ = -2.0) tended to put fewer than three items into the item pocket during the test when the IP size was two. The items they mostly Item Pocket Method to Allow Response Changing 22 placed in the item pocket were those administered at the beginning of CAT testing. For test takers of extremely low-proficiency, the first few items, selected based on initial (randomly chosen) θ around 0, were the most difficult among all administered items. Because the interim score estimates for these low-proficiency test takers quickly declined from the initial θ, the first few items placed in the item pocket mostly remained the same throughout the CAT administration under TTS2. Therefore, test takers with very low proficiency levels may benefit by being coached (for example, by test prep institutions) just to put the first items into the item pocket and then solve all other items at once and skip comparing the new items with those in the item pocket. Once the test takers reach the last item, they can revisit the first few items in the item pocket and submit their final answers. Such a manipulative modification of TTS2, however, would only yield a negligible positive bias (< 0.2 when IP size = 2) at the extremely low θ area, which is far from the main population of consideration for most CAT programs. If test takers at the bottom proficiency level are important for a CAT program, then one could suggest lowering the initial θ value for item selection to reduce possible bias even further. Conclusion Because of its efficiency, CAT has become widely accepted in the field of educational measurement; however, test takers’ dissatisfaction over not being allowed to review and change their responses (Baghi, Ferrara, & Gabrys, 1991; Legg & Buhr, 1992; Wise, 1996; Vispoel, 1998) has not yet been satisfactorily addressed. On the one hand, reducing unnecessary test anxiety for test takers by allowing them to review and change their responses during a CAT administration is believed by some to have a positive effect on test validity (Papanastasiou, 2002; Stocking, 1997; Olea et al., 2000). On the other hand, the trade-off in CAT efficiency is simply Item Pocket Method to Allow Response Changing 23 unacceptable for most operational CAT programs, especially due to the possibility that test takers would attempt to game the CAT system (Wainer, 1993; Wise, 1996). Stocking’s restricted review models (1997) brought forward effective means of controlling the impact of some manipulative test-taking strategies such as the Wainer strategy while offering test takers a limited ability to review. The new IP method presented in this study aimed to reduce the restrictions in reviewing even further but at the same time improve the robustness of the CAT system against manipulative test-taking strategies. With the IP system, test takers can go back and forth across sections to review items in the item pocket, and test developers do not need to time each section separately. The simulation result (with TTS1) showed that the IP method was as robust against the Wainer strategy as Stocking’s restricted review models. Also, unlike the restricted review models, the IP method systematically is immune to the Kingsbury and GK strategies since item selection is not influenced by items in the item pocket. The simulation study under TTS2 revealed that test takers with above average proficiency would not benefit from saturatedly excessive use of the IP system. The simulation did reveal the possibility of slight score biases (< 0.2 when IP size = 2) for test takers with very low proficiency (θ = -2) when abnormally excessive use of the IP system (TTS2) occurred. The unrealistic nature of the simulation condition (no time limit, no fatigue, and 100% accuracy in determining the easiest item), however, combined with the huge loads of item review for test takers to process in order to achieve such gains, leaves little practical chance to realize it in most operational CAT programs that are tightly timed. Given the evidence of the IP method’s robustness against the Wainer, Kingsbury, and GK strategies as well as excessive uses of the IP system seen in TTS2, it is highly unlikely that test takers will be tempted to waste their time and effort on such Item Pocket Method to Allow Response Changing 24 manipulative test-taking strategies. Even in the worst case under the studied conditions, the sacrifice in the measurement efficiency was about one or two items in the extremely low proficiency level, and some CAT programs may be willing to accept such a tradeoff to provide test takers (or clients) with better testing experience, especially when the test market is customer (i.e., test taker or client) driven. The process for determining a proper IP size is somewhat ambiguous but is critical because it ultimately decides the IP system’s flexibility and affects measurement efficiency. If the IP size is too small, test takers’ feelings of a loss of control over the test may persist because their ability to review items is too limited. On the other hand, if the IP size is too large, the CAT optimality would decrease because items in the item pocket would not contribute information that is used for item selection. The size of item pocket, however, is not the only factor that determines IP usage. For example, if a CAT is tightly timed, test takers may not necessarily use the item pocket up to its size limit because they know they will not have enough time near the end of the test to review items in the item pocket if there are too many. So test length and test time limit should be considered together when determining the IP size. Although this study mainly focused on evaluating the possible negative impacts of the IP method on measurement accuracy under worst-case scenarios, it is noteworthy that there could be several possible positive effects of the IP method in terms of the measurement accuracy. First, providing an option to review items may help test takers reduce their test anxiety levels, even if they never exercise that option (Vispoel et al., 2000; Olea et al., 2000). With reduced anxiety, test takers are expected to perform in a way they are supposed to with fewer mistakes, and, as a result, an item review option could eventually reduce measurement errors (Papanastasiou, 2002). Second, with a well-chosen IP size, the IP system may help test takers manage and use time Item Pocket Method to Allow Response Changing 25 more wisely during CAT administration. One mistake test takers often make during a test is spending too much time on a few items and then rushing through the rest of the test. This can contribute to a large (often unobservable) measurement error because the speededness could influence their whole performance on the rest of the items. An IP system, however, would allow test takers to skip items they think will take a long time to answer right at the outset. Even if a test taker finds a current item taking too long to finish in the middle of problem solving, he or she can put the item into the item pocket and move on to the next item. Since test takers always can revisit the items in the item pocket and restart from where they left off, they would not need to gamble on whether to give up and guess or spend more time on the current item in which they have already invested a sizable chunk of time. The IP method can enable flexibility in test time management to help test takers minimize unintended speededness during CAT, and, as a result, may reduce related measurement errors. These possible positive impacts of the IP method could not be investigated in this study because of the limitation of simulation, but it is strongly suggested that future studies conduct an in-depth examination of the psychological and psychometric effects of the IP method using real empirical data. The simulation conditions in this study mainly reflected a CAT program with fixed length for a high-stake exam. Because even minor changes in the item selection algorithm, item bank, estimation method, and test length can make huge differences, the findings of this study should not be imprudently generalized. Examining the impact of the use of item pockets on the test length when CAT administration is terminated based on the estimation precision would be an interesting topic for examination in future studies. The primary goal of the IP method was not necessarily to give test takers a better chance at improving their scores but to create a less restrictive testing environment, allowing them more Item Pocket Method to Allow Response Changing 26 control over a CAT administration so that they can perform undistracted without unnecessary test anxiety. Based on the findings of this study, the new IP method seems to have delivered a promising solution in doing just that in many CAT programs. Item Pocket Method to Allow Response Changing 27 References Baghi, H., Gabrys, R., & Ferrara, S. (1991, April). Applications of computer-adaptive testing in Maryland. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Benjamin, L. T., Cavell, T. A., & Shallenberger III, W. R. (1984). Staying with initial answers on objective tests: Is it myth? Teaching of Psychology, 11, 133–141. Bowles, R., & Pommerich, M., (2001, April). An examination of item review on a CAT using the specific information item selection algorithm. Paper presented at the annual meeting of the National Council of Measurement in Education, Seattle, WA. Gershon, R., & Bergstrom, B. (1995, April). Does cheating on CAT pay: Not! Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21, 347–360. Han, K. T. (2012). SimulCAT: Windows software for simulating computerized adaptive test administration. Applied Psychological Measurement, 36(1), 64-66. Kingsbury, G. G. (1996, April). Item review and adaptive testing. Paper presented at the annual meeting of the National Council on Measurement in Education, New York. Legg, S. M., & Buhr, D. C. (1992). Computerized adaptive testing with different groups. Educational Measurement: Issues and Practice, 11, 23–27. Lunz, M. E., Bergstrom, B. A., & Wright, B. D. (1992). The effect of review on student ability and test efficiency for computerized adaptive tests. Applied Psychological Measurement, 16(1), 41–51. Martineau, J., & Dean, V. (2010). How a state might benefit from computer-based assessment and how to solve problems with its implementation from the view point of the State. Paper presented at the annual meeting of the Maryland Assessment Conference, College Park, MD. Olea, J., Revuelta, J., Ximénez, M. C., & Abad, F. J. (2000). Psychometric and psychological effects of review on computerized fixed and adaptive tests. Psicológica, 21, 157–173. Item Pocket Method to Allow Response Changing 28 Papanastasiou, E. C. (2002, April). A ‘rearrangement procedure’ for scoring adaptive test with review options. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA. Poggio, J. (2010). History, current practice, and predictions for the future of computer based assessment in K-12 education. Paper presented at the annual meeting of the Maryland Assessment Conference, College Park, MD. Stocking, M. L. (1997). Revising item responses in computerized adaptive tests: A comparison of three models. Applied Psychological Measurement, 21(2), 129–142. Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Association. (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center. Talento-Miller, E., Guo, F., & Han, K. T. (2010, July). Examining test speededness by native language. Paper presented at the annual meeting of the International Test Commission, Hong Kong, China. Vispoel, W. P. (1998). Reviewing and changing answers on computer-adaptive and self-adaptive vocabulary tests. Journal of Educational Measurement, 35(4), 328–345. Vispoel, W. P., Clough, S. J., & Bleiler, T. (2005). A closer look at using judgments of item difficulty to change answers on computerized adaptive tests. Journal of Educational Measurement, 42(4), 331–350. Vispoel, W. P., Clough, S. J., Bleiler, T. Henderickson, A. B., & Ihrig, D. (2002). Can examinees use judgments of item difficulty to improve proficiency estimates on computerized adaptive vocabulary tests? Journal of Educational Measurement, 39(4), 311–330. Vispoel, W. P., Henderickson, A. B., & Bleiler, T. (2000). Limiting answer review and change on computerized adaptive vocabulary test: Psychometric and attitudinal results. Journal of Educational Measurement, 37(1), 21–38. Vispoel, W. P., Rocklin, T. R., Wang, R., & Bleiler, T. (1999). Can examinee use a review option to obtain positively biased ability estimates on a computerized adaptive test? Journal of Educational Measurement, 36(2), 141–157. Waddell, D. L., & Blankenship, J. C. (1995). Answer changing: A meta-analysis of the prevalence and patterns. The Journal of Continuing Education in Nursing, 25, 155–158. Wainer, H. (1993). Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practice, 12, 15–20. Item Pocket Method to Allow Response Changing 29 Wang, M. & Wingersky, M. (1992, April). Incorporating post-administration item response revision into a CAT. Paper presented at the annual meeting of the National Council of Measurement in Education, San Francisco, CA. Wise, S. L. (1996). A critical analysis of the arguments for and against item review in computerized adaptive testing. Paper presented at the annual meeting of the National Council on Measurement in Education, New York. Wise, S. L., Finney, S. J., Enders, C. K., Freeman, S. A., & Severance, D. D. (1999). Examinee judgments of changes in item difficulty: Implications for item review in computerized adaptive testing, Applied Measurement in Education, 12, 185–198. Item Pocket Method to Allow Response Changing 30 Table 1 Descriptive Statistics for the Item Pool (500 Items) Item parameter Mean SD a b c 0.795 0.404 0.166 0.272 1.132 0.065 a 1 0.260 0.085 Pearson Correlation b 0.260 1 0.057 c 0.085 0.057 1 Item Pocket Method to Allow Response Changing 31 Figure 1. Example of a Test Interface for a CAT With the Item Pocket Method Is m divisible by 12? (1) m is divisible by 3 (2) m is divisible by 4 [] Statement (1) ALONE is sufficient, but statement (2) alone is not sufficient to answer the question asked. [] Statement (2) ALONE is sufficient, but statement (1) alone is not sufficient to answer the question asked. [] BOTH statements (1) and (2) TOGETHER are sufficient to answer the question asked, but NEITHER statement ALONE is sufficient to answer the question asked. [] EACH statement ALONE is sufficient to answer the question asked. [] Statements (1) and (2) TOGETHER are NOT sufficient to answer the question asked, and additional data specific to the problem are needed. Item Pocket Method to Allow Response Changing 32 Figure 2. Conditional Bias, CSEM, and IP Usage Under TTS1 Conditional Standard Error of Measurement Conditional Bias Item Pocket Usage Item Pocket Method to Allow Response Changing 33 Figure 3. Conditional Bias, CSEM, and IP Usage Under TTS2 Conditional Standard Error of Measurement Conditional Bias Item Pocket Usage Item Pocket Method to Allow Response Changing 34 Figure 4. Frequency of Test Taker Revising (Easiest) Item Under TTS2
© Copyright 2026 Paperzz