Fairness Issues in Computer Adaptive Tests with Strict Time Limits Brent Bridgeman and Frederick Cline Copyright © 2002 by Educational Testing Service. All Rights Reserved Paper presented at the annual meeting of the American Educational Research Association, New Orleans, April, 2002 Abstract Time limits on the GRE analytical test (GRE-A) are such that many examinees have difficulty finishing. Results from over 100,000 examinees suggest that about half of the examinees must guess on the final six questions if they are to finish before time expires. At the higher ability levels even more guessing is required because the questions administered to higher ability examinees are typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced. Examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered shorter tests, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test. 1 Because the Graduate Record Examination General Test (GRE) is a computer-adaptive test (CAT), different examinees receive different sets of questions. The three parameter logistic scoring model takes account of the difficulty differences in these questions so that examinees who get difficult questions are not disadvantaged relative to examinees who get easier questions. However, the scoring model does not take into account differences in the amount of time it takes to respond to different questions. In a previous report (Bridgeman & Cline, 2000), we presented evidence that some questions on the quantitative and analytical sections of the GRE CAT could be answered more quickly than others. Much of this time difference was related to the difficulty of the questions—more difficult questions typically take longer to answer than less difficult questions. However, there was also substantial variation in the time it took to answer questions that met roughly the same content specifications and were at the same difficulty level. Examinees who got long tests (i.e., a disproportionate number of items that take longer than average to answer) could be at a disadvantage relative to examinees who got short tests. Despite the logical conclusion that examinees with long tests should get lower scores on a speeded test, the previous report could find no evidence of any impact on total test scores. The previous study used only a single item pool for the analysis of the GRE-analytical (GRE-A) measure. Because the current study combines data across 12 pools, finer-grained analyses are possible. First, the extent of the speed problem on GRE-A was documented by identifying the number of students who run out of time. If virtually no examinees run out of time, then it would not matter if some students were administered longer tests than others (except for a possible fatigue effect). If a non-trivial number of students run out of time, then it would matter if some students got longer tests. Second, we attempted to show the impact of long test on scores by identifying examinees who had exactly the same score (estimated theta) at item 29 (out of 35) but who had gotten there by taking tests of different lengths, as identified by the average time taken for the particular set of items that were administered. Method Data for the current set of analyses came from regular examinees who had taken tests from one of 18 GRE-A question pools that were administered from January through June of 1999. We dropped six pools that were taken by relatively few examinees, leaving examinees from 12 pools. Examinees with certain disabilities that required non-standard timing were dropped; 406 examinees were in this category. The item selection algorithm does not strictly specify item order, and minor variations, such as how the first 5-item set can begin in positions 1 through position 5, are quite common. Ultimately, twelve common patterns were identified. Other variations, however, such as no 5item set starting in positions 1-10, are extremely rare, and we dropped 18 examinees with such non-standard orders. Finally, we dropped 1522 examinees who spent less than 20 minutes on the entire test because such rapid responding seems to indicate that the examinees were not taking the test seriously. (Because some institutions do not use GRE-A scores for admissions decisions, some examinees do not make a full effort to do well on this section.) The final sample consisted of 109,990 examinees who had taken tests from one of twelve item pools. 1 Results and Discussion Evidence of Speededness If the GRE-A were not a speeded test, and assuming fatigue is not a factor, it would not matter if some examinees received a disproportionate number of items that took a long time to answer. Thus, our first task was to demonstrate the degree of speededness of GRE-A and possible consequences of random guessing behavior if time is short near the end. Measures of speededness that count the number of unanswered questions at the end of a test are virtually worthless if examinees are merely randomly guessing at the end in order to avoid the penalty for incomplete tests that is part of the GRE-A scoring design. One approach is to determine how long it takes examinees to answer questions near the beginning of the test, before they have begun any random guessing or extreme rushing to finish within the time limit. We looked at the time taken to answer questions of various types when they were administered near the beginning of the test (questions 1-20 in the 35-item test). The mean time to complete a discrete logical reasoning (LR) question was 1.8 minutes. The mean time taken for a 4-item analytical reasoning (AR) set was 8.5 minutes, and the time for a 5-item AR set was 10.3 minutes. The total test contains nine LR items, four 4-item sets, and two 5items sets, so if the examinee worked at the average pace for the first 20 items, and did not rush at the end, the test would take 71 minutes to complete, or 11 minutes longer than the 60 minutes allowed. Higher ability students may generally work faster, but on an adaptive test they are also taking harder (and therefore longer) items. In order to identify high ability students, we wanted an estimate of GRE-A scores for individuals that would be less contaminated by running out of time in the analytical section. Rather than using the possibly contaminated actual GRE-A scores, we predicted GRE-A scores with a regression equation that used GRE-V and GRE-Q as predictors. The multiple correlation was .68. Using the regression results, we identified a sample with predicted GRE-A scores in the 650-740 range. For these high ability students, the mean time in positions 1-20 was 1.9 minutes for LR items, 9.4 minutes for a 4-item AR set, and 11.3 minutes for a 5-item set. Working at this rate, the entire exam would take 77 minutes, or 17 minutes more than allowed. We defined three time pressure groups by determining the number of minutes examinees had to respond to the last six items in the test, which included an AR set in positions 31-34 and LR items in positions 30 and 35. As suggested above, if examinees answered these items at the same rate that they were working early in the test, then these items should require about 12 minutes for average ability examinees ([1.8x2]+8.5), and about 13 minutes for a higher ability examinee. We defined extreme time pressure as less than 2 minutes remaining to answer all six questions; severe time pressure was defined as between 2 and 5.5 minutes, and moderate time pressure was defined as more than 5.5 minutes. As shown in Figure 1, about a quarter of the examinees were in the extreme group, another quarter in the severe group, and half in the moderate group. Although only 32% of the total sample failed to finish, over half were clearly in time trouble. Table 1 shows the gender and 2 ethnic breakdown in each group. Asians and males were disproportionately represented in the extreme time pressure group, probably because greater time pressure is associated with harder, and therefore longer, items. Table 2 shows that while the examinees under extreme time pressure on the GRE-A scored the lowest on the Analytic section, those same test takers scored highest on the Quantitative section. The causal direction associated with these lower scores is unclear; lower analytical ability could cause students to have extreme time pressure, or extreme time pressure could cause estimates of analytical ability to be too low. As shown in Figure 2, there was significantly more time pressure for students with predicted GRE-A scores in the 650740 range than for lower ability examinees, suggesting that time pressure is the cause of lower scores rather than vice versa. Only 37 % of these high-ability examinees finished the test and had more than 5.5 minutes for the last 6 questions. Because the GRE-A is an adaptive test, the distribution of scores on the last six items should be approximately normal with 3 as the mean and mode. As noted by the solid bars in Figure 3 this is exactly what happened for the examinees with only moderate time pressure. If examinees were randomly guessing on five-choice items, then a mean and mode of about 1, and very few high score, should be expected. This pattern is shown by the severe and extreme time pressure groups on Figure 3. As shown in Table 3, the estimated ability can drop substantially for examinees who are randomly guessing at the end. For examinees in the extreme time pressure group, the estimated ability dropped by at least one theta unit over the last six items for 6% of the group, and by at least half a theta unit for 34% of the group. In the severe time pressure group, a drop of half a theta unit or more was found for 21% of the group, while in the moderate time pressure group a drop of this size was seen in fewer than 4% of the group. Table 4 (adapted from an unpublished table created by Marilyn Wingersky), shows the impact of unlucky random guessing on two examinees. Person A guesses wrong on items 28-34 and then guesses right on the last item. The estimated ability of this person, without any correction for an incomplete test is 670 at item 28 and falls 60 points to 610 at the end of the test. Had this person stopped after item 28 the penalty for an incomplete test would have been enforced, and the score would have been a 570. Because of the severity of the penalty for incomplete tests, even with unlucky guessing, this person’s final adjusted score is higher after the string of unlucky guesses than it would have been had the examinee simply stopped responding at any prior point. In general, guessing, even unlucky guessing, is preferable to simply quitting if time is getting short. However, this is not always the case as is clearly demonstrated by Person B on Table 4. This person’s estimated ability at item 28 is only slightly lower than that of Person A, and Person B has exactly the same guessing pattern on these final items (all wrong except the last one). However, the final outcome is dramatically different. While Person A ends up 70 points higher at the end than if they had quit after question 28, Person B ends up 220 points lower (and they would have ended up 320 points lower had they quit one item sooner). The dramatic fall seems to be related, at least in part, to getting some very easy and discriminating items wrong. Although the pseudo-guessing parameter in the three parameter logistic model provides for the possibility that a very low ability examinee may sometimes get a difficult item correct, there is no provision for a high ability examinee to get a series of very easy items wrong, which is what can happen when examinees are randomly guessing at the end to avoid the penalty for incomplete tests. The problem can be especially severe in a set-based test when the examinee 3 enters a set with the most difficult item, and all of the other items in the set are very easy. Items 31-34 are an analytical set, and for Person A this entire set was at a fairly high difficulty level. As the modest drop in score (and actual increase in adjusted score) shows, getting difficult items wrong has a relatively minor impact on test score. But getting items wrong in the easy set administered to Person B had much more serious consequences; before taking the first item in this set, the unadjusted score was 550 and after taking the last item in the set the score was 220. Using simulated data, Steffen and Way (1999) showed that a string of guesses at the end is likely to have more severe consequences for higher ability than for lower ability examinees, because incorrect answers are more at odds with the model’s expectations for the higher ability examinees. Effects of Items 30-35 on Test Intercorrelations If additional items make a test more reliable, which is typically the case, then the correlation of GRE-A with GRE-V or GRE-Q might be expected to be higher if assessed after 35 items than if assessed after 29 items. However, this expectation rests on the assumption that the additional items are adding more signal than noise. The extent of random responding at the end of GRE-A that was observed suggests that these additional items might add more noise than signal. We correlated the GRE-A theta score at item 29 and at item 35 with the GRE-V and GRE-Q total scores. These analyses were run separately for the 56,518 examinees in the domestic sample and the 17,911 examinees in the foreign sample. In both samples, the correlations with GRE-V remained constant as the additional six items were considered (r = .57 in the domestic sample and .51 in the foreign sample). Correlations with GRE-Q became lower with the additional items. In the domestic sample, the correlation dropped from .67 to .65 and in the foreign sample it dropped from .62 to .59. These results suggest that the items at the end of the test contribute nothing, actually less than nothing, to the reliability of the analytical score. Differences in Expected Testing Time Because a substantial number of examinees run out of time on GRE-A, and because random responding can have a significant impact on test scores, the issue of whether some examinees are administered tests that take longer than others becomes more important. We determined the length of a given examinee’s test by computing an expected item time for each item and then summing these expected times for all of the items taken by that individual. Expected item time was the average time taken to answer a question, adjusted for the position of the item in the test. The expected item time was the average time taken to answer an item across all examinees within a specific pool (items administered in more than one pool were treated as unique items for this analysis). However, because items that are administered later in the test are typically answered more quickly, the average time taken on an item when it was delivered in positions 1635 was adjusted based on the average reductions in time spent compared to the time taken in positions 1-15. For example, on average LR items delivered in position 10 are answered in 110 seconds while LR items delivered in position 30 are answered in 65 seconds. Based on these average differences, the observed average time taken for an item when delivered in position 30 was adjusted averaging two different adjustment methods--adding a constant 45 seconds and by 4 multiplying by 1.7. These adjusted times for the item from positions 16-35 were averaged with the actual completion times for the item in positions 1-15 to compute the expected item time. Similarly, adjustments were made for items that were the first item in an AR set. Because the time recorded for the first item in a set includes the time needed to read and set up the problem (typically about 2 minutes), separate estimates were computed for an item when it was delivered first and when it was delivered later item in the set, so that items that were often delivered first in a set did not get overly long expected item times. We computed mean expected times over the first 29 items, and then examined these expected times for examinees at different ability levels, as indexed by their estimated thetas after 29 items. Table 5 shows these expected times, as well as actual times, at three theta levels. (Adjacent pairs are shown so that the consistency among adjacent levels is apparent.) Once again, the results show the strong relationship of solution time and ability level. The expected time at a theta of 1.5 was ten minutes longer than the expected time at a theta of –1.5, and the actual time was about seven minutes longer. Of primary interest was the size of the standard deviations of expected time. Ideally, these standard deviations would approach zero so that, holding ability constant, there would be little difference in the expected times for the shortest and longest tests. In fact, however, the standard deviations were about 3 minutes so that an examinee who took a test whose expected time was one standard deviation below the mean would have six more minutes than an examinee whose test had an expected time one standard deviation above the mean. We defined a short test as a test with a mean expected time one standard deviation below the mean for a given theta level (to the nearest tenth), a long test as a test with an expected time one standard deviation above the mean, and everything else as a medium length test. Table 6 shows the final test scores for examinees who were at the same ability level (to the nearest tenth on the theta scale) after item 29 but who had gotten to this level by taking tests whose expected lengths were short, medium, or long. Finally, we were able to show a measurable disadvantage associated with taking a test with a disproportionate number of time-consuming questions. Examinees with short tests scored about 25 points higher than examinees with long tests even though they were at the same level after 29 items. Case closed—examinees get lower GRE-A scores because through an unlucky item draw they have been given more time-consuming tests. But wait, it also appears that examinees get higher verbal scores because they take long analytical tests. Indeed, the difference in the verbal scores is even larger than the difference in the analytical scores, but in the opposite direction. How could getting a long analytical test have a positive impact on verbal scores, especially when the verbal test may have been taken before the analytical section? Clearly, it could not, but the causal arrow could run in the opposite direction. That is, examinees with high verbal skills might get long analytical tests. If expected test length was related only to the luck of the draw, such systematic effects could not be found. So, more than luck appears to be involved in examinees with high verbal abilities getting longer analytical tests. The discrete LR items are highly dependent on verbal abilities, so a person with high verbal ability should do well with the initial LR items in the analytical section. (The analytical section typically starts with several LR items before the first set begins, and by position 10 all examinees have been given 5 LR items and a 5-item set.) Doing well on these initial LR items would place an examinee into a relatively hard (and therefore long) analytical set. The data support this speculation. The most common pattern is to begin the test with 3 LR 5 items and then go into the first 5-item set. At the end of the third LR item, the mean estimated theta for examinees who were classified as having received long tests was 1.98 (SD=1.5) while the mean theta for examinees who received short tests was –0.63 (SD=1.3). The examinees who did well on these LR items were then placed into a long set. Thus, the problem of some students getting longer tests than others may be more related to the multidimensionality of the test (with high verbal students getting longer tests) than merely to the luck of the draw. Even if it were possible to make all sets at a given difficulty level take exactly the same amount of time, the type of differential timing differences related to performance on LR items, as described above, would remain. Given the practical impossibility of assuring that all examinees are administered tests that take exactly (or even approximately) the same time to complete, and given the inability of the current version of the three parameter model to deal with a string of guesses at the end of an examination, some modification of current procedures is needed. One possibility is to simply make the test less speeded. If examinees have time to fully, or at least partially, consider all questions, then minor variations in test length are of no practical consequence. Making the test less speeded, however, should not be confused with removing all time constraints. Time limits may be needed for practical reasons related to paying for seat time in a testing center, and may also be needed to assure that the test is continuing to assess a reasoning skill. Excessive time may make the test an evaluation of persistence in blindly trying alternatives until one works, rather than assessing a reasoning construct. However, a modest reduction in the number of items (say 29 items in an hour rather than 35) would have no impact on seat time charges, would still not permit blind guessing strategies, but should allow substantially more examinees to at least make some attempt at all of the items. If a higher degree of speededness were essential to measure the desired reasoning construct, then a scoring and delivery model would need to be developed that adequately deals with the speed dimension. 6 References Bridgeman, B, & Cline, F. (2000). Variations in mean response times for questions on the computer-adaptive GRE General Test: Implications for fair assessment (GRE Board Professional Report No. 96-20P; ETS RR 00-7). Steffen, M., & Way, W. D. (April, 1999). Test-taking strategies in computerized adaptive testing. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal. 7 Extreme Finished 8% Moderate Did Not 5% Extreme Did Not 17% Moderate Finished 45% Severe Finished 15% Severe Did Not 10% Figure 1. Time Pressure After Item 29 - Full Sample 8 Moderate Did Not Moderate Finished Severe Did Not Severe Finished Extreme Did Not Extreme Finished Predicted GRE-A 450 to 540 Extreme Finished 7% Moderate Did Not 5% Extreme Did Not 17% Moderate Finished 46% Severe Finished 15% Severe Did Not 10% Figure 2. Time Pressure After Item 29 9 Moderate Did Not Moderate Finished Severe Did Not Severe Finished Extreme Did Not Extreme Finished Predicted GRE-A 650 to 740 Extreme Finished 10% Moderate Did Not 6% Extreme Did Not 18% Moderate Finished 37% Severe Finished 17% Severe Did Not 12% 10 Moderate Did Not Moderate Finished Severe Did Not Severe Finished Extreme Did Not Extreme Finished 45.0% 40.0% 35.0% Percent of Group 30.0% Moderate 25.0% Severe 20.0% Extreme 15.0% 10.0% 5.0% 0.0% 0 1 2 3 4 5 Item s Answ ered Correctly - Positions 30 to 35 Figure 3. Time Pressure After Item 29 - Finished Exam 11 6 Table 1 Percent of Each Group in Each Time Pressure Category Group Moderate (>5.5 minutes) n-54,804 (49.9%) Severe (2-5.5 minutes) n-27,577 (25.1%) Extreme (<2 min.) n-27,519 (25.0%) Male 45.6 26.8 27.6 Female 53.5 23.5 23.0 White 53.8 23.4 22.8 Black 56.1 21.7 23.2 Asian 42.2 27.7 30.1 Hispanic 50.5 23.2 26.3 Other 48.1 26.8 25.1 12 Table 2 Mean Test Scores of Examinees for Analytic, Quantitative and Verbal Sections, By Time Pressure on the GRE Analytic Section Score Moderate Severe Extreme GRE-A 539 540 504 GRE-Q 526 574 574 GRE-V 445 455 450 Table 3 Percent of Each Time Pressure Group with Indicated Theta Change (Theta at 29 Minus Theta at 35) Change in Theta Moderate Severe Extreme <-1.0 0.2 2.5 6.0 -1.0 to - 0.5 3.2 18.8 28.0 -0.5 to 0.0 72.3 75.5 64.8 0.0 to 0.5 23.8 3.0 1.1 0.5 0.5 0.1 0.1 13 Table 4 Impact of Unlucky Guessing On GRE-A Scores Person A Item 28 29 30 31 32 33 34 35 Score 670 660 650 640 630 620 610 610 Adj. Score 570 580 590 600 610 620 620 640 IRT a .79 .90 .78 1.70 1.00 1.30 1.28 .80 Person B IRT b .76 .92 .74 .79 .89 .93 .50 -.13 Score 640 610 550 480 380 220 220 310 Adj. Score 550 540 500 450 370 230 230 330 IRT a .66 .60 .81 1.02 1.28 .59 1.18 .63 Note: Both Person A and Person B make incorrect answers to items 28-34 and a correct answer to 35. (Adapted from an unpublished table by Marilyn Wingersky). Table 5 Expected and Actual Time After 29 Items By Selected Examinee Thetas after Item 29 Theta -1.5 -1.4 n 1364 1478 Expected Time M SD 54.7 2.6 54.6 2.7 Actual Time M SD 47.0 9.8 47.6 9.7 0.0 0.1 3168 3192 58.6 59.0 3.0 3.0 51.8 52.0 6.6 6.6 1.5 1.6 1802 1470 64.2 64.3 3.1 2.8 53.7 53.9 5.2 5.1 14 IRT b -.50 -.87 .17 .24 -.55 -3.11 -.26 -1.10 Table 6 Final GRE-A Scores of Examinees Whose Estimated Theta After 29 Items was 1.0 or 1.5 And Who Had Taken Short or Long Tests Expected Test Length Through Item 29 Theta at 29 = 1.0 GRE-A GRE-Q GRE-V n M SD M SD M SD Short 400 664 32 632 108 479 100 Medium 1725 651 35 621 103 491 96 Long 419 639 43 631 100 520 97 Short 309 713 32 670 91 504 102 Medium 1178 700 36 656 94 529 96 59 689 38 662 92 535 98 Theta at 29 = 1.5 Long 15 References Bridgeman, B, & Cline, F. (2000). Variations in mean response times for questions on the computer-adaptive GRE General Test: Implications for fair assessment. (GRE Report No. 96-20P; ETS RR-00-7). Princeton, NJ: Educational Testing Service. Steffen, M. & Way, W.D. (April, 1999). Test-taking strategies in computer adaptive testing. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal. 16
© Copyright 2026 Paperzz