Effects of Long Test on GRE-Analytical Scores

Fairness Issues in Computer Adaptive Tests with Strict Time Limits
Brent Bridgeman
and
Frederick Cline
Copyright © 2002 by Educational Testing Service. All Rights Reserved
Paper presented at the annual meeting of the American Educational Research Association, New
Orleans, April, 2002
Abstract
Time limits on the GRE analytical test (GRE-A) are such that many examinees have
difficulty finishing. Results from over 100,000 examinees suggest that about half of the
examinees must guess on the final six questions if they are to finish before time expires. At the
higher ability levels even more guessing is required because the questions administered to higher
ability examinees are typically more time consuming. Because the scoring model is not designed
to cope with extended strings of guesses, substantial errors in ability estimates can be introduced.
Examinees who are administered tests with a disproportionate number of time-consuming items
appear to get lower scores than examinees of comparable ability who are administered shorter
tests, though the issue is very complex because of the relationship of time and difficulty, and the
multidimensionality of the test.
1
Because the Graduate Record Examination General Test (GRE) is a computer-adaptive test
(CAT), different examinees receive different sets of questions. The three parameter logistic
scoring model takes account of the difficulty differences in these questions so that examinees
who get difficult questions are not disadvantaged relative to examinees who get easier questions.
However, the scoring model does not take into account differences in the amount of time it takes
to respond to different questions. In a previous report (Bridgeman & Cline, 2000), we presented
evidence that some questions on the quantitative and analytical sections of the GRE CAT could
be answered more quickly than others. Much of this time difference was related to the difficulty
of the questions—more difficult questions typically take longer to answer than less difficult
questions. However, there was also substantial variation in the time it took to answer questions
that met roughly the same content specifications and were at the same difficulty level.
Examinees who got long tests (i.e., a disproportionate number of items that take longer than
average to answer) could be at a disadvantage relative to examinees who got short tests. Despite
the logical conclusion that examinees with long tests should get lower scores on a speeded test,
the previous report could find no evidence of any impact on total test scores.
The previous study used only a single item pool for the analysis of the GRE-analytical (GRE-A)
measure. Because the current study combines data across 12 pools, finer-grained analyses are
possible. First, the extent of the speed problem on GRE-A was documented by identifying the
number of students who run out of time. If virtually no examinees run out of time, then it would
not matter if some students were administered longer tests than others (except for a possible
fatigue effect). If a non-trivial number of students run out of time, then it would matter if some
students got longer tests. Second, we attempted to show the impact of long test on scores by
identifying examinees who had exactly the same score (estimated theta) at item 29 (out of 35)
but who had gotten there by taking tests of different lengths, as identified by the average time
taken for the particular set of items that were administered.
Method
Data for the current set of analyses came from regular examinees who had taken tests from one
of 18 GRE-A question pools that were administered from January through June of 1999. We
dropped six pools that were taken by relatively few examinees, leaving examinees from 12 pools.
Examinees with certain disabilities that required non-standard timing were dropped; 406
examinees were in this category.
The item selection algorithm does not strictly specify item order, and minor variations, such as
how the first 5-item set can begin in positions 1 through position 5, are quite common.
Ultimately, twelve common patterns were identified. Other variations, however, such as no 5item set starting in positions 1-10, are extremely rare, and we dropped 18 examinees with such
non-standard orders. Finally, we dropped 1522 examinees who spent less than 20 minutes on the
entire test because such rapid responding seems to indicate that the examinees were not taking
the test seriously. (Because some institutions do not use GRE-A scores for admissions decisions,
some examinees do not make a full effort to do well on this section.) The final sample consisted
of 109,990 examinees who had taken tests from one of twelve item pools.
1
Results and Discussion
Evidence of Speededness
If the GRE-A were not a speeded test, and assuming fatigue is not a factor, it would not matter if
some examinees received a disproportionate number of items that took a long time to answer.
Thus, our first task was to demonstrate the degree of speededness of GRE-A and possible
consequences of random guessing behavior if time is short near the end. Measures of
speededness that count the number of unanswered questions at the end of a test are virtually
worthless if examinees are merely randomly guessing at the end in order to avoid the penalty for
incomplete tests that is part of the GRE-A scoring design.
One approach is to determine how long it takes examinees to answer questions near the
beginning of the test, before they have begun any random guessing or extreme rushing to finish
within the time limit. We looked at the time taken to answer questions of various types when
they were administered near the beginning of the test (questions 1-20 in the 35-item test). The
mean time to complete a discrete logical reasoning (LR) question was 1.8 minutes. The mean
time taken for a 4-item analytical reasoning (AR) set was 8.5 minutes, and the time for a 5-item
AR set was 10.3 minutes. The total test contains nine LR items, four 4-item sets, and two 5items sets, so if the examinee worked at the average pace for the first 20 items, and did not rush
at the end, the test would take 71 minutes to complete, or 11 minutes longer than the 60 minutes
allowed.
Higher ability students may generally work faster, but on an adaptive test they are also taking
harder (and therefore longer) items. In order to identify high ability students, we wanted an
estimate of GRE-A scores for individuals that would be less contaminated by running out of time
in the analytical section. Rather than using the possibly contaminated actual GRE-A scores, we
predicted GRE-A scores with a regression equation that used GRE-V and GRE-Q as predictors.
The multiple correlation was .68. Using the regression results, we identified a sample with
predicted GRE-A scores in the 650-740 range. For these high ability students, the mean time in
positions 1-20 was 1.9 minutes for LR items, 9.4 minutes for a 4-item AR set, and 11.3 minutes
for a 5-item set. Working at this rate, the entire exam would take 77 minutes, or 17 minutes
more than allowed.
We defined three time pressure groups by determining the number of minutes examinees had to
respond to the last six items in the test, which included an AR set in positions 31-34 and LR
items in positions 30 and 35. As suggested above, if examinees answered these items at the
same rate that they were working early in the test, then these items should require about 12
minutes for average ability examinees ([1.8x2]+8.5), and about 13 minutes for a higher ability
examinee. We defined extreme time pressure as less than 2 minutes remaining to answer all six
questions; severe time pressure was defined as between 2 and 5.5 minutes, and moderate time
pressure was defined as more than 5.5 minutes.
As shown in Figure 1, about a quarter of the examinees were in the extreme group, another
quarter in the severe group, and half in the moderate group. Although only 32% of the total
sample failed to finish, over half were clearly in time trouble. Table 1 shows the gender and
2
ethnic breakdown in each group. Asians and males were disproportionately represented in the
extreme time pressure group, probably because greater time pressure is associated with harder,
and therefore longer, items. Table 2 shows that while the examinees under extreme time
pressure on the GRE-A scored the lowest on the Analytic section, those same test takers scored
highest on the Quantitative section. The causal direction associated with these lower scores is
unclear; lower analytical ability could cause students to have extreme time pressure, or extreme
time pressure could cause estimates of analytical ability to be too low. As shown in Figure 2,
there was significantly more time pressure for students with predicted GRE-A scores in the 650740 range than for lower ability examinees, suggesting that time pressure is the cause of lower
scores rather than vice versa. Only 37 % of these high-ability examinees finished the test and
had more than 5.5 minutes for the last 6 questions.
Because the GRE-A is an adaptive test, the distribution of scores on the last six items should be
approximately normal with 3 as the mean and mode. As noted by the solid bars in Figure 3 this
is exactly what happened for the examinees with only moderate time pressure. If examinees
were randomly guessing on five-choice items, then a mean and mode of about 1, and very few
high score, should be expected. This pattern is shown by the severe and extreme time pressure
groups on Figure 3.
As shown in Table 3, the estimated ability can drop substantially for examinees who are
randomly guessing at the end. For examinees in the extreme time pressure group, the estimated
ability dropped by at least one theta unit over the last six items for 6% of the group, and by at
least half a theta unit for 34% of the group. In the severe time pressure group, a drop of half a
theta unit or more was found for 21% of the group, while in the moderate time pressure group a
drop of this size was seen in fewer than 4% of the group.
Table 4 (adapted from an unpublished table created by Marilyn Wingersky), shows the impact of
unlucky random guessing on two examinees. Person A guesses wrong on items 28-34 and then
guesses right on the last item. The estimated ability of this person, without any correction for an
incomplete test is 670 at item 28 and falls 60 points to 610 at the end of the test. Had this person
stopped after item 28 the penalty for an incomplete test would have been enforced, and the score
would have been a 570. Because of the severity of the penalty for incomplete tests, even with
unlucky guessing, this person’s final adjusted score is higher after the string of unlucky guesses
than it would have been had the examinee simply stopped responding at any prior point. In
general, guessing, even unlucky guessing, is preferable to simply quitting if time is getting short.
However, this is not always the case as is clearly demonstrated by Person B on Table 4. This
person’s estimated ability at item 28 is only slightly lower than that of Person A, and Person B
has exactly the same guessing pattern on these final items (all wrong except the last one).
However, the final outcome is dramatically different. While Person A ends up 70 points higher
at the end than if they had quit after question 28, Person B ends up 220 points lower (and they
would have ended up 320 points lower had they quit one item sooner). The dramatic fall seems
to be related, at least in part, to getting some very easy and discriminating items wrong.
Although the pseudo-guessing parameter in the three parameter logistic model provides for the
possibility that a very low ability examinee may sometimes get a difficult item correct, there is
no provision for a high ability examinee to get a series of very easy items wrong, which is what
can happen when examinees are randomly guessing at the end to avoid the penalty for
incomplete tests. The problem can be especially severe in a set-based test when the examinee
3
enters a set with the most difficult item, and all of the other items in the set are very easy. Items
31-34 are an analytical set, and for Person A this entire set was at a fairly high difficulty level.
As the modest drop in score (and actual increase in adjusted score) shows, getting difficult items
wrong has a relatively minor impact on test score. But getting items wrong in the easy set
administered to Person B had much more serious consequences; before taking the first item in
this set, the unadjusted score was 550 and after taking the last item in the set the score was 220.
Using simulated data, Steffen and Way (1999) showed that a string of guesses at the end is likely
to have more severe consequences for higher ability than for lower ability examinees, because
incorrect answers are more at odds with the model’s expectations for the higher ability
examinees.
Effects of Items 30-35 on Test Intercorrelations
If additional items make a test more reliable, which is typically the case, then the correlation of
GRE-A with GRE-V or GRE-Q might be expected to be higher if assessed after 35 items than if
assessed after 29 items. However, this expectation rests on the assumption that the additional
items are adding more signal than noise. The extent of random responding at the end of GRE-A
that was observed suggests that these additional items might add more noise than signal. We
correlated the GRE-A theta score at item 29 and at item 35 with the GRE-V and GRE-Q total
scores. These analyses were run separately for the 56,518 examinees in the domestic sample and
the 17,911 examinees in the foreign sample. In both samples, the correlations with GRE-V
remained constant as the additional six items were considered (r = .57 in the domestic sample
and .51 in the foreign sample). Correlations with GRE-Q became lower with the additional
items. In the domestic sample, the correlation dropped from .67 to .65 and in the foreign sample
it dropped from .62 to .59. These results suggest that the items at the end of the test contribute
nothing, actually less than nothing, to the reliability of the analytical score.
Differences in Expected Testing Time
Because a substantial number of examinees run out of time on GRE-A, and because random
responding can have a significant impact on test scores, the issue of whether some examinees are
administered tests that take longer than others becomes more important. We determined the
length of a given examinee’s test by computing an expected item time for each item and then
summing these expected times for all of the items taken by that individual. Expected item time
was the average time taken to answer a question, adjusted for the position of the item in the test.
The expected item time was the average time taken to answer an item across all examinees
within a specific pool (items administered in more than one pool were treated as unique items for
this analysis). However, because items that are administered later in the test are typically
answered more quickly, the average time taken on an item when it was delivered in positions 1635 was adjusted based on the average reductions in time spent compared to the time taken in
positions 1-15. For example, on average LR items delivered in position 10 are answered in 110
seconds while LR items delivered in position 30 are answered in 65 seconds. Based on these
average differences, the observed average time taken for an item when delivered in position 30
was adjusted averaging two different adjustment methods--adding a constant 45 seconds and by
4
multiplying by 1.7. These adjusted times for the item from positions 16-35 were averaged with
the actual completion times for the item in positions 1-15 to compute the expected item time.
Similarly, adjustments were made for items that were the first item in an AR set. Because the
time recorded for the first item in a set includes the time needed to read and set up the problem
(typically about 2 minutes), separate estimates were computed for an item when it was delivered
first and when it was delivered later item in the set, so that items that were often delivered first in
a set did not get overly long expected item times.
We computed mean expected times over the first 29 items, and then examined these expected
times for examinees at different ability levels, as indexed by their estimated thetas after 29 items.
Table 5 shows these expected times, as well as actual times, at three theta levels. (Adjacent pairs
are shown so that the consistency among adjacent levels is apparent.) Once again, the results
show the strong relationship of solution time and ability level. The expected time at a theta of
1.5 was ten minutes longer than the expected time at a theta of –1.5, and the actual time was
about seven minutes longer. Of primary interest was the size of the standard deviations of
expected time. Ideally, these standard deviations would approach zero so that, holding ability
constant, there would be little difference in the expected times for the shortest and longest tests.
In fact, however, the standard deviations were about 3 minutes so that an examinee who took a
test whose expected time was one standard deviation below the mean would have six more
minutes than an examinee whose test had an expected time one standard deviation above the
mean.
We defined a short test as a test with a mean expected time one standard deviation below the
mean for a given theta level (to the nearest tenth), a long test as a test with an expected time one
standard deviation above the mean, and everything else as a medium length test. Table 6 shows
the final test scores for examinees who were at the same ability level (to the nearest tenth on the
theta scale) after item 29 but who had gotten to this level by taking tests whose expected lengths
were short, medium, or long. Finally, we were able to show a measurable disadvantage
associated with taking a test with a disproportionate number of time-consuming questions.
Examinees with short tests scored about 25 points higher than examinees with long tests even
though they were at the same level after 29 items. Case closed—examinees get lower GRE-A
scores because through an unlucky item draw they have been given more time-consuming tests.
But wait, it also appears that examinees get higher verbal scores because they take long
analytical tests. Indeed, the difference in the verbal scores is even larger than the difference in
the analytical scores, but in the opposite direction. How could getting a long analytical test have
a positive impact on verbal scores, especially when the verbal test may have been taken before
the analytical section? Clearly, it could not, but the causal arrow could run in the opposite
direction. That is, examinees with high verbal skills might get long analytical tests. If expected
test length was related only to the luck of the draw, such systematic effects could not be found.
So, more than luck appears to be involved in examinees with high verbal abilities getting longer
analytical tests. The discrete LR items are highly dependent on verbal abilities, so a person with
high verbal ability should do well with the initial LR items in the analytical section. (The
analytical section typically starts with several LR items before the first set begins, and by
position 10 all examinees have been given 5 LR items and a 5-item set.) Doing well on these
initial LR items would place an examinee into a relatively hard (and therefore long) analytical
set. The data support this speculation. The most common pattern is to begin the test with 3 LR
5
items and then go into the first 5-item set. At the end of the third LR item, the mean estimated
theta for examinees who were classified as having received long tests was 1.98 (SD=1.5) while
the mean theta for examinees who received short tests was –0.63 (SD=1.3). The examinees who
did well on these LR items were then placed into a long set. Thus, the problem of some students
getting longer tests than others may be more related to the multidimensionality of the test (with
high verbal students getting longer tests) than merely to the luck of the draw. Even if it were
possible to make all sets at a given difficulty level take exactly the same amount of time, the type
of differential timing differences related to performance on LR items, as described above, would
remain.
Given the practical impossibility of assuring that all examinees are administered tests that take
exactly (or even approximately) the same time to complete, and given the inability of the current
version of the three parameter model to deal with a string of guesses at the end of an
examination, some modification of current procedures is needed. One possibility is to simply
make the test less speeded. If examinees have time to fully, or at least partially, consider all
questions, then minor variations in test length are of no practical consequence. Making the test
less speeded, however, should not be confused with removing all time constraints. Time limits
may be needed for practical reasons related to paying for seat time in a testing center, and may
also be needed to assure that the test is continuing to assess a reasoning skill. Excessive time
may make the test an evaluation of persistence in blindly trying alternatives until one works,
rather than assessing a reasoning construct. However, a modest reduction in the number of items
(say 29 items in an hour rather than 35) would have no impact on seat time charges, would still
not permit blind guessing strategies, but should allow substantially more examinees to at least
make some attempt at all of the items. If a higher degree of speededness were essential to
measure the desired reasoning construct, then a scoring and delivery model would need to be
developed that adequately deals with the speed dimension.
6
References
Bridgeman, B, & Cline, F. (2000). Variations in mean response times for questions on
the computer-adaptive GRE General Test: Implications for fair assessment (GRE Board
Professional Report No. 96-20P; ETS RR 00-7).
Steffen, M., & Way, W. D. (April, 1999). Test-taking strategies in computerized
adaptive testing. Paper presented at the annual meeting of the National Council on Measurement
in Education, Montreal.
7
Extreme
Finished
8%
Moderate
Did Not
5%
Extreme
Did Not
17%
Moderate
Finished
45%
Severe
Finished
15%
Severe
Did Not
10%
Figure 1. Time Pressure After Item 29 - Full Sample
8
Moderate
Did Not
Moderate
Finished
Severe
Did Not
Severe
Finished
Extreme
Did Not
Extreme
Finished
Predicted GRE-A 450 to 540
Extreme
Finished
7%
Moderate
Did Not
5%
Extreme
Did Not
17%
Moderate
Finished
46%
Severe
Finished
15%
Severe
Did Not
10%
Figure 2. Time Pressure After Item 29
9
Moderate
Did Not
Moderate
Finished
Severe
Did Not
Severe
Finished
Extreme
Did Not
Extreme
Finished
Predicted GRE-A 650 to 740
Extreme
Finished
10%
Moderate
Did Not
6%
Extreme
Did Not
18%
Moderate
Finished
37%
Severe
Finished
17%
Severe
Did Not
12%
10
Moderate
Did Not
Moderate
Finished
Severe
Did Not
Severe
Finished
Extreme
Did Not
Extreme
Finished
45.0%
40.0%
35.0%
Percent of Group
30.0%
Moderate
25.0%
Severe
20.0%
Extreme
15.0%
10.0%
5.0%
0.0%
0
1
2
3
4
5
Item s Answ ered Correctly - Positions 30 to 35
Figure 3. Time Pressure After Item 29 - Finished Exam
11
6
Table 1
Percent of Each Group in Each Time Pressure Category
Group
Moderate
(>5.5 minutes)
n-54,804 (49.9%)
Severe
(2-5.5 minutes)
n-27,577 (25.1%)
Extreme
(<2 min.)
n-27,519 (25.0%)
Male
45.6
26.8
27.6
Female
53.5
23.5
23.0
White
53.8
23.4
22.8
Black
56.1
21.7
23.2
Asian
42.2
27.7
30.1
Hispanic
50.5
23.2
26.3
Other
48.1
26.8
25.1
12
Table 2
Mean Test Scores of Examinees for Analytic, Quantitative and Verbal Sections,
By Time Pressure on the GRE Analytic Section
Score
Moderate
Severe
Extreme
GRE-A
539
540
504
GRE-Q
526
574
574
GRE-V
445
455
450
Table 3
Percent of Each Time Pressure Group with Indicated
Theta Change (Theta at 29 Minus Theta at 35)
Change in Theta
Moderate
Severe
Extreme
<-1.0
0.2
2.5
6.0
-1.0 to - 0.5
3.2
18.8
28.0
-0.5 to 0.0
72.3
75.5
64.8
0.0 to 0.5
23.8
3.0
1.1
0.5
0.5
0.1
0.1
13
Table 4
Impact of Unlucky Guessing On GRE-A Scores
Person A
Item
28
29
30
31
32
33
34
35
Score
670
660
650
640
630
620
610
610
Adj.
Score
570
580
590
600
610
620
620
640
IRT
a
.79
.90
.78
1.70
1.00
1.30
1.28
.80
Person B
IRT
b
.76
.92
.74
.79
.89
.93
.50
-.13
Score
640
610
550
480
380
220
220
310
Adj.
Score
550
540
500
450
370
230
230
330
IRT
a
.66
.60
.81
1.02
1.28
.59
1.18
.63
Note: Both Person A and Person B make incorrect answers to items 28-34 and a correct answer to 35.
(Adapted from an unpublished table by Marilyn Wingersky).
Table 5
Expected and Actual Time After 29 Items
By Selected Examinee Thetas after Item 29
Theta
-1.5
-1.4
n
1364
1478
Expected Time
M
SD
54.7
2.6
54.6
2.7
Actual Time
M
SD
47.0
9.8
47.6
9.7
0.0
0.1
3168
3192
58.6
59.0
3.0
3.0
51.8
52.0
6.6
6.6
1.5
1.6
1802
1470
64.2
64.3
3.1
2.8
53.7
53.9
5.2
5.1
14
IRT
b
-.50
-.87
.17
.24
-.55
-3.11
-.26
-1.10
Table 6
Final GRE-A Scores of Examinees Whose
Estimated Theta After 29 Items was 1.0 or 1.5
And Who Had Taken Short or Long Tests
Expected
Test Length
Through Item
29
Theta at 29 = 1.0
GRE-A
GRE-Q
GRE-V
n
M
SD
M
SD
M
SD
Short
400
664
32
632
108
479
100
Medium
1725
651
35
621
103
491
96
Long
419
639
43
631
100
520
97
Short
309
713
32
670
91
504
102
Medium
1178
700
36
656
94
529
96
59
689
38
662
92
535
98
Theta at 29 = 1.5
Long
15
References
Bridgeman, B, & Cline, F. (2000). Variations in mean response times for questions on the
computer-adaptive GRE General Test: Implications for fair assessment. (GRE Report
No. 96-20P; ETS RR-00-7). Princeton, NJ: Educational Testing Service.
Steffen, M. & Way, W.D. (April, 1999). Test-taking strategies in computer adaptive testing.
Paper presented at the annual meeting of the National Council on Measurement in
Education, Montreal.
16