Combinatorial Search Algorithm for Detection of Test Collusion (RR

LSAC RESEARCH REPORT SERIES
 Combinatorial Search Algorithm for Detection of Test
Collusion
Dmitry I. Belov
 Law School Admission Council
Research Report 13-01
March 2013
A Publication of the Law School Admission Council
The Law School Admission Council (LSAC) is a nonprofit corporation whose members are more than 200
law schools in the United States, Canada, and Australia. Headquartered in Newtown, PA, USA, the
Council was founded in 1947 to facilitate the law school admission process. The Council has grown to
provide numerous products and services to law schools and to more than 85,000 law school applicants
each year.
All law schools approved by the American Bar Association (ABA) are LSAC members. Canadian law
schools recognized by a provincial or territorial law society or government agency are also members.
Accredited law schools outside of the United States and Canada are eligible for membership at the
discretion of the LSAC Board of Trustees.
© 2013 by Law School Admission Council, Inc.
All rights reserved. No part of this work, including information, data, or other portions of the work
published in electronic form, may be reproduced or transmitted in any form or by any means, electronic or
mechanical, including photocopying and recording, or by any information storage and retrieval system,
without permission of the publisher. For information, write: Communications, Law School Admission
Council, 662 Penn Street, PO Box 40, Newtown, PA, 18940-0040.
This study is published and distributed by LSAC. The opinions and conclusions contained in this report
are those of the author(s) and do not necessarily reflect the position or policy of LSAC.
Table of Contents
Executive Summary ...................................................................................................... 1
Introduction ................................................................................................................... 1
Existing Applications of Kullback–Leibler Divergence for Test Security ................ 2
Problem Statement........................................................................................................ 4
Analysis of the Problem ............................................................................................... 6
Detection Algorithm ...................................................................................................... 9
Computer Simulations ................................................................................................ 11
Summary ...................................................................................................................... 18
References ................................................................................................................... 19
Acknowledgment......................................................................................................... 21
i
Executive Summary
This report presents a new algorithm for detecting groups of test takers (aberrant
groups) who had access to subsets of test questions (aberrant subsets) prior to an
exam. This method is in line with the development of statistical methods for detecting
test collusion, a new research direction in test security. Test collusion may be described
as the large-scale sharing of test materials, including answers to test questions. The
algorithm employs several new statistics to perform a sequence of statistical tests to
identify aberrant groups. The algorithm is flexible and can be easily modified to detect
other types of test collusion. It can also be applied within all major modes of testing:
paper-and-pencil testing, computer-based testing, multiple-stage testing, and
computerized adaptive testing. A simulation study demonstrates the advantages of
using the algorithm in computerized adaptive testing.
Introduction
Testing organizations have a great interest in recognizing whether or not the
performances of a test taker across multiple tests or across subject areas within a test
are homogeneous. Heterogeneous performance of a test taker may indicate a number
of phenomena including answer copying (Karabatsos, 2003), preknowledge of certain
items (Karabatsos, 2003), memorizing items instead of answering, item response
instructions that the test taker finds confusing (Karabatsos, 2003), non-unidimensional
testing, influential observations (Bradlow & Zaslavsky, 1997), test faking (LeBreton,
Barksdale, Robin, & James, 2007), and fatigue. Existing methods can be partitioned into
two categories:
1. Detecting person misfit: Person-fit or appropriateness measurement refers to
statistical methods used to evaluate the fit of a response pattern to a particular
test model. Methods from this category can be used to make decisions about
each group of test takers of size one, where a test taker is aberrant or not
(Armstrong & Shi, 2009; Belov, Pashley, Lewis, & Armstrong, 2007; Guttman,
1944; Harnisch & Linn, 1981; Karabatsos, 2003; Meijer, 1996; Meijer & Sijtsma,
2001; van der Flier, 1977; van Krimpen-Stoop & Meijer, 2001).
2. Detecting answer copying between two test takers: Answer-copying behavior
often results in an unusual agreement between the incorrect answers of a pair of
test takers, where one member of the pair is the source and the other member is
the copier who copies answers from the source. Methods from this category can
be used to make decisions about each group of test takers of size two, where a
1
pair is aberrant (source, copier) or not (Angoff, 1974; Belov, 2011; Frary, 1993;
Harpp & Hogan, 1993; Holland, 1996; Sotaridona, van der Linden, & Meijer,
2006; Wesolowsky, 2000; Wollack, 1997).
Current research on test security is focusing on methods for detecting larger groups
involved in test collusion (Jacob & Levitt, 2003; Wollack & Maynes, 2011; Zhang,
Searcy, & Horn, 2011). Test collusion may be described as large-scale sharing of test
materials or answers to test questions. The source of the shared information could be a
teacher, a test-preparation company, the Internet, or test takers communicating on the
day of the exam.
In order to identify classrooms where teachers have changed answers, Jacob and
Levitt (2003) analyzed the joint distribution of two summary statistics: an answer strings
summary and an unexpected score fluctuations summary. Both cluster analysis
(Wollack & Maynes, 2011) and factor analysis (Zhang et al., 2011) were demonstrated
to be applicable for detecting various types of test collusion. However, all three methods
rely on statistics also used in detecting answer copying; therefore they lose power when
there is a lack of matching between incorrect responses within a group of test takers
involved in test collusion. Furthermore, statistics based on matching responses will have
low power in multiple-stage testing (MST) and computerized adaptive testing (CAT)
because the actual test varies across test takers. This report presents a new approach
that does not rely on response matching statistics, but instead utilizes a difference
between distributions measured by the Kullback–Leibler divergence (Kullback & Leibler,
1951).
Throughout the report the following notation is used:

Lowercase letters a ,b ,c ,... ;  ,  ,  ,... denote scalars (including random variables).

Capital letters A , B ,C ,... denote sets; S denotes the number of elements in a set
S.
 Bold lowercase letters a , b, c ,... denote vectors.
 Bold capital letters A , B, C,... denote functions (including discrete distributions
defined by probability mass functions).
Existing Applications of Kullback–Leibler Divergence for Test Security


Given two discrete distributions G and H , both defined on a finite set 1 ,2 ,...,k ,
the following measure allows one to estimate how dissimilar these two distributions are
(Cover & Thomas, 1991; Kullback & Leibler, 1951):
2
k
D(G || H)   G(i )ln
i 1
G(i )
.
H(i )
(1)
The definition of Kullback–Leibler divergence is valid for discrete—like Equation (1)
used in this report—and continuous distributions: The larger the divergence, the higher
the dissimilarity between distributions. The value of D(G H) is always non-negative and
equals zero only if the two distributions are identical. Kullback–Leibler divergence is
asymmetric; that is, in general, D(G H)  D(H G).
The use of Kullback–Leibler divergence to detect person misfit was first proposed by
Belov et al. (2007). Its application for detecting the copier (see description of answer
copying in the Introduction) was developed by Belov and Armstrong (2010). The
asymptotic distribution of Kullback–Leibler divergence with applications beyond test
security was studied by Belov and Armstrong (2011).
Consider a test T partitioned into two subtests T g and Th . These subtests may
intersect each other. Examples of a linear test include operational and variable items,
hard and easy items, quantitative and verbal items, sections before break and after
break. An example of CAT might be administered items that were stolen (T g ) and other
administered items (Th ).
Consider a test taker with latent trait (ability)  taking two subtests T g and Th with
m and n items, respectively. Let rg  (rg 1 , rg 2 ,..., rgm ) and rh  (rh 1 , rh 2 ,..., rhn ) represent
response vectors of the test taker to T g and Th , respectively. Bayes’ theorem is used to
compute the posterior distribution of  on each subtest. A uniform prior is used, and
Bayesian posteriors are computed based on the responses. The posterior probabilities
for  based on responses rg to the subtest T g are
m
G(i | rg ) 
P(rgj i )

j
1
k
m
1
1
P(rgj l )

l
j
,
i  1,..., k ,
where P(rgj i ) is the probability of response rgj given ability level i . Similarly, the
posterior probabilities for  based on responses rh to the subtest Th are
3
(2)
n
H(i | rh ) 
P(rhj i )

j
1
k
n
1
1
P(rhj l )

l
j
,
i  1,..., k .
(3)
The Kullback–Leibler divergence D(G H) between posteriors G and H is computed
by Equation (1). Relatively large values for D(G H) indicate a significant difference in
the test taker’s performance between the two subtests (Belov & Armstrong, 2010,
2011). There could be various causes for this difference, but the cause of interest in this
particular investigation is test collusion. Therefore, each type of test collusion has to be
formalized.
Problem Statement
It is assumed that there is a fixed relation between test takers that partitions test
takers into nonintersecting groups. For example, the following relation partitions test
takers into test centers (Figure 1); that is, the same geographic location where test
takers take a test (e.g., room, college, state, region, country).
Test Taker 1
Test Center 1
Test Taker 2
Test Takers 2, 3, 8
Test Taker 3
Test Taker 4
Test Taker 5
Test Taker 6
Test Taker 7
Test Center 2
Test Taker 8
Test Takers 1, 4,
5, 6
Test Taker 9
Test Center 3
Test Takers 7, 9
FIGURE 1. Partitioning of test takers by test center
4
Each test taker can be represented as a vertex of a graph, and two test takers form
an edge if and only if they are in a relation (e.g., taking exam at the same test center).
Then the graph is partitioned into complete subgraphs. A complete graph or clique is a
graph in which every pair of distinct vertices is connected by a unique edge. Therefore,
the above nonintersecting groups of test takers will be called cliques (Figure 2).
2
3
Test center #1:
8
1
4
Test center #2:
6
5
Test center #3:
9
7
FIGURE 2. Three cliques corresponding to partitioning from Figure 1
The same geographic location is the most common relation. However, there are
other relations highly practical for test security: same high school, same undergraduate
college, same test-prep center, or same group in an online social network. These
relations allow the detection of test takers involved in test collusion, even if they take the
actual exam at different geographic locations. A choice of relation is up to the user;
however, the relation should be chosen such that the resulting number of cliques is
much less than the total number of test takers; otherwise small cliques should be
merged.
5
This report studies the following type of test collusion. In some clique, a group of test
takers had access to a subset of items from a given pool prior to the exam. This type of
test collusion is known as item preknowledge. The detection of different types of test
collusion can be reduced to the detection of item preknowledge. For example, when a
teacher corrects answers for a group of test takers, it can be detected as item
preknowledge. It is assumed that the probability of the correct response to an item from
the subset is 1 for each test taker from the group. Practical examples of the subset
include stolen items or items memorized during a previous test administration; that is,
highly exposed items in CAT and MST, or pretest items in paper and pencil testing
(P&P) and computer-based testing (CBT).
Test takers involved in test collusion will be called aberrant test takers. They form
aberrant groups. Items pre-accessed by an aberrant group form an aberrant subset.
Cliques with aberrant groups will be called aberrant cliques.
Thus, each aberrant group has access to a unique aberrant subset. Assuming that
each aberrant group is completely contained in some clique, one has to identify all
aberrant groups without having any knowledge about their corresponding aberrant
subsets.
Analysis of the Problem
Let us suppose that an aberrant subset S is known, and consider the corresponding
aberrant test taker j who was administered a set of items T j . Equations similar to
(2) and (3) compute two posterior distributions of ability: PS
administered items T j that belong to S ; and PT
j
\S
Tj
from responses to
from responses to administered
items T j that do not belong to S . Note that the case where S T j   for each aberrant
test taker j implies no test security risk or collusion. Therefore, it is assumed that
S T j   for each aberrant test taker j . Due to item preknowledge, the distribution
PS
Tj
will be shifted toward higher ability more than distribution PT
j
\S
will. Clearly, this
shift is even larger for lower ability test takers. Therefore, any measure of dissimilarity
between distributions can be used for detecting low-ability aberrant groups. Since lowability test takers involved in test collusion have the largest impact on the scoring, the
use of such measure is highly practical. This dissimilarity can be measured by the
Kullback–Leibler divergence D(PT \S || PS T ) computed by Equation (1).
j
j
If a given pool contains n items, then about 2n possible subsets may be aberrant.
For each subset S , one can apply a hypothesis test for statistic D(PT \S || PS T ) , j  J ,
j
j
where J is a set of all test takers. The enormous number of subsets and the problem of
6
multiple comparisons (Abdi, 2007) make this approach totally unrealistic. However, it is
possible to make three realistic assumptions that would make this problem tractable.
Assumption 1:
Each aberrant subset belongs to a fixed subset Q (Figure 3). In CAT or MST, the
subset Q may contain items with exposure (computed from a previous test
administration) higher than a certain threshold. This is a realistic assumption,
because the greater the exposure of an item, the higher the probability that the item
will be used later in test collusion. In P&P and CBT, the subset Q may contain items
that were previously pretested.
S2
Item pool
S1
Q
S3
FIGURE 3. An example of three aberrant subsets:
S 1 , S 2 , and S 3 .
unknown to us. These aberrant subsets belong to a subset
between
l
and
u
Q
Each aberrant subset is
(see Assumption 1), their size is
(see Assumption 2), and if they are known to aberrant groups from the same
clique, then they should have a large intersection (see Assumption 3).
Assumption 2:
Lower and upper bounds on the size of each aberrant subset are known as l and
u , respectively. In other words, for each aberrant subset S the following inequality
holds: l  S  u . Under Assumption 1, the bounds can be set with respect to Q
(size of Q ).
Assumption 3:
Within each aberrant clique, all its aberrant groups had access to items from similar
aberrant subsets. This assumption is realistic because of Assumption 1 as well as
7
the empirical fact that the more items are exposed, the higher the probability that
they will be used in test collusion.
In practice, Assumption 3 can be tightened such that each aberrant clique has no
more than one aberrant group. This is realistic because clique can be defined in various
ways (see above). For example, a smaller subset of test takers (room, class, etc.) can
be considered a clique.
Since test collusion potentially involves a large number of test takers, if a clique
X  J is aberrant then the corresponding aberrant subset S will cause the empirical
distribution HS ,X of statistic D(PT \S || PS
Tj
distribution HS ,Y of statistic D(PT \S || PS
Ti
j
i
), j  X to be dissimilar from the empirical
), i Y , where Y  J is a nonaberrant clique.
This is the central idea of the approach taken in this report, resulting in the following
statistic:
g S ,X    D(HS ,X || HS ,Y )  D(HS ,Y || HS ,X ) ,
(4)
Y J
where summation is taken over all cliques Y  J , and D(HS ,X || HS ,Y ) , D(HS ,Y || HS ,X ) are
computed according to Equation (1). The sum D(HS ,X || HS ,Y )  D(HS ,Y || HS ,X ) is used to
balance the asymmetry of Kullback–Leibler divergence. The empirical distribution of
statistic g S ,X across all cliques is computed as follows:
GS ( X ) 
g S ,X
, X J.
 g S ,Y
(5)
Y J
Assuming that clique X  J is aberrant and S is the corresponding aberrant subset
(Assumption 3), the value GS (X ) should reach its maximum; that is,
GS (X )  max GV (X ) , where V enumerates through all subsets of Q , V  Q (Assumption
V Q
1) and l  V  u (Assumption 2). Thus, the value GS (X ) can be used as an optimization
criterion in a combinatorial search for aberrant subset S if X is already identified as an
aberrant clique.
8
To identify aberrant cliques, the following statistic is introduced:
m
c X   GS ( X ) ,
i 1
i
(6)
where subsets S 1 ,S 2 ,...,S m are randomly generated such that S i  Q and l  S i  u .
Since each aberrant subset may intersect with multiple S i , this statistic should have
large values for the corresponding aberrant cliques. The critical value for statistic c X
can be computed from simulated data given a fixed significance level.
Detection Algorithm
For each clique X * identified as aberrant, the algorithm runs a simulated annealing1
(SA) in order to identify corresponding aberrant subset S * . For each X *, SA starts with
a random subset; then each iteration of SA tries to improve the subset by adding a
selected item from Q , or swapping a random item from the subset with a selected item
from Q , or removing a random item from the subset. Thus, there are two procedures
critical for the performance of SA for the problem in this study:
Procedure 1 (selection of an item to modify the subset)
Step 1: Compute discrete distribution F for items from Q (see Assumption 1), where
F is normalized item exposure.
Step 2: Item is drawn from the discrete distribution F . Using F is more realistic than
applying the uniform distribution, because the greater the exposure of an
item, the higher the probability that the item will be used in test collusion.
1
Simulated annealing (SA) is a generic heuristic for locating a good approximation to the global optimum
of a given function in a large search space. The term annealing is borrowed from metallurgy; annealing in
metallurgy involves heating and controlled cooling of a material to reduce its defects. For more details
see, for example, van Laarhoven and Aarts (1987).
9
Procedure 2 (modification of the subset)
Step 1: Simulate random variable  , distributed as follows:
Operation: 0
1
2
Probability: 1 /3 1 /3 1 /3
(7)
where
Operation 0: Selected item will be added to the subset
Operation 1: Selected item will be swapped with a random item from the subset
Operation 2: Before Operation 1, a random item will be removed from the subset
SA is parameterized as follows. The initial value of parameter t is set to 1. It is held
constant for a fixed number of trials h to improve the subset. After h trials t is reduced
t  t  d , where 0  d  1 , and the next h trials begin. The SA terminates if no
modification to the subset was accepted during previous h trials. The following
describes detailed steps of the detection algorithm.
Algorithm 1
Step 1. Simulate response data without test collusion, where test-taker abilities are
drawn from N(0,1) distribution. The resultant simulated test takers are
denoted as E , where test takers are randomly assigned to cliques; the
number of test takers and number of cliques are the same as in the original
data. Generate random subsets of items S 1 ,S 2 ,...,S m , where items are drawn
from F (see Procedure 1 above), S i  Q and l  S i  u . For each clique
Z  E compute statistic c Z according to Equation (6). Compute the empirical
distribution C of statistic c Z . Given significance level  C , compute the critical
value v C for C .
Step 2. For each clique X  J , compute statistic c X according to Equation (6).
Select the first clique X * (a potentially aberrant clique) such that c X *  v C .
m
Step 3. Let i *  arg max GS ( X *) , S *  S i * , S 0  S * , f  0 , t  1 , d  0.5 , h  5 , z  1 .
i 1
i
10
Step 4. Set S  S 0 . Select random item i Q \ S drawn from the discrete distribution
F (see Procedure 1 above). Simulate random variable  according to the
discrete distribution given by Equation (7), and then perform the chosen
operation with the S and selected item i (see Procedure 2 above). Compute
GS (X *) by Equation (5), where J  X * E .
Step 5. If GS (X *)  GS * (X *) , then go to Step 6; otherwise go to Step 7.
Step 6. Set f  1 , S 0  S , and S *  S . Go to Step 8.
Step 7. Simulate uniformly distributed  [0,1) . If   exp((GS (X *)  GS * (X *))/ t )
then set f  1 and S 0  S .
Step 8. If z  h , then set z  z 1 and go to Step 4.
Step 9. If f  1, then set f  0 , t  t  d , z  1 and go to Step 4.
Step 10. Simulate nonaberrant test takers R drawn from N(0,1) distribution.
Compute empirical distribution HS *,R of statistic D(PT \S * || PS * T ) , r R . Given
r
r
significance level  H , compute the critical value v H for HS *,R .
Step 11. Report each test taker j  X * with D(PT \S * || PS *
j
Tj
) v H as aberrant.
Step 12. Select the next clique X * (a potentially aberrant clique) such that c X *  v C
and go to Step 3; otherwise, if there are no more such cliques left, STOP.
Steps 2 and 12 select each clique with a value of statistic c X from a critical region
identified at Step 1 via computer simulation. For each selected clique X *, Steps 3–9
implement SA (van Laarhoven & Aarts, 1987) for combinatorial search of an aberrant
m
subset S *, where the initial subset is S i * , i *  arg max GS (X *) . Step 11 reports each test
i 1
taker j  X * with D(PT \S * || PS *
j
Tj
i
) from a critical region identified at Step 10 via
computer simulation.
Informally, Algorithm 1 is a sequence of two statistical tests: (a) identification of
aberrant cliques; and (b) identification of aberrant test takers within each aberrant
clique. For example, if the number of test takers is 10,000 and the number of cliques is
11
100 with 100 test takers per clique, then the number of incorrectly reported test takers
can be closely approximated by 100  C  100  H .
Common methods for detecting person misfit contain just one statistical test applied
to all test takers; that is, to the entire set J . Assume that an existing method operates
under significance level  H . Then for this example, the number of incorrectly reported
test takers can be closely approximated by 10000  H , which is 1 / C times larger than
by Algorithm 1.
Computer Simulations
The objective of this section is to compare performance of Algorithm 1 in a simulated
CAT environment with respect to two classical person-fit statistics: Guttman (1944) and
van der Flier (1977). They were chosen for their easy implementation and their close
performance with existing person-fit statistics. In particular, due to Karabatsos (2003),
statistic G by Guttman (1944) provided about 0.56 ROC area, where the mean of ROC
area (computed from 36 person-fit statistics on simulated cheating) was about 0.61.
Multiple simulation studies were conducted using disclosed Logical Reasoning (LR)
items of the Law School Admission Test (LSAT). The response probability for each item
was modeled by the three-parameter logistic (3PL) model (Lord, 1980). The CAT pool
contained 500 LR items. The distribution of (a) discrimination, (b) difficulty, and (c)
guessing parameters of the items in the CAT pool have the following minimums,
maximums, means, and variances, respectively: (a) minimum 0.28, maximum 1.67,
mean 0.75, variance 0.06; (b) minimum −2.47, maximum 2.92, mean 0.49, variance
1.27; and (c) minimum 0.00, maximum 0.52, mean 0.17, variance 0.01.
There were 100 cliques with 100 test takers per clique. Nonaberrant test takers were
simulated with abilities drawn from N(0, 1) distribution; aberrant test takers were
simulated with abilities drawn from N(0, 1) or U(−3, 0) distributions. Since low-ability
aberrant groups have the largest impact on scoring, it is practical to consider U(−3, 0)
as one of the possible distributions of aberrant test takers. Another argument for using
uniform distribution is that there is no assumption about which test taker has a
higher/lower probability of being included in an aberrant group.
The item selection criterion for CAT was the maximization of Fisher information at
the current estimate of ability ˆ . The test length was fixed at 50 items for each test
taker. The estimator of  was the expected a posteriori (EAP) estimator with a uniform
prior. The ability estimate was initialized at ˆ  0. There was no item exposure control.
Each aberrant clique could have no more than one aberrant group (see
Assumption 3 above). The number of aberrant cliques was nc  {5, 10, 15, 20, 30}. The
number of aberrant test takers in each aberrant clique was ne  {5, 10, 15, 20, 30}. Each
12
group was randomly assigned to a clique such that the total number of test takers in
each clique was 100. All algorithms were implemented in C++ by the author.
Precision (van Rijsbergen, 1979) was chosen as a measure of algorithm
performance:
[number of aberrant test takers detected]
[number of test takers detected]
(8)
The following simulation study was performed:
[1] Simulate CAT without test collusion.
[2] Compute item exposure.
[3] Form subset Q of potentially stolen items with an exposure value higher than
0.4. This step resulted in Q with 51 items.
[4] Compute discrete distribution F for items from Q , where F is normalized item
exposure.
[5] To each aberrant group assign a unique random subset S  Q , such that
25  S  35 and items are drawn from F . Using F is more realistic than applying
the uniform distribution, because the greater the exposure of an item, the higher
the probability that the item will be used in test collusion.
[6] Simulate CAT with test collusion, where nc  {5, 10, 15, 20, 30} and ne  {5, 10,
15, 20, 30}.
[7] Run algorithms to identify aberrant test takers and compute their precision.
Algorithm 1 ran for m  100. The results for different sets of parameters are
presented in Figures 4–7, where  H  {0.0025, 0.005, 0.0075, 0.01, 0.025, 0.05}, and
 C  0.25. Critical values for Guttman (1944) and van der Flier (1977) were computed
from simulated (without test collusion) data for significance levels given by  H . Because
Algorithm 1 performs a sequence of two statistical tests, it was able to simultaneously
reduce the Type I and Type II error rates in comparison to classical methods by
Guttman (1944) and van der Flier (1977).
13
nc =5, ne =5
nc =5, ne =10
1
nc =5, ne =15
1
1
Guttman
van der Flier
0.9
0.9
0.8
Algorithm 1
0.8
0.8
0.7
0.7
Precision
Precision
0.7
0.6
0.5
Precision
0.9
0.6
0.5
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.01
0.02
0.03
0.04
0.05
0
0
0.01
Nominal Type I error rate
0.02
0.03
0.04
0.05
0
nc =10, ne =10
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
Precision
0.9
Precision
1
0.5
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.03
0.04
0.05
0.01
0.02
0.03
0.04
0.05
0
nc =15, ne =5
nc =15, ne =10
0.8
0.8
0.8
0.7
0.7
0.7
Precision
1
0.9
Precision
1
0.9
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.04
0.05
0.04
0.05
0.5
0.4
0.03
0.05
0.6
0.4
0.02
0.03
nc =15, ne =15
1
0.01
0.02
Nominal Type I error rate
0.9
0
0.01
Nominal Type I error rate
0.5
0.04
0
0
Nominal Type I error rate
0.6
0.05
0.5
0.4
0.02
0.04
0.6
0.4
0.01
0.03
nc =10, ne =15
1
0.6
0.02
Nominal Type I error rate
1
0
0.01
Nominal Type I error rate
nc =10, ne =5
Precision
0.5
0.4
0
Precision
0.6
0
0
0.01
Nominal Type I error rate
0.02
0.03
0.04
0.05
0
Nominal Type I error rate
0.01
0.02
0.03
Nominal Type I error rate
FIGURE 4. Precision values, where aberrant test takers (test collusion in a smaller scale) are drawn from
N(0, 1) distribution and the abscissa is
H .
The solid lines correspond to Guttman (1944); the dashed lines
correspond to van der Flier (1977); the dotted lines correspond to Algorithm 1.
14
nc =10, ne =10
nc =10, ne =20
1
nc =10, ne =30
1
1
Guttman
van der Flier
0.9
0.9
0.8
Algorithm 1
0.8
0.8
0.7
0.7
Precision
Precision
0.7
0.6
0.5
Precision
0.9
0.6
0.5
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.01
0.02
0.03
0.04
0.05
0
0
0.01
Nominal Type I error rate
0.02
0.03
0.04
0.05
0
nc =20, ne =20
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
Precision
0.9
Precision
1
0.5
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.03
0.04
0.05
0.01
0.02
0.03
0.04
0.05
0
nc =30, ne =10
nc =30, ne =20
0.8
0.8
0.8
0.7
0.7
0.7
Precision
1
0.9
Precision
1
0.9
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.04
0.05
0.04
0.05
0.5
0.4
0.03
0.05
0.6
0.4
0.02
0.03
nc =30, ne =30
1
0.01
0.02
Nominal Type I error rate
0.9
0
0.01
Nominal Type I error rate
0.5
0.04
0
0
Nominal Type I error rate
0.6
0.05
0.5
0.4
0.02
0.04
0.6
0.4
0.01
0.03
nc =20, ne =30
1
0.6
0.02
Nominal Type I error rate
1
0
0.01
Nominal Type I error rate
nc =20, ne =10
Precision
0.5
0.4
0
Precision
0.6
0
0
0.01
Nominal Type I error rate
0.02
0.03
0.04
0.05
0
Nominal Type I error rate
0.01
0.02
0.03
Nominal Type I error rate
FIGURE 5. Precision values, where aberrant test takers (test collusion in a larger scale) are drawn from
N(0, 1) distribution and the abscissa is
H .
The solid lines correspond to Guttman (1944); the dashed lines
correspond to van der Flier (1977); the dotted lines correspond to Algorithm 1.
15
nc =5, ne =5
nc =5, ne =10
1
nc =5, ne =15
1
1
Guttman
van der Flier
0.9
0.9
0.8
Algorithm 1
0.8
0.8
0.7
0.7
Precision
Precision
0.7
0.6
0.5
Precision
0.9
0.6
0.5
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.01
0.02
0.03
0.04
0.05
0
0
0.01
Nominal Type I error rate
0.02
0.03
0.04
0.05
0
nc =10, ne =10
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
Precision
0.9
Precision
1
0.5
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.03
0.04
0.05
0.01
0.02
0.03
0.04
0.05
0
nc =15, ne =5
nc =15, ne =10
0.8
0.8
0.8
0.7
0.7
0.7
Precision
1
0.9
Precision
1
0.9
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.04
0.05
0.04
0.05
0.5
0.4
0.03
0.05
0.6
0.4
0.02
0.03
nc =15, ne =15
1
0.01
0.02
Nominal Type I error rate
0.9
0
0.01
Nominal Type I error rate
0.5
0.04
0
0
Nominal Type I error rate
0.6
0.05
0.5
0.4
0.02
0.04
0.6
0.4
0.01
0.03
nc =10, ne =15
1
0.6
0.02
Nominal Type I error rate
1
0
0.01
Nominal Type I error rate
nc =10, ne =5
Precision
0.5
0.4
0
Precision
0.6
0
0
0.01
Nominal Type I error rate
0.02
0.03
0.04
0.05
0
0.01
Nominal Type I error rate
0.02
0.03
Nominal Type I error rate
FIGURE 6. Precision values, where aberrant test takers (test collusion in a smaller scale) are drawn from
U(−3, 0) distribution and the abscissa is
H .
The solid lines correspond to Guttman (1944); the dashed
lines correspond to van der Flier (1977); the dotted lines correspond to Algorithm 1.
16
nc =10, ne =10
nc =10, ne =20
1
nc =10, ne =30
1
1
0.9
van der Flier
0.9
0.9
0.8
Algorithm 1
0.8
0.8
0.7
0.7
Precision
Precision
0.7
0.6
0.5
Precision
Guttman
0.6
0.5
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.01
0.02
0.03
0.04
0.05
0
0
0.01
Nominal Type I error rate
0.02
0.03
0.04
0.05
0
nc =20, ne =20
0.8
0.8
0.8
0.7
0.7
0.7
Precision
1
0.9
Precision
1
0.9
0.5
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.03
0.04
0.05
0.01
0.02
0.03
0.04
0.05
0
nc =30, ne =10
nc =30, ne =20
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
Precision
0.9
Precision
1
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.04
0.05
0.04
0.05
0.5
0.4
0.03
0.05
0.6
0.4
0.02
0.03
nc =30, ne =30
1
0.01
0.02
Nominal Type I error rate
1
0
0.01
Nominal Type I error rate
0.5
0.04
0
0
Nominal Type I error rate
0.6
0.05
0.5
0.4
0.02
0.04
0.6
0.4
0.01
0.03
nc =20, ne =30
1
0.6
0.02
Nominal Type I error rate
0.9
0
0.01
Nominal Type I error rate
nc =20, ne =10
Precision
0.5
0.4
0
Precision
0.6
0
0
0.01
Nominal Type I error rate
0.02
0.03
0.04
0.05
0
0.01
Nominal Type I error rate
0.02
0.03
Nominal Type I error rate
FIGURE 7. Precision values, where aberrant test takers (test collusion in a larger scale) are drawn from
U(−3, 0) distribution and the abscissa is
H .
The solid lines correspond to Guttman (1944); the dashed
lines correspond to van der Flier (1977); the dotted lines correspond to Algorithm 1.
17
Summary
This report formalizes the problem of detecting item preknowledge as follows:
Assuming that each aberrant group is completely contained in some clique, one has to
identify all aberrant groups without any knowledge about their corresponding aberrant
subsets. Such formalization is general, which makes it interesting both from a research
standpoint and from a practical standpoint. It was demonstrated that this problem
belongs to the intersection of two different fields: statistical hypothesis testing and
combinatorial optimization. To make this problem tractable, three assumptions about
the structure of an aberrant subset were made (Assumptions 1–3). This allowed the
development of a detection algorithm (Algorithm 1) based on Kullback–Leibler
divergence and simulated annealing.
Algorithm 1 can be easily modified to support other types of test collusion. Then
instead of D(PT \S || PS T ), one can use another person-fit statistic appropriate for the
j
j
type of test collusion being studied.
Algorithm 1 is applicable for all major modes of testing: P&P, CBT, MST, and CAT. A
simulation study demonstrated the advantages of using Algorithm 1, particularly in CAT,
for the detection of test collusion on a large scale (see Figures 5 and 7). Note that for
P&P and some versions of CBT, the test T j T is fixed for all test takers, which due to
the definition of statistic D(PT \S || PS
j
Tj
) , should increase the power of Algorithm 1.
Algorithm 1 can be extended by using conditioning on ability regions. For example,
statistic D , (PT \S || PS T ) is computed for each test taker j with estimated
w
w 1
j
j
ˆj  w ,w 1  . Such modification should be able to detect aberrant groups from various
ability regions with higher precision than the original algorithm. However, a poor choice
of ability regions may cause data sparseness and/or a multiple comparison problem.
Questions for further research include the following:
 How will results change if the lower and upper bounds ( l and u , respectively) on
the size of unknown aberrant subsets are violated (see Assumption 2)?
 How will results change if several aberrant groups are present within an aberrant
clique (see Assumption 3)?
 How will results change if Procedures 1 and 2 are modified? Selection of an item
from Q \ S may be deterministic, driven by a heuristic; probabilities in Equation


(7) can be changed; Operation 2 may mean just a removal of a random item from
the subset S .
Taking into account that Algorithm 1 can be immediately applied for CAT, MST,
and CBT with posteriors of speed (see, e.g., van der Linden [2011] for details on
response time modeling), how would this benefit detection of test collusion?
How will results change for CAT and MST if item exposure control is applied?
18
References
Abdi, H. (2007). Bonferroni and Šidák corrections for multiple comparisons. In N. J.
Salkind (Ed.), Encyclopedia of measurement and statistics. Thousand Oaks, CA:
Sage.
Angoff, W. (1974). The development of statistical indices for detecting cheaters. Journal
of the American Statistical Association, 69(345), 44–49.
Armstrong, R. D., & Shi, M. (2009). A parametric cumulative sum statistic for person fit.
Applied Psychological Measurement, 33, 391–410.
Belov, D. I. (2011). Detection of answer copying based on the structure of a high-stakes
test. Applied Psychological Measurement, 35, 495–517.
Belov, D. I., & Armstrong, R. D. (2010). Automatic detection of answer copying via
Kullback–Leibler divergence and K-index. Applied Psychological Measurement, 34,
379–392.
Belov, D. I., & Armstrong, R. D. (2011). Distributions of the Kullback–Leibler divergence
with applications. British Journal of Mathematical and Statistical Psychology, 64,
291–309.
Belov, D. I., Pashley, P. J., Lewis, C., & Armstrong, R. D. (2007). Detecting aberrant
responses with Kullback–Leibler distance. In K. Shigemasu, A. Okada, T. Imaizumi, &
T. Hoshino (Eds.), New trends in psychometrics (pp. 7–14). Tokyo: Universal
Academy Press.
Bradlow, E. T., & Zaslavsky, A. M. (1997). Case influence analysis in Bayesian
inference. Journal of Computational and Graphical Statistics, 6, 314–331.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: John
Wiley & Sons, Inc.
Frary, R. B. (1993). Statistical detection of multiple-choice answer copying: Review and
commentary. Applied Measurement in Education, 6, 153–165.
Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review,
9, 139–150.
Harnisch, D. L., & Linn, R. L. (1981). Analysis of item response patterns: Questionable
test data and dissimilar curriculum practices. Journal of Educational Measurement,
18, 133–146.
Harpp, D. N., & Hogan, J. J. (1993). Crime in the classroom: Detection and prevention
of cheating on multiple-choice exams. Journal of Chemical Education, 70, 306–311.
19
Holland, P. W. (1996). Assessing unusual agreement between the incorrect answers of
two examinees using the K-Index: Statistical theory and empirical support (ETS
Technical Report 96-4). Princeton, NJ: Educational Testing Service.
Jacob, B. A., & Levitt, S. D. (2003). Rotten apples: An investigation of the prevalence
and predictors of teacher cheating. The Quarterly Journal of Economics, 118,
843–877.
Karabatsos, G. (2003). Comparing the aberrant response detection performance of
thirty-six person-fit statistics. Applied Measurement in Education, 16, 277–298.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of
Mathematical Statistics, 22, 79–86.
LeBreton, J. M., Barksdale, C. D., Robin, J., & James, L. R. (2007). Measurement
issues associated with conditional reasoning tests: Indirect measurement and test
faking. Journal of Applied Psychology, 92, 1–16.
Lord, F. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Erlbaum.
Meijer, R. R. (1996). Person-fit research: An introduction. Applied Measurement in
Education, 9, 3–8.
Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied
Psychological Measurement, 25, 107–135.
Sotaridona, L. S., van der Linden, W. J., & Meijer, R. R. (2006). Detecting answer
copying using the kappa statistic. Applied Psychological Measurement, 30, 412–431.
van der Flier, H. (1977). Environmental factors and deviant response patterns. In Y. P.
Poortinga (Ed.), Basic problems in cross-cultural psychology. Amsterdam: Swets &
Zeitlinger.
van der Linden, W. J. (2011). Modeling response times with latent variables: Principles
and applications. Psychological Test and Assessment Modeling, 53, 334–358.
van Krimpen-Stoop, E. M. L. A., & Meijer, R. R. (2001). Cusum-based person-fit
statistics for adaptive testing. Journal of Educational and Behavioral Statistics, 26,
199–217.
van Laarhoven, P. J. M., & Aarts, E. H. L. (1987). Simulated Annealing: Theory and
Applications. Norwell, MA: Kluwer Academic Publishers.
van Rijsbergen, C. J. (1979). Information Retrieval. Newton, MA: ButterworthHeinemann.
20
Wesolowsky, G. O. (2000). Detecting excessive similarity in answers on multiple choice
exams. Journal of Applied Statistics, 27, 909–921.
Wollack, J. A. (1997). A nominal response model approach for detecting answer
copying. Applied Psychological Measurement, 21, 307–320.
Wollack, J. A., & Maynes, D. (2011). Detection of test collusion using item response
data. Paper presented at the annual meeting of the National Council on Measurement
in Education, New Orleans, LA.
Zhang, Y., Searcy, C. A., & Horn, L. (2011). Mapping clusters of aberrant patterns in
item responses. Paper presented at the annual meeting of the National Council on
Measurement in Education, New Orleans, LA.
Acknowledgment
I would like to thank Alex Weissman for valuable comments and suggestions on
previous versions of the report.
21