Caveon Data Forensics™ Process

Overview of Caveon Data
Forensics
Steve Addicott
Dennis Maynes
March 2011
Outline of Presentation





Overview of Caveon
Data Forensics (DF) Process
FDOE DF Goals
DF Tools & Methods
Spring 2011 FCAT DF Program
 Conservative thresholds
 Students & Schools
 Summary
 Q&A
Caveon Test Security… “The 3Ps”
 Proven
 Best in Class
 Serving the largest test programs in the world
 Practical
 Identify unusual test taking and respond
appropriately
 Protection
 Integrity of exams, DOE reputation, students’
opportunities
Caveon Data Forensics™ Process
 Analyses of test data
 First building a “model” of typical
question responses
 Identify unusual behaviors with potential
of unfair advantage
Caveon Data Forensics Process
(cont)
 Examples of “Unusual” Behavior
 Very high agreement among pairs or
groups of test takers
 Very unusual number of erasures,
particularly wrong to right.
 Very substantial gains or losses from one
occasion to another
Overview of the Use of Data
Forensics
 Many high stakes testing programs now
using Data Forensics
 Standards for Testing, e.g.,
“CCSSO’s Operational Best Practices for
State Assessment Programs”
 Essential to act on the results
FDOE Data Forensics Goals
 Uphold fairness and validity of test
results
 Identify risks and irregularities
 Take action based on data and analysis
 “Measure and Manage”
 Communicate Zero Tolerance for
Cheating
Testing Examiner’s Role
 Ensure (and then certify) the test
administration is fair and proper
 Declare scores invalid when fairness and
validity are negatively impacted
 Decision depends upon fairness and
validity, not whether an individual
cheated
Forensic Tools and Methods
 Similarity: answer-copying, collusion
 Erasures: tampering
 Gains: pre-knowledge, coaching
 Aberrance: tampering, exposure
 Identical tests: collusion
 Perfect tests: answer key loss
Similarity
 Our Most Powerful & “Credible” Statistic
 Measures degree of similarity between 2 or
more test instances
 Analyze each test instance against all other
test instances in the test
 Probable causes of extremely high similarity:




Answer Copying
Test Coaching
Proxy Test Taking
Collusion
Erasures
 Based on estimated answer changing rates
from:
 Wrong-to-Right
 Anything-to-Wrong
 Find answer sheets with unusual WtR answers
 Extreme statistical outliers could involve
tampering, “panic cheating”, etc.
Unusual Gains/Losses
 Predict score using prior year info.
 Measure large score increases/decreases
against predicted score
 Which score truly reflects the student’s actual
ability or competence?
 Extreme Gains/Losses may result from:
 Pre-knowledge, ie “Drill It and Kill It”
 Coaching
 Student development—visual acuity
Spring FCAT Data Forensics
 Focus on two groups
 Student-level
 School-level
 Utilize VERY conservative thresholds
A quick discussion of conservative
thresholds….
 Redskins winning 2011 Super Bowl = 1 in 50
 Chance of being hit by lightning = 1 in a million
 Chance of winning the lottery = 1 in 10 million
 Chance of DNA false-positive = 1 in 30 million
 Chance of tests being flagged and taken
independently = 1 in a TRILLION
Student-level Analysis
 Similarity Analysis only
 Most credible
 Chance of tests being so similar, and taken
independently = 1 in a trillion
 Invalidate test scores beyond 1012
 Fairness and Validity of test instance must be
questioned
 Appeals process to be implemented
Examinee
Test Name
Collusion Index
Collusion Cluster
District
School
Example of Flagged Examinees
xxxx1
xxxx2
xxxx3
xxxx4
xxxx5
xxxx6
xxxx7
xxxx8
xxxx9
xxx10
R
R
R
R
R
R
R
R
R
R
22.0
22.0
21.5
21.5
21.4
21.4
21.1
21.1
21.1
21.1
343
343
409
409
833
833
834
834
482
482
6
6
13
13
42
42
64
64
13
13
452
452
7461
7461
351
351
3436
3436
7741
7741
Example: 9th Grade Math Cluster
 Identifies apparent student collusion
 Definitions
 “Dominant” = same answer selected by majority
of group members
 “non-Dominant” = different answer selected by
majority of group members
 Example of 2 students that passed, but not
independently
 ie, they didn’t do their own work.
Impact of “1 in a Trillion”
Threshold, Math & Reading 2010
Grade
N
Flagged
Students
3rd
408,317
144
4th
394,039
103
5th
390,714
92
6th
387,502
224
7th
393,401
245
8th
387,190
69
9th
401,046
622
10th
360,176
57
Totals
3,122,385
1,556
School-Level Analysis
 Similarity, Gains, and Erasures
 Flagged schools conduct internal review
 Extreme instances may prompt formal
investigations and sanctions
M4 Similarity Rate Index
Erasures Rate Index
Gains Rate Index
286.20
306.49
300.62
310.77
376.53
292.00
280.79
344.23
299.73
303.05
265.98
Incident Rate
0.0
0.0
0.0
0.2
45.0
0.0
0.0
39.6
0.0
0.0
0.0
Overall Index
0.43
0.37
0.60
0.67
0.97
0.52
0.18
0.87
0.32
0.35
0.28
Mean Score
Pass Index
338
672
532
664
364
512
338
830
534
458
197
Pass Rate
Subject
M
R
M
M
M
M
R
M
R
R
M
Number of Tests
District-School
xxxx
yyyy
zzzzz
aaaa
bbbb
cccc
dddd
eeee
ffff
gggg
hhhh
32.7
21.7
19.8
15.5
12.5
12.5
11.7
11.3
10.8
10.4
10.3
0.45
0.32
0.36
0.29
0.15
0.33
0.36
0.13
0.32
0.29
0.46
34.9
24.3
21.3
17.5
0.0
14.8
13.5
0.0
13.1
1.5
12.5
0.6
0.1
0.3
0.1
0.0
0.1
0.3
0.0
0.0
12.7
2.4
9.1
0.1
4.5
3.5
6.1
5.3
1.0
2.2
0.2
0.0
10.3
Benefits of Conservative Threshholds
 Focus on most egregious Instances
 Provides results that are
 Explainable
 Defensible
 Can move later to different thresholds
 Easier to manage
 Walk before we run
Program Results
Proportion of Detected Tests
 Monitored
behavior
improves
 Invalidations
deter
cheating
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
Spring
2006
Spring
2007
Spring
2008
Spring
2009
Spring
2010
Summary
 Goal: Fair and valid testing for all
students
 DOE to conduct Data Forensics on FCAT
test data
 Focus on
 Individual students -- extremely simililar
tests
 Schools—Similarity, Gains, and Erasures
Follow Up Questions?
[email protected]