Overview of Caveon Data Forensics Steve Addicott Dennis Maynes March 2011 Outline of Presentation Overview of Caveon Data Forensics (DF) Process FDOE DF Goals DF Tools & Methods Spring 2011 FCAT DF Program Conservative thresholds Students & Schools Summary Q&A Caveon Test Security… “The 3Ps” Proven Best in Class Serving the largest test programs in the world Practical Identify unusual test taking and respond appropriately Protection Integrity of exams, DOE reputation, students’ opportunities Caveon Data Forensics™ Process Analyses of test data First building a “model” of typical question responses Identify unusual behaviors with potential of unfair advantage Caveon Data Forensics Process (cont) Examples of “Unusual” Behavior Very high agreement among pairs or groups of test takers Very unusual number of erasures, particularly wrong to right. Very substantial gains or losses from one occasion to another Overview of the Use of Data Forensics Many high stakes testing programs now using Data Forensics Standards for Testing, e.g., “CCSSO’s Operational Best Practices for State Assessment Programs” Essential to act on the results FDOE Data Forensics Goals Uphold fairness and validity of test results Identify risks and irregularities Take action based on data and analysis “Measure and Manage” Communicate Zero Tolerance for Cheating Testing Examiner’s Role Ensure (and then certify) the test administration is fair and proper Declare scores invalid when fairness and validity are negatively impacted Decision depends upon fairness and validity, not whether an individual cheated Forensic Tools and Methods Similarity: answer-copying, collusion Erasures: tampering Gains: pre-knowledge, coaching Aberrance: tampering, exposure Identical tests: collusion Perfect tests: answer key loss Similarity Our Most Powerful & “Credible” Statistic Measures degree of similarity between 2 or more test instances Analyze each test instance against all other test instances in the test Probable causes of extremely high similarity: Answer Copying Test Coaching Proxy Test Taking Collusion Erasures Based on estimated answer changing rates from: Wrong-to-Right Anything-to-Wrong Find answer sheets with unusual WtR answers Extreme statistical outliers could involve tampering, “panic cheating”, etc. Unusual Gains/Losses Predict score using prior year info. Measure large score increases/decreases against predicted score Which score truly reflects the student’s actual ability or competence? Extreme Gains/Losses may result from: Pre-knowledge, ie “Drill It and Kill It” Coaching Student development—visual acuity Spring FCAT Data Forensics Focus on two groups Student-level School-level Utilize VERY conservative thresholds A quick discussion of conservative thresholds…. Redskins winning 2011 Super Bowl = 1 in 50 Chance of being hit by lightning = 1 in a million Chance of winning the lottery = 1 in 10 million Chance of DNA false-positive = 1 in 30 million Chance of tests being flagged and taken independently = 1 in a TRILLION Student-level Analysis Similarity Analysis only Most credible Chance of tests being so similar, and taken independently = 1 in a trillion Invalidate test scores beyond 1012 Fairness and Validity of test instance must be questioned Appeals process to be implemented Examinee Test Name Collusion Index Collusion Cluster District School Example of Flagged Examinees xxxx1 xxxx2 xxxx3 xxxx4 xxxx5 xxxx6 xxxx7 xxxx8 xxxx9 xxx10 R R R R R R R R R R 22.0 22.0 21.5 21.5 21.4 21.4 21.1 21.1 21.1 21.1 343 343 409 409 833 833 834 834 482 482 6 6 13 13 42 42 64 64 13 13 452 452 7461 7461 351 351 3436 3436 7741 7741 Example: 9th Grade Math Cluster Identifies apparent student collusion Definitions “Dominant” = same answer selected by majority of group members “non-Dominant” = different answer selected by majority of group members Example of 2 students that passed, but not independently ie, they didn’t do their own work. Impact of “1 in a Trillion” Threshold, Math & Reading 2010 Grade N Flagged Students 3rd 408,317 144 4th 394,039 103 5th 390,714 92 6th 387,502 224 7th 393,401 245 8th 387,190 69 9th 401,046 622 10th 360,176 57 Totals 3,122,385 1,556 School-Level Analysis Similarity, Gains, and Erasures Flagged schools conduct internal review Extreme instances may prompt formal investigations and sanctions M4 Similarity Rate Index Erasures Rate Index Gains Rate Index 286.20 306.49 300.62 310.77 376.53 292.00 280.79 344.23 299.73 303.05 265.98 Incident Rate 0.0 0.0 0.0 0.2 45.0 0.0 0.0 39.6 0.0 0.0 0.0 Overall Index 0.43 0.37 0.60 0.67 0.97 0.52 0.18 0.87 0.32 0.35 0.28 Mean Score Pass Index 338 672 532 664 364 512 338 830 534 458 197 Pass Rate Subject M R M M M M R M R R M Number of Tests District-School xxxx yyyy zzzzz aaaa bbbb cccc dddd eeee ffff gggg hhhh 32.7 21.7 19.8 15.5 12.5 12.5 11.7 11.3 10.8 10.4 10.3 0.45 0.32 0.36 0.29 0.15 0.33 0.36 0.13 0.32 0.29 0.46 34.9 24.3 21.3 17.5 0.0 14.8 13.5 0.0 13.1 1.5 12.5 0.6 0.1 0.3 0.1 0.0 0.1 0.3 0.0 0.0 12.7 2.4 9.1 0.1 4.5 3.5 6.1 5.3 1.0 2.2 0.2 0.0 10.3 Benefits of Conservative Threshholds Focus on most egregious Instances Provides results that are Explainable Defensible Can move later to different thresholds Easier to manage Walk before we run Program Results Proportion of Detected Tests Monitored behavior improves Invalidations deter cheating 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Spring 2006 Spring 2007 Spring 2008 Spring 2009 Spring 2010 Summary Goal: Fair and valid testing for all students DOE to conduct Data Forensics on FCAT test data Focus on Individual students -- extremely simililar tests Schools—Similarity, Gains, and Erasures Follow Up Questions? [email protected]
© Copyright 2026 Paperzz