20th May, 2016 Rethinking the Padlocks Calum Davey from EEF

EEF Evaluators’ Conference
25th June 2015
Session 1: Interpretation / impact
25th June 2015
Rethinking the EEF Padlocks
Calum Davey
Education Endowment Foundation
25th June 2015
Overview
→ Background
→ Problems
→ Attrition
→ Power/chance
→ Testing
→ Proposal
→ Discussion
Background
• Summary of the security of evaluation findings
• ‘Padlocks’ developed in consultation with evaluators
Background
• Summary of the security of evaluation findings
• ‘Padlocks’ developed in consultation with evaluators
Group
Number of
pupils
Effect size
Estimated months’ progress
Literacy intervention
550
0.10 (0.03, 0.18)
+2
Evidence strength
Background
• Summary of the security of evaluation findings
• ‘Padlocks’ developed in consultation with evaluators
Group
Number of
pupils
Effect size
Estimated months’ progress
Literacy intervention
550
0.10 (0.03, 0.18)
+2
Evidence strength
Background
• Summary of the security of evaluation findings
• ‘Padlocks’ developed in consultation with evaluators
Group
Number of
pupils
Effect size
Estimated months’ progress
Literacy intervention
550
0.10 (0.03, 0.18)
+2
Evidence strength
• Five categories – combined to create overall rating:
Rating
1. Design
2. Power (MDES)
3. Attrition
4. Balance
5. Threats to validity
5
Fair and clear experimental
design (RCT)
< 0.2
< 10%
Well-balanced on
observables
No threats to validity
4
Fair and clear experimental
design (RCT, RDD)
< 0.3
< 20%
3
Well-matched comparison
(quasi-experiment)
< 0.4
< 30%
2
Matched comparison
(quasi-experiment)
< 0.5
< 40%
1
Comparison group with
poor or no matching
< 0.6
< 50%
0
No comparator
> 0.6
> 50%
Imbalanced on
observables
Significant threats
Background
15
n=37
10
13
5
5
7
6
4
2
0
0
1
2
3
Number of padlocks
Note: count does not include pilots, which often don’t get a security rating
4
5
Oxford Improving Numeracy and
Literacy
Rating
1. Design
2. Power
(MDES)
3. Attrition
4. Balance
5. Threats to validity
Fair and clear experimental
design (RCT)
< 0.2
< 10%
Well-balanced on
observables
No threats to validity
Fair and clear experimental
design (RCT, RDD)
< 0.3
< 20%
Well-matched comparison (quasiexperiment)
< 0.4
< 30%
Matched comparison (quasiexperiment)
< 0.5
< 40%
Comparison group with poor or
no matching
< 0.6
< 50%
No comparator
> 0.6
> 50%
Imbalanced on
observables
Significant threats
5
4
3
2
1
0
Act, Sing, Play
Rating
1. Design
2. Power
(MDES)
3. Attrition
4. Balance
5. Threats to validity
Fair and clear experimental
design (RCT)
< 0.2
< 10%
Well-balanced on
observables
No threats to validity
Fair and clear experimental
design (RCT, RDD)
< 0.3
< 20%
Well-matched comparison (quasiexperiment)
< 0.4
< 30%
Matched comparison (quasiexperiment)
< 0.5
< 40%
Comparison group with poor or
no matching
< 0.6
< 50%
No comparator
> 0.6
> 50%
Imbalanced on
observables
Significant threats
5
4
3
2
1
0
Team Alphie
Rating
1. Design
2. Power
(MDES)
3. Attrition
4. Balance
5. Threats to validity
Fair and clear experimental
design (RCT)
< 0.2
< 10%
Well-balanced on
observables
No threats to validity
Fair and clear experimental
design (RCT, RDD)
< 0.3
< 20%
Well-matched comparison (quasiexperiment)
< 0.4
< 30%
Matched comparison (quasiexperiment)
< 0.5
< 40%
Comparison group with poor or
no matching
< 0.6
< 50%
No comparator
> 0.6
> 50%
Imbalanced on
observables
Significant threats
5
4
3
2
1
0
Problems : power
• MDES at baseline
• MDES changes
• Confusion with p-values and CIs:
– Effect bigger than MDES!
• E.g. Calderdale
• ES=0.74, MDES <0.5
– P-value < 0.05!
• E.g. Butterfly Phonics
• ES=0.43, p<0.05, MDES>0.5
Rating
2. Power
(MDES)
5
< 0.2
4
< 0.3
3
< 0.4
2
< 0.5
1
< 0.6
0
> 0.6
Problems : attrition
• Calculated overall at the level of
randomisation
• 10% pupils off school each day
• Disadvantages individuallyrandomised:
– Act, Sing, Play (pupil): 0% attrition
at school or class level, 10% at pupil
level
– Oxford Science (school): 3%
attrition at school level, 16% at pupil
level
• Are the levels right?
Rating
3. Attrition
5
< 10%
4
< 20%
3
< 30%
2
< 40%
1
< 50%
0
> 50%
Problems : testing
• Lots of testing administered by
teachers
• Teachers rarely blinded to intervention
status
• What is the threat to validity when
effect sizes are small?
Rating
5. Threats to
validity
5
No threats to
validity
4
3
2
1
0
Significant
threats
Potential solution?
•
•
•
•
•
Assess ‘chance’ as well as MDES in padlock?
Assess attrition at pupil level for all trials?
Randomise invigilation of testing to assess bias?
Number of pupils (number with intervention)
Confidence interval for months progress?
Discussion
• Can p-values, confidence intervals, power, sample size,
etc. could be combined in measure of ‘chance’?
• What are the advantages and disadvantages of reporting
confidence intervals alongside the security rating?
• Is it right to include all attrition in the security rating?
What potential disadvantages are there?
• What is the more appropriate way to ensure
unbiasedness in testing? Would it be possible to conduct
a trial across evaluations?