EEF Evaluators’ Conference 25th June 2015 Session 1: Interpretation / impact 25th June 2015 Rethinking the EEF Padlocks Calum Davey Education Endowment Foundation 25th June 2015 Overview → Background → Problems → Attrition → Power/chance → Testing → Proposal → Discussion Background • Summary of the security of evaluation findings • ‘Padlocks’ developed in consultation with evaluators Background • Summary of the security of evaluation findings • ‘Padlocks’ developed in consultation with evaluators Group Number of pupils Effect size Estimated months’ progress Literacy intervention 550 0.10 (0.03, 0.18) +2 Evidence strength Background • Summary of the security of evaluation findings • ‘Padlocks’ developed in consultation with evaluators Group Number of pupils Effect size Estimated months’ progress Literacy intervention 550 0.10 (0.03, 0.18) +2 Evidence strength Background • Summary of the security of evaluation findings • ‘Padlocks’ developed in consultation with evaluators Group Number of pupils Effect size Estimated months’ progress Literacy intervention 550 0.10 (0.03, 0.18) +2 Evidence strength • Five categories – combined to create overall rating: Rating 1. Design 2. Power (MDES) 3. Attrition 4. Balance 5. Threats to validity 5 Fair and clear experimental design (RCT) < 0.2 < 10% Well-balanced on observables No threats to validity 4 Fair and clear experimental design (RCT, RDD) < 0.3 < 20% 3 Well-matched comparison (quasi-experiment) < 0.4 < 30% 2 Matched comparison (quasi-experiment) < 0.5 < 40% 1 Comparison group with poor or no matching < 0.6 < 50% 0 No comparator > 0.6 > 50% Imbalanced on observables Significant threats Background 15 n=37 10 13 5 5 7 6 4 2 0 0 1 2 3 Number of padlocks Note: count does not include pilots, which often don’t get a security rating 4 5 Oxford Improving Numeracy and Literacy Rating 1. Design 2. Power (MDES) 3. Attrition 4. Balance 5. Threats to validity Fair and clear experimental design (RCT) < 0.2 < 10% Well-balanced on observables No threats to validity Fair and clear experimental design (RCT, RDD) < 0.3 < 20% Well-matched comparison (quasiexperiment) < 0.4 < 30% Matched comparison (quasiexperiment) < 0.5 < 40% Comparison group with poor or no matching < 0.6 < 50% No comparator > 0.6 > 50% Imbalanced on observables Significant threats 5 4 3 2 1 0 Act, Sing, Play Rating 1. Design 2. Power (MDES) 3. Attrition 4. Balance 5. Threats to validity Fair and clear experimental design (RCT) < 0.2 < 10% Well-balanced on observables No threats to validity Fair and clear experimental design (RCT, RDD) < 0.3 < 20% Well-matched comparison (quasiexperiment) < 0.4 < 30% Matched comparison (quasiexperiment) < 0.5 < 40% Comparison group with poor or no matching < 0.6 < 50% No comparator > 0.6 > 50% Imbalanced on observables Significant threats 5 4 3 2 1 0 Team Alphie Rating 1. Design 2. Power (MDES) 3. Attrition 4. Balance 5. Threats to validity Fair and clear experimental design (RCT) < 0.2 < 10% Well-balanced on observables No threats to validity Fair and clear experimental design (RCT, RDD) < 0.3 < 20% Well-matched comparison (quasiexperiment) < 0.4 < 30% Matched comparison (quasiexperiment) < 0.5 < 40% Comparison group with poor or no matching < 0.6 < 50% No comparator > 0.6 > 50% Imbalanced on observables Significant threats 5 4 3 2 1 0 Problems : power • MDES at baseline • MDES changes • Confusion with p-values and CIs: – Effect bigger than MDES! • E.g. Calderdale • ES=0.74, MDES <0.5 – P-value < 0.05! • E.g. Butterfly Phonics • ES=0.43, p<0.05, MDES>0.5 Rating 2. Power (MDES) 5 < 0.2 4 < 0.3 3 < 0.4 2 < 0.5 1 < 0.6 0 > 0.6 Problems : attrition • Calculated overall at the level of randomisation • 10% pupils off school each day • Disadvantages individuallyrandomised: – Act, Sing, Play (pupil): 0% attrition at school or class level, 10% at pupil level – Oxford Science (school): 3% attrition at school level, 16% at pupil level • Are the levels right? Rating 3. Attrition 5 < 10% 4 < 20% 3 < 30% 2 < 40% 1 < 50% 0 > 50% Problems : testing • Lots of testing administered by teachers • Teachers rarely blinded to intervention status • What is the threat to validity when effect sizes are small? Rating 5. Threats to validity 5 No threats to validity 4 3 2 1 0 Significant threats Potential solution? • • • • • Assess ‘chance’ as well as MDES in padlock? Assess attrition at pupil level for all trials? Randomise invigilation of testing to assess bias? Number of pupils (number with intervention) Confidence interval for months progress? Discussion • Can p-values, confidence intervals, power, sample size, etc. could be combined in measure of ‘chance’? • What are the advantages and disadvantages of reporting confidence intervals alongside the security rating? • Is it right to include all attrition in the security rating? What potential disadvantages are there? • What is the more appropriate way to ensure unbiasedness in testing? Would it be possible to conduct a trial across evaluations?
© Copyright 2026 Paperzz