Sampling and Sample Size Cally Ardington Part 2 Lecture Outline Standard deviation and standard error •Detecting impact Background Hypothesis testing Power The ingredients of power We implement the Balsakhi Program Case 2: Remedial Education in India Evaluating the Balsakhi Program Incorporating random assignment into the program Post-test: control & treatment Is this impact statistically significant? 160 140 120 33% 100 33% 33% control 80 treatment 60 control μ 40 treatment μ 20 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Do n’ tk no w No A. Yes B. No C. Don’t know Ye s test scores Lecture Outline Standard deviation and standard error •Detecting impact Background Hypothesis testing Power The ingredients of power Hypothesis Testing • The Law of Large of Numbers and Central Limit Theorem allow us to do hypothesis testing to determine whether our findings are statistically significant Hypothesis Testing • In criminal law, most institutions follow the rule: “innocent until proven guilty” • The presumption is that the accused is innocent and the burden is on the prosecutor to show guilt • The jury or judge starts with the “null hypothesis” that the accused person is innocent • The prosecutor has a hypothesis that the accused person is guilty 8 Hypothesis Testing • In program evaluation, instead of “presumption of innocence,” the rule is: “presumption of insignificance” • The “Null hypothesis” (H0) is that there was no (zero) impact of the program • The burden of proof is on the evaluator to show a significant difference Hypothesis Testing: Conclusions • If it is very unlikely (less than a 5% probability) that the difference is solely due to chance: • We “reject our null hypothesis” • We may now say: • “our program has a statistically significant impact” Type I and II errors YOU CONCLUDE Effective THE TRUTH No Effect Effective No Effect Type II Error Type I Error What is the significance level? • Type I error: rejecting the null hypothesis even though it is true (false positive) • Significance level: The probability that we will reject the null hypothesis even though it is true Theoretical Sampling Distribution 0.5 0.45 0.4 H0 0.35 0.3 0.25 control 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 95% Confidence Interval 0.5 0.45 0.4 H0 0.35 0.3 0.25 control 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1.96 SD 1 2 1.96 SD 3 4 5 6 Impose Significance Level of 5% 0.5 0.45 0.4 H0 H0 H0 0.35 0.3 0.25 control 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 1.96 SD 2 3 4 5 6 Lecture Outline Standard deviation and standard error •Detecting impact Background Hypothesis testing Power The ingredients of power What is Power? • Type II Error: Failing to reject the null hypothesis (concluding there is no difference), when indeed the null hypothesis is false. • Power: If there is a measureable effect of our intervention (the null hypothesis is false), the probability that we will detect an effect (reject the null hypothesis) • Power = 1- Probability of Type II Error 0.5 0.45 0.4 H0 0.35 0.3 control 0.25 0.2 treatment Hβ 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Impose significance level of 5% 0.5 0.45 0.4 H0 0.35 0.3 control 0.25 treatment 0.2 significance Hβ 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Anything between lines cannot be distinguished from 0 Can we distinguish Hβ from H0 ? 0.5 0.45 0.4 H0 0.35 0.3 control 0.25 treatment Hβ 0.2 0.15 power 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Shaded area shows % of time we would find Hβ true if it was Type I and II errors YOU CONCLUDE Effective THE TRUTH No Effect Effective No Effect Type II Error Type I Error Type I and II errors YOU CONCLUDE Effective THE TRUTH No Effect Effective No Effect Type II Error Type I Error (probability = significance level) Type I and II errors YOU CONCLUDE Effective Effective No Effect Type II Error (probability = power) THE TRUTH Type I Error No Effect Before the experiment 0.5 0.45 0.4 0.35 0.3 Hβ 0.25 H0 control 0.2 treatment 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 • Assume two effects: no effect and treatment effect β 5 6 Impose significance level of 5% 0.5 0.45 0.4 0.35 H0 0.3 Hβ 0.25 control treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Anything between lines cannot be distinguished from 0 Can we distinguish Hβ from H0 ? 0.5 0.45 0.4 0.35 0.3 Hβ 0.25 H0 control 0.2 treatment power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Shaded area shows % of time we would find Hβ true if it was What influences power? • What are the factors that change the proportion of the research hypothesis that is shaded—i.e. the proportion that falls to the right (or left) of the null hypothesis curve? • Understanding this helps us design more powerful experiments Lecture Outline Standard deviation and standard error •Detecting impact Background Hypothesis testing Power The ingredients of power Power: Main Ingredients • Effect Size • Sample Size • Variance • Proportion of sample in Treatment vs. Control • Clustering Power: Main Ingredients • Effect Size • Sample Size • Variance • Proportion of sample in Treatment vs. Control • Clustering Effect Size: 1*SD • Hypothesized effect size determines distance between means 0.5 0.45 1 Standard Deviation 0.4 0.35 Hβ 0.3 H0 control 0.25 0.2 treatment 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Effect Size = 1*SD 0.5 0.45 0.4 0.35 Hβ 0.3 H0 0.25 control treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 26% If the true impact was 1*SD… 0.5 0.45 H0 Hβ 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 The Null Hypothesis would be rejected only 26% of the time Effect Size: 3*SD 0.5 3*SD 0.45 0.4 0.35 0.3 control 0.25 0.2 treatment 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Bigger hypothesized effect size distributions farther apart Effect size 3*SD: Power= 91% 0.5 0.45 0.4 0.35 Hβ 0.3 H0 0.25 control treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 Bigger Effect size means more power 4 5 6 ct s ffe te La rg es Sm al le st ef fe ct s ize ize yo u th at is Bo th Ne ith er st ill c.. . A. Smallest effect size that is still cost effective B. Largest effect size you expect your program to produce C. Both D. Neither ex pe ct yo .. What effect size should you use when designing your experiment? 25% 25% 25% 25% Picking an effect size • What is the smallest effect that should justify the program to be adopted: • Cost of this program vs the benefits it brings • Cost of this program vs the alternative use of the money • If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero • In contrast, any effect larger than that effect would justify adopting this program: we want to be able to distinguish it from zero Effect size and take-up • Let’s say we believe the impact on our participants is “3” • What happens if take up is 1/3? • Let’s show this graphically Effect Size: 3*SD 0.5 3*SD 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 Let’s say we believe the impact on our participants is “3” 6 Take up is 33%. Effect size is 1/3rd 0.5 1 Standard Deviation • Hypothesized effect size determines distance between means 0.45 0.4 0.35 0.3 Hβ 0.25 H0 control treatment 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Back to: Power = 26% 0.5 0.45 0.4 0.35 0.3 control 0.25 H0 Hβ 0.2 treatment power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 Take-up is reflected in the effect size 6 Power: Main Ingredients • Effect Size • Sample Size • Variance • Proportion of sample in Treatment vs. Control • Imperfect compliance • Clustering By increasing sample size you increase… 0.5 0.45 0.4 0.35 0.3 control 0.25 20% 20% 20% 20% 20% treatment 0.2 power 0.15 0.1 0.05 0 2 3 4 5 6 Do n’t kn ow Accuracy Precision Both Neither Don’t know 1 Ne ith er 0 Bo th -1 on -2 Pr ec isi A. B. C. D. E. -3 Ac cu ra cy -4 Power: Effect size = 1 SD, Sample size = 4 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 64% 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: Effect size = 1 SD, Sample size = 9 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 91% 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Sample Size Power: Main Ingredients • Effect Size • Sample Size • Variance • Proportion of sample in Treatment vs. Control • Imperfect compliance • Clustering Variance • How large an effect you can detect with a given sample depends on how variable the outcomes is. • Example: If all children have very similar learning level without a program, a very small impact will be easy to detect • We can try to “absorb” variance: • Using a baseline • Controlling for other variables • In practice, controlling for other variables (besides the baseline outcome) buys you very little Variance Low Standard Deviation 25 15 mean 50 mean 60 10 5 Number 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 0 value Frequency 20 Less Precision Medium Standard Deviation 9 6 5 mean 50 mean 60 4 3 2 1 Number 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 0 value Frequency 8 7 Even less precise High Standard Deviation 8 7 5 mean 50 mean 60 4 3 2 1 Number 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 0 va lu e Frequency 6 Power: Main Ingredients • Effect Size • Sample Size • Variance • Proportion of sample in Treatment vs. Control • Clustering Sample split: 50% C, 50% T 0.5 0.45 0.4 0.35 0.3 H0 Hβ 0.25 control treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 91% 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 If it’s not 50-50 split? • What happens to the relative fatness if the split is not 50-50. • Say 25-75? Sample split: 25% C, 75% T 0.5 0.45 0.4 0.35 H0 Hβ 0.3 control 0.25 treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 83% 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: Main Ingredients • Effect Size • Sample Size • Variance • Proportion of sample in Treatment vs. Control • Clustering Clustered design: definition • In sampling: • When clusters of individuals (e.g. schools, communities, etc) are randomly selected from the population, before selecting individuals for observation • In randomized evaluation: • When clusters of individuals are randomly assigned to different treatment groups Reason for adopting cluster randomization • Need to minimize or remove contamination • Example: In the deworming program, schools was chosen as the unit because worms are contagious • Basic feasibility considerations • Example: The PROGRESA program would not have been politically feasible if some families were introduced and not others. • Only natural choice • Example: Any education intervention that affect an entire classroom (e.g. flipcharts, teacher training). Clustered design: intuition • You want to know how close the upcoming national elections will be • Method 1: Randomly select 50 people from entire Indian population • Method 2: Randomly select 5 families, and ask ten members of each family their opinion Low intra-cluster correlation (ICC) aka ρ (rho) HIGH intra-cluster correlation (ρ) All uneducated people live in one village. People with only primary education live in another. College grads live in a third, etc. ICC (ρ) on education will be.. 25% 25% 25% No Do n’ tk no w ef fe ct on Lo w rh o High Low No effect on rho Don’t know Hi gh A. B. C. D. 25% Clustered Design: Intuition • The outcomes within a family are likely correlated. Similarly with children within a school, families within a village etc. • Each additional individual does not bring entirely new information • At the limit, imagine all outcomes within a cluster are exactly the same: effective sample size is number of clusters, not number of individuals • Precision will depend on the number of clusters, sample size within clusters and the within cluster correlation 67 If ICC (ρ) is high, what is a more efficient way of increasing power? Bo th Do n’ tk no w in pe op le m or e In clu de In clu de m or e clu st er s in th e s.. . A. Include more clusters in the sample B. Include more people in clusters C. Both D. Don’t know clu st er s 25% 25% 25% 25% Standardized Effect Sizes • The Standardized effect size is the effect size divided by the standard deviation of the outcome • δ = effect size/Standard deviation Standardized Effect Sizes An effect size of… Is considered… …and it means that… Required N under 50% treatment 0.2 Modest The average member of the 786 treatment group had a better outcome than the 58th percentile of the control group 0.5 Large The average member of the 126 treatment group had a better outcome than the 69th percentile of the control group 0.8 VERY Large The average member of the 50 treatment group had a better outcome than the 79th percentile of the control group Conclusion • Even with a perfectly valid experiment, the ability to make inference depends on the SIZE OF THE SAMPLE. • In designing an evaluation, you need to balance tradeoffs to ensure that your sample is large enough, given • • • • • • Desired power and significance levels Anticipated effect size The amount of “noise” (underlying variance in outcome variable) Treatment-Control size ration (feasibility and cost) Take up of treatment Clustering The Important Stuff How confident are we of our results ? We have a sample, not the population. The Central Limit Theorem and The Law of Large Numbers tell us important things about the sampling distribution that allow for HYPOTHESIS TESTING. Hypothesis testing enables us to establish whether our results are statistically significant. There are two kind of errors we can make in hypothesis testing > Type 1: The intervention is not effective and we find it to be effective. We FIX this at 5%. > Type 2 : The intervention is effective and we find it to be no impact. The smaller the probability of this occurring the higher our power. Power can be increased by five things >Sample size > The size of the effect > The proportion of your sample in the control group and the proportion in the sample of your treatment group > The variance > Clustering
© Copyright 2026 Paperzz