Sampling and Sample Size Part 2 Muthoni Ngatia

Sampling and Sample Size
Cally Ardington
Part 2
Lecture Outline
Standard deviation and standard error
•Detecting impact
 Background
Hypothesis testing
Power
The ingredients of power
We implement the Balsakhi Program
Case 2: Remedial Education in India
Evaluating the Balsakhi Program
Incorporating random assignment into the program
Post-test: control & treatment
Is this impact statistically significant?
160
140
120
33%
100
33%
33%
control
80
treatment
60
control μ
40
treatment μ
20
0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95 100
Do
n’
tk
no
w
No
A. Yes
B. No
C. Don’t know
Ye
s
test scores
Lecture Outline
Standard deviation and standard error
•Detecting impact
 Background
Hypothesis testing
Power
The ingredients of power
Hypothesis Testing
• The Law of Large of Numbers and Central
Limit Theorem allow us to do hypothesis testing
to determine whether our findings are statistically
significant
Hypothesis Testing
• In criminal law, most institutions follow the rule: “innocent until
proven guilty”
• The presumption is that the accused is innocent and the burden is on
the prosecutor to show guilt
• The jury or judge starts with the “null hypothesis” that the accused person is
innocent
• The prosecutor has a hypothesis that the accused person is guilty
8
Hypothesis Testing
• In program evaluation, instead of “presumption of innocence,” the
rule is: “presumption of insignificance”
• The “Null hypothesis” (H0) is that there was no (zero) impact of the
program
• The burden of proof is on the evaluator to show a significant
difference
Hypothesis Testing: Conclusions
• If it is very unlikely (less than a 5% probability) that the difference is
solely due to chance:
• We “reject our null hypothesis”
• We may now say:
• “our program has a statistically significant impact”
Type I and II errors
YOU CONCLUDE
Effective
THE
TRUTH
No Effect
Effective
No Effect

Type II Error
Type I Error



What is the significance level?
• Type I error: rejecting the null hypothesis even though it is true (false
positive)
• Significance level: The probability that we will reject the null
hypothesis even though it is true
Theoretical Sampling Distribution
0.5
0.45
0.4
H0
0.35
0.3
0.25
control
0.2
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
95% Confidence Interval
0.5
0.45
0.4
H0
0.35
0.3
0.25
control
0.2
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1.96 SD
1
2
1.96 SD
3
4
5
6
Impose Significance Level of 5%
0.5
0.45
0.4
H0
H0
H0
0.35
0.3
0.25
control
0.2
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
1.96 SD
2
3
4
5
6
Lecture Outline
Standard deviation and standard error
•Detecting impact
 Background
Hypothesis testing
Power
The ingredients of power
What is Power?
• Type II Error: Failing to reject the null hypothesis (concluding there is no
difference), when indeed the null hypothesis is false.
• Power: If there is a measureable effect of our intervention (the null
hypothesis is false), the probability that we will detect an effect (reject the
null hypothesis)
• Power = 1- Probability of Type II Error
0.5
0.45
0.4
H0
0.35
0.3
control
0.25
0.2
treatment
Hβ
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Impose significance level of 5%
0.5
0.45
0.4
H0
0.35
0.3
control
0.25
treatment
0.2
significance
Hβ
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Anything between lines cannot be distinguished from 0
Can we distinguish Hβ from H0 ?
0.5
0.45
0.4
H0
0.35
0.3
control
0.25
treatment
Hβ
0.2
0.15
power
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Shaded area shows % of time we would find Hβ true if it was
Type I and II errors
YOU CONCLUDE
Effective
THE
TRUTH
No Effect
Effective
No Effect

Type II Error
Type I Error



Type I and II errors
YOU CONCLUDE
Effective
THE
TRUTH
No Effect
Effective
No Effect

Type II Error
Type I Error


(probability =
significance level)

Type I and II errors
YOU CONCLUDE
Effective
Effective
No Effect

Type II Error
(probability =
power)
THE
TRUTH
Type I Error
No Effect



Before the experiment
0.5
0.45
0.4
0.35
0.3
Hβ
0.25
H0
control
0.2
treatment
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
• Assume two effects: no effect and treatment effect β
5
6
Impose significance level of 5%
0.5
0.45
0.4
0.35
H0
0.3
Hβ
0.25
control
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Anything between lines cannot be distinguished from 0
Can we distinguish Hβ from H0 ?
0.5
0.45
0.4
0.35
0.3
Hβ
0.25
H0
control
0.2
treatment
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Shaded area shows % of time we would find Hβ true if it was
What influences power?
• What are the factors that change the proportion of the
research hypothesis that is shaded—i.e. the proportion
that falls to the right (or left) of the null hypothesis
curve?
• Understanding this helps us design more powerful
experiments
Lecture Outline
Standard deviation and standard error
•Detecting impact
 Background
Hypothesis testing
Power
The ingredients of power
Power: Main Ingredients
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Clustering
Power: Main Ingredients
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Clustering
Effect Size: 1*SD
• Hypothesized effect size determines distance between means
0.5
0.45
1 Standard
Deviation
0.4
0.35
Hβ
0.3
H0
control
0.25
0.2
treatment
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Effect Size = 1*SD
0.5
0.45
0.4
0.35
Hβ
0.3
H0
0.25
control
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 26%
If the true impact was 1*SD…
0.5
0.45
H0
Hβ
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
The Null Hypothesis would be rejected only 26% of the time
Effect Size: 3*SD
0.5
3*SD
0.45
0.4
0.35
0.3
control
0.25
0.2
treatment
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Bigger hypothesized effect size distributions farther apart
Effect size 3*SD: Power= 91%
0.5
0.45
0.4
0.35
Hβ
0.3
H0
0.25
control
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
Bigger Effect size means more power
4
5
6
ct
s
ffe
te
La
rg
es
Sm
al
le
st
ef
fe
ct
s
ize
ize
yo
u
th
at
is
Bo
th
Ne
ith
er
st
ill
c..
.
A. Smallest effect size that is
still cost effective
B. Largest effect size you
expect your program to
produce
C. Both
D. Neither
ex
pe
ct
yo
..
What effect size should you use when
designing your experiment?
25% 25% 25% 25%
Picking an effect size
• What is the smallest effect that should justify the program to be
adopted:
• Cost of this program vs the benefits it brings
• Cost of this program vs the alternative use of the money
• If the effect is smaller than that, it might as well be zero: we are not
interested in proving that a very small effect is different from zero
• In contrast, any effect larger than that effect would justify adopting
this program: we want to be able to distinguish it from zero
Effect size and take-up
• Let’s say we believe the impact on our participants is “3”
• What happens if take up is 1/3?
• Let’s show this graphically
Effect Size: 3*SD
0.5
3*SD
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
Let’s say we believe the impact on our participants is “3”
6
Take up is 33%. Effect size is 1/3rd
0.5
1 Standard
Deviation
• Hypothesized effect size determines distance
between means
0.45
0.4
0.35
0.3
Hβ
0.25
H0
control
treatment
0.2
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Back to: Power = 26%
0.5
0.45
0.4
0.35
0.3
control
0.25
H0
Hβ
0.2
treatment
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
Take-up is reflected in the effect size
6
Power: Main Ingredients
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Imperfect compliance
• Clustering
By increasing sample size you increase…
0.5
0.45
0.4
0.35
0.3
control
0.25
20% 20% 20% 20% 20%
treatment
0.2
power
0.15
0.1
0.05
0
2
3
4
5
6
Do
n’t
kn
ow
Accuracy
Precision
Both
Neither
Don’t know
1
Ne
ith
er
0
Bo
th
-1
on
-2
Pr
ec
isi
A.
B.
C.
D.
E.
-3
Ac
cu
ra
cy
-4
Power: Effect size = 1 SD, Sample size = 4
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 64%
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: Effect size = 1 SD, Sample size = 9
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 91%
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Sample Size
Power: Main Ingredients
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Imperfect compliance
• Clustering
Variance
• How large an effect you can detect with a given sample depends on
how variable the outcomes is.
• Example: If all children have very similar learning level without a program, a
very small impact will be easy to detect
• We can try to “absorb” variance:
• Using a baseline
• Controlling for other variables
• In practice, controlling for other variables (besides the baseline
outcome) buys you very little
Variance
Low Standard Deviation
25
15
mean 50
mean 60
10
5
Number
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
0
value
Frequency
20
Less Precision
Medium Standard Deviation
9
6
5
mean 50
mean 60
4
3
2
1
Number
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
0
value
Frequency
8
7
Even less precise
High Standard Deviation
8
7
5
mean 50
mean 60
4
3
2
1
Number
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
0
va
lu
e
Frequency
6
Power: Main Ingredients
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Clustering
Sample split: 50% C, 50% T
0.5
0.45
0.4
0.35
0.3
H0
Hβ
0.25
control
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 91%
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
If it’s not 50-50 split?
• What happens to the relative fatness if the split is not 50-50.
• Say 25-75?
Sample split: 25% C, 75% T
0.5
0.45
0.4
0.35
H0
Hβ
0.3
control
0.25
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 83%
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: Main Ingredients
• Effect Size
• Sample Size
• Variance
• Proportion of sample in Treatment vs. Control
• Clustering
Clustered design: definition
• In sampling:
• When clusters of individuals (e.g. schools, communities, etc) are randomly
selected from the population, before selecting individuals for observation
• In randomized evaluation:
• When clusters of individuals are randomly assigned to different treatment
groups
Reason for adopting cluster randomization
• Need to minimize or remove contamination
• Example: In the deworming program, schools was chosen as the unit because
worms are contagious
• Basic feasibility considerations
• Example: The PROGRESA program would not have been politically feasible if
some families were introduced and not others.
• Only natural choice
• Example: Any education intervention that affect an entire classroom (e.g.
flipcharts, teacher training).
Clustered design: intuition
• You want to know how close the upcoming national elections will be
• Method 1: Randomly select 50 people from entire Indian population
• Method 2: Randomly select 5 families, and ask ten members of each
family their opinion
Low intra-cluster correlation (ICC)
aka ρ (rho)
HIGH intra-cluster correlation (ρ)
All uneducated people live in one village. People with only primary
education live in another. College grads live in a third, etc. ICC (ρ) on
education will be..
25%
25%
25%
No
Do
n’
tk
no
w
ef
fe
ct
on
Lo
w
rh
o
High
Low
No effect on rho
Don’t know
Hi
gh
A.
B.
C.
D.
25%
Clustered Design: Intuition
• The outcomes within a family are likely correlated. Similarly with
children within a school, families within a village etc.
• Each additional individual does not bring entirely new information
• At the limit, imagine all outcomes within a cluster are exactly the
same: effective sample size is number of clusters, not number of
individuals
• Precision will depend on the number of clusters, sample size within
clusters and the within cluster correlation
67
If ICC (ρ) is high, what is a more efficient way
of increasing power?
Bo
th
Do
n’
tk
no
w
in
pe
op
le
m
or
e
In
clu
de
In
clu
de
m
or
e
clu
st
er
s
in
th
e
s..
.
A. Include more clusters in the
sample
B. Include more people in
clusters
C. Both
D. Don’t know
clu
st
er
s
25% 25% 25% 25%
Standardized Effect Sizes
• The Standardized effect size is the effect size divided
by the standard deviation of the outcome
• δ = effect size/Standard deviation
Standardized Effect Sizes
An effect
size of…
Is considered…
…and it means that…
Required N under
50% treatment
0.2
Modest
The average member of the
786
treatment group had a better
outcome than the 58th
percentile of the control group
0.5
Large
The average member of the
126
treatment group had a better
outcome than the 69th
percentile of the control group
0.8
VERY Large
The average member of the
50
treatment group had a better
outcome than the 79th
percentile of the control group
Conclusion
• Even with a perfectly valid experiment, the ability to make inference
depends on the SIZE OF THE SAMPLE.
• In designing an evaluation, you need to balance tradeoffs to ensure
that your sample is large enough, given
•
•
•
•
•
•
Desired power and significance levels
Anticipated effect size
The amount of “noise” (underlying variance in outcome variable)
Treatment-Control size ration (feasibility and cost)
Take up of treatment
Clustering
The Important Stuff
How confident are we of our results ?
 We have a sample, not the population.
 The Central Limit Theorem and The Law of Large Numbers tell us important things about the sampling
distribution that allow for HYPOTHESIS TESTING.
 Hypothesis testing enables us to establish whether our results are statistically significant.
 There are two kind of errors we can make in hypothesis testing
> Type 1: The intervention is not effective and we find it to be effective. We FIX this at 5%.
> Type 2 : The intervention is effective and we find it to be no impact. The smaller the probability of this
occurring the higher our power.
 Power can be increased by five things
>Sample size
> The size of the effect
> The proportion of your sample in the control group and the proportion in the sample of your treatment
group
> The variance
> Clustering