Wednesday Day 3 Clusters and Strata When observations are not

Wednesday Day 3
• Sampling theory
• Example dataset 1
Clusters and Strata
Tor Strand, 2006
Basic statistical rule for most
common statistical tests
When observations are not
independent
Analysis of cluster sampling, cluster
randomized trials, trials with repeated
measurements and entries.
How can two observations be
interdependent?:
•
•
•
•
Cluster sampling
Cluster randomized trial
Individuals participate in a trial >once
People from the same household participate in the
same trial or survey
• An outcome is measured several times in an
individual
• Individual observations should be independent of
each other
If they are not, we are likely to ”overestimate” the
precision of our effect measures
Or in other words, we may by disregarding
dependencies ”underestimate” sample size
i.e. variance estimates will be too low, confidence
intervals too narrow, t-values too high, P-values
are too low.
How can we deal with clustering?
You need to calculate the
INFLATION FACTOR = DESIGN EFFECT
For this you need to know the:
1) Intracluster correlation coefficient (ICC)
2) Cluster size
1
How can we deal with
clustering?:
Design effect=Deff=Inflation factor (IF)
IF=1+(m-1) x ICC
ICC= Intracluster correlation coefficient
ICC is a measure of how similar the individuals within a
cluster are compared to all in the sample. Theoretical values:
0 to 1. If ICC is 0, the individuals within each cluster is not
more similar to each other than to the rest of the sample.
m = average number of individuals in each cluster
Design effect clusters
• Deft
– The ratio of the standard error achieved by a
study using this design to the standard error that
would be achieved by a SRS with the same
sample size.
– Adjusted SE = Deft*Unadjusted SE
• Deff
– Deff=Deft2
IF=1+(m-1) x ICC
• ”Typical” ICC = 0.04*
• “Usual values”: 0.01 - 0.1
• Obtain these values from
– Published studies
– Pilot studies
• Applicable for means, rates and proportions
Eldridge SM, (2003) Lessons for cluster randomised trials
in the 21st century: a systematic review of trials in primary care.
Clinical Trials
Examples of Published
Intracluster correlations coeff.
Outcome
EPDS (Q1)
EPDS 13 (Q1)
EPDS (Q2)
EPDS 13 (Q2)
Mother smokes (Q1)
Mother smokes (Q2)
Breast feeding if child under 6 months age
Breast fed at birth
Child has sleeping problem (Q1)
Child has sleeping problem (Q2)
Child's general healthvery healthy* (WCHMP)
Minor illnessmore than other children (WCHMP)
> 2 minor illnesses in past 3 months (WCHMP)
Accident in past year (WCHMP)
Behaviour problem (WCHMP)
Hospitalised in past 6 months (WCHMP)
Child attended hospital in last two months
Child seen GP in past 2 weeks (Q1)
Child seen GP in past 2 weeks (Q2)
Non-routine visits to GP by mother standardised to 1 year
Non-routine visits to GP for child standardised to 1 year
Intracluster corr
0.029
0.0336
0.0239
0.0108
0.0267
0.088
0.0355
0.0137
0.0 (0.0049)
0.0 (0.003)
0.0287
0.0438
0.0479
0.0 (0.0135)
0.0206
0.0 (0.0126)
0.014
0.0 (0.0196)
0.0 (0.0161)
0.0277
0.0274
EPDS, Edinburgh postnatal depression score
WCHMP, Warwick child health and morbidity profile
Reading et al. Arch Dis Child 2000;82:79-83
How do we adjust?:
20 clusters in a population
adjusted variance = crude variance x IF
adjusted sample size (SS) = crude(SS) x IF
adjusted SE=SE x √(IF)
SE=standard error
IF=Inflation factor
2
20 clusters in a population
20 clusters in a population
Simple random sample
One-stage cluster sample
20 clusters in a population
Two-stage cluster sample
Cluster effect
• Individuals within a cluster might be more similar
with each other than with the rest of the
population.
• In cluster sampling the variance of the estimates
might accordingly be underestimated.
• Thus, the standard errors and the 95% confidence
intervals of the estimates need to be adjusted for
the cluster effect.
Many ways to adjust for cluster
effect
• svy commands in STATA and similar
commands in other programs
• robust variance estimates
• general estimating equation models
(GEEs)
• multilevel modeling
• others
Stratification
• Stratify “to make layers”
• Reasons to stratify:
– Representative (ensure no under sampling of a
strata)
– Subgroup analyses
– Convenience (different sampling procedures in
different strata)
– Increase the overall precision
3
20 clusters in a population
Stratified
20 clusters in a population
Stratified
Simple random sample
20 clusters in a population
Stratum 1
Stratum 1
Stratum 2
Stratum 2
Stratified
20 clusters in a population
One-stage cluster sample
Stratified
Two-stage cluster sample
Stratum 1
Stratum 1
Stratum 2
Stratum 2
20 clusters in a population
Stratification increases precision
8
7
Stratified
Stratum 1
10
11
Mean
9
9
• The variance is a result of the sum of the
deviations from the mean of the particular
stratum rather than from the total sample,
i.e. the variance DECREASES (if there is a
change due to stratification).
• Reduces the required sample size
13
11
Stratum 2
Mean
12
15
Overall mean 10.5
4
Calculating variance
Compared to the
overall mean
Stratum
1
Summary - design effect
Compared to the
stratum mean
Observed
Mean
Diff
Diff2
Observed
Mean
Diff
Diff2
7
10.5
3.5
12.3
7
9.0
2.0
4.0
8
10.5
2.5
6.3
8
9.0
1.0
1.0
10
10.5
0.5
0.3
10
9.0
-1.0
1.0
11
10.5
-0.5
11
9.0
-2.0
Sum of squared diff
0.3
19.0
Sum of squared diff
4.0
10.0
• Cluster sampling might
– reduce the precision
– increase the required sample size
• Stratification might
Stratum
2
Observed
Mean
Diff
Diff2
Observed
Mean
Diff
2
Diff
9
10.5
1.5
2.3
9
12.0
3.0
9.0
11
10.5
-0.5
0.3
11
12.0
1.0
1.0
13
10.5
-2.5
6.3
13
12.0
-1.0
1.0
15
10.5
-4.5
20.3
15
12.0
-3.0
Sum of squared diff
29.0
Sum of squared diff
48.0
9.0
Sum of squared diff
20.0
Sum of squared diff
30.0
– increase the precision
– reduce the required sample size
• Many ways to adjust for the design effect,
all are technically complicated
Self weighting sampling
Probability proportional to size
Self weighting sampling
Probability proportional to size
4 clusters in a population
4 clusters in a population
Probability of selection
Probability of selection
Probability of selection
Probability of selection
1
1
1
1
3
3
3
3
1
1
1
1
6
6
6
6
Self weighting sampling
Probability proportional to size
Self weighting sampling
Equal probability of PSUs
4 clusters in a population
Probability of selection
Probability of selection
4 clusters in a population
Probability of selection
Probability of selection
1
1
1
1
3
3
4
4
1
1
1
1
6
6
4
4
5
Self weighting sampling
Equal probability of PSUs
Self weighting sampling
Equal probability of PSUs
4 clusters in a population
4 clusters in a population
Probability of selection
Probability of selection
Probability of selection
Probability of selection
1
1
1
1
4
4
4
4
1
1
1
1
4
4
4
4
6