Sample Size

SAMPLE SIZE ESTIMATION
Dr Ishita Guha
Moderator- Mr. M.S Bharambe
1
Framework
•
•
•
•
•
•
•
•
•
Introduction
When sample size is required
Importance of sample size
Approaches for calculating sample size
Practical issues in determining sample sizes
Description of some commonly used terms
Procedures for calculating sample size
Formulae for sample size estimation in different studies
References
2
Introduction
• Sample- A finite representative subset of statistical individuals in a
population
• Sample Size- is the subset of population to be studied in order to
make an inference to a reference population
• If a study does not have an adequate sample size, the significance of
the results in reality/true differences may not be detected
• This implies that the study would lack power to detect the
significance of differences because of inadequate sample size
3
Why sample size is required?
 If sample size is too small, may fail to detect important effects or
associations, or may estimate too imprecisely
 If the sample size is too large, the study will be more time consuming and
costly, and may even lead to a loss in accuracy
 Optimum sample size determination is required :
1. To allow for appropriate analysis
2. To provide the desired level of accuracy
3. To allow validity of significance test result
4. To minimize resources
4
Why is sample size calculation important?
• Economic reasons
Undersized study - may be under-powered to detect the effect
-waste of resources due to incapability to yield useful results
Oversized study -unnecessary waste of resources
-significant results that may not have much practical importance
• Ethical reasons
– can expose subjects to unnecessary/potentially harmful treatments
Sample size ensures validity, accuracy, reliability and, scientific and ethical
integrity of the study
5
When to do sample size estimation?
1. For estimating prevalence
2. For estimating difference in parameters/ statistics.
6
Approaches for calculating sample size
1. Estimation (or confidence interval approach):
Concerned with predicting the value of specific population parameters with
the assumed precision
• A random sample is drawn from the concerned population
• A point estimate for the unknown parameter is chosen from the sample may not represent the true value of the parameter
• So interval estimate for the parameter is to be obtained
Formula:
100(1-α) % confidence for unknown parameter = point estimate ± Z(1-α ) ×SE
of the point estimate
Example: 95%CI for true mean in the population = sample mean ± Z(1-α ) X SE of
sample mean
7
Interpretation:
• 95% confident that the single computed interval from a random sample from the
population contains true value of the parameter.
• The narrower the interval, the larger will be the sample size
Point
estimate
from sample
45%
True
population
prevalence
50%
8
2. Hypothesis testing (test of significance approach):
• Area of acceptance and rejection
Area of
rejection
2.5%
Area of
rejection
2.5%
Area of
rejection
0.5%
Area of
rejection
0.5%
Area of acceptance
Area of acceptance
-1.96
1.96
-2.58
2.58
Hypothesis testing
Researcher’s Decision
Hypothesis True
False
Accepted
Rejected
Correct
Type I error
(or) α error
Type II error
(or) β error
Correct
10
Description of some commonly used terms
Random error: describes the role of chance
 Sources of random error include:
• sampling variability
• subject to subject differences
• measurement errors
 It can be controlled and reduced to acceptably low levels by:
•
Increasing the sample size
•
Repeating the experiment (Averaging)
11
2. Systematic error (Bias)
• Not a consequence of chance alone
• Types- Measurement bias, Personal bias, instrumental bias.
• These factors usually can be removed/reduced by good design
and conduct of the experiment
• A strong bias can yield an estimate very far from the true
value, even in the wrong direction
12
3. Precision (Reliability)
• Degree to which a variable has the same value when measured several times
• It is a function of
-- random error (the greater the error, the less precise the measurement)
-- sample size (larger sample size would give precise estimates)
-- the confidence interval
-- the variance of the outcome variable
Precision =√n/SD
Precision is directly proportional to the sample size (n)
13
Accuracy (Validity)
• The degree to which a variable actually represent the true
value
• It is a function of random error or bias or both
• the greater the error the less accurate is the variable
14
15
Effect size
• It should represent smallest difference that would be
of clinical or biological significance
• A large sample size is needed for detection of a
minute difference
• Sample size is inversely related to the effect size
16
Design effect
• Ratio of the actual variance, under the sampling method actually used, to
the variance computed under the assumption of Simple Random sampling
• Cluster sampling is commonly used
• Sample is not as varied in cluster sampling as it would be in a random
sample, so the effective sample size is reduced
• The loss of effectiveness by the use of cluster sampling, instead of simple
random sampling, is the design effect.
• The sample sizes for simple random samples are multiplied by the design
effect to obtain the sample size for the clustered sample.
17
Practical issues in Determining Sample Sizes
• Multiple outcomes: make calculation for each outcome and then
to use largest sample size
• Dropout: subject who are enrolled but in whom outcome status
cannot be ascertained- do not count in the sample size. Anticipating
the dropout rate sample size to be calculated
• Number of sub-groups to analyze: If multiple to be analyzed,
the sample size should be increased to ensure that adequate
numbers are obtained for each sub-group
18
• Importance of the Research Issue: If the results of the
survey research are very critical, then the sample size should
be increased, the width of the confidence interval decreases
• Heterogeneity of the population: If there is likely to be
wide variations in the results obtained from various
respondents, the sample size should be increased
• Funding: budgetary constraints limit the sample size for the
study
19
Procedure for calculating sample size
 Use of formula
 Readymade tables
 Nomograms
 Computer software
20
Sample size Calculation
Required information
Anticipated population proportion
Confidence level
Precision required on either side of ‘p’
(Allowable error)
Standard Deviation
Continuous variables
n= Z2σ2 / D2
p
100(1-α)%
D
σ
Categorical variables
n =Z2 p(1− p) /D2
21
Example 1
To estimate the Hb level (g/dl) among pregnant women of a
community, it was expected that the mean and standard
deviation of Hb in pregnant women from the considered
community were about 10.3 and 2.3 g/dl respectively. Then at
α= 0.05 and relative allowable error 10% how many pregnant
women of the community should be studied?
22
Mean= 10.3
SD= 2.3
Z 1-α/2 =1.96
D= 10/100* 10.3
• Sample size (n)= Z2σ2 / D2
=[(1.96)2 * (2.3)2] /(1.03)2
= 19
Thus a total of least 19 pregnant woman needed to be studied.
23
Example 2
The current prevalence of malaria in a community is expected to be
around 40%. Hence to estimate the current prevalence of malaria in that
community, how many people should be included in the study at 5% level
of significance and absolute allowable error as 10%
P=0.40
Z 1-α/2 =1.96
D= 0.10
n =Z2 p(1− p) /D2
=
(1.96)2 * 0.40 (1-.40)/ (0.10)2
=
92
Minimum 92 people should be included in the study.
24
Sample size calculation in Cross sectional study
Estimating the population proportion with specified absolute
precision:
Required information:
• Population proportion – p
• Confidence level – 100 (1-α) %
• Absolute precision required on either side of proportion – d
• Formula:
n = Z2 1-α/₂ p(1-p) /d2
25
Example 1 (Absolute Precision)
A local health department wishes to estimate the prevalence of tonsillitis among children
under five years of age in its locality. It is known that the true rate is unlikely to exceed 20%.
The department wants to estimate the prevalence to within 5 percentage points of the
true value, with 95% confidence. How many children should be included in the sample?
Solution:
(a) Anticipated population proportion (P)
20%
(b) Confidence level
95%
(c) Absolute precision ,d (15%-25%)
5 percentage points
For P = 0.20 and d = 0.05
n=1.962 x 0.2 x (1-0.2) /(0.05)2
=245.84
26
27
Estimating a population proportion with specified relative
precisions
Required information:
• Anticipated population proportion = P
• Confidence level = 100(1-α)%
• Relative precision= ε.
• Formula:
n = Z21-α/2 (1-P)/ ε 2P
28
Example 2 (Relative Precision)
An investigator working for a national programme of immunization seeks
to estimate the proportion of children in the country who are receiving
Measles vaccinations.
The vaccination coverage is not expected to be below 50%.
How many children must be studied if the resulting estimate is to fall within 10% (not 10
percentage points) of the true proportion with 95% confidence?
Solution:
(a) Anticipated population proportion
50%
(b) Confidence level
95%
(c) Relative precision (ε)
10% (of 50%)
n = Z21-α/2 (1-P)/ ε 2P
= 1.962 X (1- 0.5) /0.102 X 0.5
= 348.16
29
30
Sample size estimation for population proportion
Required information:
• Test value of population proportion under the null hypothesis= Po
• Anticipated value of the population proportion = Pa
• Level of significance= 100α %
• Power of the test = 100(1-β)%
• Alternative hypothesis Pa>Po or Pa<Po for one sided
Pa ≠ Po for two sided.
Formula
n= {Z1-α√[Po(1-Po)+ Z1-β√[Pa(1-Pa)]}²/ (Po-Pa)2
31
Example 3: Previous surveys have demonstrated that the usual prevalence of dental
caries among school children in a particular community is about 25%. How many children
should be included in a survey designed to test for a decrease in the prevalence of dental
caries, if it is desired to be 90% sure of detecting a rate of 20% at the 5% level of
significance.
Solution
Test caries rate=25% (Po=.25)
Anticipated caries rate= 20% (Pa=0.20)
Level of significance =5%
Power of test = 90%
Alternative hypothesis (one-sided test): caries rate<25%
n= {Z1-α√[Po(1-Po)+ Z1-β√[Pa(1-Pa)]}² / (Po-Pa)2
= 601
32
33
Sample size calculation in case control studies
Requirements
• Power (1-β): probability of detecting a real effect
• α: probability of detecting a false effect
• p1: probability of exposure in case subjects
• p2: probability of exposure in controls
• Average proportion exposed P
• OR: hypothesized minimum odds ratio of exposures between cases and
controls to be detected
• r: ratio of controls to cases
34
FORMULA FOR DIFFERERNCE IN PROPORTION
Sample size in the
case group
r=ratio of controls
to cases
Represents the desired power
(typically .84 for 80% power).
2
(
p
)(
1

p
)(
Z

Z
)
r 1

/2
n(
)
r
(p1  p2 ) 2
A measure of
variability (similar to
standard deviation)
Effect Size
(difference in
proportions)
Represents the desired level
of statistical significance
(typically 1.96).
35
Example 1
Estimate cases and controls required…
80% power
To detect an odds ratio of 2.0 or greater
An equal number of cases and controls
The proportion exposed in the control group is 20%
For 80% power, Zβ=.84
• For 0.05 significance level, Zα=1.96
• r=1 (equal number of cases and controls)
• The proportion exposed in the control group is
p2=0.20
• The proportion of cases exposed:?
p1 =OR* p2 / p2(OR −1)+1
=2.0(.20) /(.20)(2.0 −1)+1
= .40/ 1.20
= .33
Average proportion exposed P
=p1+p2 = (.33+.20)/2 = .265
2
r  1 ( p )(1  p )( Z   Z/2 )
n(
)
r
(p1  p2 ) 2
n = 181
Therefore, n=362 (181 cases, 181
controls)
36
37
FORMULA FOR DIFFERERNCE IN MEANS
r=ratio of controls
to cases
Sample size in the
case group
n(
Represents the desired power
(typically .84 for 80% power).
2
2

(
Z

Z
)
r 1

/2
r
Standard deviation
of the outcome
variable
)
(difference ) 2
Effect Size
(difference in
proportions)
Represents the desired level
of statistical significance
(typically 1.96).
38
Example 2
How many cases and controls do you need assuming -- 80% power
SD of the characteristic is 10.0, An equal number of cases and controls
Detect a difference in your characteristic of 5.0 (one half standard deviation)
For 80% power, Zβ=.84
• For 0.05 significance level, Zα=1.96
• r=1
• σ=10.0
• Difference = 5.0
2
2
r  1  ( Z   Z/2 )
n(
)
r
(difference ) 2
n = 63
Therefore, n=126 (63 cases, 63
controls)
39
Cohort studies
Estimating relative risk with specified precision
Two of the following should be knownAnticipated probability of disease in people exposed to factor of interest- Pe
Anticipated probability of disease in people not exposed to factor of interest- Pc
Anticipated relative risk- RR
• For cohort study when RR>1, the values of both Pc and RR are needed
RR = Pe/Pc
Pe =RR X Pc
• If RR <1 then the value Pe and 1/RR should be used
n= Z² [(1-Pe)/Pe+1-Pc/Pc] / [logₑ (1-ε)]²
• ε =specified relative precision
40
Example
An epidemiologist is planning a study to investigate the possibility that a
certain lung disease is linked with exposure to recently identified air
pollutant. What sample size should be needed in each of the two groups ,
exposed and not exposed if the epidemiologist wishes to estimate the relative
risk within 50% of the true value(which is believed to be approximately equal
to 2) with 95% confidence ? The diseases is present in 20% of people who are
not exposed to the air pollutant.
41
Pe =?
Pc = 20% (0.2)
RR= 2
CI =95%
ε = 0.5
RR >1
Anticipated probability of disease given exposure, Pe = 0.20X 2
= 0.40
Required sample size in each group
n= Z² [(1-Pe) /Pe + 1-Pc/Pc] / [logₑ (1-ε)]²
= 1.962 [(1-0.40/0.40+ (1-0.20/0.20)] / [loge (1-0.50)]2
=3.84* 1.9 / 0.906
= 80.5
42
43
Clinical trials
47
Sample size calculation for Means
48
Sample size calculation for Proportions
49
Example 1
Comparison two proportions
A placebo control randomized trial proposes to assess the effectiveness
of colony stimulating factor in reducing sepsis in premature babies. A
previous study has shown the underlying rate of sepsis to be of 50% in
such infants around 2weeks after birth, and a reduction of this rate to
34% would be of clinical importance. Estimate sample size.
50
n = the sample size required in each group (double this for total sample)
π1 = first proportion=0.50
π2 = second proportion=0.34
π1 - π2 = size of difference of clinical importance = 0.16
Z α/2 depends on desired significance level = 1.96
Zβ depends on desired power = 0.84
n = 146
Therefore the total sample size is double this, i.e. 292.
51
52
Example 2
Comparison of two means
A RCT has been planned to evaluate a brief psychological
intervention in comparison to usual treatment in the reduction of
suicidal ideation in hospital with deliberate self poisoning. Suicidal
ideation will be measured on Beck scale, the SD of the scale was
7.7 and difference of 5 points is considered to be of clinical
importance.
53
where
σ= standard deviation, of the primary outcome variable = 7.7
δ= size of difference of clinical importance = 5.0
zα= 1.96.
zβ = 0.84
•
n=38
Therefore the total sample size is double this, i.e. 76.
54
Sample size adjustment
1. Allowing for response rates and other losses to the sample-n = total number of subjects in each group not accounting for loss to follow-up
L = loss to follow-up rate
So adjusted sample size,
nnew = n/ 1- L
Example:
If 135 is the estimated total sample size required and maximum 20% of recruited
subjects are expected to drop out before study ends, the recruitment target would be
n new= 135/(1- 0.2)
Thus, 169 eligible subjects would need to be recruited in order that at least 135
subjects complete the study
55
2. Adjustment for Unequal Group Size-First calculate n the per group sample size assuming equal number per group
n1/n2 = k, then
n2 = ½ n(1 + 1/k)
n1 = ½ n(1 + k)
Example:
Placebo-controlled trial requires a total sample size of 120 if the two groups are of
equal size. If it is decided that twice as many subjects would be randomized to the
active treatment group than to the placebo group. What will be the sample size of new
treatment group?
n1/n2= 2
N=60
n2= ½ * 60 (1+ ½ )
= 45
n1= 2 * n2 = 90
56
Normogram
•
•
•
Can be used to estimate the sample size
requirements for statistical analyses.
It uses 4 parameters (one of which is
fixed):
-- effect size(rho or delta)
-- statistical power
-- alpha (fixed)
-- number of cases
The sample size or number of cases
required is reported for two fixed levels of
statistical significance (alpha = .01 or .05)
57
Various Software for calculating sample size
 Epi Info –http://www.cdc.gov/Epiinfo/
 WinPepi –http://www.brixtonhealth.com/pepi4windows.html
 Power and Sample Size –
http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize
 equerry- https://www.statcon.de/shop/en/software/design-ofexperiments/nquery-advisor-nterim
 STRATA- http://www.stata.com/
 SPSS- http://www.spss.co.in/
58
Steps for sample size estimation
 Formulation of a research question
 Selection of appropriate study design, primary outcome
measure, statistical significance
 Appropriate formula to calculate the sample size
59
References
•
•
•
•
•
•
•
•
Lwanga SK, Lemeshow S. Sample size determination in health studies - A practical manual.
1st ed. Geneva: World Health Organization; 1991.
Zodpey SP, Ughade SN. Workshop manual: Workshop on Sample Size Considerations in
Medical Research. Nagpur: MCIAPSM; 1999
Zodpey SP. Sample size and power analysis in medical research. Indian J
DermatolVenerolLeprol 2004;70(2):123-28
Rao Vishweswara K. Biostatistics A manual of statistical methods for use in health ,
nutrition and anthropology. 2nd edition. New Delhi: Jaypee brothers;2007
Thabane, L., East, M. S., & Level, P. (2004). Sample Size Determination in Clinical Trials
HRM-733 CLass Notes.
Rao, U. (2012). Concepts in sample size determination. Indian Journal of Dental Research,
23(5), 660. http://doi.org/10.4103/0970-9290.107385
Hazra, A., & Gogtay, N. (2016). Biostatistics series module 1: Basics of biostatistics. Indian
Journal of Dermatology, 61(1), 10–20. http://doi.org/10.4103/0019-5154.173988
Sundaram KR, Dwivedi SN, Sreenivas V. Medical Statistics Principles and Methods. New
Delhi: BI Publications Pvt. Ltd; 2010
60
61