Samples and their Size: The Bane of Researchers

66
Journal of The Association of Physicians of India ■ Vol. 64 ■ September 2016
Statistics for Researchers
Samples and their Size: The Bane of Researchers
(Part I)
NJ Gogtay, UM Thatte
Introduction
H
ow often have you gone
frantically scouring for a
statistician who will “calculate”
that magical number? The number
that will answer all research
questions - the sample size!
Before we decide its size, let us
understand what we mean by a
“sample” and how to “construct”
it. Ideally, to answer a research
question, like the prevalence of
say diabetes in Mumbai or Delhi or
even India, or whether lung cancer
is associated with smoking, we
should study the entire population
to get the perfect answer. This
is obviously not practical. So we
choose a sample, study it, and then
extrapolate the results to predict
the results in the population. A
common enough example that
we see in our day-to-day lives,
are the exit polls run by agencies
during elections. All the voters
are not polled – instead a wellplanned “representative” sample
is chosen to predict the outcomes
of the elections. Apart from being
impractical, the non-statistical
challenges of working with
whole populations include cost,
manpower and time considerations.
It is also unethical to study an entire
population when a representative
subset of the population (sample)
can easily and accurately answer
the research question.
Characteristics of a Good
Sample
Before we discuss how to
calculate the sample size (and
actually do this in real life) it is
When members of a population
do not have an equal chance
of being selected in a study,
this is called non-probability
sampling and is useful for pilot
studies, proof of concept
studies, qualitative research,
and for hypothesis generation
study. The advantages of this
technique include that it is
r e l a t i ve l y q u i c k , e a s y a n d
inexpensive to use. Also, it is
possibly the only sampling
technique that can be used for a
particular research question or
population (like when studying
IV drug use in adolescents,
sexual behavior in school going
children, among others) as
t h e s e n s i t i ve n a t u r e o f t h e
research question would mean
that these participants would
necessarily need to be located
through methods that are “non
random” in nature.
mandatory to understand and
apply the characteristics of a
“good” or “appropriate” sample.
Thus, a sample should have
all the following characteristics
to answer the research question,
namely
i.B e r e p r e s e n t a t i v e o f t h e
population to which the results
will be eventually generalized.
ii.G i v e r e l i a b l e , v a l i d a n d
accurateresults.
iii. B e o f a n a p p r o p r i a t e s i z e –
neither so small that you miss
the results or be so large as to
be unethical, impractical and
unwieldy. In other words, the
sample should be sufficiently
large t o provide st at ist ical
stability or reliability.
Types of Sampling
Techniques
How to actually choose an
individual patient to be in the
sample [or not] is very important
to be defined when the research
study is planned. Broadly, there
are two types of sampling methods
(Probability and Non probability
techniques), each having multiple
subtypes, each with their own
advantages and disadvantages.
The choice of which sampling
technique is to be used will again
depend on the research question,
the population you want to study
and the feasibility of doing a
particular type of sampling.
1. N o n - P r o b a b i l i t y S a m p l i n g :
T h e m a i n d i s a d va n t a g e o f
this technique is that results
obtained from this type of a
sample may not be amenable
to generalization and is
less representative of the
population. Most importantly it
is prone to sampling bias as the
researcher may deliberately
choose the individual that
will participate in the study
and thus shift the results to
what he/she desires. When
non-probability sampling has
been used, it is necessary to
describe in the final manuscript
how the sample was different
Department of Clinical Pharmacology, Seth GS Medical College & KEM Hospital, Mumbai- 400012
Received: 08.08.2016; Accepted:10.08.2016
Journal of The Association of Physicians of India ■ Vol. 64 ■ September 2016
from an ideal sample that
should have been randomly
selected. It is also useful to
describe the characteristics
of the individuals who
might have been be left out
during the selection process
or the individuals who are
overrepresented in the sample.
These are limitations of a study
and must be stated.
Non-probability sampling
methods include:
i.
ii. Sequential sampling
iii. Quota sampling
iv. Judgmental sampling and
v. Snowball sampling
Quota sampling is a technique
where the sample chosen
has the same proportions of
individuals with characteristics
under study as the entire
population. Thus, for example,
if we want to study the level of
autonomy amongst men and
women in the city of Mumbai,
we first identify the proportion
of men and women in the
general population in the city
(say from the census data) and
then choose a sample that has
the same proportions to answer
our research question.
When the researcher selects
individuals to be studied
based on their knowledge and
professional judgment, it is
called Judgmental sampling. This
type of sampling technique
i s a l s o k n o w n a s p u r p o s i ve
or authoritative sampling.
The process simply involves
purposively handpicking
individuals from the population
based on the researcher’s
understanding, experience
and judgment. For example, if
a researcher wants to identify
challenges in conduct of clinical
research in academic set-ups in
Mumbai, he would handpick
seasoned researchers from
reputed institutes based on his
knowledge of the researchers.
This method can be used
when one knows that aonly
limited number of individuals
possess the trait of interest (in
our example, this would be
knowledge and experience in
clinical research in academic
institutes). Naturally, this
technique is limited by the
extent and accuracy in defining
the judgmental capacity of
the researcher and there is no
Convenience sampling
In convenience sampling
individuals are selected
because of their accessibility
to the researcher.It is the
most common of all sampling
techniques and among the
non-probability sampling
techniques, researchers prefer
it because participants are
readily available making the
recruitment faster. A common
example of convenience
sampling is using students or
patients attending your clinic
as participants.
of newly admitted medical
students, he may conduct the
study in one batch over one
year, and then repeat it in
the next year with the next
batch of students to gain better
insights with improved tools of
investigation.
Sequential sampling occurs when
the researcher selects a single
or a group of individuals (in
the latter case, this is called
as group sequential sampling in
a given time frame, performs
the study, and after analysis
of results may select another
group of individuals in a
different time frame, if needed).
In this method, the researcher
is able to fine tune his research
methods because he can
continue to select participants
and gain vital insights into
the study.The sample size,
n , i s n o t f i x e d i n a d va n c e ,
nor is the timeframe of data
collection. For example, if the
researcher wishes to study
coping mechanisms to the
social and curricular stresses
67
standardized way to evaluate
the reliability of the expert or
the authority.
Snowball samplingalso called as
chain referral sampling is used
when identification of potential
participants is difficult.After
the researcher identifies the
initial participant or informant,
he/shethen asks for assistance
from this person to help identify
people with a similar trait of
interest.For example, you want
to study the HIV prevalence
among MSMs, you may opt
to use snowball sampling as
it would be difficult to locate
these participants due to the
sensitive nature of the research
question that may preclude
people from readily agreeing
to make themselves available
for participation. This method
is especially useful to reach
populations, such as MSMs,
or commercial sex workers
( C S Ws ) w h o a r e o t h e r w i s e
difficult to reach with
traditional sampling methods.
However, the researcher has
limited control on sampling
and has to rely mainly on
the previous participant
and their referral patterns.
Here, more than other nonprobability sampling methods,
representativeness is even less
certain as the researcher has
no idea of the true distribution
of the population and of the
sample. In fact, because we
rely on referrals, it is highly
possible that we end up with
participants all sharing similar
traits and characteristics.
2. Probability Sampling: When
every individual in a population
has an equal chance of being
selected as a participant, it is
called probability sampling
and is probably the default
method for analytical studies
(like randomized controlled
trials). It is also the appropriate
technique when the study
is designed estimate the
population parameters
68
Journal of The Association of Physicians of India ■ Vol. 64 ■ September 2016
since it is representative of
the population.This method
ensures randomness or
unpredictability in the choice
of participant selection and
helps minimize selection
bias. This sampling technique
also ensures the validity of
the statistical methods after the
research is completed.
study the sale of antibiotics
at chemists shops without
p r e s c r i p t i o n i n E wa r d o f
Mumbai city, we would first
need to list all the chemists
shops that exist in the ward and
then choose a random sample
either by lottery method of
using a computer generated
randomization list.
The disadvantages are mainly
non-statistical in nature and
include the need for time,
resources, finances and
manpower.
Probability sampling methods
include:
i.
ii. Systematic Sampling
iii. Stratified Sampling
iv. Cluster Sampling
v. Disproportional Sampling
When each member of the
population has an equal
chance of being selected as a
participant, it is called Simple
Random Sampling. The process
of sampling occurs in one
step with each participant
getting selected independent
of other potential participants
in the population.The process
itself can be done in different
wa y s , f o r e x a m p l e , l o t t e r y
method. However, a computer
s o f t wa r e c a n d o a r a n d o m
selection of participants from
your population.
One of the most important
a d va n t a g e s o f t h i s m e t h o d
is that this is considered to
be a “fair” way to select a
sample that is closest to (or
most representative of ) the
population in characteristics,
since every member is given
an equal opportunityto be
selected. However, one must
have the entire population
(N) to be able to select
“randomly” the sample from
it. If the population is very
large, this technique becomes
cumbersome and inaccurate.
For example, if we want to
In
the
systematic
samplingmethod, which is also
a random sampling technique,
where the researcher first
randomly picks the first shop
from the population and then
goes on to select the n’th (say
the 10 th) chemist shop in the E
ward so as to cover the desired
sample.This is relatively an
easy process and can be done
manually. The n’th number
has to be fixed and decided à
priori.This process of obtaining
the systematic sample
is much like an arithmetic
progression and gives us an
adequate representation of the
population unless,in very rare
instances, certain traits of the
population are repeated for
every n’th unit.In our example,
suppose in addition we are also
studying the most expensive
drugs sold at the chemists
s h o p . I f e ve r y 1 0 t h s h o p i s
located near a hospital which
is an oncology or critical care
hospital, we would get skewed
results. However, this is highly
unlikely to happen in real life.
Simple Random Sampling
When the entire population
is divided into different
subgroups or strata, and then
randomly select the final
subjects proportionally from
the different strata, this is called
stratified sampling technique.
For example, you want to study
the sleeping patterns among
resident medical officers
of an academic institute. A
simple random sample may be
representative of residents, but
it is possible that we do not get
adequate numbers of residents
of each of the specialities
represented. Hence, if we
want to study them speciality
wise, which is so important
in this case (i.e. anatomy
residents would be quite
different from gynaecology
and obstetrics residents or
community medicine residents
would differ from surgical
residents) we would need an
adequate number, randomly
selected from each speciality.
It is important that the strata
are non-overlapping and use
simple random sampling
within each stratum.
In cluster sampling, a
group (called cluster ) of
participants that represent
the populationwherein the
population is divided into
groups that are already clustered
in certain geographical areas
(living in areas of poor hygiene
and sanitation) or time frames
(e.g. born during the world
war ), and a sample is taken
from each group.For example,
i f we wa n t e d t o s t u d y t h e
herd immunity patterns for
typhoid after vaccination, it
would be appropriate to make
geographic clusters of the
city and then sample each
cluster for the outcome. It
i s i m p o r t a n t h e r e , t h a t we
make the distinction between
stratified sampling and cluster
sampling which appear
similar. The former involves
division of the population
into strata according to one or
more variables that are in turn
related to the variables that we
are interested in. Each stratum
is then sampled. In cluster
sampling, the population
is divided into clusters, but
only few clusters are actually
studied. The error thus is
much lower with stratified
sampling as variation within
the cluster/s chosen will be
apparent only after the sample
is chosen. Often times, both
sampling techniques methods
are employed together.
Journal of The Association of Physicians of India ■ Vol. 64 ■ September 2016
Table 1: Probability and non–
probability sampling
techniques
Probability
sampling
Information about
the entire population
available
Random
Generalizable
Expensive, time
-consuming
Less prone to bias
and sampling errors
Non probability
sampling
Information about
the entire population
not available
Non random
Not generalizable
Less expensive, more
convenient
Prone to bias and
sampling errors
Disproportional sampling is
a type of stratified random
sampling wherein the size of
the sample from each stratum
is “disproportional” to the
size of that stratum within
the population. This usually
stems from the fact that the
sizes of the strata are unequal
to begin with. With this type
of sampling, some strata are
then “oversampled” relative to
other strata to achieve balance.
Let us say, for example, that in
a medical college, a 1 st MBBS
class has 140 girls and 60 boys.
A researcher needs to study
20% of the class, i.e.40 students
to answer a particular research
question. If 20% of girls and
20% of boys were to be chosen
by him, this would lead to 28
girls and 12 boys respectively
being chosen, and thus an
under-representation of boys.
69
Instead, if he were to chose 20
girls and 20 boys, this becomes
disproportionality sampling.
The choice of the sampling
method or strategy to be used
for a particular study should
be defined in advance in
the research protocol. Since
multiple factors impact the
choice of the sampling method,
it is best to list potential
sampling strategies in advance
and then use the one method
that is the most likely to answer
the study objectives. Table 1
summarises the differences
between the two methods.