An Analysis of Equally Weighted and Inverse Probability Weighted

An Analysis of Equally Weighted and Inverse
Probability Weighted Observations in the
Expanded Program on Immunization (EPI)
Sampling Method
AN ANALYSIS OF EQUALLY WEIGHTED AND INVERSE
PROBABILITY WEIGHTED OBSERVATIONS IN THE
EXPANDED PROGRAM ON IMMUNIZATION (EPI) SAMPLING
METHOD
BY
MARIA REYES, H.B.Sc.
A thesis
submitted to the Department of Mathematics & Statistics
and the School of Graduate Studies
of McMaster University
in partial fulfilment of the requirements
for the degree of
Master of Science
c Copyright by Maria Reyes, September 2016
All Rights Reserved
Master of Science (2016)
McMaster University
(Mathematics & Statistics)
TITLE:
Hamilton, Ontario, Canada
An Analysis of Equally Weighted and Inverse Probability Weighted Observations in the Expanded Program on
Immunization (EPI) Sampling Method
AUTHOR:
Maria Reyes
H.B.Sc., Statistics & Economics
University of Toronto, Canada
SUPERVISORS:
Dr. Román Viveros-Aguilera
Dr. Harry Shannon
NUMBER OF PAGES:
xx, 144
ii
To my parents, Alberto and Elizabeth, for all the sacrifices they have made to get
me through school, from kindergarten to now my masters, and to my brother,
Francis, for our conversations, which provided a nice relief from work.
Abstract
Performing health surveys in developing countries and humanitarian emergencies can
be challenging work because the resources in these settings are often quite limited and
information needs to be gathered quickly. The Expanded Program on Immunization
(EPI) sampling method provides one way of selecting subjects for a survey. It involves
having field workers proceed on a random walk guided by a path of nearest household
neighbours until they have met their quota for interviews. Due to its simplicity, the
EPI sampling method has been utilized by many surveys. However, some concerns
have been raised over the quality of estimates resulting from such samples because
of possible selection bias inherent to the sampling procedure.
We present an algorithm for obtaining the probability of selecting a household
from a cluster under several variations of the EPI sampling plan. These probabilities
are used to assess the sampling plans and compute estimator properties. In addition
to the typical estimator for a proportion, we also investigate the Horvitz-Thompson
(HT) estimator, an estimator that assigns weights to individual responses. We conduct our study on computer-generated populations having different settlement types,
different prevalence rates for the characteristic of interest and different spatial distributions of the characteristic of interest.
iv
Our results indicate that within a cluster, selection probabilities can vary largely
from household to household. The largest probability was over 10 times greater than
the smallest probability in 78% of the scenarios that were tested. Despite this, the
properties of the estimator with equally weighted observations (EQW) were similar to
what would be expected from simple random sampling (SRS) given that cases of the
characteristic of interest were evenly distributed throughout the cluster area. When
this was not true, we found absolute biases as large as 0.20. While the HT estimator
was always unbiased, the trade off was a substantial increase in the variability of the
estimator where the design effect relative to SRS reached a high of 92.
Overall, the HT estimator did not perform better than the EQW estimator under EPI sampling, and it involves calculations that may be difficult to do for actual
surveys. Although we recommend continuing to use the EQW estimator, caution
should be taken when cases of the characteristic of interest are potentially concentrated in certain regions of the cluster. In these situations, alternative sampling
methods should be sought.
Keywords: Expanded Program on Immunization, household surveys, spatial sampling, selection probabilities, Horvitz-Thompson estimator
v
Acknowledgments
I would like to extend my sincerest gratitude to my supervisors, Dr. Román ViverosAguilera and Dr. Harry Shannon who were so generous with their time and always
willing to share their wealth of knowledge with me. I could not have asked for better
mentors and role models to guide me as I worked on this thesis. I am indebted to
them for getting me to the finish line.
A special thanks to Dr. Gregory Pond who took part in my defense committee.
I thank him for his valuable feedback and questions which made me realize what
it means to be a statistical consultant. I also wish to thank Dr. Narayanaswamy
Balakrishnan for his encouragement, Dr. Ben Bolker for his research suggestions
and R coding advice, Dr. Patrick Emond, Dr. Ick Huh, and Atinder Bharaj for
the inspiring discussions at our group meetings, and Kenneth Moyle for his help
with using the department computer servers. This thesis was funded in part by
the Canadian Institutes of Health Research (CIHR) and by the Ashbaugh Graduate
Scholarship.
I am grateful to Dr. Alison Weir, Dr. Gordon Anderson, Dr. Jerry Brunner, Dr.
Christine Lim, and Asal Aslemand for instilling in me an appreciation for statistics.
They introduced me to a new world, and I would not have pursued graduate studies
vi
if it were not for them.
I would also like to recognize my parents, Alberto and Elizabeth, my brother
Francis, and the following individuals who offered me their support in various ways:
E. Alamer, E. Anthonipillai, J. Begum, T. Bekiri, A. Bhatti, S. Birchall, K. Biswas,
J. Buckley, S. Caetano, F. Choi, S. Dionyssiou, J. Francis, Q. Gao, C. Glanville, P.
Gonyeau, E. Gretchko, S. Hogan, T. Jacques, S. Jana, P. Jevtic, L. Jin, R. Kampo,
P. Keown, K. Kim, J. La Rosa, C. Lambeck, M. Li, M. Mendes, J. Pancratius, J.
Posada, S. Reiter, S. Sexton, J. Shiels, T. Tan, D. Venditti, and Y. Yang.
Most importantly, I thank God for blessing me with this incredible opportunity
to grow and learn, and I thank Our Lady of Fatima, St. Dymphna, St. Joseph of
Cupertino, St. Joseph the Worker, St. Anthony, and the many other saints whose
intercession gave me the strength to persevere through the difficult times.
vii
Abbreviations
The number following the entries refers to the section in which the term was introduced.
General Abbreviations
DE
Design effect, 3.1
EPI
Expanded Program on Immunization, 1.1
EQW
Equally weighted estimator, 7.1.3
HT
Horvitz-Thompson estimator, 7.1.3
HTR
Horvitz-Thompson estimator with restriction, 7.1.3
MSE
Mean square error, 4.1
PPS
Probability proportional to size, 2.1.2
PSU
Primary sampling unit, 2.1.2
ROH
Rate of homogeneity, 3.1
SRS
Simple random sample, 2.1.1
SSU
Secondary sampling unit, 2.1.2
viii
StRS
Stratified random sample, 4.1
SyRS
Systematic random sample, 8.2
WHO
World Health Organization, 1.1
Spatial Distribution of Households
loc reg
A population with regularly spaced households on a grid, 6.1.1
loc sqr
A population with randomly placed households over a square area,
6.1.1
loc rec
A population with randomly placed households over a rectangular
area, 6.1.1
loc agg
A population where households aggregate around several randomly
placed focal points, 6.1.1
loc cgr
A population where household density increases towards the centre of
the population area, 6.1.1
Spatial Distribution of Target Variable
val rdm
A population where the characteristic of interest is assigned to households with equal probability, 7.1.1
val spk
A population where the characteristic of interest is assigned to small
pockets of households, 7.1.1
val lpk
A population where the characteristic of interest is assigned to large
pockets of households, 7.1.1
ix
val cgr
A population where the characteristic of interest is more likely to be
assigned to households close to the centre of the population area, 7.1.1
val dgr
A population where the characteristic of interest is more likely to be
assigned to households close to the southwest corner of the population
area, 7.1.1
val hgr
A population where the characteristic of interest is more likely to be
assigned to households close to the west edge of the population area,
7.1.1
Sampling Method
nosec k1
An EPI procedure that uses a sector with angle span 2π rad and does
not skip neighbours, 6.1.2
api08 k1
An EPI procedure that uses a sector with angle span
not skip neighbours, 6.1.2
π
8
rad and does
api32 k1
An EPI procedure that uses a sector with angle span
not skip neighbours, 6.1.2
π
32
rad and does
nosec k3
An EPI procedure that uses a sector with angle span 2π rad and
selects every third neighbour for the sample, 6.1.2
api08 k3
An EPI procedure that uses a sector with angle span
every third neighbour for the sample, 6.1.2
π
8
rad and selects
api32 k3
An EPI procedure that uses a sector with angle span
every third neighbour for the sample, 6.1.2
π
32
rad and selects
Other
rad
radians, 5.1
x
Notation
The number following the entries refers to the section in which the notation was
introduced. The notation stated here only applies to Chapters 5 and beyond.
Single Cluster Population
i
Index used to label households, 5.1
N
Number of households in a population, 5.1
H
Set of all households in a population, 5.1
xi
x-coordinate of the location of household i, 5.1
yi
y-coordinate of the location of household i, 5.1
ri
Distance of household i from the centre of the population area, 5.1
γi
Angle made when moving counterclockwise from the positive x-axis to
household i (measured in radians), 5.1
Γ
Set of all household angular coordinates, γi (see above), 5.1
dij
Distance between household i and household j, 5.2
D
Matrix of distances where entry dij represents the distance between households i and j (entries along the main diagonal are 0), 5.2
xi
Zi
A random variable equal to 1 if household i has the characteristic of
interest and 0 otherwise, 7.1.1
zi
A realized value of the random variable Zi (see above), 7.1.1
p
Proportion of households in the population with the characteristic of interest, 7.1.1
Sampling
θ
Angle made by moving counterclockwise from the positive x-axis to the
centre of a sector; also referred to as the direction of the sector (measured
in radians), 5.1
Θ
Set of all sector directions such that the sector associated with θ (see
above) has at least one household, 5.1
LΘ
Length of interval unions in the set Θ, 5.1
α
Angle span of a sector (measured in radians); also referred to as the size
of a sector, 5.1
nsec
Number of households in a sector (depends on the location of households
in the population as well as the direction and angle span of the sector),
5.1
l
Index used to label ordered selections, 5.2
Hl
Set of households from the population that were not chosen in the first
l − 1 selections, 5.2
Ul
A random variable representing the household chosen at the lth selection,
5.2
ul
A realized value of the random variable Ul (see above), 5.2
xii
ul
A vector listing all the households chosen up to the lth selection in the
order that they were selected, 5.2
nnn
Number of nearest neighbours (depends on the last chosen household and
the households from the population that are not already in the sample),
5.2
k
Every k th neighbour along an EPI path is added to the sample, 5.4
S
Set of sampled households, 5.2
n
Number of households sampled from the population, 5.2
πi
Probability that household i is included in the selected sample; also referred to as the inclusion probability or probability of selection for household i 5.3
πij
Probability that households i and j are both included in the selected
sample, 5.3
π
Matrix of inclusion probabilities where entry πij represents the probability
that households i and j are both included in the selected sample (entries
along the main diagonal represent the inclusion probabilities for individual
households), 5.3
wi
Sampling weight for household i equal to the inverse of its inclusion probability (see πi ), 5.3
Hi
A random variable equal to 1 if household i appears in the sample and 0
otherwise, 7.1.3
Estimation
p̂EQW
Equally weighted estimator for a proportion, 7.1.3
p̂HT
Horvitz-Thompson estimator for a proportion, 7.1.3
p̂HT R
Restricted Horvitz-Thompson estimator for a proportion, 7.1.3
xiii
Contents
Abstract
iv
Acknowledgments
vi
Abbreviations
viii
Notation
xi
1 Data Collection for Health Surveys
1
1.1
Census vs. Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Scope of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 The Expanded Program on Immunization (EPI) Sampling Method
2.1
2.2
5
Preliminary Sampling Theory . . . . . . . . . . . . . . . . . . . . . .
5
2.1.1
Simple Random Sampling . . . . . . . . . . . . . . . . . . . .
6
2.1.2
Two-Stage Cluster Sampling . . . . . . . . . . . . . . . . . . .
7
Development and Use of the EPI Sampling Method . . . . . . . . . .
8
2.2.1
Other Applications . . . . . . . . . . . . . . . . . . . . . . . .
3 Procedures for Estimating a Population Proportion
xiv
13
15
3.1
Sample Size Determination . . . . . . . . . . . . . . . . . . . . . . . .
15
3.2
Point Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.2.1
Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2.2
Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
4 Past Simulations of the EPI Method
25
4.1
Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.2
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
5 Computation of Household Inclusion Probabilities
34
5.1
Probability of Selecting the First Household . . . . . . . . . . . . . .
35
5.2
Probability of a Sample of Households . . . . . . . . . . . . . . . . .
44
5.3
Inclusion Probabilities for Individual Households and Pairs of Households . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.4
Other Versions of EPI Sampling . . . . . . . . . . . . . . . . . . . . .
52
5.5
Additional Notes: Permutations of Household Selections . . . . . . .
52
6 Household Inclusion Probabilities in Simulated Populations
6.1
55
Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
6.1.1
Generation of Populations . . . . . . . . . . . . . . . . . . . .
56
6.1.2
Sampling Plans . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.2
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
6.3
Additional Notes: Relations between Inclusion Probabilities . . . . .
70
7 Estimator Properties in Simulated Populations
7.1
Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
72
73
7.1.1
Generation of Populations . . . . . . . . . . . . . . . . . . . .
73
7.1.2
Sampling Plans . . . . . . . . . . . . . . . . . . . . . . . . . .
79
7.1.3
Estimation of Population Proportion . . . . . . . . . . . . . .
79
7.1.4
Evaluation of Estimators . . . . . . . . . . . . . . . . . . . . .
83
7.2
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
7.3
Additional Notes: Variance of the Horvitz-Thompson Estimator . . .
93
8 Summary, Discussion and Future Directions
96
8.1
Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
97
8.2
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A Tables of Simulation Results
106
B Partial R Code
116
B.1 Packages Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B.2 General Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
B.3 Main Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.3.1 Simulation of EPI Sampling . . . . . . . . . . . . . . . . . . . 118
B.3.2 Computation of Inclusion Probabilities . . . . . . . . . . . . . 122
B.3.3 Estimation of Population Proportion . . . . . . . . . . . . . . 130
B.4 Other Functions Created . . . . . . . . . . . . . . . . . . . . . . . . . 134
xvi
List of Tables
2.1
Selection of villages using systematic PPS sampling. . . . . . . . . . .
9
4.1
Comparison of simulation designs from past EPI studies. . . . . . . .
28
6.1
Minimum and maximum number of households in non-empty sectors
and proportion of values in [0, 2π) rad that correspond to the directions
of empty sectors for loc reg, loc sqr, loc rec, loc agg, and loc cgr. . . .
7.1
60
Cases of the characteristic of interest added or removed from populations after the initial population generation procedure to attain a
certain proportion of households with the characteristic of interest.
7.2
.
79
simulated scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Range of properties for EQW, HT and HTR estimators across 13 500
A.1 Properties of household inclusion probabilities for multiple realizations
of a household spatial distribution type. . . . . . . . . . . . . . . . . 107
A.2 Bias of EQW, HT and HTR estimators; n = 7. . . . . . . . . . . . . . 108
A.3 Bias of EQW, HT and HTR estimators; n = 30. . . . . . . . . . . . . 109
A.4 Variance of EQW, HT and HTR estimators; n = 7. . . . . . . . . . . 110
A.5 Variance of EQW, HT and HTR estimators; n = 30. . . . . . . . . . . 111
A.6 Mean square error of EQW, HT and HTR estimators; n = 7. . . . . . 112
xvii
A.7 Mean square error of EQW, HT and HTR estimators; n = 30. . . . . 113
A.8 Design effect of EPI sampling relative to SRS for EQW, HT and HTR
estimators; n = 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.9 Design effect of EPI sampling relative to SRS for EQW, HT and HTR
estimators; n = 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xviii
List of Figures
2.1
Illustration of household selection using the EPI method. . . . . . . .
11
5.1
Population of N = 25 households. . . . . . . . . . . . . . . . . . . . .
36
5.2
Illustration of a sector. . . . . . . . . . . . . . . . . . . . . . . . . . .
37
5.3
Directions corresponding to non-empty sectors.
. . . . . . . . . . . .
38
5.4
Computation of the probability of the first sampled unit. . . . . . . .
42
5.5
Computation of path probabilities conditional on the first sampled unit. 46
5.6
Illustration of an EPI path with skipped neighbours. . . . . . . . . .
53
6.1
Spatial distributions of households in the simulation study; N = 150.
57
6.2
Number of possible household samples that can be drawn from loc reg,
loc sqr, loc rec, loc agg, and loc cgr.
. . . . . . . . . . . . . . . . . .
60
6.3
Household inclusion probabilities for loc reg. . . . . . . . . . . . . . .
62
6.4
Household inclusion probabilities for loc sqr. . . . . . . . . . . . . . .
64
6.5
Household inclusion probabilities for loc rec. . . . . . . . . . . . . . .
65
6.6
Household inclusion probabilities for loc agg. . . . . . . . . . . . . . .
66
6.7
Household inclusion probabilities for loc cgr. . . . . . . . . . . . . . .
67
6.8
Boxplot of household inclusion probabilities for loc reg, loc sqr, loc rec,
loc agg, and loc cgr. . . . . . . . . . . . . . . . . . . . . . . . . . . .
xix
69
6.9
Correlation between household inclusion probability and household
distance from the centre of the population area for loc reg, loc sqr,
loc rec, loc agg, and loc cgr. . . . . . . . . . . . . . . . . . . . . . . .
7.1
Spatial distributions of the target variable in the simulation study;
p = 0.50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
70
77
Boxplot of household sampling weights for loc reg, loc sqr, loc rec,
loc agg, and loc cgr. . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
7.3
Histogram of estimator bias across 13 500 simulated scenarios. . . . .
85
7.4
Histogram of estimator variance and design effect across 13 500 simulated scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
7.5
Bias of EQW, HT and HTR estimators. . . . . . . . . . . . . . . . .
88
7.6
Variance of EQW, HT and HTR estimators. . . . . . . . . . . . . . .
90
7.7
Design effect of EPI sampling relative to SRS for EQW, HT and HTR
estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1
92
Illustration of household selection using alternative sampling methods. 104
xx
Chapter 1
Data Collection for Health Surveys
Surveys play an important role in the management of public health. As Bostoen
et al. (2007) aptly expressed, “health surveys are the stethoscope, thermometer and
pressure gauge of global health.” Governments, disease control programs, humanitarian agencies, and local health administrators need to know relevant characteristics
of the population they are serving to do an effective job. Without this information,
there is no objective basis on which to judge what initiatives should take priority
or how much work needs to be done. The absence of reliable information makes it
difficult to adequately assess the impact of programs and policies or plan for the
future.
1.1
Census vs. Sample
Cross-sectional studies are one way to get a picture of a population’s overall health
status. In this type of study, investigators examine the characteristics of a population
1
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
at a single point in time (Levy and Lemeshow, 2008). Data from individuals or
households may be gathered using a descriptive survey. Examples of health data
include body measurements, disease status, living conditions, and feeding practices.
These surveys also ask about demographic information. The primary aim here is
to measure variables of interest and to summarize the data as means, proportions,
or totals, but results may also be used to test hypotheses or check for relationships
between variables (Levy and Lemeshow, 2008). A census is clearly preferred for
informational purposes; however, it may be unrealistic to interview all members of
the population due to time and budget constraints.
When a subset of the population is selected and carefully surveyed, data obtained
from this sample can provide useful insights about the population at a fraction of
the cost associated with a census. Nevertheless, conducting such a survey is still a
large undertaking.
The work that goes into a survey is more than just creating a questionnaire and
recruiting participants. The planning stage involves making sure that the population
to be sampled matches the target population and deciding what degree of precision
is needed for the results. It also involves determining how the sample will be taken
and how estimates will be computed afterwards. The answer might not always be
straightforward. Plus there are a whole host of other tasks to consider from training
field workers and performing a pre-test of the survey to preparing the data for analysis
(Cochran, 1977).
Serious thought must be put into every step of the process to maintain the integrity of the results. After all, the results may be used to guide decisions about
where to set up health care facilities, how to combat the spread of disease, and what
2
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
supplies to distribute during an emergency.
The Expanded Program on Immunization (EPI) sampling method is a technique
that the World Health Organization (WHO) has traditionally used for many of its
surveys (World Health Organization, 2008). It is particularly well suited for surveys
that make contact with respondents by visiting households because sampling is accomplished by having the field staff go from neighbour to neighbour (or the nearest
household that has not yet been selected) until the required sample size is reached.
Initially, the method was implemented to estimate the immunization levels in a population, but since then, it has been adapted for a variety of uses and implemented
for surveys outside of WHO.
1.2
Scope of Thesis
This thesis takes a critical look at the EPI sampling procedure and the associated
formulas for estimating a population proportion. Therefore, we focus on variables
that have a binary outcome. Our objective is two-fold: (1) to compile results from
past studies about EPI sampling and (2) to advance the current body of research by
conducting our own simulations and statistical analyses.
Chapter 2 introduces basic sampling concepts and terminology, which then leads
into a detailed description of the EPI method. We discuss the motivations behind
its development and how the method is performed in the field.
Chapter 3 contains the technical underpinnings of the EPI method. We begin by
giving the rationale for the 30 × 7 sample size. This is followed by the formulas that
have been suggested for estimating the population proportion and estimating the
3
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
variance of this estimator when data is collected through EPI sampling. We describe
how the EPI method is different from classical two-stage cluster sampling, and we
use this to explain why the formulas currently used for EPI surveys may be biased.
Chapter 4 is a review of the existing literature relating to computer simulations
of EPI sampling. We compare the simulation methods used in these studies as well
as the findings about the quality of estimation from EPI samples.
In Chapters 5 and 6, we examine the EPI selection procedure from the perspective
of inclusion probabilities. We provide an algorithm to compute the exact probability
that a household is included in a sample that is drawn from a single cluster. This
algorithm is applied to computer generated populations and we investigate to what
extent units from the same cluster are selected with equal probability. With this
information we try to uncover the factors that cause certain households to have a
high or low chance of being selected.
In Chapter 7, we shift the focus back to the estimation of a population proportion. The algorithm presented in Chapter 5 allows us to construct an estimator
that weights observations by the inverse of the probability of selection, otherwise
known as a Horvitz-Thompson (HT) estimator. Chapter 7 compares the traditional
EPI estimator to a weighted estimator for samples at the cluster level. We compute
properties such as bias and variance by obtaining the exact distribution of these
estimators for a given population.
Finally, we end with Chapter 8, where we summarize the results, address limitations of the study, and give a brief discussion of how the study may be extended.
4
Chapter 2
The Expanded Program on
Immunization (EPI) Sampling
Method
2.1
Preliminary Sampling Theory
A proper sample design is central to executing a successful survey. The sample
design encompasses the way elements are drawn from the population (known as the
sampling plan), as well as the formulas to estimate population parameters (Hansen
et al., 1953). The elements of a population may be persons or entire household units.
They represent the most basic level at which measurements are recorded. In other
words, it is their characteristics that are being analyzed in the survey. Assuming
that the measurements recorded are correct, the only uncertainty in the observations
comes from the sampling itself. Naturally, the amount of sampling error will depend
5
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
on how the units are selected and the sample size. The estimation procedures used
after the sample has been extracted are equally important. They must be built
around the sampling plan. The ultimate goal is to obtain results from the sample
that are representative of the greater population.
2.1.1
Simple Random Sampling
Simple random sampling (SRS) is a classic sampling plan. It involves enumerating all
the elements in the population and then choosing the elements according to random
numbers. The elements could be selected with or without replacement. In practice,
health surveys typically sample without replacement so that once someone is picked,
they cannot be picked again. A more defining aspect of SRS is that the sample
obtained could be any possible combination of elements from the population and all
of these combinations are equally likely to be observed (Lohr, 2009). As a result,
each element has the same chance of being sampled as any other element in the
population. Since this probability is non-zero and can be calculated, this makes
SRS a probability sample (UNICEF, 2010). Sampling plans that are based on a
probability sample are desirable because they allow for a proper statistical analysis
of the properties of an estimator.
When SRS is performed, the mean of the observed values in the sample is taken
as an estimate of the population mean. It can be shown that this design yields
an unbiased estimator (Hansen et al., 1953). This is a favourable property because
although the mean of a sample may be different from the population mean, there is
at least the assurance that if sampling was done repeatedly on the same population
an infinite number of times, then the average of the sample means would be equal to
6
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
the population mean. Therefore, we would not consistently overestimate the mean
or underestimate it. A proportion is just a special case of a mean. Derivation of
the expected value and variance of this estimator is possible because the inclusion
probability of the sampling units and the inclusion probability of pairs of sampling
units are known in this case.
SRS is conceptually easy to understand, but logistical factors make it challenging
to apply in the field. The population being studied by most health surveys is spread
across a large geographic region. If interviews must be conducted in person, this
means that the field worker may have to travel great distances to reach a place
where only one or two people have been selected for the survey.
2.1.2
Two-Stage Cluster Sampling
A less costly option is to use cluster sampling. In two-stage cluster sampling, members of the population are split into convenient groupings such as districts, towns
or city blocks. The clusters could take any form as long as they cover the whole
population and they do not overlap (mutually exclusive and exhuastive). The procedure begins by taking a sample of clusters. For this reason, the clusters in this
design are also referred to as primary sampling units (PSUs). The sampling could be
done with probability proportional to size (PPS), which means that larger clusters
(as measured in terms of the cluster population) are more likely to be included in
the sample than smaller clusters (Lemeshow and Robinson, 1985). The advantage of
this is that if SRS (or some other equal probability sampling technique) is used to
select the same number of elements from each of the sampled clusters, the formula
7
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
for an unbiased estimator remains simple. The estimator in cluster sampling is typically more variable than the same estimator in regular SRS (Levy and Lemeshow,
2008). To achieve the same precision, additional population elements may have to
be surveyed. However, despite the increase in sample size, cluster sampling may still
be more economical than SRS because individuals are being sampled in groups.
The problem with performing SRS in the final stage of cluster sampling is that,
before it can be done, all eligible subjects in the selected clusters, known as secondary sampling units (SSUs), must be enumerated to construct a sampling frame.
Sometimes, records may be unavailable or they may be inaccurate. While they could
be updated, there might not be the time and resources to do this. This situation
is frequently encountered in developing countries and in communities affected by
natural disaster or war. Yet, as Bostoen et al. (2007) emphasized, it is exactly in
these settings where there is a great need to do surveys and obtain reliable information. Therefore, there is strong motivation to find alternative within-cluster sampling
methods that are quick, affordable, and are capable of producing an estimator with
a high degree of accuracy and precision.
2.2
Development and Use of the EPI Sampling
Method
In place of SRS, numerous health surveys have opted to use the Expanded Program
on Immunization (EPI) sampling method, especially when difficult field conditions
prevail. A key feature of the EPI method is that its selection rule is based on the
physical distance between the population elements to be sampled. This form of
8
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Table 2.1: Selection of villages using systematic probability proportional to size
(PPS) sampling.
Village
Population
1
2
3
4
5
6
7
8
9
10
99
212
127
136
124
91
87
106
82
108
Cumulative
population
1-99
100-311
312-438
439-574
575-698
699-789
790-876
877-982
983-1064
1065-1172
⇒ selected
⇒ selected
⇒ selected
This illustration represents a smaller version of the example
given by Henderson et al. (1973). Here, three out of ten villages are sampled. Since the total population across all the
villages is 1172, the sampling interval is 1172
= 391. The
3
number 148 was randomly picked among the numbers between 1 and 391. Hence, the villages containing the 148th ,
148 + 391 = 539th person, and 539 + 391 = 930th person
are selected for the sample.
spatial sampling emerged in the public health literature in the 1960s. Henderson
et al. (1973) introduced the procedure as a way to collect data in West Africa, where
it was hard to obtain complete, up-to-date population registration records. Their aim
was to evaluate the impact of a mass vaccination campaign. At the time, the spread
of smallpox was a serious concern, so gathering information about the population
was an urgent matter. They sampled 67 sites per region and interviewed at least 16
persons per site.
The sites were selected according to PPS. A systematic approach was used. This
involved constructing a table with a column identifying the sites, a column of their
respective population sizes, and a column for the cumulative population size. A
9
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
sampling interval was calculated by dividing the total population by the number of
sites to be sampled. The first site was determined by generating a random number
between 1 and the sampling interval, then checking where the number fell in the
cumulative population. The other sites were found by adding the sampling interval
successively. See Table 2.1 for an example.
The procedure for sampling individuals was the same in each of the selected
sites. It consists of the following steps. First, field workers went to the centre of
the selected village or town. They picked a random number between 0 and 359,
which represented a direction in degrees (by convention, 0◦ pointed east, and the
direction increased by moving counterclockwise). The team then enumerated all the
households in this direction. A starting point among these households was established
by picking another random number. After the entire household was surveyed, the
team traveled along the original path, moving away from the centre of the site, until
they came upon another household. They continued in this manner until they had
met their quota of interviewing 16 individuals. If they had reached the edge of the
site before completing the quota, they turned clockwise and selected households by
moving inward. Everyone in the household containing the 16th person was examined
even though the quota was surpassed.
In 1978, the World Health Organization (WHO) adopted a version of this sampling method (Hoshaw-Woodard, 2001). Instructions are provided in the official EPI
coverage survey manual (World Health Organization, 2008). Much like the method
proposed by Henderson et al. (1973), the WHO version uses systematic PPS to sample clusters of households. It identifies the starting household by either choosing
it randomly from a list of households in the cluster or having the field interviewer
10
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
random direction
1
2
4
3
5
6
7
village
centre
N
90
W
E
180
0
S
270
EPI path
Selected household
with eligbile child
Figure 2.1: Illustration of household selection using the EPI method. The survey
interviewer begins by standing at the centre of the village and choosing a random
direction. Among the households in this direction, one is randomly picked. The
survey interviewer proceeds to visit this household then goes door-to-door until they
have collected data on enough subjects to achieve the desired sample size.
11
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
walk in a random direction away from the centre of the cluster and randomly choose
one of the households in their path; the direction can be determined by spinning
a pen or a bottle (MacIntyre, 1999). Then neighbouring households are visited to
find additional subjects. However, rather than selecting all subsequent households
by following the path prescribed by the initial direction, the WHO method takes the
next household to be whichever household, not already selected, has a front door
closest to the door of the household that was just left (World Health Organization,
2008). See Figure 2.1 for an illustration.
WHO began using this sampling method primarily for surveys relating to the
Expanded Program on Immunization—an action plan to make vaccines available
to all children throughout the world (Lemeshow and Robinson, 1985). Due to this
association, it became known as the EPI sampling method. It has also been referred
to as the 30 × 7 cluster sampling method because of the standard practice to sample
30 clusters and a minimum of seven children per cluster.
By 1982, the EPI method was performed in at least 441 surveys worldwide
(Lemeshow and Robinson, 1985). Ten years later, that number had reached 4502
(Brogan et al., 1994). Examples of surveys which have used the EPI method include those done in The Philippines (Zimicki et al., 1994), The Gambia (Milligan
et al., 2004), Ethiopia (Luman et al., 2007), and Niger (Grais et al., 2007). These
surveys assessed the proportion of children who received vaccines for diseases, such
as diphtheria, hepatitis B, measles, meningitis, pertussis, poliomyelitis, and tetanus.
12
M.Sc. Thesis - Maria Reyes
2.2.1
McMaster - Mathematics & Statistics
Other Applications
Although the EPI method was developed so that health managers could monitor vaccination coverage levels, it started being used for other purposes. A natural extension
was to use EPI sampling to measure the proportion of the population affected by a
disease. One of the earliest documented reports looked at diphtheria, measles, pertussis, poliomyelitis, and tetanus in Nepal (Rothenberg et al., 1985). Disease surveys
are often concerned with events that happen with lower frequency in the population
compared to immunization coverage surveys, and a narrower confidence interval may
be required. Cases of disease may also be distributed in pockets especially if the disease is contagious such as measles. Because of the way that households are chosen in
the EPI method, there is the question of whether the EPI method has the tendency
to underestimate or overestimate disease prevalence rates.
Additionally, EPI sampling has been applied in the context of community emergencies. Its use for assessing needs in the aftermath of natural disasters goes back
two decades. When Hurricane Andrew hit South Florida, households were sampled
according to the EPI method (Hlady et al., 1994). In these types of surveys the focus
is on determining what relief operations are required in the affected areas. Some important statistics are the proportion of households without electricity, the proportion
without running water, the proportion without enough food, and the proportion with
residents requiring medical attention (Hlady et al., 1994). Unlike surveys where subjects may not be found in every visited household, the subject here is the household
unit itself. Therefore, the sample obtained from a site may be more geographically
concentrated when the EPI procedure is strictly followed. When some parts of the
site are hit harder by the disaster than others, resulting estimates could be poor.
13
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Besides rapid-onset natural disasters, community emergencies may be triggered
by famine or conflict. There are several accounts of EPI sampling used in these
situations. Nutrition surveys which took body measurements of children in Ethiopia
were based on the EPI approach (Salama et al., 2001), and so were surveys from
The Democratic Republic of Congo (Burnham et al., 2006) and Iraq (Coghlan et al.,
2006) which estimated mortality rates due to violent conflict. It was also used in the
former Yugoslavia and in Chechnya to assess health care delivery and demands in the
mid 1990s when wars ensued in these countries (Legetic et al., 1996; Drysdale et al.,
2000). The EPI approach is detailed in the training manuals of various humanitarian
organizations, but it is only recommended when SRS is deemed unfeasible (Médecins
Sans Frontières, 2006; Centers for Disease Control and Prevention and World Food
Programme, 2007; UNICEF, 2010).
The EPI method has the essential attributes of a rapid survey including low cost
and quick feedback of results (MacIntyre, 1999). It has been suggested that the
survey could be completed in around five days if four to six teams are employed
(Lemeshow and Robinson, 1985). Moreover, the procedure is simple enough that it
can be carried out by those with little technical background (Bennett et al., 1994).
All of these reasons have led to the popularity of the EPI method.
14
Chapter 3
Procedures for Estimating a
Population Proportion
While EPI sampling is relatively simple to perform, its statistical properties are
not easily derived. Complications arise because of the multiple stages of sampling
involved and the procedure used to select units at the final stage. Therefore, analysis is typically carried out by borrowing formulas from SRS and two-stage cluster
sampling.
3.1
Sample Size Determination
The original goal of the EPI method was to construct a 95% confidence interval for
the proportion of vaccinated children such that the margin of error was no more
than 10 percentage points (Hoshaw-Woodard, 2001). To achieve this, it has been
suggested that 210 children should be surveyed, a number which comes from doubling
15
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
the sample size required by SRS to allow for clustering (Henderson and Sundaresan,
1982; Lemeshow and Robinson, 1985).
Let p be a population proportion and p̂ be a sample proportion. Under SRS, a
100(1 − α)% confidence interval for p is given by
s
p̂ ± z1− α2
p(1 − p)
,
nSRS
(3.1)
where z1− α2 represents the 100(1 − α2 )% quantile of the standard normal distribution,
and nSRS is the number of children sampled using SRS (Miller and Miller, 2003).1
By fixing the confidence level 1 − α and setting the margin of error equal to b, we
may solve for nSRS as
nSRS =
2
z1−
α p(1 − p)
2
b2
.
(3.2)
Since the population proportion is unknown, 0.50 is used for p in the calculations.
This results in the largest possible value for nSRS . If cluster sampling is to be done,
and it is expected to have a doubling effect on the variance of p̂, then the number
of individuals to be sampled (nCLU ) in order to obtain a 95% confidence interval
(α = 0.05) for p with a margin of error of b = 0.10 is
nCLU = nSRS × 2
=
1.962 (0.50)(0.50)
×2
0.102
≈ 193.
1
The confidence interval in Equation 3.1 is estimated by replacing p with p̂.
16
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
According to Henderson and Sundaresan (1982), the individuals should come from
a sample of at least 30 clusters. They stated this condition so that the normal
distribution theory could be reasonably applied. Since
193
30
≈ 7, EPI surveys typically
interview seven children per cluster for an overall sample size of 210 children.
To determine the required sample size for EPI, the SRS sample size was multiplied
by 2. This factor is called the design effect (DE). It is defined as the ratio of the
variance of an estimator under the chosen sampling plan to the variance of the
estimator if SRS had been performed instead (Lumley, 2010). A cluster survey of
the vaccination status of children in the United States was found to have a design
effect of around 2 (Serfling and Sherman, 1965). Since early uses of EPI sampling
were concerned with vaccination rates, a design effect of 2 was used for the purpose
of setting the sample size.
If n units are sampled per cluster, the variance of an estimator under cluster
sampling is related to the variance of the estimator under SRS in the following way:
V arCLU (p̂) = V arSRS (p̂) × [1 + (n − 1)ρ].
(3.3)
Here, ρ represents the rate of homogeneity (ROH) and can be interpreted as a measure of the similarity between units within the same cluster (Kish, 1965). We can
then express the design effect as
DE =
V arCLU (p̂)
V arSRS (p̂)
= 1 + (n − 1)ρ.
17
(3.4)
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
ROH takes on values between −1/(n − 1) and 1. Positive values indicate that we are
more likely to get the same response from two units sampled from the same cluster
compared to two units sampled from different clusters (Bennett et al., 1991). From
Equation (3.4) we see that the closer ROH is to 1, the greater the design effect. We
also see that the design effect depends on the sample size per cluster. Therefore,
the overall sample size requirement for a cluster survey should be computed using a
design effect based on the number of units that can be sampled per cluster and past
estimates of ROH for the target variable (Bennett et al., 1991).
3.2
Point Estimator
To demonstrate how estimates are calculated from the data from EPI samples, we will
look at a simple example. Suppose we are interested in the proportion of households
with a certain characteristic.2 We will denote this proportion by p. Let
yij =


1 if household has target characteristic;
(3.5)

0 otherwise
for the j th household in the ith cluster. If the population is divided into M clusters
and there are Ni households in the ith cluster, then
Ni
M P
P
p=
yij
i=1 j=1
M
P
.
(3.6)
Ni
i=1
2
One definition of households is “groups of persons sharing meals and residence”(Hlady et al.,
1994).
18
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Let S 0 be a subset of m PSUs and Si00 be a subset of ni SSUs from the ith PSU. For
samples obtained through the EPI method, the recommended formula for estimating
the population proportion is
P P
p̂ =
yij
i∈S 0 j∈Si00
P
ni
,
(3.7)
i∈S 0
(Lemeshow and Robinson, 1985; Bennett et al., 1991). This estimator corresponds
precisely to the proportion of households in the sample with the characteristic of
interest. If instead the survey had been about the proportion of vaccinated children
in a country, p̂ would be given by the ratio of vaccinated children in the sample to the
total children in the sample (World Health Organization, 2008). No matter which
elements or characteristics are being studied, the sample proportion is taken as the
estimate of the population proportion. Note that when samples from the clusters
are all exactly the same size, n, the estimator for p can be expressed as
P P
p̂ =
yij
i∈S 0 j∈Si00
mn


X
X
1
1
=
yij 
m i∈S 0 n
00
(3.8)
j∈Si
=
1 X
p̂i .
m i∈S 0
(3.9)
Here, the estimator for the overall population proportion may be interpreted as the
average of the sample proportions obtained for individual clusters.
19
M.Sc. Thesis - Maria Reyes
3.2.1
McMaster - Mathematics & Statistics
Expected Value
Under an appropriate sampling plan, the estimator in Equation (3.8) is an unbiased
estimator of the population proportion.
To prove that p̂ is unbiased, we will show that E(p̂) = p, but first we must
compute the probability that a household is among the selected households for the
sample.
Let Gi = 1 if cluster i is in the sample and 0 otherwise. There are many approaches to selecting clusters with PPS.3 For the procedure described in Table 2.1,
which is generally the one suggested for EPI (Henderson et al., 1973; Lemeshow and
Robinson, 1985; Bennett et al., 1991), the inclusion probability of a cluster is
P (Gi = 1) =
mNi
M
P
,
(3.10)
Ni
i=1
for i = 1, 2, . . . , M (Cochran, 1977). This holds as long as
of m and
PM
i=1
Ni
m
PM
i=1
Ni is a multiple
is greater than or equal to the size of the largest cluster in the
population.4,5
Similarly, let Hij = 1 if household j from cluster i is included in the sample and
0 otherwise. Assuming that all households from a cluster have the same chance of
3
4
Hanif and
Brewer (1980) cite 50 ways to perform PPS sampling.
P
M
N
i
When i=1
is not a whole number, an alternate systematic PPS sampling procedure can
m
PM
be used where i=1 Ni is taken as the sampling interval and the number of units in each cluster is
multiplied by m before computing the cumulative
population sizes (Cochran, 1977).
P
5
M
N
i
When there is a cluster such that i=1
≤ Ni , this cluster will appear in all samples so its
m
inclusion probability is actually P (Gi = 1) = 1. Furthermore, in some samples it can be selected
more than once. If, for example, a cluster is selected twice, the EPI manual instructs to take two
samples from that cluster (World Health Organization, 2008).
20
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
being observed, then
P (Hij = 1|Gi = 1) =
n
,
Ni
(3.11)
for j = 1, 2, . . . , Ni (Lohr, 2009). Using the results from Equations (3.10) and (3.11),
we determine that the unconditional inclusion probability of household ij is
P (Hij = 1) = P (Hij = 1|Gi = 1)P (Gi = 1)


 mNi 

=
P

M
Ni
n
Ni
i=1
=
mn
M
P
,
(3.12)
Ni
i=1
for i = 1, 2, . . . , M and j = 1, 2, . . . , Ni . Since all units have the same chance of
being selected, the sample is said to be self-weighting (Bennett et al., 1991).
Next, we can re-write the estimator in Equation (3.8) in terms of Gi and Hij so
that
P P
p̂ =
mn
Ni
M P
P
=
yij
i∈S 0 j∈Si00
yij Hij
i=1 j=1
mn
(3.13)
This uses a randomization theory approach where response values are viewed as fixed
21
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
and all uncertainty comes from which units are picked (Lohr, 2009). Therefore,
M N

P Pi
yij Hij 

 i=1 j=1

E(p̂) = E 

mn


=
M
N
M
N
i
1 XX
yij E(Hij )
mn i=1 j=1
i
1 XX
=
yij P (Hij = 1)
mn i=1 j=1


M Ni
 mn 
1 XX

yij 
=
P

M
mn i=1 j=1
Ni
i=1
Ni
M P
P
=
yij
i=1 j=1
M
P
Ni
i=1
=p
(3.14)
as required. However, it is questionable whether this is actually true for EPI since
the way it performs selections at the last stage of sampling may not give an equal
opportunity for all households to be picked.
22
M.Sc. Thesis - Maria Reyes
3.2.2
McMaster - Mathematics & Statistics
Variance
The EPI manual recommends estimating the variance of p̂ as,
2  P

m
i∈S 0
s2p̂ =  P  
ni
yi2 − 2p̂
P
i∈S 0
P
i∈S 0
n2i


m(m − 1)
i∈S 0
where yi =
ni yi + p̂2
P
(3.15)
yij and provided that m > 1 (World Health Organization, 2008).
j∈S 00
When ni = n for all i ∈ S 0 , then,
2  P

m
i∈S 0
s2p̂ =  P  
n
yi2 − 2p̂
2
=
m  i∈S 0
m2 n2
1
=
m(m − 1)
P
=
i∈S 0
yi2 − 2np̂yi + n2 p̂2
m(m − 1)
X yi 2
i∈S 0
n
− 2p̂
n2

i∈S 0



y i
n
!
+ p̂2
p̂2i − 2p̂p̂i + p̂2
m(m − 1)
P
=
i∈S 0
P
m(m − 1)
i∈S 0
P
nyi + p̂2
P
i∈S 0
(p̂i − p̂)2
m(m − 1)
.
(3.16)
(Milligan et al., 2004). From Equation (3.16), we see that the variability of p̂ is
measured in terms of variability between PSUs. This form of variance estimation
may be used even when there are more than two stages of sampling. As long as p̂i is
an unbiased estimator of p and PSUs are sampled independently of each other, then
s2p̂ is an unbiased estimator of the true variance of p̂. The proof is outlined in Hansen
23
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
et al. (1953). From Section 3.2.1, we know that the unbiasedness of p̂i depends on all
elements from the cluster having an equal probability of being selected. With EPI,
this is not guaranteed. EPI also does not select PSUs independently of each other
since the probability of selecting a PSU depends on the PSUs that were selected
beforehand. Hence, there may be issues with the performance of the estimator in
Equation (3.16) for EPI samples.
24
Chapter 4
Past Simulations of the EPI
Method
Computer simulations are useful because they allow researchers to test a multitude
of sampling plans without incurring the costs and large scale operations involved in
performing an actual survey. More importantly, true parameter values are known in
a controlled environment. Several authors have taken this approach to studying the
EPI method.
This chapter reviews key papers published over the last 30 years. We describe how
the studies were carried out and highlight their findings. In keeping with the original
purpose of EPI sampling, early simulations focused on estimating the immunization
coverage level in a population (Henderson and Sundaresan, 1982; Lemeshow et al.,
1985). The simulations that came after dealt with other variables such as those
relating to morbidity, nutrition, child care, and socioeconomic variables which may
have different spatial patterns compared to immunization status (Bennett et al.,
25
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
1991; Katz et al., 1997; Yoon et al., 1997).
4.1
Simulation Design
Henderson and Sundaresan (1982) and Lemeshow et al. (1985) performed their simulations on artificial data sets. This allowed them to experiment with different
population types. Henderson and Sundaresan (1982) analyzed ten populations and
Lemeshow et al. (1985) analyzed five populations. One of the parameters that they
varied was the proportion of vaccinated children in the population. This was also
varied at the individual cluster level. Cluster vaccination rates ranged from 10% to
99%, while overall population vaccination rates ranged from 17% to 87%.
Lemeshow et al. (1985) created virtual towns on top of a grid. They programmed
an algorithm to go through each cell in the grid and either leave it empty or place
a household. The decision was based on a random number from a uniform [0, 1)
distribution. A household was placed in a cell if the number generated was less than
the specified probability of the cell containing a household. The rules were similar
for placing a child in a household and assigning a vaccination status to a child.
To reflect conditions seen in urban and rural areas, Lemeshow et al. (1985) used
various population density levels and spatial patterns for the placement of households
and vaccinated children. Pockets of vaccination were established by picking a random
household with a child, then assigning that child and all other nearby children as
vaccinated.
In contrast, the studies by Bennett et al. (1994), Katz et al. (1997), and Yoon
et al. (1997) used real populations for their simulations. The data came from a
26
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
census survey of 30 randomly selected communities in Uganda (Bennett et al., 1994)
and 40 randomly selected communities in Nepal (Katz et al., 1997; Yoon et al.,
1997). Therefore, the communities only represented a portion of the real population,
but for simulation purposes they were treated as a population of their own. The
communities were digitally mapped with the origin centred at the mean or median
of the household coordinates.
All of the simulations mentioned above involved repeated sampling from a fixed
population. This embodies the Monte Carlo simulation method (Marasinghe, 2009).
Every time a sample was taken, the data from the sample were used to estimate the
prevalence of an outcome. The mean and variance of the resulting values across the
independent samples then served as an estimate of the expected value and variance of
the estimator in question. Since the actual prevalence in the population was known,
the bias and mean square error (MSE) could also be calculated. Let p be the overall
population prevalence and let p̂r be the corresponding estimate based on the rth
sample. If a total of R samples were generated, then
R
1X
p̂r ;
Ê (p̂) =
R r=1
(4.1)
R
2
1 X
d
V ar (p̂) =
p̂r − Ê (p̂) ;
R r=1
[ (p̂) = Ê (p̂) − p;
Bias
2
\
[ (p̂) .
M
SE (p̂) = Vd
ar (p̂) + Bias
27
(4.2)
(4.3)
(4.4)
Households
per cluster
or stratum
Children
per cluster
or stratum
Children sampled
per cluster or
stratum
Henderson and
Sundaresan (1982)
Unspecified
Unspecified
Unspecified
7
EPI
150
Lemeshow et al. (1985)
30
600
86 (average)
7
StRS, EPI
500
Bennett et al. (1994)
30
51-153
86-238
7, 15, 30
StRS, EPI, EPI3,
EPI5, QTR, PERI
1000
Katz et al. (1997)
40
31-315
13-284
7, 10, 15, 20, 25
StRS, EPI, EPI2,
EPI3, EPI4, EPI5
1000
Yoon et al. (1997)
40
31-315
13-284
7, 10, 15, 20, 25
SRS, StRS, EPI, EPI2,
EPI3, EPI4, EPI5
1000
Sampling method
Simulated
samples
28
All studies besides Henderson and Sundaresan (1982) sampled from every community in the populations that they were investigating. It is unclear
how many clusters were in the populations analyzed by Henderson and Sundaresan (1982), but the authors indicated that samples in their study
consisted of 30 clusters. A description of the sampling methods presented in column 6 of the table is given towards the end of Section 4.1.
McMaster - Mathematics & Statistics
Total
clusters or
strata
Simulation study
M.Sc. Thesis - Maria Reyes
Table 4.1: Comparison of simulation designs from past EPI studies.
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
With the exception of Henderson and Sundaresan (1982), all authors indicated
that a single sample in their simulation consisted of a subsample from every community in the population. Since all communities appeared in the final sample, they
used a stratified sampling design rather than a cluster sampling design, and the
communities should be formally called strata rather than clusters (Lohr, 2009). If
the strata have different population sizes and the same number of units are sampled
from each stratum, then the sample is not self-weighting. Units from smaller strata
have a greater probability of selection compared to units from larger strata. This
led to Katz et al. (1997) and Yoon et al. (1997) using an estimator which weighted
the sample data from a stratum by the size of the stratum. Bennett et al. (1994)
did not do a weighted calculation, and they used the same estimator as the one from
Equation (3.7). Because of this, the unweighted average of the prevalences per stratum, p̄, was substituted in place of the overall population prevalence, p, when bias
was calculated.
When the EPI method was simulated in a community, either the starting household was randomly selected from all the households (Lemeshow et al., 1985) or it
was randomly selected from the households along a random direction (Bennett et al.,
1994; Katz et al., 1997; Yoon et al., 1997). Subsequent selections involved finding
the nearest neighbour of the household that was last visited, but there were slight
differences in what constituted the nearest neighbour. Katz et al. (1997) and Yoon
et al. (1997) noted that they took the closest household to the right unless they were
at the edge of the community. The effect of increasing the sample size per community
was examined as well as modifications of the EPI method:
• EPIk selects the k th nearest neighbour (Bennett et al., 1994; Katz et al., 1997;
29
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Yoon et al., 1997);
• QTR divides the community into four quadrants and takes a quarter of the
sample from each of the quadrants (Bennett et al., 1994);
• PERI takes half of the sample starting near the centre of the community and
the other half near the periphery (Bennett et al., 1994).
Whichever procedure was done in one stratum was done in the rest of the communities in the same sample. The simulations also featured stratified random sampling
(StRS) and simple random sampling (SRS). In StRS, simple random sampling is
done within each community, while in SRS, simple random sampling is done at the
population level. These methods would serve as important benchmarks in assessing
the performance of EPI sampling and estimation. Table 4.1 summarizes the various
simulation designs.
4.2
Simulation Results
The EPI method appears to be producing estimates within 10 percentage points of
the prevalence to be estimated with 95% confidence, and it does this regardless of
the prevalence and variability across communities, as long as the instances of the
outcome are evenly distributed throughout the area of a community (Henderson and
Sundaresan, 1982). For populations with other configurations, it did not work as
well. When household density was high at the centre of the community, instances
of an outcome occurred in a single pocket in the community, vaccination coverage
was low, and every community in the population had this property, the EPI method
30
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
performed very poorly (Lemeshow et al., 1985). Only 31% of the estimates fell in
the p ± 0.1 range; the rest of the estimates fell below p − 0.1. This represents an
extreme scenario. If pocketing is only present in a few communities, simulations have
demonstrated that it may be possible to still achieve the stated estimation goals of
the EPI method (Lemeshow et al., 1985).
As expected, the EPI method proved to be more biased than StRS. In the extreme
scenario described earlier, there was a tendency to underestimate the true proportion.
An analysis of variance (ANOVA) for absolute relative bias (ARB),
[
[ (p̂) = |Bias (p̂) | ,
ARB
p
(4.5)
revealed that the interaction between the sampling method (StRS, EPI) and presence
of pocketing (yes, no) was significant, where EPI and pocketing were associated with
larger biases (Lemeshow et al., 1985). In a different study, the true disease prevelence
was consistently overestimated (Katz et al., 1997; Yoon et al., 1997). The diseases
studied were diarrhoea and xerophthalmia. Positive bias remained even when sample
size was increased and when the distance between selected households was increased.
The authors did not have a conclusive explanation for this positive bias. Another
study showed that bias was greatest for socioeconomic variables such as possession
of cattle and education level of parents, and that this bias became more pronounced
when the PERI scheme was used (Bennett et al., 1994). Implementing QTR almost
always led to smaller biases compared to EPI, but results were mixed when EPI3
and EPI5 were compared to EPI. Nevertheless, the magnitude of bias seen in these
simulations, which was generally less than 2 percentage points, may be considered
31
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
tolerable for most practical cases.
To analyze the variability of the estimator from EPI (and related methods), the
authors looked at DE defined as:
DE(p̂) =
V ar (p̂1 )
.
V ar (p̂2 )
(4.6)
When the denominator was set as the variance of the StRS estimator, DEs were close
to 1 with most ranging between 0.8 and 1.2. EPI typically generated larger variances
than StRS for socioeconomic variables (indicating a within-community clustering
effect); the opposite held for nutrition variables (Bennett et al., 1994). DEs did not
always decrease when sample size or the distance between households increased, but
they did when PERI was used instead of EPI (Bennett et al., 1994). DEs calculated
relative to SRS were larger than those calculated relative to StRS (Yoon et al., 1997).
MSE was also widely analyzed in the simulations. It is a useful measure because it
combines variance and bias. MSE was found to be significantly higher when the EPI
method was used in a community that had a pocketing variable pattern as opposed
to when StRS was used or when pocketing was absent (Lemeshow et al., 1985).
Although MSE decreased when more individuals were sampled from each community
(this has the effect of increasing the overall sample size since all communities were
included in the sample), the ratio M SEEP I /M SEStRS generally increased (Bennett
et al., 1994; Katz et al., 1997). Besides EPI, PERI was the only sampling scheme
(within the study of Bennett et al. (1994)) that frequently yielded ratios either under
0.8 or over 1.2. This result was attributed to the combination of low variance and
high bias associated with PERI.
32
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Finally, since the EPI method promotes itself as a rapid, inexpensive sampling
solution, it is meaningful to see how it compares in this respect against other schemes.
Katz et al. (1997) and Yoon et al. (1997) added up the straight-line distances between
visited households as a proxy for time and cost. In the case of StRS, households were
pre-selected for the sample then visited in an order that would minimize travel.1
Naturally, when the sample size per community was increased or when the number
of households skipped increased, the savings from EPI disappeared, and eventually,
it was more costly to implement than StRS. When the distance ratio was multiplied
by the MSE ratio, EPI was only better than StRS when the sample size was 7 or
10, and when one or no household was skipped. The ratio for most of the other
combinations was around one to two.
1
Distance between communities was ignored.
33
Chapter 5
Computation of Household
Inclusion Probabilities
Data obtained from EPI samples are analyzed as though each unit in a given cluster
has the same chance of being selected as any other unit in the same cluster. However, as seen in simulations of the EPI method, using an estimator that makes this
assumption produced biased results (Bennett et al., 1994; Katz et al., 1997; Yoon
et al., 1997). When SRS is performed at the second stage of sampling, all unselected
households from a town or village have an equal probability of being picked next. In
contrast, choice of the next household in EPI sampling is restricted to the nearest
neighbours of the last selected household. On this basis, some authors have speculated that there may be a tendency to move inwards where the density of households
is higher (Kok, 1986; Brogan et al., 1994; Luman et al., 2007). Moreover, the manner
in which the first household is chosen gives an advantage to households that lie in a
direction containing few households. Hence, despite sampling the clusters according
34
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
to PPS, the overall sample may not actually be self-weighting.
While there have been discussions about implementing weighted calculations, the
examples that have been given in past papers apply the weights to the PSU as a whole
rather than to the SSUs (Bennett et al., 1991; Brogan et al., 1994). The purpose
of weighting in these examples was to adjust for demographics and non-response
levels. There have been no further attempts to account for differences in selection
probabilities among households that arise from the sampling procedure used by the
EPI method.
In the work that follows, we focus on the sampling of households that takes place
within a cluster. For convenience, we use a population consisting of a single cluster.
An algorithm is established for computing inclusion probabilities assuming that a
response is obtained from every sampled household.
5.1
Probability of Selecting the First Household
Let H = {1, 2, . . . , N } be the set of all households in the population. These households may be represented by their Cartesian coordinates (xi , yi ) ∈ R × R, for i =
1, 2, . . . , N or their polar coordinates (ri , γi ) ∈ R+ ×[0, 2π), for i = 1, 2, . . . , N , where
p
ri = x2i + yi2 denotes household i’s distance from the origin, and γi = tan−1 ( xyii )
denotes the angle of its position in radians (rad) when moving counterclockwise
from the positive x-axis (Figure 5.1).1 We will use Γ to refer to the set of γi for
i = 1, 2, . . . , N . No households are located at the origin.
To select the first household for the sample, we adopt the method proposed
1
R denotes the set of real number; R+ denotes the set of positive real numbers.
35
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
1
●
6
y1=6.0
cartesian: (−4.0, 6.0)
2
●
polar: (7.2, 2.2)
3
●
4
5
●
●
6
●
7
8
●
r1=7.2
3
●
9
●
θ1=2.2
12
●
10
●
y
11
0
●
x1=−4.0
13
●
14
●
15
●
16
17
●
●
20
21
22
●
●
24
●
19
●
−3
18
●
23
●
25
●
●
−6
−6
−3
0
3
6
x
Figure 5.1: A population of N = 25 households. Household locations were randomly
generated such that the x-coordinate takes an integer value between -6 and 6 and
similarly for the y-coordinate.
36
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
1
●
6
2
●
3
●
4
5
●
●
6
●
7
3
α 2
8
●
●
α
9
●
θ
12
●
10
●
y
11
●
0
13
●
14
●
15
●
16
17
●
●
20
21
22
●
●
24
●
19
●
−3
18
●
23
●
25
●
●
−6
−6
−3
0
3
6
x
Figure 5.2: Illustration of a sector with direction θ =
rad (shaded region).
37
3π
4
rad and angle span α =
π
8
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
1
6
2
3
4
5
6
7
8
3
9
12
10
y
11
0
13
14
15
16
17
18
19
20
21
22
−3
24
23
25
−6
−6
−3
0
3
6
x
Figure 5.3: In the population above, when the sector size is set to α = π8 rad and
a random direction is picked between 0 rad to 2π rad, then there is a 0.21 chance
that it will land in an area where there are no households (unshaded regions). Only
directions in the set Θ = {(0.3, 1.8] ∪ (1.8, 2.4] ∪ (2.6, 3.3] ∪ (3.4, 4.4] ∪ (4.9, 6.0]}
correspond to non-empty sectors (shaded regions). The length of interval unions in
the set Θ is LΘ = 4.9
by Henderson et al. (1973) whereby the household is chosen among the households
contained in a random sector. This sector will be identified by the parameters (θ, α)
which specify the central angle (direction) of the sector and the angle span (size) of
the sector. Therefore, the sector is bounded by the angles θ ±
α
2
(Figure 5.2).
Let Θ represent the set of all θ’s such that the sector associated with θ is nonempty. Although at each spin, any direction between 0 rad and 2π rad may be
observed, the result is ignored if the sector contains no households and another spin
is made. By ignoring these values of θ, we restrict the sample space to Θ. Only
38
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
θ ∈ Θ are considered valid outcomes. Since the direction is picked at random, θ is
uniformly distributed and its probability density function is

1

 , θ ∈ Θ;
f (θ) = LΘ

0,
otherwise.
(5.1)
Here, LΘ is the length of interval unions in the set Θ. It is calculated as
LΘ = 2π −
N
X
δi I(δi > 0),
(5.2)
i=1
where
δi =


γ(i+1) − γ(i) − α,
if i ∈ {1, 2, . . . , N − 1};

(γ(1) + 2π) − γ(i) − α,
if i = N ;
(5.3)
I(·) is an indicator function, and γ(1) ≤ γ(2) ≤ . . . ≤ γ(N ) are the ordered household
angles. In most instances where α is not too small, the population is sizable, and
the households are evenly dispersed in all directions, LΘ = 2π. When the difference
between any two ordered angles is more than α, then LΘ < 2π (Figure 5.3).
Let U1 ∈ H denote a random variable for the first household added to the sample, and u1 a fixed but arbitrary value of U1 . Then, according to the law of total
39
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
probability and definition of conditional probabilities,
Z
P (U1 = u1 ) =
f (u1 , θ)dθ
Θ
Z
=
Θ
f (u1 |θ)f (θ)dθ
1
f (u1 |θ) dθ
LΘ
Θ
Z
1
=
f (u1 |θ)dθ,
LΘ Θ
Z
=
(5.4)
where
f (u1 |θ) =




1
, if household u1 lies in the sector (θ, α);
nsec (Γ, θ, α)


0,
(5.5)
otherwise;
and nsec (Γ, θ, α) is the number of households in the sector defined by (θ, α). The
parameter Γ refers to the set of household angle coordinates {γ1 , γ2 , . . . , γN }. We
will follow the convention that if a household falls on the initial side of sector, then
it is included in the sector, but if it falls on the terminal side of the sector, then it is
not included in the sector. Thus, household i is included in the sector (θ, α) when:
(a) θ ∈ 0, α2 and γi ∈ θ −
(b) θ ∈
α
2
, 2π −
α
2
α
2
+ 2π, 2π ∪ 0, θ + α2 or
and γi ∈ θ − α2 , θ + α2 or
(c) θ ∈ 2π − α2 , 2π and γi ∈ θ − α2 , 2π ∪ 0, θ +
α
2
− 2π .
Equivalently, a sector (θ, α) contains household i when:
(a) γi ∈ 0, α2 and θ ∈ γi −
α
2
+ 2π, 2π ∪ 0, γi + α2 or
40
M.Sc. Thesis - Maria Reyes
(b) γi ∈
α
2
, 2π −
α
2
McMaster - Mathematics & Statistics
and θ ∈ γi − α2 , γi +
α
2
or
(c) γi ∈ 2π − α2 , 2π and θ ∈ γi − α2 , 2π ∪ 0, γi +
α
2
− 2π .
Since for all other values of θ, f (u1 |θ) = 0, we only need to integrate 1/nsec (θ, α)
over a subset of Θ.
To avoid evaluating integrals over ranges of θ which may cross the 0 rad (or 2π
rad) boundary, the population may be rotated so that we are always integrating
between θ =
α
2
and θ =
3α
.
2
This is described in the algorithm below.
Algorithm:
1. Transform the coordinates of households in sector (γu1 , 2α).
(a) For 0 ≤ γu1 < α,
i. compute a = γu1 − α + 2π and b = γu1 + α;
ii. identify γi ∈ [a, 2π) ∪ [0, b);
iii. compute
γi0 =


γi − a,
γi ∈ [a, 2π);

γi + (2π − a), γi ∈ [0, b).
(b) For α ≤ γu1 < 2π − α,
i. compute a = γu1 − α and b = γu1 + α;
ii. identify γi ∈ [a, b) (Figure 5.4a);
iii. compute γi0 = γi − a (Figure 5.4b).
41
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
1
6
ra
d
−α
γ u1
d
ra
α
d
5
2
2
1
2 ra
+α
γu
γu +
α rad
1
4
1
3
γu
2
ra
d
6
6
u1
7
8
3
−α
γ u1
9
rad
3
5
7
3α
ad
2r
6
d
α ra
12
10
d
α 2 ra
9
11
y
y
0
0 rad
0
8
13
14
15
16
17
18
19
20
21
22
−3
−3
24
23
25
−6
−6
−6
−3
0
3
6
−6
−3
0
x
3
6
x
(a) Identify households near household 6.
(b) Transform the household coordinates.
6
1.00
0.75
4
θ*(6)
y
5
1 nsec(Γ ', θ, α)
θ*(5)
θ*(4)
7
6
2
θ*(3)
0.50
θ*(2)
θ*(1)
9
0.25
8
2
4
6
0.2
x
(c) Find the directions when the number of
households in the sector changes.
0.3
*(
θ
6)
*(
5)
0.4
θ
θ
4)
θ
*(
3)
*(
θ
2)
*(
*(
0.00
0
θ
θ
1)
0
0.5
0.6
(d) Compute the probability of selecting
household 6 for boundary values of θ.
Figure 5.4: Steps involved when computing the probability that household 6 is the
first sampled unit (u1 = 6). In this example, the angle span of the sector is set to
α = π8 rad. From Figure 5.3, we know that length of interval unions in the set Θ is
LΘ = 4.9. Therefore, using the formula in Equation (5.6), we obtain P (U1 = 6) =
0.05. γ denotes the angular coordinate of a household; θ denotes a sector direction;
nsec denotes the number of households in a sector.
42
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
(c) For 2π − α ≤ γu1 < 2π,
i. compute a = γu1 − α and b = γu1 + α − 2π;
ii. same as case (a).
iii. same as case (a).
2. Using the results at the end of the last step, determine when there is a change
in the number of households in a sector (Figure 5.4c). Compute

α

γ 0 + , γ 0 < α;
2
θ∗ =
α

γ 0 − , γ 0 > α.
2
∗
∗
∗
∗
Construct the set Θ∗ = {θ(1)
, θ(2)
, . . . , θ(D)
} where θ(1)
=
α
,
2
∗
∗
θ(2)
to θ(D−1)
are
∗
=
the values obtained from Step 2 (with duplicates omitted), and θ(D)
3α
.
2
The
∗
∗
<
< θ(2)
values in this set are arranged in strictly increasing order so that θ(1)
∗
.
. . . < θ(D)
∗
, α) for d = 1, 2, . . . , D,
3. Compute the number of households in sector (θ(d)
∗
nsec (Γ0 , θ(d)
, α) =
α
α
∗
∗
I θ(d)
− ≤ γ 0 < θ(d)
+
,
2
2
0
0
γ ∈Γ
X
then take the inverse (Figure 5.4d).
43
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
4. Compute the probability that u1 is the first household sampled.
1
P (U1 = u1 ) =
LΘ
1
=
LΘ
Z
3α/2
α/2
Z
1
dθ
nsec (Γ0 , θ, α)
∗
θ(2)
∗
θ(1)
Z
1
dθ +
∗
0
, α)
nsec (Γ , θ(2)
∗
θ(D)
∗
θ(D−1)
1
=
LΘ
=
1
dθ
∗
0
, α)
nsec (Γ , θ(D)
∗
∗
− θ(1)
θ(2)
∗
nsec (Γ0 , θ(2)
, α)
+
Z
∗
θ(3)
∗
θ(2)
1
∗
, α)
nsec (Γ0 , θ(3)
dθ + . . . +
!
∗
∗
− θ(2)
θ(3)
∗
nsec (Γ0 , θ(3)
, α)
+ ... +
∗
∗
− θ(d−1)
θ(D)
!
∗
nsec (Γ0 , θ(D)
, α)
D
∗
∗
1 X θ(d) − θ(d−1)
∗
LΘ d=2 nsec (Γ0 , θ(d)
, α)
(5.6)
The idea is to partition the directions for which household u1 has a chance of being
selected into intervals where every sector that can be constructed from these directions covers the same households. One can imagine beginning with a sector centered
at the angle θ =
α
2
then rotating this sector counterclockwise to θ =
3α
,
2
and noting
when a household point is added or removed. The algorithm applies for any unit in
the population.
5.2
Probability of a Sample of Households
In EPI sampling, the second household sampled is the household that is physically
closest to the first household sampled, the third household sampled is the household physically closest to the second household sampled (excluding the first sampled
44
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
household), and so on. Once the starting household has been chosen, every household sampled afterwards is determined by finding the nearest neighbour of the last
household visited. Since sampling is done without replacement, the next household
added to the sample is identified among the group of households in the population
that has not yet been selected. When there are multiple households tied for the
nearest neighbour, one of the households is chosen randomly. This process of going
from neighbour to neighbour ends when the desired sample size is reached.
Let the vector u = (u1 , u2 , . . . , un ) specify the units picked for a sample of size n,
and the order in which they are picked. We make the distinction between a sample
that selects household 1 then 3 then 4 and a sample that selects 4 then 3 then 1
because we will compute their probabilities separately. In general, the probability of
an EPI path u with n units is given by
P (u) = P ((u1 , u2 , . . . , un ))
= P (un |(u1 , u2 , . . . , un−1 )) × P ((u1 , u2 , . . . , un−1 ))
= P (un |(u1 , u2 , . . . , un−1 )) × P (un−1 |(u1 , u2 , . . . , un−2 ))×
P ((u1 , u2 , . . . , un−2 ))
= P (un |(u1 , u2 , . . . , un−1 )) × P (un−1 |(u1 , u2 , . . . , un−2 ))
× . . . × P (u2 |u1 ) × P (u1 )
" n
#
Y
=
P (ul |(u1 , u2 , . . . , ul−1 )) × P (u1 )
l=2
= P ((u1 , u2 , . . . , un )|u1 ) × P (u1 ).
(5.7)
It is computed as the product of conditional probabilities because each new selection
45
M.Sc. Thesis - Maria Reyes
1
1
McMaster - Mathematics & Statistics
1
3
4
P ((1, 3, 4)|u1 = 1) = 1
(a) All paths starting at household 1.
10
1
5
11
1
1
9
P ((13, 10, 9)|u1 = 13) =
1
5
16 P ((13, 11, 16)|u1 = 13) =
1
5
1
2
11 P ((13, 16, 11)|u1 = 13) =
1
10
1
2
15 P ((13, 16, 15)|u1 = 13) =
1
10
18 P ((13, 17, 18)|u1 = 13) =
1
5
21 P ((13, 19, 21)|u1 = 13) =
1
5
1
5
13
1
5
16
1
5
1
5
17
19
1
1
(b) All paths starting at household 13.
1
2
19 P ((18, 17, 19)|u1 = 18) =
1
4
1
2
20 P ((18, 17, 20)|u1 = 18) =
1
4
22 P ((18, 20, 22)|u1 = 18) =
1
2
17
18
1
2
1
2
20
1
(c) All paths starting at household 18.
Figure 5.5: Examples of EPI paths of length n = 3 for the population in Figure 5.1.
46
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
depends on all the households selected beforehand.
Let Hl be the set of households from the population which was not chosen in the
first l − 1 selections. Symbolically, Hl = {i ∈ H : i 6∈ {u1 , u2 , . . . , ul−1 }}. Because of
the way EPI sampling is carried out, only households in Hl with the shortest distance
to ul−1 have a non-zero chance of being selected. These households are the solution
to
arg min di,ul−1 = arg min
i∈Hl
i∈Hl
q
(xul−1 − xi )2 + (yul−1 − yi )2 .
(5.8)
If nnn (ul−1 , Hl ) represents the number of equidistant households closest to the household ul−1 which have not already been visited, then,
P (ul |(u1 , u2 , . . . , ul−1 )) =




1
nnn (ul−1 , Hl )


0,
,
if ul ∈ arg min di,ul−1 ;
i∈Hl
(5.9)
otherwise;
for l = 2, 3, . . . , n (Figure 5.5). Note that if at every selection after the first unit
has been selected, there is only one household that can be chosen, then P (ul |(u1 , u2 ,
. . . , ul−1 )) = 1 for l = 2, 3, . . . , n, and P ((u1 , u2 , . . . , un )) reduces to P (u1 ). The
probability P (u1 ) is computed separately since the first unit of the sample is selected
according to a different process as indicated in Section 5.1.
The program that we developed simultaneously returns all possible EPI paths
from a given population as well as the exact probability that a particular path is
realized.
47
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Algorithm:
1. Use the data on the household locations to compute an N ×N matrix D, where
the element dij in row i, column j represents the distance from household i to
household j,

 0 d12

 d21
0

D= .
..
 ..
.


dN 1 dN 2

. . . d1N 

. . . d2N 

.
.. 
..

.
. 

... 0
(5.10)
Hence, D is a distance matrix containing the distance between every pair of
households.
2. Apply the algorithm in Section 5.1, to get the probability that household 1 is
the first selected household. Do the same for the rest of the households in the
population so that P (U1 = i) is known for i = 1, 2, . . . , N .
3. Construct EPI paths and compute their probabilities.
i. EPI paths of length 1. Produce a list of N vectors where each vector
has the identification number (id) of one household from the population.
Therefore, the first vector contains the id of household 1, the second vector
contains the id of household 2, and so forth. The probabilities associated
with these paths are the probabilities from Step 2.
ii. EPI paths of length 2. Add a second household to each of the vectors in
step i. The first vector in the list has household 1 and only household 1.
48
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
To determine which household should be added to this vector, go to first
row of D and search where the minimum value occurs in columns 2 to N
as this indicates the nearest neighbour of household 1. Create separate
vectors if the minimum appears multiple times. The probability of the
resulting path(s) is obtained by dividing the old path probability by the
number of paths that were generated. Repeat for the remaining vectors in
the original list. For vector i, this means searching for the minimum value
in row i when column i is skipped.
iii. EPI paths of length 3. Take each vector in step ii, identify the last unit in
the vector, go to the matching row in D, and search for the minimum value.
Ignore values in columns that correspond to units which are already in the
sample. Update the vectors to include the new addition, and compute the
path probabilities in the same way as in step ii.
iv. Continue adding one household at a time to every vector until all vectors
have n elements. Update path probabilities at each iteration.
At the end of this procedure, an optional step that we performed was finding the
paths that yield the same sample of households. Samples, denoted by S, represent
an unordered subset of households from the population. The probability of a sample
is the sum of path probabilities for paths that include the same units as the sample,
regardless of order.
49
M.Sc. Thesis - Maria Reyes
5.3
McMaster - Mathematics & Statistics
Inclusion Probabilities for Individual Households and Pairs of Households
The inclusion probability for household i is defined as the probability that household
i is among the units drawn in a sample. When all possible samples are known along
with their probabilities, the inclusion probability for household i may be calculated
as
πi =
X
P (S).
(5.11)
S: i∈S
The joint inclusion probability for households i and j (i 6= j) refers to the probability
that households i and j are sampled together. Therefore,
πij =
X
P (S).
(5.12)
S: i,j∈S
To facilitate the calculation of inclusion probabilities, we can use matrices. Let
Iqi = 1 if household i is in sample Sq and 0 otherwise. Suppose there are a total of
Q possible samples from a population of N households. Then,

 I11 I12

 I21 I22

π= .
..
 ..
.


IQ1 IQ2
...
...
...
...
T 
I1N  P (S1 )
0
 

I2N 
P (S2 )
  0


..   ..
..
.   .
.
 
IQN
0
0
50
...
0

  I11 I12


...
0 
  I21 I22
 .
.. 
..
..

.
. 
.
  ..

. . . P (SQ ) IQ1 IQ2

. . . I1N 

. . . I2N 

.. 
..
.
. 


. . . IQN
M.Sc. Thesis - Maria Reyes

McMaster - Mathematics & Statistics

 I11 I21 . . . IQ1   P (S1 )I11 P (S1 )I12


 I12 I22 . . . IQ2   P (S2 )I21 P (S2 )I22


= .

.
.
..
..
.
 ..

..
..
.. 
.
.




I1N I2N . . . IQN
P (SQ )IQ1 P (SQ )IQ2
Q
P

2
P (Sq )Iq1
Q
P
P (Sq )Iq1 Iq2
 q=1
q=1

P
Q
Q
P

2
P
(S
)I
I
P (Sq )Iq2
q q2 q1

q=1
q=1
=

..
..

.
.

Q
Q
P
P
P (Sq )IqN Iq1
P (Sq )IqN Iq2
q=1
...
Q
P

...
P (S1 )I1N 

. . . P (S2 )I2N 


..
...

.


. . . P (SQ )IQN

P (Sq )Iq1 IQN 


...
P (Sq )Iq2 IqN 


q=1

..
..

.
.


Q

P
2
...
P (Sq )IqN
q=1
Q
P
q=1
q=1


π
π
π
.
.
.
π
12
13
1N
 1



 π21 π2 π23 . . . π2N 


= .

.
.
..

 ..
.
.
.
.
.




πN 1 πN 2 πN 3 . . . πN
(5.13)
where the elements on the diagonal are inclusion probabilities for individual units,
and the elements off the diagonal are inclusion probabilities for pairs of units.
The inclusion probabilities in π can also be estimated through a Monte Carlo
simulation approach where the EPI selection procedure is simulated repeatedly to
produce independent samples. This is done on a fixed population and for a fixed
sample size. The inclusion probability of household i is the proportion of samples
in which household i is observed, and the joint inclusion probability of households i
and j is the proportion of samples in which both households i and j are observed.2
2
Simulation runs where no households are selected because the initial sector is empty are discarded prior to the computation of inclusion probabilities so that they are not counted.
51
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
The same steps apply for other sampling plans.
5.4
Other Versions of EPI Sampling
To compute household inclusion probabilities for an EPI based sampling procedure
where the first unit of the sample is selected by SRS, and another EPI based procedure where an interviewer surveys every k th household in their path (Figure 5.6),
only minor adjustments are required to the framework we established in the previous sections. In the first case, we set P (U1 = u1 ) =
1
N
for u1 = 1, 2, . . . , N since
all households in the population have an equal probability of being chosen as the
starting household. In the second case, we compute path probabilities for a sample
of size (n − 1)k + 1 then drop the households that would be skipped in the actual
survey process before aggregating the probabilities as in Section 5.3.
5.5
Additional Notes: Permutations of Household
Selections
Under SRS, there are
N × (N − 1) × . . . (N − n + 1) =
N!
(N − n)!
(5.14)
sequences (permutations) for the order in which households are visited (EPI paths).
It is clear that as the population size N increases, the number of sequences grows
quickly. On the other hand, under EPI sampling, far fewer sequences are generated.
52
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
1
●
●
6
2
●
●
3
●
4
5
●
●
6
●
7
8
●
3
●
9
●
●
12
10
●
●
y
11
●
0
13
●
14
●
15
16
●
17
●
●
20
21
22
●
●
24
●
19
●
−3
18
●
23
●
25
●
●
−6
−6
−3
0
3
6
x
(a) In this example, every third neighbour is added to
the sample (k = 3) and a sample of three households
is desired (n = 3). One possible sample consists of
households 1, 2, and 9.
1
3A
4A
5A
6A
9
P ((1, 2, 9)|u1 = 1) =
1
2
7A
5A
6
P ((1, 2, 6)|u1 = 1) =
1
2
2
(b) For the path shown in (a) there is only one household to move to at every selection other than the third
selection. At the third selection, the path diverges
in two. Choosing to move to household 5 instead of
household 7 from household 2, means that the probability of this path conditional on starting at household
1 is equal to 0.5. This also represents the conditional
probability of selecting households 1, 2, and 9 when
every third neighbour is added to the sample.
Figure 5.6: Illustration of an EPI path with skipped neighbours.
53
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
For instance, when there is always only one neighbour to move to from each selection,
the entire path is determined by the choice of the first household, and the number
of possible paths that the survey interviewer can take is exactly equal to N . As long
as there are not too many ties in the selection process, the task of enumerating all
EPI paths remains manageable, and inclusion probabilities can be calculated from
basic sampling principles.
54
Chapter 6
Household Inclusion Probabilities
in Simulated Populations
One of the reasons for developing the algorithms in Chapter 5 is so that we can use the
probabilities that households are included in a sample to objectively assess the way
selections are made under the EPI method. In this chapter, we generate populations
and compute inclusion probabilities for elements in these populations when various
EPI sampling plans are simulated. As before, we focus on what happens in the
stage of sampling where households are picked. From the results, we compare how
the sampling plans are different from each other and from one based on SRS. We
also explore the sensitivity of inclusion probabilities to the spatial distribution of the
households in a cluster.
55
M.Sc. Thesis - Maria Reyes
6.1
McMaster - Mathematics & Statistics
Simulation Design
6.1.1
Generation of Populations
In our study, we considered five types of household spatial distributions (settlement
layouts). Each settlement had a population size of N = 150 households. The decision
to create populations containing 150 households was based on reports of the average enumeration area size in recent surveys for countries such as South Africa and
Bangladesh (Housing Development Agency, 2012; National Institute of Population
Research and Training et al., 2013).1
Households in the populations that we generated were dispersed over an area that
was 1000 × 1000 units wide. The centre of the cluster was designated as the origin
(0, 0). The procedure for placing the households in each population was tailored to
achieve a certain spatial pattern:
1. Regular pattern (loc reg). Households were placed in 6 regularly spaced rows
and 25 regularly spaced columns. The distance between a household and an
adjacent household to the left or right was 41 units, and the distance between
a household and an adjacent household to the above or below was 166 units.
2. Random pattern over a square area (loc sqr). Households were randomly distributed over the entire area. This was done by independently generating two
uniform random variables between −500 and 500 for a household’s x- and ycoordinates.
1
The term enumeration area refers to the smallest geographic unit into which a country is
divided by the census office (Statistics Canada, 2015).
56
M.Sc. Thesis - Maria Reyes
57
McMaster - Mathematics & Statistics
Figure 6.1: Five spatial distributions of N = 150 households. These populations are used for the simulation
studies discussed in Chapters 6 and 7. See pages viii-x for the meaning of abbreviations used for the spatial
distribution of households.
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
3. Random pattern over a rectangular area (loc rec). Same as loc sqr except the
y-coordinate was a uniform random variable between -250 and 250.
4. Aggregated pattern (loc agg). Households were organized into 15 groups of 10.
Those belonging to the same group were dispersed around a common focal
point (x∗g , yg∗ ). Household locations were determined by
xi = x∗g + ri × cos(φi ),
(6.1)
yi = yg∗ + ri × sin(φi ),
(6.2)
where x∗g were yg∗ are uniform random variables between −500 and 500, ri is an
exponential random variable with rate parameter λ = 75, and φi is a uniform
random variable between 0 and 2π (Bolker, 2008). All variables were generated
independently of each other.
5. Circular gradient pattern (loc cgr). Concentration of households gradually decreased from the centre to the edge of the cluster. Household coordinates were
generated using Equations (6.1) and (6.2) with the focal point set to the origin (x∗g , yg∗ ) = (0, 0) and the rate parameter for determining the distance of a
household from the origin set to λ = 175.
The regular pattern may be thought of as a planned neighbourhood development.
On the other hand, the random pattern could represent low density areas such as
rural communities or high density areas such as shanty towns since dwellings may
be scattered in an unorganized manner. For EPI, it is the relative distance between
households that is important rather than the absolute distance. Clusters with a
58
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
circular gradient pattern are meant to resemble a small village or town built around
a core area. Finally, the aggregated pattern depicts a situation where households are
grouped into compounds, as in some of the places where EPI has been performed
(Rose et al., 2006; Grais et al., 2007), or a mixed-use zone containing residential
properties as well as schools, medical facilities, places of worship, offices, retail stores,
etc. (City of Ottawa, 2012).
The populations loc reg, loc sqr, loc rec, loc agg, and loc cgr represent one realization of the spatial patterns specified above, and are depicted in Figure 6.1. In
loc agg, four households fell outside the boundaries of the cluster area, and in loc cgr,
two households fell outside the boundaries of the cluster area. These households were
removed and then given new random coordinates inside the cluster.
6.1.2
Sampling Plans
We tested six variations of the EPI sampling procedure. They were characterized
π
rad) and
by the size of the sector used to select the first household (α = 2π, π8 , 32
the number of neighbours skipped (k − 1 = 0, 2, or equivalently, k = 1, 3) for finding
subsequent units. This led to the following combinations:
4. α = 2π rad, k = 3 (nosec k3)
1. α = 2π rad, k = 1 (nosec k1)
2. α =
π
8
rad, k = 1 (api8 k1)
5. α =
π
8
3. α =
π
32
rad, k = 1 (api32 k1)
6. α =
π
32
rad, k = 3 (api8 k3)
rad, k = 3 (api32 k3)
Each method was performed with a sample size of n = 7, 15, 30, giving a total
of 6 × 3 = 18 sampling configurations.
59
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Table 6.1: Minimum and maximum number of households in non-empty sectors
(min(nsec ), max(nsec )) and proportion
of values in [0, 2π) rad that correspond to the
LΘ
directions of empty sectors 1 − 2π for loc reg, loc sqr, loc rec, loc agg, and loc cgr
(populations in Figure 6.1). Results are computed for a sector with angle span α = π8 rad
π
rad.
and α = 32
Spatial distribution
of households
loc
loc
loc
loc
loc
α=
π
8
rad
min(nsec )
max(nsec )
4
4
1
1
3
14
19
23
31
15
reg
sqr
rec
agg
cgr
α=
LΘ
rad
LΘ
2π
min(nsec )
max(nsec )
LΘ
0.00
0.00
0.00
0.02
0.00
1
1
1
1
1
6
9
9
17
9
5.81
5.82
5.48
4.61
5.77
1−
6.28
6.28
6.28
6.14
6.28
π
32
1−
LΘ
2π
0.08
0.07
0.13
0.27
0.08
See pages viii-x for the meaning of abbreviations used for the spatial distribution of households. See Equation (5.2)
for definition of LΘ .
Number of distinct EPI samples
k=1
k=3
800
Spatial
distirbution
of households
600
loc_reg
loc_sqr
loc_rec
loc_agg
loc_cgr
400
200
0
10
20
30 0
10
20
30
Sample size (n)
Figure 6.2: Number of possible EPI samples that can be drawn from loc reg, loc sqr,
loc rec, loc agg, and loc cgr (populations in Figure 6.1) when no neighbours are skipped
(k = 1) and when every third neighbour is selected (k = 3). The term sample refers to
an unordered set of n households selected from a population of size N . In comparison,
N!
if SRS had been used, the number of possible samples is always equal to (N −n)!n!
. For a
11
population of 150 households, this means that there are 2.9 × 10 SRS samples of size 7,
1.6 × 1020 SRS samples of size 15 and 3.2 × 1031 SRS samples of size 30. See pages viii-x
for the meaning of abbreviations used for the spatial distribution of households.
60
M.Sc. Thesis - Maria Reyes
6.2
McMaster - Mathematics & Statistics
Simulation Results
For the populations in the study, using a sector size of α =
π
8
rad meant it was rare,
if not impossible, to end up in a sector where there were no households (Table 6.1).
For loc reg, loc sqr, and loc cgr, a sector could point in any direction, and it would
have at least 3-4 households. Naturally, when sector size was reduced to α =
π
32
rad,
sectors did not contain as many households. The number of households in a sector
varied the most for loc agg. It also required a greater restriction on the values of θ
to avoid getting an empty sector when performing the procedure to select the first
household.
Running the calculations for the inclusion probabilities took considerably longer
in loc reg compared to the other populations. Due to the regular spacing of households within the population, a household would often have multiple nearest neighbours. There were more samples to list and thus more samples to search through
during the step where the probabilities of samples containing a certain household
were added to get the overall probability of selecting that household. The plot in
Figure 6.2 shows the number of household combinations that can be observed in
each population when EPI is applied. Besides illustrating that more combinations
of households are possible in loc reg, it also shows that more combinations of households are possible when every third neighbour is sampled as opposed to when no
neighbours are skipped. However, that number is still nowhere near the number of
samples that can be seen when SRS is performed.
To visualize where high and low inclusion probabilities occurred in the settlements, we constructed bubble plots. A selection of these plots is shown in Figures
61
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
(a) α = 2π rad, k = 1, n = 7.
(c) α =
π
32
(b) α = 2π rad, k = 1, n = 30.
(d) α = 2π rad, k = 3, n = 7.
rad, k = 1, n = 7.
Figure 6.3: Household inclusion probabilities for a population with regularly spaced households (loc reg). The size (area) of a point is proportional to the inclusion probability (πi )
of the household at that location. Connected points indicate pairs of households which
can appear together in the same sample. α denotes the angle span of the sector used for
selecting the first household; k denotes that every k th neighbour was added to the sample;
n denotes the number of households in the sample.
62
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
6.3-6.7. The size (area) of a point at location (x, y) reflects the inclusion probability
of the household at those coordinates. We joined pairs of households that have a
chance of being selected together (πij > 0) to get a sense of the geographic scope of
the possible samples.
Symmetric spatial patterns of the selection probabilities emerged for loc reg where
households were arranged in a grid formation. When the first household was chosen
randomly among all households (α = 2π rad), no neighbours were skipped in later
selections (k = 1), and sample size was set to n = 7, then within each row of
households, inclusion probabilities increased from the unit at the end to the seventh
unit from the end (Figure 6.3a). Inclusion probabilities for the middle 11 households
were constant and were smaller than the inclusion probability of the seventh unit
from the end.
For samples that started between the seventh unit from each end of a row, movements made to obtain the full sample were restricted to that row. As sample size
increased to n = 15 and n = 30, the position of a household in a row became less
significant. What was important was the row that a household belonged to. From
Figure 6.3b, we see that inclusion probabilities are smallest for households in the
first row and last row while inclusion probabilities are largest for households in the
second row and second last row.
Selecting the first unit of the sample from a random sector altered the distribution
of selection probabilities. The scenarios depicted in Figures 6.3a and 6.3c are identical
except that, in the latter, a random sector with angle span α =
π
32
rad initiated
the sampling procedure. While we still observe a general increase in the inclusion
probabilities as we move from the end of a row to the seventh last unit, the change
63
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
(a) α = 2π rad, k = 1, n = 7.
(b) α = 2π rad, k = 1, n = 30.
Figure 6.4: Household inclusion probabilities for a population with a random settlement
pattern over a square (loc sqr). The size (area) of a point is proportional to the inclusion
probability (πi ) of the household at that location. Connected points indicate pairs of
households which can appear together in the same sample. α denotes the angle span of the
sector used for selecting the first household; k denotes that every k th neighbour was added
to the sample; n denotes the number of households in the sample.
does not happen at the same rate as in the previous case. Another difference is that
there is now variability in the inclusion probabilities of the middle 11 households.
Similar remarks apply when α =
π
8
rad.
Whether households were skipped in taking the sample had a greater impact on
inclusion probabilities than whether a sector was used to identify the starting point.
In loc reg, adding every third neighbour to the sample led to a periodic behaviour
in the selection probabilities of households. When the sample size was n = 7, the
probability would spike at the fourth household from the end of a row and every
third household after it (Figure 6.3d). This held no matter of how the first unit of
the sample was picked. While the periodic behaviour remained even when larger
64
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
(a) α = 2π rad, k = 1, n = 7.
(b) α =
π
8
rad, k = 1, n = 7.
Figure 6.5: Household inclusion probabilities for a population with a random settlement
pattern over a rectangular area (loc rec). The size (area) of a point is proportional to
the inclusion probability (πi ) of the household at that location. Connected points indicate
pairs of households which can appear together in the same sample. α denotes the angle
span of the sector used for selecting the first household; k denotes that every k th neighbour
was added to the sample; n denotes the number of households in the sample.
samples were taken, the spikes in the probabilities were diminished.
Trends in household selection probabilities in the other populations were not as
predictable. This was especially true for loc sqr. When n = 7, households with the
greatest inclusion probabilities were not found in one particular area (Figure 6.4a).
Furthermore, as the sample size was increased, households which were picked most
often at n = 7 were not necessarily the ones picked most often at n = 15 and n = 30
(Figure 6.4b).
As with loc sqr, there was nothing that stood out about the inclusion probabilities
in loc rec when α = 2π rad, k = 1, n = 7. However, switching from a sampling
scheme that did not use a sector to one that did resulted in an increase in the
65
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
(a) α = 2π rad, k = 1, n = 7.
(b) α = 2π rad, k = 3, n = 7.
Figure 6.6: Household inclusion probabilities for a population with an aggregated settlement pattern (loc agg). The size (area) of a point is proportional to the inclusion probability (πi ) of the household at that location. Connected points indicate pairs of households
which can appear together in the same sample. α denotes the angle span of the sector
used for selecting the first household; k denotes that every k th neighbour was added to the
sample; n denotes the number of households in the sample.
inclusion probabilities of households close to the y-axis and a decrease in the inclusion
probabilities of households in the far west and east sides of the cluster (Figures 6.5a
and 6.5b).
A distinct feature of loc agg was that certain groups of households were far enough
from other groups such that when n = 7, households in these groups could only be
sampled with households in the same group (Figure 6.6a). When more population
units were sampled or when neighbours were skipped in the selection process, the
samples were less localized.
In loc cgr, patterns in the inclusion probabilities were somewhat clearer. If a
household had a relatively high probability of selection, it was almost certainly from
66
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
(a) α = 2π rad, k = 1, n = 7.
(b) α = 2π rad, k = 3, n = 7.
Figure 6.7: Household inclusion probabilities for a population with a circular gradient
settlement pattern (loc cgr). The size (area) of a point is proportional to the inclusion
probability (πi ) of the household at that location. Connected points indicate pairs of
households which can appear together in the same sample. α denotes the angle span of the
sector used for selecting the first household; k denotes that every k th neighbour was added
to the sample; n denotes the number of households in the sample.
the centre of the cluster. However, this did not mean that all households in the centre
of the cluster had high probabilities of selection. For instance, two households near
the origin could be side by side, yet one would have two or three times the inclusion
probability as the other.
A boxplot of the inclusion probabilities (Figure 6.8) highlights other important
properties. For all combinations of household spatial patterns and sampling procedures, the average probability of selection was 0.047, 0.100, 0.200 when sample sizes
were 7, 15, and 30 respectively. These averages correspond to the sampling fraction
n
.
N
In 78% of the scenarios that we tested, the largest probability was over 10 times
greater than the smallest probability. The variability of the inclusion probabilities
67
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
increased as sample size increased, but it was considerably lower in loc reg compared
to the other populations.
Finally, we calculated and plotted the correlation (Cor) between a household’s
inclusion probability (πi ) and its distance to the origin (ri ) (Figure 6.9). We used
the following formula:
Cov(πi , ri )
p
Cor(πi , ri ) = p
V ar(πi ) V ar(ri )
1
N
=s
1
N
N
P
(πi − π̄)(ri − r̄)
s
N
N
P
P
(πi − π̄)2 N1
(ri − r̄)2
i=1
i=1
i=1
N
P
(πi − π̄)(ri − r̄)
s
=s
N
N
P
P
(πi − π̄)2
(ri − r̄)2
i=1
i=1
i=1
(6.3)
where π̄ is the mean of the inclusion probabilities and r̄ is the mean of the distances.2 In SRS, the location of a household has no bearing on the probability that
it is selected. Our simulation indicates that this is not true for EPI. Nearly all correlations were negative with many of them falling in the range of -0.4 and -0.6. A
negative correlation means that higher inclusion probabilities tend to be associated
with households closer to the centre of the population area. However, the plots may
not be representative of what happens in other populations even if the populations
2
Cov stands for covariance.
68
M.Sc. Thesis - Maria Reyes
loc_reg
McMaster - Mathematics & Statistics
loc_sqr
loc_rec
loc_agg
loc_cgr
0.4
●
●
0.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=7
0.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=15
πi
0.4
●
●
●
0.0
●
0.4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=30
0.2
0.0
k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3
c_ 8_ 2_ c_ 8_ 2_
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3
o
a a n
a a
a a n
a a
a a n
a a
a a n
a a
a a n
a a
n
n
n
n
n
Sampling method
Figure 6.8: Boxplot of household inclusion probabilities (πi ) for loc reg, loc sqr,
loc rec, loc agg, and loc cgr (populations in Figure 6.1) when various sampling plans
are applied. The horizontal line that appears across the plots indicates the average
inclusion probability (π̄) for a given sample size (n). See pages viii-x for the meaning
of abbreviations used for the spatial distribution of households and the sampling
method.
are of a similar type. For instance, when we generated 30 independent realizations
of the loc sqr pattern, we found that on average, the relationship between inclusion
probability and distance from the origin was much weaker (Table A.1). An analysis was also done for 30 realizations of loc rec, loc agg, and loc cgr. The main
observation we made is that the magnitude of correlation was generally higher in
the loc cgr populations, and it increased further when larger samples were taken or
when neighbours were skipped in between successive selections.
69
M.Sc. Thesis - Maria Reyes
loc_reg
0.0
McMaster - Mathematics & Statistics
loc_sqr
loc_rec
loc_agg
loc_cgr
●
●
●
●
●
Cor( πi , ri )
−0.2 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.6
●
●
●
●
●
●
●
●
●
●
●
●
●
Sampling
method
●
●
●
●
−0.4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
nosec_k1
api08_k1
api32_k1
nosec_k3
api08_k3
api32_k3
●
−0.8
●
10
15
20
25
30
10
15
20
25
30
10
15
20
25
30
10
15
20
25
30
10
15
20
25
30
Sample size (n)
Figure 6.9: Correlation between household inclusion probability (πi ) and household
distance from the centre of the population area (ri ) for loc reg, loc sqr, loc rec,
loc agg, and loc cgr (populations in Figure 6.1) when various sampling plans are
applied. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households and the sampling method.
6.3
Additional Notes: Relations between Inclusion Probabilities
The following properties hold for any sampling plan that selects units without replacement:
Property 1:
N
P
πi = n
i=1
Property 2:
N
P
j6=i
Property 3:
πij = (n − 1)πi
N P
N
P
i=1 j>i
πij = 21 n(n − 1)
(Cochran, 1977). Recall that the sum of the probabilities for all possible samples
is 1 because one of these samples must be observed. To get πi , we add up the
probabilities of samples containing unit i. Since there are n units in each sample,
70
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
all of which are unique, this means that when the sum of inclusion probabilities for
individual units is taken across the entire population, the probability of each sample
is counted n times, which gives the first property. To get the second property, we
use a similar argument. If all samples containing unit i were listed, we would know
πij , for every value of j 6= i. We also know that the sum of probabilities for these
samples is equal to πi . Since unit i appears with n − 1 other units in a sample, πi
P
is counted n − 1 times in N
j6=i πij . Finally, the third property is a consequence of
the first two properties and the symmetry of inclusion probabilities for pairs of units
(πij = πji ).
To check the inclusion probabilities computed in our simulation study, we made
sure that all the relations above were true. For n = 7, we also compared the inclusion
probabilities to ones estimated by resampling from the population 25 000 times. The
difference between exact and estimated inclusion probabilities was no greater than
0.006.
71
Chapter 7
Estimator Properties in Simulated
Populations
As shown in Chapter 6, the EPI method does not sample households with equal
probability. When EPI sampling is used, there can be large variation in the inclusion
probabilities of households. Furthermore, there was some correlation between the
inclusion probability of a household and the distance of the household from the cluster
origin, indicating that households with the largest probabilities are not randomly
located throughout the area.
This provides motivation for investigating how an estimator that assumes constant probability of selection for all households performs when this assumption is not
met, especially when the target variable exhibits non-random spatial trends. Therefore, it is of interest to compare this estimator to one that accounts for differences
in inclusion probabilities.
72
M.Sc. Thesis - Maria Reyes
7.1
McMaster - Mathematics & Statistics
Simulation Design
7.1.1
Generation of Populations
Populations were created by assigning a binary outcome to each of the N = 150
households in loc reg, loc sqr, loc rec, loc agg, and loc cgr (Figure 6.1). There were
two components to household data: location, as specified by x- and y-coordinates,
and a response value z for the variable being measured. If household i had the
characteristic of interest, zi = 1; if it did not, zi = 0.
A Bernoulli random variable was generated to set the response value of an individual household. If the result was 1, then the characteristic of interest would be present
for that household. The probability of this occurrence depended on p, the proportion of households in the population targeted to have the characteristic of interest.
Five levels of p were considered: p = 0.10, 0.30, 0.50, 0.70, 0.90. The probability of
assigning the characteristic of interest to household i, denoted by P (Zi = 1), was
set as a function of the household’s geographic position in the cluster to allow for
finer control over how the characteristic of interest was spatially distributed in the
population. Six types of patterns were produced:
1. Random (val rdm). It was equally likely to observe the characteristic of interest
anywhere in the population. The probabilities were P (Zi = 1) = p for i =
1, 2, . . . , N .
2. Small pockets (val spk). The characteristic of interest was concentrated in certain neighbourhoods. A total of
Np
5
households was chosen from the population
using SRS. Their locations served as a reference for where a pocket of cases
73
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
would occur. The pockets were created sequentially. For a given focal point, the
five closest households that did not already have the characteristic of interest
were identified and given the characteristic of interest.
3. Large pockets (val lpk). This is the same as the previous pocketing pattern except with
Np
15
pocket focal points so that the characteristic of interest appeared
in groups of 15 households.
4. Circular gradient (val cgr). Cases of the characteristic of interest decreased
when moving away from the origin. The probabilities were
P (Zi = 1) =
1
1 + eβ0 +β1
√
x2i +yi2
(7.1)
for i = 1, 2, . . . , N where β0 ∈ (−∞, ∞) and β1 ∈ (0, ∞). Consequently,
households that fall on the same circle centred at the origin have the same
probability of having the characteristic of interest, and since β1 is positive, this
probability is lower for households that are further away from the origin.1
In order to calculate the probabilities, values for β0 and β1 had to be specified. We fixed β0 so that had there been a household exactly at the origin, the
resulting probability would be 0.95 if p = 0.10 or p = 0.30, it would be 0.975 if
p = 0.50, and it would be 0.99 if p = 0.70 or p = 0.90. Given β0 , we solved for
1
If β1 were negative, we would see the opposite effect; households further away from the origin
would be more likely to have the characteristic of interest.
74
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
β1 in the equation
N
X
E
!
Zi
= Np
i=1
N
X
E(Zi ) = N p
i=1
N
X
P (Zi = 1) = N p
i=1
N
X
i=1
1
β0 +β1
1+e
√
x2i +yi2
= N p.
(7.2)
This ensured that the expected number of cases was equal to the target number.
It is not trivial to solve Equation (7.2) as there is no closed-form solution
for β1 . However, an approximation was obtained using the uniroot function
in the base package of R (R Core Team, 2015).2 The maximum iterations was
set to 105 and the tolerance limit was set to 2.2 × 10−16 .
5. Diagonal gradient (val dgr). Cases of the characteristic of interest decreased
when moving from the southwest corner to the northeast corner of the study
area. The probabilities were
P (Zi = 1) =
1
1+
eβ0 +β1 [(xi −xmin )+(yi −ymin )]
,
(7.3)
for i = 1, 2, . . . , N where β0 ∈ (−∞, ∞), xmin = min{x1 , x2 , . . . , xN } and
2
√ 2 2
A rudimentary approach is to use Newton’s method. Let g(β1 ) = 1/ 1 + eβ0 +β1 xi +yi −N p.
Since g(β1 ) is differentiable, we may compute β11 = β10 + g(β10 )/g 0 (β10 ), where β10 is an initial
guess at the root. The solution is updated by substituting β11 for β10 in the equation above. This
iterative process continues until the maximum number of iterations is reached or until the difference
between β1n+1 and β1n is smaller than the tolerance limit.
75
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
ymin = min{y1 , y2 , . . . , yN }. Let c be some fixed but arbitrary positive constant.
Probability contour levels are lines of the form
c = (x − xmin ) + (y − ymin )
y = −x + (xmin + ymin + c).
(7.4)
The same values for β0 were used as in the circular gradient pattern, and values
for β1 were approximated to satisfy
N
X
i=1
1
1+
eβ0 +β1 [(xi −xmin )+yi −ymin )]
= N p.
(7.5)
Calculations were carried out using the uniroot function in the base package
of R (R Core Team, 2015). The maximum iterations was set to 105 and the
tolerance limit was set to 2.2 × 10−16 .
6. Horizontal gradient (val hgr). This is the same as val dgr except probability
contour levels are represented by steeper lines making the gradient almost
horizontal. The probabilities were
P (Zi = 1) =
1
1 + eβ0 +β1 [(xi −xmin )+ 250 (yi −ymin )]
1
,
(7.6)
for i = 1, 2, . . . , N where β0 ∈ (−∞, ∞), xmin = min{x1 , x2 , . . . , xN } and
ymin = min{y1 , y2 , . . . , yN }.
76
loc_agg; val_hgr
loc_cgr; val_cgr
500
250
250
250
250
250
0
0
0
0
0
−250
−250
−500
−500
−500
−250
0
250
500
−250
−500
−500
−250
0
250
500
y
500
y
500
y
500
−250
−250
−500
−500
−250
0
250
500
−500
−500
−250
0
250
500
−500
0
250
x
x
x
x
loc_reg; val_rdm
loc_sqr; val_hgr
loc_rec; val_cgr
loc_agg; val_spk
loc_cgr; val_lpk
500
250
250
250
250
250
0
0
0
0
0
−250
−250
−250
−500
−500
−250
0
x
250
500
−250
−500
−500
−250
0
250
500
y
500
y
500
y
500
−250
−500
−500
x
−250
0
250
500
x
Characteristic
of interest
Not Present (0)
500
−500
−500
−250
0
x
250
500
−500
−250
0
250
500
x
Present (1)
Figure 7.1: Spatial distributions of the target variable when the proportion of households with the characteristic
of interest is p = 0.50. The first item in the population label refers to the spatial distribution of households
and the second item refers to the spatial distribution of the target variable. See pages viii-x for the meaning of
abbreviations used.
McMaster - Mathematics & Statistics
500
−500
−250
x
y
77
y
loc_rec; val_rdm
500
y
y
loc_sqr; val_spk
M.Sc. Thesis - Maria Reyes
loc_reg; val_dgr
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
While demographics may differ from one district to another, it may be reasonable
to assume that within a small community of 150 households there is spatial homogeneity. Therefore, the presence of children or seniors could be examples of household
characteristics that are evenly distributed throughout a cluster. Conversely, a pocketing pattern is likely to be seen for a disease transmitted through close contact. It
could also represent the geographic distribution of households that have lost power
in an emergency or households whose members have experienced violence related to
group conflict. Another variable that is tied to location is whether a household gets
its drinking water from the municipal grid or some other source. Households near
the centre of the cluster may be connected to the municipal grid while those on the
outskirts may use a spring or pond as their main supply of water. This is best captured by the circular gradient pattern. Lastly, we can consider a scenario where we
are interested in estimating the proportion of households that did not evacuate their
residence during a landslide or flood. One side or corner of the cluster may be at a
lower risk, so as we move in that direction, we may find more and more homes that
are occupied. Hence, cases of this outcome may be distributed similar to a diagonal
gradient or horizontal gradient pattern.
Five realizations of each of the patterns were generated for every level of p and
set of household locations. Thus, 750 populations were analyzed in the study, a few
of which are illustrated in Figure 7.1. However, there are only five arrangements
of households underlying these populations since household locations remained fixed
for a given type of household spatial distribution.
For val rdm, val cgr, val dgr, and val hgr patterns, cases of the characteristic
of interest were randomly added or removed after the initial procedure to attain
78
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Table 7.1: Cases of the characteristic of interest added or removed from populations
after the initial population generation procedure to attain a certain proportion (p)
of households with characteristic of interest.
Spatial distribution
of target variable
val rdm
val spk
val lpk
val cgr
val dgr
val hgr
Average number of cases added (+) or removed (−)
p = 0.10
p = 0.30
p = 0.50
p = 0.70
-0.32
0.00
0.00
0.08
-0.20
0.16
-0.72
0.00
0.00
0.36
-0.32
0.04
-0.68
0.00
0.00
-0.52
1.28
-0.88
-1.64
0.00
0.00
-0.28
-1.40
-0.56
p = 0.90
-1.04
0.00
0.00
-1.28
0.24
-0.20
Maximum change (abs) in number of cases
p = 0.10
p = 0.30
p = 0.50
p = 0.70
p = 0.90
8
0
0
4
6
5
11
0
0
10
8
8
11
0
0
10
16
12
13
0
0
13
11
11
6
0
0
8
9
7
abs refers to absolute value. See pages viii-x for the meaning of abbreviations used for the spatial distribution of the target
variable.
the desired proportion. In most instances (87% of the populations created), the
adjustment that was made to the number of cases was less than 10% of the aimed
total cases (Table 7.1).3
7.1.2
Sampling Plans
Households were sampled in the same manner as they were in Chapter 6. The
sampling plans were based on the EPI method, but used varying sample sizes (n =
π
7, 15, 30), sector sizes (α = 2π, π8 , 32
rad), and skipping rule (k = 1, 3) for a total of
18 combinations.
7.1.3
Estimation of Population Proportion
The proportion of households with the characteristic of interest was estimated in
three ways.
3
We may intend for the characteristic of interest to be present in half of the population, but
because we are using a random procedure to generate outcomes, we could end up with only 70 out
of the 150 households with the characteristic of interest rather than 75 households. Therefore, we
5
need to create five more instances of the characteristic of interest, which represents 75
= 0.06 of
the total aimed cases.
79
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
The first estimator uses the standard formula for a mean. If S is the set of units
in a sample, then the population proportion is estimated by
p̂EQW =
1X
zi .
n i∈S
(7.7)
Under this procedure, all observations receive equal weight (EQW).
Another estimator that we considered was the Horvitz-Thompson (HT) estimator.
It is a weighted average given by
p̂HT =
1 X
wi zi .
N i∈S
The response from household i is weighted at wi =
(7.8)
1
,
πi
where πi is the probability
that household i is in the sample. Alternatively, p̂HT can be expressed as
p̂HT
N
1 X 1
=
zi Hi ,
N i=1 πi
(7.9)
where Hi = 1 if unit i is in the sample S and 0 otherwise. It follows that P (Hi =
80
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
1) = πi . Since
E(p̂HT ) = E
=
N
1 X 1
zi Hi
N i=1 πi
!
N
1 X 1
zi E(Hi )
N i=1 πi
N
1 X 1
=
zi P (Hi = 1)
N i=1 πi
N
1 X 1
zi πi
=
N i=1 πi
N
1 X
=
zi
N i=1
=p
(7.10)
the Horvitz-Thompson estimator has the advantage of being unbiased regardless of
how the sample is taken (Cochran, 1977).
Lastly, we looked at an estimator which is essentially the Horvitz-Thompson
estimator, but truncated at 1 (HTR). It is calculated as
p̂HT R = min{p̂HT , 1}.
(7.11)
For the populations and sampling plans in the investigation, several households end
up with a selection probability so small that when it is converted to a sampling
weight, the sampling weight is greater than the population size N = 150 (Figure
7.2). If even one of these households appears in the resulting sample and they
have the characteristic of interest, then the estimate of the population proportion as
81
M.Sc. Thesis - Maria Reyes
loc_reg
McMaster - Mathematics & Statistics
loc_sqr
loc_rec
400
loc_agg
loc_cgr
●
●
●
●
300
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
300
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
200
0
400
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
200
●
●
100
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=30
●
●
●
●
●
n=15
wi = 1 πi
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=7
●
0
400
●
●
●
●
●
●
●
●
●
200
●
●
●
●
●
●
●
●
●
●
●
●
k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3
c_ 8_ 2_ c_ 8_ 2_
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3
o
a a n
a a
a a n
a a
a a n
a a
a a n
a a
a a n
a a
n
n
n
n
n
Sampling method
Figure 7.2: Boxplot of household sampling weights (wi ) for loc reg, loc sqr, loc rec,
loc agg, and loc cgr (populations in Figure 6.1) when various sampling plans are
applied. πi denotes the inclusion probability for household i; n denotes the number
of household sampled. See pages viii-x for the meaning of abbreviations used for the
spatial distribution of households and the sampling method.
calculated using HT estimator will exceed 1. This situation arises more generally.
As long as the combined weights are greater than N for sampled units with the
characteristic of interest, then p̂HT > 1. Proportions are only meaningful when they
are between 0 and 1. Therefore, it was necessary to conceive a restricted version of
the Horvitz-Thompson estimator.
82
M.Sc. Thesis - Maria Reyes
7.1.4
McMaster - Mathematics & Statistics
Evaluation of Estimators
Estimators were assessed on their accuracy and precision. For each of the stated
estimators, we computed their expected value as
E(p̂) =
X
p̂S P (S),
(7.12)
S
and their variance as
V ar(p̂) =
X
S
(p̂S − E(p̂S ))2 P (S),
(7.13)
where p̂S is the proportion estimated from the data in sample S, and P (S) is the
probability that sample S is observed. These computations are possible because all
samples from a population can be listed along with their corresponding probability (see algorithm in Chapter 5 and Appendix B.3.3). Since the true population
proportion was known, we were also able to measure the bias of an estimator
Bias(p̂) = E(p̂) − p,
(7.14)
as well as the mean square error
M SE(p̂) = V ar(p̂) + (Bias(p̂))2 .
83
(7.15)
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Table 7.2: Range of properties for the equally weighted (EQW) estimator, the
Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator across 13 500 simulated scenarios. The scenarios were based on 750 populations and 18 sampling plans as described in Sections 7.1.1 and 7.1.2.
EQW
HT
HTR
Min
Mean
Max
Min
Mean
Max
Min
Mean
Max
Bias Variance
-0.17
0.0005
0.01
0.0329
0.20
0.2257
0.00
0.0004
0.00
0.0800
0.00
0.4590
-0.19
0.0004
-0.03
0.0449
0.00
0.1930
MSE
DE
0.0005 0.20
0.0347 2.66
0.2275 25.39
0.0004 0.16
0.0800 9.34
0.4590 92.38
0.0004 0.16
0.0471 4.56
0.1998 37.67
Min refers to minimum value; Max refers to maximum value;
MSE refers to mean square error; DE refers to design effect.
Finally, to compare the variability of an estimator to the variability of the standard
estimator under SRS, we checked the design effect,
DE(p̂) =
=
V ar(p̂)
V ar(p̂SRS )
V ar(p̂)
N −n p(1−p)
N −1
n
(7.16)
Together, these properties could help provide a picture of the quality of estimation from using a certain estimator and what impact population and sampling plan
decisions have on estimation.
84
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
EQW
HTR
Relative frequency
0.4
0.3
0.2
0.1
0.0
−0.2
−0.1
0.0
0.1
0.2 −0.2
−0.1
0.0
0.1
0.2
^)
Bias ( p
Figure 7.3: Histogram of estimator bias across 13 500 simulated scenarios. EQW
refers to the equally weighted estimator; HTR refers to the restricted HorvitzThompson estimator. The scenarios are based on 750 populations and 18 sampling
plans as described in Sections 7.1.1 and 7.1.2.
7.2
Simulation Results
The distributions of EQW, HT, and HTR estimators were obtained under 750×18 =
13 500 population and sampling settings. Estimator properties were calculated from
these distributions accordingly. The minimum, mean, and maximum values that were
observed for the bias, variance, mean square error, and design effect associated with
each type of estimator are given in Table 7.2. The histograms in Figures 7.3 and 7.4
provide a graphical summary of the individual results across the different simulation
scenarios. The plots in Figures 7.5-7.7, represent marginal means, and show how an
estimator property varies over the levels of a factor while controlling for sample size
and the proportion of households in the population with the characteristic of interest.
The factors examined in the plots are spatial distribution of the target variable and
85
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Relative frequency
EQW
HT
HTR
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
^)
Var ( p
Relative frequency
EQW
HT
HTR
0.4
0.2
0.0
0
25
50
75
0
25
50
75
0
25
50
75
^)
DE ( p
Figure 7.4: Histogram of estimator variance and design effect across 13 500 simulated
scenarios. EQW refers to the equally weighted estimator; HT refers to the HorvitzThompson estimator; HTR refers to the restricted Horvitz-Thompson estimator.
The scenarios are based on 750 populations and 18 sampling plans as described in
Sections 7.1.1 and 7.1.2.
86
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
sampling method. Data corresponding to these plots can be found in Appendix A
(Tables A.2-A.9), which also includes results relating to the spatial distribution of
households.
Within our simulation study, the bias of HT was always equal to 0, agreeing
with what is indicated by the theory. However, we found that some HT estimates
reached as high as 4.9, particularly when p > 0.50. This never happened for the
EQW or HTR estimator, which were always bounded by 0 and 1. However, ensuring
that estimates were less than or equal to 1 for HTR came at the cost of losing
the property of unbiasedness. While EQW and HTR exhibited bias, values peaked
around 0, demonstrating that bias was not too large in most of the scenarios that
we looked at in our study. The magnitude of the bias was less than 0.05 in 82% of
the results for EQW and 77% of the results for HTR.
For EQW, the most negative biases (representing 2.5% of results) were between
-0.17 and -0.060, and the most positive biases (representing 2.5% of results) were
between 0.11 and 0.20. Large negative biases typically occurred when estimation
was performed on populations where households were randomly placed over a square
(loc sqr) and cases of the characteristic of interest were concentrated in the west end
of the population area (val hgr). Yet, this might not hold in general because for
loc sqr the probability of sampling households from the west end of the population
area happened to be lower compared to other realizations of the same household
spatial distribution.
In contrast, large positive biases for EQW typically occurred when estimation
was performed on populations where the concentration of households and cases of
the characteristic of interest increased towards the centre of the cluster (loc cgr;
87
M.Sc. Thesis - Maria Reyes
p=0.10
p=0.30
0.05
p=0.50
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
p=0.90
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.05
●
●
●
●
●
●
●
^)
Bias ( p
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.05
●
●
n=15
0.00
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.10
0.05
●
n=7
●
p=0.70
●
●
●
0.00
McMaster - Mathematics & Statistics
Estimator
●
●
●
●
●
●
●
●
●
●
●
●
●
●
EQW
HT
HTR
−0.10
●
●
0.05
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.05
●
●
n=30
0.00
●
−0.10
k k r r r
k k r r r
k k r r r
k k r r r
k k r r r
dm sp lp cg dg hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg
l_r l_ al_ al_ l_ l_
l_ l a al l l
l_ l a al l l
l_ l a al l l
l_ l a al l l
va va v v va va va va v v va va va va v v va va va va v v va va va va v v va va
Spatial distribution of the target variable
(a) Results averaged across spatial distributions of households and sampling methods.
p=0.10
0.00
●
●
●
●
●
●
●
●
●
●
p=0.30
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
p=0.50
●
●
●
●
●
●
●
●
p=0.70
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
p=0.90
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=7
●
●
−0.04
●
●
0.00
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.04
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=15
^)
Bias ( p
−0.08
●
0.00
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.08
●
●
EQW
HT
HTR
n=30
●
−0.04
●
●
●
●
−0.08
Estimator
●
k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3
c_ 8_ 2_ c_ 8_ 2_
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3
o
a
a n a a
n
n a a n a a
n a a n a a
n a a n a a
n a a n a a
Sampling method
(b) Results averaged across spatial distributions of households and spatial distributions of
the target variable.
Figure 7.5: Bias of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT)
estimator and the restricted Horvitz-Thompson (HTR) estimator. See Tables A.2 and
A.3 for values of the bias when n = 7, 30. p denotes the proportion of households in the
population with the characteristic of interest; n denotes the number of households that were
sampled. See pages viii-x for the meaning of abbreviations used for the spatial distribution
of the target variable and the sampling method.
88
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
val cgr). Even when results are averaged across household spatial patterns and sampling methods, the bias associated with val cgr remains positive and significantly
higher than what is observed for other spatial distributions of the target variable.
The marginal plots for bias also suggests that a pocketing pattern leads to moderate
positive biases. However, when individual results were analyzed, there were several
instances where a large bias was associated with a population that had a pocketing
pattern, but bias was negative. In these populations, pockets were located near the
periphery of the cluster.
Overall, extreme EQW biases often coincided with the population proportion
being around a half and with large sample sizes. Choice of sampling plan had less of
an impact on bias compared to the spatial distribution of the target variable.
Whenever HTR was biased, the bias was negative. This is to be expected, since
by definition, the HTR estimate must be less than or equal to the HT estimate for
each and every sample. Unlike EQW, the magnitude of the bias for HTR continued to
grow as the proportion increased beyond p = 0.50, and bias tended to be greater for
n = 7 compared to n = 30. The most negative biases (representing 2.5% of results)
ranged from -0.19 to -0.13. These results were observed for all household spatial
distributions except for when households were regularly spaced (loc reg). Results
were mixed for the spatial distribution of the target variable where the bias of HTR
was large. Populations where the bias of HTR was close to 0 (magnitude less than
0.01) were also of varying types, but there were twice as many populations with the
loc reg pattern than those with other household spatial distributions.
Overall, the HT estimator was more variable than the other estimators. When
variances were compared case by case, the variance of the HT estimator exceeded
89
M.Sc. Thesis - Maria Reyes
p=0.10
McMaster - Mathematics & Statistics
p=0.30
p=0.50
p=0.70
0.15
●
●
●
0.10
●
●
0.05
0.00
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=7
●
●
●
p=0.90
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.10
●
●
●
0.05
0.00
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Estimator
●
●
●
●
●
●
●
n=15
^)
Var ( p
0.15
●
●
●
●
●
●
●
●
●
●
●
EQW
HT
HTR
0.15
●
●
0.05
0.00
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=30
0.10
k k r r r
k k r r r
k k r r r
k k r r r
k k r r r
dm sp lp cg dg hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg
l
l
l
l
l_r l_ al_ al_ l_ l_
l_ l a al l
l_ l a al l
l_ l a al l
l_ l a al l
va va v v va va va va v v va va va va v v va va va va v v va va va va v v va va
Spatial distribution of the target variable
(a) Results averaged across spatial distributions of households and sampling methods.
p=0.10
p=0.30
p=0.50
0.15
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
p=0.90
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=7
●
●
0.10
0.05
●
p=0.70
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.10
●
●
●
●
0.05
●
●
●
●
●
●
●
●
●
0.00
0.15
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.00
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
EQW
HT
HTR
●
●
●
●
●
●
●
●
●
●
n=30
●
●
●
0.05
●
Estimator
●
●
0.10
●
●
●
●
●
●
n=15
^)
Var ( p
0.00
0.15
k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3
c_ 8_ 2_ c_ 8_ 2_
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3
o
a
a n a a
n
n a a n a a
n a a n a a
n a a n a a
n a a n a a
Sampling method
(b) Results averaged across spatial distributions of households and spatial distributions of
the target variable.
Figure 7.6: Variance of the equally weighted (EQW) estimator, the Horvitz-Thompson
(HT) estimator and the restricted Horvitz-Thompson (HTR) estimator. See Tables A.4
and A.5 for values of the variance when n = 7, 30. p denotes the proportion of households
in the population with the characteristic of interest; n denotes the number of households
that were sampled. See pages viii-x for the meaning of abbreviations used for the spatial
distribution of the target variable and the sampling method.
90
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
the variance of the EQW estimator 87% of the time. When the variance of the
HT estimator was less than the variance of the EQW estimator, the associated
population usually had positive outcomes of the target variable distributed as a
circular gradient (val cgr) and positive outcomes were present in less than half of the
population (p < 0.50). As for the variance of HTR, it was equal to the variance of
the HT estimator in 25% of the scenarios, and fell between the variances of the HT
and EQW estimators in 62% of the scenarios.
While the EQW and HTR estimators had variances that rose then dropped
around the p = 0.50 mark, the variance of the HT estimator kept rising. Therefore, as the proportion of households with the characteristic of interest increased, so
too did the difference between the variance of the HT estimator and the variance of
the other estimators. When the highest variances were identified (those in the top
2.5%), the range was 0.22 to 0.46 for HT, and 0.12 to 0.19 for EQW. Many of these
results came from populations that had an aggregated settlement pattern (loc agg)
and had cases of the characteristic of interest distributed in large pockets (val lpk).
Ignoring for population type, variances tended to be higher when no neighbours
were skipped during the sampling process instead of otherwise. For the HT estimator,
variances reached even higher levels when a random sector was used to pick the first
unit of the sample. For all estimators, variance decreased as sample size increased.
Plots of MSE have been omitted as they are almost identical to the plots for
variance. For the estimators and scenarios investigated in this study, the major
source of MSE was variance. Hence, the values computed for MSE were only slightly
greater than the values computed for variance (refer to Tables A.6 and A.7), and the
comments previously made about variance generally apply to MSE.
91
M.Sc. Thesis - Maria Reyes
p=0.10
40
McMaster - Mathematics & Statistics
p=0.30
p=0.50
p=0.70
p=0.90
30
n=7
20
10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
20
●
●
●
●
n=15
^)
DE ( p
0
40
●
●
Estimator
●
●
●
10
0
40
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
n=30
20
●
●
10
0
EQW
HT
HTR
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
k k r r r
k k r r r
k k r r r
k k r r r
k k r r r
dm sp lp cg dg hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg
l
l
l
l
l_r l_ al_ al_ l_ l_
l_ l a al l
l_ l a al l
l_ l a al l
l_ l a al l
va va v v va va va va v v va va va va v v va va va va v v va va va va v v va va
Spatial distribution of the target variable
(a) Results averaged across spatial distributions of households and sampling methods.
p=0.10
p=0.30
p=0.50
p=0.70
p=0.90
40
n=7
30
20
10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
20
n=15
^)
DE ( p
40
●
●
●
●
●
●
●
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
Estimator
●
●
●
●
●
●
●
●
●
●
●
EQW
HT
HTR
●
40
●
●
●
n=30
30
●
20
10
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3
c_ 8_ 2_ c_ 8_ 2_
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
c 8 2 c 8 2
se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3
o
a
a n a a
n
n a a n a a
n a a n a a
n a a n a a
n a a n a a
Sampling method
(b) Results averaged across spatial distributions of households and spatial distributions of
the target variable.
Figure 7.7: Design effect of EPI sampling relative to SRS for the equally weighted (EQW)
estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson
(HTR) estimator. See Tables A.8 and A.9 for values of the variance when n = 7, 30. p
denotes the proportion of households in the population with the characteristic of interest;
n denotes the number of households that were sampled. See pages viii-x for the meaning
of abbreviations used for the spatial distribution of the target variable and the sampling
method.
92
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
While DE followed a similar trend as variance when results were compared across
the various spatial distributions of the target variable and sampling methods, DE
did not decrease as sample size increased. Despite a larger sample, the variance of
the estimators did not shrink as much as it would have if SRS had been performed.
For HT, 87% of the observed DEs were over 2; for HTR, 81% were over 2; for
EQW, 45% were over 2. However, the maximum DE for HT was 92, well above
the maximum DE for HTR which was 38. These occurred in separate populations,
but both populations had p = 0.90 in common. The DE for EQW was generally
lower than HT and HTR, though it still reached a high of 25. This was seen in a
population where the characteristic of interest had a pocketing pattern and p = 0.50.
Regardless of the estimator, large DEs (top 2.5% of results) were associated more
with sampling methods that did not skip households as opposed to those that did.
7.3
Additional Notes: Variance of the HorvitzThompson Estimator
Due to the particular structure of the Horvitz-Thompson estimator, its variance
can be calculated using other formulas besides the one in Equation (7.13). These
formulas make use of the inclusion probability of individual units (πi ) and pairs of
units (πij ) rather than the probability of a sample. The variance can be calculated
as
1
V ar(p̂HT ) = 2
N
N
X
1 − πi
i=1
πi
N X
N
X
πij − πi πj
2
zi + 2
zi zj
π
π
i
j
i=1 j>i
93
!
(7.17)
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
or equivalently,
1
V ar(p̂HT ) = 2
N
2
N X
N
X
zi
zj
(πi πj − πij )
−
π
πj
i
i=1 j>i
!
,
(7.18)
which we used to verify the results of our simulation. Details of the derivation of
Equation (7.18) from Equation (7.17) are given in Cochran (1977).
The variance of p̂HT can be estimated by


X X πij − πi πj

1 X 1 − π i 2
z
+
2
zi zj 
Vd
ar1 (p̂HT ) = 2 
i
2


N
πi
πij πi πj
i∈S
i∈S j∈S
(7.19)
j>i
or


1 X X
(πi πj − πij )
Vd
ar2 (p̂HT ) = 2 
N  i∈S j∈S
yi
yj
−
πi π j
2

.

(7.20)
j>i
While the estimators in Equations (7.19) and (7.20) may not equal the same value
for a given sample, both are unbiased estimators of V ar(p̂HT ), if πij > 0 for all i
and j (Cochran, 1977). However, in EPI sampling, many pairs of households cannot
appear together and have an inclusion probability of 0.
Although we did not compute the estimated variance for every scenario in our
simulation, we found that for the population with randomly placed households over
a square area (loc reg) where the characteristic of interest was randomly assigned
(val rdm) to 75 out of the 150 households (p = 0.50), if a sample of 7 households was
selected using the EPI method where the first household was randomly selected from
94
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
the entire population and no neighbours were skipped, then V ar(p̂HT ) = 0.049 but
E(Vd
ar1 (p̂HT )) = 0.054 and E(Vd
ar2 (p̂HT )) = 0.118. Furthermore, the first variance
estimator yielded negative values in 19% of the samples (the probability of observing
any of these samples is 0.17) and the second estimator yielded negative values in
74% of the samples (the probability of observing any of these samples is 0.77).4,5
4
In this example, there are 91 different combinations of households that have a non-zero probability of being drawn. However, these combinations do not all have the same probability of being
drawn.
5
Cochran (1977) noted that negative variance estimates are possible when the estimators in
Equation (7.19) and (7.20) are used.
95
Chapter 8
Summary, Discussion and Future
Directions
The EPI method certainly has several operational benefits, the main one being that
it can be implemented even without having a list of every household at a site. This
is especially helpful in settings where there are just not enough time and resources
to properly construct a sampling frame. The only information that really needs to
be known beforehand is the relative sizes of the clusters making up the population.
At the same time, there are obvious concerns about the the data coming from an
EPI sample. While the EPI method follows the usual procedure of selecting clusters
with PPS, the way that it selects elements at the very last stage of sampling remains
fundamentally different from SRS or any other sampling scheme for which there is a
well-established theory on estimation.
96
M.Sc. Thesis - Maria Reyes
8.1
McMaster - Mathematics & Statistics
Summary and Discussion
The goal throughout this thesis has been to gain a fuller understanding of how
units are sampled when the EPI method of selecting nearest neighbours is used.
Our analysis takes a probabilistic approach, and we ultimately use the results to
investigate properties of the estimator for a population proportion.
Previous assessments of the estimation in the EPI sampling design were done in
the context of a multi-cluster population. In these studies, an estimate was based
on a sample that combined subsamples drawn from several subpopulations. Our
research focused on the sampling of units that happens within a cluster since this
is where EPI diverges from typical implementations of two-stage cluster sampling.
By doing the analysis at this level, we aimed to provide some insight as to how the
statistical properties of the sample proportion differs when SSUs are picked according
to EPI versus random selection.
Besides working with single-cluster populations, there are other aspects of our
simulation study that make it different from those that have already been published.
First, we set sample size in terms of number of households so that our analysis
would be relevant for a general household survey. Even for surveys focused on a
specific demographic group, it has been suggested that the number of households to
be visited should be fixed (Brogan et al., 1994). Often, these surveys still include
questions about household-level characteristics (SMART, 2012).
Second, we obtained inclusion probabilities for the households in the populations that we generated. We developed an algorithm to compute the probability of
sampling a household when the starting point is chosen from the households that
97
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
lie in a random sector and when the starting point is chosen randomly among all
households in the population. Additionally, we have an algorithm when sampling is
performed by taking every k th neighbour encountered along a path of nearest neighbours. Bennett et al. (1994) proposed several adaptations of the EPI method, but
we only investigated the one that uses the skipping rule since this was the one more
frequently applied in real surveys (Fonn et al., 2006; Grais et al., 2007; Vinck and
Bell, 2011).
Third, for a given population and sampling plan, we computed estimator properties from the exact distribution of the estimator rather than from the results of a
Monte Carlo simulation. Furthermore, since we could calculate the inclusion probabilities of households, we introduced the Horvitz-Thompson estimator and a restricted version of the Horvitz-Thompson estimator for estimating the population
proportion. We compared their performance to the standard estimator that uses the
sample proportion, a comparison which to our knowledge has not been previously
studied in an EPI setting.
Our investigation of the EPI method covers results for a variety of populations.
One of the ways in which the populations differed from each other is the spatial
distribution of households. This had an effect on the number of samples that were
possible where more combinations of households could be observed when the households were regularly spaced in the population. However, because of the restriction
that the EPI method imposes on the selection of households, an EPI sample cannot
consist of any combination of households in the population like in SRS.
We also demonstrated that inclusion probabilities can vary largely from unit
to unit. More notably, our analysis showed an increase in the range of inclusion
98
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
probabilities when going from a sample of n = 7 households (sampling fraction was
0.05) to a sample of n = 30 households (sampling fraction was 0.20). At n = 30, we
found instances where the probability of getting picked was as low as 0.02 for some
households and as high as 0.50 for others in the same population (Figure 6.8). Each
household can be linked to every other household in the population by constructing a
chain of nearest neighbours. Certain households appear early in more chains because
of how they are positioned relative to the other households in the population. Hence,
these households are more likely to be added to the sample wherever the sample
begins.
We observed a negative correlation between the probability of selecting a household and the distance of the household from the centre of the population area. This
correlation tended to become stronger the larger the sample size, indicating that the
EPI paths were heading towards the centre of the population. This was especially
true for populations that were denser in the centre since the nearest neighbour of a
household would likely be in the direction of the centre. However, as more households are sampled, the chain keeps going and eventually moves out to the periphery.
If every third household encountered were included for the survey, then a chain of
88 households would be required in order to reach 30 households. Here, the chain
includes over half of the population. This may explain why skipping neighbours did
not always result in a more negative correlation.
The size of the sector used for selecting the first household had little impact on
the inclusion probabilities compared to the other factors varied in the simulation.
Although using a sector meant that not all households had an equal chance of being
chosen as the starting point, the more important determinant in the probability of
99
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
sampling a household was the number of EPI paths that went through it. Unless
there was a substantial imbalance in the way households are laid out around the
origin, initiating the sample from a random sector was similar to initiating it from a
household that was chosen randomly across the entire population.1
While the Horvitz-Thompson estimator (HT) seemed promising because it is
unbiased, it has some significant drawbacks. First, it requires knowing the probability
of selection for every population element. For EPI, this means that households in
the survey area need to be mapped. While this could be accomplished by obtaining
high-resolution satellite images from Google Earth or flying a drone to capture aerial
shots of the community (Escamilla et al., 2014; The Swiss Foundation for Mine Action
(FSD), 2016), there is still the task of computing inclusion probabilities, which could
prove to be challenging in the real world. Second, the Horvitz-Thompson estimator
under EPI sampling turns out to be very unstable. There is a lot of variability in how
the observations are weighted, and estimates of a proportion can surpass 1. Even
when we restricted the Horvitz-Thompson estimator (HTR) so that it would not go
beyond 1, design effects remained high relative to the equally weighted estimator.
Moreover, the amount by which it underestimated the population proportion would
also grow as the population proportion reached higher levels. At p = 0.90, we saw
many instances where the magnitude of bias exceeded 0.10. Therefore, weighting the
observations did not lead to an overall improvement in estimation.
Our analysis of the regular estimator (EQW) for a proportion confirms many of
1
It may be interesting to check how choosing the first unit from a sector compares to choosing
it from a strip when dwellings are given positive dimensions. According to Henderson et al. (1973),
if a strip were used, the probability of being picked first would be proportional to the ratio between
the width of the dwelling to its distance from the origin (this concern was also brought up in Brogan
et al. (1994) as well as in Centers for Disease Control and Prevention and World Food Programme
(2007)).
100
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
our suspicions about EPI. Among the populations that we examined, the properties
of this estimator did not deviate excessively under EPI from what it would be under
SRS as long as outcomes were randomly assigned to the households. In the absence
of any aggregation in the geographical distribution of the characteristic of interest,
biases were close to 0, and design effects were close to 1.
When cases of the characteristic of interest were isolated to certain neighbourhoods or when it appeared more frequently for households near the centre of the
population area, there was a greater departure between the sample proportion and
the population proportion it was supposed to estimate, agreeing with reports from
previous simulations (Lemeshow et al., 1985). Both positive and negative biases occurred for the pocketing pattern, while results were overwhelmingly positive for the
circular gradient pattern. The variance of the estimator was also considerably lower
in the latter case, thus supporting the claim that EPI samples have a tendency to
include centrally located households (Luman et al., 2007).
It does not come as a surprise that regional disparities in a population would make
the EQW estimator under EPI more biased and more variable compared to its SRS
counterpart. However, we also found that design effects increased with sample size.
Taking every third neighbour for the sample was effective in reducing the variance
of the estimator. It made the greatest difference when n = 30 as evidenced by
comparing the design effects for k = 1 and k = 3 (Tables A.8 and A.9). It should be
noted though that when there was a high concentration of cases at the centre of the
population area and only seven households were selected for the sample, skipping
neighbours increased the bias.
101
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Based on our results and those from past investigations, using the nearest neighbour sampling procedure may be a close substitute for random selection at the final
stage of sampling, provided that the characteristic of interest is evenly distributed
throughout the cluster. One can never be sure of the exact spatial pattern of the
target variable prior to conducting the survey, so we suggest looking at how it is
distributed in other similar populations, if that information is available.
In constructing the populations and simulating the EPI method, we did not
capture all of the complexities of the real world such as multi-household dwellings,
misjudgement in determining which household to go to next, or roads and barriers
that would alter the survey path. However, we do not believe these details would
have material consequence in a population where the characteristic of interest occurs randomly in the population. It is harder to predict what kind of impact they
would have when the characteristic of interest does not occur randomly. The spatial
patterns that we generated are by no means a comprehensive account of all the possibilities, but they did help us to identify what conditions can lead to extreme bias
or variance for the equally weighted estimator of the prevalence in a cluster.
Finally, throughout our simulations we assumed that there was no non-response
or data collection errors. It is doubtful that a survey will be executed this perfectly
in the field. We did not incorporate these human factors into our simulations as we
saw them as separate issues and we wanted to assess EPI sampling and estimation in
their purest form. Nevertheless, non-response and data collection errors are serious,
provisions should be made to minimize them, and they should be kept in mind when
interpreting data.
102
M.Sc. Thesis - Maria Reyes
8.2
McMaster - Mathematics & Statistics
Future Directions
We concluded from our study that if EPI sampling is used, then the EQW estimator
is preferable to the HT estimator given its lower variability and ease of calculation.
That said, the EQW estimator still reached alarming levels of bias and variance when
sampling was conducted in a community that had areas of high and low concentrations of the characteristic of interest. This reinforces the warning that actual surveys
should not report estimates for individual clusters. Averaging results across several
clusters should help to moderate the effect of any cluster associated with potential
estimation issues.
In Bennett et al. (1994), Katz et al. (1997) and Yoon et al. (1997) where EPI was
analyzed in the context of a stratified sampling design for a real population, magnitude of the bias rarely went beyond 2 percentage points and design effects hovered
around 1. It would be informative to run simulations of the EPI method in which
clusters are selected according to PPS (possibly from a hierarchy, such as when an
enumeration area is sampled from a town that is sampled from a region of a country)
and for populations with different rates of homogeneity. We recommend checking
the properties of the confidence intervals that are produced when the formula in
Equation (3.15) is used to estimate the variance of the estimator for a population
proportion. As far as we know, there have not been any studies on EPI to do this.
Research may be extended by looking at the estimates for means of continuous variables and relative frequencies of categorical variables with more than two outcomes.
Another issue that needs to be addressed is how to convert sample size requirements
into number of households when the target population is not proportional to the
103
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Compact Segment Sampling (CSS)
Systematic Random Sampling (SyRS)
500
500
250
250
Characteristic
of interest
Not Present (0)
y
y
Present (1)
0
0
Selection status
Not Selected
Selected
−250
−250
−500
−500
−500
−250
0
250
500
−500
x
−250
0
250
500
x
Figure 8.1: Illustration of household selection using alternative sampling methods. Both
plots depict the same population of N = 150 households. Household locations were randomly generated such that the x-coordinate takes a real number between -500 and 500 and
similarly for the y-coordinate. The characteristic of interest is present in 75 households
(p = 0.50) and is distributed in a pocketing pattern. To take a sample of 30 households,
CSS involves partitioning the cluster into five segments, each containing 30 households,
then randomly picking one of these segments as the sample, while SyRS involves randomly
picking one of first five households in the northwest corner of the cluster then moving in a
serpentine manner and sampling every fifth household.
total households in a cluster or when the questions in a survey concern separate
populations.
We restricted our literature review to papers about computer simulations of the
EPI method, but there have also been a handful of studies that tested EPI in the
field. These studies involved conducting a survey where households are sampled
once using EPI then again using an alternative procedure to examine whether they
would yield similar estimates. Compact segment sampling (CSS) (Milligan et al.,
2004) and systematic random sampling (SyRS) (Rose et al., 2006; Luman et al.,
2007) are just two spatial sampling methods that EPI has been evaluated against.
104
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
In CSS, the cluster is partitioned into segments containing roughly the same number
of households, then one of these segments is picked at random and all households in
the segment are interviewed while in SyRS the interviewer selects households passed
at regular intervals as they go through the cluster in a serpentine manner while
(Figure 8.1). A modification to EPI sampling has also been proposed where the
first unit of the sample is taken as the closest household to random coordinates in
the cluster then the process of surveying a chain of nearest neighbours is carried
out as usual (Grais et al., 2007). A quick search of surveys which were performed in
difficult field conditions reveals that there is a growing shift towards using Geographic
Information Systems (GIS) and Global Positioning Systems (GPS) to facilitate the
selection of households (Roberts et al., 2004; Vanden Eng et al., 2007; Siri et al.,
2008; Galway et al., 2012; Shannon et al., 2012; Kondo et al., 2014). Although
these sampling plans employ various techniques, the common idea behind them is to
generate random spatial points and locate nearby households. It may be useful to
do a simulation (perhaps using real census data) in which they are compared to EPI
and even to SyRS and CSS.
105
Appendix A
Tables of Simulation Results
106
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Table A.1: Variance of household inclusion probabilities and correlation between
household inclusion probability and household distance from the centre of the population area. For all household spatial distributions other than loc reg, results were
averaged across 30 populations that were generated using the same random procedure, but with different seed numbers. All populations consisted of N = 150
households from which samples of size n = 7, 15, 30 were drawn.
Cor(πi , ri )
V ar(πi )
Spatial
distribution
of households
loc reg
loc sqr
loc rec
loc agg
loc cgr
Sampling
method
nosec
api08
api32
nosec
api08
api32
nosec
api08
api32
nosec
api08
api32
nosec
api08
api32
nosec
api08
api32
nosec
api08
api32
nosec
api08
api32
nosec
api08
api32
nosec
api08
api32
k1
k1
k1
k3
k3
k3
k1
k1
k1
k3
k3
k3
k1
k1
k1
k3
k3
k3
k1
k1
k1
k3
k3
k3
k1
k1
k1
k3
k3
k3
n=7
n = 15
n = 30
(π̄ = 0.047)
(π̄ = 0.010)
(π̄ = 0.020)
0.0001
0.0001
0.0001
0.0002
0.0002
0.0001
0.0004
0.0005
0.0005
0.0008
0.0009
0.0009
0.0004
0.0009
0.0007
0.0008
0.0012
0.0010
0.0005
0.0010
0.0008
0.0009
0.0011
0.0010
0.0006
0.0006
0.0006
0.0014
0.0014
0.0014
0.0001
0.0002
0.0001
0.0006
0.0007
0.0006
0.0026
0.0027
0.0028
0.0042
0.0042
0.0042
0.0025
0.0038
0.0034
0.0039
0.0045
0.0043
0.0026
0.0035
0.0032
0.0041
0.0043
0.0043
0.0044
0.0045
0.0044
0.0084
0.0084
0.0084
0.0014
0.0016
0.0018
0.0045
0.0048
0.0046
0.0131
0.0130
0.0132
0.0116
0.0114
0.0114
0.0130
0.0165
0.0157
0.0094
0.0095
0.0093
0.0116
0.0133
0.0132
0.0112
0.0114
0.0111
0.0315
0.0312
0.0313
0.0152
0.0152
0.0153
n = 7 n = 15 n = 30
-0.36
-0.44
-0.14
-0.21
-0.32
0.03
-0.11
-0.20
-0.19
-0.19
-0.23
-0.22
-0.08
-0.47
-0.42
-0.17
-0.38
-0.34
-0.07
-0.24
-0.28
-0.21
-0.33
-0.35
-0.38
-0.38
-0.35
-0.54
-0.54
-0.53
-0.65
-0.81
0.05
-0.42
-0.49
-0.28
-0.20
-0.26
-0.25
-0.24
-0.26
-0.25
-0.17
-0.48
-0.42
-0.29
-0.33
-0.32
-0.20
-0.36
-0.39
-0.38
-0.43
-0.44
-0.57
-0.57
-0.56
-0.63
-0.63
-0.63
-0.41
-0.56
-0.29
-0.68
-0.67
-0.67
-0.26
-0.29
-0.28
-0.24
-0.24
-0.24
-0.27
-0.38
-0.36
-0.33
-0.35
-0.35
-0.39
-0.47
-0.48
-0.40
-0.42
-0.43
-0.62
-0.63
-0.62
-0.72
-0.72
-0.72
πi denotes the probability of selecting a household; ri denotes the distance of a household from the centre
n
of the population area; π̄ is the average probability of selection and is equal to the sampling fraction N
.
See pages viii-x for the meaning of abbreviations used for the spatial distribution of households and the
sampling method.
107
Bias(p̂); n = 7
Spatial
distribution
of households
108
Sampling
method
p = 0.10
HT
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
HTR
-0.000
-0.002
-0.003
-0.003
-0.003
-0.001
-0.002
-0.004
-0.001
-0.003
-0.004
-0.002
-0.003
-0.003
-0.001
-0.002
-0.002
EQW
0.001
0.005
0.014
0.020
0.011
0.001
0.017
0.006
0.043
0.001
-0.007
0.006
0.010
0.008
0.010
0.016
0.013
p = 0.30
HT
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
HTR
-0.002
-0.013
-0.012
-0.013
-0.012
-0.005
-0.008
-0.016
-0.005
-0.012
-0.018
-0.008
-0.013
-0.014
-0.008
-0.010
-0.010
EQW
0.005
0.012
0.023
0.023
0.027
-0.004
0.026
0.029
0.050
0.004
0.004
0.010
0.017
0.017
0.017
0.024
0.022
p = 0.50
HT
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
HTR
-0.006
-0.029
-0.027
-0.031
-0.030
-0.019
-0.027
-0.032
-0.015
-0.023
-0.032
-0.018
-0.028
-0.029
-0.023
-0.025
-0.025
EQW
0.005
0.008
0.024
0.025
0.036
0.001
0.023
0.021
0.046
0.016
0.011
0.012
0.017
0.017
0.020
0.026
0.025
p = 0.70
HT
HTR
0.000 -0.015
0.000 -0.054
0.000 -0.052
0.000 -0.061
0.000 -0.056
0.000 -0.044
0.000 -0.051
0.000 -0.059
0.000 -0.034
0.000 -0.047
0.000 -0.051
0.000 -0.034
0.000 -0.052
0.000 -0.054
0.000 -0.045
0.000 -0.051
0.000 -0.050
EQW
0.002
0.005
0.007
0.012
0.017
0.002
0.015
0.011
0.013
0.005
0.005
0.006
0.006
0.006
0.011
0.012
0.012
p = 0.90
HT
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
-0.000
0.000
0.000
0.000
HTR
-0.036
-0.098
-0.107
-0.111
-0.108
-0.090
-0.094
-0.099
-0.086
-0.090
-0.092
-0.062
-0.098
-0.100
-0.088
-0.103
-0.100
p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution
of households, the spatial distribution of the target variable and the sampling method.
McMaster - Mathematics & Statistics
Spatial
distribution
of target
variable
loc reg
loc sqr
loc rec
loc agg
loc cgr
val rdm
val spk
val lpk
val cgr
val dgr
val hgr
nosec k1
api08 k1
api32 k1
nosec k3
api08 k3
api32 k3
EQW
-0.003
-0.001
-0.001
0.008
0.005
0.000
0.010
0.004
0.023
-0.011
-0.018
0.000
0.002
0.002
0.000
0.003
0.002
M.Sc. Thesis - Maria Reyes
Table A.2: Bias of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the
restricted Horvitz-Thompson (HTR) estimator when sample size is n = 7 for the simulation study in Chapter 7.
Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling
method) are averaged across the other two factors.
Bias(p̂); n = 30
Spatial
distribution
of households
109
Sampling
method
EQW
0.008
0.000
0.026
0.021
0.018
0.002
0.017
0.024
0.078
-0.002
-0.029
0.012
0.017
0.015
0.015
0.016
0.016
p = 0.30
HT
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
HTR
-0.000
-0.008
-0.001
-0.001
-0.006
-0.001
-0.002
-0.002
-0.000
-0.004
-0.010
-0.004
-0.004
-0.004
-0.002
-0.002
-0.003
EQW
0.006
-0.005
0.024
0.024
0.039
-0.009
0.014
0.033
0.077
0.006
-0.015
0.018
0.024
0.022
0.013
0.014
0.014
p = 0.50
HT
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
HTR
-0.000
-0.022
-0.006
-0.009
-0.015
-0.008
-0.010
-0.010
-0.001
-0.012
-0.020
-0.012
-0.014
-0.013
-0.007
-0.008
-0.009
EQW
0.007
-0.008
0.032
0.023
0.056
-0.000
0.021
0.028
0.061
0.021
0.001
0.021
0.027
0.025
0.019
0.020
0.020
p = 0.70
HT
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
HTR
-0.002
-0.051
-0.015
-0.031
-0.035
-0.027
-0.028
-0.031
-0.013
-0.028
-0.034
-0.030
-0.034
-0.033
-0.019
-0.022
-0.023
p = 0.90
EQW
HT
HTR
0.004 -0.000 -0.012
0.003 0.000 -0.094
0.007 0.000 -0.058
0.013 0.000 -0.091
0.026 0.000 -0.100
0.002 0.000 -0.073
0.024 0.000 -0.068
0.017 0.000 -0.075
0.017 0.000 -0.065
0.006 0.000 -0.071
-0.002 0.000 -0.074
0.012 0.000 -0.072
0.012 0.000 -0.087
0.012 0.000 -0.082
0.009 0.000 -0.057
0.009 0.000 -0.064
0.009 0.000 -0.064
p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution of
households, the spatial distribution of the target variable and the sampling method.
McMaster - Mathematics & Statistics
Spatial
distribution
of target
variable
loc reg
loc sqr
loc rec
loc agg
loc cgr
val rdm
val spk
val lpk
val cgr
val dgr
val hgr
nosec k1
api08 k1
api32 k1
nosec k3
api08 k3
api32 k3
p = 0.10
EQW
HT
HTR
0.001 -0.000 -0.000
-0.010 0.000 -0.001
0.007 0.000 0.000
0.009 0.000 0.000
0.009 0.000 -0.001
-0.002 0.000 -0.000
0.006 0.000 -0.000
0.008 0.000 -0.000
0.046 0.000 0.000
-0.016 0.000 -0.001
-0.024 0.000 -0.001
0.002 0.000 -0.000
0.005 0.000 -0.000
0.004 0.000 -0.000
0.003 0.000 -0.000
0.003 0.000 -0.000
0.003 0.000 -0.001
M.Sc. Thesis - Maria Reyes
Table A.3: Bias of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the
restricted Horvitz-Thompson (HTR) estimator when sample size is n = 30 for the simulation study in Chapter
7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling
method) are averaged across the other two factors.
V ar(p̂); n = 7
Spatial
distribution
of households
110
Sampling
method
p = 0.10
HT
0.031
0.041
0.044
0.042
0.043
0.019
0.040
0.056
0.031
0.046
0.050
0.045
0.048
0.048
0.033
0.034
0.035
HTR
0.031
0.036
0.038
0.035
0.035
0.018
0.036
0.048
0.029
0.039
0.041
0.041
0.040
0.041
0.029
0.029
0.029
EQW
0.073
0.075
0.074
0.083
0.066
0.029
0.078
0.117
0.076
0.067
0.079
0.091
0.090
0.090
0.059
0.058
0.058
p = 0.30
HT
0.079
0.112
0.106
0.108
0.097
0.056
0.098
0.144
0.076
0.102
0.128
0.106
0.116
0.118
0.085
0.088
0.091
HTR
0.075
0.084
0.081
0.082
0.073
0.045
0.080
0.112
0.067
0.079
0.092
0.091
0.088
0.089
0.069
0.069
0.069
EQW
0.081
0.088
0.084
0.094
0.078
0.034
0.099
0.135
0.075
0.071
0.097
0.104
0.101
0.102
0.068
0.068
0.068
p = 0.50
HT
0.091
0.143
0.134
0.147
0.135
0.090
0.144
0.176
0.096
0.122
0.153
0.127
0.145
0.148
0.115
0.121
0.125
HTR
0.084
0.094
0.089
0.093
0.085
0.058
0.098
0.121
0.072
0.084
0.101
0.101
0.098
0.098
0.079
0.079
0.078
EQW
0.065
0.063
0.062
0.070
0.054
0.030
0.075
0.101
0.050
0.057
0.063
0.077
0.075
0.075
0.051
0.049
0.050
p = 0.70
HT
0.080
0.143
0.141
0.163
0.138
0.114
0.149
0.179
0.096
0.126
0.134
0.114
0.147
0.152
0.118
0.131
0.135
HTR
0.068
0.072
0.072
0.074
0.066
0.053
0.081
0.098
0.055
0.066
0.070
0.077
0.078
0.077
0.063
0.064
0.063
EQW
0.021
0.021
0.018
0.019
0.014
0.012
0.026
0.034
0.012
0.014
0.014
0.023
0.023
0.023
0.014
0.014
0.014
p = 0.90
HT
0.042
0.129
0.158
0.169
0.148
0.126
0.136
0.149
0.116
0.122
0.126
0.077
0.144
0.152
0.111
0.144
0.147
HTR
0.026
0.038
0.044
0.038
0.037
0.033
0.041
0.048
0.031
0.034
0.033
0.030
0.040
0.039
0.033
0.040
0.037
p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial
distribution of households, the spatial distribution of the target variable and the sampling method.
McMaster - Mathematics & Statistics
Spatial
distribution
of target
variables
loc reg
loc sqr
loc rec
loc agg
loc cgr
val rdm
val spk
val lpk
val cgr
val dgr
val hgr
nosec k1
api08 k1
api32 k1
nosec k3
api08 k3
api32 k3
EQW
0.027
0.029
0.033
0.036
0.034
0.011
0.036
0.051
0.041
0.026
0.024
0.041
0.041
0.041
0.022
0.023
0.022
M.Sc. Thesis - Maria Reyes
Table A.4: Variance of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the
restricted Horvitz-Thompson (HTR) estimator when sample size is n = 7 for the simulation study in Chapter 7.
Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling
method) are averaged across the other two factors.
V ar(p̂); n = 30
Spatial
distribution
of households
111
Sampling
method
p = 0.10
HT
0.007
0.021
0.012
0.012
0.020
0.005
0.011
0.018
0.005
0.022
0.025
0.019
0.019
0.019
0.010
0.010
0.011
HTR
0.007
0.019
0.012
0.012
0.017
0.005
0.011
0.018
0.005
0.020
0.022
0.018
0.018
0.018
0.009
0.009
0.009
EQW
0.019
0.020
0.022
0.024
0.015
0.006
0.014
0.033
0.019
0.021
0.026
0.033
0.033
0.033
0.007
0.007
0.007
p = 0.30
HT
0.022
0.058
0.031
0.038
0.045
0.020
0.030
0.047
0.015
0.048
0.074
0.052
0.053
0.053
0.024
0.025
0.026
HTR
0.022
0.041
0.030
0.036
0.034
0.019
0.027
0.043
0.014
0.040
0.053
0.045
0.045
0.045
0.020
0.020
0.020
EQW
0.023
0.024
0.026
0.028
0.015
0.007
0.024
0.036
0.016
0.023
0.033
0.038
0.037
0.038
0.009
0.009
0.009
p = 0.50
HT
0.027
0.087
0.047
0.062
0.067
0.045
0.058
0.067
0.021
0.063
0.093
0.073
0.076
0.076
0.038
0.042
0.043
HTR
0.027
0.051
0.040
0.047
0.041
0.033
0.043
0.051
0.019
0.044
0.057
0.055
0.054
0.055
0.028
0.028
0.028
EQW
0.016
0.016
0.018
0.021
0.010
0.006
0.017
0.026
0.011
0.019
0.020
0.026
0.025
0.026
0.007
0.007
0.007
p = 0.70
HT
0.022
0.110
0.047
0.081
0.082
0.065
0.072
0.081
0.036
0.072
0.085
0.079
0.087
0.086
0.048
0.055
0.057
HTR
0.021
0.040
0.033
0.042
0.038
0.031
0.038
0.042
0.023
0.036
0.040
0.045
0.044
0.044
0.025
0.026
0.025
EQW
0.005
0.004
0.004
0.004
0.002
0.002
0.005
0.007
0.002
0.003
0.004
0.006
0.006
0.006
0.002
0.002
0.002
p = 0.90
HT
0.012
0.125
0.060
0.112
0.123
0.089
0.082
0.094
0.076
0.085
0.094
0.084
0.111
0.107
0.063
0.076
0.080
HTR
0.008
0.022
0.028
0.026
0.029
0.022
0.022
0.025
0.021
0.022
0.023
0.026
0.030
0.029
0.017
0.017
0.017
p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial
distribution of households, the spatial distribution of the target variable and the sampling method.
McMaster - Mathematics & Statistics
Spatial
distribution
of target
variable
loc reg
loc sqr
loc rec
loc agg
loc cgr
val rdm
val spk
val lpk
val cgr
val dgr
val hgr
nosec k1
api08 k1
api32 k1
nosec k3
api08 k3
api32 k3
EQW
0.006
0.006
0.008
0.009
0.008
0.002
0.006
0.014
0.010
0.007
0.006
0.012
0.012
0.012
0.002
0.002
0.002
M.Sc. Thesis - Maria Reyes
Table A.5: Variance of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the
restricted Horvitz-Thompson estimator (HTR) when sample size is n = 30 for the simulation study in Chapter
7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling
method) are averaged across the other two factors.
M SE(p̂); n = 7
Spatial
distribution
of households
112
Sampling
method
p = 0.10
HT
0.031
0.041
0.044
0.042
0.043
0.019
0.040
0.056
0.031
0.046
0.050
0.045
0.048
0.048
0.033
0.034
0.035
HTR
0.031
0.036
0.038
0.035
0.035
0.018
0.036
0.048
0.029
0.039
0.041
0.041
0.041
0.041
0.029
0.029
0.029
EQW
0.073
0.076
0.076
0.084
0.068
0.030
0.079
0.119
0.079
0.068
0.080
0.092
0.091
0.091
0.060
0.060
0.060
p = 0.20
HT
0.079
0.112
0.106
0.108
0.097
0.056
0.098
0.144
0.076
0.102
0.128
0.106
0.116
0.118
0.085
0.088
0.091
HTR
0.076
0.084
0.081
0.082
0.073
0.045
0.080
0.112
0.067
0.079
0.093
0.091
0.089
0.089
0.070
0.069
0.069
EQW
0.082
0.090
0.086
0.096
0.081
0.035
0.101
0.137
0.078
0.072
0.098
0.105
0.103
0.103
0.070
0.070
0.071
p = 0.50
HT
0.091
0.143
0.134
0.147
0.135
0.090
0.144
0.176
0.096
0.122
0.153
0.127
0.145
0.148
0.115
0.121
0.125
HTR
0.084
0.095
0.090
0.094
0.086
0.058
0.099
0.122
0.073
0.085
0.102
0.101
0.100
0.099
0.080
0.080
0.079
EQW
0.065
0.063
0.064
0.072
0.057
0.031
0.077
0.104
0.053
0.058
0.063
0.078
0.076
0.076
0.052
0.051
0.052
p = 0.70
HT
0.080
0.143
0.141
0.163
0.138
0.114
0.149
0.179
0.096
0.126
0.134
0.114
0.147
0.152
0.118
0.131
0.135
HTR
0.068
0.075
0.075
0.078
0.069
0.055
0.084
0.103
0.056
0.069
0.073
0.079
0.081
0.080
0.066
0.067
0.066
EQW
0.021
0.021
0.019
0.019
0.015
0.012
0.027
0.035
0.012
0.015
0.014
0.023
0.023
0.024
0.015
0.015
0.015
p = 0.90
HT
0.042
0.129
0.158
0.169
0.148
0.126
0.136
0.149
0.116
0.122
0.126
0.077
0.144
0.152
0.111
0.144
0.147
HTR
0.028
0.048
0.057
0.050
0.049
0.043
0.051
0.059
0.039
0.043
0.043
0.035
0.052
0.050
0.041
0.051
0.049
p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial
distribution of households, the spatial distribution of the target variable and the sampling method.
McMaster - Mathematics & Statistics
Spatial
distribution
of target
variable
loc reg
loc sqr
loc rec
loc agg
loc cgr
val rdm
val spk
val lpk
val cgr
val dgr
val hgr
nosec k1
api08 k1
api32 k1
nosec k3
api08 k3
api32 k3
EQW
0.027
0.029
0.033
0.036
0.035
0.012
0.037
0.052
0.042
0.026
0.025
0.041
0.041
0.041
0.023
0.023
0.023
M.Sc. Thesis - Maria Reyes
Table A.6: Mean square error (MSE) of of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT)
estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is n = 7 for the simulation
study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target
variable, and sampling method) are averaged across the other two factors.
M SE(p̂); n = 30
Spatial
distribution
of households
113
Sampling
method
p = 0.10
HT
0.007
0.021
0.012
0.012
0.020
0.005
0.011
0.018
0.005
0.022
0.025
0.019
0.019
0.019
0.010
0.010
0.011
HTR
0.007
0.019
0.012
0.012
0.017
0.005
0.011
0.018
0.005
0.020
0.022
0.018
0.018
0.018
0.009
0.009
0.009
EQW
0.020
0.025
0.024
0.027
0.019
0.007
0.016
0.037
0.027
0.022
0.029
0.036
0.036
0.036
0.010
0.010
0.010
p = 0.30
HT
0.022
0.058
0.031
0.038
0.045
0.020
0.030
0.047
0.015
0.048
0.074
0.052
0.053
0.053
0.024
0.025
0.026
HTR
0.022
0.042
0.030
0.036
0.034
0.019
0.027
0.043
0.014
0.040
0.054
0.045
0.045
0.045
0.020
0.020
0.020
EQW
0.023
0.029
0.028
0.031
0.021
0.007
0.026
0.042
0.024
0.025
0.036
0.041
0.041
0.041
0.012
0.012
0.012
p = 0.50
HT
0.027
0.087
0.047
0.062
0.067
0.045
0.058
0.067
0.021
0.063
0.093
0.073
0.076
0.076
0.038
0.042
0.043
HTR
0.027
0.052
0.040
0.048
0.042
0.033
0.043
0.052
0.019
0.045
0.058
0.055
0.055
0.055
0.028
0.029
0.028
EQW
0.017
0.018
0.020
0.023
0.016
0.006
0.019
0.030
0.016
0.021
0.021
0.028
0.028
0.028
0.009
0.009
0.009
p = 0.70
HT
0.022
0.110
0.047
0.081
0.082
0.065
0.072
0.081
0.036
0.072
0.085
0.079
0.087
0.086
0.048
0.055
0.057
HTR
0.021
0.043
0.034
0.044
0.039
0.032
0.039
0.044
0.023
0.037
0.042
0.046
0.046
0.046
0.026
0.027
0.026
EQW
0.005
0.005
0.004
0.005
0.004
0.002
0.007
0.008
0.003
0.004
0.004
0.006
0.006
0.006
0.003
0.003
0.003
p = 0.90
HT
0.012
0.125
0.060
0.112
0.123
0.089
0.082
0.094
0.076
0.085
0.094
0.084
0.111
0.107
0.063
0.076
0.080
HTR
0.009
0.031
0.032
0.035
0.040
0.029
0.029
0.032
0.027
0.028
0.031
0.033
0.039
0.037
0.021
0.023
0.023
p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial
distribution of households, the spatial distribution of the target variable and the sampling method.
McMaster - Mathematics & Statistics
Spatial
distribution
of target
variable
loc reg
loc sqr
loc rec
loc agg
loc cgr
val rdm
val spk
val lpk
val cgr
val dgr
val hgr
nosec k1
api08 k1
api32 k1
nosec k3
api08 k3
api32 k3
EQW
0.007
0.008
0.009
0.009
0.010
0.002
0.007
0.015
0.013
0.007
0.007
0.013
0.014
0.014
0.003
0.003
0.003
M.Sc. Thesis - Maria Reyes
Table A.7: Mean square error (MSE) of of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT)
estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is n = 30 for the simulation
study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target
variable, and sampling method) are averaged across the other two factors.
DE(p̂); n = 7
Spatial
distribution
of households
114
Sampling
method
p = 0.30
EQW HT HTR
2.54 2.76 2.62
2.61 3.88 2.92
2.58 3.67 2.80
2.89 3.75 2.85
2.28 3.39 2.54
1.02 1.93 1.57
2.70 3.39 2.79
4.08 4.99 3.87
2.63 2.63 2.32
2.33 3.55 2.73
2.73 4.44 3.21
3.17 3.67 3.15
3.11 4.01 3.07
3.11 4.09 3.09
2.04 2.96 2.41
2.03 3.05 2.38
2.03 3.15 2.40
p = 0.50
EQW HT HTR
2.38 2.65 2.45
2.58 4.18 2.74
2.46 3.92 2.61
2.74 4.28 2.71
2.26 3.95 2.47
1.00 2.64 1.68
2.88 4.20 2.87
3.93 5.13 3.52
2.18 2.80 2.11
2.07 3.56 2.46
2.84 4.45 2.94
3.03 3.70 2.94
2.96 4.24 2.87
2.97 4.30 2.86
1.99 3.36 2.31
1.97 3.53 2.31
1.98 3.64 2.28
p = 0.70
EQW HT HTR
2.25 2.78 2.36
2.17 4.96 2.50
2.17 4.90 2.50
2.43 5.65 2.58
1.87 4.81 2.29
1.06 3.98 1.83
2.62 5.16 2.82
3.52 6.23 3.41
1.73 3.32 1.90
1.97 4.37 2.29
2.18 4.65 2.42
2.68 3.98 2.69
2.61 5.10 2.71
2.60 5.27 2.67
1.76 4.11 2.20
1.70 4.57 2.22
1.72 4.69 2.18
EQW
1.72
1.72
1.48
1.53
1.15
0.95
2.12
2.77
0.98
1.16
1.15
1.89
1.87
1.88
1.17
1.16
1.16
p = 0.90
HT
3.38
10.49
12.77
13.70
12.00
10.20
10.98
12.08
9.38
9.92
10.25
6.25
11.70
12.29
8.96
11.70
11.90
HTR
2.11
3.08
3.57
3.05
3.01
2.67
3.34
3.88
2.49
2.72
2.69
2.47
3.27
3.14
2.66
3.22
3.03
p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial
distribution of households, the spatial distribution of the target variable and the sampling method.
McMaster - Mathematics & Statistics
Spatial
distribution
of target
variable
loc reg
loc sqr
loc rec
loc agg
loc cgr
val rdm
val spk
val lpk
val cgr
val dgr
val hgr
nosec k1
api08 k1
api32 k1
nosec k3
api08 k3
api32 k3
p = 0.10
EQW HT HTR
2.22 2.55 2.51
2.32 3.34 2.88
2.66 3.60 3.07
2.92 3.41 2.86
2.73 3.48 2.86
0.93 1.54 1.42
2.95 3.23 2.90
4.16 4.56 3.88
3.33 2.49 2.38
2.10 3.77 3.17
1.95 4.09 3.28
3.30 3.64 3.33
3.31 3.89 3.28
3.33 3.90 3.29
1.82 2.66 2.38
1.83 2.76 2.37
1.82 2.82 2.38
M.Sc. Thesis - Maria Reyes
Table A.8: Design effect (DE) of EPI sampling relative to SRS for the equally weighted (EQW) estimator, the
Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is
n = 7 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial
distribution of target variable, and sampling method) are averaged across the other two factors.
DE(p̂); n = 30
Spatial
distribution
of households
115
Sampling
method
p = 0.10
HT
3.06
8.60
4.77
4.88
8.43
2.26
4.75
7.41
1.92
8.91
10.45
7.78
7.76
7.77
3.94
4.10
4.36
HTR
3.06
8.05
4.77
4.88
7.06
2.18
4.66
7.31
1.92
8.27
9.03
7.47
7.37
7.39
3.69
3.70
3.77
EQW
3.38
3.51
3.86
4.28
2.63
1.08
2.57
5.88
3.38
3.73
4.56
5.86
5.85
5.87
1.22
1.21
1.19
p = 0.30
HT
3.93
10.35
5.54
6.69
8.04
3.54
5.38
8.29
2.58
8.60
13.09
9.31
9.38
9.46
4.23
4.43
4.67
HTR
3.93
7.36
5.25
6.40
6.03
3.31
4.73
7.61
2.54
7.14
9.45
8.05
7.96
8.04
3.53
3.60
3.57
EQW
3.42
3.57
3.86
4.18
2.23
1.03
3.52
5.37
2.42
3.42
4.95
5.61
5.57
5.64
1.31
1.31
1.28
p = 0.50
HT
4.09
13.01
7.05
9.22
9.92
6.77
8.70
10.04
3.16
9.46
13.82
10.87
11.37
11.33
5.69
6.21
6.48
HTR
4.05
7.58
5.97
7.03
6.14
4.86
6.36
7.67
2.87
6.63
8.55
8.21
8.10
8.15
4.13
4.22
4.13
EQW
2.88
2.84
3.15
3.64
1.83
0.98
2.95
4.54
1.87
3.37
3.48
4.62
4.50
4.56
1.18
1.17
1.16
p = 0.70
HT
3.95
19.60
8.41
14.29
14.58
11.47
12.82
14.39
6.44
12.79
15.09
13.95
15.42
15.26
8.56
9.68
10.13
HTR
3.70
7.17
5.93
7.43
6.67
5.47
6.66
7.45
4.06
6.34
7.12
7.92
7.82
7.84
4.48
4.55
4.48
EQW
2.16
1.74
1.59
1.71
1.02
0.88
2.20
2.94
0.97
1.39
1.46
2.37
2.33
2.36
0.94
0.93
0.93
p = 0.90
HT
5.14
51.93
25.03
46.49
50.94
37.01
34.11
38.78
31.62
35.18
38.72
34.79
45.98
44.08
25.98
31.57
33.03
HTR
3.50
8.94
11.57
10.80
12.07
9.25
9.30
10.28
8.72
9.01
9.70
10.84
12.49
11.91
6.86
7.10
7.06
p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial
distribution of households, the spatial distribution of the target variable and the sampling method.
McMaster - Mathematics & Statistics
Spatial
distribution
of target
variable
loc reg
loc sqr
loc rec
loc agg
loc cgr
val rdm
val spk
val lpk
val cgr
val dgr
val hgr
nosec k1
api08 k1
api32 k1
nosec k3
api08 k3
api32 k3
EQW
2.65
2.42
3.25
3.64
3.11
0.87
2.40
5.66
4.06
2.77
2.33
5.01
5.08
5.06
0.99
0.98
0.97
M.Sc. Thesis - Maria Reyes
Table A.9: Design effect (DE) of EPI sampling relative to SRS for the equally weighted (EQW) estimator, the
Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is
n = 30 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial
distribution of target variable, and sampling method) are averaged across the other two factors.
Appendix B
Partial R Code
B.1
Packages Used
• dplyr (Wickham and Francois, 2015)
• ggplot2 (Wickham, 2009)
• Matrix (Bates and Maechler, 2015)
• mefa (Sólymos, 2009)
• parallel (R Core Team, 2015)
• plyr (Wickham, 2011)
• reshape2 (Wickham, 2007)
116
M.Sc. Thesis - Maria Reyes
B.2
McMaster - Mathematics & Statistics
General Parameters
alpha 0
The angle span of a sector (measured in radians); a value in [0, 2π]
or NULL. If the latter, the first household is randomly chosen from
the entire population.
dmat
A matrix of distances between each pair of households in the
population.
gam
A vector of angular coordinates corresponding to the households
in the population.
k
Every k th neighbour along an EPI path is added to the sample.
loc
A data frame of household locations. It may also include a column
of values for the response variable associated with each household,
if this is required by the function.
n
Number of households to be sampled from the population.
pmat
A matrix of inclusion probabilities for individual households and
pairs of households.
R
Number of samples to be generated.
samp
A data frame where each row represents a sample. Include a
column for the probability of a sample labeled prob, otherwise,
samples will be assigned a probability of R1 where R is the total
number of rows in the data frame.
samp frame
A vector that lists all the sampling units.
theta d
A sector direction (measured in radians); a value in [0, 2π).
u1
Identification number of the first selected household.
117
M.Sc. Thesis - Maria Reyes
B.3
B.3.1
McMaster - Mathematics & Statistics
Main Functions
Simulation of EPI Sampling
Households in a Random Sector
Generates a sample of households by identifying all the households in a randomly
selected geographic sector of the population.
epi_samp_sec
<- function(loc, gam, alpha_0, theta_d="random"){
if(theta_d=="random"){theta_d <- rdm_direc()}
if(alpha_0==0){ac_code <- 0}
else if(alpha_0>0 & alpha_0<(2*pi)){ac_code <- NULL}
else if(alpha_0==(2*pi)){ac_code <- 1}
sec <- cbind(theta_d, sec_angles(alpha_0,theta_d))
hh <- in_sector(loc, gam, a1=sec[,"a1"], a2=sec[,"a2"],
ambig_case=ac_code)
return(list(sec=sec, hh=hh))
}
Sample of Nearest Neighbours
Finds the nearest neighbour of the current household and adds it to the sample
until the sample reaches the required size. When a household has multiple nearest
neighbours, a random selection is made among the nearest neighbours.
epi_samp_hh <- function(loc, dmat, n, u1){
# Setup
new_hh <- u1
selected <- c(new_hh, rep(NA,n-1))
# Construct path of nearest neighbours
118
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
if(n>=2){
for(i in 2:n){
current_hh <- new_hh
dmat[,current_hh] <- NA
current_row <- dmat[current_hh,]
min_dist <- min(current_row, na.rm=TRUE)
closest_hh <- names(current_row)[(which(current_row==min_dist))]
new_hh <- sample(closest_hh,1)
selected[i] <- new_hh
}
}
# Return locations of selected household
loc[selected,]
}
Generation of an EPI Sample
Simulates the EPI sampling procedure for selecting households in a single cluster.
sim_epi_v0 <- function(loc, gam, dmat, n, k, R, alpha_0=NULL,
theta_d="random", skipped.rm=TRUE){
# Setup
N <- nrow(loc)
ui_seq <- seq(from=1, by=k, length.out=n)
n_exp <- ui_seq[n]
n_lim <- tail(ui_seq[which(ui_seq<=N)],1)
padding <- n_exp-n_lim
if(!is.null(alpha_0)){
# Helper function to generate a single EPI sample.
# First household is selected from a random sector.
119
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
sec_epi_samp <- function(loc, gam, dmat, n_lim, n_exp,
alpha_0, theta_d, padding){
s1 <- epi_samp_sec(loc, gam, alpha_0, theta_d)
if(nrow(s1[["hh"]])>0){
u1 <- rdm_hh(s1[["hh"]])
s2 <- epi_samp_hh(loc, dmat, n_lim, u1)
samp_hh <- c(as.numeric(rownames(s2)), rep(NA,padding))
}
else{
samp_hh <- rep(NA,n_exp)
}
c(s1[["sec"]][,"theta_d"], s1[["sec"]][,"a1"],
s1[["sec"]][,"a2"], nrow(s1[["hh"]]),
samp_hh)
}
output <t(replicate(R, sec_epi_samp(loc, gam, dmat, n_lim, n_exp,
alpha_0, theta_d, padding)))
}
else{
# is.null(alpha_0)
# Helper function to generate a single EPI sample.
# First household is randomly selected from the entire
# population.
nosec_epi_samp <- function(loc, dmat, N, n_lim, padding){
u1 <- sample(1:N,1)
s2 <- epi_samp_hh(loc, dmat, n_lim, u1)
c(rep(NA,4),as.numeric(rownames(s2)),rep(NA,padding))
}
output <- t(replicate(R, nosec_epi_samp(loc, dmat, N, n_lim,
padding)))
}
120
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
if(skipped.rm==TRUE){
cols <- c(1:4,(4+ui_seq))
output <- matrix(output[,cols], nrow=R)
colnames(output) <- c("theta_d", "a1","a2", "sec_pop",
paste0('u', 1:n))
}
else{
# skipped.rm==FALSE
colnames(output) <- c("theta_d", "a1","a2", "sec_pop",
paste0('u', 1:n_exp))
}
data.frame(output)
}
sim_epi <- function(loc, dmat, n, R, alpha_0=NULL,
theta_d="random", k=1, skipped.rm=TRUE,
na.rm=TRUE, maxit=1e+6){
# Setup
gam <- hh_direc(loc)
# Initial simulation results
output <- sim_epi_v0(loc,gam,dmat,n,k,R,alpha_0,theta_d,skipped.rm)
if(!is.null(alpha_0)){
clean_output <- output[output[,"sec_pop"]>0,]}
else{clean_output <- output; na.rm=FALSE}
out_samp <- nrow(clean_output)
tot_samp <- out_samp
tot_iter <- R
# Achieve the required number of samples
121
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
if(na.rm==TRUE){
req_samp <- R
while(tot_samp<req_samp & tot_iter<maxit){
R <- min((req_samp-tot_samp),(maxit-tot_iter))
new_output <- sim_epi_v0(loc,gam,dmat,n,k,R,alpha_0,theta_d,
skipped.rm)
clean_new_output <- new_output[new_output[,"sec_pop"]>0,]
new_samp <- nrow(clean_new_output)
clean_output <- rbind(clean_output, clean_new_output)
tot_samp <- tot_samp + new_samp
tot_iter <- tot_iter + R
}
if(tot_samp>0){rownames(clean_output) <- NULL}
output <- clean_output
out_samp <- nrow(output)
}
sim_info <- cbind(out_samp,tot_samp,tot_iter)
return(list(samples=output, details=sim_info))
}
B.3.2
Computation of Inclusion Probabilities
Probability of First Sampled Unit
Returns the exact probability that a household is selected as the first unit in a sample
when the first unit is selected from a random sector. Probabilities are computed for
every household in the population.
122
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
epi_u1_prob_exact <function(loc, alpha_0, rescale=TRUE){
# Setup
N <- nrow(loc)
gam <- hh_direc(loc)
gam_ord <- sort(unique(gam))
L_Theta <- 2*pi
# probability density function of theta
if(rescale==TRUE){
delta <c(diff(gam_ord), gam_ord[1]+2*pi-gam_ord[length(gam_ord)]) alpha_0
delta_pos <- which(delta>0)
L_Theta <- 2*pi-sum(delta[delta_pos])
}
# Values of theta when the households in the resulting sector
# change
sec_direc <- c(gam_ord-alpha_0/2,gam_ord+alpha_0/2)
sec_direc <sapply(1:length(sec_direc), function(i) {
current_val <- sec_direc[i]
if(current_val<0){current_val <- current_val+2*pi}
else if(current_val>=2*pi){current_val <- current_val-2*pi}
# ...do nothing if sec_direc is in [0,2*pi)
current_val
})
sec_direc_ord <- sort(sec_direc)
# Number of households in each sector specified by sec_direc_ord
sec_theta0 <- in_sector(loc,gam,alpha_0,sec_direc_ord[1])
123
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
sec_theta0_hh <- sec_theta0[,"hh_id"]
n_theta0 <- length(sec_theta0_hh)
sec <- sec_angles(alpha_0=alpha_0,theta_d=sec_direc_ord[1])
tol <- .Machine$double.eps^0.5
excl <- which(abs(gam-sec[,"a2"])<=tol)
incl <- which(abs(gam-sec[,"a1"])<=tol)
if(n_theta0==0){n_theta0=length(incl)}
else{
if(length(excl)!=0){
if(any(excl %in% sec_theta0_hh)){n_theta0=n_theta0sum(excl %in% sec_theta0_hh)}
}
if(length(incl)!=0){
if(!all(incl %in% sec_theta0_hh)){n_theta0=n_theta0+
sum(!(incl %in% sec_theta0_hh))}
}
}
n_change <sapply(1:(length(sec_direc_ord)-1), function(i){
sec <- sec_angles(alpha_0,sec_direc_ord[i])
sum(abs(gam-sec[,"a2"])<=tol)sum(abs(gam-sec[,"a1"])<=tol)
})
sec_pop <- cumsum(c(n_theta0,n_change))
# Data frame of theta values and number of households in the
# corresponding sector
sec_hh_dat <- data.frame(sec_direc_ord,sec_pop)
# Conditional probability of the first sampled unit given a
# sector
u1_cond_sec <- sapply(1:N, function(i){
sec <- sec_angles(alpha_0,gam[i])
a1 <- sec[,"a1"]; a2 <- sec[,"a2"]
124
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
if(a1<a2){
sec_subset <sec_hh_dat[which(((sec_direc_ord>a1)|(
abs(sec_direc_ord-a1)<=tol)) &
((sec_direc_ord<a2)|(
abs(sec_direc_ord-a2)<=tol))),]
}
else{
# Since alpha_0>0, a1 and a2 will never be equal.
# Thus, if a1<a2 is FALSE then a1>a2 is TRUE
sec_subset1 <sec_hh_dat[which((sec_direc_ord>a1)|(
abs(sec_direc_ord-a1)<=tol)),]
sec_subset2 <sec_hh_dat[which((sec_direc_ord<a2)|(
abs(sec_direc_ord-a2)<=tol)),]
sec_subset1[,"sec_direc_ord"] <sec_subset1[,"sec_direc_ord"]-a1
sec_subset2[,"sec_direc_ord"] <sec_subset2[,"sec_direc_ord"]+(2*pi-a1)
sec_subset <- rbind(sec_subset1,sec_subset2)
}
sum(diff(sec_subset[,"sec_direc_ord"]) /
sec_subset[2:nrow(sec_subset),"sec_pop"])
})
# Output
data.frame(u1=1:N, prob=u1_cond_sec/L_Theta)
}
125
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Path Probability Conditional on the Starting Household
Returns all possible EPI paths for a fixed sample size and the conditional probability
of each of these paths given the first household in the sample.
epi_path_cond_prob <- function(loc, dmat, n){
# Setup
samp_frame <- as.numeric(rownames(loc))
N <- length(samp_frame)
path_units <- data.frame(u1=samp_frame)
path_probs <- rep(1,N)
# Helper function to add a household to each of the given paths
# and update the path probability
get_nn_prob <- function(i,old_paths,old_probs,dmat){
current_path <- old_paths[i,]
current_prob <- old_probs[i]
current_hh <- tail(current_path,1)
dmat_subset <- dmat[current_hh,]
dmat_subset[current_path] <- NA
min_dist <- min(dmat_subset, na.rm=TRUE)
closest_hh <- as.numeric(names(which(dmat_subset==min_dist)))
new_paths <- matrix(rep(current_path, length(closest_hh)),
ncol=length(current_path), byrow=TRUE)
new_paths <- cbind(new_paths, closest_hh)
new_probs <- current_prob * (1/length(closest_hh))
cbind(new_paths, new_probs)
}
# Enumerate paths
if(n>=2){
for(j in 2:n){
old_paths <- path_units
old_probs <- path_probs
temp_results <do.call("rbind",
126
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
llply(1:nrow(old_paths), function(i)
get_nn_prob(i,old_paths,old_probs,dmat)))
path_units <- temp_results[,1:j]
path_probs <- temp_results[,j+1]
}
}
# Format results
output <- data.frame(path_units, path_probs)
colnames(output) <- c(paste0('u', 1:n),"path_cond_u1")
# Return results
output
}
Path Probability
Returns all possible EPI paths for a fixed sample size and the probability of each of
these paths.
epi_path_prob <- function(loc, dmat, n, alpha_0=NULL, D=NULL,
k=1, skipped.rm=TRUE,
rescale=TRUE, zeros.rm=FALSE){
# Setup
samp_frame <- as.numeric(rownames(loc))
N <- length(samp_frame)
ui_seq <- seq(from=1, by=k, length.out=n)
n_exp <- ui_seq[n]
n_lim <- tail(ui_seq[which(ui_seq<=N)],1)
padding <- n_exp-n_lim
q <- n_exp+(1:3) # Probability columns
# Probability of selecting first household
127
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
u1prob <- epi_u1_prob(loc, alpha_0, D, rescale, zeros.rm)
# Probability of path
if(n_lim==1){
output <- data.frame(u1=u1prob[,"u1"],
u1_uncond=u1prob[,"prob"],
path_cond_u1=rep(1,nrow(u1prob)),
prob=u1prob[,"prob"])
}
else{
pathcp <- epi_path_cond_prob(loc, dmat, n_lim)
path_info <- function(i, u1prob, pathcp, n_lim, padding){
u1 <- u1prob[i,"u1"]
paths <- pathcp[which(pathcp[,"u1"]==u1),]
u1_uncond <- rep(u1prob[i,"prob"],nrow(paths))
path_cond_u1 <- paths[,"path_cond_u1"]
data.frame(paths[,1:n_lim], matrix(nrow=1,ncol=padding),
u1_uncond, path_cond_u1,prob=u1_uncond*path_cond_u1)
}
output <ldply(1:nrow(u1prob), function(i)
path_info(i, u1prob, pathcp, n_lim, padding))
}
if(skipped.rm==TRUE){
cols <- c(ui_seq,q)
output <- output[,cols]
colnames(output) <- c(paste0('u', 1:n),
"u1_uncond","path_cond_u1","prob")
}
else{
# skipped.rm==FALSE
colnames(output) <- c(paste0('u', 1:n_exp),
"u1_uncond","path_cond_u1","prob")
128
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
}
output
}
Matrix of Inclusion Probabilities
Returns a matrix of inclusion probabilities for individual households and pairs of
households. Adapted from a post on Stack Overflow.1
incl_prob <- function(samp_frame, samp){
N <- length(samp_frame)
prob <- NULL
if("prob" %in% colnames(samp)){prob <- samp[,"prob"]}
samp <- samp_units(samp)
R <- nrow(samp)
is_present <- function(i, samp_frame, samp){
# Retrieved from http://stackoverflow.com/a/27178390/3808364
apply(samp, 1, function(row)
as.integer(any(row==samp_frame[i])))
}
ncores <- detectCores()
clus <- makeCluster(ncores)
clusterExport(clus, c("is_present","N","samp_frame","samp"),
envir=environment())
m <- Matrix(parSapply(clus, 1:N, function(h)
is_present(h,samp_frame,samp)),
sparse=TRUE)
stopCluster(clus)
if(is.null(prob)){
1
r - creating a matrix of bivariate frequencies. (2014). Retrieved from http://stackoverflow.com/
a/27178390/3808364
129
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
# Estimated inclusion probabilities
output <- (1/R) * Matrix::t(m) %*% m
}
else{
# Exact or approximated inclusion probabilities
row_weight <- Matrix(Diagonal(R,prob),sparse=TRUE)
output <- Matrix::t(m) %*% row_weight %*% m
}
output
}
B.3.3
Estimation of Population Proportion
Distribution of Estimator for Population Proportion
Returns the distribution and properties of an estimator (expected value, bias, variance, and mean square error). The Horvitz-Thompson estimator and the restricted
Horvitz-Thompson estimator are computed in addition to the usual proportion estimator when a matrix of inclusion probabilities is provided.
phat_distrib <- function(loc, samp, pmat=NULL){
# Setup
z <- as.numeric(as.character(loc[,"z"]))
ppop <- mean(z)
N <- nrow(loc)
samp_hh <- as.matrix(samp_units(samp))
if("prob" %in% colnames(samp)){samp_prob <- samp[,"prob"]}
else{samp_prob <- rep(NA,nrow(samp))}
# phat_eqw
ncores <- detectCores()
clus <- makeCluster(ncores)
clusterExport(clus, c("samp_hh","z","N"),
130
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
envir=environment())
phat_eqw <- parSapply(clus, 1:nrow(samp_hh), function(i)
mean(z[samp_hh[i,]],na.rm=TRUE))
stopCluster(clus)
# phat_ht, phat_ht_res
if(!is.null(pmat)){
hh_prob <get_prob(pmat, prob_type="indv", zeros.rm=FALSE)[,"prob"]
wtval <- z/hh_prob
ncores <- detectCores()
clus <- makeCluster(ncores)
clusterExport(clus, c("samp_hh","wtval","N"),
envir=environment())
phat_ht <- parSapply(clus, 1:nrow(samp_hh), function(i)
(1/N)*sum(wtval[samp_hh[i,]],na.rm=TRUE))
stopCluster(clus)
phat_ht_res <- apply(data.frame(phat_ht, 1),1,min)
}
else{
phat_ht <- NULL
phat_ht_res <- NULL
}
# ------------------------------------------------------------# Helper function
get_distrib <- function(phat, prob, ppop){
if(is.numeric(prob)){
distrib <- data.frame(phat, prob=prob)
distrib <- aggregate(prob~phat, data=distrib, sum)
expval <- sum(distrib[,"prob"]*distrib[,"phat"])
bias <- expval-ppop
variance <- sum(distrib[,"prob"]*(distrib[,"phat"]-expval)^2)
stdev <- sqrt(variance)
mse <- sum(distrib[,"prob"]*(distrib[,"phat"]-ppop)^2)
properties <- data.frame(expval,bias,variance,stdev,mse)
131
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
}
else{
# prob==NA
freq <- table(phat)
distrib <data.frame(phat=as.numeric(rownames(freq)),
prob=data.frame(freq)[,2]/length(phat))
est_expval <- mean(phat)
est_bias <- est_expval-ppop
est_variance <- var(phat)
est_stdev <- sd(phat)
est_mse <- (1/(length(phat)-1))*sum((phat-ppop)^2)
properties <data.frame(est_expval,est_bias,est_variance,
est_stdev,est_mse)
}
list(distrib=distrib, properties=properties)
}
# ------------------------------------------------------------# Compute distribution and properties
unweighted <- get_distrib(phat_eqw,samp_prob,ppop)
if(!is.null(phat_ht)){
ht_unres <- get_distrib(phat_ht,samp_prob,ppop)
ht_res <- get_distrib(phat_ht_res,samp_prob,ppop)
# Aggregate probability of restricted values
agg_prob_res <- sum(ht_unres[[1]]
[which(ht_unres[[1]]
[,"phat"]>1),"prob"])
# Compile results
stat_summary <data.frame(est_formula=c("eqw","ht","ht_res"),
132
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
rbind(unweighted[["properties"]],
ht_unres[["properties"]],
ht_res[["properties"]]))
}
else{
ht_unres <- NA
ht_res <- NA
agg_prob_res <- NA
stat_summary <data.frame(est_formula=c("eqw","ht","ht_res"),
rbind(unweighted[["properties"]],
rep(NA,5), rep(NA,5)))
}
# ------------------------------------------------------------# Return output
list(phat_eqw=unweighted, phat_ht=ht_unres,
phat_ht_res=ht_res, stat_summary=stat_summary,
agg_prob_res=agg_prob_res)
}
133
M.Sc. Thesis - Maria Reyes
B.4
McMaster - Mathematics & Statistics
Other Functions Created
check prob
Checks relations between computed inclusion probabilities:
PN
•
i=1 πi = n;
PN
•
j6=i = (n − 1)πi ;
•
PN PN
i=1
j>i
πij = 12 n(n − 1);
(Cochran, 1977).
dist mat
Computes the distance between each pair of households.
gen hh
Generates household locations.
gen val
Assigns a binary outcome to households in a population.
get prob
Extracts inclusion probabilities for individual households or inclusion probabilities for pairs of households from a matrix of probabilities.
hh dist
Computes the distance of a household from the origin.
hh direc
Returns the angle made when moving counterclockwise from the
positive x-axis to a given household.
in sector
Identifies households which lie in the specified sector.
phat var
Computes the population variance of the proportion estimator
using the following formulas:
• V ar(p̂SRS ) =
• V ar(p̂HT ) =
(Cochran, 1977).
phat distrib.
rdm direc
N −n p(1−p)
N −1
n
1
N2
(Lohr, 2009);
hP
N
1−πi 2
i=1 πi yi
+2
πij −πi πj
yi yj
j>i
πi πj
PN P
i=1
i
Used to check variances computed by
Returns a random value in [0, 2π).
134
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
rdm hh
Returns a random household number.
samp units
Extracts sampled units from simulation output.
samp val
Returns values associated with sampled units.
sec analysis
Returns intervals of θ associated with non-empty sectors.
sec angles
Returns the angles where a sector starts and ends.
sim srs
Generates samples according to SRS.
135
Bibliography
1. Bates, D. and Maechler, M. (2015). Matrix: Sparse and Dense Matrix Classes and
Methods. R package version 1.2-3.
2. Bennett, S., Radalowicz, A., Vella, V., and Tomkins, A. (1994). A computer simulation of household sampling schemes for health surveys in developing countries.
International Journal of Epidemiology, 23(6):1282–1291.
3. Bennett, S., Woods, T., Liyanage, W. M., and Smith, D. L. (1991). A simplified
general method for cluster-sample surveys of health in developing countries. World
Health Statistics Quarterly, 44(3):98–106.
4. Bolker, B. (2008). Ecological Models and Data in R. Princeton University Press.
5. Bostoen, K., Bilukha, O. O., Fenn, B., Morgan, O. W., Tam, C. C., ter Veen, A.,
and Checchi, F. (2007). Methods for health surveys in difficult settings: charting
progress, moving forward. Emerging Themes in Epidemiology, 4:13.
6. Brogan, D., Flagg, E. W., Deming, M., and Waldman, R. (1994). Increasing the
accuracy of the Expanded Programme on Immunization’s cluster survey design.
Annals of Epidemiology, 4(4):302–311.
7. Burnham, G., Lafta, R., Doocy, S., and Roberts, L. (2006). Mortality after the
136
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
2003 invasion of Iraq: a cross-sectional cluster sample survey.
The Lancet,
368(9545):1421–1428.
8. Centers for Disease Control and Prevention and World Food Programme (2007).
A Manual:
Measuring and Interpreting Malnutrition and Mortality.
Re-
trieved from http://www.unscn.org/en/resource portal/index.php?&themes=
199&resource=602
9. City of Ottawa (2012).
GM - General Mixed Use Zone (Sec. 187-188).
Re-
trieved from http://ottawa.ca/en/residents/laws-licenses-and-permits/
laws/city-ottawa-zoning-law/gm-general-mixed-use-zone-sec-187
10. Cochran, W. G. (1977). Sampling Techniques. Wiley, New York, 3 edition.
11. Coghlan, B., Brennan, R. J., Ngoy, P., Dofara, D., Otto, B., Clements, M., and
Stewart, T. (2006). Mortality in the Democratic Republic of Congo: a nationwide
survey. The Lancet, 367(9504):44–51.
12. Drysdale, S., Howarth, J., Powell, V., and Healing, T. (2000). The use of cluster sampling to determine aid needs in Grozny, Chechnya in 1995. Disasters,
24(3):217–227.
13. Escamilla, V., Emch, M., Dandalo, L., Miller, W. C., Martinson, F., and Hoffman, I.
(2014). Sampling at community level by using satellite imagery and geographical
analysis. Bulletin of the World Health Organization, 92(9):690–694.
14. Fonn, S., Sartorius, B., Levin, J., and Likibi, M. L. (2006). Immunisation coverage estimates by cluster sampling survey of children (aged 1223 months) in
Gauteng province, 2003. Southern African Journal of Epidemiology and Infection, 21(4):164–169.
137
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
15. Galway, L., Bell, N., SAE, A. S., Hagopian, A., Burnham, G., Flaxman, A., Weiss,
W. M., Rajaratnam, J., and Takaro, T. K. (2012). A two-stage cluster sampling
method using gridded population data, a GIS, and Google EarthTM imagery in a
population-based mortality survey in Iraq. International Journal of Health Geographics, 11:12.
16. Grais, R. F., Rose, A. M. C., and Guthmann, J.-P. (2007). Don’t spin the pen: two
alternative methods for second-stage sampling in urban cluster surveys. Emerging
Themes in Epidemiology, 4:8.
17. Hanif, M. and Brewer, K. R. W. (1980). Sampling with Unequal Probabilities without
Replacement: A Review. International Statistical Review / Revue Internationale
de Statistique, 48(3):317–335.
18. Hansen, M. H., Hurwitz, W. N., and Madow, W. G. (1953). Sample Survey Methods
and Theory: Methods and Applications, Volume 2. Wiley & Sons, New York.
19. Henderson, R. H., Davis, H., Eddins, D. L., and Foege, W. H. (1973). Assessment
of vaccination coverage, vaccination scar rates, and smallpox scarring in five areas
of West Africa. Bulletin of the World Health Organization, 48(2):183–194.
20. Henderson, R. H. and Sundaresan, T. (1982). Cluster sampling to assess immunization coverage: a review of experience with a simplified sampling method. Bulletin
of the World Health Organization, 60(2):253–260.
21. Hlady, W. G., Quenemoen, L. E., Armenia-Cope, R. R., Hurt, K. J., Malilay, J.,
Noji, E. K., and Wurm, G. (1994). Use of a modified cluster sampling method
to perform rapid needs assessment after Hurricane Andrew. Annals of Emergency
Medicine, 23(4):719–725.
138
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
22. Hoshaw-Woodard, S. (2001). Description and Comparison of the Methods of Cluster
Sampling and Lot Quality Assurance Sampling to Assess Immunization Coverage.
Department of Vaccines and Biologicals, World Health Organization, Geneva.
23. Housing Development Agency (2012).
Status.
South Africa:
Informal Settlements
Retrieved from http://www.thehda.co.za/information/research/
category/research
24. Katz, J., Yoon, S. S., Brendel, K., and West, K. P. (1997). Sampling designs
for xerophthalmia prevalence surveys.
International Journal of Epidemiology,
26(5):1041–1048.
25. Kish, L. (1965). Survey Sampling. Wiley-Interscience, New York, 1 edition edition.
26. Kok, P. W. (1986). Cluster sampling for immunization coverage. Social Science &
Medicine, 22(7):781–783.
27. Kondo, M. C., Bream, K. D., Barg, F. K., and Branas, C. C. (2014). A random
spatial sampling method in a rural developing nation. BMC Public Health, 14:338.
28. Legetic, B., Jakovljevic, D., Marinkovic, J., Niciforovic, O., and Stanisavljevic, D.
(1996). Health care delivery and the status of the population’s health in the current
crises in former Yugoslavia using EPI-design methodology. International Journal
of Epidemiology, 25(2):341–348.
29. Lemeshow, S. and Robinson, D. (1985). Surveys to measure programme coverage
and impact: a review of the methodology used by the expanded programme on
immunization. World Health Statistics Quarterly, 38(1):65–75.
30. Lemeshow, S., Tserkovnyi, A. G., Tulloch, J. L., Dowd, J. E., Lwanga, S. K., and
139
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Keja, J. (1985). A computer simulation of the EPI survey strategy. International
Journal of Epidemiology, 14(3):473–481.
31. Levy, P. S. and Lemeshow, S. (2008). Sampling of Populations: Methods and Applications. Wiley, 4th edition.
32. Lohr, S. L. (2009). Sampling: Design and Analysis. Duxbury Press, Boston, 2nd
edition.
33. Luman, E. T., Worku, A., Berhane, Y., Martin, R., and Cairns, L. (2007). Comparison of two survey methodologies to assess vaccination coverage. International
Journal of Epidemiology, 36(3):633–641.
34. Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley, Hoboken,
NJ.
35. MacIntyre, K. (1999). Rapid assessment and sample surveys: trade-offs in precision
and cost. Health Policy and Planning, 14(4):363–373.
36. Marasinghe, M. (2009).
Monte Carlo Methods.
Retrieved from http://
www.public.iastate.edu/~mervyn/stat580/Notes
37. Médecins Sans Frontières (2006).
Rapid Health Assessment of Refugee or Dis-
placed Populations. Retrieved from http://www.refbooks.msf.org/msf docs/en/
MSFdocMenu en.htm
38. Miller, I. and Miller, M. (2003). John E. Freund’s Mathematical Statistics with
Applications. Pearson, Upper Saddle River, NJ, 7th edition.
39. Milligan, P., Njie, A., and Bennett, S. (2004). Comparison of two cluster sampling
methods for health surveys in developing countries. International Journal of Epidemiology, 33(3):469–476.
140
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
40. National Institute of Population Research and Training, Mitra and Associates, and
ICF International (2013). Bangladesh Demographic and Health Survey 2011. Retrieved from http://dhsprogram.com/publications/publication-fr265-dhsfinal-reports.cfm
41. R Core Team (2015). R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria.
42. Roberts, L., Lafta, R., Garfield, R., Khudhairi, J., and Burnham, G. (2004). Mortality before and after the 2003 invasion of Iraq: cluster sample survey. The Lancet,
364(9448):1857–1864.
43. Rose, A. M., Grais, R. F., Coulombier, D., and Ritter, H. (2006). A comparison of
cluster and systematic sampling methods for measuring crude mortality. Bulletin
of the World Health Organization, 84(4):290–296.
44. Rothenberg, R. B., Lobanov, A., Singh, K. B., and Stroh, G. (1985). Observations
on the application of EPI cluster survey methods for estimating disease incidence.
Bulletin of the World Health Organization, 63(1):93–99.
45. Salama, P., Assefa, F., Talley, L., Spiegel, P., van Der Veen, A., and Gotway, C. A.
(2001). Malnutrition, measles, mortality, and the humanitarian response during a
famine in Ehiopia. Journal of the American Medical Association., 286(5):563–571.
46. Serfling, R. E. and Sherman, I. L. (1965). Attribute Sampling Methods for Local
Health Departments. U. S. Dept. of Health, Education, and Welfare, Public Health
Service, Communicable Disease Center, Epidemiology Branch, Atlanta.
47. Shannon, H. S., Hutson, R., Kolbe, A., Stringer, B., and Haines, T. (2012). Choosing
a survey sample when data on the population are limited: a method using Global
141
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
Positioning Systems and aerial and satellite photographs. Emerging Themes in
Epidemiology, 9(1):5.
48. Siri, J. G., Lindblade, K. A., Rosen, D. H., Onyango, B., Vulule, J. M., Slutsker,
L., and Wilson, M. L. (2008). A census-weighted, spatially-stratified household
sampling strategy for urban malaria epidemiology. Malaria Journal, 7:39.
49. SMART (2012).
Sampling Methods and Sample Size Calculation for the
SMART Methodology. Retrieved from http://smartmethodology.org/surveyplanning-tools/smart-methodology
50. Sólymos, P. (2009). Processing ecological data in R with the mefa package. Journal
of Statistical Software, 29(8):1–28.
51. Statistics Canada (2015).
Dissemination area (DA) - Census Dictionary.
Retrieved from https://www12.statcan.gc.ca/census-recensement/2011/ref/
dict/geo021-eng.cfm
52. The Swiss Foundation for Mine Action (FSD) (2016).
Using High-resolution
Imagery to Support the Post-earthquake Census in Port-au-Prince, Haiti.
Retrieved from http://drones.fsd.ch/2016/04/26/case-study-no-7-usinghigh-resolution-imagery-to-support-the-post-earthquake-census-inport-au-prince-haiti
53. UNICEF (2010). Rapid Assessment Sampling in Emergency Situations. Retrieved
from http://www.unicef.org/eapro/12205 3619.html
54. Vanden Eng, J. L., Wolkon, A., Frolov, A. S., Terlouw, D. J., Eliades, M. J., Morgah, K., Takpa, V., Dare, A., Sodahlon, Y. K., Doumanou, Y., Hawley, W. A.,
142
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
and Hightower, A. W. (2007). Use of handheld computers with global positioning systems for probability sampling and data entry in household surveys. The
American Journal of Tropical Medicine and Hygiene, 77(2):393–399.
55. Vinck, P. and Bell, E. (2011).
Violent Conflicts And Displacement In Cen-
tral Mindanao, Challenges For Recovery And Development.
Retrieved from
http://documents.worldbank.org/curated/en/2011/12/16234429/violentconflicts-displacement-central-mindanao-challenges-recoverydevelopment-vol-1-2-annexes
56. Wickham, H. (2007). Reshaping data with the reshape package. Journal of Statistical
Software, 21(12):1–20.
57. Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer New
York.
58. Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal of
Statistical Software, 40(1):1–29.
59. Wickham, H. and Francois, R. (2015). dplyr: A Grammar of Data Manipulation. R
package version 0.4.2.
60. World Health Organization (2008).
Training for Mid-Level Managers (MLM)
Module 7: The EPI Coverage Survey.
Retrieved from http://www.who.int/
immunization/documents/mlm/en
61. Yoon, S. S., Katz, J., Brendel, K., and West, K. P. (1997). Efficiency of EPI cluster
sampling for assessing diarrhoea and dysentery prevalence. Bulletin of the World
Health Organization, 75(5):417–426.
143
M.Sc. Thesis - Maria Reyes
McMaster - Mathematics & Statistics
62. Zimicki, S., Hornik, R. C., Verzosa, C. C., Hernandez, J. R., de Guzman, E., Dayrit,
M., Fausto, A., Lee, M. B., and Abad, M. (1994). Improving vaccination coverage
in urban areas through a health communication campaign: the 1990 Philippine
experience. Bulletin of the World Health Organization, 72(3):409–422.
144