An Analysis of Equally Weighted and Inverse Probability Weighted Observations in the Expanded Program on Immunization (EPI) Sampling Method AN ANALYSIS OF EQUALLY WEIGHTED AND INVERSE PROBABILITY WEIGHTED OBSERVATIONS IN THE EXPANDED PROGRAM ON IMMUNIZATION (EPI) SAMPLING METHOD BY MARIA REYES, H.B.Sc. A thesis submitted to the Department of Mathematics & Statistics and the School of Graduate Studies of McMaster University in partial fulfilment of the requirements for the degree of Master of Science c Copyright by Maria Reyes, September 2016 All Rights Reserved Master of Science (2016) McMaster University (Mathematics & Statistics) TITLE: Hamilton, Ontario, Canada An Analysis of Equally Weighted and Inverse Probability Weighted Observations in the Expanded Program on Immunization (EPI) Sampling Method AUTHOR: Maria Reyes H.B.Sc., Statistics & Economics University of Toronto, Canada SUPERVISORS: Dr. Román Viveros-Aguilera Dr. Harry Shannon NUMBER OF PAGES: xx, 144 ii To my parents, Alberto and Elizabeth, for all the sacrifices they have made to get me through school, from kindergarten to now my masters, and to my brother, Francis, for our conversations, which provided a nice relief from work. Abstract Performing health surveys in developing countries and humanitarian emergencies can be challenging work because the resources in these settings are often quite limited and information needs to be gathered quickly. The Expanded Program on Immunization (EPI) sampling method provides one way of selecting subjects for a survey. It involves having field workers proceed on a random walk guided by a path of nearest household neighbours until they have met their quota for interviews. Due to its simplicity, the EPI sampling method has been utilized by many surveys. However, some concerns have been raised over the quality of estimates resulting from such samples because of possible selection bias inherent to the sampling procedure. We present an algorithm for obtaining the probability of selecting a household from a cluster under several variations of the EPI sampling plan. These probabilities are used to assess the sampling plans and compute estimator properties. In addition to the typical estimator for a proportion, we also investigate the Horvitz-Thompson (HT) estimator, an estimator that assigns weights to individual responses. We conduct our study on computer-generated populations having different settlement types, different prevalence rates for the characteristic of interest and different spatial distributions of the characteristic of interest. iv Our results indicate that within a cluster, selection probabilities can vary largely from household to household. The largest probability was over 10 times greater than the smallest probability in 78% of the scenarios that were tested. Despite this, the properties of the estimator with equally weighted observations (EQW) were similar to what would be expected from simple random sampling (SRS) given that cases of the characteristic of interest were evenly distributed throughout the cluster area. When this was not true, we found absolute biases as large as 0.20. While the HT estimator was always unbiased, the trade off was a substantial increase in the variability of the estimator where the design effect relative to SRS reached a high of 92. Overall, the HT estimator did not perform better than the EQW estimator under EPI sampling, and it involves calculations that may be difficult to do for actual surveys. Although we recommend continuing to use the EQW estimator, caution should be taken when cases of the characteristic of interest are potentially concentrated in certain regions of the cluster. In these situations, alternative sampling methods should be sought. Keywords: Expanded Program on Immunization, household surveys, spatial sampling, selection probabilities, Horvitz-Thompson estimator v Acknowledgments I would like to extend my sincerest gratitude to my supervisors, Dr. Román ViverosAguilera and Dr. Harry Shannon who were so generous with their time and always willing to share their wealth of knowledge with me. I could not have asked for better mentors and role models to guide me as I worked on this thesis. I am indebted to them for getting me to the finish line. A special thanks to Dr. Gregory Pond who took part in my defense committee. I thank him for his valuable feedback and questions which made me realize what it means to be a statistical consultant. I also wish to thank Dr. Narayanaswamy Balakrishnan for his encouragement, Dr. Ben Bolker for his research suggestions and R coding advice, Dr. Patrick Emond, Dr. Ick Huh, and Atinder Bharaj for the inspiring discussions at our group meetings, and Kenneth Moyle for his help with using the department computer servers. This thesis was funded in part by the Canadian Institutes of Health Research (CIHR) and by the Ashbaugh Graduate Scholarship. I am grateful to Dr. Alison Weir, Dr. Gordon Anderson, Dr. Jerry Brunner, Dr. Christine Lim, and Asal Aslemand for instilling in me an appreciation for statistics. They introduced me to a new world, and I would not have pursued graduate studies vi if it were not for them. I would also like to recognize my parents, Alberto and Elizabeth, my brother Francis, and the following individuals who offered me their support in various ways: E. Alamer, E. Anthonipillai, J. Begum, T. Bekiri, A. Bhatti, S. Birchall, K. Biswas, J. Buckley, S. Caetano, F. Choi, S. Dionyssiou, J. Francis, Q. Gao, C. Glanville, P. Gonyeau, E. Gretchko, S. Hogan, T. Jacques, S. Jana, P. Jevtic, L. Jin, R. Kampo, P. Keown, K. Kim, J. La Rosa, C. Lambeck, M. Li, M. Mendes, J. Pancratius, J. Posada, S. Reiter, S. Sexton, J. Shiels, T. Tan, D. Venditti, and Y. Yang. Most importantly, I thank God for blessing me with this incredible opportunity to grow and learn, and I thank Our Lady of Fatima, St. Dymphna, St. Joseph of Cupertino, St. Joseph the Worker, St. Anthony, and the many other saints whose intercession gave me the strength to persevere through the difficult times. vii Abbreviations The number following the entries refers to the section in which the term was introduced. General Abbreviations DE Design effect, 3.1 EPI Expanded Program on Immunization, 1.1 EQW Equally weighted estimator, 7.1.3 HT Horvitz-Thompson estimator, 7.1.3 HTR Horvitz-Thompson estimator with restriction, 7.1.3 MSE Mean square error, 4.1 PPS Probability proportional to size, 2.1.2 PSU Primary sampling unit, 2.1.2 ROH Rate of homogeneity, 3.1 SRS Simple random sample, 2.1.1 SSU Secondary sampling unit, 2.1.2 viii StRS Stratified random sample, 4.1 SyRS Systematic random sample, 8.2 WHO World Health Organization, 1.1 Spatial Distribution of Households loc reg A population with regularly spaced households on a grid, 6.1.1 loc sqr A population with randomly placed households over a square area, 6.1.1 loc rec A population with randomly placed households over a rectangular area, 6.1.1 loc agg A population where households aggregate around several randomly placed focal points, 6.1.1 loc cgr A population where household density increases towards the centre of the population area, 6.1.1 Spatial Distribution of Target Variable val rdm A population where the characteristic of interest is assigned to households with equal probability, 7.1.1 val spk A population where the characteristic of interest is assigned to small pockets of households, 7.1.1 val lpk A population where the characteristic of interest is assigned to large pockets of households, 7.1.1 ix val cgr A population where the characteristic of interest is more likely to be assigned to households close to the centre of the population area, 7.1.1 val dgr A population where the characteristic of interest is more likely to be assigned to households close to the southwest corner of the population area, 7.1.1 val hgr A population where the characteristic of interest is more likely to be assigned to households close to the west edge of the population area, 7.1.1 Sampling Method nosec k1 An EPI procedure that uses a sector with angle span 2π rad and does not skip neighbours, 6.1.2 api08 k1 An EPI procedure that uses a sector with angle span not skip neighbours, 6.1.2 π 8 rad and does api32 k1 An EPI procedure that uses a sector with angle span not skip neighbours, 6.1.2 π 32 rad and does nosec k3 An EPI procedure that uses a sector with angle span 2π rad and selects every third neighbour for the sample, 6.1.2 api08 k3 An EPI procedure that uses a sector with angle span every third neighbour for the sample, 6.1.2 π 8 rad and selects api32 k3 An EPI procedure that uses a sector with angle span every third neighbour for the sample, 6.1.2 π 32 rad and selects Other rad radians, 5.1 x Notation The number following the entries refers to the section in which the notation was introduced. The notation stated here only applies to Chapters 5 and beyond. Single Cluster Population i Index used to label households, 5.1 N Number of households in a population, 5.1 H Set of all households in a population, 5.1 xi x-coordinate of the location of household i, 5.1 yi y-coordinate of the location of household i, 5.1 ri Distance of household i from the centre of the population area, 5.1 γi Angle made when moving counterclockwise from the positive x-axis to household i (measured in radians), 5.1 Γ Set of all household angular coordinates, γi (see above), 5.1 dij Distance between household i and household j, 5.2 D Matrix of distances where entry dij represents the distance between households i and j (entries along the main diagonal are 0), 5.2 xi Zi A random variable equal to 1 if household i has the characteristic of interest and 0 otherwise, 7.1.1 zi A realized value of the random variable Zi (see above), 7.1.1 p Proportion of households in the population with the characteristic of interest, 7.1.1 Sampling θ Angle made by moving counterclockwise from the positive x-axis to the centre of a sector; also referred to as the direction of the sector (measured in radians), 5.1 Θ Set of all sector directions such that the sector associated with θ (see above) has at least one household, 5.1 LΘ Length of interval unions in the set Θ, 5.1 α Angle span of a sector (measured in radians); also referred to as the size of a sector, 5.1 nsec Number of households in a sector (depends on the location of households in the population as well as the direction and angle span of the sector), 5.1 l Index used to label ordered selections, 5.2 Hl Set of households from the population that were not chosen in the first l − 1 selections, 5.2 Ul A random variable representing the household chosen at the lth selection, 5.2 ul A realized value of the random variable Ul (see above), 5.2 xii ul A vector listing all the households chosen up to the lth selection in the order that they were selected, 5.2 nnn Number of nearest neighbours (depends on the last chosen household and the households from the population that are not already in the sample), 5.2 k Every k th neighbour along an EPI path is added to the sample, 5.4 S Set of sampled households, 5.2 n Number of households sampled from the population, 5.2 πi Probability that household i is included in the selected sample; also referred to as the inclusion probability or probability of selection for household i 5.3 πij Probability that households i and j are both included in the selected sample, 5.3 π Matrix of inclusion probabilities where entry πij represents the probability that households i and j are both included in the selected sample (entries along the main diagonal represent the inclusion probabilities for individual households), 5.3 wi Sampling weight for household i equal to the inverse of its inclusion probability (see πi ), 5.3 Hi A random variable equal to 1 if household i appears in the sample and 0 otherwise, 7.1.3 Estimation p̂EQW Equally weighted estimator for a proportion, 7.1.3 p̂HT Horvitz-Thompson estimator for a proportion, 7.1.3 p̂HT R Restricted Horvitz-Thompson estimator for a proportion, 7.1.3 xiii Contents Abstract iv Acknowledgments vi Abbreviations viii Notation xi 1 Data Collection for Health Surveys 1 1.1 Census vs. Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Scope of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 The Expanded Program on Immunization (EPI) Sampling Method 2.1 2.2 5 Preliminary Sampling Theory . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Two-Stage Cluster Sampling . . . . . . . . . . . . . . . . . . . 7 Development and Use of the EPI Sampling Method . . . . . . . . . . 8 2.2.1 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . 3 Procedures for Estimating a Population Proportion xiv 13 15 3.1 Sample Size Determination . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Point Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Past Simulations of the EPI Method 25 4.1 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5 Computation of Household Inclusion Probabilities 34 5.1 Probability of Selecting the First Household . . . . . . . . . . . . . . 35 5.2 Probability of a Sample of Households . . . . . . . . . . . . . . . . . 44 5.3 Inclusion Probabilities for Individual Households and Pairs of Households . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.4 Other Versions of EPI Sampling . . . . . . . . . . . . . . . . . . . . . 52 5.5 Additional Notes: Permutations of Household Selections . . . . . . . 52 6 Household Inclusion Probabilities in Simulated Populations 6.1 55 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.1.1 Generation of Populations . . . . . . . . . . . . . . . . . . . . 56 6.1.2 Sampling Plans . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3 Additional Notes: Relations between Inclusion Probabilities . . . . . 70 7 Estimator Properties in Simulated Populations 7.1 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 72 73 7.1.1 Generation of Populations . . . . . . . . . . . . . . . . . . . . 73 7.1.2 Sampling Plans . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.1.3 Estimation of Population Proportion . . . . . . . . . . . . . . 79 7.1.4 Evaluation of Estimators . . . . . . . . . . . . . . . . . . . . . 83 7.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.3 Additional Notes: Variance of the Horvitz-Thompson Estimator . . . 93 8 Summary, Discussion and Future Directions 96 8.1 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A Tables of Simulation Results 106 B Partial R Code 116 B.1 Packages Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 B.2 General Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.3 Main Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 B.3.1 Simulation of EPI Sampling . . . . . . . . . . . . . . . . . . . 118 B.3.2 Computation of Inclusion Probabilities . . . . . . . . . . . . . 122 B.3.3 Estimation of Population Proportion . . . . . . . . . . . . . . 130 B.4 Other Functions Created . . . . . . . . . . . . . . . . . . . . . . . . . 134 xvi List of Tables 2.1 Selection of villages using systematic PPS sampling. . . . . . . . . . . 9 4.1 Comparison of simulation designs from past EPI studies. . . . . . . . 28 6.1 Minimum and maximum number of households in non-empty sectors and proportion of values in [0, 2π) rad that correspond to the directions of empty sectors for loc reg, loc sqr, loc rec, loc agg, and loc cgr. . . . 7.1 60 Cases of the characteristic of interest added or removed from populations after the initial population generation procedure to attain a certain proportion of households with the characteristic of interest. 7.2 . 79 simulated scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Range of properties for EQW, HT and HTR estimators across 13 500 A.1 Properties of household inclusion probabilities for multiple realizations of a household spatial distribution type. . . . . . . . . . . . . . . . . 107 A.2 Bias of EQW, HT and HTR estimators; n = 7. . . . . . . . . . . . . . 108 A.3 Bias of EQW, HT and HTR estimators; n = 30. . . . . . . . . . . . . 109 A.4 Variance of EQW, HT and HTR estimators; n = 7. . . . . . . . . . . 110 A.5 Variance of EQW, HT and HTR estimators; n = 30. . . . . . . . . . . 111 A.6 Mean square error of EQW, HT and HTR estimators; n = 7. . . . . . 112 xvii A.7 Mean square error of EQW, HT and HTR estimators; n = 30. . . . . 113 A.8 Design effect of EPI sampling relative to SRS for EQW, HT and HTR estimators; n = 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.9 Design effect of EPI sampling relative to SRS for EQW, HT and HTR estimators; n = 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 xviii List of Figures 2.1 Illustration of household selection using the EPI method. . . . . . . . 11 5.1 Population of N = 25 households. . . . . . . . . . . . . . . . . . . . . 36 5.2 Illustration of a sector. . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Directions corresponding to non-empty sectors. . . . . . . . . . . . . 38 5.4 Computation of the probability of the first sampled unit. . . . . . . . 42 5.5 Computation of path probabilities conditional on the first sampled unit. 46 5.6 Illustration of an EPI path with skipped neighbours. . . . . . . . . . 53 6.1 Spatial distributions of households in the simulation study; N = 150. 57 6.2 Number of possible household samples that can be drawn from loc reg, loc sqr, loc rec, loc agg, and loc cgr. . . . . . . . . . . . . . . . . . . 60 6.3 Household inclusion probabilities for loc reg. . . . . . . . . . . . . . . 62 6.4 Household inclusion probabilities for loc sqr. . . . . . . . . . . . . . . 64 6.5 Household inclusion probabilities for loc rec. . . . . . . . . . . . . . . 65 6.6 Household inclusion probabilities for loc agg. . . . . . . . . . . . . . . 66 6.7 Household inclusion probabilities for loc cgr. . . . . . . . . . . . . . . 67 6.8 Boxplot of household inclusion probabilities for loc reg, loc sqr, loc rec, loc agg, and loc cgr. . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 69 6.9 Correlation between household inclusion probability and household distance from the centre of the population area for loc reg, loc sqr, loc rec, loc agg, and loc cgr. . . . . . . . . . . . . . . . . . . . . . . . 7.1 Spatial distributions of the target variable in the simulation study; p = 0.50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 70 77 Boxplot of household sampling weights for loc reg, loc sqr, loc rec, loc agg, and loc cgr. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.3 Histogram of estimator bias across 13 500 simulated scenarios. . . . . 85 7.4 Histogram of estimator variance and design effect across 13 500 simulated scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.5 Bias of EQW, HT and HTR estimators. . . . . . . . . . . . . . . . . 88 7.6 Variance of EQW, HT and HTR estimators. . . . . . . . . . . . . . . 90 7.7 Design effect of EPI sampling relative to SRS for EQW, HT and HTR estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 92 Illustration of household selection using alternative sampling methods. 104 xx Chapter 1 Data Collection for Health Surveys Surveys play an important role in the management of public health. As Bostoen et al. (2007) aptly expressed, “health surveys are the stethoscope, thermometer and pressure gauge of global health.” Governments, disease control programs, humanitarian agencies, and local health administrators need to know relevant characteristics of the population they are serving to do an effective job. Without this information, there is no objective basis on which to judge what initiatives should take priority or how much work needs to be done. The absence of reliable information makes it difficult to adequately assess the impact of programs and policies or plan for the future. 1.1 Census vs. Sample Cross-sectional studies are one way to get a picture of a population’s overall health status. In this type of study, investigators examine the characteristics of a population 1 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics at a single point in time (Levy and Lemeshow, 2008). Data from individuals or households may be gathered using a descriptive survey. Examples of health data include body measurements, disease status, living conditions, and feeding practices. These surveys also ask about demographic information. The primary aim here is to measure variables of interest and to summarize the data as means, proportions, or totals, but results may also be used to test hypotheses or check for relationships between variables (Levy and Lemeshow, 2008). A census is clearly preferred for informational purposes; however, it may be unrealistic to interview all members of the population due to time and budget constraints. When a subset of the population is selected and carefully surveyed, data obtained from this sample can provide useful insights about the population at a fraction of the cost associated with a census. Nevertheless, conducting such a survey is still a large undertaking. The work that goes into a survey is more than just creating a questionnaire and recruiting participants. The planning stage involves making sure that the population to be sampled matches the target population and deciding what degree of precision is needed for the results. It also involves determining how the sample will be taken and how estimates will be computed afterwards. The answer might not always be straightforward. Plus there are a whole host of other tasks to consider from training field workers and performing a pre-test of the survey to preparing the data for analysis (Cochran, 1977). Serious thought must be put into every step of the process to maintain the integrity of the results. After all, the results may be used to guide decisions about where to set up health care facilities, how to combat the spread of disease, and what 2 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics supplies to distribute during an emergency. The Expanded Program on Immunization (EPI) sampling method is a technique that the World Health Organization (WHO) has traditionally used for many of its surveys (World Health Organization, 2008). It is particularly well suited for surveys that make contact with respondents by visiting households because sampling is accomplished by having the field staff go from neighbour to neighbour (or the nearest household that has not yet been selected) until the required sample size is reached. Initially, the method was implemented to estimate the immunization levels in a population, but since then, it has been adapted for a variety of uses and implemented for surveys outside of WHO. 1.2 Scope of Thesis This thesis takes a critical look at the EPI sampling procedure and the associated formulas for estimating a population proportion. Therefore, we focus on variables that have a binary outcome. Our objective is two-fold: (1) to compile results from past studies about EPI sampling and (2) to advance the current body of research by conducting our own simulations and statistical analyses. Chapter 2 introduces basic sampling concepts and terminology, which then leads into a detailed description of the EPI method. We discuss the motivations behind its development and how the method is performed in the field. Chapter 3 contains the technical underpinnings of the EPI method. We begin by giving the rationale for the 30 × 7 sample size. This is followed by the formulas that have been suggested for estimating the population proportion and estimating the 3 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics variance of this estimator when data is collected through EPI sampling. We describe how the EPI method is different from classical two-stage cluster sampling, and we use this to explain why the formulas currently used for EPI surveys may be biased. Chapter 4 is a review of the existing literature relating to computer simulations of EPI sampling. We compare the simulation methods used in these studies as well as the findings about the quality of estimation from EPI samples. In Chapters 5 and 6, we examine the EPI selection procedure from the perspective of inclusion probabilities. We provide an algorithm to compute the exact probability that a household is included in a sample that is drawn from a single cluster. This algorithm is applied to computer generated populations and we investigate to what extent units from the same cluster are selected with equal probability. With this information we try to uncover the factors that cause certain households to have a high or low chance of being selected. In Chapter 7, we shift the focus back to the estimation of a population proportion. The algorithm presented in Chapter 5 allows us to construct an estimator that weights observations by the inverse of the probability of selection, otherwise known as a Horvitz-Thompson (HT) estimator. Chapter 7 compares the traditional EPI estimator to a weighted estimator for samples at the cluster level. We compute properties such as bias and variance by obtaining the exact distribution of these estimators for a given population. Finally, we end with Chapter 8, where we summarize the results, address limitations of the study, and give a brief discussion of how the study may be extended. 4 Chapter 2 The Expanded Program on Immunization (EPI) Sampling Method 2.1 Preliminary Sampling Theory A proper sample design is central to executing a successful survey. The sample design encompasses the way elements are drawn from the population (known as the sampling plan), as well as the formulas to estimate population parameters (Hansen et al., 1953). The elements of a population may be persons or entire household units. They represent the most basic level at which measurements are recorded. In other words, it is their characteristics that are being analyzed in the survey. Assuming that the measurements recorded are correct, the only uncertainty in the observations comes from the sampling itself. Naturally, the amount of sampling error will depend 5 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics on how the units are selected and the sample size. The estimation procedures used after the sample has been extracted are equally important. They must be built around the sampling plan. The ultimate goal is to obtain results from the sample that are representative of the greater population. 2.1.1 Simple Random Sampling Simple random sampling (SRS) is a classic sampling plan. It involves enumerating all the elements in the population and then choosing the elements according to random numbers. The elements could be selected with or without replacement. In practice, health surveys typically sample without replacement so that once someone is picked, they cannot be picked again. A more defining aspect of SRS is that the sample obtained could be any possible combination of elements from the population and all of these combinations are equally likely to be observed (Lohr, 2009). As a result, each element has the same chance of being sampled as any other element in the population. Since this probability is non-zero and can be calculated, this makes SRS a probability sample (UNICEF, 2010). Sampling plans that are based on a probability sample are desirable because they allow for a proper statistical analysis of the properties of an estimator. When SRS is performed, the mean of the observed values in the sample is taken as an estimate of the population mean. It can be shown that this design yields an unbiased estimator (Hansen et al., 1953). This is a favourable property because although the mean of a sample may be different from the population mean, there is at least the assurance that if sampling was done repeatedly on the same population an infinite number of times, then the average of the sample means would be equal to 6 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics the population mean. Therefore, we would not consistently overestimate the mean or underestimate it. A proportion is just a special case of a mean. Derivation of the expected value and variance of this estimator is possible because the inclusion probability of the sampling units and the inclusion probability of pairs of sampling units are known in this case. SRS is conceptually easy to understand, but logistical factors make it challenging to apply in the field. The population being studied by most health surveys is spread across a large geographic region. If interviews must be conducted in person, this means that the field worker may have to travel great distances to reach a place where only one or two people have been selected for the survey. 2.1.2 Two-Stage Cluster Sampling A less costly option is to use cluster sampling. In two-stage cluster sampling, members of the population are split into convenient groupings such as districts, towns or city blocks. The clusters could take any form as long as they cover the whole population and they do not overlap (mutually exclusive and exhuastive). The procedure begins by taking a sample of clusters. For this reason, the clusters in this design are also referred to as primary sampling units (PSUs). The sampling could be done with probability proportional to size (PPS), which means that larger clusters (as measured in terms of the cluster population) are more likely to be included in the sample than smaller clusters (Lemeshow and Robinson, 1985). The advantage of this is that if SRS (or some other equal probability sampling technique) is used to select the same number of elements from each of the sampled clusters, the formula 7 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics for an unbiased estimator remains simple. The estimator in cluster sampling is typically more variable than the same estimator in regular SRS (Levy and Lemeshow, 2008). To achieve the same precision, additional population elements may have to be surveyed. However, despite the increase in sample size, cluster sampling may still be more economical than SRS because individuals are being sampled in groups. The problem with performing SRS in the final stage of cluster sampling is that, before it can be done, all eligible subjects in the selected clusters, known as secondary sampling units (SSUs), must be enumerated to construct a sampling frame. Sometimes, records may be unavailable or they may be inaccurate. While they could be updated, there might not be the time and resources to do this. This situation is frequently encountered in developing countries and in communities affected by natural disaster or war. Yet, as Bostoen et al. (2007) emphasized, it is exactly in these settings where there is a great need to do surveys and obtain reliable information. Therefore, there is strong motivation to find alternative within-cluster sampling methods that are quick, affordable, and are capable of producing an estimator with a high degree of accuracy and precision. 2.2 Development and Use of the EPI Sampling Method In place of SRS, numerous health surveys have opted to use the Expanded Program on Immunization (EPI) sampling method, especially when difficult field conditions prevail. A key feature of the EPI method is that its selection rule is based on the physical distance between the population elements to be sampled. This form of 8 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Table 2.1: Selection of villages using systematic probability proportional to size (PPS) sampling. Village Population 1 2 3 4 5 6 7 8 9 10 99 212 127 136 124 91 87 106 82 108 Cumulative population 1-99 100-311 312-438 439-574 575-698 699-789 790-876 877-982 983-1064 1065-1172 ⇒ selected ⇒ selected ⇒ selected This illustration represents a smaller version of the example given by Henderson et al. (1973). Here, three out of ten villages are sampled. Since the total population across all the villages is 1172, the sampling interval is 1172 = 391. The 3 number 148 was randomly picked among the numbers between 1 and 391. Hence, the villages containing the 148th , 148 + 391 = 539th person, and 539 + 391 = 930th person are selected for the sample. spatial sampling emerged in the public health literature in the 1960s. Henderson et al. (1973) introduced the procedure as a way to collect data in West Africa, where it was hard to obtain complete, up-to-date population registration records. Their aim was to evaluate the impact of a mass vaccination campaign. At the time, the spread of smallpox was a serious concern, so gathering information about the population was an urgent matter. They sampled 67 sites per region and interviewed at least 16 persons per site. The sites were selected according to PPS. A systematic approach was used. This involved constructing a table with a column identifying the sites, a column of their respective population sizes, and a column for the cumulative population size. A 9 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics sampling interval was calculated by dividing the total population by the number of sites to be sampled. The first site was determined by generating a random number between 1 and the sampling interval, then checking where the number fell in the cumulative population. The other sites were found by adding the sampling interval successively. See Table 2.1 for an example. The procedure for sampling individuals was the same in each of the selected sites. It consists of the following steps. First, field workers went to the centre of the selected village or town. They picked a random number between 0 and 359, which represented a direction in degrees (by convention, 0◦ pointed east, and the direction increased by moving counterclockwise). The team then enumerated all the households in this direction. A starting point among these households was established by picking another random number. After the entire household was surveyed, the team traveled along the original path, moving away from the centre of the site, until they came upon another household. They continued in this manner until they had met their quota of interviewing 16 individuals. If they had reached the edge of the site before completing the quota, they turned clockwise and selected households by moving inward. Everyone in the household containing the 16th person was examined even though the quota was surpassed. In 1978, the World Health Organization (WHO) adopted a version of this sampling method (Hoshaw-Woodard, 2001). Instructions are provided in the official EPI coverage survey manual (World Health Organization, 2008). Much like the method proposed by Henderson et al. (1973), the WHO version uses systematic PPS to sample clusters of households. It identifies the starting household by either choosing it randomly from a list of households in the cluster or having the field interviewer 10 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics random direction 1 2 4 3 5 6 7 village centre N 90 W E 180 0 S 270 EPI path Selected household with eligbile child Figure 2.1: Illustration of household selection using the EPI method. The survey interviewer begins by standing at the centre of the village and choosing a random direction. Among the households in this direction, one is randomly picked. The survey interviewer proceeds to visit this household then goes door-to-door until they have collected data on enough subjects to achieve the desired sample size. 11 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics walk in a random direction away from the centre of the cluster and randomly choose one of the households in their path; the direction can be determined by spinning a pen or a bottle (MacIntyre, 1999). Then neighbouring households are visited to find additional subjects. However, rather than selecting all subsequent households by following the path prescribed by the initial direction, the WHO method takes the next household to be whichever household, not already selected, has a front door closest to the door of the household that was just left (World Health Organization, 2008). See Figure 2.1 for an illustration. WHO began using this sampling method primarily for surveys relating to the Expanded Program on Immunization—an action plan to make vaccines available to all children throughout the world (Lemeshow and Robinson, 1985). Due to this association, it became known as the EPI sampling method. It has also been referred to as the 30 × 7 cluster sampling method because of the standard practice to sample 30 clusters and a minimum of seven children per cluster. By 1982, the EPI method was performed in at least 441 surveys worldwide (Lemeshow and Robinson, 1985). Ten years later, that number had reached 4502 (Brogan et al., 1994). Examples of surveys which have used the EPI method include those done in The Philippines (Zimicki et al., 1994), The Gambia (Milligan et al., 2004), Ethiopia (Luman et al., 2007), and Niger (Grais et al., 2007). These surveys assessed the proportion of children who received vaccines for diseases, such as diphtheria, hepatitis B, measles, meningitis, pertussis, poliomyelitis, and tetanus. 12 M.Sc. Thesis - Maria Reyes 2.2.1 McMaster - Mathematics & Statistics Other Applications Although the EPI method was developed so that health managers could monitor vaccination coverage levels, it started being used for other purposes. A natural extension was to use EPI sampling to measure the proportion of the population affected by a disease. One of the earliest documented reports looked at diphtheria, measles, pertussis, poliomyelitis, and tetanus in Nepal (Rothenberg et al., 1985). Disease surveys are often concerned with events that happen with lower frequency in the population compared to immunization coverage surveys, and a narrower confidence interval may be required. Cases of disease may also be distributed in pockets especially if the disease is contagious such as measles. Because of the way that households are chosen in the EPI method, there is the question of whether the EPI method has the tendency to underestimate or overestimate disease prevalence rates. Additionally, EPI sampling has been applied in the context of community emergencies. Its use for assessing needs in the aftermath of natural disasters goes back two decades. When Hurricane Andrew hit South Florida, households were sampled according to the EPI method (Hlady et al., 1994). In these types of surveys the focus is on determining what relief operations are required in the affected areas. Some important statistics are the proportion of households without electricity, the proportion without running water, the proportion without enough food, and the proportion with residents requiring medical attention (Hlady et al., 1994). Unlike surveys where subjects may not be found in every visited household, the subject here is the household unit itself. Therefore, the sample obtained from a site may be more geographically concentrated when the EPI procedure is strictly followed. When some parts of the site are hit harder by the disaster than others, resulting estimates could be poor. 13 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Besides rapid-onset natural disasters, community emergencies may be triggered by famine or conflict. There are several accounts of EPI sampling used in these situations. Nutrition surveys which took body measurements of children in Ethiopia were based on the EPI approach (Salama et al., 2001), and so were surveys from The Democratic Republic of Congo (Burnham et al., 2006) and Iraq (Coghlan et al., 2006) which estimated mortality rates due to violent conflict. It was also used in the former Yugoslavia and in Chechnya to assess health care delivery and demands in the mid 1990s when wars ensued in these countries (Legetic et al., 1996; Drysdale et al., 2000). The EPI approach is detailed in the training manuals of various humanitarian organizations, but it is only recommended when SRS is deemed unfeasible (Médecins Sans Frontières, 2006; Centers for Disease Control and Prevention and World Food Programme, 2007; UNICEF, 2010). The EPI method has the essential attributes of a rapid survey including low cost and quick feedback of results (MacIntyre, 1999). It has been suggested that the survey could be completed in around five days if four to six teams are employed (Lemeshow and Robinson, 1985). Moreover, the procedure is simple enough that it can be carried out by those with little technical background (Bennett et al., 1994). All of these reasons have led to the popularity of the EPI method. 14 Chapter 3 Procedures for Estimating a Population Proportion While EPI sampling is relatively simple to perform, its statistical properties are not easily derived. Complications arise because of the multiple stages of sampling involved and the procedure used to select units at the final stage. Therefore, analysis is typically carried out by borrowing formulas from SRS and two-stage cluster sampling. 3.1 Sample Size Determination The original goal of the EPI method was to construct a 95% confidence interval for the proportion of vaccinated children such that the margin of error was no more than 10 percentage points (Hoshaw-Woodard, 2001). To achieve this, it has been suggested that 210 children should be surveyed, a number which comes from doubling 15 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics the sample size required by SRS to allow for clustering (Henderson and Sundaresan, 1982; Lemeshow and Robinson, 1985). Let p be a population proportion and p̂ be a sample proportion. Under SRS, a 100(1 − α)% confidence interval for p is given by s p̂ ± z1− α2 p(1 − p) , nSRS (3.1) where z1− α2 represents the 100(1 − α2 )% quantile of the standard normal distribution, and nSRS is the number of children sampled using SRS (Miller and Miller, 2003).1 By fixing the confidence level 1 − α and setting the margin of error equal to b, we may solve for nSRS as nSRS = 2 z1− α p(1 − p) 2 b2 . (3.2) Since the population proportion is unknown, 0.50 is used for p in the calculations. This results in the largest possible value for nSRS . If cluster sampling is to be done, and it is expected to have a doubling effect on the variance of p̂, then the number of individuals to be sampled (nCLU ) in order to obtain a 95% confidence interval (α = 0.05) for p with a margin of error of b = 0.10 is nCLU = nSRS × 2 = 1.962 (0.50)(0.50) ×2 0.102 ≈ 193. 1 The confidence interval in Equation 3.1 is estimated by replacing p with p̂. 16 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics According to Henderson and Sundaresan (1982), the individuals should come from a sample of at least 30 clusters. They stated this condition so that the normal distribution theory could be reasonably applied. Since 193 30 ≈ 7, EPI surveys typically interview seven children per cluster for an overall sample size of 210 children. To determine the required sample size for EPI, the SRS sample size was multiplied by 2. This factor is called the design effect (DE). It is defined as the ratio of the variance of an estimator under the chosen sampling plan to the variance of the estimator if SRS had been performed instead (Lumley, 2010). A cluster survey of the vaccination status of children in the United States was found to have a design effect of around 2 (Serfling and Sherman, 1965). Since early uses of EPI sampling were concerned with vaccination rates, a design effect of 2 was used for the purpose of setting the sample size. If n units are sampled per cluster, the variance of an estimator under cluster sampling is related to the variance of the estimator under SRS in the following way: V arCLU (p̂) = V arSRS (p̂) × [1 + (n − 1)ρ]. (3.3) Here, ρ represents the rate of homogeneity (ROH) and can be interpreted as a measure of the similarity between units within the same cluster (Kish, 1965). We can then express the design effect as DE = V arCLU (p̂) V arSRS (p̂) = 1 + (n − 1)ρ. 17 (3.4) M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics ROH takes on values between −1/(n − 1) and 1. Positive values indicate that we are more likely to get the same response from two units sampled from the same cluster compared to two units sampled from different clusters (Bennett et al., 1991). From Equation (3.4) we see that the closer ROH is to 1, the greater the design effect. We also see that the design effect depends on the sample size per cluster. Therefore, the overall sample size requirement for a cluster survey should be computed using a design effect based on the number of units that can be sampled per cluster and past estimates of ROH for the target variable (Bennett et al., 1991). 3.2 Point Estimator To demonstrate how estimates are calculated from the data from EPI samples, we will look at a simple example. Suppose we are interested in the proportion of households with a certain characteristic.2 We will denote this proportion by p. Let yij = 1 if household has target characteristic; (3.5) 0 otherwise for the j th household in the ith cluster. If the population is divided into M clusters and there are Ni households in the ith cluster, then Ni M P P p= yij i=1 j=1 M P . (3.6) Ni i=1 2 One definition of households is “groups of persons sharing meals and residence”(Hlady et al., 1994). 18 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Let S 0 be a subset of m PSUs and Si00 be a subset of ni SSUs from the ith PSU. For samples obtained through the EPI method, the recommended formula for estimating the population proportion is P P p̂ = yij i∈S 0 j∈Si00 P ni , (3.7) i∈S 0 (Lemeshow and Robinson, 1985; Bennett et al., 1991). This estimator corresponds precisely to the proportion of households in the sample with the characteristic of interest. If instead the survey had been about the proportion of vaccinated children in a country, p̂ would be given by the ratio of vaccinated children in the sample to the total children in the sample (World Health Organization, 2008). No matter which elements or characteristics are being studied, the sample proportion is taken as the estimate of the population proportion. Note that when samples from the clusters are all exactly the same size, n, the estimator for p can be expressed as P P p̂ = yij i∈S 0 j∈Si00 mn X X 1 1 = yij m i∈S 0 n 00 (3.8) j∈Si = 1 X p̂i . m i∈S 0 (3.9) Here, the estimator for the overall population proportion may be interpreted as the average of the sample proportions obtained for individual clusters. 19 M.Sc. Thesis - Maria Reyes 3.2.1 McMaster - Mathematics & Statistics Expected Value Under an appropriate sampling plan, the estimator in Equation (3.8) is an unbiased estimator of the population proportion. To prove that p̂ is unbiased, we will show that E(p̂) = p, but first we must compute the probability that a household is among the selected households for the sample. Let Gi = 1 if cluster i is in the sample and 0 otherwise. There are many approaches to selecting clusters with PPS.3 For the procedure described in Table 2.1, which is generally the one suggested for EPI (Henderson et al., 1973; Lemeshow and Robinson, 1985; Bennett et al., 1991), the inclusion probability of a cluster is P (Gi = 1) = mNi M P , (3.10) Ni i=1 for i = 1, 2, . . . , M (Cochran, 1977). This holds as long as of m and PM i=1 Ni m PM i=1 Ni is a multiple is greater than or equal to the size of the largest cluster in the population.4,5 Similarly, let Hij = 1 if household j from cluster i is included in the sample and 0 otherwise. Assuming that all households from a cluster have the same chance of 3 4 Hanif and Brewer (1980) cite 50 ways to perform PPS sampling. P M N i When i=1 is not a whole number, an alternate systematic PPS sampling procedure can m PM be used where i=1 Ni is taken as the sampling interval and the number of units in each cluster is multiplied by m before computing the cumulative population sizes (Cochran, 1977). P 5 M N i When there is a cluster such that i=1 ≤ Ni , this cluster will appear in all samples so its m inclusion probability is actually P (Gi = 1) = 1. Furthermore, in some samples it can be selected more than once. If, for example, a cluster is selected twice, the EPI manual instructs to take two samples from that cluster (World Health Organization, 2008). 20 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics being observed, then P (Hij = 1|Gi = 1) = n , Ni (3.11) for j = 1, 2, . . . , Ni (Lohr, 2009). Using the results from Equations (3.10) and (3.11), we determine that the unconditional inclusion probability of household ij is P (Hij = 1) = P (Hij = 1|Gi = 1)P (Gi = 1) mNi = P M Ni n Ni i=1 = mn M P , (3.12) Ni i=1 for i = 1, 2, . . . , M and j = 1, 2, . . . , Ni . Since all units have the same chance of being selected, the sample is said to be self-weighting (Bennett et al., 1991). Next, we can re-write the estimator in Equation (3.8) in terms of Gi and Hij so that P P p̂ = mn Ni M P P = yij i∈S 0 j∈Si00 yij Hij i=1 j=1 mn (3.13) This uses a randomization theory approach where response values are viewed as fixed 21 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics and all uncertainty comes from which units are picked (Lohr, 2009). Therefore, M N P Pi yij Hij i=1 j=1 E(p̂) = E mn = M N M N i 1 XX yij E(Hij ) mn i=1 j=1 i 1 XX = yij P (Hij = 1) mn i=1 j=1 M Ni mn 1 XX yij = P M mn i=1 j=1 Ni i=1 Ni M P P = yij i=1 j=1 M P Ni i=1 =p (3.14) as required. However, it is questionable whether this is actually true for EPI since the way it performs selections at the last stage of sampling may not give an equal opportunity for all households to be picked. 22 M.Sc. Thesis - Maria Reyes 3.2.2 McMaster - Mathematics & Statistics Variance The EPI manual recommends estimating the variance of p̂ as, 2 P m i∈S 0 s2p̂ = P ni yi2 − 2p̂ P i∈S 0 P i∈S 0 n2i m(m − 1) i∈S 0 where yi = ni yi + p̂2 P (3.15) yij and provided that m > 1 (World Health Organization, 2008). j∈S 00 When ni = n for all i ∈ S 0 , then, 2 P m i∈S 0 s2p̂ = P n yi2 − 2p̂ 2 = m i∈S 0 m2 n2 1 = m(m − 1) P = i∈S 0 yi2 − 2np̂yi + n2 p̂2 m(m − 1) X yi 2 i∈S 0 n − 2p̂ n2 i∈S 0 y i n ! + p̂2 p̂2i − 2p̂p̂i + p̂2 m(m − 1) P = i∈S 0 P m(m − 1) i∈S 0 P nyi + p̂2 P i∈S 0 (p̂i − p̂)2 m(m − 1) . (3.16) (Milligan et al., 2004). From Equation (3.16), we see that the variability of p̂ is measured in terms of variability between PSUs. This form of variance estimation may be used even when there are more than two stages of sampling. As long as p̂i is an unbiased estimator of p and PSUs are sampled independently of each other, then s2p̂ is an unbiased estimator of the true variance of p̂. The proof is outlined in Hansen 23 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics et al. (1953). From Section 3.2.1, we know that the unbiasedness of p̂i depends on all elements from the cluster having an equal probability of being selected. With EPI, this is not guaranteed. EPI also does not select PSUs independently of each other since the probability of selecting a PSU depends on the PSUs that were selected beforehand. Hence, there may be issues with the performance of the estimator in Equation (3.16) for EPI samples. 24 Chapter 4 Past Simulations of the EPI Method Computer simulations are useful because they allow researchers to test a multitude of sampling plans without incurring the costs and large scale operations involved in performing an actual survey. More importantly, true parameter values are known in a controlled environment. Several authors have taken this approach to studying the EPI method. This chapter reviews key papers published over the last 30 years. We describe how the studies were carried out and highlight their findings. In keeping with the original purpose of EPI sampling, early simulations focused on estimating the immunization coverage level in a population (Henderson and Sundaresan, 1982; Lemeshow et al., 1985). The simulations that came after dealt with other variables such as those relating to morbidity, nutrition, child care, and socioeconomic variables which may have different spatial patterns compared to immunization status (Bennett et al., 25 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 1991; Katz et al., 1997; Yoon et al., 1997). 4.1 Simulation Design Henderson and Sundaresan (1982) and Lemeshow et al. (1985) performed their simulations on artificial data sets. This allowed them to experiment with different population types. Henderson and Sundaresan (1982) analyzed ten populations and Lemeshow et al. (1985) analyzed five populations. One of the parameters that they varied was the proportion of vaccinated children in the population. This was also varied at the individual cluster level. Cluster vaccination rates ranged from 10% to 99%, while overall population vaccination rates ranged from 17% to 87%. Lemeshow et al. (1985) created virtual towns on top of a grid. They programmed an algorithm to go through each cell in the grid and either leave it empty or place a household. The decision was based on a random number from a uniform [0, 1) distribution. A household was placed in a cell if the number generated was less than the specified probability of the cell containing a household. The rules were similar for placing a child in a household and assigning a vaccination status to a child. To reflect conditions seen in urban and rural areas, Lemeshow et al. (1985) used various population density levels and spatial patterns for the placement of households and vaccinated children. Pockets of vaccination were established by picking a random household with a child, then assigning that child and all other nearby children as vaccinated. In contrast, the studies by Bennett et al. (1994), Katz et al. (1997), and Yoon et al. (1997) used real populations for their simulations. The data came from a 26 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics census survey of 30 randomly selected communities in Uganda (Bennett et al., 1994) and 40 randomly selected communities in Nepal (Katz et al., 1997; Yoon et al., 1997). Therefore, the communities only represented a portion of the real population, but for simulation purposes they were treated as a population of their own. The communities were digitally mapped with the origin centred at the mean or median of the household coordinates. All of the simulations mentioned above involved repeated sampling from a fixed population. This embodies the Monte Carlo simulation method (Marasinghe, 2009). Every time a sample was taken, the data from the sample were used to estimate the prevalence of an outcome. The mean and variance of the resulting values across the independent samples then served as an estimate of the expected value and variance of the estimator in question. Since the actual prevalence in the population was known, the bias and mean square error (MSE) could also be calculated. Let p be the overall population prevalence and let p̂r be the corresponding estimate based on the rth sample. If a total of R samples were generated, then R 1X p̂r ; Ê (p̂) = R r=1 (4.1) R 2 1 X d V ar (p̂) = p̂r − Ê (p̂) ; R r=1 [ (p̂) = Ê (p̂) − p; Bias 2 \ [ (p̂) . M SE (p̂) = Vd ar (p̂) + Bias 27 (4.2) (4.3) (4.4) Households per cluster or stratum Children per cluster or stratum Children sampled per cluster or stratum Henderson and Sundaresan (1982) Unspecified Unspecified Unspecified 7 EPI 150 Lemeshow et al. (1985) 30 600 86 (average) 7 StRS, EPI 500 Bennett et al. (1994) 30 51-153 86-238 7, 15, 30 StRS, EPI, EPI3, EPI5, QTR, PERI 1000 Katz et al. (1997) 40 31-315 13-284 7, 10, 15, 20, 25 StRS, EPI, EPI2, EPI3, EPI4, EPI5 1000 Yoon et al. (1997) 40 31-315 13-284 7, 10, 15, 20, 25 SRS, StRS, EPI, EPI2, EPI3, EPI4, EPI5 1000 Sampling method Simulated samples 28 All studies besides Henderson and Sundaresan (1982) sampled from every community in the populations that they were investigating. It is unclear how many clusters were in the populations analyzed by Henderson and Sundaresan (1982), but the authors indicated that samples in their study consisted of 30 clusters. A description of the sampling methods presented in column 6 of the table is given towards the end of Section 4.1. McMaster - Mathematics & Statistics Total clusters or strata Simulation study M.Sc. Thesis - Maria Reyes Table 4.1: Comparison of simulation designs from past EPI studies. M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics With the exception of Henderson and Sundaresan (1982), all authors indicated that a single sample in their simulation consisted of a subsample from every community in the population. Since all communities appeared in the final sample, they used a stratified sampling design rather than a cluster sampling design, and the communities should be formally called strata rather than clusters (Lohr, 2009). If the strata have different population sizes and the same number of units are sampled from each stratum, then the sample is not self-weighting. Units from smaller strata have a greater probability of selection compared to units from larger strata. This led to Katz et al. (1997) and Yoon et al. (1997) using an estimator which weighted the sample data from a stratum by the size of the stratum. Bennett et al. (1994) did not do a weighted calculation, and they used the same estimator as the one from Equation (3.7). Because of this, the unweighted average of the prevalences per stratum, p̄, was substituted in place of the overall population prevalence, p, when bias was calculated. When the EPI method was simulated in a community, either the starting household was randomly selected from all the households (Lemeshow et al., 1985) or it was randomly selected from the households along a random direction (Bennett et al., 1994; Katz et al., 1997; Yoon et al., 1997). Subsequent selections involved finding the nearest neighbour of the household that was last visited, but there were slight differences in what constituted the nearest neighbour. Katz et al. (1997) and Yoon et al. (1997) noted that they took the closest household to the right unless they were at the edge of the community. The effect of increasing the sample size per community was examined as well as modifications of the EPI method: • EPIk selects the k th nearest neighbour (Bennett et al., 1994; Katz et al., 1997; 29 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Yoon et al., 1997); • QTR divides the community into four quadrants and takes a quarter of the sample from each of the quadrants (Bennett et al., 1994); • PERI takes half of the sample starting near the centre of the community and the other half near the periphery (Bennett et al., 1994). Whichever procedure was done in one stratum was done in the rest of the communities in the same sample. The simulations also featured stratified random sampling (StRS) and simple random sampling (SRS). In StRS, simple random sampling is done within each community, while in SRS, simple random sampling is done at the population level. These methods would serve as important benchmarks in assessing the performance of EPI sampling and estimation. Table 4.1 summarizes the various simulation designs. 4.2 Simulation Results The EPI method appears to be producing estimates within 10 percentage points of the prevalence to be estimated with 95% confidence, and it does this regardless of the prevalence and variability across communities, as long as the instances of the outcome are evenly distributed throughout the area of a community (Henderson and Sundaresan, 1982). For populations with other configurations, it did not work as well. When household density was high at the centre of the community, instances of an outcome occurred in a single pocket in the community, vaccination coverage was low, and every community in the population had this property, the EPI method 30 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics performed very poorly (Lemeshow et al., 1985). Only 31% of the estimates fell in the p ± 0.1 range; the rest of the estimates fell below p − 0.1. This represents an extreme scenario. If pocketing is only present in a few communities, simulations have demonstrated that it may be possible to still achieve the stated estimation goals of the EPI method (Lemeshow et al., 1985). As expected, the EPI method proved to be more biased than StRS. In the extreme scenario described earlier, there was a tendency to underestimate the true proportion. An analysis of variance (ANOVA) for absolute relative bias (ARB), [ [ (p̂) = |Bias (p̂) | , ARB p (4.5) revealed that the interaction between the sampling method (StRS, EPI) and presence of pocketing (yes, no) was significant, where EPI and pocketing were associated with larger biases (Lemeshow et al., 1985). In a different study, the true disease prevelence was consistently overestimated (Katz et al., 1997; Yoon et al., 1997). The diseases studied were diarrhoea and xerophthalmia. Positive bias remained even when sample size was increased and when the distance between selected households was increased. The authors did not have a conclusive explanation for this positive bias. Another study showed that bias was greatest for socioeconomic variables such as possession of cattle and education level of parents, and that this bias became more pronounced when the PERI scheme was used (Bennett et al., 1994). Implementing QTR almost always led to smaller biases compared to EPI, but results were mixed when EPI3 and EPI5 were compared to EPI. Nevertheless, the magnitude of bias seen in these simulations, which was generally less than 2 percentage points, may be considered 31 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics tolerable for most practical cases. To analyze the variability of the estimator from EPI (and related methods), the authors looked at DE defined as: DE(p̂) = V ar (p̂1 ) . V ar (p̂2 ) (4.6) When the denominator was set as the variance of the StRS estimator, DEs were close to 1 with most ranging between 0.8 and 1.2. EPI typically generated larger variances than StRS for socioeconomic variables (indicating a within-community clustering effect); the opposite held for nutrition variables (Bennett et al., 1994). DEs did not always decrease when sample size or the distance between households increased, but they did when PERI was used instead of EPI (Bennett et al., 1994). DEs calculated relative to SRS were larger than those calculated relative to StRS (Yoon et al., 1997). MSE was also widely analyzed in the simulations. It is a useful measure because it combines variance and bias. MSE was found to be significantly higher when the EPI method was used in a community that had a pocketing variable pattern as opposed to when StRS was used or when pocketing was absent (Lemeshow et al., 1985). Although MSE decreased when more individuals were sampled from each community (this has the effect of increasing the overall sample size since all communities were included in the sample), the ratio M SEEP I /M SEStRS generally increased (Bennett et al., 1994; Katz et al., 1997). Besides EPI, PERI was the only sampling scheme (within the study of Bennett et al. (1994)) that frequently yielded ratios either under 0.8 or over 1.2. This result was attributed to the combination of low variance and high bias associated with PERI. 32 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Finally, since the EPI method promotes itself as a rapid, inexpensive sampling solution, it is meaningful to see how it compares in this respect against other schemes. Katz et al. (1997) and Yoon et al. (1997) added up the straight-line distances between visited households as a proxy for time and cost. In the case of StRS, households were pre-selected for the sample then visited in an order that would minimize travel.1 Naturally, when the sample size per community was increased or when the number of households skipped increased, the savings from EPI disappeared, and eventually, it was more costly to implement than StRS. When the distance ratio was multiplied by the MSE ratio, EPI was only better than StRS when the sample size was 7 or 10, and when one or no household was skipped. The ratio for most of the other combinations was around one to two. 1 Distance between communities was ignored. 33 Chapter 5 Computation of Household Inclusion Probabilities Data obtained from EPI samples are analyzed as though each unit in a given cluster has the same chance of being selected as any other unit in the same cluster. However, as seen in simulations of the EPI method, using an estimator that makes this assumption produced biased results (Bennett et al., 1994; Katz et al., 1997; Yoon et al., 1997). When SRS is performed at the second stage of sampling, all unselected households from a town or village have an equal probability of being picked next. In contrast, choice of the next household in EPI sampling is restricted to the nearest neighbours of the last selected household. On this basis, some authors have speculated that there may be a tendency to move inwards where the density of households is higher (Kok, 1986; Brogan et al., 1994; Luman et al., 2007). Moreover, the manner in which the first household is chosen gives an advantage to households that lie in a direction containing few households. Hence, despite sampling the clusters according 34 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics to PPS, the overall sample may not actually be self-weighting. While there have been discussions about implementing weighted calculations, the examples that have been given in past papers apply the weights to the PSU as a whole rather than to the SSUs (Bennett et al., 1991; Brogan et al., 1994). The purpose of weighting in these examples was to adjust for demographics and non-response levels. There have been no further attempts to account for differences in selection probabilities among households that arise from the sampling procedure used by the EPI method. In the work that follows, we focus on the sampling of households that takes place within a cluster. For convenience, we use a population consisting of a single cluster. An algorithm is established for computing inclusion probabilities assuming that a response is obtained from every sampled household. 5.1 Probability of Selecting the First Household Let H = {1, 2, . . . , N } be the set of all households in the population. These households may be represented by their Cartesian coordinates (xi , yi ) ∈ R × R, for i = 1, 2, . . . , N or their polar coordinates (ri , γi ) ∈ R+ ×[0, 2π), for i = 1, 2, . . . , N , where p ri = x2i + yi2 denotes household i’s distance from the origin, and γi = tan−1 ( xyii ) denotes the angle of its position in radians (rad) when moving counterclockwise from the positive x-axis (Figure 5.1).1 We will use Γ to refer to the set of γi for i = 1, 2, . . . , N . No households are located at the origin. To select the first household for the sample, we adopt the method proposed 1 R denotes the set of real number; R+ denotes the set of positive real numbers. 35 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 1 ● 6 y1=6.0 cartesian: (−4.0, 6.0) 2 ● polar: (7.2, 2.2) 3 ● 4 5 ● ● 6 ● 7 8 ● r1=7.2 3 ● 9 ● θ1=2.2 12 ● 10 ● y 11 0 ● x1=−4.0 13 ● 14 ● 15 ● 16 17 ● ● 20 21 22 ● ● 24 ● 19 ● −3 18 ● 23 ● 25 ● ● −6 −6 −3 0 3 6 x Figure 5.1: A population of N = 25 households. Household locations were randomly generated such that the x-coordinate takes an integer value between -6 and 6 and similarly for the y-coordinate. 36 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 1 ● 6 2 ● 3 ● 4 5 ● ● 6 ● 7 3 α 2 8 ● ● α 9 ● θ 12 ● 10 ● y 11 ● 0 13 ● 14 ● 15 ● 16 17 ● ● 20 21 22 ● ● 24 ● 19 ● −3 18 ● 23 ● 25 ● ● −6 −6 −3 0 3 6 x Figure 5.2: Illustration of a sector with direction θ = rad (shaded region). 37 3π 4 rad and angle span α = π 8 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 1 6 2 3 4 5 6 7 8 3 9 12 10 y 11 0 13 14 15 16 17 18 19 20 21 22 −3 24 23 25 −6 −6 −3 0 3 6 x Figure 5.3: In the population above, when the sector size is set to α = π8 rad and a random direction is picked between 0 rad to 2π rad, then there is a 0.21 chance that it will land in an area where there are no households (unshaded regions). Only directions in the set Θ = {(0.3, 1.8] ∪ (1.8, 2.4] ∪ (2.6, 3.3] ∪ (3.4, 4.4] ∪ (4.9, 6.0]} correspond to non-empty sectors (shaded regions). The length of interval unions in the set Θ is LΘ = 4.9 by Henderson et al. (1973) whereby the household is chosen among the households contained in a random sector. This sector will be identified by the parameters (θ, α) which specify the central angle (direction) of the sector and the angle span (size) of the sector. Therefore, the sector is bounded by the angles θ ± α 2 (Figure 5.2). Let Θ represent the set of all θ’s such that the sector associated with θ is nonempty. Although at each spin, any direction between 0 rad and 2π rad may be observed, the result is ignored if the sector contains no households and another spin is made. By ignoring these values of θ, we restrict the sample space to Θ. Only 38 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics θ ∈ Θ are considered valid outcomes. Since the direction is picked at random, θ is uniformly distributed and its probability density function is 1 , θ ∈ Θ; f (θ) = LΘ 0, otherwise. (5.1) Here, LΘ is the length of interval unions in the set Θ. It is calculated as LΘ = 2π − N X δi I(δi > 0), (5.2) i=1 where δi = γ(i+1) − γ(i) − α, if i ∈ {1, 2, . . . , N − 1}; (γ(1) + 2π) − γ(i) − α, if i = N ; (5.3) I(·) is an indicator function, and γ(1) ≤ γ(2) ≤ . . . ≤ γ(N ) are the ordered household angles. In most instances where α is not too small, the population is sizable, and the households are evenly dispersed in all directions, LΘ = 2π. When the difference between any two ordered angles is more than α, then LΘ < 2π (Figure 5.3). Let U1 ∈ H denote a random variable for the first household added to the sample, and u1 a fixed but arbitrary value of U1 . Then, according to the law of total 39 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics probability and definition of conditional probabilities, Z P (U1 = u1 ) = f (u1 , θ)dθ Θ Z = Θ f (u1 |θ)f (θ)dθ 1 f (u1 |θ) dθ LΘ Θ Z 1 = f (u1 |θ)dθ, LΘ Θ Z = (5.4) where f (u1 |θ) = 1 , if household u1 lies in the sector (θ, α); nsec (Γ, θ, α) 0, (5.5) otherwise; and nsec (Γ, θ, α) is the number of households in the sector defined by (θ, α). The parameter Γ refers to the set of household angle coordinates {γ1 , γ2 , . . . , γN }. We will follow the convention that if a household falls on the initial side of sector, then it is included in the sector, but if it falls on the terminal side of the sector, then it is not included in the sector. Thus, household i is included in the sector (θ, α) when: (a) θ ∈ 0, α2 and γi ∈ θ − (b) θ ∈ α 2 , 2π − α 2 α 2 + 2π, 2π ∪ 0, θ + α2 or and γi ∈ θ − α2 , θ + α2 or (c) θ ∈ 2π − α2 , 2π and γi ∈ θ − α2 , 2π ∪ 0, θ + α 2 − 2π . Equivalently, a sector (θ, α) contains household i when: (a) γi ∈ 0, α2 and θ ∈ γi − α 2 + 2π, 2π ∪ 0, γi + α2 or 40 M.Sc. Thesis - Maria Reyes (b) γi ∈ α 2 , 2π − α 2 McMaster - Mathematics & Statistics and θ ∈ γi − α2 , γi + α 2 or (c) γi ∈ 2π − α2 , 2π and θ ∈ γi − α2 , 2π ∪ 0, γi + α 2 − 2π . Since for all other values of θ, f (u1 |θ) = 0, we only need to integrate 1/nsec (θ, α) over a subset of Θ. To avoid evaluating integrals over ranges of θ which may cross the 0 rad (or 2π rad) boundary, the population may be rotated so that we are always integrating between θ = α 2 and θ = 3α . 2 This is described in the algorithm below. Algorithm: 1. Transform the coordinates of households in sector (γu1 , 2α). (a) For 0 ≤ γu1 < α, i. compute a = γu1 − α + 2π and b = γu1 + α; ii. identify γi ∈ [a, 2π) ∪ [0, b); iii. compute γi0 = γi − a, γi ∈ [a, 2π); γi + (2π − a), γi ∈ [0, b). (b) For α ≤ γu1 < 2π − α, i. compute a = γu1 − α and b = γu1 + α; ii. identify γi ∈ [a, b) (Figure 5.4a); iii. compute γi0 = γi − a (Figure 5.4b). 41 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 1 6 ra d −α γ u1 d ra α d 5 2 2 1 2 ra +α γu γu + α rad 1 4 1 3 γu 2 ra d 6 6 u1 7 8 3 −α γ u1 9 rad 3 5 7 3α ad 2r 6 d α ra 12 10 d α 2 ra 9 11 y y 0 0 rad 0 8 13 14 15 16 17 18 19 20 21 22 −3 −3 24 23 25 −6 −6 −6 −3 0 3 6 −6 −3 0 x 3 6 x (a) Identify households near household 6. (b) Transform the household coordinates. 6 1.00 0.75 4 θ*(6) y 5 1 nsec(Γ ', θ, α) θ*(5) θ*(4) 7 6 2 θ*(3) 0.50 θ*(2) θ*(1) 9 0.25 8 2 4 6 0.2 x (c) Find the directions when the number of households in the sector changes. 0.3 *( θ 6) *( 5) 0.4 θ θ 4) θ *( 3) *( θ 2) *( *( 0.00 0 θ θ 1) 0 0.5 0.6 (d) Compute the probability of selecting household 6 for boundary values of θ. Figure 5.4: Steps involved when computing the probability that household 6 is the first sampled unit (u1 = 6). In this example, the angle span of the sector is set to α = π8 rad. From Figure 5.3, we know that length of interval unions in the set Θ is LΘ = 4.9. Therefore, using the formula in Equation (5.6), we obtain P (U1 = 6) = 0.05. γ denotes the angular coordinate of a household; θ denotes a sector direction; nsec denotes the number of households in a sector. 42 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics (c) For 2π − α ≤ γu1 < 2π, i. compute a = γu1 − α and b = γu1 + α − 2π; ii. same as case (a). iii. same as case (a). 2. Using the results at the end of the last step, determine when there is a change in the number of households in a sector (Figure 5.4c). Compute α γ 0 + , γ 0 < α; 2 θ∗ = α γ 0 − , γ 0 > α. 2 ∗ ∗ ∗ ∗ Construct the set Θ∗ = {θ(1) , θ(2) , . . . , θ(D) } where θ(1) = α , 2 ∗ ∗ θ(2) to θ(D−1) are ∗ = the values obtained from Step 2 (with duplicates omitted), and θ(D) 3α . 2 The ∗ ∗ < < θ(2) values in this set are arranged in strictly increasing order so that θ(1) ∗ . . . . < θ(D) ∗ , α) for d = 1, 2, . . . , D, 3. Compute the number of households in sector (θ(d) ∗ nsec (Γ0 , θ(d) , α) = α α ∗ ∗ I θ(d) − ≤ γ 0 < θ(d) + , 2 2 0 0 γ ∈Γ X then take the inverse (Figure 5.4d). 43 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 4. Compute the probability that u1 is the first household sampled. 1 P (U1 = u1 ) = LΘ 1 = LΘ Z 3α/2 α/2 Z 1 dθ nsec (Γ0 , θ, α) ∗ θ(2) ∗ θ(1) Z 1 dθ + ∗ 0 , α) nsec (Γ , θ(2) ∗ θ(D) ∗ θ(D−1) 1 = LΘ = 1 dθ ∗ 0 , α) nsec (Γ , θ(D) ∗ ∗ − θ(1) θ(2) ∗ nsec (Γ0 , θ(2) , α) + Z ∗ θ(3) ∗ θ(2) 1 ∗ , α) nsec (Γ0 , θ(3) dθ + . . . + ! ∗ ∗ − θ(2) θ(3) ∗ nsec (Γ0 , θ(3) , α) + ... + ∗ ∗ − θ(d−1) θ(D) ! ∗ nsec (Γ0 , θ(D) , α) D ∗ ∗ 1 X θ(d) − θ(d−1) ∗ LΘ d=2 nsec (Γ0 , θ(d) , α) (5.6) The idea is to partition the directions for which household u1 has a chance of being selected into intervals where every sector that can be constructed from these directions covers the same households. One can imagine beginning with a sector centered at the angle θ = α 2 then rotating this sector counterclockwise to θ = 3α , 2 and noting when a household point is added or removed. The algorithm applies for any unit in the population. 5.2 Probability of a Sample of Households In EPI sampling, the second household sampled is the household that is physically closest to the first household sampled, the third household sampled is the household physically closest to the second household sampled (excluding the first sampled 44 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics household), and so on. Once the starting household has been chosen, every household sampled afterwards is determined by finding the nearest neighbour of the last household visited. Since sampling is done without replacement, the next household added to the sample is identified among the group of households in the population that has not yet been selected. When there are multiple households tied for the nearest neighbour, one of the households is chosen randomly. This process of going from neighbour to neighbour ends when the desired sample size is reached. Let the vector u = (u1 , u2 , . . . , un ) specify the units picked for a sample of size n, and the order in which they are picked. We make the distinction between a sample that selects household 1 then 3 then 4 and a sample that selects 4 then 3 then 1 because we will compute their probabilities separately. In general, the probability of an EPI path u with n units is given by P (u) = P ((u1 , u2 , . . . , un )) = P (un |(u1 , u2 , . . . , un−1 )) × P ((u1 , u2 , . . . , un−1 )) = P (un |(u1 , u2 , . . . , un−1 )) × P (un−1 |(u1 , u2 , . . . , un−2 ))× P ((u1 , u2 , . . . , un−2 )) = P (un |(u1 , u2 , . . . , un−1 )) × P (un−1 |(u1 , u2 , . . . , un−2 )) × . . . × P (u2 |u1 ) × P (u1 ) " n # Y = P (ul |(u1 , u2 , . . . , ul−1 )) × P (u1 ) l=2 = P ((u1 , u2 , . . . , un )|u1 ) × P (u1 ). (5.7) It is computed as the product of conditional probabilities because each new selection 45 M.Sc. Thesis - Maria Reyes 1 1 McMaster - Mathematics & Statistics 1 3 4 P ((1, 3, 4)|u1 = 1) = 1 (a) All paths starting at household 1. 10 1 5 11 1 1 9 P ((13, 10, 9)|u1 = 13) = 1 5 16 P ((13, 11, 16)|u1 = 13) = 1 5 1 2 11 P ((13, 16, 11)|u1 = 13) = 1 10 1 2 15 P ((13, 16, 15)|u1 = 13) = 1 10 18 P ((13, 17, 18)|u1 = 13) = 1 5 21 P ((13, 19, 21)|u1 = 13) = 1 5 1 5 13 1 5 16 1 5 1 5 17 19 1 1 (b) All paths starting at household 13. 1 2 19 P ((18, 17, 19)|u1 = 18) = 1 4 1 2 20 P ((18, 17, 20)|u1 = 18) = 1 4 22 P ((18, 20, 22)|u1 = 18) = 1 2 17 18 1 2 1 2 20 1 (c) All paths starting at household 18. Figure 5.5: Examples of EPI paths of length n = 3 for the population in Figure 5.1. 46 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics depends on all the households selected beforehand. Let Hl be the set of households from the population which was not chosen in the first l − 1 selections. Symbolically, Hl = {i ∈ H : i 6∈ {u1 , u2 , . . . , ul−1 }}. Because of the way EPI sampling is carried out, only households in Hl with the shortest distance to ul−1 have a non-zero chance of being selected. These households are the solution to arg min di,ul−1 = arg min i∈Hl i∈Hl q (xul−1 − xi )2 + (yul−1 − yi )2 . (5.8) If nnn (ul−1 , Hl ) represents the number of equidistant households closest to the household ul−1 which have not already been visited, then, P (ul |(u1 , u2 , . . . , ul−1 )) = 1 nnn (ul−1 , Hl ) 0, , if ul ∈ arg min di,ul−1 ; i∈Hl (5.9) otherwise; for l = 2, 3, . . . , n (Figure 5.5). Note that if at every selection after the first unit has been selected, there is only one household that can be chosen, then P (ul |(u1 , u2 , . . . , ul−1 )) = 1 for l = 2, 3, . . . , n, and P ((u1 , u2 , . . . , un )) reduces to P (u1 ). The probability P (u1 ) is computed separately since the first unit of the sample is selected according to a different process as indicated in Section 5.1. The program that we developed simultaneously returns all possible EPI paths from a given population as well as the exact probability that a particular path is realized. 47 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Algorithm: 1. Use the data on the household locations to compute an N ×N matrix D, where the element dij in row i, column j represents the distance from household i to household j, 0 d12 d21 0 D= . .. .. . dN 1 dN 2 . . . d1N . . . d2N . .. .. . . ... 0 (5.10) Hence, D is a distance matrix containing the distance between every pair of households. 2. Apply the algorithm in Section 5.1, to get the probability that household 1 is the first selected household. Do the same for the rest of the households in the population so that P (U1 = i) is known for i = 1, 2, . . . , N . 3. Construct EPI paths and compute their probabilities. i. EPI paths of length 1. Produce a list of N vectors where each vector has the identification number (id) of one household from the population. Therefore, the first vector contains the id of household 1, the second vector contains the id of household 2, and so forth. The probabilities associated with these paths are the probabilities from Step 2. ii. EPI paths of length 2. Add a second household to each of the vectors in step i. The first vector in the list has household 1 and only household 1. 48 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics To determine which household should be added to this vector, go to first row of D and search where the minimum value occurs in columns 2 to N as this indicates the nearest neighbour of household 1. Create separate vectors if the minimum appears multiple times. The probability of the resulting path(s) is obtained by dividing the old path probability by the number of paths that were generated. Repeat for the remaining vectors in the original list. For vector i, this means searching for the minimum value in row i when column i is skipped. iii. EPI paths of length 3. Take each vector in step ii, identify the last unit in the vector, go to the matching row in D, and search for the minimum value. Ignore values in columns that correspond to units which are already in the sample. Update the vectors to include the new addition, and compute the path probabilities in the same way as in step ii. iv. Continue adding one household at a time to every vector until all vectors have n elements. Update path probabilities at each iteration. At the end of this procedure, an optional step that we performed was finding the paths that yield the same sample of households. Samples, denoted by S, represent an unordered subset of households from the population. The probability of a sample is the sum of path probabilities for paths that include the same units as the sample, regardless of order. 49 M.Sc. Thesis - Maria Reyes 5.3 McMaster - Mathematics & Statistics Inclusion Probabilities for Individual Households and Pairs of Households The inclusion probability for household i is defined as the probability that household i is among the units drawn in a sample. When all possible samples are known along with their probabilities, the inclusion probability for household i may be calculated as πi = X P (S). (5.11) S: i∈S The joint inclusion probability for households i and j (i 6= j) refers to the probability that households i and j are sampled together. Therefore, πij = X P (S). (5.12) S: i,j∈S To facilitate the calculation of inclusion probabilities, we can use matrices. Let Iqi = 1 if household i is in sample Sq and 0 otherwise. Suppose there are a total of Q possible samples from a population of N households. Then, I11 I12 I21 I22 π= . .. .. . IQ1 IQ2 ... ... ... ... T I1N P (S1 ) 0 I2N P (S2 ) 0 .. .. .. . . . IQN 0 0 50 ... 0 I11 I12 ... 0 I21 I22 . .. .. .. . . . .. . . . P (SQ ) IQ1 IQ2 . . . I1N . . . I2N .. .. . . . . . IQN M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics I11 I21 . . . IQ1 P (S1 )I11 P (S1 )I12 I12 I22 . . . IQ2 P (S2 )I21 P (S2 )I22 = . . . .. .. . .. .. .. .. . . I1N I2N . . . IQN P (SQ )IQ1 P (SQ )IQ2 Q P 2 P (Sq )Iq1 Q P P (Sq )Iq1 Iq2 q=1 q=1 P Q Q P 2 P (S )I I P (Sq )Iq2 q q2 q1 q=1 q=1 = .. .. . . Q Q P P P (Sq )IqN Iq1 P (Sq )IqN Iq2 q=1 ... Q P ... P (S1 )I1N . . . P (S2 )I2N .. ... . . . . P (SQ )IQN P (Sq )Iq1 IQN ... P (Sq )Iq2 IqN q=1 .. .. . . Q P 2 ... P (Sq )IqN q=1 Q P q=1 q=1 π π π . . . π 12 13 1N 1 π21 π2 π23 . . . π2N = . . . .. .. . . . . . πN 1 πN 2 πN 3 . . . πN (5.13) where the elements on the diagonal are inclusion probabilities for individual units, and the elements off the diagonal are inclusion probabilities for pairs of units. The inclusion probabilities in π can also be estimated through a Monte Carlo simulation approach where the EPI selection procedure is simulated repeatedly to produce independent samples. This is done on a fixed population and for a fixed sample size. The inclusion probability of household i is the proportion of samples in which household i is observed, and the joint inclusion probability of households i and j is the proportion of samples in which both households i and j are observed.2 2 Simulation runs where no households are selected because the initial sector is empty are discarded prior to the computation of inclusion probabilities so that they are not counted. 51 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics The same steps apply for other sampling plans. 5.4 Other Versions of EPI Sampling To compute household inclusion probabilities for an EPI based sampling procedure where the first unit of the sample is selected by SRS, and another EPI based procedure where an interviewer surveys every k th household in their path (Figure 5.6), only minor adjustments are required to the framework we established in the previous sections. In the first case, we set P (U1 = u1 ) = 1 N for u1 = 1, 2, . . . , N since all households in the population have an equal probability of being chosen as the starting household. In the second case, we compute path probabilities for a sample of size (n − 1)k + 1 then drop the households that would be skipped in the actual survey process before aggregating the probabilities as in Section 5.3. 5.5 Additional Notes: Permutations of Household Selections Under SRS, there are N × (N − 1) × . . . (N − n + 1) = N! (N − n)! (5.14) sequences (permutations) for the order in which households are visited (EPI paths). It is clear that as the population size N increases, the number of sequences grows quickly. On the other hand, under EPI sampling, far fewer sequences are generated. 52 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 1 ● ● 6 2 ● ● 3 ● 4 5 ● ● 6 ● 7 8 ● 3 ● 9 ● ● 12 10 ● ● y 11 ● 0 13 ● 14 ● 15 16 ● 17 ● ● 20 21 22 ● ● 24 ● 19 ● −3 18 ● 23 ● 25 ● ● −6 −6 −3 0 3 6 x (a) In this example, every third neighbour is added to the sample (k = 3) and a sample of three households is desired (n = 3). One possible sample consists of households 1, 2, and 9. 1 3A 4A 5A 6A 9 P ((1, 2, 9)|u1 = 1) = 1 2 7A 5A 6 P ((1, 2, 6)|u1 = 1) = 1 2 2 (b) For the path shown in (a) there is only one household to move to at every selection other than the third selection. At the third selection, the path diverges in two. Choosing to move to household 5 instead of household 7 from household 2, means that the probability of this path conditional on starting at household 1 is equal to 0.5. This also represents the conditional probability of selecting households 1, 2, and 9 when every third neighbour is added to the sample. Figure 5.6: Illustration of an EPI path with skipped neighbours. 53 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics For instance, when there is always only one neighbour to move to from each selection, the entire path is determined by the choice of the first household, and the number of possible paths that the survey interviewer can take is exactly equal to N . As long as there are not too many ties in the selection process, the task of enumerating all EPI paths remains manageable, and inclusion probabilities can be calculated from basic sampling principles. 54 Chapter 6 Household Inclusion Probabilities in Simulated Populations One of the reasons for developing the algorithms in Chapter 5 is so that we can use the probabilities that households are included in a sample to objectively assess the way selections are made under the EPI method. In this chapter, we generate populations and compute inclusion probabilities for elements in these populations when various EPI sampling plans are simulated. As before, we focus on what happens in the stage of sampling where households are picked. From the results, we compare how the sampling plans are different from each other and from one based on SRS. We also explore the sensitivity of inclusion probabilities to the spatial distribution of the households in a cluster. 55 M.Sc. Thesis - Maria Reyes 6.1 McMaster - Mathematics & Statistics Simulation Design 6.1.1 Generation of Populations In our study, we considered five types of household spatial distributions (settlement layouts). Each settlement had a population size of N = 150 households. The decision to create populations containing 150 households was based on reports of the average enumeration area size in recent surveys for countries such as South Africa and Bangladesh (Housing Development Agency, 2012; National Institute of Population Research and Training et al., 2013).1 Households in the populations that we generated were dispersed over an area that was 1000 × 1000 units wide. The centre of the cluster was designated as the origin (0, 0). The procedure for placing the households in each population was tailored to achieve a certain spatial pattern: 1. Regular pattern (loc reg). Households were placed in 6 regularly spaced rows and 25 regularly spaced columns. The distance between a household and an adjacent household to the left or right was 41 units, and the distance between a household and an adjacent household to the above or below was 166 units. 2. Random pattern over a square area (loc sqr). Households were randomly distributed over the entire area. This was done by independently generating two uniform random variables between −500 and 500 for a household’s x- and ycoordinates. 1 The term enumeration area refers to the smallest geographic unit into which a country is divided by the census office (Statistics Canada, 2015). 56 M.Sc. Thesis - Maria Reyes 57 McMaster - Mathematics & Statistics Figure 6.1: Five spatial distributions of N = 150 households. These populations are used for the simulation studies discussed in Chapters 6 and 7. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households. M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 3. Random pattern over a rectangular area (loc rec). Same as loc sqr except the y-coordinate was a uniform random variable between -250 and 250. 4. Aggregated pattern (loc agg). Households were organized into 15 groups of 10. Those belonging to the same group were dispersed around a common focal point (x∗g , yg∗ ). Household locations were determined by xi = x∗g + ri × cos(φi ), (6.1) yi = yg∗ + ri × sin(φi ), (6.2) where x∗g were yg∗ are uniform random variables between −500 and 500, ri is an exponential random variable with rate parameter λ = 75, and φi is a uniform random variable between 0 and 2π (Bolker, 2008). All variables were generated independently of each other. 5. Circular gradient pattern (loc cgr). Concentration of households gradually decreased from the centre to the edge of the cluster. Household coordinates were generated using Equations (6.1) and (6.2) with the focal point set to the origin (x∗g , yg∗ ) = (0, 0) and the rate parameter for determining the distance of a household from the origin set to λ = 175. The regular pattern may be thought of as a planned neighbourhood development. On the other hand, the random pattern could represent low density areas such as rural communities or high density areas such as shanty towns since dwellings may be scattered in an unorganized manner. For EPI, it is the relative distance between households that is important rather than the absolute distance. Clusters with a 58 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics circular gradient pattern are meant to resemble a small village or town built around a core area. Finally, the aggregated pattern depicts a situation where households are grouped into compounds, as in some of the places where EPI has been performed (Rose et al., 2006; Grais et al., 2007), or a mixed-use zone containing residential properties as well as schools, medical facilities, places of worship, offices, retail stores, etc. (City of Ottawa, 2012). The populations loc reg, loc sqr, loc rec, loc agg, and loc cgr represent one realization of the spatial patterns specified above, and are depicted in Figure 6.1. In loc agg, four households fell outside the boundaries of the cluster area, and in loc cgr, two households fell outside the boundaries of the cluster area. These households were removed and then given new random coordinates inside the cluster. 6.1.2 Sampling Plans We tested six variations of the EPI sampling procedure. They were characterized π rad) and by the size of the sector used to select the first household (α = 2π, π8 , 32 the number of neighbours skipped (k − 1 = 0, 2, or equivalently, k = 1, 3) for finding subsequent units. This led to the following combinations: 4. α = 2π rad, k = 3 (nosec k3) 1. α = 2π rad, k = 1 (nosec k1) 2. α = π 8 rad, k = 1 (api8 k1) 5. α = π 8 3. α = π 32 rad, k = 1 (api32 k1) 6. α = π 32 rad, k = 3 (api8 k3) rad, k = 3 (api32 k3) Each method was performed with a sample size of n = 7, 15, 30, giving a total of 6 × 3 = 18 sampling configurations. 59 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Table 6.1: Minimum and maximum number of households in non-empty sectors (min(nsec ), max(nsec )) and proportion of values in [0, 2π) rad that correspond to the LΘ directions of empty sectors 1 − 2π for loc reg, loc sqr, loc rec, loc agg, and loc cgr (populations in Figure 6.1). Results are computed for a sector with angle span α = π8 rad π rad. and α = 32 Spatial distribution of households loc loc loc loc loc α= π 8 rad min(nsec ) max(nsec ) 4 4 1 1 3 14 19 23 31 15 reg sqr rec agg cgr α= LΘ rad LΘ 2π min(nsec ) max(nsec ) LΘ 0.00 0.00 0.00 0.02 0.00 1 1 1 1 1 6 9 9 17 9 5.81 5.82 5.48 4.61 5.77 1− 6.28 6.28 6.28 6.14 6.28 π 32 1− LΘ 2π 0.08 0.07 0.13 0.27 0.08 See pages viii-x for the meaning of abbreviations used for the spatial distribution of households. See Equation (5.2) for definition of LΘ . Number of distinct EPI samples k=1 k=3 800 Spatial distirbution of households 600 loc_reg loc_sqr loc_rec loc_agg loc_cgr 400 200 0 10 20 30 0 10 20 30 Sample size (n) Figure 6.2: Number of possible EPI samples that can be drawn from loc reg, loc sqr, loc rec, loc agg, and loc cgr (populations in Figure 6.1) when no neighbours are skipped (k = 1) and when every third neighbour is selected (k = 3). The term sample refers to an unordered set of n households selected from a population of size N . In comparison, N! if SRS had been used, the number of possible samples is always equal to (N −n)!n! . For a 11 population of 150 households, this means that there are 2.9 × 10 SRS samples of size 7, 1.6 × 1020 SRS samples of size 15 and 3.2 × 1031 SRS samples of size 30. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households. 60 M.Sc. Thesis - Maria Reyes 6.2 McMaster - Mathematics & Statistics Simulation Results For the populations in the study, using a sector size of α = π 8 rad meant it was rare, if not impossible, to end up in a sector where there were no households (Table 6.1). For loc reg, loc sqr, and loc cgr, a sector could point in any direction, and it would have at least 3-4 households. Naturally, when sector size was reduced to α = π 32 rad, sectors did not contain as many households. The number of households in a sector varied the most for loc agg. It also required a greater restriction on the values of θ to avoid getting an empty sector when performing the procedure to select the first household. Running the calculations for the inclusion probabilities took considerably longer in loc reg compared to the other populations. Due to the regular spacing of households within the population, a household would often have multiple nearest neighbours. There were more samples to list and thus more samples to search through during the step where the probabilities of samples containing a certain household were added to get the overall probability of selecting that household. The plot in Figure 6.2 shows the number of household combinations that can be observed in each population when EPI is applied. Besides illustrating that more combinations of households are possible in loc reg, it also shows that more combinations of households are possible when every third neighbour is sampled as opposed to when no neighbours are skipped. However, that number is still nowhere near the number of samples that can be seen when SRS is performed. To visualize where high and low inclusion probabilities occurred in the settlements, we constructed bubble plots. A selection of these plots is shown in Figures 61 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics (a) α = 2π rad, k = 1, n = 7. (c) α = π 32 (b) α = 2π rad, k = 1, n = 30. (d) α = 2π rad, k = 3, n = 7. rad, k = 1, n = 7. Figure 6.3: Household inclusion probabilities for a population with regularly spaced households (loc reg). The size (area) of a point is proportional to the inclusion probability (πi ) of the household at that location. Connected points indicate pairs of households which can appear together in the same sample. α denotes the angle span of the sector used for selecting the first household; k denotes that every k th neighbour was added to the sample; n denotes the number of households in the sample. 62 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 6.3-6.7. The size (area) of a point at location (x, y) reflects the inclusion probability of the household at those coordinates. We joined pairs of households that have a chance of being selected together (πij > 0) to get a sense of the geographic scope of the possible samples. Symmetric spatial patterns of the selection probabilities emerged for loc reg where households were arranged in a grid formation. When the first household was chosen randomly among all households (α = 2π rad), no neighbours were skipped in later selections (k = 1), and sample size was set to n = 7, then within each row of households, inclusion probabilities increased from the unit at the end to the seventh unit from the end (Figure 6.3a). Inclusion probabilities for the middle 11 households were constant and were smaller than the inclusion probability of the seventh unit from the end. For samples that started between the seventh unit from each end of a row, movements made to obtain the full sample were restricted to that row. As sample size increased to n = 15 and n = 30, the position of a household in a row became less significant. What was important was the row that a household belonged to. From Figure 6.3b, we see that inclusion probabilities are smallest for households in the first row and last row while inclusion probabilities are largest for households in the second row and second last row. Selecting the first unit of the sample from a random sector altered the distribution of selection probabilities. The scenarios depicted in Figures 6.3a and 6.3c are identical except that, in the latter, a random sector with angle span α = π 32 rad initiated the sampling procedure. While we still observe a general increase in the inclusion probabilities as we move from the end of a row to the seventh last unit, the change 63 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics (a) α = 2π rad, k = 1, n = 7. (b) α = 2π rad, k = 1, n = 30. Figure 6.4: Household inclusion probabilities for a population with a random settlement pattern over a square (loc sqr). The size (area) of a point is proportional to the inclusion probability (πi ) of the household at that location. Connected points indicate pairs of households which can appear together in the same sample. α denotes the angle span of the sector used for selecting the first household; k denotes that every k th neighbour was added to the sample; n denotes the number of households in the sample. does not happen at the same rate as in the previous case. Another difference is that there is now variability in the inclusion probabilities of the middle 11 households. Similar remarks apply when α = π 8 rad. Whether households were skipped in taking the sample had a greater impact on inclusion probabilities than whether a sector was used to identify the starting point. In loc reg, adding every third neighbour to the sample led to a periodic behaviour in the selection probabilities of households. When the sample size was n = 7, the probability would spike at the fourth household from the end of a row and every third household after it (Figure 6.3d). This held no matter of how the first unit of the sample was picked. While the periodic behaviour remained even when larger 64 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics (a) α = 2π rad, k = 1, n = 7. (b) α = π 8 rad, k = 1, n = 7. Figure 6.5: Household inclusion probabilities for a population with a random settlement pattern over a rectangular area (loc rec). The size (area) of a point is proportional to the inclusion probability (πi ) of the household at that location. Connected points indicate pairs of households which can appear together in the same sample. α denotes the angle span of the sector used for selecting the first household; k denotes that every k th neighbour was added to the sample; n denotes the number of households in the sample. samples were taken, the spikes in the probabilities were diminished. Trends in household selection probabilities in the other populations were not as predictable. This was especially true for loc sqr. When n = 7, households with the greatest inclusion probabilities were not found in one particular area (Figure 6.4a). Furthermore, as the sample size was increased, households which were picked most often at n = 7 were not necessarily the ones picked most often at n = 15 and n = 30 (Figure 6.4b). As with loc sqr, there was nothing that stood out about the inclusion probabilities in loc rec when α = 2π rad, k = 1, n = 7. However, switching from a sampling scheme that did not use a sector to one that did resulted in an increase in the 65 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics (a) α = 2π rad, k = 1, n = 7. (b) α = 2π rad, k = 3, n = 7. Figure 6.6: Household inclusion probabilities for a population with an aggregated settlement pattern (loc agg). The size (area) of a point is proportional to the inclusion probability (πi ) of the household at that location. Connected points indicate pairs of households which can appear together in the same sample. α denotes the angle span of the sector used for selecting the first household; k denotes that every k th neighbour was added to the sample; n denotes the number of households in the sample. inclusion probabilities of households close to the y-axis and a decrease in the inclusion probabilities of households in the far west and east sides of the cluster (Figures 6.5a and 6.5b). A distinct feature of loc agg was that certain groups of households were far enough from other groups such that when n = 7, households in these groups could only be sampled with households in the same group (Figure 6.6a). When more population units were sampled or when neighbours were skipped in the selection process, the samples were less localized. In loc cgr, patterns in the inclusion probabilities were somewhat clearer. If a household had a relatively high probability of selection, it was almost certainly from 66 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics (a) α = 2π rad, k = 1, n = 7. (b) α = 2π rad, k = 3, n = 7. Figure 6.7: Household inclusion probabilities for a population with a circular gradient settlement pattern (loc cgr). The size (area) of a point is proportional to the inclusion probability (πi ) of the household at that location. Connected points indicate pairs of households which can appear together in the same sample. α denotes the angle span of the sector used for selecting the first household; k denotes that every k th neighbour was added to the sample; n denotes the number of households in the sample. the centre of the cluster. However, this did not mean that all households in the centre of the cluster had high probabilities of selection. For instance, two households near the origin could be side by side, yet one would have two or three times the inclusion probability as the other. A boxplot of the inclusion probabilities (Figure 6.8) highlights other important properties. For all combinations of household spatial patterns and sampling procedures, the average probability of selection was 0.047, 0.100, 0.200 when sample sizes were 7, 15, and 30 respectively. These averages correspond to the sampling fraction n . N In 78% of the scenarios that we tested, the largest probability was over 10 times greater than the smallest probability. The variability of the inclusion probabilities 67 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics increased as sample size increased, but it was considerably lower in loc reg compared to the other populations. Finally, we calculated and plotted the correlation (Cor) between a household’s inclusion probability (πi ) and its distance to the origin (ri ) (Figure 6.9). We used the following formula: Cov(πi , ri ) p Cor(πi , ri ) = p V ar(πi ) V ar(ri ) 1 N =s 1 N N P (πi − π̄)(ri − r̄) s N N P P (πi − π̄)2 N1 (ri − r̄)2 i=1 i=1 i=1 N P (πi − π̄)(ri − r̄) s =s N N P P (πi − π̄)2 (ri − r̄)2 i=1 i=1 i=1 (6.3) where π̄ is the mean of the inclusion probabilities and r̄ is the mean of the distances.2 In SRS, the location of a household has no bearing on the probability that it is selected. Our simulation indicates that this is not true for EPI. Nearly all correlations were negative with many of them falling in the range of -0.4 and -0.6. A negative correlation means that higher inclusion probabilities tend to be associated with households closer to the centre of the population area. However, the plots may not be representative of what happens in other populations even if the populations 2 Cov stands for covariance. 68 M.Sc. Thesis - Maria Reyes loc_reg McMaster - Mathematics & Statistics loc_sqr loc_rec loc_agg loc_cgr 0.4 ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=7 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=15 πi 0.4 ● ● ● 0.0 ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=30 0.2 0.0 k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 c_ 8_ 2_ c_ 8_ 2_ c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 o a a n a a a a n a a a a n a a a a n a a a a n a a n n n n n Sampling method Figure 6.8: Boxplot of household inclusion probabilities (πi ) for loc reg, loc sqr, loc rec, loc agg, and loc cgr (populations in Figure 6.1) when various sampling plans are applied. The horizontal line that appears across the plots indicates the average inclusion probability (π̄) for a given sample size (n). See pages viii-x for the meaning of abbreviations used for the spatial distribution of households and the sampling method. are of a similar type. For instance, when we generated 30 independent realizations of the loc sqr pattern, we found that on average, the relationship between inclusion probability and distance from the origin was much weaker (Table A.1). An analysis was also done for 30 realizations of loc rec, loc agg, and loc cgr. The main observation we made is that the magnitude of correlation was generally higher in the loc cgr populations, and it increased further when larger samples were taken or when neighbours were skipped in between successive selections. 69 M.Sc. Thesis - Maria Reyes loc_reg 0.0 McMaster - Mathematics & Statistics loc_sqr loc_rec loc_agg loc_cgr ● ● ● ● ● Cor( πi , ri ) −0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● Sampling method ● ● ● ● −0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● nosec_k1 api08_k1 api32_k1 nosec_k3 api08_k3 api32_k3 ● −0.8 ● 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30 Sample size (n) Figure 6.9: Correlation between household inclusion probability (πi ) and household distance from the centre of the population area (ri ) for loc reg, loc sqr, loc rec, loc agg, and loc cgr (populations in Figure 6.1) when various sampling plans are applied. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households and the sampling method. 6.3 Additional Notes: Relations between Inclusion Probabilities The following properties hold for any sampling plan that selects units without replacement: Property 1: N P πi = n i=1 Property 2: N P j6=i Property 3: πij = (n − 1)πi N P N P i=1 j>i πij = 21 n(n − 1) (Cochran, 1977). Recall that the sum of the probabilities for all possible samples is 1 because one of these samples must be observed. To get πi , we add up the probabilities of samples containing unit i. Since there are n units in each sample, 70 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics all of which are unique, this means that when the sum of inclusion probabilities for individual units is taken across the entire population, the probability of each sample is counted n times, which gives the first property. To get the second property, we use a similar argument. If all samples containing unit i were listed, we would know πij , for every value of j 6= i. We also know that the sum of probabilities for these samples is equal to πi . Since unit i appears with n − 1 other units in a sample, πi P is counted n − 1 times in N j6=i πij . Finally, the third property is a consequence of the first two properties and the symmetry of inclusion probabilities for pairs of units (πij = πji ). To check the inclusion probabilities computed in our simulation study, we made sure that all the relations above were true. For n = 7, we also compared the inclusion probabilities to ones estimated by resampling from the population 25 000 times. The difference between exact and estimated inclusion probabilities was no greater than 0.006. 71 Chapter 7 Estimator Properties in Simulated Populations As shown in Chapter 6, the EPI method does not sample households with equal probability. When EPI sampling is used, there can be large variation in the inclusion probabilities of households. Furthermore, there was some correlation between the inclusion probability of a household and the distance of the household from the cluster origin, indicating that households with the largest probabilities are not randomly located throughout the area. This provides motivation for investigating how an estimator that assumes constant probability of selection for all households performs when this assumption is not met, especially when the target variable exhibits non-random spatial trends. Therefore, it is of interest to compare this estimator to one that accounts for differences in inclusion probabilities. 72 M.Sc. Thesis - Maria Reyes 7.1 McMaster - Mathematics & Statistics Simulation Design 7.1.1 Generation of Populations Populations were created by assigning a binary outcome to each of the N = 150 households in loc reg, loc sqr, loc rec, loc agg, and loc cgr (Figure 6.1). There were two components to household data: location, as specified by x- and y-coordinates, and a response value z for the variable being measured. If household i had the characteristic of interest, zi = 1; if it did not, zi = 0. A Bernoulli random variable was generated to set the response value of an individual household. If the result was 1, then the characteristic of interest would be present for that household. The probability of this occurrence depended on p, the proportion of households in the population targeted to have the characteristic of interest. Five levels of p were considered: p = 0.10, 0.30, 0.50, 0.70, 0.90. The probability of assigning the characteristic of interest to household i, denoted by P (Zi = 1), was set as a function of the household’s geographic position in the cluster to allow for finer control over how the characteristic of interest was spatially distributed in the population. Six types of patterns were produced: 1. Random (val rdm). It was equally likely to observe the characteristic of interest anywhere in the population. The probabilities were P (Zi = 1) = p for i = 1, 2, . . . , N . 2. Small pockets (val spk). The characteristic of interest was concentrated in certain neighbourhoods. A total of Np 5 households was chosen from the population using SRS. Their locations served as a reference for where a pocket of cases 73 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics would occur. The pockets were created sequentially. For a given focal point, the five closest households that did not already have the characteristic of interest were identified and given the characteristic of interest. 3. Large pockets (val lpk). This is the same as the previous pocketing pattern except with Np 15 pocket focal points so that the characteristic of interest appeared in groups of 15 households. 4. Circular gradient (val cgr). Cases of the characteristic of interest decreased when moving away from the origin. The probabilities were P (Zi = 1) = 1 1 + eβ0 +β1 √ x2i +yi2 (7.1) for i = 1, 2, . . . , N where β0 ∈ (−∞, ∞) and β1 ∈ (0, ∞). Consequently, households that fall on the same circle centred at the origin have the same probability of having the characteristic of interest, and since β1 is positive, this probability is lower for households that are further away from the origin.1 In order to calculate the probabilities, values for β0 and β1 had to be specified. We fixed β0 so that had there been a household exactly at the origin, the resulting probability would be 0.95 if p = 0.10 or p = 0.30, it would be 0.975 if p = 0.50, and it would be 0.99 if p = 0.70 or p = 0.90. Given β0 , we solved for 1 If β1 were negative, we would see the opposite effect; households further away from the origin would be more likely to have the characteristic of interest. 74 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics β1 in the equation N X E ! Zi = Np i=1 N X E(Zi ) = N p i=1 N X P (Zi = 1) = N p i=1 N X i=1 1 β0 +β1 1+e √ x2i +yi2 = N p. (7.2) This ensured that the expected number of cases was equal to the target number. It is not trivial to solve Equation (7.2) as there is no closed-form solution for β1 . However, an approximation was obtained using the uniroot function in the base package of R (R Core Team, 2015).2 The maximum iterations was set to 105 and the tolerance limit was set to 2.2 × 10−16 . 5. Diagonal gradient (val dgr). Cases of the characteristic of interest decreased when moving from the southwest corner to the northeast corner of the study area. The probabilities were P (Zi = 1) = 1 1+ eβ0 +β1 [(xi −xmin )+(yi −ymin )] , (7.3) for i = 1, 2, . . . , N where β0 ∈ (−∞, ∞), xmin = min{x1 , x2 , . . . , xN } and 2 √ 2 2 A rudimentary approach is to use Newton’s method. Let g(β1 ) = 1/ 1 + eβ0 +β1 xi +yi −N p. Since g(β1 ) is differentiable, we may compute β11 = β10 + g(β10 )/g 0 (β10 ), where β10 is an initial guess at the root. The solution is updated by substituting β11 for β10 in the equation above. This iterative process continues until the maximum number of iterations is reached or until the difference between β1n+1 and β1n is smaller than the tolerance limit. 75 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics ymin = min{y1 , y2 , . . . , yN }. Let c be some fixed but arbitrary positive constant. Probability contour levels are lines of the form c = (x − xmin ) + (y − ymin ) y = −x + (xmin + ymin + c). (7.4) The same values for β0 were used as in the circular gradient pattern, and values for β1 were approximated to satisfy N X i=1 1 1+ eβ0 +β1 [(xi −xmin )+yi −ymin )] = N p. (7.5) Calculations were carried out using the uniroot function in the base package of R (R Core Team, 2015). The maximum iterations was set to 105 and the tolerance limit was set to 2.2 × 10−16 . 6. Horizontal gradient (val hgr). This is the same as val dgr except probability contour levels are represented by steeper lines making the gradient almost horizontal. The probabilities were P (Zi = 1) = 1 1 + eβ0 +β1 [(xi −xmin )+ 250 (yi −ymin )] 1 , (7.6) for i = 1, 2, . . . , N where β0 ∈ (−∞, ∞), xmin = min{x1 , x2 , . . . , xN } and ymin = min{y1 , y2 , . . . , yN }. 76 loc_agg; val_hgr loc_cgr; val_cgr 500 250 250 250 250 250 0 0 0 0 0 −250 −250 −500 −500 −500 −250 0 250 500 −250 −500 −500 −250 0 250 500 y 500 y 500 y 500 −250 −250 −500 −500 −250 0 250 500 −500 −500 −250 0 250 500 −500 0 250 x x x x loc_reg; val_rdm loc_sqr; val_hgr loc_rec; val_cgr loc_agg; val_spk loc_cgr; val_lpk 500 250 250 250 250 250 0 0 0 0 0 −250 −250 −250 −500 −500 −250 0 x 250 500 −250 −500 −500 −250 0 250 500 y 500 y 500 y 500 −250 −500 −500 x −250 0 250 500 x Characteristic of interest Not Present (0) 500 −500 −500 −250 0 x 250 500 −500 −250 0 250 500 x Present (1) Figure 7.1: Spatial distributions of the target variable when the proportion of households with the characteristic of interest is p = 0.50. The first item in the population label refers to the spatial distribution of households and the second item refers to the spatial distribution of the target variable. See pages viii-x for the meaning of abbreviations used. McMaster - Mathematics & Statistics 500 −500 −250 x y 77 y loc_rec; val_rdm 500 y y loc_sqr; val_spk M.Sc. Thesis - Maria Reyes loc_reg; val_dgr M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics While demographics may differ from one district to another, it may be reasonable to assume that within a small community of 150 households there is spatial homogeneity. Therefore, the presence of children or seniors could be examples of household characteristics that are evenly distributed throughout a cluster. Conversely, a pocketing pattern is likely to be seen for a disease transmitted through close contact. It could also represent the geographic distribution of households that have lost power in an emergency or households whose members have experienced violence related to group conflict. Another variable that is tied to location is whether a household gets its drinking water from the municipal grid or some other source. Households near the centre of the cluster may be connected to the municipal grid while those on the outskirts may use a spring or pond as their main supply of water. This is best captured by the circular gradient pattern. Lastly, we can consider a scenario where we are interested in estimating the proportion of households that did not evacuate their residence during a landslide or flood. One side or corner of the cluster may be at a lower risk, so as we move in that direction, we may find more and more homes that are occupied. Hence, cases of this outcome may be distributed similar to a diagonal gradient or horizontal gradient pattern. Five realizations of each of the patterns were generated for every level of p and set of household locations. Thus, 750 populations were analyzed in the study, a few of which are illustrated in Figure 7.1. However, there are only five arrangements of households underlying these populations since household locations remained fixed for a given type of household spatial distribution. For val rdm, val cgr, val dgr, and val hgr patterns, cases of the characteristic of interest were randomly added or removed after the initial procedure to attain 78 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Table 7.1: Cases of the characteristic of interest added or removed from populations after the initial population generation procedure to attain a certain proportion (p) of households with characteristic of interest. Spatial distribution of target variable val rdm val spk val lpk val cgr val dgr val hgr Average number of cases added (+) or removed (−) p = 0.10 p = 0.30 p = 0.50 p = 0.70 -0.32 0.00 0.00 0.08 -0.20 0.16 -0.72 0.00 0.00 0.36 -0.32 0.04 -0.68 0.00 0.00 -0.52 1.28 -0.88 -1.64 0.00 0.00 -0.28 -1.40 -0.56 p = 0.90 -1.04 0.00 0.00 -1.28 0.24 -0.20 Maximum change (abs) in number of cases p = 0.10 p = 0.30 p = 0.50 p = 0.70 p = 0.90 8 0 0 4 6 5 11 0 0 10 8 8 11 0 0 10 16 12 13 0 0 13 11 11 6 0 0 8 9 7 abs refers to absolute value. See pages viii-x for the meaning of abbreviations used for the spatial distribution of the target variable. the desired proportion. In most instances (87% of the populations created), the adjustment that was made to the number of cases was less than 10% of the aimed total cases (Table 7.1).3 7.1.2 Sampling Plans Households were sampled in the same manner as they were in Chapter 6. The sampling plans were based on the EPI method, but used varying sample sizes (n = π 7, 15, 30), sector sizes (α = 2π, π8 , 32 rad), and skipping rule (k = 1, 3) for a total of 18 combinations. 7.1.3 Estimation of Population Proportion The proportion of households with the characteristic of interest was estimated in three ways. 3 We may intend for the characteristic of interest to be present in half of the population, but because we are using a random procedure to generate outcomes, we could end up with only 70 out of the 150 households with the characteristic of interest rather than 75 households. Therefore, we 5 need to create five more instances of the characteristic of interest, which represents 75 = 0.06 of the total aimed cases. 79 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics The first estimator uses the standard formula for a mean. If S is the set of units in a sample, then the population proportion is estimated by p̂EQW = 1X zi . n i∈S (7.7) Under this procedure, all observations receive equal weight (EQW). Another estimator that we considered was the Horvitz-Thompson (HT) estimator. It is a weighted average given by p̂HT = 1 X wi zi . N i∈S The response from household i is weighted at wi = (7.8) 1 , πi where πi is the probability that household i is in the sample. Alternatively, p̂HT can be expressed as p̂HT N 1 X 1 = zi Hi , N i=1 πi (7.9) where Hi = 1 if unit i is in the sample S and 0 otherwise. It follows that P (Hi = 80 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 1) = πi . Since E(p̂HT ) = E = N 1 X 1 zi Hi N i=1 πi ! N 1 X 1 zi E(Hi ) N i=1 πi N 1 X 1 = zi P (Hi = 1) N i=1 πi N 1 X 1 zi πi = N i=1 πi N 1 X = zi N i=1 =p (7.10) the Horvitz-Thompson estimator has the advantage of being unbiased regardless of how the sample is taken (Cochran, 1977). Lastly, we looked at an estimator which is essentially the Horvitz-Thompson estimator, but truncated at 1 (HTR). It is calculated as p̂HT R = min{p̂HT , 1}. (7.11) For the populations and sampling plans in the investigation, several households end up with a selection probability so small that when it is converted to a sampling weight, the sampling weight is greater than the population size N = 150 (Figure 7.2). If even one of these households appears in the resulting sample and they have the characteristic of interest, then the estimate of the population proportion as 81 M.Sc. Thesis - Maria Reyes loc_reg McMaster - Mathematics & Statistics loc_sqr loc_rec 400 loc_agg loc_cgr ● ● ● ● 300 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 0 400 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 ● ● 100 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=30 ● ● ● ● ● n=15 wi = 1 πi ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=7 ● 0 400 ● ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 c_ 8_ 2_ c_ 8_ 2_ c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 o a a n a a a a n a a a a n a a a a n a a a a n a a n n n n n Sampling method Figure 7.2: Boxplot of household sampling weights (wi ) for loc reg, loc sqr, loc rec, loc agg, and loc cgr (populations in Figure 6.1) when various sampling plans are applied. πi denotes the inclusion probability for household i; n denotes the number of household sampled. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households and the sampling method. calculated using HT estimator will exceed 1. This situation arises more generally. As long as the combined weights are greater than N for sampled units with the characteristic of interest, then p̂HT > 1. Proportions are only meaningful when they are between 0 and 1. Therefore, it was necessary to conceive a restricted version of the Horvitz-Thompson estimator. 82 M.Sc. Thesis - Maria Reyes 7.1.4 McMaster - Mathematics & Statistics Evaluation of Estimators Estimators were assessed on their accuracy and precision. For each of the stated estimators, we computed their expected value as E(p̂) = X p̂S P (S), (7.12) S and their variance as V ar(p̂) = X S (p̂S − E(p̂S ))2 P (S), (7.13) where p̂S is the proportion estimated from the data in sample S, and P (S) is the probability that sample S is observed. These computations are possible because all samples from a population can be listed along with their corresponding probability (see algorithm in Chapter 5 and Appendix B.3.3). Since the true population proportion was known, we were also able to measure the bias of an estimator Bias(p̂) = E(p̂) − p, (7.14) as well as the mean square error M SE(p̂) = V ar(p̂) + (Bias(p̂))2 . 83 (7.15) M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Table 7.2: Range of properties for the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator across 13 500 simulated scenarios. The scenarios were based on 750 populations and 18 sampling plans as described in Sections 7.1.1 and 7.1.2. EQW HT HTR Min Mean Max Min Mean Max Min Mean Max Bias Variance -0.17 0.0005 0.01 0.0329 0.20 0.2257 0.00 0.0004 0.00 0.0800 0.00 0.4590 -0.19 0.0004 -0.03 0.0449 0.00 0.1930 MSE DE 0.0005 0.20 0.0347 2.66 0.2275 25.39 0.0004 0.16 0.0800 9.34 0.4590 92.38 0.0004 0.16 0.0471 4.56 0.1998 37.67 Min refers to minimum value; Max refers to maximum value; MSE refers to mean square error; DE refers to design effect. Finally, to compare the variability of an estimator to the variability of the standard estimator under SRS, we checked the design effect, DE(p̂) = = V ar(p̂) V ar(p̂SRS ) V ar(p̂) N −n p(1−p) N −1 n (7.16) Together, these properties could help provide a picture of the quality of estimation from using a certain estimator and what impact population and sampling plan decisions have on estimation. 84 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics EQW HTR Relative frequency 0.4 0.3 0.2 0.1 0.0 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 ^) Bias ( p Figure 7.3: Histogram of estimator bias across 13 500 simulated scenarios. EQW refers to the equally weighted estimator; HTR refers to the restricted HorvitzThompson estimator. The scenarios are based on 750 populations and 18 sampling plans as described in Sections 7.1.1 and 7.1.2. 7.2 Simulation Results The distributions of EQW, HT, and HTR estimators were obtained under 750×18 = 13 500 population and sampling settings. Estimator properties were calculated from these distributions accordingly. The minimum, mean, and maximum values that were observed for the bias, variance, mean square error, and design effect associated with each type of estimator are given in Table 7.2. The histograms in Figures 7.3 and 7.4 provide a graphical summary of the individual results across the different simulation scenarios. The plots in Figures 7.5-7.7, represent marginal means, and show how an estimator property varies over the levels of a factor while controlling for sample size and the proportion of households in the population with the characteristic of interest. The factors examined in the plots are spatial distribution of the target variable and 85 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Relative frequency EQW HT HTR 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 ^) Var ( p Relative frequency EQW HT HTR 0.4 0.2 0.0 0 25 50 75 0 25 50 75 0 25 50 75 ^) DE ( p Figure 7.4: Histogram of estimator variance and design effect across 13 500 simulated scenarios. EQW refers to the equally weighted estimator; HT refers to the HorvitzThompson estimator; HTR refers to the restricted Horvitz-Thompson estimator. The scenarios are based on 750 populations and 18 sampling plans as described in Sections 7.1.1 and 7.1.2. 86 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics sampling method. Data corresponding to these plots can be found in Appendix A (Tables A.2-A.9), which also includes results relating to the spatial distribution of households. Within our simulation study, the bias of HT was always equal to 0, agreeing with what is indicated by the theory. However, we found that some HT estimates reached as high as 4.9, particularly when p > 0.50. This never happened for the EQW or HTR estimator, which were always bounded by 0 and 1. However, ensuring that estimates were less than or equal to 1 for HTR came at the cost of losing the property of unbiasedness. While EQW and HTR exhibited bias, values peaked around 0, demonstrating that bias was not too large in most of the scenarios that we looked at in our study. The magnitude of the bias was less than 0.05 in 82% of the results for EQW and 77% of the results for HTR. For EQW, the most negative biases (representing 2.5% of results) were between -0.17 and -0.060, and the most positive biases (representing 2.5% of results) were between 0.11 and 0.20. Large negative biases typically occurred when estimation was performed on populations where households were randomly placed over a square (loc sqr) and cases of the characteristic of interest were concentrated in the west end of the population area (val hgr). Yet, this might not hold in general because for loc sqr the probability of sampling households from the west end of the population area happened to be lower compared to other realizations of the same household spatial distribution. In contrast, large positive biases for EQW typically occurred when estimation was performed on populations where the concentration of households and cases of the characteristic of interest increased towards the centre of the cluster (loc cgr; 87 M.Sc. Thesis - Maria Reyes p=0.10 p=0.30 0.05 p=0.50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● p=0.90 ● ● ● ● ● ● ● ● ● ● ● ● ● −0.05 ● ● ● ● ● ● ● ^) Bias ( p ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.05 ● ● n=15 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.10 0.05 ● n=7 ● p=0.70 ● ● ● 0.00 McMaster - Mathematics & Statistics Estimator ● ● ● ● ● ● ● ● ● ● ● ● ● ● EQW HT HTR −0.10 ● ● 0.05 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.05 ● ● n=30 0.00 ● −0.10 k k r r r k k r r r k k r r r k k r r r k k r r r dm sp lp cg dg hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg l_r l_ al_ al_ l_ l_ l_ l a al l l l_ l a al l l l_ l a al l l l_ l a al l l va va v v va va va va v v va va va va v v va va va va v v va va va va v v va va Spatial distribution of the target variable (a) Results averaged across spatial distributions of households and sampling methods. p=0.10 0.00 ● ● ● ● ● ● ● ● ● ● p=0.30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● p=0.50 ● ● ● ● ● ● ● ● p=0.70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● p=0.90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=7 ● ● −0.04 ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=15 ^) Bias ( p −0.08 ● 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.08 ● ● EQW HT HTR n=30 ● −0.04 ● ● ● ● −0.08 Estimator ● k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 c_ 8_ 2_ c_ 8_ 2_ c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 o a a n a a n n a a n a a n a a n a a n a a n a a n a a n a a Sampling method (b) Results averaged across spatial distributions of households and spatial distributions of the target variable. Figure 7.5: Bias of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator. See Tables A.2 and A.3 for values of the bias when n = 7, 30. p denotes the proportion of households in the population with the characteristic of interest; n denotes the number of households that were sampled. See pages viii-x for the meaning of abbreviations used for the spatial distribution of the target variable and the sampling method. 88 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics val cgr). Even when results are averaged across household spatial patterns and sampling methods, the bias associated with val cgr remains positive and significantly higher than what is observed for other spatial distributions of the target variable. The marginal plots for bias also suggests that a pocketing pattern leads to moderate positive biases. However, when individual results were analyzed, there were several instances where a large bias was associated with a population that had a pocketing pattern, but bias was negative. In these populations, pockets were located near the periphery of the cluster. Overall, extreme EQW biases often coincided with the population proportion being around a half and with large sample sizes. Choice of sampling plan had less of an impact on bias compared to the spatial distribution of the target variable. Whenever HTR was biased, the bias was negative. This is to be expected, since by definition, the HTR estimate must be less than or equal to the HT estimate for each and every sample. Unlike EQW, the magnitude of the bias for HTR continued to grow as the proportion increased beyond p = 0.50, and bias tended to be greater for n = 7 compared to n = 30. The most negative biases (representing 2.5% of results) ranged from -0.19 to -0.13. These results were observed for all household spatial distributions except for when households were regularly spaced (loc reg). Results were mixed for the spatial distribution of the target variable where the bias of HTR was large. Populations where the bias of HTR was close to 0 (magnitude less than 0.01) were also of varying types, but there were twice as many populations with the loc reg pattern than those with other household spatial distributions. Overall, the HT estimator was more variable than the other estimators. When variances were compared case by case, the variance of the HT estimator exceeded 89 M.Sc. Thesis - Maria Reyes p=0.10 McMaster - Mathematics & Statistics p=0.30 p=0.50 p=0.70 0.15 ● ● ● 0.10 ● ● 0.05 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=7 ● ● ● p=0.90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.10 ● ● ● 0.05 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Estimator ● ● ● ● ● ● ● n=15 ^) Var ( p 0.15 ● ● ● ● ● ● ● ● ● ● ● EQW HT HTR 0.15 ● ● 0.05 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=30 0.10 k k r r r k k r r r k k r r r k k r r r k k r r r dm sp lp cg dg hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg l l l l l_r l_ al_ al_ l_ l_ l_ l a al l l_ l a al l l_ l a al l l_ l a al l va va v v va va va va v v va va va va v v va va va va v v va va va va v v va va Spatial distribution of the target variable (a) Results averaged across spatial distributions of households and sampling methods. p=0.10 p=0.30 p=0.50 0.15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● p=0.90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n=7 ● ● 0.10 0.05 ● p=0.70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.10 ● ● ● ● 0.05 ● ● ● ● ● ● ● ● ● 0.00 0.15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● EQW HT HTR ● ● ● ● ● ● ● ● ● ● n=30 ● ● ● 0.05 ● Estimator ● ● 0.10 ● ● ● ● ● ● n=15 ^) Var ( p 0.00 0.15 k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 c_ 8_ 2_ c_ 8_ 2_ c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 o a a n a a n n a a n a a n a a n a a n a a n a a n a a n a a Sampling method (b) Results averaged across spatial distributions of households and spatial distributions of the target variable. Figure 7.6: Variance of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator. See Tables A.4 and A.5 for values of the variance when n = 7, 30. p denotes the proportion of households in the population with the characteristic of interest; n denotes the number of households that were sampled. See pages viii-x for the meaning of abbreviations used for the spatial distribution of the target variable and the sampling method. 90 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics the variance of the EQW estimator 87% of the time. When the variance of the HT estimator was less than the variance of the EQW estimator, the associated population usually had positive outcomes of the target variable distributed as a circular gradient (val cgr) and positive outcomes were present in less than half of the population (p < 0.50). As for the variance of HTR, it was equal to the variance of the HT estimator in 25% of the scenarios, and fell between the variances of the HT and EQW estimators in 62% of the scenarios. While the EQW and HTR estimators had variances that rose then dropped around the p = 0.50 mark, the variance of the HT estimator kept rising. Therefore, as the proportion of households with the characteristic of interest increased, so too did the difference between the variance of the HT estimator and the variance of the other estimators. When the highest variances were identified (those in the top 2.5%), the range was 0.22 to 0.46 for HT, and 0.12 to 0.19 for EQW. Many of these results came from populations that had an aggregated settlement pattern (loc agg) and had cases of the characteristic of interest distributed in large pockets (val lpk). Ignoring for population type, variances tended to be higher when no neighbours were skipped during the sampling process instead of otherwise. For the HT estimator, variances reached even higher levels when a random sector was used to pick the first unit of the sample. For all estimators, variance decreased as sample size increased. Plots of MSE have been omitted as they are almost identical to the plots for variance. For the estimators and scenarios investigated in this study, the major source of MSE was variance. Hence, the values computed for MSE were only slightly greater than the values computed for variance (refer to Tables A.6 and A.7), and the comments previously made about variance generally apply to MSE. 91 M.Sc. Thesis - Maria Reyes p=0.10 40 McMaster - Mathematics & Statistics p=0.30 p=0.50 p=0.70 p=0.90 30 n=7 20 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 20 ● ● ● ● n=15 ^) DE ( p 0 40 ● ● Estimator ● ● ● 10 0 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 n=30 20 ● ● 10 0 EQW HT HTR ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● k k r r r k k r r r k k r r r k k r r r k k r r r dm sp lp cg dg hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg rdm _sp l_lp _cg _dg _hg l l l l l_r l_ al_ al_ l_ l_ l_ l a al l l_ l a al l l_ l a al l l_ l a al l va va v v va va va va v v va va va va v v va va va va v v va va va va v v va va Spatial distribution of the target variable (a) Results averaged across spatial distributions of households and sampling methods. p=0.10 p=0.30 p=0.50 p=0.70 p=0.90 40 n=7 30 20 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 20 n=15 ^) DE ( p 40 ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 Estimator ● ● ● ● ● ● ● ● ● ● ● EQW HT HTR ● 40 ● ● ● n=30 30 ● 20 10 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● k1 k1 k1 k3 k3 k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 _k1 _k1 _k1 _k3 _k3 _k3 c_ 8_ 2_ c_ 8_ 2_ c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 c 8 2 se pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 ose pi0 pi3 o a a n a a n n a a n a a n a a n a a n a a n a a n a a n a a Sampling method (b) Results averaged across spatial distributions of households and spatial distributions of the target variable. Figure 7.7: Design effect of EPI sampling relative to SRS for the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator. See Tables A.8 and A.9 for values of the variance when n = 7, 30. p denotes the proportion of households in the population with the characteristic of interest; n denotes the number of households that were sampled. See pages viii-x for the meaning of abbreviations used for the spatial distribution of the target variable and the sampling method. 92 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics While DE followed a similar trend as variance when results were compared across the various spatial distributions of the target variable and sampling methods, DE did not decrease as sample size increased. Despite a larger sample, the variance of the estimators did not shrink as much as it would have if SRS had been performed. For HT, 87% of the observed DEs were over 2; for HTR, 81% were over 2; for EQW, 45% were over 2. However, the maximum DE for HT was 92, well above the maximum DE for HTR which was 38. These occurred in separate populations, but both populations had p = 0.90 in common. The DE for EQW was generally lower than HT and HTR, though it still reached a high of 25. This was seen in a population where the characteristic of interest had a pocketing pattern and p = 0.50. Regardless of the estimator, large DEs (top 2.5% of results) were associated more with sampling methods that did not skip households as opposed to those that did. 7.3 Additional Notes: Variance of the HorvitzThompson Estimator Due to the particular structure of the Horvitz-Thompson estimator, its variance can be calculated using other formulas besides the one in Equation (7.13). These formulas make use of the inclusion probability of individual units (πi ) and pairs of units (πij ) rather than the probability of a sample. The variance can be calculated as 1 V ar(p̂HT ) = 2 N N X 1 − πi i=1 πi N X N X πij − πi πj 2 zi + 2 zi zj π π i j i=1 j>i 93 ! (7.17) M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics or equivalently, 1 V ar(p̂HT ) = 2 N 2 N X N X zi zj (πi πj − πij ) − π πj i i=1 j>i ! , (7.18) which we used to verify the results of our simulation. Details of the derivation of Equation (7.18) from Equation (7.17) are given in Cochran (1977). The variance of p̂HT can be estimated by X X πij − πi πj 1 X 1 − π i 2 z + 2 zi zj Vd ar1 (p̂HT ) = 2 i 2 N πi πij πi πj i∈S i∈S j∈S (7.19) j>i or 1 X X (πi πj − πij ) Vd ar2 (p̂HT ) = 2 N i∈S j∈S yi yj − πi π j 2 . (7.20) j>i While the estimators in Equations (7.19) and (7.20) may not equal the same value for a given sample, both are unbiased estimators of V ar(p̂HT ), if πij > 0 for all i and j (Cochran, 1977). However, in EPI sampling, many pairs of households cannot appear together and have an inclusion probability of 0. Although we did not compute the estimated variance for every scenario in our simulation, we found that for the population with randomly placed households over a square area (loc reg) where the characteristic of interest was randomly assigned (val rdm) to 75 out of the 150 households (p = 0.50), if a sample of 7 households was selected using the EPI method where the first household was randomly selected from 94 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics the entire population and no neighbours were skipped, then V ar(p̂HT ) = 0.049 but E(Vd ar1 (p̂HT )) = 0.054 and E(Vd ar2 (p̂HT )) = 0.118. Furthermore, the first variance estimator yielded negative values in 19% of the samples (the probability of observing any of these samples is 0.17) and the second estimator yielded negative values in 74% of the samples (the probability of observing any of these samples is 0.77).4,5 4 In this example, there are 91 different combinations of households that have a non-zero probability of being drawn. However, these combinations do not all have the same probability of being drawn. 5 Cochran (1977) noted that negative variance estimates are possible when the estimators in Equation (7.19) and (7.20) are used. 95 Chapter 8 Summary, Discussion and Future Directions The EPI method certainly has several operational benefits, the main one being that it can be implemented even without having a list of every household at a site. This is especially helpful in settings where there are just not enough time and resources to properly construct a sampling frame. The only information that really needs to be known beforehand is the relative sizes of the clusters making up the population. At the same time, there are obvious concerns about the the data coming from an EPI sample. While the EPI method follows the usual procedure of selecting clusters with PPS, the way that it selects elements at the very last stage of sampling remains fundamentally different from SRS or any other sampling scheme for which there is a well-established theory on estimation. 96 M.Sc. Thesis - Maria Reyes 8.1 McMaster - Mathematics & Statistics Summary and Discussion The goal throughout this thesis has been to gain a fuller understanding of how units are sampled when the EPI method of selecting nearest neighbours is used. Our analysis takes a probabilistic approach, and we ultimately use the results to investigate properties of the estimator for a population proportion. Previous assessments of the estimation in the EPI sampling design were done in the context of a multi-cluster population. In these studies, an estimate was based on a sample that combined subsamples drawn from several subpopulations. Our research focused on the sampling of units that happens within a cluster since this is where EPI diverges from typical implementations of two-stage cluster sampling. By doing the analysis at this level, we aimed to provide some insight as to how the statistical properties of the sample proportion differs when SSUs are picked according to EPI versus random selection. Besides working with single-cluster populations, there are other aspects of our simulation study that make it different from those that have already been published. First, we set sample size in terms of number of households so that our analysis would be relevant for a general household survey. Even for surveys focused on a specific demographic group, it has been suggested that the number of households to be visited should be fixed (Brogan et al., 1994). Often, these surveys still include questions about household-level characteristics (SMART, 2012). Second, we obtained inclusion probabilities for the households in the populations that we generated. We developed an algorithm to compute the probability of sampling a household when the starting point is chosen from the households that 97 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics lie in a random sector and when the starting point is chosen randomly among all households in the population. Additionally, we have an algorithm when sampling is performed by taking every k th neighbour encountered along a path of nearest neighbours. Bennett et al. (1994) proposed several adaptations of the EPI method, but we only investigated the one that uses the skipping rule since this was the one more frequently applied in real surveys (Fonn et al., 2006; Grais et al., 2007; Vinck and Bell, 2011). Third, for a given population and sampling plan, we computed estimator properties from the exact distribution of the estimator rather than from the results of a Monte Carlo simulation. Furthermore, since we could calculate the inclusion probabilities of households, we introduced the Horvitz-Thompson estimator and a restricted version of the Horvitz-Thompson estimator for estimating the population proportion. We compared their performance to the standard estimator that uses the sample proportion, a comparison which to our knowledge has not been previously studied in an EPI setting. Our investigation of the EPI method covers results for a variety of populations. One of the ways in which the populations differed from each other is the spatial distribution of households. This had an effect on the number of samples that were possible where more combinations of households could be observed when the households were regularly spaced in the population. However, because of the restriction that the EPI method imposes on the selection of households, an EPI sample cannot consist of any combination of households in the population like in SRS. We also demonstrated that inclusion probabilities can vary largely from unit to unit. More notably, our analysis showed an increase in the range of inclusion 98 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics probabilities when going from a sample of n = 7 households (sampling fraction was 0.05) to a sample of n = 30 households (sampling fraction was 0.20). At n = 30, we found instances where the probability of getting picked was as low as 0.02 for some households and as high as 0.50 for others in the same population (Figure 6.8). Each household can be linked to every other household in the population by constructing a chain of nearest neighbours. Certain households appear early in more chains because of how they are positioned relative to the other households in the population. Hence, these households are more likely to be added to the sample wherever the sample begins. We observed a negative correlation between the probability of selecting a household and the distance of the household from the centre of the population area. This correlation tended to become stronger the larger the sample size, indicating that the EPI paths were heading towards the centre of the population. This was especially true for populations that were denser in the centre since the nearest neighbour of a household would likely be in the direction of the centre. However, as more households are sampled, the chain keeps going and eventually moves out to the periphery. If every third household encountered were included for the survey, then a chain of 88 households would be required in order to reach 30 households. Here, the chain includes over half of the population. This may explain why skipping neighbours did not always result in a more negative correlation. The size of the sector used for selecting the first household had little impact on the inclusion probabilities compared to the other factors varied in the simulation. Although using a sector meant that not all households had an equal chance of being chosen as the starting point, the more important determinant in the probability of 99 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics sampling a household was the number of EPI paths that went through it. Unless there was a substantial imbalance in the way households are laid out around the origin, initiating the sample from a random sector was similar to initiating it from a household that was chosen randomly across the entire population.1 While the Horvitz-Thompson estimator (HT) seemed promising because it is unbiased, it has some significant drawbacks. First, it requires knowing the probability of selection for every population element. For EPI, this means that households in the survey area need to be mapped. While this could be accomplished by obtaining high-resolution satellite images from Google Earth or flying a drone to capture aerial shots of the community (Escamilla et al., 2014; The Swiss Foundation for Mine Action (FSD), 2016), there is still the task of computing inclusion probabilities, which could prove to be challenging in the real world. Second, the Horvitz-Thompson estimator under EPI sampling turns out to be very unstable. There is a lot of variability in how the observations are weighted, and estimates of a proportion can surpass 1. Even when we restricted the Horvitz-Thompson estimator (HTR) so that it would not go beyond 1, design effects remained high relative to the equally weighted estimator. Moreover, the amount by which it underestimated the population proportion would also grow as the population proportion reached higher levels. At p = 0.90, we saw many instances where the magnitude of bias exceeded 0.10. Therefore, weighting the observations did not lead to an overall improvement in estimation. Our analysis of the regular estimator (EQW) for a proportion confirms many of 1 It may be interesting to check how choosing the first unit from a sector compares to choosing it from a strip when dwellings are given positive dimensions. According to Henderson et al. (1973), if a strip were used, the probability of being picked first would be proportional to the ratio between the width of the dwelling to its distance from the origin (this concern was also brought up in Brogan et al. (1994) as well as in Centers for Disease Control and Prevention and World Food Programme (2007)). 100 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics our suspicions about EPI. Among the populations that we examined, the properties of this estimator did not deviate excessively under EPI from what it would be under SRS as long as outcomes were randomly assigned to the households. In the absence of any aggregation in the geographical distribution of the characteristic of interest, biases were close to 0, and design effects were close to 1. When cases of the characteristic of interest were isolated to certain neighbourhoods or when it appeared more frequently for households near the centre of the population area, there was a greater departure between the sample proportion and the population proportion it was supposed to estimate, agreeing with reports from previous simulations (Lemeshow et al., 1985). Both positive and negative biases occurred for the pocketing pattern, while results were overwhelmingly positive for the circular gradient pattern. The variance of the estimator was also considerably lower in the latter case, thus supporting the claim that EPI samples have a tendency to include centrally located households (Luman et al., 2007). It does not come as a surprise that regional disparities in a population would make the EQW estimator under EPI more biased and more variable compared to its SRS counterpart. However, we also found that design effects increased with sample size. Taking every third neighbour for the sample was effective in reducing the variance of the estimator. It made the greatest difference when n = 30 as evidenced by comparing the design effects for k = 1 and k = 3 (Tables A.8 and A.9). It should be noted though that when there was a high concentration of cases at the centre of the population area and only seven households were selected for the sample, skipping neighbours increased the bias. 101 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Based on our results and those from past investigations, using the nearest neighbour sampling procedure may be a close substitute for random selection at the final stage of sampling, provided that the characteristic of interest is evenly distributed throughout the cluster. One can never be sure of the exact spatial pattern of the target variable prior to conducting the survey, so we suggest looking at how it is distributed in other similar populations, if that information is available. In constructing the populations and simulating the EPI method, we did not capture all of the complexities of the real world such as multi-household dwellings, misjudgement in determining which household to go to next, or roads and barriers that would alter the survey path. However, we do not believe these details would have material consequence in a population where the characteristic of interest occurs randomly in the population. It is harder to predict what kind of impact they would have when the characteristic of interest does not occur randomly. The spatial patterns that we generated are by no means a comprehensive account of all the possibilities, but they did help us to identify what conditions can lead to extreme bias or variance for the equally weighted estimator of the prevalence in a cluster. Finally, throughout our simulations we assumed that there was no non-response or data collection errors. It is doubtful that a survey will be executed this perfectly in the field. We did not incorporate these human factors into our simulations as we saw them as separate issues and we wanted to assess EPI sampling and estimation in their purest form. Nevertheless, non-response and data collection errors are serious, provisions should be made to minimize them, and they should be kept in mind when interpreting data. 102 M.Sc. Thesis - Maria Reyes 8.2 McMaster - Mathematics & Statistics Future Directions We concluded from our study that if EPI sampling is used, then the EQW estimator is preferable to the HT estimator given its lower variability and ease of calculation. That said, the EQW estimator still reached alarming levels of bias and variance when sampling was conducted in a community that had areas of high and low concentrations of the characteristic of interest. This reinforces the warning that actual surveys should not report estimates for individual clusters. Averaging results across several clusters should help to moderate the effect of any cluster associated with potential estimation issues. In Bennett et al. (1994), Katz et al. (1997) and Yoon et al. (1997) where EPI was analyzed in the context of a stratified sampling design for a real population, magnitude of the bias rarely went beyond 2 percentage points and design effects hovered around 1. It would be informative to run simulations of the EPI method in which clusters are selected according to PPS (possibly from a hierarchy, such as when an enumeration area is sampled from a town that is sampled from a region of a country) and for populations with different rates of homogeneity. We recommend checking the properties of the confidence intervals that are produced when the formula in Equation (3.15) is used to estimate the variance of the estimator for a population proportion. As far as we know, there have not been any studies on EPI to do this. Research may be extended by looking at the estimates for means of continuous variables and relative frequencies of categorical variables with more than two outcomes. Another issue that needs to be addressed is how to convert sample size requirements into number of households when the target population is not proportional to the 103 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Compact Segment Sampling (CSS) Systematic Random Sampling (SyRS) 500 500 250 250 Characteristic of interest Not Present (0) y y Present (1) 0 0 Selection status Not Selected Selected −250 −250 −500 −500 −500 −250 0 250 500 −500 x −250 0 250 500 x Figure 8.1: Illustration of household selection using alternative sampling methods. Both plots depict the same population of N = 150 households. Household locations were randomly generated such that the x-coordinate takes a real number between -500 and 500 and similarly for the y-coordinate. The characteristic of interest is present in 75 households (p = 0.50) and is distributed in a pocketing pattern. To take a sample of 30 households, CSS involves partitioning the cluster into five segments, each containing 30 households, then randomly picking one of these segments as the sample, while SyRS involves randomly picking one of first five households in the northwest corner of the cluster then moving in a serpentine manner and sampling every fifth household. total households in a cluster or when the questions in a survey concern separate populations. We restricted our literature review to papers about computer simulations of the EPI method, but there have also been a handful of studies that tested EPI in the field. These studies involved conducting a survey where households are sampled once using EPI then again using an alternative procedure to examine whether they would yield similar estimates. Compact segment sampling (CSS) (Milligan et al., 2004) and systematic random sampling (SyRS) (Rose et al., 2006; Luman et al., 2007) are just two spatial sampling methods that EPI has been evaluated against. 104 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics In CSS, the cluster is partitioned into segments containing roughly the same number of households, then one of these segments is picked at random and all households in the segment are interviewed while in SyRS the interviewer selects households passed at regular intervals as they go through the cluster in a serpentine manner while (Figure 8.1). A modification to EPI sampling has also been proposed where the first unit of the sample is taken as the closest household to random coordinates in the cluster then the process of surveying a chain of nearest neighbours is carried out as usual (Grais et al., 2007). A quick search of surveys which were performed in difficult field conditions reveals that there is a growing shift towards using Geographic Information Systems (GIS) and Global Positioning Systems (GPS) to facilitate the selection of households (Roberts et al., 2004; Vanden Eng et al., 2007; Siri et al., 2008; Galway et al., 2012; Shannon et al., 2012; Kondo et al., 2014). Although these sampling plans employ various techniques, the common idea behind them is to generate random spatial points and locate nearby households. It may be useful to do a simulation (perhaps using real census data) in which they are compared to EPI and even to SyRS and CSS. 105 Appendix A Tables of Simulation Results 106 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Table A.1: Variance of household inclusion probabilities and correlation between household inclusion probability and household distance from the centre of the population area. For all household spatial distributions other than loc reg, results were averaged across 30 populations that were generated using the same random procedure, but with different seed numbers. All populations consisted of N = 150 households from which samples of size n = 7, 15, 30 were drawn. Cor(πi , ri ) V ar(πi ) Spatial distribution of households loc reg loc sqr loc rec loc agg loc cgr Sampling method nosec api08 api32 nosec api08 api32 nosec api08 api32 nosec api08 api32 nosec api08 api32 nosec api08 api32 nosec api08 api32 nosec api08 api32 nosec api08 api32 nosec api08 api32 k1 k1 k1 k3 k3 k3 k1 k1 k1 k3 k3 k3 k1 k1 k1 k3 k3 k3 k1 k1 k1 k3 k3 k3 k1 k1 k1 k3 k3 k3 n=7 n = 15 n = 30 (π̄ = 0.047) (π̄ = 0.010) (π̄ = 0.020) 0.0001 0.0001 0.0001 0.0002 0.0002 0.0001 0.0004 0.0005 0.0005 0.0008 0.0009 0.0009 0.0004 0.0009 0.0007 0.0008 0.0012 0.0010 0.0005 0.0010 0.0008 0.0009 0.0011 0.0010 0.0006 0.0006 0.0006 0.0014 0.0014 0.0014 0.0001 0.0002 0.0001 0.0006 0.0007 0.0006 0.0026 0.0027 0.0028 0.0042 0.0042 0.0042 0.0025 0.0038 0.0034 0.0039 0.0045 0.0043 0.0026 0.0035 0.0032 0.0041 0.0043 0.0043 0.0044 0.0045 0.0044 0.0084 0.0084 0.0084 0.0014 0.0016 0.0018 0.0045 0.0048 0.0046 0.0131 0.0130 0.0132 0.0116 0.0114 0.0114 0.0130 0.0165 0.0157 0.0094 0.0095 0.0093 0.0116 0.0133 0.0132 0.0112 0.0114 0.0111 0.0315 0.0312 0.0313 0.0152 0.0152 0.0153 n = 7 n = 15 n = 30 -0.36 -0.44 -0.14 -0.21 -0.32 0.03 -0.11 -0.20 -0.19 -0.19 -0.23 -0.22 -0.08 -0.47 -0.42 -0.17 -0.38 -0.34 -0.07 -0.24 -0.28 -0.21 -0.33 -0.35 -0.38 -0.38 -0.35 -0.54 -0.54 -0.53 -0.65 -0.81 0.05 -0.42 -0.49 -0.28 -0.20 -0.26 -0.25 -0.24 -0.26 -0.25 -0.17 -0.48 -0.42 -0.29 -0.33 -0.32 -0.20 -0.36 -0.39 -0.38 -0.43 -0.44 -0.57 -0.57 -0.56 -0.63 -0.63 -0.63 -0.41 -0.56 -0.29 -0.68 -0.67 -0.67 -0.26 -0.29 -0.28 -0.24 -0.24 -0.24 -0.27 -0.38 -0.36 -0.33 -0.35 -0.35 -0.39 -0.47 -0.48 -0.40 -0.42 -0.43 -0.62 -0.63 -0.62 -0.72 -0.72 -0.72 πi denotes the probability of selecting a household; ri denotes the distance of a household from the centre n of the population area; π̄ is the average probability of selection and is equal to the sampling fraction N . See pages viii-x for the meaning of abbreviations used for the spatial distribution of households and the sampling method. 107 Bias(p̂); n = 7 Spatial distribution of households 108 Sampling method p = 0.10 HT 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 HTR -0.000 -0.002 -0.003 -0.003 -0.003 -0.001 -0.002 -0.004 -0.001 -0.003 -0.004 -0.002 -0.003 -0.003 -0.001 -0.002 -0.002 EQW 0.001 0.005 0.014 0.020 0.011 0.001 0.017 0.006 0.043 0.001 -0.007 0.006 0.010 0.008 0.010 0.016 0.013 p = 0.30 HT 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 HTR -0.002 -0.013 -0.012 -0.013 -0.012 -0.005 -0.008 -0.016 -0.005 -0.012 -0.018 -0.008 -0.013 -0.014 -0.008 -0.010 -0.010 EQW 0.005 0.012 0.023 0.023 0.027 -0.004 0.026 0.029 0.050 0.004 0.004 0.010 0.017 0.017 0.017 0.024 0.022 p = 0.50 HT 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 HTR -0.006 -0.029 -0.027 -0.031 -0.030 -0.019 -0.027 -0.032 -0.015 -0.023 -0.032 -0.018 -0.028 -0.029 -0.023 -0.025 -0.025 EQW 0.005 0.008 0.024 0.025 0.036 0.001 0.023 0.021 0.046 0.016 0.011 0.012 0.017 0.017 0.020 0.026 0.025 p = 0.70 HT HTR 0.000 -0.015 0.000 -0.054 0.000 -0.052 0.000 -0.061 0.000 -0.056 0.000 -0.044 0.000 -0.051 0.000 -0.059 0.000 -0.034 0.000 -0.047 0.000 -0.051 0.000 -0.034 0.000 -0.052 0.000 -0.054 0.000 -0.045 0.000 -0.051 0.000 -0.050 EQW 0.002 0.005 0.007 0.012 0.017 0.002 0.015 0.011 0.013 0.005 0.005 0.006 0.006 0.006 0.011 0.012 0.012 p = 0.90 HT 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.000 0.000 0.000 0.000 HTR -0.036 -0.098 -0.107 -0.111 -0.108 -0.090 -0.094 -0.099 -0.086 -0.090 -0.092 -0.062 -0.098 -0.100 -0.088 -0.103 -0.100 p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households, the spatial distribution of the target variable and the sampling method. McMaster - Mathematics & Statistics Spatial distribution of target variable loc reg loc sqr loc rec loc agg loc cgr val rdm val spk val lpk val cgr val dgr val hgr nosec k1 api08 k1 api32 k1 nosec k3 api08 k3 api32 k3 EQW -0.003 -0.001 -0.001 0.008 0.005 0.000 0.010 0.004 0.023 -0.011 -0.018 0.000 0.002 0.002 0.000 0.003 0.002 M.Sc. Thesis - Maria Reyes Table A.2: Bias of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is n = 7 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling method) are averaged across the other two factors. Bias(p̂); n = 30 Spatial distribution of households 109 Sampling method EQW 0.008 0.000 0.026 0.021 0.018 0.002 0.017 0.024 0.078 -0.002 -0.029 0.012 0.017 0.015 0.015 0.016 0.016 p = 0.30 HT 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 HTR -0.000 -0.008 -0.001 -0.001 -0.006 -0.001 -0.002 -0.002 -0.000 -0.004 -0.010 -0.004 -0.004 -0.004 -0.002 -0.002 -0.003 EQW 0.006 -0.005 0.024 0.024 0.039 -0.009 0.014 0.033 0.077 0.006 -0.015 0.018 0.024 0.022 0.013 0.014 0.014 p = 0.50 HT 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 HTR -0.000 -0.022 -0.006 -0.009 -0.015 -0.008 -0.010 -0.010 -0.001 -0.012 -0.020 -0.012 -0.014 -0.013 -0.007 -0.008 -0.009 EQW 0.007 -0.008 0.032 0.023 0.056 -0.000 0.021 0.028 0.061 0.021 0.001 0.021 0.027 0.025 0.019 0.020 0.020 p = 0.70 HT 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 HTR -0.002 -0.051 -0.015 -0.031 -0.035 -0.027 -0.028 -0.031 -0.013 -0.028 -0.034 -0.030 -0.034 -0.033 -0.019 -0.022 -0.023 p = 0.90 EQW HT HTR 0.004 -0.000 -0.012 0.003 0.000 -0.094 0.007 0.000 -0.058 0.013 0.000 -0.091 0.026 0.000 -0.100 0.002 0.000 -0.073 0.024 0.000 -0.068 0.017 0.000 -0.075 0.017 0.000 -0.065 0.006 0.000 -0.071 -0.002 0.000 -0.074 0.012 0.000 -0.072 0.012 0.000 -0.087 0.012 0.000 -0.082 0.009 0.000 -0.057 0.009 0.000 -0.064 0.009 0.000 -0.064 p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households, the spatial distribution of the target variable and the sampling method. McMaster - Mathematics & Statistics Spatial distribution of target variable loc reg loc sqr loc rec loc agg loc cgr val rdm val spk val lpk val cgr val dgr val hgr nosec k1 api08 k1 api32 k1 nosec k3 api08 k3 api32 k3 p = 0.10 EQW HT HTR 0.001 -0.000 -0.000 -0.010 0.000 -0.001 0.007 0.000 0.000 0.009 0.000 0.000 0.009 0.000 -0.001 -0.002 0.000 -0.000 0.006 0.000 -0.000 0.008 0.000 -0.000 0.046 0.000 0.000 -0.016 0.000 -0.001 -0.024 0.000 -0.001 0.002 0.000 -0.000 0.005 0.000 -0.000 0.004 0.000 -0.000 0.003 0.000 -0.000 0.003 0.000 -0.000 0.003 0.000 -0.001 M.Sc. Thesis - Maria Reyes Table A.3: Bias of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is n = 30 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling method) are averaged across the other two factors. V ar(p̂); n = 7 Spatial distribution of households 110 Sampling method p = 0.10 HT 0.031 0.041 0.044 0.042 0.043 0.019 0.040 0.056 0.031 0.046 0.050 0.045 0.048 0.048 0.033 0.034 0.035 HTR 0.031 0.036 0.038 0.035 0.035 0.018 0.036 0.048 0.029 0.039 0.041 0.041 0.040 0.041 0.029 0.029 0.029 EQW 0.073 0.075 0.074 0.083 0.066 0.029 0.078 0.117 0.076 0.067 0.079 0.091 0.090 0.090 0.059 0.058 0.058 p = 0.30 HT 0.079 0.112 0.106 0.108 0.097 0.056 0.098 0.144 0.076 0.102 0.128 0.106 0.116 0.118 0.085 0.088 0.091 HTR 0.075 0.084 0.081 0.082 0.073 0.045 0.080 0.112 0.067 0.079 0.092 0.091 0.088 0.089 0.069 0.069 0.069 EQW 0.081 0.088 0.084 0.094 0.078 0.034 0.099 0.135 0.075 0.071 0.097 0.104 0.101 0.102 0.068 0.068 0.068 p = 0.50 HT 0.091 0.143 0.134 0.147 0.135 0.090 0.144 0.176 0.096 0.122 0.153 0.127 0.145 0.148 0.115 0.121 0.125 HTR 0.084 0.094 0.089 0.093 0.085 0.058 0.098 0.121 0.072 0.084 0.101 0.101 0.098 0.098 0.079 0.079 0.078 EQW 0.065 0.063 0.062 0.070 0.054 0.030 0.075 0.101 0.050 0.057 0.063 0.077 0.075 0.075 0.051 0.049 0.050 p = 0.70 HT 0.080 0.143 0.141 0.163 0.138 0.114 0.149 0.179 0.096 0.126 0.134 0.114 0.147 0.152 0.118 0.131 0.135 HTR 0.068 0.072 0.072 0.074 0.066 0.053 0.081 0.098 0.055 0.066 0.070 0.077 0.078 0.077 0.063 0.064 0.063 EQW 0.021 0.021 0.018 0.019 0.014 0.012 0.026 0.034 0.012 0.014 0.014 0.023 0.023 0.023 0.014 0.014 0.014 p = 0.90 HT 0.042 0.129 0.158 0.169 0.148 0.126 0.136 0.149 0.116 0.122 0.126 0.077 0.144 0.152 0.111 0.144 0.147 HTR 0.026 0.038 0.044 0.038 0.037 0.033 0.041 0.048 0.031 0.034 0.033 0.030 0.040 0.039 0.033 0.040 0.037 p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households, the spatial distribution of the target variable and the sampling method. McMaster - Mathematics & Statistics Spatial distribution of target variables loc reg loc sqr loc rec loc agg loc cgr val rdm val spk val lpk val cgr val dgr val hgr nosec k1 api08 k1 api32 k1 nosec k3 api08 k3 api32 k3 EQW 0.027 0.029 0.033 0.036 0.034 0.011 0.036 0.051 0.041 0.026 0.024 0.041 0.041 0.041 0.022 0.023 0.022 M.Sc. Thesis - Maria Reyes Table A.4: Variance of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is n = 7 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling method) are averaged across the other two factors. V ar(p̂); n = 30 Spatial distribution of households 111 Sampling method p = 0.10 HT 0.007 0.021 0.012 0.012 0.020 0.005 0.011 0.018 0.005 0.022 0.025 0.019 0.019 0.019 0.010 0.010 0.011 HTR 0.007 0.019 0.012 0.012 0.017 0.005 0.011 0.018 0.005 0.020 0.022 0.018 0.018 0.018 0.009 0.009 0.009 EQW 0.019 0.020 0.022 0.024 0.015 0.006 0.014 0.033 0.019 0.021 0.026 0.033 0.033 0.033 0.007 0.007 0.007 p = 0.30 HT 0.022 0.058 0.031 0.038 0.045 0.020 0.030 0.047 0.015 0.048 0.074 0.052 0.053 0.053 0.024 0.025 0.026 HTR 0.022 0.041 0.030 0.036 0.034 0.019 0.027 0.043 0.014 0.040 0.053 0.045 0.045 0.045 0.020 0.020 0.020 EQW 0.023 0.024 0.026 0.028 0.015 0.007 0.024 0.036 0.016 0.023 0.033 0.038 0.037 0.038 0.009 0.009 0.009 p = 0.50 HT 0.027 0.087 0.047 0.062 0.067 0.045 0.058 0.067 0.021 0.063 0.093 0.073 0.076 0.076 0.038 0.042 0.043 HTR 0.027 0.051 0.040 0.047 0.041 0.033 0.043 0.051 0.019 0.044 0.057 0.055 0.054 0.055 0.028 0.028 0.028 EQW 0.016 0.016 0.018 0.021 0.010 0.006 0.017 0.026 0.011 0.019 0.020 0.026 0.025 0.026 0.007 0.007 0.007 p = 0.70 HT 0.022 0.110 0.047 0.081 0.082 0.065 0.072 0.081 0.036 0.072 0.085 0.079 0.087 0.086 0.048 0.055 0.057 HTR 0.021 0.040 0.033 0.042 0.038 0.031 0.038 0.042 0.023 0.036 0.040 0.045 0.044 0.044 0.025 0.026 0.025 EQW 0.005 0.004 0.004 0.004 0.002 0.002 0.005 0.007 0.002 0.003 0.004 0.006 0.006 0.006 0.002 0.002 0.002 p = 0.90 HT 0.012 0.125 0.060 0.112 0.123 0.089 0.082 0.094 0.076 0.085 0.094 0.084 0.111 0.107 0.063 0.076 0.080 HTR 0.008 0.022 0.028 0.026 0.029 0.022 0.022 0.025 0.021 0.022 0.023 0.026 0.030 0.029 0.017 0.017 0.017 p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households, the spatial distribution of the target variable and the sampling method. McMaster - Mathematics & Statistics Spatial distribution of target variable loc reg loc sqr loc rec loc agg loc cgr val rdm val spk val lpk val cgr val dgr val hgr nosec k1 api08 k1 api32 k1 nosec k3 api08 k3 api32 k3 EQW 0.006 0.006 0.008 0.009 0.008 0.002 0.006 0.014 0.010 0.007 0.006 0.012 0.012 0.012 0.002 0.002 0.002 M.Sc. Thesis - Maria Reyes Table A.5: Variance of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson estimator (HTR) when sample size is n = 30 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling method) are averaged across the other two factors. M SE(p̂); n = 7 Spatial distribution of households 112 Sampling method p = 0.10 HT 0.031 0.041 0.044 0.042 0.043 0.019 0.040 0.056 0.031 0.046 0.050 0.045 0.048 0.048 0.033 0.034 0.035 HTR 0.031 0.036 0.038 0.035 0.035 0.018 0.036 0.048 0.029 0.039 0.041 0.041 0.041 0.041 0.029 0.029 0.029 EQW 0.073 0.076 0.076 0.084 0.068 0.030 0.079 0.119 0.079 0.068 0.080 0.092 0.091 0.091 0.060 0.060 0.060 p = 0.20 HT 0.079 0.112 0.106 0.108 0.097 0.056 0.098 0.144 0.076 0.102 0.128 0.106 0.116 0.118 0.085 0.088 0.091 HTR 0.076 0.084 0.081 0.082 0.073 0.045 0.080 0.112 0.067 0.079 0.093 0.091 0.089 0.089 0.070 0.069 0.069 EQW 0.082 0.090 0.086 0.096 0.081 0.035 0.101 0.137 0.078 0.072 0.098 0.105 0.103 0.103 0.070 0.070 0.071 p = 0.50 HT 0.091 0.143 0.134 0.147 0.135 0.090 0.144 0.176 0.096 0.122 0.153 0.127 0.145 0.148 0.115 0.121 0.125 HTR 0.084 0.095 0.090 0.094 0.086 0.058 0.099 0.122 0.073 0.085 0.102 0.101 0.100 0.099 0.080 0.080 0.079 EQW 0.065 0.063 0.064 0.072 0.057 0.031 0.077 0.104 0.053 0.058 0.063 0.078 0.076 0.076 0.052 0.051 0.052 p = 0.70 HT 0.080 0.143 0.141 0.163 0.138 0.114 0.149 0.179 0.096 0.126 0.134 0.114 0.147 0.152 0.118 0.131 0.135 HTR 0.068 0.075 0.075 0.078 0.069 0.055 0.084 0.103 0.056 0.069 0.073 0.079 0.081 0.080 0.066 0.067 0.066 EQW 0.021 0.021 0.019 0.019 0.015 0.012 0.027 0.035 0.012 0.015 0.014 0.023 0.023 0.024 0.015 0.015 0.015 p = 0.90 HT 0.042 0.129 0.158 0.169 0.148 0.126 0.136 0.149 0.116 0.122 0.126 0.077 0.144 0.152 0.111 0.144 0.147 HTR 0.028 0.048 0.057 0.050 0.049 0.043 0.051 0.059 0.039 0.043 0.043 0.035 0.052 0.050 0.041 0.051 0.049 p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households, the spatial distribution of the target variable and the sampling method. McMaster - Mathematics & Statistics Spatial distribution of target variable loc reg loc sqr loc rec loc agg loc cgr val rdm val spk val lpk val cgr val dgr val hgr nosec k1 api08 k1 api32 k1 nosec k3 api08 k3 api32 k3 EQW 0.027 0.029 0.033 0.036 0.035 0.012 0.037 0.052 0.042 0.026 0.025 0.041 0.041 0.041 0.023 0.023 0.023 M.Sc. Thesis - Maria Reyes Table A.6: Mean square error (MSE) of of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is n = 7 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling method) are averaged across the other two factors. M SE(p̂); n = 30 Spatial distribution of households 113 Sampling method p = 0.10 HT 0.007 0.021 0.012 0.012 0.020 0.005 0.011 0.018 0.005 0.022 0.025 0.019 0.019 0.019 0.010 0.010 0.011 HTR 0.007 0.019 0.012 0.012 0.017 0.005 0.011 0.018 0.005 0.020 0.022 0.018 0.018 0.018 0.009 0.009 0.009 EQW 0.020 0.025 0.024 0.027 0.019 0.007 0.016 0.037 0.027 0.022 0.029 0.036 0.036 0.036 0.010 0.010 0.010 p = 0.30 HT 0.022 0.058 0.031 0.038 0.045 0.020 0.030 0.047 0.015 0.048 0.074 0.052 0.053 0.053 0.024 0.025 0.026 HTR 0.022 0.042 0.030 0.036 0.034 0.019 0.027 0.043 0.014 0.040 0.054 0.045 0.045 0.045 0.020 0.020 0.020 EQW 0.023 0.029 0.028 0.031 0.021 0.007 0.026 0.042 0.024 0.025 0.036 0.041 0.041 0.041 0.012 0.012 0.012 p = 0.50 HT 0.027 0.087 0.047 0.062 0.067 0.045 0.058 0.067 0.021 0.063 0.093 0.073 0.076 0.076 0.038 0.042 0.043 HTR 0.027 0.052 0.040 0.048 0.042 0.033 0.043 0.052 0.019 0.045 0.058 0.055 0.055 0.055 0.028 0.029 0.028 EQW 0.017 0.018 0.020 0.023 0.016 0.006 0.019 0.030 0.016 0.021 0.021 0.028 0.028 0.028 0.009 0.009 0.009 p = 0.70 HT 0.022 0.110 0.047 0.081 0.082 0.065 0.072 0.081 0.036 0.072 0.085 0.079 0.087 0.086 0.048 0.055 0.057 HTR 0.021 0.043 0.034 0.044 0.039 0.032 0.039 0.044 0.023 0.037 0.042 0.046 0.046 0.046 0.026 0.027 0.026 EQW 0.005 0.005 0.004 0.005 0.004 0.002 0.007 0.008 0.003 0.004 0.004 0.006 0.006 0.006 0.003 0.003 0.003 p = 0.90 HT 0.012 0.125 0.060 0.112 0.123 0.089 0.082 0.094 0.076 0.085 0.094 0.084 0.111 0.107 0.063 0.076 0.080 HTR 0.009 0.031 0.032 0.035 0.040 0.029 0.029 0.032 0.027 0.028 0.031 0.033 0.039 0.037 0.021 0.023 0.023 p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households, the spatial distribution of the target variable and the sampling method. McMaster - Mathematics & Statistics Spatial distribution of target variable loc reg loc sqr loc rec loc agg loc cgr val rdm val spk val lpk val cgr val dgr val hgr nosec k1 api08 k1 api32 k1 nosec k3 api08 k3 api32 k3 EQW 0.007 0.008 0.009 0.009 0.010 0.002 0.007 0.015 0.013 0.007 0.007 0.013 0.014 0.014 0.003 0.003 0.003 M.Sc. Thesis - Maria Reyes Table A.7: Mean square error (MSE) of of the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is n = 30 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling method) are averaged across the other two factors. DE(p̂); n = 7 Spatial distribution of households 114 Sampling method p = 0.30 EQW HT HTR 2.54 2.76 2.62 2.61 3.88 2.92 2.58 3.67 2.80 2.89 3.75 2.85 2.28 3.39 2.54 1.02 1.93 1.57 2.70 3.39 2.79 4.08 4.99 3.87 2.63 2.63 2.32 2.33 3.55 2.73 2.73 4.44 3.21 3.17 3.67 3.15 3.11 4.01 3.07 3.11 4.09 3.09 2.04 2.96 2.41 2.03 3.05 2.38 2.03 3.15 2.40 p = 0.50 EQW HT HTR 2.38 2.65 2.45 2.58 4.18 2.74 2.46 3.92 2.61 2.74 4.28 2.71 2.26 3.95 2.47 1.00 2.64 1.68 2.88 4.20 2.87 3.93 5.13 3.52 2.18 2.80 2.11 2.07 3.56 2.46 2.84 4.45 2.94 3.03 3.70 2.94 2.96 4.24 2.87 2.97 4.30 2.86 1.99 3.36 2.31 1.97 3.53 2.31 1.98 3.64 2.28 p = 0.70 EQW HT HTR 2.25 2.78 2.36 2.17 4.96 2.50 2.17 4.90 2.50 2.43 5.65 2.58 1.87 4.81 2.29 1.06 3.98 1.83 2.62 5.16 2.82 3.52 6.23 3.41 1.73 3.32 1.90 1.97 4.37 2.29 2.18 4.65 2.42 2.68 3.98 2.69 2.61 5.10 2.71 2.60 5.27 2.67 1.76 4.11 2.20 1.70 4.57 2.22 1.72 4.69 2.18 EQW 1.72 1.72 1.48 1.53 1.15 0.95 2.12 2.77 0.98 1.16 1.15 1.89 1.87 1.88 1.17 1.16 1.16 p = 0.90 HT 3.38 10.49 12.77 13.70 12.00 10.20 10.98 12.08 9.38 9.92 10.25 6.25 11.70 12.29 8.96 11.70 11.90 HTR 2.11 3.08 3.57 3.05 3.01 2.67 3.34 3.88 2.49 2.72 2.69 2.47 3.27 3.14 2.66 3.22 3.03 p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households, the spatial distribution of the target variable and the sampling method. McMaster - Mathematics & Statistics Spatial distribution of target variable loc reg loc sqr loc rec loc agg loc cgr val rdm val spk val lpk val cgr val dgr val hgr nosec k1 api08 k1 api32 k1 nosec k3 api08 k3 api32 k3 p = 0.10 EQW HT HTR 2.22 2.55 2.51 2.32 3.34 2.88 2.66 3.60 3.07 2.92 3.41 2.86 2.73 3.48 2.86 0.93 1.54 1.42 2.95 3.23 2.90 4.16 4.56 3.88 3.33 2.49 2.38 2.10 3.77 3.17 1.95 4.09 3.28 3.30 3.64 3.33 3.31 3.89 3.28 3.33 3.90 3.29 1.82 2.66 2.38 1.83 2.76 2.37 1.82 2.82 2.38 M.Sc. Thesis - Maria Reyes Table A.8: Design effect (DE) of EPI sampling relative to SRS for the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is n = 7 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling method) are averaged across the other two factors. DE(p̂); n = 30 Spatial distribution of households 115 Sampling method p = 0.10 HT 3.06 8.60 4.77 4.88 8.43 2.26 4.75 7.41 1.92 8.91 10.45 7.78 7.76 7.77 3.94 4.10 4.36 HTR 3.06 8.05 4.77 4.88 7.06 2.18 4.66 7.31 1.92 8.27 9.03 7.47 7.37 7.39 3.69 3.70 3.77 EQW 3.38 3.51 3.86 4.28 2.63 1.08 2.57 5.88 3.38 3.73 4.56 5.86 5.85 5.87 1.22 1.21 1.19 p = 0.30 HT 3.93 10.35 5.54 6.69 8.04 3.54 5.38 8.29 2.58 8.60 13.09 9.31 9.38 9.46 4.23 4.43 4.67 HTR 3.93 7.36 5.25 6.40 6.03 3.31 4.73 7.61 2.54 7.14 9.45 8.05 7.96 8.04 3.53 3.60 3.57 EQW 3.42 3.57 3.86 4.18 2.23 1.03 3.52 5.37 2.42 3.42 4.95 5.61 5.57 5.64 1.31 1.31 1.28 p = 0.50 HT 4.09 13.01 7.05 9.22 9.92 6.77 8.70 10.04 3.16 9.46 13.82 10.87 11.37 11.33 5.69 6.21 6.48 HTR 4.05 7.58 5.97 7.03 6.14 4.86 6.36 7.67 2.87 6.63 8.55 8.21 8.10 8.15 4.13 4.22 4.13 EQW 2.88 2.84 3.15 3.64 1.83 0.98 2.95 4.54 1.87 3.37 3.48 4.62 4.50 4.56 1.18 1.17 1.16 p = 0.70 HT 3.95 19.60 8.41 14.29 14.58 11.47 12.82 14.39 6.44 12.79 15.09 13.95 15.42 15.26 8.56 9.68 10.13 HTR 3.70 7.17 5.93 7.43 6.67 5.47 6.66 7.45 4.06 6.34 7.12 7.92 7.82 7.84 4.48 4.55 4.48 EQW 2.16 1.74 1.59 1.71 1.02 0.88 2.20 2.94 0.97 1.39 1.46 2.37 2.33 2.36 0.94 0.93 0.93 p = 0.90 HT 5.14 51.93 25.03 46.49 50.94 37.01 34.11 38.78 31.62 35.18 38.72 34.79 45.98 44.08 25.98 31.57 33.03 HTR 3.50 8.94 11.57 10.80 12.07 9.25 9.30 10.28 8.72 9.01 9.70 10.84 12.49 11.91 6.86 7.10 7.06 p denotes the proportion of households in the population with the characteristic of interest. See pages viii-x for the meaning of abbreviations used for the spatial distribution of households, the spatial distribution of the target variable and the sampling method. McMaster - Mathematics & Statistics Spatial distribution of target variable loc reg loc sqr loc rec loc agg loc cgr val rdm val spk val lpk val cgr val dgr val hgr nosec k1 api08 k1 api32 k1 nosec k3 api08 k3 api32 k3 EQW 2.65 2.42 3.25 3.64 3.11 0.87 2.40 5.66 4.06 2.77 2.33 5.01 5.08 5.06 0.99 0.98 0.97 M.Sc. Thesis - Maria Reyes Table A.9: Design effect (DE) of EPI sampling relative to SRS for the equally weighted (EQW) estimator, the Horvitz-Thompson (HT) estimator and the restricted Horvitz-Thompson (HTR) estimator when sample size is n = 30 for the simulation study in Chapter 7. Results for each factor (spatial distribution of households, spatial distribution of target variable, and sampling method) are averaged across the other two factors. Appendix B Partial R Code B.1 Packages Used • dplyr (Wickham and Francois, 2015) • ggplot2 (Wickham, 2009) • Matrix (Bates and Maechler, 2015) • mefa (Sólymos, 2009) • parallel (R Core Team, 2015) • plyr (Wickham, 2011) • reshape2 (Wickham, 2007) 116 M.Sc. Thesis - Maria Reyes B.2 McMaster - Mathematics & Statistics General Parameters alpha 0 The angle span of a sector (measured in radians); a value in [0, 2π] or NULL. If the latter, the first household is randomly chosen from the entire population. dmat A matrix of distances between each pair of households in the population. gam A vector of angular coordinates corresponding to the households in the population. k Every k th neighbour along an EPI path is added to the sample. loc A data frame of household locations. It may also include a column of values for the response variable associated with each household, if this is required by the function. n Number of households to be sampled from the population. pmat A matrix of inclusion probabilities for individual households and pairs of households. R Number of samples to be generated. samp A data frame where each row represents a sample. Include a column for the probability of a sample labeled prob, otherwise, samples will be assigned a probability of R1 where R is the total number of rows in the data frame. samp frame A vector that lists all the sampling units. theta d A sector direction (measured in radians); a value in [0, 2π). u1 Identification number of the first selected household. 117 M.Sc. Thesis - Maria Reyes B.3 B.3.1 McMaster - Mathematics & Statistics Main Functions Simulation of EPI Sampling Households in a Random Sector Generates a sample of households by identifying all the households in a randomly selected geographic sector of the population. epi_samp_sec <- function(loc, gam, alpha_0, theta_d="random"){ if(theta_d=="random"){theta_d <- rdm_direc()} if(alpha_0==0){ac_code <- 0} else if(alpha_0>0 & alpha_0<(2*pi)){ac_code <- NULL} else if(alpha_0==(2*pi)){ac_code <- 1} sec <- cbind(theta_d, sec_angles(alpha_0,theta_d)) hh <- in_sector(loc, gam, a1=sec[,"a1"], a2=sec[,"a2"], ambig_case=ac_code) return(list(sec=sec, hh=hh)) } Sample of Nearest Neighbours Finds the nearest neighbour of the current household and adds it to the sample until the sample reaches the required size. When a household has multiple nearest neighbours, a random selection is made among the nearest neighbours. epi_samp_hh <- function(loc, dmat, n, u1){ # Setup new_hh <- u1 selected <- c(new_hh, rep(NA,n-1)) # Construct path of nearest neighbours 118 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics if(n>=2){ for(i in 2:n){ current_hh <- new_hh dmat[,current_hh] <- NA current_row <- dmat[current_hh,] min_dist <- min(current_row, na.rm=TRUE) closest_hh <- names(current_row)[(which(current_row==min_dist))] new_hh <- sample(closest_hh,1) selected[i] <- new_hh } } # Return locations of selected household loc[selected,] } Generation of an EPI Sample Simulates the EPI sampling procedure for selecting households in a single cluster. sim_epi_v0 <- function(loc, gam, dmat, n, k, R, alpha_0=NULL, theta_d="random", skipped.rm=TRUE){ # Setup N <- nrow(loc) ui_seq <- seq(from=1, by=k, length.out=n) n_exp <- ui_seq[n] n_lim <- tail(ui_seq[which(ui_seq<=N)],1) padding <- n_exp-n_lim if(!is.null(alpha_0)){ # Helper function to generate a single EPI sample. # First household is selected from a random sector. 119 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics sec_epi_samp <- function(loc, gam, dmat, n_lim, n_exp, alpha_0, theta_d, padding){ s1 <- epi_samp_sec(loc, gam, alpha_0, theta_d) if(nrow(s1[["hh"]])>0){ u1 <- rdm_hh(s1[["hh"]]) s2 <- epi_samp_hh(loc, dmat, n_lim, u1) samp_hh <- c(as.numeric(rownames(s2)), rep(NA,padding)) } else{ samp_hh <- rep(NA,n_exp) } c(s1[["sec"]][,"theta_d"], s1[["sec"]][,"a1"], s1[["sec"]][,"a2"], nrow(s1[["hh"]]), samp_hh) } output <t(replicate(R, sec_epi_samp(loc, gam, dmat, n_lim, n_exp, alpha_0, theta_d, padding))) } else{ # is.null(alpha_0) # Helper function to generate a single EPI sample. # First household is randomly selected from the entire # population. nosec_epi_samp <- function(loc, dmat, N, n_lim, padding){ u1 <- sample(1:N,1) s2 <- epi_samp_hh(loc, dmat, n_lim, u1) c(rep(NA,4),as.numeric(rownames(s2)),rep(NA,padding)) } output <- t(replicate(R, nosec_epi_samp(loc, dmat, N, n_lim, padding))) } 120 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics if(skipped.rm==TRUE){ cols <- c(1:4,(4+ui_seq)) output <- matrix(output[,cols], nrow=R) colnames(output) <- c("theta_d", "a1","a2", "sec_pop", paste0('u', 1:n)) } else{ # skipped.rm==FALSE colnames(output) <- c("theta_d", "a1","a2", "sec_pop", paste0('u', 1:n_exp)) } data.frame(output) } sim_epi <- function(loc, dmat, n, R, alpha_0=NULL, theta_d="random", k=1, skipped.rm=TRUE, na.rm=TRUE, maxit=1e+6){ # Setup gam <- hh_direc(loc) # Initial simulation results output <- sim_epi_v0(loc,gam,dmat,n,k,R,alpha_0,theta_d,skipped.rm) if(!is.null(alpha_0)){ clean_output <- output[output[,"sec_pop"]>0,]} else{clean_output <- output; na.rm=FALSE} out_samp <- nrow(clean_output) tot_samp <- out_samp tot_iter <- R # Achieve the required number of samples 121 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics if(na.rm==TRUE){ req_samp <- R while(tot_samp<req_samp & tot_iter<maxit){ R <- min((req_samp-tot_samp),(maxit-tot_iter)) new_output <- sim_epi_v0(loc,gam,dmat,n,k,R,alpha_0,theta_d, skipped.rm) clean_new_output <- new_output[new_output[,"sec_pop"]>0,] new_samp <- nrow(clean_new_output) clean_output <- rbind(clean_output, clean_new_output) tot_samp <- tot_samp + new_samp tot_iter <- tot_iter + R } if(tot_samp>0){rownames(clean_output) <- NULL} output <- clean_output out_samp <- nrow(output) } sim_info <- cbind(out_samp,tot_samp,tot_iter) return(list(samples=output, details=sim_info)) } B.3.2 Computation of Inclusion Probabilities Probability of First Sampled Unit Returns the exact probability that a household is selected as the first unit in a sample when the first unit is selected from a random sector. Probabilities are computed for every household in the population. 122 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics epi_u1_prob_exact <function(loc, alpha_0, rescale=TRUE){ # Setup N <- nrow(loc) gam <- hh_direc(loc) gam_ord <- sort(unique(gam)) L_Theta <- 2*pi # probability density function of theta if(rescale==TRUE){ delta <c(diff(gam_ord), gam_ord[1]+2*pi-gam_ord[length(gam_ord)]) alpha_0 delta_pos <- which(delta>0) L_Theta <- 2*pi-sum(delta[delta_pos]) } # Values of theta when the households in the resulting sector # change sec_direc <- c(gam_ord-alpha_0/2,gam_ord+alpha_0/2) sec_direc <sapply(1:length(sec_direc), function(i) { current_val <- sec_direc[i] if(current_val<0){current_val <- current_val+2*pi} else if(current_val>=2*pi){current_val <- current_val-2*pi} # ...do nothing if sec_direc is in [0,2*pi) current_val }) sec_direc_ord <- sort(sec_direc) # Number of households in each sector specified by sec_direc_ord sec_theta0 <- in_sector(loc,gam,alpha_0,sec_direc_ord[1]) 123 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics sec_theta0_hh <- sec_theta0[,"hh_id"] n_theta0 <- length(sec_theta0_hh) sec <- sec_angles(alpha_0=alpha_0,theta_d=sec_direc_ord[1]) tol <- .Machine$double.eps^0.5 excl <- which(abs(gam-sec[,"a2"])<=tol) incl <- which(abs(gam-sec[,"a1"])<=tol) if(n_theta0==0){n_theta0=length(incl)} else{ if(length(excl)!=0){ if(any(excl %in% sec_theta0_hh)){n_theta0=n_theta0sum(excl %in% sec_theta0_hh)} } if(length(incl)!=0){ if(!all(incl %in% sec_theta0_hh)){n_theta0=n_theta0+ sum(!(incl %in% sec_theta0_hh))} } } n_change <sapply(1:(length(sec_direc_ord)-1), function(i){ sec <- sec_angles(alpha_0,sec_direc_ord[i]) sum(abs(gam-sec[,"a2"])<=tol)sum(abs(gam-sec[,"a1"])<=tol) }) sec_pop <- cumsum(c(n_theta0,n_change)) # Data frame of theta values and number of households in the # corresponding sector sec_hh_dat <- data.frame(sec_direc_ord,sec_pop) # Conditional probability of the first sampled unit given a # sector u1_cond_sec <- sapply(1:N, function(i){ sec <- sec_angles(alpha_0,gam[i]) a1 <- sec[,"a1"]; a2 <- sec[,"a2"] 124 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics if(a1<a2){ sec_subset <sec_hh_dat[which(((sec_direc_ord>a1)|( abs(sec_direc_ord-a1)<=tol)) & ((sec_direc_ord<a2)|( abs(sec_direc_ord-a2)<=tol))),] } else{ # Since alpha_0>0, a1 and a2 will never be equal. # Thus, if a1<a2 is FALSE then a1>a2 is TRUE sec_subset1 <sec_hh_dat[which((sec_direc_ord>a1)|( abs(sec_direc_ord-a1)<=tol)),] sec_subset2 <sec_hh_dat[which((sec_direc_ord<a2)|( abs(sec_direc_ord-a2)<=tol)),] sec_subset1[,"sec_direc_ord"] <sec_subset1[,"sec_direc_ord"]-a1 sec_subset2[,"sec_direc_ord"] <sec_subset2[,"sec_direc_ord"]+(2*pi-a1) sec_subset <- rbind(sec_subset1,sec_subset2) } sum(diff(sec_subset[,"sec_direc_ord"]) / sec_subset[2:nrow(sec_subset),"sec_pop"]) }) # Output data.frame(u1=1:N, prob=u1_cond_sec/L_Theta) } 125 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Path Probability Conditional on the Starting Household Returns all possible EPI paths for a fixed sample size and the conditional probability of each of these paths given the first household in the sample. epi_path_cond_prob <- function(loc, dmat, n){ # Setup samp_frame <- as.numeric(rownames(loc)) N <- length(samp_frame) path_units <- data.frame(u1=samp_frame) path_probs <- rep(1,N) # Helper function to add a household to each of the given paths # and update the path probability get_nn_prob <- function(i,old_paths,old_probs,dmat){ current_path <- old_paths[i,] current_prob <- old_probs[i] current_hh <- tail(current_path,1) dmat_subset <- dmat[current_hh,] dmat_subset[current_path] <- NA min_dist <- min(dmat_subset, na.rm=TRUE) closest_hh <- as.numeric(names(which(dmat_subset==min_dist))) new_paths <- matrix(rep(current_path, length(closest_hh)), ncol=length(current_path), byrow=TRUE) new_paths <- cbind(new_paths, closest_hh) new_probs <- current_prob * (1/length(closest_hh)) cbind(new_paths, new_probs) } # Enumerate paths if(n>=2){ for(j in 2:n){ old_paths <- path_units old_probs <- path_probs temp_results <do.call("rbind", 126 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics llply(1:nrow(old_paths), function(i) get_nn_prob(i,old_paths,old_probs,dmat))) path_units <- temp_results[,1:j] path_probs <- temp_results[,j+1] } } # Format results output <- data.frame(path_units, path_probs) colnames(output) <- c(paste0('u', 1:n),"path_cond_u1") # Return results output } Path Probability Returns all possible EPI paths for a fixed sample size and the probability of each of these paths. epi_path_prob <- function(loc, dmat, n, alpha_0=NULL, D=NULL, k=1, skipped.rm=TRUE, rescale=TRUE, zeros.rm=FALSE){ # Setup samp_frame <- as.numeric(rownames(loc)) N <- length(samp_frame) ui_seq <- seq(from=1, by=k, length.out=n) n_exp <- ui_seq[n] n_lim <- tail(ui_seq[which(ui_seq<=N)],1) padding <- n_exp-n_lim q <- n_exp+(1:3) # Probability columns # Probability of selecting first household 127 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics u1prob <- epi_u1_prob(loc, alpha_0, D, rescale, zeros.rm) # Probability of path if(n_lim==1){ output <- data.frame(u1=u1prob[,"u1"], u1_uncond=u1prob[,"prob"], path_cond_u1=rep(1,nrow(u1prob)), prob=u1prob[,"prob"]) } else{ pathcp <- epi_path_cond_prob(loc, dmat, n_lim) path_info <- function(i, u1prob, pathcp, n_lim, padding){ u1 <- u1prob[i,"u1"] paths <- pathcp[which(pathcp[,"u1"]==u1),] u1_uncond <- rep(u1prob[i,"prob"],nrow(paths)) path_cond_u1 <- paths[,"path_cond_u1"] data.frame(paths[,1:n_lim], matrix(nrow=1,ncol=padding), u1_uncond, path_cond_u1,prob=u1_uncond*path_cond_u1) } output <ldply(1:nrow(u1prob), function(i) path_info(i, u1prob, pathcp, n_lim, padding)) } if(skipped.rm==TRUE){ cols <- c(ui_seq,q) output <- output[,cols] colnames(output) <- c(paste0('u', 1:n), "u1_uncond","path_cond_u1","prob") } else{ # skipped.rm==FALSE colnames(output) <- c(paste0('u', 1:n_exp), "u1_uncond","path_cond_u1","prob") 128 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics } output } Matrix of Inclusion Probabilities Returns a matrix of inclusion probabilities for individual households and pairs of households. Adapted from a post on Stack Overflow.1 incl_prob <- function(samp_frame, samp){ N <- length(samp_frame) prob <- NULL if("prob" %in% colnames(samp)){prob <- samp[,"prob"]} samp <- samp_units(samp) R <- nrow(samp) is_present <- function(i, samp_frame, samp){ # Retrieved from http://stackoverflow.com/a/27178390/3808364 apply(samp, 1, function(row) as.integer(any(row==samp_frame[i]))) } ncores <- detectCores() clus <- makeCluster(ncores) clusterExport(clus, c("is_present","N","samp_frame","samp"), envir=environment()) m <- Matrix(parSapply(clus, 1:N, function(h) is_present(h,samp_frame,samp)), sparse=TRUE) stopCluster(clus) if(is.null(prob)){ 1 r - creating a matrix of bivariate frequencies. (2014). Retrieved from http://stackoverflow.com/ a/27178390/3808364 129 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics # Estimated inclusion probabilities output <- (1/R) * Matrix::t(m) %*% m } else{ # Exact or approximated inclusion probabilities row_weight <- Matrix(Diagonal(R,prob),sparse=TRUE) output <- Matrix::t(m) %*% row_weight %*% m } output } B.3.3 Estimation of Population Proportion Distribution of Estimator for Population Proportion Returns the distribution and properties of an estimator (expected value, bias, variance, and mean square error). The Horvitz-Thompson estimator and the restricted Horvitz-Thompson estimator are computed in addition to the usual proportion estimator when a matrix of inclusion probabilities is provided. phat_distrib <- function(loc, samp, pmat=NULL){ # Setup z <- as.numeric(as.character(loc[,"z"])) ppop <- mean(z) N <- nrow(loc) samp_hh <- as.matrix(samp_units(samp)) if("prob" %in% colnames(samp)){samp_prob <- samp[,"prob"]} else{samp_prob <- rep(NA,nrow(samp))} # phat_eqw ncores <- detectCores() clus <- makeCluster(ncores) clusterExport(clus, c("samp_hh","z","N"), 130 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics envir=environment()) phat_eqw <- parSapply(clus, 1:nrow(samp_hh), function(i) mean(z[samp_hh[i,]],na.rm=TRUE)) stopCluster(clus) # phat_ht, phat_ht_res if(!is.null(pmat)){ hh_prob <get_prob(pmat, prob_type="indv", zeros.rm=FALSE)[,"prob"] wtval <- z/hh_prob ncores <- detectCores() clus <- makeCluster(ncores) clusterExport(clus, c("samp_hh","wtval","N"), envir=environment()) phat_ht <- parSapply(clus, 1:nrow(samp_hh), function(i) (1/N)*sum(wtval[samp_hh[i,]],na.rm=TRUE)) stopCluster(clus) phat_ht_res <- apply(data.frame(phat_ht, 1),1,min) } else{ phat_ht <- NULL phat_ht_res <- NULL } # ------------------------------------------------------------# Helper function get_distrib <- function(phat, prob, ppop){ if(is.numeric(prob)){ distrib <- data.frame(phat, prob=prob) distrib <- aggregate(prob~phat, data=distrib, sum) expval <- sum(distrib[,"prob"]*distrib[,"phat"]) bias <- expval-ppop variance <- sum(distrib[,"prob"]*(distrib[,"phat"]-expval)^2) stdev <- sqrt(variance) mse <- sum(distrib[,"prob"]*(distrib[,"phat"]-ppop)^2) properties <- data.frame(expval,bias,variance,stdev,mse) 131 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics } else{ # prob==NA freq <- table(phat) distrib <data.frame(phat=as.numeric(rownames(freq)), prob=data.frame(freq)[,2]/length(phat)) est_expval <- mean(phat) est_bias <- est_expval-ppop est_variance <- var(phat) est_stdev <- sd(phat) est_mse <- (1/(length(phat)-1))*sum((phat-ppop)^2) properties <data.frame(est_expval,est_bias,est_variance, est_stdev,est_mse) } list(distrib=distrib, properties=properties) } # ------------------------------------------------------------# Compute distribution and properties unweighted <- get_distrib(phat_eqw,samp_prob,ppop) if(!is.null(phat_ht)){ ht_unres <- get_distrib(phat_ht,samp_prob,ppop) ht_res <- get_distrib(phat_ht_res,samp_prob,ppop) # Aggregate probability of restricted values agg_prob_res <- sum(ht_unres[[1]] [which(ht_unres[[1]] [,"phat"]>1),"prob"]) # Compile results stat_summary <data.frame(est_formula=c("eqw","ht","ht_res"), 132 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics rbind(unweighted[["properties"]], ht_unres[["properties"]], ht_res[["properties"]])) } else{ ht_unres <- NA ht_res <- NA agg_prob_res <- NA stat_summary <data.frame(est_formula=c("eqw","ht","ht_res"), rbind(unweighted[["properties"]], rep(NA,5), rep(NA,5))) } # ------------------------------------------------------------# Return output list(phat_eqw=unweighted, phat_ht=ht_unres, phat_ht_res=ht_res, stat_summary=stat_summary, agg_prob_res=agg_prob_res) } 133 M.Sc. Thesis - Maria Reyes B.4 McMaster - Mathematics & Statistics Other Functions Created check prob Checks relations between computed inclusion probabilities: PN • i=1 πi = n; PN • j6=i = (n − 1)πi ; • PN PN i=1 j>i πij = 12 n(n − 1); (Cochran, 1977). dist mat Computes the distance between each pair of households. gen hh Generates household locations. gen val Assigns a binary outcome to households in a population. get prob Extracts inclusion probabilities for individual households or inclusion probabilities for pairs of households from a matrix of probabilities. hh dist Computes the distance of a household from the origin. hh direc Returns the angle made when moving counterclockwise from the positive x-axis to a given household. in sector Identifies households which lie in the specified sector. phat var Computes the population variance of the proportion estimator using the following formulas: • V ar(p̂SRS ) = • V ar(p̂HT ) = (Cochran, 1977). phat distrib. rdm direc N −n p(1−p) N −1 n 1 N2 (Lohr, 2009); hP N 1−πi 2 i=1 πi yi +2 πij −πi πj yi yj j>i πi πj PN P i=1 i Used to check variances computed by Returns a random value in [0, 2π). 134 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics rdm hh Returns a random household number. samp units Extracts sampled units from simulation output. samp val Returns values associated with sampled units. sec analysis Returns intervals of θ associated with non-empty sectors. sec angles Returns the angles where a sector starts and ends. sim srs Generates samples according to SRS. 135 Bibliography 1. Bates, D. and Maechler, M. (2015). Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.2-3. 2. Bennett, S., Radalowicz, A., Vella, V., and Tomkins, A. (1994). A computer simulation of household sampling schemes for health surveys in developing countries. International Journal of Epidemiology, 23(6):1282–1291. 3. Bennett, S., Woods, T., Liyanage, W. M., and Smith, D. L. (1991). A simplified general method for cluster-sample surveys of health in developing countries. World Health Statistics Quarterly, 44(3):98–106. 4. Bolker, B. (2008). Ecological Models and Data in R. Princeton University Press. 5. Bostoen, K., Bilukha, O. O., Fenn, B., Morgan, O. W., Tam, C. C., ter Veen, A., and Checchi, F. (2007). Methods for health surveys in difficult settings: charting progress, moving forward. Emerging Themes in Epidemiology, 4:13. 6. Brogan, D., Flagg, E. W., Deming, M., and Waldman, R. (1994). Increasing the accuracy of the Expanded Programme on Immunization’s cluster survey design. Annals of Epidemiology, 4(4):302–311. 7. Burnham, G., Lafta, R., Doocy, S., and Roberts, L. (2006). Mortality after the 136 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 2003 invasion of Iraq: a cross-sectional cluster sample survey. The Lancet, 368(9545):1421–1428. 8. Centers for Disease Control and Prevention and World Food Programme (2007). A Manual: Measuring and Interpreting Malnutrition and Mortality. Re- trieved from http://www.unscn.org/en/resource portal/index.php?&themes= 199&resource=602 9. City of Ottawa (2012). GM - General Mixed Use Zone (Sec. 187-188). Re- trieved from http://ottawa.ca/en/residents/laws-licenses-and-permits/ laws/city-ottawa-zoning-law/gm-general-mixed-use-zone-sec-187 10. Cochran, W. G. (1977). Sampling Techniques. Wiley, New York, 3 edition. 11. Coghlan, B., Brennan, R. J., Ngoy, P., Dofara, D., Otto, B., Clements, M., and Stewart, T. (2006). Mortality in the Democratic Republic of Congo: a nationwide survey. The Lancet, 367(9504):44–51. 12. Drysdale, S., Howarth, J., Powell, V., and Healing, T. (2000). The use of cluster sampling to determine aid needs in Grozny, Chechnya in 1995. Disasters, 24(3):217–227. 13. Escamilla, V., Emch, M., Dandalo, L., Miller, W. C., Martinson, F., and Hoffman, I. (2014). Sampling at community level by using satellite imagery and geographical analysis. Bulletin of the World Health Organization, 92(9):690–694. 14. Fonn, S., Sartorius, B., Levin, J., and Likibi, M. L. (2006). Immunisation coverage estimates by cluster sampling survey of children (aged 1223 months) in Gauteng province, 2003. Southern African Journal of Epidemiology and Infection, 21(4):164–169. 137 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 15. Galway, L., Bell, N., SAE, A. S., Hagopian, A., Burnham, G., Flaxman, A., Weiss, W. M., Rajaratnam, J., and Takaro, T. K. (2012). A two-stage cluster sampling method using gridded population data, a GIS, and Google EarthTM imagery in a population-based mortality survey in Iraq. International Journal of Health Geographics, 11:12. 16. Grais, R. F., Rose, A. M. C., and Guthmann, J.-P. (2007). Don’t spin the pen: two alternative methods for second-stage sampling in urban cluster surveys. Emerging Themes in Epidemiology, 4:8. 17. Hanif, M. and Brewer, K. R. W. (1980). Sampling with Unequal Probabilities without Replacement: A Review. International Statistical Review / Revue Internationale de Statistique, 48(3):317–335. 18. Hansen, M. H., Hurwitz, W. N., and Madow, W. G. (1953). Sample Survey Methods and Theory: Methods and Applications, Volume 2. Wiley & Sons, New York. 19. Henderson, R. H., Davis, H., Eddins, D. L., and Foege, W. H. (1973). Assessment of vaccination coverage, vaccination scar rates, and smallpox scarring in five areas of West Africa. Bulletin of the World Health Organization, 48(2):183–194. 20. Henderson, R. H. and Sundaresan, T. (1982). Cluster sampling to assess immunization coverage: a review of experience with a simplified sampling method. Bulletin of the World Health Organization, 60(2):253–260. 21. Hlady, W. G., Quenemoen, L. E., Armenia-Cope, R. R., Hurt, K. J., Malilay, J., Noji, E. K., and Wurm, G. (1994). Use of a modified cluster sampling method to perform rapid needs assessment after Hurricane Andrew. Annals of Emergency Medicine, 23(4):719–725. 138 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 22. Hoshaw-Woodard, S. (2001). Description and Comparison of the Methods of Cluster Sampling and Lot Quality Assurance Sampling to Assess Immunization Coverage. Department of Vaccines and Biologicals, World Health Organization, Geneva. 23. Housing Development Agency (2012). Status. South Africa: Informal Settlements Retrieved from http://www.thehda.co.za/information/research/ category/research 24. Katz, J., Yoon, S. S., Brendel, K., and West, K. P. (1997). Sampling designs for xerophthalmia prevalence surveys. International Journal of Epidemiology, 26(5):1041–1048. 25. Kish, L. (1965). Survey Sampling. Wiley-Interscience, New York, 1 edition edition. 26. Kok, P. W. (1986). Cluster sampling for immunization coverage. Social Science & Medicine, 22(7):781–783. 27. Kondo, M. C., Bream, K. D., Barg, F. K., and Branas, C. C. (2014). A random spatial sampling method in a rural developing nation. BMC Public Health, 14:338. 28. Legetic, B., Jakovljevic, D., Marinkovic, J., Niciforovic, O., and Stanisavljevic, D. (1996). Health care delivery and the status of the population’s health in the current crises in former Yugoslavia using EPI-design methodology. International Journal of Epidemiology, 25(2):341–348. 29. Lemeshow, S. and Robinson, D. (1985). Surveys to measure programme coverage and impact: a review of the methodology used by the expanded programme on immunization. World Health Statistics Quarterly, 38(1):65–75. 30. Lemeshow, S., Tserkovnyi, A. G., Tulloch, J. L., Dowd, J. E., Lwanga, S. K., and 139 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Keja, J. (1985). A computer simulation of the EPI survey strategy. International Journal of Epidemiology, 14(3):473–481. 31. Levy, P. S. and Lemeshow, S. (2008). Sampling of Populations: Methods and Applications. Wiley, 4th edition. 32. Lohr, S. L. (2009). Sampling: Design and Analysis. Duxbury Press, Boston, 2nd edition. 33. Luman, E. T., Worku, A., Berhane, Y., Martin, R., and Cairns, L. (2007). Comparison of two survey methodologies to assess vaccination coverage. International Journal of Epidemiology, 36(3):633–641. 34. Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley, Hoboken, NJ. 35. MacIntyre, K. (1999). Rapid assessment and sample surveys: trade-offs in precision and cost. Health Policy and Planning, 14(4):363–373. 36. Marasinghe, M. (2009). Monte Carlo Methods. Retrieved from http:// www.public.iastate.edu/~mervyn/stat580/Notes 37. Médecins Sans Frontières (2006). Rapid Health Assessment of Refugee or Dis- placed Populations. Retrieved from http://www.refbooks.msf.org/msf docs/en/ MSFdocMenu en.htm 38. Miller, I. and Miller, M. (2003). John E. Freund’s Mathematical Statistics with Applications. Pearson, Upper Saddle River, NJ, 7th edition. 39. Milligan, P., Njie, A., and Bennett, S. (2004). Comparison of two cluster sampling methods for health surveys in developing countries. International Journal of Epidemiology, 33(3):469–476. 140 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 40. National Institute of Population Research and Training, Mitra and Associates, and ICF International (2013). Bangladesh Demographic and Health Survey 2011. Retrieved from http://dhsprogram.com/publications/publication-fr265-dhsfinal-reports.cfm 41. R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. 42. Roberts, L., Lafta, R., Garfield, R., Khudhairi, J., and Burnham, G. (2004). Mortality before and after the 2003 invasion of Iraq: cluster sample survey. The Lancet, 364(9448):1857–1864. 43. Rose, A. M., Grais, R. F., Coulombier, D., and Ritter, H. (2006). A comparison of cluster and systematic sampling methods for measuring crude mortality. Bulletin of the World Health Organization, 84(4):290–296. 44. Rothenberg, R. B., Lobanov, A., Singh, K. B., and Stroh, G. (1985). Observations on the application of EPI cluster survey methods for estimating disease incidence. Bulletin of the World Health Organization, 63(1):93–99. 45. Salama, P., Assefa, F., Talley, L., Spiegel, P., van Der Veen, A., and Gotway, C. A. (2001). Malnutrition, measles, mortality, and the humanitarian response during a famine in Ehiopia. Journal of the American Medical Association., 286(5):563–571. 46. Serfling, R. E. and Sherman, I. L. (1965). Attribute Sampling Methods for Local Health Departments. U. S. Dept. of Health, Education, and Welfare, Public Health Service, Communicable Disease Center, Epidemiology Branch, Atlanta. 47. Shannon, H. S., Hutson, R., Kolbe, A., Stringer, B., and Haines, T. (2012). Choosing a survey sample when data on the population are limited: a method using Global 141 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics Positioning Systems and aerial and satellite photographs. Emerging Themes in Epidemiology, 9(1):5. 48. Siri, J. G., Lindblade, K. A., Rosen, D. H., Onyango, B., Vulule, J. M., Slutsker, L., and Wilson, M. L. (2008). A census-weighted, spatially-stratified household sampling strategy for urban malaria epidemiology. Malaria Journal, 7:39. 49. SMART (2012). Sampling Methods and Sample Size Calculation for the SMART Methodology. Retrieved from http://smartmethodology.org/surveyplanning-tools/smart-methodology 50. Sólymos, P. (2009). Processing ecological data in R with the mefa package. Journal of Statistical Software, 29(8):1–28. 51. Statistics Canada (2015). Dissemination area (DA) - Census Dictionary. Retrieved from https://www12.statcan.gc.ca/census-recensement/2011/ref/ dict/geo021-eng.cfm 52. The Swiss Foundation for Mine Action (FSD) (2016). Using High-resolution Imagery to Support the Post-earthquake Census in Port-au-Prince, Haiti. Retrieved from http://drones.fsd.ch/2016/04/26/case-study-no-7-usinghigh-resolution-imagery-to-support-the-post-earthquake-census-inport-au-prince-haiti 53. UNICEF (2010). Rapid Assessment Sampling in Emergency Situations. Retrieved from http://www.unicef.org/eapro/12205 3619.html 54. Vanden Eng, J. L., Wolkon, A., Frolov, A. S., Terlouw, D. J., Eliades, M. J., Morgah, K., Takpa, V., Dare, A., Sodahlon, Y. K., Doumanou, Y., Hawley, W. A., 142 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics and Hightower, A. W. (2007). Use of handheld computers with global positioning systems for probability sampling and data entry in household surveys. The American Journal of Tropical Medicine and Hygiene, 77(2):393–399. 55. Vinck, P. and Bell, E. (2011). Violent Conflicts And Displacement In Cen- tral Mindanao, Challenges For Recovery And Development. Retrieved from http://documents.worldbank.org/curated/en/2011/12/16234429/violentconflicts-displacement-central-mindanao-challenges-recoverydevelopment-vol-1-2-annexes 56. Wickham, H. (2007). Reshaping data with the reshape package. Journal of Statistical Software, 21(12):1–20. 57. Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer New York. 58. Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal of Statistical Software, 40(1):1–29. 59. Wickham, H. and Francois, R. (2015). dplyr: A Grammar of Data Manipulation. R package version 0.4.2. 60. World Health Organization (2008). Training for Mid-Level Managers (MLM) Module 7: The EPI Coverage Survey. Retrieved from http://www.who.int/ immunization/documents/mlm/en 61. Yoon, S. S., Katz, J., Brendel, K., and West, K. P. (1997). Efficiency of EPI cluster sampling for assessing diarrhoea and dysentery prevalence. Bulletin of the World Health Organization, 75(5):417–426. 143 M.Sc. Thesis - Maria Reyes McMaster - Mathematics & Statistics 62. Zimicki, S., Hornik, R. C., Verzosa, C. C., Hernandez, J. R., de Guzman, E., Dayrit, M., Fausto, A., Lee, M. B., and Abad, M. (1994). Improving vaccination coverage in urban areas through a health communication campaign: the 1990 Philippine experience. Bulletin of the World Health Organization, 72(3):409–422. 144
© Copyright 2024 Paperzz