THE CUBE METHOD: BALANCED SAMPLING APPLICATIONS IN THE BASQUE STATISTICS ORGANISATION Aritz Adin Urtasun EUSKAL ESTATISTIKA ERAKUNDEA BASQUE STATISTICS INSTITUTE Donostia-San Sebastián, 1 01010 VITORIA-GASTEIZ Tel.: 945 01 75 00 Fax.: 945 01 75 01 E-mail: [email protected] www.eustat.es Introduction Eustat, aware of the growing demand for increasingly disaggregated quality statistics, organised the 23rd International Statistics Seminar in 2010, with the title "Balanced and Efficient Sampling: The Cube Method". Eustat aims to redefine current designs to obtain samples that provide quality estimators for more disaggregated areas or domains at the same or a similar cost. Eustat convened two research and training grants in the field of statistical-mathematical methodologies, and more specifically, focused on sample optimisation, for the same purpose. The outcomes of the research have been implemented in several statistical operations in the 2010-2012 Basque Statistics Plan: A study on bullying among students in Primary Education and Compulsory Secondary Education schools; a Survey on the Information Society for Families, a Survey on Technological Innovation, a Survey on Poverty and Social Inequality, and a Study on Women in Basque Rural Areas. The purpose of this publication is to disseminate the research conducted during the grant period and to provide useful material for users interested in efficient and balanced sampling. The document is divided into two separate parts. Part One approaches the concepts and definitions of sampling theory, as well as simple and complex probability-based sampling plans. Part Two describes the Cube Method and its implementation in several of the Basque Statistics Organisation's standard surveys. Vitoria-Gasteiz, December 2012 Javier Forcada Sainz General Director of EUSTAT Contents INTRODUCTION ........................................................................................................................... 1 CONTENTS................................................................................................................................... 2 1. INTRODUCTION ....................................................................................................................... 4 2. INTRODUCTION TO SAMPLING THEORY ............................................................................. 5 DEFINITIONS AND BASIC NOTATION .............................................................................................. 5 SAMPLING PROPORTIONS ............................................................................................................ 6 THE HORVITZ-THOMPSON ESTIMATOR ......................................................................................... 6 3. PROBABILITY SAMPLING PLANS .......................................................................................... 7 SIMPLE RANDOM SAMPLING ......................................................................................................... 7 STRATIFIED SAMPLING ................................................................................................................ 8 CLUSTER SAMPLING .................................................................................................................. 10 SUMMARY OF THE METHODS PRESENTED ................................................................................... 11 4. COMPLEX SAMPLING PLANS .............................................................................................. 13 TWO-STAGE SAMPLING.............................................................................................................. 13 SELECTION OF PRIMARY UNITS WITH EQUAL PROBABILITIES ......................................................... 14 SELF-WEIGHTING TWO-STAGE PLAN ........................................................................................... 15 5. THE CUBE METHOD: BALANCED SAMPLING .................................................................... 16 CUBE REPRESENTATION............................................................................................................ 16 BALANCED SAMPLES ................................................................................................................. 16 DESCRIPTION OF THE METHOD .................................................................................................. 18 6. SAS MACROS FOR SELECTING BALANCED SAMPLES .............................................. 19 EXE_CUBE MACRO .................................................................................................................... 19 ECHANT_STRAT MACRO............................................................................................................. 20 DISJUNCTIVE AUXILIARY MACRO ................................................................................................. 21 CREAR_ESTRATO AUXILIARY MACRO .......................................................................................... 21 EXAMPLE OF MACRO USE .......................................................................................................... 22 7. BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD................................. 26 SAMPLE OF ESO (COMPULSORY SECONDARY EDUCATION) CENTRES FOR A STUDY ON BULLYING IN THE BASQUE COUNTRY............................................................................................................. 26 SAMPLE FOR THE INFORMATION SOCIETY SURVEY (ESI-COMPANIES) ......................................... 30 SAMPLE FOR THE SOCIAL CAPITAL SURVEY (ECS)..................................................................... 33 SAMPLE FOR THE TECHNOLOGICAL INNOVATION SURVEY (EIT)................................................... 38 SAMPLE FOR THE POVERTY AND SOCIAL INEQUALITY SURVEY (EPDS) ....................................... 42 CONTENTS SAMPLE FOR THE STUDY OF WOMEN IN BASQUE RURAL AREAS .................................................... 47 SAMPLE FOR BASQUE COUNTRY AND DRUGS SURVEY ............................................................... 52 8. CONCLUSIONS ...................................................................................................................... 56 BALANCING AND STRATIFICATION ............................................................................................... 56 CHOICE OF BALANCING VARIABLES ............................................................................................ 56 BALANCING AND CALIBRATION ................................................................................................... 57 Analysis of results ..................................................................................... 57 1. Calibration on the 2012 Basque Country and Drugs Survey ................. 57 2. Calibration on the 20012 Social Capital Survey .................................... 58 INTEREST OF BALANCED SAMPLING ............................................................................................ 60 9. BIBLIOGRAPHY...................................................................................................................... 61 CONTENTS 3 1. Introduction This Technical Handbook is the fruit of the work carried out in the course of the training and research grants in the field of statistical-mathematical methodologies for sampling optimisation given by the Basque Statistics Institute / Euskal Estatistika Erakundea in 2010. The Handbook is divided into the following chapters: Chapter One offers an introduction and mentions the objectives that led to the preparation of this technical Handbook. Chapter Two gives an introduction to sampling theory, with definitions and basic notations in sampling design, sampling proportions and a definition of the HorvitzThompson estimator and its variance. The next two chapters develop the concepts of probability sampling plans and complex sampling plans, with a description of most of the methods used in official statistics. Chapter Five approaches the concept of balanced sampling and introduces the Cube Method for selecting balanced samples. The aim of Chapter Six is to list the SAS macros for selecting balanced samples. Chapter Seven shows the samples balance in Eustat using the Cube Method. The last chapter gives some conclusions on balancing, stratification and calibration. My thanks to the members of the Methodology, Innovation and R&D Department for their support and to Eustat staff in general for their kindness. KEYWORDS: Sampling design, inclusion probabilities, Horvitz-Thompson estimator, balanced samples, Cube Method, balance, stratification and calibration variables INTRODUCTION 2. Introduction to sampling theory Before we can introduce the Cube Method for selecting balanced samples and demonstrating the method's usefulness, we should start with an overview of sampling theory. Definitions and basic notation The purpose is to study a finite population U = {1,…, N} of N size. We define the variable of interest y which takes the values yk , k ∈ U ; whose total and mean are: Y = ∑ yk 1 N Y = and k∈U ∑y k k∈U A sample s is a subset of the population s ⊂ U . A sampling design or a sampling plan p(s) is a probability distribution on all the possible samples in which ∑ p(s) = 1 . s ⊂U The random sample S takes the value s with probability Pr( S = s ) = p ( s ) . We define inclusion probability as the probability that k is the unit selected in random sample S: π k = E ( Ik ) = Pr(k ∈ S ) = ∑ p( s ) ⎧ ⎪ 1 ⎪ 0 ⎩ Ik = ⎨ where k∈s if k ∈ S if k ∉ S Similarly, second-order inclusion probability is defined as: π kl = E ( IkIl ) = Pr(k y l ∈ S ) = If the sample design is of a fixed size, then ∑π k∈U INTRODUCTION TO SAMPLING THEORY k ∑ p( s) k ,l∈s = n. 5 Sampling proportions Suppose that the variable of interest defined on population U is a qualitative variable. In this case, the variable of interest gives information on a quality of the population units and the membership or non-membership in a certain class. Suppose that the variable of interest divides population units into two classes C and C ′ The y k characteristic for each population unit is defined as: 1 if k ∈ C ⎪ 0 if k ∉ C ⎩ ⎧ ⎪ yk = ⎨ ∀k ∈ U The total of population elements (class totals) and the proportions of population elements (class proportion) that belong to C are defined as: Y = ∑ yk = A Y = and k∈U 1 N ∑y k∈U k A =P N = We can consider the problem of estimating A and P as if we were estimating the population total and population mean, where each y k takes the value 0 or 1. If we write quasi-variance S S2 = ∑(y k∈U k −Y ) 2 = N −1 2 in terms of P and Q = 1-P ∑y k∈U 2 k − NY 2 = N −1 1 N ( NP − NP 2 ) = PQ N −1 N −1 Whose unbiased estimator is: n s = pq n −1 2 where p= ∑y k∈S k = n a n The Horvitz-Thompson estimator The Horvitz-Thompson estimator of the population total and the population mean of variable of interest y is defined as: yk Yˆπ = ∑ k∈S and πk The Horvitz-Thompson estimator is unbiased if 1 Yˆπ = N yk ∑π k∈S π k > 0, k ∈ U k . For fixed size designs, the variance can be estimated by: ⎛y y 1 Vˆar (Yˆπ ) = − ∑∑ ⎜⎜ k − l 2 k∈S l∈S ⎝ π k π l l ≠k INTRODUCTION TO SAMPLING THEORY 2 ⎞ (π kl − π k π l ) ⎟⎟ . π kl ⎠ 6 3. Probability sampling plans A probability sampling is one in which every unit in the population has a chance of being selected in the sample, and this probability can be accurately determined. As explained later, the Cube Method is based on the inclusion probabilities defined by the design to select a balanced sample: i.e. in fact, the cube method optimises probability sampling methods. Three main types of probability sampling are defined below. Simple random sampling Simple random sampling (SRS) is a sampling method in which a sample size n of a population size N is selected in such a way that all samples of the same size have the same probability of being chosen. The sample design for an SRS of a fixed size n is: ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ ⎛N⎞ ⎜ ⎟ p( s) = ⎜⎝ n ⎟⎠ −1 if card(s) = n 0 otherwise Therefore, the inclusion probability of the unit k is: N −1 ⎛ N − 1 ⎞⎛ N ⎞ π k = ∑ p( s ) = ∑ ⎛⎜ ⎞⎟ = ⎜⎜ ⎟⎟⎜⎜ ⎟⎟ k∈s k∈s ⎝ n ⎠ ⎝ n − 1 ⎠⎝ n ⎠ −1 = n , ∀k ∈ U N In other words, all the individuals of U have the same probability of being chosen. The H-T estimator for the population mean in an SRS is 1 Yˆπ = N yk ∑π k∈S = k 1 N ∑y k k∈S N 1 = ∑ yk n n k∈S ˆ The unbiased variance estimator of Y π is: ) ˆ s y2 Var (Y π ) = (1 − f ) n 1 s y2 = ( yk − Yˆπ ) 2 ∑ n − 1 k∈S where and PROBABILITY SAMPLING PLANS f = n is defined as the sample fraction N 7 Stratified sampling Suppose that the population U is divided into subpopulations or strata U h , h = 1,..., H ; where the strata meet the following properties: H (i) UU h =U h =1 (ii) U hIU i= φ , h ≠ i (iii) If N h is the size of U h , then H ∑N h =1 h =N A sample design is stratified when a simple sample of a fixed size n h is randomly H selected from each stratum, where ∑n h =1 h = n is the sample size. This sampling technique is used when the study population is very homogeneous and can be divided into internally homogeneous strata. Thus, we can achieve more precise estimators in each stratum and combine them to obtain a more accurate estimator of the population total. SRS is used to select the units in each stratum, so the inclusion probability in unit k is: π k= nh , ∀k ∈ U . Nh Horvitz-Thompson estimator of the mean for stratified sampling: 1 Yˆ st= N yk ∑π k∈S k = 1 N H Nh ∑ n ∑y h =1 h k∈S h k = 1 N H ∑ N Yˆ h =1 h h Estimator variance can be estimated without bias by: 1 Vˆar (Yˆ st) = 2 N where 2 s yh = PROBABILITY SAMPLING PLANS H 2 s yh h =1 nh ∑ N h2 (1 − f h ) 1 H ( y k −Yˆ h) 2 is the quasi-variance sample of stratum h. ∑ n h −1 h =1 8 Allocation in stratified sampling Dividing sample size into several strata can be accomplished according to several criteria. The most frequently used criteria are described below. 1. Proportional allocation Proportional allocation is when the number of sample units allocated to each stratum is proportional to the size of the stratum. Thus, a stratified plan is said to have a proportional allocation if: nh n = , Nh N Supposing that n h = for h = 1,..., H nN h is an integer, the estimator for the population mean is: N 1 H 1 Yˆ prop= ∑ N hYˆ h = ∑ y k N h =1 n k∈S Proportional allocations can be made to roots, cubes and any other power lower than 1 in the same manner. 2. Minimum variance allocation Minimum variance allocation or Neyman allocation consists in determining the values of n h in such a way that estimator variance is minimal in a fixed sample size of n. Lagrange multipliers are used to obtain the necessary values of n h . nh = n N hS h H ∑N h =1 h h = 1,..., H for Sh 3. Minimum sample size allocation In this case, the problem consists in finding the allocation that gives the minimum sample size n* for fixed variation V. Again, thanks to Lagrange multipliers, we have that: n* = ⎛ H ⎞ ⎜ ∑ N h S yh ⎟ ⎝ h =1 ⎠ 2 H 2 V + ∑ N h S yh h =1 PROBABILITY SAMPLING PLANS 9 Cluster sampling Suppose that population U is divided into M subsets U i , i = 1,..., M , called clusters, which meet the following properties: M (i) UU =U i i =1 (ii) U iIU j = φ , i ≠ j M (iii) ∑N i =1 i = N where N i is the number of elements in cluster U i . A sample design is made by clusters when we select a size m cluster sample, denoted as s I , with a plan p I ( s I ) in which all the units of the chosen clusters are evident. The full random sample is given by S = UU i , the size of which is n = ∑N i∈S I i∈S I i . Normally, the size of the sample is random. This sampling technique is used when the population is naturally divided into groups that are supposed to contain all the variability in the population; i.e., each cluster faithfully represents the characteristics of the study population (thus simplifying the gathering of sample information). Selection of clusters with equal probabilities Supposing that all the clusters have the same probability of being chosen, the sample plan will consist in selecting the clusters by following a size m SRS. In this case, the probability of selecting a cluster is π Ii= m . The following simplified M expression of the Horvitz-Thomson estimator of the mean is obtained: 1 Yˆπ = N where Y i= 1 Ni ∑y k∈U i k yk ∑π k∈S k = 1 N ∑ i∈S i N iY i π Ii = is the mean for cluster U i , M ∑ N iY i Nm i∈S i i = 1,..., M Estimator variance can be estimated without bias by: ⎛ Yˆ π M −m M ⎜ Vˆar (Yˆπ ) = Y − ∑ i M N 2 m m − 1 i∈S I ⎜⎝ PROBABILITY SAMPLING PLANS ⎞ ⎟ ⎟ ⎠ 2 10 Systematic sampling with equal probability Suppose that the N units of population U are numbered 1 to N in a certain order (random, or according to order criteria). If n is the number of units to be selected in the sample, we define k = N n as the sampling interval. r ∈ {1,..., k} as the primary unit. After r, the units that are at a distance lk for l = 1,2,..., n − 1 are selected in the sample. We select a random number Systematic sampling can be viewed as cluster sampling where the problem consists in choosing a single cluster from the potential k. Composition of possible k systematic samples 1 y1 y k +1 L y ( n −1) k +1 2 ··· y2 y k +2 L y ( n −1) k + 2 i ··· k yi yk y k +i y 2k L L y ( n − k ) k +3 y nk Summary of the methods presented The coefficient of variation of estimator θˆ is defined as the quotient between the standard deviation and its real value θ , CV (θˆ) = Var (θˆ) θ . Vˆar (θˆ) Therefore, the estimator of the coefficient of variation of θˆ , is cv(θˆ) = θˆ A table with the formulas for the estimator, variance and coefficients of variation for the population mean and the proportions of the methods presented is given below. PROBABILITY SAMPLING PLANS 11 PROBABILITY SAMPLING PLANS 12 Proportions P Pop u latio n mean Y cv ( Pˆ ) Coeff. of variation Vˆar ( Pˆ ) Variance Pˆ Estimator cv (Yˆ ) Coeff. of variation Vˆar (Yˆ ) Variance Yˆ Estimator 2 (1 − p ) cv ( Pˆ ) = (1 − f ) p ( n − 1) p (1 − p ) Vˆar ( Pˆ ) = (1 − f ) n −1 1 Pˆ = ∑ yk n k ∈S cv(Yˆπ ) = (1 − f ) cv (Yˆ ) n sy ) Var (Yˆπ ) = (1 − f ) n Simple Random Sampling 1 Yˆπ = ∑ yk n k∈ S h =1 cv( Pˆ st ) = 1 Vˆar ( Pˆ st ) = 2 N h =1 2 h 2 h 2 h h ph h =1 ∑N H h nh p h (1 − p h ) n h −1 ph (1 − f h ) H h =1 ∑N H h =1 ∑ N h Yˆh H nh s 2yh 2 s yh (1 − f h ) h p q N (1 − f h) h h ∑ n h −1 h =1 ∑N H h =1 H h =1 ∑N 1 Pˆ st = N cv (Yˆ st ) = h ∑ N Yˆ H ∑ N h2 (1 − f h ) H 1 Vˆar (Yˆ st ) = 2 N 1 Yˆ st = N Stratified sampling ∑ ∑ N iY i i∈S i M i i ∑N Y i∈S i i ∑a 2 2 i i ∑a i∈ S I i∈ S I i∈S I ( ∑ ai )2 i∈S I i i∈S I 2 − 2 p ∑ aiN i + p 2 ∑ N i2 ∑N i ∈S I i∈ S I ⎞ ⎟ ⎟ ⎠ 2 − 2 p ∑ ai N i + p2 ∑ N i ; donde a i = p i N i i∈S I m −1 m i 2 M −m m cv ( Pˆ ) = M m −1 Vˆar ( p ) = i∈S I ∑N i∈S i ∑a M −m Pˆ = cv(Yˆπ ) = ∑ ⎞ ⎟ ⎟ ⎠ ˆ ⎛ m⎞ m ⎛ ⎜ Y i− Y π ⎜1 − ⎟ ⎜ M ⎝ M ⎠ m − 1 i∈S I ⎝ ⎛ Yˆ M −m M ˆ ⎜Y − π Vˆar (Y π ) = i M N 2 m m − 1 i∈S I ⎜⎝ M Yˆπ = Nm Cluster sampling 4. Complex sampling plans Although the methods presented make up the three main types of probability sampling methods, designs tend to be more complex when it comes to determining the designs for the surveys made by EUSTAT or by different statistics bodies. Two-stage sampling Suppose that population is U = {1,..., k ,..., N } composed of M subpopulations U i , i = 1,..., M called primary units. At the same time, each primary unit M ∑N i =1 i U i is composed of N i secondary units where =N. In general , a two-stage sample is defined as follows: - A sample S I of primary units of size m is selected. - If a primary unit is selected in the first stage, a sample - secondary units is selected. Two-stage plans must meet the properties of invariance and independence. The full random sample is given by S = US i S i of size n i of , the size of which is n = ∑n i∈S I i∈S I i . We can define: • π I ,i as the probability of selecting the primary unit U i • π k|i as the probability of selecting the unit k, given that U i has been selected. Therefore, the probability of inclusion for unit k is: π k= π I ,i π k |i , k ∈U i The Horvitz-Thompson estimator of the mean in a two-stage sample is: 1 Yˆπ = N ˆ where Y i= 1 Ni yk ∑π k∈S i yk ∑π k∈S k = 1 N ∑ ∑π i∈S I k∈S i yk I ,i π k |i = 1 N ∑ i∈S I N iYˆ i π I ,i is the Horvitz-Thompson estimator of the mean of the primary k |i unit U i Moreover, in a two-stage plan it is given that Vˆar (Yˆ π ) = VˆarUP + VˆarUS , COMPLEX SAMPLING PLANS 13 where VˆarUP is the part of the variance that refers to primary units and VˆarUS to secondary units. Therefore, in two-stage sampling we can combine the main probability sampling plans presented (simple random sampling, stratified sampling and cluster sampling) to select primary units as well as secondary units. Selection of primary units with equal probabilities Suppose that simple random sampling is used in the two sampling stages. Then, the probabilities defined above would be: π I ,i= π k |i= m , i = 1,..., M M ni , Ni i = 1,..., M , k ∈S i In this case, the probability of inclusion for unit k is: π k= mn i , MN i k ∈U i If we modify the Horvitz-Thompson estimator formula for two-stage sampling, we have that: 1 Yˆπ = N yk ∑π k∈S k = N iy k M ∑ ∑ Nm i∈S I k∈S i n i And its variance estimator is simplified N −n M −m M Vˆar (Yˆπ ) = Ms I2 + 2 ∑ N i i i si2 2 ni N m N m k∈S i where ˆ ⎞ ⎛ 1 ˆ − Yπ ⎟ ⎜ s = Y ∑ i M⎟ m − 1 i∈S I ⎜⎝ ⎠ 2 I COMPLEX SAMPLING PLANS 2 and ⎛ Yˆπ 1 ⎜ − s = y ∑ k N n i −1 k∈S I ⎜⎝ i 2 i ⎞ ⎟ ⎟ ⎠ 2 14 Self-weighting two-stage plan Suppose that the primary units in the first stage are selected with the inclusion probabilities proportional to size (PPS); in other words, π I ,i = Ni m N In the second stage, the secondary units are selected according to fixed size simple random sampling n i =n 0 (in each primary unit); in other words, π k |i = n0 Ni Thus, the probabilities of inclusion of unit k are the same for every unit in population U: π k = π I ,i π k |i = COMPLEX SAMPLING PLANS N i n 0 mn0 = m N Ni N 15 5. The Cube Method: Balanced Sampling The Cube Method (Deville and Tillé,2004) is a method for selecting balanced samples with equal or unequal inclusion probabilities, optimizing probability sampling methods. Intuitively, the method allows the proportions of the original population in the sample to be maintained on certain balancing variables (qualitative variables), always taking the design's inclusion probabilities into consideration. The balancing variables must be strongly correlated with the variables of interest. Cube representation Let us consider a finite population U = {1,…, N} of size N, where the aim is to estimate the total (or mean) of certain variables of interest. To understand how the Cube Method works, suppose that a sample is denoted by a vector s = ( s1 ... s k ... s N ) t where s k takes the value 1 if unit k is in the sample and is 0 otherwise. Geometrically, each vector s is a vertex of an N-cube. Possible samples in a population of size N=3 Therefore, a sampling design p(.) is a probability distribution on the set S = {0,1} the possible samples. The inclusion probability of unit k is defined as N of all π k = Pr(S k = 1) . Balanced samples Suppose that we have certain auxiliary variables with known values for all the units of the population, k ∈ U . THE CUBE METHOD: BALANCED SAMPLING 16 The auxiliary variables could be used as stratification variables (qualitative) or balancing variables (qualitative or quantitative). Thus, it is said that a samples is balanced on variables x1 ,x 2 ,...,x p if the balancing variables are satisfied: X̂π = X ⇔ ∑ k∈s xkj πk = ∑ xkj k∈U ∀s ∈ S with p(s) > 0 j = 1,..., p In other words, the Horvitz-Thompson estimators of the variables x1 ,x 2 ,...,x p in the sample are equal to the totals of said variables in the population. The inclusion probability vector π will always be predetermined by the sampling design. The equations that derive from the balance constraints define a subspace (Q) of N dimension N – p in R . Therefore, the problem consists in choosing a vertex (a sample) of the N-cube that will stay within the subspace Q. Given that it is not possible to select an exactly balanced sample, the Cube Method implements a method for selecting approximately balanced samples. THE CUBE METHOD: BALANCED SAMPLING 17 Description of the method The cube method proposed by Deville and Tillé (2004) is composed of two phases: 1. Flight phase The flight phase is is a generalization of the splitting procedure (See "Sampling Theory"). It is a random path that begins with the inclusion probability vector π and remains in the intersection of the cube and the subspace defined by the balancing equations (Q). 2. Landing phase If a sample (a vertex) has not been selected at the end of the flight phase, the landing phase should be applied. There are three potential solutions for this phase: - To progressively eliminate the balancing variables and apply the flight phase again (the variables need to be deleted by ascending order of importance). Use the linear programming to calculate the best approximately balanced sample (minimizing the difference in balance). Choose the vertex closest to the probabilities vector obtained in the flight stage, rounding out the inclusion probabilities that are still not equal to 0 or 1. Deville and Tillé programed a much quicker implementation of the flight phase (See “Fast SAS Macros for balancing simples user´s guide”), which takes up most of the implementation time. The advantages obtained were: o There are no constraints on the size of the population. o The execution time is linearly dependent on the size of the population. SAS MACROS FOR SELECTING BALANCED SAMPLES 18 6. SAS macros for selecting balanced samples Next, the SAS macros that allow balanced samples to be selected are presented. The two main macros (exe_cube y echant_estrat) were were developed by Guillaume Chauvet and Yves Tillé. The auxiliary macros disjunctive and crear_estrato were made by Eustat to speed up management of the former. Although Eustat has opted to work with the SAS macros that implement the Cube Method, the functions that select balanced samples in R are also available (see sampling package: http://cran.r-project.org/web/packages/sampling/index.html). exe_cube macro The SAS macro exe_cube, allows the Cube Method (Fast Cube Method) to be used to select balanced samples. Input data The input data are a SAS table with all the population units from which the sample will be selected. It should contain at least: - An identification variable - A variable with inclusion probabilities - The variables on which the sample will be balanced The table must not have missing values in said values. Macro syntaxis A brief description of the necessary arguments follows: BASE = Name of the SAS library that contains the table with the input data. DATA = Name of the SAS table with the input data. ID = Units of population identification variable. PI = Variable with inclusion probabilities CONTR = Variables on which the sample will be balanced. ATTER = Option selected for the landing phase. 1. The balancing variables are gradually eliminated. SAS MACROS FOR SELECTING BALANCED SAMPLES 19 2. All the possible samples for the remaining units (values other than 0 or 1) are considered. The ones that provide the less difference in balance are selected. 3. The same procedure as for option 2 but only considering samples with a size equal to the sum of inclusion probabilities (fixed sampling size). 4. The inclusion probabilities are rounded for the remaining units, keeping the size of the default sample. To use options 3 or 4, enter the inclusion probabilities variable in the contr parameter. COMPEQ = Equal to 1 to balance the complement of the sample as well. SORT = Name of the SAS table with the output data, which was saved in the library specified in the base parameter. It contains all the units of population, as well as the variable ech; equal to 1 if the unit has been selected and otherwise 0. echant_strat macro The SAS echant_strat macro allows stratified samples to be selected using the Cube Method (Fast Cube Method), globally balanced in the total population and approximately balanced in each stratum. The steps followed by the macro to select a balanced sample are: 1. Independent flight phase in each of the strata 2. Joint flight phase with the remaining units that were not selected in the strata 3. Landing phase with the still unselected units. Input data There has to be a SAS table with the units of population for each of the strata defined for the stratified sample. Each table must contain at least the same variables that were defined for the exe_cube variable. Macro syntaxis A brief description of the necessary arguments follows: DATA = Name of the SAS table with the input data for each stratum. ID = Units of population identification variable. PI = A variable with inclusion probabilities CONTR = Variables on which the sample will be balanced. SORT = Name of the SAS table with the output data. SAS MACROS FOR SELECTING BALANCED SAMPLES 20 disjunctive auxiliary macro The disjunctive SAS macro allows one or more variables of interest to be divided into disaggregated variables according to certain categories. The macro also allows the names of said categories to be entered. Description Suppose that in a size N population, given a variable of interest Y and a qualitative variable X that takes values 1, 2, …,L; the disjunctive macro gives the disjunctive 1 variables Y , Y 2 , ... , Y L where: ⎧ yi y il = ⎨ ⎩0 if xi = l if xi ≠ l for i = 1, ... , N l = 1,... , L Macro syntax A brief description of the necessary arguments follows: DATA = Name of the SAS table that contains the population data VAR= Variable(s) of interest. CATEG = Qualitative variable that contains the categories for creating disjunctive variables. NOMBRES_CATEG (optional) = Names of the categories of the variable categ. By default categ1, categ2,…, categL. Results and outputs The disjunctive macro adds the disjunctive variables created from the variable of interest varto the input table. The names of the new variables are the union of the name of the variable var and the names defined by the variable nombres_categ (separated by the symbol “_”). The names are saved in the local variable contr_categ macro. crear_estrato auxiliary macro The SAS crear_estrato macro allows a SAS table to be divided into several tables according to a stratification variable. Macro syntax A brief description of the necessary arguments follows: DATA = Name of the SAS table that contains the population data. ID = An identification variable VAR_ESTRAT = Variable on which the stratification is to be performed SAS MACROS FOR SELECTING BALANCED SAMPLES 21 Results and outputs The crear_estrato macro returns a SAS table for each of the values of the variable var_estrat. The default names of the output tables are of the type: estrato_ {var_estrat } j where {var_estrat} j is the j-th value of the variable var_estrat. The names are saved in the local variable datos_estrat macro. Example of macro use Suppose that we want to select a stratified sample of establishments, balancing the sample on the number of employees per Province. The initial SAS table with the population data would look like this: data id 1 estrata A 2 3 A B 4 5 B B 6 7 C C pik π1 π2 π3 π4 π5 π6 π7 employ TH e1 48 e2 e3 20 20 e4 e5 01 48 e6 e7 01 20 where 01 = Araba, 20 = Gipuzkoa and 48 = Bizkaia; πk is the inclusion probability of establishment k; ek is the number of employees in establishment k. • First, we apply the disjunctive macro to calculate the disjunctive balancing variables for the number of employees by Province. %global contr_categ; %disjunctive( DATA = data, VAR = employ, CATEG = TH, NOMBRES_CATEG = Araba Gipuzkoa Bizkaia ); SAS MACROS FOR SELECTING BALANCED SAMPLES 22 data id 1 2 3 4 5 6 7 estrata A A B B B C C pik π1 π2 π3 π4 π5 π6 π7 employ TH e1 48 e2 20 e3 20 e4 01 e5 48 e6 01 e7 20 employ _ Araba employ _ Gipuzkoa employ _ Bizkaia 0 0 e1 0 e2 0 0 e3 0 e4 0 0 0 0 e5 e6 0 0 0 e7 0 As mentioned above, the aim is to select a balanced sample on the number of employees per Province; in other words, on the totals: ∑ employ _ Araba k∈N k , ∑ employ _ Gipuzkoa k∈N k and ∑ employ _ Bizkaia k∈N k In this case, the macro variable contr_categ keeps the values: &contr_categ. = empleo_Araba empleo_Gipuzkoa empleo_Bizkaia. • Next, we would apply the crear_estrato macro to obtain a dataset with the data for each of the strata. %global datos_estrat; %crear_estrato( DATA = data, ID = id, VAR_ESTRAT = estrata ); stratum_A id estrata pik 1 2 A A π1 π2 id 3 4 5 estrata B B B pik π3 π4 π5 id estrata pik 6 7 C C π6 π7 employ TH e1 e2 48 20 employ _ Araba employ _ Gipuzkoa employ _ Bizkaia 0 0 0 e2 e1 0 employ TH e3 20 e4 01 e5 48 stratum_B employ _ Araba employ _ Gipuzkoa employ _ Bizkaia e3 0 0 e4 0 0 e5 0 0 employ TH stratum_C employ _ Araba employ _ Gipuzkoa employ _ Bizkaia e6 e7 01 20 e6 0 0 e7 0 0 In this case, the macro variable datos_estrat keeps the values: &datos_estrat. = estratum_A estratum_B estratum_C SAS MACROS FOR SELECTING BALANCED SAMPLES 23 • Finally, we will call the echant_strat macro that selects the balanced sample for the samples stratified using the Cube Method. %echant_strat( DATA = &datos_estrat., ID = id, PI = pik, CONTR = pik &contr_categ., SORT = sample ); The macro output would look like this: sample where id ech 1 ech1 2 ech2 3 4 ech3 ech4 5 ech5 6 7 ech6 ech7 ⎧1 if unit k has been selected for all k ∈ {1,...,7} echk = ⎨ ⎩0 otherwise * Comment: On some occasions, the aim may be to balance the sample on totals that refer to sample units as such. For instance, in the preceding case we wanted to balance the sample on the number of establishments per Province. In such case, we need to create a variable that takes the value 1 for all the units, and enter it in the %disjunctive macro to create the desired balancing variables. data id estrata pik 1 A π1 π2 π3 π4 π5 π6 π7 2 A 3 4 B B 5 B 6 C 7 C SAS MACROS FOR SELECTING BALANCED SAMPLES employ TH e1 48 ONE 1 e2 20 1 e3 e4 20 01 1 1 e5 48 1 e6 01 1 e7 20 1 24 %global contr_categ; %disjunctive( DATA = data, VAR = ONE, CATEG = TH, NOMBRES_CATEG = Araba Gipuzkoa Bizkaia ); data id estrata pik 1 A 2 A 3 4 B B 5 B 6 C 7 C π1 π2 π3 π4 π5 π6 π7 employ TH ONE ONE _ Araba ONE _ Gipuzkoa ONE _ Bizkaia e1 48 1 0 0 1 e1 20 1 0 1 0 e1 e1 20 01 1 1 0 1 1 0 0 0 e1 48 1 0 0 1 e1 01 1 1 0 0 e1 20 1 0 1 0 SAS MACROS FOR SELECTING BALANCED SAMPLES 25 7. Balanced samples in Eustat using the cube method Some of the sampling designs that have been balanced using the Cube Method in Eustat are presented next. The method design for each case is described: the technical datasheet, stratification variables, allocations and inclusion probabilities, as well as the variables on which the sample was balanced. Some of the outcomes obtained are also presented. Sample of ESO (Compulsory Secondary Education) centres for a study on bullying in the Basque Country. The Department of Eduction, Universities and Research, via the Basque Country Institute for Educational Assessment and Research (ISEI-IVEI) conducted a survey on bullying in Basque Country schools. To that end, a cluster sample (schools) had to be taken to assess a maximum number of 40 students per selected centre. Technical Datasheet • Framework The sample comprised secondary schools in the Basque Country that had at least one group in the 1st, 2nd, 3rd and 4th years of ESO. • Sample design. The sample was one of unequal allocations with subsampling in the second stage. 1st stage Sampling units Secondary schools in the Basque Country Stratification Stratified sampling by Province and system (public and private school systems) was used to select the centres. Allocation Proportional to the number of centres in each stratum. Draw The sampling was probabilistic proportional to size (PPS) of the number of students per centre. BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 26 2nd stage Sampling units Secondary school students in the Basque Country Stratification 40 students (10 from the 1st year, 10 from the 2nd, 10 from the 3rd and 10 from the 4th) per selected centre, whenever possible. There was no minimum number of students per centre. Draw Simple random sampling. The end sample was self-weighted by strata (Province and System). • Sample size The optimal sample size for allocation sampling was calculated according to the following formula: ncenters = na [(1 + δ (M − 1)] M where na is the sample size for a simple random sample and the rest is the socalled design effect in cluster sampling. With M = Average number of students per centre δ = intracentre correlation Nzα2 / 2 S 2 N na = = 2 2 2 Ne + zα / 2 S ⎡ e2 ⎤ 1 ( 1 ) + N − ⎢ ⎥ zα2 / 2 pq ⎦ ⎣ N = Total number of students (elementary units) e = Maximum acceptable error zα2 / 2 = Critical value for the significance level α • Balancing variables The sample was balanced on the following variables: - Number of students by year and number of groups by year. Thus, estimations of the average number of students by centre and group came as close as possible to the data provided by Education Statistics. - Number of centres belonging to each type of size. Coding of centre size into 5 groups, minimizing intraclass inertia according to the size in students: [0-143], [144-243], [244-361], [362-506] y [507-708]. BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 27 Results The results obtained for the balancing variables using the Cube Method are shown below. Each table compares the population distribution with the one obtained with sample weighting. The percentages are given by columns. Distribution of the number of students by year Population Sample (weighted) 1st year ESO nd 2 year ESO rd 3 year ESO th 4 year ESO 19,664 19,617 (27.21%) (27.14%) 18,633 18,649 (25.78%) (25.80%) 17,669 17,764 (24.45%) (24.58%) 16,306 16,243 (22.56%) (22.47%) TOTAL 72,272 72,272 Distribution of the number of groups by year Population 1st year ESO 2nd year ESO rd 3 year ESO th 4 year ESO Sample (weighted) 870 869 (25.02%) (24.04%) 852 849 (24.50%) (24.47%) 896 896 (25.77%) (25.82%) 859 856 (24.71%) (24.67%) 3,477 3,470 TOTAL Distribution of the number of centres by type of size Population BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD Sample (weighted) 28 Size 1 Size 2 Size 3 Size 4 Size 5 100 95 (30.12%) (28.79%) 128 129 (38.55%) (39.09%) 61 63 (18.37%) (19.09%) 31 31 (9.34%) (9.39%) 12 12 (3.61%) (3.64%) 332 330 TOTAL Very good estimators of the student average per centre and group for each year were also obtained by taking into account the variables on which the sample was balanced. ACADEMIC YEAR 2011-12 Student average by centres Student average by groups Population Sample (weighted) Population Sample (weighted) 1st year ESO 59.23 59.44 22.60 22.57 2nd year ESO 56.21 56.51 21.90 21.97 3 year ESO 53.22 53.83 19.72 19.83 4th year ESO 49.11 49.22 18.98 18.98 TOTAL 217.69 219.00 20.79 20.33 rd BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 29 Sample for the Information Society Survey (ESI-Companies) The general aim of the ESI carried out by EUSTAT is to provide politicians, economic and social stakeholders, universities, private researchers and the general public with periodic information on the penetration of the new information technologies and ICTs in Basque Country companies. The ESI-Companies sample is a panel that every year includes the companies that have answered previous repetitions of the survey. Owing to various incidents (medical leaves, substitutions, no response, etc.) the original sample distribution breaks down. Therefore, it was decided to update the sample with a new sample distribution that would preserve the original design and show the new distribution of the population in the strata. In 2012, it was decided to renew nearly 15% of the panel. Moreover, the Cube Method was introduced to select balanced samples, with the aim of obtaining a balanced distribution in the Basque Country regions. Technical Datasheet • Framework The sample comprised establishments of any business sector that carry out their activity in the Basque Country, except in the primary sector and domestic services. • Sample design. It was a one-stage stratified sample. Sampling units The establishments were part of the aforementioned framework. Stratification A stratified sample was made by crossing the following variables: - Province 1 = Araba; 2 = Bizkaia; 3 = Gipuzkoa - Employment stratum 1 = 0-5 employees; 2 = 6-9 employees; 3 = 10-19 employees; 4 = 20-49 employees; 5 = 50-99 employees; 6 = 100 and more employees; - Sector of activity (CNAE09 to 2 digits) Allocation Self-represented elements: establishments with 100 or more employees (employment stratum 6) Two different allocations were made for the rest of the establishments: 1. A distribution proportional to the square of the number of establishments per province and directly proportional to the number of establishments per BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 30 stratum (province, activity and employment) was made on the basis of a sample size preset in the original design of n=700. The sample size in each stratum was calculated according to the following formula: nTH i Act j Empk = n PROVi estab Act j Empk 5 ∑ ∑ estab j∈ Act k =1 where nTH i = (7000 − census ) Act j Empk estab PROVi 3 ∑ i =1 i = 1,2,3 estab PROVi Finally, the establishments were added until a minimum size of 5 establishments in the grouped employment strata were obtained (less than 10 employees and more than 10 employees). 2. Distribution according to a 10% maximum sample error in each sector of activity (without taking the census strata into consideration). The sample size in each stratum was calculated according to the following formula 2 nh = where N h zα2 / 2 S h Nh = 2 2 2 ⎡ N h e + zα / 2 S h e2 ⎤ 1 ( 1 ) + − N h ⎢ ⎥ zα2 / 2 pq ⎦ ⎣ N h = Number of establishments in stratum h e = Maximum acceptable error zα2 / 2 = Critical value for the significance level α After the two allocations were made, the missing units were distributed until the sample size needed for non-census units was reached. This distribution was made in proportion to the size of the strata in the sectors that were underrepresented compared to the first allocation. Finally, the allocations per sector of activity were distributed in proportion to the root in each province and grouped employment. Drawing A simple random sampling is conducted in each strata, giving priority to the establishments that were specified as high in the framework. • Balancing variables The sample has been balanced on the number of establishment in each region (20 regions) in order to obtain better estimations at regional level. BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 31 • Substitutes A pool of substitutes for around 3,500 establishments is needed to complete the sample. The number of substitutes per stratum is proportional to the theoretical sample in the employment and province strata. As in the main sample, the substitutes sample will be balanced with the Cube Method on the number of establishments in each region. Results The results obtained with the Cube Method when balancing the number of establishments per region is given below. Distribution of the number of establishments by region Population Valles Alaveses Llanada Alavesa Montaña Alavesa Rioja Alavesa Estribaciones del Gorbea Cantábrica Alavesa Arratia - Nervión Gran Bilbao Durangaldea Encartaciones Gernika – Bermeo Markina – Ondarroa Plentzia – Mungia Bajo Bidasoa Bajo Deba BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD Sample (weighted) 405 523 (0.22 %) (0.29 %) 18,903 19,063 (10.49 %) (10.58 %) 248 257 (0.14 %) (0.14 %) 1,311 1,135 (0.73 %) (0.63 %) 780 749 (0.43 %) (0.42 %) 2,180 2,099 (1.21 %) (1.16 %) 1,787 1,399 (0.99 %) (0.78 %) 73,572 72,517 (40.82 %) (40.24 %) 7,517 7,795 (4.17 %) (4.33 %) 2,356 2,364 (1.31 %) (1.31 %) 3,425 3,364 (1.90 %) (1.87 %) 1,828 2,446 (1.01 %) (1.36 %) 4,008 4,609 (2.22 %) (2.56 %) 7,169 8,343 (3.98 %) (4.63 %) 4,191 4,989 (2.33 %) (2.77 %) 32 Alto Deba Donostialdea Goierri Tolosaldea Urola Costa TOTAL 4,197 4,742 (2.33%) (2.63 %) 31,422 28,724 (17.44 %) (15.94 %) 4,929 5,192 (2.73 %) (2.88 %) 4,029 4,105 (2.24 %) (2.28 %) 5,966 5,809 (3.31 %) (3.22 %) 180.223 180,223 The percentages are given by columns. Sample for the Social Capital Survey (ECS) Social capital is construed as a resource to which one has access when one has broad personal networks in which one takes active part in several economic and social spheres, in an environment of trust that can facilitate personal and social development, as well as the economic development of society. Specifically, in the Social Capital Survey carried out by Eustat, social capital is designed as a set of social participation and relationship dimensions that include: social friends and family networks; trust in people and institutions; social participation and cooperation; information and communication; social cohesion and integration, and health and happiness. In 2012, it was decided to use the Cube Method to select the sample for the Social Capital Survey. Thus, we have obtained a balanced sample by sex and age in each Province, as well as helping to improve estimations at the regional level. Technical Datasheet • Framework The framework of the Social Capital Survey sample comprises a population age 15 and over that resides in houses and collective establishments in the Basque Country. • Sample design. It was a one-stage stratified sample. Sampling units Population age 15 and over, that resides in houses and collective establishments in the Basque Country. BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 33 Sample size n = 7000 individuals were selected. Stratification A stratified sample was made by crossing the following variables: - Province 01 = Araba, 20 = Gipuzkoa and 48 = Bizkaia; - Size of the municipality Capital cities, Medium-sized (20,000-100,000) and Small (20,000 or less) - Nationality 0 = National; 1 = Foreigners Allocation A criterion for each level of stratification has been established: 1. Distribution proportional to the square root of the number of individuals per Province. 2. Distribution proportional to the number of individuals by size of the municipality. 3. Distribution proportional to the 2/3 power of the number of individuals per nationality. The no-response rates in the previous survey (ECS 2007) were taken into consideration when choosing the best allocation in the third level. Similar response rates can be expected for the survey, considering that the methods used to gather survey information are the same. Therefore, an allocation was sought that would obtain the minimum sample size needed (around 400 units) to give estimations at the level of capital cities and foreign population, taking the response rates into consideration. The sample size in each stratum was specified by the following formula: n PROVi SIZE j NATk = n PROVi SIZE j 3 ∑ ( N PROVi SIZE j NATk ) 2 3 ( N PROVi SIZE j NATk ) 2 k where n PROVi SIZE j = 7000 ∑ i N PROVi N PROVi N PROVi SIZE j ∑N PROVi SIZE j j i ∈ { Araba, Gipuzkoa, Bizkaia} for j ∈ {Capital , Medium, Small} k ∈ {National , Foreigners} BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 34 Drawing Simple random sampling was carried out in each of the strata. • Balancing variables The sample was balanced on the following variables: - Number of individuals in the cross of Province (Araba, Gipuzkoa, Bizkaia), Sex (Men and Women) and Age (15-24, 24-34, 35-44, 45-54, 55-64 and over 65). - Number of individuals in each of the 20 regions in the Basque Country. • Substitutes A pool of substitutes of another 7,000 individuals is needed to complete the sample. The substitutes have been taken while preserving the same sample distribution by strata as in the original sample, balancing the sample on the same variables as in the headings. Results The results obtained for the balancing variables using the Cube Method are shown below. Each table compares the population distribution with the one obtained with sample weighting. The percentages are given by columns. Distribution by Province, Sex and Age Province = ARABA (01) Men Women TOTAL Population Sample (weighted) Population Sample (weighted) Population Sample (weighted) 15-24 years 13,818 13,729 12,831 12,762 26,649 26,491 (10.06%) (10.02%) (9.24%) (9.17%) (9.65%) (9.59%) 25-34 years 35-44 years 23,028 22,923 21,541 44,648 (16.73%) (15.51%) (16.13%) (16.16%) 28,954 28,948 26,298 21,725 (15.60%) 26,278 44,569 (16.77%) 55,252 55,226 (21.08%) (21.13%) (18.93%) (18.87%) (20.0%) (19.99%) 45-54 years 55-64 years 24,889 24,895 24,891 25,039 49,780 49,934 (18.12%) (18.17%) (17.92%) (17.98%) (18.02%) (18.08%) 20,051 19,942 20,355 20,332 40,406 40,274 (14.60%) (14.55%) (14.65%) (14.60%) (14.63%) (14.58%) Over 65 26,584 26,590 33,009 33,086 59,593 59,676 (19.36%) (19.40%) (23.76%) (23.76%) (21.57%) (21.60%) TOTAL 137,324 137,027 138,925 139,222 276,249 276,249 (100 %) (100 %) (100 %) (100 %) (100 %) (100%) BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 35 Province = GIPUZKOA (20) Men Women Population Sample (weighted) 15-24 years 25-34 years 35-44 years 30,206 (10.18%) TOTAL Population Sample (weighted) Population Sample (weighted) 30,273 28,416 28,371 58,622 58,644 (10.22%) (9.09%) (9.07%) (9.62%) (9.63%) 45,461 45,452 43,313 43,517 88,774 88,968 (15.32%) (15.34%) (13.86%) (13.91%) (14.57%) (14.60%) 60,481 60,491 56,318 56,361 116,799 116,852 (20.39%) (20.41%) (18.02%) (18.01%) (19.17%) (19.18%) 45-54 years 54,351 54,228 54,409 54,480 108,760 108,707 (18.32%) (18.30%) (17.41%) (17.41%) (17.85%) (17.84%) 55-64 years 45,126 44,881 46,428 46,525 91,554 91,406 (15.21%) (15.14%) (14.85%) (14.87%) (15.03%) (15.0%) Over 65 61,051 61,021 83,677 83,638 144,728 144,659 (20.58%) (20.59%) (26.77%) (26.73%) (23.76%) (23.74%) TOTAL 296,676 296,346 312,561 312,891 609,237 609,237 (100 %) (100 %) (100 %) (100 %) (100 %) (100 %) Province = BIZKAIA (48) Men Women Population 47,497 47,673 (9.80%) (9.83%) 76,941 (15.87%) TOTAL Sample (weighted) Population 45,007 45,152 92,504 92,825 (8.59%) (8.62%) (9.17%) (9.20%) 76,969 73,755 73,658 150,696 150,627 (15.88%) (14.07%) (14.06%) (14.94%) (14.93%) 97,104 97,136 93,542 93,318 190,646 190,454 (20.03%) (20.04%) (17.85%) (17.81%) (18.90%) (18.88%) 90,348 90,178 93,048 92,807 183,396 182,985 (18.64%) (18.60%) (17.75%) (17.71%) (18.18%) (18.14%) 72,330 72,308 77,119 77,329 149,449 149,637 (14.92%) (14.91%) (14.71%) (14.76%) (14.81%) (14.83%) Over 65 100,487 100,558 141,669 141,762 242,156 242,320 (20.73%) (20.74%) (27.03%) (27.05%) (24.0%) (24.02%) TOTAL 484,707 484,821 524,140 524,026 1,008,847 1,008,847 (100 %) (100 %) (100 %) (100 %) (100 %) (100 %) 15-24 years 25-34 years 35-44 years 45-54 years 55-64 years Population Sample (weighted) BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD Sample (weighted) 36 Distribution of the number of individuals by region Population Valles Alaveses Llanada Alavesa Montaña Alavesa Rioja Alavesa Sample weighted 5,107 5,051 (0.27%) (0.27%) 221,595 221,680 (11.69%) (11.69%) 2,855 2,886 (0.15%) (0.15%) 9,852 9,835 (0.52%) (0.52%) 7,296 7,292 (0.38%) (0.38%) Cantábrica Alavesa 30,043 30,004 (1.58%) (1.58%) Arratia-Nervión 20,289 20,386 (1.07%) (1.08%) Gran Bilbao 768,311 767,962 (40.53%) (40.51%) Durangaldea 83,470 83,513 (4.40%) (4.41%) Encartaciones 27,787 27,742 (1.47%) (1.46%) Gernika-Bermeo 40,183 40,331 (2.12%) (2.13%) Markina-Ondarroa 23,128 23,333 (1.22%) (1.23%) 46,104 Bajo Bidasoa 46,202 (2.44%) 66,403 (3.50%) (3.50%) Bajo Deba 47,748 47,664 (2.52%) (2.51%) Alto Deba 53,540 53,584 (2.82%) (2.83%) Donostialdea 282,424 282,508 (14.90%) (14.90%) Goierri 57,859 57,781 (3.05%) (3.05%) Tolosaldea 40,147 40,193 (2.12%) (2.12%) Urola Costa 61,490 61,462 (3.24%) (3.24%) TOTAL 1,895,729 1,895,729 Estribaciones del Gorbea Plentzia-Mungia BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD (2.43%) 66,418 37 Sample for the Technological Innovation Survey (EIT) The principal aim of the EIT carried out by EUSTAT is to learn more about the effort made to innovate in several sectors of the economy, and how to obtain a series of indicators that will allow us to compare the level reached in the Basque Country with that of surrounding countries. The EIT sample is a panel that every year includes the companies that have answered previous repetitions of the survey. As in the case of the ESIE, the original distribution of the sample deteriorated owing to several incidents (registrations, cancellations, modifications, etc.). Therefore the sample is updated according to a new sample distribution that follows the new distribution of the population in the strata while preserving the original design. In 2012, it was decided to renew nearly 7% of the panel. Moreover, the Cube Method was introduced to select balanced samples, with the aim of obtaining a balanced distribution in the Basque Country regions and their capitals. Technical Datasheet • Framework It comprises the establishments in any sector of activity where they carry out their business in the Basque Country, except the primary sector, public administration, association activities, household activities, and extraterritorial organisation and bodies. • Sample design. It was a one-stage stratified sample. Sampling units The establishments were part of the aforementioned framework. Stratification A stratified sample was made by crossing the following variables: - Province 1 = Araba; 2 = Bizkaia; 3 = Gipuzkoa - Employment stratum 1 = 0-9 employees; 2 = 10-49 employees; 3 = 50-249 employees; 4 = 250 or more employees; - Sector of activity (CNAE09 to 2 digits) Allocation Self-represented elements: establishments with 250 employees or more (employment stratum 4) or establishments that correspond to activity 46 in employment strata 2 and 3. BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 38 For the other establishments, the following theoretical allocation is set: - 2400 establishments are distributed for the strata of 10 or more employees and 750 establishments for strata will less than 10 employees. - The distribution is carried out in proportion to the root of the number of establishments by province and employment stratum. Subsequently another allocation proportional to the root of the number of establishments by activity stratum is made. In other words, the sample size in each stratum is specified by the following formula: n PROVi Emp j Actk = nTH i Emp j estabPROVi Emp j Actk ∑ k∈Act where - n PROVi Emp j estabPROVi Emp j Actk ⎧ estabPROVi Emp j ⎪750 ⎪ estabPROVi Emp j ∑ ⎪ j =1 =⎨ estabPROVi Emp j ⎪ ⎪2400 ∑ estabPROVi Emp j ⎪ j∈2 , 3 ⎩ i ∈ {01,20,48} j ∈ {1,2,3} for employ < 10 for employ > 10 Finally, establishments are added until the minimum size of 5 establishments in each stratum is obtained. After the theoretical sizes needed for each stratum have been calculated, we subtract the units that the panel already has to obtain the number of units to take from each stratum. Specifically, 771 establishments had to be taken out in 2012. Draw A simple random sampling is conducted in each strata, giving priority to the establishments that were specified as high in the framework. • Balancing variables The sample for employment strata 2 and 3 (more than 10 employees) has been balanced on the number of establishments in each region (20 regions) and capital cities to obtain better regional estimations. • Substitutes A pool of substitutes is needed to complete the sample. Therefore, 5 establishments will be taken from the strata that are not complete. 1,950 reserve establishments have been extracted in 2012. As in the main sample, the substitutes sample will be balanced with the Cube Method on the number of establishments in each region and the capitals. BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 39 Results The results obtained with the Cube Method when balancing hte number of establishments per region and capital cities is given below. Distribution of the number of establishments by region and capital cities (more than 10 employees) Population Valles Alaveses Llanada Alavesa (no capital city) Montaña Alavesa Rioja Alavesa Estribaciones del Gorbea Cantábrica Alavesa Arratia - Nervión Gran Bilbao (without the capital) Durangaldea Encartaciones Gernika – Bermeo Markina – Ondarroa Plentzia – Mungia Bajo Bidasoa Bajo Deba Alto Deba Donostialdea (without the capital) Goierri BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD Sample (weighted) 50 64 (0.40 %) (0.51 %) 102 69 (0.81 %) (0.54 %) 14 19 (0.11 %) (0.15 %) 105 93 (0.83 %) (0.74 %) 97 156 (0.77 %) (1.23 %) 185 234 (1.47 %) (1.86 %) 135 114 (1.07 %) (0.91%) 2,931 2,597 (23.26 %) (20.61 %) 648 556 (5.14 %) (4.41 %) 111 217 (0.88 %) (1.72 %) 162 271 (1.29 %) (2.15 %) 103 192 (0.82 %) (1.52 %) 200 333 (1.59 %) (2.64 %) 373 385 (2.96 %) (3.06 %) 359 290 (2.85 %) (2.30 %) 366 490 (2.90%) (3.88 %) 910 841 (7.22 %) (6.67 %) 334 387 (2.65 %) (3.07 %) 40 Tolosaldea Urola Costa Vitoria-Gasteiz Bilbao Donostia-San Sebastián TOTAL 311 419 (2.47 %) (3.32 %) 390 263 (3.09 %) (2.09 %) 1,548 1,467 (12.28 %) (11.64 %) 1,979 1,988 (15.70 %) (15.78 %) 1,190 1,158 (9.44 %) (9.19 %) 12,603 12,603 The percentages are given by columns. • Notes: 1. A post-stratification was carried out to calculate the weightings of the number of establishments per region. The activity strata were grouped according to sector aggregation A38 (CNAE09), since it is the sector used in dissemination. 2. Very good estimates of the number of establishments in the three capitals were made. 3. In the other regions, despite the fact that most of them well properly estimated, we find many regions with a high relative error, such as Estribaciones del Gorbea, Encartaciones, Gernika-Bermeo, Markina-Ondarroa, Plentzia-Mungia, Tolosaldea and Urola-Costa. 4. In these seven regions the Cube Method did not attain a sampling solution with better results due to the constraints imposed by the design: - Despite a sample size of 2,900 establishments, only 410 were in the draw, because the rest came from the panel as well as the census strata. - Moreover, only establishments in 173 strata were selected, out of the 401 strata defined for the cross of Province, activity and employment. - Finally, in 21 of the 173 strata in which the draw actually took place, the establishment to be selected was pre-determined (because priority had to be given to registrations). BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 41 Sample for the Poverty and Social Inequality Survey (EPDS) The Poverty and Inequality Survey (EPDS) is highly important to the Department of Justice, Employment and Social Security because it is connected to the evaluation and programming of its economic benefits. That is why it is particularly important to consolidate a sampling design that will permit the most appropriate approach possible to the survey group. In general, the main goal of the EPDS is to know, study and asses the various lines of poverty, their incidence is the Basque Country, and the indicators associated with social inequality. In 2012, it was decided to use the Cube Method to select the sample for the EPDS. This has enabled us to obtain a sample balanced by sex, age and nationality, as well as the family size in each Province. Technical Datasheet • Framework The framework of the Poverty and Social Inequality Survey comprises the family dwellings occupied in the Basque Country and its provinces. • Sample design. A two-stage sample with stratification in the first stage and fixed sample size in the second. Sampling units Occupied family dwellings in the Basque Country Sample size Around 4,000 survey units were selected, which provided around 8,000 substitution units (two units per sampling unit). First stage: Sections sample In the first stage a draw of the census sections in the Basque Country takes place. o Stratification The units in the first stage are stratified by crossing the following variables: - Regions and areas 01 = Añana; 02 = Ayala/Aiara; 03 = Campezo-Montaña Alavesa; 04 = Laguardia-Rioja Alavesa; 05 = Salvatierra/Agurain; 06 = Vitoria-Gasteiz; 07 = Zuia; 08 = Donostialdea; 09 = Tolosaldea-Goierri; 10 = Alto-Deba; 11 = Bajo-Deba; 12 = Margen Derecha; 13 = Bilbao; 14 = Margen Izquierda; 15 = Bizkaia Costa; 16 = Duranguesado - Typologies BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 42 An analysis is carried out of the EPDS-specific types of census sections in Eustat. To this end, basic variables are taken into consideration: age, sex, nationality, relation with the activity, number of residents in the dwelling, and average personal and family income. After an Analysis of the Main Components is carried out, the sections are classified into 7 types. - Predominance of young people: With the aim of over-representing the sample in areas characterised by a strong relative presence of people under age 45, the sections are classified into two groups: 1 = Sections with a predominance of young people 0 = Other sections In the second stage, lots are drawn for 24 dwellings in the "youth" section and 16 dwellings in the other sections. o Allocation The lots for the 4,000 dwellings are drawn according to the following allocations: 1. Distribution proportional to the square root of the number of dwellings per Province. 2. Distribution proportional to the square root of the number of dwellings per regions/areas. 3. Distribution proportional to the number of dwellings by type and section type ("youth"/"non-youth") A minimum size of 160 dwellings per region and 112 dwellings in the Álava region are required. o Draw The draw for the sections has been probabilistic and proportional to size (PPS), measured in the number of occupied dwellings. Second stage: Dwellings sample o Allocation From 16 to 24 dwellings, depending on the type of section concerned, were selected for each section selected in the first stage of the sample. o Draw A simple random draw was made in each section selected in the first stage. • Balancing variables The sample was balanced on the same variables in the first and second stages. This guarantees that the final sample will be balanced on the complete dwellings framework. The balanced variables are as follows: BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 43 - Family size: Number of dwellings with 1 resident, 2 residents, 3-4 residents and more than 5 residents by Province. - Sex: Number of men and women by Province. - Age: Number of individuals age 34 or less, age 35-44, 45-54 and over 65, by Province. - Nationality: Number of Spanish and foreign individuals by Province. - Number of individuals in each region/area. • Substitutes To complete the sample, lots are drawn for a substitute and a reserve for each dwelling. The substitutes have been taken from each of the census sections selected in the first stage, balancing the sample on the same variables as the title-holding dwellings. Results The results obtained for the balancing variables using the Cube Method are shown below. Each table compares the population distribution with the one obtained with sample weighting. The percentages are given by columns. Distribution of dwellings by Family Size and Province Araba Population Gipuzkoa Sample Population (weighted) Bizkaia Sample (weighted) Population Sample (weighted) 35,528 35,440 68,232 68,553 109,535 112,675 (27.77%) (27.70%) (24.97%) (25.09%) (24.44%) (25.14%) 2 residents 37,537 38,174 78,075 78,039 130,825 130,322 (29.34%) (29.84%) (28.57%) (28.56%) (29.18%) (29.07%) 3-4 residents 47,391 47,735 108,714 108,381 180,827 178,194 (37.04%) (37.31%) (39.78%) (39.66%) (40.34%) (39.75%) 7,485 6,592 18,248 18,295 27,079 27,075 (5.85%) (5.15%) (6.68%) (6.69%) (6.04%) (6.04%) 127,941 127,941 273,269 448,266 448,266 1 resident More than 5 residents TOTAL 273,269 BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 44 Distribution by Sex and Province Araba Men Women TOTAL Gipuzkoa Bizkaia Population Sample (weighted) Population Sample (weighted) Population Sample (weighted) 157,836 155,759 344,561 347,363 553,674 551,028 (49.91%) (49.63%) (49.48%) (48.49%) (48.53%) 158,392 158,111 354,687 588,197 584,492 (50.09%) (50.37%) (50.98%) (50.52%) (51.51%) (51.47%) 316,228 313,870 702,911 702,050 1,141,871 1,135,521 (49.02%) 358,350 Distribution by Age and Province Araba Population Less than 34 years 35 - 44 years Gipuzkoa Bizkaia Sample Sample Sample Population Population (weighted) (weighted) (weighted) 108,383 109,676 233,423 234,644 366,085 363,674 (34.27%) (34.94%) (33.21%) (33.42%) (32.06%) (32.03%) 55,227 49,691 116,445 116,922 188,762 194,045 (17.46%) (15.83%) (16.57%) (16.65%) (16.53%) (17.09%) 45 - 54 years 49,799 109,078 107,384 182,531 179,632 (15.52%) (15.30%) (15.99%) (15.82%) 55 - 64 years 40,810 49,939 (15.91%) 43,836 92,261 91,599 151,434 146,342 (12.91%) (13.97%) (13.13%) (13.05%) (13.26%) (12.89%) 62,009 60,729 151,704 151,501 253,059 251,828 (19.61%) (19.35%) (21.58%) (21.58%) (22.16%) (22.18%) 316,228 313,870 Over 65 TOTAL (15.75%) 702,911 702,050 1,141,871 1,135,521 Distribution by Nationality and Province Araba National Foreign TOTAL Gipuzkoa Bizkaia Population Sample (weighted) Population Sample (weighted) Population Sample (weighted) 286,633 289,847 658,599 659,521 1,067,272 1,059,925 (90.64%) (92.35%) (93.70%) (93.94%) (93.47%) (93.34%) 29,595 24,023 44,312 42,529 74,599 75,595 (9.36%) (7.65%) (6.30%) (6.06%) (6.53%) (6.66%) 316,228 313,870 702,050 1,141,871 1,135,521 702,911 BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 45 Distribution of the number of individuals by region/area Population Añana Ayala / Aiara Campezo - Montaña Alavesa Laguardia - Rioja Alavesa Salvatierra/Agurain Vitoria - Gasteiz Zuia Donostialdea Tolosaldea - Goierri Alto Deba Bajo Deba Margen Derecha Bilbao Margen Izquierda Bizkaia Costa Duranguesado TOTAL BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD Sample weighted 8,617 8,350 (0.40%) (0.39%) 34,208 33,894 (1.58%) (1.58%) 3,156 3,118 (0.15%) (0.14%) 11,414 11,181 (0.53%) (0.52%) 12,255 12,384 (0.57%) (0.58%) 237,059 235,576 (10.97%) (10.95%) 9,519 9,368 (0.44%) (0.44%) 472,708 472,950 (21.87%) (21.98%) 114,584 113,420 (5.30%) (5.27%) 60,919 60,945 (2.82%) (2.83%) 54,700 54,734 (2.53%) (2.54%) 161,425 157,625 (7.47%) (7.33%) 349,132 348,884 (16.16%) (16.22%) 386,068 379,912 (17.87%) (17.66%) 126,504 127,321 (5.85%) (5.92%) 118,742 121,778 (5.49%) (5.66%) 2,161,010 2,151,441 46 Sample for the study of women in Basque rural areas The Department of Agriculture, Fishing and Food wants to update the study that has been conducted since 1998 on "Women in Basque Rural Areas. Needs, Demands and Social Needs". In 2012, contrary to previous designs, a sample of women and another sample of men age 15 or over who live in the towns that the Department has indicated as rural is going to be taken by the criteria of size, population density and agricultural GDP ratio. The sample should comprise 250 men and 250 women in each of the Basque Country provinces. Moreover, a decision is reached to use the Cube Method to select the sample and obtain a sample of men and women balanced by age, nationality, level of studies and type of dwelling (urban nucleus or scattered) in each Province. Technical Datasheet • Framework The sample framework comprises the population age 15 and older that resides in family dwellings in the 128 municipalities indicated as rural by the Department of Agriculture, Fishing and Food. • Sample design. It was decided to conduct a two-stage study with stratification in the first stage, since the aim is to obtain a sample of women and a sample of the same size of men in rural municipalities. The allocations in the first and second stages are calculated so the final sample of individuals is self-weighted by Province. Thus, after lots are drawn for the rural municipalities, there will be a draw for the same number of men and women in each municipality. Sample size Around 250 men and 250 women are chosen in each Basque Country Province. Substitutes will not be selected because a booster sample will be carried out, considering the estimated no response rate (46% in each Province). First stage: Municipalities sample In the first stage a stratified draw of the 128 rural municipalities in the Basque Country takes place. o Sampling units Rural municipalities in the Basque Country. These are clusters of individuals of different sizes. o Stratification In the first stage the units are stratified by: - Province 01 = Araba, 20 = Gipuzkoa and 48 = Bizkaia; BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 47 - Size of the municipalities The stratification of the municipalities by size is optimal. In other words, it minimizes intra-class inertia or internal variance of each stratum, taking the total inertia or variance as a benchmark. 1 = [0-569]; 2 = [570-1154]; 3 = [1155-1884]; 4 = [1885-3400] o Allocation The final aim is to draw lots for 250 men and 250 women in each Province. Substitutes will not be selected because a booster sample will be carried out, considering the estimated no response rate (46% in each Province). The following procedure has been followed to calculate the number of municipalities that will be included in the draw: 1. Distribution proportional to the size of the strata (population) of 500 individuals for each Province. 2. The number of municipalities for the draw in each Province is calculated on the basis of a multiple of the sample population fraction. 3. Distribution proportional to the number of municipalities per stratum. 4. The municipalities sample is extended to select those that belong to a stratum size equal to 4. o Draw Once the theoretical distribution has been obtained, the draw for rural municipalities is done by simple random sampling. Second stage: Sample of men and women In the second stage, we must select the men and women who will be surveyed. o Sampling units Men and women age 15 and older who belong to the rural municipalities selected in the first stage. o Allocation For each rural municipality selected in the first stage, the number of men and women who are in the draw is calculated proportionately to the size of the municipality in the stratum. In other words: n MUNi = n h Pop MUN i Pop h where MUNi are the rural municipalities selected in the first stage and h is the stratum for that municipality. o Draw Two simple, independent random samples are taken from the subpopulations of men and women in each municipality. The final sample is approximately self-weighted by Provinces. BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 48 • Balancing variables The sample was balanced on the same variables in the first and second stages. This guarantees that the final sample will be balanced on the complete individuals framework. The balanced variables are as follows: Sex: Number of men and women by Province. Age: Number of individuals age 15-25, 26-39, 40-54, 55-64 and over 65, by Province. Nationality: Number of Spanish and foreign individuals by Province. Studies: Number of individuals with primary, secondary or higher studies, by Province Type of dwelling: Number of individuals residing in dwellings of the nucleus or scattered type. - Results The results obtained for the balancing variables using the Cube Method are shown below. Each table compares the population distribution with the one obtained with sample weighting. The percentages are given by columns. Distribution by Age and Province SEX = MEN Araba Population Gipuzkoa Sample (weighted) Bizkaia Sample (weighted) Population 1,231 1,236 1,769 1,807 (10.41%) (10.45%) (8.90%) (9.09%) 2,958 2,988 4,354 4,383 (25.01%) (25.26%) (21.91%) (22.06%) Population Sample (weighted) 15 - 25 years (9,70%) 26 - 39 years 3,706 1,676 (9.53%) 3,634 (21.08%) (20.67%) 40 - 54 years 5,746 5,807 3,396 3,320 6,169 6,260 (32.68%) (33.03%) (28.71%) (28.07%) (31.05%) (31.51%) 2,698 2,730 1,802 1,809 3,191 (15.35%) (15.53%) (15.23%) (15.29%) (16.06%) 55 - 64 years Over 65 TOTAL 1,705 3,727 3,734 2,442 2,476 4,386 3,050 (15.35%) 4,369 (21.20%) (21.24%) (20.64%) (20.93%) (22.07%) (21.99%) 17,582 17,852 11,829 19,869 19,869 BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 11,829 49 SEX = WOMEN Araba Population Gipuzkoa Sample (weighted) Sample (weighted) Population Bizkaia Population Sample (weighted) 15 - 25 years 1,552 1,624 1,164 1,133 1,716 1,655 (9.91%) (10.37 %) (10.73%) (10.45%) (8.99%) (8.67%) 26 - 39 years 3,351 3,309 2,709 2,658 3,970 4,058 (21.39%) (21.12%) (24.98%) (24.51 %) (20.81%) (21.27%) 40 - 54 years 4,694 4,749 2,880 2,870 5,398 5,403 (29.96%) (30.31%) (26.56%) (26.47%) (28.29%) (28.32%) 55 - 64 years 2,133 2,067 1,416 1,481 (13.61%) (13.19%) (13.06%) (13.66%) 3,938 3,918 2,675 2,703 2,714 (14.23%) 5,281 2,708 (14.19%) 5,255 (25.13%) (25.01%) (24.67%) (24.93 %) (27.68%) (27.54%) 15,668 15,668 10,844 19,079 19,079 Over 65 TOTAL 10,844 Distribution by Nationality and Province SEX = MEN Araba Foreign TOTAL Bizkaia Sample (weighted) Population Sample (weighted) Population Sample (weighted) 16,410 (93.33%) 1,172 16,403 11,182 11,218 19,037 19,000 (93.29%) (94.53%) (94.83%) (95.81%) (95.63%) 1,179 647 611 832 869 (6.67%) (6.71%) (5.47%) (5.17%) (4.19%) (4.37%) 17,582 17,852 11,829 11,829 19,869 19,869 Population National Gipuzkoa SEX = WOMEN Araba National Foreign TOTAL Gipuzkoa Bizkaia Population Sample (weighted) Population Sample (weighted) Population Sample (weighted) 14,694 14,673 10,300 10,278 18,270 18,251 (93.78%) (93.65%) (94.98%) (94.78%) (95.76%) (95.66%) 974 995 544 566 809 828 (6.22%) (6.35%) (5.02%) (5.22%) (4.24%) (4.34%) 15,668 15,668 10,844 10,844 19,079 19,079 BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 50 Distribution by Level of Studies and Province SEX = MEN Araba Population Gipuzkoa Sample (weighted) Bizkaia Sample (weighted) Population 5,287 5,144 6,873 6,813 (44.70%) (43.49 %) (34.59%) (34.29%) 4,957 5,123 8,798 8,915 (41.91%) (43.41%) (44.28%) (44.87%) Population Sample (weighted) Primary Studies (41.54%) Secondary Studies 7,616 7,225 (41,09%) 7,630 (43.32%) (43.40%) Higher Studies 2,662 2,727 1,585 1,562 4,198 4,141 (15.14%) (15.51%) (13.40%) (13.20%) (21.13%) (20.84%) TOTAL 17,582 17,852 11,829 19,869 19,869 7,304 11,829 SEX = WOMEN Araba Population Gipuzkoa Sample (weighted) Bizkaia Sample (weighted) Population 4,928 4,922 7,587 7,586 (45.44%) (45.39%) (39.77%) (39.76%) 3,451 3,441 6,148 6,160 (31.82%) (31.73 %) (32.22%) (32.29%) Population Sample (weighted) Secondary Primary (43.23%) Secondary Studies 5,459 6,665 (42.54%) 5,557 (34.84%) (35.47 %) Higher Studies 3,435 3,446 2,465 2,482 5,344 5,333 (21.92%) (21.99%) (22.73%) (22.89%) (28.01%) (27.95%) TOTAL 15,668 15,668 10,844 19,079 19,079 6,774 10,844 Distribution by Type of Dwelling and Province SEX = MEN Araba Scattered TOTAL Bizkaia Sample (weighted) Population 1,027 16,743 (95.23%) 839 4,299 3,938 8,119 7,624 (5.84%) (4.77%) (36.34%) (33.29%) (40.86%) (38.37%) 17,582 17,852 11,829 19,869 19,869 Population Nucleus Gipuzkoa 16,555 (94.16%) Sample (weighted) Population 7,530 7,891 11,750 12,245 (63.66%) (66.71%) (59.14%) (61.63 %) 11,829 BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD Sample (weighted) 51 SEX = WOMEN Araba Scattered TOTAL Bizkaia Sample (weighted) Population 887 14,977 (95.59%) 691 3,621 3,157 7,524 7,007 (5.66%) (4.41%) (33.39%) (29.11%) (39.44%) (36.73%) 15,668 15,668 10,844 19,079 19,079 Population Nucleus Gipuzkoa 14,781 (94.34%) Sample (weighted) Population Sample (weighted) 7,223 7,687 11,555 12,072 (66.61%) (70.89%) (60.56%) (63.27%) 10,844 Sample for Basque Country and Drugs Survey Basque Country and Drugs is a biennial survey, aimed al discovering the consumption of various substances by the Basque population aged between 15 and 74 years, and their perception on various issues related to drugs and drug addiction. In 2012, it was decided to use the Cube Method to select the sample. This has enabled us to obtain a sample balanced by number of individuals in each sanitary region, size of municipalities, sex and nationality. Technical Datasheet • Framework The framework of the sample comprises population aged between 15 and 74 years old that resides in family dwellings in the Basque Country and its provinces. • Sample design It was a one-stage stratified sample. Sampling units Population aged between 15 and 74 years (reference date: July 15, 2012), that resides in family dwellings in the Basque Country. Sample size According to the specifications of the operation, n = 2007 individuals were selected, providing the same number of substitutes and reserves. Stratification A stratified simple was made by crossing the following variables: BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 52 - Province 01 = Araba; 20 = Gipuzkoa; 48 = Bizkaia - Age groups: 6 decadal age groups (15-24, 25-34, 35-44, 45-54, 55-64 and 65-74 years) Allocation A criterion for each level of stratification has been established: 1. Distribution proportional to the square root of the number of individuals per Province 2. For each Province, double size allocation for the youngest age groups (15-24 years, 25-34 years y 35-44 years). Drawing Simple random sampling was carried out in each of the strata. • Balancing variables The sample was balanced on the following variables: - - - Number of individuals aged between 15 and 74 years for each of the 11 sanitary regions of the Basque Country: Alava, West Gipuzkoa, Gipuzkoa East, (Biz) Interior, (Biz) Ezkerraldea-Enkarterri, (Biz) Uribe and (Biz) Bilbao. Number of individuals aged between 15 and 74 years in the municipalities, according to their size in population: Capitals, 50,000-100,000 habitants, 25,000-50,000 habitants, 10,000-25,.000 habitants and less than 10,000 habitants. Number of individuals by sex. Number Spanish and foreing individuals. • Substitutes To complete the sample, lots are drawn for a substitute and reserve for each individual. The substitutes have been taken keeping the same distribution of the original stratum sample, balancing the sample on the same variables. Results The results obtained for the balancing variables using the Cube Method are shown below. Each table compares the population distribution with the one obtained with sample weighting. The percentages are given by columns: BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 53 Distribution of the number of individual by sanitary province Alava 219,042 Sample (weighted) 218,966 (13.28%) (13.28%) West Gipuzkoa 218,155 218,335 (13.23%) (13.24%) Gipuzkoa 328,814 329,009 (19.94%) (19.95%) (Biz) Interior 227,787 228,032 (13.81%) (13.83%) (Biz) EzkerraldeaEnkarterria 225,829 224,429 (13.70%) (13.61%) (Biz) Uribe 166,287 166,029 (10.08%) (10.07%) (Biz) Bilbao 263,028 264,141 (15.95%) (16.02%) TOTAL 1,648,942 1,648942 Population Distribution of the number of individual by municipalities size Capitals 587,948 Sample (weighted) 589,033 (35.66%) (35.72%) 50,000 - 100,000 184,970 184,638 (11.22%) (11.20%) 25,000 - 50,000 239,465 239,354 (14.52%) (14.52%) 10,000 - 25,000 300,173 300,088 (18.20%) (18.20%) Less than 10,000 336,386 335,829 (20.40%) (20.37%) TOTAL 1,648,942 1,648,942 Population BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 54 Distribution of the number of individual by sex Population Sample (weighted) Men 823,310 823,742 (49.93%) (49.96%) Women 825,632 825,200 (50.07%) (50.04%) TOTAL 1,648,942 1,648,942 Distribution of the number of individual by nationality Population Sample (weighted) National 1,519,906 1,518,872 (92.17%) (92.11%) Foreing 129,036 130,070 (7.83%) (7.89%) TOTAL 1,648,942 1,648,942 BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD 55 8. Conclusions Finally, we will mention certain conclusions regarding the interest in carrying out balanced samplings, the choice of balancing variables and the relation of balance with regards to stratification and calibration. Balancing and stratification For stratification and balancing purposes, we need to know the value of the auxiliary variables for all the population units. The greatest advantage of stratification is that it allows us to divide a population into more homogeneous subpopulations to obtain more precise estimators, which reduces sampling variance. The greater the number of variables correlated with the variables of interest used, the better the stratification. Even so, using too many stratification variables may produce very small strata in which the sample size is insufficient, not to mention the problems that may arise from no response in such strata. However, the latter can be fixed by collapsing the strata (poststratification). Balancing variables allow the variables that cannot be entered in multiple stratification to be added as balancing variables, which retains the benefits of stratification with regards to reducing variance and adds the advantage of balancing. They also allow us to work in domains defined on the cross of several strata or small areas. Balancing variables can be quantitative, whereas stratification variables always need to be qualitative or categorical. Choice of balancing variables The auxiliary variables chosen to balance a sample should be very well correlated with the variables of interest and not correlated with each other. When balancing a sample on a large number of qualitative auxiliary variables, estimated totals (or estimated means) are obtained with distributions that are practically identical to the original population. The Cube Method provides a very interesting way of selecting primary units in a multistage sample. In the event of choosing a balanced sample in the second stage, the variables to be balanced should have been previously balanced in the first stage. CONCLUSIONS 56 Balancing and calibration Contrary to balancing and stratification, for calibration we only need to know the value of the auxiliary variables for the sample elements, as well as the totals of those variables in the population. The best strategy to use is balancing and calibration together (see the simulation in Deville and Tillé, 2004). This is because, in general, better results are obtained if we calibrate a sample on the same auxiliary variables that were used in the balancing. There is one case in which the calibration can be used on variables that are not balancing variables, and that is when it is the same variable measured at different times. Analysis of results Then, we are going to show the results obtained during the calibration of two sample previously balanced with the Cube Method (2012 Basque Country and Drugs and 2012 Social Capital Survey) In both cases, the calibration has been done with the CALMAR macro (calage sur marges), “readjusting” the sample weights of the individuals to the marginal totals of the calibration auxiliary variables. 1. Calibration on the 2012 Basque Country and Drugs Survey For the Basque Country and Drugs 2012 survey (n=2007 individuals), it has decided to calibrate the sample on the following variables: - Cross of the Province and Sex variables (stratification variables) - Sanitary region, municipality sizes and sex (balancing variables) Starting from the initial weights whi = wh ∀i stratum), were obtained the final weights whi* (same weights inside of each using the CALMAR macro with the ranking ratio method for adjusting the estimations to the marginal totals on the calibration variables. The variable f = w hi* is defined as the ratio of the final weights and initial weights. w hi Analyzing the distribution of this variable, we can determine how much the initial weights are being deformed for adjusting the marginal totals on the calibration variables. This is a resume of the distribution of the f variable: CONCLUSIONS 57 Mean Median Mode Standard deviation 1 0.9987 0.9978 0.0875 Coefficient of variation 8.75% Minimum Maximum 0.8365 1.2484 As we can see, the final weights are not so far from the initial weights (maximum increase of 24% and maximum decrease of 16%), maintaining largely the sampling weights associated to the stratification. 2. Calibration on the 20012 Social Capital Survey For the Social Capital Survey 2012 (n=4000 individuals), it has decided to calibrate the sample on the following variables: - Province (Araba, Gipuzkoa y Bizkaia) - Sex (men and women) - Age (15-24, 25-34, 35-44, 45-54, 55-64 and more than 65 years) So, the sample has been calibrated to 36 marginal totals. As in the previous sample, the f = w hi* variables as been defined as the ratio of w hi the final weights and the initial weights. The final weights whi* were obtained using the CALMAR macro with the ranking ratio method for adjusting the estimations to the marginal totals on the calibration variables. In this case, will not only analyze the distribution of the variable f, we are going also to compare with the values obtained in the 2007 Social Capital Survey. Remember that both surveys have the same sample design, but the 2012 Sample Capital Survey has been selected balancing the sample with the Cube Method. The balancing variables used are the same as the calibration variables. The next table shows the results obtained for the years 2007 and 2012: CONCLUSIONS 58 2007 2012 Mean Median Mode Standard deviation 1.1139 0.9685 2.0076 0.5306 1.0074 0.9944 1.0287 0.1125 Coefficient of variation 47.63% 11.17% Minimum Maximum 0.4223 2.3236 0.7965 1.2915 As the 2012 SCS has been balanced on the calibration variables, we have obtained better results, obtaining final weights much less distant than on the 2007 SCS (maximum increase of 29% versus 132% and maximum decrease of 20% versus 58%). CONCLUSIONS 59 Interest of balanced sampling In a model-based framework assisted by a model, a sampling design balanced with the Horvitz-Thompson estimator is often an optimal strategy (see Nedyalkova and Tillé, 2009). In fact, when a sample is completely balanced, Horvitz-Thompson estimator variance in auxiliary variables equals zero. The advantages of balanced sampling are the following: CONCLUSIONS - It is an optimisation of probability sampling designs, whether they are single stage or multi-stage, in which the inclusion probabilities defined by the design are the key for selecting random samples. - It increases the accuracy of the Horvitz-Thompson estimator. Moreover, estimator variance only depends on the variables of interest and the balancing variables (regression residuals). - The probability that samples that are less favourable, extreme or distant from the mean will be selected is almost nil. - Balanced sampling guarantees that sample size in specific geographical areas or domains will not be too small. 60 9. Bibliography ADIN, A.; ARAMENDI, J.; GALBETE, E. AND IZTUETA, A. (2012) El Método del Cubo: Un Método para seleccionar muestras equilibradas. Congreso Vasco de Sociología y Ciencia Política ARDILLY, P. (1994) Les Techniques de Sondage. Technip, Paris. ARDILLY, P. AND TILLÉ, Y. (2006) Sampling Methods: Exercises and Solutions. Springer, New York. AZORÍN, F. AND SANCHEZ-CRESPO, J. L. (1986) Métodos y Aplicaciones del Muestreo. Alianza Editorial, Madrid. CHAUVET, G. AND TILLÉ, Y. (2005) Fast SAS Macros for balancing Samples: user's guide. Software Manual, University of Neuchâtel. CHAUVET, G. AND TILLÉ, Y. (2007) Application of fast SAS macros for balanced samples to the selection of addresses. Case Studies in Business, Industry and Government Statistics, 1:173-182. COCHRAN, W. (1977) Sampling Techniques. Wiley, New York. DEVILLE, J.-C. AND TILLÉ, Y. (2004) Efficient balanced sampling: the cube method. Biometrika, 91:893912. DEVILLE, J.-C. AND TILLÉ, Y. (2005) Variance approximation under balanced sampling. Journal of Statistical Planning and Inference, 128:569-591. KISH, L. (1965) Survey Sampling. Wiley, New York. NEDYALKOVA, D. AND TILLÉ, Y. (2009) Optimal sampling and estimation strategies under linear model. Biometrika, 95:521-537. BIBLIOGRAPHY 61 SÄRNDAL, C.-E.; SWENSSON, B. AND WRETMAN, J. (1992) Model Assisted Survey Sampling. Springer Verlag, New York. TILLÉ, Y. (2000) Ten years of balanced sampling with the cube method: an appraisal. Demographic Statistical Methods Division Seminar of the U.S. Census Bureau. TILLÉ, Y. (2005) Teoría de Muestreo. Gruope de Statistique, Université de Neuchâtel, Suisse. http://www2.unine.ch/files/content/sites/statistics/files/shared/docume nts/curso_teoria_de_muestreo.pdf TILLÉ, Y. AND MATEI, A. (2007) The R Package Sampling. The Comprehensive R Archive Network, Manual of the Contributed Packages. http://cran.r-project.org/web/packages/sampling/sampling.pdf Tillé, Y. (2010) Muestreo Equilibrado y Eficiente: el Método del Cubo. Instituto Vasco de Estadística, Vitoria-Gasteiz. http://www.eustat.es/productosServicios/datos/Seminario_52.pdf BIBLIOGRAPHY 62
© Copyright 2026 Paperzz