THE CUBE METHOD: BALANCED SAMPLING

THE CUBE METHOD:
BALANCED SAMPLING APPLICATIONS
IN THE BASQUE STATISTICS ORGANISATION
Aritz Adin Urtasun
EUSKAL ESTATISTIKA ERAKUNDEA
BASQUE STATISTICS INSTITUTE
Donostia-San Sebastián, 1
01010 VITORIA-GASTEIZ
Tel.: 945 01 75 00
Fax.: 945 01 75 01
E-mail: [email protected]
www.eustat.es
Introduction
Eustat, aware of the growing demand for increasingly disaggregated quality statistics,
organised the 23rd International Statistics Seminar in 2010, with the title "Balanced and
Efficient Sampling: The Cube Method".
Eustat aims to redefine current designs to obtain samples that provide quality estimators
for more disaggregated areas or domains at the same or a similar cost. Eustat convened
two research and training grants in the field of statistical-mathematical methodologies,
and more specifically, focused on sample optimisation, for the same purpose.
The outcomes of the research have been implemented in several statistical operations in
the 2010-2012 Basque Statistics Plan: A study on bullying among students in Primary
Education and Compulsory Secondary Education schools; a Survey on the Information
Society for Families, a Survey on Technological Innovation, a Survey on Poverty and
Social Inequality, and a Study on Women in Basque Rural Areas.
The purpose of this publication is to disseminate the research conducted during the
grant period and to provide useful material for users interested in efficient and balanced
sampling.
The document is divided into two separate parts. Part One approaches the concepts and
definitions of sampling theory, as well as simple and complex probability-based sampling
plans. Part Two describes the Cube Method and its implementation in several of the
Basque Statistics Organisation's standard surveys.
Vitoria-Gasteiz, December 2012
Javier Forcada Sainz
General Director of EUSTAT
Contents
INTRODUCTION ........................................................................................................................... 1
CONTENTS................................................................................................................................... 2
1. INTRODUCTION ....................................................................................................................... 4
2. INTRODUCTION TO SAMPLING THEORY ............................................................................. 5
DEFINITIONS AND BASIC NOTATION .............................................................................................. 5
SAMPLING PROPORTIONS ............................................................................................................ 6
THE HORVITZ-THOMPSON ESTIMATOR ......................................................................................... 6
3. PROBABILITY SAMPLING PLANS .......................................................................................... 7
SIMPLE RANDOM SAMPLING ......................................................................................................... 7
STRATIFIED SAMPLING ................................................................................................................ 8
CLUSTER SAMPLING .................................................................................................................. 10
SUMMARY OF THE METHODS PRESENTED ................................................................................... 11
4. COMPLEX SAMPLING PLANS .............................................................................................. 13
TWO-STAGE SAMPLING.............................................................................................................. 13
SELECTION OF PRIMARY UNITS WITH EQUAL PROBABILITIES ......................................................... 14
SELF-WEIGHTING TWO-STAGE PLAN ........................................................................................... 15
5. THE CUBE METHOD: BALANCED SAMPLING .................................................................... 16
CUBE REPRESENTATION............................................................................................................ 16
BALANCED SAMPLES ................................................................................................................. 16
DESCRIPTION OF THE METHOD .................................................................................................. 18
6.
SAS MACROS FOR SELECTING BALANCED SAMPLES .............................................. 19
EXE_CUBE MACRO .................................................................................................................... 19
ECHANT_STRAT MACRO............................................................................................................. 20
DISJUNCTIVE AUXILIARY MACRO ................................................................................................. 21
CREAR_ESTRATO AUXILIARY MACRO .......................................................................................... 21
EXAMPLE OF MACRO USE .......................................................................................................... 22
7.
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD................................. 26
SAMPLE OF ESO (COMPULSORY SECONDARY EDUCATION) CENTRES FOR A STUDY ON BULLYING IN
THE BASQUE COUNTRY............................................................................................................. 26
SAMPLE FOR THE INFORMATION SOCIETY SURVEY (ESI-COMPANIES) ......................................... 30
SAMPLE FOR THE SOCIAL CAPITAL SURVEY (ECS)..................................................................... 33
SAMPLE FOR THE TECHNOLOGICAL INNOVATION SURVEY (EIT)................................................... 38
SAMPLE FOR THE POVERTY AND SOCIAL INEQUALITY SURVEY (EPDS) ....................................... 42
CONTENTS
SAMPLE FOR THE STUDY OF WOMEN IN BASQUE RURAL AREAS .................................................... 47
SAMPLE FOR BASQUE COUNTRY AND DRUGS SURVEY ............................................................... 52
8. CONCLUSIONS ...................................................................................................................... 56
BALANCING AND STRATIFICATION ............................................................................................... 56
CHOICE OF BALANCING VARIABLES ............................................................................................ 56
BALANCING AND CALIBRATION ................................................................................................... 57
Analysis of results ..................................................................................... 57
1. Calibration on the 2012 Basque Country and Drugs Survey ................. 57
2. Calibration on the 20012 Social Capital Survey .................................... 58
INTEREST OF BALANCED SAMPLING ............................................................................................ 60
9. BIBLIOGRAPHY...................................................................................................................... 61
CONTENTS
3
1. Introduction
This Technical Handbook is the fruit of the work carried out in the course of the training
and research grants in the field of statistical-mathematical methodologies for sampling
optimisation given by the Basque Statistics Institute / Euskal Estatistika Erakundea in
2010.
The Handbook is divided into the following chapters:
Chapter One offers an introduction and mentions the objectives that led to the
preparation of this technical Handbook.
Chapter Two gives an introduction to sampling theory, with definitions and basic
notations in sampling design, sampling proportions and a definition of the HorvitzThompson estimator and its variance.
The next two chapters develop the concepts of probability sampling plans and complex
sampling plans, with a description of most of the methods used in official statistics.
Chapter Five approaches the concept of balanced sampling and introduces the Cube
Method for selecting balanced samples.
The aim of Chapter Six is to list the SAS macros for selecting balanced samples.
Chapter Seven shows the samples balance in Eustat using the Cube Method.
The last chapter gives some conclusions on balancing, stratification and calibration.
My thanks to the members of the Methodology, Innovation and R&D Department for
their support and to Eustat staff in general for their kindness.
KEYWORDS: Sampling design, inclusion probabilities, Horvitz-Thompson estimator,
balanced samples, Cube Method, balance, stratification and calibration variables
INTRODUCTION
2. Introduction to sampling theory
Before we can introduce the Cube Method for selecting balanced samples and
demonstrating the method's usefulness, we should start with an overview of sampling
theory.
Definitions and basic notation
The purpose is to study a finite population U = {1,…, N} of N size.
We define the variable of interest y which takes the values yk , k ∈ U ; whose total
and mean are:
Y = ∑ yk
1
N
Y =
and
k∈U
∑y
k
k∈U
A sample s is a subset of the population s ⊂ U .
A sampling design or a sampling plan p(s) is a probability distribution on all the possible
samples in which
∑ p(s) = 1 .
s ⊂U
The random sample S takes the value s with probability Pr( S = s ) = p ( s ) .
We define inclusion probability as the probability that k is the unit selected in random
sample S:
π k = E ( Ik ) = Pr(k ∈ S ) = ∑ p( s )
⎧
⎪
1
⎪ 0
⎩
Ik = ⎨
where
k∈s
if k ∈ S
if k ∉ S
Similarly, second-order inclusion probability is defined as:
π kl = E ( IkIl ) = Pr(k y l ∈ S ) =
If the sample design is of a fixed size, then
∑π
k∈U
INTRODUCTION TO SAMPLING THEORY
k
∑ p( s)
k ,l∈s
= n.
5
Sampling proportions
Suppose that the variable of interest defined on population U is a qualitative variable. In
this case, the variable of interest gives information on a quality of the population units
and the membership or non-membership in a certain class.
Suppose that the variable of interest divides population units into two classes C and C ′
The
y k characteristic for each population unit is defined as:
1 if k ∈ C
⎪ 0 if k ∉ C
⎩
⎧
⎪
yk = ⎨
∀k ∈ U
The total of population elements (class totals) and the proportions of population elements
(class proportion) that belong to C are defined as:
Y = ∑ yk = A
Y =
and
k∈U
1
N
∑y
k∈U
k
A
=P
N
=
We can consider the problem of estimating A and P as if we were estimating the
population total and population mean, where each y k takes the value 0 or 1.
If we write quasi-variance S
S2 =
∑(y
k∈U
k
−Y ) 2
=
N −1
2
in terms of P and Q = 1-P
∑y
k∈U
2
k
− NY 2
=
N −1
1
N
( NP − NP 2 ) =
PQ
N −1
N −1
Whose unbiased estimator is:
n
s =
pq
n −1
2
where
p=
∑y
k∈S
k
=
n
a
n
The Horvitz-Thompson estimator
The Horvitz-Thompson estimator of the population total and the population mean of
variable of interest y is defined as:
yk
Yˆπ = ∑
k∈S
and
πk
The Horvitz-Thompson estimator is unbiased if
1
Yˆπ =
N
yk
∑π
k∈S
π k > 0, k ∈ U
k
.
For fixed size designs, the variance can be estimated by:
⎛y
y
1
Vˆar (Yˆπ ) = − ∑∑ ⎜⎜ k − l
2 k∈S l∈S ⎝ π k π l
l ≠k
INTRODUCTION TO SAMPLING THEORY
2
⎞ (π kl − π k π l )
⎟⎟
.
π kl
⎠
6
3. Probability sampling plans
A probability sampling is one in which every unit in the population has a chance of being
selected in the sample, and this probability can be accurately determined.
As explained later, the Cube Method is based on the inclusion probabilities defined by
the design to select a balanced sample: i.e. in fact, the cube method optimises
probability sampling methods.
Three main types of probability sampling are defined below.
Simple random sampling
Simple random sampling (SRS) is a sampling method in which a sample size n of a
population size N is selected in such a way that all samples of the same size have the
same probability of being chosen.
The sample design for an SRS of a fixed size n is:
⎧
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎩
⎛N⎞
⎜ ⎟
p( s) = ⎜⎝ n ⎟⎠
−1
if card(s) = n
0
otherwise
Therefore, the inclusion probability of the unit k is:
N −1 ⎛ N − 1 ⎞⎛ N ⎞
π k = ∑ p( s ) = ∑ ⎛⎜ ⎞⎟ = ⎜⎜
⎟⎟⎜⎜ ⎟⎟
k∈s
k∈s ⎝ n ⎠
⎝ n − 1 ⎠⎝ n ⎠
−1
=
n
, ∀k ∈ U
N
In other words, all the individuals of U have the same probability of being chosen.
The H-T estimator for the population mean in an SRS is
1
Yˆπ =
N
yk
∑π
k∈S
=
k
1
N
∑y
k
k∈S
N 1
= ∑ yk
n n k∈S
ˆ
The unbiased variance estimator of Y π is:
) ˆ
s y2
Var (Y π ) = (1 − f )
n
1
s y2 =
( yk − Yˆπ ) 2
∑
n − 1 k∈S
where
and
PROBABILITY SAMPLING PLANS
f =
n
is defined as the sample fraction
N
7
Stratified sampling
Suppose that the population U is divided into subpopulations or strata U h ,
h = 1,..., H ; where the strata meet the following properties:
H
(i)
UU
h
=U
h =1
(ii)
U hIU i= φ , h ≠ i
(iii)
If N h is the size of U h , then
H
∑N
h =1
h
=N
A sample design is stratified when a simple sample of a fixed size n h is randomly
H
selected from each stratum, where
∑n
h =1
h
= n is the sample size.
This sampling technique is used when the study population is very homogeneous and
can be divided into internally homogeneous strata. Thus, we can achieve more precise
estimators in each stratum and combine them to obtain a more accurate estimator of the
population total.
SRS is used to select the units in each stratum, so the inclusion probability in unit k is:
π k=
nh
, ∀k ∈ U .
Nh
Horvitz-Thompson estimator of the mean for stratified sampling:
1
Yˆ st=
N
yk
∑π
k∈S
k
=
1
N
H
Nh
∑ n ∑y
h =1
h k∈S h
k
=
1
N
H
∑ N Yˆ
h =1
h
h
Estimator variance can be estimated without bias by:
1
Vˆar (Yˆ st) = 2
N
where
2
s yh
=
PROBABILITY SAMPLING PLANS
H
2
s yh
h =1
nh
∑ N h2 (1 − f h )
1 H
( y k −Yˆ h) 2 is the quasi-variance sample of stratum h.
∑
n h −1 h =1
8
Allocation in stratified sampling
Dividing sample size into several strata can be accomplished according to several
criteria. The most frequently used criteria are described below.
1. Proportional allocation
Proportional allocation is when the number of sample units allocated to each stratum is
proportional to the size of the stratum.
Thus, a stratified plan is said to have a proportional allocation if:
nh
n
= ,
Nh N
Supposing that n h =
for
h = 1,..., H
nN h
is an integer, the estimator for the population mean is:
N
1 H
1
Yˆ prop= ∑ N hYˆ h = ∑ y k
N h =1
n k∈S
Proportional allocations can be made to roots, cubes and any other power lower than 1
in the same manner.
2. Minimum variance allocation
Minimum variance allocation or Neyman allocation consists in determining the values of
n h in such a way that estimator variance is minimal in a fixed sample size of n.
Lagrange multipliers are used to obtain the necessary values of n h .
nh = n
N hS h
H
∑N
h =1
h
h = 1,..., H
for
Sh
3. Minimum sample size allocation
In this case, the problem consists in finding the allocation that gives the minimum sample
size n* for fixed variation V.
Again, thanks to Lagrange multipliers, we have that:
n* =
⎛ H
⎞
⎜ ∑ N h S yh ⎟
⎝ h =1
⎠
2
H
2
V + ∑ N h S yh
h =1
PROBABILITY SAMPLING PLANS
9
Cluster sampling
Suppose that population U is divided into M subsets U i , i = 1,..., M , called clusters,
which meet the following properties:
M
(i)
UU
=U
i
i =1
(ii)
U iIU j = φ , i ≠ j
M
(iii)
∑N
i =1
i
= N where N i is the number of elements in cluster U i .
A sample design is made by clusters when we select a size m cluster sample, denoted
as s I , with a plan p I ( s I ) in which all the units of the chosen clusters are evident.
The full random sample is given by S =
UU
i
, the size of which is n =
∑N
i∈S I
i∈S I
i
.
Normally, the size of the sample is random.
This sampling technique is used when the population is naturally divided into groups that
are supposed to contain all the variability in the population; i.e., each cluster faithfully
represents the characteristics of the study population (thus simplifying the gathering of
sample information).
Selection of clusters with equal probabilities
Supposing that all the clusters have the same probability of being chosen, the sample
plan will consist in selecting the clusters by following a size m SRS.
In this case, the probability of selecting a cluster is
π Ii=
m
. The following simplified
M
expression of the Horvitz-Thomson estimator of the mean is obtained:
1
Yˆπ =
N
where Y i=
1
Ni
∑y
k∈U i
k
yk
∑π
k∈S
k
=
1
N
∑
i∈S i
N iY i
π Ii
=
is the mean for cluster U i ,
M
∑ N iY i
Nm i∈S i
i = 1,..., M
Estimator variance can be estimated without bias by:
⎛
Yˆ π
M −m M
⎜
Vˆar (Yˆπ ) =
Y
−
∑ i M
N 2 m m − 1 i∈S I ⎜⎝
PROBABILITY SAMPLING PLANS
⎞
⎟
⎟
⎠
2
10
Systematic sampling with equal probability
Suppose that the N units of population U are numbered 1 to N in a certain order
(random, or according to order criteria).
If n is the number of units to be selected in the sample, we define k = N n as the
sampling interval.
r ∈ {1,..., k} as the primary unit. After r, the units that are at
a distance lk for l = 1,2,..., n − 1 are selected in the sample.
We select a random number
Systematic sampling can be viewed as cluster sampling where the problem consists in
choosing a single cluster from the potential k.
Composition of possible k systematic samples
1
y1
y k +1
L
y ( n −1) k +1
2
···
y2
y k +2
L
y ( n −1) k + 2
i
···
k
yi
yk
y k +i
y 2k
L
L
y ( n − k ) k +3 y nk
Summary of the methods presented
The coefficient of variation of estimator θˆ is defined as the quotient between the
standard deviation and its real value θ , CV (θˆ) =
Var (θˆ)
θ
.
Vˆar (θˆ)
Therefore, the estimator of the coefficient of variation of θˆ , is cv(θˆ) =
θˆ
A table with the formulas for the estimator, variance and coefficients of variation for the
population mean and the proportions of the methods presented is given below.
PROBABILITY SAMPLING PLANS
11
PROBABILITY SAMPLING PLANS
12
Proportions
P
Pop u latio n
mean
Y
cv ( Pˆ )
Coeff. of
variation
Vˆar ( Pˆ )
Variance
Pˆ
Estimator
cv (Yˆ )
Coeff. of
variation
Vˆar (Yˆ )
Variance
Yˆ
Estimator
2
(1 − p )
cv ( Pˆ ) = (1 − f )
p ( n − 1)
p (1 − p )
Vˆar ( Pˆ ) = (1 − f )
n −1
1
Pˆ = ∑ yk
n k ∈S
cv(Yˆπ ) =
(1 − f )
cv (Yˆ )
n
sy
)
Var (Yˆπ ) = (1 − f )
n
Simple Random
Sampling
1
Yˆπ = ∑ yk
n k∈ S
h =1
cv( Pˆ st ) =
1
Vˆar ( Pˆ st ) = 2
N
h =1
2
h
2
h
2
h
h
ph
h =1
∑N
H
h
nh
p h (1 − p h )
n h −1
ph
(1 − f h )
H
h =1
∑N
H
h =1
∑ N h Yˆh
H
nh
s 2yh
2
s yh
(1 − f h )
h
p q
N (1 − f h) h h
∑
n h −1
h =1
∑N
H
h =1
H
h =1
∑N
1
Pˆ st =
N
cv (Yˆ st ) =
h
∑ N Yˆ
H
∑ N h2 (1 − f h )
H
1
Vˆar (Yˆ st ) = 2
N
1
Yˆ st =
N
Stratified sampling
∑
∑ N iY i
i∈S i
M
i
i
∑N Y
i∈S i
i
∑a
2
2
i
i
∑a
i∈ S I
i∈ S I
i∈S I
( ∑ ai )2
i∈S I
i
i∈S I
2
− 2 p ∑ aiN i + p 2 ∑ N i2
∑N
i ∈S I
i∈ S I
⎞
⎟
⎟
⎠
2
− 2 p ∑ ai N i + p2 ∑ N i
; donde a i = p i N i
i∈S I
m −1
m
i
2 M −m m
cv ( Pˆ ) =
M m −1
Vˆar ( p ) =
i∈S I
∑N
i∈S i
∑a
M −m
Pˆ =
cv(Yˆπ ) =
∑
⎞
⎟
⎟
⎠
ˆ
⎛
m⎞ m
⎛
⎜ Y i− Y π
⎜1 − ⎟
⎜
M
⎝ M ⎠ m − 1 i∈S I ⎝
⎛
Yˆ
M −m M
ˆ
⎜Y − π
Vˆar (Y π ) =
i
M
N 2 m m − 1 i∈S I ⎜⎝
M
Yˆπ =
Nm
Cluster sampling
4. Complex sampling plans
Although the methods presented make up the three main types of probability sampling
methods, designs tend to be more complex when it comes to determining the designs for
the surveys made by EUSTAT or by different statistics bodies.
Two-stage sampling
Suppose that population is
U = {1,..., k ,..., N } composed of M subpopulations U i ,
i = 1,..., M called primary units.
At the same time, each primary unit
M
∑N
i =1
i
U i is composed of N i secondary units where
=N.
In general , a two-stage sample is defined as follows:
-
A sample S I of primary units of size m is selected.
-
If a primary unit is selected in the first stage, a sample
-
secondary units is selected.
Two-stage plans must meet the properties of invariance and independence.
The full random sample is given by S =
US
i
S i of size n i of
, the size of which is n =
∑n
i∈S I
i∈S I
i
.
We can define:
•
π I ,i
as the probability of selecting the primary unit U i
•
π k|i
as the probability of selecting the unit k, given that U i has been selected.
Therefore, the probability of inclusion for unit k is:
π k= π I ,i π k |i ,
k ∈U i
The Horvitz-Thompson estimator of the mean in a two-stage sample is:
1
Yˆπ =
N
ˆ
where Y i=
1
Ni
yk
∑π
k∈S i
yk
∑π
k∈S
k
=
1
N
∑ ∑π
i∈S I k∈S i
yk
I ,i
π k |i
=
1
N
∑
i∈S I
N iYˆ i
π I ,i
is the Horvitz-Thompson estimator of the mean of the primary
k |i
unit U i
Moreover, in a two-stage plan it is given that Vˆar (Yˆ π ) = VˆarUP + VˆarUS ,
COMPLEX SAMPLING PLANS
13
where VˆarUP is the part of the variance that refers to primary units and VˆarUS to
secondary units.
Therefore, in two-stage sampling we can combine the main probability sampling plans
presented (simple random sampling, stratified sampling and cluster sampling) to select
primary units as well as secondary units.
Selection of primary units with equal probabilities
Suppose that simple random sampling is used in the two sampling stages.
Then, the probabilities defined above would be:
π I ,i=
π k |i=
m
, i = 1,..., M
M
ni
,
Ni
i = 1,..., M , k ∈S i
In this case, the probability of inclusion for unit k is:
π k=
mn i
,
MN i
k ∈U i
If we modify the Horvitz-Thompson estimator formula for two-stage sampling, we have
that:
1
Yˆπ =
N
yk
∑π
k∈S
k
=
N iy k
M
∑
∑
Nm i∈S I k∈S i n i
And its variance estimator is simplified
N −n
M −m
M
Vˆar (Yˆπ ) =
Ms I2 + 2 ∑ N i i i si2
2
ni
N m
N m k∈S i
where
ˆ ⎞
⎛
1
ˆ − Yπ ⎟
⎜
s =
Y
∑ i M⎟
m − 1 i∈S I ⎜⎝
⎠
2
I
COMPLEX SAMPLING PLANS
2
and
⎛
Yˆπ
1
⎜
−
s =
y
∑ k N
n i −1 k∈S I ⎜⎝
i
2
i
⎞
⎟
⎟
⎠
2
14
Self-weighting two-stage plan
Suppose that the primary units in the first stage are selected with the inclusion
probabilities proportional to size (PPS); in other words,
π I ,i =
Ni
m
N
In the second stage, the secondary units are selected according to fixed size simple
random sampling n i =n 0 (in each primary unit); in other words,
π k |i =
n0
Ni
Thus, the probabilities of inclusion of unit k are the same for every unit in population U:
π k = π I ,i π k |i =
COMPLEX SAMPLING PLANS
N i n 0 mn0
=
m
N
Ni
N
15
5. The Cube Method: Balanced Sampling
The Cube Method (Deville and Tillé,2004) is a method for selecting balanced samples
with equal or unequal inclusion probabilities, optimizing probability sampling methods.
Intuitively, the method allows the proportions of the original population in the sample to
be maintained on certain balancing variables (qualitative variables), always taking the
design's inclusion probabilities into consideration. The balancing variables must be
strongly correlated with the variables of interest.
Cube representation
Let us consider a finite population U = {1,…, N} of size N, where the aim is to estimate
the total (or mean) of certain variables of interest.
To understand how the Cube Method works, suppose that a sample is denoted by a
vector
s = ( s1 ... s k ... s N ) t where s k takes the value 1 if unit k is in the sample and is
0 otherwise.
Geometrically, each vector s is a vertex of an N-cube.
Possible samples in a population of size N=3
Therefore, a sampling design p(.) is a probability distribution on the set S = {0,1}
the possible samples. The inclusion probability of unit k is defined as
N
of all
π k = Pr(S k = 1) .
Balanced samples
Suppose that we have certain auxiliary variables with known values for all the units of the
population, k ∈ U .
THE CUBE METHOD: BALANCED SAMPLING
16
The auxiliary variables could be used as stratification variables (qualitative) or balancing
variables (qualitative or quantitative).
Thus, it is said that a samples is balanced on variables x1 ,x 2 ,...,x p if the balancing
variables are satisfied:
X̂π = X ⇔ ∑
k∈s
xkj
πk
= ∑ xkj
k∈U
∀s ∈ S with p(s) > 0
j = 1,..., p
In other words, the Horvitz-Thompson estimators of the variables x1 ,x 2 ,...,x p in the
sample are equal to the totals of said variables in the population.
The inclusion probability vector π will always be predetermined by the sampling design.
The equations that derive from the balance constraints define a subspace (Q) of
N
dimension N – p in R . Therefore, the problem consists in choosing a vertex (a
sample) of the N-cube that will stay within the subspace Q.
Given that it is not possible to select an exactly balanced sample, the Cube Method
implements a method for selecting approximately balanced samples.
THE CUBE METHOD: BALANCED SAMPLING
17
Description of the method
The cube method proposed by Deville and Tillé (2004) is composed of two phases:
1. Flight phase
The flight phase is is a generalization of the splitting procedure (See "Sampling
Theory").
It is a random path that begins with the inclusion probability vector π and remains in
the intersection of the cube and the subspace defined by the balancing equations
(Q).
2. Landing phase
If a sample (a vertex) has not been selected at the end of the flight phase, the
landing phase should be applied.
There are three potential solutions for this phase:
-
To progressively eliminate the balancing variables and apply the flight phase
again (the variables need to be deleted by ascending order of importance).
Use the linear programming to calculate the best approximately balanced
sample (minimizing the difference in balance).
Choose the vertex closest to the probabilities vector obtained in the flight stage,
rounding out the inclusion probabilities that are still not equal to 0 or 1.
Deville and Tillé programed a much quicker implementation of the flight phase (See
“Fast SAS Macros for balancing simples user´s guide”), which takes up most of the
implementation time. The advantages obtained were:
o
There are no constraints on the size of the population.
o
The execution time is linearly dependent on the size of the population.
SAS MACROS FOR SELECTING BALANCED SAMPLES
18
6. SAS macros for selecting balanced
samples
Next, the SAS macros that allow balanced samples to be selected are presented.
The two main macros (exe_cube y echant_estrat) were were developed by Guillaume
Chauvet and Yves Tillé. The auxiliary macros disjunctive and crear_estrato were made
by Eustat to speed up management of the former.
Although Eustat has opted to work with the SAS macros that implement the Cube
Method, the functions that select balanced samples in R are also available (see
sampling package: http://cran.r-project.org/web/packages/sampling/index.html).
exe_cube macro
The SAS macro exe_cube, allows the Cube Method (Fast Cube Method) to be used to
select balanced samples.
Input data
The input data are a SAS table with all the population units from which the sample will be
selected.
It should contain at least:
-
An identification variable
-
A variable with inclusion probabilities
-
The variables on which the sample will be balanced
The table must not have missing values in said values.
Macro syntaxis
A brief description of the necessary arguments follows:
ƒ
BASE = Name of the SAS library that contains the table with the input data.
ƒ
DATA = Name of the SAS table with the input data.
ƒ
ID = Units of population identification variable.
ƒ
PI = Variable with inclusion probabilities
ƒ
CONTR = Variables on which the sample will be balanced.
ƒ
ATTER = Option selected for the landing phase.
1. The balancing variables are gradually eliminated.
SAS MACROS FOR SELECTING BALANCED SAMPLES
19
2. All the possible samples for the remaining units (values other than 0 or 1) are
considered. The ones that provide the less difference in balance are selected.
3. The same procedure as for option 2 but only considering samples with a size
equal to the sum of inclusion probabilities (fixed sampling size).
4. The inclusion probabilities are rounded for the remaining units, keeping the size
of the default sample.
To use options 3 or 4, enter the inclusion probabilities variable in the contr
parameter.
ƒ
COMPEQ = Equal to 1 to balance the complement of the sample as well.
ƒ SORT = Name of the SAS table with the output data, which was saved in the library
specified in the base parameter. It contains all the units of population, as well as the
variable ech; equal to 1 if the unit has been selected and otherwise 0.
echant_strat macro
The SAS echant_strat macro allows stratified samples to be selected using the Cube
Method (Fast Cube Method), globally balanced in the total population and approximately
balanced in each stratum.
The steps followed by the macro to select a balanced sample are:
1. Independent flight phase in each of the strata
2. Joint flight phase with the remaining units that were not selected in the strata
3. Landing phase with the still unselected units.
Input data
There has to be a SAS table with the units of population for each of the strata defined for
the stratified sample.
Each table must contain at least the same variables that were defined for the exe_cube
variable.
Macro syntaxis
A brief description of the necessary arguments follows:
ƒ
DATA = Name of the SAS table with the input data for each stratum.
ƒ
ID = Units of population identification variable.
ƒ
PI = A variable with inclusion probabilities
ƒ
CONTR = Variables on which the sample will be balanced.
ƒ SORT = Name of the SAS table with the output data.
SAS MACROS FOR SELECTING BALANCED SAMPLES
20
disjunctive auxiliary macro
The disjunctive SAS macro allows one or more variables of interest to be divided into
disaggregated variables according to certain categories. The macro also allows the
names of said categories to be entered.
Description
Suppose that in a size N population, given a variable of interest Y and a qualitative
variable X that takes values 1, 2, …,L; the disjunctive macro gives the disjunctive
1
variables Y , Y
2
, ... , Y L where:
⎧ yi
y il = ⎨
⎩0
if xi = l
if xi ≠ l
for
i = 1, ... , N
l = 1,... , L
Macro syntax
A brief description of the necessary arguments follows:
ƒ
DATA = Name of the SAS table that contains the population data
ƒ
VAR= Variable(s) of interest.
ƒ
CATEG = Qualitative variable that contains the categories for creating disjunctive
variables.
ƒ
NOMBRES_CATEG (optional) = Names of the categories of the variable categ.
By default categ1, categ2,…, categL.
Results and outputs
The disjunctive macro adds the disjunctive variables created from the variable of interest
varto the input table.
The names of the new variables are the union of the name of the variable var and the
names defined by the variable nombres_categ (separated by the symbol “_”).
The names are saved in the local variable contr_categ macro.
crear_estrato auxiliary macro
The SAS crear_estrato macro allows a SAS table to be divided into several tables
according to a stratification variable.
Macro syntax
A brief description of the necessary arguments follows:
ƒ
DATA = Name of the SAS table that contains the population data.
ƒ
ID = An identification variable
ƒ
VAR_ESTRAT = Variable on which the stratification is to be performed
SAS MACROS FOR SELECTING BALANCED SAMPLES
21
Results and outputs
The crear_estrato macro returns a SAS table for each of the values of the variable
var_estrat.
The default names of the output tables are of the type: estrato_ {var_estrat } j where
{var_estrat} j is the j-th value of the variable var_estrat.
The names are saved in the local variable datos_estrat macro.
Example of macro use
Suppose that we want to select a stratified sample of establishments, balancing the
sample on the number of employees per Province.
The initial SAS table with the population data would look like this:
data
id
1
estrata
A
2
3
A
B
4
5
B
B
6
7
C
C
pik
π1
π2
π3
π4
π5
π6
π7
employ TH
e1
48
e2
e3
20
20
e4
e5
01
48
e6
e7
01
20
where
01 = Araba, 20 = Gipuzkoa and 48 = Bizkaia;
πk
is the inclusion probability of establishment k;
ek is the number of employees in establishment k.
• First, we apply the disjunctive macro to calculate the disjunctive balancing variables
for the number of employees by Province.
%global contr_categ;
%disjunctive(
DATA = data,
VAR = employ,
CATEG = TH,
NOMBRES_CATEG = Araba Gipuzkoa Bizkaia
);
SAS MACROS FOR SELECTING BALANCED SAMPLES
22
data
id
1
2
3
4
5
6
7
estrata
A
A
B
B
B
C
C
pik
π1
π2
π3
π4
π5
π6
π7
employ TH
e1
48
e2
20
e3
20
e4
01
e5
48
e6
01
e7
20
employ _ Araba employ _ Gipuzkoa employ _ Bizkaia
0
0
e1
0
e2
0
0
e3
0
e4
0
0
0
0
e5
e6
0
0
0
e7
0
As mentioned above, the aim is to select a balanced sample on the number of
employees per Province; in other words, on the totals:
∑ employ _ Araba
k∈N
k
,
∑ employ _ Gipuzkoa
k∈N
k
and
∑ employ _ Bizkaia
k∈N
k
In this case, the macro variable contr_categ keeps the values:
&contr_categ. = empleo_Araba empleo_Gipuzkoa empleo_Bizkaia.
• Next, we would apply the crear_estrato macro to obtain a dataset with the data for
each of the strata.
%global datos_estrat;
%crear_estrato(
DATA = data,
ID = id,
VAR_ESTRAT = estrata
);
stratum_A
id
estrata
pik
1
2
A
A
π1
π2
id
3
4
5
estrata
B
B
B
pik
π3
π4
π5
id
estrata
pik
6
7
C
C
π6
π7
employ TH
e1
e2
48
20
employ _ Araba employ _ Gipuzkoa employ _ Bizkaia
0
0
0
e2
e1
0
employ TH
e3
20
e4
01
e5
48
stratum_B
employ _ Araba employ _ Gipuzkoa employ _ Bizkaia
e3
0
0
e4
0
0
e5
0
0
employ TH
stratum_C
employ _ Araba employ _ Gipuzkoa employ _ Bizkaia
e6
e7
01
20
e6
0
0
e7
0
0
In this case, the macro variable datos_estrat keeps the values:
&datos_estrat. = estratum_A estratum_B estratum_C
SAS MACROS FOR SELECTING BALANCED SAMPLES
23
• Finally, we will call the echant_strat macro that selects the balanced sample for the
samples stratified using the Cube Method.
%echant_strat(
DATA = &datos_estrat.,
ID = id,
PI = pik,
CONTR = pik &contr_categ.,
SORT = sample
);
The macro output would look like this:
sample
where
id
ech
1
ech1
2
ech2
3
4
ech3
ech4
5
ech5
6
7
ech6
ech7
⎧1 if unit k has been selected
for all k ∈ {1,...,7}
echk = ⎨
⎩0 otherwise
* Comment:
On some occasions, the aim may be to balance the sample on totals that refer to
sample units as such.
For instance, in the preceding case we wanted to balance the sample on the number of
establishments per Province.
In such case, we need to create a variable that takes the value 1 for all the units, and
enter it in the %disjunctive macro to create the desired balancing variables.
data
id
estrata
pik
1
A
π1
π2
π3
π4
π5
π6
π7
2
A
3
4
B
B
5
B
6
C
7
C
SAS MACROS FOR SELECTING BALANCED SAMPLES
employ TH
e1
48
ONE
1
e2
20
1
e3
e4
20
01
1
1
e5
48
1
e6
01
1
e7
20
1
24
%global contr_categ;
%disjunctive(
DATA = data,
VAR = ONE,
CATEG = TH,
NOMBRES_CATEG = Araba Gipuzkoa Bizkaia
);
data
id
estrata
pik
1
A
2
A
3
4
B
B
5
B
6
C
7
C
π1
π2
π3
π4
π5
π6
π7
employ TH
ONE ONE _ Araba ONE _ Gipuzkoa ONE _ Bizkaia
e1
48
1
0
0
1
e1
20
1
0
1
0
e1
e1
20
01
1
1
0
1
1
0
0
0
e1
48
1
0
0
1
e1
01
1
1
0
0
e1
20
1
0
1
0
SAS MACROS FOR SELECTING BALANCED SAMPLES
25
7. Balanced samples in Eustat using
the cube method
Some of the sampling designs that have been balanced using the Cube Method in
Eustat are presented next.
The method design for each case is described: the technical datasheet, stratification
variables, allocations and inclusion probabilities, as well as the variables on which the
sample was balanced. Some of the outcomes obtained are also presented.
Sample of ESO (Compulsory Secondary Education) centres
for a study on bullying in the Basque Country.
The Department of Eduction, Universities and Research, via the Basque Country
Institute for Educational Assessment and Research (ISEI-IVEI) conducted a survey on
bullying in Basque Country schools.
To that end, a cluster sample (schools) had to be taken to assess a maximum number of
40 students per selected centre.
Technical Datasheet
• Framework
The sample comprised secondary schools in the Basque Country that had at
least one group in the 1st, 2nd, 3rd and 4th years of ESO.
• Sample design.
The sample was one of unequal allocations with subsampling in the second
stage.
1st stage
Sampling units
Secondary schools in the Basque Country
Stratification
Stratified sampling by Province and system (public and private school
systems) was used to select the centres.
Allocation
Proportional to the number of centres in each stratum.
Draw
The sampling was probabilistic proportional to size (PPS) of the number of
students per centre.
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
26
2nd stage
Sampling units
Secondary school students in the Basque Country
Stratification
40 students (10 from the 1st year, 10 from the 2nd, 10 from the 3rd and 10
from the 4th) per selected centre, whenever possible. There was no
minimum number of students per centre.
Draw
Simple random sampling.
The end sample was self-weighted by strata (Province and System).
• Sample size
The optimal sample size for allocation sampling was calculated according to the
following formula:
ncenters = na
[(1 + δ (M − 1)]
M
where na is the sample size for a simple random sample and the rest is the socalled design effect in cluster sampling.
With
M = Average number of students per centre
δ = intracentre correlation
Nzα2 / 2 S 2
N
na =
=
2
2
2
Ne + zα / 2 S
⎡
e2 ⎤
1
(
1
)
+
N
−
⎢
⎥
zα2 / 2 pq ⎦
⎣
N = Total number of students (elementary units)
e = Maximum acceptable error
zα2 / 2 = Critical value for the significance level α
• Balancing variables
The sample was balanced on the following variables:
-
Number of students by year and number of groups by year.
Thus, estimations of the average number of students by centre and group
came as close as possible to the data provided by Education Statistics.
-
Number of centres belonging to each type of size.
Coding of centre size into 5 groups, minimizing intraclass inertia according to
the size in students: [0-143], [144-243], [244-361], [362-506] y [507-708].
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
27
Results
The results obtained for the balancing variables using the Cube Method are shown
below.
Each table compares the population distribution with the one obtained with sample
weighting. The percentages are given by columns.
Distribution of the number of students by year
Population
Sample
(weighted)
1st year
ESO
nd
2 year
ESO
rd
3 year
ESO
th
4 year
ESO
19,664
19,617
(27.21%)
(27.14%)
18,633
18,649
(25.78%)
(25.80%)
17,669
17,764
(24.45%)
(24.58%)
16,306
16,243
(22.56%)
(22.47%)
TOTAL
72,272
72,272
Distribution of the number of groups by year
Population
1st year
ESO
2nd year
ESO
rd
3 year
ESO
th
4 year
ESO
Sample
(weighted)
870
869
(25.02%)
(24.04%)
852
849
(24.50%)
(24.47%)
896
896
(25.77%)
(25.82%)
859
856
(24.71%)
(24.67%)
3,477
3,470
TOTAL
Distribution of the number of centres by type of size
Population
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
Sample
(weighted)
28
Size 1
Size 2
Size 3
Size 4
Size 5
100
95
(30.12%)
(28.79%)
128
129
(38.55%)
(39.09%)
61
63
(18.37%)
(19.09%)
31
31
(9.34%)
(9.39%)
12
12
(3.61%)
(3.64%)
332
330
TOTAL
Very good estimators of the student average per centre and group for each year were
also obtained by taking into account the variables on which the sample was balanced.
ACADEMIC
YEAR 2011-12
Student average
by centres
Student average
by groups
Population
Sample
(weighted)
Population
Sample
(weighted)
1st year ESO
59.23
59.44
22.60
22.57
2nd year ESO
56.21
56.51
21.90
21.97
3 year ESO
53.22
53.83
19.72
19.83
4th year ESO
49.11
49.22
18.98
18.98
TOTAL
217.69
219.00
20.79
20.33
rd
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
29
Sample for the Information Society Survey (ESI-Companies)
The general aim of the ESI carried out by EUSTAT is to provide politicians, economic
and social stakeholders, universities, private researchers and the general public with
periodic information on the penetration of the new information technologies and ICTs in
Basque Country companies.
The ESI-Companies sample is a panel that every year includes the companies that have
answered previous repetitions of the survey. Owing to various incidents (medical leaves,
substitutions, no response, etc.) the original sample distribution breaks down. Therefore,
it was decided to update the sample with a new sample distribution that would preserve
the original design and show the new distribution of the population in the strata.
In 2012, it was decided to renew nearly 15% of the panel. Moreover, the Cube Method
was introduced to select balanced samples, with the aim of obtaining a balanced
distribution in the Basque Country regions.
Technical Datasheet
• Framework
The sample comprised establishments of any business sector that carry out their
activity in the Basque Country, except in the primary sector and domestic
services.
• Sample design.
It was a one-stage stratified sample.
Sampling units
The establishments were part of the aforementioned framework.
Stratification
A stratified sample was made by crossing the following variables:
- Province
1 = Araba; 2 = Bizkaia; 3 = Gipuzkoa
- Employment stratum
1 = 0-5 employees; 2 = 6-9 employees; 3 = 10-19 employees;
4 = 20-49 employees; 5 = 50-99 employees; 6 = 100 and more
employees;
- Sector of activity (CNAE09 to 2 digits)
Allocation
Self-represented elements: establishments with 100 or more employees
(employment stratum 6)
Two different allocations were made for the rest of the establishments:
1. A distribution proportional to the square of the number of establishments per
province and directly proportional to the number of establishments per
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
30
stratum (province, activity and employment) was made on the basis of a
sample size preset in the original design of n=700.
The sample size in each stratum was calculated according to the following
formula:
nTH i Act j Empk = n PROVi
estab Act j Empk
5
∑ ∑ estab
j∈ Act k =1
where
nTH i = (7000 − census )
Act j Empk
estab PROVi
3
∑
i =1
i = 1,2,3
estab PROVi
Finally, the establishments were added until a minimum size of 5
establishments in the grouped employment strata were obtained (less than
10 employees and more than 10 employees).
2. Distribution according to a 10% maximum sample error in each sector of
activity (without taking the census strata into consideration).
The sample size in each stratum was calculated according to the following
formula
2
nh =
where
N h zα2 / 2 S h
Nh
=
2
2
2
⎡
N h e + zα / 2 S h
e2 ⎤
1
(
1
)
+
−
N
h
⎢
⎥
zα2 / 2 pq ⎦
⎣
N h = Number of establishments in stratum h
e = Maximum acceptable error
zα2 / 2 = Critical value for the significance level α
After the two allocations were made, the missing units were distributed until the
sample size needed for non-census units was reached. This distribution was
made in proportion to the size of the strata in the sectors that were underrepresented compared to the first allocation.
Finally, the allocations per sector of activity were distributed in proportion to the
root in each province and grouped employment.
Drawing
A simple random sampling is conducted in each strata, giving priority to the
establishments that were specified as high in the framework.
• Balancing variables
The sample has been balanced on the number of establishment in each region
(20 regions) in order to obtain better estimations at regional level.
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
31
• Substitutes
A pool of substitutes for around 3,500 establishments is needed to complete the
sample. The number of substitutes per stratum is proportional to the theoretical
sample in the employment and province strata.
As in the main sample, the substitutes sample will be balanced with the Cube
Method on the number of establishments in each region.
Results
The results obtained with the Cube Method when balancing the number of
establishments per region is given below.
Distribution of the number of establishments by region
Population
Valles Alaveses
Llanada Alavesa
Montaña Alavesa
Rioja Alavesa
Estribaciones del Gorbea
Cantábrica Alavesa
Arratia - Nervión
Gran Bilbao
Durangaldea
Encartaciones
Gernika – Bermeo
Markina – Ondarroa
Plentzia – Mungia
Bajo Bidasoa
Bajo Deba
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
Sample
(weighted)
405
523
(0.22 %)
(0.29 %)
18,903
19,063
(10.49 %)
(10.58 %)
248
257
(0.14 %)
(0.14 %)
1,311
1,135
(0.73 %)
(0.63 %)
780
749
(0.43 %)
(0.42 %)
2,180
2,099
(1.21 %)
(1.16 %)
1,787
1,399
(0.99 %)
(0.78 %)
73,572
72,517
(40.82 %)
(40.24 %)
7,517
7,795
(4.17 %)
(4.33 %)
2,356
2,364
(1.31 %)
(1.31 %)
3,425
3,364
(1.90 %)
(1.87 %)
1,828
2,446
(1.01 %)
(1.36 %)
4,008
4,609
(2.22 %)
(2.56 %)
7,169
8,343
(3.98 %)
(4.63 %)
4,191
4,989
(2.33 %)
(2.77 %)
32
Alto Deba
Donostialdea
Goierri
Tolosaldea
Urola Costa
TOTAL
4,197
4,742
(2.33%)
(2.63 %)
31,422
28,724
(17.44 %)
(15.94 %)
4,929
5,192
(2.73 %)
(2.88 %)
4,029
4,105
(2.24 %)
(2.28 %)
5,966
5,809
(3.31 %)
(3.22 %)
180.223
180,223
The percentages are given by columns.
Sample for the Social Capital Survey (ECS)
Social capital is construed as a resource to which one has access when one has broad
personal networks in which one takes active part in several economic and social
spheres, in an environment of trust that can facilitate personal and social development,
as well as the economic development of society.
Specifically, in the Social Capital Survey carried out by Eustat, social capital is designed
as a set of social participation and relationship dimensions that include: social friends
and family networks; trust in people and institutions; social participation and cooperation;
information and communication; social cohesion and integration, and health and
happiness.
In 2012, it was decided to use the Cube Method to select the sample for the Social
Capital Survey. Thus, we have obtained a balanced sample by sex and age in each
Province, as well as helping to improve estimations at the regional level.
Technical Datasheet
• Framework
The framework of the Social Capital Survey sample comprises a population age
15 and over that resides in houses and collective establishments in the Basque
Country.
• Sample design.
It was a one-stage stratified sample.
Sampling units
Population age 15 and over, that resides in houses and collective establishments
in the Basque Country.
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
33
Sample size
n = 7000 individuals were selected.
Stratification
A stratified sample was made by crossing the following variables:
- Province
01 = Araba, 20 = Gipuzkoa and 48 = Bizkaia;
- Size of the municipality
Capital cities, Medium-sized (20,000-100,000) and Small (20,000 or
less)
- Nationality
0 = National; 1 = Foreigners
Allocation
A criterion for each level of stratification has been established:
1. Distribution proportional to the square root of the number of individuals per
Province.
2. Distribution proportional to the number of individuals by size of the
municipality.
3. Distribution proportional to the 2/3 power of the number of individuals per
nationality.
The no-response rates in the previous survey (ECS 2007) were taken into
consideration when choosing the best allocation in the third level. Similar
response rates can be expected for the survey, considering that the methods
used to gather survey information are the same.
Therefore, an allocation was sought that would obtain the minimum sample size
needed (around 400 units) to give estimations at the level of capital cities and
foreign population, taking the response rates into consideration.
The sample size in each stratum was specified by the following formula:
n PROVi SIZE j NATk = n PROVi SIZE j
3
∑
( N PROVi SIZE j NATk ) 2
3
( N PROVi SIZE j NATk ) 2
k
where
n PROVi SIZE j = 7000
∑
i
N PROVi
N PROVi
N PROVi SIZE j
∑N
PROVi SIZE j
j
i ∈ { Araba, Gipuzkoa, Bizkaia}
for
j ∈ {Capital , Medium, Small}
k ∈ {National , Foreigners}
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
34
Drawing
Simple random sampling was carried out in each of the strata.
• Balancing variables
The sample was balanced on the following variables:
-
Number of individuals in the cross of Province (Araba, Gipuzkoa, Bizkaia),
Sex (Men and Women) and Age (15-24, 24-34, 35-44, 45-54, 55-64 and
over 65).
-
Number of individuals in each of the 20 regions in the Basque Country.
• Substitutes
A pool of substitutes of another 7,000 individuals is needed to complete the
sample. The substitutes have been taken while preserving the same sample
distribution by strata as in the original sample, balancing the sample on the same
variables as in the headings.
Results
The results obtained for the balancing variables using the Cube Method are shown
below.
Each table compares the population distribution with the one obtained with sample
weighting. The percentages are given by columns.
Distribution by Province, Sex and Age
Province = ARABA (01)
Men
Women
TOTAL
Population
Sample
(weighted)
Population
Sample
(weighted)
Population
Sample
(weighted)
15-24
years
13,818
13,729
12,831
12,762
26,649
26,491
(10.06%)
(10.02%)
(9.24%)
(9.17%)
(9.65%)
(9.59%)
25-34
years
35-44
years
23,028
22,923
21,541
44,648
(16.73%)
(15.51%)
(16.13%)
(16.16%)
28,954
28,948
26,298
21,725
(15.60%)
26,278
44,569
(16.77%)
55,252
55,226
(21.08%)
(21.13%)
(18.93%)
(18.87%)
(20.0%)
(19.99%)
45-54
years
55-64
years
24,889
24,895
24,891
25,039
49,780
49,934
(18.12%)
(18.17%)
(17.92%)
(17.98%)
(18.02%)
(18.08%)
20,051
19,942
20,355
20,332
40,406
40,274
(14.60%)
(14.55%)
(14.65%)
(14.60%)
(14.63%)
(14.58%)
Over 65
26,584
26,590
33,009
33,086
59,593
59,676
(19.36%)
(19.40%)
(23.76%)
(23.76%)
(21.57%)
(21.60%)
TOTAL
137,324
137,027
138,925
139,222
276,249
276,249
(100 %)
(100 %)
(100 %)
(100 %)
(100 %)
(100%)
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
35
Province = GIPUZKOA (20)
Men
Women
Population
Sample
(weighted)
15-24
years
25-34
years
35-44
years
30,206
(10.18%)
TOTAL
Population
Sample
(weighted)
Population
Sample
(weighted)
30,273
28,416
28,371
58,622
58,644
(10.22%)
(9.09%)
(9.07%)
(9.62%)
(9.63%)
45,461
45,452
43,313
43,517
88,774
88,968
(15.32%)
(15.34%)
(13.86%)
(13.91%)
(14.57%)
(14.60%)
60,481
60,491
56,318
56,361
116,799
116,852
(20.39%)
(20.41%)
(18.02%)
(18.01%)
(19.17%)
(19.18%)
45-54
years
54,351
54,228
54,409
54,480
108,760
108,707
(18.32%)
(18.30%)
(17.41%)
(17.41%)
(17.85%)
(17.84%)
55-64
years
45,126
44,881
46,428
46,525
91,554
91,406
(15.21%)
(15.14%)
(14.85%)
(14.87%)
(15.03%)
(15.0%)
Over 65
61,051
61,021
83,677
83,638
144,728
144,659
(20.58%)
(20.59%)
(26.77%)
(26.73%)
(23.76%)
(23.74%)
TOTAL
296,676
296,346
312,561
312,891
609,237
609,237
(100 %)
(100 %)
(100 %)
(100 %)
(100 %)
(100 %)
Province = BIZKAIA (48)
Men
Women
Population
47,497
47,673
(9.80%)
(9.83%)
76,941
(15.87%)
TOTAL
Sample
(weighted)
Population
45,007
45,152
92,504
92,825
(8.59%)
(8.62%)
(9.17%)
(9.20%)
76,969
73,755
73,658
150,696
150,627
(15.88%)
(14.07%)
(14.06%)
(14.94%)
(14.93%)
97,104
97,136
93,542
93,318
190,646
190,454
(20.03%)
(20.04%)
(17.85%)
(17.81%)
(18.90%)
(18.88%)
90,348
90,178
93,048
92,807
183,396
182,985
(18.64%)
(18.60%)
(17.75%)
(17.71%)
(18.18%)
(18.14%)
72,330
72,308
77,119
77,329
149,449
149,637
(14.92%)
(14.91%)
(14.71%)
(14.76%)
(14.81%)
(14.83%)
Over 65
100,487
100,558
141,669
141,762
242,156
242,320
(20.73%)
(20.74%)
(27.03%)
(27.05%)
(24.0%)
(24.02%)
TOTAL
484,707
484,821
524,140
524,026
1,008,847
1,008,847
(100 %)
(100 %)
(100 %)
(100 %)
(100 %)
(100 %)
15-24
years
25-34
years
35-44
years
45-54
years
55-64
years
Population
Sample
(weighted)
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
Sample
(weighted)
36
Distribution of the number of individuals by region
Population
Valles Alaveses
Llanada Alavesa
Montaña Alavesa
Rioja Alavesa
Sample
weighted
5,107
5,051
(0.27%)
(0.27%)
221,595
221,680
(11.69%)
(11.69%)
2,855
2,886
(0.15%)
(0.15%)
9,852
9,835
(0.52%)
(0.52%)
7,296
7,292
(0.38%)
(0.38%)
Cantábrica Alavesa
30,043
30,004
(1.58%)
(1.58%)
Arratia-Nervión
20,289
20,386
(1.07%)
(1.08%)
Gran Bilbao
768,311
767,962
(40.53%)
(40.51%)
Durangaldea
83,470
83,513
(4.40%)
(4.41%)
Encartaciones
27,787
27,742
(1.47%)
(1.46%)
Gernika-Bermeo
40,183
40,331
(2.12%)
(2.13%)
Markina-Ondarroa
23,128
23,333
(1.22%)
(1.23%)
46,104
Bajo Bidasoa
46,202
(2.44%)
66,403
(3.50%)
(3.50%)
Bajo Deba
47,748
47,664
(2.52%)
(2.51%)
Alto Deba
53,540
53,584
(2.82%)
(2.83%)
Donostialdea
282,424
282,508
(14.90%)
(14.90%)
Goierri
57,859
57,781
(3.05%)
(3.05%)
Tolosaldea
40,147
40,193
(2.12%)
(2.12%)
Urola Costa
61,490
61,462
(3.24%)
(3.24%)
TOTAL
1,895,729
1,895,729
Estribaciones del Gorbea
Plentzia-Mungia
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
(2.43%)
66,418
37
Sample for the Technological Innovation Survey (EIT)
The principal aim of the EIT carried out by EUSTAT is to learn more about the effort
made to innovate in several sectors of the economy, and how to obtain a series of
indicators that will allow us to compare the level reached in the Basque Country with that
of surrounding countries.
The EIT sample is a panel that every year includes the companies that have answered
previous repetitions of the survey. As in the case of the ESIE, the original distribution of
the sample deteriorated owing to several incidents (registrations, cancellations,
modifications, etc.). Therefore the sample is updated according to a new sample
distribution that follows the new distribution of the population in the strata while
preserving the original design.
In 2012, it was decided to renew nearly 7% of the panel. Moreover, the Cube Method
was introduced to select balanced samples, with the aim of obtaining a balanced
distribution in the Basque Country regions and their capitals.
Technical Datasheet
• Framework
It comprises the establishments in any sector of activity where they carry out their
business in the Basque Country, except the primary sector, public administration,
association activities, household activities, and extraterritorial organisation and
bodies.
• Sample design.
It was a one-stage stratified sample.
Sampling units
The establishments were part of the aforementioned framework.
Stratification
A stratified sample was made by crossing the following variables:
- Province
1 = Araba; 2 = Bizkaia; 3 = Gipuzkoa
- Employment stratum
1 = 0-9 employees; 2 = 10-49 employees;
3 = 50-249 employees; 4 = 250 or more employees;
- Sector of activity (CNAE09 to 2 digits)
Allocation
Self-represented elements: establishments with 250 employees or more
(employment stratum 4) or establishments that correspond to activity 46 in
employment strata 2 and 3.
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
38
For the other establishments, the following theoretical allocation is set:
-
2400 establishments are distributed for the strata of 10 or more employees
and 750 establishments for strata will less than 10 employees.
-
The distribution is carried out in proportion to the root of the number of
establishments by province and employment stratum. Subsequently another
allocation proportional to the root of the number of establishments by activity
stratum is made.
In other words, the sample size in each stratum is specified by the following
formula:
n PROVi Emp j Actk = nTH i Emp j
estabPROVi Emp j Actk
∑
k∈Act
where
-
n PROVi Emp j
estabPROVi Emp j Actk
⎧
estabPROVi Emp j
⎪750
⎪
estabPROVi Emp j
∑
⎪
j =1
=⎨
estabPROVi Emp j
⎪
⎪2400
∑ estabPROVi Emp j
⎪
j∈2 , 3
⎩
i ∈ {01,20,48}
j ∈ {1,2,3}
for employ < 10
for employ > 10
Finally, establishments are added until the minimum size of 5 establishments
in each stratum is obtained.
After the theoretical sizes needed for each stratum have been calculated, we
subtract the units that the panel already has to obtain the number of units to take
from each stratum. Specifically, 771 establishments had to be taken out in 2012.
Draw
A simple random sampling is conducted in each strata, giving priority to the
establishments that were specified as high in the framework.
• Balancing variables
The sample for employment strata 2 and 3 (more than 10 employees) has been
balanced on the number of establishments in each region (20 regions) and
capital cities to obtain better regional estimations.
• Substitutes
A pool of substitutes is needed to complete the sample. Therefore, 5
establishments will be taken from the strata that are not complete. 1,950 reserve
establishments have been extracted in 2012.
As in the main sample, the substitutes sample will be balanced with the Cube
Method on the number of establishments in each region and the capitals.
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
39
Results
The results obtained with the Cube Method when balancing hte number of
establishments per region and capital cities is given below.
Distribution of the number of establishments by region and capital
cities (more than 10 employees)
Population
Valles Alaveses
Llanada Alavesa
(no capital city)
Montaña Alavesa
Rioja Alavesa
Estribaciones del Gorbea
Cantábrica Alavesa
Arratia - Nervión
Gran Bilbao (without the capital)
Durangaldea
Encartaciones
Gernika – Bermeo
Markina – Ondarroa
Plentzia – Mungia
Bajo Bidasoa
Bajo Deba
Alto Deba
Donostialdea (without the capital)
Goierri
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
Sample
(weighted)
50
64
(0.40 %)
(0.51 %)
102
69
(0.81 %)
(0.54 %)
14
19
(0.11 %)
(0.15 %)
105
93
(0.83 %)
(0.74 %)
97
156
(0.77 %)
(1.23 %)
185
234
(1.47 %)
(1.86 %)
135
114
(1.07 %)
(0.91%)
2,931
2,597
(23.26 %)
(20.61 %)
648
556
(5.14 %)
(4.41 %)
111
217
(0.88 %)
(1.72 %)
162
271
(1.29 %)
(2.15 %)
103
192
(0.82 %)
(1.52 %)
200
333
(1.59 %)
(2.64 %)
373
385
(2.96 %)
(3.06 %)
359
290
(2.85 %)
(2.30 %)
366
490
(2.90%)
(3.88 %)
910
841
(7.22 %)
(6.67 %)
334
387
(2.65 %)
(3.07 %)
40
Tolosaldea
Urola Costa
Vitoria-Gasteiz
Bilbao
Donostia-San Sebastián
TOTAL
311
419
(2.47 %)
(3.32 %)
390
263
(3.09 %)
(2.09 %)
1,548
1,467
(12.28 %)
(11.64 %)
1,979
1,988
(15.70 %)
(15.78 %)
1,190
1,158
(9.44 %)
(9.19 %)
12,603
12,603
The percentages are given by columns.
• Notes:
1. A post-stratification was carried out to calculate the weightings of the number of
establishments per region. The activity strata were grouped according to sector
aggregation A38 (CNAE09), since it is the sector used in dissemination.
2. Very good estimates of the number of establishments in the three capitals were
made.
3. In the other regions, despite the fact that most of them well properly estimated,
we find many regions with a high relative error, such as Estribaciones del
Gorbea, Encartaciones, Gernika-Bermeo, Markina-Ondarroa, Plentzia-Mungia,
Tolosaldea and Urola-Costa.
4. In these seven regions the Cube Method did not attain a sampling solution with
better results due to the constraints imposed by the design:
-
Despite a sample size of 2,900 establishments, only 410 were in the draw,
because the rest came from the panel as well as the census strata.
-
Moreover, only establishments in 173 strata were selected, out of the 401
strata defined for the cross of Province, activity and employment.
-
Finally, in 21 of the 173 strata in which the draw actually took place, the
establishment to be selected was pre-determined (because priority had to be
given to registrations).
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
41
Sample for the Poverty and Social Inequality Survey (EPDS)
The Poverty and Inequality Survey (EPDS) is highly important to the Department of
Justice, Employment and Social Security because it is connected to the evaluation and
programming of its economic benefits. That is why it is particularly important to
consolidate a sampling design that will permit the most appropriate approach possible to
the survey group.
In general, the main goal of the EPDS is to know, study and asses the various lines of
poverty, their incidence is the Basque Country, and the indicators associated with social
inequality.
In 2012, it was decided to use the Cube Method to select the sample for the EPDS. This
has enabled us to obtain a sample balanced by sex, age and nationality, as well as the
family size in each Province.
Technical Datasheet
• Framework
The framework of the Poverty and Social Inequality Survey comprises the family
dwellings occupied in the Basque Country and its provinces.
• Sample design.
A two-stage sample with stratification in the first stage and fixed sample size in
the second.
Sampling units
Occupied family dwellings in the Basque Country
Sample size
Around 4,000 survey units were selected, which provided around 8,000
substitution units (two units per sampling unit).
First stage: Sections sample
In the first stage a draw of the census sections in the Basque Country takes
place.
o Stratification
The units in the first stage are stratified by crossing the following variables:
- Regions and areas
01 = Añana; 02 = Ayala/Aiara; 03 = Campezo-Montaña Alavesa;
04 = Laguardia-Rioja Alavesa; 05 = Salvatierra/Agurain;
06 = Vitoria-Gasteiz; 07 = Zuia; 08 = Donostialdea;
09 = Tolosaldea-Goierri; 10 = Alto-Deba; 11 = Bajo-Deba;
12 = Margen Derecha; 13 = Bilbao; 14 = Margen Izquierda;
15 = Bizkaia Costa; 16 = Duranguesado
- Typologies
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
42
An analysis is carried out of the EPDS-specific types of census sections
in Eustat. To this end, basic variables are taken into consideration: age,
sex, nationality, relation with the activity, number of residents in the
dwelling, and average personal and family income.
After an Analysis of the Main Components is carried out, the sections are
classified into 7 types.
- Predominance of young people:
With the aim of over-representing the sample in areas characterised by a
strong relative presence of people under age 45, the sections are
classified into two groups:
1 = Sections with a predominance of young people
0 = Other sections
In the second stage, lots are drawn for 24 dwellings in the "youth"
section and 16 dwellings in the other sections.
o Allocation
The lots for the 4,000 dwellings are drawn according to the following
allocations:
1. Distribution proportional to the square root of the number of dwellings
per Province.
2. Distribution proportional to the square root of the number of dwellings
per regions/areas.
3. Distribution proportional to the number of dwellings by type and
section type ("youth"/"non-youth")
A minimum size of 160 dwellings per region and 112 dwellings in the Álava
region are required.
o Draw
The draw for the sections has been probabilistic and proportional to size
(PPS), measured in the number of occupied dwellings.
Second stage: Dwellings sample
o Allocation
From 16 to 24 dwellings, depending on the type of section concerned, were
selected for each section selected in the first stage of the sample.
o Draw
A simple random draw was made in each section selected in the first stage.
•
Balancing variables
The sample was balanced on the same variables in the first and second stages.
This guarantees that the final sample will be balanced on the complete dwellings
framework.
The balanced variables are as follows:
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
43
-
Family size: Number of dwellings with 1 resident, 2 residents, 3-4 residents
and more than 5 residents by Province.
-
Sex: Number of men and women by Province.
-
Age: Number of individuals age 34 or less, age 35-44, 45-54 and over 65, by
Province.
-
Nationality: Number of Spanish and foreign individuals by Province.
-
Number of individuals in each region/area.
• Substitutes
To complete the sample, lots are drawn for a substitute and a reserve for each
dwelling. The substitutes have been taken from each of the census sections
selected in the first stage, balancing the sample on the same variables as the
title-holding dwellings.
Results
The results obtained for the balancing variables using the Cube Method are shown
below.
Each table compares the population distribution with the one obtained with sample
weighting. The percentages are given by columns.
Distribution of dwellings by Family Size and Province
Araba
Population
Gipuzkoa
Sample
Population
(weighted)
Bizkaia
Sample
(weighted)
Population
Sample
(weighted)
35,528
35,440
68,232
68,553
109,535
112,675
(27.77%)
(27.70%)
(24.97%)
(25.09%)
(24.44%)
(25.14%)
2
residents
37,537
38,174
78,075
78,039
130,825
130,322
(29.34%)
(29.84%)
(28.57%)
(28.56%)
(29.18%)
(29.07%)
3-4
residents
47,391
47,735
108,714
108,381
180,827
178,194
(37.04%)
(37.31%)
(39.78%)
(39.66%)
(40.34%)
(39.75%)
7,485
6,592
18,248
18,295
27,079
27,075
(5.85%)
(5.15%)
(6.68%)
(6.69%)
(6.04%)
(6.04%)
127,941
127,941
273,269
448,266
448,266
1 resident
More than
5
residents
TOTAL
273,269
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
44
Distribution by Sex and Province
Araba
Men
Women
TOTAL
Gipuzkoa
Bizkaia
Population
Sample
(weighted)
Population
Sample
(weighted)
Population
Sample
(weighted)
157,836
155,759
344,561
347,363
553,674
551,028
(49.91%)
(49.63%)
(49.48%)
(48.49%)
(48.53%)
158,392
158,111
354,687
588,197
584,492
(50.09%)
(50.37%)
(50.98%)
(50.52%)
(51.51%)
(51.47%)
316,228
313,870
702,911
702,050
1,141,871
1,135,521
(49.02%)
358,350
Distribution by Age and Province
Araba
Population
Less
than 34
years
35 - 44
years
Gipuzkoa
Bizkaia
Sample
Sample
Sample
Population
Population
(weighted)
(weighted)
(weighted)
108,383
109,676
233,423
234,644
366,085
363,674
(34.27%)
(34.94%)
(33.21%)
(33.42%)
(32.06%)
(32.03%)
55,227
49,691
116,445
116,922
188,762
194,045
(17.46%)
(15.83%)
(16.57%)
(16.65%)
(16.53%)
(17.09%)
45 - 54
years
49,799
109,078
107,384
182,531
179,632
(15.52%)
(15.30%)
(15.99%)
(15.82%)
55 - 64
years
40,810
49,939
(15.91%)
43,836
92,261
91,599
151,434
146,342
(12.91%)
(13.97%)
(13.13%)
(13.05%)
(13.26%)
(12.89%)
62,009
60,729
151,704
151,501
253,059
251,828
(19.61%)
(19.35%)
(21.58%)
(21.58%)
(22.16%)
(22.18%)
316,228
313,870
Over 65
TOTAL
(15.75%)
702,911
702,050
1,141,871 1,135,521
Distribution by Nationality and Province
Araba
National
Foreign
TOTAL
Gipuzkoa
Bizkaia
Population
Sample
(weighted)
Population
Sample
(weighted)
Population
Sample
(weighted)
286,633
289,847
658,599
659,521
1,067,272
1,059,925
(90.64%)
(92.35%)
(93.70%)
(93.94%)
(93.47%)
(93.34%)
29,595
24,023
44,312
42,529
74,599
75,595
(9.36%)
(7.65%)
(6.30%)
(6.06%)
(6.53%)
(6.66%)
316,228
313,870
702,050
1,141,871
1,135,521
702,911
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
45
Distribution of the number of individuals by region/area
Population
Añana
Ayala / Aiara
Campezo - Montaña Alavesa
Laguardia - Rioja Alavesa
Salvatierra/Agurain
Vitoria - Gasteiz
Zuia
Donostialdea
Tolosaldea - Goierri
Alto Deba
Bajo Deba
Margen Derecha
Bilbao
Margen Izquierda
Bizkaia Costa
Duranguesado
TOTAL
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
Sample
weighted
8,617
8,350
(0.40%)
(0.39%)
34,208
33,894
(1.58%)
(1.58%)
3,156
3,118
(0.15%)
(0.14%)
11,414
11,181
(0.53%)
(0.52%)
12,255
12,384
(0.57%)
(0.58%)
237,059
235,576
(10.97%)
(10.95%)
9,519
9,368
(0.44%)
(0.44%)
472,708
472,950
(21.87%)
(21.98%)
114,584
113,420
(5.30%)
(5.27%)
60,919
60,945
(2.82%)
(2.83%)
54,700
54,734
(2.53%)
(2.54%)
161,425
157,625
(7.47%)
(7.33%)
349,132
348,884
(16.16%)
(16.22%)
386,068
379,912
(17.87%)
(17.66%)
126,504
127,321
(5.85%)
(5.92%)
118,742
121,778
(5.49%)
(5.66%)
2,161,010 2,151,441
46
Sample for the study of women in Basque rural areas
The Department of Agriculture, Fishing and Food wants to update the study that has
been conducted since 1998 on "Women in Basque Rural Areas. Needs, Demands and
Social Needs".
In 2012, contrary to previous designs, a sample of women and another sample of men
age 15 or over who live in the towns that the Department has indicated as rural is going
to be taken by the criteria of size, population density and agricultural GDP ratio. The
sample should comprise 250 men and 250 women in each of the Basque Country
provinces.
Moreover, a decision is reached to use the Cube Method to select the sample and obtain
a sample of men and women balanced by age, nationality, level of studies and type of
dwelling (urban nucleus or scattered) in each Province.
Technical Datasheet
• Framework
The sample framework comprises the population age 15 and older that resides in
family dwellings in the 128 municipalities indicated as rural by the Department of
Agriculture, Fishing and Food.
• Sample design.
It was decided to conduct a two-stage study with stratification in the first stage,
since the aim is to obtain a sample of women and a sample of the same size of
men in rural municipalities. The allocations in the first and second stages are
calculated so the final sample of individuals is self-weighted by Province.
Thus, after lots are drawn for the rural municipalities, there will be a draw for the
same number of men and women in each municipality.
Sample size
Around 250 men and 250 women are chosen in each Basque Country Province.
Substitutes will not be selected because a booster sample will be carried out,
considering the estimated no response rate (46% in each Province).
First stage: Municipalities sample
In the first stage a stratified draw of the 128 rural municipalities in the Basque
Country takes place.
o Sampling units
Rural municipalities in the Basque Country. These are clusters of individuals
of different sizes.
o Stratification
In the first stage the units are stratified by:
- Province
01 = Araba, 20 = Gipuzkoa and 48 = Bizkaia;
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
47
- Size of the municipalities
The stratification of the municipalities by size is optimal. In other words, it
minimizes intra-class inertia or internal variance of each stratum, taking
the total inertia or variance as a benchmark.
1 = [0-569]; 2 = [570-1154]; 3 = [1155-1884]; 4 = [1885-3400]
o Allocation
The final aim is to draw lots for 250 men and 250 women in each Province.
Substitutes will not be selected because a booster sample will be carried out,
considering the estimated no response rate (46% in each Province).
The following procedure has been followed to calculate the number of
municipalities that will be included in the draw:
1. Distribution proportional to the size of the strata (population) of 500
individuals for each Province.
2. The number of municipalities for the draw in each Province is
calculated on the basis of a multiple of the sample population
fraction.
3. Distribution proportional to the number of municipalities per stratum.
4. The municipalities sample is extended to select those that belong to
a stratum size equal to 4.
o Draw
Once the theoretical distribution has been obtained, the draw for rural
municipalities is done by simple random sampling.
Second stage: Sample of men and women
In the second stage, we must select the men and women who will be surveyed.
o Sampling units
Men and women age 15 and older who belong to the rural municipalities
selected in the first stage.
o Allocation
For each rural municipality selected in the first stage, the number of men and
women who are in the draw is calculated proportionately to the size of the
municipality in the stratum. In other words:
n MUNi = n h
Pop MUN i
Pop h
where MUNi are the rural municipalities selected in the first stage and h is the
stratum for that municipality.
o Draw
Two simple, independent random samples are taken from the subpopulations
of men and women in each municipality.
The final sample is approximately self-weighted by Provinces.
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
48
• Balancing variables
The sample was balanced on the same variables in the first and second stages.
This guarantees that the final sample will be balanced on the complete
individuals framework.
The balanced variables are as follows:
Sex: Number of men and women by Province.
Age: Number of individuals age 15-25, 26-39, 40-54, 55-64 and over 65, by
Province.
Nationality: Number of Spanish and foreign individuals by Province.
Studies: Number of individuals with primary, secondary or higher studies, by
Province
Type of dwelling: Number of individuals residing in dwellings of the nucleus
or scattered type.
-
Results
The results obtained for the balancing variables using the Cube Method are shown
below.
Each table compares the population distribution with the one obtained with sample
weighting. The percentages are given by columns.
Distribution by Age and Province
SEX = MEN
Araba
Population
Gipuzkoa
Sample
(weighted)
Bizkaia
Sample
(weighted)
Population
1,231
1,236
1,769
1,807
(10.41%)
(10.45%)
(8.90%)
(9.09%)
2,958
2,988
4,354
4,383
(25.01%)
(25.26%)
(21.91%)
(22.06%)
Population
Sample
(weighted)
15 - 25
years
(9,70%)
26 - 39
years
3,706
1,676
(9.53%)
3,634
(21.08%)
(20.67%)
40 - 54
years
5,746
5,807
3,396
3,320
6,169
6,260
(32.68%)
(33.03%)
(28.71%)
(28.07%)
(31.05%)
(31.51%)
2,698
2,730
1,802
1,809
3,191
(15.35%)
(15.53%)
(15.23%)
(15.29%)
(16.06%)
55 - 64
years
Over 65
TOTAL
1,705
3,727
3,734
2,442
2,476
4,386
3,050
(15.35%)
4,369
(21.20%)
(21.24%)
(20.64%)
(20.93%)
(22.07%)
(21.99%)
17,582
17,852
11,829
19,869
19,869
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
11,829
49
SEX = WOMEN
Araba
Population
Gipuzkoa
Sample
(weighted)
Sample
(weighted)
Population
Bizkaia
Population
Sample
(weighted)
15 - 25
years
1,552
1,624
1,164
1,133
1,716
1,655
(9.91%)
(10.37 %)
(10.73%)
(10.45%)
(8.99%)
(8.67%)
26 - 39
years
3,351
3,309
2,709
2,658
3,970
4,058
(21.39%)
(21.12%)
(24.98%)
(24.51 %)
(20.81%)
(21.27%)
40 - 54
years
4,694
4,749
2,880
2,870
5,398
5,403
(29.96%)
(30.31%)
(26.56%)
(26.47%)
(28.29%)
(28.32%)
55 - 64
years
2,133
2,067
1,416
1,481
(13.61%)
(13.19%)
(13.06%)
(13.66%)
3,938
3,918
2,675
2,703
2,714
(14.23%)
5,281
2,708
(14.19%)
5,255
(25.13%)
(25.01%)
(24.67%)
(24.93 %)
(27.68%)
(27.54%)
15,668
15,668
10,844
19,079
19,079
Over 65
TOTAL
10,844
Distribution by Nationality and Province
SEX = MEN
Araba
Foreign
TOTAL
Bizkaia
Sample
(weighted)
Population
Sample
(weighted)
Population
Sample
(weighted)
16,410
(93.33%)
1,172
16,403
11,182
11,218
19,037
19,000
(93.29%)
(94.53%)
(94.83%)
(95.81%)
(95.63%)
1,179
647
611
832
869
(6.67%)
(6.71%)
(5.47%)
(5.17%)
(4.19%)
(4.37%)
17,582
17,852
11,829
11,829
19,869
19,869
Population
National
Gipuzkoa
SEX = WOMEN
Araba
National
Foreign
TOTAL
Gipuzkoa
Bizkaia
Population
Sample
(weighted)
Population
Sample
(weighted)
Population
Sample
(weighted)
14,694
14,673
10,300
10,278
18,270
18,251
(93.78%)
(93.65%)
(94.98%)
(94.78%)
(95.76%)
(95.66%)
974
995
544
566
809
828
(6.22%)
(6.35%)
(5.02%)
(5.22%)
(4.24%)
(4.34%)
15,668
15,668
10,844
10,844
19,079
19,079
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
50
Distribution by Level of Studies and Province
SEX = MEN
Araba
Population
Gipuzkoa
Sample
(weighted)
Bizkaia
Sample
(weighted)
Population
5,287
5,144
6,873
6,813
(44.70%)
(43.49 %)
(34.59%)
(34.29%)
4,957
5,123
8,798
8,915
(41.91%)
(43.41%)
(44.28%)
(44.87%)
Population
Sample
(weighted)
Primary
Studies
(41.54%)
Secondary
Studies
7,616
7,225
(41,09%)
7,630
(43.32%)
(43.40%)
Higher
Studies
2,662
2,727
1,585
1,562
4,198
4,141
(15.14%)
(15.51%)
(13.40%)
(13.20%)
(21.13%)
(20.84%)
TOTAL
17,582
17,852
11,829
19,869
19,869
7,304
11,829
SEX = WOMEN
Araba
Population
Gipuzkoa
Sample
(weighted)
Bizkaia
Sample
(weighted)
Population
4,928
4,922
7,587
7,586
(45.44%)
(45.39%)
(39.77%)
(39.76%)
3,451
3,441
6,148
6,160
(31.82%)
(31.73 %)
(32.22%)
(32.29%)
Population
Sample
(weighted)
Secondary
Primary
(43.23%)
Secondary
Studies
5,459
6,665
(42.54%)
5,557
(34.84%)
(35.47 %)
Higher
Studies
3,435
3,446
2,465
2,482
5,344
5,333
(21.92%)
(21.99%)
(22.73%)
(22.89%)
(28.01%)
(27.95%)
TOTAL
15,668
15,668
10,844
19,079
19,079
6,774
10,844
Distribution by Type of Dwelling and Province
SEX = MEN
Araba
Scattered
TOTAL
Bizkaia
Sample
(weighted)
Population
1,027
16,743
(95.23%)
839
4,299
3,938
8,119
7,624
(5.84%)
(4.77%)
(36.34%)
(33.29%)
(40.86%)
(38.37%)
17,582
17,852
11,829
19,869
19,869
Population
Nucleus
Gipuzkoa
16,555
(94.16%)
Sample
(weighted)
Population
7,530
7,891
11,750
12,245
(63.66%)
(66.71%)
(59.14%)
(61.63 %)
11,829
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
Sample
(weighted)
51
SEX = WOMEN
Araba
Scattered
TOTAL
Bizkaia
Sample
(weighted)
Population
887
14,977
(95.59%)
691
3,621
3,157
7,524
7,007
(5.66%)
(4.41%)
(33.39%)
(29.11%)
(39.44%)
(36.73%)
15,668
15,668
10,844
19,079
19,079
Population
Nucleus
Gipuzkoa
14,781
(94.34%)
Sample
(weighted)
Population
Sample
(weighted)
7,223
7,687
11,555
12,072
(66.61%)
(70.89%)
(60.56%)
(63.27%)
10,844
Sample for Basque Country and Drugs Survey
Basque Country and Drugs is a biennial survey, aimed al discovering the consumption of
various substances by the Basque population aged between 15 and 74 years, and their
perception on various issues related to drugs and drug addiction.
In 2012, it was decided to use the Cube Method to select the sample. This has enabled
us to obtain a sample balanced by number of individuals in each sanitary region, size of
municipalities, sex and nationality.
Technical Datasheet
• Framework
The framework of the sample comprises population aged between 15 and 74
years old that resides in family dwellings in the Basque Country and its provinces.
• Sample design
It was a one-stage stratified sample.
Sampling units
Population aged between 15 and 74 years (reference date: July 15, 2012), that
resides in family dwellings in the Basque Country.
Sample size
According to the specifications of the operation, n = 2007 individuals were
selected, providing the same number of substitutes and reserves.
Stratification
A stratified simple was made by crossing the following variables:
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
52
- Province
01 = Araba; 20 = Gipuzkoa; 48 = Bizkaia
- Age groups:
6 decadal age groups
(15-24, 25-34, 35-44, 45-54, 55-64 and 65-74 years)
Allocation
A criterion for each level of stratification has been established:
1. Distribution proportional to the square root of the number of individuals
per Province
2. For each Province, double size allocation for the youngest age groups
(15-24 years, 25-34 years y 35-44 years).
Drawing
Simple random sampling was carried out in each of the strata.
• Balancing variables
The sample was balanced on the following variables:
-
-
-
Number of individuals aged between 15 and 74 years for each of the 11
sanitary regions of the Basque Country: Alava, West Gipuzkoa, Gipuzkoa
East, (Biz) Interior, (Biz) Ezkerraldea-Enkarterri, (Biz) Uribe and (Biz) Bilbao.
Number of individuals aged between 15 and 74 years in the municipalities,
according to their size in population: Capitals, 50,000-100,000 habitants,
25,000-50,000 habitants, 10,000-25,.000 habitants and less than 10,000
habitants.
Number of individuals by sex.
Number Spanish and foreing individuals.
• Substitutes
To complete the sample, lots are drawn for a substitute and reserve for each
individual.
The substitutes have been taken keeping the same distribution of the original
stratum sample, balancing the sample on the same variables.
Results
The results obtained for the balancing variables using the Cube Method are shown
below.
Each table compares the population distribution with the one obtained with sample
weighting. The percentages are given by columns:
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
53
Distribution of the number of individual by sanitary province
Alava
219,042
Sample
(weighted)
218,966
(13.28%)
(13.28%)
West Gipuzkoa
218,155
218,335
(13.23%)
(13.24%)
Gipuzkoa
328,814
329,009
(19.94%)
(19.95%)
(Biz) Interior
227,787
228,032
(13.81%)
(13.83%)
(Biz) EzkerraldeaEnkarterria
225,829
224,429
(13.70%)
(13.61%)
(Biz) Uribe
166,287
166,029
(10.08%)
(10.07%)
(Biz) Bilbao
263,028
264,141
(15.95%)
(16.02%)
TOTAL
1,648,942
1,648942
Population
Distribution of the number of individual by municipalities size
Capitals
587,948
Sample
(weighted)
589,033
(35.66%)
(35.72%)
50,000 - 100,000
184,970
184,638
(11.22%)
(11.20%)
25,000 - 50,000
239,465
239,354
(14.52%)
(14.52%)
10,000 - 25,000
300,173
300,088
(18.20%)
(18.20%)
Less than 10,000
336,386
335,829
(20.40%)
(20.37%)
TOTAL
1,648,942
1,648,942
Population
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
54
Distribution of the number of individual by sex
Population
Sample
(weighted)
Men
823,310
823,742
(49.93%)
(49.96%)
Women
825,632
825,200
(50.07%)
(50.04%)
TOTAL
1,648,942
1,648,942
Distribution of the number of individual by nationality
Population
Sample
(weighted)
National
1,519,906
1,518,872
(92.17%)
(92.11%)
Foreing
129,036
130,070
(7.83%)
(7.89%)
TOTAL
1,648,942
1,648,942
BALANCED SAMPLES IN EUSTAT USING THE CUBE METHOD
55
8. Conclusions
Finally, we will mention certain conclusions regarding the interest in carrying out
balanced samplings, the choice of balancing variables and the relation of balance with
regards to stratification and calibration.
Balancing and stratification
For stratification and balancing purposes, we need to know the value of the auxiliary
variables for all the population units.
The greatest advantage of stratification is that it allows us to divide a population into
more homogeneous subpopulations to obtain more precise estimators, which reduces
sampling variance. The greater the number of variables correlated with the variables of
interest used, the better the stratification.
Even so, using too many stratification variables may produce very small strata in which
the sample size is insufficient, not to mention the problems that may arise from no
response in such strata. However, the latter can be fixed by collapsing the strata (poststratification).
Balancing variables allow the variables that cannot be entered in multiple stratification to
be added as balancing variables, which retains the benefits of stratification with regards
to reducing variance and adds the advantage of balancing.
They also allow us to work in domains defined on the cross of several strata or small
areas.
Balancing variables can be quantitative, whereas stratification variables always need to
be qualitative or categorical.
Choice of balancing variables
The auxiliary variables chosen to balance a sample should be very well correlated with
the variables of interest and not correlated with each other.
When balancing a sample on a large number of qualitative auxiliary variables, estimated
totals (or estimated means) are obtained with distributions that are practically identical to
the original population.
The Cube Method provides a very interesting way of selecting primary units in a multistage sample. In the event of choosing a balanced sample in the second stage, the
variables to be balanced should have been previously balanced in the first stage.
CONCLUSIONS
56
Balancing and calibration
Contrary to balancing and stratification, for calibration we only need to know the value of
the auxiliary variables for the sample elements, as well as the totals of those variables in
the population.
The best strategy to use is balancing and calibration together (see the simulation in
Deville and Tillé, 2004). This is because, in general, better results are obtained if we
calibrate a sample on the same auxiliary variables that were used in the balancing.
There is one case in which the calibration can be used on variables that are not
balancing variables, and that is when it is the same variable measured at different times.
Analysis of results
Then, we are going to show the results obtained during the calibration of two sample
previously balanced with the Cube Method (2012 Basque Country and Drugs and
2012 Social Capital Survey)
In both cases, the calibration has been done with the CALMAR macro (calage sur
marges), “readjusting” the sample weights of the individuals to the marginal totals of
the calibration auxiliary variables.
1. Calibration on the 2012 Basque Country and Drugs Survey
For the Basque Country and Drugs 2012 survey (n=2007 individuals), it has decided
to calibrate the sample on the following variables:
-
Cross of the Province and Sex variables (stratification variables)
-
Sanitary region, municipality sizes and sex (balancing variables)
Starting from the initial weights
whi = wh ∀i
stratum), were obtained the final weights
whi*
(same weights inside of each
using the CALMAR macro with the
ranking ratio method for adjusting the estimations to the marginal totals on the
calibration variables.
The variable f =
w hi*
is defined as the ratio of the final weights and initial weights.
w hi
Analyzing the distribution of this variable, we can determine how much the initial
weights are being deformed for adjusting the marginal totals on the calibration
variables.
This is a resume of the distribution of the f variable:
CONCLUSIONS
57
Mean
Median
Mode
Standard deviation
1
0.9987
0.9978
0.0875
Coefficient of variation
8.75%
Minimum
Maximum
0.8365
1.2484
As we can see, the final weights are not so far from the initial weights (maximum
increase of 24% and maximum decrease of 16%), maintaining largely the sampling
weights associated to the stratification.
2. Calibration on the 20012 Social Capital Survey
For the Social Capital Survey 2012 (n=4000 individuals), it has decided to calibrate
the sample on the following variables:
-
Province (Araba, Gipuzkoa y Bizkaia)
-
Sex (men and women)
-
Age (15-24, 25-34, 35-44, 45-54, 55-64 and more than 65 years)
So, the sample has been calibrated to 36 marginal totals.
As in the previous sample, the f =
w hi*
variables as been defined as the ratio of
w hi
the final weights and the initial weights. The final weights
whi*
were obtained using
the CALMAR macro with the ranking ratio method for adjusting the estimations to
the marginal totals on the calibration variables.
In this case, will not only analyze the distribution of the variable f, we are going also
to compare with the values obtained in the 2007 Social Capital Survey.
Remember that both surveys have the same sample design, but the 2012 Sample
Capital Survey has been selected balancing the sample with the Cube Method. The
balancing variables used are the same as the calibration variables.
The next table shows the results obtained for the years 2007 and 2012:
CONCLUSIONS
58
2007
2012
Mean
Median
Mode
Standard deviation
1.1139
0.9685
2.0076
0.5306
1.0074
0.9944
1.0287
0.1125
Coefficient of variation
47.63%
11.17%
Minimum
Maximum
0.4223
2.3236
0.7965
1.2915
As the 2012 SCS has been balanced on the calibration variables, we have obtained
better results, obtaining final weights much less distant than on the 2007 SCS (maximum
increase of 29% versus 132% and maximum decrease of 20% versus 58%).
CONCLUSIONS
59
Interest of balanced sampling
In a model-based framework assisted by a model, a sampling design balanced with the
Horvitz-Thompson estimator is often an optimal strategy (see Nedyalkova and Tillé,
2009). In fact, when a sample is completely balanced, Horvitz-Thompson estimator
variance in auxiliary variables equals zero.
The advantages of balanced sampling are the following:
CONCLUSIONS
-
It is an optimisation of probability sampling designs, whether they are single stage or
multi-stage, in which the inclusion probabilities defined by the design are the key for
selecting random samples.
-
It increases the accuracy of the Horvitz-Thompson estimator. Moreover, estimator
variance only depends on the variables of interest and the balancing variables
(regression residuals).
-
The probability that samples that are less favourable, extreme or distant from the
mean will be selected is almost nil.
-
Balanced sampling guarantees that sample size in specific geographical areas or
domains will not be too small.
60
9. Bibliography
ADIN, A.; ARAMENDI, J.; GALBETE, E. AND IZTUETA, A. (2012)
El Método del Cubo: Un Método para seleccionar muestras
equilibradas. Congreso Vasco de Sociología y Ciencia Política
ARDILLY, P. (1994)
Les Techniques de Sondage. Technip, Paris.
ARDILLY, P. AND TILLÉ, Y. (2006)
Sampling Methods: Exercises and Solutions. Springer, New York.
AZORÍN, F. AND SANCHEZ-CRESPO, J. L. (1986)
Métodos y Aplicaciones del Muestreo. Alianza Editorial, Madrid.
CHAUVET, G. AND TILLÉ, Y. (2005)
Fast SAS Macros for balancing Samples: user's guide. Software
Manual, University of Neuchâtel.
CHAUVET, G. AND TILLÉ, Y. (2007)
Application of fast SAS macros for balanced samples to the selection
of addresses. Case Studies in Business, Industry and Government
Statistics, 1:173-182.
COCHRAN, W. (1977)
Sampling Techniques. Wiley, New York.
DEVILLE, J.-C. AND TILLÉ, Y. (2004)
Efficient balanced sampling: the cube method. Biometrika, 91:893912.
DEVILLE, J.-C. AND TILLÉ, Y. (2005)
Variance approximation under balanced sampling. Journal of
Statistical Planning and Inference, 128:569-591.
KISH, L. (1965)
Survey Sampling. Wiley, New York.
NEDYALKOVA, D. AND TILLÉ, Y. (2009)
Optimal sampling and estimation strategies under linear model.
Biometrika, 95:521-537.
BIBLIOGRAPHY
61
SÄRNDAL, C.-E.; SWENSSON, B. AND WRETMAN, J. (1992)
Model Assisted Survey Sampling. Springer Verlag, New York.
TILLÉ, Y. (2000)
Ten years of balanced sampling with the cube method: an appraisal.
Demographic Statistical Methods Division Seminar of the U.S.
Census Bureau.
TILLÉ, Y. (2005)
Teoría de Muestreo. Gruope de Statistique, Université de Neuchâtel,
Suisse.
http://www2.unine.ch/files/content/sites/statistics/files/shared/docume
nts/curso_teoria_de_muestreo.pdf
TILLÉ, Y. AND MATEI, A. (2007)
The R Package Sampling. The Comprehensive R Archive Network,
Manual of the Contributed Packages.
http://cran.r-project.org/web/packages/sampling/sampling.pdf
Tillé, Y. (2010)
Muestreo Equilibrado y Eficiente: el Método del Cubo. Instituto Vasco
de Estadística, Vitoria-Gasteiz.
http://www.eustat.es/productosServicios/datos/Seminario_52.pdf
BIBLIOGRAPHY
62