Optimal allocation algorithm Optimal allocation algorithm

Optimal Allocation in the Multi-way
Stratification Design for
Business Surveys (*)
Paolo Righi , Piero Demetrio Falorsi
[email protected]; [email protected]
Italian National Statistical Institute
(*) Research of National Interest n.2007RHFBB3 (PRIN) “Efficient use of auxiliary information at the
design and at the estimation stage of complex surveys: methodological aspects and applications for
producing official statistics””
Outline




Statement of the problem
Multi-way Sampling Design
Multi-way optimal allocation algorithm
Monte Carlo simulation
Statement of the problem
 Large scale surveys in Official Statistics usually produce
estimates for a set of parameters by a huge number of
highly detailed estimation domains
 These domains generally define not nested partitions of the
target population
 When the domain indicator variables are available at
framework level, we may plan a sample covering each
domain
 Fixing the sample sizes:
 Help to control the sampling errors of the main estimates;
 When direct estimators are not reliable (small area problem), having
the units in the domains allows to:
 bound the bias of small area indirect estimators; use models with
specific small area effects.
Statement of the problem
 Standard solution for fixing the sample sizes stratifies the
sample with strata given by cross-classification of variables
defining the different partitions (cross-classified or oneway stratified design)
 Main drawback:
 Too detailed stratification:
Risk of sample size explosion;
Inefficient sample allocation (2 units per stratum constraint);
Risk of statistical burden (e.g. repeated business surveys) .
Statement of the problem
 Domain of Interest
 Parameter of interest and estimator:
 Multivariate (r=1,…,R) and multidomain (d =1, … , D) context
being  dk  1 if k  U d and  dk  0 otherwise.
Statement of the problem
The sampling strategy herein proposed bases each
domain estimate on a planned sample size.
We consider a general random sampling design
where U h (h=1, …, H) of size N h define minimal
planned subpopulations.
We assume two cases U d = U h or U d =  hd U h
where d is a subset of 1,..., H .
Statement of the problem
Example:
Three domain types Tl (l=1, .., 3).
Nace four digit; Nace three digit by size; Nace 2 digit by
geography
Each domain type defines a partition of the population
of Dl cardinality being D  D1  D2  D3 .
Different sampling design allows to plan the sample size
of the interest domain:
 the standard approach define the U h ’s combining
the population of the three domain types. Then
H  D1  D2  D3 and the δ k are defined as
(0,..,1,...,0) vectors. We denote these design as
cross-classified or one-way stratified design;
Statement of the problem
Example (continue):
 the U h ’s are defined combining all the couples of
domain types. Then
H  ( D1  D2 )  ( D1  D3 )  ( D2  D3 ) ;
 some U h ’s agree with the domains of one
population partitions (for instance T1) and the
others U h ’s are defined combining couples of the
remaining domain types (T2 and T3 ). Then
H  D1  ( D2  D3 ) ;
 the U h ’s agree with the domains of interes. Then
H  D1  D2  D3.
Sampling design defining the U h ’s as in the last three
points are denoted by Multi-way (or incomplete)
stratification
Multi-way Sampling Design
 Main problem of MWD: define a procedure for random
selection
 We propose to use the Cube method (Deville and Tillé,
2004):
 Select random sample of multi-way stratified design;
 For a large population and a lot of domains.
Multi-way Sampling Design
The cube algorithm selects a sample s respecting
the following general balancing equations
s
xk
k
 U x k
Setting x k   k δ k with δk  (1k ,...,  hk ,...,  Hk )
being  hk  1 if k U h and  hk  0 otherwise
We have
 s  hk  nh fixed for each sample selection
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 6
Optimal allocation algorithm
Deville and Tillé (2005) proposed
approximated expression of the variance
V p (tˆ( dr ) | π)  f
an
 kU (1 /  k  1) (2dr)k 
Where f = N/(N-H),  ( dr) k  y rk  dk   k g ( dr ) k
and g (dr )k = δ k B (dr ) , being
B ( dr )  [k U  k2 δ k δ k (1 /  k  1)]1
k U  k  dk δ k y rk (1 /  k  1)
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 6
Optimal allocation algorithm
We propose an algorithm for the definition of an optimal
inclusion probability vector π * according to the following
optimality criterion:
Min (kU  k* )
(a) V p (tˆ( dr ) | π * )  V( dr )  (dr)
(b) 0   k*  1
being V(dr ) a fixed variance threshold for the domain U d on the
r-th variable of interest.
Optimal allocation algorithm
We note the (a) constraints depend on the
unknown variables of interest. In practice only
model predicted values can be used. In the paper
we sketch the main phases of the algorithm in this
operative context.
We consider a general prediction model M
yrk  urk
 yrk  ~

2
2
E
(
u
)

0

k
;
E
(
u
)


 M rk
M rk
rk ; .
 E (u , u )  0 k  l
 M rk rl
2 values are known or can be predicted
We suppose the  rk
Optimal allocation algorithm
To take into account the model uncertainty, the
sampling variance is replaced by the Anticipated
Variances (Isaki and Fuller, 1982).
An upward approximation of the anticipated
variances for the proposed strategy is
AV (tˆ( dr ) )  Em E p (tˆ( dr )  t( dr ) | π* ) 2 
 f  kU (1 /  k*  1)~(2dr ) k 
2
  kU (1 /  k*  1)  ( dr ) k  rk
where ~(2dr ) k is computed by means of a model
predicted value ~
y rk .
The approximation neglects a residual term that we do not show for sake of brevity. However, the
optimization procedure does not change if the corrected anticipate variance is taken into account.
Optimal allocation algorithm
The optimization problem is defined as
Min (kU  k* )
(5) AV (tˆ(dr ) )  V(dr )  (dr)
0   k*  1
To obtain a solution of the optimization problem we formulate constraints
(5) as
2
2
*
~

(
y


)
/

kU dk rk rk k 
2
2
 V( dr )  kU  dk ( ~
yrk
  rk
)  C( dr ) ( π* , ~
gd )
being :
,
2
gd  ( g~1k ,..., g~dk ,..., g~Dk )
C( dr )  f [ kU 2 (1   k* ) dk ~
yrk g~dk  kU  k* (1   k* ) g~dk
] ~
~
~
g~dk = δk B
B( dr ) given by (2) replacing yrk with ~yrk
( dr ) with
,
Optimal allocation algorithm
The algorithm consists of two calculation loops
nested in each other.
Let ( ) G and ( , ) G respectively denote
the generic quantity G as calculated by the
iteration  ( =0,1,2,..) of the first loop (outer
process) and
by iteration   =0,1,2,..) of the second
(inner process).
one
Optimal allocation algorithm
For a given value of the vector of the inclusion
probabilities ( ) π  ( ( ) π1 ,..., ( ) π k ,..., ( ) π N )  ,
the first calculation loop calculates the terms (D
x R) terms ( a ) C( dr ) ( π* , ~
gd ) .
For given values of ( a ) C( dr ) ( π* , ~
gd ) , the second
calculation loop finds the inclusion probabilities,
( , )
π  ( ( , ) π1 ,..., ( , ) π k ,..., ( , ) π N )  ,
solution of problem by a slight modification of
the Chromy algorithm.
Monte Carlo simulation
 Objectives of simulation:
 Test the convergence
(optimization step)
of
the
optimization
algorithm
 Comparison between the expect AV and the Monte Carlo
empirical AV
 Comparison with standard cross-classified stratified design
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 12
Monte Carlo simulation
 Data:
 Subpopulation of the Istat Italian Graduates’ Career Survey
(3,427 units);
 Driving allocation variables:
employed status (yes/no)
;
actively seeking work (yes/no)
.
 We generate the values of the two variables by means a
logistic additive model (Prediction model);
 Explicative variables: degree mark, sex, age class and
aggregation of subject area degree
 The parameters are estimated by the data from the previous
survey
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 13
Monte Carlo simulation
 Survey target estimates:
 Two partitions define the most disaggregate domains:
First partition: university by subject area degree (9 classes);
Second partition: degree by sex;
Domains in real survey:448+94; Strata 2,981 (university,
degree, sex);
In the simulation: domains 20+15;strata 91.
 Errors thresholds fixed in terms of CV(%)
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 14
Monte Carlo simulation
 Results:
 Assuming as known values
Iterations (outer process): 6;
Optimal sample size 171 (after calibration 182).
 Assuming predicted values:
Iterations (outer process): 3;
Optimal sample size 699 (after calibration 707).
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 15
Monte Carlo simulation
 Analysis of the allocation with the predicted values:
 The sample allocation procedure uses an approximation of the
AV
Average of Expectected Anticipated CV(%)
y2
y1
Partition
1
8.1
17.8
2
9.2
19.1
Average of Empirical (10,000 Monte
Carlo simulations) Anticipated CV(%)
y2
y1
Partition
1
6.7
14.7
2
7.4
15.5
 The simulation confirms the input AV is an upward
approximation of the real AV
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 16
Monte Carlo simulation
 Comparison with the standard approach:
 The implicit model (one-way stratification model) is similar to
the model used in our approach;
 The allocation differences depend on the unit minimum
number constraint (2) in each stratum;
 The sample size is 751 units (+7.4%);
 Taking into account the domains with small population strata
(<10 units in average per stratum) standard approach
produces +14.4% sample size.
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 17
References







Bethel J. (1989) Sample Allocation in Multivariate Surveys, Survey
Methodology, 15, 47-57.
Chromy J. (1987). Design Optimization with Multiple Objectives, Proceedings
of the Survey Research Methods Sec-tion. American Statistical Association,
194-199.
Deville J.-C., Tillé Y. (2004) Efficient Balanced Sampling: the Cube Method,
Biometrika, 91, 893-912.
Deville J.-C., Tillé Y. (2005) Variance approximation under balanced sampling,
Journal of Statistical Planning and Inference, 128, 569-591
Falorsi P. D., Righi P. (2008) A Balanced Sampling Approach for Multi-way
Stratification Designs for Small Area Estimation, Survey Methodology, 34, 223234
Falorsi P. D., Orsini D., Righi P., (2006) Balanced and Coordinated Sampling
Designs for Small Domain Estimation, Statistics in Transition, 7, 1173-1198
Isaki C.T., Fuller W.A. (1982) Survey design under a regression superpopulation
model, Journal of the American Statistical Association, 77, 89-96
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 18