Optimal Allocation in the Multi-way Stratification Design for Business Surveys (*) Paolo Righi , Piero Demetrio Falorsi [email protected]; [email protected] Italian National Statistical Institute (*) Research of National Interest n.2007RHFBB3 (PRIN) “Efficient use of auxiliary information at the design and at the estimation stage of complex surveys: methodological aspects and applications for producing official statistics”” Outline Statement of the problem Multi-way Sampling Design Multi-way optimal allocation algorithm Monte Carlo simulation Statement of the problem Large scale surveys in Official Statistics usually produce estimates for a set of parameters by a huge number of highly detailed estimation domains These domains generally define not nested partitions of the target population When the domain indicator variables are available at framework level, we may plan a sample covering each domain Fixing the sample sizes: Help to control the sampling errors of the main estimates; When direct estimators are not reliable (small area problem), having the units in the domains allows to: bound the bias of small area indirect estimators; use models with specific small area effects. Statement of the problem Standard solution for fixing the sample sizes stratifies the sample with strata given by cross-classification of variables defining the different partitions (cross-classified or oneway stratified design) Main drawback: Too detailed stratification: Risk of sample size explosion; Inefficient sample allocation (2 units per stratum constraint); Risk of statistical burden (e.g. repeated business surveys) . Statement of the problem Domain of Interest Parameter of interest and estimator: Multivariate (r=1,…,R) and multidomain (d =1, … , D) context being dk 1 if k U d and dk 0 otherwise. Statement of the problem The sampling strategy herein proposed bases each domain estimate on a planned sample size. We consider a general random sampling design where U h (h=1, …, H) of size N h define minimal planned subpopulations. We assume two cases U d = U h or U d = hd U h where d is a subset of 1,..., H . Statement of the problem Example: Three domain types Tl (l=1, .., 3). Nace four digit; Nace three digit by size; Nace 2 digit by geography Each domain type defines a partition of the population of Dl cardinality being D D1 D2 D3 . Different sampling design allows to plan the sample size of the interest domain: the standard approach define the U h ’s combining the population of the three domain types. Then H D1 D2 D3 and the δ k are defined as (0,..,1,...,0) vectors. We denote these design as cross-classified or one-way stratified design; Statement of the problem Example (continue): the U h ’s are defined combining all the couples of domain types. Then H ( D1 D2 ) ( D1 D3 ) ( D2 D3 ) ; some U h ’s agree with the domains of one population partitions (for instance T1) and the others U h ’s are defined combining couples of the remaining domain types (T2 and T3 ). Then H D1 ( D2 D3 ) ; the U h ’s agree with the domains of interes. Then H D1 D2 D3. Sampling design defining the U h ’s as in the last three points are denoted by Multi-way (or incomplete) stratification Multi-way Sampling Design Main problem of MWD: define a procedure for random selection We propose to use the Cube method (Deville and Tillé, 2004): Select random sample of multi-way stratified design; For a large population and a lot of domains. Multi-way Sampling Design The cube algorithm selects a sample s respecting the following general balancing equations s xk k U x k Setting x k k δ k with δk (1k ,..., hk ,..., Hk ) being hk 1 if k U h and hk 0 otherwise We have s hk nh fixed for each sample selection ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 6 Optimal allocation algorithm Deville and Tillé (2005) proposed approximated expression of the variance V p (tˆ( dr ) | π) f an kU (1 / k 1) (2dr)k Where f = N/(N-H), ( dr) k y rk dk k g ( dr ) k and g (dr )k = δ k B (dr ) , being B ( dr ) [k U k2 δ k δ k (1 / k 1)]1 k U k dk δ k y rk (1 / k 1) ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 6 Optimal allocation algorithm We propose an algorithm for the definition of an optimal inclusion probability vector π * according to the following optimality criterion: Min (kU k* ) (a) V p (tˆ( dr ) | π * ) V( dr ) (dr) (b) 0 k* 1 being V(dr ) a fixed variance threshold for the domain U d on the r-th variable of interest. Optimal allocation algorithm We note the (a) constraints depend on the unknown variables of interest. In practice only model predicted values can be used. In the paper we sketch the main phases of the algorithm in this operative context. We consider a general prediction model M yrk urk yrk ~ 2 2 E ( u ) 0 k ; E ( u ) M rk M rk rk ; . E (u , u ) 0 k l M rk rl 2 values are known or can be predicted We suppose the rk Optimal allocation algorithm To take into account the model uncertainty, the sampling variance is replaced by the Anticipated Variances (Isaki and Fuller, 1982). An upward approximation of the anticipated variances for the proposed strategy is AV (tˆ( dr ) ) Em E p (tˆ( dr ) t( dr ) | π* ) 2 f kU (1 / k* 1)~(2dr ) k 2 kU (1 / k* 1) ( dr ) k rk where ~(2dr ) k is computed by means of a model predicted value ~ y rk . The approximation neglects a residual term that we do not show for sake of brevity. However, the optimization procedure does not change if the corrected anticipate variance is taken into account. Optimal allocation algorithm The optimization problem is defined as Min (kU k* ) (5) AV (tˆ(dr ) ) V(dr ) (dr) 0 k* 1 To obtain a solution of the optimization problem we formulate constraints (5) as 2 2 * ~ ( y ) / kU dk rk rk k 2 2 V( dr ) kU dk ( ~ yrk rk ) C( dr ) ( π* , ~ gd ) being : , 2 gd ( g~1k ,..., g~dk ,..., g~Dk ) C( dr ) f [ kU 2 (1 k* ) dk ~ yrk g~dk kU k* (1 k* ) g~dk ] ~ ~ ~ g~dk = δk B B( dr ) given by (2) replacing yrk with ~yrk ( dr ) with , Optimal allocation algorithm The algorithm consists of two calculation loops nested in each other. Let ( ) G and ( , ) G respectively denote the generic quantity G as calculated by the iteration ( =0,1,2,..) of the first loop (outer process) and by iteration =0,1,2,..) of the second (inner process). one Optimal allocation algorithm For a given value of the vector of the inclusion probabilities ( ) π ( ( ) π1 ,..., ( ) π k ,..., ( ) π N ) , the first calculation loop calculates the terms (D x R) terms ( a ) C( dr ) ( π* , ~ gd ) . For given values of ( a ) C( dr ) ( π* , ~ gd ) , the second calculation loop finds the inclusion probabilities, ( , ) π ( ( , ) π1 ,..., ( , ) π k ,..., ( , ) π N ) , solution of problem by a slight modification of the Chromy algorithm. Monte Carlo simulation Objectives of simulation: Test the convergence (optimization step) of the optimization algorithm Comparison between the expect AV and the Monte Carlo empirical AV Comparison with standard cross-classified stratified design ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 12 Monte Carlo simulation Data: Subpopulation of the Istat Italian Graduates’ Career Survey (3,427 units); Driving allocation variables: employed status (yes/no) ; actively seeking work (yes/no) . We generate the values of the two variables by means a logistic additive model (Prediction model); Explicative variables: degree mark, sex, age class and aggregation of subject area degree The parameters are estimated by the data from the previous survey ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 13 Monte Carlo simulation Survey target estimates: Two partitions define the most disaggregate domains: First partition: university by subject area degree (9 classes); Second partition: degree by sex; Domains in real survey:448+94; Strata 2,981 (university, degree, sex); In the simulation: domains 20+15;strata 91. Errors thresholds fixed in terms of CV(%) ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 14 Monte Carlo simulation Results: Assuming as known values Iterations (outer process): 6; Optimal sample size 171 (after calibration 182). Assuming predicted values: Iterations (outer process): 3; Optimal sample size 699 (after calibration 707). ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 15 Monte Carlo simulation Analysis of the allocation with the predicted values: The sample allocation procedure uses an approximation of the AV Average of Expectected Anticipated CV(%) y2 y1 Partition 1 8.1 17.8 2 9.2 19.1 Average of Empirical (10,000 Monte Carlo simulations) Anticipated CV(%) y2 y1 Partition 1 6.7 14.7 2 7.4 15.5 The simulation confirms the input AV is an upward approximation of the real AV ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 16 Monte Carlo simulation Comparison with the standard approach: The implicit model (one-way stratification model) is similar to the model used in our approach; The allocation differences depend on the unit minimum number constraint (2) in each stratum; The sample size is 751 units (+7.4%); Taking into account the domains with small population strata (<10 units in average per stratum) standard approach produces +14.4% sample size. ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 17 References Bethel J. (1989) Sample Allocation in Multivariate Surveys, Survey Methodology, 15, 47-57. Chromy J. (1987). Design Optimization with Multiple Objectives, Proceedings of the Survey Research Methods Sec-tion. American Statistical Association, 194-199. Deville J.-C., Tillé Y. (2004) Efficient Balanced Sampling: the Cube Method, Biometrika, 91, 893-912. Deville J.-C., Tillé Y. (2005) Variance approximation under balanced sampling, Journal of Statistical Planning and Inference, 128, 569-591 Falorsi P. D., Righi P. (2008) A Balanced Sampling Approach for Multi-way Stratification Designs for Small Area Estimation, Survey Methodology, 34, 223234 Falorsi P. D., Orsini D., Righi P., (2006) Balanced and Coordinated Sampling Designs for Small Domain Estimation, Statistics in Transition, 7, 1173-1198 Isaki C.T., Fuller W.A. (1982) Survey design under a regression superpopulation model, Journal of the American Statistical Association, 77, 89-96 ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 18
© Copyright 2026 Paperzz