Giorgio E. Montanari - UniFI

Quasi-Optimal Estimation in Survey Sampling
Stima quasi ottimale nelle inferenze campionarie
Giorgio E. Montanari
Department of Statistical Sciences-University of Perugia
Via A. Pascoli - 06100 Perugia - Italy - E-mail:[email protected]
Riassunto: in questo lavoro si propone una classe estesa di stimatori per regressione
generalizzata che include come caso particolare lo stimatore ottimo. Vengono poi
presentati alcuni risultati che documentano come quest'ultimo sia equivalente ad uno
stimatore per regressione generalizzata basato su un modello di regressione lineare che
include tra le variabili esplicative le variabili bilanciate rispetto al disegno campionario
in uso, quelle cioé le cui medie sono stimate senza errore dagli usuali stimatori corretti.
Attraverso una scelta oculata delle variabili bilanciate è allora possibile costruire uno
stimatore quasi ottimale capace di coniugare l'efficienza asintotica dello stimatore
ottimo con la maggiore stabilità di quello per regressione generalizzata.
Keyword: generalised regression estimator, superpopulation model, optimal estimator.
1. Introduction
Regression estimation is a powerful technique for estimating finite population means or
totals of survey variables when the population means or totals of a set of auxiliary
variables are known. Two types of regression estimators have been recently studied in
literature, namely the Generalised Regression Estimator (GRE) and the Optimal
Estimator (OPE). Till now, they have been studied separately (for a short review see
Montanari, 1998). In this paper we explore the connections between the two types of
regression estimators and establish conditions under which a GRE estimator is exactly
or asymptotically equivalent to the OPE. Then, an estimation strategy able to merge the
large sample efficiency of the OPE with the greater stability of the GRE for samples of
moderate size is proposed.
2. The framework
Consider a finite population U ={u1, u2, ... , uN}, where the i-th unit is represented by its
label i. Let Yi and xi = (X1i, X2i,..., Xqi )' be the values associated with unit i of the survey
variable y and of a q-dimensional auxiliary variable x. The population mean vector
x  iN1 x i / N of x is assumed known, e. g. from administrative registers, or census data.
The unknown y variable mean, Y  iN1Yi / N , has to be estimated by means of a sample
s drawn from U according to a probabilistic sampling design and taking into account the
knowledge of x .
Let Yˆ = is Yi / N i and x̂ =  is x i / N i be the design-unbiased HorvitzThompson estimators of Y and x , respectively, where i (i=1,…,N) is the first
order inclusion probability of unit i. The most common way of taking into account the
knowledge of the auxiliary variable population means is to adopt the regression
estimator Yˆ R = Yˆ + ̂ '( x - x̂ ), where ̂ is a vector of regression coefficients given by
some function of sample data {(Yi, xi); is}. The class of regression estimators contains
well known estimators, such as ratio and product estimators, even with linearly
transformed auxiliary variables, post-stratified and regression estimators. So, the main
issue the statistician has to deal with is the definition of ̂ .
~
The OPE is obtained from the difference estimator YO = Yˆ +B'( x - x̂ ), where B is a
vector of constants. The latter is an unbiased estimator of Y , and assembling the values
of y and x into a N-vector Y and an Nq-matrix X having xi' on its i-th row, the
~
sampling variance of YO is minimised by taking B = (X'WX)-1X'WY (Montanari, 1987),
where W is the NN-matrix whose ij-th entry is (ijij)/N 2ij , and ij is the second
order inclusion probability of units i and j (i,j=1,…,N; ii =i). When X'WX is
singular and its rank is q< q, to define B one or more entries of x, hence of x ,
have to be dropped in such a way to obtain a q q non singular variance matrix X'WX.
The optimum value of B can be estimated in many ways. For our purposes, if we take
Horvitz-Thompson estimators of variances and covariances of unbiased estimators,
under mild conditions on the sampling design, a consistent estimator of B is given by
B̂ = (Xs'WssXs)-1Xs'WssYs , where Wss = {(ij - ij) /N 2ij ij}i,js; Xs = {xi'}is ; Ys= {Yi}is.
~
ˆ '(x  x
ˆ ) , i.e. the OPE. In large samples
Then, replacing B by B̂ in YO , we get YˆO  Yˆ  B
~
this estimator shares the properties of YO , the latter being the first order Taylor linear
approximation of the former. So, asymptotically, i.e. up to terms of the first order, YˆO
is design unbiased, and has the minimum variance among all regression estimators
based on the same auxiliary information x . The main drawback of YˆO is that it is
complex to compute and may be unstable in finite size samples (Casady and Valliant,
1993). However, if an adequate number of degrees of freedom is available for
estimating B, the problem can be overcome.
A GRE is based on a "working superpopulation linear regression model" relating
the survey variable to the auxiliary variables whose population means are known.
Consider the model E m ( Y) = X'  and Vm ( Y) =  2  , where  = diag{vi}i=1,…,N is a
known matrix. Note that Em , Vm and Cm denote the expected value, variance and
covariance with respect to the model. Let ̂ N = (X'X)-1XY be the census weighted
least-squares regression estimator of  . Then, replacing ̂ N by the consistent estimator
̂ n = (Xs'  ss1 Xs)-1Xs  ss1 Ys , where  ss1 = diag{1/vii}is, the corresponding GRE is defined
to be YˆG = Yˆ + ̂ n '( x - x̂ ). The large sample properties of YˆG can be established by means
~
of its first order Taylor linear approximation YG = Yˆ + ̂ N '( x  xˆ HT ) (Särndal, Swensson
and Wretman, 1992; p. 235). In particular, YˆG is asymptotically design unbiased, and
when the working model holds true it has the minimum expected asymptotic design
variance with respect to the model, i.e. for any other design unbiased or approximately
~
design unbiased estimator Yˆ * of Y , EmV( YG )  EmV( Yˆ * ), for all  . On the contrary, if
the model is wrongly specified, the value of the asymptotic variance of YˆG may be
sensibly higher than that of YˆO based on the same auxiliary information x .
3. An Extended class of GRE's
To explore the connections between GRE's and the OPE, let us enlarge the GRE class to
embody non-diagonal model variance matrix. Consider the model E m ( Y) = X  and
Vm ( Y) = 2, where  is now a positive definite symmetric NN-matrix. Let  ss1 be the
symmetric matrix that has vij/ij as ij-th entry, where i,js and vij is the ij-th
entry of -1. Then, provided that the matrix (Xs'  ss1 Xs)-1 exists for all sample s, the
corresponding Extended GRE (EGRE) can be written as YˆEG = Yˆ + ̂ n '( x  xˆ ), where
̂ n = (Xs'  ss1 Xs)-1Xs'  ss1 Ys. Observe that the entries of Xs'  ss1 Xs and Xs'  ss1 Ys are design
unbiased estimators of the corresponding entries of X'-1X and X'-1Y, respectively,
provided that vij  0 implies ij  0. So, under mild conditions on the second order
inclusion probabilities, ̂ n converges in probability to ̂ N = (X'-1X)-1X'-1Y, i. e. the
census weighted least-squares regression estimator of  , and the first order Taylor
~
approximation of YˆEG is YEG = Yˆ + ̂ N '( x  xˆ ). Obviously, when  is a diagonal matrix,
the EGRE reduces to a GRE. Note that when W is non singular, the OPE belongs to the
EGRE class defined above setting  = W-1. However, generally W is only non-negative
definite, being singular for many sampling designs. But, even in such a case there are
EGRE's that are asymptotically equivalent to the OPE, as we show next.
Let us call Design Balanced Variable (DBV) any non null auxiliary variable z
whose mean is estimated without error by the Horvitz-Thompson estimator
Zˆ =  is Z i / N i , i.e. Zˆ  Z = 0 , for all s. For this reason, the DBV's are normally not
inserted into a regression estimator. But, assembling the population values Zi into the
N-vector Z, the following theorems can be proved.
Theorem 1. A variable z is a DBV if and only if Z belongs to the subspace orthogonal
to that spanned by the columns of W.

It follows that the subspace spanned by the DBV's has dimension t=Nr(W), where r()
denotes the rank of a matrix. So, there are at most t linearly independent DBV's. Now,
let Z be an Nt-matrix containing the population values of t linearly independent
DBV's and assume that X does not contain any DBV.
Theorem 2. Assume the model E m ( Y) = (ZX)  and Vm ( Y) =  2 . If  is chosen so
~
~
that there exists a scalar  such that -1-1Z(Z'-1Z)-1Z'-1 = W, then YEG = YO , i.e. the
EGRE is asymptotically equal to the OPE based on the same auxiliary information x .

Theorem 2 provides a sufficient condition for the asymptotic equality between an
EGRE and the OPE that uses the same auxiliary variable x ; besides the auxiliary
variables that are not DBV's, the working model should include a number t of
linearly independent DBV's and the variance matrix  2  should be set so that
-1-1Z(Z'-1Z)-1Z'-1 is proportional to W. In this way, it is possible to get an
asymptotic minimum variance estimator, irrespective of the working model goodness,
given the amount of auxiliary information x . Technically, the information provided by
Z and  is summarised by W, hence by the sampling design. The next theorem assures
the finite size sample identity between an EGRE and the OPE.
Theorem 3. If -1 - -1Z(Z'-1Z)-1Z'-1  W implies  ss -  ss Zs(Zs'  ss Zs)-1Zs'  ss  Wss,
then YˆEG = YˆO .

4. Conclusion
Generally speaking, the OPE is approximately equal to an EGRE based on a working
model that includes the effect of any existing DBV's. Unfortunately, theorem 2 does not
provide guidelines for determining the structure of the matrix  and the DBV's that
correspond to the sampling design in use. However, for common designs, easy solutions
can be found. So, given an amount of auxiliary information x , the OPE is
approximately or exactly equal to an EGRE based on a working model which includes
the maximum number of linearly independent DBV's and assumes a variance matrix
that reflects the structure of the first and second order inclusion probabilities. Hence the
OPE allows a better fit of the data, and this explains its asymptotic superiority.
However, in finite size samples, besides incidental collinearities among regressors, the
OPE is exposed to instabilities due to a likely inadequate number of residual degrees of
freedom available for fitting the model. In particular, this concern may be relevant in
stratified designs with a few observations per stratum, or multistage sampling designs
with a few PSU's per stratum.
The above analysis suggests the following quasi-optimal estimation strategy. When
a reliable model is lacking, use an GRE based on a working model with a variance
matrix set according to theorem 2 and with a suitably reduced number of DBV's to
achieve a sufficient number of degrees of freedom to fit the model. In particular, in the
case of stratified samples, this may be accomplished introducing into the model DBV's
corresponding to superstrata obtained collapsing original strata. Collapsing should be
performed so that within each superstratum, strata effects can be considered negligible.
For each superstratum, a DBV is obtained adding the DBV's of the collapsed strata. By
reducing the number of DBV's inserted into the model, we accept a smaller level of
asymptotic efficiency to better control the finite size sampling variance of the estimator.
This strategy has been implemented in a simulation study and the resulting estimator
was on the average the most efficient across populations generated under a variety of
superpopulation models. Results are available from the author.
References
Casady R.J., Valliant R. (1993) Conditional properties of post-stratified estimators
under normal theory, Survey Methodology, 19, 183-192.
Montanari G.E. (1987) Post-sampling efficient QR-prediction in large-scale surveys,
International Statistical Review, 55, 191-202.
Montanari G.E. (1998) On regression estimation of finite population mean, Survey
Methodology, 24, 69-77.
Särndal C.E., Swensson B., Wretman. J.H. (1992) Model assisted survey sampling,
Springer-Verlag, New York.