CONDITIONING ON AUXILIARY VARIABLE MEANS IN FINITE

Aust. N. Z. J. Stat. 42(4), 2000, 407–421
CONDITIONING ON AUXILIARY VARIABLE MEANS IN
FINITE POPULATION INFERENCE
GIORGIO E. MONTANARI1
University of Perugia
Summary
This paper reviews conditional properties of the mean and total estimators of a finite population when auxiliary information is available. An exact design-based conditional analysis
for complex sampling designs is intractable, but an asymptotic conditional framework can
be developed. Within such a framework the paper establishes sufficient conditions for
conditional unbiasedness and explores conditional properties of various types of regression
estimators. A sample statistic capable of indicating the presence of substantial conditional
biases is proposed, and illustrated by a simulation study.
Key words: optimal estimator; regression estimation; survey sampling.
1. Introduction
In sampling theory, the conditional properties of estimators have been brought to the
attention of researchers, especially because of work on the predictive approach to finite population inference, championed for example by Royall & Cumberland (1981, 1985). Holt &
Smith (1979) first considered the issue of conditional inference. Rao (1985) attempted to
perform an exact design-based conditional inference, concluding that for complex sampling
designs the problem is intractable. Robinson (1987) proposed an asymptotic conditional approach to inference for the ratio estimator in simple random sampling. Casady & Valliant
(1993) applied the Robinson approach to post-stratified estimators. Tillé (1995) proposed a
class of conditionally weighted Horvitz–Thompson estimators, which contains known estimators as well as complex estimators that are not really operational. In this paper we develop an
asymptotic theory that allows us to establish sufficient conditions for conditional unbiasedness
within a large class of estimators of the population mean. Then, we propose a sample statistic
capable of revealing when an estimator is substantially conditionally biased, and we discuss
estimation strategies that yield conditionally valid inferences. A simulation study provides
empirical evidence of the quoted theory.
2. The problem
Consider a finite population U consisting of N units labelled i = 1, 2, . . . , N. Let
Yi be the value of a survey variable Y in the ith unit. Suppose that the population mean
N
η =
i=1 Yi /N has to be estimated by means of a sample s, drawn from U according
Received September 1998; revised October 1999; accepted November 1999.
1 Dept of Statistical Sciences, University of Perugia, V.A. Pascoli, 06100 Perugia, Italy.
e-mail: [email protected]
Acknowledgments. This work was supported by a research grant from the University of Perugia, Italy. The
author thanks the referees for their careful reading and constructive comments.
c Australian Statistical Publishing Association Inc. 2000. Published by Blackwell Publishers Ltd,
108 Cowley Road, Oxford OX4 1JF, UK and 350 Main Street, Malden MA 02148, USA
408
GIORGIO E. MONTANARI
to a probabilistic sampling design. For this purpose, assume that the population mean ξ =
N
i=1 xi /N of a q-dimensional auxiliary variable column vector xi = (X1i , X2i , . . . , Xqi ) is
available, for example from administrative registers or census data. In this paper, we suppose
that xi is known for units in the sample but not in the population, and that, for consistency
with external sources, the estimator of η must take the known values contained in ξ when
applied to the auxiliary variables. So, in our perspective, the vector ξ and the sampling design
are taken as given. We do not discuss the choice of subsets of elements of ξ for estimation
purposes, nor the design of the sampling scheme.
Let η̂ and ξ̂ denote estimators of η and ξ , respectively. In the traditional design-based
approach to inference, their statistical properties are obtained by averaging over the set of all
samples s that can be selected. Only recently have attempts been made to take into account the
sampling distribution of η̂ given the realized value of ξ̂ . However, exact results concerning
the conditional distribution of η̂, given ξ̂ , are quite difficult to achieve (see Tillé, 1995), but
useful results can be obtained using the joint asymptotic distribution of η̂ and ξ̂ .
3. A sufficient condition for conditional unbiasedness
Most estimators of descriptive population parameters that are design-unbiased, or approximately design-unbiased, approach normality under wide conditions. Scott & Wu (1981)
have established well-known results about the asymptotic normality of expansion, ratio and
regression estimators for simple random sampling. Such results can be extended to stratified
random sampling, when sample sizes within strata approach infinity. Isaki & Fuller (1982),
Francisco & Fuller (1991) and others (see Sen, 1988) have proved that Horvitz–Thompson estimators are asymptotically normal in certain unequal-probability sampling designs. Krewski
& Rao (1981) and Rao & Wu (1985) proved asymptotic normality of the estimators of means
and totals when the number of strata approach infinity in multistage sampling designs with
two primary units per stratum.
Motivated by those results, and assuming mild conditions on the sampling design and on
the population moments of Y, we suppose that the joint probability sampling distribution of
the vector [η̂ ξ̂ T ]T is normally distributed with mean vector [η ξ T ]T and variance matrix
η̂
var(η̂) c(η̂, ξ̂ )
,
V
=
ξ̂
c(η̂, ξ̂ T ) V (ξ̂ )
where var(η̂) is the variance of η̂, V (ξ̂ ) is the variance matrix of ξ̂ , and c(η̂, ξ̂ ) is the
covariance vector between η̂ and ξ̂ . Assuming V (ξ̂ ) non-singular (in particular, ξ̂ must
not contain any element with a null sampling variance) and applying multivariate normal
distribution standard results (Jobson, 1992 Section 7.2.2) it follows that the distribution of η̂
given ξ̂ is still normal with expected value E(η̂ | ξ̂ ) = η+B T (ξ̂ −ξ ) and variance var(η̂ | ξ̂ ) =
var(η̂) − B T V (ξ̂ )B, where B = V (ξ̂ )−1 c(η̂, ξ̂ ). Hence, estimator η̂ is conditionally biased,
unless c(η̂, ξ̂ ) = 0, or ξ̂ = ξ . In the latter case we say that the sample is balanced with respect
to the auxiliary variables. The conditional mean squared error (MSE) of η̂ can be written
MSE(η̂ | ξ̂ ) = var(η̂) + B T [(ξ̂ − ξ )(ξ̂ − ξ )T − V (ξ̂ )]B,
and the unconditional variance of η̂ can be decomposed as follows
var(η̂) = B T V (ξ̂ )B + E [var(η̂ | ξ̂ )],
c Australian Statistical Publishing Association Inc. 2000
AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE
409
where the first term on the right-hand side, B T V (ξ̂ )B, is the contribution to the unconditional
variance of η̂ due to its conditional bias.
Since ξ is known, a conditionally unbiased estimator can be obtained from η̂ by removing
its conditional bias. In fact, if B were known, the estimator
η̃O = η̂ + B T (ξ − ξ̂ )
would be normally distributed as well, and conditionally and unconditionally unbiased, and
would have an unconditional null covariance vector with ξ̂ . Furthermore, because of normality, η̃O would be the minimum variance estimator of η among all unbiased estimators that
use the same amount of auxiliary information, with a conditional and unconditional variance
given by var(η̃O ) = var(η̂) − B T V (ξ̂ )B. So conditional unbiasedness is a requirement for a
minimum variance estimator of η. We call η̃O the optimal estimator (OPE).
In practice, whenever the distribution of the mean or total estimator of a survey variable
is approximately normal, the above results can be applied. As a consequence, the efficient use
of any known population mean requires that an unbiased or approximately unbiased estimator
of η be chosen. It has to be uncorrelated with the estimators of the available auxiliary variable
population means. Note that no superpopulation model has been used. The theory relies on
the approximate joint normal distribution of estimators. Instead of modelling relationships
between population variables, as is done when superpopulations are considered, we model
relationships between estimators.
In the sequel, we assume that nonlinear estimators of population means converge in
probability to their first order Taylor linear approximations when the sample size and the population size approach infinity. Furthermore, to apply results of this section, we also assume
that those linear approximations approach normality. We use the term ‘asymptotic’ to describe
any property that depends upon this convergence.
4. The empirical optimal estimator
The estimator η̃O cannot be computed when B is unknown, as it usually is. However,
in most cases, B can be estimated by B̂ = V̂ (ξ̂ )−1 ĉ(η̂, ξ̂ ), where V̂ (ξ̂ ) and ĉ(η̂, ξ̂ ) are
design-consistent estimators of V (ξ̂ ) and c(η̂, ξ̂ ), respectively. For example, we might use
the Taylor linearization method (Wolter, 1985 Chapter 6) to obtain estimators of variances
and covariances contained in V (ξ̂ ) and c(η̂, ξ̂ ). Then, replacing B by B̂ in η̃O , we get the
estimator
η̂O = η̂ + B̂ T (ξ − ξ̂ ).
In large samples, η̂O shares the properties of η̃O , because η̃O is the first order Taylor linear approximation of η̂O (Montanari, 1987). We call η̂O the empirical optimal estimator
(EOE). The main properties of η̂O are: (i) it is asymptotically design-unbiased with asymptotic variance var(η̃O ); (ii) it is asymptotically conditionally design-unbiased; (iii) the means
of the auxiliary variables estimated through this estimator equal the corresponding population
means, i.e. ξ̂O = ξ , for all s.
Properties of the OPE and EOE have been studied when η̂ and ξ̂ are Horvitz–Thompson
estimators, i.e. η̂HT =
i∈s Yi /(N πi ) and ξ̂HT =
i∈s xi /(N πi ), where πi is the first
order inclusion probability of the sampling design. Montanari (1987) obtained the EOE from
the unbiased difference estimator η̂D = η̂HT + ϑ T (ξ − ξ̂HT ), minimizing its variance with
c Australian Statistical Publishing Association Inc. 2000
410
GIORGIO E. MONTANARI
respect to ϑ. Rao (1994) pointed out that the EOE gives valid conditional inference. Montanari
(1998) noted that when there is more than one survey variable, the EOE can be expressed as
a weighted sum of observations with the same weights applying to all variables of interest.
5. The generalized regression estimator
Most estimators of a finite population mean or total belong to the class of the generalized
regression estimator (GRE). Such a class is described in Särndal, Swensson & Wretman (1992
Chapter 6) and in Estevao, Hidiroglou & Särndal (1995). A GRE is defined as
T
η̂G = η̂HT + β̂ (ξ − ξ̂HT ), where β̂ =
i∈s
xi xiT
νi π i
−1 i∈s
xi Yi
.
νi π i
The estimator β̂ is a consistent estimator of
β̂N =
N
i=1
xi xiT
νi
−1 N
i=1
xi Yi
,
νi
i.e. the census-weighted least squares regression estimator of β assuming the superpopulation
linear regression model Em (Yi ) = xiT β, var(Yi ) = νi σ 2 and covm (Yi , Yj ) = 0, for all
i = j = 1, . . . , N. Note that Em , var m and covm denote expected value, variance, and
covariance under the model, respectively. In the sequel, we call the model upon which the
GRE is based the ‘working model’.
The large sample properties of η̂G can be established by means of its first order Taylor
T
linear approximation η̃G = η̂HT + β̂N
(ξ − ξ̂HT ) (Särndal et al., 1992 p . 235). The main
properties are: (i) it is asymptotically design-unbiased with asymptotic variance var(η̃G );
(ii) the means of the auxiliary variables estimated through the GRE equal the corresponding population means, i.e. ξ̂G = ξ ; (iii) it has the minimum expected asymptotic design
variance with respect to the model, i.e. for any other design unbiased estimator η̂∗ of η,
Em (var(η̃G )) ≤ Em (var(η̂∗ )), for all β (Wright, 1983).
However, η̂G is conditionally biased when it has a non-null covariance vector with ξ̂HT .
In fact, using results of Section 3, it follows that, asymptotically,
E(η̂G | ξ̂HT ) = η + B T (ξ̂HT − ξ ),
var(η̂G | ξ̂HT ) = var(η̂G ) − B T V (ξ̂HT )B,
MSE(η̂G | ξ̂HT ) = var(η̂G ) + B T (ξ̂HT − ξ )(ξ̂HT − ξ )T − V (ξ̂HT ) B,
(1)
where B = V (ξ̂HT )−1 c(η̂G , ξ̂HT ). Then the corresponding EOE, η̂O = η̂G + B̂ T (ξ − ξ̂HT ),
is asymptotically conditionally unbiased with unconditional variance var(η̂O ) = var(η̂G ) −
B T V (ξ̂HT )B. It can be proved (Montanari, 1998) that B = BHT − β̂N , where BHT =
T
V (ξ̂HT )−1 c(η̂HT , ξ̂HT ). Hence, η̂O = η̂HT + BHT
(ξ − ξ̂HT ) and, asymptotically,
var(η̂G ) − var(η̂O ) = (BHT − β̂N )T V (ξ̂HT )(BHT − β̂N ) ≥ 0.
c Australian Statistical Publishing Association Inc. 2000
(2)
AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE
411
Montanari (1998) proved that, when the working model is correctly specified, asymptotically
BHT = β̂N . On the other hand, when the model is wrongly specified, the right-hand side of
(2) can take non-negligible values. This event is relatively common, because the specification
of the model is limited by the availability of the population mean of auxiliary variables.
The quantity
(BHT − β̂N )T V (ξ̂HT )(BHT − β̂N )
λ(η̂G , η̂O ) =
var(η̂G )
gives the asymptotic relative gain in efficiency of η̂O over η̂G , i.e. the share of unconditional
variance of η̂G due to its conditional bias. We propose using λ(η̂G , η̂O ) as an index of
superpopulation linear model inadequacy for extracting all information from the realized value
of ξ̂HT . Values of λ(η̂G , η̂O ) significantly greater than zero suggest that shifting from η̂G to
η̂O is profitable or, alternatively, the conditional MSE (1) of η̂G has to be estimated to achieve
valid conditional inferences. Given ξ̂HT , the quantity
γ (η̂G | ξ̂HT ) =
(BHT − β̂N )T (ξ̂HT − ξ )
var(η̂G )
gives the ratio between the asymptotic conditional bias and the unconditional standard error.
Note that, unconditionally, E(γ (η̂G | ξ̂HT )) = 0 and var(γ (η̂G | ξ̂HT )) = λ(η̂G , η̂O ).
Both quantities λ(η̂G , η̂O ) and γ (η̂G | ξ̂HT ) are estimable and their estimated values give
useful information for improving the efficiency of the estimation process in subsequent similar
surveys, or even in the current survey as we show in Section 8.
In survey practice, when BHT and β̂N are to be estimated, η̂G is usually preferred to η̂O
because η̂O is more complex to compute and is generally less stable in finite size samples, and
its unconditional variance can be greater than that of η̂G (Casady & Valliant, 1993). However,
if an adequate number of degrees of freedom g is available for estimating BHT , the problem
can be overcome. For example, for a standard complex sampling design with replacement
sampling at the first stage, g can be roughly taken as the number of sample clusters minus the
number of strata (Lehtonen & Pahkinen, 1995 p. 181; for more elaboration on this topic see
Eltinge & Jang, 1996). A stable B̂HT can be expected when g is large enough relative to the
dimension q of the auxiliary variable xi .
6. A new outlook on the empirical optimal estimator
In this section, we explore connections between GREs and EOEs. To this end, we enlarge
the GRE class to non-diagonal variance matrices of the working model. Assemble the values
of Yi and xi into an N -vector y and an N × q matrix X, with xiT its ith row, and consider
the working model Em (y) = Xβ and var m (y) = σ 2 , where can be any positive definite
symmetric N × N matrix. Denote by ys and Xs the sub-matrices of y and X whose rows
correspond to the units in the sample s. Let Qss be the symmetric matrix that has ν ij /πij
as ij th entry, where i, j ∈ s, ν ij is the ij th entry of −1 , and πij is the second order
inclusion probability of the sampling design (with πii = πi ). Then, provided that the matrix
(XsT Qss Xs )−1 exists for all sample s, the corresponding extended GRE (EGRE) is defined to
be
η̂EG = η̂HT + β̂ T (ξ − ξ̂HT ),
where
β̂ = (XsT Qss Xs )−1 XsT Qss ys .
c Australian Statistical Publishing Association Inc. 2000
412
GIORGIO E. MONTANARI
Observe that the entries of XsT Qss Xs and XsT Qss ys are design-unbiased estimators of the
corresponding entries of XT −1 X and XT −1 y, respectively, provided that ν ij = 0 implies πij = 0. So, under mild conditions on the second order inclusion probabilities, β̂
converges in probability to β̂N = (X T −1 X)−1 X T −1 y, i.e. the census-weighted least
squares regression estimator of β, and the first order Taylor linear approximation of η̂EG is
T
η̃EG = η̂HT + β̂N
(ξ − ξ̂HT ). When is a diagonal matrix, we get the customary GRE.
Now, denote by W the N ×N matrix whose ij th entry is wij = (πij −πi πj )/(N 2 πi πj ).
The variance of η̂HT and ξ̂HT and the covariance between η̂HT and ξ̂HT can now be written
var(η̂HT ) = y T Wy, V (ξ̂HT ) = X T W X and c(η̂HT , ξ̂HT ) = X T Wy, respectively. Then,
denoting by Wss the symmetric matrix whose ij th entry is wij /πij , where i, j ∈ s, the EOE
can be written
T
(ξ − ξ̂HT ),
η̂O = η̂HT + B̂HT
where
B̂HT = (XsT Wss Xs )−1 XsT Wss ys .
So, when W is non-singular, the EOE belongs to the EGRE class defined above, setting =
W −1 . Generally, however, W is only non-negative definite, being singular for many sampling
designs. Even in such a case there are EGREs that are asymptotically equal to the OPE, as we
show next.
We give the name ‘design balanced variable’ (DBV) to any non-null auxiliary variable
whose
mean is estimated without error by the Horvitz–Thompson estimator, that is
N
Z
/N
π
=
i
i∈s i
i=1 Zi /N, for all possible s under the sampling design. Then, assembling
the population values Zi into the N -vector z, we have the following theorem.
Theorem 1. An auxiliary variable is design-balanced if and only if the vector z belongs to
the subspace orthogonal to that spanned by the columns of W .
Proof. If z is a DBV, then a T W z = 0 for any vector a (the covariance between the unbiased
estimators of a DBV mean and any other variable mean is identically zero). Hence, W z = 0.
On the other end, if W z = 0 holds, then zT W z = 0, i.e. z is a DBV.
From Theorem 1, it follows that the subspace spanned by the DBVs has dimension
N − rank(W ), where rank(· ) denotes the rank of a matrix. So, when rank(W ) = N, there
is no DBV; when rank(W ) = N − t, where t > 0, there are t linearly independent DBVs.
Now, let Z be an N × t matrix containing t linearly independent DBVs and assume that
X does not contain any DBV. Then, we have the following theorem.
Theorem 2. Assume the model Em (y) = [Z X]β and var m (y) = σ 2 . If is such that
there exists a scalar α so that −1 − −1 Z(Z T −1 Z)−1 Z T −1 = αW , then η̃EG = η̃O ,
i.e. the EGRE based on the above working model is asymptotically equivalent to the OPE based
on the same auxiliary information ξ .
Proof. To prove the result it is sufficient to write [Z X]β = Zβz + Xβx and to note that
the census-weighted least squares estimator of βx can be written β̂xN = (X T AX)−1 X T Ay,
where A = −1 − −1 Z(Z T −1 Z)−1 Z T −1 . Since rank(
) = N, by matrix algebra we
have rank(A) = N − rank(Z) = N − t. So, if there exists a scalar α such that αW = A, then
β̂xN = BHT . Hence, since η̃EG − η̃O = (β̂xN − B)T (ξ − ξ̂HT ), it follows that η̃EG = η̃O .
Theorem 2 gives a sufficient condition for asymptotic equivalence between an EGRE
and the OPE that uses the same amount of auxiliary information ξ . In fact, besides auxiliary
variables that are not DBVs, the working model should include a number N − rank(W ) of
c Australian Statistical Publishing Association Inc. 2000
AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE
413
linearly independent DBVs and the variance matrix σ 2 should be chosen so that A is a matrix proportional to W , i.e. A ∝ W . In this manner we get an asymptotic minimum variance
and conditional unbiased estimator that allows valid conditional inferences, given the amount
of auxiliary information ξ and irrespective of the goodness of the working model.
The next theorem ensures the finite size sample identity between an EGRE and the EOE.
Theorem 3. If the expression −1 −
−1 Z(Z T −1 Z)−1 Z T −1 ∝ W implies the expression
Qss − Qss Zs (ZsT Qss Zs )−1 ZsT Qss ∝ Wss , then η̂EG = η̂O .
Proof. It is sufficient to write [Zs Xs ]β = Zz βz + Xs βx and to note that the sample-weighted
least squares estimator of βx with respect to the weight matrix Qss can be written β̂x =
(XsT Ass Xs )−1 X T Ass ys , where Ass = Qss − Qss Zs (ZsT Qss Zs )−1 ZsT Qss . If Ass ∝ Wss
then β̂x = B̂HT . Hence, since η̂EG − η̂O = (β̂x − B̂HT )T (ξ − ξ̂HT ), the result follows.
Generally speaking, the OPE is approximately equal to the EGRE based on the model
that includes the effect of any existing DBVs and assumes a matrix that depends on the
structure of the sampling design. Unfortunately, Theorem 2 does not provide guidelines for
identifying the proper . In the next section, we examine a number of case studies in which
a suitable matrix −1 can be singled out from the structure of W .
6.1. Examples of asymptotic equivalencies between EGREs and OPEs
The structure of W is the starting point for deriving the model under which an EGRE is
asymptotically equivalent to the OPE. As we mentioned above, when rank(W ) = N there
are no DBVs. So, given ξ , setting ∝ W −1 the EGRE is asymptotically equal to the OPE.
Furthermore, because of Theorem 3, the EGRE is the EOE as well. As an example, consider
be the inclusion
Poisson sampling with size measure ai (i = 1, 2, . . . , N ). Let πi = nai /A probability of the ith units, where n is the expected sample size and A = N
i=1 ai . In this
case, W = diag(πi−1 − 1). Setting = diag(ai /(A − nai )), the corresponding GRE is the
EOE as well. This form of the variance matrix was also carried out by Särndal (1996) by
minimizing the asymptotic variance of a GRE for Poisson sampling.
Next we give examples where DBVs exist. As a rule of thumb, potential DBVs are those
proportional to the first order inclusion probabilities within subpopulations from which fixed
size samples are selected.
Example 1. Consider a simple random sampling of n units. In this case rank(W ) = N − 1
and any vector proportional to the unit vector 1 belongs to the orthogonal subspace of W .
Set Z = 1 and = I , where I is the identity matrix; then A ∝ W . Furthermore, because
of Theorem 3, the GRE based on a homoscedastic linear regression model with an intercept
term is the EOE as well.
Now, to apply the theory of Section 5, suppose that the population mean ξ of an auxiliary
variable X is available, and that we are willing to use the ratio estimator η̂R = ξ ȳ/x̄, where
ȳ and x̄ are sample means, i.e. the Horvitz–Thompson estimators of η and ξ. That estimator
is the GRE based on the simple linear regression model through the origin Em (Yi ) = Xi β,
var m (Yi ) = σ 2 Xi and cov(Yi , Yj ) = 0. Its linear approximation is η̃R = ȳ + R(ξ − x̄),
2 is
where R = η/ξ. The corresponding OPE is η̃O = ȳ + Q(ξ − x̄), where Q = SXY /SX
the population regression coefficient of X; SXY is the population covariance between Y and
2 is the variance of X. So η̂ = ȳ + Q̂(ξ − x̄), where Q̂ is the sample regression
X, and SX
O
c Australian Statistical Publishing Association Inc. 2000
414
GIORGIO E. MONTANARI
coefficient of X. Using the linear approximations of η̂R and η̂O , straight calculations give
the asymptotic results
√
2
(Q − R)2 SX
n (Q − R)(x̄ − ξ )
,
λ(η̂R , η̂O ) =
,
γ (η̂R | x̄) =
SE2
(1 − f )SE2
1−f 2
1−f 2
(3)
SE + (Q − R)2 (x̄ − ξ )2 −
SX ,
MSE(η̂R | x̄) =
n
n
where f = n/N is the sampling fraction and SE2 is the population variance of the variable
E = Y − RX. Note that (3) can be rewritten
MSE(η̂R | x̄) =
1−f 2
2
) + (Q − R)2 (x̄ − ξ )2 ,
(SY − Q2 SX
n
and the latter is an asymptotic version of a result in Robinson (1987).
When the working model upon which η̂R is based holds true, for large populations
Q ≈ R and negligible gains in efficiency can be obtained from the OPE. If the model is
wrongly specified, the relative conditional bias γ (η̂R | x̄) can be substantial, as can the gain in
efficiency of η̂O over η̂R , measured by λ(η̂R , η̂O ). This can occur when the finite population
intercept in a linear regression is far from zero.
Example 2. Consider stratified random sampling. Let the ith unit within the hth stratum
(h = 1, . . . , H ; i = 1, . . . , Nh ) be labelled hi. Let nh be the sample size within stratum
h. In this case, rank(W ) = N − H and the indicator variables of stratum memberships of
population units are a set of linearly independent DBVs. Define Z = [z1 z2 . . . zH ], where
z- is a vector whose entries are Z-hi = 1, when h = -, and Z-hi = 0, otherwise. Then,
setting = diag(νhi ), where νhi = nh (Nh − 1)/Nh (Nh − nh ), it follows that A ∝ W . The
corresponding GRE is asymptotically equivalent to the OPE when the working model includes
the stratum membership indicator variables of population units and the variance matrix is as
specified above. Note that this form of the variance matrix was also worked out by Särndal
(1996) by minimizing the asymptotic variance of the GRE based on a model that includes an
intercept term for each stratum. Furthermore, when nh is constant across strata, because of
Theorem 3, the GRE is the EOE as well.
Again, to apply results of Section 5, suppose that the population mean, ξ, of an auxiliary variable X is known, and that we are willing to adopt the GRE based on the model
Em (Yhi ) = β1 + β2 Xhi , var m (Yhi ) = σ 2 , and cov(Yhi , Yh j ) = 0, for all h = h or
2.
i = j, i.e. η̂G = η̂HT + Q̂(ξ − ξ̂HT ), where Q̂ is a consistent estimator of Q = SXY /SX
In this example, the EOE is known as the combined regression estimator (Cochran, 1977
Section 7.10) and its linear approximation takes the form η̃O = η̂HT + B(ξ − ξ̂HT ), where
H
−1
−1
2
2
2
2
2
B = H
h=1 ch SXh Qh
h=1 ch SXh , ch = (nh − Nh )Nh /N , Qh = SXY h /SXh , and
2
SXY h and SXh are the stratum covariance between Y and X and the stratum variance of X,
respectively. Observe that B is a weighted average of stratum regression coefficients. Straight
calculations based on the linear approximations of η̂G and η̂O give the following asymptotic
results
2
(B − Q)2 H
(B − Q)(ξ̂HT − ξ )
h=1 ch SXh
γ (η̂G | ξ̂HT ) = ,
λ(η̂G , η̂O ) =
,
H
2
H
2
ch SEh
h=1
c
S
h
h=1
Eh
c Australian Statistical Publishing Association Inc. 2000
AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE
MSE(η̂G | ξ̂HT ) =
H
h=1
2
ch SEh
+ (B − Q)
2
(ξ̂HT − ξ ) −
2
H
h=1
415
2
ch SXh
,
2 is the stratum population variance of the variable E = Y −QX. When the working
where SEh
model upon which η̂G is based holds true, for large strata Qh ≈ Q for all h, and B ≈ Q.
So negligible gains in efficiency can be achieved by the OPE. If the model does not hold true,
different stratum regression coefficients can give non-negligible values of γ (η̂G | ξ̂HT ) and
λ(η̂G , η̂O ), i.e. η̂G is conditionally biased, given ξ̂HT , and asymptotically inefficient.
Example 3. Consider stratified two-stage random sampling. Let the j th elementary unit
within the ith primary sampling unit (PSU) of the hth stratum (h = 1, . . . , H ; i = 1, . . . , Nh ;
j = 1, . . . , Mhi ) be labelled hij. Using simple random sampling without replacement in
both stages, nh PSUs are selected from each stratum and mhi elementary units are drawn
from each selected PSU. It can be verified that rank(W ) = N − H and that the matrix Z =
[z1 z2 . . . zH ] contains H linearly independent DBVs, where z- is a vector whose entries
are Z-hij = Mh /Nh Mhi if h = -, and Z-hij = 0 otherwise, and Mh is the elementary unit
stratum size. Setting = diag(
hi ), where hi = [ahi I + bhi 11T ]−1 is an Mhi × Mhi
matrix with
ahi =
Nh Mhi Mhi − mhi
nh mhi Mhi − 1
and
bhi =
N h nh − 1
Nh Mhi mhi − 1
−
,
nh mhi Mhi − 1
nh Nh − 1
the EGRE is asymptotically equivalent to the OPE. Furthermore, if nh is constant across strata,
because of Theorem 3, the EGRE is the EOE as well. Note that the structure of 2 is that of an
equal correlation model within PSUs, and that the indicator variables of stratum membership
of elementary units are not DBVs, unless Mhi is constant within each stratum.
Now we present two examples involving unequal probability sampling, but for simplicity we assume sampling with replacement. In such a case, first and second order inclusion
probabilities must be replaced by ϕi = E(δi ) and ϕij = E(δi δj ), where δi is the random
variable defined as the number of times unit i has been selected in the sample s. As a consequence, Horvitz–Thompson estimators have to be replaced by Hansen–Hurvitz analogues and
the matrix W has to be properly redefined. Note that results of sampling with replacement
are often used to approximate results of sampling without replacement. The following two
examples can be useful for approximating OPEs in complex sampling schemes.
Example 4. Consider a with-replacement unequal-probability sampling design of fixed size
n and with selection probabilities Pi (i = 1, . . . , N ). The sampling variance of the unbiased
(Hansen–Hurvitz) estimator is given by y T Wy, where W = n−1 N −2 (P −1 − 11T ), where
P = diag(Pi ). In this case, rank(W ) = N −1 and the variable Zi = N Pi is a DBV. Inserting
the latter into the working model and setting = diag(nPi ), the GRE is asymptotically
equivalent to the OPE. Furthermore, because of Theorem 3, the GRE is the EOE as well.
Example 5. Consider a stratified two-stage sampling as in Example 3, but now at the first
stage nh PSUs are selected from each stratum h using a with-replacement unequal-probability
scheme with selection probabilities Phi . In this case, rank(W ) = N − H, and the matrix
Z = [z1 z2 . . . zH ], where z- is the vector whose entries are Z-hij = Phi Mh /Mhi if h = -,
and Z-hij = 0 otherwise, contains H linearly independent DBVs. Inserting the latter into the
c Australian Statistical Publishing Association Inc. 2000
416
GIORGIO E. MONTANARI
working model and setting = diag(
hi ) where hi = [ahi I + bhi 11T ]−1 is an Mhi × Mhi
matrix with
ahi =
1 Mhi Mhi − mhi
,
nh Phi mhi Mhi − 1
bhi =
1 Mhi mhi − 1
,
nh Phi mhi Mhi − 1
the EGRE is asymptotically equivalent to the OPE. Again, if nh is constant across strata,
the EGRE is the EOE as well. Note that the indicator variables of stratum membership of
elementary units are not DBVs, unless Phi is proportional to Mhi .
We conclude this section with the following observations. First, multiplying a DBV by
any constant does not modify the equivalence results shown in the previous examples. Second,
there are alternative forms of that guarantee the same equivalence results in the previous
examples. So the problem of finding the best choice of is open.
7. Estimation strategies
The examples examined above illustrate that, given some auxiliary information ξ , the
OPE is generally equivalent to the EGRE based on a working model that includes the maximum
number of linearly independent DBVs and assumes a variance matrix that reflects the structure
of first and second order inclusion probabilities. So, the OPE allows a better fit of the data,
and this explains its asymptotic superiority. However, in finite size samples, besides potential
collinearities among regressors, the EOE is exposed to instabilities because there might be an
inadequate number of residual degrees of freedom available for estimating a larger number
of parameters. In particular, this concern can be important in stratified designs with a few
observations per stratum, or in multistage sampling designs with a few PSUs per stratum.
The analysis and the examples presented in the previous sections suggest the following
options for estimating a population mean, taking into account a given amount of auxiliary
information ξ and the realized value of ξ̂HT . Above all, we recommend the conditional approach to inference. When an estimator is conditionally biased, its precision is greater or less
than that predicted by the unconditional variance according to the conditional MSE (1). An
unconditional inference would ignore this piece of information. So, the first option consists
of using directly the EOE. This choice guarantees asymptotic conditional unbiasedness and
efficiency and allows valid conditional inferences.
However, when the stability of the EOE estimator is of concern, as when the total sample
size is not big enough or the number of strata is high compared to the number of observations,
a second option consists of adopting a GRE or EGRE based on a reasonable reduced working
model and estimating the conditional MSE (1), to guarantees valid conditional inferences.
In the reduced model a suitable subset of DBVs can be inserted. For example, in the case of
stratified samples, this can be accomplished by introducing DBVs corresponding to superstrata
obtained by collapsing the original strata. Collapsing should be performed so that, within each
superstratum, strata effects can be considered negligible. In this way some conditional bias is
accepted to control for the unconditional variance of the estimator.
8. An empirical study
The theory developed in the previous sections is asymptotic in nature, being based on
first order approximations and normality. In this section we report results of an empirical
study carried out to test the theory in the presence of finite size samples.
c Australian Statistical Publishing Association Inc. 2000
AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE
417
A finite population partitioned into eight strata of size 100 was considered. Let hi be
the label of unit i = 1, 2, . . . , 100 within stratum h = 1, 2, . . . , 8. Values of an auxiliary
variable X were assigned through the monotonic function of h and i
Xhi = 4.95 + 5
h−1
j + 0.05h × i.
j =1
A finite population of Y values, given X, was generated using the model
2
Yhi = 20 + 2Xhi + 0.06Xhi
+ Xhi εhi ,
where εhi were observations from a standardized normal random variable. The realized mean
and standard deviation of Y were 618.2 and 676.0, respectively. The asymmetry of Y, measured by the ratio between the third central moment and the cube of the standard deviation,
was 1.21. The correlation coefficient between Y and X was equal to 0.96. The unit values
Xhi were assumed unknown.
A proportional stratified random sampling-without-replacement design was used to select 5000 samples of size 80 (10 units per stratum) and 5000 samples of size 24 (three units
per stratum). For each sample the following quantities were computed:
1. ȳ and x̄, i.e. the sample means of Y and X;
2. η̂G1 , i.e. the GRE based on the model Em (Yhi ) = β1 + β2 Xhi , var m (Yhi ) = Xhi σ 2 ,
assuming the population mean ξ1 of X known;
3. η̂O1 and η̃O1 , i.e. the EOE and OPE based on the same auxiliary variable used by η̂G1 ;
2 , var (Y ) =
4. η̂G2 , i.e. the GRE based on the model Em (Yhi ) = β1 + β2 Xhi + β3 Xhi
m hi
2
2
Xhi σ (the true model), assuming ξ1 and ξ2 known, the latter being the population mean
of the squared X unit values;
5. η̂O2 and η̃O2 , i.e. the EOE and OPE based on the same auxiliary variables used by η̂G2 ;
6. η̂SDi (i = 1, 2), i.e. the sample-dependent estimators defined to take the value of η̂Gi
when the estimated absolute value of the conditional relative bias |γ (η̂Gi | ξ̂HT )| is below
25%; and the value of η̂Oi otherwise.
The sample-dependent estimator η̂SDi was introduced to evaluate the selection of the
estimator according to the estimated value of |γ (η̂Gi | ξ̂HT )| after the sample had been drawn.
Here, 25% was an arbitrarily chosen threshold under which the relative conditional bias was
thought acceptable from the efficiency point of view. Variances and covariances of estimators
were computed using the Taylor linearization method (Wolter, 1985 Chapter 6) and standard
formulæ for a stratified sample with a proportional allocation. The realized finite population
was such that λ(η̂G1 , η̂O1 ) = 16.0% and λ(η̂G2 , η̂O2 ) = 0.0%. Note that in this set up, the
EOE was a GRE based on a model with an intercept term for each stratum and a homoscedastic
variance matrix.
Table 1 reports unconditional scaled MSEs of estimators of η, when the MSE of the
sample mean was equal to 100. Estimator biases were always negligible and are not reported.
First, consider the case of sample size n = 80. As expected, when only ξ1 is known, η̂G1
was less efficient than η̂O1 . The result was due to the inadequacy of the working model upon
which η̂G1 was based and the consequent conditional bias. The estimated |γ (η̂G1 | x̄)| took
values above the 25% threshold quite often and the sample-dependent estimator η̂SD1 was
almost as efficient as η̂O1 . When both ξ1 and ξ2 were known, η̂G2 was a bit less efficient than
η̃O2 , but slightly more efficient than η̂O2 , being based on the more parsimonious true model.
Most of the time the estimated |γ (η̂G2 | x̄, x 2 )|, where x 2 is the sample mean of the squared
c Australian Statistical Publishing Association Inc. 2000
418
GIORGIO E. MONTANARI
TABLE 1
Unconditional results: scaled empirical MSE of estimators and percentages
of samples for which |γ̂ (η̂G | ξ̂HT )| > 25%
Auxiliary used
Estimator
(1, X)
None
ȳ
η̂G1
η̂O1
η̃O1
(1, X.X 2 )
η̂SD1
η̂G2
η̂O2
η̃O2
η̂SD2
Scaled MSE
n = 80
100.0
50.1
42.4
41.5
43.0
freq {|γ̂ (η̂G1 | x̄)| > 25%} = 50.1%
32.7 33.7
32.4
33.2
freq {|γ̂ η̂G2 | x̄, x 2 | > 25%} = 22.5%
Scaled MSE
n = 24
100.0
50.0
49.0
41.9
49.4
freq {|γ̂ (η̂G1 | x̄)| > 25%} = 48.0%
33.5 45.2
33.2
44.8
freq {|γ̂ η̂G2 | x̄, x 2 | > 25%} = 51.0%
X unit values, was below the 25% threshold and the sample-dependent estimator η̂SD2 was
almost as efficient as η̂G2 . The result confirms that, given a suitable amount of auxiliary
information, only a negligible gain in efficiency can be achieved through the OPE when the
working model upon which the GRE is based holds true, even with very large samples. The
efficiencies of η̂O1 and η̂O2 were very similar to those of η̃O1 and η̃O2 , which represented a
gauge of the best that could be expected from the EOEs when the sample size increases. The
loss of degrees of freedom concerning η̂O1 and η̂O2 did not seem to affect their precision for
this sample size and number of strata. For example, note that η̂O2 was equivalent to a GRE
based on a model with 10 parameters and 80 observations.
To warn about EOE instability, a sample size n = 24 was also considered. The loss in
precision of η̂O1 and η̂O2 was dramatic compared to their linear approximations. In fact,
η̂O2 was based on a model with 10 parameters and 24 observations. The estimated value
of |γ (η̂G | ξ̂HT )| was unstable as well. In particular, in the case of the quadratic working
model, it caused the sample-dependent estimator to take the value of η̂O2 often, even if it
was unnecessary. In fact, although the absolute value of the true relative conditional bias was
always under the threshold, the estimated value was over the threshold 51% of the time. The
problem was the small number of residual degrees of freedom, i.e. 14.
8.1. Conditional results
To explore the conditional performances of estimators, the 5000 samples of size 80 were
sorted by the value of x̄ and put into 23 groups of 600 samples each, with a 67% overlap
between consecutive groups to smooth the plot. Empirical means of estimators and other
quantities were computed within each group. Figures 1–3 present some conditional results
for estimators based on the auxiliary (1, X).
Figure 1 shows the behaviour of the ratios between the bias of estimators over the corresponding standard deviations within each group plotted against the average values of (x̄ − ξ1 ).
The GRE η̂G1 is substantially conditionally biased compared to its conditional standard deviation. On the other hand, η̂O1 and η̂SD1 are approximately conditionally unbiased.
Figure 2 reports the behaviour of the average of (η̂G1 − η)2 within each group, i.e.
the empirical conditional MSE, of which an important part was due to the conditional bias.
Note that the estimator of the conditional MSE (1) tracks the true value closely. On the other
hand, the traditional variance estimator remains close to the unconditional variance value
represented by the horizontal line in the graph. So η̂G1 is, on average, more precise than
predicted by its unconditional variance estimator in approximately balanced samples with
respect to X. The opposite is true in unbalanced samples.
c Australian Statistical Publishing Association Inc. 2000
AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE
419
1.0
0.5
0.0
1.8
1.2
0.6
0.5
0.0
0.6
1.2
1.8
1.0
Figure 1. Conditional relative biases of η̂G1 ( ),
η̂O1 ( ) and η̂SD1 (- - -) versus x̄ − ξ1
225.0
100.0
1.8
1.2
0.6
0.0
0.6
1.2
1.8
Figure 2. Conditional MSE ( ) of η̂G1 and its estimator ( ) and
the unconditional variance estimator of η̂G1 (- - -) versus x̄ − ξ1
0.90
0.75
1.8
1.2
0.6
0.6
1.2
1.8
Figure 3. 90% conditional confidence interval coverage rates obtained by the pairs
(η̂G1 | x̄)) ( ), and (η̂O1 , V̂ (η̂O1 )) (- - -) versus x̄ − ξ1
(η̂G1 , V̂ (η̂G1 )) ( ), (η̂G1 , MSE
Figure 3 reports 90% confidence interval conditional coverage rates. Confidence intervals were constructed assuming approximate normal distributions for estimators coupled with
the appropriate variance or MSE estimators. Although the unconditional coverage rates are
all very close to the nominal one (indeed they are somewhat lower), conditionally the worst
behaviour is that of η̂G1 coupled with the unconditional variance estimator. The coverage rate
is higher than 90% in approximately balanced samples and lower in extremely unbalanced
c Australian Statistical Publishing Association Inc. 2000
420
GIORGIO E. MONTANARI
samples. Better results are achieved using η̂G1 coupled with its conditional MSE estimator,
or η̂O1 coupled with its own variance estimator. Confidence interval coverage rates are also
affected by conditional properties of variance and MSE estimators, but that matter is beyond
the scope of this paper.
Similar behaviour was observed for sample size 24. The estimators η̂O1 and η̂SD1
were still approximately conditionally unbiased even if unconditionally they were much less
efficient than η̂O1 .
Conditional results for estimators based on the auxiliary variables (1, X, X 2 ) are not
reported because no conditional trend was noticeable in any case. Details are available on
request.
9. Discussion
The main conclusion we can draw from this analysis is that the population mean of any
known auxiliary variable is efficiently incorporated into the estimation strategy if the resulting
estimator of the mean of the survey variable is uncorrelated with the estimator of the mean
of the auxiliary variable. This condition also guarantees conditional unbiasedness and allows
valid conditional inferences. That requirement can be achieved through the EOE. The latter
correspond to an extended GRE based on a working model that include the effects of any
existing DBV. With stratified design with a few observations per stratum, the EOE can be
vulnerable to the consequent loss of degrees of freedom. Further empirical work is needed to
evaluate when a sample is large enough to overcome the problem, especially with complex
designs.
When the use of the EOE is thought unsafe, the alternative consists of using a working
model without DBVs or with a few of them capable of capturing the relationship between
the survey variables and the auxiliary variables whose population means are known. If the
model holds true, the corresponding EGRE guarantees efficiency and conditional unbiasedness. However, specification of the model is confined to regressors with known population
means, so it only rarely describes the complexity of the finite population. Therefore, we
recommend estimating the conditional MSE of the GRE or EGRE, to correctly evaluate its
precision from a conditional point of view.
To use the function γ (η̂G | ξ̂HT ) for choosing between an EGRE and the EOE, empirical
evidence is needed for determining which values of |γ (η̂G | ξ̂HT )| are to be considered negligible. Further work is needed on the distribution of the estimated |γ (η̂G | ξ̂HT )| for choosing
the threshold over which shifting from a EGRE to the corresponding EOE is profitable for a
valid conditional inference and a lower unconditional variance, especially when the true value
of λ(η̂G , η̂O ) is zero. The estimated value of |γ (η̂G | ξ̂HT )| is unstable but in most practical
situations there is more than one survey variable of interest, so to apply the same weights to
every variable the EOE could be chosen on the basis of an averaged value of |γ (η̂G | ξ̂HT )|
across the main survey variables; and such an average is more precisely estimable.
So far we have restricted the analysis to the use of balanced design variables, because
we took the vector ξ as fixed. This is part of the wider problem of selecting a subset of the
available auxiliary variables for estimation purposes. For example, if the true model is linear,
a quadratic term inserted into the working model may give a less efficient estimator in finite
sample sizes, although asymptotically it would result in a more efficient estimator of the population mean. Generally speaking, one should condition on all the auxiliary variables whose
c Australian Statistical Publishing Association Inc. 2000
AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE
421
population means are known, to avoid any conditional bias. However, with finite size samples,
that choice may yield unacceptable levels of conditional and unconditional variance. So, the
problem of selecting the best subset of auxiliary variables for finite size sample efficiency is
still open.
References
CASADY, R.J. & VALLIANT, R. (1993). Conditional properties of post-stratified estimators under normal theory.
Survey Methodology 19, 183–192.
COCHRAN, W. (1977). Sampling Techniques. New York: Wiley.
DEVILLE, J.C. & SÄRNDAL, C.E. (1992). Calibration estimators in survey sampling. J. Amer. Statist. Assoc. 87,
376–382.
ELTINGE, J.L. & JANG, D.S. (1996). Stability measures for variance component estimators under a stratified
multistage design. Survey Methodology 22, 157–165.
ESTEVAO, V., HIDIROGLOU, M.A. & SÄRNDAL, C.E. (1995). Methodological principles for a generalised estimation system at Statistics Canada. J. Official Statist. 11, 181–204.
FRANCISCO, C.A. & FULLER, W.A. (1991). Quantile estimation with a complex survey design. Ann. Statist. 19,
454–469.
HOLT, D. & SMITH, T.M.F. (1979). Post-stratification. J. Roy. Statist. Soc. Ser. A 142, 33–46.
JOBSON, J.D. (1992). Applied Multivariate Data Analysis, Vol. II. New York: Springer-Verlag.
KREWSKY, D. & RAO, J.N.K. (1981). Inference from stratified samples: properties of the linearisation, jackknife
and balanced repeated replication methods. Ann. Statist. 9, 1010–1019.
ISAKI, C.T. & FULLER, W.A. (1982). Survey design under the regression superpopulation model. J. Amer. Statist.
Assoc. 77, 89–96.
LEHTONEN, R. & PAHKINEN, E.J. (1995). Practical Methods for Survey Design. New York: Wiley.
MONTANARI, G.E. (1987). Post-sampling efficient QR-prediction in large-scale surveys. Internat. Statist. Rev.
55, 191–202.
MONTANARI, G.E. (1998). On regression estimation of finite population mean. Survey Methodology 24, 69–77.
RAO, J.N.K. (1985). Conditional inference in survey sampling. Survey Methodology 11, 15–31.
RAO, J.N.K. (1994). Estimating totals and distribution functions using auxiliary information at the estimation
stage. J. Official Statist. 10, 153–165.
RAO, J.N.K. & WU, C.F.J. (1985). Inference from stratified samples: second order analysis of three methods for
non-linear statistics. J. Amer. Statist. Assoc. 80, 620–630.
ROBINSON, J. (1987). Conditioning ratio estimates under simple random sampling. J. Amer. Statist. Assoc. 82,
826–831.
ROYALL, R.M. & CUMBERLAND, W.G. (1981). An empirical study of the ratio estimator and estimator of its
variance. J. Amer. Statist. Assoc. 76, 66–88.
ROYALL, R.M. & CUMBERLAND, W.G. (1985). Conditional coverage properties of finite population confidence
intervals. J. Amer. Statist. Assoc. 80, 830–840.
SÄRNDAL, C.E. (1996). Efficient estimators with simple variance in unequal probability sampling. J. Amer.
Statist. Assoc. 91, 1289–1300.
SÄRNDAL, C.E., SWENSSON, B. & WRETMAN, J.H. (1992). Model Assisted Survey Sampling. New York:
Springer-Verlag.
SCOTT, A.J. & WU, C.F.J. (1981). On the asymptotic distribution of ratio and regression estimators. J. Amer.
Statist. Assoc. 76, 98–112.
SEN, P.K. (1988). Asymptotics in finite population sampling. In Handbook of Statistics 6: Sampling, pp . 291–
332. Amsterdam: North-Holland.
TILLÉ, Y. (1995). Auxiliary information and conditional inference. Bull. Internat. Statist. Inst., Proc. 50th Session. Vol. 1. 303–319.
WOLTER, K.M. (1985). Introduction to Variance Estimation. New York: Springer-Verlag.
WRIGHT, R.L. (1983). Finite population sampling with multivariate auxiliary information. J. Amer. Statist.
Assoc. 78, 879–884.
c Australian Statistical Publishing Association Inc. 2000