Aust. N. Z. J. Stat. 42(4), 2000, 407–421 CONDITIONING ON AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE GIORGIO E. MONTANARI1 University of Perugia Summary This paper reviews conditional properties of the mean and total estimators of a finite population when auxiliary information is available. An exact design-based conditional analysis for complex sampling designs is intractable, but an asymptotic conditional framework can be developed. Within such a framework the paper establishes sufficient conditions for conditional unbiasedness and explores conditional properties of various types of regression estimators. A sample statistic capable of indicating the presence of substantial conditional biases is proposed, and illustrated by a simulation study. Key words: optimal estimator; regression estimation; survey sampling. 1. Introduction In sampling theory, the conditional properties of estimators have been brought to the attention of researchers, especially because of work on the predictive approach to finite population inference, championed for example by Royall & Cumberland (1981, 1985). Holt & Smith (1979) first considered the issue of conditional inference. Rao (1985) attempted to perform an exact design-based conditional inference, concluding that for complex sampling designs the problem is intractable. Robinson (1987) proposed an asymptotic conditional approach to inference for the ratio estimator in simple random sampling. Casady & Valliant (1993) applied the Robinson approach to post-stratified estimators. Tillé (1995) proposed a class of conditionally weighted Horvitz–Thompson estimators, which contains known estimators as well as complex estimators that are not really operational. In this paper we develop an asymptotic theory that allows us to establish sufficient conditions for conditional unbiasedness within a large class of estimators of the population mean. Then, we propose a sample statistic capable of revealing when an estimator is substantially conditionally biased, and we discuss estimation strategies that yield conditionally valid inferences. A simulation study provides empirical evidence of the quoted theory. 2. The problem Consider a finite population U consisting of N units labelled i = 1, 2, . . . , N. Let Yi be the value of a survey variable Y in the ith unit. Suppose that the population mean N η = i=1 Yi /N has to be estimated by means of a sample s, drawn from U according Received September 1998; revised October 1999; accepted November 1999. 1 Dept of Statistical Sciences, University of Perugia, V.A. Pascoli, 06100 Perugia, Italy. e-mail: [email protected] Acknowledgments. This work was supported by a research grant from the University of Perugia, Italy. The author thanks the referees for their careful reading and constructive comments. c Australian Statistical Publishing Association Inc. 2000. Published by Blackwell Publishers Ltd, 108 Cowley Road, Oxford OX4 1JF, UK and 350 Main Street, Malden MA 02148, USA 408 GIORGIO E. MONTANARI to a probabilistic sampling design. For this purpose, assume that the population mean ξ = N i=1 xi /N of a q-dimensional auxiliary variable column vector xi = (X1i , X2i , . . . , Xqi ) is available, for example from administrative registers or census data. In this paper, we suppose that xi is known for units in the sample but not in the population, and that, for consistency with external sources, the estimator of η must take the known values contained in ξ when applied to the auxiliary variables. So, in our perspective, the vector ξ and the sampling design are taken as given. We do not discuss the choice of subsets of elements of ξ for estimation purposes, nor the design of the sampling scheme. Let η̂ and ξ̂ denote estimators of η and ξ , respectively. In the traditional design-based approach to inference, their statistical properties are obtained by averaging over the set of all samples s that can be selected. Only recently have attempts been made to take into account the sampling distribution of η̂ given the realized value of ξ̂ . However, exact results concerning the conditional distribution of η̂, given ξ̂ , are quite difficult to achieve (see Tillé, 1995), but useful results can be obtained using the joint asymptotic distribution of η̂ and ξ̂ . 3. A sufficient condition for conditional unbiasedness Most estimators of descriptive population parameters that are design-unbiased, or approximately design-unbiased, approach normality under wide conditions. Scott & Wu (1981) have established well-known results about the asymptotic normality of expansion, ratio and regression estimators for simple random sampling. Such results can be extended to stratified random sampling, when sample sizes within strata approach infinity. Isaki & Fuller (1982), Francisco & Fuller (1991) and others (see Sen, 1988) have proved that Horvitz–Thompson estimators are asymptotically normal in certain unequal-probability sampling designs. Krewski & Rao (1981) and Rao & Wu (1985) proved asymptotic normality of the estimators of means and totals when the number of strata approach infinity in multistage sampling designs with two primary units per stratum. Motivated by those results, and assuming mild conditions on the sampling design and on the population moments of Y, we suppose that the joint probability sampling distribution of the vector [η̂ ξ̂ T ]T is normally distributed with mean vector [η ξ T ]T and variance matrix η̂ var(η̂) c(η̂, ξ̂ ) , V = ξ̂ c(η̂, ξ̂ T ) V (ξ̂ ) where var(η̂) is the variance of η̂, V (ξ̂ ) is the variance matrix of ξ̂ , and c(η̂, ξ̂ ) is the covariance vector between η̂ and ξ̂ . Assuming V (ξ̂ ) non-singular (in particular, ξ̂ must not contain any element with a null sampling variance) and applying multivariate normal distribution standard results (Jobson, 1992 Section 7.2.2) it follows that the distribution of η̂ given ξ̂ is still normal with expected value E(η̂ | ξ̂ ) = η+B T (ξ̂ −ξ ) and variance var(η̂ | ξ̂ ) = var(η̂) − B T V (ξ̂ )B, where B = V (ξ̂ )−1 c(η̂, ξ̂ ). Hence, estimator η̂ is conditionally biased, unless c(η̂, ξ̂ ) = 0, or ξ̂ = ξ . In the latter case we say that the sample is balanced with respect to the auxiliary variables. The conditional mean squared error (MSE) of η̂ can be written MSE(η̂ | ξ̂ ) = var(η̂) + B T [(ξ̂ − ξ )(ξ̂ − ξ )T − V (ξ̂ )]B, and the unconditional variance of η̂ can be decomposed as follows var(η̂) = B T V (ξ̂ )B + E [var(η̂ | ξ̂ )], c Australian Statistical Publishing Association Inc. 2000 AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE 409 where the first term on the right-hand side, B T V (ξ̂ )B, is the contribution to the unconditional variance of η̂ due to its conditional bias. Since ξ is known, a conditionally unbiased estimator can be obtained from η̂ by removing its conditional bias. In fact, if B were known, the estimator η̃O = η̂ + B T (ξ − ξ̂ ) would be normally distributed as well, and conditionally and unconditionally unbiased, and would have an unconditional null covariance vector with ξ̂ . Furthermore, because of normality, η̃O would be the minimum variance estimator of η among all unbiased estimators that use the same amount of auxiliary information, with a conditional and unconditional variance given by var(η̃O ) = var(η̂) − B T V (ξ̂ )B. So conditional unbiasedness is a requirement for a minimum variance estimator of η. We call η̃O the optimal estimator (OPE). In practice, whenever the distribution of the mean or total estimator of a survey variable is approximately normal, the above results can be applied. As a consequence, the efficient use of any known population mean requires that an unbiased or approximately unbiased estimator of η be chosen. It has to be uncorrelated with the estimators of the available auxiliary variable population means. Note that no superpopulation model has been used. The theory relies on the approximate joint normal distribution of estimators. Instead of modelling relationships between population variables, as is done when superpopulations are considered, we model relationships between estimators. In the sequel, we assume that nonlinear estimators of population means converge in probability to their first order Taylor linear approximations when the sample size and the population size approach infinity. Furthermore, to apply results of this section, we also assume that those linear approximations approach normality. We use the term ‘asymptotic’ to describe any property that depends upon this convergence. 4. The empirical optimal estimator The estimator η̃O cannot be computed when B is unknown, as it usually is. However, in most cases, B can be estimated by B̂ = V̂ (ξ̂ )−1 ĉ(η̂, ξ̂ ), where V̂ (ξ̂ ) and ĉ(η̂, ξ̂ ) are design-consistent estimators of V (ξ̂ ) and c(η̂, ξ̂ ), respectively. For example, we might use the Taylor linearization method (Wolter, 1985 Chapter 6) to obtain estimators of variances and covariances contained in V (ξ̂ ) and c(η̂, ξ̂ ). Then, replacing B by B̂ in η̃O , we get the estimator η̂O = η̂ + B̂ T (ξ − ξ̂ ). In large samples, η̂O shares the properties of η̃O , because η̃O is the first order Taylor linear approximation of η̂O (Montanari, 1987). We call η̂O the empirical optimal estimator (EOE). The main properties of η̂O are: (i) it is asymptotically design-unbiased with asymptotic variance var(η̃O ); (ii) it is asymptotically conditionally design-unbiased; (iii) the means of the auxiliary variables estimated through this estimator equal the corresponding population means, i.e. ξ̂O = ξ , for all s. Properties of the OPE and EOE have been studied when η̂ and ξ̂ are Horvitz–Thompson estimators, i.e. η̂HT = i∈s Yi /(N πi ) and ξ̂HT = i∈s xi /(N πi ), where πi is the first order inclusion probability of the sampling design. Montanari (1987) obtained the EOE from the unbiased difference estimator η̂D = η̂HT + ϑ T (ξ − ξ̂HT ), minimizing its variance with c Australian Statistical Publishing Association Inc. 2000 410 GIORGIO E. MONTANARI respect to ϑ. Rao (1994) pointed out that the EOE gives valid conditional inference. Montanari (1998) noted that when there is more than one survey variable, the EOE can be expressed as a weighted sum of observations with the same weights applying to all variables of interest. 5. The generalized regression estimator Most estimators of a finite population mean or total belong to the class of the generalized regression estimator (GRE). Such a class is described in Särndal, Swensson & Wretman (1992 Chapter 6) and in Estevao, Hidiroglou & Särndal (1995). A GRE is defined as T η̂G = η̂HT + β̂ (ξ − ξ̂HT ), where β̂ = i∈s xi xiT νi π i −1 i∈s xi Yi . νi π i The estimator β̂ is a consistent estimator of β̂N = N i=1 xi xiT νi −1 N i=1 xi Yi , νi i.e. the census-weighted least squares regression estimator of β assuming the superpopulation linear regression model Em (Yi ) = xiT β, var(Yi ) = νi σ 2 and covm (Yi , Yj ) = 0, for all i = j = 1, . . . , N. Note that Em , var m and covm denote expected value, variance, and covariance under the model, respectively. In the sequel, we call the model upon which the GRE is based the ‘working model’. The large sample properties of η̂G can be established by means of its first order Taylor T linear approximation η̃G = η̂HT + β̂N (ξ − ξ̂HT ) (Särndal et al., 1992 p . 235). The main properties are: (i) it is asymptotically design-unbiased with asymptotic variance var(η̃G ); (ii) the means of the auxiliary variables estimated through the GRE equal the corresponding population means, i.e. ξ̂G = ξ ; (iii) it has the minimum expected asymptotic design variance with respect to the model, i.e. for any other design unbiased estimator η̂∗ of η, Em (var(η̃G )) ≤ Em (var(η̂∗ )), for all β (Wright, 1983). However, η̂G is conditionally biased when it has a non-null covariance vector with ξ̂HT . In fact, using results of Section 3, it follows that, asymptotically, E(η̂G | ξ̂HT ) = η + B T (ξ̂HT − ξ ), var(η̂G | ξ̂HT ) = var(η̂G ) − B T V (ξ̂HT )B, MSE(η̂G | ξ̂HT ) = var(η̂G ) + B T (ξ̂HT − ξ )(ξ̂HT − ξ )T − V (ξ̂HT ) B, (1) where B = V (ξ̂HT )−1 c(η̂G , ξ̂HT ). Then the corresponding EOE, η̂O = η̂G + B̂ T (ξ − ξ̂HT ), is asymptotically conditionally unbiased with unconditional variance var(η̂O ) = var(η̂G ) − B T V (ξ̂HT )B. It can be proved (Montanari, 1998) that B = BHT − β̂N , where BHT = T V (ξ̂HT )−1 c(η̂HT , ξ̂HT ). Hence, η̂O = η̂HT + BHT (ξ − ξ̂HT ) and, asymptotically, var(η̂G ) − var(η̂O ) = (BHT − β̂N )T V (ξ̂HT )(BHT − β̂N ) ≥ 0. c Australian Statistical Publishing Association Inc. 2000 (2) AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE 411 Montanari (1998) proved that, when the working model is correctly specified, asymptotically BHT = β̂N . On the other hand, when the model is wrongly specified, the right-hand side of (2) can take non-negligible values. This event is relatively common, because the specification of the model is limited by the availability of the population mean of auxiliary variables. The quantity (BHT − β̂N )T V (ξ̂HT )(BHT − β̂N ) λ(η̂G , η̂O ) = var(η̂G ) gives the asymptotic relative gain in efficiency of η̂O over η̂G , i.e. the share of unconditional variance of η̂G due to its conditional bias. We propose using λ(η̂G , η̂O ) as an index of superpopulation linear model inadequacy for extracting all information from the realized value of ξ̂HT . Values of λ(η̂G , η̂O ) significantly greater than zero suggest that shifting from η̂G to η̂O is profitable or, alternatively, the conditional MSE (1) of η̂G has to be estimated to achieve valid conditional inferences. Given ξ̂HT , the quantity γ (η̂G | ξ̂HT ) = (BHT − β̂N )T (ξ̂HT − ξ ) var(η̂G ) gives the ratio between the asymptotic conditional bias and the unconditional standard error. Note that, unconditionally, E(γ (η̂G | ξ̂HT )) = 0 and var(γ (η̂G | ξ̂HT )) = λ(η̂G , η̂O ). Both quantities λ(η̂G , η̂O ) and γ (η̂G | ξ̂HT ) are estimable and their estimated values give useful information for improving the efficiency of the estimation process in subsequent similar surveys, or even in the current survey as we show in Section 8. In survey practice, when BHT and β̂N are to be estimated, η̂G is usually preferred to η̂O because η̂O is more complex to compute and is generally less stable in finite size samples, and its unconditional variance can be greater than that of η̂G (Casady & Valliant, 1993). However, if an adequate number of degrees of freedom g is available for estimating BHT , the problem can be overcome. For example, for a standard complex sampling design with replacement sampling at the first stage, g can be roughly taken as the number of sample clusters minus the number of strata (Lehtonen & Pahkinen, 1995 p. 181; for more elaboration on this topic see Eltinge & Jang, 1996). A stable B̂HT can be expected when g is large enough relative to the dimension q of the auxiliary variable xi . 6. A new outlook on the empirical optimal estimator In this section, we explore connections between GREs and EOEs. To this end, we enlarge the GRE class to non-diagonal variance matrices of the working model. Assemble the values of Yi and xi into an N -vector y and an N × q matrix X, with xiT its ith row, and consider the working model Em (y) = Xβ and var m (y) = σ 2 , where can be any positive definite symmetric N × N matrix. Denote by ys and Xs the sub-matrices of y and X whose rows correspond to the units in the sample s. Let Qss be the symmetric matrix that has ν ij /πij as ij th entry, where i, j ∈ s, ν ij is the ij th entry of −1 , and πij is the second order inclusion probability of the sampling design (with πii = πi ). Then, provided that the matrix (XsT Qss Xs )−1 exists for all sample s, the corresponding extended GRE (EGRE) is defined to be η̂EG = η̂HT + β̂ T (ξ − ξ̂HT ), where β̂ = (XsT Qss Xs )−1 XsT Qss ys . c Australian Statistical Publishing Association Inc. 2000 412 GIORGIO E. MONTANARI Observe that the entries of XsT Qss Xs and XsT Qss ys are design-unbiased estimators of the corresponding entries of XT −1 X and XT −1 y, respectively, provided that ν ij = 0 implies πij = 0. So, under mild conditions on the second order inclusion probabilities, β̂ converges in probability to β̂N = (X T −1 X)−1 X T −1 y, i.e. the census-weighted least squares regression estimator of β, and the first order Taylor linear approximation of η̂EG is T η̃EG = η̂HT + β̂N (ξ − ξ̂HT ). When is a diagonal matrix, we get the customary GRE. Now, denote by W the N ×N matrix whose ij th entry is wij = (πij −πi πj )/(N 2 πi πj ). The variance of η̂HT and ξ̂HT and the covariance between η̂HT and ξ̂HT can now be written var(η̂HT ) = y T Wy, V (ξ̂HT ) = X T W X and c(η̂HT , ξ̂HT ) = X T Wy, respectively. Then, denoting by Wss the symmetric matrix whose ij th entry is wij /πij , where i, j ∈ s, the EOE can be written T (ξ − ξ̂HT ), η̂O = η̂HT + B̂HT where B̂HT = (XsT Wss Xs )−1 XsT Wss ys . So, when W is non-singular, the EOE belongs to the EGRE class defined above, setting = W −1 . Generally, however, W is only non-negative definite, being singular for many sampling designs. Even in such a case there are EGREs that are asymptotically equal to the OPE, as we show next. We give the name ‘design balanced variable’ (DBV) to any non-null auxiliary variable whose mean is estimated without error by the Horvitz–Thompson estimator, that is N Z /N π = i i∈s i i=1 Zi /N, for all possible s under the sampling design. Then, assembling the population values Zi into the N -vector z, we have the following theorem. Theorem 1. An auxiliary variable is design-balanced if and only if the vector z belongs to the subspace orthogonal to that spanned by the columns of W . Proof. If z is a DBV, then a T W z = 0 for any vector a (the covariance between the unbiased estimators of a DBV mean and any other variable mean is identically zero). Hence, W z = 0. On the other end, if W z = 0 holds, then zT W z = 0, i.e. z is a DBV. From Theorem 1, it follows that the subspace spanned by the DBVs has dimension N − rank(W ), where rank(· ) denotes the rank of a matrix. So, when rank(W ) = N, there is no DBV; when rank(W ) = N − t, where t > 0, there are t linearly independent DBVs. Now, let Z be an N × t matrix containing t linearly independent DBVs and assume that X does not contain any DBV. Then, we have the following theorem. Theorem 2. Assume the model Em (y) = [Z X]β and var m (y) = σ 2 . If is such that there exists a scalar α so that −1 − −1 Z(Z T −1 Z)−1 Z T −1 = αW , then η̃EG = η̃O , i.e. the EGRE based on the above working model is asymptotically equivalent to the OPE based on the same auxiliary information ξ . Proof. To prove the result it is sufficient to write [Z X]β = Zβz + Xβx and to note that the census-weighted least squares estimator of βx can be written β̂xN = (X T AX)−1 X T Ay, where A = −1 − −1 Z(Z T −1 Z)−1 Z T −1 . Since rank( ) = N, by matrix algebra we have rank(A) = N − rank(Z) = N − t. So, if there exists a scalar α such that αW = A, then β̂xN = BHT . Hence, since η̃EG − η̃O = (β̂xN − B)T (ξ − ξ̂HT ), it follows that η̃EG = η̃O . Theorem 2 gives a sufficient condition for asymptotic equivalence between an EGRE and the OPE that uses the same amount of auxiliary information ξ . In fact, besides auxiliary variables that are not DBVs, the working model should include a number N − rank(W ) of c Australian Statistical Publishing Association Inc. 2000 AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE 413 linearly independent DBVs and the variance matrix σ 2 should be chosen so that A is a matrix proportional to W , i.e. A ∝ W . In this manner we get an asymptotic minimum variance and conditional unbiased estimator that allows valid conditional inferences, given the amount of auxiliary information ξ and irrespective of the goodness of the working model. The next theorem ensures the finite size sample identity between an EGRE and the EOE. Theorem 3. If the expression −1 − −1 Z(Z T −1 Z)−1 Z T −1 ∝ W implies the expression Qss − Qss Zs (ZsT Qss Zs )−1 ZsT Qss ∝ Wss , then η̂EG = η̂O . Proof. It is sufficient to write [Zs Xs ]β = Zz βz + Xs βx and to note that the sample-weighted least squares estimator of βx with respect to the weight matrix Qss can be written β̂x = (XsT Ass Xs )−1 X T Ass ys , where Ass = Qss − Qss Zs (ZsT Qss Zs )−1 ZsT Qss . If Ass ∝ Wss then β̂x = B̂HT . Hence, since η̂EG − η̂O = (β̂x − B̂HT )T (ξ − ξ̂HT ), the result follows. Generally speaking, the OPE is approximately equal to the EGRE based on the model that includes the effect of any existing DBVs and assumes a matrix that depends on the structure of the sampling design. Unfortunately, Theorem 2 does not provide guidelines for identifying the proper . In the next section, we examine a number of case studies in which a suitable matrix −1 can be singled out from the structure of W . 6.1. Examples of asymptotic equivalencies between EGREs and OPEs The structure of W is the starting point for deriving the model under which an EGRE is asymptotically equivalent to the OPE. As we mentioned above, when rank(W ) = N there are no DBVs. So, given ξ , setting ∝ W −1 the EGRE is asymptotically equal to the OPE. Furthermore, because of Theorem 3, the EGRE is the EOE as well. As an example, consider be the inclusion Poisson sampling with size measure ai (i = 1, 2, . . . , N ). Let πi = nai /A probability of the ith units, where n is the expected sample size and A = N i=1 ai . In this case, W = diag(πi−1 − 1). Setting = diag(ai /(A − nai )), the corresponding GRE is the EOE as well. This form of the variance matrix was also carried out by Särndal (1996) by minimizing the asymptotic variance of a GRE for Poisson sampling. Next we give examples where DBVs exist. As a rule of thumb, potential DBVs are those proportional to the first order inclusion probabilities within subpopulations from which fixed size samples are selected. Example 1. Consider a simple random sampling of n units. In this case rank(W ) = N − 1 and any vector proportional to the unit vector 1 belongs to the orthogonal subspace of W . Set Z = 1 and = I , where I is the identity matrix; then A ∝ W . Furthermore, because of Theorem 3, the GRE based on a homoscedastic linear regression model with an intercept term is the EOE as well. Now, to apply the theory of Section 5, suppose that the population mean ξ of an auxiliary variable X is available, and that we are willing to use the ratio estimator η̂R = ξ ȳ/x̄, where ȳ and x̄ are sample means, i.e. the Horvitz–Thompson estimators of η and ξ. That estimator is the GRE based on the simple linear regression model through the origin Em (Yi ) = Xi β, var m (Yi ) = σ 2 Xi and cov(Yi , Yj ) = 0. Its linear approximation is η̃R = ȳ + R(ξ − x̄), 2 is where R = η/ξ. The corresponding OPE is η̃O = ȳ + Q(ξ − x̄), where Q = SXY /SX the population regression coefficient of X; SXY is the population covariance between Y and 2 is the variance of X. So η̂ = ȳ + Q̂(ξ − x̄), where Q̂ is the sample regression X, and SX O c Australian Statistical Publishing Association Inc. 2000 414 GIORGIO E. MONTANARI coefficient of X. Using the linear approximations of η̂R and η̂O , straight calculations give the asymptotic results √ 2 (Q − R)2 SX n (Q − R)(x̄ − ξ ) , λ(η̂R , η̂O ) = , γ (η̂R | x̄) = SE2 (1 − f )SE2 1−f 2 1−f 2 (3) SE + (Q − R)2 (x̄ − ξ )2 − SX , MSE(η̂R | x̄) = n n where f = n/N is the sampling fraction and SE2 is the population variance of the variable E = Y − RX. Note that (3) can be rewritten MSE(η̂R | x̄) = 1−f 2 2 ) + (Q − R)2 (x̄ − ξ )2 , (SY − Q2 SX n and the latter is an asymptotic version of a result in Robinson (1987). When the working model upon which η̂R is based holds true, for large populations Q ≈ R and negligible gains in efficiency can be obtained from the OPE. If the model is wrongly specified, the relative conditional bias γ (η̂R | x̄) can be substantial, as can the gain in efficiency of η̂O over η̂R , measured by λ(η̂R , η̂O ). This can occur when the finite population intercept in a linear regression is far from zero. Example 2. Consider stratified random sampling. Let the ith unit within the hth stratum (h = 1, . . . , H ; i = 1, . . . , Nh ) be labelled hi. Let nh be the sample size within stratum h. In this case, rank(W ) = N − H and the indicator variables of stratum memberships of population units are a set of linearly independent DBVs. Define Z = [z1 z2 . . . zH ], where z- is a vector whose entries are Z-hi = 1, when h = -, and Z-hi = 0, otherwise. Then, setting = diag(νhi ), where νhi = nh (Nh − 1)/Nh (Nh − nh ), it follows that A ∝ W . The corresponding GRE is asymptotically equivalent to the OPE when the working model includes the stratum membership indicator variables of population units and the variance matrix is as specified above. Note that this form of the variance matrix was also worked out by Särndal (1996) by minimizing the asymptotic variance of the GRE based on a model that includes an intercept term for each stratum. Furthermore, when nh is constant across strata, because of Theorem 3, the GRE is the EOE as well. Again, to apply results of Section 5, suppose that the population mean, ξ, of an auxiliary variable X is known, and that we are willing to adopt the GRE based on the model Em (Yhi ) = β1 + β2 Xhi , var m (Yhi ) = σ 2 , and cov(Yhi , Yh j ) = 0, for all h = h or 2. i = j, i.e. η̂G = η̂HT + Q̂(ξ − ξ̂HT ), where Q̂ is a consistent estimator of Q = SXY /SX In this example, the EOE is known as the combined regression estimator (Cochran, 1977 Section 7.10) and its linear approximation takes the form η̃O = η̂HT + B(ξ − ξ̂HT ), where H −1 −1 2 2 2 2 2 B = H h=1 ch SXh Qh h=1 ch SXh , ch = (nh − Nh )Nh /N , Qh = SXY h /SXh , and 2 SXY h and SXh are the stratum covariance between Y and X and the stratum variance of X, respectively. Observe that B is a weighted average of stratum regression coefficients. Straight calculations based on the linear approximations of η̂G and η̂O give the following asymptotic results 2 (B − Q)2 H (B − Q)(ξ̂HT − ξ ) h=1 ch SXh γ (η̂G | ξ̂HT ) = , λ(η̂G , η̂O ) = , H 2 H 2 ch SEh h=1 c S h h=1 Eh c Australian Statistical Publishing Association Inc. 2000 AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE MSE(η̂G | ξ̂HT ) = H h=1 2 ch SEh + (B − Q) 2 (ξ̂HT − ξ ) − 2 H h=1 415 2 ch SXh , 2 is the stratum population variance of the variable E = Y −QX. When the working where SEh model upon which η̂G is based holds true, for large strata Qh ≈ Q for all h, and B ≈ Q. So negligible gains in efficiency can be achieved by the OPE. If the model does not hold true, different stratum regression coefficients can give non-negligible values of γ (η̂G | ξ̂HT ) and λ(η̂G , η̂O ), i.e. η̂G is conditionally biased, given ξ̂HT , and asymptotically inefficient. Example 3. Consider stratified two-stage random sampling. Let the j th elementary unit within the ith primary sampling unit (PSU) of the hth stratum (h = 1, . . . , H ; i = 1, . . . , Nh ; j = 1, . . . , Mhi ) be labelled hij. Using simple random sampling without replacement in both stages, nh PSUs are selected from each stratum and mhi elementary units are drawn from each selected PSU. It can be verified that rank(W ) = N − H and that the matrix Z = [z1 z2 . . . zH ] contains H linearly independent DBVs, where z- is a vector whose entries are Z-hij = Mh /Nh Mhi if h = -, and Z-hij = 0 otherwise, and Mh is the elementary unit stratum size. Setting = diag( hi ), where hi = [ahi I + bhi 11T ]−1 is an Mhi × Mhi matrix with ahi = Nh Mhi Mhi − mhi nh mhi Mhi − 1 and bhi = N h nh − 1 Nh Mhi mhi − 1 − , nh mhi Mhi − 1 nh Nh − 1 the EGRE is asymptotically equivalent to the OPE. Furthermore, if nh is constant across strata, because of Theorem 3, the EGRE is the EOE as well. Note that the structure of 2 is that of an equal correlation model within PSUs, and that the indicator variables of stratum membership of elementary units are not DBVs, unless Mhi is constant within each stratum. Now we present two examples involving unequal probability sampling, but for simplicity we assume sampling with replacement. In such a case, first and second order inclusion probabilities must be replaced by ϕi = E(δi ) and ϕij = E(δi δj ), where δi is the random variable defined as the number of times unit i has been selected in the sample s. As a consequence, Horvitz–Thompson estimators have to be replaced by Hansen–Hurvitz analogues and the matrix W has to be properly redefined. Note that results of sampling with replacement are often used to approximate results of sampling without replacement. The following two examples can be useful for approximating OPEs in complex sampling schemes. Example 4. Consider a with-replacement unequal-probability sampling design of fixed size n and with selection probabilities Pi (i = 1, . . . , N ). The sampling variance of the unbiased (Hansen–Hurvitz) estimator is given by y T Wy, where W = n−1 N −2 (P −1 − 11T ), where P = diag(Pi ). In this case, rank(W ) = N −1 and the variable Zi = N Pi is a DBV. Inserting the latter into the working model and setting = diag(nPi ), the GRE is asymptotically equivalent to the OPE. Furthermore, because of Theorem 3, the GRE is the EOE as well. Example 5. Consider a stratified two-stage sampling as in Example 3, but now at the first stage nh PSUs are selected from each stratum h using a with-replacement unequal-probability scheme with selection probabilities Phi . In this case, rank(W ) = N − H, and the matrix Z = [z1 z2 . . . zH ], where z- is the vector whose entries are Z-hij = Phi Mh /Mhi if h = -, and Z-hij = 0 otherwise, contains H linearly independent DBVs. Inserting the latter into the c Australian Statistical Publishing Association Inc. 2000 416 GIORGIO E. MONTANARI working model and setting = diag( hi ) where hi = [ahi I + bhi 11T ]−1 is an Mhi × Mhi matrix with ahi = 1 Mhi Mhi − mhi , nh Phi mhi Mhi − 1 bhi = 1 Mhi mhi − 1 , nh Phi mhi Mhi − 1 the EGRE is asymptotically equivalent to the OPE. Again, if nh is constant across strata, the EGRE is the EOE as well. Note that the indicator variables of stratum membership of elementary units are not DBVs, unless Phi is proportional to Mhi . We conclude this section with the following observations. First, multiplying a DBV by any constant does not modify the equivalence results shown in the previous examples. Second, there are alternative forms of that guarantee the same equivalence results in the previous examples. So the problem of finding the best choice of is open. 7. Estimation strategies The examples examined above illustrate that, given some auxiliary information ξ , the OPE is generally equivalent to the EGRE based on a working model that includes the maximum number of linearly independent DBVs and assumes a variance matrix that reflects the structure of first and second order inclusion probabilities. So, the OPE allows a better fit of the data, and this explains its asymptotic superiority. However, in finite size samples, besides potential collinearities among regressors, the EOE is exposed to instabilities because there might be an inadequate number of residual degrees of freedom available for estimating a larger number of parameters. In particular, this concern can be important in stratified designs with a few observations per stratum, or in multistage sampling designs with a few PSUs per stratum. The analysis and the examples presented in the previous sections suggest the following options for estimating a population mean, taking into account a given amount of auxiliary information ξ and the realized value of ξ̂HT . Above all, we recommend the conditional approach to inference. When an estimator is conditionally biased, its precision is greater or less than that predicted by the unconditional variance according to the conditional MSE (1). An unconditional inference would ignore this piece of information. So, the first option consists of using directly the EOE. This choice guarantees asymptotic conditional unbiasedness and efficiency and allows valid conditional inferences. However, when the stability of the EOE estimator is of concern, as when the total sample size is not big enough or the number of strata is high compared to the number of observations, a second option consists of adopting a GRE or EGRE based on a reasonable reduced working model and estimating the conditional MSE (1), to guarantees valid conditional inferences. In the reduced model a suitable subset of DBVs can be inserted. For example, in the case of stratified samples, this can be accomplished by introducing DBVs corresponding to superstrata obtained by collapsing the original strata. Collapsing should be performed so that, within each superstratum, strata effects can be considered negligible. In this way some conditional bias is accepted to control for the unconditional variance of the estimator. 8. An empirical study The theory developed in the previous sections is asymptotic in nature, being based on first order approximations and normality. In this section we report results of an empirical study carried out to test the theory in the presence of finite size samples. c Australian Statistical Publishing Association Inc. 2000 AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE 417 A finite population partitioned into eight strata of size 100 was considered. Let hi be the label of unit i = 1, 2, . . . , 100 within stratum h = 1, 2, . . . , 8. Values of an auxiliary variable X were assigned through the monotonic function of h and i Xhi = 4.95 + 5 h−1 j + 0.05h × i. j =1 A finite population of Y values, given X, was generated using the model 2 Yhi = 20 + 2Xhi + 0.06Xhi + Xhi εhi , where εhi were observations from a standardized normal random variable. The realized mean and standard deviation of Y were 618.2 and 676.0, respectively. The asymmetry of Y, measured by the ratio between the third central moment and the cube of the standard deviation, was 1.21. The correlation coefficient between Y and X was equal to 0.96. The unit values Xhi were assumed unknown. A proportional stratified random sampling-without-replacement design was used to select 5000 samples of size 80 (10 units per stratum) and 5000 samples of size 24 (three units per stratum). For each sample the following quantities were computed: 1. ȳ and x̄, i.e. the sample means of Y and X; 2. η̂G1 , i.e. the GRE based on the model Em (Yhi ) = β1 + β2 Xhi , var m (Yhi ) = Xhi σ 2 , assuming the population mean ξ1 of X known; 3. η̂O1 and η̃O1 , i.e. the EOE and OPE based on the same auxiliary variable used by η̂G1 ; 2 , var (Y ) = 4. η̂G2 , i.e. the GRE based on the model Em (Yhi ) = β1 + β2 Xhi + β3 Xhi m hi 2 2 Xhi σ (the true model), assuming ξ1 and ξ2 known, the latter being the population mean of the squared X unit values; 5. η̂O2 and η̃O2 , i.e. the EOE and OPE based on the same auxiliary variables used by η̂G2 ; 6. η̂SDi (i = 1, 2), i.e. the sample-dependent estimators defined to take the value of η̂Gi when the estimated absolute value of the conditional relative bias |γ (η̂Gi | ξ̂HT )| is below 25%; and the value of η̂Oi otherwise. The sample-dependent estimator η̂SDi was introduced to evaluate the selection of the estimator according to the estimated value of |γ (η̂Gi | ξ̂HT )| after the sample had been drawn. Here, 25% was an arbitrarily chosen threshold under which the relative conditional bias was thought acceptable from the efficiency point of view. Variances and covariances of estimators were computed using the Taylor linearization method (Wolter, 1985 Chapter 6) and standard formulæ for a stratified sample with a proportional allocation. The realized finite population was such that λ(η̂G1 , η̂O1 ) = 16.0% and λ(η̂G2 , η̂O2 ) = 0.0%. Note that in this set up, the EOE was a GRE based on a model with an intercept term for each stratum and a homoscedastic variance matrix. Table 1 reports unconditional scaled MSEs of estimators of η, when the MSE of the sample mean was equal to 100. Estimator biases were always negligible and are not reported. First, consider the case of sample size n = 80. As expected, when only ξ1 is known, η̂G1 was less efficient than η̂O1 . The result was due to the inadequacy of the working model upon which η̂G1 was based and the consequent conditional bias. The estimated |γ (η̂G1 | x̄)| took values above the 25% threshold quite often and the sample-dependent estimator η̂SD1 was almost as efficient as η̂O1 . When both ξ1 and ξ2 were known, η̂G2 was a bit less efficient than η̃O2 , but slightly more efficient than η̂O2 , being based on the more parsimonious true model. Most of the time the estimated |γ (η̂G2 | x̄, x 2 )|, where x 2 is the sample mean of the squared c Australian Statistical Publishing Association Inc. 2000 418 GIORGIO E. MONTANARI TABLE 1 Unconditional results: scaled empirical MSE of estimators and percentages of samples for which |γ̂ (η̂G | ξ̂HT )| > 25% Auxiliary used Estimator (1, X) None ȳ η̂G1 η̂O1 η̃O1 (1, X.X 2 ) η̂SD1 η̂G2 η̂O2 η̃O2 η̂SD2 Scaled MSE n = 80 100.0 50.1 42.4 41.5 43.0 freq {|γ̂ (η̂G1 | x̄)| > 25%} = 50.1% 32.7 33.7 32.4 33.2 freq {|γ̂ η̂G2 | x̄, x 2 | > 25%} = 22.5% Scaled MSE n = 24 100.0 50.0 49.0 41.9 49.4 freq {|γ̂ (η̂G1 | x̄)| > 25%} = 48.0% 33.5 45.2 33.2 44.8 freq {|γ̂ η̂G2 | x̄, x 2 | > 25%} = 51.0% X unit values, was below the 25% threshold and the sample-dependent estimator η̂SD2 was almost as efficient as η̂G2 . The result confirms that, given a suitable amount of auxiliary information, only a negligible gain in efficiency can be achieved through the OPE when the working model upon which the GRE is based holds true, even with very large samples. The efficiencies of η̂O1 and η̂O2 were very similar to those of η̃O1 and η̃O2 , which represented a gauge of the best that could be expected from the EOEs when the sample size increases. The loss of degrees of freedom concerning η̂O1 and η̂O2 did not seem to affect their precision for this sample size and number of strata. For example, note that η̂O2 was equivalent to a GRE based on a model with 10 parameters and 80 observations. To warn about EOE instability, a sample size n = 24 was also considered. The loss in precision of η̂O1 and η̂O2 was dramatic compared to their linear approximations. In fact, η̂O2 was based on a model with 10 parameters and 24 observations. The estimated value of |γ (η̂G | ξ̂HT )| was unstable as well. In particular, in the case of the quadratic working model, it caused the sample-dependent estimator to take the value of η̂O2 often, even if it was unnecessary. In fact, although the absolute value of the true relative conditional bias was always under the threshold, the estimated value was over the threshold 51% of the time. The problem was the small number of residual degrees of freedom, i.e. 14. 8.1. Conditional results To explore the conditional performances of estimators, the 5000 samples of size 80 were sorted by the value of x̄ and put into 23 groups of 600 samples each, with a 67% overlap between consecutive groups to smooth the plot. Empirical means of estimators and other quantities were computed within each group. Figures 1–3 present some conditional results for estimators based on the auxiliary (1, X). Figure 1 shows the behaviour of the ratios between the bias of estimators over the corresponding standard deviations within each group plotted against the average values of (x̄ − ξ1 ). The GRE η̂G1 is substantially conditionally biased compared to its conditional standard deviation. On the other hand, η̂O1 and η̂SD1 are approximately conditionally unbiased. Figure 2 reports the behaviour of the average of (η̂G1 − η)2 within each group, i.e. the empirical conditional MSE, of which an important part was due to the conditional bias. Note that the estimator of the conditional MSE (1) tracks the true value closely. On the other hand, the traditional variance estimator remains close to the unconditional variance value represented by the horizontal line in the graph. So η̂G1 is, on average, more precise than predicted by its unconditional variance estimator in approximately balanced samples with respect to X. The opposite is true in unbalanced samples. c Australian Statistical Publishing Association Inc. 2000 AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE 419 1.0 0.5 0.0 1.8 1.2 0.6 0.5 0.0 0.6 1.2 1.8 1.0 Figure 1. Conditional relative biases of η̂G1 ( ), η̂O1 ( ) and η̂SD1 (- - -) versus x̄ − ξ1 225.0 100.0 1.8 1.2 0.6 0.0 0.6 1.2 1.8 Figure 2. Conditional MSE ( ) of η̂G1 and its estimator ( ) and the unconditional variance estimator of η̂G1 (- - -) versus x̄ − ξ1 0.90 0.75 1.8 1.2 0.6 0.6 1.2 1.8 Figure 3. 90% conditional confidence interval coverage rates obtained by the pairs (η̂G1 | x̄)) ( ), and (η̂O1 , V̂ (η̂O1 )) (- - -) versus x̄ − ξ1 (η̂G1 , V̂ (η̂G1 )) ( ), (η̂G1 , MSE Figure 3 reports 90% confidence interval conditional coverage rates. Confidence intervals were constructed assuming approximate normal distributions for estimators coupled with the appropriate variance or MSE estimators. Although the unconditional coverage rates are all very close to the nominal one (indeed they are somewhat lower), conditionally the worst behaviour is that of η̂G1 coupled with the unconditional variance estimator. The coverage rate is higher than 90% in approximately balanced samples and lower in extremely unbalanced c Australian Statistical Publishing Association Inc. 2000 420 GIORGIO E. MONTANARI samples. Better results are achieved using η̂G1 coupled with its conditional MSE estimator, or η̂O1 coupled with its own variance estimator. Confidence interval coverage rates are also affected by conditional properties of variance and MSE estimators, but that matter is beyond the scope of this paper. Similar behaviour was observed for sample size 24. The estimators η̂O1 and η̂SD1 were still approximately conditionally unbiased even if unconditionally they were much less efficient than η̂O1 . Conditional results for estimators based on the auxiliary variables (1, X, X 2 ) are not reported because no conditional trend was noticeable in any case. Details are available on request. 9. Discussion The main conclusion we can draw from this analysis is that the population mean of any known auxiliary variable is efficiently incorporated into the estimation strategy if the resulting estimator of the mean of the survey variable is uncorrelated with the estimator of the mean of the auxiliary variable. This condition also guarantees conditional unbiasedness and allows valid conditional inferences. That requirement can be achieved through the EOE. The latter correspond to an extended GRE based on a working model that include the effects of any existing DBV. With stratified design with a few observations per stratum, the EOE can be vulnerable to the consequent loss of degrees of freedom. Further empirical work is needed to evaluate when a sample is large enough to overcome the problem, especially with complex designs. When the use of the EOE is thought unsafe, the alternative consists of using a working model without DBVs or with a few of them capable of capturing the relationship between the survey variables and the auxiliary variables whose population means are known. If the model holds true, the corresponding EGRE guarantees efficiency and conditional unbiasedness. However, specification of the model is confined to regressors with known population means, so it only rarely describes the complexity of the finite population. Therefore, we recommend estimating the conditional MSE of the GRE or EGRE, to correctly evaluate its precision from a conditional point of view. To use the function γ (η̂G | ξ̂HT ) for choosing between an EGRE and the EOE, empirical evidence is needed for determining which values of |γ (η̂G | ξ̂HT )| are to be considered negligible. Further work is needed on the distribution of the estimated |γ (η̂G | ξ̂HT )| for choosing the threshold over which shifting from a EGRE to the corresponding EOE is profitable for a valid conditional inference and a lower unconditional variance, especially when the true value of λ(η̂G , η̂O ) is zero. The estimated value of |γ (η̂G | ξ̂HT )| is unstable but in most practical situations there is more than one survey variable of interest, so to apply the same weights to every variable the EOE could be chosen on the basis of an averaged value of |γ (η̂G | ξ̂HT )| across the main survey variables; and such an average is more precisely estimable. So far we have restricted the analysis to the use of balanced design variables, because we took the vector ξ as fixed. This is part of the wider problem of selecting a subset of the available auxiliary variables for estimation purposes. For example, if the true model is linear, a quadratic term inserted into the working model may give a less efficient estimator in finite sample sizes, although asymptotically it would result in a more efficient estimator of the population mean. Generally speaking, one should condition on all the auxiliary variables whose c Australian Statistical Publishing Association Inc. 2000 AUXILIARY VARIABLE MEANS IN FINITE POPULATION INFERENCE 421 population means are known, to avoid any conditional bias. However, with finite size samples, that choice may yield unacceptable levels of conditional and unconditional variance. So, the problem of selecting the best subset of auxiliary variables for finite size sample efficiency is still open. References CASADY, R.J. & VALLIANT, R. (1993). Conditional properties of post-stratified estimators under normal theory. Survey Methodology 19, 183–192. COCHRAN, W. (1977). Sampling Techniques. New York: Wiley. DEVILLE, J.C. & SÄRNDAL, C.E. (1992). Calibration estimators in survey sampling. J. Amer. Statist. Assoc. 87, 376–382. ELTINGE, J.L. & JANG, D.S. (1996). Stability measures for variance component estimators under a stratified multistage design. Survey Methodology 22, 157–165. ESTEVAO, V., HIDIROGLOU, M.A. & SÄRNDAL, C.E. (1995). Methodological principles for a generalised estimation system at Statistics Canada. J. Official Statist. 11, 181–204. FRANCISCO, C.A. & FULLER, W.A. (1991). Quantile estimation with a complex survey design. Ann. Statist. 19, 454–469. HOLT, D. & SMITH, T.M.F. (1979). Post-stratification. J. Roy. Statist. Soc. Ser. A 142, 33–46. JOBSON, J.D. (1992). Applied Multivariate Data Analysis, Vol. II. New York: Springer-Verlag. KREWSKY, D. & RAO, J.N.K. (1981). Inference from stratified samples: properties of the linearisation, jackknife and balanced repeated replication methods. Ann. Statist. 9, 1010–1019. ISAKI, C.T. & FULLER, W.A. (1982). Survey design under the regression superpopulation model. J. Amer. Statist. Assoc. 77, 89–96. LEHTONEN, R. & PAHKINEN, E.J. (1995). Practical Methods for Survey Design. New York: Wiley. MONTANARI, G.E. (1987). Post-sampling efficient QR-prediction in large-scale surveys. Internat. Statist. Rev. 55, 191–202. MONTANARI, G.E. (1998). On regression estimation of finite population mean. Survey Methodology 24, 69–77. RAO, J.N.K. (1985). Conditional inference in survey sampling. Survey Methodology 11, 15–31. RAO, J.N.K. (1994). Estimating totals and distribution functions using auxiliary information at the estimation stage. J. Official Statist. 10, 153–165. RAO, J.N.K. & WU, C.F.J. (1985). Inference from stratified samples: second order analysis of three methods for non-linear statistics. J. Amer. Statist. Assoc. 80, 620–630. ROBINSON, J. (1987). Conditioning ratio estimates under simple random sampling. J. Amer. Statist. Assoc. 82, 826–831. ROYALL, R.M. & CUMBERLAND, W.G. (1981). An empirical study of the ratio estimator and estimator of its variance. J. Amer. Statist. Assoc. 76, 66–88. ROYALL, R.M. & CUMBERLAND, W.G. (1985). Conditional coverage properties of finite population confidence intervals. J. Amer. Statist. Assoc. 80, 830–840. SÄRNDAL, C.E. (1996). Efficient estimators with simple variance in unequal probability sampling. J. Amer. Statist. Assoc. 91, 1289–1300. SÄRNDAL, C.E., SWENSSON, B. & WRETMAN, J.H. (1992). Model Assisted Survey Sampling. New York: Springer-Verlag. SCOTT, A.J. & WU, C.F.J. (1981). On the asymptotic distribution of ratio and regression estimators. J. Amer. Statist. Assoc. 76, 98–112. SEN, P.K. (1988). Asymptotics in finite population sampling. In Handbook of Statistics 6: Sampling, pp . 291– 332. Amsterdam: North-Holland. TILLÉ, Y. (1995). Auxiliary information and conditional inference. Bull. Internat. Statist. Inst., Proc. 50th Session. Vol. 1. 303–319. WOLTER, K.M. (1985). Introduction to Variance Estimation. New York: Springer-Verlag. WRIGHT, R.L. (1983). Finite population sampling with multivariate auxiliary information. J. Amer. Statist. Assoc. 78, 879–884. c Australian Statistical Publishing Association Inc. 2000
© Copyright 2025 Paperzz