A B-SHADE based best linear unbiased estimation tool for biased

Environmental Modelling & Software 48 (2013) 93e97
Contents lists available at SciVerse ScienceDirect
Environmental Modelling & Software
journal homepage: www.elsevier.com/locate/envsoft
Short communication
A B-SHADE based best linear unbiased estimation tool for biased
samples
Mao-Gui Hu a, Jin-Feng Wang a, *, Yu Zhao a, b, Lin Jia c
a
State Key Laboratory of Resources and Environmental Information Systems, Institute of Geographic Sciences and Natural Resources Research, Chinese
Academy of Sciences, Beijing 100101, China
b
School of Geosciences and Info-Physics, Central South University, Changsha 410083, China
c
State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing 100012, China
a r t i c l e i n f o
a b s t r a c t
Article history:
Received 11 October 2012
Received in revised form
3 April 2013
Accepted 24 June 2013
Available online
If not handled properly, a biased sample from a population usually results in a biased estimation of the
population. The bias of a sample can be caused by selection bias or attrition bias. The B-SHADE (Biased
Sentinel Hospital based Area Disease Estimation) model provides a best linear unbiased estimation
(BLUE) solution by incorporating the ratio between the sample and the population, the autocorrelation
within the population, and support from historical data. Three extensions are proposed and implemented in the software based on the B-SHADE model. First, we extend the original population totaloriented estimation method to population mean estimation, which is another important parameter in
sampling. Second, a historical sample rather than a historical population is found to be applicable in
population mean estimation. This is particularly important in practice, where there is no integrated
historical population information but good historical samples. Finally, efficient sampling optimization
based on the simulated annealing algorithm is proposed and implemented in the software. This is useful
in evaluating the efficiency of old samples and designing new samples. A demonstration shows that
when the “vertical” relationship and “horizontal” correlation can be well represented and calculated
from historical data, the result estimated by the B-SHADE model is better than results from traditional
simple random sampling and ratio estimation. Although the B-SHADE model was originally designed for
sentinel hospitals, the software is a common tool for similar problems in different applications.
Ó 2013 Elsevier Ltd. All rights reserved.
Keywords:
Biased sample
BLUE estimation
Sampling design
Correlation
1. Introduction
Sampling is a method for investigating and understanding a
population using a sample. It has been widely applied in various
disciplines including natural resources, environmental pollution,
and public health. With the sample data collected, various parameters of the population (for example, mean and sum) can be
estimated using an appropriate model (Wang et al., 2002, 2009;
Haining, 2003; Fischer and Wang, 2011). Usually, a best and unbiased estimation is expected (Cochran, 1977). However, if the samples are not carefully selected to match the stochastic features of
the population, the estimated result could be biased with respect to
the population’s real value (Christakos, 1992). Consequently, the
estimated result could be systematically different from that of the
population. Two common reasons can lead to sample bias:
* Corresponding author. Tel.: þ86 10 64888965; fax: þ86 10 64889630.
E-mail addresses: [email protected] (M.-G. Hu), [email protected]
(J.-F. Wang).
1364-8152/$ e see front matter Ó 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.envsoft.2013.06.011
selection bias and attrition bias (Heckman, 1979; Cuddeback et al.,
2004). Sample selection bias is usually due to preferential sampling.
For example, when setting sentinels to estimate a disease’s prevalence or incidence in a city, hospitals with better equipment and
doctors are more likely to be selected by planners. This would result
in the sentinels’ average number of visitors being much higher than
the actual average number of visitors to all hospitals. Another kind
of selection bias is missing or unattainable samples that are systematically different from the population, causing the final samples
to be biased. Attrition bias mainly occurs in longitudinal research
(Cuddeback et al., 2004). As time goes by, some samples may be lost
for some reason or other. If the lost samples differ systematically
from the population, the remaining samples could be biased if the
statistical inference model is not changed, even though all the
samples were originally unbiased when they were designed. The
bias must be corrected in the biased samples to estimate the
population.
The remainder of the paper is organized as follows. In Section 2,
we present the B-SHADE (Biased Sentinel Hospital based Area
94
M.-G. Hu et al. / Environmental Modelling & Software 48 (2013) 93e97
3. Software design
Software availability
3.1. Main functions
Software name: B-SHADE Estimation and Sampling
Developer: Mao-Gui Hu, Yu Zhao
Programming language: R, C#
Operating system: Microsoft Windows
Available since: April 2012
Availability: free download from http://www.sssampling.org/bshade
Contact: [email protected]
Disease Estimation) model used to obtain a best linear unbiased
estimation of the sample population from biased samples. The
software implementation of the model is described in Section 3.
This software has been used in two applications, as explained in
Section 4. Finally, our conclusions are presented in Section 5.
2. Method
The B-SHADE method is a model-assisted data-driven model aimed at dealing
with bias correction for population estimation using biased samples (Wang et al.,
2011). As with the classical ratio estimation (CRE) method used to correct bias in
several applications, the ratio relationship between the samples and population
is an important connection showing the degree of bias in a sample. However, the
two methods apply different strategies when considering the ratio. Using total
population estimation for example, in the CRE method, the ratio is usually between the estimated total from the biased samples and the population’s true total
value. On the other hand, in the B-SHADE method, each sample has a ratio of the
sample’s value to the population’s true total value. The ratio is often calculated
from historical investigations whose data represent the population well. The
other distinguishing feature of B-SHADE is that it takes account of the correlations between the samples and population. In classical sampling techniques,
samples are assumed to be independent of each other (Cochran, 1977). However,
in realistic applications, this is not the case, which means that the samples are
often correlated with each other (Getis and Ord, 1992; Englund, 1994). According
to the first law of geography, “everything is related to everything else, but near
things are more related than distant things” (Tobler, 1970). The distance can be
measured in either geographical space or attribute space. For example, child
population density has a greater influence on the incidence of handefootemouth
disease than climate factors (Hu et al., 2012), and thus, counties with similar child
population densities may be more similar in terms of the disease incidence. In
the B-SHADE model, the correlation is considered and calculated as a covariance
between the samples and the population. To guarantee a best and unbiased
estimation of the population total, each sample is assigned a unique weight
adaptively.
By considering the correlation between the samples and the population, the BSHADE method generates an unbiased estimation of the population using biased
samples. It incorporates a “vertical” relationship between each sample and the
population to correct the sampling bias, and a “horizontal” correlation between each
pair of samples to increase the effectiveness of the sample size and reduce estimation variance. The weight of each sample is calculated adaptively under the best
linear unbiased estimation (BLUE) constraint (Wang et al., 2011):
n
P
j¼1
n
P
i¼1
N
P
wj C yi ; yj þ mbi ¼
C yi ; yj
j¼1
wi bi ¼ 1;
(1)
bi ¼ Eðyi Þ=EðYÞ
Based on the B-SHADE model, our software was designed using
the rule of “just enough”. A flowchart of the software is illustrated
in Fig. 1. Population estimation from biased samples has been
specifically designed according to the B-SHADE method. In this
method, the “vertical” relationship between each sample and the
population is the ratio between them, while the “horizontal” correlation between samples is represented by the covariance between them. Both of these parameters are expected to be acquired
from a general historical investigation. To calculate the covariance,
each sample in the population should have a time series of historical data. Although the minimum length of the time series is two,
in practice it should be long enough to represent a stable value of
the sample.
Population estimation from a biased sample is a common
problem in applications. Besides the original total population
estimation, two extended functions are also implemented in the
package, mean estimation and sampling optimization, which are
also commonly used in practical sampling. Each sample’s weight is
also output by the program to show its position in estimating the
population. Another result is the variance of the estimated population total or mean, which can be used to assess the distance between the estimated population and its true value. Besides
population estimation, another distinguishing function of the
software is sampling design. For a given sample size, the program
attempts to find the best sample combination for which the variance of the estimated population total or mean is minimized. A
simulated annealing optimization method is adopted to select
samples. If the number of combinations is small, all combinations of
the population with the given sample size are expected to be
compared to find the optimum combination. The model and optimization algorithm are implemented with the R language, which is
a very popular cross-platform statistical language (R Core Team,
2012). For users who are not familiar with R, a program with a
graphical user interface (GUI) has also been designed to wrap the R
code under Microsoft Windows. Although the B-SHADE model was
originally designed for sentinel hospitals, it is a common method
for similar problems in different applications.
3.2. Population mean estimation
The mean and total population are two closely related parameters in sampling. When one of these is known, the other one can
be easily estimated. The B-SHADE model aims to estimate total
population from a biased sample, which requires the population to
be known at some previous time to measure the correlation between each pair of samples (the covariance on right side of the first
expression in Eq. (1)). However, this condition is too strict for some
applications, although there might be well designed sampling
where n is the number of samples, N is the population size, C(yi, yj) is the covariance
between the ith and jth samples obtained from historical data, bi is the bias of the ith
sample with respect to the population total Y, also obtained from historical data, wi is
the weight of the ith sample, and m is a Lagrange multiplier. The estimated population total and its variance can then be obtained from the following expressions
(Wang et al., 2011):
yw ¼
n
P
i¼1
wi yi
s2yw ¼ ðrn 1Þ
where rn ¼
torical data.
n P
n
P
i¼1 j¼1
PN
i¼1
PN
wi wj C yi ; yj 2m
j¼1
Cðyi ; yj Þ=
Pn
i¼1
(2)
Pn
j¼1
wi wj Cðyi ; yj Þ is estimated from hisFig. 1. Flowchart of B-SHADE estimation and sampling.
M.-G. Hu et al. / Environmental Modelling & Software 48 (2013) 93e97
95
investigations in history. For example, in the aforementioned
attrition bias sampling, the collected sample is unbiased at the
beginning. If the relationship between samples is rather stable, the
unbiased sample at the beginning can be used to correct the bias in
later stages. In some other applications, the mean instead of the
total is the most interesting parameter to estimate. Thus, developing a mean estimation method based on biased samples is very
important and necessary. Following similar deduction and calculation steps, the B-SHADE model has been extended to estimate the
population mean. Under the BLUE constraint, the mathematical
expectation of the estimated population mean should be equal to
the true population mean, and the variance of the estimator should
be minimized. The sample weights can be solved using the
following linear equations:
n
P
j¼1
n
P
i¼1
N0
P
wj C yi ; yj þ mbi ¼ ð1=N 0 Þ
C yi ; yj
j¼1
wi bi ¼ 1;
bi ¼ Eðyi Þ=E Y
(3)
where N0 is the size of historical unbiased samples (N0 < N) or of the
population (N0 ¼ N). Then, the estimated population and its variance mean can be obtained from Eq. (4):
yw ¼
n
P
i¼1
w i yi
s2y ¼ ðrn 1Þ
w
n P
n
P
i¼1 j¼1
wi wj C yi ; yj 2m
(4)
P 0 PN 0
Pn Pn
where rn ¼ ð1=N 0 2 Þ N
i¼1
j ¼ 1 Cðyi ; yj Þ=
i¼1
j ¼ 1 wi wj Cðyi ; yj Þ
is estimated from historical data.
Another problem that should be noted is that the parameter m in
the BSHADE model is also calculated from historical data. When
used to estimate the variance in Eqs. (2) and (4) for new sample
data, it should be adjusted if the new sample data is systematically
different from the historical data but the relationships between
samples are not changed. In the software, we propose the following
method to estimate the variance for the new estimated population
mean:
s2yw ¼ a2 s2y0
(5)
w
where a ¼ yw =y0w is the ratio between the population mean estimation in the new period and historically; and y0w is the historically
estimated population mean. For total population estimation, the
adjustment is also applicable and adopted in the software (Fig. 2).
3.3. Sample optimization
The estimated result from B-SHADE is a best linear unbiased
estimation of the population. For a selected sample scheme, no
matter whether it is unbiased to the population or biased to the
population, the mathematical expectation of the estimated total
population or population mean is equal to the true value, and the
variance is minimized. A good sample scheme should have a small
sample number and small variance for the population estimation.
Combined with the simulated annealing (SA) optimization algorithm, a sampling optimization method based on B-SHADE is
implemented in the software. SA is a global optimization method
based on probabilistic metaheuristics. It simulates the process of a
metal cooling and freezing, which results in a minimum energy
crystalline structure (Kirkpatrick et al., 1983) and helps to select the
best samples when given the expected number of samples or
estimation error. Sample optimization is an important issue in
Fig. 2. Population estimation with the B-SHADE model.
sampling as it affects not only the cost but also the precision of
estimation (Bastin et al., 1984; Hu and Wang, 2011; Stein and
Ettema, 2003). In Eq. (1), the covariance between each pair of
samples is determined from historical data; as is rn in Eq. (2). If the
correlations between samples are relatively stable, we can estimate
s2yw without knowing the exact sample value. On one hand, for a
given estimation variance, the smallest number of samples can
theoretically be found from all sample combinations in the population. On the other hand, for a given number of samples, the
samples with the smallest estimation variance can also be found.
Two functions are specially designed for sampling (Fig. 3a). One
is selecting optimal samples with fixed sample number, and the
other one is selecting samples for each number in a range. A varianceenumber (VeN) relationship plot is output to analyze how
sample number affects the estimation accuracy (Fig. 3b). This is
very useful for users to make a tradeoff between accuracy and
sample number. Advanced options about the SA algorithm can also
be modified for detailed control of the optimization process.
4. Examples
Handefootemouth disease (HFMD) is a common infectious
disease occurring among children, with the main clinical symptoms
being mouth ulcers and vesicles on the hands, feet, and mouth (Hu
et al., 2012). There is currently no effective vaccine or antiviral
treatment for the disease. Here we take HFMD cases as an example
to demonstrate the application of the software. Two months’ daily
HFMD data (August 4eSeptember 4 in both 2009 and 2010) from 19
hospitals in a district in China were used as the experimental data.
The data from 2009 are treated as historical data, and are used to
improve population estimation in 2010 when five of the 19 hospitals are selected as sentinel samples. First, the total number of
HFMD cases in the period for 2010 is estimated using the
96
M.-G. Hu et al. / Environmental Modelling & Software 48 (2013) 93e97
Fig. 3. Sampling optimization (a) and the VeN plot (b).
population estimation function. Second, new sentinel samples are
selected using the proposed sampling optimization method.
4.1. Population estimation
The total number of cases at the 19 hospitals for August 4e
September 4, 2009 varied greatly, from 0 to 254, with mean 16.7
and standard deviation 58.0. The total number of cases in 2009 and
2010 are 318 and 317, respectively. The total and mean numbers of
cases for the five sentinel samples in 2009 are 47 and 9.4, which
differs significantly from the population parameters. The trends of
the daily total number of cases for 2009 and 2010 are very similar
(Fig. 4). We used the sample to estimate the total number of cases at
the 19 hospitals during the period August 4eSeptember 4, 2010.
According to the B-SHADE model, the correlation between hospitals measured by the covariance is calculated from the daily time
series for the number of cases in 2009. The bias ratio bi (i ¼ 1, ., 5)
for each sample is the ratio between the total number of cases in
the and the total number of cases for the 19 hospitals in 2009. Two
comma separated value (CSV) files are prepared as input data for
the program: one is the sample data file and the other is the historical population data file. Both have the same file format: the first
column gives the time sequence, and the other columns are the
time series data for each hospital. It should be noted that, as discussed earlier, when the goal is to estimate the population mean
rather than the total population, the “historical population” data
file can be replaced directly by a historical unbiased sample if the
historical population data is not available.
The estimated total number of cases during the one month
period in 2010 is 285, with an absolute error of 32. The estimated
total numbers of case from two other common methods, simple
random sampling estimation and ratio estimation (Heckman,
1979), are 205 and 365, respectively, with absolute errors of 112
and 48, respectively. The number of cases on each day during the
sample period was also estimated by the software. The absolute
errors in the daily estimation are summarized in Table 1. The results
show that simple random sampling has the largest mean error due
to no bias factor being considered. The ratio estimation has rather
small mean error 1.51, although this is slightly larger than the absolute mean error of B-SHADE (1.00). Taking the range of estimated
error into consideration, simple random sampling has the smallest
range while that of ratio method is the largest.
4.2. Sampling optimization
Fig. 4. HFMD number of cases for the population and sample hospitals in 2009 and
2010.
Sampling optimization can be used not only to design new
sampling schemes, but also to evaluate the accuracy of current
sampling scheme. For example, it is unknown whether the five
sentinel samples in the above section are the best choice out of all
the candidates. The best samples should estimate the population
with as small an error as possible. Using the B-SHADE model and
M.-G. Hu et al. / Environmental Modelling & Software 48 (2013) 93e97
Table 1
Summary of absolute errors in estimating the number of HFMD cases by the three
methods.
Method
Min.
1st qu.
Median
Mean
3rd qu.
Max.
B-SHADE
Simple random
Ratio
14.00
14.00
14.00
5.54
6.70
3.43
1.05
3.70
0.73
1.00
3.49
1.51
2.20
1.05
5.47
13.52
7.20
19.06
97
can be well represented and calculated from historical data, the
estimated result from the B-SHADE model is better than results
from traditional simple random sampling and ratio estimation. We
trust that the new method and software will provide a viable option
for alleviating biased sample inference confusion in applications.
Acknowledgments
historical data, we can select those samples that give the smallest
standard error of estimation.
The daily number of HFMD cases over one month for the 19
hospitals in 2009 is used as historical data. We want to select five
sentinel samples by the proposed sampling method. In the “Samples selection” page of the software, the sample number is inputted
as a “Fixed” number (5); the estimated goal is the total population.
The output shows that the theoretical variance of total estimation
by the selected best samples is 0.26. The newly selected samples are
H01, H02, H03, H04 and H14, which are a little different from the
above samples (H01, H03, H09, H10 and H14). Furthermore, the
new samples are also used to estimate the daily number of HFMD
cases during the same period in 2010. The absolute error for the
total monthly number of cases is 21, which is better than the error
of 32 with the old samples. A summary of the absolute error for
the daily total number of cases is: Min.: 2.87; Median: 0.64;
Mean: 0.65; Max.: 0.78. These results indicate that the new
samples should be adopted to estimate the population.
5. Discussion and conclusion
The B-SHADE software provides a new tool for obtaining the
best linear unbiased total estimation from biased samples. It considers both the “vertical” relationship and “horizontal” correlation
between samples and the population. Three extensions are proposed and implemented in the software based on the B-SHADE
model. First, we extend the original total population-oriented
estimation method to population mean estimation, which is
another important parameter in sampling. Second, historical samples rather than the historical population are found to be applicable
in population mean estimation. This is particularly important in
practice, where there is no integrated historical population information but good historical samples. Finally, an efficient sampling
optimization method based on the simulated annealing algorithm
is proposed and implemented in the software. This is useful in
evaluating the efficiency of old samples and designing new samples. When the “vertical” relationship and “horizontal” correlation
This study was supported by MOST (2012CB955503;
2012ZX10004-201; 2011AA120305), CAS (XDA05090102), NSFC
(41023010; 41271404), and LREIS (08R8B650PA) grants.
References
Bastin, G., Lorent, B., Duque, C., Gevers, M., 1984. Optimal estimation of the average
areal rainfall and optimal selection of rain gauge locations. Water Resources
Research 20 (4), 463e470.
Christakos, G., 1992. Random Field Models in Earth Sciences. Dover Publ., NY.
Cochran, W.G., 1977. Sampling Techniques, third ed. John Wiley & Sons, Inc..
Cuddeback, G., Wilson, E., Orme, J.G., Orme, T.C., 2004. Detecting and statistically
correcting sample selection bias. Journal of Social Service Research 30 (3), 19e33.
Englund, E.J., 1994. Spatial autocorrelation: implications for sampling and estimation. In: Proceedings of the ASA/EPA Conferences on Interpretation of Environmental Data, pp. 31e40.
Fischer, M.M., Wang, J.F., 2011. Spatial Data Analysis: Models, Methods and Techniques. Springer.
Getis, A., Ord, J.K., 1992. The analysis of spatial association by use of distance statistics. Geographical Analysis 24 (3), 189e206.
Haining, R., 2003. Spatial Data Analysis: Theory and Practice. Cambridge University
Press, Cambridge.
Heckman, J.J., 1979. Sample selection bias as a specification error. Econometrica 47
(1), 153e161.
Hu, M.G., Wang, J.F., 2011. A spatial sampling optimization package using MSN
theory. Environmental Modelling & Software 26 (4), 546e548.
Hu, M.G., Li, Z.J., Wang, J.F., Jia, L., Liao, Y., et al., 2012. Determinants of the incidence
of hand, foot and mouth disease in China using geographically weighted
regression models. PLoS ONE 7 (6), e38978.
Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P., 1983. Optimization by simulated
annealing. Science 220 (4598), 671e680.
R Core Team, 2012. R: a Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria.
Stein, A., Ettema, C., 2003. An overview of spatial sampling procedures and
experimental design of spatial studies for ecosystem comparisons. Agriculture,
Ecosystems and Environment 94 (1), 31e47.
Tobler, W., 1970. A computer movie simulating urban growth in the Detroit region.
Economic Geography 46 (2), 234e240.
Wang, J.F., Liu, J.Y., Zhuang, D.F., Li, L.F., Ge, Y., 2002. Spatial sampling design for
monitoring the area of cultivated land. International Journal of Remote Sensing
23 (2), 263e284.
Wang, J.F., Christakos, G., Hu, M.G., 2009. Modeling spatial means of surfaces with
stratified nonhomogeneity. IEEE Transactions on Geoscience and Remote
Sensing 47 (12), 4167e4174.
Wang, J.F., Reis, B.Y., Hu, M.G., Christakos, G., et al., 2011. Area disease estimation
based on sentinel hospital records. PLoS ONE 6 (8), e23428.