Imputing censored data with desirable spatial covariance function

Imputing censored data with desirable
spatial covariance function properties using
simulated annealing
L. Sedda, P. M. Atkinson, E. Barca &
G. Passarella
Journal of Geographical Systems
Geographical Information, Analysis,
Theory, and Decision
ISSN 1435-5930
Volume 14
Number 3
J Geogr Syst (2012) 14:265-282
DOI 10.1007/s10109-010-0145-1
1 23
Your article is protected by copyright and
all rights are held exclusively by SpringerVerlag. This e-offprint is for personal use only
and shall not be self-archived in electronic
repositories. If you wish to self-archive your
work, please use the accepted author’s
version for posting to your own website or
your institution’s repository. You may further
deposit the accepted author’s version on a
funder’s repository at a funder’s request,
provided it is not made publicly available until
12 months after publication.
1 23
Author's personal copy
J Geogr Syst (2012) 14:265–282
DOI 10.1007/s10109-010-0145-1
ORIGINAL ARTICLE
Imputing censored data with desirable spatial
covariance function properties using simulated
annealing
L. Sedda • P. M. Atkinson • E. Barca • G. Passarella
Received: 11 June 2010 / Accepted: 25 November 2010 / Published online: 12 December 2010
Ó Springer-Verlag 2010
Abstract When measurements of values that are less than the limit of detection
are reported as not detected, the data are referred to as censored. The non-recording
of values below the limit of detection is common in soil science research although
modelling data affected by censoring can be problematic. This paper develops and
tests a modified version of Spatial Simulated Annealing, called Simulated
Annealing by Variogram and Histogram form, for drawing values for censored
points given a mixed set of observed and censored data. The algorithm aims to
maximise the goodness of fitting between the experimental and theoretical variograms (by allowing variation in its parameters) while the imputed values are constrained to a target histogram form. In practice, the experimental histogram is
estimated by transforming the available data (interval and exact observations) to
quantiles and fitting a plausible distribution. The theoretical distribution of the data
is used to constrain the variogram fitting. The proposed simulated annealing method
is designed to find the optimal spatial arrangement of values, given by the lowest
errors in variogram and histogram fitting and kriging prediction. The accuracy of the
method proposed is assessed on a simulated data set in which the censored point
values are known and compared with the Spatial Simulated Annealing algorithm.
According to the results obtained, the Simulated Annealing by Variogram and
Electronic supplementary material The online version of this article (doi:10.1007/s10109-010-0145-1)
contains supplementary material, which is available to authorized users.
L. Sedda (&)
Spatial Ecology and Epidemiology Group, University of Oxford,
South Park Road, Oxford OX1 3PS, UK
e-mail: [email protected]
P. M. Atkinson
School of Geography, University of Southampton, Highfield, Southampton S017 1BJ, UK
E. Barca G. Passarella
IRSA-CNR, National Research Council, via F. De Blasio 5, 70123 Bari, Italy
123
Author's personal copy
266
L. Sedda et al.
Histogram form (SAVH) approach can be recommended as a useful tool for the
analysis of spatially distributed data with censoring.
Keywords Detection limit Annealing simulation Variogram and histogram
fitting Cross-validation Kriging
JEL Classification
C21 C24 C61 Q19
1 Introduction
In certain circumstances, the attribute of interest may not be observed fully for some
of the required sampling locations, for example, due to the limitations of the
measuring instrument. In particular, when the values to be observed are below the
detection limit of the measuring device, the resulting values are referred to as a
‘‘censored’’ or ‘‘interval’’ observations (or soft data). Censored data can be
expressed either as an interval or single value. They are distinguished from ‘‘exact
observations’’ (or hard data) that are observed above the limit of detection (De
Oliveira 2005). The given definition of a censored observation makes censored data
distinct from a missing observation where the information is unknown. Censored
observations convey information regarding their value and the general distribution
of the whole data set (Orton and Lark 2007). The major difficulties related to
censored data are that they are not missing in a random pattern, but are missing at
one end of the distribution (left and/or right censored) and, in some cases, the
censored observations can have a similar range as the variance of the data set
(De Oliveira 2005; Helsel 2005).
Much research has been undertaken in recent decades with the aim of producing
reasonable values for censored points, according to some properties of the variable
of interest (Porter et al. 1988). Methods of creating complete data by ‘‘filling in’’
censored observations can be classified into single-imputation methods and
multiple-imputation methods. Single imputation involves filling in one value for
each censored location (Caudill 1996; Gibbons 1995; Odell et al. 1992), often
independently from the distribution of the complete sample (censored and noncensored). Multiple imputation replaces each censored value with two or more
plausible values from the joint distribution of possible values, often in a Bayesian
approach (Hopke et al. 2001; Rubin 1987). Recently, maximum likelihood
estimation (ML) and the expectation–maximisation algorithm (EM) were proposed,
although sometimes the direct evaluation of the likelihood for correlated data with
censored values can involve computationally intractable integrals (Abrahamsen and
Benth 2001; Svensson et al. 2006).
Knowledge of the form of the spatial covariance function for the variable of
interest permits constraint of the censored points to values that depend on the spatial
configuration of the sample locations as well as the underlying distribution of the
values. Also, in a regression context, it avoids the potential bias that may arise if
considering the error term as being independent and identically distributed (Agarwal
and Sharma 2003; Bang and Tsiatis 2002; Tsiatis 1990). Advanced spatial
123
Author's personal copy
Imputing censored data with desirable spatial covariance
267
frameworks for the univariate or multivariate context are proposed by Stein (1992),
Knotters et al. (1995), Militino and Ugarte (1999), Abrahamsen and Benth (2001),
Lark and Papritz (2003), De Oliveira (2005), Fridley and Dixon (2007), Orton and
Lark (2007) and Goovaerts (2009). Some of these methods use the Bayesian
statistical framework. An example of a non-Bayesian approach is given by
Abrahamsen and Benth (2001). They proposed the use of kriging (a linear technique
for spatial prediction) with inequality constraints preceded by Gaussian anamorphosis modelling (Rivoirard 1994) of the raw variable. The pre-transformation of
the variable demands a final back transformation, often a complex technique that
introduces more imprecision into the final predictions (Leuangthong and Deutsch
2003). Elements of these issues are taken into account by Goovaerts (2009) who
proposed the use of indicator kriging with multiple thresholds for the values below
the detection limit. The method is applied for continuous data with a cumulative
histogram and requires the modelling of a number of variograms equal to the
number of thresholds applied.
A Bayesian approach can be found in De Oliveira (2005). De Oliveira (2005)
used Gaussian Random Fields to predict the depths of geologic horizon with
censored data that were described spatially by the random field model. This
framework was implemented via a data augmentation method that can also be found
in other papers (Fridley and Dixon 2007; Pardo-Igúzquiza 1998). Data augmentation consists of regarding latent and missing data as parameters to estimate (Meng
and Van Dyk 1999). Although this introduces many more parameters, the
conditional distributions become much easier to sample from simplifying the
design of the algorithm. The principal drawback is slow convergence due to
the large correlation between model parameters and latent data. In the presence of
large numbers of data, the computing time can be large and it can be difficult to
determine when the chain has reached convergence (Liu 2001) and to invert the
covariance matrix (De Oliveira 2005).
This paper develops a method of imputation in the context of a known covariance
function (or variogram; hereafter, the text is standardised to variogram) with
unknown parameters (e.g. sill, nugget, range), a known univariate distribution with
unknown parameters, where one tail of the distribution is censored, and in the
absence of covariates. For the unknown parameters, the true fixed values are
unknown, but can be estimated from initial (starting) values. The main assumption
is that the imputed values at the censored locations are arranged spatially such as to
produce the smallest variogram fitting error (considering the censored and exact
observations), smallest histogram fitting error and smallest kriging errors.
The algorithm, created in the R software environment,1 is an extension, for
censored data with unknown variogram and histogram parameters, of the Spatial
Simulated Annealing (SSA) of Deutsch and Journel (1998) called Simulated
Annealing by Variogram and Histogram form (SAVH). SAVH is based on a
plausible theoretical variogram function and distribution obtained from the raw
data, and it does not constrain the final variogram or distribution to a specific set of
pre-defined variogram parameters (range, nugget or sill) (as in Deutsch and Journel
1
The R project. www.r-project.org.
123
Author's personal copy
268
L. Sedda et al.
1998; Deutsch and Wen 1998). The final result is the imputation of values to the
censored points that represent an important source of information especially when
the censored interval contains values that are the level of a defined risk. The method
aims to increase the speed of computing without sacrificing simplicity and
efficiency.
2 Methods
SAVH is a geostatistical program designed to reduce the uncertainty in the prediction
of values at censored locations, when applied to incomplete data sets (the R code is
available from the corresponding author, while some of the implementation details
are given in the Supplementary Information). The information required a priori to run
SAVH is the histogram form and the variogram form. The term ‘form’ refers to the
general mathematical function describing the variogram and the histogram, without
any specification of its parameters. For example, a possible variogram form is
exponential with unknown values of sill, nugget and range; while for the histogram, a
possible form is Poisson with unknown value of mean.
2.1 SAVH general characteristics
2.1.1 Histogram
The number of censored points and the censored interval are important sources of
information about the unobserved part of the true histogram. The histogram may be
discretised such as to incorporate directly the information given by the censored
observations. Specifically, an approximate estimate of the true histogram can be
obtained by computing the histogram with bin widths equal to the length of the
censored interval, G (Gilliom and Helsel 1984; Saito and Goovaerts 2000;
Huzurbazar 2005). In this paper, the form of the distribution is determined in this
way, but then the parameters estimated are discarded and the form of the
distribution only is used as a constraint in the subsequent imputation process.
2.1.2 Variogram
Depending on various parameters, such as the censored interval and the proportion
of censored points to the total, it may be expected that the variogram total sill
variance is likely to increase when the censored points are added because the
censored points lie outside the range of the exact observations. No further
information is available. However, by drawing a random value for each censored
point, it is possible to simulate an experimental variogram for the complete data set.
This process can be repeated to produce a large number of experimental variograms,
which leads to an envelope of random draws within which the true variogram is
expected to lie. The realisations can be constrained to a specific variogram form,
from which a narrower envelope is obtained. It is within this envelope that the true
variogram (and that estimated by SAVH) is expected to lie. The variogram is thus
123
Author's personal copy
Imputing censored data with desirable spatial covariance
269
considered as a variable in the annealing process. At each iteration, the experimental
variogram is calculated from the new imputed values and the theoretical variogram
parameters (sill, range and nugget) are fitted to the new experimental variogram (by
nlm, see below). While the new fitted variogram parameters may change, the initial
values from which they are calculated are held constant.
2.1.3 Predictions
A problem with ordinary or simple kriging is that it is known to produce conditional
bias. That is, it is biased when predicting extremes: under-predicting highs and overpredicting lows. This means that kriging is not well suited to imputation of censored
values distributed at one tail. In particular, imputation under the constraint of
minimum prediction variance will tend to predict overly (biased) large values for
the censored locations at the lower tail of the distribution. SAVH reduces the biases
introduced by ordinary kriging by conditioning the predictions to the histogram
form and allowing the variogram to vary from that of the observed data, constrained
only to a selected variogram form. Another possibility is the use of disjunctive
kriging which, through indicator coding of the data set, constrains the predictions to
lie within the range of the sample data. The difficulties related to the orthogonal
transformation (often based on Hermite polynomials) of the data make this option
unsuitable for optimisation algorithms.
2.1.4 Combination
The set of three constraints given above are represented as a single objective
function, E:
E ¼ minðw0 WLS þ w1 D þ w2 H Þ
ð1Þ
where D is the departure (in absolute terms) of the Mean Squared Deviation Ratio
error (MSDR, measured using the cross-validation method via ordinary kriging, see
below) from a value of one (Kerry and Oliver 2007c):
D ¼ absð1 MSDRÞ
ð2Þ
and WLS is the weighted least square residual, as defined by McBratney and
Webster (1986), between the experimental and theoretical variograms. WLS with
respect to ordinary least squares (OLS) gives more weight at the shorter lags
(improving the fitting at shorter distances), which is desirable when there is large
variation between the paired comparisons and when the variogram is used in a
kriging context (Oliver 2010). The equations for MSDR and WLS are not recalled
because they are common measures in geostatistical research. Finally, H is the AIC
of the fitted distribution with respect to the target histogram form, and w0, w1 and
w2, are user-defined rescaling coefficients for WLS, D and H respectively. A
common way to choose the rescaling coefficients is to assign values that produce
comparable variation in WLS, D and H (i.e. around one unit—in the present case by
setting w0 = 0.001, w1 = 1 and w2 = 0.01). This is only possible after running the
algorithm and recording the range of variation for WLS, D and H.
123
Author's personal copy
270
L. Sedda et al.
Table 1 Notation
Symbol
Description
Value
n
Number of censored points
62
G
Range value for the censored points
(censored interval)
[0, 5]
T
Temperature during SA
i
Actual number of perturbations so
far in SA
c
Number of perturbations before
decreasing T
E
Value of objective function
D
Ei - Ei?1
Unit
g kg-1
The minimum of E can be solved approximately by Simulated Annealing (SA),
a well established and documented procedure (Alkhamis and Ahmed 2004;
Christakos and Killam 1993; Dueck and Scheuer 1990; Gelman et al. 1995;
Rajasekaran 2000; Triki et al. 2005) for minimisation/maximisation of a complex
objective function (the term ‘complex’ is used to express objective functions with
more than one variable), and used in the past for switching points procedures (i.e.
Macmillan 2001).
2.2 SAVH algorithm flow (following the notation in Table 1)
2.2.1 Input
Censored locations; censored interval G; exact observations; variogram form; and
histogram form.
2.2.2 Output
Values at the censored locations; optimised variogram-related parameters; histogram, variogram and prediction statistics.
Step i
Create the target histogram and variogram forms (supervised).
i1 Estimate the histogram using bin widths equal to G.
i2 Fit the histogram with an a priori distribution form.
i3 Estimate the experimental variogram of the observed data only.
i4 Fit theoretical variogram models and select a suitable general form.
Step ii
Create the initial state.
ii1 Draw n values, vi, from a uniform distribution with range equal to G.
ii2 Allocate the vi values randomly at the n censored locations.
ii3 Merge the n censored observations (new values) with the exact observations.
ii4 Calculate the objective function E (see step iv).
123
Author's personal copy
Imputing censored data with desirable spatial covariance
271
Step iii Exchange the value of one point, belonging to vi, chosen randomly, with a
value sampled from a uniform distribution with range equal to G.
Step iv
Calculate the objective function E.
iv1 Estimate the experimental variogram and then fit with the pre-defined
variogram form. New parameters for the theoretical variogram are obtained
(through the non-linear minimisation function), and a WLS value between the
experimental and theoretical variograms is produced.
iv2 Apply cross-validation via ordinary kriging to estimate the MSDR.
iv3 Fit the pre-defined distribution to the histogram by General Additive Model
for Location, Shape and Scale (GAMLSS) to calculate H.
Step v
Accept or reject the new state.
v1 If the new value of E is smaller than the previous one, accept the change. Else
go to step v2
v2 Calculate the Boltzmann probability, exp(D/T). If it is greater than a random
number generated from the uniform distribution between [0,1], accept the new
state. Otherwise reject it.
Step vi Check if the system is frozen. If the system is not frozen (see termination
criterion), after c perturbations decrease the temperature according to the cooling
schedule and go back to Step iii.
2.3 SAVH computational details
2.3.1 Step i Histogram and variogram forms
To determine the histogram form, the histogram constructed with bin widths equal
to the censored interval was fitted with different distribution families using
generalised additive models for location, scale and shape (GAMLSS) (Rigby and
Stasinopoulos 2005). The goodness of fit was investigated via three measures:
global deviance and the Akaike (AIC) and Bayesian (BIC) information criteria. AIC
and BIC, in particular, take into account the model log-likelihood and a complexity
term. The latter provides the difference between the two criteria: in the AIC, it is a
function of the number of parameters involved, while in the BIC it is a function of
both the number of parameters and the sample size. The same model is applied for
histogram fitting in SAVH step iv3.
GAMLSS is a flexible model allowing several controls on the statistical
distribution of the independent variable in the presence of large skewness and/or
kurtosis (even without explanatory variables). This is possible by relaxing the
hypothesis associated with the exponential family distribution of the response
variable, for which a general distribution family with large skew and/or kurtosis is
allowed. In practice, GAMLSS considers the distribution of the response variable
defined by several parameters related (linearly and/or additively) to the shape and
scale of the distribution and not only to the mean (as in generalised linear models).
123
Author's personal copy
272
L. Sedda et al.
The variogram form was chosen after analysis of the experimental variogram of
the exact observations produced by residual maximum likelihood, which can
provide a robust variogram in the presence of outliers and with a small data set
(Lark 2000a; Kerry and Oliver 2007a, b, c; Webster and Oliver 2001). It was fitted
using a non-linear minimisation function, nlm (Dennis and Schnabel 1983; Ribeiro
and Diggle 2001), defining initial values for the range, partial sill and nugget,
chosen according to the experimental variogram of the exact observations.
2.3.2 Step iii Value exchange
The new value exchanged at step iii is drawn from a uniform distribution (and is not
produced by swapping in values from other locations). This allows the relaxation of
the histogram constraints and avoids the potential bias produced by the approximate
estimation of the histogram form described in Sect. 2.1.1 above.
2.3.3 Step iv1 Variogram fitting and cross-validation
Step iv1 of the SAVH routine was applied because the new experimental variogram
(produced by changing the value at a censored location) is likely to lead to a new set of
variogram model parameters. Similarly to step i, constant initial values of the
variogram parameters are used in the nlm function at each iteration of the algorithm to
produce a new theoretical variogram for each new experimental variogram. A further
constraint prevents final variograms with a nugget term smaller than a pre-fixed noise
value that, in this analysis, was chosen as the instrumental error (0.1 mg kg-1).
The procedure of variogram model fitting, in the presence of distribution
asymmetry and outliers in a small data set, required a log-normal transformation
(Box and Cox 1964) to transform the original distribution to a Gaussian distribution
(Gringarten and Deutsch 2001). Kerry and Oliver (2007a) suggest transforming the
data for skewness less than -1 or greater than ?1. The log-normal transformation
applied to produce the theoretical variogram requires that the values in G are also
log-transformed. Hence, cross-validation (with ordinary kriging) was applied to the
entire log-transformed data set (censored and exact observations) and using the i-th
theoretical variogram.
2.3.4 Steps v and vi: Simulated annealing
The rules applied in SA are now explained. Initially, the temperature, T, was fixed
sufficiently high (the higher the temperature, the more likely an unfavourable
perturbation will be accepted) to keep an acceptance rate greater than 0.5 (Corana
et al. 1987; Kirkpatrick et al. 1983; Metropolis et al. 1953). This choice allows
escape from local minima solutions. The parameter T is controlled by the cooling
schedule (Aarts and Korst 1989; Bouktif et al. 2006; Kuo et al. 2001). In SAVH, the
rule proposed by Bölte and Thonemann (1996) and Ingber (1996) was implemented,
which guarantees an acceptance rule with decreasing probability. Also, it respects
the condition proposed by Geman and Geman (1984): T no larger than the ratio
between the starting temperature and the log value of i. The termination criterion
123
Author's personal copy
Imputing censored data with desirable spatial covariance
273
defined by Kirkpatrick et al. (1983) and Deutsch and Journel (1998) was adopted: if
the desired acceptance is not achieved after three successive temperature decreases,
the system is considered ‘‘frozen’’ and the annealing stops.
2.4 Comparison with SSA on the simulated data set
The accuracy of the SAVH method was evaluated using a simulated data set for
which the values of the censored points were known. In this context, the methods
proposed were compared with the classical SSA (Deutsch and Journel 1998). SSA
takes into account the mean, the variance–covariance matrix and the histogram of
the values. In SSA, there are various options for the objective function that can
include a histogram, a variogram (or indicator variograms), and when covariates are
present, a correlation coefficient and conditional distribution.
Both SAVH and SSA use WLS in the optimisation function. The main
differences between the two algorithms are that:
(a)
In SAVH the variogram and its parameters are variable and not targeted to an
a priori variogram (as in SSA);
(b) In SAVH only the distributional form (histogram) is given. In the classical
SSA a fully described histogram is required;
(c) The MSDR is absent in SSA;
SSA was applied twice: (1) with the variogram of the observed data and a fitted
histogram using the censored interval and observed data (SSAob) (this allows
constraint of the predictions to within the censored range) and (2) with the true
variogram and true histogram (SSAvh) (note that this is not possible in practice, but
is included for reference).
The simulated data set was composed of 120 points randomly distributed
spatially with values in the range 0–61 (in integers) produced from a stationary
spatial Gaussian random field. To generate the values, 1,000 simulations of the
random field were generated by conditional simulation with exponential variogram
parameters: sill = 0.74 (g kg-1)-2, range = 0.98 (km) and nugget = 0.12
(g kg-1)-2. The conditional simulation (Chilès and Delfiner 1999) is conditional
to a specified mean and variance–covariance matrix and is based on kriging
predictions on a random field. Initially, 13 points of the simulated data set were
considered as censored and hence removed from the data in order to allow their
imputation. To assess the sensitivity of the algorithms to the number of censored
locations, the algorithms were also run also for 23 and 40 censored points.
3 Field data
Data were collected during fieldwork in the years 2000 and 2001. The data belong to
the Toxicological Database of the Apulia Soils project (TDB)2 that aimed to digitise
and elaborate the main environmental characteristics of the Apulia Region (Fig. 1).
2
The Apulia Region. http://bdt.regione.puglia.it/home.html.
123
Author's personal copy
274
L. Sedda et al.
Fig. 1 Study area in southern Italy and categorised silt measures
From the TDB database, the soil attribute of total silt was considered which,
because of its statistical properties, constitutes an interesting case to analyse. The
importance of this variable is its relation with plant health and growth, agricultural
management and the quality of groundwater (Van Breemen et al. 1983). The
measure of total silt was obtained in the laboratory on soil samples following the
Italian governmental policy on the official methods for soil chemical analysis
(Ministero per le Politiche Agricole e Forestali 1999), and based on the same policy,
the value is expressed in g kg-1 without decimals. It is a rounded continuous
variable for which the original values are not available, and for this reason, it was
treated as a discrete variable. The total silt is composed of calcium carbonate or
calcite (CaCO3), magnesite (MgCO3) and sodium carbonate (Na2CO3), but is
expressed simply as calcium carbonate because the latter is generally the prevalent
component (Raimo and Napolitano 2003).
Total silt was measured once at each location and was censored because some of
the investigated locations were below the level of detection (the distribution is
called left truncated), so that these values do not belong to the range of the exact
observations. Total silt was measured at 149 locations in the Apulia Region, among
which 62 were censored and 87 were exactly observed. The censored interval has a
range between 0 and 5 g kg-1. As shown in Fig. 1, the locations are concentrated in
the middle of the north and along the coast and administrative boundaries in the rest
of the territory and always at lower altitudes (not shown) because the TDB database
was constructed to help address some of the problems affecting agricultural areas.
The minimum and maximum distances between points for the complete data set,
the exact observations and the censored locations are almost the same. This type of
spatial distribution helps in the choice of separation lag distance (Webster and
Oliver 2001).
123
Author's personal copy
Imputing censored data with desirable spatial covariance
275
4 Results and discussion
The results for the simulated data set are explored first to provide an understanding
of the capabilities of the proposed algorithm.
4.1 Simulated data
SSA provides an alternative to SAVH. SSA requires the target variogram and the
target histogram, which is required to constrain the prediction to within the censored
interval, but neither of these is available in the case of censored data. The only
information known that can be given to SSA is the theoretical (or experimental)
variogram of the observed data and the observed histogram. In fact, providing only the
statistics of the observed data (variogram and histogram) often leads to predictions
that are larger in value than the censored interval at the lower tail of the distribution
(not shown, while the results of SSA using the censored locations with values equal to
half of the detection limit are reported in the Supplementary Information).
Table 2 shows a comparison between SAVH and SSA. When SSA uses the
observed histogram and variogram (i.e. SSAob), it produces the largest error in terms
of the number of correctly predicted unknown values (i.e. the observations
considered as censored), the mean squared difference between the true values and
the predicted values at the censored points and the deviance of the MSDR (for 13
and 23 censored locations) from the optimal value of one. The latter represents the
overall algorithm performance by including the results of the predictions of the
observed locations. With 13 and 23 censored locations to be predicted, SAVH
predicts three times the number of correct values than SSAob. When the number of
censored points is increased to 33% (40 from 120) of the total number of
observations, SAVH is slightly more accurate than SSAob, in terms of MSE, but
with predictions of similar accuracy at the observed locations (in fact, the MSDR
values are identical). A comparison between the experimental variogram produced
by SAVH and the true experimental variogram and its envelopes is reported in the
Supplementary Information.
When SSA is provided with the true histogram as well as the true variogram
(a situation that is not possible in practice), SSAvh outperforms SAVH especially for
23 and 40 censored locations. Only when the proportion of censored locations
compared to the total is lower than 10% do the two methods give similar results. If
only the true variogram is provided (not shown in Table 2), the accuracy decreases
with respect to SSAvh (for SSA with true variogram, the numbers of correctly
predicted locations are 3, 9 and 12, and the MSEs are 0.85, 1.47 and 4.46,
respectively, for 13, 23 and 40 censored locations), and for 40 censored locations, it
equals SAVH. This confirms that the histogram is required to obtain a large number
of correct predictions, especially when the number of censored locations is lower
than 1/3; while with a larger number of censored locations, the histogram does not
increase accuracy.
In relation to the parameters of the original distribution (mean = 14.94 and
variance = 47.56), neither algorithm (SAVH or SSAob) can reproduce the sample
mean and variance when the number of censored locations is increased, although
123
Author's personal copy
276
L. Sedda et al.
Table 2 Number of points correctly predicted (npc), mean squared error of the prediction (MSE) and
mean squared deviation ratio (MSDR) for three different censored point sample sizes (n) by the three
methods for a simulated data set (SAVH, simulated annealing by variogram and histogram form; SSAob,
spatial simulated annealing with variogram from the observed points and histogram from the complete
data; SSAvh, spatial simulated annealing with true variogram and histogram)
n
Range
SAVH
SSAvh
SSAob
npc
MSE
MSDR
npc
MSE
MSDR
npc
MSE
MSDR
13
0–4
9
0.50
0.98
3
1.12
0.78
10
0.02
1.00
23
0–6
15
1.25
0.87
6
3.53
0.77
19
0.12
0.98
40
0–9
12
4.46
0.77
10
8.21
0.77
33
1.00
0.92
Mean
Variance
Mean
Variance
Mean
Variance
Statistics of the final distribution
13
–
14.92
47.74
15.32
47.84
14.94
47.50
23
–
14.61
49.28
17.22
54.12
14.93
47.81
40
–
14.62
53.43
19.84
60.97
14.90
48.03
Computational characteristics for 40 censored locations
UPT (min)
NoP
99
30
38
53,213
12,021
26,564
The mean and variance values refer to the final distribution obtained by the imputing methods. Finally,
the processing time (UPT) and the number of perturbations (NoP) required to reach the convergence are
reported
SAVH is substantially more accurate than SSAob. Finally, the SAVH requires a
longer computation time than SSA (Table 2).
4.2 The Apulia data
4.2.1 Obtaining the histogram form
The exploratory data analysis for total silt considering the exact observations only and
the histogram of the complete data set is summarised in Table 3. The histogram of the
complete data set was estimated by transforming the raw data range (0–587) to bin
widths of five units (0–590) in order to incorporate the censored values in one interval
(the first one). The position of the mean and the values of skewness and kurtosis indicate
a leptokurtic distribution with positive skew, especially for the discretised histogram.
A common distribution choice to fit the discretised histogram with prevalence of
values in the left tail is the Poisson (PO) distribution. In the present case, the Poisson
inverse Gaussian (PIG) distribution (Holla 1966) was another alternative. The PIG
distribution is generated as a Poisson distribution with a random mean parameter,
which has an inverse Gaussian distribution. The PIG provided the best fit (Table 4).
This was also confirmed by an analysis of the randomised quantile residuals where,
for the PIG, the mean was close to zero, the variance was smaller and the
asymmetry was almost zero.
123
Author's personal copy
Imputing censored data with desirable spatial covariance
Table 3 Summary statistics for
the exact observations and the
transformed complete histogram
from the Apulia total silt
variable
Exact
observations
Complete
discretised
histogram
Number of points
87
149
Range (g kg-1)
8–587
0–590
133.20
78.91
Raw mean (g kg-1)
-1 2
Table 4 Results of fitting to the
discretised histogram of total silt
using a Poisson (PO) and a
Poisson Inverse Gaussian (PIG)
distributions
277
Variance ((g kg ) )
13,479
12,032
Standard deviation (g kg-1)
116
109
Skewness
1.40
1.88
Kurtosis
2.15
3.81
PO
Intercept
Global deviance
PIG
4.37
4.36
19,513.30
2,048.76
Akaike information criterion
19,513.30
2,052.76
Bayesian information criterion
19,513.30
2,058.77
Mean (g kg-1)
-4.21
-0.34
Variance ((g kg-1)2)
56.77
2.11
Coefficient of skewness
0.41
-0.05
Coefficient of kurtosis
1.47
1.28
Filliben correlation coefficient
0.89
0.91
Randomised quantile residuals
4.2.2 Obtaining the variogram form
An experimental variogram was estimated, with 15 lags of 10 km each, for the 87
exact observations. The experimental variogram (not shown) did not exhibit
anisotropy and so the analysis proceeded based on an isotropic random field. The
model chosen to fit to the experimental variogram was exponential (variogram
form). The values of 0.37 (g kg-1)-2 for the partial sill, 0.74 (g kg-1)-2 for the
nugget and 12.16 km for the range were input as starting values for the optimisation
in the SAVH algorithms. The variogram initial conditions and, in general, the SA
parameters can affect the final SA results only if the global minimum is not reached.
In fact, simulated annealing should find the same optimum with high probability
repeatedly. To ensure this, the SA was tested several times (with different
parameters for annealing temperature and variogram fitting optimisation).
4.2.3 SAVH results
Table 5 shows the result of running SAVH. Three other results of variogram fitting
are shown to facilitate interpretation: the variogram fitted to (1) the exact
observations only, (2) the complete data set based on the allocation of the desired
123
Author's personal copy
278
L. Sedda et al.
Table 5 SAVH results and theoretical variograms for exact observations, complete data set with midpoint at censored locations and complete data set with random values at the censored locations
Algorithm
Partial sill
(g kg-1)2
Nugget
(g kg-1)2
Nugget-to-sill
ratio
WLS
(g kg-1)2
MSDR
SE
(g kg-1)
SAVH
2.00
1.32
0.39
1,528
0.98
-0.40
Random
2.01
0.99
0.33
1,111
–
2,146
Midpoint
3.07
0.41
0.11
Exact only
0.52
0.43
0.45
43.48
–
–
–
0.22
2.60
SQE
(g kg-1)2
0.94
–
–
13.85
For all the measures refer to the text
values (according to the fitted PIG distribution) to the censored points, but
randomised spatially and (3) the complete data set, but with the censored interval
mid-point allocated to the censored points.
The partial sill, as expected, increases from the 0.52 (g kg-1)2 of the theoretical
variogram of the exact observations to 2 (g kg-1)2 (Table 5). The total sill is more
than three times the total sill of the exact observations. Given that the censored
points are at the left tail of the distribution and with values very different from the
mean of the exact observations, it is expected that the sill increases. SAVH reduces
the nugget-to-sill ratio, with respect to the variogram of the exact observations,
facilitating a reduction in the prediction error variance (Lark 2000b; Marcotte
1995).
The variograms produced by attributing random draws and the midpoint to the
censored locations produced an even smaller nugget-to-sill ratio. This arises because
the variogram parameters are in these cases not constrained to the MSDR and AIC.
Attributing the censored interval midpoint to the censored locations produced the
largest partial sill because, according to the fitted PIG distribution, the censored
interval contains more large values than small ones.
The SAVH model produced a final theoretical variogram within the envelope
obtained from a set of variogram models fitted to experimental variograms
calculated from 100,000 sets of observed and censored data, in which the censored
values were drawn randomly from a uniform distribution (Fig. 2). Finally, the
envelopes were fitted using the same approach for variogram fitting described above
(initial variogram parameter values used in the non-linear minimisation function).
The final variogram produced by SAVH is not statistically different from a model
fitted to a complete data set based on random draws, but is significantly different
from the variogram of the exact observations only.
Regarding the cross-validation results shown in Table 5 (last two columns), the
standard error and the standard quadratic error are both much smaller in SAVHbased kriging relative to exact observations only. Hence, the SAVH method
predicted values and a variogram that increase the general accuracy of the kriging
algorithm.
To provide a visual assessment of the differences revealed in Table 5, Fig. 3
shows the predictions obtained by ordinary kriging based on the theoretical
variogram produced by SAVH (left-hand side) and based on the variogram of the
123
Author's personal copy
279
2
0
1
Semivariance
3
4
Imputing censored data with desirable spatial covariance
0
20000
40000
60000
80000
Distance
Fig. 2 Theoretical variograms for the Apulia data: exact observations (circles), SAVH (triangles) and
envelopes for the random simulation (solid lines)
Fig. 3 Kriging predictions using (1) the fitted theoretical variogram and estimated values at censored
locations from SAVH (left-hand side) and (2) the fitted theoretical variogram of exact observations only
and assuming censored values equal to the detection limit (5 g kg-1) (right-hand side)
exact observations only (right-hand side). In the former case, the censored
observations were estimated by SAVH, while in the latter case the censored
observations were set equal to the detection limit value.
The kriging predictions obtained using the exact observations with censored data
equal to the detection limit exhibit greater continuity with an over-prediction of the
silt content in the centre and South of the Apulia region relative to the kriging
123
Author's personal copy
280
L. Sedda et al.
predictions based on SAVH. In the north, the extreme smoothing under-predicts
(relatively) an area of high silt concentration. The SAVH-based kriging predicted
greater variability across space due to the increase in the variance obtained
including the overall range of the censored locations and permitted reproduction of
the true high concentration of silt in the northern area (see Fig. 1 for comparison).
An observable effect is the restriction spatially of the influence of high silt
concentrations and hence the creation of a halo appearance around such locations.
5 Conclusion
The variogram produced from observed (non-censored) locations can be misleading
if applied to predict the censored locations using classical geostatistical approaches
such as kriging and SSA. In this paper, an extension of the SSA algorithm was
developed to impute values for censored locations while taking into account the
general form of autocorrelation and distributional form implicit in the complete data
set (i.e. observed and censored data). It was demonstrated that for a data set with
less than 20% of censored locations, SAVH can estimate correctly 65-70% of the
censored values. In comparison with SSA (based on the observed data only), the
SAVH algorithm predicted more censored points correctly and resulted in a
consistently smaller prediction error. The algorithm proposed is an attractive
alternative to augmentation or geostatistical approaches for those researchers
interested in imputing censored values while taking into account spatial dependence. The only information required by SAVH is the form of the histogram and the
form of the variogram and starting values for all other parameters.
Acknowledgments This study was supported by a fellowship from the Master and Back program
financed by the Regional Sardinia Government, under agreement between the School of Geography,
University of Southampton (UK) and the Dipartimento di Economia e Sistemi Arbori, University of
Sassari (Italy). Thanks are due to the Apulia Regional Authority for Ecology and the Water Research
Institute of the National Research Council for providing the data used in this study. Finally, the authors
would like to acknowledge Dr. Edith Cheng at the University of Southampton, for inspiring the analysis
and for helpful assistance.
References
Aarts E, Korst J (1989) Simulated annealing and Boltzmann machines—a stochastic approach to
combinatorial optimization and neural computing. Wiley & Sons, New York
Abrahamsen P, Benth FE (2001) Kriging with inequality constraints. Math Geol 33(6):719–744
Agarwal R, Sharma M (2003) Parameter estimation for non-linear environmental models using belowdetection data. Ad Environ Res 7(2):249–261
Alkhamis TM, Ahmed MA (2004) Simulation-based optimization using simulated annealing with
confidence interval. In: Proceedings 2004 Winter simulation conference, IEEE, Washington D.C.,
pp 514–519
Bang H, Tsiatis AA (2002) Median regression with censored cost data. Biometrics 58(3):643–649
Bölte A, Thonemann UW (1996) Optimizing simulated annealing schedules with genetic programming.
Eur J Oper Res 92(2):402–416
123
Author's personal copy
Imputing censored data with desirable spatial covariance
281
Bouktif S, Sahraoui H, Antoniol G (2006) Simulated annealing for improving software quality prediction.
In: Proceedings genetic and evolutionary computation conference GECCO 06, ACM, New York,
pp 1893–1991
Box GEP, Cox DR (1964) An analysis of transformations. J Roy Stat Soc B 26(2):211–252
Caudill SB (1996) Maximum likelihood estimation in a model with interval data: a comment and
extension. J Appl Stat 23(1):97–104
Christakos G, Killam BR (1993) Sampling design for classifying contaminant level using annealing
search algorithms. Water Resour Res 29(12):4063–4076
Corana A, Marchesi M, Martini C, Ridella S (1987) Minimizing multimodal functions of continuous
variables with the ‘‘simulated annealing’’ algorithm. ACM T Math Software 13(3):262–280
De Oliveira V (2005) Bayesian inference and prediction of Gaussian random fields based on censored
data. J Comput Graph Stat 14(1):95–115
Dennis JE, Schnabel RB (1983) Numerical methods for unconstrained optimization and nonlinear
equations. Prentice-Hall, Englewood Cliffs
Deutsch CV, Journel AG (1998) GSLIB Geostatistical Software Library and user’s guide, 2nd edn.
Oxford University Press, New York
Deutsch CV, Wen XH (1998) An improved perturbation mechanism for simulated annealing simulation.
Math Geol 30(7):801–816
Dueck G, Scheuer T (1990) Threshold accepting: a general purpose optimization algorithm appearing
superior to simulated annealing. J Comput Phys 90(1):161–175
Fridley BL, Dixon P (2007) Data augmentation for a Bayesian spatial model involving censored
observations. Environmetrics 18(2):107–123
Gelman A, Roberts G, Gilks W (1995) Efficient Metropolis jumping rules. In: Bernardo JM, Berger JO,
Dawid AP, Smith AFM (eds) Bayesian statistics, vol 5. Oxford University Press, New York,
pp 599–608
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distribution and the Bayesian restoration of
images. IEEE T Pattern Anal 6(6):721–741
Gibbons R (1995) Some statistical and conceptual issues in the detection of low-level environmental
pollutants. Environ Ecol Stat 2(2):125–167
Gilliom RJ, Helsel DR (1984) Estimation of distributional parameters of censored trace-level water
quality data. Water Resour Res 22(2):147–155
Goovaerts P (2009) AUTO-IK: a 2D indicator kriging program for the automated non-parametric
modeling of local uncertainty in earth sciences. Comput Geosci 35(6):1255–1270
Gringarten E, Deutsch CV (2001) Theacher’s aide. Variogram interpretation and modeling. Math Geol
27(5):659–672
Helsel DR (2005) Nondetects and data analysis. Wiley & Sons, New York
Holla MS (1966) On a poisson-inverse Gaussian distribution. Metrika 11(1):115–121
Hopke PK, Liu C, Rubin DB (2001) Multiple imputation for multivariate data with missing and belowthreshold measurements: time-series concentrations of pollutants in the Arctic. Biometrics
57(1):22–33
Huzurbazar AV (2005) A censored data histogram. Commun Stat Simulat 34(1):113–120
Ingber L (1996) Adaptive simulated annealing (ASA): lessons learned. J Control Cybern 25(1):33–54
Kerry R, Oliver MA (2007a) Determining the effect of asymmetric data on the variogram. I. Underlying
asymmetry. Comput Geosci 33(10):1212–1232
Kerry R, Oliver MA (2007b) Determining the effect of asymmetric data on the variogram. II. Outliers.
Comput Geosci 33(10):1233–1260
Kerry R, Oliver MA (2007c) Comparing sampling needs for variograms of soil properties computed by
method of moments and residual maximum likelihood. Geoderma 140(10):383–396
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science
220(4598):671–680
Knotters M, Brus DJ, Oude-Voshaar JH (1995) A comparison of kriging, co-kriging and kriging
combined with regression for spatial interpolation of horizon depth with censored observations.
Geoderma 67(3–4):227–246
Kuo SF, Liu CW, Merkley GP (2001) Application of the simulated annealing method to agricultural
water resource management. J Agr Eng Res 80(1):109–124
Lark RM (2000a) A comparison of some robust estimators of the variogram for use in soil survey. Eur J
Soil Sci 51(1):137–157
123
Author's personal copy
282
L. Sedda et al.
Lark RM (2000b) Estimating variograms of soil properties by the method-of-moments and maximum
likelihood. Eur J Soil Sci 51(4):717–728
Lark RM, Papritz A (2003) Fitting a linear model of coregionalization for soil properties using simulated
annealing. Geoderma 115(3–4):245–260
Leuangthong O, Deutsch CV (2003) Stepwise conditional transformation for simulation of multiple
variables. Math Geol 35(2):155–173
Liu C (2001) The art of data augmentation: discussion. J Comput Graph Stat 10(1):75–81
Macmillan W (2001) Redistricting in a GIS environment: an optimisation algorithm using switchingpoints. J Geogr Syst 3(2):167–180
Marcotte D (1995) Generalized cross-validation for covariance model selection. Math Geol
27(5):659–672
McBratney AB, Webster R (1986) Choosing functions for semivariograms of soil properties and fitting
them to sampling estimates. J Soil Sci 37(4):617–639
Meng XL, Van Dyk DA (1999) Seeking efficient data augmentation schemes via conditional and
marginal augmentation. Biometrika 86(2):301–320
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equation of state calculations by
fast computing machines. J Chem Phys 21(6):1087–1092
Militino AF, Ugarte MD (1999) Analyzing censored spatial data. Math Geol 31(5):551–561
Ministero per le Politiche Agricole e Forestali (1999) Metodi ufficiali di analisi chimica del suolo.
Gazzetta Ufficiale Supplemento Ordinario 248:1–162
Odell PM, Anderson KM, D’Agostino RB (1992) Maximum likelihood estimation for interval censored
data using a weibull-based accelerated failure time model. Biometrics 48(3):951–959
Oliver MA (2010) The variogram and kriging. In: Fischer MM, Getis A (eds) Handbook of applied spatial
analysis. Springer, Berlin, pp 319–352
Orton TG, Lark RM (2007) Estimating the local mean for Bayesian maximum entropy by generalized
least squares and maximum likelihood, and an application to the spatial analysis of a censored soil
variable. Eur J Soil Sci 58(1):60–73
Pardo-Igúzquiza E (1998) Optimal selection of number and location of rainfall gauges for areal rainfall
estimation using geostatistics and simulated annealing. J Hydrol 210(1–4):206–220
Porter PS, Ward RC, Bell HF (1988) The detection limit. Water quality monitoring data are plagued with
levels of chemicals that are too low to be measured precisely. Environ Sci Technol 22(8):856–861
Raimo F, Napolitano A (2003) Studio della distribuzione spaziale di alcuni parametri chimici. Il Tabacco
11:11–17
Rajasekaran S (2000) On simulated annealing and nested annealing. J Global Optim 16(1):43–56
Ribeiro PJ Jr, Diggle PJ (2001) geoR: a package for geostatistical analysis. R-NEWS 1(2):15–18
Rigby RA, Stasinopoulos DM (2005) Generalized additive models for location, scale and shape, (with
discussion). Appl Stat-J Roy St C 54(3):507–554
Rivoirard J (1994) Introduction to disjunctive kriging and non-linear geostatistics. Clarendon, Oxford
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley & Sons, New York
Saito H, Goovaerts P (2000) Geostatistical interpolation of positively skewed and censored data in a
Dioxin-contaminated site. Environ Sci Technol 34(19):4228–4235
Stein ML (1992) Prediction and inference for truncated spatial data. J Comput Graph Stat 1(1):354–372
Svensson I, Sjöstedt-De Luna S, Bondesson L (2006) Estimation of wood fibre length distributions from
censored data through an EM algorithm. Scand J Stat 33(3):503–522
Triki E, Collette Y, Siarry P (2005) A theoretical study on the behaviour of simulated annealing leading to
a new cooling schedule. Eur J Oper Res 166(1):77–92
Tsiatis AA (1990) Estimating regression parameters using linear regression rank test for censored data.
Ann Stat 18(1):354–372
Van Breemen N, Mulder J, Driscoll CT (1983) Acidification and alkalinization of soils. Plant Soil
75(3):283–308
Webster R, Oliver MA (2001) Geostatistics for environmental scientists. Wiley & Sons, Chichester
123