Part 3: Estimation and Modeling of Spatial Correlations
1
Notations and Definitions
Denote: {Y (s), s ∈ R} a r -dimensional spatial process (R ⊂ Rr ), usually though not necessarily
r = 2. The mean process is: E(Y (s)) = µ(s) with s ∈ R. We also assume that the variance of Y (s)
exists for all s ∈ R.
A common assumption is that Y is a Gaussian process, which implies that for any k ≥ 1
and locations s1 , . . . , sk , the joint distribution of the vector random variable (Y (s1 ), . . . , Y (sk )) is
multivariate normal distributed.
Stationarity.
• The process is said to be strictly stationary (also called strongly stationary) if, for any given
n ≥ 1, any set of n sites {s1 , . . . , sn } and any h ∈ Rr , the distribution of (Y (s1 ), . . . , Y (sn )) is the
same as that of (Y (s1 + h), . . . , Y (sn + h)).
• A less restrictive condition is given by weak stationarity (also called second-order stationarity):
A process is weakly stationary if µ(s) = µ and Cov(Y (s), Y Y (s + h)) = C(h) for all h ∈ Rr such
that s and s + h both lie within R.
• Assuming E[Y (s + h) − Y (s)] = 0 and define
E[Y (s + h) − Y (s)]2 = V ar(Y (s + h) − Y (s)) = 2γ(h).
This expression makes sense when the left-hand side depends only on h, and not the particular
choice of s. If this is the case, we say the process is intrinsically stationary.
Isotropy.
• If the function γ(h) depends upon the separation vector only through its length ||h||, then we
say that the process is isotropic; if not, we say it is anisotropic.
• If the process is intrinsically stationary and isotropic, it is also called homogeneous.
Variogram, covariogram and correlogram. If Y is a stationary spatial process with covariance
provided by C(s, s0 ), then
C(s, s0 ) = C(s − s0 ) = C(h)
then C is called the covariance function or covariogram. Correlogram is the scaled version
ρ(h) =
C(h)
where σ 2 = V ar(Y (s)) = C(0).
σ2
In geostatistics, second order properties are usually described by semivariogram. If the variance of
difference in responses, V ar(Y (s0 ) − Y (s)), depends only on relative locations, i.e., is a function of
h = s0 − s, then this function is called variogram.
2γ(h) = V ar(Y (s + h) − Y (s)).
If, in addition, Y has a constant mean (e.g. residual process) then U is intrinsically stationary as
defined above. Using the semivariogram γ (= half of variogram) leads to nicer formulae, e.g., for
stationary Y :
γ(h) = σ 2 − C(h),
1
Therefore, given C, we are able to determine γ. But what about the converse: in general, can
we recover C from γ? This can be found from the equation above once we know C(0). However,
in the relationship γ(h) = C(0) − C(h) we can add a constant on the right side so C(h) is not
identified. However, if the spatial process is ergodic, then C(h) → 0 as ||h|| → ∞, we also have
that γ(h) → C(0) as ||h|| → ∞. Therefore, C(h) is well defined under ergodicity since
C(h) = C(0) − γ(h) =
lim γ(u) − γ(h).
||u||→∞
The limit C(h) → 0 does not always exists. For example, if Z(s) is standard Brownian motion in
one dimension then V(Z(s1 ) − Z(s2 )) = |s1 − s2 |, which tends to ∞ as |s1 − s2 | −→ ∞.
For much of the theory of spatial processes, the principal assumption required is intrinsic stationarity. From this point of view, the stronger forms of stationarity are not needed. On the other
hand, either strict or second-order stationarity are more natural assumptions — for example, as
with time series analysis, it is often very useful to think of an observed process as consisting of a
deterministic trend superimposed on an underlying stationary field. For this reason, it is a good
idea to be cautious when a preliminary analysis of the data indicates that the process is intrinsically
stationary but not stationary. It may well be that the process is best approached by first looking
for a trend with stationary residuals.
Since γ is called variogram in Bailey & Gatrell (1995), as well as in spatial and geoR R packages,
we will use the same terminology throughout the course.
2
Variogram models
Sample variogram. Given the observed data {Y (si ), i = 1, . . . , n} (e.g. the residuals after eliminating the trend), the classical method-of-moments (MoM) estimate of the value of variogram at
separation h is given by
γ̂(h) =
1
2|N (h)|
X
(Y (si ) − Y (sj ))2
(1)
(si ,sj )∈N (h)
where
N (h) = {(si , sj ) : si − sj = h}
and —N(h)— is is the number of distinct pairs in N (h). Usually, one needs to ’grid-up’ the h-space
into intervals I1 = (0, h1 ), . . . , IK = (hK−1 , hK ) for 0 < h1 < . . . < hK and represent each interval
by its midpoint. Further, redefine
N (hk ) = {(sj , sj ) : ||si − sj || ∈ Ik }, k = 1, . . . , K.
Moreover, one may use a moving window T (h) being some small neighborhood or tolerance region
for h, then
N (h) = {(si , sj ) : si − sj ∈ T (h)}.
Journel and Huijbregts, 1978 recommended choosing T(h) large enough to contain at least 30 pairs
of points, and this can still be recommended as a rule of thumb.
Some of the drawbacks of the sample variograms are that 1. it will be sensitive to outliers; 2 a
sample average of squared differences can be badly behaved; 3. it uses data differences rather than
the data itself; and 4. the components of the sum will be dependent within and across bins, and
N (hk ) will vary across bins.
2
More subtle objection arises from the skewness of the distribution: if we assume the process
to be Gaussian, then for a specific s and h, the distribution of {Y (s + h) − Y (s)}2 is of the form
2γ(h)χ21 , and the χ21 distribution is highly skewed. However, if X ∼ χ21 , then x1/4 has a nearly
symmetric distribution so that we would expect that |Y (s+h)−Y (s)|1/2 to be much better behaved.
This idea lies at the heart of the proposal made by Cressie and Hawkins (1980). They suggested
X
1
1
|Y (si ) − Y (sj )|1/2
2γ̄(h) =
(2)
0.457 + 0.494/|N (h)| |N (h)|
(si ,sj )∈N (h)
as an approximately unbiased estimator of 2γ(h). The first factor in this estimator is a bias correction term. A numerical study reported by Cressie (pages 80-82) showed that, under a Gaussian
model with added Gaussian noise containing 5% contamination, estimator in (2) always has smaller
bias than the estimator in (1) and may also have smaller variance, the latter depending on the signal
to noise variance ratio g (interpreting “signal” as the spatial variogram of the Gaussian model and
“noise” as the added contaminated component). For large g, γ̂(h) has smaller variance than γ̄(h) ,
as might be expected from the fact that γ̂(h) is also the maximum likelihood estimator if we neglect
contamination, but for small g this comparison is reversed. Based on these comparisons, Cressie
recommended that both estimates be computed and compared. A possible counter-argument is
that if Y (s) has a marginal distribution that is far from normal (though not necessarily with any
outlier contamination), the whole argument leading to (1) as an approximately unbiased estimator
would appear to be invalid.
Another possible “robust” estimator is given by
£
¤
median (Y (si ) − Y (sj ))2 : (si , sj ) ∈ N (h)
2γ̃(h) =
(3)
0.457
where 0.457 is a bias-correction factor. We can interpret the estimator in (3) as the squared
interquartile range (modulo a multiplicative factor) of {(Y (si ) − Y (sj )) : (si , sj ) ∈ N (h)}, which
might be also thought as a natural robust measure of the scale of Y (s + h) − Y (s) over s ∈ R for
fixed h.
Graphical assessment of the variogram estimate. Informally, one plots γ(h), and then an appropriately shaped theoretical variogram is fit by eye or by trial and error to choose the nugget, sill,
and range. Commonly, one should evaluate all three variogram estimators and compare.
Another way to compute these [plots is as a “variogram cloud”. This method of computing the
variogram is available when there are multiple replications of the spatial field. This assumption
may be satisfied for the data observed for many years, which, at least for the present analysis, we
are treating as independent from year to year. In the variogram cloud, one point is plotted for
each pair of locations (si , sj ). The distance between locations (si , sj ), dij say, is plotted along the
x-axis, and an estimate of V(Y (si ) − Y (sj )) is plotted along the y-axis. For the latter, we may use
either MoM or the robust method. An additional remark is that when plotting multiple replicates,
for comparison purposes, we are not using the raw data but standardized residuals from a timedependent fit, so the sample means and standardized deviations at each location have already been
adjusted to be approximately 0 and 1, respectively.
Conditions for valid covariograms. In order for Σ to be a valid covariance matrix for any configuration of sampling locations in R , the covariogram must be
• symmetric: C(h) =PC(−h)
implying Σij = Σji .
n Pn
• positive-definite:
i=1
j=1 ai aj C(si , sj ) ≥ 0 for all finite configurations {si ∈ R, i =
1, . . . , n} and all real numbers {ai , i = 1, . . . , n}.
3
One way to check wether C is valid is using the spectral/Fourier decomposition. That is, C is
a valid covariance functions if and only if it is the characteristic function of a symmetric about 0
random variable (Bochners Theorem), i.e.,
Z
Z
C(h) = . . . cos(wT h) G(dw).
Inversion formula could be used to check if C(h) is valid. If
Z
Z
Z
Z
. . . |C(h)| dh < ∞ thenC(h) = . . . cos(wT h)g(w) dw.
The necessary and sufficient condition for positive definiteness is then that g(w) ≥ 0 for all w.
Using the properties of characteristic functions, we can also construct valid covariance functions.
There is a corresponding theory for the variogram. Suppose γ() is the semivariogram of a secondorder stationaryPprocess; then, by a combination of the equations above, if {ai , i = 1, . . . , n} are
constants with
−i = 1n ai = 0, we have
XX
ai aj γ(si − sj ) ≤ 0.
i
j
This is a conditional non-positive-definiteness condition. It can easily be seen that the equation
above is a necessary condition for γ() to be a valid semivariogram in the general (intrinsically
stationary) case: the converse result is described in detail by Cressie (1993).
Isotropic covariogram examples. Instead of trying to estimate using the empirical variogram, one
can also uses parametric functions. Below there are a few example of covariogram functions and
their corresponding variograms and correograms.
• Exponential covariance function
(
σ 2 exp(−|h/θ|) h > 0
C(h; θ) =
σ2
otherwise
• Gaussian covariance function
(
C(h; θ) =
σ 2 exp(−(h/θ)2 ) h > 0
σ2
otherwise
• Powered exponential covariance function
(
C(h; θ) =
σ 2 exp(−|h/θ|p ) h > 0
σ2
otherwise
• Spherical covariance function
h>θ
0 ¡
¢
3
1
2
3
C(h; θ) = σ 1 − 2 h/θ + 2 (h/θ)
h≤θ
2
σ
otherwise
4
The corresponding variograms are:
• Exponential:
(
σ 2 (1 − exp(−|h/θ|)) h > 0
γ(h; θ) =
0
otherwise
Simpler in functional form than the spherical case (and valid for all d) but without the finite
range.
• Gaussian:
(
σ 2 (1 − exp(−(h/θ)2 )) h > 0
γ(h; θ) =
0
otherwise
• Powered exponential:
(
γ(h; θ) =
σ 2 (1 − exp(−|h/θ|p )), h > 0
0
otherwise
Here 0 < p ≤ 2. This form generalizes both the exponential and Gaussian variograms, and forms
the basis for the families of spatial covariance functions introduced by Sacks et al. (1989), though
in generalizing the results from one dimension to higher dimensions, these authors used a product
form of covariance function in preference to constructions based on isotropic processes.
• Spherical:
2
h>θ
σ ¡
¢
3
1
2
3
γ(h; θ) = σ 2 h/θ − 2 (h/θ)
0<h≤θ
0
otherwise
The spherical variogram is valid if d = 2 or 3, but for higher dimensions it fails the non-positivedefiniteness condition. It is a convenient form because it increases from 0 when h is small, levelling
off at the constant σ 2 at h = θ. This is of the “nugget/range/sill” form which is often considered
a realistic and interpretable form for a semivariogram.
The corresponding correograms of the variograms above are obtained by dividing the covariogram functions with σ 2 .
The Matérn class. Another example of correlation function is provided by the Matérn family of
correlograms which is defined as a function of the spatial distance h and two other parameters:
µ ¶ν
1
φh
ρ(h; ν, φ) =
2Kν (φh), ν, φ > 0,
Γ(ν) 2
where Kν is a modified Bessel function (see Matérn, 1960), φ is a parameter that defines the extend
of spatial dependence (the spatial scale parameter) and ν is a smoothing parameter, which applies
smoothness over the dependence range (the shape parameter). The value ν = 0.5 is the exponential
model. The Gaussian model is an extreme case in the Matérn family, obtained when ν → ∞.
• ν controls how many times differentiable ρ is at h = 0;
5
1.0
0.6
0.4
0.0
0.2
correlation(h)
0.8
Exponential(.8)
Exponential(1.8)
Gaussian(.8)
Gaussian(1.8)
Spherical(.8)
Spherical(1.8)
0.0
0.5
1.0
1.5
2.0
h
Figure 1: Examples of correlation functions.
• In two-dimensions, the greatest integer in ν indicates the number of times process realizations
will be mean-square differentiable.
Diggle et al. (2003, sec. 2.4) suggests that appropriate value for ν can be chosen to reflect
scientific knowledge about the smoothness of the process under study. They also give some more
details, references and comparison of the properties of different model families. This is important
as the correlation function involves two parameters and its estimation will be simplified as ν is
considered fixed.
To clarify the effect of ν, let us look at simulations of one-dimensional Gaussian processes with
Matérn correlation functions varying the value of ν and adjusting the correlatin function so that
the practical range is the same for all models (as later defined in this section).
Nugget parameter. A positive constant may be added to the variogram models to represent the
nugget effect:
γ(h; θ, σ 2 ) + τ 2 , h > 0
γ(h; θ, σ 2 , τ 2 ) = {
0,
h=0
and the corresponding covariogram and correlogram are:
C(h; θ, σ 2 , τ 2 ) = {
ρ(h; θ, σ 2 , τ 2 ) = {
C(h; θ, σ 2 ), h > 0
σ2 + τ 2,
h=0
σ2
ρ(h; θ),
σ 2 +τ 2
1,
h>0
h=0
To make the last expression on previous page simpler, models with nugget effect are sometimes
parameterized in terms of proportion of nugget effect:
α=
τ2
τ 2 + σ2
When using the existing geostatistics software, we will need to pay attention to different terminology. For example, in the ’geoR’ package, σ 2 is in fact σ 2 + τ 2 .
6
1.5
1.0
0.5
Y(x)
−0.5
0.0
κ = 0.5
κ=1
κ=∞
0.0
0.2
0.4
0.6
0.8
1.0
x
Figure 2: 1D Gaussian processes with Matérn correlation for three different values of the smoothing
parameter ν.
Anisotropic cases. Under anisotropy, the spatial association depends upon the separation vector
between locations (i.e. direction and distance). There are a number of ways of dealing with
anisotropic processes as more or less direct generalizations of isotropic processes. The simplest of
these is geometric anisotropy. This refers to a variogram of the form
¡
¢
C(s − s0 ) = σ 2 ρ (s − s0 )A(s − s0 )
where ρ is a valid correlation function and A a d × d matrix assumed positive definite, representing
a linear transformation of Rr . Of course, if A is the identity this reduces to the isotropic case. The
idea is that the process is not isotropic in the original space, but is in some linearly transformed
space, which may for example correspond to stretching and/or rotating the original coordinates s
to s0 by
µ
¶µ
¶
1 0
cos(φA ) − sin(φA )
0 T
T
(s ) = s
sin(φA ) cos(φA )
0 φ−1
A
and modelling the correlation as a function of distance between the transformed coordinates s0 .
The anisotropy angle φA can be estimated using maximum likelihood approach.
A generalization of anisotropy arises from the simple observation that if Z1 , . . . , Zp are independent intrinsically stationary processes, then
Z = Z1 + . . . + Zp
is also intrinsically stationary, with semivariogram given by
γ(h) = γ1 (h) + . . . + γp (h),
where γ1 , . . . , γp denote the semivariograms of Z1 , . . . , Zp , respectively. Thus
γ(h) =
p
X
i=1
7
γ0 (Ai h),
1.0
Sill
0.75
0.5
0.25
Nugget
0.0
Range
0.0
Figure 3: A graphical display for the nugget, sill and range parameters.
where γ0 is an isotropic semivariogram and A1 , . . . , Ap are matrices, is a valid semivariogram generalizing geometric anisotropy. This is called zonal anisotropy.
Anisotropy may be assessed using directional semivariogram and rose diagram created from the
directional semivariogram to evaluate geometric anisotropy. Another diagnostic plot is theempirical
semivariogram contour plot.
The nugget, the sill and the range. Several of these semivariograms have the general shape of
Figure 3. We always have γ(0) = 0, but γ increases from a non-negative value near h = 0 (the
nugget) to a limiting value (the sill) which is either attained at a finite value h = R (the range), or
else approached asymptotically as h −→ ∞. In the latter there is still a scale parameter which we
may denote by R, and which may be defined precisely as the value of h at which γ(h) comes within
a specified distance of its limiting value. The cases where the nugget is strictly positive may appear
paradoxical because they imply there is a discontinuity in the covariance function, but in fact this
is a well-known feature of spatial data. There are various possible explanations, the simplest being
that there is some residual white noise over and above any smooth spatial variation.
Examples for the range parameter. The range of the spherical model is clearly θ. For models of the
Matérn family, however, the range is infinite. Correlation functions derived from compact kernels
have finite range. These functions may be mainly use for modeling correlation for large data as one
can make use of the sparsity in the correlation matrix to invert it.
For monotonous variograms, the practical or relative range is defined as that distance, say h.95 ,
at which
γ(h; θ, σ 2 , τ 2 ) = .95(σ 2 + τ 2 )
For example, for exponential family, setting exp(−h.95 /θ) equal to this value we obtain h.95 = 3θ,
since log(0.05) ≈ −3.
The nugget τ 2 is often viewed as a “non-spatial effect variance”, and the partial sill σ 2 is viewed
as a “spatial effect variance”.
3
Fitting Variogram Models
Empirical variogram (sample variogram) estimators are useful tools for exploring spatial correlation,
but they cannot be used in spatial regression or prediction, because they don´t necessarily imply a
8
1.0
0.8
0.6
0.4
γ(h)
0.0
0.2
exponential
spherical
gaussian
0.0
0.2
0.4
0.6
0.8
1.0
distance
Figure 4: Variograms for θ = .8 but equivalent practical range.
valid spatial process. If Y is a stationary process with covariogram C then Y = (Y (s1 ), . . . , Y (sn ))
has covariance matrix Σ where
Σij = C(si − sj ).
Nontrivial restrictions must be placed on C to ensure that Σ is indeed a valid covariance matrix
(it may fail the conditional non-positive-definiteness). Thus it is possible that spatial predictions
derived from such estimators will appear to have negative variances; in particular, we will see that
both spatial regression and prediction will need the inverse matrix Σ−1 . The most common way
of avoiding these difficulties is to replace the empirical estimator of γ(h) by some parametric form
which is known to be conditionally non-positive-definite, such as one of the families listed in the
previous section. It may well be considered desirable on general statistical modeling grounds to seek
a parametric family which adequately models the observed data, but this provides an additional
and specific motivation to do that. Note that in general there is no need to restrict ourselves to
isotropic models, though it is usually convenient to consider isotropic models first.
Another alternative to sample variogram or using a parametric shape function for variogram or
covariogram, one can also use fitting procedures as presented below.
Least squares estimators (LSE). Suppose we have estimated the semivariogram γ(h) at a finite
set of values of h, and wish to fit a model specified by the parametric function γ(h; θ) in terms of
a finite parameter vector θ. This could, for instance, be any of the isotropic forms considered in
Section 2 where θ contains the nugget, sill and range and any incidental parameters. For definiteness
we shall assume the MoM estimator γ̂(h) has been used and let γ̂ denote the vector of estimates
of γ(θ), the vector of model values at the same vector of h values. Common approaches to least
squares estimators are
• Ordinary least squares or OLS, in which we choose to estimate θ to minimize
(γ̂ − γ(θ))T (γ̂ − γ(θ)).
• Generalized least squares or GLS, in which we choose θ to minimize
(γ̂ − γ(θ))T V (θ)−1 (γ̂ − γ(θ))
9
where V (θ) denotes the covariance matrix of γ̂ which, since the problem is non-linear, depends on
the unknown θ.
• Weighted least squares (WLS), in which we choose θ to minimize
(γ̂ − γ(θ))T W (θ)(γ̂ − γ(θ))
where W (θ) is a diagonal matrix whose diagonal entries are the variances of the entries of γ̂. Thus
WLS allows for the variances of γ̂ but not the covariances, while GLS allows for both. Also the
weights in matrix W (θ) may be proportional to |N (h)| or inversely proportional to approximate
variance of γ̂(h) (Cressie, 1985).
In general, we expect the three estimators OLS, WLS, GLS to be in increasing order of efficiency but in decreasing order of convenience to use. Note, in particular, that OLS is immediately
implementable by a nonlinear least squares procedure, whereas WLS and GLS require specification
of the matrices W (θ) and V (θ).
For a Gaussian process, however, we have the expressions
V((Y (s + h) − Y (s))2 ) = 2(γ(h))2 ,
(4)
¡
¢
corr (Y (s1 + h1 ) − Y (s1 ))2 , (Y (s2 + h2 ) − Y (s2 ))2 =
(γ(s1 − s2 + h1 ) + γ(s1 − s2 − h2 ) − γ(s1 − s2 + h1 − h3 ) − γ(s1 − s2 ))2
4γ(h1 )γ(h2 )
(5)
which may be used to evaluate the matrices W (θ) and V (θ). Thus GLS is possible in principle,
but complicated to implement. For example, there is no guarantee that the resulting minimization
problem has a unique solution. As a compromise, Cressie, 1985 proposed the approximate WLS
criterion: if γ̂ is evaluated on a finite set {hj }, choose θ to minimize
¶2
µ
X
γ̂(hj )
−1
(6)
|N (hj )|
γ̂(hj , θ)
j
Note that (6) may be derived as the WLS solution under the approximation
V(γ̂(h)) ≈
8γ 2 (h)
|N (h)|
which follows from (4) if we assume that the individual Z(si ) − Z(sj) terms are independent. This
assumption of course is not exactly satisfied but may nevertheless be a reasonable approximation
if the pairs (si , sj ) lying in N(h) are widely spread over the sampling space. The criterion (6) is
no more difficult to implement than OLS, and may be expected to be substantially more efficient,
while avoiding the complications of GLS.
Likelihood based approach. Assume that (possibly after a transformation) Y = (Y (s1 ), Y (s2 ), . . . , Y (sn ))
is multivariate Gaussian with
• mean function µ(s) = xT (s)β; and
• covariance matrix defined by covariogram C(h; θ) from a valid family specified up to parameter
vector θ. Estimate all unknown parameters β and θ simultaneously on the basis of the likelihood
function.
We can incorporate deterministic linear regression terms with no essential change in the methodology, so the model we shall consider is
Y ∼ N (Xβ, Σ)
10
with Y an n-dimensional vector of observations, X an n × q matrix of known regressors (q < n; X
of full rank), β a q-vector of unknown regression parameters and Σ the covariance matrix of the
observations. In many applications we may assume
Σ = αV (θ)
where α is an unknown scale parameter and V (θ) is a matrix of standardized covariances determined
by the unknown parameter vector θ. Consequently, the negative log likelihood is given by
l(β, α, θ) =
n
n
1
1
log(2π) + log(α) + log(|V (θ)|) +
(Y − Xβ)T V (θ)−1 (Y − Xβ).
2
2
2
2α
If we denote
β̂(θ) = (X T V (θ)−1 X)−1 X T V (θ)−1 Y
G2 (θ) = (Y − X β̂(θ))T V (θ)−1 (Y − X β̂(θ))
we further simplify the log-likelihood function as
l(β̂(θ), α, θ) =
n
1
1 2
n
log(2π) + log(α) + log(|V (θ)|) +
G (θ).
2
2
2
2α
(7)
It is possible to minimize (7) numerically with respect to α and θ, or alternatively to minimize it
analytically with respect to α by defining
α̂(θ) =
G2 (θ)
.
n
In this case we have to minimize, with respect to θ, the function
l ∗ (θ) = l(β, α̂(θ), θ) =
n
n
1
n
log(2π) + log(G2 (θ)/n) + log(|V (θ)|) + .
2
2
2
2
(8)
The quantity (7) or (8) is often called a profile negative log likelihood to reflect the fact that it is
computed from the negative log-likelihood by minimizing analytically over some of the parameters.
The method given here is essentially that first proposed by Kitanidis, 1983 and by Mardia and
Marshall, 1984. To calculate (8), the key element is the Cholesky decomposition which enables
us to write V = LLT where L is a lower triangular matrix. Hence if we write the initial model
equations in the form
Y = Xβ + η, η ∼ N (0, αV )
and define Y ∗ = L−1 Y, X∗ = L−1 X, η ∗ = L−1 η and we have
Y ∗ = X ∗ β + η∗, η∗ ∼ N (0, αI)
so that the calculation of β̂ reduces to an ordinary least squares problem for (Z∗, X∗). Also, the
calculation of |V | is easy because this is just the square of |L, and |L is just the product of diagonal
entries. The authorś implementation of this is based on the algorithm of Healy, 1968 to calculate
the Cholesky decomposition, followed by the SVDFIT algorithm of Press et al., 1986 (Section 14.3)
to solve the ordinary least squares problem, all within the DFPMIN algorithm of Press et al., 1986
(Section 10.7) to solve the function minimization problem with respect to θ. DFPMIN is a variable
metric algorithm requiring the specification of first-order derivatives of the objective function as
well as the function itself, but these can be approximated numerically. In this form, the algorithm
11
is somewhat simpler than the original proposals made by Kitanidis, 1983 and Mardia and Marshall,
1984.
To summarize the main steps of the algorithm:
1. For the current value of θ, compute V = V (θ) and hence the Cholesky decomposition
V = LLT .
2. Calculate L−1 . This is easy, given that L is lower triangular.
3. Calculate |L|, which is simply the product of the diagonal entries of L. Hence |V | = |L|2 .
4. Compute Y ∗ = L−1 Y, X∗ = L−1 X.
5. Solve the ordinary least squares problem Y ∗ = X ∗ β + η∗ - the residual sum of squares is
2
G (θ).
6. Define g(α; θ) by (7), or g(θ) by (8), so that g is the function which we have to minimize.
7. Repeat each of steps 1 to 6 for each θ (or each (α, θ) pair) for which g has to be evaluated.
The minimum will eventually be achieved at a point θ̂ (or (α̂, θ̂)) and this defines the maximum
likelihood estimator.
8. Define H to be the Hessian matrix, i.e. the matrix of second-order derivatives of g with
respect to the unknown parameters, evaluated at the maximum likelihood estimators. This is
also known as the observed information matrix, and in the case of a quasi-Newton algorithm
such as DFPMIN, may be obtained approximately from the algorithm itself. (The algorithm does
not attempt to evaluate H directly, but maintains an approximation of it which is improved as
the algorithm continues.) In this case, in accordance with standard maximum likelihood theory,
the inverse matrix H −1 is an approximation to the sampling covariance matrix of the parameter
estimates. In particular, the square roots of the diagonal entries of H −1 are approximate standard
errors of the parameter estimates. Finally, the minimized value of g may be used for likelihood ratio
tests in comparing one model with another - we shall see numerous examples of this subsequently.
Remarks.
• Effective operation of the algorithm requires reasonable starting values. One solution is
to calculate the approximate WLS estimators first, using these as starting values for the MLE
procedure. In the author’s experience, that level of care is not usually required, but it is important
to use starting values that at least represent reasonable guesses of the MLEs. One general piece
of advice is to build up gradually from simpler models towards more complicated ones, using the
estimates from simpler models to help gauge starting values for the more complicated models.
• It is also advisable to remember that efficient operation of quasi-Newton algorithms such as
DFPMIN requires that the parameters be at least reasonably well-scaled, e.g. the algorithm will
not usually work correctly if one parameter is varying on a scale 106 times another. This could
require attention, in particular, to choosing a suitable unit for distance.
• 3. The reader may be wondering why we considered two forms of g instead of just using (8)
which has fewer parameters. The reason is that for some examples, the covariance matrix does not
have the form of Σ = αV (θ), so in this case the simplification afforded by analytic solution for α is
not available.
4. The interpretation of H in step 8 is not precisely in accordance with standard asymptotic
theory of maximum likelihood estimates, because it is calculated from a profile log likelihood rather
than the original log likelihood function. However, it can be shown that the H matrix in this case
has the same interpretation as when it is defined directly from the log likelihood. See Patefield,
1977.
Advantages and disadvantages of maximum likelihood estimation
Although maximum likelihood estimation appears to be computationally feasible, opinion is
still divided concerning its desirability when compared with simpler methods such as the approximate WLS method. Asymptotic properties of maximum likelihood estimators were considered by
12
Mardia and Marshall (1984), who showed that the usual asymptotic properties of consistency and
asymptotic normality are satisfied under a form of increasing domain asymptotics (see section 2.2.3
for a parallel discussion of Cressie’s WLS method under this form of asymptotics). However, the
conditions given by Mardia and Marshall are not particularly easy to check, and more seriously,
there is no indication of how large the samples need to be for asymptotic results to be reliable
indicators of sampling properties. Another issue concerns possible multi-modality of the likelihood
surface. An example given by Warnes and Ripley (1987), and repeated by Ripley (1988), suggests
that this can be a problem even with the simplest spatial models. In fact it would appear that the
original example given by Warnes and Ripley was in error - Mardia and Watkins (1989) presented
an alternative analysis of the same data set. Nevertheless, the possibility of multi-modality is real,
arising from discontinuities in the first derivative of the log likelihood, as shown theoretically by
Mardia and Watkins in the case of the spherical variogram model and a variant of the exponential
model. They advocated plotting the profile likelihood surface, as well as or instead of finding the
MLE by optimization. In the present authorś experience, multi-modality is not usually a difficulty
in low-dimensional estimation problems, but even with parameter dimensions of the order of 4 or
5, it can happen that parallel runs of the maximum likelihood estimation routine, starting from
different initial values, lead to different parameter estimates. It can also happen that with poor
initial values, the algorithm will not converge at all. Given the various difficulties that can arise,
it is advisable to be check the results of the algorithm by rerunning from different starting values,
and to be cautious about the results if these difficulties arise.
The theoretical benefit of maximum likelihood is that we can expect the estimates to be more
efficient than the alternative methods in large samples. However, it is not clear how big a benefit this
is. A simulation study by Zimmerman and Zimmerman (1991) compared the MLE, approximate
WLS and a number of alternative estimators, concluding that the MLE is only slightly superior
to the approximate WLS procedure from this point of view. It has also been pointed out that
the MLE procedure depends on the assumption of a Gaussian process and therefore may perform
poorly when the true distribution is non-Gaussian, but of course this does not mean that the WLS
procedures would necessarily perform better in this case. The simulations of Zimmerman and
Zimmerman (1991) do not address this issue since they are restricted to Gaussian processes.
Finally, the computational complexity of maximum likelihood (if there are n sampling points,
then Σ is an n × n matrix, and the process can be slow if n is large) may be outweighed by
its convenience as a very widely applicable method of estimation, by which a variety of models
can be estimated, and compared using either likelihood ratio tests or automatic model selection
criteria such as the Akaike Information Criterion (AIC). Maximum likelihood methods also link up
naturally with Bayesian procedures. The various disadvantages that have been pointed out, such as
non-robustness when there are outliers in the data, or the possible multi-modality of the likelihood
surface, are caveats to keep in mind when using the method, but they are not reasons to abandon
maximum likelihood estimation.
Restricted maximum likelihood. The idea of restricted maximum likelihood or REML estimation was originally proposed by Patterson and Thompson (1971) in connection with variance
components in linear models. However, a number of authors have pointed out that the situation
considered by Patterson and Thompson is essentially the same as arises with Gaussian models for
spatial data: in both cases there is a linear model with correlated errors, whose covariance matrix
depends on some additional parameters. Thus it is natural to try to separate the two parts of the
estimation problem, the “linear model” part and the “covariance structure” part. Cressie (1993) is
one author who has advocated this approach to spatial analysis.
The motivation behind REML estimation is perhaps best expressed in a very simple case.
13
Suppose Y1 , . . . , Yn are independent univariate random variables, each N (µ, σ 2 ) distributed. The
MLE of σ is a biased estimator. It appears that the mle is not the best one to use in this case.
Suppose, however, instead of basing the maximum likelihood estimator on the full joint density
of Y1 , . . . , Yn , we base it on the joint density of the vector of contrasts, Y1 − Ȳ , . . . , Yn − Ȳ whose
distribution does not depend on µ. The maximum likelihood estimator of σ 2 , under this formulation,
turns out to be an unbiased estimator. Thus, by constructing an estimate of σ 2 based on an (n−1)dimensional vector of contrasts, we appear to have done better than the usual maximum likelihood
estimator based on the full n-dimensional data vector.
This idea can be extended to the general model defined by
Y ∼ N (Xβ, Σ), with Σ = αV (θ).
If we let W = AT Y be a vector of n − q linearly independent contrasts, i.e. the n − q columns of
A are linearly independent and AT X = 0 , then we find that
W ∼ N (0, AT ΣA)
and the joint negative log likelihood function based on W is of the form
lW (α, θ) =
n−q
n−q
1
1
log(2π) +
log(α) + log |AT V (θ)A| +
W T (AT V (θ)A)−1 W
2
2
2
2α
(9)
As pointed out by Patterson and Thompson (1971), it is possible to choose A to satisfy AAT =
I − X(X T X)−1 X T , AT A = I. In this case a further calculation, first given by Harville (1974),
shows that (9) may be simplified to
n−q
1
1
1
1
n−q
log(2π)+
log(α)− log |X T X|+ log |X T V (θ)−1 X|+ log |V (θ)|+ G2 (θ).
2
2
2
2
2
2α
(10)
2
This is minimized with respect to α by setting α̂ = G (θ)/(n − q) in which case the likelihood in
(11) becomes
µ 2 ¶
n−q
G (θ)
1
1
1
n−q
n−q
log(2π)+
log
− log |X T X|+ log |X T V (θ)−1 X|+ log |V (θ)|+
.
l∗W (θ) =
2
2
n−q
2
2
2
2
(11)
Comparing (11) with (8) , we can see that there are two substantive changes: the coefficient of
log G2 (θ) has been changed from n/2 to (n−q)/2, and there is an additional term of 12 log |X T V (θ)−1 X|.
lW (α, θ) =
Which one to choose? Cressie (1993, sec. 2.6) speaks strongly in favor of the more robust
weighted least squares approach. Diggle et al. (2003) prefer likelihood based methods, which are
optimal under the model assumptions. Referring to the instability due to correlation in sample
variogram ordinates, they question, whether matching theoretical and empirical variograms is optimal in any sense. For small or moderate samples and/or when the trend surface model has many
parameters, REML leads to less biased estimators of variogram parameters than ML (see, again,
Cressie, 1993, sec. 2.6). On the other hand, Diggle et al. (2003) say that, in their experience,
REML is more sensitive than ML to mis-specification of the trend model. Generally, it is suggested
to try all three estimation approaches and compare.
References
[1] Bailey, T. C. & Gatrell, A. C. (1995). Interactive spatial data analysis. Longman Scientific &
Technical, Harlow.
14
[2] Cressie, N. (1985). “Fitting variogram models by weighted least squares”. J. Internat. Assoc.
Math. Geol. 17: 563586.
[3] Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley, New York, revised edn.
[4] Cressie and Hawkins (1980). “Robust Estimation of the variogram”, Journal of the International Association for Mathematical Geology, 12,115-125.
[5] Diggle, P. J., Ribeiro, Jr., P. J. & Christensen, O. F. (2003). “An introduction to modelbased geostatistics”. In J. Mller (ed.), Spatial Statistics and Computational Methods, no. 173
in Lecture Notes in Statistics, pp. 4486. Springer-Verlag, New York.
[6] Healy, M. J. R. (1968), “The Disciplining Of Medical Data”, British Medical bulletin, 210-214.
[7] Journel, A.G., Huijbregts, C.J. (1978), Mining Geostatistics, Academic Press, London.
[8] Kitanidis, P. K. (1983), “Statistical estimation of polynomial generalized covariance functions
and hydrologic applications”, Water Resources Research, 19, 909-921.
[9] Mardia, K.V., Marshall, R.J. (1984), “Maximum likelihood estimation of models for residual
covariance in spatial regression”, Biometrika, 71, 135-146.
[10] Mardia, K.V., Watkins, a.J. (1989), “On multimodality of the likelihood estimation of models
for residual covariance in spatial regression”, Biometrika, 71,135-146.
[11] Matérn, B. (1960), “Spatial Variation”, Meddelanden fran Statens Skogsforskningsinstitut, 49,
No. 5.
[12] Patterson, H.D., Thompson, R. (1974), “Maximum likelihood estimation of components of
variance”, Proceedings of the 8th International Biometric Conference, biometric Society, 197207.
[13] Press, W. H. , Teukolsky, S. A. , Vetterling, W. T., Flannery, B.F. (1992), “Numerical Recipes
in C: The Art of Scientific Computing”.
[14] Ripley, B.D. (1988), Statistical Inference for Spatial Processes, Cambridge University Press,
UK.
[15] Sacks, J., Welch, W.J., Mitchell, T.J., Wynn, H.P. (1989), “Design and analysis of computer
experiments”, Statistical Science, 409-423.
[16] Warnes, J.J., Ripley, B.D. (1987), “Problems with likelihood estimation of covariance functions
of spatial Gausssian Processes”, Biometrika, 74, 640-642.
[17] Zimmerman, D.L., Zimmerman, M.B. (1991), “A comparison of spatial semivariogram estimators and corresponding ordinary kriging predictors”, Technometrics, 33, 77-91.
15
© Copyright 2026 Paperzz