Nonparametric estimation of spatial segregation in a multivariate

Appl. Statist. (2005)
54, Part 3, pp. 645–658
Nonparametric estimation of spatial segregation in
a multivariate point process: bovine tuberculosis in
Cornwall, UK
Peter Diggle and Pingping Zheng
Lancaster University, UK
and Peter Durr
Veterinary Laboratories Agency, Weybridge, UK
[Received July 2003. Final revision August 2004]
Summary. The paper is motivated by a problem in veterinary epidemiology, in which spatially
referenced breakdowns of bovine tuberculosis are classified according to their genotype and
year of occurrence. We develop a nonparametric method for addressing spatial segregation
in the resulting multivariate spatial point process, with associated Monte Carlo tests for the
null hypothesis that different genotypes are randomly intermingled and no temporal changes
in spatial segregation. Our spatial segregation estimates use a kernel regression method with
bandwidth selected by a multivariate cross-validated likelihood criterion.
Keywords: Bovine tuberculosis; Monte Carlo test; Multivariate point process; Spatial
segregation
1.
Introduction
A multivariate spatial point process is a stochastic process that generates points in two-dimensional space, each point being one of two or more qualitatively distinguishable types. Spatial
segregation occurs if, within some planar region of interest, particular types of point predominate in particular subregions, rather than being randomly intermingled. In this paper, we
assume an underlying multivariate inhomogeneous Poisson point process and investigate spatial segregation via the nonparametric estimation of ratios of componentwise intensities of the
process.
The work was motivated by the following problem in veterinary epidemiology, in which the
points identify farms which experience one or more cases of bovine tuberculosis (BTB) whereas
the types refer to different strains of the disease. However, the methodology could also be useful
in other disciplines where data of this kind arise, e.g. in human epidemiology (Richardson et al.,
2002) or in ecology (Pielou 1961, 1977).
BTB is a serious disease of cattle that is caused by the bacterium Mycobacterium bovis
(M. bovis) and which is endemic in parts of the UK. As part of control measures, herds
are regularly inspected for BTB by using a comparative tuberculin skin test. When disease
is detected in a cattle herd and M. bovis is successfully cultured from at least one test positive
animal, a deoxyribonucleic acid typing technique known as spoligotyping can then be used to
Address for correspondence: Pingping Zheng, Department of Mathematics and Statistics, Fylde College,
Lancaster University, Lancaster, LA1 4YF, UK.
E-mail: [email protected]
 2005 Royal Statistical Society
0035–9254/05/54645
646
P. Diggle, P. Zheng and P. Durr
determine the genotype of M. bovis that is responsible for the BTB breakdown (Durr, Hewinson
and Clifton-Hadley, 2000). If a particular genotype predominates in a given locality, a working
hypothesis is that any new breakdown within that locality which is of the locally predominant
type is likely to be a consequence of internal cross-infection, whereas a new breakdown of a
different type is more likely to be the result of importation of infected animals from a remote
location. A second possibility to explain an unexpected type in a subregion is from a mutation
event. However, the evidence to date is that spoligotypes are genetically stable, so this is unlikely
over a timespan of a few years. Characterizing the locally predominant genotypes, and the
extent to which different genotypes are spatially segregated, is therefore a potentially useful tool
in monitoring the progress of the disease within an administrative region (Durr, Clifton-Hadley
and Hewinson, 2000).
The data that were available to us consist of the spatial locations of a subset of BTB breakdowns within cattle herds in the county of Cornwall, UK, over the years 1989–2002 inclusively.
Because it is not always possible to isolate M. bovis from test positive cattle, our analyses had to
be necessarily restricted to those BTB breakdowns where M. bovis was successfully cultured and
spoligotyped. We shall use these data to estimate the extent to which different spoligotypes are
spatially segregated, to test the hypothesis of spatial segregation of the different spoligotypes
and to assess possible changes in the spatial segregation over time.
Section 2 describes the Cornwall BTB data in more detail. In Section 3 of the paper, we formally define the estimation problem and clarify the distinction between spatial segregation and
the related concepts of spatial clustering or spatial aggregation. We also propose a nonparametric method of estimation based on kernel regression within a generalized additive modelling
framework, together with associated Monte Carlo significance tests. The method is a multivariate generalization of a method proposed by Kelsall and Diggle (1998) for the estimation of
relative risk from case–control data in spatial epidemiology. Section 4 discusses the application
of the method to the Cornwall BTB data, including descriptive analyses of spatial segregation among different genotypes, and of temporal changes in the spatial segregation. Section 5
discusses possible extensions and alternative methodological approaches.
The methods are implemented in a suite of R functions which are available from the second
author’s Web page: www.maths.lancs.ac.uk/∼ zhengp1/tb.
2.
The Cornwall bovine tuberculosis data
2.1. Description of the data
The data cover annual or biennial inspections of beef, dairy and mixed cattle farms throughout
the county of Cornwall, UK, over the years 1989–2002. Each herd that was tested has its spatial location recorded as the single-point Ordnance Survey map reference of the corresponding
farm, to a spatial resolution of 100 m. Note, however, that georeferencing of individual farms
can be problematic (Durr and Froggatt, 2002).
Within the 14-year period that is covered by the data, 2404 skin test positive animals were
slaughtered and assigned a spoligotype. When disease is confirmed in an animal, it is not possible
to determine its date of onset; hence times of onset are censored to the left and the temporal
resolution of the data is generally between 1 and 2 years, corresponding to the cycle of farm
inspections. We call a skin test positive animal a ‘reactor’. A detectably infected animal is a ‘confirmed reactor’ and a herd with one or more confirmed reactors is a ‘confirmed breakdown’.
Reactors are slaughtered and subjected to post mortem examination. Tissue samples are taken
from at least one reactor per breakdown, to attempt the isolation of M. bovis. Since 1997, following the introduction of the spoligotyping technique, standard practice has been to perform
Nonparametric Estimation of Spatial Segregation
647
Table 1. Frequency distribution of confirmed cases of BTB in
Cornwall, classified by spoligotype and year of occurrence
Year
Cases with the following
spoligotypes:
Others
Unresolved
Total
9
12
15
20
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
4
6
23
19
19
12
9
33
42
81
79
74
37
56
10
7
7
13
1
0
3
5
5
17
14
9
5
13
11
11
7
7
7
2
5
5
8
8
17
35
15
28
2
2
4
2
2
0
0
15
6
19
6
18
9
19
3
1
0
1
0
1
0
3
3
7
14
11
1
1
2
0
0
0
0
1
0
0
1
5
5
4
1
1
30
27
41
42
29
15
17
61
64
132
130
147
67
117
Total
494
109
166
104
26
20
919
genotyping on at least one positive culture from the majority of confirmed breakdowns. Isolates
before 1997 which had been successfully freeze dried and stored were recultured and had their
spoligotype retrospectively determined.
In most breakdowns only one spoligotype was assigned, either because only one reactor animal was identified or else, in multireactor breakdowns, only one was cultured or if more than
one were cultured they all gave identical results. In a small number of multireactor breakdowns,
more than one spoligotype was identified, and the herd was classified according to the predominant spoligotype. This process resulted in 919 confirmed breakdowns, in 20 of which the
classification by spoligotype was unresolved. Henceforth, we refer to each confirmed breakdown
as a case and abbreviate spoligotype to type. Table 1 shows the frequencies of cases classified
by type and year of occurrence. The four most common types are, in order of frequency, 9, 15,
12 and 20. Together, these account for 873 out of the 919 cases. The total numbers of cases per
year in fact reflect the intensity of the testing programme and have not directly measured the
severity of the epidemic. We shall not attempt to incorporate the rarer types explicitly into our
analysis of spatial segregation.
Fig. 1 gives the overall spatial distribution of the 919 cases. The unit of length in Fig. 1, and
thereafter in this paper, is 1 m. The map gives a strong visual impression of spatial aggregation
of cases, with areas of high incidence in the north-east and south-west of Cornwall. Spatial
aggregation could reflect either the infectious aetiology of the disease or an inhomogeneous
spatial distribution of herds at risk.
For comparison, Fig. 2 shows the locations of all 4353 currently known cattle farms in Cornwall. Their spatial distribution shows less variation in intensity than does the case distribution,
but it is nevertheless clearly non-uniform, reflecting a lack of cattle farms in some parts of Cornwall. This underlines the importance of assessing spatial segregation in a way which allows for
spatial heterogeneity in the underlying population at risk.
P. Diggle, P. Zheng and P. Durr
80000
20000
40000
60000
Northings
100000
120000
648
140000
160000
180000
200000
220000
240000
Eastings
Fig. 1. Spatial distribution of all cases over the 14 years
3.
Estimation of spatial segregation
3.1. Descriptors of spatial structure
A common approach in the exploratory analysis of spatial point pattern data is to identify spatial structure through rejection of a null hypothesis which formalizes the notion of an absence
of structure. Within this paradigm, a test statistic is chosen to be sensitive to the alternative
hypothesis of interest. Often, two or more tests are applied to the same data, but using statistics
that are intended to be sensitive to different kinds of departure from the null hypothesis.
The simplest example of this approach is a test of complete spatial randomness, by which we
mean that the data form a partial realization of a homogeneous spatial Poisson process. We
then define an aggregated pattern as one which deviates significantly from this null hypothesis
in such a way that the points of the pattern tend to form local concentrations. No particular
mechanism is implied. For this reason, we prefer to reserve the term clustering to describe a point
process in which points form functional groups, e.g. a process in which parent points give rise to
collections of offspring in their vicinity. By contrast, we use the word heterogeneous to describe
a process in which the intensity, or mean number of points per unit area, varies spatially. In
general, clustering and heterogeneity are not empirically distinguishable. It is possible to formulate a point process which can equally well be interpreted as a process of independent points
in a heterogeneous environment, or as a process of clusters of related events in a homogeneous
environment (Bartlett, 1964).
Spatial segregation is a descriptor of structure in a multivariate pattern. We say that a multivariate pattern exhibits spatial segregation if for at least some j "= i the conditional intensity
649
80000
60000
20000
40000
Northings
100000
120000
Nonparametric Estimation of Spatial Segregation
140000
160000
180000
200000
Eastings
220000
240000
Fig. 2. Spatial distribution of the known and spatially discrete herds in Cornwall
of type j points at x given a point of type i at x is less than the marginal intensity of type
j points at x. A process which generates patterns of this kind is virtually bound also to generate patterns which are marginally aggregated. However, qualitatively similar patterns could
also be generated by a process with independent, but strongly clustered, type-specific components.
We conclude firstly that tests for spatial structure are only useful if they are designed to detect
particular kinds of structure and are not based on assumptions which are so restrictive as to
render the hypothesis under test self-evidently false a priori. For example, in testing for spatial
segregation we would not want to assume that, under the null hypothesis of no segregation,
the component patterns were completely random. Secondly, the detailed scientific interpretation of spatial structure almost invariably requires subject-matter knowledge or information in
addition to the data themselves. Thirdly, tests are best constructed within an explicitly declared
modelling framework, so that underlying assumptions are transparent, and in a way which leads
naturally to an associated method for estimating quantities of interest in the event that the null
hypothesis is rejected.
3.2. The Poisson process model
To develop a method for detecting and estimating spatial segregation, we shall assume that the
data are a partial realization of a multivariate, spatially inhomogeneous Poisson point process.
The key assumption in the Poisson process model is that different cases are stochastically independent. In our motivating application, this would clearly be violated if cases were defined at
650
P. Diggle, P. Zheng and P. Durr
the individual animal level. Strictly, the infectious nature of the disease also renders it incorrect
at the herd level. However, in the absence of detailed data on the spatiotemporal spread of the
disease, it is reasonable to use the Poisson process model to describe the resulting spatial distribution of the disease, essentially as a consequence of the duality between spatial heterogeneity
and spatial clustering that was originally established by Bartlett (1964).
The model assumes that the component processes, each corresponding to cases of a particular
type, are independent Poisson processes with respective intensity functions λk .x/ : k = 1, . . . , m,
where k denotes type. The λk .·/ are in turn derived as the product of two functions: λ0 .x/, the
intensity function for the univariate Poisson process of herds at risk, and ρk .x/, the probability
that a herd at location x will generate a case of type k within the period of time that is under
consideration. Note that the data cannot identify the occurrence of multiple breakdowns within
the same herd and year.
We now define relative risk surfaces ρjk .x/ = ρj .x/=ρk .x/, for all j "= k. Note that ρjk .x/ =
λj .x/=λk .x/ because λj .x/ = λ0 .x/ρj .x/. Similarly, if pk .x/ denotes the conditional probability
that a case known to occur at location x is of type k, then
!m
!m
"
"
pk .x/ = λk .x/
λj .x/ = ρk .x/
ρj .x/,
j=1
j=1
and pj .x/=pk .x/ = ρjk .x/. These expressions show that relative risks can be estimated without
assuming any particular form for the spatial distribution of herds at risk.
We say that the underlying Poisson process is completely unsegregated if ρk .x/ = αk ρ.x/ for
some spatial function ρ.·/ or, equivalently, pk .x/ = pk . In other words, different types may be
more or less common but show no propensity to occur in relatively greater numbers in particular subregions. It follows that, under no spatial segregation, all the relative risk surfaces
ρjk .x/ are spatially constant. At the opposite extreme, complete spatial segregation occurs if
only one type of event can occur at any particular location. In this case, for each x, pk .x/ = 1
for some particular k = k.x/ and pk .x/ = 0 for all other k. In practice, less extreme forms of
partial segregation would be expected to occur and would be expressed quantitatively through
the spatial behaviour of the set of estimated functions p̂k .x/, k = 1, . . . , m. We call the functions
pk .x/ the type-specific probabilities.
3.3. Kernel estimation of type-specific probabilities
We propose to estimate the pk .x/ through a multivariate adaptation of the kernel smoothing
methodology that was proposed by Kelsall and Diggle (1995, 1998) for case–control data in
human epidemiology. The adaptation proceeds as follows.
The data are represented as a set of multinomial outcomes Yi , i = 1, . . . , n, where, for each of
k = 1, . . . , m, the outcome Yi = k denotes a breakdown of type k at the location xi , and the corresponding multinomial cell probabilities are the type-specific probabilities pk .xi /. We propose
a kernel regression estimator for the probability surfaces pk .x/. This takes the form
p̂k .x/ =
where, for each of k = 1, 2, . . . , m,
n
"
i=1
wik .x/ I.Yi = k/,
! n
"
wik .x/ = wk .x − xi /
wk .x − xj /,
j=1
.1/
Nonparametric Estimation of Spatial Segregation
651
and wk .·/ is the kernel function with bandwidth hk > 0; hence
wk .x/ =
w0 .x=hk /
,
h2k
where w0 .·/ is the standardized form of the kernel function. In the results reported below, we
use the Gaussian kernel
w0 .x/ = exp.−$x$2 =2/,
where $ · $ denotes the Euclidean distance of the point x from the origin.
The log-likelihood function is
n "
m
"
I.Yi = k/ log{pk .xi /},
L.p1 , . . . , pm / =
.2/
.3/
i=1 k=1
where I.·/ is the indicator function. In a parametric model for the pk .x/, a widely accepted
method of parameter estimation is to choose parameter values to maximize the right-hand side
of equation (3). In the kernel setting, to do so would lead to the unhelpful bandwidth choices
hk = 0, giving p̂k .xi / = 1 or p̂k .xi / = 0 according to whether the corresponding Yi does or does
not equal k. To circumvent this, we use a cross-validated log-likelihood function. Two variants
of the cross-validated form of function (3) could be defined, according to whether we do or do
not choose the same bandwidth for all m components of the p-surface. Using a common bandwidth h gives the desirable property that Σm
k=1 p̂k .x/ = 1, for every location x. The cross-validated
log-likelihood function for h is then defined as
m
n "
"
.i/
I.Yi = k/ log{p̂k .xi /},
.4/
Lc .h/ =
i=1 k=1
.i/
where p̂k .xi / denotes the kernel estimator (1), based on all of the data except .xi , Yi /.
3.4. Monte Carlo inferences
Kelsall and Diggle (1998) used Monte Carlo sampling to assess whether their estimated risk
surface showed significant departure from spatially constant risk. In a similar fashion, we here
use Monte Carlo methods to test the null hypothesis of no spatial variation in the relative risk
surfaces between pairs of different spoligotypes.
Recall that λk .x/, k = 1, 2, . . . , m, denote the type-specific intensity functions and that
!m
"
λj .x/:
pk .x/ = λk .x/
j=1
The null hypothesis is H0 : λk .x/ = αk λ0 .x/, k = 1, 2, . . . , m: hence, under hypothesis H0 , pk .x/ =
αk , for all x. The αk can be estimated by α̂k = nk =n, where nk is the number of cases of type k
and n is the total number of cases. Hence, our suggested statistic to test H0 is
T=
n "
m
"
i=1 k=1
{p̂k .xi / − α̂k }2 :
.5/
For each Monte Carlo simulation under H0 , we relabel the data at random while preserving the
observed number of cases of each type. We denote the value of T for the original data as t1 , and
values after simulated random relabelling as t2 , t3 , . . . , ts . The p-value for a Monte Carlo test of
significance is p = .k + 1/=s, where k is the number of tj > t1 .
652
P. Diggle, P. Zheng and P. Durr
3.5. Case–control data
When the absolute risk of disease is of interest, rather than relative risk between two different
types, the analysis that was described above can still be used, by identifying non-cases among
the population at risk as an additional type, say type 0. The interpretation of the relative risk
surfaces ρjk .x/ when neither j nor k is 0 remains the same. The interpretation of the type-specific probabilities is slightly different, since at each location we now have Σm
k=1 pk .x/ = 1 − p0 .x/,
which could be spatially varying even in the absence of type-specific spatial segregation.
3.6. Investigating temporal changes in spatial segregation
When each case is allocated to one of a discrete set of time periods, t = 1, 2, . . . , r say, we shall use
the kernel method to estimate type-specific probability surfaces p̂k .x, t/ for each t. For comparability between time periods, we use a common bandwidth h for all types and all time periods,
which we chose by maximizing the sum over time periods of the cross-validated log-likelihood
criterion (4) for cases within each time period.
In this context, the null hypothesis of interest is that the type-specific probability surfaces do
not change over time: hence, H0 : pk .x, t/ = pk .x/, where k = 1, 2, . . . , m and t = 1, 2, . . . , r and a
suggested test statistic for a Monte Carlo test of H0 is
T=
r "
m "
"
k=1 t=1 x∈X
{p̂k .x, t/ − p̄k .x/}2 ,
.6/
where X denotes the set of all case locations irrespective of type or time period, p̂k .x, t/ is the
estimated type-specific probability surface for type k in time period t and
p̄k .x/ = r −1
r
"
t=1
p̂k .x, t/:
Because the true type-specific probability surfaces under hypothesis H0 are unknown, we propose an approximate Monte Carlo test in which we sample case labels from the estimated time
constant type-specific probability surfaces p̄k .x/, holding the number of cases of each type in
each time period fixed at their observed values. Simulation results suggest that the use of estimated p̄k .x/ renders the test slightly anticonservative, but that the effect of this is too small to
affect the results that are reported for the Cornwall BTB data in Section 4.
4.
Application to the Cornwall bovine tuberculosis data
4.1. Spatial segregation over the 14-year period
Fig. 3 shows the spatial distributions of cases corresponding to each of the four most common
spoligotypes. The visual impression is of strong spatial segregation, with each of the four types
predominating in particular subregions.
Fig. 4 shows the cross-validated log-likelihood Lc .h/ for the system consisting of the four
most common types of case, covering the 14-year period and using a Gaussian kernel. The
optimal choice of bandwidth is hopt = 5015 m. Fig. 5 shows the resulting estimated type-specific
probabilities p̂k .x/, which confirm that there is strong spatial segregation among the different
spoligotypes.
The maximum type-specific probabilities are 0:999, 0:914, 0:934 and 0:927 for spoligotypes 9,
12, 15 and 20 respectively whereas the minimum type-specific probabilities are 0.015 for type 9,
and zero (to three decimal places) for types 12, 15 and 20. All four types therefore show wide
spatial variation in the estimates p̂k .x/. Note that the corresponding marginal proportions of
Northings
40000
60000
80000
100000
120000
100000
80000
Northings
60000
20000
40000
20000
140000
160000
180000
200000
220000
140000
240000
160000
Eastings
180000
200000
220000
240000
220000
240000
Eastings
(b)
80000
Northings
60000
20000
20000
40000
40000
60000
80000
100000
100000
120000
120000
(a)
Northings
653
120000
Nonparametric Estimation of Spatial Segregation
140000
160000
180000
200000
Eastings
(c)
220000
240000
140000
160000
180000
200000
Eastings
(d)
Fig. 3. Spatial distributions of the four most common spoligotype data over the 14 years: (a) spoligotype 9;
(b) spoligotype 12; (c) spoligotype 15; (d) spoligotype 20
the four types are 0:566, 0:125, 0:190 and 0:119. The most common type 9 has two separate foci,
a major one in the east which extends over a relatively large area, and a smaller one in the west of
Cornwall. Type 12 has a single focus towards the west. Type 15 has a single focus in the central
part of Cornwall. Type 20 occurs predominantly in the extreme west. The local maximum to the
east of the main concentration of type 20 cases arises from two near-coincident but otherwise
isolated cases.
The Monte Carlo test for spatial segregation among different spoligotypes is perhaps redundant in view of the very strong segregation that is observed in the data and the smoothed typespecific probability maps, but it is reported for completeness. Using s = 1000, i.e. 999 simulated
random relabellings of the spoligotypes among all cases, the test rejected the null hypothesis
P. Diggle, P. Zheng and P. Durr
−1200
−1000
Lc
−800
−600
−400
654
0
20000
40000
h
60000
Fig. 4. Cross-validated log-likelihood for the four most common types over the 14 years
1.0
120000
80000
0.6
60000
0.8
100000
Northings
0.8
100000
Northings
1.0
120000
80000
0.6
60000
0.4
0.4
40000
40000
0.2
0.2
20000
20000
0.0
140000
160000
180000 200000
Eastings
220000
0.0
240000
140000
160000
(a)
180000 200000
Eastings
220000
240000
(b)
1.0
120000
80000
0.6
60000
0.8
100000
Northings
0.8
100000
Northings
1.0
120000
80000
0.6
60000
0.4
40000
0.4
40000
0.2
20000
0.2
20000
0.0
140000
160000
180000 200000
Eastings
(c)
220000
240000
0.0
140000
160000
180000 200000
Eastings
220000
240000
(d)
Fig. 5. Estimated type-specific probabilities for the four most common types over the 14 years: (a) spoligotype 9; (b) spoligotype 12; (c) spoligotype 15; (d) spoligotype 20
Nonparametric Estimation of Spatial Segregation
655
with a p-value of 0.001, i.e. the observed value of the test statistic (5) was greater than all 999
simulated values.
4.2. Changes in the spatial segregation over time
Before 1997 the annual number of cases is too small for the application of nonparametric
smoothing methods. To investigate temporal changes in the spatial segregation of M. bovis we
therefore consider only the years 1997–2002 and define three time periods t corresponding to
the years 1997–1998, 1999–2000 and 2001–2002.
Figs 6 and 7 show the estimated type-specific probability surfaces for the four most common spoligotypes in each time period, using a common bandwidth of 9647 (m) estimated by our
cross-validation criterion. The increase in h by comparison with the analysis repeated in Section
4.1 is to be expected because of the smaller number of cases within individual time periods.
The Monte Carlo test for changes in the type-specific probability surfaces over time gives
a p-value of 0.015 with s = 1000. This result suggests that the relatively subtle effects which
appeared from a visual inspection of Figs 6 and 7 nevertheless represent genuine changes over
time. In general terms, the predominant effect is of a progressive increase in the degree of segregation over time. Thus spoligotype 15 becomes progressively more dominant in north central
Cornwall, whereas spoligotype 20 shows near dominance in the far west in 2001–2002. Spoligotype 9 remains dominant in the east of Cornwall in all three time periods, but its territory
is confined to an area that is closer to the eastern boundary in 2001–2002 than in 1997–1998.
Finally the distribution of spoligotype 15 appears to be relatively stable over the three time
periods.
5.
Discussion
We have demonstrated a nonparametric method for the estimation of spatial segregation
between different types of event in a multivariate point process. Application to the Cornwall
BTB data confirms the existence of strong spatial segregation between different spoligotypes
and identifies significant temporal changes in the spatial segregation between 1997 and 2002.
We have chosen to classify each breakdown according to its predominant spoligotype. When
more than one spoligotype is identified within a single breakdown, the likely explanation is that
the herd in question has experienced two or more infection events within the annual or biennial
interval between successive tests. If we could distinguish such multiple events, we would still be
able to identify type-specific probabilities by using control data on the locations of all herds at
risk, but there would no longer be a constraint that these probabilities should sum to 1 at each
location. However, the current data do not support an analysis of this kind; specifically, when
a breakdown involves more than one animal with the same associated spoligotype, the number
of independent infection events cannot be determined.
The immediate use of our methodology is to confirm (or not) visual impressions of spatial
segregation that are obtained by smoothly mapping the data. A potentially more important use
of estimated type-specific probability surfaces is that they can assist in the management of new
cases investigating bought-in animals. If, for example, an animal came from a farm in north
central Cornwall but was subsequently identified in a herd breakdown on a farm near the eastern
boundary, a spoligotype 9 would suggest post-arrival infection whereas a spoligotype 15 would
suggest an incipient breakdown on the source farms. A second potential application would be
to enable a comparison between the spatial variation in spoligotype distributions among farm
animals and in wildlife reservoirs.
656
P. Diggle, P. Zheng and P. Durr
1.0
120000
1.0
120000
0.8
100000
80000
0.6
60000
0.8
100000
80000
0.6
60000
0.4
40000
0.4
40000
0.2
20000
0.2
20000
0.0
140000
160000
180000
200000
220000
0.0
240000
140000
160000
(a)
180000
200000
220000
240000
(b)
1.0
1.0
120000
120000
0.8
100000
80000
0.6
0.8
100000
80000
0.6
60000
60000
0.4
0.4
40000
40000
0.2
0.2
20000
20000
0.0
0.0
140000
160000
180000
200000
(c)
220000
140000
240000
160000
180000
200000
220000
240000
(d)
1.0
120000
1.0
120000
0.8
100000
80000
0.6
60000
0.8
100000
80000
0.6
60000
0.4
40000
0.4
40000
0.2
20000
0.2
20000
0.0
140000
160000
180000
200000
(e)
220000
240000
0.0
140000
160000
180000
200000
220000
240000
(f)
Fig. 6. Estimated type-specific probabilities for (a) spoligotype 9, 1997–1998, (b) spoligotype 12,
1997–1998, (c) spoligotype 9, 1999–2000, (d) spoligotype 12, 1999–2000, (e) spoligotype 9, 2001–2002,
and (f) spoligotype 12, 2001–2002
Nonparametric Estimation of Spatial Segregation
657
1.0
1.0
120000
120000
0.8
100000
80000
0.6
0.8
100000
80000
0.6
60000
60000
0.4
0.4
40000
40000
0.2
0.2
20000
20000
0.0
0.0
140000
160000
180000
200000
220000
140000
240000
160000
180000
200000
220000
240000
(b)
(a)
1.0
1.0
120000
120000
0.8
100000
80000
0.6
0.8
100000
80000
0.6
60000
60000
0.4
0.4
40000
40000
0.2
0.2
20000
20000
0.0
0.0
140000
160000
180000
200000
220000
140000
240000
160000
180000
200000
220000
240000
(d)
(c)
1.0
1.0
120000
120000
0.8
100000
80000
0.6
0.8
100000
80000
0.6
60000
60000
0.4
0.4
40000
40000
0.2
0.2
20000
20000
0.0
0.0
140000
160000
180000
200000
(e)
220000
240000
140000
160000
180000
200000
220000
240000
(f)
Fig. 7. Estimated type-specific probabilities for (a) spoligotype 15, 1997–1998, (b) spoligotype 20,
1997–1998, (c) spoligotype 15, 1999–2000, (d) spoligotype 20, 1999–2000, (e) spoligotype 15, 2001–2002,
and (f) spoligotype 20, 2001–2002
658
P. Diggle, P. Zheng and P. Durr
A useful extension to the methodology that is described in the current paper would be to
allow adjustment for the effects of known risk factors at the herd level, which might themselves
be spatially structured. A generalized additive model (Hastie and Tibshirani, 1990) with a logit
link function can be applied in the form
logit{pj .x, u/} = u& βj + gj .x/,
.7/
where u is the vector of herd covariates and gj .x/ is a smooth function of x.
Our kernel smoothing method has the advantage of transparency, but it represents only one
of several different approaches which could have been used. Obvious competitors include spline
smoothing methods (Wood, 2003) and hierarchical stochastic models in which the set of underlying type-specific probability surfaces are modelled as a realization of a latent multivariate
spatial stochastic process, so extending to the multivariate setting the model-based geostatistics
framework of Diggle et al. (1998). For example, the functions gj .x/ in equation (7) could be
replaced by Sj .x/ where S.x/ = {S1 .x/, . . . , Sm .x/} is a multivariate spatial Gaussian process.
It would be interesting to develop an overtly spatiotemporal model for the evolution of different spoligotypes. However, this would require data on the date of onset of each case, information
which is not obtainable from the current annual or biennial testing protocol except in a heavily
censored form.
Acknowledgements
We thank Roger Sainsbury of the State Veterinary Service, who helped to collect the Cornwall spoligotyping data sets, and Jackie Inwald and Si Palmer of the Department of Bacterial
Diseases, Veterinary Laboratories Agency, Weybridge, who carried out the spoligotyping.
This work was supported by the Department for Environment, Food and Rural Affairs
(‘SE3020’) and by the UK Engineering and Physical Sciences Research Council through the
award of a Senior Fellowship to Peter Diggle (grant GR/S48059/01).
References
Bartlett, M. S. (1964) The spectral analysis of two-dimensional point processes. Biometrika, 51, 299–311.
Diggle, P. J., Tawn, J. A. and Moyeed, R. A. (1998) Model-based geostatistics (with discussion). Appl. Statist.,
47, 299–350.
Durr, P. A., Clifton-Hadley, R. S. and Hewinson, R. G. (2000) Molecular epidemiology of bovine tuberculosis:
II, Applications of genotyping. Rev. Scient. Tech. Off. Int. Epizoo., 19, 689–701.
Durr, P. A. and Froggatt, A. E. A. (2002) How best to geo-reference farms?: a case study from Cornwall, England.
Prev. Veter. Med., 56, 51–62.
Durr, P. A., Hewinson, R. G. and Clifton-Hadley, R. S. (2000) Molecular epidemiology of bovine tuberculosis:
I, Mycobacterium bovis genotyping. Rev. Scient. Tech. Off. Int. Epizoo., 19, 675–688.
Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. London: Chapman and Hall.
Kelsall, J. E. and Diggle, P. J. (1995) Kernel estimation of relative risk. Bernoulli, 1, 3–16.
Kelsall, J. E. and Diggle, P. J. (1998) Spatial variation in risk of disease: a nonparametric binary regression
approach. Appl. Statist., 47, 559–573.
Pielou, E. C. (1961) Segregation and symmetry in two-species populations as studied by nearest-neighbour relationships. J. Ecol., 49, 255–269.
Pielou, E. C. (1977) Mathematical Ecology, 2nd edn. New York: Wiley.
Richardson, M., van Lill, S. W. P., van der Spuy, G. D., Munch, Z., Booysen, C. N., Beyers, N., van Helden,
P. D. and Warren, R. M. (2002) Historic and recent events contribute to the disease dynamics of Beijing-like
Mycobacterium tuberculosis isolates in a high incidence region. Int. J. Tubercul. Lung Dis., 6, 1001–1011.
Wood, S. N. (2003) Thin plate regression splines. J. R. Statist. Soc. B, 65, 95–114.