Spatial modelling of multinomial data with latent

Biostatistics (2000), 1, 2, pp. 177–189
Printed in Great Britain
Spatial modelling of multinomial data with latent
structure: an application to geographical mapping of
human gene and haplotype frequencies
P. VOUNATSOU, T. SMITH
Swiss Tropical Institute, Socinstrasse 57, PO Box CH-4002 Basel, Switzerland
and A. E. GELFAND
Department of Statistics, University of Connecticut, 196 Auditorium Road, U-120, Storrs,
CT 0629-3120, USA
S UMMARY
We develop hierarchical models for spatial multinomial data with missing categories, to analyse a
database of HLA-A and -B gene and haplotype frequencies from Papua New Guinea, with a highly variable number of samples per spatial unit. The spatial structure of the multinomial data is incorporated
by adopting conditional autoregressive (CAR) priors for the random effects, reflecting extra-multinomial
variation. Different spatial structures are investigated, and covariate effects are evaluated using a novel
model selection criterion. Tables and maps reveal strong spatial association and the importance of altitude, a covariate anticipated to be significant in explaining genetic variation. Our approach can be used
in identifying associations with environmental factors, linguistic or epidemiological patterns and hence
potential causes of genetic diversity (population movements, natural selection, stochastic effects).
Keywords: Alleles; CAR; Genetic variation; Haplotypes; Hierarchical model; Linkage disequilibrium; Missing data;
Model choice.
1. I NTRODUCTION
The study of geographical patterns of genetic variation can provide important information about human origins (see, for example, Cavalli-Sforza, 1997). In particular, with regard to the genetic structure
of contemporary populations, such investigation can help to distinguish the contributions of natural selection mediated by environmental covariates from those of genetic drift (stochastic effects affecting gene
frequencies), or population movements and mixing (exogamy). In this paper we address some of the issues involved in explaining variation in such spatially structured genetic data, and provide methods for
modelling sparse data with latent structure and covariates.
Spatial variation in human genetic data is particularly relevant for the understanding of subsistence
societies such as those in Papua New Guinea (PNG). There is a remarkable diversity of topography and
climate in PNG, together with many small, culturally, linguistically, and genetically diverse human populations (Attenborough and Alpers, 1992). Since most of these populations have no documented history,
analysis of their gene frequencies in terms of spatial pattern and covariate dependencies provides a phenomenological strategy for studying their origins and the reasons for their diversity.
Genetic information in humans is encoded linearly in 23 pairs of chromosomes. Genes which define
different characteristics are found at specific locations (loci) on the chromosome, and may exist in different
c Oxford University Press (2000)
178
VOUNATSOU ET AL .
forms (alleles). For any given locus an individual has a pair of alleles (one from each chromosome) that
may be both identical (homozygous) or may differ (heterozygous). Laboratory identification schemes,
known as serological typing, only determine the alleles at a given locus, without assigning them to the
individual chromosomes of the pair. Therefore, for each individual, we observe only the pair of alleles at
each locus; assignment to the chromosome is missing. Populations can be characterized by the proportions
of alleles for each gene (allele frequencies).
To study spatial genetic variation, we use data from the Human Leucocyte Antigen (HLA) system,
which is a set of related genes on the sixth chromosome involved in various immunological processes. Two
genes within the HLA system are analysed here: the HLA-A and HLA-B genes. Variation in these two
genes affects regulation of immune responses to many infectious diseases. HLA-B has been implicated
in resistance to severe malaria (Hill et al., 1991). An altitudinal cline in HLA-A in PNG parallels that in
malaria endemicity but may also reflect natural selection by other infectious diseases (Bhatia et al., 1991;
Smith et al., 1994). In PNG there is limited diversity at loci A and B (Bhatia et al., 1995), with only four
different alleles of HLA-A and six alleles of HLA-B occurring regularly.
An extension of geographical analysis of allele frequencies is that of haplotype frequencies. Haplotypes
are combinations of alleles at different loci. For instance, a two-locus haplotype is defined by two alleles,
one from each of the two loci. Thus an individual has a pair of haplotypes which we cannot directly
observe. We see only the set of four alleles but not the pair of haplotypes which yielded them. Therefore
haplotypes are latent data and estimation of their frequencies is a missing data problem.
Allele and haplotype data can be viewed as multinomial trials. Classical approaches for estimating
associated probabilities are based on maximum likelihood methods as in Farewell (1982) and Lonjou et
al. (1995). The EM algorithm, a standard tool for missing data problems, has been applied to estimation
of haplotype frequencies in Long et al. (1995).
Hierarchical models offer the possibility of capturing extra-multinomial variation and spatial dependencies in the multinomial frequencies of alleles/haplotypes between geographical regions. They have
been succesfully applied in spatial epidemiology, using Poisson models with disease-count data, to map
disease rates (Bernardinelli and Montomoli, 1992; Clayton et al., 1993; Waller et al., 1995). Recent work
of Daniels and Gatsonis (1997) studies hierarchical polytomous response models but with no missing data
or spatial concerns.
Motivated by the foregoing genetic questions we analyse a set of Class I HLA data collected using serological typing (laboratory identification of different forms of the genes) of nearly 6000 persons
sampled from the general population in roughly 250 areas of PNG (Bhatia et al., 1995; Smith et al.,
1995). We consider areal spatial variation of allele and haplotype frequencies at the HLA-A and HLA-B
loci, both in order to describe the spatial pattern, and to explain these patterns by investigating possible
environmental causal factors. In many areas the data are very sparse. We employ a hierarchical framework for modelling the observed areal multinomial data incorporating latent structure and introducing
novel extra-multinomial variation across the areas using the three-dimensional (longitude, latitude and
altitude) spatial coordinates of the areas as well as several area-level multivariate heterogeneity specifications.
The format of the paper is as follows. A description of the data is given in Section 2. Section 3 reviews
the global (ignoring area-level information) likelihood-based analysis but uses simulation-based model
fitting to avoid possibly inappropriate asymptotics. This enables routine examination of the association
between allelic type at locus A and allelic type at locus B, so-called linkage disequilibrium. Section 4
introduces area-level modelling, offering several different specifications. Section 5 discusses aspects of
the Markov chain Monte Carlo (MCMC) model fitting. An attractive approach for model comparison,
which neatly piggy-backs on to simulation-based model fitting, is presented in Section 6. Section 7 selects
a best model and then presents the analysis of the data under this model, revealing very strong spatial and
altitude effects. In Section 8, we offer a brief summary.
Spatial modelling of multinomial data with latent structure
179
Table 1. Joint allele frequencies at the HLA-A and HLA-B loci
HLA-A
HLA-B
A11A11 A11A24 A11Aw34 A11A2 A2A2 A2A24 A2Aw34 A24A24 A24Aw34 Aw34Aw34
Bw62Bw62
Freq
88
134
35
264
183
253
91
18
270
201
242
13
58
38
27
695
740
913
297
570
515
Total
5645
B13B13
B13B27
B13B39
B13Bw56
B13Bw60
B13Bw62
(B27, B27)
B27B39
B27Bw56
B27Bw60
B27Bw62
B39B39
B39Bw56
B39Bw60
B39Bw62
Bw56Bw56
Bw56Bw60
Bw56Bw62
Bw60Bw60
Bw60Bw62
382
15
20
2
14
8
17
21
4
20
11
20
3
4
3
0
39
26
51
20
35
49
908
14
29
9
51
26
48
18
10
82
52
86
2
11
8
10
54
97
108
18
84
91
294
3
17
3
17
16
22
6
0
28
14
18
1
2
5
0
23
27
38
7
25
22
11
0
0
1
1
1
0
0
0
1
2
0
0
0
0
0
1
2
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
20
0
0
0
2
1
0
1
0
0
0
1
0
0
2
0
1
3
0
3
6
0
12 2737
0
27
0
48
0
9
0 114
0
65
0
87
0
35
0
3
0
99
0
95
0
88
0
4
0
25
0
14
0
14
1 383
3 358
0 493
5 178
2 314
1 284
1090
22
20
6
51
55
62
9
1
38
24
27
2
12
6
1
162
189
205
49
91
58
191
7
0
5
14
11
17
1
0
2
3
2
1
4
0
2
31
35
18
16
13
9
2. DATA DESCRIPTION
The dataset consists of class I HLA typing data on 5645 PNG Melanesians drawn from the general
population. The data record the alleles found at two different loci, HLA-A and HLA-B. The allelic typing
was done serologically (Bhatia et al., 1989). In particular, four alleles (A2, A11, A24, Aw34) occur
at the HLA-A locus and six alleles (B13, B27, B39, Bw56, Bw60, Bw62) at the HLA-B locus. For
each individual, at each locus (A or B), two alleles are recorded. A joint 10 × 21 frequency table of the
genotypes in loci A and B is given in Table 1.
For agricultural planning purposes PNG is divided into 4566 environmental zones (Resource Mapping
Units (RMU)). The dataset was collected at 252 of these RMUs which we consider as the spatial units in
the data. The sampling intensity in each unit was determined by convenience and ranged from 1 to 643.
Within each unit healthy volunteers were recruited, mainly from field surveys at the village level. As a
result of collapsing RMUs which are very close together, we reduce the number of spatial areas to 98.
The sample size in these areas varies between 5 and 643 individuals. The maps in Figure 1 show these
areas, obtained as Delaunay tesselations, and reveal considerable spatial irregularity, that is, departure
from lattice or grid-like shape. Further interpretation of Figure 1 is deferred to Section 7.
Associated with the ith unit is the longitude, latitude and altitude at its centroid, collected as xi =
(xi1 , xi2 , xi3 ) where xi1 and xi2 correspond to the longitude and latitude, respectively, transformed to
Cartesian coordinates, then centered and scaled, and xi3 denotes the altitude. Altitudes are observed in
180
VOUNATSOU ET AL .
(a)
(b)
Allele A24 at HLA-A
Allele Aw34 at HLA-A
0.81 to 1
0.704 to 0.81
0.513 to 0.704
0.062 to 0.513
0.224 to 0.459
0.143 to 0.224
0.095 to 0.143
0
to 0.095
(25)
(24)
(22)
(27)
(23)
(23)
(27)
(25)
(d)
(c)
Allele Bw56 at HLA-B
Allele Bw62 at HLA-B
0.41 to 1
0.33 to 0.41
0.19 to 0.33
0
to 0.19
0.354 to 0.654
0.281 to 0.354
0.177 to 0.281
0
to 0.177
(25)
(18)
(29)
(26)
(22)
(26)
(25)
(25)
Fig. 1. Maps of raw frequencies for allelic types. (a) A24, (b) Aw34, (c) Bw56, (d) Bw62.
four ordered classifications: ≤ 600 m, 600–1200 m, 1200–1800 m and > 1800 m with frequencies 27%,
12%, 46% and 15%, respectively. These classifications are centered and scaled to −1.5, −0.5, 0.5 and
1.5. The distance between areas is calculated as the Euclidean distance between their centroids.
3. G LOBAL ESTIMATION OF ALLELE AND HAPLOTYPE FREQUENCIES
Suppose that at
a given locus
(say HLA-A) there are L known alleles A1 , . . . , A L with frequencies
L
πl , l = 1, . . . , L
l=1 πl = 1 , respectively. An individual has a pair of alleles at that particular locus.
Let ol and el be, respectively, the number of times that allele Al occurs in a homozygote (both alleles the
same at locus A) or a heterozygote (alleles differ at locus A). Thus, the number of occurrences of a given
+ el and the counts (n 1 ,n 2 , . . . , n L ) will follow a multinomial
allele Al will be nl = 2ol
L distribution,
L
(n 1 , n 2 , . . . , n L ) ∼ Mult
likelihood L(π ) = l=1
πl2ol +el and
l=1 n l ; π1 , π2 , . . . , π L , yielding the
L
maximum likelihood estimates of the allele frequencies, π̂l = nl / i=1
ni .
The Hardy–Weinberg equilibrium occurs when genotype frequencies arise through independence of
the allele pairs, i.e. for, say, locus A, P(genotype Al Al ) = 2πl πl . Table 2 shows that the genotype frequencies at each locus roughly satisfy such equilibrium. That is, for each l and l the observed proportion
of Al Al pairs is roughly 2π̂l π̂l . We assume such equilibrium in the following.
Spatial modelling of multinomial data with latent structure
181
Table 2. Examination of Hardy–Weinberg equilibrium at the HLA-A and HLA-B loci
HLA-A frequencies
Genotype Observed HWE
Aw34Aw34
0.034
0.025
Aw34A24
0.193
0.208
Aw34A2
0.002
0.001
Aw34A11
0.052
0.055
A2A2
0.000
0.000
A2A24
0.004
0.005
A24A24
0.485
0.440
A24A11
0.161
0.232
A2A11
0.002
0.001
A11A11
0.068
0.031
Genotype
B13B13
B13B27
B13B39
B13Bw56
B13Bw60
B13Bw62
B27B27
B27B39
B27Bw56
B27Bw60
B39B39
HLA-B frequencies
Observed HWE Genotype
0.016
0.009 B39Bw56
0.024
0.017 B39Bw60
0.006
0.003 B39Bw62
0.047
0.060 Bw56Bw56
0.032
0.038 Bw56Bw60
0.045
0.050 Bw56Bw62
0.016
0.009 Bw60Bw60
0.003
0.003 Bw60Bw62
0.048
0.060 Bw62Bw62
0.036
0.038 B27Bw62
0.002
0.003
Observed
0.010
0.006
0.005
0.123
0.131
0.162
0.053
0.101
0.091
0.043
HWE
0.011
0.007
0.009
0.104
0.133
0.173
0.042
0.111
0.072
0.050
A two-locus haplotype is denoted by the pair Ar Bs . An individual has a pair of haplotypes for a
particular combination of two loci. However, we are only able to type the alleles at a locus, but not the
chromosome they come from. Therefore, haplotypes are latent data. For example, alleles (Ar , Ak , Bs , Bt )
can give the pair of haplotypes, Ar Bs and Ak Bt or the pair Ar Bt and Ak Bs .
rindexes the alleles 1, . . . , R at locus HLA-A,
Let πr s be the frequency of haplotype (r, s) where
S
s indexes to the alleles 1, . . . , S at HLA-B and rR=1 s=1
πr s = 1. Also, let nr k,st , r ≤ k, s ≤ t
denote the number of times the four observed alleles are (Ar , Ak , Bs , Bt ). The number of different ns is
R(R + 1)S(S + 1)/4.
Under Hardy–Weinberg equilibrium, the probability of the genotype (Ar , Ar , Bs , Bs ) is πr2s , the probability for (Ar , Ar , Bs , Bt ) and (Ar , Ak , Bs , Bs ) genotypes are 2πr s πr t , and 2πr s πks , respectively, and
the probability of the genotype (Ar , Ak , Bs , Bt ) is 2πr s πkt + 2πr t πks . Thus, the likelihood function for
π = (π11 , π12 , . . . , π R S )T is
2nrr,ss L(π; n) ∝
πr s
(2πr s πr t )nrr,st
(2πr s πks )nr k,ss
r,s
r,s,t
s<t
r,k,s
r <k
(2πr s πkt + 2πr t πks )nr k,st .
(1)
r,k,s,t
r <k,s<t
Explicit maximization of (1) over π is not possible. Instead, when r = k and s = t let nr k,st =
Z r s,kt + Z r t,ks , where Z r s,kt and Z r t,ks are latent data and represent the number of times that alleles
(Ar , Ak , Bs , Bt ) arose from the haplotype pair (Ar Bs , Ak Bt ) and (Ar Bt , Ak Bs ), respectively. Then the
log likelihood function with the latent data can be written as
log L(π; Z , n) ∝
nr s log πr s
(2)
r,s
where
nr s = 2nrr,ss +
s<t
nrr,st +
t<s
nrr,ts +
r <k
nr k,ss +
k<r
n kr,ss +
Z r s,kt .
k=r,t=s
Viewing (2) as a full-data log likelihood suggests using the EM algorithm to maximize (1). This has been
proposed by Excoffier and Slatkin (1995) and Long et al. (1995).
182
VOUNATSOU ET AL .
Table 3. Posterior summaries (median and 90% interval estimates) for the πr s (haplotype
frequencies) and the πr. and π.s (allele frequencies)
HLA-A
HLA-B
B13
B27
B39
Bw56
Bw60
Bw62
Marginals
0.101 (0.097, 0.106)
0.105 (0.101, 0.109)
0.020 (0.018, 0.023)
0.323 (0.317, 0.330)
0.212 (0.207, 0.218)
0.273 (0.264, 0.279)
A11
A24
Aw34
0.202 (0.197, 0.208)
0.020 (0.018, 0.023)
0.048 (0.044, 0.051)
0.007 (0.006, 0.009)
0.041 (0.036, 0.044)
0.031 (0.027, 0.035)
0.055 (0.051, 0.059)
0.635 (0.628, 0.641)
0.048 (0.044, 0.050)
0.049 (0.046, 0.053)
0.010 (0.008, 0.011)
0.209 (0.203, 0.215)
0.137 (0.131, 0.142)
0.183 (0.177, 0.188)
0.197 (0.192, 0.202)
0.034 (0.031, 0.036)
0.008 (0.007, 0.010)
0.003 (0.002, 0.004)
0.073 (0.070, 0.078)
0.044 (0.040, 0.048)
0.034 (0.032, 0.038)
The EM algorithm may find difficulty in converging if the likelihood has local modes. Regardless,
it provides only a point estimate and an approximate standard error. A Bayesian approach yields entire
marginal and joint distributions for the π s and avoids asymptotics. With a flat (constant) prior on the
πr s , the resultant analysis may be viewed as exact likelihood inference. Using MCMC simulation we
obtain samples roughly from the posterior distribution of π . Marginal posterior summaries of the πr s
using the median and equal tail 90% interval estimates are given in Table 3. Since allele A2 occurs so
infrequently, we elected to model only the frequencies for A11, A24 and Aw34 in the HLA-A and the
subsequent haplotype analyses. The marginals in Table 3 provide similar summaries for the πr. s and π.s s.
Independence of the loci would mean that πr s − πr. π.s = 0. For 12 of the 18 resulting posteriors, the 90%
interval estimates failed to contain 0, suggesting association or linkage disequilibrium between the loci.
The models discussed in this section apply to the whole region and do not consider any areal unit spatial
structure. We view them as baseline specifications providing a reference against which more sophisticated
models, introduced in the next section, can be compared.
4. H IERARCHICAL SPATIAL MODELS
We extend the models of the previous section to incorporate heterogeneity among areas. We capture
this variation in gene frequency using models incorporating areal covariates and spatial random effects
with a goal of spatially smoothing the frequencies. Such spatial models for gene and haplotype frequencies
have not been investigated previously. Lonjou et al. (1995) look at the geographical distribution of
haplotype frequencies among potential bone-marrow donors in France. However, they use maximum
likelihood methods to estimate frequencies within each region ignoring spatial dependencies and variable
sample size in each region.
Indexing the areas by i, the area-level data are denoted as n i = (n i1 , n i2 , . . . , n i L )T where, in the case
of a single locus, l indexes allele types, i.e. L = R or S. In the case of two loci, l indexes haplotypes,
i.e. L = R S. Recall that, as in (2), here we do not observe the n i,r s ; latent Z i s must be introduced. We
propose
a Lhierarchical
model for the n i which, at the first level, assumes that n i = (n i1 , n i2 , . . . , n i L ) ∼
n
,
π
Mult
, where πi = (πi1 , πi2 , . . . , πi L ) denotes the theoretical allele or haplotype frequeni
j
i
j=1
cies in spatial unit i.
The second-stage specification models θil = log ππi1il , i = 1, . . . , 98, l = 2, . . . , L. The spatial
component is captured either through random spatial effects (which we denote by φs) or by a linear
regression form in the covariate vector xi (longitude, latitude, altitude). The allelic type (or haplotype)
explanation is captured using type effects (which we denote by γ s). In the regression case, we employ a
vector of regression coefficients which depends upon, l, βl . The global model of the previous section sets
θil = γl .
Spatial modelling of multinomial data with latent structure
183
Note that models for θil which are additive in spatial and type components will not be sensible. They
are too limited in the range of behavior they permit for the πil . For instance, if θil = φi + γl , then, as φi
increases, all πil , l > 1 increase; only πi1 decreases. If θil = xiT β + γl the situation is even worse. The
vector of spatial components is constrained to the manifold spanned by X = (x1 , x2 , . . . , x98 )T unlike the
previous model where φ = (φ1 , φ2 , . . . , φ98 )T was unconstrained. Also, a model which sets θil = γil ,
that is, provides a γ i for each i is not interesting in the allele case. If the γil are fixed it will provide
a perfect fit under a likelihood analysis and will reveal a small lack of fit if the γil are assumed random
(with fairly vague priors) to permit Bayesian smoothing. In the haplotype case only goodness of fit to
the n i,r k,st can be assessed because the n i,r s are not observed. Hence, fit will not be near perfect and this
model now provides a measure of the best we can do in this case. As a result, we consider the following
collection of second-stage models:
Model (1)
Model (2)
Model (3)
θil = γl
θil = γl + β1l x1i + β2l x2i + β3l x3i
(l)
θil = γl + φi
Model (4)
θil = γl + φi + βl x3i
(l)
Model (1) is the earlier nonhierarchical baseline model. Model (2) adds a spatial component which is
(l)
a linear regression form. Model (3) adds random effects modelled as discussed below. The φi will be
dependent across i and provide areal adjustment to the γl . Model (4) reintroduces the altitude covariate
which is anticipated to be important in explaining genetic variation. It is a surrogate for moisture level
which influences the incidence of diseases such as malaria. Local genetics should reflect response to such
incidence.
We adopt a Bayesian perspective for convenient model fitting. We take flat priors for γ = (γ2 , γ3 , . . . ,
(l)
γ L )T and β to approximate a likelihood analysis. With a proper prior on the φi , a proper posterior is
assured for all four models provided at least L + 3 of the multinomial vectors, n i , have all n i j > 0.
See, for example, Speckman et al. (1999). We remark that using an alternative set of logits results in a
nonlinear reparametrization of γ and β for which the prior would no longer be flat. However, this lack of
invariance to parametrization is not an issue in terms of the logits. Again, spatial smoothing of the logits,
equivalently the πil is the objective.
(l)
(l)
(l)
(l)
Turning to the φi , suppose we assume that the φ (l) = (φ1 , φ2 , . . . , φ N ) are independent. For each
(l)
φ (l) we choose a spatial Gaussian prior given through a conditional normal distribution for each φi given
(l)
φ j , j = i, as in Bernardinelli and Montomoli (1992). This normal is of the form N ( j ωi j φ j /ωi+ , σ 2 /
ωi+ ), where ωii = 0, ωi j = ω ji and j ωi j = ωi+ . The resultant joint distribution for φ (l) is improper
N
(l)
and, in fact, the posteriors under models (3) and (4) are as well. Imposing the constraint i=1
φi = 0
removes the impropriety returning us to the conditions of the previous paragraph.
We recognize that there is some arbitrariness in the definition of our areal units, that the resulting
units are very irregular in size and shape, that at a different resolution or with rezoning of the areas at
the current resolution, spatial association might look different. Hence we avoid subjective distance-based
ωi j . We adopt a nearest-neighbor choice, setting ωi j = 1 for j an immediate neighbor of i, obtaining the
conditional autoregressive (CAR) prior (Besag et al., 1995). We regard this prior as an elementary attempt
to qualitatively describe the spatial dependence in the data. In fact, we introduce a variance σl2 for each
(l)
l which specifies the amount of variability in the φi (Richardson et al., 1995). A proper inverse gamma
2
prior is adopted for the σl again to ensure a proper posterior (Hobert and Casella, 1996) as well as a
well-behaved MCMC simulation. We chose an inverse gamma distribution having mean 0.1 and infinite
variance which provides the right order of magnitude for the φs.
184
VOUNATSOU ET AL .
Dependence between the φ(l) could be captured using a multivariate conditional autoregressive prior
(2)
(3)
(L)
(MCAR) for the φi = (φi , φi , . . . , φi ) as in Mardia (1988). We are currently exploring such
specifications but in any event, though the φ (l) are assumed independent a priori, the dependence is
revised through the posterior.
5. BAYESIAN COMPUTATIONS
The foregoing hierarchical models give rise to high-dimensional posterior densities which require
MCMC methods to fit. To simulate from the respective posterior densities of each model we use a Gibbs
sampling scheme (Gelfand and Smith, 1990). The latent Z have independent binomial distributions.
Whenever full conditional distributions are not of a standard form we use a random-walk Metropolis proposal which is a normal distribution with a mean current value and a variance–covariance matrix which
is the inverse of the Hessian matrix estimated at the current mean. The Hessian matrix can be calculated
straightforwardly since the second derivatives can be obtained analytically.
(l)
(l)
We note that for models involving φi the parameters γl and φi are not identifiable. For an arbitrary
(l)
(l)
(l)
c, these models cannot distinguish γl = γl + c and φi = φi − c from γl and φi (whence the posterior
is improper). To overcome this problem, at the end of every Gibbs iteration we impose the constraint
N
(l)
(l)
(l)
i=1 φi = 0, by adding to the samples of γl the average of the φi s and subtracting it from each φi .
We implement a multiple-chain Gibbs sampler with 10 independent chains running in parallel. We
assess convergence using the method of Gelman and Rubin (1992). After convergence, we collect a final
sample from the posterior of size 1000 by taking every 30th value of each chain.
6. M ODEL CHOICE
Models (3) and (4) are not regular in that model dimension increases as sample size (number of areas)
increases. Model choice criteria requiring degrees of freedom or customary asymptotics are not applicable.
Moreover, Bayes factors, even if we could compute them, are not interpretable since some of the prior
specifications are improper. We adopt the model selection approach developed by Gelfand and Ghosh
(1998). In the context of spatial modelling it has been applied to Poisson data by Waller et al. (1997) and
to univariate normal data by Gelfand et al. (1998). As Gelfand and Ghosh note, because the multinomial
distribution is a multivariate exponential family, their deviance-loss–utility-maximization approach can
be extended to multinomial likelihoods.
The basic idea of this approach is to first specify a loss function suitable for prediction of observations
under the stochastic family for the data. Our choice is connected to the deviance associated with the
multinomial distribution. This loss is used to define two additive components, one which rewards good
predictive performance for a new n il , the other which rewards fidelity to the observed n il . Next, the
minimizing posterior action is obtained for each n il under this two-component loss. Surprisingly, it can
be obtained explicitly. Inserting this minimizing action into the posterior expected loss and cumulating
over n il yields a predictive model choice criterion: select the model yielding the smallest value. For the
multinomial, omitting details, we obtain a criterion of the form D = G + P where
L
N 2(µ̃il + L −1 )
2(n il + L −1 )
−1
−1
G=2
(µ̃il + L ) log
+ (n il + L ) log
µ̃il + n il + 2L −1
µ̃il + n il + 2L −1
i=1 l=1
and
P=2
L N t˜il − (n il + L −1 ) log(n il + L −1 )
i=1 l=1
Spatial modelling of multinomial data with latent structure
185
Table 4. Goodness of fit term (G), penalty term (P), and overall model choice criterion (D). Smaller
values indicate preferred models
HLA-A
G
P
D
G
Model for θil =
(1) γl
1764.25 191.96 1956.21 1775.78
(2) γl + β1l x1i + β2l x2i 1014.34 194.32 1208.66 1221.24
+β3l x3i
(l)
(3) γl + φi
961.33 217.30 1178.63 991.66
(l)
(4) γl + φi + βl x3i
508.59 238.19 746.78 715.93
HLA-B
HLA-AB
P
D
G
P
D
469.65 2245.43 5334.44 2182.42 7516.86
471.89 1693.13 4766.78 2156.02 6928.80
590.08 1581.74 4578.02 2376.32 6954.34
627.67 1343.60 4536.99 2299.64 6836.63
Here, µ̃il is the posterior predictive mean of a new n il given the data and model, i.e. µ̃il = E(n il,new |
data, model). Similarly,
t˜il = E((n il,new + L −1 ) log(n il,new + L −1 ) | data, model).
The L −1 s are continuity corrections introduced to ensure that the log can be calculated and that the
expectations exist.
G is interpreted as a goodness of fit term. It is 0 when µ̃il = n il and increases as µ̃il moves away
from n il . P is interpreted as a penalty term and can be shown to be approximately a weighted sum of
predictive variances. A poor model will produce a large G and P, hence D. A good model will make both
G and P, hence D, small. An overfitted model will reveal a small G but an inflated P. Thus, parsimony
is encouraged. Finally, in the case of the haplotype models we replace n il by n i,r k,st in all of the above.
Computation of the posterior predictive expectations is done by Monte Carlo integration using samples
from the required predictive distributions. These samples are generated straightforwardly, one for one,
using samples of the πil . The posterior samples which are the output of the MCMC algorithm used to fit
the model yield posterior samples of the πil .
Table 4 provides the model choice for models (1)–(4) for locus A, for locus B and for the AB haplotype.
We see that the baseline model is poor. Model (2) is better but the CAR spatial explanation, including
altitude, is much better. Though it yields slight overfitting, G is substantially reduced.
7. A NALYSIS OF THE DATA
In the HLA-A gene we used A11 as the base cell for the logits and, based upon the previous section,
used model (4) with CAR. Table 5 presents a posterior summary. Since the altitude levels and the spatial
effects are centered at 0, the γ s may be thought of as population level logits. The altitude and spatial terms
provide adjustments for the ith area. Converting the median logits to probabilities gives π A11 = 0.161,
π A24 = 0.676 and π Aw34 = 0.163, in reasonable agreement with Table 3. Table 5 also reveals that altitude
provides a significant adjustment to the probabilities. The advantage of introducing distinct parameters,
β A24 and β Aw34 , is apparent; the former is roughly twice the latter. Also, both altitude coefficients are
positive indicating that alleles A24 and Aw34 are more frequent in high altitudes in contrast to A11 which
is met mainly in low-altitude areas. Maps of raw allele frequencies in Figures 1(a) and (b) confirm the
association with altitude, especially for A24. Note that the central areas of PNG are mainly the highaltitude areas, while the coastal areas are the lowlands. The spatial effects are also important. Figure 2(a)
shows a grey-scale map of the posterior medians of the φiA24 . Recalling that, a priori, the φiA24 are all
centered at 0, the spatial association is evident. Figure 2(b) shows a grey-scale map of the posterior
medians of the φiAw34 . Again there is spatial association but with a different pattern from Figures 2(a)
and (b), demonstrating the importance of introducing distinct spatial models for φ A24 and φ Aw34 .
186
VOUNATSOU ET AL .
Table 5. Posterior median and 90% credible sets for parameters under model 4 for
HLA-A, -B, -AB frequencies
HLA-A
Aw34
A24
HLA-B
Bw62
B27
B39
Bw56
Bw60
HLA-AB
Aw34Bw62
Aw34B27
Aw34B39
Aw34Bw56
Aw34Bw60
Aw34B13
A24Bw62
A24B27
A24B39
A24Bw56
A24Bw60
A24Bw13
A11Bw62
A11B27
A11B39
A11Bw56
A11Bw60
β
0.375 (0.367, 0.377)
0.708 (0.707, 0.709)
σ2
0.06 (0.02, 0.29)
0.05 (0.02, 0.23)
1.03 (0.82, 1.23)
−0.05 (−0.31, 0.10)
−1.80 (−2.17, −1.38)
1.18 (1.00, 1.37)
0.75 (0.54, 0.92)
0.61 (0.52, 0.70)
0.35 (0.26, 0.51)
0.07 (−0.05, 0.11)
0.56 (0.51, 0.69)
0.52 (0.50, 0.57)
0.06 (0.02, 0.28)
0.06 (0.02, 0.29)
0.06 (0.02, 0.27)
0.05 (0.02, 0.27)
0.07 (0.02, 0.33)
−1.21 (−1.49, 0.50)
−2.11 (−3.30, −0.44)
−4.35 (−6.06, −2.93)
−0.38 (−0.60, 1.58)
−0.95 (−1.44, 1.03)
−1.34 (−1.96, 0.22)
0.67 (0.54, 2.67)
−0.57 (−0.90, 1.36)
−2.18 (−2.71, 0.23)
0.80 (0.69, 2.67)
0.38 (0.20, 2.29)
−0.65 (−0.94, 1.37)
−0.85 (−0.99, 1.23)
−0.98 (−1.33, 1.07)
−2.82 (−3.78, −0.92)
−0.92 (−1.15, 0.92)
−1.42 (−1.92, 0.64)
0.21 (0.03, 0.70)
0.36 (−0.03, 1.33)
−1.32 (−2.53, −0.24)
0.72 (0.69, 1.39)
0.48 (0.13, 0.90)
0.27 (0.03, 0.51)
0.95 (0.87, 1.39)
0.67 (0.55, 1.34)
0.38 (0.05, 0.87)
0.77 (0.62, 1.44)
0.70 (0.63, 1.39)
0.36 (0.26, 0.90)
0.31 (0.61, 0.91)
0.28 (0.12, 0.87)
0.19 (−0.65, 0.58)
0.10 (−0.06, 0.48)
0.09 (−0.43, 0.72)
0.05 (0.02, 0.24)
0.05 (0.02, 0.23)
0.05 (0.02, 0.23)
0.05 (0.02, 0.24)
0.05 (0.02, 0.25)
0.05 (0.02, 0.20)
0.04 (0.02, 0.27)
0.04 (0.02, 0.25)
0.05 (0.02, 0.24)
0.05 (0.02, 0.21)
0.05 (0.02, 0.26)
0.05 (0.02, 0.28)
0.04 (0.01, 0.19)
0.05 (0.02, 0.27)
0.05 (0.015, 0.23)
0.05 (0.02, 0.19)
0.05 (0.02, 0.24)
γ
0.001 (−0.15, 0.08)
1.43 (1.30, 1.52)
Turning to the HLA-B gene and taking B13 as the base cell, Table 5 also summarizes posterior estimates of model parameters. Again converting the median logits to probabilities yields π B13 = 0.097,
π B27 = 0.092, π B39 = 0.016, π Bw56 = 0.316, π Bw60 = 0.212, and Bw62 = 0.272, in remarkable agreement with Table 3. Again the magnitude and sign of the altitude coefficients indicate that Bw62, Bw56
and Bw60 are most frequent in high altitude in contrast to allele B13. Allele B39 shows no association
with altitude. These results are confirmed by the maps of the raw frequencies of the alleles at HLA-B (e.g.
Figures 1(c) and (d)) and the maps of spatial effects (e.g. Figures 2(c) and (d)).
Finally, for the HLA-A and B haplotypes, we obtain an 18-cell joint frequency distribution. Taking
the (A11, B13) as the base haplotype, 13 of the 17 logits show significant altitude effects (Table 5). Plots
analogous to those in Figures 1 and 2 are omitted, but again, reveal strong spatial associations.
Many studies have contrasted the HLA types found in the PNG highlands with those in coastal areas,
but most compared only small numbers of populations. Such studies are unable to establish whether
the differences result from natural selection (possibly related to altitudinal variation in disease patterns),
distinctions between waves of human migrants, or in situ divergence of populations via genetic drift. Since
it includes many sampling points, the present dataset can be used to distinguish between these hypotheses,
and Bhatia et al. (1991) applied the Ewans Watterson test to these data to argue that the altitudinal cline
Spatial modelling of multinomial data with latent structure
(a)
187
(b)
Allele A24 at HLA-A
Allele Aw34 HLA-A
0.38 to 0.7
0.15 to 0.38
-0.45 to 0.15
-0.81 to -0.45
0.45 to 0.78
0.22 to 0.45
-0.27 to 0.22
-1.48 to -0.27
(22)
(26)
(25)
(25)
(22)
(25)
(26)
(25)
(d)
(c)
Allele Bw56 at HLA-B
Allele Bw62 at HLA-B
0.2 to 0.73
0.05 to 0.2
-0.04 to 0.05
-0.97 to -0.04
0.15 to 0.75
0
to 0.15
-0.15 to 0
-0.82 to -0.15
(27)
(17)
(26)
(28)
(24)
(21)
(28)
(25)
Fig. 2. Spatial effects under model (4) (CAR) (see text for details) for allelic types. (a) φ A24 , (b) φ Aw34 , (c) φ Bw56 ,
(d) φ Bw62 .
in HLA-A is a result of natural selection. In a further analysis, Smith et al. (1994) used matrix correlation
approaches to confirm that the clinal variation in HLA-A reflects genuine effects of altitude, not merely
geographical separation of the highland and lowland populations. However, neither of these analyses
could adequately assess the residual geographical patterns after accounting for the altitudinal variation.
The present approach could incorporate even more data than were used by Smith et al. (1994) and
Bhatia et al. (1981) because it included spatial units with arbitrarily small sample sizes. The earlier
analyses did not include sufficient units for the overall spatial patterns to be evident, but the inclusion of
these sparse units means that sampling variation obscures spatial patterns in the raw data. The residual
patterns in the smoothed data, however, confirm that there is variation in HLA allele frequencies over
short distances, comparable in magnitude to the variation across the whole of the South West Pacific
region (Searjeantson, 1989). Our analyses are the first to demonstrate that altitudinal differences in HLAB are not due to spatial correlation, and to provide descriptions of the residual spatial patterns at both loci,
after adjustment for the altitudinal clines.
Features of spatial pattern which are not obvious in maps of raw frequencies in Figure 1 but which
are depicted by the CAR-smoothed estimates of the altitude-adjusted gene frequencies in Figure 2 (not
all maps are shown here) include west–east clinal variation. Since the original colonization of PNG is
thought to have been via waves of immigrants from the north-west, these clines may reflect later arrival
188
VOUNATSOU ET AL .
of the ancestors of the current western populations. There are higher frequencies of A24 and Bw56 in the
east of PNG, while B27 is more frequent in the west. The rather different spatial patterns shown by other
alleles might reflect natural selection by unknown causes, and could be further investigated by testing
additional covariates (e.g. climatic data) in the models. Thus, conditional on altitude, Bw62 frequencies
are highest along a crescent from the south-west (western province) continuing through the centre of PNG
(the main highland valleys). A complementary pattern is shown by B39. In the south-west Bw60 appears
to be most frequent.
8. S UMMARY
In this paper we have presented a hierarchical framework for modelling sparse spatially structured
multinomial data. Markov chain Monte Carlo methods have been applied to fit the models and to obtain
point and interval estimates for the model parameters. A model comparison criterion, convenient for
multinomial data models, fitted using MCMC, has been employed.
We have implemented this approach to study geographical patterns in HLA-A and HLA-B alleles as
well as HLA-AB haplotypes at an areal level. The chosen model is of the same form in all three cases and
reveals strong spatial association as well as correlation of the allele frequencies with altitude.
ACKNOWLEDGEMENTS
This work was supported by the Swiss National Science Foundation grant 3200-049809.96. The authors are gratefull to Michael Alpers and Kuldeep Bhatia for making the data available.
R EFERENCES
ATTENBOROUGH , R. D. AND A LPERS , M. P. (1992). Human Biology in Papua New Guinea: The Small Cosmos,
Research Monographs on Human Population Biology 10.
B ERNARDINELLI , L. AND M ONTOMOLI , C. (1992). Empirical Bayes versus fully Bayesian analysis of geographical
variation in disease risk. Statistics in Medicine 11, 983–1007.
B ESAG , J., G REEN , P., H IGDON , D. AND M ENGERSEN , K. (1995). Bayesian computation and stochastic systems
(with discussion). Statistical Science 10, 3–66.
B HATIA , K. K., B LACK , F. L., S MITH , T. A., P RASAD , M. L. AND KOKI , G.N. (1995). Class I HLA antigens in
two long-separated populations: Melanesians and South Amerinds. American Journal of Physical Anthropology
97, 291–305.
B HATIA , K., J ENKINS , C., P RASAD , M. L. AND KOKI , G. (1989). Immunogenetic studies on two recently contacted
populations from Papua New Guinea. Human Biology 61, 45–64.
B HATIA , K. K., S MITH , T. A., P RASAD , M. L. AND KOKI , G.N. (1991). Selection at HLA loci in Papua New
Guinea. In Programme and Abstracts of the Annual Scientific Meeting of the Australasian and South-East Asian
Tissue Typing Association, Auckland, pp. 189–199.
C AVALLI -S FORZA , L. L. (1997). Genes, peoples, and languages. Proceedings of Natural Academy of Sciences USA
1997 July 22, 94, 7719–7724.
C LAYTON , D. G., B ERNARDINELLI , L. AND M ONTOMOLI , C. (1993). Spatial correlation in ecological analysis.
International Journal of Epidemiology 22, 1193–1202.
C RESSIE , N. A. C. (1993). Statistics for Spatial Data. New York: Wiley.
DANIELS , M. J. AND G ATSONIS , C. (1997). Hierarchical polytomous regression models with applications to health
services research. Statistics in Medicine 16, 2311–2325.
Spatial modelling of multinomial data with latent structure
189
E XCOFFIER , L. AND S LATKIN , M. (1995). Maximum-likelihood estimation of molecular haplotype frequencies in a
diploid population. Molecular Biology and Evolution 12, 921–927.
FAREWELL , V. T. (1982). The estimation of gene frequency, based on a particular leucocyte culture experiment.
Biometrics, 38, 769–775.
G ELFAND , A. E. AND G HOSH , S. K. (1998). Model choice: A minimum posterior predictive loss approach. Biometrics 85, 1–11.
G ELFAND , A. E., G HOSH , S. K., K NIGHT, J. AND S IRMANS , C. F. (1998). Spatio-temporal modelling of residential
sales markets. Journal of Business and Economic Statistics 16, 312–331.
G ELFAND , A. E. AND S MITH , A. F. M. (1990). Sampling-based approaches to calculating marginal densities.
Journal of the American Statistical Association 85, 398–409.
G ELMAN , A. AND RUBIN , D. B. (1992). Inference from iterative simulations using multiple sequences (with discussion). Statistical Science 7, 457–511.
G REENWOOD , B. M. (1991). Common West African HLA antigens are associated with protection from severe
malaria. Nature 352, 595–600.
H ILL , A. V. S. et al. (1991). Common West African HLA antigens are associated with protection from severe malaria.
Nature 352, 595–600.
H OBERT, J. P. AND C ASELLA , G. (1996). The effect of improper priors on Gibbs sampling in hierarchical linear
mixed models. Journal of the American Statistical Association 91, 1461–1473.
L ONG , J. C., W ILLIAMS , R. C. AND U RBANEK , M. (1995). An E-M algorithm and testing strategy for multiplelocus haplotypes. American Journal of Human Genetics 56, 799–810.
L ONJOU , C., C LAYTON , J. C AMBON -T HOMSEN , A. AND R AFFOUX , C. (1995). HLA-A, -B, -DR haplotype frequencies in France—implications for recruitment of potential bone marrow donors. Transplantation 60, 375–383.
M ARDIA , K. V. (1988). Multi-dimensional multivariate gaussian Markov random fields with application to image
processing. Journal of Multivariate Analysis 24, 265–284.
R ICHARDSON , S., M ONFORT, C., G REEN , M., D RAPER , G. AND M UIRHEAD , C. (1995). Spatial variation of
natural radiation and chilhood leukaemia incidence in Great Britain. Statistics in Medicine 14, 2487–2501.
S ERJEANTSON , S. W. (1989). HLA genes and antigens. In The Colonization of the Pacific, A Genetic Trail, eds
Hill, A.V.S. and Serjeantson, S.W., pp. 120–173. Oxford: Oxford University Press.
S MITH , T., B HATIA , K., P RASAD , M., KOKI , G. AND A LPERS , M. (1994). Altitude, language, and the frequencies
of class I HLA alleles in Papua New Guinea. American Journal of Physical Anthropology 95, 155–168.
S PECKMAN , P. L., L EE , J. AND S UN , D. (1999). Existence of the MLE and propriety of posteriors for a general
multinomial choice model. Technical Report, Department of Statistics, University of Missouri, Columbia, MO.
WALLER , L. A., C ARLIN , B. P., X IA , H. AND G ELFAND , A. (1997). Hierarchical spatio-temporal mapping of
disease rates. Journal of the American Statistical Association 92, 607–617.
[Received June 23, 1999; revised October 27, 1999; accepted for publication November 1, 1999]