Biostatistics (2000), 1, 2, pp. 177–189 Printed in Great Britain Spatial modelling of multinomial data with latent structure: an application to geographical mapping of human gene and haplotype frequencies P. VOUNATSOU, T. SMITH Swiss Tropical Institute, Socinstrasse 57, PO Box CH-4002 Basel, Switzerland and A. E. GELFAND Department of Statistics, University of Connecticut, 196 Auditorium Road, U-120, Storrs, CT 0629-3120, USA S UMMARY We develop hierarchical models for spatial multinomial data with missing categories, to analyse a database of HLA-A and -B gene and haplotype frequencies from Papua New Guinea, with a highly variable number of samples per spatial unit. The spatial structure of the multinomial data is incorporated by adopting conditional autoregressive (CAR) priors for the random effects, reflecting extra-multinomial variation. Different spatial structures are investigated, and covariate effects are evaluated using a novel model selection criterion. Tables and maps reveal strong spatial association and the importance of altitude, a covariate anticipated to be significant in explaining genetic variation. Our approach can be used in identifying associations with environmental factors, linguistic or epidemiological patterns and hence potential causes of genetic diversity (population movements, natural selection, stochastic effects). Keywords: Alleles; CAR; Genetic variation; Haplotypes; Hierarchical model; Linkage disequilibrium; Missing data; Model choice. 1. I NTRODUCTION The study of geographical patterns of genetic variation can provide important information about human origins (see, for example, Cavalli-Sforza, 1997). In particular, with regard to the genetic structure of contemporary populations, such investigation can help to distinguish the contributions of natural selection mediated by environmental covariates from those of genetic drift (stochastic effects affecting gene frequencies), or population movements and mixing (exogamy). In this paper we address some of the issues involved in explaining variation in such spatially structured genetic data, and provide methods for modelling sparse data with latent structure and covariates. Spatial variation in human genetic data is particularly relevant for the understanding of subsistence societies such as those in Papua New Guinea (PNG). There is a remarkable diversity of topography and climate in PNG, together with many small, culturally, linguistically, and genetically diverse human populations (Attenborough and Alpers, 1992). Since most of these populations have no documented history, analysis of their gene frequencies in terms of spatial pattern and covariate dependencies provides a phenomenological strategy for studying their origins and the reasons for their diversity. Genetic information in humans is encoded linearly in 23 pairs of chromosomes. Genes which define different characteristics are found at specific locations (loci) on the chromosome, and may exist in different c Oxford University Press (2000) 178 VOUNATSOU ET AL . forms (alleles). For any given locus an individual has a pair of alleles (one from each chromosome) that may be both identical (homozygous) or may differ (heterozygous). Laboratory identification schemes, known as serological typing, only determine the alleles at a given locus, without assigning them to the individual chromosomes of the pair. Therefore, for each individual, we observe only the pair of alleles at each locus; assignment to the chromosome is missing. Populations can be characterized by the proportions of alleles for each gene (allele frequencies). To study spatial genetic variation, we use data from the Human Leucocyte Antigen (HLA) system, which is a set of related genes on the sixth chromosome involved in various immunological processes. Two genes within the HLA system are analysed here: the HLA-A and HLA-B genes. Variation in these two genes affects regulation of immune responses to many infectious diseases. HLA-B has been implicated in resistance to severe malaria (Hill et al., 1991). An altitudinal cline in HLA-A in PNG parallels that in malaria endemicity but may also reflect natural selection by other infectious diseases (Bhatia et al., 1991; Smith et al., 1994). In PNG there is limited diversity at loci A and B (Bhatia et al., 1995), with only four different alleles of HLA-A and six alleles of HLA-B occurring regularly. An extension of geographical analysis of allele frequencies is that of haplotype frequencies. Haplotypes are combinations of alleles at different loci. For instance, a two-locus haplotype is defined by two alleles, one from each of the two loci. Thus an individual has a pair of haplotypes which we cannot directly observe. We see only the set of four alleles but not the pair of haplotypes which yielded them. Therefore haplotypes are latent data and estimation of their frequencies is a missing data problem. Allele and haplotype data can be viewed as multinomial trials. Classical approaches for estimating associated probabilities are based on maximum likelihood methods as in Farewell (1982) and Lonjou et al. (1995). The EM algorithm, a standard tool for missing data problems, has been applied to estimation of haplotype frequencies in Long et al. (1995). Hierarchical models offer the possibility of capturing extra-multinomial variation and spatial dependencies in the multinomial frequencies of alleles/haplotypes between geographical regions. They have been succesfully applied in spatial epidemiology, using Poisson models with disease-count data, to map disease rates (Bernardinelli and Montomoli, 1992; Clayton et al., 1993; Waller et al., 1995). Recent work of Daniels and Gatsonis (1997) studies hierarchical polytomous response models but with no missing data or spatial concerns. Motivated by the foregoing genetic questions we analyse a set of Class I HLA data collected using serological typing (laboratory identification of different forms of the genes) of nearly 6000 persons sampled from the general population in roughly 250 areas of PNG (Bhatia et al., 1995; Smith et al., 1995). We consider areal spatial variation of allele and haplotype frequencies at the HLA-A and HLA-B loci, both in order to describe the spatial pattern, and to explain these patterns by investigating possible environmental causal factors. In many areas the data are very sparse. We employ a hierarchical framework for modelling the observed areal multinomial data incorporating latent structure and introducing novel extra-multinomial variation across the areas using the three-dimensional (longitude, latitude and altitude) spatial coordinates of the areas as well as several area-level multivariate heterogeneity specifications. The format of the paper is as follows. A description of the data is given in Section 2. Section 3 reviews the global (ignoring area-level information) likelihood-based analysis but uses simulation-based model fitting to avoid possibly inappropriate asymptotics. This enables routine examination of the association between allelic type at locus A and allelic type at locus B, so-called linkage disequilibrium. Section 4 introduces area-level modelling, offering several different specifications. Section 5 discusses aspects of the Markov chain Monte Carlo (MCMC) model fitting. An attractive approach for model comparison, which neatly piggy-backs on to simulation-based model fitting, is presented in Section 6. Section 7 selects a best model and then presents the analysis of the data under this model, revealing very strong spatial and altitude effects. In Section 8, we offer a brief summary. Spatial modelling of multinomial data with latent structure 179 Table 1. Joint allele frequencies at the HLA-A and HLA-B loci HLA-A HLA-B A11A11 A11A24 A11Aw34 A11A2 A2A2 A2A24 A2Aw34 A24A24 A24Aw34 Aw34Aw34 Bw62Bw62 Freq 88 134 35 264 183 253 91 18 270 201 242 13 58 38 27 695 740 913 297 570 515 Total 5645 B13B13 B13B27 B13B39 B13Bw56 B13Bw60 B13Bw62 (B27, B27) B27B39 B27Bw56 B27Bw60 B27Bw62 B39B39 B39Bw56 B39Bw60 B39Bw62 Bw56Bw56 Bw56Bw60 Bw56Bw62 Bw60Bw60 Bw60Bw62 382 15 20 2 14 8 17 21 4 20 11 20 3 4 3 0 39 26 51 20 35 49 908 14 29 9 51 26 48 18 10 82 52 86 2 11 8 10 54 97 108 18 84 91 294 3 17 3 17 16 22 6 0 28 14 18 1 2 5 0 23 27 38 7 25 22 11 0 0 1 1 1 0 0 0 1 2 0 0 0 0 0 1 2 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 2 1 0 1 0 0 0 1 0 0 2 0 1 3 0 3 6 0 12 2737 0 27 0 48 0 9 0 114 0 65 0 87 0 35 0 3 0 99 0 95 0 88 0 4 0 25 0 14 0 14 1 383 3 358 0 493 5 178 2 314 1 284 1090 22 20 6 51 55 62 9 1 38 24 27 2 12 6 1 162 189 205 49 91 58 191 7 0 5 14 11 17 1 0 2 3 2 1 4 0 2 31 35 18 16 13 9 2. DATA DESCRIPTION The dataset consists of class I HLA typing data on 5645 PNG Melanesians drawn from the general population. The data record the alleles found at two different loci, HLA-A and HLA-B. The allelic typing was done serologically (Bhatia et al., 1989). In particular, four alleles (A2, A11, A24, Aw34) occur at the HLA-A locus and six alleles (B13, B27, B39, Bw56, Bw60, Bw62) at the HLA-B locus. For each individual, at each locus (A or B), two alleles are recorded. A joint 10 × 21 frequency table of the genotypes in loci A and B is given in Table 1. For agricultural planning purposes PNG is divided into 4566 environmental zones (Resource Mapping Units (RMU)). The dataset was collected at 252 of these RMUs which we consider as the spatial units in the data. The sampling intensity in each unit was determined by convenience and ranged from 1 to 643. Within each unit healthy volunteers were recruited, mainly from field surveys at the village level. As a result of collapsing RMUs which are very close together, we reduce the number of spatial areas to 98. The sample size in these areas varies between 5 and 643 individuals. The maps in Figure 1 show these areas, obtained as Delaunay tesselations, and reveal considerable spatial irregularity, that is, departure from lattice or grid-like shape. Further interpretation of Figure 1 is deferred to Section 7. Associated with the ith unit is the longitude, latitude and altitude at its centroid, collected as xi = (xi1 , xi2 , xi3 ) where xi1 and xi2 correspond to the longitude and latitude, respectively, transformed to Cartesian coordinates, then centered and scaled, and xi3 denotes the altitude. Altitudes are observed in 180 VOUNATSOU ET AL . (a) (b) Allele A24 at HLA-A Allele Aw34 at HLA-A 0.81 to 1 0.704 to 0.81 0.513 to 0.704 0.062 to 0.513 0.224 to 0.459 0.143 to 0.224 0.095 to 0.143 0 to 0.095 (25) (24) (22) (27) (23) (23) (27) (25) (d) (c) Allele Bw56 at HLA-B Allele Bw62 at HLA-B 0.41 to 1 0.33 to 0.41 0.19 to 0.33 0 to 0.19 0.354 to 0.654 0.281 to 0.354 0.177 to 0.281 0 to 0.177 (25) (18) (29) (26) (22) (26) (25) (25) Fig. 1. Maps of raw frequencies for allelic types. (a) A24, (b) Aw34, (c) Bw56, (d) Bw62. four ordered classifications: ≤ 600 m, 600–1200 m, 1200–1800 m and > 1800 m with frequencies 27%, 12%, 46% and 15%, respectively. These classifications are centered and scaled to −1.5, −0.5, 0.5 and 1.5. The distance between areas is calculated as the Euclidean distance between their centroids. 3. G LOBAL ESTIMATION OF ALLELE AND HAPLOTYPE FREQUENCIES Suppose that at a given locus (say HLA-A) there are L known alleles A1 , . . . , A L with frequencies L πl , l = 1, . . . , L l=1 πl = 1 , respectively. An individual has a pair of alleles at that particular locus. Let ol and el be, respectively, the number of times that allele Al occurs in a homozygote (both alleles the same at locus A) or a heterozygote (alleles differ at locus A). Thus, the number of occurrences of a given + el and the counts (n 1 ,n 2 , . . . , n L ) will follow a multinomial allele Al will be nl = 2ol L distribution, L (n 1 , n 2 , . . . , n L ) ∼ Mult likelihood L(π ) = l=1 πl2ol +el and l=1 n l ; π1 , π2 , . . . , π L , yielding the L maximum likelihood estimates of the allele frequencies, π̂l = nl / i=1 ni . The Hardy–Weinberg equilibrium occurs when genotype frequencies arise through independence of the allele pairs, i.e. for, say, locus A, P(genotype Al Al ) = 2πl πl . Table 2 shows that the genotype frequencies at each locus roughly satisfy such equilibrium. That is, for each l and l the observed proportion of Al Al pairs is roughly 2π̂l π̂l . We assume such equilibrium in the following. Spatial modelling of multinomial data with latent structure 181 Table 2. Examination of Hardy–Weinberg equilibrium at the HLA-A and HLA-B loci HLA-A frequencies Genotype Observed HWE Aw34Aw34 0.034 0.025 Aw34A24 0.193 0.208 Aw34A2 0.002 0.001 Aw34A11 0.052 0.055 A2A2 0.000 0.000 A2A24 0.004 0.005 A24A24 0.485 0.440 A24A11 0.161 0.232 A2A11 0.002 0.001 A11A11 0.068 0.031 Genotype B13B13 B13B27 B13B39 B13Bw56 B13Bw60 B13Bw62 B27B27 B27B39 B27Bw56 B27Bw60 B39B39 HLA-B frequencies Observed HWE Genotype 0.016 0.009 B39Bw56 0.024 0.017 B39Bw60 0.006 0.003 B39Bw62 0.047 0.060 Bw56Bw56 0.032 0.038 Bw56Bw60 0.045 0.050 Bw56Bw62 0.016 0.009 Bw60Bw60 0.003 0.003 Bw60Bw62 0.048 0.060 Bw62Bw62 0.036 0.038 B27Bw62 0.002 0.003 Observed 0.010 0.006 0.005 0.123 0.131 0.162 0.053 0.101 0.091 0.043 HWE 0.011 0.007 0.009 0.104 0.133 0.173 0.042 0.111 0.072 0.050 A two-locus haplotype is denoted by the pair Ar Bs . An individual has a pair of haplotypes for a particular combination of two loci. However, we are only able to type the alleles at a locus, but not the chromosome they come from. Therefore, haplotypes are latent data. For example, alleles (Ar , Ak , Bs , Bt ) can give the pair of haplotypes, Ar Bs and Ak Bt or the pair Ar Bt and Ak Bs . rindexes the alleles 1, . . . , R at locus HLA-A, Let πr s be the frequency of haplotype (r, s) where S s indexes to the alleles 1, . . . , S at HLA-B and rR=1 s=1 πr s = 1. Also, let nr k,st , r ≤ k, s ≤ t denote the number of times the four observed alleles are (Ar , Ak , Bs , Bt ). The number of different ns is R(R + 1)S(S + 1)/4. Under Hardy–Weinberg equilibrium, the probability of the genotype (Ar , Ar , Bs , Bs ) is πr2s , the probability for (Ar , Ar , Bs , Bt ) and (Ar , Ak , Bs , Bs ) genotypes are 2πr s πr t , and 2πr s πks , respectively, and the probability of the genotype (Ar , Ak , Bs , Bt ) is 2πr s πkt + 2πr t πks . Thus, the likelihood function for π = (π11 , π12 , . . . , π R S )T is 2nrr,ss L(π; n) ∝ πr s (2πr s πr t )nrr,st (2πr s πks )nr k,ss r,s r,s,t s<t r,k,s r <k (2πr s πkt + 2πr t πks )nr k,st . (1) r,k,s,t r <k,s<t Explicit maximization of (1) over π is not possible. Instead, when r = k and s = t let nr k,st = Z r s,kt + Z r t,ks , where Z r s,kt and Z r t,ks are latent data and represent the number of times that alleles (Ar , Ak , Bs , Bt ) arose from the haplotype pair (Ar Bs , Ak Bt ) and (Ar Bt , Ak Bs ), respectively. Then the log likelihood function with the latent data can be written as log L(π; Z , n) ∝ nr s log πr s (2) r,s where nr s = 2nrr,ss + s<t nrr,st + t<s nrr,ts + r <k nr k,ss + k<r n kr,ss + Z r s,kt . k=r,t=s Viewing (2) as a full-data log likelihood suggests using the EM algorithm to maximize (1). This has been proposed by Excoffier and Slatkin (1995) and Long et al. (1995). 182 VOUNATSOU ET AL . Table 3. Posterior summaries (median and 90% interval estimates) for the πr s (haplotype frequencies) and the πr. and π.s (allele frequencies) HLA-A HLA-B B13 B27 B39 Bw56 Bw60 Bw62 Marginals 0.101 (0.097, 0.106) 0.105 (0.101, 0.109) 0.020 (0.018, 0.023) 0.323 (0.317, 0.330) 0.212 (0.207, 0.218) 0.273 (0.264, 0.279) A11 A24 Aw34 0.202 (0.197, 0.208) 0.020 (0.018, 0.023) 0.048 (0.044, 0.051) 0.007 (0.006, 0.009) 0.041 (0.036, 0.044) 0.031 (0.027, 0.035) 0.055 (0.051, 0.059) 0.635 (0.628, 0.641) 0.048 (0.044, 0.050) 0.049 (0.046, 0.053) 0.010 (0.008, 0.011) 0.209 (0.203, 0.215) 0.137 (0.131, 0.142) 0.183 (0.177, 0.188) 0.197 (0.192, 0.202) 0.034 (0.031, 0.036) 0.008 (0.007, 0.010) 0.003 (0.002, 0.004) 0.073 (0.070, 0.078) 0.044 (0.040, 0.048) 0.034 (0.032, 0.038) The EM algorithm may find difficulty in converging if the likelihood has local modes. Regardless, it provides only a point estimate and an approximate standard error. A Bayesian approach yields entire marginal and joint distributions for the π s and avoids asymptotics. With a flat (constant) prior on the πr s , the resultant analysis may be viewed as exact likelihood inference. Using MCMC simulation we obtain samples roughly from the posterior distribution of π . Marginal posterior summaries of the πr s using the median and equal tail 90% interval estimates are given in Table 3. Since allele A2 occurs so infrequently, we elected to model only the frequencies for A11, A24 and Aw34 in the HLA-A and the subsequent haplotype analyses. The marginals in Table 3 provide similar summaries for the πr. s and π.s s. Independence of the loci would mean that πr s − πr. π.s = 0. For 12 of the 18 resulting posteriors, the 90% interval estimates failed to contain 0, suggesting association or linkage disequilibrium between the loci. The models discussed in this section apply to the whole region and do not consider any areal unit spatial structure. We view them as baseline specifications providing a reference against which more sophisticated models, introduced in the next section, can be compared. 4. H IERARCHICAL SPATIAL MODELS We extend the models of the previous section to incorporate heterogeneity among areas. We capture this variation in gene frequency using models incorporating areal covariates and spatial random effects with a goal of spatially smoothing the frequencies. Such spatial models for gene and haplotype frequencies have not been investigated previously. Lonjou et al. (1995) look at the geographical distribution of haplotype frequencies among potential bone-marrow donors in France. However, they use maximum likelihood methods to estimate frequencies within each region ignoring spatial dependencies and variable sample size in each region. Indexing the areas by i, the area-level data are denoted as n i = (n i1 , n i2 , . . . , n i L )T where, in the case of a single locus, l indexes allele types, i.e. L = R or S. In the case of two loci, l indexes haplotypes, i.e. L = R S. Recall that, as in (2), here we do not observe the n i,r s ; latent Z i s must be introduced. We propose a Lhierarchical model for the n i which, at the first level, assumes that n i = (n i1 , n i2 , . . . , n i L ) ∼ n , π Mult , where πi = (πi1 , πi2 , . . . , πi L ) denotes the theoretical allele or haplotype frequeni j i j=1 cies in spatial unit i. The second-stage specification models θil = log ππi1il , i = 1, . . . , 98, l = 2, . . . , L. The spatial component is captured either through random spatial effects (which we denote by φs) or by a linear regression form in the covariate vector xi (longitude, latitude, altitude). The allelic type (or haplotype) explanation is captured using type effects (which we denote by γ s). In the regression case, we employ a vector of regression coefficients which depends upon, l, βl . The global model of the previous section sets θil = γl . Spatial modelling of multinomial data with latent structure 183 Note that models for θil which are additive in spatial and type components will not be sensible. They are too limited in the range of behavior they permit for the πil . For instance, if θil = φi + γl , then, as φi increases, all πil , l > 1 increase; only πi1 decreases. If θil = xiT β + γl the situation is even worse. The vector of spatial components is constrained to the manifold spanned by X = (x1 , x2 , . . . , x98 )T unlike the previous model where φ = (φ1 , φ2 , . . . , φ98 )T was unconstrained. Also, a model which sets θil = γil , that is, provides a γ i for each i is not interesting in the allele case. If the γil are fixed it will provide a perfect fit under a likelihood analysis and will reveal a small lack of fit if the γil are assumed random (with fairly vague priors) to permit Bayesian smoothing. In the haplotype case only goodness of fit to the n i,r k,st can be assessed because the n i,r s are not observed. Hence, fit will not be near perfect and this model now provides a measure of the best we can do in this case. As a result, we consider the following collection of second-stage models: Model (1) Model (2) Model (3) θil = γl θil = γl + β1l x1i + β2l x2i + β3l x3i (l) θil = γl + φi Model (4) θil = γl + φi + βl x3i (l) Model (1) is the earlier nonhierarchical baseline model. Model (2) adds a spatial component which is (l) a linear regression form. Model (3) adds random effects modelled as discussed below. The φi will be dependent across i and provide areal adjustment to the γl . Model (4) reintroduces the altitude covariate which is anticipated to be important in explaining genetic variation. It is a surrogate for moisture level which influences the incidence of diseases such as malaria. Local genetics should reflect response to such incidence. We adopt a Bayesian perspective for convenient model fitting. We take flat priors for γ = (γ2 , γ3 , . . . , (l) γ L )T and β to approximate a likelihood analysis. With a proper prior on the φi , a proper posterior is assured for all four models provided at least L + 3 of the multinomial vectors, n i , have all n i j > 0. See, for example, Speckman et al. (1999). We remark that using an alternative set of logits results in a nonlinear reparametrization of γ and β for which the prior would no longer be flat. However, this lack of invariance to parametrization is not an issue in terms of the logits. Again, spatial smoothing of the logits, equivalently the πil is the objective. (l) (l) (l) (l) Turning to the φi , suppose we assume that the φ (l) = (φ1 , φ2 , . . . , φ N ) are independent. For each (l) φ (l) we choose a spatial Gaussian prior given through a conditional normal distribution for each φi given (l) φ j , j = i, as in Bernardinelli and Montomoli (1992). This normal is of the form N ( j ωi j φ j /ωi+ , σ 2 / ωi+ ), where ωii = 0, ωi j = ω ji and j ωi j = ωi+ . The resultant joint distribution for φ (l) is improper N (l) and, in fact, the posteriors under models (3) and (4) are as well. Imposing the constraint i=1 φi = 0 removes the impropriety returning us to the conditions of the previous paragraph. We recognize that there is some arbitrariness in the definition of our areal units, that the resulting units are very irregular in size and shape, that at a different resolution or with rezoning of the areas at the current resolution, spatial association might look different. Hence we avoid subjective distance-based ωi j . We adopt a nearest-neighbor choice, setting ωi j = 1 for j an immediate neighbor of i, obtaining the conditional autoregressive (CAR) prior (Besag et al., 1995). We regard this prior as an elementary attempt to qualitatively describe the spatial dependence in the data. In fact, we introduce a variance σl2 for each (l) l which specifies the amount of variability in the φi (Richardson et al., 1995). A proper inverse gamma 2 prior is adopted for the σl again to ensure a proper posterior (Hobert and Casella, 1996) as well as a well-behaved MCMC simulation. We chose an inverse gamma distribution having mean 0.1 and infinite variance which provides the right order of magnitude for the φs. 184 VOUNATSOU ET AL . Dependence between the φ(l) could be captured using a multivariate conditional autoregressive prior (2) (3) (L) (MCAR) for the φi = (φi , φi , . . . , φi ) as in Mardia (1988). We are currently exploring such specifications but in any event, though the φ (l) are assumed independent a priori, the dependence is revised through the posterior. 5. BAYESIAN COMPUTATIONS The foregoing hierarchical models give rise to high-dimensional posterior densities which require MCMC methods to fit. To simulate from the respective posterior densities of each model we use a Gibbs sampling scheme (Gelfand and Smith, 1990). The latent Z have independent binomial distributions. Whenever full conditional distributions are not of a standard form we use a random-walk Metropolis proposal which is a normal distribution with a mean current value and a variance–covariance matrix which is the inverse of the Hessian matrix estimated at the current mean. The Hessian matrix can be calculated straightforwardly since the second derivatives can be obtained analytically. (l) (l) We note that for models involving φi the parameters γl and φi are not identifiable. For an arbitrary (l) (l) (l) c, these models cannot distinguish γl = γl + c and φi = φi − c from γl and φi (whence the posterior is improper). To overcome this problem, at the end of every Gibbs iteration we impose the constraint N (l) (l) (l) i=1 φi = 0, by adding to the samples of γl the average of the φi s and subtracting it from each φi . We implement a multiple-chain Gibbs sampler with 10 independent chains running in parallel. We assess convergence using the method of Gelman and Rubin (1992). After convergence, we collect a final sample from the posterior of size 1000 by taking every 30th value of each chain. 6. M ODEL CHOICE Models (3) and (4) are not regular in that model dimension increases as sample size (number of areas) increases. Model choice criteria requiring degrees of freedom or customary asymptotics are not applicable. Moreover, Bayes factors, even if we could compute them, are not interpretable since some of the prior specifications are improper. We adopt the model selection approach developed by Gelfand and Ghosh (1998). In the context of spatial modelling it has been applied to Poisson data by Waller et al. (1997) and to univariate normal data by Gelfand et al. (1998). As Gelfand and Ghosh note, because the multinomial distribution is a multivariate exponential family, their deviance-loss–utility-maximization approach can be extended to multinomial likelihoods. The basic idea of this approach is to first specify a loss function suitable for prediction of observations under the stochastic family for the data. Our choice is connected to the deviance associated with the multinomial distribution. This loss is used to define two additive components, one which rewards good predictive performance for a new n il , the other which rewards fidelity to the observed n il . Next, the minimizing posterior action is obtained for each n il under this two-component loss. Surprisingly, it can be obtained explicitly. Inserting this minimizing action into the posterior expected loss and cumulating over n il yields a predictive model choice criterion: select the model yielding the smallest value. For the multinomial, omitting details, we obtain a criterion of the form D = G + P where L N 2(µ̃il + L −1 ) 2(n il + L −1 ) −1 −1 G=2 (µ̃il + L ) log + (n il + L ) log µ̃il + n il + 2L −1 µ̃il + n il + 2L −1 i=1 l=1 and P=2 L N t˜il − (n il + L −1 ) log(n il + L −1 ) i=1 l=1 Spatial modelling of multinomial data with latent structure 185 Table 4. Goodness of fit term (G), penalty term (P), and overall model choice criterion (D). Smaller values indicate preferred models HLA-A G P D G Model for θil = (1) γl 1764.25 191.96 1956.21 1775.78 (2) γl + β1l x1i + β2l x2i 1014.34 194.32 1208.66 1221.24 +β3l x3i (l) (3) γl + φi 961.33 217.30 1178.63 991.66 (l) (4) γl + φi + βl x3i 508.59 238.19 746.78 715.93 HLA-B HLA-AB P D G P D 469.65 2245.43 5334.44 2182.42 7516.86 471.89 1693.13 4766.78 2156.02 6928.80 590.08 1581.74 4578.02 2376.32 6954.34 627.67 1343.60 4536.99 2299.64 6836.63 Here, µ̃il is the posterior predictive mean of a new n il given the data and model, i.e. µ̃il = E(n il,new | data, model). Similarly, t˜il = E((n il,new + L −1 ) log(n il,new + L −1 ) | data, model). The L −1 s are continuity corrections introduced to ensure that the log can be calculated and that the expectations exist. G is interpreted as a goodness of fit term. It is 0 when µ̃il = n il and increases as µ̃il moves away from n il . P is interpreted as a penalty term and can be shown to be approximately a weighted sum of predictive variances. A poor model will produce a large G and P, hence D. A good model will make both G and P, hence D, small. An overfitted model will reveal a small G but an inflated P. Thus, parsimony is encouraged. Finally, in the case of the haplotype models we replace n il by n i,r k,st in all of the above. Computation of the posterior predictive expectations is done by Monte Carlo integration using samples from the required predictive distributions. These samples are generated straightforwardly, one for one, using samples of the πil . The posterior samples which are the output of the MCMC algorithm used to fit the model yield posterior samples of the πil . Table 4 provides the model choice for models (1)–(4) for locus A, for locus B and for the AB haplotype. We see that the baseline model is poor. Model (2) is better but the CAR spatial explanation, including altitude, is much better. Though it yields slight overfitting, G is substantially reduced. 7. A NALYSIS OF THE DATA In the HLA-A gene we used A11 as the base cell for the logits and, based upon the previous section, used model (4) with CAR. Table 5 presents a posterior summary. Since the altitude levels and the spatial effects are centered at 0, the γ s may be thought of as population level logits. The altitude and spatial terms provide adjustments for the ith area. Converting the median logits to probabilities gives π A11 = 0.161, π A24 = 0.676 and π Aw34 = 0.163, in reasonable agreement with Table 3. Table 5 also reveals that altitude provides a significant adjustment to the probabilities. The advantage of introducing distinct parameters, β A24 and β Aw34 , is apparent; the former is roughly twice the latter. Also, both altitude coefficients are positive indicating that alleles A24 and Aw34 are more frequent in high altitudes in contrast to A11 which is met mainly in low-altitude areas. Maps of raw allele frequencies in Figures 1(a) and (b) confirm the association with altitude, especially for A24. Note that the central areas of PNG are mainly the highaltitude areas, while the coastal areas are the lowlands. The spatial effects are also important. Figure 2(a) shows a grey-scale map of the posterior medians of the φiA24 . Recalling that, a priori, the φiA24 are all centered at 0, the spatial association is evident. Figure 2(b) shows a grey-scale map of the posterior medians of the φiAw34 . Again there is spatial association but with a different pattern from Figures 2(a) and (b), demonstrating the importance of introducing distinct spatial models for φ A24 and φ Aw34 . 186 VOUNATSOU ET AL . Table 5. Posterior median and 90% credible sets for parameters under model 4 for HLA-A, -B, -AB frequencies HLA-A Aw34 A24 HLA-B Bw62 B27 B39 Bw56 Bw60 HLA-AB Aw34Bw62 Aw34B27 Aw34B39 Aw34Bw56 Aw34Bw60 Aw34B13 A24Bw62 A24B27 A24B39 A24Bw56 A24Bw60 A24Bw13 A11Bw62 A11B27 A11B39 A11Bw56 A11Bw60 β 0.375 (0.367, 0.377) 0.708 (0.707, 0.709) σ2 0.06 (0.02, 0.29) 0.05 (0.02, 0.23) 1.03 (0.82, 1.23) −0.05 (−0.31, 0.10) −1.80 (−2.17, −1.38) 1.18 (1.00, 1.37) 0.75 (0.54, 0.92) 0.61 (0.52, 0.70) 0.35 (0.26, 0.51) 0.07 (−0.05, 0.11) 0.56 (0.51, 0.69) 0.52 (0.50, 0.57) 0.06 (0.02, 0.28) 0.06 (0.02, 0.29) 0.06 (0.02, 0.27) 0.05 (0.02, 0.27) 0.07 (0.02, 0.33) −1.21 (−1.49, 0.50) −2.11 (−3.30, −0.44) −4.35 (−6.06, −2.93) −0.38 (−0.60, 1.58) −0.95 (−1.44, 1.03) −1.34 (−1.96, 0.22) 0.67 (0.54, 2.67) −0.57 (−0.90, 1.36) −2.18 (−2.71, 0.23) 0.80 (0.69, 2.67) 0.38 (0.20, 2.29) −0.65 (−0.94, 1.37) −0.85 (−0.99, 1.23) −0.98 (−1.33, 1.07) −2.82 (−3.78, −0.92) −0.92 (−1.15, 0.92) −1.42 (−1.92, 0.64) 0.21 (0.03, 0.70) 0.36 (−0.03, 1.33) −1.32 (−2.53, −0.24) 0.72 (0.69, 1.39) 0.48 (0.13, 0.90) 0.27 (0.03, 0.51) 0.95 (0.87, 1.39) 0.67 (0.55, 1.34) 0.38 (0.05, 0.87) 0.77 (0.62, 1.44) 0.70 (0.63, 1.39) 0.36 (0.26, 0.90) 0.31 (0.61, 0.91) 0.28 (0.12, 0.87) 0.19 (−0.65, 0.58) 0.10 (−0.06, 0.48) 0.09 (−0.43, 0.72) 0.05 (0.02, 0.24) 0.05 (0.02, 0.23) 0.05 (0.02, 0.23) 0.05 (0.02, 0.24) 0.05 (0.02, 0.25) 0.05 (0.02, 0.20) 0.04 (0.02, 0.27) 0.04 (0.02, 0.25) 0.05 (0.02, 0.24) 0.05 (0.02, 0.21) 0.05 (0.02, 0.26) 0.05 (0.02, 0.28) 0.04 (0.01, 0.19) 0.05 (0.02, 0.27) 0.05 (0.015, 0.23) 0.05 (0.02, 0.19) 0.05 (0.02, 0.24) γ 0.001 (−0.15, 0.08) 1.43 (1.30, 1.52) Turning to the HLA-B gene and taking B13 as the base cell, Table 5 also summarizes posterior estimates of model parameters. Again converting the median logits to probabilities yields π B13 = 0.097, π B27 = 0.092, π B39 = 0.016, π Bw56 = 0.316, π Bw60 = 0.212, and Bw62 = 0.272, in remarkable agreement with Table 3. Again the magnitude and sign of the altitude coefficients indicate that Bw62, Bw56 and Bw60 are most frequent in high altitude in contrast to allele B13. Allele B39 shows no association with altitude. These results are confirmed by the maps of the raw frequencies of the alleles at HLA-B (e.g. Figures 1(c) and (d)) and the maps of spatial effects (e.g. Figures 2(c) and (d)). Finally, for the HLA-A and B haplotypes, we obtain an 18-cell joint frequency distribution. Taking the (A11, B13) as the base haplotype, 13 of the 17 logits show significant altitude effects (Table 5). Plots analogous to those in Figures 1 and 2 are omitted, but again, reveal strong spatial associations. Many studies have contrasted the HLA types found in the PNG highlands with those in coastal areas, but most compared only small numbers of populations. Such studies are unable to establish whether the differences result from natural selection (possibly related to altitudinal variation in disease patterns), distinctions between waves of human migrants, or in situ divergence of populations via genetic drift. Since it includes many sampling points, the present dataset can be used to distinguish between these hypotheses, and Bhatia et al. (1991) applied the Ewans Watterson test to these data to argue that the altitudinal cline Spatial modelling of multinomial data with latent structure (a) 187 (b) Allele A24 at HLA-A Allele Aw34 HLA-A 0.38 to 0.7 0.15 to 0.38 -0.45 to 0.15 -0.81 to -0.45 0.45 to 0.78 0.22 to 0.45 -0.27 to 0.22 -1.48 to -0.27 (22) (26) (25) (25) (22) (25) (26) (25) (d) (c) Allele Bw56 at HLA-B Allele Bw62 at HLA-B 0.2 to 0.73 0.05 to 0.2 -0.04 to 0.05 -0.97 to -0.04 0.15 to 0.75 0 to 0.15 -0.15 to 0 -0.82 to -0.15 (27) (17) (26) (28) (24) (21) (28) (25) Fig. 2. Spatial effects under model (4) (CAR) (see text for details) for allelic types. (a) φ A24 , (b) φ Aw34 , (c) φ Bw56 , (d) φ Bw62 . in HLA-A is a result of natural selection. In a further analysis, Smith et al. (1994) used matrix correlation approaches to confirm that the clinal variation in HLA-A reflects genuine effects of altitude, not merely geographical separation of the highland and lowland populations. However, neither of these analyses could adequately assess the residual geographical patterns after accounting for the altitudinal variation. The present approach could incorporate even more data than were used by Smith et al. (1994) and Bhatia et al. (1981) because it included spatial units with arbitrarily small sample sizes. The earlier analyses did not include sufficient units for the overall spatial patterns to be evident, but the inclusion of these sparse units means that sampling variation obscures spatial patterns in the raw data. The residual patterns in the smoothed data, however, confirm that there is variation in HLA allele frequencies over short distances, comparable in magnitude to the variation across the whole of the South West Pacific region (Searjeantson, 1989). Our analyses are the first to demonstrate that altitudinal differences in HLAB are not due to spatial correlation, and to provide descriptions of the residual spatial patterns at both loci, after adjustment for the altitudinal clines. Features of spatial pattern which are not obvious in maps of raw frequencies in Figure 1 but which are depicted by the CAR-smoothed estimates of the altitude-adjusted gene frequencies in Figure 2 (not all maps are shown here) include west–east clinal variation. Since the original colonization of PNG is thought to have been via waves of immigrants from the north-west, these clines may reflect later arrival 188 VOUNATSOU ET AL . of the ancestors of the current western populations. There are higher frequencies of A24 and Bw56 in the east of PNG, while B27 is more frequent in the west. The rather different spatial patterns shown by other alleles might reflect natural selection by unknown causes, and could be further investigated by testing additional covariates (e.g. climatic data) in the models. Thus, conditional on altitude, Bw62 frequencies are highest along a crescent from the south-west (western province) continuing through the centre of PNG (the main highland valleys). A complementary pattern is shown by B39. In the south-west Bw60 appears to be most frequent. 8. S UMMARY In this paper we have presented a hierarchical framework for modelling sparse spatially structured multinomial data. Markov chain Monte Carlo methods have been applied to fit the models and to obtain point and interval estimates for the model parameters. A model comparison criterion, convenient for multinomial data models, fitted using MCMC, has been employed. We have implemented this approach to study geographical patterns in HLA-A and HLA-B alleles as well as HLA-AB haplotypes at an areal level. The chosen model is of the same form in all three cases and reveals strong spatial association as well as correlation of the allele frequencies with altitude. ACKNOWLEDGEMENTS This work was supported by the Swiss National Science Foundation grant 3200-049809.96. The authors are gratefull to Michael Alpers and Kuldeep Bhatia for making the data available. R EFERENCES ATTENBOROUGH , R. D. AND A LPERS , M. P. (1992). Human Biology in Papua New Guinea: The Small Cosmos, Research Monographs on Human Population Biology 10. B ERNARDINELLI , L. AND M ONTOMOLI , C. (1992). Empirical Bayes versus fully Bayesian analysis of geographical variation in disease risk. Statistics in Medicine 11, 983–1007. B ESAG , J., G REEN , P., H IGDON , D. AND M ENGERSEN , K. (1995). Bayesian computation and stochastic systems (with discussion). Statistical Science 10, 3–66. B HATIA , K. K., B LACK , F. L., S MITH , T. A., P RASAD , M. L. AND KOKI , G.N. (1995). Class I HLA antigens in two long-separated populations: Melanesians and South Amerinds. American Journal of Physical Anthropology 97, 291–305. B HATIA , K., J ENKINS , C., P RASAD , M. L. AND KOKI , G. (1989). Immunogenetic studies on two recently contacted populations from Papua New Guinea. Human Biology 61, 45–64. B HATIA , K. K., S MITH , T. A., P RASAD , M. L. AND KOKI , G.N. (1991). Selection at HLA loci in Papua New Guinea. In Programme and Abstracts of the Annual Scientific Meeting of the Australasian and South-East Asian Tissue Typing Association, Auckland, pp. 189–199. C AVALLI -S FORZA , L. L. (1997). Genes, peoples, and languages. Proceedings of Natural Academy of Sciences USA 1997 July 22, 94, 7719–7724. C LAYTON , D. G., B ERNARDINELLI , L. AND M ONTOMOLI , C. (1993). Spatial correlation in ecological analysis. International Journal of Epidemiology 22, 1193–1202. C RESSIE , N. A. C. (1993). Statistics for Spatial Data. New York: Wiley. DANIELS , M. J. AND G ATSONIS , C. (1997). Hierarchical polytomous regression models with applications to health services research. Statistics in Medicine 16, 2311–2325. Spatial modelling of multinomial data with latent structure 189 E XCOFFIER , L. AND S LATKIN , M. (1995). Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular Biology and Evolution 12, 921–927. FAREWELL , V. T. (1982). The estimation of gene frequency, based on a particular leucocyte culture experiment. Biometrics, 38, 769–775. G ELFAND , A. E. AND G HOSH , S. K. (1998). Model choice: A minimum posterior predictive loss approach. Biometrics 85, 1–11. G ELFAND , A. E., G HOSH , S. K., K NIGHT, J. AND S IRMANS , C. F. (1998). Spatio-temporal modelling of residential sales markets. Journal of Business and Economic Statistics 16, 312–331. G ELFAND , A. E. AND S MITH , A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398–409. G ELMAN , A. AND RUBIN , D. B. (1992). Inference from iterative simulations using multiple sequences (with discussion). Statistical Science 7, 457–511. G REENWOOD , B. M. (1991). Common West African HLA antigens are associated with protection from severe malaria. Nature 352, 595–600. H ILL , A. V. S. et al. (1991). Common West African HLA antigens are associated with protection from severe malaria. Nature 352, 595–600. H OBERT, J. P. AND C ASELLA , G. (1996). The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association 91, 1461–1473. L ONG , J. C., W ILLIAMS , R. C. AND U RBANEK , M. (1995). An E-M algorithm and testing strategy for multiplelocus haplotypes. American Journal of Human Genetics 56, 799–810. L ONJOU , C., C LAYTON , J. C AMBON -T HOMSEN , A. AND R AFFOUX , C. (1995). HLA-A, -B, -DR haplotype frequencies in France—implications for recruitment of potential bone marrow donors. Transplantation 60, 375–383. M ARDIA , K. V. (1988). Multi-dimensional multivariate gaussian Markov random fields with application to image processing. Journal of Multivariate Analysis 24, 265–284. R ICHARDSON , S., M ONFORT, C., G REEN , M., D RAPER , G. AND M UIRHEAD , C. (1995). Spatial variation of natural radiation and chilhood leukaemia incidence in Great Britain. Statistics in Medicine 14, 2487–2501. S ERJEANTSON , S. W. (1989). HLA genes and antigens. In The Colonization of the Pacific, A Genetic Trail, eds Hill, A.V.S. and Serjeantson, S.W., pp. 120–173. Oxford: Oxford University Press. S MITH , T., B HATIA , K., P RASAD , M., KOKI , G. AND A LPERS , M. (1994). Altitude, language, and the frequencies of class I HLA alleles in Papua New Guinea. American Journal of Physical Anthropology 95, 155–168. S PECKMAN , P. L., L EE , J. AND S UN , D. (1999). Existence of the MLE and propriety of posteriors for a general multinomial choice model. Technical Report, Department of Statistics, University of Missouri, Columbia, MO. WALLER , L. A., C ARLIN , B. P., X IA , H. AND G ELFAND , A. (1997). Hierarchical spatio-temporal mapping of disease rates. Journal of the American Statistical Association 92, 607–617. [Received June 23, 1999; revised October 27, 1999; accepted for publication November 1, 1999]
© Copyright 2025 Paperzz