Pθ - Department of Statistics

The Gibbs sampler on extended pedigrees:
Monte Carlo methods for the genetic analysis of complex traits
E. A. Thompson ,(1) and E. M. Wijsman ,(2)
(1) Department of Statistics, GN-22, University of Washington,
(2) Division of Medical Genetics, GG-20, University of Washington,
and
(1), (2) Department of Biostatisticst, SC-32, University of Washington.
September, 1990
5
Abstract
This research plan is being produced as a technical report, first, because it summarises much preliminary work that has not yet been published, and second, and more importantly, because it provides
the methodological formulation for use of the Gibbs sampler in the Monte Carlo evaluation of likelihoods for complex genetic models. These methodological foundations have been presented in several
seminars and at meetings, March - June 1990.
The long term objective of this project is to develop computational and statistical techniques for
the study design and the analysis of complex familial diseases. This will aid in the identitification of
genes which contribute to disease susceptibility for complex diseases such as heart disease. The specific
aims include development of methods for the computation and maximisation of likelihoods of genetic
models in order to fit models for complex genetic traits. These methods will provide the means of
analysing multilocus Mendelian models on extended complex pedgrees, for polygenic models with
genetic and environmental complexities, and for mixed models, including linkage analysis with multiple
marker loci. This project will also investigate the power of studies combining segregation and linkage
analysis in the analysis of complex traits, and will investigate efficiencies of alternative sampling
designs.
The approach employs simulation techniques based on the Gibbs sampler, a novel Monte Carlo
technique now becoming widely used in Statistics. This technique provides samples from the distribution of genetic effects, conditional on observed phenotypic data, and thence facilitates efficient Monte
Carlo estimation of likelihood ratios. Programs will be developed in order to perform several forms of
data analysis, and to evaluate the performance of these techniques. Several datasets are available for
use, including a large collection of pedigrees collected through the collaboration of E.M.W. in a study
of the genetics and epidemiology of heart disease. Analyses include applications of Monte Carlo techniques to 1) determination of multilocus haplotypes segregating in large complex genealogies, 2) fit
polygenic models to quantitative measurements, such as Apolipoprotein B levels, associated with risk
for heart disease, 3) fit multiple threshold models to data on rank-ordered categorical traits, and 4) to
perform segregation and linkage analysis of risk factors for heart disease. The performance of methods
can be evaluated by the analysis of data simulated under models which include both polygenic and
Mendelian effects which are associated with linked markers.
Acknowledgement
Contributions to the work described in this report have been made by several students: Charlie
Geyer has contributed to many useful discussions on the Monte Carlo evaluation of likelihood ratios:
Nuala Sheehan has implemented the Gibbs sampler for simple Mendelian traits, and her "neighbourhood structure" pedigree format which facilitates conditional local computations and sampling has been
used also in other programs: Sun Wei Guo has implemented the Gibbs sampler on pedigrees for the
additive genetic model, and for other more complex random effects models, and has used these in his
programs for Monte Carlo EM on pedigrees.
6
TABLE OF CONTENTS
A.
B.
B.1
B.2
B.3
B.4
B.4.1
B.4.2
B.4.3
B.4.4
B.5
B.6
B.7
C.
C.1
C.2
C.3
C.3.1
C.3.2
C.4
D.
D.1
D.1.1
D.1.2
D.1.3
D.2
D.2.1
D.2.2
D.2.3
D.2.4
D.3
D.3.1
D.3.2
D.3.3
D.4
D.4.1
D.4.2
D.5
D.5.1
D.5.2
D.6
D.6.1
SPECIFIC AIMS
5
SIGNIFICANCE AND BACKGROUND
Introduction
Coronary heart disease as a model of complex disease
The genetic approach
History and limitations of the genetic approach
Mendelian loci
Traits involving genes at more than one locus
Limitations of the genetic approach for complex traits
Linkage analysis techniques
Identification of modes of inheritance
Combined analyses
Computation of likelihoods on extended pedigrees
Simulation methods in genetic epidemiology
The Gibbs sampler on pedigrees
6
6
6
7
8
8
8
9
10
10
11
11
PRELIMINARY STUDIES
The need for complex models; the Gibbs sampler approach
Haplotype analysis
Quantitative genetic studies on pedigrees
Likelihood evaluation by peeling
Likelihood maximisation and the EM algorithm
Data sets
12
12
13
14
14
14
15
PROPOSED RESEARCH
Foundations
Overview
The Gibbs sampler on pedigrees
Monte Carlo evaluation of likelihood ratios
Mendelian models
Ancestral genotypes
Ancestral haplotypes; the Gibbs sampler solution
Peeling and the Gibbs sampler in combination
Likelihoods for Mendelian models; linkage analysis
Polygenic models
Likelihood evaluation for polygenic models
Maximising the likelihood; a Monte Carlo version of the EM algorithm
Polygenic models; evaluation of methods
Threshold models
Categorical traits; multiple threshold models
Monte Carlo importance sampling for likelihoods for categorical traits,
Segregation and linkage analysis for complex genetic traits
The mixed model for complex traits
Linkage analysis for complex traits
Pedigree and sampling design
Power of a study
16
16
16
17
18
18
18
18
19
19
20
20
20
21
21
21
22
22
22
23
24
24
7
D.6.2
D.7
Sequential design; use of the Gibbs sampler
Program and software availability
24
24
RESEARCH PLAN
A. SPECIFIC AIMS
Long-term objective:
The overall objective is to develop techniques for the analysis of the genetic basis of complex
familial traits. There is a need for techniques for likelihood analysis of complex models that combine
linkage and segregation analysis. Only then can we address questions about traits, such as coronary
heart disease, that are influenced by environmental factors as well as by genes at several loci. Additionally, new computer technology has made timely a Monte Carlo approach to problems of likelihood evaluation and maximisations on complex data structures. Using complex genetic models we propose to
develop methods of likelihood analysis of phenotypic data observed on members of a pedigree. This will
aid in the identification of markers which are linked to loci which contribute to the determination of
complex genetic traits.
Specific objectives:
1. To develop Monte Carlo computational methods for the computation of likelihoods on extended
and/or complex pedigrees:
(a) for multilocus Mendelian models on extended complex pedigrees,
(b) for polygenic models with multiple random effects for quantitative traits, adding genetic and environmental complexities beyond those encompassed by an additive genetic model,
(c) for threshold models with multiple thresholds, for rank-ordered categorical traits, and
(d) for combining (a) - (c) to compute likelihoods for complex traits under mixed models, including
linkage analysis with multiple marker loci.
2. To develop methods for maximising likelihoods and fitting models for complex genetic traits, using
the same sequence of component models as in specific aims #1(a) - #1(d).
3. Evaluating the procedures of specific aims #1 and #2 by:
(a) analysing ancestry of multilocus haplotypes in the large complex genealogies of a genetic isolate,
(b) analysing Apolipoprotein B levels, and other quantitative measurements, both separately and jointly,
under multiple-effects polygenic models,
(c) analysing Alzheimer’s disease data and low-density lipoprotein measures and undertaking simulation
studies to assess the performance of multiple threshold models for rank-ordered categorical traits,
(d) segregation and linkage analysis of Apolipoprotein B levels, and other heart disease traits, incorporating genetic effects due to other loci, under complex models for data on extended pedigrees, and
(e) analysis of data simulated under complex models, involving both polygenic and Mendelian effects,
and associated with linked markers.
4. To develop a coherent set of computer programs, within a single pedigree analysis computing environment, for items #1, #2 and #3 above.
8
5. To investigate the statistical power of studies combining segregation and linkage analysis in the analysis of complex genetic traits, and, to the extent that time allows, to investigate efficiencies of alternative
sampling designs and pedigree structures.
B. SIGNIFICANCE AND BACKGROUND
B.1 Introduction
The risk for developing many common diseases is believed to be influenced by both genetic and
environmental factors. Examples include coronary heart disease, hypertension, several forms of mental
illness, and some cancers. Because multiple factors may play important roles in the susceptibility of
individuals to develop such diseases, these are often considered to be complex diseases. While the data
necessary to understand different complex traits is disease-specific, the underlying principles of analysis
of the genetic component are applicable to a variety of diseases. To facilitate discussion of some of the
problems involved in the dissection of complex traits, we will concentrate our examples on the problem
of identifying genetic factors which influence susceptibility to develop coronary heart disease. Our proposed research is also focused on this complex disease.
To aid the reviewers a detailed list of contents is supplied as Appendix A. Since this proposal
requires citations from the three major areas of modern Monte Carlo methods in Statistics, Coronary
Heart Disease, and segregation and linkage analysis in Genetic Epidemiology, it was impossible to
include paper titles in the Literature Cited section. For the convenience of primary reviewers a version
of Literature Cited that includes paper titles is included as Appendix B.
B.2 Coronary heart disease as a model of complex disease
Coronary heart disease (CHD) and quantitative risk factors for CHD cluster in families (Neufeld
and Goldbourt, 1983; Robertson, 1981; Berg, 1989; Hopkins and Williams, 1989; Berg, 1987; Sing and
Orr, 1978; Sosenko et al., 1980; Namboodiri et al., 1984; Goldstein et al., 1973; Brunzell et al., 1976). It
has been suggested that the aggregation of CHD in families may reflect the clustering of these quantitative risk factors including blood cholesterol, lipid, and apolipoprotein levels (Perkins, 1986).
Familial clustering can occur because of shared environment, shared genes, or both. Clustering
in family members is difficult to explain in the absence of genetic risk factors, especially in the absence
of a highly influential environmental risk factor (Khoury et al., 1988). Low correlations between
spouses for quantitative risk factors for CHD (e.g. for high density lipoprotein levels) can be interpreted
as evidence against a strong marital environmental effect and as evidence that familial correlations
reflect genetic effects. In addition, the presence of spousal correlations does not imply the absence of
genetic factors. In general, demonstration of genetic susceptibility must be obtained in a more convincing fashion than simple demonstration of familial correlations, if only because identification of a correlation does not explain the underlying cause of the correlation.
Various techniques have been used to demonstrate that genetic factors influence susceptibility for
developing CHD. Path analysis has been used to demonstrate a genetic component for total cholesterol
(CH), triglyceride levels (TG), very low density cholesterol levels (VLDL), low density cholesterol
(LDL), and high density cholesterol (HDL) (Rao et al., 1979). Twin studies have demonstrated
increased concordance among male monozygotic twins vs. dizygotic twins for CHD, suggestive of
genetic influences (Feinleib et al., 1977). Twin studies have shown increased concordance among
female twins for LDL, HDL and TG levels (Austin et al., 1987), but not for male twins (Feinleib et al.,
1977), possibly indicative of sex effects. Segregation analyses have indicated genetic effects for quantitative risk factors for CHD including Apolipoprotein B levels (ApoB), HDL-C levels (Amos et al., 1987;
Hasstedt et al., 1987) and LDL subclass patterns (Austin et al., 1988, 1990).
Environmental factors also influence CHD risk and levels of quantitative risk factors. For example, path analysis indicates that LDL, HDL, and possibly TG levels are influenced by common familial
9
environments (Rao et al., 1979). Monozygotic twins are not completely concordant for CHD (Feinleib
et al., 1977) or for lipid levels (Austin et al., 1987). This provides additional evidence that environmental factors influence these traits. Studies of CHD in Japanese populations demonstrate an increase in
CHD in Japanese Americans as compared to native Japanese, consistent with environmental effects on
CHD (Kagan et al., 1974).
Genetic and environmental risk factors may be statistically confounded. For example, in baboons
the difference in levels of low density lipoprotein cholesterol levels (LDL-C) as a function of diet show
significant genotype effects for genotypes at the low density lipoprotein receptor (Blangero et al., 1989).
The difference in risk for CHD as a function of age and sex may also be considered to be an example of
gene-environment interaction. In general, it would not be surprising to find that differential response of
genotypes to environmental risk factors is an important factor in complex diseases, although such interactions are not currently considered in most analyses.
Despite the strong evidence for a genetic component to heart disease, few genes have been identified which clearly influence the risk of developing CHD. A few rare, single locus diseases are known to
influence the risk for developing CHD. The most well understood of these single gene defects is familial hypercholesterolemia (FH) which is known to be inherited as a single gene defect in the LDL receptor (Brown and Goldstein, 1986). However, the incidence of heterozygotes in the population for this
defect is only about 1 in 500 (Goldstein et al., 1973), which is far too rare to explain common disease.
Other rare disorders have also been described: familial defective Apo B-100 (Innerarity et al., 1987),
mutations in the lipoprotein lipase gene (Langlois et al., 1989), and sitosterolemia (Beaty et al., 1986),
none of which is common enough to account for the bulk of familial CHD.
A few loci show associations with CHD or quantitative risk factors for CHD in the general population. These include Apolipoproteins B and E, possibly Lipoprotein(a), and a putative locus controlling
LDL subclass patterns (Austin et al., 1988). Lipoprotein(a) genotypes modify the risk to carriers of FH
in a genotype dependent fashion (Utermann 1989), but it not clear that these genotypes also modify the
risk of CHD in a normolipidemic population. Mean levels of total cholesterol differ among genotypes at
the ApoE locus (Sing and Davignon, 1988). In a variety of populations of different ethnic background,
the E2 allele has been consistently associated with lower levels of plasma cholesterol, and the E4 allele
with raised levels (Robertson and Cumming, 1985). However, studies based on patients with CHD vs.
controls have not found a consistent association between disease status and ApoE genotype (Davignon
et al., 1988; Sing and Moll, 1989). Associations between alleles at the ApoB locus and CHD status are
sometimes (Deeb et al., 1990; Hegele et al., 1986) but not always (Ferns and Dalton, 1986) observed.
B.3 The genetic approach.
The identification of genes which influence the development of diseases provides a powerful tool
for the study of disease. This identification of genes and their functions can be done through one of two
mechanisms: through study of loci which are suspected to play a role in the disease (candidate loci), or
through identification of previously unknown loci.
When a candidate locus is hypothesised to influence the disease, various studies may be used to
either confirm or reject this hypothesis. Often initial studies are association studies and do not use pedigree information. Allele frequencies in patients and controls or mean trait values as a function of the
genotype at the candidate locus may be compared (Sing and Davignon, 1985). However, association
studies are sensitive to any cause for underlying statistical associations including a number of causes
which have nothing to do with genetic linkage. Fortunately, use of pedigree information which demonstrates cosegregation of the disease and candidate locus is much less sensitive to associations caused by
phenomena other than linkage. Therefore, demonstration of presence (or absence) of cosegregation of
the disease and candidate loci in pedigrees provides much more secure evidence for (or against) the
hypothesis of candidate gene involvement than does an association study.
10
Three types of analysis need to be performed in order to identify previously unknown genes
which influence predisposition to disease. First, a genetic model must be developed which at least
approximates the mode or modes of inheritance of the disease. Second, one or more of the relevant
genes must be localised (mapped) to a chromosomal region. This is generally done by showing that
there is non-random inheritance of the trait in question with genetic markers of known location. Finally,
the gene must be isolated (cloned) from this region. In this proposal, we will concern ourselves only
with the first two of these three steps.
B.4 History and limitations of the genetic approach.
B.4.1. Mendelian Loci
The past decade has seen enormous success in the area of mapping single-locus traits. Notable
successes are the mapping of the Huntington’s locus, cystic fibrosis (CF), neurofibromatosis (NF1 and
NF2), and Duchenne’s muscular dystrophy (DM). The genes for CF (Riordan et al., 1989), DM
(Monaco et al., 1985, Kunkel et al., 1985) have now been cloned, proving that it is not necessary to
know the function or structure of a gene in order to map and identify it. Most recently, the gene for NF1
has been not only cloned, but shown to be highly homologous to a protein under study in developmental
biology (Xu et al., 1990). As of a year ago, over 5100 genetic markers were mapped, of which 1631
were distinct genes with either known function or which caused a genetic disease (McAlpine et al.,
1989). More than 95% of these markers were mapped in the last decade.
The successful mapping of this large number of loci can be attributed to several factors. 1) The
rapid expansion of the human genetic map provided ample marker data for localising the disease loci. 2)
The well established modes of inheritance of the disorders being mapped maximised the power to detect
linkage. 3) The availability of a powerful technique of statistical analysis (Morton, 1955) which was
developed and implemented in a computer program, LIPED (Ott, 1976) provided a tool which was
available when massive amounts of data started to be collected. 4) Finally, an understanding of when the
analysis techniques are insensitive to approximations allowed resources to be channelled into the most
important aspect of the analysis.
B.4.2. Traits involving genes at more than one locus
A genetically complex trait is one for which there is more than one interacting genetic components and/or genetically influenced risk factors. This may include the relatively simple case where two
or more loci each can individually produce the same phenotype (genetic heterogeneity). Or it may
describe the situation where many genes interact with each other and with different environments to produce manifestation(s) of the trait. Loci may act independently, additively, or in some more complex
fashion. Many complex traits are common in the population.
Complex traits usually have phenotypes which do not show a one-to-one correspondence with
single locus genotypes. For example, many complex traits are quantitative or have quantitative indicators for diagnosis. Others may be multivariate phenotypes measured on a number of scales, some quantitative and some discrete. Other complex traits are discrete. Discrete traits may include not only binary
traits (affected/normal) but also categorical data with more than two possibilities (e.g., affected, probably affected, and normal). There may be age and/or sex-dependent penetrance of phenotypes as a function of underlying genotypes. These traits are also often influenced not only by random environmental
factors but also by individual and family environmental factors. As a result, models which use standard
quantitative genetic theory as developed in the context of more controlled plant and animal environments may not give reliable or interpretable results when applied to complex traits in humans.
Genes involved in complex traits have not yielded to linkage analyses with the success rate found
for Mendelian traits. The few exceptions to this statement are a small number of discrete, rare, diseases
which are highly penetrant but genetically heterogeneous. More commonly, initial reports of successful
linkage analyses of diseases which are familial but which are not simple genetic diseases have failed to
11
be reproduced in independent studies. One of the most impressive examples of this is the study on
manic depression in the Amish: the initial report of significant evidence for linkage (Egeland et al.,
1988) was followed by an extension to the pedigree which eliminated any evidence for genetic linkage
(Kelso et al., 1989). Results of linkage analyses may often be insecure due to problems of diagnosis,
multivariate phenotype, model inaccuracy, and techniques which are not appropriate for detecting linkage when the underlying genetic mechanism is not simple.
B.4.3. Limitations of genetic approach for complex traits
Limitations in the available techniques of analysis undoubtedly contribute to the failure to identify genes contributing to complex traits. These limitations exist for two reasons. First, there are severe
practical problems with performing computations for complex genetic models. Second, most of the
techniques of genetic analysis in human genetics have been designed around discrete single-locus disorders. Problems with implementation, power, and interpretation of results of analyses exist when these
techniques are used for analysis of complex disorders.
Linkage analysis techniques.
The most common technique of linkage analysis is the method of lod scores (Morton, 1955).
This method can efficiently provide both positive and negative evidence of linkage for single-locus traits
(Morton, 1955), and maximises the use of data by performing likelihood computations on data from
extended pedigrees. The methods are based on the assumption that exactly one locus exists for the trait
and that the penetrance of the trait is known.
A number of problems exist when the lod score method is used for attempting to map gene(s) for
complex diseases. In the absence of evidence for single-locus inheritance of complex traits, theoretical
justification is lacking for what constitutes significant evidence either in favour of or against linkage. A
trait determined by two epistatically interacting loci may be appear to be inherited as a recessive, even
though in fact the disease contribution of the locus linked to a marker is dominant. Such misspecification of the mode of inheritance of a trait locus can lead to a very low probability that a true linkage will
be detected (Greenberg and Hodge, 1989). The mode of inheritance of the trait as indicated by complex
segregation analysis (see below) is not guaranteed to be the same as the mode of inheritance of the susceptibility locus which is linked to the marker(s) being tested. True linkages may therefore be missed
when techniques which are designed for single locus traits are used for identifying multilocus traits.
False rejection of linkage is serious since considerable effort may then be wasted in searching elsewhere
for the gene(s) (Risch, 1990b).
The lod score method of linkage analysis looses power as genotypes are misclassified. This
occurs because of reduced penetrance, genetic heterogeneity, and quantitative variability. Note that
reduced penetrance may be caused by epistatic interactions with unlinked loci as well as from nongenetic events. The loss of power can be substantial. For example, if penetrance of a locus linked to a
genetic marker is only 50%, the sample size necessary to detect linkage is about 4 times that needed if
the penetrance is 100% (Ott, 1985). Genetic heterogeneity has an even greater effect on the power. If
50% of families with a particular disease are segregating for a locus linked to a marker at a recombination rate of 5%, the mean sample size necessary to detect linkage is about 13 times that needed to detect
linkage if the trait is caused by the same locus in all families (Ott, 1985).
Sib-pair (e.g., Suarez and Van Eerdewegh, 1984) and related methods (Weeks and Lange, 1988;
Haseman and Elston,1972) can also be used to search for genetic linkage. These methods are useful
when penetrance is low, the mode of inheritance of the trait is poorly defined and/or the trait is suspected
to be controlled by more than one locus. Unfortunately, the methods require very large sample sizes
and, with the exception of the affected-pedigree-member method (Weeks and Lange 1988), cannot handle information from extended pedigrees.
Identification of modes of inheritance.
12
Complex segregation analysis is often used to investigate the mode of inheritance of complex
traits. The commonly used mixed model assumes the observed data can be explained with some combination of one Mendelian "major" locus, background polygenic effects, and random environmental
effects (Morton and MacLean, 1974). Oligogenic models have not been seriously attempted because of
the very large number of parameters which would need to be estimated.
The mixed model is not robust to violations of the underlying assumptions. For example, a simulation which included common familial environmental effects found that up to 40% of simulations
which did not include an underlying major locus gave false evidence for such a locus (Demenais and
Abel, 1989). Failure to correct for ascertainment bias can also lead to spurious results (Hanis and
Chakraborty, 1984). Finally, because the mixed model can only model segregation of one major gene,
the results from such an analysis may indicate a mode of inheritance which approximates the overall
pattern of inheritance of the trait but which does not indicate correctly the mode of inheritance of trait
contributions of individual loci.
B.4.4. Combined analyses
Slightly more complex models may enable successful genetic analysis of complex traits. Linkage
analysis in conjunction with the mixed model offers considerable scope for validation and location of
major genes. Preliminary attempts to use such techniques have been successful on simulated data when
either technique alone failed to succeed (Elston et al., 1989). Use simultaneously of multiple genetic
markers for the purpose of mapping a genetically heterogeneous trait also is much more efficient than
use of individual markers. For two disease loci, the sample size necessary to detect linkage with the use
of multiple markers is as little as one third of that needed when markers are analyzed individually (Lander and Botstein, 1986).
Analysis via a mixed model will also require large amounts of data, but by seeking first for one of
the most major gene components, summarising other effects into a polygenic component, we can hope,
in conjunction with linkage analysis, to extract the effects of a major gene. In principle, with sufficient
data, the process could be repeated on the residual variation.
B.5. Computation of likelihoods on extended pedigrees
In fitting genetic models on extended pedigrees, the standard computational method has been that
of "peeling". Peeling algorithms originated with Hilden (1970), Elston and Stewart (1971) and Heuch
and Li (1972), and provide tractable methods for computing likelihoods on extended pedigrees. The
methods have been extended to complex pedigrees (Lange and Elston, 1975; Cannings, Thompson and
Skolnick, 1978). (A complex pedigree is one that contains inbreeding loops, multiple marriages
between two or more families, etc.) The peeling algorithm has also been extended to more complex
models (Cannings, Thompson and Skolnick, 1980), and has been used in the genetic epidemiological
analysis of complex genetic traits on extended pedigrees (Meyers et al., 1982; Bishop et al., 1988; Cannon-Albright et al., 1988). As described by Elston and Stewart (1971) and detailed by Ott (1979), the
peeling methodology is applicable also to polygenic models with multiple random effects. However,
peeling for models with common environmental effects, maternal effects, or dominance effects has not
been implemented in the available computer packages. In preliminary work we have implemented a
general random effects peeling algorithm for computation of polygenic likelihoods on an extended, but
not yet complex, pedigrees (see C.3.1).
However, peeling has its limitations. By its nature, it cannot be applied to models with both fixed
and random heritable effects, although various approximations have been proposed, and implemented in
packages (Lalouel et al., 1983; Hasstedt and Cartwright, 1979; Hasstedt 1982). Hasstedt’s method of
approximation of mixed-model likelihoods on extended pedigrees is to replace the set of polygenic contributions at each stage in peeling by a set of just three components. Each of these is peeled at the next
stage, proliferating more components in accordance with the major genotype combinations on the current nuclear family, and the reduction/approximation process repeated. Hasstedt (1982) showed the
13
method works well on small pedigrees, and when there are few possible major-genotype configurations
on the pedigree. However, our preliminary work (see C.1) has lead us to conclude that such deterministic approximations need further investigation, and have lead us to Monte Carlo methods both as an alternative evaluation procedure, and as a procedure for assessing the adequacy of current methods.
Moreover, even with increasing accessibility of faster computers, computational feasibility limits
the complexities of the pedigrees and of the genetic models we can consider. These limitations have
lead to alternative approaches for genetically complex traits, such as the regressive models of Bonney
(1986). While successful in the important task of building covariate effects into genetic analyses, these
models also have their limitations; 1) the underlying genetics of the model, as summarised in marginal
and joint phenotypic distributions, is dependent upon the set of individuals observed, and 2) the individual models assume specific covariance structures among pedigree members.
B.6. Simulation methods in genetic epidemiology
The increasing speed of computers opens the way for very different approaches to the computational problems of genetic analysis. Simulation methods on pedigrees have attracted interest for a long
time: Wright and McPhee (1925) estimated inbreeding by random choices in tracing ancestral paths.
More recently, "gene drop" simulation methods have been used in the analysis of gene extinction (MacCluer et al., 1986) and studies of the performance of segregation and linkage analysis procedures
(Thompson et al., 1978; Boehnke, 1986). Lange and Matthysse (1989) have proposed using the
Metropolis algorithm (Metropolis at al., 1953) to provise a novel simulation approach to the estimation
of the potential power of a linkage study. In a very different genetic problem, we have used a Metropolis algorithm simulation method in the analysis of sparse and large contingency tables (Guo and Thompson, 1990).
The availability and accessibility of computers is also increasing, creating a demand for userfriendly interactive computing environments for genetic analysis. For example, expert system
approaches are becoming more widely investigated (Prokosch et al., 1989), and there is an increasing
number of packages for linkage and for segregation analysis. However, there remains a dearth of
methodology and software for the analysis of complex genetic models. With the advent of faster computers, Monte Carlo methods of analysis become a realistic and efficient proposition for likelihood analysis of complex models for data on large or complex pedigrees. Monte Carlo methods are simulation
methods where random variables may not be simulated directly from the process under consideration as
in "gene drop" or in most studies of expected power. The aim is the evaluation of a deterministic function; realisations are taken from any distribution that facilitates reliable and accurate evaluation of this
function. The paradigm example is importance sampling for the evaluation of deterministic integrals
(Hammersley and Handscombe, 1964); the key estimation equations of this proposal (section D.1.3)
may be viewed as a form of importance sampling.
B.7. The Gibbs sampler on pedigrees
A new method of simulation, the Gibbs sampler, is now becoming widely used in image analysis
(Geman and Geman, 1984) and in other areas of statistics (Gelfand and Smith, 1990). This method provides a solution to a major drawback to the use of simulation methods in pedigree analysis; namely, the
inability to simulate "backwards" from observed data. Simulation forwards from the founders ("gene
drop") is straightforward, but produces data in accordance with observations in only a minute proportion
of realisations. Related to the Metropolis algorithm (Metropolis et al., 1953; Hastings, 1970), which
recently also has been used in a genetic context (Lange and Matthysse, 1989), the Gibbs sampler produces realisations from the conditional distribution of genotype given phenotype, and can do so on large
and complex pedigrees. Although applications to date have been limited to simple Mendelian loci
(Sheehan, 1989), the method is also applicable to linked haplotypes, polygenic effects, and thence to
complex genetic models. Preliminary work on implementing and investigating the performance of the
14
Gibbs sampler on pedigrees is described in section C. As detailed in section D, the Gibbs sampler provides an approach to the evaluation and maximisation of likelihoods for complex genetic models, and to
the analysis of multilocus haplotypes underlying a possibly complex genetic trait.
C. PRELIMINARY STUDIES
C.1 The need for complex models: the Gibbs sampler approach
In a model for a complex genetic trait we may wish to include both Medelian major loci (heritable fixed effects) and polygenic effects (heritable random effects). For this, we need a mixed model.
Analyses of simulated data indicate that segregation analysis performed in conjunction with linkage
analysis makes it possible to determine the modes of trait determination provided by susceptibility loci
(Elston et al., 1989). Failure to use linked marker information makes the task of identifying linked
markers and modes of inheritance very difficult. Thus, the future of segregation analysis is in conjunction with linkage analysis (Elston et al., 1989): we need to incorporate data on potentially linked markers. In order for likelihood analysis of genetic models to be successful, we need models that can at least
approximate the true pattern of effects segregating in pedigrees. For complex traits, we shall need complex models.
Estimates of the mixed model cannot be obtained by using peeling algorithms (see section B.5).
Moreover, the likelihood under a mixed model, with a contribution to the quantitative trait from a major
genes as well as a polygenic component, is a mixture of likelihood contributions of the polygenic form,
one corresponding to each possible major-genotype combination over the entire pedigree. There can be
many billion such combinations. Deterministic approximations to mixed model likelihoods have been
proposed (see section B.5), but can be checked only on small pedigrees, or on pedigrees on which there
are few contributing major-genotype combinations. In preliminary work on large pedigrees we found
the approximations sensitive to the way in which different components were combined. More seriously,
the approximations were sensitive to the number of components retained at each stage of peeling. As
more components are retained, the approximation should improve, but we found putative likelihoods
continuing to change.
Our initial Monte Carlo studies were designed as a procedure to assess the adequacy of current
approximation methods because we had reached the limits of exact computation. However, we quickly
recognised that these methods provide an alternative and potentially far more efficient and reliable
approach to the evaluation of likelihoods for complex models.
An essential component of a Monte-Carlo approach to the likelihood analysis of complex genetic
models is the Gibbs sampler. Even though our preliminary work does not directly address the mixed
model, we have shown how the Gibbs sampler can be applied to models involving Mendelian genes,
polygenic effects and multilocus haplotypes. Each of these applications is useful in its own right, as
described in sections C.2, and C.3, and the related work proposed in section D. In addition, each is a
necessary component of the tool kit needed for the application of the Gibbs sampler to Monte Carlo
likelihood analysis of complex genetic models.
The Gibbs sampler (Hastings, 1970; Geman and Geman, 1984) is becoming widely used in several statistical settings (Gelfand and Smith, 1990). Preliminary work (Sheehan, Possolo, and Thompson,
1989) has shown how the algorithm provides a method for sampling for the posterior distribution of
genotypes on a pedigree, conditional on observed phenotypes. We shall show (section D.5) that this
ability to sample from this conditional distribution opens the way for a Monte Carlo approach to the
computational problems of fitting complex genetic models to data on extended pedigrees, and thus a
whole multitude of algorithms for genetic analysis of complex genetic traits.
The idea of the Gibbs sampler relies on the same conditional independence structure of genetic
models that is the basis of other methods of computation. Specifically, the conditional probability of an
individual’s genotype, given phenotypic information on some subset of the pedigree, and on the
15
genotypes of all other individuals, depends only on his own phenotype (if observed) and the genotypes
of his parents (if present in the genealogy), his offspring (if any) and his offspring’s other parent (his
"spouses"). The Gibbs sampler on pedigrees is thus specified as follows:
a) Find any genotypic configuration over the pedigree that is consistent with phenotypic observations.
b) Take a random permutation of all individuals in the pedigree.
c) For each individual, in the order indicated by (b), update the genotypic configuration by replacing the
genotype of the individual by a realisation from the local conditional distribution given his own phenotype, and current genotypic estimates for his parents, spouses and offspring.
d) Repeat (b) and (c) for as long as required, accumulating/storing the configuration achieved (for example) after the 50th. scan, and every subsequent 5 th. scan of the pedigree.
The general theoretical justification for this procedure was developed by Hastings (1970) and Geman
and Geman (1984). It has been shown to produce realisations from the global posterior distribution of
genotype given phenotype on a pedigree (Sheehan, 1989). Sheehan (1989,1990) has investigated the
determination of initial configurations, the advantages of random scans versus fixed scans or completely
random choice, and other aspects of the procedure. With regard to (d), note that the procedure must be
run "long enough" to "forget" the initial configuration before the configuration will be a reasonable realisation from the posterior distribution, and that adjacent realisations will be highly dependent. However,
convergence of the distribution occurs very quickly, dependence between realisations decays rapidly,
and for most purposes dependence of the realisations utilised does not cause difficulties (Geyer, 1990).
The general choice of numbers 50 and 5 of (d) requires investigation, but this choice has worked well in
preliminary studies of several applications of the Gibbs sampler on pedigrees. In section D, we shall
show how the Gibbs sampler may be used to address questions in the analysis of complex genetic traits;
the preliminary work on simpler models has established the feasibility of the approach. [Geyer and
Sheehan were both students of Thompson; their Ph. D. theses (University of Washington, 1990) detail
much relevant preliminary work.]
C.2. Haplotype analysis
We have developed methods for investigating the ancestry of individual alleles at single loci in
large and complex pedigrees. For example, Thompson and Morgan (1989) use a recursive descent probability algorithm to investigate the ancestry of two recessive lethals in a very large and complex genealogy of Canadian Hutterites. In principle the same method applies to ancestry at linked loci (Thompson,
1988), but again there are computational limitations. Moreover, while Thompson and Morgan managed
to incorporate the effects of lethality of the recessive trait into their ancestral inference, the inclusion of
a general selection pattern is impossible with current methods. The preliminary work of Sheehan (1990)
on the use of the Gibbs sampler on pedigrees has been in the inference of single-locus genotypes on
pedigrees that are far too large and complex for exact results to be obtainable. The Monte Carlo
approach performs well on the genealogy studied by Sheehan (1990). We have used it to obtain preliminary estimates of cystic fibrosis carrier probabilities for ancestral individuals the Hutterite genealogy.
The next stage is the extension of these methods to permit the inference of multilocus haplotypes, and in
particular to the ancestry of the different CF haplotypes in the Hutterite genealogy. One very attractive
feature of the Gibbs sampler approach, is that lethality, or any specific less extreme selection pattern can
be incorporated into the analysis. Details of our proposed work are given in section D.2.2
We also have developed procedures based on logic for reconstructing extended haplotypes in
pedigrees (Wijsman, 1987). These techniques are computationally rapid and can provide haplotypes
segregating in complex pedigrees of up to 100-200 individuals in minutes on a personal computer. On
larger computers, the size of the pedigrees can be much greater. Pedigree complexity does not complicate the analysis. However, because the methods reconstruct only the genotypes and haplotypes which
are forced from the data, they do not provide probabilistic estimates of uncertain haplotypes, including
those removed by more than 2 generations from the observed data. However, the constraints to the data
16
supplied by using these techniques can efficiently limit the potential data space so that computations
need never be made for situations with zero probability.
C.3 Quantitative genetic studies on pedigrees
C.3.1 Likelihood evaluation by peeling
The essence of "peeling" is the sequential summing out of genotypes of individuals, permitting
probabilities derived from phenotypic observations to be accumulated through an extended pedigree.
Peeling programs for Gaussian models are a only a component of our tool-kit for likelihood analysis of
complex models for pedigree data, but they are an essential ingredient (section D.5). For a polygenic
model, the sequential summation of the peeling process translates to integration with respect to different
genetic effects. The sequential computation of likelihoods for a Gaussian random effects model is based
on the fact that multivariate Gaussian densities can be explicitly integrated with respect to any component to give an expression of the same form in the remaining components. Specifically,
∞
∫ exp(−a Ba/2) da
1
T
k
= (2π b kk ) 2 exp(−a *T (B * −bbT /b kk )a * /2)
(C1)
−∞
where, without loss of generality, B is symmetric, B * is B with the k th row/column eliminated, b is
that row/column, without the (k, k) element, and b kk is that element i.e.;
B * b 
B =  T

b kk 
b
(C2)
In a simple polygenic model for a normally distributed quantitative trait with observed values y on a set
of related individuals, we might have (for example) a model
y=µ +a+d+e
(C3)
where µ is a vector of means (incorporating perhaps some fixed effects), a is a vector of breeding values
distributed as N (0, σ 2 G), d is a vector of dominance deviations distributed as N (0, τ 2 D), and e is a
vector of uncorrelated residuals, each of variance ε 2 . The matrix G has components which are twice the
kinship coefficients between individuals. Alternatively, the vector of additive effects a can be generated
via
ai = (a m + a f )/2 + si
(C4)
where m and f are the parents of i , the segregation effects s i are independent, mean zero, and variance
σ 2 /2. (This holds, regardless of inbreeding in the pedigree.) The dominance effects are effects correlated between bilateral relatives; these also have a simple conditional form.
We have implemented a general peeling program for multiple random effects models, and tested
its validity in analyses of simulated genetic data on extended zero-loop pedigrees. This work was performed as a part of our work on the genetics of disease resistance in extended (800-member) pedigrees
of Brassica campestris. We cannot detail here the peeling algorithm for such a model. It involves successive incorporation of observation, prior, segregation and dominance terms. For example, from equation (C4) the segregation from parents m and f to i gives an additional factor
(σ 2 /2)− 2 exp(−(ai − 0. 5(a m + a f ))2 /σ 2 )
1
(C5)
Integration with respect to variables for which all relevant terms have been incorporated is accomplished
using equation (C1). The procedure is broadly analogous to that of peeling a Mendelian trait (Thompson, 1976; Thompson, 1986) although the detailed operations differ.
17
C.3.2 Likelihood maximisation and the EM algorithm
For a likelihood analysis we require not only likelihood evaluation, but also likelihood maximisation. With few exceptions (Lange et al., 1976), human geneticists have not considered multiple random
effects models for quantitative traits, but they are a major component of likelihood analysis of phenotypic data observed on extended pedigrees in the quantitative animal and plant genetics literature (Henderson, 1986; Shaw, 1987). This animal breeding methodology also has useful application in the fitting
of models for quantitative epidemiological traits observed on on human pedigrees. One approach that is
frequently used to obtain likelihood estimates estimates of genetic variance component parameters is the
EM algorithm (Dempster et al., 1977; Thompson and Meyer, 1986). However, direct implementation of
algorithms for random effects models involves the repeated inversion of large matrices. This is computationally very slow. Various methods of avoiding these computationally intensive matrix operations
have been proposed (Schaeffer and Kennedy, 1986; Graser et al., 1987). For example, for the model
(C3), the EM iterative equation for the additive genetic variance σ 2 takes the form
σ *2 = IEσ 2 ,τ 2 ,ε 2 (aT G−1 a | y) = aT G−1 a + tr(G−1 W)
(C6)
where a = ε −2 W(y − µ ) is the conditional mean of a given the data y, W = (σ −2 G−1 + ε −2 I)−1 is the
conditional variance, and other notation is as previously defined. This is just one component equation of
one simple model. For a simple additive genetic model, we have implemented an EM algorithm that
does not require any matrix inversion, and can be run easily on a workstation or microcomputer
(Thompson and Shaw, 1990). A similar method can be implemented for bivariate traits (Thompson and
Shaw, in preparation). However, the method does not readily extend to more complex genetic models.
A Monte Carlo EM approach using the Gibbs sampler is easily implemented, and extends readily
to more complex models. We have shown that the method compares well with that of Thompson and
Shaw (1990) for simple additive genetic models, and have implemented the algorithm for models
including the fitting of common sib, dominance effects, or maternal effects, on extended pedigrees (Guo
and Thompson; in preparation). While the method is simply a procedure for obtaining maximum likelihood estimates under a genetic model, and therefore cannot overcome the statistical problems of the
lack of precision inherent in these models, we have shown that computationally the method compares
well with alternatives. Details of how the Gibbs sampler will be used to implement a Monte Carlo EM
approach to quantitative genetic models are given in section D.3.2.
C.4 Data sets.
Several data sets are already available for testing the methodology developed in this project. Two
of these have been collected through projects on which Dr. Wijsman is collaborating. The first data set
is available through Dr. Wijsman’s collaborations with a large project which is studying the genetics and
epidemiology of heart disease. This data set consists of over 175 pedigrees ranging in size from 4 to
132 individuals. Data are available on levels of quantitative risk factors for heart disease including total
cholesterol, very low density, low density, high density cholesterol, triglyceride, and Apolipoprotein B.
Also available are qualitative or categorical data such as low density lipoprotein subclass patterns,
myocardial infarction, angina, etc. Data on environmental risk factors such as smoking, drug treatments, age, weight and height, etc. are also available. Genotype data are available for most pedigree
members for multiple polymorphic sites at the Apolipoprotein B locus, and the complex of loci encoding the Apolipoprotein A1-C3-A4 proteins. The second data set is a set of over 125 pedigrees which
have multiple cases of Alzheimer’s disease. Twelve of these pedigrees are of Volga German ancestry
(Bird et al., 1989). This pedigree collection includes age at diagnosis, if known, age at last exam, and
for 33 pedigrees, data on more than 50 genetic markers. The sizes of both data sets are increasing
steadily.
The data sets contain complicating factors which are typical of human data. These include: 1)
loops in the pedigrees (inbreeding, brothers marrying sisters, etc.), 2) extended pedigrees, 3) multiple
18
marriages 4) pedigrees which extend up both maternal and paternal lineages, 5) missing data, 6) nonpaternity, 7) uncertain information (for example, age at onset or even disease status for Alzheimer’s
patients (Bird et al 1989)), and 8) genetic marker data from highly polymorphic systems. Some of these
polymorphic systems consist of multiple polymorphic sites, and need to have haplotypes constructed in
order to extract the full information on segregation of the alleles.
Additionally, we have access to a large database of genealogical and medical genetic data on the
Canadian Hutterites, through Dr. Thompson’s collaboration with Professor Ken Morgan (see section G).
The preliminary genealogical dataset contains 9,205 married individuals. With the addition of offpspring born to 1981, the new database contains almost 20,000 individuals living and dead. The total
Hutterite population now numbers over 30,000 living individuals. It is almost completely endogamous,
and ancestry traces to only about 70 founders about 12 generations ago. With its large complex genealogy, and the availability of haplotype information for many current individuals, this is an ideal database
for testing methods addressing multilocus Mendelian haplotypes. Another dataset, available through Dr.
Thompson’s previous and preliminary work, is of measures of disease resistance on six designed pedigrees each of between 500 and 800 individuals of AllBrassica campestris. These pedigrees are large,
but (by design) not very complex. Analyses to date have not addressed the rank-ordered categorical
nature of the trait. Plant and animal pedigrees are a useful surrogate for human pedigree and medical
information, for testing and evaluating computational techniques and statistical methodology. They can
be used in studies of the effects of missing data, by ignoring part of the observations. Dr. Thompson
also has other smaller human and animal pedigree data sets available from past work (for example, the
complex genealogy of a Newfoundland isolate, with a total of about 5,000 individuals but with genetic
data on only about 500 current members (Thompson, 1981)). Such data sets can be used to provide
detailed comparisons between previous results and those of our new approach. 5 copies to Ellen W.
October;
1) copy to Simon Tavare, USC.
Send to:
2) Duncan Thomas, Department of Preventive Medicine,
Univ. Southern California, 1420
San Pablo Street,
Los Angeles, CA 90033-9987. (ack. letter)
3) Terry Speed, Dept. Statistics, Univ. CA Berkeley,
note)
4) Mark Skolnick, (ack. letter)
5) Robert Elston
6) Ken Lange
7,8,9) Ted Reich (3 copies; van Eerdewegh, Green, Suarez)
10) Mike Boehnke
November 7;
11,12) Vogler, Biostatistics, Wash U., St. Louis
19
Berkeley, CA
(ack.
12,13) Ruth Shaw, Plant Sciences, Riverside, Ca
_November 13;
14, 15) Nuala Sheehan, Alun Thomas
16) Charlie Geyer
November 29;
17) Nicolas Schork, Dept. Statistics, U. Mich. (copy from Boehnke)
18) 12/12/90; Tom Mitchell-Olds
___________________________
10) Julian Besag,
5) Tim Bishop
D. PROPOSED RESEARCH
D.1 Foundations
D.1.1 Overview
Although many of the specific objectives are computational, the computational developments are
not the primary objective, but rather provide the framework within which to address questions of the
genetic basis of complex traits. Our ultimate objective is to develop methods which will allow us to
analyse traits on the basis of complex models that may involve linked markers, several loci and/or polygenic components. However, we propose to investigate first methods for the component parts of complex models -- Mendelian models, and polygenic models. Only where the performance of Monte Carlo
methods are understood fully, through investigations where comparisons with exact results are possible,
can we hope to understand performance for complex models.
Genetically complex traits result, by definition, from the interaction of more than one genetic
component, and are often also influenced by environment. Thus simple genetic mechanisms cannot
explain the observed familial distributions, and classical segregation analyses are inconclusive. The
major-gene components of such traits can only be resolved and confirmed by linkage analysis, and
genetic marker studies are necessary. Linkage analysis with a complex genetic model for the trait is not
easily accomplished. Whereas use of an incorrect or simplified trait model is often conservative for the
detection of linkage, overconservative procedures are undesirable. For traits which are not fully penetrant dominant diseases, a large amount of data is required to provide reasonable probabilities of detecting linkage. The power declines sharply if there are inaccuracies in the model. Since human data are
usually limited, it is necessary to have accurate genetic models for a traits in order to detect true linkages. Against the conservatism of incorrect model specification must be set the unknown effects of multiple tests -- use of several models, alternative diagnoses, alternative markers, and alternative labs. In
view of this ill-defined multiplicity, it is not surprising that some apparent linkages (even with high LOD
scores) have not stood up to further investigation. Caution is necessary in the interpretation of linkage
studies for complex and common diseases. The Monte Carlo evaluation methods proposed here will
facilitate studies of the performance of likelihood analyses of complex models in conjunction with linkage analysis, and will increase our understanding of the properties of tests and estimates of linkage for
complex traits.
20
Another complexity lies in genetic heterogeneity. Inferences of genetic etiology are restricted to
the population studied, and sometimes to only a subset of the families that show a given disease trait.
Heterogeneity greatly decreases the power to detect linkage, and the chance of validating a suggested
linkage by analyses of additional data. To maximise power, it is important to make full use of data on
extended pedigrees, where genetic homogeneity is more likely to obtain. The Monte Carlo evaluation
methods proposed here are designed to permit likelihood analyses of data observed on members of
extended pedigrees. Such pedigrees are often complex; the methods proposed can also be used on complex pedigrees.
Yet other difficulties in the analysis of complex traits arise from problems of sampling and ascertainment, and the difficulties of meeting efficient design specifications in human genetic epidemiological studies, even where such specifications are available. Methods for likelihood evaluation can, of
course, also be used in evaluation of expected log-likelihoods and thence for the evaluation of alternative
sampling strategies and pedigree structures. Analyses of power and precision in the estimation of
parameters of complex models on large pedigrees have previously been unavailable. Little is known
about the amounts of data required to discriminate reliably between alternative models, when segregation analysis is performed in conjunction with linkage analysis. The Monte Carlo likelihood evaluation methods for complex models proposed here will also be useful in studies of power and of the precision of maximum likelihood estimates of parameters.
D.1.2 The Gibbs sampler on pedigrees.
The Gibbs sampler was described in section C.1. Although preliminary implementation and performance investigations has been for single Mendelian loci, we note that the Gibbs sampler is not limited to single loci, nor even to multilocus haplotypes. It works equally well for polygenic effects, other
genetic or environmental random effects, or even for a combination of random and fixed effects,
although its performance for more complex models remains to be more fully investigated. We note also
that size and complexity of the pedigree cause no problems in implementing the procedure, although
again its performance requires further investigation.
Uses of the ability to generate realisations from the posterior are not limited to direct questions
about risk probabilities, although these do have practical application (D.2,D.6)." Realisations can also
be used in the evaluation of the likelihoods for complex genetic models (D.5), and in the maximisation
of likelihoods (D.3). Thus one important early component of the work will be an implementation and an
investigation of the performance and properties of the Gibbs sampler in the context of complex genetic
models. Unlike classical direct simulation approaches, the Gibbs sampler does not provide independent
realisations. Thus, before the properties of the use of Gibbs sampler realisations in likelihood estimation can be studied, we must first know and understand the properties of the realisations themselves.
D.1.3 Monte Carlo evaluation of likelihood ratios
Where the pedigree is very large and complex or the model involves several loci, peeling to evaluate the likelihood is impossible. The Gibbs sampler then provides a way for the Monte Carlo evaluation of likelihoods.
Many likelihoods of genetic models are of the general form
L(θ ) =
∫ P (y | x) P (x) dx
θ
θ
(D1)
x
where y is the set of phenotypic observations, x the set of underlying genotypes or genetic effects, and θ
the set of all parameters of the genetic model, including segregation, linkage, and penetrance parameters. In making inferences, the absolute value of a likelihood has litle meaning; it suffices to compute
relative likelihoods, or likelihood ratios. We note here the general result that for a likelihood of the form
(D1), the likelihood ratio between parameters θ and θ 0 can be expressed as
21
l(θ ; θ 0 ) =
L(θ )
=
L(θ 0 )
∫
x
Pθ (y, x)
dPθ 0 (x | y)
Pθ 0 (y, x)
(D2)
Thus if N realisations, {x( j) ; j = 1 . . . , N }, are generated from the posterior distribution of genotype
given phenotype (x given y) at parameter values θ 0 , then we can obtain a Monte-Carlo evaluation of the
likelihood ratio as
1
lˆ(θ ; θ 0 ) =
N
N
Σ
j=1
1
Pθ (y, x( j) )
=
Pθ 0 (y, x( j) ) N
N
Σ
j=1
Pθ (y | x( j) ) Pθ (x( j) )
Pθ 0 (y | x( j) ) Pθ 0 (x( j) )
(D3)
Thus for the many likelihoods arising in pedigree analysis which are of the form (D1), the form (D2) is
the relevant Gibbs sampler form, with samples arising from the posterior distribution of genotypic
effects (x) given phenotype (y). The form (D3) can thus be used to provide a Gibbs sampler based
Monte Carlo estimate of a likelihood ratio. This basic estimation form will arise several times in our
proposed work.
D.2 Mendelian models
D.2.1 Ancestral genotypes
For ease of presentation we shall develop our approach first in the context of Mendelian models.
This will enable us to evaluate the performance of the techniques in situations for which exact evaluation methods are available for comparison. There are many cases where it would be of interest to ascertain the genotypes of ancestors of a current group of individuals. These posterior probabilities have
implications for genetic counselling. Previous analyses of ancestral genotypes have used the "peeling
method" (Cannings, Thompson and Skolnick; 1978). This method provides probabilities of data
observed on a pedigree under a genetic model, and can thus be used either to estimate the parameters of
the model, or to obtain risk probabilities for specific individuals. In fact, posterior risk probabilities can
normally be obtained for all members of the pedigree with just two applications of the peeling algorithm
(Thompson, 1981).
The feasibility of the computational method of peeling to evaluate likelihoods or posterior probabilities is limited to pedigrees that are not too complex and to models with few genotypes. Single gene
models can be analysed on pedigrees as complex as that of Tristan da Cunha (Thompson, 1978), a West
Coast Newfoundland genealogy (Thompson, 1981) or the Samaritan genealogy (Thomas, 1986). Even
simpler questions, such as the ancestry of rare lethal recessives, can be addressed on more complex
genealogies, such the Hutterite genealogy (Thompson and Morgan, 1989), but our ability to answer
questions about linked loci or multilocus haplotypes is very limited (see C.2). There are also deterministic logical procedures that can be used to infer haplotypes on pedigrees without extensive missing data
(see C.2) but these also have limitations
D.2.2 Ancestral haplotypes; the Gibbs sampler solution
The Gibbs sampler provides a method of simulating from the posterior distribution of genotype
given phenotype. By time-averaging these realisations over long runs, we can obtain accurate estimates
of the marginal and joint posterior genotype probabilities (Sheehan, 1990). The Hutterite genealogical
database of 9,205 married individuals (see C.4) is far too large for exact computational methods such as
peeling; even the small subsection consisting of the 583 ancestors of individuals affected with cystic
fibrosis (CF) is too complex for any exact computations (Fujiwara et al., 1989). However, the computational requirements of the Gibbs sampler increase linearly with pedigree size, and pedigree complexity
is an advantage for the efficiency of the method (Sheehan, 1990). The ancestry of the CF haplotypes in
the Hutterite genealogy is of considerable interest. It had been previously recognised that with three different linked marker haplotypes carrying the CF mutation (Fujiwara et al. 1989), there were likely to be
at least three different origins among the 70 founder members of this population which now numbers
22
over 30,000. It now appears likely that there are in fact three distinct CF mutations present (Klinger et
al., 1990), two of which are (at best) uncommon in other Caucasians.
Using the Gibbs sampler we shall estimate the probabilities that various ancestors carried, and
that various founders gave rise to, the three CF haplotypes now present in the Hutterite population. This
is not only an intriguing question in its own right, but will provide useful measures of the efficiency of
the Gibbs sampler in sampling from the posterior distribution over the enormous space of all a posteriori
possible genotype configurations on the pedigree. It will provide risk estimates for CF carrier status for
all members of the population past and present, and also risk probabilities separately for the three different CF haplotypes. The origins of the two "uncommon" CF haplotypes are of interest. Another question
of ancestral haplotype inference which we shall investigate, is that of marker haplotypes of Alzheimer’s
patients in extended pedigrees of Volga German origin (section C.4.).
D.2.3 Peeling and the Gibbs sampler in combination
Where exact and Monte Carlo approaches can be combined, we can use the optimal features of
both approaches. Many large complex pedigrees consist of a very complex core, but with peripheral
branches that are readily peeled. The final probability functions (Cannings, Thompson and Skolnick,
1978) that result from peeling a branch to a joint cutset of individuals on the periphery of the core
genealogy can be incorporated as boundary conditions into the Monte Carlo Gibbs sampler on the core.
Thus the Monte Carlo portion of an estimation can be made more efficient, by being restricted to relatively few individuals. This will be implemented and investigated, again using the Hutterite genealogy
as one example. We shall also investigate haplotypes segregating for the Apolipoprotein B region,
where some logical deterministic conclusions are possible, and haplotypes in the Alzheimer’s pedigrees
segregating for chromosome-21 markers (see C.4). This combined peeling and Gibbs sampler scheme
is just one example of the general objective of combining Monte Carlo evaluation with exact computational methods: other examples appear in the proposed analysis of the mixed model (section D.5).
D.2.4 Likelihoods for Mendelian models; linkage analysis.
The Gibbs sampler can also be used to provide estimates of the likelihood of a Mendelian model.
The first and simplest specific instance of the general likelihood form (D1) is the well known equation
L(θ ) = Σ Pθ (y|g) Pθ (g)
(D4)
g
where y is the observed phenotypic data, and g a genotypic configuration on the pedigree, and summation is over all genotypic configurations. This form is of no use for Monte Carlo evaluation for,
although it is simple to simulate from Pθ (g), only a minute fraction of the genotypic configurations
realised will even be consistent with the data y. However, in the conditional sampling form (D2), sampling is restricted to genotypes that are not only consistent with the data but is focused on genotypes, g,
that have high posterior probability given the data, y, providing an efficient form for Monte Carlo evaluation. The estimate (D3) becomes
1
lˆ(θ ; θ 0 ) =
N
N
Σ
j=1
Pθ (y|g( j) ) Pθ (g( j) )
Pθ 0 (y|g( j) ) Pθ 0 (g( j) )
(D5)
where the N realisations, {g( j) ; j = 1 . . . , N }, are generated from the posterior distribution of genotype
given phenotype (g given y), obtained by running the Gibbs sampler at parameter values θ 0 .
The estimation (D5) is easily implemented; Pθ (y, g( j) ) is a simple product of founder probabilities, segregation probabilities and penetrances. It is only the summation over genotypic configurations
in (D4) that causes computational problems. This is replaced in (D5) by a summation over realisations.
Further, note that we have a free choice of the simulation parameter values θ 0 (although some choices
are more efficient than others), and obtain with one simulation the likelihood for all other θ relative to
L(θ 0 ). If we choose, for particular θ of interest, a θ 0 differing only in the penetrance probabilities, then
23
the ratio in (D5) becomes a simple ratio of penetrance probabilities Pθ (y|g( j) )/Pθ 0 (y|g( j) ), the segregation probabilities cancelling; Pθ (g( j) ) = Pθ 0 (g( j) ). Conversely, if the θ : θ 0 difference is in the segregation probabilities, then (D5) involves only ratios Pθ (g( j) )/Pθ 0 (g( j) ), and the data enter into the computation only via the Gibbs sampling procedure.
Thus we have here a method for the evaluation of likelihood ratios. A very preliminary prototype
program has shown feasibility of the approach. An early objective is to implement this procedure more
generally, to investigate its properties, and to use it on complex pedigrees, and for genetic models
involving multiple alleles or several loci, where exact computation ("peeling") may be impossible. The
the incidence of CF and other haplotypes in the Hutterite genealogy will again provide a wealth of
examples. Apolipoprotein B haplotypes segregating in the heart disease families, and chromosome-21
haplotypes in the Alzheimer’s families will provide other examples, and the opportunity for comparison
of exact and Monte Carlo results. Additionally, we note that linkage likelihoods are of the form (D4)
and may be estimated via the form (D5). That is, the genotype vector g may consist of genotypes at several linked markers, and the parameters of interest, θ , may include recombination parameters. We shall
investigate the performance of the Gibbs sampler in the evaluation of likelihoods for linkage.
D.3 Polygenic models
D.3.1 Likelihood evaluation for polygenic models
The peeling algorithm of Elston and Stewart (1971) can be generalised to permit computation of
the likelihood for a variety of polygenic models (Ott, 1979). Implementation of a general polygenic
peeling algorithm was detailed in section C.3.1. For a zero loop pedigree, there is thus no need for a
Monte Carlo evaluation. However, for pedigrees with loops Monte Carlo evaluation using the Gibbs
sampler may prove a useful alternative approach. Further, the cases of polygenic models will be a useful
testing ground for assessing the precision and efficiency of Monte Carlo evaluation, by comparison with
exact peeling results on large but simple pedigrees. In addition to simulation studies, our designed plant
pedigrees of Brassica campestris (section C.4) with measures of disease resistance and other quantitative traits will be used to test procedures. Additionally, the Gibbs sampler for polygenic effects will be
implemented and its properties assessed,. This is necessary for developing the tool kit required for complex models (section D.5). The procedure is directly analogous to equations (D4) and (D5) above. For
the model (C3)
L(θ ) =
∫ P (y | a, d) P (a, d) da dd
θ
θ
(D4*)
x
where y is the set of phenotypic observations, a and d the additive and dominance effects. The estimation equation (D3) for the likelihood ratio becomes
1
lˆ(θ ; θ 0 ) =
N
N
Σ
j=1
Pθ (y, a( j) , d( j) )
Pθ 0 (y, a( j) , d( j) )
(D5*)
where the additive and dominance effects {(a( j) , d( j) ) ; j = 1, . . . , N } are realisations from the conditional distributions of these effects, given the phenotypic data y, obtained using the Gibbs sampler run at
parameter values θ 0 .
D.3.2 Maximising the likelihood; a Monte Carlo version of the EM algorithm
To fit genetic models, it is necessary to maximise likelihoods, not simply to evaluate them. Previous and preliminary work on use of the EM algorithm to obtain maximum likelihood estimates of the
variance parameters in quantitative genetic models was detailed in C.3.2. The main thrust of this work
has been to find ways of evaluating the terms of EM equations such as (C6) without computationally
intensive matrix manipulations. Unfortunately, the methods developed are only appropriate for models
with only one heritable random effect. For example, the method cannot readily be extended to include
24
dominance effects or maternal effects in addition to additive genetic effects.
However, the ability to sample from the conditional distributions of genetic effects given the
observed data provides an alternative approach to the evaluation of the conditional expectations arising
in the EM equations -- namely, Monte Carlo evaluation using the Gibbs sampler on pedigrees. We simply sample values of ai and d i from the local conditional distribution for individual i , and use a timeaverage across realisations as an estimate of the conditional expectation such as in equation (C6). In this
approach there is virtually no limitation on the number of effects in the model, nor on the size or complexity of the pedigree. Nor do missing observations cause any difficulty in implementing the procedures.
D.3.3. Polygenic models; evaluation of methods.
We shall use the data from the heart disease study (see C.4), to assess our Monte Carlo procedures for the evaluation and maximisation of likelihoods for polygenic models with multiple heritable
and environmental effects, on extended pedigrees. This data set includes a number of quantitative normally distributed measures, for which such models are appropriate. We shall assess the effect of missing observations both on the performance of the techniques, and on the statistical power to test alternative hypotheses. We shall implement and evaluate multi-trait analyses of quantitative traits related to
coronary heart disease.
D.4. Threshold models
D.4.1 Categorical traits; multiple threshold models.
Much of the phenotypic data arising in quantitative genetic analyses is in fact categorical, even
though it may be treated as Normal and analysed via traditional quantitative genetic models. For example, age-of-onset of Alzheminer’s disease may be most appropriately cleected in time intervals rather
than exact age. Many traits, including those we may wish to model as determined by an underlying
polygenic genetic model, are qualitative, not quantitative. Even where the categories are ordered, and/or
indexed by real numbers, the assumption of a Normal distribution is often unwarranted. The effect of
treating (for example) a disease susceptibility scored as 0,1,2,3,4 or 5 as a Normally distributed variate
has too often not been investigated. Nonetheless models for the treatment of such categorical data do
exist in the population genetics literature. For example Wright (1934) considered a multiple threshold
model underlying the number of digits in guinea pigs. (Moreover, Wright (1935) also detected a relevant major gene.)
Instead of
y=µ +a+d+e
(see (C3))
we have
u = µ + a + d + e and P(y i = j) = γ j (ui ),
j = 1, . . . , m
(D6)
We need a form of γ j which results in a tractable form of likelihood on the basis of observations y.
There are two possibilities, which, subject to reparametrisation, are equivalent. In either case the unconditional marginal distribution, and the conditional marginal distributions of the y i can be derived. One
is a categorical version of a Gaussian susceptibility model
γ j (u) = Φ(c j − u) − Φ(c j−1 − u)
(D7)
with c 0 = −∞ and c m = ∞, where Φ(. ) is the standard Normal distribution function. An alternative is
to categorise the variable u directly producing a threshold model:
γ j (u) = I (k j−1 ,k j ) (u) = 1 if k j−1 ≤ u ≤ k j and 0 otherwise.
(D8)
Since the models are essentially reparametrisations of each other, for convenience we shall discuss here
25
only the second form.
D.4.2 Monte Carlo importance sampling for likelihoods for categorical traits
The likelihood on the basis of data observation y is obtainable only as a multivariate Normal integral over some region (see e.g. Johnson & Kotz, 1972). However, using equations (D6) and (D8), the
likelihood is clearly simply
∫
fθ (u)du
(D9)
umemberK
where θ is the set of genetic parameters ((σ 2 , τ 2 , ε 2 ) in the present case) where f () is the multivariate
m
Normal density of u and where K is the m-dimensional rectangle
(k y −1 , k y ).
Π
i=1
i
i
Now, for each u the
multinormal likelihood fθ (u) can be computed by methods outlined in C.3.1. Thus the likelihood (D9)
is a prime candidate for Monte Carlo evaluation. However, as in (D4) of section D.2.4 this form is of little use, since few randomly generated u will even produce values in the relevant set K. Instead, we propose a form of importance sampling (Hammersley and Handscomb, 1964), writing (D9) as
L(θ , {k j ; j = 1, . . . , m}) =
∫
K
fθ (u)
dH(u)
h(u)
(D10)
where H is a probability distribution with density h over the data-dependent set K. Thus again,
although we do not here use the Gibbs sampler, we have a method of data-dependent sampling which
will provide an efficient method of Monte Carlo evaluation of the likelihood function. Choice of the
probability density h(. ) remains open, although it should be strictly positive on K (and zero elsewhere).
In some very preliminary investigations, use of a uniform distribution in those dimensions in which K is
bounded ( y i ≠ 1 or m ) and a semi-normal distribution providing the appropriate tail behaviour in
unbounded dimensions seems to work satisfactorily.
We shall fit such threshold models for rank-ordered categorical traits. On a large data set of
extended pedigrees of Brassica we have data on disease susceptibility that is scored from 0 to 5. In the
case of data on Alzheimer’s Disease in extended pedigrees (see C.4), we have categorical levels of disease status and onset, based on clinical symptoms, due to uncertainties of diagnosis. The low density
lipoprotein measurements also involve uncertainties resulting in different categorical classifications.
Additionally we shall use simulated data to investigate likelihood analysis of models for rank-ordered
categorical traits; many traits show variation in disease severity, age-of-onset or other ordered categorical features.
D.5 Segregation and linkage analysis for complex genetic traits.
The mixed model is characterised by having both fixed and random effects as components of phenotype. Moreover, in genetics, the mixed model refers to the case where components of each type are
heritable; "major genes" contribute effects dependent on the discrete major genotype, while "polygenic
background" contributes the random effects. The mixed model in its classical form has not contributed
much to the resolution of complex genetic traits. The future of resolving the genetic components of
such traits is now generally accepted to lie in linkage analysis. Only by linking the major gene components to a marker locus will the genetic components be resolved. However, for human familial data,
linkage analysis and mapping of quantitative trait loci (Lander and Botstein, 1989) is not straightforward. Thus useful mixed models must include not only the classical components of fixed and random,
heritable and non-heritable, effects, but the major-gene components may involve also linked markers.
D.5.1. The mixed model for complex traits
We may write the likelihood under a mixed model in the generic form
26
L(θ ) = Σ
G
∫ P (y | z, G) P (G) dP (z)
θ
θ
θ
(D11)
z
where G incorporates the major genes, and z incorporates all random effects. (For simplicity, one may
think of z being the vector of breeding values, and G the vector of genotypes at a single locus, but the
model would normally be more complex than this.) The mixed model likelihood is intrinsically unpeelable; it is a multivariate normal mixture with as many components as there are major genotype combinations on the pedigree.
However, for given z,
fθ (y | z) = Σ P(y | z, G) P(G)
(D12)
G
is computable by regular peeling, and for given G
fθ (y | G) =
∫ P(y | z, G) dP(z)
(D13)
z
is computable by polygenic peeling. Thus the problem is particularly suited to Monte Carlo evaluation,
where either z or G is sampled from an appropriate distribution.
Now the Gibbs sampler provides a simple and effective method for the conditional sampling of
genotypes given phenotypes. Yet again, the forms (D11) - (D13) do not provide useful sampling procedures. Instead, we require the analogues of (D2); that is
L(θ )
=
L(θ 0 ) Σ
G
=
∫
z
=Σ
G
∫
z
fθ (z) Pθ (G) fθ (y | z, G)
dPθ 0 (z, G | y)
fθ 0 (z) Pθ 0 (G) fθ 0 (y | z, G)
(D14)
fθ (z) fθ (y | z)
dFθ 0 (z | y)
fθ 0 (z) fθ 0 (y | z)
(D15)
Pθ (G) fθ (y | G)
Pθ 0 (G | y)
Pθ 0 (G) fθ 0 (y | G)
(D16)
In this example, G denotes the underlying major genotype contribution and z the polygenic contribution
(which may include also common environmental effects). In (D14), Monte Carlo sampling would be
effected with respect to (w.r.t) to both G and z. In form (D15), sampling would be w.r.t z, with major
gene likelihoods fθ (y | z) computed by Mendelian peeling. In (D16), sampling would be w.r.t G, with
polygenic likelihoods fθ (y | G) for given G computed by polygenic peeling. In every case, the sampling is conditional on the data y and is achieved by the Gibbs sampler. Moreover, this form of sampling is particularly effective in focusing the sample on those values of G and/or z which are a posteriori probable, given the data y and the parameter value θ 0 . It is thus far more efficient than any "genedrop" sampling of the latent genotypic variables.
One interesting question here is whether it is indeed more effective to sample w.r.t the major
genotype or the polygenic components. Traditionally, the mixed model likelihood is thought of as a
mixture of polygenic likelihoods, but it is equally a multinormal mixture of major locus likelihoods.
The relative efficiencies of the two alternative Monte Carlo views will depend on the relative contributions of the two components, in a way that needs to be investigated. Many other aspects of this simulation approach require thorough investigation -- for example, the best prescriptions for simulation values
of θ 0 in order to obtain an overview of the likelihood surface L(θ ).
D.5.2. Linkage analysis for complex traits
27
In the likelihood equation (D11), the major genotype component G may involve not only loci
affecting the trait, but also linked markers, and the genetic model parameters θ may include recombination fractions. Thus the forms (D14)-(D16) which may be used to obtain Monte Carlo estimates of likelihood ratios also include the case of linkage analysis for a complex trait. In the Gibbs sampler approach
to likelihood evaluation, potentially linked markers add dimensions to the Monte Carlo sampling, but do
not alter the approach in any fundamental way. We shall include linkage analysis in our implementation
and evaluate the techniques using again primarily the heart disease data where there are data for numerous Apolipoprotein markers (see C.4).
D.6 Pedigree and sampling design
D.6.1 Power of a study
Simulation from the posterior distribution of genetic factors, given partial phenotypic information, or given only a pedigree structure, is the basic ingredient of studies of statistical power. The same
simulation procedures that are used in the evaluation of likelihoods can be used in the evaluation of
expected likelihoods or log-likelihood differences. For example, just as Lange and Matthysse (1989) use
a Metropolis algorithm to investigate the power of a linkage study, given phenotypic data on the disease
trait, the Gibbs sampler can be used to provide conditional or unconditional estimates of power of the
wide variety of likelihood analyses described above. Such power estimates are an important aspect of
any investigation of statistical procedures; it is important to know how many data are likely to be
required in order to detect effects of given types and magnitude.
D.6.2 Sequential design; use of the Gibbs sampler
In human genetics, the opportunity to design studies for optimal statistical properties is very limited. Nonetheless, sampling strategy is important, and in a long-term study questions of sequential sampling arise -- some pedigrees are followed up, others, for a variety of reasons, are not. The statistical
information provided by alternative sampling strategies is another important aspect of any assessment of
a methodological procedure. Thompson (1983) has shown that this information is closely related to
uncertainties in the genotypes of individuals available for sampling, conditional on the phenotypes of
those already observed. Thus, yet again, the Gibbs sampler provides an ideal method of estimating such
information, since its output consists precisely of realisations from the relevant conditional distribution.
D.7 Program and software availability
In the course of developing methods for the likelihood analysis of genetic models on extended
pedigrees, and testing and assessing these methods, we shall write many computer programs. Although
it is far beyond the scope of this proposal to produce a "package" of programs, we shall, for greater reliability, efficiency and our own convenience, use a single framework for handling pedigrees and pedigree
data, document programs adequately, and establish a library of executable programs that will serve as
building blocks for more extended studies. We write programs in C in a Unix environment, and are
using the pedigree neighbourhood structures of Sheehan (1989) in preliminary studies (e.g. Thompson
and Shaw, 1990). Our current objective is to implement and investigate the proposed methods. All programs developed will be made available to other interested researchers, but it would be premature to
propose a package of programs for complex segregation analysis. Before this is a realistic proposition,
the methods must be developed and their properties studied.
28
Literature cited.
Amos, C.I., R.C. Elston, S.R. Srinivasan, A.F. Wilson, J.L. Cresanta, L.J. Ward and G.S. Berenson.
1987. Linkage and segregation analyses of apolipoproteins A1 and B, and lipoprotein cholesterol
levels in a large pedigree with excess coronary heart disease: the Bogalusa heart study. Genet
Epidemiol 4: 115-128.
Austin, M.A., M.C. King, R.D. Bawol, S.B. Hulley, and G.D. Freidman 1987. Risk factors for coronary
heart disease in adult female twins. Am J Epidemiol 125: 308-318.
Austin, M.A., M.C. King, K.M. Vranizan, B. Newman and R.M. Krauss. 1988. Inheritance of lowdensity lipoprotein subclass patterns: results of complex segregation analysis. Am J Hum Genet
43: 838-846.
Austin, M.A., J.D. Brunzell, W.L. Fitch and R.M. Krauss. 1990. Inheritance of low density liporotein
subclass patterns in familial combined hyperlipidemia. Arteriosclerosis, in press.
Beaty, T.H., P.O. Kwiterovich Jr, M.J. Khoury, S. White, P.S. Bachorik, H.H. Smith, B. Teng and A. Sniderman. 1986. Genetic analysis of plasma sitosterol, Apoprotein B, and lipoproteins in a large
Amish pedigree with sitosterolemia. Am J Hum Genet 38: 492-504.
Berg, K. 1987. Genetics of coronary heart disease and its risk factors. In: Molecular approaches to
human polygenic disease. Ciba Symposium 130. John Wiley & Sons, New York, pp 14-33.
Berg, K. 1989. Impact of medical genetics on research and practices in the area of cardiovascular disease. Clin Genet 36: 299-312.
Bird, T.D., S.M. Sumi, E.J. Nemens, D. Nochlin, G. Schellenberg, T.H. Lampe, A. Sadovnick, H. Chui,
G.W. Miner and J. Tinklenberg. 1989. Phenotypic heterogeneity in familial Alzheimer’s disease:
A study of 24 kindreds. Annals of Neurology 25: 12-25.
Bishop, D.T., L. Cannon-Albright, T. McLellan, E. J. Gardner and M. H. Skolnick. 1988. Segregation
and linkage analysis of nine Utah breast cancer pedigrees. Genet Epidemiol 5: 151-170.
Blangero, J., C. Kammerer, L. Konigsberg, J. Hixson, and J. MacCluer. 1989. Statistical detection of
genotype-environment interaction: a multivariate measured genotype approach. Am J Hum
Genet 45: A234.
Bonney, G.E. 1986. Regressive logistic models for familial disease and other binary traits. Biometrics
42: 611-626.
Boehnke, M. 1986. Estimating the power of a proposed linkage study: a practical computer simulation
approach. Am J Hum Genet 39: 513-527.
Brown, M.S. and J.L. Goldstein. 1986. A receptor-mediated pathway for cholesterol homeostasis. Science 232: 34-47.
Brunzell, J.B., H.G. Schrott, A.G. Motulsky and E.L. Bierman. 1976. Myocardial infarction in the
familial forms of hypertriglyceridemia. Metabolism 25: 313-320.
Cannings, C., E. A. Thompson, and M. H. Skolnick, M.H. 1978. Probability functions on complex
pedigrees. Adv Appl Prob 10: 26-61.
Cannings, C., E. A. Thompson and M. H. Skolnick. 1980. Pedigree analysis of complex models. In
Current Developments in Anthropological Genetics (J. Mielke and M. Crawford, eds.). Plenum
Press, New York pp. 251-298.
Cannon-Albright, L.A., M. H. Skolnick, T. Bishop, R. G. Lee and R. W. Burt. 1988. Common inheritance of susceptibility to colonic adenomatous polyps and associated colorectal cancers. NEJM
319: 533-537.
29
Davignon, J., R.E. Gregg, and C.F. Sing. 1988. Apolipoprotein E polymorphism and atherosclerosis.
Arteriosclerosis 8: 1-21.
Deeb, S., A. Failor, B.G. Brown, J. Brunzell, J. Albers, A.G. Motulsky and E. Wijsman. 1990. Association of apolipoprotein B gene variants with plasma apoB and LDL cholesterol levels. Submitted.
Demenais, F. and L. Abel. 1989. Robustness of the unified model to shared environmental effects in the
analysis of dichotomous traits. Genet Epidemiol 6: 229-234.
Dempster, A.P., N. M. Laird and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the
EM algorithm. J Roy Statist Soc, Series B 39, 1-38.
Elston, R.C. and J. Stewart. 1971. A general model for the genetic analysis of pedigree data. Human
Heredity, 21: 523-542.
Elston, R. C., J. W. MacCluer, S. E. Hodge, M. A. Spence, and R. H. King. 1989. Genetic Analysis
Workshop 6: Linkage analysis based on affected pedigree members. In Multipoint Mapping and
Linkage Based on Affected Pedigree Members. R. C. Elston, M. A. Spence, S. E. Hodge and J.
W. MacCluer (eds.). Alan R. Liss, New York. Pp. 93-103.
Feinleib, M., R.J. Garrison, R. Fabsitz, J.C. Christian, Z. Hrubec, N.O. Borhani, W.B. Kannel, R. Rosenman, J.T. Schwartz and J.O. Wagner. 1977. The NHLBI twin study of cardiovascular disease risk
factors: methodology and summary of results. Am J Epidemiol 106: 284-295.
Ferns, G.A.A. and D.J. Dalton. 1986. Frequency of XbaI polymorphism in myocardial infarct survivors.
Lancet ii: 572.
Fujiwara, T.M., K. Morgan, R. H. Schwartz, R. A. Doherty, S. R. Miller, K. Klinger, P. Stanislotovitis,
N. Stuart and P. C. Watkins. 1989. Genealogical analysis of cystic fibrosis families and chromosome 7q RFLP haplotypes in the Hutterite Brethren. Am J Hum Genet 44: 327-337.
Gelfand, A. E. and Smith, A. F. M. 1990 Sampling-based approaches to calculating marginal densities.
JASA 85: 398-409.
Geman S. and D. Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration
of images. IEEE Trans. of Pattern Analysis and Machine Intelligence 6: 721-741.
Geyer, C. J. 1990. Likelihood and Exponential Families. Ph. D. Thesis, University of Washington.
Goldstein, J.L., H.G. Schrott, W.T. Hazzard, E. L Bierman and A.G. Motulsky 1973. Hyperlipidemia in
coronary heart disease. II. Genetic analysis of lipid levels in 176 families and delineation of a
new inherited disorder, combined hyperlipidemia. J Clin Invest 52: 1544-1568.
Graser, H.-U., S. P. Smith and B. Tier. 1987. A derivative-free approach for estimating variance components in animal models by restricted maximum likelihood. J Animal Science 64, 1362-1370.
Greenberg, D.A. and S.E. Hodge. 1989. Linkage analysis under "random" and "genetic" reduced penetrance. Genet Epidemiol 6: 259-264.
Guo, S-W. and Thompson, E. A. Performing the exact test of Hardy-Weinberg equilibrium for multiple
alleles. Biometrics in press.
Hammersley, J. M. and D. C. Handscomb. 1964. Monte Carlo Methods. Methuen and Co.: London.
Hanis, C. and R. Chakraborty. 1984. Nonrandom sampling in human genetics: familial correlations.
I.M.A. J Math Appl Med & Biol 1: 193-213.
Haseman, J.K. and R.C. Elston. 1972. The investigation of linkage between a quantitative trait and a
marker locus. Behav Genet 2: 3-19.
Hasstedt, S.J. and P. Cartwright. 1979. PAP -- Pedigree Analysis Package. Technical report No. 13,
Department of Medical Biophysics and Computing, University of Utah.
30
Hasstedt, S.J. 1982. A mixed-model likelihood approximation on large pedigrees. Computers in
Biomedical Research, 15, 295-307.
Hasstedt S.J., L. Wu and R.R. Williams. 1987. Major locus inheritance of apolipoprotein B in Utah
pedigrees. Genet Epidemiol 4: 67-76.
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications
Biometrika 57: 97-109.
Hegele, R.A., L.S. Huang, P.N. Herbert, C.B. Blum, J.E. Buring, C.H. Hennekens, and J.L. Breslow.
1986. Apolipoprotein B-gene polymorphisms associated with myocardial infarction. New Eng J
Med 515: 1509-1515.
Henderson, C.R. 1986. Recent developments in variance and covariance estimation. J of Animal Science 63, 208-216.
Heuch I. and F. M. F. Li. 1972. PEDIG -- A computer program for calculation of genotype probabilities
using phenotypic information. Clin Genet 3: 501-504.
Hilden, J. 1970. GENEX -- An algebraic approach to pedigree probability calculus. Clin Genet 1:
319-348.
Hopkins, P.N. and R.R. Williams. 1989. Human Genetics and coronary heart disease: a public health
perspective. Ann Rev Nutr 9: 303-345.
Innerarity, T.L., K.H. Weisgraber, K.S. Arnold, R.W. Mahley, R.M. Krauss, G.L. Vega and S.M. Grundy.
1987. Familial defective apolipoprotein B-100: low density lipoproteins with abnormal receptor
binding. PNAS 84: 6919-6923.
Johnson, N. L. and S. Kotz. 1972. Distributions in Statistics: Continuous Multivariate Distributions.
Wiley, New York.
Kagen, A., B.R. Harris, W. Winkelstein, K.G. Johnson, H. Kato, S.L. Syme, G.G. Rhoads, M.L. Gay,
M.Z. Nichaman, H.B. Hamilton and J. Tillotson. 1974. Epidemiologic studies of coronary heart
disease and stroke in Japanese men living in Japan, Hawaii and California: demographic, physical, dietary, and biochemical characteristics. J Chronic Dis 27: 345-364.
Khoury, M.J., T.H. Beaty, K-Y. Liang. 1988. Can familial aggregation of disease be explained by familial aggregation of environmental risk factors? Am J Epidemiol 127: 674-683.
Klinger, K., G. T. Horn, P. Stanislotovitis, R. H. Schwartz, T.M. Fujiwara and K. Morgan. 1990. Cystic
fibrosis mutations in the Hutterite Brethren. Am J Hum Genet 46: 983-987.
Kunkel, L.M., A.P. Monaco, W. Middlesworth, H.D. Ochs, S.A. Latt. 1985. Specific cloning of DNA
fragments absent from the DNA of a male patient with an X chromosome deletion. PNAS 82:
4778-4782.
Lalouel, J.-M., D. C. Rao, N. E. Morton and R. C. Elston. 1983. A unified model for complex segregation analysis. Am J Hum Genet 35: 816-826.
Lander, E.S. and D. Botstein. 1986. Strategies for studying heterogeneous genetic traits in humans by
using a linkage map of restriction fragment length polymorphisms. Proc Natl Acad Sci 83:
7353-7357.
Lander, E. S. and D. Botstein. 1989. Mapping Mendelian factors underlying quantitative traits using
RFLP linkage maps. Genetics 121: 185-199.
Lange, K. and R. C. Elston. 1975. Extensions to pedigree analysis: likelihood computations for simple
and complex pedigrees. Hum. Hered. 25: 95-105.
Lange, K., J. Westlake and M. A. Spence. 1976. Extensions to pedigree analysis. III. Variance components by the scoring method. Ann. Hum. Genet. 39: 485-491.
31
Lange, K. and S. Matthysse. 1989. Simulation of pedigree genotypes by random walks. Am J Hum
Genet 45, 959-970.
Langlois, S., S. Deeb, J.D. Brunzell, J.J. Kastelein, and M.R. Hayden. 1989. A major insertion accounts
for a significant proportion of mutations underlying human lipoprotein lipase deficiency. PNAS
86: 948-952.
MacCluer, J. W., J. L. VandeBerg, B. Read, and O. A. Ryder. 1986. Pedigree analysis by computer simulation. Zoo Biology, 5, 147-160.
McAlpine, P.J., T.B. Shows, C. Boucheix, L.C. Stranc, T.G. Barent, A.J. Pakstis and R.C. Doute. 1989.
Report of the nomenclature committee and the 1989 catalog of mapped genes. HGM10. In Cytogenet and Cell Genet 51: 13-66.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller. 1953. Equation of
state calculation by fast computing machines. J. Chem. Phys. 21: 1087-1092.
Meyers, D.A., W. B. Bias and D. G. Marsh 1982. A genetic study of total IgE levels in the Amish.
Hum. Hered. 32: 15-23.
Monaco, A.P., C.J. Bertelson, W. Middlesworth, C.A. Colletti, J. Aldridge, K.H. Fischbeck, R. Bartlett,
M.A. Pericak-Vance, A.D. Roses, and L.M. Kunkel. 1985. Detection of deletions spanning the
Duchenne muscular dystrophy locus using a tightly linked DNA segment. Nature 316: 842-845.
Morton, N.E. 1955. Sequential tests for the detection of linkage. Am J Hum Genet 7: 277-318.
Morton, N.E. and C.J. MacLean. 1974. Analysis of family resemblance. III. Complex segregation of
quantitative traits. Am J. Hum Genet 26: 489-503.
Namboodiri, K.K., P.P. Green, E.B. Kaplan, J.A. Morrison, G.A. Chase, R.C. Elston, A.R.G. Owen,
R.M. Rifkind, C.J. Glueck and H.A. Tyroler. 1984. The collaborative lipid research clinics program family study. IV. Familial associations of plasma liped and lipoproteins. Am J Epidemiol
119: 975-996.
Neufeld, H.N. and U. Goldbourt. 1983. Coronary heart disease: genetic aspects. Circulation 67:
943-954.
Ott, J. 1976. A computer program for linkage analysis of general human pedigrees. Am J Hum Genet
28: 528-529.
Ott, J. 1979. Maximum likelihood estimation by counting methods under polygenic and mixed models
in human pedigrees. Am J Hum Genet 31: 161-175.
Ott, J. 1985. Analysis of human genetic linkage. The John’s Hopkins University Press, Baltimore, MD.
Ott, J. 1990. Cutting a Gordian Knot in the linkage analysis of complex human traits. Am J Hum Genet
46: 219-221.
Perkins, K.A. 1986. Family history of coronary heart disease: is it an independent risk factor? Am J
Epidemiol 124: 182-194.
Ploughman, L.M. and Boehke M. 1989. Estimation of the power of a proposed linkage study for a
complex genetic trait. Am J Hum Genet 44, 543-551.
Prokosch, H.U., Seuchter, S.A., Thompson, E.A. and Skolnick, M.H. 1989. Applying expert system
techniques to human genetics. Computers and Biomedical Research, 22, 234-247.
Rao, D.C., N.E. Morton, C.L. Gulbrandsen, G.G. Rhoads, A. Kagen and S. Yee. 1979. Cultural and
biological determinants of lipoprotein concentrations. Ann Hum Genet 42: 467-477.
Riordan, J.R., J.M. Rommens, B-S. Kerem, N. Alon, R. Rozmahel, Z. Grzelczak, J. Zielenski, S. Lok,
N. Plavsic, J-L. Chou, M.L. Drumm, M.C. Iannuzzi, F.S. Collins, L-C. Tsui. 1989. Identification
of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245:
1066-1073.
32
Risch, N. 1990a. Linkage strategies for genetically complex models. I. Multilocus models. Am J Hum
Genet 46: 222-228.
Risch, N. 1990b. Genetic linkage and complex diseases: a response. Genet Epidemiol 7: 41-45.
Robertson, F.W. 1981. The genetic component in coronary heart disease: a review. Genetic Res 37:
1-16.
Robertson, F.W. and A.M. Cumming 1985. Effects of apoprotein E polymorphism on serum lipoprotein
concentration. Arteriosclerosis 5: 283-292.
Schaeffer, L. R. and B. W. Kennedy. 1986. Computing strategies for solving mixed model equations. J
of Dairy Science 69, 575-579.
Shaw, R.G. 1987. Maximum likelihood approaches applied to quantitative genetics of natural populations. Evolution 41: 812-826.
Sheehan, N. A. 1989. Image Processing Procedures Applied to the Estimation of Genotypes on Pedigrees. Technical report No. 176, Department of Statistics, University of Washington.
Sheehan, N. A., A. Possolo and E. A. Thompson. 1989. Image processing procedures applied to the
estimation of genotypes on pedigrees. Am J Hum Genet 45 (Suppl.): A248.
Sheehan, N. A. 1990. Genetic reconstruction on Pedigrees. Ph. D. Thesis, University of Washington.
Sing, C.F. and J.D. Orr. 1978. Analysis of genetic and environmental sources of variation in serum
cholesterol in Tecumseh, Michigan IV. Separation of polygene from common environmental
effects. Am J Hum Genet 30: 491-504.
Sing, C.F. and J. Davignon. 1985. Role of the apolipoprotein E polymorphism in determining normal
plasma lipid and lipoprotein variation. Am J Hum Genet 37: 268-285.
Sing, C.F. and P.P. Moll. 1989 Genetics of variability of CHD risk. Int J Epidemiol 18 (Suppl
1):S183-S195.
Sosenko, J.M., J.L. Breslow, R.C. Ellison and O.S. Miettinen. 1980. Familial aggregation of total
cholesterol, high density lipoprotein cholesterol and total triglyceride levels in plasma. Am J Epidemiol 112: 656-660.
Suarez, B.K. and P. Van Eerdewegh. 1984. A comparison of three affected-sib-pair scoring methods to
detect HLA-linked disease susceptibility genes. Am J Med Genet 18: 135-146.
Thomas, A. W. 1985. Data Structure, Methods of Approximation and Optimal Computation for Pedigree Analysis. Ph. D. Thesis, University of Cambridge.
Thompson, E.A. 1976. Peeling programs for zero-loop pedigrees. December, 1976. Technical Report
No. 5, Department of Medical Biophysics & Computing, University of Utah.
Thompson, E. A. 1978. Ancestral inference. II. The founders of Tristan da Cunha. Ann Hum Genet 42:
239-253.
Thompson, E.A., Kravitz, K., Hill, J. and Skolnick, M.H. 1978. Linkage and the power of a pedigree
structure. In Genet Epidemiol N.E. Morton (ed.), pp. 247−253. Academic Press, New York.
Thompson, E. A. 1981. Pedigree analysis of Hodgkin’s disease in a Newfoundland genealogy. Ann
Hum Genet 45: 279-292.
Thompson, E.A. 1983. Optimal sampling for pedigree analysis: parameter estimation and genotype
uncertainty. Theor Pop Biol 24, 39-58.
Thompson, E.A. 1986. Pedigree Analysis in Human Genetics Johns Hopkins University Press, Baltimore, MD.
Thompson E.A. 1988. Two-locus and three-locus gene identity by descent in pedigrees. I.M.A. J Math
Appl in Med and Biol 5, 261-280.
33
Thompson E.A. and Morgan K. 1989. Recursive descent probabilities for rare recessive lethals. Ann
Hum Genet 53, 357-374.
Thompson E.A. and Shaw R.G. 1990. Pedigree analysis for quantitative traits; variance components
without matrices. Biometrics 46, 399-414.
Thompson, R. and K. Meyer. 1986. Estimation of variance components: what is missing in the EM
algorithm? J Statist Comp and Simulation 24, 215-230.
Utermann, G. 1989. The mysteries of lipoprotein(a). Science 246: 904-910.
Weeks, D.E. and K. Lange. 1988. The affected-pedigree-member method of linkage analysis. Am J
Hum Genet 42: 315-326.
Wijsman, E. M. 1987. A deductive method of haplotype analysis in pedigrees. Am J Hum Genet 41:
356-373.
Wright, S. and H. C. McPhee. 1925. An approximate methods of calculating coefficients of inbreeding
and relationship from livestock pedigrees. J Agric Res 31: 377-383.
Wright, S. 1934. An analysis of variability of number of digits in an inbred strain of guinea pigs.
Genetics 19: 506-536.
Wright, S. 1935. A mutation of the guinea pig tending to restore the pentadactyl foot when heterozygous, producing a monstrosity when homozygous. Genetics 20: 84-107.
Xu, G., P. O’Connell, D. Viskochil, R. Cawthon, M. Robertson, M. Culver, D. Dunn, J. Stevens, R.
Gesteland, R. White and R. Weiss. 1990. The neurofibromatosis type I gene encodes a protein
related to GAP. Cell 62: 599-608.
34