Modeling of Identity-by-Descent Processes Along a

Copyright © 2011 by the Genetics Society of America
DOI: 10.1534/genetics.111.127720
Modeling of Identity-by-Descent Processes Along a Chromosome Between
Haplotypes and Their Genotyped Ancestors
Tom Druet*,1 and Frederic Paul Farnir†
*Unit
of Animal Genomics, Faculty of Veterinary Medicine and Centre for Biomedical Integrative Genoproteomics and †Unit of Animal
Productions, Faculty of Veterinary Medicine, University of Liège, B-4000 Liège, Belgium
Manuscript received February 13, 2011
Accepted for publication March 3, 2011
ABSTRACT
Identity-by-descent probabilities are important for many applications in genetics. Here we propose
a method for modeling the transmission of the haplotypes from the closest genotyped relatives along an
entire chromosome. The method relies on a hidden Markov model where hidden states correspond to the
set of all possible origins of a haplotype within a given pedigree. Initial state probabilities are estimated from
average genetic contribution of each origin to the modeled haplotype while transition probabilities are
computed from recombination probabilities and pedigree relationships between the modeled haplotype
and the various possible origins. The method was tested on three simulated scenarios based on real data sets
from dairy cattle, Arabidopsis thaliana, and maize. The mean identity-by-descent probabilities estimated for
the truly inherited parental chromosome ranged from 0.94 to 0.98 according to the design and the marker
density. The lowest values were observed in regions close to crossing over or where the method was not able
to discriminate between several origins due to their similarity. It is shown that the estimated probabilities
were correctly calibrated. For marker imputation (or QTL allele prediction for fine mapping or genomic
selection), the method was efficient, with 3.75% allelic imputation error rates on a dairy cattle data set with
a low marker density map (1 SNP/Mb). The method should prove useful for situations we are facing now in
experimental designs and in plant and animal breeding, where founders are genotyped with relatively high
markers densities and last generation(s) genotyped with a lower-density panel.
T
WO alleles at a single locus are identical-by-descent
(IBD) if they derive from a common ancestor without mutation. The probability for two alleles to be IBD
(IBDp) is an information of high importance as it is
used in several applications such as gene or QTL mapping (e.g., Meuwissen and Goddard 2004), markerassisted (Fernando and Grossman 1989), or genomic
selection (De Roos et al. 2007; Calus et al. 2008), haplotype reconstruction, or imputation (Kong et al. 2008).
Estimation of IBDp may rely on linkage or linkage
disequilibrium information, or on both. With linkage,
the estimation of IBDp is based on Mendelian segregation rules, recombination, pedigree, and genotypes.
With such information, IBDp are most often estimated
between parents and offspring with the sole informative
flanking markers (e.g., Fernando and Grossman 1989;
Wang et al. 1995; Pong-Wong et al. 2001). IBDp between more distant relatives are obtained by transmitting the parent–offspring information along the path
between the relatives. If some individuals on this path
are not genotyped, transmission of chromosomes can
no longer be traced unambiguously. In addition, since
Supporting information is available online at http://www.genetics.org/
cgi/content/full/genetics.111.127720/DC1.
1
Corresponding author: Unit of Animals Genomics, GIGA-R B34, 1 avenue de L’Hôpital, B-4000 Liège, Belgium. E-mail: [email protected]
Genetics 188: 409–419 ( June 2011)
the information relies only on flanking markers, it is
rapidly eroded along the pedigree path.
Other methods estimate IBDp probabilities using
multipoint linkage and segregation analysis and by
considering all the pedigree links. Lander and Green
(1987) proposed such a method considering each alternative gene flow pattern in a pedigree separately. However, it results in high computational costs when the
pedigree increases. Abecasis et al. (2002) described
an efficient algorithm for the analysis of dense genetic
maps in pedigree. Their program, called Merlin, proved
faster (and used less memory) than softwares based on
the Lander–Green algorithm. Although very efficient,
Merlin nevertheless becomes also impractical in large
pedigrees with complex structures such as those used in
animal and plant breeding.
Methods such as Loki (Heath 1997) or Morgan
(Wijsman et al. 2006) consider multipoint linkage and
segregation analysis using MCMC techniques to jointly
estimate IBDp or haplotypes probabilities. Although
fairly general and potentially using all the available information, these methods are also very demanding
from a computational point of view—especially in problems facing large pedigrees and in contexts of highdensity genotyping. Furthermore, these approaches
are always faced with the problem of getting stuck in
a local maximum region.
410
T. Druet and F. Farnir
A common feature to all the methods based on linkage is that the definition of the IBDp refers to a common ancestor within the pedigree: regions will be
declared to be IBD only if they can be shown to descend from one common ancestral region within the
pedigree. The situation is different with linkage disequilibrium methods (e.g., Meuwissen and Goddard
2001; Hernandez-Sanchez et al. 2006), where IBDp
are computed with reference to unobserved ancestors
of the studied population. The assumptions are that these
common ancestors were present T generations before
the current population and that the studied mutations
are old and were already present within these ancestors.
The size of the haplotypes shared IBD by two gametes is
relatively small and small haplotypes are used to estimate
IBDp. As a consequence, such methods require higher
marker densities than for linkage approaches, focus on
relatively small segments surrounding the region-ofinterest, and basically ignore pedigree relationships.
With current marker maps (e.g., 50K SNP chips for cattle or maize), these methods also have a high computational burden.
In plant and animal breeding, as in experimental designs, new more cost-effective genotyping strategies are
being implemented. Indeed, important founder individuals are genotyped at high density whereas other individuals are genotyped at lower density. For example, in the
nested association mapping (NAM) design (Yu et al.
2008) or in multiparent advanced generation intercrosses (MAGIC) (e.g., Threadgill et al. 2002; Kover
et al. 2009), the recombinant inbred lines are phenotyped
and genotyped at low density whereas founder lines are
genotyped at high density. In dairy cattle, important sires
could be genotyped using dense panels (or even sequenced), while younger animals (e.g., those without phenotype) can be genotyped at lower density (e.g., Habier
et al. 2009). In such designs, estimation of IBDp between
founder and nonfounder individuals might be particularly useful for applications such as QTL mapping,
genomic selection, or marker imputation. The IBDp estimation method should then rely on pedigree information
and linkage (because some animals are genotyped at
lower density), be computationally efficient, and handle
complex pedigrees with overlapping generations where
genotyped founders are spread over several generations.
In this article, we propose an alternative and efficient
method with which to describe the IBD process for a whole
chromosome between target individuals and their genotyped ancestors, taking benefit of the marker density and
using recombination probabilities and known pedigree.
MATERIALS AND METHODS
Estimation of IBD probabilities: The approach proposed
here is a linkage method using information from genotyped
ancestors in the pedigree even in situations where some
individuals in the pedigree are not genotyped. It does not
estimate IBDp with nonancestors.
Any chromosome of today’s individuals is a mosaic of ancestral chromosomes. In theory, except for mutation events,
any piece of chromosome could be traced back to a corresponding piece of an ancestral chromosome in the individual’s pedigree. Inferring the ancestral origin of genotyped
chromosomes can be done using probability statements: each
locus on a chromosome has a computable probability to descend from ancestral chromosomes taken in a subset C of the
possible chromosomes set V.
The set of chromosomes V: This set contains all the chromosomes in a given pedigree. For a pedigree containing n (genotyped or not) individuals, the dimension of V is 2 · n.
The subset C(c) of potential parental chromosomes of a given
chromosome c: For each target chromosome (TC) c in the pedigree, the potential parental chromosomes (PC) of c constitute a subset C(c) of V. The members of C can be obtained as
follows (an example is provided in supporting information,
File S1): starting from c and going up the pedigree (i.e., backward in time), we obtain the two possible (i.e., parental) origins of c in the previous generation. If this parent is
genotyped, these two chromosomes are included in (and
make up) C(c) and the process ends. If not, this process is
repeated for each of the two parental chromosomes until:
• either one finally reaches genotyped chromosomes,
which are then included in C(c), or
• either one reaches an ungenotyped chromosome for
which no genotyped PC exists. This ungenotyped chromosome is then included in C(c) as a “phantom”
chromosome.
Any locus on the TC is IBD with the corresponding locus on
one of these chromosomes.
In terms of probabilities, the conP
sequence is that CðcÞ
P ðcðkÞ [ xðkÞÞ 5 1 when the sum is taken
x
over all chromosomes making up C(c), where P(c(k) [ x(k))
denotes the probability for TC c to be IBD with x at a givenlocus k.
Modeling inherited parental chromosomes along the chromosome:
According to the way the ancestral chromosomes set C has
been defined, the process leading to the formation of a mosaic
of pieces of the members of C can be seen as follows:
• The beginning of the chromosome can correspond to
any of the chromosomes ci in C with a probability that
can be estimated (see Initial state probabilities).
• Considering any pair of positions k and k 1 1 along the
chromosome, there is a possibility to switch from ci in
position k (noted ci(k)) to another member cj of C in
position k 1 1, and the probability of these switches
can also be computed. These “transition probabilities”
depend on the couple (ci, cj) and on the distance between the positions k and k 1 1.
A Markov process, where PC stands for states, can be used
to approximate this process (Thompson 1994). In that case,
the PC inherited (i.e., state) at locus k 1 1 is dependent on the
PC inherited at locus k, and is assumed conditionally independent on PC inherited at previous loci:
P c k 1 1 [ cj k 1 1 jc k [ ci k ; c k 2 1 [ ci 9 k 2 1 ;
c k 2 2 [ ci 99 k 2 2 ; . . .
5 P c k 1 1 [cj k 1 1 j c k [ci k :
Consequently, the genetic process can be approximated
through the stochastic process of a Markov chain: this reduces
to first, describing the prior probability to be in any given state
(PC) and second, to computing the transition probabilities
between PC.
Identity-by-Descent Between Haplotypes
In the case of crossover interference, the Markov property
is not valid. In addition, previous studies showed that in some
particular pedigrees, the Markov assumption is not appropriate (e.g., Mcpeek and Sun 2000; Broman 2005) but that the
Markov approximation may nevertheless be successful to describe the IBD process (Thompson 1994; Leutenegger et al.
2003; Broman 2005). In our study, the Markov approximation
might be applicable because simplified pedigrees are used:
only IBDp with the closest ancestors are estimated and chromosomes are modeled independently.
Initial state probabilities: The prior probability of any given PC
ci is simply 0:5Gi , where Gi is the number of generations between the PC ci and the TC c (see examples in File S1).
State transition probabilities: Transition between PC can potentially take place anywhere along the chromosome, but
information is available only for discrete positions (i.e., markers),
and so we can use these markers as indicators of possible
transitions between PC (as later described). The needed transition probabilities are then conditional probabilities:
P c k 1 1 [ cj k 1 1 j c k [ c i k :
ity by P[c(k) [ ci(k)], which is simply 0:5Gi . To summarize,
the transition probability is then:
h i
P c k 1 1 [ cj k 1 1 j c k [ci k 5 0:5ðGj 2Ga Þ · r · ð1 2 r ÞðGa 21Þ :
ð2Þ
Estimating the probability of one sequence of parental chromosomes:
When modeling the IBD process along the chromosome, Np
positions of TC are considered simultaneously. Since we have
NPC parental chromosomes, each locus must a priori be IBD
with one of these PC, and without further (i.e., marker) information, we have NPC possible parental origin for each of
the positions, which amounts to a total of Ns ¼ ðNPC ÞNp possible sequences of PC. All these sequences are not equally likely,
and the probability for c to correspond to any particular sequence s ¼ {pc1, pc2, . . ., pcNp } can be computed using the
Markovian property given above,
Np
Y
P s 5 P ðcð1Þ [ pc1 ð1ÞÞ
P ½cðkÞ [ pck ðkÞ j cðk 2 1Þ [ pck 21 ðk 2 1Þ;
k52
As already mentioned above, this probability is the probability that the TC c is IBD with the PC cj in position k 1 1 when
the TC is IBD with PC ci in position k.
Two situations exist:
• If i ¼ j, this means that no recombination occurred between these two positions for the Gi generations between
this PC and the TC. The probability is:
h i
P c k 1 1 [ cj k 1 1 j c k [ ci k 5 ð1 2 r ÞGi ;
411
(1)
where r is the recombination fraction between positions k and
k 1 1.
• If i 6¼ j, we can decompose the situation into three necessary events. The first two events lead to a chromosome
where positions k and (k 1 1) originate from PC i and j,
respectively. The last event is then similar to the previous
situation:
• On the pedigree paths from ci and cj to c, there must be at
least one common ancestor where the two paths merge.
We note this ancestor “a” (see examples in File S1). The
probability for the PCi and PCj to proceed down to this
ancestor at positions k and k 1 1, respectively, is
0:5ðGi 2Ga Þ · 0:5ðGj 2Ga Þ , where Ga is the number of generations between the ancestor a and the TC.
• After reaching this common ancestor, the two chromosomes need to recombine in the next meiosis, leading to
the needed hybrid chromosome m in the next generation, with m(k) ¼ ci(k) and m(k 1 1) ¼ cj(k 1 1). The
probability for this recombination is r. In addition, this
recombinant haplotype is transmitted to the next generation with probability 0.5, resulting in a total probability
of 0.5 · r for this event.
• For the remaining (Ga 2 1) generations, the hybrid chromosome must be handed down without recombination.
The probability to do that is 0:5ðGa 21Þ · ð12r ÞðGa 21Þ .
As a consequence, we have the following joint probability:
h i
P ci k [ c k ; cj k 1 1 [ c k 1 1 5 0:5ðGi 1Gj 2Ga Þ · r · ð1 2 r ÞðGa 2 1Þ :
The needed conditional probability P[c(k 1 1) [ cj(k 1 1)j
c(k) [ ci(k)] can be obtained by dividing this joint probabil-
where pck corresponds to the kth PC in the sequence.
Among these Ns sequences, ðNPC ÞNp 21 share the same PC in
position k (say, ci(k)). Consequently, IBDp between TC c and
PC ci at position k can be estimated by summing the probabilities over all these sequences where pck(k) equals ci(k),
P ðcðkÞ [ ci ðkÞÞ 5
Ns
i
X
h j
P s j *I pck ðkÞ 5 ci ðkÞ ;
(3)
j51
j
where s j is the jth possible sequence, pc k ðkÞ is the kth PC in
the jth sequence, and I(a ¼ b) is an indicator function equals
to 1 if a equals b and 0 otherwise. The number of sequences
increases exponentially with the number of positions and the
number of ancestral chromosomes to be considered, which
potentially renders this computation very demanding. Fortunately, the use of hidden Markov models (HMM) can provide
efficient computation of such probabilities.
Conditioning IBD probabilities on observed haplotypes: Chromosomes are not observed but genotypes are. Using these
genotypes, haplotypes can be obtained for TC and genotyped
PC (e.g., Windig and Meuwissen 2004; Druet and Georges
2010) and can be used to infer IBDp. The available genotypic
information allows conditioning of the sequence probabilities
from the previous section on the obtained haplotypes. P(s j j hc)
would be used instead of P(s j ) in Equation 3, where hc stands
for a haplotype (on chromosome c):
P s j P hc j s j
P s j j h c 5 P j9 c j9 :
j9 P s P h j s
(4)
Probabilities such as P(hc j s j) are obvious to compute: if, for
all the PC making up the sequence sj, the corresponding alleles
are identical to alleles of the TC (hc), then P(hc j s j) ¼ 1;
otherwise P(hc j s j) ¼ 0 (computation of P(hc j sj) is based
on nonmissing and phased markers only). To allow for genotyping errors and/or mutations, these probabilities can be
replaced by ð12eÞðNp 2Nd Þ eNd ; where e is a small value (typically, 0.001) and Nd is the number of differences between
alleles of hc and alleles of the PC in the sequence j.
For phantom PC, both genotypes and haplotypes are unknown. Therefore, we replace known or estimated haplotypes
by the probability that phantom PC carries allele i at marker
m, taken to be equal to the frequency of allele i at marker m.
412
T. Druet and F. Farnir
Extending to inbreeding: In case of inbreeding, several paths
might lead from the same PC to the TC (see File S1 for examples in cases of inbreeding). To deal with this problem, we first
derive IBDp along each path individually, and then sum over
all paths belonging to the same PC. The set of parental chromosomes C(c) is extended to the set of distinct paths P(c)
leading from the PC to the TC. In the absence of inbreeding,
C(c) and P(c) are identical and PC appear in only one path,
but when inbreeding is present some of the PC will be present
in several paths. Transition probabilities are computed for all
pairs of paths in P(c) and haplotypes from the PC at the top of
the path are used in Equation 4. In addition, paths from PC to
TC might share more than one common ancestor in which
recombination could occur. Equation 2 considered that paths
merge in a single ancestor. To take into account the possibility
of several common meioses between paths pi and pj, Equation
2 can be generalized to estimate transition probabilities between path pi at marker k and path pj at marker k 1 1,
h i
P c k 1 1 )pj k 1 1 j c k )pi k 5 0:5ðMi Þ · r ðMo Þ · ð1 2 r ÞðMs Þ ;
ð5Þ
where c(k) )p(k) indicates that c was inherited through
path p at position k, Mi is the number of meioses in path pj that
are not on the path pi. Ms represents the number of common
meioses that result in the transmission of the same chromosome (no crossing over) while Mo represents the number of
common meioses that result in the transmission of the complementary chromosome (crossing over). The sum of Ms and
M0 is equal to the number of common meioses between the
two paths and the sum of Mi, Ms, and Mo is equal to the
number of meioses in path pj.
In absence of inbreeding, there is a unique path from PC
to TC (there is a one-to-one relationship between path and PC
in that case) and Equations 1 and 2 are identical to Equation
5. Indeed, when i ¼ j (Equation 1), Ms ¼ Gi (Mi ¼ Mo ¼ 0)
and when i 6¼ j (Equation 2), Mi ¼ Gj 2 Ga, Mo ¼ 1, and Ms ¼
Ga 21.
A hidden Markov Model to model the IBD process along a
chromosome: HMM (e.g., Rabiner 1989) allow efficient computation of sequence probabilities conditionally on haplotypes
(Equation 4). This result can be achieved without estimating
the probability of all sequences by using the forward–backward
algorithm (Baum and Egon 1967). Actually, HMM have already been used to model IBD processes along a chromosome
in various applications (e.g., Lander and Green 1987; Guo
1994; Mott et al. 2000; Leutenegger et al. 2003; Scheet and
Stephens 2006; Thompson 2008). Rabiner (1989) described
HMM as unobserved Markov chains (hidden Markov chains)
and observed sequences of “emitted” symbols. The symbol
observed at position k of the sequence depends only on the
state of the hidden Markov chain at the same position. The
Markov chain is fully characterized by a set of states and corresponding sets of initial probabilities (to be in any of the
states) and transition probabilities (between any ordered pair
of states). As mentioned earlier, our description of the
transmission of pieces of chromosomes from a set of PC to
a TC closely resembles this definition of Markov chain: the
NPC parental chromosomes could be used as states. Note
also that PC are not directly observed but “emitted” haplotypes can be obtained by genotyping and haplotype reconstruction and that the alleles of the haplotypes depend only
on the PC (hidden state) inherited at the corresponding
position.
In this study, we thus propose use of a HMM to identify the
origin of a chromosome segment among the genotyped PC in
the pedigree (see File S2 for more details on the algorithm).
The initial states and transition probabilities of our HMM are
defined by 0:5Gi and Equations 1 and 5, respectively. The
emission probability at marker k is equal to 1.00 2 e (e) if
the allele in TC is equal (different) to the allele in the PC for
that marker, where e is the probability of error (e.g., genotyping errors, mutations). When the PC is a “phantom” PC, the
emission probabilities are equal to the frequency of the observed marker allele.
The model implemented in this study relies on known
haplotypes for TC and genotyped PC eventually obtained by
other programs (e.g., Druet and Georges 2010). We discuss
this assumption below. Paternal and maternal chromosomes
are modeled independently.
Simulation study: Data sets: The method was tested with
simulations based on haplotypes from real data sets from dairy
cattle, maize, and Arabidopsis thaliana. For each species, genotypes from chromosome 1 were used.
The dairy cattle set consisted of 4732 animals genotyped on
a custom-made 60K Illumina panel described in Charlier
et al. (2008). Data were provided by CRV (The Netherlands)
and are available upon request. The 2691 BTA1 SNP spanned
approximately 160 Mb. The pedigree file contained 13,163 individuals. A subset of 3648 target chromosomes was selected to
study the estimation of IBDp. These TC correspond to individuals without any genotyped descendent and with male
ancestors carrying PC genotyped over three to eight generations (and no genotyped female ancestor). One generation of
genotyped male ancestor corresponds to a genotyped sire and
an ungenotyped dam. Two generations of genotyped male ancestors carrying PC corresponds to genotyped sire and maternal grandsire, while the dam and maternal grandam are
ungenotyped. For each additional generation of genotyped
male ancestors carrying PC, the sire of the (ungenotyped)
mate of the oldest genotyped sire is added. For instance, in
the first example in File S1, male ancestors carrying PC (individuals 2, 6, 9, and 10) are genotyped for four generations,
whereas status of individual 4 is irrelevant because individual
2 is genotyped. A subset of 1000 individuals represented the
PC: the 516 individuals with at least one genotyped descendent and 484 individuals chosen at random are to be used as
phantom PC in the simulation. Haplotypes from all individuals were obtained using Beagle (Browning and Browning
2007) and DAGPHASE (Druet and Georges 2010).
The genotypes from 19 “founder” accessions of the MAGIC
in A. thaliana (Kover et al. 2009) were downloaded from
http://gscan.well.ox.ac.uk/arabidopsis. We simulated the MAGIC
lines as described in Scarcelli et al. (2007) by intermating
19 founders lines for five generations producing 342 F5 outcrossed families. More precisely, in the first generation, a complete diallele cross (the 19 · 18 ¼ 342 possible F1 crosses) was
simulated. The generations F1–F4 were then randomly intercrossed. In each generation, two plants from each family were
selected to be paternal and maternal parents. The 342 final
lines were obtained by selfing F5 families for six generations.
Genotypes from 275 markers on the first chromosome (30 Mb)
were used.
Data from the NAM design in maize (Yu et al. 2008;
Mcmullen et al. 2009) were used to simulate the last design.
In the NAM design, 25 founders are crossed with the B73 line
to produce recombinant inbred lines (RILs). Since deducing
the origin is relatively easy in that case (based on the SNPs,
which are polymorphic across the two founder lines), we
changed the design to a complete diallele cross among the
26 founders and further crossed the F1 for three additional
generations (as described for the MAGIC data set). Genotypes
for the 175 SNP on chromosome 1 (200 Mb) were downloaded
from http://www.panzea.org/lit/data_sets.html#NAM_map.
For the three populations, data were simulated by using real
pedigree for dairy cattle and described pedigree for plants and
Identity-by-Descent Between Haplotypes
by transmitting the haplotypes from genotyped ancestors to
offspring using Mendelian segregation rules and recombination probabilities (assuming 1 cM ¼ 1 Mb). In dairy cattle
data, phantom PC haplotypes were selected randomly from
the 484 individuals mentioned above. Furthermore, with this
data set, a subset of 164 markers (the SNP with the highest
MAF was selected every megabase) was used to test the estimation of IBDp at a lower marker density.
In dairy cattle, haplotypes of really genotyped animals were
assumed known while in plants, all haplotypes of founder and
last-generation individuals were assumed known. Pedigrees
were also considered as known. To test the impact of missing
genotypes on the estimation of IBDp, 1 or 5% of marker
alleles were randomly erased in the dairy cattle data set.
Similarly, we randomly generated 0.01, 0.1, and 1% genotyping errors in the dairy cattle data set to study the effect of
genotyping errors (the error rate e used by the model was
equal to 0.001 in all cases) on estimation of IBDp. Finally, estimated haplotypes [using Beagle (Browning and Browning
2007) and DAGPHASE (Druet and Georges 2010)] were
also used in the dairy cattle data set to compare estimated
IBDp obtained with known and inferred haplotypes. In this
situation where haplotypes are inferred, when the paternal or
maternal origin of a TIPC was not known because the corresponding individual had no genotyped parent, we identified
the TIPC as the chromosome that transmitted the marker
alleles to the TC. All studies were repeated 100 times.
Estimation of accuracy of IBDp: The quality of IBDp estimation
can be described by the distribution, over all TC and markers,
of the estimated IBDp between the TC and the truly inherited
parental chromosome (TIPC). This distribution indicates
whether the model correctly predicts the origin of the chromosome. Ideally, the IBDp should be one for all positions.
For each haplotype and each marker position, the model
estimates NPC IBDp (one for each PC in the set). IBDp with
each PC were assigned to percentile classes (100 classes: 0–
0.01, 0.01–0.02, . . . , 0.99–1.00). When the corresponding PC
was (not) the TIPC, one (in)correct estimation was added
in the class. Then, the percentage of correct estimations was
computed within each class. It should be noted that the
expected proportion of TIPC in percentile q is q, meaning
that a plot of the observed proportion of TIPC vs. the estimated IBDp should be a straight diagonal line (see Figure 3).
For the dairy cattle data, results were given separately for
genotyped TIPC and phantom TIPC.
Application: marker imputation with low density chip in
dairy cattle:. To illustrate a potential application of the method,
we mimicked imputation of markers from a dense SNP panel
with genotypes from a low-density SNP chip. Using the dairy
cattle data described above, we conserved all the genotypes from
the 1000 animals representing the PC in the simulation study,
while for the remaining 3732 individuals, we conserved only 164
markers per animal (the same as in the simulation study).
Haplotypes of reference and target individuals were reconstructed using Beagle (Browning and Browning 2007) and
DAGPHASE (Druet and Georges 2010) as described in
Zhang and Druet (2010).
The method described in this study was then used to
estimate IBDp between TC and PC. Missing alleles were then
predicted by combining IBDp and alleles observed in PC as
NPC
X
P hkc 5 m 5
P ðcðkÞ [ x k P hkx 5 m ;
x51
where P ðhkx 5 mÞ is the probability that haplotype x at marker
k carries allele m (1 if yes, 0 if no, when haplotype is known)
413
and P ðcðkÞ [ xðkÞÞ is the estimated IBDp between PC x and
TC c at marker k. For ungenotyped ancestors (phantom PC),
the frequency of the marker allele was used instead of the
observed allele. Allelic probabilities of both chromosomes
from an individual were then combined to obtain genotype
probabilities. These probabilities were used to compute estimated number of allele “1” per genotype (ranging from 0 to
2). The number of errors per genotype was equal to the absolute value of the difference between the real and estimated
numbers of allele 1 (the proportion is obtained by dividing
the number of errors by the number of imputed alleles). The
efficiency of prediction was then obtained by averaging this
statistic over all imputed genotypes and markers. This value
indicates whether the model correctly predicts marker
alleles.
RESULTS
Estimation of IBDp in the three simulated designs:
Distributions of IBDp for TIPC with the four simulated
data sets are presented in Figure 1. The curves indicate
that a large majority of the estimated IBDp are above
0.99 and very few at low values for all simulated designs,
stressing the ability of the method to identify with high
probability the TIPC. The mean IBDp for TIPC was
equal to 0.9798 and 0.9415 in the A. thaliana and maize
designs, respectively. For the A. thaliana design, IBDp of
TIPC was higher than 0.99 and 0.90 for 91.3 and 95.9%
of the situations, respectively. For the maize design, with
lower density but closer founders (in terms of generations), these values dropped respectively to 74.0 and
88.1%. For the dairy cattle design, mean IBDp was equal
to 0.9800 for genotyped TIPC and only to 0.5511 for
phantom TIPC resulting in an overall mean IBDp of
0.9567 for TIPC (5.4% of the genome of TC was inherited from phantom PC). With 1% (5%) missing
markers, IBDp were almost identical to the ones
obtained with the full markers set: 0.9565 (0.9560) for
overall mean IBDp for TIPC and to 0.9799 (0.9796)
when considering only genotyped TIPC. With genotypes errors, overall mean IBDp for TIPC and IBDp
for genotyped TIPC slightly decreased to 0.9551 and
0.9783 (0.01% errors), 0.9442 and 0.9667 (0.1% errors),
and 0.9433 and 0.9657 (1% errors), respectively. Values
obtained with estimated haplotypes instead of known
haplotypes were in between results obtained in missing
genotypes and genotyping errors scenarios: 0.9531 for
overall mean IBDp for TIPC and 0.9762 IBDp for genotyped TIPC. Returning to the situation with no missing
markers and no genotyping errors, but decreasing the
number of markers from 2691 to 164, the average IBDp
dropped to 0.9456 (genotyped TIPC), 0.6294 (phantom
TIPC), and 0.9418 (all TIPC) (Figure 1). Of genotyped
TIPC, 88.1% (94.8%) and 78.6% (90.1%) had IBDp
.0.99 (0.90) with the high (low)-density maps. For
phantom TIPC, estimated IBDp were low more frequently, clearly indicating a lack of power to detect
phantom TIPC. For these TIPC, emission probabilities
414
T. Druet and F. Farnir
Figure 1.—Observed cumulative distribution of mean IBDp for truly inherited parental chromosomes (TIPC):
( ) all TIPC, (*) genotyped TIPC, and
(s) phantom TIPC (ungenotyped
TIPC). The figure was obtained by assigning estimated IBDp to 100 probability
classes (0–0.01, 0.01–0.02, . . ., 0.99–1.00).
•
are equal to allele frequencies while for genotyped TIPC,
emission probabilities are equal to 0.001 or 0.999 according to the allele actually observed on the PC. By
chance, emission probability associated to a genotyped
PC can be higher than the product of allele frequencies, leading to an incorrect assignment. In addition, PC
closer (in terms of the number of generations) to TC
will have higher transition probabilities than phantom
PC, often separated from TC by more generations.
When more genotyped PC are present in the set of
PC, the probability of obtaining by chance one PC with
emission probability higher than the product of allele
frequencies increases. This is illustrated in Table 1
where estimation of IBDp is summarized for individuals
with an increasing number of genotyped PC. Indeed,
with more genotyped PC, mean IBDp associated to
phantom TIPC decreases. In addition, the table indicates that proportion from the genome inherited from
genotyped PC increases as expected with a higher number of genotyped PC but that mean IBDp estimated for
genotyped TIPC slightly decreases. This is explained by
the fact that additional genotyped TIPC are separated
from the TC by more generations, resulting in smaller
transmitted segments (the expected size of the segment
is 1/(1 1 G) M, where G is the number of generations
between the TC and PC) with more transitions and
fewer markers per segment to identify the origin. However, since the proportion of genotyped TIPC increases,
the overall mean IBDp increases too (with the highdensity map). With the lower-density map, the overall
mean IBDp of TIPC remains relatively constant because
IBDp for phantom TIPC have a stronger decrease. The
most favorable configuration for IBDp estimation oc-
curs when a sire or a dam is genotyped because 100%
of the two TIPC are genotyped and transmitted fragments are large. In these cases, the mean IBDp of TIPC
is equal to 0.9949 and 0.9851 with the high- and lowdensity maps, respectively. For one animal, the mean
IBDp of TIPC is the mean of the IBDp estimated for
the paternal and maternal chromosomes, resulting in
0.9443 (0.9395) to 0.9656 (0.9432) mean IBDp when
male ancestors carrying PC are genotyped from three to
eight generations (see material and methods for more
details) for the high (low)-density map (see Table 1).
Results of the bovine data set clearly indicate that
estimation of IBDp is improved when transmitted chromosome segments are larger (or with less crossing over
along the chromosome). Figure 2 shows that mean estimated IBDp are low around crossing over and increases
with distance from crossing over. When the exact position of a crossing over cannot be identified, the method
estimates intermediate IBDp for the two concerned PC.
With increased marker density, the influence of the
crossing over is reduced because more markers are
available to precisely localize crossing over. These
events greatly contribute to the proportion of intermediate (or low) estimated IBDp for TIPC.
Figure 3 shows that percentage of TIPC within each
class of estimated IBDp linearly increases from 0 to 1
with IBDp with both plant designs. For each class, the
proportion of PC corresponding to the TIPC is equal to
the estimated IBDp, indicating that these probabilities
are well calibrated. For the dairy cattle data, this is no
longer true and we observe some high IBDp associated
to noninherited PC. However, when considering only
portions of chromosomes inherited from genotyped PC,
Identity-by-Descent Between Haplotypes
415
TABLE 1
Mean IBDp for TIPC with the bovine data set as a function of the number of generations of genotyped male ancestors carrying PC:
results from the dense map (in parentheses, results for the low-density map)
Paternal
chromosome
Number of
genotyped
male ancestors
carrying PCa
IBDp for TIPC
3
4
5
6
7
8
0.9949
0.9949
0.9949
0.9949
0.9949
0.9949
(0.9851)
(0.9851)
(0.9851)
(0.9851)
(0.9851)
(0.9851)
Maternal chromosome
IBDp for
genotyped
TIPC
0.9826
0.9671
0.9594
0.9562
0.9386
0.9412
(0.9554)
(0.9370)
(0.9265)
(0.9206)
(0.9031)
(0.9060)
IBDp for
phantom TIPC
0.6258
0.5161
0.4458
0.3949
0.3662
0.3171
(0.7091)
(0.5967)
(0.5155)
(0.4471)
(0.3989)
(0.3287)
Proportion of
chromosome
inherited from
genotyped TIPCb
0.7508
0.8754
0.9380
0.9686
0.9845
0.9921
(0.7500)
(0.8750)
(0.9375)
(0.9687)
(0.9844)
(0.9922)
IBDp for all
TIPCc
0.8936
0.9109
0.9276
0.9386
0.9298
0.9363
(0.8939)
(0.8946)
(0.9010)
(0.9057)
(0.8953)
(0.9014)
Mean IBDp per
chromosomed
0.9443
0.9529
0.9613
0.9668
0.9623
0.9656
(0.9395)
(0.9398)
(0.9430)
(0.9454)
(0.9402)
(0.9432)
a
One generation of genotyped male ancestor corresponds to a genotyped sire and two generations of genotyped male ancestors
carrying PC corresponds to sire and maternal grandsire genotyped. For each additional generation of genotyped male ancestors
carrying PC, the sire of the mate of the oldest genotyped sire is added. For instance, in the first example of File S1, male ancestors
carrying PC (individuals 2, 6, 9, and 10) are genotyped for four generations.
b
Expected proportion of maternal chromosome inherited from genotyped TIPC is indicated in parentheses.
c
The IBDp for all TIPC is an average of IBDp for genotyped (column 3) and phantom (column 4) TIPC weighted by the
proportion of genome inherited from genotyped TIPC (column 5).
d
The mean IBDp per chromosome is equal to the mean of paternal (column 2) and maternal chromosomes (column 6).
the concordance between estimated IBDp and proportion of TIPC is correct while for regions inherited from
phantom PC only, estimated IBDp are poorly calibrated.
Computational requirements: Computation times and
memory requirements were measured on a computer
with Intel Xeon “Harpertown” L5420 at 2.50 GHz. The
requirements are a function of number of TC (1200 for
the maize design, 684 for the A. thaliana design, and
1824 for the dairy cattle design), the number of different paths to PC per TC (8 for the maize design set, 1024
for the A. thaliana design, and maximum 15 for the
dairy cattle design), and the number of markers (175
for the maize design, 275 for the A. thaliana design, and
2691 for the dairy cattle design). Computation times to
estimate IBDp at all marker positions and for all TC
were equal to 41 sec, 7 min 53 sec, and 7 hr 13 min 22
sec, while memory requirements were equal to 129 Mb,
398 Mb, and 140 Mb for the maize, dairy cattle, and A.
thaliana designs, respectively.
Application: marker imputation with low density chip
in dairy cattle: Table S1 summarizes allelic imputation
error rates obtained in dairy cattle for TC genotyped
with the low-density map. The method was used to predict markers inherited from PC genotyped on the
higher-density marker panel. As for the estimation of
IBDp probabilities, the efficiency of the method increased when higher proportions of PC were genotyped. When the expected proportion of the genome
inherited from genotyped ancestors increased from
,0.25 to 1, the error rate decreases from 0.2647 to
0.0056. The majority (73.4%) of the individuals with
TC in our data sets had more than 87.5% of their genome inherited from genotyped ancestors (correspond-
ing to three genotyped male ancestors carrying PC or
more) and 13.6% had both their parents genotyped.
The error rates were below 0.05 for animals with at least
three generations of genotyped male ancestors carrying
PC and the overall mean imputation error rate per
animal was equal to 0.0375: the proportion of correctly
imputed alleles (0.9625) was above the mean estimated
IBDp for TIPC (0.9418) despite the fact that haplotypes
were estimated and that genotyping errors might be
present in the data set. Some genotypes might be correctly imputed by chance: although the correct PC was
not identified, the incorrect PC with high estimated
IBDp carried the same marker allele. Therefore, when
the method incorrectly identifies the inherited parental
chromosome because another PC is very similar (identityby-state for a stretch of markers), the error has limited
consequence on marker imputation or even QTL detection and genomic selection because the incorrect PC
most likely carries the same alleles as the TIPC. Similar
results were found by Zhang and Druet (2010) using
the same method but for the whole genome.
DISCUSSION
Accuracy of IBD probabilities estimation: We herein
present a method aimed at estimating IBDp between
a TC and a set of PC. Since the method relies on
concordance of haplotypes from TC and PC, it will be
more efficient when more markers per inherited segment are available. The size of these segments is proportional to the inverse of the number of generations
from the PC to the TC. Therefore, the method will be
416
T. Druet and F. Farnir
Figure 2.—Relation between estimated IBDp
of TIPC and distance from the closest crossing
over with the dairy cattle data: (A) high-marker
density, (B) low-marker density.
more efficient for closer PC and for higher marker
densities. On a simulated data set based on real founder
haplotypes and pedigree structures, the method assigned
high IBDp to the TIPC in a large majority of the cases
and low IBDp to the other PC, even for marker densities
as low as 1 marker/Mb and for founders separated from
the TC by up to 10 generations. However, for some regions, lower IBDp were assigned to the TIPC. Such
regions, where both TIPC and incorrect PC have
intermediate IBDp, correspond to portions for which
the origin is more difficult to determine: close to a
crossing over (as illustrated by Figure 2) or for which
several ancestors have very similar haplotypes. With
lower-density maps, these problems occur more often
because it is more likely that two PC have similar haplotypes and because less informative markers are available to precisely determine where crossing over take place.
Therefore, the estimated IBDp of TIPC decreased with
lower-density marker maps. Figure 3 indicates that estimated IBDp are well calibrated and are equal to the
probability that a PC is a TIPC. Therefore, it is possible
to identify regions in which the TIPC cannot be determined precisely.
Results also showed that the estimation of IBDp was
less efficient for regions inherited from a phantom PC.
The method should therefore ideally be applied to
designs with limited number of phantom PC, especially
when these are close ancestors of the TC. Phantom PC
are described on the basis of allele frequencies. A model
describing phantom PC with haplotypes (such as HMM
of Scheet and Stephens 2006 or Browning and
Browning 2007) might turn out to be more efficient
although more computationally intensive.
The impact of PC incorrect assignments might be
limited for marker allele prediction. Indeed, in regions
where the method cannot precisely determine the
haplotype origin, the method proposes several origins
that are very similar to each other and to the TC. Due to
this similarity, the method is not able to determine the
correct origin but thanks to this similarity, the correct
Identity-by-Descent Between Haplotypes
417
Figure 3.—Relation between real proportion of truly inherited parental chromosomes (TIPC) and estimated IBDp:
( ) all TIPC, (*) genotyped TIPC, and
(s) phantom TIPC. The figure was obtained by assigning PC in 100 probability
classes according to their estimated
IBDp (0–0.01, 0.01–0.02, . . . , 0.99–
1.00), and then the proportion of TIPC
was computed in each class.
•
marker allele is predicted anyway. Therefore, the method
proved to be efficient for prediction of marker alleles,
even with lower map densities.
Consequences of approximations of the model: In
this study, we assumed that haplotypes are known. In
inbred species (such as some plants or mice), haplotypes
are indeed known. Furthermore, Druet and Georges
(2010) showed that with the current marker densities,
haplotype reconstruction can be achieved very efficiently in livestock species, especially for important
parents with many offspring, like tested sires in artificial
insemination contexts. For genotyped (eventually at
lower density) target individuals, haplotypes can be partially reconstructed on the basis of homozygous markers
and familial information such as a genotyped parent.
When using inferred haplotypes with the dairy cattle
design, IBDp of TIPC remained high, showing that the
method was still efficient when haplotyping can be performed as accurately as in dairy cattle. In cases where
haplotyping reconstruction generates many errors, it is
likely that estimation of IBDp will be less accurate,
particularly when errors occur in TC. Since the efficiency
of the method was less affected by missing genotypes
than by genotyping errors, we advise reconstructing
portions of haplotypes that can be inferred very accurately (for instance, thanks to homozygous markers
or mendelian segregation rules) and consider other
markers as missing.
Results showed that the method remained efficient
with missing genotypes and genotyping errors under
normal conditions. Indeed, we tested call rates as low
as 95% whereas, for instance, with genotyping arrays
used in cattle, call rates are above 99% on average.
Genotyping errors rates were set as high as 1% while
use of genotyping arrays generally results in ,0.1% error
rates.
The effect of missing genotypes is to reduce marker
density since we do not have information for some
markers. Reduction of marker density by a few percentage has a small impact on estimation of IBDp.
Results, on simulated and real data, proved that the
Markov approximation of the IBD process along the
chromosome between a TC and its PC was efficient in
our designs.
Comparison to other methods: Since our method
models the entire chromosome and allows direct
estimation of IBDp between haplotypes separated by
more than one generation, it should achieve better results than traditional methods based on linkage equilibrium, which rely mostly on genotypes of close relatives
and flanking markers such as those of Wang et al. (1995)
or Pong-Wong et al. (2001). However, our method is less
suited to very sparse marker maps. For moderate- and
high-marker densities our method has reasonable computational costs, which might not be the case for multipoint linkage methods using MCMC techniques (Heath
1997).
Other multipoint linkage methods such as those
proposed by Lander and Green (1987) or Abecasis
et al. (2002) consider all the pedigree relationships.
These methods therefore make better use of all the
available information. However, Merlin (Abecasis et al.
2002) was not computationally efficient on the data set
used in this study (Merlin could not cope with our large
pedigree). In comparison, our method uses only information from ancestors, ignores other relatives, and
models both chromosomes separately. Consequently,
it works in subpedigrees including only the TC and
418
T. Druet and F. Farnir
their ancestors, which makes it computationally efficient. Our results showed that for designs in which haplotypes of important parents can be reconstructed
precisely (on the basis of LD or information from progeny), the IBDp between TC and PC can be efficiently
computed using our method and can be useful for transferring information (genotype, QTL, etc.) from important founders to other individuals in the population.
The IBDp estimated in our method are distinct from
IBDp estimated through methods based on LD, such as
the method developed by Meuwissen and Goddard
(2001). Indeed, these methods use small haplotypes
of dense markers to identify IBD relationships due to
distant common ancestors. Our method identifies IBD
relationships within a pedigree and only a few generations separate the haplotype from its ancestors. Since,
for closer relationships, animals share longer chromosome segments, our method can work with lower-density
maps than methods based on LD can.
Applications: Our method is particularly suited to
describing the transmission of information from recent
ancestors to current individuals, even with low-density
maps (e.g., 1 SNP/Mb). Burdick et al. (2006) already
used this technique for marker imputation in human
pedigrees with Merlin (Abecasis et al. 2002). Such applications are becoming more common in animal and
plant breeding or in experimental designs. For instance,
MAGIC designs (e.g., Threadgill et al. 2002; Kover
et al. 2009) rely on genotyping of founders and last generations and are now developed in several organisms for
the genetic dissection of complex traits. In addition, recent strategies aim to genotype founders with high-density
marker panels and use this genomic information to infer
genotypic information on other individuals genotyped
on lower-density panels. This strategy has, for example,
been implemented in the NAM in maize (Yu et al. 2008;
Mcmullen et al. 2009). In livestock species, genomic
selection (Meuwissen et al. 2001) is currently applied
by genotyping reference animals on medium-density
SNP chips (60K SNPs in dairy cattle). Effects of the
SNPs are then estimated and can be used to predict the
genetic value of other individuals, which must be genotyped for the same markers. Habier et al. (2009) proposed genotyping some animals with lower-density
marker panels and using statistical methods to transfer
information from reference individuals to these animals.
In dairy cattle, most breeding companies are studying
the possibility of implementing such a strategy to reduce
genotyping costs and extend genomic selection to
a larger fraction of the population. With the advent of
higher-marker density panels (for example, a bovine
SNP chip with more than 750,000 SNPs has been released in 2010) and of high-throughput sequencing
data, such strategies will become even more important.
Our method would be fitted to such new paradigms.
The program CHROMIBD implementing the current method is available from the authors upon request.
CONCLUSIONS
We present an efficient method for estimation of
IBDp between a target chromosome and its genotyped
ancestors. The method assigned high IBDp to the truly
inherited parental chromosome and was still efficient
when marker densities were 1 Mb/SNP or when
founders were separated from target chromosomes by
up to 10 generations. The proposed method proved
well adapted for situations that will become more common in the near future, with founders genotyped on
high-density maps and other individuals genotyped for
lower-density marker arrays, as proposed in livestock
species and in experimental designs such as the nested
association mapping design in maize.
Tom Druet is Research Associate from the Fonds de la Recherche
Scientifique (FNRS). The authors thank CRV (http://www.crv4all.com),
the Wellcome-Trust Centre for Human Genetics, and PANZEA for
access to their data. We acknowledge Laurence Moreau and Mathieu
Gautier for their comments and suggestions on this work. This work
was funded by grants of the Service Public de Wallonie and from the
Communauté Française de Belgique. We acknowledge University of
Liège (SEGI and GIGA bioinformatics platform) for the use of NIC3
and GIGA-grid supercomputers.
LITERATURE CITED
Abecasis, G. R., S. S. Cherny, W. O. Cookson and L. R. Cardon,
2002 Merlin–rapid analysis of dense genetic maps using sparse
gene flow trees. Nat. Genet. 30: 97–101.
Baum, L. E., and J. A. Egon, 1967 An inequality with applications
to statistical estimation for probabilistic functions of a Markov
process and to a model for ecology. Bull. Am. Meteorol. Soc.
73: 360–363.
Broman, K. W., 2005 The genomes of recombinant inbred lines.
Genetics 169: 1133–1146.
Browning, S. R., and B. L. Browning, 2007 Rapid and accurate
haplotype phasing and missing-data inference for whole-genome
association studies by use of localized haplotype clustering. Am. J.
Hum. Genet. 81: 1084–1097.
Burdick, J. T., W. M. Chen, G. R. Abecasis and V. G. Cheung,
2006 In silico method for inferring genotypes in pedigrees.
Nat. Genet. 38: 1002–1004.
Calus, M. P., T. H. Meuwissen, A. P. de Roos and R. F. Veerkamp,
2008 Accuracy of genomic selection using different methods to
define haplotypes. Genetics 178: 553–561.
Charlier, C., W. Coppieters, F. Rollin, D. Desmecht, J. S.
Agerholm et al., 2008 Highly effective SNP-based association mapping and management of recessive defects in livestock. Nat. Genet. 40: 449–454.
de Roos, A. P., C. Schrooten, E. Mullaart, M. P. Calus and R. F.
Veerkamp, 2007 Breeding value estimation for fat percentage
using dense markers on Bos taurus autosome 14. J. Dairy Sci. 90:
4821–4829.
Druet, T., and M. Georges, 2010 A hidden markov model combining linkage and linkage disequilibrium information for haplotype
reconstruction and quantitative trait locus fine mapping. Genetics 184: 789–798.
Fernando, R., and M. Grossman, 1989 Marker assisted selection using best linear unbiased prediction. Genet. Sel. Evol. 21: 467–477.
Guo, S. W., 1994 Computation of identity-by-descent proportions
shared by two siblings. Am. J. Hum. Genet. 54: 1104–1109.
Habier, D., R. L. Fernando and J. C. Dekkers, 2009 Genomic selection using low-density marker panels. Genetics 182: 343–353.
Heath, S. C., 1997 Markov chain Monte Carlo segregation and linkage
analysis for oligogenic models. Am. J. Hum. Genet. 61: 748–760.
Identity-by-Descent Between Haplotypes
Hernandez-Sanchez, J., C. S. Haley and J. A. Woolliams,
2006 Prediction of IBD based on population history for fine
gene mapping. Genet. Sel. Evol. 38: 231–252.
Kong, A., G. Masson, M. L. Frigge, A. Gylfason, P. Zusmanovich
et al., 2008 Detection of sharing by descent, long-range phasing
and haplotype imputation. Nat. Genet. 40: 1068–1075.
Kover, P. X., W. Valdar, J. Trakalo, N. Scarcelli, I. M. Ehrenreich
et al., 2009 A multiparent advanced generation inter-cross to finemap quantitative traits in Arabidopsis thaliana. PLoS Genet. 5(7):
e1000551.
Lander, E. S., and P. Green, 1987 Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA 84:
2363–2367.
Leutenegger, A. L., B. Prum, E. Genin, C. Verny, A. Lemainque
et al., 2003 Estimation of the inbreeding coefficient through use
of genomic data. Am. J. Hum. Genet. 73: 516–523.
McMullen, M. D., S. Kresovich, H. S. Villeda, P. Bradbury, H. H.
Li et al., 2009 Genetic properties of the maize nested association
mapping population. Science 325: 737–740.
McPeek, M. S., and L. Sun, 2000 Statistical tests for detection of
misspecified relationships by use of genome-screen data. Am. J.
Hum. Genet. 66: 1076–1094.
Meuwissen, T. H., and M. E. Goddard, 2001 Prediction of identity
by descent probabilities from marker-haplotypes. Genet. Sel. Evol.
33: 605–634.
Meuwissen, T. H., and M. E. Goddard, 2004 Mapping multiple
QTL using linkage disequilibrium and linkage analysis information and multitrait data. Genet. Sel. Evol. 36: 261–279.
Meuwissen, T. H., B. J. Hayes and M. E. Goddard, 2001 Prediction
of total genetic value using genome-wide dense marker maps.
Genetics 157: 1819–1829.
Mott, R., C. J. Talbot, M. G. Turri, A. C. Collins and J. Flint,
2000 A method for fine mapping quantitative trait loci in outbred animal stocks. Proc. Natl. Acad. Sci. USA 97: 12649–12654.
Pong-Wong, R., A. W. George, J. A. Woolliams and C. S. Haley,
2001 A simple and rapid method for calculating identity-bydescent matrices using multiple markers. Genet. Sel. Evol. 33:
453–471.
419
Rabiner, L. R., 1989 A tutorial on Hidden Markov Chains and selected applications in speech recognition. Proc. IEEE 77: 257–286.
Scarcelli, N., J. M. Cheverud, B. A. Schaal and P. X. Kover,
2007 Antagonistic pleiotropic effects reduce the potential adaptive value of the FRIGIDA locus. Proc. Natl. Acad. Sci. USA 104:
16986–16991.
Scheet, P., and M. Stephens, 2006 A fast and flexible statistical
model for large-scale population genotype data: applications to
inferring missing genotypes and haplotypic phase. Am. J. Hum.
Genet. 78: 629–644.
Thompson, E. A., 1994 Monte Carlo estimation of multilocus autozygosity probabilities, pp. 498–506 in Proceedings of the 1994 Interface Conference, edited by J. Sall and A. Lehman. Interface
Foundation of North America, Fairfax Station, VA.
Thompson, E. A., 2008 The IBD process along four chromosomes.
Theor. Popul. Biol. 73: 369–373.
Threadgill, D. W., K. W. Hunter and R. W. Williams, 2002 Genetic dissection of complex and quantitative traits: from fantasy to
reality via a community effort. Mamm. Genome 13: 175–178.
Wang, T., R. Fernando, S. Van der Beek, M. Grossman and J. A. M.
Van Arendonk, 1995 Covariance between relatives for a marked
quantitative trait locus. Genet. Sel. Evol. 27: 251–274.
Wijsman, E. M., J. H. Rothstein and E. A. Thompson, 2006 Multipoint linkage analysis with many multiallelic or dense diallelic
markers: Markov chain-Monte Carlo provides practical approaches for genome scans on general pedigrees. Am. J. Hum.
Genet. 79: 846–858.
Windig, J. J., and T. H. E. Meuwissen, 2004 Rapid haplotype reconstruction in pedigrees with dense marker maps. J. Anim.
Breed. Genet. 121: 26–39.
Yu, J., J. B. Holland, M. D. McMullen and E. S. Buckler,
2008 Genetic design and statistical power of nested association
mapping in maize. Genetics 178: 539–551.
Zhang, Z., and T. Druet, 2010 Marker imputation with low-density
marker panels in Dutch Holstein cattle. J. Dairy Sci. 93: 5487–5494.
Communicating editor: H. Zhao
GENETICS
Supporting Information
http://www.genetics.org/cgi/content/full/genetics.111.127720/DC1
Modeling of Identity-by-Descent Processes Along a Chromosome Between
Haplotypes and Their Genotyped Ancestors
Tom Druet and Frederic Paul Farnir
Copyright © 2011 by the Genetics Society of America
DOI: 10.1534/genetics.111.127720
2 SI
T. Druet and F. P. Farnir
FILE S1
Examples for the method section
Pedigree example for estimation of IBD probabilities
We use the following pedigree to illustrate how to construct the subset of potential parental chromosomes, to estimate initial state
probabilities and to give examples of ancestors where paths from different parental chromosome merge.
Figure. Pedigree of the example, genotyped animals are represented in grey (squares for males and circles for females).
The subset C(c) of potential parental chromosomes of a given chromosome c
Noting iM (iP ) the maternal (paternal) chromosomes of individual i, we can find C(1M) as explained in the method section:
•
The potential PC of 1M are 3P and 3M, and these chromosomes are not genotyped.
•
Using the same process for 3P, the PC are then 6P and 6M. Since these chromosomes are genotyped, they are included in C(1M)
an the process stops here along that path.
•
Now, using the same process for 3M, the PC are now 7P and 7M, which are not genotyped. So we have to repeat the process for
these two chromosomes.
•
7P leads to 2 new genotyped chromosomes to be included in C(1M), i.e. 9P and 9M, while 7M leads to two ungenotyped
chromosomes 8P and 8M, and the process is to be repeated again.
As a result, we have that C(1M) = {6P, 6M, 8M, 9P, 9M, 10P, 10M}, which means that 1M is a mosaic of these chromosomes. It also
means that any locus on 1M is IBD with the corresponding locus on one of these chromosomes. In terms of probabilities, the
consequence is that
C(c)
P(c(k) x(k)) = 1 when the sum is taken over all chromosomes making up C(c), where P(c(k) x(k)) denotes
x
the probability for c to be IBD with x at a given locus k.
T. Druet and F. P. Farnir
3 SI
Initial state probabilities
To illustrate the property that the initial probabilities sum over C(c) to 1, in the example described above, the IBDp for the beginning
of chromosome 1M are:
P(1M(1) 6P(1)) = P(1M(1) 6M(1)) = 0.52 =0.25,
P(1M(1) 9P(1)) = P(1M(1) 9M(1)) = P(1M(1) 8M(1)) = 0.53 = 0.125,
P(1M(1) 10P(1)) = P(1M(1) 10M(1)) = 0.54 = 0.0625
and we have that
C(1M
x
)
(
)
P 1M (1) x(1) = 2 * 0.25+ 3 * 0.125+ 2 * 0.0625 = 1
State transition probabilities
As explained in the paper, when PC (ci and cj) are distinct at positions k and k+1, then on the pedigree paths from ci and cj to c, there
must be at least one common ancestor where the two paths merge. We note this ancestor ‘a’. To illustrate on the example, if ci = 10P
and cj = 9M, a = 7. Similarly, if ci=6M and cj=9P, a = 3.
4 SI
T. Druet and F. P. Farnir
Example in case of inbreeding
The figure below represents the maternal branch of the pedigree of individual #1. One ungenotyped great-grand-parent (male #5) is
parent of both grand-parents. Both parents of individual #5 (individuals #8 and #9) are genotyped and carry PC.
Figure. Example of inbred pedigree, genotyped animals are represented in grey (squares for males and circles for females). A. Maternal branch of the pedigree.
B. Each path from PC to TC is represented separately (the number of each distinct path is written in blue).
We start by representing each different path from PC to TC separately (Figure, part B). Paths were numbered from 1 to 16 (in blue).
For individuals #8 and #9, they are two paths to transmit chromosomes to individual #1. Through individual #3 or through
individual #4. Therefore, individuals #5, #8 and #9 and the corresponding meioses appear twice in the figure (part B).
IBD probabilities are first computed for each path. Then IBD probabilities for each PC is obtained by summing IBD probabilities for
each path leading from the concerned PC to the TC. To illustrate, P(1m 8p) = P(8p5p3p2p1m) + P(8p5p4p2m1m) =
P(8p5p)*[P(5p3p2p1m) + P(5p4p2m1m)]
T. Druet and F. P. Farnir
5 SI
To obtain transition probabilities from path x at position k to path y at position k+1, meioses in path y are divided in independent
meioses (those meioses are not present in path x) and common meioses (those meioses are present in path x - the same
parent/offspring pair appears in both paths). Common meioses are further divided in meioses resulting in the transmission of the
same haplotype of the parent (same meioses) or in the transmission of the complementary haplotype of the parent (opposite meioses).
Mi, Ms and Mo are the number of independent, same and opposite meioses, respectively. The sum of all the meioses is equal to the
number of meioses in path y. For example, considering PC noted 1 and 6 in the example (i.e. 8p and 10m, respectively) as x and y
respectively, the trajectories of y starts with 2 independent meioses (Mi=2) to reach maternal haplotype of individual 3. Then, to
allow for the transition from x to y, a recombination must take place in this individual (Mo=1). Then the recombined haplotype must
have been transmitted along the remaining (common to both paths) meiosis with no further recombination (Ms=1). Further examples
are given below. The formula given in the text (equation [5]) is then easily obtained, considering that the probability of transmission
of an haplotype is equal to 0.5 for an independent meiosis, (1-r) for a 'same' meiosis and r for an 'opposite' meiosis.
Other examples of number of independent and common meioses for transition between different paths, some of which including
inbred individuals, are:
From path 1 to path 1: Mi = 0, Ms =4, Mo =0.
From path 1 to path 2: Mi = 0, Ms =3, Mo =1.
From path 1 to path 3: Mi = 1, Ms =2, Mo =1.
From path 1 to path 9: Mi = 2, Ms =1, Mo =1.
From path 1 to path 10: Mi = 2, Ms =0, Mo =2.
From path 1 to path 11: Mi = 3, Ms =0, Mo =1.
From path 1 to path 16: Mi = 3, Ms =0, Mo =1.
Remark that, combining these results, we can for example compute:
P[c(k+1)9p | c(k) 8p]
=
P[c(k+1)3 | c(k) 1] + P[c(k+1)11 | c(k) 1] + P[c(k+1)3 | c(k) 9] + P[c(k+1)11 | c(k) 9] =
0.5*r*(1-r) + 0.5*r + 0.5*r + 0.5*r*(1-r)
6 SI
T. Druet and F. P. Farnir
FILE S2
Algorithm
Each target chromosome (TC) is modeled independently. The method starts by searching the set of paths leading to the closest
genotyped ancestors of the modeled TC by a recursive subroutine which searches the pedigree backwards in time until it reaches a
genotyped ancestor or an ancestor without genotyped parent in the pedigree (it corresponds then to a phantom parental
chromosome).
For each identified path, a list of meioses is stored (one meiosis per generation). A meiosis is defined by a parent-progeny pair.
Then, for each pair of paths, the number of common meioses (Mc) between the two paths is counted and stored (common meioses
share same parent-progeny pair). In addition, to the number of common meioses, the number of common meioses with transmission
of complementary chromosome (corresponding to a crossing over) is counted (Mo). This corresponds to common meioses where in
the first path the paternal chromosome was transmitted whereas in the second path the maternal chromosome was transmitted or
vice versa. The number of common meioses with transmission of identical chromosome (Ms) is equal to Mc-Mo.
Three matrices with dimension Np x Np (with Np equal to the number of paths) are then stored. The first contains for each pair of
paths the number of independent meioses (Mi). Mi is equal to the total number of meioses in the second path minus Mc. The second
and third matrix contains respectively Mo and Ms for each pair of paths.
For instance, in the first example in File S1, the set of parental chromosomes of TC 1M is {6P, 6M, 8M, 9P, 9M, 10P, 10M}. For this set,
the three matrices are:
The matrix of independent meioses, Mi(i,j) (rows (i) correspond to PC at position k and columns (j) to PC at position k+1) is:
6P
6M
8M
9P
9M
10P
10M
6P
0
0
2
2
2
3
3
6M
0
0
2
2
2
3
3
8M
1
1
0
1
1
1
1
9P
1
1
1
0
0
2
2
9M
1
1
1
0
0
2
2
10P
1
1
0
1
1
0
0
10M
1
1
0
1
1
0
0
The matrix of common meioses with transmission of complementary chromosome (crossing over), Mo(i,j) (rows correspond to PC at
position k and columns to PC at position k+1) is:
6P
6M
8M
9P
9M
10P
10M
6P
0
1
1
1
1
1
1
6M
1
0
1
1
1
1
1
8M
1
1
0
1
1
1
1
9P
1
1
1
0
1
1
1
9M
1
1
1
1
0
1
1
10P
1
1
1
1
1
0
1
10M
1
1
1
1
1
1
0
T. Druet and F. P. Farnir
7 SI
The matrix of common meioses with transmission of the same chromosome, Ms(i,j) (rows correspond to PC at position k and columns
to PC at position k+1) is:
6P
6M
8M
9P
9M
10P
10M
6P
2
1
0
0
0
0
0
6M
1
2
0
0
0
0
0
8M
0
0
3
1
1
2
2
9P
0
0
1
3
2
1
1
9M
0
0
1
2
3
1
1
10P
0
0
2
1
1
4
3
10M
0
0
2
1
1
3
4
The forward algorithm (Rabiner, 1989) is then implemented as:
Initialization for all paths i at marker 1:
Ms(i,i)
(i,1)=0,5
where Ms(i,i) is the number of identical meioses between paths i and i (or the number of meioses on the path i).
Induction for all paths i and marker k :
Np
(i,k) = ( ( j,k 1) * ( j,i,k 1)) * a(i,k)
j =1
where (j,i,k-1) is the transition probability from path j to path i between markers k-1 and k and a(i,k) is the emission probability of
path i at marker k.
The transition probabilities are calculated with equation [5] :
(i, j,k) = 0.5 Mi(i, j ) Mo(i, j ) (1 ) Ms(i, j )
where is the recombination rate between markers k-1 and k.
The emission probabilities are equal to 1.000 if alleles on target and parental chromosomes are identical at marker k (1.00 if it is
missing on one of the chromosomes).
Similarly, the backward algorithm (Rabiner, 1989) is implemented as follow. Initialization for all paths i at the last marker Nm:
(i,Nm)=1.00
Induction for all paths i and marker k :
Np
(i,k) =
( ( j,k + 1) * (i, j,k) * a( j,k + 1))
j =1
Finally, forward and backward probabilities can be combined to compute the probability that the hidden Markov chain went through
hidden states i at marker k (corresponding to the probability that TC is identical by descent to PC i at marker k or the IBD probability
between the TC and the PC i at marker k).
8 SI
T. Druet and F. P. Farnir
TABLE S1
Mean allelic imputation error rates of missing markers for animals genotyped on the low-density marker map
Expected percentage of genome inherited from
Number of individuals
Mean allelic error rate
25%
7
0.2647
25% < . 50%
81
0.2073
50% < . 75%
249
0.0985
75% < . 87.5%
657
0.0545
87.5% < . 90%
121
0.0451
90% < . 92.5%
270
0.0394
92.5% < . 95%
562
0.0306
95% < . 97.5%
734
0.0216
97.5% < . < 100%
545
0.0163
100 %
506
0.0056
genotyped TIPC