Copyright © 2011 by the Genetics Society of America DOI: 10.1534/genetics.111.127720 Modeling of Identity-by-Descent Processes Along a Chromosome Between Haplotypes and Their Genotyped Ancestors Tom Druet*,1 and Frederic Paul Farnir† *Unit of Animal Genomics, Faculty of Veterinary Medicine and Centre for Biomedical Integrative Genoproteomics and †Unit of Animal Productions, Faculty of Veterinary Medicine, University of Liège, B-4000 Liège, Belgium Manuscript received February 13, 2011 Accepted for publication March 3, 2011 ABSTRACT Identity-by-descent probabilities are important for many applications in genetics. Here we propose a method for modeling the transmission of the haplotypes from the closest genotyped relatives along an entire chromosome. The method relies on a hidden Markov model where hidden states correspond to the set of all possible origins of a haplotype within a given pedigree. Initial state probabilities are estimated from average genetic contribution of each origin to the modeled haplotype while transition probabilities are computed from recombination probabilities and pedigree relationships between the modeled haplotype and the various possible origins. The method was tested on three simulated scenarios based on real data sets from dairy cattle, Arabidopsis thaliana, and maize. The mean identity-by-descent probabilities estimated for the truly inherited parental chromosome ranged from 0.94 to 0.98 according to the design and the marker density. The lowest values were observed in regions close to crossing over or where the method was not able to discriminate between several origins due to their similarity. It is shown that the estimated probabilities were correctly calibrated. For marker imputation (or QTL allele prediction for fine mapping or genomic selection), the method was efficient, with 3.75% allelic imputation error rates on a dairy cattle data set with a low marker density map (1 SNP/Mb). The method should prove useful for situations we are facing now in experimental designs and in plant and animal breeding, where founders are genotyped with relatively high markers densities and last generation(s) genotyped with a lower-density panel. T WO alleles at a single locus are identical-by-descent (IBD) if they derive from a common ancestor without mutation. The probability for two alleles to be IBD (IBDp) is an information of high importance as it is used in several applications such as gene or QTL mapping (e.g., Meuwissen and Goddard 2004), markerassisted (Fernando and Grossman 1989), or genomic selection (De Roos et al. 2007; Calus et al. 2008), haplotype reconstruction, or imputation (Kong et al. 2008). Estimation of IBDp may rely on linkage or linkage disequilibrium information, or on both. With linkage, the estimation of IBDp is based on Mendelian segregation rules, recombination, pedigree, and genotypes. With such information, IBDp are most often estimated between parents and offspring with the sole informative flanking markers (e.g., Fernando and Grossman 1989; Wang et al. 1995; Pong-Wong et al. 2001). IBDp between more distant relatives are obtained by transmitting the parent–offspring information along the path between the relatives. If some individuals on this path are not genotyped, transmission of chromosomes can no longer be traced unambiguously. In addition, since Supporting information is available online at http://www.genetics.org/ cgi/content/full/genetics.111.127720/DC1. 1 Corresponding author: Unit of Animals Genomics, GIGA-R B34, 1 avenue de L’Hôpital, B-4000 Liège, Belgium. E-mail: [email protected] Genetics 188: 409–419 ( June 2011) the information relies only on flanking markers, it is rapidly eroded along the pedigree path. Other methods estimate IBDp probabilities using multipoint linkage and segregation analysis and by considering all the pedigree links. Lander and Green (1987) proposed such a method considering each alternative gene flow pattern in a pedigree separately. However, it results in high computational costs when the pedigree increases. Abecasis et al. (2002) described an efficient algorithm for the analysis of dense genetic maps in pedigree. Their program, called Merlin, proved faster (and used less memory) than softwares based on the Lander–Green algorithm. Although very efficient, Merlin nevertheless becomes also impractical in large pedigrees with complex structures such as those used in animal and plant breeding. Methods such as Loki (Heath 1997) or Morgan (Wijsman et al. 2006) consider multipoint linkage and segregation analysis using MCMC techniques to jointly estimate IBDp or haplotypes probabilities. Although fairly general and potentially using all the available information, these methods are also very demanding from a computational point of view—especially in problems facing large pedigrees and in contexts of highdensity genotyping. Furthermore, these approaches are always faced with the problem of getting stuck in a local maximum region. 410 T. Druet and F. Farnir A common feature to all the methods based on linkage is that the definition of the IBDp refers to a common ancestor within the pedigree: regions will be declared to be IBD only if they can be shown to descend from one common ancestral region within the pedigree. The situation is different with linkage disequilibrium methods (e.g., Meuwissen and Goddard 2001; Hernandez-Sanchez et al. 2006), where IBDp are computed with reference to unobserved ancestors of the studied population. The assumptions are that these common ancestors were present T generations before the current population and that the studied mutations are old and were already present within these ancestors. The size of the haplotypes shared IBD by two gametes is relatively small and small haplotypes are used to estimate IBDp. As a consequence, such methods require higher marker densities than for linkage approaches, focus on relatively small segments surrounding the region-ofinterest, and basically ignore pedigree relationships. With current marker maps (e.g., 50K SNP chips for cattle or maize), these methods also have a high computational burden. In plant and animal breeding, as in experimental designs, new more cost-effective genotyping strategies are being implemented. Indeed, important founder individuals are genotyped at high density whereas other individuals are genotyped at lower density. For example, in the nested association mapping (NAM) design (Yu et al. 2008) or in multiparent advanced generation intercrosses (MAGIC) (e.g., Threadgill et al. 2002; Kover et al. 2009), the recombinant inbred lines are phenotyped and genotyped at low density whereas founder lines are genotyped at high density. In dairy cattle, important sires could be genotyped using dense panels (or even sequenced), while younger animals (e.g., those without phenotype) can be genotyped at lower density (e.g., Habier et al. 2009). In such designs, estimation of IBDp between founder and nonfounder individuals might be particularly useful for applications such as QTL mapping, genomic selection, or marker imputation. The IBDp estimation method should then rely on pedigree information and linkage (because some animals are genotyped at lower density), be computationally efficient, and handle complex pedigrees with overlapping generations where genotyped founders are spread over several generations. In this article, we propose an alternative and efficient method with which to describe the IBD process for a whole chromosome between target individuals and their genotyped ancestors, taking benefit of the marker density and using recombination probabilities and known pedigree. MATERIALS AND METHODS Estimation of IBD probabilities: The approach proposed here is a linkage method using information from genotyped ancestors in the pedigree even in situations where some individuals in the pedigree are not genotyped. It does not estimate IBDp with nonancestors. Any chromosome of today’s individuals is a mosaic of ancestral chromosomes. In theory, except for mutation events, any piece of chromosome could be traced back to a corresponding piece of an ancestral chromosome in the individual’s pedigree. Inferring the ancestral origin of genotyped chromosomes can be done using probability statements: each locus on a chromosome has a computable probability to descend from ancestral chromosomes taken in a subset C of the possible chromosomes set V. The set of chromosomes V: This set contains all the chromosomes in a given pedigree. For a pedigree containing n (genotyped or not) individuals, the dimension of V is 2 · n. The subset C(c) of potential parental chromosomes of a given chromosome c: For each target chromosome (TC) c in the pedigree, the potential parental chromosomes (PC) of c constitute a subset C(c) of V. The members of C can be obtained as follows (an example is provided in supporting information, File S1): starting from c and going up the pedigree (i.e., backward in time), we obtain the two possible (i.e., parental) origins of c in the previous generation. If this parent is genotyped, these two chromosomes are included in (and make up) C(c) and the process ends. If not, this process is repeated for each of the two parental chromosomes until: • either one finally reaches genotyped chromosomes, which are then included in C(c), or • either one reaches an ungenotyped chromosome for which no genotyped PC exists. This ungenotyped chromosome is then included in C(c) as a “phantom” chromosome. Any locus on the TC is IBD with the corresponding locus on one of these chromosomes. In terms of probabilities, the conP sequence is that CðcÞ P ðcðkÞ [ xðkÞÞ 5 1 when the sum is taken x over all chromosomes making up C(c), where P(c(k) [ x(k)) denotes the probability for TC c to be IBD with x at a givenlocus k. Modeling inherited parental chromosomes along the chromosome: According to the way the ancestral chromosomes set C has been defined, the process leading to the formation of a mosaic of pieces of the members of C can be seen as follows: • The beginning of the chromosome can correspond to any of the chromosomes ci in C with a probability that can be estimated (see Initial state probabilities). • Considering any pair of positions k and k 1 1 along the chromosome, there is a possibility to switch from ci in position k (noted ci(k)) to another member cj of C in position k 1 1, and the probability of these switches can also be computed. These “transition probabilities” depend on the couple (ci, cj) and on the distance between the positions k and k 1 1. A Markov process, where PC stands for states, can be used to approximate this process (Thompson 1994). In that case, the PC inherited (i.e., state) at locus k 1 1 is dependent on the PC inherited at locus k, and is assumed conditionally independent on PC inherited at previous loci: P c k 1 1 [ cj k 1 1 jc k [ ci k ; c k 2 1 [ ci 9 k 2 1 ; c k 2 2 [ ci 99 k 2 2 ; . . . 5 P c k 1 1 [cj k 1 1 j c k [ci k : Consequently, the genetic process can be approximated through the stochastic process of a Markov chain: this reduces to first, describing the prior probability to be in any given state (PC) and second, to computing the transition probabilities between PC. Identity-by-Descent Between Haplotypes In the case of crossover interference, the Markov property is not valid. In addition, previous studies showed that in some particular pedigrees, the Markov assumption is not appropriate (e.g., Mcpeek and Sun 2000; Broman 2005) but that the Markov approximation may nevertheless be successful to describe the IBD process (Thompson 1994; Leutenegger et al. 2003; Broman 2005). In our study, the Markov approximation might be applicable because simplified pedigrees are used: only IBDp with the closest ancestors are estimated and chromosomes are modeled independently. Initial state probabilities: The prior probability of any given PC ci is simply 0:5Gi , where Gi is the number of generations between the PC ci and the TC c (see examples in File S1). State transition probabilities: Transition between PC can potentially take place anywhere along the chromosome, but information is available only for discrete positions (i.e., markers), and so we can use these markers as indicators of possible transitions between PC (as later described). The needed transition probabilities are then conditional probabilities: P c k 1 1 [ cj k 1 1 j c k [ c i k : ity by P[c(k) [ ci(k)], which is simply 0:5Gi . To summarize, the transition probability is then: h i P c k 1 1 [ cj k 1 1 j c k [ci k 5 0:5ðGj 2Ga Þ · r · ð1 2 r ÞðGa 21Þ : ð2Þ Estimating the probability of one sequence of parental chromosomes: When modeling the IBD process along the chromosome, Np positions of TC are considered simultaneously. Since we have NPC parental chromosomes, each locus must a priori be IBD with one of these PC, and without further (i.e., marker) information, we have NPC possible parental origin for each of the positions, which amounts to a total of Ns ¼ ðNPC ÞNp possible sequences of PC. All these sequences are not equally likely, and the probability for c to correspond to any particular sequence s ¼ {pc1, pc2, . . ., pcNp } can be computed using the Markovian property given above, Np Y P s 5 P ðcð1Þ [ pc1 ð1ÞÞ P ½cðkÞ [ pck ðkÞ j cðk 2 1Þ [ pck 21 ðk 2 1Þ; k52 As already mentioned above, this probability is the probability that the TC c is IBD with the PC cj in position k 1 1 when the TC is IBD with PC ci in position k. Two situations exist: • If i ¼ j, this means that no recombination occurred between these two positions for the Gi generations between this PC and the TC. The probability is: h i P c k 1 1 [ cj k 1 1 j c k [ ci k 5 ð1 2 r ÞGi ; 411 (1) where r is the recombination fraction between positions k and k 1 1. • If i 6¼ j, we can decompose the situation into three necessary events. The first two events lead to a chromosome where positions k and (k 1 1) originate from PC i and j, respectively. The last event is then similar to the previous situation: • On the pedigree paths from ci and cj to c, there must be at least one common ancestor where the two paths merge. We note this ancestor “a” (see examples in File S1). The probability for the PCi and PCj to proceed down to this ancestor at positions k and k 1 1, respectively, is 0:5ðGi 2Ga Þ · 0:5ðGj 2Ga Þ , where Ga is the number of generations between the ancestor a and the TC. • After reaching this common ancestor, the two chromosomes need to recombine in the next meiosis, leading to the needed hybrid chromosome m in the next generation, with m(k) ¼ ci(k) and m(k 1 1) ¼ cj(k 1 1). The probability for this recombination is r. In addition, this recombinant haplotype is transmitted to the next generation with probability 0.5, resulting in a total probability of 0.5 · r for this event. • For the remaining (Ga 2 1) generations, the hybrid chromosome must be handed down without recombination. The probability to do that is 0:5ðGa 21Þ · ð12r ÞðGa 21Þ . As a consequence, we have the following joint probability: h i P ci k [ c k ; cj k 1 1 [ c k 1 1 5 0:5ðGi 1Gj 2Ga Þ · r · ð1 2 r ÞðGa 2 1Þ : The needed conditional probability P[c(k 1 1) [ cj(k 1 1)j c(k) [ ci(k)] can be obtained by dividing this joint probabil- where pck corresponds to the kth PC in the sequence. Among these Ns sequences, ðNPC ÞNp 21 share the same PC in position k (say, ci(k)). Consequently, IBDp between TC c and PC ci at position k can be estimated by summing the probabilities over all these sequences where pck(k) equals ci(k), P ðcðkÞ [ ci ðkÞÞ 5 Ns i X h j P s j *I pck ðkÞ 5 ci ðkÞ ; (3) j51 j where s j is the jth possible sequence, pc k ðkÞ is the kth PC in the jth sequence, and I(a ¼ b) is an indicator function equals to 1 if a equals b and 0 otherwise. The number of sequences increases exponentially with the number of positions and the number of ancestral chromosomes to be considered, which potentially renders this computation very demanding. Fortunately, the use of hidden Markov models (HMM) can provide efficient computation of such probabilities. Conditioning IBD probabilities on observed haplotypes: Chromosomes are not observed but genotypes are. Using these genotypes, haplotypes can be obtained for TC and genotyped PC (e.g., Windig and Meuwissen 2004; Druet and Georges 2010) and can be used to infer IBDp. The available genotypic information allows conditioning of the sequence probabilities from the previous section on the obtained haplotypes. P(s j j hc) would be used instead of P(s j ) in Equation 3, where hc stands for a haplotype (on chromosome c): P s j P hc j s j P s j j h c 5 P j9 c j9 : j9 P s P h j s (4) Probabilities such as P(hc j s j) are obvious to compute: if, for all the PC making up the sequence sj, the corresponding alleles are identical to alleles of the TC (hc), then P(hc j s j) ¼ 1; otherwise P(hc j s j) ¼ 0 (computation of P(hc j sj) is based on nonmissing and phased markers only). To allow for genotyping errors and/or mutations, these probabilities can be replaced by ð12eÞðNp 2Nd Þ eNd ; where e is a small value (typically, 0.001) and Nd is the number of differences between alleles of hc and alleles of the PC in the sequence j. For phantom PC, both genotypes and haplotypes are unknown. Therefore, we replace known or estimated haplotypes by the probability that phantom PC carries allele i at marker m, taken to be equal to the frequency of allele i at marker m. 412 T. Druet and F. Farnir Extending to inbreeding: In case of inbreeding, several paths might lead from the same PC to the TC (see File S1 for examples in cases of inbreeding). To deal with this problem, we first derive IBDp along each path individually, and then sum over all paths belonging to the same PC. The set of parental chromosomes C(c) is extended to the set of distinct paths P(c) leading from the PC to the TC. In the absence of inbreeding, C(c) and P(c) are identical and PC appear in only one path, but when inbreeding is present some of the PC will be present in several paths. Transition probabilities are computed for all pairs of paths in P(c) and haplotypes from the PC at the top of the path are used in Equation 4. In addition, paths from PC to TC might share more than one common ancestor in which recombination could occur. Equation 2 considered that paths merge in a single ancestor. To take into account the possibility of several common meioses between paths pi and pj, Equation 2 can be generalized to estimate transition probabilities between path pi at marker k and path pj at marker k 1 1, h i P c k 1 1 )pj k 1 1 j c k )pi k 5 0:5ðMi Þ · r ðMo Þ · ð1 2 r ÞðMs Þ ; ð5Þ where c(k) )p(k) indicates that c was inherited through path p at position k, Mi is the number of meioses in path pj that are not on the path pi. Ms represents the number of common meioses that result in the transmission of the same chromosome (no crossing over) while Mo represents the number of common meioses that result in the transmission of the complementary chromosome (crossing over). The sum of Ms and M0 is equal to the number of common meioses between the two paths and the sum of Mi, Ms, and Mo is equal to the number of meioses in path pj. In absence of inbreeding, there is a unique path from PC to TC (there is a one-to-one relationship between path and PC in that case) and Equations 1 and 2 are identical to Equation 5. Indeed, when i ¼ j (Equation 1), Ms ¼ Gi (Mi ¼ Mo ¼ 0) and when i 6¼ j (Equation 2), Mi ¼ Gj 2 Ga, Mo ¼ 1, and Ms ¼ Ga 21. A hidden Markov Model to model the IBD process along a chromosome: HMM (e.g., Rabiner 1989) allow efficient computation of sequence probabilities conditionally on haplotypes (Equation 4). This result can be achieved without estimating the probability of all sequences by using the forward–backward algorithm (Baum and Egon 1967). Actually, HMM have already been used to model IBD processes along a chromosome in various applications (e.g., Lander and Green 1987; Guo 1994; Mott et al. 2000; Leutenegger et al. 2003; Scheet and Stephens 2006; Thompson 2008). Rabiner (1989) described HMM as unobserved Markov chains (hidden Markov chains) and observed sequences of “emitted” symbols. The symbol observed at position k of the sequence depends only on the state of the hidden Markov chain at the same position. The Markov chain is fully characterized by a set of states and corresponding sets of initial probabilities (to be in any of the states) and transition probabilities (between any ordered pair of states). As mentioned earlier, our description of the transmission of pieces of chromosomes from a set of PC to a TC closely resembles this definition of Markov chain: the NPC parental chromosomes could be used as states. Note also that PC are not directly observed but “emitted” haplotypes can be obtained by genotyping and haplotype reconstruction and that the alleles of the haplotypes depend only on the PC (hidden state) inherited at the corresponding position. In this study, we thus propose use of a HMM to identify the origin of a chromosome segment among the genotyped PC in the pedigree (see File S2 for more details on the algorithm). The initial states and transition probabilities of our HMM are defined by 0:5Gi and Equations 1 and 5, respectively. The emission probability at marker k is equal to 1.00 2 e (e) if the allele in TC is equal (different) to the allele in the PC for that marker, where e is the probability of error (e.g., genotyping errors, mutations). When the PC is a “phantom” PC, the emission probabilities are equal to the frequency of the observed marker allele. The model implemented in this study relies on known haplotypes for TC and genotyped PC eventually obtained by other programs (e.g., Druet and Georges 2010). We discuss this assumption below. Paternal and maternal chromosomes are modeled independently. Simulation study: Data sets: The method was tested with simulations based on haplotypes from real data sets from dairy cattle, maize, and Arabidopsis thaliana. For each species, genotypes from chromosome 1 were used. The dairy cattle set consisted of 4732 animals genotyped on a custom-made 60K Illumina panel described in Charlier et al. (2008). Data were provided by CRV (The Netherlands) and are available upon request. The 2691 BTA1 SNP spanned approximately 160 Mb. The pedigree file contained 13,163 individuals. A subset of 3648 target chromosomes was selected to study the estimation of IBDp. These TC correspond to individuals without any genotyped descendent and with male ancestors carrying PC genotyped over three to eight generations (and no genotyped female ancestor). One generation of genotyped male ancestor corresponds to a genotyped sire and an ungenotyped dam. Two generations of genotyped male ancestors carrying PC corresponds to genotyped sire and maternal grandsire, while the dam and maternal grandam are ungenotyped. For each additional generation of genotyped male ancestors carrying PC, the sire of the (ungenotyped) mate of the oldest genotyped sire is added. For instance, in the first example in File S1, male ancestors carrying PC (individuals 2, 6, 9, and 10) are genotyped for four generations, whereas status of individual 4 is irrelevant because individual 2 is genotyped. A subset of 1000 individuals represented the PC: the 516 individuals with at least one genotyped descendent and 484 individuals chosen at random are to be used as phantom PC in the simulation. Haplotypes from all individuals were obtained using Beagle (Browning and Browning 2007) and DAGPHASE (Druet and Georges 2010). The genotypes from 19 “founder” accessions of the MAGIC in A. thaliana (Kover et al. 2009) were downloaded from http://gscan.well.ox.ac.uk/arabidopsis. We simulated the MAGIC lines as described in Scarcelli et al. (2007) by intermating 19 founders lines for five generations producing 342 F5 outcrossed families. More precisely, in the first generation, a complete diallele cross (the 19 · 18 ¼ 342 possible F1 crosses) was simulated. The generations F1–F4 were then randomly intercrossed. In each generation, two plants from each family were selected to be paternal and maternal parents. The 342 final lines were obtained by selfing F5 families for six generations. Genotypes from 275 markers on the first chromosome (30 Mb) were used. Data from the NAM design in maize (Yu et al. 2008; Mcmullen et al. 2009) were used to simulate the last design. In the NAM design, 25 founders are crossed with the B73 line to produce recombinant inbred lines (RILs). Since deducing the origin is relatively easy in that case (based on the SNPs, which are polymorphic across the two founder lines), we changed the design to a complete diallele cross among the 26 founders and further crossed the F1 for three additional generations (as described for the MAGIC data set). Genotypes for the 175 SNP on chromosome 1 (200 Mb) were downloaded from http://www.panzea.org/lit/data_sets.html#NAM_map. For the three populations, data were simulated by using real pedigree for dairy cattle and described pedigree for plants and Identity-by-Descent Between Haplotypes by transmitting the haplotypes from genotyped ancestors to offspring using Mendelian segregation rules and recombination probabilities (assuming 1 cM ¼ 1 Mb). In dairy cattle data, phantom PC haplotypes were selected randomly from the 484 individuals mentioned above. Furthermore, with this data set, a subset of 164 markers (the SNP with the highest MAF was selected every megabase) was used to test the estimation of IBDp at a lower marker density. In dairy cattle, haplotypes of really genotyped animals were assumed known while in plants, all haplotypes of founder and last-generation individuals were assumed known. Pedigrees were also considered as known. To test the impact of missing genotypes on the estimation of IBDp, 1 or 5% of marker alleles were randomly erased in the dairy cattle data set. Similarly, we randomly generated 0.01, 0.1, and 1% genotyping errors in the dairy cattle data set to study the effect of genotyping errors (the error rate e used by the model was equal to 0.001 in all cases) on estimation of IBDp. Finally, estimated haplotypes [using Beagle (Browning and Browning 2007) and DAGPHASE (Druet and Georges 2010)] were also used in the dairy cattle data set to compare estimated IBDp obtained with known and inferred haplotypes. In this situation where haplotypes are inferred, when the paternal or maternal origin of a TIPC was not known because the corresponding individual had no genotyped parent, we identified the TIPC as the chromosome that transmitted the marker alleles to the TC. All studies were repeated 100 times. Estimation of accuracy of IBDp: The quality of IBDp estimation can be described by the distribution, over all TC and markers, of the estimated IBDp between the TC and the truly inherited parental chromosome (TIPC). This distribution indicates whether the model correctly predicts the origin of the chromosome. Ideally, the IBDp should be one for all positions. For each haplotype and each marker position, the model estimates NPC IBDp (one for each PC in the set). IBDp with each PC were assigned to percentile classes (100 classes: 0– 0.01, 0.01–0.02, . . . , 0.99–1.00). When the corresponding PC was (not) the TIPC, one (in)correct estimation was added in the class. Then, the percentage of correct estimations was computed within each class. It should be noted that the expected proportion of TIPC in percentile q is q, meaning that a plot of the observed proportion of TIPC vs. the estimated IBDp should be a straight diagonal line (see Figure 3). For the dairy cattle data, results were given separately for genotyped TIPC and phantom TIPC. Application: marker imputation with low density chip in dairy cattle:. To illustrate a potential application of the method, we mimicked imputation of markers from a dense SNP panel with genotypes from a low-density SNP chip. Using the dairy cattle data described above, we conserved all the genotypes from the 1000 animals representing the PC in the simulation study, while for the remaining 3732 individuals, we conserved only 164 markers per animal (the same as in the simulation study). Haplotypes of reference and target individuals were reconstructed using Beagle (Browning and Browning 2007) and DAGPHASE (Druet and Georges 2010) as described in Zhang and Druet (2010). The method described in this study was then used to estimate IBDp between TC and PC. Missing alleles were then predicted by combining IBDp and alleles observed in PC as NPC X P hkc 5 m 5 P ðcðkÞ [ x k P hkx 5 m ; x51 where P ðhkx 5 mÞ is the probability that haplotype x at marker k carries allele m (1 if yes, 0 if no, when haplotype is known) 413 and P ðcðkÞ [ xðkÞÞ is the estimated IBDp between PC x and TC c at marker k. For ungenotyped ancestors (phantom PC), the frequency of the marker allele was used instead of the observed allele. Allelic probabilities of both chromosomes from an individual were then combined to obtain genotype probabilities. These probabilities were used to compute estimated number of allele “1” per genotype (ranging from 0 to 2). The number of errors per genotype was equal to the absolute value of the difference between the real and estimated numbers of allele 1 (the proportion is obtained by dividing the number of errors by the number of imputed alleles). The efficiency of prediction was then obtained by averaging this statistic over all imputed genotypes and markers. This value indicates whether the model correctly predicts marker alleles. RESULTS Estimation of IBDp in the three simulated designs: Distributions of IBDp for TIPC with the four simulated data sets are presented in Figure 1. The curves indicate that a large majority of the estimated IBDp are above 0.99 and very few at low values for all simulated designs, stressing the ability of the method to identify with high probability the TIPC. The mean IBDp for TIPC was equal to 0.9798 and 0.9415 in the A. thaliana and maize designs, respectively. For the A. thaliana design, IBDp of TIPC was higher than 0.99 and 0.90 for 91.3 and 95.9% of the situations, respectively. For the maize design, with lower density but closer founders (in terms of generations), these values dropped respectively to 74.0 and 88.1%. For the dairy cattle design, mean IBDp was equal to 0.9800 for genotyped TIPC and only to 0.5511 for phantom TIPC resulting in an overall mean IBDp of 0.9567 for TIPC (5.4% of the genome of TC was inherited from phantom PC). With 1% (5%) missing markers, IBDp were almost identical to the ones obtained with the full markers set: 0.9565 (0.9560) for overall mean IBDp for TIPC and to 0.9799 (0.9796) when considering only genotyped TIPC. With genotypes errors, overall mean IBDp for TIPC and IBDp for genotyped TIPC slightly decreased to 0.9551 and 0.9783 (0.01% errors), 0.9442 and 0.9667 (0.1% errors), and 0.9433 and 0.9657 (1% errors), respectively. Values obtained with estimated haplotypes instead of known haplotypes were in between results obtained in missing genotypes and genotyping errors scenarios: 0.9531 for overall mean IBDp for TIPC and 0.9762 IBDp for genotyped TIPC. Returning to the situation with no missing markers and no genotyping errors, but decreasing the number of markers from 2691 to 164, the average IBDp dropped to 0.9456 (genotyped TIPC), 0.6294 (phantom TIPC), and 0.9418 (all TIPC) (Figure 1). Of genotyped TIPC, 88.1% (94.8%) and 78.6% (90.1%) had IBDp .0.99 (0.90) with the high (low)-density maps. For phantom TIPC, estimated IBDp were low more frequently, clearly indicating a lack of power to detect phantom TIPC. For these TIPC, emission probabilities 414 T. Druet and F. Farnir Figure 1.—Observed cumulative distribution of mean IBDp for truly inherited parental chromosomes (TIPC): ( ) all TIPC, (*) genotyped TIPC, and (s) phantom TIPC (ungenotyped TIPC). The figure was obtained by assigning estimated IBDp to 100 probability classes (0–0.01, 0.01–0.02, . . ., 0.99–1.00). • are equal to allele frequencies while for genotyped TIPC, emission probabilities are equal to 0.001 or 0.999 according to the allele actually observed on the PC. By chance, emission probability associated to a genotyped PC can be higher than the product of allele frequencies, leading to an incorrect assignment. In addition, PC closer (in terms of the number of generations) to TC will have higher transition probabilities than phantom PC, often separated from TC by more generations. When more genotyped PC are present in the set of PC, the probability of obtaining by chance one PC with emission probability higher than the product of allele frequencies increases. This is illustrated in Table 1 where estimation of IBDp is summarized for individuals with an increasing number of genotyped PC. Indeed, with more genotyped PC, mean IBDp associated to phantom TIPC decreases. In addition, the table indicates that proportion from the genome inherited from genotyped PC increases as expected with a higher number of genotyped PC but that mean IBDp estimated for genotyped TIPC slightly decreases. This is explained by the fact that additional genotyped TIPC are separated from the TC by more generations, resulting in smaller transmitted segments (the expected size of the segment is 1/(1 1 G) M, where G is the number of generations between the TC and PC) with more transitions and fewer markers per segment to identify the origin. However, since the proportion of genotyped TIPC increases, the overall mean IBDp increases too (with the highdensity map). With the lower-density map, the overall mean IBDp of TIPC remains relatively constant because IBDp for phantom TIPC have a stronger decrease. The most favorable configuration for IBDp estimation oc- curs when a sire or a dam is genotyped because 100% of the two TIPC are genotyped and transmitted fragments are large. In these cases, the mean IBDp of TIPC is equal to 0.9949 and 0.9851 with the high- and lowdensity maps, respectively. For one animal, the mean IBDp of TIPC is the mean of the IBDp estimated for the paternal and maternal chromosomes, resulting in 0.9443 (0.9395) to 0.9656 (0.9432) mean IBDp when male ancestors carrying PC are genotyped from three to eight generations (see material and methods for more details) for the high (low)-density map (see Table 1). Results of the bovine data set clearly indicate that estimation of IBDp is improved when transmitted chromosome segments are larger (or with less crossing over along the chromosome). Figure 2 shows that mean estimated IBDp are low around crossing over and increases with distance from crossing over. When the exact position of a crossing over cannot be identified, the method estimates intermediate IBDp for the two concerned PC. With increased marker density, the influence of the crossing over is reduced because more markers are available to precisely localize crossing over. These events greatly contribute to the proportion of intermediate (or low) estimated IBDp for TIPC. Figure 3 shows that percentage of TIPC within each class of estimated IBDp linearly increases from 0 to 1 with IBDp with both plant designs. For each class, the proportion of PC corresponding to the TIPC is equal to the estimated IBDp, indicating that these probabilities are well calibrated. For the dairy cattle data, this is no longer true and we observe some high IBDp associated to noninherited PC. However, when considering only portions of chromosomes inherited from genotyped PC, Identity-by-Descent Between Haplotypes 415 TABLE 1 Mean IBDp for TIPC with the bovine data set as a function of the number of generations of genotyped male ancestors carrying PC: results from the dense map (in parentheses, results for the low-density map) Paternal chromosome Number of genotyped male ancestors carrying PCa IBDp for TIPC 3 4 5 6 7 8 0.9949 0.9949 0.9949 0.9949 0.9949 0.9949 (0.9851) (0.9851) (0.9851) (0.9851) (0.9851) (0.9851) Maternal chromosome IBDp for genotyped TIPC 0.9826 0.9671 0.9594 0.9562 0.9386 0.9412 (0.9554) (0.9370) (0.9265) (0.9206) (0.9031) (0.9060) IBDp for phantom TIPC 0.6258 0.5161 0.4458 0.3949 0.3662 0.3171 (0.7091) (0.5967) (0.5155) (0.4471) (0.3989) (0.3287) Proportion of chromosome inherited from genotyped TIPCb 0.7508 0.8754 0.9380 0.9686 0.9845 0.9921 (0.7500) (0.8750) (0.9375) (0.9687) (0.9844) (0.9922) IBDp for all TIPCc 0.8936 0.9109 0.9276 0.9386 0.9298 0.9363 (0.8939) (0.8946) (0.9010) (0.9057) (0.8953) (0.9014) Mean IBDp per chromosomed 0.9443 0.9529 0.9613 0.9668 0.9623 0.9656 (0.9395) (0.9398) (0.9430) (0.9454) (0.9402) (0.9432) a One generation of genotyped male ancestor corresponds to a genotyped sire and two generations of genotyped male ancestors carrying PC corresponds to sire and maternal grandsire genotyped. For each additional generation of genotyped male ancestors carrying PC, the sire of the mate of the oldest genotyped sire is added. For instance, in the first example of File S1, male ancestors carrying PC (individuals 2, 6, 9, and 10) are genotyped for four generations. b Expected proportion of maternal chromosome inherited from genotyped TIPC is indicated in parentheses. c The IBDp for all TIPC is an average of IBDp for genotyped (column 3) and phantom (column 4) TIPC weighted by the proportion of genome inherited from genotyped TIPC (column 5). d The mean IBDp per chromosome is equal to the mean of paternal (column 2) and maternal chromosomes (column 6). the concordance between estimated IBDp and proportion of TIPC is correct while for regions inherited from phantom PC only, estimated IBDp are poorly calibrated. Computational requirements: Computation times and memory requirements were measured on a computer with Intel Xeon “Harpertown” L5420 at 2.50 GHz. The requirements are a function of number of TC (1200 for the maize design, 684 for the A. thaliana design, and 1824 for the dairy cattle design), the number of different paths to PC per TC (8 for the maize design set, 1024 for the A. thaliana design, and maximum 15 for the dairy cattle design), and the number of markers (175 for the maize design, 275 for the A. thaliana design, and 2691 for the dairy cattle design). Computation times to estimate IBDp at all marker positions and for all TC were equal to 41 sec, 7 min 53 sec, and 7 hr 13 min 22 sec, while memory requirements were equal to 129 Mb, 398 Mb, and 140 Mb for the maize, dairy cattle, and A. thaliana designs, respectively. Application: marker imputation with low density chip in dairy cattle: Table S1 summarizes allelic imputation error rates obtained in dairy cattle for TC genotyped with the low-density map. The method was used to predict markers inherited from PC genotyped on the higher-density marker panel. As for the estimation of IBDp probabilities, the efficiency of the method increased when higher proportions of PC were genotyped. When the expected proportion of the genome inherited from genotyped ancestors increased from ,0.25 to 1, the error rate decreases from 0.2647 to 0.0056. The majority (73.4%) of the individuals with TC in our data sets had more than 87.5% of their genome inherited from genotyped ancestors (correspond- ing to three genotyped male ancestors carrying PC or more) and 13.6% had both their parents genotyped. The error rates were below 0.05 for animals with at least three generations of genotyped male ancestors carrying PC and the overall mean imputation error rate per animal was equal to 0.0375: the proportion of correctly imputed alleles (0.9625) was above the mean estimated IBDp for TIPC (0.9418) despite the fact that haplotypes were estimated and that genotyping errors might be present in the data set. Some genotypes might be correctly imputed by chance: although the correct PC was not identified, the incorrect PC with high estimated IBDp carried the same marker allele. Therefore, when the method incorrectly identifies the inherited parental chromosome because another PC is very similar (identityby-state for a stretch of markers), the error has limited consequence on marker imputation or even QTL detection and genomic selection because the incorrect PC most likely carries the same alleles as the TIPC. Similar results were found by Zhang and Druet (2010) using the same method but for the whole genome. DISCUSSION Accuracy of IBD probabilities estimation: We herein present a method aimed at estimating IBDp between a TC and a set of PC. Since the method relies on concordance of haplotypes from TC and PC, it will be more efficient when more markers per inherited segment are available. The size of these segments is proportional to the inverse of the number of generations from the PC to the TC. Therefore, the method will be 416 T. Druet and F. Farnir Figure 2.—Relation between estimated IBDp of TIPC and distance from the closest crossing over with the dairy cattle data: (A) high-marker density, (B) low-marker density. more efficient for closer PC and for higher marker densities. On a simulated data set based on real founder haplotypes and pedigree structures, the method assigned high IBDp to the TIPC in a large majority of the cases and low IBDp to the other PC, even for marker densities as low as 1 marker/Mb and for founders separated from the TC by up to 10 generations. However, for some regions, lower IBDp were assigned to the TIPC. Such regions, where both TIPC and incorrect PC have intermediate IBDp, correspond to portions for which the origin is more difficult to determine: close to a crossing over (as illustrated by Figure 2) or for which several ancestors have very similar haplotypes. With lower-density maps, these problems occur more often because it is more likely that two PC have similar haplotypes and because less informative markers are available to precisely determine where crossing over take place. Therefore, the estimated IBDp of TIPC decreased with lower-density marker maps. Figure 3 indicates that estimated IBDp are well calibrated and are equal to the probability that a PC is a TIPC. Therefore, it is possible to identify regions in which the TIPC cannot be determined precisely. Results also showed that the estimation of IBDp was less efficient for regions inherited from a phantom PC. The method should therefore ideally be applied to designs with limited number of phantom PC, especially when these are close ancestors of the TC. Phantom PC are described on the basis of allele frequencies. A model describing phantom PC with haplotypes (such as HMM of Scheet and Stephens 2006 or Browning and Browning 2007) might turn out to be more efficient although more computationally intensive. The impact of PC incorrect assignments might be limited for marker allele prediction. Indeed, in regions where the method cannot precisely determine the haplotype origin, the method proposes several origins that are very similar to each other and to the TC. Due to this similarity, the method is not able to determine the correct origin but thanks to this similarity, the correct Identity-by-Descent Between Haplotypes 417 Figure 3.—Relation between real proportion of truly inherited parental chromosomes (TIPC) and estimated IBDp: ( ) all TIPC, (*) genotyped TIPC, and (s) phantom TIPC. The figure was obtained by assigning PC in 100 probability classes according to their estimated IBDp (0–0.01, 0.01–0.02, . . . , 0.99– 1.00), and then the proportion of TIPC was computed in each class. • marker allele is predicted anyway. Therefore, the method proved to be efficient for prediction of marker alleles, even with lower map densities. Consequences of approximations of the model: In this study, we assumed that haplotypes are known. In inbred species (such as some plants or mice), haplotypes are indeed known. Furthermore, Druet and Georges (2010) showed that with the current marker densities, haplotype reconstruction can be achieved very efficiently in livestock species, especially for important parents with many offspring, like tested sires in artificial insemination contexts. For genotyped (eventually at lower density) target individuals, haplotypes can be partially reconstructed on the basis of homozygous markers and familial information such as a genotyped parent. When using inferred haplotypes with the dairy cattle design, IBDp of TIPC remained high, showing that the method was still efficient when haplotyping can be performed as accurately as in dairy cattle. In cases where haplotyping reconstruction generates many errors, it is likely that estimation of IBDp will be less accurate, particularly when errors occur in TC. Since the efficiency of the method was less affected by missing genotypes than by genotyping errors, we advise reconstructing portions of haplotypes that can be inferred very accurately (for instance, thanks to homozygous markers or mendelian segregation rules) and consider other markers as missing. Results showed that the method remained efficient with missing genotypes and genotyping errors under normal conditions. Indeed, we tested call rates as low as 95% whereas, for instance, with genotyping arrays used in cattle, call rates are above 99% on average. Genotyping errors rates were set as high as 1% while use of genotyping arrays generally results in ,0.1% error rates. The effect of missing genotypes is to reduce marker density since we do not have information for some markers. Reduction of marker density by a few percentage has a small impact on estimation of IBDp. Results, on simulated and real data, proved that the Markov approximation of the IBD process along the chromosome between a TC and its PC was efficient in our designs. Comparison to other methods: Since our method models the entire chromosome and allows direct estimation of IBDp between haplotypes separated by more than one generation, it should achieve better results than traditional methods based on linkage equilibrium, which rely mostly on genotypes of close relatives and flanking markers such as those of Wang et al. (1995) or Pong-Wong et al. (2001). However, our method is less suited to very sparse marker maps. For moderate- and high-marker densities our method has reasonable computational costs, which might not be the case for multipoint linkage methods using MCMC techniques (Heath 1997). Other multipoint linkage methods such as those proposed by Lander and Green (1987) or Abecasis et al. (2002) consider all the pedigree relationships. These methods therefore make better use of all the available information. However, Merlin (Abecasis et al. 2002) was not computationally efficient on the data set used in this study (Merlin could not cope with our large pedigree). In comparison, our method uses only information from ancestors, ignores other relatives, and models both chromosomes separately. Consequently, it works in subpedigrees including only the TC and 418 T. Druet and F. Farnir their ancestors, which makes it computationally efficient. Our results showed that for designs in which haplotypes of important parents can be reconstructed precisely (on the basis of LD or information from progeny), the IBDp between TC and PC can be efficiently computed using our method and can be useful for transferring information (genotype, QTL, etc.) from important founders to other individuals in the population. The IBDp estimated in our method are distinct from IBDp estimated through methods based on LD, such as the method developed by Meuwissen and Goddard (2001). Indeed, these methods use small haplotypes of dense markers to identify IBD relationships due to distant common ancestors. Our method identifies IBD relationships within a pedigree and only a few generations separate the haplotype from its ancestors. Since, for closer relationships, animals share longer chromosome segments, our method can work with lower-density maps than methods based on LD can. Applications: Our method is particularly suited to describing the transmission of information from recent ancestors to current individuals, even with low-density maps (e.g., 1 SNP/Mb). Burdick et al. (2006) already used this technique for marker imputation in human pedigrees with Merlin (Abecasis et al. 2002). Such applications are becoming more common in animal and plant breeding or in experimental designs. For instance, MAGIC designs (e.g., Threadgill et al. 2002; Kover et al. 2009) rely on genotyping of founders and last generations and are now developed in several organisms for the genetic dissection of complex traits. In addition, recent strategies aim to genotype founders with high-density marker panels and use this genomic information to infer genotypic information on other individuals genotyped on lower-density panels. This strategy has, for example, been implemented in the NAM in maize (Yu et al. 2008; Mcmullen et al. 2009). In livestock species, genomic selection (Meuwissen et al. 2001) is currently applied by genotyping reference animals on medium-density SNP chips (60K SNPs in dairy cattle). Effects of the SNPs are then estimated and can be used to predict the genetic value of other individuals, which must be genotyped for the same markers. Habier et al. (2009) proposed genotyping some animals with lower-density marker panels and using statistical methods to transfer information from reference individuals to these animals. In dairy cattle, most breeding companies are studying the possibility of implementing such a strategy to reduce genotyping costs and extend genomic selection to a larger fraction of the population. With the advent of higher-marker density panels (for example, a bovine SNP chip with more than 750,000 SNPs has been released in 2010) and of high-throughput sequencing data, such strategies will become even more important. Our method would be fitted to such new paradigms. The program CHROMIBD implementing the current method is available from the authors upon request. CONCLUSIONS We present an efficient method for estimation of IBDp between a target chromosome and its genotyped ancestors. The method assigned high IBDp to the truly inherited parental chromosome and was still efficient when marker densities were 1 Mb/SNP or when founders were separated from target chromosomes by up to 10 generations. The proposed method proved well adapted for situations that will become more common in the near future, with founders genotyped on high-density maps and other individuals genotyped for lower-density marker arrays, as proposed in livestock species and in experimental designs such as the nested association mapping design in maize. Tom Druet is Research Associate from the Fonds de la Recherche Scientifique (FNRS). The authors thank CRV (http://www.crv4all.com), the Wellcome-Trust Centre for Human Genetics, and PANZEA for access to their data. We acknowledge Laurence Moreau and Mathieu Gautier for their comments and suggestions on this work. This work was funded by grants of the Service Public de Wallonie and from the Communauté Française de Belgique. We acknowledge University of Liège (SEGI and GIGA bioinformatics platform) for the use of NIC3 and GIGA-grid supercomputers. LITERATURE CITED Abecasis, G. R., S. S. Cherny, W. O. Cookson and L. R. Cardon, 2002 Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 30: 97–101. Baum, L. E., and J. A. Egon, 1967 An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology. Bull. Am. Meteorol. Soc. 73: 360–363. Broman, K. W., 2005 The genomes of recombinant inbred lines. Genetics 169: 1133–1146. Browning, S. R., and B. L. Browning, 2007 Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81: 1084–1097. Burdick, J. T., W. M. Chen, G. R. Abecasis and V. G. Cheung, 2006 In silico method for inferring genotypes in pedigrees. Nat. Genet. 38: 1002–1004. Calus, M. P., T. H. Meuwissen, A. P. de Roos and R. F. Veerkamp, 2008 Accuracy of genomic selection using different methods to define haplotypes. Genetics 178: 553–561. Charlier, C., W. Coppieters, F. Rollin, D. Desmecht, J. S. Agerholm et al., 2008 Highly effective SNP-based association mapping and management of recessive defects in livestock. Nat. Genet. 40: 449–454. de Roos, A. P., C. Schrooten, E. Mullaart, M. P. Calus and R. F. Veerkamp, 2007 Breeding value estimation for fat percentage using dense markers on Bos taurus autosome 14. J. Dairy Sci. 90: 4821–4829. Druet, T., and M. Georges, 2010 A hidden markov model combining linkage and linkage disequilibrium information for haplotype reconstruction and quantitative trait locus fine mapping. Genetics 184: 789–798. Fernando, R., and M. Grossman, 1989 Marker assisted selection using best linear unbiased prediction. Genet. Sel. Evol. 21: 467–477. Guo, S. W., 1994 Computation of identity-by-descent proportions shared by two siblings. Am. J. Hum. Genet. 54: 1104–1109. Habier, D., R. L. Fernando and J. C. Dekkers, 2009 Genomic selection using low-density marker panels. Genetics 182: 343–353. Heath, S. C., 1997 Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. Am. J. Hum. Genet. 61: 748–760. Identity-by-Descent Between Haplotypes Hernandez-Sanchez, J., C. S. Haley and J. A. Woolliams, 2006 Prediction of IBD based on population history for fine gene mapping. Genet. Sel. Evol. 38: 231–252. Kong, A., G. Masson, M. L. Frigge, A. Gylfason, P. Zusmanovich et al., 2008 Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40: 1068–1075. Kover, P. X., W. Valdar, J. Trakalo, N. Scarcelli, I. M. Ehrenreich et al., 2009 A multiparent advanced generation inter-cross to finemap quantitative traits in Arabidopsis thaliana. PLoS Genet. 5(7): e1000551. Lander, E. S., and P. Green, 1987 Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA 84: 2363–2367. Leutenegger, A. L., B. Prum, E. Genin, C. Verny, A. Lemainque et al., 2003 Estimation of the inbreeding coefficient through use of genomic data. Am. J. Hum. Genet. 73: 516–523. McMullen, M. D., S. Kresovich, H. S. Villeda, P. Bradbury, H. H. Li et al., 2009 Genetic properties of the maize nested association mapping population. Science 325: 737–740. McPeek, M. S., and L. Sun, 2000 Statistical tests for detection of misspecified relationships by use of genome-screen data. Am. J. Hum. Genet. 66: 1076–1094. Meuwissen, T. H., and M. E. Goddard, 2001 Prediction of identity by descent probabilities from marker-haplotypes. Genet. Sel. Evol. 33: 605–634. Meuwissen, T. H., and M. E. Goddard, 2004 Mapping multiple QTL using linkage disequilibrium and linkage analysis information and multitrait data. Genet. Sel. Evol. 36: 261–279. Meuwissen, T. H., B. J. Hayes and M. E. Goddard, 2001 Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. Mott, R., C. J. Talbot, M. G. Turri, A. C. Collins and J. Flint, 2000 A method for fine mapping quantitative trait loci in outbred animal stocks. Proc. Natl. Acad. Sci. USA 97: 12649–12654. Pong-Wong, R., A. W. George, J. A. Woolliams and C. S. Haley, 2001 A simple and rapid method for calculating identity-bydescent matrices using multiple markers. Genet. Sel. Evol. 33: 453–471. 419 Rabiner, L. R., 1989 A tutorial on Hidden Markov Chains and selected applications in speech recognition. Proc. IEEE 77: 257–286. Scarcelli, N., J. M. Cheverud, B. A. Schaal and P. X. Kover, 2007 Antagonistic pleiotropic effects reduce the potential adaptive value of the FRIGIDA locus. Proc. Natl. Acad. Sci. USA 104: 16986–16991. Scheet, P., and M. Stephens, 2006 A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78: 629–644. Thompson, E. A., 1994 Monte Carlo estimation of multilocus autozygosity probabilities, pp. 498–506 in Proceedings of the 1994 Interface Conference, edited by J. Sall and A. Lehman. Interface Foundation of North America, Fairfax Station, VA. Thompson, E. A., 2008 The IBD process along four chromosomes. Theor. Popul. Biol. 73: 369–373. Threadgill, D. W., K. W. Hunter and R. W. Williams, 2002 Genetic dissection of complex and quantitative traits: from fantasy to reality via a community effort. Mamm. Genome 13: 175–178. Wang, T., R. Fernando, S. Van der Beek, M. Grossman and J. A. M. Van Arendonk, 1995 Covariance between relatives for a marked quantitative trait locus. Genet. Sel. Evol. 27: 251–274. Wijsman, E. M., J. H. Rothstein and E. A. Thompson, 2006 Multipoint linkage analysis with many multiallelic or dense diallelic markers: Markov chain-Monte Carlo provides practical approaches for genome scans on general pedigrees. Am. J. Hum. Genet. 79: 846–858. Windig, J. J., and T. H. E. Meuwissen, 2004 Rapid haplotype reconstruction in pedigrees with dense marker maps. J. Anim. Breed. Genet. 121: 26–39. Yu, J., J. B. Holland, M. D. McMullen and E. S. Buckler, 2008 Genetic design and statistical power of nested association mapping in maize. Genetics 178: 539–551. Zhang, Z., and T. Druet, 2010 Marker imputation with low-density marker panels in Dutch Holstein cattle. J. Dairy Sci. 93: 5487–5494. Communicating editor: H. Zhao GENETICS Supporting Information http://www.genetics.org/cgi/content/full/genetics.111.127720/DC1 Modeling of Identity-by-Descent Processes Along a Chromosome Between Haplotypes and Their Genotyped Ancestors Tom Druet and Frederic Paul Farnir Copyright © 2011 by the Genetics Society of America DOI: 10.1534/genetics.111.127720 2 SI T. Druet and F. P. Farnir FILE S1 Examples for the method section Pedigree example for estimation of IBD probabilities We use the following pedigree to illustrate how to construct the subset of potential parental chromosomes, to estimate initial state probabilities and to give examples of ancestors where paths from different parental chromosome merge. Figure. Pedigree of the example, genotyped animals are represented in grey (squares for males and circles for females). The subset C(c) of potential parental chromosomes of a given chromosome c Noting iM (iP ) the maternal (paternal) chromosomes of individual i, we can find C(1M) as explained in the method section: • The potential PC of 1M are 3P and 3M, and these chromosomes are not genotyped. • Using the same process for 3P, the PC are then 6P and 6M. Since these chromosomes are genotyped, they are included in C(1M) an the process stops here along that path. • Now, using the same process for 3M, the PC are now 7P and 7M, which are not genotyped. So we have to repeat the process for these two chromosomes. • 7P leads to 2 new genotyped chromosomes to be included in C(1M), i.e. 9P and 9M, while 7M leads to two ungenotyped chromosomes 8P and 8M, and the process is to be repeated again. As a result, we have that C(1M) = {6P, 6M, 8M, 9P, 9M, 10P, 10M}, which means that 1M is a mosaic of these chromosomes. It also means that any locus on 1M is IBD with the corresponding locus on one of these chromosomes. In terms of probabilities, the consequence is that C(c) P(c(k) x(k)) = 1 when the sum is taken over all chromosomes making up C(c), where P(c(k) x(k)) denotes x the probability for c to be IBD with x at a given locus k. T. Druet and F. P. Farnir 3 SI Initial state probabilities To illustrate the property that the initial probabilities sum over C(c) to 1, in the example described above, the IBDp for the beginning of chromosome 1M are: P(1M(1) 6P(1)) = P(1M(1) 6M(1)) = 0.52 =0.25, P(1M(1) 9P(1)) = P(1M(1) 9M(1)) = P(1M(1) 8M(1)) = 0.53 = 0.125, P(1M(1) 10P(1)) = P(1M(1) 10M(1)) = 0.54 = 0.0625 and we have that C(1M x ) ( ) P 1M (1) x(1) = 2 * 0.25+ 3 * 0.125+ 2 * 0.0625 = 1 State transition probabilities As explained in the paper, when PC (ci and cj) are distinct at positions k and k+1, then on the pedigree paths from ci and cj to c, there must be at least one common ancestor where the two paths merge. We note this ancestor ‘a’. To illustrate on the example, if ci = 10P and cj = 9M, a = 7. Similarly, if ci=6M and cj=9P, a = 3. 4 SI T. Druet and F. P. Farnir Example in case of inbreeding The figure below represents the maternal branch of the pedigree of individual #1. One ungenotyped great-grand-parent (male #5) is parent of both grand-parents. Both parents of individual #5 (individuals #8 and #9) are genotyped and carry PC. Figure. Example of inbred pedigree, genotyped animals are represented in grey (squares for males and circles for females). A. Maternal branch of the pedigree. B. Each path from PC to TC is represented separately (the number of each distinct path is written in blue). We start by representing each different path from PC to TC separately (Figure, part B). Paths were numbered from 1 to 16 (in blue). For individuals #8 and #9, they are two paths to transmit chromosomes to individual #1. Through individual #3 or through individual #4. Therefore, individuals #5, #8 and #9 and the corresponding meioses appear twice in the figure (part B). IBD probabilities are first computed for each path. Then IBD probabilities for each PC is obtained by summing IBD probabilities for each path leading from the concerned PC to the TC. To illustrate, P(1m 8p) = P(8p5p3p2p1m) + P(8p5p4p2m1m) = P(8p5p)*[P(5p3p2p1m) + P(5p4p2m1m)] T. Druet and F. P. Farnir 5 SI To obtain transition probabilities from path x at position k to path y at position k+1, meioses in path y are divided in independent meioses (those meioses are not present in path x) and common meioses (those meioses are present in path x - the same parent/offspring pair appears in both paths). Common meioses are further divided in meioses resulting in the transmission of the same haplotype of the parent (same meioses) or in the transmission of the complementary haplotype of the parent (opposite meioses). Mi, Ms and Mo are the number of independent, same and opposite meioses, respectively. The sum of all the meioses is equal to the number of meioses in path y. For example, considering PC noted 1 and 6 in the example (i.e. 8p and 10m, respectively) as x and y respectively, the trajectories of y starts with 2 independent meioses (Mi=2) to reach maternal haplotype of individual 3. Then, to allow for the transition from x to y, a recombination must take place in this individual (Mo=1). Then the recombined haplotype must have been transmitted along the remaining (common to both paths) meiosis with no further recombination (Ms=1). Further examples are given below. The formula given in the text (equation [5]) is then easily obtained, considering that the probability of transmission of an haplotype is equal to 0.5 for an independent meiosis, (1-r) for a 'same' meiosis and r for an 'opposite' meiosis. Other examples of number of independent and common meioses for transition between different paths, some of which including inbred individuals, are: From path 1 to path 1: Mi = 0, Ms =4, Mo =0. From path 1 to path 2: Mi = 0, Ms =3, Mo =1. From path 1 to path 3: Mi = 1, Ms =2, Mo =1. From path 1 to path 9: Mi = 2, Ms =1, Mo =1. From path 1 to path 10: Mi = 2, Ms =0, Mo =2. From path 1 to path 11: Mi = 3, Ms =0, Mo =1. From path 1 to path 16: Mi = 3, Ms =0, Mo =1. Remark that, combining these results, we can for example compute: P[c(k+1)9p | c(k) 8p] = P[c(k+1)3 | c(k) 1] + P[c(k+1)11 | c(k) 1] + P[c(k+1)3 | c(k) 9] + P[c(k+1)11 | c(k) 9] = 0.5*r*(1-r) + 0.5*r + 0.5*r + 0.5*r*(1-r) 6 SI T. Druet and F. P. Farnir FILE S2 Algorithm Each target chromosome (TC) is modeled independently. The method starts by searching the set of paths leading to the closest genotyped ancestors of the modeled TC by a recursive subroutine which searches the pedigree backwards in time until it reaches a genotyped ancestor or an ancestor without genotyped parent in the pedigree (it corresponds then to a phantom parental chromosome). For each identified path, a list of meioses is stored (one meiosis per generation). A meiosis is defined by a parent-progeny pair. Then, for each pair of paths, the number of common meioses (Mc) between the two paths is counted and stored (common meioses share same parent-progeny pair). In addition, to the number of common meioses, the number of common meioses with transmission of complementary chromosome (corresponding to a crossing over) is counted (Mo). This corresponds to common meioses where in the first path the paternal chromosome was transmitted whereas in the second path the maternal chromosome was transmitted or vice versa. The number of common meioses with transmission of identical chromosome (Ms) is equal to Mc-Mo. Three matrices with dimension Np x Np (with Np equal to the number of paths) are then stored. The first contains for each pair of paths the number of independent meioses (Mi). Mi is equal to the total number of meioses in the second path minus Mc. The second and third matrix contains respectively Mo and Ms for each pair of paths. For instance, in the first example in File S1, the set of parental chromosomes of TC 1M is {6P, 6M, 8M, 9P, 9M, 10P, 10M}. For this set, the three matrices are: The matrix of independent meioses, Mi(i,j) (rows (i) correspond to PC at position k and columns (j) to PC at position k+1) is: 6P 6M 8M 9P 9M 10P 10M 6P 0 0 2 2 2 3 3 6M 0 0 2 2 2 3 3 8M 1 1 0 1 1 1 1 9P 1 1 1 0 0 2 2 9M 1 1 1 0 0 2 2 10P 1 1 0 1 1 0 0 10M 1 1 0 1 1 0 0 The matrix of common meioses with transmission of complementary chromosome (crossing over), Mo(i,j) (rows correspond to PC at position k and columns to PC at position k+1) is: 6P 6M 8M 9P 9M 10P 10M 6P 0 1 1 1 1 1 1 6M 1 0 1 1 1 1 1 8M 1 1 0 1 1 1 1 9P 1 1 1 0 1 1 1 9M 1 1 1 1 0 1 1 10P 1 1 1 1 1 0 1 10M 1 1 1 1 1 1 0 T. Druet and F. P. Farnir 7 SI The matrix of common meioses with transmission of the same chromosome, Ms(i,j) (rows correspond to PC at position k and columns to PC at position k+1) is: 6P 6M 8M 9P 9M 10P 10M 6P 2 1 0 0 0 0 0 6M 1 2 0 0 0 0 0 8M 0 0 3 1 1 2 2 9P 0 0 1 3 2 1 1 9M 0 0 1 2 3 1 1 10P 0 0 2 1 1 4 3 10M 0 0 2 1 1 3 4 The forward algorithm (Rabiner, 1989) is then implemented as: Initialization for all paths i at marker 1: Ms(i,i) (i,1)=0,5 where Ms(i,i) is the number of identical meioses between paths i and i (or the number of meioses on the path i). Induction for all paths i and marker k : Np (i,k) = ( ( j,k 1) * ( j,i,k 1)) * a(i,k) j =1 where (j,i,k-1) is the transition probability from path j to path i between markers k-1 and k and a(i,k) is the emission probability of path i at marker k. The transition probabilities are calculated with equation [5] : (i, j,k) = 0.5 Mi(i, j ) Mo(i, j ) (1 ) Ms(i, j ) where is the recombination rate between markers k-1 and k. The emission probabilities are equal to 1.000 if alleles on target and parental chromosomes are identical at marker k (1.00 if it is missing on one of the chromosomes). Similarly, the backward algorithm (Rabiner, 1989) is implemented as follow. Initialization for all paths i at the last marker Nm: (i,Nm)=1.00 Induction for all paths i and marker k : Np (i,k) = ( ( j,k + 1) * (i, j,k) * a( j,k + 1)) j =1 Finally, forward and backward probabilities can be combined to compute the probability that the hidden Markov chain went through hidden states i at marker k (corresponding to the probability that TC is identical by descent to PC i at marker k or the IBD probability between the TC and the PC i at marker k). 8 SI T. Druet and F. P. Farnir TABLE S1 Mean allelic imputation error rates of missing markers for animals genotyped on the low-density marker map Expected percentage of genome inherited from Number of individuals Mean allelic error rate 25% 7 0.2647 25% < . 50% 81 0.2073 50% < . 75% 249 0.0985 75% < . 87.5% 657 0.0545 87.5% < . 90% 121 0.0451 90% < . 92.5% 270 0.0394 92.5% < . 95% 562 0.0306 95% < . 97.5% 734 0.0216 97.5% < . < 100% 545 0.0163 100 % 506 0.0056 genotyped TIPC
© Copyright 2026 Paperzz