BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 23 2008, pages 2720–2725 doi:10.1093/bioinformatics/btn519 Genetics and population analysis A better block partition and ligation strategy for individual haplotyping Yuzhong Zhao1,2 , Yun Xu1,2,∗ , Zhihao Wang1,2 , Hong Zhang1,2 and Guoliang Chen1,2 1 Department of Computer Science, University of Science and Technology of China and 2 Anhui Province-MOST Co-Key Laboratory of High Performance Computing and Its Application, Hefei, Anhui 230027, P.R. China Received on April 15, 2008; revised on August 10, 2008; accepted on October 4, 2008 Advance Access publication October 9, 2008 Associate Editor: Alex Bateman ABSTRACT Motivation: Haplotype played an important role in the association studies of disease gene and drug responsivity over the past years, but the low throughput of expensive biological experiments largely limited its application. Alternatively, some efficient statistical methods were developed to deduce haplotypes from genotypes directly. Because these algorithms usually needed to estimate the frequencies of numerous possible haplotypes, the partition and ligation strategy was widely adopted to reduce the time complexity. The haplotypes were usually partitioned uniformly in the past, but recent studies showed that the haplotypes had their own block structure, which may be not uniform. More reasonable block partition and ligation strategy according to the haplotype structure may further improve the accuracy of individual haplotyping. Results: In this article, we presented a simple algorithm for block partition and ligation, which provided better accuracy for individual haplotyping. The block partition and ligation could be completed within O(m2 logm + m2 n) time complexity, where m represented the length of genotypes and n represented the number of individuals. We tested the performance of our algorithm on both real and simulated dataset. The result showed that our algorithm yielded better accuracy with short running time. Availability: The software is publicly available at http://mail.ustc.edu. cn/∼zyzh. Contact: [email protected] 1 INTRODUCTION As the most common form of genetic variation, single nucleotide polymorphism (SNP) has been widely studied to analyze the possible association between diseases, and genomes. In general, many complex diseases, such as diabetes and cancer may be not affected by a single SNP, so it is necessary and important to study multi-SNPs in a region together (International HapMap Consortium, 2003). These linked SNPs on a chromosome constitute a character string, which is also called a haplotype. Although haplotype analysis has gained increasing attention recently, it is still costly and time consuming to derive haplotypes from biological experiments (Bonizzoni et al., 2003). Indeed, many experimental data only provide the genotype for each individual, which is the combined information of two haplotypes ∗ To whom correspondence should be addressed. 2720 from paired chromosomes. For each locus of a genotype, the two alleles of haplotypes are given, but the position of each allele is uncertain. That is to say, we do not know whether an allele is from paternal haplotype or maternal haplotype. Theoretically there may be exponential possible haplotype configurations for a given genotype, whereas in practice the number of haplotype patterns in a certain population is much smaller, which makes it possible to infer haplotypes from genotypes directly. In the past 20 years, two categories of individual haplotyping algorithms were widely studied. One focused on finding the exact haplotype solution of each individual through some combinational methods (Clark, 1990; Gusfield, 2002), and the other focused on estimating the haplotype frequencies in the population according to certain statistical models (Excoffier and Slatkin, 1995; Stephen et al., 2001). Commonly combinatorial haplotyping algorithms are in compliance with the maximum parsimony principle. Because the number of possible haplotypes in the real population was limited, it was believed that the smallest haplotype set which can resolve all genotypes was closest to the reality. Many algorithms were developed to solve this problem (Clark, 1990; Li et al., 2005; Wang and Xu, 2003). Besides the parsimony haplotyping (Gusfield, 2002) proposed a somewhat different coalescent model, which required the haplotypes in the solution to construct a perfect phylogeny tree. There were also some research works based on the perfect phylogeny haplotyping (Chung and Gusfiled, 2003; Gusfield, 2002). However, most problems based on these combinatorial models were proved to be NP-hard, which may be difficult to gain the optimal solution in acceptable time. This implied that most combinatorial methods cannot handle large amount of SNPs. Compared with the combinatorial methods, statistical haplotyping algorithms usually can handle much longer genotypes. Rather than infer the exact haplotype configuration of each individual, statistical methods estimated the frequencies of haplotypes in the population and chose the most probable haplotype pair as the solution. Many different statistical algorithms were adopted to estimate the haplotype frequencies, such as Expectation Maximization (Excoffier and Slatkin, 1995; Qin et al., 2002), Bayesian (Niu et al., 2002) and Markov chain Monte Carlo (MCMC) (Stephen et al., 2001). Statistical algorithms usually need to consider lots of probable haplotypes, which required large amount of storage. Partition– ligation strategy was usually adopted to alleviate this limitation (Kimmel and Shamir, 2005; Lin et al., 2004; Marchini et al., 2006; © The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Partition and ligation strategy Qin et al., 2002). The genotypes were partitioned into a collection of blocks. Algorithm was performed on each block and the frequencies of haplotypes in each block can be estimated. The final solution will be constructed by ligating the subsolutions of each blocks. In this process, many haplotypes with low probability can be directly discarded, which largely reduced the time complexity and space complexity of algorithms. The traditional partition–ligation strategy usually partitioned haplotypes into uniform blocks. However, many studies have shown that haplotype has its own block structure (Daly et al., 2001; Gabriel et al., 2002; Patil et al., 2001; Zhang et al., 2005), which may be not uniform. So it may be more reasonable to partition haplotypes into appropriate blocks according to the genome structure. Some researchers noticed the affection to the accuracy caused by the block partition and adopted somewhat different partition strategy. Lin et al. (2004) defined a block to be a region with high linkage disequilibrium (LD). The pairwise |D | among segregating SNPs in the same block should be higher than a certain threshold (e.g. 0.8). But it may be difficult to select an appropriate threshold. When the LD in the genome was low, their method would be infeasible. Delaneau et al. (2007) combined the block partition process with their iterative Expectation Maximization algorithm. The blocks were partitioned such that their IEM algorithm generated the fewest haplotypes in each block. Their partition strategy was related to the specific haplotyping algorithm, which was not very flexible and efficient. Among various haplotyping algorithms, it seemed that the PHASE algorithm (Stephen et al., 2001) provided the most accurate haplotyping result (Marchini et al., 2006). However, PHASE had very long running time. Recently, new algorithms, such as fastPHASE (Scheet and Stephens, 2006), HaploRec (Eronen et al., 2006), 2SNP (Brinza and Zelikovsky, 2006) and BEAGLE (Browning and Browning, 2007) were proposed to deduce haplotypes with much less time cost. Whilst their accuracies were also lower than PHASE. In this article, we proposed a better partition and ligation strategy to improve the accuracy of individual haplotyping. The SNPs with relatively high associations were assembled together to constitute a block. The performance improvement caused by our new block partition–ligation strategy was also analyzed based on the Expectation Maximization algorithm. Compared with other algorithms, our algorithm gained comparable accuracy with much less time cost. on the total five loci, the haplotype frequencies will be estimated as 5/12, 2/12, 4/12 and 1/12 for ‘01001’, ‘10011’, ‘11111’ and ‘10010’, respectively. That is to say, we discarded the ‘right’ haplotype ‘10010’. Whereas the second block partition will not make such a mistake. The traditional PLEM algorithm alleviates this limitation through enlarging the size of buffer which stores the probable haplotypes. However, it is usually difficult to select right haplotypes with low probability and large buffer also reduces the efficiency of the algorithm. So it is more adaptive to improve the accuracy through carefully choosing the appropriate partition. Many studies have shown that SNPs on a chromosome should not be independent, some of them may have very strong association (Daly et al., 2001; Gabriel et al., 2002; Patil et al., 2001). A simple idea is that we should align SNPs with strong association into the same block, thus the linkage between them will not be wrongly broken by a partition. The non-random associations between different SNPs are usually measured by the degree of LD, which will be computed in the first step of our algorithm. 2 2.2 METHODS Without loss of generality, the alleles of a SNP can be denoted by ‘0’ and ‘1’, thus a haplotype can be represented as a string over {0, 1}. Denote a genotype to be a string over {0, 1, 2}, where ‘0’ and ‘1’ represent homozygous locus and ‘2’ represents heterozygous locus. Although the partition–ligation strategy can largely reduce the time complexity of algorithms, the unreasonable block partition will increase the error rate of haplotype frequency estimating. For example, consider the set of unrelated individuals with following genotypes on five loci— ‘01001’, ‘01001’, ‘10011’, ‘11111’, ‘11111’ and ‘22022’. If we restrict the minimum block size to be two, there will be two probable block partitions— ‘∗∗∗ | ∗∗’ and ‘∗∗ | ∗∗∗’. In the first block partition, if only the latter block was concerned, EM algorithm will estimate the haplotype frequencies as 4/12, 7/12, 1/12 for ‘01’, ‘11’ and ‘00’, respectively. The haplotype ‘10’ is discarded since its low probability. However, if we apply EM algorithm 2.1 First step: compute the LD score One of the most widely used measurement of LD is r 2 that can be computed as follow. Consider two SNP sites of s0 and s1 , each with two different alleles ‘0’ and ‘1’, then there will be four probable haplotypes present: ‘00’, ‘01’, ‘10’ and ‘11’. Let p00 denote the frequency of the haplotype ‘00’ and in general let pij denote the frquency of the haplotype ‘ij’. Suppose the frequency of ‘0’ (‘1’, respectively) at s0 to be p0 (p1 , respectively) and the frequency of ‘0’ (‘1’, respectively) at s1 to be q0 (q1 , respectively), the r 2 score between s0 and s1 is r 2 (s0 ,s1 ) = (p00 p11 −p01 p10 )2 p0 p1 q0 q1 (1) The value of r 2 will be bounded in the region [0, 1] with the higher value representing the higher LD and higher association. The value of pi and qi can be directly concluded from the genotype data whereas pij cannot. Let nij denote the number of genotype ‘ij’ observed in the population and n denote the total number of individuals, then pij can be estimated with the following equations. p00 = (2n00 +n02 +n20 +n22 p00 p11 /(p00 p11 +p01 p10 ))/2n (2) p01 = (2n01 +n02 +n21 +n22 p01 p10 /(p00 p11 +p01 p10 ))/2n (3) p10 = (2n10 +n12 +n20 +n22 p01 p10 /(p00 p11 +p01 p10 ))/2n (4) p11 = (2n11 +n12 +n21 +n22 p00 p11 /(p00 p11 +p01 p10 ))/2n (5) It is actually the EM algorithm applying on only two loci (Barrett et al., 2005). Consequently, we can gain the r 2 score of m(m−1)/2 pairs of SNPs, where m represents the length of genotypes. The value of pi and qi can be computed in O(mn) time and the value of pij can be estimated in O(m2 n+rm2 ) time, where r represents the round of iteration. Totally, the first step can be completed with O(m2 n+rm2 ) time complexity and O(m2 ) space complexity. Second step: determine the optimal block partition In an ideal block partition, the SNPs in the same block should have high LD whereas the LD between neighboring blocks should be low. Suppose there is a block [i...j] from the i-th locus to the j-th locus, denote LD[i][j] to be the average LD score of block [i...j]. 2 r 2 (s,t) (6) LD[i][j] = (j −i+1)(j −i) i≤s≤j s<t≤j Consider two adjacent blocks [i...j] and [j +1...k], denote LD[i][j][k] to be the average LD score between two blocks. 1 r 2 (s,t) (7) LD[i][j][k] = (j −i+1)(k −j) i≤s≤j j+1≤t≤k Our purpose is to find the block partition to maximize the average LD score inside each block and minimize the average LD score between 2721 Y.Zhao et al. 2.3 Third step: frequency estimation and greedy ligation 0.12 0.24 0 0.36 0.17 1 0.24 0.36 0 1 0.05 0.12 0 1 0.27 0 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 2 2 0 2 2 Fig. 1. The r 2 score estimated in the first step. adjacent blocks. Suppose partition P = (b1 ,b2 , ... ,bk ) establishes k +1 blocks [b0 +1,b1 ],[b1 +1,b2 ],...,[bk +1,bk+1 ], where b0 = 0 and bk+1 = m. We evaluate partition P according to the score SP , which simply subtracts the total LD score between adjacent blocks from the total LD score inside each block. SP = k LD[bi +1][bi+1 ]− i=0 k−1 LD[bi +1][bi+1 ][bi+2 ] (8) i=0 The block partition P with the maximum SP score will be chosen as the optimal solution, which can be selected in polynomial time through a simple dynamic programming algorithm. Define S[i][j] to be the maximum score of a block partition with the last block to be [i...j]. Then, apply dynamic programming theory, S[i][j] = max {S[k][i−1]+LD[i][j]−LD[k][i−1][j]} 0<k<i−1 (9) To further improve the efficiency of our algorithm, we restrict the block size to be at least bmin and at most bmax, then the above equation should be modified as follow, S[i][j] = max i−bmax≤k≤i−bmin {S[k][i−1]+LD[i][j]−LD[k][i−1][j]} (10) Using the above recursion, we can design a dynamic programming algorithm to find the optimal block partition. The score array S[i][j] will be a m×(bmax −bmin+1) matrix, so the space complexity of our block partition algorithm should be O(dm), where d is equal to bmax −bmin+1 for simplicity. In Equation (10), the internal LD score LD[i][j] and the external LD score LD[k][i−1][j] can be computed within O(d 2 ) time, so S[i][j] can be computed within O(d 3 ) time. The time complexity of our block partition algorithm should be O(d 4 m). For the previous example of six unrelated individuals with genotypes— ‘01001’, ‘01001’, ‘10011’, ‘11111’, ‘11111’, ‘22022’, the r 2 score of each pair of SNPs can be computed in the first step. Figure 1 shows the r 2 estimating results. Denote the partition ‘∗∗∗ | ∗∗’ to be P1 and the partition ‘∗∗ | ∗∗∗’ to be P2 , the evaluation score of P1 and P2 can be computed to be 0.037 and 0.057, respectively. The partition P2 will be selected because of its higher evaluation. Clearly, our algorithm chooses a better block partition. It is interesting to notice that our block partition algorithm do not assign the middle SNP into the left block though it has higher LD with the first two SNPs than latter two (0.17 + 0.36 versus 0.36 + 0.05). Our algorithm always attempts to find the global optimal solution. As a result, the average LD score in each block should be similar because the asymmetric LD partition will reduce the final evaluation score. Generally, there are more probable haplotypes in low LD blocks than high LD blocks. In an asymmetric LD partition, the high LD block wastes a portion of buffer space whereas the low LD block requires more space. So the symmetric LD partition will utilize the limited buffer space most sufficiently and the accuracy will be better. 2722 EM algorithm is performed on each block and the haplotype frequencies are estimated. Only the haplotypes with relatively high frequencies will be stored in the buffer. Different from the traditional PLEM algorithm, that randomly ligates two adjacent blocks, our algorithm attempts to keep the high LD block property as long as possible during the ligation process. Suppose there are k +1 blocks [b0 +1,b1 ],[b1 +1,b2 ],...,[bk +1,bk+1 ] partitioned through the second step, we first ligate two adjacent blocks [bi +1,bi+1 ] and [bi+1 +1,bi+2 ] with the highest LD[bi +1][bi+1 ][bi+2 ] score. Thus the number of blocks reduces to k. We recompute the LD score between adjacent blocks and reselect the adjacent blocks with highest LD[bi + 1][bi+1 ][bi+2 ] score to ligate and so on. This process continues until the complete haplotypes are determined. In each ligation step, the selection of adjacent blocks can be completed within O(k(m/k)2 ) = O(m2 /k) time. So the total time for block selection is O(m2 /k)+O(m2 /(k −1))+···+O(m2 ) = O(m2 logk). Our algorithm performs such a greedy process to make sure that in each step always the adjacent blocks with the strongest association will be ligated. Since the size of buffer is limited, it is always better to ligate weak association blocks later because they usually have more uncertainty and produce more haplotypes. During the frequency estimating process, we discard haplotypes whose probability is lower than 0.00001. Only haplotypes with relatively high probability will be stored in the buffer. The maximum buffer size can also be specified by the user to prevent storing too many haplotypes to reduce the efficiency of algorithm. Some small tricks are also adopted to further improve the accuracy and robustness of our algorithm. Suppose the specified buffer size is B, besides the top B haplotypes with relatively high frequencies, we also select the most probable haplotype pair of each individual into the buffer. This ensures that our algorithm will not wrongly terminate in the next ligation step because of the lack of the complementary haplotypes. Therefore, the practical buffer size will be a little larger than the buffer size B specified by the user. In theory, the real buffer size will not exceed B+2n, whereas in practice it will be much smaller. Compared with other individual haplotyping algorithms based on partition–ligation strategy, our algorithm performs some additional operations to improve the accuracy of frequencies estimating. These additional operations lead to additional cost, which is equal to O(m2 n+rm2 + d 4 m+m2 logk), a simple summation of time complexity of each step. The value of r and d can be regarded as constants and k < m, so our algorithm only increases O(m2 n+m2 logm) time complexity, which is acceptable compared with the time cost of haplotype frequency estimating. 3 RESULTS AND DISCUSSION To evaluate the accuracy of our better block partition–ligation estimation maximization (BBPLEM) algorithm, we used the individual error rate (IER) and the switch error rate (SER), which were two widely used criterias to access the performance of individual haplotyping (Delaneau et al., 2007; Marchini et al., 2006). The IER is the percentage of individuals whose haplotype configurations are incorrectly concluded. Generally, the value of IER decreases with the increasing of individuals and increases with the increasing of genotype length. When the genotype is long, almost all haplotyping algorithms fail to deduce the complete right haplotype configuration, that is to say, the IER will be close to 100%, which losses the statistical significance. Switch error is the error between adjacent pair of heterozygous loci. Such an error can be simply corrected by one switch, which is the reason for the name of ‘switch error’. SER is the value of the number of switch errors divided by the number of heterozygous loci. Partition and ligation strategy Table 1. Accuracy and time comparison of various algorithms on the ACE dataset Method IER SER PLEM1.0 (Qin et al., 2002) fastPhase1.2 (Scheet and Stephens, 2006) GERBIL1.1 (Kimmel and Shamir, 2005) 2SNP1.7 (Brinza and Zelikovsky, 2006) Ishape2.0 (Delaneau et al., 2007) BEAGLE2.1(Browning and Browning, 2007) BBPLEM (Uniform partition & Pairwise ligation) BBPLEM (Uniform partition & Greedy ligation) BBPLEM (Optimal partition & Greedy ligation) 0.214 0.198 0.091 0.091 0.091 0.218 0.182 0.182 0.172 0.067 0.027 0.008 0.008 0.017 0.082 0.063 0.052 0.052 10 9 Running Time (s) 8 Number of Block 7 0.1 13.3 3.9 0.1 16.9 2.4 0.2 0.2 0.2 6 5 4 3 2 1 0 Although the maximum block size bmax and the minimum block size bmin can be arbitrarily selected, different choices may lead to different results. The value of bmin should be greater than 1, since it is meaningless to estimate the haplotype frequency of a single SNP. The value of bmax should not be too large because large block will cost lots of memory. It is better to select appropriate bmin and bmax according to the practical genome structure. In our algorithm, we use bmin = 2 and bmax = 10 as the default parameter setting. We compared BBPLEM to several software including PLEM (Qin et al., 2002), fastPhase (Scheet and Stephens, 2006), GERBIL (Kimmel and Shamir, 2005), 2SNP (Brinza and Zelikovsky, 2006), Ishape (Delaneau et al., 2007) and BEAGLE (Browning and Browning, 2007). PHASE was excluded because it was too slow to be performed enough times to estimate the average performance. The program HaploRec (Eronen et al., 2006) was not tested because the missing SNPs cannot be well-handled by its current version. All experiments were completed on a Windows server with 3.20 GHz CPU and 1 GB RAM. 3.1 Real data We compared the accuracy of our algorithm with other algorithms based on the human angiotensin converting enzyme dataset, which was provided by Rieder et al. (1999). It contained the genotypes of 11 unrelated individuals at 52 SNPs and 13 different haplotypes which were identified through experiments. The buffer sizes of PLEM and BBPLEM were both set to be 50. The round of EM iteration was set to be 20. The parameter K (number of clusters) of fastPhase was set to be 10 to reduce its running time. The parameters of GERBIL, 2SNP, Ishape were all set as their default settings. The parameter ‘nsample’ of BEAGLE was set to be 200 and the parameter ‘seed’ was randomly generated in every independent running. To estimate their average performances, 100 independent runs were performed. The estimation of IER, SER and running time can be described in Table 1. As demonstrated in Table 1 the uniform block partition and pairwise ligation strategy was also applied on the BBPLEM algorithm to evaluate the accuracy improvement caused by our different block partition and ligation strategy. When the uniform block partition strategy was adopted, the block size was set to be 2. Figure 2 shows the distribution of the size of blocks partitioned by our optimal block partition algorithm. The number of blocks decreases with the increasing of the block size. Among all algorithms GERBIL and 2SNP provided the most accurate haplotyping result on the ACE dataset, but their 1 2 3 4 5 6 7 Length of Block 8 9 10 Fig. 2. The distribution of block size (ACE dataset). Table 2. Accuracy and time comparison of various algorithms on the chromosome 5q31 dataset Method IER SER Running Time (s) PLEM1.0 (Qin et al., 2002) fastPhase1.2 (Scheet and Stephens, 2006) GERBIL1.1 (Kimmel and Shamir, 2005) 2SNP1.7 (Brinza and Zelikovsky, 2006) Ishape2.0 (Delaneau et al., 2007) BEAGLE2.1(Browning and Browning, 2007) BBPLEM (Uniform partition & Pairwise ligation) BBPLEM (Uniform partition & Greedy ligation) BBPLEM (Optimal partition & Greedy ligation) − 0.392 0.434 0.465 0.388 0.404 0.418 0.431 0.388 − 0.042 0.045 0.046 0.047 0.043 0.046 0.046 0.043 − 282.7 47.0 1.0 2151.1 8.1 6.3 5.6 5.4 performances were worse than other algorithms in the latter comparisons on larger dataset. BBPLEM yielded medium accuracy with less time. Among three different partition–ligation strategies, the optimal block partition and greedy ligation strategy provided the most accurate haplotyping result. Besides the human angiotensin converting enzyme dataset, we also tested various algorithms on the 5q31 dataset, which was generated by Daly et al. (2001). The 5q31 dataset contained 129 trio pedigrees of father, mother and child with their genotypes at 103 SNPs in chromosome 5q31. The genotypes of 129 children were selected for the performance evaluation of various algorithms. In the original children data, there were 3873 (29%) heterozygous alleles and 1334 (10%) missing alleles. After pedigree resolving, the phase of 2714 heterozygous alleles and 168 missing alleles can be identified. According to these identified SNPs, we estimated the accuracies of various algorithms. Because there were more genotypes in 5q31 than ACE, the buffer sizes of PLEM and BBPLEM were both set to be 100. The parameter nsample of BEAGLE was changed to be 25 to gain better performance.The parameters of other algorithms were all set as before. To estimate their average performances with the result demonstrated in Table 2, 10 independent runs were performed. For the 5q31 data, Ishape and fastPhase provided the minimum IER and SER, respectively. But they both had very long running time. PLEM failed to gain a solution in 10 h. Compared with other 2723 Y.Zhao et al. Table 3. Accuracy and time comparison of various algorithms on the simulated dataset 18 16 Number of Block 14 Method IER SER Running Time (s) PLEM1.0 (Qin et al., 2002) fastPhase1.2 (Scheet and Stephens, 2006) GERBIL1.1 (Kimmel and Shamir, 2005) 2SNP1.7 (Brinza and Zelikovsky, 2006) Ishape2.0 (Delaneau et al., 2007) BEAGLE2.1(Browning and Browning, 2007) BBPLEM (Uniform partition & Pairwise ligation) BBPLEM (Uniform partition & Greedy ligation) BBPLEM (Optimal partition & Greedy ligation) 0.808 0.853 0.950 0.970 0.242 0.745 0.643 0.552 0.541 0.223 0.155 0.234 0.246 0.038 0.131 0.151 0.138 0.136 2.6 205.6 132.9 0.2 1820.0 9.2 1.1 1.0 0.9 12 10 8 6 4 2 0 1 2 3 4 5 6 7 Length of Block 8 9 10 35 30 Fig. 3. The distribution of block size (5q31 dataset). 25 Running Time (s) algorithms, BBPLEM yielded better accuracy with much less time. Moreover, the optimal block partition and greedy ligation strategy provided the best accuracy among three strategies, which indicated its effectiveness. The distribution of the size of blocks partitioned by our optimal block partition algorithm was given in Figure 3. 20 15 10 Simulated data Some programs have been developed to simulate haplotypes for the genome study (Hudson, 2002; Liang et al., 2007). However, it may be difficult to simulate the nature block structure of haplotypes. Wang et al. (2002) investigated the blocklike structures of the haplotypes. It was observed that when the mutation rate was set to be 0.3×10−9 per site per year and the recombination rate was set to be 1.0×10−8 for 50% probability and 4.0×10−8 for the remaining 50% probability, the distribution of block size of the simulated haplotypes and the real human chromosome 21 was very similar. Here, we used the same parameter setting. The program ‘GENOME’ was employed to generate haplotypes for our performance evaluation because it can allow recombination rates to vary along the genome. We generated 10 samples of 100 haplotypes. The length of the haplotypes varied from 80 to 120. The genotypes were generated by randomly pairing two haplotypes. For each sample, 100 individuals were generated and the accuracies were estimated. The buffer sizes of BBPLEM and PLEM were set to be 100. The round of iteration was set to be 20. The parameter K of fastPhase was set to be 10. The parameter nsample of BEAGLE was set to be 25. For other algorithms, we just chose their default parameter settings. To estimate the average performances of various algorithms, 10 independent runs were performed for each sample. Table 3 presented the accuracy and time comparison of various algorithms. For the simulated data, Ishape yielded much better accuracy than other algorithms. But it cost lots of time. PLEM, BEAGLE, 2SNP and BBPLEM were much faster than other algorithms. BBPLEM yielded comparable accuracy with very short time. The optimal block partition and greedy ligation strategy still provided the best performance. To evaluate the efficiency of BBPLEM, we examined the running time of BBPLEM with respect to different m and n. The results are given in Figures 4 and 5. 2724 5 0 0 200 400 600 800 1000 1200 Length of Haplotype (100 individuals) 1400 Fig. 4. Running time versus length of genotype. 43 42 41 Running Time (s) 3.2 40 39 38 37 36 35 34 0 1000 2000 3000 4000 5000 6000 Number of Individuals (1000 SNPs) Fig. 5. Running time versus number of individuals. Consistence with our previous discussion about the time complexity of our algorithm, the running time is about proportional to the square of m and linear with n. As demonstrated in Figures 4 and 5, the BBPLEM algorithm is very fast. When there is about 1000 individuals and 1000 SNPs, BBPLEM can handle them in a few seconds. Whereas other algorithms such as Ishape and fastPhase fail to gain the haplotyping results within 10 h. Partition and ligation strategy 4 CONCLUSIONS Partition–ligation was a widely used approach for individual haplotyping to reduce the time cost. However, the inappropriate partition may increase the error rate of frequencies estimating. In this article, we proposed a better partition and ligation strategy according to the LD between each pairs of SNPs. The haplotype blocks were partitioned such that the LD within a block was very high whereas the LD between adjacent blocks was low, which largely reduced the error rate of haplotyping. We evaluated the accuracy of our algorithm on both real dataset and simulated dataset. Compared with other algorithms, our algorithm gained comparable accuracy with much less time. Some haplotyping error may be by reason of the limitation of simple EM algorithm. Because our block partition and ligation strategy was flexible and efficient, combining the block partition and ligation strategy with better frequency estimating algorithms, such as Bayesian and MCMC may lead to better accuracy. Although the neighboring SNPs usually have high LD, strong association may exist between distant SNPs. In our algorithm, a haplotype block must be a continuous region, so the distant high LD SNPs will not be considered. If we allow distant SNPs to constitute a block, the accuracy of haplotyping may be further improved, which is the future direction and our next work. ACKNOWLEDGEMENTS We thank Linbin Yu, Yiming Lei, Mingzhi Shao and Juan Liu, who provided many helpful suggestions for our article. Funding: The Key Project of The National Nature Science Foundation of China under the grant No. 60533020. Conflict of Interest: none declared. REFERENCES Barrett,J.C. et al. (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21, 263–265. Bonizzoni,P. et al. (2003) The haplotyping problem: an overview of computational models and solutions. J. Comput. Sci. Technol., 18, 675–688. Brinza,D. and Zelikovsky,A. (2006) 2SNP: scable phasing based on 2-SNP haplotypes. Bioinformatics, 22, 371–373. Browning,S.R. and BrowningB.L. (2007) Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am. J. Hum. Genet., 81, 1084–1097. Chung,R.H. and Gusfiled,D. (2003) Perfect phylogeny haplotyper: haplotype inferral using a tree model. Bioinformatics, 19, 780–781. Clark,A. (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol., 7, 111–122. Daly,M.J. et al. (2001) High-resolution haplotype structure in the human genome. Nat. Genet., 29, 229–232. Delaneau,O. et al. (2007) ISHAPE: new rapid and accurate software for haplotyping. BMC Bioinformatics, 8, 205. Eronen,L. et al. (2006) HaploRec: efficent and accurate large-scale reconstruction of haplotypess. BMC Bioinformatics, 7, 542. Excoffier,L. and Slatkin,M. (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol., 12, 921–927. Gabriel,S.B. et al. (2002) The structure of haplotype blocks in the human genome. Science, 296, 2225–2229. Gusfield,D. (2002) Haplotypoing as perfect phylogeny: conceptual framework and efficient solutions. In Proceedings of RECOMB 2002: The 6th Annual International Conference on Computational Biology. ACM, New York, USA, pp. 166–175. Hudson,R.R. (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338. International HapMap Consortium (2003) The international HapMap project. Nature, 426, 789–796. Kimmel,G. and Shamir,R. (2005) GERBIL: genotype resolution and block identification using likelihood. Proc. Natl Acad. Sci. USA, 102, 158–162. Li,Z. et al. (2005) A parsimonious tree-grow method for haplotype inference. Bioinformatics, 21, 3475–3481. Liang,L. et al. (2007) GENOME: a rapid coalescent-based whole genome simulator. Bioinformatics, 23, 1565–1567. Lin,S. et al. (2004) Haplotype and missing data inference in nuclear families. Genome Res., 14, 1624–1632. Marchini,J. et al. (2006) A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet., 78, 437–450. Niu,T.H. et al. (2002) Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms. Am. J. Hum. Genet., 70, 157–169. Patil,N. et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294, 1719–1723. Qin,Z.S. et al. (2002) Partition-Ligation EM algorithm for haplotype inference with single nucleotide polymorphisms. Am. J. Hum. Genet., 71, 1242–1247. Rieder,M.J. et al. (1999) Sequence variation in the human angiotensin converting enzyme. Nat. Genet., 22, 59–62. Scheet,P. and Stephens,M. (2006) A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet., 78, 629–644. Stephen,M. et al. (2001) A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet., 68, 978–989. Wang,N. et al. (2002) Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am. J. Hum. Genet., 71, 1227–1234. Wang,L.S. and Xu,Y. (2003) Haplotype inference by maximum parsimony. Bioinformatics, 19, 1773–1780. Zhang,K. et al. (2005) HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics, 21, 131–134. 2725
© Copyright 2026 Paperzz