A better block partition and ligation strategy for

BIOINFORMATICS
ORIGINAL PAPER
Vol. 24 no. 23 2008, pages 2720–2725
doi:10.1093/bioinformatics/btn519
Genetics and population analysis
A better block partition and ligation strategy for individual
haplotyping
Yuzhong Zhao1,2 , Yun Xu1,2,∗ , Zhihao Wang1,2 , Hong Zhang1,2 and Guoliang Chen1,2
1 Department
of Computer Science, University of Science and Technology of China and 2 Anhui Province-MOST
Co-Key Laboratory of High Performance Computing and Its Application, Hefei, Anhui 230027, P.R. China
Received on April 15, 2008; revised on August 10, 2008; accepted on October 4, 2008
Advance Access publication October 9, 2008
Associate Editor: Alex Bateman
ABSTRACT
Motivation: Haplotype played an important role in the association
studies of disease gene and drug responsivity over the past years,
but the low throughput of expensive biological experiments largely
limited its application. Alternatively, some efficient statistical methods
were developed to deduce haplotypes from genotypes directly.
Because these algorithms usually needed to estimate the frequencies
of numerous possible haplotypes, the partition and ligation strategy
was widely adopted to reduce the time complexity. The haplotypes
were usually partitioned uniformly in the past, but recent studies
showed that the haplotypes had their own block structure, which
may be not uniform. More reasonable block partition and ligation
strategy according to the haplotype structure may further improve
the accuracy of individual haplotyping.
Results: In this article, we presented a simple algorithm for block
partition and ligation, which provided better accuracy for individual
haplotyping. The block partition and ligation could be completed
within O(m2 logm + m2 n) time complexity, where m represented the
length of genotypes and n represented the number of individuals. We
tested the performance of our algorithm on both real and simulated
dataset. The result showed that our algorithm yielded better accuracy
with short running time.
Availability: The software is publicly available at http://mail.ustc.edu.
cn/∼zyzh.
Contact: [email protected]
1
INTRODUCTION
As the most common form of genetic variation, single nucleotide
polymorphism (SNP) has been widely studied to analyze the
possible association between diseases, and genomes. In general,
many complex diseases, such as diabetes and cancer may be not
affected by a single SNP, so it is necessary and important to study
multi-SNPs in a region together (International HapMap Consortium,
2003). These linked SNPs on a chromosome constitute a character
string, which is also called a haplotype.
Although haplotype analysis has gained increasing attention
recently, it is still costly and time consuming to derive haplotypes
from biological experiments (Bonizzoni et al., 2003). Indeed,
many experimental data only provide the genotype for each
individual, which is the combined information of two haplotypes
∗ To
whom correspondence should be addressed.
2720
from paired chromosomes. For each locus of a genotype, the two
alleles of haplotypes are given, but the position of each allele is
uncertain. That is to say, we do not know whether an allele is
from paternal haplotype or maternal haplotype. Theoretically there
may be exponential possible haplotype configurations for a given
genotype, whereas in practice the number of haplotype patterns in a
certain population is much smaller, which makes it possible to infer
haplotypes from genotypes directly.
In the past 20 years, two categories of individual haplotyping
algorithms were widely studied. One focused on finding the exact
haplotype solution of each individual through some combinational
methods (Clark, 1990; Gusfield, 2002), and the other focused on
estimating the haplotype frequencies in the population according
to certain statistical models (Excoffier and Slatkin, 1995; Stephen
et al., 2001).
Commonly combinatorial haplotyping algorithms are in
compliance with the maximum parsimony principle. Because the
number of possible haplotypes in the real population was limited,
it was believed that the smallest haplotype set which can resolve
all genotypes was closest to the reality. Many algorithms were
developed to solve this problem (Clark, 1990; Li et al., 2005; Wang
and Xu, 2003). Besides the parsimony haplotyping (Gusfield, 2002)
proposed a somewhat different coalescent model, which required
the haplotypes in the solution to construct a perfect phylogeny
tree. There were also some research works based on the perfect
phylogeny haplotyping (Chung and Gusfiled, 2003; Gusfield,
2002). However, most problems based on these combinatorial
models were proved to be NP-hard, which may be difficult to gain
the optimal solution in acceptable time. This implied that most
combinatorial methods cannot handle large amount of SNPs.
Compared with the combinatorial methods, statistical haplotyping
algorithms usually can handle much longer genotypes. Rather than
infer the exact haplotype configuration of each individual, statistical
methods estimated the frequencies of haplotypes in the population
and chose the most probable haplotype pair as the solution.
Many different statistical algorithms were adopted to estimate the
haplotype frequencies, such as Expectation Maximization (Excoffier
and Slatkin, 1995; Qin et al., 2002), Bayesian (Niu et al., 2002)
and Markov chain Monte Carlo (MCMC) (Stephen et al., 2001).
Statistical algorithms usually need to consider lots of probable
haplotypes, which required large amount of storage. Partition–
ligation strategy was usually adopted to alleviate this limitation
(Kimmel and Shamir, 2005; Lin et al., 2004; Marchini et al., 2006;
© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Partition and ligation strategy
Qin et al., 2002). The genotypes were partitioned into a collection of
blocks. Algorithm was performed on each block and the frequencies
of haplotypes in each block can be estimated. The final solution
will be constructed by ligating the subsolutions of each blocks. In
this process, many haplotypes with low probability can be directly
discarded, which largely reduced the time complexity and space
complexity of algorithms.
The traditional partition–ligation strategy usually partitioned
haplotypes into uniform blocks. However, many studies have shown
that haplotype has its own block structure (Daly et al., 2001; Gabriel
et al., 2002; Patil et al., 2001; Zhang et al., 2005), which may be
not uniform. So it may be more reasonable to partition haplotypes
into appropriate blocks according to the genome structure. Some
researchers noticed the affection to the accuracy caused by the
block partition and adopted somewhat different partition strategy.
Lin et al. (2004) defined a block to be a region with high linkage
disequilibrium (LD). The pairwise |D | among segregating SNPs in
the same block should be higher than a certain threshold (e.g. 0.8).
But it may be difficult to select an appropriate threshold. When
the LD in the genome was low, their method would be infeasible.
Delaneau et al. (2007) combined the block partition process with
their iterative Expectation Maximization algorithm. The blocks were
partitioned such that their IEM algorithm generated the fewest
haplotypes in each block. Their partition strategy was related to
the specific haplotyping algorithm, which was not very flexible and
efficient.
Among various haplotyping algorithms, it seemed that the
PHASE algorithm (Stephen et al., 2001) provided the most
accurate haplotyping result (Marchini et al., 2006). However,
PHASE had very long running time. Recently, new algorithms,
such as fastPHASE (Scheet and Stephens, 2006), HaploRec
(Eronen et al., 2006), 2SNP (Brinza and Zelikovsky, 2006) and
BEAGLE (Browning and Browning, 2007) were proposed to deduce
haplotypes with much less time cost. Whilst their accuracies were
also lower than PHASE.
In this article, we proposed a better partition and ligation
strategy to improve the accuracy of individual haplotyping. The
SNPs with relatively high associations were assembled together
to constitute a block. The performance improvement caused by
our new block partition–ligation strategy was also analyzed based
on the Expectation Maximization algorithm. Compared with other
algorithms, our algorithm gained comparable accuracy with much
less time cost.
on the total five loci, the haplotype frequencies will be estimated as 5/12,
2/12, 4/12 and 1/12 for ‘01001’, ‘10011’, ‘11111’ and ‘10010’, respectively.
That is to say, we discarded the ‘right’ haplotype ‘10010’. Whereas the
second block partition will not make such a mistake. The traditional PLEM
algorithm alleviates this limitation through enlarging the size of buffer which
stores the probable haplotypes. However, it is usually difficult to select right
haplotypes with low probability and large buffer also reduces the efficiency
of the algorithm. So it is more adaptive to improve the accuracy through
carefully choosing the appropriate partition.
Many studies have shown that SNPs on a chromosome should not be
independent, some of them may have very strong association (Daly et al.,
2001; Gabriel et al., 2002; Patil et al., 2001). A simple idea is that we should
align SNPs with strong association into the same block, thus the linkage
between them will not be wrongly broken by a partition. The non-random
associations between different SNPs are usually measured by the degree of
LD, which will be computed in the first step of our algorithm.
2
2.2
METHODS
Without loss of generality, the alleles of a SNP can be denoted by ‘0’ and ‘1’,
thus a haplotype can be represented as a string over {0, 1}. Denote a genotype
to be a string over {0, 1, 2}, where ‘0’ and ‘1’ represent homozygous locus
and ‘2’ represents heterozygous locus.
Although the partition–ligation strategy can largely reduce the time
complexity of algorithms, the unreasonable block partition will increase
the error rate of haplotype frequency estimating. For example, consider
the set of unrelated individuals with following genotypes on five loci—
‘01001’, ‘01001’, ‘10011’, ‘11111’, ‘11111’ and ‘22022’. If we restrict the
minimum block size to be two, there will be two probable block partitions—
‘∗∗∗ | ∗∗’ and ‘∗∗ | ∗∗∗’. In the first block partition, if only the latter block
was concerned, EM algorithm will estimate the haplotype frequencies as
4/12, 7/12, 1/12 for ‘01’, ‘11’ and ‘00’, respectively. The haplotype ‘10’
is discarded since its low probability. However, if we apply EM algorithm
2.1
First step: compute the LD score
One of the most widely used measurement of LD is r 2 that can be computed
as follow. Consider two SNP sites of s0 and s1 , each with two different
alleles ‘0’ and ‘1’, then there will be four probable haplotypes present: ‘00’,
‘01’, ‘10’ and ‘11’. Let p00 denote the frequency of the haplotype ‘00’ and
in general let pij denote the frquency of the haplotype ‘ij’. Suppose the
frequency of ‘0’ (‘1’, respectively) at s0 to be p0 (p1 , respectively) and the
frequency of ‘0’ (‘1’, respectively) at s1 to be q0 (q1 , respectively), the r 2
score between s0 and s1 is
r 2 (s0 ,s1 ) =
(p00 p11 −p01 p10 )2
p0 p1 q0 q1
(1)
The value of r 2 will be bounded in the region [0, 1] with the higher value
representing the higher LD and higher association. The value of pi and qi
can be directly concluded from the genotype data whereas pij cannot. Let nij
denote the number of genotype ‘ij’ observed in the population and n denote
the total number of individuals, then pij can be estimated with the following
equations.
p00 = (2n00 +n02 +n20 +n22 p00 p11 /(p00 p11 +p01 p10 ))/2n
(2)
p01 = (2n01 +n02 +n21 +n22 p01 p10 /(p00 p11 +p01 p10 ))/2n
(3)
p10 = (2n10 +n12 +n20 +n22 p01 p10 /(p00 p11 +p01 p10 ))/2n
(4)
p11 = (2n11 +n12 +n21 +n22 p00 p11 /(p00 p11 +p01 p10 ))/2n
(5)
It is actually the EM algorithm applying on only two loci (Barrett et al.,
2005). Consequently, we can gain the r 2 score of m(m−1)/2 pairs of SNPs,
where m represents the length of genotypes.
The value of pi and qi can be computed in O(mn) time and the value of
pij can be estimated in O(m2 n+rm2 ) time, where r represents the round of
iteration. Totally, the first step can be completed with O(m2 n+rm2 ) time
complexity and O(m2 ) space complexity.
Second step: determine the optimal block partition
In an ideal block partition, the SNPs in the same block should have high LD
whereas the LD between neighboring blocks should be low. Suppose there
is a block [i...j] from the i-th locus to the j-th locus, denote LD[i][j] to be
the average LD score of block [i...j].
2
r 2 (s,t)
(6)
LD[i][j] =
(j −i+1)(j −i)
i≤s≤j s<t≤j
Consider two adjacent blocks [i...j] and [j +1...k], denote LD[i][j][k] to be
the average LD score between two blocks.
1
r 2 (s,t)
(7)
LD[i][j][k] =
(j −i+1)(k −j)
i≤s≤j j+1≤t≤k
Our purpose is to find the block partition to maximize the average LD
score inside each block and minimize the average LD score between
2721
Y.Zhao et al.
2.3 Third step: frequency estimation and greedy
ligation
0.12
0.24
0
0.36
0.17
1
0.24
0.36
0
1
0.05
0.12
0
1
0.27
0
1
0
0
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
2
2
0
2
2
Fig. 1. The r 2 score estimated in the first step.
adjacent blocks. Suppose partition P = (b1 ,b2 , ... ,bk ) establishes k +1
blocks [b0 +1,b1 ],[b1 +1,b2 ],...,[bk +1,bk+1 ], where b0 = 0 and bk+1 = m.
We evaluate partition P according to the score SP , which simply subtracts
the total LD score between adjacent blocks from the total LD score inside
each block.
SP =
k
LD[bi +1][bi+1 ]−
i=0
k−1
LD[bi +1][bi+1 ][bi+2 ]
(8)
i=0
The block partition P with the maximum SP score will be chosen as the
optimal solution, which can be selected in polynomial time through a simple
dynamic programming algorithm.
Define S[i][j] to be the maximum score of a block partition with the last
block to be [i...j]. Then, apply dynamic programming theory,
S[i][j] = max {S[k][i−1]+LD[i][j]−LD[k][i−1][j]}
0<k<i−1
(9)
To further improve the efficiency of our algorithm, we restrict the block size
to be at least bmin and at most bmax, then the above equation should be
modified as follow,
S[i][j] =
max
i−bmax≤k≤i−bmin
{S[k][i−1]+LD[i][j]−LD[k][i−1][j]}
(10)
Using the above recursion, we can design a dynamic programming algorithm
to find the optimal block partition. The score array S[i][j] will be a
m×(bmax −bmin+1) matrix, so the space complexity of our block partition
algorithm should be O(dm), where d is equal to bmax −bmin+1 for
simplicity. In Equation (10), the internal LD score LD[i][j] and the external
LD score LD[k][i−1][j] can be computed within O(d 2 ) time, so S[i][j] can
be computed within O(d 3 ) time. The time complexity of our block partition
algorithm should be O(d 4 m).
For the previous example of six unrelated individuals with genotypes—
‘01001’, ‘01001’, ‘10011’, ‘11111’, ‘11111’, ‘22022’, the r 2 score of each
pair of SNPs can be computed in the first step. Figure 1 shows the r 2
estimating results. Denote the partition ‘∗∗∗ | ∗∗’ to be P1 and the partition
‘∗∗ | ∗∗∗’ to be P2 , the evaluation score of P1 and P2 can be computed to be
0.037 and 0.057, respectively. The partition P2 will be selected because of
its higher evaluation. Clearly, our algorithm chooses a better block partition.
It is interesting to notice that our block partition algorithm do not assign
the middle SNP into the left block though it has higher LD with the first two
SNPs than latter two (0.17 + 0.36 versus 0.36 + 0.05). Our algorithm always
attempts to find the global optimal solution. As a result, the average LD
score in each block should be similar because the asymmetric LD partition
will reduce the final evaluation score. Generally, there are more probable
haplotypes in low LD blocks than high LD blocks. In an asymmetric LD
partition, the high LD block wastes a portion of buffer space whereas the
low LD block requires more space. So the symmetric LD partition will utilize
the limited buffer space most sufficiently and the accuracy will be better.
2722
EM algorithm is performed on each block and the haplotype frequencies
are estimated. Only the haplotypes with relatively high frequencies will be
stored in the buffer. Different from the traditional PLEM algorithm, that
randomly ligates two adjacent blocks, our algorithm attempts to keep the high
LD block property as long as possible during the ligation process. Suppose
there are k +1 blocks [b0 +1,b1 ],[b1 +1,b2 ],...,[bk +1,bk+1 ] partitioned
through the second step, we first ligate two adjacent blocks [bi +1,bi+1 ]
and [bi+1 +1,bi+2 ] with the highest LD[bi +1][bi+1 ][bi+2 ] score. Thus
the number of blocks reduces to k. We recompute the LD score between
adjacent blocks and reselect the adjacent blocks with highest LD[bi +
1][bi+1 ][bi+2 ] score to ligate and so on. This process continues until the
complete haplotypes are determined. In each ligation step, the selection of
adjacent blocks can be completed within O(k(m/k)2 ) = O(m2 /k) time. So
the total time for block selection is O(m2 /k)+O(m2 /(k −1))+···+O(m2 ) =
O(m2 logk).
Our algorithm performs such a greedy process to make sure that in
each step always the adjacent blocks with the strongest association will be
ligated. Since the size of buffer is limited, it is always better to ligate weak
association blocks later because they usually have more uncertainty and
produce more haplotypes.
During the frequency estimating process, we discard haplotypes whose
probability is lower than 0.00001. Only haplotypes with relatively high
probability will be stored in the buffer. The maximum buffer size can also
be specified by the user to prevent storing too many haplotypes to reduce the
efficiency of algorithm. Some small tricks are also adopted to further improve
the accuracy and robustness of our algorithm. Suppose the specified buffer
size is B, besides the top B haplotypes with relatively high frequencies,
we also select the most probable haplotype pair of each individual into the
buffer. This ensures that our algorithm will not wrongly terminate in the next
ligation step because of the lack of the complementary haplotypes. Therefore,
the practical buffer size will be a little larger than the buffer size B specified
by the user. In theory, the real buffer size will not exceed B+2n, whereas in
practice it will be much smaller.
Compared with other individual haplotyping algorithms based on
partition–ligation strategy, our algorithm performs some additional
operations to improve the accuracy of frequencies estimating. These
additional operations lead to additional cost, which is equal to O(m2 n+rm2 +
d 4 m+m2 logk), a simple summation of time complexity of each step. The
value of r and d can be regarded as constants and k < m, so our algorithm only
increases O(m2 n+m2 logm) time complexity, which is acceptable compared
with the time cost of haplotype frequency estimating.
3
RESULTS AND DISCUSSION
To evaluate the accuracy of our better block partition–ligation
estimation maximization (BBPLEM) algorithm, we used the
individual error rate (IER) and the switch error rate (SER), which
were two widely used criterias to access the performance of
individual haplotyping (Delaneau et al., 2007; Marchini et al.,
2006). The IER is the percentage of individuals whose haplotype
configurations are incorrectly concluded. Generally, the value of IER
decreases with the increasing of individuals and increases with the
increasing of genotype length. When the genotype is long, almost all
haplotyping algorithms fail to deduce the complete right haplotype
configuration, that is to say, the IER will be close to 100%, which
losses the statistical significance. Switch error is the error between
adjacent pair of heterozygous loci. Such an error can be simply
corrected by one switch, which is the reason for the name of ‘switch
error’. SER is the value of the number of switch errors divided by
the number of heterozygous loci.
Partition and ligation strategy
Table 1. Accuracy and time comparison of various algorithms on the ACE
dataset
Method
IER
SER
PLEM1.0 (Qin et al., 2002)
fastPhase1.2 (Scheet and Stephens, 2006)
GERBIL1.1 (Kimmel and Shamir, 2005)
2SNP1.7 (Brinza and Zelikovsky, 2006)
Ishape2.0 (Delaneau et al., 2007)
BEAGLE2.1(Browning and Browning, 2007)
BBPLEM (Uniform partition & Pairwise ligation)
BBPLEM (Uniform partition & Greedy ligation)
BBPLEM (Optimal partition & Greedy ligation)
0.214
0.198
0.091
0.091
0.091
0.218
0.182
0.182
0.172
0.067
0.027
0.008
0.008
0.017
0.082
0.063
0.052
0.052
10
9
Running
Time (s)
8
Number of Block
7
0.1
13.3
3.9
0.1
16.9
2.4
0.2
0.2
0.2
6
5
4
3
2
1
0
Although the maximum block size bmax and the minimum block
size bmin can be arbitrarily selected, different choices may lead to
different results. The value of bmin should be greater than 1, since it
is meaningless to estimate the haplotype frequency of a single SNP.
The value of bmax should not be too large because large block will
cost lots of memory. It is better to select appropriate bmin and bmax
according to the practical genome structure. In our algorithm, we
use bmin = 2 and bmax = 10 as the default parameter setting.
We compared BBPLEM to several software including PLEM
(Qin et al., 2002), fastPhase (Scheet and Stephens, 2006), GERBIL
(Kimmel and Shamir, 2005), 2SNP (Brinza and Zelikovsky, 2006),
Ishape (Delaneau et al., 2007) and BEAGLE (Browning and
Browning, 2007). PHASE was excluded because it was too slow
to be performed enough times to estimate the average performance.
The program HaploRec (Eronen et al., 2006) was not tested because
the missing SNPs cannot be well-handled by its current version. All
experiments were completed on a Windows server with 3.20 GHz
CPU and 1 GB RAM.
3.1
Real data
We compared the accuracy of our algorithm with other algorithms
based on the human angiotensin converting enzyme dataset, which
was provided by Rieder et al. (1999). It contained the genotypes of 11
unrelated individuals at 52 SNPs and 13 different haplotypes which
were identified through experiments. The buffer sizes of PLEM and
BBPLEM were both set to be 50. The round of EM iteration was
set to be 20. The parameter K (number of clusters) of fastPhase was
set to be 10 to reduce its running time. The parameters of GERBIL,
2SNP, Ishape were all set as their default settings. The parameter
‘nsample’ of BEAGLE was set to be 200 and the parameter ‘seed’
was randomly generated in every independent running. To estimate
their average performances, 100 independent runs were performed.
The estimation of IER, SER and running time can be described in
Table 1.
As demonstrated in Table 1 the uniform block partition and
pairwise ligation strategy was also applied on the BBPLEM
algorithm to evaluate the accuracy improvement caused by our
different block partition and ligation strategy. When the uniform
block partition strategy was adopted, the block size was set to be
2. Figure 2 shows the distribution of the size of blocks partitioned
by our optimal block partition algorithm. The number of blocks
decreases with the increasing of the block size.
Among all algorithms GERBIL and 2SNP provided the most
accurate haplotyping result on the ACE dataset, but their
1
2
3
4
5
6
7
Length of Block
8
9
10
Fig. 2. The distribution of block size (ACE dataset).
Table 2. Accuracy and time comparison of various algorithms on the
chromosome 5q31 dataset
Method
IER
SER
Running
Time (s)
PLEM1.0 (Qin et al., 2002)
fastPhase1.2 (Scheet and Stephens, 2006)
GERBIL1.1 (Kimmel and Shamir, 2005)
2SNP1.7 (Brinza and Zelikovsky, 2006)
Ishape2.0 (Delaneau et al., 2007)
BEAGLE2.1(Browning and Browning, 2007)
BBPLEM (Uniform partition & Pairwise ligation)
BBPLEM (Uniform partition & Greedy ligation)
BBPLEM (Optimal partition & Greedy ligation)
−
0.392
0.434
0.465
0.388
0.404
0.418
0.431
0.388
−
0.042
0.045
0.046
0.047
0.043
0.046
0.046
0.043
−
282.7
47.0
1.0
2151.1
8.1
6.3
5.6
5.4
performances were worse than other algorithms in the latter
comparisons on larger dataset. BBPLEM yielded medium accuracy
with less time. Among three different partition–ligation strategies,
the optimal block partition and greedy ligation strategy provided the
most accurate haplotyping result.
Besides the human angiotensin converting enzyme dataset, we
also tested various algorithms on the 5q31 dataset, which was
generated by Daly et al. (2001). The 5q31 dataset contained 129
trio pedigrees of father, mother and child with their genotypes at
103 SNPs in chromosome 5q31. The genotypes of 129 children
were selected for the performance evaluation of various algorithms.
In the original children data, there were 3873 (29%) heterozygous
alleles and 1334 (10%) missing alleles. After pedigree resolving,
the phase of 2714 heterozygous alleles and 168 missing alleles can
be identified. According to these identified SNPs, we estimated
the accuracies of various algorithms. Because there were more
genotypes in 5q31 than ACE, the buffer sizes of PLEM and
BBPLEM were both set to be 100. The parameter nsample of
BEAGLE was changed to be 25 to gain better performance.The
parameters of other algorithms were all set as before. To estimate
their average performances with the result demonstrated in Table 2,
10 independent runs were performed.
For the 5q31 data, Ishape and fastPhase provided the minimum
IER and SER, respectively. But they both had very long running
time. PLEM failed to gain a solution in 10 h. Compared with other
2723
Y.Zhao et al.
Table 3. Accuracy and time comparison of various algorithms on the
simulated dataset
18
16
Number of Block
14
Method
IER
SER
Running
Time (s)
PLEM1.0 (Qin et al., 2002)
fastPhase1.2 (Scheet and Stephens, 2006)
GERBIL1.1 (Kimmel and Shamir, 2005)
2SNP1.7 (Brinza and Zelikovsky, 2006)
Ishape2.0 (Delaneau et al., 2007)
BEAGLE2.1(Browning and Browning, 2007)
BBPLEM (Uniform partition & Pairwise ligation)
BBPLEM (Uniform partition & Greedy ligation)
BBPLEM (Optimal partition & Greedy ligation)
0.808
0.853
0.950
0.970
0.242
0.745
0.643
0.552
0.541
0.223
0.155
0.234
0.246
0.038
0.131
0.151
0.138
0.136
2.6
205.6
132.9
0.2
1820.0
9.2
1.1
1.0
0.9
12
10
8
6
4
2
0
1
2
3
4
5
6
7
Length of Block
8
9
10
35
30
Fig. 3. The distribution of block size (5q31 dataset).
25
Running Time (s)
algorithms, BBPLEM yielded better accuracy with much less time.
Moreover, the optimal block partition and greedy ligation strategy
provided the best accuracy among three strategies, which indicated
its effectiveness. The distribution of the size of blocks partitioned
by our optimal block partition algorithm was given in Figure 3.
20
15
10
Simulated data
Some programs have been developed to simulate haplotypes for the
genome study (Hudson, 2002; Liang et al., 2007). However, it may
be difficult to simulate the nature block structure of haplotypes.
Wang et al. (2002) investigated the blocklike structures of the
haplotypes. It was observed that when the mutation rate was set
to be 0.3×10−9 per site per year and the recombination rate
was set to be 1.0×10−8 for 50% probability and 4.0×10−8
for the remaining 50% probability, the distribution of block size
of the simulated haplotypes and the real human chromosome 21
was very similar. Here, we used the same parameter setting. The
program ‘GENOME’ was employed to generate haplotypes for our
performance evaluation because it can allow recombination rates to
vary along the genome. We generated 10 samples of 100 haplotypes.
The length of the haplotypes varied from 80 to 120.
The genotypes were generated by randomly pairing two
haplotypes. For each sample, 100 individuals were generated and
the accuracies were estimated. The buffer sizes of BBPLEM and
PLEM were set to be 100. The round of iteration was set to be 20. The
parameter K of fastPhase was set to be 10. The parameter nsample of
BEAGLE was set to be 25. For other algorithms, we just chose their
default parameter settings. To estimate the average performances of
various algorithms, 10 independent runs were performed for each
sample. Table 3 presented the accuracy and time comparison of
various algorithms.
For the simulated data, Ishape yielded much better accuracy than
other algorithms. But it cost lots of time. PLEM, BEAGLE, 2SNP
and BBPLEM were much faster than other algorithms. BBPLEM
yielded comparable accuracy with very short time. The optimal
block partition and greedy ligation strategy still provided the best
performance.
To evaluate the efficiency of BBPLEM, we examined the running
time of BBPLEM with respect to different m and n. The results are
given in Figures 4 and 5.
2724
5
0
0
200
400
600
800
1000
1200
Length of Haplotype (100 individuals)
1400
Fig. 4. Running time versus length of genotype.
43
42
41
Running Time (s)
3.2
40
39
38
37
36
35
34
0
1000
2000
3000
4000
5000
6000
Number of Individuals (1000 SNPs)
Fig. 5. Running time versus number of individuals.
Consistence with our previous discussion about the time
complexity of our algorithm, the running time is about proportional
to the square of m and linear with n. As demonstrated in
Figures 4 and 5, the BBPLEM algorithm is very fast. When there
is about 1000 individuals and 1000 SNPs, BBPLEM can handle
them in a few seconds. Whereas other algorithms such as Ishape
and fastPhase fail to gain the haplotyping results within 10 h.
Partition and ligation strategy
4
CONCLUSIONS
Partition–ligation was a widely used approach for individual
haplotyping to reduce the time cost. However, the inappropriate
partition may increase the error rate of frequencies estimating. In this
article, we proposed a better partition and ligation strategy according
to the LD between each pairs of SNPs. The haplotype blocks were
partitioned such that the LD within a block was very high whereas
the LD between adjacent blocks was low, which largely reduced the
error rate of haplotyping.
We evaluated the accuracy of our algorithm on both real
dataset and simulated dataset. Compared with other algorithms,
our algorithm gained comparable accuracy with much less time.
Some haplotyping error may be by reason of the limitation of
simple EM algorithm. Because our block partition and ligation
strategy was flexible and efficient, combining the block partition
and ligation strategy with better frequency estimating algorithms,
such as Bayesian and MCMC may lead to better accuracy.
Although the neighboring SNPs usually have high LD, strong
association may exist between distant SNPs. In our algorithm, a
haplotype block must be a continuous region, so the distant high LD
SNPs will not be considered. If we allow distant SNPs to constitute a
block, the accuracy of haplotyping may be further improved, which
is the future direction and our next work.
ACKNOWLEDGEMENTS
We thank Linbin Yu, Yiming Lei, Mingzhi Shao and Juan Liu, who
provided many helpful suggestions for our article.
Funding: The Key Project of The National Nature Science
Foundation of China under the grant No. 60533020.
Conflict of Interest: none declared.
REFERENCES
Barrett,J.C. et al. (2005) Haploview: analysis and visualization of LD and haplotype
maps. Bioinformatics, 21, 263–265.
Bonizzoni,P. et al. (2003) The haplotyping problem: an overview of computational
models and solutions. J. Comput. Sci. Technol., 18, 675–688.
Brinza,D. and Zelikovsky,A. (2006) 2SNP: scable phasing based on 2-SNP haplotypes.
Bioinformatics, 22, 371–373.
Browning,S.R. and BrowningB.L. (2007) Rapid and accurate haplotype phasing and
missing data inference for whole genome association studies using localized
haplotype clustering. Am. J. Hum. Genet., 81, 1084–1097.
Chung,R.H. and Gusfiled,D. (2003) Perfect phylogeny haplotyper: haplotype inferral
using a tree model. Bioinformatics, 19, 780–781.
Clark,A. (1990) Inference of haplotypes from PCR-amplified samples of diploid
populations. Mol. Biol. Evol., 7, 111–122.
Daly,M.J. et al. (2001) High-resolution haplotype structure in the human genome. Nat.
Genet., 29, 229–232.
Delaneau,O. et al. (2007) ISHAPE: new rapid and accurate software for haplotyping.
BMC Bioinformatics, 8, 205.
Eronen,L. et al. (2006) HaploRec: efficent and accurate large-scale reconstruction of
haplotypess. BMC Bioinformatics, 7, 542.
Excoffier,L. and Slatkin,M. (1995) Maximum-likelihood estimation of molecular
haplotype frequencies in a diploid population. Mol. Biol. Evol., 12, 921–927.
Gabriel,S.B. et al. (2002) The structure of haplotype blocks in the human genome.
Science, 296, 2225–2229.
Gusfield,D. (2002) Haplotypoing as perfect phylogeny: conceptual framework and
efficient solutions. In Proceedings of RECOMB 2002: The 6th Annual International
Conference on Computational Biology. ACM, New York, USA, pp. 166–175.
Hudson,R.R. (2002) Generating samples under a Wright-Fisher neutral model of genetic
variation. Bioinformatics, 18, 337–338.
International HapMap Consortium (2003) The international HapMap project. Nature,
426, 789–796.
Kimmel,G. and Shamir,R. (2005) GERBIL: genotype resolution and block identification
using likelihood. Proc. Natl Acad. Sci. USA, 102, 158–162.
Li,Z. et al. (2005) A parsimonious tree-grow method for haplotype inference.
Bioinformatics, 21, 3475–3481.
Liang,L. et al. (2007) GENOME: a rapid coalescent-based whole genome simulator.
Bioinformatics, 23, 1565–1567.
Lin,S. et al. (2004) Haplotype and missing data inference in nuclear families. Genome
Res., 14, 1624–1632.
Marchini,J. et al. (2006) A comparison of phasing algorithms for trios and unrelated
individuals. Am. J. Hum. Genet., 78, 437–450.
Niu,T.H. et al. (2002) Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms. Am. J. Hum. Genet., 70, 157–169.
Patil,N. et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution
scanning of human chromosome 21. Science, 294, 1719–1723.
Qin,Z.S. et al. (2002) Partition-Ligation EM algorithm for haplotype inference with
single nucleotide polymorphisms. Am. J. Hum. Genet., 71, 1242–1247.
Rieder,M.J. et al. (1999) Sequence variation in the human angiotensin converting
enzyme. Nat. Genet., 22, 59–62.
Scheet,P. and Stephens,M. (2006) A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and
haplotypic phase. Am. J. Hum. Genet., 78, 629–644.
Stephen,M. et al. (2001) A new statistical method for haplotype reconstruction from
population data. Am. J. Hum. Genet., 68, 978–989.
Wang,N. et al. (2002) Distribution of recombination crossovers and the origin of
haplotype blocks: the interplay of population history, recombination, and mutation.
Am. J. Hum. Genet., 71, 1227–1234.
Wang,L.S. and Xu,Y. (2003) Haplotype inference by maximum parsimony.
Bioinformatics, 19, 1773–1780.
Zhang,K. et al. (2005) HapBlock: haplotype block partitioning and tag SNP selection
software using a set of dynamic programming algorithms. Bioinformatics, 21,
131–134.
2725