Approaches to haplotype assembly

Approaches to haplotype assembly
Robert Vaser
Faculty of Electrical Engineering and Computing, University of Zagreb
Laboratory for Bioinformatics and Computational Biology
Unska 3, 10000 Zagreb
Email: [email protected]
been developed, from first generation based on Sangerโ€™s
sequencing method [5], over next generation sequencing
(NGS) methods such as Rocheโ€™s 454 Life Sciences and
Illumina Solexa [6] to most recent methods such as Pacific
Bioscienses [7] and Oxford Nanopore Technologies [8]. No
matter what sequencing method is used, only small portions
(fragments or reads) of the DNA can be read at a time. In
order to reconstruct the sequenced DNA, the read fragments
need to be stitched back together. Large amounts of
sequencing data, sequencing errors and genomic repeats make
this a non-trivial task. Tools tailored for this problem are
called assemblers and are based on the assumption that highly
similar fragments originate from the same area in the DNA
[9]. There are three main groups of assembly methods, greedy,
de Bruijn graph based approaches and overlap-layoutconsensus (OLC) based approaches [9]. Most greedy
algorithms tend to join fragments by the best alignment
criterion. On the other hand, De Bruijn graphs and the OLC
paradigm are more sophisticated. De Bruijn graphs are
directed graphs in which vertices are ๐‘˜-mers obtained from
sequenced fragments and edges represent overlaps of length
๐‘˜ โˆ’ 1 between two vertices. Sequence reconstruction is done
by finding an Eulerian path in such graph. They are optimized
for short fragments with minimal error rates and were
popularized with the assembler Euler [10]. The OLC paradigm
was introduced with algorithms described in [11] and the
Celera assembler [12] which is still extensively used today,
with modifications for new sequencing technologies. The
paradigm consists of three phases, overlap phase where
fragments are aligned to each other, layout phase where an
overlap graph is constructed and linearized, and consensus
phase where all the initial fragments are aligned to the layout
sequences in order to resolve ambiguities in the data. It is best
suited for longer fragments with no particular restrictions on
the error rate.
As the assembly process is hard on its own and sequencing
technologies are unable to sequence haplotypes separately,
most of the modern assemblers create genomic sequences with
fused haplotypes thus losing a portion of the information. To
obtain the complete information within a genome, all
haplotypes of a chromosome need to be reconstructed.
Computational methods which try to solve this problem can be
divided into two groups: haplotype phasing and haplotype
assembly. There also exist other experimental methods [13]โ€“
[15] which tend to be costly, time consuming and not practical
for some applications.
Abstractโ€”DNA sequencing methods enabled new insight into
human genome as well as genomes of other organisms. Humans
are diploid which indicates that each chromosome has a
homologous pairs. Each chromosome of a pair is called a
haplotype and the main variation between them is manifested in
form of single nucleotide polymorphism (SNP). Most reference
genomes obtained by assembling sequenced fragments contain
fused haplotypes of a chromosome set. Various different
computational and experimental methods were developed trying
to reconstruct haplotypes of an organism. This paper describes
the haplotype assembly problem, which is based on data gathered
from whole genome sequencing, with an overview on
computational algorithms from the first problem formulation
until to date.
Keywordsโ€”haplotype assembly, genome assembly, single
nucleotide polymorphisms, diploid, polyploid
I. INTRODUCTION
An organismโ€™s complete set of chromosomes, which are
packed and organized structures containing DNA molecules,
is called a genome. Depending on the ploidy of the species,
chromosomes can exist as singles or in pairs, triplets,
quadruplets or in even larger sets. The number of
chromosomes in a set determines the notation for an organism
such as haploid, diploid (e.g. humans) or if there are more than
two chromosomes per set, polyploid (e.g. the cultivated kiwi
fruit is hexaploid [1]). DNA contained in a single chromosome
is called a haplotype. Haplotypes of the same set are
homologous to each other, i.e. they contain the same genes at
the same position (locus). Although, these genes can have
differences between each other which results in different
phenotypes. Those variant forms of genes are called alleles.
Differences occur as single nucleotide polymorphisms (SNP,
โ€œsnipโ€), insertions, deletions and structural variants [2]. The
most frequent genetic variation happens in form of SNPs [3]
which has been extensively researched. The International
HapMap Project had the goal to developed a reference map of
common patterns of DNA variations (polymorphism rate in
humans is around 0.1%) by taking DNA samples from three
different human populations [4]. Knowing the precise order of
nucleotides within a genome and the patterns of genetic
variation enables new avenues of research in various fields
such as virology, medical diagnosis, biotechnology etc.
In order to determine the order of nucleotides in a DNA
molecule, a process called DNA sequencing is conducted. In
the past decade, many different sequencing platforms have
1
The first group of methods called haplotype phasing is
based on obtaining genotypes from a population of nonrelated
individuals with SNP arrays [16]. Genotype is a list of
nucleotide pairs (SNPs in diploid organisms have two
different values) for each variant site in the genome. For an
individualโ€™s genotype with N variant sites, there exist 2๐‘โˆ’1
possible haplotypes. With genotypes of a population,
assumptions about the haplotype evolution can be used to infer
haplotypes of each individual in the population [16]. Various
methods have been developed in order to solve haplotype
phasing, from the first heuristic method described by Clark
[17] to more reliable methods as expectation maximization
algorithms and statistical methods (e.g. PHASE [18]).
The second method group, haplotype assembly, is based on
whole genome sequencing (WGS) and can be further divided
into two subgroups. One subgroup finds variant sites by
overlapping sequenced fragments (or mapping them to a
reference or assembly), locating positions of disagreement and
solving optimization problems given all variant sizes [19]. The
other subgroup incorporates haplotype reconstruction within
the genome assembly using heuristic algorithms. In the next
section, the haplotype assembly problem is presented in more
detail with an overview on computational algorithms specially
designed for it.
nucleotides at that position. With the natural order of SNPs in
a genome, an ๐‘š ๐‘ฅ ๐‘› SNP matrix ๐‘‹ can be defined over the
values {0, 1, โˆ’}. An example for this matrix is shown in figure
2a. The value โ€˜โˆ’โ€˜ in matrix cell (๐‘–, ๐‘—) denotes that SNP ๐‘†๐‘— is
not covered by fragment ๐น๐‘– . Furthermore, let ๐บ๐น (๐‘‹) = (๐น, ๐ธ๐น )
be a fragment conflict graph (figure 2b). Vertices of ๐บ๐น (๐‘‹) are
fragments and each edge indicates that the connected
fragments are in conflict, i.e. they disagree on at least one of
the SNPs they both cover. If matrix ๐‘‹ is feasible (error-free),
๐บ๐น (๐‘‹) is a bipartite graph where each of the two shores
defines a haplotype, i.e. fragments can be divided into two
disjoint subsets in which neither two are in conflict. Another
graph formulation is an SNP conflict graph ๐บ๐‘† (๐‘‹) = (๐‘†, ๐ธ๐‘† )
(figure 2c). In this vertices represent SNPs and edges denote
conflicts between SNPs. For two SNPs ๐‘–, ๐‘— to be in conflict,
there have to exist two fragments ๐‘˜, ๐‘™ which both cover the
two SNPs and the submatrix defined by ๐‘–, ๐‘—, ๐‘˜, ๐‘™ has three
values of one type (0 or 1) and one of the other (1 or 0,
respectively). Otherwise the two SNPs are either in the same
haplotype or in different haplotypes. If the matrix ๐‘‹ is gapless,
i.e. all fragments cover a set of consecutive SNPs, and iff
๐บ๐‘† (๐‘‹) is an independent set (no two vertices are adjacent) then
๐บ๐น (๐‘‹) is bipartite, thus the matrix ๐‘‹ is feasible.
With the above definitions, the fundamental problem of
haplotype assembly is to determine the optimal set of changes
to an SNP matrix ๐‘‹ so that it becomes feasible. The authors
proposed two main formulations of this problem for which
they prove complexities. Those are minimum fragment
removal (MFR) and minimum SNP removal (MSR). They are
defined as follows. Given a SNP matrix ๐‘‹ find the smallest set
of fragments if solving MFR, or the smallest set of SNPs if
solving MSR, which when excluded make the resulting matrix
feasible. For the gapless case the MFR and MSR problems can
be solved in polynomial time. It is shown that MFR can be
reduced to a maximum cost flow problem and MSR to finding
independent sets in perfect graphs. In more realistic case when
fragments have gaps (mostly due to pair-end fragments which
have sequence gaps of several kilobases) both MFR and MSR
problems are proven to be NP-hard.
A year after the initial problem formulation, the same group
of researches refined the definition of haplotype assembly and
proposed additional optimization problems, of which the
minimum error correction (MEC) [20] had the most research
interest. MEC is defined for a given SNP matrix ๐‘‹ as finding
the smallest set of matrix cell changes, 0 to 1 and vice versa,
which make the matrix feasible. It is proven to be NP-hard
even for the gapless case [21].
The list of haplotype assembly algorithms developed, from
the first problem definition until today, is really large. Most of
them were designed for diploid organisms, although some
attention has been given to polyploid organisms as well. Some
of the algorithms are solving problem formulations declared
above or their variations. Other try to reconstruct partial
haplotypes during the genome assembly and finalize them by
using heuristics. In the next chapter, we present some of the
haplotype assembly algorithms which we find noteworthy.
II. HAPLOTYPE ASSEMBLY
Most assemblies of diploid organisms, which were
constructed from WGS fragments, contain merged haplotypes
of a chromosome set. This can be easily observed when all of
the sequenced fragments are mapped back to the assembly.
Figure 1 shows such an example. SNPs can be identified at
positions where nucleotide bases differ in mapped fragments,
keeping in mind possible sequencing errors. The haplotype
assembly problem is finding the maximally consistent pair of
haplotypes for a given set of variant sites (SNPs) inferred from
fragment assembly. Formal definition of this problem was
introduced in 2001 by Lancia et al. [19] for diploid organisms.
Their notation for the problem is as follows. There is a set of ๐‘›
SNPs ๐‘† = {๐‘†1 , ๐‘†2 , โ€ฆ , ๐‘†๐‘› } and a set of ๐‘š fragments ๐น =
{๐น1 , ๐น2 , โ€ฆ , ๐น๐‘š }. Each SNP is covered by some of the fragments
and has two possible values, 0 or 1, regardless of the actual
Figure 1 Locating variant sites is done by aligning fragments to a
reference sequence which was assembled from whole genome
sequencing data.
2
Figure 2 Examples were redrawn from [19] and show a SNP matrix ๐‘‹ (a) with its corresponding graphs, fragment conflict graph ๐บ๐น (๐‘‹) (b)
and SNP conflict graph ๐บ๐‘† (๐‘‹) (c).
๐‘‚(๐ท, ๐ต) properly contain overlap ๐‘‚(๐ด, ๐ต) and overlaps
๐‘‚(๐ถ, ๐ต), ๐‘‚(๐ท, ๐ด) do not exist (figure 4). The second condition
requires that for all fragments, which overlap ๐ด or ๐ต, match
either the haplotype of ๐ด or the haplotype of ๐ต or both, if they
are contained in a region with no variant sites.
Wang et al. [28] presented a similar branch-and-bound
approach as Lippert et al. [20], but for the MEC problem. In
addition, they used a genetic algorithm (GA) to obtain the
optimal solution. Their candidate solution represents the
classification of each of ๐‘š fragments. When the best solution
is found, haplotypes are inferred from fragment subsets so that
at each position the value which occurs the most is chosen.
Qian et al. [29] proposed a particle swarm optimization (PSO)
approach, similar to the GA by Wang et al., which yields
better performance. Wu et al. [30] elaborate that earlier
implementations of GA and PSO fail to obtain good
performance when dealing with millions of fragments due to
high feature number. They introduced a different and
practically more suitable definition for a candidate solution in
their GA. Their candidate solution represents a haplotype, i.e.
the feature number is bound by the number of SNPs, while its
pair haplotype can be obtained via bit-wise complement.
III. ALGORITHMS FOR HAPLOTYPE ASSEMBLY
In 2002 Lippert et al. [20] developed the first practical
method for the MFR problem based on the branch-and-bound
algorithm which they claim is efficient on real and simulated
data. In the same year, Rizzi et al. [22] gave definitions for
practical algorithms for the MFR and the MSR problem based
on dynamic programming. Algorithms for the gapless case
have time complexities ๐‘‚(๐‘š2 ๐‘› + ๐‘š3 ) and ๐‘‚(๐‘š๐‘›2 ) for MFR
and MSR, respectively. In case of gaps in fragments, the time
complexities
change
to
๐‘‚(22๐‘˜ ๐‘š2 ๐‘› + 23๐‘˜ ๐‘š3 )
and
2๐‘˜+2
๐‘‚(๐‘š๐‘›
), where k is the maximal number of gaps in any
fragment. Several years later, He et al. [23] have presented a
novel dynamic programming algorithm with a time
complexity of ๐‘‚(2๐‘˜ ๐‘š๐‘›), where k is the length of the longest
fragment.
Jones et al. [24] used heuristics to assemble the diploid
genome of Candida albicans (fungus). First the PHRAP
assembler [25] was run to obtain contiguous stretches of DNA
with no gaps (contigs). Afterwards, alignments between all
contigs were found in order to find heterozygous areas (areas
with a significant number of variant sites). Concerning
assembler results, there are two cases, one where contigs were
broken at heterozygous areas (figure 3a) and the other in
which a big contig contains a smaller one (figure 3b). The
haplotypes are reconstructed by copying the non-overlaping
(nearly homozygous) areas to contigs missing them (figure
3c). Additionally, variant sites were found in nearly
homozygous areas by looking for pattern of high base quality
disagreements between individual fragments.
In the assembly of diploid Ciona savignyi (sea squirt),
which has a really high polymorphism rate of 4.6%, Vinson et
al. [26] used heuristics to split fragments into two sets and
assemble the haplotypes separately. They modified the
assembler Arachne [27] by adding a new step after obtaining
overlaps between fragments. In this step, they remove
overlaps between fragments which originate from different
haplotypes. Two conditions have to be satisfied for an overlap
between fragments ๐ด and ๐ต, denoted ๐‘‚(๐ด, ๐ต), to be discarded.
First demands that there are two additional fragments ๐ถ, ๐ท that
locally establish two distinct haplotypes, overlaps ๐‘‚(๐ถ, ๐ด) and
Figure 3 Figure was redrawn from [24] and displays assembled
contigs which consist of nearly homozygous areas (in blue) and
heterozygous areas (in orange). Those contigs are either broken at
heterozygous area (a) or a big contig contains a smaller one (b). In
order to reconstruct haplotypes, homozygous areas are copied to
contigs that are missing them (c).
3
of fragments supporting the second. The proposed algorithm is
called minimum weighted edge removal (MWER) which is
tailored for the compass graph. It is based on finding spanning
trees in graphs. The authors extended their tool a year later to
reconstruct haplotypes of polyploid organism [35].
Berger et al. [36] implemented a maximum-likelihood
estimation algorithm for diploid and polyploidy haplotype
assembly, in their tool HapTree.
WhatsHap has been developed by Patterson et al. [37] as a
response to sequencing technologies which produce longer
fragments with higher error rates. It is tailored for solving the
weighted MEC problem (wMEC) and is based on dynamic
programming. The wMEC problem differs from the original
MEC problem by having an additional function which assigns
weights to elements of the SNP matrix ๐‘‹. Weights are based
on sequencing quality values of each base. The goal of wMEC
is to make ๐บ๐น (๐‘‹) conflict free by changing values in the
matrix ๐‘‹ with a minimum total weight.
Another tool tailored for future-gen technologies is HapCol
[38]. It is an exact algorithm, based on dynamic programming,
with time complexity that is exponential in number of
corrections for each SNP. HapCol was designed to solve the
haplotype assembly problem for sequencing techniques that
produce long fragments with uniform distribution of
sequencing errors. A new variant of the MEC problem, called
k-cMEC, in which each column can have at most k errors is
proposed.
Bonizzoni et al. [39] gave the first k-ploid MEC
formulation and proved its complexity. The k-MEC problem
they defined is suited for situations where the number of
haplotypes k is known a priori. The SNP matrix ๐‘‹ is k-conflict
free iff there exists a k-partition of fragments such that in each
subset no fragments are in conflict. The k-MEC is solved by
finding the minimum number of corrections in order to make
the matrix ๐‘‹ k-conflict free.
Most recently, diploid aware assemblers started to emerge.
One of them is Falcon with its haplotype resolving module
Falcon-Unzip [40]. It is an OLC based assembler which
corrects long erroneous fragments before assembly. Falcon
produces haplotype-fused primary contigs, and associate
contigs which are alternative paths from bubble-like structures
not included in primary contigs (figure 5). Bubbles in overlap
Figure 4 Figure was redrawn from [26] and shows an example of the
splitting rule. Blue fragments define one haplotype and orange
fragments the other. Overlap ๐‘‚(๐ด, ๐ต) is removed as it connects two
fragments from different haplotypes.
Levy et al. [31] developed a heuristic algorithm which was
used in assembling the first diploid human genome. The first
part of the algorithm is done in a greedy fashion with iterative
refinement steps afterwards. The fragment with the fewest
missing elements is chosen to seed a partial haplotype pair
(the other haplotype is obtained via bit-wise complement).
While fragments that share non-missing information with a
haplotype exist, the fragment with the strongest signal is
assigned to the corresponding haplotype. The signal is defined
as the number of SNPs indicating one haplotype minus the
number of SNPs indicating the other. When all fragments are
assigned to partial haplotypes the refinement step assigns the
majority value to each SNP in the haplotype. Afterwards, each
fragment is assigned to an appropriate haplotype, again via
majority rule. Geraci [32] benchmarked seven state-of-the-art
algorithms and showed that this heuristic had the best
performance.
Bansal and Bafna [33] provided a novel formulation of the
MEC problem based on graph cuts in their tool HapCut. They
build a weighted SNP graph, where edges between two SNPs
indicate that there exists a fragment covering both SNPs. Edge
weights equal to the number of fragments disagreeing with the
current haplotype phase minus the number of fragments
supporting it. They show that finding a max-cut in such graphs
results in optimal MEC solution. The proposed algorithm is
heuristic and consists of two steps. First a random initial
haplotype configuration is declared. Afterwards, graph cuts
which have positive sums of edge weights are found. If a cut
improves the MEC score, the haplotypes are updated. This
procedure is run until no more improvements are possible.
The tool HapCompass and corresponding compass graphs
were developed by Aguiar and Istrail [34]. Compass graphs
are a type of weighted SNP graphs where edges denote that
two SNPs are covered by at least one fragment. For a pair of
SNPs and a pair of fragments, there exists two possible
0 0
haplotype configurations,
(fragments from the same
1 1
0 1
haplotype) and
(fragments from different haplotypes).
1 0
With that defined, edge weights equal the number of
fragments supporting the first configuration minus the number
Figure 5 Figure was redrawn from [40] and shows an example of the
initial assembly produced by Falcon. This assembly is split into
primary and associate contigs to save information about structural
variants.
4
graph represent structural variants and are suited for the
haplotype assembly problem. The Falcon-Unzip module finds
variant sites by aligning all raw fragments to primary and
associated contigs. At each base they look if there exist a base
with occurrence less than 75% and another base with
occurrence higher than 25%. A greedy algorithm is then used
to phase SNPs into blocks. The initial phasing is random for
all SNPs. For any two SNPs that share a fragment, a coupling
score is calculated. The score equals to the number of
fragments supporting the particular phase. By flipping SNPs at
each position, the phase with the best score is kept. If there are
not enough fragments connecting 2 SNPs, they break the
haplotype block. The process is run iteratively until no more
optimizations are possible or a pre-defined limit is reached.
After the phasing, each fragment is tagged with a list of pairs
(๐‘๐‘™๐‘œ๐‘๐‘˜๐‘–๐‘‘ , ๐‘โ„Ž๐‘Ž๐‘ ๐‘’๐‘–๐‘‘ ). Overlaps between fragments of the same
block but with different phases are removed. The end results
are correctly phased primary contigs and their associate
contigs called haplotigs.
Another newly developed diploid aware assembler is
Supernova [41], which is based on de Bruijn graphs. They
used the 10x genomics technology to produce short linked
fragments. Those fragments come from the same genome
region and are tagged with the same barcode. The assembler
builds a standard de Bruijn graph and looks for linear paths
which are punctuated only by bubbles. Those linear paths are
called lines and are used to create contigs (figure 6). Bubbles
in those lines are simplified so that each bubble contains all of
its paths (figure 6). Contigs are ordered and connect together
with the help of pair-end fragments and linked fragments. The
final step is line phasing, i.e. determining the orientation of
each bubble that has two branches. The orientation means
placing one path on top and one on bottom. Sets of linked
fragments that cover bubbles vote for their orientations. These
sets can be represented as string over the alphabet {+1, 0, โˆ’1},
where +1 stands for top, 0 for silent and โˆ’1 for bottom. A
phasing is good if each set is coherent, i.e. nearly all its values
are +1 or โˆ’1. A heuristic algorithm is used with a random
orientation for each bubble at the beginning. In order to
achieve coherence of all linked fragments sets, different
perturbations which flip bubbles are conducted. At the end,
lines with phased bubbles are transformed into megabubbles
which contain partial haplotypes.
IV. CONCLUSION
Obtaining haplotypes via whole genome sequencing is
proven to be a complex problem. There have been many
formulations and even more solutions to obtain haplotypes of
diploid organisms. Diploid and polyploid aware genome
assemblers have great potential to facilitate advancements in
the fields of medicine and microbiology. With the dawn of
long read sequencing, where fragments will span more SNPs,
we expect that the research in this field will continue at an
even faster pace.
ACKNOWLEDGMENT
We would like to give special thanks to Kreลกimir
Kriลพanoviฤ‡ and Ivan Soviฤ‡ for proofreading this paper and
providing valuable suggestions.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Figure 6 Figure was redrawn from [41] and shows a sample de
Bruijn graph. The path colored in blue is called a line and is used to
construct contigs in the Supernova assembler. Complex bubbles in
lines are simplified so that they contain all of their paths.
[15]
5
R. N. Crowhurst, D. Whittaker, and R. C. Gardner, โ€œTHE
GENETIC ORIGIN OF KIWIFRUIT,โ€ Acta Hortic., no.
297, pp. 61โ€“62, Apr. 1992.
V. Bansal, A. L. Halpern, N. Axelrod, and V. Bafna, โ€œAn
MCMC algorithm for haplotype assembly from wholegenome sequence data,โ€ Genome Res., vol. 18, no. 8, pp.
1336โ€“1346, 2008.
A. Chakravarti, โ€œItโ€™s raining SNPs, hallelujah?,โ€ Nat. Genet.,
vol. 19, no. 3, pp. 216โ€“7, Jul. 1998.
The International HapMap Consortium, โ€œThe International
HapMap Project,โ€ Nature, vol. 426, no. 6968, pp. 789โ€“796,
Dec. 2003.
F. Sanger, S. Nicklen, and a R. Coulson, โ€œDNA sequencing
with chain-terminating inhibitors.,โ€ Proc. Natl. Acad. Sci. U.
S. A., vol. 74, no. 12, pp. 5463โ€“7, 1977.
E. Pettersson, J. Lundeberg, and A. Ahmadian, โ€œGenerations
of sequencing technologies,โ€ Genomics, vol. 93, no. 2, pp.
105โ€“111, 2009.
A. Rhoads and K. F. Au, โ€œPacBio Sequencing and Its
Applications,โ€ Genomics, Proteomics Bioinforma., vol. 13,
no. 5, pp. 278โ€“289, 2015.
Y. Feng, Y. Zhang, C. Ying, D. Wang, and C. Du,
โ€œNanopore-based fourth-generation DNA sequencing
technology,โ€ Genomics, Proteomics Bioinforma., vol. 13,
no. 1, pp. 4โ€“16, 2015.
N. Nagarajan and M. Pop, โ€œSequence assembly
demystified.,โ€ Nat. Rev. Genet., vol. 14, no. 3, pp. 157โ€“67,
2013.
P. A. Pevzner, H. Tang, and M. S. Waterman, โ€œAn Eulerian
path approach to DNA fragment assembly.,โ€ Proc. Natl.
Acad. Sci. U. S. A., vol. 98, no. 17, pp. 9748โ€“53, 2001.
E. W. Myers, โ€œToward Simplifying and Accurately
Formulating Fragment Assembly,โ€ J. Comput. Biol., vol. 2,
no. 2, pp. 275โ€“290, 1995.
E. W. Myers, โ€œA Whole-Genome Assembly of Drosophila,โ€
Science (80-. )., vol. 287, no. 5461, pp. 2196โ€“2204, 2000.
A. T. Woolley, C. Guillemette, C. L. Cheung, D. E.
Housman, and C. M. Lieber, โ€œDirect haplotyping of
kilobase-size DNA using carbon nanotube probes.,โ€ Nat.
Biotechnol., vol. 18, no. 7, pp. 760โ€“3, 2000.
H. C. Fan, J. Wang, A. Potanina, and S. R. Quake, โ€œWholegenome molecular haplotyping of single cells.,โ€ Nat.
Biotechnol., vol. 29, no. 1, pp. 51โ€“7, 2011.
E. F. Kirkness, R. V. Grindberg, J. Yee-Greenbaum, C. R.
Marshall, S. W. Scherer, R. S. Lasken, and J. C. Venter,
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
โ€œSequencing of isolated sperm cells for direct haplotyping of
a human genome,โ€ Genome Res., vol. 23, no. 5, pp. 826โ€“
832, 2013.
C. Lo, โ€œAlgorithms for Haplotype Phasing.โ€
a G. Clark, โ€œInference of haplotypes from PCR-amplified
samples of diploid populations.,โ€ Mol. Biol. Evol., vol. 7, no.
2, pp. 111โ€“122, 1990.
M. Stephens, N. J. Smith, and P. Donnelly, โ€œA new
statistical method for haplotype reconstruction from
population data.,โ€ Am. J. Hum. Genet., vol. 68, no. 4, pp.
978โ€“989, 2001.
G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz,
โ€œSNPs problems, complexity, and algorithms,โ€ Algorithmsโ€”
ESA 2001, vol. 2161, pp. 182โ€“193, 2001.
R. Lippert, R. Schwartz, G. Lancia, and S. Istrail,
โ€œAlgorithmic strategies for the single nucleotide
polymorphism haplotype assembly problem.,โ€ Brief.
Bioinform., vol. 3, no. 1, pp. 23โ€“31, 2002.
R. Cilibrasi, L. Van Iersel, S. Kelk, and J. Tromp, โ€œThe
complexity of the single individual SNP haplotyping
problem,โ€ Algorithmica (New York), vol. 49, no. 1, pp. 13โ€“
36, 2007.
R. Rizzi, V. Bafna, S. Istrail, and G. Lancia, โ€œPractical
algorithms and fixed-parameter tractability for the single
individual {SNP} haplotyping problem,โ€ \bibremark{No
string.}Proceedings 2nd Int. Work. Algorithms Bioinforma.,
vol. 2452, pp. 29โ€“43, 2002.
D. He, A. Choi, and K. Pipatsrisawat, โ€œOptimal algorithms
for haplotype assembly from whole-genome sequence data,โ€
Bioinformatics, vol. 26, no. 12, pp. i183-90, 2010.
T. Jones, N. a Federspiel, H. Chibana, J. Dungan, S.
Kalman, B. B. Magee, G. Newport, Y. R. Thorstenson, N.
Agabian, P. T. Magee, R. W. Davis, and S. Scherer, โ€œThe
diploid genome sequence of Candida albicans.,โ€ Proc. Natl.
Acad. Sci. U. S. A., vol. 101, no. 19, pp. 7329โ€“7334, 2004.
M. de la Bastide and W. R. McCombie, โ€œAssembling
genomic DNA sequences with PHRAP.,โ€ Curr. Protoc.
Bioinformatics, vol. Chapter 11, p. Unit11.4, Mar. 2007.
J. P. Vinson, D. B. Jaffe, K. Oโ€™Neill, E. K. Karlsson, N.
Stange-Thomann, S. Anderson, J. P. Mesirov, N. Satoh, Y.
Satou, C. Nusbaum, B. Birren, J. E. Galagan, and E. S.
Lander, โ€œAssembly of polymorphic genomes: Algorithms
and application to Ciona savignyi,โ€ Genome Res., vol. 15,
no. 8, pp. 1127โ€“1135, 2005.
S. Batzoglou, โ€œARACHNE: A Whole-Genome Shotgun
Assembler,โ€ Genome Res., vol. 12, no. 1, pp. 177โ€“189,
2002.
R. S. Wang, L. Y. Wu, Z. P. Li, and X. S. Zhang,
โ€œHaplotype reconstruction from SNP fragments by
minimum error correction,โ€ Bioinformatics, vol. 21, no. 10,
pp. 2456โ€“2462, 2005.
W. Qian, Y. Yang, N. Yang, and C. Li, โ€œParticle swarm
optimization for SNP haplotype reconstruction problem,โ€
vol. 196, pp. 266โ€“272, 2008.
J. Wu, J. Wang, and J. Chen, โ€œA genetic algorithm for single
individual SNP haplotype assembly,โ€ Proc. 9th Int. Conf.
Young Comput. Sci. ICYCS 2008, pp. 1012โ€“1017, 2008.
S. Levy, G. Sutton, P. C. Ng, L. Feuk, A. L. Halpern, B. P.
Walenz, N. Axelrod, J. Huang, E. F. Kirkness, G. Denisov,
Y. Lin, J. R. MacDonald, A. W. C. Pang, M. Shago, T. B.
Stockwell, A. Tsiamouri, V. Bafna, V. Bansal, S. A. Kravitz,
D. A. Busam, K. Y. Beeson, T. C. McIntosh, K. A.
Remington, J. F. Abril, J. Gill, J. Borman, Y. H. Rogers, M.
E. Frazier, S. W. Scherer, R. L. Strausberg, and J. C. Venter,
โ€œThe diploid genome sequence of an individual human,โ€
PLoS Biol., vol. 5, no. 10, pp. 2113โ€“2144, 2007.
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
6
F. Geraci, โ€œA comparison of several algorithms for the
single individual SNP haplotyping reconstruction problem,โ€
Bioinformatics, vol. 26, no. 18, pp. 2217โ€“2225, 2010.
V. Bansal and V. Bafna, โ€œHapCUT: An efficient and
accurate algorithm for the haplotype assembly problem,โ€
Bioinformatics, vol. 24, no. 16, pp. 153โ€“159, 2008.
D. Aguiar and S. Istrail, โ€œHapCompass: A Fast Cycle Basis
Algorithm for Accurate Haplotype Assembly of Sequence
Data,โ€ J. Comput. Biol., vol. 19, no. 6, pp. 577โ€“590, 2012.
D. Aguiar and S. Istrail, โ€œHaplotype assembly in polyploid
genomes and identical by descent shared tracts,โ€
Bioinformatics, vol. 29, no. 13, pp. 352โ€“360, 2013.
E. Berger, D. Yorukoglu, J. Peng, and B. Berger, โ€œHapTree:
A novel bayesian framework for single individual
polyplotyping using NGS data,โ€ Lect. Notes Comput. Sci.
(including Subser. Lect. Notes Artif. Intell. Lect. Notes
Bioinformatics), vol. 8394 LNBI, no. 3, pp. 18โ€“19, 2014.
M. Patterson, T. Marschall, N. Pisanti, and L. E. O. V. A. N.
Iersel, โ€œW hats H apโ€ฏ: Weighted Haplotype Assembly for
Future-Generation Sequencing Reads,โ€ J. Comput. Biol.,
vol. 22, no. 0, pp. 1โ€“12, 2014.
Y. Pirola, S. Zaccaria, R. Dondi, G. W. Klau, N. Pisanti, and
P. Bonizzoni, โ€œHapCol: Accurate and Memory-efficient
Haplotype Assembly from Long Reads,โ€ Bioinformatics, no.
August, p. btv495, 2015.
P. Bonizzoni, R. Dondi, G. W. Klau, Y. Pirola, N. Pisanti,
and S. Zaccaria, โ€œOn the Minimum Error Correction
Problem for Haplotype Assembly in Diploid and Polyploid
Genomes.,โ€ J. Comput. Biol., vol. 23, no. 0, pp. 1โ€“19, 2016.
C. Chin, P. Peluso, F. J. Sedlazeck, M. Nattestad, G. T.
Concepcion, C. Dunn, R. O. Malley, R. Figueroa-balderas,
A. Morales-cruz, R. Grant, M. Delledonne, C. Luo, J. R.
Ecker, D. Cantu, and D. R. Rank, โ€œPhased Diploid Genome
Assembly with Single Molecule Real-Time Sequencing,โ€
2016.
N. I. Weisenfeld, V. Kumar, P. Shah, D. M. Church, D. B.
Jaffe, and P. Ca, โ€œDirect determination of diploid genome
sequences,โ€ pp. 1โ€“21, 2016.