Approaches to haplotype assembly Robert Vaser Faculty of Electrical Engineering and Computing, University of Zagreb Laboratory for Bioinformatics and Computational Biology Unska 3, 10000 Zagreb Email: [email protected] been developed, from first generation based on Sangerโs sequencing method [5], over next generation sequencing (NGS) methods such as Rocheโs 454 Life Sciences and Illumina Solexa [6] to most recent methods such as Pacific Bioscienses [7] and Oxford Nanopore Technologies [8]. No matter what sequencing method is used, only small portions (fragments or reads) of the DNA can be read at a time. In order to reconstruct the sequenced DNA, the read fragments need to be stitched back together. Large amounts of sequencing data, sequencing errors and genomic repeats make this a non-trivial task. Tools tailored for this problem are called assemblers and are based on the assumption that highly similar fragments originate from the same area in the DNA [9]. There are three main groups of assembly methods, greedy, de Bruijn graph based approaches and overlap-layoutconsensus (OLC) based approaches [9]. Most greedy algorithms tend to join fragments by the best alignment criterion. On the other hand, De Bruijn graphs and the OLC paradigm are more sophisticated. De Bruijn graphs are directed graphs in which vertices are ๐-mers obtained from sequenced fragments and edges represent overlaps of length ๐ โ 1 between two vertices. Sequence reconstruction is done by finding an Eulerian path in such graph. They are optimized for short fragments with minimal error rates and were popularized with the assembler Euler [10]. The OLC paradigm was introduced with algorithms described in [11] and the Celera assembler [12] which is still extensively used today, with modifications for new sequencing technologies. The paradigm consists of three phases, overlap phase where fragments are aligned to each other, layout phase where an overlap graph is constructed and linearized, and consensus phase where all the initial fragments are aligned to the layout sequences in order to resolve ambiguities in the data. It is best suited for longer fragments with no particular restrictions on the error rate. As the assembly process is hard on its own and sequencing technologies are unable to sequence haplotypes separately, most of the modern assemblers create genomic sequences with fused haplotypes thus losing a portion of the information. To obtain the complete information within a genome, all haplotypes of a chromosome need to be reconstructed. Computational methods which try to solve this problem can be divided into two groups: haplotype phasing and haplotype assembly. There also exist other experimental methods [13]โ [15] which tend to be costly, time consuming and not practical for some applications. AbstractโDNA sequencing methods enabled new insight into human genome as well as genomes of other organisms. Humans are diploid which indicates that each chromosome has a homologous pairs. Each chromosome of a pair is called a haplotype and the main variation between them is manifested in form of single nucleotide polymorphism (SNP). Most reference genomes obtained by assembling sequenced fragments contain fused haplotypes of a chromosome set. Various different computational and experimental methods were developed trying to reconstruct haplotypes of an organism. This paper describes the haplotype assembly problem, which is based on data gathered from whole genome sequencing, with an overview on computational algorithms from the first problem formulation until to date. Keywordsโhaplotype assembly, genome assembly, single nucleotide polymorphisms, diploid, polyploid I. INTRODUCTION An organismโs complete set of chromosomes, which are packed and organized structures containing DNA molecules, is called a genome. Depending on the ploidy of the species, chromosomes can exist as singles or in pairs, triplets, quadruplets or in even larger sets. The number of chromosomes in a set determines the notation for an organism such as haploid, diploid (e.g. humans) or if there are more than two chromosomes per set, polyploid (e.g. the cultivated kiwi fruit is hexaploid [1]). DNA contained in a single chromosome is called a haplotype. Haplotypes of the same set are homologous to each other, i.e. they contain the same genes at the same position (locus). Although, these genes can have differences between each other which results in different phenotypes. Those variant forms of genes are called alleles. Differences occur as single nucleotide polymorphisms (SNP, โsnipโ), insertions, deletions and structural variants [2]. The most frequent genetic variation happens in form of SNPs [3] which has been extensively researched. The International HapMap Project had the goal to developed a reference map of common patterns of DNA variations (polymorphism rate in humans is around 0.1%) by taking DNA samples from three different human populations [4]. Knowing the precise order of nucleotides within a genome and the patterns of genetic variation enables new avenues of research in various fields such as virology, medical diagnosis, biotechnology etc. In order to determine the order of nucleotides in a DNA molecule, a process called DNA sequencing is conducted. In the past decade, many different sequencing platforms have 1 The first group of methods called haplotype phasing is based on obtaining genotypes from a population of nonrelated individuals with SNP arrays [16]. Genotype is a list of nucleotide pairs (SNPs in diploid organisms have two different values) for each variant site in the genome. For an individualโs genotype with N variant sites, there exist 2๐โ1 possible haplotypes. With genotypes of a population, assumptions about the haplotype evolution can be used to infer haplotypes of each individual in the population [16]. Various methods have been developed in order to solve haplotype phasing, from the first heuristic method described by Clark [17] to more reliable methods as expectation maximization algorithms and statistical methods (e.g. PHASE [18]). The second method group, haplotype assembly, is based on whole genome sequencing (WGS) and can be further divided into two subgroups. One subgroup finds variant sites by overlapping sequenced fragments (or mapping them to a reference or assembly), locating positions of disagreement and solving optimization problems given all variant sizes [19]. The other subgroup incorporates haplotype reconstruction within the genome assembly using heuristic algorithms. In the next section, the haplotype assembly problem is presented in more detail with an overview on computational algorithms specially designed for it. nucleotides at that position. With the natural order of SNPs in a genome, an ๐ ๐ฅ ๐ SNP matrix ๐ can be defined over the values {0, 1, โ}. An example for this matrix is shown in figure 2a. The value โโโ in matrix cell (๐, ๐) denotes that SNP ๐๐ is not covered by fragment ๐น๐ . Furthermore, let ๐บ๐น (๐) = (๐น, ๐ธ๐น ) be a fragment conflict graph (figure 2b). Vertices of ๐บ๐น (๐) are fragments and each edge indicates that the connected fragments are in conflict, i.e. they disagree on at least one of the SNPs they both cover. If matrix ๐ is feasible (error-free), ๐บ๐น (๐) is a bipartite graph where each of the two shores defines a haplotype, i.e. fragments can be divided into two disjoint subsets in which neither two are in conflict. Another graph formulation is an SNP conflict graph ๐บ๐ (๐) = (๐, ๐ธ๐ ) (figure 2c). In this vertices represent SNPs and edges denote conflicts between SNPs. For two SNPs ๐, ๐ to be in conflict, there have to exist two fragments ๐, ๐ which both cover the two SNPs and the submatrix defined by ๐, ๐, ๐, ๐ has three values of one type (0 or 1) and one of the other (1 or 0, respectively). Otherwise the two SNPs are either in the same haplotype or in different haplotypes. If the matrix ๐ is gapless, i.e. all fragments cover a set of consecutive SNPs, and iff ๐บ๐ (๐) is an independent set (no two vertices are adjacent) then ๐บ๐น (๐) is bipartite, thus the matrix ๐ is feasible. With the above definitions, the fundamental problem of haplotype assembly is to determine the optimal set of changes to an SNP matrix ๐ so that it becomes feasible. The authors proposed two main formulations of this problem for which they prove complexities. Those are minimum fragment removal (MFR) and minimum SNP removal (MSR). They are defined as follows. Given a SNP matrix ๐ find the smallest set of fragments if solving MFR, or the smallest set of SNPs if solving MSR, which when excluded make the resulting matrix feasible. For the gapless case the MFR and MSR problems can be solved in polynomial time. It is shown that MFR can be reduced to a maximum cost flow problem and MSR to finding independent sets in perfect graphs. In more realistic case when fragments have gaps (mostly due to pair-end fragments which have sequence gaps of several kilobases) both MFR and MSR problems are proven to be NP-hard. A year after the initial problem formulation, the same group of researches refined the definition of haplotype assembly and proposed additional optimization problems, of which the minimum error correction (MEC) [20] had the most research interest. MEC is defined for a given SNP matrix ๐ as finding the smallest set of matrix cell changes, 0 to 1 and vice versa, which make the matrix feasible. It is proven to be NP-hard even for the gapless case [21]. The list of haplotype assembly algorithms developed, from the first problem definition until today, is really large. Most of them were designed for diploid organisms, although some attention has been given to polyploid organisms as well. Some of the algorithms are solving problem formulations declared above or their variations. Other try to reconstruct partial haplotypes during the genome assembly and finalize them by using heuristics. In the next chapter, we present some of the haplotype assembly algorithms which we find noteworthy. II. HAPLOTYPE ASSEMBLY Most assemblies of diploid organisms, which were constructed from WGS fragments, contain merged haplotypes of a chromosome set. This can be easily observed when all of the sequenced fragments are mapped back to the assembly. Figure 1 shows such an example. SNPs can be identified at positions where nucleotide bases differ in mapped fragments, keeping in mind possible sequencing errors. The haplotype assembly problem is finding the maximally consistent pair of haplotypes for a given set of variant sites (SNPs) inferred from fragment assembly. Formal definition of this problem was introduced in 2001 by Lancia et al. [19] for diploid organisms. Their notation for the problem is as follows. There is a set of ๐ SNPs ๐ = {๐1 , ๐2 , โฆ , ๐๐ } and a set of ๐ fragments ๐น = {๐น1 , ๐น2 , โฆ , ๐น๐ }. Each SNP is covered by some of the fragments and has two possible values, 0 or 1, regardless of the actual Figure 1 Locating variant sites is done by aligning fragments to a reference sequence which was assembled from whole genome sequencing data. 2 Figure 2 Examples were redrawn from [19] and show a SNP matrix ๐ (a) with its corresponding graphs, fragment conflict graph ๐บ๐น (๐) (b) and SNP conflict graph ๐บ๐ (๐) (c). ๐(๐ท, ๐ต) properly contain overlap ๐(๐ด, ๐ต) and overlaps ๐(๐ถ, ๐ต), ๐(๐ท, ๐ด) do not exist (figure 4). The second condition requires that for all fragments, which overlap ๐ด or ๐ต, match either the haplotype of ๐ด or the haplotype of ๐ต or both, if they are contained in a region with no variant sites. Wang et al. [28] presented a similar branch-and-bound approach as Lippert et al. [20], but for the MEC problem. In addition, they used a genetic algorithm (GA) to obtain the optimal solution. Their candidate solution represents the classification of each of ๐ fragments. When the best solution is found, haplotypes are inferred from fragment subsets so that at each position the value which occurs the most is chosen. Qian et al. [29] proposed a particle swarm optimization (PSO) approach, similar to the GA by Wang et al., which yields better performance. Wu et al. [30] elaborate that earlier implementations of GA and PSO fail to obtain good performance when dealing with millions of fragments due to high feature number. They introduced a different and practically more suitable definition for a candidate solution in their GA. Their candidate solution represents a haplotype, i.e. the feature number is bound by the number of SNPs, while its pair haplotype can be obtained via bit-wise complement. III. ALGORITHMS FOR HAPLOTYPE ASSEMBLY In 2002 Lippert et al. [20] developed the first practical method for the MFR problem based on the branch-and-bound algorithm which they claim is efficient on real and simulated data. In the same year, Rizzi et al. [22] gave definitions for practical algorithms for the MFR and the MSR problem based on dynamic programming. Algorithms for the gapless case have time complexities ๐(๐2 ๐ + ๐3 ) and ๐(๐๐2 ) for MFR and MSR, respectively. In case of gaps in fragments, the time complexities change to ๐(22๐ ๐2 ๐ + 23๐ ๐3 ) and 2๐+2 ๐(๐๐ ), where k is the maximal number of gaps in any fragment. Several years later, He et al. [23] have presented a novel dynamic programming algorithm with a time complexity of ๐(2๐ ๐๐), where k is the length of the longest fragment. Jones et al. [24] used heuristics to assemble the diploid genome of Candida albicans (fungus). First the PHRAP assembler [25] was run to obtain contiguous stretches of DNA with no gaps (contigs). Afterwards, alignments between all contigs were found in order to find heterozygous areas (areas with a significant number of variant sites). Concerning assembler results, there are two cases, one where contigs were broken at heterozygous areas (figure 3a) and the other in which a big contig contains a smaller one (figure 3b). The haplotypes are reconstructed by copying the non-overlaping (nearly homozygous) areas to contigs missing them (figure 3c). Additionally, variant sites were found in nearly homozygous areas by looking for pattern of high base quality disagreements between individual fragments. In the assembly of diploid Ciona savignyi (sea squirt), which has a really high polymorphism rate of 4.6%, Vinson et al. [26] used heuristics to split fragments into two sets and assemble the haplotypes separately. They modified the assembler Arachne [27] by adding a new step after obtaining overlaps between fragments. In this step, they remove overlaps between fragments which originate from different haplotypes. Two conditions have to be satisfied for an overlap between fragments ๐ด and ๐ต, denoted ๐(๐ด, ๐ต), to be discarded. First demands that there are two additional fragments ๐ถ, ๐ท that locally establish two distinct haplotypes, overlaps ๐(๐ถ, ๐ด) and Figure 3 Figure was redrawn from [24] and displays assembled contigs which consist of nearly homozygous areas (in blue) and heterozygous areas (in orange). Those contigs are either broken at heterozygous area (a) or a big contig contains a smaller one (b). In order to reconstruct haplotypes, homozygous areas are copied to contigs that are missing them (c). 3 of fragments supporting the second. The proposed algorithm is called minimum weighted edge removal (MWER) which is tailored for the compass graph. It is based on finding spanning trees in graphs. The authors extended their tool a year later to reconstruct haplotypes of polyploid organism [35]. Berger et al. [36] implemented a maximum-likelihood estimation algorithm for diploid and polyploidy haplotype assembly, in their tool HapTree. WhatsHap has been developed by Patterson et al. [37] as a response to sequencing technologies which produce longer fragments with higher error rates. It is tailored for solving the weighted MEC problem (wMEC) and is based on dynamic programming. The wMEC problem differs from the original MEC problem by having an additional function which assigns weights to elements of the SNP matrix ๐. Weights are based on sequencing quality values of each base. The goal of wMEC is to make ๐บ๐น (๐) conflict free by changing values in the matrix ๐ with a minimum total weight. Another tool tailored for future-gen technologies is HapCol [38]. It is an exact algorithm, based on dynamic programming, with time complexity that is exponential in number of corrections for each SNP. HapCol was designed to solve the haplotype assembly problem for sequencing techniques that produce long fragments with uniform distribution of sequencing errors. A new variant of the MEC problem, called k-cMEC, in which each column can have at most k errors is proposed. Bonizzoni et al. [39] gave the first k-ploid MEC formulation and proved its complexity. The k-MEC problem they defined is suited for situations where the number of haplotypes k is known a priori. The SNP matrix ๐ is k-conflict free iff there exists a k-partition of fragments such that in each subset no fragments are in conflict. The k-MEC is solved by finding the minimum number of corrections in order to make the matrix ๐ k-conflict free. Most recently, diploid aware assemblers started to emerge. One of them is Falcon with its haplotype resolving module Falcon-Unzip [40]. It is an OLC based assembler which corrects long erroneous fragments before assembly. Falcon produces haplotype-fused primary contigs, and associate contigs which are alternative paths from bubble-like structures not included in primary contigs (figure 5). Bubbles in overlap Figure 4 Figure was redrawn from [26] and shows an example of the splitting rule. Blue fragments define one haplotype and orange fragments the other. Overlap ๐(๐ด, ๐ต) is removed as it connects two fragments from different haplotypes. Levy et al. [31] developed a heuristic algorithm which was used in assembling the first diploid human genome. The first part of the algorithm is done in a greedy fashion with iterative refinement steps afterwards. The fragment with the fewest missing elements is chosen to seed a partial haplotype pair (the other haplotype is obtained via bit-wise complement). While fragments that share non-missing information with a haplotype exist, the fragment with the strongest signal is assigned to the corresponding haplotype. The signal is defined as the number of SNPs indicating one haplotype minus the number of SNPs indicating the other. When all fragments are assigned to partial haplotypes the refinement step assigns the majority value to each SNP in the haplotype. Afterwards, each fragment is assigned to an appropriate haplotype, again via majority rule. Geraci [32] benchmarked seven state-of-the-art algorithms and showed that this heuristic had the best performance. Bansal and Bafna [33] provided a novel formulation of the MEC problem based on graph cuts in their tool HapCut. They build a weighted SNP graph, where edges between two SNPs indicate that there exists a fragment covering both SNPs. Edge weights equal to the number of fragments disagreeing with the current haplotype phase minus the number of fragments supporting it. They show that finding a max-cut in such graphs results in optimal MEC solution. The proposed algorithm is heuristic and consists of two steps. First a random initial haplotype configuration is declared. Afterwards, graph cuts which have positive sums of edge weights are found. If a cut improves the MEC score, the haplotypes are updated. This procedure is run until no more improvements are possible. The tool HapCompass and corresponding compass graphs were developed by Aguiar and Istrail [34]. Compass graphs are a type of weighted SNP graphs where edges denote that two SNPs are covered by at least one fragment. For a pair of SNPs and a pair of fragments, there exists two possible 0 0 haplotype configurations, (fragments from the same 1 1 0 1 haplotype) and (fragments from different haplotypes). 1 0 With that defined, edge weights equal the number of fragments supporting the first configuration minus the number Figure 5 Figure was redrawn from [40] and shows an example of the initial assembly produced by Falcon. This assembly is split into primary and associate contigs to save information about structural variants. 4 graph represent structural variants and are suited for the haplotype assembly problem. The Falcon-Unzip module finds variant sites by aligning all raw fragments to primary and associated contigs. At each base they look if there exist a base with occurrence less than 75% and another base with occurrence higher than 25%. A greedy algorithm is then used to phase SNPs into blocks. The initial phasing is random for all SNPs. For any two SNPs that share a fragment, a coupling score is calculated. The score equals to the number of fragments supporting the particular phase. By flipping SNPs at each position, the phase with the best score is kept. If there are not enough fragments connecting 2 SNPs, they break the haplotype block. The process is run iteratively until no more optimizations are possible or a pre-defined limit is reached. After the phasing, each fragment is tagged with a list of pairs (๐๐๐๐๐๐๐ , ๐โ๐๐ ๐๐๐ ). Overlaps between fragments of the same block but with different phases are removed. The end results are correctly phased primary contigs and their associate contigs called haplotigs. Another newly developed diploid aware assembler is Supernova [41], which is based on de Bruijn graphs. They used the 10x genomics technology to produce short linked fragments. Those fragments come from the same genome region and are tagged with the same barcode. The assembler builds a standard de Bruijn graph and looks for linear paths which are punctuated only by bubbles. Those linear paths are called lines and are used to create contigs (figure 6). Bubbles in those lines are simplified so that each bubble contains all of its paths (figure 6). Contigs are ordered and connect together with the help of pair-end fragments and linked fragments. The final step is line phasing, i.e. determining the orientation of each bubble that has two branches. The orientation means placing one path on top and one on bottom. Sets of linked fragments that cover bubbles vote for their orientations. These sets can be represented as string over the alphabet {+1, 0, โ1}, where +1 stands for top, 0 for silent and โ1 for bottom. A phasing is good if each set is coherent, i.e. nearly all its values are +1 or โ1. A heuristic algorithm is used with a random orientation for each bubble at the beginning. In order to achieve coherence of all linked fragments sets, different perturbations which flip bubbles are conducted. At the end, lines with phased bubbles are transformed into megabubbles which contain partial haplotypes. IV. CONCLUSION Obtaining haplotypes via whole genome sequencing is proven to be a complex problem. There have been many formulations and even more solutions to obtain haplotypes of diploid organisms. Diploid and polyploid aware genome assemblers have great potential to facilitate advancements in the fields of medicine and microbiology. With the dawn of long read sequencing, where fragments will span more SNPs, we expect that the research in this field will continue at an even faster pace. ACKNOWLEDGMENT We would like to give special thanks to Kreลกimir Kriลพanoviฤ and Ivan Soviฤ for proofreading this paper and providing valuable suggestions. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] Figure 6 Figure was redrawn from [41] and shows a sample de Bruijn graph. The path colored in blue is called a line and is used to construct contigs in the Supernova assembler. Complex bubbles in lines are simplified so that they contain all of their paths. [15] 5 R. N. Crowhurst, D. Whittaker, and R. C. Gardner, โTHE GENETIC ORIGIN OF KIWIFRUIT,โ Acta Hortic., no. 297, pp. 61โ62, Apr. 1992. V. Bansal, A. L. Halpern, N. Axelrod, and V. Bafna, โAn MCMC algorithm for haplotype assembly from wholegenome sequence data,โ Genome Res., vol. 18, no. 8, pp. 1336โ1346, 2008. A. Chakravarti, โItโs raining SNPs, hallelujah?,โ Nat. Genet., vol. 19, no. 3, pp. 216โ7, Jul. 1998. The International HapMap Consortium, โThe International HapMap Project,โ Nature, vol. 426, no. 6968, pp. 789โ796, Dec. 2003. F. Sanger, S. Nicklen, and a R. Coulson, โDNA sequencing with chain-terminating inhibitors.,โ Proc. Natl. Acad. Sci. U. S. A., vol. 74, no. 12, pp. 5463โ7, 1977. E. Pettersson, J. Lundeberg, and A. Ahmadian, โGenerations of sequencing technologies,โ Genomics, vol. 93, no. 2, pp. 105โ111, 2009. A. Rhoads and K. F. Au, โPacBio Sequencing and Its Applications,โ Genomics, Proteomics Bioinforma., vol. 13, no. 5, pp. 278โ289, 2015. Y. Feng, Y. Zhang, C. Ying, D. Wang, and C. Du, โNanopore-based fourth-generation DNA sequencing technology,โ Genomics, Proteomics Bioinforma., vol. 13, no. 1, pp. 4โ16, 2015. N. Nagarajan and M. Pop, โSequence assembly demystified.,โ Nat. Rev. Genet., vol. 14, no. 3, pp. 157โ67, 2013. P. A. Pevzner, H. Tang, and M. S. Waterman, โAn Eulerian path approach to DNA fragment assembly.,โ Proc. Natl. Acad. Sci. U. S. A., vol. 98, no. 17, pp. 9748โ53, 2001. E. W. Myers, โToward Simplifying and Accurately Formulating Fragment Assembly,โ J. Comput. Biol., vol. 2, no. 2, pp. 275โ290, 1995. E. W. Myers, โA Whole-Genome Assembly of Drosophila,โ Science (80-. )., vol. 287, no. 5461, pp. 2196โ2204, 2000. A. T. Woolley, C. Guillemette, C. L. Cheung, D. E. Housman, and C. M. Lieber, โDirect haplotyping of kilobase-size DNA using carbon nanotube probes.,โ Nat. Biotechnol., vol. 18, no. 7, pp. 760โ3, 2000. H. C. Fan, J. Wang, A. Potanina, and S. R. Quake, โWholegenome molecular haplotyping of single cells.,โ Nat. Biotechnol., vol. 29, no. 1, pp. 51โ7, 2011. E. F. Kirkness, R. V. Grindberg, J. Yee-Greenbaum, C. R. Marshall, S. W. Scherer, R. S. Lasken, and J. C. Venter, [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] โSequencing of isolated sperm cells for direct haplotyping of a human genome,โ Genome Res., vol. 23, no. 5, pp. 826โ 832, 2013. C. Lo, โAlgorithms for Haplotype Phasing.โ a G. Clark, โInference of haplotypes from PCR-amplified samples of diploid populations.,โ Mol. Biol. Evol., vol. 7, no. 2, pp. 111โ122, 1990. M. Stephens, N. J. Smith, and P. Donnelly, โA new statistical method for haplotype reconstruction from population data.,โ Am. J. Hum. Genet., vol. 68, no. 4, pp. 978โ989, 2001. G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz, โSNPs problems, complexity, and algorithms,โ Algorithmsโ ESA 2001, vol. 2161, pp. 182โ193, 2001. R. Lippert, R. Schwartz, G. Lancia, and S. Istrail, โAlgorithmic strategies for the single nucleotide polymorphism haplotype assembly problem.,โ Brief. Bioinform., vol. 3, no. 1, pp. 23โ31, 2002. R. Cilibrasi, L. Van Iersel, S. Kelk, and J. Tromp, โThe complexity of the single individual SNP haplotyping problem,โ Algorithmica (New York), vol. 49, no. 1, pp. 13โ 36, 2007. R. Rizzi, V. Bafna, S. Istrail, and G. Lancia, โPractical algorithms and fixed-parameter tractability for the single individual {SNP} haplotyping problem,โ \bibremark{No string.}Proceedings 2nd Int. Work. Algorithms Bioinforma., vol. 2452, pp. 29โ43, 2002. D. He, A. Choi, and K. Pipatsrisawat, โOptimal algorithms for haplotype assembly from whole-genome sequence data,โ Bioinformatics, vol. 26, no. 12, pp. i183-90, 2010. T. Jones, N. a Federspiel, H. Chibana, J. Dungan, S. Kalman, B. B. Magee, G. Newport, Y. R. Thorstenson, N. Agabian, P. T. Magee, R. W. Davis, and S. Scherer, โThe diploid genome sequence of Candida albicans.,โ Proc. Natl. Acad. Sci. U. S. A., vol. 101, no. 19, pp. 7329โ7334, 2004. M. de la Bastide and W. R. McCombie, โAssembling genomic DNA sequences with PHRAP.,โ Curr. Protoc. Bioinformatics, vol. Chapter 11, p. Unit11.4, Mar. 2007. J. P. Vinson, D. B. Jaffe, K. OโNeill, E. K. Karlsson, N. Stange-Thomann, S. Anderson, J. P. Mesirov, N. Satoh, Y. Satou, C. Nusbaum, B. Birren, J. E. Galagan, and E. S. Lander, โAssembly of polymorphic genomes: Algorithms and application to Ciona savignyi,โ Genome Res., vol. 15, no. 8, pp. 1127โ1135, 2005. S. Batzoglou, โARACHNE: A Whole-Genome Shotgun Assembler,โ Genome Res., vol. 12, no. 1, pp. 177โ189, 2002. R. S. Wang, L. Y. Wu, Z. P. Li, and X. S. Zhang, โHaplotype reconstruction from SNP fragments by minimum error correction,โ Bioinformatics, vol. 21, no. 10, pp. 2456โ2462, 2005. W. Qian, Y. Yang, N. Yang, and C. Li, โParticle swarm optimization for SNP haplotype reconstruction problem,โ vol. 196, pp. 266โ272, 2008. J. Wu, J. Wang, and J. Chen, โA genetic algorithm for single individual SNP haplotype assembly,โ Proc. 9th Int. Conf. Young Comput. Sci. ICYCS 2008, pp. 1012โ1017, 2008. S. Levy, G. Sutton, P. C. Ng, L. Feuk, A. L. Halpern, B. P. Walenz, N. Axelrod, J. Huang, E. F. Kirkness, G. Denisov, Y. Lin, J. R. MacDonald, A. W. C. Pang, M. Shago, T. B. Stockwell, A. Tsiamouri, V. Bafna, V. Bansal, S. A. Kravitz, D. A. Busam, K. Y. Beeson, T. C. McIntosh, K. A. Remington, J. F. Abril, J. Gill, J. Borman, Y. H. Rogers, M. E. Frazier, S. W. Scherer, R. L. Strausberg, and J. C. Venter, โThe diploid genome sequence of an individual human,โ PLoS Biol., vol. 5, no. 10, pp. 2113โ2144, 2007. [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] 6 F. Geraci, โA comparison of several algorithms for the single individual SNP haplotyping reconstruction problem,โ Bioinformatics, vol. 26, no. 18, pp. 2217โ2225, 2010. V. Bansal and V. Bafna, โHapCUT: An efficient and accurate algorithm for the haplotype assembly problem,โ Bioinformatics, vol. 24, no. 16, pp. 153โ159, 2008. D. Aguiar and S. Istrail, โHapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data,โ J. Comput. Biol., vol. 19, no. 6, pp. 577โ590, 2012. D. Aguiar and S. Istrail, โHaplotype assembly in polyploid genomes and identical by descent shared tracts,โ Bioinformatics, vol. 29, no. 13, pp. 352โ360, 2013. E. Berger, D. Yorukoglu, J. Peng, and B. Berger, โHapTree: A novel bayesian framework for single individual polyplotyping using NGS data,โ Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8394 LNBI, no. 3, pp. 18โ19, 2014. M. Patterson, T. Marschall, N. Pisanti, and L. E. O. V. A. N. Iersel, โW hats H apโฏ: Weighted Haplotype Assembly for Future-Generation Sequencing Reads,โ J. Comput. Biol., vol. 22, no. 0, pp. 1โ12, 2014. Y. Pirola, S. Zaccaria, R. Dondi, G. W. Klau, N. Pisanti, and P. Bonizzoni, โHapCol: Accurate and Memory-efficient Haplotype Assembly from Long Reads,โ Bioinformatics, no. August, p. btv495, 2015. P. Bonizzoni, R. Dondi, G. W. Klau, Y. Pirola, N. Pisanti, and S. Zaccaria, โOn the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes.,โ J. Comput. Biol., vol. 23, no. 0, pp. 1โ19, 2016. C. Chin, P. Peluso, F. J. Sedlazeck, M. Nattestad, G. T. Concepcion, C. Dunn, R. O. Malley, R. Figueroa-balderas, A. Morales-cruz, R. Grant, M. Delledonne, C. Luo, J. R. Ecker, D. Cantu, and D. R. Rank, โPhased Diploid Genome Assembly with Single Molecule Real-Time Sequencing,โ 2016. N. I. Weisenfeld, V. Kumar, P. Shah, D. M. Church, D. B. Jaffe, and P. Ca, โDirect determination of diploid genome sequences,โ pp. 1โ21, 2016.
© Copyright 2025 Paperzz