Approaches to DNA de novo assembly Ivan Sović Center for informatics and computing Rud̄er Bošković Institute Bijenička cesta 54, 10 000 Zagreb Email: [email protected] Abstract— DNA is the basic building block of all known life, accounting for all the diversities in nature. Determining the DNA of an individual organism is performed through a process called DNA sequencing. Although several different sequencing technologies do exist, they are limited and are able to acquire relatively short sequence reads. One of the approaches to sequencing involves randomly breaking a long DNA molecule into small fragments and sequencing only those fragments. Due to the random positioning of fragments on the source DNA, majority of them overlap, and provide the necessary information to combine them back together. The process of reconstructing the original DNA sequence from fragment reads is called DNA assembly. Assembly is a computationally very intensive process that may take days, or even weeks to produce the sequence of a more complex organism. Reconstructing a DNA sequence in the absence of a previously reconstructed reference sequence from a similar organism is called de novo assembly. De novo assembly methods currently provide the only means to discover new, previously unknown sequences, and are currently indispensable in biological research. In this paper, short descriptions of the sequencing process and the current sequencing platforms are given. DNA assembly process is thoroughly described, and the analysis of several de novo approaches used for assembly are presented. Overview and description of existing software tools are given, including some parallel implementations. As a conclusion, aspects of possible future development of DNA assembly are considered. Index Terms—DNA sequencing, assembly, overlap, layout, consensus, de Bruijn, parallel I. I NTRODUCTION An extremely important research subject in biology is the determination of the sequence of naturally occurring deoxyribonucleic acid (DNA) molecules. The DNA is a long molecule consisting of a large number of simple components called nucleotides or bases, that comprise a long chain (sequence). The process of resolving the structure of the DNA molecule is called DNA sequencing. Currently, several methodologically different sequencing techniques exist: first generation techniques mostly based on Sanger’s method of sequencing, and next generation sequencing techniques (NGS), most prominent of them being Roche’s 454 Life Sciences, Illumina Solexa, Applied Biosystems’ SOLiD systems [1] and Ion Proton from Life Technologies. General differences amongst the first and the next generation sequencers are the lower cost and higher throughput of next generation sequencers, but with disadvantages of shorter read lengths and higher error rates [2] [3]. Commonly, DNA molecules have the length from a few million up to a few billion base-pairs (bp), dependant on the species (i.e. Escherichia coli’s DNA has the length of 4.6 million base-pairs, while Homo sapiens’ DNA is the size of 3.3 billion base-pairs). Limited by the current sequencing technology, the entire DNA can not be read at once. Instead, multiple smaller reads of different parts of the DNA molecule are produced, each about 30-400 bp long, representing fragments of the original DNA molecule. The process of reconstruction of the original DNA sequence from read fragments is called DNA sequence assembly. The assembly process is generally based on finding overlaps among reads, and joining them into contiguous sequences (contigs). The explanation for such an approach is based on the fact that a large number of copies of the same DNA molecule have been randomly broken into fragments, thus causing the fragments of different DNA copies to partially overlap. Computational DNA assembly is currently the only means of obtaining the sequence of a DNA molecule. This makes the research in the field of assembly of crucial importance to the development of biology and furthering the understanding of life. II. DNA SEQUENCING The molecule of DNA is a long, unbranched, paired polymer chain, formed of the same four types of monomers called nucleotides, connected in a long linear sequence that encodes the genetic information [4]. Monomers are small molecules, such as amino acids, nucleotides and glucose, which may bind to other monomers in order to form larger molecules called polymers. Nucleotides are the basic building blocks of DNA, consisting of a sugar (deoxyribose) with a phosphate group attached to it, and a base [4]. The base of a nucleotide can be one of the following: adenine (A), cytosine (C), thymine (T) and guanine (G). In graphical or textual representations of DNA molecules, nucleotides are commonly denoted through the abbreviations of their base names A, T, C and G. In this case, the DNA can be written as the sequence of these four letters. Nucleotide A from one strand of the DNA always bonds to a T nucleotide of the other strand of the DNA, and vice versa. For this reason A and T nucleotides are called complements. The same applies for the C and the G nucleotides. Figure 1a shows the molecular structure of a nucleotide, while Figure 1b depicts the double-stranded nature of DNA nucleotides, the maximum length that can currently be read at once is limited to at most 1000 nucleotides. This is an extremely important limitation, since the DNA can be up to a few billion nucleotides long. All current solutions of this problem are based on the same principle: multiple copies of the DNA molecule are randomly broken into small fragments of such size that can be read by modern technology. The fragments are then multiplied into large number of copies if/as required by the sequencing platform. The method of sequencing the entire genome by dividing it in smaller fragments is called whole genome shotgun (WGS) sequencing. Given large enough number of DNA samples, randomly created fragments will start to overlap with each other, which gives the only available information about putting them together. The average number of sequences that independently contain a certain nucleotide is called the depth of coverage (often only coverage) [5]. Combining the fragments back into a single chain of nucleotides is called DNA assembly, while the programs that perform this process are called genome assemblers [6]. Sequencing of a DNA fragment can be performed in two ways: sequencing only one strand of DNA from the 50 end, called the single-end (SE) reads, or sequencing the fragment in both strands from their 50 ends, called the paired-end (PE) reads. Sequences of a PE read are also often called mate-pairs. The method of sequencing an entire genome and producing PE reads is called the double barrelled whole genome shotgun (DBWGS) sequencing. To help gain a better understanding of DNA assembly methods, a short description of current sequencing technologies and their limitations is required. The first generation sequencing, or Sanger sequencing, is based on the modification of the process of DNA synthesis. During this process, the two chains of DNA are separated, followed by the addition of nucleotides that are complements to those contained within the chains. Sanger’s method contains modified nucleotides in addition to the normal ones, where each modified nucleotide is missing an oxygen on the 30 end, and has also been fluorescently marked [1]. Integrating these nucleotides into a DNA chain causes a halt (or termination) in the elongation process. Using capillary electrophoresis, sequences are separated by their length (mass) and the termination base is read. This process of sequencing can deliver read lengths up to 1000 bases, high raw accuracy, and allow for 384 samples to be sequenced in parallel, generating 24 bases per instrument second [1]. Great shortcomings of this methods are its high price of about $10 per 10000 bases, and long sequencing time. Examples of machines used for first generation sequencing include the Applied Biosystems Prism 3730 and the Molecular Dynamics MegaBACE [7]. Commercial NGS DNA sequencing platforms today include the Genome Sequencer from Roche 454 Life Sciences, the Solexa Genome Analyzer from Illumina, the SOLiD System from Applied Biosystems, Ion Storm and Ion Proton from Life Technologies, the Heliscope from Helicos and the Polonator [7]. There are several different methods that these technologies (a) Generic nucleotide structure (b) Structure of the DNA chain Fig. 1: a) A generic molecular structure of a single nucleotide, consisting of a sugar backbone, a phosphate group and a base (denoted by B). The base of a nucleotide in the DNA can be one of the four: A, C, T and G. b) Depiction of the doublestranded DNA molecule, composed of nucleotides. consisting of A, C, T and G nucleotides. It is important to note that the carbon atoms on Figure 1a are marked 10 − 50 . Their order in the molecule is not coincidental, and is of great importance for the formulation of DNA. Bonds between nucleotides are always formed between their 3’ and 5’ ends via the phosphate group, creating a polymer chain composed of a repetitive sugar-based backbone with a series of bases protruding from it [4], that represents a single strand of the DNA. During synthesis, DNA is always extended in the "direction" from its 50 end to 30 end - in other words, the 50 carbon atom of a nucleotide that is to be added to an existing chain is always bonded with the last 30 carbon atom of that chain through covalent bonds between the phosphate group of one nucleotide, and the 30 carbon in the deoxyribose ring of the other. By convention, a DNA sequence is always read from its 50 end towards the 30 end. Significant effort has been put into the development of methods for determining the exact sequence of nucleotides comprising a DNA molecule. The process of "reading" the sequence is commonly referred to as DNA sequencing. Methods and technology for performing DNA sequencing are broadly divided into two groups: the first generation sequencing based on Sanger’s method, and the next (or second) generation sequencing (NGS) based on several different approaches. However, although these techniques can read the sequence of 2 use for sequencing: hybridization to tiling arrays, parallelized pyrosequencing (454), reverse termination (Solexa), ligating degenerated probes (SOLiD) and single molecule sequencing (Heliscope) [1]. Read lengths obtained from the NGS reads are in the 400bp range (454), the 100bp range (Solexa and SOLiD), or less [7]. Compared to Sanger sequencing, next generation sequencers have characteristic error profiles, different in case of each of the technologies [7]. These error profiles can include inaccurate determination of simple sequence repeats, enrichments of base call error toward the 30 ends of reads, insertion or deletion errors during homopolymer runs and other [7] [8]. Error profiles of some of the platforms have been published and are available [7]. During the process of determining nucleotide types (commonly referred to as base-calling), information about the quality of a certain base-call is often available. Software tools such as Phred [9] read chromatogram files and analyse peaks in order to call bases, while assigning them a quality score that is proportional to the probability that the base is called wrong. Quality scores are used by some assembly tools to enhance efficiency; and also for comparison of different sequencing technologies. (a) An example of a single-end DNA fragment read. (b) An example of a paired-end DNA fragment read. Fig. 2: Examples of DNA fragments read from a) only one end and b) from both ends. The first sequence fragment assembly algorithms, developed in 1980s, were focused on creating multiple overlapping alignments of the reads to produce a layout assembly of the data [17]. The reads were typically sequences obtained using Sanger’s method. The alignment is used to read the consensus sequence, from which the DNA sequence is inferred. The described method was called the overlap-layout-consensus approach (OLC). Currently, the OLC approach is one of the two mainstream approaches of the de novo DNA assembly, and is especially used during the assembly of long, high-quality reads [17]. Examples of OLC assembly tools include Celera Assembler [18], Arachne [19] and Newbler [20]. Another very important, and currently most utilized approach in combination with NGS data, is based on the de Bruijn graph. This approach was first presented by Pevzner et al. in 2001 [21], and is currently the prevalent representation in short-read assemblers. Unlike the computationally expensive step of determining overlap amongst reads, the de Bruijn graph is constructed from fixed-length subsequences (of length k) that were consecutive in larger sequences, and that have a perfect (k − 1)-suffix to (k − 1)-prefix overlap. Examples of assemblers based on the de Bruijn approach include Euler [21], AllPaths [22] and Velvet [23]. Both the OLC and the de Bruijn approaches are graph based. On the other hand, different approaches also do exist. They include greedy string-based methods i.e. implemented in SSAKE [24], and hybrid methods such as Taipan [25]. Recently, several parallelized versions of assemblers have appeared. This step in the development of assembly tools is of great importance, since it can provide rapid execution of otherwise highly computationally intensive task. Examples include ABySS [3]; a method by Kundeti et al. [26]; and a parallelization of the Euler assembler [27]. Errors in assembly can occur for two main reasons: incomplete or incorrect information provided to the assembler and the limitations of the assembly algorithm [6]. The realworld WGS data that contains errors can induce the following problems in overlap and de Bruijn graphs [7]: 1) Spurs - short dead-end branches (divergences) of the main path (Figure 3a). Possible causes include sequencing errors toward one end of a read, and low coverage. 2) Bubbles - divergence of a path into two branches that afterwards join together again into one path (Figure 3b). III. F RAGMENT ASSEMBLY All sequencing platforms produce observations of the target DNA molecule in the form of reads - sequences of single-letter base calls with a numeric quality value [7]. An example of a single-end read and a paired-end read is shown in Figure 2a and Figure 2b, respectively. Genome assembly is commonly divided into two groups: • De novo - assembly of sequence reads into longer contiguous sequences called contigs, followed by the process of correctly ordering contigs into scaffolds, in the absence of a reference genome sequence [8] [10] [11]. • Mapping/reference - when available, a pre-existing reference genome sequence can be used to align the reads of a newly obtained genome, thus avoiding the process of constructing complex data structures as with de novo assembly [8] [10]. As a result of the de novo assembly stage, a set of scaffolded contigs is available. Besides ordered contigs, scaffolds also contain gaps of imprecise length separating the contigs [12]. Scaffolds are also often called supercontigs. Low risk assemblies consistent with almost all of the detectable pairwise read overlaps are called unitigs [12]. Sequence alignment is the procedure of comparing two or more sequences by searching for a series of individual characters (in this case nucleotides) or character patterns that are in the same order in all sequences [13]. In most cases, the sequences cannot be aligned perfectly due to their mismatch. In an optimal case, non-identical characters and gaps are placed, in order to match together as many identical or similar characters as possible. Popular tools that perform sequence alignment include BLAST [14], BLAT [15] and BLASTZ [16]. 3 the placement of reads during the assembly. These constraints limit the possible traversals of the assembly graph, allowing longer segments of the genome to be unambiguously reconstructed [33]. In the next subsections, the following notation is used to formally define the problem of DNA fragment assembly. Let Σ be the alphabet consisting of four elements Σ = {A, C, T, G}. Then, Σk is a set of all strings of length k over the alphabet Σ. Also, let v ∈ Σk . The length of a string v is denoted by |v|. Additionally, given 1 ≤ i ≤ j ≤ |v| | i, j, |v| ∈ N, the i-th element of a string v is denoted with v[i], while a substring of a string v is denoted by v[i, j], where i is the starting position of the substring within v, and j the end position, inclusive. The string of length k is called a k-mer, while the k-spectrum of v is the set of all k-mers that are substrings of v. A kmolecule is a pair of k-mers which are reverse complements of each other [34]. If v and w are strings, and there exists a maximal length non-empty string z that is a prefix of v and a suffix of w, then it is said that v overlaps w, with the length of overlap ov(w, v) = |z|. This definition is not symmetric [34]. If v does not overlap w, then ov(w, v) = 0. (a) Spur (b) Bubble (c) Convergence (d) Cycle Fig. 3: Problems that can appear during the process of building graph representations for DNA de novo assembly. Possible causes include sequencing errors toward the middle of a read, and by polymorphisms in the target. It is important to note that efficient bubble detection is not trivial. 3) Converging and diverging paths - inverse definition than for the bubbles, two paths converge into one, that later diverges again into two separate paths (Figure 3c). Possible causes are repeats in the target genome. 4) Cycles - paths that converge on themselves (Figure 3d). Possible causes are repeats in the target genome. Although sequencing technology has greatly improved, still no sequencing platform can produce sufficient data to assemble a complete genome from a single experiment [28]. One of the key problems in shotgun sequencing is caused by repeats in genome sequences [29]. Repeats, in cases when they are longer than fragment reads, induce problems during the overlap phase of assembly. As a result, assemblies result in fragmented contigs, separated by gaps. Repeat caused fragmentation is more pronounced in the NGS technology, since reads are generally of smaller size [28]. Other than repeats, gaps can also be caused by other, technology-specific reasons. Assemblies containing gaps are called draft assemblies, while the process of filling the gaps is called finishing. Finishing involves obtaining missing sequences, improving low quality regions, resolving misassemblies and ordering scaffolds; and often accounts for the majority of labour and cost of genome projects [28]. In order to fill gaps, further sequencing is usually required, either by amplifying and sequencing fragments spanning gaps, or by resequencing the entire DNA. Tools that perform gap closure and finishing do exist, such as IMAGE [30] and Consed [31], although some hybrid assemblers perform gap closure indirectly by incorporating reads from multiple sequencing platforms (i.e. Velvet [23], CABOG [32]). In case paired-end sequencing was performed, mate-pair information can be used to provide additional constraints to A. Overlap-Layout-Consensus approach (OLC) Let S = {s1 , s2 ..., sn } be a set of non-empty strings over an alphabet Σ. An overlap graph of S is a complete weighted directed graph where each string in S is a vertex and the length of an edge between vertices x and y, x → y, is |y| − ov(x, y) [34]. An example of assembly using the overlap graph is depicted in Figure 4. The construction of the overlap graph is the first step of the OLC approach. General process of this approach consists of three main phases [7] [8] [35], from which it derives its name: 1) Overlap - reads are compared to each other in a pairwise manner to construct an overlap graph. 2) Layout - the overlap graph is analysed and simplified with the application of graph algorithms to identify the appropriate paths traversing through the graph and produce an approximate layout of the reads along the genome. The ultimate goal is a single path that traverses each node in the overlap graph exactly once [6]. 3) Consensus - multiple sequence alignment of all reads covering the genome is performed, and the original sequence of the genome being assembled is inferred through the consensus of the aligned reads. A very important observation is that the identification of a path that traverses every node in the graph only once is a computationally difficult Hamiltonian path problem. The problem is also known as the Travelling Salesman Problem, and is NP-complete [36]. Discussion The three-phase definition of the OLC approach causes that the algorithms utilizing this method are naturally implemented through a modular design, which allows for simpler modification and optimization of distinct assembly steps. Another 4 advantage of OLC assemblers is that the overlaps among reads may vary in length, which equips them to handle the data either from Sanger or NGS platforms. On the other hand, the processing cost of the overlap phase is very high, as this phase is very time consuming for determining the overlap of every pair of reads in the data set. Although OLC approach is capable of handling NGS data, this data commonly consists of an order of magnitude more reads than were commonly generated in Sanger-based projects [6] (for which the OLC was initially designed), causing a significant increase of the overlap calculation time (quadratic complexity) [37]. However, some OLC assemblers such as Edena [38] and Shorty [39] have been well optimized to handle such data, and even outperform some other, NGS specialized assemblers [6]. Also, finding the Hamiltonian path in the layout step is still not solvable in polynomial time, which makes OLC additionally difficult and dependant on various heuristics to produce reliable results [21]. are redundant with pairs of other overlaps. Shorty is designed for a special case when a small number of long reads are available, and uses them as seeds to recruit short reads and their mate pairs, and iteratively repeats this process to create larger contigs from previously constructed contigs. Parallelization of the OLC approach can be easily achieved by distributing the overlap computation on different processing units [6]. However, OLC based parallel assembly tools are rare and hard to find, as most recent implementations focus on the de Bruijn approach. PASQUAL [40] is an example of a parallel OLC assembler, designed for shared memory parallelism using OpenMP. Authors claim to have achieved better performance and results than any other tested assembly tool they have ran. In [41] authors have implemented a parallel OLC and a parallel de Bruijn assembler, in order to analyse the differences in their scalability and efficiency. B. The de Bruijn approach As in the case of overlap graphs, let S = {s1 , s2 ..., sn } be a set of non-empty strings over an alphabet Σ. Given a positive integer parameter k, the de Bruijn graph B k (S) is a directed graph with vertices defined by {d ∈ Σk | ∃i 3 d ⊆ si }, and edges defined by {d[1..k] → d[2..k + 1] | d ∈ Σk+1 , ∃i 3 d ⊆ si } [34]. A vertex of the de Bruijn graph is commonly identified through its associated k-mer. An example of the assembly using the de Bruijn graph is depicted in Figure 5. The process of constructing a de Bruijn graph consists of the following steps [42]: 1) Construction of k-spectrum - reads are divided into overlapping subsequences of length k. 2) Graph node creation - a node is created for every (k−1)spectrum of each unique k-mer. 3) Edge creation - a directed edge is created from node a to node b iff there exists a k-mer such that its prefix is equal to a and its suffix to b. The first application of de Bruijn graphs for genome assembly was proposed by Pevzner et al. in 2001 [21]. By converting the set of reads into edges of the de Bruijn graph, the assembly problem becomes equivalent to finding an Eulerian path in the Implementations Celera [18] is an assembler that was first designed to handle data from Sanger sequencers, but was later revised and optimized for the Roche’s 454 NGS data. Its revised pipeline, called CABOG, is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. It also uses mate pair information to merge sets of unitigs into larger structures. Newbler [20] is a well known and widely used assembly tool, distributed with Roche 454 Life Science sequencers. Since newer revisions it provides scaffold building from paired-end data, exploits coverage to handle base calling errors, uses instrument metrics to overcome inaccurate calls of the number of bases in homopolymer runs, and implements OLC twice - once to generate unitigs from reads, and second time to generate larger contigs from unitigs. Edena [38] and Shorty [39] are OLC assemblers that are targeted for the short reads generated by Solexa and SOLiD sequencing platforms. Edena discards duplicate reads, finds all perfect error free overlaps and removes individual overlaps that Fig. 5: An example of assembly using the de Bruijn graph by finding a Eulerian path. A sample sequence ACCAT T CCA was fragmented into two reads, {ACCAT T C, AT T CCAA}. Reads were used to obtain the k-mer spectrum for k = 4, {ACCA, CCAT , CAT T , AT T C, T T CC, T CCA, CCAA}, and the (k − 1)-mer spectrum, {ACC, CCA, CAT , AT T , T T C, T CC, CAA}, required to construct the graph. The original sample sequence is fully reconstructed by traversing the graph. Fig. 4: An example of assembly using the overlap graph by finding a Hamiltonian path. In this simple example, the set of input fragments consists of five reads of equal length, {AT C, CCA, CAG, T CC, AGT }. The result of assembly is the reconstruction of the sequence AT CCAGT . 5 graph - a path that uses every edge in the graph - for which efficient algorithms do exist [6]. In this case, assembly is a by-product of the graph construction, which proceeds quickly using a constant-time hash table lookup for the existence of each k-mer in the k-spectrum [7]. However, there can be an exponential number of distinct Eulerian paths in a graph, while only one can be deemed to be the correct assembly of the genome. To reduce the complexity of the problem, heuristics are usually applied to the constructed graph. The graph is filtered of erroneous occurrences such as bubbles, spurs, cycles and convergences/divergences, and nodes that are unambiguously connected by an edge are merged together [3]. by assembling the regions that are locally non-repetitive, and then glues the local graphs where they have overlapping structure. Unlike the case of the OLC approach, many parallel assembly tools exist that use the de Bruijn approach, some being specially designed for such purpose, while others incorporated support for multi-threading in later versions. ABySS [3] was among first de Bruijn assemblers to utilize the parallel paradigm. ABySS distributes the de Bruijn graph over multiple computers by placing k-mers over available nodes, with the location of a k-mer determined by a simple hash function [3]. ABySS uses MPI (Message Passing Interface) protocol for communication between nodes. Ray [43] also distributes the graph across multiple computers through MPI, but also allows simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Contrail presents a cloud computing solution to assembly of large genomes, and relies on Hadoop to iteratively transform an on-disk representation of the assembly graph [44]. Contrail is currently being extended to support the OLC approach, that will be of great use for long reads. As mentioned in the previous subsection, implementations and comparison of a de Bruijn and a OLC assemblers is given in [41]. An out-of-core algorithm for constructing large bidirected de Bruijn graphs in Θ(n/p), and with Θ(n) message complexity, where p is the number of processors, and n is a constant degree polynomial in p, is presented in [26]. Discussion A great advantage of the de Bruijn approach is that explicit computations of pairwise overlaps are not required, unlike the case of the OLC approach. Finding pairwise overlaps is a computationally very expensive process, and thus the de Bruijn approach provides great performance when very large data sets are given (such as NGS data) [42]. Finding an Eulerian path through the edges of a graph is known to be a polynomial problem, and therefore can be executed more efficiently than finding a Hamiltonian path. Since there can be very many Eulerian paths in a graph, constraints have to be added in order to find the path that represents the actual genome. These constraints can make the assembly much more difficult, and transform this polynomial problem to a NP-hard problem, as shown in [34]. Also, de Bruijn graphs are very sensitive to sequencing errors and repeats, as they lead to new k-mers, adding to the graph complexity [42]. Additionally, although de Bruijn approach can handle both Sanger and NGS data, by dividing long reads from Sanger sequencing into short k-mers, there is an effective loss of long range connectivity information implied by each read [6]. C. Other methods Among described approaches, two main other methods exist: greedy and hybrid. Greedy algorithms were common in genome assemblers for Sanger data [6], and were part of the first NGS assembly packages [7]. They operate in simplest and most intuitive fashion: reads are joined together into contigs iteratively, starting with the reads that have best overlaps; the process is repeated until there are no more reads or contigs that can be joined. This approach may not lead to a globally optimal solution, for instance, in a case where a contig at hand takes on reads that, combined with another contig, might result with that contig to grow even larger. Examples of assemblers that use greedy approach include SSAKE [24] that aggressively assembles short nucleotide sequences by progressively searching for the longest possible overlap between sequences through a prefix tree; VCAKE [45] that implements an iterative extension algorithm that, unlike its predecessors, can incorporate imperfect matches during contig extension; and SHARCGS [46] that adds pre- and postprocessing to the SSAKE assembler, where at preprocessing it filters the raw read set three times with different settings, constructs three assemblies, and in the postprocessing step merges them. Hybrid assembly has twofold meaning. For one, it is an approach where data obtained from several different sequencing platforms is simultaneously used for the assembly of one genome [47]. The data is commonly acquired both from: platforms with long reads, base-space and moderate throughput (i.e. Roche 454), and platforms with short reads, base- or color-space and high throughput (i.e. Solexa and Implementations Euler [21] was the first assembler based on the de Bruijn approach, developed for Sanger reads, and later modified for short Roche 454 reads, very short unpaired Illumina/Solexa reads and paired-end Solexa reads. It first filters input reads by detecting erroneous base calls through noting low-frequency kmers, and then constructs two k-mer graphs at different values of k, and compares them. Velvet [37] is a collection of methods for assembly using de Bruijn graphs. It consists of two parts: first, called Tour Bus, removes sequencing errors and handles polymorphisms, and second aimed to resolve repeats based on the available information from low coverage long reads or paired shotgun reads. It applies a series of heuristics based on local graph topology, coverage, sequence identity and paired-end constraints, in order to reduce graph complexity. AllPaths [22] was initially designed for large genome assembly using paired-end short reads obtained from Solexa sequencers, but now works both on small and large (mammalian size) genomes. AllPaths partitions the graph to resolve repeats 6 SOLiD). Here, base-space and color space stand for notations of DNA sequences outputted from DNA sequencers. In basespace, a DNA sequence is given plainly through the four types of nucleotides (A, C, T, G), while in color-space each base is coded by two values called colors, with each color used in coding two neighbouring bases. Color-space coding provides additional error checking during the sequencing process. Example of an assembler that uses this type of the hybrid approach is HybridVelvet [47] which presents a onestep pipeline incorporating Velvet, that can take any long basespace sequences and short color-space reads as inputs and export assembled base-space contigs and scaffolds. The second meaning of hybrid assembly is an approach that combines both greedy and graph based assembly methods [25]. An example of such an assembler is Taipan [25]. It uses greedy extensions for contig construction, but at each step also constructs enough of the corresponding read graph in order to make better decisions of the continuation of assembly. Additionally, some assemblers use a suffix trie approach as in [48], and an approach using combinatorial algorithms as in [49]. These and other different approaches are seldom found in assembly tools. Although N50 initially followed the increase of the depth of coverage, N50 reached a plateau (converged) when the depth of coverage exceeded a certain threshold [35]. To compare the N50 values among assemblers, authors in [35] have used the value of N50 at the same depth of coverage where N50s for all assemblers have converged. Qualitative comparison of several assembly tools is given in Table I. More detailed description of benchmarking procedures and results of various assembly tools can be found in comparative and review articles such as [6] [7] [8] [35] [50] [51]. IV. D ISCUSSION ABOUT FUTURE PROSPECTS With all the presented aspects of DNA assembly taken into account, there are several possible directions in which the future research can progress beyond the state-of-the-art. One such direction is the exploration of new assembly algorithms and data types that can be utilized, reducing computational complexity and increasing assembly accuracy. Also, assembly tools have to become capable of handling and processing large amounts of data more rapidly, while reducing memory consumption. From the implementation point of view, two possible optimizations can be observed: for one, compression techniques can be applied on the sequence data thus reducing storage and operational size, and two, the computationally intensive task of assembly can be parallelized and deployed on cloud platforms. Clouds can provide scalable and on-demand resources to researchers, reusable workflows and reusability of results. CloudMan is an excellent example of an easy to use cloud manager platform that addresses the stated possibilities and has already been used for bioinformatics purposes [53]. To address the constatation about compression, an assembler that uses this approach has recently been published [52]. On assembly of a C. elegans data set, SGA used only 4.5GB of RAM, whereas AbySS required 14.1GB, Velvet 23.0GB and SOAPdenovo 38.8GB [52]. However, SGA resulted in longer execution time than other stated assemblers. On a side note - recently, an interesting announcement regarding sequencing technologies has been made public. A company called Life Technologies has released a press statement claiming that they have created a sequencing machine Benchtop Ion ProtonTM Sequencer capable of sequencing the entire genome under a day, for the price of $1000, and have already started taking orders for the device [54]. A $1000 dollar genome barrier has been a long goal for the developers of genome sequencing technologies, as it is supposed to bring genome sequencing to masses through hospitals, medical institutions and research facilities. D. Comparison of implementations The efficiency and performance of assemblers is generally assessed through the size and the accuracy of assembled contigs and scaffolds [8], and resource consumption [35]. The resource consumption of a certain assembler includes the total processing time, and RAM occupancy. The common size measurements include: • N50 length - the longest length such that at least 50% of all base pairs are contained in contigs of this length or larger [35]. Provides a standard measure of assembly connectivity, higher N50 lengths indicate better performance of an assembler. • Minimum contig length • Maximum contig length On the other hand, the accuracy of assemblies is generally difficult to measure, since these estimates can only be given in comparison to a known correct (benchmark) sequence, such as a reference genome or an artificially created data set. Accuracy measurements include: • Sequence coverage - percentage of the benchmark sequence covered by output contigs. • Assembly error rate - output contigs are aligned to the benchmark sequence, and the number of mismatched bases are calculated. Assembly error rate is the percentage of mismatched bases of the total bases in the aligned contigs in the reference sequence. It is important to note that N50 length can only be comparable between different assemblers when each is measured with the same combined length value [8]. In [35], authors compared seven different assembly tools, and have discovered an interesting pattern in assembly connectivity measured by N50 length and compared against the depths of coverage. V. C ONCLUSION While the length of sequences that can be read at once is still limited, the de novo assembly methods are irreplaceable. Currently, they provide the only means to discover new, previously unknown sequences, which is essential for characterization of the biological diversity of our world [8]. However, although genome assembly is the only way to obtain the information 7 TABLE I: Comparison of performance and quality of several assembly tools on sequences obtained from a C. elegans data set. Results were taken from [52]. Scaffold N50 size Aligned contig N50 size Mean aligned contig size Reference bases covered Mismatch rate at all assembled bases Contigs with split/bad alignment Total CPU time Maximum memory usage SGA 26.3 kbp 16.8 kbp 4.9 kbp 96.2 Mbp 1 per 21,545 bp 458 (4.4 Mbp) 41 h 4.5 GB contained in the genetic code of an organism, complexity and accuracy of current assembly methods is still greatly influenced by read errors and imprecise algorithms, as well as the number and size of the reads obtained from sequencing. One can say that the areas of sequencing, assembly and bioinformatics in general have never been as vivid and interesting as now, while also providing great opportunities for research that will find real-life applications, and ultimately, effect our lives for better. Velvet 31.3 kbp 13.6 kbp 5.3 kbp 94.8 Mbp 1 per 8,786 bp 787 (7.2 Mbp) 2h 23.0 GB ABySS 23.8 kbp 18.4 kbp 6.0 kbp 95.9 Mbp 1 per 5,577 bp 638 (9.1 Mbp) 5h 14.1 GB SOAPdenovo 31.1 kbp 16.0 kbp 5.6 kbp 95.1 Mbp 1 per 26,585 bp 483 (4.4 Mbp) 13 h 38.8 GB [11] L. Jorde, Encyclopedia of genetics, genomics, proteomics, and bioinformatics: 8 volume set, ser. Encyclopedia of Genetics, Genomics, Proteomics, and Bioinformatics: 8 Volume Set. John Wiley & Sons Ltd., 2005, no. s. 8. [Online]. Available: http://books.google.hr/books?id=fytFAQAAIAAJ [12] S. Koren, J. Miller, B. Walenz, and G. Sutton, “An algorithm for automated closure during assembly,” BMC Bioinformatics, vol. 11, no. 1, p. 457, 2010. [Online]. Available: http://www.biomedcentral.com/14712105/11/457 [13] Bioinformatics: Sequence and Genome Analysis, Second Edition, 2nd ed. Cold Spring Harbor Laboratory Press, Jul. 2004. [14] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, pp. 403– 410, 1990. [15] W. J. Kent, “BLAT—the BLAST-like alignment tool,” Genome research, vol. 12, no. 4, pp. 656–664, Apr. 2002. [16] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller, “Human-mouse alignments with BLASTZ,” Genome research, vol. 13, no. 1, pp. 103–7+, 2003. [17] D. Edwards, J. Stajich, and D. Hansen, Bioinformatics: tools and applications. Springer Science+Business Media, 2009. [18] E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H. H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter, “A whole-genome assembly of Drosophila.” Science (New York, N.Y.), vol. 287, no. 5461, pp. 2196–2204, Mar. 2000. [Online]. Available: http://dx.doi.org/10.1126/science.287.5461.2196 [19] “Arachne: A whole-genome shotgun assembler,” vol. 12. [20] Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y. J. Chen, Z. Chen, S. B. Dewell, L. Du, J. M. Fierro, X. V. Gomes, B. C. Godwin, W. He, S. Helgesen, C. H. Ho, C. H. Ho, G. P. Irzyk, S. C. Jando, M. L. Alenquer, T. P. Jarvie, K. B. Jirage, J. B. Kim, J. R. Knight, J. R. Lanza, J. H. Leamon, S. M. Lefkowitz, M. Lei, J. Li, K. L. Lohman, H. Lu, V. B. Makhijani, K. E. McDade, M. P. McKenna, E. W. Myers, E. Nickerson, J. R. Nobile, R. Plant, B. P. Puc, M. T. Ronan, G. T. Roth, G. J. Sarkis, J. F. Simons, J. W. Simpson, M. Srinivasan, K. R. Tartaro, A. Tomasz, K. A. Vogt, G. A. Volkmer, S. H. Wang, Y. Wang, M. P. Weiner, P. Yu, R. F. Begley, and J. M. Rothberg, “Genome sequencing in microfabricated high-density picolitre reactors,” Nature, no. 7057, pp. 376–380, Sep. [21] P. A. Pevzner, H. Tang, and M. S. Waterman, “An eulerian path approach to dna fragment assembly,” Proceedings of the National Academy of Sciences, vol. 98, no. 17, pp. 9748–9753, Aug. 2001. [Online]. Available: http://dx.doi.org/10.1073/pnas.171285098 [22] J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, and D. B. Jaffe, “Allpaths: De novo assembly of whole-genome shotgun microreads,” Genome Research, vol. 18, no. 5, pp. 810–820, 2008. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/18340039 [23] D. R. Zerbino and E. Birney, “Velvet: Algorithms for de novo short read assembly using de bruijn graphs,” Genome Research, vol. 18, no. 5, pp. 821–829, May 2008. [Online]. Available: http://dx.doi.org/10.1101/gr.074492.107 [24] “Assembling millions of short DNA sequences using SSAKE,” Bioinformatics, vol. 23, no. 4, pp. 500–501, Feb. 2007. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btl629 ACKNOWLEDGEMENTS The author would like to acknowledge the support of the scientific research project "Methods of scientific visualization" (098 − 098 2562 − 2567), funded by the Ministry of Science, Education and Sports of the Republic of Croatia. Additionally, author would like to thank prof.dr.sc. Karolj Skala and doc.dr.sc. Mile Šikić for their support and guidance during the work on this project. R EFERENCES [1] E. Pettersson, J. Lundeberg, and A. Ahmadian, “Generations of sequencing technologies,” Genomics, vol. 93, no. 2, pp. 105–111, Feb. 2009. [Online]. Available: http://dx.doi.org/10.1016/j.ygeno.2008.10.003 [2] J. R. Miller, S. Koren, and G. Sutton, “Assembly algorithms for next-generation sequencing data,” Genomics, vol. 11, no. 6, pp. 315–327, 2010. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/20211242# [3] J. T. Simpson, K. Wong, S. D. Jackman, and et al., “Abyss: A parallel assembler for short read sequence data,” Genome research, vol. 19, no. 6, pp. 1117–1123, Jun. 2009. [Online]. Available: http://dx.doi.org/10.1101/gr.089532.108 [4] B. Alberts, A. Johnson, J. Lewis, M. Raff, and K. Roberts, Molecular biology of the cell. 978-0-8153-4r05, 2008. [5] S. Taudien, I. Ebersberger, G. Glöckner, and M. Platzer, “Should the draft chimpanzee sequence be finished?” Trends in Genetics, vol. 22, no. 3, pp. 122–125, 2006. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/16430990 [6] M. Pop, “Genome assembly reborn: recent computational challenges,” Briefings in Bioinformatics, vol. 10, no. 4, pp. 354–366, 2009. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/19482960 [7] J. R. Miller, S. Koren, and G. Sutton, “Assembly algorithms for nextgeneration sequencing data,” Genomics, vol. 95, no. 6, pp. 315–327, Jun. 2010. [Online]. Available: http://dx.doi.org/10.1016/j.ygeno.2010.03.001 [8] S. Bao, R. Jiang, W. Kwan, X. Ma, and Y.-Q. Song, “Evaluation of next-generation sequencing software in mapping and assembly,” J Hum Genet, vol. 56, no. 6, pp. 406–414, Jun. 2011. [Online]. Available: http://dx.doi.org/10.1038/jhg.2011.43 [9] B. Ewing, L. Hillier, M. C. Wendl, and P. Green, “Base-Calling of automated sequencer traces using phred. i. accuracy assessment,” Genome Research, vol. 8: 175-185, 1998. [10] Applied Biosystems, “Applied Biosystems SOLiD 3 Plus System: De Novo Assembly Protocol,” Life Technologies Corporation, Tech. Rep., 2010. 8 [40] X. Liu, P. R. Pande, H. Meyerhenke, and D. A. Bader, “Pasqual: A parallel de novo assembler for next generation genome sequencing.” [Online]. Available: http://www.cc.gatech.edu/pasqual/index.html [41] M. Ahmed, I. Ahmad, and S. U. Khan, “A comparative analysis of parallel computing approaches for genome assembly,” Interdisciplinary Sciences: Computational Life Sciences, vol. 3, no. 1, pp. 57–63, Mar. 2011. [Online]. Available: http://dx.doi.org/10.1007/s12539-011-0062-0 [42] K. Löwe, “Mapping of wdlps neochromosomes,” Otto-von-GuerickeUniversität Magdeburg, Tech. Rep., Nov. 2010. [43] S. Boisvert, F. Laviolette, and J. Corbeil, “Ray: Simultaneous assembly of reads from a mix of high-throughput sequencing technologies,” Journal of computational biology a journal of computational molecular cell biology, vol. 17, no. 11, pp. 1519–1533, 2010. [Online]. Available: http://www.liebertonline.com/doi/abs/10.1089/cmb.2009.0238 [44] [Online]. Available: http://sourceforge.net/apps/mediawiki/contrailbio/index.php?title=Contrail [45] “Extending assembly of short DNA sequences to handle error.” Bioinformatics (Oxford, England), vol. 23, no. 21, pp. 2942–2944, Nov. 2007. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btm451 [46] “SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing.” Genome research, [25] B. Schmidt, R. Sinha, B. Beresford-Smith, and S. J. Puglisi, “A fast hybrid short read fragment assembly algorithm,” Bioinformatics, vol. 25, pp. 2279–2280, September 2009. [Online]. Available: http://dl.acm.org/citation.cfm?id=1596446.1596453 [26] V. K. Kundeti, S. Rajasekaran, H. Dinh, M. Vaughn, and V. Thapar, “Efficient parallel and out of core algorithms for constructing large bidirected de bruijn graphs,” BMC Bioinformatics, vol. 11, no. 1, pp. 560+, 2010. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-11-560 [27] W. Shi and W. Zhou, “A parallel euler approach for large-scale biological sequence assembly,” Information Technology and Applications, International Conference on, vol. 1, pp. 437–441, 2005. [28] X. Yang, D. Medvin, G. Narasimhan, D. Yoder-Himes, and S. Lory, “Clog: A pipeline for closing gaps in a draft assembly using short reads,” in Proceedings of the 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences, ser. ICCABS ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 202–207. [Online]. Available: http://dx.doi.org/10.1109/ICCABS.2011.5729881 [29] E. Arner, Solving repeat problems in shotgun sequencing, 2006. [Online]. Available: http://books.google.hr/books?id=eBMINAAACAAJ [30] I. Tsai, T. Otto, and M. Berriman, “Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps,” Genome Biology, vol. 11, no. 4, p. R41, 2010. [Online]. Available: http://genomebiology.com/2010/11/4/R41 [31] D. G. C. Gordon and P. Green, “Consed: A graphical tool for sequence finishing,” Genome Research, vol. 8:195-202, 1998. [32] J. R. Miller, A. L. Delcher, S. Koren, E. Venter, B. P. Walenz, A. Brownley, J. Johnson, K. Li, C. Mobarry, and G. Sutton, “Aggressive assembly of pyrosequencing reads with mates,” Bioinformatics, vol. 24, pp. 2818–2824, December 2008. [Online]. Available: http://dl.acm.org/citation.cfm?id=1520805.1520818 [33] J. Wetzel, C. Kingsford, and M. Pop, “Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies,” BMC Bioinformatics, vol. 12, no. 1, p. 95, 2011. [Online]. Available: http://www.biomedcentral.com/1471-2105/12/95 [34] P. Medvedev, K. Georgiou, G. Myers, and M. Brudno, “Computability of models for sequence assembly,” in In WABI, 2007, pp. 289–301. [35] Y. Lin, J. Li, H. Shen, L. Zhang, C. J. Papasian, and H.-W. Deng, “Comparative studies of de novo assembly tools for next-generation sequencing technologies,” Bioinformatics, vol. 27, no. 15, pp. 2031–2037, Jun. 2011. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btr319 [36] D. e. Davendra, Traveling salesman problem, theory and applications. Rijeka: InTech, 2010. [37] D. R. Zerbino, “Genome assembly and comparison using de bruijn graphs,” 2009. [38] D. Hernandez, P. François, L. Farinelli, M. Østerås, and J. Schrenzel, “De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer,” Genome Research, vol. 18, no. 5, pp. 802–809, 2008. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/18332092 [39] M. S. Hossain, N. Azimi, and S. Skiena, “Crystallizing short-read assemblies around seeds.” BMC Bioinformatics, no. S-1. [47] [48] [49] [50] [51] [52] [53] [54] 9 vol. 17, no. 11, pp. 1697–1706, Nov. 2007. [Online]. Available: http://dx.doi.org/10.1101/gr.6435207 F. M. You, Y. Wang, Y. Q. Gu, M.-C. Luo, N. Huo, K. R. Deal, J. Lee, G. R. Lazo, and J. D. nad Olin D. Anderson, “Hybridvelvet: de novo hybrid assembly of next generation sequencing data with velvet,” Plant & Animal Genomes XIX Conference, Jan. 2011. C. Barnes, G. Dalgleish, G. Jared, K. Swift, and S. Tate, Trie-based data structures for sequence assembly. Springer, 1997, no. August, p. 206–223. [Online]. Available: http://www.springerlink.com/index/31617V66L054T702.pdf J. Kececioglu and E. Myers, “Combinatorial algorithms for DNA sequence assembly,” Algorithmica, vol. 13, no. 1, pp. 7–51, Feb. 1995. [Online]. Available: http://dx.doi.org/10.1007/BF01188580 M. Ruffalo, T. LaFramboise, and M. Koyutürk, “Comparative analysis of algorithms for next-generation sequencing read alignment,” Bioinformatics, vol. 27, no. 20, pp. 2790–2796, Oct. 2011. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btr477 W. Zhang, J. Chen, Y. Yang, Y. Tang, J. Shang, and B. Shen, “A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies,” vol. 6, no. 3, 2011. J. T. Simpson and R. Durbin, “Efficient de novo assembly of large genomes using compressed data structures,” Genome Research, Dec. 2011. [Online]. Available: http://dx.doi.org/10.1101/gr.126953.111 E. Afgan, D. Baker, N. Coraor, H. Goto, I. M. Paul, K. D. Makova, A. Nekrutenko, and J. Taylor, “Harnessing cloud computing with galaxy cloud,” Nat Biotech, vol. 29, no. 11, pp. 972–974, Nov. 2011. [Online]. Available: http://dx.doi.org/10.1038/nbt.2028 [Online]. Available: http://www.lifetechnologies.com/global/en/home/aboutus/news-gallery/press-releases/2012/life-techologies-itroduces-thebechtop-io-proto.html
© Copyright 2026 Paperzz