Approaches to DNA de novo assembly

Approaches to DNA de novo assembly
Ivan Sović
Center for informatics and computing
Rud̄er Bošković Institute
Bijenička cesta 54, 10 000 Zagreb
Email: [email protected]
Abstract— DNA is the basic building block of all known
life, accounting for all the diversities in nature. Determining
the DNA of an individual organism is performed through
a process called DNA sequencing. Although several different
sequencing technologies do exist, they are limited and are able to
acquire relatively short sequence reads. One of the approaches
to sequencing involves randomly breaking a long DNA molecule
into small fragments and sequencing only those fragments. Due
to the random positioning of fragments on the source DNA,
majority of them overlap, and provide the necessary information
to combine them back together. The process of reconstructing
the original DNA sequence from fragment reads is called DNA
assembly. Assembly is a computationally very intensive process
that may take days, or even weeks to produce the sequence of a
more complex organism. Reconstructing a DNA sequence in the
absence of a previously reconstructed reference sequence from a
similar organism is called de novo assembly. De novo assembly
methods currently provide the only means to discover new,
previously unknown sequences, and are currently indispensable
in biological research.
In this paper, short descriptions of the sequencing process
and the current sequencing platforms are given. DNA assembly
process is thoroughly described, and the analysis of several de
novo approaches used for assembly are presented. Overview and
description of existing software tools are given, including some
parallel implementations. As a conclusion, aspects of possible
future development of DNA assembly are considered.
Index Terms—DNA sequencing, assembly, overlap, layout,
consensus, de Bruijn, parallel
I. I NTRODUCTION
An extremely important research subject in biology is the
determination of the sequence of naturally occurring deoxyribonucleic acid (DNA) molecules. The DNA is a long molecule
consisting of a large number of simple components called
nucleotides or bases, that comprise a long chain (sequence).
The process of resolving the structure of the DNA molecule
is called DNA sequencing. Currently, several methodologically
different sequencing techniques exist: first generation techniques mostly based on Sanger’s method of sequencing, and
next generation sequencing techniques (NGS), most prominent
of them being Roche’s 454 Life Sciences, Illumina Solexa,
Applied Biosystems’ SOLiD systems [1] and Ion Proton
from Life Technologies. General differences amongst the first
and the next generation sequencers are the lower cost and
higher throughput of next generation sequencers, but with
disadvantages of shorter read lengths and higher error rates
[2] [3].
Commonly, DNA molecules have the length from a few
million up to a few billion base-pairs (bp), dependant on the
species (i.e. Escherichia coli’s DNA has the length of 4.6
million base-pairs, while Homo sapiens’ DNA is the size of 3.3
billion base-pairs). Limited by the current sequencing technology, the entire DNA can not be read at once. Instead, multiple
smaller reads of different parts of the DNA molecule are
produced, each about 30-400 bp long, representing fragments
of the original DNA molecule. The process of reconstruction
of the original DNA sequence from read fragments is called
DNA sequence assembly. The assembly process is generally
based on finding overlaps among reads, and joining them into
contiguous sequences (contigs). The explanation for such an
approach is based on the fact that a large number of copies
of the same DNA molecule have been randomly broken into
fragments, thus causing the fragments of different DNA copies
to partially overlap.
Computational DNA assembly is currently the only means
of obtaining the sequence of a DNA molecule. This makes the
research in the field of assembly of crucial importance to the
development of biology and furthering the understanding of
life.
II. DNA SEQUENCING
The molecule of DNA is a long, unbranched, paired polymer
chain, formed of the same four types of monomers called
nucleotides, connected in a long linear sequence that encodes
the genetic information [4]. Monomers are small molecules,
such as amino acids, nucleotides and glucose, which may bind
to other monomers in order to form larger molecules called
polymers. Nucleotides are the basic building blocks of DNA,
consisting of a sugar (deoxyribose) with a phosphate group
attached to it, and a base [4]. The base of a nucleotide can
be one of the following: adenine (A), cytosine (C), thymine
(T) and guanine (G). In graphical or textual representations of
DNA molecules, nucleotides are commonly denoted through
the abbreviations of their base names A, T, C and G. In this
case, the DNA can be written as the sequence of these four
letters. Nucleotide A from one strand of the DNA always
bonds to a T nucleotide of the other strand of the DNA,
and vice versa. For this reason A and T nucleotides are
called complements. The same applies for the C and the G
nucleotides.
Figure 1a shows the molecular structure of a nucleotide,
while Figure 1b depicts the double-stranded nature of DNA
nucleotides, the maximum length that can currently be read
at once is limited to at most 1000 nucleotides. This is an
extremely important limitation, since the DNA can be up to
a few billion nucleotides long. All current solutions of this
problem are based on the same principle: multiple copies of
the DNA molecule are randomly broken into small fragments
of such size that can be read by modern technology. The
fragments are then multiplied into large number of copies if/as
required by the sequencing platform. The method of sequencing the entire genome by dividing it in smaller fragments
is called whole genome shotgun (WGS) sequencing. Given
large enough number of DNA samples, randomly created
fragments will start to overlap with each other, which gives
the only available information about putting them together.
The average number of sequences that independently contain
a certain nucleotide is called the depth of coverage (often
only coverage) [5]. Combining the fragments back into a
single chain of nucleotides is called DNA assembly, while
the programs that perform this process are called genome
assemblers [6].
Sequencing of a DNA fragment can be performed in two
ways: sequencing only one strand of DNA from the 50 end,
called the single-end (SE) reads, or sequencing the fragment
in both strands from their 50 ends, called the paired-end (PE)
reads. Sequences of a PE read are also often called mate-pairs.
The method of sequencing an entire genome and producing PE
reads is called the double barrelled whole genome shotgun
(DBWGS) sequencing.
To help gain a better understanding of DNA assembly methods, a short description of current sequencing technologies and
their limitations is required. The first generation sequencing, or
Sanger sequencing, is based on the modification of the process
of DNA synthesis. During this process, the two chains of
DNA are separated, followed by the addition of nucleotides
that are complements to those contained within the chains.
Sanger’s method contains modified nucleotides in addition to
the normal ones, where each modified nucleotide is missing
an oxygen on the 30 end, and has also been fluorescently
marked [1]. Integrating these nucleotides into a DNA chain
causes a halt (or termination) in the elongation process. Using
capillary electrophoresis, sequences are separated by their
length (mass) and the termination base is read. This process
of sequencing can deliver read lengths up to 1000 bases, high
raw accuracy, and allow for 384 samples to be sequenced
in parallel, generating 24 bases per instrument second [1].
Great shortcomings of this methods are its high price of about
$10 per 10000 bases, and long sequencing time. Examples
of machines used for first generation sequencing include the
Applied Biosystems Prism 3730 and the Molecular Dynamics
MegaBACE [7].
Commercial NGS DNA sequencing platforms today include
the Genome Sequencer from Roche 454 Life Sciences, the
Solexa Genome Analyzer from Illumina, the SOLiD System
from Applied Biosystems, Ion Storm and Ion Proton from Life
Technologies, the Heliscope from Helicos and the Polonator
[7]. There are several different methods that these technologies
(a) Generic nucleotide structure
(b) Structure of the DNA chain
Fig. 1: a) A generic molecular structure of a single nucleotide,
consisting of a sugar backbone, a phosphate group and a base
(denoted by B). The base of a nucleotide in the DNA can be
one of the four: A, C, T and G. b) Depiction of the doublestranded DNA molecule, composed of nucleotides.
consisting of A, C, T and G nucleotides. It is important to
note that the carbon atoms on Figure 1a are marked 10 − 50 .
Their order in the molecule is not coincidental, and is of
great importance for the formulation of DNA. Bonds between
nucleotides are always formed between their 3’ and 5’ ends
via the phosphate group, creating a polymer chain composed
of a repetitive sugar-based backbone with a series of bases
protruding from it [4], that represents a single strand of
the DNA. During synthesis, DNA is always extended in the
"direction" from its 50 end to 30 end - in other words, the 50
carbon atom of a nucleotide that is to be added to an existing
chain is always bonded with the last 30 carbon atom of that
chain through covalent bonds between the phosphate group
of one nucleotide, and the 30 carbon in the deoxyribose ring
of the other. By convention, a DNA sequence is always read
from its 50 end towards the 30 end.
Significant effort has been put into the development of
methods for determining the exact sequence of nucleotides
comprising a DNA molecule. The process of "reading" the sequence is commonly referred to as DNA sequencing. Methods
and technology for performing DNA sequencing are broadly
divided into two groups: the first generation sequencing based
on Sanger’s method, and the next (or second) generation
sequencing (NGS) based on several different approaches.
However, although these techniques can read the sequence of
2
use for sequencing: hybridization to tiling arrays, parallelized
pyrosequencing (454), reverse termination (Solexa), ligating
degenerated probes (SOLiD) and single molecule sequencing
(Heliscope) [1]. Read lengths obtained from the NGS reads
are in the 400bp range (454), the 100bp range (Solexa and
SOLiD), or less [7].
Compared to Sanger sequencing, next generation sequencers
have characteristic error profiles, different in case of each of
the technologies [7]. These error profiles can include inaccurate determination of simple sequence repeats, enrichments
of base call error toward the 30 ends of reads, insertion or
deletion errors during homopolymer runs and other [7] [8].
Error profiles of some of the platforms have been published
and are available [7].
During the process of determining nucleotide types (commonly referred to as base-calling), information about the
quality of a certain base-call is often available. Software tools
such as Phred [9] read chromatogram files and analyse peaks
in order to call bases, while assigning them a quality score that
is proportional to the probability that the base is called wrong.
Quality scores are used by some assembly tools to enhance
efficiency; and also for comparison of different sequencing
technologies.
(a) An example of a single-end DNA fragment read.
(b) An example of a paired-end DNA fragment read.
Fig. 2: Examples of DNA fragments read from a) only one
end and b) from both ends.
The first sequence fragment assembly algorithms, developed
in 1980s, were focused on creating multiple overlapping
alignments of the reads to produce a layout assembly of the
data [17]. The reads were typically sequences obtained using
Sanger’s method. The alignment is used to read the consensus
sequence, from which the DNA sequence is inferred. The
described method was called the overlap-layout-consensus
approach (OLC). Currently, the OLC approach is one of the
two mainstream approaches of the de novo DNA assembly, and
is especially used during the assembly of long, high-quality
reads [17]. Examples of OLC assembly tools include Celera
Assembler [18], Arachne [19] and Newbler [20].
Another very important, and currently most utilized approach in combination with NGS data, is based on the de
Bruijn graph. This approach was first presented by Pevzner et
al. in 2001 [21], and is currently the prevalent representation in
short-read assemblers. Unlike the computationally expensive
step of determining overlap amongst reads, the de Bruijn graph
is constructed from fixed-length subsequences (of length k)
that were consecutive in larger sequences, and that have a
perfect (k − 1)-suffix to (k − 1)-prefix overlap. Examples of
assemblers based on the de Bruijn approach include Euler [21],
AllPaths [22] and Velvet [23].
Both the OLC and the de Bruijn approaches are graph
based. On the other hand, different approaches also do exist.
They include greedy string-based methods i.e. implemented in
SSAKE [24], and hybrid methods such as Taipan [25].
Recently, several parallelized versions of assemblers have
appeared. This step in the development of assembly tools
is of great importance, since it can provide rapid execution
of otherwise highly computationally intensive task. Examples
include ABySS [3]; a method by Kundeti et al. [26]; and a
parallelization of the Euler assembler [27].
Errors in assembly can occur for two main reasons: incomplete or incorrect information provided to the assembler
and the limitations of the assembly algorithm [6]. The realworld WGS data that contains errors can induce the following
problems in overlap and de Bruijn graphs [7]:
1) Spurs - short dead-end branches (divergences) of the
main path (Figure 3a). Possible causes include sequencing errors toward one end of a read, and low coverage.
2) Bubbles - divergence of a path into two branches that
afterwards join together again into one path (Figure 3b).
III. F RAGMENT ASSEMBLY
All sequencing platforms produce observations of the target
DNA molecule in the form of reads - sequences of single-letter
base calls with a numeric quality value [7]. An example of a
single-end read and a paired-end read is shown in Figure 2a
and Figure 2b, respectively. Genome assembly is commonly
divided into two groups:
• De novo - assembly of sequence reads into longer contiguous sequences called contigs, followed by the process
of correctly ordering contigs into scaffolds, in the absence
of a reference genome sequence [8] [10] [11].
• Mapping/reference - when available, a pre-existing reference genome sequence can be used to align the reads
of a newly obtained genome, thus avoiding the process
of constructing complex data structures as with de novo
assembly [8] [10].
As a result of the de novo assembly stage, a set of scaffolded contigs is available. Besides ordered contigs, scaffolds
also contain gaps of imprecise length separating the contigs
[12]. Scaffolds are also often called supercontigs. Low risk
assemblies consistent with almost all of the detectable pairwise read overlaps are called unitigs [12].
Sequence alignment is the procedure of comparing two
or more sequences by searching for a series of individual
characters (in this case nucleotides) or character patterns that
are in the same order in all sequences [13]. In most cases, the
sequences cannot be aligned perfectly due to their mismatch.
In an optimal case, non-identical characters and gaps are
placed, in order to match together as many identical or similar
characters as possible. Popular tools that perform sequence
alignment include BLAST [14], BLAT [15] and BLASTZ [16].
3
the placement of reads during the assembly. These constraints
limit the possible traversals of the assembly graph, allowing
longer segments of the genome to be unambiguously reconstructed [33].
In the next subsections, the following notation is used to
formally define the problem of DNA fragment assembly. Let Σ
be the alphabet consisting of four elements Σ = {A, C, T, G}.
Then, Σk is a set of all strings of length k over the alphabet
Σ. Also, let v ∈ Σk . The length of a string v is denoted by
|v|. Additionally, given 1 ≤ i ≤ j ≤ |v| | i, j, |v| ∈ N, the i-th
element of a string v is denoted with v[i], while a substring of
a string v is denoted by v[i, j], where i is the starting position
of the substring within v, and j the end position, inclusive.
The string of length k is called a k-mer, while the k-spectrum
of v is the set of all k-mers that are substrings of v. A kmolecule is a pair of k-mers which are reverse complements
of each other [34]. If v and w are strings, and there exists a
maximal length non-empty string z that is a prefix of v and a
suffix of w, then it is said that v overlaps w, with the length of
overlap ov(w, v) = |z|. This definition is not symmetric [34].
If v does not overlap w, then ov(w, v) = 0.
(a) Spur
(b) Bubble
(c) Convergence
(d) Cycle
Fig. 3: Problems that can appear during the process of building
graph representations for DNA de novo assembly.
Possible causes include sequencing errors toward the
middle of a read, and by polymorphisms in the target.
It is important to note that efficient bubble detection is
not trivial.
3) Converging and diverging paths - inverse definition
than for the bubbles, two paths converge into one, that
later diverges again into two separate paths (Figure 3c).
Possible causes are repeats in the target genome.
4) Cycles - paths that converge on themselves (Figure 3d).
Possible causes are repeats in the target genome.
Although sequencing technology has greatly improved, still
no sequencing platform can produce sufficient data to assemble
a complete genome from a single experiment [28]. One of the
key problems in shotgun sequencing is caused by repeats in
genome sequences [29]. Repeats, in cases when they are longer
than fragment reads, induce problems during the overlap phase
of assembly. As a result, assemblies result in fragmented contigs, separated by gaps. Repeat caused fragmentation is more
pronounced in the NGS technology, since reads are generally
of smaller size [28]. Other than repeats, gaps can also be
caused by other, technology-specific reasons. Assemblies containing gaps are called draft assemblies, while the process of
filling the gaps is called finishing. Finishing involves obtaining
missing sequences, improving low quality regions, resolving
misassemblies and ordering scaffolds; and often accounts for
the majority of labour and cost of genome projects [28]. In
order to fill gaps, further sequencing is usually required, either
by amplifying and sequencing fragments spanning gaps, or by
resequencing the entire DNA. Tools that perform gap closure
and finishing do exist, such as IMAGE [30] and Consed
[31], although some hybrid assemblers perform gap closure
indirectly by incorporating reads from multiple sequencing
platforms (i.e. Velvet [23], CABOG [32]).
In case paired-end sequencing was performed, mate-pair
information can be used to provide additional constraints to
A. Overlap-Layout-Consensus approach (OLC)
Let S = {s1 , s2 ..., sn } be a set of non-empty strings over
an alphabet Σ. An overlap graph of S is a complete weighted
directed graph where each string in S is a vertex and the length
of an edge between vertices x and y, x → y, is |y| − ov(x, y)
[34].
An example of assembly using the overlap graph is depicted
in Figure 4. The construction of the overlap graph is the first
step of the OLC approach. General process of this approach
consists of three main phases [7] [8] [35], from which it
derives its name:
1) Overlap - reads are compared to each other in a pairwise
manner to construct an overlap graph.
2) Layout - the overlap graph is analysed and simplified
with the application of graph algorithms to identify
the appropriate paths traversing through the graph and
produce an approximate layout of the reads along the
genome. The ultimate goal is a single path that traverses
each node in the overlap graph exactly once [6].
3) Consensus - multiple sequence alignment of all reads
covering the genome is performed, and the original
sequence of the genome being assembled is inferred
through the consensus of the aligned reads.
A very important observation is that the identification of
a path that traverses every node in the graph only once is
a computationally difficult Hamiltonian path problem. The
problem is also known as the Travelling Salesman Problem,
and is NP-complete [36].
Discussion
The three-phase definition of the OLC approach causes that
the algorithms utilizing this method are naturally implemented
through a modular design, which allows for simpler modification and optimization of distinct assembly steps. Another
4
advantage of OLC assemblers is that the overlaps among reads
may vary in length, which equips them to handle the data either
from Sanger or NGS platforms.
On the other hand, the processing cost of the overlap
phase is very high, as this phase is very time consuming for
determining the overlap of every pair of reads in the data set.
Although OLC approach is capable of handling NGS data,
this data commonly consists of an order of magnitude more
reads than were commonly generated in Sanger-based projects
[6] (for which the OLC was initially designed), causing a
significant increase of the overlap calculation time (quadratic
complexity) [37]. However, some OLC assemblers such as
Edena [38] and Shorty [39] have been well optimized to handle
such data, and even outperform some other, NGS specialized
assemblers [6]. Also, finding the Hamiltonian path in the
layout step is still not solvable in polynomial time, which
makes OLC additionally difficult and dependant on various
heuristics to produce reliable results [21].
are redundant with pairs of other overlaps. Shorty is designed
for a special case when a small number of long reads are
available, and uses them as seeds to recruit short reads and
their mate pairs, and iteratively repeats this process to create
larger contigs from previously constructed contigs.
Parallelization of the OLC approach can be easily achieved
by distributing the overlap computation on different processing
units [6]. However, OLC based parallel assembly tools are rare
and hard to find, as most recent implementations focus on the
de Bruijn approach. PASQUAL [40] is an example of a parallel
OLC assembler, designed for shared memory parallelism using
OpenMP. Authors claim to have achieved better performance
and results than any other tested assembly tool they have ran.
In [41] authors have implemented a parallel OLC and a parallel
de Bruijn assembler, in order to analyse the differences in their
scalability and efficiency.
B. The de Bruijn approach
As in the case of overlap graphs, let S = {s1 , s2 ..., sn } be a
set of non-empty strings over an alphabet Σ. Given a positive
integer parameter k, the de Bruijn graph B k (S) is a directed
graph with vertices defined by {d ∈ Σk | ∃i 3 d ⊆ si }, and
edges defined by {d[1..k] → d[2..k + 1] | d ∈ Σk+1 , ∃i 3
d ⊆ si } [34]. A vertex of the de Bruijn graph is commonly
identified through its associated k-mer.
An example of the assembly using the de Bruijn graph is
depicted in Figure 5. The process of constructing a de Bruijn
graph consists of the following steps [42]:
1) Construction of k-spectrum - reads are divided into
overlapping subsequences of length k.
2) Graph node creation - a node is created for every (k−1)spectrum of each unique k-mer.
3) Edge creation - a directed edge is created from node a
to node b iff there exists a k-mer such that its prefix is
equal to a and its suffix to b.
The first application of de Bruijn graphs for genome assembly was proposed by Pevzner et al. in 2001 [21]. By converting
the set of reads into edges of the de Bruijn graph, the assembly
problem becomes equivalent to finding an Eulerian path in the
Implementations
Celera [18] is an assembler that was first designed to handle
data from Sanger sequencers, but was later revised and optimized for the Roche’s 454 NGS data. Its revised pipeline,
called CABOG, is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. It
also uses mate pair information to merge sets of unitigs into
larger structures.
Newbler [20] is a well known and widely used assembly
tool, distributed with Roche 454 Life Science sequencers.
Since newer revisions it provides scaffold building from
paired-end data, exploits coverage to handle base calling
errors, uses instrument metrics to overcome inaccurate calls
of the number of bases in homopolymer runs, and implements
OLC twice - once to generate unitigs from reads, and second
time to generate larger contigs from unitigs.
Edena [38] and Shorty [39] are OLC assemblers that are
targeted for the short reads generated by Solexa and SOLiD
sequencing platforms. Edena discards duplicate reads, finds all
perfect error free overlaps and removes individual overlaps that
Fig. 5: An example of assembly using the de Bruijn graph by
finding a Eulerian path. A sample sequence ACCAT T CCA
was fragmented into two reads, {ACCAT T C, AT T CCAA}.
Reads were used to obtain the k-mer spectrum for k =
4, {ACCA, CCAT , CAT T , AT T C, T T CC, T CCA,
CCAA}, and the (k − 1)-mer spectrum, {ACC, CCA,
CAT , AT T , T T C, T CC, CAA}, required to construct the
graph. The original sample sequence is fully reconstructed by
traversing the graph.
Fig. 4: An example of assembly using the overlap graph by
finding a Hamiltonian path. In this simple example, the set of
input fragments consists of five reads of equal length, {AT C,
CCA, CAG, T CC, AGT }. The result of assembly is the
reconstruction of the sequence AT CCAGT .
5
graph - a path that uses every edge in the graph - for which
efficient algorithms do exist [6]. In this case, assembly is a
by-product of the graph construction, which proceeds quickly
using a constant-time hash table lookup for the existence of
each k-mer in the k-spectrum [7]. However, there can be
an exponential number of distinct Eulerian paths in a graph,
while only one can be deemed to be the correct assembly
of the genome. To reduce the complexity of the problem,
heuristics are usually applied to the constructed graph. The
graph is filtered of erroneous occurrences such as bubbles,
spurs, cycles and convergences/divergences, and nodes that are
unambiguously connected by an edge are merged together [3].
by assembling the regions that are locally non-repetitive, and
then glues the local graphs where they have overlapping
structure.
Unlike the case of the OLC approach, many parallel assembly tools exist that use the de Bruijn approach, some being
specially designed for such purpose, while others incorporated support for multi-threading in later versions. ABySS [3]
was among first de Bruijn assemblers to utilize the parallel
paradigm. ABySS distributes the de Bruijn graph over multiple
computers by placing k-mers over available nodes, with the
location of a k-mer determined by a simple hash function [3].
ABySS uses MPI (Message Passing Interface) protocol for
communication between nodes. Ray [43] also distributes the
graph across multiple computers through MPI, but also allows
simultaneous assembly of reads from a mix of high-throughput
sequencing technologies. Contrail presents a cloud computing
solution to assembly of large genomes, and relies on Hadoop
to iteratively transform an on-disk representation of the assembly graph [44]. Contrail is currently being extended to support
the OLC approach, that will be of great use for long reads. As
mentioned in the previous subsection, implementations and
comparison of a de Bruijn and a OLC assemblers is given
in [41]. An out-of-core algorithm for constructing large bidirected de Bruijn graphs in Θ(n/p), and with Θ(n) message
complexity, where p is the number of processors, and n is a
constant degree polynomial in p, is presented in [26].
Discussion
A great advantage of the de Bruijn approach is that explicit
computations of pairwise overlaps are not required, unlike the
case of the OLC approach. Finding pairwise overlaps is a
computationally very expensive process, and thus the de Bruijn
approach provides great performance when very large data sets
are given (such as NGS data) [42]. Finding an Eulerian path
through the edges of a graph is known to be a polynomial
problem, and therefore can be executed more efficiently than
finding a Hamiltonian path.
Since there can be very many Eulerian paths in a graph,
constraints have to be added in order to find the path that
represents the actual genome. These constraints can make the
assembly much more difficult, and transform this polynomial
problem to a NP-hard problem, as shown in [34]. Also, de
Bruijn graphs are very sensitive to sequencing errors and
repeats, as they lead to new k-mers, adding to the graph
complexity [42]. Additionally, although de Bruijn approach
can handle both Sanger and NGS data, by dividing long reads
from Sanger sequencing into short k-mers, there is an effective
loss of long range connectivity information implied by each
read [6].
C. Other methods
Among described approaches, two main other methods
exist: greedy and hybrid. Greedy algorithms were common
in genome assemblers for Sanger data [6], and were part
of the first NGS assembly packages [7]. They operate in
simplest and most intuitive fashion: reads are joined together
into contigs iteratively, starting with the reads that have best
overlaps; the process is repeated until there are no more reads
or contigs that can be joined. This approach may not lead to a
globally optimal solution, for instance, in a case where a contig
at hand takes on reads that, combined with another contig,
might result with that contig to grow even larger. Examples
of assemblers that use greedy approach include SSAKE [24]
that aggressively assembles short nucleotide sequences by progressively searching for the longest possible overlap between
sequences through a prefix tree; VCAKE [45] that implements
an iterative extension algorithm that, unlike its predecessors,
can incorporate imperfect matches during contig extension;
and SHARCGS [46] that adds pre- and postprocessing to the
SSAKE assembler, where at preprocessing it filters the raw
read set three times with different settings, constructs three
assemblies, and in the postprocessing step merges them.
Hybrid assembly has twofold meaning. For one, it is
an approach where data obtained from several different sequencing platforms is simultaneously used for the assembly
of one genome [47]. The data is commonly acquired both
from: platforms with long reads, base-space and moderate
throughput (i.e. Roche 454), and platforms with short reads,
base- or color-space and high throughput (i.e. Solexa and
Implementations
Euler [21] was the first assembler based on the de Bruijn
approach, developed for Sanger reads, and later modified for
short Roche 454 reads, very short unpaired Illumina/Solexa
reads and paired-end Solexa reads. It first filters input reads by
detecting erroneous base calls through noting low-frequency kmers, and then constructs two k-mer graphs at different values
of k, and compares them.
Velvet [37] is a collection of methods for assembly using
de Bruijn graphs. It consists of two parts: first, called Tour
Bus, removes sequencing errors and handles polymorphisms,
and second aimed to resolve repeats based on the available
information from low coverage long reads or paired shotgun reads. It applies a series of heuristics based on local
graph topology, coverage, sequence identity and paired-end
constraints, in order to reduce graph complexity.
AllPaths [22] was initially designed for large genome assembly using paired-end short reads obtained from Solexa sequencers, but now works both on small and large (mammalian
size) genomes. AllPaths partitions the graph to resolve repeats
6
SOLiD). Here, base-space and color space stand for notations
of DNA sequences outputted from DNA sequencers. In basespace, a DNA sequence is given plainly through the four
types of nucleotides (A, C, T, G), while in color-space each
base is coded by two values called colors, with each color
used in coding two neighbouring bases. Color-space coding
provides additional error checking during the sequencing
process. Example of an assembler that uses this type of the
hybrid approach is HybridVelvet [47] which presents a onestep pipeline incorporating Velvet, that can take any long basespace sequences and short color-space reads as inputs and
export assembled base-space contigs and scaffolds.
The second meaning of hybrid assembly is an approach
that combines both greedy and graph based assembly methods
[25]. An example of such an assembler is Taipan [25]. It uses
greedy extensions for contig construction, but at each step also
constructs enough of the corresponding read graph in order to
make better decisions of the continuation of assembly.
Additionally, some assemblers use a suffix trie approach as
in [48], and an approach using combinatorial algorithms as in
[49]. These and other different approaches are seldom found
in assembly tools.
Although N50 initially followed the increase of the depth of
coverage, N50 reached a plateau (converged) when the depth
of coverage exceeded a certain threshold [35]. To compare the
N50 values among assemblers, authors in [35] have used the
value of N50 at the same depth of coverage where N50s for
all assemblers have converged.
Qualitative comparison of several assembly tools is given
in Table I. More detailed description of benchmarking procedures and results of various assembly tools can be found in
comparative and review articles such as [6] [7] [8] [35] [50]
[51].
IV. D ISCUSSION ABOUT FUTURE PROSPECTS
With all the presented aspects of DNA assembly taken into
account, there are several possible directions in which the
future research can progress beyond the state-of-the-art. One
such direction is the exploration of new assembly algorithms
and data types that can be utilized, reducing computational
complexity and increasing assembly accuracy. Also, assembly
tools have to become capable of handling and processing
large amounts of data more rapidly, while reducing memory
consumption. From the implementation point of view, two
possible optimizations can be observed: for one, compression
techniques can be applied on the sequence data thus reducing
storage and operational size, and two, the computationally
intensive task of assembly can be parallelized and deployed on
cloud platforms. Clouds can provide scalable and on-demand
resources to researchers, reusable workflows and reusability of
results. CloudMan is an excellent example of an easy to use
cloud manager platform that addresses the stated possibilities
and has already been used for bioinformatics purposes [53].
To address the constatation about compression, an assembler
that uses this approach has recently been published [52]. On
assembly of a C. elegans data set, SGA used only 4.5GB of
RAM, whereas AbySS required 14.1GB, Velvet 23.0GB and
SOAPdenovo 38.8GB [52]. However, SGA resulted in longer
execution time than other stated assemblers.
On a side note - recently, an interesting announcement
regarding sequencing technologies has been made public. A
company called Life Technologies has released a press statement claiming that they have created a sequencing machine
Benchtop Ion ProtonTM Sequencer capable of sequencing the
entire genome under a day, for the price of $1000, and have
already started taking orders for the device [54]. A $1000
dollar genome barrier has been a long goal for the developers
of genome sequencing technologies, as it is supposed to
bring genome sequencing to masses through hospitals, medical
institutions and research facilities.
D. Comparison of implementations
The efficiency and performance of assemblers is generally
assessed through the size and the accuracy of assembled
contigs and scaffolds [8], and resource consumption [35].
The resource consumption of a certain assembler includes the
total processing time, and RAM occupancy. The common size
measurements include:
• N50 length - the longest length such that at least 50% of
all base pairs are contained in contigs of this length or
larger [35]. Provides a standard measure of assembly connectivity, higher N50 lengths indicate better performance
of an assembler.
• Minimum contig length
• Maximum contig length
On the other hand, the accuracy of assemblies is generally
difficult to measure, since these estimates can only be given in
comparison to a known correct (benchmark) sequence, such as
a reference genome or an artificially created data set. Accuracy
measurements include:
• Sequence coverage - percentage of the benchmark sequence covered by output contigs.
• Assembly error rate - output contigs are aligned to the
benchmark sequence, and the number of mismatched
bases are calculated. Assembly error rate is the percentage of mismatched bases of the total bases in the aligned
contigs in the reference sequence.
It is important to note that N50 length can only be comparable between different assemblers when each is measured
with the same combined length value [8]. In [35], authors
compared seven different assembly tools, and have discovered
an interesting pattern in assembly connectivity measured by
N50 length and compared against the depths of coverage.
V. C ONCLUSION
While the length of sequences that can be read at once is still
limited, the de novo assembly methods are irreplaceable. Currently, they provide the only means to discover new, previously
unknown sequences, which is essential for characterization of
the biological diversity of our world [8]. However, although
genome assembly is the only way to obtain the information
7
TABLE I: Comparison of performance and quality of several assembly tools on sequences obtained from a C. elegans data
set. Results were taken from [52].
Scaffold N50 size
Aligned contig N50 size
Mean aligned contig size
Reference bases covered
Mismatch rate at all assembled bases
Contigs with split/bad alignment
Total CPU time
Maximum memory usage
SGA
26.3 kbp
16.8 kbp
4.9 kbp
96.2 Mbp
1 per 21,545 bp
458 (4.4 Mbp)
41 h
4.5 GB
contained in the genetic code of an organism, complexity
and accuracy of current assembly methods is still greatly
influenced by read errors and imprecise algorithms, as well as
the number and size of the reads obtained from sequencing.
One can say that the areas of sequencing, assembly and
bioinformatics in general have never been as vivid and interesting as now, while also providing great opportunities for
research that will find real-life applications, and ultimately,
effect our lives for better.
Velvet
31.3 kbp
13.6 kbp
5.3 kbp
94.8 Mbp
1 per 8,786 bp
787 (7.2 Mbp)
2h
23.0 GB
ABySS
23.8 kbp
18.4 kbp
6.0 kbp
95.9 Mbp
1 per 5,577 bp
638 (9.1 Mbp)
5h
14.1 GB
SOAPdenovo
31.1 kbp
16.0 kbp
5.6 kbp
95.1 Mbp
1 per 26,585 bp
483 (4.4 Mbp)
13 h
38.8 GB
[11] L. Jorde, Encyclopedia of genetics, genomics, proteomics, and
bioinformatics: 8 volume set, ser. Encyclopedia of Genetics,
Genomics, Proteomics, and Bioinformatics: 8 Volume Set. John
Wiley & Sons Ltd., 2005, no. s. 8. [Online]. Available:
http://books.google.hr/books?id=fytFAQAAIAAJ
[12] S. Koren, J. Miller, B. Walenz, and G. Sutton, “An algorithm for
automated closure during assembly,” BMC Bioinformatics, vol. 11, no. 1,
p. 457, 2010. [Online]. Available: http://www.biomedcentral.com/14712105/11/457
[13] Bioinformatics: Sequence and Genome Analysis, Second Edition, 2nd ed.
Cold Spring Harbor Laboratory Press, Jul. 2004.
[14] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic local
alignment search tool,” Journal of Molecular Biology, vol. 215, pp. 403–
410, 1990.
[15] W. J. Kent, “BLAT—the BLAST-like alignment tool,” Genome research,
vol. 12, no. 4, pp. 656–664, Apr. 2002.
[16] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison,
D. Haussler, and W. Miller, “Human-mouse alignments with BLASTZ,”
Genome research, vol. 13, no. 1, pp. 103–7+, 2003.
[17] D. Edwards, J. Stajich, and D. Hansen, Bioinformatics: tools and
applications. Springer Science+Business Media, 2009.
[18] E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P.
Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. Reinert,
K. A. Remington, E. L. Anson, R. A. Bolanos, H. H. Chou, C. M.
Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon,
L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan,
Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter,
“A whole-genome assembly of Drosophila.” Science (New York, N.Y.),
vol. 287, no. 5461, pp. 2196–2204, Mar. 2000. [Online]. Available:
http://dx.doi.org/10.1126/science.287.5461.2196
[19] “Arachne: A whole-genome shotgun assembler,” vol. 12.
[20] Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A.
Bemben, J. Berka, M. S. Braverman, Y. J. Chen, Z. Chen, S. B. Dewell,
L. Du, J. M. Fierro, X. V. Gomes, B. C. Godwin, W. He, S. Helgesen,
C. H. Ho, C. H. Ho, G. P. Irzyk, S. C. Jando, M. L. Alenquer, T. P.
Jarvie, K. B. Jirage, J. B. Kim, J. R. Knight, J. R. Lanza, J. H. Leamon,
S. M. Lefkowitz, M. Lei, J. Li, K. L. Lohman, H. Lu, V. B. Makhijani,
K. E. McDade, M. P. McKenna, E. W. Myers, E. Nickerson, J. R.
Nobile, R. Plant, B. P. Puc, M. T. Ronan, G. T. Roth, G. J. Sarkis, J. F.
Simons, J. W. Simpson, M. Srinivasan, K. R. Tartaro, A. Tomasz, K. A.
Vogt, G. A. Volkmer, S. H. Wang, Y. Wang, M. P. Weiner, P. Yu, R. F.
Begley, and J. M. Rothberg, “Genome sequencing in microfabricated
high-density picolitre reactors,” Nature, no. 7057, pp. 376–380, Sep.
[21] P. A. Pevzner, H. Tang, and M. S. Waterman, “An eulerian path
approach to dna fragment assembly,” Proceedings of the National
Academy of Sciences, vol. 98, no. 17, pp. 9748–9753, Aug. 2001.
[Online]. Available: http://dx.doi.org/10.1073/pnas.171285098
[22] J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K.
Belmonte, E. S. Lander, C. Nusbaum, and D. B. Jaffe, “Allpaths:
De novo assembly of whole-genome shotgun microreads,” Genome
Research, vol. 18, no. 5, pp. 810–820, 2008. [Online]. Available:
http://www.ncbi.nlm.nih.gov/pubmed/18340039
[23] D. R. Zerbino and E. Birney, “Velvet: Algorithms for de novo
short read assembly using de bruijn graphs,” Genome Research,
vol. 18, no. 5, pp. 821–829, May 2008. [Online]. Available:
http://dx.doi.org/10.1101/gr.074492.107
[24] “Assembling millions of short DNA sequences using SSAKE,”
Bioinformatics, vol. 23, no. 4, pp. 500–501, Feb. 2007. [Online].
Available: http://dx.doi.org/10.1093/bioinformatics/btl629
ACKNOWLEDGEMENTS
The author would like to acknowledge the support of the
scientific research project "Methods of scientific visualization"
(098 − 098 2562 − 2567), funded by the Ministry of Science,
Education and Sports of the Republic of Croatia. Additionally, author would like to thank prof.dr.sc. Karolj Skala and
doc.dr.sc. Mile Šikić for their support and guidance during the
work on this project.
R EFERENCES
[1] E. Pettersson, J. Lundeberg, and A. Ahmadian, “Generations of
sequencing technologies,” Genomics, vol. 93, no. 2, pp. 105–111, Feb.
2009. [Online]. Available: http://dx.doi.org/10.1016/j.ygeno.2008.10.003
[2] J. R. Miller, S. Koren, and G. Sutton, “Assembly
algorithms for next-generation sequencing data,” Genomics,
vol. 11, no. 6, pp. 315–327, 2010. [Online]. Available:
http://www.ncbi.nlm.nih.gov/pubmed/20211242#
[3] J. T. Simpson, K. Wong, S. D. Jackman, and et al., “Abyss: A
parallel assembler for short read sequence data,” Genome research,
vol. 19, no. 6, pp. 1117–1123, Jun. 2009. [Online]. Available:
http://dx.doi.org/10.1101/gr.089532.108
[4] B. Alberts, A. Johnson, J. Lewis, M. Raff, and K. Roberts, Molecular
biology of the cell. 978-0-8153-4r05, 2008.
[5] S. Taudien, I. Ebersberger, G. Glöckner, and M. Platzer,
“Should the draft chimpanzee sequence be finished?” Trends in
Genetics, vol. 22, no. 3, pp. 122–125, 2006. [Online]. Available:
http://www.ncbi.nlm.nih.gov/pubmed/16430990
[6] M. Pop, “Genome assembly reborn: recent computational challenges,”
Briefings in Bioinformatics, vol. 10, no. 4, pp. 354–366, 2009. [Online].
Available: http://www.ncbi.nlm.nih.gov/pubmed/19482960
[7] J. R. Miller, S. Koren, and G. Sutton, “Assembly algorithms for nextgeneration sequencing data,” Genomics, vol. 95, no. 6, pp. 315–327, Jun.
2010. [Online]. Available: http://dx.doi.org/10.1016/j.ygeno.2010.03.001
[8] S. Bao, R. Jiang, W. Kwan, X. Ma, and Y.-Q. Song, “Evaluation of
next-generation sequencing software in mapping and assembly,” J Hum
Genet, vol. 56, no. 6, pp. 406–414, Jun. 2011. [Online]. Available:
http://dx.doi.org/10.1038/jhg.2011.43
[9] B. Ewing, L. Hillier, M. C. Wendl, and P. Green, “Base-Calling
of automated sequencer traces using phred. i. accuracy assessment,”
Genome Research, vol. 8: 175-185, 1998.
[10] Applied Biosystems, “Applied Biosystems SOLiD 3 Plus System: De
Novo Assembly Protocol,” Life Technologies Corporation, Tech. Rep.,
2010.
8
[40] X. Liu, P. R. Pande, H. Meyerhenke, and D. A. Bader, “Pasqual: A
parallel de novo assembler for next generation genome sequencing.”
[Online]. Available: http://www.cc.gatech.edu/pasqual/index.html
[41] M. Ahmed, I. Ahmad, and S. U. Khan, “A comparative analysis of
parallel computing approaches for genome assembly,” Interdisciplinary
Sciences: Computational Life Sciences, vol. 3, no. 1, pp. 57–63, Mar.
2011. [Online]. Available: http://dx.doi.org/10.1007/s12539-011-0062-0
[42] K. Löwe, “Mapping of wdlps neochromosomes,” Otto-von-GuerickeUniversität Magdeburg, Tech. Rep., Nov. 2010.
[43] S. Boisvert, F. Laviolette, and J. Corbeil, “Ray: Simultaneous assembly
of reads from a mix of high-throughput sequencing technologies,”
Journal of computational biology a journal of computational molecular
cell biology, vol. 17, no. 11, pp. 1519–1533, 2010. [Online]. Available:
http://www.liebertonline.com/doi/abs/10.1089/cmb.2009.0238
[44] [Online]. Available: http://sourceforge.net/apps/mediawiki/contrailbio/index.php?title=Contrail
[45] “Extending assembly of short DNA sequences to handle error.” Bioinformatics (Oxford, England), vol. 23, no. 21, pp. 2942–2944, Nov. 2007.
[Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btm451
[46] “SHARCGS, a fast and highly accurate short-read assembly
algorithm for de novo genomic sequencing.” Genome research,
[25] B. Schmidt, R. Sinha, B. Beresford-Smith, and S. J. Puglisi, “A
fast hybrid short read fragment assembly algorithm,” Bioinformatics,
vol. 25, pp. 2279–2280, September 2009. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1596446.1596453
[26] V. K. Kundeti, S. Rajasekaran, H. Dinh, M. Vaughn, and V. Thapar,
“Efficient parallel and out of core algorithms for constructing large bidirected de bruijn graphs,” BMC Bioinformatics, vol. 11, no. 1, pp. 560+,
2010. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-11-560
[27] W. Shi and W. Zhou, “A parallel euler approach for large-scale biological sequence assembly,” Information Technology and Applications,
International Conference on, vol. 1, pp. 437–441, 2005.
[28] X. Yang, D. Medvin, G. Narasimhan, D. Yoder-Himes, and S. Lory,
“Clog: A pipeline for closing gaps in a draft assembly using short
reads,” in Proceedings of the 2011 IEEE 1st International Conference on
Computational Advances in Bio and Medical Sciences, ser. ICCABS ’11.
Washington, DC, USA: IEEE Computer Society, 2011, pp. 202–207.
[Online]. Available: http://dx.doi.org/10.1109/ICCABS.2011.5729881
[29] E. Arner, Solving repeat problems in shotgun sequencing, 2006.
[Online]. Available: http://books.google.hr/books?id=eBMINAAACAAJ
[30] I. Tsai, T. Otto, and M. Berriman, “Improving draft assemblies by
iterative mapping and assembly of short reads to eliminate gaps,”
Genome Biology, vol. 11, no. 4, p. R41, 2010. [Online]. Available:
http://genomebiology.com/2010/11/4/R41
[31] D. G. C. Gordon and P. Green, “Consed: A graphical tool for sequence
finishing,” Genome Research, vol. 8:195-202, 1998.
[32] J. R. Miller, A. L. Delcher, S. Koren, E. Venter, B. P. Walenz,
A. Brownley, J. Johnson, K. Li, C. Mobarry, and G. Sutton, “Aggressive
assembly of pyrosequencing reads with mates,” Bioinformatics,
vol. 24, pp. 2818–2824, December 2008. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1520805.1520818
[33] J. Wetzel, C. Kingsford, and M. Pop, “Assessing the benefits of
using mate-pairs to resolve repeats in de novo short-read prokaryotic
assemblies,” BMC Bioinformatics, vol. 12, no. 1, p. 95, 2011. [Online].
Available: http://www.biomedcentral.com/1471-2105/12/95
[34] P. Medvedev, K. Georgiou, G. Myers, and M. Brudno, “Computability
of models for sequence assembly,” in In WABI, 2007, pp. 289–301.
[35] Y. Lin, J. Li, H. Shen, L. Zhang, C. J. Papasian, and
H.-W. Deng, “Comparative studies of de novo assembly
tools for next-generation sequencing technologies,” Bioinformatics,
vol. 27, no. 15, pp. 2031–2037, Jun. 2011. [Online]. Available:
http://dx.doi.org/10.1093/bioinformatics/btr319
[36] D. e. Davendra, Traveling salesman problem, theory and applications.
Rijeka: InTech, 2010.
[37] D. R. Zerbino, “Genome assembly and comparison using de bruijn
graphs,” 2009.
[38] D. Hernandez, P. François, L. Farinelli, M. Østerås, and
J. Schrenzel, “De novo bacterial genome sequencing: Millions
of very short reads assembled on a desktop computer,” Genome
Research, vol. 18, no. 5, pp. 802–809, 2008. [Online]. Available:
http://www.ncbi.nlm.nih.gov/pubmed/18332092
[39] M. S. Hossain, N. Azimi, and S. Skiena, “Crystallizing short-read
assemblies around seeds.” BMC Bioinformatics, no. S-1.
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
9
vol. 17, no. 11, pp. 1697–1706, Nov. 2007. [Online]. Available:
http://dx.doi.org/10.1101/gr.6435207
F. M. You, Y. Wang, Y. Q. Gu, M.-C. Luo, N. Huo, K. R. Deal, J. Lee,
G. R. Lazo, and J. D. nad Olin D. Anderson, “Hybridvelvet: de novo
hybrid assembly of next generation sequencing data with velvet,” Plant
& Animal Genomes XIX Conference, Jan. 2011.
C.
Barnes,
G.
Dalgleish,
G.
Jared,
K.
Swift,
and
S. Tate, Trie-based data structures for sequence assembly.
Springer, 1997, no. August, p. 206–223. [Online]. Available:
http://www.springerlink.com/index/31617V66L054T702.pdf
J. Kececioglu and E. Myers, “Combinatorial algorithms for DNA
sequence assembly,” Algorithmica, vol. 13, no. 1, pp. 7–51, Feb. 1995.
[Online]. Available: http://dx.doi.org/10.1007/BF01188580
M. Ruffalo, T. LaFramboise, and M. Koyutürk, “Comparative
analysis of algorithms for next-generation sequencing read alignment,”
Bioinformatics, vol. 27, no. 20, pp. 2790–2796, Oct. 2011. [Online].
Available: http://dx.doi.org/10.1093/bioinformatics/btr477
W. Zhang, J. Chen, Y. Yang, Y. Tang, J. Shang, and B. Shen, “A
practical comparison of de novo genome assembly software tools for
next-generation sequencing technologies,” vol. 6, no. 3, 2011.
J. T. Simpson and R. Durbin, “Efficient de novo assembly of large
genomes using compressed data structures,” Genome Research, Dec.
2011. [Online]. Available: http://dx.doi.org/10.1101/gr.126953.111
E. Afgan, D. Baker, N. Coraor, H. Goto, I. M. Paul, K. D. Makova,
A. Nekrutenko, and J. Taylor, “Harnessing cloud computing with
galaxy cloud,” Nat Biotech, vol. 29, no. 11, pp. 972–974, Nov. 2011.
[Online]. Available: http://dx.doi.org/10.1038/nbt.2028
[Online]. Available: http://www.lifetechnologies.com/global/en/home/aboutus/news-gallery/press-releases/2012/life-techologies-itroduces-thebechtop-io-proto.html