de-novo sequencing of genomes using up to five different types of

DE-NOVO SEQUENCING OF GENOMES USING UP TO FIVE
DIFFERENT TYPES OF GENOMIC DNA LIBRARIES
Dr. Georg Gradl, Dr. Axel Strittmatter, Dr. Sascha Glinka,
Eurofins Genomics, Anzinger Str. 7a, 85560 Ebersberg, Germany
ABSTRACT
This application note describes the strategy for de-novo sequencing of prokaryotic, fungal and diploid higher eukaryotic genomes
of any size. Routinely, up to 5 different non-cloned genomic DNA libraries are prepared for sequencing with Roche GS FLX Titanium
series chemistry using massive parallel sequencing technology. The libraries are a combination of one Shotgun library (SG) and
various Long Paired End libraries (LPE) with different jumping distances (3 kb, 8 kb, 20 kb and up to 40 kb). After sequencing, the
resulting reads are assembled into large contigs (> 1kb) and provide not only the genetic sequence but also necessary scaffolding information. In a final step, remaining gaps are closed by preparation of PCR fragments, subsequent double-stranded Sanger
sequencing and assembly.
INTRODUCTION
Next Generation Sequencing technology enables the de-novo sequencing of any kind of genome in a fast and efficient way. In
order to achieve the optimum result in a reasonable time and at affordable cost, we have developed a strategy which combines
sequencing and scaffolding of up to 5 different types of libraries. This setup allows us to directly span repeated genetic regions of
up to 40 kb. Therefore the assembly of data not only orders the hundreds or thousands of contigs, but also drastically cuts down
gap closing and manual editing times. After assembly, and with the scaffolding information, it is possible to automatically design
PCR primer pairs for adjacent contig ends, perform PCR on the genomic DNA and sequence PCR fragments by Sanger technology
in 96-well format. The resulting sequence reads are incorporated into the original assembly and thereby gaps are closed. With this
strategy it is also possible to verify sequences of doubt, mainly in areas of low coverage, or uncertain areas of presumed or possible
mis-assembly.
MATERIAL AND METHODS
DNA Preparation
High molecular weight DNA should be prepared with the various commercially available kits or customer’s internal lab protocols. In
cases where the DNA was prepared with phenol and/or chloroform, we include an additional purification step to avoid loss of enzymatic
activity during the consecutive library preparation steps.
Library Preparation
All libraries for this sequencing strategy are prepared according to
the recommendations and protocols of Roche/454.
Shotgun libraries (SG)
High molecular weight genomic DNA is shotgun fragmented using
the Roche/454 GS Rapid Library Prep Kit and nebulizers provided
with the kit. Further library preparation is performed according to
“GS FLX Titanium Rapid Library Preparation Method Manual”.
Long Paired End libraries (LPE)
Genomic DNA is fragmented into the appropriate fragment sizes (3kb, 8kb, 20kb and up to 40kb) using the
HydroShearTM DNA Shearing Device (GeneMachine). Further library
preparation is performed according to “GS FLX Titanium Paired
End Library Prep 3kb Span Method Manual” and “GS FLX Titanium
Paired End Library Prep 20+8kb Span Method Manual”. See Fig.1
eurofinsgenomics.com
Fig.1 Preparation of a Long Paired End Library (LPE) with 3, 8,
20 or 40 kb spanning distance
- Page 1 -
DNA Sequencing
Sequencing is performed on Roche/454 GS FLX systems equipped with on-instrument Data Analysis Software Modules v.2.3.
Assembly and Scaffolding
After sequencing, data assembly and scaffolding is performed
with appropriate hardware and adopted software solution. In consequence with the amount of data produced, we either make use
of the Roche Genome Assembler in the current version (formerly
known as “Newbler”) or the Celera Assembler Version 6.1 or MIRA
Version 3.2. All sequencing data are processed on one of different
multiprocessor computer systems ranging from 8 Cores with 32 GB
to 32 Core with 1024 GB of RAM. See Fig.2 and Fig.3
Strategy is depending on genome size
With the above described library portfolio, Eurofins Genomics can
easily address any larger diploid eukaryotic genome e.g. complex
and higher plants, but also fungi, algae, or insect sample. For any
of these types of complex and large genomes, we routinely make
use of our 5 library strategy which also includes a preparation of
several copies of each type of the described libraries. This is done
to avoid bias in the chromosomal fragments used for each library,
thereby covering the entire genetic information.
Fig.2 Assembly and scaffolding of a combination of shotgun
(SG) and long paired end (LPE) libraries. Gap closing made easy!
For medium size genomes like fungi of 40 Mb we recommend a
2 – 3 libraries setup (shotgun library plus 8kb library optional with
3 kb or 20 kb). For smaller fungal genomes or large bacteria samples a 2 libraries setup (shotgun and 8 kb) is likely to be sufficient.
Small or non-complex bacterial genomes (not including some extremophiles) are sequenced with a 1 library strategy that is a long
paired end library with 8 kb jumping distance.
RESULTS AND DISCUSSION
Depending on the complexity and size of the genome, we select the
appropriate strategy for our sequencing approach.
Example 1: Sequencing of unknown 11.3 Mb fungal species with
4 libraries protocol (shotgun, LPE3kb, LPE8kb, LPE20kb). Each library
was sequenced on a quarter picotiter plate of the Roche GS FLX sequencer, resulting in 992,611 reads with a very even distribution of
reads per library (SG 25.6%, LPE3kb 24.3%, LPE8kb 26.8% LPE20kb
23.3%). The draft assembly and scaffolding resulted in 11 large
contigs (see Fig.4). The largest scaffold was 3,020,278 base pairs
and after manual inspection it was noted that two chromosomes
were linked. After resolving, we had 12 large scaffolds representing the 11 chromosomes and a plasmid. The entire mitochondrial
sequence was not correctly assembled under the selected conditions; it was, however, possible to resolve the correct structure
during the manual editing phase.
Fig.3 Screenshot of a contig, coassembled out of SG, LPE3kb,
LPE8kb and LPE20kb reads. Contig visualised with GAP5 (Staden
Package version 2.0)
Example 2: Sequencing of a highly GC rich and repetitive Streptomyces spec. with 4 libraries (shotgun, LPE3kb, LPE8kb, LPE20kb).
Each library was sequenced on a quarter picotiter plate of the
Roche GS FLX sequencer, resulting in approx. 250.000 reads per
library with a very even distribution of reads (SG 28.5%, LPE3kb
23.7%, LPE8kb 23.8% LPE20kb 24.0%). The sequencing revealed a
eurofinsgenomics.com
Fig.4 First draft assembly and scaffold of the yeast genome.
Scaffold 1 could be resolved to 2 chromosomes, occasionally
linked together.
- Page 2 -
genome size of 8.8 Mb. Assembly and scaffolding resulted in 118 large contigs in 1 large single scaffold of 8.760.064 bases with a total
gap length of 31 kb.
Example 3: Sequencing of a 3.1 Mb beta-Proteobacterium. Cost efficient sequencing was carried out with 1x LPE8kb library only.
Based on the nature of the library production and sequencing, a LPE library delivers 3 types of sequences: True paired ends, pairs without
partner and shotgun like reads (see Fig.5). These reads can be assembled to contigs and clustered at the same time (see Fig.6 and Fig.7).
The sequencing resulted in 21 large contigs. 18 out of these large contigs were present in 5 scaffolds. The largest contig was 650 kb long.
This allowed straightforward gap closing.
Example 4: Sequencing of numerous other species with the LPE8kb strategy, in most cases, results in a manageable number of
contigs and scaffolds. The outcome is very much dependant on the complexity and G/C content of the genomes. If the results are not
sufficient, the addition of further shotgun sequencing improves the quality immediately (see Table 1).
Fig.5 Because the adapter is not always in the middle of the
fragment, a LPE library delivers 3 types of sequences.
Fig.7 A typical result for a LPE only approach
eurofinsgenomics.com
Fig.6 Sequencing and scaffolding with one LPE8kb library
Organism
Genome
Size
Coverage
Large
Contigs
Scaffolds
Cyanobacterium spec.
8.6 Mb
28
198
24
Helicobacter spec.
2.1 Mb
27
11
9
Clostridium spec.
3.9 Mb
35
98
23
Clostridium spec.
3.9 Mb
44
91
27
E. coli spec.
4.5 Mb
28
18
11
Lactobacillus spec.
3.1 Mb
45
14
7
Lactobacillus spec.
2.4 Mb
37
24
22
Fungal Genome
42.9 Mb
20
231
32
Table 1 Examples of genomes that have been sequenced with one
LPE8kb only
- Page 3 -
OUTLOOK
We are currently converting all of the protocols described above for use with the Illumina sequencing technology. After the final evaluation steps, the described sequencing strategy will give way to de novo sequencing of genomes on the Illumina HiSeq 2000 sequencer,
that will also deliver highly accurate and correctly assembled data sets. This will in turn further decrease the price per sample, as data
output on the Illumina HiSeq 2000 offers the ability to sequence several samples in parallel. Also new sequencing targets e.g. complex
communities might be addressed in the near future.
CONCLUSION
It is possible and affordable to routinely sequence and assemble haploid or diploid genomes of any size with the above described
technology. Further improvements will open windows to new applications and also a new sequencing platform.
REFERENCES
De Schutter et al. Genome sequence of the recombinant protein production host Pichia pastoris. (2009)
Nature Biotechnology 27:561-6.
Rounsley et al. De novo next generation sequencing of plant genomes. (2009)
Rice 2:35-43.
CONTACT
Global Sales Manager Next Generation Sequencing
Dr. Georg Gradl
Tel. +49 8092 8289-945
Email: [email protected]
European Sales Manager Next Generation Sequencing
Dr. Axel Strittmatter
Tel. +49 8092 8289-972
Email: [email protected]
GS FLX is a trademark of Roche.
Illumina and HiSeq are trademarks of Illumina, Inc.
eurofinsgenomics.com
- Page 4 -