drVM: Detection and Reconstruction of Viral Genomes from

1
drVM: Detection and Reconstruction of Viral Genomes from Metagenomes
2
3
Hsin-Hung Lin and Yu-Chieh Liao*
4
5
Institute of Population Health Sciences, National Health Research Institutes, Miaoli,
6
Taiwan
7
8
Running title: Viral metagenome analysis
9
10
*Correspondence to Yu-Chieh Liao: [email protected]
11
E-mail address for Hsin-Hung Lin: [email protected]
12
13
14
15
1
16
ABSTRACT
17
Background
18
The discovery of new or divergent viruses using high-throughput next-generation
19
sequencing (NGS) has become more commonplace. However, although analysis of deep
20
NGS data allows us to identity potential pathogens, the entire analytical procedure
21
requires competency in the bioinformatics domain, which include implementing proper
22
software packages and preparing prerequisite databases. Simple and user-friendly
23
bioinformatics pipelines are urgently required to obtain complete viral genome sequences
24
from metagenomic data.
25
Results
26
This paper presents a pipeline, drVM (detection and reconstruction of viral genomes from
27
metagenomes), for rapid viral read identification, genus-level read partition, read
28
normalization, de novo assembly, sequence annotation and coverage profiling. The
29
feasibility of drVM was validated via the analysis of over 300 sequencing runs generated
30
by Illumina and Ion Torrent technologies to detect and reconstruct a variety of virus types
31
including DNA viruses, RNA viruses and retroviruses. drVM is available for free
32
download at: https://sourceforge.net/projects/sb2nhri/files/drVM/ and it is also assembled
33
as a Docker container and a virtual machine to facilitate seamless deployment.
2
34
Conclusions
35
drVM was compared with other viral detection tools to demonstrate its merits in terms of
36
viral genome completeness and reduced computation time. This substantiates the
37
platform’s potential to produce prompt and accurate viral genome sequences from clinical
38
samples.
39
40
Keywords
41
Next-generation
42
reconstruction
sequencing
(NGS),
bioinformatics,
3
metagenomics,
detection,
43
BACKGROUND
44
Viruses are the most abundant biological entities on Earth and are found among all
45
cellular forms of life including animals, plants, bacteria, and fungi. More than 4,500 viral
46
species have been discovered; their sequence information has been collected by
47
researchers [1-3]. Viruses have caused some of the most dramatic and deadly disease
48
pandemics in human history and viral disease outbreaks tend to occur every several years.
49
Over the past two decades, avian influenza H5N1 virus [4], SARS coronavirus, H1N1
50
pandemic, MERS coronavirus [5], Ebola virus [6] and Zika virus [7] have emerged in the
51
human population. During such outbreaks, identification of the causative agent and
52
comparative genome analysis is of cornerstone importance for disease surveillance and
53
epidemiology. Unbiased next-generation sequencing (NGS) is emerging as an attractive
54
approach for viral identification from varied samples, including blood, feces, sputum and
55
other swab samples [8, 9]. The technique holds the promise to aid in the identification of
56
potential pathogens in a single assay without a prior knowledge of the target [10].
57
58
SURPI [10] and Taxonomer [11] are pathogen detection tools that have been proposed to
59
analyze metagenomic NGS data for comprehensive diagnostic applications. However,
60
both tools are incapable of complete viral genome assembly. VIP has pursues the same
4
61
strategy as SURPI-subtraction to identification-to identify viral reads but provides an
62
alternative approach to assembly - classification to assembly - for improved viral
63
genome assembly [12]. Although VIP generates phylogenetic trees, thereby facilitating
64
the visualization of genealogy between a candidate virus and existing reference
65
sequences, it does not produce assembled viral sequences. Moreover, ease-of-use
66
challenges associated with operating SURPI and VIP impede their applications in most
67
laboratories and are often accessible only by trained personnel. VirusTAP is a web-based
68
integrated NGS analysis tool for viral genome assembly from metagenomic reads [13].
69
This user-friendly tool enables users to obtain viral genomes more easily by just
70
uploading raw NGS reads and clicking several selections. However, VirusTAP only
71
accepts Illumina data and does not appear to be fully functional, as evaluated via various
72
viral metagenomic reads. Therefore, tools for metagenomics analyses are urgently
73
required that are substantially more computationally efficient, accurate (for compete viral
74
genome assembly), and easy-to-use.
75
76
Here we present drVM (“detection and reconstruction of viral genomes from
77
metagenomes”), a bioinformatics pipeline that firstly provides rapid classification of
78
NGS reads generated by Illumina or Ion Torrent sequencing technology against a viral
5
79
database, then partitions the viral reads into genus groups, and finally assembles viral
80
genomes from the corresponding genus-level reads. For ease of deployment, a Docker
81
container [14] and virtual machine [15] image were created for drVM. The performance
82
of the platform was evaluated in the analysis of 349 sequence data in the Sequence Read
83
Archive (SRA) from 18 independent studies [8-10, 13, 16-29]. These data sets encompass
84
a variety of sample types, viruses, and sequencing depths. drVM was demonstrated to be
85
highly adept at the detection and reconstruction of various viral genomes and
86
concomitantly outperforms other analytical pipelines including SURPI, VIP and
87
VirusTAP.
88
6
89
MATERIALS AND METHODS
90
Reference databases
91
Viral nucleotide sequences were obtained from the Nucleotide database of the National
92
Center for Biotechnology Information (NCBI) using the query term “(complete[title])
93
AND (viridae[organism])”; this query resulted in 642,079 hits (as of 16 March 2016).
94
Based on each sequence identifier (accession number) of the viral sequence, its
95
taxonomic ID, scientific name and genus-level annotation were separately obtained from
96
the NCBI Taxonomy database. Viral sequences with taxonomy ID not included in the
97
virus division (division id = 9), those with taxonomic information absent, and those
98
lacking genus-level annotation, were labeled “nonVira”, “noTax”, and “noGenus”,
99
respectively. Those sequences were excluded from database construction. The remaining
100
viral sequences were utilized as rawDB. Three hundred and seventy-seven viral genomes
101
were obtained by filtering with host “human” from the NCBI viral genome resource to
102
retrieve 684 sequences (1 March 2016) [1]. The viral sequences were segmented into
103
their corresponding genus level for refDB; the corresponding refDB was uploaded to
104
SourceForge
105
(https://sourceforge.net/projects/sb2nhri/files/drVM/refDB.tar.gz)
106
alignments and reduce memory requirements, rawDB was split into 8 sub-databases, each
for
later
7
use
To
increase
read
107
contained sequences with total length no greater than 200 Mbp. SNAP (version 0.15.4)
108
[30], possessing a default seed size of 20, was used to generate index tables
109
(snap_index_virus) for the viral sub-databases. In addition, a BLAST database
110
(blast_index_virus) was created by makeblastdb (BLAST 2.2.28+) [31] from rawDB to
111
enable contig annotation.
112
113
Implementation of drVM
114
The drVM pipeline is implemented in Python and incorporates several open-source tools,
115
including BLAST, SNAP and SPAdes [32]. The package comprises two modules:
116
CreateDB.py and drVM.py. A schematic flowchart of drVM is shown in Fig. 1. In
117
CreateDB.py, viral sequences (provided by the user) are processed to produce rawDB for
118
SNAP and BLAST database construction, and refDB is downloaded from the
119
SourceForge link. drVM.py takes single- or paired-end reads as input. It aligns reads to
120
SNAP DB and, accordingly, identifies viral reads using the following parameters: snap
121
single -x -h 250 -d 24 -n 25 -F a. Based on taxonomic information (genus-level
122
annotation) of the aligned sequence, the viral reads are partitioned into genus-level
123
groups. Each group is labeled according to its corresponding genus. By examining the
124
aligned sequences with reads, sequence coverage and read depth are calculated.
8
125
Maximum coverage and average depth for a genus are estimated by taking all aligned
126
sequences corresponding to that genus into consideration. Viral reads in a genus with
127
average depth ≧ min depth (default =1) undergo de novo assembly via SPAdes (v.3.6.1).
128
Prior to assembly, digital normalization [33] in khmer [34] is employed to remove reads
129
with high-abundance (default = 100) k-mers. Using BLAST, each genus-level assembly
130
is annotated to its closest reference against refDB using blastn (identity ≧ 80%). Each
131
reference is examined to confirm that it contains an at least 50% alignment rate. If this
132
procedure returns no hits, the assembly is annotated by BLAST against rawDB. Reads
133
are aligned to contigs by means of SOAP2 [35] and the read-covered contigs are placed
134
into a coverage plot with reference-guide coordinates. To increase contig continuity, reads
135
aligned to contigs corresponding to a reference are extracted for pairs and those paired
136
reads are re-assembled. Please note this re-assembly process is only performed for
137
paired-end reads.
138
139
Metagenomic datasets
140
A total of 349 runs in the Sequence Read Archive (SRA) were downloaded (see Table S1).
141
The metagenome datasets based on prior studies [8-10, 13, 16-29] are summarized in
142
Table 1. Each sequence run was analyzed by drVM: drVM.py -1 read1.fastq -2
9
143
read2.fastq -t 16 (-type iontorrent for Ion Torrent datasets) on a server (Intel Xeon
144
E7-4820, 2.00GHz with 256 GB of RAM).
10
145
RESULTS
146
drVM pipeline
147
drVM was designed to detect and reconstruct viral genomes from metagenomes. As
148
described in the Materials and Methods, it contains two modules: CreateDB.py and
149
drVM.py (Fig. 1). The viral sequences downloaded from NCBI were processed by
150
CreateDB.py to produce SNAP and BLAST databases. Taxonomy information was also
151
automatically extracted from NCBI. In drVM.py, metagenomic reads that were aligned to
152
complete viral sequences were firstly partitioned into genus-level groups to facilitate de
153
novo assembly. Each genus-level assembly was then annotated with a close reference in
154
refDB or rawDB to produce corresponding coverage plots. The drVM pipeline was
155
implemented, in Python, to incorporate open-source tools including BLAST, khmer,
156
SNAP, SOAP2 and SPAdes. It is open-source and is available for download:
157
https://sourceforge.net/projects/sb2nhri/files/drVM/. In addition to the drVM script, the
158
module is distributed as a Docker image in the Docker Hub (https://hub.docker.com/)
159
repository and as a virtual machine image (drVM.ova) on the sourceforge website. With
160
the Docker image and the drVM virtual machine, users are able to interface with drVM
161
with ease. In addition to executing commands in a terminal window (Fig. 2A), a
162
graphical user interface (GUI) (Fig. 2B) was created in the drVM virtual machine. Users
11
163
are thus able to construct the required databases for drVM by simply clicking on the
164
‘Create’ button featured in CreateDB.py. Following a 30-minute period while DB is
165
created, users are able to analyze metagenomic data with drVM.py to produce a folder
166
(Fig. 2C) containing assembled viral genomes (*.ctg.fa) along with annotated coverage
167
plots (Fig. 2D).
168
169
Detection and reconstruction of viral genomes
170
Three hundred and forty-nine sequencing datasets from 18 research studies (Table 1 and
171
Table S1) were downloaded and analyzed by means of drVM. As shown in Table 1,
172
drVM detected 260 runs and produced coverage plots corresponding to annotated viruses;
173
viral genomes were reconstructed from 147 runs by presenting a single contig mapped on
174
a reference. The results generated by the drVM package (except filtered reads) for the
175
349 runs are provided at http://sb.nhri.org.tw/drVM/, and the assembled viral sequences
176
and their corresponding coverage plots are available for download. Here, we take the
177
second study listed in Table 1 as an example – the authors described
178
sequence-independent amplification for samples containing ultra-low amounts of viral
179
RNA coupled with 34-run Illumina sequencing (SRR513075 to SRR527726 and
180
SRR629705-9 listed in Table S1). De novo assembly, optimized for viral genomes, was
12
181
performed, so as to capture 96 to 100% of the viral protein coding region of the human
182
immunodeficiency virus (HIV), respiratory syncytial virus (RSV) and West Nile virus
183
(WNV). With drVM, the three viruses were successfully detected and some complete
184
viral genomes were obtained, such as HIV in SRR513075, WNV in SRR527701 and
185
RSV in SRR527708. In addition to these three viruses, drVM detected and reconstructed
186
the GB virus C in SRR513075, SRR513080, SRR513087, SRR629705 and SRR629708
187
and detected the Hepatitis C virus in SRR513080, SRR629705 and SRR629708, and
188
torque teno virus in SRR527718, SRR527725, SRR629705 and SRR629707, which has,
189
to date, not been reported in the literature. Furthermore, 12 complete or near-complete
190
viral genomes of adenovirus, norovirus and hepatitis B virus have been obtained in the
191
fourth study [20]; drVM was able to automatically produce 8 complete genomes among
192
this dataset (e.g., ERR233414) and additional 7 plant virus genomes (e.g. ERR233413).
193
To illustrate its wide-ranging versatility, drVM was also leveraged to reconstruct
194
influenza viruses from influenza virus-positive respiratory samples [9]. drVM produced
195
complete influenza A H1N1 and H3N2 virus genomes in some influenza-positive samples
196
(e.g. ERR690494 and ERR690510), detected human herpesvirus 1 (HSV-1) in one
197
sample, and also recovered a viral genome associated with the respiratory syncytical
198
virus (RSV) from an influenza-negative sample. In analyzing the metagenomic datasets,
13
199
drVM reconstructed various viral genomes including DNA, RNA and retro-transcribing
200
viruses (Table 2). For example, drVM produced a 34,901-bp sequence of human
201
adenovirus D, two segments of human picobirnavirus (2,485 and 1,717 bp), a 12,216-bp
202
sequence of bovine viral diarrhea virus and a 3,166-bp sequence of hepatitis B virus in
203
ERR233414, ERR227885, SRR1170806, and ERR233420, respectively. In summary,
204
drVM successfully reconstructed various viral genomes (around 30) among 349
205
metagenomic sequencing runs.
206
207
Performance comparison
208
Unlike SURPI [10], VIP [12] and VirusTAP [13], drVM directly identifies reads by
209
aligning them to viral genomes and therefore is expected to save a substantial amount of
210
computational time from subtracting reads of the host and co-located bacteria. Three
211
datasets – SRR1170797, SRR1106548 and DRR049387 – were separately input into VIP,
212
SURPI and VirusTAP for pipeline evaluation [10, 12, 13]. Two sequencing runs
213
comprising co-existing multiple human papillomavirus types [21] and segmented-genome
214
influenza virus [9] were used as additional datasets to facilitate performance comparison.
215
As demonstrated in Table 3, drVM requires the shortest execution time with regards to
216
SRR1106548 and DRR049387, and the second-shortest time for the other three datasets.
14
217
Note that the analyses of drVM, SURPI and VIP were performed on a quad-core CPU
218
with 128 GB RAM while VirusTAP was executed on a 120-core CPU web server with 1
219
TB RAM. Therefore, drVM was validated to be the most rapid tool for virus
220
identification when computational hardware is considered. Moreover, the complete
221
genomes of bovine viral diarrhea virus, human papillomavirus and influenza A virus were
222
generated by drVM during the SRR1170797, SRR062073 and ERR690519 runs,
223
respectively, while the other three tools were not able to produce complete genomes for
224
these viruses. The outputs generated by each tool running on the datasets delineated in
225
Table 3 are available at: https://sourceforge.net/projects/sb2nhri/files/drVM/Comparison/.
226
To simulate the disturbance of host reads, homo sapiens raw reads from liver cancer cells
227
(SRR3031107) were concatenated to sequencing reads (SRR544883) from the hepatitis C
228
virus infection [18]. Before read concatenation, drVM reconstructed the GB virus C from
229
SRR544883, as shown in Table 2. Although the software still was able to reconstruct the
230
viral genome of GB virus C from the concatenated reads, execution time increased by a
231
factor of three (from 43 min to 2 hr 7 min while analyzing read bases of 711 Mb to 10
232
Gb). Taken together, drVM not only rapidly detects viral reads but also enables the de
233
novo assembly of the reads into viral genomes.
15
234
DISCUSSION
235
As described in SURPI, SNAP runs 10-100× faster than existing alignment tools
236
including Bowtie 2 [36] and BWA [37]. drVM and SURPI use SNAP while VIP and
237
VirusTAP utilize Bowtie 2 and BWA-SW [38], respectively. SNAP coupling with the
238
light reference database (viral sequences vs. human, bacterial and viral sequences used in
239
SURPI, VIP and VirusTAP) yields a noteworthy reduction in the processing time for viral
240
read identification in drVM. Furthermore, the reference database can easily be updated
241
whenever necessary. In contrast, SURPI classifies reads against viral and bacterial
242
databases to identify all potential pathogens, which invariably requires extended
243
execution time and requires more effort with regards to ensuring that the references are
244
updated and current. In drVM, viral reads with the loosest edit distance of 24 (in SNAP)
245
were partitioned into genus-level groups based on the virus taxonomy retrieved from the
246
NCBI Taxonomy databases, excluding phages. Digital normalization was then employed
247
to correct uneven read depth distribution, so as to eliminate redundant reads while
248
retaining sufficient information for subsequent assembly. For example, drVM produced a
249
7,455-bp contig of porcine kobuvirus for ERR1097471 (in Table 2) but resulted in a
250
fragmented assembly when digital normalization was not used (-dn off). Additionally,
251
normalization is able to reduce the processing time required for assembly. For example,
16
252
not activating the digital normalization routine increased the execution time of
253
ERR1097472 by drVM from 20 min to 3 hr. It is important to note that the digital
254
normalization routine comprises a sequence of random samples, insinuating that the
255
results produced by drVM can be different (e.g., SRR527703) from run-to-run. The
256
normalized reads were assembled, in a de novo fashion, by SPAdes with multiple k-mer
257
sizes (21, 33, 55… automatically selected based on read length). Since SPAdes operates
258
with Illumina and Ion Torrent reads, drVM is able to handle sequencing reads produced
259
by these two platforms (default for Illumina, -type iontorrent for Ion Torrent). Moreover,
260
drVM annotates each genus-level assembly with a close reference to produce the
261
corresponding coverage plots. If multiple contigs present in one coverage plot, reads
262
aligned to the contigs are extracted in pairs for subsequent re-assembly; such a process is
263
able to improve genome completeness. An example can be seen in the run corresponding
264
to SRR527705 – fragmented contigs were annotated with the West Nile virus in the
265
pre.run routine; drVM produced a 11,017-bp contig of the complete genome. Over
266
99.98% of the target sequence identity was obtained in the drVM-produced contig when
267
referenced against the submitted assembly (JX503096.1) [19].
268
17
269
In the comparative study of VIP and VirusTAP, the authors have compared their
270
assemblies with direct metagenomic assemblies via Ensemble Assembler, IDBA_UD,
271
A5-miseq or CLC workbench. Results have led the authors to conclude that
272
“classification to assembly” outperformed direct assembly in terms of assembly
273
continuity and execution time [12, 13]. drVM was not compared with SPAdes, but it was
274
compared against SURPI, VIP and VirusTAP. As evident from Table 3, drVM produced
275
complete genomes of bovine viral diarrhea virus and human papillomavirus in the
276
analysis of reads produced by Ion Torrent and Illumina platforms, respectively; the other
277
tools were unable to assemble the viral genomes. Unlike SURPI and VIP, the coverage
278
plots produced by drVM are created by mapping raw reads back to the assembled contigs,
279
not to closely related references. A coverage plot with a continuous profile reflects the
280
accuracy and continuity of the assembly paradigm.
281
282
Although drVM operated in such a manner that the subtraction of the host read is
283
neglected may produce valid host sequences and annotate with viral references, this
284
mistake can be easily identified via inspection of the assembled contigs. For example, in
285
running SRR3031107 concatenated with SRR544883, a 1,872-bp contig annotated with
286
Feline leukemia virus (92% identity) actually corresponded to homo sapiens notch 2
18
287
(100% identity). Therefore, conclusions drawn from drVM should be made with caution,
288
especially with regards to retrovirus detection.
289
290
Among SURPI, VIP and VirusTAP, VirusTAP is the only user-friendly tool for viral
291
genome assembly from metagenomic reads. The other two pipelines are not easily
292
operated by users lacking advanced bioinformatics skills. Although academic users can
293
access VirusTAP and obtain viral genomes with ease, the package is not suitable for
294
assembling Ion Torrent data (Table 3, SRR1170797) and low-copy viruses (Table 3,
295
SRR062073). Alternatively, drVM provides a user-friendly interface (as shown in Fig. 2B)
296
and generates self-explanatory results (Fig. 2C and 2D); it therefore is well-suited for
297
clinical applications. We have applied drVM in the analyses of over 300 sequencing runs
298
retrieved from NCBI’s SRA to demonstrate that drVM is indeed able to detect and
299
reconstruct various viral genomes from metagenomic data. We can therefore expect to
300
extend our knowledge of viral diversity by leveraging the unique genome assembly
301
capabilities of drVM in the metagenomics domain.
302
303
19
304
AVAILABILITY AND REQUIREMTNS
305
Project name: drVM
306
Project home page: https://sourceforge.net/projects/sb2nhri/files/drVM/
307
Operating system(s): OS X, Linux, Windows
308
Programming language: Python
309
Requirements: Docker or Virtual machine; 8 GB RAM
310
License: GNU General Public License, version 3.0 (GPL-3.0)
311
312
20
313
DECLARATIONS
314
List of abbreviations
315
NGS:
316
immunodeficiency virus, RSV: respiratory syncytial virus, WNV: West Nile virus
next-generation
sequencing,
SRA:
sequencing
read
archive,
HIV:
317
318
Competing interests
319
The authors declare that they have no competing interests.
320
321
Author’ contributions
322
YCL conceived the project. HHL implemented the pipeline. YCL and HHL prepared,
323
read and approved the final manuscript.
324
325
Acknowledges
326
This work was supported by grants (PH-105-PP-05) from the National Health Research
327
Institute, Taiwan, and a research grant (MOST-104-2320-B-400-021-MY2) from the
328
Ministry of Science and Technology, Taiwan.
329
330
Additional file
21
331
Supplementary Table S1. List of 349 sequencing datasets from 18 research studies.
332
333
22
334
REFERENCES
335
336
337
338
339
340
341
342
1.
343
344
345
346
347
348
349
350
351
4.
352
353
354
355
356
357
358
359
360
2.
3.
5.
6.
7.
8.
9.
Brister JR, Ako-Adjei D, Bao Y, Blinkova O: NCBI viral genomes resource. Nucleic
Acids Res 2015, 43:D571-577.
Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S,
Zaremba S, Gu Z, et al: ViPR: an open bioinformatics database and analysis
resource for virology research. Nucleic Acids Res 2012, 40:D593-598.
Sharma D, Priyadarshini P, Vrati S: Unraveling the web of viroinformatics:
computational tools and databases in virus research. J Virol 2015,
89:1489-1501.
Chan PK: Outbreak of avian influenza A(H5N1) virus infection in Hong Kong in
1997. Clin Infect Dis 2002, 34 Suppl 2:S58-64.
Bean AG, Baker ML, Stewart CR, Cowled C, Deffrasnes C, Wang LF, Lowenthal JW:
Studying immunity to zoonotic diseases in the natural host - keeping it real. Nat
Rev Immunol 2013, 13:851-861.
Feldmann H: Ebola--a growing threat? N Engl J Med 2014, 371:1375-1378.
Calvet G, Aguiar RS, Melo ASO, Sampaio SA, de Filippis I, Fabri A, Araujo ESM, de
Sequeira PC, de Mendonça MCL, de Oliveira L, et al: Detection and sequencing of
Zika virus from amniotic fluid of fetuses with microcephaly in Brazil: a case
study. The Lancet Infectious Diseases 2016.
Batty EM, Wong TH, Trebes A, Argoud K, Attar M, Buck D, Ip CL, Golubchik T, Cule
M, Bowden R, et al: A modified RNA-Seq approach for whole genome
sequencing of RNA viruses from faecal and blood samples. PLoS One 2013,
8:e66129.
Fischer N, Indenbirken D, Meyer T, Lutgehetmann M, Lellek H, Spohn M,
Aepfelbacher M, Alawi M, Grundhoff A: Evaluation of Unbiased Next-Generation
Sequencing of RNA (RNA-seq) as a Diagnostic Method in Influenza
Virus-Positive Respiratory Samples. J Clin Microbiol 2015, 53:2238-2250.
361
362
363
364
365
10.
Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D, Samayoa E,
Bouquet J, Greninger AL, Luk KC, Enge B, et al: A cloud-compatible
bioinformatics pipeline for ultrarapid pathogen identification from
next-generation sequencing of clinical samples. Genome Res 2014,
24:1180-1192.
366
367
11.
Flygare S, Simmon K, Miller C, Qiao Y, Kennedy B, Di Sera T, Graf EH, Tardif KD,
Kapusta A, Rynearson S, et al: Taxonomer: an interactive metagenomics analysis
23
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
portal for universal pathogen detection and host mRNA expression profiling.
Genome Biol 2016, 17:111.
Li Y, Wang H, Nie K, Zhang C, Zhang Y, Wang J, Niu P, Ma X: VIP: an integrated
pipeline for metagenomics of virus identification and discovery. Sci Rep 2016,
6:23774.
Yamashita A, Sekizuka T, Kuroda M: VirusTAP: Viral Genome-Targeted Assembly
Pipeline. Front Microbiol 2016, 7:32.
Merkel D: Docker: lightweight Linux containers for consistent development and
deployment. Linux J 2014, 2014:2.
Nocq J, Celton M, Gendron P, Lemieux S, Wilhelm BT: Harnessing virtual
machines to simplify next-generation DNA sequencing analysis. Bioinformatics
2013, 29:2075-2083.
Yozwiak NL, Skewes-Cox P, Stenglein MD, Balmaseda A, Harris E, DeRisi JL: Virus
identification in unknown tropical febrile illness cases using deep sequencing.
PLoS Negl Trop Dis 2012, 6:e1485.
Chiu CY, Yagi S, Lu X, Yu G, Chen EC, Liu M, Dick EJ, Jr., Carey KD, Erdman DD,
Leland MM, Patterson JL: A novel adenovirus species associated with an acute
respiratory outbreak in a baboon colony and evidence of coincident human
infection. MBio 2013, 4:e00084.
Law J, Jovel J, Patterson J, Ford G, O'Keefe S, Wang W, Meng B, Song D, Zhang Y,
Tian Z, et al: Identification of hepatotropic viruses from plasma using deep
sequencing: a next generation diagnostic tool. PLoS One 2013, 8:e60595.
Malboeuf CM, Yang X, Charlebois P, Qu J, Berlin AM, Casali M, Pesko KN, Boutwell
CL, DeVincenzo JP, Ebel GD, et al: Complete viral RNA genome sequencing of
ultra-low copy samples by sequence-independent amplification. Nucleic Acids
Res 2013, 41:e13.
Cotten M, Oude Munnink B, Canuti M, Deijs M, Watson SJ, Kellam P, van der
Hoek L: Full genome virus detection in fecal samples using sensitive nucleic acid
preparation, deep sequencing, and a novel iterative sequence classification
algorithm. PLoS One 2014, 9:e93269.
Ma Y, Madupu R, Karaoz U, Nossa CW, Yang L, Yooseph S, Yachimski PS, Brodie EL,
Nelson KE, Pei Z: Human papillomavirus community in healthy persons, defined
by metagenomics analysis of human microbiome project shotgun sequencing
data sets. J Virol 2014, 88:4786-4797.
Neill JD, Bayles DO, Ridpath JF: Simultaneous rapid sequencing of multiple RNA
24
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
virus genomes. J Virol Methods 2014, 201:68-72.
Berg MG, Lee D, Coller K, Frankel M, Aronsohn A, Cheng K, Forberg K, Marcinkus
M, Naccache SN, Dawson G, et al: Discovery of a Novel Human Pegivirus in
Blood Associated with Hepatitis C Virus Co-Infection. PLoS Pathog 2015,
11:e1005325.
Day JM, Oakley BB, Seal BS, Zsak L: Comparative analysis of the intestinal
bacterial and RNA viral communities from sentinel birds placed on selected
broiler chicken farms. PLoS One 2015, 10:e0117210.
Greninger AL, Naccache SN, Federman S, Yu G, Mbala P, Bres V, Stryke D, Bouquet
J, Somasekar S, Linnen JM, et al: Rapid metagenomic identification of viral
pathogens in clinical samples by real-time nanopore sequencing analysis.
Genome Medicine 2015, 7.
Nouri S, Salem N, Nigg JC, Falk BW: Diverse Array of New Viral Sequences
Identified in Worldwide Populations of the Asian Citrus Psyllid (Diaphorina citri)
Using Viral Metagenomics. J Virol 2015, 90:2434-2445.
Karlsson OE, Larsson J, Hayer J, Berg M, Jacobson M: The Intestinal Eukaryotic
Virome in Healthy and Diarrhoeic Neonatal Piglets. PLoS One 2016,
11:e0151481.
Lojkic I, Bidin M, Prpic J, Simic I, Kresic N, Bedekovic T: Faecal virome of red foxes
from peri-urban areas. Comp Immunol Microbiol Infect Dis 2016, 45:10-15.
Wang Y, Zhu N, Li Y, Lu R, Wang H, Liu G, Zou X, Xie Z, Tan W: Metagenomic
analysis of viral genetic diversity in respiratory samples from children with
severe acute respiratory infection in China. Clin Microbiol Infect 2016.
Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson D, Shenker S, Stoica I, Karp RM,
Sittler T: Faster and more accurate sequence alignment with SNAP. arXiv
preprint arXiv:11115572 2011.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL:
BLAST+: architecture and applications. BMC Bioinformatics 2009, 10:421.
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM,
Nikolenko SI, Pham S, Prjibelski AD, et al: SPAdes: a new genome assembly
algorithm and its applications to single-cell sequencing. J Comput Biol 2012,
19:455-477.
Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT: Tackling soil
diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci
U S A 2014, 111:4904-4909.
25
438
439
440
441
442
443
444
445
446
447
34.
448
449
38.
35.
36.
37.
Crusoe MR, Alameldin HF, Awad S, Boucher E, Caldwell A, Cartwright R,
Charbonneau A, Constantinides B, Edvenson G, Fay S, et al: The khmer software
package: enabling efficient nucleotide sequence analysis. F1000Res 2015,
4:900.
Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved
ultrafast tool for short read alignment. Bioinformatics 2009, 25:1966-1967.
Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat
Methods 2012, 9:357-359.
Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 2009, 25:1754-1760.
Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler
transform. Bioinformatics 2010, 26:589-595.
450
451
26
452
FIGURE LEGENDS
453
Fig. 1
454
produce SNAP and BLAST databases. drVM.py analyzes NGS reads to produce viral
455
genomes.
456
Fig. 2
457
command-line interface. (B) Input associated with the graphical user interface. (C)
458
Output file structure generated by drVM. (D) Coverage profiles produced by drVM.
A schematic flowchart of drVM. CreateDB.py processes viral sequences to
Input and output explanation of drVM. (A) Input associated with the
459
27
460
Table 1 The feasibility of drVM in various types of clinical samples and public datasets.
Sequencing
method
No. of
runs
Ref
Bioinformatics pipeline for ultrarapid pathogen identification
Complete viral RNA genome sequencing of ultra-low copy samples
Illumina
Illumina
13
34
Faecal virome of red foxes from peri-urban areas
Full genome virus detection in fecal samples
Human papillomavirus community in healthy persons
Identification of hepatotropic viruses from plasma
Intestinal bacterial and RNA viral communities from sentinel birds
Intestinal virome in healthy and diarrhoeic neonatal piglets
Metagenomic analysis for severe acute respiratory infection
Illumina
Illumina
Illumina
Illumina
Ion Torrent
Ion Torrent
Illumina
Metagenomic identification of viral pathogens in clinical samples
New viral sequences identified in Asian citrus psyllid
Novel adenovirus associated with baboon acute respiratory outbreak
Novel human pegivirus associated with hepatitis C virus co-infection
RNA sequencing in influenza virus-positive respiratory samples
RNA-seq of RNA viruses from faecal and blood samples
Simultaneous sequencing of multiple RNA virus genomes
Viral genome-targeted assembly pipeline
Study description
Virus identification in unknown tropical febrile illness cases
461
462
a
b
Selected runs with assembled contigs > 1,000 bp [21]
A selected run with mapped viral reads > 5000 [16]
28
drVM result (No. of run)
Detection
Reconstruction
[10]
[19]
5
34
0
21
8
20
20a
14
8
29
4
[28]
[20]
[21]
[18]
[24]
[27]
[29]
1
11
15
5
7
6
4
1
8
5
2
2
2
4
Illumina
Illumina
Illumina
Illumina
Illumina
Illumina
Ion Torrent
Ion Torrent
3
4
4
15
49
82
40
1
[25]
[26]
[17]
[23]
[9]
[8]
[22]
[13]
2
3
4
13
29
82
37
1
1
0
3
2
13
73
9
0
Illumina
1b
[16]
1
0
463
464
Table 2 Various viral genomes reconstructed by drVM based on the corresponding NGS reads.
Virus type
dsDNA
ssDNA
dsRNA
ssRNA (+)
ssRNA (-)
Retro-transcribing
Viral genome reconstruction by drVM
Human adenovirus (ERR233414), Human papillomavirus (ERR233428), Simian adenovirus (SRR766764)
Adeno-associated virus (SRR766763), Bovine parvovirus (SRR2010686), Fox circovirus (SRR1920168),
Human bocavirus (SRR2010686), Torque teno mini virus (SRR2040553), Torque teno virus (SRR2010686)
Human picobirnavirus (ERR227885)
Bovine viral diarrhea virus (SRR1170806), Chikungunya virus (SRR3180716), GB virus C (SRR544883),
Hepatitis C virus (SRR2940645), Human pegivirus (SRR2940645), Human rhinovirus (SRR2010685),
Norovirus (ERR233428), Pepino mosaic virus (ERR227848), Pepper mild mottle virus (ERR233431), Porcine
kobuvirus (ERR1097471), Tobacco mild green mosaic virus (ERR233421), Tobacco mosaic virus
(ERR233421), Tomato mosaic virus (ERR233424), West Nile virus (SRR527705)
Human parainfluenza virus (SRR2010686), Human respiratory syncytial virus (SRR527716), Influenza A
virus (ERR690510)
Hepatitis B virus (ERR233420), Human immunodeficiency virus (SRR513075)
465
466
29
467
468
Table 3 Performance comparison between drVM, SURPI, VIP and VirusTAP.
Target virus
(run accession)
Bovine viral diarrhea
virus (SRR1170797)
Bases
(Mbp)
12.5
Run timea
Result
b
drVM
SURPI
(Comprehensive)
VIP (Sense)
VirusTAP
149 sec
52,365 sec
1,683 sec
71 sec
12,224 bp
262 bp
9,078 bp
353 bp
Human immunodeficiency
virus (SRR1106548)c
600.9
Run time
Result
598 sec
3,055 bp
32,604 sec
799 bp
25,049 sec
4,632 bp
1,388 sec
2,896 bp
Human papillomavirus
(SRR062073)
5000
Run time
Result
608 sec
7,654 & 7,914 bp
18,107 sec
3,515 bp
8,603 sec
2,164 bp
519 sec
555 bp
Human rotavirus A
(DRR049387)
266.1
Run time
Result
464 sec
13 contigs
59,510 sec
13 contigs
6,259 sec
41377 reads
925 sec
Influenza A virus
(ERR690519)
3300
Run time
12,697sec
16,997 sec
86,001 sec
4,504 sec
Result
8 H3N2 segments
11 contigs
2,673 reads
34 viral contigs
469
470
471
472
473
a
474
c
11 viral contigs
The analyses were executed on a quad-core CPU with 128 GB RAM, except for VirusTAP (executed on a 120-core CPU with 1 TB
RAM).
b
The result is summarized as the largest length of viral contig (bold: close to the genome size of target virus), read number mapped to
the target virus, or contig number depending on the output of tools and the target virus (Human rotavirus A: 11 segments, Influenza A
virus: 8 segments).
The results were produced by analyzing reads with barcode: GCCAAT, except for SURPI, which recognizes multiple barcodes.
30
475
Fig. 1 A schematic flowchart of drVM. CreateDB.py processes viral sequences
to produce SNAP and BLAST databases. drVM.py
31
analyzes NGS reads to produce viral genomes.
476
477
Fig. 2 Input and output explanation of drVM. (A) Input of command-line
interface. (B) Input of graphic user interface. (C) Output
32
file structure from drVM. (D) Coverage profiles produced by drVM.
478
33