1 drVM: Detection and Reconstruction of Viral Genomes from Metagenomes 2 3 Hsin-Hung Lin and Yu-Chieh Liao* 4 5 Institute of Population Health Sciences, National Health Research Institutes, Miaoli, 6 Taiwan 7 8 Running title: Viral metagenome analysis 9 10 *Correspondence to Yu-Chieh Liao: [email protected] 11 E-mail address for Hsin-Hung Lin: [email protected] 12 13 14 15 1 16 ABSTRACT 17 Background 18 The discovery of new or divergent viruses using high-throughput next-generation 19 sequencing (NGS) has become more commonplace. However, although analysis of deep 20 NGS data allows us to identity potential pathogens, the entire analytical procedure 21 requires competency in the bioinformatics domain, which include implementing proper 22 software packages and preparing prerequisite databases. Simple and user-friendly 23 bioinformatics pipelines are urgently required to obtain complete viral genome sequences 24 from metagenomic data. 25 Results 26 This paper presents a pipeline, drVM (detection and reconstruction of viral genomes from 27 metagenomes), for rapid viral read identification, genus-level read partition, read 28 normalization, de novo assembly, sequence annotation and coverage profiling. The 29 feasibility of drVM was validated via the analysis of over 300 sequencing runs generated 30 by Illumina and Ion Torrent technologies to detect and reconstruct a variety of virus types 31 including DNA viruses, RNA viruses and retroviruses. drVM is available for free 32 download at: https://sourceforge.net/projects/sb2nhri/files/drVM/ and it is also assembled 33 as a Docker container and a virtual machine to facilitate seamless deployment. 2 34 Conclusions 35 drVM was compared with other viral detection tools to demonstrate its merits in terms of 36 viral genome completeness and reduced computation time. This substantiates the 37 platform’s potential to produce prompt and accurate viral genome sequences from clinical 38 samples. 39 40 Keywords 41 Next-generation 42 reconstruction sequencing (NGS), bioinformatics, 3 metagenomics, detection, 43 BACKGROUND 44 Viruses are the most abundant biological entities on Earth and are found among all 45 cellular forms of life including animals, plants, bacteria, and fungi. More than 4,500 viral 46 species have been discovered; their sequence information has been collected by 47 researchers [1-3]. Viruses have caused some of the most dramatic and deadly disease 48 pandemics in human history and viral disease outbreaks tend to occur every several years. 49 Over the past two decades, avian influenza H5N1 virus [4], SARS coronavirus, H1N1 50 pandemic, MERS coronavirus [5], Ebola virus [6] and Zika virus [7] have emerged in the 51 human population. During such outbreaks, identification of the causative agent and 52 comparative genome analysis is of cornerstone importance for disease surveillance and 53 epidemiology. Unbiased next-generation sequencing (NGS) is emerging as an attractive 54 approach for viral identification from varied samples, including blood, feces, sputum and 55 other swab samples [8, 9]. The technique holds the promise to aid in the identification of 56 potential pathogens in a single assay without a prior knowledge of the target [10]. 57 58 SURPI [10] and Taxonomer [11] are pathogen detection tools that have been proposed to 59 analyze metagenomic NGS data for comprehensive diagnostic applications. However, 60 both tools are incapable of complete viral genome assembly. VIP has pursues the same 4 61 strategy as SURPI-subtraction to identification-to identify viral reads but provides an 62 alternative approach to assembly - classification to assembly - for improved viral 63 genome assembly [12]. Although VIP generates phylogenetic trees, thereby facilitating 64 the visualization of genealogy between a candidate virus and existing reference 65 sequences, it does not produce assembled viral sequences. Moreover, ease-of-use 66 challenges associated with operating SURPI and VIP impede their applications in most 67 laboratories and are often accessible only by trained personnel. VirusTAP is a web-based 68 integrated NGS analysis tool for viral genome assembly from metagenomic reads [13]. 69 This user-friendly tool enables users to obtain viral genomes more easily by just 70 uploading raw NGS reads and clicking several selections. However, VirusTAP only 71 accepts Illumina data and does not appear to be fully functional, as evaluated via various 72 viral metagenomic reads. Therefore, tools for metagenomics analyses are urgently 73 required that are substantially more computationally efficient, accurate (for compete viral 74 genome assembly), and easy-to-use. 75 76 Here we present drVM (“detection and reconstruction of viral genomes from 77 metagenomes”), a bioinformatics pipeline that firstly provides rapid classification of 78 NGS reads generated by Illumina or Ion Torrent sequencing technology against a viral 5 79 database, then partitions the viral reads into genus groups, and finally assembles viral 80 genomes from the corresponding genus-level reads. For ease of deployment, a Docker 81 container [14] and virtual machine [15] image were created for drVM. The performance 82 of the platform was evaluated in the analysis of 349 sequence data in the Sequence Read 83 Archive (SRA) from 18 independent studies [8-10, 13, 16-29]. These data sets encompass 84 a variety of sample types, viruses, and sequencing depths. drVM was demonstrated to be 85 highly adept at the detection and reconstruction of various viral genomes and 86 concomitantly outperforms other analytical pipelines including SURPI, VIP and 87 VirusTAP. 88 6 89 MATERIALS AND METHODS 90 Reference databases 91 Viral nucleotide sequences were obtained from the Nucleotide database of the National 92 Center for Biotechnology Information (NCBI) using the query term “(complete[title]) 93 AND (viridae[organism])”; this query resulted in 642,079 hits (as of 16 March 2016). 94 Based on each sequence identifier (accession number) of the viral sequence, its 95 taxonomic ID, scientific name and genus-level annotation were separately obtained from 96 the NCBI Taxonomy database. Viral sequences with taxonomy ID not included in the 97 virus division (division id = 9), those with taxonomic information absent, and those 98 lacking genus-level annotation, were labeled “nonVira”, “noTax”, and “noGenus”, 99 respectively. Those sequences were excluded from database construction. The remaining 100 viral sequences were utilized as rawDB. Three hundred and seventy-seven viral genomes 101 were obtained by filtering with host “human” from the NCBI viral genome resource to 102 retrieve 684 sequences (1 March 2016) [1]. The viral sequences were segmented into 103 their corresponding genus level for refDB; the corresponding refDB was uploaded to 104 SourceForge 105 (https://sourceforge.net/projects/sb2nhri/files/drVM/refDB.tar.gz) 106 alignments and reduce memory requirements, rawDB was split into 8 sub-databases, each for later 7 use To increase read 107 contained sequences with total length no greater than 200 Mbp. SNAP (version 0.15.4) 108 [30], possessing a default seed size of 20, was used to generate index tables 109 (snap_index_virus) for the viral sub-databases. In addition, a BLAST database 110 (blast_index_virus) was created by makeblastdb (BLAST 2.2.28+) [31] from rawDB to 111 enable contig annotation. 112 113 Implementation of drVM 114 The drVM pipeline is implemented in Python and incorporates several open-source tools, 115 including BLAST, SNAP and SPAdes [32]. The package comprises two modules: 116 CreateDB.py and drVM.py. A schematic flowchart of drVM is shown in Fig. 1. In 117 CreateDB.py, viral sequences (provided by the user) are processed to produce rawDB for 118 SNAP and BLAST database construction, and refDB is downloaded from the 119 SourceForge link. drVM.py takes single- or paired-end reads as input. It aligns reads to 120 SNAP DB and, accordingly, identifies viral reads using the following parameters: snap 121 single -x -h 250 -d 24 -n 25 -F a. Based on taxonomic information (genus-level 122 annotation) of the aligned sequence, the viral reads are partitioned into genus-level 123 groups. Each group is labeled according to its corresponding genus. By examining the 124 aligned sequences with reads, sequence coverage and read depth are calculated. 8 125 Maximum coverage and average depth for a genus are estimated by taking all aligned 126 sequences corresponding to that genus into consideration. Viral reads in a genus with 127 average depth ≧ min depth (default =1) undergo de novo assembly via SPAdes (v.3.6.1). 128 Prior to assembly, digital normalization [33] in khmer [34] is employed to remove reads 129 with high-abundance (default = 100) k-mers. Using BLAST, each genus-level assembly 130 is annotated to its closest reference against refDB using blastn (identity ≧ 80%). Each 131 reference is examined to confirm that it contains an at least 50% alignment rate. If this 132 procedure returns no hits, the assembly is annotated by BLAST against rawDB. Reads 133 are aligned to contigs by means of SOAP2 [35] and the read-covered contigs are placed 134 into a coverage plot with reference-guide coordinates. To increase contig continuity, reads 135 aligned to contigs corresponding to a reference are extracted for pairs and those paired 136 reads are re-assembled. Please note this re-assembly process is only performed for 137 paired-end reads. 138 139 Metagenomic datasets 140 A total of 349 runs in the Sequence Read Archive (SRA) were downloaded (see Table S1). 141 The metagenome datasets based on prior studies [8-10, 13, 16-29] are summarized in 142 Table 1. Each sequence run was analyzed by drVM: drVM.py -1 read1.fastq -2 9 143 read2.fastq -t 16 (-type iontorrent for Ion Torrent datasets) on a server (Intel Xeon 144 E7-4820, 2.00GHz with 256 GB of RAM). 10 145 RESULTS 146 drVM pipeline 147 drVM was designed to detect and reconstruct viral genomes from metagenomes. As 148 described in the Materials and Methods, it contains two modules: CreateDB.py and 149 drVM.py (Fig. 1). The viral sequences downloaded from NCBI were processed by 150 CreateDB.py to produce SNAP and BLAST databases. Taxonomy information was also 151 automatically extracted from NCBI. In drVM.py, metagenomic reads that were aligned to 152 complete viral sequences were firstly partitioned into genus-level groups to facilitate de 153 novo assembly. Each genus-level assembly was then annotated with a close reference in 154 refDB or rawDB to produce corresponding coverage plots. The drVM pipeline was 155 implemented, in Python, to incorporate open-source tools including BLAST, khmer, 156 SNAP, SOAP2 and SPAdes. It is open-source and is available for download: 157 https://sourceforge.net/projects/sb2nhri/files/drVM/. In addition to the drVM script, the 158 module is distributed as a Docker image in the Docker Hub (https://hub.docker.com/) 159 repository and as a virtual machine image (drVM.ova) on the sourceforge website. With 160 the Docker image and the drVM virtual machine, users are able to interface with drVM 161 with ease. In addition to executing commands in a terminal window (Fig. 2A), a 162 graphical user interface (GUI) (Fig. 2B) was created in the drVM virtual machine. Users 11 163 are thus able to construct the required databases for drVM by simply clicking on the 164 ‘Create’ button featured in CreateDB.py. Following a 30-minute period while DB is 165 created, users are able to analyze metagenomic data with drVM.py to produce a folder 166 (Fig. 2C) containing assembled viral genomes (*.ctg.fa) along with annotated coverage 167 plots (Fig. 2D). 168 169 Detection and reconstruction of viral genomes 170 Three hundred and forty-nine sequencing datasets from 18 research studies (Table 1 and 171 Table S1) were downloaded and analyzed by means of drVM. As shown in Table 1, 172 drVM detected 260 runs and produced coverage plots corresponding to annotated viruses; 173 viral genomes were reconstructed from 147 runs by presenting a single contig mapped on 174 a reference. The results generated by the drVM package (except filtered reads) for the 175 349 runs are provided at http://sb.nhri.org.tw/drVM/, and the assembled viral sequences 176 and their corresponding coverage plots are available for download. Here, we take the 177 second study listed in Table 1 as an example – the authors described 178 sequence-independent amplification for samples containing ultra-low amounts of viral 179 RNA coupled with 34-run Illumina sequencing (SRR513075 to SRR527726 and 180 SRR629705-9 listed in Table S1). De novo assembly, optimized for viral genomes, was 12 181 performed, so as to capture 96 to 100% of the viral protein coding region of the human 182 immunodeficiency virus (HIV), respiratory syncytial virus (RSV) and West Nile virus 183 (WNV). With drVM, the three viruses were successfully detected and some complete 184 viral genomes were obtained, such as HIV in SRR513075, WNV in SRR527701 and 185 RSV in SRR527708. In addition to these three viruses, drVM detected and reconstructed 186 the GB virus C in SRR513075, SRR513080, SRR513087, SRR629705 and SRR629708 187 and detected the Hepatitis C virus in SRR513080, SRR629705 and SRR629708, and 188 torque teno virus in SRR527718, SRR527725, SRR629705 and SRR629707, which has, 189 to date, not been reported in the literature. Furthermore, 12 complete or near-complete 190 viral genomes of adenovirus, norovirus and hepatitis B virus have been obtained in the 191 fourth study [20]; drVM was able to automatically produce 8 complete genomes among 192 this dataset (e.g., ERR233414) and additional 7 plant virus genomes (e.g. ERR233413). 193 To illustrate its wide-ranging versatility, drVM was also leveraged to reconstruct 194 influenza viruses from influenza virus-positive respiratory samples [9]. drVM produced 195 complete influenza A H1N1 and H3N2 virus genomes in some influenza-positive samples 196 (e.g. ERR690494 and ERR690510), detected human herpesvirus 1 (HSV-1) in one 197 sample, and also recovered a viral genome associated with the respiratory syncytical 198 virus (RSV) from an influenza-negative sample. In analyzing the metagenomic datasets, 13 199 drVM reconstructed various viral genomes including DNA, RNA and retro-transcribing 200 viruses (Table 2). For example, drVM produced a 34,901-bp sequence of human 201 adenovirus D, two segments of human picobirnavirus (2,485 and 1,717 bp), a 12,216-bp 202 sequence of bovine viral diarrhea virus and a 3,166-bp sequence of hepatitis B virus in 203 ERR233414, ERR227885, SRR1170806, and ERR233420, respectively. In summary, 204 drVM successfully reconstructed various viral genomes (around 30) among 349 205 metagenomic sequencing runs. 206 207 Performance comparison 208 Unlike SURPI [10], VIP [12] and VirusTAP [13], drVM directly identifies reads by 209 aligning them to viral genomes and therefore is expected to save a substantial amount of 210 computational time from subtracting reads of the host and co-located bacteria. Three 211 datasets – SRR1170797, SRR1106548 and DRR049387 – were separately input into VIP, 212 SURPI and VirusTAP for pipeline evaluation [10, 12, 13]. Two sequencing runs 213 comprising co-existing multiple human papillomavirus types [21] and segmented-genome 214 influenza virus [9] were used as additional datasets to facilitate performance comparison. 215 As demonstrated in Table 3, drVM requires the shortest execution time with regards to 216 SRR1106548 and DRR049387, and the second-shortest time for the other three datasets. 14 217 Note that the analyses of drVM, SURPI and VIP were performed on a quad-core CPU 218 with 128 GB RAM while VirusTAP was executed on a 120-core CPU web server with 1 219 TB RAM. Therefore, drVM was validated to be the most rapid tool for virus 220 identification when computational hardware is considered. Moreover, the complete 221 genomes of bovine viral diarrhea virus, human papillomavirus and influenza A virus were 222 generated by drVM during the SRR1170797, SRR062073 and ERR690519 runs, 223 respectively, while the other three tools were not able to produce complete genomes for 224 these viruses. The outputs generated by each tool running on the datasets delineated in 225 Table 3 are available at: https://sourceforge.net/projects/sb2nhri/files/drVM/Comparison/. 226 To simulate the disturbance of host reads, homo sapiens raw reads from liver cancer cells 227 (SRR3031107) were concatenated to sequencing reads (SRR544883) from the hepatitis C 228 virus infection [18]. Before read concatenation, drVM reconstructed the GB virus C from 229 SRR544883, as shown in Table 2. Although the software still was able to reconstruct the 230 viral genome of GB virus C from the concatenated reads, execution time increased by a 231 factor of three (from 43 min to 2 hr 7 min while analyzing read bases of 711 Mb to 10 232 Gb). Taken together, drVM not only rapidly detects viral reads but also enables the de 233 novo assembly of the reads into viral genomes. 15 234 DISCUSSION 235 As described in SURPI, SNAP runs 10-100× faster than existing alignment tools 236 including Bowtie 2 [36] and BWA [37]. drVM and SURPI use SNAP while VIP and 237 VirusTAP utilize Bowtie 2 and BWA-SW [38], respectively. SNAP coupling with the 238 light reference database (viral sequences vs. human, bacterial and viral sequences used in 239 SURPI, VIP and VirusTAP) yields a noteworthy reduction in the processing time for viral 240 read identification in drVM. Furthermore, the reference database can easily be updated 241 whenever necessary. In contrast, SURPI classifies reads against viral and bacterial 242 databases to identify all potential pathogens, which invariably requires extended 243 execution time and requires more effort with regards to ensuring that the references are 244 updated and current. In drVM, viral reads with the loosest edit distance of 24 (in SNAP) 245 were partitioned into genus-level groups based on the virus taxonomy retrieved from the 246 NCBI Taxonomy databases, excluding phages. Digital normalization was then employed 247 to correct uneven read depth distribution, so as to eliminate redundant reads while 248 retaining sufficient information for subsequent assembly. For example, drVM produced a 249 7,455-bp contig of porcine kobuvirus for ERR1097471 (in Table 2) but resulted in a 250 fragmented assembly when digital normalization was not used (-dn off). Additionally, 251 normalization is able to reduce the processing time required for assembly. For example, 16 252 not activating the digital normalization routine increased the execution time of 253 ERR1097472 by drVM from 20 min to 3 hr. It is important to note that the digital 254 normalization routine comprises a sequence of random samples, insinuating that the 255 results produced by drVM can be different (e.g., SRR527703) from run-to-run. The 256 normalized reads were assembled, in a de novo fashion, by SPAdes with multiple k-mer 257 sizes (21, 33, 55… automatically selected based on read length). Since SPAdes operates 258 with Illumina and Ion Torrent reads, drVM is able to handle sequencing reads produced 259 by these two platforms (default for Illumina, -type iontorrent for Ion Torrent). Moreover, 260 drVM annotates each genus-level assembly with a close reference to produce the 261 corresponding coverage plots. If multiple contigs present in one coverage plot, reads 262 aligned to the contigs are extracted in pairs for subsequent re-assembly; such a process is 263 able to improve genome completeness. An example can be seen in the run corresponding 264 to SRR527705 – fragmented contigs were annotated with the West Nile virus in the 265 pre.run routine; drVM produced a 11,017-bp contig of the complete genome. Over 266 99.98% of the target sequence identity was obtained in the drVM-produced contig when 267 referenced against the submitted assembly (JX503096.1) [19]. 268 17 269 In the comparative study of VIP and VirusTAP, the authors have compared their 270 assemblies with direct metagenomic assemblies via Ensemble Assembler, IDBA_UD, 271 A5-miseq or CLC workbench. Results have led the authors to conclude that 272 “classification to assembly” outperformed direct assembly in terms of assembly 273 continuity and execution time [12, 13]. drVM was not compared with SPAdes, but it was 274 compared against SURPI, VIP and VirusTAP. As evident from Table 3, drVM produced 275 complete genomes of bovine viral diarrhea virus and human papillomavirus in the 276 analysis of reads produced by Ion Torrent and Illumina platforms, respectively; the other 277 tools were unable to assemble the viral genomes. Unlike SURPI and VIP, the coverage 278 plots produced by drVM are created by mapping raw reads back to the assembled contigs, 279 not to closely related references. A coverage plot with a continuous profile reflects the 280 accuracy and continuity of the assembly paradigm. 281 282 Although drVM operated in such a manner that the subtraction of the host read is 283 neglected may produce valid host sequences and annotate with viral references, this 284 mistake can be easily identified via inspection of the assembled contigs. For example, in 285 running SRR3031107 concatenated with SRR544883, a 1,872-bp contig annotated with 286 Feline leukemia virus (92% identity) actually corresponded to homo sapiens notch 2 18 287 (100% identity). Therefore, conclusions drawn from drVM should be made with caution, 288 especially with regards to retrovirus detection. 289 290 Among SURPI, VIP and VirusTAP, VirusTAP is the only user-friendly tool for viral 291 genome assembly from metagenomic reads. The other two pipelines are not easily 292 operated by users lacking advanced bioinformatics skills. Although academic users can 293 access VirusTAP and obtain viral genomes with ease, the package is not suitable for 294 assembling Ion Torrent data (Table 3, SRR1170797) and low-copy viruses (Table 3, 295 SRR062073). Alternatively, drVM provides a user-friendly interface (as shown in Fig. 2B) 296 and generates self-explanatory results (Fig. 2C and 2D); it therefore is well-suited for 297 clinical applications. We have applied drVM in the analyses of over 300 sequencing runs 298 retrieved from NCBI’s SRA to demonstrate that drVM is indeed able to detect and 299 reconstruct various viral genomes from metagenomic data. We can therefore expect to 300 extend our knowledge of viral diversity by leveraging the unique genome assembly 301 capabilities of drVM in the metagenomics domain. 302 303 19 304 AVAILABILITY AND REQUIREMTNS 305 Project name: drVM 306 Project home page: https://sourceforge.net/projects/sb2nhri/files/drVM/ 307 Operating system(s): OS X, Linux, Windows 308 Programming language: Python 309 Requirements: Docker or Virtual machine; 8 GB RAM 310 License: GNU General Public License, version 3.0 (GPL-3.0) 311 312 20 313 DECLARATIONS 314 List of abbreviations 315 NGS: 316 immunodeficiency virus, RSV: respiratory syncytial virus, WNV: West Nile virus next-generation sequencing, SRA: sequencing read archive, HIV: 317 318 Competing interests 319 The authors declare that they have no competing interests. 320 321 Author’ contributions 322 YCL conceived the project. HHL implemented the pipeline. YCL and HHL prepared, 323 read and approved the final manuscript. 324 325 Acknowledges 326 This work was supported by grants (PH-105-PP-05) from the National Health Research 327 Institute, Taiwan, and a research grant (MOST-104-2320-B-400-021-MY2) from the 328 Ministry of Science and Technology, Taiwan. 329 330 Additional file 21 331 Supplementary Table S1. List of 349 sequencing datasets from 18 research studies. 332 333 22 334 REFERENCES 335 336 337 338 339 340 341 342 1. 343 344 345 346 347 348 349 350 351 4. 352 353 354 355 356 357 358 359 360 2. 3. 5. 6. 7. 8. 9. Brister JR, Ako-Adjei D, Bao Y, Blinkova O: NCBI viral genomes resource. Nucleic Acids Res 2015, 43:D571-577. Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S, Zaremba S, Gu Z, et al: ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res 2012, 40:D593-598. Sharma D, Priyadarshini P, Vrati S: Unraveling the web of viroinformatics: computational tools and databases in virus research. J Virol 2015, 89:1489-1501. Chan PK: Outbreak of avian influenza A(H5N1) virus infection in Hong Kong in 1997. Clin Infect Dis 2002, 34 Suppl 2:S58-64. Bean AG, Baker ML, Stewart CR, Cowled C, Deffrasnes C, Wang LF, Lowenthal JW: Studying immunity to zoonotic diseases in the natural host - keeping it real. Nat Rev Immunol 2013, 13:851-861. Feldmann H: Ebola--a growing threat? N Engl J Med 2014, 371:1375-1378. Calvet G, Aguiar RS, Melo ASO, Sampaio SA, de Filippis I, Fabri A, Araujo ESM, de Sequeira PC, de Mendonça MCL, de Oliveira L, et al: Detection and sequencing of Zika virus from amniotic fluid of fetuses with microcephaly in Brazil: a case study. The Lancet Infectious Diseases 2016. Batty EM, Wong TH, Trebes A, Argoud K, Attar M, Buck D, Ip CL, Golubchik T, Cule M, Bowden R, et al: A modified RNA-Seq approach for whole genome sequencing of RNA viruses from faecal and blood samples. PLoS One 2013, 8:e66129. Fischer N, Indenbirken D, Meyer T, Lutgehetmann M, Lellek H, Spohn M, Aepfelbacher M, Alawi M, Grundhoff A: Evaluation of Unbiased Next-Generation Sequencing of RNA (RNA-seq) as a Diagnostic Method in Influenza Virus-Positive Respiratory Samples. J Clin Microbiol 2015, 53:2238-2250. 361 362 363 364 365 10. Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D, Samayoa E, Bouquet J, Greninger AL, Luk KC, Enge B, et al: A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res 2014, 24:1180-1192. 366 367 11. Flygare S, Simmon K, Miller C, Qiao Y, Kennedy B, Di Sera T, Graf EH, Tardif KD, Kapusta A, Rynearson S, et al: Taxonomer: an interactive metagenomics analysis 23 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. portal for universal pathogen detection and host mRNA expression profiling. Genome Biol 2016, 17:111. Li Y, Wang H, Nie K, Zhang C, Zhang Y, Wang J, Niu P, Ma X: VIP: an integrated pipeline for metagenomics of virus identification and discovery. Sci Rep 2016, 6:23774. Yamashita A, Sekizuka T, Kuroda M: VirusTAP: Viral Genome-Targeted Assembly Pipeline. Front Microbiol 2016, 7:32. Merkel D: Docker: lightweight Linux containers for consistent development and deployment. Linux J 2014, 2014:2. Nocq J, Celton M, Gendron P, Lemieux S, Wilhelm BT: Harnessing virtual machines to simplify next-generation DNA sequencing analysis. Bioinformatics 2013, 29:2075-2083. Yozwiak NL, Skewes-Cox P, Stenglein MD, Balmaseda A, Harris E, DeRisi JL: Virus identification in unknown tropical febrile illness cases using deep sequencing. PLoS Negl Trop Dis 2012, 6:e1485. Chiu CY, Yagi S, Lu X, Yu G, Chen EC, Liu M, Dick EJ, Jr., Carey KD, Erdman DD, Leland MM, Patterson JL: A novel adenovirus species associated with an acute respiratory outbreak in a baboon colony and evidence of coincident human infection. MBio 2013, 4:e00084. Law J, Jovel J, Patterson J, Ford G, O'Keefe S, Wang W, Meng B, Song D, Zhang Y, Tian Z, et al: Identification of hepatotropic viruses from plasma using deep sequencing: a next generation diagnostic tool. PLoS One 2013, 8:e60595. Malboeuf CM, Yang X, Charlebois P, Qu J, Berlin AM, Casali M, Pesko KN, Boutwell CL, DeVincenzo JP, Ebel GD, et al: Complete viral RNA genome sequencing of ultra-low copy samples by sequence-independent amplification. Nucleic Acids Res 2013, 41:e13. Cotten M, Oude Munnink B, Canuti M, Deijs M, Watson SJ, Kellam P, van der Hoek L: Full genome virus detection in fecal samples using sensitive nucleic acid preparation, deep sequencing, and a novel iterative sequence classification algorithm. PLoS One 2014, 9:e93269. Ma Y, Madupu R, Karaoz U, Nossa CW, Yang L, Yooseph S, Yachimski PS, Brodie EL, Nelson KE, Pei Z: Human papillomavirus community in healthy persons, defined by metagenomics analysis of human microbiome project shotgun sequencing data sets. J Virol 2014, 88:4786-4797. Neill JD, Bayles DO, Ridpath JF: Simultaneous rapid sequencing of multiple RNA 24 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. virus genomes. J Virol Methods 2014, 201:68-72. Berg MG, Lee D, Coller K, Frankel M, Aronsohn A, Cheng K, Forberg K, Marcinkus M, Naccache SN, Dawson G, et al: Discovery of a Novel Human Pegivirus in Blood Associated with Hepatitis C Virus Co-Infection. PLoS Pathog 2015, 11:e1005325. Day JM, Oakley BB, Seal BS, Zsak L: Comparative analysis of the intestinal bacterial and RNA viral communities from sentinel birds placed on selected broiler chicken farms. PLoS One 2015, 10:e0117210. Greninger AL, Naccache SN, Federman S, Yu G, Mbala P, Bres V, Stryke D, Bouquet J, Somasekar S, Linnen JM, et al: Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis. Genome Medicine 2015, 7. Nouri S, Salem N, Nigg JC, Falk BW: Diverse Array of New Viral Sequences Identified in Worldwide Populations of the Asian Citrus Psyllid (Diaphorina citri) Using Viral Metagenomics. J Virol 2015, 90:2434-2445. Karlsson OE, Larsson J, Hayer J, Berg M, Jacobson M: The Intestinal Eukaryotic Virome in Healthy and Diarrhoeic Neonatal Piglets. PLoS One 2016, 11:e0151481. Lojkic I, Bidin M, Prpic J, Simic I, Kresic N, Bedekovic T: Faecal virome of red foxes from peri-urban areas. Comp Immunol Microbiol Infect Dis 2016, 45:10-15. Wang Y, Zhu N, Li Y, Lu R, Wang H, Liu G, Zou X, Xie Z, Tan W: Metagenomic analysis of viral genetic diversity in respiratory samples from children with severe acute respiratory infection in China. Clin Microbiol Infect 2016. Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson D, Shenker S, Stoica I, Karp RM, Sittler T: Faster and more accurate sequence alignment with SNAP. arXiv preprint arXiv:11115572 2011. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinformatics 2009, 10:421. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012, 19:455-477. Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT: Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci U S A 2014, 111:4904-4909. 25 438 439 440 441 442 443 444 445 446 447 34. 448 449 38. 35. 36. 37. Crusoe MR, Alameldin HF, Awad S, Boucher E, Caldwell A, Cartwright R, Charbonneau A, Constantinides B, Edvenson G, Fay S, et al: The khmer software package: enabling efficient nucleotide sequence analysis. F1000Res 2015, 4:900. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009, 25:1966-1967. Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods 2012, 9:357-359. Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25:1754-1760. Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010, 26:589-595. 450 451 26 452 FIGURE LEGENDS 453 Fig. 1 454 produce SNAP and BLAST databases. drVM.py analyzes NGS reads to produce viral 455 genomes. 456 Fig. 2 457 command-line interface. (B) Input associated with the graphical user interface. (C) 458 Output file structure generated by drVM. (D) Coverage profiles produced by drVM. A schematic flowchart of drVM. CreateDB.py processes viral sequences to Input and output explanation of drVM. (A) Input associated with the 459 27 460 Table 1 The feasibility of drVM in various types of clinical samples and public datasets. Sequencing method No. of runs Ref Bioinformatics pipeline for ultrarapid pathogen identification Complete viral RNA genome sequencing of ultra-low copy samples Illumina Illumina 13 34 Faecal virome of red foxes from peri-urban areas Full genome virus detection in fecal samples Human papillomavirus community in healthy persons Identification of hepatotropic viruses from plasma Intestinal bacterial and RNA viral communities from sentinel birds Intestinal virome in healthy and diarrhoeic neonatal piglets Metagenomic analysis for severe acute respiratory infection Illumina Illumina Illumina Illumina Ion Torrent Ion Torrent Illumina Metagenomic identification of viral pathogens in clinical samples New viral sequences identified in Asian citrus psyllid Novel adenovirus associated with baboon acute respiratory outbreak Novel human pegivirus associated with hepatitis C virus co-infection RNA sequencing in influenza virus-positive respiratory samples RNA-seq of RNA viruses from faecal and blood samples Simultaneous sequencing of multiple RNA virus genomes Viral genome-targeted assembly pipeline Study description Virus identification in unknown tropical febrile illness cases 461 462 a b Selected runs with assembled contigs > 1,000 bp [21] A selected run with mapped viral reads > 5000 [16] 28 drVM result (No. of run) Detection Reconstruction [10] [19] 5 34 0 21 8 20 20a 14 8 29 4 [28] [20] [21] [18] [24] [27] [29] 1 11 15 5 7 6 4 1 8 5 2 2 2 4 Illumina Illumina Illumina Illumina Illumina Illumina Ion Torrent Ion Torrent 3 4 4 15 49 82 40 1 [25] [26] [17] [23] [9] [8] [22] [13] 2 3 4 13 29 82 37 1 1 0 3 2 13 73 9 0 Illumina 1b [16] 1 0 463 464 Table 2 Various viral genomes reconstructed by drVM based on the corresponding NGS reads. Virus type dsDNA ssDNA dsRNA ssRNA (+) ssRNA (-) Retro-transcribing Viral genome reconstruction by drVM Human adenovirus (ERR233414), Human papillomavirus (ERR233428), Simian adenovirus (SRR766764) Adeno-associated virus (SRR766763), Bovine parvovirus (SRR2010686), Fox circovirus (SRR1920168), Human bocavirus (SRR2010686), Torque teno mini virus (SRR2040553), Torque teno virus (SRR2010686) Human picobirnavirus (ERR227885) Bovine viral diarrhea virus (SRR1170806), Chikungunya virus (SRR3180716), GB virus C (SRR544883), Hepatitis C virus (SRR2940645), Human pegivirus (SRR2940645), Human rhinovirus (SRR2010685), Norovirus (ERR233428), Pepino mosaic virus (ERR227848), Pepper mild mottle virus (ERR233431), Porcine kobuvirus (ERR1097471), Tobacco mild green mosaic virus (ERR233421), Tobacco mosaic virus (ERR233421), Tomato mosaic virus (ERR233424), West Nile virus (SRR527705) Human parainfluenza virus (SRR2010686), Human respiratory syncytial virus (SRR527716), Influenza A virus (ERR690510) Hepatitis B virus (ERR233420), Human immunodeficiency virus (SRR513075) 465 466 29 467 468 Table 3 Performance comparison between drVM, SURPI, VIP and VirusTAP. Target virus (run accession) Bovine viral diarrhea virus (SRR1170797) Bases (Mbp) 12.5 Run timea Result b drVM SURPI (Comprehensive) VIP (Sense) VirusTAP 149 sec 52,365 sec 1,683 sec 71 sec 12,224 bp 262 bp 9,078 bp 353 bp Human immunodeficiency virus (SRR1106548)c 600.9 Run time Result 598 sec 3,055 bp 32,604 sec 799 bp 25,049 sec 4,632 bp 1,388 sec 2,896 bp Human papillomavirus (SRR062073) 5000 Run time Result 608 sec 7,654 & 7,914 bp 18,107 sec 3,515 bp 8,603 sec 2,164 bp 519 sec 555 bp Human rotavirus A (DRR049387) 266.1 Run time Result 464 sec 13 contigs 59,510 sec 13 contigs 6,259 sec 41377 reads 925 sec Influenza A virus (ERR690519) 3300 Run time 12,697sec 16,997 sec 86,001 sec 4,504 sec Result 8 H3N2 segments 11 contigs 2,673 reads 34 viral contigs 469 470 471 472 473 a 474 c 11 viral contigs The analyses were executed on a quad-core CPU with 128 GB RAM, except for VirusTAP (executed on a 120-core CPU with 1 TB RAM). b The result is summarized as the largest length of viral contig (bold: close to the genome size of target virus), read number mapped to the target virus, or contig number depending on the output of tools and the target virus (Human rotavirus A: 11 segments, Influenza A virus: 8 segments). The results were produced by analyzing reads with barcode: GCCAAT, except for SURPI, which recognizes multiple barcodes. 30 475 Fig. 1 A schematic flowchart of drVM. CreateDB.py processes viral sequences to produce SNAP and BLAST databases. drVM.py 31 analyzes NGS reads to produce viral genomes. 476 477 Fig. 2 Input and output explanation of drVM. (A) Input of command-line interface. (B) Input of graphic user interface. (C) Output 32 file structure from drVM. (D) Coverage profiles produced by drVM. 478 33
© Copyright 2026 Paperzz