BBMap: A Fast, Accurate, Splice-Aware Aligner Brian Bushnell1* 1 LBNL Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, USA *To whom correspondence should be addressed: Email: [email protected] March 21, 2014 ACKNOWLEDGMENTS: The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Special thanks for testing, suggestions and feedback to: Rob Egan, Alex Copeland, Brian Foster, Alicia Clum, Hui Sun, Kurt LaButti, Vasanth Singan, Andrew Tritt, Alexander Spunde, Joel Martin, James Han, Mat Nolan and Matt Scholz. DISCLAIMER: LBNL: This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor The Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or The Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof or The Regents of the University of California. BBMap: A Fast, Accurate, Splice-Aware Aligner Brian Bushnell Department of Energy, Joint Genome Institute, 2800 Mitchel Drive, Walnut Creek, California 94598, USA Introduction Results Conclusions Speed vs Mismatches (Soil Metagenome) 900 800 smalt bwa Correctly Mapped Kbps 25 bbmap Percent True Positive Alignment of reads is one of the primary computational tasks in bioinformatics. Of paramount importance to resequencing, alignment is also crucial to other areas quality control, scaffolding, string-graph assembly, homology detection, assembly evaluation, error-correction, expression quantification, and even as a tool to evaluate other tools. An optimal aligner would greatly improve virtually any sequencing process, but optimal alignment is prohibitively expensive for gigabases of data. Here, we will present BBMap [1], a fast splice-aware aligner for short and long reads. We will demonstrate that BBMap has superior speed, sensitivity, and specificity to alternative high-throughput aligners bowtie2 [2], bwa [3], smalt, [4] GSNAP [5], and BLASR [6]. ROC Curve (Soil Metagenome) 30 bwa 20 15 10 700 bbmap 600 smalt 500 400 300 200 100 5 0 0 0 0.0001 0.001 0.01 0.1 1 10 4 8 12 20 24 28 32 Number of Mismatches Percent False Positive Only 3 aligners were capable of indexing the metagenome, though it was not particularly large, at 5Gbp and 22M scaffolds. bowtie2 Signal/Noise Ratio vs Substitution Length (Phycomyces) 40 35 bbmap smalt 30 Speed vs Substitution Length (Phycomyces) 7000 bwa Correctly Mapped Kbps Mapping perfect reads is easy, but real reads have errors and mutations; any reference substring can be transformed into any read by applying a series of insertions, deletions, substitutions, and no-calls. The alignment game is played by assigning scores to these operations, then finding the location(s) in a reference maximizing that function. A function correctly reflecting probabilities of errors and mutations will yield its highest score at a read’s most likely origin, allowing correct alignments to be made. A good aligner will be able to map reads rapidly and accurately in the presence of mutations. SNR, Decibels Problem Description gsnap 25 20 15 10 Materials & Methods bowtie2 bwa 6000 bbmap 5000 smalt 4000 gsnap 3000 2000 1000 5 0 0 0 4 8 12 16 20 24 28 32 0 4 8 12 Substitution Length 16 20 24 28 32 Substitution Length BBMap is good with contiguous substitutions that trouble FM-indexed aligners. True Positive Rate vs Insertion Length (E.coli) 90 80 bowtie2 70 60 bwa 50 bbmap 40 smalt 30 gsnap 20 10 0 ROC Curve (Maize) 0 4 8 12 16 20 24 False Positive Rate vs Insertion Length (E.coli) Percent False Positive (of Output) 100 Percent Correct To evaluate the aligners, synthetic reads were generated, transmuted, and tagged with their genomic start and stop. Aligners then mapped reads to the reference, and the resulting sam file was graded. A mapping was considered ‘strictly correct’ if the start and stop matched the genomic origin. To reflect the diversity of JGI research, multiple genomes were used: 1 fungus, 1 bacteria, 1 plant, and 1 soil metagenome. Prior to mapping, the genomes were fully indexed, and the times recorded. 70 16 100 28 100 90 80 bowtie2 70 bwa 60 bbmap 50 40 smalt 30 gsnap References 20 10 0 32 0 4 8 Insertion Length bowtie2 12 16 20 24 28 32 Insertion Length 60 Bowtie2 and BBMap both handle insertions well, but BBMap handles longer ones. gsnap bbmap True Positive Rate vs Deletion Length (Maize) 100 bwa 20 10 80 70 60 bowtie2 50 bwa 40 bbmap 30 0 0.001 smalt 20 0.01 0.1 1 10 gsnap 10 100 0 Percent False Positive False Positive Rate vs Deletion Length (Maize) 100 90 blasr 30 Percent False Positive (of Input) 40 Percent Correct Percent True Positive smalt 50 90 bowtie2 80 bwa 70 bbmap 60 smalt 50 gsnap 40 30 20 4 16 64 256 1024 4096 10 16384 1 4 16 64 Deletion Length 100 90 100 80 70 80 40 20 Seconds Minutes 60 60 50 40 30 20 bowtie2 True Positive Rate vs Error Rate For Synthetic PacBio Data (Maize) 100 90 Indexing Time (Phycomyces) 1024 4096 16384 Signal/Noise Ratio vs Mismatches (Maize) bwa 35 bbmap 70 blasr 60 bowtie2 40 bwa 80 smalt 50 40 30 SNR, Decibels Indexing Time (Maize) 256 Deletion Length BBMap is unrivalled at processing long deletions. Bowtie2 is second, while smalt has an exceptionally high error rate. Percent Correct 120 [1] sourceforge.net/projects/bbmap/ [2] bowtie-bio.sourceforge.net/bowtie2/ [3] bio-bwa.sourceforge.net/ [4] research-pub.gene.com/gmap/ [5] www.sanger.ac.uk/resources/software/smalt/ [6] Chaisson MJ, Tesler G (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13:238 doi:10.1186/1471-2105-13-238 0 1 The tests were run on an exclusive NERSC Mendel node with 128GB RAM and 16 Ivy Bridge E cores in two sockets. All programs used their default settings and 32 threads, as hyperthreading was enabled. All reads were single-ended 150bp reads except synthetic PacBio, which was 400bp reads. ROC curves are generated from reads with a variety of SNPs, indels, and nocalls. • BBMap is shown to be a fast and accurate aligner, capable of correctly handling an overall wider variety of references, reads, and mutations than others. It has particularly outstanding performance with deletions, especially long ones, that other aligners cannot handle at all. While less common than SNPs, such largescale features indicating (for example) the complete absence of a gene or promoter will not even be detected by other aligners. • GSNAP’s performance was unexpectedly bad. It was incapable of indexing soil, and yielded 100% incorrect mappings against Maize, for unknown reasons. It had generally inferior performance on the two genomes it seemed able to process, Phycomyces and E.coli. And despite being billed as a de novo splice aligner, GSNAP was incapable of mapping reads across any long deletions. • Bwa was fairly fast and showed fairly good results as long as the edit distance was less than 7, but was incapable of handling low-identity reads and performed poorly with indels. Though these tests were run with default settings, more sensitive settings were also explored, causing an exponential increase in runtime with marginal improvement in results. • Bowtie2 did a fairly good job in all cases. Though slower than bwa with perfect reads, it was much better able to handle lower-identity reads and indels. Overall, it seemed like a better tool than bwa for general use; but as it failed to index the soil metagenome, it can’t be recommended for metagenomics. • Smalt was the only aligner to maintain a consistent speed with decreasing read identity, and was able to map more highly mutated reads than anything else, even BBMap. However, it does so at the cost of extremely high false positive rates, particularly for indels. • BLASR had acceptable speed but poor accuracy on synthetic PacBio data, mapping fewer reads and generating more false positives than alternatives. This may in part be because BLASR is optimized for its native format, rather than fastq. Regardless, it does not seem like a good choice for mapping short Illumina reads to PacBio for error correction. bbmap 30 smalt gsnap 20 15 10 20 5 10 0 0 0 4 8 12 16 Percent Error 20 Acknowledgements 25 0 4 8 12 16 20 24 28 32 Special thanks for testing, suggestions, and feedback go to: Rob Egan, Alex Copeland, Brian Foster, Alicia Clum, Hui Sun, Kurt LaButti, Vasanth Singan, Andrew Tritt, Alexander Spunde, Joel Martin, James Han, Matt Nolan, and Matt Scholz.. Number of Mismatches 10 0 0 Smalt and BBMap have much faster indexing than FM-transform aligners. BBMap is the most accurate with PacBio’s error profile, even moreso than PacBio’s own BLASR. Real PacBio data is around 15% error. BBMap is always first or second in this graph, and is capable of accurately mapping reads with more errors than other aligners. For any questions or comments, please contact the author at [email protected] The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
© Copyright 2026 Paperzz