TECHNICAL REPORT One Codex: Genomic microbial detection One Codex provides a robust and reproducible cloud-based computational platform for identifying microbes using state-of-the-art bioinformatics. The One Codex metagenomic classification system, powered by a database containing roughly 40,000 whole genomes, provides microbial identification with best-in-class accuracy. Summary º One Codex exhaustively compares input samples against a database of 40,000 whole microbial genomes º Extensive testing against validation datasets demonstrates best-in-class detection accuracy º The One Codex data platform supports data archival, sample comparison, and metagenomic search Data Platform nomes. As microbial genomes continue to be sequenced at a rapid rate, the One Codex reference database will continue to expand and incorporate those novel genomes as well. Users upload raw metagenomic sequence data to the One Codex platform through a graphical upload tool with both drag-and-drop and folder navigation options. A command-line tool and API are also available for large-volume data upload (https://docs.onecodex.com). Once uploaded, reads are taxonomically classified and an interactive report is populated and linked to the user’s account. The user interface allows users to organize metadata, compare samples, and search quickly by the microbes detected in each sample. Deployment and Reproducibility The One Codex microbial detection algorithm is executed within a static Linux container running in a secure cloud environment. Every version of the algorithm, database, and input data is tracked with a unique identifier that is stored alongside each analysis that is executed. This deployment strategy guarantees deterministic execution such that each analysis result can be exactly reproduced. The One Codex platform ensures that all users have access to the highest level of analytical accuracy, state-of-theart bioinformatics, and intuitive data visualization. Microbial detection One Codex classifies unknown nucleotide sequences according to the set of signature sequences within it that are unique to a specific taxonomic group. Each read is first broken into the complete set of overlapping sequences of length 31bp that comprise it (k-mers). These k-mers are compared against an exhaustive database that contains every known k-mer that is unique to a specific taxonomic grouping (e.g., a specific clade of bacteria, viruses, etc.). Each read or contig is then assigned to the microbial clade it most closely resembles, and the complete sample is summarized as a collection of these signature sequences indicating the presence of a group of organisms. The user is presented with the complete set of evidence for every organism in each sample. Reference Database ROC Curve for six metagenomic classification methods and databases (False Positive Rate on horizontal axis, True Positive Rate on vertical axis). The accuracy associated with the detection of 100 species-specific reads is labeled. Overall accuracy is represented by the AUC statistic, for which One Codex shows the best performance. The One Codex reference database is the largest available index of whole microbial genomes, numbering approximately 40,000 bacteria, viruses, protists, archaea, and fungi. A compressed data structure is used to index and rapidly search the complete set of unique k-mers found within these ge- 1 Microbial detection on the One Codex platform – October 2015 Benchmarks base, which only includes the microbial genomes contained in NCBI RefSeq, is shown for comparison. Classifier Species-Level Specificity Strain-Level Specificity One Codex 97.9% 82.4% One Codex “RefSeq” DB 86.4% 11.8% Minikraken 93.6% 20.3% Kraken 89.3% 15.7% Clark 84.9% – Seed-Kraken 86.9% 11.4% The overall accuracy of detection was also summarized by a ROC curve analysis, shown on the previous page, which summarizes the probability that a given analytical threshold (number of reads detected) will result in accurate detection. The AUC statistic summarizes classification accuracy, indicating that One Codex is the most accurate detection method. As a point of illustration, the True Positive Rate and False Positive rate resulting from an analytical cutoff of 100 reads is indicated with a point. The placement of that point shows that using a threshold of 100 reads to determine whether or not an organism is present is only feasible when using One Codex. Accuracy metrics for a panel of microbial detection methods generated using 50 million reads simulated from 10,639 genomes. Specificity statistics are at the individual read level. Conclusion One Codex detects specific microbes within genomic data by exhaustively searching against a database of 40,000 whole microbial genomes. The performance of this method, defined by the sensitivity and precision of detection, exceeds that of the state-of-the-art comparison methods. Crucially, the framework of execution of One Codex is userfriendly and reproducible, such that any result will be exactly recreated from the same input data. This combination of accuracy, precision, and reproducibility makes the One Codex platform the most powerful and robust method currently available for microbial detection. Accuracy Metrics Two metrics are commonly used to summarize the accuracy of any microbial detection algorithm that assigns a specific label to individual sequences within a dataset. Sensitivity measures the proportion of all labeled reads that are assigned correctly (in the AUC graph on the prior page, at the genus level). Specificity measures the proportion of reads with an assignment consistent with the true taxon. A higher rate of false positive results lead to a lower specificity, while a higher rate of false negatives leads to a lower sensitivity. Both of those measures of accuracy are encapsulated in the AUC statistic. i Minot S, et al. (2015) bioRxiv doi: 10.1101/027607 Validation Datasets In order to measure the performance of a panel of microbial detection algorithms, we used a validation dataset generated from a synthetic microbial mixture.i These datasets contain 50 million reads simulated from 10,639 microbial genomes and were designed to mimic shotgun-sequencing datasets from complex microbial mixtures, except that the source of each individual read is known. Results The accuracy statistics for a panel of modern metagenomic classification methods are shown as a table above. The sensitivity and specificity of One Codex at the species- and strain-levels exceed that of all other methods. The One Codex “RefSeq” data- 2
© Copyright 2025 Paperzz