One Codex White Paper

 TECHNICAL REPORT
One Codex: Genomic microbial detection
One Codex provides a robust and reproducible cloud-based computational platform for identifying microbes using state-of-the-art bioinformatics. The One Codex metagenomic classification system, powered by a database containing roughly 40,000 whole genomes, provides microbial identification with best-in-class accuracy.
Summary
º One Codex exhaustively compares input samples against a database of 40,000 whole microbial genomes
º Extensive testing against validation datasets demonstrates best-in-class detection accuracy
º The One Codex data platform supports data archival, sample comparison, and metagenomic search
Data Platform
nomes. As microbial genomes continue to be sequenced at a rapid rate, the One Codex reference
database will continue to expand and incorporate
those novel genomes as well.
Users upload raw metagenomic sequence data to
the One Codex platform through a graphical upload
tool with both drag-and-drop and folder navigation
options. A command-line tool and API are also
available
for
large-volume
data
upload
(https://docs.onecodex.com).
Once
uploaded,
reads are taxonomically classified and an interactive
report is populated and linked to the user’s account. The user interface allows users to organize
metadata, compare samples, and search quickly by
the microbes detected in each sample.
Deployment and Reproducibility
The One Codex microbial detection algorithm is
executed within a static Linux container running in a
secure cloud environment. Every version of the algorithm, database, and input data is tracked with a
unique identifier that is stored alongside each analysis that is executed. This deployment strategy guarantees deterministic execution such that each analysis result can be exactly reproduced. The One Codex platform ensures that all users have access to
the highest level of analytical accuracy, state-of-theart bioinformatics, and intuitive data visualization.
Microbial detection
One Codex classifies unknown nucleotide sequences according to the set of signature sequences
within it that are unique to a specific taxonomic
group. Each read is first broken into the complete
set of overlapping sequences of length 31bp that
comprise it (k-mers). These k-mers are compared
against an exhaustive database that contains every
known k-mer that is unique to a specific taxonomic
grouping (e.g., a specific clade of bacteria, viruses,
etc.). Each read or contig is then assigned to the
microbial clade it most closely resembles, and the
complete sample is summarized as a collection of
these signature sequences indicating the presence
of a group of organisms. The user is presented with
the complete set of evidence for every organism in
each sample.
Reference Database
ROC Curve for six metagenomic classification methods
and databases (False Positive Rate on horizontal axis,
True Positive Rate on vertical axis). The accuracy associated with the detection of 100 species-specific reads is
labeled. Overall accuracy is represented by the AUC statistic, for which One Codex shows the best performance.
The One Codex reference database is the largest
available index of whole microbial genomes, numbering approximately 40,000 bacteria, viruses, protists, archaea, and fungi. A compressed data structure is used to index and rapidly search the complete set of unique k-mers found within these ge-
1
Microbial detection on the One Codex platform – October 2015
Benchmarks
base, which only includes the microbial genomes
contained in NCBI RefSeq, is shown for comparison.
Classifier
Species-Level
Specificity
Strain-Level
Specificity
One Codex
97.9%
82.4%
One Codex
“RefSeq” DB
86.4%
11.8%
Minikraken
93.6%
20.3%
Kraken
89.3%
15.7%
Clark
84.9%
–
Seed-Kraken
86.9%
11.4%
The overall accuracy of detection was also summarized by a ROC curve analysis, shown on the previous page, which summarizes the probability that a
given analytical threshold (number of reads detected) will result in accurate detection. The AUC statistic summarizes classification accuracy, indicating
that One Codex is the most accurate detection
method. As a point of illustration, the True Positive
Rate and False Positive rate resulting from an analytical cutoff of 100 reads is indicated with a point.
The placement of that point shows that using a
threshold of 100 reads to determine whether or not
an organism is present is only feasible when using
One Codex.
Accuracy metrics for a panel of microbial detection methods generated using 50 million reads simulated from
10,639 genomes. Specificity statistics are at the individual
read level.
Conclusion
One Codex detects specific microbes within genomic data by exhaustively searching against a database of 40,000 whole microbial genomes. The
performance of this method, defined by the sensitivity and precision of detection, exceeds that of the
state-of-the-art comparison methods. Crucially, the
framework of execution of One Codex is userfriendly and reproducible, such that any result will
be exactly recreated from the same input data. This
combination of accuracy, precision, and reproducibility makes the One Codex platform the most powerful and robust method currently available for microbial detection.
Accuracy Metrics
Two metrics are commonly used to summarize the
accuracy of any microbial detection algorithm that
assigns a specific label to individual sequences
within a dataset. Sensitivity measures the proportion of all labeled reads that are assigned correctly
(in the AUC graph on the prior page, at the genus
level). Specificity measures the proportion of reads
with an assignment consistent with the true taxon. A
higher rate of false positive results lead to a lower
specificity, while a higher rate of false negatives
leads to a lower sensitivity. Both of those measures
of accuracy are encapsulated in the AUC statistic.
i Minot S, et al. (2015) bioRxiv doi: 10.1101/027607 Validation Datasets
In order to measure the performance of a panel of
microbial detection algorithms, we used a validation
dataset generated from a synthetic microbial mixture.i These datasets contain 50 million reads simulated from 10,639 microbial genomes and were designed to mimic shotgun-sequencing datasets from
complex microbial mixtures, except that the source
of each individual read is known.
Results
The accuracy statistics for a panel of modern metagenomic classification methods are shown as a table above. The sensitivity and specificity of One Codex at the species- and strain-levels exceed that of
all other methods. The One Codex “RefSeq” data-
2