OTU Clustering

Introduction to
OTU Clustering
Susan Huse
August 4, 2016
What is an OTU?
Operational Taxonomic Units
a.k.a. phylotypes
a.k.a. clusters
aggregations of reads
based only on sequence similarity,
independent of taxonomic name
The
Magical 3%
3% SSU OTUs = Species
and
6% SSU OTUs = Genera
NOT!
History of The 3% OTUs
Wayne et al (1987) IJSEM 37(4): 463-464
Uses genome-level phylogeny to define species,
using 70% DNA-DNA reassociation.
Stackebrandt and Goebel (1994) IJSEM 44(4): 846-849
70% DNA-DNA ~ 97% similarity of full-length 16S
Therefore…
97% similarity of any hypervariable region in any bacteria
or fungus = species?
Cumulative Percentage of Non-Singleton Taxa
18% of genera are
> 10% OTUs
30% of genera are
< 3% OTUs
Genus (827)
Family (244)
Order (92)
Class (52)
Phylum (31)
Fig 1A Schloss & Westcott
(2011) Appl Env Microbiol
Maximum Intra-Taxon Distance for Full-length 16S
Why OTUs?
Because we can!
We can’t:
• do whole genome sequencing of communities
• assign complete taxonomic names
• derive phylogenetic trees from millions of short tags
Continuing to develop new methods that are more
biologically or evolutionarily meaningful
Come back in 10 years…
Pros and Cons
OTUs
Taxonomy
Novel organisms
Universal names
Insufficient taxonomy
Meaning associated with
names
Does not lump together all
order or family-level
classifications
Independent of clustering
width and algorithm
Many names based on
phenotype rather than
genotype
Historically well-studied are
split
New areas are lumped
Cluster “Width”
Diameter
Sequences are
never more than D
apart.
(CL)
Radius
Sequences are
never more than R
from seed or
reference.
(Ref, Greedy)
Variable
Only selected
positions (Oligo)
Averaged radii (AL)
Primary Aggregation Methods
Complete linkage - no two sequences are farther apart than
the clustering width
Average linkage - the average distance between sequences
is less than the width
Greedy clusters – sequences are less than the width to OTU
representatives built de novo
Reference – sequences are less than the width to universal
reference set of OTU representatives
Oligotyping – sequences are equivalent at most informationrich nucleotide positions
Probabilistic – error modeling of likely clusters
De Novo Greedy Clustering
Most abundant reads:
• are likely true sequences
• least likely to be errors or chimeras, and
• most likely to spawn errors and chimeras
Seed OTUs with most abundant,
Assign variation to most abundant source.
De Novo Greedy Clustering
Sort by abundance
First Seq1 is seed of Cluster1
Compare next Seq to each cluster in order:
if Distance(Seq, Seed) <= W, include it in
the cluster
if Seq is not a member of any cluster, seed a
new cluster
Edgar (2010) Bioinformatics 26(19):2460-1
Greedy Seeds
>W
Each tag goes to OTU with nearest representative,
not OTU with the nearest sequence
Closed Reference OTUs
1. Create a set full-length reference
sequences (e.g., Greengenes)
2. Cluster reference sequences
3. Select representative sequence for each
cluster
4. Map each tag to nearest rep <= Width
5. Bin each tag to that reference’s OTU#,
or unclustered if > Width
Rideout et al (2014) PeerJ in press
Open Reference OTUs
1. Perform Closed Reference OTU mapping
2. Perform de novo clustering on unclustered
sequences
3. Subsequent datasets can be mapped to
the reference and to an expanding set of
de novo clusters.
Reference OTU Caveat
X% clustering of full-length reference
sequences does not represent X% for each
hypervariable region.
Will vary by phylotype
Oligotyping
Calculated Entropy
Alignment positions of greatest information
Oligotypes identical in selected positions
Ignore the rest as noise
Alignment Position
Eren et al (2013) Methods in Ecol and Evol 4:1111-9
Pros and Cons
• Greedy, AL - standard de novo methods,
abundance dependent, project-specific
• Reference OTU – compare V-regions,
universal IDs, dependent on reference
database and reference clustering
• Oligotyping / DADA2 – increased
sensitivity, no specific radius, projectspecific
The Problem of Inflation
Clustering algorithms return more OTUs
than predicted for mock communities.
OTU inflation leads to:
• alpha diversity inflation
• beta diversity inflation
Where does this inflation come from?
How can we adjust for this inflation?
OTU inflation varied with sample depth
MS-CL
M2FN Rarefaction - PML
7000
6000
OTUs
5000
5K
4000
10K
3000
15K
20K
2000
50K
1000
100K
0
-
20,000
40,000
60,000
80,000 100,000 120,000
Number of Sequences Sampled
Low-Quality Reads and
Errant OTUs
Word to the wise: minimize adventurous OTUs
Different clustering algorithms
have very different effects on the size
and number of OTUs created…
Inflation in Action
>200,000 V6 reads of E. coli clone (454)
1,042 is a few more
than the expected 2
Huse et al (2010) Environ Microbiol 12(7): 1889-98
Sequencing Noise in E.coli
Distance
Tags
0 177,697
0.02 33,425
0.03
3738
0.04
2
0.05
635
0.06
38
0.07
89
0.08
22
0.09
0
0.10
4
> 0.10
23
Seqs
2
633
2268
1
306
25
72
17
0
4
16
82.4%
97.9%
99.6%
99.6%
99.9%
99.9%
100.0%
100.0%
100.0%
100.0%
100.0%
Cluster at
3% using
only tags
within 3%
of the
“correct”
sequence
Inflation with small sequence error
Only 129 OTUs…
much better!!
Clustering V6 sequences
within 3% of known templates
(outliers removed)
E. coli (2)
S. epidermidis (1)
MS-CL
129 / 89 / 694
Clustering method
Clone43 (43)
MS-AL
54 / 44 / 218
Alignment method
PW-CL
6 / 5 / 308
MS – Multiple Sequence Alignment
PW – Pairwise Alignment
PW-AL
2 / 1 / 43
CL – Complete Linkage clustering
AL – Average Linkage clustering
Multiple Sequence Alignments
Regardless of clustering algorithm,
an MSA cannot fully align tags whose
sequences are too divergent,
more tags cause more problems
18,156 sequences and 392 positions
MSA Limitations
Independent Distances
Distance Comparison
Subsampled Distances
X-axis: 1000 distances subsampled from MSA of 40,000 sequences
Y-axis: 1000 distances from MSA of only the 1,000 bacterial sequences
Complete Linkage
propagates OTUs
ClusterCount: 36
27
1
5
4
#5
#6
#4
#3
#2 #7
#1
#9
#8
#12
#11
#10
#13
EachV6variantwith2ntdistance,willstartanewOTU.
EachV6variantwith1ntdistance,canclusterwiththeseedora2ndvariant.
Average Linkage
collapses errors
ClusterCount: 1
#1
Clusterstendtobeheavilydominatedbytheirmostabundant
sequence,whichstronglyweightstheaverageandsmoothesthenoise.
Greedy Seeds
(UClust)
ClusterCount: 1
Effect of Sampling Depth
M2FN Rarefaction - PML
MS-CL
7000
6000
OTUs
5000
5K
4000
10K
3000
15K
20K
2000
50K
1000
100K
0
-
20,000
40,000
60,000
80,000 100,000 120,000
Number of Sequences Sampled
Newer methods do not
Previous algorithms
differentially inflated OTUs
at a sampling depth
dependent upon total
sample depth
Errant OTUs
If a low-quality sequence is >3% from its
source,
it can create a new OTU.
If the rate of an errant read = 1 in 10
thousand, and
we have 1 million reads:
0.0001 * 1,000,000 reads = 100 errant OTUs
Impact of Errant OTUs
Errant OTUs as percent of OTUs decreases
with diversity. If we have 100 errant OTUs:
Mock community:
100 / 50 OTUs = +200%
Diverse community:
100 / 2,000 OTUs = +5%
Relative Inflation
Absolute
number of
errant OTUs
will increase
with sample
size.
Relative
number of
errant OTUs
will decrease
with sample
complexity
What is the right cluster width?
No.
What is the right cluster width?
Distances vary by hypervariable region
– Full-length does not equal V1-V3
– which does not equal V3-V5
– which does not equal V6, etc.
Distance vary by bacterial lineage
What specificity are you trying to gain?
No single clustering width is the right answer,
use several (and wave your hands a lot) or use
other methods like Oligotyping
Questions about Clustering
• How meaningful are clusters functionally?
• When is an errare rare and when is it an error?
• Should it be included in an existing cluster or start its
own?
• How to assign sequences if OTUs overlap?
• What is the effect of residual low quality data or
chimeras?
• How sensitive are alpha and beta diversity estimates to
clustering results?