Introduction to OTU Clustering Susan Huse August 4, 2016 What is an OTU? Operational Taxonomic Units a.k.a. phylotypes a.k.a. clusters aggregations of reads based only on sequence similarity, independent of taxonomic name The Magical 3% 3% SSU OTUs = Species and 6% SSU OTUs = Genera NOT! History of The 3% OTUs Wayne et al (1987) IJSEM 37(4): 463-464 Uses genome-level phylogeny to define species, using 70% DNA-DNA reassociation. Stackebrandt and Goebel (1994) IJSEM 44(4): 846-849 70% DNA-DNA ~ 97% similarity of full-length 16S Therefore… 97% similarity of any hypervariable region in any bacteria or fungus = species? Cumulative Percentage of Non-Singleton Taxa 18% of genera are > 10% OTUs 30% of genera are < 3% OTUs Genus (827) Family (244) Order (92) Class (52) Phylum (31) Fig 1A Schloss & Westcott (2011) Appl Env Microbiol Maximum Intra-Taxon Distance for Full-length 16S Why OTUs? Because we can! We can’t: • do whole genome sequencing of communities • assign complete taxonomic names • derive phylogenetic trees from millions of short tags Continuing to develop new methods that are more biologically or evolutionarily meaningful Come back in 10 years… Pros and Cons OTUs Taxonomy Novel organisms Universal names Insufficient taxonomy Meaning associated with names Does not lump together all order or family-level classifications Independent of clustering width and algorithm Many names based on phenotype rather than genotype Historically well-studied are split New areas are lumped Cluster “Width” Diameter Sequences are never more than D apart. (CL) Radius Sequences are never more than R from seed or reference. (Ref, Greedy) Variable Only selected positions (Oligo) Averaged radii (AL) Primary Aggregation Methods Complete linkage - no two sequences are farther apart than the clustering width Average linkage - the average distance between sequences is less than the width Greedy clusters – sequences are less than the width to OTU representatives built de novo Reference – sequences are less than the width to universal reference set of OTU representatives Oligotyping – sequences are equivalent at most informationrich nucleotide positions Probabilistic – error modeling of likely clusters De Novo Greedy Clustering Most abundant reads: • are likely true sequences • least likely to be errors or chimeras, and • most likely to spawn errors and chimeras Seed OTUs with most abundant, Assign variation to most abundant source. De Novo Greedy Clustering Sort by abundance First Seq1 is seed of Cluster1 Compare next Seq to each cluster in order: if Distance(Seq, Seed) <= W, include it in the cluster if Seq is not a member of any cluster, seed a new cluster Edgar (2010) Bioinformatics 26(19):2460-1 Greedy Seeds >W Each tag goes to OTU with nearest representative, not OTU with the nearest sequence Closed Reference OTUs 1. Create a set full-length reference sequences (e.g., Greengenes) 2. Cluster reference sequences 3. Select representative sequence for each cluster 4. Map each tag to nearest rep <= Width 5. Bin each tag to that reference’s OTU#, or unclustered if > Width Rideout et al (2014) PeerJ in press Open Reference OTUs 1. Perform Closed Reference OTU mapping 2. Perform de novo clustering on unclustered sequences 3. Subsequent datasets can be mapped to the reference and to an expanding set of de novo clusters. Reference OTU Caveat X% clustering of full-length reference sequences does not represent X% for each hypervariable region. Will vary by phylotype Oligotyping Calculated Entropy Alignment positions of greatest information Oligotypes identical in selected positions Ignore the rest as noise Alignment Position Eren et al (2013) Methods in Ecol and Evol 4:1111-9 Pros and Cons • Greedy, AL - standard de novo methods, abundance dependent, project-specific • Reference OTU – compare V-regions, universal IDs, dependent on reference database and reference clustering • Oligotyping / DADA2 – increased sensitivity, no specific radius, projectspecific The Problem of Inflation Clustering algorithms return more OTUs than predicted for mock communities. OTU inflation leads to: • alpha diversity inflation • beta diversity inflation Where does this inflation come from? How can we adjust for this inflation? OTU inflation varied with sample depth MS-CL M2FN Rarefaction - PML 7000 6000 OTUs 5000 5K 4000 10K 3000 15K 20K 2000 50K 1000 100K 0 - 20,000 40,000 60,000 80,000 100,000 120,000 Number of Sequences Sampled Low-Quality Reads and Errant OTUs Word to the wise: minimize adventurous OTUs Different clustering algorithms have very different effects on the size and number of OTUs created… Inflation in Action >200,000 V6 reads of E. coli clone (454) 1,042 is a few more than the expected 2 Huse et al (2010) Environ Microbiol 12(7): 1889-98 Sequencing Noise in E.coli Distance Tags 0 177,697 0.02 33,425 0.03 3738 0.04 2 0.05 635 0.06 38 0.07 89 0.08 22 0.09 0 0.10 4 > 0.10 23 Seqs 2 633 2268 1 306 25 72 17 0 4 16 82.4% 97.9% 99.6% 99.6% 99.9% 99.9% 100.0% 100.0% 100.0% 100.0% 100.0% Cluster at 3% using only tags within 3% of the “correct” sequence Inflation with small sequence error Only 129 OTUs… much better!! Clustering V6 sequences within 3% of known templates (outliers removed) E. coli (2) S. epidermidis (1) MS-CL 129 / 89 / 694 Clustering method Clone43 (43) MS-AL 54 / 44 / 218 Alignment method PW-CL 6 / 5 / 308 MS – Multiple Sequence Alignment PW – Pairwise Alignment PW-AL 2 / 1 / 43 CL – Complete Linkage clustering AL – Average Linkage clustering Multiple Sequence Alignments Regardless of clustering algorithm, an MSA cannot fully align tags whose sequences are too divergent, more tags cause more problems 18,156 sequences and 392 positions MSA Limitations Independent Distances Distance Comparison Subsampled Distances X-axis: 1000 distances subsampled from MSA of 40,000 sequences Y-axis: 1000 distances from MSA of only the 1,000 bacterial sequences Complete Linkage propagates OTUs ClusterCount: 36 27 1 5 4 #5 #6 #4 #3 #2 #7 #1 #9 #8 #12 #11 #10 #13 EachV6variantwith2ntdistance,willstartanewOTU. EachV6variantwith1ntdistance,canclusterwiththeseedora2ndvariant. Average Linkage collapses errors ClusterCount: 1 #1 Clusterstendtobeheavilydominatedbytheirmostabundant sequence,whichstronglyweightstheaverageandsmoothesthenoise. Greedy Seeds (UClust) ClusterCount: 1 Effect of Sampling Depth M2FN Rarefaction - PML MS-CL 7000 6000 OTUs 5000 5K 4000 10K 3000 15K 20K 2000 50K 1000 100K 0 - 20,000 40,000 60,000 80,000 100,000 120,000 Number of Sequences Sampled Newer methods do not Previous algorithms differentially inflated OTUs at a sampling depth dependent upon total sample depth Errant OTUs If a low-quality sequence is >3% from its source, it can create a new OTU. If the rate of an errant read = 1 in 10 thousand, and we have 1 million reads: 0.0001 * 1,000,000 reads = 100 errant OTUs Impact of Errant OTUs Errant OTUs as percent of OTUs decreases with diversity. If we have 100 errant OTUs: Mock community: 100 / 50 OTUs = +200% Diverse community: 100 / 2,000 OTUs = +5% Relative Inflation Absolute number of errant OTUs will increase with sample size. Relative number of errant OTUs will decrease with sample complexity What is the right cluster width? No. What is the right cluster width? Distances vary by hypervariable region – Full-length does not equal V1-V3 – which does not equal V3-V5 – which does not equal V6, etc. Distance vary by bacterial lineage What specificity are you trying to gain? No single clustering width is the right answer, use several (and wave your hands a lot) or use other methods like Oligotyping Questions about Clustering • How meaningful are clusters functionally? • When is an errare rare and when is it an error? • Should it be included in an existing cluster or start its own? • How to assign sequences if OTUs overlap? • What is the effect of residual low quality data or chimeras? • How sensitive are alpha and beta diversity estimates to clustering results?
© Copyright 2026 Paperzz