CompBio

Computational Metagenomics:
Algorithms for Understanding the
"Unculturable" Microbial Majority
Sourav Chatterji
UC Davis Genome Center
[email protected]
Background
The Microbial World
Exploring the Microbial World
• Culturing
– Majority of microbes currently unculturable.
– No ecological context.
• Molecular Surveys (e.g. 16S rRNA)
– “who is out there?”
– “what are they doing?”
Environmental Shotgun Sequencing
Interpreting Metagenomic Data
• Nature of Metagenomic Data
– Mosaic
– Fragmentary
• New Sequencing Technologies
– Enormous amount of data
– Short Reads
Overview of Talk
• Metagenomic Binning
• PhyloMetagenomics
• The Big Picture/ Future Work
Overview of Talk
• Metagenomic Binning
– Background
– CompostBin [to appear in RECOMB 2008]
• PhyloMetagenomics
• The Big Picture
Metagenomic Binning
Classification of sequences by taxa
Current Binning Methods
•
•
•
•
•
Assembly
Align with Reference Genome
Database Search [MEGAN, BLAST]
Phylogenetic Analysis
DNA Composition [TETRA,Phylopythia]
Current Binning Methods
• Need closely related reference genomes.
• Poor performance on short fragments.
– Sanger sequence reads 500-1000 bp long.
– Current assembly methods unreliable
• Complex Communities Hard to Bin.
Genome Signatures
• Does genomic sequence from an organism have a
unique “signature” that distinguishes it from genomic
sequence of other organisms?
– Yes [Karlin et al. 1990s]
• What is the minimum length sequence that is
required to distinguish genomic sequence of one
organism from the genomic sequence of another
organism?
DNA-composition metrics
The K-mer Frequency Metric
CompostBin uses hexamers
DNA-composition metrics
• Working with K-mers for Binning.
– Curse of Dimensionality : O(4K) independent
dimensions.
– Statistical noise increases with decreasing
fragment lengths.
• Project data into a lower dimensional space to
decrease noise.
– Principal Component Analysis.
PCA separates species
Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]
Effect of Skewed Relative Abundance
Abundance 1:1
Abundance 20:1
B. anthracis and L. monogocytes
A Weighting Scheme
For each read, find overlap with other sequences
A Weighting Scheme
4
5
5
Calculate the redundancy of each position.
Weight is inverse of average redundancy.
3
Weighted PCA
• Calculate weighted mean µw :
N
μ w =  w iXi
i =1
• Calculates weighted co-variance matrix Mw
N
T
=
M w  w i (X i μ w )(X i μ w )
i =1
• Principal Components are eigenvectors of Mw.
– Use first three PCs for further analysis.
Weighted PCA separates species
PCA
Weighted PCA
B. anthracis and L. monogocytes : 20:1
Un-supervised Classification
Semi-Supervised Classification
• 31 Marker Genes [courtesy Martin Wu]
– Omni-present
– Relatively Immune to Lateral Gene Transfer
• Reads containing these marker genes can be
classified with high reliability.
Semi-supervised Classification
Use a semi-supervised version of the normalized cut algorithm
The Semi-supervised
Normalized Cut Algorithm
1. Calculate the K-nearest neighbor graph (KNNgraph) from the point set.
2. Update the KNN-graph with information from
marker genes.
3. Bisect the graph using the normalized-cut
algorithm.
Generalization to multiple bins
Apply algorithm
recursively
Gluconobacter oxydans [0.61], Granulobacter
bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]
Generalization to multiple bins
Gluconobacter oxydans [0.61], Granulobacter
bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]
Testing
• Simulate Metagenomic Sequencing
– Variables
•
•
•
•
Number of species
Relative abundance
GC content
Phylogenetic Diversity
• Test on a “real” dataset where answer is wellestablished.
Future Directions
• Holy Grail : Complex Communities
• Semi-supervised methods
– More marker genes
– Semi-supervised projection?
• Hybrid Methods
– Assembly Information
– Population Genetic Information
Overview of Talk
• Metagenomic Binning
• Phylo-Metagenomics
– Applications
– Incorporating Alignment Accuracy
• The Big Picture/ Future Work
Population Structure of Communities
Garcia Martin et al., Nat. Biotechnology (2006)
Gene Family Characterization
Yooseph et al., PLoS Biology (2007)
Manual Masking
• Require skilled and tedious manual
intervention
• Subjective and non-reproducible
• Impractical for high throughput data
– Frequently ignored. “Garbage-in-and-garbageout”
Gblocks
Probabilistic Masking using pair-HMMs
• Probabilistic formulation
of alignment problem.
• Can answer additional
questions
– Alignment Reliability
– Sub-optimal Alignments
Durbin et al., Cambridge University Press (1998)
Probabilistic Masking
• What is the probability residues xi and yj are
homologous?
• Posterior Probability the residues xi and yj are
homologous
Pr[x i y j ] =
Pr[x i y j , x, y]
Pr[x, y]
• Can be calculated efficiently for all pairs (and
gaps) in quadratic time.
Scoring Multiple Alignments
• Calculate the “posterior probability matrix”
and distances dij between every pair of
sequences.
• Weighted “sum of pairs” score for column r :
 d Pr[r  r ]
d
ij
i
i, j
ij
i, j
j
Testing
The Balibase 3.0 Benchmark Database
Testing
• Realign sequences using MSA programs like
Clustalw.
• Sensitivity: for all correctly aligned columns,
the fraction that has been masked as good
• Specificity: for all incorrectly aligned
columns, the fraction that has been masked
as bad
Performance
Sensitivity
Specificity
Prob Mask
97%
93%
Gblocks
53%
94%
The Final Result
A Phylogenetic Database/Pipeline (with Martin Wu)
Overview of Talk
• Metagenomic Binning
• Phylo-Metagenomics
• The Big Picture/ Future Work
Population Structure
Venter et al. , Science (2004)
How to integrate information from
multiple markers?
Species-species Interactions
Interactions in Microbial Communities
Time Series Data
Ruan et al., Bioinformatics (2006)
Interaction Networks in
Microbial Communities
Ruan et al., Bioinformatics (2006)
Functional Profiling
Prediction of Gene Function
Prediction of Metabolic Pathway
Functional Profiling (with Binning)
McCutcheon and Moran PNAS.(2007)
Single Cell Genomics
Hutchinson and Venter, Nature Biotechnology (2006)
Single Cell Genomics
Reads From Single Cell
“Simulated” Contamination
The Big Picture
Microbial Community
Metagenomic Sampling
Single Cell Genomics
Population Structure
Functional Profiling
Time Series Data
Species Interaction Network
Acknowledgements
UC Davis
• Jonathan Eisen
• Martin Wu
• Dongying Wu
• Ichitaro Yamazaki
• Amber Hartman
• Marcel Huntemann
UC Berkeley
• Lior Pachter
• Richard Karp
• Ambuj Tewari
• Narayanan Manikandan
Princeton University
• Simon Levin
• Josh Weitz
• Jonathan Dushoff