Algorithms overview for Berkeley ITR meeting

CIPRES:
Enabling Tree of Life Projects
Tandy Warnow
The Program in Evolutionary
Dynamics at Harvard University
The University of Texas at Austin
Phylogeny
From the Tree of the Life Website,
University of Arizona
Orangutan
Gorilla
Chimpanzee
Human
Evolution informs about
everything in biology
• Big genome sequencing projects just produce data – so
what?
• Evolutionary history relates all organisms and genes, and
helps us understand and predict
–
–
–
–
–
–
interactions between genes (genetic networks)
drug design
predicting functions of genes
influenza vaccine development
origins and spread of disease
origins and migrations of humans
Reconstructing the “Tree” of Life
Handling large datasets:
millions of species
Cyber Infrastructure for Phylogenetic Research
Purpose: to create a national infrastructure of hardware,
algorithms, database technology, etc., necessary to infer
the Tree of Life.
Group: 40 biologists, computer scientists, and
mathematicians from 13 institutions.
Funding: $11.6 M (large ITR grant from NSF).
CIPRes Members
University of New Mexico
Bernard Moret
David Bader
UCSD/SDSC
Fran Berman
Alex Borchers
Phil Bourne
John Huelsenbeck
Terri Liebowitz
Mark Miller
UT Austin
Tandy Warnow
David M. Hillis
Warren Hunt
Robert Jansen
Randy Linder
Lauren Meyers
Daniel Miranker
University of Arizona
David R. Maddison
University of Connecticut
Paul O Lewis
University of British Columbia
Wayne Maddison
University of Pennsylvania
Junhyong Kim
Susan Davidson
Sampath Kannan
North Carolina State University
Spencer Muse
Texas A&M
Tiffani Williams
American Museum of Natural
History
Ward C. Wheeler
NJIT
Usman Roshan
UC Berkeley
Satish Rao
Steve Evans
Richard M Karp
Brent Mishler
Elchanan Mossel
Eugene W. Myers
Christos M. Papadimitriou
Stuart J. Russell
Rice
Luay Nakhleh
SUNY Buffalo
William Piel
Florida State University
David L. Swofford
Mark Holder
Yale
Michael Donoghue
Paul Turner
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGGCCT
AGGGCAT
AGGGCAT
TAGCCCT
TAGCCCA
-2 mil yrs
TGGACTT
TAGACTT
AGCACTT
AGCACAA
AGCGCTT
-1 mil yrs
today
Steps in a phylogenetic analysis
• Gather data
• Align sequences
• Reconstruct phylogeny on the multiple alignment often obtaining a large number of trees
• Compute consensus (or otherwise estimate the
reliable components of the evolutionary history)
• Perform post-tree analyses.
Phylogeny Problem
U
AGGGCAT
V
W
TAGCCCA
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
CIPRES research in algorithms
•
•
•
•
•
•
•
•
•
•
•
Heuristics for NP-hard problems in phylogeny reconstruction
Compact representation of sets of trees
Reticulate evolution reconstruction
Performance of phylogeny reconstruction methods under stochastic
models of evolution
Gene order phylogeny
Genomic alignment
Lower bounds for MP
Distance-based reconstruction
Gene family evolution
High-throughput phylogenetic placement
Multiple sequence alignment
Phylogenetic reconstruction methods
1.
Hill-climbing heuristics for hard optimization criteria
(Maximum Parsimony and Maximum Likelihood)
Local optimum
Cost
Global optimum
Phylogenetic trees
2.
Polynomial time distance-based methods: Neighbor
Joining, FastME, Weighbor, etc.
Performance criteria
• Running time.
• Space.
• Statistical performance issues (e.g., statistical
consistency) with respect to a Markov model of
evolution.
• “Topological accuracy” with respect to the
underlying true tree. Typically studied in
simulation.
• Accuracy with respect to a particular criterion
(e.g. tree length or likelihood score), on real data.
Markov models of site evolution
Simplest (Jukes-Cantor):
• The model tree is a pair (T,{e,p(e)}), where T is a rooted binary
tree, and p(e) is the probability of a substitution on the edge e
• The state at the root is random
• If a site changes on an edge, it changes with equal probability to
each of the remaining states
• The evolutionary process is Markovian
More complex models (such as the General Markov model) are
also considered, with little change to the theory.
Variation between different sites is either prohibited or minimized,
in order to ensure identifiability of the model.
Distance-based Phylogenetic Methods
Maximum Parsimony
• Input: Set S of n aligned sequences of length k
• Output:
– A phylogenetic tree T leaf-labeled by sequences in S
– additional sequences of length k labeling the internal
nodes of T
such that
 H (i, j )
( i , j )E (T )
is minimized.
Maximum Likelihood
• Input: Set S of n aligned sequences of length k,
and a specified parametric model
• Output:
– A phylogenetic tree T leaf-labeled by sequences in S
– With additional model parameters (e.g. edge “lengths”)
such that Pr[S|(T, params)] is maximized.
Approaches for “solving” MP/ML
1.
2.
3.
Hill-climbing heuristics (which can get stuck in local optima)
Randomized algorithms for getting out of local optima
Approximation algorithms for MP (based upon Steiner Tree
approximation algorithms).
Local optimum
Cost
Global optimum
Phylogenetic trees
Theoretical results
• Neighbor Joining is polynomial time, and
statistically consistent under typical models
of evolution.
• Maximum Parsimony is NP-hard, and even
exact solutions are not statistically
consistent under typical models.
• Maximum Likelihood is NP-hard and
statistically consistent under typical models.
Theoretical convergence rates
• Atteson: Let T be a General Markov model tree
defining additive matrix D. Then Neighbor
Joining will reconstruct the true tree with high
probability from sequences that are of length at
least O(lg n emax Dij).
• Proof: Show NJ accurate on input matrix d such
that max{|Dij-dij|}<f/2, for f the minimum edge
length.
Problems with NJ
• Theory: The convergence rate is
exponential: the number of sites needed to
obtain an accurate reconstruction of the tree
with high probability grows exponentially
in the evolutionary diameter.
• Empirical: NJ has poor performance on
datasets with some large leaf-to-leaf
distances.
Quantifying Error
FN
FN: false negative
(missing edge)
FP: false positive
(incorrect edge)
50% error rate
FP
Neighbor joining has poor performance on large
diameter trees [Nakhleh et al. ISMB 2001]
Error Rate
0.8
NJ
Simulation study based
upon fixed edge
lengths, K2P model of
evolution, sequence
lengths fixed to 1000
nucleotides.
Error rates reflect
proportion of incorrect
edges in inferred trees.
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
1600
• Other standard polynomial time methods
don’t improve substantially on NJ (and have
the same problem with large diameter
datasets).
• What about trying to “solve” maximum
parsimony or maximum likelihood?
Solving NP-hard problems
exactly is … unlikely
• Number of
(unrooted) binary
trees on n leaves is
(2n-5)!!
• If each tree on
1000 taxa could be
analyzed in 0.001
seconds, we would
find the best tree in
2890 millennia
#leaves
#trees
4
3
5
15
6
105
7
945
8
10395
9
135135
10
2027025
20
2.2 x 1020
100
4.5 x 10190
1000
2.7 x 102900
How good an MP analysis do we
need?
• Our research shows that we need to get
within 0.01% of optimal (or better even, on
large datasets) to return reasonable
estimates of the true tree’s “topology”
Problems with current techniques for MP
Shown here is the performance of a heuristic maximum parsimony analysis on a real
dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using
any method for any amount of time.) Acceptable error is below 0.01%.
0.2
0.18
Performance of TNT with time
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
0.04
0.02
0
0
4
8
12
Hours
16
20
24
Empirical problems with existing
methods
• Heuristics for Maximum Parsimony (MP) and
Maximum Likelihood (ML) cannot handle large
datasets (take too long!) – we need new heuristics
for MP/ML that can analyze large datasets
• Polynomial time methods have poor topological
accuracy on large diameter datasets – we need
better polynomial time methods
Using divide-and-conquer
•
•
•
Conjecture: better (more accurate) solutions
will be found if we analyze a small number
of smaller subsets and then combine
solutions
Note: different “base” methods will need
potentially different decompositions.
Alert: the subtree compatibility problem is
NP-complete!
Using divide-and-conquer
•
•
•
Conjecture: better (more accurate) solutions
will be found if we analyze a small number
of smaller subsets and then combine
solutions
Note: different “base” methods will need
potentially different decompositions.
Alert: the subtree compatibility problem is
NP-complete!
Using divide-and-conquer
•
•
•
Conjecture: better (more accurate) solutions
will be found if we analyze a small number
of smaller subsets and then combine
solutions
Note: different “base” methods will need
potentially different decompositions.
Alert: the subtree compatibility problem is
NP-complete!
DCMs: Divide-and-conquer for
improving phylogeny reconstruction
Strict Consensus Merger (SCM)
“Boosting” phylogeny
reconstruction methods
• DCMs “boost” the performance of
phylogeny reconstruction methods.
Base method M
DCM
DCM-M
DCMs (Disk-Covering Methods)
• DCMs for polynomial time methods
improve topological accuracy (empirical
observation), and have provable theoretical
guarantees under Markov models of
evolution
• DCMs for hard optimization problems
reduce running time needed to achieve good
levels of accuracy (empirically observation)
Absolute fast convergence vs.
exponential convergence
DCM-Boosting [Warnow et al. 2001]
• DCM+SQS is a two-phase procedure which
reduces the sequence length requirement of
methods.
Exponentially
converging
method
DCM
SQS
Absolute fast
converging
method
DCM1-boosting distance-based methods
[Nakhleh et al. ISMB 2001]
0.8
NJ
Error Rate
DCM1-NJ
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
• DCM1-boosting
makes distancebased methods more
accurate
• Theoretical
guarantees that
DCM1-NJ
converges to the
true tree from
polynomial length
1600 sequences
Major challenge: MP and ML
• Maximum Parsimony (MP) and Maximum
Likelihood (ML) remain the methods of
choice for most systematists
• The main challenge here is to make it
possible to obtain good solutions to MP or
ML in reasonable time periods on large
datasets
Maximum Parsimony
• Input: Set S of n aligned sequences of
length k
• Output: A phylogenetic tree T
– leaf-labeled by sequences in S
– additional sequences of length k labeling the
internal nodes of T
H (i, j )
such that (i , j
is minimized.
)E (T )
Maximum parsimony (example)
• Input: Four sequences
–
–
–
–
ACT
ACA
GTT
GTA
• Question: which of the three trees has the
best MP scores?
Maximum Parsimony
ACT
GTA
ACA
ACT
GTT
ACA
GTT
GTA
ACA
GTA
ACT
GTT
Maximum Parsimony
ACT
GTT
2 GTT GTA
1
2
GTA
ACA
ACA
GTT
ACT
ACA ACT
1
3
3
MP score = 7
MP score = 5
ACA
ACT
GTA
ACA GTA
2
1
1
MP score = 4
Optimal MP tree
GTT
GTA
Maximum Parsimony:
computational complexity
Optimal labeling can be
computed in linear time O(nk)
GTA
ACA
ACA
ACT
1
GTA
2
1
GTT
MP score = 4
Finding the optimal MP tree is NP-hard
Problems with current techniques for MP
Best methods are a combination of simulated annealing, divide-and-conquer and
genetic algorithms, as implemented in the software package TNT. However, they
do not reach 0.01% of optimal on large datasets in 24 hours.
0.2
0.18
Performance of TNT with time
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
0.04
0.02
0
0
4
8
12
Hours
16
20
24
Observations
• The best MP heuristics cannot get
acceptably good solutions within 24 hours
on most of these large datasets.
• Datasets of these sizes may need months (or
years) of further analysis to reach
reasonable solutions.
• Apparent convergence can be misleading.
Our objective: speed up the best
MP heuristics
Fake study
Performance of hill-climbing heuristic
MP score
of best trees
Desired Performance
Time
Divide-and-conquer technique
for speeding up MP/ML searches
But: it didn’t work!
• A simple divide-and-conquer was
insufficient for the best performing MP
heuristics -- TNT by itself was as good as
DCM(TNT).
How can we improve upon existing techniques?
Tree Bisection and Reconnection (TBR)
Tree Bisection and Reconnection (TBR)
Delete an edge
Tree Bisection and Reconnection (TBR)
Tree Bisection and Reconnection (TBR)
Reconnect the trees with a new edge
that bifurcates an edge in each tree
A conjecture as to why current
techniques are poor:
• Our studies suggest that trees with near optimal
scores tend to be topologically close (RF distance
less than 15%) from the other near optimal trees.
• The standard technique (TBR) for moving around
tree space explores O(n3) trees, which are mostly
topologically distant.
• So TBR may be useful initially (to reach near
optimality) but then more “localized” searches are
more productive.
Using DCMs differently
• Observation: DCMs make small local changes to
the tree
• New algorithmic strategy: use DCMs iteratively
and/or recursively to improve heuristics on large
datasets
• However, the initial DCMs for MP
– produced large subproblems and
– took too long to compute
• We needed a decomposition strategy that produces
small subproblems quickly.
Using DCMs differently
• Observation: DCMs make small local changes to
the tree
• New algorithmic strategy: use DCMs iteratively
and/or recursively to improve heuristics on large
datasets
• However, the initial DCMs for MP
– produced large subproblems and
– took too long to compute
• We needed a decomposition strategy that produces
small subproblems quickly.
Using DCMs differently
• Observation: DCMs make small local changes to
the tree
• New algorithmic strategy: use DCMs iteratively
and/or recursively to improve heuristics on large
datasets
• However, the initial DCMs for MP
– produced large subproblems and
– took too long to compute
• We needed a decomposition strategy that produces
small subproblems quickly.
New DCM3 decomposition
Input: Set S of sequences, and guide-tree T
1. Compute short subtree graph G(S,T), based upon T
2. Find clique separator in the graph G(S,T) and form subproblems
DCM3 decompositions
(1) can be obtained in O(n) time
(2) yield small subproblems
(3) can be used iteratively
Iterative-DCM3
T
DCM3
Base method
T’
New DCMs
•
DCM3
1.
2.
3.
4.
•
•
Recursive-DCM3
Iterative DCM3
1.
2.
•
Compute subproblems using DCM3 decomposition
Apply base method to each subproblem to yield subtrees
Merge subtrees using the Strict Consensus Merger technique
Randomly refine to make it binary
Compute a DCM3 tree
Perform local search and go to step 1
Recursive-Iterative DCM3
Datasets
Obtained from various researchers and online databases
•
•
•
•
•
•
•
•
•
•
1322 lsu rRNA of all organisms
2000 Eukaryotic rRNA
2594 rbcL DNA
4583 Actinobacteria 16s rRNA
6590 ssu rRNA of all Eukaryotes
7180 three-domain rRNA
7322 Firmicutes bacteria 16s rRNA
8506 three-domain+2org rRNA
11361 ssu rRNA of all Bacteria
13921 Proteobacteria 16s rRNA
Comparison of DCMs (13,921 sequences)
TNT
DCM3
Rec-DCM3
I-DCM3
Rec-I-DCM3
0.4
0.35
0.3
Average MP 0.25
score above
optimal, shown as 0.2
a percentage of
0.15
the optimal
0.1
0.05
0
0
4
8
12
16
20
24
Hours
Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the default
to recursion to iteration to recursion+iteration. On very large datasets Rec-I-DCM3 gives
significant improvements over unboosted TNT.
Rec-I-DCM3 significantly improves performance
0.2
0.18
Current best techniques
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
DCM boosted version of best techniques
0.04
0.02
0
0
4
8
12
16
Hours
Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset
20
24
Rec-I-DCM3(TNT) vs. TNT
(Comparison of scores at 24 hours)
TNT
Rec-I-DCM3
0.1
0.09
0.08
Average MP 0.07
score above
0.06
optimal at 24
0.05
hours, shown as a
percentage of the 0.04
0.03
optimal
0.02
0.01
0
1
2
3
4
5
6
7
8
9
10
Dataset#
Base method is the default TNT technique, the current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by returning trees which are at most 0.01%
above optimal on most datasets.
Observations
• Rec-I-DCM3 improves upon the best
performing heuristics for MP.
• The improvement increases with the
difficulty of the dataset.
Other DCMs
• DCM for NJ and other distance methods produces
absolute fast converging (afc) methods, which
improve upon NJ in simulation studies
• DCMs have been used to scale GRAPPA (software
for whole genome phylogenetic analysis) from its
maximum of about 15-20 genomes to 1000
genomes.
• Current projects: DCM development for
maximum likelihood and multiple sequence
alignment.
Questions
• Tree shape (including branch lengths) has an
impact on phylogeny reconstruction - but what
model of tree shape to use?
• What is the sequence length requirement for
Maximum Likelihood? (Result by Szekely and
Steel is worse than that for Neighbor Joining.)
• Why is MP not so bad?
Acknowledgements
• NSF
• The David and Lucile Packard Foundation
• The Program in Evolutionary Dynamics at
Harvard
• The Institute for Cellular and Molecular Biology
at UT-Austin
• Collaborators: Usman Roshan, Bernard Moret,
and Tiffani Williams
Reconstructing the “Tree” of Life
Handling large datasets:
millions of species
The “Tree of Life” is not
really a tree:
reticulate evolution

Download Report

Algorithms overview for Berkeley ITR meeting

Paperzz.com

Your Paperzz