Computational problems in evolutionary tree reconstruction

CS 394C: Computational Biology Algorithms
Tandy Warnow
Department of Computer Sciences
University of Texas at Austin
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGGCCT
AGGGCAT
AGGGCAT
TAGCCCT
TAGCCCA
-2 mil yrs
TGGACTT
TAGACTT
AGCACTT
AGCACAA
AGCGCTT
-1 mil yrs
today
Molecular Systematics
U
AGGGCAT
V
W
TAGCCCA
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
Phylogeny estimation methods
• Distance-based (Neighbor joining, NQM, and
others): mostly statistically consistent and
polynomial time
• Maximum parsimony and maximum
compatibility: NP-hard and not statistically
consistent
• Maximum likelihood: NP-hard and usually
statistically consistent (if solved exactly)
• Bayesian Methods: statistically consistent if run
long enough
Distance-based methods
• Theorem: Let (T,) be a Cavender-Farris
model tree, with additive matrix [(i,j)]. Let
>0 be given. The sequence length that
suffices for accuracy with probability at
least 1-  of NJ (neighbor joining) and the
Naïve Quartet Method is
O(log n e(O(max (i,j))).
Neighbor joining (although statistically consistent)
has poor performance on large diameter trees
[Nakhleh et al. ISMB 2001]
Error Rate
0.8
NJ
Simulation study based
upon fixed edge
lengths, K2P model of
evolution, sequence
lengths fixed to 1000
nucleotides.
Error rates reflect
proportion of incorrect
edges in inferred trees.
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
1600
Maximum Parsimony
• Input: Set S of n aligned sequences of
length k
• Output: A phylogenetic tree T
– leaf-labeled by sequences in S
– additional sequences of length k labeling the
internal nodes of T
H (i, j )
such that (i , j
is minimized.
)E (T )
Maximum parsimony (example)
• Input: Four sequences
–
–
–
–
ACT
ACA
GTT
GTA
• Question: which of the three trees has the
best MP scores?
Maximum Parsimony
ACT
GTA
ACA
ACT
GTT
ACA
GTT
GTA
ACA
GTA
ACT
GTT
Maximum Parsimony
ACT
GTT
2 GTT GTA
1
2
GTA
ACA
ACA
GTT
ACT
ACA ACT
1
3
3
MP score = 7
MP score = 5
ACA
ACT
GTA
ACA GTA
2
1
1
MP score = 4
Optimal MP tree
GTT
GTA
Maximum Parsimony
Optimal labeling can be computed in polynomial
time using Dynamic Programming
GTA
ACA
ACA
ACT
1
GTA
2
1
GTT
MP score = 4
Finding the optimal MP tree is NP-hard
Solving NP-hard problems
exactly is … unlikely
• Number of
(unrooted) binary
trees on n leaves is
(2n-5)!!
• If each tree on
1000 taxa could be
analyzed in 0.001
seconds, we would
find the best tree in
2890 millennia
#leaves
#trees
4
3
5
15
6
105
7
945
8
10395
9
135135
10
2027025
20
2.2 x 1020
100
4.5 x 10190
1000
2.7 x 102900
Approaches for “solving” MP and ML
(and other NP-hard problems in phylogeny)
1.
2.
3.
Hill-climbing heuristics (which can get stuck in local optima)
Randomized algorithms for getting out of local optima
Approximation algorithms for MP (based upon Steiner Tree approximation
algorithms) -- however, the approx. ratio that is needed is probably 1.01 or
smaller!
Local optimum
Cost
Global optimum
Phylogenetic trees
Problems with techniques for MP and ML
Shown here is the performance of a TNT heuristic maximum parsimony analysis on a
real dataset of almost 14,000 sequences. (“Optimal” here means best score to date,
using any method for any amount of time.) Acceptable error is below 0.01%.
0.2
0.18
Performance of TNT with time
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
0.04
0.02
0
0
4
8
12
Hours
16
20
24
MP and Cavender-Farris
• Consider a tree (AB,CD) with two very
long branches leading to A and C, and all
other branches very short.
• MP will be statistically inconsistent (and
“positively misleading”) on this tree.
Problems with existing phylogeny
reconstruction methods
• Polynomial time methods (generally based
upon distances) have poor accuracy with
large diameter datasets.
• Heuristics for NP-hard optimization
problems take too long (months to reach
acceptable local optima).
Warnow et al.: Meta-algorithms
for phylogenetics
• Basic technique: determine the conditions under which a
phylogeny reconstruction method does well (or poorly),
and design a divide-and-conquer strategy (specific to the
method) to improve its performance
• Warnow et al. developed a class of divide-and-conquer
methods, collectively called DCMs (Disk-Covering
Methods). These are based upon chordal graph theory to
give fast decompositions and provable performance
guarantees.
Disk-Covering Method (DCM)
Improving phylogeny reconstruction
methods using DCMs
• Improving the theoretical convergence rate
and performance of polynomial time
distance-based methods using DCM1
• Speeding up heuristics for NP-hard
optimization problems (Maximum
Parsimony and Maximum Likelihood) using
Rec-I-DCM3
DCM1
Warnow, St. John, and Moret, SODA 2001
Exponentially
converging
method
•
•
DCM
SQS
Absolute fast
converging
method
A two-phase procedure which reduces the sequence length requirement of
methods. The DCM phase produces a collection of trees, and the SQS phase
picks the “best” tree.
The “base method” is applied to subsets of the original dataset. When the base
method is NJ, you get DCM1-NJ.
DCM1-boosting distance-based methods
[Nakhleh et al. ISMB 2001]
0.8
NJ
Error Rate
DCM1-NJ
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
•Theorem:
DCM1-NJ
converges to the
true tree from
polynomial
length sequences
1600
Rec-I-DCM3 significantly improves performance
(Roshan et al. CSB 2004)
0.2
0.18
Current best techniques
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
DCM boosted version of best techniques
0.04
0.02
0
0
4
8
12
16
20
24
Hours
Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset.
Similar improvements obtained for RAxML (maximum likelihood).
Summary (so far)
• Optimization problems in biology are
almost all NP-hard, and heuristics may run
for months before finding local optima.
• The challenge here is to find better
heuristics, since exact solutions are very
unlikely to ever be achievable on large
datasets.
Summary
• NP-hard optimization problems abound in
phylogeny reconstruction, and in
computational biology in general, and need
very accurate solutions
• Many real problems have beautiful and
natural combinatorial and graph-theoretic
formulations