Computational and mathematical challenges
involved in very large-scale phylogenetics
Tandy Warnow
The University of Texas at Austin
How did life evolve on earth?
An international effort to
understand how life
evolved on earth
Biomedical applications:
drug design, protein
structure and function
prediction, biodiversity.
•
Courtesy of the Tree of Life project
Steps in a phylogenetic analysis
• Gather data
• Align sequences
• Reconstruct phylogeny on the multiple alignment often obtaining a large number of trees
• Compute consensus (or otherwise estimate the
reliable components of the evolutionary history)
• Perform post-tree analyses.
Reconstructing the “Tree” of Life
Handling large datasets:
millions of species, and
most problems are NP-hard
Evolution is complex:
Reticulate evolution
Indels, genome rearrangements, etc.
Gene tree/species tree differences
The CIPRES Project
(Cyber-Infrastructure for Phylogenetic Research)
The US National Science Foundation funds this project, which has the following major
components:
• ALGORITHMS and SOFTWARE: scaling to millions of sequences
(open source, freely distributed)
• MATHEMATICS/PROBABILITY/STATISTICS: Obtaining better
mathematical theory under complex models of evolution
• DATABASES: Producing new database technology for structured
data, to enable scientific discoveries
• SIMULATIONS: The first million taxon simulation under
realistically complex models
• OUTREACH: Museum partners, K-12, general scientific public
• PORTAL available to all researchers
• See www.phylo.org for more about CIPRES.
This talk
• Part 1: Overview of some interesting mathematical
issues
• Part 2: Better heuristics for NP-hard optimization
problems
• Part 3: New methods for simultaneous estimation
of trees and alignments
• Part 4: Related research and open problems.
Part 1: Mathematical issues under
stochastic models
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGGCCT
AGGGCAT
AGGGCAT
TAGCCCT
TAGCCCA
-2 mil yrs
TGGACTT
TAGACTT
AGCACTT
AGCACAA
AGCGCTT
-1 mil yrs
today
Markov models of single site
evolution
Simplest (Jukes-Cantor):
• The model tree is a pair (T,{e,p(e)}), where T is a rooted binary
tree, and p(e) is the probability of a substitution on the edge e.
• The state at the root is random.
• If a site changes on an edge, it changes with equal probability to
each of the remaining states.
• The evolutionary process is Markovian.
More complex models (such as the General Markov model) are
also considered, often with little change to the theory.
Modelling variation between
characters: Rates-across-sites
B
D
A
C
B
A
D
C
• If a site (i.e., character) is twice as fast as another on one edge, it is
twice as fast everywhere.
• The distribution of the rates is typically assumed to be gamma.
Modelling variation between characters:
The “no-common-mechanism” model
• A separate random variable for every combination of site and edge the underlying tree is fixed, but otherwise there are no constraints on
variation between sites.
C
A
B
D
A
B
C
D
Identifiability and statistical
consistency
• A model is identifiable if it is uniquely
characterized by the probability distribution
it defines.
• A phylogenetic reconstruction method is
statistically consistent under a model if the
probability that the method reconstructs the
true tree goes to 1 as the sequence length
increases.
Identifiability results
• The “standard” Markov models (from Jukes-Cantor to the
General Markov model) are identifiable.
• These models are also identifiable when sites draw rates
from a gamma distribution (easy to prove if the distribution
is known, and harder to prove if the distribution must be
estimated - cf. Allman’s talk).
• However, mixed models are often not identifiable (cf.
Matsen’s talk), nor are some models in which sites draw
rates from more complex distributions.
Phylogeny estimation typically is done under identifiable
models.
What about phylogeny reconstruction methods?
U
AGGGCAT
V
W
TAGCCCA
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
Phylogenetic reconstruction methods
1.
Hill-climbing heuristics for NP-hard optimization
criteria (Maximum Parsimony and Maximum
Likelihood)
Local optimum
Cost
Global optimum
Phylogenetic trees
2.
3.
Polynomial time distance-based methods: Neighbor
Joining, FastME, Weighbor, etc.
Bayesian methods
Performance criteria
• Running time.
• Space.
• Statistical performance issues (e.g., statistical
consistency and sequence length requirements)
• “Topological accuracy” with respect to the
underlying true tree. Typically studied in
simulation.
• Accuracy with respect to a particular criterion
(e.g. tree length or likelihood score), on real data.
Quantifying Error
FN
FN: false negative
(missing edge)
FP: false positive
(incorrect edge)
50% error rate
FP
Statistical consistency, exponential convergence, and
absolute fast convergence (afc)
Theoretical results: statistical
consistency under typical models?
• Neighbor Joining is polynomial time, and
statistically consistent.
• Maximum Parsimony is NP-hard, and even
exact solutions are not statistically consistent.
• Maximum Likelihood is NP-hard, but exact
solutions are statistically consistent
Theoretical results: sequence length
requirements under typical models?
• Neighbor joining (and some other distance-based
methods) will return the true tree with high
probability provided sequence lengths are
exponential in the diameter of the tree (Erdos et
al., Atteson).
• Maximum likelihood will return the true tree with
high probability provided sequence lengths are
exponential in the number of taxa (Steel and
Szekely).
Neighbor joining has poor performance on large
diameter trees [Nakhleh et al. ISMB 2001]
Error Rate
0.8
NJ
Simulation study based
upon fixed edge
lengths, K2P model of
evolution, sequence
lengths fixed to 1000
nucleotides.
Error rates reflect
proportion of incorrect
edges in inferred trees.
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
1600
DCM1
Warnow, St. John, and Moret, SODA 2001
Exponentially
converging
method
DCM
SQS
Absolute fast
converging
method
• A two-phase procedure which reduces the sequence length
requirement of methods. The DCM phase produces a
collection of trees, and the SQS phase picks the “best” tree.
• The “base method” is applied to subsets of the original
dataset. When the base method is NJ, you get DCM1-NJ.
DCM1-boosting distance-based methods
[Nakhleh et al. ISMB 2001]
0.8
NJ
Error Rate
DCM1-NJ
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
•Theorem:
DCM1-NJ
converges to the
true tree from
polynomial
length sequences
1600
However,
• The best accuracy in simulation tends to be from
computationally intensive methods (and most
molecular phylogeneticists prefer these methods).
• Unfortunately, these approaches can take weeks or
more, just to reach decent local optima.
• Conclusion: We need better heuristics!
Part 2: Improved heuristics for
NP-hard optimization problems
Approaches for “solving” MP/ML
1.
2.
3.
Hill-climbing heuristics (which can get stuck in local optima)
Randomized algorithms for getting out of local optima
Approximation algorithms for MP (based upon Steiner Tree
approximation algorithms).
Local optimum
Cost
Global optimum
Phylogenetic trees
Problems with current techniques for MP
Shown here is the performance of a TNT heuristic maximum parsimony analysis on a
real dataset of almost 14,000 sequences. (“Optimal” here means best score to date,
using any method for any amount of time.) Acceptable error is below 0.01%.
0.2
0.18
Performance of TNT with time
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
0.04
0.02
0
0
4
8
12
Hours
16
20
24
Observations
• The best MP heuristics cannot get acceptably good
solutions within 24 hours on some large datasets.
• Datasets of these sizes may need months (or years)
of further analysis to reach reasonable solutions.
• Apparent convergence can be misleading.
Rec-I-DCM3: a new technique
(Roshan et al.)
• Combines a new decomposition technique
(DCM3) with recursion and iteration, to
produce a novel approach for escaping local
optima
• Tested initially on MP (maximum
parsimony), but also implemented for ML
and other optimization problems
The DCM3 decomposition
Input: Set S of sequences, and guide-tree T
1. Compute short subtree graph G(S,T), based upon T
2. Find clique separator in the graph G(S,T) and form subproblems
DCM3 decompositions
(1) can be obtained in O(n) time (the
short subtree graph is triangulated)
(2) yield small subproblems
(3) can be used iteratively
Iterative-DCM3
T
DCM3
Base method
T’
Rec-I-DCM3 significantly improves performance
(Roshan et al.)
0.2
0.18
Current best techniques
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
DCM boosted version of best techniques
0.04
0.02
0
0
4
8
12
16
Hours
Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset
20
24
Observations
• Rec-I-DCM3 improves upon the best performing
heuristics for MP.
• The improvement increases with the difficulty of
the dataset.
Rec-I-DCM3 also improves maximum likelihood
(using RAxML) analyses (data not shown), and is
included in the CIPRES portal.
Part 3: Multiple sequence
alignment
Steps in a phylogenetic analysis
• Gather data
• Align sequences
• Reconstruct phylogeny on the multiple alignment often obtaining a large number of trees
• Compute consensus (or otherwise estimate the
reliable components of the evolutionary history)
• Perform post-tree analyses.
Steps in a phylogenetic analysis
• Gather data
• Align sequences
• Reconstruct phylogeny on the multiple alignment often obtaining a large number of trees
• Compute consensus (or otherwise estimate the
reliable components of the evolutionary history)
• Perform post-tree analyses.
Basic observations about
standard two-phase methods
• Many new MSA methods improve on
ClustalW, with ProbCons and MAFFT the two
best MSA methods.
• Although alignment accuracy correlates with
phylogenetic accuracy, it has less effect than
might be expected (Wang et al., unpublished).
What about simultaneous estimation?
• Several Bayesian methods for simultaneous
estimation of trees and alignments have been
developed, but none can be applied to datasets
with more than (approx.) 20 sequences.
• POY attempts to solve the NP-hard “minimum
length tree” problem, where gaps contribute to the
length of the tree and can be applied to large
datasets. However, its performance on simulated
data isn’t competitive with the best two-phase
methods (unpublished data).
New method: SATe
(Simultaneous Alignment and Tree estimation)
• Developers: Warnow, Linder, Liu, Nelesen,
and Zhao.
• Basic technique: propose alignments (using
treelength under a selected affine gap
penalty), and compute maximum likelihood
trees for these alignments under GTR.
• Unpublished.
Simulation study
• 100 taxon model trees (generated by r8s and then
modified, so as to deviate from the molecular clock).
• DNA sequences evolved under ROSE (indel events of
blocks of nucleotides, plus HKY site evolution). The root
sequence has 1000 sites.
• We vary the gap length distribution, probability of gaps,
and probability of substitutions, to produce 8 model
conditions: models 1-4 have “long gaps” and 5-8 have
“short gaps”.
• We compared RAxML on various alignments (including
the true alignment) to SATe.
Topological accuracy
FN (false negative): proportion of correct edges
missing from the estimated tree
FP (false positive): proportion of incorrect edges in
the estimated tree
QuickTime™ and a
decompressor
are needed to see this pi cture.
Alignment accuracy
• Normalized number of columns in the estimated alignment relative to
the true alignment.
QuickTi me™ and a
decompressor
are needed to see this picture.
Alignment accuracy
FN: proportion of correctly homologous pairs of nucleotides missing from
the estimated alignment (i.e., 1-SP score).
FP: proportion of incorrect pairings of nucleotides in the estimated
alignment.
QuickTi me™ and a
decompressor
are needed to see this picture.
Other observations
• SATe is more accurate at estimating the number of
gaps and the correct alignment length than other
methods.
• The standard alignment accuracy measure, SP, is
not particularly predictive of phylogenetic
accuracy.
• We still need methods for MSA under models that
include rearrangements and duplications.
Part IV: Open questions
• Tree shape (including branch lengths) has an impact on
phylogeny reconstruction - but what model of tree shape to
use?
• What is the sequence length requirement for Maximum
Likelihood? (Current bound is probably too large.)
• Why is MP not so bad?
• How to detect and reconstruct reticulate evolution?
• “Teasing apart” trees under complex models?
• What complex models are identifiable?
General comments
• Current models of sequence evolution are clearly
too simple, and more realistic ones are not
identifiable.
• Relative performance between methods can
change as the models become more complex or as
the number of taxa increases.
• We do not know how methods perform under
realistic conditions (nor how long we need to let
computationally intensive methods run).
Acknowledgements
• Funding: NSF, The David and Lucile Packard
Foundation, The Program in Evolutionary
Dynamics at Harvard, and The Institute for
Cellular and Molecular Biology at UT-Austin.
• Collaborators: Peter Erdos, Daniel Huson, Randy
Linder, Kevin Liu, Bernard Moret, Serita Nelesen,
Usman Roshan, Mike Steel, Katherine St. John,
Laszlo Szekely, Tiffani Williams, and David Zhao.
© Copyright 2026 Paperzz