CIPRES.2006.algorithms_tw

The Algorithms Group
Tandy Warnow, Focus Leader
The University of Texas at Austin
CIPRES Algorithms Group
• Berkeley: Steve Evans, Dick Karp, Elchanan
Mossel, Christos Papadimitriou, Satish Rao, and
Stuart Russell
• Texas/UNM: David Bader, Warren Hunt, Bernard
Moret, Luay Nakhleh, Usman Roshan, Jijun Tang,
Li-San Wang, Tandy Warnow, and Tiffani Williams
• Elsewhere: John Huelsenbeck (UCSD), Sampath
Kannan (Penn), Paul Lewis (UConn), and foreign
collaborators
• Students and postdocs: 19 funded, 16 unfunded
Algorithms group presentation
• Tandy: Overview of the algorithms group,
and empirically-driven algorithm
development
• Satish: Theoretically-driven algorithm
development
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGGCCT
AGGGCAT
AGGGCAT
TAGCCCT
TAGCCCA
-2 mil yrs
TGGACTT
TAGACTT
AGCACTT
AGCACAA
AGCGCTT
-1 mil yrs
today
Phylogeny Problem
U
AGGGCAT
V
W
TAGCCCA
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
Steps in a typical phylogenetic
analysis
1. Gather data
2. Align sequences
3. Reconstruct phylogeny on the multiple
alignment - often obtaining a large number of
trees (or do 2&3 simultaneously)
4. Compute consensus (or otherwise estimate the
reliable components of the evolutionary history)
5. Perform post-tree analyses.
Markov models of evolution
.09
.1
.22
.15
s5
.14
.15
s1
.5
s3
s2
.2
s4
• The number of changes for a site
on each edge is drawn from a
Poisson distribution, whose mean
is the “length” of the edge.
• The probability of each
substitution is given by a matrix.
• The nucleotide at the root for a
given site is drawn from the
uniform distribution.
• Standard models also assume that
sites evolve under i.i.d.
processes.
Reconstructing the “Tree” of Life
Handling large datasets:
millions of species
NP-hard problems
True evolution is very complex (e.g.
reticulation, events that scramble
genomes): new data, and new model
development
Types of algorithms research
• Theoretically motivated: focused on
computational complexity, exact and approximate
solutions to NP-hard problems, sequence length
requirements for accuracy, and novel models.
• Empirically motivated: focused on performance
on real and simulated data, with respect to
topological accuracy and/or solution to the major
NP-hard problems.
Solving NP-hard problems
exactly is … unlikely
• Number of
(unrooted) binary
trees on n leaves is
(2n-5)!!
• If each tree on
1000 taxa could be
analyzed in 0.001
seconds, we would
find the best tree in
2890 millennia
#leaves
#trees
4
3
5
15
6
105
7
945
8
10395
9
135135
10
2027025
20
2.2 x 1020
100
4.5 x 10190
1000
2.7 x 102900
Recent software developed by
algorithms group members
• GARLI (Zwickl) and RAxML
(Stamatakis), the two best software
packages for maximum likelihood
• Rec-I-DCM3, a meta-method, which boosts
the performance of base methods for
phylogeny reconstruction (Roshan, Warnow,
Moret, and Williams)
Problems with current techniques for MP
Shown here is the performance of a heuristic maximum parsimony analysis on a real
dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using
any method for any amount of time.) Acceptable error is below 0.01%.
0.2
0.18
Performance of TNT with time
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
0.04
0.02
0
0
4
8
12
Hours
16
20
24
“Boosting” phylogeny
reconstruction methods
• DCMs “boost” the performance of
phylogeny reconstruction methods.
Base method M
DCM
DCM-M
DCMs (Disk-Covering Methods)
• DCMs for polynomial time methods
improve topological accuracy (empirical
observation), and have provable theoretical
guarantees under Markov models of
evolution
• DCMs for hard optimization problems
reduce running time needed to achieve good
levels of accuracy (empirically observation)
DCMs: Divide-and-conquer for
improving phylogeny reconstruction
Iterative-DCM3
T
DCM3
Base method
T’
Rec-I-DCM3 significantly improves performance
0.2
0.18
TNT
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
DCM boosted version of TNT
0.04
0.02
0
0
4
8
12
16
20
24
Hours
Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset (“optimal” here means the
Best score found to date, using any method for any amount of time).
Comments
• Rec-I-DCM3 improves upon the best performing
heuristics for MP, and the improvement increases
with the difficulty of the dataset.
• DCMs also boost the performance of heuristics for
maximum likelihood, methods for phylogeny
reconstruction from gene order, and heuristics for
the simultaneous estimation of trees and multiple
alignments (data not shown).
• Rec-I-DCM3 will be in the first software release
from the CIPRES project
Connections to other focus group
activity
• We develop and study new stochastic models of
evolution, and implement new simulators.
• We assist in developing new software for the
major optimization problems in phylogenetics.
• We study how to compactly store sets of trees for
later database queries.
• We perform outreach to the Computer Science,
Mathematics, and Statistics research communities.