The Algorithms Group Tandy Warnow, Focus Leader The University of Texas at Austin CIPRES Algorithms Group • Berkeley: Steve Evans, Dick Karp, Elchanan Mossel, Christos Papadimitriou, Satish Rao, and Stuart Russell • Texas/UNM: David Bader, Warren Hunt, Bernard Moret, Luay Nakhleh, Usman Roshan, Jijun Tang, Li-San Wang, Tandy Warnow, and Tiffani Williams • Elsewhere: John Huelsenbeck (UCSD), Sampath Kannan (Penn), Paul Lewis (UConn), and foreign collaborators • Students and postdocs: 19 funded, 16 unfunded Algorithms group presentation • Tandy: Overview of the algorithms group, and empirically-driven algorithm development • Satish: Theoretically-driven algorithm development DNA Sequence Evolution -3 mil yrs AAGACTT AAGGCCT AGGGCAT AGGGCAT TAGCCCT TAGCCCA -2 mil yrs TGGACTT TAGACTT AGCACTT AGCACAA AGCGCTT -1 mil yrs today Phylogeny Problem U AGGGCAT V W TAGCCCA X TAGACTT Y TGCACAA X U Y V W TGCGCTT Steps in a typical phylogenetic analysis 1. Gather data 2. Align sequences 3. Reconstruct phylogeny on the multiple alignment - often obtaining a large number of trees (or do 2&3 simultaneously) 4. Compute consensus (or otherwise estimate the reliable components of the evolutionary history) 5. Perform post-tree analyses. Markov models of evolution .09 .1 .22 .15 s5 .14 .15 s1 .5 s3 s2 .2 s4 • The number of changes for a site on each edge is drawn from a Poisson distribution, whose mean is the “length” of the edge. • The probability of each substitution is given by a matrix. • The nucleotide at the root for a given site is drawn from the uniform distribution. • Standard models also assume that sites evolve under i.i.d. processes. Reconstructing the “Tree” of Life Handling large datasets: millions of species NP-hard problems True evolution is very complex (e.g. reticulation, events that scramble genomes): new data, and new model development Types of algorithms research • Theoretically motivated: focused on computational complexity, exact and approximate solutions to NP-hard problems, sequence length requirements for accuracy, and novel models. • Empirically motivated: focused on performance on real and simulated data, with respect to topological accuracy and/or solution to the major NP-hard problems. Solving NP-hard problems exactly is … unlikely • Number of (unrooted) binary trees on n leaves is (2n-5)!! • If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia #leaves #trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x 102900 Recent software developed by algorithms group members • GARLI (Zwickl) and RAxML (Stamatakis), the two best software packages for maximum likelihood • Rec-I-DCM3, a meta-method, which boosts the performance of base methods for phylogeny reconstruction (Roshan, Warnow, Moret, and Williams) Problems with current techniques for MP Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. 0.2 0.18 Performance of TNT with time 0.16 0.14 Average MP 0.12 score above optimal, shown as 0.1 a percentage of 0.08 the optimal 0.06 0.04 0.02 0 0 4 8 12 Hours 16 20 24 “Boosting” phylogeny reconstruction methods • DCMs “boost” the performance of phylogeny reconstruction methods. Base method M DCM DCM-M DCMs (Disk-Covering Methods) • DCMs for polynomial time methods improve topological accuracy (empirical observation), and have provable theoretical guarantees under Markov models of evolution • DCMs for hard optimization problems reduce running time needed to achieve good levels of accuracy (empirically observation) DCMs: Divide-and-conquer for improving phylogeny reconstruction Iterative-DCM3 T DCM3 Base method T’ Rec-I-DCM3 significantly improves performance 0.2 0.18 TNT 0.16 0.14 Average MP 0.12 score above optimal, shown as 0.1 a percentage of 0.08 the optimal 0.06 DCM boosted version of TNT 0.04 0.02 0 0 4 8 12 16 20 24 Hours Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset (“optimal” here means the Best score found to date, using any method for any amount of time). Comments • Rec-I-DCM3 improves upon the best performing heuristics for MP, and the improvement increases with the difficulty of the dataset. • DCMs also boost the performance of heuristics for maximum likelihood, methods for phylogeny reconstruction from gene order, and heuristics for the simultaneous estimation of trees and multiple alignments (data not shown). • Rec-I-DCM3 will be in the first software release from the CIPRES project Connections to other focus group activity • We develop and study new stochastic models of evolution, and implement new simulators. • We assist in developing new software for the major optimization problems in phylogenetics. • We study how to compactly store sets of trees for later database queries. • We perform outreach to the Computer Science, Mathematics, and Statistics research communities.
© Copyright 2026 Paperzz