Bayesian phylogenetic reconstruction using PhyloBayes June 6, 2012 The aim of this practical is to explore a few real-case examples, using PhyloBayes (Lartillot, Lepage and Blanquart, 2009). These examples have been chosen to illustrate the importance of model choice in phylogenetics: depending on the model, different topologies and/or support values will obtain. Goodnes-of-fit tests can then be performed, to reveal the strengths and weaknesses of each model. Some of the models used here (mostly the non-mixture models) are available in most other Bayesian softwares, in particular in MrBayes (Ronquist and Huelsenbeck, 2003), or in Maximum likelihood implementations, e.g. RaxML (Stamatakis, Ludwig and Meier, 2005) and PhyML (Guindon and Gascuel, 2003). You are invited to conduct a few experiments on your side, after this practical, just to check that the trees that you will obtain under alternative frameworks (maximum likelihood or Bayes), or under alternative softwares, but using the same substitution model, are not fundamentally different. In contrast, the choice of the model will generally have a substantial impact on the resulting tree. In particular, mixture models (which are a specific feature of PhyloBayes) can lead to a dramatic improvement in phylogenetic accuracy, particularly in the cases of ’deep’ and saturated phylogenies. Bilaterian phylogeny: a case study This subsection is inspired from Lartillot, Brinkmann and Philippe (2007). If you go into the phylobayes3.3d/data/bilateria folder, you will find a multiple alignment called platy3.ali. This is a concatenation of 30 genes from 10 bilaterian taxa. Scaffolding and multiple alignment have been done in Philippe, Lartillot and Brinkmann (2005). The entire set of genes of the original article is in the directory bilateria/meta2005, and the complete concatenation is bilateria/m2g.puz. Step 1 Using this alignment, you can try one of the available site-homogeneous models (GTR, WAG, JTT or LG), and compare with one of the two infinite mixture model (CAT-Poisson or CAT-GTR). For each model, you should run 2 chains in parallel. Each time you create a new chain, you give it a name. Thus, for instance, in order to run 2 chains under the JTT + Gamma model: pb -d platy3.ali -jtt -ncat 1 -dgam 4 -dc -s platylg1 pb -d platy3.ali -jtt -ncat 1 -dgam 4 -dc -s platylg2 All options are preceded by a ’-’. They allow you to customize the model, as well as other specific details of the run. The -d option allows you to specify the dataset. Concerning the substitution model, here, we have activated the model with the JTT empirical relative exchangeabilities (-jtt), with only one category (-ncat 1, corresponding to the classical JTT model), combined with a discrete gamma distribution of rates across sites (-dgam 4, with 4 categories). Finally, the -dc command will eliminate all constant positions from the alignment (as a way of speeding up the convergence), while the -s option will make phylobayes save all parameter configurations visited by the MCMC (and not just the tree topologies). Using this -s option is important, since we want 1 to conduct goodness-of-fit tests. For the many other options available, see the manual for details, or type pb without any argument. To run the Dirichlet process CAT model, use the -cat option: pb -d platy3.ali -cat -poisson -dgam 4 -dc -s platycatpoisson1 pb -d platy3.ali -cat -gtr -dgam 4 -dc -s platycatgtr1. The CAT-Poisson model is faster, and therefore, should be preferred over the CAT-GTR model in the context of the practical session. On the other hand, the CAT-GTR model is probably better than CAT on most datasets, in terms of statistical fit. Step 2 You can check convergence and mixing of the chains by visualizing the content of the .trace files, using gnuplot, and by using tracecomp and bpcomp. tracecomp -x 100 1 platywag1 platywag2 will produce an output summarizing the discrepancies and the effective sizes estimated for each column of the trace file. The discrepancy is defined as the difference between the two means, divided by the standard deviation. It is computed for each column of the trace file (log likelihood, tree length, alpha parameter, number of categories, etc). The effective size is evaluated using the method of Geyer (1992). A maximum discrepancy < 0.3 and minimum effective size > 50 can be considered as an acceptable run. Similarly: bpcomp -x 100 1 platywag1 platywag2 will compare the two list of trees produced by the two chains, and output a discrepancy index (maxdiff), measuring how different the consensus trees produced by the two chains are. Ideally, the maxdiff statistic should be 0.1 or less. However, a maxdiff of less than 0.3 should already give you a good qualitative idea of the phylogenetic tree. You can call tracecomp and bpcomp on two chains while they are running. This allows you to probe the chains, and depending on the results, decide whether or not you want to stop them. In addition to comparing the list of trees produced by the two chains, bpcomp will also pool the two lists, and compute a majority rule consensus based on this combined set of trees. The tree will be written in the bpcomp.con.tre file, unless you specify another output, e.g.: bpcomp -x 100 1 -o platywag platywag1 platywag2 in the present case it will be written in a file named platywag.con.tre. You can then visualize the consensus tree using any tree visualization software (e.g. njplot, or figtree). To stop chains, you can simply put a 0 in the .run file: echo 0 > platywag1.run Then, you can restart the chain from where it stopped: pb platywag1 You can also restart a chain that was killed (or after a computer crash), except if the files have been corrupted during the crash (which happens sometimes, but rarely). Step 3 once chains have reached convergence, you can compare the trees obtained under the sitehomogeneous and dirichlet-process mixture models. What do you observe? When models disagree about the topology of the phylogenetic tree, one usually wants to use independent/objective means to assess wich model is most likely giving you the right answer. There are several approaches to compare models • Bayes factor. The numerical computation of Bayes factors, however is CPU intensive, and thus far, the most widely used method (harmonic mean), is not at all reliable (Lartillot and Philippe, 2006). Thus, it is better to consider that Bayes factor evaluation is simply not available using current computational means. 2 • Cross validation. This method is not widely used, but was developed in the specific context of comparing homogeneous and mixture models (Lartillot and Philippe, 2008). It would be accessible here, but is computationally too intensive to be done in the context of this practical session. • posterior predictive testing: this is a fast and easy method, which we will use now. Posterior predictive testing is done using the ppred program. For instance: ppred -x 100 1 -div platywag1 will perform a posterior predictive test, using the diversity as the test statistic, on the first chain run under the WAG model. As mentioned during the lecture, correctly modeling site-specific biochemical specificities is crucial for obtaining correct phylogenies, in particular when the data are saturated (Lartillot, Brinkmann and Philippe, 2007). Step 4 Make a posterior predictive test, using site-specific diversity as the test statistic, on all chains that you have run. Which models fail and pass the test? What do you conclude from that, in terms of phylogenetic accuracy ? Note that you can also run ppred to obtain a sample of simulated datasets from the posterior predictive distribution: ppred -x 100 1 platywag1 You can then compute any statistic of interest (not necessarily those available through the options of the ppred program), on the empirical data set (observed value of the statistic), and on each of the simulation replicates (which gives you the null distribution). The fraction of replicates that give you a statistic more extreme than the value computed on the true empirical data is a Monte Carlo estimate of the posterior predictive p-value. If you are comfortable with perl, you could try to implement your own script computing the average diversity of a sequence alignment, and the perform the diversity test using this script combined with ppred (now used without the -div option). You could then invent other statistics, which might help you uncover other types of model violations. As a suggestion, here is a problem for which no statistic has been defined in ppred: Step 5 The CAT-like models account for site-specific amino-acid propensities. On the other hand, they assume that the amino-acid propensities at a given site are the same over the entire phylogeny. Yet, it is possible to imagine that sites would be visiting slightly different subsets of amino-acids in different regions of the tree. Assume that you are considering a relatively large phylogenetic tree made of 2 large subgroups (e.g. protostomes and deuterostomes in animals). Can you imagine a test statistic that would test for the present of differences between the two taxon subsets in the site-specific amino-acid propensities? Could you implement this test statistic, and apply it in the present case, and then, later on, on a larger dataset ? Chordates: taxon sampling and gene jackknife The folder phylobayes3.3d/data/chordates contains all the data used in the two articles Delsuc et al. (2006) and Delsuc et al. (2008). The file chord38.ali contains the concatenation used in Delsuc et al. (2006). This article reports the new finding made at that time, that urochordates (sea squirts) are more closely related to chordates (vertebrates and lamprey) than are cephalochordates (amphioxus), thus challenging the traditional view on these 3 groups. However, the phylogeny displayed in the figure of this article also contains an oddity: in this tree, cephalochordates are the sister group of echinoderms (represented by the sea urchin Strongylocentrotus). 3 This dataset was subsequently reanalyzed using CAT (Delsuc et al., 2008). You can reproduce this analysis: infer phylogeny with this dataset chord38.ali, under various models (in particular, one matrix, such as LG or WAG, versus CAT or CAT-GTR), and observe the role of model choice in that case. Again, you can then perform posterior predictive tests to assess the goodness-of-fit of the various models that you might have tried. The dataset is relatively large, and therefore, it might be a good idea to try the (still experimental) MPI version of phylobayes, e.g.: mpirun -np 8 pb mpi -d chord38.ali -cat -gtr -dgam 4 -s -dc chordcatgtr1 However, the posterior predictive tests have not yet been implemented in this MPI version, so you can only estimate the topology under alternative models. In the second article, in addition to reanalyzing the original dataset with more sophisticated models, the authors also improved taxon sampling, going from 38 taxa in the original dataset to 50 taxa in the upgraded version. Also, this article introduces the idea of gene jackknife: instead of analyzing the entire concatenation of all 179 genes (which would take too long), the idea is to build 50 (or 100) replicates, each replicate consisting of a concatenation of a random set of 50 genes. in the data folder, you will find 2 perl scripts, jackknife.pl and concat.pl. Using this script, you can build jackknife replicates, and analyze them under various models. During the practical session, you will not be able to analyze much more than a handful of replicates. Plastidial genomes, and the position of Mesostigma A list of 50 genes from a set of 28 green algae (including land plants), from Rodrı́guez-Ezpeleta et al. (2007), have been gathered in phylobayes/data/plastid/genewise. You can concatenate them using the concat.pl script. Here, they are already aligned, and stripped down to 12 taxa, again for computational reasons. A slightly smaller concatenation, excluding the 3 RNA polymerase genes, has also been constructed (wornapol.ali). Step 6 You can analyze the 2 concatenations, with and without the RNA pol genes using CAT. What do you observe? Step 7 You can analyze the complete concatenation using CAT, either alone or combined with the ”gene-specific branch lengths” model (-gbl option). What do you observe? In order to use the gbl option, you need to specify the partition of the dataset into separate genes. This is done by giving the name of a file containing this partition: -gbl <partition> the partition file should simply specify the total number of genes, followed by the size of each (the sum of all sizes should then be equal to the total number of aligned positions in the alignment). Mammalian dataset The data in the folder phylobayes/data/mamdating are very similar to those used in Springer et al. (2003), although with only the nuclear genes (mitochondrial genes are not considered here), and restricting taxon sampling to placentals. This dataset represents an easy case on which to estimate divergence times (planned for the second day). Divergence time estimation with PhyloBayes does not integrate over tree topologies: one should first estimate the tree topology, and then enforce this topology as a constraint for divergence time estimation. A pre-estimated tree topology is provided in the folder (plac.tree). However, if we have enough time (and CPU), we can also estimate it. In the present case, the dataset is a nucleotide dataset, and is not saturated at the synonymous level (nuclear sequences are slow evolving, and placentals are a relatively recent group). Therefore, the GTR model should be good enough. 4 References Delsuc F, H B, Chourrout D, Philippe H. 2006. Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature. 439:965–968. Delsuc F, Tsagkogeorga G, Lartillot N, Philippe H. 2008. Additional molecular support for the new chordate phylogeny. genesis. 46:592–604. Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52:696–704. Lartillot N, Brinkmann H, Philippe H. 2007. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC evolutionary biology. 7:S4. Lartillot N, Lepage T, Blanquart S. 2009. Phylobayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics. 25:2286. Lartillot N, Philippe H. 2006. Computing bayes factors using thermodynamic integration. Systematic biology. 55:195. Lartillot N, Philippe H. 2008. Improvement of molecular phylogenetic inference and the phylogeny of bilateria. Philosophical Transactions of the Royal Society B: Biological Sciences. 363:1463. Philippe H, Lartillot N, Brinkmann H. 2005. Multigene analyses of bilaterian animals corroborate the monophyly of ecdysozoa, lophotrochozoa, and protostomia. Molecular Biology and Evolution. 22:1246. Rodrı́guez-Ezpeleta N, Philippe H, Brinkmann H, Becker B, Melkonian M. 2007. Phylogenetic analyses of nuclear, mitochondrial, and plastid multigene data sets support the placement of mesostigma in the streptophyta. Mol Biol Evol. 24:723–31. Ronquist F, Huelsenbeck JP. 2003. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 19:1572–4. Springer M, Murphy W, Eizirik E, O’Brien S. 2003. Placental mammal diversification and the cretaceous–tertiary boundary. Proceedings of the National Academy of Sciences of the United States of America. 100:1056. Stamatakis A, Ludwig T, Meier H. 2005. Raxml-iii: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics. 21:456–63. 5
© Copyright 2025 Paperzz