Bayesian phylogenetic reconstruction using PhyloBayes

Bayesian phylogenetic reconstruction using PhyloBayes
June 6, 2012
The aim of this practical is to explore a few real-case examples, using PhyloBayes (Lartillot,
Lepage and Blanquart, 2009). These examples have been chosen to illustrate the importance of
model choice in phylogenetics: depending on the model, different topologies and/or support values
will obtain. Goodnes-of-fit tests can then be performed, to reveal the strengths and weaknesses of
each model.
Some of the models used here (mostly the non-mixture models) are available in most other
Bayesian softwares, in particular in MrBayes (Ronquist and Huelsenbeck, 2003), or in Maximum
likelihood implementations, e.g. RaxML (Stamatakis, Ludwig and Meier, 2005) and PhyML (Guindon and Gascuel, 2003). You are invited to conduct a few experiments on your side, after this
practical, just to check that the trees that you will obtain under alternative frameworks (maximum
likelihood or Bayes), or under alternative softwares, but using the same substitution model, are
not fundamentally different. In contrast, the choice of the model will generally have a substantial
impact on the resulting tree. In particular, mixture models (which are a specific feature of PhyloBayes) can lead to a dramatic improvement in phylogenetic accuracy, particularly in the cases of
’deep’ and saturated phylogenies.
Bilaterian phylogeny: a case study
This subsection is inspired from Lartillot, Brinkmann and Philippe (2007).
If you go into the phylobayes3.3d/data/bilateria folder, you will find a multiple alignment
called platy3.ali. This is a concatenation of 30 genes from 10 bilaterian taxa. Scaffolding
and multiple alignment have been done in Philippe, Lartillot and Brinkmann (2005). The entire
set of genes of the original article is in the directory bilateria/meta2005, and the complete
concatenation is bilateria/m2g.puz.
Step 1 Using this alignment, you can try one of the available site-homogeneous models (GTR,
WAG, JTT or LG), and compare with one of the two infinite mixture model (CAT-Poisson or
CAT-GTR). For each model, you should run 2 chains in parallel.
Each time you create a new chain, you give it a name. Thus, for instance, in order to run 2
chains under the JTT + Gamma model:
pb -d platy3.ali -jtt -ncat 1 -dgam 4 -dc -s platylg1
pb -d platy3.ali -jtt -ncat 1 -dgam 4 -dc -s platylg2
All options are preceded by a ’-’. They allow you to customize the model, as well as other
specific details of the run. The -d option allows you to specify the dataset. Concerning the substitution model, here, we have activated the model with the JTT empirical relative exchangeabilities
(-jtt), with only one category (-ncat 1, corresponding to the classical JTT model), combined
with a discrete gamma distribution of rates across sites (-dgam 4, with 4 categories). Finally, the
-dc command will eliminate all constant positions from the alignment (as a way of speeding up the
convergence), while the -s option will make phylobayes save all parameter configurations visited
by the MCMC (and not just the tree topologies). Using this -s option is important, since we want
1
to conduct goodness-of-fit tests. For the many other options available, see the manual for details,
or type pb without any argument.
To run the Dirichlet process CAT model, use the -cat option:
pb -d platy3.ali -cat -poisson -dgam 4 -dc -s platycatpoisson1
pb -d platy3.ali -cat -gtr -dgam 4 -dc -s platycatgtr1.
The CAT-Poisson model is faster, and therefore, should be preferred over the CAT-GTR model in
the context of the practical session. On the other hand, the CAT-GTR model is probably better
than CAT on most datasets, in terms of statistical fit.
Step 2 You can check convergence and mixing of the chains by visualizing the content of the
.trace files, using gnuplot, and by using tracecomp and bpcomp.
tracecomp -x 100 1 platywag1 platywag2
will produce an output summarizing the discrepancies and the effective sizes estimated for each
column of the trace file. The discrepancy is defined as the difference between the two means,
divided by the standard deviation. It is computed for each column of the trace file (log likelihood,
tree length, alpha parameter, number of categories, etc). The effective size is evaluated using the
method of Geyer (1992). A maximum discrepancy < 0.3 and minimum effective size > 50 can be
considered as an acceptable run.
Similarly:
bpcomp -x 100 1 platywag1 platywag2
will compare the two list of trees produced by the two chains, and output a discrepancy index
(maxdiff), measuring how different the consensus trees produced by the two chains are. Ideally,
the maxdiff statistic should be 0.1 or less. However, a maxdiff of less than 0.3 should already give
you a good qualitative idea of the phylogenetic tree.
You can call tracecomp and bpcomp on two chains while they are running. This allows you to
probe the chains, and depending on the results, decide whether or not you want to stop them.
In addition to comparing the list of trees produced by the two chains, bpcomp will also pool
the two lists, and compute a majority rule consensus based on this combined set of trees. The tree
will be written in the bpcomp.con.tre file, unless you specify another output, e.g.:
bpcomp -x 100 1 -o platywag platywag1 platywag2
in the present case it will be written in a file named platywag.con.tre. You can then visualize
the consensus tree using any tree visualization software (e.g. njplot, or figtree).
To stop chains, you can simply put a 0 in the .run file:
echo 0 > platywag1.run
Then, you can restart the chain from where it stopped:
pb platywag1
You can also restart a chain that was killed (or after a computer crash), except if the files have
been corrupted during the crash (which happens sometimes, but rarely).
Step 3 once chains have reached convergence, you can compare the trees obtained under the sitehomogeneous and dirichlet-process mixture models. What do you observe?
When models disagree about the topology of the phylogenetic tree, one usually wants to use
independent/objective means to assess wich model is most likely giving you the right answer. There
are several approaches to compare models
• Bayes factor. The numerical computation of Bayes factors, however is CPU intensive, and
thus far, the most widely used method (harmonic mean), is not at all reliable (Lartillot and
Philippe, 2006). Thus, it is better to consider that Bayes factor evaluation is simply not
available using current computational means.
2
• Cross validation. This method is not widely used, but was developed in the specific context
of comparing homogeneous and mixture models (Lartillot and Philippe, 2008). It would be
accessible here, but is computationally too intensive to be done in the context of this practical
session.
• posterior predictive testing: this is a fast and easy method, which we will use now.
Posterior predictive testing is done using the ppred program. For instance:
ppred -x 100 1 -div platywag1
will perform a posterior predictive test, using the diversity as the test statistic, on the first chain
run under the WAG model. As mentioned during the lecture, correctly modeling site-specific
biochemical specificities is crucial for obtaining correct phylogenies, in particular when the data
are saturated (Lartillot, Brinkmann and Philippe, 2007).
Step 4 Make a posterior predictive test, using site-specific diversity as the test statistic, on all
chains that you have run. Which models fail and pass the test? What do you conclude from that,
in terms of phylogenetic accuracy ?
Note that you can also run ppred to obtain a sample of simulated datasets from the posterior
predictive distribution:
ppred -x 100 1 platywag1
You can then compute any statistic of interest (not necessarily those available through the options
of the ppred program), on the empirical data set (observed value of the statistic), and on each of
the simulation replicates (which gives you the null distribution). The fraction of replicates that
give you a statistic more extreme than the value computed on the true empirical data is a Monte
Carlo estimate of the posterior predictive p-value.
If you are comfortable with perl, you could try to implement your own script computing the
average diversity of a sequence alignment, and the perform the diversity test using this script
combined with ppred (now used without the -div option). You could then invent other statistics,
which might help you uncover other types of model violations. As a suggestion, here is a problem
for which no statistic has been defined in ppred:
Step 5 The CAT-like models account for site-specific amino-acid propensities. On the other hand,
they assume that the amino-acid propensities at a given site are the same over the entire phylogeny.
Yet, it is possible to imagine that sites would be visiting slightly different subsets of amino-acids in
different regions of the tree.
Assume that you are considering a relatively large phylogenetic tree made of 2 large subgroups
(e.g. protostomes and deuterostomes in animals). Can you imagine a test statistic that would
test for the present of differences between the two taxon subsets in the site-specific amino-acid
propensities? Could you implement this test statistic, and apply it in the present case, and then,
later on, on a larger dataset ?
Chordates: taxon sampling and gene jackknife
The folder phylobayes3.3d/data/chordates contains all the data used in the two articles Delsuc
et al. (2006) and Delsuc et al. (2008).
The file chord38.ali contains the concatenation used in Delsuc et al. (2006). This article
reports the new finding made at that time, that urochordates (sea squirts) are more closely related
to chordates (vertebrates and lamprey) than are cephalochordates (amphioxus), thus challenging
the traditional view on these 3 groups. However, the phylogeny displayed in the figure of this
article also contains an oddity: in this tree, cephalochordates are the sister group of echinoderms
(represented by the sea urchin Strongylocentrotus).
3
This dataset was subsequently reanalyzed using CAT (Delsuc et al., 2008). You can reproduce
this analysis: infer phylogeny with this dataset chord38.ali, under various models (in particular,
one matrix, such as LG or WAG, versus CAT or CAT-GTR), and observe the role of model choice
in that case. Again, you can then perform posterior predictive tests to assess the goodness-of-fit
of the various models that you might have tried.
The dataset is relatively large, and therefore, it might be a good idea to try the (still experimental) MPI version of phylobayes, e.g.:
mpirun -np 8 pb mpi -d chord38.ali -cat -gtr -dgam 4 -s -dc chordcatgtr1
However, the posterior predictive tests have not yet been implemented in this MPI version, so you
can only estimate the topology under alternative models.
In the second article, in addition to reanalyzing the original dataset with more sophisticated
models, the authors also improved taxon sampling, going from 38 taxa in the original dataset to
50 taxa in the upgraded version. Also, this article introduces the idea of gene jackknife: instead of
analyzing the entire concatenation of all 179 genes (which would take too long), the idea is to build
50 (or 100) replicates, each replicate consisting of a concatenation of a random set of 50 genes.
in the data folder, you will find 2 perl scripts, jackknife.pl and concat.pl. Using this script,
you can build jackknife replicates, and analyze them under various models. During the practical
session, you will not be able to analyze much more than a handful of replicates.
Plastidial genomes, and the position of Mesostigma
A list of 50 genes from a set of 28 green algae (including land plants), from Rodrı́guez-Ezpeleta et al.
(2007), have been gathered in phylobayes/data/plastid/genewise. You can concatenate them
using the concat.pl script. Here, they are already aligned, and stripped down to 12 taxa, again for
computational reasons. A slightly smaller concatenation, excluding the 3 RNA polymerase genes,
has also been constructed (wornapol.ali).
Step 6 You can analyze the 2 concatenations, with and without the RNA pol genes using CAT.
What do you observe?
Step 7 You can analyze the complete concatenation using CAT, either alone or combined with the
”gene-specific branch lengths” model (-gbl option). What do you observe?
In order to use the gbl option, you need to specify the partition of the dataset into separate
genes. This is done by giving the name of a file containing this partition:
-gbl <partition>
the partition file should simply specify the total number of genes, followed by the size of each (the
sum of all sizes should then be equal to the total number of aligned positions in the alignment).
Mammalian dataset
The data in the folder phylobayes/data/mamdating are very similar to those used in Springer
et al. (2003), although with only the nuclear genes (mitochondrial genes are not considered here),
and restricting taxon sampling to placentals.
This dataset represents an easy case on which to estimate divergence times (planned for the
second day). Divergence time estimation with PhyloBayes does not integrate over tree topologies:
one should first estimate the tree topology, and then enforce this topology as a constraint for
divergence time estimation.
A pre-estimated tree topology is provided in the folder (plac.tree). However, if we have
enough time (and CPU), we can also estimate it. In the present case, the dataset is a nucleotide
dataset, and is not saturated at the synonymous level (nuclear sequences are slow evolving, and
placentals are a relatively recent group). Therefore, the GTR model should be good enough.
4
References
Delsuc F, H B, Chourrout D, Philippe H. 2006. Tunicates and not cephalochordates are the closest
living relatives of vertebrates. Nature. 439:965–968.
Delsuc F, Tsagkogeorga G, Lartillot N, Philippe H. 2008. Additional molecular support for the
new chordate phylogeny. genesis. 46:592–604.
Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies
by maximum likelihood. Syst Biol. 52:696–704.
Lartillot N, Brinkmann H, Philippe H. 2007. Suppression of long-branch attraction artefacts in the
animal phylogeny using a site-heterogeneous model. BMC evolutionary biology. 7:S4.
Lartillot N, Lepage T, Blanquart S. 2009. Phylobayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics. 25:2286.
Lartillot N, Philippe H. 2006. Computing bayes factors using thermodynamic integration. Systematic biology. 55:195.
Lartillot N, Philippe H. 2008. Improvement of molecular phylogenetic inference and the phylogeny
of bilateria. Philosophical Transactions of the Royal Society B: Biological Sciences. 363:1463.
Philippe H, Lartillot N, Brinkmann H. 2005. Multigene analyses of bilaterian animals corroborate
the monophyly of ecdysozoa, lophotrochozoa, and protostomia. Molecular Biology and Evolution.
22:1246.
Rodrı́guez-Ezpeleta N, Philippe H, Brinkmann H, Becker B, Melkonian M. 2007. Phylogenetic
analyses of nuclear, mitochondrial, and plastid multigene data sets support the placement of
mesostigma in the streptophyta. Mol Biol Evol. 24:723–31.
Ronquist F, Huelsenbeck JP. 2003. Mrbayes 3: Bayesian phylogenetic inference under mixed
models. Bioinformatics. 19:1572–4.
Springer M, Murphy W, Eizirik E, O’Brien S. 2003. Placental mammal diversification and the
cretaceous–tertiary boundary. Proceedings of the National Academy of Sciences of the United
States of America. 100:1056.
Stamatakis A, Ludwig T, Meier H. 2005. Raxml-iii: a fast program for maximum likelihood-based
inference of large phylogenetic trees. Bioinformatics. 21:456–63.
5

Download Report

Bayesian phylogenetic reconstruction using PhyloBayes

Paperzz.com

Your Paperzz