PHASE - School of Computer Science | The University of Manchester

PHASE : a Software Package for Phylogenetics And S equence
E volution
Version 1.1, April 24, 2003
Copyright 2002, 2003 by the University of Manchester.
PHASE is distributed under the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed
in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Howsun Jow and Vivek Gowri-Shankar
∗
bug report: [email protected]
Why is PHASE different from other phylogenetic programs?
This package is designed specifically for use with RNA sequences that have a conserved secondary
structure, e.g., rRNA and tRNA. It is well known that compensatory substitutions occur in the paired
regions of RNA secondary structures; this means that substitutions occurring on one side of a pair are
correlated with substitutions on the other side. Most phylogenetic programs assume that each site in a
molecule evolves independently of the others but this assumption is not valid for RNA genes.
Substitution models of sequence evolution that consider pairs of sites rather than single sites are
implemented in this package along with standard nucleotides substitution models used nowadays. When
a RNA molecule with a secondary structure is used in conjunction with a RNA substitution model,
PHASE requires a structure-based alignment of the sequences with the consensus secondary structure
indicated in bracket and dot notation at the top of the alignment. We assume that you can provide this
structure.
It is now commonplace to perform combined analyses of heterogeneous sequence data when nucleotides with diffent patterns of evolution are sequenced for a set of studied species. It is possible to
use several substitution models simultaneously with PHASE (for paired and/or unpaired sites) when
analysing protein coding genes or when stems and loops of RNA genes are used.
PHASE provides a Markov Chain Monte Carlo sampler to generate large numbers of possible phylogenetic trees with probability proportional to their likelihood. This is a Bayesian statistical method that
allows posterior probabilities to be generated for alternative trees and alternative clades. These posterior probabilities provide a sound statistical measure of support of alternative phylogenetic hypotheses,
and they remove the need for bootstrapping. Where many alternative arrangements of a given set of
species exist, it is possible to calculate posterior probabilities for all the alternative arrangements of
these species in a convenient way.
Standard Maximum Likelihood techniques for inferring the optimal tree with any of the DNA or
RNA evolution models are also implemented.
The program’s features include:
• Bayesian estimation of phylogenies and substitution model parameters
• standard ML search algorithms for inferring the optimal tree with optional topology constraints
• 6, 7 and 16 state RNA models
• standard 4 state DNA models
• invariant and discrete gamma model for substitution rate heterogeneity between sites
• mixing of molecular data types in a single analysis
Journal publications :
• C. Hudelot, V. Gowri-Shankar, H. Jow, M. Rattray and P. Higgs. “RNA-based Phylogenetic Methods: Application to Mammalian Mitochondrial RNA Sequences”. Molecular Phylogenetics and Evolution (in press, 2003).
• H. Jow, C. Hudelot, M. Rattray and P. Higgs. “Bayesian phylogenetics using an RNA substitution
model applied to early mammalian evolution”. Molecular Biology and Evolution, 19(9):1591-1601
(2002).
Acknowledgements
Howsun Jow and Vivek Gowri-Shankar carried out this work as PhD students at Manchester University
under the supervision of Magnus Rattray. We gratefully acknowledge contributions to the design, documentation and testing from Paul Higgs and Cendrine Hudelot. The PHASE software was developed
as part of a BBSRC funded research project into RNA-based phylogenetic methods (investigators: Paul
Higgs and Magnus Rattray).
1
Contents
Why is PHASE different from other phylogenetic programs?
. . . . . . . . . . . . . . . . . .
1
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Introduction
4
How to read this manual ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Aquiring and installing the software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
MS-Windows installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Unix-like system installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Description of programs in the PHASE package . . . . . . . . . . . . . . . . . . . . . . . . . .
5
optimise and mlphase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
mcmcphase and consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
likelihood
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
simulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Running the programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1 Using programs in the PHASE package
1.1
1.2
1.3
Inputs/outputs in PHASE
7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.1.1
Data file format
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.1.2
Control file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.1.3
Tree file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.1.4
Substitution model parameters file format . . . . . . . . . . . . . . . . . . . . . .
9
1.1.5
Parameters displayed on the screen and output of each program . . . . . . . . .
9
1.1.6
Clade file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Control files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2.1
Structure of the control files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2.2
Datafile block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2.3
Model block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Using the programs in the PHASE package . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.3.1
likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.3.2
optimise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.3.3
simulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.3.4
mlphase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.3.5
mcmcphase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.3.6
analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2
2 Elements of phylogenetic theory
2.1
2.2
2.3
2.4
2.5
23
Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.1
Unrooted phylogenies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.2
String representation of a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.3
Branch lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Nucleotide substitution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2.1
A Markov model of substitution . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2.2
Transition matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.2.3
Nucleotide substitution models implemented in PHASE . . . . . . . . . . . . . .
26
Paired-site substitution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.3.1
RNA secondary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.3.2
Theory of compensatory substitutions . . . . . . . . . . . . . . . . . . . . . . . .
28
2.3.3
Base-paired substitution models implemented in PHASE
. . . . . . . . . . . . .
29
Refinements to substitution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.4.1
Invariant and discrete gamma models . . . . . . . . . . . . . . . . . . . . . . . .
31
2.4.2
The MIXED model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Bayesian phylogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.5.1
Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.5.2
Markov chain Monte-Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.5.3
Priors and proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.5.4
Pitfalls of Markov chain Monte-Carlo techniques . . . . . . . . . . . . . . . . . .
34
A Some examples of control files
37
A.1 Control file for likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
A.2 Control file for optimise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
A.3 Control file for simulate (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
A.4 Control file for simulate (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
A.5 Control file for mlphase (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
A.6 Control file for mlphase (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
A.7 Control file for mcmcphase (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
A.8 Control file for mcmcphase (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Bibliography
46
3
Introduction
How to read this manual ?
People with a good background in phylogenetic inference might be interested only in the first chapter
which explains how to use PHASE . The second chapter contains a few elements of the theory of
phylogenetic inference with some valuable information about PHASE that can make technical details in
the first chapter clearer. Experienced phylogeneticists might find it useful to read the RNA substitution
models section (2.3) to learn about RNA substitution models and the Bayesian phylogenetics section (2.5)
if they are not familiar with Markov Chain Monte-Carlo (MCMC) techniques.
Once you have read the short description of the programs in this introduction, you can try them
straightaway with the examples provided. However, be warned that inferences using the mammals
dataset of 69 species and the maximum likelihood inference with the primates (primates-rna-ml.control)
require at least one day. You should use other control files instead.
The first chapter of this manual should be used as a reference only and to clarify obscure points
about PHASE programs. The HTML version of these pages is probably more appropriate to find useful
information.
Aquiring and installing the software
PHASE can be downloaded from http://www.bioinf.man.ac.uk/resources; it is currently available for
Windows and Unix/Linux platforms.
MS-Windows installation
Download the archive phase-1.1-MSWin-exec.zip and decompress it into the directory of your choice, for
instance c:\Phase\. PHASE does not require any other installation procedure and you can therefore
test the software straightaway with the provided example files.
Unix-like system installation
For Unix and Linux systems you are recommended to compile the program yourself. However, if the
process fails and if you cannot produce a proper executable, then you can try the precompiled linux
version in the archive phase-1.1-linux-i586-exec.tgz.
To compile the program yourself:
• decompress and extract the archive into the directory of your choice
tar -xvvzf phase-1.1.tgz
• enter the newly created phase-1.1 directory
cd phase-1.1
• compile with the provided Makefile
make
4
We assume here that you have the default recent C++ compiler g++ on your platform. You cannot
compile PHASE with gcc v2.96 and older. You can check the gcc version installed on your system by
typing “g++ -v”. You might want to (or might have to) edit and modify the makefiles in order to
adapt them to your specific system configuration. In that case please have a look at the readme file first.
PHASE uses the BLAS and LAPACK library routines. Unless your system is equipped with optimised versions of these mathematical libraries, in which case you are strongly advised to modify the
makefile, generic versions will be built during the compilation process. The g77 compiler and the libg2c
library are required but they should already be present on your system.
Description of programs in the PHASE package
The PHASE (PHylogenetics and Sequence Evolution) package consists of two main programs, mlphase
and mcmcphase.
• mlphase performs maximum-likelihood inference
• mcmcphase is a Bayesian phylogenetic inference program
There are five other smaller programs in the package:
• analyser checks the content of the molecular sequences
• likelihood computes the likelihood given a specified evolution model
• simulate generates sequences according to a specified evolution model
• optimise is a smaller version of mlphase without tree search capabilities
• consensus is used with mcmcphase to summarize the results of a MCMC run.
Below we summarize the behaviour of these programs. Please refer to the first chapter in order to
learn how to use them.
optimise and mlphase
The mlphase program is a maximum-likelihood phylogenetic inference program similar to dnaml in
PHYLIP 1 and baseml in PAML2 . The mlphase program has a broad range of functionalities and can be
used with a large number of evolutionary substitution models including those which take into account
the RNA secondary structure in the evolution of RNA sequences (see sections 2.2 and 2.3).
The mlphase program has two main modes of operation:
1. Optimisation of user-defined trees: estimation of maximum likelihood (ML) branch lengths
and, optionally, evolutionary model parameters, given a set of labelled molecular sequences, for a
user-defined set of phylogenetic tree topologies.
2. Maximum-likelihood tree search: the program aims at finding the model (tree topology, associated branch length and, optionally, sequence evolution model parameters) that yields the highest
likelihood. The user can choose the topology search algorithm to be used among the three available:
• Simple exhaustive search: all the possible phylogenies are considered.
• Branch and bound search: non-optimal phylogenies are rejected before evaluation.
• Heuristic search via stepwise addition: greedy search for the best topology.
Constraints can be placed on the phylogenetic tree topologies that are considered during ML inference
in order to reduce the search space and the computation time.
1 http://evolution.genetics.washington.edu/phylip.html
2 http://abacus.gene.ucl.ac.uk/software/paml.html
5
The optimise program is a simpler version of mlphase and is provided for convenience. This program
returns the ML branch lengths and ML evolutionary parameters of a fixed user-defined tree topology, for
instance a consensus tree found with a MCMC run. This is equivalent to the first mode of mlphase with
only one tree. The optimise program requires less parameters than mlphase; it is simpler to use and
allows quick experimentations with different initial parameters when an entrapment in a local maximum
of the likelihood is suspected.
mcmcphase and consensus
The mcmcphase program performs Bayesian phylogenetic inference. It uses a Markov Chain Monte
Carlo algorithm to sample from the posterior probability distribution of phylogenetic tree topology,
branch lengths and sequence evolution model parameters. For an explanation of Bayesian phylogenetics
and a description of the MCMC sampling algorithms used in mcmcphase, please consult the Bayesian
Phylogenetics section (section 2.5) or Jow et al. (2002).
The consensus program is used to exploit the results of a MCMC run. This program produces two
consensus models (using mean and median of the parameters in the sample) and can return consensus
branch lengths for any supplied topology, e.g., a PHYLIP-style consensus tree, if similar topologies were
sampled during the run.
likelihood
The likelihood program computes the likelihood of a phylogeny with respect to any implemented substitution models.
simulate
The program simulate generates molecular sequences according to an user-specified tree (topology &
branch lengths) and substitution model (type & parameters).
analyser
At the moment analyser outputs basic statistics about a sequence data file. It can also be used to locate
in sequences the sites with too many gaps in case you decide to remove them. The analyser program
can be quite useful to validate your secondary structure alignment and to set a maximum limit for the
mismatch frequency at each site (see section 2.3.1).
Running the programs
Programs in the PHASE package are run through the command line under both Unix-like systems and
MS-windows systems. For Windows operating systems, you have to open a MS-DOS command window
to use them. Click on “Run...” in the “Start...” menu and type cmd in the newly opened dialog box.
You might have to type command instead of cmd depending on your MS-Windows version. Once
the command window is opened, you have to move to the directory where you extracted the software.
At the shell prompt, you can type, for example, cd c:\Phase\. You can then run any program of the
PHASE package.
Run the programs by typing their name followed by the arguments they require. In most cases,
PHASE ’s programs take one argument which is the name of a control file (see section 1.2). You can
type for instance:
mcmcphase control\hiv-dna-mcmc-1.control 3 or,
optimise control/primates-rna-optimise-7a.control
After installation, if the examples are all present, these commands should work.
3 Please note that the use of the ‘\’ or ‘/’ characters is dependant on your operating system. On Unix systems you
might have to type “./” before the program name.
6
1 - Using programs in the PHASE package
1.1
1.1.1
Inputs/outputs in PHASE
Data file format
All molecular sequence data used by the PHASE programs are stored in a common format. The data
file format is similar to the PHYLIP data file format but has a few minor modifications.
A data file is divided into four sections but two of them are not compulsory. Comments can be
included by preceding the commented lines with a hash (#) symbol. The entire commented line is
ignored by the program. Taking a look at the example data files in the package (*.dna, *.rna, and *.mix
in the data directory) will make the following explanations easier to understand.
File content
The first non-comment section of the data file is a single line containing
1. the number of species
2. the length of the molecular sequences
3. a code which can be either DNA for usual unpaired molecular sequences or RNA for base-paired
molecular sequences. In fact the purpose of this code is to indicate whether a pairing mask (see below)
is present in the data file and you can use the code RNA even if some nucleotides are unpaired.
For example the line,
5 100 DNA
at the beginning of a data file indicates that there are five non-base-paired sequences of length 100 in
the file.
For convenience a third code, MIXED, can be used instead of RNA when the user is using a
concatenation of RNA loops and stems (see section 2.3.1) but should be avoided in other cases. More
details on the specific meaning of the code MIXED are given in the class section below. The lines,
10 300 RNA
and,
10 300 MIXED
both indicate that there are ten sequences of length 300 in the file and that a pairing mask is associated
with them.
Pairing mask
The second section of the data file is the pairing mask. This mask is only required when sequences
contain some base-paired nucleotides (in that case the code should be RNA or MIXED). In the case
of fully unpaired sequences – i.e., when the DNA code is used – the pairing mask must not be
provided. The pairing mask is in the form of a mathematical expression consisting of round brackets.
Corresponding brackets indicate that the bases at those positions in the sequence form a base-pair in
the RNA secondary structure. Unpaired sites can be indicated with a dot “.” or a hyphen “-”. For
example a sequence ACCAGAUGGU with a pairing mask (((.(.)))) indicates that the sequence is
made of the base-pairs AU-CG-CG-GU and unpaired sites A-A.
7
Molecular sequences
The third section of the data file contains the molecular sequences. Indels (-) and ambiguities (purine
(R), pyrimidine (Y), unknown(N or ?) ) are allowed. Sequences can be written in one of two formats.
The first is the non-interleaved format. This consists of an identifying label for each sequence followed
by the whole sequence. An example is:
2 8 DNA
Mouse ACCGUGGU
UCCAUAAA
Rat
ACUGUGGC
UCGAUAUA
There can be no spaces in the label though the sequence itself can be formatted into blocks using
multiple lines and spaces. An alternate way of specifying the sequences is using the interleaved format.
This enables the sequences to be split into homologous blocks. The non-interleaved example given above
could equivalently be written:
2 8 DNA
Mouse
ACCG
Rat
ACUG
UGGUUCCAUAAA
UGGCUCGAUAUA
Notice that only the first interleaved block should contain labels. Subsequent interleaved blocks are
assumed to have the same labels and to be in the same order.
Class section
The fourth section is not compulsory and is used when performing a combined analysis of heterogeneous
data sets (e.g., loops and stems of a RNA molecule, protein coding genes with three codon positions
or concatenated data of different genes with different evolutionary patterns). You can safely skip this
section if you plan to study DNA sequences or RNA helices only (i.e., no “.” in the pairing mask) with
only one appropriate nucleotide/base-pair substitution model.
The aim of this section is to assign each nucleotide/pair to a class. Each class is expected to have
a different pattern of evolution. This section consists of a sequence of integers which correspond to the
class of each nucleotide. For instance, the class section of a protein coding gene may look like:
...2 3 1 2 3 1 2 3 1 2 3 1 2 ...
When the data file contains a class section, programs in the PHASE package expect it to comply to the
following set of rules:
• class labels are separated by a space
• classes are labelled from 1 to K, where K is the number of distinct classes
• the number of labels equals the length of the sequences
• when used in conjunction with a base-paired structure, the two components of a paired site are in
the same class.
Since PHASE is specifically designed for the analysis of RNA sequences with secondary structure,
the most common use of the class section should be the obvious separation of unpaired and base-paired
sites into two distinct classes. The code MIXED can replace the code RNA to avoid a tiresome task
and let PHASE know that he can simply use the provided pairing mask to build the class section (e.g.,
(((.())))..) implies 2 2 2 1 2 2 2 2 2 1 1 2). When the code MIXED is used the class section is
not compulsory and the unpaired and paired sites will respectively be attributed to the classes 1 and 2
automatically1 .
Usually classes are used to determine the model of sequence evolution PHASE is using with each
nucleotide. Each class in the data file is treated by its own model of nucleotide substitution during the
1 When
the class section is present the code MIXED is equivalent to RNA: the user assignment prevails the automatic
one
8
phylogenetic inference. The models are defined later in the model section of the control file(see 1.2.3).
Let us just point out here that if you use the MIXED type for your data with the automatic assignment,
i.e., without the class section, you have to make sure your first and second model are respectively a
nucleotide substitution model and a base-pair substitution model when you declare your models of
evolution. We will return to this point later on.
1.1.2
Control file format
Most programs in the package use a control file. The purpose of this file is to assign a specific task to the
program, i.e., analysed sequences, assumed substitution model, and others specific parameters. Control
files are the key to using the software and two sections are devoted to them. Section 1.2 describes the
structure of this file and describes common features for many programs in the package. Section 1.3
presents the specific parameters for each program.
1.1.3
Tree file format
PHASE can output trees into a file and sometimes the user has to provide a file which contains one or
more trees. A tree file is simply a file with one ore more phylogenies written in the computer readable
format described in the tree representation section (2.1.2).
1.1.4
Substitution model parameters file format
With a model parameters file, one can provide initial values for the parameters of the substitution models
used. PHASE can also create a model parameters file to store the results concerning a substitution
model after a run (these could be Maximum Likelihood Estimate (MLE) parameters or Mean Posterior
Estimate (MPE) parameters).
Model parameters file content
The content of this file is highly dependant on the substitution model used and we cannot describe it in
general terms. The fields used to assign a value to each parameter are hopefully quite self-explanatory
as long as you know the underlying substitution model. You might need to have a look at the transition
matrices section (2.2.2) to understand the PHASE concept of “rate ratios” in substitution models.
Each “Rate ratio i” parameter in this file stands for the parameter αi in the transition matrix of
the corresponding model. Transition matrices for all implemented substitution models are given in
section 2.2.3 for DNA models and in section 2.3.3 for RNA models.
Producing a model parameters file
Model parameters files and control files share the same structural elements. Some examples can be
found in the data directory (*.model). Although it is quite easy to understand the content of a model
parameters file without explanations when reading it, you might find it harder to produce your own file
from scratch and without guidance if you want to initialise a substitution model with specific values. It
is possible to use the simulate program to generate a stub of this file for each model implemented in the
PHASE package. This skeleton can be modified easily to suit your needs. See section 1.3.3 for details.
1.1.5
Parameters displayed on the screen and output of each program
Each program in the package will output information on the screen, and one or more files to store the
results permanently. The outputs will be reviewed individually for each program in section 1.3.
The content displayed on the screen is usually quite easy to understand, but you might be a bit
confused by the parameters of substitution models. PHASE outputs on the screen two kind of matrices:
• one “rate ratios” matrix R
9
• one transition matrix Q
These matrices are described in the transition matrices section (2.2.2). Other parameters have a straightforward meaning.
1.1.6
Clade file format
The user is allowed to specify some invariant clades to reduce the number of possible topologies when
using mlphase. A clade file contains a list of monophyletic clades in newick format (see section 2.1.2).
All studied species must appear once (and only once) in the file, either alone or in a clade. Here is a
simple clade file example for 6 species:
(Specie5,Specie6);
(Specie4,(Specie3,Specie2));
Specie1;
1.2
Control files
Most programs in the PHASE package have their options set using a simple text file. We call this file
the control file. Although the content of this file may differ for each program in the package, its structure
remains the same. Some control files are provided as example with the package (*.control in the control
directory). The easiest and safest way to use PHASE is to copy one of these examples and to adapt it
to your need.
1.2.1
Structure of the control files
A control file contains logical blocks (e.g., DATAFILE block, MODEL block, . . . ) and control lines.
Lines preceeded by a hash (#) symbol are considered comments and ignored. Comments can be placed
anywhere.
A control line is used to define a parameter and gives it a value. It has the format:
label = value
The order in which control lines are provided in the control file is not important but they must appear in
the right block. Note that PHASE is case sensitive, “Tree file” and “Tree File” are two different labels.
At the moment no warning is issued if the user mistypes an optional parameter. Please check your
control files against the provided examples, otherwise PHASE might miss some important parameters
without you noticing it.
A block is a container. It contains control lines but can also contain other blocks. The block
BLOCKNAME begins with the tag:
{BLOCKNAME}
and ends with the tag:
{\BLOCKNAME}
Tags must be put alone in their line. By convention the name of blocks are all uppercase.
In the remainder of this document, parameters of the control files are colored depending on their
status. Compulsory parameters are in red and you must provide a value for them. Optionnal parameters
are in green and they do not need to appear in the control file. Often, a default value will be assumed
for optional parameters. Some fields are dependent on the presence and/or values of other parameters
and their presence (or absence) is compulsory under certain conditions. These conditional parameters
are in orange.
1.2.2
Datafile block
Almost all programs in the PHASE package require a DATAFILE block to parse analysed sequences.
As stated previously, the DATAFILE block begins with the tag {DATAFILE} alone on a line and
ends with the tag {\DATAFILE} alone on a line. The DATAFILE block contains some necessary
10
information which is not included in the data file itself (see section 1.1.1 for the format of this file); it
contains the following control lines:
• Data file: the location of the molecular sequences file to be used.
Data file = data/sequences.dna
• Interleaved data file: a yes/no option that specifies whether the molecular data is interleaved.
Interleaved data file = yes
• Outgroup: the label of the outgroup sequence (see section 2.1.1). The inference techniques used in
PHASE produce unrooted phylogenies and using an outgroup in your study is not required. However
PHASE requires this parameter to produce a unique newick representation (2.1.2) for unrooted trees.
Outgroup = Mole
• Heterogeneous data models: is a yes/no parameter which specifies whether the data file contains a
class section. The default value is no and the class section of your data file will be ignored if you
forget this field.
Heterogeneous data models = yes
1.2.3
Model block
Most programs in the PHASE package require the specification of a substitution model for sequence
evolution. This is the purpose of the MODEL block. The MODEL block is delimited by the {MODEL}
and {\MODEL} tags. It contains the name of the substitution model followed by parameters (and
sometimes blocks) specific to the model (see section 2.2 for background information on substitution
models of nucleotide evolution).
Simple substitution model
Depending on the data to be analysed, the PHASE package can be used with a wide variety of
DNA substitution models or RNA-specific base-paired models (see sections 2.2.3 and 2.3.3 for a review of these models). The content of the MODEL block is the same for all these models and the
parameters are:
• Model: the model’s name, by convention it should be all upper case.
Model = REV
Nucleotide substitution models implemented include JC69, K80, HKY85, TN93 and REV.
Base-paired substitution models implemented include RNA6A, RNA6B, RNA7A, RNA7D,
RNA16A.
• Discrete gamma distribution of rates: the discrete gamma model (see section 2.4.1) can be used to
account for among site rate variation. Use yes/no values to turn this option on/off. When a discrete
gamma model is used, PHASE expects the number of gamma categories to be specified. By default
the discrete gamma model is not used.
Discrete gamma distribution of rates = yes
• Number of gamma categories: when the discrete gamma model is used, you have to provide an integer
to specify the desired number of discrete gamma categories.
Number of gamma categories = 5
• Invariant sites: alternatively, or in conjunction with the discrete gamma model, the user can allow
a proportion of sites to be invariant, i.e., with zero rate of evolution. The default value is no.
Invariant sites = yes
Mixed model for combined analyses of heterogeneous data
To study heterogeneous sequences several models are required. The mixed model (see section 2.4.2)
allows these models to work concurrently.
11
• Model: this field contains the name of the model which is MIXED.
Model = MIXED
• Number of models: the number of models used concurrently. If a class section was provided with the
data file then the number of models should be the same as the number of classes. If you used the
flag MIXED in your data file and did not provide a class section then this parameter has to be set
to 2 and the two models must be a DNA substitution model and a base-paired substitution model
respectively.
Number of models = 3
• {MODELi} block: each model used in the mixed model must be defined in its own block. If the
number of models is n then the MODEL block must contains n blocks whose name are MODEL1,
MODEL2, . . . , MODELn. The content of these blocks is the same as for a simple substitution model
block.
{MODEL}
Model = MIXED
Number of models = 2
{MODEL1}
Model = REV
Invariant sites = yes
{\MODEL1}
{MODEL2}
Model = RNA7A
Discrete gamma distribution of rates = yes
Number of gamma categories = 5
{\MODEL2}
{MODEL3}
Model = RNA7D
Invariant sites = no
Discrete gamma distribution of rates = no
{\MODEL3}
{\MODEL}
1.3
Using the programs in the PHASE package
Each program in the PHASE package requires a specific control-file, the content of which is described
here. As in the previous section, compulsory parameters appears in red, optional parameters in green
and conditional parameters dependant on the others are in orange.
1.3.1
likelihood
Using likelihood
The likelihood program is used to compute the likelihood of a model of evolution (i.e., tree + parameterised substitution model) given a set of studied sequences. To use likelihood, one has to provide a
phylogeny for the taxa under investigation (i.e., topology and branch lengths) and a substitution model
for nucleotide evolution with user-defined parameters. To use likelihood, type at the command-line:
likelihood likelihood-control-file
where likelihood-control-file is a valid control file for the likelihood program. For verification purposes
likelihood outputs the phylogenetic tree used on the screen before the likelihood value. Unlike most
other PHASE programs, likelihood does not send any results to a file.
Control file for likelihood
An example of a valid control file for likelihood can be found in appendix A.1. In its control file, the
likelihood program requires the specification of:
• a DATAFILE block: see the data file block section (1.2.2).
12
• a MODEL block: see the model block section (1.2.3).
• Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (section 2.1.2),
with branch lengths values.
Tree file = data/mammals-consensus.tree
• Model parameters file: the name of the file containing parameter values for the model defined in the
MODEL block above. Simulate can help you to produce this file.
Model parameters file = data/mammals-consensus.model
1.3.2
optimise
Using optimise
The program optimise is used to compute maximum-likelihood estimates (MLE) for the branch lengths
and substitution model parameters of a given model of evolution (i.e., a fixed tree topology and a specified
substitution model with free parameters). One can specify some initial values for branch lengths and
substitution model parameters to speed-up the convergence or to detect trapping in local maxima of the
likelihood function. To use optimise, type at the command-line:
optimise optimise-control-file
where optimise-control-file is a valid control file for the optimise program.
When launched, optimise displays the initial tree and the initial likelihood on the screen and begins
the optimisation. Once it is finished, the ML substitution model parameters are printed on the screen
and saved in the “.output” file with the ML tree and the value of the maximum likelihood. The ML tree
is also saved in the “.tree” file and a “.model” file (see input section 1.1.4) is created to store the MLE
for the substitution model parameters.
Control file for optimise
An example of a valid control file for optimise can be found in appendix A.2. The control file of the
optimise program must/may provide:
• a DATAFILE block: see the data file block section (1.2.2).
• a MODEL block: see the model block section (1.2.3).
• Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (see section 2.1.2)
with optional initial branch lengths values.
Tree file = mammals69-mix-consensus.tree
• Random seed: the integer value provided with this field is used to initialise the random number
generator (used to draw random initial branch lengths if they are not provided).
Random seed = 1
• Starting model parameters file: the name of a file containing initial values for the parameters of the
substitution model used. If this field is not provided, the analysed sequences are used to initialise
the model.
Starting model parameters file = data/hiv.model
• Output file: the basename for the three files basename.tree, basename.model and basename.output.
They contain the results generated by optimise.
Output file = mammals69-mix-optimise
1.3.3
simulate
Using simulate
Simulate is used:
13
1. to generate examples of “.model ” files for all the substitution models implemented in PHASE . A
“.model” file (see section 1.1.4) is used to provide initial or fixed values for the model parameters to
some programs in the package.
2. to generate molecular sequences which evolved from a random initial one according to a specified
model of evolution, i.e., phylogeny and substitution model.
To use simulate, type at the command-line:
simulate simulate-control-file
where simulate-control-file is a valid control file for the simulate program. In its first mode of operation
simulate create a single “.model” file and you can modify this file with your own initial values. In its
second mode of operation, simulate displays on screen the tree used to generate the actual sequences.
This tree was either provided by the user or randomly created by the program. In the second case
the tree is saved in a file specified by the user. Eventually, the likelihood of the generated molecular
sequences given the model is printed on the screen and simulate saves the sequences in a file specified
by the user. The format of this file is described in the data file format section (1.1.1). If the MIXED
model described in section 2.4.2 is used, heterogeneous sequences are generated in sequential order.
Control file for simulate
In appendix A.3 and A.4, example control files are provided for the first and the second mode of operation
respectively.
The control file of the simulate program must provide
• a MODEL block: see the model block section (1.2.3).
• Retrieve the name of the model’s parameters: a boolean field to specify the user’s aim. Use yes for
the first mode of usage mentionned above and no for the second mode.
Retrieve the name of the model’s parameters = no
• Model parameters file: if simulate is used to generate an example of a substitution model parameters
file, the parameters are saved in a file having the name provided. When simulate is used to generate
sequences, the user must provide parameters for the substitution model and they are read from the
given file.
Model parameters file = simulate.model
The following fields may be required when simulate is used to generate sequences.
• Random seed: the integer value provided with this field is used to initialise the random number
generator.
Random seed = 1
• Random tree and Tree file: simulate can either generate a random tree or use a supplied phylogeny.
If Random tree is equal to yes then simulate generates a random tree and saves it in the specified
file. If Random tree is equal to no then simulate parses the user tree from the specified file.
Random tree = no
Tree file = 8-species.tree
• Number of species and Maximum branch length: when the Random tree field is set to yes, the user
must provide the number of species and the maximum value for branch lengths in the generated the
tree.
Number of species = 10
Maximum branch length = .4
• Number of symbols from class i: you have to specify the number of symbols (e.g., number of nucleotides or number of paired sites) you want to generate for each class in your final sequence.
Number of symbols from class 1 = 100
Number of symbols from class 2 = 100
Number of symbols from class 3 = 100
Number of symbols from class 4 = 500
Number of symbols from class 5 = 300
14
• Structure for the elements of class i: simulate can add a stucture in the generated data file in which
case you have to specify the appropriate structure for the elements of each class.
Structure for the elements of class 1 = .
Structure for the elements of class 2 = .
Structure for the elements of class 3 = .
Structure for the elements of class 4 = .
Structure for the elements of class 5 = ()
• Data file type and Total length of the raw sequences: simulate produces an input file following the
format defined in the data file format section (1.1.1). To produce this file, you have to specify yourself
the type and the length written in the first line (see section 1.1.1). With the 5 classes described above:
Data file type = RNA
Total length of the raw sequences = 1400
#(100+100+100+500+300*2)
• Output file: the name of the file where generated sequences are saved.
Output file = simulated-data/codons and rna.sequences
1.3.4
mlphase
Using mlphase
The mlphase program can be used:
1. to find the Maximum Likelihood Estimates for branch lengths and, optionally, evolutionary model
parameters for a user-defined set of topologies.
2. to find the phylogeny and, optionally, evolutionary model parameters that yield the maximum likelihood. Three algorithms are provided for topology search:
• Simple exhaustive search
• Branch-and-bound exhaustive search
• Heuristic stepwise addition
In the first mode of operation, mlphase operates like optimise but several trees can be considered at once.
In the second mode of operation, when mlphase performs a branch and bound search or an exhaustive
search, the ten phylogenies (and associated substitution model parameters) with the highest likelihood
are returned. These two search algorithms return the best tree unless they become trapped in local
minima during the optimisation process. The heuristic stepwise addition returns only one tree. It is
less likely to find the optimal tree but it is computationally feasible with a larger number of taxa. Be
warned that the optimiser might crash unexpectedly sometimes and you can change the initial values to
overcome that (hopefully rare) problem.
To reduce the search space and the computation time, constraints can be placed on the phylogenetic
tree topologies considered during ML inference. With a clade file (see section 1.1.6) one can specify
invariant monophyletic clade topologies which should be preserved during phylogenetic inference. The
program will look for an optimal topology consistent with these clade arrangements. To use mlphase,
type at the command-line:
mlphase mlphase-control-file
where mlphase-control-file is a valid control file for mlphase. The mlphase program saves the results of
an inference in a single file. Results are also displayed on screen during the run.
Control file for mlphase
Please see the examples in appendix A.5 and A.6. These control files show the two main modes of
operation. The control file of the mlphase program contains:
• a DATAFILE block: see the data file block section (1.2.2).
• a MODEL block: see the model block section (1.2.3).
15
• a FUNCTION block dependant on the operating mode of mlphase (see below)
• Random seed: the seed for the random number generator.
Random seed = 13
• Output file: the name of the file where the results are sent.
Output file = results/hiv-mlphase.output
The FUNCTION block contains specific parameters according to the mode of operation. At the
moment, mlphase can “Optimise user-defined phylogenetic trees” or “Search for ML topology”.
When the user wants to optimise a set of defined trees the FUNCTION block contains the following fields:
• Function: the parameter to specify the mode of operation.
Function = Optimise user-defined phylogenetic trees
• Trees file: the name of the file containing the phylogenies, i.e., a set of trees in the Newick format
(section 2.1.2) with optional initial branch lengths values.
Trees file = primates.phylogenies
• Number of trees: the user has to specify the number of trees in the previous file.
Number of trees = 4
• Optimise model parameters: set this field to no if the model parameters are to be considered fixed,
set it to yes if you want to optimise them.
Optimise model parameters = no
• User’s model parameters file: if the parameters are constant one must provide values for them. This
field is for the name of the file containing the parameters for the model defined in the MODEL block.
If provided when not required, the content of this file is used to initialise the parameters of the model
before optimisation.
User’s model parameters file = data/hiv-REV.model
When looking for the ML tree, the FUNCTION block contains:
• Function: the parameter to specify the mode of operation.
Function = Search for ML topology
• Topology search: this field specifies the search algorithm used to determine the phylogenies with
the highest likelihood. At the moment the search algorithms implemented are Simple exhaustive
search, Branch-and-bound exhaustive search and Heuristic stepwise addition.
Topology search = Heuristic stepwise addition
• User defined monophyletic clades and Clade file: set the first field to yes if you want to constrain the
search in the topology space. The second field is the name of your clade file (see section 1.1.6).
User defined monophyletic clades = yes
Clade file = primates.clades
• Optimise model parameters: set this field to no if the model parameters are to be considered fixed,
set it to yes if you want to optimise them.
Optimise model parameters = yes
• User’s model parameters file: if the parameters are constant one must provide values for them, this
field is for the name of the file containing the parameters for the model defined in the MODEL block.
User’s model parameters file = data/primates-RNA7A.model
16
1.3.5
mcmcphase
Using mcmcphase
The mcmcphase program perfoms Bayesian estimation of phylogenies (see section 2.5) and uses Markov
chain Monte Carlo to produce large samples from the posterior probability density. To use mcmcphase,
simply type at the command-line:
mcmcphase mcmcphase-control-file
where mcmcphase-control-file is a valid control file for the mcmcphase program.
The mcmcphase program saves the results of an inference in many files. Be warned that it might
require a large amount of disk space for large studies (around 90 Mb for 70 species and 50000 samples).
• .besttree and .bestmodel files: the phylogeny and the parameters of the substitution model when
the best state (i.e., the state with the highest likelihood) was visited, it is not necessary one of the
sampled configurations and this state might have been visited during the burnin period. The best
configuration is not very important in a MCMC analysis but a “strange” best state indicates quickly
that something went wrong. The tree and the model can also be used as starting points in maximum
likelihood inference.
• .mp file: the file with the sampled parameters of the substitution model(s). Each sample occupies
one line. The parameters are, in order,
– the proportion of invariant sites if an invariant category is used (+I models)
– the gamma shape parameter (α) if the discrete gamma model is used (+dGX models)
– the frequencies of the states as they appear in the substitution matrix
– the rate ratios
When a MIXED model is used, substitution model parameters are printed sequentially. Except for
the first model, each set of parameters is preceded by the average substitution rate of the model. The
average substitution rate for the first model is always 1.0 and therefore this value is not reported.
• .samples file: the sampled topologies, this file can be used with another phylogenetic package to
produce a consensus tree. To avoid wasting disk space, mcmcphase will output the sampled topologies
using an index for each species according to their appearance order in the datafile.
• .bl file: the branch lengths for the previous topologies (for use with other PHASE programs).
• .output file: a file with similar content to the screen output.
• .plot file: the evolution of the likelihood during the run. Sampling of these values starts at the
beginning of the run, i.e., likelihood values are stored durning the burnin too.
Using consensus
The consensus program is used to exploit the large sample of states produced by mcmcphase. The
program still lacks the ability to produce a consensus tree by itself and requires that tree from the user.
Many phylogenetic programs can build a consensus tree from the sample of topologies produced by
mcmcphase in the “.samples” file. You can use the consense program of PHYLIP 2 for instance. To use
consensus, simply type at the command-line:
consensus mcmcphase-control-file consensus-topology-file
where mcmcphase-control-file is the control file that was used by the mcmcphase program to produce
the results and consensus-topology-file is the file which contains the consensus topology. Since mcmcphase outputs the topologies using numbers instead of the names of the species, consensus expects the
consensus topology to be given with numbers too.
The consensus program retrieves the model used and the location of the sample files from the controlfile. Two consensus substitution models are produced using respectively the mean and median values
of the sample. The consensus topology is used to produce a consensus tree with branch lengths. The
2 http://evolution.genetics.washington.edu/phylip.html
17
branch lengths of the states whose topology is identical to the consensus topology are used. For each
branch, the consensus length is simply the mean value of all the lengths. The consensus program cannot
return a consensus tree if the consensus topology has never been visited. In such a case, we suggest you
use optimise to produce ML branch lengths.
Control file for mcmcphase
Please see the examples provided in appendix A.7 and A.8. In the control file of mcmcphase one
can/must have:
• a DATAFILE block: see the data file block section (1.2.2).
• a MODEL block: see the model block section (1.2.3).
• a PERTURBATION block: control block for the mixing properties of mcmcphase (see below).
• Random seed: the seed for the random number generator.
Random seed = 1
• Burnin iterations: the number of “burnin” cycles (i.e., cycles before the beginning of the sampling).
During the “burnin”, only likelihood values are stored.
Burnin iterations = 150000
• Sampling iterations: the number of cycles for sampling.
Sampling iterations = 600000
• Sampling period: the number of cycles between extraction of two consecutive samples.
Sampling period = 20
• Random start model parameters and User’s starting model parameters file: to reduce the necessary
“burnin” time, the chain can be initialised with some user-specified model parameters. Otherwise
the sequences are used to initialise the substitution model.
Random start model parameters = no
User’s starting model parameters file = data/primates-RNA7A.model
• Random start tree and User’s starting tree file: similarly, one can choose to initialise the chain
randomly or with a user-defined topology. We do not encourage the use of an initial user-defined
topology but this option can be useful to quickly gain an idea of what results can be expected.
Random start tree = yes
User’s starting tree file = this field is ignored in this case
• Output file: the basename for all the output files (basename.besttree, basename.bestmodel, basename.mp, basename.samples, basename.bl, basename.output and basename.plot).
Output file = results/hiv-dna
• Output format: the format used for the topologies in the .samples file, it can be phylip (with a
semi-colon at the end) or bambe (without semi-colon).
Output format = phylip
PERTURBATION block
The PERTURBATION block contains the mixing parameters used for the proposals. The following
mixing parameters are relative to the branches:
• Initial branch step proposal parameter: the initial standard deviation of the normal distribution used
to modify the branch lengths. This proposal parameter is modified during the “burnin”.
Initial branch step proposal parameter = 0.1
• Branch length upper bound: the upper bound used for the uniform prior distribution of branch
lengths.
Branch length upper bound = 3.5
18
The PERTURBATION block also contains mixing parameters for the proposals of substitution model
parameters. These parameters are dependant on the substitution model used. With simple substitution
models, i.e., any nucleotide or base-paired model used alone, the following parameters are added in the
PERTURBATION block to perturb the frequencies:
• Frequencies, proposal priority: an integer value to specify how often we try to perturb the frequencies
with respect to other parameters. This parameter is usually compulsory except for models with fixed
frequencies (i.e., JC69 and K80). Use the value 0 to prevent the perturbation (i.e., if you want
the frequencies to remain equal to the empirical frequencies or to the values provided in an initial
substitution model).
Frequencies, proposal priority = 1
• Frequencies, proposal minimum acceptance rate and Frequencies, proposal maximum acceptance rate:
during the “burnin”, mcmcphase will try to adapt the proposal step so as to reach an acceptance rate
within the specified range. By default this range is [0.21, 0.25]. If you want to change the default
values, provide the two parameters. Use the range [0.0, 1.0] to turn off the dynamic adaptation of
the proposal step.
Frequencies, proposal minimum acceptance rate = 0.0
Frequencies, proposal maximum acceptance rate = 1.0
• Frequencies, initial Dirichlet tuning parameter: the initial proposal parameter (the higher the Dirichlet parameter, the lower the step). This parameter is compulsory if you turned off the dynamic
adaptation of the step. Allowed values are in the range [100.0, 100000.0] and the default value is
1000.0.
Frequencies, initial Dirichlet tuning parameter = 1500.0
Similar parameters are used for the rate ratios:
• Rate ratios, proposal priority: an integer value to specify how often we try to perturb each rate
ratio with respect to other parameters. Rate ratios are treated individually but they share the same
priority. This parameter is usually compulsory except for models with fixed rates (i.e., JC69). You
can use the value 0 to have constant rate ratios but you should not do that unless you provide initial
parameters with a “.model” file.
Rate ratios, proposal priority = 2
• Rate ratios, proposal minimum acceptance rate and Rate ratios, proposal maximum acceptance rate:
rate ratios are treated individually but they share the same range for the acceptance rate. The
default range is [0.21, 0.25] and you can turn off the dynamic modification of the step with the range
[0.0, 1.0].
Rate ratios, proposal minimum acceptance rate = 0.3
Rate ratios, proposal maximum acceptance rate = 0.6
• Rate Ratios, initial step: allowed initial values are in the range ]0, 5.0]. The default value (when the
dynamic step option is on) is 0.2. You cannot provide a different initial step for each rate ratio at
the moment.
Rate Ratios, initial step = 0.25
Similarly, you can/must use the following parameters when using a +I (invariant category) or/and
a +dG (discrete gamma categories) model:
• Gamma parameter, proposal priority: compulsory if a gamma model is used.
• Gamma parameter, proposal minimum acceptance rate: default value = 0.21
• Gamma parameter, proposal maximum acceptance rate: default value = 0.25
• Gamma parameter, initial step: default value = 0.2
and,
• Invariant parameter, proposal priority: compulsory if an invariant model is used.
19
• Invariant parameter, proposal minimum acceptance rate: default value = 0.21
• Invariant parameter, proposal maximum acceptance rate: default value = 0.25
• Invariant parameter, initial step: default value = 0.05
The upper-bound for the uniform prior distribution of each rate ratio is 1000.0. The upper-bound
for the uniform prior distribution of the gamma parameter is 200000.0. Be aware that there might be
something wrong if you reach that upper-bound value during a MCMC inference, i.e., there may be
insufficient data to properly estimate the parameter.
PERTURBATION Block for MIXED models
When a mixed model is used, the previous proposal parameters are required for each model and must
be enclosed in separate blocks. You must complete a PERTURBATION i block for each model in the
PERTURBATION block with the previously described parameters (see appendix A.8). Parameters
which are specific to the MIXED model appear in the PERTURBATION block.
• Model i priority: specify a priority for each model with respect to the other models and the priority
used for the average rates.
Model 1 priority = 8
Model 2 priority = 24
• Average rates, proposal priority: specify a priority for the perturbation of the model average substitution rate.
Average rates, proposal priority = 1
• Average rates, proposal minimum acceptance rate and Average rates, proposal maximum acceptance
rate: the acceptance range mcmcphase aims for, [0.21, 0.25] by default.
Average rates, proposal minimum acceptance rate = 0.15
Average rates, proposal maximum acceptance rate = 0.25
• Average rates, initial step: allowed initial values are in the range ]0, 1.0]. The default value (when
the dynamic step option is on) is 0.2. You cannot provide a specific initial step for each average
substitution rate at the moment.
Average rates, initial step = 0.14
The upper-bound for the uniform prior distribution of the average substitution rate ratios is 100.0.
Proposals priority
Each cycle mcmcphase perturbs one branch length and a topology change is tried every ten cycles (see
the Bayesian phylogenetics algorithms in section 2.5). Each cycle PHASE will also try to modify the
substitution model. Let us consider the following example with the MIXED model and three substitution models,
{PERTURBATION}
#PERTURBATION OF THE TREE :
...
#PERTURBATION OF THE MODEL :
Model 1 priority = 8
Model 2 priority = 24
Model 3 priority = 7
Average rates, proposal priority = 1
Average rates, initial step = .3
Average rates, proposal minimum acceptance rate = .15
Average rates, proposal maximum acceptance rate = .20
{PERTURBATION1}
Frequencies, proposal priority = 1
Rate ratios, proposal priority = 2
20
Gamma parameter, proposal priority = 1
{\PERTURBATION1}
{PERTURBATION2}
Frequencies, proposal priority = 3
Rate ratios, proposal priority = 4
Gamma parameter, proposal priority = 1
Invariant parameter, proposal priority = 1
{\PERTURBATION2}
{PERTURBATION3}
Frequencies, proposal priority = 4
Rate ratios, proposal priority = 2
Gamma parameter, proposal priority = 1
{\PERTURBATION3}
{\PERTURBATION}
The mcmcphase program will modify the average substitution rate of the second model with probability given in 1.1a; the probability of modifying the average substitution rate of the third model is the
same. The average substitution rate of the first model is the reference, it is never modified and remains
equal to 1.0. With probability given in 1.1b, mcmcphase will modify the parameters of the substitution
model i (i.e., any parameters but the average substitution rate). The priorities inside the corresponding
PERTURBATION i block are used, the figures inside each PERTURBATION i block do not have any
effect outside the block.
P (average rate)
=
P (model i)
=
total priority
=
Average rates, proposal priority
total priority
M odel i priority
total priority
(N − 1) ∗ Average rates, proposal priority +
(1.1a)
(1.1b)
N
X
M odel i priority (1.1c)
i=1
In our example, if the second model is selected for the next modification, we define prioritytotal :
prioritytotal = priorityf requencies +prioritygamma +priorityinvariant +numberrates
ratios ∗priorityrate ratios
and we modify either all the frequencies (probability 1.2a), or one of the rate ratios (probability 1.2b
each), or the invariant parameter (probability 1.2c), or the gamma shape parameter (probability 1.2d).
1.3.6
P
=
P
=
P
=
P
=
priorityf requencies
prioritytotal
priorityrate ratios
prioritytotal
priorityinvariant
prioritytotal
prioritygamma
prioritytotal
(1.2a)
(1.2b)
(1.2c)
(1.2d)
analyser
The analyser program does not require any control file. To use analyser, type at the command-line:
analyser
or
analyser data-file
where data-file is a file following the data file format described in section 1.1.1. The analyser program
needs you to provide the fields usually used in the DATAFILE block (see section 1.2.2) in order to parse
your sequences. Once it is done you will be prompted for the class to check if the data file contains
heterogeneous sites and analyser will require a “.lump” file.
The “.lump” file is used:
21
1. to match sites to a given state, e.g., indels(-) and purine(R) can be lumped to an ambiguity state(X).
2. to choose the state(s) used for the cut-off (see below).
Two ‘‘.lump” files are provided, data/dna.lump and data/rna.lump. The first one is used with single
nucleotide, ambiguity — i.e., -, R, Y and X — are lumped in a state X used for the cut-off. The
second one is used with paired sites. mismatches — e.g., AC, UU, . . . — are lumped into the single
state MM and ambiguity — e.g., C-, UR, . . . — are lumped into the state XX. Both states are used
for the cut-off.
Once the “.lump” file is provided, analyser outputs statistics for each state and requires a value for
the cut-off (between 0 and 1.0). For each site the frequencies of “cut-off states” is computed and the
sites above the cut-off threshold provided are displayed on the screen.
22
2 - Elements of phylogenetic theory
2.1
Phylogenetic trees
Usually a sketch of a tree-like structure is used to describe evolution; the evolutionary tree represents
the hierarchical relationships among species arising through evolution. Ancestors’ species are located
near the root of the tree and contemporary species are the leaves. Almost all methods accept the
appropriateness of a tree-like model to describe the evolution of species but one must keep in mind that
it is a strong assumption in itself.
2.1.1
Unrooted phylogenies
Since the data for the ancestors are usually missing, the phylogenetic trees produced by PHASE are
only schematic trees comprising a set of nodes linked together by branches. Terminal nodes, usually
called tips or leaves, are known sequences of existing organisms or contemporary taxa. Internal nodes
are bifurcation points between genetically isolated groups.
The analytical techniques used in PHASE result in the inference of an unrooted, strictly bifurcating
tree. The location of the common ancestor of all the species under study — i.e., the earliest point in
time — cannot be identified by our inference method. An unrooted, strictly bifurcating tree can be
seen as a kind of network where all the internal nodes are linked to exactly three others nodes, either
internal nodes or leaves. To produce a neat tree-like structure, one or more outgroup species, known to
be genetically isolated from all the others, should be used to root the tree (see figure 2.1).
Figure 2.1: Two equivalent representations of the same unrooted tree
2.1.2
String representation of a tree
All trees used by the programs follow the newick1 standard though the grammar in PHASE is more
limited. The newick format uses the recursive definition of a tree to represent phylogenies in a computer
readable form with nested parentheses. The tree in figure 2.1 can be written:
(Outgroup, gorilla, (human, chimpanzee));
1 http://evolution.genetics.washington.edu/phylip/newicktree.html
23
However one must be aware that this representation is not unique, the following one works as well:
(human, (Outgroup, gorilla), chimpanzee));
Sometimes, when an outgroup was provided, the rooted representation is used in PHASE :
(Outgroup, (gorilla, (human, chimpanzee)))
2.1.3
Branch lengths
The branch lengths usually represent the evolutionary distances between two consecutive nodes. We
tend to split the phylogenetic tree into two parts: its topology (i.e., the pattern of branching) and its
associated edge lengths. The expected rate of evolutionary change is assumed constant across all lineages
in a phylogeny and the length of a branch is scaled to the expected number of substitutions per site
along that branch. These lengths can be integrated in the string representation seen in section 2.1.2; for
instance we can write:
(Outgroup : 0.35, gorilla : 0.25, (human : 0.3, chimpanzee : 0.2));
2.2
Nucleotide substitution models
Substitution models are a description of the way sequences evolve in time by nucleotide replacements.
The PHASE package provides a wide range of substitution models. These consist of standard nucleotide
substitution models as well as specific base-pair substitution models.
2.2.1
A Markov model of substitution
Replacements within DNA sequences can be described and modelled by a Markov process with four
states. Each state represents one base — Adenine, Cytosine, Guanine or Thymine (see figure 2.2). Lots
Figure 2.2: Markov model for nucleotide evolution in DNA sequences
of assumptions are made in order to make phylogenetic reconstructions more computationally feasible.
First, each nucleotide is supposed to evolve independently of other sites evolution and of its past history.
We suppose there is no interaction between sites and we treat them independently. Further, the Markov
process of substitution is assumed to be the same across all sites (spatial homogeneity). Finally, the
process is assumed to remain constant over time (stationary) and time homogeneous, i.e., nucleotide
frequencies and substitution rates can be assumed constant through time and across all sites in an
alignment.
24
One might concede that assumptions made for the nucleotide evolutionary process are not strictly
valid. Actual data shows some discrepancies, e.g., heterogeneous selection pressure, unequal base frequencies among species, . . . . We can relax these assumptions and allow for substitution rate variation
across sites with the gamma model of Yang (1994), described in section 2.4.1. It is also possible to use
multiple substitution processes simultaneously when heterogeneous data are analysed (see section 2.4.2).
In spite of their name, DNA models can naturally be used for the treatment of the loops within RNA
sequences (see figure 2.3). In RNA loops, nucleotides are not subject to any structural constraints and
they are assumed to evolve independently from other sites. Therefore, the use of similar Markov models
for nucleotide evolution in RNA loops is appropriate.
2.2.2
Transition matrices
The mathematical expression of a DNA Markov model uses a matrix Q of substitution rates in which
each element rij represents the rate of substitution from nucleotide i to nucleotide j. The diagonal
elements of the instantaneous rate matrix must satisfy the equation
X
(2.1)
rii = −
rij
j6=i
so that each row of Q sums to zero. The process must be homogeneous and stationary; if πA , πC , πG
and πT are the four equilibrium bases frequencies then the rates must obey the following constraint:
πi rij = πj rji
∀i, j
(2.2)
also known as the time-reversibility constraint. To enforce this constraint we define αij so that
Qij = rij = mr × πj αij
∀i, ∀j 6= i
(2.3)
where mr is a constant factor described later. The time-reversibility condition is satisfied with a symmetric choice of αij . In practice, PHASE uses one of these αij parameters as a reference and sets its
value to 1.0. Depending on the model, other parameters (we call them rate ratios) are fixed or inferred
during an analysis.
With Q we can compute the transition probability matrix over time2 t.
dP (t)
= P (t) × Q
dt
P (t) = exp(Qt)
= exp({πj αij } × mr × t)
The transition probability matrix P (t) = {pij (t)} is used to compute the probability that nucleotide i
will be nucleotide j after time t (i can be equal to j). The “rate ratios” matrix in PHASE refers to the
matrix {αij } and the “transition rates” matrix refers to Q.
Inference methods used do not permit the separation of mr, a factor proportional to the average
substitution rate of the model, and t, branch lengths of the evolutionary tree (see section 2.1) which
reflect an amount of change. The longer the branch, the bigger the evolutionary distance between its
two incident nodes. We have to impose a scaling on the branch length. In practice, we fix the average
rate of substitutions of our model to be one per “unit of time”. This is done by adding a constraint for
the factor mr .
nbX
states X
(2.4)
mr ×
πi rij = 1.0
i=1
j6=i
This last constraint does not hold when multiple substitution models are used simultaneously in the
MIXED model. The average substitution rate of the first model is still fixed equal to 1.0 but the
average substitution rate of other models is now a free parameter.
2 the
term branch length would be more correct
25
2.2.3
Nucleotide substitution models implemented in PHASE
One can refer to Whelan et al. (2001) for a comprehensive review of the following substitution models
and their hierarchical relationships. The transition rate matrices of these models can highlight their
differences. They are presented by increasing complexity, i.e., ordered according to their number of
free parameters. (equilibrium frequencies and/or rates). In nucleotide substitution models, the A↔G
transitition is used as a reference by PHASE , αAG = αGA = 1.
JC69 model (Jukes-Cantor, 69)
The Jukes-Cantor model assumes equal base frequencies and equal mutation rates, therefore it does not
have any free parameter. πi = 14 quad∀i, αij = 1.0 ∀i, ∀j 6= i
A
C
G
T
 A
∗
0.25
0.25
0.25


Q = mr ×  C 0.25 ∗ 0.25 0.25

 G 0.25 0.25
∗ 0.25
T 0.25 0.25 0.25 ∗








Table 2.1: JC69 transition matrix
K80 model (Kimura, 80)
The Kimura model assumes equal base frequencies and accounts for the difference between transitions
and transversions with one parameter. πi = 14 ∀i, αtransition = 1.0, αtransversion = α1
A
C
G
T

 A
∗
0.25α
0.25
0.25α
1
1 


∗
0.25α1 0.25 

Q = mr ×  C 0.25α1


 G
0.25 0.25α1
∗
0.25α1 
T 0.25α1 0.25 0.25α1
∗


Table 2.2: K80 transition matrix
HKY85 model (Hasegawa-Kishino-Yano, 85)
The HKY85 model does not assume equal base frequencies and accounts for the difference between
transitions and transversions with one parameter. αtransition = 1.0, αtransversion = α1
A
C
G
T
 A
∗
πC α1 πG πT α1


C
π
α
∗
πG α1 πT
Q = mr × 
A 1

 G
πA πC α1
∗
πT α1
T πA α1 πC πG α1
∗

Table 2.3: HKY85 transition tatrix
26







TN93 model (Tamura-Nei, 93)
The TN93 model has four frequencies parameters. It accounts for the difference between transitions and
transversions and differentiates the two kinds of transitions (purine↔purine & pyrimidine↔pyrimidine).
αAG = αGA = 1.0, αtransversion = α1 , αCT = αT C = α2
A
C
G
T

 A
∗
π
α
π
π
C 1
G
T α1 


∗
πG α1 πT α2 

Q = mr ×  C πA α1


 G
πA πC α1
∗
πT α1 
T πA α1 πC α2 πG α1
∗


Table 2.4: TN93 transition matrix
REV model (Yang, 94)
The REV model is the most general model for nucleotide substitution subject to the time-reversibility
constraint. It has four frequencies and five rate parameters.
A
C
G
T
 A
∗
πC α1 πG πT α2 



∗
πG α3 πT α4 

Q = mr ×  C πA α1


 G
πA πC α3
∗
πT α5 
T πA α2 πC α4 πG α5
∗


Table 2.5: REV transition matrix
2.3
Paired-site substitution models
RNA substitution models are an attempt to add biological realism in the evolution model. The assumption that each nucleotide site evolves independantly must be modified for RNA molecules. Paired-site
substitution models can account for the secondary structure of these molecules.
2.3.1
RNA secondary structure
In the double helical structure of the DNA molecule, two complementary nucleotide strands are held
together with hydrogen bonds between the Waston-Crick pairs A-T and C-G. RNA molecules usually
come as single strands but left in their environment they fold themselves in their tertiary structure
because of the same hydrogen bonding mechanism. Helices, also known as stems, are formed intramolecularly .
There are 16 possible base-pairings, however of these, only six (AU, GU, GC, UA, UG, CG)
are stable enough to form actual base-pairs. The rest are called mismatches and occur at very low
frequencies in helices. RNA molecules, such as ribosomal RNAs and transfer RNAs, have an important
role. Their structure cannot easily be disrupted without impact on their function and lethal consequences
and selection is acting to maintain the secondary structure. Yet, the primary structure of the stems
(i.e., their nucleotide sequence) can still vary and in fact we observe that RNA helical regions are quite
variable in sequence. The nature of the bases is not important and substitutions are possible as long as
they preserve the secondary structure. One could model the evolution of stems using the DNA models
described above but there may be a substantial bias in results because paired substitutions would seem
27
Figure 2.3: A RNA molecule secondary structure
far less probable than they are in reality (see Jow et al., 2002). Statistics become invalid and it can have
an effect on inferred phylogenies.
The secondary structure is left unchanged when complementary substitutions occur in the DNA gene
coding for the RNA molecule. The process can be a single step process (double substitution) or a two
step process (two single substitutions). These two processes are descibed in the theory of compensatory
substitutions section below.
2.3.2
Theory of compensatory substitutions
From the individual sequence viewpoint complementary mutations are a two-step process typically involving a U-G or a G-U pair as a transition state. These pairs are thermodynamically less stable than
Waston-Crick pairs but they are still more likely to arise than any other mismatches. Nonetheless, in
phylogenetic studies we are not considering individual copies of a gene but we are rather modelling consensus sequences for a large number of individuals. From the population genetics viewpoint, evolution in
stems can either occur by two single substitutions or by simultaneous compensatory substitutions, (see,
e.g., Higgs, 1998; Savill et al., 2001). The first mechanism is by fixation of the slightly deleterious UG
or GU pair in the population before the second mutation occurs. The second mechanism happens when
natural selection against intermediate mutants is too strong. In such a case, deleterious pairs are kept
low in frequency until a second mutation takes place in one of the sporadic mutant sequences by chance.
Afterwards, the new neutral variant may replace the original one due to drift in gene frequencies (see
figure 2.4).
Figure 2.4: Substitution mechanisms for paired-sites
28
Therefore, even if simultaneous mutations are very unlikely to occur in a single organism, it is
reasonable, although not compulsory, to allow double substitutions in models from the population point
of view. The experimental results you can have with PHASE confirm that. Since natural selection
against intermediate mutants with any other mismatch pairs than U-G or G-U is usually much stronger,
one can notice two groups of states in which rapid interchange occurs, while interchange between the
two groups, although possible, is really slow (see figure 2.5).
Figure 2.5: Mutation rate between paired-sites
2.3.3
Base-paired substitution models implemented in PHASE
Like DNA models, RNA substitution models are Markov models but they consider pairs of nucleotides
as their elementary states rather than single sites. The PHASE software contains 16-state models to
account for the 16 possible pairs that can be formed with 4 bases. These models have a lot of parameters and you might prefer them the 6-state and 7-state models where mismatch pairs are respectively
discarded or lumped into a single state MM. The time-reversibility contraint and the average mutation
rate are set as they were for DNA models. One can refer to Savill et al. (2001) for a better review of the
following substitution models and their hierarchical relationships. With base-paired models, PHASE
uses the rate of the double transition AU↔GC as a reference for the rate ratios.
RNA6A model
Six state models completely ignore mismatches and consider substitutions between the six stable basepairs only. Mismatch pairs are assigned to one of the 6 states in some deterministic fashion (the treatment
of a mismatch is quite similar to the treatment of a gap with the DNA models). The RNA6A model
is the most general six state model with 15 rate parameters and 6 frequencies (and 2 constraints) as
shown in table 2.6.





Q = mr × 




AU
AU
∗
GU πAU α1
GC
πAU
U A πAU α2
U G πAU α3
CG πAU α4
GU
πGU α1
∗
πGU α5
πGU α6
πGU α7
πGU α8
GC
πGC
πGC α5
∗
πGC α9
πGC α10
πGC α11
UA
πU A α2
πU A α6
πU A α9
∗
πU A α12
πU A α13
UG
πU G α3
πU G α7
πU G α10
πU G α12
∗
πU G α14
CG
πCG α4
πCG α8
πCG α11
πCG α13
πCG α14
∗










Table 2.6: RNA6A transition matrix
RNA6B model (Tillier, 94)
The RNA6B model (Tillier, 1994) is formed by restriction of the RNA6A model. The RNA6B model
has only 3 rate parameters and 6 frequencies, it uses a rate of single transitions α1 and a rate of double
transversions α2 . The reference for the rate ratios are the rates of double transition. The transition rate
matrix is given in table 2.7.
29

AU
AU
∗
GU πAU α1
GC
πAU
U A πAU α2
U G πAU α2
CG πAU α2




Q = mr ∗ 




GU
πGU α1
∗
πGU α1
πGU α2
πGU α2
πGU α2
GC
πGC
πGC α1
∗
πGC α2
πGC α2
πGC α2
UA
πU A α2
πU A α2
πU A α2
∗
πU A α1
πU A
UG
πU G α2
πU G α2
πU G α2
πU G α1
∗
πU G α1
CG
πCG α2
πCG α2
πCG α2
πCG
πCG α1
∗










Table 2.7: RNA6B transition matrix
RNA7A model
The RNA7A model is the most general of the seven state models. It has 21 rate parameters (including
the reference rate AU↔GC) and 7 frequencies. All mismatches are treated in a single state MM. The
RNA7A model is described by the following rate matrix (table 2.8).






Q = mr × 





AU
GU
GC
UA
UG
CG
MM
AU
∗
πAU α1
πAU
πAU α2
πAU α3
πAU α4
πAU α5
GU
πGU α1
∗
πGU α6
πGU α7
πGU α8
πGU α9
πGU α10
GC
πGC
πGC α6
∗
πGC α11
πGC α12
πGC α13
πGC α14
UA
πU A α2
πU A α7
πU A α11
∗
πU A α15
πU A α16
πU A α17
UG
πU G α3
πU G α8
πU G α12
πU G α15
∗
πU G α18
πU G α19
CG
πCG α4
πCG α9
πCG α13
πCG α16
πCG α18
∗
πCG α20
MM
πM M α5
πM M α10
πM M α14
πM M α17
πM M α19
πM M α20
∗












Table 2.8: RNA7A transition matrix
RNA7D model (Tillier, 98)
The RNA7D model (Tillier and Collins, 1998) is a biologically plausible restriction of the RNA7A
model. The restictions in the 7D model are analogous to the restrictions made in the 6B. There is one
more frequency parameter for the mismatch state and one more rate ratio parameter for the substitution
rates involving this state. The reference for the rate ratios are the rate of double transitions. This model
is described by the following rate matrix (table 2.9).






Q = mr × 





AU
GU
GC
UA
UG
CG
MM
AU
∗
πAU α1
πAU
πAU α2
πAU α2
πAU α2
πAU α3
GU
πGU α1
∗
πGU α1
πGU α2
πGU α2
πGU α2
πGU α3
GC
πGC
πGC α1
∗
πGC α2
πGC α2
πGC α2
πGC α3
UA
πU A α2
πU A α2
πU A α2
∗
πU A α1
πU A
πU A α3
UG
πU G α2
πU G α2
πU G α2
πU G α1
∗
πU G α1
πU G α3
CG
πCG α2
πCG α2
πCG α2
πCG
πCG α1
∗
πCG α3
MM
πM M α3
πM M α3
πM M α3
πM M α3
πM M α3
πM M α3
∗












Table 2.9: RNA7D transition matrix
RNA16A model
PHASE contains a general 16-state model (RNA16), however this model has 119 + 15 free parameters
and is not well suited for phylogenetic inference, especially maximum-likelihood inference. RNA16A is
a simplified 16-state model, it reduces some of the complexity of the RNA16 model by cutting down
on the number of rate parameters from 120 to 5. It uses a rate of single transitions α1 , a rate of
30
double transversions α2 , a mismatch↔non-mismatch transition rate α3 for transitions requiring only
one substitution and a mismatch↔mismatch transition rate α4 for transitions requiring one substitution
too. The reference rate is the rate of double transitions. Some base-pair substitutions are not allowed
(null substitution rate). The transition matrix for the RNA16A model is given in table 2.10.
2.4
Refinements to substitution models
In this section we introduce some refinements made to the substitution models described above.
2.4.1
Invariant and discrete gamma models
Substitution rates are definitely variable over sites of a sequence for many real dataset if not all. Including
the heterogeneity of rates in substitution models is widely recognized as an important factor in the
fitting to data. One attempt to take this acknowledged biological fact into account is to suppose that
a proportion of sites are invariant while others evolve at the same single rate. PHASE provides this
invariant model. One extra parameter in the model governs the proportion of sites with zero rate of
evolution.
Models that allows continuous variability of mutation rates over sites are more realistic and the
gamma model of Yang (1994) outperforms the invariant model. The discrete gamma model is implemented in PHASE . The continuous rate distribution is approximated with a discrete distribution which
is computationaly tractable and sites are divided into k equally probable rate categories. A single parameter α governs the shape of this distribution and the substitution rates for all categories. The mean
E(r) of the gamma distribution is the average mutation rate of our substitution model as stated earlier
and its variance is V (r) = E(r)2 /α. A small alpha suggests that rates differ significantly between sites
with few sites having high rates and others being practically invariant; on the contrary, large α models
weak rate heterogeneity (see figure 2.6). When α → +∞, the gamma model reduces to the single rate
model. Computational requirement of the discrete gamma model is roughly linear, i.e., the application
of a discrete gamma model with k categories is about k times slower than the use of a model where rate
heterogeneity is not considered.
2.4.2
The MIXED model
Since the current trend in phylogenetic analysis is to use several genes and/or several sorts of sequences
at once, models were designed for combined analyses of heterogeneous sequence data from the same set
of species (Yang, 1996). PHASE allows to use multiple substitution models simultaneously to treat this
kind of sequences, each model having its own independant set of parameters. The average mutation rate
of the first model is still set to 1.0 but the average mutation rate of the others are now free parameters of
the model. The MIXED model for combined analysis of heterogeneous data is equivalent to the model
with proportional branch lengths described in Yang (1996).
2.5
2.5.1
Bayesian phylogenetics
Bayes’ theorem
A Bayesian approach to phylogeny reconstruction requires the definition of a parameter space Ω which
contains the sets of all possible combined states φ = {τi , νi , θ} where the symbol τi labels the ith
possible tree topology, νi are the branch lengths associated with this topology and θ is a set of allowed
parameters for our evolutionary model (e.g., rate ratios αij , nucleotide or base-pair frequencies πi ,
gamma distribution parameter α, . . . ). According to Bayes’ theorem, we can calculate the posterior
probability of the combined state φ given sequence data X,
p(φ|X) = PNs R
i=1
P (X|φ)p(φ)
R
dνi dθP (X|φ)p(φ)
(2.5)
where Ns is the number of possible tree topologies for a data set containing s species, P (X|φ) is the
likelihood of the data and p(φ) is the prior probability density associated with state θ.
31
Figure 2.6: Probability density function of several gamma distributions of rate heterogeneity with mean
E(r) = 1
2.5.2
Markov chain Monte-Carlo (MCMC)
Computing the denominator of equation 2.5 is infeasible for realistic sized problems. A Markov Chain
Monte-Carlo method is therefore used. The standard Metropolis-Hastings MCMC algorithm can construct a Markov chain in our state space Ω by iterating a two step process (Metropolis et al., 1953;
Hastings, 1970). Firstly, a new state φ0 is drawn from the actual state φn according to some proposal
mechanism. The proposed state is then accepted or rejected with some probability which depends on
the ratio of the posterior probabilities of the two states φ0 and φn and of the proposal.
After a “burnin” period, the chain converges to an equilibrium, under quite weak conditions. After
discarding an initial portion of the chain, states are distributed according to the posterior probability
density p(φ|X). PHASE can produce a large sample from the posterior probability density and with
this sample one can compute the posterior probability of any identifiable phylogenetic feature of interest.
For instance the posterior probability of a specific topology is simply given by the fraction of times this
topology appears in our MCMC sample. Similarly we can fit a posterior probability density curve to
the gamma distribution parameter.
2.5.3
Priors and proposals
Uniform priors ?
We have no strong evidence for any particular prior and we therefore choose a simple factorized prior
p(φ) = p(θ)p(νi )P (τi ). We assume a uniform prior on all trees P (τi ) = 1/Ns , we use a flat Dirichlet
distribution prior for frequency parameters (i.e., all sets of frequencies are equally likely as long as
they sum to one) and we choose a uniform positive prior for substitution rate parameters, gamma
distribution parameter and branch lengths. Consequently, for all pairs of possible states (φ, φ0 ), the
priors are equal. One should set upper limits in the case of uniform priors but since these parameters
usually remain between reasonable limits during simulations these boundaries should not have any effect
on experimental results unless unreasonable values are chosen. It is good practice to check whether these
upper boudaries are reached while monitoring the parameters convergence.
32
Proposals for the parameters
For the proposal step, we have to balance the desire to move globally through the parameter space Ω
with the need to make computationally feasible moves in areas of high probability. Therefore we split
up the process and we apply a suitable proposition to the variables at each iteration. For the frequency
parameters we adopt a Dirichlet proposal distribution centred at the current frequency vector used by
Larget and Simon (1999). For the gamma distribution parameter and the substitution rate ratios, a
normal proposal distribution centred at the current value is used, with reflecting boundary at zero and
at the upper-limit defined above. Distant moves in Ω might result in a low acceptance rate whereas
small modifications will prevent a full inspection of highly probable areas. The parameters used to move
into the state space must therefore be carefully chosen for proper mixing (quick convergence and good
sampling from the posterior probability density). With mcmcphase, these parameters can be adjusted
during the burnin period.
Proposals for the tree
The tree topology is perturbed every ten cycles with either the nearest neighbor interchange (NNI)
proposal shown in figure 2.7 or the subtree pruning and re-grafting (SPR) proposal (Swofford et al.,
1996) shown in figure 2.8.
Figure 2.7: The nearest neighbor interchange algorithm (Jow et al., 2002)
Figure 2.8: The subtree pruning and regrafting algorithm (Jow et al., 2002)
33
Each cycle a randomly chosen branch length is modified with a figure δ drawn from a normal distribution centred at zero. When the branch length becomes negative special rules which can lead to a
topology change are applied (Jow et al., 2002). If the branch is an internal branch then one of the two
nearest neighbor topologies is proposed with each having equal probability; this is the Nearest Neighbour
Interchange described above. The new internal branch length is set to y = |x + δ| (see figure 2.9). If
the branch is a terminal branch, we cannot apply the NNI algorithm and we simply use a reflecting
boundary. The new proposed length is y = |x + δ|.
Figure 2.9: The continuous change algorithm when x + δ < 0 (Jow et al., 2002)
The acceptance rate for the SPR and the NNI proposals are usually quite low. The “local” NNI
proposal, induced by a branch length modification, has a better acceptance rate.
2.5.4
Pitfalls of Markov chain Monte-Carlo techniques
One can doubt that maximum-likelihood algorithms always find the true global maximum of the likelihood function. Similarly, with MCMC techniques, the Markov chain can fail to converge to the stationary distribution of the posterior probabilities. A possible reason for this is the failure to visit all
highly probable regions of the parameter space because of local maxima in the likelihood curve. However
poor proposal mechanisms and/or failure to run the chain long enough are usually the main cause of
sample defect (see Huelsenbeck et al., 2002). Unfortunately it is not always easy to identify these traps.
We can only recommend to do long runs, monitor the convergence of several model parameters since
monitoring the likelihood only is not enough, and repeat the experiment using different random starting
trees to check that all the chains give similar results (i.e., substitution model parameters, consensus
tree, likelihood, . . . ).
34
35








Q = mr × 









AU
GU
GC
UA
UG
CG
AA
AG
AC
GA
GG
CA
CC
CU
UC
UU
AU
∗
πAU α1
πAU
πAU α2
πAU α2
πAU α2
πAU α3
πAU α3
πAU α3
0
0
0
0
πAU α3
0
πAU α3
GU
πGU α1
∗
πGU α1
πGU α2
πGU α2
πGU α2
0
0
0
πGU α3
πGU α3
0
0
πGU α3
0
πGU α3
GC
πGC
πGC α1
∗
πGC α2
πGC α2
πGC α2
0
0
πGC α3
πGC α3
πGC α3
0
πGC α3
0
πGC α3
0
UA
πU A α2
πU A α2
πU A α2
∗
πU A α1
πU A
πU A α3
0
0
πU A α3
0
πU A α3
0
0
πU A α3
πU A α3
CG
πCG α2
πCG α2
πCG α2
πCG
πCG α1
∗
0
πCG α3
0
0
πCG α3
πCG α3
πCG α3
πCG α3
0
0
AA
πAA α3
0
0
πAA α3
0
0
∗
πAA α4
πAA α4
πAA α4
0
πAA α4
0
0
0
0
AG
πAG α3
0
0
0
πAG α3
πAG α3
πAG α4
∗
πAG α4
0
πAG α4
0
0
0
0
0
AC
πAC α3
0
πAC α3
0
0
0
πAC α4
πAC α4
∗
0
0
0
πAC α4
0
πAC α4
0
GA
0
πGA α3
πGA α3
πGA α3
0
0
πGA α4
0
0
∗
πGA α4
πGA α4
0
0
0
0
Table 2.10: RNA16A transition matrix
UG
πU G α2
πU G α2
πU G α2
πU G α1
∗
πU G α1
0
πU G α3
0
0
πU G α3
0
0
0
πU G α3
πU G α3
GG
0
πGG α3
πGG α3
0
πGG α3
πGG α3
0
πGG α4
0
πGG α4
∗
0
0
0
0
0
CA
0
0
0
πCA α3
0
πCA α3
πCA α4
0
0
πCA α4
0
∗
πCA α4
πCA α4
0
0
CC
0
0
πCC α3
0
0
πCC α3
0
0
πCC α4
0
0
πCC α4
∗
πCC α4
πCC α4
0
CU
πCU α3
πCU α3
0
0
0
πCU α3
0
0
0
0
0
πCU α4
πCU α4
∗
0
πCU α4
UC
0
0
πU C α3
πU C α3
πU C α3
0
0
0
πU C α4
0
0
0
πU C α4
0
∗
πU C α4
UU
πU U α3
πU U α3
0
πU U α3
πU U α3
0
0
0
0
0
0
0
0
πU U α4
πU U α4
∗


















Appendices
36
AppendixA - Some examples of control files
A.1
Control file for likelihood
######################## The Sequence Alignment Section ############
{DATAFILE}
#The name of your data file
Data file = data/mammals69.mix
#The format of your data file (interleaved or not)
Interleaved data file = no
#The species used to root the tree
Outgroup
= 26
{\DATAFILE}
######################## The Evolutionary Model Section ############
{MODEL}
#the name of your model
Model = MIXED
#since we are using the mixed model we provide the number of models
Number of models = 2
#and we define each substitution model inside its own block.
#the file "data/mammals69.mix" is a RNA sequence with loops and stems
#we did not specify a class section but the code used was MIXED
#therefore the first model must be the DNA model for the loop
#and the second model must be the RNA model for the helices.
{MODEL1}
#DNA model : REV + dg3
Model = REV
Discrete gamma distribution of rates = yes
Number of gamma categories = 3
{\MODEL1}
{MODEL2}
#RNA model : RNA7A + dg4 + I
Model = RNA7A
Discrete gamma distribution of rates = yes
Number of gamma categories = 4
Invariant sites
= yes
{\MODEL2}
{\MODEL}
####################### The tree & model Section ####################
37
#To evaluate the likelihood of a phylogeny you must provide
#1.a phylogeny file (tree with branch lengths)
Tree file = data/mammals69-mix-consensus.tree
#2.the parameters for the model you defined above
Model parameters file = data/mammals69-mix-consensus.model
A.2
Control file for optimise
######################## The Sequence Alignment Section ############
{DATAFILE}
#The name of your data file
Data file = data/primates.rna
#the format of your data file (interleaved or not)
Interleaved data file = no
#the species used to root the tree
Outgroup
= 14
{\DATAFILE}
######################## The Evolutionary Model Section ############
{MODEL}
#model : RNA16A + dG4
Model = RNA16A
Discrete gamma distribution of rates = yes
Number of gamma categories = 4
{\MODEL}
####################### The tree & model Section ####################
#the phylogeny to optimise
Tree file = data/primates.tree
#an optional field to choose the initial model parameters (and check
#whether the method always converge to the same tree)
#Starting model parameters file =
#a random seed to initialise the branch lengths randomly in case the phylogeny
#provided does not hold this information
Random seed = 1
#the base name of the three output files (base.output, base.model, base.tree)
Output file = results/primates-rna-optimise/primates-rna-optimise-RNA16A
A.3
Control file for simulate (1)
######################## The Evolutionary Model Section ############
{MODEL}
#the name of your model
Model = MIXED
#since we are using the mixed model we provide the number of models
Number of models = 2
38
{MODEL1}
#DNA model : REV + dg3
Model = REV
Discrete gamma distribution of rates = yes
Number of gamma categories = 3
{\MODEL1}
{MODEL2}
#RNA model : RNA7A + dg4 + I
Model = RNA7A
Discrete gamma distribution of rates = yes
Number of gamma categories = 4
Invariant sites
= yes
{\MODEL2}
{\MODEL}
######################## The Simulate Section ############
#to produce an example of ’.model’ file for the specified model set this field
#to ’yes’
Retrieve the name of the model’s parameters = yes
#the following field is the name of the ’.model’ file to create
Model parameters file = data/simulate.model
A.4
Control file for simulate (2)
######################## The Evolutionary Model Section ############
{MODEL}
#the name of your model
Model = MIXED
#since we are using the mixed model we provide the number of models
Number of models = 2
{MODEL1}
#DNA model : REV + dg3
Model = REV
Discrete gamma distribution of rates = yes
Number of gamma categories = 3
{\MODEL1}
{MODEL2}
#RNA model : RNA7A + dg4 + I
Model = RNA7A
Discrete gamma distribution of rates = yes
Number of gamma categories = 4
Invariant sites
= yes
{\MODEL2}
{\MODEL}
######################## The Simulate Section ############
#to simulate some sequences set this field to ’no’
Retrieve the name of the model’s parameters = no
39
#the file with the user-specified parameters of the substitution model
Model parameters file = data/simulate.model
#Initialise the random number generator with a seed
Random seed = 1
#Random tree or user-specified tree ?
Random tree = yes
#parameters used if a random tree is generated
Number of species = 8
Maximum branch length = .5
#if Random tree == yes the tree will be saved with that file name
#if Random tree == no the tree is read from that file
Tree file = simulated-data/random-8species.tree
#generate sequences:
#for each model you have to specify the desired number of symbols
#if you are not using a MIXED model fill this field for the class 1 only
Number of symbols from class 1 = 600
Number of symbols from class 2 = 1200
#if you need a secondary structure fill the following fields
Structure for the elements of class 1 = .
Structure for the elements of class 2 = ()
#to produce a complete PHASE input file, you have to specify the type and the
#final length yourself
Data file type = RNA
Total length of the raw sequences = 3000
#the name of the file where your sequences are saved, please check this file
#before use
Output file = simulated-data/simulated-8species.mix
A.5
Control file for mlphase (1)
####################### The Data Section ###########################
{DATAFILE}
#The name of your data file
Data file = data/hiv.dna
#The format of your data file (interleaved or not)
Interleaved data file = no
#The species used to root the tree
Outgroup
= 1
{\DATAFILE}
40
######################## The Evolutionary Model Section ############
{MODEL}
#model : REV + dG4
Model = REV
Discrete gamma distribution of rates = yes
Number of gamma categories = 4
{\MODEL}
####################### The Function Section ###########################
{FUNCTION}
Function = Optimise user-defined phylogenetic trees
#The file with the trees
Trees file = data/hiv-dna.trees
#The number of trees in this file
Number of trees = 2
#Optimise the substitution model parameters simultaneously ?
Optimise model parameters = yes
# The name of the file containing initial substitution model parameters,
# if the previous field is set to no (ie, fixed parameters for the model) this
# field is compulsory
#### User’s model parameters file =
{\FUNCTION}
# Random seed for the random number generator
Random seed
= 2
# The next control line sets the output file
Output file = results/hiv-dna-ml/hiv-dna-ml.output
A.6
Control file for mlphase (2)
####################### The Data Section ###########################
{DATAFILE}
#The name of your data file
Data file = data/primates.rna
#The format of your data file (interleaved or not)
Interleaved data file = no
#The species used to root the tree
Outgroup
= 1
{\DATAFILE}
######################## The Evolutionary Model Section ############
{MODEL}
#model : RNA7A + dG3
Model = RNA7A
Discrete gamma distribution of rates = yes
Number of gamma categories = 3
{\MODEL}
41
####################### The Function Section ###########################
{FUNCTION}
Function = Search for ML topology
#Monophyletic clades ?
User defined monophyletic clades = yes
Clade file = data/primates.clades
#The search method for tree topology :
# ’Simple exhaustive search’, ’Branch-and-bound exhaustive search’ or
# ’Heuristic stepwise addition’
Topology search = Branch-and-bound exhaustive search
#optimise the substitution model parameters simultaneously ?
Optimise model parameters = yes
#a field to choose initial model parameters, this field is compulsory if
#the previous field is set to no
#### User’s model parameters file =
{\FUNCTION}
# Random seed for the random number generator
Random seed
= 2
# The next control line sets the output file
Output file = results/primates-rna-ml/primates-rna-ml.output
A.7
Control file for mcmcphase (1)
######################## The Sequence Alignment Section ############
{DATAFILE}
#The name of your data file
Data file = simulated-data/suzuki-arranged.dna
#The format of your data file (interleaved or not)
Interleaved data file = no
#The species used to root the tree
Outgroup
= 1
#Is there a class section in your data file ?
Heterogeneous data models = yes
{\DATAFILE}
######################## The Evolutionary Model Section ############
{MODEL}
#model : K80 + dG3
Model = K80
Discrete gamma distribution of rates = yes
Number of gamma categories = 3
Invariant sites
= no
{\MODEL}
42
######################## The Perturbation Section ############
{PERTURBATION}
#Initial branch step proposal parameter
Initial branch step proposal parameter = 0.1
#Upper bound for the branch length uniform distribution
Branch length upper bound = 1.5
#priority for the frequencies perturbation
Frequencies, proposal priority = 1
#optional initial parameter for the perturbation
Frequencies, initial Dirichlet tuning parameter = 500.0
#priority for the rate ratios perturbation
Rate ratios, proposal priority = 1
#optional, initial rate ratio step proposal parameter
Rate ratios, initial step = 0.3
#optional, set the lower bound for the acceptance rate
Rate ratios, proposal minimum acceptance rate = 0.2
#optional, set the upper bound for the acceptance rate
Rate ratios, proposal maximum acceptance rate = 0.6
#priority for the gamma shape parameter perturbation
Gamma parameter, proposal priority = 1
#do not adapt the proposal parameter during the burnin period
Gamma parameter, proposal minimum acceptance rate = .0
Gamma parameter, proposal maximum acceptance rate = 1.0
# the initial proposal step is required because it is fixed
Gamma parameter, initial step = .1
#priority for the invariant parameter (% of invariant sites) perturbation
Invariant parameter, proposal priority = 1
{\PERTURBATION}
######################## The program Section ############
#initialise the random number generator
Random seed = 1
#number of burnin iterations
Burnin iterations = 150000
#number of sampling iterations
Sampling iterations = 300000
#sample every 20 cycles
Sampling period = 20
#initialise the chain with user-defined substitution parameters ?
Random start model parameters = yes
#### User’s starting model parameters file =
#initialise the chain with a given tree ?
Random start tree = yes
#### User’s starting tree file =
#The base name for the output files (base.output, base.bestmp, base.besttree,
43
#
Output file
base.samples, base.mp, base.bl, base.plot)
= results/simulation-mix-mcmc/simulation-suzuki-mcmc
#the format for the ’base.samples’ file (phylip or bambe)
Output format = phylip
A.8
Control file for mcmcphase (2)
######################## The Sequence Alignment Section ############
{DATAFILE}
Data file = data/mammals69.mix
Interleaved data file = no
Outgroup
= 26
{\DATAFILE}
######################## The Evolutionary Model Section ############
{MODEL}
Model = MIXED
Number of models = 2
{MODEL1}
Model = REV
Discrete gamma distribution of rates = yes
Number of gamma categories = 4
Invariant sites
= no
{\MODEL1}
{MODEL2}
Model = RNA7A
Discrete gamma distribution of rates = yes
Number of gamma categories = 4
Invariant sites
= no
{\MODEL2}
{\MODEL}
######################## The MCMC PERTURBATION Section ############
{PERTURBATION}
#PERTURBATION OF THE TREE :
Initial branch step proposal parameter = 0.03
Branch length upper bound = 1.7
#PERTURBATION OF THE MODEL :
Model 1 priority = 8
Model 2 priority = 24
Average
Average
Average
Average
rates,
rates,
rates,
rates,
proposal priority = 1
initial step = .3
proposal minimum acceptance rate = .15
proposal maximum acceptance rate = .20
{PERTURBATION1}
Frequencies, proposal priority = 1
Rate ratios, proposal priority = 1
Gamma parameter, proposal priority = 1
{\PERTURBATION1}
{PERTURBATION2}
Frequencies, proposal priority = 1
Rate ratios, proposal priority = 1
44
Gamma parameter, proposal priority = 1
{\PERTURBATION2}
{\PERTURBATION}
Random seed = 1
Burnin iterations = 40000
Sampling iterations = 100000
Sampling period = 10
Random start model parameters = no
User’s starting model parameters file = data/mammals69-mix-consensus.model
Random start tree = no
User’s starting tree file = data/mammals69-mix-consensus.tree
Output file
= results/mammals69-mix-mcmc/mammals69-mix-mcmc-preinit
Output format = phylip
45
Bibliography
Hasegawa, M. et al.
1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol.,
42:160–174.
Hastings, W.
1970. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57:97–
109.
Higgs, P.
1998. Compensatory neutral mutation and the evolution of RNA. Genetica, 102:91–101.
Hudelot, C., V. Gowri-Shankar, H. Jow, M. Rattray, and P. Higgs
2003. RNA-based phylogenetics methods: Application to mammalian mitochondrial RNA sequences.
Mol. Phyl. Evol.
Huelsenbeck, J., B. Larget, R. Miller, and F. Ronquist
2002. Potential applications and pitfalls of bayesian inference of phylogeny. Syst. Biol., 51(5):673–688.
Jow, H., C. Hudelot, M. Rattray, and P. Higgs
2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution.
Mol. Biol. Evol., 19(9):1591–1601.
Jukes, T. and C. Cantor
1969. Evolution of protein molecules. In Mammalian Protein Metabolism, volume 3, Pp. 21–132.
Munro, H.H., ed.
Kimura, M.
1980. A simple method for estimating evolutionary rate of base substitutions through comparative
studies of nucleotide sequences. J. Mol. Evol., 16:111–120.
Larget, B. and D. Simon
1999. Markov chain monte carlo algorithms for the bayesian analysis of phylogenetic trees. Molecular
Biology and Evolution, 16(6):750–759.
Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller
1953. Equations of states calculations for fast computing machines. Journal of Chemical Physics,
21:1087–1091.
Savill, N., D. Hoyle, and P. Higgs
2001. Rna sequence evolution with secondary structure constraints: Comparison of substitution rate
models using maximum likelyhood methods. Genetics, 157:399–411.
Swofford, D. L., G. Olsen, P. Waddell, and D. Hillis
1996. Phylogenetic inference. In Molecular Systematics (2nd edition), Pp. 407–515. Hillis, D.M.
Tamura, K. and M. Nei
1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial dna
in humans and chimpanzees. Mol. Biol. Evol., 10(3):512–526.
Tillier, E. and R. Collins
1998. High apparent rate of simultaneous compensatory basepair substitutions in ribosomal RNA.
Genetics, 148:1993–2002.
46
Tillier, E. R. M.
1994. Maximum likelihood with multiparameter models of substitution. Journal of Molecular Evolution, 39:409–417.
Whelan, S., P. Liò, and N. Goldman
2001. Molecular phylogenetics: state-of-the art methods for looking into the past. TRENDS in
Genetics, 17(5):262–272.
Yang, Z.
1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites:
Approximate methods. J. Mol. Evol., 39:306–314.
Yang, Z.
1996. Maximum likelihood models for combined analyses of multiple sequence data. J. Mol. Evol.,
42:587–596.
47