PHASE : a Software Package for Phylogenetics And S equence E volution Version 1.1, April 24, 2003 Copyright 2002, 2003 by the University of Manchester. PHASE is distributed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Howsun Jow and Vivek Gowri-Shankar ∗ bug report: [email protected] Why is PHASE different from other phylogenetic programs? This package is designed specifically for use with RNA sequences that have a conserved secondary structure, e.g., rRNA and tRNA. It is well known that compensatory substitutions occur in the paired regions of RNA secondary structures; this means that substitutions occurring on one side of a pair are correlated with substitutions on the other side. Most phylogenetic programs assume that each site in a molecule evolves independently of the others but this assumption is not valid for RNA genes. Substitution models of sequence evolution that consider pairs of sites rather than single sites are implemented in this package along with standard nucleotides substitution models used nowadays. When a RNA molecule with a secondary structure is used in conjunction with a RNA substitution model, PHASE requires a structure-based alignment of the sequences with the consensus secondary structure indicated in bracket and dot notation at the top of the alignment. We assume that you can provide this structure. It is now commonplace to perform combined analyses of heterogeneous sequence data when nucleotides with diffent patterns of evolution are sequenced for a set of studied species. It is possible to use several substitution models simultaneously with PHASE (for paired and/or unpaired sites) when analysing protein coding genes or when stems and loops of RNA genes are used. PHASE provides a Markov Chain Monte Carlo sampler to generate large numbers of possible phylogenetic trees with probability proportional to their likelihood. This is a Bayesian statistical method that allows posterior probabilities to be generated for alternative trees and alternative clades. These posterior probabilities provide a sound statistical measure of support of alternative phylogenetic hypotheses, and they remove the need for bootstrapping. Where many alternative arrangements of a given set of species exist, it is possible to calculate posterior probabilities for all the alternative arrangements of these species in a convenient way. Standard Maximum Likelihood techniques for inferring the optimal tree with any of the DNA or RNA evolution models are also implemented. The program’s features include: • Bayesian estimation of phylogenies and substitution model parameters • standard ML search algorithms for inferring the optimal tree with optional topology constraints • 6, 7 and 16 state RNA models • standard 4 state DNA models • invariant and discrete gamma model for substitution rate heterogeneity between sites • mixing of molecular data types in a single analysis Journal publications : • C. Hudelot, V. Gowri-Shankar, H. Jow, M. Rattray and P. Higgs. “RNA-based Phylogenetic Methods: Application to Mammalian Mitochondrial RNA Sequences”. Molecular Phylogenetics and Evolution (in press, 2003). • H. Jow, C. Hudelot, M. Rattray and P. Higgs. “Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution”. Molecular Biology and Evolution, 19(9):1591-1601 (2002). Acknowledgements Howsun Jow and Vivek Gowri-Shankar carried out this work as PhD students at Manchester University under the supervision of Magnus Rattray. We gratefully acknowledge contributions to the design, documentation and testing from Paul Higgs and Cendrine Hudelot. The PHASE software was developed as part of a BBSRC funded research project into RNA-based phylogenetic methods (investigators: Paul Higgs and Magnus Rattray). 1 Contents Why is PHASE different from other phylogenetic programs? . . . . . . . . . . . . . . . . . . 1 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction 4 How to read this manual ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Aquiring and installing the software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 MS-Windows installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Unix-like system installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Description of programs in the PHASE package . . . . . . . . . . . . . . . . . . . . . . . . . . 5 optimise and mlphase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 mcmcphase and consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 simulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Running the programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1 Using programs in the PHASE package 1.1 1.2 1.3 Inputs/outputs in PHASE 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.1 Data file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.2 Control file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.3 Tree file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.4 Substitution model parameters file format . . . . . . . . . . . . . . . . . . . . . . 9 1.1.5 Parameters displayed on the screen and output of each program . . . . . . . . . 9 1.1.6 Clade file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Control files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Structure of the control files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.2 Datafile block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.3 Model block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Using the programs in the PHASE package . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.1 likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.2 optimise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.3 simulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.4 mlphase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.5 mcmcphase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.6 analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 2 Elements of phylogenetic theory 2.1 2.2 2.3 2.4 2.5 23 Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 Unrooted phylogenies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.2 String representation of a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.3 Branch lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Nucleotide substitution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 A Markov model of substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Transition matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.3 Nucleotide substitution models implemented in PHASE . . . . . . . . . . . . . . 26 Paired-site substitution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 RNA secondary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2 Theory of compensatory substitutions . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.3 Base-paired substitution models implemented in PHASE . . . . . . . . . . . . . 29 Refinements to substitution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.1 Invariant and discrete gamma models . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.2 The MIXED model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Bayesian phylogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5.1 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5.2 Markov chain Monte-Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.3 Priors and proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.4 Pitfalls of Markov chain Monte-Carlo techniques . . . . . . . . . . . . . . . . . . 34 A Some examples of control files 37 A.1 Control file for likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 A.2 Control file for optimise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A.3 Control file for simulate (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A.4 Control file for simulate (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 A.5 Control file for mlphase (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 A.6 Control file for mlphase (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A.7 Control file for mcmcphase (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 A.8 Control file for mcmcphase (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Bibliography 46 3 Introduction How to read this manual ? People with a good background in phylogenetic inference might be interested only in the first chapter which explains how to use PHASE . The second chapter contains a few elements of the theory of phylogenetic inference with some valuable information about PHASE that can make technical details in the first chapter clearer. Experienced phylogeneticists might find it useful to read the RNA substitution models section (2.3) to learn about RNA substitution models and the Bayesian phylogenetics section (2.5) if they are not familiar with Markov Chain Monte-Carlo (MCMC) techniques. Once you have read the short description of the programs in this introduction, you can try them straightaway with the examples provided. However, be warned that inferences using the mammals dataset of 69 species and the maximum likelihood inference with the primates (primates-rna-ml.control) require at least one day. You should use other control files instead. The first chapter of this manual should be used as a reference only and to clarify obscure points about PHASE programs. The HTML version of these pages is probably more appropriate to find useful information. Aquiring and installing the software PHASE can be downloaded from http://www.bioinf.man.ac.uk/resources; it is currently available for Windows and Unix/Linux platforms. MS-Windows installation Download the archive phase-1.1-MSWin-exec.zip and decompress it into the directory of your choice, for instance c:\Phase\. PHASE does not require any other installation procedure and you can therefore test the software straightaway with the provided example files. Unix-like system installation For Unix and Linux systems you are recommended to compile the program yourself. However, if the process fails and if you cannot produce a proper executable, then you can try the precompiled linux version in the archive phase-1.1-linux-i586-exec.tgz. To compile the program yourself: • decompress and extract the archive into the directory of your choice tar -xvvzf phase-1.1.tgz • enter the newly created phase-1.1 directory cd phase-1.1 • compile with the provided Makefile make 4 We assume here that you have the default recent C++ compiler g++ on your platform. You cannot compile PHASE with gcc v2.96 and older. You can check the gcc version installed on your system by typing “g++ -v”. You might want to (or might have to) edit and modify the makefiles in order to adapt them to your specific system configuration. In that case please have a look at the readme file first. PHASE uses the BLAS and LAPACK library routines. Unless your system is equipped with optimised versions of these mathematical libraries, in which case you are strongly advised to modify the makefile, generic versions will be built during the compilation process. The g77 compiler and the libg2c library are required but they should already be present on your system. Description of programs in the PHASE package The PHASE (PHylogenetics and Sequence Evolution) package consists of two main programs, mlphase and mcmcphase. • mlphase performs maximum-likelihood inference • mcmcphase is a Bayesian phylogenetic inference program There are five other smaller programs in the package: • analyser checks the content of the molecular sequences • likelihood computes the likelihood given a specified evolution model • simulate generates sequences according to a specified evolution model • optimise is a smaller version of mlphase without tree search capabilities • consensus is used with mcmcphase to summarize the results of a MCMC run. Below we summarize the behaviour of these programs. Please refer to the first chapter in order to learn how to use them. optimise and mlphase The mlphase program is a maximum-likelihood phylogenetic inference program similar to dnaml in PHYLIP 1 and baseml in PAML2 . The mlphase program has a broad range of functionalities and can be used with a large number of evolutionary substitution models including those which take into account the RNA secondary structure in the evolution of RNA sequences (see sections 2.2 and 2.3). The mlphase program has two main modes of operation: 1. Optimisation of user-defined trees: estimation of maximum likelihood (ML) branch lengths and, optionally, evolutionary model parameters, given a set of labelled molecular sequences, for a user-defined set of phylogenetic tree topologies. 2. Maximum-likelihood tree search: the program aims at finding the model (tree topology, associated branch length and, optionally, sequence evolution model parameters) that yields the highest likelihood. The user can choose the topology search algorithm to be used among the three available: • Simple exhaustive search: all the possible phylogenies are considered. • Branch and bound search: non-optimal phylogenies are rejected before evaluation. • Heuristic search via stepwise addition: greedy search for the best topology. Constraints can be placed on the phylogenetic tree topologies that are considered during ML inference in order to reduce the search space and the computation time. 1 http://evolution.genetics.washington.edu/phylip.html 2 http://abacus.gene.ucl.ac.uk/software/paml.html 5 The optimise program is a simpler version of mlphase and is provided for convenience. This program returns the ML branch lengths and ML evolutionary parameters of a fixed user-defined tree topology, for instance a consensus tree found with a MCMC run. This is equivalent to the first mode of mlphase with only one tree. The optimise program requires less parameters than mlphase; it is simpler to use and allows quick experimentations with different initial parameters when an entrapment in a local maximum of the likelihood is suspected. mcmcphase and consensus The mcmcphase program performs Bayesian phylogenetic inference. It uses a Markov Chain Monte Carlo algorithm to sample from the posterior probability distribution of phylogenetic tree topology, branch lengths and sequence evolution model parameters. For an explanation of Bayesian phylogenetics and a description of the MCMC sampling algorithms used in mcmcphase, please consult the Bayesian Phylogenetics section (section 2.5) or Jow et al. (2002). The consensus program is used to exploit the results of a MCMC run. This program produces two consensus models (using mean and median of the parameters in the sample) and can return consensus branch lengths for any supplied topology, e.g., a PHYLIP-style consensus tree, if similar topologies were sampled during the run. likelihood The likelihood program computes the likelihood of a phylogeny with respect to any implemented substitution models. simulate The program simulate generates molecular sequences according to an user-specified tree (topology & branch lengths) and substitution model (type & parameters). analyser At the moment analyser outputs basic statistics about a sequence data file. It can also be used to locate in sequences the sites with too many gaps in case you decide to remove them. The analyser program can be quite useful to validate your secondary structure alignment and to set a maximum limit for the mismatch frequency at each site (see section 2.3.1). Running the programs Programs in the PHASE package are run through the command line under both Unix-like systems and MS-windows systems. For Windows operating systems, you have to open a MS-DOS command window to use them. Click on “Run...” in the “Start...” menu and type cmd in the newly opened dialog box. You might have to type command instead of cmd depending on your MS-Windows version. Once the command window is opened, you have to move to the directory where you extracted the software. At the shell prompt, you can type, for example, cd c:\Phase\. You can then run any program of the PHASE package. Run the programs by typing their name followed by the arguments they require. In most cases, PHASE ’s programs take one argument which is the name of a control file (see section 1.2). You can type for instance: mcmcphase control\hiv-dna-mcmc-1.control 3 or, optimise control/primates-rna-optimise-7a.control After installation, if the examples are all present, these commands should work. 3 Please note that the use of the ‘\’ or ‘/’ characters is dependant on your operating system. On Unix systems you might have to type “./” before the program name. 6 1 - Using programs in the PHASE package 1.1 1.1.1 Inputs/outputs in PHASE Data file format All molecular sequence data used by the PHASE programs are stored in a common format. The data file format is similar to the PHYLIP data file format but has a few minor modifications. A data file is divided into four sections but two of them are not compulsory. Comments can be included by preceding the commented lines with a hash (#) symbol. The entire commented line is ignored by the program. Taking a look at the example data files in the package (*.dna, *.rna, and *.mix in the data directory) will make the following explanations easier to understand. File content The first non-comment section of the data file is a single line containing 1. the number of species 2. the length of the molecular sequences 3. a code which can be either DNA for usual unpaired molecular sequences or RNA for base-paired molecular sequences. In fact the purpose of this code is to indicate whether a pairing mask (see below) is present in the data file and you can use the code RNA even if some nucleotides are unpaired. For example the line, 5 100 DNA at the beginning of a data file indicates that there are five non-base-paired sequences of length 100 in the file. For convenience a third code, MIXED, can be used instead of RNA when the user is using a concatenation of RNA loops and stems (see section 2.3.1) but should be avoided in other cases. More details on the specific meaning of the code MIXED are given in the class section below. The lines, 10 300 RNA and, 10 300 MIXED both indicate that there are ten sequences of length 300 in the file and that a pairing mask is associated with them. Pairing mask The second section of the data file is the pairing mask. This mask is only required when sequences contain some base-paired nucleotides (in that case the code should be RNA or MIXED). In the case of fully unpaired sequences – i.e., when the DNA code is used – the pairing mask must not be provided. The pairing mask is in the form of a mathematical expression consisting of round brackets. Corresponding brackets indicate that the bases at those positions in the sequence form a base-pair in the RNA secondary structure. Unpaired sites can be indicated with a dot “.” or a hyphen “-”. For example a sequence ACCAGAUGGU with a pairing mask (((.(.)))) indicates that the sequence is made of the base-pairs AU-CG-CG-GU and unpaired sites A-A. 7 Molecular sequences The third section of the data file contains the molecular sequences. Indels (-) and ambiguities (purine (R), pyrimidine (Y), unknown(N or ?) ) are allowed. Sequences can be written in one of two formats. The first is the non-interleaved format. This consists of an identifying label for each sequence followed by the whole sequence. An example is: 2 8 DNA Mouse ACCGUGGU UCCAUAAA Rat ACUGUGGC UCGAUAUA There can be no spaces in the label though the sequence itself can be formatted into blocks using multiple lines and spaces. An alternate way of specifying the sequences is using the interleaved format. This enables the sequences to be split into homologous blocks. The non-interleaved example given above could equivalently be written: 2 8 DNA Mouse ACCG Rat ACUG UGGUUCCAUAAA UGGCUCGAUAUA Notice that only the first interleaved block should contain labels. Subsequent interleaved blocks are assumed to have the same labels and to be in the same order. Class section The fourth section is not compulsory and is used when performing a combined analysis of heterogeneous data sets (e.g., loops and stems of a RNA molecule, protein coding genes with three codon positions or concatenated data of different genes with different evolutionary patterns). You can safely skip this section if you plan to study DNA sequences or RNA helices only (i.e., no “.” in the pairing mask) with only one appropriate nucleotide/base-pair substitution model. The aim of this section is to assign each nucleotide/pair to a class. Each class is expected to have a different pattern of evolution. This section consists of a sequence of integers which correspond to the class of each nucleotide. For instance, the class section of a protein coding gene may look like: ...2 3 1 2 3 1 2 3 1 2 3 1 2 ... When the data file contains a class section, programs in the PHASE package expect it to comply to the following set of rules: • class labels are separated by a space • classes are labelled from 1 to K, where K is the number of distinct classes • the number of labels equals the length of the sequences • when used in conjunction with a base-paired structure, the two components of a paired site are in the same class. Since PHASE is specifically designed for the analysis of RNA sequences with secondary structure, the most common use of the class section should be the obvious separation of unpaired and base-paired sites into two distinct classes. The code MIXED can replace the code RNA to avoid a tiresome task and let PHASE know that he can simply use the provided pairing mask to build the class section (e.g., (((.())))..) implies 2 2 2 1 2 2 2 2 2 1 1 2). When the code MIXED is used the class section is not compulsory and the unpaired and paired sites will respectively be attributed to the classes 1 and 2 automatically1 . Usually classes are used to determine the model of sequence evolution PHASE is using with each nucleotide. Each class in the data file is treated by its own model of nucleotide substitution during the 1 When the class section is present the code MIXED is equivalent to RNA: the user assignment prevails the automatic one 8 phylogenetic inference. The models are defined later in the model section of the control file(see 1.2.3). Let us just point out here that if you use the MIXED type for your data with the automatic assignment, i.e., without the class section, you have to make sure your first and second model are respectively a nucleotide substitution model and a base-pair substitution model when you declare your models of evolution. We will return to this point later on. 1.1.2 Control file format Most programs in the package use a control file. The purpose of this file is to assign a specific task to the program, i.e., analysed sequences, assumed substitution model, and others specific parameters. Control files are the key to using the software and two sections are devoted to them. Section 1.2 describes the structure of this file and describes common features for many programs in the package. Section 1.3 presents the specific parameters for each program. 1.1.3 Tree file format PHASE can output trees into a file and sometimes the user has to provide a file which contains one or more trees. A tree file is simply a file with one ore more phylogenies written in the computer readable format described in the tree representation section (2.1.2). 1.1.4 Substitution model parameters file format With a model parameters file, one can provide initial values for the parameters of the substitution models used. PHASE can also create a model parameters file to store the results concerning a substitution model after a run (these could be Maximum Likelihood Estimate (MLE) parameters or Mean Posterior Estimate (MPE) parameters). Model parameters file content The content of this file is highly dependant on the substitution model used and we cannot describe it in general terms. The fields used to assign a value to each parameter are hopefully quite self-explanatory as long as you know the underlying substitution model. You might need to have a look at the transition matrices section (2.2.2) to understand the PHASE concept of “rate ratios” in substitution models. Each “Rate ratio i” parameter in this file stands for the parameter αi in the transition matrix of the corresponding model. Transition matrices for all implemented substitution models are given in section 2.2.3 for DNA models and in section 2.3.3 for RNA models. Producing a model parameters file Model parameters files and control files share the same structural elements. Some examples can be found in the data directory (*.model). Although it is quite easy to understand the content of a model parameters file without explanations when reading it, you might find it harder to produce your own file from scratch and without guidance if you want to initialise a substitution model with specific values. It is possible to use the simulate program to generate a stub of this file for each model implemented in the PHASE package. This skeleton can be modified easily to suit your needs. See section 1.3.3 for details. 1.1.5 Parameters displayed on the screen and output of each program Each program in the package will output information on the screen, and one or more files to store the results permanently. The outputs will be reviewed individually for each program in section 1.3. The content displayed on the screen is usually quite easy to understand, but you might be a bit confused by the parameters of substitution models. PHASE outputs on the screen two kind of matrices: • one “rate ratios” matrix R 9 • one transition matrix Q These matrices are described in the transition matrices section (2.2.2). Other parameters have a straightforward meaning. 1.1.6 Clade file format The user is allowed to specify some invariant clades to reduce the number of possible topologies when using mlphase. A clade file contains a list of monophyletic clades in newick format (see section 2.1.2). All studied species must appear once (and only once) in the file, either alone or in a clade. Here is a simple clade file example for 6 species: (Specie5,Specie6); (Specie4,(Specie3,Specie2)); Specie1; 1.2 Control files Most programs in the PHASE package have their options set using a simple text file. We call this file the control file. Although the content of this file may differ for each program in the package, its structure remains the same. Some control files are provided as example with the package (*.control in the control directory). The easiest and safest way to use PHASE is to copy one of these examples and to adapt it to your need. 1.2.1 Structure of the control files A control file contains logical blocks (e.g., DATAFILE block, MODEL block, . . . ) and control lines. Lines preceeded by a hash (#) symbol are considered comments and ignored. Comments can be placed anywhere. A control line is used to define a parameter and gives it a value. It has the format: label = value The order in which control lines are provided in the control file is not important but they must appear in the right block. Note that PHASE is case sensitive, “Tree file” and “Tree File” are two different labels. At the moment no warning is issued if the user mistypes an optional parameter. Please check your control files against the provided examples, otherwise PHASE might miss some important parameters without you noticing it. A block is a container. It contains control lines but can also contain other blocks. The block BLOCKNAME begins with the tag: {BLOCKNAME} and ends with the tag: {\BLOCKNAME} Tags must be put alone in their line. By convention the name of blocks are all uppercase. In the remainder of this document, parameters of the control files are colored depending on their status. Compulsory parameters are in red and you must provide a value for them. Optionnal parameters are in green and they do not need to appear in the control file. Often, a default value will be assumed for optional parameters. Some fields are dependent on the presence and/or values of other parameters and their presence (or absence) is compulsory under certain conditions. These conditional parameters are in orange. 1.2.2 Datafile block Almost all programs in the PHASE package require a DATAFILE block to parse analysed sequences. As stated previously, the DATAFILE block begins with the tag {DATAFILE} alone on a line and ends with the tag {\DATAFILE} alone on a line. The DATAFILE block contains some necessary 10 information which is not included in the data file itself (see section 1.1.1 for the format of this file); it contains the following control lines: • Data file: the location of the molecular sequences file to be used. Data file = data/sequences.dna • Interleaved data file: a yes/no option that specifies whether the molecular data is interleaved. Interleaved data file = yes • Outgroup: the label of the outgroup sequence (see section 2.1.1). The inference techniques used in PHASE produce unrooted phylogenies and using an outgroup in your study is not required. However PHASE requires this parameter to produce a unique newick representation (2.1.2) for unrooted trees. Outgroup = Mole • Heterogeneous data models: is a yes/no parameter which specifies whether the data file contains a class section. The default value is no and the class section of your data file will be ignored if you forget this field. Heterogeneous data models = yes 1.2.3 Model block Most programs in the PHASE package require the specification of a substitution model for sequence evolution. This is the purpose of the MODEL block. The MODEL block is delimited by the {MODEL} and {\MODEL} tags. It contains the name of the substitution model followed by parameters (and sometimes blocks) specific to the model (see section 2.2 for background information on substitution models of nucleotide evolution). Simple substitution model Depending on the data to be analysed, the PHASE package can be used with a wide variety of DNA substitution models or RNA-specific base-paired models (see sections 2.2.3 and 2.3.3 for a review of these models). The content of the MODEL block is the same for all these models and the parameters are: • Model: the model’s name, by convention it should be all upper case. Model = REV Nucleotide substitution models implemented include JC69, K80, HKY85, TN93 and REV. Base-paired substitution models implemented include RNA6A, RNA6B, RNA7A, RNA7D, RNA16A. • Discrete gamma distribution of rates: the discrete gamma model (see section 2.4.1) can be used to account for among site rate variation. Use yes/no values to turn this option on/off. When a discrete gamma model is used, PHASE expects the number of gamma categories to be specified. By default the discrete gamma model is not used. Discrete gamma distribution of rates = yes • Number of gamma categories: when the discrete gamma model is used, you have to provide an integer to specify the desired number of discrete gamma categories. Number of gamma categories = 5 • Invariant sites: alternatively, or in conjunction with the discrete gamma model, the user can allow a proportion of sites to be invariant, i.e., with zero rate of evolution. The default value is no. Invariant sites = yes Mixed model for combined analyses of heterogeneous data To study heterogeneous sequences several models are required. The mixed model (see section 2.4.2) allows these models to work concurrently. 11 • Model: this field contains the name of the model which is MIXED. Model = MIXED • Number of models: the number of models used concurrently. If a class section was provided with the data file then the number of models should be the same as the number of classes. If you used the flag MIXED in your data file and did not provide a class section then this parameter has to be set to 2 and the two models must be a DNA substitution model and a base-paired substitution model respectively. Number of models = 3 • {MODELi} block: each model used in the mixed model must be defined in its own block. If the number of models is n then the MODEL block must contains n blocks whose name are MODEL1, MODEL2, . . . , MODELn. The content of these blocks is the same as for a simple substitution model block. {MODEL} Model = MIXED Number of models = 2 {MODEL1} Model = REV Invariant sites = yes {\MODEL1} {MODEL2} Model = RNA7A Discrete gamma distribution of rates = yes Number of gamma categories = 5 {\MODEL2} {MODEL3} Model = RNA7D Invariant sites = no Discrete gamma distribution of rates = no {\MODEL3} {\MODEL} 1.3 Using the programs in the PHASE package Each program in the PHASE package requires a specific control-file, the content of which is described here. As in the previous section, compulsory parameters appears in red, optional parameters in green and conditional parameters dependant on the others are in orange. 1.3.1 likelihood Using likelihood The likelihood program is used to compute the likelihood of a model of evolution (i.e., tree + parameterised substitution model) given a set of studied sequences. To use likelihood, one has to provide a phylogeny for the taxa under investigation (i.e., topology and branch lengths) and a substitution model for nucleotide evolution with user-defined parameters. To use likelihood, type at the command-line: likelihood likelihood-control-file where likelihood-control-file is a valid control file for the likelihood program. For verification purposes likelihood outputs the phylogenetic tree used on the screen before the likelihood value. Unlike most other PHASE programs, likelihood does not send any results to a file. Control file for likelihood An example of a valid control file for likelihood can be found in appendix A.1. In its control file, the likelihood program requires the specification of: • a DATAFILE block: see the data file block section (1.2.2). 12 • a MODEL block: see the model block section (1.2.3). • Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (section 2.1.2), with branch lengths values. Tree file = data/mammals-consensus.tree • Model parameters file: the name of the file containing parameter values for the model defined in the MODEL block above. Simulate can help you to produce this file. Model parameters file = data/mammals-consensus.model 1.3.2 optimise Using optimise The program optimise is used to compute maximum-likelihood estimates (MLE) for the branch lengths and substitution model parameters of a given model of evolution (i.e., a fixed tree topology and a specified substitution model with free parameters). One can specify some initial values for branch lengths and substitution model parameters to speed-up the convergence or to detect trapping in local maxima of the likelihood function. To use optimise, type at the command-line: optimise optimise-control-file where optimise-control-file is a valid control file for the optimise program. When launched, optimise displays the initial tree and the initial likelihood on the screen and begins the optimisation. Once it is finished, the ML substitution model parameters are printed on the screen and saved in the “.output” file with the ML tree and the value of the maximum likelihood. The ML tree is also saved in the “.tree” file and a “.model” file (see input section 1.1.4) is created to store the MLE for the substitution model parameters. Control file for optimise An example of a valid control file for optimise can be found in appendix A.2. The control file of the optimise program must/may provide: • a DATAFILE block: see the data file block section (1.2.2). • a MODEL block: see the model block section (1.2.3). • Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (see section 2.1.2) with optional initial branch lengths values. Tree file = mammals69-mix-consensus.tree • Random seed: the integer value provided with this field is used to initialise the random number generator (used to draw random initial branch lengths if they are not provided). Random seed = 1 • Starting model parameters file: the name of a file containing initial values for the parameters of the substitution model used. If this field is not provided, the analysed sequences are used to initialise the model. Starting model parameters file = data/hiv.model • Output file: the basename for the three files basename.tree, basename.model and basename.output. They contain the results generated by optimise. Output file = mammals69-mix-optimise 1.3.3 simulate Using simulate Simulate is used: 13 1. to generate examples of “.model ” files for all the substitution models implemented in PHASE . A “.model” file (see section 1.1.4) is used to provide initial or fixed values for the model parameters to some programs in the package. 2. to generate molecular sequences which evolved from a random initial one according to a specified model of evolution, i.e., phylogeny and substitution model. To use simulate, type at the command-line: simulate simulate-control-file where simulate-control-file is a valid control file for the simulate program. In its first mode of operation simulate create a single “.model” file and you can modify this file with your own initial values. In its second mode of operation, simulate displays on screen the tree used to generate the actual sequences. This tree was either provided by the user or randomly created by the program. In the second case the tree is saved in a file specified by the user. Eventually, the likelihood of the generated molecular sequences given the model is printed on the screen and simulate saves the sequences in a file specified by the user. The format of this file is described in the data file format section (1.1.1). If the MIXED model described in section 2.4.2 is used, heterogeneous sequences are generated in sequential order. Control file for simulate In appendix A.3 and A.4, example control files are provided for the first and the second mode of operation respectively. The control file of the simulate program must provide • a MODEL block: see the model block section (1.2.3). • Retrieve the name of the model’s parameters: a boolean field to specify the user’s aim. Use yes for the first mode of usage mentionned above and no for the second mode. Retrieve the name of the model’s parameters = no • Model parameters file: if simulate is used to generate an example of a substitution model parameters file, the parameters are saved in a file having the name provided. When simulate is used to generate sequences, the user must provide parameters for the substitution model and they are read from the given file. Model parameters file = simulate.model The following fields may be required when simulate is used to generate sequences. • Random seed: the integer value provided with this field is used to initialise the random number generator. Random seed = 1 • Random tree and Tree file: simulate can either generate a random tree or use a supplied phylogeny. If Random tree is equal to yes then simulate generates a random tree and saves it in the specified file. If Random tree is equal to no then simulate parses the user tree from the specified file. Random tree = no Tree file = 8-species.tree • Number of species and Maximum branch length: when the Random tree field is set to yes, the user must provide the number of species and the maximum value for branch lengths in the generated the tree. Number of species = 10 Maximum branch length = .4 • Number of symbols from class i: you have to specify the number of symbols (e.g., number of nucleotides or number of paired sites) you want to generate for each class in your final sequence. Number of symbols from class 1 = 100 Number of symbols from class 2 = 100 Number of symbols from class 3 = 100 Number of symbols from class 4 = 500 Number of symbols from class 5 = 300 14 • Structure for the elements of class i: simulate can add a stucture in the generated data file in which case you have to specify the appropriate structure for the elements of each class. Structure for the elements of class 1 = . Structure for the elements of class 2 = . Structure for the elements of class 3 = . Structure for the elements of class 4 = . Structure for the elements of class 5 = () • Data file type and Total length of the raw sequences: simulate produces an input file following the format defined in the data file format section (1.1.1). To produce this file, you have to specify yourself the type and the length written in the first line (see section 1.1.1). With the 5 classes described above: Data file type = RNA Total length of the raw sequences = 1400 #(100+100+100+500+300*2) • Output file: the name of the file where generated sequences are saved. Output file = simulated-data/codons and rna.sequences 1.3.4 mlphase Using mlphase The mlphase program can be used: 1. to find the Maximum Likelihood Estimates for branch lengths and, optionally, evolutionary model parameters for a user-defined set of topologies. 2. to find the phylogeny and, optionally, evolutionary model parameters that yield the maximum likelihood. Three algorithms are provided for topology search: • Simple exhaustive search • Branch-and-bound exhaustive search • Heuristic stepwise addition In the first mode of operation, mlphase operates like optimise but several trees can be considered at once. In the second mode of operation, when mlphase performs a branch and bound search or an exhaustive search, the ten phylogenies (and associated substitution model parameters) with the highest likelihood are returned. These two search algorithms return the best tree unless they become trapped in local minima during the optimisation process. The heuristic stepwise addition returns only one tree. It is less likely to find the optimal tree but it is computationally feasible with a larger number of taxa. Be warned that the optimiser might crash unexpectedly sometimes and you can change the initial values to overcome that (hopefully rare) problem. To reduce the search space and the computation time, constraints can be placed on the phylogenetic tree topologies considered during ML inference. With a clade file (see section 1.1.6) one can specify invariant monophyletic clade topologies which should be preserved during phylogenetic inference. The program will look for an optimal topology consistent with these clade arrangements. To use mlphase, type at the command-line: mlphase mlphase-control-file where mlphase-control-file is a valid control file for mlphase. The mlphase program saves the results of an inference in a single file. Results are also displayed on screen during the run. Control file for mlphase Please see the examples in appendix A.5 and A.6. These control files show the two main modes of operation. The control file of the mlphase program contains: • a DATAFILE block: see the data file block section (1.2.2). • a MODEL block: see the model block section (1.2.3). 15 • a FUNCTION block dependant on the operating mode of mlphase (see below) • Random seed: the seed for the random number generator. Random seed = 13 • Output file: the name of the file where the results are sent. Output file = results/hiv-mlphase.output The FUNCTION block contains specific parameters according to the mode of operation. At the moment, mlphase can “Optimise user-defined phylogenetic trees” or “Search for ML topology”. When the user wants to optimise a set of defined trees the FUNCTION block contains the following fields: • Function: the parameter to specify the mode of operation. Function = Optimise user-defined phylogenetic trees • Trees file: the name of the file containing the phylogenies, i.e., a set of trees in the Newick format (section 2.1.2) with optional initial branch lengths values. Trees file = primates.phylogenies • Number of trees: the user has to specify the number of trees in the previous file. Number of trees = 4 • Optimise model parameters: set this field to no if the model parameters are to be considered fixed, set it to yes if you want to optimise them. Optimise model parameters = no • User’s model parameters file: if the parameters are constant one must provide values for them. This field is for the name of the file containing the parameters for the model defined in the MODEL block. If provided when not required, the content of this file is used to initialise the parameters of the model before optimisation. User’s model parameters file = data/hiv-REV.model When looking for the ML tree, the FUNCTION block contains: • Function: the parameter to specify the mode of operation. Function = Search for ML topology • Topology search: this field specifies the search algorithm used to determine the phylogenies with the highest likelihood. At the moment the search algorithms implemented are Simple exhaustive search, Branch-and-bound exhaustive search and Heuristic stepwise addition. Topology search = Heuristic stepwise addition • User defined monophyletic clades and Clade file: set the first field to yes if you want to constrain the search in the topology space. The second field is the name of your clade file (see section 1.1.6). User defined monophyletic clades = yes Clade file = primates.clades • Optimise model parameters: set this field to no if the model parameters are to be considered fixed, set it to yes if you want to optimise them. Optimise model parameters = yes • User’s model parameters file: if the parameters are constant one must provide values for them, this field is for the name of the file containing the parameters for the model defined in the MODEL block. User’s model parameters file = data/primates-RNA7A.model 16 1.3.5 mcmcphase Using mcmcphase The mcmcphase program perfoms Bayesian estimation of phylogenies (see section 2.5) and uses Markov chain Monte Carlo to produce large samples from the posterior probability density. To use mcmcphase, simply type at the command-line: mcmcphase mcmcphase-control-file where mcmcphase-control-file is a valid control file for the mcmcphase program. The mcmcphase program saves the results of an inference in many files. Be warned that it might require a large amount of disk space for large studies (around 90 Mb for 70 species and 50000 samples). • .besttree and .bestmodel files: the phylogeny and the parameters of the substitution model when the best state (i.e., the state with the highest likelihood) was visited, it is not necessary one of the sampled configurations and this state might have been visited during the burnin period. The best configuration is not very important in a MCMC analysis but a “strange” best state indicates quickly that something went wrong. The tree and the model can also be used as starting points in maximum likelihood inference. • .mp file: the file with the sampled parameters of the substitution model(s). Each sample occupies one line. The parameters are, in order, – the proportion of invariant sites if an invariant category is used (+I models) – the gamma shape parameter (α) if the discrete gamma model is used (+dGX models) – the frequencies of the states as they appear in the substitution matrix – the rate ratios When a MIXED model is used, substitution model parameters are printed sequentially. Except for the first model, each set of parameters is preceded by the average substitution rate of the model. The average substitution rate for the first model is always 1.0 and therefore this value is not reported. • .samples file: the sampled topologies, this file can be used with another phylogenetic package to produce a consensus tree. To avoid wasting disk space, mcmcphase will output the sampled topologies using an index for each species according to their appearance order in the datafile. • .bl file: the branch lengths for the previous topologies (for use with other PHASE programs). • .output file: a file with similar content to the screen output. • .plot file: the evolution of the likelihood during the run. Sampling of these values starts at the beginning of the run, i.e., likelihood values are stored durning the burnin too. Using consensus The consensus program is used to exploit the large sample of states produced by mcmcphase. The program still lacks the ability to produce a consensus tree by itself and requires that tree from the user. Many phylogenetic programs can build a consensus tree from the sample of topologies produced by mcmcphase in the “.samples” file. You can use the consense program of PHYLIP 2 for instance. To use consensus, simply type at the command-line: consensus mcmcphase-control-file consensus-topology-file where mcmcphase-control-file is the control file that was used by the mcmcphase program to produce the results and consensus-topology-file is the file which contains the consensus topology. Since mcmcphase outputs the topologies using numbers instead of the names of the species, consensus expects the consensus topology to be given with numbers too. The consensus program retrieves the model used and the location of the sample files from the controlfile. Two consensus substitution models are produced using respectively the mean and median values of the sample. The consensus topology is used to produce a consensus tree with branch lengths. The 2 http://evolution.genetics.washington.edu/phylip.html 17 branch lengths of the states whose topology is identical to the consensus topology are used. For each branch, the consensus length is simply the mean value of all the lengths. The consensus program cannot return a consensus tree if the consensus topology has never been visited. In such a case, we suggest you use optimise to produce ML branch lengths. Control file for mcmcphase Please see the examples provided in appendix A.7 and A.8. In the control file of mcmcphase one can/must have: • a DATAFILE block: see the data file block section (1.2.2). • a MODEL block: see the model block section (1.2.3). • a PERTURBATION block: control block for the mixing properties of mcmcphase (see below). • Random seed: the seed for the random number generator. Random seed = 1 • Burnin iterations: the number of “burnin” cycles (i.e., cycles before the beginning of the sampling). During the “burnin”, only likelihood values are stored. Burnin iterations = 150000 • Sampling iterations: the number of cycles for sampling. Sampling iterations = 600000 • Sampling period: the number of cycles between extraction of two consecutive samples. Sampling period = 20 • Random start model parameters and User’s starting model parameters file: to reduce the necessary “burnin” time, the chain can be initialised with some user-specified model parameters. Otherwise the sequences are used to initialise the substitution model. Random start model parameters = no User’s starting model parameters file = data/primates-RNA7A.model • Random start tree and User’s starting tree file: similarly, one can choose to initialise the chain randomly or with a user-defined topology. We do not encourage the use of an initial user-defined topology but this option can be useful to quickly gain an idea of what results can be expected. Random start tree = yes User’s starting tree file = this field is ignored in this case • Output file: the basename for all the output files (basename.besttree, basename.bestmodel, basename.mp, basename.samples, basename.bl, basename.output and basename.plot). Output file = results/hiv-dna • Output format: the format used for the topologies in the .samples file, it can be phylip (with a semi-colon at the end) or bambe (without semi-colon). Output format = phylip PERTURBATION block The PERTURBATION block contains the mixing parameters used for the proposals. The following mixing parameters are relative to the branches: • Initial branch step proposal parameter: the initial standard deviation of the normal distribution used to modify the branch lengths. This proposal parameter is modified during the “burnin”. Initial branch step proposal parameter = 0.1 • Branch length upper bound: the upper bound used for the uniform prior distribution of branch lengths. Branch length upper bound = 3.5 18 The PERTURBATION block also contains mixing parameters for the proposals of substitution model parameters. These parameters are dependant on the substitution model used. With simple substitution models, i.e., any nucleotide or base-paired model used alone, the following parameters are added in the PERTURBATION block to perturb the frequencies: • Frequencies, proposal priority: an integer value to specify how often we try to perturb the frequencies with respect to other parameters. This parameter is usually compulsory except for models with fixed frequencies (i.e., JC69 and K80). Use the value 0 to prevent the perturbation (i.e., if you want the frequencies to remain equal to the empirical frequencies or to the values provided in an initial substitution model). Frequencies, proposal priority = 1 • Frequencies, proposal minimum acceptance rate and Frequencies, proposal maximum acceptance rate: during the “burnin”, mcmcphase will try to adapt the proposal step so as to reach an acceptance rate within the specified range. By default this range is [0.21, 0.25]. If you want to change the default values, provide the two parameters. Use the range [0.0, 1.0] to turn off the dynamic adaptation of the proposal step. Frequencies, proposal minimum acceptance rate = 0.0 Frequencies, proposal maximum acceptance rate = 1.0 • Frequencies, initial Dirichlet tuning parameter: the initial proposal parameter (the higher the Dirichlet parameter, the lower the step). This parameter is compulsory if you turned off the dynamic adaptation of the step. Allowed values are in the range [100.0, 100000.0] and the default value is 1000.0. Frequencies, initial Dirichlet tuning parameter = 1500.0 Similar parameters are used for the rate ratios: • Rate ratios, proposal priority: an integer value to specify how often we try to perturb each rate ratio with respect to other parameters. Rate ratios are treated individually but they share the same priority. This parameter is usually compulsory except for models with fixed rates (i.e., JC69). You can use the value 0 to have constant rate ratios but you should not do that unless you provide initial parameters with a “.model” file. Rate ratios, proposal priority = 2 • Rate ratios, proposal minimum acceptance rate and Rate ratios, proposal maximum acceptance rate: rate ratios are treated individually but they share the same range for the acceptance rate. The default range is [0.21, 0.25] and you can turn off the dynamic modification of the step with the range [0.0, 1.0]. Rate ratios, proposal minimum acceptance rate = 0.3 Rate ratios, proposal maximum acceptance rate = 0.6 • Rate Ratios, initial step: allowed initial values are in the range ]0, 5.0]. The default value (when the dynamic step option is on) is 0.2. You cannot provide a different initial step for each rate ratio at the moment. Rate Ratios, initial step = 0.25 Similarly, you can/must use the following parameters when using a +I (invariant category) or/and a +dG (discrete gamma categories) model: • Gamma parameter, proposal priority: compulsory if a gamma model is used. • Gamma parameter, proposal minimum acceptance rate: default value = 0.21 • Gamma parameter, proposal maximum acceptance rate: default value = 0.25 • Gamma parameter, initial step: default value = 0.2 and, • Invariant parameter, proposal priority: compulsory if an invariant model is used. 19 • Invariant parameter, proposal minimum acceptance rate: default value = 0.21 • Invariant parameter, proposal maximum acceptance rate: default value = 0.25 • Invariant parameter, initial step: default value = 0.05 The upper-bound for the uniform prior distribution of each rate ratio is 1000.0. The upper-bound for the uniform prior distribution of the gamma parameter is 200000.0. Be aware that there might be something wrong if you reach that upper-bound value during a MCMC inference, i.e., there may be insufficient data to properly estimate the parameter. PERTURBATION Block for MIXED models When a mixed model is used, the previous proposal parameters are required for each model and must be enclosed in separate blocks. You must complete a PERTURBATION i block for each model in the PERTURBATION block with the previously described parameters (see appendix A.8). Parameters which are specific to the MIXED model appear in the PERTURBATION block. • Model i priority: specify a priority for each model with respect to the other models and the priority used for the average rates. Model 1 priority = 8 Model 2 priority = 24 • Average rates, proposal priority: specify a priority for the perturbation of the model average substitution rate. Average rates, proposal priority = 1 • Average rates, proposal minimum acceptance rate and Average rates, proposal maximum acceptance rate: the acceptance range mcmcphase aims for, [0.21, 0.25] by default. Average rates, proposal minimum acceptance rate = 0.15 Average rates, proposal maximum acceptance rate = 0.25 • Average rates, initial step: allowed initial values are in the range ]0, 1.0]. The default value (when the dynamic step option is on) is 0.2. You cannot provide a specific initial step for each average substitution rate at the moment. Average rates, initial step = 0.14 The upper-bound for the uniform prior distribution of the average substitution rate ratios is 100.0. Proposals priority Each cycle mcmcphase perturbs one branch length and a topology change is tried every ten cycles (see the Bayesian phylogenetics algorithms in section 2.5). Each cycle PHASE will also try to modify the substitution model. Let us consider the following example with the MIXED model and three substitution models, {PERTURBATION} #PERTURBATION OF THE TREE : ... #PERTURBATION OF THE MODEL : Model 1 priority = 8 Model 2 priority = 24 Model 3 priority = 7 Average rates, proposal priority = 1 Average rates, initial step = .3 Average rates, proposal minimum acceptance rate = .15 Average rates, proposal maximum acceptance rate = .20 {PERTURBATION1} Frequencies, proposal priority = 1 Rate ratios, proposal priority = 2 20 Gamma parameter, proposal priority = 1 {\PERTURBATION1} {PERTURBATION2} Frequencies, proposal priority = 3 Rate ratios, proposal priority = 4 Gamma parameter, proposal priority = 1 Invariant parameter, proposal priority = 1 {\PERTURBATION2} {PERTURBATION3} Frequencies, proposal priority = 4 Rate ratios, proposal priority = 2 Gamma parameter, proposal priority = 1 {\PERTURBATION3} {\PERTURBATION} The mcmcphase program will modify the average substitution rate of the second model with probability given in 1.1a; the probability of modifying the average substitution rate of the third model is the same. The average substitution rate of the first model is the reference, it is never modified and remains equal to 1.0. With probability given in 1.1b, mcmcphase will modify the parameters of the substitution model i (i.e., any parameters but the average substitution rate). The priorities inside the corresponding PERTURBATION i block are used, the figures inside each PERTURBATION i block do not have any effect outside the block. P (average rate) = P (model i) = total priority = Average rates, proposal priority total priority M odel i priority total priority (N − 1) ∗ Average rates, proposal priority + (1.1a) (1.1b) N X M odel i priority (1.1c) i=1 In our example, if the second model is selected for the next modification, we define prioritytotal : prioritytotal = priorityf requencies +prioritygamma +priorityinvariant +numberrates ratios ∗priorityrate ratios and we modify either all the frequencies (probability 1.2a), or one of the rate ratios (probability 1.2b each), or the invariant parameter (probability 1.2c), or the gamma shape parameter (probability 1.2d). 1.3.6 P = P = P = P = priorityf requencies prioritytotal priorityrate ratios prioritytotal priorityinvariant prioritytotal prioritygamma prioritytotal (1.2a) (1.2b) (1.2c) (1.2d) analyser The analyser program does not require any control file. To use analyser, type at the command-line: analyser or analyser data-file where data-file is a file following the data file format described in section 1.1.1. The analyser program needs you to provide the fields usually used in the DATAFILE block (see section 1.2.2) in order to parse your sequences. Once it is done you will be prompted for the class to check if the data file contains heterogeneous sites and analyser will require a “.lump” file. The “.lump” file is used: 21 1. to match sites to a given state, e.g., indels(-) and purine(R) can be lumped to an ambiguity state(X). 2. to choose the state(s) used for the cut-off (see below). Two ‘‘.lump” files are provided, data/dna.lump and data/rna.lump. The first one is used with single nucleotide, ambiguity — i.e., -, R, Y and X — are lumped in a state X used for the cut-off. The second one is used with paired sites. mismatches — e.g., AC, UU, . . . — are lumped into the single state MM and ambiguity — e.g., C-, UR, . . . — are lumped into the state XX. Both states are used for the cut-off. Once the “.lump” file is provided, analyser outputs statistics for each state and requires a value for the cut-off (between 0 and 1.0). For each site the frequencies of “cut-off states” is computed and the sites above the cut-off threshold provided are displayed on the screen. 22 2 - Elements of phylogenetic theory 2.1 Phylogenetic trees Usually a sketch of a tree-like structure is used to describe evolution; the evolutionary tree represents the hierarchical relationships among species arising through evolution. Ancestors’ species are located near the root of the tree and contemporary species are the leaves. Almost all methods accept the appropriateness of a tree-like model to describe the evolution of species but one must keep in mind that it is a strong assumption in itself. 2.1.1 Unrooted phylogenies Since the data for the ancestors are usually missing, the phylogenetic trees produced by PHASE are only schematic trees comprising a set of nodes linked together by branches. Terminal nodes, usually called tips or leaves, are known sequences of existing organisms or contemporary taxa. Internal nodes are bifurcation points between genetically isolated groups. The analytical techniques used in PHASE result in the inference of an unrooted, strictly bifurcating tree. The location of the common ancestor of all the species under study — i.e., the earliest point in time — cannot be identified by our inference method. An unrooted, strictly bifurcating tree can be seen as a kind of network where all the internal nodes are linked to exactly three others nodes, either internal nodes or leaves. To produce a neat tree-like structure, one or more outgroup species, known to be genetically isolated from all the others, should be used to root the tree (see figure 2.1). Figure 2.1: Two equivalent representations of the same unrooted tree 2.1.2 String representation of a tree All trees used by the programs follow the newick1 standard though the grammar in PHASE is more limited. The newick format uses the recursive definition of a tree to represent phylogenies in a computer readable form with nested parentheses. The tree in figure 2.1 can be written: (Outgroup, gorilla, (human, chimpanzee)); 1 http://evolution.genetics.washington.edu/phylip/newicktree.html 23 However one must be aware that this representation is not unique, the following one works as well: (human, (Outgroup, gorilla), chimpanzee)); Sometimes, when an outgroup was provided, the rooted representation is used in PHASE : (Outgroup, (gorilla, (human, chimpanzee))) 2.1.3 Branch lengths The branch lengths usually represent the evolutionary distances between two consecutive nodes. We tend to split the phylogenetic tree into two parts: its topology (i.e., the pattern of branching) and its associated edge lengths. The expected rate of evolutionary change is assumed constant across all lineages in a phylogeny and the length of a branch is scaled to the expected number of substitutions per site along that branch. These lengths can be integrated in the string representation seen in section 2.1.2; for instance we can write: (Outgroup : 0.35, gorilla : 0.25, (human : 0.3, chimpanzee : 0.2)); 2.2 Nucleotide substitution models Substitution models are a description of the way sequences evolve in time by nucleotide replacements. The PHASE package provides a wide range of substitution models. These consist of standard nucleotide substitution models as well as specific base-pair substitution models. 2.2.1 A Markov model of substitution Replacements within DNA sequences can be described and modelled by a Markov process with four states. Each state represents one base — Adenine, Cytosine, Guanine or Thymine (see figure 2.2). Lots Figure 2.2: Markov model for nucleotide evolution in DNA sequences of assumptions are made in order to make phylogenetic reconstructions more computationally feasible. First, each nucleotide is supposed to evolve independently of other sites evolution and of its past history. We suppose there is no interaction between sites and we treat them independently. Further, the Markov process of substitution is assumed to be the same across all sites (spatial homogeneity). Finally, the process is assumed to remain constant over time (stationary) and time homogeneous, i.e., nucleotide frequencies and substitution rates can be assumed constant through time and across all sites in an alignment. 24 One might concede that assumptions made for the nucleotide evolutionary process are not strictly valid. Actual data shows some discrepancies, e.g., heterogeneous selection pressure, unequal base frequencies among species, . . . . We can relax these assumptions and allow for substitution rate variation across sites with the gamma model of Yang (1994), described in section 2.4.1. It is also possible to use multiple substitution processes simultaneously when heterogeneous data are analysed (see section 2.4.2). In spite of their name, DNA models can naturally be used for the treatment of the loops within RNA sequences (see figure 2.3). In RNA loops, nucleotides are not subject to any structural constraints and they are assumed to evolve independently from other sites. Therefore, the use of similar Markov models for nucleotide evolution in RNA loops is appropriate. 2.2.2 Transition matrices The mathematical expression of a DNA Markov model uses a matrix Q of substitution rates in which each element rij represents the rate of substitution from nucleotide i to nucleotide j. The diagonal elements of the instantaneous rate matrix must satisfy the equation X (2.1) rii = − rij j6=i so that each row of Q sums to zero. The process must be homogeneous and stationary; if πA , πC , πG and πT are the four equilibrium bases frequencies then the rates must obey the following constraint: πi rij = πj rji ∀i, j (2.2) also known as the time-reversibility constraint. To enforce this constraint we define αij so that Qij = rij = mr × πj αij ∀i, ∀j 6= i (2.3) where mr is a constant factor described later. The time-reversibility condition is satisfied with a symmetric choice of αij . In practice, PHASE uses one of these αij parameters as a reference and sets its value to 1.0. Depending on the model, other parameters (we call them rate ratios) are fixed or inferred during an analysis. With Q we can compute the transition probability matrix over time2 t. dP (t) = P (t) × Q dt P (t) = exp(Qt) = exp({πj αij } × mr × t) The transition probability matrix P (t) = {pij (t)} is used to compute the probability that nucleotide i will be nucleotide j after time t (i can be equal to j). The “rate ratios” matrix in PHASE refers to the matrix {αij } and the “transition rates” matrix refers to Q. Inference methods used do not permit the separation of mr, a factor proportional to the average substitution rate of the model, and t, branch lengths of the evolutionary tree (see section 2.1) which reflect an amount of change. The longer the branch, the bigger the evolutionary distance between its two incident nodes. We have to impose a scaling on the branch length. In practice, we fix the average rate of substitutions of our model to be one per “unit of time”. This is done by adding a constraint for the factor mr . nbX states X (2.4) mr × πi rij = 1.0 i=1 j6=i This last constraint does not hold when multiple substitution models are used simultaneously in the MIXED model. The average substitution rate of the first model is still fixed equal to 1.0 but the average substitution rate of other models is now a free parameter. 2 the term branch length would be more correct 25 2.2.3 Nucleotide substitution models implemented in PHASE One can refer to Whelan et al. (2001) for a comprehensive review of the following substitution models and their hierarchical relationships. The transition rate matrices of these models can highlight their differences. They are presented by increasing complexity, i.e., ordered according to their number of free parameters. (equilibrium frequencies and/or rates). In nucleotide substitution models, the A↔G transitition is used as a reference by PHASE , αAG = αGA = 1. JC69 model (Jukes-Cantor, 69) The Jukes-Cantor model assumes equal base frequencies and equal mutation rates, therefore it does not have any free parameter. πi = 14 quad∀i, αij = 1.0 ∀i, ∀j 6= i A C G T A ∗ 0.25 0.25 0.25 Q = mr × C 0.25 ∗ 0.25 0.25 G 0.25 0.25 ∗ 0.25 T 0.25 0.25 0.25 ∗ Table 2.1: JC69 transition matrix K80 model (Kimura, 80) The Kimura model assumes equal base frequencies and accounts for the difference between transitions and transversions with one parameter. πi = 14 ∀i, αtransition = 1.0, αtransversion = α1 A C G T A ∗ 0.25α 0.25 0.25α 1 1 ∗ 0.25α1 0.25 Q = mr × C 0.25α1 G 0.25 0.25α1 ∗ 0.25α1 T 0.25α1 0.25 0.25α1 ∗ Table 2.2: K80 transition matrix HKY85 model (Hasegawa-Kishino-Yano, 85) The HKY85 model does not assume equal base frequencies and accounts for the difference between transitions and transversions with one parameter. αtransition = 1.0, αtransversion = α1 A C G T A ∗ πC α1 πG πT α1 C π α ∗ πG α1 πT Q = mr × A 1 G πA πC α1 ∗ πT α1 T πA α1 πC πG α1 ∗ Table 2.3: HKY85 transition tatrix 26 TN93 model (Tamura-Nei, 93) The TN93 model has four frequencies parameters. It accounts for the difference between transitions and transversions and differentiates the two kinds of transitions (purine↔purine & pyrimidine↔pyrimidine). αAG = αGA = 1.0, αtransversion = α1 , αCT = αT C = α2 A C G T A ∗ π α π π C 1 G T α1 ∗ πG α1 πT α2 Q = mr × C πA α1 G πA πC α1 ∗ πT α1 T πA α1 πC α2 πG α1 ∗ Table 2.4: TN93 transition matrix REV model (Yang, 94) The REV model is the most general model for nucleotide substitution subject to the time-reversibility constraint. It has four frequencies and five rate parameters. A C G T A ∗ πC α1 πG πT α2 ∗ πG α3 πT α4 Q = mr × C πA α1 G πA πC α3 ∗ πT α5 T πA α2 πC α4 πG α5 ∗ Table 2.5: REV transition matrix 2.3 Paired-site substitution models RNA substitution models are an attempt to add biological realism in the evolution model. The assumption that each nucleotide site evolves independantly must be modified for RNA molecules. Paired-site substitution models can account for the secondary structure of these molecules. 2.3.1 RNA secondary structure In the double helical structure of the DNA molecule, two complementary nucleotide strands are held together with hydrogen bonds between the Waston-Crick pairs A-T and C-G. RNA molecules usually come as single strands but left in their environment they fold themselves in their tertiary structure because of the same hydrogen bonding mechanism. Helices, also known as stems, are formed intramolecularly . There are 16 possible base-pairings, however of these, only six (AU, GU, GC, UA, UG, CG) are stable enough to form actual base-pairs. The rest are called mismatches and occur at very low frequencies in helices. RNA molecules, such as ribosomal RNAs and transfer RNAs, have an important role. Their structure cannot easily be disrupted without impact on their function and lethal consequences and selection is acting to maintain the secondary structure. Yet, the primary structure of the stems (i.e., their nucleotide sequence) can still vary and in fact we observe that RNA helical regions are quite variable in sequence. The nature of the bases is not important and substitutions are possible as long as they preserve the secondary structure. One could model the evolution of stems using the DNA models described above but there may be a substantial bias in results because paired substitutions would seem 27 Figure 2.3: A RNA molecule secondary structure far less probable than they are in reality (see Jow et al., 2002). Statistics become invalid and it can have an effect on inferred phylogenies. The secondary structure is left unchanged when complementary substitutions occur in the DNA gene coding for the RNA molecule. The process can be a single step process (double substitution) or a two step process (two single substitutions). These two processes are descibed in the theory of compensatory substitutions section below. 2.3.2 Theory of compensatory substitutions From the individual sequence viewpoint complementary mutations are a two-step process typically involving a U-G or a G-U pair as a transition state. These pairs are thermodynamically less stable than Waston-Crick pairs but they are still more likely to arise than any other mismatches. Nonetheless, in phylogenetic studies we are not considering individual copies of a gene but we are rather modelling consensus sequences for a large number of individuals. From the population genetics viewpoint, evolution in stems can either occur by two single substitutions or by simultaneous compensatory substitutions, (see, e.g., Higgs, 1998; Savill et al., 2001). The first mechanism is by fixation of the slightly deleterious UG or GU pair in the population before the second mutation occurs. The second mechanism happens when natural selection against intermediate mutants is too strong. In such a case, deleterious pairs are kept low in frequency until a second mutation takes place in one of the sporadic mutant sequences by chance. Afterwards, the new neutral variant may replace the original one due to drift in gene frequencies (see figure 2.4). Figure 2.4: Substitution mechanisms for paired-sites 28 Therefore, even if simultaneous mutations are very unlikely to occur in a single organism, it is reasonable, although not compulsory, to allow double substitutions in models from the population point of view. The experimental results you can have with PHASE confirm that. Since natural selection against intermediate mutants with any other mismatch pairs than U-G or G-U is usually much stronger, one can notice two groups of states in which rapid interchange occurs, while interchange between the two groups, although possible, is really slow (see figure 2.5). Figure 2.5: Mutation rate between paired-sites 2.3.3 Base-paired substitution models implemented in PHASE Like DNA models, RNA substitution models are Markov models but they consider pairs of nucleotides as their elementary states rather than single sites. The PHASE software contains 16-state models to account for the 16 possible pairs that can be formed with 4 bases. These models have a lot of parameters and you might prefer them the 6-state and 7-state models where mismatch pairs are respectively discarded or lumped into a single state MM. The time-reversibility contraint and the average mutation rate are set as they were for DNA models. One can refer to Savill et al. (2001) for a better review of the following substitution models and their hierarchical relationships. With base-paired models, PHASE uses the rate of the double transition AU↔GC as a reference for the rate ratios. RNA6A model Six state models completely ignore mismatches and consider substitutions between the six stable basepairs only. Mismatch pairs are assigned to one of the 6 states in some deterministic fashion (the treatment of a mismatch is quite similar to the treatment of a gap with the DNA models). The RNA6A model is the most general six state model with 15 rate parameters and 6 frequencies (and 2 constraints) as shown in table 2.6. Q = mr × AU AU ∗ GU πAU α1 GC πAU U A πAU α2 U G πAU α3 CG πAU α4 GU πGU α1 ∗ πGU α5 πGU α6 πGU α7 πGU α8 GC πGC πGC α5 ∗ πGC α9 πGC α10 πGC α11 UA πU A α2 πU A α6 πU A α9 ∗ πU A α12 πU A α13 UG πU G α3 πU G α7 πU G α10 πU G α12 ∗ πU G α14 CG πCG α4 πCG α8 πCG α11 πCG α13 πCG α14 ∗ Table 2.6: RNA6A transition matrix RNA6B model (Tillier, 94) The RNA6B model (Tillier, 1994) is formed by restriction of the RNA6A model. The RNA6B model has only 3 rate parameters and 6 frequencies, it uses a rate of single transitions α1 and a rate of double transversions α2 . The reference for the rate ratios are the rates of double transition. The transition rate matrix is given in table 2.7. 29 AU AU ∗ GU πAU α1 GC πAU U A πAU α2 U G πAU α2 CG πAU α2 Q = mr ∗ GU πGU α1 ∗ πGU α1 πGU α2 πGU α2 πGU α2 GC πGC πGC α1 ∗ πGC α2 πGC α2 πGC α2 UA πU A α2 πU A α2 πU A α2 ∗ πU A α1 πU A UG πU G α2 πU G α2 πU G α2 πU G α1 ∗ πU G α1 CG πCG α2 πCG α2 πCG α2 πCG πCG α1 ∗ Table 2.7: RNA6B transition matrix RNA7A model The RNA7A model is the most general of the seven state models. It has 21 rate parameters (including the reference rate AU↔GC) and 7 frequencies. All mismatches are treated in a single state MM. The RNA7A model is described by the following rate matrix (table 2.8). Q = mr × AU GU GC UA UG CG MM AU ∗ πAU α1 πAU πAU α2 πAU α3 πAU α4 πAU α5 GU πGU α1 ∗ πGU α6 πGU α7 πGU α8 πGU α9 πGU α10 GC πGC πGC α6 ∗ πGC α11 πGC α12 πGC α13 πGC α14 UA πU A α2 πU A α7 πU A α11 ∗ πU A α15 πU A α16 πU A α17 UG πU G α3 πU G α8 πU G α12 πU G α15 ∗ πU G α18 πU G α19 CG πCG α4 πCG α9 πCG α13 πCG α16 πCG α18 ∗ πCG α20 MM πM M α5 πM M α10 πM M α14 πM M α17 πM M α19 πM M α20 ∗ Table 2.8: RNA7A transition matrix RNA7D model (Tillier, 98) The RNA7D model (Tillier and Collins, 1998) is a biologically plausible restriction of the RNA7A model. The restictions in the 7D model are analogous to the restrictions made in the 6B. There is one more frequency parameter for the mismatch state and one more rate ratio parameter for the substitution rates involving this state. The reference for the rate ratios are the rate of double transitions. This model is described by the following rate matrix (table 2.9). Q = mr × AU GU GC UA UG CG MM AU ∗ πAU α1 πAU πAU α2 πAU α2 πAU α2 πAU α3 GU πGU α1 ∗ πGU α1 πGU α2 πGU α2 πGU α2 πGU α3 GC πGC πGC α1 ∗ πGC α2 πGC α2 πGC α2 πGC α3 UA πU A α2 πU A α2 πU A α2 ∗ πU A α1 πU A πU A α3 UG πU G α2 πU G α2 πU G α2 πU G α1 ∗ πU G α1 πU G α3 CG πCG α2 πCG α2 πCG α2 πCG πCG α1 ∗ πCG α3 MM πM M α3 πM M α3 πM M α3 πM M α3 πM M α3 πM M α3 ∗ Table 2.9: RNA7D transition matrix RNA16A model PHASE contains a general 16-state model (RNA16), however this model has 119 + 15 free parameters and is not well suited for phylogenetic inference, especially maximum-likelihood inference. RNA16A is a simplified 16-state model, it reduces some of the complexity of the RNA16 model by cutting down on the number of rate parameters from 120 to 5. It uses a rate of single transitions α1 , a rate of 30 double transversions α2 , a mismatch↔non-mismatch transition rate α3 for transitions requiring only one substitution and a mismatch↔mismatch transition rate α4 for transitions requiring one substitution too. The reference rate is the rate of double transitions. Some base-pair substitutions are not allowed (null substitution rate). The transition matrix for the RNA16A model is given in table 2.10. 2.4 Refinements to substitution models In this section we introduce some refinements made to the substitution models described above. 2.4.1 Invariant and discrete gamma models Substitution rates are definitely variable over sites of a sequence for many real dataset if not all. Including the heterogeneity of rates in substitution models is widely recognized as an important factor in the fitting to data. One attempt to take this acknowledged biological fact into account is to suppose that a proportion of sites are invariant while others evolve at the same single rate. PHASE provides this invariant model. One extra parameter in the model governs the proportion of sites with zero rate of evolution. Models that allows continuous variability of mutation rates over sites are more realistic and the gamma model of Yang (1994) outperforms the invariant model. The discrete gamma model is implemented in PHASE . The continuous rate distribution is approximated with a discrete distribution which is computationaly tractable and sites are divided into k equally probable rate categories. A single parameter α governs the shape of this distribution and the substitution rates for all categories. The mean E(r) of the gamma distribution is the average mutation rate of our substitution model as stated earlier and its variance is V (r) = E(r)2 /α. A small alpha suggests that rates differ significantly between sites with few sites having high rates and others being practically invariant; on the contrary, large α models weak rate heterogeneity (see figure 2.6). When α → +∞, the gamma model reduces to the single rate model. Computational requirement of the discrete gamma model is roughly linear, i.e., the application of a discrete gamma model with k categories is about k times slower than the use of a model where rate heterogeneity is not considered. 2.4.2 The MIXED model Since the current trend in phylogenetic analysis is to use several genes and/or several sorts of sequences at once, models were designed for combined analyses of heterogeneous sequence data from the same set of species (Yang, 1996). PHASE allows to use multiple substitution models simultaneously to treat this kind of sequences, each model having its own independant set of parameters. The average mutation rate of the first model is still set to 1.0 but the average mutation rate of the others are now free parameters of the model. The MIXED model for combined analysis of heterogeneous data is equivalent to the model with proportional branch lengths described in Yang (1996). 2.5 2.5.1 Bayesian phylogenetics Bayes’ theorem A Bayesian approach to phylogeny reconstruction requires the definition of a parameter space Ω which contains the sets of all possible combined states φ = {τi , νi , θ} where the symbol τi labels the ith possible tree topology, νi are the branch lengths associated with this topology and θ is a set of allowed parameters for our evolutionary model (e.g., rate ratios αij , nucleotide or base-pair frequencies πi , gamma distribution parameter α, . . . ). According to Bayes’ theorem, we can calculate the posterior probability of the combined state φ given sequence data X, p(φ|X) = PNs R i=1 P (X|φ)p(φ) R dνi dθP (X|φ)p(φ) (2.5) where Ns is the number of possible tree topologies for a data set containing s species, P (X|φ) is the likelihood of the data and p(φ) is the prior probability density associated with state θ. 31 Figure 2.6: Probability density function of several gamma distributions of rate heterogeneity with mean E(r) = 1 2.5.2 Markov chain Monte-Carlo (MCMC) Computing the denominator of equation 2.5 is infeasible for realistic sized problems. A Markov Chain Monte-Carlo method is therefore used. The standard Metropolis-Hastings MCMC algorithm can construct a Markov chain in our state space Ω by iterating a two step process (Metropolis et al., 1953; Hastings, 1970). Firstly, a new state φ0 is drawn from the actual state φn according to some proposal mechanism. The proposed state is then accepted or rejected with some probability which depends on the ratio of the posterior probabilities of the two states φ0 and φn and of the proposal. After a “burnin” period, the chain converges to an equilibrium, under quite weak conditions. After discarding an initial portion of the chain, states are distributed according to the posterior probability density p(φ|X). PHASE can produce a large sample from the posterior probability density and with this sample one can compute the posterior probability of any identifiable phylogenetic feature of interest. For instance the posterior probability of a specific topology is simply given by the fraction of times this topology appears in our MCMC sample. Similarly we can fit a posterior probability density curve to the gamma distribution parameter. 2.5.3 Priors and proposals Uniform priors ? We have no strong evidence for any particular prior and we therefore choose a simple factorized prior p(φ) = p(θ)p(νi )P (τi ). We assume a uniform prior on all trees P (τi ) = 1/Ns , we use a flat Dirichlet distribution prior for frequency parameters (i.e., all sets of frequencies are equally likely as long as they sum to one) and we choose a uniform positive prior for substitution rate parameters, gamma distribution parameter and branch lengths. Consequently, for all pairs of possible states (φ, φ0 ), the priors are equal. One should set upper limits in the case of uniform priors but since these parameters usually remain between reasonable limits during simulations these boundaries should not have any effect on experimental results unless unreasonable values are chosen. It is good practice to check whether these upper boudaries are reached while monitoring the parameters convergence. 32 Proposals for the parameters For the proposal step, we have to balance the desire to move globally through the parameter space Ω with the need to make computationally feasible moves in areas of high probability. Therefore we split up the process and we apply a suitable proposition to the variables at each iteration. For the frequency parameters we adopt a Dirichlet proposal distribution centred at the current frequency vector used by Larget and Simon (1999). For the gamma distribution parameter and the substitution rate ratios, a normal proposal distribution centred at the current value is used, with reflecting boundary at zero and at the upper-limit defined above. Distant moves in Ω might result in a low acceptance rate whereas small modifications will prevent a full inspection of highly probable areas. The parameters used to move into the state space must therefore be carefully chosen for proper mixing (quick convergence and good sampling from the posterior probability density). With mcmcphase, these parameters can be adjusted during the burnin period. Proposals for the tree The tree topology is perturbed every ten cycles with either the nearest neighbor interchange (NNI) proposal shown in figure 2.7 or the subtree pruning and re-grafting (SPR) proposal (Swofford et al., 1996) shown in figure 2.8. Figure 2.7: The nearest neighbor interchange algorithm (Jow et al., 2002) Figure 2.8: The subtree pruning and regrafting algorithm (Jow et al., 2002) 33 Each cycle a randomly chosen branch length is modified with a figure δ drawn from a normal distribution centred at zero. When the branch length becomes negative special rules which can lead to a topology change are applied (Jow et al., 2002). If the branch is an internal branch then one of the two nearest neighbor topologies is proposed with each having equal probability; this is the Nearest Neighbour Interchange described above. The new internal branch length is set to y = |x + δ| (see figure 2.9). If the branch is a terminal branch, we cannot apply the NNI algorithm and we simply use a reflecting boundary. The new proposed length is y = |x + δ|. Figure 2.9: The continuous change algorithm when x + δ < 0 (Jow et al., 2002) The acceptance rate for the SPR and the NNI proposals are usually quite low. The “local” NNI proposal, induced by a branch length modification, has a better acceptance rate. 2.5.4 Pitfalls of Markov chain Monte-Carlo techniques One can doubt that maximum-likelihood algorithms always find the true global maximum of the likelihood function. Similarly, with MCMC techniques, the Markov chain can fail to converge to the stationary distribution of the posterior probabilities. A possible reason for this is the failure to visit all highly probable regions of the parameter space because of local maxima in the likelihood curve. However poor proposal mechanisms and/or failure to run the chain long enough are usually the main cause of sample defect (see Huelsenbeck et al., 2002). Unfortunately it is not always easy to identify these traps. We can only recommend to do long runs, monitor the convergence of several model parameters since monitoring the likelihood only is not enough, and repeat the experiment using different random starting trees to check that all the chains give similar results (i.e., substitution model parameters, consensus tree, likelihood, . . . ). 34 35 Q = mr × AU GU GC UA UG CG AA AG AC GA GG CA CC CU UC UU AU ∗ πAU α1 πAU πAU α2 πAU α2 πAU α2 πAU α3 πAU α3 πAU α3 0 0 0 0 πAU α3 0 πAU α3 GU πGU α1 ∗ πGU α1 πGU α2 πGU α2 πGU α2 0 0 0 πGU α3 πGU α3 0 0 πGU α3 0 πGU α3 GC πGC πGC α1 ∗ πGC α2 πGC α2 πGC α2 0 0 πGC α3 πGC α3 πGC α3 0 πGC α3 0 πGC α3 0 UA πU A α2 πU A α2 πU A α2 ∗ πU A α1 πU A πU A α3 0 0 πU A α3 0 πU A α3 0 0 πU A α3 πU A α3 CG πCG α2 πCG α2 πCG α2 πCG πCG α1 ∗ 0 πCG α3 0 0 πCG α3 πCG α3 πCG α3 πCG α3 0 0 AA πAA α3 0 0 πAA α3 0 0 ∗ πAA α4 πAA α4 πAA α4 0 πAA α4 0 0 0 0 AG πAG α3 0 0 0 πAG α3 πAG α3 πAG α4 ∗ πAG α4 0 πAG α4 0 0 0 0 0 AC πAC α3 0 πAC α3 0 0 0 πAC α4 πAC α4 ∗ 0 0 0 πAC α4 0 πAC α4 0 GA 0 πGA α3 πGA α3 πGA α3 0 0 πGA α4 0 0 ∗ πGA α4 πGA α4 0 0 0 0 Table 2.10: RNA16A transition matrix UG πU G α2 πU G α2 πU G α2 πU G α1 ∗ πU G α1 0 πU G α3 0 0 πU G α3 0 0 0 πU G α3 πU G α3 GG 0 πGG α3 πGG α3 0 πGG α3 πGG α3 0 πGG α4 0 πGG α4 ∗ 0 0 0 0 0 CA 0 0 0 πCA α3 0 πCA α3 πCA α4 0 0 πCA α4 0 ∗ πCA α4 πCA α4 0 0 CC 0 0 πCC α3 0 0 πCC α3 0 0 πCC α4 0 0 πCC α4 ∗ πCC α4 πCC α4 0 CU πCU α3 πCU α3 0 0 0 πCU α3 0 0 0 0 0 πCU α4 πCU α4 ∗ 0 πCU α4 UC 0 0 πU C α3 πU C α3 πU C α3 0 0 0 πU C α4 0 0 0 πU C α4 0 ∗ πU C α4 UU πU U α3 πU U α3 0 πU U α3 πU U α3 0 0 0 0 0 0 0 0 πU U α4 πU U α4 ∗ Appendices 36 AppendixA - Some examples of control files A.1 Control file for likelihood ######################## The Sequence Alignment Section ############ {DATAFILE} #The name of your data file Data file = data/mammals69.mix #The format of your data file (interleaved or not) Interleaved data file = no #The species used to root the tree Outgroup = 26 {\DATAFILE} ######################## The Evolutionary Model Section ############ {MODEL} #the name of your model Model = MIXED #since we are using the mixed model we provide the number of models Number of models = 2 #and we define each substitution model inside its own block. #the file "data/mammals69.mix" is a RNA sequence with loops and stems #we did not specify a class section but the code used was MIXED #therefore the first model must be the DNA model for the loop #and the second model must be the RNA model for the helices. {MODEL1} #DNA model : REV + dg3 Model = REV Discrete gamma distribution of rates = yes Number of gamma categories = 3 {\MODEL1} {MODEL2} #RNA model : RNA7A + dg4 + I Model = RNA7A Discrete gamma distribution of rates = yes Number of gamma categories = 4 Invariant sites = yes {\MODEL2} {\MODEL} ####################### The tree & model Section #################### 37 #To evaluate the likelihood of a phylogeny you must provide #1.a phylogeny file (tree with branch lengths) Tree file = data/mammals69-mix-consensus.tree #2.the parameters for the model you defined above Model parameters file = data/mammals69-mix-consensus.model A.2 Control file for optimise ######################## The Sequence Alignment Section ############ {DATAFILE} #The name of your data file Data file = data/primates.rna #the format of your data file (interleaved or not) Interleaved data file = no #the species used to root the tree Outgroup = 14 {\DATAFILE} ######################## The Evolutionary Model Section ############ {MODEL} #model : RNA16A + dG4 Model = RNA16A Discrete gamma distribution of rates = yes Number of gamma categories = 4 {\MODEL} ####################### The tree & model Section #################### #the phylogeny to optimise Tree file = data/primates.tree #an optional field to choose the initial model parameters (and check #whether the method always converge to the same tree) #Starting model parameters file = #a random seed to initialise the branch lengths randomly in case the phylogeny #provided does not hold this information Random seed = 1 #the base name of the three output files (base.output, base.model, base.tree) Output file = results/primates-rna-optimise/primates-rna-optimise-RNA16A A.3 Control file for simulate (1) ######################## The Evolutionary Model Section ############ {MODEL} #the name of your model Model = MIXED #since we are using the mixed model we provide the number of models Number of models = 2 38 {MODEL1} #DNA model : REV + dg3 Model = REV Discrete gamma distribution of rates = yes Number of gamma categories = 3 {\MODEL1} {MODEL2} #RNA model : RNA7A + dg4 + I Model = RNA7A Discrete gamma distribution of rates = yes Number of gamma categories = 4 Invariant sites = yes {\MODEL2} {\MODEL} ######################## The Simulate Section ############ #to produce an example of ’.model’ file for the specified model set this field #to ’yes’ Retrieve the name of the model’s parameters = yes #the following field is the name of the ’.model’ file to create Model parameters file = data/simulate.model A.4 Control file for simulate (2) ######################## The Evolutionary Model Section ############ {MODEL} #the name of your model Model = MIXED #since we are using the mixed model we provide the number of models Number of models = 2 {MODEL1} #DNA model : REV + dg3 Model = REV Discrete gamma distribution of rates = yes Number of gamma categories = 3 {\MODEL1} {MODEL2} #RNA model : RNA7A + dg4 + I Model = RNA7A Discrete gamma distribution of rates = yes Number of gamma categories = 4 Invariant sites = yes {\MODEL2} {\MODEL} ######################## The Simulate Section ############ #to simulate some sequences set this field to ’no’ Retrieve the name of the model’s parameters = no 39 #the file with the user-specified parameters of the substitution model Model parameters file = data/simulate.model #Initialise the random number generator with a seed Random seed = 1 #Random tree or user-specified tree ? Random tree = yes #parameters used if a random tree is generated Number of species = 8 Maximum branch length = .5 #if Random tree == yes the tree will be saved with that file name #if Random tree == no the tree is read from that file Tree file = simulated-data/random-8species.tree #generate sequences: #for each model you have to specify the desired number of symbols #if you are not using a MIXED model fill this field for the class 1 only Number of symbols from class 1 = 600 Number of symbols from class 2 = 1200 #if you need a secondary structure fill the following fields Structure for the elements of class 1 = . Structure for the elements of class 2 = () #to produce a complete PHASE input file, you have to specify the type and the #final length yourself Data file type = RNA Total length of the raw sequences = 3000 #the name of the file where your sequences are saved, please check this file #before use Output file = simulated-data/simulated-8species.mix A.5 Control file for mlphase (1) ####################### The Data Section ########################### {DATAFILE} #The name of your data file Data file = data/hiv.dna #The format of your data file (interleaved or not) Interleaved data file = no #The species used to root the tree Outgroup = 1 {\DATAFILE} 40 ######################## The Evolutionary Model Section ############ {MODEL} #model : REV + dG4 Model = REV Discrete gamma distribution of rates = yes Number of gamma categories = 4 {\MODEL} ####################### The Function Section ########################### {FUNCTION} Function = Optimise user-defined phylogenetic trees #The file with the trees Trees file = data/hiv-dna.trees #The number of trees in this file Number of trees = 2 #Optimise the substitution model parameters simultaneously ? Optimise model parameters = yes # The name of the file containing initial substitution model parameters, # if the previous field is set to no (ie, fixed parameters for the model) this # field is compulsory #### User’s model parameters file = {\FUNCTION} # Random seed for the random number generator Random seed = 2 # The next control line sets the output file Output file = results/hiv-dna-ml/hiv-dna-ml.output A.6 Control file for mlphase (2) ####################### The Data Section ########################### {DATAFILE} #The name of your data file Data file = data/primates.rna #The format of your data file (interleaved or not) Interleaved data file = no #The species used to root the tree Outgroup = 1 {\DATAFILE} ######################## The Evolutionary Model Section ############ {MODEL} #model : RNA7A + dG3 Model = RNA7A Discrete gamma distribution of rates = yes Number of gamma categories = 3 {\MODEL} 41 ####################### The Function Section ########################### {FUNCTION} Function = Search for ML topology #Monophyletic clades ? User defined monophyletic clades = yes Clade file = data/primates.clades #The search method for tree topology : # ’Simple exhaustive search’, ’Branch-and-bound exhaustive search’ or # ’Heuristic stepwise addition’ Topology search = Branch-and-bound exhaustive search #optimise the substitution model parameters simultaneously ? Optimise model parameters = yes #a field to choose initial model parameters, this field is compulsory if #the previous field is set to no #### User’s model parameters file = {\FUNCTION} # Random seed for the random number generator Random seed = 2 # The next control line sets the output file Output file = results/primates-rna-ml/primates-rna-ml.output A.7 Control file for mcmcphase (1) ######################## The Sequence Alignment Section ############ {DATAFILE} #The name of your data file Data file = simulated-data/suzuki-arranged.dna #The format of your data file (interleaved or not) Interleaved data file = no #The species used to root the tree Outgroup = 1 #Is there a class section in your data file ? Heterogeneous data models = yes {\DATAFILE} ######################## The Evolutionary Model Section ############ {MODEL} #model : K80 + dG3 Model = K80 Discrete gamma distribution of rates = yes Number of gamma categories = 3 Invariant sites = no {\MODEL} 42 ######################## The Perturbation Section ############ {PERTURBATION} #Initial branch step proposal parameter Initial branch step proposal parameter = 0.1 #Upper bound for the branch length uniform distribution Branch length upper bound = 1.5 #priority for the frequencies perturbation Frequencies, proposal priority = 1 #optional initial parameter for the perturbation Frequencies, initial Dirichlet tuning parameter = 500.0 #priority for the rate ratios perturbation Rate ratios, proposal priority = 1 #optional, initial rate ratio step proposal parameter Rate ratios, initial step = 0.3 #optional, set the lower bound for the acceptance rate Rate ratios, proposal minimum acceptance rate = 0.2 #optional, set the upper bound for the acceptance rate Rate ratios, proposal maximum acceptance rate = 0.6 #priority for the gamma shape parameter perturbation Gamma parameter, proposal priority = 1 #do not adapt the proposal parameter during the burnin period Gamma parameter, proposal minimum acceptance rate = .0 Gamma parameter, proposal maximum acceptance rate = 1.0 # the initial proposal step is required because it is fixed Gamma parameter, initial step = .1 #priority for the invariant parameter (% of invariant sites) perturbation Invariant parameter, proposal priority = 1 {\PERTURBATION} ######################## The program Section ############ #initialise the random number generator Random seed = 1 #number of burnin iterations Burnin iterations = 150000 #number of sampling iterations Sampling iterations = 300000 #sample every 20 cycles Sampling period = 20 #initialise the chain with user-defined substitution parameters ? Random start model parameters = yes #### User’s starting model parameters file = #initialise the chain with a given tree ? Random start tree = yes #### User’s starting tree file = #The base name for the output files (base.output, base.bestmp, base.besttree, 43 # Output file base.samples, base.mp, base.bl, base.plot) = results/simulation-mix-mcmc/simulation-suzuki-mcmc #the format for the ’base.samples’ file (phylip or bambe) Output format = phylip A.8 Control file for mcmcphase (2) ######################## The Sequence Alignment Section ############ {DATAFILE} Data file = data/mammals69.mix Interleaved data file = no Outgroup = 26 {\DATAFILE} ######################## The Evolutionary Model Section ############ {MODEL} Model = MIXED Number of models = 2 {MODEL1} Model = REV Discrete gamma distribution of rates = yes Number of gamma categories = 4 Invariant sites = no {\MODEL1} {MODEL2} Model = RNA7A Discrete gamma distribution of rates = yes Number of gamma categories = 4 Invariant sites = no {\MODEL2} {\MODEL} ######################## The MCMC PERTURBATION Section ############ {PERTURBATION} #PERTURBATION OF THE TREE : Initial branch step proposal parameter = 0.03 Branch length upper bound = 1.7 #PERTURBATION OF THE MODEL : Model 1 priority = 8 Model 2 priority = 24 Average Average Average Average rates, rates, rates, rates, proposal priority = 1 initial step = .3 proposal minimum acceptance rate = .15 proposal maximum acceptance rate = .20 {PERTURBATION1} Frequencies, proposal priority = 1 Rate ratios, proposal priority = 1 Gamma parameter, proposal priority = 1 {\PERTURBATION1} {PERTURBATION2} Frequencies, proposal priority = 1 Rate ratios, proposal priority = 1 44 Gamma parameter, proposal priority = 1 {\PERTURBATION2} {\PERTURBATION} Random seed = 1 Burnin iterations = 40000 Sampling iterations = 100000 Sampling period = 10 Random start model parameters = no User’s starting model parameters file = data/mammals69-mix-consensus.model Random start tree = no User’s starting tree file = data/mammals69-mix-consensus.tree Output file = results/mammals69-mix-mcmc/mammals69-mix-mcmc-preinit Output format = phylip 45 Bibliography Hasegawa, M. et al. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol., 42:160–174. Hastings, W. 1970. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57:97– 109. Higgs, P. 1998. Compensatory neutral mutation and the evolution of RNA. Genetica, 102:91–101. Hudelot, C., V. Gowri-Shankar, H. Jow, M. Rattray, and P. Higgs 2003. RNA-based phylogenetics methods: Application to mammalian mitochondrial RNA sequences. Mol. Phyl. Evol. Huelsenbeck, J., B. Larget, R. Miller, and F. Ronquist 2002. Potential applications and pitfalls of bayesian inference of phylogeny. Syst. Biol., 51(5):673–688. Jow, H., C. Hudelot, M. Rattray, and P. Higgs 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol., 19(9):1591–1601. Jukes, T. and C. Cantor 1969. Evolution of protein molecules. In Mammalian Protein Metabolism, volume 3, Pp. 21–132. Munro, H.H., ed. Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16:111–120. Larget, B. and D. Simon 1999. Markov chain monte carlo algorithms for the bayesian analysis of phylogenetic trees. Molecular Biology and Evolution, 16(6):750–759. Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller 1953. Equations of states calculations for fast computing machines. Journal of Chemical Physics, 21:1087–1091. Savill, N., D. Hoyle, and P. Higgs 2001. Rna sequence evolution with secondary structure constraints: Comparison of substitution rate models using maximum likelyhood methods. Genetics, 157:399–411. Swofford, D. L., G. Olsen, P. Waddell, and D. Hillis 1996. Phylogenetic inference. In Molecular Systematics (2nd edition), Pp. 407–515. Hillis, D.M. Tamura, K. and M. Nei 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial dna in humans and chimpanzees. Mol. Biol. Evol., 10(3):512–526. Tillier, E. and R. Collins 1998. High apparent rate of simultaneous compensatory basepair substitutions in ribosomal RNA. Genetics, 148:1993–2002. 46 Tillier, E. R. M. 1994. Maximum likelihood with multiparameter models of substitution. Journal of Molecular Evolution, 39:409–417. Whelan, S., P. Liò, and N. Goldman 2001. Molecular phylogenetics: state-of-the art methods for looking into the past. TRENDS in Genetics, 17(5):262–272. Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol., 39:306–314. Yang, Z. 1996. Maximum likelihood models for combined analyses of multiple sequence data. J. Mol. Evol., 42:587–596. 47
© Copyright 2026 Paperzz