Practical 3: Birds

Practical 3: Birds Background Birds form a colourful and diverse part of the modern-­‐day fauna. There are nearly 10,000 living bird species, grouped into about 34 orders. Some of the relationships among these orders have been difficult to resolve, even with large amount of DNA sequence data. The timing of origin of modern birds has been a long-­‐standing question. Although the first birds arose in the Jurassic, lineages belonging to the modern (crown) group did not appear in the fossil record until the Late Cretaceous. Most of the modern bird orders seem to have arisen in the early Paleogene, leading to the hypothesis that they arose after the mass extinction at the end of the Cretaceous. The disappearance of the dinosaurs and other animals at this time opened up a range of opportunities for birds (and mammals) to diversify. Recently, Jarvis et al. (2014) analysed the genomes of 48 bird species (shown below), representing all of the modern orders. The core data set consisted of 8,295 orthologous genes. This data set was used to investigate the relationships among the living orders of birds and to estimate their timescale of evolution. In this practical you use Bayesian phylogenetic methods to estimate the evolutionary timescale of 11 of these bird taxa. In Part 1, you will analyse a 5-­‐gene data set using BEAST to co-­‐estimate the the phylogeny and evolutionary timescale. In Part 2, you will analyse a 12-­‐
gene data set using MCMCtree. Practical 3 -­‐ 1
Part 1: Molecular dating using BEAST BEAST BEAST is a package of programs designed for Bayesian phylogenetic analysis, developed primarily by Alexei Drummond and Andrew Rambaut. It implements a wide range of evolutionary models, including a variety of tree priors and clock models. BEAST is one of the most widely used programs for Bayesian phylogenetic analysis, and is probably the most commonly used for molecular dating. The program is particularly useful for analyses of sequences sampled at the population level, with a range of coalescent-­‐based models being available. For example, various demographic parameters can be estimated from a sample of sequences from a population, and skyline plots can be used to estimate population size through time. BEAST can also handle species-­‐
level sampling, with a number of tree priors based on speciation models. A notable feature of BEAST is that all of its analyses are based on rooted phylogenetic trees. This means that a model of among-­‐lineage rate variation is assumed in every analysis. A range of clock models are available in BEAST, including uncorrelated relaxed clocks and local clocks. The program can also explicitly handle time-­‐structured data, such as those sampled from viruses or ancient DNA. The BEAST package includes a number of programs that are designed to carry out various tasks associated with Bayesian phylogenetic analysis. BEAUti is used to prepare input files for BEAST, while LogCombiner and TreeAnnotator are used to process the log files produced by the MCMC analysis. Practical 3 -­‐ 2
Preparing an input file using BEAUti BEAST reads input files formatted in XML, a language that shares similarities with HTML. Input files can be prepared by BEAUti, which has a graphical user interface. After launching BEAUti, the first step is to load the sequence data into the program. Select “Import Alignment” from the “File” menu and import the 5 data files birds.locus044.nex, birds.locus081.nex, birds.locus223.nex, birds.locus225.nex, and birds.locus260.nex. These are the nucleotide alignments of 5 of the 8,295 genes analysed by Jarvis et al. (2014). Each alignment contains sequences from 11 birds and 1 outgroup taxon (alligator). You should now be in the “Partitions” section of BEAUti. The window will display some of the characteristics of the data that you have loaded. In our analysis, we will assign a separate substitution model to each of the 5 genes. To do this, select all of the genes and unlink their substitution models. We will assume that the 5 genes share the same clock model and the same tree topology, so leave these linked across genes. Go to the “Taxon Sets” section. Here we can define groups of sequences that might be of interest. In the current analysis, we are interested in some of the nodes of the tree that can be used for age calibration. Taxon sets can be created by clicking on the “+” symbol in the bottom-­‐left of the BEAUti window. Create a taxon set and call it “Birds”. In this taxon set, we want to include the 11 ingroup sequences (everything except the alligator). Select these taxa and click on the green arrow to put them in the “Included Taxa” window. Do not check the boxes in the columns “Mono?” or “Stem?”. Create another taxon set called “Galloanseres” that contains the mallard and red junglefowl. Skip the “Tips” section, which would normally be used to define the ages of time-­‐structured sequence data. Skip the “Traits” section, which would normally be used to add trait data to the taxa in the data set. Go to the “Sites” section. Here we choose the nucleotide substitution model. In the current analysis, we will use the HKY+G model of nucleotide substitution for each gene. To specify this model, select “HKY” for “Substitution Model”, “Estimated” for “Base Frequencies”, and “Gamma” for Site “Heterogeneity Model”. Leave “Number of Gamma Categories” at 4. Practical 3 -­‐ 3
Go to the “Clocks” section. Here we need to choose the clock model that we want to use in our analysis. Here we will use a “strict clock” model, which assumes that all lineages evolve at the same rate. Because we want to estimate the evolutionary rate, check the “Estimate” box and enter a value of 10-­‐3 (you need to type this as “1e-­‐3” or simply as “0.001”). Go to the “Trees” section. Here we need to choose the prior distribution for the tree in our analysis. In the drop-­‐down menu next to “Tree Prior”, there are various models that can be used to generate a prior distribution for the tree. In the current analysis, we are dealing with sequences from different species, which means that we need to use one of the speciation models. The “Coalescent” models are only appropriate for population-­‐level analyses. For this analysis, we only have 12 taxa so we shall choose the simplest speciation model, which is the Yule process. This is a pure-­‐birth model in which all lineages have an equal chance of splitting into two descendent lineages. Choose the “Speciation: Yule Process” model. Skip the “States” section, which would normally be used to activate the estimation of ancestral states. Go to the “Priors” section. Here we need to choose prior distributions for the various parameters in the analysis. Most of the default choices can be left as they are. However, this is where we need to specify our calibrations: • There is extensive fossil evidence for the split between the alligator and the ingroup (birds). We can specify an age range of 239–250.4 million years for this evolutionary divergence. Click on the “Using Tree Prior” box next to “treeModel.rootHeight” and change the prior to a “Uniform Distribution” with a minimum of 239 and a maximum of 250.4. Use an initial value within this interval. Note that we are giving the dates in Myr. • The oldest fossil taxon that can be assigned to the crown bird clade is Vegavis iaai, which has been dated at 66 million years. Despite extensive study, there has been no evidence of any earlier Cretaceous fossil taxa that can be assigned to modern bird lineages. We can use this information to place a maximum constraint of 99.6 million years (the beginning of the Late Cretaceous) on the age of modern birds. Change the prior on “tmrca(Birds)” to a uniform(66, 99.6) distribution. Use an initial value within this interval. • The fossil Anatalavis oxfordi can be assigned to the duck lineage. We can use this information to calibrate the divergence between ducks and chickens. Change the prior on “tmrca(Galloanseres)” to a uniform(51, 99.6) distribution. We also need to specify the prior distribution for the substitution rate (“clock.rate”). Here we can use the CTMC Reference Prior, which is an uninformative prior distribution. Skip the “Operators” section. This section lists the mechanisms for proposing changes to the tree and parameter values during the MCMC analysis. The default settings are fine. Go to the “MCMC” section. Here we need to choose the settings for the MCMC analysis. To keep the analysis fairly short, use a chain length of 5,000,000 steps and sample every 500 steps. Choose the names of your output files by typing a desired name into the field next to “File name stem”. Something like “birds.strictclock” should be fine. Uncheck the box next to “Create operator analysis file”. Now click on “Generate BEAST file” to create the input file for BEAST. Keep BEAUti open because we will want to change some settings later. Practical 3 -­‐ 4
Bayesian phylogenetic analysis using BEAST Open the program BEAST and load the input file that you created using BEAUti. Uncheck the box next to “Use BEAGLE library if available” and click on the “Run” button. You’ll notice that the program quits after a few seconds. BEAST has generated a random starting tree for the MCMC analysis, but the starting tree does not satisfy the calibration constraints that we have chosen. We can get around this problem by giving a starting tree with branch lengths that satisfy the constraints. The file birds.tree.nex contains such a tree. Go back to BEAUti and click on the “Partitions” section. Drag and drop the tree file into this window. This will load the tree into BEAUti. Go to the “Trees” section. Click on the button next to “User-­‐specified starting tree”. Note that this option was previously greyed out. Save the input file (you can replace the previous one that you generated). Try running BEAST using the new input file. The program will again quit with an error. This time, it is because there is some sort of bug in the creation of the XML file. You’ll need to open the XML file in a text editor and change the specification of the starting tree. This is the original version: <!-­‐-­‐ Construct a starting tree that is compatible with specified clade heights-­‐-­‐> <rescaledTree id="startingTree"> <!-­‐-­‐ The user-­‐specified starting tree in a newick tree format. -­‐-­‐> <newick usingDates="false"> (Alligator:245.0,((GreatTinamou:11.25,Ostrich:11.25):67.5,((Mallard:55.0,RedJunglefowl:55.0
):12.5,((AmericanFlamingo:11.25,RockPigeon:11.25):45.0,(Hoatzin:45.0,(EmperorPenguin:33.75,(Bar
nOwl:22.5,(Budgerigar:11.25,ZebraFinch:11.25):11.25):11.25):11.25):11.25):11.25):11.25):166.25); </newick> </rescaledTree> Delete the lines that refer to “rescaledTree”, and change the open tag of the newick block so that it reads: <newick id=”startingTree”> This part of the input file should now look like this: <!-­‐-­‐ The user-­‐specified starting tree in a newick tree format. -­‐-­‐> <newick id=”startingTree”> (Alligator:245.0,((GreatTinamou:11.25,Ostrich:11.25):67.5,((Mallard:55.0,RedJunglefowl:55.0
):12.5,((AmericanFlamingo:11.25,RockPigeon:11.25):45.0,(Hoatzin:45.0,(EmperorPenguin:33.75,(Bar
nOwl:22.5,(Budgerigar:11.25,ZebraFinch:11.25):11.25):11.25):11.25):11.25):11.25):11.25):166.25); </newick> Run BEAST using the revised input file. While you are waiting for the analysis to finish, try answering the questions below. Practical 3 -­‐ 5
Q. Given the available fossil information, would you consider the use of uniform priors for node ages to be a conservative approach? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Q. Why might the assumption of a strict clock be incorrect in this analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Q. Assume that substitution rates vary substantially among bird lineages. Consider the date estimates that would be produced using the strict clock and the relaxed clock. In terms of accuracy and precision, how would you expect the date estimates to differ between these two clock models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Practical 3 -­‐ 6
Accounting for rate variation across lineages Different lineages of birds are likely to experience different rates of evolution. We will now take this into account by using a relaxed-­‐clock model in our dating analysis. Go back to BEAUti and click on the “Clocks” section. Select the uncorrelated lognormal relaxed clock. If you have already closed BEAUti, you’ll need to go back to the beginning of the practical notes and set everything up in the same way (except for the clock model). Save the file under a name that is different from your previous file. Use something like “birds.relaxedclock” for “File name stem”. If your computer has more than 1 processor, run BEAST using the new input file. Otherwise,
you should wait until your first analysis (using the strict-clock model) is done before starting
the second BEAST analysis. Feel free to take a break at this point while you are waiting for your two analyses to finish running. Q. When might it be more appropriate to use an autocorrelated relaxed clock rather than an uncorrelated relaxed clock? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Q. What is the random local clock and when might it be appropriate to use this model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Practical 3 -­‐ 7
Comparison of clock models Open the program Tracer, which is used to view the parameter samples that are drawn from the posterior distribution. Load the 2 .log files from your BEAST analyses. You can inspect the characteristics of the posterior distributions of parameters. The first thing to check is that the effective sample sizes (ESSs) of all of our sampled parameters are greater than 200. This indicates that we have drawn enough samples to be able to produce a reasonable estimate of the posterior distribution of each parameter. The effective sample size is lower than the actual number of samples because the samples drawn from the MCMC are not entirely independent of each other. If any ESS values are below ~200, it means that we need to run the MCMC analysis for a greater number of steps. If this is the case, ignore it for the purposes of this practical. We also want to draw our samples only from the stationary distribution. For this reason, we normally discard the first ~10% of samples. This is known as the ‘burn-­‐in’ phase. By default, Tracer excludes the first 10% of your samples when calculating the mean and other statistics. Q. Look at the results from your analysis based on the strict-­‐clock model. What are the mean and 95% HPD interval (=95% credible interval) for the estimate of the age of the crown bird clade, which is given by “tmrca(Birds)”? Does this match the prior distribution that we assigned to it? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Now we want to compare the fit of the two clock models to our data. This can be done using Bayes factors, which is a comparison of the marginal likelihoods of the two models. To calculate the marginal likelihoods, we will use a quick (but not very good) method known as the harmonic-­‐mean estimator. Select both of the files in the top-­‐left box of Tracer. Go to the “Analysis” menu and select “Model Comparison”. Select “Bayes factors” and click “OK”. The Bayes factor now needs to be interpreted. The (natural) log of the Bayes factor is displayed. Values of log(BF) can be interpreted as follows: 1–3 positive support, 3–5 strong support, >5 decisive support. Q. What is the log Bayes Factor of the strict clock compared with the uncorrelated lognormal relaxed clock? What level of support does this indicate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Practical 3 -­‐ 8
In the uncorrelated lognormal relaxed-­‐clock model, we can also look at the estimate of the coefficient of variation of rates and the covariance of rates. The coefficient of variation is the standard deviation of the branch rates divided by the mean of the branch rates. This gives an indication of the amount of rate variation across lineages. Have a look at the posterior distribution of this statistic. Note that a value of 0 corresponds to a strict clock. Q. Does the coefficient of variation of rates provide support for rate variation across branches in the bird tree? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The covariance of rates measures the correlation in rates between neighbouring branches. This gives an indication of how autocorrelated the branch rates are (i.e., whether the rate changes gradually throughout the tree). A value of 0 indicates that there is no rate autocorrelation. Q. Does the covariance of rates provide support for rate autocorrelation across branches in the bird tree? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Now we want to try something that is a lot more complicated. We will assign a relative rate parameter to each gene, which will allow them to have distinct substitution rates. However, we want to keep a single underlying relaxed-­‐clock model. So all 5 genes will share the same pattern of rate variation across branches, but the genes can evolve at different rates relative to each other. To do this, you will need to edit your newest XML file (birds.relaxedclock.xml or similar) using a text editor. First, add a relative rate parameter to the substitution model for each gene like this (note that you should give a different name to the “.mu” parameter for each gene). The red font below shows the added text. <!-­‐-­‐ site model -­‐-­‐> <siteModel id="locus044.siteModel"> <substitutionModel> <HKYModel idref="birds.locus044.hky"/> </substitutionModel> <gammaShape gammaCategories="4"> <parameter id="birds.locus044.alpha" value="0.5" lower="0.0"/> </gammaShape> <relativeRate> <parameter id="birds.locus044.mu" value="1.0" lower="0.0" upper="1000"/> </relativeRate> </siteModel> Practical 3 -­‐ 9
You will then need to construct a compound parameter that is a vector of the five relative rates. Add this after the last of the five site models: <compoundParameter id="allMus"> <parameter idref="birds.locus044.mu"/> <parameter idref="birds.locus081.mu"/> <parameter idref="birds.locus223.mu"/> <parameter idref="birds.locus225.mu"/> <parameter idref="birds.locus260.mu"/> </compoundParameter> Next, you will need to add an operator that proposes changes to the relative rates during the MCMC. Add this to the top of the “operators” block. The red font below shows the added text. <operators id="operators" optimizationSchedule="default"> <deltaExchange delta="0.75" parameterWeights="690 897 597 735 960" weight="1"> <parameter idref="allMus"/> </deltaExchange> The 5 parameter weights correspond to the lengths of the 5 genes. These values are needed because the relative rates need to be weighted by sequence length. This operator proposes changes to the relative rates, while keeping their weighted mean at a value of 1. We should now add the parameter to the “fileLog” section, so that it is recorded in the .log output file. Add this line above the parameters of the relaxed-­‐clock model. The red font below shows the added text. <parameter idref="allMus"/> <parameter idref="ucld.mean"/> <parameter idref="ucld.stdev"/> The last thing left to do is to change the names of the output files. You can do a find-­‐and-­‐
replace of “birds.relaxedclock”, changing it to “birds.relaxedclock2”. Run BEAST using the modified XML file and have a look at the .log file in Tracer. Q. What are the relative rates of the 5 genes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Practical 3 -­‐ 10
Timing the diversification of modern birds We should now have a look at the estimate of the tree and divergence times. Open the program TreeAnnotator. This program is used to process the .trees file from BEAST. It reads all of the sampled trees and summarises the information in the form of a single tree. In the box next to “Burnin (as states)”, enter the value “1000”. This means that we are throwing out the first 1000 samples (10%) because we are regarding them as “burn-­‐in”. For the “Input Tree File”, click “Choose File” and select the .trees file produced by the BEAST analysis using the relaxed-­‐clock model (try the earlier one without the relative rates specified). For the “Output File”, click “Choose File” and select the directory where you want to save the output file from TreeAnnotator. Give the output file the name birds.beast.tre and click “Run”. Use FigTree to view the file birds.beast.tre produced by TreeAnnotator. The summary tree for your Bayesian phylogenetic analysis will be displayed. We are mainly interested in two features of the tree. First, we want to see where the passerine (Zebra Finch) has been placed. Check the box next to “Node Labels” and select “posterior” in the drop-­‐down menu next to “Display”. This will label the nodes of the tree with posterior probabilities, which indicate the support for each of the groupings represented in the tree. Q. Where has the Zebra Finch been placed, and with what posterior probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . We are also interested in the age of modern birds. In the “Node Labels” box, select “height” in the drop-­‐down menu next to “Display”. This will label the nodes of the tree with the estimated ages in Myr. You can also view the 95% HPD intervals by selecting “height_95%_HPD” in the drop-­‐down menu next to “Display”. Q. What are the mean and 95% HPD (credibility) interval for the estimate of the age of the modern bird clade? Does this correspond to the mass extinction at the end of the Cretaceous period (66 million years ago)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Practical 3 -­‐ 11
Part 2: Genome-­‐scale dating using MCMCtree MCMCtree Bayesian phylogenetic programs such as BEAST and MrBayes are very flexible and implement a wide range of useful evolutionary models, but they are very computationally intensive. These methods generally cannot be used for analysing large data sets, especially genome-­‐scale data sets. For example, the analyses in Part 1 of this practical only involved a tiny fraction of the 8,295 genes from 48 taxa that were analysed by Jarvis et al. (2014). Quicker methods are needed to handle such data sets. MCMCtree, part of the PAML package by Ziheng Yang, is a Bayesian dating program that can reduce the computational demand by using approximate likelihood calculation. This involves a 2-­‐step process. First, an approximation of the likelihood function is constructed from the sequence data. This step produces a variance-­‐covariance matrix. Second, the matrix is used for estimating divergence times. Note that the sequence data are not analysed directly in this step. The program can implement the strict clock, autocorrelated lognormal clock, and uncorrelated lognormal clock. MCMCtree is able to implement age constraints with soft bounds. A limitation of the program is that the tree topology needs to be fixed. The user needs to provide a tree file that includes the calibrations. Preparing a control file For this part of the practical, you will be looking at 2 different data files: birds5.phy, which contains the 5 genes that were analysed in Part 1 of this practical, and birds20.phy, which is a larger data set that contains 20 genes from the same taxa. Also provided is the control file mcmctree.ctl, which contains the settings for MCMCtree. Open mcmctree.ctl in a text editor and have a look at the contents. First we need to set the random seed used by MCMCtree. When set to -­‐1, MCMCtree will use the computer’s current time to set the seed. You can choose a specific positive number here if you want to the results to be exactly reproducible. seed = -­‐1 The names of various files are then given: the file containing the sequence data, the file containing the tree and calibrations, and the output file. seqfile = birds5.phy treefile = birds.cal.tre outfile = birds5.out.txt Practical 3 -­‐ 12
We then give the number of data subsets. Here we will treat the entire alignment as a single subset. A potential problem with defining many subsets in MCMCtree is that each subset is assigned an independent clock model. This can lead to a lot of parameters if we are using relaxed-­‐clock models. ndata = 1 seqtype = 0 * 0: nucleotides; 1:codons; 2:AAs The next part of the control file specifies whether the standard or approximate likelihood calculation is used. A value of 0 tells MCMCtree to sample from the prior only. We will start with this so that we can investigate the calibrations (we will discuss this in more detail below). usedata = 0 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV Next we specify the clock model and a maximum age constraint on the root. We will use the uncorrelated lognormal relaxed clock, which is called the “independent rates” model here. We will also specify a maximum age of 250.4 Myr for the root. Note that in MCMCtree the convention is to give time in units of 100 Myr. So 250.4 Myr is given as “2.504”. clock = 2 * 1: global clock; 2: independent rates; 3: correlated rates RootAge = '<2.504' * safe constraint on root age, used if no fossil for root. We will use the HKY+G model here, with four rate categories. model = 4 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85 alpha = 0.5 * alpha for gamma rates at sites ncatG = 4 * No. categories in discrete gamma We will choose to keep sites with gaps and missing data. cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)? Next we need to set the tree prior. Here we are setting the parameters that control the birth-­‐death process. This process is used to construct the prior for the node times in the tree that do not have calibrations. Here we are using the default 1 1 0, which generates a uniform prior on node ages. The effects of varying these values are not well known, and it can be worthwhile to test different values of these when you are doing a dating analysis of your data with MCMCtree. BDparas = 1 1 0 * birth, death, sampling Practical 3 -­‐ 13
Q. What is the difference between the Yule and birth-­‐death speciation models? Do you think that these models would generate different tree priors, and in what way (hint: consider the expected lengths of branches throughout the tree)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . We also set some gamma priors for the transition/transversion ratio and the alpha shape parameter of the gamma distribution for rates across sites. kappa_gamma = 6 2 * gamma prior for kappa alpha_gamma = 1 1 * gamma prior for alpha We now need to set the gamma priors for parameters in the clock model. “rgene” is the mean substitution rate, whereas “sigma2” controls the standard deviation of the rate. In the “independent rates” clock model, “sigma2” reflects the degree of rate variation across branches. In the autocorrelated clock model, “sigma2” reflects the degree of autocorrelation between ancestral and descendent branches. rgene_gamma = 1 10 * gamma prior for overall rates for genes sigma2_gamma = 1 1 * gamma prior for sigma^2 (for clock=2 or 3) We can leave the details of the MCMC proposal mechanisms at the default values: finetune = 1: .05 0.1 0.12 0.1 .3 * auto (0 or 1) : times, rates, mixing, paras, RateParas, FossilErr Now the MCMC settings are given. Here we will discard a burn-­‐in of 100,000 steps before sampling every 100 steps from a total of 1,000,000 steps. This is not a lot of steps compared with what we used for MrBayes and BEAST. Bear in mind, however, that we are not estimating the tree topology in MCMCtree. print = 1 burnin = 100000 sampfreq = 100 nsample = 10000 Practical 3 -­‐ 14
Preparing a tree file with calibrations MCMCtree requires a tree file that also contains the calibrations. Open the file birds.tre in a text editor. This is the tree topology that will be used in the analysis. However, we need to add the calibrations to this tree. We will implement all 3 calibrations as uniform priors, but note that these have soft bounds in MCMCtree. In other words, there is a 2.5% tail of probability at each end of the uniform interval. First, we will add a uniform prior from 51 to 99.6 Myr for the split between mallard and red junglefowl. Remember that time units are giving in 100 Myr units. The red font indicates the text that you need to add to the tree: (Mallard,RedJunglefowl)'>.51<.996' Then add the calibrations for the alligator-­‐bird split and for the node uniting all modern birds. This is a bit more difficult (be careful with the parentheses): (Budgerigar,ZebraFinch)))))))'>.66<.996')'>2.39<2.504'; Save the file as birds.cal.tre, then open the file in FigTree to check that you have added the calibrations correctly. Displaying the values at the internal nodes. It should look like this: When we have calibrations at multiple nodes in the tree, they can interact so that the marginal prior densities differ from the ones that we specified. We can investigate this by running the analysis without data, which is equivalent to sampling from the prior. Ensure that you have set “usedata” to 0 in the control file mcmctree.ctl, then run MCMCtree ./mcmctree and open the output file mcmc.txt in Tracer. There are three parameters of interest here. “t_n13” is the age of the root, “t_n14” is the age of the node uniting modern birds, and “t_n17” is the age of the node uniting mallard and red junglefowl. Q. Do these distributions differ from the 3 calibration densities that we chose above? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Practical 3 -­‐ 15
Running MCMCtree Now we will proceed with the dating analysis in MCMCtree. Go back to the control file mcmctree.ctl and look at the line that begins with “usedata”. If we give a value of 1, the standard likelihood is calculated. Here we want to use the approximate likelihood method, so we need to specify a value of 3. This will tell MCMCtree to produce a file that contains the variance-­‐covariance matrix that will be used for constructing an approximation of the likelihood. usedata = 3 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV From a command line, go to the folder containing these files and type ./mcmctree When MCMCtree is done, you will see an output file called out.BV. This is the variance-­‐
covariance matrix that we want to use for the dating analysis. Change the name of this file to in.BV. Open mcmctree.ctl in a text editor and change the settings for the analysis to: usedata = 2 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV Return to the command line and run MCMCtree ./mcmctree When MCMCtree is done, it will produce a file called FigTree.tre. As its name suggests, you can open this in FigTree to see the results. You can also show the 95% credible intervals of the date estimates. Q. What are the mean and 95% HPD (credibility) interval for the estimate of the age of the modern bird clade? Does this correspond to the mass extinction at the end of the Cretaceous period (66 million years ago)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Now repeat the MCMCtree analysis using the data file that contains 20 genes (birds20.phy). You can use the same control file (modifying it to change the name of the data file) and the same tree file. To avoid confusion, it is best to run this analysis in a new folder. Practical 3 -­‐ 16
Infinite-­‐sites plot Now we want to examine the effect of increasing the amount of data on the estimates of divergence times. Open the .out.txt files from the 2 MCMCtree analyses and have a look at the end of the file. You will see a table that summarise the estimates of node times, along with their 95% credible intervals: Posterior mean (95% Equal-­‐tail CI) (95% HPD CI) HPD-­‐CI-­‐width t_n13 2.4471 (2.3900, 2.5038) (2.3897, 2.5032) 0.1135 (Jnode 22) t_n14 0.8979 (0.7191, 1.0037) (0.7374, 1.0107) 0.2733 (Jnode 21) t_n15 0.6523 (0.3672, 0.8600) (0.3921, 0.8711) 0.4790 (Jnode 20) t_n16 0.8123 (0.6474, 0.9458) (0.6559, 0.9490) 0.2931 (Jnode 19) t_n17 0.5858 (0.4937, 0.7439) (0.4840, 0.7260) 0.2419 (Jnode 18) t_n18 0.6947 (0.4893, 0.8568) (0.5092, 0.8678) 0.3586 (Jnode 17) t_n19 0.6165 (0.3612, 0.7990) (0.3897, 0.8160) 0.4263 (Jnode 16) t_n20 0.6767 (0.4690, 0.8404) (0.4900, 0.8532) 0.3633 (Jnode 15) t_n21 0.6550 (0.4394, 0.8225) (0.4555, 0.8330) 0.3774 (Jnode 14) t_n22 0.6048 (0.3782, 0.7773) (0.3973, 0.7900) 0.3927 (Jnode 13) t_n23 0.4766 (0.2500, 0.6847) (0.2500, 0.6843) 0.4344 (Jnode 12) mu 0.0880 (0.0609, 0.1355) (0.0564, 0.1250) 0.0686 sigma2 0.4579 (0.1846, 1.0344) (0.1285, 0.8755) 0.7470 lnL -­‐10.3507 (-­‐17.4950, -­‐5.0300) (-­‐16.8120, -­‐4.5820) 12.2300 Plot the widths of the 95% CIs (second-­‐last column) against the mean estimates (second column) for the 11 node times (“t_n13” to “t_n23”). You can do this in your program of choice, such as Microsoft Excel or R. Q. What is the relationship between the 95% CI widths and the mean estimates? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Q. With an increasing amount of sequence data, we would expect to see a positive linear relationship between 95% CI widths and the mean estimates of node times. Why might this pattern not be evident here? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Practical 3 -­‐ 17