Homework II remarks: Bootstrap analysis ML tree Rep 1 Rep 2 Step 1: replicate sites Step 2: Build a tree random with replacement. for each replicate. Homework II remarks: Bootstrap analysis ML tree Rep 1 Present in all three trees Present in 2 trees but not the ML tree. Present in one tree Rep 2 Step 3: consensus clades. Which of these statements about bootstrap analysis is NOT true. A. Bootstrap values provide a measure, similar to a confidence level, that we use to assess the strength of support (or lack thereof) for each clade on a phylogenetic tree. B. The bootstrap can help us decide which model of sequence evolution to use for phylogenetic inference. C. A small bootstrap value on a branch suggests that only a few sites in the alignment support the clade formed by that branch. D. A high bootstrap value (100%) assures us that the sequence data is not biased. We can be sure that the taxa connected by that branch are truly joined by common descent rather than some quirky feature of the sequence that we have chosen to analyse. True, this is why we do the test. A. Bootstrap values provide a measure, similar to a confidence level, that we use to assess the strength of support (or lack thereof) for each clade on a phylogenetic tree. True. These sites are too rare to be sampled much. C. A small bootstrap value on a branch suggests that only a few sites in the alignment support the clade formed by that branch. NOT true. The test tells us nothing about whether we used the right test or model. We compare log likelihood scores to compare models. B. The bootstrap can help us decide which model of sequence evolution to use for phylogenetic inference. NOT true. We cannot detect bias with the bootstrap. D. A high bootstrap value (100%) assures us that the sequence data is not biased. We can be sure that the taxa connected by that branch are truly joined by common descent rather than some quirky feature of the sequence that we have chosen to analyse. Why do we need bootstrap analysis for MP, ML and distance (NJ) trees? Why can’t we just list the optimal tree and leave it at that? Homework II remarks: ML analysis Ordered by degrees of freedom (df): number of (free) model parameters The bootstrap analysis gives us some sense of our confidence in the ML tree and specifically in individual clades. Homework II remarks: ML analysis Ordered by degrees of freedom (df): number of (free) model parameters Homework II remarks: ML analysis Ordered by degrees of freedom (df): number of (free) model parameters next best log likelihood smallest log likelihood Parameter with the largest effect on the fit of the model to the sequence data? Ts:Tv Ts:Tv + base frequencies Add base frequency parameters? Change in -lnL = 1901 Parameter with the largest effect on the fit of the model to the sequence data? Parameters to estimate: Ts:Tv (2 STs, substitution types) Ts:Tv + base frequencies 6 STs + base frequencies Ts:Tv + G (rate heterogeneity) Ts:Tv + base frequencies+G 6 STs + base frequencies+G Add across site rate heterogeneity (G): Change in -lnL= 9765 Parameter with the largest effect on the fit of the model to the sequence data? Ts:Tv Ts:Tv + base frequencies 6 STs + base frequencies Substitution types (ST)? Change in -lnL= 1830 Parameter with the largest effect on the fit of the model to the sequence data? Parameters to estimate: Ts:Tv (2 STs, substitution types) Ts:Tv + base frequencies 6 STs + base frequencies Ts:Tv + G (rate heterogeneity) Ts:Tv + base frequencies+G 6 STs + base frequencies+G Add across site rate heterogeneity (G): Change in -lnL= 11,337 Parameter with the largest effect on the fit of the model to the sequence data? Parameters to estimate: Ts:Tv (2 STs, substitution types) Ts:Tv + base frequencies 6 STs + base frequencies Ts:Tv + G (rate heterogeneity) Ts:Tv + base frequencies+G 6 STs + base frequencies+G In fact the Gamma parameter is so important for this alignment that: A simpler model (fewer parameters) fits better than more complex models. 5 fewer parameters than GTR Add across site rate heterogeneity (G): Change in -lnl= 9957 Hierarchical log likelihood ratio test (hLRT). Model comparison for nested models. More on model choice for ML and Bayesian analysis Comparison of 12 Models of evolution for mammalian sequences. Output is from PAUP4.0 Nested models: compare a “larger” model (more parameters) to any smaller model (fewer parameters) Compare GTR+I+G (10 df) HKY+I+G (7df). χ2 = -2(lnLlarge-lnL2small) = -2(268402 - 268959) = 1114 χ32 = 1114 Models nested within the general time reversible (GTR) model •Significantly better (p<0.05) model if χ32 ≥ ! 7.82. Equal base frequencies JC69 1 substitution type (ST) K80 2 ST (transitions and transversions) Unequal base frequencies F81 1 ST HKY/F84 2 STs GTR 6 STs (A<-->G, A<-->T, . . .) χ32 = 1114 •So the GTR+I+G model is a highly and significantly • better fit to the sequence data than is HKY+I+G. Non nested models: compare any model (more parameters) to any other model. E.g. Compare HKY+I+G model vs. GTR model HKY is nested in GTR and, HKY is nested in GTR+I+G and, HKY+I+G is nested in GTR+I+G. But HKY+I+G is not nested in GTR. Akaike Information Criterion AICi= -2LnLi + 2ki where ki is the number of parameters (df) for model i hLRT and AIC calculated and tabled. The method with the highest likelihood or smallest AIC or BIC is chosen for the final search. • Especially useful for non-nested models. • Incoporates a penalty for each parameter. • Smallest AIC --> best model Gamma (G) parameter What about the Gamma parameter from homework 2? What is it? α= 20 Proportion of sites α= 5 α= 1 α= 0.50 α= 0.1 Substitution rates Describes across -site rate variation with a shape parameter α. 1st1st codon position Gamma shape parameter: α=1.10 2nd 1st 1st 2nd 2nd 2nd codon position Gamma shape parameter: α =0.55 3rd 3rd codon position Gamma shape parameter: α =1.16 1st 1st •Slow average rate but moderate rate heterogeneity •Slowest average rate but highest rate heterogeneity. 3rd 3rd Gamma rate heterogeneity (α) differs across codon position. 1st 1st 2nd 2nd 2nd 2nd 3rd 3rd 3rd 3rd Gamma rate heterogeneity (α) differs across codon position. •Slow average rate but moderate rate heterogeneity Gamma rate heterogeneity (α) differs across codon position. •Slow average rate but moderate rate heterogeneity •Slowest average rate but highest rate heterogeneity. •Fastest average rate but low rate heterogeneity Gamma rate parameter α is large then very little across-site rate variation α is small then more across-site rate variation 1st codon position: TS:TV=2.41, p(inv)=0.39 1st 2nd codon position: TS:TV=2.70, 2nd p(inv)=0.48 3rd 3rd codon position: TS:TV=8.64, p(inv)=0.01 1st codon position: 2nd codon position: 1st codon position: 1st 2nd And base frequencies. 2nd codon position: 3rd 3rd codon position: 1st 2nd 3rd 3rd codon position: Also transition to transversion ratios vary. 1st codon position: 1st All parameters vary by site 1st codon position: TS:TV=2.41, p(inv)=0.39 1st Gamma shape parameter: α=1.10 2nd codon position: 2nd 3rd 3rd codon position: 2nd codon position: TS:TV=2.70, 2nd p(inv)=0.48 Gamma shape parameter: α =0.55 3rd 3rd codon position: TS:TV=8.64, p(inv)=0.01 Gamma shape parameter: α =1.16 Data partitions The importance of data partitions; different purposes for different partitions. Not surprisingly nuclear genes and mitochondrial genes evolve differently We suspect that the three codon positions in protein-coding genes evolve differently (eg 3rd codon position does not change protein much). ribosomal genes have important folding patterns Does one rate of change apply across the whole sequence? Secondary folding: loops (fast) and stems (slow to change) Confounded model parameters Why do we need to partition data? Not all software partitions data (including PAUP) MrBayes and MetaPiga do partition models. So we will learn these. We need to partition all parameters because of confounding. Confounded model parameters Base frequency estimates vary by the ratio of transitions to tranversions. Although the GTR+I+Γ model provides a statistically better fit than the HKY+I+Γ model, branch length and topology estimates differ very little between models. Thus, the three extra parameters may be unnecessarily costly in terms of power to detect the optimal tree. The change in the values of model parameters is evidence that they are confounded with one another: the value of one parameter estimate depends on others. Data partitions of ALL parameters Partition All sites 13 genes 3 positions df 7 91 21 -lnL 107457 105843 102303 -Δ2lnL Δdf 3228 10308 84 14 Effect of model parameters on random trees: AIC 214928 211868 204636 HKY+I+G model but a different model for each partition of the data. Components of likelihood. Highest likelihood: one parameter for every pattern. 5273 parameter model: largest possible model. Worst likelihood: random tree and 1 parameter Effect of model parameters on ML trees: Smallest model. JC model ML tree: lnL=125000 1 parameter model GTR+I+G model: -lnL = 106000 10 parameter model MLEs from data stratified by codon position -lnL df A 1 28161 7 0.35 2 15985 7 0.20 3 58157 7 0.43 102,303 21 All 107,457 7 0.36 C 0.27 0.28 0.32 G 0.14 0.11 0.05 T 0.24 0.41 0.20 ts/tv 2.41 2.70 8.64 p (inv) α 0.39 1.10 0.48 0.55 0.01 1.16 0.33 0.08 0.24 3.83 0.34 0.79 MLEs from data partitioned by codon position -lnL df 1 28161 7 2 15985 7 3 58157 7 102,303 21 A 0.35 0.20 0.43 C 0.27 0.28 0.32 G 0.14 0.11 0.05 T 0.24 0.41 0.20 ts/tv 2.41 2.70 8.64 p (inv) α 0.39 1.10 0.48 0.55 0.01 1.16 ts/tv 3.83 p (inv) α 0.34 0.79 Better model fit Same model for all codon positions -lnL df A C G T 107,457 7 0.36 0.33 0.08 0.24
© Copyright 2026 Paperzz