Which of these statements about bootstrap analysis is NOT true.

Homework II remarks: Bootstrap analysis
ML tree
Rep 1
Rep 2
Step 1: replicate sites
Step 2: Build a tree
random with replacement. for each replicate.
Homework II remarks: Bootstrap analysis
ML tree
Rep 1
Present in all
three trees
Present in 2 trees
but not the ML
tree.
Present in one
tree
Rep 2
Step 3: consensus clades.
Which of these statements about bootstrap
analysis is NOT true.
A.
Bootstrap values provide a measure, similar to a confidence
level, that we use to assess the strength of support (or lack thereof) for
each clade on a phylogenetic tree.
B.
The bootstrap can help us decide which model of sequence
evolution to use for phylogenetic inference.
C.
A small bootstrap value on a branch suggests that only a few
sites in the alignment support the clade formed by that branch.
D.
A high bootstrap value (100%) assures us that the sequence
data is not biased. We can be sure that the taxa connected by that
branch are truly joined by common descent rather than some quirky
feature of the sequence that we have chosen to analyse.
True, this is why we do the test.
A. Bootstrap values provide a
measure, similar to a confidence level,
that we use to assess the strength of
support (or lack thereof) for each clade
on a phylogenetic tree.
True. These sites are too rare to be sampled
much.
C. A small bootstrap value on a
branch suggests that only a few sites in
the alignment support the clade formed
by that branch.
NOT true. The test tells us nothing about
whether we used the right test or model. We
compare log likelihood scores to compare
models.
B. The bootstrap can help us decide
which model of sequence evolution to
use for phylogenetic inference.
NOT true. We cannot detect bias with the
bootstrap.
D. A high bootstrap value (100%) assures
us that the sequence data is not biased. We
can be sure that the taxa connected by that
branch are truly joined by common
descent rather than some quirky feature of
the sequence that we have chosen to
analyse.
Why do we need bootstrap analysis for
MP, ML and distance (NJ) trees?
Why can’t we just list the optimal tree and
leave it at that?
Homework II remarks: ML analysis
Ordered by degrees of
freedom (df): number of
(free) model parameters
The bootstrap analysis gives us some sense
of our confidence in the ML tree and
specifically in individual clades.
Homework II remarks: ML analysis
Ordered by degrees of
freedom (df): number of
(free) model parameters
Homework II remarks: ML analysis
Ordered by degrees of
freedom (df): number of
(free) model parameters
next best log likelihood
smallest log likelihood
Parameter with the largest effect on the
fit of the model to the sequence data?
Ts:Tv
Ts:Tv + base frequencies
Add base frequency parameters?
Change in -lnL = 1901
Parameter with the largest effect on the
fit of the model to the sequence data?
Parameters to estimate:
Ts:Tv (2 STs, substitution types)
Ts:Tv + base frequencies
6 STs + base frequencies
Ts:Tv + G (rate heterogeneity)
Ts:Tv + base frequencies+G
6 STs + base frequencies+G
Add across site rate heterogeneity (G):
Change in -lnL= 9765
Parameter with the largest effect on the
fit of the model to the sequence data?
Ts:Tv
Ts:Tv + base frequencies
6 STs + base frequencies
Substitution types (ST)?
Change in -lnL= 1830
Parameter with the largest effect on the
fit of the model to the sequence data?
Parameters to estimate:
Ts:Tv (2 STs, substitution types)
Ts:Tv + base frequencies
6 STs + base frequencies
Ts:Tv + G (rate heterogeneity)
Ts:Tv + base frequencies+G
6 STs + base frequencies+G
Add across site rate heterogeneity (G):
Change in -lnL= 11,337
Parameter with the largest effect on the
fit of the model to the sequence data?
Parameters to estimate:
Ts:Tv (2 STs, substitution types)
Ts:Tv + base frequencies
6 STs + base frequencies
Ts:Tv + G (rate heterogeneity)
Ts:Tv + base frequencies+G
6 STs + base frequencies+G
In fact the Gamma parameter is so
important for this alignment that:
A simpler model (fewer
parameters) fits better than
more complex models.
5 fewer parameters
than GTR
Add across site rate heterogeneity (G):
Change in -lnl= 9957
Hierarchical log likelihood ratio test (hLRT).
Model comparison for nested models.
More on model choice for ML
and Bayesian analysis
Comparison of 12 Models of evolution for mammalian
sequences. Output is from PAUP4.0
Nested models: compare a “larger” model (more
parameters) to any smaller model (fewer parameters)
Compare
GTR+I+G (10 df)
HKY+I+G (7df).
χ2 = -2(lnLlarge-lnL2small)
= -2(268402 - 268959) = 1114
χ32 = 1114
Models nested within the general
time reversible (GTR) model


•Significantly better (p<0.05) model if
χ32 ≥
! 7.82.
Equal base frequencies
 JC69
1 substitution type (ST)
 K80
2 ST (transitions and transversions)
Unequal base frequencies
 F81
1 ST
 HKY/F84
2 STs
 GTR
6 STs (A<-->G, A<-->T, . . .)
χ32 = 1114
•So the GTR+I+G model is a highly and significantly
• better fit to the sequence data than is HKY+I+G.
Non nested models: compare any model
(more parameters) to any other model.
E.g. Compare HKY+I+G model vs. GTR model
HKY is nested in GTR and,
HKY is nested in GTR+I+G and,
HKY+I+G is nested in GTR+I+G.
But HKY+I+G is not nested in GTR.
Akaike Information Criterion
AICi= -2LnLi + 2ki
where ki is the number of parameters (df) for
model i
hLRT and AIC calculated and tabled.
The method with the highest likelihood or
smallest AIC or BIC is chosen for the final
search.
• Especially useful for non-nested models.
• Incoporates a penalty for each parameter.
• Smallest AIC --> best model
Gamma (G) parameter
What about the Gamma
parameter from homework 2?
What is it?
α= 20
Proportion
of sites
α= 5
α= 1
α= 0.50
α= 0.1
Substitution rates
Describes across -site rate variation with a shape parameter α.
1st1st codon position
Gamma shape parameter: α=1.10
2nd
1st
1st
2nd
2nd
2nd codon position
Gamma shape parameter: α =0.55
3rd
3rd codon position
Gamma shape parameter: α =1.16
1st
1st
•Slow average rate but
moderate rate heterogeneity
•Slowest average rate but
highest rate heterogeneity.
3rd
3rd
Gamma rate
heterogeneity (α) differs
across codon position.
1st
1st
2nd
2nd
2nd
2nd
3rd
3rd
3rd
3rd
Gamma rate
heterogeneity (α) differs
across codon position.
•Slow average rate but
moderate rate heterogeneity
Gamma rate
heterogeneity (α) differs
across codon position.
•Slow average rate but
moderate rate heterogeneity
•Slowest average rate but
highest rate heterogeneity.
•Fastest average rate but low
rate heterogeneity
Gamma rate parameter


α is large then very little across-site rate
variation
α is small then more across-site rate
variation
1st codon position: TS:TV=2.41, p(inv)=0.39
1st
2nd codon position: TS:TV=2.70,
2nd p(inv)=0.48
3rd
3rd codon position: TS:TV=8.64, p(inv)=0.01
1st codon position:
2nd codon position:
1st codon position:
1st
2nd
And base
frequencies.
2nd codon position:
3rd
3rd codon position:
1st
2nd
3rd
3rd codon position:
Also
transition to
transversion
ratios vary.
1st codon position:
1st
All parameters vary by site
1st codon position: TS:TV=2.41, p(inv)=0.39
1st
Gamma shape parameter: α=1.10
2nd codon position:
2nd
3rd
3rd codon position:
2nd codon position: TS:TV=2.70,
2nd p(inv)=0.48
Gamma shape parameter: α =0.55
3rd
3rd codon position: TS:TV=8.64, p(inv)=0.01
Gamma shape parameter: α =1.16
Data partitions




The importance of data partitions; different
purposes for different partitions.
Not surprisingly nuclear genes and mitochondrial
genes evolve differently
We suspect that the three codon positions in
protein-coding genes evolve differently (eg 3rd
codon position does not change protein much).
ribosomal genes have important folding patterns
Does one rate of change apply
across the whole sequence?
Secondary
folding:
loops (fast)
and
stems (slow to
change)
Confounded model parameters
Why do we need to partition data?




Not all software partitions data (including
PAUP)
MrBayes and MetaPiga do partition models.
So we will learn these.
We need to partition all parameters because
of confounding.
Confounded model parameters
Base frequency estimates vary by the ratio of transitions to tranversions.
Although the GTR+I+Γ model
provides a statistically better fit
than the HKY+I+Γ model, branch
length and topology estimates
differ very little between models.
Thus, the three extra parameters
may be unnecessarily costly in
terms of power to detect the
optimal tree.
The change in the values of model parameters is
evidence that they are confounded with one another: the
value of one parameter estimate depends on others.
Data partitions of ALL parameters
Partition
All sites
13 genes
3 positions
df
7
91
21
-lnL
107457
105843
102303
-Δ2lnL
Δdf
3228
10308
84
14
Effect of model parameters on
random trees:
AIC
214928
211868
204636
HKY+I+G model but a different model
for each partition of the data.
Components of likelihood.
Highest likelihood: one parameter for every pattern.
5273 parameter model: largest possible model.
Worst likelihood: random tree and 1 parameter
Effect of model parameters on
ML trees:
Smallest model.
JC model ML tree: lnL=125000
1 parameter model
GTR+I+G model: -lnL = 106000
10 parameter model
MLEs from data stratified by codon position
-lnL df A
1 28161 7 0.35
2 15985 7 0.20
3 58157 7 0.43
102,303 21
All
107,457 7 0.36
C
0.27
0.28
0.32
G
0.14
0.11
0.05
T
0.24
0.41
0.20
ts/tv
2.41
2.70
8.64
p (inv) α
0.39 1.10
0.48 0.55
0.01 1.16
0.33 0.08 0.24 3.83 0.34 0.79
MLEs from data partitioned by codon
position
-lnL df
1 28161 7
2 15985 7
3 58157 7
102,303 21
A
0.35
0.20
0.43
C
0.27
0.28
0.32
G
0.14
0.11
0.05
T
0.24
0.41
0.20
ts/tv
2.41
2.70
8.64
p (inv) α
0.39 1.10
0.48 0.55
0.01 1.16
ts/tv
3.83
p (inv) α
0.34 0.79
Better model fit
Same model for all codon positions
-lnL df
A
C
G
T
107,457 7
0.36 0.33 0.08 0.24