Comparing alternative trees What do the likelihood scores mean

Comparing alternative trees
Building a Maximum Likelihood Tree







What do the likelihood scores
mean?
Recall ln-likelihood tests of
alternative models.
Estimate a “pretty good” tree (NJ or parsimony)
Use the tree to estimate various model parameters.
Choose the model parameters that have the highest likelihood
(lowest -lnL).
Search tree space using the optimal model and a good tree
search method (NNI, SPR, TBR) and 5-10 random starts.
Choose the tree with the highest likelihood (lowest -lnL).
Using the same optimal model parameters, run a bootstrap
analysis to assess support for individual clades.
Compare the likelihoods of alternative tree hypotheses.
Do the trees differ significantly?
Again we calculate pair-wise site
differences (conditioned on the
best model).







lnL = -1405.61
lnL = -1405.61


H1: Chimp with human tree: lnL = -1405.61
H2: Chimp with gorilla tree: lnL = -1408.80
But what do these log likelihoods (lnL) mean?
Remember the likelihood function: L = Pr(D|H)
And the likelihood ratio test: L = Pr(D|H1)/Pr(D|H2)
Harder mathematically. L = e -1405.61 /e-1408.80
And we don’t know how the likelihood score is distributed
so we can’t test the hypothesis statistically.
But the log of the ratio is easy and equivalent to the ratio.
And the difference between the log of the two likelihoods
is (asymptotically) chi - sq (χ2) distributed.
From Felsenstein, 2003 ‘Inferring Phylogenies”
So we use logs to compare tree
hypotheses






H1: Chimp with human tree: lnL 1 = -1405.61
H2: Chimp with gorilla tree: lnL 2 = -1408.80
Note that the lnL scores are negative.
That is because the likelihoods (L) are
probabilities and are therefore fractions.
The log of any fraction is a negative number.
e.g. 100 = 102 so log of 100 is 2
2
-2 so log of 1/100 is - 2
 and 1/100 = 1/ 10 = 10
 and the log of 1/1000 is -3.
Recall basic math




Which number is bigger? 1/100 or 1/1000 ?
Which number is bigger? -2 or -3?
So which number is bigger? -1405.61 or -1408.80
So which hypothesis has the larger (or highest)
likelihood?
For example




Which number is bigger: 1/100 or 1/1000 ?
» 1/100
Which number is bigger: -2 or -3?
» -2
So which number is bigger: -1405.61 or -1408.80?
» -1405.61
So which hypothesis has the larger (or highest)
likelihood?
» Chimp with Human
We use the natural log (lnL).





The log is in our case the natural log: ln to the base e=2.71
(instead of to the base 10).
How to read Seaview output?
Homework
The ln(L) statistic for the tree inferred
using the GTR+G model.
Which of the trees built from the six
models is the best?
PhyML ln(L)=-87280.1 7872 sites GTR 4 rate classes
H1: Chimp with human tree:
H2: Chimp with gorilla tree: lnL 2 = -1408.80
But the same principle applies:
We prefer H 1 (the tree with the chimp/human clade).
The ln(L) statistic for the tree inferred
using the GTR+G model
NOTE: we often see -lnL 1 = 1405.61 and -ln = 1408.80.
 In that case we want the smallest negative lnL because
it has the highest likelihood
Which tree has the highest likelihood?
PhyML ln(L)=-97326.8 7872 sites GTR

Comparing alternative trees
Recall ln-likelihood tests of
alternative models.
Do the trees differ significantly?
Again we calculate pair-wise site
differences (conditioned on the
best model).
lnL = -1405.61
lnL = -1405.61
Kishino-Hasegawa (KH) test
Simply a paired t-test comparing two trees.
Calculate the pair-wise differences at each
site for each tree.
Sum the differences over all sites.
Calculate the standard error of the pairwise differences (SE).
lnL/SE >1.96, p ≤ 0.05 significantly
different trees
Kishino & Hasegawa. 1989. J. Mol. Evol. 29:170-179
From Felsenstein, 2003 ‘Inferring Phylogenies”
Shimodaira-Hasegawa (SH) test




A newer variant of the KH that corrects for multiple
tests & some bias
Should also correct KH for multiple tests (critical
value is for 0.05 / # trees tested.
For both, use RELL-calculated p-values (Resampling
Estimated log Likelihood.
For both, one-sided test if ML is one of the trees.
– Shimodaira & Hasegawa. 1999. MBE
16(8):1114-1116
Seaview?




Here I compared 10 trees using PAUP. Four
were statistically poorer than the ML tree.
Here I compared 10 trees using PAUP. Four
were statistically poorer than the ML tree.
Likelihood vs. Bayesian methods
Both use maximum likelihood
optimized models and Markov Chains.
Does not do these tests.
PAUP and Phylip and others will perform
these tests.
Some newer variants are in the newest
Consel software (Shimodaira, 2008).
Not used for Bayesian analysis.

Likelihood:

Bayesian methods I




L = Pr(D|H)
(joint) Probability of the data (D) given the hypothesis (H)
H may be a tree or a branch length or a model parameter
D is the sequence of nucleotides
Bayesian adds a prior:




Pr(H|D) = Pr (D|H) (Pr (H)
Pr(D)
Probability of the hypothesis (H) given the data (D).
The product of the Likelihood and the Prior
Typically uses Monte Carlo Markov Chain (MCMC) to
search tree space.
Models of sequence evolution






JC
K2P
HKY
GTR
HKY+I+G
GTR+I+G
equal probability of change (1 df)
transition rate ≠ transversion rate (2df)
adds unequal nucleotide frequencies (5 df)
p (A<-->G) ≠ p (A<-->T) ≠ . . .≠p (T<--> C ) (8 df)
adds invariant sites (I) + rate heterogeneity (G) (8df)
most complex (10 df)
Bayesian optimization:
Bayesian analysis and MCMC


ML: fix parameters
Bayes: Marginalize over parameters
Monte Carlo Markov Chain
 combines parameter estimation with the tree search
algorithm
• (integrates over tree and parameter space)
Whereas, conventional Likelihood
 tree search conditioned on parameters estimated from
preliminary trees
• (integrates over the tree space)
Bayesian optimization: simultaneously optimizes
parameters and trees.
ω represents the
model parameters
ML optimization: optimizes the tree likelihood over fixed ω,
Baysian tree search:
Monte Carlo Markov Chains
Recall ML tree search and tree
space?
ω, the model parameters are
determined first then the tree is
optimized.
Improve hill climbing with NNI, SPR and
TBR and with random starts for ML trees
Search method


Markov Chain Monte Carlo (MCMC)
 Simulates a walk through parameter
and tree space.
Analogous to Maximum Likelihood
heuristic search
 “hill climbing” through tree space to
find highest likelihood tree.


Thanks to Mark Holder for the portions
of the following slides.
From the Workshop on Molecular
Evolution, Woods Hole, MA, July, 2003.
Similarly for Baysian trees: hill climbing variant
Always moves to next tree if R>1.
R = ratio of the new tree height to the present one.
Moves with low probability (0.03)
If R < 1, then probability of the move = R
Moves with high probability (0.92)
If R < 1, then probability of the move = R
Tree search in Mr. Bayes
Begins with a wander through space.
Propose a new location.
Calculate height of new location.
 R= new height/old height.
 Move with probability that is a
function of R.
 Always move if R>1.


Early steps are discarded: Burn-in.
Help to avoid local optima?




Metropolis Coupled Monte Carlo
Markov Chains
Run at least 4 chains simultaneously.
One chain - cold chain - explores with
relatively short steps.
Others - heated chains - explore with big
steps: cover much more of the tree space.
Advantages of MCMCMC
Cold chain with short steps: better explores parameter space.
Metropolis-coupled Markov
Chain Monte Carlo: MC3




Run multiple (at least four) chains
simultaneously.
Cold chain is the main chain - the one that
shows up in the buffer and on the output.
3 heated chains that take bigger steps across
posterior probability hills.
A heated chain sometimes swaps to the cold
chain if hot chain finds better space.
Advantages of MCMCMC
Heated chains may miss optimal parameter space but
cover tree space more thoroughly.
Hot chains may become the cold
chain.
Chain results:
1 -- [-41631.791] (-43694.786) (-42920.096)
(-42782.307) * (-42388.547) [-41306.253] (43688.544) (-42883.304)
1000 -- (-32120.952) (-31590.257) (-31579.554)
[-31096.284] * (-31353.766) (-31437.477) [31176.966] (-31814.110) -- 0:08:51
Advantages of MCMCMC
Short steps: may miss globally optimal hill.
Hot chains may become the cold
chain.
Chain results:
1 -- [-41631.791] (-43694.786) (-42920.096)
(-42782.307) * (-42388.547) [-41306.253] (43688.544) (-42883.304)
1000 -- (-32120.952) (-31590.257) (-31579.554)
[-31096.284] * (-31353.766) (-31437.477) [31176.966] (-31814.110) -- 0:08:51
Average standard deviation of split frequencies:
0.106151
Average standard deviation of split frequencies:
0.106151
2000 -- (-30922.429) (-30900.476) (-30861.073)
[-30822.676] * [-30826.747] (-30849.901) (30848.131) (-30874.821) -- 0:07:40
2000 -- (-30922.429) (-30900.476) (-30861.073)
[-30822.676] * [-30826.747] (-30849.901) (30848.131) (-30874.821) -- 0:07:40
Standard deviations ≤ 0.01?
White noise - no trend over
generations.
Chain results:
1 -- [-41631.791] (-43694.786) (-42920.096)
(-42782.307) * (-42388.547) [-41306.253] (43688.544) (-42883.304)
1000 -- (-32120.952) (-31590.257) (-31579.554)
[-31096.284] * (-31353.766) (-31437.477) [31176.966] (-31814.110) -- 0:08:51
Average standard deviation of split frequencies:
0.106151
2000 -- (-30922.429) (-30900.476) (-30861.073)
[-30822.676] * [-30826.747] (-30849.901) (30848.131) (-30874.821) -- 0:07:40
MrBayes > exe cynmix.nex
begin mrbayes;
outgroup Ibalia;
charset morphology = 1-166;
charset molecules = 167-3246;
charset COI = 167-1244;
charset COI_1st = 167-1244\3;
charset COI_2nd = 168-1244\3;
charset COI_3rd = 169-1244\3;
charset EF1a = 1245-1611;
charset EF1a_2nd = 1245-1611\3;
charset EF1a_3rd = 1246-1611\3;
charset EF1a_1st = 1247-1611\3;
charset LWRh = 1612-2092;
charset LWRh_2nd = 1612-2092\3;
charset LWRh_3rd = 1613-2092\3;
charset LWRh_1st = 1614-2092\3;
charset 28S = 2093-3246;
charset 28S_Stem = 2160-2267 2361-2401 2489-2528 2539-2565 2577-2647
2671-2760 2768-2827 2848-3194 3220-3246;
charset 28S_Loop = 2093-2159 2268-2360 2402-2488 2529-2538 2566-2576
2648-2670 2761-2767 2828-2847 3195-3219;
partition Names = 5: morphology, COI, EF1a, LWRh, 28S;
partition Nopart = 2: morphology, molecules;
partition Morph_mito_nucl_ribo = 4: morphology, COI, EF1a LWRh, 28S;
partition Extreme = 12: morphology, COI_1st, COI_2nd, COI_3rd, EF1a_2nd,
EF1a_3rd, EF1a_1st, LWRh_2nd, LWRh_3rd, LWRh_1st, 28S_Stem, 28S_Loop;
end;
begin mrbayes;
set partition=Names;
lset applyto=(2,3,4,5) nst=6 rates=invgamma;
unlink shape=(all) pinvar=(all) statefreq=(all) revmat=(all);
prset ratepr=variable;
end;
ML score is no longer improving.
The entire
MrBayes
block for a
mixed
analysis.
MrBayes > mcmc ngen=50000 samplefreq=50
MrBayes > sump burnin=500
MrBayes > sumt burnin=500
MrBayes > comparetree
List of taxon bipartitions found in tree file:
Post burn-in
parameter
summary.
Partitioned
analysis:
cynmix.nex
with lots of
parameters
for mixed
data.
Majority
rule
consensus
of all trees
sampled
after the
burn in.
Cynmix.nex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
--------------------------------------------------------
.............*..................
.........................*......
............*...................
.............................*..
..........*.....................
.....*..........................
..............*.................
..*.............................
...............................*
................*...............
..............................*.
...........................*....
.........*......................
....*...........................
.......*........................
......*.........................
....................*...........
.................*..............
...............*................
........................*.......
.......................*........
...........*....................
..................*.............
...................*............
...*............................
.....................*..........
......................*.........
..........................*.....
............................*...
.*..............................
........*.......................
.*******************************
.........****...................
..........................***...
................**..............
................***.............
.......*********................
...****.........................
.........................****...
.*.****.........................
.......******...................
.............**.................
...........................**...
...**...........................
...................***..........
..........***...................
.....**.........................
.......*************************
.......................**.......
......................***....***
.............................***
...........**...................
.******.........................
.......*********...******....***
.......**.......................
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
3988
3985
3983
3983
3979
3978
3976
3971
3970
3968
3961
3960
3957
3952
3952
3948
3938
3908
3904
3882
3814
3791
3635
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
3988
3985
3983
3983
3979
3978
3976
3971
3970
3968
3961
3960
3957
3952
3952
3948
3938
3908
3904
3882
3814
3791
3635
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.997
0.996
0.996
0.996
0.995
0.994
0.994
0.993
0.992
0.992
0.990
0.990
0.989
0.988
0.988
0.987
0.984
0.977
0.976
0.970
0.953
0.948
0.909
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.997
0.996
0.996
0.996
0.995
0.994
0.994
0.993
0.992
0.992
0.990
0.990
0.989
0.988
0.988
0.987
0.984
0.977
0.976
0.970
0.953
0.948
0.909
Bipartitions
(splits) for 2
runs:
Cynmix.nex
4001 samples from 200,000
generations.
Converging to stationarity
List of taxon bipartitions found in tree file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
------------------
.............*..................
.........................*......
............*...................
.............................*..
..........*.....................
.....*..........................
..............*.................
..*.............................
...............................*
................*...............
..............................*.
...........................*....
.........*......................
....*...........................
.......*........................
......*.........................
....................*...........
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
4001
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
lnL plot of both runs over 200
generations

Markov chains run until stationarity is
reached.
 Point where the fit is good and does
not improve.
 The “top of the hill” in tree/parameter
space.
 Detected when the lnL scores plateaus.
But has stationarity been reached?
Plot of lnL by generation. Looks
pretty patternless. Just as we’d like.
+------------------------------------------------------------+ 26570.79
|
1
1
|
|
1
|
|
11
|
|
1 1
2 1
2
|
|
2
1
1
1 2
2 1 1 2 |
|
1
1 12
2
2
1 2
2 22
2 2 2
|
| 1 1 2
222
1
1 1
1
1 2
111|
|1
112
1
2
2
21
*
11 * 2 1 12
|
|2 2
2
2 1
1 12 1 *
2 1
2
2|
| 1
2
2 21
1
2
1 2
1
1 2 2 |
|
2
2
1 2
2 1
1
|
| 2
1 1
2
2
|
|
2
1
2
2
|
|
2 1
2
|
|
1 2 1
1 2
2
|
+------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 26592.29
^
^
25000
Looks pretty stationary from 200
-200,000 generations.
Except that SE is too big.
199900 -- [-26577.614] (-26610.759) (-26597.502) (-26603.504) * (26614.970) (-26643.536) (-26600.831) [-26580.881] -- 0:00:10
200000 -- [-26576.965] (-26619.707) (-26590.745) (-26594.843) * (26606.805) (-26644.199) (-26601.677) [-26577.404] -- 0:00:00

Average standard deviation of split frequencies: 0.020419
Monitor the run to check the standard error of
the split frequencies.
Should be smaller than 0.01 but is 0.024.
Should have run the analysis for longer than
200,000 generations.
Lots and lots of parameters to estimate here.
1.0



Run one
1.0




What are they?
Splits are bipartitions of the taxa that
define clades.
The standard deviation measures the
discrepancies between the two runs.
Gets smaller as the trees for the two runs
become more similar.
Goal: tight fit to the diagonal
As two runs converge to same tree.
Clade probability in analysis 2
Clade probability in analysis 2
Run two
This graph plots the probabilities
of clades found in file 1 (the x-axis)
against the probabilities of the
same clades found in file 2 (the yaxis).
0.0

Monitoring the standard error of the split
frequencies.
We look for SE ≤ 0.01
Should it be SE ≤ 0.0001?
We want the SE to approach 0.
As it does, the two runs converge on the
same optimal tree.
Goal: tight fit to the diagonal
As two runs converge to same tree.
Bivariate plot of clade probabilities:
Standard deviations of split
frequencies
SE is a simple and perhaps better
diagnostic than lnL plots.
Clade probability in analysis 1
Clade probability in analysis 1
From Ronquist lecture
From Ronquist lecture
Goal: tight fit to the diagonal
As two runs converge to same tree.
Clade probability in analysis 2
Clade probability in analysis 2
Goal: tight fit to the diagonal
As two runs converge to same tree.
Clade probability in analysis 1
e.g. tree 2
prob<0.75
tree 1 prob
~ 1.00.
Quite a lot
of variation
between
runs.
Clade probability in analysis 1
From Ronquist lecture
From Ronquist lecture