Consensus methods Strict consensus methods

Consensus methods
• A consensus tree is a summary of the
agreement among a set of fundamental trees
• There are many consensus methods that
differ in:
1. the kind of agreement
2. the level of agreement
• Consensus methods can be used with multiple
trees from a single analysis or from multiple
analyses
Strict consensus methods
• Strict consensus methods require agreement across all the
fundamental trees
• They show only those relationships that are unambiguously
supported by the parsimonious interpretation of the data
• The commonest method (strict component consensus) focuses on
clades/components/full splits
• This method produces a consensus tree that includes all and
only those full splits found in all the fundamental trees
• Other relationships (those in which the fundamental trees
disagree) are shown as unresolved polytomies
1
Strict consensus methods
TWO FUNDAMENTAL TREES
A
B
C
D
E
A
F
B
C
B
A
G
D
E
C
F
E
D
F
G
G
STRICT COMPONENT CONSENSUS TREE
Majority-rule consensus methods
• Majority-rule consensus methods require agreement
across a majority of the fundamental trees
• May include relationships that are not supported by
the MP tree
• This method produces a consensus tree that includes
all and only those full splits found in a majority
(>50%) of the fundamental trees
• Other relationships are shown as unresolved
polytomies
• Of particular use in bootstrapping
2
Majority rule consensus
THREE FUNDAMENTAL TREES
A
B
C
D
E
F
G
A
B
E
C
D
F
A
B
C
B
A
G
E
D
F
E
D
F
G
G
66
100
Numbers indicate frequency of
clades in the fundamental trees
C
66
66
66
MAJORITY-RULE COMPONENT CONSENSUS TREE
Reduced consensus methods
TWO FUNDAMENTAL TREES
A
B
C
D
E
F
G
A
G
B
C
D
E
F
A BC DE F G
A
B
C
D
E
F
Strict component consensus
completely unresolved
STRICT REDUCED CONSENSUS TREE
Taxon G is excluded
3
Parsimonious Character Optimization
0
A
1 => 0
0
B
*
origin
and
reversal
(ACCTRAN) 0 => 1
1
C
1
D
=
=
0
E
OR parallelism
2 separate
origins
0 => 1 (DELTRAN)
Homoplastic characters often have
alternative equally parsimonious
optimizations
Commonly used varieties are:
ACCTRAN - accelerated transformation
DELTRAN - delayed transformation
*
Consequently, branch lengths are not
always fully determined
PAUP reports minimum and maximum branch lengths
Questions
History?
India
Sri lanka
4
Questions
History?
India
Sri lanka
Missing data
• Missing data is ignored in tree building but can lead to
alternative equally parsimonious optimizations in the
absence of homoplasy
1
A
?
B
single
origin
0 => 1
on any
one of 3
branches
*
*
?
C
0
D
0
E
Abundant missing data can
lead to multiple equally
parsimonious trees.
*
This can be a serious
problem with morphological
data but is less likely to arise
with molecular data
5
Maximum Likelihood
Maximum Likelihood
• To estimate the probability that we would
observe a particular dataset, given a
phylogenetic tree and some notion of how the
evolutionary process worked over time.
Probability of
given
(
Ïa b
Ô
Ôb a
Ì
Ôc e
Ô
Ód c
p = [a,c,g,t]
c
e
a
f
d¸
Ô
fÔ
˝
gÔ
Ô
a˛
)
6
What is the probability of
observing a datum?
• If we flip a coin and get a head and we think the coin is
unbiased, then the probability of observing this head is
0.5.
• If we think the coin is biased so that we expect to get a
head 80% of the time, then the likelihood of observing
this datum (a head) is 0.8.
• Therefore: The likelihood of making some observation is
entirely dependent on the model that underlies our
assumption.
Lesson: The datum has not
p
=?
changed, our model has.
Therefore under the new model
the likelihood of observing the
datum has changed.
What is the probability of observing
a 'G' nucleotide?
– Model 1: frequency of G = 0.4 => likelihood(G) = 0.4
– Model 2: frequency of G = 0.25 => likelihood(G) = 0.25
One rule…the rule of 1.
• The sum of the likelihoods of all the
possibilities will always equal 1.
• E.g. for DNA p(a)+p(c)+p(g)+p(t)=1
7
What about longer sequences?
• If we consider a gene of length 2:
Gene 1:
ga
• The the probability of observing this gene is
the product of the probabilities of observing
each character.
• E.g
– p(g) = 0.4; p(a)=0.15 (for instance)
– likelihood(ga) = 0.4 x 0.15 = 0.06
…or even longer sequences?
• Gene 1: gactagctagacagatacgaattac
• Model (simple base frequency model):
– p(a)=0.15; p(c)=0.2; p(g)=0.4; p(t)=0.25;
– (the sum of all probabilities must equal 1)
Like(Gene 1) = 0.000000000000000018452813
8
Note about models
• You might notice that our model of base
frequency is not the optimal model for our
observed data. If we had used the following
model:
p(a)=0.4; p(c) =0.2; p(g)= 0.2; p(t) = 0.2;
The likelihood of observing the gene is:
Like(gene 1) = 0.000000000000335544320000
(a value that is almost 10,000 times higher)
Lesson: The datum has not
changed, our model has.
Therefore under the new model
the likelihood of observing the
datum has changed.
How does this relate to
phylogenetic trees?
• Consider an alignment of two sequences:
– Gene 1: gaac
– Gene 2: gacc
• We assume these genes are related by a
(simple) phylogenetic tree with branch lengths.
9
Increase in model sophistication
• It is no longer possible to simply invoke a model that
encompasses base composition, we must also include the
mechanism of sequence change and stasis.
• There are two parts to this model - the tree and the
process (the latter is confusingly referred to as the
model, although both parts really compose the model).
Note: We will stay with the confusing notation - to avoid further confusion.
The model
• The two parts of the model are the tree and the
process (the model).
• The model is composed of the composition and the
substitution process -rate of change from one character
state to another character state.
Model =
+
Ïa b
Ô
Ôb a
Ì
Ôc e
Ô
Ód c
c
e
a
f
d¸
Ô
fÔ
˝
gÔ
Ô
a˛
p = [a,c,g,t]
10
Simple “time-reversible” model
• A simple model is that the rate of change from a to c or
vice versa is 0.4, the composition of a is 0.25 and the
composition of c is 0.25 (a simplified version of the
Jukes and Cantor 1969 model)
P=
Ï . 0.4
Ô
Ô0.4
.
Ì
Ô .
.
Ô
.
Ó .
. .¸
Ô
. .Ô
˝
. .Ô
Ô
. .˛
p = [0.25 0.25 . .]
Probability of the third nucleotide
position in our current alignment
• p(a) =0.25; p(c) = 0.25;
pa Æc = 0.4
Starting with a, the likelihood of the nucleotide is 0.25 and
the likelihood of the substitution (branch) is 0.4. So the
likelihood of observing these data is:
*Likelihood(D|M) = 0.25 x 0.4 =0.01
Note: you will get the same result if you start with c, since this model is
reversible
*The likelihood of the data, given the model.
11
Substitution matrix
• For nucleotide sequences, there are 16 possible
ways to describe substitutions - a 4x4 matrix.
Ïa
Ô
Ôe
P=Ì
Ôi
Ô
Óm
b
f
j
n
c d¸
Ô
g hÔ
˝
k lÔ
Ô
o p˛
Convention dictates
that the order of
the nucleotides is
a,c,g,t
Note: for amino acids, the matrix is a 20 x 20 matrix and for codon-based models,
the matrix is 61 x 61
Substitution matrix - an
example
Ï0.976 0.01 0.007 0.007¸
Ô
Ô
Ô0.002 0.983 0.005 0.01 Ô
˝
P=Ì
Ô0.003 0.01 0.979 0.007Ô
Ô
Ô
Ó0.002 0.013 0.005 0.979˛
• In this matrix, the probability of an a changing
to a c is 0.01 and the probability of a c
remaining the same is 0.979, etc.
Note: The rows of this matrix sum to 1 - meaning that for every
nucleotide, we have covered all the possibilities of what might happen
to it. The columns do not sum to anything in particular.
12
To calculate the likelihood of the entire
dataset, given a substitution matrix, base
composition and a branch length of one
"certain evolutionary distance" or "ced"
Ï0.976 0.01 0.007 0.007¸
Ô
Ô
Ô0.002 0.983 0.005 0.01 Ô
˝
P=Ì
Ô0.003 0.01 0.979 0.007Ô
Ô
Ô
Ó0.002 0.013 0.005 0.979˛
Gene 1: ccat
Likelihood of Gene 2: ccgt given
π=[0.1,0.4,0.2,0.3]
Likelihood of a two-sequence
alignment.
• ccat
†
• ccgt
p c Pc-> cp c Pc ->c p a Pa-> g p t Pt-> t
=0.4x0.983x0.4x0.983x0.1x0.007x0.3x0.979
=0.0000300
Likelihood of going from the first to the second
sequence is 0.0000300
13
Different Branch Lengths
• For very short branch lengths, the probability of a
character staying the same is high and the probability of it
changing is low (for our particular matrix).
• For longer branch lengths, the probability of character
change becomes higher and the probability of staying the
same is lower.
• The previous calculations are based on the assumption that
the branch length describes one Certain Evolutionary
Distance or CED.
• If we want to consider a branch length that is twice as long
(2 CED), then we can multiply the substitution matrix by
itself (matrix2).
2 CED model
Ï0.976 0.01 0.007 0.007¸
Ô
Ô
Ô0.002 0.983 0.005 0.01 Ô
˝
P=Ì
Ô0.003 0.01 0.979 0.007Ô
Ô
Ô
Ó0.002 0.013 0.005 0.979˛
=
X
Ï0.976 0.01 0.007 0.007¸
Ô
Ô
Ô0.002 0.983 0.005 0.01 Ô
˝
P=Ì
Ô0.003 0.01 0.979 0.007Ô
Ô
Ô
Ó0.002 0.013 0.005 0.979˛
È0.953 0.02 0.013 0.015˘
Í
˙
Í0.005 0.966 0.015 0.029˙
Í
˙
Í 0.01 0.029 0.939 0.022˙
Í
˙
Î0.007 0.038 0.015 0.94 ˚
Which gives a likelihood of 0.0000559
Note the higher likelihood
14
†
For 3 CED
È 0.93
Í
Í0.007
3
P =Í
Í 0.01
Í0.007
Î
0.029
0.949
0.029
0.038
0.019 0.022˘
˙
0.015 0.029˙
˙
0.939 0.022˙
0.015 0.94 ˙˚
This gives a likelihood of 0.0000782
Note that as the branch lengths increase, the
values on diagonal decrease and the values on
the off-diagonals increase.
For higher values of CED units
1
0.0000300
2
0.0000559
3
0.0000782
10
0.0001620
15
0.0001770
20
0.0001750
30
0.0001520
L
i
k
e
l
i
h
o
o
d
0
10
20
30
Branch Length
40
15
Likelihood of the alignment at
various branch lengths
0.0002
ccat
ccgt
0.00018
0.00016
0.00014
0.00012
0.0001
0.00008
0.00006
0.00004
0.00002
0
0
0.1
0.2
0.3
0.4
0.5
0.6
The maximum likelihood value is 0.0001777 at a branch length
of 0.330614
16
The evolutionary revolution
• Organisms share a common ancestry and our
classification should reflect these histories (Darwin)
• Philosophy and methodology for reconstructing
evolutionary history - cladistics (Hennig)
• Philosophical nature of natural groups (Ghiselin, Hull)
• A nomenclatural system adapted to phylogenetic
systematics (Ghiselin, Griffiths, de Queiroz & Gauthier,
etc)
Content
Ancestry
17
Bryant & Cantino (2002)
claim that…
• Traditional taxonomists tend to
conceptualize taxa in terms of
content.
• Proponents of the phylocode tend
to conceptualize taxa in terms of
ancestry.
Definitions or Fixing the reference
”Name”
A
B
”Name”
CA
B
”Name”
CA
B
C
x
node
stem
apomorphy
Node based:
”Name” refers to the least inclusive clade comprising B and C
Stem based:
”Name” refers to the most inclusive clade comprising B and C,
but not A.
Apomorphy based:
”Name” refers to all taxa descending from the first ancestor
possessing apomorphy x.
18