Models of nucleotide substitution

Molecular phylogenetics 1
Level 3 Molecular Evolution and
Bioinformatics
Jim Provan
Page and Holmes: Sections 6.1-2
Distances vs. discrete characters
This division is based on how the data are treated:
Distance methods first convert aligned sequences into a
pairwise distance matrix, then input that matrix into a tree
building method
Discrete methods consider each nucleotide site (or function
of each site) separately
Sequences
Sites
1
2
3
4
1
T
A
A
A
2
T
A
A
A
3
A
T
A
A
4
T
T
A
A
5
T
T
A
A
6
A
A
T
A
7
A
A
A
T
Distances vs. discrete characters
1
Sequences
Sites
1
2
3
4
1
T
A
A
A
2
T
A
A
A
3
A
T
A
A
4
T
T
A
A
5
T
T
A
A
6
A
A
T
A
7
A
A
A
T
1
6
3
2
4
2
5
3
7
Parsimony tree
4
Distances vs. discrete characters
Sequences
Sites
1
2
3
4
1
T
A
A
A
2
T
A
A
A
3
A
T
A
A
4
T
T
A
A
5
T
T
A
A
6
A
A
T
A
7
A
A
A
T
1
3
2
1
1
2 3
3 5 4
4 5 4 2
1 2 3
2
1
2
4
Distance tree
Distances vs. discrete characters
Trees obtained by parsimony (a discrete method)
and minimum evolution (a distance method) are
identical in topology and branch lengths:
Parsimony analysis identifies seven substitutions and places them
on the five branches of the tree
Distance tree apportions observed distances between sequences
over branches of the tree
Under parsimony each site requires one change, which gives a total
of seven changes
Summing the branch lengths of the distance tree gives the same
value: 2 + 1 + 2 + 1 + 1 = 7
Parsimony tree gives additional information: which site
contributes to which branch plus ancestral states
Clustering methods vs. search methods
Cluster methods follow a set of steps (an algorithm)
and arrive at a tree:
Advantages:
– Easy to implement, resulting in very fast computer programs
– Always produce a single tree
Disadvantages:
– Results obtained from simple clustering algorithms often
depend on the order in which sequences are added to the
growing tree
– Do not allow evaluation of competing hypotheses: two different
trees could explain data equally well but no way of measuring
fit between tree and data
A clustering method
Round 2
Round 1
Start tree
Decide where to
place next sequence
A
A
D
Add next sequence
to tree
A
D
?
B
C
B
C
B
C
A
D
A
D
A
E
E
?
B
C
B
C
D
B
C
Search methods
Tree-building methods in this class use optimality
criteria to choose among the set of all possible trees:
Criterion is used to assign a “score” or “rank” to each tree
which is a function of the relationship between the tree and
the data
Require an explicit function relating tree and data (e.g. a
model of how sequences evolve)
Allow comparison of how well competing hypotheses of
evolutionary relationships fit the data
Major disadvantage is that optimality methods are
computationally very expensive:
– For a given data set and tree, what is the optimality value?
– Which of all possible trees has the maximum optimality value?
An optimality method
A
4
B
A
6
E
C
A
11
A
10
D
C
C
A
12
D
B
D
A
C
=8
E
E
C
A
E
B
E
A
C
A
13
7
D
C
E
A
D
E
D
A
E
E
=1
5
D
C
C
E
D
14
D
B
D
E
C
C
A
B
E
B
A
B
C
B
E
=8
B
C
D
B
B
15
C
D
E
B
A
D
E
B
B
=1
D
=1
D
A
B
C
Non-deterministic polynomialcompleteness problems
Non-deterministic polynomial-completeness problems
represent a set of problems with no efficient algorithm
for their solution known to exist
Problem of finding the optimal evolutionary tree for a
variety of criteria (e.g. minimum evolution, maximum
parsimony) is NP-complete:
For even a reasonable number of sequences (e.g. 20) it is
impossible to guarantee that the optimal tree has been found
In such cases, we must rely on heuristics to find something
approaching the best tree, but this may be far from optimal
Human mitochondrial DNA - different researchers obtained
quite different trees using different heuristic searches
An heuristic method
Subtree methods
The effectiveness of an heuristic search depends in part on
the number of trees examined, which can be
computationally demanding
An alternative approach is to divide the set of sequences
into smaller sets and find optimal trees for these subsets:
Smallest unrooted tree is a quartet
Each quartet has three possible unrooted trees
Quartet puzzling follows these two steps:
– For each quartet, identify the optimal tree
– Take all four-sequence trees from step 1 and assemble them into a tree
Due to homoplasy, the best tree will usually be the one which
contains most quartets (but this is an NP-complete problem as well)
Comparing tree-building methods
Type of data
Clustering algorithm
Optimality criterion
Tree-building method
Distances
Nucleotide sites
UPGMA
Neighbour
joining
Minimum
evolution
Maximum
parsimony
Maximum
likelihood
Comparing tree-building methods
Efficiency:
Effectively the time in which a computer program can find a tree
Since virtually all optimality methods are NP-complete, efficient
tree searching algorithms that guarantee the best tree are unlikely
Some optimality criteria can be evaluated quicker than others:
heuristic searches using parsimony can explore a much larger
number of trees than a search using likelihood
Power:
Measure of how much data are needed before we can be
reasonably sure of arriving at the correct result
A method may be theoretically appealing, but if it requires huge
numbers of sites it is not practical
Comparing tree-building methods
Consistency:
Will the method converge on the true tree as data are added?
Inconsistent methods will fail even if data are continually added
Robustness:
All tree-building methods make (implicit or explicit) assumptions
about evolutionary processes
Sensitivity to violations of the underlying model which return poor
estimates of phylogeny e.g. assumption of a molecular clock
Falsifiability:
The ability to tell whether these assumptions have been violated
i.e. that we should not be using the method at all!