here

III/ Phylogenetics
Up to now we have compared only two genomes.
Phylogenetics is a useful context to compare
more than two.
Example: Multiple alignment.
Suppose we have found 4 homologous
sequences by pariwise comparisons
AACGTACGATAT, ACGTAAATGT
GTCGTATTAA
GTCAAAATAA
And we would like to align them
But pairwise alignment might be incompatible
(for example, see AACC, ATCC and ATAC)
Score if match = +2, substitution = -1, indel = -1
Multiple alignment integrates all sequences.
A A C G T A C G A T A T
A ­ C G T A ­ A A T G T
G T C G T A ­ ­ T T A A
G T C A A A ­ ­ A T A A
A A C G T A C G A T A T
A ­ C G T A ­ A A T G T
G T C G T A ­ ­ T T A A
G T C A A A ­ ­ A T A A
Score?
First possibility: sum the score of pairwise
comparisons
Second possibility: evolutionary score with a
phylogeny
GTCGTATTAA
A A C G T A C G A T A T
A ­ C G T A ­ A A T G T
G T C G T A ­ ­ T T A A
G T C A A A ­ ­ A T A A
GTCAAAATAA
AACGTACGATAT
ACGTAAATGT
But then the phylogeny and
the alignment have to be
estimated
Definition: A phylogenetic tree on a set X is a tree and a
bijection between X and its leaves. It is binary if all internal
vertices have degree 3
Exercise: compute the number of binary phylogenetic trees
over a set X of size n
Definition: A rooted phylogenetic tree is a phylogenetic tree
with a root. It is binary if the root has degree 2 and all
internal vertices have degree 3.
Exercise: compute the number of rooted binary
phylogenetic trees over a set of size n
Rooted binary phylogenetic trees tell an
evolutionary history: Internal nodes are ancestral
states at the moment of a diversification event,
edges represent evolution.
Evolution of all scales of living systems might not be tree-like
Events leading to tree-like evolution :
- cell division and replication
- speciation
- duplication
Events leading to non tree-like evolution
- sex and recombination
- hybridization
- lateral transfer
- symbiosis
The definition of « atoms » of evolution is an issue
- nucleotides
- genes
- genomes
- individuals
- populations, species
Some phylogenetic problems
1/ From a pairwise distance matrix, compute a tree
which distance is the closest possible from the matrix
2/ From a rooted binary tree and homologous
characters at its leaves, compute ancestral states
3/ From a set of homologous characters, compute
the binary tree which explains best the data
4/ Combine trees and models at different scales
1.1 Trees and distances
A way to scale to multiple genomes with minimum
cost when you know how to compute pairwise
evolutionary distances.
Such methods are rarely used to produce trustable
trees, but very often used as a first quick hint, or as
starting points for exploration methods.
1.1 Trees and distances
Definition: a non negative edge weighted tree
defines a metric between the leaves: sum of the
edge weights along the paths
Definition: a metric is a tree metric if it is induced
by some edge-labeled tree (discuss the 0 weight
edges)
Theorem: a metric d is a tree metric if and only if
for any quartet i,j,k,l, among dij+dkl, dik+djl, dil+djk,
two are equal and not less than the third
Proof: => check the relation on any quartet
<= construct a tree from a metric by
induction: (1) for a quartet construct a tree,
the weights are uniquely determined, (2)
choose p,q,r such that d(p,r)+d(q,r)-d(p,q) is
maximal, construct an additional point t on
the path from p to q such that
d(t,p) = (d(p,r)+d(p,q)-d(q,r))/2
d(t,q) = (d(q,r)+d(p,q)-d(p,r))/2
d(t,x) = d(p,x)-d(t,p)
Construct the tree on X\{p,q}+{t} and add two
edges for p and q adjacent to t
Theorem: the weighted (non-binary) tree
inducing a tree metric is unique
Proof:
(1) a metric determines quartets
(2) quartets determine bipartitions
(3) bipartitions determine the tree
Distance on rooted tree
Definition: a metric is ultrametric if for all triplets
i,j,k, among dij,dik,djk, two are equal and not less
than the third.
Theorem. A metric of X is ultrametric if and only if
there is a rooted equidistant phylogenetic tree on
X. The rooted (non binary) tree is unique.
Proof: exercise, by induction on the size of X.
The molecular clock hypothesis
Molecular Evolution will produce ultrametric trees
Zuckerkandl, 1962
In practice, it is far from a general rule
1.2 Reconstruction methods
Usually metrics obtained from data are not
exactly tree-like.
Reconstruction method: from a metric,
reconstruct a tree
Consistency property: if the metric is a tree
metric, the reconstruction method returns
the corresponding tree.
Optimality: the reconstruction method
returns a "least squared" tree (implies
consistency)
Least squared means we minimize over all tree metrics dij
Σ_ij (dij-Dij)2
where Dij is the initial distance
We can write this
Σ_ij (Σ_k xkij vk-Dij)2
where xkij is 1 if the path between i and j contains branch k
and 0 otherwise, and vk is the weight of branch k
xkij contains the structure (the topology) of the graph, and
vk the weights. If we consider the structure fixed, we can
find the weights by minimizing
Σ_ij (Σ_k xkij vk-Dij)2
over all vk with xkij fixed, which can be achieved by
deriving this fonction with respect to all vk and equating
the derivate with 0
This gives a system of k linear equations of type
Σ_ij xkij (Σ_l xlij vk-Dij) = 0
which means: over every paths containing k,
equate Dij and dij
Constructing the optimum tree is NP-hard, and there
are reasonable fast heuristics
- If we suppose that the distance is close to ultrametric, a
simple method is UPGMA: choose the two closest
elements p and q, connect them with a new element t
inbetween
d(t,p) = d(t,q) = d(p,q)/2
d(t,x) = (d(p,x)+d(q,x))/2
The method is consistent with an ultrametric
distance. It is easy to find examples where the
consistency is not garantied for general tree metrics.
Improvement: Neighbor-Joining (Saitu, Nei, 1987)
Identify two vertices p and q such that the sum of the
branches of the tree ((p,q),(other vertices)) estimated by
the least-squared method is minimum (selection step).
This leads to minimizing over all p,q
Which yields a O(n5) algorithm, where n is the number of
leaves
Improvement: minimizing
Is equivalent to minimizing
Which lowers the complexity to O(n3)
Once p and q are chosen, add a new element t,
and construct the tree on the r-1 metric space
obtained by removing p,q and adding t
(reduction step)
Distances are estimated by the least squares
(estimation step) :
pt = 1/2(Dpq+(Sp-Sq)/(r-2))
qt = 1/2(Dpq+(Sq-Sp)/(r-2))
tx = 1/2(px+qx-pt-qt)
Recursively compute a tree on the r-1 metric space
Neighbor Joining is consistent
(short proof by Bryant, journal of classification, 2005)
All there is to proof is that p and q are neighbors in the
tree realizing the metric (then the reduction is exact in
the case of a tree).
Assume all weights on terminal branches are equal.
Take v an internal vertex maximizing sum_i Dvi
Take two leaves a,b adjacent to v and prove that
Q(a,b) < Q(p,q)
The estimation step is flexible : it is possible to
change the formulas and maintain consistency.
Exercise : for example, prove that consistency is
maintained if we replace
tx = 1/2(px+qx-pt-qt)
by
tx = lambda(px-pt) + (1-lambda)(qx-qt)
Then lambda can be estimated at each step
to optimize another criterion (principle of
BioNJ)
2. Ancestral character reconstruction
Another way to find good trees is to model the evolution
of characters along trees, and to find the minimum
evolution one or the most probable one.
2. Ancestral character reconstruction
As a sub-problem, we assume that we have found a set
of homologous characters, and we know their
phylogenetic relationships.
For example, a multiple alignment and a tree, or a set of
genomes as matchings or permutations and a tree, or
any set of characters (presence/absence of a trait,
discrete or continuous value for a trait)
2. Ancestral character reconstruction
Then we can sketch a general rough method for a
molecular evolutionary study:
- find homologies (pattern matching)
- find a starting tree (distance method)
- compute the evolution of homologous characters along
this tree (sequence evolution or rearrangements)
- score this evolution, supposing every homologous
character evolves independently
- try to find a better tree nearby
Example: presence or absence of a trait
0
1
1
0
0
Example: one amino acid site
E
R
R
U
U
Example: genome size
3.5
3.2
3.2
2.4
2.2
General problem. Given
- a rooted phylogenetic tree
- a state space, and values from this space on the leaves
- a non negative cost function d(x,y) on the state space
Find the states at the internal nodes, minimizing the sum
of the cost over all branches
Reference for this part is Miklos, 2013, in: Models and Algorithms for Genome Evolution, Springer
Variants: find
- the minimum cost solutions or
- the most probable solution, or
- the most probable solutions, or
- sample among probable solutions, or
- sample according to the probability distribution, or
- integrate over all solutions.
Reference for this part is Miklos, 2013, in: Models and Algorithms for Genome Evolution, Springer
Examples
1- The directed case
The label space is totally ordered,
d(x,y) = infinite if x > y and
d(x,z) = d(x,y) + d(y,z) if x < y < z
2- The Dollo case on binary state space
Only one 0->1 transition is allowed,
minimize the number of 1->0
3- Fitch parsimony on a discrete space
d(x,y)=1 if x ≠ y and d(x,y) = 0 if x = y
4- On continuous characters on R,
d(x,y)=|x-y| (Wagner parsimony)
5- On continuous characters on R,
squared parsimony, d(x,y)=(x-y)2
{0,1}
Fitch Method :
Ascending phase
1
0
{0,1}
0
1
1
0
0
1
Fitch Method :
Descending phase
1
0
1
0
1
1
0
0
0
Fitch Method :
Descending phase
1
0
1
0
1
1
0
0
0
Fitch Method :
Some optimal solutions are never examined
0
0
0
0
1
1
0
0
A general scheme by dynamic programming
f(u,x) is the cost of assigning x to node u in the subtree
rooted at u
If u is a leaf, f(u,x) = 0 if x is the value of u, ∞ otherwise
If u is not a leaf
Sankoff, Rousseau, 1975
For a binary state space
x
y
z
x
c(x,1) = min (c(y,1)+c(z,1),
c(y,0)+c(z,1)+1,
c(y,1)+c(z,0)+1,
c(y,0)+c(z,0)+2)
c(x,0) = min (c(y,1)+c(z,1) +2,
c(y,0)+c(z,1)+1,
c(y,1)+c(z,0)+1,
c(y,0)+c(z,0))
c(x,1) = 0 et c(x,0) = infini if x=1
c(x,0) = 0 et c(x,1) = infini if x=0
Forward algorithm to compute the scores f(u,x)
(postorder traversal of the tree)
Backward algorithm to assign values to the nodes
If the state space is small, the algorithm can be
implemented with Complexity O(mr2), where
m is the size of the tree, and r is the size of the
state space
Choice between equally parsimonious
solutions: possibility of favoring convergent
evolution or reversions by pushing
mutations to the root or to the leaves
Quantitative characters
d(x,y)=(x-y)2 (for continuous characters)
or
d(x,y) = |x-y| (for discrete or continuous characters)
a
a
Mean or median
b
c
b
c
Squared parsimony
N
SN(x) = min_z(SP(z)+(x-z)2)+min_z(SQ(z)+(x-z)2)
P
Q
By induction, SN(x) is a quadratic function of x
SN(x) = ax2 + bx + c
a, b and c can be calculated from the parameters of P and Q
If P is an internal node and Q is a leaf:
If both P and Q are internal nodes:
If both P and Q are leaves:
At the root, minimize quadratic function SR(x)
Propagate minimum values
Complexity O(m)
Easily extendable to a weighted version
Exercise: algorithm for Wagner parsimony
d(x,y) = |x-y|
Exercise: algorithm for Wagner parsimony
d(x,y) = |x-y|
Hint: propagate an interval in the dynamic programming algorithm
Finds one solution in O(m)
Or find all solutions in O(m^3)
Linear solution in Miklos, 2013, MAGE, Springer
Application for Genome Rearrangement
Suppose genomes are matchings on a graph with an
even number of vertices, and evolve by Single Cut or
Joins of edges.
adjacencies
Genome 1
Genes
Presence of an edge
AB
AC
0
1
1
1
0
1
1
1
0
1
1
0
0
1
0
0
1
1
Theorem. "Fitch" Solutions choosing 0 (absence) if there
is an ambiguity at the root garanty that evolving each
edge independently makes a mathing at each ancestral
node.
Equivalent to: if an edge xy is present at ancestral node
then all edges xt are not
Proof:
- by induction on the ascending phase
- by contradiction on the descending phase
Counting or sampling solutions is open (if solutions are
ancestral states and not scenarios)
For any other kind of rearrangement (DCJ, inversion,
…) the ancestral character reconstruction problem is
NP-hard
The problem comes from the exponential size of the
badly structured finite state space
3/ Inference of trees with evolutionary events
Now find the « best » tree if there are m
independent characters.
NP-completeness proof from reduction of the
Steiner tree problem on a hypercube
The parsimony method (minimizing the
number of events) is inconsistent if we model
evolution as a random process
From Felsenstein, 1978
Ancestral character is 0, P,Q,R are probabilities
of a change 0->1 in each branch.
From Felsenstein, 1978
It is possible that P101 > P110, and parsimony
will systematically place (A,C),B even with
arbitrarily large datasets
From Felsenstein, 1978
It is possible that P101 > P110, and parsimony
will systematically place (A,C),B even with
arbitrarily large datasets
If
then the tree constructed by a maximum
parsimony principle will be ((A,C),B) with more and
more support as the size of the data increases
From Felsenstein, 1978
3/ Inference of trees with evolutionary events
3.1/ Likelihood computation
Model mutations with a Markov homogeneous random process :
- Random Process
- Markov Process
- Homogeneous
does not depend on s
Dynamics of a homogeneous Markov model
Parameters of the model : let λij be a « rate » of transition from i to j. This means
P(X=j,t+dt | X=j,t) = λij * dt
The dynamics of P(t) = (P(X=i,t))i is given by
P'(t) = P(t)*Q
Where Q is the generator matrix
Which is solved by P(t) = P(0) * exp(Qt)
→ P(0) is a parameter or is computer from the equilibrium frequencies
→ exp(Qt) is solved by matrix diagonalization Q=A-1DA, and exp(Qt)=A-1exp(Dt)A
Example : Jukes-Cantor Model of DNA evolution
So P(0) = (1/4,1/4,1/4,1/4)
Q=
Example : Jukes-Cantor Model of DNA evolution
So P(0) = (1/4,1/4,1/4,1/4)
exp(Qt) =
Example : Jukes-Cantor Model of DNA evolution
So P(0) = (1/4,1/4,1/4,1/4)
exp(Qt) =
Example : Jukes-Cantor Model of DNA evolution
N=1000
Jukes-Cantor distance
Number of differences
Transitions/Transversions
Kimura
HKY
Felsenstein
GTR
Likelihood of a tree
Parameters Θ: tree, branch lengths, rates
L(θ) = P(D|θ)
Data D : homologous characters at the leaves
Likelihood of a tree
Independance of sites :
L(θ) = P(c,c,a|θ)*P(t,g,t|θ)
Likelihood of a tree
|θ
Algorithm in O(|A|^(n-1))
Alphabet A, n leaves
Likelihood of a tree : Felsenstein's algorithm
(usual dynamic programming)
Leaf i, character σ
Root r
The likelihood of a tree does not depend on the position
of the root. (Proof: exercise)
It is due to taking equilibrium frequencies at the root, and the
property of reversibility
Maximum Likelihood tree :
1/ Given a tree, optimize branch lengths
Heuristic : optimize each branch length, given the others
For one branch length t, the likelihood can be written
Optimize a continuous fonction of t
For the Jukes Cantor model with α=1/3
Equal the derivative with 0 and set branch lenght to topt.
3.2/ Compute the best tree
Like in parsimony, perform any heuristic search in trees
to find the maximum likey one. Usually NNI and SPR
moves are efficient for hill-climbing
3.3/ Statistical assessment of the tree
Bootstrapping : sample n sites at random, with
replacement, to have an idea of the variability of your
data (n is the total number of sites), many times (100,
1000)
Then reconstruct a ML tree for each sample of n sites,
and note the proportion of the trees which carry the
branches of your ML tree.
Under the hypothesis that sites are taken independently
from an unknow distribution, bootstraping gives
confidence intervals.
3.3/ Statistical assessment of the tree
Monte Carlo explorations of the trees
Likelihood ratios follow some statistical rules...
Hemoglobin gene family phylogeny in mammals, with bootstraps
Often there is not enough mutations on a gene family
alignment (typically 500 sites, ~500 sequences) to have
a good support for all branches.
Trees are then traditionnally made with
- concatenates of alignments of genes
- supertree methods
Concatenates:
Example of tree
with 31
universal genes
and a few
hundred
species from
the whole
known diversity
of life
Ciccarelli et al,
Science, 2006
Problems: only 31 genes are
selected, for their supposed
history following the species
history.
The others have duplications,
losses, transfers, they are not
universal and thus cannot be
used
Do these 31 really depict the
history of life ?
Adding species makes this 31
quickly drop to 0.
Supertrees :
Make a lot of trees with a lot of genes, and try to obtain a
consensus from those.
Trees are considered as a sample from the real tree,
with variation.
For example, find the tree with minimal total distance to
all trees
Distance in trees : Robinson-Foulds, NNI, SPR, ...
A phylogeny of mammals with supertrees
Bininda-Edmonds et al, Nature, 2007
Problem : how to model conflicts between trees ?
Any measure, like median tree, votes for clades, has little
biological interpretation
Attempt for 5741 gene families on 185 species for the
origin of eukaryotes
Pisani et al, MBE, 2007
Integrating a lot of gene families with complex histories and
biological interpretations is the subject of « reconciliation »
That is, interpreting an modelling the discord instead of
averaging it
4/ Integrative models of evolution at different scales
4.1. Interpretation of a gene history and LCA reconciliation
4.2 Reconciliations with transfers
Hemoglobin gene
Zuckerkandl and Pauling, 1965
Given a rooted gene tree and a rooted species tree,
each gene belongs to a species, find the duplication
and losses.
Given a rooted gene tree and a rooted species tree,
each gene belongs to a species, find the duplication
and losses.
Given a rooted gene tree and a rooted species tree,
each gene belongs to a species, find the duplication
and losses.
First proposition for the reconstruction : minimize the
number of duplications and/or losses in the
interpretation of the tree.
If no gene tree is known, minimizing D+L (or w(D)
+w(L)) in the species tree is an ancestral character
reconstruction problem with state space {0,1,….,}
and transition d(k->k+1) = D, d(k->k-1) = L, except if
k=0.
1
Dynamic programming :
Compute c(x,u) for each
node x and value u
1
2
1
Complexity O(n^3)
0
1
1
2
2
If a gene tree is known and fully specified,
each gene (extant genes at the leaves and ancestral
genes at the internal nodes) belongs to a species, find
the duplication and losses.
Label the vertices of the gene trees with the LCA of
corresponding species
If a gene tree is known and fully specified,
each gene (extant genes at the leaves and ancestral
genes at the internal nodes) belongs to a species, find
the duplication and losses.
Duplications are the nodes with the same label on one child
This minimizes D and L
(D = number of duplications, L = number of losses)
Possible improvements and extensions :
- switch to a birth and death process
- add the possibility of gene transfers
Transfers : dynamic programming again
1/ without gene phylogeny
Add the possibility to gain genes from counts 0
2/ with binary phylogeny
To follow
3/ with non binary phylogeny
Open problem