A Few Facts Regarding Continuous Time Markov Chains (CTMCs)

Markov Chains and Models of Molecular Evolution
Transition probability matrices
We will start by considering a single nucleotide site in a DNA sequence. There are four
possible nucleotides that can occur in a site. We represent these four nucleotides by the
four letters A, C, T and G. We imagine following an ancestral line through time. Initially
we may observe, say, an A. After a certain amount of time has passed a mutation may
occur and A mutates to, say, G. If we keep following this ancestral lineage through time
we may observe further mutations to other nucleotides. This process can be described by
a Markov chain. A Markov chain is a stochastic process in which some variable (here the
nucleotide in a particular site) is followed through time and where the probabilities of
various changes between states (transitions) depend only on the immediate preceding
state. A Markov chain is defined by a state space, which is the set of possible values that
the Markov chain can take, and the transition probabilities of the process. More formally
we might define a variable X(t), which is the value of the Markov chain at time t, and
write X(t) ∈ {A, C, T, G}, t = 0, 1, 2,…. The transition probabilities are defined by
Pij = Pr(X(t+1) = j | X(t) = i).
We often summarize the transition probabilities in a transition probability matrix. For
the DNA mutation process, we would have the following transition probability matrix:
A
C
P=
T
G
A
P
⎡ AA
⎢P
⎢ CA
⎢ PTA
⎢
⎣ PGA
C
PAC
PCC
PTC
PGC
T
G
PAT PAG ⎤
PCT PCG ⎥⎥
PTT PTG ⎥
⎥
PGT PGG ⎦
PAA is the probability of no mutations in a site in a unit of time (e.g. generation) if the
current state is A, i.e. the probability of a transition A → A. Likewise, PAT is the
probability that a mutation happens from nucleotide A to nucleotide T in a single unit of
time, i.e. the probability of a transition A → T, and so forth. For example, if the rate of
mutation is 0.01 and all nucleotides mutate with equal probability to all other nucleotides,
we might define the following transition probability matrix
PJC , 0.01
A
C
T
G
A ⎡ 0.99 0.01 / 3 0.01 / 3 0.01 / 3⎤
C ⎢⎢0.01 / 3 0.99 0.01 / 3 0.01 / 3⎥⎥
=
T ⎢0.01 / 3 0.01 / 3 0.99 0.01 / 3⎥
⎢
⎥
G ⎣0.01 / 3 0.01 / 3 0.01 / 3 0.99 ⎦
Of course, real DNA sequences mutate at much lower rates and the mutational process is
usually not symmetric among the four nucleotides. However, this type of model in which
mutations between all four nucleotides are equally likely have been used extensively in
studies of molecular evolution. This model is known as the Jukes and Cantor model and
has the transition probability matrix
PJC
A
C
T
G
A ⎡1 − µ µ / 3 µ / 3 µ / 3 ⎤
⎢
⎥
= C ⎢ µ / 3 1− µ µ / 3 µ / 3⎥ ,
T ⎢ µ / 3 µ / 3 1− µ µ / 3⎥
⎢
⎥
G ⎣ µ /3 µ /3 µ / 3 1− µ⎦
where µ is the mutation rate per generation.
n-step transition probabilities
Under any particular model we can calculate the probability of being in state j
after n generations if we started in state i. These probabilities are denoted by Pij(n ) , and
are known as n-step transition probabilities. For example, for the Jukes and Cantor
model
(1)
PAA
= PAA = 1 − µ
( 2)
AA
P
( 3)
AA
P
= (1 − µ )
= (1 − µ )
3
2
⎛µ⎞
+ 3⎜ ⎟
⎝3⎠
2
2
⎛µ⎞
⎛µ⎞
1
+ 9⎜ ⎟ (1 − µ ) + 6⎜ ⎟
⎝3⎠
⎝3⎠
3
and so forth. We see that writing down the n-step transition probabilities like this easily
becomes quite cumbersome. However, we might also recognize that the matrix of n-step
transition probabilities simply is given by the nth power of the transition probability
matrix. To realize this notice that Pij( 2) = ∑ Pik Pkj . The matrix P ( 2) = {Pij( 2) } is then
k
( 3)
ij
2
exactly given by P . Similarly we have P
(n)
n
= ∑ Pik( 2) Pkj . So P(3) = P(2)P = P2P = P3.
k
Continuing like this we realize that P = P . So we have a general method for
calculating n-step transition probabilities, at least numerically. However, it can be quite
computationally intensive to calculate these matrix powers.
Stationary distributions
In general we will assume that the state space of the Markov chain is of finite
size. For example, the size of the state space for the nucleotide models is four. We will
further assume that all states can be reached from all other states with positive
probability, i.e. the Markov chain cannot get 'trapped' in any of the states – it is
irreducible. Two states, i and j, are said to communicate if Pij( n ) > 0 for some n and
Pji( n ) > 0 for some n. A Markov chain is irreducible if all states in the chain
communicate. With the additional assumption of aperiodicity (roughly meaning that the
chain will not cycle among different states) the Markov chain is ergodic, meaning that is
has a unique equilibrium distribution or stationary distribution. Let the stationary
(equilibrium) probability of being in state i be πi. Formally, we define it as
lim Pij( n ) = π j , for all i,j, ∑ π j = 1 . The vector π = {πj} provides the stationary
n →∞
j
distribution of the process. We can think of πj as the proportion of time the chain spends
in state j in equilibrium or, equivalently, as the probability of observing state j at a
random point in time if we have no knowledge regarding the initial state of the chain.
The stationary distribution can be found as the solution to the following set of
equations.
π i = ∑ π j Pji and
j
∑π
j
=1.
j
For the Jukes and Cantor model we get
π i = (1 − µ )π i + ∑ π j
j ≠i
µ
3
= (1 − µ )π i +
µ
3
(1 − π i )
which implies
1
4
πi = .
This result could also easily be found directly from the definition of the process
using an argument of symmetry. Since Pij = µ/3 for all i ≠j and Pii = Pjj for all i, j, the
four states are exchangeable and all states must have the same stationary probability. The
calculation we would do to find, say, πA is completely identical to the calculation we
would do to find πT. So πA = πC = πT = πG = 1/4. Under the Jukes and Cantor process the
four nucleotides will occur in equal frequency in the genome of an organism.
Other models of mutation
The basic model of Jukes and Cantor was first introduced in 1969 but has since
then been modified and expanded on in a variety of ways. One of the first modifications
was by the famous population geneticist Motoo Kimura and is known as the Kimura 2parameter model. It had been known for some time that certain type of mutations occur
at a higher rate than other mutations. In particular mutations between two purines (A and
G) or between two pyrimidines (C or T) occur at a higher rate than mutations between a
purine and a pyrimidine. The changes between two purines or two pyrmidines are called
‘transitions’ (not to be confused with transitions in a Markov chain!) and mutations
between a pyrimidine and a purine are known as ‘transversions’. Kimura suggested using
a transition probability matrix of the following form
A
C
T
G
A ⎡1 − α − 2 β
β
β
α
⎤
⎢
⎥
C ⎢
1 − α − 2β
β
α
β
⎥
Q=
⎥
T ⎢
1 − α − 2β
β
α
β
⎢
⎥
G ⎣
α
β
β
1 − α − 2β ⎦
β/α is known as the transition/transversion rate ratio. We see that the rate of mutation in
this model is α +2β.
By the same argument of symmetry as used for the Jukes and Cantor model, we
can show that πA = πC = πT = πG = ¼ also for the Kimura 2-parameter model.
Instead of modeling the mutation process at the nucleotide level we could have modeled
it at the level of the protein. In that case we establish a Markov chain with state space on
the set of possible amino acids, of which there are 20. The simplest such model would be
an equal rates model similar to the Jukes and Cantor model. It has transition probabilities
⎧⎪ 1 - µ if i = j
Pij = ⎨ µ
if i ≠ j
⎪⎩19
where µ now is the mutation rate at the protein level. Using the same argument as
symmetry as for the two previous model we find πi = 1/20 for all i.
Continuous time Markov chains
For computational reasons it is often convenient to approximate a discrete time Markov
chain by a Markov chain defined in continuous time. A continuous time Markov chain
(CTMC) is a stochastic process on a discrete state space in continuous time, {X(t), t ≥ 0},
for which the conditional distribution of a future state at time t + s, given the present state
at time s and all past states depends only on the present state and is independent of the
past. As for a discrete time Markov chain, the state space of a CTMC is the set of
possible values of X(t). For example, for a model of DNA sequence evolution X(t) ∈ {A,
C, T, G}. The difference between a CTMC and a discrete time Markov chain is that t can
be any non-negative real value, whereas t is a non-negative integer in a discrete time
Markov chain.
A CTMC is defined by the rate of change between pairs of states. The rates are usually
defined by
q ij = lim Pr( X (t + s ) = j | X ( s ) = i ) / t .
t →0
where qij is the rate of change from state i to state j. For example, the rates of change in
the Jukes and Cantor model are defined in continuous time as
qij = µ/3
for all i ≠ j, and i, j ∈ {A, C, T, G}, i.e. the rate of change between all pairs of nucleotide
is defined a µ/3.
If we measure t in number of generations, µ is the mutation rate per generation. Notice
that the total rate of change out of any state (A, C, T or G) is given by µ/3 + µ/3 + µ/3 =
µ, i.e. µ is the rate of mutation for this process. We often specify a CTMC using a rate
matrix. For example, for the Jukes and Cantor model we could specify the following
matrix: Q = {qij},
A
C
T
G
A ⎡ − µ µ / 3 µ / 3 µ / 3⎤
C ⎢⎢ µ / 3 − µ µ / 3 µ / 3⎥⎥ .
Q=
T ⎢ µ / 3 µ / 3 − µ µ / 3⎥
⎥
⎢
G ⎣µ / 3 µ / 3 µ / 3 − µ ⎦
Notice that the diagonal in this matrix is chosen such that the row sums equal zero, i.e.
we define qii = −∑ qij . It will later become apparent why this notation is convenient.
j ≠i
As in the case of the discrete time Markov chains we will impose certain restrictions on
the type of Markov chain that we consider so we are guaranteed that it has a stationary
distribution (i.e. it is ergodic) and that all the relevant theory applies. We will again
assume that all states communicate, that the state space is finite and that the rates of
transition are constant in time. All the models of molecular evolution we will consider
here meet these assumptions.
Transition Probabilities of CTMCs
For most of the calculations we would like to do using these model we need to be able to
calculate transition probabilities of the processes. The transition probabilities of a CTMC
(under the assumptions stated above) are given by
Pij(t) = P[X(t + s) = j | X(s) = i].
The transition probabilities of a CTMC corresponds to the n-step transition probabilities
of a discrete time Markov chain. One of the advantages of using CTMCs instead of
discrete time Markov chains is that for many problems it is easier to find the transition
probabilities of the CTMC than the n-step transition probabilities of a discrete time
Markov chain. As we shall see, the transition probabilities can be found simply by
solving a set of linear ordinary differential equations. These equations are know as
Kolmogorov Equations, and the particular versions that we will be using are known as
the Forward Kolmogorov Equations:
dPij (t )
dt
= ∑ q kj Pik (t ) , t ≥ 0.
k
An intuitive explanation of this equation goes as follows: if the current state of the chain
is state k, k ≠ j, the rate of change into state j is qkj. If the current state is state k and k = j,
the rate of change is away from state j is ∑ q ji . Starting in state i at time 0, the
i≠ j
probability of being in state k at time t is Pik(t). The rate of change in the probability of
being in state j at time t, (dPij (t ) / dt ) , is simply the sum over all k, k ≠ j, of the probability
of being in state k times the rate of change into state j from state k, minus the probability
of being in state j times the rate of change away from state j. Remembering that we have
defined qii as qii = −∑ qij we arrive at the Forward Kolmogorov Equations as stated
j ≠i
above.
For the Jukes and Cantor model, we get
dPAA (t ) µ
µ
µ
= PAC (t ) + PAT (t ) + PAG (t ) − µPAA (t )
dt
3
3
3
=
µ
3
(1 − PAA (t ) ) − µPAA (t ) = µ (1 − 4 PAA (t ) )
3
The solution to this differential equation is
4
− tµ
1
+ ce 3
4
PAA (t ) =
where c is a constant determined by the initial conditions. With the initial condition
PAA (0) = 1 , we find
4
PAA (t ) =
1 3 − 3 tµ
.
+ e
4 4
This implies that for any nucleotide k, k ≠ A:
4
⎛ ⎛ 1 3 − 4 tµ ⎞ ⎞
1 1 − 3 tµ
3 ⎟⎟
⎜
⎜
=
−
+
, k ≠ A.
PAk (t ) 1 ⎜
e
⎟⎟ / 3 = 4 − 4 e
⎜
4
4
⎠⎠
⎝ ⎝
By symmetry we see that the continuous time transition probabilities for the Jukes and
Cantor model are given by
⎧ 1 3 − 43 tµ
if i = j
⎪⎪ + e
4
4
.
Pij (t ) = ⎨
4
⎪ 1 − 1 e − 3 tµ if i ≠ j
⎩⎪ 4 4
In general, it is not so easy to find the transition probabilities of a particular rate matrix,
Q. However, by using a bit of matrix algebra we can usually find solutions to the
differential equations. In general, we can write Kolmogorov’s Forward Equations as
d P (t )
= P (t )Q ,
dt
where P(t) = {Pij(t)} is the matrix of transition probabilities. From multivariable calculus
we may have learned that the solution to this set of linear ordinary differential equations,
with the initial condition P(0) = I (the identity matrix), can be obtained by
exponentiating Q. The matrix of transition probabilities is obtained as P(t) = eQt
∞
[= ∑ (Qt ) / i!]. If you have not taken courses in linear algebra or multivariable calculus,
i
i =0
this may seem quite foreign to you and you may be satisfied just knowing that solutions
can in fact be obtained for the system of differential equations - at least numerically.
However, you might also notice the advantage of the definition of the diagonal of Q in
that it allows the matrix representation of Kolmogorov’s Forward Equations.
For example, for the Kimura 2-parameter model we obtain the following continuous time
rate matrix:
A
A ⎡− α − 2 β
C ⎢⎢ β
Q=
T ⎢ β
⎢
G ⎣ α
C
β
− α − 2β
α
β
T
β
α
− α − 2β
β
G
α
β
β
⎤
⎥
⎥,
⎥
⎥
− α − 2β ⎦
where the transition/transversion rat ratio is α/β. The transition probabilities we obtain
by solving the system of Kolmogorov’s Forward Equations for this model are
⎧ 1 1 − 4tb 1 -2t (α + β )
+ e
if i = j
⎪ 4 + 4e
2
⎪⎪ 1 1
1
Pij (t ) = ⎨ + e − 4tb − e -2t (α + β ) if (i ≠ j , transition) .
2
⎪4 4
1
1
−
4
tb
⎪ − e
if (i ≠ j , transversion)
⎪⎩ 4 4
Stationary probabilities of CTMCs
The stationary probabilities of a CTMC is defined as lim Pij (t ) = π j , where
t →∞
∑π j
= 1.
j
As for the corresponding discrete time chain, the vector of stationary probabilities π =
{πj} is the stationary distribution of the process. For the Jukes and Cantor process we get
π i = lim P ii (t ) = 1 / 4 , which agrees with the result we found from the discrete time
t →∞
version.
Calculation of Sampling Probabilities for DNA mutation models
Consider two homologous sequences, sampled from two different species, say
ACTGACACAGTAGACACGT
ACTCACACATTAGACACGT
What is the probability of observing this particular data pattern, and how should we go
about estimating the number of substitutions that occurred in the history between the two
sequences, or alternatively, the time during which the two sequences have been evolving
since they diverged from a common ancestor? Let x denote the data obtained from the
two individuals, e.g. x = (ACTGACACAGTAGACACGT, ACTCACACATTAGACACGT),
and let xi be the data pattern observed in the first site, e.g. x1 = (A, A), x2 = (C, C), etc. In
general xi = (xi1, xi2). Assume that sites evolve independently of each other, then
S
Pr( x) = ∏ Pr( x i ) ,
i =1
where S is the number of sites in the DNA sequence. How can we calculate Pr(xi)?
Consider the example of xi = (A, T) and assume that the length of time during which the
two sequences have diverged from each other is t (Figure 1).
A
T
t
i
Figure 1.
Then we realize that
Pr (( A, T ) ) =
∑π P
i iA
i∈{ A,C ,T ,G }
(t ) PiT (t ) .
The data pattern (A, T) may occur if the ancestor of the two sequences was of type T, and
one individual evolved from type T to type A and the other individual evolved from type
T to type T. Similarly the ancestor could have been of type A, C or G. Summing over all
these possibilities we arrive at the above equation. Of course, the more general version
of this equation would be
Pr (( j , k ) ) =
∑ π P (t ) P
i ij
i∈{ A,C ,T ,G }
ik
(t ) ,
where j and k may be any nucleotides. It turns out that for most nucleotide substitution
processes, including the Jukes and Cantor model and the Kimura 2-parameter model, this
result can be simplified somewhat. To simplify these equations we need to know two
more ideas from Markov chain theory: the Chapman-Kolmogorov Equations and the idea
of time reversibility.
A CTMC is time reversible if πjqji = πiqij, i.e. the proportion of transitions
occurring in the chain at stationarity from state i to state j equals the proportion of
transitions from state j to state i. For a time reversible CTMC πjPji(t) = πiPij(t). For the
Jukes and Cantor model pi = pj and qij = qji so this model is clearly time reversible
according to the definition. Almost all models of DNA sequence evolution are time
reversible and we will in the following assume that the time reversibility assumption
holds.
The Chapman-Kolmogorov equation states
Pij (t + s ) = ∑ Pik (t ) Pkj ( s ) , s ≥ 0, t ≥ 0.
k
In other words: a transition from state i to state j in time t + s may occur by first having a
transition from state i to k in time t and then a transition from state k to j in time s,
summed over all possible values of k.
Using the property of time reversibility, we have
Pr (( j , k ) ) =
∑ π P (t ) P
i ij
i∈{ A ,C ,T ,G }
ik
(t ) =
∑π
j
i∈{ A ,C ,T ,G }
Pji (t ) Pik (t ).
Now by the Chapman-Kolmogorov equation we have
Pr (( j , k ) ) =
∑π
j
i∈{ A ,C ,T ,G }
Pji (t ) Pik (t ) =π j
∑P
ji
i∈{ A ,C ,T ,G }
(t ) Pik (t ) =π j Pjk (2t ) .
For example, for the Jukes and Cantor process and the example of xi = (A, T) we have πA
4
1 1 − 2 tµ
= 1/4 and PAT (t ) = − e 3 , so we get
4 4
Pr (( A, T ) ) =
− 2 tµ ⎞
1 ⎛⎜
3
⎟.
1
e
−
⎜
⎟
16 ⎝
⎠
4
Some Statistical Concepts
We can use these results to provide statistical estimates of the parameters µ and t. Before
going in details with this, it might be useful to review some concepts from statistics.
Much of statistics proceeds by providing estimates of parameters. Say we have a
parameter with an unknown true value (λ), we are then interested in finding an estimate
of this parameter, λ̂ . There are many different methods we could use to find good
estimates of a parameter, and we will shortly describe two of these. It is important to
remember that in classical statistics the true value of λ is considered to be fixed, it is not a
random variable, although we do not know its true value. The estimate of the parameter
( λ̂ ) is, in contrast, a random variable because it is a function of the data (which is
random). Any quantity that we can calculate based on the data is called a statistic. The
parameter estimate ( λ̂ ) is an example of a statistic, whereas the true value of the
parameter λ is not a statistic. Other examples of statistics might include the sample
mean, the sample median or the 25th percentile of a sample. Two of the most common
methods for finding statistical estimators of parameters are the method of moments and
the method of maximum likelihood. We will in the following show how these methods
can be used to provide estimates of µ and t.
Method of moments estimation of λ
For a single scalar parameter, a method of moments estimate of the parameters is
obtained by first finding an appropriate scalar statistic (some summary of the data), and
the expectation of this statistic under an appropriate model. The expected value of the
statistic is then equated to the observed value of the statistic and this equation is solved
with respect to the parameter.
We will now use the method of moments to provide an estimate of µ and t using the
number of nucleotide differences between the two sequences as the statistic. The number
of nucleotide differences is the number of sites (positions) in the alignment of the two
sequences in which the two nucleotides are different from each other. First, we need to
find the expected number of nucleotide differences between two sequences of length S
for the Jukes and Cantor model (d). Notice, that in the formulae for the transition
probabilities, µ and t always appear as a product of each other. To simplify the notation
we will, therefore, use the definition λ = 2µt. Then
E[d] = S×Pr(a site is variable) = S × 3 × ((1 / 4) − (1 / 4)e −4λ / 3 ) = S ((3 / 4) − (3 / 4)e −4 λ / 3 ).
We then equate the observed value of d to its expectation
(
ˆ
)
d = S (3 / 4) − (3 / 4)e −4λ / 3 .
Now we solve this equation with respect to λ̂
λˆ = −(3 / 4) log (1 − 4d /(3S ) ),
d / S < 3/ 4 .
The expected number of mutations between two sequences is given by the rate at which
mutations arise multiplied by the time at which the process has been running, i.e. λ = µt.
So λ̂ provides an estimate of the number of substitutions that occurred on the history of
two sequences. Converting d to an estimate of λ is some times referred to as 'correcting
for multiple hits'.
Maximum Likelihood Estimation of λ
The statistical principle that is most commonly used to devise estimators is the principle
of maximum likelihood. The statistic used in maximum likelihood estimation is the
likelihood function. The likelihood function is defined as any function proportional the
sampling probability of the data. The maximum likelihood principle tells us that we
should use the value of θ that maximizes the likelihood function, L(θ), with respect to the
parameter (θ). When the parameter is a simple scalar variable (such as θ), this is usually
done using the following steps:
(1) Define the likelihood function, L(θ).
(2) Take the logarithm of the likelihood function, ℓ(θ).
(3) Take the derivative of the likelihood function with respect to the parameter, ℓ’(θ).
(4) Equate the derivative to zero (ℓ’(θ) = 0) and solve for the parameter to find θˆ .
Confirm that θˆ is in fact a maximum by checking that the second derivative of ℓ(θ)
evaluated at θˆ is negative. Verify that the global maximum has been found
We will now use these ideas to obtain a maximum likelihood estimate of λ. We have
(
S
)
L(λ ) ∝ P (x | λ ) = ∏ π X i1 Pxi1xi 2 (λ ) ,
i =1
where ∝ means 'proportional to'. Notice here that we, for notational convenience,
express the transition probabilities as a function of λ = 2µt, instead of just t. The
stationary probabilities do not depend on λ in the Jukes and cantor model and so
s
L(λ ) ∝ ∏ Pxi1xi 2 (λ )
i =1
(
= (1 / 4) + (3 / 4)e − 4 λ / 3
)
( S −d )
((1 / 4) − (1 / 4)e
).
−4λ / 3 d
Because µ and t only appear as a product of each other in the likelihood function,
statistical theory tells us that we cannot separately estimate these two parameters. This is
yet another good reason for using the change in notation λ = µt. We then take the
logarithm of the likelihood function:
(
)
(
)
l(λ ) = ( S − d ) log (1 / 4) + (3 / 4)e −4 λ / 3 + d log (1 / 4) − (1 / 4)e −4 λ / 3 .
And then we take the derivative of the log likelihood function with respect to λ:
l ′(λ ) = ( S − d )
(−4 / 3)(−1 / 4)e −4λ / 3
− (4 / 3)(3 / 4)e −4 λ / 3
+
,
d
(1 / 4) + (3 / 4)e − 4λ / 3
(1 / 4) − (1 / 4)e − 4 λ / 3
and solve l ′(λˆ ) = 0 :
ˆ
0=
ˆ
(d − S )e −4 λ / 3
ˆ
(1 / 4) + (3 / 4)e − 4 λ / 3
+
(d / 3)e −4λ / 3
ˆ
(1 / 4) − (1 / 4)e − 4 λ / 3
which implies
(
ˆ
)
0 = − S (1 / 4) − (1 / 4)e −4λ / 3 + d / 3
so
λˆ = −(3 / 4) log (1 − 4d /( S 3) ),
d / S < 3/ 4 .
Which provides the maximum likelihood estimate of λ. Taking second derivatives will
convince us that this is in fact a maximum, and that it is the only maximum of the
likelihood function. Notice that this estimate is the same as the method of moments
estimate.
Phylogenies
Evolutionary trees, or phylogenies, describe the evolutionary relationship between groups
of organisms. For example Figure 2 depicts the evolutionary relationship between
humans, chimpanzees, gorillas and orangutans. Trees are of major importance, not only
in evolutionary biology, but in many areas of bioinformatics because they describe the
correlation between data obtained from multiple species or other evolutionary units.
When analyzing data from more than two homologous sequences, it is usually necessary
to consider the underlying tree relating the sequences. This main problem considered in
this section is how to estimate and analyze phylogenies. Most students should be familiar
with phylogenies from their basic biology training, however, there is a large vocabulary
associated with the description of phylogenies which we will briefly review. Some of
this vocabulary comes from the mathematical theory of graphs to which the description
of trees belong. The evolutionary units used in phylogenetyic analyses are called taxa
(singular: taxon). We see from Figure 2 that human and chimps are each others closest
living relatives, they are sister taxa. A set of taxa that share a most recent common
ancestor with each other, but which they do not share this ancestor with any other taxa,
are called a clade or a monophyletic group. For example, in Figure 2 humans and chimps
form a clade and humans, chimps and gorillas form another clade. Most evolutionary
biologists agree that the natural groupings of interest are the monophyletic groups.
Clade or monophyletic group
Humans
Chimps
Gorilla
Orangutan
Taxon
Sister
taxa
External
node
Lineage or branch
Internal node
Root
Figure 2
The lines in the tree are known as lineages or branches in the biological literature
and arcs or edges in the mathematical literature. Each branch initiates and terminates in a
node (or vertex in some of the mathematical literature). In a phylogeny, the nodes at the
tip of the tree representing extant species are called leaf nodes (or external nodes,
terminal nodes or tips) and nodes that represent ancestral taxa are called internal nodes
(or interior nodes). The internal node in the tree representing the most recent common
ancestor of all the taxa is called the root or root node. Leaf nodes have only one branch
leading into them (degree one) whereas internal nodes have three branches leading into
them (degree three), except for the root which has two branches leading into it (degree
two).
Trees can be either unrooted or rooted. An example of an unrooted tree is shown
in Figure 3 (Tree 1). In this tree, there is no node representing the most recent common
ancestor of all the species, i.e. no node with degree two. The root node has been
eliminated from the tree and information regarding the location of the most recent
common ancestor is not represented. In a rooted tree, sometimes nodes that are linked to
each other (adjacent) are referred to as parent and child nodes (or daughter nodes). For
example, in figure 3 the root node is the parent node of the orangutan leaf node and the
orangutan node is the child node of the root node.
Orangutan
Gorilla
Tree 1
Chimpanzee
Lemur
Human
Orangutan
Gorilla
Tree 2
Chimpanzee
Lemur
Human
Gorilla
Orangutan
Tree 3
Chimpanzee
Lemur
Human
Figure 3
Any rooted tree can always be converted into an unrooted tree, but there are many ways
in which an unrooted tree can be converted into a rooted tree.
The length of the lineages of the phylogeny often represent evolutionary time
measured in number of generations, number of years, or in expected number of
substitutions (µt). The expected number of substitutions is the most common scale of
the branch lengths for trees estimated from molecular data. As previously discussed, the
rate of substitution (µ) and absolute time (t) can usually not be independently estimated.
Only their product (µt) can be estimated directly from the molecular data.
Two trees can differ from each other because they have different branch lengths
and/or because they have different topology. In figure 3, tree 1 and tree 2 differ from
each other because the length of one of the lineages in tree 2 is shorter than in tree 1.
However, both trees have the same topology, i.e. when ignoring the branch lengths the
basic structures of the two trees are the same. In contrast, the topology of both tree 1 and
tree 2 differs from the topology of tree 3.
There are many different methods of estimating phylogenies. However, they can
roughly be divided into three categories of methods. The first type of methods are
distance methods in which a distance is calculated between all pairs of taxa and a tree is
then estimated based on these distances. The second type of method is parsimony, in
which the tree that would require the fewest possible number of evolutionary events
(mutations) is chosen. Finally, the maximum likelihood principle can be used to estimate
trees by choosing the tree that maximizes the likelihood function.
How many trees are there?
Some of these methods can be very computationally intensive. Parsimony and maximum
likelihood methods require a search for the tree that fulfills an optimality criterion. It can
be very difficult to find this tree because the number of possible trees might be very
large. The number of possible topologies can be counted relatively easily. For rooted
trees there is 1 possible topology for two taxa. For three taxa (say A B and C), there are 3
possible rooted topologies:
A
B
C
A
C
B
B
C
A
Figure 4
Similarly, if we wrote down all the possible topologies for rooted trees with four taxa
with would find that there are 15 different topologies. Can we find a general formula for
n taxa? To do this we must first find the number of branches in a tree with n taxa.
Adding a branch to a tree introduces two new branches (Figure 5). So a tree with n taxa
must have two more branches than a tree with n - 1 taxa, i.e, the number of branches is of
the form 2n - c. Because a tree with two taxa has two branches, c = 2 implying that there
are 2n - 2 branches in a rooted tree of n taxa.
A
B
A
B
C
A
B
C
D
Figure 5
In a rooted tree with k branches, there are k + 1 different ways a new branch can
be added to the tree (Figure 5). Each of these ways of adding a new branch results in a
new unique topology. Because a tree with n - 1 taxa has 2n - 4 branches, a rooted tree
with n taxa must have 2n - 4 + 1 = 2n - 3 as many topologies as a rooted tree with n - 1
taxa. So the total number of rooted topologies for a tree with n taxa must be
1×3×5×…×(2n-3) =
n
∏ (2i − 3) .
i =3
Obviously, the number of topologies increases very fast with n. For 10 taxa there are
more than 34 million topologies. For 50 taxa there are more than 2.75×1075 possible
topologies.
Maximum Likelihood Estimation of Phylogenies
We have previously used maximum likelihood to estimate µt from DNA sequence data
from two taxa. The same basic idea can be used to estimate trees for more than two taxa.
Using maximum likelihood we would choose the tree that has the topology and branch
lengths with the highest likelihood. The calculation of likelihood functions on trees is
greatly simplified by assuming independence among sites in the alignment (given the
tree). The likelihood can then simply be obtained as the product of the likelihood
calculated in individual sites. We now just need to concentrate on the problem of how to
calculate the likelihood in a single site. It might be illustrative with an example.
Consider the rooted phylogeny in Figure 6 with branch lengths t1, t2, t3 and t4, nodes
labeled 1, 2, 3, 4, and 5, and data pattern in single site X = (TCC). Assume a Markov
model of evolution such as the Jukes and
T C C
Cantor model and let Θ be a vector of the
1
2
3
parameters of this model. Then the likelihood
t2
t1
can be calculated by summing over all the
t3
4
possible values of the unknown allelic states in
t4
the internal nodes (node 4 and 5) of the
Figure 6
5
phylogeny. Starting at the root (node 5), the
probability that the ancestral state was of type z
is the stationary probability of z, πz. Then given that the state in the root was z, the
probability of observing state y at node 4 is just given by the transition probability of the
Markov chain: Pzy(t4) = Pr(z → y | t4, Θ). Likewise, the probability of observing state T
at node 1, given state y at node 4 is PzT(t1). In this way the likelihood can be calculated
by superimposing a Markov model of evolution along the branches of the tree. The only
problem is that we cannot observe the ancestral states at the internal nodes. Instead, we
have to sum over all the possible values of these ancestral states. The likelihood function
then becomes
Pr( X | Θ, t1 , t 2 , t 3 , t 4 , t 5 ) = ∑ ∑ π z Pzy (t 4 ) PyT (t1 ) PyC (t 2 ) PzC (t 3 ) .
y
z
Assuming time reversibility, and then the using the Chapman-Kolmogorov equation, we
have
Pr( X | Θ, t1 , t 2 , t 3 , t 4 , t 5 ) = ∑ π y PyT (t1 ) PyC (t 2 )∑ Pyz (t 4 ) PzC (t 3 ) = ∑ π y PyT (t1 ) PyC (t 2 ) PyC (t 3 + t 4 ) ,
y
z
y
which simplifies the calculations somewhat. We see that the calculations would be the
same for any rooted tree that gives the same unrooted tree, i.e. we can place the root in
the tree any where we want and we wuld get the same likelihood. If the model of
evoltion is time-reversible, only unrooted trees can be estimated. The idea that the root
can moved to any other position in the tree is known as the pulley principle. It was first
discovered by Joseph Felsenstein in his seminal paper in Systematic Zoology in 1981.
Before this paper, it was considered computationally intractable to use maximum
likelihood to estimate trees, because the likelihood function involved a large sum over the
ancestral states at all internal nodes. With k species there are k -1 internal nodes in a
rooted tree, and 4k-1 possible assignments of ancestral states. However, Felsenstein
devised an algorithm with computational time that is linear in the number of species.
Assume all the nodes are labeled with an integer such that the number of the root is 2n 1, let the branch length of the branch leading to node i be ti, as in Figure 6 and let the
nucleotide associated with leaf node i be Xi, then the Felsenstein algorithm can be
described as follows:
Felsenstein’s (1981) Algorithm
1. Set k = 2n - 1
2. For all i ∈ {A, C, T, G} calculate fk(i) as follows:
- If k is a leaf node set fk(i) = 1 if i = X k and fk(i) = 0 otherwise.
- Otherwise compute fa(i) and fb(i) for all i for the child nodes a, b of k and set
⎛
⎞⎛
⎞
fk(i) = ⎜⎜ ∑ Pij (t a ) f a (m) ⎟⎟⎜⎜ ∑ Pij (t b ) f b ( j ) ⎟⎟ .
⎝ j
⎠⎝ j
⎠
3. Calculate the likelihood as ∑ π j f k ( j ) .
j
This is an example of a recursive algorithm performing what is known as a postorder traversal of a tree. It will first calculate some function, fk(i), for node k after it
has been calculated for the child nodes of k. In this way the function will be
calculated for the leaf nodes first and the root node last. The function fk(i) is often
referred to as “the fractional likelihood” or “conditional likelihood” in node k of
nucleotide i. The fractional likelihood, fk(i), can be interpreted as the probability of
the data in the descendent leaf nodes of node k given nucleotide i is observed in node
k. By recursively calculating the fractional likelihood for all nodes in the tree, we
achieve in the end to have found the fractional likelihood in the root node.
Thereafter, the total likelihood can easily be obtained by summing the product of the
stationary probability and the fractional likelihood over all possible states in the root
node. Because of the pulley principle we obtain the same likelihood no matter where
the root is placed.
To really understand how the algortihm works, it might be useful to consider an
example. Figure 7 depicts a tree with four taxa and data X = (AACT) and branch
lengths as depicted in the figure. We will assume a Jukes and Cantor model with µ =
1. The algorithm will run from the root to the leaf nodes and first initialize the
fractional likelihood in the visited leaf nodes: f1(A)=1, f1(C)= f1(T)= f1(G)=0 and
f2(A)=1, f2(C)= f2(T)= f2(G)=0. Thereafter, the fractional likelihood is calculated in
node 5 as f5(A)=PAA(0.1)PAA(0.1) =0.8215. Similarly f5(C)= f5(G)= f5(T)=0.0009 is
found. The algorithm will then initialize the fractional likelihoods in node 3 and
calculate the fractional likelihood in node 6 as
f 6 ( A) = PAC (0.2) ∑ PAy (0.1) f 5 ( y ) = 0.0436 and we similarly find
y∈{ A,C ,T ,G }
f(T)=f(G)=0.0016, f(C)=0.0219. The fractional likelihoods in in node 4 are then
initilized and the we find f 7 ( A) = PAT (0.3) ∑ PAy (0.1) f 6 ( y ) = 0.0033 and similarly
y∈{ A ,C ,T ,G }
f7(T)=0.0026 , f7(G)=0.0003 and f7(C)=0.0018. Finally, the total likelihood is found
as L = ∑ (1 / 4) f 7 ( y ) = 0.0020.
y∈{ A ,C ,T ,G }
f5(A)=0.8215
f5(C)=0.0009
f5(G)=0.0009
f5(T)=0.0009
f5(A)=0.0436
f5(C)=0.0219
f5(G)=0.0016
f5(T)=0.0016
0.1
5
6
7
f5(A)=0.0033
f5(C)=0.0018
f5(G)=0.0003
f5(T)=0.0026
0.1
0.1 0.1
1
2
0.2
0.3
3
4
A
A
C
T
Figure 7
This algorithm provides the computational backbone for much of the statistical and
computational analysis of DNA sequence data from multiple species or multiple
homologous genes.
Optimization
We have now devised a method for calculating the likelihood function for a tree and a set
of parameters from a model of molecular evolution. However, we have not touched upon
the issue of how this likelihood function can be maximized to obtain maximum
likelihood estimates of tree and model parameters. There are two optimization problems
embedded in this problem: a continous optimization problem of the branch lengths and
parameters of the evolutionary model and a discrete optimization problem of the tree
topology. We will not go in detail with the continuous optimization problem but just
mention that this problem is a quite classical problem that can be approached using one
of the many available algorithms for continuous optimization, such as the NewtonRaphson method or one of its many derivatives.
The discrete optimization problem (finding the best tree) has two types of
solutions: exact algorithms in which the best tree is guranateed to be found and heuristic
algorithms, that typically are much faster, but are not guaranteed to provide the optimal
solution. The most simple exact algorithm we can imagine would be an algorithm that
simply calculates the likelihod for all possible tree to find the tree with highest likelihood.
This type of search is called an exhaustive search and is almost never used in
phylogenetics, because of the availability of faster exact searches. The most commonly
used exact search algorithm is the branch-and-bound algorithm. The branch-and-bound
algorithm proceeds by first making an initial guess of the tree topology and calculating
the maximum likelihood value of this three (LB). LB then bounds the global maximum
likelihood value for all trees (L), i.e. L > LB must be true. The branch and bound
algorithm then proceeds by examining categories of trees. In a category examined it will
either find a new tree with a higher likelihood value than LB, providing a new bound for
L, or eliminating whole categories of trees because they must have maximum likelihood
values less than LB. The advantage of this algorithm is that when whole families of trees
can be eliminated at the same time, the number of trees that needs to be examined is
drastically reduced. The branch-and-bound algorithm achieves this using a total
enumeration tree. The total enumreation tree can in phylogenetics be thought of as a tree
of trees. Consider the 15 possible unrooted trees for the five taxa a, b, c, d and e. We
can enumerate all possible trees by starting with a tree of only three taxa, and then adding
the two remaining taxa one by one to the tree. We can think of the three taxon tree as a
root node in a tree of trees where the three possible four taxon trees are internal nodes
and the leaf nodes are the 15 possible five taxon trees that have been obtained by adding
species to the three taxon tree (Figure 8). The branch and bound algorithm will start by
evaluating the likelihood at the the four taxon trees. If the the likelihood calculated based
on any of those trees is lower than the current bound, the possible five taxon trees that
can be derived from the four taxon trees can be eliminated. The reason being that when
adding a species to the tree, the likelihood cannot improve. In the example in Figure 8
we assume that our initial guess of the tree was the tree with likelihood -291.3. The
branch-and-bound algorithm then starts by calculating the likelihood of the possible four
taxon trees that could be derived by adding a extra taxon to the three-taxon tree. One of
the four taxon trees has a likelihood of LB = -302.4 and we can immediately elliminate the
five trees that can be derived by adding another taxon to this tree. When then proceed by
evaluating the likelihood of all the five taxon trees that can be derived from the four
taxon tree with the smallest likelihood. In doing that, we discover a new tree with a
lower likelihood and we set update the current bound LB = -284.7. With this new bound
we can then eliminate all the trees that can be derived from the last four taxon tree, which
has likelihood -289.5. We conclude that the tree with the highest likelihood is the tree
with likelihod -284.7 and in reacing this conclusion we only had to calculate the
likelihood for five of the trees. Of course, this algorithm can similarly be applied to data
sets with many more species where the savings of computational time achieved by the
branch-and-bound algorithm can be far greater.
b
a
e
a
a
c c b
e
d
c
e
a
b
d
a
d
c
d
a
b
a
c
a
-302.4
c
d
d
a
e
b
a
b
e
b
a
c
b
a d
c
a
d
b
-289.5
e
d
c
d
-300.9
a
d
a
b
cd
b
a
c
e
b
-284.7
c
c
b
c
-182.3
c
d
b
d
b
-287.1
e
-291.3
e
b
b
a
d
a
e
a
d
e
b d
c
c
c
-305.2
e
d
c
e
e
Figure 8
b
e
Most of the heuristic optimization algorithms proceed by guessing an initial tree and,
thereafter, introducing alterations of the tree. If one of the new trees examined by
introducing these alterations have a higher likelihood than the current tree, this tree is
now kept as the best tree and new alterations are proposed based on this tree. These
iterative algorithms stop when, in one iteration, none of the proposed alterations lead to
trees with a higher likelihood value than the current tree. One of the most famous
algortihms of this kind, is the nearest-neighbor interchange algorithm. This algorithm
proceeds by examining all the internal edges of a tree. Alterations to the tree are
proposed by erasing the two nodes around a lineage and examining the two possible trees
that can be obtained by reconnecting the edges leading to the two nodes in a different
way. For n species, there are n- 3 internal nodes, so according to this algorithm an
unrooted tree has 2n - 6 nearest neighbors. In each of the iterations of this algorithm the
likelihood for 2n - 6 trees must be calculated. Very often this algorithm terminates after
relatively few iterations, and it is, therefore, known to be a fast algorithm. Unfortunately,
it very often finds a suboptimal tree, i.e. a tree different from the maximum likelihood
tree. The optimization problem for phylogenies is still an open area of research.
a
b
c
d
a
b
c
d
Figure 9
a
c
a
c
b
d
d
b
Parsimony
The idea in the parsimony method is that the tree requiring the minimum number of
evolutionary events (mutations) should be preferred. To apply this principle, we need a
method for finding the minimum number of evolutionary events on a tree. For example,
consider the tree and four sites in Figure 10. This first site pattern can be explained
assuming a single T→G mutation. Likewise, the second site pattern can be explained
assuming a single C→A mutations. In these two sites, we need at least one mutation in
each of the sites to explain the observed data. They both have a unique parsimony
mapping of mutations requiring one mutation. In the third site, there are two different
parsimony mappings. The first mapping assumes a C→A and a A→G mutation and the
second mapping requires a A→C and a A→G mutation. These two mappings of
mutations are equally parsimonious. In the fourth site, two mutations are minimally
required to explain the data, but there are three possible parsimony mappings with two
mutations.
Site 1
: T→G
Site 2
T
A
T
A
G
C
T
Site 3
: C→A
C
Site 4
G
T
A
C
A
T
C
C
: C→A, A→G
: C→T, C→T
: A→C, A→G
: T→C, T→C
Figure 10
: C→T, T→C
In these examples, it was relatively easy to find the minimum number of mutations (the
parsimony score). However, for large data sets we need a general algorithm for finding
the parsimony score. The most commonly used algorithm is known as the Fitch-Wagner
parsimony algorithm (Fitch 1971). Let the parsimony cost be C and assume all the nodes
are labeled with an integer such that the number of the root is 2n - 1. The nucleotide
observed in node i is xi. Let Ri be a set containing the possible parsimony assignments of
ancestral nucleotides in node i given the data in the descendents of node i only. Then the
algorithm is defined as:
Fitch-Wagner parsimony algorithm
1. Set k = 2n - 1 and C = 0
2. Compute Rk as follows:
- If k is a leaf node set Rk = {xk}.
- Otherwise compute Ri and Rj for the child nodes i, j of k, and set Rk = Ri ∩ Rj if
this intersection is not empty, or else set Rk = Ri ∪ Rj and let C = C + 1.
The parsimony score for the site is the given by C. For an entire alignment, this
algorithm is applied to all sites and the final parsimony score is calculated as the sum of
scores in all sites.
Like Felsenstein's algorithm, this is a recursive algorithm that performs a post-order
traversal of the tree. Intuitively we can understand why this algorithm works by noting
that since Ri and Rj are sets of the parsimony assignments of ancestral nucleotides in
nodes i and j, no extra mutation is required if the intersection of these two sets is nonempty. However, one more mutation is requires if the intersection is empty.
To understand how the algorithm works, it might be illustrative to consider the
example in Figure 11. The algorithm will first find R1 = {A} and R2 = {C}. Then it will
calculate R6 = R1 ∪ R2 = {A, C}and set C = 1 because R1 ∩ R2 is empty. Then the
algorithm will visit node 3 and initialize R3 = {A}. Thereafter, node 7 is visited and R7 =
R6 ∩ R3 = {A}, etc. At the root node the parsimony score has been found as C = 3.
C=1
R7 = {A}
C=1
R6 = {A, C}
C=2
R8 = {A, T}
6
7
C=3
R9 = {A, C, T}
8
9
1 A
2
3
4
5
C
A
T
C
Figure 11
Similarly to maximum likelihood, a search must be conducted to find the tree with the
smallest parsimony score. Again, there is no fast algorithm that is guaranteed to find the
tree with the smallest parsimony score, so when the branch-and bound algorithm is too
slow, heuristic algorithms are often used in an attempt to find the best tree.
Distance Methods (Clustering Algorithms)
Most of the distance methods have the advantage that they do not require that an optimal
tree be found. Instead, an algorithm is constructed by which a tree is chosen without
regards to its optimality according to any well defined criterion. Most distance methods
proceed by first estimating, or calculating, a distance between all pairs of sequences.
Commonly used distances are, uncorrected distances which are simply the number of
positions in which two DNA (or protein) sequences differ, or the estimated number of
substitutions ( λ̂ ) as described in the section regarding models of molecular evolution.
After the distances gave been estimated, an algorithm is used to find a tree based on the
estimated distances.
The first, and most famous of these algorithms, is the UPGMA (Unweighted Pair
Groups Method using arithmetic Averages) algorithm by Sokal and Michener (1958).
This algorithm estimates a rooted tree with branch lengths. Let the distance between
sequence i and j be dij and the number of sequence be n. Each sequence is represented by
an leaf node in the tree, so we can think of the distances as distances between nodes.
Think of time as running backwards and place all external nodes at time zero, number all
of the leaf nodes from n -1 to 2n - 2 and let M = {n-1, .n, …, 2n - 2} be a set containing
all leaf nodes. Then the UPGMA algorithm proceeds as follows:
UPGMA
For k = 0, 1,…, n-2
(1) Identify the pair of nodes (i and j) in M with the smallest value of dij
(2) Define node k as the parent node of i and j at time dij/2
(3) Define the distance from node k to any other node v as the average distance
between all descendent nodes of node k and all descendent nodes of node v.
(4) Eliminate i and j from M and add k.
At this point it might be illustrative to consider an example:
Let the observed distances for four nodes labeled 3, 4, 5, and, 6, respectively, be
4
5
6
3
10
25
30
4
5
15
20
15
We identify node 3 and 4 as the two nodes with the smallest distance between them and
establish a new child node (0) at height 10/2 = 5. The new distances are calculated as d05
= (d35 + d45)/2 = 25 + 15 = 20 and d06 = (d36 + d46)/2 = 30 + 20 = 25. The new distances
are then
6
0
5
15
25
6
20
We then identify node 5 and 6 as the two nodes with the smallest distances. We establish
a new node at height 15/2 = 7.5 and obtain the following set of distances:
1
5
22.5
We place the last node at heights 11.25 as a parent node of node 1 and 5. This leads us to
the following tree:
2
3
4
5
5
7.5
0
1
11.25
2
The UPGMA defines a fast algorithm for estimating phylogenies. However, it has one
major drawback in that it assumes a molecular clock. A (rooted) tree obeying a
molecular clock (an ultrametric tree) is a tree in which the distance from the root to any
leaf node is the same no matter which leaf node is considered. In the previous example,
this distance was 11.5. In general, when distances are measured in expected number of
substitutions, trees are only expected to obey a molecular clock when the rate of
substitution is the same among all branches of the phylogeny. Except possibly for very
closely related species, most data does not seem to obey a molecular clock. Under such
circumstances, the UPGMA method may not provide good estimates of phylogenies. The
neighbor-joining algorithm is a modification of the basic UPGMA algorithm that allows
deviations from the molecular clock. Using the same definitions and assumptions as for
the UPGMA method, except that we now number all of the leaf nodes from n -2 to 2n - 3,
the neighbor-joining method can be described by the following algorithm:
Neighbor-joining
For k = 0, 1,…, n-3
1
∑ d iv
n − k − 2 v∈M
(2) Identify the pair of nodes (i and j) in M with the smallest value of dij - (ri + rj)
(2) Define node k as the parent node of i and j with branch lengths dik = (1/2)(dij +
ri – rj) and djk = dij – dik, to nodes i and j, respectively.
(3) For all v, define dkv = (1/2)(div + djv – dij).
(4) Eliminate i and j from M and add k.
(5) If k = n - 1, link the last two nodes (k and v) with a branch of length dkv.
(1) Define ri =
Notice that the trees estimated by this algorithm are unrooted trees. The basic idea in the
algorithm is that in the absence of a molecular clock, some estimate of the distance from
each of the child nodes to the parent should be obtained. It is no longer adequate just to
place the parent node at time dij/2. If the distances all are compatible with the same tree,
the distance from child node i to parent node k, should be of length dik = (1/2)(dij + ri –
rj).
The neighbor-joining algorithm is guaranteed to accurately construct a tree if the
distances between leaf nodes are known with 100% accuracy. In such cases, there is a
one-to-one correspondence between the tree and the matrix of all distances between
external nodes. This provides the major motivation for the algorithm. The UPGMA
algorithm achieves the same for a rooted tree obeying a molecular clock.
Both of these two algorithms are commonly used clustering algorithms used
outside the context of phylogenetics. They can be used to perform a hierarchical
clustering of any types of objects for which distances can be defined, without any
evolutionary interpretation. A clade then corresponds to a cluster. For example, the
UPGMA algorithm has commonly been used to cluster genes with similar expression
patterns in microarray experiments.