Finding a Perfect Phylogeny

Finding a Perfect Phylogeny
Caroline Uhler
Department of Statistics, University of California, Berkeley
May 2008
This work is based on the paper ‘Incomplete perfect phylogeny’ by Pe’er et al.
(2000). The goal of this paper is to review some important findings concerning the
problem of constructing a perfect phylogeny if one exists. In the first section, I
will present some concepts used in phylogenetics in general. Then, we will discuss
the complete directed perfect phylogeny problem and review Gusfield’s algorithm
(Gusfield, 1991), which solves this problem in linear time. In the third section, I
introduce the problem of incomplete data, where some character states are missing,
and we ask if the missing states can be completed in a way admitting a perfect
phylogeny. Pe’er et al. (2000) described an algorithm, which solves this problem in
near-linear time. Concluding, we will discuss some applications and extensions of the
algorithm by Pe’er et al. (2000) in the forth section.
1
Introduction
In evolutionary genetics a frequently encountered problem is the construction of a
phylogenetic tree representing the evolutionary history of a set of taxa. In a phylogenetic tree every node represents a species and the tree branches model the changes
through time. In the phylogenetic reconstruction problem one is usually given information regarding the leaf nodes, which represent extant species. The tree, which best
explains the information at the leaves, is to be inferred.
Two types of data can be used for building phylogenetic trees. In a distancebased approach to tree reconstruction a matrix of distances between the species is
given as input and the goal is to find a tree whose edge lengths are consistent with the
given distances. In a character-based approach we are given n species and for each
species the values of m characters, where each character takes on one of s possible
states. So the data can be summarized in an n × m character-state matrix M , where
Mi,j represents the state of the j th character of the ith species. The goal is to find
a phylogenetic tree including the character vectors of all internal nodes that best
explains the data.
1
In this paper I will concentrate on character-based methods. In the remaining
part of this section, I will first describe the set of states that are currently used in
character-based methods, then present the underlying model of mutation that we
will assume throughout this paper, and finally state the problem of finding a perfect
phylogeny.
1.1
State space in character-based methods
Nowadays, the character vectors usually consist of DNA sequences. So we could
imagine a model where the four nucleotides represent the possible states. However,
two individuals have almost identical DNA sequences. For example in the human
genome there are only about 600’000 differences in the nucleotide sequences. A site,
where different nucleotides currently occur, is called a single-nucleotide polymorphism
(SNP). Among the four possible nucleotides of the DNA typically only two nucleotides
commonly occur at a particular site and can therefore be encoded by ‘0’ and ‘1’, respectively. This leads to a model with binary characters. In a phylogenetic framework
only polymorphic sites contribute information and so only SNPs are the interesting
sites in the genome.
The sequence of SNPs on a chromosome is called haplotype. In diploid organisms
each individual has two copies of each chromosome and therefore two haplotypes
which might be unequal. Collecting haplotype data empirically is prohibitively expensive. Therefore, often only the genotype data is collected instead. The genotype
is a superposition of the two haplotypes, where the binary characters of the two corresponding SNPs are added together. So sites with values ‘0’ or ‘2’ are homozygous
and denote the combination ‘0-0’ and ‘1-1’, respectively. A site with value ‘1’ is heterozygous and denotes either the combination ‘0-1’ or the combination ‘1-0’. Thus
many pairs of haplotypes can give rise to the same genotype. So in short, haplotype
data give rise to binary characters whereas genotype data give rise to characters with
three states.
1.2
Model of mutation
Mutation is the base of phylogenetic reconstruction. The structure of the phylogenetic
tree is only revealed if polymorphisms exist among the sampled sequences. Inference
about the tree would be impossible without the implicit or explicit use of some genetic
model, either to asses the biological fidelity of any proposed solution, or to guide the
algorithm in constructing a solution. In the following, I briefly present the infinite
sites model, which I will use as underlying mutation model throughout the paper.
Infinite sites model: This model assumes that each mutation occurs at a previously unmutated site. So this model is based on the assumption that SNPs in the
2
sequence are so sparse relative to the mutation rate, that in the time frame of interest at most one mutation will have occurred at any SNP. This assumption is often
appropriate for DNA sequences, in which the rate of mutation per nucleotide site is
typically low. However, there are cases in which this model is obviously not applicable
as for example in viral genomes.
In the following, we will assume the infinite sites model as underlying model of
mutation. In addition, we will focus on binary characters and discuss the extension
to genotype data only briefly in Section 4.2. Moreover, we assume that the ancestral
haplotype at the root of the tree is known, and assign all of its states to be zero.
Under these assumptions the problem of finding a phylogenetic tree explaining the
data is called perfect phylogeny and is further explained in the following section.
1.3
Perfect phylogeny
A perfect phylogeny for an n × m character-state matrix M is a rooted tree T with n
leaves satisfying:
(i) Each row of M labels exactly one leaf of T .
(ii) Each column of M labels exactly one edge of T .
(iii) Every interior edge of T is labeled by at least one column of M .
(iv) The characters associated with the edges along the unique path from the root
to a leaf v exactly specify the character vector of v, i.e. the character vector has
a ‘1’ entry in all columns corresponding to characters associated to path edges
and a ‘0’ entry otherwise.
So this implies that the root of the tree represents an ancestral object that has state
‘0’ in all m characters, and each of the characters changes from state ‘0’ to state ‘1’
exactly once and never changes back from state ‘1’ to state ‘0’. Hence, any nodes
below an edge associated with character c definitely have state ‘1’ for that character.
In what follows, we will discuss the following phylogeny problem: Given an n × m
character-state matrix M , determine whether there is a perfect phylogeny for M , and
if so, build one. This problem is also called the complete directed perfect phylogeny
problem and is further discussed in the following section.
2
Complete directed perfect phylogeny
We start with an example of a character-state matrix which has a solution to the
complete directed perfect phylogeny problem and an example of a matrix which has
no solution. We will use these two examples throughout the paper.
3
Example 2.1. Define
s1
s
M1 = 2
s3
s4
s5
c1
1
0
1
0
0
c2
1
0
1
0
1
c3
0
1
0
1
0
c4
0
0
0
1
0
c5
0
0
1
0
0
and
s1
s
M2 = 2
s3
s4
s5
c1
1
0
1
0
0
c2
1
0
1
0
1
c3
0
1
0
1
0
c4
0
0
0
1
0
c5
0
1
.
1
0
1
The matrix M2 has no perfect phylogeny, while the matrix M1 has one, namely:
Figure 1: Showing the perfect phylogeny for M1 .
Answering the question if there is a perfect phylogeny can be seen to be equivalent
to a compatibility problem, a graph-theoretical problem or a matrix avoiding problem.
To explain this, we first need to make some definitions and can then state the main
theorem giving various characterizations of when a perfect phylogeny exists. We begin
with the definition of compatibility.
Definition 2.2. Two sets are called compatible if they are disjoint or one of them
contains the other, i.e.
A ∩ B ∈ {∅, A, B}.
Now, we look at the problem from a graph-theoretical point of view.
Definition 2.3. Let S = {s1 , . . . , sn } be the set of species and C = {c1 , . . . , cm } the
set of characters.
(i) The character-taxon graph to a given character-state matrix M is the bipartite graph G(M ) = (S, C, E) where E = {(si , cj )|Mi,j = 1}.
(ii) A Σ subgraph is an induced path of length four in G(M ).
4
Figure 2: Showing a Σ subgraph.
(iii) G(M ) is called Σ-free if there is no induced path of length four in G(M ) starting
and ending at a vertex corresponding to a species.
Finally, we need two last definitions regarding matrices.
Definition 2.4. A matrix A is said to avoid a matrix B, if B is not equal to any
submatrix of A.
Definition 2.5. A binary matrix A is called canonical if it can be decomposed as
follows:
(i) The leftmost k0 ≥ 0 columns are all zero.
(ii) The next k1 ≥ 0 columns are all one.
(iii) There exist canonical matrices A1 , . . . , Ar such that A is of the form shown in
Figure 3.
Figure 3: Showing a canonical matrix, where the submatrices are defined recursively.
With these definitions we can now give a characterization of when the complete
directed perfect phylogeny problem is solvable. The proof of the following theorem
can be found in Pe’er et al. (2000).
5
Theorem 2.6. Let S = {s1 , . . . , sn } be the set of species, C = {c1 , . . . , cm } the set
of characters and M a binary character-state matrix. Then the following statements
are equivalent:
(i) M has a complete phylogeny.
(ii) The 1-sets Cj = {si |Mi,j = 1}, Ck = {si |Mi,k = 1} of any two characters cj , ck
are compatible.
(iii) G(M ) is Σ-free.
(iv) Every ordering of the rows and columns
the matrix

1

1
A=
0
of M results in a matrix that avoids

1
0 .
1
(v) There exists a permutation of the rows and columns of M such that the resulting
matrix is a canonical matrix.
(vi) There exists a permutation of the rows and columns of M which yields a matrix
avoiding the following matrices:
 
¶
µ
¶
µ
¶
µ
1
0 1
1 1
0 1

0 .
, A2 =
, A3 =
, A4 =
A1 =
1 1
0 1
1 0
1
Gusfield (1991) gave an O(nm) algorithm to test for perfect phylogeny and to
construct a perfect phylogeny if it exists. The algorithm is given below. It is based
on the characterization (ii) in Theorem 2.6.
Algorithm:
Input: an n × m binary character-state matrix M ;
Output: a perfect phylogeny (if it exists);
1. Consider each column of M as a binary number with the most significant bit in
the first row. Sort the columns into decreasing order (by radix sort) and delete
any duplicated columns. Denote this new matrix by M 0 .
2. Construct an n × m keyword matrix K such that every row i contains the
characters of species i that are in state ‘1’. Let all non-defined entries be equal
to zero. Then set the first zero entry in each row equal to ‘%’.
3. Build the corresponding keyword tree T 0 as illustrated in the example below.
4. Remove the edge labels ‘%’ and output the resulting tree.
6
Example 2.7. We will illustrate this algorithm on the two character-state matrices
M1 and M2 given in Example 2.1.
M10
=
s1
s2
s3
s4
s5
c2
1
0
1
0
1
c1
1
0
1
0
0
c3
0
1
0
1
0
c5
0
0
1
0
0
c4
0
0
0
1
0
T10 =
M20
=
T20 =
s1
s2
s3
s4
s5
c2
1
0
1
0
1
c1
1
0
1
0
0
c5
0
1
1
0
1
c3
0
1
0
1
0
c4
0
0
0
1
0

−→
K1 =
−→
T1 =






−→
−→
K2 =





c2
c3
c2
c3
c2
c1
%
c1
c4
%
%
0
c5
%
0
0
0
%
0
0
0
0
0
0
0
c2
c5
c2
c3
c2
c1
c3
c1
c4
c5
%
%
c5
%
%
0
0
%
0
0
0
0
0
0
0



 −→





 −→


T2 =
So the first matrix has a perfect phylogeny, whereas the second matrix results in
a tree that is not a perfect phylogeny, as some characters label multiple edges. This
implies that the second matrix has no perfect phylogeny.
By using radix sort, the algorithm described by Gusfield (1991) runs in linear
time. Agarwala and Fernández-Baca (1994) even improved the running time to O(k)
where k is the number of ones in the data matrix. Now let’s investigate if a similar
algorithm can be found also for incomplete data.
7
3
Incomplete directed perfect phylogeny
In this setting we are only given partial haplotypes. So the character vectors are
elements in {0, 1, ∗}m , where ‘∗’ indicates that the nucleotide at a given position is
undetermined. As in the previous problem we assume that the characters are directed,
which means that the haplotype at the root of the tree is assumed to be known and
having all states equal to zero. We are interested in the question whether one can
complete the missing states in a way admitting a perfect phylogeny. Pe’er et al. (2000)
gave a near-linear time algorithm to solve this problem. But before presenting the
algorithm, note that the analogous statement to (ii) of Theorem 2.6 does not hold
for incomplete data. Even if every pair of columns is compatible and therefore has a
phylogenetic tree, the full character-state matrix might not have one.
Example 3.1. Using the matrix M2 from Example 2.1, we can construct an incomplete matrix M3 , which has no perfect phylogeny, although every pair of its columns
has one.
s1
s
M3 = 2
s3
s4
s5
c1
1
∗
1
0
∗
c2
1
∗
1
0
1
c3
0
1
0
1
0
c4
0
0
0
1
0
c5
0
1
1
∗
1
Before we can give the algorithm solving the incomplete directed perfect phylogeny
problem, another definition and a remark is needed.
Definition 3.2. Let M be a finite set. A collection H of non-empty subsets of M is
called a hierarchy on M if it satisfies the property that
A ∩ B ∈ {∅, A, B}
∀A, B ∈ H.
Remark 3.3. There is a bijection between hierarchies on S and phylogenetic trees
on S.
Proof. A phylogenetic tree T on S induces a hierarchy on S in the following way. For
a node v in T we denote the leaf set of the subtree rooted in v by L(v). L(v) is called
the clade of T associated to v. We denote the set of all clades of T by H(T ). This
set is clearly a hierarchy on S.
On the other hand, given a hierarchy H, one can easily construct the corresponding
phylogenetic tree on S. Note that the inclusion ⊆ is a partial order on H. So (H, ⊆)
is a partially ordered set and can therefore be represented in a Hasse diagram. This
structure defines a phylogenetic tree on S.
8
So a phylogenetic tree T on S is uniquely defined by the set of its clades. We
therefore identify T with the set {L | L is a clade of T }. Further results on hierarchies
can be found in Semple and Steel (2003).
Algorithm: Let S = {s1 , . . . , sn } be the set of species and C = {c1 , . . . , cm } the set
of characters.
Input: an n × m incomplete character-state matrix M ∈ {0, 1, ∗}mn ;
Output: a perfect phylogeny T (if it exists) and the corresponding complete characterstate matrix M 0 ;
1. G := G(M ) and T := {∅, S, {s1 }, . . . , {sm }}.
2. Remove all columns of M with no zero entries and all columns with zero entries
only.
3. While E(G) 6= ∅ do:
For each connected component K of G satisfying |E(K)| ≥ 1 do:
i) S 0 := S ∩ K.
ii) Compute the set U of all characters in K that have no zero entries in the
rows corresponding to S 0 .
iii) If U = ∅ then return False and halt.
iv) Else remove U from G and set T = T ∪ {S 0 }.
4. Associate to each column with no zero entries the clade S and to each column
with zero entries only the empty clade. Set
½
1 if s belongs to the clade associated with c
0
Ms,c =
0 otherwise
5. Output T and M 0 .
It is proven in Pe’er et al. (2000) that this algorithm can be implemented in nearlinear time, namely in time O(nm + klog2 (n + m)) where k denotes the total number
of edges in G(M ). The algorithm is illustrated in the following example.
Example 3.4. Using the matrix M1 from Example 2.1, we define an incomplete
matrix M3 :
c1 c2 c3 c4 c5
s1 1 1 0 0 ∗
s
0 ∗ 1 0 ∗
M3 = 2
.
s3 1 ∗ 0 0 1
s4 0 0 1 1 ∗
s5 0 1 0 0 ∗
9
In the second step of the algorithm we remove the last column of M3 as it has no zero
entries. For this example the algorithm given above performs three loops as follows:
1.
G=
K = {s1 , s3 , s5 , c1 , c2 },
S 0 = {s1 , s3 , s5 },
U = {c2 }
T = {∅, S, {s1 }, . . . , {sm }, {s1 , s3 , s5 }}
2.
G=
K = {s1 , s3 , c1 },
S 0 = {s1 , s3 },
U = {c1 }
T = {∅, S, {s1 }, . . . , {sm }, {s1 , s3 , s5 }{s1 , s3 }}
3.
G=
K = {s2 , s4 , c3 , c4 },
S 0 = {s2 , s4 },
U = {c3 }
T = {∅, S, {s1 }, . . . , {sm }, {s1 , s3 , s5 }, {s1 , s3 }, {s2 , s4 }}
10
So the given matrix has a complete phylogeny T . T corresponds to the tree shown
in Figure 1.
4
Extensions and applications
I will conclude this paper by pointing out some extensions and interesting applications
of the algorithm of Pe’er et al. (2000) solving the incomplete data problem. I will first
discuss if and how this algorithm can be generalized to the undirected situation and
under what circumstances it can be used for genotype data. Then, I will explain how
this algorithm can be helpful when using inserted repetitive genomic elements as a
source of evolutionary information. Finally, I will shortly discuss how incompatibility
is dealt with and what the connection to supertrees is.
4.1
Incomplete undirected perfect phylogeny
In this setting we are given partial haplotypes and we assume that the haplotype
at the root of the tree is unknown. This problem is proven to be NP-hard in Steel
(1992). However, when any complete haplotype (not necessarily the root) is available,
this problem can be solved in near-linear time by the algorithm of Pe’er et al. (2000)
presented in Section 3. To apply the algorithm, the available haplotype is chosen
as root and the character vector of the root is converted into an all-zero vector by
flipping the zeroes and ones in every column that is not zero at the root.
So for the general undirected case, the problem can be reduced to finding one
haplotype in the tree. Although this problem is NP-hard in general, it has been shown
by Halperin and Karp (2004) that in the case where enough explicit information is
given on the underlying tree, such a haplotype can be found. This constraint involves
the rich data hypothesis, which assumes that in each pair of columns there are exactly
three valid pairs. (In the general setting each pair of columns has at most three valid
pairs.) As proven by Halperin and Karp (2004), under the requirement that the
character-state matrix meets the rich data hypothesis, a haplotype can be found in
linear time and so by the algorithm of Pe’er et al. (2000), this problem can be solved
in near-linear time.
Satya and Mukherjee (2005) present the conditions under which a given characterstate matrix admits a perfect phylogeny. Based on this characterization they present
an efficient enumerative algorithm. They essentially do an exhaustive search over all
possible arrangements that might result in a perfect phylogeny, but exploit in a clever
way the fact that the search space is reduced.
11
4.2
Genotype data
So far we have been discussing haplotype data only. In this section we want to explore
the question how the preceding concepts can be extended to genotype data. We will
first assume complete data and in a second stage deal with incomplete genotype data.
So in this setting we are given an n × m character-state matrix M ∈ {0, 1, 2}mn .
Because genotyping is cheaper, but haplotype data is usually needed e.g. in disease
association studies, a common approach is to determine the genotypes of individuals
experimentally, and then attempt to infer their haplotypes computationally. This
leads to the problem of finding a haplotype matrix M 0 ∈ {0, 1}2mn which is consistent
with the genotype data given in M and has a perfect phylogeny. In order to get the
haplotype matrix M 0 we need to duplicate each column of M and replace every ‘0’
entry by ‘0-0’, every ‘2’ entry by ‘1-1’ and every ‘1’ entry by either ‘0-1’ or ‘1-0’. So
this problem is reduced to finding substitutes for the ‘1’ entries such that the resulting
matrix M 0 has a perfect phylogeny.
One could try to build a bridge to the algorithm of Pe’er et al. (2000) by substituting each ‘2’ in M by ‘1’ and each ‘1’ by ‘∗’, as we must change each ‘1’ to either
‘0’ or ‘1’. This results in an incomplete binary matrix for which we can find a perfect
phylogeny by the near-linear time algorithm of Pe’er et al. (2000). However, note
that this problem is not identical to the problem stated above due to the required
row duplications and the associated constraints. However, as shown by Eskin et al.
(2002), the former problem can also be solved in polynomial time, namely in O(nm2 ).
Another problem arises when the data matrix consists of an incomplete n × m
genotype matrix M ∈ {0, 1, 2, ∗}mn . Similarly to the incomplete directed perfect
phylogeny problem for haplotype data, we might be interested in the question whether
one can complete the missing states in a way admitting a perfect phylogeny. This
problem, also called the incomplete directed perfect phylogeny haplotyping problem
has been proven to be NP-complete by Kimmel and Shamir (2004). Under certain
distributional assumptions, Halperin and Karp (2004) give a quadratic-time algorithm
for inferring a perfect phylogeny from genotype data with missing values with high
probability. Satya and Mukherjee (2005) present another algorithm, which makes no
further assumptions on the given matrix M . It is an enumerative algorithm, based on
the algorithm of Pe’er et al. (2000) and similar to their algorithm for the haplotype
data.
4.3
Application to short interspersed elements in phylogenetic analysis
Recently, repetitive elements, particularly SINEs (short interspersed elements) have
been used in phylogenetic analysis. SINEs are short DNA sequences that were copied
from the genome and then randomly reinserted. The distinct insertion events can
be identified by the flanking sequences as shown in Figure 4. It is assumed to be
12
highly unlikely that an exact complete SINE gets lost. However, it might get lost by
a deletion event of a large genomic region including the SINE. In this case, also the
flanking segments are lost. So we can observe, that a deletion happened, but we do
not know whether a SINE was inserted in this region prior to the deletion. This is
illustrated in Figure 4. The reconstruction problem can be modeled as follows: To
each site we associate the state ‘0’ if the locus is present but no SINE occurs at that
locus, ‘1’ if a SINE occurs in that locus and ‘∗’ if the locus is missing, as we don’t
know if a SINE was present before the deletion event. So this leads exactly to the
problem described in Section 3.
Figure 4: A SINE was inserted to transform genome 1 into genome 2. Genome 3
resulted from a deletion in genome 2.
4.4
Supertrees and dealing with incompatibility
It is an unavoidable fact that most data matrices have no perfect phylogeny. So
we have the option of finding a maximum cardinality subset S 0 ⊂ S, which has a
perfect phylogeny or construct a perfect phylogeny that is as consistent as possible
with the given data matrix. The same problem arises when dealing with supertrees,
where phylogenetic trees are combined on the overlapping set of taxa. Due to errors,
inconsistencies usually occur and make this task complicated. Chen et al. (2006)
consider the case where the input trees are rooted. They try to find the minimal
number of flips needed to solve the inconsistencies, where each flip moves a species
into or out of a clade. To do this, they use the characterizations given in Theorem
2.6. Further approaches to this complicated subject are discussed in a survey by
Fernández-Baca (1995).
References
Agarwala, R., Fernández-Baca, D. (1994). A polynomial time algorithm for the perfect
phylogeny problem when the number of character states is fixed. SIAM Journal on
13
Computing, 23, 1216–1224.
Fernández-Baca, D. (1995). The perfect phylogeny problem. In Steiner Trees in Industries (eds D. Z. Du, and X. Cheng). Kluwer Academic Publishers.
Chen, D., Eulenstein, O., Fernández-Baca, D. (2006). Minimum-Flip Supertrees:
Complexity and Algorithms. IEEE/ACM Transactions on Computational Biology
and Bioinformatics, 03, 1545–5963.
Eskin, E., Halperlin, E., Karp, R. M. (2002). Effcient reconstruction of haplotype
structure via perfect phylogeny. Technical Report, UC Berkeley, Computer Science.
Gusfield, D. (1991). Efficient algorithms for inferring evolutionary trees. Networks,
21, 19–28.
Halperin, E., Karp, R. M. (2004). Perfect phylogeny and haplotype assignment. Proceedings of the Eighth Annual International RECOMB Conference, 10–19.
Kimmel, G., Shamir, R. (2004). The incomplete perfect phylogeny haplotype problem.
Proceedings of the Second RECOMB Satellite Workshop on Computational Methods
for SNPs and Haplotypes, 59–70.
Pe’er, I., Shamir, R., Sharan, R. (2000). Incomplete directed perfect phylogeny. In
Combinatorial Pattern Matching (eds R. Giancarlo, and D. Sankoff), pp. 143–153.
Springer-Verlag.
Satya, R. V., Mukherjee, A. (2005). The undirected incomplete perfect phylogeny
problem. Technical Report, University of Central Florida, School of Computer Science.
Semple, C., Steel, M. (2003). Phylogenetics. Oxford University Press.
Steel, M. (1992). The complexity of reconstructing trees from qualitative characters
and subtrees. Classification, 9, 91–116.
14