Computation of Measurements of Phylogenetic trees

Fast Computation of the
Exact Hybridization Number
of Two Phylogenetic Trees
Yufeng Wu and Jiayin Wang
Department of Computer Science and Engineering
University of Connecticut
ISBRA 2010
Phylogenetic Tree and Hybridization Network
• Phylogenetic Tree:
rooted, binary trees
• Reticulate Evolution:
tree model no longer
sufficient: e.g. hybrid
1
speciation, horizontal gene
transfer, recombination
• Hybridization Network:
a directed acyclic graph
displays two phylogenetic
trees in a compact way
ρ
T
ρ
T’
Input phylogenies
2
3
4
1
3
delete two
yellow edges
1
2
4
delete two
red edges
2
3
Hybridization Number Problem:
compute the minimum hybridization events needed to
construct a hybridization network displaying two trees
4
Hybridization
event: nodes
with in-degree
two or more
A Related Problem: rSPR Distance Problem
ρ
T
1
ρ
T’
Input phylogenies
2
3
ρ
4
1
3
2
• rSPR distance problem:
the minimum number of
rooted Subtree Prune and
Regraft operations to
4 transform T to T’
ρ
Prune 2
3
3
Regraft 2
1
2
3
4
One rSPR operation
1
2
3
4
Two rSPR operations
rSPR distance of two
phylogenies = the
number of subtrees in
Maximum
Agreement Forest
(MAF) - 1
(Hein, et al and
Bordewich, et al)
Maximum Agreement Forest (MAF)
ρ
• Agreement Forest
of T and T’:
a set of subtrees s.t.
– the two subtrees in AF
have same topology in
T and T’
– subtrees partition the
given taxa
– any two subtrees are
vertex-disjoint
Input phylogenies
T
1
2
ρ
3
4
5
6
ρ
3
4
2
3
2
5
6
Number of
subtrees is 3
ρ
1
1
T’
4
6
5
Maximum
Agreement
Agreement
Forest Forest
• Maximum Agreement Forest is an agreement forest of two
trees where the number of subtrees is minimized
Maximum Acyclic Agreement Forest (MAAF)
• Maximum Acyclic Agreement Forest:
subtrees in MAF are acyclic
Input phylogenies
T
1
2
35
T’
43 5 4
MAF Acyclic
Maximum
Agreement Forest
1
2
3
4
Ti in AF is ancestral to Tj if the root of Ti s
ancestral to the root of Tj in either T or T’
5
3
4
1
2
T12
T34
• Graph of Agreement Forest: GF(T,T’)
• nodes in graph G correspond to trees in the AF
• an edge from Ti to Tj if Ti is ancestral to Tj in the AF
• When graph of the AF is acyclic, the AF is said to be
acyclic
Cyclic
Graph
of AF
5
Hybridization Number and Size of MAAF
• Hybridization Number of two
original trees = the number of
T
subtrees in a MAAF -1
(Baroni, et al, 2005)
For example, the size of the
Maximum Acyclic Agreement
Forest is 3, so the
hybridization number is 3-1=2
1
2
5
3
4
Maximum Acyclic
Agreement Forest
Input phylogenies
T’
1
2
3
4
5
3
Keep two
red edges
4
1
2
Keep two
yellow edges
Node 3 and
4 are
hybridizatio
n events
2
5
1
3
4
Hybridization Network
5
Computation of the Exact Hybridization Number
• Previous Work:
Bordewich, Semple, et al,
(2007), HybridNumber
• Our Approach:
Use Integer Linear
Programming (ILP) to
minimize the number
of subtrees m
• Object min  Ci
i 1
Ci=1 if edge ei is cut
• Subject to 3 groups of
constraints to ensure
the result AF is MAAF
• Our Idea: Find a minimum
collection of edge-cuts to break
down the tree into MAAF
ρ
e3
e1
1
ρ
Input phylogenies
e4
e2 e5
2
3
4
1
3
2
4
Triple incompatible
ILP constraint for triple 1,2,3:
C1+C2+C3+C4+C5≤1
Triple Constraint
Pathway Constraint
Cyclic Constraint
More details for
Triple Constraint and
Pathway Constraint
in Wu (2009)
Graph of AF and Leaf Pair (LP) Graph
• Difficulty:
Graph of AF depends on AF
MRCA(3,4)
Input
phylogenies
T
T’
• Leaf Pair (LP) Graph:
a node corresponds to a
1
2
3
4
5
3
4
1
2
5
pair of two distinct leaves
MRCA(1,2)
• create an edge from lp(i,j)
to lp(p,q) if:
leaf pair lp(i,j) is ancestral to lp(p,q) if
• the path between i
Most Recent Common Ancestor (MRCA)
and j is disjoint with
of (i,j) is ancestral to MRCA of (p,q)
that of p and q in both
T and T’; and
1,2
3,4
• lp(i,j) is ancestral to
Part of the Leaf
lp(p,q) in either T or T’
Pair (LP) Graph
Acyclicity of Leaf Pair Graph
Input phylogenies
T
1
T’
2
3
4
1,2
1
5
3
4
1
2
5
3,4
2
3
4
5
Maximum
Agreement Forest
1
2
5
3
4
Maximum Acyclic
Agreement Forest
• Realized Leaf Pair:
if the two leaves are in the
same subtree
• Reduced LP Graph:
A LP Graph for a certain AF
• Lemma: For an AF, say
F, GF(T,T’) is acyclic iff
LP Graph(F) is acyclic
• Add constraints naively:
enumerate all cycles –
impractical in most cases
An Easy Way for Acyclic Constraints
1,3
3,7
4,5
1,2
Input phylogenies
T’
T
4,6
1
2
3
4
5
6
7
4
5
6
1
2
3
7
ILP Constraint:
M1,3 + M4,5 ≤ 1
• deal with Infeasible twin pair: Mi,j + Mp,q ≤ 1
Mi,j=1 if the path between i and j is not cut
• Enumerate all possible elementary cycles after reduce
c
infeasible twin pairs
M
k 1
ik
 c 1
in biological data, it seems a great reduction
Speed up by Divide and Conquer Approach
Input
phylogenies 9
9 T
T2
T1
2
T’2
T’1
8
8
1
T’
3
4
5
6
• Subtree Reduction:
replace a pendant subtree
occurs identically in T and T’
with a new label
• Subtree reduction keeps the
Hybridization Number
7
1
2
3
5
4
7
6
• Cluster Reduction:
replace a cluster common to
T and T’, say T1 and T’1 with
a new label, the rest part of
two trees are T2 and T’2
• h(T,T’)=h(T1,T’1)+h(T2,T’2)
See Bordewich, et al (2007) for detail
Results on Simulation Datasets
Simulation datasets are from Beiko and Hamilton (2006)
Each pair of phylogenies has 100 leaves and generated by
applying 10 rSPR operations on one tree
HybridNumber is another
software tool to compute
exact Hybridization Number
HybridNumber
SPRDist
Running
time (s)
This version of
HybridNumber
downloaded in Oct. 2009
Later version of
HybridNumber appears
faster, but still very slow
for EEEP data
Results on Biological
Datasets
Tree pairs for a Grass
(Poaceas) dataset from
the Grass Phylogeny
Working Group (2001)
The results are gained
under CPLEX environment
The later version of
HybridNumber gives
roughly the same
running time with ours
but still not so scalable
#Hybridization
SPRDist
(CPLEX)
40
14
5s
3s
2
36
13
10s
3s
3
34
12
7s
6s
4
19
9
1s
1s
5
46
19
51s
667s
6
21
4
0s
1s
7
21
7
3s
1s
8
14
3
1s
1s
9
30
8
1s
1s
10
26
13
14s
16s
11
12
7
1s
1s
12
29
14
80s
4h2716s
13
10
1
0s
1s
14
31
15
115s
7h776s
15
15
8
1s
2s
Pair
#Taxa
1
Hybrid
Number
Acknowledgment
Research is supported by National
Science Foundation [IIS-0803440]
and
the Research Foundation of
University of Connecticut