Talk - UConn School of Engineering

Close Lower and Upper Bounds for
the Minimum
Reticulate Network of Multiple
Phylogenetic Trees
Yufeng Wu
Dept. of Computer Science & Engineering
University of Connecticut, USA
ISMB 2010
1
Reticulate Networks
Gene trees: phylogenetic
trees from gene sequences
- Assume: Binary and rooted
- Different topologies at different
genes
1:
2:
3:
4:
Gene B
AGC
TGT
AAC
AGT
TB
TA
Reticulate evolution:
one explanation
- Hybrid speciation,
horizontal gene transfer
1:
2:
3:
4:
Gene A
TCG
TCA
CGG
CCG
1
2
3
4
1
3
2
4
Reticulate network:
A directed acyclic graph
displaying each of the
gene trees
Keep two
red edges
Reticulation event(s): nodes
with in-degree two or more
1
Keep two
black edges
2
3
4
2
The Minimum Reticulation Problem
Given: a set of K gene trees G.
NP complete: even for K=2
Current approaches:
Problem: reconstruct reticulate
networks with Rmin(G), the
minimum number, reticulation
events displaying each gene tree.
T1
1
• exact methods for K=2 case (see
Semple, et al)
• impose topological constraints (e.g.
galled networks, see Huson, et al.) or
work on small-scale topologies
T3
T2
2
3
1
4
2
1
4
1
2
4
3
For simplicity: a reticulation
node with exactly two incoming
edges (our approach allows
more general case)
N
2 reticulation
events. Minimum!
3
2
3
4
3
Close Lower and Upper Bounds for Minimum
Reticulation of Multiple Gene Trees
Key idea: developing novel lower and upper bounds for
Rmin(G): G is the set of K gene trees.
RH(G) < Rmin(G)
< SIT(G)
RH(G): Lower bound
Rmin(G): Minimum
Novel: first non-trivial bound
Challenging for K  3
SIT(G): Upper Bound
Works for any K
Bounds provides range of Rmin(G)
If RH(G)=SIT(G), then Rmin(G) = RH(G) = SIT(G)
4
Pairwise Distance
Pairwise distances forT1, T2 and T3
T1
Pairwise reticulation distance
of T1 and T2: d(T1,T2), the
1
1
minimum reticulation in any
reticulate network for T1 and T2
1
T2
T3
Rmin(T1,T2,T3)  max(1,1,1) = 1
Question: can Rmin(T1,T2,T3) = 1?
1
T3
T2
T1
2
3
4
1
2
3
1
2
4
3
?
Choosing same reticulate
edge  same gene trees
Rmin(T1, T2 and T3)  2!
4
v
Imaginary network
with one reticulation
node
T’
T
Display Vector
1
Tree T is displayed in a
network
2
3
4
1
3
2
VT: 0 1
Each tree has a display vector
1
4
VT’: 1 0
0 1 0 1
v1
v2
2
3
4
VT : Display vector of T, how T is displayed in the network
• one bit per reticulation node  length of display vector = number of
reticulation nodes in the network
• value 0/1: at each reticulation node, which edge (the 0-edge or 1-edge) is kept
for T?
Intuition: display vectors can not be too similar
Lemma: D(VT1,VT2)  d(T1,T2) for any network displaying T1 an T2.
D(VT1,VT2): Hamming distance of VT1 and VT2.
d(T1,T2): pairwise reticulation distance of T1 and T2
6
T1
The RH Lower Bound
Key: if R reticulation events
possible, then exist K length R
display vectors, satisfying the
distance constraints: Hamming
distance D(VT,VT’)  d(T,T’)
• Analogy: Selecting K points on R
dimensional binary hypercube s.t. the
points can not be too close
• If such K points do not exist, then
we must need at least R+1 reticulation
events.
RH lower bound: the smallest R
s.t. K points can be selected on
R-dimensional hypercube
satisfying the distance
constraints.
3
2
T2
2
T3
Question: can Rmin(T1,T2,T3) = 3?
T3
?
T2
T1
Rmin(T1, T2, T3)  4!
No polynomial time algorithm is known for
general HPP problem.
We use integer linear programming to solve it.
Closed-form formula of RH bound for K=3.
Upper Bound
Problem: how to reconstruct a network for T1, T2, …, TK using small
(may not be minimum) number, U, of reticulation events?
• U: an upper bound
Key idea: sequentially insert gene trees Ti into a growing network N
• Each step inserts a tree into N.
• New reticulation events are needed to display Ti in N.
• Minimize the new reticulation events at each step.
T1
1
2
3
4
1
N
1
T3
T2
2
3
4
1
2
N
2
3
4
1
4
3
N
2
3
4
1
2
3
4
SIT Upper Bound: Stepwise Insertion of Trees
Insertion of tree into a network: given a reticulate network N and a
gene tree T, grow N by adding the minimum number of reticulation
events to make T displayed in N
• NP complete
• Practical computation using integer linear programming
SIT bound
• Try all ordering of T1, T2, …, TK
• For each ordering, insert each tree Ti and compute the number
of reticulation events needed for inserting each Ti. Obtain a
network for each order.
• SIT bound = the smallest reticulation events in these networks.
• Heuristics when K is large or trees are large and different
9
Simulation
PIRN: a downloadable open-source software tool
• Implemented in C++ and GNU GLPK (and CPLEX)
Generation of Simulation Data: a two-stage approach
• Simulate a reticulate network N backwards in time for n species
• Randomly select K trees embedded in N.
Evaluation Creteria:
• How often exact minimum is found when lower and upper
bounds match?
• The gap between the lower and upper bounds
• Average running time (see paper)
10
Performance of PIRN: % of Datasets
Optimal Solution Found
% LB=UB
Horizontal axis: number of
taxa
Vertical axis: % of datasets
lower = upper bounds
K: number of gene trees
r: level of reticulation
Average over 100 datasets
Number of taxa
Lower and upper
bounds often match for
many data
11
Performance of PIRN: Gap of Bounds
Gap
Horizontal axis: number of taxa
Vertical axis: gap between
lower and upper bounds
K: number of gene trees
Gap between the lower
and upper bounds is often
small for many data
Number of taxa
12
Reticulate Network for Five Poaceae Trees
ndhF
phyB
RH bound: 11
SIT bound: 13
rbcL
rpoC2
ITS
13
Reticulate Network for Five Poaceae Trees
ITS
SIT bound: 13 reticulation events used in the network
14
Acknowledgement
• More information available at:
http://www.engr.uconn.edu/~ywu
• Research supported by National
Science Foundation
15