Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA ISMB 2010 1 Reticulate Networks Gene trees: phylogenetic trees from gene sequences - Assume: Binary and rooted - Different topologies at different genes 1: 2: 3: 4: Gene B AGC TGT AAC AGT TB TA Reticulate evolution: one explanation - Hybrid speciation, horizontal gene transfer 1: 2: 3: 4: Gene A TCG TCA CGG CCG 1 2 3 4 1 3 2 4 Reticulate network: A directed acyclic graph displaying each of the gene trees Keep two red edges Reticulation event(s): nodes with in-degree two or more 1 Keep two black edges 2 3 4 2 The Minimum Reticulation Problem Given: a set of K gene trees G. NP complete: even for K=2 Current approaches: Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. T1 1 • exact methods for K=2 case (see Semple, et al) • impose topological constraints (e.g. galled networks, see Huson, et al.) or work on small-scale topologies T3 T2 2 3 1 4 2 1 4 1 2 4 3 For simplicity: a reticulation node with exactly two incoming edges (our approach allows more general case) N 2 reticulation events. Minimum! 3 2 3 4 3 Close Lower and Upper Bounds for Minimum Reticulation of Multiple Gene Trees Key idea: developing novel lower and upper bounds for Rmin(G): G is the set of K gene trees. RH(G) < Rmin(G) < SIT(G) RH(G): Lower bound Rmin(G): Minimum Novel: first non-trivial bound Challenging for K 3 SIT(G): Upper Bound Works for any K Bounds provides range of Rmin(G) If RH(G)=SIT(G), then Rmin(G) = RH(G) = SIT(G) 4 Pairwise Distance Pairwise distances forT1, T2 and T3 T1 Pairwise reticulation distance of T1 and T2: d(T1,T2), the 1 1 minimum reticulation in any reticulate network for T1 and T2 1 T2 T3 Rmin(T1,T2,T3) max(1,1,1) = 1 Question: can Rmin(T1,T2,T3) = 1? 1 T3 T2 T1 2 3 4 1 2 3 1 2 4 3 ? Choosing same reticulate edge same gene trees Rmin(T1, T2 and T3) 2! 4 v Imaginary network with one reticulation node T’ T Display Vector 1 Tree T is displayed in a network 2 3 4 1 3 2 VT: 0 1 Each tree has a display vector 1 4 VT’: 1 0 0 1 0 1 v1 v2 2 3 4 VT : Display vector of T, how T is displayed in the network • one bit per reticulation node length of display vector = number of reticulation nodes in the network • value 0/1: at each reticulation node, which edge (the 0-edge or 1-edge) is kept for T? Intuition: display vectors can not be too similar Lemma: D(VT1,VT2) d(T1,T2) for any network displaying T1 an T2. D(VT1,VT2): Hamming distance of VT1 and VT2. d(T1,T2): pairwise reticulation distance of T1 and T2 6 T1 The RH Lower Bound Key: if R reticulation events possible, then exist K length R display vectors, satisfying the distance constraints: Hamming distance D(VT,VT’) d(T,T’) • Analogy: Selecting K points on R dimensional binary hypercube s.t. the points can not be too close • If such K points do not exist, then we must need at least R+1 reticulation events. RH lower bound: the smallest R s.t. K points can be selected on R-dimensional hypercube satisfying the distance constraints. 3 2 T2 2 T3 Question: can Rmin(T1,T2,T3) = 3? T3 ? T2 T1 Rmin(T1, T2, T3) 4! No polynomial time algorithm is known for general HPP problem. We use integer linear programming to solve it. Closed-form formula of RH bound for K=3. Upper Bound Problem: how to reconstruct a network for T1, T2, …, TK using small (may not be minimum) number, U, of reticulation events? • U: an upper bound Key idea: sequentially insert gene trees Ti into a growing network N • Each step inserts a tree into N. • New reticulation events are needed to display Ti in N. • Minimize the new reticulation events at each step. T1 1 2 3 4 1 N 1 T3 T2 2 3 4 1 2 N 2 3 4 1 4 3 N 2 3 4 1 2 3 4 SIT Upper Bound: Stepwise Insertion of Trees Insertion of tree into a network: given a reticulate network N and a gene tree T, grow N by adding the minimum number of reticulation events to make T displayed in N • NP complete • Practical computation using integer linear programming SIT bound • Try all ordering of T1, T2, …, TK • For each ordering, insert each tree Ti and compute the number of reticulation events needed for inserting each Ti. Obtain a network for each order. • SIT bound = the smallest reticulation events in these networks. • Heuristics when K is large or trees are large and different 9 Simulation PIRN: a downloadable open-source software tool • Implemented in C++ and GNU GLPK (and CPLEX) Generation of Simulation Data: a two-stage approach • Simulate a reticulate network N backwards in time for n species • Randomly select K trees embedded in N. Evaluation Creteria: • How often exact minimum is found when lower and upper bounds match? • The gap between the lower and upper bounds • Average running time (see paper) 10 Performance of PIRN: % of Datasets Optimal Solution Found % LB=UB Horizontal axis: number of taxa Vertical axis: % of datasets lower = upper bounds K: number of gene trees r: level of reticulation Average over 100 datasets Number of taxa Lower and upper bounds often match for many data 11 Performance of PIRN: Gap of Bounds Gap Horizontal axis: number of taxa Vertical axis: gap between lower and upper bounds K: number of gene trees Gap between the lower and upper bounds is often small for many data Number of taxa 12 Reticulate Network for Five Poaceae Trees ndhF phyB RH bound: 11 SIT bound: 13 rbcL rpoC2 ITS 13 Reticulate Network for Five Poaceae Trees ITS SIT bound: 13 reticulation events used in the network 14 Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • Research supported by National Science Foundation 15
© Copyright 2026 Paperzz