Decomposition Theory for ARGs and MinARGs Dan Gusfield Reconstructing the Evolution of Binary Bio-Sequences (SNPs) • Perfect Phylogeny (tree) model • Phylogenetic Networks (DAG) with recombination (ARG) • Blobbed Trees • Incompatibility Graph and Connected its Components • ARG Decomposition Theorem and Proof Sketch • NASC for a fully-decomposed MinARG The Perfect Phylogeny Model for binary sequences Only one mutation per site allowed (infinite sites) sites 12345 00000 1 4 Site mutations on edges 3 The tree derives the set M: 2 10100 10100 5 10000 01011 01010 00010 10000 00010 01010 01011 Extant sequences at the leaves When can a set of sequences be derived on a perfect phylogeny? Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test A richer model 10100 10000 01011 01010 00010 10101 added Pair 4, 5 fails the four gamete-test. The sites 4, 5 are ``incompatible” 12345 00000 1 4 3 10100 2 00010 5 10000 0101101010 Real sequence histories often involve recombination. Sequence Recombination 01011 10100 S P 5 Single crossover recombination 10101 A recombination of P and S at recombination point 5. The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix). Called ``crossing over” in genetics Network with Recombination = ARG 10100 10000 01011 01010 00010 10101 new 12345 00000 1 4 3 2 10100 The previous tree with one recombination event now derives all the sequences. P 00010 5 10000 5 10101 S 0101101010 A Phylogenetic Network 00010 a:00010 10010 1 4 00000 3 00100 2 b:10010 c:00100 P 3 S 10100 01100 p d:10100 5 00101 S 4 01101 f:01101 e:01100 g:00101 Minimizing Recombinations • Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful. • However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations. • Problem: Given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations. NP-hard (Wang et al 2000, Semple et al 2004) Decomposition can help First we introduce the viewpoint needed. Blobs in Networks • In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. • The cycle specified by those two paths is called a ``recombination cycle”. • In a phylogenetic Network a maximal set of (edge) intersecting cycles is called a blob. A maximal set of intersecting cycles forms a Blob 00010 10010 00000 4 3 1 00100 2 P S 3 01100 p 5 00101 S 4 01101 If directions on the edges are removed, a blob is a bi-connected component of the network. Blobbed-trees • Contracting each blob to a single node results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. • So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree. The blobs are the non-tree-like parts of the network. Every network is a tree of blobs. How do the tree parts and the blobs relate? How can we exploit this relationship? Ugly tangled network inside the blob. Incompatible Sites Recall, a pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible. M 12345 a 00010 b 10010 c 00100 d 10100 e 01100 f 01101 g 00101 Incompatibility Graph G(M) 4 1 3 2 5 Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. G(M) has two connected components. The connected components of G(M) are very informative • The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network (Bafna, Bansal; Gusfield, Hickerson). • When each blob is a single-cycle (galled-tree case) all the incompatible sites in a blob must come from a single connected component C, and that blob must contain all the sites from C. Compatible sites need not be inside any blob. (Gusfield et al 2003-5) Galled-Tree Structure So when each blob contains only a single cycle, there is a one-one correspondence between the blobs and the non-trivial connected components of the incompatibility graph. Motivating Question: To what extent does this clean one-one structure carry over to general phylogenetic networks? How do we exploit the general structure? First Observation First, in any network N for M, all sites from the same connected component of G(M) must appear together in a single blob in N. Follows by transitivity from the simple observation that two incompatible sites must be in a common cycle in any ARG for M. So we can’t decompose more finely than CCs. The Decomposition Theorem For any set of sequences M, there is a phylogenetic network that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. This “fully-decomposed” network is the finest decomposition possible. Moreover, the tree part of T(M) is invariant over all the fully-decomposed networks for M, and can be determined in polynomial-time. So, we can find a network for M by solving the recombination minimization problem for each connected component of G(M) separately, and then connect those subnetworks in an invariant way. A Phylogenetic Network with one Blob 00010 a:00010 10010 00000 4 3 1 00100 2 b:10010 P c:00100 S 3 01100 p d:10100 5 00101 S 4 01101 f:01101 e:01100 g:00101 A fully-decomposed network for the sequences 4 generated by the prior network. Incompatibility Graph 4 3 1 1 3 2 5 p s a: 00010 2 b: 10010 c: 00100 d: 10100 2 p e: 01100 5 4 s f: 01101 g: 00101 Proof Ideas Let C be a connected component of G(M). Define M[C] as the sequences in M restricted to the sites in C. M 12345 a 00010 b 10010 c 00100 d 10100 e 01100 f 01101 g 00101 4 C2 C1 1 3 B1 2 5 B2 134 a001 b101 c010 d110 e010 f 010 g010 25 00 00 00 00 10 11 01 a b c d e f g M[C1] M[C2] Now for each connected component C in G(M), call each distinct sequence in M[C] a supercharacter, and let W be the indicator matrix for the supercharacters. So W indicates which rows of M contain which particular supercharacters. 1 2 3 4 3 3 3 134 a001 b101 c010 d110 e010 f010 g 01 0 M[C1] 5 5 5 5 6 7 8 a b c d e f g 25 00 00 00 00 10 11 01 M[C2] a b c d e f g 12345678 10001000 01001000 00101000 00011000 00100100 00100010 00100001 W Proof Ideas for the Decomposition Theorem Lemma: No pair of supercharacters are incompatible. So by the NASC for a Perfect Phylogeny, there is a unique perfect phylogeny T for W. Proof Ideas - Further: For each connected component C of G(M), all supercharacters that originate from C label edges in T that are incident with one single node v[C] in T. So, if we expand each node v[C] to be a network that generates the supercharacters from C (the sequences in M[C]), and connect each network correctly to the edges in T, the resulting network is a fully-decomposed blobbed-tree that generates M. Algorithmically, T is easy to find. T can be constructed from M in O(nm^2) time. T is the tree resulting from contracting each blob in the fully-decomposed blobbed-tree T(M) for M. The supercharacters from M play the role in phylogenetic networks that normal binary characters play in perfect phylogeny trees. So supercharacters are the fundamental characters of phylogenetic networks. However … While fully-decomposed networks always exist, they do not necessarily minimize the number of recombination nodes, over all possible networks: For some M, there is no fully-decomposed MinARG. That is, sometimes it pays to put sites from different connected components together on the same blob. 000000 4 3 5 Sequences in M are in black. Sequence 100010 is not in M. 1 4 s p 6 2 100010 p 001000 3 s 5 s p G(M) has two components. Each requires two recs, but this combined network needs only three. 000100 0011010 010010 100001 100101 G(L) has one component. The addition of sequence 100010 reduces the number of components in G(M) from 2 to 1. G(M) for the original data 1 2 3 4 5 6 Two components, so two blobs, each blob requires two recombs, by the HK lower bound theorem, so a fully decomposed networks needs at least four recombinations 1 2 3 4 5 6 G(L) created from the original data, and the addition of the new interior sequence 100010. G(L) has only one connected component compared to two components for G(M). Theorem: For any K, there is a dataset where best fully-decomposed network uses K recombinations more than optimal. In that construction, the ratio of the number of recombinations in the best fully-decomposed network to the optimal is constant as K grows. Open Question: Construct examples where the show that the ratio can be arbitrarily large. Sufficient Conditions But we can prove several useful sufficient conditions for when there is a fully-decomposed network that minimizes the number of recombinations, over all possible networks. The deepest result: Main Theorem: Let N be a phylogenetic network for input M, let L be the set of sequences that label the nodes of N, and let G(L) be the incompatibility graph for L. If G(L) and G(M) have the same number of connected components, then there is a fully-decomposed network for M with the same number of recombinations as in N. Restatement with converse There is a fully-decomposed MinARG for M if and only if there is some MinARG for M whose nodes are labeled with the sequences L, such that the number of connected components of incompatibility graphs G(L) and G(M) are the same. Corollary A fully-decomposed network exists that minimizes the number of recombinations, unless every optimal network uses some recombination node(s) labeled by sequence(s) not in M, and the addition of those sequences to M creates an incompatibility between sites in different components of G(M). A Practical Sufficient Condition If M can be derived on a network N in which every edge contains at most one site, and every node is labeled with a sequence in M, then there is a fully-decomposed network for M which minimizes the number of recombinations over all possible networks for M. Proof Sketch of MainTheorem
© Copyright 2026 Paperzz