Molecular Phylogenetics and Evolution 107 (2017) 209–220 Contents lists available at ScienceDirect Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev Convex recoloring as an evolutionary marker Zeev Frenkel a, Yosef Kiat b, Ido Izhaki a, Sagi Snir a,⇑ a b Department of Ecology and Evolutionary Biology, University of Haifa, Israel Israeli Bird Ringing Center, Society for the Protection of Nature in Israel, Israel a r t i c l e i n f o Article history: Received 21 May 2016 Revised 16 October 2016 Accepted 25 October 2016 Available online 3 November 2016 Keywords: Phylogenetics Maximum parsimony Character compatibility Perfect phylogeny Statistical significance Supertree Optimal convex recoloring cost a b s t r a c t With the availability of enormous quantities of genetic data it has become common to construct very accurate trees describing the evolutionary history of the species under study, as well as every single gene of these species. These trees allow us to examine the evolutionary compliance of given markers (characters). A marker compliant with the history of the species investigated, has undergone mutations along the species tree branches, such that every subtree of that tree exhibits a different state. Convex recoloring (CR) uses combinatorial representation to measure the adequacy of a taxonomic classifier to a given tree. Despite its biological origins, research on CR has been almost exclusively dedicated to mathematical properties of the problem, or variants of it with little, if any, relationship to taxonomy. In this work we return to the origins of CR. We put CR in a statistical framework and introduce and learn the notion of the statistical significance of a character. We apply this measure to two data sets - Passerine birds and prokaryotes, and four examples. These examples demonstrate various applications of CR, from evolutionary relatedness, through lateral evolution, to supertree construction. The above study was done with a new software that we provide, containing algorithmic improvement with a graphical output of a (optimally) recolored tree. Availability: A code implementing the features and a README is available at http://research.haifa.ac.il/ ssagi/software/convexrecoloring.zip. Ó 2016 Elsevier Inc. All rights reserved. 1. Introduction The practice of constructing a tree depicting the evolutionary history of a set of organisms is nowadays common to almost every phylogenomic study - an area combining genomic data and techniques for the study of evolution (Eisen and Fraser, 2003; Delsuc et al., 2005). In particular, the deluge of the molecular data accumulating constantly, allows us to gauge the accuracy of the constructed trees. A character, genetic or morphological, classifies the species set into several character classes. If we consider each class as a different color, then every species is colored by the state of the character it possesses, and the given character induces a coloring over the tree leaves. We say that the coloring is convex on the given tree if every color class induces a clade or a subtree and these subtrees do not overlap (Moran and Snir, 2008) (or equivalently, do not intersect). Convexity is a desirable and natural property in classification. When a character is convex on a tree, it is denoted as homoplasy free meaning it displays no reversals or convergence (Zhang and Kumar, 1997). The well-founded and widespread phylogenetic approach maximum parsimony (Fitch, 1971) seeks a tree ⇑ Corresponding author. http://dx.doi.org/10.1016/j.ympev.2016.10.018 1055-7903/Ó 2016 Elsevier Inc. All rights reserved. with minimal changes on its edges, summed over all input characters. A minimum can be obtained when a perfect phylogeny exists in which case each input character is homoplasy-free on that phylogeny (Fernandez-Baca, 2001). Such a tree not necessarily exists, and even finding it is computationally intractable (Bodlaender et al., 1992). In the above setting, the characters are given and assumed to be reliable, and a plausible tree is sought. In other settings, the tree is also given, along with the characters, but one or more characters are not convex on that tree. In this case, we may question about the reliability of that tree. Alternatively, in a setting where the tree provides enough confidence, the question shifts to the reliability of the input characters. Moreover, we may wonder if the character under examination has evolutionary traces or is influenced by other factors such as environment or simply randomness. In both cases, questioning the tree while assuming character reliability or questioning the character evolutionary meaningfulness, we look for the recoloring distance that counts the minimum number of tree nodes we need to recolor in order to arrive at convexity. This value indicates the level of disagreement between the tree and the coloring. The notion of the recoloring distance was coined in Moran and Snir (2008) where the problem, convex recoloring (CR), was defined and studied for several types of trees and input colorings. Despite its biological 210 Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 origin, due to its mathematical cleanliness, mainly combinatorial/ algorithmic aspects of the problem and its derivatives, that have little if at all biological relevance, were studied. These include extensions to certain graph types rather than a tree, specific input colorings, constrained recoloring schemes, and alike (see e.g. Kanj and Kratsch, 2009; Kammer and Tholey, 2012; Campêlo et al., 2013 and references therein, but see also Matsen, 2015 for a classification oriented study). In this work we bring back the high level theory of CR down to the biological ground in several aspects. For a taxonomist, it would be desirable to determine quantitatively and statistically, the relevance of a character (i.e. any classification) to the tree at hand. The recoloring distance is an absolute, context-less, number. We therefore introduce the notion of a coloring significance, indicating how likely we are to see, a coloring of this distance or less, by chance on the given tree. In the Results section we demonstrate the use of the coloring significance measure by applying CR to several examples. First, in order to obtain an intuition regarding this measure we show a simulation study. The results reveal that the recoloring distance is more structured than expected. Next, using two data sets, we demonstrate the various uses of CR as an evolutionary marker. The first data set is over eighty Passerine birds, and the second is over a hundred prokaryotes, with few colorings (characters) for each data set. The results obtained concern not only questions of phylogeny/character reliability, but also intensity of non tree-like activity in prokaryotes and the power of supertree methods. Importantly, we provide a software that implements the features we describe in this work. To this respect, in the Method section we describe an algorithmic improvement to the algorithm presented in Moran and Snir (2008). The improvement is achieved by reducing the average number of colors checked at a node. We do not give an asymptotic analysis for this improvement but do provide rigorous proof for its correctness. We are aware that since the appearance of the algorithm of Moran and Snir (2008), there have been further improvements (e.g. Bar-Yehuda et al., 2008) to that first algorithm, and there might be other algorithms with better complexity than the one presented here. However a basic property of this algorithm, which to the best of our knowledge was not used before, is a local view that allows a dynamic calculation of the set of candidate colors of each tree node. Accordingly, we believe that the algorithmic improvements provided here, accompanied with more fundamental theoretical improvements to CR, viewing it as a fixed parameter tractable problem (Bodlaender et al., 2011), will allow application of CR to data sets of orders of thousands of species and hundreds of colors. 2. Results We now show four examples for the application of convex recoloring to synthetic and real data. The first one is a simple example based on random colorings of a binary tree, demonstrating the distribution of optimal convex recoloring cost in one simple case. The other three are applications to real biological examples of colored trees where the colorings represent a different classification each time. In each case we compute the optimal recoloring and its associated p-value, signifying how much the given coloring complies with the evolutionary history of the given species set (that is also given as input, and is represented by the tree topology). 2.1. Example 1: Statistical distribution of the recoloring distance Our first example shows how the recoloring distance distributes for a given tree size and number of colors. We constructed a set of random binary trees with 50 leaves. Next, we randomly and uniformly colored the tree leaves by 4 colors (no uncolored leafs, all internal nodes are uncolored). This is simply done by choosing for every leaf each color with probability 1=4. Therefore, the trees obtained are different in topology and also by the proportions between color sets. For each of these trees a convex recoloring was calculated. The distribution of cost of recoloring is presented in Fig. 1(a). We note that a naive upper bound to the expected value of this statistic, is the value of 3n=4 where n is the number of leaves. This is achieved by recoloring all the leaves with the most common color. As this must have at least n=4, the bound is trivially obtained. However, as we see in the figure, a much smaller value (from n=2 to 3n=5) is usually obtained, signifying existence of a more profound structure in this question than that naive bound. Notwithstanding, a more precise bound is not trivial to obtain and is beyond the scope of this work. Distribution of colors frequencies on the resulted convex trees is presented in Fig. 1(b). The results are divided into three cases (three bar charts in the figure) representing cases in which the most common color had (i) below 25 members (Blue bars), (ii) between 25 and 28 members (Brown bars), and (iii) above 28 members (Green bars). As shown, this difference in the prevalence of the most common color, affects minimally over the distribution of the final colors, where the most common color colors around 70% of the leaves. We note that as there are many (possibly even exponentially many) optimal recolorings, this distribution might be biased according to the strategy employed by the algorithm. One may observe that in a tree, every color is preserved at least by a single leaf as this does not violate convexity of the tree. This observation is explained by the three short bars in the right of Fig. 1(b). 2.2. Example 2: Birds moult strategies In this example compatibility of adult/juvenile moult strategy of birds with their evolutionary history was examined. We took a tree over 80 bird taxa representing 29 of the 46 Passerine families (Treplin et al., 2008). The leaves of this phylogeny were classified by their main moult strategies in adult/juvenile life stages as described in Jenni and Winkler (1994), Cramp et al. (1993), and Ginn and Melville (1983). Such characterization was made only for 43 of these genus and species and was expressed by one, two or even three of three observed moult strategy types: ‘‘Summer complete/summer partial”, ‘‘Summer complete/summer complete”, and ‘‘Winter complete/winter complete”. Such characterization induces the following coloring of phylogenetic tree’s leafs: leafs corresponding to non-characterized species and species characterized by more than one strategy type - uncolored; leafs corresponding to species characterized by only one moult strategy type are colored by Blue, Red and Green (26, 7 and 4 leafs respectively). Based on our program we found that this coloring is not convex: Popt ¼ 8, p-value = 0.26 (see Fig. 2). Excluding the green color results in non-convex coloring with P opt ¼ 5, p-value = 0.46. Unifying colors Red and Blue (in the initial coloring) results in Popt ¼ 3, p-value = 1.0. The latter means the following. After unifications of Red and Blue, we are left with two colors - Red/Blue and Green where the Green comprises of only 4 members, that are dispersed. A cost of 3 means that in order to arrive at convexity we must uncolor all but one of the Green leaves. As shown in previous section (Section 2.1), any tree recoloring retains at least one leaf of any color class intact. The latter implies that this is not only the minimum cost possible, rather also the maximum cost for the given configuration of 4 Green leaves. Moreover, since any other, random or not, input coloring with 4 Green leaves cannot achieve a cost higher than that (i.e. a cost greater than 3), all colorings attain this (3) or smaller cost, explaining the p-value of 1 of that result. The Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 211 Fig. 1. Results based on simulation data. (a) A distribution of minimal cost for convex recoloring for random binary tree with 50 leaves: leaves are colored randomly (4 colors with the same probabilities, no uncolored leaves), internal nodes are uncolored. (b) A distribution of leave color frequencies in minimal convex recoloring for random binary tree with 50 leaves. The distribution is about the same for situation when most frequent color was presented in 6 24, from 25 to 28, and P 29 leaves in the input random coloring. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) biological interpretation of the results above, is that adult/juvenile moult strategies of birds are not evolutionary compliant. It can be explained by the hypothesis that similar adult/juvenile moult strategies were formed independently for different bird species and/or changed in different directions during the process of evolution (e.g., caused by changing of climatic niches). 2.3. Example 3: Birds migration strategies Birds genus and species from Example 2 above were also classified by subdivision into three overlapping classes based on main migration strategy: ‘‘residents”, ‘‘short-” and ‘‘long-distance migrants” (Hall and Tullberg, 2004; Cramp et al., 1993). In total, 41 out of 80 genus were classified. Such a classification induces the following coloring on the tree leaves: 15 ‘‘pure” residents (Red), 8 short-distance migrants (Blue), 6 long-distance migrants (Green) and 12 having various strategies. We found that such a coloring is also non convex: Removing genus with various strategies yields P opt ¼ 9, p-value = 0.2 (see Fig. 3). Combining Blue (shortdistance migrants) and Green (long-distance migrants) into ‘‘migrants” gives a bi-colored tree that is non convex with P opt ¼ 11, p-value = 0.18. Removal of the Blue color (nodes) results in Popt ¼ 4, p-value = 0.51. Finally, combining Blue to Red results in P opt ¼ 6, p-value = 1.0. The above means that migration distance is, similarly to moult strategy, also not evolutionary compliant (presumably like many of ecological/geographical/behavior characters). Such estimation can be explained by the hypothesis that ability and preference to migrate on long distance changed in both direction during the process of evolution and was caused by multiple internal and environmental traits. 2.4. Example 4: Evolutionary classes among prokaryotes In this part, we study convexity among prokaryotes. Our species set is composed of 41 archaeal and 59 bacterial genomes, representing the forest of life (Puigbó et al., 2009), and that were studied in Puigbó et al. (2009). The characters used for colorings represent three different classifications: (i) domain based (2 colors, archaeal/ bacterial), (ii) phylum based (24 colors), and (iii) order based (57 colors). The underlying approach here is different from the examples above as these characters are considered accurate and largely representing the main trend of evolution of the given species set. Under this setting, the given tree is under scrutiny. Here, trees represent gene specific histories, dubbed gene trees. These histories are substantially different as many genes are subjected to the phenom- ena of horizontal gene transfer (HGT), the passage of genetic material between organisms by means other than lineal descent (Doolittle, 1999; Ochman et al., 2000). Evolution in light of HGT tangles the traditional universal Tree of Life, turning it into a network of relationships (Gogarten et al., 2002; Zhaxybayeva et al., 2004; Gogarten and Townsend, 2005; Bapteste et al., 2005). To put the above discussion in the context of color convexity, we did the following. First we considered a tree representing the evolution of the Isoleucyl-tRNA synthetase (IleS, COG0060) gene, henceforth the IleS-tree, that is present in all 100 considered prokaryotes. The IleS-tree and the corresponding colorings of (domain-, phylum-, and order-based) are depicted in Fig. 4. Leaf coloration follows order classification. Our results show that none of the colorings is convex on the IleS-tree. In order to delve deeper into the meaning of this result, we analyzed each category (coloring) separately. Starting with the domain level, the tree from Fig. 4 can be perceived as an unrooted quartet tree (Avni et al., 2015) over four large clades (subtrees) pertaining almost exclusively either to bacteria and archaea. If we ignore the outliers and color these clades as depicted in the figure: archaea Red, and bacteria - Green, we see a quartet colored Red; GreenjRed; Green. Obviously, this coloring is not convex, suggesting a very early HGT between archaea and bacteria of the IleS gene. At the phylum level, there can be seen several violations to convexity that can be evidenced by the presence of members of a single phyla populating two of the four domain clades indicated above. This by definition is a violation to convexity as we require all members of a phyla to be present in a single domain clade. Specifically, in the figure (Fig. 4), we point at three members of the Proteobacteria-Alpha phylum (index 22, green arrows), present in the two bacteria clades. Finally, there are many violations of convexity at the level of orders. One such violations that we also mention in the figure is of orders with indices 20 (Desulfurococcales, indicated by blue arrows) and 55 (Thermoproteales, indicated by red arrows). There can be found several quartets over these orders composed of a pair from the index 20 order and another pair from the index 55 order that exhibit a quartet colored 20; 55j20; 55 arrangement. It can be shown that such an arrangement requires a recoloring of at least one leaf (see Fig. 4). Despite the deep discordance between individual gene histories, the belief in an underlying, vertical trend of evolution even among prokaryotes, yields a major challenge of finding this tree. Normally, this underlying phylogeny is inferred by constructing gene trees for genes thought to be immune to HGT, typically 212 Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 Fig. 2. Adult/juvenile moult strategy of birds. Leaf coloring is as follows: ‘‘Summer complete/summer complete” - Blue, ‘‘Winter complete/winter complete” - Red, ‘‘Summer and winter complete/winter complete” - Green; non-characterized species and species characterized by more than one strategy type - uncolored (black). Optimal (convex) recoloring is schematically shown by lines of corresponding colors. Note that only one green and two red colored leaves remained. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 213 Fig. 3. Migration strategy of birds. Leaf coloring is as follows: ‘‘pure” residents - Red; short-distance migrants - Blue; long-distance migrants - Green; non-characterized species and species characterized by more than one strategy type - uncolored (black). Initial coloring is not convex. Optimal (convex) recoloring is schematically shown by lines of corresponding colors. Indeed, significance of initial coloring (p-value, see definition) is low (see text). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) ribosomal RNA genes. Nevertheless, even such genes are subjected to HGT, obfuscating the central trend of evolutionary relationships (Berkum et al., 2003; Dewhirst et al., 2005; Schouls et al., 2003; Yap et al., 1999). Therefore, it was suggested to construct the underlying species tree by a two stage approach as follows: First, gene trees such as the IleS-tree above, are constructed separately 214 Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 Fig. 4. A tree over 100 prokaryotes based on genes Isoleucyl-tRNA synthetase (IleS, COG0060) from Puigbó et al. (2009). For convenience, organism names are appended by three numbers separated by an underline (representing domain, phylum, and order indices respectively; order indices 1 and 57 correspond to organisms with questionable order). Leaf coloration follows the coloring defined by order. It can be seen that the three colorings- domain, phylum, and order - are not convex on the IleS tree. On the domain level, one can see two pairs of large clades (subtrees), a pair for each domain, corresponding to domains 1 (Archaea, red lines) and 2 (Bacteria, green lines), intertwined along the tree, yielding non convexity of the domain coloring. At the phylum level, phylum 22 (Proteobacteria-Alpha, pointed by green arrows in the figure) was found in both bacteria clades and hence yielding non convexity also at the level of phylum. We remark that one can find few additional such examples for bad classified phylums according to this gene tree. The archaea domain was also found non convex by the order coloring: carriers of colors correspond to order 20 (Desulfurococcales, pointed by blue arrows) and 55 (Thermoproteales, pointed by red arrows) overlap. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) for a multitude of genes. These trees do not necessarily span the entire taxa set rather overlap at subsets of it. Subsequently these trees are amalgamated together to produce a big tree over the complete taxa set. This approach is denoted the supertree construction and the resulted tree is denoted a supertree (Bininda-Emonds et al., 2002; Creevey and McInerney, 2005). In light of the above, the task we pursue here is how much the supertree ‘‘corrects” the non convexity of individual gene trees. In Puigbó et al. (2010), a set of 6901 orthologous gene families (COGS Tatusov et al., 2001) was selected and for each such family, its gene tree was reconstructed. From this set, a subset of around a hundred fairly conserved, ubiquitous genes, denoted nearly universal trees (or NUTs), were taken. A tree spanning the entire taxa set was constructed by a supertree method, based on the NUTs trees. We denote it as the NUTs-tree. We wanted to measure the convexity of the NUTs-tree with respect to each of our three colorings. Applying our program we found that all are convex on this tree. For illustration, the tree, leaf-colored according to phylum, is shown in Fig. 5. To summarize this part, we start with the IleS-tree. We note that the fact that all the three colorings were found highly insignificant (high p-values) suggests an intensive HGT activity. Nevertheless, in the case of HGT, one caveat should be raised. HGT operates in scale of subrtees while recoloring counts single nodes and therefore a 215 T hD e i et ra0 h0 1 1 BB d d __ 2 2 __ 1 1 00 __51 18 ASco ilbu s a 00 1 HS u 1 BB i e ll s i __ 2 _ p yp 2_ 1 0012 1 __ 4 BB p 2 7 p __ 2 2 __ 2 2 00 _1 0 8 _3 _9 Bc_2 38 A n a v a 0 11B 9_ 2_ 0 No ss p0 c__ 2 _ 9 _ 4 5 1 c _ B T ri e r0 11 B c _ 2 _ 9 _ 1_59 p0 2 _ 9_ 2 2 Syns Bc_ 4 1Bc _ 9 el01 a0 T h eA c a m c _ 2 _ 9 _ 2 5 01B c_2_ 1 47 ma B 7 __ 1 Pro 2 __ 7 i01 v _ 2 B hh _ Glo __386 0 11 B __22_ 4 lau 0 p __22_ 2 C he h s B aa 2 D 0 11 Ba _ tu 0 B yc flo 1 M Bi xy0 ub R Aeqrunaoe 0 1 B q _ 2 _ 3 _ 4 F T h e m a001 B t _ 2 _ 2 3 _ 1 B t_ 2 _ 5 6 F ues n 23_56 M s f lu0011 B M C o B f _u2_ 2 _ 1 LBaal oc o t a h c cs uc 0011 B _ 1 2 _3 _ 2 4 2 a 00 1 B f _f _ 2 _ 1 BB f _ 2 _ 1 1 2 _ 2 f _ 22 _ 2 _ 5 2 1 1 _1 2_ 6 2_ 6 27 Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 3 69 9 __ 1 1 9 _1 _ 22 _ 45 4 BBpp _ 7 _7 _ 4 4 1 1 1 0 _ 1 2 _ 7_4 x au 0 p _p _22_ 1 yexs v B 37 _ B 1 M D 18_ r 0x 011B p _ 2 _1 8 _ 3 59 i ceprt et u 0 p B RM _ _ Ag e 011B p _p2_ 2 __1188 _ 9 eei m ttfpl 0ea0011BB p _ 2 N M e m r 1_34 M Bu _2_2 Bp 0 11B 21 _4 3 tc ae0 p_ 2_21_2 eea M 1 Ps p_2_ o01B Escc C Ch C ah l plnt r 0 n P 0 11 B O V epri b a 0 r 0 1 BB vv__ 2 s L e p 0 11 B v v _ 22 __55 _ V icnvaar 0 1 BB v __22__ 2_ 5 __1112 0 1 B vv _ 2 _ 2 44__ 3 2 _ 2 _ 11 4 _ 5 89 4_5 29 Rhoba0 8 1Bo_2_ Bl am a0 1B 1 6_4 o_ 2_ 16 _ 4 Plam a01B o_2_ 16_4 1 1 Gem ob01 Bo_2 _16_ 41 _7 b_ 2_ 4_423 1B Fl ajco0 1 B b _ 2 _4 _ 4 8 B a th001 B b _ 22__ 66__1 3 u _ _ th b 2 y C i01B Bb_ P rholvt e 0 1 44999 C 2 22___ 4 __22___22 2 s 0 1 BB s _ 2 rbu 01 Bs B or e p ai n 0 1 T ep L Su Cens y_1_ 8_11 Th ep e_ 1_ 8_ 55 Ca T h lm a _ 1 _ 8 _ 5 5 _1_8 PPyyyrrecte P a _ riase___ 1 _ 8 5 5 1_8 _55 1_ 8_5_ 5 5 5 SAte r H p a y S e Suu pm _ b a 1 ltloso u __11 __ 8 _ _1_1_ _ 88 _ 2 0 _8 8_ _ 22 0 _5 50 0 0 la c_ 1_ 8_ 11_ 32 M M etb et u s _ Un a_ 1_ M cm 1_ 11 M ee t 1 _ M et hcuu _ e _ 1 1 _ 3 3 3 t l a _ 11 _ _ _ 1 _ 11 1 1 1 3 _ 1 1 __ 1 _ 331 Ha 3 1 N al m H a l twp ha__11 _ H a ls pa _ 1 __11111__22 _ 1 _ 1 1 _ 2 66 1_26 A rc fu _ 1 _11_5 _1_ Th ea 1_ 11 _5 T h e v oc_ _1_11_ 54 4 50 tka 5 _1 _1 eq 3 an _5 N 11 53 3 1 _ 1 1 _1 _ 55 3 o _1 _ _ 1 1 _ e kf u __ 11 _ 1 TPhyyrrrahbo _ P Me 9 1_2 9 1 _ 11 1 _ 2 _1_ 1_30 0 t tsht_ _3 Meeettj a _p1__11_ 1 11_ M et e tm m C _1 _1 30 M Picto _1_1 1_54 33 _ 333 3 11111__ _ _ 1 1 a __ 11 _ t mt a ca _ M eM ee t b M 2.0 Fig. 5. The NUTs-tree constructed from a subset of around a hundred gene trees by the supertree approach (Puigbó et al., 2009). The tree is leaf-colored according to the phylum classification. As can be seen, this coloring is convex on the tree. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) recoloring distance cannot, on its face value, be indicative to intensity of HGT. Our domain-level coloring illustrates that. Recall we had a tree over four large clades, two colored with Green and two with Red. In order to turn this coloring to convex on that tree, a whole clade needs to be recolored. In contrast, one Subtree Pruning and Regrafting (SPR) operation, that cuts an entire clade from its current location and joins it in another, would have fixed this situation, yielding a convex tree. However, this SPR move would have modified the tree topology - an operation that stands in contradiction to the CR philosophy that keeps the tree topology intact. Therefore, while intensity of HGT is normally measured by the SPR-distance to the species tree (Hein, 1990), it is important to mention that finding such a distance is computationally intractable (NP-hard) (Bordewich and Semple, 2005) (but exponential in the number of SPR events), finding the recoloring distance may provide some intuition and is exponential only in the number of colors. The second example with prokaryotic data, dealt with the power of the supertree approach and how this is relates to CR. We have shown that the supertree approach can ‘‘correct” all coloring violations as exhibited by IleS-tree. We note that convexity 216 Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 with respect to these classifications, is not the only criterion. Therefore, we can frequently find trees that are not convex with respect to this classification yet provide other, insightful relationships. 3. Conclusions In this work we studied convex recoloring (CR) and focused on relevant biological aspects of it. Since its introduction in 2005 (Moran and Snir, 2005; Moran and Snir, 2005), CR was almost entirely studied in the context of theoretical computer science while the biological relevance of it was neglected. We believe that this is the prime importance of the work presented here. Specifically, we used CR as a marker for character compliance with organismal evolution, by fitting it to the tree nodes and measuring compatibility. We augmented the parameterless value of the recoloring distance with a statistical framework that provides the (statistical) significance of the given input coloring in terms of a p-value, allowing determination of the evolutionary relatedness of the character under study. On a more technical level, we provided algorithmic improvements to the basic algorithm for CR introduced in Moran and Snir (2008). The improvement is achieved by reducing the set of possible recolorings and considering a more local, instead of a global, view of the problem. In general, when the input coloring is near random and has a big recoloring distance, this improvement appears to be of little benefit over the asymptotic bound. Nevertheless this improvement is more pronounced in the case of a coloring close to convexity. It appears that our heuristic bears some similarity to the principles implemented in Bar-Yehuda et al. (2008). While we do not have theoretical asymptotic analysis for this improvement, it was experimentally demonstrated in our simulations and real data examples. Importantly, we also provide software implementation for the algorithm, containing the features discussed above and providing an output that can be used conveniently in tree viewing software as demonstrated in our examples. To the best of our knowledge, no such software exists. In the experimental realm, we applied our software to four examples, two from Ornithology and two from Microbiology. The examples from Ornithology addressed the topic of character compliance with species evolution. Our results show that both characters, migration strategies and moult strategy, are insignificant on the tree - implying they were not evolved along with the species evolution. The examples from Microbiology focused on horizontal gene transfer (HGT) and the strength of the tree signal in light of it. Here, as opposed to the previous examples, we treated the characters as reliable and questioned the tree. In our first example, we showed that classification based on the individual gene tree is evolutionary unrelated. The second example shows that supertree approach enables to construct evolutionary-consistent tree from the HGTaffected trees obtained for individual genes. These examples demonstrate various application of the convexity criterion. Moreover, these examples from very distant fields in Biology attest on the generality and applicability of the concept and its implementation. introduction of CR. A coloring is some property associated with a set. A coloring C of a tree T assigns colors to the nodes of the tree. A coloring is denoted partial if not all nodes are colored; otherwise the coloring is total. The carrier of a set of nodes is the minimal subtree containing all nodes in the set. We denote by the carrier of the color d as the carrier of the nodes colored by d, formally carrierðC 1 ðdÞÞ. C is said to be convex on T if for any pair of colors d1 – d2 , carrierðC 1 ðd1 ÞÞ and carrierðC 1 ðd2 ÞÞ do not intersect (see Fig. 6). A color d1 is considered as a bad color if there exists a color d2 such that carrierðC 1 ðd1 ÞÞ \ carrierðC 1 ðd2 ÞÞ – £. Note that our definition of a bad color is different from the one in Moran and Snir (2008), where a bad color was defined only for total coloring, and color d1 with carrierðC 1 ðd1 ÞÞ containing no nodes with other colors was considered as good color even in the case of carrierðC 1 ðd1 ÞÞ carrierðC 1 ðd2 ÞÞ. A recoloring scheme may have several cost functions (see Moran and Snir, 2008). We here consider the uniform cost model under which the recoloring of uncolored vertices is free, recoloring any colored node v to any other color costs 1, and uncoloring a colored node is prohibited. Hence, given an input (partial) coloring C. the cost of a recoloring C 0 with respect to C, denoted costC ðC 0 Þ, is the number of recolored vertices that were previously colored by C (we note however that the software we provide implements the weighted cost model in which the cost of the recoloring is the sum of the weights of the recolored vertices). A convex recoloring C 0 is optimal for an input coloring C if it has a minimal possible cost with respect to C. We denote this cost by Popt ðCÞ. Henceforth, we refer only to convex recoloring, i.e., if not specifically mentioned, a recoloring is assumed to be convex. 4.2. Improved algorithm - candidate colors To increase the efficiency of the search algorithm for an optimal recoloring we use the following restriction on the set of candidate recolorings. We apply a more local approach as was pursued in Moran and Snir (2008). While in Moran and Snir (2008) a color was defined as bad globally, i.e., across the whole tree, and every bad color was examined at every node, here we restrict ourselves at every individual node, only to colors that are relevant to this node. We therefore define the following. For a node v, a color d0 is a candidate color if Cðv Þ ¼ d0 , or exist colors 1 d1 ; . . . ; dn such 1 v 2 carrierðC 1 ðd0 ÞÞ, or that v 2 carrierðC 1 ðdn ÞÞ there and carrierðC ðdi ÞÞ \ carrierðC ðdi1 ÞÞ – £ for all i ¼ 1; . . . ; n. Informally, either d0 is v’s original color, or v sits inside d0 ’s carrier, or there is a chain of carrier intersections from d0 to v (see Fig. 7). The set of candidate colors for a concrete node colored by a bad Black Blue 4. Materials and methods 4.1. Convex recoloring of trees - basic definitions The theory of convex recoloring (CR) relies on some non trivial mathematical concepts that were defined and introduced in Moran and Snir (2008). We here provide a brief and a minimum necessary Red Green Fig. 6. An example of a convex coloring on a tree (white nodes are considered as uncolored). Borders of color carriers are shown by violet lines. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 C B A F 4.3. Coloring significance - estimation of p-value D E G Fig. 7. A non convex coloring of tree. White nodes are considered uncolored; carriers of colors red and black intersect at nodes A and B; carriers of colors blue and green intersect at nodes D and E; carriers of colors red and blue intersect at node C. This means that all four colors red, black, green and blue are candidate for all of the tree nodes. Indeed, recoloring of nodes F and G both by black or both by green results in convex coloring (although the black and green carriers are initially disjoint). It is easy to see that any convex recoloring of this tree changes the colors of at least two colored nodes, i.e., has cost at least 2. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) color can be significantly smaller than the entire set of bad colors (as there can be a situation where a certain bad color is candidate only for a small subset of nodes colored by bad colors, see example presented in Fig. 8a). An additional reduction of the set of candidate colors for a node can be achieved by dynamically recalculating the set of candidate colors for subtrees, while considering previous decisions made for other nodes affecting the node in question (such decisions can break down chains of overlapping carriers of candidate colors that can result in splitting a cluster of candidate colors into smaller parts or even singletons, see Fig. 8b for example). In the Appendix we provide rigorous arguments why the restriction to candidate colors indeed guarantees optimal convex recoloring. Our algorithm follows along the lines induced by Lemma 4.8 of Moran and Snir (2008) however instead of considering the fixed set of bad colors, we use the smaller sets of candidate colors, that are calculated dynamically during the run of the algorithm. (a) (b) B A C 217 B C A Fig. 8. (a) Candidate colors vs. bad colors. Nodes A; B and C are at the intersection of the carriers of red and blue, black and gray, green and yellow, respectively. Hence, all these six colors (white nodes are considered as uncolored) are bad. However, only colors red and blue are candidate for node A, only colors black and gray are candidate for node B, and only green and yellow for C. Following the arguments provided in the Appendix, in searching for an optimal convex recoloring, it is enough to check only candidate colors for each node, i.e., no need to check all bad 6 colors (in contrast to Moran and Snir, 2008). (b) Simplified search for optimal recoloring by dynamic recalculation of candidate colors. In the figure, colors red, blue, yellow, green and black are candidate for node B (white nodes are considered as uncolored). The algorithm of Moran and Snir (2008), in the search after an optimal recoloring, considers all 5 bad colors as possible recolorings of node B, and all possible color partitions of the remaining set of the other 4 bad colors (in total, 34 ¼ 81 partitions standing for the options ‘‘left”, ‘‘right”, and ‘‘none”) as recoloring of the subtree rooted at B. As we prove in the Appendix, we don’t need to consider parts assigning the yellow and green colors to the subtree rooted at A. As the initial coloring for this subtree is convex, hence the coloring of B by the set of currently assigned colors (based on the candidate colors for B) are straightforward and determine the extension of the constructed recoloring (of minimal possible cost) for this subtree. In the case when B is not recolored by red and the partition of candidate colors does not assign the red color to the subtree rooted at C, initial coloring of this subtree is convex by colors excluding red. Hence, the coloring of B by the assigned color set determine the extension of constructed optimal recoloring for this subtree. In all other cases, only colors red, yellow, green and blue (no black) can be a candidate for node C. In particular, if B is recolored by blue then only blue is a candidate color for C; else - blue is not a candidate color for C. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) An important feature of CR that was not explored so far, is the significance of a given input (non convex) coloring. A relatively low cost P opt ðCÞ of some coloring C is not necessarily a proof for the goodness of the input tree coloring C. For example, it might be that this optimal cost is attained by many random recolorings. Hence, the significance of a given coloring gives an estimate on how likely we are to find by random another coloring with the same cost. The biological meaning of this value can be interpreted as follows. Assume we believe in the given tree topology (this is also the underlying assumption in CR in general, as opposed to perfect phylogeny, where the tree is built based on the given set of discrete characters). We also believe in the coloring on the tree (i.e., the color assignment to the tree nodes). Then this significance value can be interpreted as a means to measure statistically the compliance of this character with the evolutionary history of the taxa set at hand (that is depicted by the tree). Therefore in addition to the P opt ðCÞ value we provide an estimation for the quality of C by the probability to obtain Popt ðC 0 Þ 6 P opt ðCÞ for a ‘‘random” coloring C 0 . As analytic calculation of this measure appears to be hard and presumably computationally intractable, the straightforward way to proceed is via simulations (a method known as permutation test or bootstrap Wasserman, 2004). Hence, to estimate this probability (that can be dubbed as a p-value) we calculate the frequency of events R ¼ fP opt ðC i Þ 6 Popt ðCÞg out of N random colorings of the tree (e.g., N = 10,000). In order to maintain the initial properties of the input coloring C, we preserve the proportions between colors of the original coloring. Hence, each random coloring of the tree is simulated by a reshuffling of the input colors of C between the nodes set. As the initial coloring C can be considered as a realization of random tree coloring, we get: p-value¼ ðN R þ 1Þ=ðN þ 1Þ. Uncolored nodes are not affected by the reshuffling similarly as they do not affect the cost function. The software implementation associated with this article provides this value along with the absolute cost of the optimal convex recoloring for the given coloring C. 4.4. Implementation The algorithm is implemented in Python and receives as input a colored tree in either Newick or NEXUS formats. Colors are given to nodes either as part of their names by some convention, or by a separate table. It is also possible to indicate colors to internal nodes, and these colors are interpreted by the program as part of the input. The output of the program is an optimally recolored tree (one of the many possible, saved both in Newick or NEXUS formats), the list of recoloring of the nodes, the cost of the optimal convex recoloring, and p-value. This output can be used by several tree viewer softwares (e.g., FigTree Rambaut, 2010) as is demonstrated in our Results section below. Our experiments showed that the program was able to find optimal convex recoloring for trees with 200 leafs, randomly colored by up to 60 colors, in a few seconds. Recall that by Moran and Snir (2008), the algorithm runs in time that is linear in the number of vertices and even in the number of good colors and exponential (that is, fixed parameter tractable Downey and Fellows, 1999) in the number of bad colors. Consequently, it can also handle larger trees (e.g. of 1000 leaves) however with relatively small number of ‘‘bad” colors (e.g., 20). More implementation details can be found in the Appendix. Acknowledgements We wish to acknowledgements Lana Martin for valuable edit on the manuscript. 218 Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 Appendix A. Rigorous proofs for optimal recoloring via candidate colors Here we present a formal proof that there exists an optimal convex recoloring (one of all possible) that rewrite (recolors) nodes only by their corres candidate colors (similar to Lemma 4.7 of Moran and Snir, 2008). In fact, in our algorithm of optimal convex recoloring searching candidate colors are calculated dynamically with taking into account restrictions caused by current decision not use some colors in recoloring of subtree. Any partial convex coloring can be naturally extended in such a way that there will remained no uncolored nodes situated in the carrier of some used color (let v 2 carrierðdÞ; we can define Cðv Þ :¼ d; this definition is correct because initially coloring C was convex, carriers of colors were not changed, hence C remained to be convex). We consider only recoloring with zero cost of coloring for uncolored nodes, hence, for the simplicity, all convex coloring and recoloring will be considered as already after all these extensions (i.e., if some node is uncolored in considered convex coloring then it is not belonging to carrier of any used color). Pair of colors ðd1 ; d2 Þ considered as neighbor in coloring C (not necessary convex) if there exist vertices v 1 and v 2 such that Cðv 1 Þ ¼ d1 ; Cðv 2 Þ ¼ d2 , and u connected with v by single edge or by path (going on edges of tree) visiting only uncolored (in C) nodes. Claim A. 1. Let C be input coloring of tree T. There exist a convex recoloring C 0 of minimal possible cost such that for each d 2 C 0 ðTÞ there exists node v such that Cðv Þ ¼ C 0 ðv Þ ¼ d. Proof. Situation with convex input coloring C (e.g., no colored nodes) is trivial. Let C 00 a convex recoloring of minimal possible cost with minimal possible number of colors (it always exists because the set of nodes is finite). Let there exists color d 2 C 00 ðTÞ such that 0 for any v 2 T is uncolored in C or Cðv Þ – d. Let d be one of neighbor colors of d in C 00 . Then recoloring C 0 coincidental to C 00 out of C 001 ðdÞ Claim A. 4. Let C 0 be a convex recoloring (after al extensions, see above) of T in respect to input coloring C. Assume nodes u and v are connected by single edge or by a path going on tree edges and visiting only nodes uncolered in C 0 (and hence not in carrier of any color in C 0 ). Assume also C 0 ðuÞ is candidate for nodes u and v, but C 0 ðv Þ is not candidate for v. Then there exists a convex recoloring C 00 with costðC 00 Þ 6 costðC 0 Þ coloring more nodes by its candidate colors than C0. Proof. Let T ðv ;C Þ be the minimal subtree of T containing all nodes 0 colored in C 0 by color C 0 ðv Þ, i.e., T ðv ;C Þ ¼ carrierðC 01 ðC 0 ðv ÞÞÞ. Denote 0 ðv ;C Þ by T 0 the maximal subtree of T ðv ;C Þ containing node v and not containing nodes such that C 0 ðv Þ is its candidate color. Based on 0 0 ðv ;C 0 Þ Claim 2, set T ðv ;C Þ n T 0 0 (it can be empty) is connected (see exam- ple presented in Fig. 9). C 0 is convex, hence all nodes of T ðv ;C Þ in C 0 are colored by C 0 ðv Þ. Therefore, recoloring C 00 coincidental with C 0 0 ðv ;C 0 Þ ðv ;C 0 Þ in T n T 0 and coloring nodes of T 0 by color C 0 ðuÞ is convex. 00 Now C ðv Þ is candidate (in C) for v. Color C 0 ðv Þ was not candidate ðv ;C 0 Þ (in C) for all nodes of T 0 , hence all nodes that were colored by its candidate color in C 0 remained colored by the same candidate color in C 00 . Following to Observation 1, the cost of C 00 is not higher than the cost of C 0 . h Claim A. 5. There exists recoloring of the minimum possible cost such that all nodes are recolored by its candidate colors or remained uncolored. Proof. Let C 0 be a convex recoloring (with all possible extensions, see above) of the minimum possible cost such that for each d 2 C 0 ðTÞ there exist node v such that Cðv Þ ¼ C 0 ðv Þ ¼ d (see Claim 1), recoloring the most possible number of nodes by its candidate color. Now we will show that if some nodes are colored in C 0 by 0 and coloring all nodes from C 001 ðdÞ by d is convex (because C 00 is convex) and use less colors (does not use d). This contradicts to definition of C 00 ). h 2 1 Claim A. 2. Let d be a candidate color for nodes u and v in respect to input coloring C. Then d is a candidate for all nodes in the path (going on edges of tree, without returns) from u to v. u Proof. Color d is candidate for u and v, hence there exist colors ðuÞ ðuÞ d0 ; . . . ; dnðuÞ v ðv Þ ðv Þ and d0 ; . . . ; dnðv Þ ðv Þ ðuÞ ðv Þ v 5 9 ðuÞ such that u 2 carrierðC 1 ðd0 ÞÞ; 3 10 6 11 4 7 12 ðuÞ 8 13 2 carrierðC 1 ðd0 ÞÞ; dnðuÞ ¼ dnðv Þ ¼ d; carrierðC 1 ðdi ÞÞ \ carrier ðuÞ ðC 1 ðdi1 ÞÞ – £ and ðv Þ ðv Þ carrierðC 1 ðdj ÞÞ \ carrierðC 1 ðdj1 ÞÞ – £ for all i ¼ 1; . . . ; nðuÞ and j ¼ 1; . . . ; nðv Þ . Hence, there exists path from u to v (going on edges of tree) such that color d is a candidate for all visited nodes. A path from u to v going on edges of tree without returns is unique, hence color d as candidate for all its nodes. h Recoloring any colored n nodes to any individually-other colors (i.e., C 0 ðv Þ – Cðv Þ, but it can be that C 0 ðv Þ ¼ CðuÞ) costs n. This enables to make a following observation: Observation A. 3. Let T 0 be a subtree of tree T. Let C 0 be a recoloring of T in respect to input coloring C. Let all nodes in T 0 are recolored by C 0 only by its non-candidate colors. Then any recoloring of T coincidental to C 0 in T n T 0 has cost not higher than C 0 . 14 15 16 17 18 19 20 Fig. 9. Illustration for proof of Claim 4. Nodes with white color are considered as uncolored. Color of internal disk indicates color in coloring C while the color of the ring indicates a color in coloring C 0 . Candidate colors are indicated by colored squares. Color C 0 ðv Þ can’t be candidate for nodes 2 and 16 because it is candidate for 0 node 7 and not candidate for node v (see Claim 2). In this example T ðv ;C Þ is a subtree ðv ;C 0 Þ with nodes v ; 3; 6; 7; 8; 10; 11; 12; 13; 17; 18 and 20; T 0 is a subtree with nodes 0 v ; 6; 10; 11, 17, and 18; T 1ðv ;C Þ is a subtree with nodes 3; 7, 8; 12; 13 and 20. nodes 6 and 8 should be colored by red in C 0 , node 9 should be colored by blue in C 0 Note that in this examples colors blue, green, red and violet are bad. Nevertheless, colors red and violet are not candidate for nodes with candidate colors green and blue. Colors green and blue are not candidate for nodes with candidate colors red and violet. Such subdivision of bad colors into groups of candidate colors can dramatically reduce amount of variants in searching of convex recoloring of minimal possible cost (see Claim 6). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 219 Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 non-candidate color then these nodes were uncolored in C (hence C 00 coincidental with C 0 on all other nodes and not coloring these nodes is like searched for) or there exists convex recoloring C 00 having cost (relatively to input coloring C) not higher than C 0 and coloring more nodes by its candidate colors. This contradicts to definition of C 0 and enough for the proof. Assume node v is colored in C and C 0 such that C 0 ðv Þ is not candidate for v (in C). 1. Let color Cðv Þ is not presented in C . Using designations from the proof of Claim 4, convex recoloring C 00 coincidental with C 0 13 1 2 12 11 10 3 4 5 6 7 8 9 0 ðv ;C Þ ðv ;C Þ in T ðv ;C Þ n T 0 and coloring nodes of T 0 by color Cðv Þ is convex and has cost lower than C 0 (because C 00 ðv Þ ¼ Cðv Þ – C 0 ðv Þ) that contradicts to definition of C 0 . 2. Assume color Cðv Þ is already presented in C 0 (hence the recoloring C 00 from (1) can be non convex). Let u is such that CðuÞ ¼ C 0 ðuÞ ¼ Cðv Þ (it exists by definition of C 0 ). Color Cðv Þ is a candidate for u and v, hence, following Claim 2, color Cðv Þ is candidate for all nodes in the path from u to v (going on edges of tree without repeats). Let node v 0 is the first node in this path (starting from u) colored in C 0 by non-candidate color d (e.g., v 0 can coincident with v). Let node u0 is the last node in the path from u to v 0 colored by its candidate color (e.g., u0 can coincident 0 0 0 with u). Node u0 belongs to overlap of carrierðC 1 ðC 0 ðu0 ÞÞÞ and carrierðC 1 ðCðv ÞÞÞ carrierðfu; v gÞ, hence color C 0 ðuÞ is candidate for u and v, and, based on Claim 2, it is candidate for v 0 . Hence, based on Claim 4 there exist a convex recoloring C 00 having cost not higher than C 0 and coloring more nodes by candidate colors. That contradicts our definition of C 0 . h We now consider a special case of a convex recoloring. A partial convex recoloring C 0 is conservative relative to initial coloring C if it satisfies the following: (1) only vertices uncolored by C can be uncolored by C 0 ; (2) A node uncolored by C can be colored in C 0 only by a bad color of C or remain uncolored; (3) A vertex can change its color only to bad color of C; and (4) For every color d used in coloring C 0 , set C 01 ðdÞ is connected. In Moran and Snir (2008) it is shown that an optimal conservative recoloring is also a general optimal convex recoloring. By our next claim, optimality holds even if we replace ‘‘bad color” (in the definition of Moran and Snir (2008)) by ‘‘candidate color” in the definition of conservative recoloring (see above). We refer to such a conservative recoloring as candidate conservative recolorings. Claim A. 6. An optimal candidate conservative recoloring is an optimal convex recoloring in general. Proof. Let C 0 be an optimal convex recoloring from Claim 5. Using all possible extensions of C 0 on nodes uncolored in C 0 one can obtain an optimal convex recoloring satisfying conditions of candidate conservative convex recoloring. h Observation A. 7. Let T 0 be a subtree of tree T. Let C 0 be an optimal recoloring for subtree T 0 restricted to some set of colors. In this case we can use Claim 6 with restriction to a set of candidate colors such that this set is used by uncoloring nodes of excluded colors (see Fig. 10). Observation A. 8. Using the definition of good colors in the sense of Moran and Snir (2008), i.e., color d is good in partial coloring C if carrierðC 1 ðdÞÞ contains no nodes with other colors and no uncolored nodes from carriers of other colors. Then there exists an opti- Fig. 10. Recoloring without using of some colors and candidate colors. Nodes with white color are considered as uncolored. In the case when it is allowed to use all colors, carrier of color red is overlapped with carrier of color black (node 10), carrier of color black is overlapped with carriers of colors blue and green (nodes 11 and 13), hence color red is a candidate for node 9, and colors blue and green are candidate for node 1. If convex recoloring of minimal possible cost is searching under condition of non-using of color black, then node 1 has only one candidate color (red) and node 9 has only two candidate colors, (green and blue). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) mal candidate conservative recoloring with no recoloring of good colors by other good colors. Using Observation 8 we can improve the algorithm searching for optimal convex recoloring. Algorithm for searching for optimal convex recoloring: For a rooted tree T, we denote by T v the subtree rooted at vertex v. We designate by Ps ðv ; d; DÞ the minimal cost of a candidate conservative convex recoloring C 0 of T v such that C 0 ðv Þ ¼ d and C 0 uses only colors from D for the descendants of v. For convenience, we use symbol H to denote a ‘‘color” of an uncolored vertex and assume a cost infinity for uncoloring a colored vertex. Denote Ps ðv ; DÞ :¼ mind2D[fHg Ps ðv ; d; DÞ, the minimum cost convex recoloring that uses colors from D. We also designate by Pc ðv ; d; DÞ the minimal cost of convex recoloring under which v is either C 0 ðv Þ ¼ d or C 0 ðT v Þ does not contain d. This means that 0 Pc ðv ; d; DÞ ¼ minfPs ðv ; d; D [ fd; HgÞ; mind0 2ðD[fHgÞnfdg Ps ðv ; d ; ðD[ fHgÞ n fdgÞg. da;B denote the inverse Assume T is rooted at some vertex v r . Let Cronecker delta, such that da;B ¼ 0 if a 2 B, and da;B ¼ 1 otherwise. Denote by C ¼ CC ðTÞ the set of all node colors used in coloring C. Then analogously to Lemma 4.8 from Moran and Snir (2008), the cost of a minimal convex recoloring of the entire tree T can be written as Ps ðv r ; CÞ and is calculated recursively: Ps ðv ; d; DÞ ¼ P dCðv Þ;fH;dg þ minðD1 ;...;Dk Þ ki¼1 P c ðv i ; d; Di Þ, where v i are children of ver- tex v, [ki¼1 Di ¼ D, and Di \ Dj ¼ £ for i – j. The restriction to candidate conservative convex recolorings rather than all conservative convex recolorings has no asymptotic implication on the running time of the algorithm however it allows us to discard a large fraction of valid color partitions ðD1 ; . . . ; Dk Þ and reduce the running time dramatically. The implementation of the candidate is done recursively one child after the other, while the color assignment i-th child is checked only after it is guaranteed that for all j < i the color assignment for the jth child satisfy the candidate criterion. Appendix B. Implementation details This algorithm is implemented in Python and receives as input a colored tree in either Newick or NEXUS formats. In standard Newick format node names (captions that can include color) can be specified only for leafs. A color is assigned to a leaf by specification immediately after the leaf name: $hColori$. It is also possible to set initial coloring of leafs by a separate table in the following format: hLeafIdInTreei hNameOfLeafToDisplayi hColori (see examples in ReadMe.txt file). In NEXUS format it is also possible to assign input 220 Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220 colors for internal nodes (in this case the program assigns a vertex the color that is indicated in the vertex incoming edge: [&!color = #-hCodeOfColori]: hLengthOfEdgei. The output of the program is an optimally convex-recolored tree (one of the many possible, saved both in Newick or NEXUS formats), the recoloring of the nodes, and the cost of the optimal convex recoloring. References Avni, E., Cohen, R., Snir, S., 2015. Weighted quartets phylogenetics. Syst. Biol. 64 (2), 233–242. Bapteste, E., Susko, E., Leigh, J., MacLeod, D., Charlebois, R.L., Doolittle, W.F., 2005. Do orthologous gene phylogenies really support tree-thinking? BMC Evol. Biol. 5, 33. Bar-Yehuda, R., Feldman, I., Rawitz, D., 2008. Improved approximation algorithm for convex recoloring of trees. Theory Comput. Syst. 43 (1), 3–18. Berkum, P., Terefework, Z., Paulin, L., Suomalainen, S., Lindstrom, K., Eardly, B.D., 2003. Discordant phylogenies within the rrn loci of rhizobia. J. Bacteriol. 185 (10), 2988–2998. Bininda-Emonds, O.R.P., Gittleman, J.L., Steel, M.A., 2002. The (super)tree of life: procedures, problems, and prospects. Annu. Rev. Ecol. Syst. 33 (1), 265–289. Bodlaender, H.L., Fellows, M.R., Warnow, T., 1992. Two strikes against perfect phylogeny. In: ICALP, pp. 273–283. Bodlaender, H.L., Fellows, M.R., Langston, M.A., Ragan, M.A., Rosamond, F.A., Weyer, M., 2011. Quadratic kernelization for convex recoloring of trees. Algorithmica 61 (2), 362–388. Bordewich, M., Semple, C., 2005. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Comb. 8, 409–423. http://dx.doi.org/ 10.1007/s00026-004-0229-z. Campêlo, M., Lima, K.R., Moura, P.F.S., Wakabayashi, Y., 2013. Polyhedral studies on the convex recoloring problem. Electron. Notes Discrete Math. 44, 233–238. Cramp, S., Perrins, C.M., Brooks, D.J., Dunn, E., 1993. Handbook of the birds of Europe, the Middle East and North Africa: the birds of the western Palearctic. Flycatchers to shrikes. . Handbook of the birds of Europe, the Middle East and North Africa: the birds of the western Palearctic/Stanley Cramp, chief ed., vol. VII. Oxford University Press. Creevey, C.J., McInerney, J.O., 2005. Clann: investigating phylogenetic information through supertree analyses. Bioinformatics 21 (3), 390–392. Delsuc, F., Brinkmann, H., Philippe, H., 2005. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6 (5), 361–375. Dewhirst, F.E., Shen, Z., Scimeca, M.S., Stokes, L.N., Boumenna, T., Chen, T., Paster, B. J., Fox, J.G., 2005. Discordant 16S and 23S rRNA gene phylogenies for the Genus Helicobacter: implications for phylogenetic inference and systematics. J. Bacteriol. 187 (17), 6106–6118. Doolittle, W.F., 1999. Phylogenetic classification and the universal tree. Science 284 (5423), 2124–2129. Downey, R.G., Fellows, M.R., 1999. Parameterized Complexity. Springer. Eisen, J.A., Fraser, C.M., 2003. Phylogenomics: intersection of evolution and genomics. Science 300 (5626), 1706–1707. Fernandez-Baca, D., 2001. The perfect phylogeny problem. In: Cheng, X., Du, D.Z. (Eds.), Steiner Trees in Industry. Kluwer. Fitch, W.M., 1971. Towards defining the course of evolution: minimum change for a specified tree topology. Syst. Zool. 20, 406–416. Ginn, H.B., Melville, D.S., 1983. Moult in Birds. BTO Guide, British Trust for Ornithology. Hall, K.S., Tullberg, B.S., 2004. Phylogenetic analyses of the diversity of moult strategies in Sylviidae in relation to migration. Evol. Ecol. 18 (1), 85–105. Hein, J., 1990. Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. 98 (2), 185–200. Jenni, L., Winkler, R., 1994. Moult and Ageing of European Passerines. Academic Press. Kammer, F., Tholey, T., 2012. The complexity of minimum convex coloring. Discrete Appl. Math. 160 (6), 810–833. Kanj, I.A., Kratsch, D., 2009. Convex recoloring revisited: complexity and exact algorithms. In: Computing and Combinatorics. Springer, pp. 388–397. Matsen, F.A., 2015. Phylogenetics and the human microbiome. Syst. Biol. 64 (1), e26–e41. Moran, S., Snir, S., 2005. Efficient approximation of convex recolorings. In: Approximation, Randomization and Combinatorial Optimization, Algorithms and Techniques, 8th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2005 and 9th InternationalWorkshop on Randomization and Computation, RANDOM 2005, Berkeley, CA, USA, August 22–24, 2005, Proceedings, pp. 192–208. Moran, S., Snir, S., 2005. Convex recolorings of strings and trees: definitions, hardness results and algorithms. In: Algorithms and Data Structures, 9th International Workshop, WADS 2005, Waterloo, Canada, August 15–17, 2005, Proceedings, pp. 218–232. Moran, S., Snir, S., 2008. Convex recolorings of strings and trees: definitions, hardness results and algorithms. J. Comput. Syst. Sci. 74 (5), 850–869. Ochman, H., Lawrence, J.G., Groisman, E.A., 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405 (6784), 299–304. Gogarten, J.P., Townsend, J.P., 2005. Horizontal gene transfer, genome innovation and evolution. Nat. Rev. Micro. 3 (9), 679–687. Gogarten, J.P., Ford Doolittle, W., Lawrence, J.G., 2002. Prokaryotic evolution in light of gene transfer. Mol. Biol. Evol. 19 (12), 2226–2238. Puigbó, P., Wolf, Y.I., Koonin, E.V., 2009. Search for a ‘tree of life’ in the thicket of the phylogenetic forest. J. Biol. 8 (6), 59. Puigbó, P., Wolf, Y.I., Koonin, E.V., 2010. The tree and net components of prokaryote evolution. Genome Biol. Evol. 2, 745–756. Rambaut, A., 2010. Figtree v1.3.1. Institute of Evolutionary Biology. University of Edinburgh. Schouls, L.M., Schot, C.S., Jacobs, J.A., 2003. Horizontal transfer of segments of the 16S rRNA genes between species of the Streptococcus anginosus group. J. Bacteriol. 185 (24), 7241–7246. Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B. S., Kiryutin, B., Galperin, M.Y., Fedorova, N.D., Koonin, E.V., 2001. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucl. Acids Res. 29 (1), 22–28. Treplin, S., Siegert, R., Bleidorn, C., Thompson, H.S., Fotso, R., Tiedemann, R., 2008. Molecular phylogeny of songbirds (Aves: Passeriformes) and the relative utility of common nuclear marker loci. Cladistics 24 (3), 328–349. Wasserman, L., 2004. All of Statistics. Springer, New York. Yap, W.H., Zhang, Z., Wang, Y., 1999. Distinct types of rrna operons exist in the genome of the Actinomycete Thermomonospora chromogena and evidence for horizontal transfer of an entire rRNA operon. J. Bacteriol. 181 (17), 5201–5209. Zhang, J., Kumar, S., 1997. Detection of convergent and parallel evolution at the amino acid sequence level. Mol. Biol. Evol. 14 (5), 527–536. Zhaxybayeva, O., Lapierre, P., Gogarten, J.P., 2004. Genome mosaicism and organismal lineages. Trends Genet 20, 254–260.
© Copyright 2026 Paperzz