Convex recoloring as an evolutionary marker

Molecular Phylogenetics and Evolution 107 (2017) 209–220
Contents lists available at ScienceDirect
Molecular Phylogenetics and Evolution
journal homepage: www.elsevier.com/locate/ympev
Convex recoloring as an evolutionary marker
Zeev Frenkel a, Yosef Kiat b, Ido Izhaki a, Sagi Snir a,⇑
a
b
Department of Ecology and Evolutionary Biology, University of Haifa, Israel
Israeli Bird Ringing Center, Society for the Protection of Nature in Israel, Israel
a r t i c l e
i n f o
Article history:
Received 21 May 2016
Revised 16 October 2016
Accepted 25 October 2016
Available online 3 November 2016
Keywords:
Phylogenetics
Maximum parsimony
Character compatibility
Perfect phylogeny
Statistical significance
Supertree
Optimal convex recoloring cost
a b s t r a c t
With the availability of enormous quantities of genetic data it has become common to construct very
accurate trees describing the evolutionary history of the species under study, as well as every single gene
of these species. These trees allow us to examine the evolutionary compliance of given markers (characters). A marker compliant with the history of the species investigated, has undergone mutations along the
species tree branches, such that every subtree of that tree exhibits a different state. Convex recoloring
(CR) uses combinatorial representation to measure the adequacy of a taxonomic classifier to a given tree.
Despite its biological origins, research on CR has been almost exclusively dedicated to mathematical
properties of the problem, or variants of it with little, if any, relationship to taxonomy. In this work we
return to the origins of CR. We put CR in a statistical framework and introduce and learn the notion of
the statistical significance of a character. We apply this measure to two data sets - Passerine birds and
prokaryotes, and four examples. These examples demonstrate various applications of CR, from evolutionary relatedness, through lateral evolution, to supertree construction. The above study was done with a
new software that we provide, containing algorithmic improvement with a graphical output of a (optimally) recolored tree.
Availability: A code implementing the features and a README is available at http://research.haifa.ac.il/
ssagi/software/convexrecoloring.zip.
Ó 2016 Elsevier Inc. All rights reserved.
1. Introduction
The practice of constructing a tree depicting the evolutionary
history of a set of organisms is nowadays common to almost every
phylogenomic study - an area combining genomic data and techniques for the study of evolution (Eisen and Fraser, 2003; Delsuc
et al., 2005). In particular, the deluge of the molecular data accumulating constantly, allows us to gauge the accuracy of the constructed trees. A character, genetic or morphological, classifies
the species set into several character classes. If we consider each
class as a different color, then every species is colored by the state
of the character it possesses, and the given character induces a coloring over the tree leaves. We say that the coloring is convex on the
given tree if every color class induces a clade or a subtree and these
subtrees do not overlap (Moran and Snir, 2008) (or equivalently, do
not intersect). Convexity is a desirable and natural property in classification. When a character is convex on a tree, it is denoted as
homoplasy free meaning it displays no reversals or convergence
(Zhang and Kumar, 1997). The well-founded and widespread phylogenetic approach maximum parsimony (Fitch, 1971) seeks a tree
⇑ Corresponding author.
http://dx.doi.org/10.1016/j.ympev.2016.10.018
1055-7903/Ó 2016 Elsevier Inc. All rights reserved.
with minimal changes on its edges, summed over all input characters. A minimum can be obtained when a perfect phylogeny exists in
which case each input character is homoplasy-free on that phylogeny (Fernandez-Baca, 2001). Such a tree not necessarily exists,
and even finding it is computationally intractable (Bodlaender
et al., 1992). In the above setting, the characters are given and
assumed to be reliable, and a plausible tree is sought. In other settings, the tree is also given, along with the characters, but one or
more characters are not convex on that tree. In this case, we may
question about the reliability of that tree.
Alternatively, in a setting where the tree provides enough confidence, the question shifts to the reliability of the input characters.
Moreover, we may wonder if the character under examination has
evolutionary traces or is influenced by other factors such as environment or simply randomness. In both cases, questioning the tree
while assuming character reliability or questioning the character
evolutionary meaningfulness, we look for the recoloring distance
that counts the minimum number of tree nodes we need to recolor
in order to arrive at convexity. This value indicates the level of disagreement between the tree and the coloring. The notion of the
recoloring distance was coined in Moran and Snir (2008) where
the problem, convex recoloring (CR), was defined and studied for
several types of trees and input colorings. Despite its biological
210
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
origin, due to its mathematical cleanliness, mainly combinatorial/
algorithmic aspects of the problem and its derivatives, that have
little if at all biological relevance, were studied. These include
extensions to certain graph types rather than a tree, specific input
colorings, constrained recoloring schemes, and alike (see e.g. Kanj
and Kratsch, 2009; Kammer and Tholey, 2012; Campêlo et al.,
2013 and references therein, but see also Matsen, 2015 for a classification oriented study).
In this work we bring back the high level theory of CR down to
the biological ground in several aspects. For a taxonomist, it would
be desirable to determine quantitatively and statistically, the relevance of a character (i.e. any classification) to the tree at hand. The
recoloring distance is an absolute, context-less, number. We therefore introduce the notion of a coloring significance, indicating how
likely we are to see, a coloring of this distance or less, by chance
on the given tree. In the Results section we demonstrate the use
of the coloring significance measure by applying CR to several
examples. First, in order to obtain an intuition regarding this measure we show a simulation study. The results reveal that the recoloring distance is more structured than expected. Next, using two
data sets, we demonstrate the various uses of CR as an evolutionary
marker. The first data set is over eighty Passerine birds, and the
second is over a hundred prokaryotes, with few colorings (characters) for each data set. The results obtained concern not only questions of phylogeny/character reliability, but also intensity of non
tree-like activity in prokaryotes and the power of supertree
methods.
Importantly, we provide a software that implements the features we describe in this work. To this respect, in the Method section we describe an algorithmic improvement to the algorithm
presented in Moran and Snir (2008). The improvement is achieved
by reducing the average number of colors checked at a node. We do
not give an asymptotic analysis for this improvement but do provide rigorous proof for its correctness. We are aware that since
the appearance of the algorithm of Moran and Snir (2008), there
have been further improvements (e.g. Bar-Yehuda et al., 2008) to
that first algorithm, and there might be other algorithms with better complexity than the one presented here. However a basic property of this algorithm, which to the best of our knowledge was not
used before, is a local view that allows a dynamic calculation of the
set of candidate colors of each tree node. Accordingly, we believe
that the algorithmic improvements provided here, accompanied
with more fundamental theoretical improvements to CR, viewing
it as a fixed parameter tractable problem (Bodlaender et al.,
2011), will allow application of CR to data sets of orders of thousands of species and hundreds of colors.
2. Results
We now show four examples for the application of convex
recoloring to synthetic and real data. The first one is a simple
example based on random colorings of a binary tree, demonstrating the distribution of optimal convex recoloring cost in one simple
case. The other three are applications to real biological examples of
colored trees where the colorings represent a different classification each time. In each case we compute the optimal recoloring
and its associated p-value, signifying how much the given coloring
complies with the evolutionary history of the given species set
(that is also given as input, and is represented by the tree
topology).
2.1. Example 1: Statistical distribution of the recoloring distance
Our first example shows how the recoloring distance distributes
for a given tree size and number of colors. We constructed a set of
random binary trees with 50 leaves. Next, we randomly and uniformly colored the tree leaves by 4 colors (no uncolored leafs, all
internal nodes are uncolored). This is simply done by choosing
for every leaf each color with probability 1=4. Therefore, the trees
obtained are different in topology and also by the proportions
between color sets. For each of these trees a convex recoloring
was calculated. The distribution of cost of recoloring is presented
in Fig. 1(a). We note that a naive upper bound to the expected
value of this statistic, is the value of 3n=4 where n is the number
of leaves. This is achieved by recoloring all the leaves with the most
common color. As this must have at least n=4, the bound is trivially
obtained. However, as we see in the figure, a much smaller value
(from n=2 to 3n=5) is usually obtained, signifying existence of a
more profound structure in this question than that naive bound.
Notwithstanding, a more precise bound is not trivial to obtain
and is beyond the scope of this work. Distribution of colors frequencies on the resulted convex trees is presented in Fig. 1(b).
The results are divided into three cases (three bar charts in the figure) representing cases in which the most common color had (i)
below 25 members (Blue bars), (ii) between 25 and 28 members
(Brown bars), and (iii) above 28 members (Green bars). As shown,
this difference in the prevalence of the most common color, affects
minimally over the distribution of the final colors, where the most
common color colors around 70% of the leaves. We note that as
there are many (possibly even exponentially many) optimal recolorings, this distribution might be biased according to the strategy
employed by the algorithm. One may observe that in a tree, every
color is preserved at least by a single leaf as this does not violate
convexity of the tree. This observation is explained by the three
short bars in the right of Fig. 1(b).
2.2. Example 2: Birds moult strategies
In this example compatibility of adult/juvenile moult strategy
of birds with their evolutionary history was examined. We took a
tree over 80 bird taxa representing 29 of the 46 Passerine families
(Treplin et al., 2008). The leaves of this phylogeny were classified
by their main moult strategies in adult/juvenile life stages as
described in Jenni and Winkler (1994), Cramp et al. (1993), and
Ginn and Melville (1983). Such characterization was made only
for 43 of these genus and species and was expressed by one, two
or even three of three observed moult strategy types: ‘‘Summer
complete/summer partial”, ‘‘Summer complete/summer complete”, and ‘‘Winter complete/winter complete”. Such characterization induces the following coloring of phylogenetic tree’s leafs:
leafs corresponding to non-characterized species and species characterized by more than one strategy type - uncolored; leafs corresponding to species characterized by only one moult strategy type
are colored by Blue, Red and Green (26, 7 and 4 leafs respectively).
Based on our program we found that this coloring is not convex:
Popt ¼ 8, p-value = 0.26 (see Fig. 2). Excluding the green color
results in non-convex coloring with P opt ¼ 5, p-value = 0.46. Unifying colors Red and Blue (in the initial coloring) results in Popt ¼ 3,
p-value = 1.0. The latter means the following. After unifications of
Red and Blue, we are left with two colors - Red/Blue and Green where the Green comprises of only 4 members, that are dispersed.
A cost of 3 means that in order to arrive at convexity we must
uncolor all but one of the Green leaves. As shown in previous section (Section 2.1), any tree recoloring retains at least one leaf of any
color class intact. The latter implies that this is not only the minimum cost possible, rather also the maximum cost for the given
configuration of 4 Green leaves. Moreover, since any other, random
or not, input coloring with 4 Green leaves cannot achieve a cost
higher than that (i.e. a cost greater than 3), all colorings attain this
(3) or smaller cost, explaining the p-value of 1 of that result. The
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
211
Fig. 1. Results based on simulation data. (a) A distribution of minimal cost for convex recoloring for random binary tree with 50 leaves: leaves are colored randomly (4 colors
with the same probabilities, no uncolored leaves), internal nodes are uncolored. (b) A distribution of leave color frequencies in minimal convex recoloring for random binary
tree with 50 leaves. The distribution is about the same for situation when most frequent color was presented in 6 24, from 25 to 28, and P 29 leaves in the input random
coloring. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
biological interpretation of the results above, is that adult/juvenile
moult strategies of birds are not evolutionary compliant. It can be
explained by the hypothesis that similar adult/juvenile moult
strategies were formed independently for different bird species
and/or changed in different directions during the process of evolution (e.g., caused by changing of climatic niches).
2.3. Example 3: Birds migration strategies
Birds genus and species from Example 2 above were also classified by subdivision into three overlapping classes based on main
migration strategy: ‘‘residents”, ‘‘short-” and ‘‘long-distance
migrants” (Hall and Tullberg, 2004; Cramp et al., 1993). In total,
41 out of 80 genus were classified. Such a classification induces
the following coloring on the tree leaves: 15 ‘‘pure” residents
(Red), 8 short-distance migrants (Blue), 6 long-distance migrants
(Green) and 12 having various strategies. We found that such a coloring is also non convex: Removing genus with various strategies
yields P opt ¼ 9, p-value = 0.2 (see Fig. 3). Combining Blue (shortdistance migrants) and Green (long-distance migrants) into
‘‘migrants” gives a bi-colored tree that is non convex with
P opt ¼ 11, p-value = 0.18. Removal of the Blue color (nodes) results
in Popt ¼ 4, p-value = 0.51. Finally, combining Blue to Red results in
P opt ¼ 6, p-value = 1.0. The above means that migration distance is,
similarly to moult strategy, also not evolutionary compliant
(presumably like many of ecological/geographical/behavior
characters). Such estimation can be explained by the hypothesis
that ability and preference to migrate on long distance changed
in both direction during the process of evolution and was caused
by multiple internal and environmental traits.
2.4. Example 4: Evolutionary classes among prokaryotes
In this part, we study convexity among prokaryotes. Our species
set is composed of 41 archaeal and 59 bacterial genomes, representing the forest of life (Puigbó et al., 2009), and that were studied
in Puigbó et al. (2009). The characters used for colorings represent
three different classifications: (i) domain based (2 colors, archaeal/
bacterial), (ii) phylum based (24 colors), and (iii) order based (57
colors). The underlying approach here is different from the examples above as these characters are considered accurate and largely
representing the main trend of evolution of the given species set.
Under this setting, the given tree is under scrutiny. Here, trees represent gene specific histories, dubbed gene trees. These histories are
substantially different as many genes are subjected to the phenom-
ena of horizontal gene transfer (HGT), the passage of genetic material between organisms by means other than lineal descent
(Doolittle, 1999; Ochman et al., 2000). Evolution in light of HGT
tangles the traditional universal Tree of Life, turning it into a network of relationships (Gogarten et al., 2002; Zhaxybayeva et al.,
2004; Gogarten and Townsend, 2005; Bapteste et al., 2005).
To put the above discussion in the context of color convexity,
we did the following. First we considered a tree representing the
evolution of the Isoleucyl-tRNA synthetase (IleS, COG0060) gene,
henceforth the IleS-tree, that is present in all 100 considered
prokaryotes. The IleS-tree and the corresponding colorings of
(domain-, phylum-, and order-based) are depicted in Fig. 4. Leaf
coloration follows order classification. Our results show that none
of the colorings is convex on the IleS-tree.
In order to delve deeper into the meaning of this result, we analyzed each category (coloring) separately. Starting with the domain
level, the tree from Fig. 4 can be perceived as an unrooted quartet
tree (Avni et al., 2015) over four large clades (subtrees) pertaining
almost exclusively either to bacteria and archaea. If we ignore the
outliers and color these clades as depicted in the figure: archaea Red, and bacteria - Green, we see a quartet colored
Red; GreenjRed; Green. Obviously, this coloring is not convex, suggesting a very early HGT between archaea and bacteria of the IleS
gene.
At the phylum level, there can be seen several violations to convexity that can be evidenced by the presence of members of a single phyla populating two of the four domain clades indicated
above. This by definition is a violation to convexity as we require
all members of a phyla to be present in a single domain clade.
Specifically, in the figure (Fig. 4), we point at three members of
the Proteobacteria-Alpha phylum (index 22, green arrows), present
in the two bacteria clades.
Finally, there are many violations of convexity at the level of
orders. One such violations that we also mention in the figure is
of orders with indices 20 (Desulfurococcales, indicated by blue
arrows) and 55 (Thermoproteales, indicated by red arrows). There
can be found several quartets over these orders composed of a pair
from the index 20 order and another pair from the index 55 order
that exhibit a quartet colored 20; 55j20; 55 arrangement. It can be
shown that such an arrangement requires a recoloring of at least
one leaf (see Fig. 4).
Despite the deep discordance between individual gene histories, the belief in an underlying, vertical trend of evolution even
among prokaryotes, yields a major challenge of finding this tree.
Normally, this underlying phylogeny is inferred by constructing
gene trees for genes thought to be immune to HGT, typically
212
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
Fig. 2. Adult/juvenile moult strategy of birds. Leaf coloring is as follows: ‘‘Summer complete/summer complete” - Blue, ‘‘Winter complete/winter complete” - Red, ‘‘Summer
and winter complete/winter complete” - Green; non-characterized species and species characterized by more than one strategy type - uncolored (black). Optimal (convex)
recoloring is schematically shown by lines of corresponding colors. Note that only one green and two red colored leaves remained. (For interpretation of the references to
color in this figure legend, the reader is referred to the web version of this article.)
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
213
Fig. 3. Migration strategy of birds. Leaf coloring is as follows: ‘‘pure” residents - Red; short-distance migrants - Blue; long-distance migrants - Green; non-characterized
species and species characterized by more than one strategy type - uncolored (black). Initial coloring is not convex. Optimal (convex) recoloring is schematically shown by
lines of corresponding colors. Indeed, significance of initial coloring (p-value, see definition) is low (see text). (For interpretation of the references to color in this figure legend,
the reader is referred to the web version of this article.)
ribosomal RNA genes. Nevertheless, even such genes are subjected
to HGT, obfuscating the central trend of evolutionary relationships
(Berkum et al., 2003; Dewhirst et al., 2005; Schouls et al., 2003;
Yap et al., 1999). Therefore, it was suggested to construct the
underlying species tree by a two stage approach as follows: First,
gene trees such as the IleS-tree above, are constructed separately
214
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
Fig. 4. A tree over 100 prokaryotes based on genes Isoleucyl-tRNA synthetase (IleS, COG0060) from Puigbó et al. (2009). For convenience, organism names are appended by
three numbers separated by an underline (representing domain, phylum, and order indices respectively; order indices 1 and 57 correspond to organisms with questionable
order). Leaf coloration follows the coloring defined by order. It can be seen that the three colorings- domain, phylum, and order - are not convex on the IleS tree. On the
domain level, one can see two pairs of large clades (subtrees), a pair for each domain, corresponding to domains 1 (Archaea, red lines) and 2 (Bacteria, green lines), intertwined
along the tree, yielding non convexity of the domain coloring. At the phylum level, phylum 22 (Proteobacteria-Alpha, pointed by green arrows in the figure) was found in both
bacteria clades and hence yielding non convexity also at the level of phylum. We remark that one can find few additional such examples for bad classified phylums according
to this gene tree. The archaea domain was also found non convex by the order coloring: carriers of colors correspond to order 20 (Desulfurococcales, pointed by blue arrows)
and 55 (Thermoproteales, pointed by red arrows) overlap. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)
for a multitude of genes. These trees do not necessarily span the
entire taxa set rather overlap at subsets of it. Subsequently these
trees are amalgamated together to produce a big tree over the
complete taxa set. This approach is denoted the supertree construction and the resulted tree is denoted a supertree (Bininda-Emonds
et al., 2002; Creevey and McInerney, 2005).
In light of the above, the task we pursue here is how much the
supertree ‘‘corrects” the non convexity of individual gene trees. In
Puigbó et al. (2010), a set of 6901 orthologous gene families (COGS
Tatusov et al., 2001) was selected and for each such family, its gene
tree was reconstructed. From this set, a subset of around a hundred
fairly conserved, ubiquitous genes, denoted nearly universal trees
(or NUTs), were taken. A tree spanning the entire taxa set was constructed by a supertree method, based on the NUTs trees. We
denote it as the NUTs-tree. We wanted to measure the convexity
of the NUTs-tree with respect to each of our three colorings. Applying our program we found that all are convex on this tree. For illustration, the tree, leaf-colored according to phylum, is shown in
Fig. 5.
To summarize this part, we start with the IleS-tree. We note that
the fact that all the three colorings were found highly insignificant
(high p-values) suggests an intensive HGT activity. Nevertheless, in
the case of HGT, one caveat should be raised. HGT operates in scale
of subrtees while recoloring counts single nodes and therefore a
215
T hD e i
et ra0
h0 1
1 BB d
d __ 2
2 __ 1
1 00
__51
18
ASco
ilbu s
a 00 1
HS u
1 BB i
e ll s
i __ 2 _
p yp
2_ 1
0012
1 __ 4
BB p
2 7
p __ 2
2 __ 2
2 00
_1
0
8
_3
_9
Bc_2
38
A n a v a 0 11B
9_
2_
0
No ss p0 c__ 2 _ 9 _ 4 5
1
c
_
B
T ri e r0 11 B c _ 2 _ 9 _ 1_59
p0
2 _ 9_ 2 2
Syns
Bc_
4
1Bc
_
9
el01 a0
T h eA c a m c _ 2 _ 9 _ 2 5
01B c_2_
1 47
ma
B
7 __ 1
Pro
2 __ 7
i01
v
_
2
B hh _
Glo
__386
0 11 B
__22_ 4
lau 0
p
__22_ 2
C he h s
B aa 2
D
0 11 Ba _
tu 0 B
yc flo 1
M Bi xy0
ub
R
Aeqrunaoe 0 1 B q _ 2 _ 3 _ 4
F
T
h e m a001 B t _ 2 _ 2 3 _
1 B t_ 2 _ 5 6
F ues n
23_56
M
s f lu0011 B
M
C
o
B f _u2_ 2 _ 1
LBaal oc o
t
a
h
c cs uc 0011 B _ 1 2 _3 _ 2 4
2
a 00 1 B f _f _ 2 _
1 BB f _ 2 _ 1 1 2 _ 2
f _ 22 _ 2 _ 5 2
1
1
_1 2_ 6
2_ 6
27
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
3 69
9 __ 1
1
9
_1
_ 22 _
45 4
BBpp _
7 _7 _ 4 4
1
1
1
0
_
1
2 _ 7_4
x au 0
p _p _22_ 1
yexs v
B
37
_
B
1
M
D
18_
r 0x 011B p
_ 2 _1 8 _ 3 59
i ceprt et u 0
p
B
RM
_
_
Ag
e 011B p _p2_ 2 __1188 _ 9
eei m
ttfpl 0ea0011BB p _ 2
N
M
e
m
r
1_34
M
Bu
_2_2
Bp
0 11B
21 _4 3
tc ae0
p_ 2_21_2
eea
M
1
Ps
p_2_
o01B
Escc
C Ch
C ah l plnt r 0
n P 0 11 B
O
V epri b a 0 r 0 1 BB vv__ 2
s
L e p 0 11 B v v _ 22 __55 _
V icnvaar 0 1 BB v __22__ 2_ 5 __1112
0 1 B vv _ 2 _ 2 44__ 3 2
_ 2 _ 11 4 _ 5 89
4_5
29
Rhoba0
8
1Bo_2_
Bl am a0 1B
1 6_4
o_ 2_ 16
_
4
Plam a01B o_2_ 16_4 1
1
Gem ob01 Bo_2 _16_ 41
_7
b_ 2_ 4_423
1B
Fl ajco0
1 B b _ 2 _4 _ 4 8
B a th001 B b _ 22__ 66__1 3
u
_
_
th
b
2
y
C
i01B
Bb_
P rholvt e 0 1
44999
C
2 22___ 4
__22___22 2
s
0 1 BB s _ 2
rbu 01 Bs
B or e p ai n 0 1
T ep
L
Su
Cens y_1_ 8_11
Th ep e_ 1_ 8_
55
Ca
T h lm a _ 1 _ 8 _ 5 5
_1_8
PPyyyrrecte
P
a
_
riase___ 1 _ 8 5 5
1_8 _55
1_
8_5_ 5 5
5
SAte r
H
p
a
y
S
e
Suu pm
_
b
a
1
ltloso u __11 __ 8 _
_1_1_ _ 88 _ 2 0
_8 8_ _ 22 0
_5 50 0
0
la
c_
1_
8_
11_
32
M
M etb
et u
s _
Un a_ 1_
M
cm 1_ 11
M ee t
1 _
M et hcuu _ e _ 1 1 _ 3 3 3
t l a _ 11 _
_
_ 1 _ 11 1 1 1 3
_ 1 1 __
1 _ 331
Ha
3
1
N al m
H a l twp ha__11 _
H a ls pa _ 1 __11111__22
_ 1 _ 1 1 _ 2 66
1_26
A rc fu _ 1
_11_5
_1_
Th ea 1_ 11
_5
T h e v oc_
_1_11_
54
4
50
tka
5
_1
_1
eq
3
an
_5
N
11 53 3
1 _ 1 1 _1 _ 55 3
o _1 _ _ 1 1 _
e kf u __ 11 _ 1
TPhyyrrrahbo _
P
Me
9
1_2 9
1 _ 11 1 _ 2
_1_ 1_30 0
t tsht_
_3
Meeettj a _p1__11_ 1 11_
M et
e tm
m C _1 _1 30
M
Picto _1_1 1_54
33
_ 333 3
11111__
_
_
1
1
a __ 11 _
t mt a ca _
M eM ee t b
M
2.0
Fig. 5. The NUTs-tree constructed from a subset of around a hundred gene trees by the supertree approach (Puigbó et al., 2009). The tree is leaf-colored according to the
phylum classification. As can be seen, this coloring is convex on the tree. (For interpretation of the references to color in this figure legend, the reader is referred to the web
version of this article.)
recoloring distance cannot, on its face value, be indicative to intensity of HGT. Our domain-level coloring illustrates that. Recall we
had a tree over four large clades, two colored with Green and
two with Red. In order to turn this coloring to convex on that tree,
a whole clade needs to be recolored. In contrast, one Subtree Pruning and Regrafting (SPR) operation, that cuts an entire clade from its
current location and joins it in another, would have fixed this situation, yielding a convex tree. However, this SPR move would have
modified the tree topology - an operation that stands in contradiction to the CR philosophy that keeps the tree topology intact.
Therefore, while intensity of HGT is normally measured by the
SPR-distance to the species tree (Hein, 1990), it is important to
mention that finding such a distance is computationally intractable
(NP-hard) (Bordewich and Semple, 2005) (but exponential in the
number of SPR events), finding the recoloring distance may provide
some intuition and is exponential only in the number of colors.
The second example with prokaryotic data, dealt with the
power of the supertree approach and how this is relates to CR.
We have shown that the supertree approach can ‘‘correct” all coloring violations as exhibited by IleS-tree. We note that convexity
216
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
with respect to these classifications, is not the only criterion.
Therefore, we can frequently find trees that are not convex with
respect to this classification yet provide other, insightful
relationships.
3. Conclusions
In this work we studied convex recoloring (CR) and focused on
relevant biological aspects of it. Since its introduction in 2005
(Moran and Snir, 2005; Moran and Snir, 2005), CR was almost
entirely studied in the context of theoretical computer science
while the biological relevance of it was neglected. We believe that
this is the prime importance of the work presented here. Specifically, we used CR as a marker for character compliance with organismal evolution, by fitting it to the tree nodes and measuring
compatibility. We augmented the parameterless value of the
recoloring distance with a statistical framework that provides the
(statistical) significance of the given input coloring in terms of a
p-value, allowing determination of the evolutionary relatedness
of the character under study.
On a more technical level, we provided algorithmic improvements to the basic algorithm for CR introduced in Moran and
Snir (2008). The improvement is achieved by reducing the set of
possible recolorings and considering a more local, instead of a global, view of the problem. In general, when the input coloring is
near random and has a big recoloring distance, this improvement
appears to be of little benefit over the asymptotic bound. Nevertheless this improvement is more pronounced in the case of a coloring
close to convexity. It appears that our heuristic bears some similarity to the principles implemented in Bar-Yehuda et al. (2008).
While we do not have theoretical asymptotic analysis for this
improvement, it was experimentally demonstrated in our simulations and real data examples.
Importantly, we also provide software implementation for the
algorithm, containing the features discussed above and providing
an output that can be used conveniently in tree viewing software
as demonstrated in our examples. To the best of our knowledge,
no such software exists.
In the experimental realm, we applied our software to four
examples, two from Ornithology and two from Microbiology. The
examples from Ornithology addressed the topic of character compliance with species evolution. Our results show that both characters, migration strategies and moult strategy, are insignificant on
the tree - implying they were not evolved along with the species
evolution.
The examples from Microbiology focused on horizontal gene
transfer (HGT) and the strength of the tree signal in light of it. Here,
as opposed to the previous examples, we treated the characters as
reliable and questioned the tree. In our first example, we showed
that classification based on the individual gene tree is evolutionary
unrelated. The second example shows that supertree approach
enables to construct evolutionary-consistent tree from the HGTaffected trees obtained for individual genes. These examples
demonstrate various application of the convexity criterion. Moreover, these examples from very distant fields in Biology attest on
the generality and applicability of the concept and its
implementation.
introduction of CR. A coloring is some property associated with a
set. A coloring C of a tree T assigns colors to the nodes of the tree.
A coloring is denoted partial if not all nodes are colored; otherwise
the coloring is total. The carrier of a set of nodes is the minimal subtree containing all nodes in the set. We denote by the carrier of the
color d as the carrier of the nodes colored by d, formally
carrierðC 1 ðdÞÞ. C is said to be convex on T if for any pair of colors
d1 – d2 , carrierðC 1 ðd1 ÞÞ and carrierðC 1 ðd2 ÞÞ do not intersect (see
Fig. 6).
A color d1 is considered as a bad color if there exists a color d2
such that carrierðC 1 ðd1 ÞÞ \ carrierðC 1 ðd2 ÞÞ – £. Note that our definition of a bad color is different from the one in Moran and Snir
(2008), where a bad color was defined only for total coloring,
and color d1 with carrierðC 1 ðd1 ÞÞ containing no nodes with other
colors was considered as good color even in the case of
carrierðC 1 ðd1 ÞÞ carrierðC 1 ðd2 ÞÞ. A recoloring scheme may have
several cost functions (see Moran and Snir, 2008). We here consider
the uniform cost model under which the recoloring of uncolored
vertices is free, recoloring any colored node v to any other color
costs 1, and uncoloring a colored node is prohibited. Hence, given
an input (partial) coloring C. the cost of a recoloring C 0 with respect
to C, denoted costC ðC 0 Þ, is the number of recolored vertices that
were previously colored by C (we note however that the software
we provide implements the weighted cost model in which the cost
of the recoloring is the sum of the weights of the recolored vertices). A convex recoloring C 0 is optimal for an input coloring C if
it has a minimal possible cost with respect to C. We denote this
cost by Popt ðCÞ. Henceforth, we refer only to convex recoloring,
i.e., if not specifically mentioned, a recoloring is assumed to be
convex.
4.2. Improved algorithm - candidate colors
To increase the efficiency of the search algorithm for an optimal
recoloring we use the following restriction on the set of candidate
recolorings. We apply a more local approach as was pursued in
Moran and Snir (2008). While in Moran and Snir (2008) a color
was defined as bad globally, i.e., across the whole tree, and every
bad color was examined at every node, here we restrict ourselves
at every individual node, only to colors that are relevant to this
node. We therefore define the following. For a node v, a color d0
is a candidate color if Cðv Þ ¼ d0 , or
exist
colors
1
d1 ; . . . ; dn
such
1
v 2 carrierðC 1 ðd0 ÞÞ, or
that v 2 carrierðC 1 ðdn ÞÞ
there
and
carrierðC ðdi ÞÞ \ carrierðC ðdi1 ÞÞ – £ for all i ¼ 1; . . . ; n. Informally, either d0 is v’s original color, or v sits inside d0 ’s carrier, or
there is a chain of carrier intersections from d0 to v (see Fig. 7).
The set of candidate colors for a concrete node colored by a bad
Black
Blue
4. Materials and methods
4.1. Convex recoloring of trees - basic definitions
The theory of convex recoloring (CR) relies on some non trivial
mathematical concepts that were defined and introduced in Moran
and Snir (2008). We here provide a brief and a minimum necessary
Red
Green
Fig. 6. An example of a convex coloring on a tree (white nodes are considered as
uncolored). Borders of color carriers are shown by violet lines. (For interpretation of
the references to color in this figure legend, the reader is referred to the web version
of this article.)
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
C
B
A
F
4.3. Coloring significance - estimation of p-value
D
E
G
Fig. 7. A non convex coloring of tree. White nodes are considered uncolored;
carriers of colors red and black intersect at nodes A and B; carriers of colors blue and
green intersect at nodes D and E; carriers of colors red and blue intersect at node C.
This means that all four colors red, black, green and blue are candidate for all of the
tree nodes. Indeed, recoloring of nodes F and G both by black or both by green
results in convex coloring (although the black and green carriers are initially
disjoint). It is easy to see that any convex recoloring of this tree changes the colors
of at least two colored nodes, i.e., has cost at least 2. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of
this article.)
color can be significantly smaller than the entire set of bad colors
(as there can be a situation where a certain bad color is candidate
only for a small subset of nodes colored by bad colors, see example
presented in Fig. 8a). An additional reduction of the set of candidate colors for a node can be achieved by dynamically recalculating
the set of candidate colors for subtrees, while considering previous
decisions made for other nodes affecting the node in question
(such decisions can break down chains of overlapping carriers of
candidate colors that can result in splitting a cluster of candidate
colors into smaller parts or even singletons, see Fig. 8b for example). In the Appendix we provide rigorous arguments why the
restriction to candidate colors indeed guarantees optimal convex
recoloring. Our algorithm follows along the lines induced by
Lemma 4.8 of Moran and Snir (2008) however instead of considering the fixed set of bad colors, we use the smaller sets of candidate
colors, that are calculated dynamically during the run of the
algorithm.
(a)
(b)
B
A
C
217
B
C
A
Fig. 8. (a) Candidate colors vs. bad colors. Nodes A; B and C are at the intersection of
the carriers of red and blue, black and gray, green and yellow, respectively. Hence,
all these six colors (white nodes are considered as uncolored) are bad. However,
only colors red and blue are candidate for node A, only colors black and gray are
candidate for node B, and only green and yellow for C. Following the arguments
provided in the Appendix, in searching for an optimal convex recoloring, it is
enough to check only candidate colors for each node, i.e., no need to check all bad 6
colors (in contrast to Moran and Snir, 2008). (b) Simplified search for optimal
recoloring by dynamic recalculation of candidate colors. In the figure, colors red,
blue, yellow, green and black are candidate for node B (white nodes are considered
as uncolored). The algorithm of Moran and Snir (2008), in the search after an
optimal recoloring, considers all 5 bad colors as possible recolorings of node B, and
all possible color partitions of the remaining set of the other 4 bad colors (in total,
34 ¼ 81 partitions standing for the options ‘‘left”, ‘‘right”, and ‘‘none”) as recoloring
of the subtree rooted at B. As we prove in the Appendix, we don’t need to consider
parts assigning the yellow and green colors to the subtree rooted at A. As the initial
coloring for this subtree is convex, hence the coloring of B by the set of currently
assigned colors (based on the candidate colors for B) are straightforward and
determine the extension of the constructed recoloring (of minimal possible cost) for
this subtree. In the case when B is not recolored by red and the partition of
candidate colors does not assign the red color to the subtree rooted at C, initial
coloring of this subtree is convex by colors excluding red. Hence, the coloring of B
by the assigned color set determine the extension of constructed optimal recoloring
for this subtree. In all other cases, only colors red, yellow, green and blue (no black)
can be a candidate for node C. In particular, if B is recolored by blue then only blue is
a candidate color for C; else - blue is not a candidate color for C. (For interpretation
of the references to color in this figure legend, the reader is referred to the web
version of this article.)
An important feature of CR that was not explored so far, is the
significance of a given input (non convex) coloring. A relatively
low cost P opt ðCÞ of some coloring C is not necessarily a proof for
the goodness of the input tree coloring C. For example, it might
be that this optimal cost is attained by many random recolorings.
Hence, the significance of a given coloring gives an estimate on
how likely we are to find by random another coloring with the
same cost. The biological meaning of this value can be interpreted
as follows. Assume we believe in the given tree topology (this is
also the underlying assumption in CR in general, as opposed to perfect phylogeny, where the tree is built based on the given set of discrete characters). We also believe in the coloring on the tree (i.e.,
the color assignment to the tree nodes). Then this significance
value can be interpreted as a means to measure statistically the
compliance of this character with the evolutionary history of the
taxa set at hand (that is depicted by the tree).
Therefore in addition to the P opt ðCÞ value we provide an estimation for the quality of C by the probability to obtain
Popt ðC 0 Þ 6 P opt ðCÞ for a ‘‘random” coloring C 0 . As analytic calculation
of this measure appears to be hard and presumably computationally intractable, the straightforward way to proceed is via simulations (a method known as permutation test or bootstrap
Wasserman, 2004). Hence, to estimate this probability (that can
be dubbed as a p-value) we calculate the frequency of events
R ¼ fP opt ðC i Þ 6 Popt ðCÞg out of N random colorings of the tree (e.g.,
N = 10,000). In order to maintain the initial properties of the input
coloring C, we preserve the proportions between colors of the original coloring. Hence, each random coloring of the tree is simulated
by a reshuffling of the input colors of C between the nodes set. As
the initial coloring C can be considered as a realization of random
tree coloring, we get: p-value¼ ðN R þ 1Þ=ðN þ 1Þ. Uncolored nodes
are not affected by the reshuffling similarly as they do not affect
the cost function. The software implementation associated with
this article provides this value along with the absolute cost of
the optimal convex recoloring for the given coloring C.
4.4. Implementation
The algorithm is implemented in Python and receives as input a
colored tree in either Newick or NEXUS formats. Colors are given to
nodes either as part of their names by some convention, or by a
separate table. It is also possible to indicate colors to internal
nodes, and these colors are interpreted by the program as part of
the input. The output of the program is an optimally recolored tree
(one of the many possible, saved both in Newick or NEXUS formats), the list of recoloring of the nodes, the cost of the optimal
convex recoloring, and p-value. This output can be used by several
tree viewer softwares (e.g., FigTree Rambaut, 2010) as is demonstrated in our Results section below. Our experiments showed that
the program was able to find optimal convex recoloring for trees
with 200 leafs, randomly colored by up to 60 colors, in a few seconds. Recall that by Moran and Snir (2008), the algorithm runs in
time that is linear in the number of vertices and even in the number of good colors and exponential (that is, fixed parameter tractable Downey and Fellows, 1999) in the number of bad colors.
Consequently, it can also handle larger trees (e.g. of 1000 leaves)
however with relatively small number of ‘‘bad” colors (e.g., 20).
More implementation details can be found in the Appendix.
Acknowledgements
We wish to acknowledgements Lana Martin for valuable edit on
the manuscript.
218
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
Appendix A. Rigorous proofs for optimal recoloring via
candidate colors
Here we present a formal proof that there exists an optimal convex recoloring (one of all possible) that rewrite (recolors) nodes
only by their corres candidate colors (similar to Lemma 4.7 of
Moran and Snir, 2008). In fact, in our algorithm of optimal convex
recoloring searching candidate colors are calculated dynamically
with taking into account restrictions caused by current decision
not use some colors in recoloring of subtree. Any partial convex
coloring can be naturally extended in such a way that there will
remained no uncolored nodes situated in the carrier of some used
color (let v 2 carrierðdÞ; we can define Cðv Þ :¼ d; this definition is
correct because initially coloring C was convex, carriers of colors
were not changed, hence C remained to be convex). We consider
only recoloring with zero cost of coloring for uncolored nodes,
hence, for the simplicity, all convex coloring and recoloring will
be considered as already after all these extensions (i.e., if some
node is uncolored in considered convex coloring then it is not
belonging to carrier of any used color). Pair of colors ðd1 ; d2 Þ considered as neighbor in coloring C (not necessary convex) if there exist
vertices v 1 and v 2 such that Cðv 1 Þ ¼ d1 ; Cðv 2 Þ ¼ d2 , and u connected with v by single edge or by path (going on edges of tree) visiting only uncolored (in C) nodes.
Claim A. 1. Let C be input coloring of tree T. There exist a convex
recoloring C 0 of minimal possible cost such that for each d 2 C 0 ðTÞ
there exists node v such that Cðv Þ ¼ C 0 ðv Þ ¼ d.
Proof. Situation with convex input coloring C (e.g., no colored
nodes) is trivial. Let C 00 a convex recoloring of minimal possible cost
with minimal possible number of colors (it always exists because
the set of nodes is finite). Let there exists color d 2 C 00 ðTÞ such that
0
for any v 2 T is uncolored in C or Cðv Þ – d. Let d be one of neighbor
colors of d in C 00 . Then recoloring C 0 coincidental to C 00 out of C 001 ðdÞ
Claim A. 4. Let C 0 be a convex recoloring (after al extensions, see
above) of T in respect to input coloring C. Assume nodes u and v are
connected by single edge or by a path going on tree edges and visiting
only nodes uncolered in C 0 (and hence not in carrier of any color in C 0 ).
Assume also C 0 ðuÞ is candidate for nodes u and v, but C 0 ðv Þ is not candidate for v. Then there exists a convex recoloring C 00 with
costðC 00 Þ 6 costðC 0 Þ coloring more nodes by its candidate colors than
C0.
Proof. Let T ðv ;C Þ be the minimal subtree of T containing all nodes
0
colored in C 0 by color C 0 ðv Þ, i.e., T ðv ;C Þ ¼ carrierðC 01 ðC 0 ðv ÞÞÞ. Denote
0
ðv ;C Þ
by T 0
the maximal subtree of T ðv ;C Þ containing node v and not
containing nodes such that C 0 ðv Þ is its candidate color. Based on
0
0
ðv ;C 0 Þ
Claim 2, set T ðv ;C Þ n T 0
0
(it can be empty) is connected (see exam-
ple presented in Fig. 9). C 0 is convex, hence all nodes of T ðv ;C Þ in C 0
are colored by C 0 ðv Þ. Therefore, recoloring C 00 coincidental with C 0
0
ðv ;C 0 Þ
ðv ;C 0 Þ
in T n T 0
and coloring nodes of T 0
by color C 0 ðuÞ is convex.
00
Now C ðv Þ is candidate (in C) for v. Color C 0 ðv Þ was not candidate
ðv ;C 0 Þ
(in C) for all nodes of T 0 , hence all nodes that were colored by
its candidate color in C 0 remained colored by the same candidate
color in C 00 . Following to Observation 1, the cost of C 00 is not higher
than the cost of C 0 . h
Claim A. 5. There exists recoloring of the minimum possible cost such
that all nodes are recolored by its candidate colors or remained
uncolored.
Proof. Let C 0 be a convex recoloring (with all possible extensions,
see above) of the minimum possible cost such that for each
d 2 C 0 ðTÞ there exist node v such that Cðv Þ ¼ C 0 ðv Þ ¼ d (see Claim
1), recoloring the most possible number of nodes by its candidate
color. Now we will show that if some nodes are colored in C 0 by
0
and coloring all nodes from C 001 ðdÞ by d is convex (because C 00 is
convex) and use less colors (does not use d). This contradicts to
definition of C 00 ). h
2
1
Claim A. 2. Let d be a candidate color for nodes u and v in respect to
input coloring C. Then d is a candidate for all nodes in the path (going
on edges of tree, without returns) from u to v.
u
Proof. Color d is candidate for u and v, hence there exist colors
ðuÞ
ðuÞ
d0 ; . . . ; dnðuÞ
v
ðv Þ
ðv Þ
and d0 ; . . . ; dnðv Þ
ðv Þ
ðuÞ
ðv Þ
v
5
9
ðuÞ
such that u 2 carrierðC 1 ðd0 ÞÞ;
3
10
6
11
4
7
12
ðuÞ
8
13
2 carrierðC 1 ðd0 ÞÞ; dnðuÞ ¼ dnðv Þ ¼ d; carrierðC 1 ðdi ÞÞ \ carrier
ðuÞ
ðC 1 ðdi1 ÞÞ – £
and
ðv Þ
ðv Þ
carrierðC 1 ðdj ÞÞ \ carrierðC 1 ðdj1 ÞÞ – £
for all i ¼ 1; . . . ; nðuÞ and j ¼ 1; . . . ; nðv Þ . Hence, there exists path
from u to v (going on edges of tree) such that color d is a candidate
for all visited nodes. A path from u to v going on edges of tree
without returns is unique, hence color d as candidate for all its
nodes. h
Recoloring any colored n nodes to any individually-other colors
(i.e., C 0 ðv Þ – Cðv Þ, but it can be that C 0 ðv Þ ¼ CðuÞ) costs n. This
enables to make a following observation:
Observation A. 3. Let T 0 be a subtree of tree T. Let C 0 be a recoloring
of T in respect to input coloring C. Let all nodes in T 0 are recolored
by C 0 only by its non-candidate colors. Then any recoloring of T
coincidental to C 0 in T n T 0 has cost not higher than C 0 .
14
15
16
17
18
19
20
Fig. 9. Illustration for proof of Claim 4. Nodes with white color are considered as
uncolored. Color of internal disk indicates color in coloring C while the color of the
ring indicates a color in coloring C 0 . Candidate colors are indicated by colored
squares. Color C 0 ðv Þ can’t be candidate for nodes 2 and 16 because it is candidate for
0
node 7 and not candidate for node v (see Claim 2). In this example T ðv ;C Þ is a subtree
ðv ;C 0 Þ
with nodes v ; 3; 6; 7; 8; 10; 11; 12; 13; 17; 18 and 20; T 0
is a subtree with nodes
0
v ; 6; 10; 11, 17, and 18; T 1ðv ;C Þ is a subtree with nodes 3; 7, 8; 12; 13 and 20. nodes 6
and 8 should be colored by red in C 0 , node 9 should be colored by blue in C 0 Note
that in this examples colors blue, green, red and violet are bad. Nevertheless, colors
red and violet are not candidate for nodes with candidate colors green and blue.
Colors green and blue are not candidate for nodes with candidate colors red and
violet. Such subdivision of bad colors into groups of candidate colors can
dramatically reduce amount of variants in searching of convex recoloring of
minimal possible cost (see Claim 6). (For interpretation of the references to color in
this figure legend, the reader is referred to the web version of this article.)
219
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
non-candidate color then these nodes were uncolored in C (hence
C 00 coincidental with C 0 on all other nodes and not coloring these
nodes is like searched for) or there exists convex recoloring C 00 having cost (relatively to input coloring C) not higher than C 0 and coloring more nodes by its candidate colors. This contradicts to
definition of C 0 and enough for the proof. Assume node v is colored
in C and C 0 such that C 0 ðv Þ is not candidate for v (in C).
1. Let color Cðv Þ is not presented in C . Using designations from
the proof of Claim 4, convex recoloring C 00 coincidental with C 0
13
1
2
12
11
10
3
4
5
6
7
8
9
0
ðv ;C Þ
ðv ;C Þ
in T ðv ;C Þ n T 0
and coloring nodes of T 0
by color Cðv Þ is convex and has cost lower than C 0 (because C 00 ðv Þ ¼ Cðv Þ – C 0 ðv Þ)
that contradicts to definition of C 0 .
2. Assume color Cðv Þ is already presented in C 0 (hence the recoloring C 00 from (1) can be non convex). Let u is such that
CðuÞ ¼ C 0 ðuÞ ¼ Cðv Þ (it exists by definition of C 0 ). Color Cðv Þ is
a candidate for u and v, hence, following Claim 2, color Cðv Þ is
candidate for all nodes in the path from u to v (going on edges
of tree without repeats). Let node v 0 is the first node in this path
(starting from u) colored in C 0 by non-candidate color d (e.g., v 0
can coincident with v). Let node u0 is the last node in the path
from u to v 0 colored by its candidate color (e.g., u0 can coincident
0
0
0
with u). Node u0 belongs to overlap of carrierðC 1 ðC 0 ðu0 ÞÞÞ and
carrierðC 1 ðCðv ÞÞÞ carrierðfu; v gÞ, hence color C 0 ðuÞ is candidate for u and v, and, based on Claim 2, it is candidate for v 0 .
Hence, based on Claim 4 there exist a convex recoloring C 00 having cost not higher than C 0 and coloring more nodes by candidate colors. That contradicts our definition of C 0 . h
We now consider a special case of a convex recoloring. A partial
convex recoloring C 0 is conservative relative to initial coloring C if it
satisfies the following: (1) only vertices uncolored by C can be
uncolored by C 0 ; (2) A node uncolored by C can be colored in C 0 only
by a bad color of C or remain uncolored; (3) A vertex can change its
color only to bad color of C; and (4) For every color d used in coloring C 0 , set C 01 ðdÞ is connected. In Moran and Snir (2008) it is
shown that an optimal conservative recoloring is also a general
optimal convex recoloring. By our next claim, optimality holds
even if we replace ‘‘bad color” (in the definition of Moran and
Snir (2008)) by ‘‘candidate color” in the definition of conservative
recoloring (see above). We refer to such a conservative recoloring
as candidate conservative recolorings.
Claim A. 6. An optimal candidate conservative recoloring is an
optimal convex recoloring in general.
Proof. Let C 0 be an optimal convex recoloring from Claim 5. Using
all possible extensions of C 0 on nodes uncolored in C 0 one can
obtain an optimal convex recoloring satisfying conditions of candidate conservative convex recoloring. h
Observation A. 7. Let T 0 be a subtree of tree T. Let C 0 be an optimal
recoloring for subtree T 0 restricted to some set of colors. In this case
we can use Claim 6 with restriction to a set of candidate colors
such that this set is used by uncoloring nodes of excluded colors
(see Fig. 10).
Observation A. 8. Using the definition of good colors in the sense
of Moran and Snir (2008), i.e., color d is good in partial coloring C if
carrierðC 1 ðdÞÞ contains no nodes with other colors and no uncolored nodes from carriers of other colors. Then there exists an opti-
Fig. 10. Recoloring without using of some colors and candidate colors. Nodes with
white color are considered as uncolored. In the case when it is allowed to use all
colors, carrier of color red is overlapped with carrier of color black (node 10), carrier
of color black is overlapped with carriers of colors blue and green (nodes 11 and 13),
hence color red is a candidate for node 9, and colors blue and green are candidate
for node 1. If convex recoloring of minimal possible cost is searching under
condition of non-using of color black, then node 1 has only one candidate color (red)
and node 9 has only two candidate colors, (green and blue). (For interpretation of
the references to color in this figure legend, the reader is referred to the web version
of this article.)
mal candidate conservative recoloring with no recoloring of good
colors by other good colors.
Using Observation 8 we can improve the algorithm searching
for optimal convex recoloring.
Algorithm for searching for optimal convex recoloring: For a
rooted tree T, we denote by T v the subtree rooted at vertex v. We
designate by Ps ðv ; d; DÞ the minimal cost of a candidate conservative convex recoloring C 0 of T v such that C 0 ðv Þ ¼ d and C 0 uses only
colors from D for the descendants of v. For convenience, we use
symbol H to denote a ‘‘color” of an uncolored vertex and assume
a cost infinity for uncoloring a colored vertex. Denote
Ps ðv ; DÞ :¼ mind2D[fHg Ps ðv ; d; DÞ, the minimum cost convex recoloring that uses colors from D. We also designate by Pc ðv ; d; DÞ the
minimal cost of convex recoloring under which v is either
C 0 ðv Þ ¼ d or C 0 ðT v Þ does not contain d. This means that
0
Pc ðv ; d; DÞ ¼ minfPs ðv ; d; D [ fd; HgÞ; mind0 2ðD[fHgÞnfdg Ps ðv ; d ; ðD[
fHgÞ n fdgÞg.
da;B denote the inverse
Assume T is rooted at some vertex v r . Let Cronecker delta, such that da;B ¼ 0 if a 2 B, and da;B ¼ 1 otherwise.
Denote by C ¼ CC ðTÞ the set of all node colors used in coloring C.
Then analogously to Lemma 4.8 from Moran and Snir (2008), the
cost of a minimal convex recoloring of the entire tree T can be
written as Ps ðv r ; CÞ and is calculated recursively: Ps ðv ; d; DÞ ¼
P
dCðv Þ;fH;dg þ minðD1 ;...;Dk Þ ki¼1 P c ðv i ; d; Di Þ, where v i are children of ver-
tex v, [ki¼1 Di ¼ D, and Di \ Dj ¼ £ for i – j. The restriction to candidate conservative convex recolorings rather than all conservative
convex recolorings has no asymptotic implication on the running
time of the algorithm however it allows us to discard a large fraction of valid color partitions ðD1 ; . . . ; Dk Þ and reduce the running
time dramatically. The implementation of the candidate is done
recursively one child after the other, while the color assignment
i-th child is checked only after it is guaranteed that for all j < i
the color assignment for the jth child satisfy the candidate
criterion.
Appendix B. Implementation details
This algorithm is implemented in Python and receives as input a
colored tree in either Newick or NEXUS formats. In standard Newick format node names (captions that can include color) can be
specified only for leafs. A color is assigned to a leaf by specification
immediately after the leaf name: $hColori$. It is also possible to set
initial coloring of leafs by a separate table in the following format:
hLeafIdInTreei hNameOfLeafToDisplayi hColori (see examples in
ReadMe.txt file). In NEXUS format it is also possible to assign input
220
Z. Frenkel et al. / Molecular Phylogenetics and Evolution 107 (2017) 209–220
colors for internal nodes (in this case the program assigns a vertex
the color that is indicated in the vertex incoming edge:
[&!color = #-hCodeOfColori]: hLengthOfEdgei. The output of the
program is an optimally convex-recolored tree (one of the many
possible, saved both in Newick or NEXUS formats), the recoloring
of the nodes, and the cost of the optimal convex recoloring.
References
Avni, E., Cohen, R., Snir, S., 2015. Weighted quartets phylogenetics. Syst. Biol. 64 (2),
233–242.
Bapteste, E., Susko, E., Leigh, J., MacLeod, D., Charlebois, R.L., Doolittle, W.F., 2005.
Do orthologous gene phylogenies really support tree-thinking? BMC Evol. Biol.
5, 33.
Bar-Yehuda, R., Feldman, I., Rawitz, D., 2008. Improved approximation algorithm for
convex recoloring of trees. Theory Comput. Syst. 43 (1), 3–18.
Berkum, P., Terefework, Z., Paulin, L., Suomalainen, S., Lindstrom, K., Eardly, B.D.,
2003. Discordant phylogenies within the rrn loci of rhizobia. J. Bacteriol. 185
(10), 2988–2998.
Bininda-Emonds, O.R.P., Gittleman, J.L., Steel, M.A., 2002. The (super)tree of life:
procedures, problems, and prospects. Annu. Rev. Ecol. Syst. 33 (1), 265–289.
Bodlaender, H.L., Fellows, M.R., Warnow, T., 1992. Two strikes against perfect
phylogeny. In: ICALP, pp. 273–283.
Bodlaender, H.L., Fellows, M.R., Langston, M.A., Ragan, M.A., Rosamond, F.A., Weyer,
M., 2011. Quadratic kernelization for convex recoloring of trees. Algorithmica
61 (2), 362–388.
Bordewich, M., Semple, C., 2005. On the computational complexity of the rooted
subtree prune and regraft distance. Ann. Comb. 8, 409–423. http://dx.doi.org/
10.1007/s00026-004-0229-z.
Campêlo, M., Lima, K.R., Moura, P.F.S., Wakabayashi, Y., 2013. Polyhedral studies on
the convex recoloring problem. Electron. Notes Discrete Math. 44, 233–238.
Cramp, S., Perrins, C.M., Brooks, D.J., Dunn, E., 1993. Handbook of the birds of
Europe, the Middle East and North Africa: the birds of the western Palearctic.
Flycatchers to shrikes. . Handbook of the birds of Europe, the Middle East and
North Africa: the birds of the western Palearctic/Stanley Cramp, chief ed., vol.
VII. Oxford University Press.
Creevey, C.J., McInerney, J.O., 2005. Clann: investigating phylogenetic information
through supertree analyses. Bioinformatics 21 (3), 390–392.
Delsuc, F., Brinkmann, H., Philippe, H., 2005. Phylogenomics and the reconstruction
of the tree of life. Nat. Rev. Genet. 6 (5), 361–375.
Dewhirst, F.E., Shen, Z., Scimeca, M.S., Stokes, L.N., Boumenna, T., Chen, T., Paster, B.
J., Fox, J.G., 2005. Discordant 16S and 23S rRNA gene phylogenies for the Genus
Helicobacter: implications for phylogenetic inference and systematics. J.
Bacteriol. 187 (17), 6106–6118.
Doolittle, W.F., 1999. Phylogenetic classification and the universal tree. Science 284
(5423), 2124–2129.
Downey, R.G., Fellows, M.R., 1999. Parameterized Complexity. Springer.
Eisen, J.A., Fraser, C.M., 2003. Phylogenomics: intersection of evolution and
genomics. Science 300 (5626), 1706–1707.
Fernandez-Baca, D., 2001. The perfect phylogeny problem. In: Cheng, X., Du, D.Z.
(Eds.), Steiner Trees in Industry. Kluwer.
Fitch, W.M., 1971. Towards defining the course of evolution: minimum change for a
specified tree topology. Syst. Zool. 20, 406–416.
Ginn, H.B., Melville, D.S., 1983. Moult in Birds. BTO Guide, British Trust for
Ornithology.
Hall, K.S., Tullberg, B.S., 2004. Phylogenetic analyses of the diversity of moult
strategies in Sylviidae in relation to migration. Evol. Ecol. 18 (1), 85–105.
Hein, J., 1990. Reconstructing evolution of sequences subject to recombination
using parsimony. Math. Biosci. 98 (2), 185–200.
Jenni, L., Winkler, R., 1994. Moult and Ageing of European Passerines. Academic
Press.
Kammer, F., Tholey, T., 2012. The complexity of minimum convex coloring. Discrete
Appl. Math. 160 (6), 810–833.
Kanj, I.A., Kratsch, D., 2009. Convex recoloring revisited: complexity and exact
algorithms. In: Computing and Combinatorics. Springer, pp. 388–397.
Matsen, F.A., 2015. Phylogenetics and the human microbiome. Syst. Biol. 64 (1),
e26–e41.
Moran, S., Snir, S., 2005. Efficient approximation of convex recolorings. In:
Approximation, Randomization and Combinatorial Optimization, Algorithms
and Techniques, 8th International Workshop on Approximation Algorithms for
Combinatorial
Optimization
Problems,
APPROX
2005
and
9th
InternationalWorkshop on Randomization and Computation, RANDOM 2005,
Berkeley, CA, USA, August 22–24, 2005, Proceedings, pp. 192–208.
Moran, S., Snir, S., 2005. Convex recolorings of strings and trees: definitions,
hardness results and algorithms. In: Algorithms and Data Structures, 9th
International Workshop, WADS 2005, Waterloo, Canada, August 15–17, 2005,
Proceedings, pp. 218–232.
Moran, S., Snir, S., 2008. Convex recolorings of strings and trees: definitions,
hardness results and algorithms. J. Comput. Syst. Sci. 74 (5), 850–869.
Ochman, H., Lawrence, J.G., Groisman, E.A., 2000. Lateral gene transfer and the
nature of bacterial innovation. Nature 405 (6784), 299–304.
Gogarten, J.P., Townsend, J.P., 2005. Horizontal gene transfer, genome innovation
and evolution. Nat. Rev. Micro. 3 (9), 679–687.
Gogarten, J.P., Ford Doolittle, W., Lawrence, J.G., 2002. Prokaryotic evolution in light
of gene transfer. Mol. Biol. Evol. 19 (12), 2226–2238.
Puigbó, P., Wolf, Y.I., Koonin, E.V., 2009. Search for a ‘tree of life’ in the thicket of the
phylogenetic forest. J. Biol. 8 (6), 59.
Puigbó, P., Wolf, Y.I., Koonin, E.V., 2010. The tree and net components of prokaryote
evolution. Genome Biol. Evol. 2, 745–756.
Rambaut, A., 2010. Figtree v1.3.1. Institute of Evolutionary Biology. University of
Edinburgh.
Schouls, L.M., Schot, C.S., Jacobs, J.A., 2003. Horizontal transfer of segments of the
16S rRNA genes between species of the Streptococcus anginosus group. J.
Bacteriol. 185 (24), 7241–7246.
Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B.
S., Kiryutin, B., Galperin, M.Y., Fedorova, N.D., Koonin, E.V., 2001. The COG
database: new developments in phylogenetic classification of proteins from
complete genomes. Nucl. Acids Res. 29 (1), 22–28.
Treplin, S., Siegert, R., Bleidorn, C., Thompson, H.S., Fotso, R., Tiedemann, R., 2008.
Molecular phylogeny of songbirds (Aves: Passeriformes) and the relative utility
of common nuclear marker loci. Cladistics 24 (3), 328–349.
Wasserman, L., 2004. All of Statistics. Springer, New York.
Yap, W.H., Zhang, Z., Wang, Y., 1999. Distinct types of rrna operons exist in the
genome of the Actinomycete Thermomonospora chromogena and evidence for
horizontal transfer of an entire rRNA operon. J. Bacteriol. 181 (17), 5201–5209.
Zhang, J., Kumar, S., 1997. Detection of convergent and parallel evolution at the
amino acid sequence level. Mol. Biol. Evol. 14 (5), 527–536.
Zhaxybayeva, O., Lapierre, P., Gogarten, J.P., 2004. Genome mosaicism and
organismal lineages. Trends Genet 20, 254–260.