SN-sets to partition the set of leaves

Constructing a level-2 phylogenetic network
from a dense set of input triplets
Leo van Iersel1, Judith Keijsper1, Steven Kelk2, Leen Stougie12
(1) Technische Universiteit Eindhoven (TU/e)
(2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam
Email: [email protected]
Web: http://homepages.cwi.nl/~kelk
Triplet-based methods (1)
Given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)
Find the tree that by contracting and deleting edges can give each of the triplet
subgraphs as a minor
solution
z
w
x
y
x
w
algorithm
x
y
z
w
z
y
w
z
x
y
Triplet-based methods (2)
Given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)
Find the tree that by contracting and deleting edges can give each of the triplet
subgraphs as a minor
solution
z
w
x
y
x
w
algorithm
x
y
z
w
z
y
w
z
x
y
Triplet-based methods (2)
Given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)
Find the tree that by contracting and deleting edges can give each of the triplet
subgraphs as a minor
solution
z
w
x
y
x
w
algorithm
x
y
z
w
z
y
w
z
x
y
From trees to networks…
• The algorithm of Aho et al. (1981) can be used to construct trees from rooted
triplets.
• But…what if the algorithm fails? Why might the algorithm fail?
• Possible reason 1: The underlying evolution is tree-like, but the input triplets
contain errors.
• Possible reason 2: The triplets are correct, but the underlying evolution is not
tree-like. Biological phenomena such as hybridization, horizontal gene transfer,
recombination and gene duplication can lead to evolutionary scenarios that are
not tree-like!
• Response: try and construct not phylogenetic trees, but phylogenetic networks
From trees to networks (2)
• For example, suppose the
input is {xy|z, xz|y}.
x
y
z
z
x
z
y
y
x
(Note that there are cases when, even if there is at most one triplet per 3
species, a tree is not possible)
From trees to networks (2)
• For example, suppose the
input is {xy|z, xz|y}.
x
y
z
z
x
z
y
y
x
(Note that there are cases when, even if there is at most one triplet per 3
species, a tree is not possible)
From trees to networks (2)
• For example, suppose the
input is {xy|z, xz|y}.
x
y
z
z
x
z
y
y
x
(Note that there are cases when, even if there is at most one triplet per 3
species, a tree is not possible)
Level-k phylogenetic networks
A level-k phylogenetic
network is a rooted, directed
acyclic graph where every
biconnected component (in
the underlying undirected
graph) contains at most k
recombination vertices.
root
(only one!)
split-vertex
z
y
x
recombination-vertex
leafvertex
Level-1 Networks
• A set of input triplets is dense iff, for every subset of 3 species, there is at
least one triplet corresponding to those 3 species.
• Therefore, a dense set of input triplets for n species contains O(n3) triplets.
• Jansson & Sung (2006) showed:
Given a dense set of triplets T for a set L of species, it is possible to
determine in polynomial-time whether a level-1 phylogenetic network N
exists such that all the triplets in T are consistent with N. (And if so, to
construct such a network.)
• They later showed, together with Nguyen, how to do this in time linear in |T|.
They also showed that, in the non-dense case, the problem is NP-hard.
• But what about level-2 networks, and higher?
Here is an example of a
level-2 network.
Main result: Given a dense set of triplets T for a set L of species, it is
possible to determine in time O(|T|3) whether a level-2 phylogenetic
network N exists such that all the triplets in T are consistent with N. (And if
so, to construct such a network.)
Algorithm, basic idea
•
The basic idea behind Aho’s algorithm for trees is that we are able to
determine, recursively, which species belong to which of the two
subtrees hanging from some root vertex.
•
For the level-1 and level-2 networks if there again exists such a clear
dichotomy, we iterate on the two subsets.
root
Subnetwork
Subnetwork
Algorithm, basic idea
•
The basic idea behind Aho’s algorithm for trees is that we are able to
determine, recusively, which species belong to which of the two subtrees
hanging from some root vertex.
•
For the level-1 networks if there again exists such a clear dichotomy, we iterate
on the two subsets. Otherwise there must exist a network of the form
Subnetwork
Subnetwork
Subnetwork
Subnetwork
Subnetwork
Algorithm, basic idea
•
The basic idea behind Aho’s algorithm for trees is that we are able to
determine, recusively, which species belong to which of the two subtrees
hanging from some root vertex.
•
For the level-1 networks if there again exists such a clear dichotomy, we iterate
on the two subsets. Otherwise there must exist a network of the form
Find the partition of the
species (leaves) into
the subnetworks
Subnetwork
Find the blue backbone
network
Subnetwork
Subnetwork
Subnetwork
Subnetwork
Treat each of the
partition elements
(sub-networks) as
leaves to be hanged on
the backbone
Recurse on the
subnetworks
Algorithm, high-level idea

For level-2 networks the idea is similar:
Find the partition of the
species (leaves) into
the subnetworks
There is a complication
in level-2
Subnetwork
Subnetwork
Subnetwork
Subnetwork
Subnetwork
Find the blue backbone
network!
There are more level-2
backbone forms
Treat each of the
partition elements
(sub-networks) as
(meta-)leaves to be
hanged on the
backbone
Recurse on the
subnetworks
Definition: inducing new triplet sets from partitions of
the leaf set
• Suppose I have a partition P = {P1, P2, …, Pt} of the leaf set L.
• Suppose I have a dense set of triplets T on the leaf set L.
• Let T’ be a new triplet set on leaf set {q1, q2,…, qt} defined as follows:
• qiqj|qk is in T’ if and only if i≠j≠k and there exists a triplet xy|z in T such that x
is in Pi, y is in Pj and z is in Pk
• Then we say that T’ is the triplet set induced by the partition P of L.
• Critically: if T is dense, then T’ is also dense.
• In some sense this can be perceived as a ‘coarsening’ of the input set.
Definition: simple level-2 networks
Lemma: There are exactly 4 different backbone networks
A simple level-2 network is any network obtained by
“hanging leaves” off one of the above structures.
A picture description of the simple level-2 algorithm
Here the leaves
{a,b,c,d,e,f,g,h} have
been ‘hung’ from structure 8a,
to yield a simple level-2
network.
Level-2 network algorithm
Assume some oracle gives us the partition of the leaves into subnetworks
Treat each subnetwork as a leaf and construct a simple level-2
network
The simple level-2 network algorithm

Guess the right “recombination leaf”

Remove it and remove the triplets that contain this leaf
1 recombination vertex left with below it a caterpillar
Suppose we can correctly
‘guess’ that leaf g hangs
directly below a
recombination node
If we remove g, and all
triplets that contain g, then
we know that a level-1
network must be possible on
this new set of triplets
(because now fewer
recombination nodes are
needed)
Level-2 network algorithm
Assume some oracle gives us the partition of the leaves into subnetworks
Treat each subnetwork as a leaf and construct a simple level-2
network
The simple level-2 network algorithm

Guess the right “recombination leaf”

Remove it and remove the triplets that contain this leaf
1 recombination vertex left with below it a caterpillar

Guess the right “caterpillar set”
Caterpillar set
Caterpillar

A caterpillar set with respect to a dense triplet set T
is the set of leaves of a caterpillar subgraph of a
network consistent with T
The empty set is also a caterpillar set
Suppose we subsequently
guess that the caterpillar
with h now hangs below a
recombination node in the
new network.
If we remove the hcaterpillar, and all triplets
that contain leaves of it, then
we know that a level-0
network must be possible on
this new set of triplets
(because now even fewer
recombination nodes are
needed.)
Level-2 network algorithm
Assume some oracle gives us the partition of the leaves into subnetworks
Treat each subnetwork as a leaf and construct a simple level-2
network
The simple level-2 network algorithm

Guess the right “recombination leaf”

Remove it and remove the triplets that contain this leaf
1 recombination vertex left with below it a caterpillar

Guess the right “caterpillar set”

Remove it and remove the triplets that contain any element of
this set

Construct the unique tree for the remaining triplets
[Jansson&Sung 2006]
In such a case the resulting
tree is UNIQUE (J&S).
So now we have a tree. We
are going to guess how to
add the h-caterpillar back in,
and then guess how to add
leaf g back in.
Adding the h-caterpillar back in.
And finally adding leaf g
back in.
g
Level-2 network algorithm
Assume some oracle gives us the partition of the leaves into subnetworks
Treat each subnetwork as a leaf and construct a simple level-2
network
The simple level-2 network algorithm

Guess the right “recombination leaf”

Remove it and remove the triplets that contain this leaf
1 recombination vertex left with below it a caterpillar

Guess the right “caterpillar set”

Remove it and remove the triplets that contain any element of
this set

Construct the unique tree for the remaining triplets
[Jansson&Sung 2006]

Insert the caterpillar set and the recombination leaf in the tree
in the correct way
For each pair of guesses try all 4 backbone structures
Simple level-2 algorithm
Theorem: The simple level-2 network
algorithm works in O(|T|^3)
SN-sets to partition the set of leaves
•
•
•
•
Jansson & Sung introduced the SN-set to partition the set of leaves
SN-sets are special subsets of the leaves L, and are defined w.r.t. T
All sets containing just a single leaf, are SN-sets.
Any other SN-set is any subset of leaves obtained by taking the closure of
some subset S of the leaves L w.r.t. the following operation
If x,y є S and xz|y є T or yz|x є T then z є S
The SN-set that is equal to the total leaf set L, is called the trivial SN-set.
An SN-set that is non-trivial, and is not a strict subset of any other non-trivial
SN-set, is called a maximal SN-set.
(If the network is a tree there are 2 maximal SN-sets: one the set of leaves of
the subtree right and the other the set of leaves of the subtree left of the root)
Definition: maximal SN-set
•
Jansson and Sung proved that the set of maximal SN-sets indeed partition
the leaf set L. So no two maximal SN-sets overlap, and they completely
cover the set of input leaves.
•
All SN-sets and all maximal SN-sets can be found in polynomial-time.
•
Jansson & Sung solved the level-1 problem by observing that each maximal
SN-sets hangs as a ‘meta-leaf’ on the level-1 backbone network;
each maximal SN-set can completely be separated from the rest of the
network by removing just one edge
•
There are maximal SN-sets in level-2 networks that can hang under more
than one edge!!!!
Definition highest cut-edge



In a phylogenetic network N, a cut-edge (x,y) is an
edge whose removal disconnects the undirected
graph.
A cut-edge (x,y) is said to be a trivial cut edge iff y is
a leaf.
A cut-edge (x,y) is said to be highest iff there is no
cut-edge (p,q) such that there is a directed path from
q to x in N.
• Fact. Let (x,y) be a highest cut-edge and let L’ be the set of leaves
reachable from y. Let L* be a strict subset of L’. Then L* is not a maximal
SN-set.
• Proof: the set of leaves reachable from a highest cut-edge (x,y), is itself an
SN-set. Clearly for any two leaves p,q in L’ and leaf r outside L’ there cannot
be triplets pr|q and qr|p: the edge (x,y) forms a bottleneck. Thus pq|r must
exist.
x
So: each maximal
SN-set can be
expressed as the
union of the
leaves reachable
by one or more
highest cut-edges.
y
p
r
L’
q
p
q
r
Central Theorem (simplified). Suppose there is a dense triplet set T
consistent with some simple level-2 network N. Then there exists a level-2
network N’ (not necessarily simple) such that, with the exception of perhaps
one maximal SN-set with respect to T, every maximal SN-set appears below
a single cut-edge in N’. The remaining, ‘odd-one-out’ maximal SN-set (if it
exists) will be equal to the union of leaves below two cut-edges.
In other words: there exists at most one maximal SN-set which is the
union of the leaves below two highest cut-edges, whereas all other
SN-sets consist of the leaves below one highest cut-edge
The algorithm






Determine the maximal SN-sets
Guess the right SN-set to be split
Treat the max SN-sets and the two split sets
as leaves {S1,S2,…,Sq}
Adapt T to a new triplet set T’:
SiSk|Sh є T’ if and only if
there exist xєSi, yєSk,zєSh s.t. xy|z є T
Construct a simple level-2 network for T’
Recursively find the sub-networks for the sets
S1,S2,…,Sq
Conclusions & open problems
• So we know how to efficiently construct level-2 networks from dense triplet
sets. What’s next?
• Applicability: how useful is it?
• Initial implementation: programming and fine-tuning
• Improving running time: in the spirit of the “SN-tree” of J&S&N
• Complexity: what about level-3 and higher?
• Bounds: worst-case, best-case scenarios
• Building all networks
• Properties of output networks as function of input
• Different triplet restrictions
• Confidence: how good are the solutions?
• Exponential-time exact algorithms for NP-hard problems