A Fundamental Decomposition Theory for Phylogenetic Networks

Decomposition Theory for
ARGs and MinARGs
Dan Gusfield
Reconstructing the Evolution
of Binary Bio-Sequences
(SNPs)
• Perfect Phylogeny (tree) model
• Phylogenetic Networks (DAG) with
recombination (ARG)
• Blobbed Trees
• Incompatibility Graph and Connected its
Components
• ARG Decomposition Theorem and Proof
Sketch
• NASC for a fully-decomposed MinARG
The Perfect Phylogeny Model
for binary sequences
Only one mutation per site
allowed (infinite sites)
sites 12345
00000
1
4
Site mutations on edges
3
The tree derives the set M:
2
10100
10100
5
10000
01011
01010
00010
10000
00010
01010
01011
Extant sequences at the leaves
When can a set of sequences
be derived on a perfect
phylogeny?
Classic NASC: Arrange the sequences in
a matrix. Then (with no duplicate
columns), the sequences can be
generated on a unique perfect
phylogeny if and only if no two columns
(sites) contain all four pairs:
0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
A richer model
10100
10000
01011
01010
00010
10101 added
Pair 4, 5 fails the four
gamete-test. The sites 4, 5
are ``incompatible”
12345
00000
1
4
3
10100
2
00010
5
10000
0101101010
Real sequence histories often involve recombination.
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point 5.
The first 4 sites come from P (Prefix) and the sites
from 5 onward come from S (Suffix).
Called ``crossing over” in genetics
Network with Recombination =
ARG
10100
10000
01011
01010
00010
10101 new
12345
00000
1
4
3
2
10100
The previous tree with one
recombination event now derives
all the sequences.
P
00010
5
10000
5
10101
S
0101101010
A Phylogenetic Network
00010
a:00010
10010
1
4
00000
3
00100
2
b:10010
c:00100
P
3 S
10100
01100
p
d:10100
5
00101
S
4
01101
f:01101
e:01100
g:00101
Minimizing Recombinations
• Any set M of sequences can be generated by a phylogenetic
network with enough recombinations, and one mutation per site.
This is not interesting or useful.
• However, the number of (observable) recombinations is small in
realistic sets of sequences. ``Observable” depends on n and m
relative to the number of recombinations.
• Problem: Given a set of sequences M, find a phylogenetic
network generating M, minimizing the number of
recombinations. NP-hard (Wang et al 2000, Semple et al 2004)
Decomposition can help
First we introduce the viewpoint needed.
Blobs in Networks
• In a Phylogenetic Network, with a
recombination node x, if we trace two paths
backwards from x, then the paths will
eventually meet.
• The cycle specified by those two paths is
called a ``recombination cycle”.
• In a phylogenetic Network a maximal set of
(edge) intersecting cycles is called a blob.
A maximal set of intersecting
cycles forms a Blob
00010
10010
00000
4
3
1
00100
2
P
S
3
01100
p
5
00101
S
4
01101
If directions on the edges are removed, a blob is
a bi-connected component of the network.
Blobbed-trees
• Contracting each blob to a single node results in a
directed, rooted tree, otherwise one of the “blobs”
was not maximal.
• So every phylogenetic network can be viewed as a
directed tree of blobs - a blobbed-tree.
The blobs are the non-tree-like parts of the network.
Every network is a
tree of blobs.
How do the tree parts
and the blobs relate?
How can we exploit
this relationship?
Ugly tangled
network inside
the blob.
Incompatible Sites
Recall, a pair of sites (columns) of M that
fail
the 4-gametes test are said to be
incompatible.
A site that is not in such a pair is
compatible.
M
12345
a 00010
b 10010
c 00100
d 10100
e 01100
f 01101
g 00101
Incompatibility Graph G(M)
4
1
3
2
5
Two nodes are connected iff the pair
of sites are incompatible, i.e, fail the
4-gamete test.
G(M) has two connected components.
The connected components of
G(M) are very informative
• The number of non-trivial connected
components is a lower-bound on the number
of recombinations needed in any network
(Bafna, Bansal; Gusfield, Hickerson).
• When each blob is a single-cycle (galled-tree
case) all the incompatible sites in a blob
must come from a single connected
component C, and that blob must contain all
the sites from C. Compatible sites need not
be inside any blob. (Gusfield et al 2003-5)
Galled-Tree Structure
So when each blob contains only a single cycle,
there is a one-one correspondence between
the blobs and the non-trivial connected
components of the incompatibility graph.
Motivating Question: To what extent does this
clean one-one structure carry over to general
phylogenetic networks? How do we exploit
the general structure?
First Observation
First, in any network N for M, all sites from the
same connected component of G(M) must
appear together in a single blob in N. Follows
by transitivity from
the simple observation that two incompatible
sites must be in a common cycle in any ARG
for M.
So we can’t decompose more finely than CCs.
The Decomposition Theorem
For any set of sequences M, there is a phylogenetic
network that derives M, where each blob contains all
and only the sites in one non-trivial connected
component of G(M). The compatible sites can
always be put on edges outside of any blob.
This “fully-decomposed” network is the finest
decomposition possible.
Moreover, the tree part of T(M) is invariant
over all the fully-decomposed networks
for M, and can be determined in
polynomial-time.
So, we can find a network for M by solving
the recombination minimization problem for
each connected component of G(M)
separately, and then connect those
subnetworks in an invariant way.
A Phylogenetic Network with
one Blob
00010
a:00010
10010
00000
4
3
1
00100
2
b:10010
P
c:00100
S
3
01100
p
d:10100
5
00101
S
4
01101
f:01101
e:01100
g:00101
A fully-decomposed
network for the sequences 4
generated by the prior
network.
Incompatibility Graph
4
3
1
1
3
2
5
p s
a: 00010
2
b: 10010
c: 00100
d: 10100
2
p
e: 01100
5
4
s
f: 01101
g: 00101
Proof Ideas
Let C be a connected component of G(M).
Define M[C] as the sequences in M
restricted to the sites in C.
M
12345
a 00010
b 10010
c 00100
d 10100
e 01100
f 01101
g 00101
4
C2
C1
1
3
B1
2
5
B2
134
a001
b101
c010
d110
e010
f 010
g010
25
00
00
00
00
10
11
01
a
b
c
d
e
f
g
M[C1]
M[C2]
Now for each connected component C in G(M), call each distinct
sequence in M[C] a supercharacter, and let W be the indicator
matrix for the supercharacters. So W indicates which rows of M
contain which particular supercharacters.
1
2
3
4
3
3
3
134
a001
b101
c010
d110
e010
f010
g 01 0
M[C1]
5
5
5
5
6
7
8
a
b
c
d
e
f
g
25
00
00
00
00
10
11
01
M[C2]
a
b
c
d
e
f
g
12345678
10001000
01001000
00101000
00011000
00100100
00100010
00100001
W
Proof Ideas for the
Decomposition Theorem
Lemma: No pair of supercharacters are
incompatible.
So by the NASC for a Perfect Phylogeny,
there is a unique perfect phylogeny T for
W.
Proof Ideas - Further:
For each connected component C of G(M), all
supercharacters that originate from C label edges in
T that are incident with one single node v[C] in T.
So, if we expand each node v[C] to be a network that
generates the supercharacters from C (the
sequences in M[C]), and connect each network
correctly to the edges in T, the resulting network is a
fully-decomposed blobbed-tree that generates M.
Algorithmically, T is easy to find.
T can be constructed from M in O(nm^2)
time.
T is the tree resulting from contracting
each blob in the fully-decomposed
blobbed-tree T(M) for M.
The supercharacters from M play the role
in phylogenetic networks that normal
binary characters play in perfect
phylogeny trees.
So supercharacters are the fundamental
characters of phylogenetic networks.
However …
While fully-decomposed networks always exist, they do
not necessarily minimize the number of
recombination nodes, over all possible networks: For
some M, there is no fully-decomposed MinARG.
That is, sometimes it pays to put sites from different
connected components together on the same blob.
000000
4
3
5
Sequences in M are in black.
Sequence 100010 is not in M.
1
4
s
p 6
2
100010
p
001000
3 s 5 s
p
G(M) has two
components. Each
requires two recs, but
this combined network
needs only three.
000100
0011010
010010 100001
100101
G(L) has one component. The addition of
sequence 100010
reduces the number of components in G(M) from 2 to 1.
G(M) for the original data
1
2
3
4
5
6
Two components, so two blobs,
each blob requires two recombs,
by the HK lower bound theorem,
so a fully decomposed networks needs
at least four recombinations
1
2
3
4
5
6
G(L) created from the original data,
and the addition of the new
interior sequence 100010.
G(L) has only one connected
component compared to two
components for G(M).
Theorem: For any K, there is a dataset where best fully-decomposed
network uses K recombinations more than optimal.
In that construction, the ratio of the number of recombinations
in the best fully-decomposed network to the optimal is constant as
K grows.
Open Question: Construct examples where the show that the
ratio can be arbitrarily large.
Sufficient Conditions
But we can prove several useful sufficient conditions
for when there is a fully-decomposed network that minimizes the
number of recombinations, over all possible networks.
The deepest result:
Main Theorem: Let N be a phylogenetic network for input M, let L be
the set of sequences that label the nodes of N, and let G(L) be the
incompatibility graph for L. If G(L) and G(M) have the same number
of connected components, then there is a fully-decomposed network for
M with the same number of recombinations as in N.
Restatement with converse
There is a fully-decomposed MinARG for M if and only if
there is some MinARG for M whose nodes are labeled
with the sequences L, such that the number of connected
components of incompatibility graphs G(L) and G(M) are
the same.
Corollary
A fully-decomposed network exists that
minimizes the number of recombinations,
unless every optimal network uses some
recombination node(s) labeled by sequence(s)
not in M, and the addition of those sequences
to M creates an incompatibility between sites
in different components of G(M).
A Practical Sufficient
Condition
If M can be derived on a network N in which
every edge contains at most
one site, and every node is labeled with a
sequence in M, then there is a
fully-decomposed network for M which
minimizes the number of recombinations
over all possible networks for M.
Proof Sketch of MainTheorem