Phylogenetic Networks with Constrained

Computing close bounds on the
minimum number of
recombinations
Dan Gusfield UCD
Y. Song, Y. F. Wu, D. Gusfield (ISMB2005)
D. Gusfield, D. Hickerson (Dis. Appl. Math 2005)
D. Gusfield, V. Bansal (Recomb 2005)
Reconstructing the Evolution of
Binary Bio-Sequences (SNPs)
• Perfect Phylogeny (tree) model
• Phylogenetic Networks (DAG) with
recombination (ARG)
• Efficiently Computed Lower and Upper
Bounds on the number of recombinations
needed.
• Decomposition Theorem and open question.
Geneological or Phylogenetic
Networks
• The major biological motivation comes from
genetics and attempts to reconstruct the history of
recombination in populations.
• Some of the algorithmic and mathematical results
also have phylogenetic applications, for example
in hybrid speciation, lateral gene transfer.
• Critical distinction is whether there is a fixed
linear ordering of the sites (some results need it
and some do not).
The Perfect Phylogeny Model for
binary sequences
sites 12345
Ancestral sequence 00000
1
4
Site mutations on edges
3
The tree derives the set M:
2
10100
10100
5
10000
01011
01010
00010
10000
00010
01010
01011
Extant sequences at the leaves
When can a set of sequences be
derived on a perfect phylogeny
with the all-0 root?
Classic NASC: Arrange the sequences in a
matrix. Then (with no duplicate columns),
the sequences can be generated on a unique
perfect phylogeny if and only if no two
columns (sites) contain all four pairs:
0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
A richer model
10100
10000
01011
01010
00010
10101 added
pair 4, 5 fails the three
gamete-test. The sites 4, 5
``conflict”.
12345
00000
1
4
3
10100
2
00010
5
10000
0101101010
Real sequence histories often involve recombination.
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point 5.
The first 4 sites come from P (Prefix) and the sites
from 5 onward come from S (Suffix).
called ``crossing over” in genetics
Network with Recombination
10100
10000
01011
01010
00010
10101 new
12345
00000
1
4
3
2
10100
The previous tree with one
recombination event now derives
all the sequences.
P
00010
5
10000
5
10101
S
0101101010
Multiple Crossover
Recombination
4-crossovers
2-crossovers = ``gene conversion”
A Phylogenetic Network
00010
a:00010
10010
00000
4
3
1
00100
2
b:10010
P
c:00100
S
3
01100
p
d:10100
5
00101
S
4
01101
f:01101
e:01100
g:00101
Minimizing recombinations
• Any set M of sequences can be generated by a
phylogenetic network with enough recombinations, and
one mutation per site. This is not interesting or useful.
• However, the number of (observable) recombinations is
small in realistic sets of sequences. ``Observable” depends
on n and m relative to the number of recombinations.
• Two algorithmic problems: given a set of sequences M,
find a phylogenetic network generating M, minimizing the
number of recombinations . Find a network generating M
that has some biologically-motivated structural properties.
Minimization is NP-hard
The problem of finding a phylogenetic network that creates a given set of
sequences M, and minimizes the number of recombinations, is NP-hard.
(Wang et al 2000) (Semple 2004)
A super-exponential-time method computes the exact min (Song and Hein). Works
only for a small number of sequences.
Wang et al. explored the problem of finding a phylogenetic network where the
recombination cycles are required to be node disjoint, if possible.
They gave a sufficient but not a necessary condition to recognize cases when this
is possible. O(nm + n^4) time. A full solution is
in Gusfield, Eddhu, Langley O(nm + n^3) time (also root-unknown case solved
later in the same time bound).
Practical and Accurate Lower
Bound Computation
• Unconstrained recombination.
• Want efficient computation of lower bounds on the
minimum number of recombinations needed.
Many methods: HK, Haplotype, History, Connected
Comps, Composite Interval, Gall-Interval
Composite Method, Subsets, Recmin, Composite
ILP, HapBound (-S,-M). Some require a linear
order, and others are more general.
Incompatible Sites
A pair of sites (columns) of M that fail the
4-gametes test are said to be incompatible.
A site that is not in such a pair is compatible.
M
12345
a 00010
b 10010
c 00100
d 10100
e 01100
f 01101
g 00101
Incompatibility Graph
4
1
3
2
5
Two nodes are connected iff the pair
of sites are incompatible, i.e, fail the
4-gamete test.
The grandfather of all lower
bounds - HK 1985
• Arrange the nodes of the incompatibility graph on
the line in order that the sites appear in the
sequence. This bound requires a linear order.
• The HK bound is the minimum number of vertical
lines needed to cut every edge in the
incompatibility graph. Weak bound, but widely
used - not only to bound the number of
recombinations, but also to suggest their locations.
Justification for HK
If two sites are incompatible, there must have
been some recombination where the
crossover point is between the two sites.
HK Lower Bound
1
2
3
4
5
HK Lower Bound = 1
1
2
3
4
5
More general view of HK
Given a set of intervals on the line, and for each interval I, a
number N(I), define the composite problem: Find the minimum
number of vertical lines so that every interval I intersects at
least N(I) of the vertical lines.
In HK, each incompatibility defines an interval I where N(I) = 1.
The composite problem is easy to solve by a left-to-right myopic
placement of vertical lines: Sort the intervals by right end-point;
Process the intervals left to right in that order; when the right
endpoint of an interval I is reached, place there (if needed)
additional vertical so that N(I) lines intersect I.
If each N(I) is a ``local” lower bound on the number of
recombinations needed in interval I, then the solution to
the composite problem is a valid lower bound for the
full sequences. The resulting bound is called the composite
bound given the local bounds.
This general approach is called the Composite Method
(Simon Myers 2002).
Example: New bound
The number of non-trivial connected
components in the incompatibility graph is
a lower bound on the needed number of
recombinations (Gusfield, Hickerson;
Bafna, Bansal).
CC Bound = 2 in the prior example.
The CC bound can be arbitrarily larger than the HK bound,
but can also be smaller.
When the CC bound is used for the local bound
N(I) in each interval I,
then the resulting composite bound is provably as large as the
HK bound, and is usually larger (in our simulations).
Better composite bounds are obtained by using the
``haplotype bound” as the local bound.
Haplotype Bound (Simon Myers)
• Rh = Number of distinct sequences (rows) Number of distinct sites (columns) -1 (folklore)
• Before computing Rh, remove any site that is
compatible with all other sites. A valid lower
bound results - generally increases the bound.
• Generally really bad bound, often negative, when
used on large intervals, but Very Good when used
as local bounds in the Composite Interval Method,
and other methods.
Composite Interval Method using
RH bounds
Compute Rh separately for each of the C(m,2)
intervals of the m sites; let N(I) = Rh(I) be the
local lower bound for interval I. Then compute
the composite bound using these local bounds.
Polynomial time and gives bounds that often double
HK in our simulations.
Composite Subset Method
(Myers)
• Let S be subset of sites, and Rh(S) be the
haplotype bound for subset S. If the
leftmost site in S is L and the rightmost site
in S is R, then use Rh(S) as a local bound
N(I) for interval I = [S,L].
• Compute Rh(S) on many subsets, and then
solve the composite problem to find a
composite bound.
RecMin (Myers)
• World Champion Lower Bound Program
(available). Often RecMin gives a bound three
times as large as HK.
• Computes Rh on subsets of sites, but limits the
size and the span of the subsets. Default
parameters are s = 6, w = 15 (s = size, w = span).
• Generally, impractical to set s = w = m, so
generally one doesn’t know if increasing the
parameters would increase the bound. (example:
Myers bound of 70 on the LPL data).
Optimal RecMin Bound (ORB)
• The Optimal RecMin Bound is the lower
bound that RecMin would produce if both
parameters were set to their maximum
values (s = w= m).
• In general, RecMin cannot compute (in
practical time) the ORB.
• Practical computation of the ORB is our
first contribution.
Computing the ORB
• Gross Idea: For each interval I, use ILP to find a
subset of sites S that maximizes Rh(S) over all
subsets in interval I. Call the result Opt(I).
• Set N(I) = Opt(I), and compute the composite
bound using those local bounds.
• The composite bound is the ORB. -- the result
one would get by using all 2^m subsets in
RecMin, with s = w = m.
We have moved from doing an exponential number of simple
computations (computing Rh for each subset), to solving
a quadratic number of (possibly expensive) ILPs.
Is this a good trade-off in practice? Our experience - very much
so!
How to compute Opt(I) by ILP
Create one variable Xi for each row i; one variable Yj
for each column j in interval I. All variables are 0/1 variables.
Define S(i,i’) as the set of columns where rows i and i’ have different
values. Each column in S(i,i’) is a witness that rows i and i’ differ.
For each pair of rows (i,i’), create the constraint:
Xi + Xi’ <= 1 + ∑ [Yj: j in S(i,i’)]
Objective Function: Maximize ∑ Xi - ∑ Yj -1
Alternate way to compute Opt(I)
by ILP
First remove any duplicate rows. Let N be the number of rows
remaining.
Create one variable Yj for each column j
in interval I. All variables are 0/1 variables.
S(i,i’) as before.
For each pair of rows (i,i’) create the constraint:
1 <= ∑ [Yj: j in S(i,i’)]
Objective Function: Maximize N - ∑(Yj) -1
Finds the smallest number of columns to distinguish the rows.
Two critical tricks
1) Use the second ILP formulation to compute Opt(I). It solves
much faster than the first (why?)
2) Reduce the number of intervals where Opt(I) is computed:
I
L
If the solution to Opt(I) uses columns that
only span interval L, then there is no
need to find Opt(I’) in any interval
I’ containing L. Each ILP computation
directly spawns at most 4 other ILPs.
Apply this idea recursively.
With the second trick we need to find Opt(I)
for only 0.5% - 35% of all the C(m,2)
intervals (empirical result). Surprisingly
fast in practice (with either the GNU solver
or CPLEX).
Bounds Higher Than the Optimal
RecMin Bound
• Often the ORB underestimates the true minimum,
e.g. Kreitman’s ADH data: 6 v. 7
• How to derive sharper bound? Idea: In the
composite method, check if each local bound N(I)
= Opt(I) is tight, and if not, increase N(I) by one.
• Small increases in local bounds can increase the
composite bound.
Bounds Sharper Than Optimal
RecMin Bound
• A set of sequences M is called self-derivable if there is a
network generating M where the sequence at every node
(leaf and intermediate) is in M.
• Observation: The haplotype bound for a set of sequences is
tight if and only if the sequences are self-derivable.
• So for each interval I where Opt(I) is computed, we check
self-derivability of the sequences induced by columns S*,
where S* are the columns specified by Opt(I). Increase
N(I) by 1 if the sequences are not self-derivable.
Algorithm for Self-Derivability
• Solution is easy when there are no mutations -only recombinations are allowed and one initial
pair of sequences is chosen as ``reached”.
• Two reached sequences S1 and S2 can reach a
third sequence S3, if S3 can be created by
recombining S1 and S2.
• Do BFS to see if all sequences can be reached by
successive application of the ``reach operation”.
• Clearly polynomial time and can be optimized
with suffix-trees etc. (Kececiouglu, Gusfield)
Self-derivability Test with
mutations allowed
• For each site i, construct set MUT(i) of sequence pairs (S1,
S2) in M where S1 and S2 differ at only site.
• Try each sequence in M as root (which is the only reached
sequence initially).
• For each root, enumerate all ways of choosing exactly one
ordered pair of sequences (Sp, Sq) from each MUT(i). Sp
is allowed to ``reach” Sq.
• Run the prior self-derivability algorithm with these new
permitted reach relations, to test if all sequences in M can
be reached. If so, then M is self-derivable, otherwise it is
not.
Checking if N(I) should be
increased by two
If the set of sequences is not self-derivable,
we can test if adding one new sequence makes
it self-derivable.
the number of candidate sequences is
polynomial and for each one added to M we
check self-derivability. N(I) should be
increased by two if none of these sets of
sequences is self-derivable.
Program HapBound
• HapBound –S. Checks each Opt(I) subset for selfderivability. Increase N(I) by 1 or 2 if possible.
This often beats ORB and is still practical for
most datasets.
• HapBound –M. Explicitly examine each interval
directly for self-derivability. Increase local bound
if possible. Derives lower bound of 7 for
Kreitman’s ADH data in this mode.
HapBound vs. RecMin on LPL
from Clark et al.
Program
Lower Bound
Time
RecMin (default)
59
3s
RecMin –s 25 –w 25
75
7944s
RecMin –s 48 –w 48
No result
5 days
HapBound
75
31s
HapBound -S
78
1643s
2 Ghz PC
Example where RecMin has
difficulity in Finding Optimal Bound
on a 25 by 376 Data Matrix
Program
RecMin default
RecMin –s 30 –w 30
RecMin –s 35 –w 35
RecMin –s 40 –w 40
RecMin –s 45 –w 45
HapBound
HapBound -S
Bound
36
42
43
43
43
44
48
Time
1s
3m 25s
24m 2s
2h 9m 4s
10h 20m 59s
2m 59s
39m 30s
Frequency of HapBound –S
Bound Sharper Than Optimal
RecMin Bound
ms param. Rho=1
Rho=5
Rho=10
Rho=20
Theta=1
0.0%
0.4%
0.5%
1.5%
Theta=5
0.7%
4.0%
10.4%
27.0%
Theta=10 1.4%
9.2%
17.8%
40.4%
Theta=20 1.4%
10.5%
27.8%
45.4%
For every ms parameters, 1000 data sets are used.
Computing Upper Bounds
• The method is an adaptation of the
``history” lower bound (Myers).
• A non-informative column is one with
fewer than two 0’s or fewer than two 1’s.
Single History Computation
1) Set W = 0
2) Collapse identical rows together, and remove
non-informative columns. Repeat until neither is
possible.
3) Let A be the data at this point. If A is empty,
stop, else remove some row r from A, and set W
= W + 1. Go to step 2).
Note that the choice of r is arbitrary in Step 3), so the
resulting W can vary.
History Lower Bound
Theorem (Myers) Let W* be the minimum W
obtained from all possible single history
computations.
Then W* is a valid lower bound on the
number of recombinations needed.
Naïve time: theta(n!) (RecMin), but can be
reduced to theta(2^n) (Bafna, Bansal).
Converting the History Lower
Bound to an Upper Bound
• Given a set of rows A and a single row r,
define w(r | A - r) as the minimum number
of recombinations needed to create r from
A-r (well defined in our application).
• w(r | A-r) can be computed in linear time by
a greedy-type algorithm.
Upper Bound Computation
1)
2)
3)
Set W = 0
Collapse identical rows together, and remove noninformative columns. Repeat until neither is possible.
Let A be the data at this point. If A is empty, stop, else
remove some row r from A, and set W = W + W(r | A-r).
Go to step 2).
Note that the choice of r is arbitrary in Step 3), so the
resulting W can vary.
This is the Single History Computation, with a change in
step 3).
Note, even a single execution of the upper bound
computation gives a valid upper bound, and a way
to construct a network. This is in contrast to the
History Bound which requires finding the
minimum W over all histories.
However, we also would like to find the lowest
upper bound possible with this approach. This can
be done in O(2^n) time by DP. In practice, we can
use branch and bound to find the lowest upper
bound, but we have also found that branching on
the best local choice, or randomizing gives good
results.
Branch and Bound
(Branching) In Step 3) choose r to minimize
w(r | A-r) + L(A-r), where L(A-r) is some
fast lower bound on the number of
recombinations needed for the set A-r.
Even HK is good for this purpose.
(Bounding) Let C be the min for an full
solution found so far; If W + L(A) >= C,
then backtrack.
Kreitman’s 1983 ADH Data
•
11 sequences, 43 segregating sites
•
Both HapBound and SHRUB took only a fraction of a
second to analyze this data.
•
Both produced 7 for the number of detected
recombination events
Therefore, independently of all other methods, our lower
and upper bound methods together imply that 7 is the
minimum number of recombination events.
A minimal ARG for Kreitman’s data
SHRUB produces
code that can be
input to an open
source program to
display the
constructed ARG
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Papers and Software
wwwcsif.cs.ucdavis.edu/~gusfield/
A general result and open
question
Recombination Cycles
• In a Phylogenetic Network, with a
recombination node x, if we trace two paths
backwards from x, then the paths will
eventually meet.
• The cycle specified by those two paths is
called a ``recombination cycle”.
Blobbed-trees: generalizing
galled-trees
• In a phylogenetic network a maximal set of
intersecting cycles is called a blob.
• Contracting each blob results in a directed, rooted
tree, otherwise one of the “blobs” was not
maximal.
• So every phylogenetic network can be viewed as a
directed tree of blobs - a blobbed-tree.
The blobs are the non-tree-like parts of the network.
Every network is a
tree of blobs.
How do the tree parts
and the blobs relate?
How can we exploit
this relationship?
Ugly tangled
network inside
the blob.
The Decomposition Theorem
(Recomb 2005)
For any set of sequences M, there is a blobbed-tree
T(M) that derives M, where each blob contains all
and only the sites in one non-trivial connected
component of the incompatibility graph. The
compatible sites can always be put on edges
outside of any blob. Moreover, the tree part of
T(M) is unique. And it is easy to find the tree part.
This is weaker than the result for galled-trees: it
replaces “must” with “can”.
The main open question
• Is there always a blobbed-tree where each blob has
all and only the sites of a single connected
component of the conflict graph, which minimizes
the number of recombinations over all possible
phylogenetic networks for M?
• If so, then we can optimize on each component
separately, and find lower bounds on each
component separately and then add those bounds
to get a valid lower bound for M.