Trapezoid Graphs, Independent Sets, and Whole Genome

Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome
Comparisons
Raluca Uricaru, Eric Rivals
LIRMM, CNRS Université de Montpellier 2
6 novembre 2009
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
1 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
2 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
3 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Problem and Motivation
Align two or several genomic sequences to determine
resemblances between them ;
infer knowledge from one sequence to other highly similar sequences ;
identify the conserved parts between sequences indicating common
biological components ;
identify differences between sequences responsible for what
distinguishes them (e.g., virulence, pathogenicity).
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
4 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
General Method for Pairwise Alignment
complex problem due to the sizes of the genomic sequences ;
3-phase heuristic called the "anchor based strategy" :
1
2
3
computation of local similarities (fragments) between sequences ;
using an external method ;
fragment chaining phase : selects a subset of non-overlapping,
collinear anchors giving a maximum weighted chain ;
non-collinear anchors −→ NP-complete problem ;
apply recursively the first 2 phases on yet not aligned regions ;
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
5 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
6 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Fragment Chaining
Trapezoid Representation
S1
1
2
3
4
5
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
7 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Fragment Chaining
Box Representation
7
y
6
w(B5) = 13
6
w(B4) = 6 3
3
W [Cmax ] = w(B2 ) + w(B5 )
= 8 + 13 = 21
w(B3) = 10
4
4
w(B2) = 8
4
3
w(B1) = 5 2
x
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
8 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Fragment Chaining Definition
n fragments represented as boxes in a dotplot : {B1 , . . . , Bn } ;
lower corner of Bi = l(Bi ), upper corner of Bi = u(Bi ) ;
if u(Bi ) < l(Bj ) then Bi Bj ;
the weight of a box Bi : w(Bi ) =| Px (Bi ) | + | Py (Bi ) |
Px and Py are the projections on the 2 axis ;
{Bi1 , Bi2 , . . . Bik } forms a chain C if Bij Bij+1 ∀j, 1 ≤ j ≤ k , with
P
W [C] = kj=1 w(Bij ).
We are looking for the chain Cmax giving the maximum weight.
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
9 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Fragment Chaining
equivalent to the maximum weighted independent set in the
trapezoid graph corresponding to the trapezoid/box
representations above ;
[S. Felsner et al, Trapezoid graphs and generalizations, geometry and
algorithms, 1995.]
dynamic programming algorithm in O(n log(n)), using the sweep
line paradigm ;
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
10 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
11 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Chaining with Overlaps
Box Representation
7
y
6
w(B5) = 13
r = 33%
6
w(B4) = 6 3
W [Cmax ] = w(B2 ) + w(B3 ) + w(B5 )
−Ox (B2 , B3 ) − Ox (B3 , B5 )
= 28
3
w(B3) = 10
4
4
w(B2) = 8
4
3
w(B1) = 5 2
x
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
12 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Chaining with Overlaps
overlap between Bi , Bj boxes on x-axis : Ox (Bi , Bj ) =| Px (Bi ) ∩ Px (Bj ) |
maximum allowed overlap on x-axis : r ∗ min(| Px (Bi ) |, | Px (Bj ) |)
if l(Bi ) < l(Bj ) and Ox (Bi , Bj ) ≤ r ∗ min(| Px (Bi ) |, | Px (Bj ) |) or Bi Bj
then Bi r ,x Bj . Similarly on y-axis . . .
if Bi r ,x Bj and Bi r ,y Bj then Bi r Bj
total overlap between Bi , Bj boxes : O(Bi , Bj ) = Ox (Bi , Bj ) + Oy (Bi , Bj )
{Bi1 , Bi2 , . . . Bik } forms a chain C if Bij r Bij+1 ∀j, 1 ≤ j ≤ k ,
P
P −1
with W [C] = kj=1 w(Bij ) − kj=1
O(Bij , Bij+1 ).
We are looking for the chain Cmax giving the maximum weight.
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
13 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Remarks on Chaining With Overlaps
adapted to all types of fragments ;
makes sense from a biological point of view ;
a straightforward but non-practical
dynamic programming algorithm in O(n2 ),
n = number of fragments ;
partial order on boxes r with the sweep line paradigm :
novel algorithm in O(n log n + nm), where m ≤ r % max(| Py |)
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
14 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Algorithm for Chaining With Overlaps
Algorithm 1: Maximum Weighted Chain with Overlaps
Data: P a set of points corresponding to the 2n + 4 box corners
n boxes + 2 additional boxes corresponding to ⊥, >
Result: Prev a vector of previous boxes in the maximum weighted chain
begin
foreach p ∈ P in ascending order on x-coordinate do
if ∃ a box Bi , with p the lower corner of Bi then
Case 1 Compute B , the best previous box ending before B ;
j
i
/* Bj may overlap Bi on y -axis */
else
/* ∃ a box Bi , with p the upper corner of Bi */
Case 2 Update the weight and the predecessor of opened boxes ;
/* deal with overlaps on x-axis */
Update the list of potential predecessors ;
traceback (Prev [>]);
end
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
15 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Algorithm for Chaining With Overlaps
Algorithm 2: Maximum Weighted Chain with Overlaps
Data: P a set of points corresponding to the 2n + 4 box corners
n boxes + 2 additional boxes corresponding to ⊥, >
Result: Prev a vector of previous boxes in the maximum weighted chain
begin
foreach p ∈ P in ascending order on x-coordinate do
if ∃ a box Bi , with p the lower corner of Bi then
Case 1 Compute B , the best previous box ending before B ;
j
i
/* Bj may overlap Bi on y -axis */
else
/* ∃ a box Bi , with p the upper corner of Bi */
Case 2 Update the weight and the predecessor of opened boxes ;
/* deal with overlaps on x-axis */
Update the list of potential predecessors ;
traceback (Prev [>]);
end
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
15 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Case 1
Compute Bj , the best previous box ending before Bi . . .
7
y
6
B5
B4
3
3
4
B3
6
4
B2
4
3
B1
2
x
Algorithm
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
16 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Case 2
Update the weight and the predecessor of opened boxes . . .
7
y
B5
B4
6
3
3
B3
4
6
B2
4
3
B1
2
4
x
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
17 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
18 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Conclusion
the chaining with proportional overlaps corresponds to an
extended definition of a maximum weighted independent set
including a tolerance notion ;
the algorithm in O(n log n + nm) is very efficient in practice ;
e.g., on a real case composed of 190000 fragments :
O(n2 ) algorithm : 34min
O(n log n + nm) algorithm : < 2min ;
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
19 / 20
Annexe
Support :
CoCoGEN project
http://www.lirmm.fr/~uricaru/CoCoGEN
Thank you for your attention !
Questions ?
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
20 / 20