Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome
Comparisons
Raluca Uricaru, Eric Rivals
LIRMM, CNRS Université de Montpellier 2
6 novembre 2009
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
1 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
2 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
3 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Problem and Motivation
Align two or several genomic sequences to determine
resemblances between them ;
infer knowledge from one sequence to other highly similar sequences ;
identify the conserved parts between sequences indicating common
biological components ;
identify differences between sequences responsible for what
distinguishes them (e.g., virulence, pathogenicity).
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
4 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
General Method for Pairwise Alignment
complex problem due to the sizes of the genomic sequences ;
3-phase heuristic called the "anchor based strategy" :
1
2
3
computation of local similarities (fragments) between sequences ;
using an external method ;
fragment chaining phase : selects a subset of non-overlapping,
collinear anchors giving a maximum weighted chain ;
non-collinear anchors −→ NP-complete problem ;
apply recursively the first 2 phases on yet not aligned regions ;
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
5 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
6 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Fragment Chaining
Trapezoid Representation
S1
1
2
3
4
5
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
7 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Fragment Chaining
Box Representation
7
y
6
w(B5) = 13
6
w(B4) = 6 3
3
W [Cmax ] = w(B2 ) + w(B5 )
= 8 + 13 = 21
w(B3) = 10
4
4
w(B2) = 8
4
3
w(B1) = 5 2
x
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
8 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Fragment Chaining Definition
n fragments represented as boxes in a dotplot : {B1 , . . . , Bn } ;
lower corner of Bi = l(Bi ), upper corner of Bi = u(Bi ) ;
if u(Bi ) < l(Bj ) then Bi Bj ;
the weight of a box Bi : w(Bi ) =| Px (Bi ) | + | Py (Bi ) |
Px and Py are the projections on the 2 axis ;
{Bi1 , Bi2 , . . . Bik } forms a chain C if Bij Bij+1 ∀j, 1 ≤ j ≤ k , with
P
W [C] = kj=1 w(Bij ).
We are looking for the chain Cmax giving the maximum weight.
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
9 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Fragment Chaining
equivalent to the maximum weighted independent set in the
trapezoid graph corresponding to the trapezoid/box
representations above ;
[S. Felsner et al, Trapezoid graphs and generalizations, geometry and
algorithms, 1995.]
dynamic programming algorithm in O(n log(n)), using the sweep
line paradigm ;
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
10 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
11 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Chaining with Overlaps
Box Representation
7
y
6
w(B5) = 13
r = 33%
6
w(B4) = 6 3
W [Cmax ] = w(B2 ) + w(B3 ) + w(B5 )
−Ox (B2 , B3 ) − Ox (B3 , B5 )
= 28
3
w(B3) = 10
4
4
w(B2) = 8
4
3
w(B1) = 5 2
x
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
12 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Chaining with Overlaps
overlap between Bi , Bj boxes on x-axis : Ox (Bi , Bj ) =| Px (Bi ) ∩ Px (Bj ) |
maximum allowed overlap on x-axis : r ∗ min(| Px (Bi ) |, | Px (Bj ) |)
if l(Bi ) < l(Bj ) and Ox (Bi , Bj ) ≤ r ∗ min(| Px (Bi ) |, | Px (Bj ) |) or Bi Bj
then Bi r ,x Bj . Similarly on y-axis . . .
if Bi r ,x Bj and Bi r ,y Bj then Bi r Bj
total overlap between Bi , Bj boxes : O(Bi , Bj ) = Ox (Bi , Bj ) + Oy (Bi , Bj )
{Bi1 , Bi2 , . . . Bik } forms a chain C if Bij r Bij+1 ∀j, 1 ≤ j ≤ k ,
P
P −1
with W [C] = kj=1 w(Bij ) − kj=1
O(Bij , Bij+1 ).
We are looking for the chain Cmax giving the maximum weight.
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
13 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Remarks on Chaining With Overlaps
adapted to all types of fragments ;
makes sense from a biological point of view ;
a straightforward but non-practical
dynamic programming algorithm in O(n2 ),
n = number of fragments ;
partial order on boxes r with the sweep line paradigm :
novel algorithm in O(n log n + nm), where m ≤ r % max(| Py |)
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
14 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Algorithm for Chaining With Overlaps
Algorithm 1: Maximum Weighted Chain with Overlaps
Data: P a set of points corresponding to the 2n + 4 box corners
n boxes + 2 additional boxes corresponding to ⊥, >
Result: Prev a vector of previous boxes in the maximum weighted chain
begin
foreach p ∈ P in ascending order on x-coordinate do
if ∃ a box Bi , with p the lower corner of Bi then
Case 1 Compute B , the best previous box ending before B ;
j
i
/* Bj may overlap Bi on y -axis */
else
/* ∃ a box Bi , with p the upper corner of Bi */
Case 2 Update the weight and the predecessor of opened boxes ;
/* deal with overlaps on x-axis */
Update the list of potential predecessors ;
traceback (Prev [>]);
end
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
15 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Algorithm for Chaining With Overlaps
Algorithm 2: Maximum Weighted Chain with Overlaps
Data: P a set of points corresponding to the 2n + 4 box corners
n boxes + 2 additional boxes corresponding to ⊥, >
Result: Prev a vector of previous boxes in the maximum weighted chain
begin
foreach p ∈ P in ascending order on x-coordinate do
if ∃ a box Bi , with p the lower corner of Bi then
Case 1 Compute B , the best previous box ending before B ;
j
i
/* Bj may overlap Bi on y -axis */
else
/* ∃ a box Bi , with p the upper corner of Bi */
Case 2 Update the weight and the predecessor of opened boxes ;
/* deal with overlaps on x-axis */
Update the list of potential predecessors ;
traceback (Prev [>]);
end
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
15 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Case 1
Compute Bj , the best previous box ending before Bi . . .
7
y
6
B5
B4
3
3
4
B3
6
4
B2
4
3
B1
2
x
Algorithm
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
16 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Case 2
Update the weight and the predecessor of opened boxes . . .
7
y
B5
B4
6
3
3
B3
4
6
B2
4
3
B1
2
4
x
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
17 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Summary
1
Whole Genome Comparisons
2
Fragment Chaining
3
Chaining with Proportional Overlaps
4
Conclusion
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
18 / 20
Whole Genome Comparisons
Fragment Chaining
Chaining with Overlaps
Conclusion
Conclusion
the chaining with proportional overlaps corresponds to an
extended definition of a maximum weighted independent set
including a tolerance notion ;
the algorithm in O(n log n + nm) is very efficient in practice ;
e.g., on a real case composed of 190000 fragments :
O(n2 ) algorithm : 34min
O(n log n + nm) algorithm : < 2min ;
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
19 / 20
Annexe
Support :
CoCoGEN project
http://www.lirmm.fr/~uricaru/CoCoGEN
Thank you for your attention !
Questions ?
Trapezoid Graphs, Independent Sets, and Whole Genome Comparisons
20 / 20
© Copyright 2026 Paperzz