Geometric Sorting with Length-Weighted Reversals

Computer Science Department
Technion – Israel Institute of Technology
Genomic Sorting with
Length-Weighted Reversals
Ron Y. Pinter
Technion
Steve Skiena
SUNY Stony Brook
1
Genome Rearrangement
• events
– duplication
– translocation
– reversal (inversion)
• occur primarily during reproduction
• allow large-scale genomic comparisons
2
Sorting by Reversals
• genome represented as a permutation on
1, 2, …, n
– n = # homologous genes among species
• assumptions
– can identify genes
– genes are distinct
• operation: reversal of a subsequence (of genes)
– models inversion (occurs during crossover)
• one of the permutations can be 1, 2, …, n
– appropriately relabel others
3
Example
•
•
4
3
2
8
7
1
5
6
11 10
9
4
3
2
1
7
8
5
6
9
10 11
1
2
3
4
8
7
6
5
9
10 11
1
2
3
4
5
6
7
8
9
10 11
6 reversal
in our model (for f(l) = l): cost = 18
4
Our Model
• unsigned
• cost of reversal of subsequence of length
l is f(l)
• total sorting cost (or distance) is
S
Sj are
reversed
subsequences
f (length(sj))
5
Cost Functions
• additive
f(x+y) = f(x) + f(y)
• subadditive
f(x+y) < f(x) + f(y)
f(l)
• superadditive
f(x+y) > f(x) + f(y)
• other
– e.g. bitonic
f(l)
6
Problems
• algorithm to sort any permutation
– worst-case min cost
• approximate min cost for a given
permutation
7
Extremal Costs
• highly subadditive: e.g. unit cost, f(l) = 1
– NP complete [Caprara, ’97]
– series of approximation ratios: 2, 1.75, 1.375
• highly superadditive: f(l) > l2
– essentially bubblesort
8
Our Results
• additive cost function
– specifically f(l) = l
• QuickSort-like algorithm for worst-case
– complexity: O(n lg2n)
• min cost approximation ratio of O(lg2n)
9
MedianEject(a,b)
• find r maximal blocks of wrong-sided
elements with respect to median
• for lg r do: flip every other pair of blocks
of wrong-sided and adjacent
blocks
• move wrong-sided blocks to median
boundary
• reverse left and right blocks
10
Sample Run
complexity: O((b-a) lg r)
11
ReversalSort(a,b)
MedianEject (a,b);
ReversalSort (a,
b  a 
 2  );
ReversalSort (  b  a  ,b);
 2 
Complexity
n
T(n) = 2  T ( 2 ) + O(f(n) lg n) O(f(n)lg2n)
= O(n lg2n) for f(n)~n
12
Algorithmic Improvements
I
simplify “short” phases
II merge 2 last steps of MedianEject
p q p
when possible (2p+q vs. 3p+q)
III apply II recursively
13
Approximation Ratio
•
M(p) is the maximal total distance between pairs of out-of order
elements
Lemma 4:
but
Lemma 6:
+
Lemma 7:
yields:
min cost is (M(p))
# of out-of order elts < 3  M(p)
MedianEject touches only elements within linear range from
out-of-order elements
•
each round of MedianEject takes O(M(p)  lg2 n)
•
ReversalSort costs O(M(p)  lg2 n)
•
ReversalSort is at most O((lg2 n) times optimal
14
Bioinformatic “Validation”
• use our cost (= distance) to build phylogenetic
trees
Cyanophora
Cyanidium
Guilardia
Porphyra
• 4 plants (chloroplastic genes)
• consistent with [Martin et al., PNAS Sept ‘02]
• work in progress [M. Shoham]
15
Open Problems: Algorithmic
• weighted genes
• tighter approximation ratio
– close to O(lg n)
– can get to constant?
• other cost functions (incl. bitonic)
• the signed case
16
Open Problems: Modeling
• chromosomal ordering
• what is the right cost function?
– consider cost(l) = ld
• combine with constant-based models
– restricted regions
– “undesired” reversal sequences
• deal with duplication and translocation events
17