Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena SUNY Stony Brook 1 Genome Rearrangement • events – duplication – translocation – reversal (inversion) • occur primarily during reproduction • allow large-scale genomic comparisons 2 Sorting by Reversals • genome represented as a permutation on 1, 2, …, n – n = # homologous genes among species • assumptions – can identify genes – genes are distinct • operation: reversal of a subsequence (of genes) – models inversion (occurs during crossover) • one of the permutations can be 1, 2, …, n – appropriately relabel others 3 Example • • 4 3 2 8 7 1 5 6 11 10 9 4 3 2 1 7 8 5 6 9 10 11 1 2 3 4 8 7 6 5 9 10 11 1 2 3 4 5 6 7 8 9 10 11 6 reversal in our model (for f(l) = l): cost = 18 4 Our Model • unsigned • cost of reversal of subsequence of length l is f(l) • total sorting cost (or distance) is S Sj are reversed subsequences f (length(sj)) 5 Cost Functions • additive f(x+y) = f(x) + f(y) • subadditive f(x+y) < f(x) + f(y) f(l) • superadditive f(x+y) > f(x) + f(y) • other – e.g. bitonic f(l) 6 Problems • algorithm to sort any permutation – worst-case min cost • approximate min cost for a given permutation 7 Extremal Costs • highly subadditive: e.g. unit cost, f(l) = 1 – NP complete [Caprara, ’97] – series of approximation ratios: 2, 1.75, 1.375 • highly superadditive: f(l) > l2 – essentially bubblesort 8 Our Results • additive cost function – specifically f(l) = l • QuickSort-like algorithm for worst-case – complexity: O(n lg2n) • min cost approximation ratio of O(lg2n) 9 MedianEject(a,b) • find r maximal blocks of wrong-sided elements with respect to median • for lg r do: flip every other pair of blocks of wrong-sided and adjacent blocks • move wrong-sided blocks to median boundary • reverse left and right blocks 10 Sample Run complexity: O((b-a) lg r) 11 ReversalSort(a,b) MedianEject (a,b); ReversalSort (a, b a 2 ); ReversalSort ( b a ,b); 2 Complexity n T(n) = 2 T ( 2 ) + O(f(n) lg n) O(f(n)lg2n) = O(n lg2n) for f(n)~n 12 Algorithmic Improvements I simplify “short” phases II merge 2 last steps of MedianEject p q p when possible (2p+q vs. 3p+q) III apply II recursively 13 Approximation Ratio • M(p) is the maximal total distance between pairs of out-of order elements Lemma 4: but Lemma 6: + Lemma 7: yields: min cost is (M(p)) # of out-of order elts < 3 M(p) MedianEject touches only elements within linear range from out-of-order elements • each round of MedianEject takes O(M(p) lg2 n) • ReversalSort costs O(M(p) lg2 n) • ReversalSort is at most O((lg2 n) times optimal 14 Bioinformatic “Validation” • use our cost (= distance) to build phylogenetic trees Cyanophora Cyanidium Guilardia Porphyra • 4 plants (chloroplastic genes) • consistent with [Martin et al., PNAS Sept ‘02] • work in progress [M. Shoham] 15 Open Problems: Algorithmic • weighted genes • tighter approximation ratio – close to O(lg n) – can get to constant? • other cost functions (incl. bitonic) • the signed case 16 Open Problems: Modeling • chromosomal ordering • what is the right cost function? – consider cost(l) = ld • combine with constant-based models – restricted regions – “undesired” reversal sequences • deal with duplication and translocation events 17
© Copyright 2026 Paperzz