Manuscript on bioRxiv We also have a Poster LEAP: A Generalization of the Landau-Vishkin Algorithm with Custom Gap Penalties Hongyi Xin1, Jeremie Kim1, Sunny Nahar1,4, Can Alkan3 and Onur Mutlu1,2 1Carnegie Mellon University 2ETH Zürich 3Bilkent University 4Google 1 Approximate String Matching Problem (ASM) • A fundamental problem in Bioinformatics • DNA and protein sequence mapping and comparisons • Computes a similarity score between two strings • Differences include mismatches as well as indels • Common scoring schemes • Edit-distance (mismatch and indels have same penalty score) • Affine gapping (gap opening is more penalized than gap extension) 2 The Landau-Vishkin (1989) Algorithm A C T T 0 1 2 A 1 0 C 2 T 0 1 A 1 1 G 2 2 • Dynamic-programming algorithm for editdistance • Diagonally oriented • Find out the furthest nodes reachable at e edit-distance • Use nodes at e to compute the furthest nodes at e+1 edit-distance • Procedure (3 steps) • Pick the furthest starting position at e (initiation) • Traverse diagonally until hitting a mismatch (elongation) A • Notify neighbor diagonals with new potential furthest starting positions at e+1 (termination) T • Furthest node reachable at e A G C 1 2 2 2 A C T 0 1 A C T 3 Benefit of the Landau-Vishkin Algorithm A C 0 1 2 A 1 0 1 2 C 2 1 0- 1 2 2 1 0- 1 2 2 1 1 1- 2 2 2 2 1- 2 3 2- 2 2 2- 3 3 3 2- 3 4 3- 3 2- 3 4 4 3 2- 5 4 3 • Compared against Ukkonen’s banded algorithm • Less bookkeeping • Only stores starting positions and elongation length • Compute fewer nodes • Only the start and end positions in the diagonal • Less compute per node • Elongation only checks for the next mismatch • Termination updates the furthest node for next edit iteration • Only one termination per edit per diagonal • Less work! (although still being O(k*m) ) T A G A A C T T T T A G C A C T 4 Limitations • Landau-Vishkin was proposed for edit-distance • Correctness for custom gap penalties has not been proven 5 Our Contribution • Prove that Landau-Vishkin also works for custom gap penalties • Including affine-gapping 6 Problem Setup • Convert the process of computing the DP tableA into C Ttraversing T A G Ca graph A C T • Nodes in the table vertices 0 1 2 • Horizontal, vertical and diagonal transitionsA 1edges 0 1 C 2 • Mismatches, indels weights on the edges T • Goal: find a path from start to A destination with min edge weights G A A C T T 2 1 0 1 2 2 1 0 1 2 2 1 1 1 2 2 2 2 1 2 3 2 2 2 2 3 3 3 2 3 4 3 3 2 3 4 4 3 2 5 4 3 7 Problem Statement of LEAP • Toad swims in a swimming pool with hurdles in it swim stride leap 8 Three Theorems • Delaying a leap in the path until the next hurdle does not add cost to the path • There exists an optimal path where the toad only leaps before a hurdle • Landau-Vishkin algorithm finds such optimal path 9 Pseudo-Proof 1 • Blue path has the same cost as the red path • No change in leaps, no additional strides 10 Pseudo-Proof 2 • Must exist an optimal path where leaps are right before hurdles • Iteratively use theorem 1 11 Pseudo-Proof 3 • Can be proved using induction • Energy cost monotonically increases along the path • Check out our paper/poster for further details! 12 Further Optimizations • Elongation finds the next mismatch • Elongation can be sped up using the de Bruijn sequence technique • Think of it as a hashing function A C T T A G C A C G No need to iteratively check for letter matches. Use bit-parallel operations instead! A C T A G A A C T T - 1 1 0 0 0 1 0 0 1 - 0 0 0 1 0 0 1 - Shift 1 1 1 1 1 1 0 - 2’s complement 0 0 0 1 0 0 0 - AND 0 1 1 * De Bruijn sequence 4 lookup XOR 13 Results • Two implementations • LEAP (without De Bruijn sequence optimization) • LEAP-BV (with optimization) • Compared to 3 state-of-theart implementations • Myer’s bit vector SeqAn • NW-SIMD • Canonical LV (equivalent to LEAP for Levenshtein dist.) Takeaway: LEAP-BV attains as much as 7.4x speedup against Myers’ bit-vector and 32x speedup against NW-SIMD 14 Conclude • We prove that the Landau-Vishkin algorithm can be extended to support custom gap penalties • We further optimized the Landau-Vishkin method • We achieved up to 7.4x speed up over the state-of-the-art implementation 15 Special thanks to: Acknowledgement 16 Q&A 17
© Copyright 2025 Paperzz