LEAP: A Generalization of the Landau-Vishkin

Manuscript on bioRxiv
We also have a Poster
LEAP: A Generalization of
the Landau-Vishkin Algorithm
with Custom Gap Penalties
Hongyi Xin1, Jeremie Kim1, Sunny Nahar1,4, Can Alkan3 and Onur Mutlu1,2
1Carnegie
Mellon University
2ETH Zürich
3Bilkent University
4Google
1
Approximate String Matching Problem (ASM)
• A fundamental problem in Bioinformatics
• DNA and protein sequence mapping and comparisons
• Computes a similarity score between two strings
• Differences include mismatches as well as indels
• Common scoring schemes
• Edit-distance (mismatch and indels have same penalty score)
• Affine gapping (gap opening is more penalized than gap extension)
2
The Landau-Vishkin (1989) Algorithm
A
C
T
T
0
1
2
A
1
0
C
2
T
0
1
A
1
1
G
2
2
• Dynamic-programming algorithm for editdistance
• Diagonally oriented
• Find out the furthest nodes reachable at e
edit-distance
• Use nodes at e to compute the furthest nodes
at e+1 edit-distance
• Procedure (3 steps)
• Pick the furthest starting position at e (initiation)
• Traverse diagonally until hitting a mismatch
(elongation)
A
• Notify neighbor diagonals with new potential
furthest starting positions at e+1 (termination)
T
• Furthest node reachable at e
A
G
C
1
2
2
2
A
C
T
0
1
A
C
T
3
Benefit of the Landau-Vishkin Algorithm
A
C
0
1
2
A
1
0
1
2
C
2
1
0-
1
2
2
1
0-
1
2
2
1
1
1-
2
2
2
2
1-
2
3
2-
2
2
2-
3
3
3
2-
3
4
3-
3
2-
3
4
4
3
2-
5
4
3
• Compared against Ukkonen’s banded algorithm
• Less bookkeeping
• Only stores starting positions and elongation length
• Compute fewer nodes
• Only the start and end positions in the diagonal
• Less compute per node
• Elongation only checks for the next mismatch
• Termination updates the furthest node for next edit
iteration
• Only one termination per edit per diagonal
• Less work! (although still being O(k*m) )
T
A
G
A
A
C
T
T
T
T
A
G
C
A
C
T
4
Limitations
• Landau-Vishkin was proposed for edit-distance
• Correctness for custom gap penalties has not been proven
5
Our Contribution
• Prove that Landau-Vishkin also works for custom gap penalties
• Including affine-gapping
6
Problem Setup
• Convert the process of computing the DP tableA into
C Ttraversing
T A G Ca graph
A C T
• Nodes in the table  vertices
0 1 2
• Horizontal, vertical and diagonal transitionsA  1edges
0 1
C
2
• Mismatches, indels  weights on the edges
T
• Goal: find a path from start to
A
destination with min edge weights G
A
A
C
T
T
2
1
0
1
2
2
1
0
1
2
2
1
1
1
2
2
2
2
1
2
3
2
2
2
2
3
3
3
2
3
4
3
3
2
3
4
4
3
2
5
4
3
7
Problem Statement of LEAP
• Toad swims in a swimming pool with hurdles in it
swim
stride
leap
8
Three Theorems
• Delaying a leap in the path until the next hurdle does not add cost to
the path
• There exists an optimal path where the toad only leaps before a
hurdle
• Landau-Vishkin algorithm finds such optimal path
9
Pseudo-Proof 1
• Blue path has the same cost as the red path
• No change in leaps, no additional strides
10
Pseudo-Proof 2
• Must exist an optimal path where leaps are right before hurdles
• Iteratively use theorem 1
11
Pseudo-Proof 3
• Can be proved using induction
• Energy cost monotonically increases along the path
• Check out our paper/poster for further details!
12
Further Optimizations
• Elongation finds the next mismatch
• Elongation can be sped up using the de Bruijn sequence technique
• Think of it as a hashing function
A C T T A G C A C G
No need to iteratively
check for letter matches. Use bit-parallel operations instead!
A
C
T
A
G
A
A
C
T
T
-
1
1
0
0
0
1
0
0
1
-
0
0
0
1
0
0
1
-
Shift
1
1
1
1
1
1
0
-
2’s complement
0
0
0
1
0
0
0
-
AND
0
1
1
* De Bruijn sequence
4
lookup
XOR
13
Results
• Two implementations
• LEAP (without De Bruijn
sequence optimization)
• LEAP-BV (with optimization)
• Compared to 3 state-of-theart implementations
• Myer’s bit vector  SeqAn
• NW-SIMD
• Canonical LV (equivalent to
LEAP for Levenshtein dist.)
Takeaway: LEAP-BV attains as much as 7.4x speedup against Myers’ bit-vector and
32x speedup against NW-SIMD
14
Conclude
• We prove that the Landau-Vishkin algorithm can be extended to
support custom gap penalties
• We further optimized the Landau-Vishkin method
• We achieved up to 7.4x speed up over the state-of-the-art
implementation
15
Special
thanks to:
Acknowledgement
16
Q&A
17