Semi-local longest common subsequences

Semi-local longest common subsequences:
A superglue for string comparison
1
Introduction
7
Compressed string comparison
2
Matrix distance multiplication
8
Parallel string comparison
3
Semi-local string comparison
9
The transposition network
method
4
The seaweed method
10
Beyond semi-locality
Alexander Tiskin
Department of Computer Science, University of Warwick
http://go.warwick.ac.uk/alextiskin
Alexander Tiskin (Warwick)
Semi-local LCS
1 / 156
5
Periodic string comparison
6
Sparse string comparison
Alexander Tiskin (Warwick)
Semi-local LCS
2 / 156
Introduction
1
Introduction
7
Compressed string comparison
String matching: finding an exact pattern in a string
String comparison: finding similar patterns in two strings
2
Matrix distance multiplication
8
Parallel string comparison
3
Semi-local string comparison
9
The transposition network
method
4
6
global: whole string vs whole string
local: substrings vs substrings
The seaweed method
Main focus of this work:
10
5
Standard types of string comparison:
Beyond semi-locality
semi-local: whole string vs substrings; prefixes vs suffixes
Periodic string comparison
Closely related to approximate string matching: whole pattern string vs
substrings in text string
Sparse string comparison
Alexander Tiskin (Warwick)
Semi-local LCS
3 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
4 / 156
Introduction
Introduction
Overview
Terminology and notation
Standard approach to string comparison: dynamic programming
The “squared paper” notation:
Our approach: the algebra of unit-Monge matrices, a.k.a. seaweed braids
Can be used either for dynamic programming, or for divide-and-conquer
Integers {. . . − 2, −1, 0, 1, 2, . . .}
x − = x − 21
x + = x + 12
Half-integers . . . − 32 , − 12 , 12 , 32 , 25 , . . . = . . . (−2)+ , (−1)+ , 0+ , 1+ , 2+
Divide-and-conquer is better for:
Planar dominance:
comparing strings dynamically (truncation, concatenation)
(i, j) (i 0 , j 0 ) iff i < i 0 and j < j 0 (the “above-left” relation)
comparing compressed srtrings (e.g. LZ-compression)
(i, j) ≷ (i 0 , j 0 ) iff i > i 0 and j < j 0 (the “below-left” relation)
comparing strings in parallel
where “above/below” follow matrix convention (not the Cartesian one!)
The “conquer” step is non-trivial — but not so much that it couldn’t have
been discovered much earlier!
Alexander Tiskin (Warwick)
Semi-local LCS
5 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
6 / 156
Introduction
Introduction
Terminology and notation
Terminology and notation
A permutation matrix is a 0/1 matrix with exactly one nonzero per row
and per column


0 1 0
1 0 0
0 0 1
Given matrix
PD, its distribution matrix is made up of ≷-dominance sums:
D Σ (i, j) = ı̂>i,̂<j D(i, j)
Given matrix E , its density matrix is made up of quadrangle differences:
E (ı̂, ̂) = E (ı̂− , ̂+ ) − E (ı̂− , ̂− ) − E (ı̂+ , ̂+ ) + E (ı̂+ , ̂− )
where D Σ , E over integers; D, E over half-integers





Σ


0 1 2 3
0 1 2 3
0 1 0
0
1
0


0 1 1 2


1 0 0 = 0 1 1 2


0 0 0 1
0 0 0 1 = 1 0 0
0 0 1
0 0 1
0 0 0 0
0 0 0 0
(D Σ ) = D for all D
Matrix E is simple, if (E )Σ = E ; equivalently, if it has all zeros in the left
column and bottom row
Alexander Tiskin (Warwick)
Semi-local LCS
7 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
8 / 156
Introduction
Introduction
Terminology and notation
Terminology and notation
Matrix E is Monge, if E is nonnegative
Matrix E is Monge, if E is nonnegative
Intuition: boundary-to-boundary distances in a (weighted) planar graph
Intuition: boundary-to-boundary distances in a (weighted) planar graph
2
1
3
i
0
2
1
3
4
0
i
4
4
0
j
4
G (planar)
0
j
1
2
3
1
G. Monge (1746–1818)
G. Monge (1746–1818)
Alexander Tiskin (Warwick)
Semi-local LCS
9 / 156
2
blue + blue ≤ red + red
Alexander Tiskin (Warwick)
Semi-local LCS
Introduction
Introduction
Terminology and notation
Implicit unit-Monge matrices
Matrix E is unit-Monge, if E is a permutation matrix
Efficient P Σ queries: range tree on nonzeros of P
Intuition: boundary-to-boundary distances in a grid-like graph
Simple unit-Monge matrix:
P Σ,
where P is a permutation matrix
•
used as an implicit representation of P Σ

1 2 3
1 1 2

0 0 1
0 0 0
Semi-local LCS
[Bentley: 1980]
under every node, binary search tree by j-coordinate
•
•
Alexander Tiskin (Warwick)
10 / 156
binary search tree by i-coordinate
P Σ sometimes called the rank function of P
Seaweed matrix: P


Σ
0
0 1 0
0
1 0 0 = 
0
0 0 1
0
3
11 / 156
•
•
−→
−→
•
•
−→
•
•
Alexander Tiskin (Warwick)
−→
•
−→
•
•
•
−→
•
•
•
•
•
•
•
•
•
•
•
•
↓
•
•
•
•
↓
•
•
•
•
•
•
•
Semi-local LCS
12 / 156
Introduction
Implicit unit-Monge matrices
1
Introduction
7
Compressed string comparison
Every node of the range tree represents a canonical range (rectangular
region), and stores its nonzero count
2
Matrix distance multiplication
8
Parallel string comparison
Overall, ≤ n log n canonical ranges are non-empty
3
Semi-local string comparison
9
The transposition network
method
4
The seaweed method
10
Beyond semi-locality
Efficient P Σ queries: (contd.)
PΣ
A
query is equivalent to ≷-dominance counting: how many nonzeros
are ≷-dominated by query point?
Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges
Total size O(n log n), query time O(log2 n)
There are asymptotically more efficient (but less practical) data structures
[JáJá+: 2004]
Total size O(n), query time O logloglogn n
[Chan, Pǎtraşcu: 2010]
Alexander Tiskin (Warwick)
Semi-local LCS
13 / 156
5
Periodic string comparison
6
Sparse string comparison
Alexander Tiskin (Warwick)
Semi-local LCS
Matrix distance multiplication
Matrix distance multiplication
Seaweed braids
Seaweed braids
Distance algebra (a.k.a (min, +) or tropical algebra):
Matrix classes closed under -multiplication (for given n):
addition ⊕ given by min, multiplication given by +
general (integer, real) matrices ∼ general weighted graphs
Monge matrices ∼ planar weighted graphs
Matrix -multiplication
AB =C
C (i, k) =
14 / 156
L
j
A(i, j) B(j, k) = minj A(i, j) + B(j, k)
simple unit-Monge matrices ∼ grid-like graphs
Implicit -multiplication: define PA
Intuition: gluing distances in a union of graphs
PB = PC as PAΣ PBΣ = PCΣ
The seaweed monoid Tn :
i
simple unit-Monge matrices under G0
equivalently, permutation (seaweed) matrices under
j
Also known as the 0-Hecke monoid of the symmetric group H0 (Sn )
G 00
k
Alexander Tiskin (Warwick)
Semi-local LCS
15 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
16 / 156
Matrix distance multiplication
Matrix distance multiplication
Seaweed braids
Seaweed braids
PA
The seaweed monoid Tn :
PB = PC can be seen as combing of seaweed braids
n! elements (permutations of size n)
n − 1 generators g1 , g2 , . . . , gn−1 (elementary crossings)
=
PA
PB
Idempotence:
gi2 = gi for all i
PC
Far commutativity:
gi gj = gj gi j − i > 1
PA
=
=
=
···
Braid relations:
gi gj gi = gj gi gj j − i = 1
Semi-local LCS
17 / 156
=
Alexander Tiskin (Warwick)
Semi-local LCS
Matrix distance multiplication
Matrix distance multiplication
Seaweed braids
Seaweed braids
Identity: 1
x =x
positive braids: far comm; braid relations
braids: gi gi−1 = 1; far comm; braid relations
Coxeter’s presentation of Sn : gi2 = 1; far comm; braid relations
locally free idempotent monoid: idem; far comm
[Vershik+: 2000]
Generalisations:
x =0
general 0-Hecke monoids


· · · •
· · • ·

0=
· • · · =
• · · ·
Alexander Tiskin (Warwick)
18 / 156
Related structures:


• · · ·
· • · ·

1=
· · • · =
· · · •
Zero: 0
···
PC
PB
Alexander Tiskin (Warwick)
=
Coxeter monoids
[Fomin, Greene: 1998; Buch+: 2008]
[Tsaranov: 1990; Richardson, Springer: 1990]
J -trivial monoids
Semi-local LCS
19 / 156
Alexander Tiskin (Warwick)
[Denton+: 2011]
Semi-local LCS
20 / 156
Matrix distance multiplication
Matrix distance multiplication
Seaweed braids
Seaweed matrix multiplication
Computation in the seaweed monoid: a confluent rewriting system can be
obtained by software (Semigroupe, GAP)
The implicit unit-Monge matrix -multiplication problem
T3 : 1, a = g1 , b = g2 ; ab, ba; aba = 0
Given permutation matrices PA , PB , compute PC , such that
PAΣ PBΣ = PCΣ (equivalently, PA PB = PC )
aa → a
bb → b
bab → 0
aba → 0
T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,
bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba; abacba = 0
aa → a
bb → b
ca → ac
cc → c
bab → aba
cbc → bcb
cbac → bcba
abacba → 0
Easy to use, but not an efficient algorithm
Alexander Tiskin (Warwick)
Semi-local LCS
21 / 156
Matrix -multiplication: running time
type
general
Monge
implicit unit-Monge
time
O(n3 )
3
log n)3 O n (log
log2 n
2
O(n )
O(n1.5 )
O(n log n)
Alexander Tiskin (Warwick)
Semi-local LCS
Matrix distance multiplication
Matrix distance multiplication
Seaweed matrix multiplication
Seaweed matrix multiplication
PB
standard
[Chan: 2007]
via [Aggarwal+: 1987]
[T: 2006]
[T: 2010]
22 / 156
PB,lo , PB,hi
?
PA
Alexander Tiskin (Warwick)
PC
PA,lo , PA,hi
Semi-local LCS
23 / 156
Alexander Tiskin (Warwick)
PC ,lo + PC ,hi
Semi-local LCS
24 / 156
Matrix distance multiplication
Matrix distance multiplication
Seaweed matrix multiplication
Seaweed matrix multiplication
PB
PB,lo , PB,hi
PA,lo , PA,hi
Alexander Tiskin (Warwick)
PA
PC
Semi-local LCS
25 / 156
Alexander Tiskin (Warwick)
PC
Semi-local LCS
Matrix distance multiplication
Matrix distance multiplication
Seaweed matrix multiplication
Bruhat order
26 / 156
Implicit unit-Monge matrix -multiplication: the algorithm
Bruhat order
PCΣ (i, k) = minj PAΣ (i, j) + PBΣ (j, k)
Permutation A is lower (“more sorted”) than permutation B in the Bruhat
order (A B), if B
A by successive pairwise sorting (equivalently,
A
B by anti-sorting) of arbitrary pairs
Divide-and-conquer on the range of j: two subproblems of size n/2
Σ PΣ = PΣ
PA,lo
B,lo
C ,lo
Σ PΣ
Σ
PA,hi
B,hi = PC ,hi
Conquer: tracing a balanced path from bottom-left to top-right PC
Permutation matrices: PA PB , if PB
submatrix sorting: ( 01 10 ) → ( 10 01 )
Path invariant: equal number of -dominated nonzeros in PC ,hi and
-dominating nonzeros in PC ,lo
Plays an important role in group theory and algebraic geometry (inclusion
order of Schubert varieties)
All such nonzeros are “errors” in the output, must be removed
Describes pivoting order in Gaussian elimination (LUP/Bruhat
decomposition)
Must compensate by placing nonzeros on the path, time O(n)
Overall time O(n log n)
Alexander Tiskin (Warwick)
Semi-local LCS
PA by successive 2 × 2
27 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
28 / 156
Matrix distance multiplication
Matrix distance multiplication
Bruhat order
Bruhat order
Ehresmann’s criterion (dot criterion, related to tableau criterion)
Bruhat comparability: running time
O(n2 )
[Ehresmann: 1934; Proctor: 1982; Grigoriev: 1982]
[T: 2013]
[Gawrychowski: NEW]
O(n log n)
n log n O log
log n
PA PB iff PAΣ ≤ PBΣ


Σ
0 1
1 0 0
0 0
0 0 1 = 
0 0
0 1 0
0 0


Σ
0 1
1 0 0
0 0
0 0 1 = 
0 0
0 1 0
0 0
elementwise
 
2 3
0


1 2 0
≤
1 1 0
0 0
0
 
2 3
0


1 2 0
6≤
1 1 0
0 0
0
1
1
0
0
2
2
1
0
1
1
0
0
2
1
0
0


3
0

2 
= 1
1
0
0


3
0
2
 = 1
1
0
0
Σ
0 1
0 0
1 0
Σ
1 0
0 0
0 1
Time O(n2 )
Alexander Tiskin (Warwick)
Semi-local LCS
29 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Matrix distance multiplication
Matrix distance multiplication
Bruhat order
Bruhat order
Seaweed criterion

1
0
0

1
0
0
PA PB iff PAR
PB = Id R , where P R = clockwise rotation of matrix P
Intuition: permutations represented by seaweed braids
PA PB , iff
no pair of seaweeds is crossed in PA , while the “corresponding” pair is
uncrossed in PB
equivalently, no pair is uncrossed in PAR , while the “corresponding”
pair is uncrossed in PB
equivalently,
PAR
PB = Id
 
0 0
0


0 1 1
1 0
0
R 
0 0
0


0 1
1
1 0
0

0 1
0 0
1 0
 
 
 

0 1
0 1 0
0 0 1
0 0 1
0 0 = 0 0 1 1 0 0 = 0 1 0 = Id R
1 0
1 0 0
0 1 0
1 0 0
PAR
PA
PB
R
30 / 156
=
Id R
PB
Time O(n log n) by seaweed matrix multiplication
Alexander Tiskin (Warwick)
Semi-local LCS
31 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
32 / 156
Matrix distance multiplication
Matrix distance multiplication
Bruhat order
Bruhat order

1
0
0

1
0
0
 
0 0
0
0 1 6 1
1 0
0
R 
0 0
0


0 1
1
1 0
0

1 0
0 0
0 1
 
 
 

1 0
0 1 0
0 1 0
0 1 0
0 0 = 0 0 1 1 0 0 = 0 0 1 6= Id R
0 1
1 0 0
0 0 1
1 0 0
Alternative solution: clever implementation of Ehresmann’s criterion
[Gawrychowski: 2013]
The online partial sums problem: maintain array X [1 : n], subject to
update(k, ∆): X [k] ← X [k] + ∆
P
prefixsum(k): return 1≤i≤k X [i]
Query time:
PAR
=
PB
PA
Θ(log n) in semigroup or group model
Θ logloglogn n in RAM model on integers
6= Id R
PB
Gives Bruhat comparability in time O
[Pǎtraşcu, Demaine: 2004]
n log n log log n
in RAM model
n log n Open problem: seaweed multiplication in time O log
log n ?
Alexander Tiskin (Warwick)
Semi-local LCS
33 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
34 / 156
Semi-local string comparison
Semi-local LCS and edit distance
1
Introduction
7
Compressed string comparison
Consider strings (= sequences) over an alphabet of size σ
2
Matrix distance multiplication
8
Contiguous substrings vs not necessarily contiguous subsequences
Parallel string comparison
Special cases of substring: prefix, suffix
3
Semi-local string comparison
4
The seaweed method
9
10
5
6
The transposition network
method
Notation: strings a, b of length m, n respectively
Assume where necessary: m ≤ n; m, n reasonably close
The longest common subsequence (LCS) score:
Beyond semi-locality
Periodic string comparison
length of longest string that is a subsequence of both a and b
Sparse string comparison
equivalently, alignment score, where score(match) = 1 and
score(mismatch) = 0
In biological terms, “loss-free alignment” (unlike efficient but “lossy”
BLAST)
Alexander Tiskin (Warwick)
Semi-local LCS
35 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
36 / 156
Semi-local string comparison
Semi-local string comparison
Semi-local LCS and edit distance
Semi-local LCS and edit distance
LCS computation by dynamic programming
(
lcs(a, ∅) = 0
max(lcs(aα, b), lcs(a, bβ))
lcs(aα, bβ) =
lcs(∅, b) = 0
lcs(a, b) + 1
The LCS problem
Give the LCS score for a vs b
LCS: running time
O(mn) mn
O log
2
n
O
[Wagner, Fischer:
[Masek, Paterson:
[Crochemore+:
[Paterson, Dančı́k:
[Bille, Farach-Colton:
σ = O(1)
mn(log log n)2 log2 n
1974]
1980]
2003]
1994]
2008]
Running time varies depending on the RAM model version
∗
D
E
S
I
G
N
∗
0
0
0
0
0
0
0
D
0
1
1
1
1
1
1
E
0
1
2
2
2
2
2
F
0
1
2
2
2
2
2
I
0
1
2
2
3
3
3
N
0
1
2
2
3
3
4
E
0
1
2
2
3
3
4
if α 6= β
if α = β
lcs(“DEFINE”, “DESIGN”) = 4
LCS(a, b) can be traced back through
the dynamic programming table at no
extra asymptotic time cost
We assume word-RAM with word size log n (where it matters)
Alexander Tiskin (Warwick)
Semi-local LCS
37 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Semi-local string comparison
Semi-local string comparison
Semi-local LCS and edit distance
Semi-local LCS and edit distance
LCS on the alignment graph (directed, acyclic)
B A A B C A B C A B A C A
B
A
A
B
C
B
C
A
LCS: dynamic programming
blue = 0
red = 1
38 / 156
[WF: 1974]
Sweep cells in any -compatible order
Cell update: time O(1)
Overall time O(mn)
score(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8
LCS = highest-score path from top-left to bottom-right
Alexander Tiskin (Warwick)
Semi-local LCS
39 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
40 / 156
Semi-local string comparison
Semi-local string comparison
Semi-local LCS and edit distance
Semi-local LCS and edit distance
LCS: micro-block dynamic programming
[MP: 1980; BF: 2008]
‘Begin at the beginning,’ the King said gravely, ‘and
go on till you come to the end: then stop.’
Sweep cells in micro-blocks, in any -compatible order
Micro-block size:
L. Carroll, Alice in Wonderland
t = O(log n) when σ = O(1)
t = O logloglogn n otherwise
Dynamic programming: begins at empty strings,
proceeds by appending characters, then stops
Micro-block interface:
What about:
O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits
prepending/deleting characters (dynamic LCS)
O(t) small integers, each O(1) bits
concatenating strings (LCS on compressed
strings; parallel LCS)
Micro-block update: time O(1), by precomputing all possible interfaces
mn(log log n)2 mn
Overall time O log
when
σ
=
O(1),
O
otherwise
2
n
log2 n
Alexander Tiskin (Warwick)
Semi-local LCS
41 / 156
taking substrings (= local alignment)
Alexander Tiskin (Warwick)
Semi-local LCS
Semi-local string comparison
Semi-local string comparison
Semi-local LCS and edit distance
Semi-local LCS and edit distance
The semi-local LCS problem
Dynamic programming from both ends:
better by ×2, but still not good enough
Give the (implicit) matrix of O (m + n)2 LCS scores:
Is dynamic programming strictly necessary to
solve sequence alignment problems?
Eppstein+, Efficient algorithms for sequence
analysis, 1991
Alexander Tiskin (Warwick)
Semi-local LCS
42 / 156
43 / 156
string-substring LCS: string a vs every substring of b
prefix-suffix LCS: every prefix of a vs every suffix of b
suffix-prefix LCS: every suffix of a vs every prefix of b
substring-string LCS: every substring of a vs string b
Alexander Tiskin (Warwick)
Semi-local LCS
44 / 156
Semi-local string comparison
Semi-local string comparison
Semi-local LCS and edit distance
Score matrices and seaweed matrices
Semi-local LCS on the alignment graph
The score matrix H
B A A B C A B C A B A C A
blue = 0
red = 1
B
A
A
B
C
B
C
A
score(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5
String-substring LCS: all highest-score top-to-bottom paths
Semi-local LCS: all highest-score boundary-to-boundary paths
Alexander Tiskin (Warwick)
Semi-local LCS
1
2
3
4
5
6
6
7
8
8
8
8
8
a = “BAABCBCA”
-1 0
1
2
3
4
5
5
6
7
7
7
7
7
b = “BAABCABCABACA”
-2 -1 0
1
2
3
4
4
5
6
6
6
6
7
-3 -2 -1 0
1
2
3
3
4
5
5
6
6
7
H(i, j) = score(a, bhi : ji)
-4 -3 -2 -1 0
1
2
2
3
4
4
5
5
6
H(4, 11) = 5
-5 -4 -3 -2 -1 0
1
2
3
4
4
5
5
6
-6 -5 -4 -3 -2 -1 0
1
2
3
3
4
4
5
-7 -6 -5 -4 -3 -2 -1 0
1
2
2
3
3
4
-8 -7 -6 -5 -4 -3 -2 -1 0
1
2
3
3
4
-9 -8 -7 -6 -5 -4 -3 -2 -1 0
1
2
3
4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
1
2
3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
1
2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
1
0
H(i, j) = j − i if i > j
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
45 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Semi-local string comparison
Semi-local string comparison
Score matrices and seaweed matrices
Score matrices and seaweed matrices
Semi-local LCS: output representation and running time
The score matrix H and the seaweed matrix P
size
O(n2 )
O(m1/2 n)
O(n)
O(n log n)
. . . or any
query time
O(1)
O(log n)
string-substring
O(n)
string-substring
O(log2 n)
2D orthogonal range counting data structure
running time
O(mn2 )
O(mn)
O(mn) O logmn
0.5
n
O
46 / 156
H(i, j): the number of matched characters for a vs substring bhi : ji
trivial
[Alves+: 2003]
[Alves+: 2005]
[T: 2006]
j − i − H(i, j): the number of unmatched characters
Properties of matrix j − i − H(i, j):
simple unit-Monge
therefore, = P Σ , where P = −H is a permutation matrix
P is the seaweed matrix, giving an implicit representation of H
string-substring
mn(log log n)2 log2 n
Alexander Tiskin (Warwick)
naive
[Schmidt: 1998; Alves+: 2005]
[T: 2006]
[T: 2006]
Range tree for P: memory O(n log n), query time O(log2 n)
[T: 2007]
Semi-local LCS
47 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
48 / 156
Semi-local string comparison
Semi-local string comparison
Score matrices and seaweed matrices
Score matrices and seaweed matrices
The score matrix H and the seaweed matrix P
The score matrix H and the seaweed matrix P
1
2
3
4
5
6
6
7
8
8
8
8
8
a = “BAABCBCA”
a = “BAABCBCA”
-1 0
1
2
3
4
5
5
6
7
7
7
7
7
b = “BAABCABCABACA”
b = “BAABCABCABACA”
-2 -1 0
1
2
3
4
4
5
6
6
6
6
7
-3 -2 -1 0
1
2
3
3
4
5
5
6
6
7
H(i, j) = score(a, bhi : ji)
-4 -3 -2 -1 0
1
2
2
3
4
4
5
5
6
H(4, 11) = 5
-5 -4 -3 -2 -1 0
1
2
3
4
4
5
5
6
H(4, 11) =
11 − 4 − P Σ (i, j) =
11 − 4 − 2 = 5
-6 -5 -4 -3 -2 -1 0
1
2
3
3
4
4
5
-7 -6 -5 -4 -3 -2 -1 0
1
2
2
3
3
4
-8 -7 -6 -5 -4 -3 -2 -1 0
1
2
3
3
4
-9 -8 -7 -6 -5 -4 -3 -2 -1 0
1
2
3
4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
1
2
3
green: P(i, j) = 1
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
1
2
H(i, j) = j − i − P Σ (i, j)
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
1
0
H(i, j) = j − i if i > j
blue: difference in H is 0
red: difference in H is 1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Alexander Tiskin (Warwick)
Semi-local LCS
49 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Semi-local string comparison
Semi-local string comparison
Score matrices and seaweed matrices
Score matrices and seaweed matrices
50 / 156
The (combed) seaweed braid in the alignment graph
B A A B C A B C A B A C A
B
A
A
B
C
B
C
A
a = “BAABCBCA”
b = “BAABCABCABACA”
H(4, 11) =
11 − 4 − P Σ (i, j) =
11 − 4 − 2 = 5
Seaweed braid: a highly symmetric object (element of H0 (Sn ))
P(i, j) = 1 corresponds to seaweed top i
Also define seaweeds top
right, left
bottom j
right, left
Can be built by assembling subbraids: divide-and-conquer
bottom
Flexible approach to local alignment, compressed approximate matching,
parallel computation. . .
Represent implicitly semi-local LCS for each prefix of a vs b
Alexander Tiskin (Warwick)
Semi-local LCS
51 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
52 / 156
Semi-local string comparison
Semi-local string comparison
Weighted alignment
Weighted alignment
The LCS problem is a special case of the weighted alignment score
problem with weighted matches (wM ), mismatches (wX ) and gaps (wG )
Weighted alignment graph for a, b
LCS score: wM = 1, wX = wG = 0
Levenshtein score: wM = 2, wX = 1, wG = 0
Alignment score is rational, if wM , wX , wG are rational numbers
Edit distance: minimum cost to transform a into b by weighted character
edits (insertion, deletion, substitution)
Corresponds to weighted alignment score with wM = 0:
insertion/deletion cost −wG
B A A B C A B C A B A C A
B
A
A
B
C
B
C
A
blue = 0
red (solid) = 2
red (dotted) = 1
Levenshtein(“BAABCBCA”, “CABCABA”) = 11
substitution weight −wX
Alexander Tiskin (Warwick)
Semi-local LCS
53 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Semi-local string comparison
Semi-local string comparison
Weighted alignment
Weighted alignment
Reduction: ordinary alignment graph for blown-up a, b
Reduction: rational-weighted semi-local alignment to semi-local LCS
$B $A $A $B $C $A $B $C $A $B $A $C $A
$B
$A
$A
$B
$C
$B
$C
$A
blue = 0
red = 1 or 2
Semi-local LCS
$B $A $A $B $C $A $B $C $A $B $A $C $A
$B
$A
$A
$B
$C
$B
$C
$A
Let wM = 1, wX = µν , wG = 0
Levenshtein(“BAABCBCA”, “CABCABA”) =
lcs(“$B$A$A$B$C$B$C$A”, “$C$A$B$C$A$B$A”) = 11
Alexander Tiskin (Warwick)
54 / 156
Increase × ν 2 in complexity (can be reduced to ν)
55 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
56 / 156
The seaweed method
Seaweed combing
1
Introduction
7
Compressed string comparison
2
Matrix distance multiplication
8
Parallel string comparison
3
Semi-local string comparison
9
The transposition network
method
4
The seaweed method
10
Beyond semi-locality
5
Periodic string comparison
6
Sparse string comparison
Alexander Tiskin (Warwick)
Semi-local LCS
B A A B C A B C A B A C A
B
A
A
B
C
B
C
A
57 / 156
Alexander Tiskin (Warwick)
The seaweed method
The seaweed method
Seaweed combing
Seaweed combing
B A A B C A B C A B A C A
Semi-local LCS
58 / 156
B A A B C A B C A B A C A
B
A
A
B
C
B
C
A
B
A
A
B
C
B
C
A
Alexander Tiskin (Warwick)
Semi-local LCS
59 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
60 / 156
The seaweed method
The seaweed method
Seaweed combing
Micro-block seaweed combing
B A A B C A B C A B A C A
Semi-local LCS: seaweed combing
[T: 2006]
Initialise uncombed seaweed braid: mismatch cell = crossing
Sweep cells in any -compatible order
match cell: skip (keep uncrossed)
mismatch cell: comb (uncross) iff the same seaweed pair already
crossed before
Cell update: time O(1)
B
A
A
B
C
B
C
A
Overall time O(mn)
Alexander Tiskin (Warwick)
Semi-local LCS
61 / 156
Alexander Tiskin (Warwick)
The seaweed method
The seaweed method
Micro-block seaweed combing
Micro-block seaweed combing
B A A B C A B C A B A C A
Semi-local LCS
62 / 156
B A A B C A B C A B A C A
B
A
A
B
C
B
C
A
B
A
A
B
C
B
C
A
Alexander Tiskin (Warwick)
Semi-local LCS
63 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
64 / 156
The seaweed method
The seaweed method
Micro-block seaweed combing
Cyclic LCS
Semi-local LCS: micro-block seaweed combing
[T: 2007]
The cyclic LCS problem
Initialise uncombed seaweed braid: mismatch cell = crossing
Give the maximum LCS score for a vs all cyclic rotations of b
Sweep cells in micro-blocks, in any -compatible order
Micro-block size: t = O logloglogn n
Cyclic LCS: running time
Micro-block interface:
O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits
O(t) integers, each O(log n) bits, can be reduced to O(log t) bits
Micro-block update: time O(1), by precomputing all possible interfaces
log n)2 Overall time O mn(log
2
log n
Alexander Tiskin (Warwick)
Semi-local LCS
65 / 156
mn2
O log
n
O(mn log m)
O(mn)
log n)2 O mn(log
2
log n
naive
[Maes: 1990]
[Bunke, Bühler: 1993; Landau+: 1998; Schmidt: 1998]
[T: 2007]
Alexander Tiskin (Warwick)
Semi-local LCS
The seaweed method
The seaweed method
Cyclic LCS
Longest repeating subsequence
Cyclic LCS: the algorithm
The longest repeating subsequence problem
Micro-block seaweed combing on a vs bb, time O
mn(log log n)2 log2 n
66 / 156
Find the longest subsequence of a that is a square (a repetition of two
identical strings)
Make n string-substring LCS queries, time negligible
Longest repeating subsequence: running time
O(m3 )
O(m2 )
2
log m)2 O m (log
2
log m
Alexander Tiskin (Warwick)
Semi-local LCS
67 / 156
Alexander Tiskin (Warwick)
naive
[Kosowski: 2004]
[T: 2007]
Semi-local LCS
68 / 156
The seaweed method
The seaweed method
Longest repeating subsequence
Approximate matching
Longest repeating subsequence: the algorithm
Micro-block seaweed combing on a vs a, time O
The approximate pattern matching problem
m2 (log log m)2
log2 m
Give the substring closest to a by alignment score, starting at each
position in b
Make m − 1 prefix-suffix LCS queries, time negligible
Assume rational alignment score
Open question: generalise to longest subsequence that is a cube, etc.
Approximate pattern matching: running time
O(mn) mn
O log
n
O
Alexander Tiskin (Warwick)
Semi-local LCS
69 / 156
[Sellers: 1980]
via [Masek, Paterson: 1980]
σ = O(1)
mn(log log n)2 log2 n
Alexander Tiskin (Warwick)
via [Bille, Farach-Colton: 2008]
Semi-local LCS
70 / 156
The seaweed method
Approximate matching
1
Introduction
7
Compressed string comparison
2
Matrix distance multiplication
8
Parallel string comparison
3
Semi-local string comparison
9
The transposition network
method
4
The seaweed method
10
Beyond semi-locality
Approximate pattern matching: the algorithm
Micro-block seaweed combing on a vs b (with blow-up), time
log n)2 O mn(log
2
log n
The implicit semi-local edit score matrix:
an anti-Monge matrix
approximate pattern matching ∼ row minima
Row minima in O(n) element queries
[Aggarwal+: 1987]
Each query in time O(log2 n) using the range tree representation,
combined query time negligible
log n)2 Overall running time O mn(log
, same as [Bille, Farach-Colton: 2008]
log2 n
Alexander Tiskin (Warwick)
Semi-local LCS
71 / 156
5
Periodic string comparison
6
Sparse string comparison
Alexander Tiskin (Warwick)
Semi-local LCS
72 / 156
Periodic string comparison
Periodic string comparison
Wraparound seaweed combing
Wraparound seaweed combing
The periodic string-substring LCS problem
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Give (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u ±∞
Let u be of length p
May assume that every character of a occurs in u (otherwise delete it)
Only substrings of b of length ≤ mp (otherwise LCS score = m)
Alexander Tiskin (Warwick)
Semi-local LCS
73 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Periodic string comparison
Periodic string comparison
Wraparound seaweed combing
Wraparound seaweed combing
B A A B C A B C A B A C A
74 / 156
B A A B C A B C A B A C A
B
A
A
B
C
B
C
A
B
A
A
B
C
B
C
A
Alexander Tiskin (Warwick)
Semi-local LCS
75 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
76 / 156
Periodic string comparison
Periodic string comparison
Wraparound seaweed combing
Wraparound seaweed combing
Periodic string-substring LCS: Wraparound seaweed combing
The tandem LCS problem
Initialise uncombed seaweed braid: mismatch cell = crossing
Give LCS score for a vs b = u k
Sweep cells row-by-row: each row starts at match cell, wraps at boundary
We have n = kp; may assume k ≤ m
Sweep cells in any -compatible order
Tandem LCS: running time
match cell: skip (keep uncrossed)
mismatch cell: comb (uncross) iff the same seaweed pair already
crossed before, possibly in a different period
O(mkp)
O m(k + p)
O(mp)
Cell update: time O(1)
naive
[Landau, Ziv-Ukelson: 2001]
[T: 2009]
Overall time O(mn)
Direct application of wraparound seaweed combing
String-substring LCS score: count seaweeds with multiplicities
Alexander Tiskin (Warwick)
Semi-local LCS
77 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Periodic string comparison
Periodic string comparison
Wraparound seaweed combing
Wraparound seaweed combing
The tandem alignment problem
Cyclic tandem alignment: the algorithm
Give the substring closest to a in b = u ±∞ by alignment score among
Periodic seaweed combing for a vs b (with blow-up), time O(mp)
78 / 156
For each k ∈ [1 : m]:
global: substrings u k of length kp across all k
cyclic: substrings of length kp across all k
solve tandem LCS (under given alignment score) for a vs u k
local: substrings of any length
obtain scores for a vs p successive substrings of b of length kp by LCS
batch query: time O(1) per substring
Tandem alignment: running time
O(m2 p)
O(mp)
O(mp log p)
O(mp)
O(mp)
all
global
cyclic
cyclic
local
Alexander Tiskin (Warwick)
Running time O(mp)
[Myers, Miller:
[Benson:
[T:
[Myers, Miller:
Semi-local LCS
naive
1989]
2005]
2009]
1989]
79 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
80 / 156
Sparse string comparison
Semi-local LCS between permutations
1
Introduction
2
Matrix distance multiplication
8
Parallel string comparison
Semi-local string comparison
9
The transposition network
method
7
Compressed string comparison
The LCS problem on permutation strings (LCSP)
3
4
Equivalent to
longest increasing subsequence (LIS) in a string
maximum clique in a permutation graph
The seaweed method
10
5
Give LCS score for a vs b; in each of a, b all characters distinct
Beyond semi-locality
maximum planar matching in an embedded bipartite graph
Periodic string comparison
LCSP: running time
6
Sparse string comparison
O(n log n)
O(n log log n)
Alexander Tiskin (Warwick)
Semi-local LCS
81 / 156
implicit in [Erdös, Szekeres:
[Robinson: 1938; Knuth: 1970; Dijkstra:
unit-RAM
[Chang, Wang:
[Bespamyatnikh, Segal:
Alexander Tiskin (Warwick)
Sparse string comparison
Sparse string comparison
Semi-local LCS between permutations
Semi-local LCS between permutations
1935]
1980]
1992]
2000]
Semi-local LCS
82 / 156
Semi-local LCS
84 / 156
D E H C B A F G
Semi-local LCSP
Give semi-local LCS scores for a vs b; in each of a, b all characters distinct
Equivalent to
longest increasing subsequence (LIS) in every substring of a string
Semi-local LCSP: running time
O(n2 log n)
O(n2 )
O(n1.5 log n)
O(n1.5 )
O(n log2 n)
restricted
randomised, restricted
Alexander Tiskin (Warwick)
Semi-local LCS
[Albert+: 2003; Chen+:
[Albert+:
[T:
[T:
C
F
A
E
D
H
G
B
naive
2005]
2007]
2006]
2010]
83 / 156
Alexander Tiskin (Warwick)
Sparse string comparison
Sparse string comparison
Semi-local LCS between permutations
Semi-local LCS between permutations
D E H C B A F G
D E H C B A F G
C
F
A
E
D
H
G
B
C
F
A
E
D
H
G
B
Alexander Tiskin (Warwick)
Semi-local LCS
85 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Sparse string comparison
Sparse string comparison
Semi-local LCS between permutations
Semi-local LCS between permutations
86 / 156
D E H C B A F G
Semi-local LCSP: the algorithm
C
F
A
E
D
H
G
B
Divide-and-conquer on the alignment graph
Divide graph (say) horizontally; two subproblems of effective size n/2
Conquer: seaweed matrix
-multiplication, time O(n log n)
Overall time O(n log2 n)
Alexander Tiskin (Warwick)
Semi-local LCS
87 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
88 / 156
Sparse string comparison
Sparse string comparison
Longest piecewise monotone subsequences
Longest piecewise monotone subsequences
A k-increasing sequence: a concatenation of k increasing sequences
Longest k-increasing subsequence: algorithm A
A k-modal sequence: a concatenation of k alternating increasing and
decreasing sequences
Sparse LCS for id k vs b: time O(nk log n)
The longest k-increasing (k-modal) subsequence problem
Longest k-increasing subsequence: algorithm B
Give the longest k-increasing (k-modal) subsequence of string b
Compute seaweed matrix for id vs b: time O(n log2 n)
Extract three-way seaweed submatrix
Longest k-increasing (k-modal) subsequence: running time
Compute three-way seaweed submatrix for id k vs b by log k instances of
seaweed matrix -square/multiply: time log k · O(n log n) = O(n log2 n)
O(nk log n)
O(nk log n)
O(n log2 n)
Query the LCS score for id k vs b: time negligible
k-modal
[Demange+: 2007]
via [Hunt, Szymanski: 1977]
[T: 2010]
Algorithm B faster than Algorithm A for k ≥ log n
Main idea: LCS for id k (respectively, (idid)k/2 ) vs b
Alexander Tiskin (Warwick)
Semi-local LCS
Overall time O(n log2 n)
89 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Sparse string comparison
Sparse string comparison
Maximum clique in a circle graph
Maximum clique in a circle graph
The maximum clique problem in a circle graph
Maximum clique in a circle graph: running time
Given a circle with n chords, find the maximum-size subset of pairwise
intersecting chords
exp(n)
O(n3 )
O(n2 )
O(n1.5 )
O(n log2 n)
[Gavril:
[Rotem, Urrutia: 1981; Hsu:
[Masuda+: 1990; Apostolico+:
[T:
[T:
90 / 156
naive
1973]
1985]
1992]
2006]
2010]
S
Standard reduction to an interval model: cut the circle and lay it out on
the line; chords become intervals (here drawn as square diagonals)
Chords intersect iff intervals overlap, i.e. intersect without containment
Alexander Tiskin (Warwick)
Semi-local LCS
91 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
92 / 156
Sparse string comparison
Sparse string comparison
Maximum clique in a circle graph
Maximum clique in a circle graph
Maximum clique in a circle graph: the algorithm
Helly property: if any set of intervals intersect pairwise, then they all
intersect at a common point
Compute seaweed matrix, build range tree: time O(n log2 n)
Run through all 2n + 1 possible common intersection points
For each point, find a maximum subset of covering overlapping segments
by a prefix-suffix LCS query: time (2n + 1) · O(log2 n) = O(n log2 n)
Overall time O(n log2 n) + O(n log2 n) = O(n log2 n)
Alexander Tiskin (Warwick)
Semi-local LCS
93 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
94 / 156
Sparse string comparison
Sparse string comparison
Maximum clique in a circle graph
Maximum clique in a circle graph
Parameterised maximum clique in a circle graph
Parameterised maximum clique in a circle graph: the algorithm
The maximum clique problem in a circle graph, sensitive e.g. to
For each diagonal block of size d, compute seaweed matrix, build range
tree: time n/d · O(d log2 d) = O(n log2 d)
the number e of edges
Extend each diagonal block to a quadrant: time O(n log2 d)
the size l of maximum clique
the thickness d of interval model (the maximum number of intervals
covering a point)
We have l ≤ d ≤ n, e ≤ n2
Run through all 2n + 1 possible common intersection points
For each point, find a maximum subset of covering overlapping segments
by a prefix-suffix LCS query: time O(n log2 d)
Overall time O(n log2 d)
Parameterised maximum clique in a circle graph: running time
O(n log n + e)
O(n log n + nl log(n/l))
O(n log n + n log2 d)
Alexander Tiskin (Warwick)
[Apostolico+: 1992]
[Apostolico+: 1992]
NEW
Semi-local LCS
95 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
96 / 156
Compressed string comparison
Grammar compression
1
Introduction
7
Compressed string comparison
Notation: pattern p of length m; text t of length n
2
Matrix distance multiplication
8
Parallel string comparison
3
Semi-local string comparison
9
The transposition network
method
4
tk = α, where α is an alphabet character
tk = uv , where each of u, v is an alphabet character, or ti for i < k
The seaweed method
10
5
A GC-string (grammar-compressed string) t is a straight-line program
(context-free grammar) generating t = tn̄ by n̄ assignments of the form
In general, n = O(2n̄ )
Beyond semi-locality
Example: Fibonacci string “ABAABABAABAAB”
Periodic string comparison
t1 = A t2 = t1 B t3 = t2 t1
6
t4 = t3 t2
t5 = t4 t3
t6 = t5 t4
Sparse string comparison
Alexander Tiskin (Warwick)
Semi-local LCS
97 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Compressed string comparison
Compressed string comparison
Grammar compression
Extended substring-string LCS on GC-strings
Grammar-compression covers various compression types, e.g. LZ78, LZW
(not LZ77 directly)
LCS: running time (r = m + n, r̄ = m̄ + n̄)
Simplifying assumption: arithmetic up to n runs in O(1)
p
plain
t
plain
plain
GC
GC
GC
This assumption can be removed by careful index remapping
Alexander Tiskin (Warwick)
Semi-local LCS
99 / 156
O(mn) O logmn
2
m
O(m3 n̄ + . . .)
O(m1.5 n̄)
O(m log m · n̄)
NP-hard
O(r 1.2 r̄ 1.4 )
O(r log r · r̄ ) O r log(r /r̄ ) · r̄
O r log1/2 (r /r̄ ) · r̄
Alexander Tiskin (Warwick)
gen. CFG
ext subs-s
ext subs-s
R weights
Semi-local LCS
98 / 156
[Wagner, Fischer:
[Masek, Paterson:
[Crochemore+:
[Myers:
[T:
[T:
[Lifshits:
[Hermelin+:
[T:
[Hermelin+:
[Gawrychowski:
1974]
1980]
2003]
1995]
2008]
2010]
2005]
2009]
2010]
2010]
2012]
100 / 156
Compressed string comparison
Compressed string comparison
Extended substring-string LCS on GC-strings
Subsequence recognition on GC-strings
Extended substring-string LCS (plain pattern, GC text): the algorithm
The global subsequence recognition problem
For every k, compute by recursion the appropriate part of seaweed matrix
Pp,tk , using matrix -multiplication: time O(m log m · n̄)
Does pattern p appear in text t as a subsequence?
Overall time O(m log m · n̄)
Global subsequence recognition: running time
p
plain
plain
GC
Alexander Tiskin (Warwick)
Semi-local LCS
101 / 156
t
plain
GC
GC
O(n)
O(mn̄)
NP-hard
Alexander Tiskin (Warwick)
greedy
greedy
[Lifshits: 2005]
Semi-local LCS
102 / 156
Compressed string comparison
Compressed string comparison
Subsequence recognition on GC-strings
Subsequence recognition on GC-strings
The local subsequence recognition problem
Local subsequence recognition: running time ( + output)
Find all minimally matching substrings of t with respect to p
p
plain
t
plain
plain
GC
GC
GC
Substring of t is matching, if p is a subsequence of t
Matching substring of t is minimally matching, if none of its proper
substrings are matching
Alexander Tiskin (Warwick)
Semi-local LCS
103 / 156
O(mn) mn
O log
m
O(c m + n)
O(m + nσ)
O(m2 log mn̄)
O(m1.5 n̄)
O(m log m · n̄)
O(m · n̄)
NP-hard
Alexander Tiskin (Warwick)
[Mannila+:
[Das+:
[Boasson+:
[Troniček:
[Cégielski+:
[T:
[T:
[Yamamoto+:
[Lifshits:
Semi-local LCS
1995]
1997]
2001]
2001]
2006]
2008]
2010]
2011]
2005]
104 / 156
Compressed string comparison
Compressed string comparison
Subsequence recognition on GC-strings
Subsequence recognition on GC-strings
0
H
ı̂0+
O
ı̂1+
O
ı̂2+
O
n0
H
−m
n
H
0
M
̂0+
M
̂1+
ı̂0+
M
̂2+
ı̂1+
bhi : ji matching iff box [i : j] not pierced left-to-right
Determined by -chain of ≷-maximal seaweeds
ı̂2+
bhi : ji minimally matching iff (i, j) is in the interleaved skeleton -chain
n0
Alexander Tiskin (Warwick)
Semi-local LCS
105 / 156
n0
̂0+
̂1+
̂2+
n
Alexander Tiskin (Warwick)
m+n
Semi-local LCS
Compressed string comparison
Compressed string comparison
Subsequence recognition on GC-strings
Threshold approximate matching
Local subsequence recognition (plain pattern, GC text): the algorithm
The threshold approximate matching problem
For every k, compute by recursion the appropriate part of seaweed matrix
Pp,tk , using matrix -multiplication: time O(m log m · n̄)
Find all matching substrings of t with respect to p, according to a
threshold k
106 / 156
Given an assignment t = t 0 t 00 , find by recursion
Substring of t is matching, if the edit distance for p vs t is at most k
minimally matching substrings in t 0
minimally matching substrings in t 00
Then, find -chain of ≷-maximal seaweeds in time n̄ · O(m) = O(mn̄)
Its skeleton -chain: minimally matching substrings in t overlapping t 0 , t 00
Overall time O(m log m · n̄) + O(mn̄) = O(m log m · n̄)
Alexander Tiskin (Warwick)
Semi-local LCS
107 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
108 / 156
Compressed string comparison
Compressed string comparison
Threshold approximate matching
Threshold approximate matching
0
H
Threshold approximate matching: running time ( + output)
p
plain
t
plain
plain
GC
GC
GC
O(mn)
O(mk)
4
O(m + n + nk
m )
O(mn̄k 2 )
O(mn̄k + n̄ log n)
O(mn̄ + n̄k 4 + n̄ log n)
O(m log m · n̄)
NP-hard
[Sellers: 1980]
[Landau, Vishkin: 1989]
[Cole, Hariharan: 2002]
[Kärkkäinen+: 2003]
[LV: 1989] via [Bille+: 2010]
[CH: 2002] via [Bille+: 2010]
[T: NEW]
[Lifshits: 2005]
ı̂0+ ı̂1+
O O
ı̂2+
O
ı̂3+ ı̂4+
O O
n0
H
n
H
M M M M
̂0+ ̂1+ ̂2+ ̂3+
M
̂4+
Blow up: weighted alignment on strings p, t of size m, n equivalent to
LCS on strings p, t of size m = νm, n = νn
(Also many specialised variants for LZ compression)
Alexander Tiskin (Warwick)
Semi-local LCS
109 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Compressed string comparison
Compressed string comparison
Threshold approximate matching
Threshold approximate matching
110 / 156
−m
Threshold approx matching (plain pattern, GC text): the algorithm
Algorithm structure similar to local subsequence recognition
-chains replaced by m × m submatrices
Extra ingredients:
0
ı̂0+
the blow-up technique: reduction of edit distances to LCS scores
ı̂1+
implicit matrix searching, replaces -chain interleaving
ı̂2+
Monge row minima: “SMAWK” O(m)
Implicit unit-Monge row minima:
ı̂3+
O(m log log m)
O(m)
ı̂4+
n0
[Aggarwal+: 1987]
n0
̂0+ ̂1+ ̂2+ ̂3+
Alexander Tiskin (Warwick)
̂4+
n
m+n
Semi-local LCS
[T: 2012]
[Gawrychowski: 2012]
Overall time O(m log m · n̄) + O(mn̄) = O(m log m · n̄)
111 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
112 / 156
Parallel string comparison
Parallel LCS
1
Introduction
7
Compressed string comparison
The ordered 2D grid dag
2
Matrix distance multiplication
3
Semi-local string comparison
4
The seaweed method
8
grid 2 (n)
Parallel string comparison
nodes arranged in an n × n grid
9
10
5
Periodic string comparison
6
Sparse string comparison
The transposition network
method
edges directed top-to-bottom, left-to-right
Beyond semi-locality
≤ 2n outputs (from right/bottom borders)
≤ 2n inputs (to left/top borders)
size n2
depth 2n − 1
Applications: triangular linear system; discretised PDE via Gauss–Seidel
iteration (single step); 1D cellular automata; dynamic programming
Sequential work O(n2 )
Alexander Tiskin (Warwick)
Semi-local LCS
113 / 156
Alexander Tiskin (Warwick)
Parallel string comparison
Parallel string comparison
Parallel LCS
Parallel LCS
Bulk-Synchronous Parallel (BSP) computer
Simple, realistic general-purpose parallel
model
[Valiant: 1990]
0
1
p−1
P
M
P
M
P
M
···
COMM. ENV . (g , l)
Contains
Semi-local LCS
114 / 156
Parallel ordered 2D grid computation
grid 2 (n)
Consists of a p × p grid of blocks, each
isomorphic to grid 2 (n/p)
The blocks can be arranged into 2p − 1
anti-diagonal layers, with ≤ p independent
blocks in each layer
p processors, each with local memory (1 time unit/operation)
communication environment, including a network and an external
memory (g time units/data unit communicated)
barrier synchronisation mechanism (l time units/synchronisation)
Alexander Tiskin (Warwick)
Semi-local LCS
115 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
116 / 156
Parallel string comparison
Parallel string comparison
Parallel LCS
Parallel LCS
Parallel ordered 2D grid computation (contd.)
Parallel LCS computation
The computation proceeds in 2p − 1 stages, each computing a layer of
blocks. In a stage:
The 2D grid algorithm solves the LCS problem (and many others) by
dynamic programming
every block assigned to a different processor (some processors idle)
the processor reads the 2n/p block inputs, computes the block, and
writes back the 2n/p block outputs
comp: (2p − 1) · O (n/p)2 = O(p · n2 /p 2 ) = O(n2 /p)
comp O(n2 /p)
comm O(n)
sync O(p)
comm is not scalable (i.e. does not decrease with increasing p)
:-(
Can scalable comm be achieved for the LCS problem?
comm: (2p − 1) · O(n/p) = O(n)
n≥p
comp O(n2 /p)
comm O(n)
Alexander Tiskin (Warwick)
sync O(p)
Semi-local LCS
117 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
Parallel string comparison
Parallel string comparison
Parallel LCS
Parallel LCS
Parallel LCS computation
Parallel LCS computation (cont.)
Solve the more general semi-local LCS problem:
Communication vs synchronisation tradeoff
Parallelising normal O(n log n) seaweed multiplication: [Krusche, T: 2010]
n
comp O(n2 /p)
comm O p1/2
sync O(log2 p)
each string vs all substrings of the other string
all prefixes of each string against all suffixes of the other string
Divide-and-conquer on substrings of a, b: log p recursion levels
Special seaweed multiplication
Each level assembles substring LCS from smaller ones by parallel seaweed
multiplication
Sacrifices some comp, comm for sync
Base level: p semi-local LCS subproblems, each of size
comp O(n2 /p)
Semi-local LCS
119 / 156
Alexander Tiskin (Warwick)
comm O
[Krusche, T: 2007]
n log p p 1/2
sync O(log p)
n
Open problem: can we achieve comm O p1/2
, sync O(log p)?
n/p 1/2
Sequential time still O(n2 )
Alexander Tiskin (Warwick)
118 / 156
Semi-local LCS
120 / 156
Parallel string comparison
Parallel LCS
1
Introduction
7
Compressed string comparison
Sequential time O(n log n), no good way known to parallelise directly
2
Matrix distance multiplication
8
Parallel string comparison
Solution via the more general semi-local LCSP is no longer efficient:
sequential time O(n log2 n)
3
Semi-local string comparison
9
The transposition network
method
4
The seaweed method
10
Beyond semi-locality
Parallel LCS on permutation strings (LCSP = LIS)
Divide-and-conquer on substrings/subsequences of a, b
Parallelising normal O(n log n) seaweed multiplication: [Krusche, T: 2010]
2 n
n log p comp O n log
comm
O
sync O(log2 p)
p
p
Open problem: can we achieve comp O
n log2 n p
5
Periodic string comparison
6
Sparse string comparison
for LCSP?
2
Can we achieve comm O(n/p), sync O(log p) for semi-local LCSP?
Alexander Tiskin (Warwick)
Semi-local LCS
121 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
The transposition network method
The transposition network method
Transposition networks
Transposition networks
Comparison network: a circuit of comparators
Seaweed combing as a transposition network
A comparator sorts two inputs and outputs them in prescribed order
Comparison networks traditionally used for non-branching merging/sorting
A B C A
Classical comparison networks
merging
sorting
# comparators
O(n log n)
O(n log2 n)
O(n log n)
[Batcher: 1968]
[Batcher: 1968]
[Ajtai+: 1983]
A
C
B
C
122 / 156
−7
−5
−3
−1
+1
+3
+5
+7
−7
−1
+3
−3
−5
+7
+5
+1
Comparison networks are visualised by wire diagrams
Character mismatches correspond to comparators
Transposition network: all comparisons are between adjacent wires
Inputs anti-sorted (sorted in reverse); each value traces a seaweed
Alexander Tiskin (Warwick)
Semi-local LCS
123 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
124 / 156
The transposition network method
The transposition network method
Transposition networks
Parameterised string comparison
Global LCS: transposition network with binary input
Parameterised string comparison
0
String comparison sensitive e.g. to
0
0
0
A B C A
low similarity: small λ = LCS(a, b)
1
1
A
C
B
C
high similarity: small κ = dist LCS (a, b) = m + n − 2λ
1
1
Can also use weighted alignment score or edit distance
0
0
1
Assume m = n, therefore κ = 2(n − λ)
0
0
1
1
1
Inputs still anti-sorted, but may not be distinct
Comparison between equal values is indeterminate
Alexander Tiskin (Warwick)
Semi-local LCS
125 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
The transposition network method
The transposition network method
Parameterised string comparison
Parameterised string comparison
Low-similarity comparison: small λ
Parameterised string comparison: running time
sparse set of matches, may need to look at them all
Low-similarity, after preprocessing in O(n log σ)
O(nλ)
preprocess matches for fast searching, time O(n log σ)
High-similarity comparison: small κ
set of matches may be dense, but only need to look at small subset
no need to preprocess, linear search is OK
Flexible comparison: sensitive to both high and low similarity, e.g. by both
comparison types running alongside each other
Alexander Tiskin (Warwick)
Semi-local LCS
126 / 156
127 / 156
High-similarity, no preprocessing
O(n · κ)
Flexible
O(λ · κ · log n)
O(λ · κ)
[Ukkonen: 1985]
[Myers: 1986]
no preproc
after preproc
Alexander Tiskin (Warwick)
[Hirschberg: 1977]
[Apostolico, Guerra: 1985]
[Apostolico+: 1992]
[Myers: 1986; Wu+: 1990]
[Rick: 1995]
Semi-local LCS
128 / 156
The transposition network method
The transposition network method
Parameterised string comparison
Dynamic string comparison
Parameterised string comparison: the waterfall algorithm
Low-similarity: O(n · λ)
0
0
0
0
0
0
0
The dynamic LCS problem
High-similarity: O(n · κ)
0
0
0
0
0
0
0
0
Maintain current LCS score under updates to one or both input strings
0
Both input strings are streams, updated on-line:
1
0
1
0
1
1
1
0
1
0
1
0
1
1
1
0
deleting characters at left or right
1
1
1
1
Assume for simplicity m ≈ n, i.e. m = Θ(n)
1
0
1
1
1
1
1
1
1
1
1
0
0
0
1
1
0
0
1
0
1
1
1
0
1
1
0
prepending/appending characters at left or right
Goal: linear time per update
O(n) per update of a (n = |b|)
O(m) per update of b (m = |a|)
0
Trace 0s through network in contiguous blocks and gaps
Alexander Tiskin (Warwick)
Semi-local LCS
129 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
The transposition network method
The transposition network method
Dynamic string comparison
Bit-parallel string comparison
Dynamic LCS in linear time: update models
Bit-parallel string comparison
left
–
prep
prep
prep+del
String comparison using standard instructions on words of size w
right
app+del
app
app
app+del
a fixed
standard DP [Wagner, Fischer: 1974]
[Landau+: 1998], [Kim, Park: 2004]
[Ishida+: 2005]
[T: NEW]
130 / 156
Bit-parallel string comparison: running time
O(mn/w )
[Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001]
Dynamic semi-local LCS in linear time: update models
left
prep+del
right
app+del
Alexander Tiskin (Warwick)
amortised
[T: NEW]
Semi-local LCS
131 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
132 / 156
The transposition network method
The transposition network method
Bit-parallel string comparison
Bit-parallel string comparison
Bit-parallel string comparison: binary transposition network
High-similarity bit-parallel string comparison
In every cell: input bits s, c; output bits s 0 , c 0 ; match/mismatch flag µ
κ = dist LCS (a, b)
c
µ
¬
s0
s
µ
s
c0
c
∧
+
s0
s
c
µ
s0
c0
0
0
0
0
0
1
0
0
1
0
0
1
0
1
0
1
1
0
1
1
0
0
1
0
0
1
0
1
0
1
0
1
1
1
0
1
1
1
1
1
s
c
µ
s0
c0
0
0
0
0
0
1
0
0
1
0
0
1
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
1
0
1
0
1
1
1
0
1
1
1
1
1
Assume κ odd, m = n
Waterfall algorithm within diagonal band of width κ + 1: time O(nκ/w )
c0
Band waterfall supported from below by separator matches
2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)
S ← (S + (S ∧ M)) ∨ (S ∧ ¬M), where S, M are words of bits s, µ
Alexander Tiskin (Warwick)
Semi-local LCS
133 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
134 / 156
The transposition network method
Bit-parallel string comparison
1
Introduction
7
Compressed string comparison
2
Matrix distance multiplication
8
Parallel string comparison
3
Semi-local string comparison
9
The transposition network
method
4
The seaweed method
10
Beyond semi-locality
High-similarity bit-parallel multi-string comparison: a vs b0 , . . . , br −1
κi = dist LCS (a, bi ) ≤ κ
0≤i <r
5
Periodic string comparison
6
Sparse string comparison
Waterfalls within r diagonal bands of width κ + 1: time O(nr κ/w )
Each band’s waterfall supported from below by separator matches
Alexander Tiskin (Warwick)
Semi-local LCS
135 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
136 / 156
Beyond semi-locality
Beyond semi-locality
Fixed-length substring LCS
Fixed-length substring LCS
Given window length w , let w -window be any substring of length w
Window-substring LCS: the algorithm
The window-substring LCS problem
Compute seaweed matrices for canonical substrings of a vs b recursively
Give LCS score for every w -window of a vs every substring of b
Compute seaweed matrices for w -windows of a vs b, using the
decomposition of each w -window into canonical substrings
Both stages use matrix
Window-substring LCS: running time
O(mn3 w )
O(mnw )
O(mn)
Alexander Tiskin (Warwick)
-multiplication, overall time O(mn)
naive
multiple seaweed
[Krusche, T: 2010]
Semi-local LCS
137 / 156
Alexander Tiskin (Warwick)
Beyond semi-locality
Beyond semi-locality
Fixed-length substring LCS
Fixed-length substring LCS
Semi-local LCS
An implementation of alignment plots
The window-window LCS problem
138 / 156
[Krusche: 2010]
Give LCS score for every w -window of a vs every w -window of b
Based on O(mnw ) repeated seaweed combing approach, using some ideas
from the optimal O(mn) algorithm: resulting time O(mnw 0.5 )
Provides an LCS-scored alignment plot (by analogy with Hamming-scored
dot plot), a useful tool for studying weak conservation in the genome
Weighted alignment (match 1, mismatch 0, gap −0.5) via alignment dag
blow-up: ×4 slowdown
C++, Intel assembly (x86, x86 64, MMX/SSE2 data parallelism)
Window-window LCS: running time
O(mnw 2 )
O(mnw )
O(mn)
Alexander Tiskin (Warwick)
SMP parallelism (currently two processors)
naive
repeated seaweed combing
[Krusche, T: 2010]
Semi-local LCS
139 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
140 / 156
Beyond semi-locality
Beyond semi-locality
Fixed-length substring LCS
Fixed-length substring LCS
Krusche’s implementation used at Warwick Systems Biology Centre to
study conservation in plant non-coding DNA
More sensitive than BLAST and other heuristic (seed-and-extend) methods
Ì
¸
»Ð
´
¿
²
¬Ö
±«®²
¿
´
ø
î
ð
ï
ð
÷
ê
ì
ô
ï
ê
ë
Pï
é
ê
¼
±·æ
ï
ð
ò
ï
ï
ï
ï
ñ
¶ò
ï
í
ê
ë
ó
í
ï
í
È
ò
î
ð
ï
ð
ò
ð
ì
í
ï
ì
ò̈
Ì
Û
Ý
Ø
Ò×
Ý
ß
Ôß
Ü
Ê
ß
ÒÝ
Û
Found hundreds of conserved non-coding regions between papaya (Carica
papaya), poplar (Populus trichocarpa), thale cress (Arabidopsis thaliana)
and grape (Vitis vinifera)
î
í
ö
Û
³³¿Ð
·½
񪋓
ô
Ð
»¬»® Õ
®«-½
¸
»íô
ß
´
Ȭ
¿
²
¼
»® Ì
·-µ·²
ô
×
-¿
¾
»´
´
»Ý
¿
®®»lî¿
²
¼Í
¿
-½
¸
¿Ñ¬¬ ìô
ï
Í
§-¬»³- Þ
·±±́¹
§Ü
±½
¬±®¿
´
Ì
®¿
·²
·²
¹Ý
»²
¨ș
˲
·ª
»®-·¬§ ±º
É¿
®©·½
µô
Ý
±ª
»²
¬®§ Ý
Ê
ìé
ß
Ô
ô
ËÕ
ô
î
Ü
»°¿
®¬³»²
¬ ±º
Þ
·±±́¹
·½
¿
´
Í
½
·»²
½
»-ô
˲
·ª
»®-·¬§ ±º
É¿
®©·½
µô
Ý
±ª
»²
¬®§ Ý
Ê
ìé
ßÔ
ô
ËÕ
ô
í
Ü
»°¿
®¬³»²
¬ ±º
Ý
±³°«¬»® Í
½
·»²
½
»ô
˲
·ª
»®-·¬§ ±º
É¿
®©·½
µô
Ý
±ª
»²
¬®§ Ý
Ê
ìé
ß
Ô
ô
ËÕ
ô
¿
²
¼
ì
É¿
®©·½
µÍ
§-¬»³- Þ
·±±́¹
§Ý
»²
¨ș
˲
·ª
»®-·¬§ ±º
É¿
®©·½
µô
Ý
±ª
»²
¬®§ Ý
Ê
ìé
ß
Ô
ô
ËÕ
Single processor:
×10 speedup over heavily optimised, bit-parallel naive algorithm
×7 speedup over biologists’ heuristics
Î
»½
»·ª
»¼ï
éÓ¿
§î
ð
ï
ð
å
®»ª
·-»¼êÖ
«´
§î
ð
ï
ð
å
¿
½
½
»°¬»¼ï
îÖ
«´
§î
ð
ï
ð
ò
ö
Ú
±® ½
±®®»-°±²
¼
»²
½
»ø
º
¿
¨õ
ì
ìî
ì
é
êë
é
ëé
ç
ë
å
»ó
³¿
·´
-ò
±¬¬à©¿
®©·½
µò
¿
½
ò
«µ÷
ò
Two processors: extra ×2 speedup, near-perfect parallelism
Alexander Tiskin (Warwick)
Semi-local LCS
Û
ª
±´
«¬·±²
¿
®§ ¿
²
¿
´
§-·- ±º
®»¹
«´
¿
¬±®§ -»¯
«»²
½
»- ø
Û
ß
Î
Í
÷
·²°´
¿
²
¬-
Í
ËÓÓß
Î
Ç
141 / 156
»´
«½
·¼
¿
¬·±²±º¹
»²
»²
»¬©±®µ-ò
Ì
¸
·- ¿
°°®±¿
½
¸½
±²
-¬·¬«¬»- ¿³¿
¶±® ½
¸
¿
´
´
»²
¹
»ô
¸
±©»ª
»®ô
¿
- ±²
´
§ ¿ª
»®§ -³¿
´
´
º
®¿
½
¬·±²±º
²
±²
ó
½
±¼
·²
¹Ü
Òß·- ¬¸
±«¹
¸
¬ ¬± ½
±²
¬®·¾
«¬» ¬± ¹
»²
» ®»¹
«´
¿
¬·±²
ò
Ì
¸
» ³¿
°°·²
¹±º
®»¹
«´
¿
¬±®§ ®»¹
·±²
¨¼
·¬·±²
¿
´
´
§ ·²
ª
±ª́
»- ¬¸
»´
¿
¾
±®·±«- ½
±²
-¬®«½
¬·±²±º
°®±³±¬»® ¼
»´
»¬·±²-»®·»- ©¸
·½
¸¿
®» ¬¸
»²º
«-»¼¬± ®»°±®¬»®
¹
»²
»- ¿
²
¼¿
--¿
§»¼·²¬®¿
²
-¹
»²
·½±®¹
¿
²
·-³-òÞ
·±·²
º
±®³¿
¬·½³»¬¸
±¼
- ½
¿
²¾
» «-»¼¬± -½
¿
²-»¯
«»²
½
»- º
±®
³¿
¬½
¸
»- º
±® µ²
±©²®»¹
«´
¿
¬±®§ ³±¬·º
-ô
¸
±©»ª
»® ¬¸
»-» ³»¬¸
±¼
-¿
®» ½
«®®»²
¬´
§¸
¿
³°»®»¼¾
§ ¬¸
» ®»´
¿
¬·ª
»´
§ -³¿
´
´
¿
³±«²
¬ ±º
-«½
¸³±¬·º
-¿
²
¼¾
§ ¿¸
·¹
¸º
¿-́»ó
¼
·-½
±ª
»®§ ®¿
¬»ò
Ø
»®»ô
©» ¼
»³±²
-¨¬» ¿®±¾
«-¬ ¿
²
¼¸
·¹
¸
´
§ -»²
-·¬·ª
»ô
Alexander Tiskin (Warwick)
Semi-local LCS
·²-·´
·½
± ³»¬¸
±¼¬± ·¼
»²
¬·º
§ »ª
±´
«¬·±²
¿
®·´
§½
±²
-»®ª
»¼®»¹
·±²
- ©·¬¸
·²²
±²
ó
½
±¼
·²
¹ÜÒß
ò
Í
»¯
«»²
½
»½
±²
-»®ª
¿
¬·±²
©·¬¸
·²¬¸
»-» ®»¹
·±²
- ·- ¬¿
µ»²¿
- Ȼ
·¼
»²
½
ȼ
±® »ª
±´
«¬·±²
¿
®§ °®»--«®» ¿
¹
¿
·²
-¬ ³«¬¿
¬·±²
-ô
©¸
·½
¸·- -«¹
¹
»-¬·ª
» ±º
º
«²
½
¬·±²
¿
´
·³°±®¬¿
²
½
»ò
É» ¬»-¬ ¬¸
·- ³»¬¸
±¼±²¿-³¿
´
´
-»¬ ±º
©»´
´
½
¸
¿
®¿
½
¬»®·-»¼°®±³±¬»®-ô
¿
²
¼-¸
±© ¬¸
¿
¬ ·¬
Beyond semi-locality
Beyond semi-locality
Fixed-length substring LCS
Fixed-length substring
LCS
¬¸
» ¬±±´
±°¬·³·-»¼
º
±® ¬¸
»¿
²
¿
´
§-·- ±º
°´
¿
²
¬ °®±³±¬»®- ·- ¿
ª
¿
·´
¿
¾
´
» ±²
´
·²
»¿
¬¸
¬¬°æ
ñ
ñ
©-¾
½
ò
©¿
®©·½
µò
¿
½
ò
«µñ
»¿
®-ñ
142 / 156
-»¯
«»²
½
»- ½
±²
¬¿
·²½
´
«-¬»®- ±º
¨²
-½
®·°¬·±²¾
·²
¼
·²
¹-·¬»-ô
±º
¬»²¼
»-½
®·¾
»¼¿
- ®»¹
«´
¿
¬±®§ ³±¼
«´
»-ò
ߪ
»®-·±²±º
³¿
·²
ò
°¸
°ò
Conserved DNA could Õ
prove
crop
science
- Science
Omega
http://www.scienceomega.com/article/732/conserved-dna-could-prove-...
»§©±®¼
-æ®»¹
«´
¿
¬±®§shortcut
³±¼
«´
»-ô
-»¯
«»²
½
»½
±²
-»®ª
¿
¬·±²
ô
°®±³±¬»®-ô
¨²
-½
®·°¬·±²
ô
ß®¿
¾
·¼
±°-·-ò
×
ÒÌ
Î
ÑÜ
ËÝ
Ì
×
ÑÒ
˰-¬®»¿
³±º
¬¸
»½
±¼
·²
¹®»¹
·±²±º
Ȼ
»®§ ¹
»²
»´
·»- ¬¸
» °®±ó
·²
¬»®²
¿
´¿
²
¼»¨
¬»®²
¿
´-¬·³«´
· ø
Ü
¿
ª
·¼
-±²
ôî
ð
ð
ï
¾
÷
òÍ
°»½
·B½
³±¬»®ô¿
²¿
®»¿±º²
±²
ó
½
±¼
·²
¹Ü
Ò߬¸
¿
¬ ·- ®»¯
«·®»¼º
±®
®»-°±²
-»- ¿
®» ±º
¬»²¿
--±½
·¿
¬»¼©·¬¸½
´
«-¬»®- ±º
¨²
-½
®·°¬·±²
®»½
®«·¬³»²
¬ ±º
Î
Òß°±´
§³»®¿
-» ¿
²
¼º
±® ¬®¿
²
-½
®·°¬·±²¬± ¬¿
µ»
º
¿
½
¬±® ¾
·²
¼
·²
¹-·¬»- ø
Ì
Ú
Þ
Í
÷
ô
¼
»-½
®·¾
»¼¿
- ®»¹
«´
¿
¬±®§ ³±¼
«´
»°´
¿
½
»ò
Î
»¹
«´
¿
¬±®§ ®»¹
·±²
-¿
®» ²
±¬ ´
·³·¬»¼¬± ¬¸
» °®±³±¬»®ô
ø
Î
Ó-÷
ø
Ü¿
ª
·¼
-±²
ô
î
ð
ð
ï
¿
å
Ð
®·»-¬ »¬ ¿ò́
ô
î
ð
ð
ç
÷
ò
×
¼
»²
¬·B½
¿
¬·±²±º
¾
«¬ ³¿
§ ¿-́± ±½
½
«® ·²¬¸
» ·²
¬®±²
- ±º
¹
»²
»- ø
Í
½
¸
¿
«»® »¬ ¿ò́
ô
Î
Ó- ·- «-«¿
´
´
§ ¬¸
» B®-¬ -¬»° ¬±©¿
®¼
- «²
¼
»®-¬¿
²
¼
·²
¹¬¸
»
î
ð
ð
ç
÷±® ´
·» ·²¬¸
Ȓ
~ ËÌ
Îø
Ý
¿
©´
»§ »¬ ¿ò́
ô
î
ð
ð
ì
÷
ò
Ð
®±³±¬»®®»¹
«´
¿
¬·±²±º
¹
»²
» »¨
°®»--·±²
ò
Í
¬«¼
§·²
¹¬¸
»½
±³¾
·²
¿
¬±®·¿
´
MOST READ
COMMENTS
«-«¿
´
´
§½
±²
¬¿
·²³«´
¬·°´
» ®»¹
«´
¿
¬±®§ »´
»³»²
¬- ¬± ©¸
·½
¸¬®¿
²
ó
¿
²
¼°¸
§-·½
¿
´¿
®®¿
²
¹
»³»²
¬ ±º»´
»³»²
¬- ©·¬¸
·²®»¹
«´
¿
¬±®§
-½
®·°¬·±²º
¿
½
¬±®- ¾
·²
¼·²¿-»¯
«»²
½
»ó
-°»½
·B½³¿
²
²
Ȩ ±
³±¼
«´
»- ·- ¿-́± ·³°±®¬¿
²
¬ ¬± ®»ª
»¿
´·²
¬»®¿
½
¬·±²
- ¾
»¬©»»²
High IQ means intelligent information
by-«®»
Katy ¬¸
Edgington
³±¼
«´
¿
¬» ¾
·²
¼
·²
¹±ºÎ
Òß°±´
§³»®¿
-» ¿
²
¼»²
¿
¬
¨²
-½
®·°¬·±²º
¿
½
¬±®- ¿
²
¼¬¸
»·®filtering
º
«²
½
¬·±²¿
- °¿
®¬ ±º
®»¹
«´
¿
¬±®§
¨²
-½
®·°¬·±²¬¿
µ»- °´
¿
½
» ·²¬¸
»¿
°°®±°®·¿
¬» ¬·--«»- ¿
²
¼¿
¬ ¬¸
»
³±¼
«´
»-ò
06 December 2012
Minimum alcohol pricing: who wins
¿
°°®±°®·¿
¬» ¬·³»-ò
Ì
®¿
²
-½
®·°¬·±²º
¿
½
¬±®- ³¿
§¿
½
¬¼
·®»½
¬´
§ ¬±
Ú
«²
½
¬·±²
¿
´®»¹
·±²
- ©·¬¸
·²°®±³
¿
²¾
» ·¼
»²
¬·B»¼
and who ±¬»®loses? ½
»·¬¸
»® °®±³±¬» ±® ·²
¸
·¾
·¬ ¬®¿
²
-½
®·°¬·±²
òß
´
¬»®²
¿
¬·ª
»´
§ô¬¸
»§
Ȭ
°»®·³»²
¬¿
´
´
§¾
§¹
»²
»®¿
¬·²
¹®»°±®¬»® ½
±²
-¬®«½
¬- ¿
²
¼¿
²
¿ó́
Graphene ink could be the future for
³¿
§¿
½
¬ ¬± ®»½
®«·¬ ±¬¸
»® °®±¬»·²
- ¬± ¿½
±³°´
»¨ø
Ô
¿
¬½
¸
³¿
²
ô
§-·²
¹»¨
°®»--·±²°¿
¬¬»®²
- ·²
¨²
-¹
»²
·½
½
»´
´
´
·²
»±®
±®¹
¿
²
ó
foldable electronics
ï
ç
ç
é
÷
ò
×
¬ ·- ¬¸
»·® ½
±³¾
·²
¿
¬±®·¿
´
ȼ
º
»½
¬¿
¬ ¬¸
»´
Ȼ
»´
±º
·²
¼
·ª
·¼
«¿
´
·-³-ò
Î
»¹
«´
¿
¬±®§ ®»¹
·±²
-¿
®» ¬¸
»²³¿
°°»¼¾
§¿
²
¿
´
§-·²
¹¬¸
»
Inside the mind of a serial killer: a
°®±³±¬»®- ¬¸
¿
¬ ¹
±ª
»®²
- ¨²
-½
®·°¬·±²
¿
´®»-°±²
-»- ¬± ¾
±¬¸
ȼ
º
»½
¬ ±º¿¼
»´
»¬·±²-»®·»-ò
Ì
¸
·- ¿
°°®±¿
½
¸·- ²
±¬ ·¼
»¿ố
psychologist's perspective
̸» д¿²¬ Ý»´´ô ʱ´ò îìæ íçìçŠíçêëô ѽ¬±¾»® îðïîô ©©©ò°´¿²¬½»´´ò±®¹ = îðïî ß³»®·½¿² ͱ½·»¬§ ±º д¿²¬ Þ·±´±¹·-¬-ò ß´´ ®·¹¸¬- ®»-»®ª»¼ò
ÔßÎÙÛóÍÝßÔÛ Þ×ÑÔÑÙÇ ßÎÌ×ÝÔÛ
Conserved DNA could prove crop science shortcut
ݱ²-»®ª»¼ Ò±²½±¼·²¹ Í»¯«»²½»- Ø·¹¸´·¹¸¬
͸¿®»¼ ݱ³°±²»²¬- ±º λ¹«´¿¬±®§ Ò»¬©±®µ·² Ü·½±¬§´»¼±²±«- д¿²¬É Ñß
Ô¿«®¿ Þ¿¨¬»®ô¿ôï ß´»µ-»§ Ö·®±²µ·²ô¿ôï η½¸¿®¼ Ø·½µ³¿²ô¿ôï Ö¿§ Ó±±®»ô¿ ݸ®·-¬±°¸»® Þ¿®®·²¹¬±²ô¿ 묻® Õ®«-½¸»ô¿
Ò·¹»´ Ðò ܧ»®ô¾ Ê·½µ§ Þ«½¸¿²¿²óɱ´´¿-¬±²ô¿ô½ ß´»¨¿²¼»® Ì·-µ·²ô¼ Ö·³ Þ»§²±²ô¿ô½ Õ¿¬¸»®·²» Ü»²¾§ô¿ô½
¿²¼ Í¿-½¸¿ Ѭ¬¿ôî
E-mail:
ͧ-¬»³- Þ·±´±¹§ Ý»²¬®»ô ˲·ª»®-·¬§ ±º É¿®©·½µô ݱª»²¬®§ ÝÊì éßÔô ˲·¬»¼ Õ·²¹¼±³
Ñ®¹¿²·-¿¬·±² ¿²¼ ß--»³¾´§ ·² Ý»´´- ܱ½¬±®¿´ Ì®¿·²·²¹ Ý»²¬®»ô ˲·ª»®-·¬§ ±º É¿®©·½µô ݱª»²¬®§ ÝÊì éßÔô ˲·¬»¼ Õ·²¹¼±³
½ ͽ¸±±´ ±º Ô·º» ͽ·»²½»-ô ˲·ª»®-·¬§ ±º É¿®©·½µô ݱª»²¬®§ ÝÊì éßÔô ˲·¬»¼ Õ·²¹¼±³
¼ Ü»°¿®¬³»²¬ ±º ݱ³°«¬»® ͽ·»²½»ô ˲·ª»®-·¬§ ±º É¿®©·½µô ݱª»²¬®§ ÝÊì éßÔô ˲·¬»¼ Õ·²¹¼±³
¾ Ó±´»½«´¿®
ݱ²-»®ª»¼ ²±²½±¼·²¹ -»¯«»²½»- øÝÒÍ-÷ ·² ÜÒß ¿®» ®»´·¿¾´» °±·²¬»®- ¬± ®»¹«´¿¬±®§ »´»³»²¬- ½±²¬®±´´·²¹ ¹»²» »¨°®»--·±²ò
Ë-·²¹ ¿ ½±³°¿®¿¬·ª» ¹»²±³·½- ¿°°®±¿½¸ ©·¬¸ º±«® ¼·½±¬§´»¼±²±«- °´¿²¬ -°»½·»- øß®¿¾·¼±°-·- ¬¸¿´·¿²¿ô °¿°¿§¿ ÅÝ¿®·½¿
°¿°¿§¿Ãô °±°´¿® Åб°«´«- ¬®·½¸±½¿®°¿Ãô ¿²¼ ¹®¿°» ÅÊ·¬·- ª·²·º»®¿Ã÷ô ©» ¼»¬»½¬»¼ ¸«²¼®»¼- ±º ÝÒÍ- «°-¬®»¿³ ±º ß®¿¾·¼±°-·¹»²»-ò Ü·-¬·²½¬ °±-·¬·±²·²¹ô ´»²¹¬¸ô ¿²¼ »²®·½¸³»²¬ º±® ¬®¿²-½®·°¬·±² º¿½¬±® ¾·²¼·²¹ -·¬»- -«¹¹»-¬ ¬¸»-» ÝÒÍ- °´¿§ ¿ º«²½¬·±²¿´
®±´» ·² ¬®¿²-½®·°¬·±²¿´ ®»¹«´¿¬·±²ò ̸» »²®·½¸³»²¬ ±º ¬®¿²-½®·°¬·±² º¿½¬±®- ©·¬¸·² ¬¸» -»¬ ±º ¹»²»- ¿--±½·¿¬»¼ ©·¬¸ ÝÒÍ ·Alexander
Tiskin©·¬¸
(Warwick)
LCS ¬®¿²-½®·°¬·±²¿´ ²»¬©±®µ ©¸±-» º«²½¬·±² ·- ¬± ®»¹«´¿¬»
½±²-·-¬»²¬
¬¸» ¸§°±¬¸»-·- ¬¸¿¬ ¬±¹»¬¸»® ¬¸»§ º±®³ Semi-local
°¿®¬ ±º ¿ ½±²-»®ª»¼
Print:
vî
ð
ï
ð˲
·ª
»®-·¬§ ±º
É¿
®©·½
µ
Ö
±«®²
¿
´
½
±³°·´
¿
¬·±²v î
ð
ï
ðÞ
´
¿
½
µ©»´
´
Ð
«¾
´
·-¸
·²
¹Ô
¬¼
¿ É¿®©·½µ
Researchers from the University of Warwick have found that hundreds of regions
of non-coding DNA are conserved among various plant species that have been
evolving separately for around 100 million years. The findings, which have been
detailed in the journal The Plant Cell, have led the team to believe that these
sequences, although they are not genes, could provide vital information about
plant development and functioning.
143 / 156
We can study
selected
conserved
regions
Computational analysis of the genomes of papaya (Carica papaya), poplar
experimentally
(Populus trichocarpa), thale cress (Arabidopsis thaliana) and grape (Vitis vinifera)
to understand
showed that all four species contained conserved non-coding sequences (CNSs)
Alexander Tiskin (Warwick)
Semi-local LCS
ï
ê
ë
primate research Dutch healthcare
HIV capsid high IQ
celebrating deafness
art
appreciation space angels
spent nuclear fuel
144 / 156
Beyond semi-locality
Beyond semi-locality
Fixed-length substring LCS
Quasi-local LCS and spliced alignment
Publications:
Given m (possibly overlapping) prescribed substrings in a
[Picot+: 2010] Evolutionary Analysis of Regulatory Sequences (EARS) in
Plants. Plant Journal.
The quasi-local LCS problem
[Baxter+: 2012] Conserved Noncoding Sequences Highlight Shared
Components of Regulatory Networks in Dicotyledonous Plants. Plant Cell.
Give LCS score for every prescribed substring of a vs every subsring of b
Web service: http://wsbc.warwick.ac.uk/ears
Quasi-local LCS: running time
Warwick press release (with video): http://www2.warwick.ac.uk/
newsandevents/pressreleases/discovery_of_100
O(m2 n)
O(mn log2 m)
repeated seaweed combing
[T: unpublished]
Future work: genomic repeats; genome-scale comparison
Alexander Tiskin (Warwick)
Semi-local LCS
145 / 156
Alexander Tiskin (Warwick)
Beyond semi-locality
Beyond semi-locality
Quasi-local LCS and spliced alignment
Quasi-local LCS and spliced alignment
Quasi-local LCS: the algorithm
The spliced alignment problem
Compute seaweed matrices for canonical substrings of a vs b recursively
Compute seaweed matrices for prescribed substrings of a vs b, using the
decomposition of each prescribed substring into canonical substrings
Both stages use matrix
Semi-local LCS
146 / 156
Give the chain of non-overlapping prescribed substrings in a, closest to b
by alignment score
Describes gene assembly from candidate exons, given a known similar gene
-multiplication, time O(mn log2 m)
Assume m = n; let s = total size of prescribed substrings
Spliced alignment: running time
O(ns) = O(n3 )
O(n2.5 )
O(n2 log2 n)
O(n2 log n)
Alexander Tiskin (Warwick)
Semi-local LCS
147 / 156
Alexander Tiskin (Warwick)
[Gelfand+: 1996]
[Kent+: 2006]
[T: unpublished]
[Sakai: 2009]
Semi-local LCS
148 / 156
Beyond semi-locality
Beyond semi-locality
Length-optimised substring alignment
Length-optimised substring alignment
Assume a fixed set of character alignment scores
Length-optimised substring alignment
Sweep cells in any -compatible order
The length-optimised substring alignment problem
For every prefix of string a and every prefix of string b, give a pair of
respective suffixes of these prefixes that have the highest alignment score
against each other
Cell update: time O(1)
Overall time O(mn)
Length-optimised substring alignment maximises alignment score
irrespective of length
Length-optimised substring alignment: running time
O(m2 n2 )
O(mn)
[SW: 1981]
[naive]
[Smith, Waterman: 1981]
May not always be the most relevant, e.g. for a particular prefix pair we
may have
suffixes’ length 50, score 50
suffixes’ length 100, score 51
Alexander Tiskin (Warwick)
Semi-local LCS
Alexander Tiskin (Warwick)
149 / 156
Semi-local LCS
150 / 156
Beyond semi-locality
Length-optimised substring alignment
1
Introduction
7
Compressed string comparison
2
Matrix distance multiplication
8
Parallel string comparison
3
Semi-local string comparison
9
The transposition network
method
4
The seaweed method
10
Beyond semi-locality
The bounded substring alignment problem
For every prefix of string a and every prefix of string b, give a pair of
respective suffixes of these prefixes of length at most t that have the
highest alignment score against each other
Normalised alignment score:
score(ahi0 :i1 i,bhj0 :j1 i)
i1 −i0 +j1 −j0 +L
The normalised substring alignment problem
5
Periodic string comparison
For every prefix of string a and every prefix of string b, give a pair of
respective suffixes of these prefixes that have the highest normalised
alignment score against each other
6
Sparse string comparison
Both solved efficiently using fixed-length alignment as a subroutine
Alexander Tiskin (Warwick)
Semi-local LCS
151 / 156
Alexander Tiskin (Warwick)
Semi-local LCS
152 / 156
Conclusions and open problems (1)
Conclusions and open problems (2)
Implicit unit-Monge matrices:
Compressed string comparison:
equivalent to seaweed braids
distance multiplication in time O(n log n)
LCS and approximate matching on plain pattern against GC-text
Parallel string comparison:
Semi-local LCS problem:
comm- and sync- efficient parallel LCS
representation by implicit unit-Monge matrices
generalisation to rational alignment scores
Transposition networks:
simple interpretation of parameterised and bit-parallel LCS
Seaweed combing and micro-block speedup:
dynamic LCS
a simple algorithm for semi-local LCS
semi-local LCS in time O(mn), and even slightly o(mn)
Beyond semi-locality:
window alignment and sparse spliced alignment
Sparse string comparison:
fast LCS on permutation strings
fast max-clique in a circle graph
Alexander Tiskin (Warwick)
Semi-local LCS
153 / 156
Conclusions and open problems (3)
Alexander Tiskin (Warwick)
Semi-local LCS
154 / 156
Semi-local LCS
156 / 156
Thank you
Open problems (theory):
real-weighted alignment?
max-clique in circle graph: O(n log2 n)
O(n log n)?
compressed LCS: other compression models?
p
parallel LCS: comm per processor O nplog
1/2
O
n
p 1/2
?
lower bounds?
Open problems (biological applications):
real-weighted alignment?
good pre-filtering (e.g. q-grams)?
application for multiple alignment? (NB: NP-hard)
Alexander Tiskin (Warwick)
Semi-local LCS
155 / 156
Alexander Tiskin (Warwick)