Pairwise alignment using shortest path algorithms

Pairwise alignment using
shortest path algorithms
We will discuss:
• Edit graph
• Dijkstra’s algorithm
• A∗ algorithm (GDUS)
1
References
• Dijkstra algorithm in general:
– Cormen, Leiserson, Rivest: Introduction to algorithms, MIT Press, 1990.
ISBN 0-262-03141-8
– ...
• Dijkstra and A∗ algorithm for sequence alignment:
– Lecture by Knut Reinert in SS04, which in turn relies on:
– Reinert, Stoye, Will, An iterative method for faster sum-of-pairs multiple sequence alignment, Bioinformatics, 2000, Vol 16, no 9, pages 808-814.
– Stoye, Divide-and-Conquer Multiple Sequence Alignment, TR 97-02 Universität Bielefeld, 1997.
2
Edit graph
3
Edit graph
In principle, the edit graph is just another way to look at a dynamic programming
matrix:
• Each matrix entry corresponds to a vertex.
• Dependencies in the recursion correspond to (directed) edges.
In this representation, finding an optimal alignment w.r.t. distance corresponds to
finding a shortest path in a directed acyclic graph (DAG).
(Why is the edit digraph acyclic?
– because otherwise DP wouldn’t work: infinite loop in the recursion)
4
Edit graph
(2)
Here is an edit graph with source s, sink t, and an optimal source-to-sink path s → t
in dashed and red:
s
0
A
N
N
1
2
3
0
C
1
A
2
N
3
t
Don’t mind that the arrows are the other way round this time.
5
Edit graph
(3)
We denote the vertices by their coordinates, e.g., the source is s = (0, 0).
A path π from s = (0, 0) to t = (m, n) corresponds to an alignment. We consider
π as a set of edges. Each edge corresponds to a column of the alignment. The cost
of an alignment is the sum of the cost of all edges:
c(π) :=
X
c(e) .
e∈π
The cost of an edge is given by the DP recursion:
(i − 1, j − 1)
(i, j − 1)
s(xi , yj )



F (i − 1, j − 1) + s(xi, yj )
d
F (i, j) := min F (i − 1, j) + d



F (i, j − 1) + d
d
(i − 1, j)
(i, j)
6
Edit graph
(4)
Recall that usually there are many orders in which the entries of a DP matrix can be
computed. Edit graphs are just the right level of abstraction to state this fact:
Computing the shortest path in a DAG (directed acyclic graph) can be done by
traversing the DAG in any topological order.
Recapitulation.
A topological order of a DAG is a labeling t : V → N of its nodes such that t(u) ≤
t(v) holds for all edges (u, v) ∈ E.
7
Edit graph
(5)
A topological order for a digraph G = (V, E) can be found, e.g., by a post-order
depth-first traversal starting from the sink in O(V + E) time.
d
1
1
Topological-Sort(G)
b
Input: DAG G
Output: topologically ordered list of vertices V
call DFS(G) to compute finishing times f (v) for each vertex v
as each vertex is finished, insert it onto the front of a linked list T
return T
4
c
1
g
f
−3
3
2
e
2
h
2
−2
j
1
6
i
a
8
Edit graph
(6)
We can find shortest paths in DAGs by processing the nodes in topological order.
d
DAG-shortest-paths(G, w, s)
Input: DAG G, weights w, source s
Output: distance labels d and predecessors π
4
d(s) := 0; π(s) := nil
b
for each vertex v ∈ V \ {s} do
d(v) := ∞
π(v) := nil;
for each vertex u taken in topologically sorted order do
for each edge (u, v) ∈ E do
2
if d(v) > d(u) + w(u, v) then
d(v) := d(u) + w(u, v)
π(v) := u
1
1
c
1
g
f
−3
3
2
e
h
2
−2
j
1
6
i
a
On the next slide we do this for the more realistic example (only initial graph shown,
shortest path computation at the blackboard).
9
H
8
8
8
2
3
8
8
W
8
−10
8
8
H
8
0
8
8
8
8
8
8
8
3
8
3
8
8
8
8
−6
8
1
8
1
8
−6
0
8
0
−6
2
3
8
8
8
3
0
0
0
8
3
−10
8
1
8
8
8
8
1
3
3
3
1
8
8
1
8
8
8
3
3
−5
0
1
8
1
8
8
3
1
2
3
E
8
8
8
8
8
2
0
−15
E
8
8
8
8
8
2
3
2
3
−5
−6
8
2
1
1
8
3
H
8
8
8
8
8
8
8
0
8
−6
8
8
A
2
4
−5
3
G
8
8
8
8
8
8
2
8
E
8
8
E
8
8
8
8
W
1
0
3
0
8
8
8
8
A
2
−5
3
8
8
8
8
G
1
1
8
8
A
8
8
2
8
A
1
8
8
P
E
8
−6
8
10
Edit graph
(7)
However, we have already seen that already constructing the whole graph would
be the major obstacle. Let us have a look at an (in theory) more costly algorithm,
namely Dijkstra’s algorithm for graphs with non-negative edge costs – which does
not rely on a topological order.
11
Dijkstra’s algorithm
As noted before, constructing the complete graph is much too expensive. But we
can adapt Dijkstra’s algorithm and construct the graph as needed, starting from the
source s.
Recapitulation.
A priority queue Q is an abstract data type (ADT) that maintains a set of elements
S, each with an associated value called a key.
Here, the elements are the nodes of the edit graph and their keys are the distance
labels d.
12
Dijkstra’s algorithm
(2)
Operations supported:
• Q.insert(x, k)
inserts the element x with key k into Q
• Q.empty()
true iff the queue is empty
• Q.extract min()
returns the element in Q with the lowest key
• Q.decrease key(x, k)
decreases the key of element v to the value k
13
Dijkstra’s algorithm
1
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
(3)
Dijkstra(G, w, s, d, π)
{
G = (V, E) : graph;
Adj : V → P(V ) : set of successor nodes, this implements E;
w : E → N : edge weights;
d : V → N : distance from s as computed by algorithm;
π : V → V : predecessor;
QhV, Ni : priority queue;
d[s] ← 0; ∀v ∈ V \ {s} : d[v] ← ∞;
∀v ∈ V : π[v] ← nil;
Q.insert(s, d[s]);
while ( ! Q.empty()) do
u ← Q.extract min();
foreach (v ∈ Adj(u)) do
if (d[v] ≡ ∞) Q.insert(v, ∞); fi
c = d[u] + w(u, v);
if (c < d[v])
d[v] ← c;
π[v] ← u;
Q.decrease key(v, d[v]);
fi
od
od
}
14
Dijkstra’s algorithm
(4)
• As in the usual Dijkstra algorithm, in the priority queue Q we store the values
of the best paths found so far. The “final” values are stored in d. The shortest
path arborescence∗ is stored in π, it is used for backtracing.
• In each step we delete the node u with the minimal value from Q.
• This node is then expanded, which means that all neighboring nodes v of u,
that are not already in Q, are inserted into Q with the value d(u) + w(u, v).
We know Adj(u) because the graph has a simple grid structure.
• In case v is already in Q, we relax the triangle inequality.
∗ An
arborescence is a directed forest in which every vertex has out-degree at most one
15
Dijkstra’s algorithm
(5)
Dijkstra’s algorithm guarantees that the value of the node u that is removed from
the priority queue Q equals the value of the shortest path from s to u.
Thus, when t is extracted, we can backtrace the shortest path and output the alignment.
The next slide shows the example again (only initial graph).
16
H
15.5
15.5
E
15.5
A
15.5
G
15.5
A
15.5
W
15.5
G
15.5
H
15.5
E
15.5
E
15.5
17
16
16
17
16
19
17
17
16
16
17
16
10
15
10
18
15
17
16
16
18
18
18
18
15
15
P
15.5
A
18
15.5
W
0
18
18
18
18
15
17
17
17
18
18
15
9
16
18
16
18
18
15
9
9
17
16
9
15
10
18
15
17
16
16
15
9
16
18
17
18
18
15
9
9
5
15.5
5
H
15.5
E
15.5
A
15.5
E
17
Dijkstra’s algorithm
(6)
Theorem. (Correctness)
Let G = (V, E) be a directed graph, s ∈ V and let w : E → N be a weight function
on its edges, w ≥ 0. Then the following holds: After v is extracted from the priority
queue, d[v] equals distw (s, v), the length of a shortest path from s to v, and one
such path is given by s, . . . , π(π(v)), π(v), v.
Proof. Blackboard.
18
Running time of Dijkstra’s algorithm
Dominated by the at most O(|E|) decrease key and the at most O(|V |) extract min
operations.
With binary heap:
both take O(log |V |) time
total time: O (|V | + |E|) log |V |
With Fibonacci heap:
decrease key in amortized ∗ O(1) time, extract min in O(log |V|) time
nada
total time: O |E| + |V | log |V |
But: |E| = O(nm) → In theory, we have not improved the bounds. . . In practice,
however, only a small portion of the alignment graph needs to be constructed, so
the running time is superior.
∗ Amortized
complexity of O(1) means that we may not achieve a O(1) bound for each individual
call, but a sequence of k calls will take O(k) time in total.
19
A∗ algorithm, a.k.a. GDUS-Bounding
The so-called Goal-Directed-Unidirectional-Search (GDUS), also known as the A∗
algorithm tries to direct the computation of the shortest path more into the direction
of the sink t.
This is achieved by using a lower bound l(u, t) on the length of a shortest path from
u to t. First the cost of an edge (u, v) is redefined as follows:
0
c (u, v) := c(u, v) + l(v, t) − l(u, t) ,
Dijkstra’s algorithm is only correct for non-negative edge weights, thus l needs to
satisfy the consistency-condition
c(u, v) ≥ l(u, t) − l(v, t)
∀(u, v) ∈ E .
Then it is easy to show (blackboard) that the redefinition of the edge costs does not
change the (set of) optimal path(s) and that the new edge weights are still positive.
20
A∗ algorithm, a.k.a. GDUS-Bounding
(2)
A simple lower bound is
l (i, j), (m, n) := |(m − n) − (i − j)|d ,
because the shortest path from (i, j) must end on the same diagonal as (m, n).
The better the lower bound, the more directed the search.
• In the extreme case, if l is tight, then we extract only the nodes on an optimal
path from the priority queue.
• In the other extreme, if l = 0, we get Dijkstra’s algorithm back.
Next slides: example (initial graph and solution).
21
H
0
0
31
P
E
17
0
0
16
0
31
A
G
31
16
0
A
31
31
W
G
31
H
31
E
31
E
31
17
16
19
17
17
16
16
18
15
17
16
16
18
18
18
18
15
15
0
17
16
10
15
10
18
18
18
18
18
15
17
17
17
18
18
15
9
16
18
16
18
18
15
9
9
17
16
9
15
10
18
15
17
16
16
15
9
16
18
17
18
18
15
9
9
A
31
W
5
31
0
5
H
31
E
31
A
31
E
22
H
0
0
P
A
17
31
A
G
A
W
0
0
31
62
93
16
17
17
62
5
10
18
18
15
H
15
9
16
47
17
17
18
17
16
9
15
26
18
16
16
15
17
16
16
18
18
18
15
15
78
18
47
E
5
18
52
15
65
78
18
78
44
15
A
9
16
18
17
E
16
18
18
83
9
52
17
96
15
E
17
18
18
80
10
H
17
47
63
95
47
0
44
64
E
18
18
G
78
26
44
78
93
19
10
18
17
16
26
47
62
16
15
26
31
79
17
16
16
48
79
16
16
18
W
E
61
16
92
16
77
61
81
15
9
9
9
92
70
23
Example: DP Matrix for global alignment
match=0,
I
M
I
S
S
M
I
S
S
I
S
S
I
P
I
0
3
6
9
12
15
18
21
24
27
30
33
36
39
45
48
mismatch=2,
M
3
2
3
6
9
12
15
18
21
24
27
30
33
36
42
45
Y
6
5
4
5
8
11
14
17
20
23
26
29
32
35
41
44
M
9
8
5
6
7
10
11
14
17
20
23
26
29
32
38
41
I
12
9
8
5
8
9
12
11
14
17
20
23
26
29
35
38
space=3
S
15
12
11
8
5
8
11
14
11
14
17
20
23
26
32
35
S
18
15
14
11
8
5
8
11
14
11
14
17
20
23
29
32
I
21
18
17
14
11
8
7
8
11
14
11
14
17
20
26
29
S
24
21
20
17
14
11
10
9
8
11
14
11
14
17
23
26
A
27
24
23
20
17
14
13
12
11
10
13
14
13
16
22
25
H
30
27
26
23
20
17
16
15
14
13
12
15
16
15
21
24
I
33
30
29
26
23
20
19
16
17
16
13
14
17
16
20
21
P
36
33
32
29
26
23
22
19
18
19
16
15
16
16
17
20
P
39
36
35
32
29
26
25
22
21
20
19
18
17
19
16
19
I
42
39
38
35
32
29
28
25
24
23
20
21
20
20
19
16
E
45
42
41
38
35
32
31
28
27
26
23
22
23
19
22
19
24
Example: Needleman-Wunsch
(2)
M
Y
M
I
S
S
I
S
A
H
I
P
P
I
E
←
←
←
←
←
-
←
←
←
←
←
←
-
-
←
-
←
←
←
←
←
←
←
←
←
-↑ -
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
↑ -
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
←
-
←
←
←
←
-
←
-
←
←
←
←
←
←
←
-
←
←
-
←
←
←
-
-↑ -
-
I
↑
←
-
M
↑
-
I
↑
↑ -
-
S
↑
↑ -↑
-
S
↑
↑ -↑
-↑ -
M
↑
-↑ -↑
I
↑
↑ -↑
S
↑
↑ -↑
S
↑
↑ -↑
I
↑
↑ -↑
-↑
↑ ←
↑
↑ - ←
-↑
↑
↑
↑ -↑ - ←↑
↑ -↑
↑
↑ -
S
↑
↑ -↑
↑
↑ -↑ -↑
←
←↑ -↑ - ↑ - ← -↑ -
S
↑
↑ -↑
↑
↑ -↑ -↑
↑ -↑ -
I
↑
↑ -↑
↑ -↑
↑
↑ -↑
P
↑
↑ -↑
↑
↑
↑
↑
↑
P
↑
↑ -↑
↑
↑
↑
↑
↑
I
↑
↑ -↑
↑ -↑
↑
↑ -↑
-
←
←
-↑ -
-↑ -↑
↑ -
-
-↑ -
←
↑ -↑ -
←
←
-↑ ←
↑ -↑ -↑ - - ←
↑ -↑ -↑ -↑ - -
←
←
-
-
↑ -↑ -↑ -
←
↑ -↑ -
←
←
←
←
←
←
-↑ -↑
←
←
16*17=272 nodes were used.
25
Example: Dijkstra
M
Y
M
I
S
S
I
←
←
←
←
←
-
←
←
-
-
←
-
←
←
←
←
←
←
←
-↑ -
←
←
I
↑
←
-
M
↑
-
I
↑
↑ -
-
S
↑
↑ -↑
-
S
↑
↑ -↑
-↑ -
M
↑
-↑ -↑
I
↑
↑ -↑
↑ -↑
←
←
S
A
H
(3)
I
P
←
←
←
←
←
←
↑ -
←
←
←
←
←
←
←
←
↑ -
-
←
←
←
←
-
↑
I
-↑
↑ ←
↑ - ←
-↑
↑
↑ -↑ - ←↑
↑
↑
↑ -
S
↑ -↑
S
↑
S
S
I
P
P
I
-
↑
-↑ -↑
I
E
←
←
←
←
-↑ -
P
←
-
←
←
←
-
←
-
←
←
↑ -↑ -
-↑ -
-
↑
-
-↑ -
←
←
←
←↑ -↑ - ↑ - ← -↑ ←
↑ -↑ -
-↑ ←
↑ -↑ -↑ - - ←
↑ -↑ -↑ - -
←
←
-
←
-↑ -
←
↑ -↑ -
←
165 nodes were inserted, 132 nodes were extracted.
26
Example: A∗ (GDUS)
M
Y
M
I
←
-
←
←
←
←
←
←
-
S
S
I
←
←
←
I
↑
M
↑
↑ -
-
I
↑
↑ -↑
-
S
↑
↑ -↑
←
-↑ -↑ -
S
↑ -↑
-↑ -↑ -↑ -
M
↑
I
S
S
I
S
S
I
P
P
I
A
H
I
P
P
I
E
←
←
-↑
↑
←
↑ -
↑ -
↑
↑ -
-
S
(4)
↑ -
-↑
↑ -
←
←
-
↑ -
←
←
←
↑ -↑ -
↑ -
←
←
←
↑ -↑ -
↑ -
←
-
←
-
-↑ -↑ ←
←
↑ - ←
-↑ -↑ - ↑ - - -↑ ↑ -
-
↑ -
←
-
←
↑ -
←
←
106 nodes were inserted, 76 nodes were extracted.
27

Download Report

Pairwise alignment using shortest path algorithms

Paperzz.com

Your Paperzz