Alignment graph

Dynamic programming:
one algorithmic key to
many biological locks
Mikhail Gelfand
RTCB, IITP, RA S and FBB, MSU
2010-2011
BIOINFORMATICS FOR BIOLOGISTS
Pavel Pevzner and Ron Shamir, eds.
(Cambridge University Press, 2011)
Ch. 4. DYNAMIC PROGRAMMING: ONE
ALGORITHMIC KEY FOR MANY
BIOLOGICAL LOCKS
Mikhail Gelfand
Research and Training Center “Bioinformatics” of the
Institute for Information Transmission Problems, RAS
and Faculty of Bioengineering and Bioinformatics,
M.V.Lomonosov Moscow State University
Alignment
Three (of many) alignments of two sequences.
Plus denotes a match; dot, a mismatch, minus, a gap.
(a) Two matches, five mismatches,
(b) three matches, one mismatch, two gaps of size three
(six indels, that is one-nucleotide insertions/deletions),
(c) four matches, two gaps of size three (six indels).
The number of alignments is large
# of alignments of two sequences of length N
~ (1+√2)2N+1√N
at N = 1000
# ≈ 10767
# of elementary particles in the Universe ≈ 1080
at N = 100
# ≈ 1076
assume 1 operation per alignment,
1012 operations per second
=> need 1057 years
=> we cannot consider them one by one
Gene recognition
Segmentation of a genomic fragment
into protein-coding and non-coding
regions
based on differences in statistical
properties of these regions
difficult in eukaryotes due to the existence
of introns, non-coding regions within
genes
Toy example
How many operations are needed to
calculate
∑i=1…m, j=1…n xi∙yj =
= x1∙y1 + x1∙y2 + … + x1∙yn +
+ x2∙y1 + x2∙y2 + … + x2∙yn +
+…+
+ xm∙y1 + xm∙y2 + … + xm∙yn
Naïve answer:
mn multiplications and mn–1 additions
but rewrite as…
(x1 + x2 + … + xm) ∙ (y1 + y2 + … + yn) =
= ∑i=1…m xi ∙ ∑j=1…n yj
and it becomes
m+n–2 additions and just 1 multiplication
Quiz
How many multiplications do we need to
calculate
x1y1 ∙ x1y2 ∙ … ∙ x1yn ∙ x2y1 ∙ x2y2 ∙ … ∙ x2yn ∙ … ∙
∙ xmy1 ∙ xmy2 ∙ … ∙ xmyn = ∏ i=1…m, j=1…n xiyj
if we are
(a)naïve?
(b) sophisticated?
(c) What if in addition to multiplication, we have
an operation “taking to the power”?
(d) if we may perform not only multiplication, but
also addition?
Lesson
Restructuring the order of
calculations using properties of
the data may sharply decrease
the number of operations
Graphs
Vertices/nodes: v1, v2, …, vn
Arcs /edges– directed pairs of vertices:
am(vi, vj)
multiple sources
and sinks
contains cycles
“bad” graphs and not graphs
multiple
arcs
loop
multiple
not a
undirected
components
graph
graph
(hanging
arc)
Sources, sinks, paths, cycles
Source is a vertex that is not an end vertex for any arc
Sink is a vertex that is not a start vertex for any arc.
Walk p of length N is an ordered set of N arcs
w = (a1, …, aN) such that the end vertex of arc
an = (bn, en) coincides with the start vertex of arc an+1,
en=bn+1, for all n = 1, …, N–1.
v1
v1
v2
v3
v1
v2
v2
v4
v5
v6
v3
v4
one source
and one sink
w=(a(v2,v1))
multiple sources
and sinks
no source
and sink
w=(a(v1,v3),a(v3,v2,),a(v2,v4,),a(v3,v4),
w=(a(v4,v5),a(v5,v3))
a(v3,v1), a(v1,v3))
Sources, sinks, paths, cycles
In a graph without loops and multiple arcs, each walk may
also be defined as an ordered set of vertices
w = (v1, …, vN+1) such that for each pair of adjacent
vertices vn, vn+1 there is an arc an = (vn, vn+1), n = 1, …, N.
v1
v1
v2
v3
v1
v2
v2
v4
v5
v6
v3
v4
one source
and one sink
multiple sources
and sinks
w=(v2,v1)
w=(v4,v5,v3)
no source
and sink
w=(v1,v3,v2,v4,v3,v1,v3)
Sources, sinks, paths, cycles
A path is a walk in which no arc is passed twice.
Cycle is a path in which the end vertex of the last arc aN
coincides with the start vertex of the first arc a1, eN=b1.
Acyclic graph contains no cycles.
Acyclic graph
Acyclic graph
Cyclic graph
v1
v1
v2
v3
v1
v2
v2
v4
v5
v6
v3
v4
one source
and one sink
multiple sources
and sinks
p=(v2,v1)
p=(v4,v5,v3)
no source
and sink
p=c=(v1,v3,v2,v4,v3,v1)
Quiz
(a) Draw all acyclic connected oriented
graphs with three vertices (up to vertex
labels).
(b) How many oriented graphs will there
be if we label vertices with symbols A, B
and C?
(c) Prove that in an acyclic graph there is
at least one source and at least one sink.
(d) Draw sinks and sources in the graphs
of (a).
Problem
Consider an acyclic graph with one
source and one sink. Assign each arc
with a number called a weight. For a
given path, its path score is defined as
the sum of the weights of its arcs.
Given a weighted acyclic graph, find the
highest scoring path from the sink to
the source.
Observation
If two subpaths P and Q end at the same vertex v,
and the score of P is larger than the score of Q,
then for all pairs of paths P* and Q* that start with
P and Q, respectively, and coincide after v, the
score of P* is higher than the score of Q*.
Hence, we do not need to consider all paths, as it is
sufficient to construct the highest scoring
subpath from the source to each vertex, finishing
Q
at the sink.
P
P>Q
v
P*,Q*
P* > Q*
Let’s do it for this graph
1
3
3
6
2
8
2
5
6
5
2
1
3
4
5
4
2
2
1
1
Step 1
Step 2
1
3
3
6
2
5
6
1
5
3
1
3
4
6
8
4
2
4
1
2
5
6
1
5
5
1
3
4
2
2
5
2
2
3
2
8
1
3
5
6
2
4
1
2
2
2
1
3
Step 3
Step 4
1
3
3
6
2
8
5
2
3
4
6
4
6
1
10
2
3
11
2
5
5
5
7
5 2
1
3
4
1
2
6
2
2
8
5
5 2
1
10
3
5
6
4
1
3
4
1
2
2
2
1
3
Step 5
Step 6
1
3
12
3
6
8
10
5
5
4
2
2
2
11
3
2
16
2
6
7
4
1
1
3
18
3
6
5
5 2
1
1
3
8
10
5
5
2
2
11
2
6
5
7
5 2
1
4
2
3
2
4
1
1
3
Step 7
Step 8
1
3
16
6
8
2
5
3
8
4
1
2
2
6
7
1
10
3
11
5
5
5
5
7
5 2
1
3
4
2
19
3
6
2
2
16
2
5 2
1
4
11
5
6
10
1
3
18
3
19
4
1
2
2
2
1
3
Step 9
20
1
3
6
8
2
5
3
6
8
7
2
4
1
2
1
10
3
11
2
5
6
5
5
1
3
4
2
19
5
2
2
1
3
16
2
5
1
4
11
5
6
10
20
3
19
3
16
Backtracing
5
7
2
4
1
2
2
2
1
3
Quiz
At what steps did we have more than
one vertex with all incoming arcs
processed?
Algorithm
Data types and definitions:
vertices: v, u, Source, Sink;
arcs: (v,u), a;
start vertex of arc a: Begining_vertex(a);
weight of arc (v,u): W(v,u);
path: BestPath; // defined as a set of arcs
the highest score of subpath ending at v: Score (v);
the highest score of subpath coming through (v,u) and ending at
u : Top_score (v,u);
the last arc of the highest scoring subpath ending at u:
Last_arc(u).
Initialize: for each vertex v: Score (v) := minus_infinity.
Forward process: while There are unprocessed vertices:
v := arbitrary unprocessed vertex with all incoming arcs processed;
for each arc (v,u): // consider all arcs starting at v
Top_score (v,u) := Score (v)+W(v,u);
if Top_score (v,u)>Score (u) // subpath coming through v is better than the
//current best subpath ending at u
then: // update the data for u
Score (u) := Top_score (v,u);
Last_acr (u) := (v,u);
endif;
(v,u) := processed_arc;
endfor;
v := processed_vertex;
endwhile.
Backtracing:
BestPath = empty_set; // initialize
v := Sink; // go from the sink backwards by marked arcs
until v=Source
Add Last_arc (v) to BestPath; // add the last arc of the best path ending at the
//current vertex
v := Beging_vertex (Last_arc(v)); // go to the start vertex of this arc
enduntil.
Output BestPath.
The number of operations
The limiting procedure is processing
vertices and adding arcs to paths, and
we consider each arc only once
Hence the number of operations is linear
in the number of arcs A: the run time of
the algorithm is O (A)
Greedy algorithm
Start at the source
and select the
highest-weighted
arc at each step.
3
6
2
8
2
5
6
13 < 20
It does not work.
1
3
5
2
1
3
4
5
4
2
2
1
1
Quiz
(a) Construct the simplest possible graph in which
the greedy algorithm yields the highest scoring
path.
(b) Construct a graph with three vertices in which
the greedy algorithm does not yield the highest
scoring path.
(c) Construct a graph with three vertices in which
the greedy algorithm does yield the highest
scoring path.
(d) Assign new weights to the arcs of the above
graph so that the greedy algorithm will yield the
highest scoring path.
Quiz cont’d
(e) Write an algorithm for construction of the path
with the maximum number of arcs and apply it to
the above graph.
Hint: do not change the algorithm, set proper arc
weights.
(f) Modify the maximum score algorithm so as to
construct the path with the minimal score and find
this path for the above graph.
(g) Provide a greedy algorithm for finding the path
of minimal score in a graph, and apply it to the
above graph.
(h) For the above graph, find the path with the
minimal number of arcs.
Lesson
The generic dynamic programming
algorithm may be applied to different
problems. The common feature of these
problems is that each one can be
decomposed into an ordered set of
smaller subproblems, and to solve a more
complex subproblem one needs to know
only the solutions of the simpler ones, but
not the entire set of possibilities.
Note
There exist path optimization problems that cannot be
solved by the dynamic programming.
Traveling salesman problem. Given a non-oriented graph
with weighted arcs, we need to construct the lowest
scoring path passing through all the vertices (the
salesman needs to visit all cities with travel time between
the cities given by the arc weights, while spending the
least amount of time traveling).
All cities need to be visited in a single trip => NP-complete
problem.
No efficient algorithms are known. Most computer scientists
believe that for all NP-complete problems the number of
operations required to provide an optimal solution is
exponential in the problem size.
Alignment
Given two symbol sequences (nucleotides or
amino acids) of lengths M and N, set a
correspondence between these sequences so
that some symbols are set in pairs, matching or
mismatching, whereas other symbols are
ignored (indels). The order of corresponding
symbols in the subsequences should coincide.
The alignment score is the sum of match premiums
r per matching pair minus the sum of mismatch
penalties p per mismatching pair and deletion
penalties q per ignored symbol.
The goal is to construct the highest scoring
alignment.
Quiz
What are the scores of the alignments
Reduction to the optimal path problem
Construct a graph.
Vertices correpond to pairs of positions
(endpoint of partial alignments).
Outcoming arcs (for each vertex) are of
three types:
• match (weight r ) or mismatch (weight(–p));
total M∙N arcs
• deletion in the 1st sequence (weight (–q));
total M∙(N+1) arcs
• deletion in the 2nd sequence (weight (–q);
total (M+1)∙N) arcs
Alignment graph
g
g
a
n
d
a
l
f
e
l
f
a
n
d
Alignment graph with weights
g
q
g
e
l
f
a
n
d
q
q
q
q
q
q
q
r
q
q
a
q
p
q
p
q
q
p
q
q
p
q
q
p
q
q
p
q
q
p
p
q
p
q
p
q
q
p
p
q
p
q
r
q
p
q
q
p
p
q
p
q
p
q
r
q
q
p
p
q
r
q
p
q
p
q
q
r
p
q
p
q
p
q
p
q
q
p
q
r
q
q
q
p
q
q
q
q
q
p
q
q
q
q
q
q
q
q
p
p
q
q
q
q
q
q
q
q
q
r
p
q
q
q
q
q
q
q
q
q
p
p
q
q
q
q
q
q
q
q
q
p
p
q
q
q
q
f
p
q
q
q
q
l
q
q
q
a
p
p
q
q
q
d
q
q
q
n
p
p
q
q
q
p
q
q
Paths for the three alignments
g
g
a
n
d
a
l
f
e
l
f
a
n
d
Variants
• Hanging-end alignment
(genome assembly)
– zero-weight arcs from the source to the
top and left “perimeter” and from the right
and bottom perimeter to the sink
• Local alignment
– zero-weight arcs from the source to all
internal vertices and from internal
vertices to the sink
Weights
• Amino-acid substitution weight matrices
– evolutionary
• PAM (sure alignment of closely related proteins, take
matrix to the power)
• BLOSUM (alignment of conservative regions in
distantly related proteins)
– based on physical and chemical properties of
residues
• Deletion penalty
– affine penalties (opening and extension
penalties)
• Structural alignment as the gold standard
Quiz
For the above alignments, assuming
match premium r=10, what
combinations of mismatch and
deletion penalties would yield optimal
alignments (a), (b), and (c)?
Multiple alignment
• triple  cubic graph
– etc
• for K sequences of length N requires
O(NK) operations
• soon becomes unworkable
• progressive alignment
– all pairwise alignments, distance matrices
– guide tree
– alignment of partial alignment
Lesson
Weights matter. The same
graph with differently
assigned arc weights
will yield different types
of alignment.
Gene recognition
Define a gene as a sequence fragment consisting of exons
and introns.
The boundaries between them are donor sites (between
exons and introns, usually GT) and acceptor sites
(between introns and exons, usually AG).
Each exon and intron is assigned a weight, measuring
coding affinity (respectively, non-coding affinity) of its
sequence.
The gene’s score is the sum of weights of constituent
exons and introns.
The goal is, given a sequence and a set of candidate donor
and acceptor sites, construct the highest-scoring exon–
intron structure for a gene.
Construct a graph
(a)
actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga
(b)
actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga
Complexity
Assume even distribution
of sites (leave out details)
=> O(L) vertices, O(L2) arcs
Can we do better?
It makes sense to assume that the segment weights
are additive (we assume that for exons anyhow).
Then we have just O(L) arcs
(a)
actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga
(b)
actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga
Quiz
There are two paths in the segment
graph that describe exon–intron
structures not represented in the
exon–intron graph. What are they?
What arcs need to be added to the
exon–intron graph to represent these
structures?
Lesson
Structure matters. The same problem
may be represented by different
graphs, and the conceptually simplest
representation is not necessarily the
most efficient one.
Return to the toy problem
calculate
the standard trick would not work because
x∙z + y∙z = (x + y) ∙ z (before) holds, but
(x+z) ∙ (y+z) = x∙y + z generally does not.
Quiz. When (x+z) ∙ (y+z) = x∙y + z ?
DP, generic statement.
1. Path weights
Let
be the operation of calculating the
path score S given arc weights W. We
require that the associative rule hold
Hence we can simply write
The path weight (former S(P) =
becomes
.
.
)
DP, generic statement.
2. Graph score
Let Ψ be the set of all paths and + the
operation of selecting the path. We
require that + possess the associative,
commutative rules for combining paths:
and
.
The graph score is define as
(for the optimal path problem )
DP, generic statement.
3. Transitivity
To use dynamic programming, we need
the distribution law
and
.
This is a generalization of the property
used for calculating the optimal path:
max (x + z, y + z) = max (x, y) + z.
DP, algorithm
Problem (physics of polymers)
Linear polymer chain of L+1 monomers k = 0, …, L.
Each monomer assumes N states σ(k) є {σi | i = 1, …, N}.
Energy of interactions between adjacent monomers is
defined by an N×N matrix ξ(σi,σj) (measured in the KT
units).
Chain conformation P is defined by the states of the
monomers {σ(0), σ(1), …, σ(L)}.
Exponent of energy: S(P) = exp (–E(P)) =
= ∏k=1…L exp (–ξ(σ(k–1),σ(k)).
Ψ is the set of all conformations.
Calculate the partition function of the set of all
conformations Ω = ∑PєΨ S(P).
Graph construction and reduction to DP
Vertices correspond to monomer states, so that their
number is (L+1)∙N+2 (two additional vertices are the
source and the sink, corresponding to the virtual
start and end of the chain).
Arcs link vertices corresponding to adjacent
monomers.
Arc weights are the interaction energies.
Paths through this graph exactly correspond to the
chain conformations.
is ordinary multiplication, and is addition
The path score is the product of arc weights.
The total graph score is the sum of these products.
Standard DP solves the problem.
Quiz
(a)How many operations shall we need?
(b) How many operations shall we need
if we calculate the partition function
directly?
(c) Provide an algorithm for calculating
the number of paths in a graph. Hint:
invent suitable arc weights and
reduce to the previous problem.
(d) What will Ω be if both and are the
operation of taking the maximum?
Problem
Calculate the minimum energy and the number of
conformations with the minimum energy.
Arc weights are pairs [1, ξ], with ξ as defined previously.
Path scores are pars [n, ε], where ε is the energy, and n is
the number of conformations having this energy.
When two systems are combined, the resulting energy is the
sum of the systems’ energies, whereas the number of
states is the product of the numbers of states. Hence
solves the problem.
Lesson
Generalizations are useful
Note
Not all problems that can be solved by
dynamic programming have a simple graph
representation. For example, reconstruction
of the secondary structure of a RNA
molecule given its sequence can be
decomposed into simpler, embedded
problems and can be solved by a variant of
dynamic programming algorithm, but in the
language of this paragraph it requires
slightly more complicated objects called
hypergraphs.
Спасибо
•
•
•
•
Mikhail Roytberg
Andrei Mironov
Anatoly Rubinov
Pavel Pevzner