Multi-alignment of Genomes

String Matching
String matching: definition of the problem (text,pattern)
depends on what we have: text or patterns
• Exact matching:
• The patterns ---> Data structures for the patterns
• 1 pattern ---> The algorithm depends on |p| and ||
• k patterns ---> The algorithm depends on k, |p| and ||
• Extensions
• Regular Expressions
• The text ----> Data structure for the text (suffix tree, ...)
• Approximate matching:
• Dynamic programming
• Sequence alignment (pairwise and multiple)
• Sequence assembly: hash algorithm
• Probabilistic search:
Hidden Markov Models
2.2 Pairwise alignment
Given two DNA sequences
A (a1a2...an) and B (b1b2...bm) from the alphabet {a,c,t,g}
we say that A* and B* from {a,c,t,g,-} are aligned iff
i) A* and B* become A and B if gaps ( – ) are removed.
ii) |A*|=|B*|
iii) For all i, it is not possible that ai = bi = MALIG (an example)
How many alignments of two sequences exist?
Which is the best alignment?
2.2 Number of alignments
Given two DNA sequences
A (a1a2...an) and B (b1b2...bm) there are:
#(a1a2...an ,b1b2...bm) =
#(a1a2...an-1 ,b1b2...bm)
+ #(a1a2...an ,b1b2...bm-1)
+ #(a1a2...an-1 ,b1b2...bm-1)
b1 b2 b3
a1
a2
a3
those that end with (an,-)
those that end with (-,bm)
those that end with (an,bm)
#(a1,b1)
2.2 Number of alignments
Given two DNA sequences
A (a1a2...an) and B (b1b2...bm) there are:
#(a1a2...an ,b1b2...bm) =
#(a1a2...an-1 ,b1b2...bm)
+ #(a1a2...an ,b1b2...bm-1)
+ #(a1a2...an-1 ,b1b2...bm-1)
a1
a2
a3
b1 b2 b3
1 1 1 1
1
1
1
those that end with (an,-)
those that end with (-,bm)
those that end with (an,bm)
2.2 Number of alignments
Given two DNA sequences
A (a1a2...an) and B (b1b2...bm) there are:
#(a1a2...an ,b1b2...bm) =
#(a1a2...an-1 ,b1b2...bm)
+ #(a1a2...an ,b1b2...bm-1)
+ #(a1a2...an-1 ,b1b2...bm-1)
a1
a2
a3
b1 b2 b3
1 1 1 1
1 3 ? ?
1
1
those that end with (an,-)
those that end with (-,bm)
those that end with (an,bm)
2.2 Number of alignments
Given two DNA sequences
A (a1a2...an) and B (b1b2...bm) there are:
#(a1a2...an ,b1b2...bm) =
#(a1a2...an-1 ,b1b2...bm)
+ #(a1a2...an ,b1b2...bm-1)
+ #(a1a2...an-1 ,b1b2...bm-1)
a1
a2
a3
1
1
1
1
b1 b2 b3
1 1 1
3 5 7
5
?
7
those that end with (an,-)
those that end with (-,bm)
those that end with (an,bm)
2.2 Number of alignments
Given two DNA sequences
A (a1a2...an) and B (b1b2...bm) then:
#(a1a2...an ,b1b2...bm) =
#(a1a2...an-1 ,b1b2...bm)
+ #(a1a2...an ,b1b2...bm-1)
+ #(a1a2...an-1 ,b1b2...bm-1)
a1
a2
a3
1
1
1
1
b1
1
3
5
7
b2
1
5
13
25
those that end with ( an , -)
those that end with ( - , bm)
those that end with ( an , bm)
b3
1
7
25
63
But, what is the
assymptotic value?
2.2 Assymptotic value
As
K=n
#(a1a2...an ,b1b2...bn) >
2n
(
)
Σ
(
)
(
)
n
k=0
n
k
n
k
=
and
n! ~ nn e-n (Stirling approximation)
then
#(a1a2...an ,b1b2...bn) > 22n
2.2 Best alignment
How can an alignment be scored?
catcactactgacgactatcgtagcgcggctatacatctacgccaa- ctac-t-gtgtagatcgccgg
c- tgactgc--acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg---* * *** * * ** * *******
* * **** **** ******* * ****
** * ***
• Match: favorable
• Mismatch: unfavorable
• Gap: worst case
Then we assign a score for each case,
for example 1,-1,-2.
How can the best alignment be found?
2.2 Best alignment
CTACTACTAC GT
A
C
T
G
A
The cell contains the score of the best
alignment of AC and CTACT.
Best alignment
Given the maximum score,
how can the best alignment be found?
accaccacaccacaacgagcata … acctgagcgatat
a
c
c
.
.
t
• Quadratic cost in space and time
• Up to 10,000 bps sequences in length
Download alggen tool
2.2 Some slides revisited
We have developed the theory according to the
following principles:
1) Both sequences have a similar length (global).
2) The model of gaps is linear
If there are k consecutive gaps
the penalty scores k(-2).
2.2 Semiglobal pairwise alignment
Assume that we have sequences with different length
S1
S2
It is meaningless to introduce gaps until both
sequences have similar length ….
The most probable alignment should be
Initial gaps
Final gaps
How can these alignments be found?
2.2 Semiglobal pairwise alignment
Note that
Initial gaps
CTACTACTAC GT
A
C
T
Final gaps
2.2 Semiglobal pairwise alignment
Given a cell
CTACTACTAC GT
0
0
0 0 0 0 0 0 0
0 0 0 0
A
C
T
The cell contains the score of the best
alignment of CTA with the empty sequence.
2.2 Semiglobal pairwise alignment
CTACTACTAC GT
0 0 0 0 0 0 0…
A
C
T
The contribution of the initial gaps is disregarded, then
CTACTACTAC GT
0 0 0 0 0 0 0…
A
1
C
2
T
3
but, what happens with the final gaps?
2.2 Semiglobal pairwise alignment
CTACTACTAC GT
0 0 0 0 0 0 0…
A
1
C
2
T
3
How does the algorithm search for the best alignment?
… by checking the last row for the best score.
Practice with the alggen tool.
2.2 Affine-gap model score
Given the following alignments
that have the same score …
agtaccccgtag
agt- cc- -gta-
agtaccccgtag
agt- c-c -gta-
agtaccccgtag
agt- c- -cgta-
agtaccccgtag
agt- -cc -gta-
agtaccccgtag
agt- -c -cgta-
agtaccccgtag
agt- -- ccgta-
Which is the most reliable case
from a biological point of view?
2.2 Affine-gap model score
Then, how can we distinguish between
consecutive gaps and separated gaps?
agtaccccgtag
agt- -c -cgta-
agtaccccgtag
agt- -- ccgta-
By scoring the opening gaps greater than the extension gaps,
for instance, -10 and -0.5.
Then, the penalty of k consecutive gaps becomes
OG + (k-1) EG
which is an affine-gap function.
How is the best alignment found?.
2.2 Affine-gap model score
CTACTACTAC GT
A
C
T
G
A
Smallest arrows: refer to the introduction of an opening gap.
Largest arrows: refer to the introduction of an extension gap.
But from which cell do the largest arrows originate?
2.2 Local alignment
Given two sequences, we can consider the alignments of all
their substrings…
…how can the best of them be found?
Two questions arise:
- how can the alignments be compared?
- how can the best one be selected?
2.2 Local alignment
accaccacaccacaacgagcata … acctgagcgatat
Given a path
a
c
c
.
.
t
Imagine the graph of the scores:
can the best subalignments be detected?
…
It suffices to compare the value of each cell with zero!