Sequence Comparison Algorithms - Department of Computer Science

Sequence Comparison
Algorithms
Ellen Walker
Bioinformatics
Hiram College
The Problem
• We have two sequences that we want
to compare, based on edit distance
• Edit distance = number of changes to
get from one string to the other
– Insertions
– Deletions
– Changes
Example
•
LOVE => MONEY
• 1. Replace L by M
• 2. Replace V by N
• 3. Add Y at the end
LOVE–
MONEY
Brute Force Solution
• Try all possible alignments between the
strings
• Looking at one string,
– Every possible shift (space before or after)
– Every possible gap (space within)
– Gaps of various lengths, bounded by the
size of the longest string
How many possibilities are
there?
• Consider only single insertions:
• _ M _ O _ N _ E _ Y_
– There are N+1 places to insert, where N is
the length of the string
• At each place you have 2 choices
(insert or not)
– Therefore, just this subset is already 2N
– So, brute force is exponential!
Dynamic Programming
• Score possibilities in an alignment
matrix
• Value of any square in the matrix
depends on:
– Value above (if “vertical gap”)
– Value beside (if “horizontal gap”)
– Value diagonally above (if match or
mismatch)
Global Alignment Matrix
L
M
O
N
E
Y
0
––
-1
––
-2
––
-3
––
-4
––
-5
| -1
\
\
-2
––
-3
––
-4
––
-4
0
––
-1
––
-2
––
-3
\
\
-2
––
-3
0
––
-1
-1
O
| -2
\
\
-2
V
| -3
\
| -1
-3
E
| -4
\
-1
| -2
-4
\
\
-2
Local Alignment Matrix
M
O
N
E
Y
0
0
0
0
0
0
L
0
0
0
0
0
0
O
0
0
\
0
0
0
0
0
0
0
\
0
1
V
0
0
E
0
0
0
0
1
Computing the Alignment
Matrix
• For each square:
– Take minimum of vertical gap, horizontal
gap, (mis)match score : O(1)
• There are N*M squares, where N and M
are the lengths of the strings
• Therefore, time and space are both
O(N*M) or (for short) O(N2)
But, what is N?
• If we’re matching genomes, N is huge!
• N2 is too much time and space!
• How can we save further?
Ordering the Computations
• Each cell can be computed when the
ones above, diagonally above, and to
the left are computed
– Left-to-right, top to bottom (row major)
– Top-to-bottom, left to right (column major)
– Across a diagonal wavefront
Saving Space: Row Major
• A row major computation really only needs
two rows (the one above, and the current
row).
• After each computation, the current row
becomes the row above
• Savings: space is O(N) instead of O(N2)
• Cost: Insufficient information for traceback
– Do a new alignment, limited to a region around the
result.
Saving Time: Wavefront
• Use a parallel processor (effectively N
machines at a time)
• Each reverse diagonal is computed at once
• Time is now O(N), but cost is N processors
instead of 1
• Computer science theoretician would say “no
savings”, but if you’re the one waiting, you
might disagree!
Saving More Time: Partial
Search
• In local alignment, large areas have 0’s.
• Mismatches adjacent to 0’s are also 0’s.
• To get “reasonably large” values, you
need longer sequences (BLAST
“words”) in common
• So, only search near where there are
common subsequences
Finding Common
Subsequences
• Pick a sequence length.
• For each subsequence of that length,
find all occurrences in each sequence
• If i is the index in one sequence and j is
the index in the other sequence, then fill
in the region of the alignment matrix
near (i, j) (i,j) is called the seed
BLAST’s Generalization
• Consider a threshold T and a sequence
S
• The neighborhood of the sequence S is
all sequences that score at or better
than T against S
• BLAST uses neighborhoods to set
seeds (areas of the alignment matrix to
explore)
Consequences of Choices
• Higher T’s are faster, but ignore more
potential matches
• Longer sequences are less common
– Smaller neighborhoods for a given T
– Fewer areas to search
– More likelihood of missing good alignments
T vs Sequence Size
• Longer sequences have higher
maximum scores (unless normalized)
• But, longer sequences (tend to) have
more likelihood of mismatches?
Too Many Seeds
• If we pick a sequence length and
threshold that is sufficiently sensitive,
we still might have too many seeds for
reasonable alignment times.
• Two-seed solution:
– Only consider areas of the table that
contain two seeds (diagonals) separated
by a limited distance
Extending Alignments
• A seed region is a small alignment
• We want to “grow” the alignments
(especially if we can connect to
others(!))
• To grow an alignment, use SmithWaterman to compute neighboring
values
• Question: when to stop growing?
Score Changes During
Growing
• As an alignment is extended, its score
changes
– Score increases when sub-matches
connect
– Score decreases when extended into
unrelated area
• Often score must decrease before
increasing!
When to Stop?
• Consider current score, compared to
maximum score so far
• When the current score gets sufficiently small
relative to the maximum, then stop
• This is another parameter with a tradeoff
(stop too soon and get smaller results, stop
too late and do useless work)
One more “trick”
• Suppose that there is a “standard”
sequence that many people want to
align against
• Run the seeding algorithm with different
sequence lengths and thresholds and
save the resulting seed locations
• When someone does a search, the
seeding part has already been done
Offline vs. Online Algorithms
• Offline Algorithms
– Execute “standardized” part of algorithm in
advance, and save result
– This is like compilation of a program
• Online Algorithm
– Use the tables or databases you built offline to
answer a specific query
– This is like running a program
– User sees only time taken by Online Algorithm
Common Offline/Online
Applications
• Web searching
– Offline: build indexes of sites vs. keywords
– Online: retrieve sites from the index
• Neural networks
– Offline: train the network on many
examples of the problem, set the weights
– Online: run the network once (with fixed
weights) on the specific example
Summary
• Smith Waterman is exact, accurate, and
time-consuming (even though it uses
dynamic programming to get down to
O(N2)
• BLAST speeds up the search process,
but is no longer exact, so it can miss
good alignments (even the best one!)
Using BLAST Well
• Importance of setting parameters
– Sequence length
– Score threshold
– Distance (for two-hit method)
– Stopping condition (for growing seeded
alignments)
Exercises
• Given the BLOSUM62 matrix at
http://www.ncbi.nlm.nih.gov/Class/BLAS
T/BLOSUM62.txt
– What is the neighborhood of HID with
threshold 5? 10? 15?
• Create two random sequences of 20
bases each (flip two coins for each
base: HH=A, TT=T, HT=C, TH=G)