Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College The Problem • We have two sequences that we want to compare, based on edit distance • Edit distance = number of changes to get from one string to the other – Insertions – Deletions – Changes Example • LOVE => MONEY • 1. Replace L by M • 2. Replace V by N • 3. Add Y at the end LOVE– MONEY Brute Force Solution • Try all possible alignments between the strings • Looking at one string, – Every possible shift (space before or after) – Every possible gap (space within) – Gaps of various lengths, bounded by the size of the longest string How many possibilities are there? • Consider only single insertions: • _ M _ O _ N _ E _ Y_ – There are N+1 places to insert, where N is the length of the string • At each place you have 2 choices (insert or not) – Therefore, just this subset is already 2N – So, brute force is exponential! Dynamic Programming • Score possibilities in an alignment matrix • Value of any square in the matrix depends on: – Value above (if “vertical gap”) – Value beside (if “horizontal gap”) – Value diagonally above (if match or mismatch) Global Alignment Matrix L M O N E Y 0 –– -1 –– -2 –– -3 –– -4 –– -5 | -1 \ \ -2 –– -3 –– -4 –– -4 0 –– -1 –– -2 –– -3 \ \ -2 –– -3 0 –– -1 -1 O | -2 \ \ -2 V | -3 \ | -1 -3 E | -4 \ -1 | -2 -4 \ \ -2 Local Alignment Matrix M O N E Y 0 0 0 0 0 0 L 0 0 0 0 0 0 O 0 0 \ 0 0 0 0 0 0 0 \ 0 1 V 0 0 E 0 0 0 0 1 Computing the Alignment Matrix • For each square: – Take minimum of vertical gap, horizontal gap, (mis)match score : O(1) • There are N*M squares, where N and M are the lengths of the strings • Therefore, time and space are both O(N*M) or (for short) O(N2) But, what is N? • If we’re matching genomes, N is huge! • N2 is too much time and space! • How can we save further? Ordering the Computations • Each cell can be computed when the ones above, diagonally above, and to the left are computed – Left-to-right, top to bottom (row major) – Top-to-bottom, left to right (column major) – Across a diagonal wavefront Saving Space: Row Major • A row major computation really only needs two rows (the one above, and the current row). • After each computation, the current row becomes the row above • Savings: space is O(N) instead of O(N2) • Cost: Insufficient information for traceback – Do a new alignment, limited to a region around the result. Saving Time: Wavefront • Use a parallel processor (effectively N machines at a time) • Each reverse diagonal is computed at once • Time is now O(N), but cost is N processors instead of 1 • Computer science theoretician would say “no savings”, but if you’re the one waiting, you might disagree! Saving More Time: Partial Search • In local alignment, large areas have 0’s. • Mismatches adjacent to 0’s are also 0’s. • To get “reasonably large” values, you need longer sequences (BLAST “words”) in common • So, only search near where there are common subsequences Finding Common Subsequences • Pick a sequence length. • For each subsequence of that length, find all occurrences in each sequence • If i is the index in one sequence and j is the index in the other sequence, then fill in the region of the alignment matrix near (i, j) (i,j) is called the seed BLAST’s Generalization • Consider a threshold T and a sequence S • The neighborhood of the sequence S is all sequences that score at or better than T against S • BLAST uses neighborhoods to set seeds (areas of the alignment matrix to explore) Consequences of Choices • Higher T’s are faster, but ignore more potential matches • Longer sequences are less common – Smaller neighborhoods for a given T – Fewer areas to search – More likelihood of missing good alignments T vs Sequence Size • Longer sequences have higher maximum scores (unless normalized) • But, longer sequences (tend to) have more likelihood of mismatches? Too Many Seeds • If we pick a sequence length and threshold that is sufficiently sensitive, we still might have too many seeds for reasonable alignment times. • Two-seed solution: – Only consider areas of the table that contain two seeds (diagonals) separated by a limited distance Extending Alignments • A seed region is a small alignment • We want to “grow” the alignments (especially if we can connect to others(!)) • To grow an alignment, use SmithWaterman to compute neighboring values • Question: when to stop growing? Score Changes During Growing • As an alignment is extended, its score changes – Score increases when sub-matches connect – Score decreases when extended into unrelated area • Often score must decrease before increasing! When to Stop? • Consider current score, compared to maximum score so far • When the current score gets sufficiently small relative to the maximum, then stop • This is another parameter with a tradeoff (stop too soon and get smaller results, stop too late and do useless work) One more “trick” • Suppose that there is a “standard” sequence that many people want to align against • Run the seeding algorithm with different sequence lengths and thresholds and save the resulting seed locations • When someone does a search, the seeding part has already been done Offline vs. Online Algorithms • Offline Algorithms – Execute “standardized” part of algorithm in advance, and save result – This is like compilation of a program • Online Algorithm – Use the tables or databases you built offline to answer a specific query – This is like running a program – User sees only time taken by Online Algorithm Common Offline/Online Applications • Web searching – Offline: build indexes of sites vs. keywords – Online: retrieve sites from the index • Neural networks – Offline: train the network on many examples of the problem, set the weights – Online: run the network once (with fixed weights) on the specific example Summary • Smith Waterman is exact, accurate, and time-consuming (even though it uses dynamic programming to get down to O(N2) • BLAST speeds up the search process, but is no longer exact, so it can miss good alignments (even the best one!) Using BLAST Well • Importance of setting parameters – Sequence length – Score threshold – Distance (for two-hit method) – Stopping condition (for growing seeded alignments) Exercises • Given the BLOSUM62 matrix at http://www.ncbi.nlm.nih.gov/Class/BLAS T/BLOSUM62.txt – What is the neighborhood of HID with threshold 5? 10? 15? • Create two random sequences of 20 bases each (flip two coins for each base: HH=A, TT=T, HT=C, TH=G)
© Copyright 2025 Paperzz