C OMBINATORIAL P ATTERN M ATCHING CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] O UTLINE : E XACT MATCHING Tabulating patterns in long texts Short patterns (direct indexing) Longer patterns (hash tables) Finding exact patterns in a text Brute force (run time) Efficient algorithms (pattern preprocessing) Single pattern: Knuth-Morris-Platt Multiple patterns: Aho-Corasick algorithm Efficient algorithms (text preprocessing) Suffix trees Burrows Wheeler Transform-based CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] O UTLINE : A PPROXIMATE M ATCHING Algorithms for approximate pattern matching Heuristics behind BLAST Statistics behind BLAST Alternatives to BLAST: BLAT, PatternHunter etc. CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] S TRING ENCODING It is often necessary to index strings; a convenient way to do this is first to convert strings to integers. Given a string s of length n on alphabet A (0..c-1), with c=|A| characters, we can define a map code(s)→[0,∞), as n−1 code(s) → s[1]c + s[2]c n−2 + . . . + s[n − 1]c + s[n] There are cL different L-mers, but at most n-L+1 different L-mers in a text of length n A 0 C 1 G 2 T 3 CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 AGT A=0*16 G=2*4 T=3 11 ATA A=0*16 T=3*4 A=0 12 TGG T=3*16 G=2*4 G=2 58 SERGEI L KOSAKOVSKY POND [[email protected]] T ABULATING SHORT PATTERNS If the L is small (e.g. 3 or 4), i.e. the total number of patterns is not too large and many of them are likely to be found in the input text then we could use direct indexing to tabulate/locate strings efficiently The distribution of short strings in genetic sequences is biologically informative, e.g. Synonymous codons (triplets of nucleotides, 64 patterns) are often used preferentially in organisms (transcriptional selection, secondary structure, etc) The distribution of short nucleotide k-mers (e.g. L=4, 256 patterns) can be useful for detecting horizontal (from species to species) gene transfer and gene finding The location of short amino-acid strings (e.g. L=3, 8000 patterns) is useful for finding seeds for BLAST CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] S HORT PATTERN SCAN 1 2 3 4 5 6 Data : Alphabet A, Text T, pattern length p Result: Frequency of each pattern in text R ← array(|A|p ); O(L): naive n ← len(T ); O(1): if using the previous code to compute the for i:=1 to n-p+1 do current one R [code (T [i : i + p − 1])] + = 1; end return R; CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] T ABULATING / LOCATING LONGER PATTERNS Finding repeats/motifs: ATGGTCTAGGTCCTAGTGGTC Flanking sequences in genomic rearrangements Motifs: promoter regions, functional sites, immune targets Cellular immunity targets in pathogens (e.g. protein 9 mers) There are too many patterns to store in an array, and even if we could, then the array would be very sparse E.g. ~512,000,000,000 amino-acid 9-mers, but in an average HIV-1 sequence (~3 aa. kb long) there are at most ~3000 unique 9-mers CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] H ASH TABLES Allow to efficiently (O(1) on average) store and retrieve a small subset of a large universe of records. Hash tables implement associative arrays (dictionaries) in a variety of languages (Python, Perl etc) The universe (records): e.g. 512,000,000,000 amino-acid 9-mers Hash function: record ➝ hash key Note: because there are more keys than array indices, this function is NOT one to one The storage: Hash Table (array) << the size of the universe CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] A SIMPLE HASH FUNCTION A reasonable hash function (on integer records i) is: i → i mod P P is a prime number and also the natural size of the hash table Hash keys range from o to P-1 If the records are uniformly distributed, so will be their hash keys 4-mer (256 possible) Integer code Hash Key ACGT 27 27 CCCA 148 47 TGCC 229 27 CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 COLLISION P=101 SERGEI L KOSAKOVSKY POND [[email protected]] C OLLISIONS Collisions are frequent even for sparsely populated lightly loaded hash tables load level α = (number of entries in hash table)/(table size) The birthday paradox: what is the probability that two people out of a random group of n (<365) people share a birthday (in hash table terms, what is the probability of a collision if people=records and hash keys=birthdays)? � 1 P (n) = 1 − 1 × 1 − 365 n 10 23 50 CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 � � 2 × 1− 365 α 0.027 0.063 0.137 � � n−1 ... 1 − 365 � P(n) 0.117 0.507 0.97 SERGEI L KOSAKOVSKY POND [[email protected]] D EALING WITH C OLLISIONS Several strategies to deal with collisions: the simplest one is chaining Each hash key is associated with a linked list of all records sharing the hash key Hash Key 0 CGCC Hash Key 1 AAAC Hash Key 2 ... CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 AAAA ∅ 4-mer (256 possible) Integer code Hash Key AAAA 0 0 AAAC 1 1 CGCC 101 0 SERGEI L KOSAKOVSKY POND [[email protected]] H ASH TABLE PERFORMANCE Retrieving/storing a record in a hash table of size m with load factor α Worst case - all records have the same key: O(m) Expected run time is O (1), assuming uniformly distributed records and hash keys Record is not in the table Record is in the table EN = e −α + α + O (1/m) ES = 1 + α/2 + O (1/m) This is because the probability of having many collisions with the same key is quite low (even though the probability of SOME collisions in high) CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] E XACT P ATTERN MATCHING Motivation: Searching a database for a known pattern Goal: Find all occurrences of a pattern in a text Input: Pattern P = p[1]…p[n] and text T = t[1]…t[m] (n≤m) Output: All positions 1< i < (m – n + 1) such that the n-letter substring of text T[i][i+n-1] starting at i matches the pattern P Desired performance: O(n+m) CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] B RUTE FORCE PATTERN MATCHING 6 Data : Pattern P, Text T Result: The list of positions in T where P occurs n ← len(P ); Substring comparison can take from 1 to n m ← len(T ); (left-to-right) string comparisons for i:=1 to m-n+1 do if T[i:i+n-1] = P then output i; end 7 end 1 2 3 4 5 Text: GGCATC; Pattern: GCAT i=1 (2 comparisons) i=2 (4 comparisons) i=3 (1 comparison) G G C A T C G C A A G G C A T C G C A T G G C A T C G C A T CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] B RUTE FORCE RUN TIME Worst case: O(nm). This can be achieved, for example, by searching for P=AA...C in text T=AA...A, because each substring comparison takes exactly n steps Expected on random text: O(1). This is because the substring comparison takes on average 1 − q n comparisons (q = 1/alphabet size) 1−q For n = 20 and q = 1/4 (nucleotides), substring comparison will take on average 4/3 operations. Genetic texts are not random, so the performance may degrade. CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] I MPROVING THE RUN TIME The search pattern can be preprocessed in O(n) time to eliminate backtracking in the text and hence guarantee O(n+m) run time A variety of procedures, starting with the Knuth-Morris-Pratt algorithm in 1977, take this approach. Makes use of the observation that if a string comparison fails at pattern position i, then we can shift the pattern i-b(i) positions, where b(i) depends on the pattern and continue comparing at position the same or the next position in the text, thus avoiding backtracking. These types of algorithms are popular in text editors/mutable texts, because they do not require the preprocessing of (large) text A C A A C G A C A C G A C C A C A A C A G C A A T G A C G A C A C G A C A C A SHIFT A C A A C G A C A C G A C C A C A A C A G C A A T G A C G A C A C G A C A C A CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] E XACT MULTIPLE PATTERN MATCHING The problem: given a dictionary of D patterns P1,P2,..., PD (total length n) and text T report all occurrences of every pattern in the text. Arises, for instance when one is comparing multiple patterns against a database Assuming an efficient implementation of individual pattern comparison, this problem can be solved in O(Dm+n) time by scanning the text D times. Aho and Corasick (1975) showed how this can be done efficiently in O(m+n) time. Uses the idea of a trie (from the word retrieval), or prefix trie Intuitively, we can reduce the amount of work by exploiting repetitions in the patterns. CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] P REFIX TRIE Patterns: ‘ape’, ‘as’, ‘ease’. Constructed in O(n) time, one word at a time. Root Root a a a 1 1 Properties of a trie Root 1 e Stores a set of words in a tree 5 Each edge is labeled with a letter p s p 2 2 2 e e 3 4 p s 3 4 a 6 e 3 Each node labeled with a state (order of creation) s 7 e 8 CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 Any two edges sharing a parent node have distinct labels Each word can be spelled by tracing a path from the root to a leaf SERGEI L KOSAKOVSKY POND [[email protected]] S EARCHING TEXT FOR MULTIPLE PATTERNS USING A TRIE : THREADING Suppose we want to search the text ‘appease’ for the occurrences of patterns ‘ape’, ‘as’ and ‘ease’, given their trie. The naive way to do it is to thread (i.e. spell the word using tree edges from the root) the text starting at position i, until either: A leaf (or specially marked terminal node) is reached (a match has been found) Spelling cannot be completed (no match) CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] I=1: NO MATCH I=4: MATCH APPEASE APPEASE Root a 1 4 e X 3 a 5 1 a 6 e p Root Root p s 2 I=5: MATCH APPEASE s 7 4 e 3 7 5 4 a 6 e s e 8 2 6 e p s a e 3 1 5 p s 2 a e s 7 e e 8 But we already knew this, because ‘as’ is a part ‘ease’! If we take advantage of this, there will be no need to backtrack in the text, and the algorithm will run in O(n+m). The Aho-Corasick algorithm implements exactly this idea using a finite state automaton starting with the trie and adding shortcuts CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 8 SERGEI L KOSAKOVSKY POND [[email protected]] S UFFIX TREES A trie that is built on every suffix of a text T (length m), and collapses all interior nodes that have a single child is called a suffix tree. A very powerful data structure, e.g. given a suffix tree and a pattern P (length n), all k occurrences of P in T can be found in time O(n +k), i.e. independently of the size of the text (but it figures into the construction cost of tree T) A suffix tree can be built in linear time O (m) CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] B UILDING A SUFFIX TREE Example ‘bananas#’. It is convenient to terminate the text with a special character, so that no suffix is a prefix of another suffix (e.g. as in banana). This guarantees that spelling any suffix from the root will end at a leaf. Construct the suffix tree in two phases from the longest to the shortest suffix: Phase 1: Spell as much of the suffix from the root as possible Phase 2: If stopped in the middle of an edge, break the edge and add a new branch spell the rest of the suffix along that branch. Label the leaf with the starting position of the suffix. CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] BANANAS# ANANAS# NANAS# ANAS# Root Root Root Root bananas# bananas# ananas# 1 1 bananas# ana bananas# ananas# nanas# 2 1 2 1 3 nanas# N1 3 nas# s# 2 NAS# 4 AS# S# AND # Root Root Root bananas# a bananas# ana N1 N3 nas# 4 N2 1 3 nas# N3 s# na 6 3 5 N1 # N2 7 8 2 CSE/BIMM/BENG 181, SPRING 2010 s# nas# s# 6 3 5 5 nas# s# Tuesday, May 4, 2010 s# s# N1 2 na N2 na s# nas# s# bananas# a na 1 1 na 4 nas# s# 2 4 SERGEI L KOSAKOVSKY POND [[email protected]] S UFFIX TREE PROPERTIES Exactly m leaves for text of size m (counting the terminator) Each interior node has at least two children (except possibly the root); edges with the same parent spell substrings starting with different letters. Root bananas# a 1 The size of the tree is O(m) N3 na s# na s# N2 nas# # 7 8 s# Can be constructed in O(m) time N1 This uses the obser vation that during construction, not every suffix has to be spelled all the way from the root (which would lead to quadratic time); suffix links can short circuit the process 6 3 5 nas# s# 2 4 Is also memory efficient (about ~5m*sizeof(long) bytes for text without too much difficulty) CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] M ATCHING PATTERNS USING S UFFIX T REES Consider the problem of finding pattern ‘an’ in the text ‘bananas#’ Root Two matches: positions 2 and 4 bananas# a 1 Thread the pattern onto the tree N3 n Completely spelled: report the index of every leaf below the point where spelling stopped. This is because the pattern is a prefix of every suffix spelled by traversing the rest of the subtree. s# 6 na s# N2 nas# 3 # 7 8 s# 5 a N1 nas# s# Incompletely spelled: no match 2 4 Runs in O(n+k) time, where n is the length of the pattern, and k is the number of matches. CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] F INDING LONGEST COMMON SUBSTRINGS USING S UFFIX T REES Given two texts: T and U find the longest continuous substring that is common to both texts N0 $ Can be done in O (len (T) + len (U)) time. 10 5 %TCGA$ A CG N3 N4 G T N5 N6 Construct a suffix tree on T %U$ $ CGT%TCGA$ A$ Find the deepest internal node whose children refer to suffixes starting in T and in U 9 1 7 T%TCGA$ 2 A$ 8 T%TCGA$ 3 %TCGA$ CGA$ 4 6 E.g. T = ‘ACGT’, U = ‘TCGA’ CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] S HORT READ MAPPING Next generation sequencing (NGS) technologies (454, Solexa, SOLiD) generate gigabases of short (32-500 bp) reads per run A fundamental bioinformatics task in NGS analysis is to map all the reads to a reference genome: i.e. find all the coordinates in the known genome where a given read is located ATGGTCTAGGTCCTAGTGGTC Can take a LONG time to map 15,000,000 reads to a 3 gigabase genome! CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] B URROWS -W HEELER T RANSFORM B ASED M APPERS In 1994, Burrows and Wheeler described a lossless text transformation (block sorter), which makes the text easily compressible and is the algorithmic basis of BZIP2 Surprisingly, this transform is also very useful for finding all instances of a given (short) string in a large text, while using very little memory A number of NGS read mappers now use BWT transformed reference genomes to accelerate mapping by several orders of magnitude. CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] BWT Given an input text T=t[1]...t[N], we construct N left-shift rotations of the input text, sort them lexicographically, and map the input text to the last column of the sorted rotations: E.g. input ABRACA is mapped to CARAAB Note: sorted rotations make it very easy to find all instances of text in a string (also the idea behind suffix arrays) ROTATIONS A B R A C A B R A C A A R A C A A B A C A A B R CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 C A A B R A SORTED A A B R A C A A A B C R A B C R A A B R A A A C R A A C B A A C B A R A C A R A A B SERGEI L KOSAKOVSKY POND [[email protected]] W HY BOTHER ? The text output by BWT tends to contain runs of the same character and be easily compressible by arithmetic, run-length or Huffman coders, e.g. final char (L) a o o o o a a i i o a e i e e i i i o o sorted rotations n n n n n n n n n n n n n n n n n n n n to decompress. It achieves compression to perform only comparisons to a depth transformation} This section describes transformation} We use the example and treats the right-hand side as the most tree for each 16 kbyte input block, enc tree in the output stream, then encodes turn, set $L[i]$ to be the turn, set $R[i]$ to the unusual data. Like the algorithm of Man use a single set of probabilities table using the positions of the suffixes in value at a given point in the vector $R we present modifications that improve t when the block size is quite large. Ho which codes that have not been seen in with $ch$ appear in the {\em same order with $ch$. In our exam with Huffman or arithmetic coding. Bri with figures given by Bell˜\cite{bell}. from the CSE/BIMM/BENG 181, SPRING 2010 Figure 1: Example of sorted rotations. Twenty consecutive rotations SERGEI L KOSAKOVSKY POND [[email protected]] Tuesday, May 4, 2010 sorted list of rotations of a version of this paper are shown, together with the final character of each rotation. I NVERSE BWT The beauty of BWT is that knowing only the output and the position of which sorted row contained the original string, the input can be reconstructed in no worse than O(N log (N)) time. Step 1: reconstruct the first column of rotations (F) from the last column (L). To do so, we simply sort the characters in L. Step 2: determine the mapping of predecessor characters and recover the input character by character from the last one PREDECESSOR CHARACTERS: RIGHT SHIFT MATRIX M (M’). ROTATIONS (M) SORTED STARTING WITH THE 2ND CHARACTER A A A B C R A B C R A A B R A A A C R A A C B A CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 A C B A R A C A R A A B C A R A A B A A A B C R A B C R A A B R A A A C R A A C B A A C B A R A SERGEI L KOSAKOVSKY POND [[email protected]] M A A B R A C A A B C R B C R A A R A A A AZC A B C A C B A R A A R A A B FPREDECESSORL M’ C A A B R A A R A A B A A B C R B C R A A R A A A C A A C B A C B A R A L F Both M and M’ contain every rotation of input text T, i.e. permutations of the same set of strings. For each row i in M, the last character (L[i]) is the cyclic predecessor of the first character (F[i]) in the original text We wish to define a transformation, Z(i), that maps the i-th row of M’ to the corresponding row in M (i.e. its cyclic predecessor), using the following observations M is sorted lexicographically, which implies that all rows of M’ beginning with the same character are also sorted lexicographically, for example rows 1,3,4 (all begin with A). The row of the i-th occurrence of character ‘X’ in the last column of M corresponds to the row of the i-th occurrence of character ‘X’ in the first column of M’ Z: [0,1,2,3,4,5] → [4,0,5,1,2,3] CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] Z: [0,1,2,3,4,5] → [4,0,5,1,2,3] In the original string T, the character that preceded the i-th character of the last column L (BWT output) is L[Z[i]] INPUT: T A B R A C A BWT (T) = L C A R A A B For example, for R (i=2), the predecessor in T is L[Z[2]] = L[5] = B For B (i=5), it is L[Z[5]] = L[3] = A If we know the position of the last character of T in L, we can “unwind” the input by repeated application of Z. Can use an inverse of Z to generate the input string forward CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] Open Access of possible repetitiveness into them. The importance Ultrafast and memory-efficient alignment of short DNA sequences who typically use various tricks to squeeze data as m to the human genome =)& $%&'()%*3,>0-) ?8%@&)--3,A9B%9 $ D%-FG)8' performance. Their C0@,%&*,D+)E)& approaches, though, boil down to by experimentation. In this paper, we address the issue of compressing a Opportunistic Data Structures with Applications framework. We devise a novel data structure for ind ∗ † Paolo Ferragina Giovanni Manzini a function of the entropy of the underlying data set. of a compression algorithm, proposed by Burrows and Uses BWT and “opportunistic data structures” (i.e. data structures well known indexing tool, the Suffix Arrayindex [17]. ofWe Abstract working adirectly on compressed data) to build a compressed There is an upsurging interest in designing succinct datathe structures for basicis searching prob space occupancy is decreased when input compr a genome (see [23] and references therein). The motivation has to be found in the exponential increa electronic data nowadays available which is evenits surpassing the significant increase in memory Abstract performance. More precisely, space occupancy is o disk storage capacities of current computers. Space reduction is an attractive issue because Bowtiealso is an T ultrafast, memory-efficient alignment program forimprovements aligning short DNA as sequence reads intimately related to performance noted by several authors bits (e.g. Knuth Storage requirements for T=t[1]...t[N] are bits/ a text [1, u] is stored using O(H (T )) + o(1) pe k to largeBentley genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more [5]). In designing these implicit data structures the goal is to reduce as much as possibl than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. characterentropy auxiliary information together with the inputfor data any without introducing a significant slowdow of T (thekeptbound holds fixed k). Given a Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking thethat final query performance. Yet input are simultaneously representedtoinachieve their entirety thus taking no adva algorithm permits mismatches. Multiple processor cores data can be used of possible repetitiveness into them. Thefor importance thoseoccurrences issues is well known toof program even greater alignment speeds. Bowtieto is open source http://bowtie.cbcb.umd.edu. structure allows search the ofocc P typically use various to squeeze(length data as much possibleimplemented and still achieve good q Searching or kwho occurrences of tricks a pattern m)as can performance. though, boil down heuristics whose effectiveness we is witnessed (for any fixedTheir > 0). If data aretouncompressible ach � � approaches, in time O(mby+experimentation. k log N ), ∀� > 0 Rationale on compressible data the succinc In this paper, we address the our issue of solution compressing andimproves indexing data by studying it in a theor framework. We devise a novel data structure for indexing and searching whose space occupan and suffix data space or in q a function array of the entropy of the structures underlying data set. either The noveltyin resides in the careful combin of a compression algorithm, proposed by Burrows and Wheeler [7], with the structural propert Ita well is aknown belief that some shouldsin indexing[27] tool, the Suffix Array [17]. space We call the overhead data structure opportunistic CSE/BIMM/BENG 181, S 2010space occupancy is decreased when the input is compressible S L K at no significant P [ slowdown @ . the ] q in or suffix arrays) withitsrespect tois word-based indicessense (lik performance. More precisely, space occupancy optimal in an information-content be Software /0-1(),2"3,4551),63,78+9:-),;!< )+,%-. $%&'()%* !""# 7**8)55H,>)&+)8,I08,=909&I08(%+9:5,%&*,>0(@1+%+90&%-,=90-0'J3,4&5+9+1+),I08,7*E%&:)*,>0(@1+)8,D+1*9)53,K&9E)859+J,0I,A%8J-%&*3,>0--)'), C%8L3,AM,!"NO!3,KD7., >088)5@0&*)&:)H,=)& $%&'()%*.,P(%9-H,-%&'()%*Q:5.1(*.)*1 Published: 4 March 2009 Genome Biology 2009, 10:R25 (doi:10.1186/gb-2009-10-3-r25) Received: 21 October 2008 Revised: 19 December 2008 Accepted: 4 March 2009 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2009/10/3/R25 © 2009 Langmead et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. T@U=0R+9)H,%,&)R,1-+8%I%5+,()(08JS)II9:9)&+,+00-,I08,+B),%-9'&()&+,0I,5B08+,MV7,5)W1)&:),8)%*5,+0,-%8'),')&0()5.TX@U =0R+9)H,5B08+S8)%*,%-9'&()&+ 4(@80E)()&+5,9&,+B),)II9:9)&:J,0I,MV7,5)W1)&:9&',B%E),G0+B G80%*)&)*,+B),%@@-9:%+90&5,I08,5)W1)&:9&',%&*,*8%(%+9:%--J 9&:8)%5)*,+B),59F),0I,5)W1)&:9&',*%+%5)+5.,?):B&0-0'9)5,I80( 4--1(9&%,YD%&,M9)'03,>73,KD7Z,%&*,7@@-9)*,=905J5+)(5,Y[05S +)8,>9+J3,>73,KD7Z,B%E),G))&,15)*,+0,@80I9-),()+BJ-%+90&,@%+S +)8&5, YA)M4CSD)WZ, \2]3, +0, (%@, MV7S@80+)9&, 9&+)8%:+90&5 PRING Y>B4CSD)WZ,\!]3,%&*,+0,9*)&+9IJ,*9II)8)&+9%--J,)^@8)55)*,')&)5 Tuesday, May 4, 2010 Y;V7SD)WZ,\6],9&,+B),B1(%&,')&0(),%&*,0+B)8,5@):9)5.,?B) 5)W1)&:),%,+0+%-,0I,%G01+,59^,+89--90&,G%5),@%985,0I,B1(%&,MV7 \#]. f9+B, )^95+9&', ()+B0*53, +B), :0(@1+%+90&%-, :05+, 0I, %-9'&9&' (%&J,5B08+,8)%*5,+0,%,(%((%-9%&,')&0(),95,E)8J,-%8').,[08 )^%(@-)3, )^+8%@0-%+9&', I80(, +B), 8)51-+5, @8)5)&+)*, B)8), 9& ?%G-)5,2,%&*,!3,0&),:%&,5)),+B%+,A%W,R01-*,8)W198),(08),+B%& ERGEI OSAKOVSKY OND SPOND UCSD EDU <,:)&+8%-,@80:)559&',1&9+,Y>CKZS(0&+B5,%&*,Db7C,(08),+B%& 6,>CKSJ)%85,+0,%-9'&,+B),2O",G9--90&,G%5)5,I80(,+B),5+1*J,GJ H ASHING VS BWT AND OPPORTUNISTIC DATA STRUCTURES Genome Biology 2009, http://genomebiology.com/2009/10/3/R25 Volume 10, Issue 3, Article R25 Langmead et al. R25.2 Table 1 Bowtie alignment performance versus SOAP and Maq Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) 15 m 41 s 33.8 1,149 - 67.4 Bowtie -v 2 Server 15 m 7 s SOAP 91 h 57 m 35 s 91 h 47 m 46 s 0.10 13,619 351× 67.3 16 m 41 s 29.5 1,353 - 71.9 17 h 46 m 35 s 17 h 53 m 7 s 0.49 804 59.8× 74.7 17 m 58 s 28.8 1,353 - 71.9 0.27 804 107× 74.7 Bowtie PC Maq Bowtie Maq Server 17 m 57 s 18 m 26 s 32 h 56 m 53 s 32 h 58 m 39 s The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at one time. CPU, central processing unit. !"#$%&'()(*+,('&-.&/0(10230(#(4&05'&6(!*(-(!7&89:;<&=,0>(' CSE/BIMM/BENG 181, SPRING 2010 4(('4&$0)(&-((5&4$+?5&#+&."(*'&$"7$(!&4(54"#")"#.&#$05&>+5#"72 Tuesday, May 4, 2010 0&#.,">0*&'(4Z#+,&>+%,@#(!&?"#$&S&O/&+A&FWG<&X$(&"5'(J&"4 SERGEI L KOSAKOVSKY POND [[email protected]] 4%0**& (5+@7$& #+& -(& '"4#!"-@#('& +)(!& #$(& "5#(!5(#& 05'& #+& -( I NEXACT PATTERN MATCHING Homologous biological sequences are unlikely to match exactly; evolution drives them apart with mutations for example. Exact algorithms (e.g. local alignments) are quadratic in time and are too slow for comparing/searching large genomic sequences. Pattern matching with errors is a fundamental problem in bioinformatics – finding homologs in a database. Well-performing heuristics are frequently used. CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] E XAMPLE : L ONGEST COMMON SUBSTRING (LCS) IN I NFLUENZA A VIRUS (IAV) H5N1 HEMAGGLUTININ ( N =957 FROM 2005+) LENGTH OF LCS 80 60 40 20 0 1 0.95 0.9 0.85 0.8 0.75 0.7 PROPORTION OF SEQUENCES WITH LCS Suffix trees can be adapted to efficiently find LCS from a proportion of a set of sequences as well. The longest fully conserved nucleotide substring in viruses sampled in 2005 or later is merely 8 nucleotides long This poses significant challenges for even straightforward tasks, such as diagnostic probe design CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] K - DIFFERENCES MATCHING The k-mismatch problem: given a text T (length m), a pattern P (length n) and the maximum tolerable number of mismatches k, output all locations i in T where there are at most k differences between P and T[i:i+n-1] The k-differences problem: can also match characters to indels (cost 1) -- a generalization. Both can be easily solved in O(nm) time, by either brute force or dynamic programming Viskin and Landau (1985) propose an O(m+nk) time algorithm for the k-differences problem by combining dynamic programming with text and pattern preprocessing using suffix trees of T%P$. CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] Q UERY MATCHING If the pattern is long (e.g. a new gene sequence), it may be beneficial to look for substrings of the pattern that approximately match the reference (e.g. all genes in GenBank). CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] Q UERY MATCHING Approximately matching strings share some perfectly matching substrings (L-mers). Instead of searching for approximately matching strings (difficult, quadratic) search for perfectly matching substrings (easy, linear). Extend obtained perfect matches to obtain longer approximate matches that are locally optimal. This is the idea behind probably the most important bioinformatics tool: Basic Local Alignment Search Tool (Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.), 1990 Three primary questions: How to select L? How to extend the seed? How to confirm that the match is biologically relevant? CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] !"#$%&'()*+,-./&01*2-345& keyword Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD GVK 18 GAK 16 Neighborhood GIK 16 words GGK 14 neighborhood GLK 13 score threshold GNK 12 (T = 13) GRK 11 GEK 11 GDK 11 extension Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263 High-scoring Pair (HSP) CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] S ELECTING SEED SIZE L If strings X and Y (each length n), match with k<n mismatches, then the longest perfect match between them has at least ceil (n/(k+1)) characters. Easy to show by the following observation: if there are k+1 bins and k objects then at least one of the bins will be empty. Partition the strings into k+1 equal length substrings -- at least one of them will have no mismatches. In fact, if the longest perfect match is expected to be quite a bit longer (at least if the mismatches are randomly distributed), e.g. about 40 for n = 100, k = 5 (expected minimum is 17). CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] S ELECTING SEED SIZE L Smaller L: easier to find, but decreased performance, and, importantly, specificity – two random sequences are more likely to have a short common substring Larger L: could miss out many potential matches, leading to decreased sensitivity. By default BLAST uses L (w, word size) of 3 for protein sequences and 11 for nucleotide sequences. MEGABLAST (a faster version of BLAST for similar sequences) uses longer seeds. CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] H OW TO EXTEND THE MATCH ? Gapped local alignment (blastn) Simple (gapless) extension (original BLAST) Greedy X-drop alignment (MEGABLAST) ... A tradeoff between speed and accuracy CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] H OW TO SCORE MATCHES ? A R N D C Q E G H I L K M F P S T W Y V Biological sequences are not random some letters are more frequent than others (e.g. in HIV-1 40% of the genome is A) some mismatches are more common than others in homologous sequences (e.g. due to selection, chemical properties of the residues etc), and should be weighed differently. BLAST introduces a weighting function on residues: δ (i,j) which assigns a score to a pair of residues. A R N D C Q E G H I L K M F P S T W Y V HIV-WITHIN For nucleotides it is 5 for i=j and -4 otherwise. For proteins it is based on a large training dataset of homologous sequences (Point Accepted Mutations matrices). PAM120 is roughly equivalent to substitutions accumulated over 120 million years of evolution in an average protein CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] H OW TO COMPUTE SIGNIFICANCE ? Before a search is done we need to decide what a good cutoff value H for a match is. It is determined by computing the probability that two random sequences will have at least one match scoring H or greater. Uses Altschul-Dembo-Karlin statistics (1990-1991) CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] S TATISTICS OF SCORES Given a segment pair H between two sequences, comprised of rcharacter substrings T1 and T2, we compute the score of the H as: s(H) = r � δ(T1 [i], T2 [i]) i=1 We are interested in finding out how likely the maximal score for any segment pair of two random sequences is to exceed some threshold X Dembo and Karlin (1990) showed that The mean value for the maximum score between two segment pairs of two random sequences (lengths n and m), assuming a few things about δ (i,j)), is approximately M = log(nm)/λ CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SOLVES ∗ � i,j pi qj exp(λδ(pi , qj )) =0 SERGEI L KOSAKOVSKY POND [[email protected]] S TATISTICS OF SCORES ( CONT ’ D ) For biological sequences, high scoring real matches should greatly exceed the random expectation and the probability that this happens (x is the difference between the mean and the expectation) is Prob{S(H) > x + mean} ≤ K exp(−λ x) ∗ ∗ K and λ are expressions that depend on the scoring matrix and letter frequencies, and the distribution is similar to other extreme value distributions. One can show that the expected number of HSPs – high scoring segment pairs, exceeding the threshold S’ is −λS � E = Kmne � CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] Random Mutated Mean HSP 1000 500 0 10 15 Log(mn) 20 25 500 2. Secondly, the number of scores exceeding the mean are supposed to follow distribution, or decay exponentially as the function of x = score−Mexpecte the simulation based on sequences of length 217 . As you move away from the number of replicates scoring x points above the mean drops exponen Count 400 0 10 15 Log(mn) 20 25 300 umber of scores exceeding the mean are supposed to follow a Poisson decay exponentially as the function of x = score−Mexpected . Consider based on sequences of length 217 . As you move away from the mean, replicates scoring x points above the mean drops exponentially. 200 100 0 75 80 85 Score 90 95 100 Count CSE/BIMM/BENG 181, SPRING 2010 400 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] E- VALUES Because thresholds are determined by the algorithm internally, it is better to ‘normalize’ the result as follows: λS � − log K S= log 2 BIT SCORE E = nm2 −S E-VALUE POISSON DISTRIBUTION FOR THE NUMBER K OF HSPS WITH SCORES ≥ S exp−E E k /k! PROBABILITY OF FINDING AT LEAST ONE: 1 − exp−E CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] T IMELINE 1970: Needleman-Wunsch global alignment algorithm 1981: Smith-Waterman local alignment algorithm 1985: FASTA 1990: BLAST (basic local alignment search tool) 2000s: BLAST has become too slow in “genome vs. genome” comparisons - new faster algorithms evolve! BLAT Pattern Hunter CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] BLAT VS . BLAST BLAT (BLAST-Like Alignment Tool): same idea as BLAST - locate short sequence hits and extend (developed by J Kent at UCSC) BLAT builds an index of the database and scans linearly through the query sequence, whereas BLAST builds an index of the query sequence and then scans linearly through the database Index is stored in RAM resulting in faster searches Longer K-mers and greedier extensions specifically designed for highly similar sequences (e.g > 95% nucleotide, >85% protein) CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]] BLAT INDEXING Here is an example with k = 3: Genome: cacaattatcacgaccgc 3-mers (non-overlapping): cac aat tat cac gac cgc Index: aat 3 gac 12 Multiple instances map to cac 0,9 tat 6 single index! cgc 15 cDNA (query sequence): aattctcac 3-mers (overlapping): aat att ttc tct ctc tca cac 0 1 2 3 4 5 6 Position of 3-mer in query, genome Hits: aat 4 cac 1,10 clump: cacAATtatCACgaccgc CSE/BIMM/BENG 181, SPRING 2010 Tuesday, May 4, 2010 SERGEI L KOSAKOVSKY POND [[email protected]]
© Copyright 2025 Paperzz