combinatorial pattern matching

C OMBINATORIAL P ATTERN
M ATCHING
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
O UTLINE : E XACT MATCHING
Tabulating patterns in long texts
Short patterns (direct indexing)
Longer patterns (hash tables)
Finding exact patterns in a text
Brute force (run time)
Efficient algorithms (pattern preprocessing)
Single pattern: Knuth-Morris-Platt
Multiple patterns: Aho-Corasick algorithm
Efficient algorithms (text preprocessing)
Suffix trees
Burrows Wheeler Transform-based
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
O UTLINE : A PPROXIMATE M ATCHING
Algorithms for approximate pattern matching
Heuristics behind BLAST
Statistics behind BLAST
Alternatives to BLAST: BLAT, PatternHunter etc.
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
S TRING ENCODING
It is often necessary to index strings; a convenient way to do this is
first to convert strings to integers.
Given a string s of length n on alphabet A (0..c-1), with c=|A|
characters, we can define a map code(s)→[0,∞), as
n−1
code(s) → s[1]c
+ s[2]c
n−2
+ . . . + s[n − 1]c + s[n]
There are cL different L-mers, but at most n-L+1 different L-mers in
a text of length n
A
0
C
1
G
2
T
3
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
AGT
A=0*16
G=2*4
T=3
11
ATA
A=0*16
T=3*4
A=0
12
TGG
T=3*16
G=2*4
G=2
58
SERGEI L KOSAKOVSKY POND [[email protected]]
T ABULATING SHORT PATTERNS
If the L is small (e.g. 3 or 4), i.e. the total number of patterns is not too
large and many of them are likely to be found in the input text then we
could use direct indexing to tabulate/locate strings efficiently
The distribution of short strings in genetic sequences is biologically
informative, e.g.
Synonymous codons (triplets of nucleotides, 64 patterns) are often used
preferentially in organisms (transcriptional selection, secondary structure, etc)
The distribution of short nucleotide k-mers (e.g. L=4, 256 patterns) can be
useful for detecting horizontal (from species to species) gene transfer and gene
finding
The location of short amino-acid strings (e.g. L=3, 8000 patterns) is useful for
finding seeds for BLAST
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
S HORT PATTERN SCAN
1
2
3
4
5
6
Data : Alphabet A, Text T, pattern length p
Result: Frequency of each pattern in text
R ← array(|A|p );
O(L): naive
n ← len(T );
O(1): if using the previous code to compute the
for i:=1 to n-p+1 do
current one
R [code (T [i : i + p − 1])] + = 1;
end
return R;
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
T ABULATING / LOCATING
LONGER PATTERNS
Finding repeats/motifs: ATGGTCTAGGTCCTAGTGGTC
Flanking sequences in genomic rearrangements
Motifs: promoter regions, functional sites, immune targets
Cellular immunity targets in pathogens (e.g. protein 9 mers)
There are too many patterns to store in an array, and even if we could,
then the array would be very sparse
E.g. ~512,000,000,000 amino-acid 9-mers, but in an average HIV-1
sequence (~3 aa. kb long) there are at most ~3000 unique 9-mers
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
H ASH TABLES
Allow to efficiently (O(1) on average) store and retrieve a small subset
of a large universe of records. Hash tables implement associative arrays
(dictionaries) in a variety of languages (Python, Perl etc)
The universe (records):
e.g. 512,000,000,000 amino-acid 9-mers
Hash function: record ➝ hash key
Note: because there are more keys than array
indices, this function is NOT one to one
The storage:
Hash Table (array) << the size of the universe
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
A SIMPLE HASH FUNCTION
A reasonable hash function (on integer records i) is: i → i
mod P
P is a prime number and also the natural size of the hash table
Hash keys range from o to P-1
If the records are uniformly distributed, so will be their hash keys
4-mer (256 possible)
Integer code
Hash Key
ACGT
27
27
CCCA
148
47
TGCC
229
27
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
COLLISION
P=101
SERGEI L KOSAKOVSKY POND [[email protected]]
C OLLISIONS
Collisions are frequent even for sparsely populated lightly loaded hash
tables
load level α = (number of entries in hash table)/(table size)
The birthday paradox: what is the probability that two people out of a
random group of n (<365) people share a birthday (in hash table terms, what
is the probability of a collision if people=records and hash keys=birthdays)?
�
1
P (n) = 1 − 1 × 1 −
365
n
10
23
50
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
�
�
2
× 1−
365
α
0.027
0.063
0.137
�
�
n−1
... 1 −
365
�
P(n)
0.117
0.507
0.97
SERGEI L KOSAKOVSKY POND [[email protected]]
D EALING WITH C OLLISIONS
Several strategies to deal with collisions: the simplest one is chaining
Each hash key is associated with a linked list of all records sharing the hash
key
Hash Key 0
CGCC
Hash Key 1
AAAC
Hash Key 2
...
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
AAAA
∅
4-mer (256 possible)
Integer code
Hash Key
AAAA
0
0
AAAC
1
1
CGCC
101
0
SERGEI L KOSAKOVSKY POND [[email protected]]
H ASH TABLE PERFORMANCE
Retrieving/storing a record in a hash table of size m with load factor α
Worst case - all records have the same key: O(m)
Expected run time is O (1), assuming uniformly distributed records and
hash keys
Record is not in the table
Record is in the table
EN = e
−α
+ α + O (1/m)
ES = 1 + α/2 + O (1/m)
This is because the probability of having many collisions with the same key
is quite low (even though the probability of SOME collisions in high)
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
E XACT P ATTERN MATCHING
Motivation: Searching a database for a known pattern
Goal: Find all occurrences of a pattern in a text
Input: Pattern P = p[1]…p[n] and text T = t[1]…t[m] (n≤m)
Output: All positions 1< i < (m – n + 1) such that the n-letter
substring of text T[i][i+n-1] starting at i matches the pattern P
Desired performance: O(n+m)
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
B RUTE FORCE PATTERN MATCHING
6
Data : Pattern P, Text T
Result: The list of positions in T where P occurs
n ← len(P );
Substring comparison can take from 1 to n
m ← len(T );
(left-to-right) string comparisons
for i:=1 to m-n+1 do
if T[i:i+n-1] = P then
output i;
end
7
end
1
2
3
4
5
Text: GGCATC; Pattern: GCAT
i=1 (2 comparisons)
i=2 (4 comparisons)
i=3 (1 comparison)
G G C A T C
G C A A
G G C A T C
G C A T
G G C A T C
G C A T
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
B RUTE FORCE RUN TIME
Worst case: O(nm). This can be achieved, for example, by searching
for P=AA...C in text T=AA...A, because each substring comparison
takes exactly n steps
Expected on random text: O(1). This is because the substring
comparison takes on average 1 − q n comparisons (q = 1/alphabet
size)
1−q
For n = 20 and q = 1/4 (nucleotides), substring comparison will take
on average 4/3 operations.
Genetic texts are not random, so the performance may degrade.
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
I MPROVING THE RUN TIME
The search pattern can be preprocessed in O(n) time to eliminate
backtracking in the text and hence guarantee O(n+m) run time
A variety of procedures, starting with the Knuth-Morris-Pratt algorithm in 1977,
take this approach. Makes use of the observation that if a string comparison fails
at pattern position i, then we can shift the pattern i-b(i) positions, where b(i)
depends on the pattern and continue comparing at position the same or the next
position in the text, thus avoiding backtracking.
These types of algorithms are popular in text editors/mutable texts, because they
do not require the preprocessing of (large) text
A C A A C G A C A C G A C C A C A A C A G C A A T G
A C G A C A C G A C A C A
SHIFT
A C A A C G A C A C G A C C A C A A C A G C A A T G
A C G A C A C G A C A C A
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
E XACT MULTIPLE PATTERN
MATCHING
The problem: given a dictionary of D patterns P1,P2,..., PD (total length n)
and text T report all occurrences of every pattern in the text.
Arises, for instance when one is comparing multiple patterns against a
database
Assuming an efficient implementation of individual pattern comparison,
this problem can be solved in O(Dm+n) time by scanning the text D times.
Aho and Corasick (1975) showed how this can be done efficiently in O(m+n)
time.
Uses the idea of a trie (from the word retrieval), or prefix trie
Intuitively, we can reduce the amount of work by exploiting repetitions in
the patterns.
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
P REFIX TRIE
Patterns: ‘ape’, ‘as’, ‘ease’. Constructed in O(n) time, one word at a
time.
Root
Root
a
a
a
1
1
Properties of a trie
Root
1
e
Stores a set of words in a tree
5
Each edge is labeled with a letter
p s
p
2
2
2
e
e
3
4
p s
3
4
a
6
e
3
Each node labeled with a state (order of
creation)
s
7
e
8
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
Any two edges sharing a parent node
have distinct labels
Each word can be spelled by tracing a
path from the root to a leaf
SERGEI L KOSAKOVSKY POND [[email protected]]
S EARCHING TEXT FOR MULTIPLE PATTERNS
USING A TRIE : THREADING
Suppose we want to search the text ‘appease’ for the occurrences of
patterns ‘ape’, ‘as’ and ‘ease’, given their trie.
The naive way to do it is to thread (i.e. spell the word using tree
edges from the root) the text starting at position i, until either:
A leaf (or specially marked terminal node) is reached (a match has been
found)
Spelling cannot be completed (no match)
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
I=1: NO MATCH
I=4: MATCH
APPEASE
APPEASE
Root
a
1
4
e
X
3
a
5
1
a
6
e
p
Root
Root
p s
2
I=5: MATCH
APPEASE
s
7
4
e
3
7
5
4
a
6
e
s
e
8
2
6
e
p s
a
e
3
1
5
p s
2
a
e
s
7
e
e
8
But we already knew this, because ‘as’ is a part ‘ease’! If we take
advantage of this, there will be no need to backtrack in the text, and the
algorithm will run in O(n+m).
The Aho-Corasick algorithm implements exactly this idea using a finite
state automaton starting with the trie and adding shortcuts
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
8
SERGEI L KOSAKOVSKY POND [[email protected]]
S UFFIX TREES
A trie that is built on every suffix of a text T (length m), and
collapses all interior nodes that have a single child is called a suffix
tree.
A very powerful data structure, e.g. given a suffix tree and a pattern
P (length n), all k occurrences of P in T can be found in time O(n
+k), i.e. independently of the size of the text (but it figures into the
construction cost of tree T)
A suffix tree can be built in linear time O (m)
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
B UILDING A SUFFIX TREE
Example ‘bananas#’. It is convenient to terminate the text with a
special character, so that no suffix is a prefix of another suffix (e.g. as
in banana). This guarantees that spelling any suffix from the root will
end at a leaf.
Construct the suffix tree in two phases from the longest to the
shortest suffix:
Phase 1: Spell as much of the suffix from the root as possible
Phase 2: If stopped in the middle of an edge, break the edge and add a new
branch spell the rest of the suffix along that branch. Label the leaf with the
starting position of the suffix.
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
BANANAS#
ANANAS#
NANAS#
ANAS#
Root
Root
Root
Root
bananas#
bananas# ananas#
1
1
bananas# ana
bananas# ananas# nanas#
2
1
2
1
3
nanas#
N1
3
nas# s#
2
NAS#
4
AS#
S# AND #
Root
Root
Root
bananas# a
bananas# ana
N1
N3
nas#
4
N2
1
3
nas#
N3
s#
na
6
3
5
N1
#
N2
7
8
2
CSE/BIMM/BENG 181, SPRING 2010
s#
nas#
s#
6
3
5
5
nas# s#
Tuesday, May 4, 2010
s#
s#
N1
2
na
N2
na s#
nas# s#
bananas# a
na
1
1
na
4
nas# s#
2
4
SERGEI L KOSAKOVSKY POND [[email protected]]
S UFFIX TREE PROPERTIES
Exactly m leaves for text of size m (counting the
terminator)
Each interior node has at least two children
(except possibly the root); edges with the same
parent spell substrings starting with different
letters.
Root
bananas# a
1
The size of the tree is O(m)
N3
na
s#
na
s#
N2
nas#
#
7
8
s#
Can be constructed in O(m) time
N1
This uses the obser vation that during
construction, not every suffix has to be spelled
all the way from the root (which would lead to
quadratic time); suffix links can short circuit the
process
6
3
5
nas# s#
2
4
Is also memory efficient (about ~5m*sizeof(long)
bytes for text without too much difficulty)
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
M ATCHING PATTERNS USING
S UFFIX T REES
Consider the problem of finding pattern
‘an’ in the text ‘bananas#’
Root
Two matches: positions 2 and 4
bananas# a
1
Thread the pattern onto the tree
N3
n
Completely spelled: report the index of every
leaf below the point where spelling stopped.
This is because the pattern is a prefix of every
suffix spelled by traversing the rest of the
subtree.
s#
6
na
s#
N2
nas#
3
#
7
8
s#
5
a
N1
nas# s#
Incompletely spelled: no match
2
4
Runs in O(n+k) time, where n is the length of
the pattern, and k is the number of matches.
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
F INDING LONGEST COMMON
SUBSTRINGS USING S UFFIX T REES
Given two texts: T and U find
the longest continuous
substring that is common to
both texts
N0
$
Can be done in O (len (T) + len
(U)) time.
10
5
%TCGA$ A
CG
N3
N4
G
T
N5
N6
Construct a suffix tree on T
%U$
$ CGT%TCGA$ A$
Find the deepest internal node
whose children refer to suffixes
starting in T and in U
9
1
7
T%TCGA$
2
A$
8
T%TCGA$
3
%TCGA$ CGA$
4
6
E.g. T = ‘ACGT’, U = ‘TCGA’
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
S HORT READ MAPPING
Next generation sequencing (NGS) technologies (454, Solexa,
SOLiD) generate gigabases of short (32-500 bp) reads per run
A fundamental bioinformatics task in NGS analysis is to map all the
reads to a reference genome: i.e. find all the coordinates in the
known genome where a given read is located
ATGGTCTAGGTCCTAGTGGTC
Can take a LONG time to map 15,000,000 reads to a 3 gigabase
genome!
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
B URROWS -W HEELER
T RANSFORM B ASED M APPERS
In 1994, Burrows and Wheeler described a lossless text
transformation (block sorter), which makes the text easily
compressible and is the algorithmic basis of BZIP2
Surprisingly, this transform is also very useful for finding all instances
of a given (short) string in a large text, while using very little memory
A number of NGS read mappers now use BWT transformed
reference genomes to accelerate mapping by several orders of
magnitude.
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
BWT
Given an input text T=t[1]...t[N], we construct N left-shift rotations
of the input text, sort them lexicographically, and map the input text
to the last column of the sorted rotations:
E.g. input ABRACA is mapped to CARAAB
Note: sorted rotations make it very easy to find all instances of text in a
string (also the idea behind suffix arrays)
ROTATIONS
A
B
R
A
C
A
B
R
A
C
A
A
R
A
C
A
A
B
A
C
A
A
B
R
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
C
A
A
B
R
A
SORTED
A
A
B
R
A
C
A
A
A
B
C
R
A
B
C
R
A
A
B
R
A
A
A
C
R
A
A
C
B
A
A
C
B
A
R
A
C
A
R
A
A
B
SERGEI L KOSAKOVSKY POND [[email protected]]
W HY BOTHER ?
The text output by BWT tends to contain runs of the same character
and be easily compressible by arithmetic, run-length or Huffman
coders, e.g.
final
char
(L)
a
o
o
o
o
a
a
i
i
o
a
e
i
e
e
i
i
i
o
o
sorted rotations
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
to decompress. It achieves compression
to perform only comparisons to a depth
transformation} This section describes
transformation} We use the example and
treats the right-hand side as the most
tree for each 16 kbyte input block, enc
tree in the output stream, then encodes
turn, set $L[i]$ to be the
turn, set $R[i]$ to the
unusual data. Like the algorithm of Man
use a single set of probabilities table
using the positions of the suffixes in
value at a given point in the vector $R
we present modifications that improve t
when the block size is quite large. Ho
which codes that have not been seen in
with $ch$ appear in the {\em same order
with $ch$.
In our exam
with Huffman or arithmetic coding. Bri
with figures given by Bell˜\cite{bell}.
from the
CSE/BIMM/BENG 181, SPRING 2010 Figure 1: Example of sorted rotations. Twenty consecutive rotations
SERGEI
L KOSAKOVSKY POND [[email protected]]
Tuesday, May 4, 2010
sorted list of rotations of a version of this paper are shown, together with the final
character of each rotation.
I
NVERSE BWT
The beauty of BWT is that knowing only the output and the position of which
sorted row contained the original string, the input can be reconstructed in no
worse than O(N log (N)) time.
Step 1: reconstruct the first column of rotations (F) from the last column (L). To
do so, we simply sort the characters in L.
Step 2: determine the mapping of predecessor characters and recover the input
character by character from the last one
PREDECESSOR CHARACTERS:
RIGHT SHIFT MATRIX M (M’).
ROTATIONS (M)
SORTED STARTING WITH THE 2ND CHARACTER
A
A
A
B
C
R
A
B
C
R
A
A
B
R
A
A
A
C
R
A
A
C
B
A
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
A
C
B
A
R
A
C
A
R
A
A
B
C
A
R
A
A
B
A
A
A
B
C
R
A
B
C
R
A
A
B
R
A
A
A
C
R
A
A
C
B
A
A
C
B
A
R
A
SERGEI L KOSAKOVSKY POND [[email protected]]
M A A B R A C
A
A
B
C
R
B
C
R
A
A
R A
A A
AZC
A B
C A
C
B
A
R
A
A
R
A
A
B
FPREDECESSORL
M’ C A A B R A
A
R
A
A
B
A
A
B
C
R
B
C
R
A
A
R
A
A
A
C
A
A
C
B
A
C
B
A
R
A
L F
Both M and M’ contain every rotation of input text T, i.e. permutations of the same set of strings.
For each row i in M, the last character (L[i]) is the cyclic predecessor of the first character (F[i]) in
the original text
We wish to define a transformation, Z(i), that maps the i-th row of M’ to the corresponding row in
M (i.e. its cyclic predecessor), using the following observations
M is sorted lexicographically, which implies that all rows of M’ beginning with the same character
are also sorted lexicographically, for example rows 1,3,4 (all begin with A).
The row of the i-th occurrence of character ‘X’ in the last column of M corresponds to the row of
the i-th occurrence of character ‘X’ in the first column of M’
Z: [0,1,2,3,4,5] → [4,0,5,1,2,3]
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
Z: [0,1,2,3,4,5] → [4,0,5,1,2,3]
In the original string T, the character that preceded the i-th
character of the last column L (BWT output) is L[Z[i]]
INPUT: T
A B R A C A
BWT (T) = L C A R A A B
For example, for R (i=2), the predecessor in T is L[Z[2]] = L[5] = B
For B (i=5), it is L[Z[5]] = L[3] = A
If we know the position of the last character of T in L, we can
“unwind” the input by repeated application of Z.
Can use an inverse of Z to generate the input string forward
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
Open
Access
of possible repetitiveness into them.
The
importance
Ultrafast
and
memory-efficient
alignment of
short DNA
sequences
who
typically
use
various
tricks
to
squeeze data as m
to the human genome
=)& $%&'()%*3,>0-)
?8%@&)--3,A9B%9
$ D%-FG)8'
performance.
Their C0@,%&*,D+)E)&
approaches,
though, boil down to
by experimentation.
In this paper,
we
address
the
issue
of
compressing
a
Opportunistic Data Structures with Applications
framework. We devise a novel data structure for ind
∗
†
Paolo
Ferragina
Giovanni
Manzini
a function of the entropy of the underlying data set.
of
a
compression
algorithm,
proposed
by
Burrows
and
Uses BWT and “opportunistic data structures” (i.e. data structures
well known
indexing
tool,
the Suffix
Arrayindex
[17]. ofWe
Abstract
working adirectly
on compressed
data)
to build
a compressed
There is an upsurging
interest
in designing succinct
datathe
structures
for basicis
searching
prob
space
occupancy
is
decreased
when
input
compr
a genome
(see [23] and references therein). The motivation has to be found in the exponential increa
electronic data nowadays
available
which is evenits
surpassing
the significant
increase in memory
Abstract
performance.
More
precisely,
space
occupancy
is
o
disk storage capacities of current computers. Space reduction is an attractive issue because
Bowtiealso
is an T
ultrafast,
memory-efficient
alignment
program
forimprovements
aligning
short
DNA as
sequence
reads
intimately
related
to
performance
noted
by
several
authors bits
(e.g.
Knuth
Storage requirements
for
T=t[1]...t[N]
are
bits/
a text
[1,
u]
is
stored
using
O(H
(T
))
+
o(1)
pe
k
to largeBentley
genomes. For
the
human
genome,
Burrows-Wheeler
indexing
allows
Bowtie
to
align
more
[5]). In designing these implicit data structures the goal is to reduce as much as possibl
than
25
million
reads
per CPU hour with a memory footprint of approximately 1.3 gigabytes.
characterentropy
auxiliary
information
together
with
the inputfor
data any
without
introducing
a significant
slowdow
of
T
(thekeptbound
holds
fixed
k).
Given
a
Bowtie extends previous Burrows-Wheeler
techniques with
a novel quality-aware
backtracking
thethat
final
query
performance.
Yet input
are simultaneously
representedtoinachieve
their entirety thus taking no adva
algorithm
permits
mismatches.
Multiple processor
cores data
can be used
of possible
repetitiveness
into
them.
Thefor
importance
thoseoccurrences
issues is well known toof
program
even greater
alignment
speeds. Bowtieto
is open
source
http://bowtie.cbcb.umd.edu.
structure
allows
search
the ofocc
P
typically use various
to squeeze(length
data as much
possibleimplemented
and still achieve good q
Searching or kwho
occurrences
of tricks
a pattern
m)as can
performance.
though,
boil down
heuristics whose effectiveness we
is witnessed
(for any
fixedTheir
> 0). If
data
aretouncompressible
ach
� � approaches,
in time
O(mby+experimentation.
k log N ), ∀� > 0
Rationale
on compressible
data
the succinc
In this paper, we address
the our
issue of solution
compressing andimproves
indexing data by studying
it in a theor
framework. We devise a novel data structure for indexing and searching whose space occupan
and suffix
data
space
or in
q
a function array
of the entropy
of the structures
underlying data set. either
The noveltyin
resides
in the careful
combin
of a compression algorithm, proposed by Burrows and Wheeler [7], with the structural propert
Ita well
is aknown
belief
that
some
shouldsin
indexing[27]
tool, the
Suffix Array
[17]. space
We call the overhead
data structure opportunistic
CSE/BIMM/BENG 181, S
2010space occupancy is decreased when the input is compressible
S
L K at no significant
P
[ slowdown
@
. the
] q
in
or suffix
arrays)
withitsrespect
tois word-based
indicessense
(lik
performance.
More precisely,
space occupancy
optimal in an information-content
be
Software
/0-1(),2"3,4551),63,78+9:-),;!<
)+,%-.
$%&'()%*
!""#
7**8)55H,>)&+)8,I08,=909&I08(%+9:5,%&*,>0(@1+%+90&%-,=90-0'J3,4&5+9+1+),I08,7*E%&:)*,>0(@1+)8,D+1*9)53,K&9E)859+J,0I,A%8J-%&*3,>0--)'),
C%8L3,AM,!"NO!3,KD7.,
>088)5@0&*)&:)H,=)& $%&'()%*.,P(%9-H,-%&'()%*Q:5.1(*.)*1
Published: 4 March 2009
Genome Biology 2009, 10:R25 (doi:10.1186/gb-2009-10-3-r25)
Received: 21 October 2008
Revised: 19 December 2008
Accepted: 4 March 2009
The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2009/10/3/R25
© 2009 Langmead et al.; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
T@U=0R+9)H,%,&)R,1-+8%I%5+,()(08JS)II9:9)&+,+00-,I08,+B),%-9'&()&+,0I,5B08+,MV7,5)W1)&:),8)%*5,+0,-%8'),')&0()5.TX@U
=0R+9)H,5B08+S8)%*,%-9'&()&+
4(@80E)()&+5,9&,+B),)II9:9)&:J,0I,MV7,5)W1)&:9&',B%E),G0+B
G80%*)&)*,+B),%@@-9:%+90&5,I08,5)W1)&:9&',%&*,*8%(%+9:%--J
9&:8)%5)*,+B),59F),0I,5)W1)&:9&',*%+%5)+5.,?):B&0-0'9)5,I80(
4--1(9&%,YD%&,M9)'03,>73,KD7Z,%&*,7@@-9)*,=905J5+)(5,Y[05S
+)8,>9+J3,>73,KD7Z,B%E),G))&,15)*,+0,@80I9-),()+BJ-%+90&,@%+S
+)8&5, YA)M4CSD)WZ, \2]3, +0, (%@, MV7S@80+)9&, 9&+)8%:+90&5
PRING
Y>B4CSD)WZ,\!]3,%&*,+0,9*)&+9IJ,*9II)8)&+9%--J,)^@8)55)*,')&)5
Tuesday, May 4, 2010 Y;V7SD)WZ,\6],9&,+B),B1(%&,')&0(),%&*,0+B)8,5@):9)5.,?B)
5)W1)&:),%,+0+%-,0I,%G01+,59^,+89--90&,G%5),@%985,0I,B1(%&,MV7
\#].
f9+B, )^95+9&', ()+B0*53, +B), :0(@1+%+90&%-, :05+, 0I, %-9'&9&'
(%&J,5B08+,8)%*5,+0,%,(%((%-9%&,')&0(),95,E)8J,-%8').,[08
)^%(@-)3, )^+8%@0-%+9&', I80(, +B), 8)51-+5, @8)5)&+)*, B)8), 9&
?%G-)5,2,%&*,!3,0&),:%&,5)),+B%+,A%W,R01-*,8)W198),(08),+B%&
ERGEI
OSAKOVSKY OND SPOND UCSD EDU
<,:)&+8%-,@80:)559&',1&9+,Y>CKZS(0&+B5,%&*,Db7C,(08),+B%&
6,>CKSJ)%85,+0,%-9'&,+B),2O",G9--90&,G%5)5,I80(,+B),5+1*J,GJ
H ASHING VS BWT AND
OPPORTUNISTIC DATA
STRUCTURES
Genome Biology 2009,
http://genomebiology.com/2009/10/3/R25
Volume 10, Issue 3, Article R25
Langmead et al. R25.2
Table 1
Bowtie alignment performance versus SOAP and Maq
Platform CPU time
Wall clock time Reads mapped per
hour (millions)
Peak virtual memory
footprint (megabytes)
Bowtie speed-up Reads aligned (%)
15 m 41 s
33.8
1,149
-
67.4
Bowtie -v 2 Server
15 m 7 s
SOAP
91 h 57 m 35 s 91 h 47 m 46 s
0.10
13,619
351×
67.3
16 m 41 s
29.5
1,353
-
71.9
17 h 46 m 35 s 17 h 53 m 7 s
0.49
804
59.8×
74.7
17 m 58 s
28.8
1,353
-
71.9
0.27
804
107×
74.7
Bowtie
PC
Maq
Bowtie
Maq
Server
17 m 57 s
18 m 26 s
32 h 56 m 53 s 32 h 58 m 39 s
The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National
Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was
used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie
was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values).
For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28
bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more
comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at
one time. CPU, central processing unit.
!"#$%&'()(*+,('&-.&/0(10230(#(4&05'&6(!*(-(!7&89:;<&=,0>('
CSE/BIMM/BENG
181, SPRING 2010
4(('4&$0)(&-((5&4$+?5&#+&."(*'&$"7$(!&4(54"#")"#.&#$05&>+5#"72
Tuesday, May 4, 2010
0&#.,">0*&'(4Z#+,&>+%,@#(!&?"#$&S&O/&+A&FWG<&X$(&"5'(J&"4
SERGEI L KOSAKOVSKY POND [[email protected]]
4%0**& (5+@7$& #+& -(& '"4#!"-@#('& +)(!& #$(& "5#(!5(#& 05'& #+& -(
I NEXACT PATTERN
MATCHING
Homologous biological sequences are unlikely to match exactly;
evolution drives them apart with mutations for example.
Exact algorithms (e.g. local alignments) are quadratic in time and are
too slow for comparing/searching large genomic sequences.
Pattern matching with errors is a fundamental problem in
bioinformatics – finding homologs in a database.
Well-performing heuristics are frequently used.
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
E XAMPLE : L ONGEST COMMON SUBSTRING (LCS) IN
I NFLUENZA A VIRUS (IAV) H5N1 HEMAGGLUTININ
( N =957 FROM 2005+)
LENGTH OF LCS
80
60
40
20
0
1
0.95 0.9 0.85 0.8 0.75 0.7
PROPORTION OF SEQUENCES WITH LCS
Suffix trees can be adapted to efficiently find LCS from a proportion of a set of sequences
as well.
The longest fully conserved nucleotide substring in viruses sampled in 2005 or later is
merely 8 nucleotides long
This poses significant challenges for even straightforward tasks, such as diagnostic probe
design
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
K - DIFFERENCES MATCHING
The k-mismatch problem: given a text T (length m), a pattern P
(length n) and the maximum tolerable number of mismatches k,
output all locations i in T where there are at most k differences
between P and T[i:i+n-1]
The k-differences problem: can also match characters to indels (cost
1) -- a generalization.
Both can be easily solved in O(nm) time, by either brute force or
dynamic programming
Viskin and Landau (1985) propose an O(m+nk) time algorithm for
the k-differences problem by combining dynamic programming with
text and pattern preprocessing using suffix trees of T%P$.
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
Q UERY MATCHING
If the pattern is long (e.g. a new gene sequence), it may be beneficial
to look for substrings of the pattern that approximately match the
reference (e.g. all genes in GenBank).
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
Q UERY MATCHING
Approximately matching strings share some perfectly matching
substrings (L-mers).
Instead of searching for approximately matching strings (difficult,
quadratic) search for perfectly matching substrings (easy, linear).
Extend obtained perfect matches to obtain longer approximate
matches that are locally optimal.
This is the idea behind probably the most important bioinformatics
tool: Basic Local Alignment Search Tool (Altschul, S., Gish, W.,
Miller, W., Myers, E. & Lipman, D.J.), 1990
Three primary questions: How to select L? How to extend the seed?
How to confirm that the match is biologically relevant?
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
!"#$%&'()*+,-./&01*2-345&
keyword
Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD
GVK 18
GAK 16
Neighborhood
GIK 16
words
GGK 14
neighborhood
GLK 13
score threshold
GNK 12
(T = 13)
GRK 11
GEK 11
GDK 11
extension
Query: 22
VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60
+++DN +G +
IR L
G+K I+ L+ E+ RG++K
Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263
High-scoring Pair (HSP)
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
S ELECTING SEED SIZE L
If strings X and Y (each length n), match with k<n mismatches, then
the longest perfect match between them has at least ceil (n/(k+1))
characters.
Easy to show by the following observation: if there are k+1 bins and
k objects then at least one of the bins will be empty.
Partition the strings into k+1 equal length substrings -- at least one
of them will have no mismatches.
In fact, if the longest perfect match is expected to be quite a bit
longer (at least if the mismatches are randomly distributed), e.g.
about 40 for n = 100, k = 5 (expected minimum is 17).
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
S ELECTING SEED SIZE L
Smaller L: easier to find, but decreased performance, and,
importantly, specificity – two random sequences are more likely to
have a short common substring
Larger L: could miss out many potential matches, leading to
decreased sensitivity.
By default BLAST uses L (w, word size) of 3 for protein sequences
and 11 for nucleotide sequences.
MEGABLAST (a faster version of BLAST for similar sequences)
uses longer seeds.
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
H OW TO EXTEND THE MATCH ?
Gapped local alignment (blastn)
Simple (gapless) extension (original BLAST)
Greedy X-drop alignment (MEGABLAST)
...
A tradeoff between speed and accuracy
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
H OW TO SCORE MATCHES ?
A R N D C Q E G H I L K M F P S T W Y V
Biological sequences are not random
some letters are more frequent than others (e.g. in HIV-1 40%
of the genome is A)
some mismatches are more common than others in
homologous sequences (e.g. due to selection, chemical
properties of the residues etc), and should be weighed
differently.
BLAST introduces a weighting function on residues: δ
(i,j) which assigns a score to a pair of residues.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
HIV-WITHIN
For nucleotides it is 5 for i=j and -4 otherwise.
For proteins it is based on a large training dataset of homologous
sequences (Point Accepted Mutations matrices). PAM120 is
roughly equivalent to substitutions accumulated over 120 million
years of evolution in an average protein
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
H OW TO COMPUTE
SIGNIFICANCE ?
Before a search is done we need to decide what a good cutoff value
H for a match is.
It is determined by computing the probability that two random
sequences will have at least one match scoring H or greater.
Uses Altschul-Dembo-Karlin statistics (1990-1991)
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
S TATISTICS OF SCORES
Given a segment pair H between two sequences, comprised of rcharacter substrings T1 and T2, we compute the score of the H as:
s(H) =
r
�
δ(T1 [i], T2 [i])
i=1
We are interested in finding out how likely the maximal score for
any segment pair of two random sequences is to exceed some
threshold X
Dembo and Karlin (1990) showed that
The mean value for the maximum score between two segment pairs of
two random sequences (lengths n and m), assuming a few things about δ
(i,j)), is approximately
M = log(nm)/λ
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SOLVES
∗
�
i,j
pi qj exp(λδ(pi , qj ))
=0
SERGEI L KOSAKOVSKY POND [[email protected]]
S TATISTICS OF SCORES ( CONT ’ D )
For biological sequences, high scoring real matches should greatly exceed
the random expectation and the probability that this happens (x is the
difference between the mean and the expectation) is
Prob{S(H) > x + mean} ≤ K exp(−λ x)
∗
∗
K and λ are expressions that depend on the scoring matrix and letter
frequencies, and the distribution is similar to other extreme value
distributions.
One can show that the expected number of HSPs – high scoring segment
pairs, exceeding the threshold S’ is
−λS �
E = Kmne
�
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
Random
Mutated
Mean HSP
1000
500
0
10
15
Log(mn)
20
25
500
2. Secondly, the number of scores exceeding the mean are supposed to follow
distribution, or decay exponentially as the function of x = score−Mexpecte
the simulation based on sequences of length 217 . As you move away from
the number of replicates scoring x points above the mean drops exponen
Count
400
0
10
15
Log(mn)
20
25
300
umber of scores exceeding the mean are supposed to follow a Poisson
decay exponentially as the function of x = score−Mexpected . Consider
based on sequences of length 217 . As you move away from the mean,
replicates scoring x points above the mean drops exponentially.
200
100
0
75
80
85
Score
90
95
100
Count
CSE/BIMM/BENG 181, SPRING 2010
400
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
E- VALUES
Because thresholds are determined by the algorithm internally, it is
better to ‘normalize’ the result as follows:
λS � − log K
S=
log 2
BIT SCORE
E = nm2
−S
E-VALUE
POISSON DISTRIBUTION FOR
THE NUMBER K OF HSPS
WITH SCORES ≥ S
exp−E E k /k!
PROBABILITY OF FINDING
AT LEAST ONE:
1 − exp−E
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
T IMELINE
1970: Needleman-Wunsch global alignment algorithm
1981: Smith-Waterman local alignment algorithm
1985: FASTA
1990: BLAST (basic local alignment search tool)
2000s: BLAST has become too slow in “genome vs. genome”
comparisons - new faster algorithms evolve!
BLAT
Pattern Hunter
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
BLAT VS . BLAST
BLAT (BLAST-Like Alignment Tool): same idea as BLAST - locate
short sequence hits and extend (developed by J Kent at UCSC)
BLAT builds an index of the database and scans linearly through the
query sequence, whereas BLAST builds an index of the query
sequence and then scans linearly through the database
Index is stored in RAM resulting in faster searches
Longer K-mers and greedier extensions specifically designed for
highly similar sequences (e.g > 95% nucleotide, >85% protein)
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]
BLAT INDEXING
Here is an example with k = 3:
Genome: cacaattatcacgaccgc
3-mers (non-overlapping): cac aat tat cac gac cgc
Index: aat 3
gac 12
Multiple instances map to
cac 0,9
tat 6
single index!
cgc 15
cDNA (query sequence): aattctcac
3-mers (overlapping): aat att ttc tct ctc tca cac
0
1
2
3
4
5
6
Position of 3-mer in query, genome
Hits: aat 4
cac 1,10
clump: cacAATtatCACgaccgc
CSE/BIMM/BENG 181, SPRING 2010
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]

Download Report

combinatorial pattern matching

Paperzz.com

Your Paperzz