String-Matching

C
H
E
A
T
A
R
T
C
H
E
A
T
1
A
R
T
0
H
1
E
2
A
3
T
4
C
A
R
T
1
2
3
4
C
A
R
T
0
1
2
3
4
H
1
1
E
2
A
3
T
4
Compares H to C. They are different. The distance for this
comparison is 1.
Looking at the neighboring cells, the best distance so far is 0 (as seen
in the top-left cell).
So, if add this distance of 1 to the best previous distance, this cell gets
a value of 1 (which is 0+1).
C
A
R
T
0
1
2
3
4
H
1
1
2
3
4
E
2
A
3
T
4
C
A
R
T
0
1
2
3
4
H
1
1
2
3
4
E
2
2
2
3
4
A
3
T
4
C
A
R
T
0
1
2
3
4
H
1
1
2
3
4
E
2
2
2
3
4
A
3
3
2
T
4
C
A
R
T
0
1
2
3
4
H
1
1
2
3
4
E
2
2
2
3
4
A
3
3
2
3
4
T
4
4
3
3
3
Final distance is 3.
if Ai = Bj,
Mij =
{
if Ai ≠ Bj,
M(i-1)(j-1)
min
(
M(i-1)(j-1)
M(i-1)j
Mi(j-1)
)
+1
In Boyer-Moore is that the comparison is done from
right to left, starting with the last character in the
pattern.
The first comparison is between X and C, which do not match.
But since X does not appear anywhere in the search pattern,
we can now rule out a match anywhere in the first 3
characters. So the skip value for X will be initialized to 3, the
length of the search pattern.
Again we start from the right by comparing B and C,
which again do not match.
However, this time B does occur within the search
pattern. The skip value for B will be 1 in order to line up
with the last B in the search pattern.
Traditionally, implementations of this algorithm
have created a 256-byte table to hold the skip
value for all possible characters.
Example
pattern = "STING"
string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"
try to match first m characters
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
This fails. Slide pattern right to look for other matches.
Note that R does not occur in pattern. So can slide it past R.
Example
pattern = "STING"
string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
Fails again.
Rightmost character S is in pattern precisely once, so slide until two
S's line up.
Example
pattern = "STING"
string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
No C in pattern. Slide past it.
Example
pattern = "STING"
string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
No space in pattern. Slide past it
Example
pattern = "STING"
string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
No O in pattern. Slide past it.
Example
pattern = "STING"
string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
Rightmost char T. Exactly one T in pattern. Slide to align them.
Example
pattern = "STING"
string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
match
Complexity is O(n).
The execution time can actually be sub-linear:
It doesn't need to actually check every character of the string to be
searched but rather skips over some of them (check right-most
character of the block of m first, if not found in pattern can skip
entire rest of block).
Best-case performance is O(n/m). In the best case, only one in m
characters needs to be checked.
Actually works better (on average) with longer m!
Text Editor, Digital Library and Search Engine:
Every person uses a text editor and every user of a
digital library or search engine, needs to find patterns
in a text.
The Boyer Moore algorithm is directly implemented
the search command of practically all text editors. The
longest common subsequence dynamic programming
algorithm is implemented in system commands that
test differences between files.
Multimedia and Computational Biology:
Computational biology: in finding a close mutation,
Communications: to adjust for transmission noise,
Texts: to detect common typing errors.
Multimedia: to adjust for loss compressions,
occlusions, scaling, affine transformations or
dimension loss.
DNA sequencing: The largest overlap heuristic for
finding the shortest common superstring.
Medical Tests:
The BMH algorithm achieves the best overall results
when used with medical tests. This algorithm usually
performs at least twice as fast as the other algorithms
tested. The time performance of exact string pattern
matching can be greatly improved if an efficient
algorithm is used. Considering the growing amount of
text handled in the electronic patient record, it is worth
implementing this efficient algorithm.
Retrieving Music Pattern from Musical Database:
When musical note from musical database are to be
retrieved then we need string matching. There are four
similar techniques for this: edit distance, dice
similarity, Jaccard similarity and cosine similarity. The
musical notes are retrieved by QBE (query by
example) approach. So the best scheme for this
problem is Levenshtein distance with Jaccard
similarity. This is an approximate music search
technique. As the Jaccard similarity performs excellent
in passing a query when a pitch change scenario is
Intrusion Detection:
Intrusion detection systems fall into two basic
categories:
1. signature-based intrusion detection systems and
2. anomaly detection systems.
Intrusion Detection:
Signature-based intrusion detection systems:
Intruders have signatures, like computer viruses.
Find data packets that contain any known intrusion
related signatures or anomalies related to Internet
protocols.
Based upon a set of signatures and rules, the detection
system is able to find and log suspicious
activity and generate alerts.
Intrusion Detection:
Anomaly-based intrusion detection systems:
Anomaly-based intrusion detection usually depends on
packet anomalies present in protocol header parts.
Intrusion Detection:
May become the performance bottleneck in deep
packet inspection.
Detecting Plagiarism:
Composes of structural and syntactic phases:
In the structural phase, documents are decomposed into
components by its syntax and compared at the coarse
level.
Detecting Plagiarism:
The structural mapping processes the decomposed
documents based on its syntax without actually
mapping at the word level.
The structural mapping can be applied in a hierarchical
way based on the structural organization of a
document.
Detecting Plagiarism:
Secondly, the syntactic matching algorithm uses a
heuristic look-ahead algorithm for matching
consecutive tokens with a verification patch.
Bioinformatics:
Approximate matching of a search pattern to a target
(called the “text” in string algorithms) is a fundamental
tool in molecular biology.
The pattern is often called the “query” and the text is
called a “sequence database”, but we will use “pattern”
and “text” to be consistent.
Bioinformatics:
The importance of approximate matching is that
biological sequences change and evolve.
Related genes in different organisms, or even similar
genes within the same organism, most commonly have
similar, but not identical sequences.
Determining which sequences of known function are
most similar to a new gene of unknown function is
often the first step in finding out what the new gene
does.
Digital Forensics:
Digital forensic text string searches are designed to
search every byte of the digital evidence, at the
physical level, to locate specific text strings of interest
to the investigation.
Given the nature of the data sets typically encountered,
text string search results are extremely noisy, which
results in inordinately high levels of information
retrieval (IR) overhead and information overload.
Text Mining:
1.
2.
3.
4.
5.
6.
7.
8.
Information extraction,
topic tracking,
content summarization,
information visualization,
question answering,
concept linkage,
text categorization/ classification, and
text clustering
Video Retrieval:
String based video retrieval method first converts the
unstructured video into a curve and marks the feature
string of it.
Approximate string matching is then used to retrieve
video quickly.
The characteristic curve of the key frame sequence is
first extracted followed by marking the feature string
and then approximate string matching is used on the
feature string to get fast video retrieval.
Introduction to Algorithms by Thomas H. Cormen,
Charles E. Leiserson, Ronald L. Rivest, and Clifford
Stein.
http://shaunwagner.com/writings_computer_levenshtei
n.html
A fast string searching algorithm, R. S. Boyer and J.
S. Moore, Communications of the ACM, vol. 20 (10),
pp. 762-772).
Questions?
THANK YOU!