התאמת תבניות לפי קנה מידה

Notation (p.#, s.#) means (pdf page no., slide page no. as it written on the slides)
Scaled Pattern Matching
(p.1, s.0.1)
Motivation: Searching for templates in aerial photographs
Input: Aerial photo/image
Template: The pattern we are looking for, tank for example
Task: Search for locations where the template appears in the image
The problems ahead of us:
1. Rotation – What if the template is rotated in relational to the one we are looking at
2. Error – What if there is an error in part of the template (partly match)
3. Size – What if the template is scaled in relational to the one we are looking at
(p.4, s.3)
If there is no need for exact matching (avoiding error) algorithm like Suffix tree &
LCA can deal with the problem of Local Error and Orientation by approximation
(p.7)
Let's look at a problem of digitizing newspaper stories from the point of view of the
size only (we are not searching for any error nor rotated match)
We will keep a dictionary of fonts and we will search for appearances in all sizes
(p.8, s.6)
Problem: The problem is inherently inexact.
What if the appearance is 1.5 times bigger? What is 0.5 a pixel?
Solution until now: Natural scales only
Consider 1, 2, 3, 4, 5 … the only scales that we looking for, discrete scales.
(p.9, s.6b)
Definition:
Text in size: n  n
Pattern in size: m m
Text








a11
a1n 
an1
a



nn 


Pattern
 a11


a
 m1
a1m 


amm 
Find all occurrences of the pattern in the text in all discrete sizes.
(p.10, s.5-6)
Our problem: Discrete Exact Scaled Matching
Input:
T
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
O
O
O
X
X
X
X
X
X
O
O
O
X
X
X
X
X
X
O
O
O
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
O
O
X
X
X
X
O
X
X
X
X
X
X
O
O
X
X
O
X
O
X
X
X
O
O
X
X
X
X
X
X
X
X
X
X
X
O
O
X
X
X
X
X
X
X
O
X
X
X
X
X
X
X
X
X
P
X
X
X
X
XX
XX
XX
X
XX
OO
XO
X
T
T
T
S
S
S
R
R
R
T
T
T
S
S
S
R
R
R
XX
XO
XO
X
X
X
X
X
In the example above we can see 1, 2 and 3 scale match
Example for 3 scaling:
Z U T
Y V S
X W R
Z
Z
Z
Y
Y
Y
X
X
X
Z
Z
Z
Y
Y
Y
X
X
X
Z
Z
Z
Y
Y
Y
X
X
X
U
U
U
V
V
V
W
W
W
U
U
U
V
V
V
W
W
W
U
U
U
V
V
V
W
W
W
T
T
T
S
S
S
R
R
R
There is a linear algorithm O  n 2  that find scale match, in dictionary problem (AC –
96 (p.6, s.4))
(p.11, s.9)
Idea:
Fix a scale s , divide every text dimension into n / s squares, every square is at size
n2
,
there
are
squares at the text.
ss
s2
There is a constant amount of work for each square ( s -block)
(p.12, s.10)
n2
s2
For how many scales should we check for? How many correct one’s there are?
n
The highest scale we need to check for is   beyond that the pattern exceed the text
m
Time for searching match of a pattern scaled with s is:
n
m
 
Total time for searching the pattern at any scale is
n
s
s 1
n
m
 
The progression
1
s
s 1
2
2
2
n
m
 
 n2 
s 1
1
   n2  .
s2
converges to a constant.
(p.13, s.11)
Problem: Real scales are an open problem even for strings…
How to define scaling in one dimension?
Let's look at the pattern 'aabcccbb'
Scaled to 2 it would look like that 'aaaabbccccccbbbb' every item is doubled
But what about scaling to 1.5? How will the pattern will look like than?
Every item which is not an integer after scaling will be truncate to an integer (a
version of rounding is also possible)
Scaled to 1.5 the pattern would look like that 'aaabccccbbb' – a ‘half’ 'b' (the left b
item) and a ‘half’ 'c' were truncated.
(p.14, s.12)
FORMALLY:
r times
r
Denote: a as a single element aaa a
We will look at continuity of an item as it was an instance of a single item.
PROBLEM DEFINITION 1:
r
Input: Pattern P  a1r1 a2 r2 a j j Where P   ri  m
Text T Where T  n
Output: All text locations where a1c1 a2  r2 
 rj 1 
a j 1
aj
cj
appears for some   1,   , c1   r1  , c j   rj 
This definition conclude the appearance of the pattern in the text with as many a1
ahead it and as many a j behind it, beyond the respectively truncate numbers.
(p.15, s.13)
Remark:   1 means we only scale up.
Reasons: We need to avoid conceptual problems of loss of resolution.
The conceptual problem can be that from "far enough" away everything looks the
same, and we can’t determine mismatches.
From the above we can conclude that by our definition (   1), for every scaling with
1
k
there is a match at every text location
m
(p.16, s.14)
PROBLEM DEFINITION 2 (SIMPLIFY DEFINITION):
Look for a1 r1  a2  r2 
 rj 1 
a j 1
 a j 
aj
in the text
Example: P  aabcccbbbb
3 
 2

a 2
3 
 1

b 2
3 
 3

c 2
3 
 4

b 2
d aaa b cccc bbbbbb e In this text there is a match by definition 1&2
daaabccccbbbbbbbbe In this text there is a match only by definition 1 but not by 2
(p.17, s.15)
WHY ARE DEFINITIONS EQUIVALENT:
Split text and pattern to
symbol part T S , P S
and length part T L , P L
P  aabcccbbbb
P S  abcb
Example:
P L  2134
T  daaabccccbbbbbbe
T S  dabcbe
T L  131461
Time for split: O  n  m
Finding P S in T S : O  n  m (e.g. KMP)
The hard part: Finding P L in T L
(p18, s.16)
Claim:
Solving definition 2 in time O  f  n    Solving definition 1 in time O  f  n  
Why?
r

Find a2 r2 a j 1 j1 by definition 2, i.e. find an m  2 inside items match
Time O  f  n  

For each match verify in constant time that:
1st & last symbol of the pattern are match with T S and T L
Time for verifying O  n  . The maximum matches that can be is n
Total time O  f (n)  n   O  f (n) 
(p19, s.17)
Naive Algorithm for Matching P L in T L :
Before we start remember that t , p are value numbers in T L and P L respectively.
Each item continuity appearance in the pattern and text is numbered in P L and T L .
We are trying to find a scale  that would make a match for every pattern value,
scaled with  , to the values of some position in the text.
We will do that for every position finding interval that  need to be on it.
For each text location, position pattern starting at that location and calculate interval
 t t 1
 p , p  for each resulting <text, pattern> pair


This is the interval of possible  scale since:
t
t
 p  t  For every     p   t and there is no match with that 
p
p
t 1
t 1
 p  t  1  For every  
  p   t and there is no match with that 
p
p
(p.20, s.18)
If intersection of all intervals   then  match.
Example:
Index
Interval
PL
TL
PL
1
 2
1, 3 
2
2
Interval
Interval
2
3
4,5
1
4
2
 5
 2, 2 
1

 2, 2 2 
4
5
6
7
8
Intersection =  . No need to check other pairs
2
2
1
3
2
4
7
4
2
3
2
5
7 8 
 5
, 
2,3  2, 2 

 2, 2 
3 3


1
1 3
1
2,3  2, 2 2   2 3 , 2 3   2, 2 2 

 
 

 1 1
Intersection:  2 , 2  match at location 2
 3 3
5
3
Time: O  mn 
(p.21, s.19)
Improvement – Parameterized Matching
Introduced: Baker 1994
Motivation: Trying to reveal "copying" code
P m-matches T at location i if  bijection  :    such that
  P     p1    p2    pm   titi 1 ti m1
(p.22, s.20)
Example:
P  abaccbba
T  badadbbaadcd
In the third place we have an m-match. a  d , b  a, c  b
Claim (AFM – 94): For  that can be sorted in linear time (e.g.   1,
, n )
parameterized matching can be done in time O  n 
(p.23, s.21)
Lemma
  ,   1 for which P L matches T L at location i scaled to  ,
only if P L m-matches T L at i
Proof
Assume P L does not m-matche T L at location i . Let us look at the possible reasons
for this m-mismatch and by that proofing the lemma.
Situation (i)
TL
P
a
c≠a
b
b
W.L.O.G. c ≥ a + 1
L
Let us check scale match now and see if it is possible to have it. We will check it for
the smallest possibility of c (closest numerator since the denominator is the same),
which will give us the best chances for scale match. This possibility is c = a + 1.
 a a 1  a 1 a  2 
 b , b    b , b   
(p.24, s.22)
Situation (ii)
TL
P
a
a
b
c≠b
W.L.O.G. c ≥ b + 1
L
Let us check scale match now and see if it is possible to have it. We will check it for
the smallest possibility of c (closest denominator since the numerator is the same),
which will give us the best chances for scale match (smallest denominator nearest to b
as possible). This possibility is c = b + 1.
 a a 1   a a 1 
 b , b    b  1 , b  1 
The intersection will not be empty only if
a 1 a
  ab  b  ab  a  b  a
b 1 b
But this can never happen if we are looking for scale up only not scale down with
 1
(p.25, s.23)
Algorithm for Real Scaled String Matching
Let pi1 , pi2 , , pil be the different numbers in P L


1. m-match P L in T L
2. For each match check intersection of intervals between
pi1 , , pil & corresponding symbols in T L
End
Example
P L = 2 3 2 3 2 pi1  2, pi2  3
Note: there is no interest for which symbol the first 2 stands for and for which symbol
other 2 stands for, the only thing we need it for, is interval generating in order to
check intersection (the second step of the algorithm)
Index 1 2 3 4 5 6 7 8 9 10 11 12
5 6 5 6 5 6 10 6 10 6 10 7
TL
m-match (index no.) Scale match (intervals)
1
 1 
1
 2 2 ,3   2, 2 3 
 1  2 
2
3,3 2  1 3 , 2 
 1  1 2
6
3,3 2  3 3 ,3 3 
1 
1

7
5,5 2   2, 2 3 
(p.26, s.26)
Important Fact
l
p
j 1
ij
m
So there are at most O
 m  different p
ij
’s
Algorithm Time
O  n  Parameterized matching (for   1,
, n as claimed in (p.22 s.20))
 m  Verification of interval intersection for each location of parameterized
matching (No more than O  m  different p ’s, locations to check)
O
ij
Total
O n m


(p.27, s.27)
TIGTHER ANALYSIS:
 limit on # of possible m-matches
Lemma:
Let P  m, T  n
p ,
i1
, pil
 different numbers in P
Then  at most
L
2n
m-matches of P L in T L
l
MEANING:
Since verification, as seen above, is O  l  per m-match,
 2n 
lemma implies verification time: O   l   O  n 
 l 
(p.28, s.28)
Proof:
Remember that every pi j is a notation for number of continuous occurrences for the
first symbol with that no. of occurrences. Any same occurrence by any symbol will
not appear as pi j anymore.
Now let us look at a place in the text where there is an m-match:
Every place pi j shows the first appearances of pi1 , , pil
PL
pi1
pi2
pil
TL
a1
a2
al
Every ai is a representation for symbol occurrence in the text. Now, since we know
that in this position there is an m-match and every pi j is different, every ai must be
different, otherwise there was no m-match in that position. The sum of these ai ’s is
l
 ai 
i 1
l2
2
(p.29, s.29)
Let x be total number of m-matches in text
Our target now is to find what x is and we want it to be  n
We will discover x with that tricky way:
The sum of all text elements that match 1st occurrences of pi j ’s in the pattern is 
BUT: This sum is counting overlaps matches too; some m-match can start at the
middle of another m-match, this means that it possible that we summarize some of
these element matches twice.
HOW MANY OVERLAPS CAN BE?
(p.30, s.30)
For each text location, at most l m-matches will count it, because every ai ’s are
different.
xl 2 1 xl
Total Count Without Overlaps 
 
2 l 2
Dividing it by the most possible overlaps we can now find max x that possible.
xl
n
2
2n
Which give us a limit on the max no. of parameterized matching in text x 
l
Clearly without summarizing anything twice (overlaps stay outside) we get
xl 2
2