Def(a,X,Y

A* Search
A* (pronounced "A star") is a best first, graph
search algorithm that finds the least-cost path from
a given initial node to one goal node out of one or
more possible goals.
Definitions
A* uses a distance-plus-estimate heuristic function denoted by
f(x) to determine the order in which the search visits nodes in
the tree induced by the search. The distance-plus-estimate
heuristic is a sum of two functions:
• the path-cost function denoted g(x) from the start node to
the current node and
•an admissible "heuristic estimate" of the distance to the
goal denoted h(x).
• an admissible h(x) must not overestimate the distance to
the goal. For an application like routing, h(x) might
represent the straight-line distance to the goal, since that is
physically the smallest possible distance between any two
points (or nodes for that matter).
An A* algorithm for Edit Distance
Edit Distance DE (X,Y) measures how close string X is to string Y.
DE(X,Y) is the cost of the minimum cost transformation t : X t Y
where t is a sequence of operations (insertion, equal substitution,
unequal substitution, and deletion). The cost of t is the sum of the
operation costs where each operation costs 1 except for equal
substitution which costs 0.
A
B
B
A
C
B
A
A
C
A
The cost of this transformation is 3 which happens to be minimal.
Dynamic programming Solution
(an O(mn) solution)
Decomposition : Last Operation Delete, Substitute, or Insert
Atomic Problems : X prefix or Y prefix empty
Table :
Rows for 0 .. M for X prefix characters,
Columns 0 .. N for Y prefix characters
Table Entry : DE (Xi , Yj)
Composition :  = cost(Substitution) = 1 if xi != yj and 0 otherwise.
DE (Xi ,Yj ) = min{ DE (Xi-1 ,Yj ) + 1,
DE (Xl-1 ,Yj-1 ) + ,
DE (Xi ,Yj-1 ) + 1 }
Edit Distance as a Shortest Path Problem
Define a transformation graph GXY = (V,E) as follows:
• The set V of nodes (vertices) = {0 .. M}  {0 .. N} where
node npq represents the state of transforming a p length
prefix of X into a q length prefix of Y.
• The set E of edges represent the operations of
• deletion , connecting node np,q to np+1,q with length 1
• substitution , connecting node np,q to np+1,q+1 with
length 0 or 1 depending on whether Xp+1 = Yq+1 or not
• insertion , connecting node np,q to np,q+1 with length 1
The start and goal nodes are n0,0 and nM,N
Introduction
Edit Distance – Based on Single Character Edit Operations
 Insertion :   a
 Inserts an “a” into target without effecting the source;
cost = 1
 Equal Substitution : a  a
 Substitutes an “a” into target for an “a” in source;
cost = 0
 Unequal Substitution : a  b
 Substitutes a “b” into target for an “a” in source;
cost = 1
 Deletion : a  
 Deletes an “a” from source without effecting the target;
cost = 1
Example of a Transformation Graph
The vertices of T correspond to prefix pairs of X and Y. The
edges of T are directed and correspond to the single character edit
operations which would transform one prefix pair into another.
Example of a Transformation Graph
•X = abbab
•Y = bbaba
DE(X,Y) = cost of shortest path
start vertex to goal vertex = 2
A frequency based Lower Bound function h
•
Let Xi be the suffix of X beginning with the ith character and Yj be
similarly defined.
•
If X = abbab and Y = bbaba
•
•
X2 = bbab and Y2 = baba
•
Excess(X2,Y2,a) = 0
•
Def(X2,Y2,a) =1
•
Excess(X2,Y2,b) = 0
•
Def(X2,Y2,b) =0
•
Excess(X2,Y2) is sum of excesses over alphabet and
Def(X2,Y2) is sum of deficiencies.
h( X2,Y2 ) = max{Excess(X2,Y2),Def(X2,Y2)} is a lower bound to
the length of the shortest path from vertex to goal.
Classification and Strings
Applications of Edit Distance
•
•
•
•
•
•
DNA analysis
Classification of heart beats.
Handwriting recognition.
Spelling correction.
Error correction of variable length codes.
Speech recognition.
Discrete Directional Alphabet
Mapping EKG’s to Strings
Classification as Path Problem
• LB(Start,Goal-1) = 0
• LB(Start,Goal-2) = 3
Lower Bounds to Edit Distance
Lower Bound Based on Frequency
Let fa(X) and fa(Y) be the frequencies of a in X and Y.
Define Ex(a,X,Y) = fa(X) – fa(Y) if fa(X) > fa(Y) else 0
Define Def(a,X,Y) = fa(Y) – fa(X) if fa(Y) > fa(X) else 0
For any a, both Ex(a,X,Y) and Def(a,X,Y)  D(X,Y)
Ex(a,X,Y) + Ex(b,X,Y)  D(X,Y).
max { a Ex(a,X,Y), a Def(a,X,Y) }  D(X,Y)
LB(i,j,X,Y) computed for the ith suffix of X and the jth
suffix of Y is a lower bound to the remaining distance after
having computed the edit distance for the ith and jth
prefixes of X and Y.
Lower Bounds to Edit Distance
Lower Bound Based on Frequency
•
Since X has a deficiency of 1
b with Y1 as a target, 1 is a
lower bound to D(X,Y1).
•
Since X has a deficiency of 2
a’s with Y2 as a target and an
excess of 1 b, 2 is a lower
bound to D(X,Y2).
•
Since X has a deficiency of 3
b’s with Y3 as a target and an
excess of 2 a’s, 3 is a lower
bound to D(X,Y3).
•
Consequently the initial
vertices of the 3
transformation graphs are
organized into a priority queue
as shown to the left.
A* Search for Closest Target
f=h+g
Keeping track of last
operation since
insertion cannot be
followed by deletion
and vise versa
A* Search for Closest Target
• Finds distance of 1
to Y1 in 3 steps.
• Y1 must be a closest
goal since bnd + dist
is minimized.