4up - Faculty of Computer Science - Free University of Bozen

Outline
Approximation: Theory and Algorithms
1
Complexity of the Tree Edit Distance Algorithm
Basic Complexity Formula
Collapsed Depth and Key Roots
Complexity Results
2
Search Space Reduction for the Tree Edit Distance
Approximate Join
Lower Bound: Tree Traversal
Upper Bound: Constrained Edit Distance
3
Conclusion
Edit Distance Complexity, Upper and Lower Bounds
Nikolaus Augsten
Free University of Bozen-Bolzano
Faculty of Computer Science
DIS
Unit 9 – Mai 8, 2009
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Complexity of the Tree Edit Distance Algorithm
Unit 9 – Mai 8, 2009
1 / 33
Basic Complexity Formula
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Complexity of the Tree Edit Distance Algorithm
Notation
Unit 9 – Mai 8, 2009
2 / 33
Basic Complexity Formula
forest-dist: Time Complexity
forest-dist(i , j, l1 , l2 , td)
fd[l1 [i ] − 1..i , l2 [j] − 1..j] : empty array;
fd[l1 [i ] − 1, l2 [j] − 1] = 0;
for di = l1 [i ] to i do fd[di , l2 [j] − 1] = fd[di − 1, l2 [j] − 1] + ωdel ;
for dj = l2 [j] to j do fd[l1 [i ] − 1, dj ] = fd[l1 [i ] − 1, dj − 1] + ωins ;
for di = l1 [i ] to i do
for dj = l2 [j] to j do
if l [di ] = l [i ] and l [dj ] = l [j] then
fd[di , dj ] = min(. . .);
td[di , dj ] = f [di , dj ];
else fd[di , dj ] = min(. . .);
Notation:
|T| is the number of nodes in T
depth(v) is the number of ancestors of v (including v itself)
depth(T) is the maximum depth of a node in T
leaves(T) is the set of leaves of T
t(i) is the subtree rooted in node i
Input nodes are i and j.
They are root nodes of subtrees t1 (i ) and t2 (j).
The nested loop is executed |t1 (i )| × |t2 (j)| times.
⇒ Time complexity O(|t1 (i )| × |t2 (j)|)
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
4 / 33
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
5 / 33
Complexity of the Tree Edit Distance Algorithm
Basic Complexity Formula
Complexity of the Tree Edit Distance Algorithm
tree-edit-dist: Time Complexity
Collapsed Depth and Key Roots
Collapsed Depth
tree-edit-dist(T1 , T2 )
td[1..|T1 |, 1..|T2 |] : empty array for tree distances;
l1 = lmld(root(T1 )); kr1 = kr(l1 , |leaves(T1 )|);
l2 = lmld(root(T2 )); kr2 = kr(l2 , |leaves(T2 )|);
for x = 1 to |kr1 | do
for y = 1 to |kr2 | do
forest-dist(kr1 [x], kr2 [y ], l1 , l2 , td);
Definition: The collapsed depth of a node v in T is
cdepth(v) = |anc(v) ∩ kr (T)|,
i.e., the number of ancestors of v (including v itself) that are key
roots.
Computing l1/2 and kr1/2 is linear, O(|T1 | + |T2 |)
Main loop executes forest-dist() |kr1 | × |kr2 | times.
Complexity:
X X
X
X
|t1 (i )| × |t2 (j)| =
|t1 (i )| ×
|t2 (j)|
i ∈kr1 j∈kr2
i ∈kr1
j∈kr2
The following lemmas help us to reformulate this expression.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Complexity of the Tree Edit Distance Algorithm
Unit 9 – Mai 8, 2009
6 / 33
Collapsed Depth and Key Roots
Nikolaus Augsten (DIS)
Complexity of the Tree Edit Distance Algorithm
Collapsed Depth
Unit 9 – Mai 8, 2009
7 / 33
Collapsed Depth and Key Roots
Collapsed Depth
Now we can rewrite the complexity formula:
Lemma (Collapsed Depth)
For a tree T with key roots kr (T)
X
|t(k)| =
k∈kr (T)
X
|T|
X
cdepth(k)
k=1
X
j∈kr2
|t2 (j)| =
|T1 |
X
cdepth(i ) ×
i =1
|T2 |
X
cdepth(j)
j=1
cdepth(T) ≥ cdepth(k) for a node k of T, thus
|T1 |
X
A node i of T is counted whenever it appears in a subtree t(k).
Node i is in the subtree t(k) iff k is the ancestor of i.
Only the subtrees of key roots are considered.
|T2 |
X
cdepth(j) ≤ |T1 ||T2 |cdepth(T1 )cdepth(T2 )
j=1
Two obvious upper bounds for the collapsed depth:
the tree depth: cdepth(T) ≤ depth(T)
the number of key roots: cdepth(T) ≤ |kr (T)|
cdepth(i ) is the number of ancestor key roots of i (definition of
collapsed depth).
Unit 9 – Mai 8, 2009
cdepth(i ) ×
i =1
Thus a node i is counted once for each ancestor key root.
Approximation: Theory and Algorithms
|t1 (i )| ×
i ∈kr1
Proof.
Consider the left-hand formula:
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
We show that the number of key roots matches the number of leaves.
8 / 33
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
9 / 33
Complexity of the Tree Edit Distance Algorithm
Collapsed Depth and Key Roots
Complexity of the Tree Edit Distance Algorithm
Complexity Results
Number of Key Roots
Complexity of the Tree Edit Distance Algorithm
Lemma (Number of Key Roots)
Theorem (Complexity of the Tree Edit Distance Algorithm)
The number of key roots of a tree is equal to the number of leaves:
Let D1 and D2 denote the depth, L1 and L2 the number of leave nodes,
and N1 and N2 the total number of nodes of two trees T1 and T2 ,
respectively.
|kr (T)| = |leaves(T)|
Proof.
We show that l () is a bijection from the key roots kr (T) to the leaves(T):
(a) Injection – for any i , j ∈ kr (T), i 6= j ⇒ l (i ) 6= l (j) :
If i > j and l (i ) = l (j), j can not be a key root by definition.
Analogous rational hold for j > i .
(b) Surjection – Each leaf x has a key root i ∈ kr (T) such that l (i ) = x:
If there is no node i > x with l (i ) = l (x), then by definition x itself is
a key root (l (x) = x is always true). Otherwise i is the key root of x.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Complexity of the Tree Edit Distance Algorithm
Unit 9 – Mai 8, 2009
10 / 33
Complexity Results
(1) The runtime of the tree edit distance algorithm is
O(N1 N2 min(D1 , L1 ) min(D2 , L2 )).
(2) Let N = max(N1 , N2 ). For full, balanced, binary trees the runtime is
O(N 2 log2 N).
(3) In the worst case min(D, L) = O(N) and the runtime is O(N 4 ).
(4) The algorithm needs O(N1 N2 ) space.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Complexity of the Tree Edit Distance Algorithm
Proof of the Complexity Theorem
Unit 9 – Mai 8, 2009
11 / 33
Complexity Results
Recent Improvements of the Complexity
Proof.
(1) Runtime (general formula): We have shown before, that the
complexity is O(|T1 ||T2 |cdepth(T1 ) cdepth(T2 )). As
cdepth(T) ≤ |kr (T)| = |leaves(T)| (see definition of cdepth(T) and
previous lemma) and cdepth(T) ≤ depth(T) (follows from the
definition of cdepth(T)), if follows that
cdepth(T) ≤ min(depth(T), |leaves(T)|).
(2) Full, balanced, binary trees: In this case depth(T) = O(log(|T|)).
(3) Worst case: A full binary tree (i.e., each node has zero or two
children) where each non-leaf nodes has at least one leaf child:
min(depth(T), |leaves(T)|) = O(|T|).
Klein [Kle98] improves the worst case for the runtime to
O(|T1 |2 |T2 | log(|T2 |), thus from O(N 4 ) to O(N 3 log(N)).
Dulucq and Touzet [DT03] also give an O(N 3 log(N)) algorithm.
Demaine et al. [DMRW07] give an O(N 3 ) algorithm. They show that
the algorithm is optimal among decomposition algorithms (algorithms
as in [ZS89, Kle98, DT03]), i.e., the lower bound is also Ω(N 3 ).
(4) Space: The size of the tree distance matrix td is |T1 | × |T2 |. In each
call of forest-dist() we need a matrix of size O(|T1 | × |T2 |), which is
freed when we exit the subroutine.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
12 / 33
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
13 / 33
Search Space Reduction for the Tree Edit Distance
Approximate Join
Search Space Reduction for the Tree Edit Distance
Definition: Approximate Join
Approximate Join Algorithm
approxJoin(S1 , S2 )
Definition (Approximate Join)
Given two sets of trees, S1 and S2 , and a distance threshold τ , let
δt (Ti , Tj ) be a function that assesses the edit distance between two trees
Ti ∈ S1 and Tj ∈ S2 . The approximate join operation between two sets of
trees reports in the output all pairs of trees (Ti , Tj ) ∈ S1 × S2 such that
δt (Ti , Tj ) ≤ τ .
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Search Space Reduction for the Tree Edit Distance
Approximate Join
Unit 9 – Mai 8, 2009
15 / 33
Lower Bound: Tree Traversal
for each Ti ∈ S1 do
for each Tj ∈ S2 do
if upperBound(Ti , Tj ) ≤ τ then
output(Ti , Tj )
if lowerBound(Ti , Tj ) ≤ τ then
if δt (Ti , Tj ) ≤ τ then
output(Ti , Tj )
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Search Space Reduction for the Tree Edit Distance
Preorder and Postorder Traversal Strings
Unit 9 – Mai 8, 2009
16 / 33
Lower Bound: Tree Traversal
Notes: Traversal Strings and Tree Inequality
Each node label is a single character of an alphabet Σ.
Traversal Strings:
pre(T) is the string of T’s node labels in preorder
post(T) is the string of T’s node labels in postorder
If the traversal strings of two trees are equal, the trees can still be
different:
Lemma (Tree Inequality)
T1
a
Let pre(T1 ) and pre(T2 ) be the preorder strings, and post(T1 ) and
post(T2 ) be the postorder strings of two trees T1 and T2 , respectively.
Then
b
pre(T1 ) 6= pre(T2 ) or post(T1 ) 6= post(T2 ) ⇒ T1 6= T2
6=
a
pre(T1 ) = aba
T2
a
b
a
= pre(T2 ) = aba
Proof.
The inversion of the argument is obviously true:
T1 = T2 ⇒ pre(T1 ) = pre(T2 ) and post(T1 ) = post(T2 )
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
17 / 33
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
18 / 33
Search Space Reduction for the Tree Edit Distance
Lower Bound: Tree Traversal
Search Space Reduction for the Tree Edit Distance
Lower Bound
Lower Bound: Tree Traversal
Illustration for the Lower Bound Proof (Preorder)
Theorem (Lower Bound)
If the trees are at tree edit distance k, then the string edit distance
between their preorder or postorder traversals is at most k.
ins(v, p, k, m)
→
←
del (v)
p
Proof.
Tree operations map to string operations (illustration on next slide):
Insertion (ins(v, p, k, m)): Let t1 . . . tf be the subtrees rooted in the
children of p. Then the preorder traversal of the subtree rooted in p is
p pre(t1 ) . . . pre(tk−1 )pre(tk ) . . . pre(tm )pre(tm+1 ) . . . pre(tf ).
t1
...
tk−1 tk
...
tm tm+1
...
tf
t1
ppre(t1 ) . . . pre(tk−1 ) v pre(tk ) . . . pre(tm )pre(tm+1 ) . . . pre(tf ).
The string distance is 1. Analog rationale for postorder.
...
v
tk−1
...
tk
p pre(t1 ) . . . pre(tk−1 )
pre(tk ) . . . pre(tm )
pre(tm+1 ) . . . pre(tf )
Inserting v moves the subtrees k to m:
p
tm+1
...
tf
tm
ppre(t1 ) . . . pre(tk−1 )
v pre(tk ) . . . pre(tm )
pre(tm+1 ) . . . pre(tf )
Deletion: Inverse of insertion.
Rename: With node rename a single string character is renamed.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Search Space Reduction for the Tree Edit Distance
Unit 9 – Mai 8, 2009
19 / 33
Lower Bound: Tree Traversal
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Search Space Reduction for the Tree Edit Distance
Lower Bound
Unit 9 – Mai 8, 2009
20 / 33
Lower Bound: Tree Traversal
Example: Traversal String Lower Bound
T1
From the lower bound theorem it follows that
T2
f6
max(δs (pre(T1 ), pre(T2 )), δs (post(T1 ), post(T2 ))) ≤ δt (T1 , T2 )
where δs and δt are the string and the tree edit distance, respectively.
The string edit distance can be computed faster:
e5
d4
a1
f6
e5
c4
c3
d3
a1
b2
b2
2
string edit distance runtime: O(n )
tree edit distance runtime: O(n3 )
pre(T1 ) = fdacbe
post(T1 ) = abcdef
Approximate join: match all trees with δt (T1 , T2 ) ≤ τ
if max(δs (pre(T1 ), pre(T2 )), δs (post(T1 ), post(T2 ))) > τ
then δt (T1 , T2 ) > τ
thus we do not have to compute the expensive tree edit distance
pre(T2 ) = fcdabe
post(T2 ) = abdcef
δs (pre(T1 ), pre(T2 )) = 2
δs (post(T1 ), post(T2 )) = 2
δt (T1 , T2 ) = 2
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
21 / 33
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
22 / 33
Search Space Reduction for the Tree Edit Distance
Lower Bound: Tree Traversal
Search Space Reduction for the Tree Edit Distance
Example: Traversal String Lower Bound
Edit Mapping
Recall the definition of the edit mapping:
The string distances of preorder and postorder may be different.
Definition (Edit Mapping)
The string distances and the tree distance may be different.
T1
T2
a
a
a
b
b
c
a
c
pre(T1 ) = abac
post(T1 ) = bcaa
An edit mapping M between T1 and T2 is a set of node pairs that satisfy
the following conditions:
(1) (a, b) ∈ M ⇒ a ∈ N(T1 ), b ∈ N(T2 )
(2) for any two pairs (a, b) and (x, y) of M:
(i) a = x ⇔ b = y (one-to-one condition)
(ii) a is to the left of x1 ⇔ b is to the left of y
(order condition)
(iii) a is an ancestor of x ⇔ b is an ancestor of y
(ancestor condition)
pre(T2 ) = abac
post(T2 ) = acba
δs (pre(T1 ), pre(T2 )) = 0
δs (post(T1 ), post(T2 )) = 2
(3) Optional: a = root(T1 ) and b = root(T1 ) ⇒ (a, b) ∈ M
(forbid deleting the root node)
δt (T1 , T2 ) = 3
1
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Search Space Reduction for the Tree Edit Distance
Upper Bound: Constrained Edit Distance
Unit 9 – Mai 8, 2009
23 / 33
Upper Bound: Constrained Edit Distance
i.e., a precedes x in both preorder and postorder
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Search Space Reduction for the Tree Edit Distance
Constrained Edit Distance
Unit 9 – Mai 8, 2009
24 / 33
Upper Bound: Constrained Edit Distance
Example: Constrained Edit Distance
T1
a
We compute a special case of the edit distance to get a faster
algorithm.
T2
a
c
b
lca(a, b) is the lowest common ancestor of a and b.
Additional requirement on the mapping M:
d
(4) for any pairs (a1 , b1 ), (a2 , b2 ), (x, y) of M:
lca(a1 , a2 ) is a proper ancestor of x
⇔
lca(b1 , b2 ) is a proper ancestor of y.
e
f
d
g
h
e
i
f
g
Constrained edit distance (dashed lines): δc (T1 , T2 ) = 5
constrained mapping Mc = {(a, a), (d, d), (c, i), (f , f )(g , g )}
edit sequence: ren(c, i), del(b), del(e), ins(h), ins(e)
Intuition: Two distinct subtrees of T1 are mapped to two distinct
subtrees of T2 .
Unconstrained edit distance (dotted lines): δt (T1 , T2 ) = 3
mapping Mt = {(a, a), (d, d), (e, e), (c, i), (f , f )(g , g )}
edit sequence: ren(c, i), del(b), ins(h)
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
25 / 33
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
26 / 33
Search Space Reduction for the Tree Edit Distance
Upper Bound: Constrained Edit Distance
Search Space Reduction for the Tree Edit Distance
Example: Constrained Edit Distance
T1
a
a
c
e
Complexity of the Constrained Edit Distance
T2
b
d
f
d
g
Theorem (Complexity of the Constrained Edit Distance)
h
e
Let T1 and T2 be two trees with |T1 | and |T2 | nodes, respectively. There
is an algorithm that computes the constrained edit distance between T1
and T2 with runtime
O(|T1 ||T2 |).
i
f
g
Proof.
(e, e) violates the 4th condition of the constrained mapping:
See [Zha95, GJK+ 02].
lca(e, f ) in T1 is a
a is a proper ancestor of d in T1
assume (e, e), (f , f ), (d, d) ∈ Mc
lca(e, f ) in T2 is h
h is not a proper ancestor of d in T2
Nikolaus Augsten (DIS)
Upper Bound: Constrained Edit Distance
Approximation: Theory and Algorithms
Search Space Reduction for the Tree Edit Distance
Unit 9 – Mai 8, 2009
27 / 33
Upper Bound: Constrained Edit Distance
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Search Space Reduction for the Tree Edit Distance
Constrained Edit Distance: Upper Bound
Unit 9 – Mai 8, 2009
28 / 33
Upper Bound: Constrained Edit Distance
Use of the Upper Bound
Theorem (Upper Bound)
The constrained edit distance can be computed faster:
Let T1 and T2 be two trees, let δt (T1 , T2 ) be the unconstrained and
δc (T1 , T2 ) be the constrained tree edit distance, respectively. Then
constrained edit distance runtime: O(n2 )
unconstrained edit distance runtime: O(n3 )
Approximate join: match all trees with δt (T1 , T2 ) ≤ τ
δt (T1 , T2 ) ≤ δc (T1 , T2 )
if δc (T1 , T2 ) ≤ τ then also δt (T1 , T2 ) ≤ τ .
thus we do not have to compute the expensive tree edit distance
Proof.
See [GJK+ 02].
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
29 / 33
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
30 / 33
Conclusion
Conclusion
Summary
What’s Next?
Tree Edit Distance Complexity
Search Space Reduction
Reference Sets (Upper and Lower Bound)
Binary Branch Distance (Lower Bound)
Lower Bound: Traversal Strings
Upper Bound: Constrained Edit Distance
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
32 / 33
Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren
Weimann.
An optimal decomposition algorithm for tree edit distance.
In Proceedings of the 34th International Colloquium on Automata,
Languages and Programming (ICALP 2007), volume 4596 of Lecture
Notes in Computer Science, pages 146–157, Wroclaw, Poland, July
2007. Springer.
Serge Dulucq and Hélène Touzet.
Analysis of tree edit distance algorithms.
In Combinatorial Pattern Matching (CPM 2003), volume 2676 of
Lecture Notes in Computer Science, pages 83–95, Berlin, 2003.
Springer.
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
33 / 33
In Proceedings of the ACM SIGMOD International Conference on
Management of Data, pages 287–298, Madison, Wisconsin, 2002.
ACM Press.
Philip N. Klein.
Computing the edit-distance between unrooted ordered trees.
In Proceedings of the 6th European Symposium on Algorithms,
volume 1461 of Lecture Notes in Computer Science, pages 91–102,
Venice, Italy, 1998. Springer.
Kaizhong Zhang.
Algorithms for the constrained editing distance between ordered
labeled trees and related problems.
Pattern Recognition, 28(3):463–474, 1995.
Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and
Ting Yu.
Approximate XML joins.
Nikolaus Augsten (DIS)
Nikolaus Augsten (DIS)
33 / 33
Kaizhong Zhang and Dennis Shasha.
Simple fast algorithms for the editing distance between trees and
related problems.
SIAM Journal on Computing, 18(6):1245–1262, 1989.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 9 – Mai 8, 2009
33 / 33

Download Report

4up - Faculty of Computer Science - Free University of Bozen

Paperzz.com

Your Paperzz