Approximation: Theory and Algorithms Outline Binary Tree Example

Outline
Approximation: Theory and Algorithms
1
Binary Branch Distance
Binary Representation of a Tree
Binary Branches
Lower Bound for the Edit Distance
Complexity
2
The pq-Gram Distance
Definition
Computing the pq-Grams
Fanout Weighting and Lower Bound
Experiments
3
Conclusion
Binary Branches and pq-Grams
Nikolaus Augsten
Free University of Bozen-Bolzano
Faculty of Computer Science
DIS
Unit 11 – May 22, 2009
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Binary Branch Distance
Unit 11 – May 22, 2009
1 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Binary Representation of a Tree
Binary Branch Distance
Binary Tree
Unit 11 – May 22, 2009
2 / 50
Binary Representation of a Tree
Example: Binary Tree
Two different binary trees: TB = (N, El , Er )
In a binary tree
TB1 = ({a, b, c, d, e, f , g }, {(a, b), (b, c), (d, e), (e, f )}, {(a, d), (e, g )})
TB2 = ({a, b, c, d, e, f , g }, {(a, b), (b, c), (e, f )}, {(a, d), (d, e), (e, g )})
each node has at most two children;
left child and right child are distinguished:
a node can have a right child without having a left child;
a
TB1
b
Notation: TB = (N, El , Er )
TB denotes a binary tree
N are the nodes of the binary tree
El and Er are the edges to the left and right children, respectively
c
d
e
binary tree
each node has exactly zero or two children.
b
6=
f g
A full binary tree:
Full binary tree:
a
TB2
c
e
f g
a
b
c h
d
d
e
i
f g
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
4 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
5 / 50
Binary Branch Distance
Binary Representation of a Tree
Binary Branch Distance
Binary Representation of a Tree
Example: Binary Tree Transformation
Represent tree T as a binary tree:
Binary tree transformation:
(i) link all neighboring siblings in a tree with edges
(ii) delete all parent-child edges except the edge to the first child
→
T
a
Transformation maintains
label information
structure information
b
b
binary representation of T
a
e
b
c d c d
Original tree can be reconstructed from the binary tree:
c
a left edge represents a parent-child relationships in the original tree
a right edges represents a right-sibling relationship in the original tree
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Binary Branch Distance
Binary Representation of a Tree
Unit 11 – May 22, 2009
6 / 50
c
d
e
d
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Binary Representation of a Tree
Binary Branch Distance
Normalized Binary Tree Representation
b
Unit 11 – May 22, 2009
7 / 50
Binary Representation of a Tree
Example: Normalized Binary Tree
Transforming T to the normalized binary tree B(T):
We extend the binary tree with null nodes ǫ as follows:
→
T
a
a null node for each missing left child of a non-null node
a null node for each missing right child of a non-null node
Note: Leaf nodes get two null-children.
The resulting normalized binary representation
b
b
B(T)
a
e
c d c d
is a full binary tree
all non-null nodes have two children
all leaves are null-nodes (and all null-nodes are leaves)
ǫ
b
c
ǫ
b
d
ǫ ǫ ǫ
c
e
d
ǫ ǫ
ǫ ǫ
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
8 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
9 / 50
Binary Branch Distance
Binary Branches
Binary Branch Distance
Binary Branch
Binary Branches
Binary Branches of Trees and Datasets
A binary branch BiB(v) is
a subtree of the normalized binary tree B(T)
consisting of a non-null node v and its two children
Binary branch sets:
Example:
BiB(a) = ({a, b, ǫ}, {(a, b)}, {(a, ǫ)})
BiB(d) = ({d, ǫ1 , ǫ2 }, {(d, ǫ1 )}, {(d, ǫ2)})
BiB(T) is the
S set of all binary branches of B(T)
BiB(S) = T∈S BiB(T) is the set of all binary branches of dataset S
1
Binary branches can be serialized as strings:
a
c
ǫ
BiB(v) = ({v, a, b}, {(v, a)}, {(v, b)}) → λ(v) ◦ λ(a) ◦ λ(b)
we can sort these strings (ǫ > λ(v) for all non-null nodes v)
ǫ
b
Note:
b
c
d
ǫ ǫ ǫ
nodes are unique in the tree, thus binary branches are unique
labels are not unique, thus the serialized binary branches are not unique
e
ǫ ǫ
d
ǫ1 ǫ2
1
Although the two null nodes have identical labels (ǫ), they are different nodes. We
emphasize this by showing their IDs in subscript.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Binary Branch Distance
Unit 11 – May 22, 2009
10 / 50
Binary Branches
c1
ǫ2
T2
a
b
c4
e
d3
ǫ
ǫ ǫ 5 d6 ǫ ǫ
ǫ ǫ
BiB(c1 ) 6= BiB(c4 ):
ǫ
Binary Branches
Binary Branch Vector
c
ǫ
d
e
11 / 50
ǫ
b
c
ǫ
Unit 11 – May 22, 2009
a
ǫ
b
Approximation: Theory and Algorithms
Binary Branch Distance
Example: Binary Branches of Trees and Datasets
T1
Nikolaus Augsten (DIS)
b
ǫ
ǫ
d
The binary branch vector BBV (T)
is a representation of the binary branch set BiB(T)
e
Construction of the binary branch vector BBV (T):
ǫ ǫ
serialize and sort all binary branches of BiB(S)
bi is the i-th serialized binary branch in sort order
BBV (T)[i] is the number of binary branches in B(T) that serialize to bi
ǫ ǫ
BiB(c1 ) = ({c1 , ǫ2 , d3 }, {(c1 , ǫ2 )}, {(c1 , d3 )})
BiB(c4 ) = ({c4 , ǫ5 , d6 }, {(c4 , ǫ5 )}, {(c4 , d6 )})
Note: BBV (T)[i ] is zero if bi does not appear in BiB(T)
Serialization of both, BiB(c1 ) and BiB(c2 ), is identical: ’cǫd’
Sorted set of serialized strings of BiB(S), where S = {T1 , T2 }:
(abǫ, bcb, bcc, bce, beǫ, cǫd, dǫb, dǫe, dǫǫ, eǫǫ)
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
12 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
13 / 50
Binary Branch Distance
Binary Branches
Binary Branch Distance
Example: Binary Branches of Trees and Datasets
T1
c
ǫ
T2
a
d
ǫ ǫ ǫ
d
ǫ
e
ǫ ǫ
ǫ
ǫ ǫ
c
ǫ
d
e
S = {T1 , T2 } is the data set
ǫ
b
c
b
c
Binary Branch Distance [YKT05]
a
ǫ
b
Lower Bound for the Edit Distance
b
ǫ
ǫ
d
Definition (Binary Branch Distance)
e
Let BBV (T) = (b1 , . . . , bk ) and BBV (T′ ) = (b1′ , . . . , bk′ ) be binary branch
vectors of trees T and T′ , respectively. The binary branch distance of T
and T′ is
k
X
|bi − bi′ |.
δB (T, T′ ) =
ǫ ǫ
ǫ ǫ
BiBsort (S) is the sorted set of serialized strings of BiB(S)
i =1
BBV (T) is the binary branch vector of T
the sorted set of serialized strings and the binary branch vectors are:
BiBsort (S) abǫ bcb bcc bce beǫ cǫd dǫb dǫe dǫǫ eǫǫ
1 1 0 1 0 2 0 0 2 1
BBV (T1 )
BBV (T2 )
1 0 1 0 1 2 1 1 0 2
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Binary Branch Distance
Unit 11 – May 22, 2009
14 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Lower Bound for the Edit Distance
Binary Branch Distance
Example: Binary Branch Distance
Unit 11 – May 22, 2009
15 / 50
Lower Bound for the Edit Distance
Example: Binary Branch Distance
The normalized binary tree representations are:
We compute the binary branch distance between T1 and T2 :
T1
a
e
b
b
c d c d
Nikolaus Augsten (DIS)
B(T1 )
a
a
T2
a
c d e
b
c d b
e
Approximation: Theory and Algorithms
B(T2 )
c
ǫ
ǫ
b
d
ǫ ǫ ǫ
b
c
d
ǫ ǫ
Unit 11 – May 22, 2009
16 / 50
Nikolaus Augsten (DIS)
c
e
ǫ ǫ
ǫ
ǫ
ǫ
b
c
ǫ
d
b
ǫ
e
ǫ ǫ
Approximation: Theory and Algorithms
ǫ
d
e
ǫ ǫ
Unit 11 – May 22, 2009
17 / 50
Binary Branch Distance
Lower Bound for the Edit Distance
Binary Branch Distance
Example: Binary Branch Distance
The binary branch vectors of T1 and T2
BiBsort (S) abǫ bcb bcc bce beǫ cǫd
1 1 0 1 0 2
BBV (T1 )
BBV (T2 )
1 0 1 0 1 2
Lower Bound Lemma
Theorem (Lower Bound)
are:
dǫb dǫe dǫǫ eǫǫ
0 0 2 1
1
1
0
Let T and T′ be two trees. If the tree edit distance between T and T′ is
δt (T, T′ ), then the binary branch distance between them satisfies
2
The binary branch distance is
P10
δB (T1 , T2 ) =
i =1 |b1,i − b2,i |
= |1 − 1| + |1 − 0| + |0 − 1| + |1 − 0| + |0 − 1|+
|2 − 2| + |0 − 1| + |0 − 1| + |2 − 0| + |1 − 2|
= 9,
where b1,i and b2,i are the i -th dimension of the vectors BBV (T1 )
and BBV (T2 ), respectively.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Binary Branch Distance
Unit 11 – May 22, 2009
18 / 50
δB (T, T′ ) ≤ 5 × δt (T, T′ ).
Proof (Sketch — Full Proof in [YKT05]).
Each node v appears at most in two binary branches.
Rename: Renaming a node causes at most two binary branches in
each tree to mismatch. The sum is 4.
Similar rational for insert and its complementary operation delete (at
most 5 binary branches mismatch).
Nikolaus Augsten (DIS)
Lower Bound for the Edit Distance
Unit 11 – May 22, 2009
19 / 50
Lower Bound for the Edit Distance
Proof Sketch: Illustration for Insert
transform T1 to T2 : ren(c, x)
a
a
g
c
x g
b
b
e f
e f
binary trees B(T1 ) and B(T2 )
a
a
ǫ
ǫ
b
b
ǫ
ǫ
c
x
g
g
e
e
ǫ
ǫ ǫ
ǫ
ǫ ǫ
f
f
ǫ ǫ
ǫ ǫ
Two binary branches (bǫc, ceg ) exist only in B(T1 )
Two binary branches (bǫx, xeg ) exist only in B(T2 )
δt (T1 , T2 ) = 1 (1 rename)
δB (T1 , T2 ) = 4 (4 binary branches different)
Approximation: Theory and Algorithms
Approximation: Theory and Algorithms
Binary Branch Distance
Proof Sketch: Illustration for Rename
Nikolaus Augsten (DIS)
Lower Bound for the Edit Distance
Unit 11 – May 22, 2009
transform T1 to T2 : ins(x, a, 2, 3)
a
a
g
e
x g
b
f
b
e f
binary trees B(T1 ) and B(T2 )
a
a
ǫ
ǫ
b
b
ǫ
ǫ
e
x
g
ǫ
e
f
g
ǫ
ǫ
ǫ ǫ
f
ǫ ǫ
ǫ ǫ
Two binary branches (bǫe, f ǫg ) exist only in B(T1 )
Tree binary branches (bǫx , f ǫǫ, xeg ) exist only in B(T2 )
δt (T1 , T2 ) = 1 (1 insertion)
δB (T1 , T2 ) = 5 (5 binary branches different)
20 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
21 / 50
Binary Branch Distance
Lower Bound for the Edit Distance
Binary Branch Distance
Proof Sketch
Complexity
Complexity: Binary Branch Vector
Given a set S with N trees of average size n.
Construction of the binary branch vectors BBV (T) for all T ∈ S:
In general it can be shown that
1. compute the binary branches of all trees in S, BiB(S):
O(Nn) time and space (traverse all trees in S)
2. sort serialized binary branches of BiB(S) and store them in BiBsort (S):
O(Nn log(Nn)) time and O(Nn) space
3. construct BBV (T):
Rename changes at most 4 binary branches
Insert changes at most 5 binary branches
Delete changes at most 5 binary branches
Each edit operation changes at most 5 binary branches, thus
′
(a) traverse all trees again and compute their binary branches:
O(Nn) time and space
(b) for each binary branch find position i in BiBsort (S):
O(Nn log Nn) time (binary search in V for Nn binary branches)
(c) BBV (T)[i] is incremented: O(1)
′
δB (T, T ) ≤ 5 × δt (T, T ).
The overall complexity is O(Nn log Nn) time and O(Nn) space
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Binary Branch Distance
Unit 11 – May 22, 2009
22 / 50
Nikolaus Augsten (DIS)
Complexity
Approximation: Theory and Algorithms
Binary Branch Distance
Unit 11 – May 22, 2009
23 / 50
Complexity
Complexity: Distance
Note: Improvement using a hash function:
we assume a hash function that maps the O(Nn) binary branches to
O(Nn) buckets without collision
we do not sort BiB(S)
position i in the vector BBV (T) is computed using the hash function
O(Nn) time (instead of O(Nn log(Nn))) and O(Nn) space
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
the binary branch vectors are of size O(Nn)
computing the distance has O(Nn) time complexity
(subtracting two binary branch vectors)
Complexity for a set that contains only two trees (N = 2):
In the following we consider only the sort-algorithm with
O(Nn log(Nn)) runtime.
Nikolaus Augsten (DIS)
Given the binary branch vectors of a set of N trees (tree size n).
Computing the distance between two of the N trees:
constructing the binary branch vectors: O(n log n) time, O(n) space
computing the distance: O(n) time and space
overall complexity: O(n log n) time, O(n) space
24 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
25 / 50
The pq-Gram Distance
Definition
The pq-Gram Distance
pq-Grams – Intuition
Definition
pq-Grams Pattern
The shape of a pq-gram (p = 2, q = 3):
•
stem
•
q-Grams for strings:
split string into substrings (q-grams) of length q
strings with many common substrings are similar
• • •
base
pq-Grams for trees:
split tree into small subtrees (pq-grams) of the same shape
trees with many common subtrees are similar
anchor node
p nodes (anchor node and p−1 ancestors) form the stem
q nodes (q consecutive children of the anchor node) form the base
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
The pq-Gram Distance
Unit 11 – May 22, 2009
27 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Definition
The pq-Gram Distance
pq-Extended Tree
Unit 11 – May 22, 2009
28 / 50
Definition
Example: Extened Tree
An example tree T and its extended tree Tpq (p = 2, q = 3):
Problem: How can we split the following tree T into 2, 3-grams?
a
b
T
a
c
d e
Solution: Extend tree T with dummy nodes (•):
b
p−1 ancestors to the root node
q−1 children before the first and after the last child of each non-leaf
q children for each leaf
d e
The result is the pq-extened tree Tpq .
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
c
2, 3-extended tree T2,3
•
a
• • b
c • •
••• • • d
e • •
••• •••
Unit 11 – May 22, 2009
29 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
30 / 50
The pq-Gram Distance
Definition
The pq-Gram Distance
Definition: pq-Gram
Label-Tuples
Definition (pq-Gram)
Let p > 0, q > 0. A pq-gram, g , of a tree T with anchor node a ∈ N(T) is a
subtree of the extended tree Tpq that is composed of the following nodes:
p nodes ap−1 , . . . , a1 , a (ai is the ancestor of node a at distance i )
Linear encoding of a pq-gram g with anchor node a:
(traverse pq-gram in preorder)
ap−1
...
=
a1
q nodes ci , . . . , ci +q−1 (ci is the i -th child of node a)
Base: the nodes ci , . . . , ci +q−1 form the base of the pq-gram g .
λ(g ) = (λ(v1 ), . . . , λ(vp+q ))
Definition (pq-Gram Profile)
The pq-gram profile, PT , of a tree T is the set of all its pq-grams.
Approximation: Theory and Algorithms
The pq-Gram Distance
(ap−1 , . . . , a1 , a, ci , . . . , ci +q−1 )
a
.
.
. ci +q−1
ci
Label-tuple: tuple of the pq-grams node labels
Stem: the nodes ap−1 , . . . , a1 , a form the stem of the pq-gram g .
Nikolaus Augsten (DIS)
Definition
Unit 11 – May 22, 2009
for the pq-gram g = (v1 , . . . , vp+q ).
31 / 50
Nikolaus Augsten (DIS)
Definition
Approximation: Theory and Algorithms
The pq-Gram Distance
pq-Gram Index
Unit 11 – May 22, 2009
32 / 50
Definition
The pq-Gram Distance
Definition (pq-Gram Distance)
The pq-gram distance between two trees, T and T′ , is defined as
Definition (pq-Gram Index)
Let T be a tree with profile PT , p > 0, q > 0. The pq-gram index, I, of
tree T is the bag of all label-tuples of T,
]
λ(g )
I(T) =
δg (T, T′ ) = |I(T) ⊎ I(T′ )| − 2|I(T) C I(T′ )|
Metric normalization to [0..1]: δg′ (T, T′ ) =
Pseudo-metric properties hold for normalization [ABG]:
g ∈PT
✓ self-identity: x = y ⇐
/ ⇒ δg (x, y ) = 0
✓ symmetry: δg (x, y ) = δg (y , x)
✓ triangle inequality: δg (x, z) ≤ δg (x, y ) + δg (y , z)
Note:
pq-grams are unique within a tree
but: different pq-grams may yield identical label-tuples
thus the pq-gram index may contain duplicates
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
δg (T,T′ )
|I(T)⊎I(T′ )|−|I(T)CI(T′ )|
Unit 11 – May 22, 2009
Different trees may have identical indexes:
a
a
b b
b b
c
c
33 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
34 / 50
The pq-Gram Distance
Computing the pq-Grams
The pq-Gram Distance
Main Memory Algorithm for a Single Tree (I)
Computing the pq-Grams
Main Memory Algorithm for a Single Tree (II)
createIndex(T, r, I, stem, p, q)
stem := shift(stem, λ(r))
base: shift register of size q (filled with *)
Input of createIndex(T, r, I, stem, p, q):
a subtree of T rooted in r
the pq-gram index I computed so far
the stem stem of r’s parent
the parameters p and q
if r is a leaf then
I := I ∪ {stem ◦ base}
else
for each child c (from left to right) of r do
base := shift(base, λ(c))
I := I ∪ {stem ◦ base}
I :=createIndex(T, c, I, stem, p, q)
for k := 1 to q − 1
base := shift(base, *)
I := I ∪ {stem ◦ base}
Output of createIndex(T, r, I, stem, p, q):
pq-gram index including
the input index I
the pq-gram index of r and all its descendants
i.e., the pq-grams (label-tuples) with anchor node r or a descendant of r
return I
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
The pq-Gram Distance
Unit 11 – May 22, 2009
35 / 50
Nikolaus Augsten (DIS)
Computing the pq-Grams
Approximation: Theory and Algorithms
The pq-Gram Distance
Main Memory Algorithm for a Single Tree (III)
Unit 11 – May 22, 2009
36 / 50
Computing the pq-Grams
Complexity of the pq-Gram Index Algorithm
pq-Gram-Index(T, p, q) computes the pq-gram index for a
complete tree T:
Theorem (pq-Gram Index Complexity)
pq-Gram-Index(T, p, q)
The pq-gram index of a tree T with size |T| can be computed in O(|T|)
time.
stem: shift register of size p (filled with *)
I: empty index
I = createIndex(T, root(T), I, stem, p, q)
return I
Proof.
Each recursive call of createIndex () processes one node, and each node is
processed exactly once.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
37 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
38 / 50
The pq-Gram Distance
Computing the pq-Grams
The pq-Gram Distance
Size of the pq-Gram Index
Fanout Weighting and Lower Bound
Motivation: Unit Cost Model Not Always Intuitive
Theorem (Size of the pq-Gram Index)
Let T be a tree with l leaves and i non-leaves. The size of the pq-gram
index of T is
|I pq (T)| = 2l + qi − 1.
a
δt = 2
a
b c
Proof.
1. We count all pq-grams whose leftmost leaf is a dummy node: Each leaf
is the anchor node of exactly one pq-gram whose leftmost leaf is a
dummys node, giving l pq-grams. Each non-leaf is the anchor of q − 1
pq-grams whose leftmost leaf is a dummy, giving i (q − 1) pq-grams.
2. We count all pq-grams whose leftmost leaf is not a dummy node: Each
node of the tree except the root is the leftmost leaf of exactly one
pq-gram, giving l + i − 1 pq-grams.
δt = 2
a
b d h i k f g
b c
d e f g
d e f
h i
h i k
Unit cost edit distance:
no difference between leaves and non-leaves
may lead to non-intuitive results
Conclusion: Non-leafs should have more weight than leafs.
Overall number of pq-grams: l + i (q − 1) + (l + i − 1) = 2l + qi − 1.
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
The pq-Gram Distance
Unit 11 – May 22, 2009
39 / 50
Fanout Weighting and Lower Bound
N(T′ )
Fanout Weighting and Lower Bound
leaf changes have small cost (c = 1 in the example)
non-leaf changes cost proportional to the node fanout
a
a
Delete: α(w → ǫ) = f + c
δt = 2
δf = 2
b c
Insert: α(ǫ → w′ ) = f ′ + c
Rename: α(w → w′ ) = (f + f ′ )/2 + c
d e f
Cost of changing a non-leaf node: proportional to its fanout.
h i
Cost of changing a leaf node: constant c.
Approximation: Theory and Algorithms
40 / 50
Fanout-Weighted Tree Edit Distance:
w′
Let T and
be two trees, w ∈
a node with fanout f ,
∈
node with fanout f ′ , c > 0 a constant. The fanout weighted tree edit
distance, δf = (T, T′ ), between T and T′ is defined as the tree edit
distance with the following costs for the edit operations:
Nikolaus Augsten (DIS)
Unit 11 – May 22, 2009
Example: Fanout-Weighted Tree Edit Distance
Definition (Fanout Weighted Tree Edit Distance)
N(T′ )
Approximation: Theory and Algorithms
The pq-Gram Distance
Fanout Weighted Tree Edit Distance
T′
Nikolaus Augsten (DIS)
Unit 11 – May 22, 2009
41 / 50
Nikolaus Augsten (DIS)
a
δt = 2
δf = 9
b c
a
b d h i k f g
d e f g
h i k
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
42 / 50
The pq-Gram Distance
Fanout Weighting and Lower Bound
The pq-Gram Distance
pq-Gram Distance Lower Bound
Experiments
Sensitivity to Structure Change — Leaf
Cost of leaf change → depends only on q
Experiment:
Proof.
See [ABG] (ACM Transactions on Database Systems).
vary p
1
vary q
1
4,3-grams
3,3-grams
2,3-grams
1,3-grams
0.8
norm pq-gram distance
δg (T, T′ )
≤ δf (T, T′ ).
2
delete leaf nodes
measure normalized pq-gram distance
norm pq-gram distance
Theorem
Let p = 1 and c ≥ max(2q − 1, 2) be the cost of changing a leaf node.
The pq-gram distance provides a lower bound for the fanout weighted tree
edit distance, i.e., for any two trees, T and T′ ,
0.6
0.4
0.2
0
2,4-grams
2,3-grams
2,2-grams
2,1-grams
0.8
0.6
0.4
0.2
0
0
2
4
6 8 10 12 14 16 18 20
number of deletions
0
2
4
6 8 10 12 14 16 18 20
number of deletions
(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.)
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
The pq-Gram Distance
Unit 11 – May 22, 2009
43 / 50
The pq-Gram Distance
vary p
25
0.4
0.2
0
20
0.4
0.2
15
10
0
0
2
4
6 8 10 12 14 16 18 20
number of deletions
3,4-gram dist
2,3-gram dist
1,2-gram dist
0.6
time [sec]
0.6
2,4-grams
2,3-grams
2,2-grams
2,1-grams
0.8
Experiments
compute pq-gram distance for varying p and q
vary tree size: up 106 nodes
measure wall clock time
vary q
norm pq-gram distance
norm pq-gram distance
0.8
44 / 50
Scalability independent of p and q.
Experiment: For pair of trees
delete non-leaf nodes
measure normalized pq-gram distance
1
Unit 11 – May 22, 2009
Influence of p and q on Scalability
Cost for non-leaf change → controlled by p
Experiment:
4,3-grams
3,3-grams
2,3-grams
1,3-grams
Approximation: Theory and Algorithms
Experiments
Sensitivity to Structure Change — Non-Leaf
1
Nikolaus Augsten (DIS)
0
2
4
6
8
5
10 12 14 16 18 20
number of deletions
0
0
(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.)
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
45 / 50
Nikolaus Augsten (DIS)
100000
200000 300000 400000
number of nodes (n)
Approximation: Theory and Algorithms
500000
Unit 11 – May 22, 2009
46 / 50
The pq-Gram Distance
Experiments
Conclusion
Scalability to Large Trees
Summary
pq-gram distance → scalable to large trees
compare with edit distance
Experiment: For pair of trees
compute tree edit distance and pq-gram distance
vary tree size: up 5 × 105 nodes
measure wall clock time
600
Binary Branch Distance
lower bound of the unit cost tree edit distance
complexity O(n log n) time
pq-Gram Distance
edit dist
2,3-gram dist
500
lower bound for the fanout weighted tree edit distance
trees are represented by small subtrees
similar trees have many common subtrees
complexity O(n log n) time
time [sec]
400
300
200
100
0
0
100000
200000
300000
400000
500000
number of nodes (n)
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
47 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
49 / 50
Conclusion
What’s Next?
Nikolaus Augsten, Michael Böhlen, and Johann Gamper.
Approximate matching of hierarchical data using pq-grams.
ACM Transactions on Database Systems (TODS).
To appear.
Rui Yang, Panos Kalnis, and Anthony K. H. Tung.
Similarity evaluation on tree-structured data.
In Proceedings of the ACM SIGMOD International Conference on
Management of Data, pages 754–765, Baltimore, Maryland, USA,
June 2005. ACM Press.
Distances for Data Centric XML:
unordered tree edit distance
windowed pq-gram distance
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
50 / 50
Nikolaus Augsten (DIS)
Approximation: Theory and Algorithms
Unit 11 – May 22, 2009
50 / 50