XPath Node Selection over Grammar

XPath Node Selection over
Grammar-Compressed Trees
Sebastian Maneth
University of Edinburgh
with Tom Sebastian (INRIA Lille)
XPath Node Selection
Given:
XPath query Q and tree T
Question: wc-time to evaluate Q over T?
Full XPath 1.0
O( |Q|^2 |T|^4 )
Wadler Fragment O( |Q|^2 |T|^2 )
Core XPath
O( |Q| |T| )
[Gottlob, Koch 2003]
[Wadler 1999]
[Gottlob, Koch 2003]
 For Core XPath: selecting tree automata, marking automata,
query automata, etc
Here: forward Core XPath
Query1
=
//a/*/b//c/d
a
e
f
a
b
a
a
c
b
d
a
d
a
c
d
Here: forward Core XPath
Query1
=
//a/*/b//c/d
a
e
 Translate to DFA
[Green,Gupta,Miklau,Onizuka,Suciu TODS2004]
f
a
b
a
a
c
b
d
a
d
 Similar to “KMP-automata”
a
 CAVE: exponential in the number of consecutive *’s!
c
d
Here: forward Core XPath
Query1
=
//a/*/b//c/d
Query1 =
//a/*/b//c/d
We use “deterministic selecting top-down tree automata” (DST automata)

 marking transition
default rule
Query1 =
//a/*/b//c/d
DST automata
 over binary tree (first-child / next-sibling)
 “marking transitions” indicate nodes to select
 can also express the following-sibling axis of XPath
 build DFA for any sequence of those
 change state only on second component
(on first component, go to Fail-state)
Given:
DST automaton A, tree T
Question: wc-time to evaluate A over T?
 simply O( |A| + |T| )
Here: T given as SLT grammar [Busatto Maneth 2004]
 Straight-line Linear context-free Tree grammars
A
Grammars
Regular string:
A  abaB
a
b
a
B
a
a
f
b
a
D
B
b
D
Regular tree:
NTs at leaves
a
Context-free: A -> aBCd
a
B
C
d
A
a
f
B
C
D
d
b
cf tree: NTs at
Internal Nodes!
D
Production of a context-free tree grammar:
B
Tree over NTs and Terminals
y1
y2
y3
y1
y3
y1
y1, y2 ,… “context-parameters”
LINEAR = each y_i at most once in RHS
STRAIGHT -LINE = grammar produces only one tree
cf tree: NTs at
Internal Nodes!
B
a
C
D
d
b
D
Examples
A0
A1(y)
A2(y)
…
An(y)
B0
B1
…
Bn
 A1(A1(e))
 A2(A2(y))
 A3(A3(y))
monadic tree
a(a( … a(e) …))
of height 2^n
 a(y)
 f(B1,B1)
 f(B2,B2)
full binary tree of height n
regular  a DAG
 f(g,g)
 Maximal compression of SLT grammars: exponential (as for DAGs)
 In practise (XML) compress much better than DAGs
[Lohrey,Maneth,Mennicke 2013] “TreeRePair compression algorithm”:
on average: shrink XML tree to 2.8% of its edges (vs. 13% for the DAG)
Given:
DST automaton A, SLT grammar G
Question: wc-time to evaluate A over G?
[Lohrey,Maneth TCS2006]
 in terms of bigO wc-complexity??
Main Results
1) Compute |A(val(G))| in time O(|A||G|).
(COUNT)
2) Compute list of pre-order numbers of result nodes
in time O(|A||G| + |A(val(G)|).
(MATERIALIZE)
3) Compute serialization S of all result subtrees (as string)
In time O(|A||G| + |S|)
(SERIALIZE)
Idea 1)  go bottom-up through the grammar
 for every state q and NT A, compute states q1,..,qk reached at the
context-parameters y1, …, yk AND corresponding #selected nodes
 when we meet an NT during this computation, then use the
information (q1, q2, …, q_k, N).
Main Results
2) Compute list of pre-order numbers of result nodes
in time O(|A||G| + |A(val(G)|).
(MATERIALIZE)
Idea 2)  in time O(|A||G|) produce a relabeling grammar G’ (by Idea 1)
 Now two passes over G’
1. bottom-up compute offsets of selected nodes for every NT A
offset = (c, o)
c=0: nodes on path from root to y1
c=1: nodes on path from y1 to y2
..
c=3: nodes on path from y3 to root
y1
y2
y3
o = node number within a “chunk” c
2. top-down pre-order chunk-wise traversal
Main Results
3) Compute serialization S of all result subtrees (as string)
In time O(|A||G| + |S|)
(SERIALIZE)
Similar to 2)
 offset = (c, o) where o is a sequence of opening and closing TAGs
 during pre-order chunk-wise traversal, produce sequence S’ of TAGs
only containing result subtrees, and list of pointers to roots of *all* result subtrees
 Follow list of pointers and produce final serizalization S
Main Results
All of 1), 2), and 3) has been implemented in “TinyT”
And runs amazingly fast.
 Bring grammar in CNF
 Use only 64 bits for one grammar rule
 Store for each NT all its chunk sizes
XMark benchmark: can store 8 XML nodes in less than one bit!
 Very impressive running times; please see:
[ S. Maneth and T. Sebastian,
“Fast and Tiny Structural Self-Indexes for XML” CoRR abs/1012.5696 (2010) ]
 In KB !!
New Results (not in TinyT)
Serialization: O(|A||G| + S)
 CAVE: |S| can be quadratic in |val(G)| (e.g. //* )
Question: Can we find small SLP (staight-line string grammar) producing S?
1) Given an SLT grammar G and a subset M of the nodes of val(G)
an SLP for S of size O(|G|m) can be constructed. (m=|M|)
Thus, O(|val(G)|m) improves to O(|G|m)
2) Given a DAG (0-SLT grammar) D and a subset M of the nodes of val(D),
an SLP P for S of size O(|D| + m) can be constructed . (m=|M|)
New Results (not in TinyT)
1) Given an SLT grammar G and a subset M of the nodes of val(G)
an SLP for S of size O(|G|m) can be constructed. (m=|M|)
Idea:  Note, the nodes in M are given as pre-order numbers.
 First, we precompute the size of each chunk of each NT
Let u be a pre-order number in M
starting with the initial rule, e.g., S  A(B(D), E) we determine
the NT in rhs(S) that produces u. E.g., “B”.
 we proceed with the sentential form B(D).
We finally obtain a sentential form T such that val(T) = val(G)/u.
The size of T is bounded |G|.
Add a new production S_u  T and add S_u to the rhs of a new start NT.
Finally, turn the resulting SLT grammar into a traversal SLP (in linear time).
New Results (not in TinyT)
2) Given a DAG (0-SLT grammar) D and a subset M of the nodes of val(D),
an SLP P for S of size O(|D| + m) can be constructed . (m=|M|)
Much easier:
 Bring the DAG in “node normal form” (= every RHS contains exactly
one terminal symbol)
(does not change the size of the grammar, in terms of edges!)
 Every subtree is represented by a unique nonterminal U
 New start production Z  U1 U2 …. Um where Uk is the NT
corresponding to the correct result subtree
Open Questions
Large gap between
O(|G’|m) for SLT grammars and
O(|D| + m) for DAGs
 Improve to O(|G’| + m*h*k)
(h = height of G’, k=rank of G’)
 Are there G’ for which smallest SLP is of that size?
Filters in XPath?
Using a DBU automaton B_i for each filter, we get
O(|A| (|B1| |B2| … |Bn|)^{k+1} |G| )
For the size of a relabeling grammar.
 Can it be improved?
END
Large gap between
O(|G’|m) for SLT grammars and
O(|D| + m) for DAGs
 Improve to O(|G’| + m*h*k)
(h = height of G’, k=rank of G’)
 Are there G’ for which smallest SLP is of that size?
Filters in XPath?
Using a DBU automaton B_i for each filter, we get
O(|A| (|B1| |B2| … |Bn|)^{k+1} |G| )
For the size of a relabeling grammar.
 Can it be improved?
Thank you for your attention!