XPath Node Selection over
Grammar-Compressed Trees
Sebastian Maneth
University of Edinburgh
with Tom Sebastian (INRIA Lille)
XPath Node Selection
Given:
XPath query Q and tree T
Question: wc-time to evaluate Q over T?
Full XPath 1.0
O( |Q|^2 |T|^4 )
Wadler Fragment O( |Q|^2 |T|^2 )
Core XPath
O( |Q| |T| )
[Gottlob, Koch 2003]
[Wadler 1999]
[Gottlob, Koch 2003]
For Core XPath: selecting tree automata, marking automata,
query automata, etc
Here: forward Core XPath
Query1
=
//a/*/b//c/d
a
e
f
a
b
a
a
c
b
d
a
d
a
c
d
Here: forward Core XPath
Query1
=
//a/*/b//c/d
a
e
Translate to DFA
[Green,Gupta,Miklau,Onizuka,Suciu TODS2004]
f
a
b
a
a
c
b
d
a
d
Similar to “KMP-automata”
a
CAVE: exponential in the number of consecutive *’s!
c
d
Here: forward Core XPath
Query1
=
//a/*/b//c/d
Query1 =
//a/*/b//c/d
We use “deterministic selecting top-down tree automata” (DST automata)
marking transition
default rule
Query1 =
//a/*/b//c/d
DST automata
over binary tree (first-child / next-sibling)
“marking transitions” indicate nodes to select
can also express the following-sibling axis of XPath
build DFA for any sequence of those
change state only on second component
(on first component, go to Fail-state)
Given:
DST automaton A, tree T
Question: wc-time to evaluate A over T?
simply O( |A| + |T| )
Here: T given as SLT grammar [Busatto Maneth 2004]
Straight-line Linear context-free Tree grammars
A
Grammars
Regular string:
A abaB
a
b
a
B
a
a
f
b
a
D
B
b
D
Regular tree:
NTs at leaves
a
Context-free: A -> aBCd
a
B
C
d
A
a
f
B
C
D
d
b
cf tree: NTs at
Internal Nodes!
D
Production of a context-free tree grammar:
B
Tree over NTs and Terminals
y1
y2
y3
y1
y3
y1
y1, y2 ,… “context-parameters”
LINEAR = each y_i at most once in RHS
STRAIGHT -LINE = grammar produces only one tree
cf tree: NTs at
Internal Nodes!
B
a
C
D
d
b
D
Examples
A0
A1(y)
A2(y)
…
An(y)
B0
B1
…
Bn
A1(A1(e))
A2(A2(y))
A3(A3(y))
monadic tree
a(a( … a(e) …))
of height 2^n
a(y)
f(B1,B1)
f(B2,B2)
full binary tree of height n
regular a DAG
f(g,g)
Maximal compression of SLT grammars: exponential (as for DAGs)
In practise (XML) compress much better than DAGs
[Lohrey,Maneth,Mennicke 2013] “TreeRePair compression algorithm”:
on average: shrink XML tree to 2.8% of its edges (vs. 13% for the DAG)
Given:
DST automaton A, SLT grammar G
Question: wc-time to evaluate A over G?
[Lohrey,Maneth TCS2006]
in terms of bigO wc-complexity??
Main Results
1) Compute |A(val(G))| in time O(|A||G|).
(COUNT)
2) Compute list of pre-order numbers of result nodes
in time O(|A||G| + |A(val(G)|).
(MATERIALIZE)
3) Compute serialization S of all result subtrees (as string)
In time O(|A||G| + |S|)
(SERIALIZE)
Idea 1) go bottom-up through the grammar
for every state q and NT A, compute states q1,..,qk reached at the
context-parameters y1, …, yk AND corresponding #selected nodes
when we meet an NT during this computation, then use the
information (q1, q2, …, q_k, N).
Main Results
2) Compute list of pre-order numbers of result nodes
in time O(|A||G| + |A(val(G)|).
(MATERIALIZE)
Idea 2) in time O(|A||G|) produce a relabeling grammar G’ (by Idea 1)
Now two passes over G’
1. bottom-up compute offsets of selected nodes for every NT A
offset = (c, o)
c=0: nodes on path from root to y1
c=1: nodes on path from y1 to y2
..
c=3: nodes on path from y3 to root
y1
y2
y3
o = node number within a “chunk” c
2. top-down pre-order chunk-wise traversal
Main Results
3) Compute serialization S of all result subtrees (as string)
In time O(|A||G| + |S|)
(SERIALIZE)
Similar to 2)
offset = (c, o) where o is a sequence of opening and closing TAGs
during pre-order chunk-wise traversal, produce sequence S’ of TAGs
only containing result subtrees, and list of pointers to roots of *all* result subtrees
Follow list of pointers and produce final serizalization S
Main Results
All of 1), 2), and 3) has been implemented in “TinyT”
And runs amazingly fast.
Bring grammar in CNF
Use only 64 bits for one grammar rule
Store for each NT all its chunk sizes
XMark benchmark: can store 8 XML nodes in less than one bit!
Very impressive running times; please see:
[ S. Maneth and T. Sebastian,
“Fast and Tiny Structural Self-Indexes for XML” CoRR abs/1012.5696 (2010) ]
In KB !!
New Results (not in TinyT)
Serialization: O(|A||G| + S)
CAVE: |S| can be quadratic in |val(G)| (e.g. //* )
Question: Can we find small SLP (staight-line string grammar) producing S?
1) Given an SLT grammar G and a subset M of the nodes of val(G)
an SLP for S of size O(|G|m) can be constructed. (m=|M|)
Thus, O(|val(G)|m) improves to O(|G|m)
2) Given a DAG (0-SLT grammar) D and a subset M of the nodes of val(D),
an SLP P for S of size O(|D| + m) can be constructed . (m=|M|)
New Results (not in TinyT)
1) Given an SLT grammar G and a subset M of the nodes of val(G)
an SLP for S of size O(|G|m) can be constructed. (m=|M|)
Idea: Note, the nodes in M are given as pre-order numbers.
First, we precompute the size of each chunk of each NT
Let u be a pre-order number in M
starting with the initial rule, e.g., S A(B(D), E) we determine
the NT in rhs(S) that produces u. E.g., “B”.
we proceed with the sentential form B(D).
We finally obtain a sentential form T such that val(T) = val(G)/u.
The size of T is bounded |G|.
Add a new production S_u T and add S_u to the rhs of a new start NT.
Finally, turn the resulting SLT grammar into a traversal SLP (in linear time).
New Results (not in TinyT)
2) Given a DAG (0-SLT grammar) D and a subset M of the nodes of val(D),
an SLP P for S of size O(|D| + m) can be constructed . (m=|M|)
Much easier:
Bring the DAG in “node normal form” (= every RHS contains exactly
one terminal symbol)
(does not change the size of the grammar, in terms of edges!)
Every subtree is represented by a unique nonterminal U
New start production Z U1 U2 …. Um where Uk is the NT
corresponding to the correct result subtree
Open Questions
Large gap between
O(|G’|m) for SLT grammars and
O(|D| + m) for DAGs
Improve to O(|G’| + m*h*k)
(h = height of G’, k=rank of G’)
Are there G’ for which smallest SLP is of that size?
Filters in XPath?
Using a DBU automaton B_i for each filter, we get
O(|A| (|B1| |B2| … |Bn|)^{k+1} |G| )
For the size of a relabeling grammar.
Can it be improved?
END
Large gap between
O(|G’|m) for SLT grammars and
O(|D| + m) for DAGs
Improve to O(|G’| + m*h*k)
(h = height of G’, k=rank of G’)
Are there G’ for which smallest SLP is of that size?
Filters in XPath?
Using a DBU automaton B_i for each filter, we get
O(|A| (|B1| |B2| … |Bn|)^{k+1} |G| )
For the size of a relabeling grammar.
Can it be improved?
Thank you for your attention!
© Copyright 2025 Paperzz