On the Memory Requirements of XPath

On the Memory
Requirements of XPath
Evaluation over XML
Streams
Ziv Bar-Yossef
Marcus Fontoura
Vanja Josifovski
IBM Almaden Research Center
Preliminaries: XML
<conference>
<name> PODS </name>
<speaker>
<name> Josifovski </name>
<paper_cnt> 1 </paper_cnt>
</speaker>
root x0
conference
name
x2
x4
<speaker>
<name> Fagin </name>
<paper_cnt> 3 </paper_cnt>
</speaker>
</conference>
x6
speaker
PODS
speaker x3
name
Josifovski
x1
x5
paper_cnt
1
x8
x7
name
Fagin
paper_cnt
3
Preliminaries: XPath 1.0
/conference[name = PODS]/speaker[paper_cnt > 1]/name
Query
Document
root x0
root
conference
speaker
conference
name
= PODS
name x2
PODS
speaker x3
x4
paper_cnt > 1
name
name
Josifovski
Result: { x7 }
x5
paper_cnt
1
x1
x6
speaker
x8
x7
name
Fagin
paper_cnt
3
XML Streams
XML stream: XML document arriving as a one-way stream
Why XML streams?
• For transferring XML between systems
• For efficient access to large XML documents
Critical resources:
• Memory
• Processing time
Streaming XML Algorithms










XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02]
X-scan [Ives, Levy, and Weld 00]
XMLTK [Avila-Campillo et al 02]
XTrie [Chan et al 02]
SPEX [Olteanu, Kiesling, and Bry 03]
Lazy DFAs [Green et al 03]
The XPush Machine [Gupta and Suciu 03]
XSQ [Peng and Chawathe 03]
TurboXPath [Josifovski, Fontoura, and Barta 04]
…
Our Results

Space lower bounds for evaluating XPath on
XML streams

A streaming XML algorithm


Matches the lower bounds on a large fragment
of the language
Uses space sub-linear in the query size rather
than exponential in the query size
Related Work

Space complexity of XPath evaluation over nonstreaming XML documents [Gottlob, Koch, Pichler 03],
[Segoufin 03]

Space complexity of XPath evaluation over streams of
indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over
relational data streams [Arasu et al 02]
Data Complexity [Vardi 82]
(Q,D) Evaluation function of a query Q on document D.
Q(D) Evaluation function of a fixed query Q on document D.
Data complexity on Q:
Complexity of best algorithm
for Q on worst D.
Worst-case data complexity: maxQ (complexity of Q).
We characterize the data complexity of Q separately for
each Q (not just the worst-case one).
XPath Fragment
1. Queries are subsumption-free
Query
Query
root
root
conference
conference
name
= PODS
name
Not subsumption-free
!= SIGMOD
name
!= SIGMOD
Subsumption-free
XPath Fragment (cont.)
2. Queries are univariate
Query
Query
root
root
conference
conference
paper_cnt
<
author_cnt
Not univariate
paper_cnt
< 30
author_cnt > 30
Univariate
XPath Fragment (cont.)
3. Queries consist of conjunctions only
4. Queries are “star-restricted”
Query Frontier Size
Query
Definitions:
root
1. Frontier at u: u, its siblings, and
the siblings of its ancestors.
2. FrontierSize(Q): size of largest
frontier.
conference
speaker
paper_cnt > 1
name = PODS
name
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
Document Recursion Depth
Definition:
Query Q
Document D
root
recDepthQ(D): Max number
of nodes in D that lie on one
root-to-leaf path and “path
match” the same node in Q.
root x0
//part
name
number
x2
part x1
name
Theorem 2: For all queries Q in
the fragment that have at least
one “//” node,
stream-space(Q) =
(recDepthQ(D)).
Refrigerator part
name x4
x3
x7
number
x5
Compressor xpart
4
12
x6
number
456
Document Depth
Document D
root x0
Definition:
depth(D): Length of longest root-toleaf path.
part x1
x2
name
x3
part
Refrigerator
name
x4
x7
x5
Compressor xpart
4
number
12
Theorem 2: For all queries Q in the
fragment that have at least one “/” node,
stream-space(Q) = (log depth(D)).
number x6
456
New algorithm
Theorem 4(a):
For all queries Q in a “Univariate XPath”:
Space: O(|Q| recDepth(D) log depth(D)).
Time: O(|D| |Q| recDepth(D)).
Theorem 4(b):
For all queries Q in a subset of our fragment and for
non-recursive documents D,
Space: O(FrontierSize(Q) log depth(D)).
Time: O(|D| FrontierSize(Q)).
Proof of Theorem 1
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
Query
root
Fragment:
• “subsumption-free”
• “univariate”
• Conjunctions only
• “star-restricted”
conference
speaker
paper_cnt > 1
name = PODS
name
Critical Document
Definition: Document D is critical for query Q, if:
(1) D matches Q.
(2) If we remove from D any node, it no longer matches Q.
Document D
Query Q
root x0
root
conference
conference
name
speaker
name = PODS
x2
x4
paper_cnt > 1
name
x6
speaker
PODS
speaker x3
name
Josifovski
x1
x5
paper_cnt
1
x8
x7
name
Fagin
paper_cnt
3
Main Lemmas
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
show
proof
Lemma 1: For all queries Q in the fragment and
any critical document D for Q,
stream-space(Q) = (FrontierSize(D)).
Lemma 2: For all queries Q in the fragment,
there is a critical document D so that
FrontierSize(D) = FrontierSize(Q).
One-way Communication
Complexity
f: (X, Y)  Z
Alice
x
Bob
m
y
f(x,y)
CC(f) = number of communication bits used by the
best protocol on the worst-case choice of inputs.
Reduction
A : streaming algorithm for Q using space S
stateA()

D

Bob
Alice
stateA()
Q(D)
Theorem: stream-space(Q) >= CC(Q)
Fooling Set Technique
Partitioned
document:

Document prefix
D,

Document suffix
Definition
A set T of partitioned documents is a fooling set for Q if:
1. All documents in T match Q.
2. For any two distinct documents D,, D, in T, either D, does not
match Q or D, does not match Q.
Theorem: For any fooling set T,
CC(Q) = (log |T|).
Proof of Lemma 1
Lemma 1: For all queries Q in the
fragment nd any critical document D for Q,
stream-space(Q) = (FS(D)).
Document D
Query Q
root x0
root
conference
conference
name
speaker
paper_cnt > 1
name = PODS
PODS
x2
speaker
x4
name
x1
name
Fagin
x3
x5
paper_cnt
3
Proof of Lemma 1
For each subset S of Frontier(D), define a partitioned
document DS:
S = { x2, x5 }
Document DS
Query Q
root x0
root
conference
conference
name
speaker
paper_cnt > 1
name = PODS
name
x1
x2
speaker
PODS
name
Fagin
x4
x3
x5
paper_cnt
3
Proof of Lemma 1 (cont)
Claim: { DS }S is a subset of Frontier(D) is a fooling set.
stream-space(Q) >= log(2FS(D)) = FS(D).
Proof of Claim:
1. For all S, DS matches Q.
2. If S  T, need: either DST or DTS does not match Q.
Proof of Claim (example)
root x0
Document DT
T = { x4,x5 }
conference
speaker
x3
name
Fagin
S = { x2,x5 }
x1
name x2
PODS
x5
x4
Document DS
paper_cnt
conference
speaker
name
Fagin
x4
3
speaker
x5
paper_cnt
x1
3
Conference name
missing!
x3
paper_cnt
x1
x2
PODS
3
Document DTS
conference
name
root x0
root x0
x5
x4
name
Fagin
x3
x4
name
Fagin
Algorithm


Uses the query as an NFA
Based on three global data structures




Pointer array
Validation array
Level array
Matches the lower bounds for a fragment of
XPath.
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
Pointer array with one entry
a
F
Validation array
1
Level array
$ u0
/a u1
/b u2
/c u3
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
F
2
F
2
Index 0
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
c
a
F
2
F
2
b
c
F
2
Index 0
F
2
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
/c
c
a
F
2
F
2
b
c
F
2
F
2
b
c
F
2
Index 0
T
2
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
c
a
F
2
F
2
b
c
/c
F
2
F
2
b
c
F
2
T
2
b
b
c
F
2
Index 0
T
2
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
c
a
F
2
F
2
b
c
/c
F
2
F
2
b
c
F
2
T
2
/b
b
b
c
F
2
T
2
b
c
T
2
Index 0
T
2
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
c
a
F
2
F
2
b
c
/c
F
2
F
2
b
c
F
2
T
2
b
b
c
F
2
T
2
/b
/a
T
T
b
c
2
T
2
a
1
Return
TRUE
Conclusion: our Contributions

Space lower bounds on the instance data
complexity of XPath on XML streams:
1. In terms of Query Frontier Size
2. In terms of Document Recursion Depth
3. In terms of Document Depth

A streaming XML algorithm


Matches the lower bounds on a fragment of the
language
Does not use finite-state automata
XPath 1.0
/conference/name
D
Q
$ x0
$ u0
C x1
/C u1
N x2
PODS
/N u2
S x3
N x4
Josifovski
Result: { x2 }
P x5
1
S x6
N x7
Fagin
P x8
3
XPath 1.0
/conference//name
D
Q
$ x0
$ u0
/C u1
C x1
N x2
PODS
S x3
S x6
//N u2
N x4
Josifovski
Result: { x2, x4, x7 }
P x5
1
N x7
Fagin
P x8
3
Reduction
A : S-space streaming algorithm for Q.
r ¸ 1: integer.
s1
s0
1
Alice
s2
1
s3
s5
s4
2 D 2
3
s6
3
s1
s2
(r = 6)
Bob
s3
s4
s5
Q(D)
s6
Theorem: S ¸ CC(Qr) / r
Q(D)