On the Memory
Requirements of XPath
Evaluation over XML
Streams
Ziv Bar-Yossef
Marcus Fontoura
Vanja Josifovski
IBM Almaden Research Center
Preliminaries: XML
<conference>
<name> PODS </name>
<speaker>
<name> Josifovski </name>
<paper_cnt> 1 </paper_cnt>
</speaker>
root x0
conference
name
x2
x4
<speaker>
<name> Fagin </name>
<paper_cnt> 3 </paper_cnt>
</speaker>
</conference>
x6
speaker
PODS
speaker x3
name
Josifovski
x1
x5
paper_cnt
1
x8
x7
name
Fagin
paper_cnt
3
Preliminaries: XPath 1.0
/conference[name = PODS]/speaker[paper_cnt > 1]/name
Query
Document
root x0
root
conference
speaker
conference
name
= PODS
name x2
PODS
speaker x3
x4
paper_cnt > 1
name
name
Josifovski
Result: { x7 }
x5
paper_cnt
1
x1
x6
speaker
x8
x7
name
Fagin
paper_cnt
3
XML Streams
XML stream: XML document arriving as a one-way stream
Why XML streams?
• For transferring XML between systems
• For efficient access to large XML documents
Critical resources:
• Memory
• Processing time
Streaming XML Algorithms
XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02]
X-scan [Ives, Levy, and Weld 00]
XMLTK [Avila-Campillo et al 02]
XTrie [Chan et al 02]
SPEX [Olteanu, Kiesling, and Bry 03]
Lazy DFAs [Green et al 03]
The XPush Machine [Gupta and Suciu 03]
XSQ [Peng and Chawathe 03]
TurboXPath [Josifovski, Fontoura, and Barta 04]
…
Our Results
Space lower bounds for evaluating XPath on
XML streams
A streaming XML algorithm
Matches the lower bounds on a large fragment
of the language
Uses space sub-linear in the query size rather
than exponential in the query size
Related Work
Space complexity of XPath evaluation over nonstreaming XML documents [Gottlob, Koch, Pichler 03],
[Segoufin 03]
Space complexity of XPath evaluation over streams of
indexed XML data [Choi, Mahoui, Wood 03]
Space complexity of select-project-join queries over
relational data streams [Arasu et al 02]
Data Complexity [Vardi 82]
(Q,D) Evaluation function of a query Q on document D.
Q(D) Evaluation function of a fixed query Q on document D.
Data complexity on Q:
Complexity of best algorithm
for Q on worst D.
Worst-case data complexity: maxQ (complexity of Q).
We characterize the data complexity of Q separately for
each Q (not just the worst-case one).
XPath Fragment
1. Queries are subsumption-free
Query
Query
root
root
conference
conference
name
= PODS
name
Not subsumption-free
!= SIGMOD
name
!= SIGMOD
Subsumption-free
XPath Fragment (cont.)
2. Queries are univariate
Query
Query
root
root
conference
conference
paper_cnt
<
author_cnt
Not univariate
paper_cnt
< 30
author_cnt > 30
Univariate
XPath Fragment (cont.)
3. Queries consist of conjunctions only
4. Queries are “star-restricted”
Query Frontier Size
Query
Definitions:
root
1. Frontier at u: u, its siblings, and
the siblings of its ancestors.
2. FrontierSize(Q): size of largest
frontier.
conference
speaker
paper_cnt > 1
name = PODS
name
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
Document Recursion Depth
Definition:
Query Q
Document D
root
recDepthQ(D): Max number
of nodes in D that lie on one
root-to-leaf path and “path
match” the same node in Q.
root x0
//part
name
number
x2
part x1
name
Theorem 2: For all queries Q in
the fragment that have at least
one “//” node,
stream-space(Q) =
(recDepthQ(D)).
Refrigerator part
name x4
x3
x7
number
x5
Compressor xpart
4
12
x6
number
456
Document Depth
Document D
root x0
Definition:
depth(D): Length of longest root-toleaf path.
part x1
x2
name
x3
part
Refrigerator
name
x4
x7
x5
Compressor xpart
4
number
12
Theorem 2: For all queries Q in the
fragment that have at least one “/” node,
stream-space(Q) = (log depth(D)).
number x6
456
New algorithm
Theorem 4(a):
For all queries Q in a “Univariate XPath”:
Space: O(|Q| recDepth(D) log depth(D)).
Time: O(|D| |Q| recDepth(D)).
Theorem 4(b):
For all queries Q in a subset of our fragment and for
non-recursive documents D,
Space: O(FrontierSize(Q) log depth(D)).
Time: O(|D| FrontierSize(Q)).
Proof of Theorem 1
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
Query
root
Fragment:
• “subsumption-free”
• “univariate”
• Conjunctions only
• “star-restricted”
conference
speaker
paper_cnt > 1
name = PODS
name
Critical Document
Definition: Document D is critical for query Q, if:
(1) D matches Q.
(2) If we remove from D any node, it no longer matches Q.
Document D
Query Q
root x0
root
conference
conference
name
speaker
name = PODS
x2
x4
paper_cnt > 1
name
x6
speaker
PODS
speaker x3
name
Josifovski
x1
x5
paper_cnt
1
x8
x7
name
Fagin
paper_cnt
3
Main Lemmas
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
show
proof
Lemma 1: For all queries Q in the fragment and
any critical document D for Q,
stream-space(Q) = (FrontierSize(D)).
Lemma 2: For all queries Q in the fragment,
there is a critical document D so that
FrontierSize(D) = FrontierSize(Q).
One-way Communication
Complexity
f: (X, Y) Z
Alice
x
Bob
m
y
f(x,y)
CC(f) = number of communication bits used by the
best protocol on the worst-case choice of inputs.
Reduction
A : streaming algorithm for Q using space S
stateA()
D
Bob
Alice
stateA()
Q(D)
Theorem: stream-space(Q) >= CC(Q)
Fooling Set Technique
Partitioned
document:
Document prefix
D,
Document suffix
Definition
A set T of partitioned documents is a fooling set for Q if:
1. All documents in T match Q.
2. For any two distinct documents D,, D, in T, either D, does not
match Q or D, does not match Q.
Theorem: For any fooling set T,
CC(Q) = (log |T|).
Proof of Lemma 1
Lemma 1: For all queries Q in the
fragment nd any critical document D for Q,
stream-space(Q) = (FS(D)).
Document D
Query Q
root x0
root
conference
conference
name
speaker
paper_cnt > 1
name = PODS
PODS
x2
speaker
x4
name
x1
name
Fagin
x3
x5
paper_cnt
3
Proof of Lemma 1
For each subset S of Frontier(D), define a partitioned
document DS:
S = { x2, x5 }
Document DS
Query Q
root x0
root
conference
conference
name
speaker
paper_cnt > 1
name = PODS
name
x1
x2
speaker
PODS
name
Fagin
x4
x3
x5
paper_cnt
3
Proof of Lemma 1 (cont)
Claim: { DS }S is a subset of Frontier(D) is a fooling set.
stream-space(Q) >= log(2FS(D)) = FS(D).
Proof of Claim:
1. For all S, DS matches Q.
2. If S T, need: either DST or DTS does not match Q.
Proof of Claim (example)
root x0
Document DT
T = { x4,x5 }
conference
speaker
x3
name
Fagin
S = { x2,x5 }
x1
name x2
PODS
x5
x4
Document DS
paper_cnt
conference
speaker
name
Fagin
x4
3
speaker
x5
paper_cnt
x1
3
Conference name
missing!
x3
paper_cnt
x1
x2
PODS
3
Document DTS
conference
name
root x0
root x0
x5
x4
name
Fagin
x3
x4
name
Fagin
Algorithm
Uses the query as an NFA
Based on three global data structures
Pointer array
Validation array
Level array
Matches the lower bounds for a fragment of
XPath.
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
Pointer array with one entry
a
F
Validation array
1
Level array
$ u0
/a u1
/b u2
/c u3
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
F
2
F
2
Index 0
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
c
a
F
2
F
2
b
c
F
2
Index 0
F
2
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
/c
c
a
F
2
F
2
b
c
F
2
F
2
b
c
F
2
Index 0
T
2
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
c
a
F
2
F
2
b
c
/c
F
2
F
2
b
c
F
2
T
2
b
b
c
F
2
Index 0
T
2
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
c
a
F
2
F
2
b
c
/c
F
2
F
2
b
c
F
2
T
2
/b
b
b
c
F
2
T
2
b
c
T
2
Index 0
T
2
Index 1
Algorithm Example Run
Input XML
<a>
<c>c1</c>
<b>b1</b>
</a>
...
Query: /a[b and c]
$
a
F
1
b
c
$ u0
/a u1
/b u2
/c u3
c
a
F
2
F
2
b
c
/c
F
2
F
2
b
c
F
2
T
2
b
b
c
F
2
T
2
/b
/a
T
T
b
c
2
T
2
a
1
Return
TRUE
Conclusion: our Contributions
Space lower bounds on the instance data
complexity of XPath on XML streams:
1. In terms of Query Frontier Size
2. In terms of Document Recursion Depth
3. In terms of Document Depth
A streaming XML algorithm
Matches the lower bounds on a fragment of the
language
Does not use finite-state automata
XPath 1.0
/conference/name
D
Q
$ x0
$ u0
C x1
/C u1
N x2
PODS
/N u2
S x3
N x4
Josifovski
Result: { x2 }
P x5
1
S x6
N x7
Fagin
P x8
3
XPath 1.0
/conference//name
D
Q
$ x0
$ u0
/C u1
C x1
N x2
PODS
S x3
S x6
//N u2
N x4
Josifovski
Result: { x2, x4, x7 }
P x5
1
N x7
Fagin
P x8
3
Reduction
A : S-space streaming algorithm for Q.
r ¸ 1: integer.
s1
s0
1
Alice
s2
1
s3
s5
s4
2 D 2
3
s6
3
s1
s2
(r = 6)
Bob
s3
s4
s5
Q(D)
s6
Theorem: S ¸ CC(Qr) / r
Q(D)
© Copyright 2026 Paperzz