An XML Query Q - Xin Luna Dong

Containment of
Nested XML Queries
Xin (Luna) Dong, Alon Halevy, Igor Tatarinov
University of Washington
Query Containment
The most fundamental relationship between
a pair of queries
 Query Q is contained in Q’ if:

 For
any database D,
 Q(D) is a subset of Q’(D)
Applications of Query Containment






Semantic caching
Reasoning about contents of data sources in data
integration
Verification of integrity constraints
Verification of knowledge bases
Determining queries independent of updates
Query answering using views
Query Processing in PDMS

XML Query Containment in Peer Data
Management System (PDMS)
MWS
Stanford
UW
QW
QS
MBW
MPW
UPenn
QP


MSB
QB2
QB1
Berkeley
QB2
Answering queries using views to extract remote data
Removing redundant queries to enhance performance
[Tatarinov and Halevy, SIGMOD 2004]
QB1
Query Containment: Relational v.s. XML
Relational
Input D
Sets of tuples
Output Q(D)
A set of tuples
Instance
containment
Q(D)  Q’(D)
– Subset
Query
containment
Q  Q’
– for every input D,
Q(D)  Q’(D)
Query Containment: Relational v.s. XML
Relational
XML
Input D
Sets of tuples
An XML instance tree
Output Q(D)
A set of tuples
An XML instance tree
Instance
containment
Q(D)  Q’(D)
– Subset
Query
containment
Q(D) Q’(D)
– Tree homomorphism
Q
Q’
Q  Q’
– for every input D,
– for every input D,
Q(D)  Q’(D)
Q(D) Q’(D)
Example – An XML Instance
D:
<project>
<member>Alice</member>
</project>
<project>
<member>Bob</member>
</project>
project
member
Alice
project
member
Bob
Example – An XML Query
Q:
for $x in /project return
<group>{
for $y in $x/member
return
<name>{
where $y=“Alice”
return <Alice/>
where $y=“Bob”
return <Bob/>
}</name>
}</group>
D:
project
member
project
member
Alice
Bob
group
group
name
name
Alice
Bob
Q(D):
Example – Another XML Query
Q’:
for $x in /project return
<group>{
for $y in /project/member
return
<name>{
where $y=“Alice”
return <Alice/>
where $y=“Bob”
return <Bob/>
}</name>
}</group>
D:
project
project
member
member
Alice
Bob
Q’(D):
group
name
name
Alice
Bob
Example – Tree Homomorphism and
Query Containment
Q’(D):
Q(D):
group
group
name
name
name
name
Alice
Bob
Alice
Bob
Q (D)
group
Q’(D)
Q’(D):
Q(D):
X
group
group
name
name
name
name
Alice
Bob
Alice
Bob
Q’(D)
group
Q (D)
Query Containment Problem


From answer containment to query
containment
Q’(D)
Q (D)  Q’
Q
Q (D)
Q’(D)  Q
Q’
Our problems
 Given
 The
queries Q and Q’, decide whether Q
complexity of query containment
Q’
Previous Work (I)

Relational query containment
 Conjunctive
queries [Chandra and Merlin, STOC 1977]
 Acyclic queries [Yannakakis, VLDB 1981]
 Queries with union [Sagiv and Yannakakis, JACM 1980]
 Queries with negation [Levy and Sagiv, VLDB 1993]
 Queries with arithmetic comparisons [Klug, JACM 1988]
 Recursive queries
[Shmueli, 1993], [Chaudhuri and Vardi, 1992]
 Queries over bags [Ioannidis and Ramakrishnan, 1995]
Previous Work (II)

XML query containment – two new challenges
 XPath



containment
With *, // and […] [Miklau and Suciu, PODS 2002]
With equality testing on tag variables
[Deutsch and Tannen, KRDB 2001]
Conjunctive queries over path expressions
[Florescu, Levy and Suciu, PODS 1998]
 Nested
query containment
Containment Cannot be Determined Solely
by Comparing XPath Components
Q:
for $g in /group
where $g/gname/text() = “database”
return
<area>{
for $p in $g/person return
<person>
<name>{$p/text()}</name>
{for $q in $g/paper
where $q/author/text() = $p/text()
return
<paper>{$q/title/text()}</paper>}
</person>
}</area>
Q’:
for $g in /group return
<area>{
for $p in $g/person return
<person>
<name>{$p/text()}</name>
<group>{$g/gname/text()}</group>
{for $q in $g/paper
where $q/author/text() = $p/text()
return
<paper>{$q/title/text()}</paper>}
</person>
}</area>
Previous Work (II)

XML query containment – two new challenges
 XPath



containment
With *, // and […] [Miklau and Suciu, PODS 2002]
With equality testing on tag variables
[Deutsch and Tannen, KRDB 2001]
Conjunctive queries over path expressions
[Florescu, Levy and Suciu, PODS 1998]
 Nested

query containment
Complex object query containment [Levy and Suciu, PODS 1997]
Containment of nested XML queries has not been
fully studied
Our Focus: Nested XML Queries



Returned tag constants
Conjunctive – no two sibling query blocks return the same tag
XPath:

HAVE




Child axis (/)
Wildcards (*)
Branches ([…])
NOT HAVE



descendant //
Arithmetic comparison
Union
Here, XPath containment is in PTIME
Complexity Result (I)
Depth
Fixed
Arbitrary
=1
PTIME
PTIME
Arbitrary
coNP
complete
In
Fanout
coNEXPTIME
Complexity Result (II)
Query
Type
No tag With tag
variables variables
Unnested
PTIME
Fanout=1
PTIME
Fixeddepth
General
coNP
complete
in
coNEXP
TIME
PTIME
With
unions
With
neg
With
//
With
euiqjoin
on tags
coNP
coNP
coNP
NP
complete
complete
complete
complete
With
arith
comp
 2P
complete
Complexity Result (II)
Query
Type
No tag With tag
variables variables
Unnested
PTIME
PTIME
Fanout=1
PTIME
PTIME
coNP
coNP
complete
complete
in
coNEXP
TIME
in
coNEXP
TIME
Fixeddepth
General
With
unions
With
neg
With
//
With
euiqjoin
on tags
coNP
coNP
coNP
NP
complete
complete
complete
complete
With
arith
comp
 2P
complete
Complexity Result (II)
Query
Type
Unnested
Fanout=1
Fixeddepth
General
No tag With tag
variables variables
PTIME
PTIME
With
unions
With
neg
With
//
With
euiqjoin
on tags
coNP
coNP
coNP
NP
complete
complete
complete
complete
complete
coNP
coNP
coNP
NP
complete
complete
complete
complete
 2P
complete
 2P
 2P
PTIME
PTIME
coNP
coNP
coNP
coNP
coNP
complete
complete
complete
complete
complete
in coNEXPTIME
complete
With
arith
comp
 2P
complete
Roadmap


Introduction and problem definition
Containment of a subset of XML queries

Query containment is decidable

Depth
Fixed
Arbitrary
=1
PTIME
PTIME
Arbitrary
coNP complete
In coNEXPTIME
Fanout



Query containment in practice
Relaxing the assumptions
Conclusions
Deciding Q
Q’?

How to find a property for an infinite number of input XML
instances

Standard technique


Find a finite set of input representatives – Canonical
Databases

Relational query: each canonical database is a minimal input to
generate the answer template

XML query answers have infinite number of shapes
Find a finite set of answer templates – Canonical Answers
Answer Shapes Determined by the
Head Tree
Head Tree:
Q’:
for $x in /project return
group
<group>{
for $y in /project/member return
name
<name>{
where $y=“Alice”
Alice
return <Alice/>
where $y=“Bob”
Bob
return <Bob/>
group
group
name
}</name>
}</group>
Alice
group
group
name
name
Bob
An Additional Candidate Answer
Head Tree:
group
group
name
name
Alice
Bob
Alice
group
name
name
Alice
Bob
group
group
group
name
name
Bob
Why Consider the Additional Case
D:
Head Tree:
Alice
group
project
name
member
Bob
Q’(D):
project
member
Alice
Bob
group
group
Q(D):
group
name
name
name
name
Alice
Bob
Alice
Bob
What can Serve as Canonical Answers?
 Prefix subtrees of the head tree?
– necessary but not sufficient
 Trees contained in the head tree?
– necessary and sufficient
– but, too many and too complex
A Head Tree can Have Many Trees
Contained in it
Head Tree:
group
group
name
Alice
name
Bob
group
name
Alice
name
Alice
Bob
Alice
name
Bob Alice
name
Bob
group
group
group
name
name
name
Alice
Bob Alice
Bob
What can Serve as Canonical Answers?
 Prefix subtrees of the head tree?
– necessary but not sufficient
 Trees contained in the head tree?
– necessary and sufficient
– but, too many and too complex
 Our solution: consider only minimal
trees that are contained in the head tree
Canonical Answer

A minimal XML instance: No two sibling subtrees where
one is contained in the other

Canonical Answer : A minimal XML instance contained in
the head tree
group
name
Alice

Bob
group
name
Alice

group
name
name
Bob
Alice
name
Alice

Every answer A of query Q corresponds to a unique
canonical answer CA, s.t. A CA, CA A
Bob
Canonical Database

Canonical Database: DBCA
 The
minimal XML instance to generate CA
CA:
for $x in /project return
group
<group>{
for $y in /project/member return
name
<name>{
Alice
where $y=“Alice”
DB:
return <Alice/>
where $y=“Bob”
project
project
return <Bob/>
}</name>
member
}</group>
Alice
name
Bob
project
member
Bob
Sound and Complete Conditions for
Nested Query Containment
Theorem 1. Q Q’, if and only if for every
canonical database DB of Q, Q(DB)
Q’(DB)
Theorem 2. Q
Q’, if and only if for every
canonical answer CA of Q,

CA is a canonical answer of Q’

DB’CA
DBCA
Query Containment Algorithm

Algorithm:
for every canonical answer CA of Q do
1.
check whether CA is a canonical answer of Q’
2.
generate DBCA and DB’CA
3.
check DB’CA
DBCA
Roadmap


Introduction and problem definition
Containment of a subset of XML queries

Query containment is decidable

Depth
Fixed
Arbitrary
=1
?
?
Arbitrary
?
?
Fanout



Query containment in practice
Relaxing the assumptions
Conclusions
Query Containment Algorithm

Algorithm:
for every canonical answer CA of Q do

1.
check whether CA is a canonical answer of Q’
2.
generate DBCA and DB’CA
3.
check DB’CA
DBCA
Polynomial in the size and number of canonical
answers


What are the sizes of canonical answers?
What is the number of canonical answers?
Containment of XML Queries
with Fanout 1
E.g. d=3 – the depth; m=1 – the maximum fanout

for $x in /project return
<group>{for $y in /project/member return
<name>{where $y =“Alice”
return <Alice/>
}</name>
}</group>

group
group
name
name
Alice
Canonical Answers and Complexity




group
Number: the depth of the query
Size: bounded by the depth of the query
Complexity: O( d·|Q|·|Q’|)
Theorem: Testing containment of XML Queries with
fanout 1 is in PTIME
Nesting with fanout 1 does not increase complexity
Roadmap


Introduction and problem definition
Containment of a subset of XML queries

Query containment is decidable

Depth
Fixed
Arbitrary
=1
PTIME
PTIME
Arbitrary
?
?
Fanout



Query containment in practice
Relaxing the assumptions
Conclusions
Containment of XML Queries
with Arbitrary Fanout

E.g. d=4 – the depth; m=3 – the maximum fanout
1 2 3

1
2 3 1 2 2 3
1 2 2 33 1
Canonical Answers


Complexity
Number:
Size:
d-1
d-2

1 2 2 3 2 33 1 3 11 2
d-1
Theorem: Testing containment of XML Queries with
depth 2 and arbitrary fanout is coNP-hard
Roadmap


Introduction and problem definition
Containment of a subset of XML queries

Query containment is decidable

Depth
Fixed
Arbitrary
=1
PTIME
PTIME
Arbitrary
coNP hard
coNP hard
Fanout



Query containment in practice
Relaxing the assumptions
Conclusions
NOT
TIGHT
Effect of the Depth
on Containment of XML Queries

Insight: Kernel Canonical Answer


The root node has a single child
In any subtree, a path pattern is repeated no more than cd times.
d – query depth
c – #(maximum path steps in a query block)

The size of kernel canonical answers



Polynomial in the query size
Exponential in the query depth
Theorem:


Testing containment of XML queries with fixed depth is
coNP-complete
Testing containment of XML queries with arbitrary depth is
in coNEXPTIME
Roadmap


Introduction and problem definition
Containment of a subset of XML queries

Query containment is decidable

Depth
Fixed
Arbitrary
=1
PTIME
PTIME
Arbitrary
coNP complete
In coNEXPTIME
Fanout



Query containment in practice
Relaxing the assumptions
Conclusions
Containment Checking in Practice

Analyze element cardinality to reduce the number of
canonical answers for containment checking
Q:
for $g in /group
where $g/gname/text() = “database”
return
<area>{
for $p in $g/person return
<person>
<name>{$p/text()}</name>
{for $q in $g/paper
where $q/author/text() = $p/text()
return
<paper>{$q/title/text()}</paper>}
</person>
}</area>

Q’:
for $g in /group return
<area>{
for $p in $g/person return
<person>
<name>{$p/text()}</name>
<group>{$g/gname/text()}</group>
{for $q in $g/paper
where $q/author/text() = $p/text()
return
<paper>{$q/title/text()}</paper>}
</person>
}</area>
#canonical answers – originally : 71  after analysis : 2
Roadmap


Introduction and problem definition
Containment of a subset of XML queries

Query containment is decidable

Depth
Fixed
Arbitrary
=1
PTIME
PTIME
Arbitrary
coNP complete
In coNEXPTIME
Fanout



Query containment in practice
Relaxing the assumptions
Conclusions
An Example Query that Returns
Tag Variables
for $x in dbGrp return
<result>{
for $y in $x/proj return
<group>{
for $u in $y/member return
<name> $u/text() </name>
for $v in $y/paper return
<pub> $v/text() </pub>
}</group>
}</result>
Deciding Query Containment



Leverage previous results – simulation mapping
[Levy and Suciu, PODS’97]
Check query simulation mapping for every
canonical answer
Complexity


Simulation mapping can be checked in polynomial
time in terms of query size
Complexity of checking containment does not arise
Other Extensions
Query
Type
Unnested
Fanout=1
Fixeddepth
General
No tag With tag
variables variables
PTIME
PTIME
With
unions
With
neg
With
//
With
euiqjoin
on tags
coNP
coNP
coNP
NP
complete
complete
complete
complete
complete
coNP
coNP
coNP
NP
complete
complete
complete
complete
 2P
complete
 2P
 2P
PTIME
PTIME
coNP
coNP
coNP
coNP
coNP
complete
complete
complete
complete
complete
in coNEXPTIME
complete
With
arith
comp
 2P
complete
Conclusions

Contributions
A
sound and complete condition for containment of
nested XML queries
 Detailed

complexity analysis
Future work
 Fill
in the open gap of complexity in case of queries with
arbitrary fanout and arbitrary nesting depth
 Evaluate
and optimize the containment algorithm with
element cardinality analysis
 Answering
nested XML queries using views
Containment of
Nested XML Queries
@VLDB 2004
Xin (Luna) Dong, Alon Halevy, Igor Tatarinov
University of Washington
www.cs.washington.edu/homes/lunadong