Document

XML Query Reformulation
Val Tannen
University of Pennsylvania
Joint work with Alin Deutsch, UC San Diego
and in part with Lucian Popa, IBM Almaden
NTUA
April 17, 2003
1
Data Exchange Between Businesses Using XML
XML
published
data
proprietary
data
published
data
proprietary
data
XML
XML
insurance company
published
data
pharmaceutical company
published
data
proprietary
data
hospital
NTUA
April 17, 2003
2
XML?
opening tag
<drug>
<name>aspirin</name>
text
<price>$4</price>
<notes>
<side-effects>upset stomach</side-effects>
<maker>Bayer</maker>
</notes>
</drug>
matching closing tag
NTUA
April 17, 2003
drug
name
price
notes
“aspirin”
“$4”
side-effects maker
“upset stomach”
“Bayer”
3
A Simple Publishing Scenario
client
virtual data
<study>
<case>
<diag>migraine</diag>
patient name is hidden
<drug>aspirin</drug>
<usage>2/day</usage>
</case>
<case>
<diag>allergy</diag>
<drug>cortisone</drug>
<usage>3/day</usage>
</case>
</study>
published data
proprietary data
prescription
usage
client query
(XQuery)
XML query
reformulation
language
standard (draft)
(SQL)
correspondence
expressed by
publishing query
(view)
patient
drug
name
2/day
aspirin
3/day
cortisone
NTUA
April 17, 2003
name
diagnosis
John
John
migraine
Jane
Jane
allergy
How to express the view?
View = query which, if executed,
would produce the virtual data
How to “compose” the client
query with the view,
obtaining the reformulation?
4
The General Problem of Query Reformulation
client
query Q(P)
? reformulated query X(S)
schema S
schema P
schema
correspondence
soundness
Given query Q(P), find query(ies) X(S) returning same answer,
whenever such X(S) exists
completeness
NTUA
April 17, 2003
5
Applications of Query Reformulation
•
data publishing
we just saw it:
public schema / storage schema
P
S
•
data integration
global schema / local schema
P
S
•
schema evolution
old schema / new schema
P
S
•
data security
illustrated next
NTUA
April 17, 2003
6
An Application: Data Security
client
(patient,ailment)
intrusive query
I(P)
(patient, physician)
+
(physician, ailment)
public schema
P
schema
correspondence
query E(S)
(exposes secret data correlation)
proprietary
schema S
Want to be sure that there is no I(P) returning same answer as E(S)
Only possible if Completeness Property holds!
NTUA
April 17, 2003
7
More Complicated Data Publishing:
Mixed And Redundant Storage (MARS)
public schema
schema correspondence
published XML
(virtual)
view of proprietary
data
may hide information
storage schema
proprietary
relational data
cached queries
partial relational
storage of XML
proprietary XML
data
redundant data
materialized views,
indexes
initial configuration
NTUA
April 17, 2003
after tuning
8
An Example With Tuning
XML
XML
drug,price,notes
NTUA
April 17, 2003
XML
diagnosis,drug
XML
rel DB
drug,price
drug,usage,diagnosis
relational DB
drug,usage,name
name,diagnosis
9
Redundancy Enables Multiple Reformulations
client query: “find how much each treatment costs”
R3
XML
XML
drug,price,notes
R2
R1
XML
diagnosis,drug
XML
Rel DB
drug,price
drug,usage,diagnosis
Relational DB
drug,usage,name
name,diagnosis
Some reformulations are potentially cheaper to execute than others.
Want to find an “optimal” one!
NTUA
April 17, 2003
10
Schema Correspondence Expressible in XQuery
The DB administrator must be able to specify the correspondence.
XML
XQuery
XQuery
XML
XQuery
XML
encode
XML
XML
rel DB
XQuery
XML
encode
relational DB
Can use XQuery, fixing any of the common encodings of relational tables in XML.
NTUA
April 17, 2003
11
XQuery?
binding part
drug
for
name
price
notes
“aspirin”
“$4”
side-effects maker
“upset stomach”
return
“Bayer”
Result should contain
<producedBy>Bayer</producedBy>
NTUA
April 17, 2003
$d in document/drug,
$m in $d//maker
<producedBy>$m/text()</producedBy>
tagging template
// (descendant)
is the transitive closure of
/ (child)
12
Approach:
XQuery Reformulation Reduced to Relational Reformulation
client XQuery
relational
queries
C&B
schema
correspondence
relational
constraints
Mappings ()
as XQueries
XML integrity
constraints
reformulated
queries
GReX
built-in relational constraints
capture XML data model
= compilation
GReX: Generic Relational encoding of XML
reformulated queries
(multiple solutions)
NTUA
April 17, 2003
13
XQuery Semantics
Variable binding stage
XML data model is a tagged tree
drug
name
price
for <drug>
$d in document/drug,
<name>aspirin</name>
$m in $d//maker
notes
<price>$4</price>
“$d” “$m”
<notes>
“aspirin”
“$4”
side-effects maker
“upset stomach”
“Bayer”
<side-effects>upset stomach</side-effects>
<maker>Bayer</maker>
</notes>
return
<producedBy>$m/text()</producedBy>
</drug>
tagging stage
XQueries compute in two stages:
navigation in XML tree,
binds variables to
nodes, text, tags, etc.
NTUA
April 17, 2003
output of new XML,
by filling in variable bindings into a
tagging template
14
Compiling the Binding Part of XQueries to Relational Queries
XBind query =
binding part of XQuery
(returns a relation:
tuples of variable bindings)
Relational query over
child(x,y) , tag(x,t) ,desc(x,y) , Root (r), etc.
Example:
for
$d in document(“drugs.xml”)/drug, $m in $d//maker
return
“$d” “$m”
a relational “conjunctive” query
compiles to
P($d,$m) :-
Root(r) , child(r,$d) , tag($d,“drug”) ,
desc($d,x) , child(x,$m) , tag($m,“maker”)
But not all models of this schema correspond to the intended model; need GReX !
NTUA
April 17, 2003
15
Sample Constraints from GReX
•
Relationship between child and descendant navigation:
xy [ child(x,y)  desc(x,y) ]
desc contains child
x [ el(x)  desc(x,x) ]
desc is reflexive
xyz [ desc(x,y)  desc(y,z)  desc(x,z) ]
desc is transitive
These do not capture transitive closure completely,
nor is it possible to do it in first-order logic; STILL...
•
Tagged tree structure of XML:
rx [ root(r)  desc(x,r)  x = r ]
root has no ancestors
xyz [ child(x,z)  child(y,z)  x = y ]
at most one parent
NTUA
April 17, 2003
16
More Constraints from GReX
(some Tag)
(oneTag)
x [ el(x)  t tag(x,t) ]
xt1t2 [ tag(x,t1)  tag(x,t2)  t1 = t2 ]
every element has a tag
one tag per element
(noLoop)
xy [ desc(x,y)  desc(y,x)  x = y ]
no non-trivial cycles
(noShare)
xyuv [ child(x,u)  child(x,v) 
unique path between
desc(u,y)  desc(v,y)  u = v ]
(inLine)
xy [ desc(x,u)  desc(y,u) 
x = y  desc(x,y)  desc(y,x) ]
NTUA
April 17, 2003
elements
ancestors of an element
are collinear
17
Which Reformulations Do We Find This Way?
client XQuery
relational
queries
C&B
schema
correspondence
relational
constraints
Mappings ()
as XQueries
XML integrity
constraints
reformulated
queries
GReX
built-in constraints
capture XML data model
= compilation
reformulated queries
(multiple solutions)
NTUA
April 17, 2003
all of them?
18
Restrictions on XQuery
Main restriction: no aggregates (to be investigated)
Leaving out aggregates, most common queries can be processed.
Minor restrictions:
no user-defined functions (of course!)
limited use of negation (or else the problem becomes undecidable)
limited use of document order (to be investigated)
no navigation to parent or wildcard child (of unspecified tag)
(unintuitive, but we can show that this needs another algorithm,
unless NP=  p2)
NTUA
April 17, 2003
19
The Reduction is Sound and Complete
For the restricted XQuery fragment,
Given:
- XBind query B
 compiled to a relational query c(B)
- schema correspondence C given by XQueries  compiled to set of constraints c(C)
Relative Completeness Theorem:
R
is a minimal reformulation of
B
under
C
iff
c(R)
is a minimal reformulation of c(B)
under c(C) and GReX
R can be computed from c(R)
NTUA
April 17, 2003
All of them are found by C&B.
20
A Glimpse at the Chase:
Transforming Queries Using Constraints
A query: ‘ find data satisfying condition “A” ‘
Q:
A
A constraint: ‘ whenever the data satisfies condition “A”, it also satisfies “B” ‘
A 
B
A chase step:
Q:
A
Q1:
A
B
The chase: repeatedly applying chase steps until no new conditions can be added
In general, Q and Q1 are not equivalent,
all DBs
constraint,
Theory of but
the in
chase:
20satisfying
years old,the
deep
and rich,they
due are!
to
Beeri, Maier, Mendelson, Sagiv, Vardi, Yannakakis and others!
NTUA
April 17, 2003
21
How Do We Use the Chase?
Capturing Relational Views With Constraints
Let the schema correspondence be the view:
‘ retrieve the data satisfying conditions “A” and “B” ‘
V:
A B
V stands for condition:
“data appears in result of V”
Capture the definition with constraints (first-order logic statements)
A
B
 V
all data satisfying “A” and “B”
“appears in result of V”
NTUA
April 17, 2003
V 
A
B
all data “appearing in V”
satisfies “A” and “B”
22
Chase & Backchase
First chase:
Q:
A
A

B
Q1:
A
A B
B

V
Q2: A B V
Next inspect all subqueries (“syntactic pieces”) of the chase result Q2:
SQ:
V
It turns out that SQ is equivalent to Q
The equivalence is checked again using the chase (backwards)
Presence of constraint A  B allows reformulation
SQ:
NTUA
April 17, 2003
V
V
 A B
Q2: A B V
23
General C&B Algorithm
(joint work with Lucian Popa, IBM Almaden)
(public) schema
Let
C
P
, (proprietary) schema
be a set of constraints. (eg., on
P
S
and/or
P
&
S)
U(P + S )
Assume some terminating
chasing sequence
Universal plan
Q(P)
SUBQUERIES
solutions X(S) = subqueries of U,
posed against S, equivalent to Q
Completeness Theorem [Deutsch&T.]:
Any scan-minimal reformulation of Q under
NTUA
April 17, 2003
C
is a subquery of U
24
Two Sets of Experiments
•
Synthetic queries
reformulation time as function of query “complexity”
XML analog of relational “star” queries, increasing number of joins
can very complex queries still be reformulated in a practical amount of time ?
•
“Realistic” queries from the XML Benchmark Project [http://monetdb.cwi.nl/xml]
The Queries: 20 queries designed to exercise interesting features of XQuery
The Schema correspondence: views in both directions
compiles to about 200 constraints!
Much more than in typical relational schemas!
NTUA
April 17, 2003
25
Experiments with Synthetic Queries
Number of joins (number of corners in the star)
NTUA
April 17, 2003
26
Experiments with Benchmark Queries
Reformulation times must be understood in conjunction with execution times
(eg., tens of seconds for Q10)
NTUA
April 17, 2003
27
Summary of Contributions
MARS, a system for XQuery reformulation,
- with mixed and redundant storage, under integrity constraints.
- complex schema correspondence (views in both directions)
Showed practical relevance of C&B method (feasible and worthwhile)
A completeness result for a significant fragment of XQuery and a large
class of schema correspondences. The method remains sound for the full language.
A reduction between minimal reformulation and query equivalence, and
we gave matching lower bounds showing our chase-based decision procedure is
asymptotically optimal for the fragment considered.
NTUA
April 17, 2003
28
NTUA
April 17, 2003
29
Why XML?
The relational data model is still the dominant concept in databases.
All data can be coded into tables.
(For that matter into (goedel)numbers too!)
Artificial coding makes life harder for query programmers.
Result: less productivity, more bugs.
XML is much more flexible. It is also “self-describing”, i.e., no
need apriori for types/schemas (but this is sometimes a bad idea).
It came from the document community (tagged text)
and was cheered by industry gurus. So we have to live with it.
(Although one can image better data models…)
NTUA
April 17, 2003
30
Making It Work
Chase: each chase step is similar to evaluation of a recursive Datalog rule on a
symbolic database built from the query
 we borrowed classical query processing techniques
Backchase: size of search space is O(2^u), u = size of universal
• compiling
planconstraints to join tree
• joins implemented as hash-joins
We found criteria for pruning this space.
• pushing selections into joins
1.
Cost-independent: prune subqueries that
typical size reduction
- do not correspond to legal XML queries
bottom-up
of subqueries:
- contain redundant descendant navigation
stepsexploration 2^100
 300
first all performing 1 navigation step,
next all performing 2 navigation steps, etc.
Perform
x child-of
contiguous
y, y child-of
navigation
z, xsteps
descendant-of
starting from
z
the root
2.
A cost-based pruning strategy parameterized by costing model
- finds optimal reformulation for any monotonic cost model
- cost models for XML are still under research
- heuristic cost model: cost is number of table scans/XML navigation steps performed
- amenable to experimenting with other cost models
NTUA
April 17, 2003
31
Benefit of Reformulation For Execution Time
original query execution - time to reformulate - execution of reformulation
no. of
elements
in document
600
500
saved time (s)
400
60
80
300
90
100
200
150
200
100
0
3
4
5
6
7
-100
number of major joins per query
Benefit increases with increasing complexity of query
and increasing database size
NTUA
April 17, 2003
32
More Results for Benchmark Queries
reformulation times (with redundancy and optimization)
5
Delta to finish
search
4.5
4
time (s)
3.5
3
2.5
Delta to best
reformulation
2
1.5
1
0.5
Q20
Q19
Q18
Q17
Q16
Q15
Q14
Q13
Q12
Q11
Q10
Q9
Q8
Q7
Q6
Q5
Q4
Q3
Q2
Q1
0
Time to first
reformulation
queries
time to first reformulation
delta to best reformulation
delta to finish search
For redundancy:
the XBind query
for each query
Time materialized
to find first reformulation
is essentially
the
same as in the absence of redundancy.
(particular case of Acess Support Relation)
Additional time spent only for finding optimal one.
NTUA
April 17, 2003
33
Related Work:
Data Integration As Particular Case of MARS Applications
Global As View (GAV)
Local As View (LAV)
X=Q o CR
Q=X o CR
Q
P
(global schema)
(local schema)
reformulation by
composition-with-views
TSIMMIS, SilkRoute, XPeranto
NTUA
April 17, 2003
CR
P
X=Q
Q
P
CR
CR
CR
S
Q
MARS
S
S
[with Fernandez and Suciu in
SIGMOD’99]
combined effect of
rewriting-with-views
rewriting+composition
Information Manifold, STORED, Agora
34
Future Work Directions
•
Short-Term:
- tuning of C&B implementation for further speedup
- XML-specific strategies for pruning the backchase stage
- in particular, finding a good cost model to perform cost-based pruning
•
Medium-Term:
- Applying C&B to Data Security
- Applications to Adaptive Distributed Query Optimization
•
Long Term:
- a unified framework for integrating data from various, heterogenous sources going
beyond classical databases (XML/relational/LDAP + web forms + web services)
NTUA
April 17, 2003
35
Application 3: Schema Evolution (e.g. Caching)
Goal: support existing client applications even after changing the schema
client
reformulated query
X (N)
old query
Q (O)
old schema
O
new schema
schema
correspondence
N
could be O extended
with cached results
Find X(N) returning same answer as Q(O)
NTUA
April 17, 2003
36
A Source of Redundancy: Relational Storage of XML
catalog
drug
drug
name
notes
“aspirin”
name
price
“$4”
“cortisone”
notes
price
“$50”
highly unstructured
public data
relational
view
(lossy)
redundant storage
Drugs
NTUA
April 17, 2003
name
price
aspirin
$4
cortisone
$50
37
Containment Under Integrity Constraints
Decision procedure for containment is based on chasing with constraints from GReX.
Natural extension to XML integrity constraints.
Some results:
•
Containment of well-behaved XPath/XBind queries under bounded simple XML integrity
constraints (SXICs) is decidable (used in relative completeness theorem).
•
Even modest use of unboundedness makes the problem undecidable.
•
Corollary: containment under bounded SXICs and DTDs is undecidable.
•
Containment under DTDs only is an open problem, but we have a PSPACE lower bound.
See proposal for details.
NTUA
April 17, 2003
38
LDAP
NTUA
April 17, 2003
39
NTUA
April 17, 2003
40
The Architecture of Our Solution
client XQuery
tagging template
defined next
XBind queries
Mappings ()
as XQueries
relational
queries
C&B
relational
constraints
schema
correspondence
rel/XML
encodings
XML integrity
constraints
reformulated
queries
GReX
built-in XML data
model constraints
not shown here
= compilation
reformulated queries
(multiple solutions)
NTUA
April 17, 2003
GReX: Generic Relational encoding of XML,
used internally to partially capture
the intended model
41
Problem:
•
XML/MARS XQuery Reformulation
•
schema correspondence given by views in both directions
•
multiple solutions
Tool: Algorithm for reformulation
of relational queries under relational constraints
Chase & Backchase (C&B)
introduced in [VLDB’99 with L. Popa and V. Tannen]
evaluated in [SIGMOD’00 with L. Popa, A. Sahuguet and V. Tannen]
NTUA
April 17, 2003
42
Capturing Relational Views With Constraints
Let the schema correspondence be a view defined as the relational conjunctive query
V(x,z) :- A(x,y), B(y,z)
Capture the definition with constraints,
(cV) x y z [ A(x,y)  B(y,z)  V(x,z) ]
(bVview
) xiszincluded
[ V(x,z)inV y A(x,y)  B(y,z) ]
result of query defining the
V is included in result of query defining view
NTUA
April 17, 2003
43
Partially capturing the XML model
Partially, because some features cannot fully be captured with constraints:
•
descendant is the transitive closure of child, but this is not FO-definable
•
neither is the “treeness” property
our solution:
add a set of constraints GREX to approximate intended models
it turns out that capturing descendant helps in capturing treeness
then, we define a significant XQuery fragment (we call it well-behaved)
that cannot distinguish between intended and approximate models
NTUA
April 17, 2003
44
Constraints in GReX (2): the tagged tree structure of XML
(topRoot)
(oneTag)
(noLoop)
rx [ root(r)  desc(x,r)  x = r ]
xt1t2 [ tag(x,t1)  tag(x,t2)  t1 = t2 ]
xy [ desc(x,y)  desc(y,x)  x = y ]
root has no ancestors
one tag per element
no non-trivial cycles
(oneParent) xyz [ child(x,z)  child(y,z)  x = y ]
at most one parent
(noShare)
unique path between
xyuv [ child(x,u)  child(x,v) 
desc(u,y)  desc(v,y)  u = v ]
(inLine)
xy [ desc(x,u)  desc(y,u) 
x = y  desc(x,y)  desc(y,x) ]
NTUA
April 17, 2003
elements
ancestors of an element
are collinear
45
XQuery Restrictions
What it allows:
composition of navigation steps,
navigation axes: self, (named)child, descendant, ancestor, idrefs
qualifiers:
path, string  path, “and”, “or”, path equality/inequality
where clause:
disjunction, path equality/inequality,
existential quantification
What it rules out:
user-defined functions,
range, before predicates,
aggregates, arbitrary negation, universal quantification,
concatenation (,)
navigation to parent (..) or to child of unspecified name (*)
NTUA
April 17, 2003
46
C&B Completeness
Let C be a set of constraints (relates public schema P and proprietary schema S)
•
C-minimal query:
removing any of its relational atoms produces non-equivalent query under D
•
Q1 is a subquery of Q2:
Q1 is isomorphic to a “piece” of Q2
U(P + S)
Universal plan
Q(P)
SUBQUERIES
solutions X(S) = subqueries of U,
posed against S, equivalent to Q
Completeness Theorem: Any C-minimal reformulation of Q is a subquery of U
NTUA
April 17, 2003
47
A Completeness Result for Our Solution
Given:
- well-behaved XBind query B
compiled to a relational query c(B)
- schema correspondence M given by well-behaved XQueries (in both directions),
compiled to set of relational constraints c(M)
- bounded XML integrity constraints XIC,
compiled to set of relational constraints c(XIC)
a class of XML integrity constraints, see [KRDB’01]
Relative Completeness Theorem: for any R
R is a (M+XIC)-minimal reformulation of B
iff
c(R) is a (GReX  c(M)  c(XIC))-minimal reformulation of c(B)
All of them are found by C&B.
Corollary:
completeness of reformulation algorithm for XBind queries
NTUA
R can be computed from c(R)
April 17, 2003
48
Capturing XML Semantics
client XQuery
relational
queries
C&B
relational
constraints
schema
correspondence
Mappings ()
as XQueries
XML integrity
constraints
reformulated
queries
GReX
built-in constraints
capture XML data model
= compilation
reformulated queries
(multiple solutions)
NTUA
April 17, 2003
49
Summary of Constraints Used in C&B Phase
•
Built-in constraints in GReX
•
Relational views compile to inclusion constraints
•
XQuery views
– their XBind queries compile to inclusion constraints as for relational views
– their return clause compiles to several decorrelated queries, each captured
with constraints
– the XML template in the return clause compiles to several Skolem and copy
functions, each compiled to constraints
•
Integrity constraints
– XML constraints compile to relational constraints
– relational schema constraints
NTUA
April 17, 2003
50
Are the Restrictions Justified?
Our completeness result holds for well-behaved XQueries, under bounded
XML integrity constraints.
What about reformulating
•
XQueries with parent and wildcard child navigation?
•
Under other XML integrity constraints?
•
Even under full-fledged DTDs?
For such extensions, we make a deeper study of equivalence, which is an
even simpler problem in reformulation.
The equivalence checker is invoked as black-box algorithm during C&B.
NTUA
April 17, 2003
51
XBind (includes XPath) Fragments
Equivalence
PTIME
navigation axes: self, (named)child, descendant
simple
well-behaved
path concatenation, attribute values
qualifiers: path, string  path, “and”
+ join on attribute variables
+ any or all (!) of the following:
. disjunction
. ancestor navigation
. path equality
NP-complete
p
 2-complete
. wildcard child () navigation
+ parent, preceding(following)-sibling
NTUA
April 17, 2003
In
p
2
52
Containment for the “well-behaved” fragment of XBind/XPath
Theorem
B1 , B2 XBind/XPath queries from our “well-behaved” fragment
c(B1) , c(B2) their relational compilation
B1
is equivalent to B2
iff
c(B1) is equivalent to c(B2) under GReX
decidable in 2p using chase
This result about containment is used in the relative completeness theorem
NTUA
April 17, 2003
53
Extensions of the “NP” fragment: 2p fragments
any or all (!) of the following make equivalence 2p-complete:
• disjunction
unsurprising: conjunctive queries+union already 2p-complete [SY’80]
• ancestor navigation
translate ancestor away introducing union: /a/b/ancestor  /[a/b]
 /a[b]
• path equality qualifier
can simulate ancestor:
//.[.//.==/p]/s  /p/ancestor/s
Not well-behaved, but we have a different decision procedure
• wildcard child navigation
union introduced by interaction //:
NTUA
April 17, 2003
//a  /a  ///a
54
Experimental Setup: Started From the XML Benchmark
Used the official XML Benchmark Project [http://monetdb.cwi.nl/xml]
The application domain: an online auctioning application.
The published schema:
a DTD given by the XML Benchmark Project
Data is partially nicely structured.
The Queries:
NTUA
April 17, 2003
20 queries designed to exercise interesting features of XQuery
55
What We Added to the XML Benchmark Setup
The mixed storage schema:
relationally: person, item, open auction, closed auction, etc.
unstructured part: annotations on items
The redundancy:
materialized the XBind query for each query
(particular case of Acess Support Relation)
The mappings:
in both directions: relations  XML, XML  XML
It all compiles to about 200 constraints !
Much more than in typical relational schemas!
Had to change original implementation [SIGMOD’00] to scale.
NTUA
April 17, 2003
56
Related Work
Publishing systems
Schema mapping proprietary relational  published XML: SilkRoute, Xperanto
reformulation by composition-with-views.
Schema mapping published XML  proprietary relational : STORED, Agora
reformulation by rewriting-with-views
Information Integration
TSIMMIS (composition-w-views), Information Manifold (rewriting-w-views)
Containment
Miklau and Suciu, smaller fragment of XPath(they too find that * is “naughty”
[FLS, CGLV] - conjunctive regular path queries
Amer-Ahia and Srivastava - minimization of tree pattern queries
Containment under integrity constraints
XML keys [BDFHT]; description logics [CGL];
NTUA
April 17, 2003
57
Query Reformulation in Data Publishing
partner/client
client query Q(P)
? reformulated query
X(S)
(not directly executable)
public schema P
(virtual data)
schema = interface against which
queries are formulated
publishing query
(may hide some proprietary data)
proprietary storage schema S
(materialized data)
Find X(S) returning same answer as Q(P)
NTUA
April 17, 2003
58
Compiling the Binding Part of XQueries to Relational Queries
XBind query = binding stage
Relational query over
of XQuery
(returns a relation:
tuples of variable bindings)
Navigation in XQueries
child(x,y),tag(x,t),
desc(x,y),Root(r), etc.
Relational join of tables child, tag,etc.
But, over arbitrary DBs with this schema, the relational translation of
Root  desc  desc
is not equivalent to that of
Root  desc
must communicate to the C&B that desc table is transitive
NTUA
April 17, 2003
59
The Challenge for “Reformulation on MARS”
To find the reformulations efficiently, we need to
•
reason with schema correspondence
•
efficiently construct the search space for reformulations
- must contain all reformulations (for completeness)
•
explore search space
- exhaustively (for security applications)
- maybe trading optimality of reformulation for search speed
(for optimization purposes)
NTUA
April 17, 2003
60
Contributions
•
A novel algorithm for reformulation of relational queries under relational constraints
– Chase & Backchase
•
A declarative semantics for most of XQuery
•
A reformulation algorithm for XQuery
Uses this semantics and exploits C&B
[VLDB’99 with Popa and Tannen]
[SIGMOD’00 with Popa, Sahuguet and Tannen]
–practical (feasible and worthwhile)
–complete for “most” of XQuery
–optimal (we show lower bounds for various XQuery fragments: KRDB’01, DBPL’01)
•
MARS: a system for XQuery reformulation over Mixed And Redundant Storage
–constructs and represents search space efficiently
–cost-based exploration strategy parameterized by traditional costing module
–finds first reformulation fast
•
Experimental evaluation: time to first reformulation, simple cost
NTUA
April 17, 2003
61
Compiling Client XQueries
client XQuery
relational
queries
C&B
relational
constraints
schema
correspondence
Mappings ()
as XQueries
XML integrity
constraints
reformulated
queries
GReX
built-in constraints
capture XML data model
= compilation
reformulated queries
(multiple solutions)
NTUA
April 17, 2003
62
Capturing the Schema Correspondence
client XQuery
relational
queries
C&B
relational
constraints
schema
correspondence
Mappings ()
as XQueries
XML integrity
constraints
reformulated
queries
GReX
built-in constraints
capture XML data model
= compilation
reformulated queries
(multiple solutions)
NTUA
April 17, 2003
63
Major Obstacles in Compiling Schema Mappings to Constraints
Schema correspondence given by XQueries. As opposed to relational queries,
•
XQueries have nested, correlated subqueries in return clause
•
XQueries create new elements
•
XQueries return deep, recursive copies of input XML trees
(solution not shown)
NTUA
April 17, 2003
64
Compiling Nested Subqueries: Decorrelation
the query
is short for the nested query
for $p in doc(“foo.xml”)//person
for
return <res>$p/phone/text()</res>
return <res>for $t in $p/phone/text()
$p in doc(“foo.xml”)//person
return $t
</res>
compile XBind parts to two decorrelated relational queries (shown here in Datalog syntax):
Bouter(p)
 Root(r), desc(r,x), child(x,p), tag(p,”person”)
Binner(p,t)  Bouter(p), child(p,n), tag(n,”phone”), text(n,t)
capture each with two inclusion constraints, as done in original C&B method
NTUA
April 17, 2003
65
Capturing Creation of New Elements
for
$p in doc(“foo.xml”)//person
return
<res>$p/phone/text()</res>
For each binding of $p, a distinct <res>-element is constructed.
set of bindings for $p, Bouter
F
injective function
<res>-elements in result
Capture F by the relation G representing its graph, and the constraints:
pr1r2 [ G(p,r1)  G(p,r2)  r1=r2 ]
( r = F(p) )
p1p2r [ G(p1,r)  G(p2,r)  p1=p2 ]
( F is injective )
p r [ G(p,r)  Bouter(p) ]
(F’s domain is included in Bouter)
p [ Bouter(p)  r G(p,r) ]
(Bouter is included in F’s domain)
NTUA
April 17, 2003
F is the Skolem function that
validates this constraint
66
Stratified-Witness Constraints
(with L.P.)
Full dependencies: no existential quantifier. The chase always terminates.
Beyond this? Given set C of dependencies --> define chase flow graph:
Nodes correspond to relation components: an R or arity 3 produces 3 nodes.
Edges are drawn between i’th of R and j’th of S iff R appears on the left
side and S appears on the right side of the implication of some dependency.
The edge is labeled  if the corresponding variable in S is existentially
quantified. C is stratified-witness if there is no cycle with an -labeled edge
Proposition
The chase with stratified-witness constraints always terminates.
NTUA
April 17, 2003
67
(Relational) Conjunctive Queries
Q(x,z)
select
from
where
notation:
queries:
NTUA
April 17, 2003
R(x,y,z) , R(y,x,u) , S(z,u)
r1.A , s.A
R r1 , R r2 , S s
r1.A=r2.B and r1.B=r2.A and
r1.C=s.A and r2.C=s.B
r
stands for
select
O(r)
r1 , … , rn
from
Rr
where
C(r)
68
(Relational) Dependencies a.k.a Integrity Constraints
(rR) [ B(r)  (sS) C(r,s) ]
B and C are conjunctions of equalities, as in where clause
example:
(r1R)(r2R) [r1.E= r2.E 
(sR) s.D= r1.D  s.E= r1.E  s.F= r2.F ]
NTUA
April 17, 2003
69
Query Containment and Dependencies
Q1
Q2


select O1(r1) from
R1 r1 where C1(r1)
select O2(r2)
R2 r2 where C2(r2)
from
define cont(Q1,Q2) as
(r1R1) [ C1(r1) 
(r2R2) C2(r2)  O1(r1)=O2(r2) ]
we have, in each instance
Q1 Q2
NTUA
April 17, 2003
iff
cont(Q1,Q2)
70
And Viceversa
d

(rR) [ B(r)  (sS) C(r,s) ]
front(d) = select r
from
back(d) =
R r where B(r)
select r
from R r , S s where B(r)  C(r,s)
we have, in each instance
d
NTUA
April 17, 2003
iff
front(d)

back(d)
71
Chase Step
d

(rR) [ B(r)  (sS) C(r,s) ]
select O(r)
from
d
R r
where B(r)
basic fact:
select O(r)
from
R r, S s
where B(r)  C(r,s)
Q
d
Q’

Q =d Q’
the chase step is applicable if Q’ is not trivially equivalent to Q
(for example, we cannot chase Q’ with d ! )
NTUA
April 17, 2003
72
Using the Chase
basic fact: if chase step of Q with
then
Inst(Q)
d is not applicable
d
( canonical instance Inst(Q) built from query Q )
Basic Theorem
D
set of dependencies
Q1
...
chaseD(Q1) terminating chase sequence
(no more applicable steps)
Then:
Q1
D
NTUA
April 17, 2003
Q2
iff
chaseD(Q1)

Q2
73
Reformulation with Views
a view is just a query:
V

select O(r) from R r where C(r)
Reformulation of query
finding X(R,V)
NTUA
April 17, 2003
Q(R) with view V :
such that Q(R) =V X(R,V)
74
One View =Two Dependencies
V

select O(r) from R r where C(r)
the “chase-in” dependency:
cV

(rR) [ C(r)  (xV) x=O(r) ]
the “backchase” dependency:
bV

(xV) (rR) C(r)  x=O(r) ]
It turns out that
if rewritings of Q with V exist then such a
rewriting can be obtained by chasing Q
NTUA
April 17, 2003
with
cV
75
The Chase and Backchase (C&B) Algorithm
(joint work with Lucian Popa, IBM Almaden)
The chase with cV always terminates.
The search space for rewritings of Q with V consists
of the subqueries of chasecV(Q).
( S is a subquery:
injective homomorphism from S to chasecV(Q) )
Keep only subqueries such that S

V
chasecV(Q)
This can be checked by (back!)chasing with cV, bV
(also terminating)
NTUA
April 17, 2003
76
Preliminary Completeness Result for C&B
(with L.P.)
Theorem
Any scan-minimal reformulation of Q with V
is a subquery of chasecV(Q).
scan-minimal:
no scan (from item) can be removed
without compromising equivalence with Q.
Fewer scans means faster execution under most cost models.
NTUA
April 17, 2003
77
Additional Integrity Constraints
In general the storage schema contains integrity constraints
that restrict its class of instances (models). This may extend
the set of reformulation solutions!
Let C be a set of dependencies
Reformulating query
finding X(R,V)
Q(R) with view V under C :
such that Q(R) =V,D X(R,V).
That’s the same as reformulating Q under C + cV + bV
Can we still use the chase?
NTUA
April 17, 2003
78