XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian Popa, IBM Almaden NTUA April 17, 2003 1 Data Exchange Between Businesses Using XML XML published data proprietary data published data proprietary data XML XML insurance company published data pharmaceutical company published data proprietary data hospital NTUA April 17, 2003 2 XML? opening tag <drug> <name>aspirin</name> text <price>$4</price> <notes> <side-effects>upset stomach</side-effects> <maker>Bayer</maker> </notes> </drug> matching closing tag NTUA April 17, 2003 drug name price notes “aspirin” “$4” side-effects maker “upset stomach” “Bayer” 3 A Simple Publishing Scenario client virtual data <study> <case> <diag>migraine</diag> patient name is hidden <drug>aspirin</drug> <usage>2/day</usage> </case> <case> <diag>allergy</diag> <drug>cortisone</drug> <usage>3/day</usage> </case> </study> published data proprietary data prescription usage client query (XQuery) XML query reformulation language standard (draft) (SQL) correspondence expressed by publishing query (view) patient drug name 2/day aspirin 3/day cortisone NTUA April 17, 2003 name diagnosis John John migraine Jane Jane allergy How to express the view? View = query which, if executed, would produce the virtual data How to “compose” the client query with the view, obtaining the reformulation? 4 The General Problem of Query Reformulation client query Q(P) ? reformulated query X(S) schema S schema P schema correspondence soundness Given query Q(P), find query(ies) X(S) returning same answer, whenever such X(S) exists completeness NTUA April 17, 2003 5 Applications of Query Reformulation • data publishing we just saw it: public schema / storage schema P S • data integration global schema / local schema P S • schema evolution old schema / new schema P S • data security illustrated next NTUA April 17, 2003 6 An Application: Data Security client (patient,ailment) intrusive query I(P) (patient, physician) + (physician, ailment) public schema P schema correspondence query E(S) (exposes secret data correlation) proprietary schema S Want to be sure that there is no I(P) returning same answer as E(S) Only possible if Completeness Property holds! NTUA April 17, 2003 7 More Complicated Data Publishing: Mixed And Redundant Storage (MARS) public schema schema correspondence published XML (virtual) view of proprietary data may hide information storage schema proprietary relational data cached queries partial relational storage of XML proprietary XML data redundant data materialized views, indexes initial configuration NTUA April 17, 2003 after tuning 8 An Example With Tuning XML XML drug,price,notes NTUA April 17, 2003 XML diagnosis,drug XML rel DB drug,price drug,usage,diagnosis relational DB drug,usage,name name,diagnosis 9 Redundancy Enables Multiple Reformulations client query: “find how much each treatment costs” R3 XML XML drug,price,notes R2 R1 XML diagnosis,drug XML Rel DB drug,price drug,usage,diagnosis Relational DB drug,usage,name name,diagnosis Some reformulations are potentially cheaper to execute than others. Want to find an “optimal” one! NTUA April 17, 2003 10 Schema Correspondence Expressible in XQuery The DB administrator must be able to specify the correspondence. XML XQuery XQuery XML XQuery XML encode XML XML rel DB XQuery XML encode relational DB Can use XQuery, fixing any of the common encodings of relational tables in XML. NTUA April 17, 2003 11 XQuery? binding part drug for name price notes “aspirin” “$4” side-effects maker “upset stomach” return “Bayer” Result should contain <producedBy>Bayer</producedBy> NTUA April 17, 2003 $d in document/drug, $m in $d//maker <producedBy>$m/text()</producedBy> tagging template // (descendant) is the transitive closure of / (child) 12 Approach: XQuery Reformulation Reduced to Relational Reformulation client XQuery relational queries C&B schema correspondence relational constraints Mappings () as XQueries XML integrity constraints reformulated queries GReX built-in relational constraints capture XML data model = compilation GReX: Generic Relational encoding of XML reformulated queries (multiple solutions) NTUA April 17, 2003 13 XQuery Semantics Variable binding stage XML data model is a tagged tree drug name price for <drug> $d in document/drug, <name>aspirin</name> $m in $d//maker notes <price>$4</price> “$d” “$m” <notes> “aspirin” “$4” side-effects maker “upset stomach” “Bayer” <side-effects>upset stomach</side-effects> <maker>Bayer</maker> </notes> return <producedBy>$m/text()</producedBy> </drug> tagging stage XQueries compute in two stages: navigation in XML tree, binds variables to nodes, text, tags, etc. NTUA April 17, 2003 output of new XML, by filling in variable bindings into a tagging template 14 Compiling the Binding Part of XQueries to Relational Queries XBind query = binding part of XQuery (returns a relation: tuples of variable bindings) Relational query over child(x,y) , tag(x,t) ,desc(x,y) , Root (r), etc. Example: for $d in document(“drugs.xml”)/drug, $m in $d//maker return “$d” “$m” a relational “conjunctive” query compiles to P($d,$m) :- Root(r) , child(r,$d) , tag($d,“drug”) , desc($d,x) , child(x,$m) , tag($m,“maker”) But not all models of this schema correspond to the intended model; need GReX ! NTUA April 17, 2003 15 Sample Constraints from GReX • Relationship between child and descendant navigation: xy [ child(x,y) desc(x,y) ] desc contains child x [ el(x) desc(x,x) ] desc is reflexive xyz [ desc(x,y) desc(y,z) desc(x,z) ] desc is transitive These do not capture transitive closure completely, nor is it possible to do it in first-order logic; STILL... • Tagged tree structure of XML: rx [ root(r) desc(x,r) x = r ] root has no ancestors xyz [ child(x,z) child(y,z) x = y ] at most one parent NTUA April 17, 2003 16 More Constraints from GReX (some Tag) (oneTag) x [ el(x) t tag(x,t) ] xt1t2 [ tag(x,t1) tag(x,t2) t1 = t2 ] every element has a tag one tag per element (noLoop) xy [ desc(x,y) desc(y,x) x = y ] no non-trivial cycles (noShare) xyuv [ child(x,u) child(x,v) unique path between desc(u,y) desc(v,y) u = v ] (inLine) xy [ desc(x,u) desc(y,u) x = y desc(x,y) desc(y,x) ] NTUA April 17, 2003 elements ancestors of an element are collinear 17 Which Reformulations Do We Find This Way? client XQuery relational queries C&B schema correspondence relational constraints Mappings () as XQueries XML integrity constraints reformulated queries GReX built-in constraints capture XML data model = compilation reformulated queries (multiple solutions) NTUA April 17, 2003 all of them? 18 Restrictions on XQuery Main restriction: no aggregates (to be investigated) Leaving out aggregates, most common queries can be processed. Minor restrictions: no user-defined functions (of course!) limited use of negation (or else the problem becomes undecidable) limited use of document order (to be investigated) no navigation to parent or wildcard child (of unspecified tag) (unintuitive, but we can show that this needs another algorithm, unless NP= p2) NTUA April 17, 2003 19 The Reduction is Sound and Complete For the restricted XQuery fragment, Given: - XBind query B compiled to a relational query c(B) - schema correspondence C given by XQueries compiled to set of constraints c(C) Relative Completeness Theorem: R is a minimal reformulation of B under C iff c(R) is a minimal reformulation of c(B) under c(C) and GReX R can be computed from c(R) NTUA April 17, 2003 All of them are found by C&B. 20 A Glimpse at the Chase: Transforming Queries Using Constraints A query: ‘ find data satisfying condition “A” ‘ Q: A A constraint: ‘ whenever the data satisfies condition “A”, it also satisfies “B” ‘ A B A chase step: Q: A Q1: A B The chase: repeatedly applying chase steps until no new conditions can be added In general, Q and Q1 are not equivalent, all DBs constraint, Theory of but the in chase: 20satisfying years old,the deep and rich,they due are! to Beeri, Maier, Mendelson, Sagiv, Vardi, Yannakakis and others! NTUA April 17, 2003 21 How Do We Use the Chase? Capturing Relational Views With Constraints Let the schema correspondence be the view: ‘ retrieve the data satisfying conditions “A” and “B” ‘ V: A B V stands for condition: “data appears in result of V” Capture the definition with constraints (first-order logic statements) A B V all data satisfying “A” and “B” “appears in result of V” NTUA April 17, 2003 V A B all data “appearing in V” satisfies “A” and “B” 22 Chase & Backchase First chase: Q: A A B Q1: A A B B V Q2: A B V Next inspect all subqueries (“syntactic pieces”) of the chase result Q2: SQ: V It turns out that SQ is equivalent to Q The equivalence is checked again using the chase (backwards) Presence of constraint A B allows reformulation SQ: NTUA April 17, 2003 V V A B Q2: A B V 23 General C&B Algorithm (joint work with Lucian Popa, IBM Almaden) (public) schema Let C P , (proprietary) schema be a set of constraints. (eg., on P S and/or P & S) U(P + S ) Assume some terminating chasing sequence Universal plan Q(P) SUBQUERIES solutions X(S) = subqueries of U, posed against S, equivalent to Q Completeness Theorem [Deutsch&T.]: Any scan-minimal reformulation of Q under NTUA April 17, 2003 C is a subquery of U 24 Two Sets of Experiments • Synthetic queries reformulation time as function of query “complexity” XML analog of relational “star” queries, increasing number of joins can very complex queries still be reformulated in a practical amount of time ? • “Realistic” queries from the XML Benchmark Project [http://monetdb.cwi.nl/xml] The Queries: 20 queries designed to exercise interesting features of XQuery The Schema correspondence: views in both directions compiles to about 200 constraints! Much more than in typical relational schemas! NTUA April 17, 2003 25 Experiments with Synthetic Queries Number of joins (number of corners in the star) NTUA April 17, 2003 26 Experiments with Benchmark Queries Reformulation times must be understood in conjunction with execution times (eg., tens of seconds for Q10) NTUA April 17, 2003 27 Summary of Contributions MARS, a system for XQuery reformulation, - with mixed and redundant storage, under integrity constraints. - complex schema correspondence (views in both directions) Showed practical relevance of C&B method (feasible and worthwhile) A completeness result for a significant fragment of XQuery and a large class of schema correspondences. The method remains sound for the full language. A reduction between minimal reformulation and query equivalence, and we gave matching lower bounds showing our chase-based decision procedure is asymptotically optimal for the fragment considered. NTUA April 17, 2003 28 NTUA April 17, 2003 29 Why XML? The relational data model is still the dominant concept in databases. All data can be coded into tables. (For that matter into (goedel)numbers too!) Artificial coding makes life harder for query programmers. Result: less productivity, more bugs. XML is much more flexible. It is also “self-describing”, i.e., no need apriori for types/schemas (but this is sometimes a bad idea). It came from the document community (tagged text) and was cheered by industry gurus. So we have to live with it. (Although one can image better data models…) NTUA April 17, 2003 30 Making It Work Chase: each chase step is similar to evaluation of a recursive Datalog rule on a symbolic database built from the query we borrowed classical query processing techniques Backchase: size of search space is O(2^u), u = size of universal • compiling planconstraints to join tree • joins implemented as hash-joins We found criteria for pruning this space. • pushing selections into joins 1. Cost-independent: prune subqueries that typical size reduction - do not correspond to legal XML queries bottom-up of subqueries: - contain redundant descendant navigation stepsexploration 2^100 300 first all performing 1 navigation step, next all performing 2 navigation steps, etc. Perform x child-of contiguous y, y child-of navigation z, xsteps descendant-of starting from z the root 2. A cost-based pruning strategy parameterized by costing model - finds optimal reformulation for any monotonic cost model - cost models for XML are still under research - heuristic cost model: cost is number of table scans/XML navigation steps performed - amenable to experimenting with other cost models NTUA April 17, 2003 31 Benefit of Reformulation For Execution Time original query execution - time to reformulate - execution of reformulation no. of elements in document 600 500 saved time (s) 400 60 80 300 90 100 200 150 200 100 0 3 4 5 6 7 -100 number of major joins per query Benefit increases with increasing complexity of query and increasing database size NTUA April 17, 2003 32 More Results for Benchmark Queries reformulation times (with redundancy and optimization) 5 Delta to finish search 4.5 4 time (s) 3.5 3 2.5 Delta to best reformulation 2 1.5 1 0.5 Q20 Q19 Q18 Q17 Q16 Q15 Q14 Q13 Q12 Q11 Q10 Q9 Q8 Q7 Q6 Q5 Q4 Q3 Q2 Q1 0 Time to first reformulation queries time to first reformulation delta to best reformulation delta to finish search For redundancy: the XBind query for each query Time materialized to find first reformulation is essentially the same as in the absence of redundancy. (particular case of Acess Support Relation) Additional time spent only for finding optimal one. NTUA April 17, 2003 33 Related Work: Data Integration As Particular Case of MARS Applications Global As View (GAV) Local As View (LAV) X=Q o CR Q=X o CR Q P (global schema) (local schema) reformulation by composition-with-views TSIMMIS, SilkRoute, XPeranto NTUA April 17, 2003 CR P X=Q Q P CR CR CR S Q MARS S S [with Fernandez and Suciu in SIGMOD’99] combined effect of rewriting-with-views rewriting+composition Information Manifold, STORED, Agora 34 Future Work Directions • Short-Term: - tuning of C&B implementation for further speedup - XML-specific strategies for pruning the backchase stage - in particular, finding a good cost model to perform cost-based pruning • Medium-Term: - Applying C&B to Data Security - Applications to Adaptive Distributed Query Optimization • Long Term: - a unified framework for integrating data from various, heterogenous sources going beyond classical databases (XML/relational/LDAP + web forms + web services) NTUA April 17, 2003 35 Application 3: Schema Evolution (e.g. Caching) Goal: support existing client applications even after changing the schema client reformulated query X (N) old query Q (O) old schema O new schema schema correspondence N could be O extended with cached results Find X(N) returning same answer as Q(O) NTUA April 17, 2003 36 A Source of Redundancy: Relational Storage of XML catalog drug drug name notes “aspirin” name price “$4” “cortisone” notes price “$50” highly unstructured public data relational view (lossy) redundant storage Drugs NTUA April 17, 2003 name price aspirin $4 cortisone $50 37 Containment Under Integrity Constraints Decision procedure for containment is based on chasing with constraints from GReX. Natural extension to XML integrity constraints. Some results: • Containment of well-behaved XPath/XBind queries under bounded simple XML integrity constraints (SXICs) is decidable (used in relative completeness theorem). • Even modest use of unboundedness makes the problem undecidable. • Corollary: containment under bounded SXICs and DTDs is undecidable. • Containment under DTDs only is an open problem, but we have a PSPACE lower bound. See proposal for details. NTUA April 17, 2003 38 LDAP NTUA April 17, 2003 39 NTUA April 17, 2003 40 The Architecture of Our Solution client XQuery tagging template defined next XBind queries Mappings () as XQueries relational queries C&B relational constraints schema correspondence rel/XML encodings XML integrity constraints reformulated queries GReX built-in XML data model constraints not shown here = compilation reformulated queries (multiple solutions) NTUA April 17, 2003 GReX: Generic Relational encoding of XML, used internally to partially capture the intended model 41 Problem: • XML/MARS XQuery Reformulation • schema correspondence given by views in both directions • multiple solutions Tool: Algorithm for reformulation of relational queries under relational constraints Chase & Backchase (C&B) introduced in [VLDB’99 with L. Popa and V. Tannen] evaluated in [SIGMOD’00 with L. Popa, A. Sahuguet and V. Tannen] NTUA April 17, 2003 42 Capturing Relational Views With Constraints Let the schema correspondence be a view defined as the relational conjunctive query V(x,z) :- A(x,y), B(y,z) Capture the definition with constraints, (cV) x y z [ A(x,y) B(y,z) V(x,z) ] (bVview ) xiszincluded [ V(x,z)inV y A(x,y) B(y,z) ] result of query defining the V is included in result of query defining view NTUA April 17, 2003 43 Partially capturing the XML model Partially, because some features cannot fully be captured with constraints: • descendant is the transitive closure of child, but this is not FO-definable • neither is the “treeness” property our solution: add a set of constraints GREX to approximate intended models it turns out that capturing descendant helps in capturing treeness then, we define a significant XQuery fragment (we call it well-behaved) that cannot distinguish between intended and approximate models NTUA April 17, 2003 44 Constraints in GReX (2): the tagged tree structure of XML (topRoot) (oneTag) (noLoop) rx [ root(r) desc(x,r) x = r ] xt1t2 [ tag(x,t1) tag(x,t2) t1 = t2 ] xy [ desc(x,y) desc(y,x) x = y ] root has no ancestors one tag per element no non-trivial cycles (oneParent) xyz [ child(x,z) child(y,z) x = y ] at most one parent (noShare) unique path between xyuv [ child(x,u) child(x,v) desc(u,y) desc(v,y) u = v ] (inLine) xy [ desc(x,u) desc(y,u) x = y desc(x,y) desc(y,x) ] NTUA April 17, 2003 elements ancestors of an element are collinear 45 XQuery Restrictions What it allows: composition of navigation steps, navigation axes: self, (named)child, descendant, ancestor, idrefs qualifiers: path, string path, “and”, “or”, path equality/inequality where clause: disjunction, path equality/inequality, existential quantification What it rules out: user-defined functions, range, before predicates, aggregates, arbitrary negation, universal quantification, concatenation (,) navigation to parent (..) or to child of unspecified name (*) NTUA April 17, 2003 46 C&B Completeness Let C be a set of constraints (relates public schema P and proprietary schema S) • C-minimal query: removing any of its relational atoms produces non-equivalent query under D • Q1 is a subquery of Q2: Q1 is isomorphic to a “piece” of Q2 U(P + S) Universal plan Q(P) SUBQUERIES solutions X(S) = subqueries of U, posed against S, equivalent to Q Completeness Theorem: Any C-minimal reformulation of Q is a subquery of U NTUA April 17, 2003 47 A Completeness Result for Our Solution Given: - well-behaved XBind query B compiled to a relational query c(B) - schema correspondence M given by well-behaved XQueries (in both directions), compiled to set of relational constraints c(M) - bounded XML integrity constraints XIC, compiled to set of relational constraints c(XIC) a class of XML integrity constraints, see [KRDB’01] Relative Completeness Theorem: for any R R is a (M+XIC)-minimal reformulation of B iff c(R) is a (GReX c(M) c(XIC))-minimal reformulation of c(B) All of them are found by C&B. Corollary: completeness of reformulation algorithm for XBind queries NTUA R can be computed from c(R) April 17, 2003 48 Capturing XML Semantics client XQuery relational queries C&B relational constraints schema correspondence Mappings () as XQueries XML integrity constraints reformulated queries GReX built-in constraints capture XML data model = compilation reformulated queries (multiple solutions) NTUA April 17, 2003 49 Summary of Constraints Used in C&B Phase • Built-in constraints in GReX • Relational views compile to inclusion constraints • XQuery views – their XBind queries compile to inclusion constraints as for relational views – their return clause compiles to several decorrelated queries, each captured with constraints – the XML template in the return clause compiles to several Skolem and copy functions, each compiled to constraints • Integrity constraints – XML constraints compile to relational constraints – relational schema constraints NTUA April 17, 2003 50 Are the Restrictions Justified? Our completeness result holds for well-behaved XQueries, under bounded XML integrity constraints. What about reformulating • XQueries with parent and wildcard child navigation? • Under other XML integrity constraints? • Even under full-fledged DTDs? For such extensions, we make a deeper study of equivalence, which is an even simpler problem in reformulation. The equivalence checker is invoked as black-box algorithm during C&B. NTUA April 17, 2003 51 XBind (includes XPath) Fragments Equivalence PTIME navigation axes: self, (named)child, descendant simple well-behaved path concatenation, attribute values qualifiers: path, string path, “and” + join on attribute variables + any or all (!) of the following: . disjunction . ancestor navigation . path equality NP-complete p 2-complete . wildcard child () navigation + parent, preceding(following)-sibling NTUA April 17, 2003 In p 2 52 Containment for the “well-behaved” fragment of XBind/XPath Theorem B1 , B2 XBind/XPath queries from our “well-behaved” fragment c(B1) , c(B2) their relational compilation B1 is equivalent to B2 iff c(B1) is equivalent to c(B2) under GReX decidable in 2p using chase This result about containment is used in the relative completeness theorem NTUA April 17, 2003 53 Extensions of the “NP” fragment: 2p fragments any or all (!) of the following make equivalence 2p-complete: • disjunction unsurprising: conjunctive queries+union already 2p-complete [SY’80] • ancestor navigation translate ancestor away introducing union: /a/b/ancestor /[a/b] /a[b] • path equality qualifier can simulate ancestor: //.[.//.==/p]/s /p/ancestor/s Not well-behaved, but we have a different decision procedure • wildcard child navigation union introduced by interaction //: NTUA April 17, 2003 //a /a ///a 54 Experimental Setup: Started From the XML Benchmark Used the official XML Benchmark Project [http://monetdb.cwi.nl/xml] The application domain: an online auctioning application. The published schema: a DTD given by the XML Benchmark Project Data is partially nicely structured. The Queries: NTUA April 17, 2003 20 queries designed to exercise interesting features of XQuery 55 What We Added to the XML Benchmark Setup The mixed storage schema: relationally: person, item, open auction, closed auction, etc. unstructured part: annotations on items The redundancy: materialized the XBind query for each query (particular case of Acess Support Relation) The mappings: in both directions: relations XML, XML XML It all compiles to about 200 constraints ! Much more than in typical relational schemas! Had to change original implementation [SIGMOD’00] to scale. NTUA April 17, 2003 56 Related Work Publishing systems Schema mapping proprietary relational published XML: SilkRoute, Xperanto reformulation by composition-with-views. Schema mapping published XML proprietary relational : STORED, Agora reformulation by rewriting-with-views Information Integration TSIMMIS (composition-w-views), Information Manifold (rewriting-w-views) Containment Miklau and Suciu, smaller fragment of XPath(they too find that * is “naughty” [FLS, CGLV] - conjunctive regular path queries Amer-Ahia and Srivastava - minimization of tree pattern queries Containment under integrity constraints XML keys [BDFHT]; description logics [CGL]; NTUA April 17, 2003 57 Query Reformulation in Data Publishing partner/client client query Q(P) ? reformulated query X(S) (not directly executable) public schema P (virtual data) schema = interface against which queries are formulated publishing query (may hide some proprietary data) proprietary storage schema S (materialized data) Find X(S) returning same answer as Q(P) NTUA April 17, 2003 58 Compiling the Binding Part of XQueries to Relational Queries XBind query = binding stage Relational query over of XQuery (returns a relation: tuples of variable bindings) Navigation in XQueries child(x,y),tag(x,t), desc(x,y),Root(r), etc. Relational join of tables child, tag,etc. But, over arbitrary DBs with this schema, the relational translation of Root desc desc is not equivalent to that of Root desc must communicate to the C&B that desc table is transitive NTUA April 17, 2003 59 The Challenge for “Reformulation on MARS” To find the reformulations efficiently, we need to • reason with schema correspondence • efficiently construct the search space for reformulations - must contain all reformulations (for completeness) • explore search space - exhaustively (for security applications) - maybe trading optimality of reformulation for search speed (for optimization purposes) NTUA April 17, 2003 60 Contributions • A novel algorithm for reformulation of relational queries under relational constraints – Chase & Backchase • A declarative semantics for most of XQuery • A reformulation algorithm for XQuery Uses this semantics and exploits C&B [VLDB’99 with Popa and Tannen] [SIGMOD’00 with Popa, Sahuguet and Tannen] –practical (feasible and worthwhile) –complete for “most” of XQuery –optimal (we show lower bounds for various XQuery fragments: KRDB’01, DBPL’01) • MARS: a system for XQuery reformulation over Mixed And Redundant Storage –constructs and represents search space efficiently –cost-based exploration strategy parameterized by traditional costing module –finds first reformulation fast • Experimental evaluation: time to first reformulation, simple cost NTUA April 17, 2003 61 Compiling Client XQueries client XQuery relational queries C&B relational constraints schema correspondence Mappings () as XQueries XML integrity constraints reformulated queries GReX built-in constraints capture XML data model = compilation reformulated queries (multiple solutions) NTUA April 17, 2003 62 Capturing the Schema Correspondence client XQuery relational queries C&B relational constraints schema correspondence Mappings () as XQueries XML integrity constraints reformulated queries GReX built-in constraints capture XML data model = compilation reformulated queries (multiple solutions) NTUA April 17, 2003 63 Major Obstacles in Compiling Schema Mappings to Constraints Schema correspondence given by XQueries. As opposed to relational queries, • XQueries have nested, correlated subqueries in return clause • XQueries create new elements • XQueries return deep, recursive copies of input XML trees (solution not shown) NTUA April 17, 2003 64 Compiling Nested Subqueries: Decorrelation the query is short for the nested query for $p in doc(“foo.xml”)//person for return <res>$p/phone/text()</res> return <res>for $t in $p/phone/text() $p in doc(“foo.xml”)//person return $t </res> compile XBind parts to two decorrelated relational queries (shown here in Datalog syntax): Bouter(p) Root(r), desc(r,x), child(x,p), tag(p,”person”) Binner(p,t) Bouter(p), child(p,n), tag(n,”phone”), text(n,t) capture each with two inclusion constraints, as done in original C&B method NTUA April 17, 2003 65 Capturing Creation of New Elements for $p in doc(“foo.xml”)//person return <res>$p/phone/text()</res> For each binding of $p, a distinct <res>-element is constructed. set of bindings for $p, Bouter F injective function <res>-elements in result Capture F by the relation G representing its graph, and the constraints: pr1r2 [ G(p,r1) G(p,r2) r1=r2 ] ( r = F(p) ) p1p2r [ G(p1,r) G(p2,r) p1=p2 ] ( F is injective ) p r [ G(p,r) Bouter(p) ] (F’s domain is included in Bouter) p [ Bouter(p) r G(p,r) ] (Bouter is included in F’s domain) NTUA April 17, 2003 F is the Skolem function that validates this constraint 66 Stratified-Witness Constraints (with L.P.) Full dependencies: no existential quantifier. The chase always terminates. Beyond this? Given set C of dependencies --> define chase flow graph: Nodes correspond to relation components: an R or arity 3 produces 3 nodes. Edges are drawn between i’th of R and j’th of S iff R appears on the left side and S appears on the right side of the implication of some dependency. The edge is labeled if the corresponding variable in S is existentially quantified. C is stratified-witness if there is no cycle with an -labeled edge Proposition The chase with stratified-witness constraints always terminates. NTUA April 17, 2003 67 (Relational) Conjunctive Queries Q(x,z) select from where notation: queries: NTUA April 17, 2003 R(x,y,z) , R(y,x,u) , S(z,u) r1.A , s.A R r1 , R r2 , S s r1.A=r2.B and r1.B=r2.A and r1.C=s.A and r2.C=s.B r stands for select O(r) r1 , … , rn from Rr where C(r) 68 (Relational) Dependencies a.k.a Integrity Constraints (rR) [ B(r) (sS) C(r,s) ] B and C are conjunctions of equalities, as in where clause example: (r1R)(r2R) [r1.E= r2.E (sR) s.D= r1.D s.E= r1.E s.F= r2.F ] NTUA April 17, 2003 69 Query Containment and Dependencies Q1 Q2 select O1(r1) from R1 r1 where C1(r1) select O2(r2) R2 r2 where C2(r2) from define cont(Q1,Q2) as (r1R1) [ C1(r1) (r2R2) C2(r2) O1(r1)=O2(r2) ] we have, in each instance Q1 Q2 NTUA April 17, 2003 iff cont(Q1,Q2) 70 And Viceversa d (rR) [ B(r) (sS) C(r,s) ] front(d) = select r from back(d) = R r where B(r) select r from R r , S s where B(r) C(r,s) we have, in each instance d NTUA April 17, 2003 iff front(d) back(d) 71 Chase Step d (rR) [ B(r) (sS) C(r,s) ] select O(r) from d R r where B(r) basic fact: select O(r) from R r, S s where B(r) C(r,s) Q d Q’ Q =d Q’ the chase step is applicable if Q’ is not trivially equivalent to Q (for example, we cannot chase Q’ with d ! ) NTUA April 17, 2003 72 Using the Chase basic fact: if chase step of Q with then Inst(Q) d is not applicable d ( canonical instance Inst(Q) built from query Q ) Basic Theorem D set of dependencies Q1 ... chaseD(Q1) terminating chase sequence (no more applicable steps) Then: Q1 D NTUA April 17, 2003 Q2 iff chaseD(Q1) Q2 73 Reformulation with Views a view is just a query: V select O(r) from R r where C(r) Reformulation of query finding X(R,V) NTUA April 17, 2003 Q(R) with view V : such that Q(R) =V X(R,V) 74 One View =Two Dependencies V select O(r) from R r where C(r) the “chase-in” dependency: cV (rR) [ C(r) (xV) x=O(r) ] the “backchase” dependency: bV (xV) (rR) C(r) x=O(r) ] It turns out that if rewritings of Q with V exist then such a rewriting can be obtained by chasing Q NTUA April 17, 2003 with cV 75 The Chase and Backchase (C&B) Algorithm (joint work with Lucian Popa, IBM Almaden) The chase with cV always terminates. The search space for rewritings of Q with V consists of the subqueries of chasecV(Q). ( S is a subquery: injective homomorphism from S to chasecV(Q) ) Keep only subqueries such that S V chasecV(Q) This can be checked by (back!)chasing with cV, bV (also terminating) NTUA April 17, 2003 76 Preliminary Completeness Result for C&B (with L.P.) Theorem Any scan-minimal reformulation of Q with V is a subquery of chasecV(Q). scan-minimal: no scan (from item) can be removed without compromising equivalence with Q. Fewer scans means faster execution under most cost models. NTUA April 17, 2003 77 Additional Integrity Constraints In general the storage schema contains integrity constraints that restrict its class of instances (models). This may extend the set of reformulation solutions! Let C be a set of dependencies Reformulating query finding X(R,V) Q(R) with view V under C : such that Q(R) =V,D X(R,V). That’s the same as reformulating Q under C + cV + bV Can we still use the chase? NTUA April 17, 2003 78
© Copyright 2026 Paperzz