slides - Penn Engineering - University of Pennsylvania

Local-as-View
Data Integration
Zachary G. Ives
University of Pennsylvania
CIS 650 – Database & Information Systems
October 16, 2008
Administrivia
Next week:
Please read Section 3, 6, 7 in the Deshpande et al.
survey
Midterm assignment will be handed out Thursday,
with a one week clock
2
Recall: Kinds of Schema Mappings
Global As View (GAV):
x M(x)  y1y2 R1(x1y1)  R2(x2y2)  c(x1y1x2y2)
where x  x1x2…xn
Local As View (LAV):
x,y,z MR(x,y)  MS(y,z)  c(xyz)  w R1(x1y1z1w)
where x1y1z1  xyz
Global-Local As View (GLAV), aka Tuple-Generating
Dependencies (TGDs):
 xyz MR(x,y)  MS(y,z)  c1(xyz)  y1y2 R1(w1y1)  R2(w2y2) 
c2(xyzy1y2)  …
where w1w2  xyz
Query is of the form Q(x) :- MR(x1,y1), MS(x2,y2), …
3
Global-As-View Mappings
 Very easy to implement – doesn’t require any new
logic on the part of a regular DBMS engine
 For instance, Starburst QGM rewrites would work
 But some drawbacks – primarily that:
 We don’t have a mechanism to describe when a source
contains only a subset of the data in the mediated schema
 e.g., “All books from this source are of type textbook”
 The mediated schema often needs to change as we add
sources – it is somewhat “brittle” because it’s defined in
terms of sources
4
An Alternate Approach:
Local-As-View
When you integrate something, you have some conceptual
model of the integrated domain
 Define that as a basic frame of reference, everything else as a view
over it
 “Local as View” using mappings that are conjunctive queries
May have overlapping/incomplete sources
 Define each source as the subset of a query over the mediated
schema – the “open world assumption”
 We can use selection or join predicates to specify that a source
contains a range of values:
ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”
5
The Local-as-View Model
The basic model is the following:
 “Local” sources are views over the mediated schema
 Sources have the data – mediated schema is virtual
 Sources may not have all the data from the domain –
“open-world assumption”
The system must use the sources (views) to answer
queries over the mediated schema
6
Answering Queries Using Views
Assumption: conjunctive queries, set semantics
 Suppose we have a mediated schema:
show(ID, title, year, genre), rating(ID, stars, source)
 A conjunctive query might be:
q(t) :- show(i, t, y, g), rating(i, 5, s)
Recall intuitions about this class of queries:
 Adding a conjunct to a query (e.g., t = 1997) removes
answers from the result but never adds any
 Any conjunctive query with at least the same constraints
& conjuncts will give valid answers
7
Why This Class of Mappings & Queries?
 Abiteboul & Duschka showed the data complexity of
answering queries using views with OWA:
views
queries
CQ
CQ!=
PQ
datalog
FO
CQ
PTIME co-NP PTIME
PTIME
undec
CQ!=
PTIME co-NP PTIME
PTIME
undec
PQ
co-NP co-NP co-NP
co-NP
undec
datalog co-NP undec co-NP
undec
undec
undec
undec
FO
undec undec undec
 Note that the common “inflationary semantics” version of
Datalog must terminate in PTIME, even with recursion
8
Query Answering
Suppose we have the query:
q(t) :- show(i, t, y, g), rating(i, 5, s)
and sources:
5star(i) :- show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r) :- show(i, t, y, g), rating(i, r, “TVGuide”)
movieInfo(i,t,y,g) :- show(i, t, y, g)
critics(i,r, s) :- rating(i, r, s)
goodMovies(t,y) :- show(i, t, y, “drama”), rating(i, 5, s), y = 1997
We want to compose the query with the source
mappings – but they’re in the wrong direction!
9
Query Answering
Suppose we have the query:
q(t) :- show(i, t, y, g), rating(i, 5, s)
and sources:
5star(i)  show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r)  show(i, t, y, g), rating(i, r, “TVGuide”)
movieInfo(i,t,y,g)  show(i, t, y, g)
critics(i,r, s)  rating(i, r, s)
goodMovies(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1997
We want to compose the query with the source
mappings – but they’re in the wrong direction!
10
Inverse Rules
We can take every mapping definition and “invert” it, though
sometimes we may have insufficient information:
If
5star(i)  show(i, t, y, g), rating(i, 5, s)
then it is also the case that:
show(i,??? ,??? ,??? ,???)  5star(i)
and we can write this as a datalog rule:
show(i,-,- ,-,-) :- 5star(i)
But how to handle the absence of attributes?
 We know that there must be AT LEAST one instance of ??? for each
attribute for each show ID
 So we might simply insert a NULL and define that NULL means
“unknown” (as opposed to “missing”)…
11
But NULLs Lose Information
Suppose we take these rules and ask for:
q(t) :- show(i, t, y, g), rating(i, 5, s)
If we look at the rule:
goodMovies(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1997
“By inspection,” q(t)  goodMovies(t,y)
But if apply our inversion procedure, we get:
show(i, t, y, g)  goodMovies(t,y), i = NULL, g = “drama”, y = 1997
rating(i, r, s)  goodMovies(t,y), i = NULL, r = 5, s = NULL
We need “a special NULL” so we can figure out which IDs and
ratings match up
12
The Solution: “Skolem Functions”
Skolem functions:
 Conceptual “perfect” hash functions
 Each function returns a unique, deterministic value for
each combination of input values
 Every function returns a non-overlapping set of values
(Skolem function F will never return a value that matches
any of Skolem function G’s values)
Skolem functions won’t ever be part of the answer set
or the computation – it doesn’t produce real values
 They’re just a way of logically generating “special NULLs”
13
Query Answering Using Inverse Rules
Invert all rules using the procedures described
Take the query and the possible rule expansions and
execute them in a Datalog interpreter
 Really the same process as GAV unfolding!
 In the previous query, we expand with all combinations of
expansions of show and of rating – every possible way of
combining and cross-correlating info from different
sources
 Then discard unsatisfiable rewritings via unification,
i.e., substituting in constants from the query for variables
in the view
 Finally, execute the union of all satisfiable rewritings
14
Pros & Cons of Inverse Rules
 Works even with recursive queries, binding
patterns, FDs on schemas
 Generally, they take view definitions, split them, and
re-join them to produce answers
 Not very efficient
 No treatment of <, > predicates
 Can we do better? i.e., be more efficient?
15
The Bucket Algorithm
 Given a query Q with relations and predicates
 Create a bucket for each subgoal in Q
 Iterate over each view (source mapping)
 If source includes bucket’s subgoal:
 Create mapping between q’s vars and the view’s var at the
same position
 If satisfiable with substitutions, add to bucket
 Do cross-product of buckets, see if result is contained
(exptime, but queries are probably relatively small)
 For each result, do a containment check to make sure the
rewriting is contained within the query
16
Let’s Try a Bucket Example
Query
q(t) :- show(i, t, y, g), rating(i, 5, s)
Sources
5star(i)  show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r)  show(i, t, y, g), rating(i, r, “TVGuide”)
movieInfo(i,t,y,g)  show(i, t, y, g)
critics(i,r,s)  rating(i, r, s)
goodMovies(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1997
good98(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1998
17
Populating the Buckets
show(i,t,y,g)
rating(i,5,s)
5star(i)
5star(i)
TVguide(t,y,g,r)
TVguide(t,y,g,r)
movieInfo(i,t,y,g)
critics(i,r,s)
goodMovies(t,y) goodMovies(t,y)
good98(t,y)
good98(t,y)
18
Evaluation
 On the board…
19
Example of Containment Testing
Suppose we have two queries:
q1(t) :- show(i, t, y, g), rating(i, 5, s) , y = 1997
q2(t,y) :- show(i, t, y, “drama”), rating(i, 5, s)
Intuitively, q1 must contain the same or fewer answers vs. q2:
 It has all of the same conditions, except one extra conjunction
(i.e., it’s more restricted)
 There’s no union or any other way it can add more data
We can say that q2 contains q1 because this holds for any
instance of our mediated schema
20
Checking Containment
via Canonical Databases
 To test for q1 µ q2:
 Create a “canonical DB” that contains a tuple for each
subgoal in q1
 Execute q2 over it
 If q2 returns a tuple that matches the head of q1, then q1
µ q2
(This is an NP-complete algorithm in the size of the query.
Testing for full first-order logic queries is undecidable!!!)
 Let’s see this for our example…
21
Example Canonical DB
q1(t) :- show(i, t, 1997, g), rating(i, 5, s)
q2(t,y) :- show(i, t, y, “drama”), rating(i, 5, s)
show
i t 1997 g
rating
i 5 s
Need to get tuple <t> in executing q2 over this database
What if q2 didn’t ask for g = drama?
22
Buckets, Rev. 2: The MiniCon
Algorithm
 A “much smarter” bucket algorithm:
 In many cases, we don’t need to perform the crossproduct of all items in all buckets
 Eliminates the need for the containment check
 This – and the Chase & Backchase strategy of
Tannen et al – are the two methods most used in
virtual data integration today
 The chase procedure (though not the backchase) can be
achieved using a Datalog program that looks much like the
inverse rules!
23
Minicon Descriptions (MCDs)
 Basically, a modification to the bucket approach
 “head homomorphism” – defines what variables must be
equated by the query
 Variable-substituted version of the subgoals
 Mapping of variable names
 Info about what’s covered
 Property 1:
 If a variable occurs in the head of a query, then there must
be a corresponding variable in the head of the MCD view
 If a variable participates in a join predicate in the query,
then it must be in the head of the view
24
MCD Construction
For each subgoal of the query
For each subgoal of each view
Choose the least restrictive head homomorphism to match the
subgoal of the query
If we can find a way of mapping the variables, then add MCD for
each possible “maximal” extension of the mapping that satisfies
Property 1
25
MCDs for Our Example
q(t) :- show(i, t, y, g),
rating(i, r, s), r = 5
5star(i)  show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r)  show(i, t, y, g), rating(i, r, “TVGuide”)
movieInfo(i,t,y,g)  show(i, t, y, g)
critics(i,r,s)  rating(i, r, s)
goodMovies(t,y)  show(i, t, 1997, “drama”), rating(i, 5, s)
good98(t,y)  show(i, t, 1998, “drama”), rating(i, 5, s)
view
h.h.
mapping
goals sat.
5star(i)
ii
ii
2
TVguide(t,y,g,r)
tt, yy, gg
tt, yy, gg, rr
1,2
movieInfo(i,t,y,g)
ii, tt, yy, gg
ii, tt, yy, gg
1
critics(i,r,s)
ii, rr, ss
ii, rr, ss
2
goodMovies(t,y)
tt,yy
tt, yy
1,2
good98(t,y)
tt,yy
tt, yy
1,2
26
Combining MCDs
 Now look for ways of combining pairwise disjoint
subsets of the goals
 Greatly reduces the number of candidates!
 Also proven to be correct without the use of a
containment check
 Variations need to be made for:
 Constants in general (I sneaked those in)
 “Semi-interval” predicates (x <= c)
 Note that full-blown inequality predicates are co-NP-hard in the
size of the data, so they don’t work
27
MiniCon and LAV Summary
 The state-of-the-art for AQUV in the relational world of data
integration
 It’s been extended to support “conjunctive XQuery” as well
 Scales to large numbers of views, which we need in LAV data
integration
 Chase & Backchase by Tannen et al.
 A procedure that has very close connections to inverse rules
 Slightly more general in some ways – but:
 Typically produces equivalent rewritings, not maximally contained ones
 Not always polynomial in the size of the data (though for some useful
settings it is!!!)
28