ppt slides - University of California, Santa Cruz

Containment of
Relational Queries with
Annotation Propagation
Wang-Chiew Tan
University of California, Santa Cruz
Annotation Management System

A system that is able to propagate meta-data that is
associated with a piece of data along with the data as
the data is being moved around
a2
a1

transformation
a1 a2
Main feature:
 To trace the provenance and flow of data
Tracing the Provenance and Flow of Data
a2
a1
a3
transformation
b2
b1
a1 a2
b3
transformation
b1 b2 b3
a1 a2
Other Applications


Keep information that cannot be otherwise
stored in the current database design
Highlight wrong data
 Errorneous
data may be copied but the comment that
it is wrong goes along with it

Security
 Annotate

security level of data items
Quality metric
 Annotate
quality level of data items
Main Question

Are the annotated outcomes the same for
equivalent queries?

Why this question?
 A query
optimizer rewrites a query. Will the rewritten
query have the same annotation propagation
behavior?
A Simple Example
Given two relation schemas: R(A,B), S(B,C)
SELECT *
FROM R NATURAL JOIN S
versus
SELECT r.A, r.B, s.C
FROM R r, S s
=a s.B
WHERE r.B =
R
a
1 2
Result1
a b
1 2 3
Result2
a
1 2 3
S
b
2 3
In a More Concise Notation
a b
Ans(x,y,z) :- R(x,y), S(y,z)
{ x ! 1, y ! 2,
z!3}
Ans(x,y,z) :- R(x,y), S(y’,z), y =a y’
a
b
{ x ! 1, y! 2, y’! 2, z ! 3 }
 A location is a triple (R, t, A)
 Annotations of values that reside in different locations
but are bound to the same variable are unioned together
Ans(y) :- R(x,y)
Ans(y) :- S(y,z)
Ans(2 a b )
 Annotations that belong to the same output location are
unioned together
More Examples
Q1:
Ans(x,v) :- R(x,y,u), R(x,z,v), R(t,w,z)
Q2:
Ans(x,v) :- R(p,q,v), R(x,z,v), R(t,w,z)
First answer: Ans(1,a b 5 c )
b
c d
Second answer: Ans(1, 5
)
R
1a
1b
1
8
2
4
8
9
3
5
4
5
c
d
A sufficient condition for annotation
containment
Theorem If Q1 and Q2 are equivalent and Q1 is minimal,
then Q1 is annotation-contained in Q2

Intuition of proof:



If Q1 is minimal, then no proper subquery of Q1 is equivalent to
Q1
The minimal query of Q2 is isomorphic to Q1 up to variable
renaming. Assume that they are identical.
Any valuation  for Q1 can be simulated by a valuation  ± h that
carries annotations in the same way as  of Q1 (h is the
homomorphism from Q2 to its minimal subquery)
Is the sufficient condition too
strong?

Is it true that if Q1 is equivalent to Q2, then Q1 is
annotation-contained in Q2?
 Answer:

No.
Is it true that if Q1 is contained in Q2 and Q1 is
minimal, then Q1 is annotation contained in Q2?
 Answer:
No.
R
S
a
Q1: Ans(x) :- R(x, y), S(x, y)
1
2
1c 2
1b 3
Q2: Ans(x) :- R(x, y)
Ans(1 a c )
Ans (1 a b )
 Both Q1 and Q2 are minimal queries but neither Q1
nor Q2 are annotation-contained in each other
Necessary and Sufficient condition?
ith column
Q1:
pth subgoal
jth column
H(… x …) :- … S(… x …) …
h(y) = x, h maps the qth subgoal
of Q2 to the pth subgoal of Q1
Q2:
H(… y …) :- … S(… y …) …
ith column

qth subgoal
jth column
If Q1 carries an annotation of the jth column of some Stuple to the output, there is a way for Q2 to simulate this
behavior via homomorphism h
A necessary and sufficient condition for
annotation-containment via
homomorphisms
Theorem Q1 is annotation-contained in Q2 iff for every
distinguished variable x that occurs at the ith column in
the head and jth column of the pth subgoal in the body of
Q1, there exists a homomorphism h from Q2 to Q1 such
that


h maps the body of Q2 into the body of Q1 and the head of Q2 to
the head of Q1
Let the qth subgoal Q2 be the preimage of the pth subgoal of Q1
under h. The variable that occurs at the jth column of the qth
subgoal of Q2 is identical to the variable that occurs at the ith
column in the head of Q2
Can a single homomorphism do the
job?
Q1: Ans(x) :- R(x,y), R(x,z)
Q2: Ans(x) :- R(x,y)

Every homomorphism from Q2 to Q1 maps the
body of Q2 to only one subgoal of Q1
Complexity of AnnotationContainment
Proposition It is NP-complete to decide if Q1 is
annotation-contained in Q2
Propagating annotations back
If we wish to attach an annotation on a
piece of data in the output, on which
source data should we attach an
annotation?
 The user should be given the choice
 Alert the user of a side-effect-free
annotation when there is one

Annotation Placement Problem

Given the source database, the query, the
output data that we wish to annotation, it is
DP-hard to decide if there is a side-effectfree annotation
Upper-bound is not DP
 Conjecture: in a class slightly above DP

Related Work
Idea is not new though annotations were
never explicitly stated as provenancebased: Wang & Madnick [VLDB 90], Lee,
Bressan & Madnick [WIDM 98], Bernstein
& Bergstraesser [IEEE Data Eng. 99]
 Annotations of Web Documents
 Annotations on genomic sequences

Open Issues





Are there polynomial time algorithms for
deciding annotation-containment for the class of
queries with bounded treewidths
Is query minimization church-rosser?
Exact complexity of the annotation placement
problem?
Annotation and propagation for XML data
Relationship between annotation-containment
and containment of conjunctive queries under
bag semantics
Open Issues (contd)
Other annotation propagation semantics
other than basing on provenance?
 Querying the annotations?

Other results that do not carry over

Query Minimization
 We
can no longer minimize a query and preserve
annotation-equivalence by discarding one subgoal at
a time

Answering Queries using Views
 Some
classical results no longer hold
 [LMSS95] if a query Q has p subgoals and a query Q’
is a complete minimal rewriting of Q using a set of
views V, then Q’ has at most p subgoals
Example
Q:
A(x) :- R(x,z,v), R(x,u,z), R(x,z’,t), R(x,s,z’)
Q’:
A(x) :- R(x,u,z), R(x,z’,t), R(x,s,z’)
R
Qmin: A(x) :- R(x,z’,t), R(x,s,z’)
1 a1 2
1 a2 3
1 a3 4
1 a4 4
:
3
2
5
6