Datalog

Datalog
Linh Anh Nguyen
Institute of Informatics
University of Warsaw
(Based on slides by Chomicki, Suda, Nutt, Eiter and Faber)
What is Datalog?
Datalog is a basic query language for deductive databases.
It uses monotonic rules to define intensional predicates from
extensional ones, and has tractable (PTIME) data complexity.
Datalog is a restricted form of definite logic programs.
Its extensions with negation form non-monotonic rules, for
which different semantics have been well-studied.
Datalog is a rule language of knowledge representation.
It has applications to Semantic Web.
Linh Anh Nguyen
Datalog
Example: Metro Database Instance
link
line
4
4
4
1
1
1
1
station
St.Germain
Odeon
St.Michel
Chatelet
Louvres
Palais-Royal
Tuileries
next station
Odeon
St.Michel
Chatelet
Louvres
Palais-Royal
Tuileries
Concorde
Datalog program for the first query:
reach(X, Y) ← link(L, X, Y)
reach(X, Y) ← link(L, X, Z), reach(Z, Y)
answer(X) ← reach(’Odeon’, X)
Note: recursive definition.
Intuitively, if the part on the right of “←” is true, the rule “fires”
and the atom on the left of “←” is concluded.
Linh Anh Nguyen
Datalog
Syntax of Datalog
A term is either a constant or a variable.
An atom is an atomic formula of the form R(t),
where R is an n-ary predicate and t is a vector of n terms.
A Datalog rule is an expression of the form A ← B1 , . . . Bn ,
where n ≥ 0, and A, B1 , . . . , Bn are atoms, and
every variable occurring in A occurs in B1 , . . . , Bn (“safety”).
A is called the head of the rule.
{B1 , . . . , Bn } is called the body of the rule.
The rule symbol “←” is often also written as “:-”.
A Datalog program is a finite set of Datalog rules.
Linh Anh Nguyen
Datalog
Datalog Programs
Let P be a Datalog program.
An extensional predicate of P is a predicate occurring only
in rule bodies of P.
An intensional predicate of P is a predicate occurring in the
head of some rule in P.
The extensional schema of P, edb(P), consists of all
extensional predicates of P.
The intensional schema of P, idb(P), consists of all
intensional predicates of P.
The schema of P, sch(P), is the union of edb(P) and idb(P).
Linh Anh Nguyen
Datalog
The Metro Example /1
Datalog program P on metro database schema:
reach(X, Y) ← link(L, X, Y)
reach(X, Y) ← link(L, X, Z), reach(Z, Y)
answer(X) ← reach(’Odeon’, X).
Here,
edb(P) = {link}
idb(P) = {reach, answer}
sch(P) = {link, reach, answer}.
Linh Anh Nguyen
Datalog
Active Domain and Instantiation
Let P be a Datalog program and I be an database instance of
edb(P).
The active domain of P with respect to I, denoted as
adom(P, I), is the set of constants occurring in P and I.
Let ν be an substitution mapping each variable to a constant
of adom(P, I). The instantiation of a rule r w.r.t. ν is the
rule obtained from r by replacing each variable x with ν(x).
Linh Anh Nguyen
Datalog
The Metro Example /2
Extensional database instance I :
link line station
next station
4
St.Germain
Odeon
4
Odeon
St.Michel
4
St.Michel
Chatelet
1
Chatelet
Louvres
1
Louvres
Palais-Royal
1
Palais-Royal Tuileries
1
Tuileries
Concorde
Datalog program P :
reach(X, Y) ← link(L, X, Y)
reach(X, Y) ← link(L, X, Z), reach(Z, Y)
answer(X) ← reach(’Odeon’, X)
adom(P, I)
=
{4, 1, St.Germain, Odeon, St.Michel, Chatelet,
Louvres, Palais-Royal, Tuileries, Concorde}
Linh Anh Nguyen
Datalog
The Least Model Semantics
Let P be a Datalog program and I be an instance of edb(P).
Each rule (A ← B1 , . . . , Bk ) of P is treated as the formula
(B1 ∧ . . . ∧ Bn → A), where variables are universally quantified.
Each tuple of a relation of I is treated as an atom.
Thus, P ∪ I is a logical theory.
A ground atom is an atom without variables.
The least Herbrand model of P containing I, denoted P(I), is
the set of all ground atoms that are logical consequences of P ∪ I.
Linh Anh Nguyen
Datalog
The Metro Example /3
Extensional database instance I :
link line station
next station
4
St.Germain
Odeon
4
Odeon
St.Michel
4
St.Michel
Chatelet
1
Chatelet
Louvres
1
Louvres
Palais-Royal
1
Palais-Royal Tuileries
1
Tuileries
Concorde
Datalog program P :
reach(X, Y) ← link(L, X, Y)
reach(X, Y) ← link(L, X, Z), reach(Z, Y)
answer(X) ← reach(’Odeon’, X)
P(I)
=
I ∪ {reach(St.Germain, Odeon), ...,
reach(St.Germain, Concorde), ...,
answer(St.Michel), ..., answer(Concorde)}
Linh Anh Nguyen
Datalog
The Least Model Semantics: Properties
Properties
If A ∈ P(I) then
the predicate of A belongs to sch(P)
the terms occurring in A are constants from adom(P, I)
if the predicate of A is an extensional predicate then A ∈ I.
Definition
A Herbrand model of P ∪ I is a set of ground atoms using
predicates from sch(P) and constants from adom(P, I) that forms
a model of P ∪ I (i.e. contains I and satisfies all the rules of P).
Property
P(I) ⊆ M for any Herbrand model M of P ∪ I.
Linh Anh Nguyen
Datalog
Fixpoint Semantics
If all facts in I hold, which other facts must hold after firing the
rules in P?
Approach
Define an immediate consequence operator TP,I (K) on
database instances K.
Start with K = ∅.
Apply TP,I to obtain a new instance:
Knew := TP,I (K) = I ∪ new facts.
Iterate until nothing new can be produced.
The result yields the semantics.
Linh Anh Nguyen
Datalog
Immediate Consequence Operator
Let P be a Datalog program, I be a database instance of edb(P).
A ground substitution ν : var (P) → adom(P, I) is a function
mapping every variable of P to a constant occurring in P ∪ I.
An ground instance of a rule r w.r.t. ν is the rule obtained from
r by replacing every variable x with ν(x).
Let K be a database instance of sch(P).
A fact R(t) is an immediate consequence of K and P if either
R ∈ edb(P) and R(t) ∈ I, or
there exists a ground instance (A ← B1 , . . . , Bn ) of a rule in P
such that A = R(t) and {B1 , . . . , Bn } ⊆ K.
def
TP,I (K) = {A | A is an immediate consequence of K and P}
Linh Anh Nguyen
Datalog
Example
Consider
P = {reachable(a)
reachable(Y) ← link(X, Y), reachable(X)}
I = {link(a, b), link(b, c)}
Then,
K1 = ∅
K2 = TP,I (K1 ) = I ∪ {reachable(a)}
K3 = TP,I (K2 ) = K2 ∪ {reachable(b)}
K4 = TP,I (K3 ) = K3 ∪ {reachable(c)}
K5 = TP,I (K4 ) = K4
Thus, K4 is a fixpoint of TP,I .
Linh Anh Nguyen
Datalog
Immediate Consequence Operator: Properties
Definition
K is called a fixpoint of TP,I if TP,I (K) = K.
Proposition
For every Datalog program P we have:
the operator TP,I is monotonic,
that is, K ⊆ K0 implies TP,I (K) ⊆ TP,I (K0 );
for any database instance K of sch(P) we have:
K is a model of P ∪ I if and only if TP,I (K) ⊆ K
if K is a fixpoint of TP,I then K is a model of P.
Fact
Since TP,I is monotonic, it has the least fixpoint, denoted lfp(TP,I ).
Linh Anh Nguyen
Datalog
Fixpoint Semantics: Properties
For a Datalog program P and an instance I of edb(P), define:
TP,I ↑ 0 = ∅
TP,I ↑ (n + 1) = TP,I (TP,I ↑ n)
ω
[
TP,I ↑ ω =
TP,I ↑ n.
n=0
We have that TP,I ↑ 0 ⊆ TP,I ↑ 1 ⊆ . . . ⊆ TP,I ↑ n ⊆ . . . ⊆ TP,I ↑ ω.
As P and I are finite, TP,I ↑ ω is also finite, and there exists n such
that TP,I ↑ ω = TP,I ↑ n.
Theorem
For every Datalog program P and every instance I of edb(P),
lfp(TP,I ) = TP,I ↑ ω (a way to computing lfp(TP,I ))
lfp(TP,I ) = P(I) (the fixpoint semantics coincides with the
least model semantics).
Linh Anh Nguyen
Datalog
Data Complexity
Data complexity of a query language is measured with respect to
the extensional database, while assuming that the query is fixed.
In the case of Datalog:
the query is a Datalog program P, which may contain a
special rule like answer(X) ← reach(’Odeon’, X)
the data complexity is the complexity of computing P(I),
measured w.r.t. the size of edb instance I.
Theorem
The data complexity of Datalog is in PTIME.
Linh Anh Nguyen
Datalog
Top-Down Query Evaluation /1
Evaluation of a ground goal A (w.r.t. P and I) :
A succeeds immediately if there is a fact A in P ∪ I
A succeeds if there is a rule A0 ← B1 , . . . , Bn in P and a
ground substitution ν such that
A0 ν = A
the evaluation of B1 ν, . . . , Bn ν all succeed.
Failure: backtracking (substitutions are undone).
Infinite looping: tabulation (some systems, e.g. XSB).
Linh Anh Nguyen
Datalog
Top-Down Query Evaluation /2
Evaluation of a non-ground goal A :
unify A with a fact or a rule head (after renaming apart)
propagate the substitutions to the body of the rule
evaluate the body.
Two goals unify if there is a substitution that maps both to the
same goal. A substitution is a most general unifier (mgu) of two
goals if all other unifying substitutions can be obtained from it by
composition. Only mgu’s are used for the method.
Linh Anh Nguyen
Datalog
Example
Extensional database instance I :
{link(a, b), link(a, c), link(c, d), link(d, e)}
Datalog program P :
reach(X, Y) ← link(X, Y)
reach(X, Y) ← link(X, Z), reach(Z, Y)
answer(X) ← link(a, X), reach(X, e)
Goal:
← answer(X)
Linh Anh Nguyen
Datalog
Top-Down Query Evaluation: Properties
The evaluation method is based on:
SLD-resolution of logic programming
tabulation.
The method is sound and complete in the sense that
it returns only correct answers
it returns all correct answers.
In application, it is done set-at-a-time (instead of tuple-at-a-time)
in order to reduce the numbers of accesses to secondary storage.
Linh Anh Nguyen
Datalog
Bottom-Up Query Evaluation
The naive method computes TP,I ↑ ω by using the set-at-a-time
technique. One of the drawbacks of this method is that many facts
are derived many times, at different iterations.
The semi-naive method improves the naive method by avoiding
redundant computation and by exploiting dependencies between
the intensional predicates. The drawback of this method is that it
computes the whole set P(I), while we are usually interested only
in computing P(I)(answer) for some special predicate answer.
The advanced method improves the semi-naive method by:
simulating the goal-driven (top-down) method to rewrite P
into P 0 such that it concentrates only on computing predicate
answer and P 0 (I)(answer) = P(I)(answer)
(the transformation is called magic-set transformation)
using the semi-naive method to compute P 0 (I)(answer).
Linh Anh Nguyen
Datalog
Introducing Negation
Datalog¬ allows negation to occur in front of atoms in rules.
Rules with negative literals in the body:
A ← B1 , . . . , Bn , ¬C1 , . . . , ¬Cm
Example: forbear(X, Y) ← anc(X, Y), ¬parent(X, Y)
Extended safety: A variable occurring in a negative literal
must also occur in a positive literal of the same rule body.
Semantics:
A negative literal ¬A is satisfied in a Herbrand interpretation I
(which is a set of ground atoms) if A ∈
/ I.
For non-ground rules consider all instantiations.
Problem: A set P of Datalog rules may have more than one
minimal Herbrand model!
Linh Anh Nguyen
Datalog
Open vs. Closed World Assumption
Closed World Assumption (CWA)
What is not implied by a program is false.
Open World Assumption (OWA)
What is not implied by a program is unknown.
Scope
traditional database applications: CWA
information integration: OWA or CWA
Linh Anh Nguyen
Datalog
Problems with Negation
Not every program has a clear logical meaning (due to the
interaction of negation with recursion).
Bottom-up evaluation does not always produce an intuitive
result.
Example:
p ← ¬q
q ← ¬p
Linh Anh Nguyen
Datalog
Implicit Quantification
Example: bachelor(X) ← male(X), ¬married(X, Y)
Does it mean:
1 X is a bachelor if X is a male and X is not married to anyone,
2 or X is a bachelor if X is a male and X is not married to
everyone?
Logically:
1 ∀X [bachelor (X ) ← male(X ) ∧ ¬∃Y married(X , Y )]
2 ∀X [bachelor (X ) ← male(X ) ∧ ∃Y ¬married(X , Y )]
The proper logical reading is 2, because it is equivalent to:
∀X ∀Y [bachelor (X ) ← male(X ) ∧ ¬married(X , Y )]
If the reading 1 is desired, replace the rule by:
bachelor(X) ← male(X), ¬husband(X)
husband(X) ← married(X, Y)
Linh Anh Nguyen
Datalog
Stratified Datalog¬ Programs
The dependency graph pdg (P) of a Datalog¬ program P :
vertices: predicates of P
edges:
a positive edge (p, q) if there is a rule in P in which p appears
in a positive literal in the body and q appears in the head
a negative edge (p, q) if there is a rule in P in which p
appears in a negative literal in the body and q appears in the
head.
A Datalog¬ program P is stratified if no cycle in its dependency
graph pdg (P) contains a negative edge.
Linh Anh Nguyen
Datalog
Stratifications
A stratification of P is a mapping s from the set of predicates of
P to natural numbers such that:
if a positive edge (p, q) is in pdg (P), then s(p) ≤ s(q)
if a negative edge (p, q) is in pdg (P), then s(p) < s(q).
There is a polynomial-time algorithm to:
determine whether a program is stratified,
if it is, to find a stratification for it.
Linh Anh Nguyen
Datalog
Stratified Datalog¬ : Bottom-Up Query Evaluation
Let P be a stratified Datalog¬ program.
Bottom-Up Query Evaluation
1
compute a stratification of a program P
2
partition P into P1 , . . . , Pn , each Pi consisting of all and only
rules whose head belongs to a single stratum
evaluate bottom-up P1 , . . . , Pn (in that order):
3
find the substitutions to the positive literals first
use negative literals only as tests
¬A succeeds if A is not in the result of the lower strata.
The result of bottom-up evaluation:
does not depend on the stratification
can be semantically characterized in various ways.
Linh Anh Nguyen
Datalog
Example
Let
I = {R1 (a), R2 (a), R3 (a), R4 (a),
R(b), R3 (b), R4 (b),
R2 (c), R3 (c), R4 (c)}
and let P be the following Datalog¬ program:
S(x) ← R1 (x), ¬R(x)
T (x) ← R2 (x), ¬R(x)
U(x) ← R3 (x), ¬T (x)
V (x) ← R4 (x), ¬S(x), ¬U(x).
Stratify P and evaluate P(I) using the stratified semantics.
Linh Anh Nguyen
Datalog
Expressiveness and Data Complexity
Expressiveness
There are queries not expressible in relational algebra but
expressible in Datalog:
e.g., transitive closure of a binary relation.
There are queries not expressible in Datalog but expressible in
relational algebra:
e.g., set difference (nonmonotonic query).
Every relational algebra query can be expressed in stratified
Datalog¬ .
Data Complexity
The data complexity of stratified Datalog¬ is in PTIME.
Linh Anh Nguyen
Datalog
Recursion in SQL3
General Form
WITH R AS definition of R query to R
If R is recursively defined, it should be preceded by RECURSIVE.
Example
WITH RECURSIVE Anc(Upper,Lower) AS
(SELECT * FROM Parent)
UNION
(SELECT P.Upper, A.Lower
FROM Parent AS P, Anc AS A
WHERE P.Lower = A.Upper)
SELECT Anc.Upper
FROM Anc
WHERE Anc.Lower = ’Dave’;
Linh Anh Nguyen
Datalog
Recursion in SQL3 (cont’d)
Example
WITH
Sib(x,y) AS
SELECT p1.child, p2.child
FROM Par p1, Par p2
WHERE p1.parent = p2.parent,
RECURSIVE Cousin(x,y) AS
Sib
UNION
(SELECT p1.child, p2.child
FROM Par p1, Par p2, Cousin
WHERE p1.parent = Cousin.x AND p2.parent = Cousin.y)
SELECT y
FROM Cousin
WHERE x = ’Sally’;
Linh Anh Nguyen
Datalog
Recursion in SQL3 (cont’d)
Mutual recursion:
more than relation can be defined simultaneously.
Linear recursion:
each definition can have only one occurrence of a relation
mutually recursive with the relation being defined.
Negation and recursion:
if EXCEPT is used, the definitions should be stratified.
Linh Anh Nguyen
Datalog