Implementation of a Database System with Boolean Algebra

Implementation of a Database System with Boolean Algebra
Constraints
by
Andras Salamon
A THESIS
Presented to the Faculty of
The Graduate College at the University of Nebraska
In Partial Fulllment of Requirements
For the Degree of Master of Science
Major: Computer Science
Under the Supervision of Professor Peter Z. Revesz
Lincoln, Nebraska
May, 1998
Implementation of a Database System with Boolean Algebra Constraints
Andras Salamon, M.S.
University of Nebraska, 1998
Advisor: Peter Z. Revesz
This thesis describes an implementation of a constraint database system with
constraints over a Boolean Algebra of sets. The system allows within the input
database as well as the queries equality, subset-equality and monotone inequality
constraints between Boolean Algebra terms built up using the operators of union,
intersection and complement. Hence the new system extends the earlier DISCO
system, which only allowed equality and subset-equality constraints between Boolean
algebra variables and constants.
The new system allows Datalog with Boolean Algebra constraints as the query language. The implementation includes an extension of Naive and Semi-Naive evaluation
methods for Datalog programs and algebraic optimization techniques for relational
algebra formulas.
The thesis also includes three example applications of the new system in the area
of family tree genealogy, genome map assembly, and two-player game analysis. In
each of these three cases the optimization provides a signicant improvement in the
running time of the queries.
ACKNOWLEDGMENTS
I would like to express my deepest appreciation to my advisor Dr. Peter Revesz
for his guidance and encouragement on this project. I also thank Dr. Jean-Camille
Birget and Dr. Spyros Magliveras for serving on my advisory committee and for their
time.
I would like to dedicate my thesis for the memories of my grandfathers Karoly
Nemeth and Sandor Salamon.
Contents
1 Datalog with Boolean Algebra Constraints
1.1 Boolean Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Syntax of Datalog Queries with Boolean Constraints . . . . . . . . .
1.3 Quantier elimination . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Elimination method for equality constraints . . . . . . . . . .
1.3.2 Elimination method for precedence and monotone inequality
constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Naive and Semi-naive evaluation methods . . . . . . . . . . . . . . .
1.4.1 Naive method . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Semi-Naive method . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3 EVAL function . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.4 EVAL INCR function . . . . . . . . . . . . . . . . . . . . . . . .
2 Implementation
2.1 The implemented Boolean Algebra . . . . .
2.1.1 Sets . . . . . . . . . . . . . . . . . .
2.2 Implemented quantier elimination methods
2.3 Hardware and Software . . . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
8
8
9
9
10
10
11
13
13
16
16
16
17
17
2
2.4 Java program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Relational Algebra Formulas
3.1 Relational Algebra Formulas . . . . . . . . . .
3.1.1 Converting a Rule . . . . . . . . . . .
3.1.2 Converting a Relation . . . . . . . . .
3.2 Optimization of Relational Algebra Formulas .
3.3 Algebraic Manipulation . . . . . . . . . . . . .
3.3.1 Laws . . . . . . . . . . . . . . . . . . .
3.3.2 Principles . . . . . . . . . . . . . . . .
3.3.3 The Algorithm . . . . . . . . . . . . .
3.4 Calculating multiple joins . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
22
22
23
23
24
25
25
27
27
35
4 User's Manual
38
5 Examples
44
4.1 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Input le format . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Ancestors example . . . . . .
5.1.1 The problem . . . . . .
5.1.2 The input database . .
5.1.3 The Datalog program .
5.1.4 The output database .
5.1.5 Execution complexity .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
40
42
44
44
44
45
46
46
3
5.1.6 Run-time results .
5.2 Genome Map . . . . . . .
5.2.1 The problem . . . .
5.2.2 Solution . . . . . .
5.2.3 Concrete examples
5.3 Unavoidable Sets . . . . .
5.3.1 Problem . . . . . .
5.3.2 Solution . . . . . .
5.3.3 Example . . . . . .
6 Multisets
6.1 Introduction . . . . . . . .
6.2 Extension of the program
6.3 An example . . . . . . . .
6.3.1 The problem . . . .
6.3.2 The solution . . . .
6.3.3 Concrete example .
7 Further Work
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
49
50
53
56
56
56
58
60
60
60
62
62
62
63
66
4
Introduction
Although among the database systems the relational database systems are the widespreadest systems at this moment, constraint databases is a very perspective approach
to change them.
Constraint databases can contain quantier-free rst-order formulas. With the
help of these formulas constraint databases are able to express more than traditional
relational databases. For instance one tuple can contain innite number of traditional
tuples.
Constraint database systems can be categorized according to the type of the constraint. Some of the well-known constraint types are for instance: linear constraints,
polynomial constraints, integer gap constraints.
In this system the constraints are Boolean Constraint, hence the name of the
system is Datalog with Boolean Constraint. This system extends the possibilities of
a previous system (Datalog with Integer Set COnstraint = DISCO) which was implemented in the department. DISCO system allows only subset-equality constraints
between Boolean Algebra variables and constants, this system allows subset-equality,
equality, and monotone inequality constraints between Boolean Algebra terms. A
description of the DISCO system can be found in [2].
First there is a theoretical overview (Chapter 1) based on [6], then a chapter
5
about the current implementation (Chapter 2), what kind of Boolean Algebra is
implemented, the main structure of the program.
Because one really important part of the program is related to Relational Algebra, Chapter 3 describes the Relational Algebra formulas, how to store, convert and
optimize these formulas.
The name of the implemented program is GreenCoat, Chapter 4 gives information about the user interface of the program. The predecessor of this program was
implemented in the previous semester by Song Liu and I.
Chapter 5 describes some examples and try to demonstrate the possibilities of the
system.
The system also supports the use of some multiset operators. Chapter 6 contains
more information about the multisets.
6
Chapter 1
Datalog with Boolean Algebra
Constraints
In this chapter I give an overview of the 'Datalog with Boolean Algebra Constraint'.
This was introduced by Kanellakis et al. [4] and extended by Peter Z. Revesz [6].
First I present the basic denitions necessary to understand the concept. Later I
dene the syntax methods.
1.1 Boolean Algebra
The following denition is taken from [6], more information can be found about
Boolean Algebras in [1].
A Boolean algebra is a sextuple (, ^, _, , 0, 1), where is the domain set,
^ and _ are binary operators (^ : ! , _ : ! ), is a unary operator
( : ! ), 0 and 1 are two special elements of the domain (0 2 ; 1 2 ). They are
also called zero and identity elements. Every Boolean algebra satises the following
axioms: (8x; y 2 :)
0
0
0
7
x_y
x _ (y ^ z )
x_x
x_0
0
y_x
(x _ y ) ^ (x _ z )
1
x
=
=
=
=
x^y
x ^ (y _ z )
x^x
x^1
0
=
=
=
=
y^x
(x ^ y ) _ (x ^ z )
0
x
0 6= 1
Boolean term: All the elements of (including 0 and 1) are Boolean terms. All
the elements of V (set of variables), and all the elements of C , where C is the set of
constants (except 0 and 1), are Boolean terms. If t1 and t2 are both Boolean terms,
than t1 _ t2 , t1 ^ t2, t1 are also Boolean terms.
Precedence constraint: If a constraint has the following form: x ^ y = 0,
(where x; y 2 [ V [ C ), then we call this constraint precedence constraint and
denote with x y.
Monotone Boolean function: A g Boolean function is monotone if 8xi yi (1 i n) : g(x1; : : : ; xn) g(y1; : : : ; yn).
Monotone inequality constraint: g(x1 ; : : : ; xn) 6= 0 is a monotone inequality
constraint if g is a monotone Boolean function.
The following is a well-known fact for Boolean terms:
0
0
Proposition 1.1 Every t Boolean term can be converted to disjunctive normal
form (DNF):
W
t(z1 ; z2; : : : ; zn) = a 2 f0; 1gn (t(a1 ; a2; : : : ; an) ^ z1a1 ^ z2a2 ^ : : : ^ znan )
where z0 denotes z', and z 0 denotes z.
8
1.2 Syntax of Datalog Queries with Boolean Constraints
The following basic denitions can also be found in [6]. Every Datalog program
contains a set of facts (constraint tuples) and a set of rules. The facts can be seen as
special rules as well. The general form of the facts is:
R(x1 ; : : : ; xk ) : f (x1 ; : : : ; xk ) = 0; g1(x1; : : : ; xk ) 6= 0; : : : ; gl(x1 ; : : : ; xk ) 6= 0:
where f and gi(1 i l) are Boolean terms.
The general form of the rules is:
R(x1 ; : : : ; xk ) : R1 (x1;1 ; : : : ; x1;k1 ); : : : ; Rn(xn;1; : : : ; xn;kn ); f (x) = 0;
g1 (x) 6= 0; : : : ; gl(x) 6= 0:
where R; R1; : : : ; Rk are relation symbols (not necessary distinct symbols), x s 2
[ V [ C , x is the set of variables in the rule, and f and gi(1 i l) are Boolean
terms.
It is not a real restriction that one side of the constraint is always 0. (f =
g) (((f ^ g ) _ (f ^ g)) = 0) hence we can convert all the constraints to this
form. Without loss of generality we can also assume that we have only one equality
constraint, because (f1 = 0; : : : ; fn = 0) ((f1 _ : : : _ fn) = 0), therefore we can
connect several equality constraints to create one constraint.
0
0
0
1.3 Quantier elimination
A quantier elimination method is an equivalency between an existentially quantied
formula and a quantier-free formula.
Quantier elimination is used for variables on the right-hand side of a rule which
9
do not occur as variables in the left-hand side.
There are three elimination methods described in [6]. The correctness proofs of
the elimination methods also can be found in the article.
1.3.1 Elimination method for equality constraints
The rst elimination method [6, Lemma 2.2] (which originates with George Boole)
can be used for equality constraints:
9x(f (x; y1; : : : ; yk ) = 0) f (0; y1; : : : ; yk) ^ f (1; y1; : : : ; yk ) = 0
1.3.2 Elimination method for precedence and monotone inequality constraints
The other [6, Lemma 2.4] can be used for precedence (x y) and monotone inequality
constraints:
9x( z1 x; : : : ; zm x;
x y1; : : : ; x yk ;
w1 u1; : : : ; ws us;
g1(x; v1 ; : : : ; v1;n1 ) 6= 0;
...
gl (x; v1 ; : : : ; v1;nl ) 6= 0; )
is equivalent to:
z1 y1; : : : ; z1 yk ; z2 y2; : : : ; zm yk
w1 u1; : : : ; ws us;
g1((y1 ^ y2 ^ : : : ^ yk ); v1; : : : ; v1;n1 ) 6= 0;
...
gl((y1 ^ y2 ^ : : : ^ yk ); v1; : : : ; v1;nl ) 6= 0; )
where zi ; yi; wi; ui; vi's are variables or constants. Although they are not necessarily
distinct symbols, they are dierent from x.
10
1.4 Naive and Semi-naive evaluation methods
If our input Datalog program does not contain recursive rules, then the evaluation
of the program is simple, because it is enough to evaluate every rule only once using
a standard algorithm for the ordering of the rules. This algorithm is described for
example in [8]. If the input program contains recursive rules, then the rules have to
be evaluated more than once.
1.4.1 Naive method
Example 1.1 The example is taken from [5]. We have an input database which
contains parents-children pairs, and our goal is to nd the ancestors of a specic
person. The Datalog program is the following: (There is a longer description of this
example in section 5.1)
children(P, C) :- P={"parent1", "parent2"}, C={"person", "brother"}.
children(P, C) :- P={"gparent1", "gparent2"}, C={"parent1", "uncle"}.
children(P, C) :- P={"gparent3", "gparent4"}, C={"parent2", "aunt"}.
children(P, C) :- P={"ggparent1", "ggparent2"}, C={"ggparent3"}.
AAncestor(P) :- children(P,C), {"person"} <= C.
AAncestor(P) :- children(P,C), AAncestor(P2), C /\ P2 != @.
After evaluating all of the rules once, we get the parents (parent1, parent2) of
the specic person. If we evaluate the rules again, we get the parents and grandparents (gparent1, gparent2, gparent3, gparent4), of the person. After the third
11
evaluation we get the great-grandparents (ggparent1, ggparent2) too. After the
fourth evaluation the method does not give new ancestors, hence we can stop.
The previously used algorithm is called Naive method. The pseudo-code of the
method is the following: (taken from [8])
for := 1 to m do
Pi := ;
repeat
for i:= 1 to m do
Qi := Pi;
for i:= 1 to m do
Pi := EVAL(i,Q1, . . . , Qm );
until Pi = Qi for all i ( 1 i m);
Where Pi is the tuples of the ith relation. Qi is the tuples of the ith relation in
the previous step. At the beginning we erase all tuples. Than repeat the steps of
the algorithm until the results of the last two steps are identical. During one step
we store the tuples rst (Qi := Pi), than calculate the new tuples using the tuples
calculated in the previous steps. The calculation is done by the EVAL function (1.4.3).
In the function, i denotes the index of the current relation, Q1; : : : ; Qm denote the
tuples of the relations, which can be used by EVAL.
1.4.2 Semi-Naive method
Tha main disadvantage of the Naive method is that it recalculates the same tuples in
every iteration. In the previous example during the rst step the algorithm calculates
12
the parents; during the second step the parents, and the grandparents; during the
third and fourth step the parent, grandparents, and great-grandparents. Therefore
the algorithm calculated the parents four times. If the number of the steps are greater
{ and in a real application it is several times greater { then this disadvantage is also
greater. The main idea of the Semi-Naive method is to omit these recalculations.
If during the calculation we use only old tuples (tuples which were calculated
before the previous step), then we only recalculate some older tuples. Therefore if we
want to calculate new tuples we should use at least one new tuple (tuple which were
calculated during the previous step).
Naturally the rst step is an exception, because there are no new tuples before
the rst step, hence the rst steps of the Semi-naive and Naive methods are identical.
The pseudo-code of the semi-naive evaluation (also taken from [8])
for := 1 to m do
Pi := EVAL(pi ; ;; : : : ; ;)
Pi := Pi
repeat
for i:= 1 to m do
Qi := Pi;
for i:= 1 to m do begin
Pi := EVAL INCR(i, P1 , . . . , Pm, Q1 , . . . , Qm );
Pi := Pi - Pi;
end;
for i:=1 to m do
Pi := Pi [ Pi
13
until Pi = ; for all i ( 1 i m);
Where Pi denotes the tuples of the ith relation, Pi the new tuples in the current,
Qi the new tuples in the previous step of the relations. At the rst step we use
the EVAL function to calculate the tuples. Than we repeat the steps of the algorithm
until there is no new tuples in the last step. In a step rst we store the new tuples
(Qi = Pi), then calculate the new tuples using EVAL INCR function (1.4.4). After
this we check whether the new tuples are really new tuples (Pi = Pi Pi ). At the
end of the step we should update the value of Pi (Pi = Pi + Pi ). The EVAL INCR
function has more parameters than the EVAL function, because the EVAL INCR function
needs not only all the tuples, but the new tuples as well.
1.4.3
EVAL
function
This function calculates new tuples from the previously known tuples. Every relation
is converted to relational algebra formulas (3.1), and these formulas are optimized
(3.2). By using these formulas, the EVAL function can easily calculate the new tuples.
Every formula is a tree, and the leaves of the formulas are the relations. If we
substitute the relations with the tuples of the relations and execute the relational
algebra operators in the nodes, then the root of the tree will contain the new tuples.
1.4.4
EVAL INCR
function
This function is similar to the EVAL function. The dierence is that EVAL INCR should
use at least one new tuple during the calculation. To achieve this, we clone all the
rules as many times as relations occur in the right hand side of the rule. In the ith
14
clone we put a before the ith relation. In Example 1.1 the clones of the rules of
relation AAncestor:
AAncestor(P) :AAncestor(P) :AAncestor(P) :-
children(P,C), "person" <= C.
children(P,C), AAncestor(P2),
children(P,C), AAncestor(P2),
C /\ P2 != @.
C /\ P2 != @.
Instead of the original rules we convert and optimize these rules to relational
algebra formulas. With this we have the possibility to calculate EVAL INCR
There is an other possibility to simplify these rules. In Example 1.1 children
relation has only facts, hence children is always empty. Therefore all the rules
which contains children can be eliminated. If we eliminate these rules we only have
one rule left:
AAncestor(P) :- children(P,C),
AAncestor(P2),
C /\ P2 != @.
More generally if all the rules of relation R contain only facts, then we can eliminate every rule which contains R in it.
Although not implemented in the system, sometimes we can eliminate other rules
too.
Example 1.2 Assume that we have mother and father relations in our input data-
base, and our goal is to nd the ancestors of a particular person. First we can dene
a parent relation, and after that the solution is the same as in Example 1.1. Here is
a part of the program:
15
mother(M,C) :- M == {"mother"}, C == {"child1", "child2"}.
father(F,C) :- F == {"father"}, C == {"child1", "child2"}.
parent(P,C) :- mother(P,C).
parent(P,C) :- father(P,C).
...
Although the parent relation has two rules, and neither of them are facts, we
calculate the tuples of the parent relation in the beginning of the evaluation, and
after this step no new tuples will be added to this relation. Hence there is only one
step in which parent is not empty. Therefore it would be possible for some of the
relations to calculate the number of steps after will be always empty and eliminate
the appropriate rules then.
16
Chapter 2
Implementation
2.1 The implemented Boolean Algebra
Chapter 1 gave a small overview of the basics of Datalog with Boolean Constraints.
All the denitions, lemmas are working with all the possible Boolean Algebras. Although during the implementation most of the system are working with all the possible
Boolean Algebras, only one Boolean Algebra is implemented.
2.1.1 Sets
In the rst Boolean Algebra, contains the sets of integers and strings. Because of
storage restrictions, contains only nite sets, or the sets which complement is nite.
It is not a strict restriction because the length of input les are nite, hence the user
can dene only these sets, and all the operators are closed. The Boolean Algebra
operators are dened in the following way: ^ \, _ [, complement set.
We should dene 0 and 1 elements also: 0 = ; = fg; 1 = fg = complete set =
set of all integers and strings.
0
0
17
2.2 Implemented quantier elimination methods
In [6] P. Revesz describes three elimination methods. One of them works only with
atomless Boolean Algebras. Because the implemented Boolean Algebra of Section
2.1.1 is not an atomless Boolean Algebra, that method is not implemented. The
other two methods (described earlier in Sections 1.3.1, 1.3.2) are implemented in this
system. The elimination method described in Section 1.3.2 can be used only if all
the inequality constraints are monotone constraints. The system does not check the
monotonity, but it assumes that all the inequality constraints are monotone inequality
constraints.
2.3 Hardware and Software
The system is implemented in Java language. Originally the system was implemented
under IRIX 6.2, using JDK 1.0.2 (Hardware: 4 CPU SGI R10000), although it was
tested also under WinNT (Hardware: Pentium 200, Pentium 133, Pentium II 267)
using Microsoft Visual J++ 1.1. Because one of the main properties of Java language
is portability, the system should work on most well-known systems even without
recompilation.
The parser was implemented using Java Compiler Compiler (JavaCC), Version
0.7pre3.
2.4 Java program
2.4.1 Packages
The Java language supports using packages (collection of similar classes). The pack-
18
Name
Function
storage
Storage of Datalog programs
relalg
Storage of Relational Algebra Operators
relalg.optimize RA optimization methods
elimination
Elimination methods
evaluation
Naive & Semi-Naive evaluation
parser
The parser
util
Miscellaneous classes
Table 2.1: Packages of the Java program
ages of the system and their function can be seen in Table 2.1.
relalg
package
and its subpackage relalg.optimize contain classes related to relational
algebra formulas. Chapter 3 contains more information about relational algebra formulas. The optimization methods (Section 3.2) are implemented in relalg.optimize
package.
relalg
elimination
package
Quantier elimination methods (Section 1.3) are implemented in this package. Because two methods are implemented, and the system does not know in advance which
one can be used, a new quantier elimination method is implemented. This method
is only a container of other quantier elimination methods (right now two methods),
and tries to execute the rst method, and if it is not possible, than the following one
until one of the methods was successful, or none of them was successful.
19
evaluation
package
This package implements the generic code of the Naive (Section 1.4.1) and Semi-Naive
(Section 1.4.2) evaluation methods. The eval, eval incr functions are also dened
in this package.
parser
package
The function of this package is to parse the input les and the user commands.
Chapter 4 contains more information about the user commands and the input le
format. The java source les in this package is created by JavaCC from a grammar
description le (.jj).
storage
package
This package stores the Datalog programs. The hierarchy among the classes can be
seen in Figure 2.1. This is not a superclass-subclass hierarchy, every class shown on
the picture contains one or more instances of the classes shown below the class. At
the top of this hierarchy there is the Database class, which contains our database. A
database is a set of Relations. Every relation have one or more Rules. Every rule
have a head, which represented by a RelationTitle, and a body. A body can contains
other relation names (RelationTitles), and Constraints. A Constraint can be an
equality or an inequality constraint. Every constraint have the form: 'Boolean Term'
= 0 or 'Boolean Term' 6= 0. A Boolean Term is represented by a Term. Because every
boolean term can be transformed to Disjunctive Normal Form (DNF), every term is
stored as an array of basic Conjunctions (Conjunction). A basic conjunction is a
conjunction of literals, which can be stored as an array of Literals. A literal can be
20
Database
Relation
Rule
Constraint
RelationTitle
Term
Conjunction
Literal
Variable
Element
Constant
Figure 2.1: The hierarchy among the classes of storage package
a variable (Variable), an element of (Element) and a constant (Constant). The
constants are not implemented in the current version of the system, but it is worth
to mention the possibility to integrate constants to the system.
Element is an abstract class, it can contain the elements of all possible Boolean
Algebras. It dened the necessary method which has to be implemented to represent
a concrete Boolean Algebra. ElementSet is a subclass of Element it can store the element of all the possible set-typed Boolean Algebras. The only non-abstract subclass
of ElementSet is ElementFSet, which implements = sets (Section 2.1.1). Figure
2.2 shows a superclass-subclass hierarchy among these classes. Angled rectangle indicates that the class is not abstract, while oval-shaped rectangle indicates that the
class is abstract.
21
Element
ElementSet
ElemetFSet
Figure 2.2: The subclasses of Element
22
Chapter 3
Relational Algebra Formulas
3.1 Relational Algebra Formulas
Because the relational algebra formulas are described in several textbooks, for example
[8], in this part I describe only the dierences between the general relational algebra
and relational algebra formulas used in this program.
In this system there are four relational algebra operators: join (1), project (),
union ([), select (). Because cross-product () can be seen as a special join, in this
system join represents both of them. The other main dierence, that join and union
originally are binary operators, hence the number of operands are always two. In this
system the number of operand are greater or equal than two. For instance if we want
to represent the join of four relations, originally we need three join operators, the new
system needs only one.
The system stores relational algebra formulas as a tree, it makes easy to change
the formula, and to represent an operations if it has more than two operands. It is
also very useful when we want to visualize a formula.
The input from the user contains Datalog rules, hence it is necessary to convert
23
these rules to relational algebra formulas. A conversion method is described in Ullman's book [8, chapter 3]. However, the method works with pure Datalog rules, hence
that method needs to be extended.
3.1.1 Converting a Rule
A general Datalog with Boolean Algebra rule has the following form:
R(X1 ; : : : ; Xk ) : Q1 (Y1;1; : : : ; Y1;l1 ); : : : ; Qm (Ym;1; : : : ; Ym;lm ); 1; : : : ; n
Where m 0 is the number of relations on the right-hand side, n 0 is the number
of selections on the right-hand side. Although either m or n can be zero, they cannot
be zero at the same time. (n + m > 0).
If m > 1 then rst we need to join the relations on the right hand side. After that
we can issue the selection one after the other. (It would be possible to combine the
selections into one selection, but the optimization method works better if we do not
combine them.)
If fY1;1; : : : ; Y1;l1 ; : : : ; Ym;1; : : : ; Ym;lm g fX1; : : : ; Xk g then the right-hand side
contains only variables which can be found in the left hand side, therefore it is not
necessary to use projection. Otherwise we need a projection (X1 ;:::;Xk ) as well.
3.1.2 Converting a Relation
First the algorithm converts all the rules of the relation. If the number of the rules
is greater than one then the algorithm connects the formulas with union.
Example 3.1 If relation R has the following two Datalog rules:
R(x,y) :- C(x,y).
R(x,y) :- A(x,z), B(z,y), D(y), z !=
f1,2,3g.
24
U
Π X,Y
C( X, Y )
σ
Z != { 1, 2, 3 }
A( X, Z )
B( Z, Y )
D( Y )
Figure 3.1: The formula of Relation R
then the algorithm rst converts the rst rule, and we get the formula: C (x; y). Next
the second rule is converted yielding: x;y (z!= 1;2;3 (A(x; z) 1 B (z; y) 1 D(y))),
and nally the two formulas are joined together with a union operator. C (x; y) [
x;y (z!= 1;2;3 (A(x; z) 1 B (z; y) 1 D(y))) (see Figure 3.1).
f
f
g
g
3.2 Optimization of Relational Algebra Formulas
Although after converting Datalog rules to relational algebra formulas we are able
to use the formulas, it is better to rst optimize the formula. Using optimization
methods we calculate a new formula from our original formula. The new formula
should be equivalent with the original one (if we evaluate it, the result should be the
same), and it should be evaluated faster. No algorithm can improve all formulas.
25
Usually the optimization algorithms improve the majority of the formulas, and leave
unaltered or even worsen some formulas.
There are many optimization methods. The algebraic manipulation method used
in our system is described in the next subsection.
3.3 Algebraic Manipulation
This method is also described in [8]. In this method we will use some equations
between formulas. These equations are also called laws. First we give a list of these
laws, and later an algorithm which can change the original formula using these laws.
After the changes the new formula can be usually evaluated faster than the original
We have to optimize only a subset of the possible formulas, because our formulas
are originally Datalog programs.
3.3.1 Laws
We use the following laws from [8]
1. Commutative law for join:
R1 1 R2 R2 1 R1
2. Associative law for joins :
(R1 1 R2 ) 1 R3 R1 1 (R2 1 R3 )
3. Cascade of projections:
26
A1;A2;:::;Ak (B1 ;B2;:::;Bl (R)) A1 ;A2;:::;Ak (R)
if fA1 ; A2; : : : ; Ak g fB1; B2; : : : ; Bl g
4. Cascade of selections:
F1 (F2 (R)) F1 F2 (R) F2 (F1 (R))
^
5. Commuting selections and projections:
If the set of attributes in condition F is the subset of fA1; A2; : : : ; Ak g:
A1;A2;:::;Ak (F (R)) F (A1 ;A2;:::;Ak (R))
If the set of attributes in F is fAi1 ; Ai2 ; : : : ; Aim g [ fB1 ; B2; : : : ; Bl g:
A1 ;A2;:::;Ak (F (R)) A1 ;A2;:::;Ak (F (A1;A2;:::;Ak;B1 ;B2;:::;Bl (R)))
6. Communing selection with Join:
If all the attributes of F are the attributes of R1 :
F (R1 1 R2 ) F (R1 ) 1 R2
If F = F1 ^ F2, and the attributes of F1 are only in R1 , and the attributes in
F2 are only in R2 , then:
F (R1 1 R2 ) F1 (R1 ) 1 F2 (R2 )
27
If F = F1 ^ F2 , and the attributes of F1 are only in R1, but the attributes in F2
are in both R1 and R2 :
F (R1 1 R2 ) F2 (F1 (R1 ) 1 R2 )
7. Commuting a projection with a join:
fA1; A2; : : : ; Ak g = fB1; B2; : : : ; Bl g [ fC1; : : : ; Cmg, where Bis are attributes
of R1 , and Cis are attributes of R2 :
A1;A2;:::;Ak (R1 1 R2 ) B1 ;B2;:::;Bl (R1 ) 1 C1 ;C2 ;:::;Cm (R2)
3.3.2 Principles
These are three main principles of algebraic query optimization:
1. Perform selections as early as possible
2. Perform projections as early as possible
3. Combine sequences of unary operations
3.3.3 The Algorithm
The steps of the algorithm
1. For each selection use rule (4) { (6) to move the selection down.
2. Move projections down using rules (3), (7), If possible, delete projections.
3. Use rule (4) to combine cascades of selection into one selection.
28
Move selections down
In this step our goal is to move selections as down as possible. Originally we have a
set of selections (S1, S2, : : : ,Sk ), and a set of relations (R1 ,R2, : : : ,Rl ). In the original
execution order, we connect the relations with a join operator (if the number of the
operations is greater than zero), than calculate the selections one after the other.
During the optimization, we rst check which relations and selections have common variables. Let Vi be the set of relations which have common variables with Si.
More formally:
Vi = fRn j (variables in Rn ) \ (variables in Si) 6= ;g
A selection (Si) can be executed, if the join of all the relations mentioned in Vi is
already calculated. The join can contain other relations also.
If Vi is empty or contains all the relations, then the selection is executed only after
we join all the relations. Therefore we should nd the place of the other selections.
If 9i8j : Vi Vj , then Si will be executed before all the other selections. If such
an index (i) does not exist, then the program chooses any index, which has a small
size Vi. Next we modify the Vj (j 6= i) sets.
(
Vi \ Vj 6= ;
Vj := VVj n Vi [ Si if
if V \ V = ;
j
i
j
As we can see, Vj contains not only relations but selections as well.
The previously described method is one step of the optimization. This step should
be repeated until all the selections are chosen. If there are some relations which are
not used during the optimization (no selection contains any variables of the relation),
then a nal join should connect these relations and the selections.
29
Π X, Y
σX /\ Y != @
σY /\ {4} == @
σX /\ { 6, 8 } == @
σX /\ {2,4} == @
A( X, Z )
B( V, Y )
C( X )
D( V, Y )
Figure 3.2: The formula before the optimization
Example 3.2 Assume, we have the following Datalog program:
A(X; Z ); B (V; Y ); C (X ); D(V; Y ); X \ f2; 4g == @;
X \ f6; 8g == @; Y \ f4g == ;; X \ Y ! = ;:
R(X; Y ) :
the corresponding relational algebra formula (see Figure 3.2):
x;y (x y= (y 4 = (x 6;8 = (x 2;4 = (A(x; z) 1 B (v; y) 1 C (x) 1 D(v; y))))))
\ 6
;
\f g
;
\f
g
;
\f
g
;
y 4 = and x 6;8 = and x 2;4 = each contain only one variable: y, x, and
x respectively. The variables in x y= are x and y. x y= has common variables
\f g
;
\f
g
;
\f
\ 6
g
;
;
\ 6
;
30
with relations A,B,C, and D, therefore V1 = fA; B; C; Dg. Similarly, V2 = fB; Dg,
V3 = fA; C g, V4 = fA; C g.
There exists no Vi, which is the subset of all the other Vj 's hence the algorithm
chooses one selection, which has the smaller Vi . In this example the algorithm can
choose V2 , V3, and V4, assume that it chooses V2 . As we issue S2, we get the following
formula: y 4 = (B (v; y) 1 D(v; y)).
The new values of V1, V3, and V4 are: V1 = fA; B; C; DgnfB; Dg [ fS2g =
fA; C; S2g, V3 and V4 are unchanged because V3 and V4 has no common variables
with S2 . (V3 = fA; C g, V4 = fA; C g)
Now V3 V1, V3 V4 hence we can issue S3, and get x 6;8 = (C (x) 1 A(x; z)).
The new values of V1 and V4 are: V1 = fA; C; S2gnfA; C g [ fS3g = fS2; S3g, V4 =
fA; C gnfA; C g [ fS3g = fS3g.
Now V4 V1, so the algorithm can issue S4, and we get x 2;4 = (x 6;8 = (C (x) 1
A(x; z)))
Finally we issue V1, and get (x; y)(x y= (x 2;4 = (x 6;8 = (C (x) 1 A(x; z)) 1
y 4 = (B (v; y) 1 D(v; y))))) (Figure 3.3)
\f g
;
\f
g
\f
\ 6
\f g
;
\f
g
;
\f
g
g
;
;
\f
g
;
;
;
Moving Projections Down
In this step our goal is to move projections as down as possible. Originally we have
a projection, below that maybe some selections and nally a join (Figure 3.4) If it is
possible then we evaluate the projection before the join. Usually it is not possible,
but we can eliminate at least some of the variables before the join.
Denote PV the set of variables in the projection. Denote SV the set of variables
in the selections. Denote V [i] the variables of the ith branch of the join. All the
31
Π
σ
σX /\ {2,4} == @
X, Y
X /\ Y != @
σ Y /\ {4} == @
σX /\ {6,8} == @
B( V, Y )
C( X )
D( V, Y )
A( X, Z )
Figure 3.3: The formula after the rst step of optimization
variable in PV or SV cannot be eliminated before the selections.
If a variable occurs only in one branch, and the variable is not in PV or in SV ,
then this variable easily can be eliminated before the join. For instance, if our original
formula is (x)(A(x) 1 B (x; y)) as shown in Figure 3.5, then y occurs only in the
second branch, therefore we can eliminate y before the join. The optimized formula
is: A(X ) 1 (x)(B (x; y)) as shown in Figure 3.6.
If there exist no variable which occurs only in one brach, then the algorithm
chooses one variables which occurs in the least branches. If all the variables are occur
in all the branches then the algorithm cannot eliminate any variables. If more than
32
Π
PV
σ
.
.
.
SV
σ
B1
V1
...
...
Bl
Vl
Figure 3.4: The variables of the formula
one variables occur in the same branches then the algorithm eliminates these variables
at the same time.
After this step the branches of the join has changed, therefore the algorithm should
recalculate the values of V [i] s. This recalculation is similar to the recalculation during
the rst step (Moving selections down) of the algorithm. There algorithm is running
until we cannot nd any eliminable variables.
Because one brach of the join can contain other join operators, after the algorithm
move a projection below the join, we should check whether it is possible to move the
projection even more below the other join. To achieve this the algorithm calls itself
in a recursive way. The new instance of the algorithm works only on a subtree
(subformula) of the original tree (formula).
0
Example 3.3 After the rst step of the optimization we got the formula shown in
33
ΠX
A( X )
A( X )
B( X, Y )
Figure 3.5: Optimization input
ΠX
B( X, Y )
Figure 3.6: Optimization output
Figure 3.3. The variables of the join are X, Y (PV = fX; Y g), the variables in
the selection are also X and Y (SV = fX; Y g). PV [ SV = fX; Y g, therefore we
cannot eliminate X and Y. The upper join has two branches. The variables in the
rst branch are: X, Z, therefore V [1] = fX; Z g. Similarly V [2] = fV; Y g. Z and V
are local variables because Z occurs only in the rst, and V occurs only in the second
branch. Therefore we can eliminate Z from the rst branch before the join, and V
from the second one. The variables in the rst brach are: fX; Z g. If we eliminate Z ,
we have only X , hence we should issue a (X ) below the join. In the same way we
should issue (Y ) below the second branch. Our new formula is shown in Figure 3.7.
Finally the algorithm tries to move the projections even below the other joins.
In the second branch, Y cannot be eliminated, because Y is a variable in the projection (and in the selection also). V cannot be eliminated because it occurs in all
the branches. In the rst branch X is ineliminable, because X is a variable in the
projection. Z is a local variable of the second branch, hence it can be eliminated
before the join. Therefore the algorithm can move the projection below the join, and
34
σX /\ Y != @
ΠX
ΠY
σ Y /\ {4} == @
σX /\ {2,4} == @
σX /\ {6,8} == @
B( V, Y )
C( X )
D( V, Y )
A( X, Z )
Figure 3.7: The formula during the second step of optimization
we get our new formula, which is shown in Figure 3.8.
Connecting Selections
This is the nal and the easiest step of the optimization. In this step, the algorithm
combine cascades of selection into one selection. The algorithm simply checks every
edge in the tree, and if both vertices of this edge are selections, then connects the two
vertices and erases the edge.
Example 3.4 In our example (Figure 3.8) there is only on pair of selections which
can be connected. (x 2;4 = and x 6;8 = . After this step, we get our nal formula,
which can be seen in Figure 3.9.
\f
g
;
\f
g
;
35
σX /\ Y != @
σX /\ {2,4} == @
σX /\ {6,8} == @
C( X )
ΠX
ΠY
σ Y /\ {4} == @
B( V, Y )
D( V, Y )
A( X, Z )
Figure 3.8: The formula after the second step of optimization
3.4 Calculating multiple joins
When we have to join more than two relations, then the simplest way to join them
is to choose one tuple from each relation, create a new tuple and write the result to
the output relation. For instance we have four relations A, B, C, D, and we want to
calculate
A(X; Y ) 1 B (Y; Z ) 1 C (Z; V ) 1 (V; W )
Assume that there are 100 tuples in the relations. In this case we have to create
1004 = 108 tuples.
36
σX /\ Y != @
σX /\ {2,4} == @,
X /\ {6,8} == @
ΠY
σ Y /\ {4} == @
C( X )
ΠX
A( X, Z )
B( V, Y )
D( V, Y )
Figure 3.9: The formula after the optimization
A better solution if we join two relations, than join the third to the result of the
previous join, and nally join the fourth to the last result. For instance if we calculate
A(X; Y ) 1 B (Y; Z ), than the size of the result usually is less than 1002 = 104. Let us
assume that the results always contain 100 tuples. In this case we have to calculate
3 1002 = 3 104 tuples, which is less than 1 percent of the original calculation.
Of course we do not know the size of the result relation before we create the result. If the two relation have no common variables, than the join is a cross-product,
so the size of the result relation is the multiplication of the size of the original relations. In other cases we only know that the size of the result is not greater than the
multiplication of the size of the original relations.
If the algorithm is able to estimate the size of the resulting relation, then it is possible to join the relations in a good order, therefore the algorithm is able to decrease
37
the cost of a multiple join. The good order can be determined using dynamic programming. Because the algebraic manipulation considerably decreases the occurrence
of multiple joins, and makes it more dicult to estimate the size of the resulting relation, this system does not change the execution order of the joins, but simply executes
the joins from left to right.
38
Chapter 4
User's Manual
The user interface of the program is a character based interface. (One student in the
department is working on a Graphical User Interface)
After staring the program the system waits for a user command. After the user
enters the command, the system waits for the next command. This process is nished
if the user exists from the system.
With the help of some commands the user can change the values of the switches,
with the other command the user can ask the program to give information or execute
a process. First I describe the switches and later the other commands.
The user is also able to give a new rule or fact to the system. The general form
of the rules and facts are described in Section 1.2. There are some dierences:
The user can enter more than one equality constraint.
The user can enter constraints in which neither side of the constraint is zero.
The user can use not only [, \, but and as well.
0
Because keyboard does not contain symbols like [, \, or the other operators,
39
Mathematical In the system
[
\/, and
\
/\, or
R
not( R ), R'
<=, =<, [=
=>, >=, =]
=
=, ==
6=
!=, <>
;
@, ZERO
Table 4.1: Operators
0
the user should use other symbols instead of. Table 4.1 shows the symbols which
can be used by the system. Usually more than one symbols can be used, they
are separated by commas.
4.1 Switches
Most of the switches has two possible values, they are either turned on or o. All the
possible values should be typed with small letters, in the examples the capital letters
show the default value of the switch.
time on|OFF.
If the switch is turned on, then program after each evaluation displays the
time used during the evaluation. It displays not only the total time, but some
part-time (for instance time used by dierent relation operators) as well.
All the time values are in second, and they are real second, not CPU second.
tempfile on|OFF.
40
If the switch is turned on, after each step of the Naive (Section 1.4.1) or SemiNaive (Section 1.4.2) evaluation, the program prints out the derived tuples into a
temporary le ('temp.bld'). This le is not being overwritten by the following
steps, hence the user can analyse the results after each step. Although the le
contain the results of more steps, each result has the same format as the input
le, therefore it is possible to use dierent parts of this le as an input le, and
continue an interrupted execution.
trace on|OFF.
If the switch is turned on then the program display the inner representation of
a rule after a new rule added to the system.
optimize ON|off.
If the switch is turned on, then the program uses relational algebra optimization
(Section 3.2), if not then the program uses the original formula. The value of
the switch should be changed before loading the input le to make eect.
method old|naive|SEMINAIVE.
With this switch the user can choose between the implemented evaluation methods. The Naive (Section 1.4.1), Semi-Naive (Section 1.4.2) methods work with
also non-recursive and recursive queries, but the 'old' method works only with
non-recursive queries.
4.2 Commands
exit|bye.
41
The command exists the program.
load|consult 'filename'.
The command load the le with the given lename, and reads the queries from
the le. The format of the input le is described in Section 4.3.
formula "R" [onto 'filename'].
The command displays the relational algebra formula of "R" which is used by
the Naive evaluation (Section 1.4.1). If the user specify a le name, then the
formula will be printed out to the le, otherwise it will be printed out to the
screen.
formuladelta "R" [onto 'filename'].
The command displays the relational algebra formula of "R" which is used by
the Semi-Naive evaluation (Section 1.4.2). If the user specify a le name, then
the formula will be printed out to the le, otherwise it will be printed out to
the screen.
display ["R"] [onto 'filename'].
If the user do not specify a relation name then the program prints out the names
and the arities of all relations.
If the user specify a relation name then the program prints out the rules of the
relation.
Similarly to the formula and formuladelta commands the user can name a
le, otherwise the result of the command will be printed out to the screen.
42
displaydf [onto 'filename'].
The command prints out all the derived facts in the database.
Similarly to the previous commands the user can name a le, otherwise the
result of the command will be printed to the screen.
memory.
The command displays the total and the free memory used by the system.
clear.
The command erases all the relations, rules from the database.
R(E1; : : : ; En)?
With this command the user can ask the derived fact of a relation. Ei can be
either a variable or an element of . One variable can occur more than once. At
the rst time using this command the system calculates the derived facts using
of of the evaluation method. Later the system uses the derived facts stored in
the memory, hence the answer will be faster.
Note: This is the only command which ends with a '?' instead of a '.'.
4.3 Input le format
The input le starts with a line contains 'begin' and ends with a line 'end'. Between
these lines there are the rule denitions.
The input le may contain empty lines, one-line comments (after // as in C++
or Java), multi-line comments (between /* and */ as in C, C++ or Java).
43
If a line would be too long it can be splitted into more lines, each line but the last
should end with a '\' character. In the examples of this thesis we do not use the
'\' character.
44
Chapter 5
Examples
5.1 Ancestors example
5.1.1 The problem
The ancestors example appeared in [5]. In the input database we store informations
about parents and their children. Our goal is to calculate all the ancestors of one
particular person.
5.1.2 The input database
Pure Datalog
In pure Datalog one of the easiest way to use the children relation. One tuple can
contains one parent and one child. For instance if husband and wife have three
children: child1, child2, child3, then our input database is the following:
children( husband, child1 ).
children( husband, child2 ).
children( husband, child3 ).
children( wife, child1 ).
45
children( wife, child2 ).
children( wife, child3 ).
New system
In the new system, we need only one tuple to represent the previous example:
children( { husband, wife }, {child1, child2, child3} ).
Comparison
The previous example shows that in the new system, we need fewer tuples to store the
same data. It also can be seen, that the tuples in the new system are more complex
then the tuples in pure Datalog. In the example we needed only one tuple instead of
six. More generally, if a couple has k children, then the pure Datalog needs 2k tuples,
in contrast to the new system, which needs only one.
5.1.3 The Datalog program
Pure Datalog
AAncestor(P) :- children(P, {"person"}).
AAncestor(P) :- children(P, C), AAncestor(C).
New system
AAncestor(P) :- children(P,C), {"person"} <= C.
AAncestor(P) :- children(P,C), AAncestor(P2), C /\ P2 != @.
46
Comparison
The program contains two rules in both systems. The rules are very similar, although
the new system has slightly more complex rules.
5.1.4 The output database
Pure Datalog
In the pure Datalog every tuple in the output relation represents one of the ancestors.
New system
In the new system every tuple in the output relation represents two ancestors.
Comparison
The new system contains half the number of tuples as pure Datalog.
5.1.5 Execution complexity
In this comparison I assume that both systems are using the Semi-Naive or the naive
evaluation.
Pure Datalog
At the rst step, the system nds the parents of person using the rst rule. The
system should check all the children tuples, and nd those in which person is the
child.
Later we need to use the second rule, hence we need to evaluate a join.
47
New system
At the rst step, the systems nds the parents of person using the rst rule. The
system should check all the children tuples, and nd those in which person is one
of the children.
Later we need to use the second rule, hence we need to evaluate a join, and a
selection.
Comparison
The rst step is almost identical. Because the number of tuples is less in the new
system, in that case the program should check fewer tuples. On the other hand,
the tuples are more complex in the new system, hence to check one tuple is more
time-consuming. If we analyze one step, then we can nd real advantages.
If our goal is to nd the ancestors of child3, then in the original system we should
check all the six tuples. Check means here to evaluate child1 == child3, child2
== child3, child3 == child3 . We should evaluate each of them twice, because
each children occur in two tuples.
In the new system, we should check only one tuple. Check here means that we
should calculate the intersection of fchild3g and f child1, child2, child3 g. It
means, that we should evaluate child1 == child3, child2 == child3, child3 ==
child3. In this case we need to evaluate these only once. Finally we should check
whether the intersection is empty or not.
In the later steps, both systems evaluate a join, the second one also evaluates a
selection. If in the pure datalog system we denote the number of tuples in relations
children and AAncestor Cp and Ap respectively, then the number of tuples in the
48
ggparent1
gparent1
uncleswife
uncle
gparent2
gparent3
gparent4
parent1
parent2
aunt
brother
cousin2
cousin1
wife
person
daughterinlaw
son
daughter
grandson
ggparent2
auntshusband
cousin3
cousin4
soninlaw
granddaughter
Figure 5.1: A family tree
new system are Cn = C2p and An = A2kp , where k is the average number of children.
Therefore the number of basic operation during join is Ap Cp in the old system and
An Cn = Ap4kCp in the new system.
If we are using Semi-Naive evaluation (1.4.2) than the algorithm uses only the
new tuples, hence the number of basic operation is Ap Cp in the old, and An Cn =
Ap Cp in the new system.
4k
Similarly to the rst step, one basic step is more complex in the second system,
but there is still an advantage of using the new system. Usually the cost of the
selection is much more less than the cost of the join, hence it is not a problem that
in the new system we need a selection as well.
49
5.1.6 Run-time results
Figure 5.1 shows a family tree, which was used for test purposes. The squared rectangles contain the names of the men, oval shaped rectangles contain the names of the
women. In the program there is no dierence between the two sex, it only helps to
understand the family tree.
Table 5.1 shows the running times. During the evaluation, the program calculated
not only the ancestors of person, but the ancestors of everybody. Because the optimization of relational algebra formulas does not change the formulas in this example,
the optimization has no eect on the running-time (the small dierences are only
because of the inaccuracy of time-measurement). As can be seen in the table the
Semi-Naive evaluation is approximately eight times faster then the Naive evaluation,
hence the Semi-Naive evaluation is a great improvement.
5.2 Genome Map
5.2.1 The problem
The following genome map problem is described in [7].
The deoxyribonucleic acid (DNA) is a sequence of nucleotides. There are four
nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T).
Random substrings of a given DNA are called clones. Clones may overlap each
other. It is possible to cut a DNA string into clones with so called restriction enzymes.
After cutting we loose all information about the order of the clones. Each clone can
be analysed further. By various enzymes the clones can be digested, and we can
measure the fragments after the digestion. To eliminate the errors of measurement
50
we can round the length of fragments.
As an input we have a set of clones (c1 ; : : : ; cn) of a DNA string, and the lengths
of the fragments of each clone.
The goal is to nd the original order of the clones. This problem is NP-complete.
However there exist dierent heuristics which make it possible to solve the problem.
In this example we have an order of clones, and the task is to decide whether it is
a possible order of clones or not.
5.2.2 Solution
The idea for the algorithm is described in [7].
Further restrictions
To apply this solution we need some further restrictions in the input database:
No clone contains any other clone.
No clone contains two dierent fragments with the same length. Although it is
possible that two dierent clones have dierent fragments with the same length.
There exists k such that each fragment is contained in at most k clones.
If (c1; : : : ; cn) is the correct order of the clones then 8i (1 i < n) : ci overlaps
ci+1
Automaton
Because every fragment is contained at most k clones, it is enough to analyze k + 1
clones at the same time. We call k + 1 adjacent clones a window. At the beginning
51
the window contains the rst k + 1 clones, but during the solution this window will
shift right. We denote the clones in the window A1; : : : ; Ak+1.
During the solution we will change the values of A1; : : : Ak+1. The values will
change if we shift the window, or if we pick fragments from the clones. We pick
fragments from several clones (A1; : : : ; Al (1 l k)) at the same time. If a clone
must contain the fragments that we pick next, then we call the clone active. If Aj
active then 8i (i < j ) Ai also active, hence one number is enough to store the set of
active clones.
We create a a non-deterministic automaton to solve the problem. The automaton
contains k + 2 states, where S0 is the initial state, H is the halt stage, and Si is the
stage which represents when i clone active.
Now we need to dene the transition of the automaton.
If Ai Ai+1 then we cannot pick a fragment from the rst i clones which is not
in Ai+1, therefore Ai+1 can be declared an active clone too.
If A1 = ; then we can shift the window right.
If there are fragments which are in the active clones and not in the rst non-active
clone, then we can pick these fragments.
Figure 5.2 shows an automaton when k = 5. This automaton is taken from [7].
Because this system supports more Boolean operators than the DISCO system which
was used in [7], the description of the edges of this automaton is simpler than in [7].
52
J==(A1Â2Â3Â4Â5)\A6
J==(A1Â2Â3Â4)\A5
A1=A1\J
J==(A1Â2Â3)\A4
J==(A1Â2)\A3
A1=A1\J
J==A1\A2
A1=A1\J
A2=A2\J
A3=A3\J
A2=A2\J
A1=A1\J
A1<=A2
S1
A2<=A3
S2
A2=A2\J
A3=A3\J
A3=A3\J
A4=A4\J
A4=A4\J
A5=A5\J
A3<=A4
S3
A1==@
A1=A1\J
A2=A2\J
A1==@
A4<=A5
S4
A1==@
S5
A1==@
init window
window empty
S0
H
Figure 5.2: The non-deterministic automation for k=5
C3
C4
C2
C5
C1
C6
C7
15 10 5
8
20 25 30 35 15 5
10
Figure 5.3: The clones of the smaller example
53
5.2.3 Concrete examples
Example with input size n=11, k=3
This is a small example, when k = 3. Figure 5.3 shows the clones with the fragments
and also shows the correct order of the clones.
To represent this example we need the following part of the input le:
clone(N,X)
clone(N,X)
clone(N,X)
clone(N,X)
clone(N,X)
clone(N,X)
clone(N,X)
clone(N,X)
::::::::-
N
N
N
N
N
N
N
N
== {1},
== {2},
== {3},
== {4},
== {5},
== {6},
== {7},
=={99},
firstClone(N) :nextClone(N1,N2)
nextClone(N1,N2)
nextClone(N1,N2)
nextClone(N1,N2)
nextClone(N1,N2)
nextClone(N1,N2)
nextClone(N1,N2)
nextClone(N1,N2)
X
X
X
X
X
X
X
X
==
==
==
==
==
==
==
==
{25,30,35,15}.
{5,8,20,25}.
{15,10,5}.
{10,5,8,20}.
{20,25,30,35}.
{35,15,5}.
{15,5,10}.
@.
N == {3}.
:- N1 == {3},
:- N1 == {4},
:- N1 == {2},
:- N1 == {5},
:- N1 == {1},
:- N1 == {6},
:- N1 == {7},
:- N1 =={99},
N2
N2
N2
N2
N2
N2
N2
N2
==
==
==
==
==
==
==
==
{4}.
{2}.
{5}.
{1}.
{6}.
{7}.
{99}.
{99}.
To implement the automation we need the second part of the input le:
pick(J, A, B) :- A = B \/ J, B /\ J ==@.
S1(L, A1, A2, A3, A4) :- L == @, firstClone(s1), clone(s1, A1),
nextClone(s1,s2), clone(s2, A2),
nextClone(s2,s3), clone(s3, A3),
nextClone(s3,s4), clone(s4, A4).
54
Problem
Genome map
(n = 11; k = 3)
Genome map
(n = 16; k = 5)
Ancestor
Unavoidable sets
Multiset
Naive
SemiNaive
Without
With
Without
With
optimization optimization optimization optimization
1810
1146
103
98
|
|
1008
59.4
59.6
7.6
30.1
253.9
19.2
1197
1181
61
Table 5.1: Test results (Pentium II 267 MHz)
932
7.5
68.9
61
// i -> i+1
S2(L, A1, A2, A3, A4) :- L == @, S1(J, A1, A2, A3, A4), A1 <= A2.
S3(L, A1, A2, A3, A4) :- L == @, S2(J, A1, A2, A3, A4), A2 <= A3.
// i+1 -> i
S1(L, A2, A3, A4, A5) :- L == @,
A1 == @, clone(c1,
S2(L, A2, A3, A4, A5) :- L == @,
A1 == @, clone(c1,
S2(J, A1, A2, A3, A4),
A4), nextClone(c1, c2), clone(c2, A5).
S3(J, A1, A2, A3, A4),
A4), nextClone(c1, c2), clone(c2, A5).
//i -> i
S1(J, B1, A2, A3, A4) :- S1(JJ,A1,A2,A3,A4), J<=A1, J/\A2=@,
pick(J,A1,B1).
S2(J, B1, B2, A3, A4) :- S2(JJ,A1,A2,A3,A4), J<=A1\/A2, J/\A3=@,
pick(J,A1,B1), pick(J,A2,B2).
S3(J, B1, B2, B3, A4) :- S3(JJ,A1,A2,A3,A4), J<=A1\/A2\/A3,
J/\A4=@, pick(J,A1,B1), pick(J,A2,B2), pick(J,A3,B3).
GOOD(X) :- S1(J, @, @, @, @).
I measured the evaluation time of this example in four dierent situations. Table
5.1 shows the results. All the numbers are real seconds, not CPU seconds, the test
55
C8
C6
C2
C5
C1
C4
C3
C7
C9
35 8
25 40 20 15 35 10 5
25 30 20 45 50 15 40
Figure 5.4: The clones of the bigger example
was running under WinNT (Pentium II 267 MHz). As can be seen the Semi-Naive
evaluation is a great improvement, we need only the 5.69 percent (optimization o)
or 8.55 percent (optimization on) of the time as the time of the Naive evaluation.
The optimization has a remarkable eect with Naive evaluation (we save 36 percent
of the time), and a a slight eect if optimization is on (4.8 percent). The reason
of the small eect is that the original rules are rather optimized, there is no much
possibility to optimize the rules more. However it has to be mentioned that the time
of the optimization (less than 1 second) is much more smaller than this small eect,
therefore the optimization is useful.
Example with input size n=16, k=5
The original example described in [7] was also tested in this system. The clones and
the fragments are also shown in Figure 5.4. Table 5.1 shows the test result of this
56
example too.
5.3 Unavoidable Sets
5.3.1 Problem
The problem originates from two-player games, such as chess for instance. We can
assign dierent labels to dierent positions. In chess we can use labels like: white
wins, black wins, draw. It is possible that a position has no labels assigned. If we
dene more labels, then it is also possible that more then one labels are assigned to
a position. For instance if we dene labels like: white has a queen, white has a rook,
white has a bishop, then if white has two rooks and a queen, and no bishops, then
the rst two labels are assigned to the position.
Assume that white wants to reach a position which has a specic label, and black
wants to avoid it. We can build a tree which contain the possible positions, the
current position is the root, and there is a directed edge between two positions if one
player can move from one position to the other. If the players turn in alternate, then
this graph is a bipartite graph (one position contains the name of the player who will
turn next). We assume that the graph is an acyclic graph. In chess this is really
acyclic, because if the same position occurs thrice, then the game is draw.
Our goal is to calculate those labels, which are unavoidable by black, if white
wants to reach the label.
5.3.2 Solution
We can assign labels to the leaves. To calculate the labels for the other nodes, we do
the following. If black has to move, then we calculate the intersection of the labels
57
1
2
4
3
5
6
8
7
9
10
11
12
13
14
15
16
{3,7}
{3}
{4,3}
{3,4,5,6}
{4,6,7}
Figure 5.5: The acyclic graph
assigned to the children of the node, because black wants to avoid the label. If white
has to move, then we calculate the union of the labels assigned to the children of the
node, because white wants to reach the label. If we assign labels one after the each
other, then after nite steps we reach root, because the graph is acyclic.
Mathematically the problem is the following: We have a directed acyclic bipartite
graph. Let A and B the two disjunct sets of vertices. The graph has a special vertex
for which the in-degree equals zero. We call this vertex root. Let us suppose that root
is in A. Sets are assigned to the leaves. If the sets of all the children of a vertex are
already dened, then we can assign a set to the vertex. If the vertex is in A, then we
assign the union of the sets of the children, if the vertex is in B then the intersection
of the sets of the children. The goal is to nd the set which is assigned to root.
58
5.3.3 Example
In this example the graph and the labels assigned to the leaves are shown in Figure
5.5.
To store the structure of the graph, we need the rst part of the input le:
left(P,C)
right(P,C)
left(P,C)
right(P,C)
left(P,C)
right(P,C)
left(P,C)
right(P,C)
left(P,C)
right(P,C)
left(P,C)
right(P,C)
left(P,C)
right(P,C)
left(P,C)
right(P,C)
left(P,C)
right(P,C)
left(P,C)
right(P,C)
left(P,C)
right(P,C)
::::::::::::::::::::::-
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
{1}, C == {2}.
{1}, C == {3}.
{2}, C == {4}.
{2}, C == {5}.
{3}, C == {5}.
{3}, C == {6}.
{4}, C == {7}.
{4}, C == {8}.
{5}, C == {8}.
{5}, C == {9}.
{6}, C == {10}.
{6}, C == {11}.
{7}, C == {12}.
{7}, C == {13}.
{8}, C == {12}.
{8}, C == {14}.
{9}, C == {14}.
{9}, C == {15}.
{10}, C == {14}.
{10}, C == {15}.
{11}, C == {15}.
{11}, C == {16}.
To store the sets of the leaves we need the following part:
white(X,
white(X,
white(X,
white(X,
white(X,
S)
S)
S)
S)
S)
:::::-
X
X
X
X
X
==
==
==
==
==
{12},
{13},
{14},
{15},
{16},
S
S
S
S
S
==
==
==
==
==
{3,7}.
{3}.
{3,4}.
{3,4,5,6}.
{4,6,7}.
59
The part which calculates the new sets contains only three rules:
black(X, S) :- left(X, L),right(X, R),white(L, W1),white(R, W2), S=W1/\W2.
white(X, S) :- left(X, L),right(X, R),black(L, B1),black(R, B2), S=B1\/B2.
un(S) :- white(X, S), X == {1}.
If we give this input le to the system we get the result, that the set of unavoidable
labels is f3; 4g. Table 5.1 shows the used time during evaluation. This example also
shows the advantage of the Semi-Naive evaluation. Because the number of iterations
are relatively small in this example, the eect is not too big. This example also shows
that the optimization method may worsen a formula. This is very rare, the problem
is that in this example the right-hand side of black and white rules contains four
relations and a selection, and in most cases it is useful to evaluate selections before
join, in this special case join before selection would have been better.
60
Chapter 6
Multisets
6.1 Introduction
A natural extension of the program is using multisets instead of sets. Multisets are
similar data structures to sets, the dierence is that a set can contain an element at
most once, while a multiset can contain several copies of an element. Unfortunately,
multisets do not form a Boolean Algebra. However, multisets can be implemented
using a limited set of multiset operators, and applying other restrictions.
6.2 Extension of the program
We allow the following multiset operators:
V1 V2
Where V1 , and V2 are multiset variables. With this operator we can check
whether one multiset variable is a subset of another multiset variable or not.
V =M
61
V1 == V2
V E
V E
V == ;
V == E
V1 V2; V2 V1
V2 = E; V V2
V2 = E; V V2
V2 = ;; V V2
V2 = E; V V2; V2 V
Table 6.1: Other operators which can be expressed
Where V is a multiset variable, and M is a concrete multiset. With this operator
we can change the value of the multiset variable.
V1 = V2 V3
Where V1, V2, and V3 are multiset variables. The value of V1 is calculated using
the already known value of V2 and V3 .
Although we allow only these three operators, some other operators can be expressed with these operators. Table 6.1 shows operators which can be expressed using
the basic operators. In the table Vi's are multiset variables, E is a multiset constant.
All the multiset variables and constants are denoted with a '*' sign in the input
le. There are also other restrictions related to multisets:
At most one multiset variable in every relation.
Only the rst variable can be a multiset
62
6.3 An example
6.3.1 The problem
The next example is also related to the genome maps, and described in [5]. In this
previous genome map example we cut the DNA with an enzyme, and get clones, after
this used an other enzyme to digest the clones. In this example at the beginning
we have two enzymes. We cut the original DNA with one of them and get the so
called row clones, cut the original with the other enzyme and get the so called column
clones. Neither the row clones nor the column clones can overlap each other. After
this we use the same enzyme to digest both the row and column clones.
6.3.2 The solution
Because of the genetic dierence, we know the rst row, and the rst column clone.
The structure of the row and column clones implies that one of these two rst clones
is a subset of the other one. Assume the the rst column clone is a subset of the rst
row clone. We also know that at the beginning of the DNA, there are the fragments
of this column clone (which are fragments of the rst row clone also), and following
this, that fragments of the rst row clone which are not in the rst column clone. Let
S be the set of these fragments. The algorithm should nd an other column clone,
which is either a subset of S , or a superset of S . If the column clone is a subset of S ,
it means that we still have more fragments from the row clones than from the column
clones, hence we need to nd an other column clone. If the column clone is a superset
of S , it means that we have more fragments from the column clones than from the
row clones, hence we need to nd a row clone now. The dierence of the column clone
and S will be the new value of S . We can repeat this step, until S equals the empty
63
set, which means, that we nd the same fragments in both the row and the column
clones. S will be empty at the end of the DNA, although there is a small possibility
that S will be empty before that.
The Datalog with Boolean Constraints program, which can solve this problem:
down(*SS,UC,UR) :- initialc(*SS,UC,UR).
right(*SS,UC,UR) :- initialr(*SS,UC,UR).
down(*SS,UC,URR) :- down(*S,UC,UR), row(*R,M), *R <= *S,
*SS = *S - *R, pick(M,UR, URR).
down(*SS,UCC,UR) :- right(*S,UC,UR), column(*C,N), *S <= *C,
*SS = *C - *S, pick(N,UC,UCC).
right(*SS,UCC,UR) :- right(*S,UC,UR), column(*C,N), *C <= *S,
*SS = *S - *C, pick(N,UC,UCC).
right(*SS,UC,URR) :- down(*S,UC,UR), row(*R,M), *S <= *R,
*SS = *R - *S, pick(M,UR,URR).
halt(X) :- down(*X,@,@), *Y == *@, *X <= *Y.
halt(X) :- right(*X,@,@), *Y == *@, *X <= *Y.
6.3.3 Concrete example
In this example the DNA contains 28 fragments, ten row and nine column clones.
Figure 6.1 shows the fragments of the DNA string, the row and column clones.
Table 6.2 shows the process of the solution. The rst column shows the expression
of the new value of S , the second column shows the new value of S . The third and
fourth columns show the unused row and column clones.
Table 5.1 shows the execution time of this example. The Semi-Naive evaluation
64
R9
5
R8
13 18 13 7
C3
C1
C6
8
R3
17 8
9
C8
R1
R4
14 27 65 8
C4
R6
7
R2
4
9
C7
4
R10
R7
R5
12 11 12 10 10 10 28 5
C9
4
C5
Figure 6.1: The correct order of clones
S
C3
5,13
R9 - S 13,18
S - C1 13
C6 - S 7, 8
R8 - S 17
C8 - S 8, 9, 14, 27
S - R3 14, 27
S - R1 27
R4 - S 65
C4 - S 8
R6 - S 4, 7
C7 - S 4, 9, 12
S - R2 12
R10 - S 10, 10, 11, 12
C9 - S 10
R7 - S 5, 28
C5 - S 4
R5 - S 5, 10
C2 - S |
Row clones
Column clones
1, 2, 3, 4, 5, 6, 7, 8, 9, 10 1, 2, 4, 5, 6, 7, 8, 9
1, 2, 3, 4, 5, 6, 7, 8, 10
1, 2, 4, 5, 6, 7, 8, 9
1, 2, 3, 4, 5, 6, 7, 8, 10
2, 4, 5, 6, 7, 8, 9
1, 2, 3, 4, 5, 6, 7, 8, 10
2, 4, 5, 7, 8, 9
1, 2, 3, 4, 5, 6, 7, 10
2, 4, 5, 7, 8, 9
1, 2, 3, 4, 5, 6, 7, 10
2, 4, 5, 7, 9
1, 2, 4, 5, 6, 7, 10
2, 4, 5, 7, 9
2, 4, 5, 6, 7, 10
2, 4, 5, 7, 9
2, 5, 6, 7, 10
2, 4, 5, 7, 9
2, 5, 6, 7, 10
2, 5, 7, 9
2, 5, 7, 10
2, 5, 7, 9
2, 5, 7, 10
2, 5, 9
5, 7, 10
2, 5, 9
5, 7
2, 5, 9
5, 7
2, 5
5
2, 5
5
2
|
2
|
|
Table 6.2: The solution
5
10
C2
65
is a great improvement in this example too (it needs only about 5 % of the time
necessary for the Naive evaluation). However the optimization of Relational Algebra
formulas has no important eect on the execution time of this query.
66
Chapter 7
Further Work
GUI: Currently the program has a character based interface, which can make
dicult for the user to handle the program. The biggest disadvantage of the
character based interface is that sometimes it is too dicult to interpret the
results, because the result is only a set of formulas.
Currently a student (Song Liu) in the Department is working on a Graphical
User Interface which will make easier to understand the results.
Approximation: Every elimination method have some restrictions on the input database. Sometimes we cannot use any quantier elimination methods.
In these cases approximation may help, when we cannot compute the correct
quantier-free formula, rather only create a formula which approximate the
result. A possible way of approximation is described in [3].
Multiset: Chapter 6 describes the an extension of the system, which makes
possible to use multisets. Only a small subset of multiset operators are used in
this extension, it would be possible to implement more multiset operators.
67
Indexing:
Indexing can improve the speed of the database systems. However indexing
is not too dicult in traditional relational database systems, it is much more
dicult in constraint database systems. A good indexing method would improve
the speed of this system.
68
Bibliography
[1] B. H. Arnold, Logic and Boolean Algebra,
Prentice-Hall, INC. (1962)
[2] J. Byon, P.Z. Revesz, DISCO: A constraint database system with sets, Proc.
Workshop on Constraint Databases and Applications, pages 68{83 (1995),
Springer-Verlag, Berlin/Heidelberg/New York.
[3] Richard Helm, Kim Marriott, Martin Odersky, Spatial Query Optimization: From
Boolean Constraints to Range Queries, Journal of Computer and System Sciences
51, 197{201 (1995)
[4] P.C. Kanellakis, G.M. Kuper, P.Z. Revesz, Constraint query languages. Journal
of Computer and System Sciences 51:26{52. (1995)
[5] Peter Z. Revesz, CSE 913 (Advanced Database Systems) class material, University
of Nebraska-Lincoln, Spring 1998.
[6] Peter Z. Revesz, The Evaluation and the Computational Complexity of Datalog
Queries of Boolean Constraint Databases, International Journal of Algebra and
Computation, to appear.
69
[7] Peter Z. Revesz, Rening Restriction Enzyme Genome Maps, Constraints 2, 361{
375 (1997)
[8] J. D. Ullman, Principles of Database and Knowledge-base Systems
Computer Science Press (1988).

Download Report

Implementation of a Database System with Boolean Algebra

Paperzz.com

Your Paperzz