Adding Magic to an Optimising Datalog Compiler

Adding Magic to an Optimising Datalog Compiler
Damien Sereni, Pavel Avgustinov and Oege de Moor
Semmle Limited
Magdalen Centre, Oxford Science Park, Robert Robinson Avenue, Oxford OX4 4GA, UK
{damien,pavel,oege}@semmle.com
ABSTRACT
1. INTRODUCTION
The magic-sets transformation is a useful technique for dramatically improving the performance of complex queries,
but it has been observed that this transformation can also
drastically reduce the performance of some queries. Successful implementations of magic in previous work require integration with the database optimiser to make appropriate decisions to guide the transformation (the sideways informationpassing strategy, or SIPS).
This paper reports on the addition of the magic-sets transformation to a fully automatic optimising compiler from
Datalog to SQL with no support from the database optimiser. We present an algorithm for making a good choice
of SIPS using heuristics based on the sizes of relations. To
achieve this, we define an abstract interpretation of Datalog
programs to estimate the sizes of relations in the program.
The effectiveness of our technique is evaluated over a substantial set of over a hundred queries, and in the context
of the other optimisations performed by our compiler. It is
shown that using the SIPS chosen by our algorithm, query
performance is often significantly improved, as expected,
but more importantly performance is never significantly degraded on queries that cannot benefit from magic.
The magic-sets transformation [2, 3] is a well-known technique for optimising complex queries which may involve subqueries and recursion, often yielding impressive improvements in query times [19]. However, when used inappropriately magic-sets can degrade performance equally significantly. This makes it difficult to adopt magic-sets as a
general-purpose optimisation in a deductive query engine.
There has been a substantial amount of work attacking
this problem, leading to implementations of magic-sets such
as the IBM DB/2 implementation based on Starburst [20]
and the cost-based optimisation technique of Seshadri et al
[26]. These methods require modification of the database
optimiser, or at least access to information about the query
plans it produces.
This paper describes an algorithm for implementing magicsets as a transformation of Datalog programs, independently
of the database backend used to execute the programs. The
context of this work is the efficient implementation of a variant of object-oriented Datalog called .QL, within a commercial system [6, 24]. The architecture of the system is as follows. First, a .QL program is translated to a pure Datalog
intermediate representation. Optimisation passes transform
this Datalog program, and the resulting optimised program
is then translated to SQL. Finally, the SQL program can be
executed on a number of databases — currently, Microsoft
SQL Server [17], PostgreSQL [21] and H2 [10] are supported.
The optimisations that we shall describe, and in particular
the magic transformation, operate purely on the intermediate Datalog representation. As a result, the precise nature
of the .QL language is irrelevant to this paper. However,
there are a number of constraints imposed by the setting in
which we work:
Categories and Subject Descriptors
H.2.4 [Database Management]: Systems—Query Processing; H.2.3 [Database Management]: Languages—Query
Languages
General Terms
Performance, Algorithms, Languages
Keywords
Magic sets, Datalog, Query optimisation
Notice: The technology described in this paper is proprietary; US and other patents pending. For licensing information, write to [email protected].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.
Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00.
• .QL queries are not written by expert logic programmers, and benefit from large reusable libraries supplied
by us, which can be applied in many different contexts.
Hand-optimisation of the program is therefore not an
option.
• The Datalog programs that our system optimises are
automatically generated from an object-oriented language. This often leads to unnatural Datalog programs, and thus our optimisations must be robust, in
the sense that they do not degrade performance on
such programs.
• As we target different database systems, and must run
on unmodified databases, we cannot rely on knowledge
of the database optimiser to guide our own optimisations.
• The Datalog optimiser must make automatic choices
of transformations to apply to a particular program,
and we cannot rely on the user enabling or disabling
magic-sets for particular queries.
Contributions. Our requirements for the implementation
of magic-sets are the following:
• The transformation should never degrade performance
significantly,
• hasSubtype(SUPER : @type, SUB : @type)
Type SUB is a direct subtype of type SUPER
• method(M : @method, C : @class)
M is a method defined in class C
Figure 1: Example System: Simplified Extensional
Database
q(X : @a, Y : @b) :- E.
• It should yield performance improvements in the average case, and
is defined to be equivalent to the following:
• It should give large performance improvements on queries
that can make use of context-sensitive information.
That is, the types in the definition of q restrict the range
of values of X and Y. These elementary types only form the
basis of our type system, but are all that will be required
for the remainder of the paper.
Achieving these goals, in particular the first, requires new
techniques to choose a suitable order of evaluation (often
called a sideways information-passing strategy). The contributions of this paper are:
• We present an algorithm for implementing magic-sets
using heuristics based on the sizes of relations to choose
evaluation order and apply magic only where necessary,
• We introduce a novel method for estimating the sizes
of relations defined by Datalog formulas as an abstract
interpretation of Datalog programs, and
• We evaluate the impact of this transformation on performance over a large suite of queries on substantial
datasets, and assess the interactions between magicsets and other optimisations performed by our system.
2.
• hasName(E : @element, N : string)
Program element E (a type or a method) has name N
PRELIMINARIES
The .QL language is irrelevant to this paper, since the optimisation passes that our system implements operate solely
on the intermediate representation of .QL queries and libraries as Datalog programs. More precisely, we use a variant of safe Datalog with stratified recursion, extended with
arbitrary logical operators (that is, disjunctions and negations can be used and nested freely) and aggregates. The
main difference between this representation and other presentations of Datalog is that our language is typed. While
a complete description of our type inference algorithm (related to, but a refinement of, the type system of Henriksson
and Maluszyński [12]) is beyond the scope of this paper, we
give a very brief description of the subset of the type system
that will be used in what follows.
Individual columns in the extensional database schema are
annotated with types, written with an @ sign. For instance,
a schema definition p(X : @a, Y : @b) defines a relation
whose first column is of type @a and whose second column
is of type @b. Subtype relationships can be defined among
such types. Each type can be used as a unary predicate:
@a(X) holds of all values of type @a.
Types for intensional relations can be inferred, or types
can be used to restrict the range of relations. A definition
of the form
q(X, Y) :- @a(X), @b(Y), E.
3. APPLYING MAGIC
The magic-sets transformation [3] is a well-known optimisation aiming to eliminate redundant work in bottom-up
evaluation of Datalog programs. This optimisation has been
shown to yield very substantial improvements in query times
[19]. Magic-sets can however substantially degrade performance when inappropriately applied. Our experience implementing known magic-sets strategies suggested that this
was a significant problem, motivating this work. We will
briefly describe the magic-sets transformation before illustrating some of the problems that we have encountered with
the naive implementation.
3.1 Magic-Sets
Throughout this paper we shall use the example of simple
queries over Java source code to illustrate our techniques.
Such queries are useful for software quality assessment. The
examples below are simplified versions of Datalog programs
produced from .QL queries. Figure 1 describes the extensional database predicates used in these examples, together
with their informal meaning. The contents of the database
are strongly simplified variants of our representation of Java
programs.
The following Datalog program finds all subtypes of type
“Cloneable” that do not contain a method named “clone”1 :
query(C : @class) :@type(Cloneable),
N = "Cloneable",
hasName(Cloneable, N),
hasSubtypePlus(Cloneable, C),
not(M = "clone", declaresMethod(C, M)).
hasSubtypePlus(SUPER : @type, SUB : @type) :hasSubtype(SUPER, SUB)
; (hasSubtypePlus(SUPER, MID), hasSubtype(MID, SUB)).
declaresMethod(C : @class, Name : string) :method(M, C), hasName(M, Name).
1
Java programmers will recognise the results of this query
as violating a Java style rule.
Note that semicolons are used to represent disjunction.
This is highly inefficient under bottom-up evaluation of
Datalog. The transitive closure of the subtype relation is
computed, and is likely to be very large, but only those
classes that are subtypes of Cloneable are needed. Furthermore, the declaresMethod computes the names of all methods declared by all classes, even though only methods called
“clone” in subtypes of Cloneable are of interest.
The magic-sets transformation restricts intensional predicates to just those tuples that are necessary for the computation. This is achieved by pushing the context in which
each predicate is used into its definition. This results in the
following program:
queryf (C : @class) :@type(Cloneable),
N = "Cloneable",
hasName(Cloneable, N),
hasSubtypePlusbf (Cloneable, C),
not(M = "clone", declaresMethodbb (C, M)).
m_hasSubtypePlusbf (SUPER : @type) :N = "Cloneable",
hasName(SUPER, N).
hasSubtypePlusbf (SUPER : @type, SUB : @type) :(m_hasSubtypePlusbf (SUPER),
hasSubtype(SUPER, SUB))
; (m_hasSubtypePlusbf (SUPER),
hasSubtypePlusbf (SUPER, MID),
hasSubtype(MID, SUB)).
bb
m_declaresMethod (C : @class, Name : string) :hasSubtypePlusbf (Cloneable, C),
Name = "clone".
declaresMethodbb (C : @class, Name : string) :m_declaresMethodbb (C, Name),
method(M, C),
hasName(M, Name).
The magic-sets transformation changes the program in two
ways. First, each literal is adorned to indicate which of its
variables should be considered bound by its context: for
instance, the bf adornment on hasSubtypePlus indicates
that its first argument is bound, and its second is free. The
context of each predicate is further isolated into a relation
(the magic predicate) that is then conjoined with it. The
context of a predicate (underlined above) consists of a subset
of the literals appearing before the predicate, wherever it is
used. It is easy to see that the transformed program is more
efficient (despite having more joins), as the adorned relations
are substantially smaller than the originals: Only subtypes
of Cloneable and methods named clone are considered.
The magic transformation relies on the choice of a sideways information-passing strategy (SIPS). This determines
the context (or magic set) of each literal, which is a (possibly
proper) subset of those formulas that appear before the literal. For instance, in the above example, @type(Cloneable)
is not taken to be part of the context of hasSubtypePlus, because it does not restrict the range of values of the Cloneable
variable.
As the context of a reference to a predicate only includes
expressions appearing before it, the order of expressions
within a conjunction is crucial. Consider the following reordering of the query from our running example:
query(C : @class) :@type(Cloneable),
not(declaresMethod(C, M), M = "clone"),
hasSubtypePlus(Cloneable, C),
hasName(Cloneable, N),
N = "Cloneable".
With this order, the declaresMethod predicate can no longer
benefit from the information that C is a subtype of Cloneable,
while the context of hasSubtypePlus does not record that
the name of Cloneable is “Cloneable”. This order therefore
does not produce good magic sets, making the optimisation
useless, or indeed detrimental due to the increased number of joins. As an example, consider the hasSubtypePlus
predicate. With the above order, a naive implementation
would split this into several predicates with different variable boundedness:
hasSubtypePlusfb (SUPER : @type, SUB : @type) :(not(M = "clone", declaresMethod(SUB, M)),
hasSubtype(SUPER, SUB))
; (not(M = "clone", declaresMethod(SUB, M)),
hasSubtypePlusff (SUPER, MID),
hasSubtype(MID, SUB)).
With this implementation, the context is largely useless,
as the entire hasSubtypePlus relation (with both variables
free) must be computed anyway, leading to wasted work.
The Need for a SIPS. Our example highlights the need
for a good choice of SIPS in general, but is this a problem
in practice? We initially implemented the magic-sets transformation without an automatic procedure for reordering
subgoals, and thus applying the transformation in the order
in which the subgoals were presented. On our test query set
of 111 queries (described in more detail in Section 6), the
results are damning: the entire set of queries runs 6.7 times
slower with the magic-sets transformation enabled. To find
the problem, let us break down the queries into the following (somewhat arbitrary) categories, based on the impact of
magic on each query:
1-4x faster with magic
Up to 2x slower
2–10x slower
> 10x slower
33
55
13
10
queries
queries
queries
queries
The results show that the magic transformation can be effective (it is an improvement on 30% of the queries), but that it
can equally be extremely detrimental, as 9% of queries run
more than ten times slower with magic. The goal of choosing
a SIPS is essentially to make the optimisation more robust,
as well as to improve its effectiveness on those queries which
it optimises. The choice of SIPS consists of:
1. Choosing the order of subgoals in conjunctions,
2. Choosing the magic set of each subgoal, as a subset of
the subgoals appearing before it, and
3. Choosing which of the variables of each subgoal should
be considered bound.
In the next section, we shall describe our strategy for achieving these goals.
3.2 Choosing a SIPS
set of values), this is a small relation, although it is
not a small relation in general.
Aim. The goal of the choice of SIPS is to maximise the
impact of the magic transformation, i.e. to minimise the
cost of evaluating the transformed program. However, as
is common in optimisation problems, evaluating all possible choices of SIPS is infeasible due to the large number of
strategies — for a conjunction of n predicates, there are n!
orders, and for each order the magic set of each predicate
must further be picked. As a result, we follow the standard
strategy of using a heuristic to find a good order.
A complicating factor is the fact that we target a number
of different database backends, and therefore that the cost
of evaluating a relation cannot be estimated precisely. This
cost varies greatly with the database optimiser, and thus is
different across all backends; in addition, it is of course not
possible to modify the database optimiser in this setting, in
contrast to other approaches [26, 20]. As a result of the impossibility of predicting the exact cost of evaluating relations
we will use the size of relations as a cost measure. While
size is an inexact measure of cost for any given backend, it
is the only possible measure carrying over several database
implementations, and yields good results in practice. We
will relegate the definition of this estimate of relation size
to the next section, and in the remainder of this section assume that we have a way to estimate the size of the relation
defined by any formula.
An Example. We shall first illustrate our algorithm informally with an example, namely the predicate query in our
running example, shown below in an inappropriate order:
1
2
3
4
5
6
7
query(C : @class) :@type(Cloneable),
not(declaresMethod(C, M),
M = "clone"),
hasSubtypePlus(Cloneable, C),
hasName(Cloneable, N),
N = "Cloneable".
The initial order of conjuncts in this query is ignored by the
algorithm, and the process is greedy: we create the resulting conjunction one conjunct at a time. At each step, we
pick the conjunct that represents the smallest relation, in
context, as outlined informally below:
1. The first conjunct in the result is the equality constraint on N (line 7), as this defines a relation of size
one.
2. The second conjunct is the hasName literal (line 6).
This is the smallest conjunct, given conjunct 7, namely
that the value of N is fixed to “Cloneable”. The choice
of this conjunct illustrates the use of context in our
algorithm: as conjunct 7 has already been chosen in
the output formula, we can assume that this holds for
the remainder of the conjunction.
3. The next conjunct is the instance of @type on line 1.
Given the context of conjuncts 6 and 7, Cloneable is
restricted to a small set of values, so this is the next
smallest conjunct
4. The fourth conjunct is now hasSubtypePlus (line 5).
Again given the context (fixing Cloneable to a small
5. Finally, there is only one conjunct left, found on lines
2-3. This itself contains a nested conjunction, which
is reordered according to the same procedure, trivially
giving the order placing the equality constraint (line
3) first.
This example illustrates our general approach: the smallest
relation is picked at each stage, taking context information
into account to determine the sizes of relations. The order
arrived at by our algorithm is slightly different from the
original order of our running example, but leads to the same
magic sets and so is equivalent. The resulting order is shown
below:
1
2
3
4
5
6
7
query(C : @class) :N = "Cloneable",
hasName(Cloneable, N),
@type(Cloneable),
hasSubtypePlus(Cloneable, C),
not(M = "clone",
declaresMethod(C, M)).
The same approach can be used to compute magic sets. Recall that the magic set of a literal L is a subset of the conjuncts appearing before L, with the aim of reducing the size
of the relation denoted by L. This is again a greedy process,
and we walk the potential magic set of a literal (that is, all
formulas appearing before it) from left to right, determining which formulas to include in the magic set. For each
formula φ in the potential magic set, we aim to determine
whether φ constrains L sufficiently, and only wish to include
it in the magic set when this is the case. As before we use
contextual information to determine this. The context of φ
has two parts:
• Any formulas appearing to the left of φ that have already been included in the magic set of L, and
• Any formulas appearing to the right of φ (but before
L). Here, we are making an optimistic assumption:
when choosing whether or not to include φ, we assume
that anything we have not yet visited will be included.
Let us illustrate this with our running example, to compute
the magic set of the hasSubtypePlus literal (line 5). The
potential magic set of this literal contains the formulas on
lines 2-4, which we traverse from left to right:
1. The first formula (line 2) is constraining, as it indirectly restricts the range of the Cloneable variable.
Here our optimistic assumption that all predicates are
included in the magic set is essential, as this formula
is only constraining given the predicate on line 3.
2. The next formula (line 3) is likewise constraining. Given
that the equality constraint on Cloneable (line 2) was
included in the magic set, this reduces the size of the
relation denoted by thehasSubtypePlus literal.
3. The third formula, @type(Cloneable), is not constraining. It is in fact implied by the use of Cloneable in
hasSubtypePlus , which is detected by our analysis.
As a result, this predicate is not included in the magic
set.
The resulting magic set is precisely the set of formulas shown
previously, so our procedure yields the right result for this
particular example.
Algorithm. Having informally described our algorithm on a
specific example, let us present the algorithm more precisely.
We assume that we are working on a single conjunction φ1 ∧
· · ·∧φn , and that we must both reorder this conjunction and
determine the magic set of each literal among the φi .
The example has illustrated a key element of the algorithm: we repeatedly use the estimate of the size of a relation in context to make the required choices. While the
discussion of our size estimation forms Section 4, let us assume that the estimate of the size of a formula φ is given by
Size(φ). We need one further operation, to evaluate the size
of φ in a context ψ. This is merely the size of the conjunction of φ and ψ, restricted to those variables which appear
free in φ, and is written as:
Size(φ | ψ)
= Size(∃x1 · · · xk (φ ∧ ψ))
where {x1 , . . . , xk } = fv(ψ) \ fv(φ)
and fv(φ) is the set of free variables of φ
It is necessary to quantify out variables that are not free in
φ, as shown by the following example: Clearly, Size(A = 1 |
p(B )) = Size(A = 1) for any predicate p, as p(B ) does not
constrain A. However, simply using conjunction would be
incorrect, as Size(A = 1 ∧ p(B )) = Size(p(B )). Our definition of Size(∃B (A = 1 ∧ p(B ))) produces the right result.
It is now straightforward to define the reordering phase of
the algorithm. This is shown in Figure 2.
MagicSet( L , c a n d i d a t e s , g l o b a l C o n t e x t )
r e s u l t ← true
remaining ← can d id at es
context ← globalContext
while r e m a i n i n g 6= [ ] do
φ ←head ( r e m a i n i n g )
r e m a i n i n g ← t a i l ( r e m a i nV
ing )
newContext ← c o n t e x t ∧
remaining
i f (IsConstraining ( φ , L , newContext ) )
result ← result ∧ φ
context ← context ∧ φ
return r e s u l t
IsConstraining( φ , L , Context ) =
Size(L | φ ∧ Context)
< THRESHOLD
Size(L | Context)
Figure 3: Determining the Magic Set
The final step of the algorithm is to determine the bound
positions of each literal. This is the set of argument positions
that should be considered bound by the magic set of the
literal. Naturally only arguments that actually appear in the
magic set can be considered, but taking all arguments that
appear in the magic leads to too coarse an approximation.
Consider the following artificial example:
p(X : @type, Y : @type) :- X = 1, @type(Y).
q(X : @type, Y : @type) :- X = Y.
query(X, Y) :- p(X, Y), q(X, Y).
Reorder( ψ1 , . . . , ψn , g l o b a l C o n t e x t )
r e m a i n i n g ← {ψ1 , . . . , ψn }
result ← [ ]
context ← globalContext
while r e m a i n i n g 6= ∅ do
m i n S i z e ← min{Size(φ | context) | φ ∈ r e m a i n i n g }
φ ← the formula making m i n S i z e minimal
context ← context ∧ φ
r e m a i n i n g ← r e m a i n i n g \ {φ}
r e s u l t ← r e s u l t ++ [ φ ]
return r e s u l t
Figure 2: Reordering Algorithm
Once the conjunction has been reordered, the magic set
of each literal must be determined. As described previously,
this involves walking the prefix (candidate set) from left to
right, choosing formulas which reduce the size of the literal.
The only subtlety is the nature of the context: it is optimistic
in the sense that it contains all formulas which have not been
rejected yet, even unseen formulas. This yields the algorithm
shown in Figure 3.
Note that both the reordering and magic-set algorithms
take a global context. The global context is simply the conjunction of all formulas that are known to be true throughout the conjunction (independently of its order). It is necessary for two reasons: first, when dealing with nested conjunctions such as φ ∧ ¬(ψ ∧ ξ) (where φ is the global context
of ψ ∧ξ); and secondly to record the adornment of the clause
being reordered.
Then p(X , Y ) is constraining for q(X , Y ), as it reduces it to
a relation of size 1. However, it suffices to consider X bound
for this to be the case — p(X , Y ) does not constrain Y in
any useful way. The fields of the magic predicate are just
those variables which are deemed to be bound, so it is clear
that as few fields as possible should be taken to be bound
to minimise the size of the magic predicate. Figure 4 gives
the pseudocode for this procedure. The algorithm starts
from the set of free variables of the magic set, and removes
variables whenever doing so does not reduce the impact of
the magic.
BoundVariables( Magic , L)
r e s u l t ← FreeVariables( Magic )
repeat
f o r each X in r e s u l t do
i f Size(L | ∃X .Magic) = Size(L | Magic)
r e s u l t ← r e s u l t \ {X }
u n t i l r e s u l t does not change
return r e s u l t
Figure 4: Computing Bound Variables
Algorithm 4 concludes our description of the choice of
SIPS: given any conjunction, we have seen how this can be
reordered to maximise the impact of magic, how the formulas in the magic set can be chosen, and finally which variables of the result are bound (giving rise to the b − f splits
of clauses). A number of further cleaning up operations are
performed — for instance, the magic-sets transformation can
introduce unsafe recursion [1], which must be handled appropriately. The details are relatively straightforward and
thus omitted.
4.
ESTIMATING RELATION SIZES
In section 3, we have shown how the magic transformation
can be guided by estimates of sizes of relations defined by
formulas, as a backend-independent estimate of the cost of
evaluation. This took the form of a function Size(φ) estimating the size of a formula φ. In this section, we complete
the picture by defining an abstract interpretation of Datalog
programs producing this size measure.
4.1 Dependencies
The idea behind the size estimation is to track dependencies between fields of relations, in addition to estimates of
the sizes of individual fields. As an example, consider the
hasName(E, N) relation, part of the extensional database,
which gives the name of each program element. Then hasName
is a function, so given the value of E, the value of N is uniquely
determined — a program element only has one name. We
can go further, however, by observing that on typical programs, the name of an element goes a long way towards
identifying this element. In this case, there are just three
elements, on average, for each name2 . This information can
be encoded in the following dependency graph for hasName:
E
184767
1
3
N
63233
The interpretation of this dependency graph is as follows:
the E field of hasName takes 184767 distinct values, while
its N field takes 63233 values. However, for each value of
value of E there in only one value of N on average (in fact,
there is only ever one), while for any value of N there are
on average 3 values of E .
As another example, the dependency graph for hasSubtype
is shown below:
SUPER
4501
5.9
1.1
SUB
30769
In this case, there are 30769 types that are subtypes of other
types (in Java, only Object is not a subtype of another type),
but only 4501 types that have a subtype. Furthermore, there
are just 1.1 supertypes on average for each type (multiple
supertypes can only arise through interfaces). Finally, there
are on average 5.9 direct subtypes for each type that is a
supertype.
In more generality, we can define dependency graphs as
follows:
Definition 1. A dependency graph for a relation with
columns X1 , . . . , Xn is a triple (Σ, G, Π), where:
1. Σ : {X1 , . . . , Xn } → R assigns a size estimate Σ(X ) to
each field X ,
α
2. G is a set of arrows of the form X −
→ Y (read “there
are α values of Y for each value of X ”), and
2
The size values are obtained from our representation for
the JFreeChart [13] program.
3. Π is a set of equality constraints of the form Xi = Xj
The examples have already illustrated the first two components of dependency graphs. The third component is used to
track equality between fields: for instance, the dependency
graph of a literal p(X , X ) is simply the graph of p(X , Y )
with the added equality constraint X = Y .
A graph (Σ, G, Π) is intended to represent relations R
with fields (X1 , . . . , Xn ) where:
1. |πXi (R)| = Σ(Xi ) for each i, where projection operates
on sets (not multisets),
α
2. Whenever Xi −
→ Xj ∈ G, then α = avgx |πXj (σXi =x (R))|,
and
3. Whenever Xi = Xj ∈ Π, then for the values of Xi and
Xj are equal in any tuple in R.
This should hold of the graphs defined for extensional relations. The dependency graphs constructed by our analysis for intensional predicates are however approximations,
and properties (1) and (2) only hold approximately for such
generated dependency graphs. Property (3), on the other
hand, can be computed soundly and so holds exactly for intensional predicates. Despite the imprecision introduced in
the analysis, its results are good enough to be used to guide
the magic transformation, as we shall see later.
4.2 Dependency Graphs for Extensional Relations
The first step in computing dependency graphs for all formulas in the program (and in particular for intensional predicates) is to compute graphs for the extensional relations in
the database. This is achieved via a combination of analysis
of the data and domain-specific annotations in the database
schema. For each relation R(X1 , . . . , Xn ) in the database,
the size Σ(Xi ) is simply measured for each column Xi . Note
that Σ(Xi ) is the size of the i th column viewed as a set,
so this may be different for each column, but can be computed by a relatively inexpensive query. There are of course
no equality constraints on extensional relations, as such a
constraint would imply that the relation had two identical
columns.
Dependency information is deduced automatically in an
important special case, namely when one of the fields in the
dependency is a key for the relation. Suppose that X is a key,
and let Y be another field, with Σ(X ) = A and Σ(Y ) = B .
Then the value of Y is a function, f say, of the value of
X . Furthermore, the number of distinct values of X for a
particular value y of Y is just |f −1 (y)|. As theP
sets f −1 (y)
−1
for distinct y are disjoint, we have that A =
(y)|,
y |f
where the sum ranges over B values of y. Hence the average
value of |f −1 (y)| is just A/B . This leads to the following
observation:
1
Whenever X is a key, then for all Y , X −
→ Y
α
and Y −
→ X , where α = Σ(X )/Σ(Y ), can be
deduced.
3
This is the source of the dependency N −
→ E in the hasName
relation: as there are 63233 names for 184767 elements (and
each element has exactly one name), there must be on average 184767/63233 ≈ 3 elements for each name.
α
Further dependencies can be deduced transitively: if X −
→
β
αβ
Y and Y −
→ Z , then naturally X −−→ Z .
α
Finally, other dependencies X −
→ Y , where neither X
nor Y is a function of the other, and where this cannot be
deduced by transitivity, must be added manually as an annotation. This could of course be measured, but annotation
is straightforward given domain knowledge, and the computation of these dependency factors is expensive.
In our example, the only annotation on hasName is that E
is a key, while hasSubtype is given the dependencies shown
above, obtained from measurements on typical programs.
4.3 Computing Dependency Graphs
Given the dependency graphs for extensional relations,
we now aim to compute the dependency graph of each formula in the program (giving, in particular, the dependency
graphs of intensional predicates). This is achieved by an abstract interpretation [4]: the Datalog program is evaluated
in a nonstandard semantics, where the values are replaced
with dependency graphs. In order to compute the graphs,
it therefore suffices to describe the effect of each operator
of relational algebra on dependency graphs, which we detail
below.
Each operation in the abstract interpretation is pessimistic:
whenever it is not possible to find the exact size of a relation
(without access to the data), the worst-case is assumed, so
the largest possible relation is chosen. Our analysis is therefore likely to overestimate relation sizes, though this is not
a substantial problem in our tests.
Throughout this section, we write [[φ]] for the inferred dependency graph of a formula φ, and fix two formulas φ1 and
φ2 , with [[φ1 ]] = (Σ1 , G1 , Π1 ) and [[φ2 ]] = (Σ2 , G2 , Π2 ).
Intersection. To define [[φ1 ∧ φ2 ]] = (Σ, G, Π), we make the
following observations. First, no column in φ1 ∧ φ2 is larger
than the corresponding column in φ1 or φ2 (for simplicity,
we assume that both relations have the same fields). The
worst-case scenario is that one relation is a subset of the
other, so the size of a field X is min(Σ1 (X ), Σ2 (X )).
Furthermore, as each tuple in φ1 ∧ φ2 lies in both relations, any constraints imposed by G1 or G2 still holds in the
intersection. The same is true of equality constraints. As a
result, we can define:
Σ(X ) =
G =
Π =
min(Σ1 (X ), Σ2 (X ))
G1 ∪ G2
Π1 ∪ Π2
As a refinement, note that G1 ∪ G2 can contain redundant
1
2
edges. For instance, if X −
→ Y ∈ G1 and X −
→ Y ∈ G2 , then
G contains seemingly inconsistent information. However, recall that both [[φ1 ]] and [[φ2 ]] are pessimistic approximations.
Hence it is safe to pick the best of the two dependencies, in
1
this case X −
→ Y , as this is closer to the correct estimate.
Union. The definition of [[φ1 ∨ φ2 ]] = (Σ, G, Π) is slightly
trickier. To compute the sizes of columns, we once more
make a pessimistic assumption and consider corresponding
columns in the input relations disjoint, so the size of a column is just the sum of its sizes in φ1 and φ2 . The equality
constraints that hold in the union are just those that hold
in both input relations:
Σ(X ) =
Π =
Σ1 (X ) + Σ2 (X )
Π1 ∩ Π2
β
Now consider field dependencies. A dependency X −
→Y
can only lie in the result if there are corresponding depenα1
α2
dencies X −
−→ Y and X −
−→ Y in the input relations.
Let us estimate the value α of this dependency. To simplify notation, let Ri be the relation between X and Y defined by φi (that is, Ri is defined by πX ,Y (φi )), and likewise
R = R1 ∪ R2 . Then γ is the average over all x ∈ dom R
of |R(x )|. Pick such an x . Then the probability that x ∈
dom Ri is |dom Ri |/|dom R| ≈ Σi (X )/Σ(X ). In this case,
|Ri (x )| ≈ αi . If x ∈
/ dom Ri , then |Ri (x )| = 0. Hence in
general we find that |Ri (x )| ≈ α · Σi (X )/Σ(X ). In the worst
case, R1 (x ) and R2 (x ) are disjoint, and thus we define:
γ=
Σ2 (X )
Σ1 (X )
α1 +
α2
Σ(X )
Σ(X )
The interpretation of this rule is natural: the dependency
in the union is the sum of the dependencies in φ1 and φ2 ,
per our worst-case assumption, but this sum is weighted by
the relative sizes of the columns. This guarantees that the
dependencies for a union P ∪ Q, where Q is much smaller
than P , are dominated by P .
Negation. The treatment of negation is straightforward, as
there is no way in our abstract domain to record a negative
dependency. As a result, the dependency graph of a negated
formula is just the empty graph.
Cartesian Product. The dependency graph [[φ1 × φ2 ]] =
(Σ, G, Π) is simply the union of the graphs for φ1 and φ2 ,
which are independent as the sets of fields of the relations
in the cartesian product are disjoint:

Σ1 (X ) if X is a field of R1
Σ(X ) =
Σ2 (X ) if X is a field of R2
G
Π
= G1 ∪ G2
= Π1 ∪ Π2
Projection. The dependency graph for a projection πS (φ)
is simply the graph obtained from [[φ]] by removing any fields
not in S . The only subtlety is that any transitive dependencies must be kept. For instance, suppose that [[φ]] contains
α
β
dependencies X −
→ Y and Y −
→ Z . Then this implies the
αβ
transitive edge X −−→ Z , which must be recorded in the
graph [[πS (φ)]], even if Y is projected out. The formal definition of this operation is unilluminating and thus omitted.
Selection with Field Equality. The graph [[σX =Y (φ)]] =
(Σ, G ′ , Π′ ) is given by: Π′ = Π ∪ {X = Y } and G ′ =
1
1
G ∪ {X −
→ Y,Y −
→ X } to reflect the fact that the X and Y
columns are equal. The equality constraints in Π are used
to keep track of equalities across several operations, and are
applied to the graph after each operation, merging nodes for
equal fields in the dependency graphs.
Selection with Constant Equality. The graph [[σX =c (φ)]] =
(Σ′ , G, Π) is identical to the graph [[φ]] = (Σ, G, Π), but is
adjusted to set the size of the X column to one, as X is a constant in the resulting relation. That is, Σ′ = Σ ⊕ {X 7→ 1}.
4.4 Recursive Programs
The nonstandard interpretation that we have presented
above applies to recursive programs, as well as nonrecursive
ones, and recursion can be evaluated using the usual stratified least fixpoint computation3 . As such, there is no direct
need to handle recursion in a special way. However, consider the hasSubtypePlus relation defined previously (and
now written hS + for space reasons). Its definition can be
written very simply as the recursive relational algebra expression:
hS + = hS ∪ hS ; hS +
where ; is relational composition. This definition unfortunately yields poor results under our analysis. Consider for
instance the relation between a type and its subtype, in the
first few steps of the fixpoint computation:
hS
hS ∪ hS 2
hS ∪ hS 2 ∪ hS 3
5.9
SUPER −−→ SUB
40.7
SUPER −−→ SUB
252
SUPER −−→ SUB
It is easy to see that the dependencies get more and more
tenuous, until some arbitrary bound is reached (to guarantee
termination), and the dependency information is forgotten.
As a result, our analysis yields no dependencies for hS + as
it stands.
This should not be considered a defect of the analysis.
In effect, the analysis result is that the transitive closure of
the graph represented by hS may be the complete graph on
the same set of nodes — which may indeed occur for some
relations (that is, the transitive closure may be a complete
graph even if the original relation is not), so without further
information about the data we cannot refine the approximation.
While necessary, this imprecision is very problematic, and
prohibits optimisation in many common cases. This is because the transitive closures of many important relations
(for instance the subtype relation or the child relation in an
AST) are far from being a complete graphs — in fact, experimental evidence shows that each type that has a subtype
has on average only 15 transitive subtypes. A great deal of
information is therefore lost in the analysis here.
In order to remedy this problem we introduce another annotation on extensional relations. The reason that the transitive closure of hasSubtype is relatively small is that long
chains through the graph of this relation are rare (that is,
the depth of the subtype hierarchy is limited). In fact there
are almost no chains of length more than three, and thus
these can be disregarded in the analysis. In our system, this
is encoded by introducing a well-founded order on elements
together with a maximal depth:
depth(<sub ) =
hasSubtype(X , Y ) =⇒
3
Y <sub X
The well-founded order is used in the analysis to eliminate
paths of length greater than three, leading to an approximation that more closely matches the data. Several independent well-founded orders can be defined, and for queries
3
The abstract domain has infinite height, so this requires
the use of a widening operator[4] to guarantee termination.
Effectively, this imposes a maximum size on any column of
a relation
on Java code we use two orders: the subtype order showed
above, and an order reflecting the average depth of the AST.
4.5 Estimating Size
The analysis that we have defined thus far constructs a
dependency graph for each formula in the program. However, we have not quite reached our goal of estimating the
size of the relation defined by each formula. Fortunately,
there is a surprisingly simple way of computing the size of
a formula from its dependency graph. In the first step, we
transform the graph by adding a root node •, and adding
an edge from • to each field X labelled with Σ(X ). These
edges encode the sizes of columns, as shown below in the
example of hasSubtype:
5.9
SUPER
SUB
1.1
30769
4501
•
Let us first make the following observation. Say the resulting
graph is G. Then:
Claim 1. If G is a tree, then the estimated size for the
relation is the product of all the edge labels in the tree.
Justification. Suppose that there are k subtrees T1 ,
. . . , Tk of the root, and that the label of the edge from • to
the root Xi of Ti is Ni . Then Size(Ti ) represents the size
of the relation projected on the fields appearing in Ti , for
each value of Xi . As there are Ni values of Xi , the size of
the projection is Ni Size(Ti ). Furthermore, there are no dependencies between the sets of fields in distinct subtrees, so
the relation must be considered to be a Cartesian product of
its components. Its estimated size is therefore Πi Ni Size(Ti ).
By induction, the size of the relation represented by the tree
is the product of its labels.
The dependency graph for hasSubtype shown above is
however not a tree. There are two ways of estimating the
size of this relation:
• There are 4051 choices for SUPER, and 5.9 values of
SUB for each, so there are 4051 × 5.9 ≈ 23900 tuples;
or
• Following the other path (30769 choices for SUB and
1.1 supertypes for each), there are 30769×1.1. ≈ 33846
tuples.
These different potential values reflect the inaccuracies in
our analysis. However, we have always chosen pessimistic
assumptions throughout the construction. Thus any edge
in the dependency graph must be considered to be an overapproximation (i.e. denoting a larger column). We must
therefore choose the smallest of these approximations, as
this is the closest approximation to the real value.
This gives the following result:
Claim 2. The estimated size of a relation with dependency graph G is the weight of the minimum spanning tree
of G, where node weights are composed with multiplication
rather than addition.
Justification. Any subgraph of G gives an overestimate
of the size of the relation, as a subgraph has fewer dependencies to restrict the size of the relation. Hence by Claim
1, the weight of each spanning tree (with multiplication) is
an overestimate of the size of the relation. The minimum
spanning tree is thus the best of these approximations.
The minimum spanning tree of a directed graph can be
computed using Edmonds’ algorithm [7], a generalisation of
which remains applicable when multiplication, rather than
addition, is used to compute the weight of a tree [9]. The
algorithm is efficient, with complexity O(n log n) on sparse
graphs of size n [8]. This gives an effective procedure for
estimating the size of any formula in the program, completing our description of the implementation of magic-sets in
Section 3.
4.6 Discussion
Our dependency analysis does not consider the data in
relations (apart from the annotations it reads in) and makes
worst-case assumptions whenever necessary. While this can
lead to imprecision in general, this is mitigated by the use of
types in our system, not illustrated above for space reasons.
Each field is given a type (for instance, @type), as described
above, which allows several refinements to the basic analysis:
1. Each type has a size, given by the number of elements
of that type in the database. This can be used in the
analysis: a column of type T can have size at most the
size of T .
2. Types can be tested for disjointness — for instance,
@type and string are known to be disjoint. This again
improves the construction of dependency graphs, in
particular in the union rule.
Despite this, the dependency analysis remains an approximation, and it is always possible to construct examples that
make it inaccurate. This is common to all such analyses and
should not be unduly worrying. To evaluate the analysis
more precisely, we compared the measured and estimated
relation sizes for the 197 intermediate relations created in
the evaluation of our benchmark suite (on the JHotDraw
dataset, described in Section 6). We exclude extensional relations, as their size is known, and the query predicate of
each program, as its size in never used. The relative error
of the estimated size is shown below:
0–0.25
35%
0.25–0.5
8%
0.5–1
5%
1–2
6%
2–5
10%
>5
12%
Invalid
25%
The results highlight both good and bad results of the size
analysis. First, the size of 48% of all relations is found to
within a factor of two, showing that the size analysis is highly
effective for a substantial subset of relations. There is however a nontrivial set of relations (22%) with relatively poor
results, due to a variety of factors, in particular the fact that
our analysis assumes that data is uniformly distributed. The
last column, Invalid (25%), represents relations using special features such as aggregates and arithmetic operations
which are not handled in the analysis, and thus are given an
infinitely large size by the analysis as a default. These results demonstrate that it is possible to obtain accurate size
results with minimal information about the data, and furthermore justify our approach, while showing that further
improvements are possible.
Finally, we rely on the database schema designer to provide annotations from domain-specific knowledge to guide
the analysis, although some values (in particular individual
column sizes) are measured. While the information encoded
in annotations could, for the most part, be measured, this
would be prohibitively expensive. The annotation burden
has proved light, as for a total of 167 fields there are 20
dependency annotations and 46 order annotations, and the
entire database schema, complete with annotations, is available [25].
5. OTHER OPTIMISATIONS
The magic-sets transformation, together with our sizebased heuristics, is an effective optimisation for Datalog programs, as we will show in the next section. However, this
optimisation is performed in the context of all the other
transformations that our compiler performs. It is therefore
crucial to examine the interaction between these optimisations. To this end, we offer a brief outline of the main optimisation techniques that are used to improve performance.
Type-Based Optimisations. A set of crucial optimisations
performed by our compiler use type information to optimise programs. Recall that we use a type inference procedure for Datalog, which refines the types of Henriksson
and Maluszyński [12] with information about dependencies
between fields. This type system is described in a technical report [5], but we are chiefly concerned with its use in
optimising programs here and so do not give a detailed description. Consider as an example the following program:
declaresMember(E : @type, M : @member) :declaresMethod(E, M) ;
declaresField(E, M) ;
declaresConstructor(E, M) ;
declaresType(E, M).
query(A : @type, B : @type) :declaresMember(E, C),
hasSubtype(C, B).
The precise nature of the extensional predicates used here
matters little, but the intuitive meaning should be clear:
the query find all subtypes of nested types, using the library predicate declaresMember. This predicate is declared
in full generality in the library, but here is only used in the
context where M is a type. The type inference algorithm
deduces that all disjunct except the last in the definition of
declaresMember are empty in this context, and the program
can be rewritten as:
declaresMember(E : @type, M : @type) :declaresType(E, M).
query(A : @type, B : @type) :declaresMember(E, C),
hasSubtype(C, B).
This example illustrates context-sensitive type specialisation,
simply called type specialisation in the sequel. This finds the
type with which a predicate is used, and restricts it appropriately by eliminating empty disjuncts. We will evaluate
this optimisation in combination with magic in the next section. It should be noted that a weaker, intra-query (contextinsensitive) variant of this optimisation is always performed,
but in our results type specialisation is taken to mean the
context-sensitive version.
16000
Types are further used to eliminate redundant joins. For
instance, in the expression
Caching Off
hasSubtype(C, B), @type(C)
12000
Caching. The main other technique that is used to improve
query times is caching. This is not a Datalog transformation,
but rather exploits commonalities between queries. Queries
written in .QL use a vast domain-specific library of classes
(in our example, this library defines abstractions over Java
programs). In the Datalog program this translates into a
number of intensional predicates that are common between
several programs, though of course each program uses only
a small subset of these predicates. Caching avoids the recomputation of these intermediate relations by storing their
results from one query to the next.
Caching is effective at reducing the total time it takes to
execute a group of queries. However, there is a clear tension between caching and other optimisations: magic-sets
and type specialisation create specialised versions of intensional predicates, reducing the commonality between successive queries. It is thus important to investigate this interaction, as shown in our experimental results.
Other Optimisations. Finally, a number of simple other
transformations are performed by the optimiser, and have a
substantial impact on query time. Inlining avoids the creation of many intermediate tables whenever this does not
lead to duplicated work. Constant propagation can be seen
as a restricted form of magic, rewriting predicates that are
called in a context in which one or more of their arguments
are constant. Finally, logical simplifications are performed
on the program, for instance p ∨(p ∧q) = p. We will not give
detailed performance measurements for these optimisations,
but it is worth noting that magic-sets and type specialisation operate on a program that has already been simplified
considerably by the simpler passes.
EXPERIMENTAL RESULTS
Our optimisations were evaluated on the suite of currently
111 queries over Java programs that are included in the free
version of our Java analysis tool. The .QL sources for these
queries are publicly available [23]; they were not selected
for this paper, but rather designed to be of use to most
programmers.
The queries break down into two sets, run against different datasets. A set of 74 queries is run over the JHotDraw
database, representing the open-source Java program of the
same name, together with the Java libraries. A further 37
queries are run over Bonita (again from the eponymous program), as these queries are specific to the J2EE framework
used by Bonita. These datasets are substantial, with the
following table sizes:
Dataset
JHotDraw
Bonita
Tables
48
48
Largest table (rows)
1254881
2107438
Total rows
2755885
4614460
All queries were run on two different database backends:
PostgreSQL 8.2 running on Debian Linux 2.6 and Microsoft
10000
PGSQL
Time (s)
the literal @type(C) is removed, as it is implied by the preceding hasSubtype(C,B). Together, type specialisation and
this type erasure drastically simplify Datalog programs.
6.
Caching On
14000
MSSQL
8000
6000
4000
2000
0
None
Magic
TS
Magic+TS
None
Magic
TS
Magic+TS
Figure 5: Total Query Time (111 Queries)
SQL Server 2005 running on Windows Server 2003. Both
ran on a quad-core 2.66GHz Intel Xeon machine (64-bit)
with 16GB of RAM.
We measure the impact of the three main optimisations
performed by the system: magic-sets, (context-sensitive)
type specialisation and caching. The latter two optimisations are presented so that the impact of magic-sets may
be compared to their impact, and to assess the interaction of these optimisations. All other optimisations (such
as constant propagation, context-insensitive type specialisation and inlining) are always enabled.
Total Time. Figure 5 gives the total running time of the
suite of 111 queries on PostgreSQL (PGSQL) and Microsoft
SQL Server (MSSQL). The suite was run once for each combination of the optimisations, resetting any caches kept between runs to guarantee their independence. The graph is
split into two parts: on the left, caching is disabled, while the
times with caching enabled are given on the right-hand side.
Labels correspond to optimisations being applied, where
“TS” represents type specialisation.
The results of Figure 5 demonstrate first of all that magicsets yields a substantial speedup (1.65x faster under PGSQL,
1.62x faster under MSSQL). Type specialisation also emerges
as an even more effective optimisation, with speedups of
3.13x and 2.71x. Finally, it is pleasing that these optimisations are essentially independent: applying both yields a
speedup of 5x (4.9x with MSSQL), close to the product of
their individual speedups (5.2x and 4.4x respectively).
Our results further support the claim that the optimisation is essentially backend-independent, as the results for
PostgreSQL and SQL Server are practically identical (though
SQL Server is faster overall, the impact of optimisations
is the same). However, magic-sets appears to be effective
only for substantial datasets: similar experiments with small
datasets under H2 (as a limited in-memory database, H2
cannot handle the amount of data used here) showed little
or no improvement from magic.
As suggested previously, caching is extremely effective on
such large sets of queries, showing that there is a great
deal of commonality between queries in the set. Furthermore, applying one of magic-sets or type specialisation worsens performance, due to the fact that these optimisations
specialise intensional predicates which could otherwise have
15
14
13
12
11
Speedup Factor
10
9
8
7
6
5
4
3
2
1
0
Figure 6: Magic Speedup: per-query (PGSQL)
been shared between queries. Together, however, the transformations yield a significant improvement (27% for PGSQL,
22% for MSSQL) even over the cached run.
Individual Query Times. The results of Figure 5 are encouraging, but as we have discussed previously, part of the
motivation of this work was to avoid incurring performance
on any individual query, not just to achieve an average-case
improvement.
Figure 6 shows the speedup provided by the magic transformation for each individual query, on PostgreSQL. A speedup
of two indicates that the query ran twice as fast, while a
value less than one indicates a slowdown. Caching is disabled, as it is impossible to measure individual query times
reliably in the presence of caching, but all other optimisations are enabled. That is, we measure the specific impact
of magic in the presence of all optimisations performed by
the system, including type specialisation. Results are sorted
by the speedup value.
The results in Figure 6 show that only two queries are
substantially penalised by magic-sets: the first query suffers a slowdown of −18%, and the second a slowdown of
−6%. This is a result of our size heuristics: a good order
of formulas is chosen, and the magic transformation is only
applied where it is clearly beneficial, avoiding performance
hits. However, this does not affect the usefulness of magic
on queries that make use of context information: out of the
111 queries, 34 run at least twice as fast, while 10 run five
times as fast with magic enabled. The median speedup is
1.6x, which is consistent with the total query times reported
previously. The results for SQL Server for individual queries
were roughly line with the results for PostgreSQL, as in the
case of total time.
7.
RELATED WORK
The magic-sets transformation was introduced as a highly
effective implementation strategy for Datalog programs [2].
It is known that magic-sets is optimal in the sense that it
requires as few facts to be evaluated as possible [3]. Mumick
et al [19] give experimental results to justify the power of
the magic-sets technique in practice, and further note that
magic-sets can degrade performance in some cases.
The most widespread implementation of magic-sets to date
is likely the implementation in IBM DB2, based on its imple-
mentation in Starburst [20]. This implementation is careful
to ensure that magic never degrades performance by only
performing magic if the resulting query is predicted to be
shorter, but as a result may miss opportunities to apply
magic. The order produced by the database optimiser is
used as the order for the magic transformation, so there is
no specific reordering heuristic.
The cost-based optimisation strategy for magic-sets described by Seshadri et al [26] is closest to our own approach.
A key difference is that unlike our approach, this relies on
interaction with the database optimiser to choose a good
SIPS (observing that no fixed choice of SIPS is optimal
for all queries). Indeed, an important contribution of this
work is to show how magic rewriting can be integrated in
a cost-based optimisation framework. However, as a result
this is not applicable to our setting, as we implement magic
as a transformation on Datalog programs without reference
to the database backend. Seshadri et al give experimental results based on variations on a single query that show
speedups of up to 6x, consistent with our findings, and slight
slowdowns in the worst case. By contrast with [26], our
benchmarks consist of a large variety of useful queries.
The NAIL! system features an algorithm for choosing the
order of subgoals [18], which is related to our reordering
algorithm. This is based on a “most-bound-first” heuristic, favouring subgoals with the most bound variables. Our
reordering procedure indirectly applies a similar heuristic
(most-bound subgoals are small, and hence moved to the
front), but allows finer control thanks to the ability to take
relation sizes into account. In particular, our approach allows the fine-grained selection of formulas in the magic set
(Figure 3), and allows the set of bound variables to be computed more precisely (Figure 4).
CORAL [22] applies the magic-sets transformation using
the order in which subgoals were written by the programmer.
Annotations may be introduced to control applicability of
magic-sets. This system offers finest control over the magic
transformation, but requires a great deal of expertise from
the programmer to wield effectively.
It should be noted that the use of Datalog for program
queries has become increasingly widespread [11, 15, 28], giving rise to a substantial library of highly complex queries.
It is often remarked that magic-sets can be of great use
for such queries, and indeed systems such as bddbddb [28]
with this target application implement magic-sets. However,
our system is the first to date to implement reordering and
heuristics to guarantee robustness of performance under this
transformation, which is crucial to allow the use of libraries
of abstractions.
Our approach to estimating the sizes of relations in Datalog programs differentiates itself from earlier work by the use
of dependency information between distinct columns. In our
datasets this has proved crucial in producing accurate size
estimates. Krishnamurthy et al [14] describe cost-based optimisation of deductive queries (without recursion), and take
into account join strategies and other measures beyond size,
but cannot make use of domain-specific information as in our
approach. Swami [27] likewise shows that cardinality may be
used with other measures to optimise deductive queries, but
does not detail how the cardinality of complex expressions
might be computed. Our analysis of recursive queries using
well-founded order is effective enough, but requires a small
number of annotations from the database schema designer.
Lipton and Naughton [16] give a method for estimating the
sizes of some recursive queries more precisely without annotations, using sample data from the relation instead, which
we do not have access to in our setting.
8.
CONCLUSION
We have described a novel algorithm for applying the
magic-sets transformation effectively on Datalog programs.
The central issue is the choice of a sideways informationpassing strategy (SIPS), and we show that estimates of the
sizes of relations represented by Datalog formulas can be
used to make this choice. The size estimate is obtained from
an abstract interpretation of the Datalog program to track
dependencies between values of distinct columns. This requires some annotations on the database schema, though the
annotation effort is kept minimal, together with inexpensive
and automatic analysis of the database contents. Sizes of
relations can be derived from the dependency information
uncovered by our analysis with good results.
The use of our SIPS algorithm was evaluated on the set
of 111 queries that ship with our Java analysis tool [24] and
on datasets representing substantial programs. The average
performance improvement due to magic over all queries was
a factor of 1.6, while some individual queries ran up to 5–
10 times faster. A crucial requirement was that the choice
of SIPS should guarantee that magic-sets never degraded
performance significantly, which was achieved as the worse
slowdown experienced (on one query) was 18%. These figures reflect the use of magic-sets within an optimising compiler performing many other optimisations.
Our results demonstrate that the SIPS algorithm presented here makes magic-sets both effective and robust, and
thus can be applied automatically in a compiler with no need
for expertise on the part of the query writer. An important
direction of future work is the use of the dependency analysis defined here to guide other optimisations on Datalog
programs, leading to a cost-based Datalog optimisation engine independent of the database backend that programs are
eventually executed on.
9.
REPEATABILITY ASSESSMENT RESULT
Figures 5 and 6 have been verified by the SIGMOD repeatability committee.
10. REFERENCES
[1] L. Balbin, G.S. Port, K. Ramamohannarao, and
K. Meenakshi. Efficient bottom-up computation of
queries on stratified databases. Journal of Logic
Programming, pages 195–344, November 1991.
[2] François Bancilhon, David Maier, Yehoshua Sagiv,
and Jeffrey D. Ullman. Magic sets and other strange
ways to implement logic programs. In SIGMOD
Conference, pages 1–16. ACM, 1986.
[3] Catriel Beeri and Raghu Ramakrishnan. On the power
of magic. In Symposium on Principles of Database
Systems (PODS), pages 269–284, 1987.
[4] Patrick Cousot and Radhia Cousot. Abstract
interpretation: A unified lattice model for static
analysis of programs by construction or approximation
of fixpoints. In Symposium on Principles of
Programming Languages (POPL), pages 238–252.
ACM Press, 1977.
[5] Oege de Moor, Damien Sereni, Pavel Avgustinov, and
Mathieu Verbaere. Type inference for datalog and its
application to query optimisation. In Maurizio
Lenzerini, editor, Proceedings of the ACM
SIGACT-SIGMOD-SIGART Symposium on Principles
of Database Systems, June 2008. To appear.
[6] Oege de Moor, Damien Sereni, Mathieu Verbaere,
Elnar Hajiyev, Pavel Avgustinov, Torbjörn Ekman,
Neil Ongkingco, and Julian Tibble. .QL:
Object-oriented queries made easy. In Generative and
Transformational Techniques for Software Engineering
(GTTSE ’07), 2007.
[7] Jack Edmonds. Optimum branchings. Journal of
Research of the National Bureau of Standards — B.
Mathematics and Mathematical Physics,
71B(4):233–240, October–December 1967.
[8] H. N. Gabow, Z. Galil, T. Spencer, and R. E. Tarjan.
Efficient algorithms for finding minimum spanning
trees in undirected and directed graphs.
Combinatorica, 6(2):109–122, 1986.
[9] Leonidas Georgiadis. Arborescence optimization
problems solvable by edmonds’ algorithm. Theoretical
Computer Science, 301(1–3):427–437, May 2003.
[10] H2 Database Engine. Website with documentation
and downloads. http://www.h2database.com, 2007.
[11] Elnar Hajiyev, Mathieu Verbaere, and Oege de Moor.
CodeQuest: scalable source code queries with Datalog.
In Proceedings of ECOOP, volume 4067 of LNCS,
pages 2–27. Springer, 2006.
[12] Jakob Henriksson and Jan Maluszyński. Static
type-checking of datalog with ontologies. In Principles
and Practice of Web Reasoning, volume 3208 of
LNCS, pages 76–89, 2004.
[13] JFreeChart. Website with documentation and
downloads. http://www.jfree.org/jfreechart/,
2007.
[14] Ravi Krishnamurthy, Haran Boral, and Carlo Zaniolo.
Optimization of nonrecursive queries. In VLDB, pages
128–137, 1986.
[15] Monica S. Lam, John Whaley, V. Benjamin Livshits,
Michael C. Martin, Dzintars Avots, Michael Carbin,
and Christopher Unkel. Context-sensitive program
analysis as database queries. In Symposium on
Principles of Database Systems (PODS), pages 1–12.
ACM Press, 2005.
[16] Richard J. Lipton and Jeffrey F. Naughton.
Estimating the size of generalized transitive closures.
In Peter M. G. Apers and Gio Wiederhold, editors,
Proceedings of the Fifteenth International Conference
on Very Large Data Bases, pages 165–171. Morgan
Kaufmann, 1989.
[17] Microsoft. SQL server website with documentation
and downloads. http://www.microsoft.com/sql,
2007.
[18] Katherine A. Morris. An algorithm for ordering
subgoals in NAIL! In Symposium on Principles of
Database Systems (PODS), pages 82–88, New York,
NY, USA, 1988. ACM.
[19] Inderpal Singh Mumick, Sheldon J. Finkelstein, Hamid
Pirahesh, and Raghu Ramakrishnan. Magic is relevant.
In SIGMOD Conference, pages 247–258, 1990.
[20] Inderpal Singh Mumick and Hamid Pirahesh.
[21]
[22]
[23]
[24]
[25]
Implementation of magic-sets in a relational database
system. In SIGMOD Conference, pages 103–114, 1994.
PostgreSQL. Documentation and downloads.
http://www.postgresql.org, 2007.
Raghu Ramakrishnan, Divesh Srivastava,
S. Sudarshan, and Praveen Seshadri. The coral
deductive system. VLDB J., 3(2):161–210, 1994.
Semmle Ltd. Collection of .QL queries used in
benchmarks, 2007. http:
//semmle.com/benchmarks/defaultqueries.tar.gz.
Semmle Ltd. Company website with free downloads,
documentation, and discussion forums.
http://semmle.com, 2007.
Semmle Ltd. Database schema with annotations, 2007.
http:
//semmle.com/benchmarks/semmlecode.dbscheme.
[26] Praveen Seshadri, Joseph M. Hellerstein, Hamid
Pirahesh, T. Y. Cliff Leung, Raghu Ramakrishnan,
Divesh Srivastava, Peter J. Stuckey, and S. Sudarshan.
Cost-based optimization for magic: Algebra and
implementation. In SIGMOD Conference, pages
435–446, 1996.
[27] A. Swami. Optimization of large join queries:
combining heuristics and combinatorial techniques. In
SIGMOD Conference, pages 367–376, New York, NY,
USA, 1989. ACM.
[28] John Whaley, Dzintars Avots, Michael Carbin, and
Monica S. Lam. Using datalog and binary decision
diagrams for program analysis. In Proceedings of
APLAS, volume 3780 of LNCS, pages 97–118.
Springer, 2005.

Download Report

Adding Magic to an Optimising Datalog Compiler

Paperzz.com

Your Paperzz