A Formal Framework for Database Sampling

A Formal Framework for Database Sampling
Jesús Bisbal
Department of Technology, Universitat Pompeu Fabra, Spain
Jane Grimson
Department of Computer Science, Trinity College Dublin, Ireland
David Bell
School of Information and Software Engineering, University of Ulster, United
Kingdom
Abstract
Database sampling is commonly used in applications like data mining and approximate query evaluation in order to achieve a compromise between the accuracy of
the results and the computational cost of the process. The authors have recently
proposed the use of database sampling in the context of populating a prototype
database, that is, a database used to support the development of data-intensive applications. Existing methods for constructing prototype databases commonly populate the resulting database with synthetic data values. A more realistic approach
is to sample a database so that the resulting sample satisfies a predefined set of
integrity constraints. The resulting database, with domain-relevant data values and
semantics, is expected to better support the software development process. This
paper presents a formal study of database sampling. A Denotational Semantics description of database sampling is first discussed. Then the paper characterises the
types of integrity constraints that must be considered during sampling. Lastly, the
sampling strategy presented here is applied to improve the data quality of a (legacy)
database. In this context, database sampling is used to incrementally identify the
set of tuples which are the cause of inconsistencies in the database, and therefore
should be the ones to be addressed by the data cleaning process.
Key words: data cleaning, database sampling, integrity constraints, legacy
systems, denotational semantics, undecidability
Preprint submitted to Elsevier Science
8 March 2005
1
Introduction
Database sampling is commonly used in a wide range of application areas
when using the entire database is not cost-effective or necessary. Examples
include data mining [16] and approximate query evaluation [19]. Recently,
database sampling has also been proposed [9,6] as a mechanism to populate a
Prototype Database, that is, any database used to support the development of
data-intensive applications (e.g. requirements elicitation, software testing, experimentation with design alternatives). This approach is justified by the fact
that increasingly more software development projects consist of extending or
enhancing existing systems, as opposed to developing new ones [3,12]. Legacy
information systems migration [8] and Web-enabling existing systems are examples where operational data can be expected to be available at development
time, and hence could be used to populate a prototype database.
To the best of the authors’ knowledge the vast majority of existing approaches
to database sampling use some type of random distribution to select the data
that must be included in the resulting database, termed here Sample Database. The research reported in this paper has followed a different approach by
selecting the data items so that the resulting Sample Database satisfies predefined criteria, generally the satisfaction of sets of integrity constraints. The
Sample Database which results from this process is referred to here as Consistent Sample Database. A preliminary study of Consistent Database Sampling
considering a set of integrity constraint types commonly used in practice (e.g.
inclusion dependencies, cardinality constraints, subset dependencies) was presented in [9]. The particular case of sampling with functional dependencies
was analysed in [6]. Database sampling was described in [5] abstracting the
details of particular integrity constraint types. The design of a generic consistent database sampling tool was also reported in [5]. The present paper
develops a formal framework where Consistent Database Sampling can be
precisely defined. It should be noted that random sampling is a particular
case of Database Sampling within this framework.
The remainder of this paper is organised as follows. The next section reviews
the existing literature on database sampling in the context of database prototyping. Then section 3 introduces some background concepts used throughout
the paper. Section 4 presents a semantic description of Consistent Database
Sampling using the Denotational Semantics formalism. These semantics will
define how Consistent Database Sampling can be used as a data cleaning
strategy. Section 5 defines and characterises a particular type of integrity constraint, referred to as Sampling-Irrelevant, which can be ignored during sampling as this cannot lead to an inconsistent Sample Database. The final section
summarises the paper and indicates the future directions of this research.
2
2
Related Work
In the context of data-intensive applications prototyping, existing approaches
can be classified into those for database performance evaluation, and those
for requirements analysis. The main goal of the former is to generate large
amounts of data, without including specific semantics. The objective of the
latter is to produce a database which is highly similar to the database it models, and therefore the focus on the semantic contents of the resulting database.
Most of the existing approaches to database prototyping populate the database
with synthetic data values, not necessarily related to the application domain at
hand (e.g. ’name1’, ’name2’, ... as values for an attribute representing persons
names).
Gray [14] reports on a database prototyping method for database performance
evaluation. The data values are strictly randomly generated using several probability distributions. The approach presented in [4] is also concerned with
database performance evaluation. It shows that generic database benchmarking tests prove unsatisfactory for high-performance databases because their
underlying schemes are too simplistic and the data volumes being considered
too small 1 . Bates [4] paper proposes an alternative, more realistic, benchmark
model suitable for large parallel databases, as well as a toolkit used to generate
database prototypes based on more real requirements, in terms of workload,
semantic information, and data values.
Some approaches have been proposed to build prototype databases with significant semantic information, always in terms of the set of integrity constraints
being satisfied by the resulting database. The general mechanism for populating these databases involves inserting data values into the database and then
testing-and-repairing this data by adding/deleting data items so that it eventually meets the specified constraints. Most existing solutions deal with a very
reduced set of constraint types, typically functional dependencies, referential
integrity constraints, or inclusion dependencies. Noble [22] describes one such
method, considering referential integrity constraints and functional dependencies, and populating the database mainly with synthetic values, although the
user can also enter a list of values to be used as the domain for an attribute.
A notable exception is found in [21] where a subset of First-Order-Logic (FOL)
is used to define the set of constraints that the generated test data must meet,
1
Benchmarks have been updated since this paper was published. The latest version
of the benchmark referred to, TPC-C version 5 at http://www.tpc.org/, is effective
since 26 February 2001. This new version increases the data volumes according to
technological advances. However, the business model still enforces a significantly
simpler set of integrity constraints than that used in [4], so the claim of excessive
simplicity made by Bates can still be considered to hold.
3
thus allowing complex constraints to be defined.
A different database prototyping approach is presented in [28]. This method
firstly checks for the consistency of an Extended Entity-Relationship (EER)
schema defined for the database being prototyped, considering cardinality constraints only. Once the design has been proved consistent, a Test Database is
generated. To guide the generation process, a so-called general Dependency
Graph is created from the EER diagram. This graph represents the set of referential integrity constraints that must hold in the database and which is used
to define a partial order between the entities of the EER diagram. The test
data generation process populates the database entities following this partial
order. Löhr-Richter [18] further develops the test data generation step used
by this method, and recognises the need for additional information in order
to generate relevant data for the application domain.
Tucherman [25] presents a database design tool which prototypes a database
also based on an Entity-Relationship Schema. Special attention is paid in this
contribution to what are called restrict/propagate rules, that is, to enforcing
referential integrity constraints. When an operation is going to violate one such
constraint, the system can be instructed to either block (restrict) the operation
so that such violation does not occur or to propagate it to the associated tables
by deleting or inserting tuples as required. No explicit reference is made as to
how the resulting database would be populated for prototyping, although it
identifies the possibility of interacting with the user when new insertions are
required, as the user may be the only source for appropriate domain-relevant
data values.
Mannila [20] described a mechanism for efficiently populating a database relation so that it satisfies exactly a predefined set of functional dependencies.
Such a relation can be used to assist the database designer in identifying
the appropriate set of functional dependencies which the database must satisfy. Since this relation satisfies all the required dependencies and no other
dependency, it can be seen as an alternative representation for the dependencies themselves. A relation generated using this method is expected to expose
missing or undesirable functional dependencies, and the designer can use it
to iteratively refine the database design until it contains only the required
dependencies.
The reader is referred to [7] for a more extended review of existing literature on
database prototyping, including an evaluation framework where they can be
analysed and compared. Finally, it must be noted that a vast literature exists
on database sampling for a large diversity of applications other than database
prototyping (eg. data mining, approximate query processing, query optimization) [19,16,23,10]. All these approaches rely on randomly selecting the data
items to be included in the sample. For the purposes of this paper, these ap4
proaches can be classified into the same category, namely the construction
of sample databases with limited semantic information, and therefore they
are not considered appropriate to support the information system development process. This justifies that related work presented here does not include
an extensive review of this line of work, as it is considered tangential to the
research being reported.
3
Background
Section 3.1 introduces the notation used throughout the paper. The reader
familiar with standard terminology for the relational model [1] may skip this
section, as it is presented here for completeness only. Section 3.2 provides a
brief introduction to the field of Denotational Semantics. The objective is to
outline and justify the steps involved when providing a Denotational Semantic
description, as will be done in Section 4. Refer to [24,2] for a more extended
treatment of this topic.
3.1 Relational Database Model
Assume there is a set of attributes U, each with an associated domain. The
domain for attribute A is denoted by DOMA . DOM denotes the union of all
S
domains DOM =
DOMA .
A∈U
A relation schema R over U is a subset of U. A database schema D =
{R1 , . . . , Rn } is a set of relation schemas. A relation r over the relation schema
R is a set of R-tuples, where a R-tuple is a mapping from the attributes of R
to their domains. R-tuples may also be referred as simply tuples. A database
instance is a set of database relations. Only finite database instances are considered here. The relation r to which an R-tuple belongs will always be clear
from the context. The value of an attribute, A ∈ U, for a particular R-tuple,
t1 , is denoted t1 (A).
An integrity constraint is an assertion about a database instance. A database
instance I satisfies an integrity constraint σ, denoted I |= σ, if the assertion
is true for this instance [13]. Here, integrity constraints are assumed to be
Relational Calculus expressions [1], that is, first-order-logic formulas without
functions and without free variables. Integrity constraints may also be referred
to as simply constraints.
5
3.2 Denotational Semantics and λ-Calculus
Denotational Semantics is a formal method initially introduced to define the
semantics of programming languages unambiguously. It describes how a syntactic object (i.e. a program written according to a specified syntax) can be
mapped into some abstract value. The semantics of the space of abstract values is considered defined and, from it, the semantics of a program is also
defined. As programs are considered to represent (or denote) a function, a
denotational semantics of a programming language maps programs written in
that language into the functions they denote.
The formalism used to define these mappings is the λ-Calculus [15]. This
is a formal system for the study of functions, their definition, application
and manipulation. An expression in this formalism, a λ-expression, such as
λx1 , . . . , xn .E denotes a function that takes argument values v1 , . . . , vn and
returns the result of evaluating expression E with all xi ’s replaced by the
corresponding vi ’s, 1 ≤ i ≤ n. In this formalism, functions are treated as
first-class objects and can, themselves, be arguments to other functions or
the result of a function. In the example above, expression E may itself be
another λ-expression. As a an example consider λx.λy.xy, which represents
f (x) = g(y) with g(y) = xy; this is equivalent to f (x, y) = xy.
The λ-expressions used in denotational semantics are usually recursive, that
is, the function being defined is used in its own definition. As an example, the
factorial function could be written in λ-notation as
F act := λn. if n = 0 then 1 else n × F act(n − 1)
Recursive definitions describe an equation which must be solved in order to
understand what they represent. In the previous example the equation must
be solved for F act. Like any other equation, recursive definitions may have
one, none, or many solutions. A definition with more than one solution is
ambiguous as it is not clear what it stands for. In contrast, if it has no solution
it cannot be said to stand for anything at all. A theory [24] has been developed
to ensure that such definitions have exactly one natural solution. According
to the theory underlying Denotational Semantics, these equations must be
defined over domains that have the structure of a Complete Lattice. A domain
D is a Complete Lattice if:
(1) A partial order, v, can be defined between its elements. v defines a partial
order if it is:
(a) Transitive: if x v y and y v z then x v z .
(b) Antisymmetric: if x v y and y v x then x = y .
(c) Reflexive: ∀x ∈ D, x v x .
6
(2) There is a least element, ⊥ (bottom), according to this partial order:
∀x ∈ D, ⊥ v x.
(3) Each ascending chain x1 v x2 v x3 · · · has a least upper bound, denoted
by x∗ . An element, u ∈ D, is an upper bound of the chain if it satisfies
xi v u for each element xi in the chain. An element x∗ is a least upper
bound if x∗ v u for each upper bound u.
A complete lattice is denoted by a pair D, vD , where D is the domain of
elements and vD a partial order defined between the elements of this domain.
It can be shown [2] that given two complete lattices D1 , vD1 and D2 , vD2 ,
their direct product D1 × D2 , vD1 ×D2 can also be defined as having a complete
lattice structure. Similar results can be obtained for D1 → D2 , vD1 →D2 [2].
All these results will be used later in Section 4. The natural solution that is
shown to always exist if complete lattices are used as domains in λ-expressions
is defined as the least fixed point of the function defined by a λ-expression.
A fixed point of function f (x) is a value y such that f (y) = y. It is a least
fixed point if it is the smallest of all fixed points according to the partial order
defined in the domain of x.
Denotational semantics has also been applied in contexts not related to programming languages. For example, Widom [27] used this formalism to define
the semantics of the Starburst rule system. In this case the semantics denote
a mapping from the set of database instances, together with a set of rules and
operations on the database, into the set of database instances. These semantics indicate how the initial instance is changed due to the set of operations
and the set of rules. The result of this denotational semantics is a database
instance, which implies that it is understood what a database instance is, and
the objective is to define what the combination of database instance, rules and
operations means.
A similar idea will be explored in Section 4. The objective is to give a formal
semantic description of Consistent Database Sampling. In order to do so, the
domains involved in this description must be shown to be Complete Lattices.
Only then is it possible to ensure that the λ-expressions that will be defined
are consistent and not ambiguous, that is, they will have one natural solution.
This solution will be the intended meaning for Consistent Database Sampling.
The semantics will map sets of database instances and integrity constraints
into sets of database instances. These semantics will indicate how a Sample
from the initial instance can be extracted according to a set of constraints. As
before, the semantics of a database instance is understood, and the objective
is to define what it means, for sampling, to have a database instance which
satisfies a set of constraints.
7
4
Semantics of Consistent Database Sampling
The formal semantics for Consistent Database Sampling presented here complements the informal developments of [9,5,6]. Briefly, these papers provided
algorithmic descriptions of how to enforce the chain of insertions that take
place when sampling a database according to, for example, a set of integrity
constraints. That is, the fact that the inclusion of one tuple in the Sample
Database may require the inclusion of other tuples before the resulting Sample Database reaches a state consistent with the specified set of constraints.
These insertions may, in turn, require additional insertions. It is well known
that updates may need to be propagated to keep any database in a consistent
state after any data manipulation [11]. The same applies to the construction
of a Sample Database. This section formalises the solutions that were adopted
in [9,5,6]. It is believed this description will be very useful in the context of
understanding and reasoning about database sampling. In particular, it clearly
identifies what components of a sampling strategy appear in all types of sampling (e.g. the need for an initial selection), and which ones are considered
specific to one concrete strategy (e.g. the use of a set of functional dependencies to evaluate the consistency of the resulting sample, or the use of random
sampling). The framework this semantic description provides can also be used
as a benchmark reference for sampling tools, such as the one reported in [5].
An additional requirement commonly addressed when constructing prototype
databases is concerned with the size of the resulting database (e.g. see [20]
in Section 2). A small prototype database is more likely to reveal missing
or undesirable constraints and hence will support the data-intensive application development better than if a large database is produced. The semantics
of Consistent Database Sampling described in the next sections will explicitly represent the existence of several alternatives of selecting the appropriate
Sample Database size (see functions Satisfy in page 11, and Choose and
Select in page 12).
As briefly outlined in Section 3.2, Denotational Semantics was initially developed as a formalism to define the semantics of programing languages precisely
and unambiguously. In this context, a denotational semantics for a conventional programming language is defined as a Meaning Function that takes any
program in the language and produces the input-output function computed
by that program [24]. Similarly, a denotational semantics for Consistent Database Sampling is defined here as a Meaning Function that takes any set of
integrity constraints and produces a function that maps a database instance
into a new database instance which is a Sample of the first one, consistent
with the input set of constraints.
In order to develop this formal description, the next section defines the do8
mains that are used by the Meaning Function of Section 4.4. The definition of
this Meaning Function is simplified using a set of supporting functions defined
in Section 4.3.
4.1 Domains
The following are the domains used by the functions defined in the next sections; they are based on the standard definitions given in Section 3.1.
• Let DB be the domain of database instances.
If I is a database instance, then I = {t1 , . . . , tn } where each ti is a tuple.
Values in a database instance are taken from domain DOM.
• Let T be the domain of tuples in a database.
If a database schema is composed by attributes {A1 , . . . , An } then T =
DOMA1 × · · · × DOMAn . The domain of database instances can be identified with the powerset of the domain of tuples, DB = P{T }. However,
to simplify the notation, both domains will be denoted differently. Without
loss of generality, it can be assumed that each tuple ti includes a unique
identifier which also identifies ti ’s relation [11].
• Let IC be the domain of integrity constraints.
As in Section 3.1, integrity constraints are assumed to be Relational Calculus expressions, that is, first-order-logic formulas without functions and
without free variables.
4.2 Domains as Complete Lattices
The domains defined in the previous section will be used in the Denotational
Semantic description of Consistent Database Sampling. According to the theory of Denotational Semantics, these domains must be proved to have a structure of Complete Lattices in order to ensure that the semantics defined below
are consistent (see Section 3.2). For each of the domains defined above, its
partial order, least element and least upper bound will be given.
Theorem 4.1 The domain T of tuples is a complete lattice.
Proof. A partial order, vDOMAi , a least element, ⊥DOMAi , and a least upper
bound can be assumed for each basic domain Ai [2] (e.g. integer, real, character). This theorem follows directly from the well-known fact that a tuple of
complete lattices is also a complete lattice [2].
Theorem 4.2 The domain DB of database instances is a complete lattice.
9
Proof. In the previous section a database instance was identified with a set
of tuples, DB = P{T }. This theorem follows directly from the well-known
fact that the powerset of a complete lattice is also a complete lattice [2].
Theorem 4.3 The domain IC of integrity constraints is a complete lattice.
Proof. A partial order between integrity constraints can be defined as in logical implication since a constraint is an assertion (Relational Calculus formula,
see Section 3.1) about a database instance. Let σ1 , σ2 ∈ IC:
σ1 vIC σ2 iff ∀I ∈ DB, I |= σ1 → I |= σ2
.
Its least element is ⊥IC = f alse.
A least upper bound for a chain of constraints can also be defined based on
their logic nature. Take the least upper bound of σ1 vIC σ2 vIC σ3 vIC · · · as
being σ ∗ = σ1 ∨ σ2 ∨ σ3 ∨ · · · .
With the above definitions, IC, vIC is a complete lattice, as follows directly
from the properties of logical implication.
4.3 Supporting Functions
The following functions have been used to simplify the denotation of Consistent Database Sampling. Function definitions are given using the λ-Calculus
[15]. Refer to Section 3.2 for a brief introduction.
Function ToBeSatisfied takes a database instance and a set of integrity constraints. It considers the database instance to be the current state of the
Sample Database being extracted, denoted by s. This function returns the set
of integrity constraints yet to be satisfied by the Sample Database.
ToBeSatisfied : DB × P(IC) → P(IC)
ToBeSatisfied = λ s, {ic1 , . . . , icn }. {ici |s 2 ici }
(1)
(2)
Function Satisfy adds a set of tuple(s) needed in the Sample Database in
order to satisfy one particular integrity constraint. It takes a database instance that represents the operational database being sampled, denoted by d,
another database instance that represents the current Sample, denoted by s,
and finally the integrity constraint to be satisfied, denoted by c. It returns
the Sample enlarged with additional tuples required to satisfy the specified
10
constraint. Note that there could be several ways of consistently enlarging
the current Sample Database. Function Select represents the selection of one
such possible enlargement, and is defined below.
Satisfy : DB × DB × IC → DB
Satisfy = λ d, s, c. s ∪
Select
³
(3)
(4)
´
{s0 |s0 ⊂ d \ s ∧ s ∪ s0 |= c ∧ {s00 |s00 ⊂ s0 ∧ s ∪ s00 |= c} = {Ø}}
This definition considers all sets of tuples not yet in the Sample (s0 ⊂ d \ s)
that can be used to satisfy constraint c (s ∪ s0 |= c). No subset of this set of
tuples should also be eligible, since as was explained above (page 8) a small
Sample Database is generally preferred to a larger one. Then one of these sets
of tuples is selected to become part of the Sample Database.
The following functions are left undefined and their semantics are taken as
given. They determine different parameters that must be considered during
Consistent Database Sampling. For example, the chain of insertions that takes
place during consistent sampling must be triggered by an initial insertion.
Several criteria can be used to perform this initial insertion, e.g. randomly,
or such that it leads to a small sample (refer to [9,6] for concrete examples).
Another parameter that determines the concrete final Sample Database that
results from this process is concerned with choosing between tuples when
several are possible (which, once again, may refer to the size of the resulting
database).
SelectTuple selects one tuple from a database based on the set of integrity
constraints, i.e. initial selection which will trigger a chain of insertions.
SelectTuple : DB × P(IC) → T
(5)
At this point it can be illustrated how random sampling is a particular case
of Consistent Database Sampling in the framework being defined here. If no
constraints are to be considered to evaluate the consistency of the sample,
function Satisfy would always return the current sample, without enlarging
it. If function SelectTuple is defined as the implementation of a random
variable of the desired probability distribution, and since IC = Ø, a random
sample would result when applying the semantics defined below. The sample
would contain strictly the set of tuples selected by function SelectTuple.
Alternatively, if IC 6= Ø, an initial selection will be performed randomly,
but it will then be consistently completed in order to arrive at a Consistent
Sample.
11
Choose selects one integrity constraint out of a set.
Choose : P(IC) → IC
(6)
Finally, function Select selects one out of all the ways of consistently completing a Sample (e.g. the smallest one, as discussed in page 8).
Select : P(DB) → DB
(7)
4.4 Meaning Function
The semantics of Database Sampling is denoted here by meaning function S.
This function takes a set of integrity constraints and returns another function
which takes an operational database and returns either a Sample of it, or
it does not terminate, denoted by ⊥ (bottom). The case in which it may
not terminate stems from the fact that the operational database itself may
not be consistent. In this case, it may not always be possible the extract a
Consistent Sample Database from such a database. The semantics of database
sampling from an inconsistent operational database are left undefined in this
section. Then Section 4.5 will provide a semantic description in the case of
an inconsistent operational database. As will be seen, these semantics can, in
fact, be seen as an strategy for incrementally detecting the inconsistencies in
the operational database, thus it could be used as a data cleaning mechanism.
S : P(IC) → DB → DB ∪ { ⊥ }
S[[{ic1 , ic2 , . . . , icn }]] = λ d.
Let t = SelectTuple ( d, {ic1 , . . . , icn }) in
S 0 ({ic1 , . . . , icn })(d, {t})
(8)
(9)
(where ‘in’ refers to the scope of binding of t)
The definition of the meaning function S indicates how, in Consistent Database Sampling, an initial tuple, t, is selected and inserted into the Sample
Database. Then this single-tuple Sample is consistently completed, that is,
function S 0 is called. S 0 adds the necessary tuples into the Sample in order to
satisfy a set of constraints. This function takes a set of integrity constraints
and returns the Least Fixed Point of a function that takes an operational
database and the current Sample Database (which may not be consistent)
and returns either a Consistent Sample Database or it does not terminate.
S 0 : P(IC) → DB × DB → DB ∪ {⊥}
12
(10)
µ
S 0 = λ{ic1 , ic2 , . . . , icn } . Least-Fixed-Point λ F.
(11)
λ d, s. Let R = ToBeSatisfied(s, {ic1 , ic2 , . . . , icn }) in
³
´¶
if R = Ø then s else F d, Satisfy(d, s, Choose(R))
If all constraints are satisfied, then the Sample is already consistent and the
process terminates. Otherwise, it firstly selects one constraint out of all those
which are not yet satisfied in the Sample, consistently completes the current
Sample according to this individual constraint and then repeats the process,
that is, its semantics are defined as its least fixed point.
Consistent Database Sampling as described above can be seen as one initial
insertion followed by consistent completion of the database in order to satisfy
the predefined set of constraints. Once the Sample Database reaches a consistent state the whole process may be repeated again, if due to the requirements
of the application at hand the Sample Database must be of larger size. Therefore, this approach to Consistent Database Sampling views the operational
database as being composed of several independent and consistent portions
of data (each one could be randomly selected out of all existing portions, as
outlined by the description of SelectTuple in page 11). This same view is
exploited in Section 4.5 in order to improve the data quality of a (legacy)
database.
4.5 Data Cleaning through Consistent Sampling
The semantics for Consistent Database Sampling given in the previous section
were left undefined in the case of an inconsistent operational database. This
case was identified when a given insertion in the Sample Database could not be
made consistent with the set of integrity constraints used for sampling because
the database being sampled is not consistent with this set of constraints.
This section provides a new denotational semantic description for Consistent
Database Sampling, which is defined even if the source database is inconsistent. It must be noted that the objective here is significantly different than in
the previous section. The semantics provided above could extract a Consistent
Sample Database even if the source database was, as a whole, inconsistent.
This is the case because the objective was to extract any Consistent Sample
(e.g. as small as possible) from the database. This objective was justified in
order to support the development of data-intensive applications.
In contrast, in this section the goal is to use Consistent Sampling as a mechanism to improve the data quality of a database [26]. The semantics provided
13
here will identify the largest consistent subset of the source database. In turn
this will also identify the inconsistencies found in the database.
For the purposes of this section, the supporting function Satisfy presented
in Section 4.3 is modified so that it considers the consistency of the Sample
according to a set of constraints, a opposed to one single constraint. Function Satisfy0 takes as parameters the database being sampled, d, the current
Sample (possibly inconsistent), s, the subset of the source database already
shown to be consistent, sC , and a set of integrity constraints, {ic1 , . . . , icn }.
It enlarges Sample s with tuples from d not yet in sC so that it satisfies the
given set of integrity constraints.
Satisfy0 : DB × DB × DB × P(IC) → DB
Satisfy0 = λ d, sC , s, {ic1 , . . . , icn }.
Let p = Select
00
(12)
(13)
µn
00
s0 |s0 ⊂ d \ (sC ∪ s) ∧ {sC ∪ s ∪ s0 } |= {ic1 , . . . , icn } ∧
0
00
{s |s ⊂ s ∧ {sC ∪ s ∪ s } |= {ic1 , . . . , icn }} = {Ø}
o¶
in
if p = Ø then Ø else s ∪ p
Besides the difference in the set of parameters, function Satisfy0 is defined
analogously to Satisfy (page 11) and its interpretation is not repeated here.
Finally, the semantics for Consistent Database Sampling with a potentially inconsistent database is represented by meaning function Q (for Data Quality).
This function takes as parameters the set of integrity constraint to be used to
evaluate the consistency of the samples, and the database being sampled, d.
Q : P(IC) → DB → DB × DB → DB
(14)
µ
Q[[{ic1 , . . . , icn }]] = λ d. Least-Fixed-Point λ F.
(15)
λ sC , sN . if d \ (sC ∪ sN ) = Ø then sC else
Let t = SelectTuple(d \ (sC ∪ sN ), {ic1 , . . . , icn }),
l = Satisfy0 (d, sC , {t}, {ic1 , . . . , icn }) in
¶
if l = Ø then F (d, sC , sN ∪ {t}) else F (d, sC ∪ l, sN )
It returns the Least Fixed Point of a function with two parameters, the subset
of the source database which has already been shown to be consistent, sC ,
and the subset of the database which contains some inconsistencies, sN . If
the whole source database has already been partitioned into consistent and
inconsistent subsets, i.e. d = sC ∪ sN , the process terminates, and the result
is the consistent subset of the database. Otherwise, it selects one tuple from
the database which still has not been considered, d \ (sC ∪ sN ). If this selection
14
can be made consistent, a consistent sample which contains it is added to sC ,
i.e. sC ∪ l, otherwise this tuple becomes part of the inconsistent subset of the
Source Database, sN ∪ {t}. It should be noted how in the case of a consistent
source database these semantics would define sC = d and sN = Ø.
From a more practical point of view, the semantics described above would
incrementally identify the set of tuples with which a consistent sample cannot
be built. These tuples can be seen as the cause of inconsistencies in the database, and should be the ones to be addressed by the data cleaning process.
The actions to be performed in order to resolve these inconsistencies would
depend on the database at hand. Two possible options would be to interact
with the user to repair or discard individual tuples, and to build a database
of inconsistencies (i.e. sN as denoted above) to be resolved at a later stage.
Independently of which method is used to repair individual tuples, Consistent
Database Sampling would be central to the data cleaning process. Section 6
will further discuss this line of work.
5
Sampling-Relevant Integrity Constraints
When consistently sampling from a database not all integrity constraints need
to be considered. For some types of constraints, even if they are excluded
from the sampling process, the resulting Sample Database will still satisfy
them, assuming that the operational database being sampled is consistent
according to these same constraints. Example of such integrity constraints
include maximal cardinality constraints (e.g. in a database storing information
about a school, a teacher may not be allowed to teach more than three courses).
These type of constraints have been termed here Sampling-Irrelevant. If only
those constraints that are actually relevant to the consistency of the resulting
Sample Database are considered, the sampling process can be expected to be
more efficient. The representation of the set of constraints to be considered
should also be simpler as it needs to express a more limited set of constraint
types (refer to [9,6] for concrete examples of mechanisms to represent integrity
constraints appropriate for sampling). This, in turn, would also simplify the
time-consuming data cleaning process described in Section 4.5.
Integrity constraints which are not Sampling-Irrelevant are referred to as
Sampling-Relevant. The objective in this section is to characterise these kind
of integrity constraints.
From the above discussion, Sampling-Irrelevant Integrity Constraint can be
defined as follows.
15
Definition 5.1 σ is a Sampling-Irrelevant Integrity Constraint iff
∀I ∈ DB, ∀t ∈ I, I |= σ → I \ {t} |= σ
(16)
Where I is a database instance and t a tuple in this database. This definition
states that if operational database I is consistent with constraint σ, then it
will still be consistent after deleting a tuple from it. This definition views a
Sample Database as the result of deleting, from the operational database, all
tuples which are not in the Sample Database. Although this view is useful for
the developments of this section, it is likely to be impractical due to the size
of an operational database compared to that of a Sample Database.
Negating the expression in the previous definition results a definition for
Sampling-Relevant Integrity Constraints.
Definition 5.2 σ is a Sampling-Relevant Integrity Constraint iff
∃I ∈ DB, ∃t ∈ I, I |= σ ∧ I \ {t} 2 σ
(17)
In this definition, I is a database instance and t a tuple in this instance.
I \ {t} denotes a database instance that results from removing tuple t from
I. According to this definition, if σ is Sampling-Relevant there must be a
database instance I and a tuple t in this database that witness the relevance
of σ for sampling, that is, deleting t from I results in a database which does
not satisfy constraint σ.
5.1 Characterisation of Sampling-Relevance
From Definition 5.2 it is possible to deduce the following sufficient, but not
necessary, condition for an integrity constraint to be Sampling-Relevant.
Theorem 5.3 Given any integrity constraint σ,
Ø 2 σ → σ is Sampling-Relevant
(18)
Proof. As stated in Section 3.1, only finite databases are considered here.
Theorem 5.3 follows directly from this fact.
Consider a Sampling-Irrelevant integrity constraint σ. Let I0 = I be the initial database instance (operational database). Since σ is Sampling-Irrelevant,
according to Definition 5.1, any tuple t0 can be removed from it and the resulting database instance, I1 = I0 \ {t0 }, will still satisfy σ, I1 |= σ. Repeating
16
this process for any other tuple, ti , leads to increasingly smaller database instances, Ii+1 = Ii \ {ti }, all of them satisfying σ, Ii+1 |= σ. Since I is finite,
eventually some database instance Ii+1 will be empty, Ii+1 = Ø. This proves
that
σ is Sampling-Irrelevant → Ø |= σ
which is equivalent to what was to be proved.
This theorem states that if a constraint is not satisfied by an empty database,
then it must be Sampling-Relevant. It is possible, however, for a SamplingRelevant integrity constraint to be satisfied by an empty database (see Section
6).
5.2 Non-Decidability of Sampling-Relevance
This section gives a proof of undecidability for Sampling-Relevance of integrity
constraints. The motivation given in Section 5 to investigate the SamplingRelevance of integrity constraints was to simplify the set of constraints being
considered during sampling. The practical implications of the proof given here
is that this goal is, in general, not achievable. Since it cannot be decided when
a constraint is Sampling-relevant, all constraints must be considered. Section
6 will briefly describe how this problem could be addressed.
One standard technique for proving that a problem is undecidable is by reducing a well-known undecidable problem to it [17]. The well-known undecidable
problem used here is the Domain-independence of Relational Calculus Queries
[1].
A Relational Calculus Query q is an expression of the form q = {x1 , . . . , xn |ϕ},
where all variables xi are free in ϕ, where ϕ is a first-order-logic formula without functions. The result of a query q over database instance I, denoted q(I),
is the set of tuples hx1 , . . . , xn i for which ϕ is true. When interpreting a Relational Calculus Query it must be specified over which domain variables xi
range. Different answers may be obtained depending on the domains being
considered. For this reason a query is interpreted relative to a particular domain. Thus qdi (I) denotes the result of query q for instance I if the domain of
interpretation is di . It must be noted that the domain of interpretation must
always be a superset of the set of values that appear in the database instance
I (known as its active domain). In practice only those queries that always
produce the same result, independently of the domain of interpretation, are
17
allowed. This set of queries are referred to as domain-independent. Therefore,
a domain-independent query can be defined as follows.
Definition 5.4 q is a Domain-Independent Relational Calculus Query iff
∀d1 , d2 ⊂ DOM, ∀I ∈ DB, qd1 (I) = qd2 (I)
(19)
It is known that if the entire expressiveness power of Relational Calculus is
allowed when defining a query, domain-independence it not decidable ([1], p.
125). Since only domain-independent queries are of interest, several syntactic restrictions have been developed to ensure the domain-independence of
queries. Refer to [1] for a more extended treatment of Relational Calculus
Queries, justification for the need of domain-independent queries, and some
syntactic restrictions (e.g. range-restricted) used to deal with their undecidability. For the purposes of this paper only its definition and its undecidability
result are needed.
Negating Definition 5.4 results the definition of Domain-Dependent queries,
which will be used in the proof presented here. It must be noted that if a
problem is undecidable its complement must also be undecidable [17].
Definition 5.5 q is a Domain-Dependent Relational Calculus Query iff
∃d1 , d2 ⊂ DOM, ∃I ∈ DB, qd1 (I) 6= qd2 (I)
(20)
In order to simplify the notation, in the following development formulas will
be given without explicitly stating the domain of variables. This should not
cause confusion as variables with the same names as above will range over the
same domains.
Theorem 5.6 Sampling-Relevance is an undecidable property of integrity constraints.
Proof. Let q = {x1 , . . . , xn |ϕ} be a Relational Calculus Query. An integrity
constraint σq can be constructed from q as in Eq. 21.
³
/ qd2 (I)
σq = ∃d1 , d2 ∃t0 t0 ∈ qd1 (I) ∧ t0 ∈
´
(21)
The objective is to prove that σq is Sampling-Relevant if and only if q is
Domain-Dependent.
(1) σq is Sampling-Relevant → q is Domain-Dependent
Since σq is Sampling-Relevant there must be some database instance
I and tuple t that witness this fact, I |= σq ∧ I \ {t} 2 σq . It suffices
18
to consider only I |= σq . By construction of σq such database instance
I witnesses the domain-dependence of q: there is a tuple t0 which is in
qd1 (I) but not in qd2 (I), and so qd1 (I) 6= qd2 (I) as required.
(2) q is Domain-Dependent → σq is Sampling-Relevant
There
is an instance I which
witnesses the domain-dependence of q,
³
´
0 0
0
∃t t ∈ qd1 (I) ∧ t ∈
/ qd2 (I) . Note that t0 is a tuple which results from
values of tuples in I. Let Iw be a database instance that results from
I by deleting any one of such tuples for each tuple t0 , but one 2 . For
example, assume that tuples t01 and t02 are the only tuples that witness
the domain-dependence of q. Further, assume that they take values from
tuples t1 , t2 , t3 and t3 , t4 respectively. If, for example, t1 is deleted from
I, there will remain only one tuple, t02 , which will witness the domaindependence of q. This tuple takes its values from t3 and t4 . Denote by tw
any of these two tuples.
It can be seen how Iw and tw , as constructed above, witness the SamplingRelevance of σq :
(a) Iw |= σq
There is at least one tuple in Iw that witnesses the domain-dependence
of ³q (e.g. in the example above,
they would have been t3 and t4 ),
´
0 0
0
∃t t ∈ qd1 (Iw ) ∧ t ∈
/ qd2 (Iw ) . This makes σq true, as required.
(b) Iw \ {tw } 2 σq
There is one and only one t0 such that t0 ∈ qd1 (Iw ) ∧ t0 ∈
/ qd2 (Iw ),
and this tuple takes its values from tw . Therefore Iw \ {tw } 2 σq , as
required.
6
Conclusions and Future Work
The use of prototype databases to support the development of data-intensive
applications has been investigated for some years. Recently, it has been proposed that existing operational data should be used to populate such database,
resulting in a more realistic prototype. The term Consistent Database Sampling is used to refer to the process of populating a prototype database with
data from an existing operational database. This approach is justified by the
fact that increasingly more software development projects consist of extending or enhancing existing systems, as opposed to developing new ones, and
therefore operational data can be expected to be available to support these
projects.
This paper first provided a formal description for Consistent Database Sampling, using the formalism of Denotational Semantics. Two different semantics
have been given. The first one aims at building a small sample to support the
2
As in Theorem 5.3, only finite databases are considered.
19
development of data intensive applications. The second one focuses on improving the data quality of a possibly inconsistent database. After that a particular
type of integrity constraint, referred to as Sampling-Relevant, has been investigated. A characterisation of this type of constraint has been developed. It
has been shown that this type of constraint is undecidable.
Future work could focus on Theorem 5.3, which stated that if a constraint is
not satisfied by an empty database, then it must be Sampling-Relevant. As
outlined in Section 5 it is possible, however, for a Sampling-Relevant integrity
constraint to be satisfied by an empty database. This raises the question of how
interesting this property is in practice, that is, how many Sampling-Relevant
integrity constraints are actually not satisfied by an empty database. Finding
a more practical characterisation of Sampling-Relevant integrity constraints
is left as an area for future research.
Another line for future research is concerned with the undecidability of SamplingRelevance. A common approach to deal with undecidability is to search for syntactic restrictions that ensure Sampling-Relevance. For example, the domainrelevance of relational calculus queries described in Section 5 is also undecidable, and some syntactic restrictions of first-order-logic (e.g. range-restricted,
safe) have been proposed to ensure that a query which satisfies these syntactic
(therefore, decidable) restrictions is domain-independent. A similar approach
could be investigated for Sampling-Relevant integrity constraints.
Finally, further research is also needed in order to exploit the possibilities that
consistent sampling can offer during the data cleaning process. In particular,
the current sample itself could provide the context necessary to automatically
repair a given tuple, from values already in the sample. This approach could
define a framework for (semi)automated data cleaning.
References
[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. AddisonWesley, 1995.
[2] L. Allison. A Practical Introduction to Denotational Semantics. Cambridge
University Press, 1986.
[3] D. Avison and G. Fitzgerald. Information Systems Development: Methodologies,
Techniques, and Tools. McGraw-Hill, 2003.
[4] C. Bates, I. Jelly, and J. Kerridge. Modelling test data for performace evaluation
of large parallel database machines. Distributed and Parallel Databases, 4(1):5–
23, January 1996.
20
[5] J. Bisbal and J. Grimson. Generalising the consistent database sampling
process. In B. Sanchez, N. Nada, A. Rashid, T. Arndt, and M. Sanchez, editors,
Proceedings of the Joint meeting of the 4th World Multiconference on Systemics,
Cybernetics and Informatics (SCI’2000) and the 6th International Conference
on Information Systems Analysis and Synthesis (ISAS’2000), volume II Information Systems Development, pages 11–16, Caracas, Venezuela, July 2000.
International Institute of Informatics and Systemics (IIIS).
[6] J. Bisbal and J. Grimson. Database sampling with functional dependencies.
Information and Software Technology, 43(10):607–615, August 2001.
[7] J. Bisbal and J. Grimson. Consistent database sampling as a database
prototyping approach. Journal of Software Maintenance and Evolution:
Research and Practice, 14(6):447–459, December 2002.
[8] J. Bisbal, D. Lawless, B. Wu, and J. Grimson. Legacy information systems:
Issues and directions. IEEE Software, 16(5):103–111, September/October 1999.
[9] J. Bisbal, B. Wu, D. Lawless, and J. Grimson.
Building consistent
sample databases to support information system evolution and migration.
In G. Quirchmayr, E. Schweighofer, and T. J.M. Bench-Capon, editors,
Proceedings of the 9th International Conference on Database and Expert
Systems Applications (DEXA’98), volume 1460 of Lecture Notes in Computer
Science, pages 196–205, Heidelberg, Germany, 1998. Springer-Verlag.
[10] N. Bruno and S. Chaudhuri. Exploiting statistics on query expressions for
optimization. In Proceedings of the International Conference on Management
of Data (SIGMOD 2002), New York, USA, 2002. ACM Press.
[11] C. J. Date. An Introduction to Database Systems. Addison-Wesley, seventh
edition, 1999.
[12] C. Ghezzi, M. Jazayeri, and D. Mandrioli.
Engineering. Prentice-Hall, 1991.
Fundamentals of Software
[13] P. Godfrey, J. Grant, J. Gryz, and J. Minker. Integrity constraints: Semantics
and applications. In J. Chomicki and G. Saake, editors, Logics for Databases and
Information Systems, chapter 9, pages 265–306. Kluwer Academic Publishers,
1998.
[14] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger.
Quickly generating billion-record synthetic databases. In Proceedings of the
International Conference on Management of Data (SIGMOD 1994), New York,
USA, 1994. ACM Press.
[15] R. J. Hindley and J. P. Seldin. Introduction to Combinators and λ-Calculus.
Cambridge University Press, 1986.
[16] J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. In
Proceedings of the 1994 ACM SIGMOD-SIGACT Symposium on Principles of
Database Systems (PODS’94), pages 77–85, New York, USA, 1994. ACM Press.
21
[17] H.R. Lewis and C.H. Papadimitriou. Elements of the Theory of Computation.
Prentice-Hall, second edition, 1998.
[18] P. Lohr-Richter and A. Zamperoni. Validating database components of software
systems. Technical Report 94–24, Leiden University, Department of Computer
Science, 1994.
[19] G. Luo, C.J. Ellmann, P.J. Haas, and J.F. Naughton. A scalable hash ripple
join algorithm. In Proceedings of the International Conference on Management
of Data (SIGMOD 2002), New York, USA, 2002. ACM Press.
[20] H. Mannila and K.-J. Räihä. Small armstrong relations for database design. In
Proceedings of the Fourth ACM SIGACT-SIGMOD Symposium on Principles
of Database Systems (PODS’85), pages 245–250, New York, USA, 1985. ACM
Press.
[21] A. Neufeld, G. Moerkotte, and P. C. Lockemann. Generating consistent test
data: Restricting the search space by a generator formula. VLDB Journal,
2(2):173–213, April 1993.
[22] H. Noble. The automatic generation of test data for a relational database.
Information Systems, 8(2):79–86, 1983.
[23] F. Olken. Random Sampling from Databases.
California, April 1993.
PhD thesis, University of
[24] J. Stoy. Denotational Semantics : the Scott-Strachey Approach to Programming
Language Theory. M.I.T Press, 1977.
[25] L. Tucherman, M.A. Casanova, and A.L. Furtado. The chris consultant - a tool
for database design and rapid prototyping. Information Systems, 15(2):187–195,
1990.
[26] R.Y. Wang, V.C. Storey, and C.P. Firth. A framework for analysis of data
quality research. IEEE Transactions on Knowledge and Data Engineering,
7(4):623–640, August 1995.
[27] J. Widom. A denotational semantics for the starburst production rule language.
SIGMOD Record, 21(3):4–9, September 1992.
[28] A. Zamperoni and P. Lohr-Richter. Enhancing the quality of conceptual
database specifications through validation.
In Proceedings of the 12th
International Conference on Entity-Relationship Approach (ER’93), volume 823
of Lecture Notes in Computer Science, pages 85–98, Heidelberg, Germany, 1993.
Springer-Verlag.
22