A Generalisation of Entity and Referential Integrity in Relational

A Generalisation of Entity and Referential Integrity in
Relational Databases
Mark Levene
Department of Computer Science
University College London
Gower Street
London WC1E 6BT, U.K.
Email: [email protected]
Phone: 0171 419 3684
Fax: 0171 387 1397
Corresponding author: George Loizou
Department of Computer Science
Birkbeck College
Malet Street
London WC1E 7HX, U.K.
Email: [email protected]
November 17, 2000
Abstract
Entity and referential integrity are the most fundamental constraints that any relational database should satisfy. We re-examine these fundamental constraints in the
context of incomplete relations, which may have null values of the types “value exists
but is unknown” and “value does not exist”. We argue that in practice the restrictions
that these constraints impose on the occurrences of null values in relations are too strict.
We justify a generalisation of the said constraints wherein we use key families, which are
collections of attribute sets of a relation schema, rather than keys, and foreign key families which are collections of pairs of attribute sets of two relation schemas, rather than
foreign keys. Intuitively, a key family is satisfied in an incomplete relation if each one of
its tuples is uniquely identifiable on the union of the attribute sets of the key family, in
all possible worlds of the incomplete relation, and, in addition, is distinguishable from all
the other tuples in the incomplete relation by its nonnull values on some element in the
key family. Our proposal can be viewed as an extension of Thalheim’s key set, which only
deals with null values of type “value exists but is unknown”. The intuition behind the
satisfaction of a foreign key family in an incomplete database is that each pair (Fi , Ki ) of
attribute sets in the foreign key family takes the foreign key attribute values over Fi of
a tuple in one incomplete relation referencing the key attribute values over Ki of a tuple
in another incomplete relation. Such referencing is defined only in the case when the
foreign key attribute values do not have any null values of the type “value does not exist”;
we insist that the referencing be defined for at least one such pair. We also investigate
some combinatorial properties of key families, and show that they are comparable to the
standard combinatorial properties of keys.
1
Introduction
The notions of entity integrity arising from the choice of a primary key and referential integrity
arising from the choice of foreign keys are the most fundamental constraints in a relational
database [Cod79, Cod90].
1
A primary key, which is a set of attributes designated by the user, is satisfied in a relation,
if each tuple in the relation is uniquely identified by the primary key values and, in addition,
the primary key must be a minimal set of attributes for which this uniqueness property holds.
We quote from Codd’s definition of entity integrity [Cod90]:
“No component of a primary key is allowed to have a missing value of any type”.
Thus entity integrity can be viewed as a restriction on the occurrence of null values in a
relation which may be incomplete. To our knowledge there has been no attempt to date, apart
from [Tha89], to re-examine the validity of entity integrity, despite the fact that there is a
growing literature on incomplete information in relational databases (see [Mai83, AHV95]; for
alternative treatments to dealing with incomplete information we refer the reader to [LL99,
Chapter 5]). We think that this is mainly due to the fact that in most of the theory of
relational database design it is implicitly assumed that relations do not contain null values,
in which case entity integrity is enforced by default. Moreover, support for null values in
current relational database management systems is still in its infancy. Our generalisation of
entity integrity can be viewed as an extension of Thalheim’s formalism [Tha89] to the case
when relations may contain occurrences of both categories of null value defined below.
A foreign key, which is a set of attributes in one relation schema referencing the primary
key attributes of another relation schema, is satisfied in a database, if whenever the foreign
key values of a tuple in the referencing relation do not contain null values, then these values
are the primary values of some tuple in the referenced relation. Referential integrity enforces
the satisfaction of such foreign key constraints [Dat86, HR96]. We continue to quote from
Codd’s definition of entity integrity [Cod90]:
“No component of a foreign key is allowed to have an I-marked value”,
where an I-marked value in our terminology is a null value of the type “value does not exist”.
Thus foreign key values also restrict the occurrence of null values in a relation which may be
incomplete. A recent formalisation of the notion of foreign key, which caters for incomplete
relations, can be found in [LL97].
Missing information generally falls into two categories: applicable information, for example, the name of a person may be unknown, and inapplicable information, for example, a
person may not have a spouse. In both cases the missing information can be modelled by special values, called null values, which act as place holders for the information that is missing.
Varied interpretations of null values within these two categories were listed in [ANS75].
In order to model these two categories of missing information we add to the database
domains two types of null value:
1) “value at present exists but is unknown” [Cod79], which is denoted in the database by
the distinguished value unk, and
2) “value does not exist”, or equivalently, “value is inapplicable” [LL86], which is denoted
in the database by the distinguished value dne.
We note that dne is very useful when filling in forms where some of the categories in the
form may be filled in as inapplicable. As opposed to unk, the null value dne does not arise
due to incompleteness of information. Despite this fact dne cannot be treated as just another
2
nonnull value; for example, we can record the fact that a person is unmarried by having dne
as their spouse attribute value but we cannot say that two unmarried people have the same
spouse. Moreover, entity integrity forbids primary key values to be null values of any type,
and referential integrity forbids foreign key values to be of the null type dne.
To motivate our approach we give four examples showing that entity and referential integrity are too strict in practice. Consider the relation, say r, shown in Table 1 over a relation
schema containing the two attributes NAME and ADDRESS, and assume that {NAME, ADDRESS} is the primary key of R. It can easily be seen that r violates entity integrity. Despite
this fact all the tuples in r are uniquely identifiable, since the problematic third tuple is the
only tuple having the name Sue Jones. Thus in all possible worlds, say s, of r, where the
occurrence of unk is replaced by a nonnull value, {NAME, ADDRESS} uniquely identifies
each tuple in s, and also there is a one-to-one correspondence between the tuples of r and
s. So as long as each tuple in a relation r is uniquely identifiable as a distinct entity by
the nonnull portion of its primary key values, we can consider the relation to satisfy entity
integrity.
NAME
John Smith
John Smith
Sue Jones
ADDRESS
Hampstead Way
Harold Rd
unk
Table 1: A relation showing that entity integrity is too strict
As a second motivating example consider a relation schema R containing the attributes
NAME, ADDRESS and DOB (date of birth), and assume that {NAME, ADDRESS, DOB}
is the primary key of R. It can easily be seen that the relation, say r, over R shown in Table 2
violates entity integrity. Despite this fact all the tuples in r are uniquely identifiable by the
nonnull portion of their primary key values. As in the previous example, in all possible worlds,
say s, of r, where each occurrence of unk is replaced by a nonnull value, {NAME, ADDRESS,
DOB} uniquely identifies each tuple in s, and also there is a one-to-one correspondence
between the tuples of r and s.
NAME
John Smith
Sue Jones
unk
ADDRESS
unk
Harold Rd
Hampstead Way
DOB
13/6/95
unk
17/12/96
Table 2: Another relation showing that entity integrity is too strict
As a third motivating example for generalising entity integrity, consider a relation schema
S containing the attributes SS# (social security number), P# (passport number) and NAME,
and assume that SS# and P# are candidate keys of S. It is evident that the relation, say s,
over S, shown in Table 3, violates entity integrity, since either SS# or P# is the primary key
of S. Despite this fact, due to the semantics of dne, the first and second tuples of s represent
distinct entities, since s is the only possible world of itself. Thus if, at some later stage, John
3
Smith of the first tuple acquires a P# it cannot be 2, and if, at some later stage, John Smith
of the second tuple acquires a SS# it cannot be 1. This would not have been the case had
the null values in s been of type unk, since then there would be a possible world of s in which
both the tuples in s represent the same entity. Moreover, in this example every tuple in s
is distinguishable by nonnull values either by SS# or P#, since each value of SS# or P# is
unique.
SS#
1
dne
P#
dne
2
NAME
John Smith
John Smith
Table 3: Yet another relation showing that entity integrity is too strict
As a final example, we motivate the generalisation of referential integrity, by considering
a relation schema R containing the attributes DEPT NAME, MGR SS# and MGR P#, and
assuming that MGR SS# is a foreign key for R referencing the candidate key SS# for S,
and MGR P# is a foreign key for R referencing the candidate key P# for S. The need for
alternative foreign keys, one for each candidate key, is due to the null values, as we cannot be
sure whether SS# or P# is the primary key of S. It is evident that the database d = {r, s},
where r over R is shown in Table 4 and s over S is shown in Table 3, violates referential
integrity. Despite this fact, the MGR SS# value of the first tuple in r, i.e. <1>, references
the SS# value of the first tuple in s, and the MGR P# value of the second tuple in r, i.e.
<unk>, references the P# value of the second tuple in s. Thus for each tuple in r, whenever
the values of a foreign key do not have an occurrence of dne, then these values reference the
corresponding candidate key values of a tuple in s. Moreover, for each tuple in the referencing
relation, there is at least one foreign key whose values do not include dne.
DEPT NAME
Computing
Mathematics
MGR SS#
1
dne
MGR P#
dne
unk
Table 4: A relation demonstrating that referential integrity is too strict
The approach we take in order to solve the problems raised in the above examples is to
relax the notions of a key and foreign key thus allowing us to generalise entity integrity and
referential integrity.
Given a relation r over a relation schema R, we define a key family K to be a family of
subsets of R. A key family K is satisfied in the relation r, if the set of all attributes appearing
in the elements of K is a key in all possible worlds of r, and, in addition, each tuple t ∈ r
is distinguishable from all the other tuples in r by the nonnull values of the attributes of
some Ki ∈ K. As an example, the key family {{NAME}, {ADDRESS}} is satisfied in the
relation r shown in Table 1. This key family is also a minimal key family in r, since it has
no redundant element and no element in it has redundant attributes. As another example,
the key family {{NAME, ADDRESS}, {NAME, DOB}, {ADDRESS, DOB}} is satisfied in
the relation r shown in Table 2. This key family is not minimal in r, since the key family
4
{{NAME}, {ADDRESS}, {DOB}} is satisfied in r; however, it is minimal in the relation
resulting from adding the tuple <John Smith, Harold Rd, 17/12/96> to r. Finally, the key
family {{SS#}, {P#}} is a minimal key family in the relation shown in Table 3.
Our generalisation of entity integrity is now evident. We define a primary key family to
be a key family which is designated by the user. Given a primary key family K, a relation r
satisfies generalised entity integrity if K is a minimal key family in r.
Suppose that a database schema contains the relation schemas R and S, and that d is a
database over this database schema containing two relations r over R and s over S. A foreign
key family F for R is a family of pairs, such that for each pair (Fi , Ki ) ∈ F, Fi is a subset of
R, Ki is a subset of S, and Fi and Ki have the same cardinalities; the set of all Ki such that
(Fi , Ki ) ∈ F, say K, is called the key family for S referenced by F. The foreign key family
F is satisfied in the database d, if K is a key family in s, for all tuples t ∈ r there is some
(Fi , Ki ) ∈ F such that the Fi values of t do not have an occurrence of dne, and for all tuples
t ∈ r, for all (Fi , Ki ) ∈ F, if the Fi values of t do not have an occurrence of dne, then there is
some tuple u ∈ s such that the Fi values of t reference the Ki values of s. As an example, the
foreign key family {(MGR SS#, SS#), (MGR P#, P#)} is satisfied in d = {r, s}, where r is
the relation shown in Table 4 and s is the relation shown in Table 3. We note the fact that
the MGR P# value of the second tuple of r is unk does not cause a violation of the foreign
key family.
Our generalisation of referential integrity is now evident. Given a foreign key family F
for R referencing a primary key family K of S, a database d containing relations over R and
S satisfies generalised referential integrity if F is a foreign key family in d.
The layout of the rest of the paper is as follows. In Section 2 we formalise the notion of
incomplete relations. In Section 3 we formalise the fundamental notions of distinguishability
and identifiability, which are the necessary ingredients of our generalisation of the concept
of a key. In Section 4 we formalise the concepts of key family and minimal key family in a
relation and as a result we suggest a generalised definition of entity integrity. In Section 5
we study the relationship between the concepts of key family and Thalheim’s key set. In
Section 6 we investigate some combinatorial problems relating to key families and minimal
key families. In Section 7 we generalise the notion of foreign key to that of a foreign key
family and as a result we suggest a generalised definition of referential integrity. Finally, in
Section 8 we give our concluding remarks.
2
Relations that model incomplete information
Herein we formalise the notion of an incomplete relation, which allows us to model incomplete
information of the form “value at present exists but is unknown” and of the form “value does
not exist”, or equivalently, “value is inapplicable”.
We use the notation |S| to denote the cardinality of a set S. A family of sets is a collection
of sets indexed by the set of natural numbers I = {1, 2, . . . , n}. If S is a subset of T we write
S ⊆ T and if S is a proper subset of T we write S ⊂ T. We often denote the singleton {A}
simply by A, when no ambiguity arises, and the union of two sets S, T, i.e. S ∪ T, simply
by ST. The nonempty power set of a set S is denoted by P(S). Finally, we will refer to the
cardinality of some standard encoding [GJ79] of S as the size of S.
5
Definition 2.1 (Relation and database) A relation schema R is a finite set of attributes
and a database schema R is a finite collection {R1 , R2 , . . . , Rn } of relation schemas.
We assume a countably infinite domain of constants, Dom, containing the distinguished
constants unk and dne, denoting the null values unknown and does not exist, respectively.
A tuple over R is a total mapping t from R into Dom such that ∀Ai ∈ R, t(Ai ) ∈ Dom.
The projection of a tuple t over R onto a set of attributes Y ⊆ R, denoted by t[Y], is the
restriction of t to Y. The projection t[Y] is said to be complete, if ∀Ai ∈ Y, t[Ai ] 6= unk, and
6 dne. Given a tuple t, we let t ↓
t[Y] is said to be nonnull if, in addition, ∀Ai ∈ Y, t[Ai ] =
denote the maximal subset Y of R such that t[Y] is nonnull.
An incomplete relation (or simply a relation) r over R is a finite set of tuples over R. A
relation whose tuples are all complete is called a complete relation. An incomplete database
(or simply a database) d over R is a finite collection {r1 , r2 , . . . , rn } of relations such that for
each i ∈ I, ri ∈ d is a relation over Ri ∈ R. A complete database is a database whose relations
are all complete.
Our justification for allowing occurrences of dne in a complete relation is that, although
dne is a distinguished value, it does not arise due to incompleteness of information. For
example, given a relation schema PERSON having an attribute SPOUSE and a relation r
over PERSON, the fact that for a tuple t ∈ r we have t[SPOUSE] = dne does not signify any
incompleteness of information.
From now on we let R be a relation schema and r be a relation over R.
Definition 2.2 (Less informative constants and tuples) We define a partial order in
Dom, denoted by v, as follows:
u v v if and only if u = v or (u = unk and v 6= dne), where u, v ∈ Dom.
We extend v to be a partial order in the set of tuples over R as follows: where t1 and t2
are tuples over R, t1 is less informative than t2 , written t1 v t2 , if ∀Ai ∈ R, t1 [Ai ] v t2 [Ai ].
Informally, POSS(r) is the set of all complete relations, which are more informative than
r, that is, they do not contain nulls of type unk.
Definition 2.3 (The set of possible worlds of a relation) The set of all possible worlds
relative to a relation r (or the set of all complete relations emanating from r), denoted by
POSS(r), is defined by
POSS(r) = {s | s is a relation over R and there exists a total and onto mapping
f : r → s such that ∀t ∈ r, t v f (t) and f (t) is complete}.
3
Distinguishability and identifiability
The notions of distinguishability and identifiability are fundamental to our generalisations
of the notions of key and foreign key. Distinguishability of a tuple by a set of attributes
X implies that the tuple is nonnull on X and that we can distinguish it from other tuples,
regardless of whether they have null values or not. Identifiability of a relation on a set of
attributes X implies that in all possible worlds s of r, each tuple of r is uniquely identifiable
by the X values of some tuple in s. The formal definitions are now given.
6
Definition 3.1 (Distinguishable tuple) A tuple t ∈ r is distinguishable by a set of attributes X, with X ⊆ R, if
1) t[X] is nonnull, and
2) there does not exist a tuple t0 ∈ r − {t}, with t[X] = t0 [X].
Definition 3.2 (Identifiable relation) A relation r over R is called an identifiable relation
on a set of attributes X ⊆ R, if for all complete relations s ∈ POSS(r), the cardinality of the
projection of s onto X is equal to the cardinality of r, i.e. |πX (s)|=|r|.
The next lemma characterises identifiable relations in terms of inequalities between pairs
of tuples in these relations.
Lemma 3.1 A relation r over R is identifiable on X ⊆ R if and only if for all pairs of distinct
tuples t1 , t2 ∈ r, there exists an attribute Ai ∈ X such that t1 [Ai ] and t2 [Ai ] are complete and
t1 [Ai ] =
6 t2 [Ai ]. 2
An alternative characterisation of identifiable relations is given in the next corollary.
Corollary 3.2 A relation r over R is identifiable on X ⊆ R if and only if for all pairs of
distinct tuples t1 , t2 ∈ r, t1 [X] 6v t2 [X]. 2
4
Key families and entity integrity in incomplete relations
Herein we define the notions of key family and minimal key family being satisfied in a relation.
We show that the notion of a minimal key family in a relation is faithful to the standard notion
of minimal key in a relation, and, in the absence of dne in relations, is a precise generalisation
of the notion of a key in a complete relation. (See [Mai83] for the concepts of faithfulness and
preciseness.) Finally, we define generalised entity integrity using the concept of a primary
key family.
Definition 4.1 (Key family) A key family K for R (or simply a key family, if R is understood from context) is a family K = {K1 , K2 , . . . , Kn } having n ≥ 1 subsets of R.
Given a key family K, a relation must be identifiable on the union of the elements Ki ∈ K.
Due to the possible presence of dne in relations, we need the further condition that each tuple
in a relation be distinguishable by at least one element Ki ∈ K.
Definition 4.2 (Satisfaction of a key family) A relation r over R satisfies a key family
K (or alternatively, K is a key family in r), written r |≈ K, if the following two conditions
are satisfied:
1) r is identifiable on
S
i∈I
Ki , and
2) every tuple in r is distinguishable by some set of attributes Ki ∈ K.
7
The reader can verify that for the boundary cases of a relation, say r, containing no tuples
or a single tuple, the key family consisting of the empty set, i.e. {∅}, is always satisfied in
r. To avoid these special cases we assume for the rest of the paper that relations contain at
least two tuples. When r contains two or more tuples, then {∅} is not satisfied in r.
We further observe that when at least some key family K is satisfied in r, where r =
{t1 , t2 , . . . , tm }, then {t1 ↓, t2 ↓, . . . , tm ↓} is also a key family in r. This key family is the
coarsest possible key family in r, which motivates the next definition.
Definition 4.3 (Minimal key family) A key family K is said to be minimal in r (or alternatively, K is a minimal key family in r), if
1) r |≈ K,
2) for no proper subset K0 ⊂ K, does r satisfy K0 , i.e. the cardinality of K is minimal,
and
3) for no proper subset K 0 i ⊂ Ki , with Ki ∈ K, does r satisfy (K − {Ki }) ∪ {K 0 i }, i.e. the
cardinalities of all the elements in K are minimal.
The proof of the following proposition, which provides justification for key families, is
straightforward.
Proposition 4.1 The following two statements are true:
1) The notion of a minimal key family is faithful to the notion of a minimal key in a relation
r, in the sense that if K = {K} is a minimal key family in r, then K is a minimal key
in r.
2) In the absence of dne in relations, the notion of a key family is a precise generalisation
of the notion of a key in a complete relation, in the sense that if K is a key family in r,
S
then for all complete relations s ∈ POSS(r), i∈I Ki is a key in s. 2
We observe that we cannot strengthen part (2) of Proposition 4.1 to say that, in the
absence of dne in relations, the notion of a minimal key family is a precise generalisation of
the notion of a minimal key. As a counterexample, {{NAME}, {ADDRESS}} is a minimal
key family in the relation shown in Table 1. However, if we replace the occurrence of unk in
the third tuple by Asmuns Hill, then {NAME, ADDRESS} is a key but not a minimal key
in the resulting complete relation. Moreover, as the relation shown in Table 3 demonstrates,
in the presence of dne, a key may be undefined according to the standard definition of a key,
assuming that the values of at least one of the candidate keys cannot be null. We are now
ready to generalise entity integrity.
Definition 4.4 (Generalised entity integrity) A primary key family is a key family, which
is designated by the user. Given a primary key family K, a relation r satisfies generalised
entity integrity, if K is a minimal key family in r.
We observe that the notion of generalised entity integrity reduces to the standard notion
of entity integrity when K is a singleton.
8
5
A comparison to Thalheim’s notion of a key set
We now compare our proposal of key family and minimal key family to Thalheim’s proposal
[Tha89] of key set and minimal key set. Since Thalheim’s approach was defined only for the
case when incomplete relations are allowed to have null values of type unk, we will assume in
this section that incomplete relations do not have nulls of type dne.
Definition 5.1 (Thalheim’s key set) A key set K for R (or simply a key set if R is understood from context) is a family of singleton attributes from R. A relation r over R satisfies
S
a key set K = {{A1 }, {A2 }, . . . , {An }}, if r is identifiable on X, where X = i∈I {Ai }.
A key set K is minimal in r if for no proper subset K0 ⊂ K does r satisfy the key set K0 .
The next proposition shows that key sets and key families are defined on exactly the same
class of relations.
Proposition 5.1 Given a relation r over R, some key set K is satisfied in r if and only if
S
S
some key family K0 is satisfied in r, where K∈K K = K 0 ∈K0 K 0 .
S
Proof. If. Suppose that some key family K0 is satisfied in r. Let X = K 0 ∈K0 K 0 , where X =
{A1 , A2 , . . . , Aq }. It follows that the key set {{A1 }, {A2 }, . . . , {Aq }} is also satisfied in r.
S
Only if. Suppose that some key set K is satisfied in r. Let X = K∈K K, where X =
{A1 , A2 , . . . , Aq }, and s = πX (r), where s = {t1 , t2 , . . . , tm }. The result follows, since the key
family {t1 ↓, t2 ↓, . . . , tm ↓} is also satisfied in r. 2
When we fix K, then the concepts of key set and key family are incomparable. Firstly,
{{NAME}, {ADDRESS}} is a key set in r but not a key family in r, where r is the relation
shown in Table 1, together with the occurrence of unk in the third tuple replaced by Harold
Rd. Secondly, {{NAME, ADDRESS}, {NAME, DOB}, {ADDRESS, DOB}} is a key family
in r but, since the elements of the key family are not singletons, not a key set in r, where r
is the relation shown in Table 2.
We observe that when we allow null values of type dne in relations, then distinguishability,
as opposed to identifiability, is also important. For example, if we treat dne as just another
nonnull value, then {{SS#}} and {{P#}} will be minimal key sets for the relation, say r,
shown in Table 3, which is contrary to Codd’s assertion that tuples should be distinguishable
by their nonnull values. On the other hand, in this case there is one minimal key family in r,
namely {{SS#}, {P#}}.
6
Combinatorial problems relating to key families
Herein we investigate some combinatorial properties of key families and show that the problem
of finding a minimal key family in a relation can be solved in polynomial time. There remain
the problems of deciding whether there exists a minimal key family in a relation which either
has an element of cardinality no greater than some natural number k or is itself of cardinality
no greater than k; these problems are both NP-complete. We also show that a relation can
be constructed having P(R) as its minimal key family.
Proposition 6.1 Given a relation r over R the problem of finding whether there exists a key
family K such r |≈ K can be solved in polynomial time in the sizes of r and R.
9
Proof. By Proposition 5.1 to test for identifiability it is sufficient to test whether {{A1 }, {A2 }, . . . , {An }}
is a key set in r, where R = {A1 , A2 , . . . , An }, treating dne as a nonnull value. If this test
is positive, then we can test distinguishability by testing whether each tuple t ∈ r is distinguishable by some X ∈ {t1 ↓, t2 ↓, . . . , tm ↓}, where r = {t1 , t2 , . . . , tm }. By Lemma 3.1 and
Definition 3.1 both these tests can be computed in polynomial time in the sizes of r and R,
which concludes the result. 2
Proposition 6.2 Given a relation r over R that satisfies at least one key family for R, the
problem of finding a key family K, such that K is a minimal key family in r, can be solved
in polynomial time in the sizes of r and R.
Proof. By Proposition 5.1 K = {K1 , K2 , . . . , Kn } is a key family in r, where r = {t1 , t2 , . . . , tn }
and for all i ∈ I, Ki = ti ↓. We note that, by Lemma 3.1 and Definition 3.1, r |≈ K can be
checked in polynomial time in the sizes of r and R.
We next minimise the cardinality of each element in K as follows. For each Ki ∈ K,
remove any attribute Aj ∈ Ki , if r |≈ (K − {Ki }) ∪ {Ki − {Aj }}; such an attribute Aj is
called a redundant attribute in Ki ∈ K with respect to r. We repeat the process of eliminating
redundant attributes Aj in Ki ∈ K with respect to r until no further changes to K are possible.
Since the cardinality of K is at most |r| and the cardinality of each Ki ∈ K is at most |R|,
this step takes polynomial time in the sizes of r and R.
In the final step we check if K is of minimal cardinality by removing from K any element
Ki if r |≈ K − {Ki }; such a set Ki of attributes is called redundant in K with respect to r.
We repeat this process of eliminating from K redundant elements in K with respect to r until
no further changes to K are possible. Since the cardinality of K is at most |r|, this step also
takes polynomial time in the sizes of r and R.
This concludes the proof, since K now satisfies the three conditions of Definition 4.3.
2
Proposition 6.3 The problem of deciding whether a relation r over R satisfies a key family
having an element of cardinality less than or equal to some natural number k is NP-complete.
Proof. To show that the problem is in NP we simply guess a key family K for R and then
check, in polynomial time in the sizes of r and R, whether the cardinality of some K ∈ K is
less than or equal to k and whether r |≈ K.
To show that the problem is NP-hard, a polynomial-time transformation from the problem
of whether a relation r, whose tuples are all nonnull, has a key of cardinality k or less, which
is known to be NP-complete [DT88] (cf. [LO78]), is trivial. In particular, if r is a relation
over R, whose tuples are all nonnull, then, trivially, r |≈ {R}, and thus the result follows.
2
Proposition 6.4 The problem of deciding whether a relation r over R satisfies a key family
of cardinality less than or equal to some natural number k is NP-complete.
Proof. To show that the problem is in NP we simply guess a key family K for R and then
check in polynomial time in the sizes of r and R, whether K has k or less elements and
whether r |≈ K.
10
To show that the problem is NP-hard a polynomial-time transformation from the vertex
cover problem of cardinality k or less, which is known to be NP-complete [GJ79], is given.
Given a graph (N, E) we construct a relation schema R having the same cardinality, say
n, as N such that each node in N corresponds to an attribute in R; for the rest of the proof we
do not distinguish between the nodes in N and their corresponding representative attributes
in R. In addition, we construct a complete relation r over R having the same cardinality as
E such that for each ei = {A1i , A2i } ∈ E we add the following tuple ti to r: ti [Ai1 ] = ti [A2i ] = i
and ti [R −{A1i , A2i }] = < dne, . . . , dne >. It remains to show that r satisfies a key family of
cardinality k or less if and only if (N, E) has a vertex cover of cardinality k or less.
We observe that by the construction of r the key family {{A1 }, . . . , {An }} is satisfied
in r, since each tuple ti ∈ r is distinguishable by both A1i and A2i , where the attribute set
S
{A1i , A2i } ⊆ R corresponds to the edge ei ∈ E; moreover, r is identifiable on i∈I {Ai }. (We
assume without loss of generality that (N, E) is connected and has no loops.) It remains to
show that (N, E) has a vertex cover whose cardinality is less than or equal to k if and only if
r satisfies a key family whose cardinality is less than or equal to k.
If (N, E) has a vertex cover, say V, whose cardinality is not greater than k, then, by the
construction of r, V is a key family that is satisfied in r. Conversely, if r satisfies a key family
K, whose cardinality is not greater than k, then by the above observation we can assume
without loss of generality that the elements of K are singletons and thus, by the construction
of r, K is a vertex cover of (N, E). 2
The next proposition shows that we can construct a relation over R such that it has a key
family of cardinality 2n − 1, where |R|= n, which is a minimal key family in this relation.
Proposition 6.5 There is a relation r over R such that P(R) is a minimal key family in r.
Proof. For each X ∈ P(R) we construct a complete relation rX as follows, with the proviso
that the nonnull values in rX are distinct from the nonnull values in any other relation rY ,
where Y ∈ P(R)−{X}. If X is a singleton, say {A}, then rA = {t}, where t[A] is nonnull and
t[R−{A}] = < dne, . . . , dne >. Otherwise, if X = {A1 , . . . , An }, with n > 1, then rX contains
n + 1 tuples t0 , t1 , . . . , tn such that t0 [X] = < 0, . . . , 0 > and t0 [R−X] = < dne, . . . , dne >,
and for all i ∈ I, ti [Ai ] = < 1 >, ti [X−{Ai }] = < 0, . . . , 0 > and ti [R−X] = < dne, . . . , dne >.
It can now be verified that r =
7
S
X∈P(R) rX
is the desired relation.
2
Foreign key families and generalised referential integrity
In the context of incomplete relations the fundamental constraint of referential integrity
[Cod79, Dat86, HR96] also needs to be re-evaluated. Herein we define the notion of foreign key family being satisfied in a database. We show that the notion of a foreign key family
in a database is faithful to the standard notion of foreign key in a database, and, in the absence of dne in relations, is a precise generalisation of the notion of a foreign key in a complete
database. Finally, we define generalised referential integrity using the concepts of foreign key
family and primary key family.
Definition 7.1 (Foreign key family) Let R and S be relation schemas in a database schema
R. Then a foreign key family F for R referencing a key family K for S (or simply a foreign key family F, if R and K are understood from context) is a family of pairs, F =
11
{(F1 , K1 ), (F2 , K2 ), . . . , (Fn , Kn )}, where for all i ∈ I, Fi ⊆ R, Ki ⊆ S, and |Fi |=|Ki |;
K = {K1 , K2 , . . . , Kn } is called the key family for S referenced by F.
Definition 7.2 (Satisfaction of a foreign key family) A database d over R containing
relations r over R and s over S, where R, S ∈ R, satisfies a foreign key family F (or alternatively, F is a foreign key family in d), written d |≈ F, if the following three conditions are
satisfied:
1) s |≈ K, where K is the key family for S referenced by F,
2) for all t ∈ r, there is some (Fi , Ki ) ∈ F such that for all A ∈ Fi , t[A] =
6 dne, and
3) for all t ∈ r, for all (Fi , Ki ) ∈ F, if for all A ∈ Fi , t[A] 6= dne, then there is some u ∈ s
such that t[Fi ] v u[Ki ].
The first condition in Definition 7.2 guarantees that K is a key family in s. The second
condition guarantees that the foreign key family is well defined for at least one pair (Fi , Ki ) ∈
F. Finally, the third condition guarantees that whenever the projection of a tuple t ∈ r onto
Fi does not have occurrences of dne, then it is less informative than the projection onto Ki of
some tuple u ∈ s. This is the standard requirement of a foreign key constraint, namely that
only entities which exist are referenced.
We note that given a database d and a foreign key family F, it can easily be tested in
polynomial time in the sizes of d and R whether F is a foreign key family in d over R or not.
The proof of the following proposition, which provides justification for foreign key families,
is straightforward. It is assumed that R, S ∈ R.
Proposition 7.1 The following two statements are true:
1) The notion of a foreign key family is faithful to the notion of a foreign key in a database
d containing relations r over R and s over S, in the sense that if F = {(F, K)} is a
foreign key family in d, then F is a foreign key for R in r referencing the key K for S
in s.
2) In the absence of dne in relations, the notion of a foreign key family is a precise generalisation of the notion of a foreign key in a complete database, in the sense that if F is a
foreign key family in d, containing relations r over R and s over S, then for all complete
relations r0 ∈ POSS(r), there exists a complete relation s0 ∈ POSS(s) such that for all
S
S
t0 ∈ r0 , there exists u0 ∈ s0 such that t0 [X] = u0 [Y], where X = i∈I Fi and Y = i∈I Ki .
2
As is the case for Proposition 4.1 we cannot strengthen the statements of Proposition 7.1.
We are now ready to generalise referential integrity.
Definition 7.3 (Generalised referential integrity) Let F be a foreign key family for R
referencing a primary key family K of S. A database d over R containing relations r over R
and s over S, where R, S ∈ R, satisfies generalised referential integrity, if F is a foreign key
family in d.
We observe that the notion of generalised referential integrity reduces to the standard
notion of referential integrity when F is a singleton.
12
8
Concluding remarks
Entity and referential integrity in their classical form were shown to be overly restrictive for
practical applications and as a result we proposed a generalisation of them. Our proposal
caters for the two types of null value, unk and dne, which should be supported by any modern
relational DBMS [Cod90]. We have generalised entity and referential integrity via the notions
of a key family being satisfied in an incomplete relation and a foreign key family being satisfied
in an incomplete database. The intuition behind this generalisation is based on the concepts
of distinguishability and identifiability. In order to uniquely identify a tuple in a relation we
need to be able to distinguish this tuple from all the other tuples in the relation, and, in
addition, its identification must be established in all possible worlds. The classical notions of
entity and referential integrity are special cases of our generalisation, so upgrading a DBMS
to provide this facility would be an upwards compatible feature.
The support of generalised entity and referential integrity will ease the task of the database
practitioner in situations such as those demonstrated via the examples given in the introduction. In the absence of such support database designers will usually apply the following
solution, involving the addition of a surrogate [Dat92] as an additional attribute to the relation schema. A surrogate is an attribute with no intrinsic meaning whose values are nonnull
and unique for each tuple in a relation. The values of a surrogate are generated either by
the database system (if it supports such a mechanism) or by an application program, and
should be concealed from the user, since they are not real-world identifiers and thus have
no meaning to the user. Although such a solution is viable, there are various overheads in
maintaining surrogates (see [Dat92] for a discussion on the adverse effects of surrogates on
database design, and on queries and updates to the database). We stipulate that entity and
referential integrity should be generalised even in the presence of surrogates. Consider, for
example, the relation, say r over R, shown in Table 3. If we add a surrogate attribute to R,
then it would be possible to have in r a tuple of the form <sno, unk, dne, unk>, where sno
is a surrogate value. In such a case when a tuple is null on both SS# and P#, users have
no meaningful way to uniquely identify the tuple. Thus surrogates do not solve the problem
that users may have of being able to identify in incomplete relations one tuple from another
by some set of meaningful values.
References
[AHV95] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley,
Reading, Ma., 1995.
[ANS75] ANSI/X3/SPARC. Study group on database management systems, interim report.
Bulletin of ACM SIGFIDET, 7(2), 1975.
[Cod79] E.F. Codd. Extending the database relational model to capture more meaning.
ACM Transactions on Database Systems, 4:397–434, 1979.
[Cod90] E.F. Codd. The Relational Model for Database Management: Version 2. AddisonWesley, Reading, Ma., 1990.
[Dat86]
C.J. Date. Referential integrity. In Relational Database: Selected Writings, pages
41–63. Addison-Wesley, Reading, Ma., 1986.
13
[Dat92]
C.J. Date. Composite keys. In Relational Database Writings 1989-1991, pages
467–474. Addison-Wesley, Reading, Ma., 1992.
[DT88]
J. Demetrovics and V.D. Thi. Relations and minimal keys. Acta Cybernetica,
8:279–285, 1988.
[GJ79]
M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman, New York, 1979.
[HR96]
T. Härder and J. Reinert. Access path support for referential integrity in SQL2.
The VLDB Journal, 5:196–214, 1996.
[LL86]
N. Lerat and W. Lipski Jr. Nonapplicable nulls. Theoretical Computer Science,
46:67–82, 1986.
[LL97]
M. Levene and G. Loizou. Null inclusion dependencies in relational databases.
Information and Computation, 136:67–108, 1997.
[LL99]
M. Levene and G. Loizou. A Guided Tour of Relational Databases and Beyond.
Springer-Verlag, London, 1999.
[LO78]
C.L. Lucchesi and S.L. Osborn. Candidate keys for relations. Journal of Computer
and System Sciences, 17:270–279, 1978.
[Mai83]
D. Maier. The Theory of Relational Databases. Computer Science Press, Rockville,
Md., 1983.
[Tha89] B. Thalheim. On semantic issues connected with keys in relational databases permitting null values. Journal of Information Processing Cybernetics, 25:11–20, 1989.
14