Entity relationship
for relational database
Entity relationship for relational database
Contents
Foreword to the second edition
Introduction
Page
5
9
Overview and positioning
Requirement collection
Conceptual design
Logical design
Physical design
Do not isolate data design
11
11
12
12
14
15
Basic objects
Entity type, entity
Attribute, attribute value
Relationship type, relationship
ISA
ID
Composite entity type
Summary
An example
Another example (Multivalued attributes)
A more comprehensive example
17
18
18
18
22
23
26
27
27
28
29
Correct ER diagram
32
Canonical mapping of ER diagrams into relation schemes
Key
Primary key
Set of foreign/primary attribute pairs
Conditions for the keys of a relationship type
Conditions for the keys of an ID-dependent enity type
Conditions for the keys of an ISA-dependent entity type
Canonical mapping of a correct ER diagram
into a relation scheme
34
34
35
35
36
37
37
Why entity relationship?
1) Normalization poses no problem
2) Referential integrity guidelines
3) Avoiding NULLs
4) Semantics support by the relational database system
38
38
42
43
44
More examples
A few design guidelines
45
46
3
37
Foreword to the second edition
Swiss Re Zurich has a long tradition of using relational databases. In 1983
we had the opportunity of taking part in a worldwide IBM early support
programme to test DB2, IBM’s mainframe relational database system, two
years before its general availability. DB2 is short for “data base two”. (The
corresponding “data base one” from IBM was IMS, developed in the late
1960s as an inventory tracking system for the U.S. moon landing effort and
still in widespread use.) The primary focus of “data base two” and relational
systems in general was and still is to provide the programmer and the user
with the simplest possible data structuring paradigm: the table, with rows
and columns. This paradigm of the relational database was invented by
E. F. Codd in the early 1970s, then at IBM, and is the only paradigm of
information technology for which a sound mathematical theory existed
before the running systems. This paradigm has generated a considerable
amount of interest, both in the theoretical and the practical worlds of information technology.
Swiss Re Zurich is one of the contributors to the further development of
this theory. (Swiss Re Zurich is the only company outside the information
technology sector that is mentioned in E. F. Codd’s 1990 book on the relational model). Also, we have a huge amount of practical experience with
relational databases. In that 1983 early support programme, a lot of users
inside and outside the IT department took the time and trouble to try out
the new system. A big database was created for the LIFE department and
filled with data from the operational systems, and the users could run their
queries against it. (This was the first instance of what we call a data warehouse today.) Other users could create their own databases and datasets and
“play” with them.
Thus more and more users became enthusiastic about the new database
paradigm, and two years later, we initiated our first operational application
systems that relied on the new database.
One of the first systems that took full advantage of the new table paradigm
was our business information system (BIS), which is probably the first big
application in the commercial world. It is still one of only a few constructed
on a pattern of relational database design that we call the “accordion principle”.
What does this mean?
One big disadvantage of most application systems in use is the fact that they
cannot cope with the need for new attributes or fields arising from changing
business needs.
Thus, the user “misuses” existing attributes to capture new information,
usually by creating a new coding system for the old attribute. Though this
makes it much more difficult to query for reports, users do it anyway
because it would be too time-consuming to have the IT department add a
new attribute and change all the affected programs.
5
Now, with the accordion principle, the attributes of certain important entities
are themselves data entries in a table, and can therefore be created dynamically,
without changing any single program. (A big application system has hundreds
of programs.)
In that way, a system can cope much better with changing business needs.
It must be stressed that at Swiss Re Zurich, the primary impulse for such
abstract data design came from the user side, whereas the IT department
provided only the mathematical structuring. The accordion principle is
described in the collection of examples at the end of this brochure.
The accordion principle is not the only very abstract data design principle
that has found its way into the Swiss Re Zurich application systems. Even
today, in a time of object-oriented programming and client/server systems,
it makes sense to give careful attention to data design, because the data structures of operational application systems usually survive for a long time – at
least in terms of data processing. By the way, we have been very successful
both in doing object-oriented programming and in producing client/server
systems. We have been able to define an architectural framework capable of
combining the benefits of true client/server programming with the need to
process large amounts of data – data, moreover, that is handled by a multiuser community that surely does not want inconsistent or lost data.
As of September 1997, our DB2 systems contain more than 2000 tables in
operational applications, and about 25,000 tables for individual data processing (IDP). The number of tables for IDP might be too high, but very often
the user wants his own versioning and history-keeping concept. But the focus
of systematic data design lies primarily on the operational applications. This
huge amount of individual data processing spread over a big user community,
accompanied by a long tradition in the area, might be the reason that the
recent appearance of the buzzword “data warehouse” did not revolutionize
our reporting structures as it seemed to in certain companies.
As system design and implementation is also a “socio-cultural” process that
sometimes can be observed to go through chaotic phases, it is important to
use well-defined formal languages to describe the results of the design wherever possible. Experience with formal design languages have shown that it
is not appropriate to try to formalise everything, but data structures will
remain as an ideal focal point of strong formalisation.
The introduction to the first edition of this brochure (1991) claimed that the
entity relationship language presented here is capable of being formalized
mathematically. This hypothesis was proven by Wolfgang Jastrowski in his
thesis at the Swiss Federal Institute of Technology (ETH Zurich) in 1994
(“Eine formale Beschreibung des Entity Relationship Modells”). Generally,
the entity relationship languages presented in textbooks do not possess this
characteristic; they usually present a few examples but leave you to your own
devices when your application is more complex. (The BIS system mentioned
above, for example, has about 250 tables).
6
Of course, the benefits of using a good design language are not restricted to
DB2. In fact, we support four different relational database systems.
Let me make a few comments on the challenge posed by problems related to
managing data systems:
The most important principle is to trust your own people more than any
vendor representative who is promising you a data-handling rose garden. The
second important thing is to give your programmers and users a chance really
to understand the way the technical systems work, instead of burying them
under checklists of technical procedures. No other principles are nearly as
important.
Examples of vendor promises can be found in the data management buzzwords that have come and gone in the last few years (although sometimes
university institutes also promote ideas of purely theoretical interest). All
companies have survived the “complete data repository”, for instance. (In the
sense of offering a complete description of all data, it does not work.) “Full
CASE data programming” does not work, “reverse data engineering” does not
work, nor does the “companywide data model”, and so on. At the moment,
the term “data warehouse” (derived from the IBM’s 1991 term “information
warehouse”) is still active. However, it will become successful only if people
are aware that a few new tools will not dispense them from dealing with the
old challenge of handling data and data structures properly. A special example
in the warehouse discussion is the OLAP pattern, invented by E. F. Codd in
1993. You will find a description of the typical OLAP data design pattern
in the examples section of this brochure.
In the introduction to the first edition, it is said that entity relationship is a
language and not a method, and that design has aspects of an art. While it
is still true that the main ingredients for data design are pad and pencil, a
blackboard and the heads of the discussion partners, experience nevertheless
shows that there are a few design guidelines worth following. These are
sketched out in a new section. The rest of the brochure has been left
unchanged, because any change in the definition of syntax or semantics of
the entity relationship language presented here would lower its quality.
Last but not least it should be mentioned that there is also an educational
reason for using an inherently mathematically structured data design language
for this brochure. The benefit that IT people can add to a company’s information management – besides helping the user to master the technology,
hopefully – lies in helping him to structure his information. And structuring
skills can be developed.
Zurich, November 1997
7
Introduction
Entity Relationship is a language for conceptual database design, originally
proposed by P. P. Chen in 1976. It consists of graphical symbols (the syntax),
which are supposed to have some meaning (the semantics).
Since 1976 a tremendous amount of literature about Entity Relationship (ER)
has appeared, describing many ER dialects, model applications and the like.
Most of these ER dialects lack an underlying mathematical foundation. This
fact is not very disturbing when only small textbook examples are involved,
but can have great consequences when one has to design large systems of, say,
a hundred or so entity types.
Another reason why there should be a mathematical definition of the design
language syntax and semantics is the fact that one has to map the design into
the structure of a database management system for implementation.
There seems to be a widespread belief that designing data structures independent of the structure of the database management system makes sense.
Taken as a general guideline, the usefulness of this principle can hardly be
denied. But it is a fact that every design language favors a certain structure
of data representation. In the case of the original ER language and its most
popular dialects this favored structure is the network database management
system.
In the case of relational database management systems, this bias towards the
network model leads to unwanted consequences – namely, that the designer
has to worry about the intricacies of a rather complicated theory of normalization in relational systems; sometimes he feels that the information content
of the original design is lost after mapping and normalization. The situation
becomes even worse if he wants to use a very advantageous feature of modern
relational systems, referential integrity.
This brochure describes a version of ER that was developed by and is used at
Swiss Re. (Since 1985 all new applications have been designed with relational
data structures only.)
Among the benefits of our favored version of ER are the following:
– mathematical definition of syntax and semantics (the designer does not
have to occupy himself – or herself – with mathematics);
– no problems with relational normalization;
– no problems with referential integrity;
– helps avoid the classical problems arising with missing information;
– all semantics expressible in the language can be enforced by the relational
system;
9
– the end user who has discussed the information content of the data
structures with the EDP department in terms of the conceptual design language will recognize these structures when he encounters them again in the
relational data, a fact which is of growing importance as individual data
processing becomes more and more widespread;
– notions of correct diagram and canonical mapping into relations are
defined, and these can be used as design guidelines.
The language described in this text is very simple, stripped of unnecessary
extras, and should be understood by anyone who has a little experience with
relational databases. There are “mathematical notes” added so that a reader
with mathematical interests can write down the underlying formalisms if he
wishes, but no knowledge of mathematics is necessary to understand the language. Special emphasis is laid on examples.
It must be stressed that ER is only a language, not a method. It is desirable
and possible to fully understand such a language, but it will never become a
method that can tell us how to map users’ perceptions of the reality of data
into conceptual data structures. Data design will remain an art.
10
Overview and positioning
To understand the wider context of ER in systems design, consider the following matrix:
Requirement design
Conceptual design
Logical design
Physical design
Data
dictionary
Operations Constraint
dictionary dictionary
•
•
•
•
•
•
•
•
•
•
•
•
One can imagine the whole design process as consisting of phases, from
requirement design down to physical design, although it cannot proceed in
such a simple linear fashion. In our context it is not important whether we
understand requirement design as the process of collecting requirements, or
as a possible result of such a collection; this also holds for the other design
phases.
Therefore it makes sense to talk of the operations dictionary for conceptual
design, the data dictionary for physical design, and so on.
ER is a language for the data dictionary for conceptual design, which means
that it covers a very small part of the whole picture.
Let us now trace the data dictionary column of the above matrix with the
help of a simple example. This “walk” down the data dictionary column is
somewhat artificial, because in reality physical data design is not only dependent on logical data design, but also on logical operations and constraints
design. Similar remarks would be appropriate for physical operations design,
logical data design and so on.
Requirement collection
Information must be kept about persons, companies and the employment of
persons by companies. Any person can be employed by at most one company. For persons, we want to keep family and given names and home addresses, and for companies, names and business addresses. In the world we want
to model, a person is uniquely identified by the combination of family and
given names, and a company by its name. In the following we shall refer to
family name simply as “name” and to given name as “forename”.
11
Conceptual design
A possible mapping of the requirements into an ER diagram is:
NAME
NAME
PERSON
FORENAME
m
COMPANY
EMPLOYMENT
1
ADDRESS
ADDRESS
PNAME
CNAME
PFORENAME
fig. 1
PERSON and COMPANY are entity types, EMPLOYMENT a relationship
type. NAME, FORENAME and ADDRESS are attributes of PERSON.
The arrows mean existence constraints: Every employment, eg every relationship of type EMPLOYMENT, is existentially dependent on one person
and on one company. In other words, before the employment <p,c> can be
added, the person p and the company c must already be there.
The label 1 at the arrow from EMPLOYMENT to COMPANY must be read
in the context of all entity types on which EMPLOYMENT is existentially
dependent. In our case, the label 1 means that for every entity p of type
PERSON there is at most one entity c of type COMPANY such that the pair
<p,c> is a relationship of type EMPLOYMENT. “At most one” means, of
course, zero or one.
The label m at the arrow from EMPLOYMENT to PERSON means “no
condition”, eg no cardinal constraint: a company can have zero, one or many
persons employed.
Logical design
The logical level in our case is the relational model, which is more or less
standardized in the literature. Normalization theory concerns this level (but
will do no harm to us in our consideration).
A possible mapping of the above ER diagram into a relational model might
produce the following:
relation schemes
PERSON {NAME,FORENAME,ADDRESS}
PrimaryKey {NAME,FORENAME}
12
COMPANY {NAME,ADDRESS}
PrimaryKey {NAME}
EMPLOYMENT {PNAME,PFORENAME,CNAME}
Key {PNAME,PFORENAME}
together with the inclusion dependencies
EMPLOYMENT {PNAME,PFORENAME} =<
PERSON {NAME,FORENAME}
EMPLOYMENT {CNAME} =< COMPANY {NAME}
The primary keys in PERSON and COMPANY are defined because these
relations are the targets of inclusion dependencies, and the key in EMPLOYMENT is a consequence of the label 1 at the arrow from EMPLOYMENT
to COMPANY.
Note that a relation can have any number of keys (exact definitions will be
given in the main expository section, not in this rather informal overview),
and that exactly one of the keys of a relation must be declared as primary if
this relation is the target of an inclusion dependency.
Inclusion dependencies, together with some operations constraints like “delete
propagation”, for example, have been designated as referential integrity constraints by E. F. Codd, the “father” of the relational model.
Possible tables for these schemes and dependencies could be the following
(Note the difference between relation and table: in a relation we have no
ordering of either columns or lines.):
PERSON
NAME
Meier
Schmidt
Lardi
Benz
FORENAME
Hans
Bruno
Ursula
Martha
COMPANY
NAME
Hasler AG
SR AG
FIDEX
ADDRESS
St.Gallen
Zurich
Oerlikon
EMPLOYMENT
PNAME
Meier
Schmidt
Lardi
PFORENAME
Hans
Bruno
Ursula
13
ADDRESS
Zurich
Geneva
St.Gallen
Lucerne
CNAME
Hasler AG
SR AG
SR AG
An entry <Meier,Ursula,FIDEX> in the table EMPLOYMENT would violate
the inclusion dependency
EMPLOYMENT {PNAME,PFORENAME} =<
PERSON {NAME,FORENAME}.
An additional entry <Meier,Hans,SR AG> would violate
Key {PNAME,PFORENAME}.
Physical design
One should not confuse a possible data dictionary physical design entry for
tables and the like as it exists, for instance, in the systems catalog of modern
relational database management systems with the act of creating such an
entry. Nevertheless we present here the physical design in terms of such
creation.
create table PERSON
(NAME
character(20)
FORENAME
character(20)
ADDRESS
character(30),
primary key (NAME,FORENAME) )
not null,
not null,
create unique index PERSON on PERSON
(NAME, FORENAME)
create table COMPANY
(NAME
ADDRESS
primary key (NAME) )
character(20)
character(30),
not null,
create unique index COMPANY on COMPANY
(NAME)
create table EMPLOYMENT
(PNAME
character(20)
not null,
PFORENAME
character(20)
not null,
CNAME
character(20)
not null,
foreign key (PNAME,PFORENAME) references PERSON,
foreign key (CNAME) references COMPANY )
create unique index EMPLOYMENT1 on
EMPLOYMENT
(PNAME,PFORENAME)
create index EMPLOYMENT2 on EMPLOYMENT
(CNAME)
14
The index EMPLOYMENT2 is not a consequence of the process of mapping
from logical to physical design. It is there to stress the fact that the physical
design is not uniquely determined by the logical data design. In reality it
could stem from the fact that the operation
given a company name
search for all employees of that company
is a heavily used operation which must have very good performance.
There are other aspects of the physical data design not further mentioned
here that cannot be derived from the logical design of the data structures only:
file sizes, distribution over disk space, clustering of data, and so on. Even
concurrency or data reorganization aspects can have an influence on the physical design. (Remember that giving more free space will lessen the need for
reorganization in the case of tables with several million rows.)
Observe that it is only on this physical level that the ordering of table columns
has to be specified. This ordering will be determined by data distribution and
access path considerations; but this subject will not be pursued here.
Do not isolate data design
Remember: this is an introductory overview. Before we go on to more exact
definitions, let us consider another simple example, which should remind us
that it is not possible to design data without taking the operations and constraints into account.
We want to collect countries into groups and keep information about these
groups together with periods of validity. We consider two different variants:
Variant 1, DATA:
BELONGING {COUNTRY,GROUP,FROMTIME,TOTIME}
Variant 2, DATA:
BELONGING {COUNTRY,GROUP, FROMTIME}
For all times from a fixed start time t0, a unique relation of belonging should
be defined. This leads to the following constraints:
Variant 1, CONSTRAINTS:
1) for all <country,group,fromtime,totime> in BELONGING, fromtime
must be smaller than totime;
2) if <country,group1,fromtime1,totime1> is in BELONGING, and <country,group2,fromtime2, totime2> is in BELONGING, then totime1
<=fromtime2 or totime2<=fromtime1 (non-overlapping);
3) for every time>=t0 and every country there exists a group, fromtime and
totime such that fromtime<=time<totime and
<country,group,fromtime,totime> is in BELONGING (no holes).
15
Variant 2, CONSTRAINTS:
1) for every country there is a group and a fromtime such that fromtime<=t0
and <country,group,fromtime> is in BELONGING.
Variant 1 has more complicated constraints than variant 2. But what about
operations? Consider a typical operation:
Given a country c and a time t, we want to know the group to which
the country belongs at that time.
Variant 1, OPERATION:
select GROUP from BELONGING
where COUNTRY=c and FROMTIME<=t
and t<TOTIME
Variant 2, OPERATION:
select GROUP from BELONGING B
where COUNTRY=c and FROMTIME<=t
and not exists
(select * from BELONGING
where COUNTRY=c
and B.FROMTIME<FROMTIME
and FROMTIME<=t)
We see that with operations, the situation is contrary to that of the constraints.
Should we take variant 1 or variant 2?
More complicated constraints mean more programming or more administrative
effort. One should take into consideration how often updates have to be made
and by how many people. (If the groups are marketing regions, then BELONGING probably must be updated once a year, but if the groups are classes of
weather conditions, then more frequent updates are to be expected.)
More complicated operations mean less efficient performance (which can be
significant if the table is queried by operational programs a few hundred
times a day). In the case of individual data processing, complicated operations demand that the user know his query language very well.
Besides comparing the pros and cons of the two variants as they currently
appear, one must usually also make guesses about the situation in the future.
This booklet gives no answers to questions like this. It only describes a language. As mentioned above, data design will remain an art.
16
Basic objects
This chapter defines the syntax and semantics of the basic objects and of
some connection types.
The syntax will consist of the graphical symbols and allowed compositions of
these symbols.
The semantics of an ER diagram is defined by the mapping of the diagram
into relations. A full definition of the semantics can only be given after the
chapter on correct ER diagrams; thus the semantic definitions in the present
chapter have a somewhat introductory and provisional character.
The undefined notions of the language are
entity type, entity
relationship type, relationship
attribute, attribute value.
Note that we say “attribute, attribute value” instead of “attribute type,
attribute”, which would be more systematic (historical reasons).
Why are these notions undefined? Because it is impossible to give definitions
of the type “an entity type is when...”.
One should define concrete entity types as clearly as possible, for instance
“the entity type CONTRACT covers all kinds of contracts the company has
made with third parties”, or “the entity type PERSON covers all (natural)
persons with whom the company has a contract of employment”.
Thus, one can define
“the entity type E1 is...”
“the entity type E2 is...”,
and so on.
but one cannot define
“an entity type is...”.
We can discover an analogy with geometry. The student is never told what a
point or a line is; he just has to learn to work with them. Point and line are
undefined notions of elementary geometry.
17
Entity type, entity
Back to the entity types PERSON and COMPANY.
The entity “John Smith” is of entity type PERSON.
The entity “Swiss Re” is of entity type COMPANY.
PERSON
COMPANY
fig. 2
The graphic notion for an entity type is a rectangle.
Attribute, attribute value
Entity types have attributes, for instance
NAME
FORENAME
BIRTHDATE
is an attribute of the entity type PERSON.
is an attribute of the entity type PERSON.
is an attribute of the entity type PERSON.
“Smith”
“John”
“79/03/29”
is an attribute value of the attribute NAME.
is an attribute value of the attribute FORENAME.
is an attribute value of the attribute BIRTHDATE.
PERSON
NAME
BIRTHDATE
FORENAME
fig. 3
The graphic notion for an attribute is an oval (circle, ellipse) connected with
the corresponding entity type by a line.
Attribute names must be unique within one and the same entity type.
Relationship type, relationship
Relationship types can be defined among entity types (a relationship type
“connects” at least two entity types).
As an example, let the entity types STAFF and PROJECT be given. Then a
relationship type COOPERATION can be defined which is existentially
dependent on STAFF and on PROJECT; this means that a cooperation <s,p>
18
can only exist if there exist entities s and p of types STAFF and PROJECT
respectively. To exist means, of course, that there is a corresponding entry in
the database.
STAFF
PROJECT
m
COOPERATION
1
fig. 4
The graphic notion for a relationship type is a diamond with arrows connecting it to the entity types on which it depends.
Each arrow has a label, “1” or “m”, with the following meaning:
The label “1” at the arrow from COOPERATION to PROJECT means
that for each entity s of type STAFF there is at most one entity p of type
PROJECT such that the pair <s,p> builds a relation of type COOPERATION.
The label “m” at the arrow from COOPERATION to STAFF means no condition.
Here is another example, where the relationship type is dependent on three
entity types:
INTRADESINCE
COMPANY
1
TRADERELATION
m
1
COUNTRY
COUNTRYNAME
COMPANYNAME
PRODUCT
PRODUCTNAME
fig. 5
Fig. 5 could be a database of a Department of Commerce, where information
is stored about which company exports which product to which country.
If f is an entity of type COMPANY, c an entity of type COUNTRY and p
an entity of type PRODUCT, then the combination <f,c,p> might or might
not be a relation of type TRADERELATION. If the relation <f,c,p> exists,
then it is existentially dependent on the entities f, c and p, which is indicated
by the arrows.
19
The label “1” at the arrow from TRADERELATION to COMPANY means
that for each pair c, p of entities of types COUNTRY and PRODUCT,
respectively, there exists at most one entity f of type COMPANY such that
<f,c,p> is a relation of type TRADERELATION.
The label “1” at the arrow from TRADERELATION to COUNTRY means
that for each pair f, p of entities of types COMPANY and PRODUCT,
respectively, there exists at most one entity c of type COUNTRY such that
<f,c,p> is a relation of type TRADERELATION.
The label “m” at the arrow from TRADERELATION to PRODUCT means
no condition. (To say “at most many” means the same as to say nothing.)
More generally, if the relationship type R is existentially dependent on the
entity types E1, E2, ..., En and only on these, then a label “1” at the arrow
from R to E1 means that for every combination of entities e2, ..., en of types
E2, ..., En, respectively, there is at most one entity e1 of type E1 such that
<e1, e2, ..., en> forms a relation of type R. (This applies analogously to a
label “1” at other arrows.)
A more exact definition of the mapping of ER diagrams into relations will
be given in a later chapter. Bypassing the logical stage of relations for the
moment, (part of ) the physical design for the diagram of fig. 5 might appear
as follows:
create table COMPANY
(COMPANYNAME
character(20)
... (additional attributes)
primary key (COMPANYNAME) );
not null,
create unique index COMPANY on COMPANY
(COMPANYNAME);
create table COUNTRY
(COUNTRYNAME
character(20)
... (additional attributes)
primary key (COUNTRYNAME) );
not null,
create unique index COUNTRY on COUNTRY
(COUNTRYNAME);
create table PRODUCT
(PRODUCTNAME
character(20)
... (additional attributes)
primary key (PRODUCTNAME) );
create unique index PRODUCT on PRODUCT
(PRODUCTNAME);
20
not null,
create table TRADERELATION
(COUNTRYNAME
character(20)
not null,
COMPANYNAME
character(20)
not null,
PRODUCTNAME
character(20)
not null,
INTRADESINCE date,
...(additional attributes)
foreign key (COUNTRYNAME) references COUNTRY,
foreign key (COMPANYNAME) references COMPANY,
foreign key (PRODUCTNAME) references PRODUCT );
create unique index TRADERELATION1 on TRADERELATION
(COMPANYNAME,PRODUCTNAME);
create unique index TRADERELATION2 on TRADERELATION
(COUNTRYNAME,PRODUCTNAME);
The index TRADERELATION1 will guarantee the semantics of the label
“1” at the arrow from TRADERELATION to COUNTRY.
The index TRADERELATION2 will guarantee the semantics of the label
“1” at the arrow from TRADERELATION to COMPANY.
Note that besides the foreign key attributes, a relationship type can have
additional attributes.
As another example consider fig. 6.
BOOK
FIELD
m
CATEGORIZATION
1
fig. 6
Every book can be assigned to at most one field.
And another example (fig. 7).
m
BOOK
fig. 7
21
RECOMMENDATION
1
TEACHER
m
CLASS
1
SUBJECT
Any book in any subject is recommended to any class by at most one teacher.
ISA
Generalization/specialization is captured by the notion of ISA, which connects two entity types.
B#
ADDRESS
BUSINESSPARTNER
ISA
ISA
CUSTOMER
C#
SUPPLIER
S#
TOTALAMOUNT
TOTALAMOUNT
fig. 8
The arrow, which means, as usual, existential dependency has the label “ISA”
this time (and will be mapped, as all arrows are, to a referential integrity
constraint).
A customer can only be added if he already exists as a BUSINESSPARTNER
(of course both can be added in the same transaction, but businesspartner
must be added first).
Again bypassing the logical design stage, the above diagram could be mapped
to the following (physical) tables:
create table BUSINESSPARTNER
(B#
integer
ADDRESS
character(20),
... (additional attributes)
primary key (B#) );
not null,
create unique index BUSINESSPARTNER on BUSINESSPARTNER
(B#);
create table CUSTOMER
(C#
integer
not null,
TOTALAMOUNT
real,
... (additional attributes)
foreign key (C#) references BUSINESSPARTNER );
22
create unique index CUSTOMER on CUSTOMER
(C#);
create table SUPPLIER
(S#
integer
not null,
TOTALAMOUNT
real,
... (additional attributes)
foreign key (S#) references BUSINESSPARTNER );
create unique index SUPPLIER on SUPPLIER
(S#).
Note that as usual the arrows (without labels) are mapped into the referential
integrity constraints:
foreign key (C#) references BUSINESSPARTNER
and
foreign key (S#) references BUSINESSPARTNER.
The labels “ISA” are mapped into key conditions: C# must be a key of
CUSTOMER and S# must be a key of SUPPLIER.
This might be a little confusing: because of the referential integrity constraints, C# and S# are foreign keys, and in general a foreign key is not a key.
(Rather, a foreign key corresponds to “a key in a foreign region”.)
In a situation like the one in fig. 8, one usually designs the common attributes
at the generalization entity type, although this is not a must. One could for
instance have the ADDRESS attribute repeated at CUSTOMER (controlled
redundancy for performance reasons if one intends, for example, to design a
very often-used query on CUSTOMER which orders by ADDRESS). One
could also have ADDRESS only at CUSTOMER and at SUPPLIER.
ID
A hierarchical connection between entity types is represented by an ID connection:
CONTRACT
C#
ID
PARTOFCONTRACT
fig. 9
23
C#
P#
“ID” derives from IDentification:
It is typical for a hierarchical connection that an entity of the dependent level
is uniquely identified “in a natural way” only within its parent entity.
As usual the arrow will be mapped into a referential integrity constraint (the
label “ID” is mapped into a key condition for the dependent entity type).
The diagram could be mapped into the following tables:
create table CONTRACT
(C#
... (additional attributes)
primary key (C#) );
integer
not null,
create unique index CONTRACT on CONTRACT
(C#);
create table PARTOFCONTRACT
(C#
integer
P#
integer
... (additional attributes)
foreign key (C#) references CONTRACT );
not null,
not null,
create unique index PARTOFCONTRACT on PARTOFCONTRACT
(C#,P#).
The attribute C# of PARTOFCONTRACT is a foreign key which references
(the attribute C# of ) CONTRACT. This amounts to the arrow without the
label.
The label “ID” is mapped into a key condition: besides the foreign key
attribute C#, there must be a further attribute of PARTOFCONTRACT,
here P#, such that the set {C#,P#} is a key of PARTOFCONTRACT. (This
is the logical level formulation; on the physical level, the key condition is,
of course, mapped into a unique index.)
So, in loose terms, “ID” means that there must be a key of the dependent
entity type which properly contains the primary key of the parent entity
type.
Note that a primary key of an entity type is nothing but a key which is designated as primary. In general, an entity type can of course have several keys,
but at most one of them can be designated as primary. (The referential
integrity construct “foreign key” in SQL references the parent table, not
attributes thereof.)
24
It should be clear by now that a table (which corresponds to an entity type)
must have a primary key defined as soon as the entity type has an incoming
arrow.
As a further example consider fig. 10.
C
R
P
S
C
STREET
ID
PLACE
R
P
ID
COUNTRY
C
ID
REGION
C
R
fig. 10
Since COUNTRY has an incoming arrow, it must have a primary key. If we
suppose that the attribute C characterizes a unique number or name, then
C is a key which we designate as primary. (It would be more exact to say that
the set {C} is a key.)
The attribute C of REGION is the foreign key referencing COUNTRY, and
must be there because of the arrow.
Now, the label “ID” at the arrow forces us to give an additional attribute,
R, of REGION, such that {C,R} is a key of REGION.
Because REGION has an incoming arrow it must have a primary key. We
designate the key {C,R} of REGION as primary.
The attribute set {C,R} of PLACE is the foreign key referencing REGION
and must be there because of the arrow.
Now, the label “ID” at the arrow forces us to give an additional attribute,
P, of PLACE, such that {C,R,P} is a key of PLACE.
Because PLACE has an incoming arrow it must have a primary key. We designate the key {C,R,P} of PLACE as primary.
The attribute set {C,R,P} of STREET is the foreign key referencing PLACE,
and must be there because of the arrow.
25
Now the label “ID” at the arrow forces us to give an additional attribute,
S, of STREET, such that {C,R,P,S} is a key of STREET.
Note that since STREET has no incoming arrow we do not have to designate
a key of STREET as primary.
Composite entity type
A composite entity type originates from a relationship type which should get
an incoming arrow (if a primary key will have to be defined).
An example:
COURSETYPE
LOCATION
m
COURSE
m
fig. 11
becomes
COURSETYPE
LOCATION
m
COURSE
m
ISA
fig. 12
BASICCOURSE
In the graphic notation, the diamond signifying a relationship type is enclosed
by a rectangle as soon as the relationship type is transformed into a composite entity type.
The meaning of the arrows with their labels does not change.
The motivation behind this construct is of an aesthetic nature and can be
demonstrated with the above example.
The ISA connection says that a basic course is a course. Now if BASICCOURSE were an entity type and COURSE a relationship type, this would
mean that an entity (the basic course) would be a relationship (the course).
But if an entity can be a relationship, then the rules of handling diagrams
become more complicated (the “metamodel” becomes more complicated).
26
Summary
The rules sketched above for mapping diagrams into tables (which will be
defined more exactly in a later chapter) are very simple and, in loose terms,
as follows:
rectangles, diamonds
arrows
labels of arrows
An example
tables (with corresponding attributes)
referential integrity constraints
keys (unique indexes)
Compare the two possibilities of data structuring for real estate shown in the
following figures.
HOUSINGESTATE
ID
PROPERTY
OWNER
ID
DWELLINGUNIT
m
BELONGS
1
fig. 13
OWNER
m
1
BELONGS
OBJECT
ISA
HOUSINGESTATE
1
CONTAINS1
ISA
ISA
PROPERTY
m
DWELLINGUNIT
1
CONTAINS2
m
fig. 14
Which structure is better?
This cannot be said definitely. The answer depends on questions like:
– What kinds of operations and constraints are expected?
– Are operations/constraints to be programmed or maintained by
administrative procedures (generic query tool and operational rules)?
– What is the probability that the data structure will have to be extended
(for instance, insertion of an entity type FLOOR between
DWELLINGUNIT and PROPERTY)?
– Where do the data come from (inserted piece by piece or loaded as a
whole from some other system)?
– How is the diagram embedded in its neighborhood? (Other connections
than that to the entity type OWNER are possible.)
27
Another example
(Multivalued attributes)
Consider fig. 15.
BOOK
ISBNNUMBER
SUBJECT
TITLE
AUTHOR
fig. 15
We have an entity type BOOK with attributes ISBNNUMBER, AUTHOR,
TITLE, and SUBJECT.
Now we realize that it should be possible to assign more than one subject to
a book, so that SUBJECT takes on the character of a multivalued attribute.
A possible solution to this problem is shown in fig. 16, where SUBJECT has
become an entity type. Are other solutions possible?
m
CATEGORIZATION
BOOK
fig. 16
28
SUBJECT
TITLE
ISBNNUMBER
AUTHOR
m
A more comprehensive example
Consider fig. 17.
ADDRESS
M#
STATUS
MEMBER
RESERVATIONDATE
1
M#
RESERVATION
LOAN
ISBN
m
ISBN
RETURNDUEDATE
COPYNUMBER
PHYSBOOK
m
M#
DATEOFLOAN
ISBN
ID
m
LOGICALBOOK
TITLE
ISBN
SUBJECT
AUTHOR
COPYNUMBER
PUBLICATIONYEAR
m
m
CATEGORIZATION
NAME
ISA
ISBN
SUBJECTNAME
COAUTHORBOOK
COAUTHOR
ISBN
m
COAUTHORSHIP
ISBN
m
NAME
COAUTHORNAME
fig. 17
Here we have a library with members, books, reservations and loans. Notice
how we differentiate between a “logical book” and its “physical copies” – perhaps several – that can be present in the library.
The ISA-dependent entity type COAUTHORBOOK looks as if it had been
designed much later (as an extension). Alternatives?
29
Keys:
(ISBN is probably not really a key..)
MEMBER:
LOGICALBOOK:
PHYSBOOK:
RESERVATION:
LOAN:
COAUTHORBOOK:
COAUTHOR:
COAUTHORSHIP:
SUBJECT:
CATEGORIZATION:
{M#}
{ISBN}
{ISBN,COPYNUMBER}
{M#,ISBN}
{ISBN,COPYNUMBER}
{ISBN}
{NAME}
{ISBN,COAUTHORNAME}
{NAME}
{ISBN,SUBJECTNAME}
Notice that by naming the attributes accordingly it is possible to indicate
which attribute combinations will form foreign/primary key pairs (for example {<LOAN.ISBN,PHYSBOOK.ISBN>, <LOAN.COPYNUMBER,PHYSBOOK.COPYNUMBER>} or {<CATEGORIZATION.SUBJECTNAME,
SUBJECT.NAME>}, and so on).
Note that it is not necessary to draw the attributes into the diagram in a first
stage. Compare fig. 17 with fig. 18, which shows only the “top object types”,
eg entity and relationship types.
MEMBER
1
m
RESERVATION
LOAN
m
m
PHYSBOOK
ID
LOGICALBOOK
SUBJECT
m
CATEGORIZATION
ISA
COAUTHORBOOK
COAUTHOR
m
fig. 18
30
COAUTHORSHIP
m
m
As an additional exercise, change the diagram in such a way that information
can be kept about books which have an editor and several authors (of several
articles).
As already stated, a data design does not make much sense without formulation of operations and constraints. An example of a constraint could be “a
loan can only be added if there is no reservation on the corresponding logical
book”. Please take a few minutes to think about this and further constraints
and operations.
31
Correct ER diagram
Not every collection of rectangles, diamonds, etc. can be considered as a
“correct” ER diagram. Consider fig. 19.
B
C
A
fig. 19
A is the set of all possible ER diagrams (every point is a diagram). In A there
are diagrams which either cannot be mapped into relations in a meaningful
way or where possible results of such a mapping have unwanted qualities.
There are very sophisticated theories about the separation of a subset B of A
such that the effects mentioned above can be controlled. One possible definition of B is contained in a dissertation by V. M. Markowitz (Haifa 1987) but
this definition consists of two pages of mathematical formulas.
For practical use we need a definition which is very easily checked in actual
diagrams. This means that the subset B has to be further restricted to a subset
C of B. This seems to be a disadvantage, but besides giving an easy definition
of “correct”, it also leads to an easy definition of the (recommended) mapping
into relations, and, on another level, it forces the designer to give preference
to simple structures.
Now for the definition of correctness for ER diagrams.
Remember what we already said about the mapping of diagrams into relations. As an example take the case of two entity types E1 and E2, connected
by an ISA arrow pointing from E2 to E1. Taking for granted that we already
know how to map E1, we could easily define how to map E2 (and the arrow
with its label). This is a recursive definition which can be applied in the case
of the other object types as well.
Thus it seems very natural to define the correctness of diagrams recursively as
well. We define the empty diagram as being correct and give six (meta-)operations which, applied to a correct diagram, again produce a correct diagram.
32
The six (meta-)operations are the following:
1) Define independent entity type:
assumption: none (of course the names must be unique)
result:
new (named) rectangle
2) Define relationship type:
assumption: E1,E2,..,En given rectangles or diamond rectangles
(n at least two)
result:
new diamond R with arrows pointing from R to all the
Ej (arrows labeled with “1” or “m”)
3) Define attribute:
assumption: F a rectangle or diamond or diamond rectangle
result:
new oval connected with F
4) Transform relationship type to composite entity type:
assumption: D a diamond
result:
D becomes a diamond rectangle
5) Define ID-dependent entity type:
assumption: F a rectangle or diamond rectangle
result:
new rectangle with ID arrow pointing to F
6) Define ISA-dependent entity type:
assumption: F a rectangle or diamond rectangle
result:
new rectangle with ISA arrow pointing to F
In a more formal exposition of the theory, one would also formalize the
reverse (meta-)operation to each of the above meta-operations (“for each
ADD the corresponding REMOVE”).
A correct ER diagram is an ER diagram which can be drawn from the empty
diagram by application of only the (meta-)operations sketched above; and by
the same token, a given ER diagram is correct if it can be transformed into
the empty diagram by application of the corresponding reverse (meta-)operations only.
A correct ER diagram has diverse properties; for example, it contains no
closed cycles of arrows. (Why?) Another property is that if an object type has
an outgoing ISA or ID arrow, then this is the only outgoing arrow of the
type considered.
Note that this definition of correctness for ER diagrams is purely syntactical
and has nothing to do with the “mapping of reality into diagrams”. Note also
that the definition “correct diagram” is not a prohibition against drawing
incorrect ones. It is just a guideline, which, properly followed, can guarantee
a lot of benefits, and if not followed, will just burden the designer with more
responsibilities. We will come back to this point.
33
Canonical mapping of ER diagrams into relation schemes
As sketched earlier, we differentiate between conceptual, logical and physical
design. The practitioner usually maps the conceptual design directly into
physical design, as we have done with a few examples.
For a reader who has followed the discussion up to now it should be clear
how conceptual design can be mapped into physical design (“create table”,
etc). The purpose of this chapter is to sketch a more formal definition of the
“natural” (we call it CANONICAL) mapping of ER diagrams into relation
schemes. This is of course a mapping of conceptual into logical design.
The reason why we chose the logical design as the target of our canonical
mapping (and not the physical design, as is usually done in practice) is that
this makes formulation simpler. Remember also that one by-product of such
a mapping process is an exact semantical definition of ER diagrams.
This canonical mapping will only be defined for correct ER diagrams (and
of course, recursively, corresponding to the recursive definition of correct
diagrams).
We first have to define the three notions
key
primary key
set of foreign/primary attribute pairs
for certain object types.
It is not relevant whether these notions are considered as part of the ER
model, or whether one says that an ER diagram has to be enriched by keys,
primary keys and sets of foreign/primary attribute pairs before being mapped
into a relation scheme. Practice shows that a designer usually defines these
notions “in the last stage”, which makes sense: a change, for instance, in the
primary key of an independent entity type can affect large parts of the
diagram.
Now for the definition of these notions: (In the following passage, “object
type” means entity type or relationship type or composite entity type.)
Key
A key is a (non-empty) set of attributes of an object type which characterizes
an entity or a relationship uniquely, and which is minimal with respect to
this property.
Any object type can have several keys which also may overlap (but of course
the minimality condition prevents one key from being contained properly in
another), and every object type should have at least one key.
Note that a key is whatever the designer says it is. The real world has no
keys.
34
Note also that “key” is a “syntactical notion” which should not be defined in
terms of the actual contents of tables. If, for instance, we design an entity
type PERSON for a company and say that the attribute set {NAME,FORENAME} is a key, it might well be that the actual data are such that {NAME}
alone would already have been unique. If, despite that, the designer says that
only {NAME,FORENAME} should be minimal, then he merely thinks that
some day, a person might enter the company who has the same name as a
person who is already there.
These remarks underline the relative unimportance of the minimality condition in the definition of keys. We just do not dispense with it to be in conformance with the traditional definition of key in the theory of relational
database systems. (In the dependency theory of relational databases, the minimality condition is interesting because normalization algorithms usually have
steps where functional dependencies with “minimal left hand sides with
respect to the given set of functional dependencies” are sought).
Primary key
A primary key is nothing but a key designated by the designer as primary. If
an entity type or a composite entity type has one or more incoming arrows,
then it must have a primary key. For a relationship type there is no need to
designate any of its keys as primary.
One can formalize the theory of relational databases with functional and
inclusion dependencies without the notion of “primary key”. We use this
notion here because it is part of an ANSI standard for SQL.
Primary key attributes in a diagram can be underlined.
Set of foreign/primary
attribute pairs
Remember: the attributes of a key are not ordered in the conceptual and
logical design stages. Therefore we speak of sets of attributes. By the same
token, we speak of sets of attribute pairs in the context of foreign/primary
key connections.
As an example consider the case of PERSON, COMPANY and EMPLOYMENT sketched in the introductory chapter. For PERSON we can choose
the set {NAME,FORENAME} as primary key and for COMPANY the set
{NAME}.
With EMPLOYMENT we have two foreign keys, {PNAME,PFORENAME}
referencing PERSON, and {CNAME} referencing COMPANY. Consider the
case of {PNAME,PFORENAME} which relates to {NAME,FORENAME}.
Due to the chosen nomenclature, we can see that PNAME relates to NAME
and that PFORENAME relates to FORENAME.
But in general we cannot rely on names. Therefore we specify two attribute
pairs,
35
<EMPLOYMENT. PNAME, PERSON. NAME>, and <EMPLOYMENT.
PFORENAME, PERSON. FORENAME>,
which together constitute the existence dependency symbolized by the arrow
from EMPLOYMENT to PERSON.
To be a little bit more formal:
Let the object type E be given with the attributes
...,Aj,... and the entity type F with the attributes
...,Bj,..., as well as an arrow from E to F.
Then the arrow corresponds to a set of attribute pairs
{<E.A1,F.B1 >,<E.A2,F.B2>,...,<E.Ak,F.Bk>},
where the set {B1,B2,...,Bk} is the primary key of F and the set
{A1,A2,...,Ak} is the foreign key of E referencing F.
If an object type has several foreign keys (in a correct diagram this is only
possible for relationship types and composite entity types), then we assume
that they do not overlap. This assumption will guarantee that our canonical
mapping into relation schemes will deliver Boyce-Codd normalized relations.
One can weaken this assumption considerably, but it is the purpose of this
booklet to present a simple theory, not a complex one, so we will only consider
examples that deviate from the non-overlapping assumption and verify that
the corresponding relations are BCNF.
It only remains to formulate the conditions for the keys of the different object
types.
Conditions for the keys of a
Let the relationship R be existentially dependent on the (composite) entity
relationship type
types E1,E2,...,En (n>=2). Then we have n arrows from R to the Ej which
(same as for composite entity type) correspond to n sets of foreign/primary attribute pairs. To simplify the nota-
tion, we assume that each of these n sets consists of only one element (each
of the primary keys of the Ej has only one attribute):
{<R.A1,E1.A1>}
{<R.A2,E2.A2>}
...
{<R.An,En.An>}
We must then consider two cases:
case 1: Each of the n arrows has the label “m”. Then the set {A1,A2,...,An}
must be a key of R.
case 2: At least one of the arrows has the label “1”. Then for each k such
that the arrow from R to Ek has the label “1”, the set
{A1,A2,...,Ak-1,Ak+1,...,An} (Ak missing) must be a key of R.
36
Of course the designer can choose additional keys besides those that are forced
upon him by the labels of the arrows as long as the corresponding attribute
sets do not overlap with foreign key attribute sets (though this condition
could also be weakened: see above).
Conditions for the keys of an
ID-dependent entity type
Let E be an entity type which is ID-dependent from an entity type F; the set
{B1,B2,...,Bk} the primary key of F; and {<E.A1,F.B1>,<E.A2,F.B2>,...,
<E.Ak,F.Bk>} the set of foreign/primary attribute pairs that corresponds to
the arrow from E to F.
Then there must be a key of E which contains the set {A1,A2,...,Ak} properly:
that is, it should contain the set {A1,...,Ak} plus at least one additional
attribute.
Conditions for the keys of an
ISA-dependent entity type
Let E be an entity type which is ISA-dependent from an entity type F; the
set {B1,B2,...,Bk} the primary key of F; and {<E.A1,F.B1 >,<E.A2,F.B2>,
...,<E.Ak,F.Bk>} the set of foreign/primary attribute pairs that corresponds to
the arrow from E to F.
Then the set {A1,A2,...,Ak} must be a key of E.
Canonical mapping of a correct
ER diagram into a relation scheme
Suppose a correct ER diagram is enriched by key conditions as described
above (note that the description has recursive character corresponding to the
recursive definition of the correctness of a diagram).
Then the mapping is very easily described:
Every rectangle, diamond and diamond rectangle maps to a relation with the
corresponding attributes; every arrow becomes a foreign/primary key connection; key maps to key; and primary key becomes primary key.
The only functional dependencies we get on the logical level of relations are
the ones defined by the keys. Therefore the result of the mapping is in BoyceCodd normal form.
As already mentioned it is common practice to map the diagrams directly
into physical design. But it is helpful to keep the “virtual intermediate stage”
of logical design in mind in the form of relations, because problems (and
solutions) that have nothing to do with the semantics of the ER diagram can
be recognized as such more easily. (Think of the ordering of attributes, the
question of collecting the tables into files, how much free space to choose in
the files, etc.)
37
Why entity relationship?
This chapter gives a few reasons why entity relationship as presented here is
better than binary ER, and also better than design in the relational data
model. To understand the chapter, some knowledge in the theory of relational databases and their normalization is required.
Following are a few remarks on each of the following points:
1) normalization poses no problem;
2) referential integrity guidelines;
3) avoiding NULLs;
4) semantics support by the relational database system.
1) Normalization poses no problem Normalization of relational databases was introduced by Codd in the early
seventies with the intention of avoiding certain update anomalies. In the seventies and early eighties (when the subject became very popular under the name
“dependency theory”) numerous articles and textbooks were written on the
subject. We should make a few remarks here on the following points:
a) normalization theory is too difficult;
b) target should be Boyce Codd normal form, not only third NF;
c) inclusion dependencies cannot be properly integrated;
d) information representation normalization versus data representation
normalization.
a) normalization theory is too difficult
Properly pursued, normalization theory must distinguish very sharply
between a syntactical and a semantical side. The syntax can be sketched as
consisting of purely abstract, functional dependencies and a set of rules on
how to derive new functional dependencies out of old ones, whereas the
semantics consists of models, eg of relations with their dynamically changing
contents. Most textbooks for beginners are not very clear on this point.
Algorithms for normalization must live entirely within syntax. The first such
algorithm was published by Bernstein in 1976 (it was erroneous). Since then,
numerous algorithms for diverse normal forms have been published, as well
as corrections. (The most recent correction to a normalization algorithm we
are aware of dates from 1990.)
One cannot avoid the impression that even for theoreticians the subject is
not very easy.
With our version of ER, the designer does not have to bother with dependency theory.
38
b) target should be Boyce Codd normal form, not only third NF
The aim of normalization is to avoid update anomalies, as already mentioned.
In our context, an anomaly produced by an update destroys the validity of a
functional dependency. (We will consider multivalued dependencies in an
example later on.) Such anomalies are best avoided if the database management system guarantees the validity of all functional dependencies over time.
This can be achieved by designing the relations in Boyce Codd normal form.
Note that the popular third normal form does not suffice. There the programmer and the individual data processing user still have to guarantee the
validity of functional dependencies themselves.
The point here is: the theory says that the third normal form can always be
achieved, but not Boyce Codd normal form. Here our ER helps: canonical
mapping always delivers Boyce Codd normal form.
c) inclusion dependencies cannot be properly integrated
Inclusion dependencies are referential integrity constraints deprived of operational aspects. Referential integrity was presented by Codd in the late seventies.
Now if one tries to integrate functional and inclusion dependencies, one very
soon faces serious problems. Imagine a relation containing two attributes which
together form the target of an inclusion dependency, but which are to be
split apart for normalization reasons. Should we normalize or keep the referential integrity intact?
But this is only the beginning. Remember that normalization algorithms must
deal with functional dependencies which may or may not be consequences of
sets of functional dependencies. Now if we add inclusion dependencies, and
ask which dependencies follow from a set of dependencies (functional and
inclusion), the question becomes undecidable (Verdi 1984). This means that
we cannot expect useful algorithms for normalization which properly integrate
inclusion dependencies.
Our ER also helps us here: we get Boyce Codd normalized relations with
fully integrated inclusion dependencies.
d) information representation normalization versus data representation
normalization
Data modelling means, first of all, building a conceptual model; then this
conceptual model is mapped into a relational model. However, it should not
be necessary to normalize the relational model after the mapping, because the
information content of the conceptual model might be lost.
39
Let's consider an example:
SNAME
SUBJECT
SNAME
m
CLASSNO
m
COURSE
TNAME
1
CLASS
CLASSNO
TEACHER
TNAME
fig. 20
Imagine that the conceptual model of fig. 20 is mapped into the relations:
SUBJECT {SNAME}
PrimaryKey {SNAME}
CLASS {CLASSNO}
PrimaryKey {CLASSNO}
TEACHER {TNAME}
PrimaryKey {TNAME}
COURSE {SNAME,CLASSNO,TNAME}
Key {SNAME,CLASSNO}
Now suppose we get from somewhere the additional functional dependency:
CLASSNO
TNAME.
This functional dependency could express the fact that for each class there is
(at most) one teacher somehow responsible.
First we look at what happens if we apply normalization theory. The relation
COURSE is not normalized and theory says that it should be split into two
parts, for instance:
C1 {CLASSNO,TNAME}
Key {CLASSNO}
and
C2 {SNAME,CLASSNO}
Key {SNAME,CLASSNO}
(What meaningful names could be given to the two new relations?) This is
data representation normalization.
40
Now for information representation normalization we go back to fig. 20 and
ask what the “responsibility functional dependency” CLASSNO
TNAME
has to do with COURSE.
One could imagine several possibilities. We will consider just two.
• The FD has nothing to do with COURSE and means only that classes can
have teachers assigned who supervise them. This case could be modeled as
in fig. 21.
SNAME
CLASSNO
CLASSNO
m
SUBJECT
SNAME
m
COURSE
TNAME
CLASS
1
m
TEACHER
TNAME
1
SUPERVISES
TNAME
CLASSNO
fig. 21
• Classes have a single teacher assigned to them, who carries out all functions,
including the teaching of all the courses. It is probably better to model this
case as in fig. 22.
m
COURSE
m
SUBJECT
CLASSNO
CLASS
m
SNAME
ASSIGNMENT
TEACHER
fig. 22
1
TNAME
The example also shows that with “relational normalization” the attributes
get an undeserved global character.
41
The next example shows that relational normalization which relies solely on
functional dependencies lacks the necessary expressibility.
Consider the entity types EMPLOYEE with (key-) attribute EMPNO and
PROJECT with (key-)attribute PROJNO. Let's assume that every employee
works in at most one project and supervises at most one project. Both work
and supervision give rise to the same functional dependency
EMPNO
PROJNO.
But we should model two different (partial) functions, and this cannot be
expressed in terms of functional dependencies, but only on the conceptual
level as in fig. 23.
PROJNO
EMPNO
m
WORKS
1
EMPLOYEE
PROJECT
EMPNO
PROJNO
m
fig. 23
2) Referential integrity guidelines
EMPNO
SUPERVISES
1
PROJNO
Referential integrity in DB2 is too difficult to understand. An example:
If a foreign key F={A,B} consists of more than one attribute, then the
definition says that a tupel is NULL on F if it is NULL on A or on B
(or both). This means that one can add today a tupel with projection
on F like <A,B>=<a,NULL> (the system only checks non-null values
for referential integrity constraints).
But if tomorrow we want to update the tupel <a,NULL> to <a,b>, then
the system checks the referential integrity constraint, because the target
tupel on F is no more NULL. Now what if there is no tupel in the corresponding parent table with values <a,b>? Was just the update from
<a,NULL> to <a,b> wrong, or was the insertion <a,NULL> wrong
already?
Transaction logic becomes complicated.
There is a more severe restriction in relational systems which concerns the
easy handling of arbitrary referential integrity constraints.
42
The philosophy of relational data is a philosophy of tupel sets, but for performance reasons the philosophy of referential integrity checks is tupel orientated.
This means that the system does not guarantee integrity of referential constraints before and after transactions but before and after each tupel insert.
This is not the same thing, and especially if one has self-referential constraints
on a single table, then the outcome of transactions can depend on the order
of processing, on which the user has, of course, no influence.
Things are now very complicated. An indication of this is that the DB2 referential integrity usage guide is a book of more than two hundred pages.
This situation calls for guidelines.
Our version of ER provides such guidelines. Only the simplest and safest
cases of referential integrity are possible results of the recommended mapping
of a correct ER diagram into relations.
3) Avoiding NULLs
A frequently mentioned advantage of the relational model is its mathematical
foundation. However, while it is clearly preferable for a technical system to
have a mathematical background, only the NULL-free version of the relational
model has a generally accepted mathematical foundation.
NULLs are too difficult for theoreticians. Proposals have appeared in the theoretical literature on how to overcome some difficulties confronting the user,
but these have proven logically unfeasible.
NULLs are too difficult for system implementers. SQL with DB2 has serious
semantical mistakes in the context of NULL.
NULLs are too difficult for users too. Thinking in a 3-valued logic or in a 3fold valuation system is not natural. Users seldom understand why the query
select * from T where A<17 or A>=17
sometimes fails to deliver all rows of the table T. Users typically lose information with joins in the presence of NULLs.
ER helps avoid NULLs. With the recommended mapping into relations one
gets relations, with no NULLs at “dangerous attributes” (attributes in keys or
foreign keys).
Of course, avoiding NULLs means in general having more relations, but this
is in general an advantage, because the system has more access paths to
choose among, and because queries tend to become much simpler in the
absence of NULLs.
43
4) Semantics support by the
relational database system
As we know, we not only have to design data, but also operations and constraints. Operations and constraints have their counterpart in programs or in
administrative rules.
It is impossible to build a graphical design language which covers the full logic
of operations and constraints: therefore we have to draw a borderline. (Some
constraints can be caught by the graphical language, but most of them cannot.)
Now, with our version of ER this borderline is very clear and easy to handle:
all semantics which can be expressed by ER can be guaranteed by the database management system (if it has referential integrity support). The programmer only has to occupy himself with constraints he writes down separately, not with constraints contained in the semantics of the graphical design
language.
Note that this is not the case with “binary entity relationship”: here the programmer also has to guarantee some constraints which can be expressed by
the language (for instance a one-to-one correspondence between entities).
44
More examples
Example 1
Compare the structure of ID and ISA with a relationship type.
ID
E
F
fig. 24
is “equivalent” to
E
m
R
1
F
fig. 25
together with the constraint “for all e in E there is a f in F such that <e,f> is
in R”
and
ISA
E
F
fig. 26
is “equivalent” to
E
1
R
1
F
fig. 27
together with the constraint “for all e in E there is a f in F such that <e,f> is
in R”.
Example 2
Let the entity types STOCK and ARTICLE be given (which means that we
have several stocks, each of type STOCK), as well as several relationship types
between STOCK and ARTICLE, all of the same structure (for instance R1,
R2, R3 and R4 with the meanings “in stock”, “catalogued”, “out of stock”
and “ordered”).
45
m
R1
m
m
R2
m
STOCK
fig. 28
ARTICLE
m
R3
m
m
R4
m
Under which circumstances does it make sense to replace the diagram of fig.
28 by the one of fig. 29? (Consider extendibility, operations and constraints.)
STATUS
m
STOCK
R
m
ARTICLE
fig. 29
A few design guidelines
Since the 1980s, attempts have been made to automate data design. Programs were developed that took English sentences of a certain, simplified
grammatical pattern as input and produced entity relationship diagrams as
output. To develop such a program is a good exercise for a student, but of
course it is no more than child’s play.
Designing data not only maps users’ perceptions of reality into digital media,
but also creates new realities, new types of workflow, new kinds of jobs (more
specialised or less, depending on the application), or even new organisational
structures. Therefore designing data can have psychological, sociological or
“political” aspects (political if a center of power is not directly definable, which
is almost always the case). Intuition and tactical sensitivity is as important as
the application of systematic truth-finding.
There is no royal road to the right data structure. (We would even accept the
servants’ entrance if there were one, but very often the problem is that there
is more than one “right” data structure.) Nevertheless, experience shows that
there are a few guidelines worth keeping in mind. The following list is by no
means complete.
Begin with the independent entity types and their characterizing keys. They
are called independent because their corresponding entities are not dependent on any other data objects. This is an old guideline used by Chris Date,
46
the famous interpreter of the relational world. He called those entities “kernel entities”. (In our graphic language, they are the boxes that have no outgoing arrows.) Thus, if you have to judge a given design, focus first on these
independent types. The rest depends existentially on them.
Formulate as exactly as possible what the entity type stands for. Watch closely
for any time dependency (a time dimension attribute in a characterizing key).
There can be many time dimensions: for instance, accident year, development
year, business year, time of estimate, etc. Observe that it makes no sense to
have such a thing as an entity in time: in other words, if a new entity = <old
entity, time>, then the old entity and the new entity are different things. If
an entity has a history, then it is good advice to consider the entity itself as
being dependent on a time dimension. (This must not necessarily be the
case, but the history time dimension should be designed explicitly in any
event.).
Try to make simple packages that are as independent as possible. Characteristic
of bad design is a big system where some data entity or other is connected
(directly or indirectly) to everything else. Remember the “correct diagram”
notion defined in this brochure. Although it is only a syntactical construct
that guarantees painless mapping of the diagram into relations, experience
shows that if this notion of correctness is violated, then usually also the
intended semantics is questionable or even wrong.
Differentiate as clearly as possible between primary data and control data.
(Control data is data that is also in users’ access, but serves the purpose of
controlling the primary data: for instance the definition of profit centers and
its versioning, or user-defined authorisation data, etc.) For control data, the
challenge of the constraints is bound to be neglected. As you have learned in
this brochure, data usually cannot exist without corresponding constraints
and there are two alternatives for coping with them: constraint enforcement
by program logic – let’s call it “electronic” constraint enforcement – and
constraint enforcement by administrative rule. For control data there is a
temptation to enforce constraints by means of administrative rules, because
this is faster to implement. But the application system continues to live,
and neighbour systems begin to rely on the constraint. Since there is some
natural organisational turnover within time, and constraints enforcement by
administration relies heavily on people’s minds, there is some danger of losing
the enforcement. Thus it can be good advice to enforce constraints electronically in the first place, even if it takes some implementation resources.
Another point with respect to constraints is the following: Beginning data
designers tend to map as many constraints as possible directly into data structures. The constraint “for all entities a1 there exists an entity b2”, for example, could be mapped into data structure; but you should be absolutely sure
that no user will ever present you at a later date with a sole exception of an
entity a1 that has no corresponding entity b2. (“It could only happen once in
47
a million years…” are famous last words.) If there is any doubt, map the
constraint into program logic (not into the data structure).
Be happy if the user “does not know what he wants” (a complaint often heard
in IT departments), because that forces you to design abstract and generalisable structures.
There is another fact of life that should invite you to design abstract structures. If an application system is new, then the semantics of the (not yet
existing) data can (and should) be defined formally. But as the system continues to live (or rather, the user lives with the system), it can be observed that
the semantics changes. In other words, in an application system in use, the
complete meaning of the data is only defined in the user’s head (a contract
today is not the same thing as it was ten years ago). If you have abstract
structures, the user has some freedom to interpret them according to changing business needs. (This sounds easy, but I know that it is not.) By the way,
the fact that the complete semantics can only be defined by the user is the
deeper reason why the universal repository concept failed.
Sometimes the ideal data structure and the ideal module or class structure
conflict. Then ask an oracle (not the database system)! Seriously, try to imagine how the data and module or class structure will develop in the future.
The more general and flexible solution is usually the better one.
Do not infer the data model directly from object models: inferenced models
as such tend to be either process-oriented or function-oriented. They are not
sufficiently abstract, and carry superfluous relationship types. Do make data,
process, and object models separately and in parallel, if they are all necessary,
and let the matching process be conducted by the brains of the people
involved.
In client/server applications, try also to realise the idea of encapsulation and
abstract data types through keeping as much of the data as possible at the
central site, packed in modules which do not commit. (Transaction boundaries
should be defined by the client, not by the server that offers abstract data
types for data handling.)
It takes more time to achieve simple, clear structures than complicated ones.
Since the second NATO science committee conference on software engineering in 1968, where the term “software crisis” was invented, software engineers
have blamed themselves for not being able to automate software construction
to a greater degree. But don’t forget that, though many bridges have been
built, there is still no automatic method for constructing bridges, and I doubt
that there will ever be one, because technology is constantly changing.
So what about design tools? Use them with extreme caution!
48
Example 3 (Accordion principle)
Let the entity type E be given with “mandatory” attributes A1, A2 and A3,
and with “optional” attributes B1, B2, .., Bn. The values of the attributes A1,
A2 and A3 are defined for every entity of type E, but for every such entity
only a subset of the attributes B1, B2, .., Bn have a defined value. For
instance, for the entity e1 (of type E) the values of B3, B5 and B8 are
defined, for e2 the values of B2, B3, B4 and B5 are defined, and for e3 none
of the B attributes has a defined value, and so on.
E
A2
A1
A3
B1
B2
Bn
fig. 30
Now the accordion consists of contracting in the horizontal and tearing apart
in the vertical: The names of the attributes B1, B2, .., Bn, a “horizontal line”,
become entities of a new type (EATTR), which means rows of a table, a “vertical line”.
A1
E
A2
A3
m
EATTRAVALUES
m
EATTR
VALUE
fig. 31
A typical relation of type EATTRVALUES would be <a1,B3,b>, symbolically
represented (A1 primary key of E), which means that the entity of type E
with the primary key attribute a1 has, possibly among others, a defined value
for the attribute B3, namely b.
This solution not only has the advantage of great flexibility (one can go as
far as allowing the user to define possible new B attributes “beyond Bn” himself ), but also operations typically become simpler. Imagine in the case of fig.
30 how complicated a query would be for the operation “given entity e of
type E, return all attribute values of the B attributes that are defined”. This
would be an almost impossible task.
49
Example 4 (Roles)
The entity relationship language as presented here is role-free. We have
renounced formulating roles because the theory becomes much more complicated with the addition of roles and because they are seldom really needed.
In situations where the introduction of roles suggests itself, a deviation from
canonical mapping is recommended as in the following typical example:
NAME
PERSON
ISA
ISA
MAN
WOMAN
NAME
NAME
1
fig. 32
MARRIED
MNAME
1
WNAME
In canonical mapping, every rectangle, diamond or diamond-rectangle
becomes a relation. The deviation in this case would be to have only the relations corresponding to PERSON and to MARRIED, together with the “role
information” that the attribute MARRIED.MNAME references PERSON
“in the role of MAN” and the attribute MARRIED.WNAME references
PERSON “in the role of WOMAN”.
Of course this “role information” must be taken into account in the tasks of
“programming and administration around the MARRIED relation”.
Another typical example in this context is the parts explosion problem (bill
of materials problem) with SUPERPART and SUBPART as ISA-dependent
on PART, and with CONTAINS as relationship between SUPERPART and
SUBPART to capture the structure on the set of parts.
Example 5
Compare the two versions in figs. 33 and 34.
1
SUPPLIER
fig. 33
50
COMPOSITION
m
PRODUCT
m
PART
1
SUPPLIESPART
m
m
SUPPLIER
PART
COMPOSITION
m
PRODUCT
fig. 34
Which of the following statements are true in which version?
1) A product can consist of several parts.
2) A part always comes from the same supplier.
3) If a certain part is needed for a certain product, then it always comes from
the same supplier.
4) A key of COMPOSITION consists of “part” and “product” (which of
course means the corresponding foreign key attributes).
5) Supplier/part connections can be captured independently of products that
contain the parts.
Example 6
Compare the two versions in figs. 35 and 36.
SUPPLIER
Label 1
SUPPLIES
Label 2
fig. 35
51
PRODUCT
Label 2
PART
SUPPLIER
Label 1
PART
SUPPLIES
m
Label 2
CONSISTSOF
m
PRODUCT
fig. 36
In the first comparison, choose label1 = 1 and label2 = m, and vice versa for
the second comparison. What do they have in common, what is different?
Example 7
Let fig. 37 be given,
S#
T#
m
SUPPLIER
S#
P#
m
COMPOSITION
m
V#
PART
T#
P#
VERSION
V#
ID
P#
fig. 37
PRODUCT
together with the following constraint:
from version to version of the same product, the pair <part, supplier>
changes. “The pair changes” means “do not use the same part any
longer” or “change the supplier” or both.
This constraint is equivalent to the functional dependency P#T#S#
among the attributes of COMPOSITION. Why?
V#
Now we have the crucial question:
Should this constraint be implemented program-controlled or ensured by the
database management system? (By the way, it is not certain that such a constraint will be recognized as being just a functional dependency.)
52
Suppose we do not trust the art of our programming, so we want the DBMS
to control the constraint. Make suggestions as to how this could be achieved.
Be open to all ideas ranging from changes in the diagram to deviations from
the canonical mapping.
Example 8 (View integration)
View integration is the technical term for the merging of diagrams.
Let the two views of fig. 38 and fig. 39 be given.
EMPLOYEE
E#
fig. 38
CLUB
CLUB1
m
CLUB2
MEMBERSHIP
CLUB3
m
MEMBER
M#
BELONGSTOEMPLOYEE
fig. 39
The purpose of the diagram of fig. 38 is to capture information on which
employee is in which club. It might be an old view designed at a time when
nobody could imagine that there would ever be more than the three clubs
corresponding to the attributes.
Let’s suppose the statutes of the company clubs were changed later so that
relatives of employees could also become members. (Thus, for every member
there must be a relative who is employed in the company.) Then the diagram
of fig. 39 was designed (by someone else who was not interested in employees but only in club members).
Now the question arises as to how the two diagrams could be merged into
one: that is, how the two views could be integrated.
53
One (of many) possibilities is drawn in fig. 40.
CLUB
m
MEMBERSHIP
m
PERSON
ISA
ISA
EMPLOYEE
1
RELATIVE
FAMILYRELATION
m
fig. 40
Could this merge have been achieved by an automatic view integration tool?
Example 9 (View integration)
As a second example of view integration, consider the following two figures:
m
RELATED 1
CLAIMSADVICE
1
m
RELATED 2
m
CLAIMSCODE
EVENT
fig. 41
CLAIMSADVICE
CLAIMSCODE
m
RELATED 3
m
fig. 42
The first idea for integration is of course to complete the triangle as in fig.
43. (An automatic tool would probably do just that if the names of the corresponding entity types of the two diagrams were spelled exactly the same way.)
54
EVENT
1
m
RELATED 1
RELATED 2
m
CLAIMSADVICE
m
m
RELATED 1
m
CLAIMSCODE
fig. 43
But now the question arises as to what the relationship types RELATED1,
RELATED2 and RELATED3 have to do with one another. Is any one of
them redundant?
To any EVENT e there is via RELATED1, RELATED3 a set of claims codes
assigned, say the set C13(e). Via the other path from EVENT to CLAIMSCODE (via RELATED2) there is another set of claims codes, say C2(e),
assigned to our entity e of type EVENT.
What have the sets C13(e) and C2(e) to do with one another?
Are they equal, is one of them contained in the other, or is there an even
more complicated condition connecting the two?
Let us consider the case that C13(e) = C2(e) for all e of type EVENT.
(Remember two sets are also equal if both are empty.) This makes sense if we
intend to class events with claims codes on the basis of clients’ claims advices
(instead of on the basis of information from newspapers, for example). In
this case the integrated diagram would look like fig. 43 without RELATED2.
Of course, many other possibilities are also imaginable.
Example 10 (Fourth normal form)
To understand this example the reader should have some knowledge about
the so-called fourth normal form and multivalued dependencies. Up to now,
only violations of third normal forms (BCNF) have been discussed, because
only functional dependencies and the question of whether the DBMS should
guarantee their validity were considered .
As an example of multivalued dependencies consider fig. 44,
55
COMPANY
PRODUCT
m
C#
TRADE
m
P#
m
D#
COUNTRY
fig. 44
together with a constraint which is a multivalued dependency described as
follows:
let the relations that correspond to the given object types be COMPANY{C#}, PRODUCT{P#}, COUNTRY{D#}, and TRADE{C#,P#,D#}.
Tthe relation TRADE violates the fourth normal form if the designer wants
to shape his constraint in such a way that the data which will be in TRADE
fulfill the following condition:
there are two relations, M{C#,P#} and E{C#,D#}, conceived such that,
for all times, the relation TRADE is the (natural) join of the relations
M and E (over C#).
A syntactical version of this constraint is the multivalued dependency
C#
P# (or C#
D#).
What should the ER designer do in such a situation?
There is a certain hope that the constraint will be formulated during the
“conceptual design phase”, and also that the designer will realize that his
constraint puts him on the track of two relationship types which correspond
to the above-mentioned (“virtual”) relations M and E. He then will (hopefully) forget about the constraint as formulated above, and merely change
his diagram (as in fig. 45)
M
m
COMPANY
m
PRODUCT
C#
P#
m
fig. 45
56
E
m
COUNTRY
D#
Of course the designer will have to think about the information content of
the new relationship types M and E (probably Make and Export).
What can we learn from this example?
We can just hope that the designer realizes what he is doing when he writes
down a constraint which would have to be mapped into a multivalued
dependency. Of course the formulation can be misleading. Another formulation of the same constraint would be:
for all c, p and d,
if there is a d2 such that <c,p,d2> is in TRADE, and
there is a p2 such that <c,p2,d> is in TRADE
then also <c,p,d> is in TRADE.
However, the likelihood that the designer will formulate such a constraint by
accident without realizing that he could map it into the data design is
extremely small.
All this shows us that the whole problem is not very serious.
The situation with the fifth normal form and join dependencies is the same.
Example 11 (The OLAP warehouse)
E
V
m
A
m
m
D
m
m
B
C
fig. 46
Fig. 46 shows a typical data structure pattern for OLAP (an abbreviation for
“online analytical processing”, a term invented by E. F. Codd in 1993), often
called a “star join schema”.
It consists of dimensions: in the example, the entity types A, B, C, D and E.
(There can be less than five, but usually there are many more.) Then there is
a fact relationship type (“fact table”) F connected to the dimensions such that
the set of all foreign key attributes of F corresponding to the dimensions is a
key of F. Fact table F has one or more additional attributes for the measured
fact values (in the example there is one, with the name V), that for the typical
57
proper OLAP pattern should be additive in the sense defined below. (We will
not discuss deviations from full additivity here, as we only want to sketch the
kernel OLAP pattern with the simple drilling capabilities typically supported
by specialised tools).
Let us consider the sets M of tuples <a,b,c,d,e> from the combinations of
dimensional key values that make sense from the information content point
of view. For each of these sets M, the sum of corresponding values v (ie the
values such that <a,b,c,d,e,v> is in F and <a,b,c,d,e> is in M) must both
make sense and reflect the set M. There may or may not be a tuple in F that
represents M, but where this is the case, its corresponding value must be
equal to the sum mentioned above. Usually, however – and especially if
OLAP tools are being used – those sums will not belong to the fact table.
(They might be internally precalculated, invisible for the fact table administrator.) Special cases of such sets M are the “parallels to the dimensions”
(groupings on a subset of the dimensions with fixed values in the non-grouped
dimensions). As an example, take a, b and c fixed from A, B and C, and
define M as the set of all <a,b,c,d1,e1> such that d1 and e1 are from D and
E. If N is another parallel containing M as a subset, then N is a drill-up of
M, and M is a drill-down of N.
This additivity constraint is sometimes difficult to achieve, especially if
some of the dimensions contain natural orderings which, first, are not total
(meaning that there are attribute pairs in the ordering whose elements are
not comparable) and second, which can be used in the definitions of sets M.
As an example, suppose that the dimension A stands for geographical unit
and has attributes REGION, DISTRICT, COUNTRY and TERRITORY. In
such a case, the usual geographical (topographical) meaning imposes a natural
ordering, and it is just as natural to define sets M in terms of this ordering.
However, territories might overlap countries, and vice versa – so TERRITORY
is not comparable to COUNTRY. In such cases it is important to have the
smallest possible elements from which all others can be built. Here, for example, REGION might be a candidate: districts, countries and territories would
all have to be built up from regions. REGION would then have to be the
primary key of the dimension A, and the fact table would contain only basic
facts according to the geographical dimension that could be summed up
without the danger of “counting values twice” (or several times).
An example for nonlinear natural ordering that is even easier to understand
would be a time dimension with attributes DAY, WEEK, MONTH and
YEAR. Snowflaking (ie adding more data objects around the dimension entity
types) does not help, as it makes it even more complicated to define the
ordering. One could try, rather, to take more than the primary key attribute
of the dimension into the fact table, but then the drills must be defined
explicitly, as certain drills that are theoretically possible would no longer
make sense.
58
The world is full of complicated orderings, and in OLAP it is important to
restrict oneself from the beginning to a few of those that can be clearly
defined (almost mathematically), and that do not change every other month.
There are tools for OLAP processing that precalculate most of the interesting
sums, and this gives fantastic drilling performance, either up or down. However, the price paid is the huge amount of preprocessing required: the
resources needed are exponential to the number of dimensions.
The dimensions should be as orthogonal as possible: that is, their values
should be independent, especially from values in the other dimensions. If
they are not, the database designer would probably change the schema as he
has learned in this brochure. For instance, if there is a functional dependency
between the data of two dimensions, he could merge the two into one; however, if this functional dependency exists only for part of the data in the two
dimensions, any redesign that copes with the dependency would destroy the
star join schema.
Another interesting observation with respect to real-life data is the fact that
data is seldom equally distributed among dimensions (ie sparsities exist).
Though this means that it can be a technical challenge to find a performing
access path, this is not our concern here: the point is that the user sooner or
later calls for a categorisation of dimensional values. As an example, take
CONTRACT as a dimension: the user will want information according to
single contract for half of them, but he will want to see the other half of the
contracts only in total (whether other dimensions are totalised or detailed).
Of course, one could take “category of contract” as the dimension instead of
“contract”; but what if the user, who in most cases wants only totals, then
wants – though perhaps in very seldom cases – the details of an uninteresting
contract? The star join schema can be a very narrow corset.
We have learned how to avoid NULLs in database design, but in the star join
schema you might be confronted with the need to add another dimension
which is not complete: that is, where the dimensional values are not defined
or not known for part of the facts. Then you will have a classical NULL-value
problem, which in this case will have consequences right up to the user interface. It is interesting to observe how the OLAP tool vendors cope with that
challenge.
To sum up, if your data fits ideally into the star join schema, then by all
means, use the fantastic tools that are available for presenting this data to the
user. However, in other cases, you are well advised to not forget how to do
classical individual data processing with a powerful language like SQL on a
relational database.
59
Dr Hanswalter Buff
Hanswalter Buff joined Swiss Re in
1982 after studies at the University
of St. Gall (economics) and the
Swiss Federal Institute of Technology (ETH) in Zurich (predominately
mathematics, taking a doctorate in
mathematical logic). After a year
in the Group Planning department,
he left the company to complete a
teacher’s certificate and continue
his studies in information science.
Hanswalter Buff joined Swiss Re
again in 1984, this time the Group
section of the IT department,
where he helped support German
life insurance companies in their
efforts to comply with the German
supervisory authorities by programming models for portfolio evaluations. Two years later, he joined
the systems programming team
for databases; since then he has
been occupied with the subject of
data in one form or another. During
these years of professional activity,
Hanswalter Buff has published
several articles on mathematical
logic and database theory; he also
has lectured on data management
for several years at the university
of St. Gallen.
© 1998
Swiss Reinsurance Company
Title: Entity relationship for
relational database
Author: Hanswalter Buff
Translation by: Swiss Re Language
Services
Produced by: Marketing
Communications, Swiss Re
Cover illustration: Markus Galizinski,
Zurich
Additional copies of this brochure,
as well as an overview of Swiss
Re’s other publications (Swiss Re
Publishing – our expertise for your
benefit) can be ordered from:
Swiss Reinsurance Company
Mythenquai 50/60
P.O. Box
CH-8022 Zurich
Telephone +41 1 285 21 21
Fax
+41 1 285 20 23
E-mail [email protected]
Internet http://www.swissre.com
5/98, 2000 en
© Copyright 2026 Paperzz