Chapter 2 Conceptual Simplicity

Chapter 2
Conceptual Simplicity
This chapter reviews the practical need for conceptual simplicity in computer software,
what is meant by it, how one can achieve it, how the relational model already provides it,
and how in principle different kinds of container type might be added to the relational
model.
2.1 The Practical Importance of Simplicity
As computer hardware has become more powerful and provides ever more resources e.g. more memory, better screen displays - software applications, whether for the PC or to
be run on powerful servers, have grown ever bigger and more complicated to take
advantage of these resources. For example, “By 1992, the word processing program
Microsoft Word had 311 commands. ... Five years later in 1997, that same word
processing program, Microsoft Word, had 1,033 commands”. See pages 80 and 81 of
[Norm99]. Schneier on pages 357 and 358 of [Schn00] has tables which show that the
number of lines of source code in the Microsoft Windows operating system rose from 3
million lines (Windows 3.1 in 1992) to an estimated 35-60 million (Windows 2000 in
2000), while the number of system calls in operating systems generally rose from 33 (in
Unix 1st edition in 1971) to 3,433 (in Windows NT 4.0 SP3 in 1999). Such escalation of
size and complexity in commercial software generally continues to this day.
Does this continued escalation matter ?
Up to the 1980s and 1990s, it made sense to use the ever-increasing resources provided
by newer, cheaper computer hardware to develop the hitherto more limited software
applications into more effective versions for the user, regardless of the extra software size
and complexity needed to achieve this. However, is this still true in the 21st century ?
There is now increasing general concern about „bloated software‟, „bloatware‟ and „code
bloat‟, so much so that these terms1 have become established and can be referenced, say
on Wikipedia [Bloa06]. In general these terms refer to programs that appear to use more
computer hardware resources than is commensurate with the benefits received by the
program user.
Related concern exists among commercial and business users of computer applications.
For example, consider the following quotes from commercial IT managers reported in the
computing press :
“If projects or programmes are overly complex, there is a good chance they are
simply wrong”. (Brian Jones, ex Global CIO Allied Domecq, then of IBM).
[Jone06].
1
Other terms are also used to denote this phenomenon, or specific aspects of it, such as „creeping
featurism‟, „creeping featuritis‟ and „second system effects‟.
1
“Complexity leads to design problems & greater risk of error”. (Martyn Thomas, exPraxis MD, Formal Methods specialist). [Thom06].
Comparable views from computing academics have been expressed since 1995.
According to Niklaus Wirth, “Software‟s girth has surpassed its functionality, largely
because hardware advances make this possible. ... software can be developed with a
fraction of the memory capacity & processor power usually required, without sacrificing
flexibility, functionality or user convenience.” [Wirth95].
Software complexity has been singled out as a particular problem with respect to
achieving effective computer security. Ferguson and Schneier state that “There are no
complex systems that are secure. Complexity is the worst enemy of security, and it
almost always comes in the form of features or options.” - see page 5 of [NFBS03]. The
reason they give for this is as follows. Typically a computer application has many
different options. Together they create a huge number of different possibilities. To
ensure that the application works correctly, all these possibilities should be tested.
However in practice the number is so huge that it is only practicable to test the most
commonly occurring combinations of options, thereby leaving many possible unfound
bugs, which in turn lead to security flaws. There are standard ways of coping with such
situations, such as the use of modular software and the application of orthogonality in the
design, but “Unfortunately, we see very little of it in real-world systems.” - page 5 again
of [NFBS03].
Nevertheless, one might assume that complexity, at least in the shape of an ever larger
variety of options and features, is a good thing for the user. However increasingly there
are concerns that not only is this assumption false but that a large range of options is
counter-productive. Donald Norman puts it very succinctly. “The result is technologydriven, feature-laden products. Each new release touts a new set of features. ... Seldom
are the customer‟s real needs addressed, ... The notion that a product with fewer
features might be more usable, more functional, and superior for the needs of the
customer is considered blasphemous.” See page 25 of [Norm99], a book which promotes
the „Information Appliance‟, the antithesis of the current kind of personal computer
application. The reason for the inadequacy of current applications is that “Design by
feature-lists is fundamentally wrong. Lists of features miss the interconnected nature of
tasks.” - see page 207 of [Norm99].
Norman in [Norm99] suggests how the current situation has arisen, using the work of
Moore [Moor91] and Christensen [Chri97] as his basis. A new technology starts out
delivering less than customers really require, although customers still buy it since it
satisfies needs unsatisfiable by any other means. Higher performance versions of the
technology are developed over time to meet the unsatisfied demands. Eventually
improvements to the technology reach the point where customers‟ needs are substantially
satisfied. From then on, while the initial customers may still appreciate ever more
advanced technology, new customers, who will eventually form the bulk of a mass
market, prefer convenience, ease of use, reliability, and low cost; they are not interested
2
in the technology per se, they want solutions that simplify their lives 2. Norman argues
that this product development cycle applies to computer applications, which now need to
evolve towards simple mass market solutions and away from high tech products. See
chapter 3 and ff. of [Norm99]3.
Another cause suggested for overly complex software arises from the first version of a
software product invariably being a prototype. As Brooks points out in [BrPr82], “The
management question, therefore, is not whether to build a pilot system and throw it
away. You will do that.”4 Thus the product needs enhancement to make it useable. In
order to enhance it as quickly as possible, rather than learning from the prototype how to
revise or develop a new underlying architecture and design, accretions and extra
complexity are inserted into the prototype itself, and this results in a more complex
product. Jamie Zawinski, who helped develop the Mozilla [Netscape 1.0] browser, stated
that marketing demands left no time to refine the browser into a smaller, more elegant
product that delivered the same functionality in a simpler way - see [ZaBl06]. The
marketing necessity for speed is not unusual. Being first to market, or at least not too
late, is often very important for the commercial success of many kinds of software
product5. Lou Gerstner, IBM‟s chairman, summed this up with : “All large companies
know today that speed and being early to market are often more important than being
right.” - see [Gers98].
However it arises, clearly software complexity has now become a problem. Yet software
does not have to imply complexity :
“We have reduced the cost of running our IT operation by 40% ... Also the quality
has improved by two or three times. We have done that by simplifying and
standardising how we run the [IT] technology.” Furthermore this has had a
beneficial affect on the business processes supported by the IT, by encouraging
further business process simplification. “ „IT has been an enormous leader in
simplifying BA‟s business,‟ said Corby” (British Airway‟s CIO). “... if you are going
to automate, you have to understand the process and simplify it. Complexity will kill
you and slow you down.” See [Corb06].
The next chapter reviews object-relational DBs to demonstrate the extent of their
complexity with respect to different kinds of container types; but first the remainder of
this chapter reviews the nature of the simplicity aimed for and the means of achieving it.
2
3
4
5
The development of the motor car illustrates the changes in product and market. The early versions were
difficult to drive, with complex controls to manage the ignition timing, fuel supply to the engine, etc, so
that they only appealed to a small market of car enthusiasts. Modern cars have automated these
problems out of existence, so that they are now comparatively easy to drive and appeal to a mass market.
Norman emphasises applications for PCs, which ought to evolve into „Information Appliances‟, where
customers no longer need to continually update their software to obtain ever larger feature sets; not all
these aspects are germane to the thesis.
„Pilot‟ is the term Brooks uses for „prototype‟ here. It derives from the term used in some other branches
of engineering for a prototype, e.g. a „pilot plant‟ in chemical engineering.
For example, see pages 116-118 of [Norm99] on „infrastructure products‟.
3
2.2 The Simplicity Required
The crucial importance of a mental model to a good software product is underlined by the
fact that among the „Universal Principles of Design‟ for general product design listed in
[LiHB03] is the „Mental Model‟ principle : “People understand and interact with systems
and environments based on mental representations developed from experience.”
In this context, „mental model‟ and „conceptual model‟ are synonyms. The latter term is
the one normally used in this thesis.
In „The Design of Everyday Things‟ [Norm98], page 53, Norman includes a good
conceptual model among his four principles for good product design : “Consistency in the
presentation of operations and results and a coherent, consistent system image.”
„System Image‟ is the term used for the conceptual model actually presented to the user
by the product when in use6. For specifically software products, in [Norm99] pages 175179, Norman emphasises the importance of the conceptual model in designing them :
“The use of a good conceptual model is .. fundamental to good design .. .”
“Good designers present explicit conceptual models for users. If they don‟t,
users create their own mental models, which are apt to be defective and lead
them astray.”
“Start with a simple, cohesive conceptual model and use it to direct all aspects
of the design. The details of implementation flow naturally from the
conceptual model.”
“The model has to be coherent, understandable, and sufficiently cohesive that
it covers the major operations of the system. ... It is successful if the users
can then use the system in ways the developers never imagined. Above all, the
user should be able to discover and learn how to use it with the minimum of
effort. In the ideal case, no manual would be required.”
Thus the kind of simplicity required of a software product is that of its conceptual
model.
It does not necessarily follow that the software implementation is simpler, or that the
software architecture, the data structures and/or algorithms will be simpler. A simple
conceptual model may well lead to a simple implementation. Where that happens, it is
all to the good. The security concerns expressed above involve both aspects, so it is
preferable if both are simplified. Yet it is not necessarily so. A simpler conceptual
model for the user may require the transfer of complexity to the implementation, so to
speak, in that the implementation must become more complex in order to handle
complications that the user formerly handled. A DBMS that presents the user with a
relational conceptual model presents a simpler model than a DBMS that presents a
network conceptual model, since those aspects of the physical implementation are
6
In fact on pages 189-190 of [Norm98], Norman refers to three conceptual models : “the design model, the
user‟s model, and the system image. The design model is the conceptualisation that the designer has in
mind. The user‟s model is what the user develops to explain the operation of the system. Ideally, the
user‟s model and the design model are equivalent. However, the user and designer communicate only
through the system itself ... Thus the system image is critical.”
4
removed from the user‟s conceptual model. Some of those aspects transferred are
automated - e.g. query optimisation - while others are transferred to an interface for the
Database Administrator (= DBA) who handles them - e.g. selecting a file type for the
storage of a relation‟s data.
The desire for simplicity has been ubiquitous for centuries, as the following quotes from
different eras and subject areas illustrate :
“Everything should be made as simple as possible, but not simpler.” Albert
Einstein (physicist) - [Eins79].
“Entities should not be multiplied without necessity.” Ockham‟s Razor7 page 142 of [LiHB03].
“The aim of science is always to reduce complexity to simplicity.” William
James (psychologist) - [Jame90].
“Throughout the history of engineering a principle seems to emerge : great
engineering is simple engineering.” (emphasis in the original). James Martin
(computing consultant and writer) - [Mart75].
“The ability to simplify means to eliminate the unnecessary so that the
necessary may speak.” Hans Hoffman (painter) - [Hoff67].
“Perfection is achieved, not when there is nothing more to add, but when
there is nothing left to take away.”8 Antoine de Saint-Exupéry (pilot and
writer) - [StEx39].
It is noticeable that what these sayings have in common is the removal of that which is
unnecessary, and the recognition that there is an irreducible minimum.
This irreducible minimum should not be simplistic, i.e. provide limited functionality, and
consequently require the user to work hard to achieve anything complicated with the
system. This would be a minimal system rather than a simple one9. Hence the
importance of Brooks‟ point that it is the ratio of functionality to complexity that matters,
because the aim is to make the application easy to use.
Given that a multiplicity of options and features is a common problem, it is also
important that simplification is not treated merely as the removal of some of these options
or features. Typically the Pareto Effect applies to an application - 80% of users use only
20% of the options10. Therefore one could „simplify‟ a product by removing 80% of the
7
8
9
10
Also written as „Occam‟s Razor‟ (by application of the razor ?), and known by other names, such as the
„Law of Parsimony‟. Commonly attributed to William of Ockham, a 14th century English Franciscan
friar and logician, who purportedly used it frequently, although it does not appear in any of his surviving
writings.
Translated from the French : “Il semble que la perfection soit atteinte non quand il n‟y a plus rien à
ajouter, mais quand il n‟y a plus rien à retrancher”.
Note that the converse error is also sometimes made, that a powerful system must of necessity be
complex.
Also known as the Pareto Distribution, both being the result of the Pareto Principle (also known as the
Pareto Rule). Note that the precise percentages are not important in this argument.
5
options with little loss to most users. While this may sometimes be effective, suppose
different users want different „20% sets‟ of options. A „lite‟ version of the product with
only 20% of the options would be of little use to most users as it wouldn‟t have the
options that they need. Zawinski noted that Mozilla has a large number of options
because different users use different options, and that a browser with few options would
be inadequate for most users11. In this thesis, the aim is not to lose any functionality; so
removing options and features and thereby leaving a diminished total functionality,
would not be acceptable.
Finally one cannot assume that a simple conceptual model will be obvious and/or
intuitive. As Donald Norman points out on page 182 of [Norm99], “Good design is not
necessarily self-explanatory : some tasks are inherently too complex. The notion that
good design can render anything „intuitively obvious‟ is false. In fact, intuition is simply
a state of subconscious knowledge that comes about after extended practice and
experience.” Furthermore “Difficult tasks will always have to be taught. The trick is to
ensure that the technology is not part of the difficulty. Devices for complex tasks must of
themselves be complex, but they can still be easy to use if the devices are properly
designed so that they fit naturally into the task. When this is done, learn the task and you
know the device.” In other words, the conceptual model must reflect the innate nature of
the application situation, so that in dealing with the software product, the user is dealing
directly with their problem, and the software product itself becomes „invisible‟.
In a similar vein, and with regard to the conceptual model underpinning programming
languages, Petre in chapter 2.1 of [HGSG90], in a section headed “Obstructions to coding
: how programming languages get in the way”, notes that “Although a programming
language is unlikely to contribute directly to a solution, it may obstruct solution, even
contributing to errors or oversights.”
So a simple conceptual model could be said to be one that maps simply and directly onto
the problem domain, and hence is „invisible‟ to the user.
Hence the conceptual simplification of different kinds of container type is just the kind of
software quality improvement that one should aim for.
2.3 Simplification Principles from Programming Languages
The idea of conceptual integrity has become well established, to the extent that it has
been applied to other topics which involve a collection of concepts. Flater in [Flat03]
applies the idea to the integration of data schemas arising from different information
systems that must work together : “Compromised conceptual integrity results in
„semantic faults‟ which are commonly blamed for latent integration bugs.” To integrate
the schemas, he uses a logical notation that incorporates belief and time. In a different
kind of example, [ECLKK02] considers strategies in course curriculum development to
ensure the conceptual integrity of the different aspects of the resulting course.
11
There are various Internet sites which report Zawinski‟s views; see for example [Zawi05] or via
„bloatware‟ on [Bloa06].
6
Nevertheless it is useful to see if the idea of conceptual integrity can be amplified and/or
related to the criteria, principles and guidelines that have been put forward in the
literature to guide the design of programming languages, since the ultimate goal is a
simplified conceptual model for a relational DB programming language. Although it
may seem a big jump from software products in general to programming languages in
general, note that Bentley suggests in [Bent88] that “a language is any mechanism to
express intent, and the input to many programs can be viewed profitably as statements in
a language”. Note also that Brooks in [BrCI82] does not differentiate between
programming language compilers, operating systems and end-user applications. Indeed
he refers to „computer systems‟ in general, which is clearly meant to correspond to any
kind of „software product‟.
References [Bent88], [Horo84] and [MacL87] have been used to provide a technical
source, and references [Wein71] and [HGSG90] to provide an input from the psychology
of programming. Together the references provide a number of criteria by which to judge
languages that cover not only the semantics of programming languages, but also their
syntax, implementation, environment and application area(s). Since only the conceptual
model of a relational DB programming language is of interest here, only the criteria
pertinent to the semantics have been abstracted12. The remainder are ignored.
Although each reference has its own terminology and approach, the criteria apposite to
semantics that a good language will meet can be summarised as :
Parsimony. This means having as small a number of concepts as possible, or
alternatively removing as many unnecessary concepts as possible.
In practice some judgement is needed to decide what a concept is, because
a group of detailed concepts can be „chunked together‟ to form one concept at
a higher level of abstraction; e.g. is a relation one concept or does it comprise
a set of concepts such as “a relational value is a set of tuple values, a tuple
value is a set of attribute values, all the values in one attribute are of the same
scalar type, etc13 ?” “In psychology, this information processing ability of
human beings that combines several small units into one large unit, which is
just as easy to handle as its individual parts, is called chunking.” - see pages
224 and ff. of [Wein71], where as long ago as 1971, Weinberg recognised the
possibility of exploiting this psychological ability to achieve what he called
„compactness‟ in programming.
Therefore one needs some consistency in the levels of abstraction used
when considering the concepts comprising a conceptual model.
Simplicity. This appears in many guises in the literature. Here it is taken to
mean that each of the concepts is as semantically simple and straightforward
as possible. Terseness is sometimes used to describe it, because something
that is not simple can rarely be described tersely. Sometimes the term
„straightforwardness‟ is used, to indicate that involuted and unexpected
12
13
The principles in fact apply more widely, especially to syntax and environment, but that is ignored here.
Note that this example does not constitute a complete specification of the „relation‟ concept. Its purpose
is to illustrate „chunking‟.
7
concepts should not occur, even if they can be regarded as simple in
themselves - see [BrCI82].
Elegance is often associated with simplicity. Petre in chapter 2.1 of
[HGSG90] reports that elegance of expression is considered important by
expert programmers; e.g. a quote given from Hoare is “I have regarded it as
the highest goal of programming language design to enable good ideas to be
elegantly expressed.” Again “Experts appreciate an uncluttered notation. ...
„ugliness‟ matters.”14
Physicists and some mathematicians have long suggested that true
physical theories and good mathematical proofs are always elegant. As long
ago as 1981, Dijkstra was quoted in [Dijk81] as follows : “elegance is a
strong factor in whether a mathematical proof is understandable” and “this
principle can be applied to programming notation”. Dijkstra also found that
“there was so much agreement (among mathematicians) about what
constituted elegance. It turned out that the major characteristics were brevity
and what I would call soberness - an economy in the use of nomenclature”.
So elegance is not as dependent on individual taste as might be imagined.
Clearly chunking is relevant again. Related concepts chunked together
will provide simplicity at a higher level of abstraction; unrelated concepts
chunked together yield complication.
Generality. “The criteria of generality argues that related concepts should
be unified into a single framework.” - page 39 of [Horo84]. There should be
no exceptions to a general rule, with all applications of the concept being an
instance of the general rule. One could consider this as chunking together a
set of related detailed concepts to ensure that they fit together consistently to
form a single concept at a higher level of abstraction.
From the opposite point of view, there should be no artificial constraints
or limitations, e.g. no minimum or maximum constraint on the number of
attributes or tuples in a relation. [Morr81] expresses this well : “Most
languages are too big and intellectually unmanageable. The problem arises
in part because the language is too restrictive; the number of rules needed to
define a language increases when a general rule has additional rules attached
to constrain its use in certain cases. (Ironically, these additional rules usually
make the language less powerful).”
Orthogonality. This states that every concept should be independent of all
the other concepts, and that there is a general and consistent way of combining
them together. Thus concepts can be combined with each other in any
arbitrary way. “In a truly orthogonal language, a small set of basic facilities
may be combined without arbitrary restrictions according to systematic
rules.” - see page 105 of [HGSG90].
14
Tractinsky successfully repeated a Japanese experiment that demonstrated that an aesthetically elegant
bank ATM was easier to use than an ugly one - see [Trac97] and [TrKI00]. See also [Norm04] on this
subject.
8
Uniformity. This is also known as consistency or regularity. It means that
similar things should be done in similar ways and have similar meanings.
“The same things should be done in the same way wherever they occur” page 219 of [Wein71].
Weinberg in [Wein71] has several useful contributions on uniformity.
Lack of uniformity in some parts of a language can create the fear of such a
lack in other parts of the language : “The more “covert categories” - things
that you cannot do or say - there are, the more one expects other such covert
categories in the language. Even if the restrictions are in another part of the
language, they may affect the actual usage of a part without such
restrictions.” - page 220 of [Wein71].
Uniformity is also conducive to naturalness. “One way of achieving
naturalness ... [is] through uniformity, but uniformity only applies to those
programmers who have some experience with the language.” - page 232 of
[Wein71]. Such experience relates to the intuitiveness described by Norman;
see the earlier quote from him.
Uniformity could be regarded as a meta principle to the extent that it is a
policy of treating all concepts in the same way. In this light, it also appears as
a general product design principle in [LiHB03] : “The usability of a system is
improved when similar parts are expressed in similar ways.”
The five criteria are not unrelated. For example, keeping related concepts together and
unrelated ones separate at a higher level of abstraction facilitates both simplicity and
generality. Parsimony and simplicity are related through elegance. Morrison in [Morr81]
points out : “Power through simplicity, simplicity through generality, should be the
guiding principle”.
The criteria should be applied at a consistent level of abstraction. For example, at the
conceptual level of a relation as a whole (as opposed to the constituent concepts that
make up a relation, such as tuples and attributes) consideration of a relational algebra
operator means considering how it applies to an entire relation. At the next level of
abstraction down, one can (say) relate attributes that are parameters of the operator to
attributes that are part of the relation.
The important thing is that these five criteria are clearly consistent with Brooks‟ concern
for ease of use and the criterion of conceptual integrity. Uniformity is directly specified
by Brooks in his definition of conceptual integrity : “Every part must reflect the same
philosophies and the same balancing of desiderata.”. Parsimony and simplicity directly
make a programming language easier to use; generality increases its functionality.
Orthogonality achieves simplicity by eliminating the need for special rules that prohibit
or constrain combinations of concepts; orthogonality also achieves greater functionality
by allowing all the concepts to be combined together without let or hindrance. The
combinatorial ability not only provides an opportunity for creative problem solving but
also provides one means (generality being the other) whereby a simple language can
express great functionality. As van Wijngaarden put it in [Wijn71] : “Orthogonal design
maximises expressive power while avoiding deleterious superfluities.”
Thus these five criteria can be used to amplify the criterion of conceptual integrity.
9
2.4 The Relational Conceptual Model
It is assumed that the reader is familiar with relational databases in general. Nevertheless
the key concepts of the relational model are now summarised so that thereafter it can be
reviewed to ascertain its simplicity and conceptual integrity.
The relational model is taken to be that which has evolved from Codd‟s original
publically available paper [Codd70]. The model has been extended, refined, and its
logical consequences developed since its inception in 1969, particularly by Codd and
Date, and more recently by Date and Darwen. Its focus is a mathematical model of a
relational database, that a user can interact with and manipulate logically, i.e. it is a
conceptual model. The justification for the relational model is that it is of great practical
value. Compared to SQL, inconsistencies have been removed, its constructs are
orthogonal, and there are no ad hoc limitations. The model is much simpler yet more
powerful than its SQL counterpart.
Its evolution is noted in useful papers of Codd‟s such as [Codd71], [Codd72Co],
[Codd72Nr], [CoDa74] which introduced the concept of „essentiality‟, [Codd74],
[Codd81] and [Codd88]. Stonebraker in [Ston94] suggested four evolutionary versions :
Codd‟s original paper [Codd70],
Codd‟s 1981 Turing Award paper [Codd82],
Codd‟s brief summary in [Codd85Re] and [Codd85Ru],
Codd‟s book [Codd90].
[Date01] gives a concise review and summary of the evolution, [Date05] gives a more
comprehensive description, summarised in chapter 8, and [Date06Dic] is a dictionary of
the terminology and concepts of the relational model.
Further developments are recorded in C. J. Date‟s series of „Relational Database
Writings‟ - [Date86], [DaWa90], [DaDa92], [Date95Wr] and [DaDM97]. They
culminate in Date and Darwen‟s „Third Manifesto‟ proposals for relational databases [DaDa98], [DaDa00] and [DaDa07].
Codd showed in [Codd90] how the relational model could be used to support a sort of
„formalised E-R model‟, but this is eschewed here in favour of Merrett‟s approach [MeRe84] and particularly [MeMc84] - which views the relational model as a simple
formalism which can be applied to any suitable semantic situation. „The Third
Manifesto‟ (= TTM) proposals also take the „simpler‟ view; for this reason, TTM also
excludes „nulls‟ so that 2-valued logic may be retained. [McGo94] and [Pasc00] by
McGoveran and Pascal respectively present the practical benefits of the simpler model 15.
Later in the thesis, a specific syntax is needed to express the concepts of the model. The
RAQUEL notation will be used for this, although this does not preclude another syntactic
notation being used instead to describe the same relational model 16. However RAQUEL
15
16
For example, [Pasc00] applies this relational logic to solve effectively such practical problems as
duplicate tuples, entity supertypes and subtypes, and data hierarchies.
Indeed the syntax and semantics of RAQUEL itself are kept separate, so that its current syntax,
composed of traditional linear text, could be supplemented by an alternative 2- or 3-dimensional graphic
version. Furthermore the relational model expressed by RAQUEL is intended eventually to be identical
10
does have certain features that contribute to a simpler construction of the relational
model, as will be seen later.
At the highest level of abstraction, the relational model comprises four concepts :
1. A relation as a container of data;
2. An open-ended set of scalar data types;
3. An open-ended set of relational algebra operators17;
4. An open-ended set of relational assignments.
These four concepts are now examined in more detail. This is done by considering how
they are made up of concepts at a lower level of conceptual abstraction. It will be seen
that the lower level conceptual abstractions are in turn made up of yet lower level
conceptual abstractions, and so on.
A Relation. The concept of a relation consists of two related concepts at a lower level of
abstraction, a relational value (henceforth abbreviated to „relvalue‟) and a relational
variable (henceforth abbreviated to „relvar‟) 18 :
A relvalue is a container of scalar values. It consists of a mathematical set of
tuples, each of which consists of a mathematical set19 of attribute values. The
tuples constituting a relvalue all have the same set of named attributes, with
every attribute having a declared data type. Each tuple contains a value of the
specified type for each attribute20. Subject to these constraints, a relvalue may
have any sized cardinality and degree.
If an attribute‟s values are themselves relvalues, then from the viewpoint
of the relvalue containing the attribute, the attribute relvalues are perceived as
scalars, i.e. each relvalue has been „enclosed‟ to become a scalar; it will need to
be „disclosed‟21 to reveal its structure and its own attribute values 22. Such nesting
of relvalues within relvalues can be continued ad infinitum23.
17
18
19
20
21
22
to that expressed by the conceptual language D; currently RAQUEL expresses only a subset of D. Date
and Darwen use the language Tutorial D in their publications to express the concepts of the language D.
It would be reasonable to use relational calculus instead of relational algebra. Nevertheless algebra is
arbitrarily chosen. It is considered easier to derive and explain complex manipulations of relations via
algebra, because algebra expressions lend themselves easily to being built up piece-meal.
The term „relation‟ is used henceforth for terseness to refer to what is permitted by the situation
concerned to be either a relvalue or a relvar.
It is emphasised that the set is mathematical rather than some other, possibly more vaguely-defined kind
of set, because of the set properties that consequently apply, i.e. the set has no ordering or structure, no
duplicates, and may be empty. However from now on, reference to a „set‟ in the thesis should be taken
to mean a „mathematical set‟ unless otherwise specified.
Consequently a NULL – i.e. the absence of a value – cannot be an attribute value.
The terms „enclose‟ and „disclose‟ are taken from APL. In APL they have precise definitions that are
the exact analogue with nested arrays for what is required for nested relvalues, and so the same terms
with corresponding definitions are used here. „Nested‟ is only used as a general, indicative term as there
appears to be no universally agreed definition for it in the literature.
This is proposed by Date and Darwen - see pages 152-3 of [Date04Int]. Its rationale is that since an
attribute can have any data type, there is nothing to prevent it from having a relational type (henceforth
abbreviated to „reltype‟). Such an attribute therefore has nested relvalues, but at the level of abstraction
of the containing relvalue, a nested relvalue is enclosed to become a single scalar value, i.e. its internal
11
A relvar is a named variable whose value is a relvalue that may change over time.
As each tuple is unique within a relvalue, there is at least one candidate key in a
relvar. If no such key is specified, all the attributes must be treated as the one and
only candidate key. Typically one or more subsets of the attributes are specified
as candidate keys, in order to better represent the real-world situation in question.
Relvars are either real relvars (commonly referred to as „base relations‟ in
the literature) that are abstractions of stored data, or virtual (or derived) relvars
whose value at any moment is the value at that moment of the relational algebra
expression which defines that relvar (commonly referred to as „views‟ in the
literature).
Note that the concepts of relvalue and relvar each consist of several more
concepts at a yet lower level of abstraction.
Scalar Data Types. The data type of every attribute of a relvalue is a scalar type, except
when the attribute holds enclosed relvalues. This applies recursively, whatever the level
of nesting.
structure and the values it contains are not visible. So the containing relvalue is in First Normal Form,
and relational algebra operators applied to the containing relvalue continue to function as normal on it.
There have been other kinds of proposals to allow relvalues to be nested as attribute values. A notable
early proposal was that of Roth, Korth and Silberschatz - see [Roko88] - who derived an extension of the
relational model that allowed nesting, but in such a way that the structure and contents of the nested
relvalues were not enclosed but visible at the level of the containing relvalue, i.e. the latter‟s attribute
values were non-scalar. As a consequence, relational algebra operators had to be amended to cope with
“Non First Normal Form” (= NF2) relvalues, as the containing relvalues were known.
If a nested relvalue is disclosed, a level of enclosure is removed, and the nested relvalue‟s attribute
values are brought up to the level of abstraction of the containing relvalue such that the nested relvalue‟s
attributes replace the nested relvalue attribute.
Just as operators pertaining to a scalar attribute type may be applied to scalar values in an attribute, so
may relational algebra operators be applied to nested relvalues in an attribute. Many algebra operators
take parameters that reference the attributes of their operands. When such operators are applied to
nested relvalues, the parameter is referencing attributes of the nested relvalues, i.e. it is referencing one
level of enclosure down, without any explicit disclosure of enclosed levels of abstraction. Furthermore,
some algebra operators permit expressions or statements as parameters, e.g. Restrict and Extend. If
such an operator is applied to an attribute‟s nested relvalues, then the expression/statement parameter
can itself include relational algebra operators that apply to nested relvalues, which may themselves
include operators with expression/statement parameters applying to nested relvalues, and so on without
artificial limit. This provides an alternative strategy for manipulating nested relvalues down to any depth
of enclosure without the use of disclosure.
23
The whole point of a relational container is that one „can see inside it‟ and perceive its structure and the
individual scalar values it contains. The whole point of a scalar is that one „cannot see inside it‟ and
cannot perceive any structure or component values. In designing a relational DB, part of the design is
determining how to apportion the DB data into relational containers, and within each relational
container, whether each attribute should contain genuine scalar values or enclosed relvalues, i.e. what
levels of abstraction are most helpful with respect to the real-world situation represented.
An analogy may help. In chemistry, atoms are the fundamental objects, and chemical reactions
concern how atoms are formed and re-formed into molecules. In physics, protons, neutrons and
electrons are the fundamental objects, and atomic reactions concern how these form and re-form into
atoms. In particle physics, quarks and leptons are the fundamental objects, and particle reactions
concern how these form and re-form protons, neutrons and electrons. One must choose the appropriate
level of abstraction for the topic of concern.
12
Because scalar types are orthogonal to relations, there is no limit to the set of
permissible scalar data types that can be used to define attribute values24. The scalar
values of a data type can be arbitrarily complex; for example a scalar value could be a
photo, a video recording or a piece of music. A scalar type could be defined via an object
class. A scalar data type may be built into the DBMS, plugged into it as an „optional
extra‟, or derived by the user in some way from a pre-existing scalar type.
A scalar type consists of a permissible set of values. It also has a set of scalar
operators associated with it, which take a value(s) of that type as an operand(s) and/or
return a value of that type. In RAQUEL, scalar operators are prohibited from side effects
when they execute; this is to achieve simplicity by making them consistent in this respect
with relational algebra operators.
Yet again, the concept of a scalar data type consists of several concepts at a lower
level of abstraction.
Relational Algebra Operators. Relational algebra utilises an open-ended set of algebra
operators. An operator is either monadic or dyadic, i.e. it takes either one or two
operands25. An operand must be a relvalue, expressed as either a literal relvalue, a relvar
or a relational algebra expression. Every operator returns a single relvalue, whose
candidate key(s), attribute names and attribute types are derived from the operand(s).
Thus the operators form a closed system under the algebra; expressions of arbitrary
complexity may be written using the operators.
In RAQUEL, the operators that compare relvalues return truth values which are
represented as zero-attribute relvalues26, thereby maintaining closure and simplifying the
algebra overall27.
Unlike the open-ended set of scalar data types, which is determined by the needs
of a particular DB, the open-ended set of algebra operators is determined by the language
designer; this includes whether provision is made for the language user to define new
operators. Enclose and Disclose operators will be needed to support nesting.
The concept of algebra operators comprises monadic and dyadic categories, and
in each of these, at a yet lower level of abstraction, there are the concepts that define the
operators of that category.
Relational Assignments. At the very least, an assignment is needed to give a relvar a
new relvalue. For simplicity and ease of use, a number of assignments are desirable, for
example to insert, amend and delete tuples in relvars, and to retrieve relvalues from a DB.
24
The truth data type, consisting of the values true and false, must always be available to the DBMS, even
if it is never used for an attribute in a DB, since the DBMS must be able to evaluate truth-valued
expressions pertaining to attribute values in order to execute algebra operators whose definitions involve
such expressions.
25
In principle, there can be operators that are niladic or take more than two operands. No such currently
exist in RAQUEL. TTM has a triadic version of Divide.
26
There are only 2 possible relvalues, one has the 0-tuple and the other has no tuples. (There cannot be
multiple 0-tuples, as they would be replicas of each other). They represent the truth values true and false
respectively. See pages 153-154 of [Date04Int] for further details.
27
Otherwise one would need a 2-sorted universe, of relvalues and scalar values. This generalisation
retains a 1-sorted universe, of relvalues only.
13
Thus RAQUEL provides a (potentially open-ended) set of such value assignments 28, and
consequently value assignment is not a „single instance concept‟ as it is in an application
programming language29. In this respect RAQUEL is more akin to SQL, which does
have several kinds of statements corresponding to such assignments.
However page 193 of [Date04Int] affirms that algebra expressions can be used for
a variety of purposes and lists some examples. While some relate to value assignments,
others relate to constraints of various kinds, such as integrity constraints or access
constraints. A generalisation of assignment to include integrity constraints was
developed by Livingstone and Gharib - see [Livi95] and [Ghar97] - and successfully
implemented in an APL interpreter. The APL integrity assignment assigned a set of
values to a variable as its set of permissible values, i.e. its data type, rather than its value.
The same concept, but further generalised to handle a whole range of constraints via a
(potentially open-ended) set of integrity assignments, is applied in RAQUEL30 so that the
non-value assignment purposes given in [Date04] can be provided for in a simple but
generally uniform way. For example, such assignments include one to generate a reltype
and assign it to a relvar.
Like the open-ended set of algebra operators, the open-ended set of assignments is
determined by the language designer, and includes whether provision is made for the
definition of new assignments.
Since RAQUEL has two categories of assignments, those that make relvalue
assignments to relvars, and those that make integrity assignments to relvars 31, the concept
of relational assignments comprises two lower level concepts, one per category. In turn
each of these comprises the concepts at a yet lower level of abstraction that define the
assignments of that category.
2.5 The Simplicity of the Relational Conceptual Model
The relational model is now reviewed to demonstrate its simplicity and conceptual
integrity.
The model is summarised graphically in figure 2.1. The figure is derived using the ideas
put forward by Hsi in [HsiI05] – see the appendix in section 2.7 for a summary of his
approach. Hsi states that a computing application has an „ontology‟, which he defines as
being “its theory of the real world” – page 4 of [HsiI05]. The ontology is formed from
the „concepts‟ that compose it. „Ontological excavation‟ is used to identify the concepts,
28
29
30
31
However they can be thought of as shorthands for more complex statements that utilise only a traditional
value assignment.
This contrasts with „textbook relational algebra‟. For example, the latest editions of two well established
DB textbooks, [ElNa07] and [CoBe02], use value assignment in the course of a discussion of algebra
operators, but do not explicitly discuss assignment, and tacitly assume algebra is of use only for
retrievals.
Some overviews of the relational model include „integrity constraints on relvars‟ as a high level
conceptual component of it. This was omitted above. It can now be seen that this component is
provided in RAQUEL by assigning suitable relational algebra expressions to relvars as integrity
constraints.
Actually there is a third class, which binds relvars to their storage mechanisms, but this is irrelevant to
the conceptual relational model.
14
which are then modelled as a semantic network; this is similar to an ER model or UML
class diagram, except that attributes of entities are shown separately from the entities
themselves. Having got what is essentially a graph structure, various numerical measures
of the graph are taken in order to obtain a measure of how well the concepts integrate
together. This general idea is used here, except that concepts or „entities‟ are not
assumed to have attributes. Instead concepts at the same level of abstraction are linked
by edges in the graph; the more detailed concepts at a lower level of abstraction, which
together make up one concept at a higher level of abstraction, form a graph at the lower
level that is expressed as a single node in the graph at the higher level. A graph at a
lower level is portrayed in a „bubble‟ that forms a node at a higher level. There is no
limit to the number of levels of abstraction that are permitted. Figure 2.1. demonstrates
graphically the conceptual simplicity of the relational model.
Consider the model with respect to the five programming language criteria :
Parsimony. There are only four concepts at the highest level of abstraction, and each
of these is made up of a very small number of constituent concepts at the next
level of abstraction down, and so on.
Simplicity. From the highest level of abstraction downwards, the concepts are
simple, particularly for someone used to imperative programming languages.
Relations are logically a simple kind of data container, and the operators and
assignments follow on in a straightforward way from them. Assignments are
conceptually quite different from operators; so for clarity and to avoid confusion,
they are treated quite differently in the language. The phenomenon of
„psychological inhibition‟ would arise otherwise. “It is the similarity between the
languages which causes the inhibition.” “It might be better, when identity is not
possible, to make the two more clearly dissimilar.” - see [Wein71].
Generality. None of the concepts have any artificial constraints or limitations at any
level of abstraction. As opposed to the single value assignment of application
programming languages, value assignments are generalised to provide a range of
uniform ways of changing relvars‟ values; and a commensurate range of uniform
assignments is provided for changing relvars‟ integrity constraints.
Orthogonality. There is complete orthogonality between all four concepts at the
highest level of abstraction. Any scalar data types can be used with any relvar.
Any operators can be used with any assignments to any relvar.
Uniformity. Reltypes are treated in a corresponding way to scalar types, both having
variables, literal values, operators and assignments. Uniformity is also applied in
achieving generalisations, as noted above.
Thus the relational conceptual model is consistent with the simplicity and conceptual
integrity proposed by Brooks.
As many of the previous references about the relational model have stated, the reason for
this achievement is that the following four design strategies have been employed in
creating the model :
1. The level of abstraction has been raised to be as high as possible. Only those
logical aspects that are germane to the handling of data in a DB are included.
15
2. The principle of „Essentiality‟ is applied, to prune out all concepts that are not
logically essential. If n different ways are used in a logical model to represent
information, then n sets of operators and assignments, one set per way, are
required in the model. The larger the value of n, the greater the complexity of the
model. Yet if only m of the n ways are essential, i.e. m < n, only the functionality
conferred by the m ways is attained. In the case of the relational model, m = 1 =
n. There is only one kind of data object, viz. the relation, and hence only one set
of operators and one set of assignments needed to handle relations. Note that
essentiality is not the same as a high level of abstraction. One could choose to
have additional concepts at a high level of abstraction.
3. The relational model is purely a logical model, with its implementation being
entirely excluded from it. One might argue that this follows from having the
highest level of abstraction possible, but it in fact it doesn‟t always follow, and it
is very important in practice to ensure that there are no implementation aspects in
the model. As an extra benefit, it also allows a greater variety of implementation
options to be made available.
4. The relational model is a formal, mathematical model. Although not heretofore
explicitly discussed, Codd based the relational model on mathematical set theory
and its application to relations. As with the application of formal methods to the
development of application programs, the advantage of this is that the
mathematics can be used to specify much more precisely what the conceptual
model actually is. It can be mathematically manipulated, proved and investigated,
so that eventually a final version of the model can be proved that is relatively bugfree. (Nothing is ever perfect !).
Although the design strategies are different, it can be seen that they are mutually
supportive or at least related to a degree. For example, using a mathematical foundation
is conducive to both excluding implementation aspects and to raising the level of
abstraction to the highest possible level.
2.6 Adding Different Kinds of Container Type to the Relational Model
The thesis aim is to add a full set of different kinds of kinds of container type to the
relational model. This affects the first of the four high level conceptual abstractions of
the relational model. However, since in principle each additional container type needs its
own operators and assignments, it also affects the third and fourth conceptual abstractions
as well. Only the concept of an open-ended set of scalar data types is unaffected, because
this must apply equally well to all container type in order to provide conceptual integrity
over the full set of kinds of container type.
In order to achieve as much simplicity and conceptual integrity as possible, the design
strategy of excluding implementation concerns should be applied. If any kind of
container type includes its physical implementation in its definition, then its level of
abstraction should be raised to eliminate the implementation and yield a simpler concept
of data container cum operators and assignments. This maintains consistency with the
relational model and allows for the possibility of providing multiple implementation
options. If possible, the concept should be so derived as to achieve as much conceptual
16
simplicity, generality and uniformity as possible when combined with the relational
model.
To achieve the maximum of conceptual integrity, it is important to aim for parsimony, by
minimising the number of kinds of container type actually added. This can not be done
by ignoring those that might be infrequently used, because the aim is to provide the
functionality of the full set. Of course, the use of defaults is acceptable, as this does not
actually exclude anything from the logical model, it merely provides a form of shorthand
for commonly occurring statements or parts of statements. However the design principle
of essentiality is relevant here. If two or more kinds of container type are variations on
the same theme or overlap in concept, then if possible it is desirable to derive one
essential kind of container type that includes the two or more as special cases, and replace
them with the single essential kind. This needs to be done in such a way that it also
achieves conceptual simplicity, generality and uniformity.
It can also be useful to raise the level of abstraction, as this can help in viewing related
kinds of container type as special cases of one underlying container type. This is
analogous to the approach used in physics to unify fundamental forces. At one time,
magnetism and electricity were considered to be two entirely different forces. Later it
was realised that they are two special cases of one force called electromagnetism 32.
Likewise physicists are currently trying to unify the four fundamental forces of
electromagnetism, the strong nuclear force, the weak nuclear force, and gravity. It is
suggested that this will come about by viewing them as four special cases of a force
viewed in 10 dimensions, and it is taken for granted that elegance will be integral to the
result33. Note that the elimination of implementation concerns may increase the
possibility of conceptually related kinds of container type arising, and hence increase the
scope for applying essentiality.
New kinds of container type will need to be mathematically specified, not only for
consistency with the relational model, but also to ensure the concepts are well-defined.
Finally all new kinds of container type must be able to be used in an orthogonal fashion
with the existing relational kind of container type. This is not just because it makes it
easier to add them, as there a minimal number of points where integration of old and new
must be achieved, but also because it provides greater functionality. Mathematical
specifications of the new kinds of container type should help ensure that orthogonality is
attained.
When the thesis aim is achieved, the extended relational model so produced will be seen
to have markedly greater conceptual simplicity compared to an object-relational SQL
with all the currently proposed kinds of container type incorporated into it.
32
33
For example, see [Gamo62] for a brief overview of this development, from Oersted‟s discovery to
Faraday‟s experiments and Maxwell‟s equations.
For example, see [Kaku99] for an overview of this.
17
2.7 Appendix
Analysing the Conceptual Integrity of Computing Applications
Through Ontological Excavations and Analysis
A computing application has an ontology, which is its theory of that part of the real world
relevant to the application. The concepts that form the ontology determine and structure
the application‟s features.
Usefulness is the extent to which an application succeeds in assisting users achieve their
goals, relative to the amount of effort required to apply the features.
Usability is the amount of effort required to use a feature to achieve a goal.
A useful application with poor usability enables users to achieve their goals, but with
great difficulty. An application with little usefulness may be very usable but doesn‟t help
users achieve their goals.
The features of an application must be determined by and conform to its ontology, strictly
speaking the ontology as perceived by the application‟s users.
The degree to which the application ontology matches the users‟ ontology determines its
conceptual fitness. Thus conceptual fitness determines usefulness.
An ontology with conceptual integrity will have conceptual coherence. Conceptual
coherence measures the degree to which an application‟s concepts are tightly related.
An application has a central core of concepts that are essential to its ontology., and can be
identified by analysis. Inessential concepts either exist to support core concepts or are
peripheral; peripheral concepts reduce conceptual coherence.
Ontological excavation is used to identify the concepts in an application and model them
as an ontology expressed as a semantic network. (This is similar to an ER model or UML
class diagram, but it is preferred to show attributes of entities separately from the entities,
to improve clarity).
Ontological analysis is used to measure an ontology‟s conceptual coherence. This
analysis involves getting measures of distance between different concepts in the
ontology/semantic network (= number of edhes between nodes). There are a number of
measures that can be calculated for each node in the (graphical) ontology. Betweenness
Centrality best identifies core concepts. A geodesic is the shortest route between 2
nodes. Conceptual coherence is derived from the average length of all geodesics
between pairs of reachable nodes; as this results in more incoherence having greater
values, the inverse of this is actually used to measure coherence, multiplied by 100.
A use case silhouetting method is used to measure the amount of ontological coverage of
typical uses of the application. Each use case uses certain concepts in the ontology. For
a set of cases, a count of each of the concepts used is made, possibly weighted by the
concept‟s importance. This shows how well the application‟s ontology matches the
user‟s ontology, and hence to what extent the user finds the application useful.
18
Conceptual coherence is a first approximation to conceptual integrity (but lacks the
functionality aspect of Brooks‟ ratio).
19