Chapter 2 Conceptual Simplicity This chapter reviews the practical need for conceptual simplicity in computer software, what is meant by it, how one can achieve it, how the relational model already provides it, and how in principle different kinds of container type might be added to the relational model. 2.1 The Practical Importance of Simplicity As computer hardware has become more powerful and provides ever more resources e.g. more memory, better screen displays - software applications, whether for the PC or to be run on powerful servers, have grown ever bigger and more complicated to take advantage of these resources. For example, “By 1992, the word processing program Microsoft Word had 311 commands. ... Five years later in 1997, that same word processing program, Microsoft Word, had 1,033 commands”. See pages 80 and 81 of [Norm99]. Schneier on pages 357 and 358 of [Schn00] has tables which show that the number of lines of source code in the Microsoft Windows operating system rose from 3 million lines (Windows 3.1 in 1992) to an estimated 35-60 million (Windows 2000 in 2000), while the number of system calls in operating systems generally rose from 33 (in Unix 1st edition in 1971) to 3,433 (in Windows NT 4.0 SP3 in 1999). Such escalation of size and complexity in commercial software generally continues to this day. Does this continued escalation matter ? Up to the 1980s and 1990s, it made sense to use the ever-increasing resources provided by newer, cheaper computer hardware to develop the hitherto more limited software applications into more effective versions for the user, regardless of the extra software size and complexity needed to achieve this. However, is this still true in the 21st century ? There is now increasing general concern about „bloated software‟, „bloatware‟ and „code bloat‟, so much so that these terms1 have become established and can be referenced, say on Wikipedia [Bloa06]. In general these terms refer to programs that appear to use more computer hardware resources than is commensurate with the benefits received by the program user. Related concern exists among commercial and business users of computer applications. For example, consider the following quotes from commercial IT managers reported in the computing press : “If projects or programmes are overly complex, there is a good chance they are simply wrong”. (Brian Jones, ex Global CIO Allied Domecq, then of IBM). [Jone06]. 1 Other terms are also used to denote this phenomenon, or specific aspects of it, such as „creeping featurism‟, „creeping featuritis‟ and „second system effects‟. 1 “Complexity leads to design problems & greater risk of error”. (Martyn Thomas, exPraxis MD, Formal Methods specialist). [Thom06]. Comparable views from computing academics have been expressed since 1995. According to Niklaus Wirth, “Software‟s girth has surpassed its functionality, largely because hardware advances make this possible. ... software can be developed with a fraction of the memory capacity & processor power usually required, without sacrificing flexibility, functionality or user convenience.” [Wirth95]. Software complexity has been singled out as a particular problem with respect to achieving effective computer security. Ferguson and Schneier state that “There are no complex systems that are secure. Complexity is the worst enemy of security, and it almost always comes in the form of features or options.” - see page 5 of [NFBS03]. The reason they give for this is as follows. Typically a computer application has many different options. Together they create a huge number of different possibilities. To ensure that the application works correctly, all these possibilities should be tested. However in practice the number is so huge that it is only practicable to test the most commonly occurring combinations of options, thereby leaving many possible unfound bugs, which in turn lead to security flaws. There are standard ways of coping with such situations, such as the use of modular software and the application of orthogonality in the design, but “Unfortunately, we see very little of it in real-world systems.” - page 5 again of [NFBS03]. Nevertheless, one might assume that complexity, at least in the shape of an ever larger variety of options and features, is a good thing for the user. However increasingly there are concerns that not only is this assumption false but that a large range of options is counter-productive. Donald Norman puts it very succinctly. “The result is technologydriven, feature-laden products. Each new release touts a new set of features. ... Seldom are the customer‟s real needs addressed, ... The notion that a product with fewer features might be more usable, more functional, and superior for the needs of the customer is considered blasphemous.” See page 25 of [Norm99], a book which promotes the „Information Appliance‟, the antithesis of the current kind of personal computer application. The reason for the inadequacy of current applications is that “Design by feature-lists is fundamentally wrong. Lists of features miss the interconnected nature of tasks.” - see page 207 of [Norm99]. Norman in [Norm99] suggests how the current situation has arisen, using the work of Moore [Moor91] and Christensen [Chri97] as his basis. A new technology starts out delivering less than customers really require, although customers still buy it since it satisfies needs unsatisfiable by any other means. Higher performance versions of the technology are developed over time to meet the unsatisfied demands. Eventually improvements to the technology reach the point where customers‟ needs are substantially satisfied. From then on, while the initial customers may still appreciate ever more advanced technology, new customers, who will eventually form the bulk of a mass market, prefer convenience, ease of use, reliability, and low cost; they are not interested 2 in the technology per se, they want solutions that simplify their lives 2. Norman argues that this product development cycle applies to computer applications, which now need to evolve towards simple mass market solutions and away from high tech products. See chapter 3 and ff. of [Norm99]3. Another cause suggested for overly complex software arises from the first version of a software product invariably being a prototype. As Brooks points out in [BrPr82], “The management question, therefore, is not whether to build a pilot system and throw it away. You will do that.”4 Thus the product needs enhancement to make it useable. In order to enhance it as quickly as possible, rather than learning from the prototype how to revise or develop a new underlying architecture and design, accretions and extra complexity are inserted into the prototype itself, and this results in a more complex product. Jamie Zawinski, who helped develop the Mozilla [Netscape 1.0] browser, stated that marketing demands left no time to refine the browser into a smaller, more elegant product that delivered the same functionality in a simpler way - see [ZaBl06]. The marketing necessity for speed is not unusual. Being first to market, or at least not too late, is often very important for the commercial success of many kinds of software product5. Lou Gerstner, IBM‟s chairman, summed this up with : “All large companies know today that speed and being early to market are often more important than being right.” - see [Gers98]. However it arises, clearly software complexity has now become a problem. Yet software does not have to imply complexity : “We have reduced the cost of running our IT operation by 40% ... Also the quality has improved by two or three times. We have done that by simplifying and standardising how we run the [IT] technology.” Furthermore this has had a beneficial affect on the business processes supported by the IT, by encouraging further business process simplification. “ „IT has been an enormous leader in simplifying BA‟s business,‟ said Corby” (British Airway‟s CIO). “... if you are going to automate, you have to understand the process and simplify it. Complexity will kill you and slow you down.” See [Corb06]. The next chapter reviews object-relational DBs to demonstrate the extent of their complexity with respect to different kinds of container types; but first the remainder of this chapter reviews the nature of the simplicity aimed for and the means of achieving it. 2 3 4 5 The development of the motor car illustrates the changes in product and market. The early versions were difficult to drive, with complex controls to manage the ignition timing, fuel supply to the engine, etc, so that they only appealed to a small market of car enthusiasts. Modern cars have automated these problems out of existence, so that they are now comparatively easy to drive and appeal to a mass market. Norman emphasises applications for PCs, which ought to evolve into „Information Appliances‟, where customers no longer need to continually update their software to obtain ever larger feature sets; not all these aspects are germane to the thesis. „Pilot‟ is the term Brooks uses for „prototype‟ here. It derives from the term used in some other branches of engineering for a prototype, e.g. a „pilot plant‟ in chemical engineering. For example, see pages 116-118 of [Norm99] on „infrastructure products‟. 3 2.2 The Simplicity Required The crucial importance of a mental model to a good software product is underlined by the fact that among the „Universal Principles of Design‟ for general product design listed in [LiHB03] is the „Mental Model‟ principle : “People understand and interact with systems and environments based on mental representations developed from experience.” In this context, „mental model‟ and „conceptual model‟ are synonyms. The latter term is the one normally used in this thesis. In „The Design of Everyday Things‟ [Norm98], page 53, Norman includes a good conceptual model among his four principles for good product design : “Consistency in the presentation of operations and results and a coherent, consistent system image.” „System Image‟ is the term used for the conceptual model actually presented to the user by the product when in use6. For specifically software products, in [Norm99] pages 175179, Norman emphasises the importance of the conceptual model in designing them : “The use of a good conceptual model is .. fundamental to good design .. .” “Good designers present explicit conceptual models for users. If they don‟t, users create their own mental models, which are apt to be defective and lead them astray.” “Start with a simple, cohesive conceptual model and use it to direct all aspects of the design. The details of implementation flow naturally from the conceptual model.” “The model has to be coherent, understandable, and sufficiently cohesive that it covers the major operations of the system. ... It is successful if the users can then use the system in ways the developers never imagined. Above all, the user should be able to discover and learn how to use it with the minimum of effort. In the ideal case, no manual would be required.” Thus the kind of simplicity required of a software product is that of its conceptual model. It does not necessarily follow that the software implementation is simpler, or that the software architecture, the data structures and/or algorithms will be simpler. A simple conceptual model may well lead to a simple implementation. Where that happens, it is all to the good. The security concerns expressed above involve both aspects, so it is preferable if both are simplified. Yet it is not necessarily so. A simpler conceptual model for the user may require the transfer of complexity to the implementation, so to speak, in that the implementation must become more complex in order to handle complications that the user formerly handled. A DBMS that presents the user with a relational conceptual model presents a simpler model than a DBMS that presents a network conceptual model, since those aspects of the physical implementation are 6 In fact on pages 189-190 of [Norm98], Norman refers to three conceptual models : “the design model, the user‟s model, and the system image. The design model is the conceptualisation that the designer has in mind. The user‟s model is what the user develops to explain the operation of the system. Ideally, the user‟s model and the design model are equivalent. However, the user and designer communicate only through the system itself ... Thus the system image is critical.” 4 removed from the user‟s conceptual model. Some of those aspects transferred are automated - e.g. query optimisation - while others are transferred to an interface for the Database Administrator (= DBA) who handles them - e.g. selecting a file type for the storage of a relation‟s data. The desire for simplicity has been ubiquitous for centuries, as the following quotes from different eras and subject areas illustrate : “Everything should be made as simple as possible, but not simpler.” Albert Einstein (physicist) - [Eins79]. “Entities should not be multiplied without necessity.” Ockham‟s Razor7 page 142 of [LiHB03]. “The aim of science is always to reduce complexity to simplicity.” William James (psychologist) - [Jame90]. “Throughout the history of engineering a principle seems to emerge : great engineering is simple engineering.” (emphasis in the original). James Martin (computing consultant and writer) - [Mart75]. “The ability to simplify means to eliminate the unnecessary so that the necessary may speak.” Hans Hoffman (painter) - [Hoff67]. “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”8 Antoine de Saint-Exupéry (pilot and writer) - [StEx39]. It is noticeable that what these sayings have in common is the removal of that which is unnecessary, and the recognition that there is an irreducible minimum. This irreducible minimum should not be simplistic, i.e. provide limited functionality, and consequently require the user to work hard to achieve anything complicated with the system. This would be a minimal system rather than a simple one9. Hence the importance of Brooks‟ point that it is the ratio of functionality to complexity that matters, because the aim is to make the application easy to use. Given that a multiplicity of options and features is a common problem, it is also important that simplification is not treated merely as the removal of some of these options or features. Typically the Pareto Effect applies to an application - 80% of users use only 20% of the options10. Therefore one could „simplify‟ a product by removing 80% of the 7 8 9 10 Also written as „Occam‟s Razor‟ (by application of the razor ?), and known by other names, such as the „Law of Parsimony‟. Commonly attributed to William of Ockham, a 14th century English Franciscan friar and logician, who purportedly used it frequently, although it does not appear in any of his surviving writings. Translated from the French : “Il semble que la perfection soit atteinte non quand il n‟y a plus rien à ajouter, mais quand il n‟y a plus rien à retrancher”. Note that the converse error is also sometimes made, that a powerful system must of necessity be complex. Also known as the Pareto Distribution, both being the result of the Pareto Principle (also known as the Pareto Rule). Note that the precise percentages are not important in this argument. 5 options with little loss to most users. While this may sometimes be effective, suppose different users want different „20% sets‟ of options. A „lite‟ version of the product with only 20% of the options would be of little use to most users as it wouldn‟t have the options that they need. Zawinski noted that Mozilla has a large number of options because different users use different options, and that a browser with few options would be inadequate for most users11. In this thesis, the aim is not to lose any functionality; so removing options and features and thereby leaving a diminished total functionality, would not be acceptable. Finally one cannot assume that a simple conceptual model will be obvious and/or intuitive. As Donald Norman points out on page 182 of [Norm99], “Good design is not necessarily self-explanatory : some tasks are inherently too complex. The notion that good design can render anything „intuitively obvious‟ is false. In fact, intuition is simply a state of subconscious knowledge that comes about after extended practice and experience.” Furthermore “Difficult tasks will always have to be taught. The trick is to ensure that the technology is not part of the difficulty. Devices for complex tasks must of themselves be complex, but they can still be easy to use if the devices are properly designed so that they fit naturally into the task. When this is done, learn the task and you know the device.” In other words, the conceptual model must reflect the innate nature of the application situation, so that in dealing with the software product, the user is dealing directly with their problem, and the software product itself becomes „invisible‟. In a similar vein, and with regard to the conceptual model underpinning programming languages, Petre in chapter 2.1 of [HGSG90], in a section headed “Obstructions to coding : how programming languages get in the way”, notes that “Although a programming language is unlikely to contribute directly to a solution, it may obstruct solution, even contributing to errors or oversights.” So a simple conceptual model could be said to be one that maps simply and directly onto the problem domain, and hence is „invisible‟ to the user. Hence the conceptual simplification of different kinds of container type is just the kind of software quality improvement that one should aim for. 2.3 Simplification Principles from Programming Languages The idea of conceptual integrity has become well established, to the extent that it has been applied to other topics which involve a collection of concepts. Flater in [Flat03] applies the idea to the integration of data schemas arising from different information systems that must work together : “Compromised conceptual integrity results in „semantic faults‟ which are commonly blamed for latent integration bugs.” To integrate the schemas, he uses a logical notation that incorporates belief and time. In a different kind of example, [ECLKK02] considers strategies in course curriculum development to ensure the conceptual integrity of the different aspects of the resulting course. 11 There are various Internet sites which report Zawinski‟s views; see for example [Zawi05] or via „bloatware‟ on [Bloa06]. 6 Nevertheless it is useful to see if the idea of conceptual integrity can be amplified and/or related to the criteria, principles and guidelines that have been put forward in the literature to guide the design of programming languages, since the ultimate goal is a simplified conceptual model for a relational DB programming language. Although it may seem a big jump from software products in general to programming languages in general, note that Bentley suggests in [Bent88] that “a language is any mechanism to express intent, and the input to many programs can be viewed profitably as statements in a language”. Note also that Brooks in [BrCI82] does not differentiate between programming language compilers, operating systems and end-user applications. Indeed he refers to „computer systems‟ in general, which is clearly meant to correspond to any kind of „software product‟. References [Bent88], [Horo84] and [MacL87] have been used to provide a technical source, and references [Wein71] and [HGSG90] to provide an input from the psychology of programming. Together the references provide a number of criteria by which to judge languages that cover not only the semantics of programming languages, but also their syntax, implementation, environment and application area(s). Since only the conceptual model of a relational DB programming language is of interest here, only the criteria pertinent to the semantics have been abstracted12. The remainder are ignored. Although each reference has its own terminology and approach, the criteria apposite to semantics that a good language will meet can be summarised as : Parsimony. This means having as small a number of concepts as possible, or alternatively removing as many unnecessary concepts as possible. In practice some judgement is needed to decide what a concept is, because a group of detailed concepts can be „chunked together‟ to form one concept at a higher level of abstraction; e.g. is a relation one concept or does it comprise a set of concepts such as “a relational value is a set of tuple values, a tuple value is a set of attribute values, all the values in one attribute are of the same scalar type, etc13 ?” “In psychology, this information processing ability of human beings that combines several small units into one large unit, which is just as easy to handle as its individual parts, is called chunking.” - see pages 224 and ff. of [Wein71], where as long ago as 1971, Weinberg recognised the possibility of exploiting this psychological ability to achieve what he called „compactness‟ in programming. Therefore one needs some consistency in the levels of abstraction used when considering the concepts comprising a conceptual model. Simplicity. This appears in many guises in the literature. Here it is taken to mean that each of the concepts is as semantically simple and straightforward as possible. Terseness is sometimes used to describe it, because something that is not simple can rarely be described tersely. Sometimes the term „straightforwardness‟ is used, to indicate that involuted and unexpected 12 13 The principles in fact apply more widely, especially to syntax and environment, but that is ignored here. Note that this example does not constitute a complete specification of the „relation‟ concept. Its purpose is to illustrate „chunking‟. 7 concepts should not occur, even if they can be regarded as simple in themselves - see [BrCI82]. Elegance is often associated with simplicity. Petre in chapter 2.1 of [HGSG90] reports that elegance of expression is considered important by expert programmers; e.g. a quote given from Hoare is “I have regarded it as the highest goal of programming language design to enable good ideas to be elegantly expressed.” Again “Experts appreciate an uncluttered notation. ... „ugliness‟ matters.”14 Physicists and some mathematicians have long suggested that true physical theories and good mathematical proofs are always elegant. As long ago as 1981, Dijkstra was quoted in [Dijk81] as follows : “elegance is a strong factor in whether a mathematical proof is understandable” and “this principle can be applied to programming notation”. Dijkstra also found that “there was so much agreement (among mathematicians) about what constituted elegance. It turned out that the major characteristics were brevity and what I would call soberness - an economy in the use of nomenclature”. So elegance is not as dependent on individual taste as might be imagined. Clearly chunking is relevant again. Related concepts chunked together will provide simplicity at a higher level of abstraction; unrelated concepts chunked together yield complication. Generality. “The criteria of generality argues that related concepts should be unified into a single framework.” - page 39 of [Horo84]. There should be no exceptions to a general rule, with all applications of the concept being an instance of the general rule. One could consider this as chunking together a set of related detailed concepts to ensure that they fit together consistently to form a single concept at a higher level of abstraction. From the opposite point of view, there should be no artificial constraints or limitations, e.g. no minimum or maximum constraint on the number of attributes or tuples in a relation. [Morr81] expresses this well : “Most languages are too big and intellectually unmanageable. The problem arises in part because the language is too restrictive; the number of rules needed to define a language increases when a general rule has additional rules attached to constrain its use in certain cases. (Ironically, these additional rules usually make the language less powerful).” Orthogonality. This states that every concept should be independent of all the other concepts, and that there is a general and consistent way of combining them together. Thus concepts can be combined with each other in any arbitrary way. “In a truly orthogonal language, a small set of basic facilities may be combined without arbitrary restrictions according to systematic rules.” - see page 105 of [HGSG90]. 14 Tractinsky successfully repeated a Japanese experiment that demonstrated that an aesthetically elegant bank ATM was easier to use than an ugly one - see [Trac97] and [TrKI00]. See also [Norm04] on this subject. 8 Uniformity. This is also known as consistency or regularity. It means that similar things should be done in similar ways and have similar meanings. “The same things should be done in the same way wherever they occur” page 219 of [Wein71]. Weinberg in [Wein71] has several useful contributions on uniformity. Lack of uniformity in some parts of a language can create the fear of such a lack in other parts of the language : “The more “covert categories” - things that you cannot do or say - there are, the more one expects other such covert categories in the language. Even if the restrictions are in another part of the language, they may affect the actual usage of a part without such restrictions.” - page 220 of [Wein71]. Uniformity is also conducive to naturalness. “One way of achieving naturalness ... [is] through uniformity, but uniformity only applies to those programmers who have some experience with the language.” - page 232 of [Wein71]. Such experience relates to the intuitiveness described by Norman; see the earlier quote from him. Uniformity could be regarded as a meta principle to the extent that it is a policy of treating all concepts in the same way. In this light, it also appears as a general product design principle in [LiHB03] : “The usability of a system is improved when similar parts are expressed in similar ways.” The five criteria are not unrelated. For example, keeping related concepts together and unrelated ones separate at a higher level of abstraction facilitates both simplicity and generality. Parsimony and simplicity are related through elegance. Morrison in [Morr81] points out : “Power through simplicity, simplicity through generality, should be the guiding principle”. The criteria should be applied at a consistent level of abstraction. For example, at the conceptual level of a relation as a whole (as opposed to the constituent concepts that make up a relation, such as tuples and attributes) consideration of a relational algebra operator means considering how it applies to an entire relation. At the next level of abstraction down, one can (say) relate attributes that are parameters of the operator to attributes that are part of the relation. The important thing is that these five criteria are clearly consistent with Brooks‟ concern for ease of use and the criterion of conceptual integrity. Uniformity is directly specified by Brooks in his definition of conceptual integrity : “Every part must reflect the same philosophies and the same balancing of desiderata.”. Parsimony and simplicity directly make a programming language easier to use; generality increases its functionality. Orthogonality achieves simplicity by eliminating the need for special rules that prohibit or constrain combinations of concepts; orthogonality also achieves greater functionality by allowing all the concepts to be combined together without let or hindrance. The combinatorial ability not only provides an opportunity for creative problem solving but also provides one means (generality being the other) whereby a simple language can express great functionality. As van Wijngaarden put it in [Wijn71] : “Orthogonal design maximises expressive power while avoiding deleterious superfluities.” Thus these five criteria can be used to amplify the criterion of conceptual integrity. 9 2.4 The Relational Conceptual Model It is assumed that the reader is familiar with relational databases in general. Nevertheless the key concepts of the relational model are now summarised so that thereafter it can be reviewed to ascertain its simplicity and conceptual integrity. The relational model is taken to be that which has evolved from Codd‟s original publically available paper [Codd70]. The model has been extended, refined, and its logical consequences developed since its inception in 1969, particularly by Codd and Date, and more recently by Date and Darwen. Its focus is a mathematical model of a relational database, that a user can interact with and manipulate logically, i.e. it is a conceptual model. The justification for the relational model is that it is of great practical value. Compared to SQL, inconsistencies have been removed, its constructs are orthogonal, and there are no ad hoc limitations. The model is much simpler yet more powerful than its SQL counterpart. Its evolution is noted in useful papers of Codd‟s such as [Codd71], [Codd72Co], [Codd72Nr], [CoDa74] which introduced the concept of „essentiality‟, [Codd74], [Codd81] and [Codd88]. Stonebraker in [Ston94] suggested four evolutionary versions : Codd‟s original paper [Codd70], Codd‟s 1981 Turing Award paper [Codd82], Codd‟s brief summary in [Codd85Re] and [Codd85Ru], Codd‟s book [Codd90]. [Date01] gives a concise review and summary of the evolution, [Date05] gives a more comprehensive description, summarised in chapter 8, and [Date06Dic] is a dictionary of the terminology and concepts of the relational model. Further developments are recorded in C. J. Date‟s series of „Relational Database Writings‟ - [Date86], [DaWa90], [DaDa92], [Date95Wr] and [DaDM97]. They culminate in Date and Darwen‟s „Third Manifesto‟ proposals for relational databases [DaDa98], [DaDa00] and [DaDa07]. Codd showed in [Codd90] how the relational model could be used to support a sort of „formalised E-R model‟, but this is eschewed here in favour of Merrett‟s approach [MeRe84] and particularly [MeMc84] - which views the relational model as a simple formalism which can be applied to any suitable semantic situation. „The Third Manifesto‟ (= TTM) proposals also take the „simpler‟ view; for this reason, TTM also excludes „nulls‟ so that 2-valued logic may be retained. [McGo94] and [Pasc00] by McGoveran and Pascal respectively present the practical benefits of the simpler model 15. Later in the thesis, a specific syntax is needed to express the concepts of the model. The RAQUEL notation will be used for this, although this does not preclude another syntactic notation being used instead to describe the same relational model 16. However RAQUEL 15 16 For example, [Pasc00] applies this relational logic to solve effectively such practical problems as duplicate tuples, entity supertypes and subtypes, and data hierarchies. Indeed the syntax and semantics of RAQUEL itself are kept separate, so that its current syntax, composed of traditional linear text, could be supplemented by an alternative 2- or 3-dimensional graphic version. Furthermore the relational model expressed by RAQUEL is intended eventually to be identical 10 does have certain features that contribute to a simpler construction of the relational model, as will be seen later. At the highest level of abstraction, the relational model comprises four concepts : 1. A relation as a container of data; 2. An open-ended set of scalar data types; 3. An open-ended set of relational algebra operators17; 4. An open-ended set of relational assignments. These four concepts are now examined in more detail. This is done by considering how they are made up of concepts at a lower level of conceptual abstraction. It will be seen that the lower level conceptual abstractions are in turn made up of yet lower level conceptual abstractions, and so on. A Relation. The concept of a relation consists of two related concepts at a lower level of abstraction, a relational value (henceforth abbreviated to „relvalue‟) and a relational variable (henceforth abbreviated to „relvar‟) 18 : A relvalue is a container of scalar values. It consists of a mathematical set of tuples, each of which consists of a mathematical set19 of attribute values. The tuples constituting a relvalue all have the same set of named attributes, with every attribute having a declared data type. Each tuple contains a value of the specified type for each attribute20. Subject to these constraints, a relvalue may have any sized cardinality and degree. If an attribute‟s values are themselves relvalues, then from the viewpoint of the relvalue containing the attribute, the attribute relvalues are perceived as scalars, i.e. each relvalue has been „enclosed‟ to become a scalar; it will need to be „disclosed‟21 to reveal its structure and its own attribute values 22. Such nesting of relvalues within relvalues can be continued ad infinitum23. 17 18 19 20 21 22 to that expressed by the conceptual language D; currently RAQUEL expresses only a subset of D. Date and Darwen use the language Tutorial D in their publications to express the concepts of the language D. It would be reasonable to use relational calculus instead of relational algebra. Nevertheless algebra is arbitrarily chosen. It is considered easier to derive and explain complex manipulations of relations via algebra, because algebra expressions lend themselves easily to being built up piece-meal. The term „relation‟ is used henceforth for terseness to refer to what is permitted by the situation concerned to be either a relvalue or a relvar. It is emphasised that the set is mathematical rather than some other, possibly more vaguely-defined kind of set, because of the set properties that consequently apply, i.e. the set has no ordering or structure, no duplicates, and may be empty. However from now on, reference to a „set‟ in the thesis should be taken to mean a „mathematical set‟ unless otherwise specified. Consequently a NULL – i.e. the absence of a value – cannot be an attribute value. The terms „enclose‟ and „disclose‟ are taken from APL. In APL they have precise definitions that are the exact analogue with nested arrays for what is required for nested relvalues, and so the same terms with corresponding definitions are used here. „Nested‟ is only used as a general, indicative term as there appears to be no universally agreed definition for it in the literature. This is proposed by Date and Darwen - see pages 152-3 of [Date04Int]. Its rationale is that since an attribute can have any data type, there is nothing to prevent it from having a relational type (henceforth abbreviated to „reltype‟). Such an attribute therefore has nested relvalues, but at the level of abstraction of the containing relvalue, a nested relvalue is enclosed to become a single scalar value, i.e. its internal 11 A relvar is a named variable whose value is a relvalue that may change over time. As each tuple is unique within a relvalue, there is at least one candidate key in a relvar. If no such key is specified, all the attributes must be treated as the one and only candidate key. Typically one or more subsets of the attributes are specified as candidate keys, in order to better represent the real-world situation in question. Relvars are either real relvars (commonly referred to as „base relations‟ in the literature) that are abstractions of stored data, or virtual (or derived) relvars whose value at any moment is the value at that moment of the relational algebra expression which defines that relvar (commonly referred to as „views‟ in the literature). Note that the concepts of relvalue and relvar each consist of several more concepts at a yet lower level of abstraction. Scalar Data Types. The data type of every attribute of a relvalue is a scalar type, except when the attribute holds enclosed relvalues. This applies recursively, whatever the level of nesting. structure and the values it contains are not visible. So the containing relvalue is in First Normal Form, and relational algebra operators applied to the containing relvalue continue to function as normal on it. There have been other kinds of proposals to allow relvalues to be nested as attribute values. A notable early proposal was that of Roth, Korth and Silberschatz - see [Roko88] - who derived an extension of the relational model that allowed nesting, but in such a way that the structure and contents of the nested relvalues were not enclosed but visible at the level of the containing relvalue, i.e. the latter‟s attribute values were non-scalar. As a consequence, relational algebra operators had to be amended to cope with “Non First Normal Form” (= NF2) relvalues, as the containing relvalues were known. If a nested relvalue is disclosed, a level of enclosure is removed, and the nested relvalue‟s attribute values are brought up to the level of abstraction of the containing relvalue such that the nested relvalue‟s attributes replace the nested relvalue attribute. Just as operators pertaining to a scalar attribute type may be applied to scalar values in an attribute, so may relational algebra operators be applied to nested relvalues in an attribute. Many algebra operators take parameters that reference the attributes of their operands. When such operators are applied to nested relvalues, the parameter is referencing attributes of the nested relvalues, i.e. it is referencing one level of enclosure down, without any explicit disclosure of enclosed levels of abstraction. Furthermore, some algebra operators permit expressions or statements as parameters, e.g. Restrict and Extend. If such an operator is applied to an attribute‟s nested relvalues, then the expression/statement parameter can itself include relational algebra operators that apply to nested relvalues, which may themselves include operators with expression/statement parameters applying to nested relvalues, and so on without artificial limit. This provides an alternative strategy for manipulating nested relvalues down to any depth of enclosure without the use of disclosure. 23 The whole point of a relational container is that one „can see inside it‟ and perceive its structure and the individual scalar values it contains. The whole point of a scalar is that one „cannot see inside it‟ and cannot perceive any structure or component values. In designing a relational DB, part of the design is determining how to apportion the DB data into relational containers, and within each relational container, whether each attribute should contain genuine scalar values or enclosed relvalues, i.e. what levels of abstraction are most helpful with respect to the real-world situation represented. An analogy may help. In chemistry, atoms are the fundamental objects, and chemical reactions concern how atoms are formed and re-formed into molecules. In physics, protons, neutrons and electrons are the fundamental objects, and atomic reactions concern how these form and re-form into atoms. In particle physics, quarks and leptons are the fundamental objects, and particle reactions concern how these form and re-form protons, neutrons and electrons. One must choose the appropriate level of abstraction for the topic of concern. 12 Because scalar types are orthogonal to relations, there is no limit to the set of permissible scalar data types that can be used to define attribute values24. The scalar values of a data type can be arbitrarily complex; for example a scalar value could be a photo, a video recording or a piece of music. A scalar type could be defined via an object class. A scalar data type may be built into the DBMS, plugged into it as an „optional extra‟, or derived by the user in some way from a pre-existing scalar type. A scalar type consists of a permissible set of values. It also has a set of scalar operators associated with it, which take a value(s) of that type as an operand(s) and/or return a value of that type. In RAQUEL, scalar operators are prohibited from side effects when they execute; this is to achieve simplicity by making them consistent in this respect with relational algebra operators. Yet again, the concept of a scalar data type consists of several concepts at a lower level of abstraction. Relational Algebra Operators. Relational algebra utilises an open-ended set of algebra operators. An operator is either monadic or dyadic, i.e. it takes either one or two operands25. An operand must be a relvalue, expressed as either a literal relvalue, a relvar or a relational algebra expression. Every operator returns a single relvalue, whose candidate key(s), attribute names and attribute types are derived from the operand(s). Thus the operators form a closed system under the algebra; expressions of arbitrary complexity may be written using the operators. In RAQUEL, the operators that compare relvalues return truth values which are represented as zero-attribute relvalues26, thereby maintaining closure and simplifying the algebra overall27. Unlike the open-ended set of scalar data types, which is determined by the needs of a particular DB, the open-ended set of algebra operators is determined by the language designer; this includes whether provision is made for the language user to define new operators. Enclose and Disclose operators will be needed to support nesting. The concept of algebra operators comprises monadic and dyadic categories, and in each of these, at a yet lower level of abstraction, there are the concepts that define the operators of that category. Relational Assignments. At the very least, an assignment is needed to give a relvar a new relvalue. For simplicity and ease of use, a number of assignments are desirable, for example to insert, amend and delete tuples in relvars, and to retrieve relvalues from a DB. 24 The truth data type, consisting of the values true and false, must always be available to the DBMS, even if it is never used for an attribute in a DB, since the DBMS must be able to evaluate truth-valued expressions pertaining to attribute values in order to execute algebra operators whose definitions involve such expressions. 25 In principle, there can be operators that are niladic or take more than two operands. No such currently exist in RAQUEL. TTM has a triadic version of Divide. 26 There are only 2 possible relvalues, one has the 0-tuple and the other has no tuples. (There cannot be multiple 0-tuples, as they would be replicas of each other). They represent the truth values true and false respectively. See pages 153-154 of [Date04Int] for further details. 27 Otherwise one would need a 2-sorted universe, of relvalues and scalar values. This generalisation retains a 1-sorted universe, of relvalues only. 13 Thus RAQUEL provides a (potentially open-ended) set of such value assignments 28, and consequently value assignment is not a „single instance concept‟ as it is in an application programming language29. In this respect RAQUEL is more akin to SQL, which does have several kinds of statements corresponding to such assignments. However page 193 of [Date04Int] affirms that algebra expressions can be used for a variety of purposes and lists some examples. While some relate to value assignments, others relate to constraints of various kinds, such as integrity constraints or access constraints. A generalisation of assignment to include integrity constraints was developed by Livingstone and Gharib - see [Livi95] and [Ghar97] - and successfully implemented in an APL interpreter. The APL integrity assignment assigned a set of values to a variable as its set of permissible values, i.e. its data type, rather than its value. The same concept, but further generalised to handle a whole range of constraints via a (potentially open-ended) set of integrity assignments, is applied in RAQUEL30 so that the non-value assignment purposes given in [Date04] can be provided for in a simple but generally uniform way. For example, such assignments include one to generate a reltype and assign it to a relvar. Like the open-ended set of algebra operators, the open-ended set of assignments is determined by the language designer, and includes whether provision is made for the definition of new assignments. Since RAQUEL has two categories of assignments, those that make relvalue assignments to relvars, and those that make integrity assignments to relvars 31, the concept of relational assignments comprises two lower level concepts, one per category. In turn each of these comprises the concepts at a yet lower level of abstraction that define the assignments of that category. 2.5 The Simplicity of the Relational Conceptual Model The relational model is now reviewed to demonstrate its simplicity and conceptual integrity. The model is summarised graphically in figure 2.1. The figure is derived using the ideas put forward by Hsi in [HsiI05] – see the appendix in section 2.7 for a summary of his approach. Hsi states that a computing application has an „ontology‟, which he defines as being “its theory of the real world” – page 4 of [HsiI05]. The ontology is formed from the „concepts‟ that compose it. „Ontological excavation‟ is used to identify the concepts, 28 29 30 31 However they can be thought of as shorthands for more complex statements that utilise only a traditional value assignment. This contrasts with „textbook relational algebra‟. For example, the latest editions of two well established DB textbooks, [ElNa07] and [CoBe02], use value assignment in the course of a discussion of algebra operators, but do not explicitly discuss assignment, and tacitly assume algebra is of use only for retrievals. Some overviews of the relational model include „integrity constraints on relvars‟ as a high level conceptual component of it. This was omitted above. It can now be seen that this component is provided in RAQUEL by assigning suitable relational algebra expressions to relvars as integrity constraints. Actually there is a third class, which binds relvars to their storage mechanisms, but this is irrelevant to the conceptual relational model. 14 which are then modelled as a semantic network; this is similar to an ER model or UML class diagram, except that attributes of entities are shown separately from the entities themselves. Having got what is essentially a graph structure, various numerical measures of the graph are taken in order to obtain a measure of how well the concepts integrate together. This general idea is used here, except that concepts or „entities‟ are not assumed to have attributes. Instead concepts at the same level of abstraction are linked by edges in the graph; the more detailed concepts at a lower level of abstraction, which together make up one concept at a higher level of abstraction, form a graph at the lower level that is expressed as a single node in the graph at the higher level. A graph at a lower level is portrayed in a „bubble‟ that forms a node at a higher level. There is no limit to the number of levels of abstraction that are permitted. Figure 2.1. demonstrates graphically the conceptual simplicity of the relational model. Consider the model with respect to the five programming language criteria : Parsimony. There are only four concepts at the highest level of abstraction, and each of these is made up of a very small number of constituent concepts at the next level of abstraction down, and so on. Simplicity. From the highest level of abstraction downwards, the concepts are simple, particularly for someone used to imperative programming languages. Relations are logically a simple kind of data container, and the operators and assignments follow on in a straightforward way from them. Assignments are conceptually quite different from operators; so for clarity and to avoid confusion, they are treated quite differently in the language. The phenomenon of „psychological inhibition‟ would arise otherwise. “It is the similarity between the languages which causes the inhibition.” “It might be better, when identity is not possible, to make the two more clearly dissimilar.” - see [Wein71]. Generality. None of the concepts have any artificial constraints or limitations at any level of abstraction. As opposed to the single value assignment of application programming languages, value assignments are generalised to provide a range of uniform ways of changing relvars‟ values; and a commensurate range of uniform assignments is provided for changing relvars‟ integrity constraints. Orthogonality. There is complete orthogonality between all four concepts at the highest level of abstraction. Any scalar data types can be used with any relvar. Any operators can be used with any assignments to any relvar. Uniformity. Reltypes are treated in a corresponding way to scalar types, both having variables, literal values, operators and assignments. Uniformity is also applied in achieving generalisations, as noted above. Thus the relational conceptual model is consistent with the simplicity and conceptual integrity proposed by Brooks. As many of the previous references about the relational model have stated, the reason for this achievement is that the following four design strategies have been employed in creating the model : 1. The level of abstraction has been raised to be as high as possible. Only those logical aspects that are germane to the handling of data in a DB are included. 15 2. The principle of „Essentiality‟ is applied, to prune out all concepts that are not logically essential. If n different ways are used in a logical model to represent information, then n sets of operators and assignments, one set per way, are required in the model. The larger the value of n, the greater the complexity of the model. Yet if only m of the n ways are essential, i.e. m < n, only the functionality conferred by the m ways is attained. In the case of the relational model, m = 1 = n. There is only one kind of data object, viz. the relation, and hence only one set of operators and one set of assignments needed to handle relations. Note that essentiality is not the same as a high level of abstraction. One could choose to have additional concepts at a high level of abstraction. 3. The relational model is purely a logical model, with its implementation being entirely excluded from it. One might argue that this follows from having the highest level of abstraction possible, but it in fact it doesn‟t always follow, and it is very important in practice to ensure that there are no implementation aspects in the model. As an extra benefit, it also allows a greater variety of implementation options to be made available. 4. The relational model is a formal, mathematical model. Although not heretofore explicitly discussed, Codd based the relational model on mathematical set theory and its application to relations. As with the application of formal methods to the development of application programs, the advantage of this is that the mathematics can be used to specify much more precisely what the conceptual model actually is. It can be mathematically manipulated, proved and investigated, so that eventually a final version of the model can be proved that is relatively bugfree. (Nothing is ever perfect !). Although the design strategies are different, it can be seen that they are mutually supportive or at least related to a degree. For example, using a mathematical foundation is conducive to both excluding implementation aspects and to raising the level of abstraction to the highest possible level. 2.6 Adding Different Kinds of Container Type to the Relational Model The thesis aim is to add a full set of different kinds of kinds of container type to the relational model. This affects the first of the four high level conceptual abstractions of the relational model. However, since in principle each additional container type needs its own operators and assignments, it also affects the third and fourth conceptual abstractions as well. Only the concept of an open-ended set of scalar data types is unaffected, because this must apply equally well to all container type in order to provide conceptual integrity over the full set of kinds of container type. In order to achieve as much simplicity and conceptual integrity as possible, the design strategy of excluding implementation concerns should be applied. If any kind of container type includes its physical implementation in its definition, then its level of abstraction should be raised to eliminate the implementation and yield a simpler concept of data container cum operators and assignments. This maintains consistency with the relational model and allows for the possibility of providing multiple implementation options. If possible, the concept should be so derived as to achieve as much conceptual 16 simplicity, generality and uniformity as possible when combined with the relational model. To achieve the maximum of conceptual integrity, it is important to aim for parsimony, by minimising the number of kinds of container type actually added. This can not be done by ignoring those that might be infrequently used, because the aim is to provide the functionality of the full set. Of course, the use of defaults is acceptable, as this does not actually exclude anything from the logical model, it merely provides a form of shorthand for commonly occurring statements or parts of statements. However the design principle of essentiality is relevant here. If two or more kinds of container type are variations on the same theme or overlap in concept, then if possible it is desirable to derive one essential kind of container type that includes the two or more as special cases, and replace them with the single essential kind. This needs to be done in such a way that it also achieves conceptual simplicity, generality and uniformity. It can also be useful to raise the level of abstraction, as this can help in viewing related kinds of container type as special cases of one underlying container type. This is analogous to the approach used in physics to unify fundamental forces. At one time, magnetism and electricity were considered to be two entirely different forces. Later it was realised that they are two special cases of one force called electromagnetism 32. Likewise physicists are currently trying to unify the four fundamental forces of electromagnetism, the strong nuclear force, the weak nuclear force, and gravity. It is suggested that this will come about by viewing them as four special cases of a force viewed in 10 dimensions, and it is taken for granted that elegance will be integral to the result33. Note that the elimination of implementation concerns may increase the possibility of conceptually related kinds of container type arising, and hence increase the scope for applying essentiality. New kinds of container type will need to be mathematically specified, not only for consistency with the relational model, but also to ensure the concepts are well-defined. Finally all new kinds of container type must be able to be used in an orthogonal fashion with the existing relational kind of container type. This is not just because it makes it easier to add them, as there a minimal number of points where integration of old and new must be achieved, but also because it provides greater functionality. Mathematical specifications of the new kinds of container type should help ensure that orthogonality is attained. When the thesis aim is achieved, the extended relational model so produced will be seen to have markedly greater conceptual simplicity compared to an object-relational SQL with all the currently proposed kinds of container type incorporated into it. 32 33 For example, see [Gamo62] for a brief overview of this development, from Oersted‟s discovery to Faraday‟s experiments and Maxwell‟s equations. For example, see [Kaku99] for an overview of this. 17 2.7 Appendix Analysing the Conceptual Integrity of Computing Applications Through Ontological Excavations and Analysis A computing application has an ontology, which is its theory of that part of the real world relevant to the application. The concepts that form the ontology determine and structure the application‟s features. Usefulness is the extent to which an application succeeds in assisting users achieve their goals, relative to the amount of effort required to apply the features. Usability is the amount of effort required to use a feature to achieve a goal. A useful application with poor usability enables users to achieve their goals, but with great difficulty. An application with little usefulness may be very usable but doesn‟t help users achieve their goals. The features of an application must be determined by and conform to its ontology, strictly speaking the ontology as perceived by the application‟s users. The degree to which the application ontology matches the users‟ ontology determines its conceptual fitness. Thus conceptual fitness determines usefulness. An ontology with conceptual integrity will have conceptual coherence. Conceptual coherence measures the degree to which an application‟s concepts are tightly related. An application has a central core of concepts that are essential to its ontology., and can be identified by analysis. Inessential concepts either exist to support core concepts or are peripheral; peripheral concepts reduce conceptual coherence. Ontological excavation is used to identify the concepts in an application and model them as an ontology expressed as a semantic network. (This is similar to an ER model or UML class diagram, but it is preferred to show attributes of entities separately from the entities, to improve clarity). Ontological analysis is used to measure an ontology‟s conceptual coherence. This analysis involves getting measures of distance between different concepts in the ontology/semantic network (= number of edhes between nodes). There are a number of measures that can be calculated for each node in the (graphical) ontology. Betweenness Centrality best identifies core concepts. A geodesic is the shortest route between 2 nodes. Conceptual coherence is derived from the average length of all geodesics between pairs of reachable nodes; as this results in more incoherence having greater values, the inverse of this is actually used to measure coherence, multiplied by 100. A use case silhouetting method is used to measure the amount of ontological coverage of typical uses of the application. Each use case uses certain concepts in the ontology. For a set of cases, a count of each of the concepts used is made, possibly weighted by the concept‟s importance. This shows how well the application‟s ontology matches the user‟s ontology, and hence to what extent the user finds the application useful. 18 Conceptual coherence is a first approximation to conceptual integrity (but lacks the functionality aspect of Brooks‟ ratio). 19
© Copyright 2026 Paperzz