Building a Reason-able Bioinformatics Ontology Using OIL

Building a Reason-able Bioinformatics Ontology Using OIL
Robert Stevens, Ian Horrocks, Carole Goble and Sean Bechhofer
Department of Computer Science
University of Manchester
Oxford Road
Manchester, M13 9PL
United Kingdom
[email protected]
Abstract
Ontologies will play an important role in bioinformatics, as they do in other disciplines, where they
will provide a source of precisely defined terms that
can be communicated across people and applications. The Ontology Inference Layer (OIL), is an
ontology language that has an easy to use frame
feel, yet at the same time allows users to exploit
the full power of an expressive description logic.
OilEd, an editor for OIL, uses reasoning to support ontology design, facilitating the development
of ontologies that are both more detailed and more
accurate.
This paper presents a bioinformatics ontology
building case study using OilEd to highlight the
features of the combination of a frame representation and an expressive description logic.
1
Introduction
Ontologies have become an increasingly important research
topic. This is chiefly a result of their usefulness in a range
of application domains [van Heijst et al., 1997; McGuinness,
1998; Uschold and Grüninger, 1996] including bioinformatics [Stevens et al., 2001].
Biologists have long had a culture of recording and sharing
information. This information has traditionally been stored
in natural language form and latterly in natural language annotated databases. There has been a recognition that if these
information resources are to continue to play their central role
in bioinformatics, they have to become machine understandable.
The automation of tasks depends on elevating the status of
the information in resources from machine-readable to something we might call machine-understandable. The natural language annotation of a bioinformatics resource can be processed computationally, but using the knowledge contained
in the natural language annotation is difficult. The key idea
is to have data in these resources defined and linked in such
a way that its meaning is explicitly interpretable by software
processes rather than just being implicitly interpretable by humans.
To realise this goal, it will be necessary to annotate bioinformatics resources with metadata (i.e., data describing their
content/functionality). Ontologies are a useful mechanism to
provide metadata for various resources. However, such annotations will be of limited value to automated processes unless
they share a common understanding as to their meaning. Ontologies, can help to meet this requirement by providing a
“representation of a shared conceptualisation of a particular
domain” that can be communicated across people and applications [Gruber, 1993].
There have been several attempts to develop bioinformatics
ontologies to exploit this biological information. The Gene
Ontology (GO) [The Gene Ontology Consortium, 2000] is
a controlled vocabulary for annotating gene products for
molecular functions, the biological processes in which it is
involved and the cellular locations in which it is found. EcoCyc [Karp et al., 2000] has used an ontology to specify a database schema for the E. coli. metabolism, signal trandsduction
etc. RiboWeb [Altman et al., 1999] also uses an ontology to
describe its data, but also guide its users through analysis of
their data. Finally, TAMBIS (Transparent Access to Multiple
bioinformatics Information Sources) [Baker et al., 1998] uses
an ontology to allow users to query bioinformatics databases.
Each of these uses a different knowledge representation system from phrases in GO; to frame based systems in EcoCyc
and RiboWeb to a description logic in TAMBIS.
Phrase based vocabularies have the advantage of being easily accessible, but suffer from difficulties in consistency and
maintenance. It is common for mistakes to be made in phrase
based vocabularies, especially in the maintenance of multiple hierarchies. Frame-based systems have the advantage of
an easily accessible and intuitive modelling style, reminisient
of an object view of the world (a frame is a class and the
slots are attributes. The frame encapsulates the properties of
the instances). Such systems, like phrase based vocabularies,
are esentially hand-crafted and can suffer from inconsistencies and logical mistakes. The well defined semantics and
reasoning support of DLs allow logically consistent ontologies to be maintained. Concepts can be defined in terms of
their properties and the reasoning used to classify the concepts based upon those descriptions. When an concept expression is unsatisfyable in terms of the rest of the model, the
reasoning support can inform the modeller of his or her mistake. Description logic based ontologies avoid the problems
of the hand-crafted ontologies, but suffer from the complexity
of the modelling style.
These considerations have led to the development of
OIL [Fensel et al., 2000], an ontology language that extends
a frame-based like view with a much richer set of modelling
primitives1 . OIL has a frame-like syntax, which facilitates
tool building, yet can be mapped onto an expressive description logic (DL), which facilitates the provision of reasoning
services. Thus a modeller is offered the best of both worlds in
both development and deployment of an ontology. OilEd is
an ontology editing tool for OIL (and DAML+OIL) that exploits both these features in order to provide a familiar and intuitive style of user interface with the added benefit of reasoning support. Its main novelty lies in the extension of the frame
editor paradigm to deal with a very expressive language, and
the use of a highly optimised DL reasoning engine to provide
sound and complete, yet still empirically tractable reasoning
services.
Reasoning with terms from deployed ontologies will be
valuable in many bioinformatics applications. The most obvious is in formulating, processing and answering queries over
bioinformatics databases. There is also a great potential in
using reasoning during analysis of novel biological entities.
The reasoning support offered by OIL is also extremely
valuable at the ontology design phase, where it can be used to
detect logically inconsistent classes and to discover implicit
subclass relations. This encourages a more descriptive approach to ontology design, with the reasoner being used to
infer part of the subsumption lattice (see the case study presented in Section 4); the resulting ontologies contain fewer
errors of consistency, yet provide more detailed descriptions
that can be exploited by automated processes in the deployed
ontologies. Finally, reasoning is of particular benefit when
ontologies are large and/or multiply authored, and also facilitates ontology sharing, merging and integration [McGuinness
et al., 2000]; considerations that will be particularly important in the distributed bioinformatics environment.
The modeller, however, is not forced to use reasoning support OilEd can be used to construct hierarchies of terms unadorned by descriptions of properties. In fact, the ontology
development described in Section 4, uses a cyclic, two stage
approach to ontology development. For a given portion of the
ontology, a simple hierarchy of terms is hand-crafted. In the
next stage, properties are described for the necessary and sufficient conditions to be a member of that class or concept. The
reasoner can then be used to check the descriptions for logical
consistency and offer any inferred knowledge (unknown subsumptions) that have been found. The cycle is then repeated
for this and other portions of the ontology.
This paper concentrates on this ontology design use of
OIL, rather than the use of reasoning in the deployed ontologies. A case study taken from the domain of bioinformatics
will be used to highlight the development and management
facilities
afforded by the combination of the frame-like syntax of
OIL with the expressive power of description logics. First,
1
A similar ontology language called DAML has been developed
as part of the DARPA DAML project [Hendler and McGuinness,
2001]. These two languages are soon to be merged under the name
DAML+OIL.
the
principal features of the language (Section 2) and its associated editor
(Section 3) will be described. Section 4 describes the development of an ontology of molecular biology and bioinformatics using OilEd.
2
Oil and DAML+OIL
The development of OIL resulted from efforts to combine the
best features of frame and DL based knowledge representation systems, while at the same time maximising compatibility with emerging web standards. These standards, such as
RDFS [Brickley and Guha, 2000], make it easier to use ontologies consistently across the web. The intention was to
design a language that was intuitive to human users, and yet
provided adequate expressive power for realistic applications
(many early DLs failed on this second count—see [Doyle and
Patil, 1991]).
The resulting language combines a familiar frame like
syntax (derived in part from the OKBC-lite knowledge
model [Chaudhri et al., 1998]), with the power and flexibility
of a DL (i.e., boolean connectives, unlimited nesting of class
elements, transitive and inverse slots, general axioms, etc.).
The language is defined as an extension of RDFS, thereby
making OIL ontologies (partially) accessible to any “RDFSaware” application.
The frame syntax is less daunting to ontologists/domain
experts than a DL style syntax, and it facilitates a modelling style in which ontologies can start out simple (in
terms of their descriptive content) and are gradually extended, both as the design itself is refined and as users become more familiar with the language’s advanced features
(see Section 4). The frame paradigm also facilitates the
construction and adaptation of tools, e.g., the OntoEdit and
Protégé editors and the Chimaera integration tool are all being
adapted to use OIL/DAML+OIL [Staab and Maedche, 2000;
Grosso et al., 1999; McGuinness et al., 2000].
On the other hand, basing the language on an underlying
mapping to a very expressive DL (SHIQ) provides a well defined semantics and a clear understanding of its formal properties, in particular that the class subsumption/satisfiability
problem is decidable and has worst case ExpTime complexity [Horrocks et al., 1999]. The mapping also provides a
mechanism for the provision of practical reasoning services
by exploiting implemented DL systems, e.g., the FaCT system [Horrocks, 2000].
OIL extends standard frame languages in a number of directions. One of the key ideas is that an anonymous class
description, or even boolean combinations of class descriptions, can occur anywhere that a class name would ordinarily be used, e.g., in slot constraints and in the list of superclasses. For example, in Figure 1 (which uses OIL’s “human
readable” presentation syntax, rather than the more verbose
RDFS serialisation), a herbivore is described as an animal
that eats only plants or part-of plants. Points to note
are that universally quantified (value-type) and existentially
quantified (has-value) slot constraints are clearly differentiated, and that the constraint on the eats slot is a disjunction,
one of whose components is an anonymous class description
(in this case, just a single slot constraint). In addition, it is
asserted that the part-of slot is transitive, and that its inverse
is the slot has-part. Further details of the language will be
given in Section 3, and a complete specification can be found
in [Fensel et al., 2000].
slot-def part-of
subslot-of structural-relation
inverse has-part
properties transitive
class-def defined herbivore
subclass-of animal
slot-constraint eats
value-type plant or
slot-constraint part-of has-value plant
Figure 1: OIL language example
3
OilEd
OilEd is a simple ontology editor that supports the construction of OIL-based ontologies. The basic design has been
heavily influenced by similar tools such as Protégé [Grosso et
al., 1999] and OntoEdit [Staab and Maedche, 2000], but
OilEd extends these approaches in a number of ways, notably
through an extension of expressive power and the use of a
reasoner.
3.1 OilEd Functionality
Basic functionality allows the definition and description of
classes, slots, individuals and axioms within an ontology.
In general, editing functions are provided through graphical means—mouse driven drop down menus, toolbars and
buttons. We will not provide a detailed description of the
graphical user interface here, as it is relatively standard (see
Figure 2, which provides a screen shot of the editors class definition panel). Instead, we will discuss the novel functionality
offered by the tool.
Frame Descriptions The central component used throughout
OilEd is the notion of a frame description. This consists of a
collection of superclasses along with a list of slot constraints.
For example, a class called hydrolase has constraints including catalyses hydrolysis, describing one of the properties of
a hydrolase to be the promotion of the reaction called hydrolysis. This is similar to other frame systems. Where OilEd differs, however, is that wherever a class name can appear, a recursively defined, anonymous frame description can be used.
For example, a gene is has-name gene-name or part-of
gene-name – indicating that a gene may be found using its
name or part of its name. In addition, arbitrary boolean combinations of frames or classes (using and, or and not) can
also appear. This is in contrast to conventional frame systems, where in general, slot constraints and superclasses must
be class names.
As well as being able to assert individuals as slot fillers,
several types of constraints on slot fillers can be asserted
Figure 2: OilEd Class Panel
(these kinds of constraint are sometimes called facets).
These include value-type restrictions (all fillers must be
of a particular class), has-value restrictions (there must be
at least one filler of a particular class), and explicit cardinality restrictions (e.g., at most three fillers of a given
class). For instance, it is possible to exactly describe
that a G-protein coupled receptor has to have seven and
only seven transmembrane regions – otherwise it is not a
G-protein coupled receptor. Each constraint has a clearly
defined meaning, removing the confusion present in some
frame systems, where, for example, it is not always clear
whether the semantics of a slot-constraint should be interpreted as a universal or existential quantification.
Class Definitions A class definition specifies the class name,
along with an optional frame description (see above) and
a specification of whether the class is defined or primitive.
If defined, the class is taken to be equivalent to the given
description (necessary and sufficient conditions). If primitive, the class is taken to be an explicit subclass of the
given description (necessary conditions). In the specification of the OIL language, classes can have multiple definitions. In OilEd, this is not allowed—classes must have a single definition—but the same effect can be achieved through
the use of equivalence axioms as discussed below.
Slot Definitions A slot definition gives the name of the slot
and allows additional properties of the slot to be asserted, e.g.,
the names of any superslots or inverses. If r is a superslot of
s, then any two objects related via s must also be related via
r (i.e., s(a, b) → r(a, b)); if r is an inverse of s, then a is related to b via s iff b is related to a via r (i.e., s(a, b) ↔ r(b, a)).
Domain and range restrictions on a slot can also be specified.
For example, we can constrain the relationship parent to have
both domain and range person, asserting that only persons
can have, and be, parents. As with class descriptions, the
domain and range restrictions can be arbitrary class expressions such as anonymous frames or boolean combinations of
classes or frames, again extending the expressivity of traditional frame editors. Note that in this context, the domain and
range restrictions are global, and apply to every occurrence
of the slot, whether explicit or implicit.
A slot r can also be asserted to be transitive (i.e., r(a, b) and
r(b, c) → r(a, c)), functional (i.e., r(a, b) and r(a, c) → b =
c) or symmetric (i.e., r(a, b) → r(b, a)).
All assertions made about slots are used by the reasoner,
and may induce hierarchical relationships between classes,
e.g., as a result of domain and range restrictions (see Section 4).
Axioms Another area where the expressive power
of OIL/OilEd exceeds that of traditional frame languages/editors is in the kinds of axiom that can be used to
assert facts about classes and their relationships. As well as
standard class definitions (which are really a restricted form
of subsumption/equivalence axiom), OilEd axioms can also
be used to assert the disjointness or equivalence of classes
(with the expected semantics) along with coverings. A
covering asserts that every instance of the covered class must
also be an instance of at least one of the covering classes. In
addition, coverings can be said to be disjoint, in which case
every instance of the covered class must be an instance of
exactly one of the covering classes.
Again, these axioms are not restricted to class names, but
can involve arbitrary class expressions (anonymous frames
or boolean combinations). This is a very powerful feature,
and is one of the main reasons for the high complexity of the
underlying decision problem. These axioms, especially the
disjointness axiom, are quite heavily used in the case study
ontology. It is useful to state explicitly that, for instance,
something cannot be both an element and a compound.
Individuals Limited functionality is provided to support the
introduction and description of individuals—the intention
within OilEd is that such individuals are for use within class
descriptions, rather than supporting the production of large
existential knowledge bases (it is supposed that RDF/RDFS
will be used directly for this purpose). As a (non-biological)
example, we may wish to define the class of Italians as being
all those Persons who were born in Italy, where Italy is not
a class but an individual. The example ontology in Section 4
does not use any individuals. It might, however, be possible to use them to describe individual chemicals within the
ontology.
Concrete Datatypes Concrete datatypes (string and integers), along with expressions concerning concrete datatypes
(such as min, max or ranges) can also be used within class
descriptions. However, the FaCT reasoner does not support reasoning over concrete datatypes, and at present OilEd
simply ignores concrete datatype restrictions when reasoning
about ontologies. The theory underlying concrete datatypes
is, however, well understood [Baader and Hanschke, 1991],
and work is in progress to extend the FaCT reasoner with
support for concrete datatypes. These data types are used in
the description of atom in the example ontology (Section 4).
3.2 Reasoning
The editor can be requested to verify an ontology using the
FaCT reasoner. When verification is requested, the ontology is translated into an equivalent SHIQ (or SHF) knowledge base and sent to the reasoner for classification [Decker et
al., 2000]. OilEd then queries the classified knowledge base,
checking for inconsistent classes and implicit subsumption
Figure 3: Hierarchy pre-classification
Figure 4: Hierarchy post-classification
relationships. The results are reported to the user by highlighting inconsistent classes and rearranging the class hierarchy display to reflect any changes discovered. FaCT/OilEd
does not provide any explanation of its inferences, although
this would clearly be useful in ontology design [McGuinness
and Borgida, 1995].
Figures 3 and 4 show the effects of classification on (part
of) the hierarchy derived from the TAMBIS ontology (see
Section 4). When verifying the ontology, a number of new
subsumption relationships are discovered (due to the class
definitions in the model).
In particular we can see that, after verification,
holoenzyme is not only an enzyme, but also a holoprotein,
and that metal-ion and small-molecule are both subclasses
of cofactor. Note that if the reasoning is not employed, and
if the extended expressiveness and advanced features are not
used, OilEd will still function as a simple frame editor.
4
Case Study: the TAMBIS Ontology
The role of ontologies in bioinformatics has become prominent in the last few years. Much of biology works by applying
prior knowledge to an unknown entity. The complex biolog-
Figure 5: Definitions pre-classification
ical data stored in bioinformatics databases requires knowledge to specify and constrain values held in that database.
Ontologies are also used as a mechanism for expressing and
sharing community knowledge, to define common vocabularies (e.g., for database annotations), and to support intelligent querying over multiple databases [Baker et al., 1999;
Stevens et al., 2001].
TAMBIS (Transparent Access to Multiple Bioinformatics
Information Sources) is a mediation system that uses an ontology to enable biologists to ask questions over multiple external databases using a common query interface. The ontology is central to the TAMBIS system: it provides a model
over which queries can be formed, it drives the query formulation interface, it indexes the middleware wrappers of the
component sources, and it supports the query rewriting process [Goble et al., 2001]. The TAMBIS ontology (TaO) covers the principal concepts of molecular biology and bioinformatics: macromolecules; their motifs, their structure, function, cellular location and the processes in which they act.
It is an ontology intended for retrieval purposes rather than
hypothesis generation, so it is broad and shallow rather than
deep and narrow [Baker et al., 1999].
The TaO was originally modelled in the G RAIL DL [Rector et al., 1997]. It was subsequently migrated to OIL in order
to (a) exploit OIL’s high expressivity, maintaining a better fidelity with biological knowledge as it is currently perceived;
(b) use reasoning support when building and evolving complex ontologies where the knowledge is dynamic and shifting;
and (c) be able to deliver the TaO as a conventional frame ontology (with all subsumptions made explicit), thus making it
accessible to a wider range of (legacy) applications and collaborators.
The approach to developing the ontology was directly influenced by the range of expressivity that OIL affords, and
the capabilities of OilEd itself, particularly its reasoning facilities. The modelling philosophy was to be descriptive, i.e.,
to model properties and allow as much as possible of the subsumption lattice to be inferred by the reasoner.
The design methodology was to first construct a basic
framework of primitive foundation classes and slots, working
both top down and bottom up, mainly using explicitly stated
superclasses. This was a cyclic activity, with portions of the
TaO being described primitively, then in the more descriptive
fashion.
In each cycle, the reasoner is used to classify the ontology. The classification can then be viewed (with and without inferred subsumptions) to check the classification against
the ontologist’s knowledge. The editor allows concepts to
be found by name, so recently constructed concepts can be
viewed in their context. Logically inconsistent concept expressions (those equivalent to bottom) are higlighted for easy
identification of badly formed expressions.
The initial model was very “tree-like”, i.e., there were very
few classes with multiple superclasses. The primitive portions of the ontology were then incrementally extended and
refined by adding new classes, elaborating slot fillers and constraints, and “upgrading” to defined classes wherever possible, so that class specifications became steadily more detailed and faithful to the application. This process was guided
by subsumption reasoning—when elaborating or changing
classes, the reasoner could be used to check consistency and
to show the impact on the class hierarchy.
As each cycle of extension of concept definitions ends, the
modeller is able to view the use of primitive and defined concepts across the ontology. This view ‘zooms’ out from the ontology, showing the lattice as dots and arcs, with the dots differentiated according to their being defined or primitive. This
enables the modeller to see areas of definition and where definition is lacking. Building-block concepts, that are not central to the use of the ontology, will in all likelyhood remain
primitive, but it is useful to spot where definition is lacking;
as definition increases the fidelity and justification for the ontology. For instance, the macromolecules within the TaO are
highly defined, but the properties, used in the definition of
more central concepts remain primitive.
Figure 6: Definitions post-classification
Figures 5 and 6 illustrate this (using a subset of the complete ontology). Figure 5 shows the distrubution of defined
concepts throughout the hierarchy before classification2 . Defined concepts are signified using a darker colour, and we can
see that the hierarchy has a very flat structure. In Figure 6,
we see the situation after classification. The defined concepts
have now been organised into a subsumption hierarchy based
on their definitions.
Figure 7 shows a (greatly simplified) fragment of the TaO
(using OIL’s presentation syntax) that we will use to illustrate
this methodology.3
Bioinformatics is the study and analysis of molecular biology – the functions and processes of the products of an
organism’s genes. The knowledge about molecular biology
is contained within numerous data banks and analysis tools.
An ontology of bioinformatics therefore needs to support two
2
The hierarchies are generated using OilEd’s export functionality, which produces graphs for rendering by AT&T’s Graphviz software
3
The complete ontology can be found at http://img.cs.
man.ac.uk/stevens/tambis-oil.html
class-def protein
class-def defined holoprotein
subclass-of protein
slot-constraint binds
has-value prosthetic-group
class-def defined enzyme
subclass-of protein
slot-constraint catalyses
has-value reaction
class-def defined holoenzyme
subclass-of enzyme
slot-constraint binds has-value prosthetic-group
class-def defined cofactor
subclass-of (metal-ion or small-molecule)
disjoint metal-ion small-molecule
Figure 7: Simplified fragment of TAMBIS ontology
domains: First, the domain of molecular biology – the chemicals and higher-order chemical structures within a cell and
second, to reflect the nature and content of bioinformatics resources.
The TaO was built with both a top-down and bottom-up
strategy. A general domain framework was provided into
which more detailed molecular biological and bioinformatics
concepts could be fitted. As well as this approach, a solid conceptual foundation about chemicals and their structure and
behaviour was built. Basic chemicals and their properties are
used to describe the more complex biological molecules of
interest to bioinformatics, so this is an appropriate approach
both from a straightforward content, as well as a modelling,
point of view. This involved the description of the different kinds of chemicals (ions, atoms and molecules etc.); their
structure, reactions, function and processes in which they act.
This general foundation was then used to give the subsequent
detailed description of the salient molecular biological concepts that form the bottom-up placement of defined concepts.
The core of the TaO is a description of basic chemical concepts. The various kinds of chemical are defined as children
of the concept chemical. These include:
atom The building block of all chemicals. A chemical’s behaviour is defined by the number of protons it contains,
i.e., its atomic number. Therefore, atom is defined as:
class-def defined atom
subclass-of chemical
slot-constraint atomic-number
cardinality 1
value-type integer
has-value (min 1)
So, atoms may only have one atomic number,
which must be an integer greater than or equal to
1. The concepts metal-atom, nonmetal-atom and
metalloid-atom are defined to be atoms with the
physicochemicalproperty of either metal nonmetal
or metalloid respectively.
The concept of carbon has been defined as a kind of
atom with atomic number six and the physicochemical
property of non-metal. This description of the concept
carbon enables it to be automatically placed as a kind
of nonmetal-atom. Several other, biologically relevant,
atom types have been included in the TaO.
ion An ion is simply a chemical with an electrical charge. It
is defined as:
class-def defined ion
subclass-of chemical
slot-constraint has-charge
has-value (not 0)
The slot constraint describes that an ion must have an
electrical charge and it can only be an electrical charge.
It also describes that the value for this charge can be a
positive or negative number, but not zero. it would be
possible to capture chemical reality further by specifying
a minimum cardinality of one – that is, a chemical must
have at least one charge to be an ion, but may have more
than one charge (a molecule could, for instance, contain
both a positive and negative charge).
the chemical ion has two asserted children: cation and
anion. Defining cation as a chemical with charge
greater-than 0 enables the classifier to place it correctly
as a kind of ion. An equivalence axiom can be used to
state that cation is a synonym of positive-ion.
Now,divalent-cation (a chemical with two positive
charges) can be defined by adding further properties to
this slot constraint: That
the filler for has-charge is equal 2, that is, has positive
two charges on the chemical.
element An element is a kind of chemical containing only
one kind of atom. OIL has the expressive power to
constrain the slot atom-type to be equal to only one.
Adding the slot constraint atom-type with the value one
to atom would also classify atom as an element.
compound A compound is a chemical containing more
than one kind of atom. The slot constraint used for
element (above) is altered so that the constraint indicates that at least two kinds of atom must be present in
this kind of chemical.
molecule A molecule is a kind of chemical containing atoms linked by covalent bonds.
The
concept covalent-bond was described as a kind
of chemical-structure and used to fill the slot
contains-bond, with the has-value restriction. So, there
must be a covalent bond present for it to be classed as a
molecule, but other kinds of bond may be present – exactly capturing what we understand of basic chemicals.
Two principal features of the ontology development arise
from this chemical core:
1. The need for a framework of primitive concepts, such
as metal and properties such as has-charge. These
can be used to develop the core of defined concepts at
the centre of the TaO. Primitive concepts, as well as
those such as chemical itself, are placed within a simple upper level ontology containing physical, mental,
substance, structure, function and process. These
are extended by their obvious conjunctive forms, e.g.,
physical-structure.
2. The ability to rapidly extend this chemicals core to another layer of defined chemical concepts, all of which
used the previously defined concepts.
The next “layer” of chemical descriptions included:
molecular-compound A chemical containing covalent
bonds and more than one type of atom.
elemental-molecule A chemical, such as oxygen (O2 ),
that contains covalent bonds and only one kind of atom.
metal-ion A kind of atom with an electrical charge.
ionic-compound A kind of chemical containing more than
one kind of atom and has an electrical charge.
All these and more were simply defined to be the conjunction of two concepts. For example:
class-def defined metal-ion
subclass-of metal, ion
class-def defined divalent-cation
subclass-of chemical
slot-constraint has-charge
has-value (equal 2)
A concept divalent-zinc-cation can then simply be defined as:
class-def defined divalent-zinc-cation
subclass-of zinc
slot-constraint has-charge
has-value (equal 2)
These descriptions of chemicals can be reinforced with the
use of axioms. It is not possible to be both an element and
a compound, so these two concepts are described as disjoint.
This means that if a concept were to be defined with properties of both an element and a compound, it would be found
to be inconsistent by the reasoner. Such strict definitions help
maintain the consistency and biological thoroughness of the
ontology.
An organic-molecular-compound is a molecular compound that contains at least one carbon atom. This, however,
is not sufficient to define an organic molecular compound.
Carbon dioxide (CO2 ) is a molecular compound containing
carbon, but is not organic. Thus the property of containing
carbon is only a necessary condition for being an organic
molecular compound. Again, the ability to be exact with concept descriptions allows the ontology to match chemical and
biological knowledge closely and prevent conceptualisations
being made that contradict domain knowledge.
Bioinformatics is mainly concerned with organic
macromolecular-compounds.
Thus, organic molecular compound was split into the biologically useful distinctions of macromolecular-compound and
small-molecular-compound. the distinction is one of
size and a protein, for example, of over 100 Daltons is
usually said to be a macromolecule. Unfortunately the
boundary is more complex, a smaller molecule can still
be “macro”, depending on its context. For this reason,
sufficiency conditions were not used in the definition. Useful
small organic molecules were simply asserted as primitive
concepts underneath small-organic-molecular-compound.
These include nucleotide, amino-acid and others useful in
describing the properties of biological concepts.
For
the
purposes
of
the
TaO,
macromolecular-compounds
are
polymers
of
small-organic-molecular-compounds and are defined as
such. Thus, protein is defined as a polymer of amino-acid;
nucleic-acid as a polymer of nucleotide and polysaccaride
as a polymer of saccaride. A macromolecule can only be
a polymer of one kind of small molecule, so the value-type
restriction is used in the slot constraint. It is only possible to
be one of these molecules, so the disjoint axiom is used on
these macromolecules.
As most of bioinformatics concentrates on the analysis and
description of nucleic acids and proteins, much of the TaO’s
description concentrates in this area. DNA and RNA are both
nucleic acids formed from different kinds of nucleotide.
Describing DNA slot-constraint value-type has-value
deoxy-nucleotide, allows the classifier to correctly place it
as a kind of nucleic-acid and capture that DNA can only be
a polymer of the deoxy- form of a nucleotide and some of
the nucleotide have to be present. The various different kinds
of DNA and RNA are distinguished by their function and/or
cellular location. Again, as before, other parts of the TaO
are used to describe these properties of biological concepts.
For example, genomic-dna is dna that is found on a nuclear
chromosome, chloroplast chromosome, or mitochondrial
chromosome. The slot constraint uses or in the filler class
expression to describe this:
slot-constraint part-of
cardinality 1
value-type
(nuclear-chromosome or
mitochondrial-chromosome or
chloroplast-chromosome)).
The TaO contains a rich partonomy. The cellular structures, in particular, use the part-of slot and its transitive property to build up this partonomy. For instance,
nuclear-chromosome is part-of the nucleus, which itself
is part-of the cell. Thus, a nuclear-chromosome is part-of
the cell.
These biological-structures and associated partonomy are
part of the TaO. Not only are they used in building some of
the descriptions of bio-concepts, but are also part of the description of the content of bioinformatics resources.
In the initial description of kinds of protein, holoprotein,
enzyme and holoenzyme were originally primitive classes,
with no slot constraints, and an explicitly asserted class hierarchy: holoprotein and enzyme were subclasses of protein,
and holoenzyme was a subclass of enzyme.
During the extension and refinement phase, the properties
of the various classes were described in more detail: it was
asserted that a holoprotein binds a prosthetic-group, that
an enzyme catalyses a reaction, and that a holoenzyme
binds a prosthetic-group. Several of the classes were also
upgraded to being defined when their description constituted
both necessary and sufficient conditions for class membership, e.g., a protein is a holoprotein if and only if it binds a
prosthetic-group.
Enzyme was removed from the superclass list and replaced with protein; then holoenzyme’s properties were described in more detail using slot constraints—in particular, it
was asserted that a holoenzyme catalyses a reaction and
binds a prosthetic-group. This allows the reasoner to infer
not only the subclass relationship w.r.t. enzyme, but also additional subclass relationships w.r.t. holoprotein, and in particular that holoenzyme is a subclass of holoprotein. This
latter relationship could have been missed if the ontology had
been hand crafted.
The extension and refinement phase also included the addition of axioms asserting disjointness, equality and covering,
further enhancing the accuracy of the model. Referring again
to Figure 7, our biologist initially asserted that cofactor was a
subclass of both metal-ion and small-molecule (a common
confusion over the semantics of ‘and’ and ‘or’) rather than
being either a metal-ion or a small-molecule. Subsequently,
when it was asserted that metal-ion and small-molecule are
disjoint, the reasoner inferred that cofactor was logically inconsistent, and the mistake was rectified. Modelling mistakes
such as these litter bioontologies crafted by hand.
There are two kinds of cofactor – coenzyme and
prosthetic-group. A coenzyme can be either a small
molecule or metal ion and binds loosely to a protein. A prosthetic group, on the other hand, is a kind of cofactor that binds
tightly to a protein, but can only be a small molecule. Again,
OIL is expressive enough to capture these distinctions accurately.
has-value covalent-bond
value-type covalent-bond
class-def defined hydrolysis
subclass-of reaction
slot-constraint hydrolysis-of
has-value covalent-bond
value-type covalent-bond
class-def defined lyase
subclass-of protein
slot-constraint catalyses
has-value lysis
value-type lysis
class-def defined hydrolase
subclass-of protein
slot-constraint catalyses
has-value hydrolysis
value-type hydrolysis
A lyase is a protein that catalyses lysis. A hydrolase is
a protein that catalyses hydrolysis. As the slot hierarchy describes hydrolysis-of being a subslot of lysis-of, hydrolysis
is a child of lysis and consequently, hydrolase is a child of
lyase.
Other advantages derived from the use of OilEd included:
• The frame-like look and feel of OilEd, and the frame
approach of the OIL language, made ontology development much less daunting to our biologist than writing
SHIQ logic expressions would have been.
• Clipboard facilities provided by OilEd allowed
(parts of) frames to be copied and pasted, making it easy to experiment with new definitions and
to maintain a consistent modelling style.
E.g.,
coenzymeA-requiring-oxidoreductase was built by
copying nad-requiring-oxidoreductase and changing the constraint on the binds slot from nad to
coenzymeA. The reasoner then automatically migrated
the class from being a subclass of holoenzyme to being
a subclass of coenzyme-requiring-enzyme.
• Class definitions can be as simple as possible yet as complex as necessary. Parts of the TaO are simply primitive
frames and slots; other parts are very elaborate and exploit the full expressive power of the OIL language.
• In TAMBIS, the ontology is managed by an ontology
server that makes full use of the class definitions, e.g.,
to classify user generated query classes. However, being
able to deliver a static “snapshot” of the ontology in the
form of an RDFS taxonomy has proved extremely convenient when working with collaborators who are building ontologies that are in fact simple taxonomies, such
as the Gene Ontology [Ashburner et al., 2000].
class-def defined prosthetic-group
subclass-of cofactor and (not metal-ion)
slot-constraint binds-tightly
has-value protein
The slot hierarchy was also used to induce the classification of types of enzyme. For example, reaction (used in the
definition of enzyme) has a child lysis. Lysis is the breaking
of a covalent bond and hydrolysis is breaking of a covalent
bond with water. These two reactions are defined using the
following slot definitions:
slot-def lysis-of
domain reaction
range covalent-bond
slot-def hydrolysis-of
subslot-of lysis-of
class-def defined lysis
subclass-of reaction
slot-constraint lysis-of
5
Conclusion
Ontologies are useful in a range of applications, where they
provide a source of precisely defined terms that can be communicated across people and applications. We have used as
an example, the initial development of a molecular biology
and bioinformatics ontology. Examples from this case study
have been used to demonstrate the utility of OIL’s integration
of features from frame and DL languages. It can be seen from
the case study that OIL can support a cyclical ontology development, where incremental moves are made from a primitive,
asserted taxonomy to one where concepts are rich with properties. These properties can be used to add richness to the
ontology (from inferred knowledge), as well as ensuring the
logical consistency and satisfiability of the ontology. Thus,
the use of reasoning can be seen to be important for the design
and management of ontologies during their development.
OilEd is a prototype development environment for OIL, designed to test and demonstrate novel ideas, and it still lacks
many features that would be required of a fully-fledged ontology development environment, e.g., it provides no support
for versioning, or for working with multiple ontologies. It is
likely that during the development of the TaO that other, or
fragments of other, ontologies will be imported into the TaO.
Moreover, the reasoning support provided by the FaCT system is incomplete for OIL extended with concrete datatypes
and individuals, and does not include additional services such
as explanation. Thus, the definitions used for atom and the
charge on ions is not used in constructing the classification.
Explanation has potential use in both the development and
use of a bio-ontology. During development, it will obviously
be useful to have explanations of why a concept was unsatisfiable according to the current model. It is a goal for a bioontology, such as TaO, to be used in the analysis of novel
biological macromolecules. Certain bioinformatics analyses
can describe the properties of such molecules. If these could
be cast in terms of the TaO, novel concepts generated by such
analyses could be classified in the TaO. the use of explanation
could significantly guide the use of such analyses.
During this case study, we have presented OIL and OilEd,
an ontology editor that has an easy to use frame interface, yet
at the same time allows users to exploit the full power of an
expressive ontology language (OIL/DAML+OIL). We have
also shown how OilEd uses reasoning to support ontology design and maintenance, and presented a case study illustrating
how this facility can be used to develop ontologies that describe their domains in more detail and with greater fidelity.
Acknowledgements: Robert Stevens is supported by BBSRC/EPSRC grant 4/B1012090 and Sean Bechhofer is supported by EPSRC grant GR/M75426.
References
[Baker et al., 1998] P.G. Baker, A. Brass, S. Bechhofer,
C. Goble, N. Paton, and R. Stevens. TAMBIS: Transparent
Access to Multiple Bioinformatics Information Sources.
An Overview. In Proceedings of the Sixth International
Conference on Intelligent Systems for Molecular Biology,
pages 25–34. AAAI Press, June 28-July 1, 1998 1998.
[Baker et al., 1999] P. Baker et al. An ontology for bioinformatics applications. Bioinformatics, 15(6):510–520, 1999.
[Brickley and Guha, 2000] D. Brickley and V.R. Guha. Resource description framework schema specification 1.0.
W3C Candidate Recommendation, 2000. http://www.
w3.org/TR/rdf-schema.
[Chaudhri et al., 1998] V. K. Chaudhri et al. OKBC: A programmatic foundation for knowledge base interoperability.
In Proc. of AAAI-98, 1998.
[Decker et al., 2000] S. Decker et al. Knowledge representation on the web. In Proc. of DL 2000, pages 89–98, 2000.
[Doyle and Patil, 1991] J. Doyle and R. Patil. Two theses of
knowledge representation. Artificial Intelligence, 48:261–
297, 1991.
[Fensel et al., 2000] D. Fensel et al. OIL in a nutshell. In
Proc. of EKAW-2000, LNAI, 2000.
[Goble et al., 2001] C. Goble et al. Transparent access to
multiple bioinformatics information sources. IBM Systems
Journal, 40(2), 2001.
[Grosso et al., 1999] W. E. Grosso et al. Knowledge modeling at the millennium (the design and evolution of protégé2000). In Proc. of KAW99, 1999.
[Gruber, 1993] T. R. Gruber. Towards principles for the design of ontologies used for knowledge sharing. In Proc. of
Int. Workshop on Formal Ontology, 1993.
[Hendler and McGuinness, 2001] J. Hendler and D. L.
McGuinness. The DARPA agent markup language. IEEE
Intelligent Systems, jan 2001.
[Horrocks et al., 1999] I. Horrocks, U. Sattler, and S. Tobies.
Practical reasoning for expressive description logics. In
Proc. of LPAR’99, pages 161–180, 1999.
[Horrocks, 2000] I. Horrocks. Benchmark analysis with fact.
In Proc. TABLEAUX 2000, pages 62–66, 2000.
[Karp et al., 2000] P.D. Karp, M. Riley, M. Saier, I.T.
Paulsen, S.M. Paley, and A. Pellegrini-Toole. The EcoCyc
and MetaCyc Databases. Nucleic Acids Research, 28:56–
59, 2000.
[Altman et al., 1999] R. Altman, M. Bada, X.J. Chai,
M. Whirl Carillo, R.O. Chen, and N.F. Abernethy. RiboWeb: An Ontology-Based System for Collaborative
Molecular Biology. IEEE Intelligent Systems, 14(5):68–
76, 1999.
[McGuinness and Borgida, 1995] D.
McGuinness
and
A. Borgida. Explaining subsumption in description logics.
In Proc. of IJCAI-95, pages 816–821, 1995.
[Ashburner et al., 2000] M. Ashburner et al. Gene ontology: Tool for the unification of biology. Nature Genetics,
25:25–29, 2000.
[McGuinness et al., 2000] D. L. McGuinness, R. Fikes,
J. Rice, and S. Wilder. An environment for merging and
testing large ontologies. In Proc. of KR-00, 2000.
[Baader and Hanschke, 1991] F. Baader and P. Hanschke. A
scheme for integrating concrete domains into concept languages. In Proc. of IJCAI-91, pages 452–457, 1991.
[McGuinness, 1998] D. L. McGuinness. Ontological issues
for knowledge-enhanced search. In Proc. of FOIS-98,
1998.
[Rector et al., 1997] A. Rector et al. The G RAIL concept
modelling language for medical terminology. Artificial Intelligence in Medicine, 9:139–171, 1997.
[Staab and Maedche, 2000] S. Staab and A. Maedche. Ontology engineering beyond the modeling of concepts and
relations. In Proc. of the ECAI’2000 Workshop on Application of Ontologies and Problem-Solving Methods, 2000.
[Stevens et al., 2001] R. Stevens, C. A. Goble, and S. Bechhofer.
Ontology-based knowledge representation for
bioinformatics. Briefings in Bioinformatics, 2001.
[The Gene Ontology Consortium, 2000] The Gene Ontology Consortium. Gene Ontology: Tool for the Unification
of Biology. Nature Genetics, 25:25–29, 2000.
[Uschold and Grüninger, 1996] M.
Uschold
and
M. Grüninger. Ontologies: Principles, methods and
applications. K. Eng. Review, 11(2):93–136, 1996.
[van Heijst et al., 1997] G. van Heijst, A. Schreiber, and
B. Wielinga. Using explicit ontologies in KBS development. Int. J. of Human-Computer Studies, 46(2/3):183–
292, 1997.