Stealthy annotation of experimental biology by spreadsheets

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
Published online 10 October 2012 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.2941
SPECIAL ISSUE PAPER
Stealthy annotation of experimental biology by spreadsheets
Katherine Wolstencroft1,*,†, Stuart Owen1, Matthew Horridge1, Simon Jupp1, Olga Krebs2,
Jacky Snoep3, Franco du Preez3, Wolfgang Mueller2, Robert Stevens1 and Carole Goble1
1
School of Computer Science, University of Manchester, Oxford Road, Manchester, UK
2
HITS gGmbH, Heidelberg, Germany
3
Manchester Interdisciplinary Biocentre, University of Manchester, UK
SUMMARY
The increase in volume and complexity of biological data has led to increased requirements to reuse that data.
Consistent and accurate metadata is essential for this task, creating new challenges in semantic data annotation
and in the constriction of terminologies and ontologies used for annotation. The BioSharing community are
developing standards and terminologies for annotation, which have been adopted across bioinformatics, but the real
challenge is to make these standards accessible to laboratory scientists. Widespread adoption requires the provision
of tools to assist scientists whilst reducing the complexities of working with semantics. This paper describes
unobtrusive ‘stealthy’ methods for collecting standards compliant, semantically annotated data and for contributing
to ontologies used for those annotations. Spreadsheets are ubiquitous in laboratory data management. Our spreadsheet-based RightField tool enables scientists to structure information and select ontology terms for annotation within
spreadsheets, producing high quality, consistent data without changing common working practices. Furthermore, our
Populous spreadsheet tool proves effective for gathering domain knowledge in the form of Web Ontology Language
(OWL) ontologies. Such a corpus of structured and semantically enriched knowledge can be extracted in Resource
Description Framework (RDF), providing further means for searching across the content and contributing to Open
Linked Data (http://linkeddata.org/). Copyright © 2012 John Wiley & Sons, Ltd.
Received 4 April 2011; Revised 1 September 2012; Accepted 10 September 2012
KEY WORDS:
semantic annotation; data curation; ontology
1. INTRODUCTION
Howe et al. in [1] make the compelling case that although databases have become important avenues for
publishing biological data, the curation increasingly lags behind data generation in funding, development
and recognition. The interpretation and reuse of data rely on it being annotated with rich, standardised
metadata. However, the labour cost of metadata construction, data curation and annotation is high: it is
a time-consuming and awkward process, which is undervalued and provides little reward for the skilled
data curators that undertake it [2, 3].
In an effort to drive-up the quality of curation at the point of data capture, various biological science
communities have proposed minimum information models and community ontologies that should be
used for structuring and describing these metadata. The models set community-wide expectations
for the least information required to understand and interpret experimentally gathered data.
Recommendations for the use of specific terminologies and ontologies clarify the information required
and the choice of terminologies for annotation, and encourage widespread compliance [4].
*Correspondence to: Katherine Wolstencroft, School of Computer Science, University of Manchester, Oxford Road,
Manchester, UK.
†
E-mail: [email protected]
Copyright © 2012 John Wiley & Sons, Ltd.
468
K. WOLSTENCROFT ET AL.
Minimum Information about a Biological or Biomedical Investigation (MIBBI) [5] is an umbrella
organisation describing minimum information models for over 30 different biological fields or
experimental technologies. The approach is a pragmatic one, with many of the models little more than
checklists. Many MIBBI checklists recommend the use of specific terminologies and ontologies for
annotation. These ontologies are published in the BioPortal [6], a community repository for biological
ontologies. Table I shows some minimum information models and corresponding ontologies commonly
used in the life sciences.
Data management specialists routinely use these MIBBI checklists and related terminologies to capture
and standardise the structure of data for sharing or for deposition into public repositories. However, the
uptake of such standards by experimental biologists is much less common. One significant reason for
this difference is the ease with which the metadata standards can be applied to data. The existence of
tools that assist with the process can increase compliance significantly. For example, Minimum
Information about a Microarray Experiment (MIAME) was initially implemented as The MicroArray
Gene Expression Markup Language (MAGE-ML), an (Extensible Markup Language) XML-based
schema. Users were required to submit data in this format to ArrayExpress (http://www.ebi.ac.uk/
arrayexpress/) or GEO (Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/) in order to
publish work relating to it. In practice, this was only feasible for laboratories with dedicated
bioinformatics support. The community developed MicroArray Gene Expression Tab (MAGE-TAB)
[9], a tabular alternative that allows scientists to prepare MIAME compliant data from within Microsoft
Excel spreadsheets to lower the barrier of adoption of the standard. This more straightforward process
requires no prior knowledge of MAGE or XML and enables experimentalists to use the tools that they
already use to record their data and are highly proficient experts in. MAGE-TAB demonstrates the
importance of providing tools and interfaces as well as metadata standards and schemas [10].
There is a similar situation in the development of the terminologies or ontologies that supply the
metadata for these annotation tasks. For a good introduction to ontologies in biology, see [11]; here, we
briefly set the context for the paper. Ontologies are used to define the entities and the relationships
between those entities [12]. The labels used for the concepts describing the domain form a controlled
vocabulary for that domain. Ontologies are now widely used in this way within biology; they provide
metadata for describing the data. Ontologies come in a variety of forms, from simple trees or
polyhierarchies, through to highly axiomatised ontologies amenable to automated reasoning.
In biology, two languages for authoring ontologies are widely used: Web Ontology Language (OWL)
and Open Biomedical Ontology (OBO). The OWL (http://www.w3.org/2007/OWL) is widely used for
authoring ontologies as a collection of axioms, each of which has a precise semantics that make such
ontologies amenable to automated reasoners. OWL and its ontologies are hard to understand and use; the
tools for developing OWL ontologies, such as Protégé (http://protege.stanford.edu) are best used in the
hands of specialists. The OBO (http://www.geneontology.org/GO.format.obo-1_2.shtml) format is a
common format for biological ontologies. OBO is less complex than OWL and may therefore be easier
to understand, but the large number of OBO ontologies available and the large numbers of terms within
them, make the task of exploration difficult. Although ontologies play an important role in many fields,
including biology, they are best used through tools that present them in a form compatible with everyday
working practices of users. For example, ontologies, irrespective of their level of axiomatisation, are
usually used by biologists as taxonomies of vocabulary terms and they are best presented as such.
Our aim is to make the collection, validation and construction of accurate and precise metadata a
simple task: without using special tools, without the need for our laboratory experimentalist to be
exposed to rich and complex ontologies and with the tools and techniques they are already familiar
with. As the most common data collection and exchange file is the Microsoft Excel spreadsheet, we
use standard Excel spreadsheets.
• To enable ontology-enriched data collection. We have developed the RightField ontology
annotation and information management desktop application (http://www.rightfield.org.uk) [13].
RightField adds constrained ontology term selection to Excel spreadsheets, addressing issues
such as multiple ontology loading, display and browsing, multiple ontology formalisms, ontology
evolution and ontology provenance, off-line working, cross-platform support, Excel application
legacy, ontology selection depths and nesting, and format import–export.
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
Copyright © 2012 John Wiley & Sons, Ltd.
Systems Biology Model Simulation
Systems Biology Models
Interaction experiments
Microarray
Proteomics
Data
MIMIX: Minimum Information about a Molecular
Interaction Experiment
MIRIAM: Minimum Information Required In the Annotation of
biochemical Models
MIASE: Minimum Information About a Simulation Experiment
MIAME: Minimum Information about a Microarray Experiment
MIAPE: Minimum Information about a Proteomics Experiment
MIBBI Model
KISAO: Kinetic Simulation Algorithm Ontology
SBO: Systems Biology Ontology
MO: MGED Ontology
MS: mass spectrometry; MOD: Protein Modification;
SEP: sample processing and separation
MI: protein–protein interaction
Ontologies
Table I. Examples of standards and ontologies for data annotation. Other ontologies are more generally applicable and are used across the domain. For example, Ontology
for Biomedical Investigations (OBI) [7] can be used to describe experimental properties, and the Gene Ontology [8] can be used to describe functional properties of genes
and gene products.
STEALTHY ANNOTATION OF EXPERIMENTAL BIOLOGY BY SPREADSHEETS
469
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
470
K. WOLSTENCROFT ET AL.
RightField and the spreadsheets it produces are already in use in a large Systems Biology programme
incorporating over 300 scientists from 120 institutes. It has significantly improved the ability for
laboratory experimentalists to self-curate their data from a familiar environment and a lower barrier
of entry to the process of semantic annotation, helping to ensure the quality of data annotation and
supporting standards compliance. When data is required to comply with particular standards in
order to be published, scientists already have it preformatted. Although developed for Systems
Biology, the software is generic. For example, the Gas and Oil industry’s instrument data sheets
are spreadsheet-based and are in need of a tool such as Rightfield with the added capability of a
local dialect to core term mapping functions.
• To enable ontology content collection. We have developed the Populous ontology population
desktop application (http://www.populous.org.uk) [14]. Populous extends RightField by adding
support for the creation and population of ontology design patterns. Specifying good ontology
design patterns in a language such as OWL can be difficult and requires knowledge of OWL
syntax, OWL semantics, knowledge representation and the ontology authoring tools. The task
of populating the design pattern however, requires only an understanding of the pattern and
sufficient domain knowledge in order to populate it. The Populous extension to RightField
provides support for domain experts in populating design pattern templates using a familiar
spreadsheet style interface. Once the template is populated, Populous uses the Ontology PreProcessing Language (OPPL) [15, 16] to map the data in the columns to variables within the
design pattern to generate axioms for the growing ontology. Populous has been successfully used
in the e-LICO (http://www.e-lico.eu) project for the construction of an ontology describing the
Kidney and Urinary Pathway (KUP) [17]. The KUP ontology (KUPO) is used to annotate
experimental data held in the KUP Knowledge-base (KUPKB). The challenge was to engage
the biologists in the ontology’s development, whilst shielding them from details of the ontology
itself. Experienced ontology developers defined the set of design patterns for KUPO and
generated a set of templates for the biologists using RightField. The content for KUPO was
populated entirely by biologists with little or no previous knowledge in ontology development.
The community can now submit new experiments to the KUPKB using RightField-generated
spreadsheets that contain metadata coming from KUPO.
To enable information model design. We combine RightField and Populous. The overall structure
for the KUPO was developed by bio-ontologists, and Populous was used to populate the patterns
implied by the KUPO. Portions of the KUPO can then be used within a RightField spreadsheet to
help biologists provide content to the KUPKB. As the biologists used the KUP-orientated
RightField data submission form, gaps in the KUPO will be spotted; Populous can be returned
to update the KUPO. The KUPO provides an information model for placing into RightField; as
the RightField templates are used, the information model can change. The Populous extension
to RighField then allows the same biologists to expand the ontology that forms the information
model for the RightField generated spreadsheets.
• RightField and Populous are so-called Data Ramps [18], tools that lower the barriers to using and
manipulating data. These tools are not Excel plugins; instead, they are Excel spreadsheet
generators. The spreadsheets are plain Excel spreadsheets that use only the standard features of
Excel to produce drop-down lists of allowed cell values. In this way, the features previously outlined are ensured. The tabulation and constraints within spreadsheets generated by RightField
means that data collected using RightField templates is highly structured and can be exported
and stored in relational databases, XML or RDF. As all, this is achieved using only Excel features
in an environment familiar to target users, these highly structured data are produced by stealth,
avoiding the need to develop, install and train users with a bespoke tool. RightField and Populous
are important components towards quality-assured Linked Data. Linked data is a method for
exposing, sharing and connecting pieces of data, information and knowledge on the Semantic
Web using unique identifiers and common formats. These tools automatically produce semantically annotated life science data that could be exported to the Linked Data cloud in a structured
format such as RDF.
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
STEALTHY ANNOTATION OF EXPERIMENTAL BIOLOGY BY SPREADSHEETS
471
RightField and Populous are open source Java applications distributed under the Berkeley Software
Distribution (BSD) licence and available from http://www.rightfield.org.uk and http://www.populous.
org.uk, respectively.
The paper is organised as follows. In Section 2, we present the motivation for RightField and
demonstrate how it is used for semantically enriched and validated data annotation against projectdefined information model templates by the pan-European SysMO Systems Biology Consortium. In
Section 3, we present the motivation for Populous and demonstrate how it is used to gather content and
populate ontologies against pre-defined patterns by the European Union e-LICO Project. We also
propose how RightField and Populous can be used to define and validate new minimum information
models. In Section 4, we reflect on our experiences of the tools in the field, including shortcomings, and
set out our future plans with respect to Linked Data. Related work is referred to throughout the sections.
2. RIGHTFIELD: ONTOLOGY-ENRICHED DATA COLLECTION BY STEALTH
A classic example of the challenges of accurate yet onerous scientific data collection is demonstrated by the
SysMO consortium (Systems Biology of Microorganisms http://www.sysmo.net). This consortium is a panEuropean research initiative to record and describe the dynamic molecular processes occurring in
microorganisms and to present these processes in the form of computerised mathematical models. The
diversity of experiments performed across the 13 multi-partner projects varies widely. Its scientists typically
perform a mixture of high-throughput ’omics experiments, such as microarray analysis or proteomics, as
well as traditional molecular biology and enzyme reaction kinetics. Despite this diversity, the consortium
aims to pool its research capacities and know-how, sharing data, methods, models and results within the
consortium and with the rest of the Systems Biology community. It has developed, through a specially
funded SysMO-DB project [19], a suite of resources and a web-based platform, the SEEK [20], to provide
a data sharing environment for scientists within the consortium, a dissemination mechanism to share
research with the wider community and a gateway to other commonly used community resources. The
most common data collection and exchange format is the Microsoft Excel spreadsheet.
There is no additional curation resource outside the projects so data curation must be handled at source by
the scientists. The laboratory scientists in SysMO (some 300 drawn from over 120 institutes across Europe)
have little experience in metadata management and semantics but understand the value of reusing data. The
time they have to spend on data curation is limited, so they will only adopt community standards if there are
clear advantages in doing so. The range of MIBBI models and accompanying ontologies they are expected to
comply with is wide, confusing and daunting; the metadata models are complicated and impractical,
requiring familiarity with XML formats, ontology languages and formats, and the annotation of typically
over 100 metadata elements. Often, multiple ontologies are needed for the annotation of a single file.
The SysMO-DB project has improved ‘curation at source’ by leveraging the widespread use of, and
familiarity with, spreadsheets. Experimentalists collect data against standard Excel spreadsheet
templates based on variations of the MIBBI community models. Terms from ontologies are exposed
as simple dropdown cell selection lists, shielding the scientist from understanding and browsing any
ontologies and guiding them into only choosing valid terms.
To enable expert bioinformaticians to make these templates, we have developed the RightField
ontology annotation and information management desktop application for adding constrained ontology
term selection to Excel spreadsheets. Thus, SEEK users are encouraged to upload spreadsheets that are
RightField enabled.
RightField is an administrator’s tool. The workflow is a three-stage process.
1. An expert bioinformatician or data manager prepares data collection templates using standard Excel.
These are imported into RightField. Ontologies are accessed and browsed through the BioPortal, a
repository of biological ontologies (http://bioportal.bioontology.org/) or from their local file system
(Figure 1(a)). The tool supports ontologies in OBO (http://www.geneontology.org/GO.format.
obo-1_2.shtml), OWL and RDFS formats and simple RDF vocabularies. Ontology terms are
selected and bound to the cells. Terms can be all the subclasses from a chosen class, all the individuals
(instances), or a combination of both (Figure 1(b)). Individual cells or whole columns or rows can be
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
472
K. WOLSTENCROFT ET AL.
Figure 1. A screenshot of RightField showing a loaded spreadsheet template and the process of marking
cells with ranges of ontology terms from the Microarray Gene Expression Data (MGED) ontology [21].
marked with the required ranges of ontology terms (Figure 1(c)). A spreadsheet can be annotated with
terms from multiple ontologies. The full Internationalised Resource Identifiers (IRIs) and associated
labels for these ontology elements are retained and discreetly stored within hidden sheets in the spreadsheet file. This affords a full provenance track concerning the origins and versions of the ontology
terms used in the annotation. The spreadsheets are saved in native Excel format.
2. A laboratory experimentalist opens the RightField enabled spreadsheet in standard Excel or
OpenOffice (Figure 2). No plugins are required, and there are no macros, visual basic code or
platform-specific libraries needed for its use. No connection to any other service is required: the
spreadsheet is self-contained as terms are embedded within its file. The spreadsheet behaves as
usual; however, marked-up cells (Figure 2(a)) are restricted to selections from a simple drop-down
list of terms using the term labels (Figure 2(b)). The ontologies have been bridged and flattened into
validated vocabulary pick-lists, so no in-depth knowledge of the ontologies or the reporting
standards are required. By selecting a term, the IRI of the ontology term is attached to the cell for
that data sheet. Once marked-up and saved, the spreadsheet data can link back to the original
ontology. The default behaviour is to retain this information in a hidden state to shield the user from
complexities of the underlying semantics. However, a feature to toggle between term labels and
identifiers has subsequently been included at the request of the expert bioinformatician and the
curious experimentalist.
3. The marked up datasheets now have embedded annotation bound to the IRIs of ontology terms.
When viewed using Excel or OpenOffice, these appear as labels (Figure 2(b)). However, for
semantically capable applications, these encapsulated ontology IRIs are available for extraction
and processing alongside the data they annotate. This has the added benefit that the data is ready
for publishing into the Linked Data cloud.
Figure 3 describes the RightField architecture. RightField combines the use of the Apache POI
library to read and manipulate spreadsheets and the [22] OWL Application Programming Interfac
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
STEALTHY ANNOTATION OF EXPERIMENTAL BIOLOGY BY SPREADSHEETS
473
Figure 2. Excel spreadsheet using a Rightfield template.
Figure 3. The RightField implementation architecture.
(API) to read and process ontology files. In order to achieve the seamless integration with Excel,
RightField augments an existing Microsoft Excel Workbook with the data as follows. For each
distinct set of ontology terms selected by the informatician, RightField adds a ‘very hidden’
worksheet to the workbook. Each worksheet records each term in the set as a tuple containing the
human readable term name, its unique identifier (IRI) and the most specific ontology that ‘defines’
the term. If the associated defining ontology was obtained from the internet or from the BioPortal
ontology repository, then RightField records the ontology document IRI, both for versioning
purposes, and so that the original ontology can automatically be reloaded if the workbook is reopened in RightField at some point in the future. Having embedded each constraining term set as a
very hidden worksheet, RightField then adds Excel Data Validation constraint information to each
of the marked up cells, or ranges of cells, in the user worksheets. The Data Validation functionality
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
474
K. WOLSTENCROFT ET AL.
used is taken directly from Excel. Using out-of-the-box functionality such as this is one of the reasons
why RightField-enabled spreadsheets can be opened in Excel without the need for any special Excel
Add-Ins or platform specific native code — both of which would decrease the robustness of the
solution. In essence, the Excel Data Validation functionality links the ranges of cells in the very
hidden worksheets to the human readable names for the ontology terms in the user worksheet.
Given this markup ‘schema’, it is easy to map back from the values of cells that correspond to
human readable term names to the IDs of the corresponding terms and the ontologies that define them.
During the design of the markup ‘schema’, the possibility of embedding complete ontologies into
Excel Workbooks was considered. The motivation for this was that the exact copy of the ontology
that defines the constraining ontology terms would then be preserved for future use and extraction,
thus providing a robust provenance solution. With regard to providing such a solution, it is not
difficult to imagine how an ontology could be encoded into a grid that could then be written into an
Excel worksheet. However, some bio-ontologies contain hundreds of thousands of axioms that
define hundreds of thousands of terms. Since older versions Excel, but still widely used, impose
rather strict size limitations on the amount of data that can be stored in a worksheet, this solution
was discounted at an early stage.
2.1. Virtue in simplicity
The simplicity of this approach—fixing and embedding terms within hidden sheets—is a significant
feature to match the operational requirements of our target users (experimentalists and informaticians).
• Ontology stability: The choices of terms are stable once the spreadsheet is established. This ensures
that a series of experiments can be annotated with the same versions of the same ontologies. If an
ontology needs to be updated, the spreadsheet must be actively uploaded and modified within
RightField.
• Offline working: Linkage to the ontology server is only required at the time the template is
created. This enables experimentalists to add or view data offline and applications to view or process
(with limitations) the data without a connection to an ontology server.
• Cross platform support: Our experimentalists use Apple OS and Windows, often old versions of
Excel or OpenOffice and do not want plug-ins. Thus macros, visual basic code or platform-specific
libraries are out of bounds.
• Standard Tools: Several metadata collection efforts have sought to take advantage of the spreadsheet
metaphor by developing bespoke ‘spreadsheet like’ ‘data disposition tools’. However, our experimentalists wanted to use the mainstream, off the shelf tools they were already familiar with.
• Distribution: The self-contained and plain nature of the resulting datasheets means that they can
be stored, distributed and shared amongst scientists as normal, through email, shared file stores,
USB sticks and so on.
• Ontology flattening: Life science ontologies come in several formats (OBO, OWL, RDFS, SKOS)
and are often broad and deep trees. Some, such as the Gene Ontology, have tens of thousands of
terms. It is easy to become overwhelmed, especially as one data sheet requires multiple ontologies.
By picking out the relevant terms for a particular context, we have merged and flattened these
complex structures. A side-effect is the exposure of shallow ontologies with many subclasses and
instances. This can result in long drop-down boxes, making the annotation process more difficult
and prone to inaccuracies. Auto-complete functions for marked up cells can help relieve this problem
but it actually highlights a wider issue with the structure of some of the ontologies. A community
standard that specifies that users should select from over 50 terms, for example, may indicate that
the ontology requires further development in that area. Defining more specific subclasses and
splitting any instances between them would reduce this problem much more effectively and would
benefit other users of the ontology.
2.2. RightField in practice
In the SysMO consortium, spreadsheets are prepared to conform to the ‘Just Enough Results Model’
(JERM). The JERM is the SysMO-DB minimum information model. It describes what type
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
STEALTHY ANNOTATION OF EXPERIMENTAL BIOLOGY BY SPREADSHEETS
475
of experiment was performed, who performed it and what was measured. It defines and describes
the relationships between SysMO assets (i.e. different types of data, mathematical models and
experimental methods) and allows JERM representations to be mapped to MIBBI models
where available.
Experiments in SysMO are related to one another using the ISA-TAB format [23]. Investigations,
Studies, Assays Tab (ISA-TAB) is a tabular representation for describing how experiments (Assays)
are grouped into wider Studies and Investigations. It allows multiple ’omics investigations to be
linked together and described and provides a format to submit data into existing databases, such as
ArrayExpress [24] and PRIDE [25]. This is a typical use for RightField. Research groups performing
transcriptomics, metabolomics and proteomics experiments can use RightField-enabled templates to
capture and therefore integrate the results from these multiple studies. The templates can ensure that
biological entities in each data set (genes, metabolites and proteins etc.) are named consistently and
that each data set is compliant to the appropriate standard (e.g. MIAME, Minimum Information
About a Cellular Assay and Minimum Information about a Proteomics Experiment, respectively).
The combination of embedding ontology term selection and standardising the content and format of
spreadsheets provides a simple mechanism to ensure data is consistently annotated and compliant with
community standards, ‘by stealth’. The aim is to make it easier to use appropriate ontology terms than
not to do so. The result for the SysMO consortium is a growing corpus of experimental data that is
easier to compare and explore. RightField-enabled templates for transcriptomics data are currently
the most widely used due to the more stringent reporting standards for transcriptomics publications
compared with other areas. A collection of RightField-enabled templates developed for SysMO can
be found here: http://www.sysmo-db.org/rightfield/templates.
The next step for SysMO-DB is the exploitation of the emerging corpus of RightField annotated
data. By extracting and storing the data in RDF, based on specifications and mappings to the
JERM ontology (http://bioportal.bioontology.org/ontologies/45133), we are able to provide further
mechanisms for searching across the content of spreadsheets and publishing SysMO data as Open
Linked Data [26]. Compliance to community ontologies means that the system can already use Web
Services from the BioPortal for term lookup and visualisation.
2.3. Related work
The idea of using spreadsheets as a data collection mechanism is not novel. The developers of ISAInfrastructure tools have produced a suite of tools to design and validate ISA compliant data [27].
Ontology terms can also be added via an Ontology Lookup Service [28] and a BioPortal Plugin.
The ISA Creator GUI has the look and feel of a spreadsheet but can only perform some of the same
functions and is not a standard tool. Scientists must download and install the software and learn a
new environment, and more of the complexities of the ontologies are exposed making it unsuitable
for experimentalists. Other ontology annotation tools in the life sciences include Phenote, which
assists with the annotation of biological phenotypes (http://phenote.org), and the PRIDE Proteome
Harvest Spreadsheet submission tool (http://www.ebi.ac.uk/pride/proteomeharvest/), which assists
with the annotation and submission of proteomics data to the PRIDE public repository. These are
powerful annotation tools for specific biological disciplines, and are not generically applicable.
3. POPULOUS: ONTOLOGY DEVELOPMENT BY STEALTH
The Populous tool [29] is an extension to RightField that changes the focus from data annotation to
ontology development.
Developing logic-based ontologies in a language such as OWL or OBO is a specialised task, but
gathering the domain knowledge that forms the content of the ontology requires input from the domain
specialist [30]. Current ontology authoring tools, such as Protégé, require an understanding of OWL
semantics and ontology construction, as well as the need for a consistent application of axiom patterns
for related types of entities. These requirements suggest a separation of domain knowledge, and
axiomatization in constructing OWL ontologies is needed. The tabulation of entities and the attributes
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
476
K. WOLSTENCROFT ET AL.
that describe them is a common practice in developing OWL ontologies [31, 32]. Much of ontology’s
content can be authored as a set of axiom patterns or templates. Spreadsheets are commonly used
to tabulate the information that is used to populate these patterns. For example, the Ontology for
Biomedical Investigation [33] describes many forms of assay used in an investigation; each assay is
described in terms of its analyte, evaluent and unit. Quick Term Templates [32] provide a form style
table into which contributors to the ontology can add values that will form the ontology’s content when
combined with the relationships in the pattern associated with the template. As the details for each
assay can be easily tabulated by a domain expert and the axioms for the ontology can be readily
generated from this tabulated knowledge, a barrier to ontology development is overcome. This
separation allows the domain experts that know about the assays to describe them in terms of other
parts of OBI without writing complex OWL axioms and doing so in a consistent style across a large
number of assay types. Using a spreadsheet-based approach enables domain experts to use familiar
tools to achieve the complex task of ontology building.
Populous is a tool for gathering content and populating ontologies. It builds on the RightField codebase
and makes use of the ontology loading and spreadsheet display functionality. Like RightField, it is also
built on the premise that a tabular, template view is a natural way of gathering data. In this case, it is
used to separate the processes of gathering content for the ontology from the process of ontology
conceptualisation. The structure and axioms from the ontology are shielded from the user, who only
sees the entities related to one another in the spreadsheet table. Consequently, Populous is most useful
when a repeating ontology design pattern emerges that needs to be populated en-mass.
Consider the OBO cell type ontology [34] that contains a basic design pattern stating that every cell
type has a nucleation, where the nucleation is one of the anucleate, nucleate, bi-nucleate or
multinucleate. Populous can be used to create a simple template where column A captures the cell
type and column B captures the nucleation value. RightField can be used to restrict the allowable
values in column A to named cell types and column B to the set of valid nucleation terms. This
template can either be exported or modified in Microsoft Excel, or users can choose to work directly
inside the Populous RightField extension. Using Populous directly offers some additional benefits
over standalone Excel, such as the ability to auto-complete and search for terms using wildcard
string matches from within a spreadsheet cell. Figure 4 shows a populated template as it appears in
the Populous extension to RightField. Data that matches terms in the validation list are highlighted
in green, whereas incorrect or unknown terms are shown in red. The user is free to choose between
whether they want to work only in Excel documents or if they want to work on the same files
directly in Populous to benefit from some of the advance features Populous offers.
The template approach lowers the entry level for a user wishing to contribute to the ontology. Once the
template is populated, the information can be translated into an OWL representation using the OPPL that
comes packaged with the Populous extension. OPPL allows ontology design patterns to be specified in
terms of OWL axioms that include variables for the differentia. Each row in the table represents a
single instantiation of the pattern. The Populous OPPL wizard guides the user through the instantiation
of a particular design pattern by binding columns in the table to variables in the axiom. The output of
this process is a new OWL ontology along with a reproducible workflow that details how the ontology
was built. Capturing knowledge using constrained templates improves the consistency with which the
templates are populated. This reduces the amount of time required to validate the data and offers a way
to automate the workflow for ontology construction. Building ontologies in this way means we capture
the provenance of the ontology’s construction, such as how each axiom was constructed and who
generated the content. This separation of content from the axiom patterns offers new possibilities for
refactoring of ontologies. Ontology developers can explore different conceptualisation from the same
data captured in the template. A linked data application may only require a simple taxonomy from the
data that could be represented in SKOS but if classification and reasoning were important, the design
patterns could be substituted to build a more expressive ontology that both form the same template.
3.1. Populous in practice
RightField and the Populous extension are being used in the e-LICO project to develop ontologies
relating to the KUP. The KUPO describes the kidney’s cells, anatomy and diseases and is being
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
STEALTHY ANNOTATION OF EXPERIMENTAL BIOLOGY BY SPREADSHEETS
477
Figure 4. Populous, an ontology content population extension to RightField.
used to capture metadata about experiments relating to renal physiology and disease. The KUPO reuses many existing ontologies to build a much more detailed model of the kidney. Several design
patterns for the KUPO were specified in templates built using RightField. Biologists with little or no
prior knowledge about ontology development populated these templates using Populous. The
designed patterns were expressed in OPPL and used to generate the final KUPO. The templates,
OPPL scripts and generated ontologies can be found at http://www.e-lico.eu/kupo.
Initial evaluations showed that RightField-enabled spreadsheets produced datasets with greater
consistency of annotation and compliance with standards. It is possible to over-ride the term selection
and annotate cells with plain text, but users did not typically do so. Therefore, the annotation was
consistent with the allowed terms from the ontology. If a suitable term could not be found, the default
behaviour was to leave the parent class as annotation, which also produces consistently annotated data.
For ontology, users are not able to contribute new term suggestions without entering a curation and
review process, so this behaviour is desirable, but for local ontologies (e.g. the SysMO-JERM and the
KUPO), this mechanism could be used to identify and collect new terms by adding the Populous
validation extension into RightField.
In Populous, if users add terms to marked-up cells that do not feature in the range of terms from the
chosen ontology, Populous will highlight these. It could mean that these terms are errors, or it could
mean that the relevant ontology needs to be extended with new terms. In each case, the tool assists
with the curation process, identifying areas that require further investigation. In the development of
KUPO, the biologists identified several gaps in the existing anatomy and cell type ontologies. Over
200 new terms can now be extracted from the populated template for candidate submission to the
respective ontology development projects.
3.2. Related work
Protégé 4 ExcelImporter plugin and the MappingMaster plugin for Protégé 3 [31] generate OWL
axioms. This and RDF conversion tools such as Excel2RDF (http://www.mindswap.org/~rreck/
excel2rdf.shtml) XLWrap [35] and RDF123 [36] focus on the transformation of spreadsheet content
but pay little attention to how the data are collected in those spreadsheets. When generating
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
478
K. WOLSTENCROFT ET AL.
ontologies, in this way, a large portion of time is taken to ‘clean up’ and validate the populated
spreadsheets. RightField and the Populous extension enable the validation of input data at the source
and bridge the gap between the population on transformation of these spreadsheets. The template to
which users of spreadsheets in this form have to comply is implicit, rather than being exposed as a
collection of constrained entries to which the user complies. These constraints make it easier for the
user to find the terms they need for annotating data and make the task of transformation easier; all in
a stealthy form by using the very tools that the users deal with in their everyday tasks.
4. DISCUSSION
Many experimental biologists have no interest or experience in the use of ontologies but understand the
value of semantic annotation and quality metadata when attempting to reuse data from others. For
projects such as SysMO and e-LICO, the overall aims of the project include sharing and dissemination.
The more functionality that can be embedded in tools already in use, the more likely that semantic
annotation will be considered part of the data management process. There are already a plethora of
terminologies and information models available to support the consistent annotation of data, but
providing such schemas is only the first step. It should not be an extra task for the data providers to
describe their data in such a way that it can be interpreted and reused. This means that simple
methods for adding semantics need to be embedded in the data management process.
RightField and Populous are flexible and extensible curation tools that make semantic annotation
easier and more accessible to researchers: hence, annotation ‘by stealth’. They bridge the gap
between informaticians and data producers as data can be annotated accurately and at source by the
laboratory scientists who can also contribute terms to ontologies when needed. The advantages are
the following:
•
•
•
•
Reduction in errors and the time it takes to annotate the data to comply with community standards;
Greater understanding through choice restrictions of annotation terms to a small and manageable set;
Greater validation to ensure data annotation quality and provide more feedback during annotation; and
Community intelligence for information model and ontology developers to incrementally improve the
coverage and structure of the metadata standards they propose. The combination of the tools provides a
convenient platform for prototyping and piloting data collection standards by trying them out to collect
data, designing models through use. A similar approach was pioneered by the Pedro toolkit [37].
RightField and Populous use a simple, yet powerful and generic approach. The tools are already in use
in large biological initiatives, but there is nothing in the applications that make them specific to the
biological domain. RightField is already under evaluation for use for datasets in the oil and gas
industry. Any domain with heterogeneous data that requires annotation to common schemas and
ontologies could use them.
Excel2RDF, XLWrap and RDF123 enable the generation of RDF statements from spreadsheet data.
RightField and Populous lay the foundations for the structured extraction of data and knowledge and
present the possibility of contributing to Open Linked Data. Ongoing research will focus on
extracting spreadsheet data into RDF for storage and comparison to fully exploit the large corpus of
data that is being generated in the SysMO consortium and in the wider systems biology community.
ACKNOWLEDGEMENTS
This work was funded by the UK’s BBSRC award BBG0102181 and EU/FP7/ICT-2007.4.4 e-LICO project.
REFERENCES
1. Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St Pierre S, Twigger
S, White O, Rhee SY. Big data: the future of biocuration. Nature 2008; 455:47–50.
2. Bateman A. Curators of the world unite: the International Society of Biocuration. Bioinformatics 2010; 26:991.
3. St Pierre S, McQuilton P. Inside FlyBase: biocuration as a career. Fly (Austin) 2009;3:112–114.
4. Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, AmaralZettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA, Copeland J, Das S,
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
STEALTHY ANNOTATION OF EXPERIMENTAL BIOLOGY BY SPREADSHEETS
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
479
de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo CT, Forster MJ, Gaudet P, Gilbert J, Goble C, Griffin JL, Jacob
D, Kleinjans J, Harland L, Haug K, Hermjakob H, Ho Sui SJ, Laederach A, Liang S, Marshall S, McGrath A, Merrill
E, Reilly D, Roux M, Shamu CE, Shang CA, Steinbeck C, Trefethen A, Williams-Jones B, Wolstencroft K, Xenarios
I, Hide W. Toward interoperable bioscience data. Nature Genetics 2012; 44:121–126.
Taylor CF, Field D, Sansone SA, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, Brazma A,
Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM,
Hardy NW, Hermjakob H, Julian RK Jr, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novère N, LeebensMack J, Lewis SE, Lord P, Mallon AM, Marthandan N, Masuya H, McNally R, Mehrle A, Morrison N, Orchard S,
Quackenbush J, Reecy JM, Robertson DG, Rocca-Serra P, Rodriguez H, Rosenfelder H, Santoyo-Lopez J, Scheuermann
RH, Schober D, Smith B, Snape J, Stoeckert CJ Jr, Tipton K, Sterk P, Untergasser A, Vandesompele J, Wiemann S.
Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project.
Nature Biotechnology 2008; 26:889–896.
Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Griffith N, Jonquet C, Rubin DL, Storey MA, Chute CG, Musen
MA. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research 2009;
37:W170–173.
Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone J, Parkinson H, Peters B, Rocca-Serra P,
Ruttenberg A, Sansone SA, Soldatova LN, Stoeckert CJ Jr, Turner JA, Zheng J, OBI consortium. Modeling biomedical experimental processes with OBI. Journal of Biomedical Semantics 2010; 1(Suppl 1):S7.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT,
Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM,
Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics
2000; 25:25–29.
Rayner TF, Rocca-Serra P, Spellman PT, Causton HC, Farne A, Holloway E, Irizarry RA, Liu J, Maier DS, Miller M,
Petersen K, Quackenbush J, Sherlock G, Stoeckert CJ Jr, White J, Whetzel PL, Wymore F, Parkinson H, Sarkans U,
Ball CA, Brazma A. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC
Bioinformatics 2006; 7:489.
Rayner TF, Rezwan FI, Lukk M, Bradley XZ, Farne A, Holloway E, Malone J, Williams E, Parkinson H. MAGETabulator, a suite of tools to support the microarray data format MAGE-TAB. Bioinformatics 2009; 25:279–280.
Bodenreider O, Stevens R. Bio-ontologies: current trends and future directions. Briefings in Bioinformatics 2006;
7:257–274.
Rector A. Modularisation of domain ontologies implemented in description logics and related formalisms including
owl. Proceedings of the 2nd International Conference on Knowledge Capture, Sanibel Island, USA, 2003.
Wolstencroft K, Owen S, Horridge M, Krebs O, Mueller W, Snoep JL, du Preez F, Goble C. RightField: embedding
ontology annotation in spreadsheets. Bioinformatics 2011; 27:2021–2022.
Jupp S, Horridge M, Iannone L, Klein J, Owen S, Schanstra J, Wolstencroft K, Stevens R. Populous: a tool for building
OWL ontologies from templates. BMC Bioinformatics 2012; 13(Suppl 1):S5.
Egana M, Rector A, Stevens R, Antezana A. Applying ontology design patterns in bio-ontologies. Knowledge Engineering: Practice and Patterns, Proceedings 2008; 5268:7–16.
Iannone L, Rector A, Stevens R. Embedding knowledge patterns into OWL. Semantic Web: Research and Applications
2009; 5554:218–232.
Jupp S, Klein J, Schanstra J, Stevens R. Developing a kidney and urinary pathway knowledge base. Journal of Biomedical
Semantics 2011; 2(Suppl 2):S7.
Atkinson M, De Roure D, van Hemert J, Michaelides D. Shaping ramps for data-intensive research. UK e-Science All
Hands Meeting, Cardiff, UK, 2010.
Müller W, Krebs O, Rojas I, Wolstencroft K, Owen S, Alexejevs S, Goble C, Snoep J. SysMO-DB: just
enough exchange for systems biology data and models. Microsoft Research eScience Workshop, Pittsburgh,
USA, 2009.
Wolstencroft K, Owen S, du Preez F, Krebs O, Mueller W, Goble C, Snoep JL. The SEEK: a platform for sharing
data and models in systems biology. Methods in Enzymology 2011; 500:629–655.
Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen M, Morrison N, Rocca-Serra
P, Sansone SA, Taylor C, White J, Stoeckert CJ Jr. The MGED ontology: a resource for semantics-based description
of microarray experiments. Bioinformatics 2006; 22:866–873.
Horridge M, Bechhofer S. The OWL API: A Java API for OWL ontologies. Semantic Web 2010. DOI:
10.1.1.163.7035
Sansone SA, Rocca-Serra P, Brandizi M, Brazma A, Field D, Fostel J, Garrow AG, Gilbert J, Goodsaid F, Hardy N,
Jones P, Lister A, Miller M, Morrison N, Rayner T, Sklyar N, Taylor C, Tong W, Warner G, Wiemann S, Members
of the RSBI Working Group. The first RSBI (ISA-TAB) workshop: “can a simple format work for complex studies?”
Omics 2008; 12:143–149.
Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway
E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Brazma A, Sansone S.
ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucleic Acids Research 2005; 33:
D553–555.
Vizcaino JA, Reisinger F, Côté R, Martens L. PRIDE: data submission and analysis. Current Protocols in Protein
Science 2010, Chapter 25: Unit 25 24.
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe
480
K. WOLSTENCROFT ET AL.
26. Wolstencroft K, Owen S, Goble C, Nguyen Q, Krebs O, Müller W. RightField: semantic enrichment of systems
biology data using spreadsheets. IEEE International Conference on e-Science, Chicago, USA, 2012.
27. Rocca-Serra P, Brandizi M, Maguire E, Sklyar N, Taylor C, Begley K, Field D, Harris S, Hide W, Hofmann O, Neumann
S, Sterk P, Tong W, Sansone SA. ISA software suite: supporting standards-compliant experimental annotation and
enabling curation at the community level. Bioinformatics 2010; b26:2354–2356.
28. Cote R, Reisinger F, Martens L, Barsnes H, Vizcaino JA, Hermjakob H. The Ontology Lookup Service: bigger and
better, Nucleic Acids Research 2010; 38:W155–160.
29. Jupp S, Horridge M, Iannone L, Klein J, Owen S, Schanstra J, Stevens R, Wolstencroft K. A tool for populating
ontology templates. Semantic Web Applications and Tools for Life Sciences (SWAT4LS), Berlin, 2010.
30. Suarez-Figueroa MC, Gomez-Perez A. NeOn methodology for building ontology networks: a scenario-based
methodology. Proceedings of the International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES
(S3T 2009), 2009.
31. O’Connor MJ, Halaschek-Wiener C, Musen MA. Mapping master: a flexible approach for mapping spreadsheets to OWL.
Proceedings of the 9th International Semantic Web Conference on The Semantic Web, Shanghai, China, 2010; 194–208.
32. Rocca-Serra P, Ruttenberg A, O’Connor MJ, Whetzel T, Schober D, Greenbaum J, Courtot M, Sansone SA,
Scheurmann R, Peters B. Overcoming the ontology enrichment bottleneck with quick term templates. International
Conference on Biomedical Ontology, 2009.
33. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC,
Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U,
Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M. Minimum information about a microarray experiment
(MIAME)-toward standards for microarray data. Nature Genetics 2001; 29:365–371.
34. Bard J, Rhee SY, Ashburner M. An ontology for cell types, Genome Biology 2005; 6:R21.
35. Langegger A, Woss W. XLWrap - querying and integrating arbitrary spreadsheets with SPARQL. Semantic Web - ISWC
2009 Proceedings 2009; 5823:359–374.
36. Han LS, Finin T, Parr C, Sachs J, Joshi A. RDF123: from spreadsheets to RDF. Semantic Web - ISWC. Lecture Notes
in Computer Science 2008; 5318:451–466.
37. Garwood K, McLaughlin T, Garwood C, Joens S, Morrison N, Taylor CF, Carroll K, Evans C, Whetton AD, Hart S,
Stead D, Yin Z, Brown AJ, Hesketh A, Chater K, Hansson L, Mewissen M, Ghazal P, Howard J, Lilley KS, Gaskell
SJ, Brass A, Hubbard SJ, Oliver SG, Paton NW. PEDRo: a database for storing, searching and disseminating experimental proteomics data, BMC Genomics 2004; 5:68.
Copyright © 2012 John Wiley & Sons, Ltd.
Concurrency Computat.: Pract. Exper. 2013; 25:467–480
DOI: 10.1002/cpe

Download Report

Stealthy annotation of experimental biology by spreadsheets

Paperzz.com

Your Paperzz