File

Grant Agreement No. 601043
EU Project No:601043(Integrated Project (IP))
DIACHRON
Managing the Evolution and Preservation of the Data Web DIACHRON
Dissemination level: PU
Type of Document: O
Contractual date of delivery: M31
Actual Date of Delivery: M33
Deliverable Number: D9.3
Deliverable
Name:
Software
prototype: Scientific Linked Data
Deliverable Leader: EMBL
Work package(s): WP9
Status & version: Final 1.2
Number of pages: XX
WP contributing to the
deliverable: 9
WP / Task responsible
Coordinator (name / contact)
Helen
Parkinson
/
[email protected]
Other Contributors
Simon Jupp, Tony Burdett, Olga
Vrousgou
EC Project Officer
Keywords: scientific linked data,
ontology evolution
Abstract (few lines): This is the
first evaluation of the DIACRON
platform for scientific linked data
Federico Milani
Grant Agreement No. 601043
Title:
Project:
URL
Type:
Authors:
Origin:
Ver.
1
1.1
1.2
Document Control
Software prototype: DIACHRON Platform adaptation for Scientific Linked Data:
design, implementation
DIACHRON
http://www.diachron-fp7.eu
PROTOTYPE
Simon Jupp, Tony Burdett, Olga Vrousgou, Helen Parkinson
Date
15/12/2015
17/12/2015
28/12/2015
Amendment History
Author(s)
Description / Comments
Olga Vrousgou, Simon Jupp
Initial version
Helen Parkinson, Tony Burdett
Internal EMBL revision
Final revision with comments from partner
reviews.
(UEDIN, BROX, INTRASOFT)
Grant Agreement No. 601043
TABLE OF CONTENTS
1! INTRODUCTION ................................................................................................................................................................ 4!
1.1! SCIENTIFIC PILOT APPLICATION - UPDATED ..................................................................................................................... 5!
1.2! DATA FLOW AND INFRASTRUCTURE ...................................................................................................................................... 8!
2! RESULTS ............................................................................................................................................................................ 10!
2.1! IMPLEMENTATION ISSUES .............................................................................................................................................. 12!
3! FUTURE WORK ............................................................................................................................................................... 14!
3.1! CONCLUSION ................................................................................................................................................................. 14!
4! APPENDIX A ..................................................................................................................................................................... 15!
Grant Agreement No. 601043
1 INTRODUCTION
In this deliverable we report on Task 9.3 that is the updated documentation and implementation for the
scientific linked data pilot. In deliverable 9.2 we described the deployment of DIACHRON services for use
in a pilot around the Experimental Factor Ontology (EFO) Web application. The success of this pilot
motivated us to develop a new version of the EBI Ontology Lookup Service 1 (OLS) that included
DIACHRON functionality for ontology archiving and change detection over a wide range of biomedical
vocabularies.
The new OLS provides services for searching and visualising over 140 biomedical ontologies and also
provides an API for programmatic access. These ontologies are typically published as files on the Web, and
the release strategies for these files vary between ontologies. The OLS system aims to provide a central
point of access to ontologies for the semantic annotation of data. In deliverable 9.2, we described the
scenario where ontologies are continuously evolving alongside the data. Biological databases such as the
Gene Expression Atlas2 and the Biosamples Database3 use tools like Zooma4 to automatically map data to
ontologies, often in bulk mode before a major database release. This process introduces a dependency
between the annotated data and the snapshot of ontologies taken on that day. Over time the ontologies and
data become out of sync until the annotations are updated and this can have downstream consequences on
services built over those data.
The new OLS was designed to be simple and relatively lightweight and comprises two primary ontology
indexes that support search and browsing across ontologies. We provide a SOLR index for searching
ontology terms and use the Neo4j graph database to store the ontology structure. There would be a
considerable cost and overhead associated with storing every version of every ontology in our SOLR and
Neo4j indexes, so we instead we only store the latest version of each ontology. We use DIACHRON as the
primary archiving system inside the new OLS, so every ontology version we see gets archived in the
DIACHRON platform. By using DIACHRON as an archive for the ontologies in OLS we can exploit the
change detection capabilities of DIACHRON to only provide long term storage of ontology changes, rather
than complete versions of each ontology. The DIACHRON component of the new OLS allows us to offer
novel services to our community of users around ontology term evolution.
Many database curation systems use the OLS to find terms that are used to annotate biological data. One of
the most important aspects of change relating to ontologies is the creation, deletion and editing of term
information. Being able to track additions to ontology is useful to demonstrate activity, as ontology activity
1
Beta release of the new Ontology Lookup Service - http://www.ebi.ac.uk/ols/beta. The current version
of OLS is located at http://www.ebi.ac.uk/ols
2
Expression Atlas - https://www.ebi.ac.uk/gxa/home
3
Biosamples Database - https://www.ebi.ac.uk/biosamples
4
Zooma - http://www.ebi.ac.uk/spot/zooma
Grant Agreement No. 601043
is an important metric when considering which ontologies to use for data annotation. Ontologies that change
very little or have had no major changes for a long time may suggest the ontology is no longer actively
maintained and should not be used for data annotation. Ontology best practices [3] encourage developers to
avoid deleting ontology classes, and favour a term deprecation or obsoletion strategy so that terms remain in
the ontology or the term URIs remain resolvable. Using DIACHRON we are able to offer the functionality
demonstrated in the EFO pilot (deliverable 9.2) to all of the relevant ontologies used in biology.
For this deliverable we have re-developed the OLS system to incorporate DIACHRON functionality. With
over 140 ontologies, representing over 3.5 million ontology terms, this pilot will evaluate the scalability of
the DIACHRON platform in a real-world application setting. The new OLS is currently in beta and is being
tested with our user community. Here we will report the design of the new pilot application and present the
current state of the application and detail any implementation issues.
1.1 SCIENTIFIC*PILOT*APPLICATION*/*UPDATED*
OLS now provides the following new functionality:
1.
2.
3.
4.
5.
An ontology loader that detects when external ontologies have changed
A search engine for querying the ontologies
Web services for programmatic access to the ontologies
A user interface for exploring and visualising the ontologies.
The new OLS system is available at http://www.ebi.ac.uk/ols/beta.
OLS monitors external ontologies for new releases by downloading the ontology from the external location
every night in the ontology’s native OBO or OWL format. We use an MD5 hashing algorithm to create a
checksum for the downloaded file. Every night a process runs that downloads the ontology file, creates a new
checksum and checks if this is different from the previous checksum to determine if the file has been
updated. Any change in the file will trigger a reload of the ontology into the OLS indexes.
OLS provides REST API access to the OLS ontologies. We extended the metadata we collect about each
ontology in the OLS system so that it now includes basic metadata such as ontology title, description and
author attribution along with configuration information, such as the predicate used for term synonyms or
definitions. The ontology configuration also includes a ‘loaded’ field that provides a timestamp for when the
ontology was last successfully loaded into the system. This additional ontology metadata provided by OLS is
used by our DIACHRON implementation to detect new ontology versions and automatically configure the
change detection components.
There are several steps that need to be executed before an ontology from OLS can be archived into
DIACHRON and for change detection to be performed:
1. Not all OLS ontologies are available in RDF. Some ontologies required by OLS are only published in
the Open Biomedical Ontology (OBO) format. A translation to OWL-RDF must be done first using
the Java OWL API.
2. Previously we used the HermiT OWL-DL reasoner to classify the ontology to compute any inferred
RDF triples and remove any redundant triples from the archive. Not all ontologies in OLS can be
classified efficiently with HermiT, so an enhancement to the Ontology Archiver in DIACHRON was
Grant Agreement No. 601043
required to specify which reasoner to use. Information about the optional reasoner to use for a
particular ontology is readily available form the OLS API.
3. Previously for EFO, we were able to specify change detection for term labels, synonyms, definitions
and obsolete classes according to the predicates used by EFO. These were predefined changes in
DIACHRON as simple and complex changes for the EFO use case. However, not all ontologies in
OLS use the same predicates for label, synonyms, definitions and obsolete classes as EFO. We
wanted to execute change detection at the level of individual ontologies, and therefore needed to treat
each ontology as an independent DIACHRONIC dataset with its own set of associated changes.
Rather than specify complex changes for each ontology, we use the OLS API ontology configuration
data to dynamically define complex changes for a particular ontology when it is first archived.
We have developed an application of the DIACHRON platform that uses the OLS API to crawl for
ontologies that have been loaded into the system and automatically archive these ontologies into
DIACHRON. Our application uses data from the ontology configuration to automatically create a changes
schema in the DIACHRON changes ontology. This application monitors the OLS API for changes and will
archive new ontology versions and perform change detection between the latest and previous version of an
ontology automatically.
Table 1 shows an updated overview of the ontology changes monitored by DIACHRON for the OLS
application. This table has been updated from the table 1 in deliverable 9.2, that was primarily concerned
with changes in EFO. This revised table shows that some of the changes in the model are now
parameterized depending on a particular ontology’s configuration.
Change with
parameters
Description
Changes in model
Add Type Label
Detect when a new label is added to class
added:
(<A>, <label predicate>, Literal)
Add Class
Detect when a new class is added to the
ontology
added:
(<A>, rdf:type, owl:Class)
Add Superclass
Detect when a new subclass axiom between
named classes is added to the ontology
added:
(<A>, rdfs:subClassOf, <B>)
Add Synonym
Detect when a synonym is added to a term
in EFO
added:
(<A>, <synonym predicate>, Literal)
Add Definition
Detect when a definition is added to a class
added:
(<A>, <definition predicate>, Literal)
Mark As Obsolete
Detect when a term is made obsolete in the
ontology
added:
(<A>, rdfs:subClassOf,
oboInOwl:ObsoleteClass)
OR
(<A>, owl:deprecated, Literal)
Delete Definition
Detect when a definition is deleted from a
class
removed:
(<A>, <definition predicate>, Literal)
Grant Agreement No. 601043
Delete Synonym
Detect when a synonym is deleted from a
term in EFO
removed:
(<A>, <synonym predicate>, Literal)
Delete Label
Detect when a label is removed from a class
removed:
(<A>, <label predicate>, Literal)
Delete Superclass
Detect when a subclass axiom between
named classes is removed from the ontology
removed:
(<A>, rdfs:subClassOf, <B>)
Delete Type Class
Detect when a class is removed from the
ontology
removed:
(<A>, rdf:type, owl:Class)
One of the challenges in extending this pilot to work with many ontologies is how to specify a number of
changes that are conceptually the same but may be implemented differently across ontology datasets. For
instance, DIACHRON changes originally specified a “Add Label” change as a simple (or basic) change that
is hard coded into the system. This change assumed that all label changes could be detected by the addition
of a new RDF triple where the predicate was rdf:label from the RDFS vocabulary. When looking at the
ontologies in OLS we found ontologies, such as Mamo (http://www.ebi.ac.uk/ols/beta/ontologies/mamo),
where they did not use rdfs:label but instead favoured skos:prefLabel as their label predicate. In this scenario
we would like DIACHRON to still report the addition of label as a single type of change, at least at the
conceptual level, even though specific implementation at the vocabulary level is different.
A similar situation exists with the implementation of the “class obsoletion” in the ontologies in OLS.
Previously for EFO we defined “Mark as Obsolete” as a complex change that could be primarily detected by
the addition of an RDF triple where the predicate was rdfs:subClassOf and the object of the triple was
oboInOwl:ObsoleteClass. Many of the ontologies from the OBO foundry follow a term obsoletion strategy,
but the implementation of this differs from that of EFO. Standard practice for OBO ontologies is to use the
owl:deprecated annotation property to indicate obsoletion of classes. We currently have to preconfigure our
DIACHRON application to create the appropriate type of “mark as obsolete” complex changes depending on
the ontology.
Grant Agreement No. 601043
1.2 DATA FLOW AND INFRASTRUCTURE*
*
Figure 1. The program flow of the OLS crawler. The crawler is scheduled to run every night, pushing
new ontologies to DIACHRON, archiving new ontology releases, and running the change detection
service for updated ontologies.
Figure 1 shows the execution flow for our DIACHRON application. For each ontology in OLS, the
DIACHRON application does the following:
1. Using the OLS REST API, get the latest version that has been indexed in OLS.
2. Get the latest version that has been archived in the archiver.
3. If it is the first time that this ontology will be archived create a DIACHRON dataset id. We need to
know the dataset id before we can upload the “diachronised” dataset.
4. If it is the first time this ontology has been archived, define the ontology’s complex changes and
store them in the Virtuoso server.
5. If there is a new version of the ontology, download it from the file location that is determined in the
OLS API
Grant Agreement No. 601043
6. The ontology is downloaded in native OWL or OBO format. Convert this to DIACHRON RDF using
the archive service.
7. The “diachronised” version of the ontology is uploaded to the archive.
8. If there is a new version of the ontology, get its complex changes and see if the properties used have
been changed. If so, then the complex changes for this ontology are updated.
9. Get the latest stored version of the ontology from the archiver and run the change detection between
the old and the new version.
EMBL-EBI uses virtualized environments for deploying software. EMBL-EBI also has a large computing
cluster running a Load Sharing Facility (LSF), where batch workloads and data-intensive applications can
run. The DIACHRON integration layer delivered in 6.2 provides a service layer for interacting with
individual DIACHRON components, such as archiving services and change detection. The DIACHRON
Archiving service is the core component of the DIACHRON platform and provides services for both
archiving and retrieving diachronic datasets. The Archive module is currently using the OpenLink Virtuoso
database (v7.2.2.1) for persistent storage. All of the DIACHRON services are being developed as Java Web
applications that can be deployed in any Java servlet container. EMBL-EBI is restricted to using Apache
Tomcat for running Java web applications. Figure 2 shows the two dedicated Tomcat applications that are
currently deployed for the pilot applications. One Tomcat sits outside the firewall and provides public access
to the DIACHRON integration layer, the second Tomcat sits behind the firewall but is visible to the
Integration layer application.
Figure 2. EMBL EBI infrastructure for DIACHRON pilot
Grant Agreement No. 601043
2 RESULTS
We recently ran our updated pilot application against the full set of ontologies in OLS. Using the current
DIACHRON archiving implementation, we were able to successfully parse and archive 75 ontologies out of
142. There can be several reasons why the remaining 67 ontologies might have failed to archive. Issues
already encountered are missing ontology files in owl:import statements or ontologies that are unsatisfiable
when classified with a reasoner. We have also encountered some instability and scaling problems when
archiving large ontologies, such as the NCBI taxonomy that contains over 1.3 million terms and nearly 10
million RDF triples.
The total execution time for archiving all 75 ontologies in parallel on our compute cluster was approximately
20 minutes. This process is now run on a nightly cron job to check for new ontologies in OLS and archive
and run change detection over new ontologies as they become available.
In deliverable 9.2, we have presented a pilot application that was integrated into the EFO website and could
be used to view changes for the EFO. In this deliverable we can now show the same visualisations for all
ontologies in OLS that can be successfully archived with DIACHRON. We have extended our javascript
library to support visualising changes in all the ontologies loaded into the OLS DIACHRON platform.
Figure 3 outlines the infrastructure being deployed for the user interface components, allowing the new OLS
website to use our javascript library to generate an interactive visualisation of changes for every ontology
loaded.
Grant Agreement No. 601043
Figure 3.
The OLS architecture showing how the Web application communicates with the public DIACHRON
integration layer services to generate the visualisation of changes.
Figure 4 is a screenshot from the new OLS, showing changes for the popular Gene Ontology (GO) over a
two week period. This view alone is extremely informative about the development of GO. We can see that
there are at least 10 and up to 40 new terms added on each release, we also see that the 2015-12-01 release
included some changes to label and synonyms and some terms were obsoleted. The bar gives us a
visualisation of the evolution GO which will be a useful feature of the new OLS website for our users.
Grant Agreement No. 601043
Figure 4. Pilot application updated to work with the Gene Ontology
2.1 IMPLEMENTATION*ISSUES*
In extending our pilot application and the development of new OLS we encountered several issues with the
DIACHRON platform that we have worked with the platform developers to resolve.
1. Through the Integration Layer and the Change detector APIs, we could not specify the Dataset_Uri
to be used for the change schema to be created (definition of complex changes) and for it to be used
for the complex change detection. A Dataset_Uri option was added to the API both at the Integration
Grant Agreement No. 601043
2.
3.
4.
5.
6.
7.
layer and the Change Detection service. (‘Change Detection Implementation’ and ‘Complex Change
Implementation’ services)
We required the ability to use different reasoners before the OWL to DIACHRON model conversion
step in the Archive service. The latter services were updated to support different OWL reasoners and
a new optional parameter was added to the REST API to specify the type of reasoner required.
OWL ontologies have huge amount of information encoded in the RDF file. Early on in the project
we added a functionality of having a filter on the predicates in the RDF triples that would get
archived. We did this so we could use DIACHRON to focus only on triples in the ontology we care
about for our application use cases. The ability to specify filters was previously hard coded into the
Java API but this has now been exposed via the REST API.
After creating a new diachronic dataset we could not access the dataset instance id through the
Integration Layer api endpoint (/webresources/archivingService/query) without restarting tomcat due
to caching issues. This bug has been fixed.
After archiving several versions of a datasets, Tomcat would halt and need a restart, due to
connections that were left open. Bug fixed in the Archiver.
After archiving a new version, the metadata of the stored version (versionNumber, recordSet) could
not be retrieved through the API endpoint (/diachron/archive/templates) unless Tomcat would be
restarted. The cache issue has been fixed in the Archiver.
There remains no service in the integration layer for archiving datasets. We are currently required to
archive datasets directly via the archiving layer. As the archiving layer has no security authentication
(handled by the integration layer) we cannot expose our archiving service on a public server.
Archiving is currently done behind our firewall, but the resulting data can still be queried via our
public integration layer services.
Grant Agreement No. 601043
3 FUTURE WORK
•
•
•
Enhance OLS archiving code to support the full set of 142 ontologies.
Migrate javascript components to production OLS web application.
Production deployment of new OLS due March 2016
3.1 CONCLUSION*
We have presented a new Ontology Lookup Service that has been developed using DIACHRON components
for archiving and change detection over multiple biomedical ontologies. We have shown that the
DIACHRON platform now provides a generic infrastructure for archiving OWL ontologies and includes
services for detecting a range of changes within those ontologies. There remains some challenges in scaling
this application up before we can move this to our production server. Once available we will be in a position
to evaluate the new system with our large user base.
Grant Agreement No. 601043
4 APPENDIX A
Workshop paper submitted to EDBT DIACHRON workshop.
Biomedical Ontology Evolution in the EMBL-EBI Ontology
Lookup Service
Olga Vrousgou
Tony Burdett
Helen Parkinson
European Bioinformatics Institute
(EMBL-EBI),
European Molecular Biology
Laboratory,
Cambridge
United Kingdom
European Bioinformatics Institute
(EMBL-EBI),
European Molecular Biology
Laboratory,
Cambridge
United Kingdom
European Bioinformatics Institute
(EMBL-EBI),
European Molecular Biology
Laboratory,
Cambridge
United Kingdom
[email protected]
[email protected]
Simon Jupp
[email protected]
European Bioinformatics Institute
(EMBL-EBI),
European Molecular Biology
Laboratory,
Cambridge
United Kingdom
[email protected]
ABSTRACT
As ontologies are playing an increasingly important role in data
annotation on today’s Semantic Web, there is a need to observe
their constant evolution. At EMBL-EBI ontology preservation and
curation is of growing value, and new tools are being developed to
help undertake these tasks. One of these tools is the Ontology
Lookup Service which holds 140 public bio-medical ontologies
that are updated when new ontology version become available. To
track changes in these ontologies we have utilized DIACHRON, a
platform for tracking the evolution of RDF documents. With
DIACHRON in OLS we can track and store the changes between
ontology releases and view the differences through a graphical
interface.
CCS Concepts
• Web Ontology Language (OWL) • Ontologies
Keywords
Ontologies, OWL, Data evolution
1. INTRODUCTION
Data integration is intrinsic to how modern research is undertaken
in areas such as genomics, drug development and personalised
medicine. To better enable this integration a large number of
biomedical ontologies have been developed by the community to
provide common metadata vocabularies. There are now several
hundred biomedical ontologies in widespread use that describe
concepts such as genes, molecules, drugs and diseases. This
amounts to millions of terms that are interconnected via
relationships that naturally form a graph of biological
knowledge.
The European Molecular Biology Laboratory - European
Bioinformatics Institute (EMBL-EBI) is a major provider of
bioinformatics services and is involved in the preservation of data
and the curation of data so that it can be served back to the
community in novel and useful ways. A large part of the curation
and added value offered by EMBL-EBI is via the semantic
annotation of data with ontologies. Ontologies can make data
more interoperable and can be used to enhance search[1] and
visualization[2] applications over data.
One of the challenges in working with highly interconnected data
is dealing with elements that change. Biomedical ontologies aim
to represent the state of knowledge in biology, but this is
constantly changing as new biology is discovered. As ontologies
develop to describe all domains of biology they must be regularly
updated with new terms or refinement of existing terms to stay
relevant with the data they describe. Tracking the changes within
the ontology alongside the changes in the data is especially
challenging. Typically the ontologies are developed independently
of the data they annotate, so when ontologies change, it can be
difficult to propagate that change to all the datasets that were
described with those ontologies.
The DIACHRON project1 has been developing technology for
monitoring the evolution of data on the Web. Dataset versions can
be archived in the DIACHRON system and there are components
for detecting and reporting on changes in the data [3]. The
DIACHRON platform is able to archive and monitor changes in
data expressed in the W3C Resource Description Framework2
(RDF) [4], thus making it suitable for archiving ontologies
expressed in the W3C Web Ontology Language3 (OWL) that can
be serialised in RDF.
The EMBL-EBI has recently developed a new version of the
Ontology Lookup Service4 (OLS) that includes DIACHRON
functionality for archiving and executing change detection over
ontologies. In this paper we present the design and
implementation of the DIACHRON functionality adapted for the
1
http://www.diachron-fp7.eu
2
http://www.w3.org/RDF/
3
http://www.w3.org/OWL/
4
http://www.ebi.ac.uk/ols/beta/
OLS system and evaluate the DIACHRON functionality for
monitoring ontology evolution.
Table 1. Abstracted ontology changes detected by
DIACHRON
2. REQUIREMENTS
OLS provides access to over 140 bio-medical ontologies.
Ontology documents typically reside on the Web and the OLS
system monitors these documents for updated versions, and
automatically loads new ontologies into the system when a new
version is released. OLS provides services for searching and
visualising these ontologies and also provides an API for
programmatic access. Many database curation systems use the
OLS to find terms that are used in annotating biological data. This
process introduces a dependency on ontologies from the annotated
data, so any changes in the ontologies can have downstream
consequences on the data and any services built over those data.
OLS requires the ability to track terms seen in ontologies over
time to better assist applications that rely on these ontologies.
There are a large number of changes that one might consider
tracking within an ontology. One of the most important aspects of
change relating to ontologies is the creation, deletion and editing
of term (or class level) information. Being able to track additions
to an ontology is also a useful metric of ontology activity.
Ontologies that haven’t been updated for a long time may suggest
the ontology is no longer actively maintained. Ontology best
practices[5] encourage developers to a avoid deleting ontology
classes, and rather use a deprecation or obsoletion strategy so that
terms remain in the ontology or at a minimum the term URIs
remain resolvable. This is import for applications that use third
party ontologies for data annotation and rely on those terms
resolving on the Web. In practice there are still many reasons why
an ontology term may get deleted, or a URI may be refactored, so
old URIs are frequently no longer accessible.
Deletions in OWL, when viewed at the RDF triple level, are
typically associated with a kind of edit. For example, moving a
class in the ontology class hierarchy would typically be detected
at the RDF triple level as a removal of a triple with an
rdfs:subClassOf predicate with the addition of a new triple with
the same subject and predicate, but a different object. The ability
to detect such changes in an ontology is useful as many edits
involving the rdfs:subClassOf predicate may suggest an ontology
is undergoing a major rearrangement and as such could have a
potential impact on any application that makes an assumption
about the structure of the hierarchy. There are also other types of
non-logical edits, such as to term metadata that may include the
editing of a term label or addition of new synonyms or definitions
that might indicate a refinement or change to the interpretation of
a term.
Table 1 outlines the major ontology changes that the OLS
application is concerned with detecting. These changes are
restricted to those that can be expressed at the level of RDF
triples.
Ontology change
Change triplet
Addition of new
class
Addition of ?x rdf:type owl:Class
Removal of class
Removal of ?x rdf:type owl:Class
Edit label
Addition of
?x <label property> <OWL literal>
Removal of
?x <label property> <OWL literal>
Edit synonym
Addition of
?x <synonym property> <OWL literal>
Removal of
?x <synonym property> <OWL literal>
Edit definition
Addition of
?x <definition property> <OWL literal>
Removal of
?x <definition property> <OWL literal>
Class move
Addition of ?x rdfs:subClassOf ?owlClass
Removal of ?x rdfs:subClassOf ?owlClass
Class obsoletion
Addition of ?x owl:deprecated “true”
2.1 Diachron services
The DIACHRON platform is comprised of several components
that are accessible via Web services. The OLS application is
currently utilizing services from three of the components, namely,
the DIACHRON archive, the change detection, and the integration
layer.
•
The archive service is responsible for converting the
ontology versions represented in OWL to the
DIACHRON RDF model. Once converted, the
“diachronised” RDF can be uploaded into the archive
via the archiving API.
•
The change detection service is responsible for the
calculation of changes between two ontology releases.
There are two types of changes, simple and complex.
Simple changes capture all the additions and deletions
of all terms in an ontology and are predefined. Complex
changes are user defined changes based on specified
collections of simple changes.
•
The integration layer is designed to provide an
abstraction over the underlying services and handle
security and mediation of services via a single point of
entry to the DIACHRON platform.
3. METHOD
Tracking ontology evolution is the primary use case for
DIACHRON within EMBL-EBI. OLS currently deals with
tracking external ontologies for changes and indexing any new
releases. OLS provides a REST API to the available ontologies
that includes information about when an ontology was last
updated. We have developed a Java application that crawls the
OLS API for each new ontology release, retrieves the latest
ontology version and archives it into the DIACHRON platform.
As new versions become available via the OLS API, the
DIACHRON platform will archive the newer version and run
change detection between versions.
7.
If this is a new version of an ontology that has been
previous archived, change detection between the new
and old versions will be executed using the changes
API. The results of the change detection are also stored
in the archive and made available via the integration
layer.
For a delta of changes to be produced between the new release
and the old version of an ontology, there is a need to always keep
the latest versions archived in a uniform way. In order to adapt
DIACHRON to be used in OLS, we need to configure the change
types in Table 1 to reflect how certain features are implemented in
a particular ontology. For instance, most ontologies have a notion
of synonyms, but the predicates for synonym vary between
ontologies. We would like to use DIACRHON to detect changes,
such as “synonym edit” at a conceptual level, without worrying
about implementation details. The deltas between ontologies need
to be stored and presented in a user friendly way in the OLS
application.
The new OLS platform provides an extensive list of API
endpoints that provides an easy way for developers to access its
underlying data. Using this API we are able to retrieve
information about the ontologies that are stored in OLS such as
their location where the ontologies will be downloaded from, the
latest version that was indexed into OLS, and information such as
the predicate used for label, synonyms and definitions.
DIACHRON provides an ontology for describing changes
according to the underlying DIACRHON model. There is also an
API for creating change types for a given dataset that is exposed
as Web service in DIACHRON. The OLS crawler reads ontology
configuration from the OLS API, creates a new dataset and
specifies DIACHRON changes using the DIACHRON service
API. The flow for adding OLS data to DIACHRON is as follow
(also depicted in Figure 1).
For each ontology in OLS:
1.
From the OLS API get the latest version that has been
indexed in OLS.
2.
Get the latest version that has been archived in the
archiver.
3.
If it is the first time that this ontology will be archived
create a DIACHRON dataset id. We need to know the
dataset id before we can upload the “diachronised”
dataset.
4.
If it is the first time this ontology has been archived,
define the ontology’s complex changes and store them
changes API.
5.
If there is a new version of the ontology, download it
from the file location that is provided by the OLS API
6.
The ontology is downloaded in native OWL or OBO
format then submitted to the DIACHRON archive
service.
Figure 1. Program flow of the OLS crawler. The crawler is
scheduled to run every night, pushing new ontologies to
DIACRHON, archiving new ontology releases, and running
the change detection service for updated ontologies.
As mentioned above, there is a need to create different complex
changes for each ontology. Many ontologies use different
properties to describe a term. This leads to the need to define a set
of complex changes for each ontology, based on the properties
defined in the OLS API. It is also essential to update the complex
changes if a term property alters from one version to another. The
main challenge we face here is that a complex change may not
only have different properties between ontologies, but may also
have an entirely different implementation. For example, when a
term is marked as obsolete in EFO, it is moved to be a subclass of
the “ObsoleteClass” class. In other ontologies, such as the Gene
Ontology, they use the owl:deprecated property to indicate term
obsoletion. We have defined additional configuration into our
OLS crawling application that specifies the obsolete class strategy
used by a particular ontology. This requires a priori knowledge of
the strategy used by an ontology, and requires maintenance of this
additional configuration to ensure it remains in sync with each
ontology.
4. RESULTS
Changes detected between ontologies are stored in the
DIACHRON platform according to the DIACHRON changes
ontology schema[6]. These changes can be queried directly with
SPARQL or by a REST API endpoint that is provided by
DIACHRON to access a JSON representation of the changes. This
changes API is used by OLS to visualize DIACHRON changes.
Figure 2 shows an overview of changes in the Experimental
Factor Ontology over a 12 month period (EFO version 2.50
through 2.59). This visualisation gives a broad overview of
changes in the ontology that is color coded according to the
change type defined in Table 1. This provides a useful visual
history of the evolution of the EFO ontology. We can easily see
that in general the ontology changes very little between releases,
with only a handful of new terms added on each release.
However, there are two very clear spikes in 2.56 and 2.57, where
a large number of deletions were followed in the next version by a
large number of additions. This spike exposes a problem with the
2.56 release where many terms disappeared and was quickly fixed
for the 2.57 release. The ability to see such a dramatic change is a
possible indicator that 2.56 should not be used and any application
relying on it should update their EFO version.
Figure 3 shows how we can use the visualization to drill down
into specific changes for a given release. We have also connected
the visualization component to the JIRA tracking system for the
EFO ontology so changes can be viewed alongside tickets that
were closed on that release. This features allows users to see
changes in the context of the work that was performed for that
release.
detection at the level of RDF triples alone is limited to fairly
trivial changes in the ontology. However, we have illustrated that
this is sufficient to highlight some key aspects of change within
ontologies, and this may be sufficient for the users of these
ontologies. We are not attempting to report on all ontological
changes, but instead require a system that tracks terms and basic
term metadata over time. In this scenario DIACHRON provides
us with a convenient platform for tracking ontology evolution at
this level.
The addition of DIACHRON functionality to OLS provides the
OLS user community a novel mechanism to access ontology
changes via the DIACHRON API. As the amount of biological
data annotated to ontology terms continues to grow, it is vital that
tools are in place to assist database developers and curators to
track the evolution of these vocabularies. The DIACHRON
functionality added to OLS now provides a platform for
monitoring these changes and can be exploited in downstream
services relating to data concurrency, consistency and integrity,
and may also be used to facilitate an aspect of data repair relating
to data-to-ontology annotation.
6. ACKNOWLEDGMENTS
We would like to acknowledge all members of the DIACHRON
consortium for their contribution to this work. In particular we
would like to thank Christos Pateritsas, Yannis Roussakis,
Hasapis Panagiotis, Marios Meimaris and Giorgos Flouris for
their contribution to components of the DIACHRON platform
utilized by the OLS application.
7. REFERENCES
Figure 2. EFO changes from release 2.50 - 2.59.
[1] Petryszak, R. et al. 2013. Expression Atlas update – a
database of gene and transcript expression from microarray
and sequencing-based functional genomics experiments. In
Nucleic Acids Research, DOI=10.1093/nar/gkt1270.
[2] Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins
H, Klemm A, Flicek P, Manolio T, Hindorff L, and
Parkinson H. 2014. The NHGRI GWAS Catalog, a curated
resource of SNP-trait associations. In Nucleic Acids
Research, Vol. 42 (Database issue): D1001-D1006.
[3] Auer S., et al. 2012. Diachronic linked data: towards longterm preservation of structured interrelated information". In
Proc. of the First International Workshop on Open Data
(WOD '12). ACM, New York, NY, USA, 31—39.
Figure 3. Reported changes from EFO release 2.56.
5. Conclusion
We have presented an update to the EMBL-EBI Ontology Lookup
Service to include DIACHRON base change tracking of
ontologies. The DIACHRON platform uses the OLS API to detect
when new ontology versions have been indexed and archives new
version in the DIACHRON archive. DIACHRON changes are
configured according to properties outlined in the OLS ontology
configuration, so that a range of abstracted common changes can
be detected between all of the ontologies in OLS. We have
developed user interface components to assist users in searching
and browsing ontology changes.
This work presents a pragmatic and scalable approach to change
detection in ontologies. Computing true change detection in OWL
ontologies is still an area of active research. Ontology change
[4] Papavassiliou V., et al. 2009. On Detecting High-Level
Changes in RDF/S KBs, Proceedings of the 8th International
Semantic Web Conference, October 25-29
[5] Barry Smith et al. 2007. The OBO Foundry: coordinated
evolution of ontologies to support biomedical data
integration. In Nature Biotechnology 25, 1251-1255.
DOI=10.1038/nbt1346.
[6] Roussakis Y., Chrysakis I., Stefanidis K., Flouris G.,
Stavrakas Y. 2015. A Flexible Framework for Understanding
the Dynamics of Evolving RDF Datasets. International
Semantic Web Conference (1) 2015: 495-512