Grant Agreement No. 601043 EU Project No:601043(Integrated Project (IP)) DIACHRON Managing the Evolution and Preservation of the Data Web DIACHRON Dissemination level: PU Type of Document: O Contractual date of delivery: M31 Actual Date of Delivery: M33 Deliverable Number: D9.3 Deliverable Name: Software prototype: Scientific Linked Data Deliverable Leader: EMBL Work package(s): WP9 Status & version: Final 1.2 Number of pages: XX WP contributing to the deliverable: 9 WP / Task responsible Coordinator (name / contact) Helen Parkinson / [email protected] Other Contributors Simon Jupp, Tony Burdett, Olga Vrousgou EC Project Officer Keywords: scientific linked data, ontology evolution Abstract (few lines): This is the first evaluation of the DIACRON platform for scientific linked data Federico Milani Grant Agreement No. 601043 Title: Project: URL Type: Authors: Origin: Ver. 1 1.1 1.2 Document Control Software prototype: DIACHRON Platform adaptation for Scientific Linked Data: design, implementation DIACHRON http://www.diachron-fp7.eu PROTOTYPE Simon Jupp, Tony Burdett, Olga Vrousgou, Helen Parkinson Date 15/12/2015 17/12/2015 28/12/2015 Amendment History Author(s) Description / Comments Olga Vrousgou, Simon Jupp Initial version Helen Parkinson, Tony Burdett Internal EMBL revision Final revision with comments from partner reviews. (UEDIN, BROX, INTRASOFT) Grant Agreement No. 601043 TABLE OF CONTENTS 1! INTRODUCTION ................................................................................................................................................................ 4! 1.1! SCIENTIFIC PILOT APPLICATION - UPDATED ..................................................................................................................... 5! 1.2! DATA FLOW AND INFRASTRUCTURE ...................................................................................................................................... 8! 2! RESULTS ............................................................................................................................................................................ 10! 2.1! IMPLEMENTATION ISSUES .............................................................................................................................................. 12! 3! FUTURE WORK ............................................................................................................................................................... 14! 3.1! CONCLUSION ................................................................................................................................................................. 14! 4! APPENDIX A ..................................................................................................................................................................... 15! Grant Agreement No. 601043 1 INTRODUCTION In this deliverable we report on Task 9.3 that is the updated documentation and implementation for the scientific linked data pilot. In deliverable 9.2 we described the deployment of DIACHRON services for use in a pilot around the Experimental Factor Ontology (EFO) Web application. The success of this pilot motivated us to develop a new version of the EBI Ontology Lookup Service 1 (OLS) that included DIACHRON functionality for ontology archiving and change detection over a wide range of biomedical vocabularies. The new OLS provides services for searching and visualising over 140 biomedical ontologies and also provides an API for programmatic access. These ontologies are typically published as files on the Web, and the release strategies for these files vary between ontologies. The OLS system aims to provide a central point of access to ontologies for the semantic annotation of data. In deliverable 9.2, we described the scenario where ontologies are continuously evolving alongside the data. Biological databases such as the Gene Expression Atlas2 and the Biosamples Database3 use tools like Zooma4 to automatically map data to ontologies, often in bulk mode before a major database release. This process introduces a dependency between the annotated data and the snapshot of ontologies taken on that day. Over time the ontologies and data become out of sync until the annotations are updated and this can have downstream consequences on services built over those data. The new OLS was designed to be simple and relatively lightweight and comprises two primary ontology indexes that support search and browsing across ontologies. We provide a SOLR index for searching ontology terms and use the Neo4j graph database to store the ontology structure. There would be a considerable cost and overhead associated with storing every version of every ontology in our SOLR and Neo4j indexes, so we instead we only store the latest version of each ontology. We use DIACHRON as the primary archiving system inside the new OLS, so every ontology version we see gets archived in the DIACHRON platform. By using DIACHRON as an archive for the ontologies in OLS we can exploit the change detection capabilities of DIACHRON to only provide long term storage of ontology changes, rather than complete versions of each ontology. The DIACHRON component of the new OLS allows us to offer novel services to our community of users around ontology term evolution. Many database curation systems use the OLS to find terms that are used to annotate biological data. One of the most important aspects of change relating to ontologies is the creation, deletion and editing of term information. Being able to track additions to ontology is useful to demonstrate activity, as ontology activity 1 Beta release of the new Ontology Lookup Service - http://www.ebi.ac.uk/ols/beta. The current version of OLS is located at http://www.ebi.ac.uk/ols 2 Expression Atlas - https://www.ebi.ac.uk/gxa/home 3 Biosamples Database - https://www.ebi.ac.uk/biosamples 4 Zooma - http://www.ebi.ac.uk/spot/zooma Grant Agreement No. 601043 is an important metric when considering which ontologies to use for data annotation. Ontologies that change very little or have had no major changes for a long time may suggest the ontology is no longer actively maintained and should not be used for data annotation. Ontology best practices [3] encourage developers to avoid deleting ontology classes, and favour a term deprecation or obsoletion strategy so that terms remain in the ontology or the term URIs remain resolvable. Using DIACHRON we are able to offer the functionality demonstrated in the EFO pilot (deliverable 9.2) to all of the relevant ontologies used in biology. For this deliverable we have re-developed the OLS system to incorporate DIACHRON functionality. With over 140 ontologies, representing over 3.5 million ontology terms, this pilot will evaluate the scalability of the DIACHRON platform in a real-world application setting. The new OLS is currently in beta and is being tested with our user community. Here we will report the design of the new pilot application and present the current state of the application and detail any implementation issues. 1.1 SCIENTIFIC*PILOT*APPLICATION*/*UPDATED* OLS now provides the following new functionality: 1. 2. 3. 4. 5. An ontology loader that detects when external ontologies have changed A search engine for querying the ontologies Web services for programmatic access to the ontologies A user interface for exploring and visualising the ontologies. The new OLS system is available at http://www.ebi.ac.uk/ols/beta. OLS monitors external ontologies for new releases by downloading the ontology from the external location every night in the ontology’s native OBO or OWL format. We use an MD5 hashing algorithm to create a checksum for the downloaded file. Every night a process runs that downloads the ontology file, creates a new checksum and checks if this is different from the previous checksum to determine if the file has been updated. Any change in the file will trigger a reload of the ontology into the OLS indexes. OLS provides REST API access to the OLS ontologies. We extended the metadata we collect about each ontology in the OLS system so that it now includes basic metadata such as ontology title, description and author attribution along with configuration information, such as the predicate used for term synonyms or definitions. The ontology configuration also includes a ‘loaded’ field that provides a timestamp for when the ontology was last successfully loaded into the system. This additional ontology metadata provided by OLS is used by our DIACHRON implementation to detect new ontology versions and automatically configure the change detection components. There are several steps that need to be executed before an ontology from OLS can be archived into DIACHRON and for change detection to be performed: 1. Not all OLS ontologies are available in RDF. Some ontologies required by OLS are only published in the Open Biomedical Ontology (OBO) format. A translation to OWL-RDF must be done first using the Java OWL API. 2. Previously we used the HermiT OWL-DL reasoner to classify the ontology to compute any inferred RDF triples and remove any redundant triples from the archive. Not all ontologies in OLS can be classified efficiently with HermiT, so an enhancement to the Ontology Archiver in DIACHRON was Grant Agreement No. 601043 required to specify which reasoner to use. Information about the optional reasoner to use for a particular ontology is readily available form the OLS API. 3. Previously for EFO, we were able to specify change detection for term labels, synonyms, definitions and obsolete classes according to the predicates used by EFO. These were predefined changes in DIACHRON as simple and complex changes for the EFO use case. However, not all ontologies in OLS use the same predicates for label, synonyms, definitions and obsolete classes as EFO. We wanted to execute change detection at the level of individual ontologies, and therefore needed to treat each ontology as an independent DIACHRONIC dataset with its own set of associated changes. Rather than specify complex changes for each ontology, we use the OLS API ontology configuration data to dynamically define complex changes for a particular ontology when it is first archived. We have developed an application of the DIACHRON platform that uses the OLS API to crawl for ontologies that have been loaded into the system and automatically archive these ontologies into DIACHRON. Our application uses data from the ontology configuration to automatically create a changes schema in the DIACHRON changes ontology. This application monitors the OLS API for changes and will archive new ontology versions and perform change detection between the latest and previous version of an ontology automatically. Table 1 shows an updated overview of the ontology changes monitored by DIACHRON for the OLS application. This table has been updated from the table 1 in deliverable 9.2, that was primarily concerned with changes in EFO. This revised table shows that some of the changes in the model are now parameterized depending on a particular ontology’s configuration. Change with parameters Description Changes in model Add Type Label Detect when a new label is added to class added: (<A>, <label predicate>, Literal) Add Class Detect when a new class is added to the ontology added: (<A>, rdf:type, owl:Class) Add Superclass Detect when a new subclass axiom between named classes is added to the ontology added: (<A>, rdfs:subClassOf, <B>) Add Synonym Detect when a synonym is added to a term in EFO added: (<A>, <synonym predicate>, Literal) Add Definition Detect when a definition is added to a class added: (<A>, <definition predicate>, Literal) Mark As Obsolete Detect when a term is made obsolete in the ontology added: (<A>, rdfs:subClassOf, oboInOwl:ObsoleteClass) OR (<A>, owl:deprecated, Literal) Delete Definition Detect when a definition is deleted from a class removed: (<A>, <definition predicate>, Literal) Grant Agreement No. 601043 Delete Synonym Detect when a synonym is deleted from a term in EFO removed: (<A>, <synonym predicate>, Literal) Delete Label Detect when a label is removed from a class removed: (<A>, <label predicate>, Literal) Delete Superclass Detect when a subclass axiom between named classes is removed from the ontology removed: (<A>, rdfs:subClassOf, <B>) Delete Type Class Detect when a class is removed from the ontology removed: (<A>, rdf:type, owl:Class) One of the challenges in extending this pilot to work with many ontologies is how to specify a number of changes that are conceptually the same but may be implemented differently across ontology datasets. For instance, DIACHRON changes originally specified a “Add Label” change as a simple (or basic) change that is hard coded into the system. This change assumed that all label changes could be detected by the addition of a new RDF triple where the predicate was rdf:label from the RDFS vocabulary. When looking at the ontologies in OLS we found ontologies, such as Mamo (http://www.ebi.ac.uk/ols/beta/ontologies/mamo), where they did not use rdfs:label but instead favoured skos:prefLabel as their label predicate. In this scenario we would like DIACHRON to still report the addition of label as a single type of change, at least at the conceptual level, even though specific implementation at the vocabulary level is different. A similar situation exists with the implementation of the “class obsoletion” in the ontologies in OLS. Previously for EFO we defined “Mark as Obsolete” as a complex change that could be primarily detected by the addition of an RDF triple where the predicate was rdfs:subClassOf and the object of the triple was oboInOwl:ObsoleteClass. Many of the ontologies from the OBO foundry follow a term obsoletion strategy, but the implementation of this differs from that of EFO. Standard practice for OBO ontologies is to use the owl:deprecated annotation property to indicate obsoletion of classes. We currently have to preconfigure our DIACHRON application to create the appropriate type of “mark as obsolete” complex changes depending on the ontology. Grant Agreement No. 601043 1.2 DATA FLOW AND INFRASTRUCTURE* * Figure 1. The program flow of the OLS crawler. The crawler is scheduled to run every night, pushing new ontologies to DIACHRON, archiving new ontology releases, and running the change detection service for updated ontologies. Figure 1 shows the execution flow for our DIACHRON application. For each ontology in OLS, the DIACHRON application does the following: 1. Using the OLS REST API, get the latest version that has been indexed in OLS. 2. Get the latest version that has been archived in the archiver. 3. If it is the first time that this ontology will be archived create a DIACHRON dataset id. We need to know the dataset id before we can upload the “diachronised” dataset. 4. If it is the first time this ontology has been archived, define the ontology’s complex changes and store them in the Virtuoso server. 5. If there is a new version of the ontology, download it from the file location that is determined in the OLS API Grant Agreement No. 601043 6. The ontology is downloaded in native OWL or OBO format. Convert this to DIACHRON RDF using the archive service. 7. The “diachronised” version of the ontology is uploaded to the archive. 8. If there is a new version of the ontology, get its complex changes and see if the properties used have been changed. If so, then the complex changes for this ontology are updated. 9. Get the latest stored version of the ontology from the archiver and run the change detection between the old and the new version. EMBL-EBI uses virtualized environments for deploying software. EMBL-EBI also has a large computing cluster running a Load Sharing Facility (LSF), where batch workloads and data-intensive applications can run. The DIACHRON integration layer delivered in 6.2 provides a service layer for interacting with individual DIACHRON components, such as archiving services and change detection. The DIACHRON Archiving service is the core component of the DIACHRON platform and provides services for both archiving and retrieving diachronic datasets. The Archive module is currently using the OpenLink Virtuoso database (v7.2.2.1) for persistent storage. All of the DIACHRON services are being developed as Java Web applications that can be deployed in any Java servlet container. EMBL-EBI is restricted to using Apache Tomcat for running Java web applications. Figure 2 shows the two dedicated Tomcat applications that are currently deployed for the pilot applications. One Tomcat sits outside the firewall and provides public access to the DIACHRON integration layer, the second Tomcat sits behind the firewall but is visible to the Integration layer application. Figure 2. EMBL EBI infrastructure for DIACHRON pilot Grant Agreement No. 601043 2 RESULTS We recently ran our updated pilot application against the full set of ontologies in OLS. Using the current DIACHRON archiving implementation, we were able to successfully parse and archive 75 ontologies out of 142. There can be several reasons why the remaining 67 ontologies might have failed to archive. Issues already encountered are missing ontology files in owl:import statements or ontologies that are unsatisfiable when classified with a reasoner. We have also encountered some instability and scaling problems when archiving large ontologies, such as the NCBI taxonomy that contains over 1.3 million terms and nearly 10 million RDF triples. The total execution time for archiving all 75 ontologies in parallel on our compute cluster was approximately 20 minutes. This process is now run on a nightly cron job to check for new ontologies in OLS and archive and run change detection over new ontologies as they become available. In deliverable 9.2, we have presented a pilot application that was integrated into the EFO website and could be used to view changes for the EFO. In this deliverable we can now show the same visualisations for all ontologies in OLS that can be successfully archived with DIACHRON. We have extended our javascript library to support visualising changes in all the ontologies loaded into the OLS DIACHRON platform. Figure 3 outlines the infrastructure being deployed for the user interface components, allowing the new OLS website to use our javascript library to generate an interactive visualisation of changes for every ontology loaded. Grant Agreement No. 601043 Figure 3. The OLS architecture showing how the Web application communicates with the public DIACHRON integration layer services to generate the visualisation of changes. Figure 4 is a screenshot from the new OLS, showing changes for the popular Gene Ontology (GO) over a two week period. This view alone is extremely informative about the development of GO. We can see that there are at least 10 and up to 40 new terms added on each release, we also see that the 2015-12-01 release included some changes to label and synonyms and some terms were obsoleted. The bar gives us a visualisation of the evolution GO which will be a useful feature of the new OLS website for our users. Grant Agreement No. 601043 Figure 4. Pilot application updated to work with the Gene Ontology 2.1 IMPLEMENTATION*ISSUES* In extending our pilot application and the development of new OLS we encountered several issues with the DIACHRON platform that we have worked with the platform developers to resolve. 1. Through the Integration Layer and the Change detector APIs, we could not specify the Dataset_Uri to be used for the change schema to be created (definition of complex changes) and for it to be used for the complex change detection. A Dataset_Uri option was added to the API both at the Integration Grant Agreement No. 601043 2. 3. 4. 5. 6. 7. layer and the Change Detection service. (‘Change Detection Implementation’ and ‘Complex Change Implementation’ services) We required the ability to use different reasoners before the OWL to DIACHRON model conversion step in the Archive service. The latter services were updated to support different OWL reasoners and a new optional parameter was added to the REST API to specify the type of reasoner required. OWL ontologies have huge amount of information encoded in the RDF file. Early on in the project we added a functionality of having a filter on the predicates in the RDF triples that would get archived. We did this so we could use DIACHRON to focus only on triples in the ontology we care about for our application use cases. The ability to specify filters was previously hard coded into the Java API but this has now been exposed via the REST API. After creating a new diachronic dataset we could not access the dataset instance id through the Integration Layer api endpoint (/webresources/archivingService/query) without restarting tomcat due to caching issues. This bug has been fixed. After archiving several versions of a datasets, Tomcat would halt and need a restart, due to connections that were left open. Bug fixed in the Archiver. After archiving a new version, the metadata of the stored version (versionNumber, recordSet) could not be retrieved through the API endpoint (/diachron/archive/templates) unless Tomcat would be restarted. The cache issue has been fixed in the Archiver. There remains no service in the integration layer for archiving datasets. We are currently required to archive datasets directly via the archiving layer. As the archiving layer has no security authentication (handled by the integration layer) we cannot expose our archiving service on a public server. Archiving is currently done behind our firewall, but the resulting data can still be queried via our public integration layer services. Grant Agreement No. 601043 3 FUTURE WORK • • • Enhance OLS archiving code to support the full set of 142 ontologies. Migrate javascript components to production OLS web application. Production deployment of new OLS due March 2016 3.1 CONCLUSION* We have presented a new Ontology Lookup Service that has been developed using DIACHRON components for archiving and change detection over multiple biomedical ontologies. We have shown that the DIACHRON platform now provides a generic infrastructure for archiving OWL ontologies and includes services for detecting a range of changes within those ontologies. There remains some challenges in scaling this application up before we can move this to our production server. Once available we will be in a position to evaluate the new system with our large user base. Grant Agreement No. 601043 4 APPENDIX A Workshop paper submitted to EDBT DIACHRON workshop. Biomedical Ontology Evolution in the EMBL-EBI Ontology Lookup Service Olga Vrousgou Tony Burdett Helen Parkinson European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Cambridge United Kingdom European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Cambridge United Kingdom European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Cambridge United Kingdom [email protected] [email protected] Simon Jupp [email protected] European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Cambridge United Kingdom [email protected] ABSTRACT As ontologies are playing an increasingly important role in data annotation on today’s Semantic Web, there is a need to observe their constant evolution. At EMBL-EBI ontology preservation and curation is of growing value, and new tools are being developed to help undertake these tasks. One of these tools is the Ontology Lookup Service which holds 140 public bio-medical ontologies that are updated when new ontology version become available. To track changes in these ontologies we have utilized DIACHRON, a platform for tracking the evolution of RDF documents. With DIACHRON in OLS we can track and store the changes between ontology releases and view the differences through a graphical interface. CCS Concepts • Web Ontology Language (OWL) • Ontologies Keywords Ontologies, OWL, Data evolution 1. INTRODUCTION Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration a large number of biomedical ontologies have been developed by the community to provide common metadata vocabularies. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biological knowledge. The European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI) is a major provider of bioinformatics services and is involved in the preservation of data and the curation of data so that it can be served back to the community in novel and useful ways. A large part of the curation and added value offered by EMBL-EBI is via the semantic annotation of data with ontologies. Ontologies can make data more interoperable and can be used to enhance search[1] and visualization[2] applications over data. One of the challenges in working with highly interconnected data is dealing with elements that change. Biomedical ontologies aim to represent the state of knowledge in biology, but this is constantly changing as new biology is discovered. As ontologies develop to describe all domains of biology they must be regularly updated with new terms or refinement of existing terms to stay relevant with the data they describe. Tracking the changes within the ontology alongside the changes in the data is especially challenging. Typically the ontologies are developed independently of the data they annotate, so when ontologies change, it can be difficult to propagate that change to all the datasets that were described with those ontologies. The DIACHRON project1 has been developing technology for monitoring the evolution of data on the Web. Dataset versions can be archived in the DIACHRON system and there are components for detecting and reporting on changes in the data [3]. The DIACHRON platform is able to archive and monitor changes in data expressed in the W3C Resource Description Framework2 (RDF) [4], thus making it suitable for archiving ontologies expressed in the W3C Web Ontology Language3 (OWL) that can be serialised in RDF. The EMBL-EBI has recently developed a new version of the Ontology Lookup Service4 (OLS) that includes DIACHRON functionality for archiving and executing change detection over ontologies. In this paper we present the design and implementation of the DIACHRON functionality adapted for the 1 http://www.diachron-fp7.eu 2 http://www.w3.org/RDF/ 3 http://www.w3.org/OWL/ 4 http://www.ebi.ac.uk/ols/beta/ OLS system and evaluate the DIACHRON functionality for monitoring ontology evolution. Table 1. Abstracted ontology changes detected by DIACHRON 2. REQUIREMENTS OLS provides access to over 140 bio-medical ontologies. Ontology documents typically reside on the Web and the OLS system monitors these documents for updated versions, and automatically loads new ontologies into the system when a new version is released. OLS provides services for searching and visualising these ontologies and also provides an API for programmatic access. Many database curation systems use the OLS to find terms that are used in annotating biological data. This process introduces a dependency on ontologies from the annotated data, so any changes in the ontologies can have downstream consequences on the data and any services built over those data. OLS requires the ability to track terms seen in ontologies over time to better assist applications that rely on these ontologies. There are a large number of changes that one might consider tracking within an ontology. One of the most important aspects of change relating to ontologies is the creation, deletion and editing of term (or class level) information. Being able to track additions to an ontology is also a useful metric of ontology activity. Ontologies that haven’t been updated for a long time may suggest the ontology is no longer actively maintained. Ontology best practices[5] encourage developers to a avoid deleting ontology classes, and rather use a deprecation or obsoletion strategy so that terms remain in the ontology or at a minimum the term URIs remain resolvable. This is import for applications that use third party ontologies for data annotation and rely on those terms resolving on the Web. In practice there are still many reasons why an ontology term may get deleted, or a URI may be refactored, so old URIs are frequently no longer accessible. Deletions in OWL, when viewed at the RDF triple level, are typically associated with a kind of edit. For example, moving a class in the ontology class hierarchy would typically be detected at the RDF triple level as a removal of a triple with an rdfs:subClassOf predicate with the addition of a new triple with the same subject and predicate, but a different object. The ability to detect such changes in an ontology is useful as many edits involving the rdfs:subClassOf predicate may suggest an ontology is undergoing a major rearrangement and as such could have a potential impact on any application that makes an assumption about the structure of the hierarchy. There are also other types of non-logical edits, such as to term metadata that may include the editing of a term label or addition of new synonyms or definitions that might indicate a refinement or change to the interpretation of a term. Table 1 outlines the major ontology changes that the OLS application is concerned with detecting. These changes are restricted to those that can be expressed at the level of RDF triples. Ontology change Change triplet Addition of new class Addition of ?x rdf:type owl:Class Removal of class Removal of ?x rdf:type owl:Class Edit label Addition of ?x <label property> <OWL literal> Removal of ?x <label property> <OWL literal> Edit synonym Addition of ?x <synonym property> <OWL literal> Removal of ?x <synonym property> <OWL literal> Edit definition Addition of ?x <definition property> <OWL literal> Removal of ?x <definition property> <OWL literal> Class move Addition of ?x rdfs:subClassOf ?owlClass Removal of ?x rdfs:subClassOf ?owlClass Class obsoletion Addition of ?x owl:deprecated “true” 2.1 Diachron services The DIACHRON platform is comprised of several components that are accessible via Web services. The OLS application is currently utilizing services from three of the components, namely, the DIACHRON archive, the change detection, and the integration layer. • The archive service is responsible for converting the ontology versions represented in OWL to the DIACHRON RDF model. Once converted, the “diachronised” RDF can be uploaded into the archive via the archiving API. • The change detection service is responsible for the calculation of changes between two ontology releases. There are two types of changes, simple and complex. Simple changes capture all the additions and deletions of all terms in an ontology and are predefined. Complex changes are user defined changes based on specified collections of simple changes. • The integration layer is designed to provide an abstraction over the underlying services and handle security and mediation of services via a single point of entry to the DIACHRON platform. 3. METHOD Tracking ontology evolution is the primary use case for DIACHRON within EMBL-EBI. OLS currently deals with tracking external ontologies for changes and indexing any new releases. OLS provides a REST API to the available ontologies that includes information about when an ontology was last updated. We have developed a Java application that crawls the OLS API for each new ontology release, retrieves the latest ontology version and archives it into the DIACHRON platform. As new versions become available via the OLS API, the DIACHRON platform will archive the newer version and run change detection between versions. 7. If this is a new version of an ontology that has been previous archived, change detection between the new and old versions will be executed using the changes API. The results of the change detection are also stored in the archive and made available via the integration layer. For a delta of changes to be produced between the new release and the old version of an ontology, there is a need to always keep the latest versions archived in a uniform way. In order to adapt DIACHRON to be used in OLS, we need to configure the change types in Table 1 to reflect how certain features are implemented in a particular ontology. For instance, most ontologies have a notion of synonyms, but the predicates for synonym vary between ontologies. We would like to use DIACRHON to detect changes, such as “synonym edit” at a conceptual level, without worrying about implementation details. The deltas between ontologies need to be stored and presented in a user friendly way in the OLS application. The new OLS platform provides an extensive list of API endpoints that provides an easy way for developers to access its underlying data. Using this API we are able to retrieve information about the ontologies that are stored in OLS such as their location where the ontologies will be downloaded from, the latest version that was indexed into OLS, and information such as the predicate used for label, synonyms and definitions. DIACHRON provides an ontology for describing changes according to the underlying DIACRHON model. There is also an API for creating change types for a given dataset that is exposed as Web service in DIACHRON. The OLS crawler reads ontology configuration from the OLS API, creates a new dataset and specifies DIACHRON changes using the DIACHRON service API. The flow for adding OLS data to DIACHRON is as follow (also depicted in Figure 1). For each ontology in OLS: 1. From the OLS API get the latest version that has been indexed in OLS. 2. Get the latest version that has been archived in the archiver. 3. If it is the first time that this ontology will be archived create a DIACHRON dataset id. We need to know the dataset id before we can upload the “diachronised” dataset. 4. If it is the first time this ontology has been archived, define the ontology’s complex changes and store them changes API. 5. If there is a new version of the ontology, download it from the file location that is provided by the OLS API 6. The ontology is downloaded in native OWL or OBO format then submitted to the DIACHRON archive service. Figure 1. Program flow of the OLS crawler. The crawler is scheduled to run every night, pushing new ontologies to DIACRHON, archiving new ontology releases, and running the change detection service for updated ontologies. As mentioned above, there is a need to create different complex changes for each ontology. Many ontologies use different properties to describe a term. This leads to the need to define a set of complex changes for each ontology, based on the properties defined in the OLS API. It is also essential to update the complex changes if a term property alters from one version to another. The main challenge we face here is that a complex change may not only have different properties between ontologies, but may also have an entirely different implementation. For example, when a term is marked as obsolete in EFO, it is moved to be a subclass of the “ObsoleteClass” class. In other ontologies, such as the Gene Ontology, they use the owl:deprecated property to indicate term obsoletion. We have defined additional configuration into our OLS crawling application that specifies the obsolete class strategy used by a particular ontology. This requires a priori knowledge of the strategy used by an ontology, and requires maintenance of this additional configuration to ensure it remains in sync with each ontology. 4. RESULTS Changes detected between ontologies are stored in the DIACHRON platform according to the DIACHRON changes ontology schema[6]. These changes can be queried directly with SPARQL or by a REST API endpoint that is provided by DIACHRON to access a JSON representation of the changes. This changes API is used by OLS to visualize DIACHRON changes. Figure 2 shows an overview of changes in the Experimental Factor Ontology over a 12 month period (EFO version 2.50 through 2.59). This visualisation gives a broad overview of changes in the ontology that is color coded according to the change type defined in Table 1. This provides a useful visual history of the evolution of the EFO ontology. We can easily see that in general the ontology changes very little between releases, with only a handful of new terms added on each release. However, there are two very clear spikes in 2.56 and 2.57, where a large number of deletions were followed in the next version by a large number of additions. This spike exposes a problem with the 2.56 release where many terms disappeared and was quickly fixed for the 2.57 release. The ability to see such a dramatic change is a possible indicator that 2.56 should not be used and any application relying on it should update their EFO version. Figure 3 shows how we can use the visualization to drill down into specific changes for a given release. We have also connected the visualization component to the JIRA tracking system for the EFO ontology so changes can be viewed alongside tickets that were closed on that release. This features allows users to see changes in the context of the work that was performed for that release. detection at the level of RDF triples alone is limited to fairly trivial changes in the ontology. However, we have illustrated that this is sufficient to highlight some key aspects of change within ontologies, and this may be sufficient for the users of these ontologies. We are not attempting to report on all ontological changes, but instead require a system that tracks terms and basic term metadata over time. In this scenario DIACHRON provides us with a convenient platform for tracking ontology evolution at this level. The addition of DIACHRON functionality to OLS provides the OLS user community a novel mechanism to access ontology changes via the DIACHRON API. As the amount of biological data annotated to ontology terms continues to grow, it is vital that tools are in place to assist database developers and curators to track the evolution of these vocabularies. The DIACHRON functionality added to OLS now provides a platform for monitoring these changes and can be exploited in downstream services relating to data concurrency, consistency and integrity, and may also be used to facilitate an aspect of data repair relating to data-to-ontology annotation. 6. ACKNOWLEDGMENTS We would like to acknowledge all members of the DIACHRON consortium for their contribution to this work. In particular we would like to thank Christos Pateritsas, Yannis Roussakis, Hasapis Panagiotis, Marios Meimaris and Giorgos Flouris for their contribution to components of the DIACHRON platform utilized by the OLS application. 7. REFERENCES Figure 2. EFO changes from release 2.50 - 2.59. [1] Petryszak, R. et al. 2013. Expression Atlas update – a database of gene and transcript expression from microarray and sequencing-based functional genomics experiments. In Nucleic Acids Research, DOI=10.1093/nar/gkt1270. [2] Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, and Parkinson H. 2014. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. In Nucleic Acids Research, Vol. 42 (Database issue): D1001-D1006. [3] Auer S., et al. 2012. Diachronic linked data: towards longterm preservation of structured interrelated information". In Proc. of the First International Workshop on Open Data (WOD '12). ACM, New York, NY, USA, 31—39. Figure 3. Reported changes from EFO release 2.56. 5. Conclusion We have presented an update to the EMBL-EBI Ontology Lookup Service to include DIACHRON base change tracking of ontologies. The DIACHRON platform uses the OLS API to detect when new ontology versions have been indexed and archives new version in the DIACHRON archive. DIACHRON changes are configured according to properties outlined in the OLS ontology configuration, so that a range of abstracted common changes can be detected between all of the ontologies in OLS. We have developed user interface components to assist users in searching and browsing ontology changes. This work presents a pragmatic and scalable approach to change detection in ontologies. Computing true change detection in OWL ontologies is still an area of active research. Ontology change [4] Papavassiliou V., et al. 2009. On Detecting High-Level Changes in RDF/S KBs, Proceedings of the 8th International Semantic Web Conference, October 25-29 [5] Barry Smith et al. 2007. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. In Nature Biotechnology 25, 1251-1255. DOI=10.1038/nbt1346. [6] Roussakis Y., Chrysakis I., Stefanidis K., Flouris G., Stavrakas Y. 2015. A Flexible Framework for Understanding the Dynamics of Evolving RDF Datasets. International Semantic Web Conference (1) 2015: 495-512
© Copyright 2026 Paperzz