Entity retrieval in the biomedical ontologies domain

Entity retrieval in the biomedical ontologies domain
Daniela Oliveira
Insight Centre for Data Analytics, NUI Galway, Ireland
Background
Research in the biomedical domain is generating massive amounts of data which feature
thousands of terminologies (concepts and properties) described differently even if they
overlap. This overlap can be due to different naming conventions, textual descriptions,
synonyms, and granularity of these cancer terminologies; it’s an open research problem to
precisely identifying an ontological term which best describe a given cancer terminology.
Ontology search services such as BioPortal and Ontology Lookup Service (OLS) contains over
500 ontologies covering distinct domains as Zebrafish anatomy to Epidemiology. However,
they often suggest large, vague, or loose search results for a given terminology. This issue
hinders the applicability of reuse and leads to situations where the same concept has different
definitions or labels.
Finding the right ontology class match for a specific term can be arduous since few search
engines provide keyword-based searches for ontologies. For example, if a non-expert user
wants to annotate his database with ontologies, to make it interoperable with other
resources, he needs to find the right ontology to describe his concepts. However, to the best
of my knowledge, no system exists to automatically provide him with a solution non-equivocal
way.
Most of the current approaches do not fully take advantage of both the complex semantics
and possibility of ranking ontologies according to their relevance to retrieve the best possible
results. Therefore, we argue that biomedical resources could benefit from the development
of a set of algorithms to specifically search biomedical ontologies, using keywords, and taking
advantage of the semantic and structural complexity of ontologies. This system could then be
utilised in the integration of resources not currently annotated with ontologies, to increase
the interoperability within the biomedical data domain. The purpose of this work is to
investigate whether it is possible to create algorithms that can search ontologies, using
keywords, and return the top-k concepts that correspond to that search.
Methodology
The ontology search process can be divided into two steps: (1) lexical matching, where the
keywords are lexically matched with the labels and synonyms of a collection of ontologies and
(2) ranking the results according to the relevance of the concept and the ontologies.
For the first step, I am using the Bioportal [1] and Ontology Lookup Service (OLS) [2] search
engines to obtain results for each keyword. By skipping this proccess I can focus first on the
second part, where I am in the early phase adapting the Best Match 25 (BM25) [3] algorithm
to ontology search. BM25 ranks documents according to their relevance in relation to a given
search query. A variation of the BM25 algorithm, BM25F [4], is used for structured
documents. In this variation the term frequency is weighted by a boost factor, which depends
on the field of the keyword, e.g. if the term is in a title, it is weighted differently than if it is in
the abstract.
To test the results, I am using a collection of different keywords from a set of cancer-related
terminologies. These keywords originated from the extraction of terms from cancer
repositories such as COSMIC [5].
Future work
BioOpener is a project which aims to find, access and aggregate multiple biomedical
repositories to integrate different resources. Currently we have 6399 terms extracted from
the cancer repositories used in the BioOpener project. A future application of this work would
be to find terms in ontologies that match the ones extracted from the repositories.
We will need to create an evaluation method to test the algorithms. We also intend to further
explore tools such as Bioportal’s Annotator and Recommender to check if there is any
possibility for integration.
In the future, we will try different information retrieval algorithms and explore the capabilities
of indexing application such as Solr.
Bibliography
1.
Noy, N., Shah, N.H., Whetzel, P.L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D.L.,
Storey, M.-A., Chute, C.G., Musen, M.A.: BioPortal: Ontologies and integrated data
resources at the click of a mouse. CEUR Workshop Proc. 833, 292–293 (2011).
2.
Jupp, S., Burdett, T., Malone, J., Leroy, C., Pearce, M., McMurry, J., Parkinson, H.: A new
ontology lookup service at EMBL-EBI. In: CEUR Workshop Proceedings. pp. 118–119
(2015).
3.
Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and
Beyond. Robertson and H (2009).
4.
Pérez-Agüera, J.R., Arroyo, J., Greenberg, J., Iglesias, J.P., Fresno, V.: Using BM25F for
semantic search. North. 1–8 (2010).
5.
Forbes, S.A., Beare, D., Gunasekaran, P., Leung, K., Bindal, N., Boutselakis, H., Ding, M.,
Bamford, S., Cole, C., Ward, S., Kok, C.Y., Jia, M., De, T., Teague, J.W., Stratton, M.R.,
McDermott, U., Campbell, P.J.: COSMIC: Exploring the world’s knowledge of somatic
mutations in human cancer. Nucleic Acids Res. (2015).