Learning Formal Definitions for Biomedical Concepts

Master Thesis
Learning Formal Definitions
for Biomedical Concepts
Alina Petrova
Affiliation: European Master’s program in Computational Logic
Technische Universität Dresden
Free University of Bozen-Bolzano
Supervisor: Prof. Michael Schroeder
Teacher in charge: Dr. George Tsatsaronis
Abstract Ontologies play a major role in life sciences, enabling a number of applications, from new data
integration to knowledge verification. Obtaining formalized knowledge from unstructured data is
especially relevant for biomedical domain, since the amount of textual biomedical data has been
growing exponentially. The aim of this thesis is to develop a method of creating formal definitions
for biomedical concepts using textual information from scientific literature (PubMed abstracts),
encyclopedias (Wikipedia), controlled vocabularies (MeSH) and the Web. The knowledge
representation formalism of choice is Description Logic as it allows for integrating the newly
acquired axioms in existing biomedical ontologies (e.g. SNOMED) as well as for automated
reasoning on top of them. The work is specifically focused on extracting non-taxonomic relations
and their instances from natural language texts. It encompasses the analysis, description,
implementation and evaluation of the supervised relation extraction pipeline and sets the scene for
the unsupervised relation extraction, proposing a novel algorithm of relation discovery via semantic
clustering.
2
Acknowledgements I would like to thank my supervisor Prof. Michael Schroeder for giving me the opportunity to work
and write my thesis in his group, for his guidance, insightful comments and the vision of the project.
I am grateful to everybody from Bioinformatics group and especially to “text miners” Dr. George
Tsatsaronis, Maria Kissa and Daniel Eisinger as well as to Norhan Mahfouz and Janine Roy who
helped me immensely with the thesis and made my stay in the group very special.
I would like to thank the EMCL master program and its organizers who gave me an amazing
opportunity to study abroad, introduced me to the world of science and allowed me to shape my
studies in the best way possible.
Many thanks to all my friends who supported me through the three years of my studies, both here
and back home. To my groupmates for all the trips, jokes, discussions, late lunches and long nights
over a drink.
A special, biggest thank you to my parents and my granny without whose constant love, support
and approval these three years of my life would have never happened.
And finally, I would like to thank Sergey who was always there for me.
3
Table of content Abstract ................................................................................................................................................ 2
Acknowledgements .............................................................................................................................. 3
1. Introduction and Motivation......................................................................................................... 7 1.1 The growth of biomedical literature and the benefits of its formalization................................. 7 1.2 Two examples of biomedical knowledge formalization ............................................................ 8 1.3 The task of formal definition generation.................................................................................... 9 1.3.1 What is formal definition generation? ................................................................................ 9 1.3.2 Why is formal definition generation important? ............................................................... 10 1.3.3 Is formal definition generation feasible? A case study ..................................................... 10 1.4 Objectives and Outline ............................................................................................................. 12
2. Background................................................................................................................................... 14 2.1 What is a definition? ................................................................................................................ 14 2.2 What is Ontology Generation? ................................................................................................. 15 2.2.2 Ontology learning and Definition generation ....................................................................... 16 2.3 Biomedical Knowledge Resources .......................................................................................... 17 2.3.1 SNOMED CT .................................................................................................................... 18 2.3.2 UMLS................................................................................................................................ 20 2.3.3 MeSH ................................................................................................................................ 22 2.4 Description Logics ................................................................................................................... 23 2.4.1 Basic DL constructors ....................................................................................................... 24 2.4.2 From triples to Description Logic formulas ...................................................................... 25
3. Related work on relation extraction ........................................................................................... 26 3.1 Relation extraction for the general domain .............................................................................. 26 3.1.1 Relation extraction and the types of linguistic processing ................................................ 27 3.1.2 Relation extraction and different types of learning ........................................................... 28 3.1.3 Generating semantic relresentations ................................................................................. 31 3.2 Biomedical extraction .............................................................................................................. 32
4. Non-taxonomic relation extraction using SNOMED CT ontology .......................................... 36 4.1. Dataset generation ................................................................................................................... 37 4.2 Feature extraction ..................................................................................................................... 37 4.3 Relation classification .............................................................................................................. 39 4.4 Discussion ................................................................................................................................ 40
5. Formal Definition Generation pipeline ...................................................................................... 41 5.1. Overview of the pipeline ......................................................................................................... 41
5.2 Annotation of biomedical texts with ontology concepts ..................................................... 44 5.2.1 Introduction to the process of annotation and related work .............................................. 44 5.2.2 The Attribute Alignment Annotator .................................................................................. 45 4
5.2.3 The Extended Annotator ................................................................................................... 45 5.2.4 Implementation ................................................................................................................. 48 5.2.5 Evaluation ......................................................................................................................... 49 5.2.6 Runtime Assessment ......................................................................................................... 50 5.2.7 Summary of contributions and conclusions ...................................................................... 51 5.2.8 Future work ....................................................................................................................... 51
5.3 Parser for Relation Extraction.............................................................................................. 52 5.3.1 Various types of the definitional structure of a sentence .................................................. 53 5.3.2 The structure of definitions in MeSH ............................................................................... 54 5.3.3 Functionality of the parser ................................................................................................ 55 5.3.4 Manual evaluation of the parser ........................................................................................ 57 5.3.5 Future improvements of the parser ................................................................................... 59
5.4 Learning Relational Labels ................................................................................................... 60 5.4.1 Choosing the classifier ...................................................................................................... 61 5.4.2 Choosing the features ........................................................................................................ 61 5.4.3 Choosing the set of relations ............................................................................................. 64
6. Evaluation ..................................................................................................................................... 66 6.1 SemRep: biomedical relation extraction system ...................................................................... 66 6.1.1 SemRep relation extraction component ............................................................................ 67 6.1.2 SemRep Gold Standard corpus ......................................................................................... 67 6.2 Experiments ............................................................................................................................. 69 6.2.1 Results ............................................................................................................................... 69 6.2.2 Improvement of the classification ..................................................................................... 69 6.2.3 Comparison with SemRep ................................................................................................ 71
7. Unsupervised Relation Extraction .............................................................................................. 73 7.1 From relation classification to unsupervised relation clustering ............................................. 73 7.2 Relation construction via semantic clustering ......................................................................... 76 7.2.1 Semantic clustering: assumptions and use cases............................................................... 76 7.2.2 Semantic clustering of lexical elements ............................................................................ 78 7.2.3 The DBSCAN algorithm and its hierarchical extention ................................................... 80 7.3 Preliminary evaluation of the method ...................................................................................... 81
8. Future work .................................................................................................................................. 85
9. Conclusions ................................................................................................................................... 87
Appendix A ........................................................................................................................................ 89 Appendix B ........................................................................................................................................ 92 Appendix C ........................................................................................................................................ 93 Appendix D ........................................................................................................................................ 99
References ........................................................................................................................................ 100 5
6
1. Introduction and Motivation Formalization of biomedical knowledge has long been an area of active research. Existing
biomedical knowledge resources vary considerably in terms of their formalization principles, from
databases and data collections (e.g. MEDLINE1), to taxonomies and controlled vocabularies (e.g.
MeSH2), to proper ontologies with rich formal semantics (e.g. SNOMED3). They also vary greatly
with respect to the sub-domains and areas they cover, as well as to their size, age, ways of
maintaining and integrating new knowledge etc. Formally representing the biomedical knowledge
can bridge the gap between existing resources and enrich them as well as process the newly
generated knowledge that come in abundance and is publicly accessible.
1.1 The growth of biomedical literature and the benefits of its formalization Research in life sciences is characterized by the exponential growth of the published scientific
materials: articles, patents, technical reports etc. MEDLINE1, one of the biggest bibliographic
databases for biomedicine, currently contains more than 23 million articles. The average amount of
newly added articles comprises 15000 items per week. Figure 1 illustrates the grow rate of
MEDLINE over the past half a century [Tsatsaronis et al. 2013].
To handle such a large amount of information, multiple initiatives have been launched for the
purpose of organizing biomedical knowledge formally, e.g. using ontologies [Bodenreider et al.
2006]. An ontology is a complex formal structure that can be decomposed into a set of logical
axioms that state different relations between formal concepts. Together the axioms model the state
of affairs in a domain. With the advances in Description Logics (DL) the process of designing,
implementing and maintaining large-scale ontologies has been considerably facilitated [Baader et al.
2003]. In fact, DL has become the most widely used formalism underlying ontologies. Several wellknown biomedical ontologies, such as GALEN [Rector et al. 2006] or SNOMED CT [SNOMED
CT User Guide] are based on DL. SNOMED CT has adopted the lightweight description logic
EL++ that allows for tractable reasoning.
There are several benefits of formal knowledge representation. First of all, an ontology can be
viewed as a conceptualization of some domain, thus it provides a common language for the
scientific society with which the communication between researchers can be facilitated. Secondly,
formalization of entities enables efficient information integration; already existing knowledge about
the entity can be aggregated from multiple resources, and the new knowledge can be easily
integrated so that it is not lost or left unnoticed. Thirdly, formal knowledge representation makes it
possible to automatize a number of crucial tasks that deal with information processing: efficient
search, validation and reasoning. Finally, formal representation can support knowledge
visualization which itself can bring about further insights about the domain, i.e., facilitate
1
MEDLINE: http://www.ncbi.nlm.nih.gov/pubmed
MeSH: http://www.nlm.nih.gov/mesh/
3
SNOMED CT: http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html
2
7
knowledge discovery.
Figure 1. Growth of the MEDLINE bibliographical database.
1.2 Two examples of biomedical knowledge formalization In this section we present two recent works in which the application of formal ontologies to
biomedical knowledge produced interesting results that demonstrate the usefulness and the potential
of knowledge formalization in the biomedical domain.
In [Rubin et al. 2006] the authors used the Foundational Model of Anatomy (FMA) ontology as the
main knowledge resource. FMA is an ontology of human anatomy that consistently describes more
than 70,000 concepts of anatomic structures [Rosse et al. 2003]. The main relation in FMA that
links concepts with each other is the part-of relation. The project deals with penetrating injuries.
Images of injuries are annotated with anatomic concepts, thus disambiguating the visible regions of
the body. Then the spatial information from the ontology is used to predict the possible internal
damages caused by the penetrating injury. The predictions are made via logical reasoning over the
ontology using the case constraints from the image. The project was conducted by the U.S. Defense
Advanced Research Projects agency and has life-saving importance.
[King et al. 2009] have run an even more large-scale and ambitious project. Its aim is to create a
“robot scientist” – a robot that conducts independent research, that is, sets a hypothesis, tests it
experimentally by designing and running the experiments and finally reasons about the acquired
data by interpreting the results, all on its own. The developed robot is called Adam. Apart from
8
sophisticated hardware (Adam is 15 m2 large) to physically run the experiments, and an elaborate
reasoning component to draw valid conclusions from observation, it has a rich knowledge base on
the backbone. The knowledge resources play a crucial role in the design of Adam as they are used
at all stages of the research process. In the experiment described in [King et al. 2009a], Adam was
provided with a general biomedical database as well as with a formal model of yeast metabolism.
The experiment proved successful: Adam autonomously generated functional genomics hypotheses
and validated then experimentally, thus becoming the first machine that made a scientific discovery
without human intervention.
The two works described above illustrate the huge range of applications that formal knowledge
resources can have in life sciences as well as their unbounded potential. Not only do they help
sustain the ever-growing collection of already published results, but they can also lead to
knowledge discovery through formal reasoning.
1.3 The task of formal definition generation 1.3.1 What is formal definition generation?
Formal definition generation (FDG) is a type of knowledge modeling that translates a natural
language definition into a formal representation using some formal language notation. FDG can be
viewed as the automatic acquisition of complex axioms for an ontology. Chapter 2.2. describes the
connection between concept definitions and axioms in more detail. Unlike the taxonomy acquisition
which seeks to identify parent-child relations in text and is usually based on simple patterns
[Wächter 2010], definition generation typically focus on highly expressive axioms containing
various logical connectives and non-taxonomic relation instances.
Formal definition generation can be illustrated by an example: a natural language sentence with a
classic definitional structure A is a type of B that has a specific property C is translated into a
formal representation A = B ⊓ ∃hasProperty.C. The formalism of choice here is Description Logic.
Some definitions can be rewritten into a formal language in a quite straightforward way:
Acenocoumarol: a coumarin that is used as an anticoagulant
.
It is a definition taken from the MeSH controlled vocabulary (see Chapter 2.3). If we assume that
Acenocoumarol, Coumarin and Anticoagulant are valid biomedical concepts, the definition can be
encoded by means of a simple description logic language the following way:
Acenocoumarol ≡ Coumarin ⊓∃used_as.Anticoagulant.
The encoding is very simple since there exists an almost perfect one to one correspondence between
the lexical items in the definition and the elements of the formal syntax. However, this is not the
case for the majority of the sentences. FDG does not boil down to a mere re-writing of textual
definitions using a different notation; instead, it is a complex task that requires thorough analysis
and understanding of utterances and their constituents.
Below are the examples of definitions that are far more difficult to process. Note that these
9
definitions were not constructed artificially to be particularly difficult or lengthy. They are taken
from a widely used biomedical resource MeSH:
Acetolactate Synthase:
A flavoprotein enzyme that catalyzes the formation of acetolactate from 2 moles of pyruvate in the
biosynthesis of valine and the formation of acetohydroxybutyrate from pyruvate and alphaketobutyrate in the biosynthesis of isoleucine.
Lissamine Green Dyes:
Green dyes containing ammonium and aryl sulfonate moieties that facilitate the visualization of
tissues, if given intravenously.
Even definitions for which finding a formal representation appears to be trivial may in fact contain
various pitfalls. How exactly should the following definition be formalized?
Acepromazin is a phenothiazine that is used in the treatment of psychoses.
Should the treatment correspond to an independent concept that is linked to acepromazin by the
used_in relation? Or should it rather correspond to the relation treats that takes as arguments
psychosis and phenothiazine, and ultimately acepromazin? The answer to this question is not
obvious and is heavily dependent on the way one chooses to model the knowledge.
1.3.2 Why is formal definition generation important?
FDG is a direct step towards semi-automatic or even automatic ontology generation: the so-called
defining sentences, i.e., those containing a definition for a concept, are a perfect source for a
knowledge base. Definitions are reasonably easy to access in large amount: they can be
accumulated from vocabularies, encyclopedias and glossaries, or mined automatically from text
collections using pattern matching (phrases like is a, is a type of, is a kind of etc. are typical signs of
a definition) and some post-filtering [Wächter 2010]. By means of FDG textual definitions are
translated into a formal representation of choice, e.g. description logics, and can either be provided
as suggestions to an ontology engineer or be incorporated directly as axioms into the ontology
under construction.
An ontology can be enhanced in multiple ways using formal definitions:
- a new ontology can be built from scratch;
- an existing inexpressive ontology can be enriched with new non-taxonomic relations and
relation instances;
- an existing expressive ontology can be validated: it may contain mistakes in the way it
models the knowledge, and the formalized definitions can be used as input to detect
inconsistencies and errors in the axiomatization, e.g. in the form of clashing relation
instances.
1.3.3 Is formal definition generation feasible? A case study
Utterances in natural language vary considerably in terms of syntactic and semantic complexity.
The question is how do we go from unstructured text to structured representation?
10
Previous attempts to automatically construct highly expressive ontologies are not numerous. One of
them is presented in [Völker et al. 2007]. The authors focus on automatic acquisition of ontology
axioms. The formalism of choice is SHOIN, a very expressive DL that is able to model negation,
conjunction, disjunction, quantitative restrictions etc.
The developed system LExO (Learning Expressive Ontologies) is based on full syntactic parsing of
a sentence. The dependency tree is transformed into DL formulas through a chain of hand-written
syntactic rules that take into account parts of speech, sentence positions, tree positions and syntactic
roles of all words. The rules cover a broad set of syntactic structures, such as relative clauses,
prepositional, noun and verbal phrases, to name a few.
Below are the examples of resulting formalizations (the details of DL syntax are given in Chapter 2):
1) Data: Facts that result from measurements or observations.
Data ≡ (Fact ⊓∃result from.(Measurement ⊔ Observation))
2) A currency is a unit of exchange, facilitating the transfer of goods and services.
Currency ≡ (Unit ⊓ ∃of.Exchange ⊓ ∃facilitate.(Transfer ⊓ ∃of.(Good ⊓ Service)))
3) Vector: An organism which carries or transmits a pathogen.
Vector ≡ (Organism ⊓ (carry ⊔ ∃transmit.Pathogen))
As it can be seen from the examples, some of the sentences, like #1, are processed quite
successfully; in fact, they are short and can be directly translated into DL, since there is an
unambiguous correspondence between the lexical tokens and the elements of the formula (concept
names, role names, connectives). Meanwhile the quality of the formalization for more complicated
sentences, e.g. #2 and #3, is quite debatable. The mistakes range from parsing errors (the verb carry
is recognized as a concept name) to questionable modeling decisions (should unit of exchange be
split into a concept and a binary relation instance? should the preposition of be treated as a role
name? what formal semantics does it have?).
The system has several major issues of formalizing the definitional sentences. They stem from the
rule-based nature of LExO: natural language is very versatile and the same idea can be expressed in
multiple, possibly infinite number of ways and it is obviously not possible to cover them all with
hand-crafted rules. To resolve these issues we address the task of formal definition generation by
applying state-of-the-art machine learning (ML) techniques and relying only partially on the
syntactic structure of a sentence. We adopt a hybrid approach of integrating both machine learning
and pattern matching defined over the parse tree, and we try to minimize the impact of the latter on
the overall system, utilizing the patterns only at certain processing steps where we find them useful.
The modeling of ontological relations in LExO is of particular interest for us. The relation instances
are extracted from all substrings that are located between the two atomic concept mentions, and the
relation label is just a textual form of this substring, sometimes slightly modified. The choice of
argument concepts is manually determined. This approach raises several problems. The first one is
the unnecessary split of a concept into two or more. For example, unit of exchange should probably
be kept as a single concept Exchange_Unit. The problem can be solved if the concept identification
process takes into account an existing concept terminology, i.e., it is semantically enriched. This
11
can be done through semantic indexing which is discussed in detail in Chapter 5.2.
The other problem is that the string labels do not bear any semantics either. The relation labels are
taken verbatim, are not normalized to any semantic types. The result is that, firstly, the number of
distinct relations is potentially huge, and secondly, there is a many-to-many correspondence
between textual realizations of relations and their semantic invariants: on the one hand, the relations
result_from and caused_by are treated as two different ontology roles, and on the other hand, the
relation labeled as of can have multiple meanings, since the corresponding preposition is
polysemous. The possible meanings of of are location, inclusion, functional or temporal
correspondence, to name a few. The algorithms of semantic normalization of relational strings
could be encoded into LExO manually, but this is a very tedious task that is still not able to cover
all lexical variants for a given relation as well as all the relations relevant to the domain. A
potentially effective solution to this problem is in the use of machine learning: the strings can be
clustered semantically to determine the semantics of different relation types (unsupervised ML),
and they can also be classified so that the correct types are assigned to each string (supervised ML).
To sum up, formal definition generation is a difficult task that requires a full range of natural
language processing steps to be performed in order to get accurate and meaningful definition
formulas. The ambiguity and richness of natural languages are the main obstacles for the task;
however, previous attempts to acquire ontology axioms from text demonstrated promising results
that unlock the huge potential of this formalization approach. We aim at addressing the FDG task in
such a way that the output formulas are semantically enriched and thus are suitable for various
reasoning tasks. The domain of choice is life sciences.
1.4 Objectives and Outline The thesis pursues two main objectives:
1) to explore the task of automatic generation of formal definitions from natural texts. This includes
identifying necessary text processing activities, splitting the original task into consecutive steps,
investigating possible methodologies for every step, unveiling the difficulties and typical mistakes
and analyzing potential applications and future directions of the task. The work is particularly
focused on non-taxonomic relations and addresses relation extraction from two different angles:
relation instance extraction and identification of domain-relevant relations. Different techniques of
supervised relation extraction are investigated; in particular, rule-based and machine learning-based
relation extraction approaches are compared. In addition, the work explores unsupervised relation
extraction, proposing a novel algorithm for semantic relation discovery and giving the complete
roadmap for the unsupervised relation extraction process.
2) to create a pipeline that transforms textual definitions of biomedical concepts into logical
representation. The system uses existing biomedical resources and text mining tools, however, it is
not strictly dependent on the specific choice of resources and can be freely customized, which
makes the transferring to other domains possible. The developed approach incorporates semantic
analysis at various stages of the transformation process in the form of semantic indexing, relation
classification and semantic clustering. To the best of our knowledge, this is the first system that
generates formal biomedical definitions and the first system that performs axiom generation using
the set of semantic relations that was generated in an unsupervised way.
The current work has been done as part of the DFG-funded research unit Hybrid Reasoning for
12
Intelligent Systems (HYBRID), project B1. The project focuses on the automatic generation of
description logic-based biomedical ontologies. The work is based on the ongoing collaboration
between the Bioinformatics group of BIOTEC, TU Dresden led by Prof. Dr. Michael Schroeder and
the Automata Theory group of the Faculty of Informatics, TU Dresden headed by Prof. Dr.-Ing.
Franz Baader.
The structure of the thesis is as follows: Chapter 2 contains background knowledge with respect to
the task of formal definition generation and introduces relevant resources and formalisms. An
overview of related work on relation extraction, both in general and in biomedical domain, which is
the cornerstone for the task at hand, is given in Chapter 3. Chapter 4 presents our previous work on
biomedical relation extraction which attempts to model three relations relevant to biomedical
definitions using the SNOMED CT formal ontology. Chapter 5 describes the three key steps in
definition formalization, namely concept annotation, relation classification and definition
construction, giving the detailed specification of methods and algorithms. Evaluation and discussion
of the methods as well as the analysis of typical mistakes of the definition generation pipeline and
its components are given in Chapter 6. Chapter 7 explores the possibilities of unsupervised relation
extraction and proposes an algorithm for the induction of relevant relation types from text. Finally,
the thesis is summarized in Chapter 8 with conclusions and future work.
13
2. Background This chapter introduces relevant notions, methods and resources relevant for the task of biomedical
knowledge formalization and definition extraction. On a more conceptual level, we will discuss
what a definition is and how it is related to the process of ontology learning; these issues are
covered by Sections 2.1 and 2.2. Moving to the implementation aspect, in Section 2.3 we will
discuss the available resources that represent the biomedical knowledge in a formalized way:
taxonomies, ontologies, controlled vocabularies. Section 2.4 gives an overview of Description
Logics, a family of logics that is commonly used as a knowledge representation formalism for
ontologies.
2.1 What is a definition? The Merriam-Webster4 dictionary has the following entry for the word “definition”:
• an explanation of the meaning of a word, phrase, etc; a statement that defines a word, phrase,
etc.;
• a statement that describes what something is;
• a clear or perfect example of a person or thing.
In essence, a definition is a statement that explains the meaning of some term – definiendum – using
a set of other terms – definiens. In the example above “definition” is the definiendum that has three
definientia.
There are two major types of definitions: intensional and extensional definition5. An intensional
definition gives necessary and sufficient properties that apply to all objects of the definiendum class.
An extensional definition, in contrast, defines a concept by enumerating all the objects of the class.
The most widely used type of intensional definitions is the so-called genus-differentia definition6. It
is a two-fold statement: the first part, genus, specifies a broad class to which the defined concept
belongs; the second part, differentia, distinguishes the definiendum from the other concepts of the
same genus. The structure of this type of definition can be captured by a formula A is a B that has
property C. For example, a triangle (A) is a plane figure (B) that has three sides and three vertices
(C). The terms genus and differentia were introduced by Aristotle.
The genus-differentia definition is a very convenient way of formalizing the knowledge, as it
explicitly contains the information about super- and subclasses from which taxonomies may be
extracted. It also lists different properties and relations to other concepts. Together taxonomic and
non-taxonomic relations from multiple definitions can be organized into an ontology, a knowledge
base or another formal representation, with possibilities to automatically validate, search and reason
on top of it.
4
www.merriam-webster.com
http://en.wikipedia.org/wiki/Definition
6
http://en.wikipedia.org/wiki/Genus-differentia_definition
5
14
2.2 What is Ontology Generation? Ontology generation, or ontology learning, is a task of acquiring formal domain knowledge from
data. The term was introduced in [Mädche et al. 2001], which define an ontology as a data schema
which provides controlled vocabulary and formal semantics for concepts and for relations between
concepts.
The source data can be structured (e.g. schemata), semi-structured (e.g. in XML format) or
unstructured (text). Ontology learning from text is performed either automatically or semiautomatically, due to the complexity of the task. In the semi-automatic scenario, the systems
suggests to the domain expert relevant concepts, relations and relational instances (“a concept A is
linked to a concept B by a relation R”), and it is up to the expert to add the suggested information to
the ontology. Ontology learning can be seen as a reverse engineering task, since the expert has some
knowledge about the domain while writing texts, and the aim of the task is to reconstruct this
knowledge.
The basic ontology components are concepts, relationships and axioms. Concepts (classes,
categories) are abstractions of groups of objects. A concept can be defined extensionally, through
the common properties of the objects it represents, or intensionally, by enumerating the objects
themselves. Examples of concepts: Disease, Anti-inflammatory Drug, Mammal etc.
Concepts can be organized into a hierarchy by a subsumption relation; the subsuming concept is
called superconcept, the one that is subsumed is a subconcept. The subsumption, or is_a, is a
taxonomic relation. An ontology may also contain non-taxonomic relations (or relationships), that
specify arbitrary links between concepts. A relation can take multiple arguments, binary relations
being the most widespread. A relation has a domain and a range which are defined the same way as
for functions [Paley 1966]. For a binary relation, domain and range are equivalent to the types of
arguments that the relation can take. Examples of relations are: causes, located_in, part_of etc.
Axioms encode inference information about concepts and relations. They are used for reasoning
purposes, to derive new knowledge, not explicitly present in the ontology. Class axioms are
statements about concepts and their relations to other concepts. Axioms form complex concept
descriptions on top of simpler concepts and assign names to them.
2.2.1 Steps of ontology learning
Ontology learning is a complex task. [Cimiano06] addresses ontology learning as a complex 8-step
process:
−
−
−
−
−
−
−
get relevant terminology (term extraction)
identify synonyms (synonym discovery)
form concepts (concept extraction)
organize concepts hierarchically (concept hierarchy induction)
define relations, their domain and range (relation extraction)
organize relations hierarchically (relation hierarchy induction)
define axioms (schemata instantiation and axiom learning)
15
Figure 2. Ontology Learning layer cake by [Cimiano06].
During the term extraction we collect lexical units (terms, phrases etc.) relevant for the domain. In
the course of the synonym discovery we group together terms that belong to the same concept.
These can be strict synonyms, lexical variants of terms, abbreviations, as well as terms that are not
fully equivalent in meaning, but share common semantics and can potentially belong to the same
class. Concept extraction provides an intensional and/or an extensional definition for every concept.
Additionally typical patterns of expression in text can be enumerated.
By inducing the concept hierarchy we organize concepts into a semi-upper lattice, a reflexive, antisymmetric and transitive structure with a top element in which every two element have a unique
least common subsume. There may be more than one hierarchy in an ontology, e.g. MeSH consists
of 16 distinct hierarchies (trees).
The relation extraction task can be further split into several subtasks: identifying a relevant set of
relations and labeling them, finding typical lexical ways of expression for every relation, defining
their argument types (i.e. domain and range). More information on relation extraction will be given
in Chapter 3. After learning relations separately they can be organized into a hierarchical order.
The last step in ontology learning concerns axioms. Axiom schemata instantiation task formalizes
concept properties such as equivalence or disjointness and relation properties such as symmetry,
transitivity, reflexivity etc. General axiom learning captures formally complex concept descriptions
composed of previously learned concepts and relations.
2.2.2 Ontology learning and Definition generation An ontology describes a domain theory which can be decomposed into a set of axioms. An ontology
is an axiomatization of definitions for concepts and relations [Cimiano06]. A well-known logical
formalism for ontologies is Description Logic (DL, see Chapter 2.5). Terminological axioms in DL
are definitions of concepts – complex structures built on top of previously defined concepts. In fact,
terminological axioms of DL that have a concept name in the left part of an equation and some
assertion in its right part are called definition. Thus the task of extracting formal definitions from
text can be viewed as a type of ontology learning.
16
In practice, several ontology learning tasks can be tackled by the same text mining procedures.
Terms and their synonyms can be identified in text and normalized to concepts by the concept
annotators (see Chapter 5.2). Relations oriented tasks, including the concept hierarchy induction
(which in essence boils down to identifying the taxonomic relation instances) can be treated by a
plethora of techniques which are covered in Chapter 3. Finally, syntactic and semantic parsing can
help construct complex axioms from raw text using identified concepts and relations.
Figure 3. Ontology learning tasks grouped by text mining procedures. Note that some tasks, e.g. relation
hierarchy induction, may be omitted.
Research advances in ontology generation differ from layer to layer. In particular, for the
biomedical domain a lot of effort is put into creating the high-performing annotators [Aronson 2006]
[Tsatsaronis et al. 2012]. Taxonomic relation extraction is a relatively easy task that can be
efficiently solved by the use of lexico-syntactic patterns [Wächter et al. 2011] [Velardi et al. 2013].
Much less has been done in the domain of non-taxonomic biomedical relation extraction. Hence,
our work will primarily focus on the latter task.
2.3 Biomedical Knowledge Resources There exists a wide range of different biomedical knowledge resources: databases, networks,
vocabularies, taxonomies, ontologies etc. [Baclawski et al. 2005]. The latter three play a major role
for researchers as they store and organize the scientific vocabulary, they align synonyms and
abbreviations with the main term, they group terms into concepts and explicitly specify relations
between different concepts. Such information, organized formally, is of great value as it forms the
common language for the domain community: it facilitates human and computer interaction, search,
hypothesis validation, reasoning, knowledge discovery and integration and etc. Ontologies and
thesauri can also be a source for definitions, both in textual and in formal representation, which
makes them highly relevant for the task of formal definition generation.
The construction of biomedical ontologies has long been an area of active research. Table 1 gives
an overview of 6 largest and most famous ontologies for the biomedical domain, stating for each
ontology its size, year of the first release, the main purpose of its use (research or production
purposes) and which type of definition is given for concepts.
17
# concepts
UMLS
SNOMED CT
FMA
GO
GALEN
MeSH
year
1,000,000
300,000
75,000
42,000
29,000
25,000
1986
1965
1995
1998
1991
1963
research/production
R,P
P
P
P
R
P
definitions
textual, triples
formal
triples
textual
formal
textual
Table 1. An overview of six widely used biomedical ontologies.
Although these ontologies are quite mature, some of them being as old as 60+ years, only two of
them are fully formalized (SNOMED CT and GALEN) and among them only SNOMED CT is used
for production. This makes the need for the formalization of biomedical knowledge become
apparent: even the existing resources are far from being fully formalized, and the upcoming
knowledge from new papers, patents and articles is not integrated.
Section 2.3 gives an overview of three widely tused biomedical knowledge resources: SNOMED
CT, UMLS and MeSH. All three are manually curated and are of substantial size. We now move
from more structured to less structured resources. We first introduce SNOMED CT, a fully
formalized domain ontology, then we proceed with UMLS which is an aggregation and alignment
of multiple ontologies that also contains an upper domain ontology (Semantic Network), and then
we conclude with MeSH, a controlled vocabulary in which the concepts are defined informally via
textual definitions and are only ordered taxonomically.
2.3.1 SNOMED CT
SNOMED CT is a formal medical ontology that describes concepts such as body structures,
disorders, organisms, findings, procedures and so on. In total it has more than 311,000 active
concepts. All concepts in SNOMED CT are organized into acyclic taxonomies that express parentchild dependencies. The 19 top level concepts are:
Clinical finding
Specimen
Situation with explicit context
Procedure
Special concept
Staging and scales
Observable entity
SNOMED Model Component
Physical object
Body structure
Physical force
Qualifier value
Organism
Event
Record artifact
Substance
Social context
Pharmaceutical or biologic Environment or geographical
product
location
Table 2. The top-level concepts of the SNOMED CT hierarchy.
Apart from the “is a” relation, SNOMED CT contains 567 different attribute relationships. A
7
These are the relationships that have at least 1 occurrence in SNOMED CT. We have also found
mentions of 9 more relationships (Episodicity, Moved to, Severity etc.) that have no instances in the
18
concept is defined by the set of relationships to other concepts. Concepts have at least one
relationship (“is a”), but many of them are also linked by attribute relationships.
Example: Pulmonary Atelectasis (disorder)8
Figure 4. The concept Pulmonary Atelectasis as described in SNOMED CT.
Attributes are organized into a flat hierarchy, with the exception of three groups that have a simple
two-level hierarchy. Below are the top 10 most frequent relationships in SNOMED CT:
•
•
•
•
•
•
•
Finding site
Method
Associated morphology
Direct procedure site
Has active ingredient
Causative agent
Has dose form
•
•
Interprets
Indirect procedure site
Direct morphology
•
As can be seen from the list above, the semantics of a relationship does not always follow
straightforwardly from its name. The reason is that SNOMED CT is a machine-processable
resource that was not designed to be intuitive for humans. The underlying structure of SNOMED
CT is based on formal logics, more specifically on a subset of the lightweight Description Logic
EL+ +. As will be shown in Chapter 4, the formal modeling of the biomedical domain used in
SNOMED CT is also not fully compatible with the modeling that a human specialist may construct,
which is reflected in the attempts to align SNOMED CT with natural language texts.
SNOMED CT is maintained by the International Health Terminology Standards Development
Organization (IHTSDO). It has been gaining popularity as a reference vocabulary in clinical
research [Rich et al. 2006]. The main reasons for that are its size (SNOMED CT claims to be the
most comprehensive clinical healthcare terminology in the world and it is one of the biggest
biomedical ontologies out there) as well as its quality (it is a manually curated resource). SNOMED
CT is of particular relevance for the work described in this thesis as it incorporates a rich set of nontaxonomic relations thus having rich expressivity.
current version of SNOMED CT.
screen-shot
taken
from
the
http://www.medicalclassifications.com/SNOMEDbrowser/
8
SNOMED
CT
Browser:
19
2.3.2 UMLS
UMLS (Unified Medical Language System) is a meta-resource that accumulates data from over 100
controlled vocabularies, ontologies and thesauri and provides consistent mappings from one
terminology to another [Bodenreider 2004]. Included are both SNOMED CT and MeSH, as well as
the Gene Ontology etc. It is developed by the US National Library of Medicine (NLM) [Unified
Medical Language System] and is the biggest terminological resource for biomedicine. The 2013
release covers almost 3 mio concepts and 11.5 mio concept names9.
Figure 5. UMLS structure [UMLS Reference Manual 2009].
UMLS was designed for the purpose of data integration serving as an interlink between existing
heterogeneous resources. It also semantically organizes the terminology by merging synonyms into
joint concepts. UMLS is a rich knowledge resource that enables knowledge understanding and is
used as a backbone for intelligent systems operating in the biomedical domain. It has numerous
applications, i.e. information search [Pratt 1997], semantic annotation10 [Aronson 2006], knowledge
representation [Baclawski et al. 2000] etc.
UMLS consists of Knowledge Sources and auxiliary software tools (implementation resources).
There are three knowledge sources in UMLS: the Metathesaurus, the Semantic Network and the
SPECIALIST Lexicon.
The Metathesaurus is the main database containing the meta-concepts and their mappings to
synonymous concepts from different resources. Every concept has a Concept Unique Identifier
(CUI) associated with one or more external concept IDs. Apart from the CUI, an entry of the
Metathesaurus may include synonyms and terminological counterparts, a defintion, relationships
with other concepts and other supplementary information. Relations between concepts are mostly
inherited from the source ontologies and thesauri, although some relations were introduced in the
Metathesarus. Concepts are also assigned one or more semantic types from the Semantic Network.
The Semantic Network [Schulze-Kremer et al. 2004] is an upper ontology for the biomedical
domain which forms the top level of the UMLS concept hierarchy. It has 133 semantic types and 54
semantic relations. Types and relations are very broad and are used for high-level categorization
and interlinking of concepts.
9
http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/statistics.html
10
http://metamap.nlm.nih.gov/
20
Semantic types are organized into a hierarchy by the "is_a" relation. The maximum depth of the
hierarchy is 8, but it should be noted that the degree of granularity varies accross the Semantic
Network. The semantic type is always assigned to a concept at the most specific level available. The
two main categories of types are Entity and Event. Below is a branch of the type hierarchy that has
type Human as a leaf:
•
Entity
◦ Physical Object
▪ Organism
• Eukaryote
◦ Animal
▪ Vertebrate
• Mammal
◦ Human
Semantic relations have very general nature: they link semantic types, but they may not always hold
between the instances of the corresponding types. For example, according to the Semantic Network,
Clinical Drug CAUSES Disease or Syndrom, but this relation does not hold between Aspirin and
Cancer.
Relations are also organized into an hierarchy. The top level divides the relations into taxonomic
("isa") and non-taxonomic ("associated_with") ones. The latter branch further into 5 major
categories. Below is the tree of the Semantic Network relations with two upper levels specified:
−
−
isa
associated_with
− physically_related_to
−
... (8 relations)
− spatially_related_to
−
− ... (4 relations)
temporally_related_to
− ... (2 relations)
−
functionally_related_to
− ... (20 relations)
−
conceptually_related_to
− ... (13 relations)
The SPECIALIST Lexicon is a set of lexical entries with spelling, abbreviations, acronyms, part
of speech and other information about a subset of core terms from the Metathesaurus as well as
about common English words. It is used in Natural Language Processing applications.
UMLS is an extremely valuable resource as it enables the transition between various taxonomies
and ontologies. In particular, UMLS is relevant for the current work as with its help concepts from
textual resources (like MeSH) can be mapped to those from highly formalized resources (like
SNOMED CT). In addition, the Semantic Network from UMLS describes general concept types
and relations relevant for the domain, and the types are assigned to semantic classes for more
specialized concepts via the Metathesaurus. Therefore, they may be useful for the relation
extraction.
21
2.3.3 MeSH
MeSH (Medical Subject Headings)11 is a controlled vocabulary in a form of a thesaurus. Same as
UMLS, it is being developed by NLM. It serves as a source for biomedical semantic indexing12.
MeSH is a hierarchically ordered collection of entries (descriptors). The 2013 version13 contains
26,853 entries covering more than 214,000 terms. Every descriptor has a unique identifier (UI),
positions in the hierarchy trees, a head term and its possible variants, a definition and other
information. All descriptors are organized into 16 distinct semantic trees; every descriptor belongs
to at least one tree. The MeSH trees are14:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Anatomy [A]
Organisms [B]
Diseases [C]
Chemicals and Drugs [D]
Analytical, Diagnostic and Therapeutic Techniques and Equipment [E]
Psychiatry and Psychology [F]
Biological Sciences [G]
Physical Sciences [H]
Anthropology, Education, Sociology and Social Phenomena [I]
Technology and Food and Beverages [J]
Humanities [K]
Information Science [L]
Persons [M]
Health Care [N]
Publication Characteristics [V]
Geographic Locations [Z]
Table 3 represents the main fields of the descriptor for Pulmonary Atelectasis. The accompanying
definition contains the information of the nature of the term (“absence of air”), its location in the
body (“lung”) and possible cause (“airway obstruction, lung compression,...”). As can be seen from
the Figure1, this information partly overlaps with the information about the same condition encoded
in SNOMED CT. However, the MeSH definition is more detailed and easier to comprehend. Not
only is it due to the fact that textual definitions contain unstructured data (as opposed to structured
entries in SNOMED CT), but the knowledge modeling itself is human-oriented: although the MeSH
tree layout is made for automatic search and indexing, the result of the search is supposed to be
satisfying and convenient for biomedical specialists.
Example: Pulmonary Atelectasis
MeSH Heading
Pulmonary Atelectasis
Unique ID
D001261
Tree Number
C08.381.730
Definition (“Scope Note”)
Absence of air in the entire or part of a lung, such as an incompletely inflated
neonate lung or a collapsed adult lung. Pulmonary atelectasis can be caused by
airway obstruction, lung compression, fibrotic contraction, or other factors.
11
http://www.nlm.nih.gov/mesh/
http://www.gopubmed.org/web/gopubmed/
13
http://www.nlm.nih.gov/pubs/factsheets/mesh.html
14
http://www.nlm.nih.gov/mesh/trees.html
12
22
The position of the term Pulmonary Atelectasis in the MeSH hierarchy:
•
Diseases[C]
◦ Respiratory Tract Diseases [C08]
▪ Lung Diseases [C08.381]
• Pulmonary Atelectasis [C08.381.730]
Table 3. The concept Pulmonary Atelectasis as described by MeSH15.
MeSH is a manually curated resource. In particular, definitions that accompany every descriptor are
composed by domain specialists based on medical encyclopedias and reference books or taken as
textual quotations from the latter. As such, MeSH suggests itself to be an appropriate resource for
textual definitions due to its trust-worthiness and broad coverage.
2.4 Description Logics This work addresses the problem of translating textual definitions into a formal representation. In
the previous chapter we have discussed the existing biomedical resources, including MeSH, which
is a controlled vocabulary with English definitions supplied for all the terms covered. This chapter
will give an overview of the Description Logics, a formalism of choice for the task of formal
definition generation.
Description Logics (DL) is a family of logics that serve as knowledge representation languages for
a domain knowledge. They have formal semantics that enables reasoning and inference of new (not
explicitly stated) knowledge. DL provides a logical formalism for ontologies and the Semantic
Web.
Various DLs differ in their expressive power captured by the set of language constructors. The more
expressive the DL is, the more computationally complex the inference problems become in the
worst case. DLs with very rich expressivity may even be undecidable, i.e. there exists no algorithm
terminating in finite time that can decide whether a formula F is formally deducible from a set of
formulas G. Thus, every DL is based on a trade-off between its expressive power and its complexity.
Basic building blocks of a DL are individuals (constants) , concepts (unary predicates) and roles
(binary predicates) [Baader et al. 2003], which can intuitively be perceived as objects, classes of
objects and relations between them. The semantics of concepts and roles are defined by an
interpretation. Interpretation I consists of a non-empty set of some elements ∆i (domain) and an
interpretation function ·i that maps every concept C to a subset of the domain Ci and every role R to
a relation Ri ⊆ ∆i x ∆i. Examples of biomedical concepts and roles are:
Disease, Drug
treats, causes
The alphabet of a DL consists of constants, variables, unary and binary predicates, syntactic
15
MeSH Browser: http://www.nlm.nih.gov/mesh/2013/mesh_browser/MBrowser.html
23
constructors and special symbols. Constructors are used to build complex concept and role
descriptions:
Drug ⊓ ∃treats.Lung_Disease
DL statements can be divided into two components: TBox and ABox. Together TBox and ABox
form a domain knowledge base. The TBox (“terminological component”) is a vocabulary of a
knowledge base, it contains concepts and roles descriptions – axioms. Terminological axioms can
have two forms, inclusions and equalities:
C⊑D
or
C ≡ D,
where C, D are concepts. Axioms for roles are defined likewise. The semantics of axioms is quite
intuitive: if C ≡ D under some I, then the set Ci is equal to the set Di; if C ⊑ D under some I, then
Ci is a subset of Di.
An equality with a concept name on the left-hand side is a definition. An inclusion with a concept
name on the left-hand site is a specification. Classical terminologies consist only of definitions,
while generalized TBoxes may as well contain specifications. Thus, a set of axioms is a generalized
terminology if the left-hand side is of each axiom is a concept name and every concept name occurs
only once in the left-hand side [Baader et al. 2003].
Disease ≡ Abnormal_condition ⊓ ∃affects(Body)
Drug ≡ Substance ⊓ ∃affects.Body
Toxic_Drug ≡ Drug ⊓ Toxic_Substance
ABox is a component that contains assertions, or facts. It describes the state of affairs of a domain
by assigning properties to individuals using the terminological vocabulary from the TBox.
Individuals are passed as arguments to the predicates, i.e. concepts and roles:
Disease(Lung_cancer)
treats(Amoxicillin, Bronchitis)
2.4.1 Basic DL constructors
We will look in detail into the syntax and semantics of several concept constructors common for
many DLs [Krötzsch et al. 2012]. These constructors will be used for the current translation of
textual definitions into the DL notation, although more complex notation that captures the semantics
of definitions in more detail can later be used as well.
1. B ≡ C ⊓ D
B is a set of individuals that are both C and D.
Bi = Ci ∩ Di
2. B ≡ C ⊔ D
B is a set of individuals that are either C or D.
Bi = Ci ∪ Di
24
3. C ≡ ¬D
Complement of D is a set of all individuals minus the ones that are D.
Ci = ∆i \ Di
4. existential restriction: C ≡ ∃R.D
Ci = {c | there exists d ∈ ∆i such that (c,d) ∈ Ri and d ∈ Di}
C is a set of individuals that participate in the role R with D being the second argument.
Ex: D ≡ ∃causes.Bronchitis, D is a set of domain entities that can be the cause of bronchitis.
5. universal restriction: C ≡ ∀R.D
Ci = {c | for all d ∈ ∆i, (c,d) ∈ Ri implies d ∈ Di}
C is a set of individuals that participate in the role R, and the second argument is always D.
Ex: D ≡ ∀causes(Bronchitis), D is a set of domain entities that cause bronchitis, and only
bronchitis.
2.4.2 From triples to Description Logic formulas
In scope of my work we will be using a DL notation for concepts, relations (roles) and constructors,
as well as its simplified version. In chapters that follow, whenever we write C = relation(D), the
existential restriction is assumed: C ≡ ∃R(D).
One last comment: in an arbitrary knowledge base specific genes, diseases and drugs may be treated
as individuals (constants) belonging to concepts Gene, Disease and Drug. However, in biomedical
ontologies like SNOMED CT or MeSH they are regarded as separate concepts subsumed by Gene,
Disease resp. Drug. We will stick to the latter way of axiomatization.
25
3. Related work on relation extraction Relation extraction is an area of Information Extraction. It was first introduced as a separate
subfield during the Sixth Message Understanding Conference (MUC-6) in 1995, in the Template
Element evaluation task. It was further developed in scope of Automatic Content Extraction
meetings (ACE): the task of Relation Detection and Characterization (RDC) was set in 2002. Early
steps in relation extraction boiled down to event extraction, i.e. the detection of specific templates in
the text that involve certain classes of objects, e.g. persons, organizations etc.
Relation extraction (RE) is the task of detecting and classifying semantic relationships that hold
between different entities. As it follows from the definition above, the task of relation extraction can
have two interpretations: according to the first interpretation, you need to detect that there is some
relation between entities; without stating what kind of relation it is. Following the second
interpretation, RE is the process of finding the related concepts and stating the type of relation.
In the area of relation extraction there is a certain degree of freedom with respect to the terminology.
The terms '' relation”, “relationship” and “role” are usually used interchangeably. The very term
“relation extraction” is somewhat ambiguous: although it may as well mean the detection of
relevant types of relations in the text, usually it is used as a shortcut for “relational instance
extraction”, i.e. the extraction of individual instances of a particular relation, which have specific
first and second (subject and object) arguments. To overcome this ambiguity, the term “triple” is
sometimes used instead of the “relational instance”.
Relation extraction can be performed both on structured and unstructured data. We are mostly
interested in the relation extraction from text; in this case both entities and relations are encoded
into a textual representation.
3.1 Relation extraction for the general domain If adopt the first interpretation of the relation extraction task and understand RE simply as detecting
the dependencies between concepts without labeling the relation, the task can be completed by
exploiting the co-occurrences of concepts. There are different statistical measures used for that:
pointwise mutual information (PMI), Pearson correlation, Chi-square, TF-IDF etc. The cooccurrence approach is quite straightforward, as it is based on a simple assumption that the two
entities should be related if they occur together in texts. The obvious disadvantage of such statistical
approach is that it lacks semantics of the established connections. One cannot identify which kind of
relation is there and whether there exists a direct relation at all (“correlation does not imply
causation”).
If we adopt a more strict definition of RE task, i.e. that it requires identifying the relation label, we
have to rely on lexical, syntactic and/or semantic properties of the textual content.
26
3.1.1 Relation extraction and the types of linguistic processing
3.1.1.1 Lexical patterns
Most work for these tasks was rule-based and involved the construction of lexical patterns for each
relation. Lexical patterns require only simple linguistic processing: sentence splitting, tokenization,
lemmatization, sometimes part-of-speech tagging.
[Ruiz-Casado et al. 2005] learn textual patterns for hyponymy, hypernymy, holonymy and
meronymy using the articles from the Simple English version of Wikipedia16 and its taxonomy of
concepts: whenever two concepts that are explicitly linked by the taxonomy as belonging to one of
the four target relations are spotted in the text, the textual context is analyzed and the pattern is
extracted.
[Sánchez et al. 2012] construct simple linguistic patterns and use them as queries to Web search
engines in order to retrieve statistical evidence of a certain property of a relational instance. Given
an instance of a non-taxonomic relation, the method checks whether the relation is transitive,
symmetrical, reflexive, inverse, functional or inverse functional.
[Chklovski et al. 2004] build 35 surface patterns to define the semantic relations between verbs.
They argue that verbs are the main means of expressing relations, and defining meta-relations on
top of existing non-taxonomic relations can considerably enrich the modeling. The meta-relations
are strength (Xed and even Yed), similarity (X i.e. Y), antonymy (either X or Y), enablement (to X by
Ying) and temporal happens-before relation (to X and eventually to Y).
[Wu et al. 2002] compose extraction rules that include lexical tokens together with part-of-speech
tags.
3.1.1.2 Syntactic patterns
Certain systems explore the syntactic structure of the source text. The motivation behind it is that
the semantic relations that hold between the two concepts should be reflected by syntactic
dependencies of these concepts. [Baclawski et al. 2005] state that a simple sentence represents a
simple fact, the subject and the object being the concepts and the predicate being the relation. Many
triple extraction systems use simple syntactic patterns to extract relational instances, or triples
[Krestel et al. 2010]. They require more involved processing steps, like dependency or constituency
parsing.
Learning by Reading system [Hovy et al. 2011] extracts propositions from syntactic structures of
type Subject – Predicate – Object. For the arguments of a relation, i.e. for subjects and objects, the
lexical items are generalized to classes (the classes themselves are automatically derived from the
corpus). The predicates remain in their lexical form. To the best of our knowledge, no relation
hierarchies or grouping are used.
Never-Ending Language Learning (NELL) and specifically its component OntExt [Mohamed et al.
2011], derives new relations relevant for each pair of categories using co-occurrence matrix over
lexical items. Namely, textual contexts between two category instances are extracted, only those
16
http://en.wikipedia.org/wiki/Main_Page
27
with specific syntactic structure are kept, and the co-occurrence matrix of contexts are built over the
whole corpus.
3.1.1.3 Semantic patterns
Some systems incorporate semantic information into the extraction process. The entities and
potential relation mentions that were annotated in the text are assigned more general semantic
classes. If a combination of semantic types of the argument concepts and the type of the relation
match a certain pattern (which is induced from an existing ontology, is pre-defined manually or
appear with a high frequency), the underlying lexical relation is extracted. [Flati et al. 2013] extract
semantic predicates using semantic classes of argument concepts adopted from Wikipedia. [Dahab
et al. 2008] integrate top level ontologies to semantically parse the input text and to generate
semantic patterns of concepts and relations. In [Hovy et al. 2011] the semantic classes are
constructed by the system itself. [Exner et al. 2012] uses semantic parsing to annotate sentences
with semantic roles and to form relational instances from them. [Fan et al. 2010] use predefined
verb frame structures to extract relations.
3.1.2 Relation extraction and different types of learning
The task of extracting relations can be done in a supervised way, e.g. using hand-written patterns, in
an unsupervised way, or using a mixed approach, e.g. bootstrapping when an initial seed of relation
instances is used. The latter is sometimes called “weakly supervised learning”. The development of
the relation extraction area of information extraction started from supervised approaches and has
been moving towards semi- and unsupervised ones. Modern RE systems operate on the web-scale
data, extracting small chunks of information from redundant sources. Repeating patterns lexical
patterns then form relation instances.
3.1.2.1 Supervised RE
Traditional relation extraction encompasses supervised learning techniques. [Mohamed et al. 2011]
state that traditional RE requires “the user to specify information about the relations to be learned”.
The information about the relations can be encoded in two ways:
• for every relation the set of corresponding patterns is manually tailored;
• relational instances are annotated in the text corpus, and the patterns are acquired explicitly
(based on frequent sequences of word tokens) or implicitly (using machine learning);.
The new relational instances are extracted by pattern-matching or by running a trained machine
learning model over the input texts.
The supervised approach usually gives high precision of the retrieved relation instances. The main
disadvantage of the approach follows form the traditional trade-off between Precision and Recall:
the precision of the results retrieved using fixed patterns is usually quite high and can go over 90%,
but the high recall is not guaranteed. Another drawback is that constructing the patterns manually
can be quite a tedious and time-consuming task. And, very importantly, in case of traditional RE
one can extract only the roles listed beforehand; it is not possible to extract new roles without
repeating the pattern learning process from the very beginning. Thus, the approach is not scalable.
To bring supervised RE to a larger scale, texts are sometimes aligned with existing formalized
28
domain resources. An example of such approach is presented in Chapter 4. [Riedel et al. 2013]
merged schemas and patterns from heterogeneous resources into a big matrix-like structure and
used the collaborative filtering techniques from the domain of recommender systems to draw
asymmetric implications between the schemas.
3.1.2.2 Semi-supervised RE
Semi-supervised learning of relations usually has a core of annotated material from which the
learning is initialed, and then the process of extraction proceeds in an unsupervised manner. The
approach combines the advantages of both supervised and unsupervised techniques: on the one
hand, some prior knowledge is given to the system to improve the performance, on the other hand,
there is no demand in big volumes of annotated material which is usually hard to get.
NELL system mentioned above is built via bootstrapping [Carlson et al. 2010]. It starts with an
initial ontology (a form of prior knowledge) that contains some categories, relations and relational
instances. The ontology helps building the first set of patterns that are then used to populate the
categories of the ontology and to extract new facts, which are then used to retrain the extraction
system and to learn yet new facts etc.
Snowball is another classical example of a bootstrap relation extraction [Agichtein et al. 2000]. The
extraction patterns are built from a few training examples and the relational triples are extracted.
The process of extraction is iterative, and at every iteration step the patterns and triples are
automatically evaluated and filtered. The patterns are using the named-entity tags as classes of
arguments and the relational strings as the containers of relations. An example of a pattern is
<Organization>'s headquarters in <Location>.
3.1.2.3 Unsupervised RE
The main distinctive feature of unsupervised relation extraction systems is that they do not use any
assisting information during learning: they are not provided with the seed examples, or background
expressive ontologies, or manually constructed patterns. The learning is performed purely from the
input data. The unsupervised RE approach aims at working with very big text corpora in order to
handle the volumes of data that are impossible to process by hand. The use of Big Data has its
advantages, e.g. an extracted piece of information can be verified statistically over the whole corpus,
but it also imposes certain constraints on the extraction procedures: as little linguistic processing as
possible should be involved, since things like syntactic or semantic parsing are computationally
expensive and cannot be performed on billions of texts.
The work by [Hasegawa et al. 2004] presents an unsupervised method for extracting relations
between named entities (NE) from large corpora of unstructured texts. Texts do not have any
linguistics annotation apart from named entity tags. The relation is defined “broadly as an affiliation,
role, location, part-whole, social relationship and so on between a pair of entities”. The process is
two-step: first the relation instances are extracted and grouped together, forming distinct relation
types, and then each type is assigned a label. Notably, all pairs of named entities appearing within a
certain proximity of each other are collected; then multiple occurrences of NE pairs are grouped
together and strings in between two entities (contexts) are clustered; after that cosine similarities
between contexts of different pairs are measured, and similar pairs form relation clusters. Each
cluster is then labeled using most frequent common words in the cluster contexts. The work is based
on two major assumptions: (a) the information about the relation is contained in the textual strings
that are located between the two NE mentions and (b) there is one dominating relation between
29
named entities that form a pair; alternatively the pair is not clustered at all.
The work by [Zhang et al. 2005] also takes into consideration pairs of named entities, but the
learning process utilizes a tree-based similarity calculated over shallow syntactic parse trees of
sentences. The use of parse trees overcomes certain problems of [Hasegawa et al. 2004], and this is
reflected by the higher F-score. First, the performance rate is affected by the choice of the treebased similarity metric: “a similarity function over parse trees is proposed to capture much larger
feature spaces instead of simple word features”. Secondly, two entities forming a pair can possibly
be involved in more than one relation.
One of the popular unsupervised RE approaches is the so-called Open Information Extraction. It
is a domain-independent paradigm that uses web-scale-size input corpora of texts. OIE tend to
extract as many triples as possible, but they are not always well-formed or abstract. Thus, OIE tends
to have lower recall as compared to general unsupervised systems.
[Banko et al. 2007] are the pioneers of Open Information Extraction. Their system TextRunner
works in three steps. First, A deep linguistic analysis is performed over a small corpus of texts. The
system itself separates the parsed triples into positive and negative ones. The triples are used for
training a machine learning RE model. Secondly, the model classifies the rest of the corpus
(millions of sentences) and extracts positive triples. The extraction is done in one pass over the
corpus and does not involve the deep processing any more. Lastly, newly extracted triples are
assigned a confidence score based on the frequency count of the triple. The system is completely
unsupervised, taking raw texts as input and outputting relational triples. Unfortunately, only 1 mil
out of 11 mil high confident triples were evaluated as concrete versus abstract, underspecified facts,
e.g. Einstein – derived – the Theory of Relativity versus Einstein – derived – theory.
To increase the rate of informative triples, [Fader et al. 2011] developed a system ReVerb. It uses
part-of-speech tagging for all sentences t be processed and imposes certain lexical and syntactic
constraints on the triples. The performance of ReVerb is considerably higher than that of
TextRunner (the area under the precision-recall curve is doubled), but this comes at the time and
space cost of tagging.
Recently research in the area of unsupervised RE moved towards more evolved machine learning
algorithms. In particular, generative models were proved to be very effective for the task.
[Alfonseca et al. 2012] use the background knowledge base (Freebase17) to disambiguate the
textual entities and the dependency parsing to extract the relational context from the text.
Disambiguated entities are grouped into pairs with respect to the knowledge base and are
accompanied with the corresponding context. Finally, a hierarchical topic model is used to extract
relations and patterns.
[Yao et al. 2011] experiment with different formulations of Latent Dirichlet Allocation model (LDA)
for the semi-supervised RE task.
17
http://www.freebase.com/
30
3.1.3 Generating semantic relresentations
There is a class of works that do not restrict themselves to triple extraction, but rather try to get the
formal representation of the whole sentence, or text. The most common choices of the
representation are triples (pairs of argument concepts linked by a certain binary relation) and
axioms (in formalisms like Description Logic, RDF etc.). Both forms of representing knowledge are
suitable for the automatic ontology generation. While triple extraction implies formalizing the input
text only partially (i.e. the chunks of text that are not covered by extracted triples are left out) [Hovy
et al. 2011] [Mohamed et al. 2011], the axiomatization approaches aim at translating the whole
sentences into formal language notation [Völker et al. 2007] [Augestein et al. 2012].
These works lie in-between the triple extraction approaches and the formal definition generation
ones and are of ultimate interest for us. Typical choices of formalisms and frameworks are
Description Logic, Resource Description Framework (RDF), Discourse Representation Theory
formalism (DRT) etc.
The system LExO (Learning Expressive Ontologies) presented in [Völker et al. 2007] has already
been discussed in Chapter 1.3.3. It translates natural language definitions into description logic
formulas in order to enrich inexpressive ontologies. The approach is based on full syntactic parsing
of a sentence. The dependency tree is transformed into OWL DL formulas through a chain of handwritten syntactic rules, that take into account parts of speech, sentence positions, tree positions and
syntactic roles of all words. The rules cover a broad set of syntactic structures, such as relative
clauses, prepositional, noun and verbal phrases, to name a few.
Another system Boxer uses DRT as the output formalism [Bos 2008] [Curran et al. 2007]. The
elements of DRT model the semantics of a text in terms of entities (discourse referents) and
relations between them (conditions). Discourse referents are the domain of a DRS. They represent
entities or events and serve as variables for conditions. Basic conditions can be unary or binary; the
former express a property of a discourse referent, are represented by nouns, adjectives, adverbs and
verbs and are labeled by lemmas or predefined classes; the latter express relations between two
referents and are represented by prepositions and predefined verb roles.
The system LODifier [Augenstein et al. 2012] is built on top of Boxer. It uses Boxer to produces
DRT structures for the input text and then to translate them into RDF. Conditions from Boxer
output are labeled either with predefined classes (ex: even, person, agent) or by lemmas (ex:
programming language, inspect). In both cases the labels are mapped to the RDF version of the
WordNet semantic network to provide URIs for relations.
All the three works share two main drawbacks. Firstly, the transformation from one format to
another is rule-based, and the sets of rules are not exhaustive due to high ambiguity and richness of
natural languages. Hence not all definitions are adequately transformed from text to formulas. The
second drawback of the approaches is that they are unable to generalize from the textual
representations, i.e. the output definitions use terms and few classes instead of concepts and
relations.
31
3.2 Biomedical extraction The majority of research works on biomedical relation extraction focus on the relations between
specific concept types: genes, proteins, diseases and drugs. Identifying related genes, drugs,
proteins and diseases have a huge potential for drug discovery and drug repositioning.
Heterogeneous pieces of information are mined from various textual sources and assembled
together in a form of ontologies, semantic networks, knowledge bases or other knowledge
representation structures. These structures can be analyzed and reasoned upon so that new
knowledge not explicitly stated in the source documents can be unveiled. The researchers can then
formulate pharmacological hypotheses that can be validated in the lab and potentially lead to new
types of drugs, new targets for already known drugs or drugs for diseases that do not yet have
efficient treatments.
One famous example of merging several pieces of information across different publications into a
hypothetical pathway that was validated in clinical trials and led to a drug repositioning is the study
of Swanson [Swanson 1986]. Swanson discovered that fish oil can effectively treat Raynaud's
syndrome, and this discovery was in fact literature-based: he took two observations already reported
in the literature, and connected them to form a new observation without running any biomedical
tests and experiments. The first observation stated that Raynaud's syndrome is connected to
increased blood viscosity. According to the second observation, fish oil decreases blood viscosity.
The resulting hypothesis was to treat Raynaud's syndrome with fish oil. The treatment was proven
to be effective.
Relation extraction in biomedical domain adopts the methodologies of the general relation
extraction.
1) One of the most common approaches is to use lexico-syntactic patterns. A set of relevant
relations is manually designed by domain experts, and every relation is assigned to a set of textual
patterns that are also constructed manually or extracted automatically from texts.
[Huang et al. 2004] extract protein-protein interactions using lexical patterns. Patterns are mined
through the dynamic alignment of relevant sentences that mention the interaction. Both the
precision and the recall of the system reach 80%.
[Xu et al. 2013] use simple pattern-based approach to extract drug-disease relation instances from
MEDLINE abstracts. The patterns are not complicated (e.g. «DRUG-induced DISEASE»), thus the
approach exhibits a typical bias towards high precision at the expense of low recall: 90.4%
precision and 13.1% recall. However, the majority of extracted instances do not yet exist in a
structured way in biomedical databases, which proves the usefulness of the approach.
The majority of works on pattern-based relation extraction rely on hand-crafted templates whose
construction is a laborious task. In some cases the patterns are built automatically, nevertheless the
approach lacks the ability to extract relations that are not explicitly stated in the text, i.e. the relation
is not properly mentioned by a verb, a deverbative noun etc, or the two interlinked entities are
located to far from each other in the text, and the pattern cannot cover them.
2) Another common relation extraction approach uses co-occurrence information. The idea behind
it is quite intuitive: entities occurring in the same sentence significantly often should be related
[Coulet et al. 2010]. The drawback of the approach lies in that the correlation information per se
32
cannot capture the type of relation present, i.e. what the formal semantics of the relation is.
However, it can efficiently identify potential relations and relation instances that may be examined
with other NLP techniques afterwards.
[Zhu et al. 2005] study the relations between different genes and chemical compounds or drugs. The
authors are not interested in pinpointing the specific type of relation, but rather in finding the
entities that may be somehow related, which makes the co-occurrence approach a proper choice.
The relation scores are calculated using a probabilistic model that combines several co-occurrence
datasets.
[Wu et al. 2012a] calculate co-occurrence scores for pairs of genes and drugs on different
segmentation levels: those of sentences, abstracts and phrases. Several co-occurrence measures
were utilized, including standard options such as mutual information, Chi-square and term
frequencies, as well as more advanced metrics based on Latent Dirichlet Allocation model.
[Cimino et al. 1993] occupy themselves with extracting drug-disease relations. Co-occurrence
patterns were extracted from the keyword section of MEDLINE publications. The patterns were
transformed into relational pattern-matching rules, e.g. “Disease caused by Chemical”, and over
2500 facts matching those rules were subsequently extracted from the literature.
[Lee et al. 2004] address the task of finding the treatment relation instances between drugs and
disease names. Sentences containing frequent drug-disease names were filtered out based on the cooccurrence information and then were used as a source for manual and automatic pattern extraction.
3) Alternative approach to extract biomedical relations is to use machine learning techniques.
Firstly, the source text is annotated with biomedical concepts; secondly, sentences or phrases are
labeled with relations using external knowledge resources, manual annotation or exploiting the
concept types. Finally, a model is trained to discriminate between instances of different classes, i.e.
relations.
[Chun et al. 2006] focus on the extraction of gene-disease relations from manually annotated
MEDLINE abstracts that describe either pathophysiology, or therapeutic significance of a gene or
the use of a gene as a marker for possible diagnosis and disease risks. Incorporating a NER prefiltering step for gene and disease names the classification performance yields 78.5% precision and
87.1% recall.
[Craven et al. 1999] automatically populate a biomedical knowledge base from text. The work
focuses on extracting relations between proteins and their locations (tissues, cell types, sub-cellular
structures), associated diseases or drugs they interact with. The classification is done using strings
containing instances of target relations and using grammatical patterns on the phrase level.
[Chang et al. 2004] classify gene-drug co-occurrences into 5 manually defined categories using a
set of relationships from PharmGKB and the Maximum Entropy algorithm.
[Airola et al. 2008] focus on protein-protein interaction extraction and utilize graph kernel based
learning algorithm the F score of 56.4%.
[Rosario et al. 2004] use neural networks and generative graphical models to automatically learn 7
different relations between drug and disease entities: cures, prevents, treats, has side effect etc.
Machine learning appears to be a potential approach of relation extraction which does not require to
33
do the tedious work of pattern contruction and is able to generalize. However, previous works on
biomedical RE using classifiers confined to small sets of relations between specific entities. Little
work is done for more comprehensive collections of relations that can be found in multiple
subfields of life sciences.
4) The last approach towards relation extraction that has been gaining popularity incorporates deep
syntactic parsing. By exploring the parse trees of target sentences the entities that are related to
each other can be identified with high precision. Usually the entities are linked into pairs according
to the rules defined over the parse trees, and the recall of the extraction process depends on the
diversity and flexibility of the rules.
[Rindflesch et al. 2000] extract formal assertions about drug-target relations relevant for cancer
treatment from text. Gene and cell mentions are extracted from parsed noun phrases using the
UMLS Metathesaurus and an automatically constructed list of auxiliary names; these mentions
serve as arguments for predications. The authors single out three main issues that need to be tackled
for an efficient predication generation, namely coordination («drug A and drug B inhibit gene C»),
anaphora («they inhibit gene C») and underspecified reference («the drugs inhibit gene C»).
[Ramakrishnan et al. 2006] formulate the task as the extraction of explicit as well as implicit
relations between known entities in text. The relations to be learned are taken from the UMLS
Semantic Network (see Chapter 2.2). After a sentence is annotated with MeSH concepts and
UMLS relation names (or their synonymous names, also provided by UMLS), they are combined
into RDF triples following the predefined rules.
[Tari et al. 2009] build biological interaction networks from gene-drug, gene-disease and proteinprotein relation instances mined from MEDLINE abstracts. Patterns like «GENE _ associated with
DRUG» or «DRUG _ inhibits GENE» («_» is used as a wild card symbol) are matched against
sentence parse trees instead of simple text representation which yields higher recall of the patterns.
[Coulet et al. 2010] automatically identify and extract predications relevant for the domain of
pharmacogenomics. Commonly occurring relations are extracted from syntactic parses of
MEDLINE abstracts in the following way: the two recognized entities are linked by a parse subtree,
and a relational triple is extracted of the root of the subtree is a verb or a deverbative form (called
«the normalized verb» by the authors). The resulting «raw» relations are then normalized and
mapped to a smaller set of relations and are integrated into an ontology.
The set of normalized relations and the hierarchy of the later was constructed manually. Firstly,
lexical items from raw relationships were lemmatized, and four frequency lists were constructed:
most common relational terms, and most common entities that modified gene, drug and phenotype
names (e.g. «dose» is a modifier of the drug name «warfarine» in «warfarine dose» noun phrase).
Secondly, synonymous elements from each list were group together manually, forming ontology
roles and concepts. Lastly, role and concept hierarchies were manually defined, shaping the output
ontology. The ontology was then encoded in OWL.
[Fabian 2012] adapts the algorithm from [Coulet et al. 2010] to extract subject-object relation
instances of drug-disease, drug-target and disease-target interactions. Instances were classified as
quantified strong, strong, relational, weak or unknown depending on the additional modifiers of the
verb phrases, like adverbs (e.g. «significantly» is a strong modifier, whereas «weakly» is not) and
auxiliary verbs («may» or «could» are markers of weak relationships). Manually constructed lists of
strong and weak modifiers serve for the aforementioned classification. Finally, a model separating
informative sentences (i.e. containing relevant relations) from non-informative ones was trained
34
using < logistic regression classifier.
The system described in [Fabian 2012] is an example of mixing several approaches. While machine
learning and deep parsing, possibly in combination with each other, appear to be the most
promising ones with respect to relation extraction task, there is a rationale to additionally
incorporate patterns and rules as components of a relation extraction system, for subtasks where
high precision is the main goal. Co-occurrence can also be integrated into the pipeline, e.g. for prefiltering purposes.
To conclude, there exist several NLP systems that extract relations from biomedical texts. However,
their common drawback is that they tend to target only specific relations. While this can be a
reasonable approach given every particular case, as the relations being extracted fit to the task at
hand, the approach cannot be generalized to other tasks, domains and purposes. Thus, an efficient
system that can be adapted to any set of biomedical relations will be of great value.
35
4. Non-­‐taxonomic relation extraction using SNOMED CT ontology References:
G. Tsatsaronis, A. Petrova, M. Kissa, Y. Ma, F. Distel, F. Baader and M. Schroeder. Learning
Formal Definitions for Biomedical Concepts. In Proceedings of OWL : Experiences and Directions
Workshop 2012 (ESWC OWLED’13), to appear.
For our first attempt to generate formal definitions for biomedical concepts we took the SNOMED
CT ontology (see Chapter 2.3.1) as a gold standard. The motivation for it is the following:
SNOMED CT is a fully formalized resource and its formal semantics can be used to build
information extractions models to process new unstructured data.
Ontology Extraction from text is a multi-step task. Several initiatives of biomedical ontology
extraction have already addressed this problem at different levels: biomedical annotators are
responsible for concept extraction, while taxonomy extraction is done by [Wächter et al. 2011]
[Fabian et al. 2012]. On the other hand, few attempts were made in the area of non-taxonomic
biomedical relation extraction. Hence we focused on non-taxonomic relations.
The approach is based on the assumption that the set of relations relevant for a given domain
remains relatively stable and does not differ drastically from source to source; in contrast, the set of
concepts as well as the set of relation instances increases constantly. To facilitate the addition of
information about the new concepts, we address the following problem: for a given input sentence
in a natural language that is annotated with two SNOMED CT concepts, decide whether the
sentence contains a relation between the two concepts and if it does, identify the relation.
Taxonomic relation instances (“A is a B”) are omitted.
More formally, we can express this problem as a multi-class classification problem. Let C be the
class label, i.e. any relation R contained in SNOMED CT. Each learning instance is a sentence S
annotated with SNOMED CT concepts, for which a set of features has been computed. If S is a
sentence which describes a relation Ri between the two SNOMED CT concepts, then S is a positive
instance for this relation, and hence C = Ri in this case; otherwise S is a negative learning instance
of Ri.
The methodology pipeline consists of three steps: (i) create a dataset with labeled instances from
which the relations can be learned, (ii) represent the instances as feature vectors, (iii) using machine
learning algorithms train a classification model that may recognize any of the labeled relations in an
unseen input sentence.
36
4.1. Dataset generation In order to obtain high quality sentences that describe relations between two SNOMED CT
concepts we need to select the sentences that primarily contain both concepts. The text corpus of
choice is the collection of MeSH definitions, since they are composed manually by medical experts,
hence they constitute precise, scientifically valid and sentences of high quality. MeSH definitions
must be annotated with SNOMED CT concepts, and those containing both concepts from concept
pairs must be collected.
The first step is to obtain a mapping between MeSH and SNOMED CT concepts. Such a mapping
exists in the UMLS (see Chapter 2.3.2). UMLS defines a Concept Unique Identifier (CUI) for each
of the UMLS concepts. Each CUI may be associated with one or more concepts from external
knowledge resources, including MeSH and SNOMED CT. Analyzing this association, we extracted
the CUIs that are associated with both a MeSH and a SNOMED CT concept, which is interpreted as
a mapping between the two concepts. Using the latest UMLS version (2012AB), we obtained a total
of 21,461 mappings.
Next, we annotated MeSH definitions with SNOMED CT concepts. For the annotation we used two
different tools: (a) MetaMap 18 which may annotate any text with UMLS concepts, and (b)
SnomedAnnotator developed in-house, which may annotate any text with SNOMED CT concepts.
Annotations produced by MetaMap were translated into SNOMED CT concepts using the
mappings from UMLS. The two annotators were used sequentially to provide a broader coverage of
annotations; hence, we considered the union of the annotations from the two tools.
We focused on three widely populated relations in SNOMED CT, namely Associated Morphology
(AM), Causative Agent (CA) and Finding Site (FS). After we filtered definitions that do not contain
both concepts from concept pairs that are linked by one of these relations, there were 424 MeSH
definitions remaining. The details of the dataset are summarized in Table 4.
Role
# instances
# word occurrences
avg. # words
# distincs words
Associated Morphology
121
938
7.75
433
Causative Agent
95
723
7.61
218
Finding Site
208
1,550
7.45
547
Table 4. Description of the produced dataset. The dataset contains 424 instances from three SNOMED CT roles:
Associated Morphology, Causative Agent and Finding Site.
4.2 Feature extraction Using the dataset described above, we generated the features with which the instances were
represented for the learning process. For the feature engineering, we use three approaches: (i) Bag
of Words, (ii) Word ngrams, and (iii) Character ngrams. The three approaches are described below,
they are summarized with an example in Table 5. In all the three approaches, the annotated
sentences are split in such a way that the words that occur between the two concepts may be
isolated and processed. We rely on two main hypotheses:
18
MetaMap annotator: http://metamap.nlm.nih.gov/
37
•
•
each relation Ri has a characteristic way of being expressed in natural language text;
it is expressed in lexical items that occur between the two concepts.
All three representations have a default feature weight equal to the value of 1 if they occur in this
text, or 0 otherwise.We also expand these representations to their weighted versions, i.e., instead of
boolean representation of the features, real values are used.
Annotated sentence
SNOMED CT relation
Alignment
BoW
Word ngrams
Character ngrams
“Baritosis/Baritosis_(disorder) is pneumoconiosis caused by barium
dust/Barium_Dust_(substance).”
Baritosis_(disorder) – Causative_agent – Barium_Dust_(substance)
left type
between-words
right type
disorder “is pneumoconiosis caused by” substance
{is,pneumoconiosis,caused,by}
{is, pneumoconiosis,caused,by,is pneumoconiosis, pneumoconiosis
caused,caused by}
{i,s, ,p,n,e,u,m,o,c,a,d,b,y,is,s , p,pn,ne,eu,um,mo,oc,co,on,ni,io,os,si,
c,ca,au,us,se,ed,d , b,by}
Table 5. Text alighment and example of an instance representation using Boolean feature values. For the ngram
representation a value of n = 2 is used.
Bag of Words (BoW) Representation: The representation of text following the Bag of Words
model has been used traditionally both in the fields of information retrieval and text mining [BaezaYates et al. 1999]. According to the BoW representation, a text string is the unordered set of all
unique words in it. Each distinct term constitutes a dimension of the collection. Thus, the BoW
feature space comprises the union of all unique terms appearing in all text definitions. Each instance
I (in our case the text between the two concepts) can be represented as a feature vector, a value of
each feature being 0 or 1 (boolean representation), depending on whether a term occurs in the
instance (1) or not (0).
Word ngrams Representation: We can expand the BoW representation in order to represent each
instance with all the possible word ngrams occurring in I. For the extraction of the word ngrams we
are using a sliding window of search in the ordered words of the input text. The size of the window
may vary depending on the value of n. Note that this representation includes at least all features of
the BoW representation; in fact, if n = 1, the word 1-gram representation is reduced to the BoW
representation. Regarding the weight of each feature, in the simple (unweighed) version, we use a
boolean representation, as previously.
Character ngrams Representation: In an analogy to representing instances at a word ngram level,
we can also represent instances at the character ngram level. Given I and a value for the parameter
n, we now examine I as an ordered series of characters instead of words. For the extraction of the
character ngrams, as in the case of word ngrams, we are using a sliding window of size n, and we
do not exclude space characters in order to capture patterns across token boundaries. Again the
weight of each feature in the simple (unweighed) version follows a boolean representation, as
previously.
Weighted Feature Representations: In all of the three feature representations of instances
mentioned above we have assumed a boolean representation for the feature values. Ideally, we
would like to have a real value for each feature acting as a weight that would discriminate their flat
contribution of the boolean representation. For this purpose, we utilize the dataset and define a
global weight for each feature, which can be computed always on the part of the dataset kept for
training. A local weight is not a realistic option, as the MeSH definition sentences are usually short,
38
significantly shorter than text passages or documents. Hence, for each feature Xi of any of the three
representations we define a weight: vi = P(Xi), where P(Xi) is the probability of occurrence of
feature Xi in the training corpus. However, since the training corpus contains instances from several
relations Ri (class labels), it is important to discriminate the probabilities of the features’
occurrences per relation (class). Hence, to create the weighted representations of instances we
create for each feature Xi, 3 features with the real values, the weight of each feature is the
probability of its occurrence in the respective relation, and each instance in the weighted version
may be now represented with 3*X features.
Combined Feature Representations: A final consideration is the representation of the instances
using the union of all the features that were described in each case. This potentiality can show
whether the synergy of word ngrams and character ngrams may provide better predicting power for
the extraction of roles from unstructured text. Naturally, this combined representation can be
utilized both for the weighted and the unweighed versions.
4.3 Relation classification During the classification step we compared four different state of the art supervised algorithms,
namely: Logistic Regression (LR), Support Vector Machines (SVM), Multinomial Naive Bayes
(MB) and Random Forests (RF). For the evaluation we apply 10-fold cross validation and for
performance measuring we report on the overall accuracy, precision, recall and F-Measure per
relation (AM, CA, and FS), and macro-averaged precision, recall and F-Measure over all roles. The
results are reported in the Appendix A. for the unweighed and weighted versions of the instance
representations respectively. All classifiers were used in the Weka toolkit implementation19.
We analyzed the results from several perspectives:
Comparing the roles: The easiest role to learn is CA (82.9% F-measure), the second easiest is FS
(80.3% F-measure), AM is relatively hard to learn (65.8% F-measure). CA strings have very typical
patterns that are easy to model with ngrams: caused by, resulting from, due to, induced by, exposure
to.
Comparing feature representations: Character ngrams, in particular of size 3, tend to outperform
other types of representation. We assume that the reason behind it is that they can detect several
textual aspects simultaneously: word order, word forms, word groups. Combining character and
token ngrams did not add up to the performance, meaning that more elaborate ways of feature
combination are needed to bring cumulative power of features.
Comparing classifiers: Most of the times SVM give the best results, LR being the second best. NB
always selected the majority class (FS role) in some cases when token characters were used as
representation.
Comparing boolean and weighted versions: Weighted representation tend to produce slightly
better results, although the difference in numbers is subtle and inconsistent, which brings us to the
conclusion that the current weighting scheme is not good enough to differentiate the predictive
power of separate roles.
19
Weka machine learning toolkit: http://www.cs.waikato.ac.nz/ml/weka/
39
The best setting over all experiments is the weighted version of character trigrams used as input for
SVM: overall accuracy reaches 75.71%, macro-average F-measure rises up to 74.91%. F-measure
results for single roles are 79.45% for FS, 65.52% for AM and 79.78% for CA.
4.4 Discussion The experiments described above show that learning relational instances from text is a generally
feasible task. In particular, our underlying assumption #1 that relations have typical ways of being
expressed in text holds. This implies that lexical features extracted from relational strings are
valuable and should be used in further experiments.
However, the performance of relation classification that we have reached so far is not high enough
to claim the problem solved. The mistakes in classification come from various sources:
•
•
•
•
incorrect choice of sources (same relations can be modeled differently in textual and in
formalized sources, thus we have two choose a corpus and an ontology that are compatible
with respect to the domain modeling);
incorrect or insufficient annotation of concepts (the annotator can miss crucial concepts or
recognize them incorrectly which results in the incorrect string extraction and concept pair
formation);
incorrect choice of relational strings (i.e. violation of the underlying assumption #2: the
relation mention can be located outside of the string in-between two concepts or can
neighbor another relation mention which leads to noise in the learning dataset);
insufficient set of features for the learning process (lexical ways of relation expression may
be overlapping across several features, thus other types of features should be incorporated
into the learning process).
Table 6. Overview of the proposed approach for the biomedical relation extraction.
Text source
MeSH definitions
Relation set R source
SNOMED CT
Annotators
MetaMap, in-house SnomedAnnotator
Feature sources
text of a definition
Feature representations
BoW, token and character ngrams, combined
Weighting schemes
boolean, probabilistic weight-per-class
ML classifiers
SVMs, Log. Reg., Random Forests, Naïve Bayes
String extraction
string between the two concepts
Mistakes made on earlier steps of the experiments are propagated to later steps and are accumulated
in the classification phase resulting in F-measure that does not exceed 80%. Thus, an efficient
solution for the task at hand should be a pipeline that improves the performance of every component
of the experiment. Table 6 summarizes the choice of resources, features and algorithms for every
step of the experiment. In the upcoming chapter we are going to describe a pipeline of formal
definition generation with the non-taxonomic relation extraction as its key component keeping this
table in mind and improving every step of the experiments.
40
5. Formal Definition Generation pipeline This chapter describes the pipeline for formal definition generation (FDG) from text. The core step
in FDG is non-taxonomic relation extraction: not only expressive relation instances account for the
most part of definition formulas, but they also require such tasks as concept annotation and
taxonomy detection as preprocessing steps. Hence, the FDG pipeline in essence tackles the task of
relation extraction.
Chapter 5 is structured as follows: in section 5.1 we give an overview of the approach of biomedical
relation extraction and compare it to the one we used in our previous experiments (see Chapter 4).
We illustrate the overview with an example of a biomedical definition, which transform into a
Description Logic formula in a stepwise manner. The rest of the chapter gives a detailed description
of every step of the translation process: section 5.2 covers the semantic annotation, section 5.3
describes the triple extraction from the definition parse tree, and section 5.4 is dedicated to the triple
classification problem.
5.1. Overview of the pipeline We base current approach on the previous set of experiments on biomedical relation extraction (RE)
described in Chapter 4. Certain steps of the pipeline are modified to achieve better performance
based on the methodological conclusions we reach in Discussion section of Chapter 4, whereas for
other steps the winning configurations from previous experiments are kept and the alternatives are
abandoned. The comparison of the two pipelines is given in Table 7.
Old approach
New approach
Text source
MeSH definitions
MeSH definitions
Relation set R source
SNOMED CT
SemRep, UMLS
Annotators
MetaMap, SnomedAnnotator
Extended Annotator
Feature sources
text of a definition
text of a definition + concept types
Feature representations BoW, token and character ngrams, character ngrams
combination
Weighting schemes
boolean, per-class weights
boolean
Classification algorithm SVMs, RFs, LogReg, Naïve Bayes SVMs
String extraction
string between the two concepts
advanced
strings
parsing
of
relational
Table 7. A comparison of two relation extraction approaches.
41
The components of the pipeline that have been modified:
• We change the background ontology serving as a reference for formal relations from
SNOMED CT to UMLS Semantic Network. SNOMED CT models the biomedical
knowledge for machine processable purposes, thus the modeling is not fully compatible with
the way biomedical relations appear in natural language texts. Semantic Network is a
human-oriented resource, it defines concepts and relations that can be directly found in texts.
The SemRep system and its set of relations that is used for evaluation (Chapter 6) has the
same human-oriented approach and in fact it uses concept types and partly inherits the
relations from the Semantic Network.
• We modify the annotation process introducing the Extended Annotator (Chapter 5.2). The
annotation process now takes into account the phrasal parsing of input sentences.
• We introduce a new relation extraction methodology which takes into account syntactic
parse tree of a definition (Chapter 5.3). Syntactic parsing proves to be useful with respect to
the RE task (see Chapter 3, subsection Biomedical Relation Extraction). The Stanford Parser
is widely used for the task at hand [Coulet et al. 2010] [Fabian 2012].
•
We added a new class of features into the learning process: semantic types of concepts
(Chapter 5.4).
Selecting the best performing parameters for the pipeline:
• We choose Support Vector Machines (SVM) as the classification algorithm. It appeared to
be the best performing classifier for our formulation of the RE task. Moreover, kernel
methods, in particular applied to parse trees, proved to be a useful tool for relation extraction
[Bunescu et al. 2006] and tend to outperform feature-based methods [Zelenko et al. 2003].
• Character ngrams of size up to 3 will be used as lexical features extracted from relational
strings. Trigrams are a common choice for language modeling [Rosenfeld 2000] and since
they proved to be as efficient as 4-grams, we choose the ngrams of smaller size as they yield
a feature set of a considerably smaller size.
• We choose the boolean weighting scheme of features, since the per-class probabilistic
weighting of features did not prove to be considerably better that the unweighed scheme, but
it expanded the feature set considerably (by n times where n is the number of distinct
relations).
We illustrate the full pipeline of formal definition generation using a MeSH definition that will be
the running example for this chapter:
Concept: Tremor (D014202)
Definition: cyclical movement of a body part that can represent
either a physiologic process or a manifestation of disease.
input
sentence
Tremor – cyclical movement of a body part that can represent
either a physiologic process or a manifestation of disease.
42
syntactic
parsing
…
Tremor – cyclical movement of a body part that can represent
semantic
annotation
D014202
D009068
my term
either a physiologic process or a manifestation of disease.
D010829
D004194
Tremor – cyclical movement of a body part that can represent
triple
extraction
triple
classification
either a physiologic process or a manifestation of disease.
ISA (Tremor, Movement)
“of” (Movement, Body Part)
“that can represent” (Tremor, Physiological Process)
“manifestation of” (Tremor, Diseasec)
43
formula
generation
Tremor ≣ ( Movement ⊓ ∃relatesTo.Body_Part ) ⊓
∃represents(Physiological_Process ⊔ Disease)
Now let us look into the first step of the formal definition generation pipeline – the semantic
annotation of sentences with biomedical concepts.
5.2 Annotation of biomedical texts with ontology concepts 5.2.1 Introduction to the process of annotation and related work
Given an input text and an ontology that describes the domain, concept annotation, also called
semantic indexing or concept recognition, is the task of finding in text mentions of ontology
concepts and mapping the corresponding lexical tokens to concepts. Typically, biomedical concept
annotators aim at recognizing textual occurrences of diseases, drugs, genes, body parts, species and
in principle, any other conceptual entity that exists in the input ontology.
Biomedical concepts may be seen as a specific kind of named entities that may contain common
(i.e., not proper) nouns. Thus, biomedical concept annotation can be viewed as a named entity
recognition (NER) task, and methods and techniques developed for classical NER are applicable for
biomedical concept recognition (dictionary-based, grammar-based, alignment-based and statistical
approaches). However, the approaches must be adapted to the particularities of biomedical concept
names. For example, protein names are highly ambiguous: one protein can have several names,
abbreviations and variants, and can be mapped to more than one protein concepts. In parallel, a
specific concept may appear with more than one textual labels (synonyms). Hence, the widely
known problems of polysemy and synonymy that accompany almost every text mining task are
aspects that need to be addressed also by biomedical annotators.
There are multiple biomedical annotators available online. One of the most widely used one is
MetaMap [Aronson 2006]. It is a dictionary-based system that indexes biomedical text with UMLS
concepts [Unified Medical Language System]. It is used as a foundation of the Medical Text
Indexer 20 (MTI) which assists PubMed curators in annotating MEDLINE abstracts. Other
annotations tools are: NCBO Annotator 21 [Jonquet et al. 2009], Mgrep [Dai et al. 2008] and
Attribute Alignment Annotator [Delfs et al. 2004].
20
21
http://metamap.nlm.nih.gov/
NCBO Annotator: http://bioportal.bioontology.org/annotator
44
Biomedical annotators serve as the basis of various more complex data mining and information
extraction tasks are essential for the performance of any biomedical intelligent systems. They are
used for semantic search [Delfs et al. 2004], ontology generation [Wächter et al. 2011], interactions
extraction [Fabian 2012] etc. A number of tasks closely related to biomedical indexing are
supported by the BioCreAtIvE22, a series of challenges that evaluate systems extracting biologically
relevant information from the literature. The two main directions of the BioCreAtIvE are the
identification of biomedical entities and the detection go the entity-fact associations in text.
5.2.2 The Attribute Alignment Annotator
The Attribute Alignment Annotator 23 , referenced hereafter as the in-house annotator or the
Annotator, was developed by [Doms 2008] as an indexer for the GoPubMed search engine. It is
based on the Smith-Waterman sequence alignment algorithm [Smith et al. 2008] and recognizes
terms from MeSH and Gene Ontology in a given text passage.
The annotator first pre-processes both the ontology terms and the text snippet by tokenizing them,
removing the stop words and stemming the remaining terms. Then the term stems are mapped onto
the text stems using the local sequence alignment algorithms [Smith et al. 2008]. Insertions,
deletions and gaps are penalized. The Information Value (IV) of terms is also taken into account
during the alignment process. The IV is calculated over the whole ontology and is based on the
frequency of a term in the ontology's vocabulary. The more frequent, thus common, the word is, the
less informative it becomes. Hence, frequent terms are assigned low IV and vice versa.
The Annotator is a suitable tool for the task of formal definition generation: it maps terms to MeSH
concepts, handling synonyms and term variants, thus covering the three lower levels of the ontology
learning layer cake [Cimiano06]. However, it has several drawbacks, like missing or ambiguous
annotations, that need to be resolved so that the annotator can be integrated into the definition
generation pipeline. Hence, the Extended Annotator was created as an improved version of the
original annotator, which is explained analytically in the following Section.
5.2.3 The Extended Annotator
5.2.3.1 Motivation
The Extended Annotator (EA) was built as an enhancement to the original annotator. It is build on
top of the annotator and uses its string matching and concept recognition functionality. Thus, EA
can be seen as an extension of the Attribute Alignment Annotator.
The role of the annotator in the definition generation process is to provide a full and unambiguous
set of concepts recognized in the input text; the concepts are then used to by the parser to construct
triples of the form: concept_A — relation_R — concept_B. The original annotator's output is
22
23
The BioCreAtIvE challenge: http://biocreative.sourceforge.net/
http://www.gopubmed.org/web/annotate/
45
problematic in one main way: the annotator can produce overlapping annotations. The reason is that
it uses unsupervised string alignment method which does not take into account any context, i.e. the
terms that appear adjacent to the annotated term in the sentence. Instead, whenever a string
sequence matches one of the terms (or its variants) from MeSH, an annotation is produced. The
annotation process can be thought of as a sliding window of varying length that goes through the
text and collects all sequences of terms that match some term. As a result, compound terms
consisting of a head term and several modifying terms ("cell-mediated immunity") may be assigned
several annotations for different subsets of constituting terms:
«cell» à cells;
«mediated» à negotiating, lymphokines;
«immunity» à immunity, immunization;
«cell mediatedimmunity» à cellular immunity.
Example 1 shows the full annotation of the definition for the concept T- Lymphocytes produced by
the original annotator and the extended annotator. Original annotation is not suitable for the task of
formal definition generation, since it is highly ambiguous, whereas we would like to have a
straightforward, one-to-one correspondence between lexical mentions of concepts and their
annotated labels. The EA annotation has only two entries, both of which are correct. In particular,
«cell mediated immunity» is labeled with Immunity, Cellular, the concept with the broadest
coverage.
Example 1: T-Lymphocytes (D013601)
Lymphocytes responsible for cell-mediated immunity.
Annotator:
“lymphocytes”: Lymphokines; D008222.
“lymphocytes”: Lymphocytes; D008214.
“cell”: Cells; D002477.
“immunity”: Immunization; D007114.
“immunity”: Immunity; D007109.
“mediated”: Negotiating; D017008.
“mediated”: Lymphokines; D008222.
“cell-mediated immunity”: Immunity, Cellular; D007111.
EA:
“lymphocytes”: Lymphocytes; D008214.
“cell mediated immunity”: Immunity, Cellular; D007111.
In Example 2 «anticholesteremic agent» is annotated with the concept Anticholesteremic Agents
twice: in the first case the whole string is annotated, while in the second case only the adjective is
taken into consideration. The annotation covering longer substring is correct. The string «sterol
biosynthesis» is simultaneously annotated with two different concepts, Sterol biosynthetic process
and Biosynthetic process. The first concept is more specific and it is expressed in the text by a
longer string.
Example 2: trans-1,4-Bis(2-chlorobenzaminomethyl)cyclohexane Dihydrochloride (D001371)
An anticholesteremic agent that inhibits sterol biosynthesis in animals.
Annotator:
“anticholesteremic”: Anticholesteremic Agents; D000924.
“anticholesteremic agent”: Anticholesteremic Agents; D000924.
“sterol”: Sterols; D013261.
“animals”: Animals; D000818.
“sterol biosynthesis”: Sterol biosynthetic process; GO0016126.
“biosynthesis”: Biosynthetic process; GO0009058.
EA:
“an anticholesteremic agent”: Anticholesteremic Agents; D000924.
46
“sterol biosynthesis”: Sterols, D013261.
“animals”: Animals; D000818.
An ambiguous set of annotations is not an ideal input for the triple generation. Thus, only one
annotation (the best fitting) should be chosen from all alternatives for each concept to be passed to
the next component of the definition generation pipeline. We assume that the best fitting annotation
means the one with the broadest coverage, i.e., it is expressed in the text by a longer substring.
5.2.3.2 Method description
The underlying idea of the Extended Annotator is that concepts are expressed in text by noun
phrases (NPs). If we annotate every NP separately and choose the most suitable annotation label, a
full and unambiguous set of annotations will be obtained.
The EA takes into account the syntactic structure of the input text. It traverses the parse tree and
collects all “flat” noun phrases, i.e., those not having smaller NPs as child nodes linked by a
preposition. Every NP is them passed separately as an input to the annotator, and either of the two
cases described below is possible:
− if a NP is annotated with exactly one MeSH concept, the concept is assigned to the NP and
the annotator proceeds with the next NP;
− if a NP gets more than one annotation, the annotator calls the filtering method that selects
the best fitting annotation to be assigned to the NP. As it was explained in the motivation,
the annotation that covers the longest substring is chosen. The original annotator outputs a
range for every annotation, specifying its starting and ending position in the text, and the
annotation with the biggest relative value of the range (the ending point minus the starting
point) is selected as the final output of the EA for a given NP.
5.2.3.3 Extended Vocabulary
The annotation procedure described previously forms the core algorithm of the Extended Annotator
that can be used for a broad range of purposes and that was evaluated on the BioASQ corpus (see
Chapter 5.2.5 Evaluation). However, several subsequent additions were made to fit EA more to the
task of formal definition generation.
It was noticed that certain terms with general semantics, like structure, system or phase are not
recognized by the annotator. The reason is that they are missing in the underlying ontology, i.e.
MeSH. Nevertheless they appear quite frequently in the MeSH definitions, especially as specifying
the parent concept of the term to be defined. The most striking example is the term «disorder».
MeSH has a rich taxonomy of various disorders which can be seen if one queries the MeSH
browser with this term. However, the term itself does not correspond to the parent concept of
Disorder as it is absent in MeSH. This is very inappropriate for the definition generation because
multiple definienda lack the annotation of the parent concept.
Example 3: Muscular Disorders, Atrophic (D020966)
Disorders characterized by an abnormal reduction in muscle volume due to a decrease in the size or number of muscle
fibers.
EA:
Disorders: Disorder; my_term.
muscle volume: Muscles; D009132.
muscle fibers: Muscle Fibers (chosen); D0184
47
The solution to this problem is the usage of an extended vocabulary. It is an extra set of terms
collected automatically from the MeSH definitions which is used for annotation in case a NP
remains without annotations after all the steps of the EA are applied. If a term from the extended
vocabulary is recognized in the NP string, it is treated like a concept and is used during the
definition generation. By now, no formal semantics have been assigned to these additional concepts,
thus they cannot participate to the reasoning process, but they are useful for constructing the formal
representation of a definition. Assigning semantics to the additional terms can be done manually or
semi-automatically and is an aspect that does not fall within the current work; it is however an
interesting aspect that may be studied as a part of a future work in this direction.
Currently, there are 35 terms and phrases in the extended vocabulary. It contains very general terms
like process, phenomenon or potential, common biomedical terms like disorder, abnormality or
insufficiency, as well as more specific terms like cortex, sinus or duct. The terms were collected in
the following way: token uni-, bi- and trigrans were collected from MeSH main headers, i.e., terms
for which the definition is sought. The ngrams that occurred 10 or more times were left. The
threshold of 10 occurrences was chosen manually: sequences with lower frequencies were very
domain-specific and constituted the majority of all extracted ngrams. The same steps were
performed for the first NPs of MeSH definitions that supposedly act as parent concepts, the
threshold being 100 occurrences. The two lists were merged, and the terms that have corresponding
concepts in MeSH were filtered out. Finally, the terms genus, family, order, species and class were
manually removed from the list, as they are used in patterns for taxonomic role extraction (see
Chapter 5.3). The resulting list formed the extended vocabulary. Its full version is given in the
Appendix B.
Two main modifications were added to the EA for definition generation:
• phrases given in brackets were removed;
• only the first sentence of a definition is parsed.
The motivation behind these steps is quite intuitive: definitions should contain only the essential
information about the definiendum. As a rule, the main content is given in the first sentence, and all
consecutive sentences further elaborate on it and are not crucial. The brackets show that the
information they contain is of additional character is not strictly necessary.
5.2.4 Implementation
The Extended Annotator was implemented in the Java programming language, Juno Service
Release 1 version, using the Eclipse IDE. The architecture of the annotator is heavily dependent on
the syntactic parse trees of the input text generated by a third party parser. In the current
implementation the Stanford Parser24 is used [Klein et al. 2003]. The parser outputs syntactic trees
per sentence labeled with Penn Treebank25 tags of parts of speech, phrases and clauses. The
Stanford Parser is integrated into the Stanford CoreNLP toolkit26. That also provides the essential
natural language analysis as sentence splitting, tokenisation, lemmatisation and part-of-speech
tagging. Version 3.2.0 of the CoreNLP package was used.
Current distribution of the Stanford Parser allows the user to choose between statistical models that
24
Stanford Parser: http://nlp.stanford.edu/software/lex-parser.shtml
Penn Treebank tags: http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
26
Stanford CoreNLP: http://nlp.stanford.edu/software/corenlp.shtml
25
48
were trained either on general purpose or on biomedical texts. Both types of models were used on
our dataset of MeSH definitions for the preliminary evaluation. Although the dataset is
biomedically related, general purpose models yielded more accurate parse trees. A possible
explanation is that in the task at hand the definitional structure of the sentences is more important
than its biomedical focus, and general models handle it better. The high importance of the structure
of the definition shows that the current approach towards formal definition generation can
potentially be generalized to other domains which makes it more valuable.
5.2.5 Evaluation
The evaluation of the Extended Annotator was made on the data provided by BioASQ. BioASQ27 is
an online challenge on biomedical question answering (QA) and semantic indexing [Tsatsaronis et
al. 2012]. BioASQ's objectives are: to advance the state-of-the-art large scale biomedical question
answering and semantic indexing, and to establish QA benchmarks as a reference data to encourage
futher competitions in the domain.
Question Answering is a task of finding relevant information from heterogeneous resources and
presenting it in a concise manner, given an information need expressed in natural language.
Semantic indexing is a task of annotating text segments with ontological concepts; it facilitates QA
and can be seen as a pre-processing step for more complez NLP tasks.
BioASQ has been running several tasks since March 2013. The Task 1A is called «Large-scale
online biomedical semantic indexing». It consists of a series of batches, each of them having
hundreds of unclassified PubMed28 documents. Documents manual annotation by PubMed curators
is provided separately and serves as a ‘gold standard’ for the evaluation. The annotation is done
using MeSH headings.
Below is an example of an input PubMed document and its gold standard annotation:
Input:
"pmid": 23847785,
"title": “The role of transcription-independent damage signals in the initiation of epithelial wound healing.”,
“abstract”: “Wound healing is an essential biological process that comprises sequential steps aimed at restoring
the architecture and function of damaged cells and tissues. This process begins with conserved damage signals,
such as Ca2+, hydrogen peroxide (H2O2) and ATP, that diffuse through epithelial tissues and initiate
immediate gene transcription-independent cellular effects, including cell shape changes, the formation of
functional actomyosin structures and the recruitment of immune cells. These events integrate the ensuing
transcription of specific wound response genes that further advance the wound healing response. The
immediate importance of transcription-independent damage signals illustrates that healing a wound begins as
soon as damage occurs.”
Output:
"pmid": 23847785,
"labels":["D014947","D014945“,"D006861"]
BioASQ data suit perfectly for the purpose of the annotator evaluation, as they are of high quality
and were designed for the task at hand. In addition, they provide a way to compare our annotator
with other systems, since BioASQ is a competition open to the public, and the performance results
are published on the BioASQ website3. Finally, BioASQ uses the same knowledge resource as our
annotator, i.e., MeSH.
27
28
BioASQ challenge: http://bioasq.org/
PubMed: http://www.ncbi.nlm.nih.gov/pubmed
49
We evaluated the Extended Annotator (EA) on the data from batch #1 of the Task 1A. It contains 6
separate data sets, ranging in size from 790 to 6562 documents per set. The total number of
documents is 16763 documents. WE have used the datasets 1-4 and 6 (the dataset #5 was not
available for the evaluation). We annotated the data with EA and matched the output with the gold
standard, measuring micro-averaged precision, recall and F-measure. We compared our
performance with that of the original in-house annotator (BioASQ Baseline in the results) which was
the base for the development of the EA. The results of this evaluation follow in Table 8.
Test set #
1
2
3
4
6
Average
Annotator
Precision
Recall
F1-score
AAA
0.2138
0.2704
0.2388
EA
0.2692
0.4375
0.3333
AAA
0.2507
0.2605
0.2555
EA
0.2758
0.4171
0.3321
AAA
0.2557
0.2998
0.2760
EA
0.3099
0.4557
0.3689
AAA
0.2348
0.2733
0.2526
EA
0.2798
0.4471
0.3442
AAA
0.2617
0.3022
0.2805
EA
0.2984
0.4640
0.3632
AAA
0.2449
0.2826
0.2622
EA
0.2866
0.4443
0.3483
Table 8. The comparative performance of the in-house annotator (AAA) and the Extended Annotator (EA).
Semantic indexing is a challenging task, it can be formulated in machine learning terms as largescale multi-label classification. It is not suprising that the state-of-the-art results are not striking: the
top performing systems barely reach 50% F1 score, occasionally beating the baseline of around
55% F1. The situation is aggravated by the hierarchical nature of the annotation classes.
EA scored average compared to the other systems. However, a consistent improvement in the scores
of EA compared to the in-house annotator can be observed. In particular, the EA performance
yields 8% gain in F1 score, improving Precision by 4% and Recall by 16%.
5.2.6 Runtime Assessment
The runtime of the EA was evaluated on PubMed abstracts and titles provided by the BioASQ as a
Batch 1 dataset. Table 9 illustrates the annotator performance on 10 different abstracts listing the
number of sentences in an abstract and its title, the number of recognized MeSH entities and the
runtime in seconds. The runtime assessment tests were run on a MacBook Air with Inter Core2 Duo
processor @ 1.6 GHz, 4 GB RAM and Mac Os X 10.6.8 x86_64 operating system. As the
annotator uses the Stanford Parser extensively, and the parser initialization tend to be timeconsuming (around half a minute), we excluded it from the runtime statistics, since the parser
should be initialized only once and them an arbitrary number of sentences/texts can be parsed.
50
# sentences
# MeSH terms
Runtime (sec.)
1
11
29
34.1
2
10
16
18.07
3
6
13
8.74
4
10
19
13.18
5
8
23
20.4
6
15
21
34.62
7
9
13
20.6
8
5
15
9.79
9
8
26
9.41
10
11
34
15.49
Average
per abstract:
9.3
20.9
18.44
Average
per sentence:
-
2.25
1.98
Table 9. The Extended Parser runtime statistics.
The annotation speed is approximately 2 seconds per sentence, and 18 seconds per PubMed abstract
on the local machine used.
5.2.7 Summary of contributions and conclusions
The Extended Annotator that recognizes the textual occurrences of MeSH terms was designed,
implemented and evaluated. It was built heuristically, improving the performance of the Attribute
Alignment Annotator developed in-house. EA has several modules (NP collection, NP annotation,
annotation selection, extra vocabulary usage) that can be further modified independently from one
another. EA uses three main external components: the syntactic parser, the string alignment indexer
and the biomedical ontology. Currently these components are the Stanford Parser, the Attribute
Alignment Annotator and MeSH respectively. However, EA can be customized to other parsers (e.g.
ClearNLP), indexers (e.g. MetaMap) and ontologies (e.g. SNOMED CT), possibly generalizing to
other domains. It is useful as a stand-alone tool and as part of various text mining pipelines, like
Ontology Generation, Definition Formalization, Biomedical Interaction Extraction and many more.
5.2.8 Future work
The current implementation of the Extended Annotator was designed for offline pre-processing
purposes and thus is not optimized in terms of speed. However, there is a potential to make it more
efficient and thus to integrate it into an online annotation service, which was also discussed in
Section 6. In addition, the formal semantics for the concepts that stem from the extended
vocabulary should be manually defined, so that subsequent reasoning can be performed.
51
5.3 Parser for Relation Extraction After a text string is annotated with biomedical concept, the next step is to group these concepts into
relational instances and to form a preliminary structure of the formal definition. It is the task of
Relation Extraction Parser (hereafter referred to as “the parser”).
The parser takes as input a textual definition pre-annotated with biomedical concepts by the
Extended Annotator as well as its syntactic parse tree and produces structures of the form
concept_A – relational_string – concept_B which we call unlabeled triples.
Tremor – cyclical movement of a body part that can
D014202
D009068
my term
represent either a physiologic process or a manifestation
D010829
of disease.
D004194
Tremor – cyclical movement of a body part that can represent
either a physiologic process or a manifestation of disease.
ISA (Tremor, Movement)
“of” (Movement, Body Part)
“that can represent” (Tremor, Physiological Process)
“manifestation of” (Tremor, Disease)
52
5.3.1 Various types of the definitional structure of a sentence
The classical structure of a definition follows the patter “A is a B that has property C”. However, it
is not necessarily the case that the string of text containing the definitional information precisely
follows this formula. In particular, the definition may not explicitly contain the “is-a” copula.
Instead, definitional statements can be formulated in multiple ways.
There are several ways of how a definition can be introduced in texts. These are studied in the
context of automatic definition detection and extraction. [Westerhout et al. 2008] manually
investigates 330 natural language definitions and outlined 5 common definition forms (original
names are kept):
• type “to be”
the definiendum is separated from the body of the definition by an explicit copula;
Ex: A definition is a statement that explains the meaning of a term.
(source: http://en.wikipedia.org/wiki/Definition)
• type “verb”
some notional verb is used instead of a copula;
Ex: A definition denotes a statement that explains the meaning of a term.
• type “punctuation”
the definition body is introduced by punctuation mark;
Ex: Definition: a statement that explains the meaning of a term.
• type “layout”
the definiendum is structurally separated from the body by spaces, new lines etc.;
Ex: Definition
A statement that explains the meaning of a term.
• type “pronoun”
the definiendum is not present in the definition, but it is referred to by a relative or a
demonstrative pronoun;
Ex: … a definition. The latter is a statement that explains the meaning of a term.
The distribution of definition types among the 330 annotated sentences is the following:
Type
to be
verb
punctuation
layout
pronoun
other
Number (percentage)
84 (25.5%)
99 (30%)
46 (13.9%)
7 (2.1%)
46 (13.9%)
48 (14.5%)
Table 10: Distribution of definition types from [Westerhout et al. 2008].
53
We will adopt the aforementioned definition classification and further on will refer to the types of
definitional structures by the names “to be”, “verb”, “punctuation”, “layout” and “pronoun”.
We draw a distinction between the two tasks: definition detection and definition processing. While
the first one is identification of definitional sentences in heterogeneous text corpora [Wächter 2010],
the second task recognizes the structure of the definition and transforms it into a formal encoding.
Since our parser was created to address the second task, it does not perform the definition detection
and is given textual definitions as input. However, it can differentiate between different definition
types while performing the first step of the parsing, i.e. identifying the head and the body of the
definition (definiendum and definiens). Currently the parser can process definitions of types “to be”,
“punctuation”, “layout” and some cases of “verb” (the list of verbs and constructions is easily
extensible). Hence, it can be used for various corpora and text resources, e.g., MeSH definitions
(“layout” type), web data, scientific publications etc. It can be integrated into a formal definition
generation pipeline preceded by a definition extraction (see [Wächter 2010]) and concept annotation
steps (see Chapter 5.2).
5.3.2 The structure of definitions in MeSH
The way the definition is introduced in the text is not the only point in which a randomly chosen
textual definition can diverge from the classical schema A is B that has a property C. To the
contrary, a definition can have multiple alterations with respect to the classical structure, which
makes the parsing of definitions a non-trivial task.
Let us illustrate the possible deviations in the structure of a definition using MeSH corpus as the
source of examples, since (a) in the scope of this thesis we work primarily with MeSH definitions
and (b) they are manually curated by the domain experts and are not artificial in their structure.
MeSH contains 25843 distinct definitions. The average length of a MeSH definition is 30 words,
203 characters. Here are the most common violations of the definitional structure that can be found
in MeSH texts:
1) the parent term is missing
Ex: Cryosurgery – the use of freezing as a special surgical technique to destroy or excise tissue.
We would argue that the use is not a parent term, but rather a relation mention expressed by an
adverbial noun.
2) differentia is missing, i.e. the specification of distinctive “property C” is not mentioned
Ex: Lavandula – a plant genus of the LAMIACEAE family.
A very common type of structure for MeSH definitions that specifies the position of the
definiendum in the taxonomic hierarchy, but does not explain how the defined concept should be
distinguished from other concepts in the same taxonomic slot.
3) differentia is too verbose, it gives more information than is needed in order to differentiate the
head concept from the sibling ones
Ex: French Guiana – a French overseas department on the northeast coast of South America. Its
capital is Cayenne. It was first settled by the French in 1604. Early development was hindered
because of the presence of a penal colony. The name of the country and the capital are variants of
Guyana, possibly from the native Indian Guarani guai (born) + ana (kin), implying a united and
interrelated race of people.
This type of definition is typical for certain areas covered by MeSH, for example, geography. This
54
fact serves as motivation to restrict ourselves to certain MeSH trees (see Section 5.3.4). Another
motivation has to do with the annotation: for certain domains the concentration of annotated
concept per definition is considerably lower than for other domains.
5.3.3 Functionality of the parser
Given as input the textual representation of a definition, its syntactic parse tree and a list of
biomedical concepts which occur in the definition linked to their textual mentions, the parser
performs the following processing steps:
1) recognize the type of the definition (see Section 5.3.1), find its head and body
2) select the first sentence of the definition
To avoid definitions whose differentia is too verbose, we restrict ourselves to parsing only the first
sentences of definitions. We assume that all the essential (necessary and sufficient) information
should be contained in the first sentence, while the sentences that follow further explain the content
of it, thus not forming part of the definition in its classical interpretation.
3) detect the parent term (the genus), if it is present
At this step we rely on the information provided by the ontology we are using for the semantic
annotation: if the term that appears first in the definition is not recognized by the annotator, i.e. it is
not considered as concept by the ontology, then it belongs to a relational string of some triple
(example 3a); otherwise it is a parent concept (example 3b).
Example 3a: Abdominal Wall – the outer margins of the abdomen, extending from the
osteocartilaginous thoracic cage to the pelvis.
Example 3b: Cattle Diseases – diseases of domestic cattle of the genus bos.
4) group coordinated concepts into conjunctive or disjunctive sets
Detecting coordination is one of the very important issues in predication extraction listed by
[Rindflesch et al. 2000]. Coordinated concepts are organized into sets with one representative
concept. Whenever this concept is participated in a triple, the rest of the concepts automatically
form triples as well, using the same relational string and the same concept as the second argument,
e.g.:
Vesicular stomatitis Indiana virus - the type species of vesiculovirus causing a disease
symptomatically similar to foot-and-mouth disease in cattle, horses, and pigs.
Foot-and-Mouth Disease — «in» — Cattle
Foot-and-Mouth Disease — «in» — Horses
Foot-and-Mouth Disease — «in» — Swine
Triples for coordinated concepts with further be transformed into conjunction and disjunction of
concepts in DL notation of the definition formula.
5) organize the concepts into concept pairs
This is the key step in the definition parsing as it shapes the resulting triples. The process is heavily
dependent on the syntactic structure of the sentence. One straightforward way of linking concepts
55
together would be to follow the dependency paths across the syntactic tree and to link every concept
with the nearest dominating one (and to link the top concept with the head concept). However,
while parsing the definition, we would like to collect as much information about the head term as
possible. For this reason we link annotated concepts with the head term whenever it is possible and
does not violate the common sense. In fact, for the majority of the triples both ways of constructing
triples (i.e. combining concepts either with the dominant NP or with the main term) are possible and
comprehensive. For example, for the following definition:
Classical Lissencephalies – disorders comprising a spectrum of brain malformations
representing the paradigm of a diffuse neuronal migration disorder,
it makes sense both to link Brain malformations both to Disorders and to Classical Lissencephalies,
and to link Diffuse neuronal migration disorder to either Brain malformations and to Classical
Lissencephalies. Indeed, if Classical Lissencephalies is a malformation that represents Diffuse
neuronal migration disorder, we can induce that it itself represents this type of disorder. Thus, by
linking concepts occurring in the definition with the main term we skip this induction process.
There is a specific syntactic construction for which we allow the concepts to be combined with the
dominant concepts: Noun Phrase + preposition + Noun Phrase.
Anterior Thalamic Nuclei – three nuclei located beneath the dorsal surface of the most
rostral part of the thalamus.
Anterior Thalamic Nuclei — “located beneath” — Surface
Surface — “of” — Part
Part — “of” — Thalamus
6) extract relational strings
To finish the formation of unlabeled triples, we need to accompany concept pairs with relational
strings that contain the mention of the respective relation in text. The intuitive approach is to take
the strings that are located in between the two concepts in the pair. We have used this approach for
our previous experiments on learning SNOMED CT relations (see Chapter 4). It has two major
disadvantages, though. First of all, the assumption that the relation mention is positioned in the text
in between the mentions of relation arguments is too strong. The string between two concepts
expresses the respective relation in the majority, but not in all cases. This holds for example 6a
(Simplexvirus — “causes” — Vesicular lesions), but it does not hold for example 6b (Condition —
“has low” — Serum protein level), as the string in between two concepts is “in which”.
Example 6a: Cercopithecine Herpesvirus 1 – a species of simplexvirus that causes vesicular
lesions of the mouth in monkeys.
Example 6b: Hypoproteinemia – a condition in which total serum protein level is below the
normal range.
The second major problem is that the string between two concepts may be too big and may contain
other concepts and relations that do not participate in the current relation instance. In example 6c
we can distinguish two relations: Peptide hormones — “produced by” — Neurons and Peptide
hormones — “produced by neurons of various regions in” — Hypothalamus. The string between
the concepts from the second triple will include the string from the first triple, thus containing two
relation mentions simultaneously.
Example 6c: Hypothalamic Hormones – peptide hormones produced by neurons of various
56
regions in the hypothalamus.
To avoid such mistakes, we extract only the substring between the current concept and the
preceding one, independently of the position of the second concept:
Peptide hormones — “produced by” — Neurons
Peptide hormones — “of various regions in” — Hypothalamus.
In case the definition has specific syntactic constructions (e.g. “PREP + which”) that locate the
relation mention to the right of the concept, the relational string is concatenated with the tokens
from the dependency path that goes for this construction:
Tooth, Nonvital – tooth from which the dental pulp has been removed or is necrotic.
Tooth, Nonvital — IS_A — Tooth
Tooth, Nonvital — “from which has been removed or is necrotic” — Dental pulp
7) detect specific taxonomic relations
This step of the parsing is domain-dependent. As we have seen in section 5.3.2, the differentia may
be completely missing from the definition and only the information about the position of the
concept in the biological taxonomy is given. There exist 8 taxonomic ranks: domain, kingdom,
phylum/division, class, order, family, genus and species. In definitions these terms introduce the
IS_A relation at the specific level of taxonomy. We would like our annotator to recognize such
instances of the IS_A relation from relational strings:
Cuphea – a plant genus of the family lythraceae.
Cuphea — IS_A (BELONGS_TO_GENUS) — Plant
Cuphea — IS_A (BELONGS_TO_FAMILY) — Lythraceae
8) detect negation
The negation is detected in the relation string using simple patterns. If the string contains tokens
like not or other than, the triple is considered to bi negated. This information is useful and should
be propagated till the step of formula generation.
After the parsing is completed, the parser outputs unlabeled triples of the form concept_A —
relational string — concept_B, possibly accompanied with the NEGATION mark. The number of
triples for a definition depends directly on the number of annotated concepts.
5.3.4 Manual evaluation of the parser
As a preliminary evaluation of the parser, we have run it over a small corpus of definitions and
manually annotated the generated triples as correct or incorrect. MeSH serves the source for textual
definitions to be parsed. We restricted ourselves to four core MeSH trees:
A – Anatomy,
B – Organisms,
C – Diseases,
57
G – Biological Sciences.
From these four trees we randomly selected 40 concepts and collected the corresponding definitions.
The definitions were first annotated with MeSH concepts by the Extended Annotator and then
processed by the parser, and the relational triples for the first sentences of every definition were
generated. The definitions and the respective triples are listed in Appendix C.
The triples were evaluated as follows: we mark a triple as correct if the two concepts serving as
arguments of the relation are chosen correctly and the relational string is also parsed correctly (it
does not miss anything). If any of the two conditions was violated, the triple was considered
incorrect. It should be noted that all annotations were assumed to be correct, i.e. having the right
labels and forming the full set annotations. It was done to evaluate the performance of the parser
independently of the bigger definition generation pipeline.
For 40 randomly selected MeSH definitions annotated with 147 concepts from MeSH and from the
extended vocabulary the parser generated 110 triples. 98 triples are manually labeled as correct,
only 11 triples (10%) are incorrect. In particular, for 32 definitions out of 40 the triples are
generated correctly (80%).
Three examples below illustrate the types of mistake for which a triple can be labeled as incorrect.
Example 1 demonstrates incorrectly chosen arguments for triples: instead of linking the
Immunoglobulin heavy-chain gene with Recombination and the latter with Immunoglobulin Class
Switching, the parser combined both Recombination and Immunoglobulin Class Switching with the
head concept.
Examples 2 and 3 relate to the same definition of Subacute Thyroiditis. In the second example the
relational string “and an enlarged damaged gland containing” does not include the substring
“characterized by” which is essential for the triple as it expresses the core information about the
characterized_by relation. In the third example the relational string “spontaneously remitting” was
extracted correctly, since it is indeed the string that links the two concepts Subacute Thyroiditis and
Inflammatory Condition together in the definition. However, it is not recognized to be conveying
the taxonomic relation, hence it is not labeled as IS_A.
Example of an incorrect triple #1:
Immunoglobulin Switch Region (D007134): a site located in the introns at the 5' end of each constant region segment of
a immunoglobulin heavy-chain gene where recombination occur during immunoglobulin class switching.
Triples: Immunoglobulin Switch Region – “where” – Genetic Recombination
Immunoglobulin Switch Region – “occur during” - Immunoglobulin Class Switching
Example of an incorrect triple #2:
Thyroiditis, Subacute (D013968): spontaneously remitting inflammatory condition of the thyroid gland, characterized
by fever; muscle weakness; sore throat; severe thyroid pain; and an enlarged damaged gland containing giant cells.
Triple: Thyroiditis, Subacute – “and an enlarged damaged gland containing” – Giant Cells
Example of an incorrect triple #3:
Thyroiditis, Subacute (D013968): spontaneously remitting inflammatory condition of the thyroid gland, characterized
by fever; muscle weakness; sore throat; severe thyroid pain; and an enlarged damaged gland containing giant cells.
Triple: Thyroiditis, Subacute – “spontaneously remitting” – Condition
In general, there is a higher possibility for the parser to produce incorrect triples in case the input
sentence is longer and contains more annotated concepts. This observation is quite intuitive, since
the longer sentences usually mean more complex syntactic structures and more evolved semantics.
58
5.3.5 Future improvements of the parser
The developed parser proves to be suitable for the task at hand. However, it implements only one
possible strategy of triple generation and it can be improved in two main ways:
• better precision
The parser is prone to occasional mistakes while linking concepts into concept pairs and selecting
the relation strings for the unlabeled triples. This master thesis does not focus on the questions of
statistical syntactic parsing, hence the parsing procedure is not highly evolved. However, we can
foresee the following improvements of the parsing if the further analysis of the syntactic structure
of definitions is performed:
• better concept pair formation
More accurate assembling of concepts into pairs can help avoid mistakes like in the Example #1
above. This can be achieved by improving the triple generation strategies as well as the underlying
syntactic parser that is used as early as during the concept annotation phase and produces the parse
tree of the definition.
• better relational string extraction
Relational string extraction can also profit from the in-depth analysis of the parse tree. For examples,
verbs with several prepositional groups attached to them should be included into relational strings
for all concepts appearing in the prepositional groups:
Abdominal Wall - the outer margins of the abdomen, extending from the osteocartilaginous
thoracic cage to the pelvis.
Abdominal Wall — «extending from» — Thoracic cage
Abdominal Wall — «extending to» — Pelvis
• differentiation between constructions in passive and active voice
Being able to recognize whether the relational string contains a passive or an active relation is an
important functionality of the parser, especially in case the next step of the pipeline – string labeling
– is performed using machine learning techniques. For example, triples of the form A – “causes” –
B and A – “is caused by” – B may both be labeled as causative, however, they express opposite
relation directions: in the first triple the cause takes the first argument, whereas in the second triple
it is the second argument.
• better recall
Better recall of the parser can be achieved if the parser is modified to perform the following tasks:
− anaphora resolution
Second Primary Neoplasms – abnormal growths of tissue that follow a previous neoplasm but are
not metastases of the latter.
If proper anaphora resolution was performed, the parser would recognize an additional triple
Metastases – “of” – (previous) Neoplasm using the reference expressed by “the latter”.
− extraction of relations with arity > 2
59
Guanosine Diphosphate Sugars – esters formed between the aldehydic carbon of sugars and the
terminal phosphate of guanosine diphosphate.
The preposition between invokes a verbal frame with three positions: A is between B and C.
− quantitative information extraction
Bromotrichloromethane – a potent liver poison. In rats, bromotrichloromethane produces about
three times the degree of liver microsomal lipid peroxidation as does carbon tetrachloride.
By now, the parser is not able to capture the quantitative information given by the definition.
However, if formally encoded, this type of knowledge could be of high value as it would enable
hybrid reasoning over the knowledge sources (i.e. combination of qualitative and quantitative
reasoning).
The list of possible improvements given above is in no way exhaustive. In general, the more potent
and accurate the parser is, the more evolved encoding we are able to get, thus the more complex
reasoning tasks can be performed.
5.4 Learning Relational Labels The last step of the formal definition generation pipeline takes as input the unlabeled triples
generated by the parser and substitutes the relational strings with the relation labels reducing the
relation instances to the invariants of some domain-specific relation:
Concept A – “relational string” – Concept B
Concept A – relation label – Concept B
For our running example of the Tremor definition the labeling step yields the following triples (note
that the triples were labeled manually for this example using SemRep relations that we found
suitable for the given relational strings):
ISA (Tremor, Movement)
“of” (Movement, Body Part)
“that can represent” (Tremor, Physiological Process)
“manifestation of” (Tremor, Diseasec)
ISA (Tremor, Movement)
LOCATION_OF (Movement, Body Part)
MANIFESTATION_OF (Tremor, Physiological Process)
MANIFESTATION_OF (Tremor, Diseasec)
60
Labeling the text strings with relation names is an instance of the text classification task. We adopt
the same problem formulation of relation labeling that we used in the previous set of experiments
(see Chapter 4). Relational instances, or triples, represent the training/testing examples and form the
learning corpus. Every instance is represented as a set of features and passed to a machine learning
algorithm.
Text classification is a supervised machine learning task, which means the learning instances are
labeled with a class – a relation name in our case. The model is then trained using the labeled
instances and is used for the classification of new instances that do not yet have a label. We assume
that a relational instance corresponds to a precisely one relation, thus the task of relation labeling is
a single-label classification. Unsupervised relation extraction will be covered in Chapter 7.
The three key decisions that we need to make while learning relational labels is a) to choose the
classification algorithm b) to choose the features that are relevant for the task, and c) to choose the
source of class labels, i.e. relations. Let us discuss in turn all the three choices.
5.4.1 Choosing the classifier
Support Vector Machines (SVM) is a linear classification algorithm that automatically builds a
hyperplane separating the instances of different classes in such a way that the margin (the distance
between the hyperplane and the instances) is maximized.
SVM yielded the highest performance in our previous experiments on relation labeling (see Chapter
4), compared to Random Forests (a decision tree algorithm), Logistic Regression and Naïve Bayes.
The winning performance of SVM with respect to other classification algorithms has been
repeatedly proven by previous works on various textual datasets. A study by [Lewis et al. 1996]
shows that SVM, Naïve Bayes and k-Nearest Neighbor are among the best performing classifiers. A
later work by [Mladenić et al. 2004] evaluates several classifiers on the Reuters news datasets
showing that SVM tend to outperform other algorithms including Naïve Bayes.
SVM is a linear classifier, however, it can be efficiently applied to linearly inseparable data: it can
apply the inner kernel function to the input instances and map them into a new feature space where
they can be separated by a hyperplane (kernel trick). We have tried using SVM with different
kernels, but the linear kernel gave the highest results.
We set our choice on the SVM with linear kernel for the final label classification setting.
5.4.2 Choosing the features
Relation triples cannot be used for the classification directly: they need to be formally represented
as an ordered set of features. Features can represent words or phrases in a text, presence or absence
of some element or any property of an object in general. Feature construction is a non-trivial and
highly important task. The features should be chosen in such a way that they represent instance
parameters that are relevant for learning; irrelevant features should be avoided since they act as
noise for a classifier; redundant features are undesirable as well as they expand the feature space
61
without adding to the performance. Years of studies of machine learning algorithms have led to
considerable advances in algorithm performance, both in terms of their accuracy and efficiency.
One can argue that now it is not the choice of an algorithm that matters most, but the choice of the
features. We use two types of features for the relation instance labeling: lexical features and concept
types.
5.4.2.1 Lexical features
Lexical features represent the relational string that is extracted for every triple from the definition.
Similar to the classification algorithm, we adopt the methodology of constructing lexical features
from our preceding experiments. We tried classical features used in the vector space modeling of
textual data, namely ngrams. Both tokens and characters were used as building blocks for ngrams,
and character ngrams appeared to be slightly more effective. Hence, character trigrams are our
final choice for the relation labeling at this stage of the research.
The choice of ngrams as the main lexical features is based on their two main advantages:
1) ngrams were proved to be very useful for the classification of very short texts [Mladenić et
al. 2003] and the relational strings extracted from single sentences are obviously extremely
short;
2) ngrams are able to implicitly capture a huge variety of information about a string. In
particular, character ngrams can reflect word order, lemmas, stems and grammatical forms
of words, important morphemes, to name a few. All this information is utilized at a cheap
cost: no sophisticated linguistic analysis is required for the ngram extraction.
Obviously, a single ngram does not play a big role in the labeling process, but several
ngrams of the same string taken together can yield a strong signal of a particular class. For
example, a set of trigrams {cau,aus,use,sed, ed, ed ,d b, by} extracted from a relational
string caused by captures not only the stem form of the verb, but also the fact that it is used
in passive voice and is followed by a preposition by, which reflects a very common lexical
pattern for the causative relation.
Ngrams can be used for the encoding of textual strings in many different ways. Obviously we have
not tried all the possible options, but some of them seem promising for the task at hand and can be
used in the future work. A widely used modification of classical ngrams is the so-called soft
ngrams. These are the ngrams that do not require elements (tokens, lemmas, stems, characters etc.)
to be directly adjacent to each other in order to form an ngram. Soft ngrams allow insertions and
deletions of elements and sometimes are weighted according to the edit distance between the two
ngrams [Sammut et al. 2011].
A similar type of ngrams is loose ngrams. They are built from the sentence preserving the original
order of the lexical elements in the text, but afterwards the order of the elements inside each ngrams
is ignored.
Another common type of ngram is the phrase-based ngrams. They are extracted specifically from
noun phrases, verb phrases, prepositional phrases etc. They rely on the syntactic analysis of a
sentence and have a more targeted nature compared to the simple ngrams that are extracted
regardless of the grammatical properties of the elements involved. In Chapter 7 while talking about
the unsupervised relation extraction we will come back to the ngrams extracted from verbs, verb
62
phrases and their derivatives.
Every type of ngrams – classical, soft, loose, phrase-based etc. – has its own advantages and
disadvantages. By modifying the way an ngram is built and compared to other ngrams one captures
some linguistic information with the ngram encoding while loosing another. Which type of ngrams
is more suitable for a particular task is an interesting engineering question which can be answered
empirically.
Special attention should be paid to the weighting of the features. Previously we tried defining
separate feature weights for every class (relation), but this did not prove to be much more efficient
than an unweighed methodology, so for the present formulation of the classification task we stick to
the Boolean weighting scheme. However, the weighting principle that could potentially work for
the relational strings depends on the position of the ngram inside the string: ngrams that appear
more to the right end of the string are upweighted. The assumption is that lexical items that are
adjacent in text to the second argument of the relation instance are more likely to bear relevant
relational information. For example, a triple A is used as a treatment for B contains two tokens,
namely used and treatment that serve as triggers for the classifier, but the latter is more important
and it reflects the ore predicative information.
5.4.2.2 Concept type features
While lexical features reflect the relation per se, concept features focus on the relation arguments.
The motivation behind the use of concept types is quite intuitive: every relation has a domain and a
range. In other words, it can take only certain types of concepts as its arguments. If we include the
concept types into the feature representation of instances, we impose explicit constraints on the
arguments of every instance, and from them the classifier will be able to learn implicit patterns of
concept types for every particular relation. Such patterns will be of great help in the classification
process:
a triple: A “is in some relation with” B
without concept types: A – relation R1 – B
A – relation R2 – B
both are candidates!
with concept types: A à type Аt, B à type Bt
R1 ⊆ At x Bt
R2 ⊆ Ct x Dt
only the first relation is matching
If every concept type is assigned a distinct ID and every concept has at least one concept type, the
relation triple can be encoded into the feature vector using the lexical features described above and
the two IDs of the concept type.
The only question that remains is how to construct a list of all possible concept types and how to
assign types ot concepts. There are two main solutions:
1) use an existing resource
The UMLS Semantic Network (see Chpater 2.3.2) contains a manually built set of concept types
relevant for the biomedical domain. Types are assigned to all the concepts of the Metathesaurus,
63
thus the type information is easily accessible. The disadvantage of this approach is that the
modeling of the domain offered by UMLS is not compatible with the modeling of relations, i.e. the
types may not correspond to the domain and range of relations and thus will not form valid patterns
of type pairs. An example of such case is given in Chapter 6.
It is often the case that a concept is assigned more than one type. If we prefer to label concepts in
triples with precisely one type, we can either select the type randomly or, if the types are organized
into a hierarchy, take the most specific type among the candidates, or the most general type, or their
most specific common parent.
2) use the top concepts as types
Another approach is to use the taxonomic structure of the ontology that lie in the background of the
semantic annotation of definitions and for which the relations to be learnt are defined. If the
taxonomy forms a single tree, then the first n levels of it can be taken as concept types. If there are
several independent trees, the tree top concepts can serve as types. For example, MeSH has 16
taxonomic trees and SNOMED CT has 19 top concepts. The can directly be used as types, and
every concept from the ontology is automatically assigned a type.
Concept type features are novel with respect to our previous experiments. They are one of the main
improvements of the pipeline at this stage of work. Their importance is demonstrated in the
Evaluation chapter of this thesis.
5.4.3 Choosing the set of relations
The last decision point of the relation classification is the choice of the relation set. In the
unsupervised scenario relations should be taken from an existing resource that contains a set of
unique relations along with the annotated relation instances. The beauty of the approaches that are
based on machine learning is their scalability: as long as we have a corpus of annotated triples,
training the classifier and producing the models for relations is a matter of minutes. In principle, any
set of relations and any corpus can be used for training.
However, the choice of the resources should not be random. As the previous experiment have
unveiled, it is very important to combine textual and formalized sources that have the same or at
least similar modeling of the domain. Previously we used SNOMED CT formal definitions together
with MeSH textual definitions (see Chapter 4). We aligned SNOMED CT formulas with MeSH
texts, extracted the triples, labeled them with formal relations and performed the learning. Although
only three relations were chosen for the experiment and both sources are manually curated, so the
relation labels of the triples were trustworthy, the results were not very impressive: the F-measure
reached only 75%. The reason of such performance rate lies in the different nature of the resources
we used: MeSH is human-oriented while SNOMED CT is machine-processable. Their ways of
modeling biomedical knowledge are not compatible: the corresponding definitions from the two
ontologies rarely contain the same relations which resulted in a very small dataset compared to the
sizes of the ontologies; and the information that overlaps in both definitions is often given in
completely different ways. For example, a SNOMED CT relation Associated morphology is rarely
explicitly present as a relation in MeSH definition, but rather its semantics is included into the
complex concept name.
While choosing the set of relations, one should always keep in mind which kind of texts the trained
relation models will be applied to. Since we are interested in formalizing textual definitions from
64
resources that have been constructed by humans and for humans, we need to use relations that were
also created for human understanding and which reflect the way humans and not machines view the
domain. In the next chapter for the evaluation purposes we will use a resource called SemRep that
satisfies the compatibility condition.
65
6. Evaluation This chapter is dedicated to the evaluation of the learning method which we propose in section 5.4.
The thesis has a specific focus on the extraction of non-taxonomic relations, therefore we give an
extensive evaluation and analysis of the supervised relation extraction method in Chapter 6 and we
discuss the prospective unsupervised methods for relation extraction in Chapter 7. The evaluation of
the annotator and the triple extraction parser are given in the respective chapters.
One of the key ingredients of relation classification is to integrate resources that are compatible
with respect to the domain modeling (see Chapter 5.4.3). Therefore, for the training of relation
models and for the subsequent evaluation tests we need a text corpus and a set of predefined
relations that have similar ways of representing biomedical concepts and links between them. A
system called SemRep fulfills this requirement: it contains a corpus of biomedical scientific texts as
well as a set of 30 relations defined on them. This makes certain SemRep components ideal for the
evaluation for our system.
6.1 SemRep: biomedical relation extraction system SemRep is a rule-based biomedical relation extraction system created by the National Library of
Medicine as part of the Semantic Knowledge Representation project (SKR)29. SemRep consists of
three main components: a) a relation extraction component that can be run online 30 , b)
SemMedDB31, a database of predications extracted automatically from biomedical literature by the
system, and c) SemRep Gold Standard Annotation, a corpus of 500 MEDLINE sentences manually
annotated with predications32.
SemRep has a range of possible applications. The relational instances extracted by the system form
a database that can serve as an auxiliary tool for reasoning tasks such as hypothesis generation,
literature discovery, intelligent decision making etc [Kilicoglu et al. 2012]. [Hristovski2013]
propose drug repositioning and repurposing by generating pharmacological hypotheses based on the
SemMedDB relation repository. [Hristovki10] combine relation instances with microarray data in
order to facilitate the result analysis and novel hypothesis generation.
SemRep is a suitable reference for the evaluation of our triple generation system:
• the relation extraction component addresses the same task as our pipeline, thus the two
systems can be given the same text corpus as input and the results are directly comparable;
• SemRep is a rule-based system whereas our pipeline rely on machine learning techniques, so
the comparison of two methodologically opposite systems can be of great interest;
29
SKR project: http://skr.nlm.nih.gov/
interactive mode of SemRep: http://skr.nlm.nih.gov/interactive/semrep.shtml
31
SemMedDB: http://skr3.nlm.nih.gov/SemMedDB/
32
SemRep Gold Standard Annotation: http://skr.nlm.nih.gov/SemRepGold/
30
66
•
•
SemRep gold standard corpus can be used for training and testing of our pipeline, the fact
that gold standard triples are already annotated with semantic types is a further advantage;
SemRep is the official relation system of NLM, thus it can be considered as the state of the
art of biomedical relation extraction.
Let us look into the relation extraction component and into the gold standard corpus in more detail.
6.1.1 SemRep relation extraction component
SemRep positions itself as the first step towards semantic interpretation of biomedical texts. Its
relation extraction mechanism is based on the shallow parsing and the annotation of concepts with
UMLS Semantic Network semantic types. An input sentence is labeled with part of speech tags,
simple noun phrases are identified and mapped to UMLS concepts from the Metathesaurus.
Concepts are then mapped to semantic types and the recurring patterns of the form type A – does
something with – type B are taken as rules for a particular semantic relation.
There are 30 different relations defined in the SemRep predication extraction systems. All but three
relations can also appear in the negated form. The relations belong to the important biomedical subdomains of clinical medicine, substance interactions, genetic etiology of diseases and
pharmacogenomics. A full list of SemRep relations is present in Appendix D. The developers of
SemRep acknowledge that there are possibly many more relations that are relevant for the domain
and that are present in texts. Nevertheless, they confine themselves to the 30 relations listed above.
The expansion of the relation set used in the system is a tedious process as new hand-crafted rules
should be created and added to the system, which bring us to one of our main motivations for the
choice of machine learning as the relation extraction technique: machine learning algorithms learn
the underlying relation patterns, or rules, themselves and do not require their explicit specification.
Most of the SemRep relations have very general names. This is particularly true for the top
occurring relations: process_of, location_of, part_of, affects, treats etc. It should be noted though,
that relation names may differ considerably with respect to their textual mentions: they may be used
in a restricted sense only. For example, process_of can only have disorders as its subject and affects
only takes processes as objects. This observation backs up the use of concept types as key features
in learning relational instances from text.
6.1.2 SemRep Gold Standard corpus
[Kilicoglu et al. 2011] describes in detail the process of creating the gold standard corpus. It
contains 500 sentences extracted randomly from 308 MEDLINE abstracts. The sentences were
manually annotated by two independent experts, and the third expert finalized the annotation of
1364 relational triples. The resulting triples contain 26 out of 30 unique relations covered by the
system. Figure 6 illustrates the distribution of relation types in the gold standard corpus:
67
Figure 6. The distribution of semantic relations in the SemRep gold standard corpus.
The relations are not distributed evenly across the corpus; in fact, 10 most frequent relations
account for more than 80% of all relation instances. This is a very important fact as it implies that
most effort should be focused on learning a small number of relations which facilitates the task in
general, since multi-class classifiers tend to perform worse with the increasing number of classes.
The top 10 relations among 1364 relational instances are:
• process_of – 239
instances
• location_of – 215
instances
• part_of – 177
instances
• treats – 126
instances
• isa – 110
instances
• affects – 98
instances
• causes – 62
instances
• interacts_with – 49
instances
• uses – 41
instances
• administered_to – 34 instances
The gold standard corpus heavily influenced the final choice of patterns for the relation extraction
component. It should be kept in mind that when we run SemRep over the gold standard, the
performance of SemRep is prone to overfitting.
68
6.2 Experiments For the evaluation of supervised relation extraction method we used the gold standard corpus of
SemRep. The relational triples were taken from the corpus as they are: the concepts and their
semantic types taken form the UMLS Semantic Network as well as the relational string and, most
importantly, the relation labels assigned manually by the domain experts. The semantic types of the
argument concepts are directly taken as features, and the relational strings serve are encoded into
character trigrams – the lexical features of the dataset.
6.2.1 Results
The resulting dataset was used for training and testing of the non-taxonomic relation SVM classifier
using 10-fold cross-validation. The tests were run for top 5 relations (process_of, location_of,
part_of, treats, isa), top 10 relations (adding affects, causes, interacts_with, uses and
administered_to) and all 26 relations. Table 11 presents the performance results as well as the
absolute and relative size of the three datasets:
top 5 relations
top 10 relations
all relations
F-measure
94%
89.1%
82.7%
Size
860 (63.4%)
1144 (84.3%)
1357 (100%)
Table 11. The performance of multi-class relational classifier across three different datasets. The size of each
dataset is specified by the absolute number of instances and by the percentage of instances covered by the
respective set of relations.
The performance rate of the classifier decreases with the number of unique relations the classifier
seeks to model. This tendency is typical for multi-class classification: the more classes are learnt,
the more difficult the task becomes. One of the reasons for such behavior of the learner lies in the
fact that feature values start to overlap for several classes when the number of classes is
considerably big, e.g. 26 relations in our case.
The value of F-measure for the full set of relations is not very high: only 82.7%. However, as we
have already shown in Chapter 6.1.2, the relations are not equally distributed across the corpus, and
learning only a subset of them can be performed with a considerably higher performance and will
cover the majority of all relation occurrences in texts. In particular, 94% F-measure for the top 5
relations is a result compatible with the state of the art of biomedical relation extraction (more on
this will follow in Chapter 6.2.3).
6.2.2 Improvement of the classification
The performance rates of the current experiment improve considerably he results that we achieved
while learning SNOMED CT relations (see Chapter 4). The classification of 26 SemRep relations is
approximately 7% more accurate as compared to the 3-class learning of SNOMED CT relations, but
the improvement is especially striking when we compare the latter with the classification of top 5
69
most frequent SemRep relations (the numbers are rounded):
Before: 424 instances, top 3 relations, 75%
After: 860 instances, top 5 relations, 94%
1144 instances, top 10 relations, 89%
1357 instances, all 26 relations, 83%
The increase in performance by almost 20% makes it extremely interesting to trace the origins of
such improvement. In particular, we would like to know whether the two main modifications of the
experiments’ setup, i.e. concept type features and the use of consistently modeled resources, are
responsible for the improvement. We answer this question experimentally.
6.2.2.1 Are concept types important?
To determine the importance of concept type features, we removed the lexical features from the
instance representation while learning the SemRep relations and examined the performance of
concept types as the sole source of information about the instances. We compared the use of
concept types with the use of the full feature set as well as with the baseline. We define the baseline
as the permanent choice of the majority class: the relation process_of is the most frequent one in the
dataset, it occurs 239 times, so it appears to be the most probable relation for every single instance
if no other information about the instance is given. By always choosing process_of as the class label
of an instance, we perform baseline learning. Here are the results of all the three settings:
top 5
top 10
all relations
ngrams + concept types
94%
89.1%
82.7%
only concept types
93.5%
79.2%
65.5%
baseline
27.8%
20.9%
17.6%
Table 12. Classification of SemRep relations using different sets of features.
The results are somewhat controversial. On the one hand, concept types appear to be useful features
for the relation classification task, since they alone yield a performance that is much higher than the
baseline. It is a somewhat expected result in case SemRep gold standard corpus is used since many
of SemRep relations are used in a restricted sense and can take only certain classes of concepts as
arguments. On the other hand, the value of concept type features decreases slightly with the
number of distinct relations. The reason is the following: the semantic types of two concepts that
are relational arguments form a pattern typical for a particular relation. When the number of
relations is small, it is highly probable that such patterns do not overlap for different relations, given
a significant number of semantic types used (e.g., 133 semantic types in the UMLS Semantic
Network). In this case machine learning boils down to simple pattern matching. While the new
types of relations are added to the dataset, the patterns begin to overlap, fully or partially, and the
learning becomes more complicated.
From the importance of concept types for the classification we can as well deduce the importance of
lexical features (ngrams) that were left out. While the influence of concept types can be illustrated
by comparing rows two and three of the Table 12, the influence of ngrams is the difference between
rows one and two. Ngrams start playing an important role in the classification when the size of the
70
relation set becomes bigger, and concept types alone become insufficient to determine the relation
types. In fact, there is a trade-off between the two types of features: when one of the feature types
starts passing a less strong signal to the classifier, it is compensated by the other feature type, and
visa versa.
6.2.2.2 Is consistent modeling important?
In order to back the use of resources that have similar domain modeling, we add concept feature
types to the dataset of SNOMED CT relations, using two different sources of semantic types. Firstly,
we use SNOMED CT top concepts as semantic types, thus keeping the innate modeling of the
ontology. Secondly, we use UMLS semantic types linking them with SNOMED CT concepts via
the Metathesaurus:
Baseline (only ngrams) – 75%
SNOMED CT types + ngrams – 99.1%
UMLS types + ngrams – 73.9%
Semantic types of the same nature as the relations to be classified boosted the performance towards
almost 100%, which means that the contraints for imposed by the types are unambiguous between
the three relations Finding Site, Associated Morphology and Causative Agent. However, such
constraints were formed successfully only by the types from the same ontology. In contrast, the
UMLS semantic types were not useful: not only did they not increase the performance, but they
even slightly deteriorated it acting as the noise for the classifier.
To sum up, both concept types and consistent modeling are essential for the successful relation
classification. It should be kept in mind that the modeling should coincide for the textual corpus, for
the source of relations and for the source of concept types.
6.2.3 Comparison with SemRep
The final step of the evaluation is to compare the performance of our learning method with that of
Semrep. The comparison is very insightful, since:
a) SemRep is the official relation extraction system of the National Library of Medicine and
can be considered to be the state-of-the-art;
b) It is a rule-based system, whereas our method uses machine learning. ML methods have
several major advantages over rule-based systems, which we summarize below, and if a ML
system can perform at least as good as the rule-based one, this can be considered a gain in
performance.
top 5 relations
top 10 relations
all relations
my method
94%
89.1%
82.7%
SemRep
95.7%
94.8%
94.1%
Table 13. Comparison of F-measure rates of SemRep and my learning method.
71
Table 13 shows that SemRep tends to classify relations better than our method. Only for the top 5
relations our methods performs on a par with SemRep, but this can already be considered a success,
since these relations still cover most of the corpus.
However, the level of top 10 relations is the most important one: it demonstrates that we are able to
perform the classification with a very high performance score of almost 90% while still handling
the absolute majority of relational instances and exploiting the computational advantages of our
approach. We argue even further that the tendency of the system to show high performance on the
core sets of relations will be preserved given another set of relation. The claim is based on the
observation that natural language relations have a Pareto-like distribution: the minority of relational
elements (verbs, predicates, relations) account for the majority of the occurrences of these elements.
On the lexical level the observation is backed by the distribution of verbs and verbal forms in
speech and texts. On the semantic level it is illustrated by the distribution of relations, as in the
SemMedDB repository of relational instances.
The computational advantages of our method are:
- the system is scalable to any number of relations and any set of relations, whereas SemRep
and analogous systems are relations-dependent;
- the training phase lies in the scale of minutes, given the annotated corpus, while systems
based on hand-crafted rules require months to construct relevant patterns.
At this point one could notice an important flaw of argumentation: the proposed system is indeed
fast if there is already a set of annotated examples. But what if the training corpus is not available?
Then the time dedicated for the acquisition of such corpus is excluded from the training phase, and
the temporal comparison with SemRep is not really valid, because for the latter the phase of pattern
construction is exactly the corpus construction. This is true indeed, and to be able to avoid such a
trap in the reasoning, one needs to develop a way of automatic acquisition of training examples.
The structure of the aforedescribed learning process sets the scene for the unsupervised relation
extraction, and we are going to introduce it in the next chapter.
72
7. Unsupervised Relation Extraction Our first experiments were based on the idea that we can use the formal resources that already exist:
if we automatically align formal definitions from an OWL encoded ontology with the textual data,
the resulting classification models can then be applied to new texts, and the domain will be easily
formalized (see Chapter 4). However, we encounter the problem of modeling: the way knowledge is
modeled in fully formalized, machine oriented resources is incompatible with the way the
knowledge is encoded in natural language texts (see Chapter 5.4). Thus we switched to the use of
relations that were designed for the human usage. This, however, meant that in order to train a
relation classifier one needs a corpus that is manually annotated with relation instances, since the
texts can no longer be aligned with the existing formalizations. While there are corpora that contain
relational annotation (e.g. SemRep Gold Standard corpus), they are not scalable; they are confined
to a specific set of relations. In order to expand or modify the set of relations that we seek to model,
the process of manual annotation has to be started anew. This is a classical bottleneck of supervised
machine learning methods. Hence, an unsupervised relation extraction could be of great value for
the formal definition generation task and could have a huge impact on the development of the
biomedical domain. This chapter gives an overview of a novel methodology that selects relations
that are relevant for the domain and extracts the instances of these relations in an unsupervised
manner, using only the terminology of relevant concepts. The methodology is highly applicable for
the biomedical domain, since the terminology bases and taxonomies for biomedicine and its various
branches are already constructed. However, the method could be generalized further to other
domains and tasks.
7.1 From relation classification to unsupervised relation clustering In this section we are going to illustrate how the currently supervised approach of relation labeling
can be transformed into an unsupervised one. Both types of features (i.e. lexical features and
concept types) will continue playing the crucial role for the learning. In addition, lexical features
will be used in the core component of the unsupervised approach as the input for the semantic
processing.
1) We started with the baseline approach which includes the annotation of concepts and the
supervised labeling of the relations. As it was discussed in Chapter 5.2, there is an ongoing
research of automatic annotation of texts with biomedical concepts, so in what follows we consider
the annotation step to be automatic. The labeling step can be done either manually or by aligning
texts with formalized resources.
Necessary resources: a concept taxonomy; a set of existing relations.
73
term A – relational string – term B
concept A – UMLS relation – concept B
2) Then we integrated concept type features into the representation of relational instances. The
concept types are defined in the same resource that is used for the annotation, be it an ontology or a
metathesaurus. Thus, the acquisition of concept types from concepts is also an automatic step. At
this point we are still dependent on the set of relations. How can we get away from them?
Necessary resources: a concept taxonomy; a set of existing relations.
term A – relational string – term B
concept A – relational string – concept B
concept type A – UMLS relation – concept type B
3) We abandon the UMLS set of relations and start extracting verbs and verbal forms from the
relational strings. The verbs are normalized to their lemmas. The instances that do not contain a
verbal form in their strings are put aside for a moment.
The motivation behind this step is that “verbs are the primary vehicle for describing events and
expressing relations between entities” [Chklovski et al. 2004]. The verbs can give us an intuition
about the relation that is encoded into the string. However, verbs from a single instance do not give
us much information about the relation: we know the vocabulary meaning of a verb, but we have no
information whatsoever about their position in the system of relations we aim to extract. We need a
general picture of all verbs from all instances for that.
Necessary resources: a concept taxonomy.
term A – relational string – term B
concept A – relational string – concept B
concept type A – verb – concept type B
74
4) We collect and lemmatize verbs from all relational instances and group them together according
to their semantics into n clusters, each representing a relation that is relevant for the given corpus,
i.e. given domain. The information about the meaning of verbs is taken from a semantic network
like WordNet [Miller 1995]. WordNet is a general domain resource, thus the method can be applied
for any domain as long as we have a concept taxonomy for it.
The semantic clustering of verbs gives as the set of relations. What remains is to label the
instances with the new relations. For the instances that contain a verb in its relational string the step
is straightforward: we label it with the cluster to which the verb is assigned. The rest of the
instances are classified using a machine learning algorithm that is trained on the labeled instances,
using concept types and ngrams as features.
There are two non-trivial cases that should be handled carefully:
a) a verb has more than one meaning
Many English verbs are highly ambiguous, and this is particularly true for the top frequent
verbs. However, the domain set semantic restrictions on the verbs, and they tend to be used
in much fewer senses as compared to the general purpose texts. For now we assume that
every verb is assigned to exactly one cluster. The relaxations of this assumption are left for
the future work.
b) a relational string contains more than one verb
The second assumption that we make for the current formulation of the method is that the
classification of instances is single-label: every instance is assigned exactly one relation. In
case verbs from different clusters are present in the relational string, several solutions are
possible: take the more frequent verb; take the verb that corresponds to the more frequent
relation; take the verb that is located more to the right end of the string (see Chapter 5.1) etc.
The most effective strategy should be determined empirically.
Necessary resources: a concept hierarchy; a semantic network.
term A – relational string – term B
concept A – relational string – concept B
concept type A – verb – concept type B
Thus, we extract relational instances in unsupervised way. We propose a three-step pipeline. Firstly,
we induce relevant relation types using the semantics of the occurring verbs. Secondly, we label the
75
instances with verbal forms by the corresponding clusters. Thirdly, we perform self-supervised
learning, using the already classified instances as training data and running the model on the
remaining instances in a bootstrap manner.
7.2 Relation construction via semantic clustering After we have given a layout of the unsupervised relation extraction pipeline, we are now going to
give more elaborate motivation and background of the approach as well as the details of its
implementation. In other words, we will answer the questions: why should it work? And how
should it work?
7.2.1 Semantic clustering: assumptions and use cases
The idea of verbs as the main conductors of the relational meaning is well-known. Many relation
extraction systems, both for the general and for the biomedical domains extract relational instances
from simple Subject – Predicate – Object structures where the predicate corresponds to the relation
and is often represented as a verb.
[Baclawski et al. 2005] argue that concepts and relations are typically represented in texts by nouns
and verbs correspondingly. [Schulte im Walde 2000] corroborates the idea, stating that usually
verbs represent actions or relations between the concepts, which is why semantic clustering of verbs
tend to help extract instances of a specific relation.
These observations constitute our motivation to take verbs as the main linguistic items that convey
relationship semantics and to derive relations by clustering the verbs' meanings.
Further analysis of the related literature brought us to a work by [Coulet et al. 2010] which is also
based on the idea of mapping “raw” relational instances that share common semantics to a smaller
set of “normalized” relations. The authors admit that semantic normalization may involve a certain
degree of simplification and the loss of information, however, simplified modeling may as well
enable more robust reasoning and other types of automated processing tasks.
In [Coulet et al. 2010] the authors manually identify the most common relations in the domain of
pharmacogenomics and organize them hierarchically. They use two separate corpora, one for the
construction of the backbone ontology and another one for the extraction of relational instances
with respect to this ontology. In the first corpus all the terms from relational strings are collected,
lemmatized and grouped into sets based on synonymy, again manually. The underlying idea is that
the same relation (ontology role) may be expressed in the text in multiple ways by means of
synonyms or near-synonyms. The resulting relations are then used to label the new instances
extracted from the second corpus.. The correspondence between the raw and the normalized
relations are many-to-one.
76
The ontology of pharmacogenomics relations has been created from the top 200 most frequent
relation types (lemmatized raw relations). The full list of relation types is not available, but the
authors present the top 30 most frequent relation types. Notably, 29 of them are verbs, which backs
our idea of using verbal forms to induce relations. The top relation types are: associate, increase,
inhibit, induce, metabolize, involve, reduce, catalyze, cause, affect, decrease, show, express, relate,
use, correlate, influence, determine etc.
The number of relation types is confined to 200 so that the manual ontology construction is feasible
(the reported construction time is 4 hours). However, it was experimentally proven that those 200
types cover 80% of all relational instances. This supports our strategy of focusing on the core
relations.
The resulting ontology contains 76 relations, obtained by manually merging synonymous types
together (e.g. associate and relate). Thus, the average number of synonymous relation types per
relation is 2.63. The top 15 relations are:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
associated_with
demonstrates
increases
reduces
studies
inhibits
influences
causes
includes
metabolizes
uses
induces
produces
affects
determines
The approach by [Coulet et al. 2010] has two important underlying principles:
− the relational strings with similar meaning can be grouped together and directly transformed
into the relations;
− only the statistically significant relations should be considered first. Statistical co-occurrence
of biomedical concepts was proven to correlate with manually defined relations in another
work by [Liu et al. 2012].
The beauty of the approach is that the relations that are relevant for the domain emerge from the
textual corpus; the definition of relations is data driven. However, the implementation is completely
manual, which not only makes the approach unscalable, but also contradicts to the idea of being
data-driven. The key step of the approach is to reduce synonymous relation instances to the
underlying conceptual relations. Prior steps, i.e. relational string extraction and lemmatization, are
automatic. The step of labeling instances with newly defined relations is automatic as well. What
remains to be automated is the semantic grouping of relational types (without loss of generality we
can think of them as verbs).
77
We propose a solution to this problem relying on the fact that the semantic similarity between two
terms can be calculated automatically. Hence, the relations can be formed in an unsupervised way.
We extract the verbs occurring in relational strings that the system generated and group the verbs
semantically producing relations, each verb cluster being a relation.
7.2.2 Semantic clustering of lexical elements
In order to cluster verbs according to their semantics, we need the following components:
1) the source of semantic information about the verbs
WordNet [Miller 1995] is a semantic network for the English language. The nodes of WordNet are
called synsets, they represent fine-grained semantic units and are populated with synonymous or
near-synonymous meanings of words. WordNet contains 117,659 synsets covering 155,287 words
and 206,941 word-sense pairs33. Synsets are organized into hierarchies by an ISA relation. Synsets
of particular parts of speech also have specific semantic relations. Semantic relations for verbs are:
•
•
•
•
hypernymy: the ISA relation between two verbs holds if one verb describes are more general
class of actions, e.g. to see is a hypernym of to perceive;
troponymy: troponyms specify a particular manner in which the action is taken, e.g. to rush
is the troponym of to run;
entailment: one action entails another action, if the latter is not possible without the former,
e.g. to snore is entailed by to sleep;
coordinate terms: the two verbs are coordinated if they share a common hypernym, e.g. to
lisp and to yell are coordinated.
WordNet can be queried with specific words and return all synsets that contain at least one meaning
of this word. It is the biggest existing semantic network, is of a general domain and has a very broad
coverage. WordNet is the perfect match for the task of measuring semantic similarities between
lexical items. In fact, considerable research has been carried out of how the graph structure of
WordNet can be used to determine how (dis-)similar the two words are. Semantic similarities have
been used in various applications, like word sense disambiguation [Li et al. 1995], text
classification and clustering [Tsatsaronis et al. 2009], information retrieval etc.
2) the measure of semantic similarity
Similarity measures utilize both hierarchical and non-hierarchical links in the semantic network to
determine how far the two meanings or lexemes are from each other and to convert this distance
into a similarity score. The simplest measure is the path length between the two nodes in the
network. The similarity score is then inversely proportional to the number of nodes in the shortest
path between the nodes. A more evolved Wu & Palmer measure takes into account the depth of the
two nodes in the network taxonomy together with the depth of their least common subsumer. Two
measures Resnik and Lin use the notion of the information content (IC). The information content of
a node is calculated either based on the frequency statistics of the node elements in the background
corpus or as a function of the number of hyponyms of the node [Seco et al. 2004]. Other wellknown similarity measures are Leacock & Chodorow, Jiang & Conrath, Adapted Lesk, Hirst & St33
http://en.wikipedia.org/wiki/WordNet
78
Onge etc.
All the similarity measures mentioned above are implemented in a Perl package
WordNet::Similarity34 [Pedersen et al. 2004]. It is a very well-known module of semantic similarity
and relatedness measures, it has extensions to other programming languages and resources
(UMLS::Similarity). I used the Java re-implementation of WordNet::Similarity, the WS4J package35,
where all the measures are defined the same way as in the original package. I also used the
wordsimilarity package36.
After a series of preliminary tests that we ran in different pairs of words using the Web interface of
the WordNet::Similarity package37, we found that the Lin measure [Lin 1998] performs very well
compared to all the measures (for polysemous verbs we consider synsets that taken as a pair
produce the highes Lin score). Hence, in the current setting of the verb clustering procedure we use
Lin as the semantic measure of choice. Our results run in agreement with the previous studies of
various semantic measures that reported Lin scores to have one of the highest correlations with
human judgments [Seco et al. 2004]. Lin similarity of two synsets is calculated as follows:
Lin(synset1, synset2) = 2*IC(lcs) / (IC(synset1) + IC(synset2)) ,
where IC is the information content score and lcs is the least common subsumer synset of the input
synsets. The WordNet::Similarity implementation of the IC uses the values that were precomputed
over a number of corpora, including the British National Corpus38 and the Brown Corpus39. The
resulting Lin values lie in the range [0,1], the higher score corresponding to the greater similarity
between the synsets.
3) the clustering algorithm
The final component is the clustering algorithm (CA) that groups verbs into relational clusters based
on the semantic similarity of verbs which we calculate over WordNet using the Lin measure.
The main principle of the cluster analysis is to group similar elements such that the elements in one
cluster are more similar to each other than to the elements of other clusters. Clustering differs from
classification in the fact that the elements are not assigned correct labels. Moreover, the algorithm is
not aware of the categories to be learned. The CA does not learn from data, it explores the data.
There is a considerable amount of different clustering algorithms. They can be divided into two
groups: flat VS. hierarchical algorithms. Flat CA simply groups elements into clusters, but the
relations between the clusters are unknown [Manning et al. 1999]. Hierarchical CA builds up bigger
clusters on top of the smaller ones, organizing them into an ordered structure. Full hierarchical
clustering produces a tree where the leaves are individual elements and every non-terminal node is
an intermediate cluster. The root of the tree is the final cluster comprising all elements of the dataset.
If one is not interested in getting the full hierarchy of clusters, a stopping condition can be
formulated upon which the cluster analysis terminates. Hierarchical clustering can be performed
bottom-up starting from single-element clusters (agglomerative) or top-down starting from a single
34
http://wn-similarity.sourceforge.net/
http://code.google.com/p/ws4j/
36
http://code.google.com/p/wordsimilarity/
37
http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi
38
http://www.natcorp.ox.ac.uk/
39
http://icame.uib.no/brown/bcm.html
35
79
dataset cluster (divisive).
In principle, both flat and hierarchical CA can be used in the task of extracting relations from verb
occurrences. The main constraint is that the algorithm should not require the final number of
clusters as an input parameter, since one part of our task is to determine the appropriate number of
relations for a domain.
7.2.3 The DBSCAN algorithm and its hierarchical extention
One of the algorithms that fit to this constraint is DBSCAN [Ester et al. 1996]. It is a powerful
density-based algorithm that is among the most used and cited clustering algorithms, according to
the Microsoft Academic Search40. Density-based algorithms define a cluster to be a region of a data
space which is characterized by high density of objects. Clusters are separated from each other by
regions with low object density [Kriegel et al. 2011]. DBSCAN assumes the clusters to be of
arbitrary shape. It also allows points to be noisy, i.e. not part of any cluster.
DBSCAN groups elements of the dataset, starting from an arbitrary element and merging into a
cluster elements that a) lie on not further than a certain distance from each other and b) have a
certain density of other elements around them. The key notions of the DBSCAN algorithm are
density reachability, direct density reachability and density connection:
•
•
•
a directly density reachable from b, if b is located not further than a distance eps and it has
at least minPts other elements in its eps neighborhood;
a is density reachable from b, if there exists a sequence of elements a, c1,...cn, b that form a
chain of directly density reachable elements;
a is density connected to b, if there exists an element c from which both a and b are density
reachable.
Thus, all elements that are density connected belong to the same cluster. The simplified version of
the algorithm's pseudocode as compared to [Ester et al. 1996] is given in the corresponding
Wikipedia page41. In order to determine which elements are density connected, DBSCAN requires
two input parameters: the minimal distance eps to check for density reachability and the neighbors
and the minimal number of elements minPts to continue the cluster expansion. For the verb
clustering task eps can be thought of as the minimal similarity score between the two verbs. The
minPts parameter is more tricky to set.
DBSCAN appears to be a highly suitable algorithm for the task of verb clustering and subsequent
relation extraction: it recognizes noisy data, does not constrain the shape of the clusters and does
not require the number of clusters to be given as input. However, there are two things that could
possibly be improved, namely:
− an hierarchical version of DBSCAN would give us an insight on the taxonomy of new
relations;
− the influence of the minPts and eps input parameters could be reduced.
40
41
http://academic.research.microsoft.com
http://en.wikipedia.org/wiki/DBSCAN
80
We have implemented a hierarchical version of DBSCAN taking the original formulation as a
starting point (we have used our own implementation of classical DBSCAN). The updated
algorithm starts with a arbitrarily low eps, thus splitting the verbs into general semantic groups.
Then for every group the algorithm is called iteratively with an increased eps, and more finegrained clusters emerge at every iteration. The algorithm stops upon convergence, i.e. the clusters
are no more split into subclusters. The current implementation also provides the functionality for
the algorithm to stop when a predefined number of iterations have been completed.
Below is the sketch of the new version of DBSCAN:
1. eps is set to a randomly small number; e.g. for the Lin semantic measure the scores below
0.4 reflect a very distant relatedness.
2. minPts is set to 1. The value of 1 signifies that there is at least one neighbor for a given
element, and an attempt to exclude it from the “noise” category and to assign it to a cluster
is motivated.
3. Classical DBSCAN is run over the full dataset, some elements are grouped into broad
clusters, the other elements for the “noise” cluster.
4. For every cluster we calculate a new value for eps by taking into account the smallest and
the biggest distance between a pair of cluster members:
eps := (minDistance + maxDistance) * c
We leverage the extreme values of the range of pairwise distances inside a cluster with a
coefficient c, which is currently set to 0.66. Through the sequence of iterative calls of the
algorithm we gradually increase eps and the subclusters reach a homogeneous density such
that further splitting is not performed.
5. The procedure is iterated until a stopping condition is met.
The two key features of the new algorithm are the iterative calls of the original algorithm and the
dynamic update of eps. The influence of the original input parameters on the cluster formation is
minimized, since the clustering is now more dependent on the update coefficient c. Two input
parameters are substituted by a single parameter. Iterative splitting of clusters yields a partial
hierarchy of resulting relational clusters.
7.3 Preliminary evaluation of the method In this section we are going to present an experiment of semantic clustering of verbs that serves as a
proof-of-concept of the method proposed in Chapter 7.
Firstly, we constructed a gold standard corpus of textual relations. We used a corpus of 100
randomly chosen MeSH definitions in order to test the semantic clustering of verbal forms. The
definitions were manually processed by two human annotators42, all the verb occurrences were
semantically analyzed and grouped into semantically coherent sets that represent fine-grained
semantic relations relations. Each set was mapped to a particular relation from the UMLS Semantic
Network.
42
I would like to thank Maria Kissa for assisting me in the manual construction of the reference
clusters as a domain expert.
81
From all unique verbs in the corpus a set of 50 verbs that are among the most frequent in MeSH
was chosen. We ran the iterative DBSCAN using Lin similarity measure over the selected verbs and
analyzed the resulting clusters, comparing them with the gold standard semantic groups. The
comparison is two-fold: we would like to evaluate how well the automatic clustering of lexical
items corresponds to the human grouping; secondly, we would like to demonstrate that the
unsupervised clusters correspond to existing ontological relations.
Below is the result of clustering of 50 verbs with iterative DBSCAN. Results of the intermediate
iterations are provided.
Input verbs: prevail, predominate, classify, mediate, transmit, convey, label, mark, cause, result, induce, lead, include,
consist, bind, attach, act, limit, emerge, divide, attenuate, confuse, infect, split, prove, regulate, present, catalyze, know,
identify, initiate, accomplish, block, follow, carry, distinguish, inhibit, recognize, provide, produce, see, interfere,
weaken, resemble, reduce, separate, form, express, use, alleviate.
Iteration # 0
•
•
•
•
cause, lead, classify, convey, initiate, inhibit, carry, follow, limit, prove, identify, label, mark, reduce, include,
present, distinguish, split, regulate, result, express, recognize, weaken, transmit, attenuate, separate, induce,
divide, know, see, attach, bind, produce
form, block
prevail, predominate
noise: provide, catalyze, alleviate, consist, emerge, mediate, accomplish, resemble, use, interfere, infect,
confuse, act
Iteration # 1
•
•
•
•
•
•
•
•
•
•
use, act
noise: accomplish, provide, consist, resemble, emerge, interfere, mediate, catalyze, infect, confuse, alleviate
result, lead
express, convey, initiate, carry, transmit, include
classify, separate, divide, label, mark, distinguish, split
recognize, know
attach, bind
form, block
prevail, predominate
noise: cause, weaken, inhibit, follow, limit, attenuate, prove, induce, identify, reduce, present, see, produce,
regulate
Iteration # 2
•
•
•
•
•
•
•
•
•
•
•
•
•
•
resemble, consist
use, act
noise: accomplish, mediate, provide, infect, catalyze, confuse, alleviate, emerge, interfere
limit, reduce, inhibit, regulate
attenuate, weaken
identify, see
result, lead
transmit, convey, initiate, express, carry, include
classify, separate, distinguish, split, mark, label, divide
recognize, know
attach, bind
form, block
prevail, predominate
noise: prove, cause, induce, present, produce, follow
82
Iteration # 3
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
mediate, interfere
resemble, consist
use, act
noise: accomplish, provide, catalyze, infect, confuse, alleviate, emerge
cause, induce, produce
limit, inhibit, reduce, regulate
attenuate, weaken
identify, see
result, lead
transmit, convey, initiate, express, carry, include
classify, separate, distinguish, divide, label, mark, split
recognize, know
attach, bind
form, block
prevail, predominate
noise: present, prove, follow
Now let us compare the clusters from the last iteration of the algorithm with corresponding manual
clusters. Table 14 aligns the verbs that were grouped automatically with the gold standard verb
clusters (verbs from the noise category are omitted). Underlined verbs from the first column are
those that do not fit to the corresponding manual cluster.
Automatic clusters
Manual clusters
UMLS relation
attach, bind
attach, bind, join, link
connected_to
attenuate, weaken
attenuate, reduce, weaken
affects
result, lead
account, cause, induce, afflict, result_of
lead, result
prevail, predominate
predominate, prevail, dominate conceptually_related_to
use, act
work, act, serve, play, use (as in uses
"used as”)
cause, induce, produce
account, cause, induce, afflict, causes, produces
lead, result
limit, inhibit, reduce, regulate
limit, inhibit
affects
classify, separate, distinguish, [mark, label]
divide, label, mark, split
[cleave,
divide,
differentiate, split]
separate,
analyzes
identify, see
recognize, identify,
observe, see
confuse, indicates
mediate, interfere
interfere
complicates
transmit,
convey,
initiate, [carry,
transmit,
mediate, functionally_related_to
express, carry, include
convey, direct, transplant]
[express]
recognize, know
know, think, study, show, prove analyzes
form, block
prevent, block
prevents
83
resemble, consist
resemble, be alike
conceptually_related_to
Table 14. Automatically generated verb clusters and their counterparts in the gold standard corpus and in
UMLS upper ontology.
The table clearly demonstrates a considerable overlap in the way verbs are grouped together by
human annotators and automatically using the semantic network relatedness. What is even more
important, the table exhibits a strong correlation between the unsupervised clusters and the existing
ontology relations. This brings us to a conclusion that the unsupervised method that we proposed
has a great potential of identifying relevant semantic relations from textual data and should be
explored and evaluated on a large scale in the future.
We have proposed a novel methodology for unsupervised relation extraction. It is based on the idea
of inducing domain-relevant relations from the textual occurrences of verbs and verbal forms. The
core component of semantic clustering was described in detail, giving the general outline as well as
the suggestions for the choice of resources and similarity measures. It was implemented in the Java
language and evaluated on biomedical data.
Though the approach was primarily designed for the biomedical domain, it relies on the domainindependent semantic network, thus it can be generalized to other domains and areas of knowledge.
84
8. Future work Formal definition generation is a very ambitious, multi-disciplinary task with huge potentials. It can
be tackled from different perspectives: logical, text mining, modeling, even philosophical. The
number of things that can be done for the task is huge. In this chapter we would like to pinpoint
some of them.
The improvements and modifications of the approach of formal definition generation can nominally
be divided into high-level changes and those that relate more to the implementation. The high-level
modifications appear as answers to the questions: what exactly do we want to formalize in a
definition? how sophisticated should the formalization be? what information is relevant for the
modeling? what information can be ignored?
All these questions can be reformulated in the following way: which formalism is suitable for the
formal definition encoding? At the current stage of the research we are using Description Logics. In
DL the concepts and the relations between them are specified explicitly, thus the core information
about the concept to be defined is easily encoded. The choice of DL is also partly motivated by the
fact that there already exist rich biomedical ontologies, like SNOMED CT, which use Description
Logic as the underlying formal language.
However, nothing prevents us from switching to other logics and formalisms, if they are a
convenient means of representing certain information. Let us look at the definition of Arthritis taken
from MeSH:
Arthritis is a form of joint disorder that results from joint inflammation. When bone surfaces
become less well protected by cartilage, bone may be exposed and damaged.
The first sentence seems to be easily translated into DL notation, as there is a direct correspondence
between lexical items of the definitions and the elements of the formula:
Arthritis is a form of joint disorder that results from joint inflammation.
Arthritis
=
Joint_Disorder ⊓ ∃results_from.Joint_Inflammation
However, if we proceed to the second sentence, the modeling becomes less intuitive:
When bone surfaces become less well protected by cartilage, bone may be exposed and
damaged.
-
should we model temporal relations (when bone surfaces…)?
should we specifically define sequences of actions (surfaces become less protected)?
should we capture modality (bone may be exposed)?
how should we quantify the intensity of a relation (less well protected)?
85
There are no correct answers to these and other similar questions. But it is definitely an interesting
task for the future work to try different formalisms for the definition generation.
Improvements on the implementation level of the formal definition generation concern different
text mining components that can be integrated into the formalization pipeline. They have partly
been covered in the Future work section of Chapter 5.3. In particular, the extraction of unlabeled
triples from text is a crucial step in the formula construction. We have tried different approaches for
the triple extraction: the naïve extraction of strings in between the two concepts, as well as more
involved approaches like the use of dependency paths or semantic roles. Yet all of them proved to
yield poor results (in terms of precision and/or recall), and the implementation of the component is
currently using rules defined manually over syntactic parse trees. These rules could be improved.
The success of the text-to-formalism transformation relies heavily on how accurately the input text
is processed. In particular, the use of state-of-the-art annotators instead of the in-house tools would
improve the output formulas considerably. The semantic annotation would also achieve higher
recall if the anaphora resolution was performed, so that the entities that are mentioned relative to
other entities in the text are correctly linked together and form valid relational triples.
A special place in the discussion of the future work and the perspectives of formal definition
generation should be given to unsupervised methods. This work sketched the scenario of
unsupervised relation extraction that uses only the concept taxonomy and the general-purpose
semantic network (WordNet). The details of the implementation is still to be defined. The scenario
is domain independent, however, it relies on the taxonomies that may not exist for some domains.
For such domains the prior step of manual or automatic taxonomy extraction is required.
Another open question of unsupervised relation extraction is how to evaluate the relations that
appear as a result of semantic clustering of verbs and verbal forms. Obviously the evaluation can be
done manually, involving the domain experts, but in this case it could be highly subjective, thus not
useful. Another way of relation evaluation that we envision is to construct a many-to-many
mappings of new relations to existing sets of relations via the relational instances. However, this
solution requires an already defined relational instances for a domain which are not always
available. Thus, new ways of cluster evaluation should be devised.
The last thing we would like to mention is the triples-to-formula step. We reformulated the task of
formal definition generation in terms of triple extraction and triple labeling, tackling the task from
the text mining perspective. When the triples are extracted and assigned relations, they can be
directly integrated into an ontology or other formal knowledge base. However, if we want to get a
well-defined formula as an output, the triples-to-formula step should be implemented.
86
9. Conclusions In this work we addressed a novel problem of generating formal definitions from textual
descriptions of biomedical concepts. We performed a thorough review of existing methods, tools,
pipelines and resources that are designed for similar or related tasks (e.g. relation extraction) and
can be adapted to the formal definition extraction.
Formal definition generation is a complex task. We approached it from a text mining perspective
and split it into several consecutive steps. For every step we did the following:
- we studied the existing methodologies and adapted them to the task;
- in cases when the existing methodologies do not show promising results we proposed and
implemented novel methods;
- we implemented every step in a flexible, component-based manner, so that the external
resources we rely on (parsers, semantic annotators, knowledge bases etc.) can be easily
interchanged;
- we evaluated every step on the relevant data sources, pinpointed typical mistakes and ways
of their elimination.
We were particularly focused on the non-taxonomic relation extraction as expressive relations
contain the core information about the concepts to be defined. Relation extraction can be done
either in a supervised or in an unsupervised manner. For the supervised relation extraction we
performed thorough feature engineering. The final features that were used in the pipeline are lexical
features and semantic concept types. We proved that concept types are the crucial features of the
relation classification as they significantly boosted the performance of the learner. In addition, we
analyzed the performance of various machine learning algorithms showing that Support Vector
Machines tend to outperform the other classifiers in relation labeling.
The machine learning approach that we propose is scalable and domain agnostic. Whenever there is
a corpus with annotated relation instances, new rules and models are learnt automatically. For small
number of distinct relations the approach proved to perform on a par with state-of-the-art rule-based
relation extraction systems like SemRep. This implies that if the top frequent relations are selected
for learning, the output model can effectively extract the majority of relation instances from texts.
Furthermore, we presented the architecture for the unsupervised relation extraction. It does not
require the input corpus to be pre-annotated with relations; instead it derives relevant relations from
the semantics of verbal forms in the corpus and annotates the corpus with relational triples using
bootstrapping. Thus it can generalize for the domains other than biomedicine and can be applied to
a wide range of tasks, from domain-specific ontology generation to personal knowledge base
curation. The unsupervised relation learning uses semantic clustering over a general-domain
semantic network (WordNet) as its core step. The use of semantic networks for the relation learning
from scratch is a novel technique, which we created and implemented.
At this stage of the research the formal definition generation pipeline can be used as a standalone
tool for concept formalization, and it can also be integrated into ontology learning tools (e.g.
Dog4Dag) as a semi-automatic tool for the assistance to domain experts.
87
88
Appendix A The tables below contain the performance metrics for the experiments of classifying three
SNOMED CT relations (Finding site FS, Associated morphology AM and Causative agent CA)
using MeSH definitions as source texts. The details of the experiments are given in Chapter 4.
Both tables list Precision, Recall and F-measure for every relation, for every classification
algorithm, for every feature representation, as well as overall Accuracy and macro-averaged
Precision, Recall and F-measure. Results in Table A.1 are calculated using the Boolean weighting
of the features, while for Table A.2 the per-class probabilistic weighting is used.
Table A.1. Performance metrics for the unweighed feature representation: overall and per relation, per
algorithm, per representation.
89
Table A.2. Performance metrics for the weighted feature representation: overall and per relation, per algorithm,
per representation.
90
91
Appendix B Extended Vocabulary list for a semantic annotator that uses MeSH as an
underlying ontology.
4) disorder
5) loss
6) abnormality
7) insufficiency
8) deficiency
9) presence
10) potential
11) resistance
12) phenomenon
13) physiological phenomena
14) process
15) cycle
16) phase
17) condition
18) result
19) chromosomes human pair
20) pair
21) magnoliopsida
22) dysplasia
23) sequence
24) receptor
25) duct
26) tract
27) cortex
28) DNA
29) RNA
30) nerve
31) sinus
32) body
33) system
34) region
35) structure
36) surface
37) part
38) fracture
92
Appendix C Table C. Manual evaluation of triple extraction parser. The triples are extracted from 40 randomly selected
MeSH definitions.
Definition
Prevotella nigrescens
(D045242): a species of
gram-negative bacteria in the
family prevotellaceae.
Tooth, Nonvital (D019553): a
tooth from which the dental
pulp has been removed or is
necrotic.
Abdominal Wall (D034861):
the outer margins of the
abdomen, extending from the
osteocartilaginous thoracic
cage to the pelvis.
Thyroiditis, Subacute
(D013968): spontaneously
remitting inflammatory
condition of the thyroid gland,
characterized by fever;
muscle weakness; sore
throat; severe thyroid pain;
and an enlarged damaged
gland containing giant cells.
Relation string
Concept A
IS_A
Prevotella nigrescens
(D045242)
IS_A
Tooth, Nonvital
(D019553)
“from which has
been removed or is
necrotic”
“the outer margins of
”
Neoplastic Syndromes,
Hereditary (D009386): the
Thyroiditis, Subacute
(D013968)
Giant Cells (D015726),
"giant cells"
no
Disorder (my-term),
"disorders"
yes
Brain (D001921),
"brain malformations"
yes
Abdominal Wall
(D034861)
“to”
Abdominal Wall
(D034861)
“spontaneously
remitting”
Thyroiditis, Subacute
(D013968)
“of”
Condition (my-term),
"inflammatory condition"
“characterized by”
Thyroiditis, Subacute
(D013968)
“characterized by”
Thyroiditis, Subacute
(D013968)
“and an enlarged
damaged gland
containing”
Classical Lissencephalies
and Subcortical Band
Heterotopias (D054221):
disorders comprising a
spectrum of brain
malformations representing
the paradigm of a diffuse
neuronal migration disorder.
Thyroiditis, Subacute
(D013968)
Thyroiditis, Subacute
(D013968)
Tooth, Nonvital
(D019553)
Abdominal Wall
(D034861)
“characterized by”
IS_A
“comprising a
spectrum of”
“representing the
paradigm of”
IS_A
Evaluation
Gram-Negative
Bacteria (D006090),
“gram-negative
bacteria”
Tooth (D014070), "a
tooth"
Dental Pulp
(D003782), "the dental
pulp"
Abdomen (D000005),
"the abdomen"
Thorax (D013909),
"the osteocartilaginous
thoracic cage"
Pelvis (D010388), "the
pelvis"
Condition (my-term),
"inflammatory
condition"
Thyroid Gland
(D013961), "the
thyroid gland"
Fever (D005334),
"fever"
Muscle Weakness
(D018908), "muscle
weakness"
Pharyngitis (D010612),
"sore throat"
Thyroiditis (D013966),
"severe thyroid pain"
“extending from”
“characterized by”
Concept B
Classical
Lissencephalies and
Subcortical Band
Heterotopias (D054221)
Classical
Lissencephalies and
Subcortical Band
Heterotopias (D054221)
Classical
Lissencephalies and
Subcortical Band
Heterotopias (D054221)
Neoplastic Syndromes,
Hereditary (D009386)
Neuronal Migration
Disorders (D054081),
"a diffuse neuronal
migration disorder"
Condition (my-term),
"the condition"
yes
yes
yes
yes
yes
yes
no
yes
yes
yes
yes
yes
yes
yes
93
condition of a pattern of
malignancies within a family,
but not every individual's
necessarily having the same
neoplasm.
Contracture (D003286):
prolonged shortening of the
muscle or other soft tissue
around a joint, preventing
movement of the joint.
T-Lymphocytes (D013601):
lymphocytes responsible for
cell-mediated immunity.
Critical Illness (D016638): a
disease or state in which
death is possible or
imminent.
“of a pattern of
malignancies within a
family, but not every
individual's
necessarily having”
“prolonged
shortening of”
“prolonged
shortening of”
“around”
“around”
Alchemilla (D031982): a
plant genus of the family
rosaceae.
Tremor (D014202): cyclical
movement of a body part that
Contracture (D003286)
Muscles (D009132),
"the muscle"
Tissues (D014024),
"other soft tissue"
Contracture (D003286)
“of”
Movement (D009068),
"movement"
IS_A
T-Lymphocytes
(D013601)
“responsible for”
T-Lymphocytes
(D013601)
IS_A
“or state in”
“symptoms of“
Keratoacanthoma
(D007636): a benign, nonneoplastic, usually selflimiting epithelial lesion
closely resembling
squamous cell carcinoma
clinically and
histopathologically.
DNA Breaks (D053960):
interruptions in the sugarphosphate backbone of dna.
Contracture (D003286)
“preventing”
“symptoms of“
Morning Sickness
(D048968): symptoms of
nausea and vomiting in
pregnant women that usually
occur in the morning during
the first 2 to 3 months of
pregnancy.
Neoplastic Syndromes,
Hereditary (D009386)
Critical Illness
(D016638)
Critical Illness
(D016638)
Morning Sickness
(D048968)
Morning Sickness
(D048968)
Neoplasms
(D009369), "the same
neoplasm"
Muscles (D009132),
"the muscle"
Tissues (D014024),
"other soft tissue"
Joints (D007596), "a
joint"
Joints (D007596), "a
joint"
Movement (D009068),
"movement"
Joints (D007596), "a
joint"
Lymphocytes
(D008214),
"lymphocytes"
Immunity, Cellular
(chosen) (D007111),
"cell-mediated
immunity"
Disease (D004194),
"disease"
Death (D003643),
"which death"
Nausea (D009325),
"nausea"
Vomiting (D014839),
"vomiting"
Pregnant Women
(D037841), "pregnant
women"
Pregnant Women
(D037841), "pregnant
women"
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
no
yes
yes
“in”
Nausea (D009325),
"nausea"
“in”
Vomiting (D014839),
"vomiting"
“that usually occur in
the morning during
the first 2 to 3
months of”
Morning Sickness
(D048968)
Pregnancy (D011247),
"pregnancy"
yes
“a benign, nonneoplastic, usually
self-limiting epithelial
lesion closely
resembling”
Keratoacanthoma
(D007636)
Carcinoma, Squamous
Cell (D002294),
"squamous cell
carcinoma"
yes
“interruptions in the
sugar-phosphate
backbone of”
DNA Breaks (D053960)
DNA (my-term), "dna"
yes
IS_A
Alchemilla (D031982)
IS_A
Alchemilla (D031982)
IS_A
Tremor (D014202)
Plants (D010944), "a
plant genus"
Rosaceae (D027824),
"the family rosaceae"
Movement (D009068),
"cyclical movement"
yes
yes
yes
yes
yes
94
can represent either a
physiologic process or a
manifestation of disease.
Dermatitis, Occupational
(D009783): a recurrent
contact dermatitis caused by
substances found in the work
place.
“of”
Movement (D009068),
"cyclical movement"
“that can represent
either”
Tremor (D014202)
“or a manifestation
of”
Tremor (D014202)
IS_A
Dermatitis, Occupational
(D009783)
“caused by
substances found in”
“a site located in“
Dermatitis, Occupational
(D009783)
Immunoglobulin Switch
Region (D007134)
Immunoglobulin Switch
Region (D007134): a site
located in the introns at the 5'
end of each constant region
segment of a immunoglobulin
heavy-chain gene where
recombination occur during
immunoglobulin class
switching.
“at the 5' end of each
constant region
segment of”
Immunoglobulin Switch
Region (D007134)
“where”
Immunoglobulin Switch
Region (D007134)
“occur during”
Immunoglobulin Switch
Region (D007134)
Astronomical Phenomena
(D055580): aggregates of
matter in outer space, such
as stars, planets, comets,
etc. and the properties and
processes they undergo.
“aggregates of
matter in outer
space, such as stars,
planets, comets, etc.
and the properties
and”
Astronomical
Phenomena (D055580)
IS_A
Codon, Nonsense
(D018389)
“that has been
converted to”
Codon, Nonsense
(D018389)
Codon, Nonsense
(D018389): an amino acidspecifying codon that has
been converted to a stop
codon by mutation.
“by”
Taste Disorders (D013651):
conditions characterized by
an alteration in gustatory
function or perception.
Dermacentor (D003870): a
widely distributed genus of
ticks, in the family ixodidae,
including a number that infest
humans and other mammals.
Bambusa (D031723): a plant
genus of the family poaceae.
IS_A
Codon, Terminator
(D018388), "a stop
codon"
Taste Disorders
(D013651)
“characterized by an
alteration in
gustatory function or”
Taste Disorders
(D013651)
IS_A
Dermacentor (D003870)
IS_A
Dermacentor (D003870)
“including a number
that infest”
“including a number
that infest”
Dermacentor (D003870)
Dermacentor (D003870)
IS_A
Bambusa (D031723)
IS_A
Bambusa (D031723)
Part (my-term), "a
body part"
Physiological
Processes (D010829),
"a physiologic
process"
Disease (D004194),
"disease"
Dermatitis, Contact
(D003877), "a
recurrent contact
dermatitis"
Workplace (D017132),
"the work place"
Introns (D007438),
"the introns"
Immunoglobulins
(D007136), "a
immunoglobulin
heavy-chain gene"
Recombination,
Genetic (D011995),
"recombination"
Immunoglobulin Class
Switching (D017578),
"immunoglobulin class
switching"
Process (my-term),
"processes"
Codon (D003062), "an
amino acid-specifying
codon"
Codon, Terminator
(D018388), "a stop
codon"
yes
yes
yes
yes
yes
yes
yes
no
no
yes
yes
yes
Mutation (D009154),
"mutation"
no
Condition (my-term),
"conditions"
yes
Perception (D010465),
"perception"
yes
Ticks (D013987),
"ticks"
Ixodidae (D026863),
"the family ixodidae"
Humans (D006801),
"humans"
Mammals (D008322),
"other mammals"
Plants (D010944), "a
plant genus"
Poaceae (D006109),
"the family poaceae"
yes
yes
yes
yes
yes
yes
95
Anaplasma ovis (D042323):
a species of gram-negative
bacteria producing mild to
severe anaplasmosis in
sheep and goats, and mild or
inapparent infections in deer
and cattle.
IS_A
Anaplasma ovis
(D042323)
“producing mild to”
Anaplasma ovis
(D042323)
“in”
“in”
“and”
“in”
“in”
Postmenopause (D017698):
the physiological period
following the menopause, the
permanent cessation of the
menstrual life
Anaplasmosis
(D000712), "severe
anaplasmosis"
Anaplasmosis
(D000712), "severe
anaplasmosis"
Anaplasma ovis
(D042323)
Infection (D007239),
"mild or inapparent
infections"
Infection (D007239),
"mild or inapparent
infections"
“the physiological
period following”
Postmenopause
(D017698)
Cuphea (D031562): a plant
genus of the family
lythraceae.
IS_A
Cuphea (D031562)
IS_A
Cuphea (D031562)
Postoperative Hemorrhage
(D019106): hemorrhage
following any surgical
procedure.
IS_A
Postoperative
Hemorrhage (D019106)
“following”
Hemorrhage (D006470),
"hemorrhage"
IS_A
Esthesioneuroblastoma,
Olfactory (D018304)
“arising from”
Esthesioneuroblastoma,
Olfactory (D018304)
Esthesioneuroblastoma,
Olfactory (D018304): a
malignant olfactory
neuroblastoma arising from
the olfactory epithelium of the
superior nasal cavity and
cribriform plate.
“of”
“of”
Eupenicillium (D055324): a
genus of endophytic,
ascomycetous mold in the
family trichocomaceae, order
eurotiales.
Sandfly fever Naples virus
(D029301): a species in the
genus phlebovirus causing
Olfactory Mucosa
(D009831), "the
olfactory epithelium"
Olfactory Mucosa
(D009831), "the
olfactory epithelium"
Gram-Negative
Bacteria (D006090),
"gram-negative
bacteria"
Anaplasmosis
(D000712), "severe
anaplasmosis"
yes
yes
Sheep (D012756),
"sheep"
yes
Goats (D006041),
"goats"
yes
Infection (D007239),
"mild or inapparent
infections"
no
Deer (D003670),
"deer"
yes
Cattle (D002417),
"cattle"
yes
Menopause
(D008593), "the
menopause"
yes
Plants (D010944), "a
plant genus"
Lythraceae (D029561),
"the family lythraceae
Hemorrhage
(D006470),
"hemorrhage"
Methods (D008722),
"any surgical
procedure"
Esthesioneuroblastom
a, Olfactory
(D018304), "a
malignant olfactory
neuroblastoma"
Olfactory Mucosa
(D009831), "the
olfactory epithelium"
Nasal Cavity
(D009296), "the
superior nasal cavity"
Ethmoid Bone
(D005004), "cribriform
plate"
Fungi (D005658),
"mold"
yes
yes
yes
yes
yes
yes
yes
yes
IS_A
Eupenicillium (D055324)
yes
IS_A
Eupenicillium (D055324)
Eurotiales (D032641),
"order eurotiales"
yes
“a species in“
Sandfly fever Naples
virus (D029301)
Phlebovirus
(D016856), "the genus
phlebovirus"
yes
96
phlebotomus fever, an
influenza-like illness.
Cattle Diseases (D002418):
diseases of domestic cattle
of the genus bos.
Amanita (D000545): a genus
of fungi of the family
agaricaceae, order
agaricales; most species are
poisonous.
Mechanical Phenomena
(D055595): the properties
and processes of materials
that affect their behavior
under force.
Pierre Robin Syndrome
(D010855): congenital
malformation characterized
by micrognathia,
glossoptosis and cleft palate.
Viral Load (D019562): the
quantity of measurable virus
in a body fluid.
Hydrastis (D039321): a plant
genus of the family
ranunculaceae.
IS_A
Amanita (D000545)
IS_A
Amanita (D000545)
IS_A
Amanita (D000545)
“the properties and“
Mechanical Phenomena
(D055595)
Phlebotomus Fever
(D010217),
"phlebotomus fever"
Disease (D004194),
"diseases"
Cattle (D002417),
"domestic cattle"
Fungi (D005658),
"fungi"
Agaricales (D000363),
"the family
agaricaceae"
Agaricales (D000363),
"order agaricales"
Process (my-term),
"processes"
“of materials that
affect”
Mechanical Phenomena
(D055595)
Behavior (D001519),
"their behavior"
yes
“congenital
malformation
characterized by
micrognathia,
glossoptosis and”
Pierre Robin Syndrome
(D010855)
Cleft Palate
(D002972), "cleft
palate"
yes
“the quantity of”
Viral Load (D019562)
“in”
Research Design
(D012107), "measurable
virus"
IS_A
Hydrastis (D039321)
IS_A
Hydrastis (D039321)
“causing”
IS_A
“of”
“which comprises a
number of”
“that are the etiologic
agents of”
Encephalitis Viruses,
Japanese (D018349)
“in”
Encephalitis (D004660),
"human encephalitis"
IS_A
IS_A
Drug Resistance, Bacterial
(D024881): the ability of
bacteria to resist or to
become tolerant to
chemotherapeutic agents,
antimicrobial agents, or
antibiotics.
Cattle Diseases
(D002418)
Disease (D004194),
"diseases"
Encephalitis Viruses,
Japanese (D018349)
Encephalitis Viruses,
Japanese (D018349)
Encephalitis Viruses,
Japanese (D018349)
IS_A
Encephalitis Viruses,
Japanese (D018349): a
subgroup of the genus
flavivirus which comprises a
number of viral species that
are the etiologic agents of
human encephalitis in many
different geographical
regions.
Sandfly fever Naples
virus (D029301)
“of”
“to resist or to
become tolerant to
chemotherapeutic
agents,”
“to resist or to
become tolerant to
chemotherapeutic
agents,”
Drug Resistance,
Bacterial (D024881)
Aptitude (D001076),
"the ability"
Research Design
(D012107),
"measurable virus"
Body Fluids
(D001826), "a body
fluid"
Plants (D010944), "a
plant genus"
Ranunculaceae
(D029626), "the family
ranunculaceae"
Flavivirus (D005416),
"the genus flavivirus"
Flavivirus (D005416),
"the genus flavivirus"
Viruses (D014780),
"viral species"
Encephalitis
(D004660), "human
encephalitis"
Region (my-term),
"many different
geographical regions"
Aptitude (D001076),
"the ability"
Bacteria (D001419),
"bacteria"
yes
yes
yes
yes
yes
yes
no
yes
yes
yes
yes
yes
no
yes
yes
yes
yes
yes
Drug Resistance,
Bacterial (D024881)
Anti-Infective Agents
(D000890),
"antimicrobial agents"
yes
Drug Resistance,
Bacterial (D024881)
Anti-Bacterial Agents
(D000900),
"antibiotics"
yes
97
RNA 3' Polyadenylation
Signals (D039104):
sequences found near the 3'
end of messenger rna that
direct the cleavage and
addition of multiple adenine
nucleotides to the 3' end of
mrna.
Lung, Hyperlucent
(D019568): a lung with
reduced markings on its
chest radiograph and
increased areas of
transradiancy.
Anterior Thalamic Nuclei
(D020643): three nuclei
located beneath the dorsal
surface of the most rostral
part of the thalamus.
Rete Testis (D012152): the
network of channels formed
at the termination of the
straight seminiferous tubules
in the mediastinum testis.
IS_A
RNA 3' Polyadenylation
Signals (D039104)
Sequence (my-term),
"sequences"
yes
“found near the 3'
end of messenger
rna that direct the
cleavage and
addition of”
RNA 3' Polyadenylation
Signals (D039104)
Adenine Nucleotides
(chosen) (D000227),
"multiple adenine
nucleotides"
no
RNA 3' Polyadenylation
Signals (D039104)
Lung, Hyperlucent
(D019568)
RNA, Messenger
(D012333), "mrna"
Lung (D008168), "a
lung"
“with reduced
markings on”
Lung, Hyperlucent
(D019568)
Thorax (D013909), "its
chest radiograph"
“three nuclei located
beneath”
Anterior Thalamic Nuclei
(D020643)
Surface (my-term), "the
dorsal surface"
Part (my-term), "the
most rostral part"
Surface (my-term),
"the dorsal surface"
Part (my-term), "the
most rostral part"
Thalamus (D013788),
"the thalamus"
Seminiferous Tubules
(D012671), "the
straight seminiferous
tubules"
Mediastinum
(D008482), "the
mediastinum testis"
“to the 3' end of”
IS_A
“of”
“of”
“the network of
channels formed at
the termination of”
Rete Testis (D012152)
“in”
Seminiferous Tubules
(D012671), "the straight
seminiferous tubules"
no
yes
yes
yes
yes
yes
yes
yes
98
Appendix D A list of SemRep relations and their negated forms.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
administered_to, neg_administered_to,
affects, neg_affects,
associated_with, neg_associated_with,
augments, neg_augments
causes, neg_causes
coexists_with, neg_coexists_with
compared_with
complicates, neg_complicates
converts_to, neg_converts_to
diagnoses, neg_diagnoses
disrupts, neg_disrupts
higher_than, neg_higher_than
inhibits, neg_inhibits
interacts_with, neg_interacts_with
location_of, neg_location_of
lower_than, neg_lower_neg
manifestation_of, neg_manifestation_of
method_of, neg_method_of
occurs_in, neg_occurs_in
part_of, neg_part_of
precedes, neg_precedes
predisposes, neg_predisposes
prevents, neg_prevents
process_of, neg_process_of
produces, neg_produces
same_as
stimulates, neg_stimulates
than
treats, neg_treats
uses, neg_uses
99
References E. Agichtein and L. Gravano. Snowball: Extractiong relations from large plain-text collections. In Proceedings of the
5th ACM International Conference on Digital Libraries, 2000.
C. Ahlers, M. Fiszman, D. Demner-Fushman, F.-M. Lang and T. Rindflesch. Extracting semantic predications from
MEDLINE citations for pharmacogenomics. Pacific Symposium on Biocomputing, 12, pp. 209-220, 2007.
A. Airola, S. Pyysalo, J. Björne, T. Pahikkala, Filip Ginter, and Tapio Salakoski. All-paths graph kernel for proteinprotein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics, 9:S2, 2008.
E. Alfonseca, K. Filippova, J.-Y. Delort, G. Garrido. Pattern learning for relation extraction with a hierarchical topic
model. In ACL'12, pp. 54-59, 2012.
A. R. Aronson. MetaMap: Mapping Text to the UMLS Metathesaurus. Bethesda, MD: NLM, NIH, DHHS (2006).
I. Augenstein, S. Pad'o and S. Rudolph. LODifier: Generating Linked Data from Unstructured
Text. In ESWC'12, pp. 210-214, 2012.
F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, P. F. Patel-Schneider. The Description Logic Handbook: Theory,
Implementation, Applications. Cambridge University Press, Cambridge, UK, 2003.
K. Baclawski, J. Cigna, M. M. Kokar, P. Mager, and B. Indurkhya. Konwledge representation andindexing using the
Unified Medical Language System. InPacific Symposium on Biocomputing, vol. 5, pp. 490-501.
K. Baclawski and T. Niu. Ontologies for Bioinformatics (Computational Molecular Biology). The MIT Press; ISBN: 0262-02591-4; 2005.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. AddisonWesley, 1999.
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead and O. Etzioni. Open Information Extraction from the Web. In
IJCAI 2007.
O. Bodenreider. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids
Research, 32(Database issue): D267-D270, Jan 2004.
O. Bodenreider and R. Stevens. Bio-ontologies: current trends and future directions. Briefings in Bioinformatics,
7(3):256‒274, 2006.
J. Bos. Wide-coverage semantic analysis with Boxer. In STEP, pp. 277-286, 2008.
R. Bunescu, R. Ge, R. Kate, E. Marcotte, R. Mooney, A. Ramani, Y. Wong. Comparative Experiments on Learning
Information Extractors for Proteins and their Interactions. Artificial Intelligence in Medicine 2005, 33(2):139-155.
R. Bunescu, R. Mooney. Subsequence Kernels for Relation Extraction. In Advances in Neural Information Processing
Systems 18 MIT Press; 2006:171-178.
A. Carlson, J. Betteridge, B. Kisiel. B. Settles, E.R. Hrushka Jr. and T. Mitchell. Toward and architecture for NeverEnding Language Learning. In AAAI 2010.
J. T. Chang, R. B. Altman. Extracting and characterizing gene-drug relationships from the literature.
Pharmacogenetics, 14(9):577-86, 2004.
T. Chklovski and P. Pantel. VerbOcean:Mining the Web for Fine-Grained Semantic Verb Relations. In EMNLP 2004.
100
H.-W. Chun, Y. Tsuruoka, J.-D. Kim, R. Shiba, N. Nagata, T. Hishiki, J. Tsujii. Extraction of Gene-Disease Relations
from Medline Using Domain Dictionaries and Machine Learning. Pacific Symposium on Biocomputing, 2006.
J. Cimino, G. Barnett. Automatic knowledge acquisition from MEDLINE. Methods of Information in Medicine, 14:120‒
130, 1993.
A. Coulet, N. H. Shah, Y. Garten, M. Musen, and R. B. Altman. Using text to build semantic networks for
pharmacogenomics. Journal of Biomedical Informatics, 43(6):1009‒19, 2010.
M. Craven and J. Kumlien. Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 77-86, 1999.
J. Curran, S. Clark and J. Bos. Linguistically motivated large-scale NLP with C&C and Boxer. In ACL, pp. 33-36, 2007.
M. Y. Dahab, H. A. Hassan, A. Rafea. TextOntoEx: Automatic ontology construction from natural English text. In
Expert Systems with Applications, 34, pp. 1474-1480, 2008.
M. Dai, N. H. Shah, W. Xuan, M. A. Musen, S. J. Watson, B. D. Athey, F. Meng et al. An Efficient Solution for
Mapping Free Text to Ontology Terms. AMIA Summit on Translational Bioinformatics. San Francisco, CA 2008.
R. Delfs, A. Doms, A. Kozlenkov, M. Schroeder. GoPubMed: ontology-based literature search applied to
GeneOntology and PubMed. Proceedings of German Bioinformatics Conference; Bielefeld, Germany: LNBI Springer;
2004. pp. 169‒178.
H. Dietze. GoWeb: Semantic Search and Browsing for the Life Sciences. PhD thesis. Technische Universität Dresden,
Germany. 2010.
A. Doms. GoPubMed: Ontology-based literature search for the life sciences. PhD thesis. Technische Universität
Dresden, Germany. 2010.
M. Ester, H.-P. Kriegel, J. Sander, X. Xu. A density-based algorithm for discovering clusters in large spatial databases
with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD96). AAAI Press, pp. 226–231, 1996.
P. Exner and P. Nugues. Entity extraation: from unstructured text to Dbpedia RDF triples. In ISWC 2012.
G. Fabian. A Text Mining Approach to the Validation and Completion of Drug-Target-Diseasse Networks. Diplom
thesis. Technische Universität Dresden, Germany. 2012.
G. Fabian, T. Wächter, M, Schroeder. Extending ontologies by finding siblings usingset expansion techniques.
Bioinformatics 28(12), pp. 292-300. 2012.
A. Fader, S. Soderland and O. Etzioni. Identifying Relations for open Information Extraction. In EMNLP'11, 2011.
J. Fan, D. Ferrucci, D. Gondek, A. Kalyanpur. PRISMATIC: Inducing knowledge from a large scale lexicalized relation
resource. In NAACL HLT 2010.
T. Flati and R. Navigli. SPred: Large-scale Harvesting of Semantic Predicates. In ACL 2013.
T. Hasegawa, S. Sekine, and R. Grishman. Discovering Relations among Named Entities from Large Corpora. In ACL
2004.
D. Hovy, C. Zhang, E. Hovy, and A. Peñas. Unsupervised discovery of domain-specific knowledge from text. In HLT
'11, pp. 1466-1475, 2011.
D. Hristovski, A. Kastrin, B. Peterlin and T.Rindflesch. Combining semantic relations and DNA microarray data for
novel hypothesis generation. Linking literature, information, and knowledge for biology, pp. 53-61. Springer Berlin
Heidelberg, 2010.
D. Hristovski, T. Rindflesch and B. Peterlin. Using literature-based discovery to identify novel therapeutic approaches.
Cardiovascular & Hematological Agents in Medicinal Chemistry, 11(1), pp. 14-24, 2013.
101
M. Huang, X. Zhu, Y. Hao, D. G. Payan, K. Qu and M. Li. Discovering patterns to extract protein-protein interactions
from full texts. Bioinformatics 20 (18), pp. 3604‒3612, 2004.
C. Jonquet, N. H. Shah, M. A. Musen. The Open Biomedical Annotator. AMIA Summit on Translational
Bioinformatics, p. 56-60, March 2009, San Francisco, CA, USA.
H. Kilicoglu, G. Rosemblat, M. Fiszman and T. Rindflesch. Constructng a semantic predication gold standard from the
biomedical literature. BMC Bioinformatics, 12:486, 2011.
H. Kilicoglu, D. Shin, M. Fiszman, G. Rosemblat and T. Rindflesch. SemMedDB: a PubMed-scale repository of
biomedical semantic predications. Bioinformatics, 28 (23), pp. 3158‒3160, 2012.
R. D. King, J. J. Rowland, W. Aubrey, M. Liakata, M. Markham, L. N. Soldatova, K. E. Whelan, A. Clare, M. Young,
A. Sparkes, S. G. Oliver, and P. Pir. The robot scientist Adam. IEEE Computer, 42(8):46‒54, 2009.
R. D. King, J. Rowland, S. G. Oliver, M. Young, W. Aubrey, E. Byrne, M. Liakata, M. Markham, P. Pir, L. N.
Soldatova, A. Sparkes, K. E. Whelan, A. Clare. The Automation of Science. Science 324 (5923): 85‒89, 2009.
D. Klein and C. D. Manning. Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for
Computational Linguistics, pp. 423-430, 2003.
R. Krestel, R. Witte and S. Bergler. Predicate-Argument EXtractor (PAX). In New Challenges for NLP Frameworks,
2010.
H.-P. Kriegel, P. Kroeger, J. Sander, A. Zimek. Density-based clustering. Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery 1.3, pp. 231-240, 2011.
M. Krötzsch, F. Simančík, I. Horrocks. A Description Logic Primer. CoRR abs/1201.4089. 2012.
C. Lee, C. Khoo, J. Na. Automatic identification of treatment relations for medical ontology learning: An exploratory
study. Advances in Knowledge Organization 2004 (9), pp. 245‒250, 2004.
D. D. Lewis, R. E. Schapire, J. P. Callan and R. Ron Papka. Training algorithms for linear text classifiers. In
Proceedings of the 19th annual international ACM SIGIR conference on research and development in information
retrieval SIGIR-1996, pp. 298-306, 1996.
X. Li, S. Szpakowicz and S. Matwin. A WordNet-based Algorithm for Word Sense Disambiguation”. In IIJCAI '95, pp.
1368–1374, 1995.
D. Lin. An information-theoretic definition of similarity. In ICML'98, pp. 296–304, 1998.
Y. Liu, R. Bill, M. Fiszman, T. Rindflesch, T. Pedersen, G. Melton, S.Pakhomov. Using SemRep to Label Semantic
Relations Extracted from Clinical Text. In AMIA Annual Symposium Proceedings, Vol. 2012, p. 587, 2012.
A. Mädche, S. Staab. Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2): 72-79, 2001.
C. Manning and H. Schütze. Foundations of statistical natural language processing. Vol. 999. Cambridge: MIT press,
1999.
G. A. Miller. WordNet: a lexical database for English. Communications of the ACM 38.11, pp. 39-41, 1995.
D. Mladenić and M. Grobelnik. Feature selection on hierarchy of web documents. Journal of Decision Support Systems,
35, pp. 45-87, 2003.
D. Mladenić, J. Brank, M. Grobelnik and N. Milic-Frayling. Feature selection using linear classifier weights:
Interaction with classification models. In Proceedings of the twenty-seventh annual international ACM SIGIR
conference on research and development in information retrieval SIGIR-2004, pp. 234-241, 2004.
T. P. Mohamed, E. R. Hruschka, Jr., and T. M. Mitchell. Discovering relations between noun categories. In EMNLP '11,
pp. 1447-1455, 2011.
102
H. Paley. Abstract algebra. Holt, Rinehart and Winston, 1966.
T. Pedersen, S. Patwardhan, J. Michelizzi. WordNet::Similarity - Measuring the Relatedness of Concepts. In
Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), pp. 1024-1025, 2004.
W. Pratt. Dynamic organization of search results using the UMLS. Proceedings of the AMIA Annual Fall Symposium.
American Medical Informatics Association, 1997.
C. Ramakrishnan, K. J. Kochut and A. P. Sheth. A Framework for Schema-Driven Relationship Discovery from
Unstructured Text. In Proceedings of te 5th International Semantic Web Conference, pp. 583‒596, 2006.
A. L. Rector and J. Rogers. Ontological and practical issues in using a description logic to represent medical concept
systems: Experience from GALEN. In Reasoning Web, pp. 197‒231, 2006.
R. L. Richesson, J. E. Andrews, and J. P. Krischer. Use of SNOMED CT to represent clinical research data: A semantic
characterization of data items on case report forms in vasculitis research. Journal of the American Medical Informatics
Association, 13(5):536‒546, 2006.
S. Riedel, L. Yao, A. McCallum, B. M. Marlin. Relation extraction with matrix factorization and universal schemas. In
NAACL HLT'13, pp. 74-84, 2013.
T. C. Rindflesch, L. Tanabe, J. N. Weinstein and L. Hunter. EDGAR: Extraction of drugs, genes, and relations from the
biomedical literature. In Proceedings of Pacific Symposium on Biocomputing, pp. 514‒525, 2000.
B. Rosario, M.A. Hearst. Classifying semantic relations in bioscience texts. In Proceedings of the 42nd Annual Meeting
on Association For Computational Linguistics, pp. 430‒430. Association for Computational Linguistics, 2004.
R. Rosenfeld. Two decades of statistical language modeling: where do we go drom here? Proceedings of the IEEE 88.8,
pp. 1270-1278, 2000.
C. Rosse, J. L. Mejino, Jr. A reference ontology for biomedical informatics: the Foundational Model of Anatomy.
Journal of Biomedical Informatics, 36(6):478-500, 2003.
D. L. Rubin, O. Dameron, Y. Bashir, D. Grossman, P. Dev, and M. A. Musen. Using ontologies linked with geometric
models to reason about penetrating injuries. Artificial Intelligence in Medicine, 37(3):167‒176, 2006.
M. Ruiz-Casado, E. Alfonseca, P. Castells. Automatic extraction of semantic relationships for WordNet by means of
pattern learning from Wikipedia. In NDLB’05, pp. 67-79, 2005.
C. Sammut, G. Webb (Eds.). Encyclopedia of Machine Learning. Springer, 2011.
D. Sánchez, A. Moreno, and L. Del Vasto-Terrientes. Learning relation axioms from text: An automatic Web-based
approach. In Expert Systems with Applications, 39, pp. 5792-5805, 2012.
N. Seco, T. Veale and J. Hayes. An intrinsic information content metric for semantic similarity in WordNet. ECAI, Vol.
16, 2004.
D. R. Swanson. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in Biology and
Medicine, 30(1):7-18, 1986.
S. Schulte im Walde. Clustering Verbs Semantically According to their Alternation Behaviour. In Proceedings of the
18th International Conference on Computational Linguistics (COLINGS), pp. 747–753, 2000.
S. Schulze-Kremer, B. Smith, and A. Kumar. Revising the UMLS semantic network. Medinfo (2004): 1700-4.
T F Smith, M S Waterman, and W M Fitch. Comparative Biosequence Metrics. Journal of Molecular Evolution, 18(1):
38-46, 1981.
SNOMED Clinical Terms, http://www.ihtsdo.org/snomed-ct
SNOMED CT User Guide. July 2013 International Release (US English), http://www.snomed.org/ug.pdf (latest access
9.10.2013).
103
L. Tari, J. Hakenberg, G. Gonzalez, and C. Baral. Querying parse tree database of medline text to synthesize userspecific biomolecular networks. In Proceedings of Pacific Symposium on Biocomputing, pp. 87-98, 2009.
G. Tsatsaronis, I. Varlamis, M. Vazirgiannis and K. Nørvåg. Omiotis: A Thesaurus-based Measure of Text Relatedness.
ECML PKDD'09, pp. 742-745, 2009.
G. Tsatsaronis, M. Schroeder, G. Paliouras, Y. Almirantis, I. Androutsopoulos, E. Gaussier, P. Gallinari, T. Artieres, M.
R. Alvers, M. Zschunke, et al., BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question
Answering, 2012 AAAI Fall Symposium Series, 2012.
G. Tsatsaronis, I. Varlamis, N. Kanhabua and K. Nørvåg. Temporal Classifiers for Predicting the Expansion of Medical
Subject Headings, The 14th International Conference on Intelligent Text Processing and Computational Linguistics
(CICLing 2013), March 2013, Samos, Greece.
Unified Medical Language System, http://www.nlm.nih.gov/research/umls/
UMLS® Reference Manual [Internet]. Bethesda (MD): National Library of Medicine (US); 2009 Sep. Available from:
http://www.ncbi.nlm.nih.gov/books/NBK9676/
P.Velardi, S. Faralli, R. Navigli. OntoLearn reloaded: a graph-based algorithm for taxonomy induction. Computqtional
Linguistics, 39(3), 2013.
J. Völker, P. Hitzler and P. Cimiano. Acquisition of OWL DL axioms from lexical resources. In ESWC, pages 670-685,
2007.
T. Wächter. 2010. Semi-automated Ontology Generation for Biocuration and Semantic Search. PhD thesis. Technische
Universität Dresden, Germany.
T. Wächter, G. Fabian, and M. Schroeder. DOG4DAG: semi-automated ontology generation in obo-edit and protégé.
Proceedings of the 4th International Workshop on Semantic Web Applications and Tools for the Life Sciences. ACM,
2011
E. Westerhout and P. Monachesi. Creating glossaries using pattern-based and machine learning techniques. In
Proceedings of the 7th International Conference on Language Resources and Evaluation, 2008.
S.-H. Wu and W.-L. Hsu. SOAT: a semi-automatic domain ontology acquisition tool from Chinese corpus. In 19th
International Conference on Computational Linguistics, 2002.
Y. Wu, M. Liu, W. J. Zheng, Z. Zhao, and H. Xu. Ranking gene-drug relationships in biomedical literature using latent
dirichlet allocation. In Proceedings of Pacific Symposium on Biocomputing, pp. 422‒433, 2012.
R. Xu, Q. Wang. Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug
repurposing. BMC Bioinformatics, 14:181, 2013.
L. Yao, A. Haghighi, S. Riedel, A. McCallum. Structured relation discovery using generative models. In EMNLP'11,
pp. 1456-1466, 2011.
D. Zelenko, C. Aone, A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research
2003, 3:1083-1106.
M. Zhang, J. Su, D. Wang, G. Zhou, and C.L. Tan. Discovering Relations Between Named Entities from a Large Raw
Corpus Using Tree Similarity-Based Clsutering. In IJCNLP 2005.
S. Zhu, Y. Okuno, G. Tsujimoto, and H. Mamitsuka. A probabilistic model for mining implicit
gene relations from literature. Bioinformatics, 21 Suppl 2:245‒51, 2005.
chemical compound-
104