A survey on annotation tools for the biomedical

B RIEFINGS IN BIOINF ORMATICS . VOL 15. NO 2. 327^340
Advance Access published on 18 December 2012
doi:10.1093/bib/bbs084
A survey on annotation tools for the
biomedical literature
Mariana Neves and Ulf Leser
Submitted: 7th September 2012; Received (in revised form): 5th November 2012
Abstract
New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora.
Such corpora, commonly called gold standards, are important for learning patterns or models during the training
phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of
natural language text. This process is very time-consuming and expensive because it requires high intellectual
effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks
for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include
visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which
implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools
for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected
for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by
hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive
solution.
Keywords: annotation tools; curation tools; gold standard corpora; text mining
INTRODUCTION
In text mining, the development of new applications
crucially depends on the existence of pre-annotated
texts, called gold standard corpora (GSC). GSC are
important for evaluating the performance of tools,
for learning patterns or models in supervised information extraction and for an unbiased comparison of
tools [1, 2]. This is particularly true in knowledgerich domains such as the life sciences where texts deal
with millions of entities whose naming often does
not follow any regularities or conventions [3] and
where relationships between entities may differ in
highly subtle ways [4]. These properties imply that
approaches purely built on background knowledge,
such as dictionaries for entity names or language
patterns specific for the intended relationships,
often lead to insufficient recall [5].
A GSC is a set of documents where all mentions of
facts of interest have been manually annotated by a
human expert. GSC essentially provide examples for
algorithms which learn commonalities between the
annotated facts and which then try to generalize these
commonalities into rules also finding novel facts [6].
This approach has proven to be highly successful in
many biomedical text mining tasks [7–9], leading to
an increasing interest in biomedical GSC. Examples at
the entity level are corpora for detecting gene names
[10], species names [11] or names of chemical entities
[12]; at the relationship level, examples are corpora
for studying protein–protein interactions (PPIs) [13],
Corresponding author. Mariana Neves, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25,
12489 Berlin, Germany. Tel: þ49-30-2093-5486; Fax: þ49-30-2093-3045; E-mail: [email protected]
Mariana Neves has a PhD in computer science from the Universidad Complutense de Madrid, Spain. She is currently working as a
post-doc researcher in biomedical text mining in the group of Prof. Leser, Humboldt-Universität zu Berlin, Germany.
Ulf Leser holds the Chair of Knowledge Management in Bioinformatics at the Department for Computer Science,
Humboldt-Universität zu Berlin, Germany.
ß The Author 2012. Published by Oxford University Press. For Permissions, please email: [email protected]
328
Neves and Leser
gene regulatory relationships [14], complex events
[15] or drug–drug interactions [16]. A comprehensive
collection of corpora for studying relationships
between biomedical entities can be found at http://
corpora.informatik.hu-berlin.de.
The manual annotation of text documents is a
time-consuming and error-prone task, as it requires
a high intellectual effort from domain experts, such
as biologists or pharmacists [17]. Annotation of comprehensive corpora frequently involves a team of
annotators, often with slightly different opinions
about the facts under study, leading to inconsistent
annotations. Comprehensive annotation guidelines
are therefore important, i.e. documents which
convey a precise definition of the facts to be annotated. The development of such guidelines is also a
complex undertaking that often proceeds in an incremental fashion, which in turn makes necessary the
re-annotation of texts [17]. This situation makes the
creation of large GSC a project that often turns out
to be much more expensive and much more
involved than expected at the start. However, the
creation of GSC is a task that can be well supported
by computational tools. Supportive tasks include
graphical interfaces for highlighting and tagging
texts, management of text collections, assignment
of texts to annotators, definition of annotation
schemes (classes of entities and relationships that
may be annotated) or export of the result into various formats. If available, existing text mining tools
can be integrated to provide pre-tagged input texts.
This situation led to the development of a number
of tools that support experts in annotating texts.
These tools range from simple word processors with
auto-completion functionalities (e.g. Notepadþþ)
over standalone tools with graphical user interfaces
[18–20] to web-based solutions suitable for large distributed annotation projects [21–24]. Tools vary
widely in their functionalities. For instance, some
tools allow arbitrary annotation schemes [20] while
others support only a fixed scheme [25, 26]; some
accept multiple input and/or create multiple output
formats [27] while others are bound to a single format
[24]; some focus on linguistic text properties [19]
while others excel in domain-specific mentions [21].
Despite the importance of such tools for text
mining, there exist very few previous surveys on
them. The study by Uren et al. [28] focuses on the
Semantic Web and does not include most of the tools
covered in this work. Dipper et al. [29] provide a
survey of five XML-based annotation tools focusing
on linguistic annotations. To the best of our knowledge, a survey specifically focusing on the needs of
the biomedical domain does not exist yet.
One important distinction to make is that between annotation and curation. By annotation, we
mean the complete tagging of a given text with
respect to some intended facts [17], usually for the
construction of a gold standard. Sometimes, also facts
that might be irrelevant for the intended scope
should be annotated. For instance, if a GSC should
be created for PPIs, also mentions of proteins not
taking part in an interaction should be tagged, as
this provides important negative information for
training and evaluation. By curation, in turn, we
mean the analysis of a given document with respect
to some information being sought for. For instance,
curated PPI databases scan documents to extract information about protein interactions [30]. Curating a
document does not always imply tagging at the text
level but sometimes only at the document level.
Furthermore, for instance, many literal PPIs might
be ignored because they do not obey quality standards of the curators (e.g. hypothetical PPI or negated
PPI). Curation can be supported by text mining [31]
and other tools [32, 33], but the requirements for the
later are quite different.
This survey focuses on annotating texts; however,
in the ‘Discussion’ section, we will also briefly discuss
how well the evaluated tools are suitable for curation. Altogether, we considered almost 30 tools,
13 of which were selected for an in-depth analysis.
We excluded tools which are not freely available or
that focus only on linguistic annotations. Thus, we
only considered tools which had been applied at least
once in the biomedical domain. The 13 tools meeting these constraints were analyzed in detail with
respect to 35 predefined criteria, including issues of
documentation and extensibility, supported formats
and platforms, implemented functionality and popularity. Furthermore, we locally installed and tested
most of the 13 tools to be able to report on difficulty
of installation and the general (naturally somewhat
subjective) look-and-feel. We also discuss which
tools are more suitable for specific situations, such
as which of them support the annotation of full
texts as well as relationships besides only entities.
MATERIALS AND METHODS
We screened PubMed and Google Scholar, checked
corpora publications and used Google search engine
to create a list of tools potentially useful for
Survey on annotation tools for the biomedical literature
annotating texts. All appropriate tools were evaluated
due to the selection criteria presented in ‘Selection of
tools’ section. Tools matching these requirements
were assessed in details using the criteria described
in ‘Criteria for selected tools’ section.
Selection of tools
There is a variety of tools available for the annotation
of textual documents. In this survey, we focus on tools
for manual annotation of semantic facts in biomedical
documents and, optionally, for supporting database
curation [34]. Therefore, a tool should allow manual
annotation of texts. Manual annotation may be at any
level, e.g. document, snippet, sentence or mention.
Tools should support annotation of named entities
and, optionally, the annotation of relationships.
We put no constraints regarding supported formats.
Finally, tools should be freely available, which, among
others, excluded for instance [25].
We focused on general purpose tools for the
manual annotation of biomedical documents,
which is testified by at least one report on successful
application of the tool for creating a gold standard or
curating biomedical data. This, for instance,
excluded tools with a sole linguistic purpose or
which have been used for named-entity annotation
in other domains, such as Ellogon [35], PalinkA [36],
SALTO [37], Slate [38] or UAM Corpus [39]. We
also excluded tools which have been used for linguistic annotation in biomedical documents (but not
for biomedical facts). For instance, in [40], biomedical documents belonging to the Genia corpus [10]
have been annotated on discourse relations using
the PDTB tool. Regarding the tools which have
been used for biomedical corpus annotation, we
only excluded CADIXE (http://caderige.imag.fr/
Cadixe/) used in [41, 42], which it is only available
by demand, because we did not receive an answer
from the tool’s developers.
Tools which focus on annotations of Web
documents have not been considered either, such
as Domeo [43]. General purpose curation tools
which do not allow the manual annotation of
texts, such as Textpresso [33] and PaperBrowser
[44], are also not the focus of this survey. Finally,
we did not consider further all tools that were
developed for a single specific question without
apparent abilities to be adapted to other tasks, such
as [32, 45], GOAnnotator [46] and PreBIND [47].
We tried to install all tools fulfilling these requirements. Also tools for which we did not succeed in
329
this task were considered further, skipping the
hands-on experiences. Still, downloading functionality must be working. We give our opinion regarding the easiness of installation in the detailed
evaluation of each tool in Section 2 in
Supplementary Data.
Criteria for selected tools
We identified 13 tools meeting our requirements.
For those, we perform an in-depth evaluation
based on published features and hands-on experiences. Tools were evaluated with respect to a list
of 35 criteria grouped into the following categories
(cf. Table 1): basics, publication, system properties,
data and functionalities.
Basics criteria assess the quality of the documentation and the existence of support (e.g. mailing lists
and forum).
Publication criteria encompass year of the last publication, number of citations (according to Google
Scholar) and abundance of papers describing
application of the tool to biomedical texts.
System properties include availability of code,
supported operating systems, general architecture
(stand-alone, web-based, plug-in) and license.
The data section addresses formats supported
by the tool for input documents, input of preannotations, output for the annotations and annotation scheme.
Finally, the functionalities group contains features
to support the annotation process, like the smallest
unit of annotation (character or token), built-in
biomedical named-entity extraction, support for
fast annotation (shortcut keys), pre-annotations,
ontologies or inter-annotator agreement.
RESULTS
We screened the scientific literature for descriptions
of tools that support the annotation of biomedical
corpora. Tools were selected for an in-depth analysis
if they are freely available, support a sufficient general
class of annotation tasks and have been previously
applied to biomedical texts. Thirteen tools met
these
criteria:
@Note,
Argo,
Bionotate,
Brat, Callisto, Djangology, GATE Teamware,
Knowtator, MMAX2, MyMiner, Semantator,
WordFreak and XConc Suite. Wherever possible
and feasible, we also installed the tools locally and
experimented with their functionality and user
330
Neves and Leser
Table 1: Criteria used for the evaluation of annotation tools
Category
Criteria
Code
Basics
Worth looking at the tool
Quality of the documentation
Existence of support (forum, mailing list)
Year of the last publication
Number of references in Google Scholar
Number of references related to corpora development
Type of installation (web, stand-alone, plug-in)
Supported operating systems
Availability of source code
Dependence on external software
Date of the last version
Easiness of installation
License of the tool
Input format for the documents
Input format for the annotations
Format of the annotation schema
Output format for the annotations (in-line, stand-off)
Linguistic level of annotations (token, character)
Allowance of overlapping annotations
Possibility of using only the keyboard
Automatic selection of a token when clicked
Support for fast annotation
Allowance of comments on the annotation
Ability to annotate relationships
Support for ontologies and terminologies
Allowance of document collections
Support for full text
Allowance of saving documents partially
Ability to highlight or bookmark parts of the text
Pre-processing the text (sentence splitting, tokenization, part-of-speech tagging)
Built-in biomedical named-entities recognition
Easiness to import pre-annotations
Assignment of the documents to the annotators
Inter-annotator agreement
Queries to a document or collection
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
C17
C18
C19
C20
C21
C22
C23
C24
C25
C26
C27
C28
C29
C30
C31
C32
C33
C34
C35
Publication
System properties
Data
Functionalities
interface. Installation succeeded with one exception:
GATE Teamware.
In this chapter, we present each of these tools
in detail and highlight their strengths and pitfalls.
The main technical features of each tool are summarized in Table 2 while the functional ones are
shown in Table 3. Tools are presented in alphabetical order. A more detailed description of each
tool can be found in Section 2 in Supplementary
Data.
@Note, University of Minho (Portugal)
and University of Vigo (Spain)
@Note [48] is a workbench supporting article
retrieval, journal crawling, PDF-to-text conversion,
pre-processing and annotation of documents. The
tool has been used for annotating microbial cellular
responses [49] and stringent response for Escherichia
coli [50]. It is not available as open source, although
plug-ins may be added to it through the AIBench
Framework. It only allows the annotation of named
entities and a limited set of dictionaries is included to
perform manual or automatic annotation. It is
pre-configured for the annotation of 14 biological
entity classes and can be extended to support arbitrary entity types. Documents collections are constructed through queries to Medline and full texts
can be annotated only if the corresponding PDF is
found in the Web.
Argo, National Centre for Text
Mining (NaCTeM), University of
Manchester (UK)
Argo [24] is a web-based curation tool that has been
used for annotating interactions between marine
drugs and enzymes [24]. It is similar to
Survey on annotation tools for the biomedical literature
331
Table 2: Comparison of all tools according to selected technical criteria
Tool
Year
Ref. Corp.
Install.
Src.
Depend.
Scheme
Input
Output
License
@Note
Argo
Bionotate
Brat
Callisto
Djangology
GATE
Knowtator
2009
2012
2009
2012
2004
2010
2010
2006
14
1
12
4
16
1
8
62
2a
1a
2a
3 (2a)
3
1a
3
9 (1a)
SA
Web
Web
Web
SA
Web
SA, Web
plug-in
^
^
3
3
3
3
3
3
Java, MySQL
Web browser
Apache
Apache, Python
Java
Django, Python
Java, Perl, MySQL
Prote¤ge¤ Frames
^
UIMA
XML
TXT
DTD
GUI
GUI
Ontology
TXT, PDF
TXT
TXT
TXT
TXT
DB
TXT
TXT, XML
XML IL
^
XML SO
TXT SO
XML SO
DB
XML IL, DB
XML SO
87
0
0
35
2
1a
2a
7
SA
Web
Plug-in
SA
3
^
^
3
Java
Web browser
Prote¤ge¤ OWL
Java
XML
GUI
OWL ontology
Java classes
TXT, XML
TXT, PDF
TXT
TXT, XML
XML SO
TXT IL
RDF, XML SO
XML SO
^
4a
Plug-in
3
Eclipse
DTD
XML
XML IL
^
NaCTeM
GNU GPL
CC BY 3.0
MITRE Co.
^
GNU AGPL
Mozilla Public
License 1.1
^
^
^
Mozilla Public
License 1.1
GNU LGPL
MMAX2
MyMiner
Semantator
WordFreak
2006
2012
2011
2003
XConc
^
Abbreviations and corresponding criterion are as follows: ‘Year’: year when the tool was first published (cf.C4).‘Ref.’: number of references found for
the publication (as of August 2012) (cf. C5). ‘Corp.’: number of publications referring to applications in biomedical annotation (cf. C6); ‘a’ indicates
whether the publication is from the developers of the tool. ‘Install.’: type of installationçplug-in, stand-alone (SA) or web-based (cf. C7). ‘Src’:
source code availability (cf. C9). ‘Depend.’: dependencies on external software (cf. C10). ‘Scheme’: format of the annotation schema (cf. C16). ‘Input’:
format for the input text (cf. C14).‘Output’: export format for the annotationsçfile, database (DB), in-line (IL) or stand-off (SO) (cf. C17).‘License’:
name of the license under which the tool is available (cf. C13).
U-Compare [51] in terms that users can build their
own annotation workflow using UIMA components. Argo is curation oriented but allows manual
annotation of text, although full texts might not be
visualized properly. The tool only support annotation of named entities and the annotation schema is
limited to those entities included in the UIMA-type
systems. Therefore, its use might not be intuitive for
those not familiar with UIMA. Analysis engines are
available to pre-process documents and automatic
extraction is performed using ABNER, GENIA
tagger and OscarMER, for instance.
Bionotate, University of Granada (Spain)
and Harvard Medical School (USA)
Bionotate [52] is a web-based tool aimed at
annotating snippets in a collaborative way.
Authors report on a pilot case study for the annotation of PPIs and gene disease [52] and Bionotate
has also been used as part of the GenomeEnvironment-Trait Evidence (GET-Evidence)
system [53]. It is an open source tool and features
a very simple installation. The annotation scheme is
composed of predefined entity types and so-called
questions encoding relationships. The existence of a
relationship is expressed by the answer to the question, which is limited to a single one per snippet.
Therefore, Bionotate might not be suitable for the
annotation of an unlimited number of entities or
relationships in a text. The latest version also allows
integration of named entity recognition services
[54] and assignment of ontology terms to curated
entities [55] (not yet available).
Brat, University of Tokyo (Japan) and
University of Manchester (UK)
Brat [21] is a web-based tool for text annotation
which has been used for a variety of projects, including cancer biology, epigenetics and post-translational
modifications (EPI) and the BioNLP 2011 Shared
Tasks [21]. It has also been employed for annotating
stem cell-related documents [56] and anatomical
terms [57]. Documents are displayed graphically
by sentences in a very appealing manner; interactive
annotation of entities and relationships is intuitive
and well supported by appropriate shortcut keys.
Pre-annotations can be loaded easily using a plain
text stand-off format and tools accessible as web
services can be integrated for automatic text
pre-processing and annotation. At the downside,
Brat has difficulties with the annotation of full
texts. To work with the tool, we had to split all
documents into parts composed at most of 50 sentences [56].
Fine
Fine
Good
Very good
Fine
Good
Good
Fine
Fine
Good
Good
Fine
Fine
@Note
Argo
Bionotate
Brat
Callisto
Djangology
GATE
Knowtator
MMAX2
MyMiner
Semantator
WordFreak
XConc
Poor
Good
Fine
Good
Fine
Poor
Good
Good
Good
Poor
Poor
Poor
Poor
Full
Poor (connectors)
Fine (slot filling)
Fine (connectors)
Poor (matrix
check-box)
Fine (slot filling)
^
Fine (attributes)
Poor (questions)
Very good
(connectors)
Fine (attributes)
^
^
^
Relat.
^
Sentences, POS tags,
parsing
^
^
Tokenization
Sentences
GATE components
^
^
^
Sentences
Sentences, tokens,
stopwords, POS
tags
Sentences, tokens
Pre-pros.
^
^
^
Dictionary-based, ABNER,
Linnaeus
BioPortal, cTAKES
^
GATE components
^
^
Good (XML)
Poor (XML)
Fine (XML
Poor (XML)
Poor (XML)
^
^
Good
(database)
^
Good (XML)
Good (TXT)
^
^
Dictionary-based (14 types)
ABNER, GENIA, OscarMER,
species
^
Web-services
Pre-ann.
NER
OWL
Prote¤ge¤
^
OMIM, NCBI
taxonomy, UniProt
OWL
^
GATE ontology editor
^
^
^
^
^
^
Ontol.
^
^
Good (P/R/F, side-by-side
comparison)
Good (P/R/F, kappa,
comparison, consensus)
Good (metrics and consensus)
Good (kappa, comparison)
Fine (time and differences in
annotations)
^
^
Poor (no. equal answers)
^
^
^
IAA
Abbreviations and corresponding criteria are as follows: ‘Speed’: speed of time for performing annotations. It considers the criteria fast annotation (cf.C22), built-in biomedical NER (cf.C31) and pre-annotations
(cf.C32).‘Full’: support for full text (cf.C27).‘Relat.’: possibility of annotating relationships (cf.C24).‘Pre-pros.’: integrated pre-processing of the text (e.g. sentence splitting and tokenization) (cf.C30).‘NER’: built-in
automatic extraction of biomedical named entities (cf.C31).‘Pre-Ann’: easiness to import pre-annotations (cf.C32).‘Ontol.’: supports of ontologies (cf.C25).‘IAA’: support for inter-annotator agreementçprecison
(P), recall (R), f-score (F) (cf. C34).
Speed
Tool
Table 3: Comparison of all tools according to selected functional criteria
332
Neves and Leser
Survey on annotation tools for the biomedical literature
333
Callisto, MITRE Corporation (USA)
Knowtator, Mayo Clinic (USA)
Callisto [58] is a stand-alone Java text annotation
framework and customized versions of it have been
used for annotating a variety of corpora, including
ITI TXM [59], temporal clinical data [60] and also
clinical data in [61]. The annotation scheme is configured by developing a custom plug-in based on a
DTD file, therefore, users should be familiar with
this format. We failed in annotating relationships
properly probably due to errors in our DTD, and
the system lacks examples of DTDs supporting relationships. Callisto is rather popular in the biomedical
domain and supports the annotation of both named
entities and relationships.
Knowtator [20] is a general purpose annotation tool
which runs in Protégé Frames (http://protege.stanford.edu/overview/protege-frames.html). It is certainly one of the most popular tools and was used
by a number of biomedical annotation projects,
such as the CRAFT corpus [65], annotation of electronic patient records in Swedish [66], clinical entities
and relationships in the CLEF Corpus [67], semantic
analysis of PubMed queries [68, 69], concept analysis
[70], radiology reports [71], temporal expressions
in medical narratives [72] and drug–drug interactions [73]. Annotation scheme are defined as ontologies and may include hierarchical relationships.
Relationships are defined using slot filling, including
co-references and relationships constrained to certain entity types. In our hands-on experiments,
importing pre-annotations did not work satisfactorily,
specially if some manual annotations had already been
included in the document. Knowtator includes a consensus module to perform inter-annotator agreement
and to create a consensus gold standard corpus.
Djangology, DePaul University (USA)
and Northwestern University (USA)
Djangology [22] is a web application for collaborative and distributed annotation of documents. It was
originally developed for annotation of named entities
in medical studies of trauma, shock and sepsis conditions. The tool is based on the Django web framework and annotations are saved in a database
(PostgreSQL, MySQL, Oracle or SQLite are supported). Annotations and documents can be easily
loaded into the system or through direct connection
to the database. Annotation of relationships is not
supported. It allows annotation by multiple annotator, which are also easily managed through the web
interface. Djangology is capable of computing
inter-annotator agreement and features a helpful
side-by-side comparison of documents annotated
by distinct annotators.
GATE Teamware, University of
Sheffield (UK)
GATE Teamware is a web-based collaborative annotation and curation environment [23] based on the
GATE infrastructure [18]. The system has been
used for annotating documents related to fungal
enzymes [62], lignocellulose research [63] and for
the OT (OrganismTagger) corpus [64]. It allows
GATE-based routines for pre-processing and preannotation. Annotation schemes are configured
using ontologies (relationships are modeled as properties in the ontology). The tool supports multiple
projects with multiple members by the flexible definition and assignment of roles to users. It also offers
monitoring of progress and generation of statistics in
real time. The complexity of the tool probably renders it unsuitable for small projects.
MMAX2, EML Research (Germany)
MMAX2 [19] is a general purpose annotation tool
supporting creation, browsing, visualization and
querying of annotations. It has been integrated into
the JANE active learning framework [74] and as
such, it was used for annotating corpora related to
gene regulation [14] and hematopoietic stem cell
transplantation [75]. Input documents, annotation
schemes and stand-off annotations are represented
in XML format and it allows annotation of entities
and relationships, as well as computation of interannotation agreement. We found usage of the tool
rather unintuitive and had difficulties configuring
and using MMAX2 for annotation of relationships.
MyMiner, Monash University
(Australia), Spanish National Cancer
Research Center (Spain), Universite¤ de la
Me¤diterrane¤e (France), Indian Institute of
Technology (India)
MyMiner [76] is a web-based tool supporting a range
of tasks in text curation, such as document labeling,
annotation of entities and binary relationships and entity normalization. It participated in the BioCreative
III IAT [77] and has been used for curation of PPIs
[78] and for assisting manual document classification
in the BioCreative III ACT [79]. Pre-annotation can
be performed using ABNER and Linneaus, and the
334
Neves and Leser
system can be configured for manual annotations of
other entity types. Relationships are annotated by
means of a matrix with the distinct mentions found
in text. While it is suitable for data curation, it is
problematic for annotation, as sometimes two entities do interact in some part of a document but not in
another. Additionally, a simple inter-annotator
agreement functionality is provided for showing differences between annotations.
Semantator, Mayo Clinic (USA) and
Lehigh University (USA)
Semantator [80] is a Protégé OWL (http://protege.
stanford.edu/overview/protege-owl.html) plug-in
which allows manual and semi-automatic annotation
of texts and was used for annotation of temporal data
in clinical narratives [81]. It is similar to Knowtator
(cf. above) in that the annotation schema is defined
by a hierarchical ontology, but Semantator has the
advantage of supporting OWL ontologies. Although
the
documentation
claims
annotation
of
co-references and of relationships, we were not
able to perform this functionality during our
hands-on experiments. For pre-annotation, users
can choose among tools from the BioPortal and
the Mayo Clinics Clinical Text Analysis and
Knowledge Extraction System.
WordFreak, University of
Pennsylvania (USA)
WordFreak [27] is a Java-based open source annotation tool built around a highly modular plug-in
architecture which can be used to integrate tools
for pre-annotation. It is one of the most popular
tools and has been used in several projects, such as
semantic role labeling [82], for the GREC corpus
[83, 84], for SNPs in [85], for histone modifications
in [86] and for semantic frames [87]. Annotation
schemes have to be specified as plug-ins, which
means that it might require a certain amount of
Java programming. It is not clear whether annotation
of complex relationships, such as biological events,
can be handled by the tool. We installed and tested
WordFreak successfully, but refrained from testing its
abilities for adaptation.
XConc Suite, University of Tokyo
(Japan) and University of
Manchester (UK)
XConc Suite [10] is a general purpose annotation
tool provided as an Eclipse plug-in. It has been
used for annotating terms in the GENIA corpus
[15], meta-knowledge also in the GENIA corpus
[88], curating pathway [89] and for the HANAPIN
corpus [90]. The tool allows importing ontologies in
OWL format and it is equipped with an Ontology
viewer. It allows the annotation of named entities as
well as complex relationships. However, we found
the annotation of complex events rather cumbersome as events are shown below a sentence, instead
of being displayed in the text itself, which might also
limit its suitability for annotating full texts.
DISCUSSION
We compared 13 tools for annotating biomedical
texts with respect to a set of 35 criteria. Our evaluation showed that there is no single tool which supports all use cases with equal robustness. Instead, all
tools have their pros and cons, making them more or
less suitable depending on the task at hand. In the
following, we highlight strengths of different tools by
discussing their suitability for common use cases in
biomedical annotation. An overview of the features
supported by each tool is shown in Figure 1.
Annotation versus curation
Most of the tools described in this survey were developed for annotation purposes. Exceptions are
@Note, Argo and Bionotate which were originally
intended to support curation. Therefore, they contain functionality that typically is not in focus during
annotation, such as document classification (cf. [34]
for examples of biocuration workflows). Of the
other tools, only MyMiner provides a component
for document categorization which can be helpful
for document triage [34]. Selection of texts is otherwise left to the users. To enable annotation, all tools
present large portions of the texts to be curated and
let the user select and then annotate tokens. An exception is Bionotate which only shows text snippets
(cf. criterion C27). Therefore, it makes annotation of
longer texts a bit painful since they need to be
broken into smaller pieces, which in turn hinders a
quick consideration of the context of a fact.
Supported annotations
Three of the 13 tools only support annotation of
named entities, namely @Note, Argo and
Djangology. This is probably a severe limitation for
many projects, as annotation of relationships has
become a de facto standard in biomedicine [14, 13,
Survey on annotation tools for the biomedical literature
335
Figure 1: Overview of the functionallities provided by each tool. The criteria shown are the following, as well as
the corresponding code: ‘Relationships’ (cf. C24), ‘Inter-Annotator Agreement’ (cf. C34), ‘Full Text’ (cf. C27),
‘Web-based’ (cf. C7), ‘Text mining pipeline’ (cf. C32), ‘Ontologies’ (cf. C25) and ‘Built-in Bio-NER’ (cf. C31).
91]. The way how annotation of relationships is supported varies considerably between the tools (cf. criteria C24). Bionotate poses questions to the user,
while MyMiner generates a matrix of possible relationships, which implies that annotation at the mention level is not possible. Callisto, GATE Teamware,
Knowtator, Semantator and WordFreak use slot filling which we found very hard to use in large-scale
annotation as no visualization inside the text is provided. Only Brat, MMAX2 and XConc Suite show
visual links between the entities directly on the text,
although its use in MMAX2 is not straightforward
and its visualization in XConc is confusing.
Pre-annotation
Clearly, manual annotation can benefit substantially
from pre-annotation. If a comprehensive and highquality pre-annotation is available, the task of the
annotator changes from careful reading of texts to
confirmation or rejection of system-suggested annotation. However, one has to keep in mind that such
an approach is only useful if pre-annotations are of
extremely high quality, which still is wishful thinking
for all but a few entity types and way beyond the
state of the art for relationships. @Note, Argo,
MyMiner and Semantator come along with a
named-entity module for the recognition of certain
biological entity classes (cf. criterion C31). Also the
GATE framework contains a set of NER components which can be integrated into GATE
Teamware. Finally, Brat allows integration of the
output of automatic text annotation and
pre-processing tools through web services. Some of
the other tools, i.e. Bionotate, Brat, Knowtator,
WordFreak and Xconc Suite, allow importing
pre-annotated texts, but leave it to the users to compute such pre-annotations with tools of their choice
(cf. criteria C14, C17, C32). This can be seen as an
advantage—for the experienced user who wants to
use her/his favorite tools—or as a disadvantage—for
the less experienced user who is not capable of performing independent pre-annotation. In our tests,
336
Neves and Leser
we could add pre-annotations without problems for
Bionotate, Brat and Djangology. But it might as well
be feasible using the XML format supported by
XConc Suite. In contrary, manipulating the input
file failed in the case of Callisto, although others reported that it has been used for validation of
pre-annotation [60]. Knowtator is provided with
an import functionality but warns user of some
risks when using it; actually, we experienced severe
problems while trying it.
Simple versus comprehensive
Biomedical annotation may be carried out by a range
of different users. Some users are proficient in computer usage and experienced in working with natural
language texts. Thus, they often seek a tool with rich
functionality and usually accept the necessity to
invest a certain amount of work for adapting or configuring the tool to specific needs. Other users are
less experienced and prefer tools which are simple
to use and provide all functionality out-of-the-box.
Table 3 and Figure 1 indicate that GATE,
Knowtator and Semantator are the most comprehensive ones in terms of functionality. They support
all standard formats (cf. criteria C14, C15, C16
and C17), are able to integrate ontologies to define
annotation vocabulary (cf. criterion C25), allow
annotation of entities and relationships (cf. criterion
C24) and also adequately support projects with many
annotators (cf. criteria C33 and C34). Besides,
also having comprehensive annotation guidelines
is important; however, supporting their creation
is beyond the scope of typical annotation tools.
Regarding simplicity, we recommend Brat,
Bionotate, Djangology, Knowtator and MyMiner.
These tools are simple to install (cf. criterion C12),
can be configured easily (cf. criterion C1) and provide good documentation (cf. criterion C2)
or self-explaining examples. Still, defining a proper
annotation schema was not simple with any of the
tools.
Popularity and support
One way of assessing tools is their recent popularity.
This may be estimated from citation counts or from
concrete reports on recent usage for annotation of
biomedical corpora. MMAX2 (87), Knowtator (62),
WordFreak (35) and Callisto (16) lead in terms of
citations [counts are according to Google Scholar (as
of August 2012)] (cf. criterion C5). When taking into
account their usage for the biomedical domain, we
found nine reports for Knowtator, seven for
WordFreak, four for XConc, three for Callisto, Brat
and GATE and only two for MMAX2 (cf. criterion
C6). However, tools which have been published only
recently, like Brat and MyMiner, certainly have a
disadvantage in both these criteria. Another important factor is the quality of documentation and the
liveliness of a tool, reflected in an ongoing development and active user support by mailing lists or
forums (cf. criteria C2 and C3). Some of the tools,
especially Callisto, Knowtator and WordFreak, seem
not to be maintained anymore and their last versions
date from more than a couple of years ago. Callisto’s
website even became unavailable recently. Also
questions posted to the forums of Knowtator,
MMAX2 and WordFreak have not been answered
by the developers within the last months. In contrast,
Brat, GATE, Knowtator, MyMiner, Semantator
and XConc provide very good documentation and
particularly Brat and GATE have an active developer
base.
Support for full texts
The use of full texts is a crucial issue in biomedical
research, as studies have proven that the differences
between results obtained only from abstracts compared to full texts are significant [92, 93]. All standalone tools, as well as Djangology and WordFreak,
support visualization and annotation of full texts, although sometimes the text loses its format when imported into a XML or in plain text formats (cf.
criterion C27). We found XConc Suite to be less
suitable for full texts as it displays relationships below
each sentence, which is confusing when working
with long documents. For Bionotate or Brat, texts
must be broken into smaller pieces to be handled
adequately. In our experience with Brat for the
CellFinder corpus [56], we had to split all documents
into parts composed at most of 50 sentences. Full
texts are also not correctly displayed in Argo and
MyMiner, whose text area seems to be limited to a
fixed size. Finally, @Note and MyMiner claim to
support PDF documents, functionality which has
not been assessed in this survey (cf. criterion C14).
Software architecture and availability
of code
An important difference between tools is their architecture and the platforms they support. Actually, all
tools we considered are generally platform independent (cf. criterion C8). However, as some run as
Survey on annotation tools for the biomedical literature
plug-ins or extensions of other, often highly complex systems (GATE, Protégé or Eclipse), installation
may be involved and require the pre-installation
of other packages (cf. criterion C10). About half of
the tools run in a Web environment, namely Argo,
Bionotate, Brat, Djangology, GATE Teamware and
MyMiner (cf. criterion C7). Argo and MyMiner can
be used without any installation, which implies that
annotated texts in first place are hosted on an external server. This might be considered as a problem
when confidentiality or licensing are an issue.
On our evaluation, we also put a special focus on
extensibility, such as the ability of a tool to be used
for annotation other than the pre-configured facts
(cf. criterion C16). However, in some cases, users
might want to perform more radical changes, for
instance to change the user interface or to support
other visualization paradigms. In such cases, availability of the source code is a prerequisite (cf. criterion
C9). Source code (currently) is not available for
@Note, Argo, MyMiner and Semantator.
337
SUPPLEMENTARY DATA
Supplementary data are available online at http://
bib.oxfordjournals.org/.
Key Points
There exist a multitude of tools available for annotating biomedical corpora as an important resource in text mining.
None of these tools satisfies all wishes and needs, but comprehensive and easy-to-use solutions exist for many use cases.
Most of the tools are released as open source which may compensate deficiencies in functionality and facilitates integration
with other systems or frameworks.
Only few tools allow the use of ontologies, specially standard
formats such as OBO and OWL.
The number of tools offered as web application increases.These
completely relieve from software installation and maintenance,
but store annotations on remote servers.
Acknowledgements
We are thankful for the valuable suggestions, comments and
corrections provided by Philippe Thomas and Michael Weidlich.
FUNDING
CONCLUSIONS
In this work, we surveyed 13 tools suitable for
manual annotation of biomedical texts. These tools
exhibit a large and diverse class of features and differ
substantially in many aspects, such as the basic architecture (web-based or stand-alone), the types of facts
that can be annotated (only fixed entities, configurable entities, relationships), their support for larger
annotation projects (inter-annotator agreement), maturity of the software and support from the developers and handling of pre-annotations. We presented
a brief description of each tool, evaluated them according to a comprehensive set of pre-defined criteria, reported on hands-on experiences for most of
them and discussed them according to common situations in biomedical annotations.
Our study shows that there is no perfect tool
which would be recommendable in any situation.
Nonetheless, we believe that as long as the requirements of a project are not too special, researchers
aiming at annotating a biomedical corpus should be
able to find a tool which complies with their needs.
However, non-standard requirements in all cases
require adaptation at the code level. Overall, we
hope that this survey raises the awareness of what is
already out there and helps to diminish the resources
going into costly developments of functionality
already available for free.
This work was supported by the Deutsche
Forschungsgemeinschaft (DFG) [LE 1428/3-1].
References
1.
Krallinger M, Morgan A, Smith L, et al. Evaluation of
text-mining systems for biology: overview of the second
biocreative community challenge. Genome Biol 2008;9:S1.
2. Kim J-D, Nguyen N, Wang Y, et al. The genia event and
protein coreference tasks of the BioNLP shared task 2011.
BMC Bioinformatics 2012;13:S1.
3. Leser U, Hakenberg J. What makes a gene name? Named
entity recognition in the biomedical literature. Brief
Bioinform 2005;6:357–69.
4. Ananiadou S, Pyysalo S, Tsujii J, et al. Event extraction
for systems biology by text mining the literature. Trends
Biotechnol 2010;28:381–90.
5. Zhou D, He Y. Extracting interactions between proteins
from the literature. J Biomed Inform 2008;41:393–407.
6. Sutton C, Mccallum A. Introduction to Conditional Random
Fields for Relational Learning. Cambridge, MA, USA: MIT
Press, 2006.
7. Leaman R, Gonzalez G. BANNER: an executable survey of
advances in biomedical named entity recognition. In:
Proceedings of the Pacific Symposium of Biocomputing 2008;
p. 652–63.
8. Tikk D, Thomas P, Palaga P, etal. A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Comput Biol 2010;6:e1000837.
9. Björne J, Ginter F, Pyysalo S, et al. Complex event extraction at PubMed scale. Bioinformatics 2010;26:i382–90.
10. Kim J-D, Ohta T, Tateisi Y, et al. Genia corpus—a semantically annotated corpus for bio-textmining. Bioinformatics
2003;19:i180–2.
338
Neves and Leser
11. Gerner M, Nenadic G, Bergman C. LINNAEUS: a species
name identification system for biomedical literature. BMC
Bioinformatics 2010;11:85.
12. Klinger R, Kolářik C, Fluck J, et al. Detection of IUPAC
and IUPAC-like chemical names. Bioinformatics 2008;24:
i268–76.
13. Pyysalo S, Airola A, Heimonen J, et al. Comparative
analysis of five protein–protein interaction corpora. BMC
Bioinformatics 2008;9:S6.
14. Buyko E, Beisswanger E, Hahn U. The GeneReg corpus
for gene expression regulation events: an overview of
the corpus and its in-domain and out-of-domain interoperability. In: Chair NCC, et al, (ed) Proceedings of the Seventh
International Conference on Language Resources and Evaluation
(LREC’10), Valletta, Malta, 2010. European Language
Resources Association (ELRA).
15. Kim J-D, Ohta T, Tsujii J. Corpus annotation for mining
biomedical events from literature. BMC Bioinformatics 2008;
9:10.
16. Segura-Bedmar I, Martı́nez P, de Pablo-Sanchez C.
Extracting drug–drug interactions from biomedical texts.
BMC Bioinformatics 2010;11:P9.
17. Cohen KB, Ogren PV. Corpus design for biomedical natural language processing. In: Proceedings of the ACL-ISMB
Workshop on Linking Biological Literature, Ontologies and
Databases 2005; p. 38–45.
18. Cunningham H, Maynard D, Bontcheva K, et al. Text
Processing with GATE (Version 6) 2011.
19. Mu« ller C, Strube M. Multi-level annotation of linguistic
data with MMAX2. In: Braun S, Kohn K, Mukherjee J,
(eds). CorpusTechnology and Language Pedagogy: New Resources,
New Tools, New Methods. Frankfurt a.M., Germany: Peter
Lang, 2006, 197–214.
20. Ogren PV. Knowtator: a protégé plug-in for annotated
corpus construc-tion. In: Proceedings of the 2006 Conference of
the North American Chapter of the Association for Computational
Linguistics on Human Language Technology, 2006. p.273–5.
Morristown, NJ, USA: Association for Computational
Linguistics.
21. Stenetorp P, Pyysalo S, Topić G, et al. Brat: a web-based
tool for nlp-assisted text annotation. In: Proceedings of
the Demonstrations at the 13th Conference of the European
Chapter of the Association for Computational Linguistics, 2012.
p.102–7. Avignon: Association for Computational
Linguistics.
22. Apostolova E, Neilan S, An G, et al. Djangology: a
light-weight web-based tool for distributed collaborative
text annotation. In: Chair NCC, et al, (ed) Proceedings of the
Seventh Conference on International Language Resources and
Evaluation (LREC’10), Valletta, Malta, 2010. European
Language Resources Association (ELRA).
23. Bontcheva K, Cunningham H, Roberts I, et al. Web-based
collaborative corpus annotation: requirements and a framework implementation. In: Proceedings of ‘‘New Challenges for
NLP Frameworks’’ workshop at Language Resources and
Evaluation Conference (LREC) 2010.
24. Rak R, Rowley A, Black W, et al. Argo: an integrative,
interactive, text mining-based workbench supporting curation. Database 2012;2012:bas010.
25. Rinaldi F, Clematide S, Schneider G, et al. ODIN: an
advanced interface for the curation of biomedical literature.
In: BioCuration 2010. Tokyo, Japan.
26. Liakata M, Claire Q, Soldatova LN. Semantic annotation of
papers: interface & enrichment tool (sapient). In: Proceedings
of the BioNLP 2009 Workshop, Boulder, Colorado, 2009.
p.193–200.
27. Morton T, Civita JL. Wordfreak: an open tool for linguistic
annotation. In: NAACL-Demonstrations ’03 Proceedings of the
2003 Conference of the North American Chapter of the Association
for Computational Linguistics on Human Language Technology:
Demonstrations, 2003. Vol. 4.
28. Uren V, Cimiano P, Iria J, et al. Semantic annotation for
knowledge management: requirements and a survey of the
state of the art. J Web Semantics 2006;4:14–28.
29. Dipper S, Götze M, Stede M. Simple annotation tools for
complex annotation tasks: an evaluation. In: Proceedings of the
LREC Workshop on XML-based Richly Annotated Corpora
(LREC’04), Lisbon, Portugal, 2004. European Language
Resources Association (ELRA).
30. Hermjakob H, Montecchi Palazzi L, Lewington C, et al.
IntAct: an open source molecular interaction database.
Nucleic Acids Res 2004;32:D452–5.
31. Winnenburg R, Wächter T, Plake C, et al. Facts from text:
can text mining help to scale-up high-quality manual
curation of gene products with ontologies? Brief Bioinform
2008;9:466–78.
32. Wiegers T, Davis A, Cohen KB, et al. Text mining and
manual curation of chemical-gene-disease networks for
the comparative toxicogenomics database (CTD). BMC
Bioinformatics 2009;10:326.
33. Mu« ller H-M, Kenny EE, Sternberg PW. Textpresso: an
ontology-based information retrieval and extraction
system for biological literature. PLoS Biol 2004;2:e309.
34. Hirschman L, Burns GAPC, Krallinger M, et al.
Text mining for the biocuration workflow. Database 2012;
2012:bas020.
35. Petasis G, Karkaletsis V, Paliouras G, etal. Using the ellogon
natural language engineering infrastructure. In: Proceedings
of theWorkshop on Balkan Language Resources andTools,1st Balkan
Conference in Informatics (BCI2003).Thessaloniki, Greece, 2003.
36. Orsan C. PALinkA: a highly customisable tool for discourse
annotation. In: Proceedings of the 4th SIGdial Workshop on
Discourse and Dialogue, ACL03, 2003. p.39–43.
37. Burchardt A, Erk K, Frank A, et al. SALTO: a versatile
multi-level annotation tool. In: Proceedings of LREC-2006,
Genoa, Italy, 2006.
38. Kaplan D, Iida R, Tokunaga T. Slat 2.0: corpus construction and annotation process management. In: Proceedings of
the 16th Annual Meeting of The Association for Natural Language
Processing (NLP2010), 2010. p.510–3.
39. O’Donnell M. Demonstration of the UAM corpus tool for
text and image annotation. In: Proceedings of the 46th Annual
Meeting of theAssociation for Computational Linguistics on Human
Language Technologies: Demo Session, HLT-Demonstrations ’08,
2008. p.13–6. Association for Computational Linguistics,
Stroudsburg, PA, USA.
40. Prasad R, McRoy S, Frid N, et al. The biomedical discourse
relation bank. BMC Bioinformatics 2011;12:188.
41. Bossy R, Jourde J, Manine A-P, et al. Bionlp shared task—
the bacteria track. BMC Bioinformatics 2012;13:S3.
42. Nédellec C. Learning language in logic–genic interaction
extraction challenge. In: Proceedings of the Learning Language
in Logic 2005 Workshop at the International Conference on
Machine Learning, 2005.
Survey on annotation tools for the biomedical literature
43. Ciccarese P, Ocana M, Clark T. Open semantic annotation
of scientific publications using domeo. J Biomed Semantics
2012;3:S1.
44. Karamanis N, Lewin I, Seal R, et al. Integrating natural
language processing with flybase curation 2007. In:
Proceedings of PSB 2007, 2007;245–56.
45. Karamanis N, Seal R, Lewin I, et al. Natural language
processing in aid of flybase curators. BMC Bioinformatics
2008;9:193.
46. Couto F, Silva M, Lee V, et al. GOAnnotator: linking
protein go annotations to evidence text. J Biomed Discov
Collab 2006;1:19.
47. Donaldson I, Martin J, de Bruijn B, et al. Prebind and textomy—mining the biomedical literature for protein–protein
interactions using a support vector machine. BMC
Bioinformatics 2003;4:11.
48. Lourenco A, Carreira R, Carneiro S, et al. @note: a workbench for biomedical text mining. J Biomed Inform 2009;42:
710–20.
49. Carreira R, Carneiro S, Pereira R, etal. Semantic annotation
of biological concepts interplaying microbial cellular
responses. BMC Bioinformatics 2011;12:460.
50. Carneiro S, Lourenco A, Ferreira E, et al. Stringent response
of Escherichia coli: revisiting the bibliome using literature
mining. Microb Inform Exp 2011;1:14.
51. Kano Y, Baumgartner WA, McCrohon L, etal. U-compare:
share and compare text mining tools with UIMA.
Bioinformatics 2009;25:1997–8.
52. Cano C, Monaghan T, Blanco A, et al. Collaborative
text-annotation resource for disease-centered relation
extraction from biomedical text. J Biomed Inform 2009;42:
967–77.
53. Ball MP, Thakuria JV, Zaranek AW, et al. A public resource
facilitating clinical use of genomes. Proc Natl Acad Sci USA
2012;109:11920–7.
54. Cano C, Labarga A, Blanco A, et al. Collaborative
semi-automatic annotation of the biomedical literature. In:
11th International Conference on Intelligent Systems Design and
Applications (ISDA 2011). Cordoba, Spain, 2011.
55. Cano C, Labarga A, Blanco A, etal. Social and semantic web
technologies for the text-to-knowledge translation process
Biomedicine. Biomedical Engineering, Trends, Research and
Technologies. InTech Open, 2011.
56. Neves M, Damaschun A, Kurtz A, et al. Annotating and
evaluating text for stem cell research. In: Proceedings of the
Third Workshop on Building and Evaluation Resources for
Biomedical Text Mining (BioTxtM 2012) at Language Resources
and Evaluation (LREC), Istanbul,Turkey, 2012. p.16–23.
57. Ohta T, Pyysalo S, Tsujii J, et al. Open-domain anatomical
entity mention detection. In: Proceedings of ACL 2012
Workshop on Detecting Structure in Scholarly Discourse (DSSD),
2012. p.27–36.
58. Day D, McHenry C, Kozierok R, et al. Callisto: a configurable annotation workbench. In: Proceedings of Language
Resources and Evaluation Conference (LREC), 2004.
59. Alex B, Grover C, Haddow B, etal. The ITI TXM corpora:
tissue expressions and protein-protein interactions. In:
Proceedings of Workshop on Building and Evaluating Resources for
BiomedicalText Mining, LREC 2008, Marrakesh, Morocco, 2008.
60. Harkema H, Setzer A, Gaizauskas R, et al. Mining and
modelling temporal clinical data. In:Cox S, (ed). In:
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
339
Proceedings of the 4th UK e-Science All Hands Meeting,
Nottingham, UK, 2005.
Zhang Y, Patrick J. Extracting semantics in a clinical
scenario. In: Roddick JF, Warren JR, (eds). Australasian
Workshop on Health Knowledge Management and Discovery
(HKMD 2007) 2007. Vol. 68 of CRPIT, p.241–7. ACS,
Ballarat, Australia.
Meurs M-J, Murphy C, Naderi N, etal. Towards evaluating
the impact of semantic support for curating the fungus scientific literature. In: The 3rd Canadian Semantic Web
Symposium (CSWS2011), Vancouver, British Columbia, 2011.
Meurs M-J, Murphy C, Morgenstern I, et al. Semantic
text mining for lignocellulose research. In: Proceedings of the
ACM fifth international workshop on Data and text mining in
biomedical informatics, DTMBIO ’11, 2011. p.35–42. ACM,
New York, NY.
Naderi N, Kappler T, Baker CJO, et al. Organismtagger:
detection, normalization and grounding of organism entities
in biomedical documents. Bioinformatics 2011;27:2721–9.
Bada M, Eckert M, Evans D, et al. Concept annotation in
the craft corpus. BMC Bioinformatics 2012;13:161.
Velupillai S, Dalianis H, Hassel M, etal. Developing a standard for de-identifying electronic patient records written in
Swedish: precision, recall and f-measure in a manual and
computerized annotation trial. Int J Med Inform 2009;78:
e19–26.
Roberts A, Gaizauskas R, Hepple M, et al. The clef corpus:
semantic annotation of clinical text. AMIA Annu Symp Proc
2007;625–9.
Dogan RI, Murray GC, Névéol A, et al. Understanding
PubMed user search behavior through log analysis.
Database 2009;2009:bap018.
Névéol A, Doan RI, Lu Z. Semi-automatic semantic
annotation of PubMed queries: a study on quality, efficiency, satisfaction. J Biomed Inform 2011;44:310–8.
Hunter L, Lu Z, Firby J, et al. OpenDMAP: an open source,
ontology-driven concept analysis engine, with applications
to capturing knowledge regarding protein transport, protein
interactions and cell-type-specific gene expression. BMC
Bioinformatics 2008;9:78.
Rubin D, Wang D, Chambers D, et al. Natural language
processing for lines and devices in portable chest X-rays.
AMIA Annu Symp Proc 2010;692–6.
Reeves RM, Ong FR, Matheny ME, et al. Detecting
temporal expressions in medical narratives. Int J Medl Inform
2012; doi:10.1016/j.ijmedinf.2012.04.006 (Advance access
publication 15 May 2012).
Boyce R, Gardner G, Harkema H. Using natural language
processing to extract drug–drug interaction information
from package inserts. In: BioNLP: Proceedings of the 2012
Workshop on Biomedical Natural Language Processing, 2012.
p.206–13. Association for Computational Linguistics,
Montréal, Canada.
Tomanek K, Wermter J, Hahn U. Efficient annotation with
the Jena ANnotation Environment (JANE). In: Proceedings
of the Linguistic AnnotationWorkshop, LAW ’07, 2007. p.9–16.
Association for Computational Linguistics, Stroudsburg,
PA, USA.
Hahn U, Beisswanger E, Buyko E, et al. Semantic annotations for biology: a corpus development initiative at the Jena
University Language & Information Engineering (JULIE)
lab. In: Proceedings of the Sixth International Conference on
340
76.
77.
78.
79.
80.
81.
82.
83.
84.
Neves and Leser
Language Resources and Evaluation (LREC’08), 2008. European
Language Resources Association (ELRA), Marrakech,
Morocco.
Salgado D, Krallinger M, Depaule M, et al. Myminer: a
web application for computer-assisted biocuration and text
annotation. Bioinformatics 2012;28:2285–7.
Arighi C, Lu Z, Krallinger M, et al. Biocreative III interactive task: an overview. BMC Bioinformatics 2011;12:S4.
Krallinger M, Leitner F, Vazquez M, et al. How to link
ontologies and protein–protein interactions to literature:
text-mining approaches and the biocreative experience.
Database 2012;2012:bas017.
Chatr-aryamontri A, Winter A, Perfetto L, et al.
Benchmarking of the 2010 biocreative challenge III
text-mining competition by the biogrid and mint interaction databases. BMC Bioinformatics 2011;12:S8.
Song D, Chute CG, Tao C. Semantator: a semi-automatic
semantic annotation tool for clinical narratives. In: 10th
International SemanticWeb Conference (ISWC2011), 2011.
Tao C, Wongsuphasawat K, Clark K, et al. Towards event
sequence representation, reasoning and visualization for
EHR data. In: Proceedings of the 2nd ACM SIGHIT
International Health Informatics Symposium, IHI ’12, 2012.
p.801–6. ACM, New York, NY, USA.
Chou W-C, Tsai RT-H, Su Y-S, et al. A semi-automatic
method for annotating a biomedical proposition bank. In:
LAC ’06 Proceedings of theWorkshop on Frontiers in Linguistically
Annotated Corpora, 2006.
Thompson P, Iqbal SA, McNaught J, et al. Construction
of an annotated corpus to support biomedical information
extraction. BMC Bioinformatics 2009;10:349.
Sasaki Y, Thompson P, Cotter P, et al. Event frame extraction based on a gene regulation corpus. In: Proceedings of
the 22nd International Conference on Computational Linguisticsç
Vol. 1. COLING ’08, 2008. p.761–8. Association for
Computational Linguistics, Stroudsburg, PA, USA.
85. Klinger R, Furlong LI, Friedrich CM, et al. Identifying
gene-specific variations in biomedical text. J Bioinform
Comput Biol 2007;5:1277–96.
86. Kolářik C, Klinger R, Hofmann-Apitius M. Identification
of histone modifications in biomedical text for supporting epigenomic research. BMC Bioinformatics 2009;
10(Suppl 1):S28.
87. Thompson P, Cotter P, McNaught J, et al. Building a
bio-event annotated corpus for the acquisition of semantic
frames from biomedical corpora. In:Chair NCC, et al, (ed).
In: Proceedings of the Sixth International Language Resources and
Evaluation (LREC’08), 2008. European Language Resources
Association (ELRA), Marrakech, Morocco.
88. Thompson P, Nawaz R, McNaught J, et al. Enriching a
biomedical event corpus with meta-knowledge annotation.
BMC Bioinformatics 2011;12:393.
89. Oda K, Kim J-D, Ohta T, et al. New challenges for text
mining: mapping between text and manually curated pathways. BMC Bioinformatics 2008;9:S5.
90. Batista-Navarro RT, Ananiadou S. Building a coreferenceannotated corpus from the domain of biochemistry. In:
Proceedings of BioNLP 2011 Workshop, BioNLP ’11, 2011.
p.83–91. Association for Computational Linguistics,
Stroudsburg, PA, USA.
91. Kim J-D, Ohta T, Pyysalo S, et al. Extracting biomolecular
events from literature the bionlp09 shared task. Comput Intell
2011;27:513–40.
92. Cohen KB, Johnson H, Verspoor K, etal. The structural and
content aspects of abstracts versus bodies of full text journal
articles are different. BMC Bioinformatics 2010;11:492.
93. Kim J-D, Wang Y, Takagi T, et al. Overview of genia event
task in bionlp shared task 2011. In: Proceedings of BioNLP
Shared Task 2011 Workshop, 2011. p.7–15. Association for
Computational Linguistics, Portland, OR, USA.