Evaluation of gene/protein name recognition programs

Masters in Proteomics and Bioinformatics
Evaluation of gene/protein name recognition
programs
Written by: Sanaa Chtioui
Directors: Anne-Lise Veuthey
University of Geneva
2008/2009
1
Summary
Identification of gene and protein names in biomedical text is a challenging task
as the corresponding nomenclature has evolved over time. This has led to multiple
synonyms for individual genes and proteins, as well as names that may be ambiguous
with other gene names or with general English words. The Gene List Task of the
BioCreAtIvE challenge evaluation enables comparison of systems addressing the
problem of protein and gene name identification on common benchmark data.
We proposed here to evaluate the gene and protein name recognition programs.
The evaluation was performed against two corpora manually annotated and specifically
tailored for gene and protein name identification (GENETAG and YAPEX), and then
against the “Variant corpus”, which contain gene and protein names, synonyms and a set
of mutation-related keywords (variants, polymorphisms, mutations…). To associate the
gene and protein names to mutations, or to relate synonyms to the same gene or protein
we evaluated against the Variant corpus.
The evaluation was based on the performance measures (precision and recall), and
achieved 81% of precision and 63% of recall for GAPSCORE programs against the
GENETAG corpus; and 97% of precision and 98% of recall for AIIAGMT programs
against the YAPEX data set. Against the Variants corpus, AIIAGMT achieved 75% of
precision and 75% of recall; GAPSCORE achieved 72% of precision and 88% of recall.
To improve the precision of gene and protein name extraction, we applied the intersection
to combine the programs because the intersection can filter out false positives and
therefore increase precision, but at the expense of recall.
Background
We are developing text mining tools for information extraction of variants. In
particular we are working an automatic procedure able to track newly published articles
concerning variants and to extract new relevant information. This requires the
development of methods able to map a protein described in a paper to the corresponding
protein database entry. In the field of information extraction, many programs have been
elaborated for the task of protein and gene name recognition in biological texts. These
programs are necessary to link information found in an article to a specific biological
entity. We propose to evaluate these tools and implement the most adapted in the pipeline
of information retrieval for variants.
2
Table of contents
Summary
Background
1. Introduction
I.
The importance of variants in the context of our research
II.
Text mining and gene/protein name extraction
III.
BioCreAtIvE and the evaluation of the results of text mining processes
IV.
Individual program description
2. Methods
I- Evaluation of gene/protein name recognition programs
1. Corpus description
1.1 GENETAG corpus
1.2 YAPEX corpus
1.3 Variants corpus
2. Evaluation of gene/protein name recognition programs
3. Evaluation of GAPSCORE program
4. Evaluation of AIIAGMT program
II. Application: Variants extraction
3. Results
3.1. GAPSCORE
3.1.1 Performance of GAPSCORE on GENETAG corpus
3.1.2 Performance of GAPSCORE on Variants corpus
3.2. AIIAGMT
3.2.1 Performance of AIIAGMT on GENETAG and YAPEX corpora
3.2.2 Performance of AIIAGMT on Variants corpus
3.3. Improving the precision of extraction
3.3.1 Improving the precision of gene and protein names extraction on
GENETAG corpus
3.3.2 Improving the precision of gene and protein names extraction on
Variants corpus
3.4 Application: Variants extraction
4. Discussion
5. Conclusion
Appendix
Bibliography
3
1. Introduction
New high-throughput technologies have accelerated the accumulation of knowledge
about biological data. However, much knowledge is still stored as written natural
language text; and the traditional information retrieval framework, which relies on
keyword-based approaches, cannot address this information overload. For this reason,
scientists have focused their attention on text mining (TM) techniques, which enable
them to collect, maintain, interpret, curate and discover the knowledge needed for
research. In fact, the protein point mutations in biomedical literature discussing human
genetic disorders and the finding of polymorphisms in genes that are markers for these
disorders are in constant evolution and provide a tremendous amount of biological data,
including variants, mutations, polymorphisms, etc. These polymorphic positions (or
single nucleotide polymorphisms (SNPs)) in the human genome can be used to indicate a
permanent change in the DNA sequence of a gene where a nucleotide is replaced by
another one.
The goal of this proposal is to identify gene and protein names in a text by using
methods based on text mining which have been developed to efficiently locate, retrieve
and manage relevant information on gene and protein. These methods will be evaluated
first against two manually annotated corpora using the standard evaluation metrics of
precision and recall, and then against a “Variants corpus” which contains sentences of
abstracts on known polymorphisms in Swiss-Prot which was retrieved by submitting a
PubMed query using gene or protein names and a list of mutation-related keywords.
(mutations, polymorphisms, variants).
I. The importance of variants in the context of our research
In biology, the identification of terms corresponding to biological substances (e.g.,
genes and proteins) is a necessary step to extract biological information (e.g. variants,
SNPs, mutations, etc); because scientists believe that SNPs may predispose people to
disease and even influence their response to drug regimens, we are interested in this study
to SNPs.
A Single Nucleotide Polymorphism or SNP is a small genetic change, or variation,
that can occur within a person's DNA sequence. The genetic code is specified by the four
nucleotide "letters" A (adenine), C (cytosine), T (thymine), and G (guanine). SNP
variation occurs when a single nucleotide, such as an A, replaces one of the other three
nucleotide letters C, G, or T. An example of a SNP is the alteration of the DNA segment
AAGGTTA to ATGGTTA, where the second "A" in the first snippet is replaced with a
"T". On average, SNPs occur in the human population more than 1 percent of the time.
Because only about 3 to 5 percent of a person's DNA sequence codes for the production
of proteins, most SNPs are found outside of "coding sequences". SNPs found within a
coding sequence are of particular interest to researchers because they are more likely to
4
alter the biological function of a protein. Because of the recent advances in technology,
coupled with the unique ability of these genetic variations to facilitate gene identification,
there has been a recent flurry of SNP discovery and detection [32].
II. Text mining and gene/protein names extraction
Life science research is characterized by the production of large and
heterogeneous collections of biological data, including gene or protein sequences,
mutations, polymorphisms, variants etc. Therefore, several methods based on text mining
have been developed to identify gene and protein names in natural language texts.
Text mining has been defined as “the discovery by computer of new, previously
unknown, information by automatically extracting information from different written
resources”. The first manual text-mining approaches surfaced in the mid-1980s, but
technological advances have enabled the field to advance swiftly during the past decade.
Text mining is an interdisciplinary field which draws on information retrieval, data
mining, machine learning, statistics, and computational linguistics. Recently, text mining
has received attention in many areas like Biomedical text mining (also known as
BioNLP) which is a rather recent research field on the edge of natural language
processing (NLP), bioinformatics, medical informatics and computational linguistics. In
fact, the NLP allows the processing of natural language texts by computer to access their
meaning, and can analyze (parse) natural language using lexical resources (dictionaries),
where words have been organized into groups after a grammar (syntactic level) and a
semantic layer has assigned meaning to these words or groups of words [35]. A major
focus of machine learning research is to automatically produce (induce) models, such as
rules and patterns, from data to ‘understand’ and ‘learn’ text structure.
The activity of finding documents that answer an information need is called
information retrieval (IR). Another branch involved by text mining that will be used in
our study is information extraction (IE); it will extract information from text without
requiring the end user of the information to read the text.
The aim of this work is the evaluation of gene and protein name recognition
programs. These methods are inspired by text mining techniques using rule-based
systems, which use rules that describe common naming structures for certain term
classes, based on morphological, orthographic and syntactic characteristics. We used the
standard evaluation metrics of precision (which evaluates the number of times a word is
falsely identified as being a gene or protein name), and recall (which evaluates the
number of times a gene or protein name is falsely identified as not being that gene or
protein name) to evaluate the chosen programs. The results will be compared against the
annotations of two manually annotated corpora ‘gold standard’: GENETAG and YAPEX
and then against a “Variant corpus”.
5
III. BioCreAtIvE and the evaluation of the results of text mining
processes
It is imperative when we evaluate text mining tasks that these should be important
to the biology community especially for the identification of biomedical entities (named
entity recognition NER) like genes and proteins [36]. Two main biological tasks have
been used for text mining evaluation challenges: document retrieval and biological
database curation [35]. Recent challenge evaluations for text mining in biology include:
BioCreAtIvE [34], and BioNLP (Biomedical text mining) [37]….
BioCreAtIvE [13] (Critical Assessment of Information Extraction systems in
Biology) consists of a community-wide effort for evaluating text mining and information
extraction systems applied to the life science (Biology) literature.
Two main tasks were posed at the first BioCreAtIvE challenge (carried out in 2003): the
first task [20] was concerned with the identification of gene mentions in text and linking
protein database entries to abstracts, and the second task [20] was related to the
extraction of human gene product annotations with GO terms (i.e. text passages
supporting those annotations). The goal of the first BioCreAtIvE Workshop was to
provide a set of common challenge evaluation tasks to assess the state of the art for text
mining applied to biological problems. The assessment focused among others on the
extraction of gene or protein names from text, and their mapping into standardized gene
identifiers for three model organism databases (fly, mouse, yeast). Overall, 27 groups
participated in the assessment, including 18 for gene/protein name extraction. The results
for gene/protein name extraction showed that 4 groups participated in the assessment
were able to extract general gene names from sentences of MEDLINE abstracts at over
80% balanced precision and recall [20].
The last BioCreative challenge took place in 2006. There were three tasks in
'BioCreative II' [11], called the gene mention (GM) which would focus on finding the
mentions of genes and proteins in sentences drawn from MEDLINE abstracts, gene
normalization (GN) which would involve producing a list of the EntrezGene identifiers
for all the human genes and proteins mentioned in a collection of MEDLINE abstracts,
and protein-protein interaction (PPI) tasks which would involve identifying proteinprotein interactions from full text papers [11].
The performance of gene mention systems has increased from the first BioCreative, and
when multiple systems are combined, the combined BioCreative II systems have
achieved over 90% balanced precision and recall [33].
The data sets produced by this contest serve as a Gold Standard training and test
set to evaluate and train Bio-NER tools and annotation extraction tools [33]. In 2003, a
corpus of 20,000 sentences was selected and annotated for training and testing purposes
[11]. For BioCreative II, there were 15,000 training sentences and 5,000 blind test
sentences [33].
In this work we evaluated the gene and protein name recognition programs,
against two manually annotated corpora: the BioCreative Task 1A (Gene name
Identification) data set, now known as the GENETAG corpus (2005) [18] [9] and
YAPEX (2002) data set.
6
IV.
Individual program description
A brief description of some systems of gene/protein name recognition is
summarized on table 0. However, our study was restricted to some particular programs
GAPSCORE, AIIAGMT, and meta-server.
GAPSCORE is a method that scans text and identifies the names of genes and
proteins [17]. This tool has many potential applications including: allowing users to
search and index documents by genes of interest and analyzing the scientific literature for
genes of interest [17]. GAPSCORE scores gene and protein names in written natural
language text based on a statistical 11model of gene names that quantifies their
appearance, morphology and context, it uses a machine learning-based approach (Support
Vector Machines (SVMs)). Since GAPSCORE does not distinguish between genes and
proteins, it uses ‘gene’ generically to mean both [1]. The algorithm is accessible from the
web at http://bionlp.stanford.edu/GAPSCORE/ [17], and consists of five steps [1]: (1)
TOKENIZE: it splits the document into sentences and words (which is a string of
alphanumeric characters); (2) FILTER: it removes from consideration any word that is
clearly not a gene name: words that are not nouns, adjectives, participles, proper nouns or
foreign words, it discards numbers, roman numerals (I–X), greek letters, amino acids; (3)
SCORE: it scores words using a machine learning classifier; (4) EXTEND: it extends
each word to the full gene name; and (5) MATCH ABBREVIATION: finally, it scores
abbreviations of the gene names identified [1]. There are Web services [17] which allow
computers to access the GAPSCORE programs running on other machines, and provide
currently an XML-RPC (Remote Procedure Calling protocol that works over the Internet)
interface to their algorithms [14]. The server can respond to two queries: find
abbreviations and find gene and protein names (cf Appendix 2).
AIIAGMT Adaptive Internet Intelligent Agents laboratory's Gene Mention
Tagger [19]: This named entity recognition system ranked 2nd and achieved 86.83% of
F-score (the harmonic average of precision and recall) in the gene mention tagging task in
the second BioCreative challenge and is developed by Adaptive Internet intelligent
Agents (AIIA) Laboratory (http://aiia.iis.sinica.edu.tw/), Institute of Information Science,
Academia Sinica, Taiwan and I-Fang Chung's Laboratory, Institute of Biomedical
Informatics, National Yang-Ming University, Taiwan. The online service is released
under the Creative Commons Attribution 2.5 License. The Web-Services version of
AIIAGMT, allow the computer program (particularly in Perl) to call it from a remote site.
In fact, the AIIAGMT program is the best performing system based on conditional
random fields (CRFs) in the second BioCreative challenge evaluation [11]. Its key
features include a rich feature set (i.e. PartOfSpeech, Hyphen …) [12], unification of
bidirectional parsing models (forward and backward parsing) [12], a dictionary-based
filtering post-process, and its attractive high performance (especially in precision up to
0.8930 in final task evaluation). Several feature types are selected, including character ngrams (window size 2 to 4), morphological and orthographic features, but excluded some
widely used features, such as stop words, prefix and suffix. Except those extensively used
features, a set of domain specific features are also picked up, including abbreviations of
biological chemical compounds (for instance, DNA, RNA, amino acids), compounds that
7
co-occurred with relevant site information, and so on, for decreasing false-positives
among terms with a gene mention-like morphology.
BioCreative MetaServer (BCMS http://bcms.bioinfo.cnio.es/) [15] is the first
meta-service for information extraction in molecular biology [6]. This prototype platform
is a joint effort of 13 research groups (FIG.0) and provides automatically generated
annotations for PubMed/Medline abstracts. This platform is to be regarded as a
distributed system requesting, retrieving and unifying textual annotations, and delivering
these data to the user at different levels of granularity. BCMS can be divided into three
main units [6]:
• A static collection of text (a set of approximately 22,800 PubMed abstracts used
in the BioCreative II challenge).
• A set of active servers providing annotations for text (FIG.1) upon request; these
annotation servers (AS) only interact with the meta-server and not directly with each
user.
• A meta-server providing the combined data, namely both the annotations and the
corresponding text. Therefore, users indirectly communicate with the annotation servers,
using the meta-server as proxy.
The data can be provided by two different means: Via web browser or by the
XML-RPC protocol [6].
Currently, the service provides four types of annotations: gene/protein mention
(GM) which locate positions in the text that are detected as gene or protein names,
gene/protein normalization (GN) which detect which genes or proteins are mentioned,
assigning sequence database identifiers to the text, taxon classification which identify the
organisms to which the text pertains, together with a confidence score, providing an ID
for the National Center for Biotechnology Information (NCBI) taxonomic database, and
protein-protein interaction (PPI): which classifies whether the text contains PPI
information and assigns a confidence score to the classification [6].
FIG.0: The 13 research groups of BCMS
8
Programs
ProMiner
[3]
PowerBioE
[4]
NLProt
type of
methodology
rule-based
approach
(dictionary
approach)
machine
learning
approach:
HMM, SVM
Support Vector
Machines
(SVMs)
program
availability
www.scai.fraunh
ofer.de/prominer
.html, Not an
open source
software
bilateral
agreement
http://textmining
.i2r.a
star.edu.sg/NLS/
demo.htm
Internet server
&
http://cubic.bioc.
columbia.edu/ser
vices/ nlprot/
program
language
Perl
Input
Abstract
UNIX™ / Linux,
Microsoft
Windows™,IBMUIMA software
-
Abstract
Not available
-
Abstract
Abstract
ABNER[8]
BioTaggerGM[5]
GAPSCOE
[1]
statistical
machine
learning system
(CRFs)
Powerful
machine
learning
frameworks
and system
combination
machine
learning-based
approach
http://www.cs.wi
sc.edu/~bsettles/
Java API
abner/
-
not available
http://bionlp.stan
ford.edu/GAPSC Python,
ORE
Perl, Java.
-
AIIAGMT
[19]
ABGene
LingPipe
[22]
http://aiia.iis.sini
ca.edu.tw/index.
php
Perl, Java
Abstract
rule-based
and/or pattern
matching
methods
machine
learning
approach: HMM
ftp://ftp.ncbi.nlm
.nih.gov/pub/tan
abe/AbGene/
Perl, C++
http://aliasJava
i.com/lingpipe/
not accessible
Java 2 (J2SE),
processor
(500MHz+),
256MB+ of RAM,
Java SDK 1.4,
MALLET 0.3.1,
and Jlex
-
Abstract
Abstract
statistical
machine
learning system
(CRFs)
System
requirement
Abstract
XML-RPC
program
Perl 5.8.0 or later,
Frontier::Client,
XML-RPC
program
linux versions
work
using Slackware
8.0 and glibc2.2.x,
or 3.4.0.
Apache
Jakarta
Lucene
TF/IDF
search
engine
(version 1.3) and
the
Alias-i
9
KEX [23]
BANNER
[24]
Whatizit
[25]
rule-based
approach
(dictionary
approach)
http://www.hgc
.inc.utokyo.ac.jp/ser
vice/tooldoc
/KeX/intro.htm
l.
machine
http://banner.so
learning system urceforge.net/
(CRFs)
Perl 5
Abstract
Java
Abstract
rule-based
approach
Java
Abstract
http://www.ebi.
ac.uk/webservi
ces/whatizit
LingPipe tokenizer
and named entity
annotator (version
1.0.6).
Solaris, dec and
irix.
C
compier
(preferably gcc),
and Perl version 5.
latest version of
the Mallet
toolkit
(version
0.4)
SOAP, WSDL
table 0: A brief description of some systems of gene/protein name recognition.
10
2. Methods
I- Evaluation of gene/protein name recognition programs
Evaluating the performance of gene and protein name recognition programs is
impossible without a standardized test corpus. In this work, we used two corpora
manually annotated and specifically tailored for gene and protein name identification:
GENETAG and YAPEX as gold standard to fulfill this task. Then, we evaluated the
chosen programs against the Variants corpus to confirm its performance and to make sure
that their good result are not related to the fact of being adjusted or trained on the gold
standards which has been manually annotated before the release of the chosen programs.
1. Corpus description:
1.1GENETAG corpus
GENETAG is a corpus of 20,000 MEDLINE sentences [9] (342,574 words) for
gene/protein Named entity recognition (NER). 15,000 GENETAG sentences were used
for the BioCreAtIvE Task 1A Competition. The GENETAG version is freely available at
ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz.[18] and will be used in this
study as gold standard. This gold standard has been tokenized [18], by breaking up
further into words, numbers, and punctuation generally called tokens, which each consist
of a string of characters without white space. In this process, hyphens and punctuation
often receive special treatment based on rules like [9]: do not split on hyphens, do not
split on single quotation marks, do not split on commas, and do not split on parentheses
and brackets [26]. The Gene/Protein names in the GENETAG corpus were manually
annotated and indicated by "NEWGENE"; the tag "NEWGENE1" is used to differentiate
overlapping names (when two gene mentions are immediately next to each other in the
text) [18], and every sentence is initialized by a number that serves as an identifier.
Example:
95229799480 Cervicovaginal/JJ foetal/NEWGENE fibronectin/NEWGENE in/IN the/DT
prediction/NN of/IN preterm/JJ labour/NN in/IN a/DT low-risk/JJ population/NN ./.
19306602252 Insulin-like/NEWGENE growth/NEWGENE factor/NEWGENE1/NEWGENE (/(
IGF-1/NEWGENE )/SYM in/IN burn/NN patients/NNS ./.
The following POS tags (Part-of-Speech) have been used [26] to label each tokens as
follows:
JJ: Adjective
IN: Preposition or subordinating conjunction
DT: Determiner
NN: Noun, singular common
. : A period
SYM: Symbol
NNS: Noun, plural common.
11
1.2 YAPEX corpus
The
YAPEX
text
collection
[27]
is
publicly
available
at
http://www.sics.se/humle/projects/prothalt/, and consists of a training set of 99 abstracts
from MEDLINE related to protein binding, and a test set of 101 abstracts, of which 48
are relevant to protein binding, and the rest were chosen randomly from the GENIA
corpus [28]. The Gene/Protein names of all the abstracts above (45,143 words) were
manually annotated by domain experts connected to the YAPEX project, and are
indicated by the XML tag: <Protname>… </Protname>. In this study, we used this
corpus as gold standard to further evaluate the chosen programs in order to confirm their
performance. Example:
<PubmedArticle>
<MedlineID>21294781</MedlineID>
<PMID>11401507</PMID>
<ArticleTitle>Molecular dissection of the <Protname>importin beta1</Protname>-recognized
nuclear targeting signal of <Protname>parathyroid hormone-related
protein</Protname>.</ArticleTitle>
<AbstractText>Produced by various types of solid tumors, <Protname>parathyroid hormonerelated protein</Protname> (<Protname>PTHrP</Protname>) is the causative agent of humoral
hypercalcemia of malignancy.</AbstractText>
</PubmedArticle>
1.3 Variants corpus
The Variants corpus is composed of 27451 sentences of abstracts describing
known polymorphisms in Swiss-Prot. These sentences were recovered by using an
automatic information retrieval method that was applied to the 5664 proteins with
variants. For each protein of interest that would be named in this study “target” , a
PubMed query was submitted to retrieve related articles using gene or protein names,
synonyms (from GPSDB), and a list of mutation-related keywords. This automatic
information retrieval method was based on the use of regular expressions (patterns) and
rules for the detection and validation of mutations. The variant terms in these articles
match to the patterns corresponding to the different observed notations mentioning the
position of single amino acid polymorphisms (SAP)[30].The SAP detection procedure
has been used to recover relevant documents. Example
Q9Y6Y9-1 | 18424732 | As predicted from the MD-2 structure , the P157S mutation had little or
no effect on MD-2 function . ...
Q9Y6R1-1 | 12444017 | [...a limited number of rat pancreatic duct cells. To examine the effects
of pRTA-related mutations, R342S and R554H, on pNBC-1 function, we performed functional
analysis and found that both mutants had...]
In the first sentence the MD-2 protein is synonym of the target (Q9Y6Y9: Lymphocyte
antigen 96) which is the protein of interest where the PubMed query was submitted to
retrieve related articles, and the mutations P157S refers to the target.
12
This corpus will be used to further evaluate the performance of the selected programs
(GAPSCORE and AIIAGMT), and to make sure that these selected programs were not
adjusted or trained on the gold standards. In this case, the evaluation was based on the
performance measures (precision and recall) of 245 sentences randomly selected and
manually controlled.
2. Evaluation of gene/protein name recognition programs
We evaluated the performance of the algorithms (GAPSCORE and AIIAGMT)
against the GENETAG and YAPEX gold standard. To compare, we used the annotations
of the manually annotated corpora GENETAG and YAPEX, available from their web
site. Every string identified by a run (GAPSCORE or AIIAGMT) is considered either a
true positive or a false positive. If the string matches a NEWGENE or Protname in the
manually annotated corpora respectively GENETAG and YAPEX, it is counted as a true
positive; else (there is no match), the string is counted as a false positive. If none of the
annotations of a gene given in the corpus is found by the predictions programs, then the
gene is counted as a false negative. A run is scored by counting the true positives (TP),
false positives (FP), and false negatives (FN).
We quantified the performance of the algorithms using recall and precision.
Precision = correctly predicted gene names/predictions [1]
OR,
Precision = TP/TP + FP
Recall = correctly predicted gene names/gene names [1]
OR,
Recall = TP/ TP + FN
We assessed the performance of the algorithms on gene name matches. The
predicted gene name can be equivalent to the corresponding name in the gold standard,
or, only needs to overlap the name in the gold standard. However, if two predicted genes
overlap the same multi-word gene name, both of them are considered correct.
3. Evaluation of GAPSCORE program
We searched the gene and protein names in short text by submitting some
sentences directly on their server (web interface).
We ran GAPSCORE on the GENETAG test set by accessing from computer
programs (PERL) using XML-RPC (cf Appendix 1), then we compared the performance
of the method against the GENETAG corpus annotations.
Since GAPSCORE algorithm could produce scores (0-1), we calculated the recall
and precision at every score cutoff that varies from 0.05 and then we plotted the resultant
curve which will illustrates the tradeoff between recall and precision (Fig.1). We chose a
strict cutoff for our application (gene and protein names recognition) that requires high
precision.
13
To confirm its performance and to make sure that GAPSCORE program was not
adjusted or trained against the gold standard (GENETAG); we manually compared its
performance against the Variants corpus. In this case, the evaluation was based on the
performance measures (precision and recall) of 245 sentences randomly selected and
manually controlled.
4. Evaluation of AIIAGMT program
In order to see what AIIAGMT tags as genes and proteins, we used the Web
Services version of the gene mention tagger, allowing our computer program (Perl) to
call AIIAGMT from a remote site. The module AIIA::GMT which is an XMLRPC client
of the web-service server, AIIA gene mention tagger, has been used to accomplish this
task.
We compared the performance of the method against the annotations of the two
manually annotated corpora GENETAG and YAPEX and then against the Variants
corpus to confirm its performance and to make sure that AIIAGMT program was not
adjusted or trained against the gold standards. The evaluation against the Variants corpus
was based on the performance measures of 245 sentences randomly selected and
manually controlled.
II. Application: Variants extraction
In the article [30] Yip YL, et al, describes an automatic method to retrieve
variants-related articles that might be useful to update Swiss-Prot information. The aim of
this application (variants extraction) is to improve the precision or exactitude of this
automatic method by combining the two gene and protein name recognition programs
already evaluated in this study. To do so, we ran AIIAGMT and GAPSCORE on the
Variant corpus (which is the output of the pipeline for variant information update), then
we created a Perl program to match the results (programs predictions) with a list of gene
and protein synonyms (called target) which is a collection of gene and protein synonym
extracted from GPSDB (Gene and Protein Synonym Database) [21]. If there is a match,
the sentence is counted as a matching sentence; otherwise, it is counted as a missing
sentence.
In order to verify whether the mutations in the Variants corpus refer effectively to the
gene/protein of interest “target”, we manually evaluated hundred matching sentences
(with target), and hundred missing sentences (missing target), randomly selected.
We evaluated also the intersection of the two programs (AIIAGMT, GAPSCORE) from
the 245 sentences previously used (section 4 & 3).
In the matching sentences, if the mutation refers to the target, the sentence is counted as a
correct one. If the mutation refers to other gene or protein, the sentence is counted as an
incorrect one.
In the missing sentences, if the mutation refers to the target, the sentence is counted as a
correct one. If the mutation refers to other gene or protein, the sentence is counted as an
incorrect one.
14
Example: Synonyms of the same entity (from GPSDB).
O00142 2.7.1.21
O00142 EC 2.7.1.21
O00142 Mt-TK
O00142 thymidine kinase 2, mitochondrial
O00142 Thymidine kinase 2, mitochondrial
O00142 THYMIDINE KINASE, MITOCHONDRIAL
O00142
TK2
The AIIAGMT predictions corresponding to the same entry:
O00142-1 | 17951082 | We report an unusual case of IMM , homozygous for the <GENE>H90N
mutation</GENE> in the <GENE>TK2 gene</GENE> but unlike other cases with the same
mutation , does not demonstrate mtDNA depletion . ...¨
In this example, two genes were predicted by AIIAGMT but only one (TK2 gene) is a
true positive (the H90N mutation is a false positive). So, the TK2 would be matched with
the list of synonyms corresponding to the same entry swiss-prot.
3. Results
3.1- GAPSCORE
We have submitted the following sentence to the server: “We observed an
increase in mitogen-activated protein kinase (MAPK) activity”. The search for gene and
protein names in this text returns the following predictions with their scores.
Gene or Protein Name
Quality
(Score)
1 MAPK
Excellent
(1.00)
2 mitogen-activated protein kinase
Excellent
(1.00)
3 increase
Poor (0.06)
4 We
Poor (0.00)
In this example, GAPSCORE program predicts four gene and protein names which are
not all a true recognition. The “mitogen-activated protein kinase” and “MAPK” are true
15
positive with an excellent score (1.00), “increase” and “we” are false recognition, but
they have a poor score (0.06 and 0.00).
3.1.1 Performance of GAPSCORE on GENETAG corpus
When we ran GAPSCORE on the GENETAG test set, the server responds to our
query by searching for the gene and protein names in a string and returns an array of the
names found. Each element of the returned array is itself an array of: [string name, int
start, int end, double score]. Example: for this sentence
95229799480 Cervicovaginal/JJ foetal/NEWGENE fibronectin/NEWGENE in/IN the/DT
prediction/NN of/IN preterm/JJ labour/NN in/IN a/DT low-risk/JJ population/NN ./.
We obtained the following result:
Gene/Protein name
fibronectin
low
prediction
Cervicovaginal
foetal
preterm
onset
34
87
53
12
27
67
offset
45
90
63
26
33
74
SCORE
1
0.065996
0.061132
1.00E-300
1.00E-300
1.00E-300
GAPSCORE predicts gene and protein names, their positions (onset and offset) and their
scores. The gene and protein names could receive the highest possible score (i.e.
fibronectin: score 1).
GAPSCORE achieved different recall and precision at every score cutoff that varies from
0.05. The resultant curve illustrates the tradeoff between recall and precision (Fig.1).
GS: Precision vs Recall
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Fig.1: Performance of GAPSCORE on GENETAG.
16
In order to effectively identify gene and protein names in the text, we chose a
strict cutoff of 0.2 (shown in the resultant curve) because this application (gene and
protein names recognition) requires high precision 81% which means that
GAPSCORE program found a few false positives. This score cutoff corresponds to a
recall of 63% (tab.1) which means that the program found a good few false negative.
In the [1] (Jeffrey T et al., 2003), the evaluation of GAPSCORE against the
YAPEX data set achieved good result (83.3% of recall and 81.5% of precision for partial
matches and 58.5% of recall and 56.7% of precision for exact matches). Though this
good result we evaluated GAPSCORE against the GENETAG data set in order to make
sure that the program was not adjusted or trained against the YAPEX data set.
3.1.2 Performance of GAPSCORE on Variants corpus
To confirm its performance and to make sure that GAPSCORE program was not
adjusted or trained against the gold standards (GENETAG), we compared manually its
performance against the Variants corpus. We manually evaluated the precision and recall
of 245 sentences randomly selected. The performance of GAPSCORE against the
Variants corpus achieved 72% of precision and 88% of recall (tab.2).
This result might suggest that the GAPSCORE program was adjusted or trained
on the gold standard since its precision was reduced by 9% against the two annotated
corpora(YAPEX and GENETAG) at the cost of an enhancement in recall (25% against
GENETAG and 5% against YAPEX) .
3.2- AIIAGMT
3.2.1 Performance of AIIAGMT on GENETAG and
YAPEX corpora
When we ran AIIAGMT on the GENETAG and YAPEX test set; the server
returns an array reference which contains all the entities recognized from our input with
its position information.
e.g.: 95229799480 Cervicovaginal foetal fibronectin in the prediction of preterm labour in a lowrisk population
Onset
27
Gene/Protein name
Foetal fibronectin
In this example AIIAGMT predicts the foetal fibronectin as Gene/Protein name in the
position 27.
We compared the performance of AIIAGMT program against the annotations of
GENETAG and YAPEX corpora. AIIAGMT achieved 91% of precision and 84% of
17
recall against the GENETAG data set, and 97% of precision and 98% of recall against the
YAPEX data set (tab.1). This means that AIIAGMT program found very few false
positive and a very few false negative.
AIIAGMT
YAPEX
GT
Precision
97%
91%
GAPSCORE
recall
98%
84%
Precision
81.5%
81%
recall
83.3%
63%
tab.1: Performance of AIIAGMT and GAPSCORE on YAPEX and GENETAG corpus
Both programs are successful, but when they were evaluated against the two
corpora, AIIAGMT achieved better precision and recall (97% & 91%) compared to the
GAPSCORE results (81.5% & 81%) (tab.1).
We evaluated AIIAGMT against the two corpora, GENETAG and YAPEX, to confirm
its performance. In both data sets, AIIAGMT achieved a good precision and recall.
To make sure that AIIAGMT was not adjusted or trained against the two manually
annotated corpora; a further evaluation was required against the Variants corpus.
3.2.2 Performance of AIIAGMT on Variants corpus
To verify that AIIAGMT was not adjusted or trained against the two manually
annotated corpora (the gold standard) knowing that GENETAG and YAPEX corpora
have been annotated before its release, we ran AIIAGMT on the Variants corpus (not
annotated); in this case, the server returns an array reference which contains all the
entities recognized from the Variants corpus with its position information.
The AIIAGMT predictions resulted in 24585 tagged sentences for 27451
sentences of Variants corpus (if there is only one gene or protein name predicted per
sentence, the sentence is considered as tagged). The XML tag <GENE>…</GENE> has
been used to label each gene/protein name predicted by the program, example
Q9Y4H2-1 | 17051426 | ...We examine the 19 CA repeat of the <GENE>IGF1 gene</GENE>,
the -202 C > A <GENE>IGFBP3</GENE>, the G972R IRS, and the G1057D IRS2
polymorphisms among 1,175 non-Hispanic white NHW and 576 Hispanic newly diagnosed
breast ...
In this example, two genes (IGF1 gene, and IGFBP3) were found by the program and both
of them are true positive. The ‘IRS’ and ‘IRS2’are false negative.
Our evaluation was based on the performance measures of 245 sentences
randomly selected and manually reviewed. In fact, the performance of AIIAGMT against
the Variants corpus set achieved 75% of precision and 75% of recall which means that
75% of the predicted gene and protein names are relevant (tab.2).
18
GAPSCORE
AIIAGMT
Precision
72%
75%
recall
88%
75%
tab.2: Performance of AIIAGMT and GAPSCORE on Variants corpus
These results confirm that AIIAGMT achieved a good precision and recall that is
required for our application i.e. recognition of gene and protein names in a text. Apart
from the manually annotated corpora, AIIAGMT achieved a good performance (75%),
but not enough to exclude that it could be adjusted or trained on the gold standard since
its precision was reduced by 22% against YAPEX data set and by 16% against
GENETAG corpus.
3.3 Improving the precision of extraction
3.3.1 Improving the precision of gene and protein names
extraction on GENETAG corpus:
Knowing the obtained results of AIIAGMT and GAPSCORE programs against
the Variants corpus, we applied union and intersection to combine these programs.
In order to verify a possible improvement in precision of gene and protein names
extraction at the cost of small reduction in recall we considered the intersection, since the
union can improve the recall at the cost of a reduction in precision.
The performance of the Intersection AIIAGMT-GAPSCORE is 97% of Precision and
80% of Recall, and for the union AIIAGMT-GAPSCORE is 79% of Precision and 90%
of recall (tab.2)
Intersection AIIAGMT-GAPSCORE
union AIIAGMT-GAPSCORE
Precision
97%
79%
recall
80%
90%
Tab.3: Performance of the intersection and union of AIIAGMT-GAPSCORE on GENTAG corpus
Since the union of both programs produced a large number of false positive, the
precision of gene and protein names extraction was reduced by 2% in GAPSCORE and
12% in AIIAGMT. The recall was increased by 27% in GAPSCORE and 6% in
AIIAGMT.
The intersection AIIAGMT-GAPSCORE improved the precision of gene and
protein names extraction by 6% in AIIAGMT and 16% in GAPSCORE because the
19
intersection filters out the false positives. The recall was increased by 17% in
GAPSCORE and reduced by 4% in AIIAGMT.
So, it is rather the intersection of both programs that allows improving the precision of
gene and protein names extraction by filtering out the false recognition.
3.3.2 Improving the precision of gene and protein names
extraction on Variants corpus
To improve the precision of gene and protein names extraction on the Variants
corpus, we have decided to consider intersection of results of the two programs against
245 sentences of Variants corpus. We applied also the union since it could improve the
recall at the cost of a small reduction in precision.
The performance of the Intersection AIIAGMT-GAPSCORE is 96% of Precision
and 70% of Recall, and for the union AIIAGMT-GAPSCORE is 64% of Precision and
79% of recall (tab.3).
Intersection AIIAGMT-GAPSCORE
union AIIAGMT-GAPSCORE
Precision
96%
64%
recall
70%
79%
Tab.4: Performance of the intersection and union of AIIAGMT-GAPSCORE on Variants corpus
Since the union of the programs produced a large number of false positive, the
precision of gene and protein names extraction was reduced by 8% in GAPSCORE and
11% in AIIAGMT. The recall was decreased by 9% in GAPSCORE and increased by 4%
in AIIAGMT.
The intersection AIIAGMT-GAPSCORE improved the precision of gene and
protein names extraction by 21% in AIIAGMT and 24% in GAPSCORE because the
intersection of both programs produced a small number of false positive by filtering out
the false recognitions. The recall was reduced by 18% in GAPSCORE and by 5% in
AIIAGMT.
These results confirm that the intersection can improve the precision of gene and protein
names extraction at the cost of a small reduction in recall.
20
3.4- Application: Variant extraction
To verify whether the mutations in the Variants corpus refer to the gene and
protein ‘target’ (protein of interest where the Pubmed query was submitted to recover the
related articles) or to other gene and protein, we manually controlled 194 matching
sentences and 121 missing sentences, randomly selected from the intersection of the
programs previously evaluated (AIIAGMT, and GAPSCORE).
As seen in section 3.2.2, from 27451 sentences of Variants corpus, we have only 24585
tagged sentences with at least one gene / protein name (AIIAGMT prediction). These
24585 sentences were matched with a list of gene and protein synonyms (protein of
interest ‘target’). If the string identified by the run of the AIIAGMT program matches
one synonym in the list of synonyms, it is considered as a target, and will be tagged with
the XML tag <target>…</target>. So, for 24585 tagged sentences, only 12867 sentences
match the list of synonyms (matching sentences), and 11698 sentences have no match
(missing sentences). Example:
Q04771-1
|
17572636
|
Functional
modeling
of
the
<GENE><target>ACVR1</target></GENE>R206H mutation in <target>FOP</target> . ...
The List of synonyms of the same entity (from GPSDB):
Q04771 Activin receptor type I
Q04771 ACTRI
Q04771 ACTR-I
Q04771 ACVR1
Q04771 ALK2
Q04771 ALK-2
Q04771 EC 2.7.11.30
Q04771 FOP
Q04771 hydroxyalkyl-protein kinase
Q04771 serine/threonine-protein kinase receptor R1
In this example, AIIAGMT program predicted ACVR1 as gene/protein name (tagged by
<GENE>…</GENE>), that is effectively a synonym of the same entity (tagged by <target>
…</target>); the FOP was not predicted by the program even if it is a synonym. So,
ACVR1 is a true positive, FOP is a false positive.
For practical reasons, GAPSCORE program was not run on the total Variants corpus. We
considered only the result of the 245 sentences previously obtained (section 3.1.2), then
we applied the intersection of AIIAGMT and GAPSCORE on these sentences. This
intersection resulted on 116 sentences where 94 sentences were matching sentences and
21 were missing sentences.
To try to improve the precision of the automatic method used to retrieve variants-related
articles, we proceeded by selecting randomly 194 matching sentences and 121 missing
sentences from the intersection of both programs (AIIAGMT and GAPSCORE
21
predictions) because the intersection improves the precision of information extraction.
These 2 subsets of sentences were manually controlled by verifying the reference of the
mutation to the gene or protein of interest ‘target’. This evaluation achieved four types of
sentences according to their correctness; the result is as follows (tab.7):
Category of sentence
Correct (number of
sentence)
Incorrect
(%)
Total number of
sentence
With target
133
0
133
With Target and
gene
61
0
61
Without target nor
gene
113
0
113
Without target,
5
37.5
8
With gene
Tab.5: category of sentences in Variants corpus.
In the missing sentences (without target), we obtained 62.5% (5/8*100) of correct
sentences which means that the mutations refer to the gene / protein target even if there is
other gene or protein in the sentence (5 sentences). If the mutations refer to other gene or
protein, the sentence is counted as incorrect (37%).
Correct sentence: Example
Q04771 -1 | 18979151 | of this metamorphogene are beginning to provide deep insight into a
highly conserved signaling pathway that regulates tissue stability following
morphogenesis , and that when damaged at a highly specific locus c.617G > A ; R206H ,
and triggered. …
In this example, the mutation R206H refers to the gene/protein of interest Q04771
(Activin receptor type-1).
Incorrect sentence: Example
P21802 -1 | 15942838 | ...and an elevated <GENE>transferrin</GENE> saturation 96
%, but a negative test for <GENE>HFE gene</GENE> mutations
such as
<GENE>C282Y</GENE> and H63D. FINDINGS: Using the mini-laparascopic
technique we diagnosed a smallnodular liver...
In this example, the mutations C282Y and H63D refer to the HFE gene which is true
positive and is not synonym of the gene/protein of interest P21802b ( Fibroblast growth
22
factor receptor 2). The ‘transferrin’ gene is also true positive, but the ‘C282Y’ is false
positive.
In the matching sentences, 31% of 100 correct sentences have other gene or protein;
however, the sentence is counted correct because the mutation refers to the gene/protein
of interest. Example: correct sentence
P07949 -1 | 11438491 | ... severely impaired the <GENE>phospholipase Cgamma</GENE> signaling pathway in SK-N-MC cells. S765P, R873Q, F893L, R897Q,
and E921K mutations resulted in a complete loss of the
<GENE><target>RET</target> kinase</GENE> activity. The P973L...
In this example, the mutations S765P, R873Q, F893L, R897Q, and E921K refer to the
gene target P07949 (Proto-oncogene tyrosine-protein kinase receptor ret) though the
presence of the ‘phospholipase C-gamma’ gene which is true positive.
These results show the correctness of sentences in the Variants corpus in order to decide
if they are to be kept or eliminated from the ModSNP database. In fact, in this data set
there are 37.5% of incorrect sentences, which means that the mutation refers to other
gene but the protein of interest is missing.
4. Discussion
In this work, we proposed to evaluate the gene and protein name recognition
programs in order to select the best one among them or to combine them to improve the
precision of gene and protein name extraction at the cost of small reduction in recall. To
evaluate, we used the two metrics often used in gene and protein name identification
precision and recall, measuring the two types of mistakes (the false positive and false
negative) that can be made during gene name identification. In this study we did not
distinguish between genes and proteins, we used ‘gene’ or ‘protein’ generically to mean
both.
The evaluation was performed against two corpora manually annotated and
specifically tailored for gene and protein name identification (GENETAG and YAPEX).
Both programs achieved a good performance: 81% of precision for GAPSCORE and
91% of precision for AIIAGMT against the GENETAG corpus. This good performance
may be due to the fact that both programs were trained or adjusted on the gold standards
since the two corpora were manually annotated before the development of the two
programs. We therefore evaluated the programs against another corpus which is not
annotated: The ‘Variants corpus’. This evaluation achieved 72% of precision for
GAPSCORE program and 75% of precision for AIIAGMT program which could assume
that the programs were adjusted or trained on the gold standards since the precision was
23
decreased by 9% in GAPSCORE and by 16% in AIIAGMT against GENETAG corpus. It
is clear that the precision of GAPSCORE (72%) and AIIAGMT (75%) is due to the fact
that the two systems do not deal with ambiguity, other than filtering out the most
ambiguous terms of variants, this means that there were some false positive. The 75% of
recall in the case of AIIAGMT means that the AIIAGMT program found more false
negative than the GAPSCORE program (88% of recall).
In order to improve the precision of gene and protein name extraction at the cost of
small reduction in recall, we considered the intersection of both programs (AIIAGMT
and GAPSCORE) on GENETAG corpus and on Variant corpus since the intersection
reduces the number of false recognitions of gene and protein names. However, the union
can enhance the recall at the cost of a small reduction in precision since it can produce a
large number of false positive. So, by combining the two programs (via intersection) we
improved the precision of gene/ protein name extraction.
Combining programs in order to improve the precision of gene and protein name
extraction is in fact the aim of BioCreative MetaServer (BCMS) since it is a prototype
platform of 13 research groups trying to provide combined data corresponding to the
different annotations of the thirteen servers. But the BioCreative MetaServer has a static
collection of text (a set of approximately 22,800 PubMed abstracts used in the
BioCreative II challenge) which would serve as internal database. For this reason, we
could not use the BCMS to identify gene and protein name in the newly published
articles. So, in the near future the BioCreative MetaServer could be a good program for
gene and protein name recognition with a good precision of extraction.
The aim of these evaluations was to use these gene and protein name recognition
programs in other applications as variants extraction. These programs are necessary to
link information found in variants-related articles to a specific biological entity (variants,
mutations and polymorphisms).
In fact, in this study, we tried to improve the performance of the automatic retrieval
approach for the extraction of new polymorphisms by applying the intersection of the two
programs since it can enhance the information extraction by filtering out the false
recognitions. To do so, we verified whether the mutations in the Variants corpus refer to
the gene or protein of interest (target) or refer to other gene.
These evaluations achieved 100% of correct sentences in the matching sentences,
which means that a hundred per cent of sentences have mutations referring to the target
though the presence of other gene/protein (61 sentences). These correct sentences will be
stored in tables of the ModSNP database.
In the missing sentences, there is no protein of interest ‘target’ but 62.5% have
mutations referring to the target though the presence of other gene, and 37.5% are
24
incorrect (mutations refer to other gene which is present). However, this result (37.5% of
incorrect sentences) is not statistically significant because we evaluated only a small data
set (194 matching sentences and 121 missing sentences). Nevertheless, we cannot make a
general rule concerning the importance of the identification of another gene (other than
the target) in the incorrect sentences. To have a more significant result, the data set can be
expanded in order to decide whether the incorrect sentences will be kept or eliminated
from the ModSNP database.
5. Conclusion
We evaluated the gene and protein name recognition programs (GAPSCORE and
AIIAGMT). The evaluation was performed against two manually annotated corpora
(Gold standards: GENETAG and YAPEX) and achieved a good performance.
To confirm the performance of the chosen programs and to make sure that the programs
were not adjusted or trained on the gold standards, we evaluated against the Variants
corpus. In this case, the evaluation achieved a relatively less good performance, which
might assume that the programs could be adjusted or trained on the gold standards.
To improve the performance of gene and protein name extraction at the cost of small
reduction in recall, we applied the intersection of the two programs since the intersection
filter out the false recognitions. However, the union could enhance the recall at the cost
of small reduction in precision.
To try to improve the precision of the automatic method used to retrieve variants-related
articles, we applied the intersection of the programs (GAPSCORE and AIIAGMT) on the
Variants corpus. This application achieved 100% of correct sentences in the matching
sentences and 37, 5% of incorrect sentences in the missing sentences.
25
Appendix
1. XML-RPC:
XML-RPC is a Remote Procedure Calling protocol that works over the Internet.
An XML-RPC message is an HTTP-POST request. The body of the request is in XML. A
procedure executes on the server and the value it returns is also formatted in XML.
Procedure parameters can be scalars, numbers, strings, dates, etc.; and can also be complex record
and list structures.
2, BioNLP Web Services:
The URI for XML-RPC server is: http://bionlp.stanford.edu/xmlrpc
The server can respond to two queries:
1- find_abbreviations
INPUT: string
OUTPUT: array
This function will search for the abbreviations in a string and return an array of the abbreviations
found. Each element of the returned array is itself an array of:
[string long form, string abbreviation, double score]
2- find_gene_and_protein_names
INPUT: string
OUTPUT: array
This function will search for the gene and protein names in a string and return an array of the
names found. Each element of the returned array is itself an array of:
[string
name,
int
start,
int
end,
double
score]
start and end are indexes into the input string that describe where the name was found. The
indexes begin at 0, and the end index is exclusive.
26
Bibliography
Articles:
1. GAPSCORE: finding gene and protein names one word at a time. Chang JT, Schütze H,
Altman RB. Department of Genetics, Stanford Medical Center, 300 Pasteur Drive, Lane L 301,
Mail Code 5120, Stanford, CA 94305-5120, USA.
2. Lynette Hirschmann, Martin Krallinger, and Alfonso Valencia, editors. Proceedings of the
Second BioCreative Challenge Evaluation Workshop. Centro Nacional de Investigaciones
Oncologicas, CNIO, Madrid, Spain, 2007.
3. Hanisch, D., Fundel, K., Mevissen, H., Zimmer, R., Fluck, J.: ProMiner: Rule based protein
and gene entity recognition. BMC Bioinformatics, 6(Suppl 1):S14, 2005. _
4. PowerBioNE: Recognizing names in biomedical texts: a machine learning approach. by:
G Zhou, J Zhang, J Su, D Shen, C Tan. Bioinformatics, Vol. 20, No. 7. (1 May 2004), pp. 11781190
5. BioTagger-GM: A Gene/Protein Name Recognition System Manabu Torii, PhDa,*,
Zhangzhi Hu, MDb, Cathy H. Wu, PhDc and Hongfang Liu, PhDd
6. Introducing meta-services for biomedical information extraction Florian Leitner1, Martin
Krallinger1, Carlos Rodriguez-Penagos1, Jörg Hakenberg2,3, Conrad Plake2, Cheng-Ju Kuo4,5,
Chun-Nan Hsu5, Genome Biology 2008, 9(Suppl 2):S6
7. Tetratricopeptide-motif-mediated interaction of FANCG with recombination proteins
XRCC3 and BRCA2. Hussain S, Wilson JB, Blom E, Thompson LH, Sung P, Gordon SM,
Kupfer GM, Joenje H, Mathew CG, Jones NJ.Department of Medical and Molecular Genetics,
King's College London School of Medicine at Guy's Hospital, London SE1 9RT, UK.
8. Settles B: ABNER: an open source tool for automatically tagging genes, proteins and
other entity names in text. Bioinformatics 2005, 21:3191-3192
9. GENETAG: a tagged corpus for gene/protein named entity recognition Lorraine Tanabe,
Natalie Xie, Lynne H Thom, Wayne Matten, W John Wilbur BMC Bioinformatics 2005, 6(Suppl
1):S3 (24 May 2005)
10. The BioCreAtIvE - Critical Assessment for Information Extraction in Biology Challenge
Genome Biology – Biology for the post-genomic era 2008 volume 9 supplement 2
11. Smith L, Tanabe LK, Johnson nee Ando R, Kuo C-J, Chung I-F, Hsu CN, Lin Y-S, Klinger
R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A,
Baumgartner WA Jr, Hunter L, Carpenter B, Tsai RT-H, Dai H-J, Liu F, Chen Y, Sun C,
Katrenko S, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, et al.: Overview of
BioCreative II gene mention recognition. Genome Biology 2008, 9(Suppl 2):S2.
12. Kuo CJ, Chang YM, Huang HS, Lin KT, Yang BH, Lin YS, Hsu CN, Chung IF: Rich
feature set, unification of bidirectional parsing and dictionary filtering for high F-score
gene mention tagging.
In Proceedings of the Second BioCreative Challenge Workshop Madrid, Spain. CNIO; 200
13. BioCreative Homepage [http://biocreative.sourceforge.net/]
14. XML-RPC Specification [http://www.xmlrpc.com/]
15. BioCreative MetaServer [http://bcms.bioinfo.cnio.es/]
16. BioCreative XML-RPC MetaService [http://bcms.bio info.cnio.es/xmlrpc/]
17. GAPSCORE (Chang et al., 2004) - http://bionlp.stanford.edu/GAPSCORE/
18. GENETAG ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/ GENETAG.tar.gz.
19. AIIAGMT: http://aiia.iis.sinica.edu.tw/
20. Kinoshita S, Cohen KB, Ogren PV, Hunter L: BioCreAtIvE Task1A: entity identification
with a stochastic tagger. BMC Bioinformatics 2005, 6(suppl 1):S4.
27
21. Pillet V, Zehnder M, Seewald AK, Veuthey AL, Petrak J: GPSDB: a new database for
synonyms expansion of gene and protein names. Bioinformatics 2005, 21:1743-1744.
22. Carpenter, B. Phrasal queries with LingPipe and Lucene: ad hoc genomics text retrieval;
NIST Special Publication: SP 500261 The Thirteenth Text Retrieval Conference; TREC.
2004.2004.
23. Yoshida M, Fukuda K, Takagi T. PNAD-CSS: a workbench for constructing a protein
name abbreviation dictionary.Bioinformatics. 2000 Feb;16(2):169-75
24. Leaman R, Gonzalez G. BANNER: an executable survey of advances in biomedical
named entity recognition. Pac Symp Biocomput. 2008; 652-663
25. Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A. Text processing
through Web services: calling Whatizit. Bioinformatics. 2008 Jan 15;24(2):296-8. Epub 2007
Nov 15.
26. Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision, 2nd
printing) Beatrice Santorini June 1990
27. Franzén,K., Eriksson,G., Olsson,F., Asker,L., Liden,P. and Coster,J. (2002) Protein names
and how to find them. Int. J. Med. Inform., 67, 49–61.
28. Ohta,T., Tateisi,Y., Mima,H. and Tsujii,J. (2002) Genia corpus: an annotated research
abstract corpus in molecular biology domain. In Proceedings of the Human Language
TechnologyConference
29. Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The
Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure
information on human protein variants. Hum Mutat. 2004 May. 23(5):464-70.
30. Yip YL, et al, RETRIEVING MUTATION-SPECIFIC INFORMATION FOR HUMAN
PROTEINS IN UNIPROT /SWISS-PROT KNOWLEDGEBASE
31. GPSDB: http://expasy.org/cgi-bin/gpsdb/form
32. SNPs: http://www.ncbi.nlm.nih.gov/About/primer/snps.html
33. Krallinger et al; Linking genes to literature: text mining, information extraction, and
retrieval applications for biology
34. Hirschman, L. et al. (2005) Overview of BioCreAtIvE: critical assessment of information
extraction for biology. BMC Bioinformatics 6, S1
35. Text mining and its potential applications in systems biology Sophia Ananiadou1,2,
Douglas B. Kell3,4 and Jun-ichi Tsujii1,2,5
36. Hirschman, L. and Blaschke, C. (2006) Evaluation of text mining in biology. In Text
Mining for Biology and Biomedicine (Ananiadou, S. and McNaught, J., eds), pp. 213–245,
Artech House
37. http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html
38. K. Bretonnel Cohen, et al, Empirical data on corpus design and usage in biomedical
natural language processing
28