Masters in Proteomics and Bioinformatics Evaluation of gene/protein name recognition programs Written by: Sanaa Chtioui Directors: Anne-Lise Veuthey University of Geneva 2008/2009 1 Summary Identification of gene and protein names in biomedical text is a challenging task as the corresponding nomenclature has evolved over time. This has led to multiple synonyms for individual genes and proteins, as well as names that may be ambiguous with other gene names or with general English words. The Gene List Task of the BioCreAtIvE challenge evaluation enables comparison of systems addressing the problem of protein and gene name identification on common benchmark data. We proposed here to evaluate the gene and protein name recognition programs. The evaluation was performed against two corpora manually annotated and specifically tailored for gene and protein name identification (GENETAG and YAPEX), and then against the “Variant corpus”, which contain gene and protein names, synonyms and a set of mutation-related keywords (variants, polymorphisms, mutations…). To associate the gene and protein names to mutations, or to relate synonyms to the same gene or protein we evaluated against the Variant corpus. The evaluation was based on the performance measures (precision and recall), and achieved 81% of precision and 63% of recall for GAPSCORE programs against the GENETAG corpus; and 97% of precision and 98% of recall for AIIAGMT programs against the YAPEX data set. Against the Variants corpus, AIIAGMT achieved 75% of precision and 75% of recall; GAPSCORE achieved 72% of precision and 88% of recall. To improve the precision of gene and protein name extraction, we applied the intersection to combine the programs because the intersection can filter out false positives and therefore increase precision, but at the expense of recall. Background We are developing text mining tools for information extraction of variants. In particular we are working an automatic procedure able to track newly published articles concerning variants and to extract new relevant information. This requires the development of methods able to map a protein described in a paper to the corresponding protein database entry. In the field of information extraction, many programs have been elaborated for the task of protein and gene name recognition in biological texts. These programs are necessary to link information found in an article to a specific biological entity. We propose to evaluate these tools and implement the most adapted in the pipeline of information retrieval for variants. 2 Table of contents Summary Background 1. Introduction I. The importance of variants in the context of our research II. Text mining and gene/protein name extraction III. BioCreAtIvE and the evaluation of the results of text mining processes IV. Individual program description 2. Methods I- Evaluation of gene/protein name recognition programs 1. Corpus description 1.1 GENETAG corpus 1.2 YAPEX corpus 1.3 Variants corpus 2. Evaluation of gene/protein name recognition programs 3. Evaluation of GAPSCORE program 4. Evaluation of AIIAGMT program II. Application: Variants extraction 3. Results 3.1. GAPSCORE 3.1.1 Performance of GAPSCORE on GENETAG corpus 3.1.2 Performance of GAPSCORE on Variants corpus 3.2. AIIAGMT 3.2.1 Performance of AIIAGMT on GENETAG and YAPEX corpora 3.2.2 Performance of AIIAGMT on Variants corpus 3.3. Improving the precision of extraction 3.3.1 Improving the precision of gene and protein names extraction on GENETAG corpus 3.3.2 Improving the precision of gene and protein names extraction on Variants corpus 3.4 Application: Variants extraction 4. Discussion 5. Conclusion Appendix Bibliography 3 1. Introduction New high-throughput technologies have accelerated the accumulation of knowledge about biological data. However, much knowledge is still stored as written natural language text; and the traditional information retrieval framework, which relies on keyword-based approaches, cannot address this information overload. For this reason, scientists have focused their attention on text mining (TM) techniques, which enable them to collect, maintain, interpret, curate and discover the knowledge needed for research. In fact, the protein point mutations in biomedical literature discussing human genetic disorders and the finding of polymorphisms in genes that are markers for these disorders are in constant evolution and provide a tremendous amount of biological data, including variants, mutations, polymorphisms, etc. These polymorphic positions (or single nucleotide polymorphisms (SNPs)) in the human genome can be used to indicate a permanent change in the DNA sequence of a gene where a nucleotide is replaced by another one. The goal of this proposal is to identify gene and protein names in a text by using methods based on text mining which have been developed to efficiently locate, retrieve and manage relevant information on gene and protein. These methods will be evaluated first against two manually annotated corpora using the standard evaluation metrics of precision and recall, and then against a “Variants corpus” which contains sentences of abstracts on known polymorphisms in Swiss-Prot which was retrieved by submitting a PubMed query using gene or protein names and a list of mutation-related keywords. (mutations, polymorphisms, variants). I. The importance of variants in the context of our research In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step to extract biological information (e.g. variants, SNPs, mutations, etc); because scientists believe that SNPs may predispose people to disease and even influence their response to drug regimens, we are interested in this study to SNPs. A Single Nucleotide Polymorphism or SNP is a small genetic change, or variation, that can occur within a person's DNA sequence. The genetic code is specified by the four nucleotide "letters" A (adenine), C (cytosine), T (thymine), and G (guanine). SNP variation occurs when a single nucleotide, such as an A, replaces one of the other three nucleotide letters C, G, or T. An example of a SNP is the alteration of the DNA segment AAGGTTA to ATGGTTA, where the second "A" in the first snippet is replaced with a "T". On average, SNPs occur in the human population more than 1 percent of the time. Because only about 3 to 5 percent of a person's DNA sequence codes for the production of proteins, most SNPs are found outside of "coding sequences". SNPs found within a coding sequence are of particular interest to researchers because they are more likely to 4 alter the biological function of a protein. Because of the recent advances in technology, coupled with the unique ability of these genetic variations to facilitate gene identification, there has been a recent flurry of SNP discovery and detection [32]. II. Text mining and gene/protein names extraction Life science research is characterized by the production of large and heterogeneous collections of biological data, including gene or protein sequences, mutations, polymorphisms, variants etc. Therefore, several methods based on text mining have been developed to identify gene and protein names in natural language texts. Text mining has been defined as “the discovery by computer of new, previously unknown, information by automatically extracting information from different written resources”. The first manual text-mining approaches surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. Recently, text mining has received attention in many areas like Biomedical text mining (also known as BioNLP) which is a rather recent research field on the edge of natural language processing (NLP), bioinformatics, medical informatics and computational linguistics. In fact, the NLP allows the processing of natural language texts by computer to access their meaning, and can analyze (parse) natural language using lexical resources (dictionaries), where words have been organized into groups after a grammar (syntactic level) and a semantic layer has assigned meaning to these words or groups of words [35]. A major focus of machine learning research is to automatically produce (induce) models, such as rules and patterns, from data to ‘understand’ and ‘learn’ text structure. The activity of finding documents that answer an information need is called information retrieval (IR). Another branch involved by text mining that will be used in our study is information extraction (IE); it will extract information from text without requiring the end user of the information to read the text. The aim of this work is the evaluation of gene and protein name recognition programs. These methods are inspired by text mining techniques using rule-based systems, which use rules that describe common naming structures for certain term classes, based on morphological, orthographic and syntactic characteristics. We used the standard evaluation metrics of precision (which evaluates the number of times a word is falsely identified as being a gene or protein name), and recall (which evaluates the number of times a gene or protein name is falsely identified as not being that gene or protein name) to evaluate the chosen programs. The results will be compared against the annotations of two manually annotated corpora ‘gold standard’: GENETAG and YAPEX and then against a “Variant corpus”. 5 III. BioCreAtIvE and the evaluation of the results of text mining processes It is imperative when we evaluate text mining tasks that these should be important to the biology community especially for the identification of biomedical entities (named entity recognition NER) like genes and proteins [36]. Two main biological tasks have been used for text mining evaluation challenges: document retrieval and biological database curation [35]. Recent challenge evaluations for text mining in biology include: BioCreAtIvE [34], and BioNLP (Biomedical text mining) [37]…. BioCreAtIvE [13] (Critical Assessment of Information Extraction systems in Biology) consists of a community-wide effort for evaluating text mining and information extraction systems applied to the life science (Biology) literature. Two main tasks were posed at the first BioCreAtIvE challenge (carried out in 2003): the first task [20] was concerned with the identification of gene mentions in text and linking protein database entries to abstracts, and the second task [20] was related to the extraction of human gene product annotations with GO terms (i.e. text passages supporting those annotations). The goal of the first BioCreAtIvE Workshop was to provide a set of common challenge evaluation tasks to assess the state of the art for text mining applied to biological problems. The assessment focused among others on the extraction of gene or protein names from text, and their mapping into standardized gene identifiers for three model organism databases (fly, mouse, yeast). Overall, 27 groups participated in the assessment, including 18 for gene/protein name extraction. The results for gene/protein name extraction showed that 4 groups participated in the assessment were able to extract general gene names from sentences of MEDLINE abstracts at over 80% balanced precision and recall [20]. The last BioCreative challenge took place in 2006. There were three tasks in 'BioCreative II' [11], called the gene mention (GM) which would focus on finding the mentions of genes and proteins in sentences drawn from MEDLINE abstracts, gene normalization (GN) which would involve producing a list of the EntrezGene identifiers for all the human genes and proteins mentioned in a collection of MEDLINE abstracts, and protein-protein interaction (PPI) tasks which would involve identifying proteinprotein interactions from full text papers [11]. The performance of gene mention systems has increased from the first BioCreative, and when multiple systems are combined, the combined BioCreative II systems have achieved over 90% balanced precision and recall [33]. The data sets produced by this contest serve as a Gold Standard training and test set to evaluate and train Bio-NER tools and annotation extraction tools [33]. In 2003, a corpus of 20,000 sentences was selected and annotated for training and testing purposes [11]. For BioCreative II, there were 15,000 training sentences and 5,000 blind test sentences [33]. In this work we evaluated the gene and protein name recognition programs, against two manually annotated corpora: the BioCreative Task 1A (Gene name Identification) data set, now known as the GENETAG corpus (2005) [18] [9] and YAPEX (2002) data set. 6 IV. Individual program description A brief description of some systems of gene/protein name recognition is summarized on table 0. However, our study was restricted to some particular programs GAPSCORE, AIIAGMT, and meta-server. GAPSCORE is a method that scans text and identifies the names of genes and proteins [17]. This tool has many potential applications including: allowing users to search and index documents by genes of interest and analyzing the scientific literature for genes of interest [17]. GAPSCORE scores gene and protein names in written natural language text based on a statistical 11model of gene names that quantifies their appearance, morphology and context, it uses a machine learning-based approach (Support Vector Machines (SVMs)). Since GAPSCORE does not distinguish between genes and proteins, it uses ‘gene’ generically to mean both [1]. The algorithm is accessible from the web at http://bionlp.stanford.edu/GAPSCORE/ [17], and consists of five steps [1]: (1) TOKENIZE: it splits the document into sentences and words (which is a string of alphanumeric characters); (2) FILTER: it removes from consideration any word that is clearly not a gene name: words that are not nouns, adjectives, participles, proper nouns or foreign words, it discards numbers, roman numerals (I–X), greek letters, amino acids; (3) SCORE: it scores words using a machine learning classifier; (4) EXTEND: it extends each word to the full gene name; and (5) MATCH ABBREVIATION: finally, it scores abbreviations of the gene names identified [1]. There are Web services [17] which allow computers to access the GAPSCORE programs running on other machines, and provide currently an XML-RPC (Remote Procedure Calling protocol that works over the Internet) interface to their algorithms [14]. The server can respond to two queries: find abbreviations and find gene and protein names (cf Appendix 2). AIIAGMT Adaptive Internet Intelligent Agents laboratory's Gene Mention Tagger [19]: This named entity recognition system ranked 2nd and achieved 86.83% of F-score (the harmonic average of precision and recall) in the gene mention tagging task in the second BioCreative challenge and is developed by Adaptive Internet intelligent Agents (AIIA) Laboratory (http://aiia.iis.sinica.edu.tw/), Institute of Information Science, Academia Sinica, Taiwan and I-Fang Chung's Laboratory, Institute of Biomedical Informatics, National Yang-Ming University, Taiwan. The online service is released under the Creative Commons Attribution 2.5 License. The Web-Services version of AIIAGMT, allow the computer program (particularly in Perl) to call it from a remote site. In fact, the AIIAGMT program is the best performing system based on conditional random fields (CRFs) in the second BioCreative challenge evaluation [11]. Its key features include a rich feature set (i.e. PartOfSpeech, Hyphen …) [12], unification of bidirectional parsing models (forward and backward parsing) [12], a dictionary-based filtering post-process, and its attractive high performance (especially in precision up to 0.8930 in final task evaluation). Several feature types are selected, including character ngrams (window size 2 to 4), morphological and orthographic features, but excluded some widely used features, such as stop words, prefix and suffix. Except those extensively used features, a set of domain specific features are also picked up, including abbreviations of biological chemical compounds (for instance, DNA, RNA, amino acids), compounds that 7 co-occurred with relevant site information, and so on, for decreasing false-positives among terms with a gene mention-like morphology. BioCreative MetaServer (BCMS http://bcms.bioinfo.cnio.es/) [15] is the first meta-service for information extraction in molecular biology [6]. This prototype platform is a joint effort of 13 research groups (FIG.0) and provides automatically generated annotations for PubMed/Medline abstracts. This platform is to be regarded as a distributed system requesting, retrieving and unifying textual annotations, and delivering these data to the user at different levels of granularity. BCMS can be divided into three main units [6]: • A static collection of text (a set of approximately 22,800 PubMed abstracts used in the BioCreative II challenge). • A set of active servers providing annotations for text (FIG.1) upon request; these annotation servers (AS) only interact with the meta-server and not directly with each user. • A meta-server providing the combined data, namely both the annotations and the corresponding text. Therefore, users indirectly communicate with the annotation servers, using the meta-server as proxy. The data can be provided by two different means: Via web browser or by the XML-RPC protocol [6]. Currently, the service provides four types of annotations: gene/protein mention (GM) which locate positions in the text that are detected as gene or protein names, gene/protein normalization (GN) which detect which genes or proteins are mentioned, assigning sequence database identifiers to the text, taxon classification which identify the organisms to which the text pertains, together with a confidence score, providing an ID for the National Center for Biotechnology Information (NCBI) taxonomic database, and protein-protein interaction (PPI): which classifies whether the text contains PPI information and assigns a confidence score to the classification [6]. FIG.0: The 13 research groups of BCMS 8 Programs ProMiner [3] PowerBioE [4] NLProt type of methodology rule-based approach (dictionary approach) machine learning approach: HMM, SVM Support Vector Machines (SVMs) program availability www.scai.fraunh ofer.de/prominer .html, Not an open source software bilateral agreement http://textmining .i2r.a star.edu.sg/NLS/ demo.htm Internet server & http://cubic.bioc. columbia.edu/ser vices/ nlprot/ program language Perl Input Abstract UNIX™ / Linux, Microsoft Windows™,IBMUIMA software - Abstract Not available - Abstract Abstract ABNER[8] BioTaggerGM[5] GAPSCOE [1] statistical machine learning system (CRFs) Powerful machine learning frameworks and system combination machine learning-based approach http://www.cs.wi sc.edu/~bsettles/ Java API abner/ - not available http://bionlp.stan ford.edu/GAPSC Python, ORE Perl, Java. - AIIAGMT [19] ABGene LingPipe [22] http://aiia.iis.sini ca.edu.tw/index. php Perl, Java Abstract rule-based and/or pattern matching methods machine learning approach: HMM ftp://ftp.ncbi.nlm .nih.gov/pub/tan abe/AbGene/ Perl, C++ http://aliasJava i.com/lingpipe/ not accessible Java 2 (J2SE), processor (500MHz+), 256MB+ of RAM, Java SDK 1.4, MALLET 0.3.1, and Jlex - Abstract Abstract statistical machine learning system (CRFs) System requirement Abstract XML-RPC program Perl 5.8.0 or later, Frontier::Client, XML-RPC program linux versions work using Slackware 8.0 and glibc2.2.x, or 3.4.0. Apache Jakarta Lucene TF/IDF search engine (version 1.3) and the Alias-i 9 KEX [23] BANNER [24] Whatizit [25] rule-based approach (dictionary approach) http://www.hgc .inc.utokyo.ac.jp/ser vice/tooldoc /KeX/intro.htm l. machine http://banner.so learning system urceforge.net/ (CRFs) Perl 5 Abstract Java Abstract rule-based approach Java Abstract http://www.ebi. ac.uk/webservi ces/whatizit LingPipe tokenizer and named entity annotator (version 1.0.6). Solaris, dec and irix. C compier (preferably gcc), and Perl version 5. latest version of the Mallet toolkit (version 0.4) SOAP, WSDL table 0: A brief description of some systems of gene/protein name recognition. 10 2. Methods I- Evaluation of gene/protein name recognition programs Evaluating the performance of gene and protein name recognition programs is impossible without a standardized test corpus. In this work, we used two corpora manually annotated and specifically tailored for gene and protein name identification: GENETAG and YAPEX as gold standard to fulfill this task. Then, we evaluated the chosen programs against the Variants corpus to confirm its performance and to make sure that their good result are not related to the fact of being adjusted or trained on the gold standards which has been manually annotated before the release of the chosen programs. 1. Corpus description: 1.1GENETAG corpus GENETAG is a corpus of 20,000 MEDLINE sentences [9] (342,574 words) for gene/protein Named entity recognition (NER). 15,000 GENETAG sentences were used for the BioCreAtIvE Task 1A Competition. The GENETAG version is freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz.[18] and will be used in this study as gold standard. This gold standard has been tokenized [18], by breaking up further into words, numbers, and punctuation generally called tokens, which each consist of a string of characters without white space. In this process, hyphens and punctuation often receive special treatment based on rules like [9]: do not split on hyphens, do not split on single quotation marks, do not split on commas, and do not split on parentheses and brackets [26]. The Gene/Protein names in the GENETAG corpus were manually annotated and indicated by "NEWGENE"; the tag "NEWGENE1" is used to differentiate overlapping names (when two gene mentions are immediately next to each other in the text) [18], and every sentence is initialized by a number that serves as an identifier. Example: 95229799480 Cervicovaginal/JJ foetal/NEWGENE fibronectin/NEWGENE in/IN the/DT prediction/NN of/IN preterm/JJ labour/NN in/IN a/DT low-risk/JJ population/NN ./. 19306602252 Insulin-like/NEWGENE growth/NEWGENE factor/NEWGENE1/NEWGENE (/( IGF-1/NEWGENE )/SYM in/IN burn/NN patients/NNS ./. The following POS tags (Part-of-Speech) have been used [26] to label each tokens as follows: JJ: Adjective IN: Preposition or subordinating conjunction DT: Determiner NN: Noun, singular common . : A period SYM: Symbol NNS: Noun, plural common. 11 1.2 YAPEX corpus The YAPEX text collection [27] is publicly available at http://www.sics.se/humle/projects/prothalt/, and consists of a training set of 99 abstracts from MEDLINE related to protein binding, and a test set of 101 abstracts, of which 48 are relevant to protein binding, and the rest were chosen randomly from the GENIA corpus [28]. The Gene/Protein names of all the abstracts above (45,143 words) were manually annotated by domain experts connected to the YAPEX project, and are indicated by the XML tag: <Protname>… </Protname>. In this study, we used this corpus as gold standard to further evaluate the chosen programs in order to confirm their performance. Example: <PubmedArticle> <MedlineID>21294781</MedlineID> <PMID>11401507</PMID> <ArticleTitle>Molecular dissection of the <Protname>importin beta1</Protname>-recognized nuclear targeting signal of <Protname>parathyroid hormone-related protein</Protname>.</ArticleTitle> <AbstractText>Produced by various types of solid tumors, <Protname>parathyroid hormonerelated protein</Protname> (<Protname>PTHrP</Protname>) is the causative agent of humoral hypercalcemia of malignancy.</AbstractText> </PubmedArticle> 1.3 Variants corpus The Variants corpus is composed of 27451 sentences of abstracts describing known polymorphisms in Swiss-Prot. These sentences were recovered by using an automatic information retrieval method that was applied to the 5664 proteins with variants. For each protein of interest that would be named in this study “target” , a PubMed query was submitted to retrieve related articles using gene or protein names, synonyms (from GPSDB), and a list of mutation-related keywords. This automatic information retrieval method was based on the use of regular expressions (patterns) and rules for the detection and validation of mutations. The variant terms in these articles match to the patterns corresponding to the different observed notations mentioning the position of single amino acid polymorphisms (SAP)[30].The SAP detection procedure has been used to recover relevant documents. Example Q9Y6Y9-1 | 18424732 | As predicted from the MD-2 structure , the P157S mutation had little or no effect on MD-2 function . ... Q9Y6R1-1 | 12444017 | [...a limited number of rat pancreatic duct cells. To examine the effects of pRTA-related mutations, R342S and R554H, on pNBC-1 function, we performed functional analysis and found that both mutants had...] In the first sentence the MD-2 protein is synonym of the target (Q9Y6Y9: Lymphocyte antigen 96) which is the protein of interest where the PubMed query was submitted to retrieve related articles, and the mutations P157S refers to the target. 12 This corpus will be used to further evaluate the performance of the selected programs (GAPSCORE and AIIAGMT), and to make sure that these selected programs were not adjusted or trained on the gold standards. In this case, the evaluation was based on the performance measures (precision and recall) of 245 sentences randomly selected and manually controlled. 2. Evaluation of gene/protein name recognition programs We evaluated the performance of the algorithms (GAPSCORE and AIIAGMT) against the GENETAG and YAPEX gold standard. To compare, we used the annotations of the manually annotated corpora GENETAG and YAPEX, available from their web site. Every string identified by a run (GAPSCORE or AIIAGMT) is considered either a true positive or a false positive. If the string matches a NEWGENE or Protname in the manually annotated corpora respectively GENETAG and YAPEX, it is counted as a true positive; else (there is no match), the string is counted as a false positive. If none of the annotations of a gene given in the corpus is found by the predictions programs, then the gene is counted as a false negative. A run is scored by counting the true positives (TP), false positives (FP), and false negatives (FN). We quantified the performance of the algorithms using recall and precision. Precision = correctly predicted gene names/predictions [1] OR, Precision = TP/TP + FP Recall = correctly predicted gene names/gene names [1] OR, Recall = TP/ TP + FN We assessed the performance of the algorithms on gene name matches. The predicted gene name can be equivalent to the corresponding name in the gold standard, or, only needs to overlap the name in the gold standard. However, if two predicted genes overlap the same multi-word gene name, both of them are considered correct. 3. Evaluation of GAPSCORE program We searched the gene and protein names in short text by submitting some sentences directly on their server (web interface). We ran GAPSCORE on the GENETAG test set by accessing from computer programs (PERL) using XML-RPC (cf Appendix 1), then we compared the performance of the method against the GENETAG corpus annotations. Since GAPSCORE algorithm could produce scores (0-1), we calculated the recall and precision at every score cutoff that varies from 0.05 and then we plotted the resultant curve which will illustrates the tradeoff between recall and precision (Fig.1). We chose a strict cutoff for our application (gene and protein names recognition) that requires high precision. 13 To confirm its performance and to make sure that GAPSCORE program was not adjusted or trained against the gold standard (GENETAG); we manually compared its performance against the Variants corpus. In this case, the evaluation was based on the performance measures (precision and recall) of 245 sentences randomly selected and manually controlled. 4. Evaluation of AIIAGMT program In order to see what AIIAGMT tags as genes and proteins, we used the Web Services version of the gene mention tagger, allowing our computer program (Perl) to call AIIAGMT from a remote site. The module AIIA::GMT which is an XMLRPC client of the web-service server, AIIA gene mention tagger, has been used to accomplish this task. We compared the performance of the method against the annotations of the two manually annotated corpora GENETAG and YAPEX and then against the Variants corpus to confirm its performance and to make sure that AIIAGMT program was not adjusted or trained against the gold standards. The evaluation against the Variants corpus was based on the performance measures of 245 sentences randomly selected and manually controlled. II. Application: Variants extraction In the article [30] Yip YL, et al, describes an automatic method to retrieve variants-related articles that might be useful to update Swiss-Prot information. The aim of this application (variants extraction) is to improve the precision or exactitude of this automatic method by combining the two gene and protein name recognition programs already evaluated in this study. To do so, we ran AIIAGMT and GAPSCORE on the Variant corpus (which is the output of the pipeline for variant information update), then we created a Perl program to match the results (programs predictions) with a list of gene and protein synonyms (called target) which is a collection of gene and protein synonym extracted from GPSDB (Gene and Protein Synonym Database) [21]. If there is a match, the sentence is counted as a matching sentence; otherwise, it is counted as a missing sentence. In order to verify whether the mutations in the Variants corpus refer effectively to the gene/protein of interest “target”, we manually evaluated hundred matching sentences (with target), and hundred missing sentences (missing target), randomly selected. We evaluated also the intersection of the two programs (AIIAGMT, GAPSCORE) from the 245 sentences previously used (section 4 & 3). In the matching sentences, if the mutation refers to the target, the sentence is counted as a correct one. If the mutation refers to other gene or protein, the sentence is counted as an incorrect one. In the missing sentences, if the mutation refers to the target, the sentence is counted as a correct one. If the mutation refers to other gene or protein, the sentence is counted as an incorrect one. 14 Example: Synonyms of the same entity (from GPSDB). O00142 2.7.1.21 O00142 EC 2.7.1.21 O00142 Mt-TK O00142 thymidine kinase 2, mitochondrial O00142 Thymidine kinase 2, mitochondrial O00142 THYMIDINE KINASE, MITOCHONDRIAL O00142 TK2 The AIIAGMT predictions corresponding to the same entry: O00142-1 | 17951082 | We report an unusual case of IMM , homozygous for the <GENE>H90N mutation</GENE> in the <GENE>TK2 gene</GENE> but unlike other cases with the same mutation , does not demonstrate mtDNA depletion . ...¨ In this example, two genes were predicted by AIIAGMT but only one (TK2 gene) is a true positive (the H90N mutation is a false positive). So, the TK2 would be matched with the list of synonyms corresponding to the same entry swiss-prot. 3. Results 3.1- GAPSCORE We have submitted the following sentence to the server: “We observed an increase in mitogen-activated protein kinase (MAPK) activity”. The search for gene and protein names in this text returns the following predictions with their scores. Gene or Protein Name Quality (Score) 1 MAPK Excellent (1.00) 2 mitogen-activated protein kinase Excellent (1.00) 3 increase Poor (0.06) 4 We Poor (0.00) In this example, GAPSCORE program predicts four gene and protein names which are not all a true recognition. The “mitogen-activated protein kinase” and “MAPK” are true 15 positive with an excellent score (1.00), “increase” and “we” are false recognition, but they have a poor score (0.06 and 0.00). 3.1.1 Performance of GAPSCORE on GENETAG corpus When we ran GAPSCORE on the GENETAG test set, the server responds to our query by searching for the gene and protein names in a string and returns an array of the names found. Each element of the returned array is itself an array of: [string name, int start, int end, double score]. Example: for this sentence 95229799480 Cervicovaginal/JJ foetal/NEWGENE fibronectin/NEWGENE in/IN the/DT prediction/NN of/IN preterm/JJ labour/NN in/IN a/DT low-risk/JJ population/NN ./. We obtained the following result: Gene/Protein name fibronectin low prediction Cervicovaginal foetal preterm onset 34 87 53 12 27 67 offset 45 90 63 26 33 74 SCORE 1 0.065996 0.061132 1.00E-300 1.00E-300 1.00E-300 GAPSCORE predicts gene and protein names, their positions (onset and offset) and their scores. The gene and protein names could receive the highest possible score (i.e. fibronectin: score 1). GAPSCORE achieved different recall and precision at every score cutoff that varies from 0.05. The resultant curve illustrates the tradeoff between recall and precision (Fig.1). GS: Precision vs Recall 1 0.9 0.8 Precision 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Fig.1: Performance of GAPSCORE on GENETAG. 16 In order to effectively identify gene and protein names in the text, we chose a strict cutoff of 0.2 (shown in the resultant curve) because this application (gene and protein names recognition) requires high precision 81% which means that GAPSCORE program found a few false positives. This score cutoff corresponds to a recall of 63% (tab.1) which means that the program found a good few false negative. In the [1] (Jeffrey T et al., 2003), the evaluation of GAPSCORE against the YAPEX data set achieved good result (83.3% of recall and 81.5% of precision for partial matches and 58.5% of recall and 56.7% of precision for exact matches). Though this good result we evaluated GAPSCORE against the GENETAG data set in order to make sure that the program was not adjusted or trained against the YAPEX data set. 3.1.2 Performance of GAPSCORE on Variants corpus To confirm its performance and to make sure that GAPSCORE program was not adjusted or trained against the gold standards (GENETAG), we compared manually its performance against the Variants corpus. We manually evaluated the precision and recall of 245 sentences randomly selected. The performance of GAPSCORE against the Variants corpus achieved 72% of precision and 88% of recall (tab.2). This result might suggest that the GAPSCORE program was adjusted or trained on the gold standard since its precision was reduced by 9% against the two annotated corpora(YAPEX and GENETAG) at the cost of an enhancement in recall (25% against GENETAG and 5% against YAPEX) . 3.2- AIIAGMT 3.2.1 Performance of AIIAGMT on GENETAG and YAPEX corpora When we ran AIIAGMT on the GENETAG and YAPEX test set; the server returns an array reference which contains all the entities recognized from our input with its position information. e.g.: 95229799480 Cervicovaginal foetal fibronectin in the prediction of preterm labour in a lowrisk population Onset 27 Gene/Protein name Foetal fibronectin In this example AIIAGMT predicts the foetal fibronectin as Gene/Protein name in the position 27. We compared the performance of AIIAGMT program against the annotations of GENETAG and YAPEX corpora. AIIAGMT achieved 91% of precision and 84% of 17 recall against the GENETAG data set, and 97% of precision and 98% of recall against the YAPEX data set (tab.1). This means that AIIAGMT program found very few false positive and a very few false negative. AIIAGMT YAPEX GT Precision 97% 91% GAPSCORE recall 98% 84% Precision 81.5% 81% recall 83.3% 63% tab.1: Performance of AIIAGMT and GAPSCORE on YAPEX and GENETAG corpus Both programs are successful, but when they were evaluated against the two corpora, AIIAGMT achieved better precision and recall (97% & 91%) compared to the GAPSCORE results (81.5% & 81%) (tab.1). We evaluated AIIAGMT against the two corpora, GENETAG and YAPEX, to confirm its performance. In both data sets, AIIAGMT achieved a good precision and recall. To make sure that AIIAGMT was not adjusted or trained against the two manually annotated corpora; a further evaluation was required against the Variants corpus. 3.2.2 Performance of AIIAGMT on Variants corpus To verify that AIIAGMT was not adjusted or trained against the two manually annotated corpora (the gold standard) knowing that GENETAG and YAPEX corpora have been annotated before its release, we ran AIIAGMT on the Variants corpus (not annotated); in this case, the server returns an array reference which contains all the entities recognized from the Variants corpus with its position information. The AIIAGMT predictions resulted in 24585 tagged sentences for 27451 sentences of Variants corpus (if there is only one gene or protein name predicted per sentence, the sentence is considered as tagged). The XML tag <GENE>…</GENE> has been used to label each gene/protein name predicted by the program, example Q9Y4H2-1 | 17051426 | ...We examine the 19 CA repeat of the <GENE>IGF1 gene</GENE>, the -202 C > A <GENE>IGFBP3</GENE>, the G972R IRS, and the G1057D IRS2 polymorphisms among 1,175 non-Hispanic white NHW and 576 Hispanic newly diagnosed breast ... In this example, two genes (IGF1 gene, and IGFBP3) were found by the program and both of them are true positive. The ‘IRS’ and ‘IRS2’are false negative. Our evaluation was based on the performance measures of 245 sentences randomly selected and manually reviewed. In fact, the performance of AIIAGMT against the Variants corpus set achieved 75% of precision and 75% of recall which means that 75% of the predicted gene and protein names are relevant (tab.2). 18 GAPSCORE AIIAGMT Precision 72% 75% recall 88% 75% tab.2: Performance of AIIAGMT and GAPSCORE on Variants corpus These results confirm that AIIAGMT achieved a good precision and recall that is required for our application i.e. recognition of gene and protein names in a text. Apart from the manually annotated corpora, AIIAGMT achieved a good performance (75%), but not enough to exclude that it could be adjusted or trained on the gold standard since its precision was reduced by 22% against YAPEX data set and by 16% against GENETAG corpus. 3.3 Improving the precision of extraction 3.3.1 Improving the precision of gene and protein names extraction on GENETAG corpus: Knowing the obtained results of AIIAGMT and GAPSCORE programs against the Variants corpus, we applied union and intersection to combine these programs. In order to verify a possible improvement in precision of gene and protein names extraction at the cost of small reduction in recall we considered the intersection, since the union can improve the recall at the cost of a reduction in precision. The performance of the Intersection AIIAGMT-GAPSCORE is 97% of Precision and 80% of Recall, and for the union AIIAGMT-GAPSCORE is 79% of Precision and 90% of recall (tab.2) Intersection AIIAGMT-GAPSCORE union AIIAGMT-GAPSCORE Precision 97% 79% recall 80% 90% Tab.3: Performance of the intersection and union of AIIAGMT-GAPSCORE on GENTAG corpus Since the union of both programs produced a large number of false positive, the precision of gene and protein names extraction was reduced by 2% in GAPSCORE and 12% in AIIAGMT. The recall was increased by 27% in GAPSCORE and 6% in AIIAGMT. The intersection AIIAGMT-GAPSCORE improved the precision of gene and protein names extraction by 6% in AIIAGMT and 16% in GAPSCORE because the 19 intersection filters out the false positives. The recall was increased by 17% in GAPSCORE and reduced by 4% in AIIAGMT. So, it is rather the intersection of both programs that allows improving the precision of gene and protein names extraction by filtering out the false recognition. 3.3.2 Improving the precision of gene and protein names extraction on Variants corpus To improve the precision of gene and protein names extraction on the Variants corpus, we have decided to consider intersection of results of the two programs against 245 sentences of Variants corpus. We applied also the union since it could improve the recall at the cost of a small reduction in precision. The performance of the Intersection AIIAGMT-GAPSCORE is 96% of Precision and 70% of Recall, and for the union AIIAGMT-GAPSCORE is 64% of Precision and 79% of recall (tab.3). Intersection AIIAGMT-GAPSCORE union AIIAGMT-GAPSCORE Precision 96% 64% recall 70% 79% Tab.4: Performance of the intersection and union of AIIAGMT-GAPSCORE on Variants corpus Since the union of the programs produced a large number of false positive, the precision of gene and protein names extraction was reduced by 8% in GAPSCORE and 11% in AIIAGMT. The recall was decreased by 9% in GAPSCORE and increased by 4% in AIIAGMT. The intersection AIIAGMT-GAPSCORE improved the precision of gene and protein names extraction by 21% in AIIAGMT and 24% in GAPSCORE because the intersection of both programs produced a small number of false positive by filtering out the false recognitions. The recall was reduced by 18% in GAPSCORE and by 5% in AIIAGMT. These results confirm that the intersection can improve the precision of gene and protein names extraction at the cost of a small reduction in recall. 20 3.4- Application: Variant extraction To verify whether the mutations in the Variants corpus refer to the gene and protein ‘target’ (protein of interest where the Pubmed query was submitted to recover the related articles) or to other gene and protein, we manually controlled 194 matching sentences and 121 missing sentences, randomly selected from the intersection of the programs previously evaluated (AIIAGMT, and GAPSCORE). As seen in section 3.2.2, from 27451 sentences of Variants corpus, we have only 24585 tagged sentences with at least one gene / protein name (AIIAGMT prediction). These 24585 sentences were matched with a list of gene and protein synonyms (protein of interest ‘target’). If the string identified by the run of the AIIAGMT program matches one synonym in the list of synonyms, it is considered as a target, and will be tagged with the XML tag <target>…</target>. So, for 24585 tagged sentences, only 12867 sentences match the list of synonyms (matching sentences), and 11698 sentences have no match (missing sentences). Example: Q04771-1 | 17572636 | Functional modeling of the <GENE><target>ACVR1</target></GENE>R206H mutation in <target>FOP</target> . ... The List of synonyms of the same entity (from GPSDB): Q04771 Activin receptor type I Q04771 ACTRI Q04771 ACTR-I Q04771 ACVR1 Q04771 ALK2 Q04771 ALK-2 Q04771 EC 2.7.11.30 Q04771 FOP Q04771 hydroxyalkyl-protein kinase Q04771 serine/threonine-protein kinase receptor R1 In this example, AIIAGMT program predicted ACVR1 as gene/protein name (tagged by <GENE>…</GENE>), that is effectively a synonym of the same entity (tagged by <target> …</target>); the FOP was not predicted by the program even if it is a synonym. So, ACVR1 is a true positive, FOP is a false positive. For practical reasons, GAPSCORE program was not run on the total Variants corpus. We considered only the result of the 245 sentences previously obtained (section 3.1.2), then we applied the intersection of AIIAGMT and GAPSCORE on these sentences. This intersection resulted on 116 sentences where 94 sentences were matching sentences and 21 were missing sentences. To try to improve the precision of the automatic method used to retrieve variants-related articles, we proceeded by selecting randomly 194 matching sentences and 121 missing sentences from the intersection of both programs (AIIAGMT and GAPSCORE 21 predictions) because the intersection improves the precision of information extraction. These 2 subsets of sentences were manually controlled by verifying the reference of the mutation to the gene or protein of interest ‘target’. This evaluation achieved four types of sentences according to their correctness; the result is as follows (tab.7): Category of sentence Correct (number of sentence) Incorrect (%) Total number of sentence With target 133 0 133 With Target and gene 61 0 61 Without target nor gene 113 0 113 Without target, 5 37.5 8 With gene Tab.5: category of sentences in Variants corpus. In the missing sentences (without target), we obtained 62.5% (5/8*100) of correct sentences which means that the mutations refer to the gene / protein target even if there is other gene or protein in the sentence (5 sentences). If the mutations refer to other gene or protein, the sentence is counted as incorrect (37%). Correct sentence: Example Q04771 -1 | 18979151 | of this metamorphogene are beginning to provide deep insight into a highly conserved signaling pathway that regulates tissue stability following morphogenesis , and that when damaged at a highly specific locus c.617G > A ; R206H , and triggered. … In this example, the mutation R206H refers to the gene/protein of interest Q04771 (Activin receptor type-1). Incorrect sentence: Example P21802 -1 | 15942838 | ...and an elevated <GENE>transferrin</GENE> saturation 96 %, but a negative test for <GENE>HFE gene</GENE> mutations such as <GENE>C282Y</GENE> and H63D. FINDINGS: Using the mini-laparascopic technique we diagnosed a smallnodular liver... In this example, the mutations C282Y and H63D refer to the HFE gene which is true positive and is not synonym of the gene/protein of interest P21802b ( Fibroblast growth 22 factor receptor 2). The ‘transferrin’ gene is also true positive, but the ‘C282Y’ is false positive. In the matching sentences, 31% of 100 correct sentences have other gene or protein; however, the sentence is counted correct because the mutation refers to the gene/protein of interest. Example: correct sentence P07949 -1 | 11438491 | ... severely impaired the <GENE>phospholipase Cgamma</GENE> signaling pathway in SK-N-MC cells. S765P, R873Q, F893L, R897Q, and E921K mutations resulted in a complete loss of the <GENE><target>RET</target> kinase</GENE> activity. The P973L... In this example, the mutations S765P, R873Q, F893L, R897Q, and E921K refer to the gene target P07949 (Proto-oncogene tyrosine-protein kinase receptor ret) though the presence of the ‘phospholipase C-gamma’ gene which is true positive. These results show the correctness of sentences in the Variants corpus in order to decide if they are to be kept or eliminated from the ModSNP database. In fact, in this data set there are 37.5% of incorrect sentences, which means that the mutation refers to other gene but the protein of interest is missing. 4. Discussion In this work, we proposed to evaluate the gene and protein name recognition programs in order to select the best one among them or to combine them to improve the precision of gene and protein name extraction at the cost of small reduction in recall. To evaluate, we used the two metrics often used in gene and protein name identification precision and recall, measuring the two types of mistakes (the false positive and false negative) that can be made during gene name identification. In this study we did not distinguish between genes and proteins, we used ‘gene’ or ‘protein’ generically to mean both. The evaluation was performed against two corpora manually annotated and specifically tailored for gene and protein name identification (GENETAG and YAPEX). Both programs achieved a good performance: 81% of precision for GAPSCORE and 91% of precision for AIIAGMT against the GENETAG corpus. This good performance may be due to the fact that both programs were trained or adjusted on the gold standards since the two corpora were manually annotated before the development of the two programs. We therefore evaluated the programs against another corpus which is not annotated: The ‘Variants corpus’. This evaluation achieved 72% of precision for GAPSCORE program and 75% of precision for AIIAGMT program which could assume that the programs were adjusted or trained on the gold standards since the precision was 23 decreased by 9% in GAPSCORE and by 16% in AIIAGMT against GENETAG corpus. It is clear that the precision of GAPSCORE (72%) and AIIAGMT (75%) is due to the fact that the two systems do not deal with ambiguity, other than filtering out the most ambiguous terms of variants, this means that there were some false positive. The 75% of recall in the case of AIIAGMT means that the AIIAGMT program found more false negative than the GAPSCORE program (88% of recall). In order to improve the precision of gene and protein name extraction at the cost of small reduction in recall, we considered the intersection of both programs (AIIAGMT and GAPSCORE) on GENETAG corpus and on Variant corpus since the intersection reduces the number of false recognitions of gene and protein names. However, the union can enhance the recall at the cost of a small reduction in precision since it can produce a large number of false positive. So, by combining the two programs (via intersection) we improved the precision of gene/ protein name extraction. Combining programs in order to improve the precision of gene and protein name extraction is in fact the aim of BioCreative MetaServer (BCMS) since it is a prototype platform of 13 research groups trying to provide combined data corresponding to the different annotations of the thirteen servers. But the BioCreative MetaServer has a static collection of text (a set of approximately 22,800 PubMed abstracts used in the BioCreative II challenge) which would serve as internal database. For this reason, we could not use the BCMS to identify gene and protein name in the newly published articles. So, in the near future the BioCreative MetaServer could be a good program for gene and protein name recognition with a good precision of extraction. The aim of these evaluations was to use these gene and protein name recognition programs in other applications as variants extraction. These programs are necessary to link information found in variants-related articles to a specific biological entity (variants, mutations and polymorphisms). In fact, in this study, we tried to improve the performance of the automatic retrieval approach for the extraction of new polymorphisms by applying the intersection of the two programs since it can enhance the information extraction by filtering out the false recognitions. To do so, we verified whether the mutations in the Variants corpus refer to the gene or protein of interest (target) or refer to other gene. These evaluations achieved 100% of correct sentences in the matching sentences, which means that a hundred per cent of sentences have mutations referring to the target though the presence of other gene/protein (61 sentences). These correct sentences will be stored in tables of the ModSNP database. In the missing sentences, there is no protein of interest ‘target’ but 62.5% have mutations referring to the target though the presence of other gene, and 37.5% are 24 incorrect (mutations refer to other gene which is present). However, this result (37.5% of incorrect sentences) is not statistically significant because we evaluated only a small data set (194 matching sentences and 121 missing sentences). Nevertheless, we cannot make a general rule concerning the importance of the identification of another gene (other than the target) in the incorrect sentences. To have a more significant result, the data set can be expanded in order to decide whether the incorrect sentences will be kept or eliminated from the ModSNP database. 5. Conclusion We evaluated the gene and protein name recognition programs (GAPSCORE and AIIAGMT). The evaluation was performed against two manually annotated corpora (Gold standards: GENETAG and YAPEX) and achieved a good performance. To confirm the performance of the chosen programs and to make sure that the programs were not adjusted or trained on the gold standards, we evaluated against the Variants corpus. In this case, the evaluation achieved a relatively less good performance, which might assume that the programs could be adjusted or trained on the gold standards. To improve the performance of gene and protein name extraction at the cost of small reduction in recall, we applied the intersection of the two programs since the intersection filter out the false recognitions. However, the union could enhance the recall at the cost of small reduction in precision. To try to improve the precision of the automatic method used to retrieve variants-related articles, we applied the intersection of the programs (GAPSCORE and AIIAGMT) on the Variants corpus. This application achieved 100% of correct sentences in the matching sentences and 37, 5% of incorrect sentences in the missing sentences. 25 Appendix 1. XML-RPC: XML-RPC is a Remote Procedure Calling protocol that works over the Internet. An XML-RPC message is an HTTP-POST request. The body of the request is in XML. A procedure executes on the server and the value it returns is also formatted in XML. Procedure parameters can be scalars, numbers, strings, dates, etc.; and can also be complex record and list structures. 2, BioNLP Web Services: The URI for XML-RPC server is: http://bionlp.stanford.edu/xmlrpc The server can respond to two queries: 1- find_abbreviations INPUT: string OUTPUT: array This function will search for the abbreviations in a string and return an array of the abbreviations found. Each element of the returned array is itself an array of: [string long form, string abbreviation, double score] 2- find_gene_and_protein_names INPUT: string OUTPUT: array This function will search for the gene and protein names in a string and return an array of the names found. Each element of the returned array is itself an array of: [string name, int start, int end, double score] start and end are indexes into the input string that describe where the name was found. The indexes begin at 0, and the end index is exclusive. 26 Bibliography Articles: 1. GAPSCORE: finding gene and protein names one word at a time. Chang JT, Schütze H, Altman RB. Department of Genetics, Stanford Medical Center, 300 Pasteur Drive, Lane L 301, Mail Code 5120, Stanford, CA 94305-5120, USA. 2. Lynette Hirschmann, Martin Krallinger, and Alfonso Valencia, editors. Proceedings of the Second BioCreative Challenge Evaluation Workshop. Centro Nacional de Investigaciones Oncologicas, CNIO, Madrid, Spain, 2007. 3. Hanisch, D., Fundel, K., Mevissen, H., Zimmer, R., Fluck, J.: ProMiner: Rule based protein and gene entity recognition. BMC Bioinformatics, 6(Suppl 1):S14, 2005. _ 4. PowerBioNE: Recognizing names in biomedical texts: a machine learning approach. by: G Zhou, J Zhang, J Su, D Shen, C Tan. Bioinformatics, Vol. 20, No. 7. (1 May 2004), pp. 11781190 5. BioTagger-GM: A Gene/Protein Name Recognition System Manabu Torii, PhDa,*, Zhangzhi Hu, MDb, Cathy H. Wu, PhDc and Hongfang Liu, PhDd 6. Introducing meta-services for biomedical information extraction Florian Leitner1, Martin Krallinger1, Carlos Rodriguez-Penagos1, Jörg Hakenberg2,3, Conrad Plake2, Cheng-Ju Kuo4,5, Chun-Nan Hsu5, Genome Biology 2008, 9(Suppl 2):S6 7. Tetratricopeptide-motif-mediated interaction of FANCG with recombination proteins XRCC3 and BRCA2. Hussain S, Wilson JB, Blom E, Thompson LH, Sung P, Gordon SM, Kupfer GM, Joenje H, Mathew CG, Jones NJ.Department of Medical and Molecular Genetics, King's College London School of Medicine at Guy's Hospital, London SE1 9RT, UK. 8. Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21:3191-3192 9. GENETAG: a tagged corpus for gene/protein named entity recognition Lorraine Tanabe, Natalie Xie, Lynne H Thom, Wayne Matten, W John Wilbur BMC Bioinformatics 2005, 6(Suppl 1):S3 (24 May 2005) 10. The BioCreAtIvE - Critical Assessment for Information Extraction in Biology Challenge Genome Biology – Biology for the post-genomic era 2008 volume 9 supplement 2 11. Smith L, Tanabe LK, Johnson nee Ando R, Kuo C-J, Chung I-F, Hsu CN, Lin Y-S, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA Jr, Hunter L, Carpenter B, Tsai RT-H, Dai H-J, Liu F, Chen Y, Sun C, Katrenko S, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, et al.: Overview of BioCreative II gene mention recognition. Genome Biology 2008, 9(Suppl 2):S2. 12. Kuo CJ, Chang YM, Huang HS, Lin KT, Yang BH, Lin YS, Hsu CN, Chung IF: Rich feature set, unification of bidirectional parsing and dictionary filtering for high F-score gene mention tagging. In Proceedings of the Second BioCreative Challenge Workshop Madrid, Spain. CNIO; 200 13. BioCreative Homepage [http://biocreative.sourceforge.net/] 14. XML-RPC Specification [http://www.xmlrpc.com/] 15. BioCreative MetaServer [http://bcms.bioinfo.cnio.es/] 16. BioCreative XML-RPC MetaService [http://bcms.bio info.cnio.es/xmlrpc/] 17. GAPSCORE (Chang et al., 2004) - http://bionlp.stanford.edu/GAPSCORE/ 18. GENETAG ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/ GENETAG.tar.gz. 19. AIIAGMT: http://aiia.iis.sinica.edu.tw/ 20. Kinoshita S, Cohen KB, Ogren PV, Hunter L: BioCreAtIvE Task1A: entity identification with a stochastic tagger. BMC Bioinformatics 2005, 6(suppl 1):S4. 27 21. Pillet V, Zehnder M, Seewald AK, Veuthey AL, Petrak J: GPSDB: a new database for synonyms expansion of gene and protein names. Bioinformatics 2005, 21:1743-1744. 22. Carpenter, B. Phrasal queries with LingPipe and Lucene: ad hoc genomics text retrieval; NIST Special Publication: SP 500261 The Thirteenth Text Retrieval Conference; TREC. 2004.2004. 23. Yoshida M, Fukuda K, Takagi T. PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary.Bioinformatics. 2000 Feb;16(2):169-75 24. Leaman R, Gonzalez G. BANNER: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput. 2008; 652-663 25. Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A. Text processing through Web services: calling Whatizit. Bioinformatics. 2008 Jan 15;24(2):296-8. Epub 2007 Nov 15. 26. Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision, 2nd printing) Beatrice Santorini June 1990 27. Franzén,K., Eriksson,G., Olsson,F., Asker,L., Liden,P. and Coster,J. (2002) Protein names and how to find them. Int. J. Med. Inform., 67, 49–61. 28. Ohta,T., Tateisi,Y., Mima,H. and Tsujii,J. (2002) Genia corpus: an annotated research abstract corpus in molecular biology domain. In Proceedings of the Human Language TechnologyConference 29. Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May. 23(5):464-70. 30. Yip YL, et al, RETRIEVING MUTATION-SPECIFIC INFORMATION FOR HUMAN PROTEINS IN UNIPROT /SWISS-PROT KNOWLEDGEBASE 31. GPSDB: http://expasy.org/cgi-bin/gpsdb/form 32. SNPs: http://www.ncbi.nlm.nih.gov/About/primer/snps.html 33. Krallinger et al; Linking genes to literature: text mining, information extraction, and retrieval applications for biology 34. Hirschman, L. et al. (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6, S1 35. Text mining and its potential applications in systems biology Sophia Ananiadou1,2, Douglas B. Kell3,4 and Jun-ichi Tsujii1,2,5 36. Hirschman, L. and Blaschke, C. (2006) Evaluation of text mining in biology. In Text Mining for Biology and Biomedicine (Ananiadou, S. and McNaught, J., eds), pp. 213–245, Artech House 37. http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html 38. K. Bretonnel Cohen, et al, Empirical data on corpus design and usage in biomedical natural language processing 28
© Copyright 2026 Paperzz