BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 7 2005, pages 1227–1236 doi:10.1093/bioinformatics/bti084 Data and text mining Automatic extraction of gene/protein biological functions from biomedical text Asako Koike1,2,∗ , Yoshiki Niwa2 and Toshihisa Takagi1 1 Department of Computational Biology, Graduate School of Frontier Science, The University of Tokyo, Kiban-3A1(CB01) 5-1-5, Kashiwanoha Kashiwa, Chiba 277-8561, Japan and 2 Central Research Laboratory, Hitachi Ltd., 1-280 Higashi-koigakubo, Kokubunji City, Tokyo 185-8601, Japan Received on April 16, 2004; revised on September 21, 2004; accepted on October 5, 2004 Advance Access publication October 27, 2004 ABSTRACT Motivation: With the rapid advancement of biomedical science and the development of high-throughput analysis methods, the extraction of various types of information from biomedical text has become critical. Since automatic functional annotations of genes are quite useful for interpreting large amounts of high-throughput data efficiently, the demand for automatic extraction of information related to gene functions from text has been increasing. Results: We have developed a method for automatically extracting the biological process functions of genes/protein/ families based on Gene Ontology (GO) from text using a shallow parser and sentence structure analysis techniques. When the gene/protein/family names and their functions are described in ACTOR (doer of action) and OBJECT (receiver of action) relationships, the corresponding GO-IDs are assigned to the genes/proteins/families. The gene/protein/family names are recognized using the gene/protein/family name dictionaries developed by our group. To achieve wide recognition of the gene/protein/family functions, we semi-automatically gather functional terms based on GO using co-occurrence, collocation similarities and rule-based techniques. A preliminary experiment demonstrated that our method has an estimated recall of 54–64% with a precision of 91–94% for actually described functions in abstracts. When applied to the PUBMED, it extracted over 190 000 gene–GO relationships and 150 000 family–GO relationships for major eukaryotes. Availability: The extracted gene functions are available at http://prime.ontology.ims.u-tokyo.ac.jp Contact: [email protected] INTRODUCTION With the development of high-throughput methods such as the yeast two-hybrid method, mass spectrometry and genome sequencing, an enormous amount of experimental results covering various genes can be quickly obtained. Although all relevant results reported so far should be considered when interpreting experimental data, retrieving them from PUBMED abstracts and/or full papers and studying them is an overwhelming task for a single researcher. A promising approach to overcoming this problem is the use of natural language processing (NLP) to automatically extract and mine the information. The advantages of NLP in the biomedical field have been demonstrated for gene/protein name recognition ∗ To whom correspondence should be addressed. (Fukuda et al., 1998; Collier et al., 2000; Tanabe and Wilbur, 2002), protein–protein interactions (Friedman et al., 2001; Koike et al., 2003), and general event extraction (Yakushiji et al., 2001; Rindflesch et al., 2000; Humphrey et al., 2000). In addition, to clarify the definition of classes (concepts) and definitize the relationships between classes that have been generated with the rapid advancements in the biomedical field, efforts have been made to construct ontologies manually (Ashburner et al., 2000; disease ontology, http://diseaseontology.sourceforge.net/; IMGT ontology, http://imgt.cines.fr/textes/IMGTindxn/ontology.html) and automatically (Blaschke and Valencia, 2002). Gene Ontology (GO) (Ashburner et al., 2000), the most widely used ontology, consists of biological process, molecular function and cellular component ontologies. Several preliminary studies have been made on automatically annotating genes, proteins and families with the corresponding GO-ID, which is assigned to each defined term (class), using only abstracts and/or sequence information (Schug et al., 2002; Xie et al., 2002; Raychaudhuri et al., 2002; Nenadic et al., 2003). To evaluate each information extraction system by solving common tasks such as gene name recognition and gene function annotation based on gene ontology, to clarify common problems, and to accelerate IE progress in the biomedical field, the KDD cup (http://www.biostat.wisc.edu/∼craven/kddcup/), TREC (http://trec.nist.gov/) and BioCreAtIvE (http://www.pdg.cnb.uam.es/ BioLINK/BioCreative.eval.html) have been held. Biomedical domain-specific problems in text mining are obvious, and various techniques for solving them have been proposed. We have developed a method for automatically assigning the GOID of a biological process to each gene and protein using natural language techniques. It uses shallow parsing and sentence structure analysis to extract the ACTOR and OBJECT relationships, so detailed gene functional annotations are possible, at least in theory. Gene Ontology vocabularies are controlled ones, so some of them do not frequently appear in the abstracts. Terms (words and multiword terms) representing similar or related meanings of GO terms are gathered semi-automatically using co-occurrence and collocation similarity of GO terms to enable recognition of the functional terms. Furthermore, rule-based term generation including morphological and syntactic term variations is used to complement the semi-automatic term-gathering methods mentioned above. This paper is organized as follows. In the following section, we introduce related work and compare our method with related ones. The functional terminology generation method and © The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1227 A.Koike et al. the gene–function relationship extraction method are explained in ‘Systems and Methods’. The precision and recall rate of our extraction system are presented in ‘Results and Discussion’. We also discuss the causes of errors. We conclude with a summary. Terminology preparing process Related work Function extraction process The extraction of relationships between classes has been well studied (Yakushiji et al., 2001; Rindflesch et al., 2000; Humphrey et al., 2000). For example, Yakushiji et al. demonstrated biological event extraction using a full-parser (Yakushiji et al., 2001). Rindflesch et al. (2000) developed an information extraction (general extraction) system for drugs and genes relevant to cancer using a shallow parser and UMLS (http://www.nlm.nih.gov/research/umls/). For the identification of GO-IDs, methods using a combination of sequence information with text information (Xie et al., 2002) and methods using only text information have been proposed (Raychaudhuri et al., 2002; Nenadic et al., 2003; Kim and Park, 2004). Raychaudhuri et al. (2002) demonstrated that abstracts can be classified into major GO-ID classes using words in the abstracts and machine leaning, such as the maximum entropy, Naive Bayes or nearest-neighbor method. Nenadic et al. (2003) classified genes and proteins into major GO-ID groups using words (ignoring collocations) appearing in abstracts and a support vector machine. Unfortunately, the receiver operating characteristic (ROC) curve of some class IDs, which did not have a sufficient number of abstracts, was quite low. Since the annotated corpus for detailed GO-IDs is not yet sufficiently large, the IDs used in both methods are limited to a high class, such as ‘signal transduction’. Although biologists would usually like to know the evidence for the automatic extraction result at the abstract or sentence level, both methods do not clearly show it. There have been preliminary reports on sentence-level annotation using a gene–function relationship extraction method based on syntactic dependency (Kim and Park, 2004) and using one based on sentence similarity using Naive Bayes (Chiang and Yu, 2003). In BioCreAtIvE, which uses full texts, a sliding window approach was used to respect the gene–function description over multiple sentences (Krallinger and Padron, 2004), and query expansions and derivationally related term expansions were used to achieve wide recognition of gene ontology terms (Krymolowski et al., 2004). Since the information extraction of gene–function relationships is quite difficult, further efforts are required. Functional annotation is difficult because there are a variety of functional term expressions, and some functions cannot be correctly extracted without considering the ACTOR–OBJECT relationship (a simple co-occurrence of gene and functional terms becomes an error). In our approach, functional terms are gathered using various methods in order to address the first problem as much as possible, and sentence structure analysis is used to address the second problem. SYSTEMS AND METHODS The overview of our method for automatically extracting the gene function is shown in Figure 1. The following sections describe the method for gathering/generating the terms for each GO-ID and the method for extracting the information and assigning the GO-ID using the gathered terms. Augmentation of functional terms In principle, the terms for the biological process of GO are used to assign the GO-ID to the gene. However, the number of these terms is insufficient for automatic extraction (at least at the sentence level) in terms of the recall. The 1228 Augmentation of functional terms based on GO Construction of gene/protein/family name dictionaries Step 1: Gene/protein/family name, and functional term recognition Step 2: Shallow parsing, noun phrase bracketing, sentence structure analysis Step 3: ACTOR-OBJECT relationships extraction Step 4: GO-ID assignment to genes (type-1 extraction) Step 5: Keyword search in OBJECT Step 6: GO-ID assignment to genes (type-2 extraction) Fig. 1. Overview of the gene function extraction process. terms are controlled vocabularies, and some are not frequently used in the abstracts. Furthermore, each GO class can be expressed using various terms instead of a defined GO term (e.g.‘GO0006915: apoptosis’ can be replaced with ‘apoptotic process’ in most cases). We thus gather the same and similar meanings or related terms semi-automatically based on: (1) related terms having a high co-occurrence score with GO terms, (2) similar terms having similar collocations with GO terms, (3) enzyme name extraction by pattern matching, (4) rule-based generation of syntactic/semantic variations and (5) verb–technical term combination variations. Here, ‘similar meaning’ and ‘relatedness’ are distinguished. Words with a ‘similar meaning’ always belong to the same (meaning) class, whereas words that are ‘related’ do not necessarily belong to the same class. For example, ‘metamorphosis’ and ‘metabolism’ are similar-meaning words. They belong to the same class, ‘metabolic system’. On the contrary, ‘chaperone’ and ‘protein folding’ are related words. ‘Chaperone’ is a protein that helps the folding of other proteins. ‘Protein folding’ is a protein action. They do not belong to the same class. Their functions are expressed using related words and/or similar meaning words: ‘Protein-A helps the protein folding of Protein-B’; ‘Protein-A is a chaperone of Protein-B’. Both sentences indicate the same function; GO-term ‘GO0006457: protein folding’, is assigned to protein-A. Basically, related terms tend to appear in the same abstracts, while similar-meaning terms tend to have similar collocation (local contexts). Accordingly, related terms (words and collocations) and similar-meaning terms are semi-automatically gathered using co-occurrences and collocations as follows. Related terms having high co-occurrence score with GO terms The terms that co-occur with statistically significant frequency in the abstracts are extracted as follows. Assume we have a large-scale text database, D. For a GO function, term F, the co-occurrence score of an arbitrary term T with F can be measured using various formulae, among which the simplest is the ratio between the density of T in the texts containing F and the density of T in the whole database D. Although there are more sophisticated formulae, they are difficult to use when comparing the co-occurrence scores of terms whose frequencies are much different. Therefore, we first classify all candidate terms Automatic extraction by NLP text set MEDLINE abstracts CS ... HS shallow parsing chunking : : : : <PUNC ","> <PH np st1=pron main="we"> <PH vp* main="find"> <CONJ "that"> <PH np st1=common main="Pitx2"> <PH D main="strongly"> <PH vp* main="activate"> <PH np st1=common main="promoter"> np:D the E np Gad1 ---np promoter N <PUNC "."> </SEN> Final Index Rel-Args form Un-sorted index : R=np-vp-np 1: Pitx2 2: activate 3: Gad1 / promoter : : : : : : : Pitx2 *-[activate] Pitx2 *-[activate]-promoter Pitx2 *-[activate]-Gad1_promoter promoter [activate]-* promoter Pitx2-[activate]-* Gad1_promoter [activate]-* Gad1_promoter Pitx2-[activate]-* : : : : : : expansion Extraction: np - vp - np vp - prep - np np - prep - np indexing : : Gad1_promoter 3 [activate]-* 1 Pitx2-[activate]-* : : Pitx2 10 *-[activate] 3 *-[activate]-promoter 1 *-[activate]-Gad1_promoter : : promoter 25 [activate]-* 3 Pitx2-[activate]-* : : Fig. 2. Procedure for collocation similarity calculation. (i.e. terms appearing at least once in the texts containing F) into several frequency classes and take the ones with the highest score from each class. The summarized candidate terms calculated for each organism and each frequency are shown in html style, and terms with a meaning similar to that of the query term are selected by biologists (PhD holders or PhD students). Similarity of terms having similar collocations with GO terms As mentioned above, the similarity of terms is measured by the similarity of their collocations. For each term T, its profiling text is defined as the set of all collocations of T (paired with their frequencies) in the database. As for the types of collocations, we adopted simpler ones such as np(noun phrase)vp(verb phrase), vp-np, vp-prep(preposition)-np, np-vp-np, np-vp-prep-np and np-prep-np. The search for similar terms is then done by applying a similar text search technique, the vector space model (Salton et al., 1975). The procedure for making the profiling text of a term is shown in Figure 2. After shallow parsing of the whole text, all collocation patterns (see above) are extracted. The expansion process and the sorting and indexing processes are then applied to obtain the indexing of all terms by their collocations. The similarity is defined by the following equations, which are known as SMART (Singhal et al., 1996): Avr[ρ(ci ) ∗ ν(ci |q) ∗ ν(ci |X)] sim(X, q) = (1) L + κ ∗ [dlen(X) − L] N ρ(ci ) = log 1 + (2) df(ci ) 1 + log[tf(ci |X)] , (3) 1 + log[tf(.|X)] where X is a similar term candidate and q is the query GO term, dlen(X) is the number of different collocations in the profile of term X, L is the average of dlen(X) over all terms, κ is a slope constant, which is set to 0.2, and ci is ν(ci |X) = the i-th collocation of query term q. The weight of each collocation ρ(ci ) is defined by Equation (2), where df(ci ) is the number of terms whose profile texts contain ci and N is the total number of terms. The weight of significance of each collocation ci with respect to term X is given by Equation (3), where tf(ci |X) is the frequency of collocation ci in the profile of X, and tf(.|X) is the average of the frequencies over the collocations consisting of the profile of X. The summarized candidate terms calculated for each organism and each frequency are shown in html style, and terms with a meaning similar to that of the query term are selected by the same biologists (PhD holders or PhD students). Enzyme name extraction by pattern matching Most functions, including metabolism, catabolism and synthesis, are expressed using an enzyme name. To compensate for the weakness of the vocabularies extracted using the two methods described above, enzyme names ending with ‘ase’ are also extracted from the abstracts corresponding to a year. For example, for the ‘GTP metabolism’ function, ‘GTP cyclohydrolase’, ‘GTP hydrolase’, ‘GTPase’ and ‘GTP guanylyltransferase’ are extracted as enzyme names to be related to ‘GTP metabolism’. However, some enzyme terms that end in ‘ase’ are not related to these functions. For example, the function of ‘permease’ belongs to ‘transport’. These unrelated terms are removed from the collected vocabularies semi-automatically. Rule-based generation of syntactic/semantic variations Syntactic variations such as ‘folding of protein’ for ‘protein folding’ are automatically generated. Furthermore, semantically similar/related terms (metabolism→metabolic, metastasis, metamorphosis, reducer, reduction) and derivationally related terms (apoptosis→apoptotic) of a GO term or a GO term consisting of single word are gathered using UMLS (for derivationally related terms), Word Net (http://www.cogsci.princeton.edu/∼wn/) (for both) 1229 A.Koike et al. and expert knowledge (for both). Errors are generated in some automatic conversions. For example, the terms ‘transport’ and ‘exchange’ are similarly used in ‘ion transport/exchange’, but not in ‘nuclear transport’. Accordingly, conservative conversion terms are provided. Functional term variations are generated using these similar/related terms. When the same term is automatically generated for multiple GO-IDs, the superclass ID (higher concept class ID) is used. Hyponym terms (lower class terms, ex. ‘phosphatidylinositol’ is the hyponym of ‘phospholipid’) are also gathered from the MeSH terms (http://www.nlm.nih.gov/mesh/meshhome.html). Verb–technical term combination variations Some functions such as ‘regulation’, ‘transport’ and ‘synthesis’ are expressed frequently by the combination of a verb and technical terms. A predefined verb is combined with one or more technical terms. For example, ‘GO0006846: acetate transport’ is assigned to ACTOR when the verb is ‘transport’, ‘locate’, ‘localize’, ‘translocate’, ‘import’ or ‘export’ and ‘acetate’ is included in OBJECT. Furthermore, some functions can be determined based on the combination of a verb and an OBJECT or based only on the verb. ‘GO0004672: protein kinase activity’ is assigned when the verb is ‘phosphorylate’ and the ACTOR and OBJECT include a protein name. If the OBJECT does not include a protein name, the ACTOR may be a kinase (for compounds). When the verb is ‘palmitoylate’, ‘GO0018318: protein amino acid palmitoylation’ is assigned to ACTOR without investigating terms in the OBJECT. These verb–technical term combination variations are semi-automatically produced. By applying the first two methods to about 190 major GO terms, we gathered about 3000 terms. Of these, less than 30% were commonly extracted using method 1 (co-occurrence) and method 2 (collocations). That is, these methods compensate for each other’s weaknesses. By using all five methods, we gathered about 240 000 terms. (There were about 10 000 original GO terms.) Extraction of relations between genes and gene functions The biological function of each gene was annotated using the following procedure, which is illustrated in Figure 1. The example sentence is shown in Figure 3. The steps are as follows. Step 1. Recognition of gene/protein/family names and GO functional terms The gene name recognition method is described elsewhere (Koike and Takagi, 2004). Briefly, gene name recognition is carried out using the GENA gene name dictionary (http://gena.ontology.ims. u-tokyo.ac.jp/search/servlet/gena) and family name dictionary (http://marine. ims.u-tokyo.ac.jp:8080/Dict/family), which were constructed based on major database entries. In our system, a protein name that does not specify the gene locus is treated as a family name. For example, since ‘14-3-3’ does not specify the gene locus (‘14-3-3 alpha’, ‘14-3-3 beta’, etc.), it is registered as a protein family name. The variations in gene name were generated based on these dictionaries and were quickly searched against abstracts using a devised trie with many heuristics, such as replacing special characters with spaces, searching inside and outside the parenthesis separately [e.g. mitogen-activated protein kinase (MAPK) 1→mitogen-activated kinase 1 + MAPK1], and using continuous expressions (e.g. GATA-4/5/6→GATA4, GATA5, GATA6). After gene/protein/family name recognition, ambiguities in gene names, especially in abbreviation names [e.g. TAK1 is the abbreviated synonym for MAP3K7 (mitogen-activated protein kinase kinase kinase 7) and NR2C2 (nuclear receptor subfamily 2, group C, member 2)] were resolved using fullname abbreviation pair search and keyword search. Finally, the existence of multiple expressions for the same gene was checked [e.g. multiple-name expression HAP1 (CYP1) in Saccharomyces cerevisiae: HAP1 is the gene name of YLR256W and YPL101w, but the second name CYP1 specifies this gene as YLR256W]. In our method, precision and recall were over 90% for the major eukaryotes (Koike and Takagi, 2004). The recognition of functional terms was also quickly done over all abstracts using a trie considering trivial term variation (replacement of special characters with a space). 1230 Step 2. Shallow parsing, noun phrase bracketing and sentence structure analysis Shallow parsing was done for sentences with gene name IDs using FDG-Lite (http://www.connexor.com/). After noun phrase bracketing using dependency/syntactic tags and morphological tags, parentheses, coordinate clauses, subordinate clauses, etc. were analyzed using various standard rules. FDG-Lite, developed by Voutilainen et al. at the University of Helsinki, gives the base form, dependency/syntactic tags and morphological tags. When a determiner, adverbial and adjective modifiers, coordinating conjunction, participle, noun and pronoun are contiguous, they are regarded as a noun phrase. Boundary recognition of noun phrases including a coordinating conjunction and comma requires the use of certain devices. The number of coordinating conjunctions before the target coordinating conjunction, whether or not a ‘past_participle_modifier’ is located after the target coordinating conjunction, whether or not the verb is before or after the target coordinating conjunction, and whether or not the target coordinating conjunction is in a subordinate phrase or adverbial phrase beginning with an interrogative are checked for the boundary of the noun phrase including coordinating conjunctions and comma. In principle, a predecessor noun phrase of the predicate verb is regarded as a subject, and just behind the noun phrase or preposition phrase of the predicate verb is regarded as an object. Certain rules are used for complicated sentence structures, such as coordinate-conjunction and insertion-phrase structures. For example (ignoring adverb phrases and prepositions for simplicity): NP1 verb1 NP2 coordinating_conjunction verb2 NP3→ The subject of verb2 is NP1; NP1, Verb1-ing NP2, Verb2 NP3→The subject of verb1 and verb2 is NP1. NP1, NP2 verb1 NP3, verb2 NP4→NP2—NP3 is an insertion phrase; the subject of verb2 is NP1 NP1 verb1 (predefined verb, such as belong, consist, encode) NP2 [relative pronoun] verb2 NP3→The subjects of verb2 are NP1 and NP2. NP1 verb1 to-infinitive verb2 NP2→The subject of verb2 is NP1. In noun phrases including a ‘modifier_of_noun:past_participle’ and ‘participle’, the subject and object inside the phrase are also extracted. NP1 verb-ing NP2→NP1 verb NP2 NP1 verb-en NP2→NP2 verb NP1, where NP is a noun phrase without a prepositional phrase. Simple anaphora (coreference of term or phrase with its antecedent) resolution was also tried. When a pronoun appeared after a relative pronoun, the previously appearing gene name was assigned after checking for singular/plural consistency. For example, in the sentence ‘In S.cerevisiae, OAC is in the inner mitochondrial membranes, and deletion of its gene greatly reduces transport of oxaloacetate sulfate’, our program recognizes ‘its gene’ based on the ‘OAC’. Step 3. ACTOR–OBJECT relationships extraction The gene–function relationships are extracted when they are expressed in ACTOR–OBJECT relationships with predefined verbs or in modification relationships. Here, ACTOR (agent) means the doer of action and OBJECT means the receiver of action (higher concept of ‘object’ of subject–object). Basically, only when ‘ACTOR’ is a gene name and ‘OBJECT’ is a gene function, the relationship is extracted. For some verbs, such as ‘require’, the reverse relation is extracted. We use these terms, since relationships between ACTOR/OBJECT and gene name/function are not affected by the passive voice or active voice although subject–object relationships are affected (in most cases, the subject is protein and the object is its function in active voice, while the opposite holds true in passive voice). The extraction patterns are roughly summarized in Table 1. In each sentence, only the gene function extracted using the corresponding pattern is highlighted. The kinds of verbs were predefined. As shown in Table 1, the ACTOR and OBJECT extraction patterns were not limited to subject–object Automatic extraction by NLP Fig. 3. Steps in sentence analysis. relationships. The gene and its function can be expressed in a modification relationship, subject–complement relationship, subject–adverb relationship and so on. Step 4. GO-ID assignment to genes (type-1 extraction) After extraction of the gene–function relationship, whether it is negative or affirmative and whether it is a contingent fact (including ‘investigate’, ‘test’, ‘examine’, ‘study’, ‘design’ and ‘predicate’) or not are checked. (The negative and contingent facts are also stored in the database PRIME with marks. However, in the following discussion, these relationships are not used.) For the verb–technical term combinations (as described in ‘Augmentation of functional terms’), the verb is confirmed to be the predefined one. For some terms, it is difficult to determine whether the assigned function is appropriate or not from one sentence. For example, ‘GO0006350: transcription’ is defined as ‘the synthesis of either RNA on a template of DNA or DNA on a template of RNA’ by the Gene Ontology Consortium. In many contexts, the ACTOR of ‘transcription’ is simply the protein activator. Accordingly, only when at least one key word such as ‘zinc-finger’, ‘Pol_I’, ‘Pol_II’, ‘Pol_III’ and ‘TFIIB’ appear in the same abstract is the GO-ID accepted. Finally, the gene-ID and GO-ID relationship is the output. Step 5. Keyword search in object/complement (type-2 extraction) In the example sentences shown in Figure 3, a complete GO term is expressed in each sentence. However, if the sentence includes an expression such as ‘chromosome III segregation’, the same ID cannot be assigned. 1231 A.Koike et al. Table 1. Example extraction patterns Patterns Sentence Types Examples Basic Type NP-VP-NP Smith and Mitchell (1989) found that [overexpression of <Gene>IMEl</gene>] induced [an <GO>early meiotic event (recombination)</GO> in rich medium], but later meiotic events did not occur (i.e., they detected [no spore formation). [Either pheromonal activation or an <gene>scgl</gene> null mutation] relieves the negative control and leads to [an <GO>arrest of cell growth</GO> in the Gl phase of the <GO>cell cycle</GO>]. Many cancer cells protect [themselves against <GO>apoptosis</GO>] by [activating <gene> nuclear factor-kappaB (NF-kappaB)/Rel </gene>, a transcription factor that] helps in cell survival. [The <gene>Kar3 protein</gene> from Saccharomyces cerevisiae] is [a minus end-directed kinesin family member that] is involved in [both <GO>nuclear fusion</GO>, or <GO>karyogamy</GO>, and <GO>mitosis</GO>]. [The ability of recombinant <gene>TIA-l</gene>] to induce [<GO>DNA fragmentation</GO> in permeabilized cells] suggested that this protein is the granule component responsible for inducing apoptosis in cytolytic lymphocyte (CTL) targets. [<gene>p53<gene>] is required to induce[ <GO>programmed cell death apoptosis </GO>l. [The <GO>aromatic carboxvlic acids</GO>] are converted to the corresponding vinvl derivatives bv [<gene>Padl</gene>]. [Eight children (5 living, 3 deceased) with severe hereditary nonspherocytic <GO>hemolytic anemia </GO>caused by <gene>glucose phosphate isomerase</gene> deficiency] have been observed in two Kentucky and Indiana families. Furthermore, the level of Imel depends on [the <GO>kinase activity</GO> of <gene>Ime2</gene>]. A unique 892-base pair cDNA was cloned that prevented [the <GO>programmed cell death response</GO> following <gene>IL-3 </gene>deprivation] by causing antisense suppression of an endogenous 2.4-kilobase(kb) mRNA. We also found that <gene>JNK</gene> activation by. <GO>UV irradiation</GO>. Saccharomyces cerevisiae Bpt 1p is an ATP-binding cassette (ABC) protein that belongs to the MRP subfamily and is [a close homologue of the glutathione conjugate (GS conjugate) <GO>transporter</GO> <gene>Ycf1p</gene>]. In summary, our analyses of embryonic and adult mice demonstrate [that two different <gene>AP-2 transcription factors<gene>] are specifically expressed during [<GO>differentiation</GO> of many neural, epidermal and urogenital tissues]. Although <gene>Cdc l5</gene>phosphorylated [<gene>Dbf2</gene>, <complex>Dbf2-Mobl</comolex>, and <complex> Dbf2(S374A/T544A)-Mob1</complex>], the pattern of phosphate incorporation into Dbf2 was substantially altered by either the S374A T544A mutations or omission of Mobl. Here, evidence is presented that [the <gene>Ras2 protein</gene> of Saccharomyces cerevisiae] is palmitoylated by [a < <gene>Ras protein acyltransferase (Ras PAT) </gene> encoded by the <gene>ERF2</gene> and <gene>ERF4</gene>genes]. (A gene-name and its function appear in different noun phrases connected by a verb phrase) NP-VP-PP NP-VP-NP PREP Verb-ing NP NP-VP-NP (relative pronoun)-VP NP-to infinitive NP-VP-to-infinitive NP-VP-NP/PP-PP Modification inside of NP NP-verb/EN-PP NP of NP NP Verb-ing NP NP by NP NP NP Adverb phrase (clause) during NP Verb-protein N (including protein)-Verb-NP (including protein) Only verb NP verb [] represents noun phrase. While the resolution of collocation variants has been well studied in NLP (Jacquemin and Royaute, 1994), it is still a challenging task. Here, we tried a simple keyword search. The score for each word consisting of functional terms (=TermScore[i], i-th term score) was defined by ‘1/[1 + log(‘frequency’ + 1)], where ‘frequency’ is the frequency of appearance in abstracts over 2 years. The sum score of each collocation (=SumScore) was calculated. The score for each collocation with the given key words was defined as j =given keywords TermScore[j ]/SumScore (=CollScore). When the top, CollScore, was over 0.75, the corresponding GO-ID was accepted. The threshold of 0.75 was determined by using about 100 learning abstract sets. Step 6. GO-ID assignment to genes (type-2 extraction) Step 6 is the same as Step 4, but for type-2 extractions. 1232 RESULTS AND DISCUSSION Evaluation method To evaluate the performance of our extraction function, we used the same abstracts used for GO-term annotation in the SGD database (http://www.yeastgenome.org/) as an S.cerevisiae test set and those for GO annotation (GOA; Camon et al., 2003) as an Homo sapiens test set. These annotated data include GO evidence codes. Since they include GO-IDs assigned based on ‘sequence similarity’, we used only the abstracts with evidence code ‘IDA:inferred from direct assay’. Furthermore, since the SGD and GOA annotations were done using full papers and the biological functions are not described in some abstracts, we used abstracts Automatic extraction by NLP that included the corresponding gene names. In total, we used 510 abstracts (726 gene–function relationships) for S.cerevisiae and 202 abstracts (226 gene–function relationships) for H.sapiens. The recall [=true_positive/(true_positive+false_negative)] was calculated using these abstracts and annotated relationships. Whether each gene–function relationship could be extracted from the corresponding abstract was used as the evaluation metric. Since not all relationships described in each paper were extracted in SGD and GOA (probably because the annotators’ primary query to PUBMED are gene names, other gene information are not necessarily extracted), two kinds of precisions [=true_positive/(true_positive+false_positive)] were calculated. One was calculated using SGD/GOA annotation as the gold standard (type-1 and type-2). The other was calculated based on 100 randomly selected gene–function relationships extracted using each method (type-1 and type-2) for S.cerevisiae and H.sapiens. In this calculation, whether the assigned GO-ID was appropriate or not was determined using the same criteria, criteria-1, -2 or -3. In the following section, the precision and recall for each method and the causes of the false-positive and false-negative errors are discussed. Table 2. Recall rate using type-1 extraction Organism Criteria-1 Criteria-2 Criteria-3 S.cerevisiae H.sapiens 18.2 (29.8)% 19.5 (28.7)% 35.3 (51.2)% 36.3 (50.8)% 42.4 (54.4)% 43.4 (57.1)% Numbers in parentheses show the expected recall rate, when information extraction target by manual was also limited to only abstract information (without body text). Criteria-1 represents the complete match of our assigned GO-ID and the SGD/GOA-assigned GO-ID. Criteria-2: represents the relationship within two higher or lower classes. For example, class hierarchy ‘GO0009987: cellular process’→‘GO0008219: cell death’→‘GO0012501: programmed cell death’→‘GO0006915: apoptosis→GO00042981: regulation of apoptosis’→‘GO0043065: positive regulation of apoptosis’ is defined based on the Gene Ontology group. The assignment of GO0006915 instead of GO00043065 is allowed in criteria-2. In criteria-3, all superclasses/subclasses except the highest class are allowed. (In this example, ‘GO0009987: cellular process’ belongs to the highest class.) In criteria-3, the assignment of ‘GO0008219: cell death’ instead of ‘GO0043065: positive regulation of apoptosis’ is accepted. Table 3. Precision using type-1 extraction Precision and recall Precision and recall for type-1 extractions Tables 2 and 3 shows the results of the type-1 extractions. Since the GO has a hierarchical structure, a superclass or subclass ID was assigned in some cases. The definition of criteria-1, -2 and -3 was described in Tables 2 and 3 comments. Although in this method, only abstract information is used for gene/protein function extraction, some GOA/SGD annotations are not described in the abstracts but described only in the body text. Therefore, we investigated whether each of 100 randomly selected GOA/SGD annotations was written in an in-depth class description (criteria-1), in a higher class description (criteria-2, -3), or neither. The numbers in parenthesis in Table 2 are estimated recall using these values. For example, for S.cerevisiae, 61% of the annotations were written in an in-depth class description (complete match with SGD annotation) in the abstracts. Therefore, 18.2/0.61 = 29.8% is the estimated recall. As shown in Table 2, the recall rate was low. About 5% of our results were written in a lower (more detailed) class description than the SGD/GOA annotations for both organisms. The superclass ID annotations using our method were due to ignorance of the adverbs and insufficient vocabularies for the hyponym. For example, in the sentence ‘Inhibition of angiogenesis by recombinant human platelet factor-4 and related peptides’, ‘GO0001525: angiogenesis’ instead of ‘GO0016525: negative regulation of angiogenesis’ was assigned by our method. Classifying the adverbs, verbs and adjectives should enable more detailed annotations. When only the co-occurrence of GO-ID and gene-ID in the same sentence was used for the annotation, the recall rates (criteria-3) of S.cerevisiae and H.sapiens were 63 and 55%, respectively. However, the precisions were lower than 50% due to the ACTOR–OBJECT relationship not being considered. For example, the GO-assignment of ‘GO0008152:metabolism’ to ‘protein-B’ was erroneously done in the sentence ‘protein-A is involved in the metabolism of protein B’. In this case, the precision may differ greatly among the vocabularies prepared for each GO-ID. The causes of the low recall rate even for criteria-3 are as follows (in order of frequency). Incomplete vocabularies. In many cases, hyponyms were used in the abstract. For example, ‘in response to phorbol ester’ is used in Organism Criteria-1 Criteria-2 Criteria-3 S.cerevisiae H.sapiens 54 (14.3)% 48(15.4)% 78(34.5)% 70 (36.5)% 94.0 (43.4)% 91.0 (51.2)% Numbers in parentheses show precision when GOA and SGD annotations were used as gold standard. Numbers in front of parentheses show actual precision by checking extracted gene–function relationship manually. one abstract to represent the function ‘response to organic substance’. The knowledge that ‘phorbol ester’ is a hyponym of ‘organic substance’ is required. Although the MeSH terms described above and the UMLS hierarchy were used in some cases, the resolution of a broad class is difficult. Function not described by a pattern in Table 1. The function expressions varied. For example, ‘We purified Tip1p from a glucanase extract of yeast cell walls and analyzed the sugar chain involved in the cell wall linkage’ describes the function of ‘GO0007047: cell wall organization and biogenesis’. Although ‘purify’ was registered as a predefined verb, ‘purify NP (gene-name) from NP (apparatus name)’ was not provided for the ‘organization’ process. Furthermore, some relationships are difficult to extract. For example, ‘The main physiological roles of Odc1p and Odc2p are probably to supply 2-oxoadipate and 2-oxoglutarate from the mitochondrial matrix to the cytosol where they are used in the biosynthesis of lysine and glutamate, respectively, and in lysine catabolism’ implicitly indicates the functions of Odc1 and Odc2 as ‘GO0006839: mitochondrial transport’. Although these patterns and predefined verbs can be added, the task is never ending. Function not written in one sentence. In some cases, the function is written over multiple sentences, and a pronoun is used in a subsequent sentence. Although some trials using multiple sentences have been reported (Krallinger and Padron, 2004), resolving this situation without degrading precision appears difficult. Parser errors and structure errors There were only a few false negatives due to parser and structure errors, so few sentence structures had to be analyzed. The false negatives were due to 1233 A.Koike et al. incorrect recognition of the start point of prepositional phrases, incorrect recognition of ‘modifier_of_noun: past_participle’ instead of ‘main_verb: past_tense’, and unresolved anaphora. In Table 3, the numbers in parentheses were calculated using SGD and GOA annotations as the gold standard, while the numbers before the parentheses represent the precision based on manual checking of 100 randomly selected gene–function relationships. Although the deep class annotation (criteria-1) seems to be difficult, the precision for criteria-3 seem to be sufficient for practical use. The causes of the errors are as follows (in descending order of importance): Gray zone errors. Many errors occurred in the semantically gray zone. For example, for the sentence ‘IGIF has been found to enhance the production of interferon-gamma (IFN-gamma) and granulocyte/macrophage colony-stimulating factor (GM-CSF) while inhibiting the production of IL-10 in concanavalin A (Con A)stimulated PBMC’, the assignment of the function ‘GO00042091: IL-10 biosynthesis’ to ‘colony-stimulating factor (GM-CSF)’ is in the gray zone (extraction pattern is ‘adverb phrase’ in Table 1). This sentence implicitly indicates the relation but does not state the obvious function. While these errors can be eliminated by limiting the number of extraction relationship patterns, doing so would reduce the recall. Loose verb conditions. The limitation on verbs in the verb– technical term combinations is quite loose in the ‘catabolism’, ‘synthesis’, ‘biosynthesis’, ‘organization’, ‘biogenesis’ and ‘metabolism’ combinations. This causes false positives. For example, for ‘Saccharomyces cerevisiae possess two Escherichia coli endonuclease III homologs, NTG1 and NTG2, whose gene products function in the base excision repair pathway and initiate removal of a variety of oxidized pyrimidines from DNA’, ‘GO0006221: pyrimidine biosynthesis’ and ‘GO0006281: DNA repair’ were assigned as NTG1 and NTG2. The former is a false positive. An additional screening devise is needed for such terms. Parser errors and sentence structure analysis errors. There were only a few parser and sentence structure analysis errors in our small test sets. Some errors were observed in the modification relationships in ‘NP preposition NP’ and in the recognition of long names. Gene name recognition errors. There are some names in common for multiple genes. Although the resolution of this ambiguity by abbreviation–full name matching and keyword searching was tried, there were still some failures, as described elsewhere (Koike and Takagi, 2004). Precision and recall for type-2 extractions To compensate for the lack of various expressions for biological functions, ‘key word match’ instead of ‘complete term match’ was also tried. The results are summarized in Tables 4 and 5. Compared to the rates for the type1 extractions, the recall rate was slightly higher, and the precision was slightly lower. The slight increase in the recall rate is because the keyword search was done within the noun phrase, ignoring the adverb. For example, for ‘Swi5 and Ace2 are cell cycle-regulated transcription factors that activate expression of early G(1)-specific genes in Saccharomyces cerevisiae’, our program extracted the relations <Swi5 and Ace2> be <cell cycle-regulated transcription factors> and <Swi5 and Ace2> be-activate <expression of early G(1)specific genes>. A key word search was done in each noun phase, and only the hypernym ‘GO0007049: cell cycle’ was assigned, although ‘ID GO0000114: G1-specific transcription in mitotic cell cycle’ can 1234 Table 4. Recall rate using type-2 extraction (datasets including Type-1 hits) Organism Criteria-1 Criteria-2 Criteria-3 S.cerevisiae H.sapiens 20.7 (33.9)% 20.8 (30.6)% 37.2 (53.9)% 38.9 (54.0)% 45.9 (58.8)% 48.7 (64.1)% Table 5. Precision using type-2 extraction (datasets including Type-1 hits) Organism Criteria-1 Criteria-2 Criteria-3 S.cerevisiae H.sapiens 49 (13.9)% 40 (11.8)% 69 (29.5)% 60 (30.2)% 93.9 (37.4)% 90.6 (43.7)% be assigned to Swi5 and Ace2 genes by using all bold key words. To raise the recall rate, the score threshold was lowered to 0.75. However, some false positives are attributable to this lower threshold. For example, ‘GO0015919: peroxisomal membrane transport’ was given by only the keywords ‘peroxisomal membrane’. In this case, function assignment without considering ‘transport’ or an appropriate verb or other nouns caused the false positive. Function extraction for each organism We applied our method to all the abstracts with each MeSH term (homo sapiens, mice, rats, drosophila melanogaster, caenorhabditis elegans and saccharomyces cerevisiae). The results are summarized in Table 6. We also extracted the family name–function relationships. Many of them (>80%) were not yet registered in the major databases such as LocusLink, RGD, GDB, SGD, Flybase and WormBase. These results are searchable at http://prime. ontology.ims.u-tokyo.ac.jp When all abstracts were used, the recall rate of gene–function relationships in the previous test set of S.cerevisiae was 32.4% for criteria-1 and 69.2% for criteria-3. Those of H.sapiens were 31.0% for criteria-1 and 70.8% for criteria-3. Since the precision and recall rate at each gene–function relationship level (fact level) should differ from that at an abstract level, the precision was recalculated at the fact level and is summarized in Table 7 (family name–function relationships are not included). Here, ‘fact level’ is used to mean whether the extracted gene–function relationship is correct or not. When multiple evidential sentences are extracted for one gene–function relationship from multiple abstracts, if at least one sentence is correct, the gene–function relationship is regarded to be correct, i.e. a fact. The difference in precision among organisms in Table 7 is mainly due to the difference in precision of gene/protein name recognition for each organism. The precision for the gene–function relationship level in Table 7 is slightly lower than that for the abstract level in Tables 4 and 5. Erroneously extracted gene–function relationships consisted of only one or two evidential sentences. The precision could be increased to some extent by discarding the gene–function relationships with few evidential sentences. CONCLUSION We have developed an information extraction system that uses natural language techniques to assign GO IDs to each gene/protein/family found in abstracts. In this system, each sentence is shallowly Automatic extraction by NLP Table 6. Function extractions for each organism Organism No. of abstracts Protein (family) kinds Extracted function gene (non-red) + family (non-red) S.cerevisiae C.elegans D.melanogaster M.musculus R.norvegicus H.sapiens 51 646 5 170 19 331 666 098 1 035 237 8 219 949 2568 (1563) 907 (534) 1294 (533) 5985 (2736) 3254 (2257) 7268 (3609) 22 323 (11 122) + 16 145 (8200) 3602 (2282) + 1891 (1324) 4829 (3201) + 2040 (1376) 174 194 (50 757) + 140 992 (34 524) 142 047 (38 554) + 153 434 (37 024) 337 945 (83 713) + 351 255 (67 540) Table 7. Precision of gene–function relationship level Organism Criteria-3 (%) S.cerevisiae C.elegans D.melanogaster M.musculus R.norvegicus H.sapiens 86 91 81 80 86 87 parsed, and the ACTOR–OBJECT relationships are extracted using rule-based sentence structure analysis. When gene names and their functional terms are described in ACTOR–OBJECT relationships with predefined verb or modification relationships, the corresponding GO-IDs are assigned to the gene/protein/family. The gene/protein/family names are quickly recognized by the devised trie, which is constructed based on the GENA gene name dictionary and family name dictionary to extract ambiguous gene names that do not specify unique gene names. For wide recognition of the gene/protein functions, the functional terms are semi-automatically gathered based on GO using co-occurrence in the same abstract and the collocation similarities of the terms. The terms related to a GO term are mainly gathered using the first method, and the similar-meaning terms are mainly gathered using the second method. Additional hyponyms are gathered using an MeSH hierarchy, and semantic/syntactic variations of the gathered terms are generated using rule-based methods. In a preliminary experiment, our system had a recall rate of about 42–49% [criteria-3 in Table 2 and Table 4], with 91–94% precision [criteria-3 in Table 3 and Table 5] for both S.cerevisiae and H.sapiens at an abstract level. Considering the percentage of actually described functions in the test set abstracts, the recall with NLP was even higher [54–64%: criteria-3 in Table 2 and Table 5]. The precision of our method is higher than the simple co-occurrence rate of gene and functional term (<50%) in a sentence, since the ACTOR and OBJECT relationships are considered. Further, when all NCBI abstracts are used, the recall rate increases to about 70%, and the precision drops to about 86–87% for both organisms at the fact level (gene–function relationship level). Although this evaluation allowed superclass identification (instead of detailed class) identification, the annotated GO class level seems to be sufficiently useful. Many of the false negatives and superclass recognitions (instead of detailed class recognition) were due to a lack of biological function terms. Some vocabularies are difficult to find using an automatic process, while some are detectable with our term-finding system, which uses co-occurrence and collocation similarities. Expanding the number of biological functional terms in our system should increase the recall. Application of this method to abstracts using each major eukaryote MeSH term resulted in the extraction of over 190 000 non-redundant gene/protein GO-ID relationships and 150 000 family name GOID relationships for S.cerevisiae, C.elegans, D.melanogaster, M.musculus, R.norvegicus and H.sapiens. Many biological functions that were not extracted by a major database or consortium were extracted. The results are open to the public in the PRIME database (http://prime.ontology.ims.u-tokyo.ac.jp:8081/). ACKNOWLEDGEMENTS We thank the reviewers for their helpful suggestions and references. This work is supported in part by a grant-in aid for scientific research on priority area genome information science, from the Japanese Ministry of Education, Culture, Sports, Science and Technology. REFERENCES Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 25–29. Blaschke,C. and Valencia,A. (2002) Automatic ontology construction from the literature. Genome Inform., 13, 201–213. Camon,E., Magrane,M., Barrell,D., Binns,D., Fleischmann,W., Kersey,P., Mulder,N., Oinn,T., Maslen,J., Cox,A. and Apweiler,R. (2003) The gene ontology annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res., 13, 662–672. Chiang,J.-H. and Yu,H.-C. (2003) MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics, 19, 1417–1422. Collier,N., Nobata,C. and Tsujii,J. (2000) Comparison between Tagged Corpora for the Named Entity Task. In Proceedings of the 18th International Conference on Computational Linguistics, Saarbrucker, Germany, pp. 201–207. Friedman,C., Kra,P., Yu,H., Krauthammer,M. and Rzhetsky,A. (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17(Suppl. 1), S74–S82. Fukuda,K., Tsunoda,T., Tamura,A. and Takagi,T. (1998) Toward information extraction: identifying protein names from biological papers. Proceedings of the Pacific Symposium on Biocomputing, pp. 705–716. Humphrey,K., Demetriou,G. and Gaizauskas,R. (2000) Two applications of information extraction to biological science journal articles. Enzyme Interact. Protein Struct., Proceedings of the Pacific Symposium on Biocomputing, Hawaii, USA, pp. 505–516. Jacquemin,C. and Royaute,J. (1994) Retrieving terms and their variants in a lexicalised unification-based framework. Proceedings of SIGIR, pp. 132–141. Kim,J.-J. and Park,J.C. (2004) Annotation of gene products in the literature with gene ontology terms using syntactic dependencies. Lect. Notes Artifi. Intell. (in press). Koike,A. and Takagi,T. (2004) Proceedings of HLT/NAACL BioLINK Workshop, pp. 9–16. Koike,A., Kobayashi,Y. and Takagi,T. (2003) Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. Genome Res., 13, 1231–1243. 1235 A.Koike et al. Krallinger,M. and Padron,M.M. (2004) Prediction of GO annotation by combining entity specific sentence sliding window profiles. Proceedings of BioCreAtIvE., Granada, Spain. Krymolowski,Y., Alex,B. and Leidner,J.L. (2004) BioCreative Task 2.1: The Edinburgh– Stanford System. Proceedings of BioCreAtIvE. Nenadic,G., Rice,S., Spasic,I., Ananiadou,S. and Stapley,B. (2003) Selecting text features for gene name classification: from documents to terms. Proceedings of the ACL Workshop on Natural Language Processing in Biomedicine. Sapporo, Japan, pp. 121–128. Raychaudhuri,S., Chang,J., Sutphin,P. and Altman,R. (2002) Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature, Genome Res., 12, 203–214. Rindflesch,T.C., Tanabe,L., Weinstein,J.N., and Hunter,L. (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. Proceedings of Pacific Symposium on Bioinformatics, Hawaii, USA, pp. 514–525. 1236 Salton,G., Wong,A. and Yang,C.S. (1975) A vector space model for automatic indexing. Commun. ACM, 18, 613–620. Schug,J., Diskin,S., Mazzarelli,J., Brunk,B.P. and Stoeckert,Jr,C.J. (2002) Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res., 12, 648–655. Singhal,A., Buckley,C. and Mitra,M. (1996) Pivoted document length normalization. In Proceedings of ACM SIGIR’96, Zurich, Switzerland, pp. 21–29. Tanabe,L. and Wilbur,W.J. (2002) Tagging gene and protein names in biomedical text. Bioinformatics, 18, 1124–1132. Yakushiji,A., Tateishi,Y., Miyano,Y. and Tsujii,J. (2001) Event extraction from biological papers using a full parser. Proceedings of Pacific Symposium on Bioinformatics, Hawaii, USA, pp. 408–419. Xie,H., Wasserman,A., Levine,Z., Novik,A., Grebinskiy,V., Shoshan,A. and Mintz,L. (2002) Large-scale protein annotation through gene ontology. Genome Res., 12, 785–794.
© Copyright 2026 Paperzz