Automatic extraction of gene/protein biological

BIOINFORMATICS
ORIGINAL PAPER
Vol. 21 no. 7 2005, pages 1227–1236
doi:10.1093/bioinformatics/bti084
Data and text mining
Automatic extraction of gene/protein biological functions
from biomedical text
Asako Koike1,2,∗ , Yoshiki Niwa2 and Toshihisa Takagi1
1 Department
of Computational Biology, Graduate School of Frontier Science, The University of Tokyo,
Kiban-3A1(CB01) 5-1-5, Kashiwanoha Kashiwa, Chiba 277-8561, Japan and 2 Central Research
Laboratory, Hitachi Ltd., 1-280 Higashi-koigakubo, Kokubunji City, Tokyo 185-8601, Japan
Received on April 16, 2004; revised on September 21, 2004; accepted on October 5, 2004
Advance Access publication October 27, 2004
ABSTRACT
Motivation: With the rapid advancement of biomedical science and
the development of high-throughput analysis methods, the extraction of various types of information from biomedical text has become
critical. Since automatic functional annotations of genes are quite useful for interpreting large amounts of high-throughput data efficiently,
the demand for automatic extraction of information related to gene
functions from text has been increasing.
Results: We have developed a method for automatically extracting
the biological process functions of genes/protein/ families based on
Gene Ontology (GO) from text using a shallow parser and sentence
structure analysis techniques. When the gene/protein/family names
and their functions are described in ACTOR (doer of action) and
OBJECT (receiver of action) relationships, the corresponding GO-IDs
are assigned to the genes/proteins/families. The gene/protein/family
names are recognized using the gene/protein/family name dictionaries developed by our group. To achieve wide recognition of the
gene/protein/family functions, we semi-automatically gather functional
terms based on GO using co-occurrence, collocation similarities and
rule-based techniques. A preliminary experiment demonstrated that
our method has an estimated recall of 54–64% with a precision of
91–94% for actually described functions in abstracts. When applied
to the PUBMED, it extracted over 190 000 gene–GO relationships and
150 000 family–GO relationships for major eukaryotes.
Availability: The extracted gene functions are available at
http://prime.ontology.ims.u-tokyo.ac.jp
Contact: [email protected]
INTRODUCTION
With the development of high-throughput methods such as the yeast
two-hybrid method, mass spectrometry and genome sequencing, an
enormous amount of experimental results covering various genes
can be quickly obtained. Although all relevant results reported so far
should be considered when interpreting experimental data, retrieving
them from PUBMED abstracts and/or full papers and studying them
is an overwhelming task for a single researcher.
A promising approach to overcoming this problem is the use
of natural language processing (NLP) to automatically extract and
mine the information. The advantages of NLP in the biomedical
field have been demonstrated for gene/protein name recognition
∗ To
whom correspondence should be addressed.
(Fukuda et al., 1998; Collier et al., 2000; Tanabe and Wilbur,
2002), protein–protein interactions (Friedman et al., 2001; Koike
et al., 2003), and general event extraction (Yakushiji et al., 2001;
Rindflesch et al., 2000; Humphrey et al., 2000). In addition,
to clarify the definition of classes (concepts) and definitize the
relationships between classes that have been generated with the
rapid advancements in the biomedical field, efforts have been
made to construct ontologies manually (Ashburner et al., 2000;
disease ontology, http://diseaseontology.sourceforge.net/; IMGT
ontology, http://imgt.cines.fr/textes/IMGTindxn/ontology.html) and
automatically (Blaschke and Valencia, 2002). Gene Ontology (GO)
(Ashburner et al., 2000), the most widely used ontology, consists
of biological process, molecular function and cellular component
ontologies. Several preliminary studies have been made on automatically annotating genes, proteins and families with the corresponding
GO-ID, which is assigned to each defined term (class), using only
abstracts and/or sequence information (Schug et al., 2002; Xie et al.,
2002; Raychaudhuri et al., 2002; Nenadic et al., 2003).
To evaluate each information extraction system by solving common tasks such as gene name recognition and gene function
annotation based on gene ontology, to clarify common problems, and to accelerate IE progress in the biomedical field, the
KDD cup (http://www.biostat.wisc.edu/∼craven/kddcup/), TREC
(http://trec.nist.gov/) and BioCreAtIvE (http://www.pdg.cnb.uam.es/
BioLINK/BioCreative.eval.html) have been held. Biomedical
domain-specific problems in text mining are obvious, and various
techniques for solving them have been proposed.
We have developed a method for automatically assigning the GOID of a biological process to each gene and protein using natural
language techniques. It uses shallow parsing and sentence structure analysis to extract the ACTOR and OBJECT relationships, so
detailed gene functional annotations are possible, at least in theory.
Gene Ontology vocabularies are controlled ones, so some of them
do not frequently appear in the abstracts. Terms (words and multiword terms) representing similar or related meanings of GO terms
are gathered semi-automatically using co-occurrence and collocation similarity of GO terms to enable recognition of the functional
terms. Furthermore, rule-based term generation including morphological and syntactic term variations is used to complement the
semi-automatic term-gathering methods mentioned above.
This paper is organized as follows. In the following section, we introduce related work and compare our method with
related ones. The functional terminology generation method and
© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1227
A.Koike et al.
the gene–function relationship extraction method are explained in
‘Systems and Methods’. The precision and recall rate of our extraction system are presented in ‘Results and Discussion’. We also
discuss the causes of errors. We conclude with a summary.
Terminology preparing process
Related work
Function extraction process
The extraction of relationships between classes has been well studied (Yakushiji et al., 2001; Rindflesch et al., 2000; Humphrey et al.,
2000). For example, Yakushiji et al. demonstrated biological event
extraction using a full-parser (Yakushiji et al., 2001). Rindflesch
et al. (2000) developed an information extraction (general extraction) system for drugs and genes relevant to cancer using a shallow
parser and UMLS (http://www.nlm.nih.gov/research/umls/). For the
identification of GO-IDs, methods using a combination of sequence
information with text information (Xie et al., 2002) and methods
using only text information have been proposed (Raychaudhuri et al.,
2002; Nenadic et al., 2003; Kim and Park, 2004).
Raychaudhuri et al. (2002) demonstrated that abstracts can be
classified into major GO-ID classes using words in the abstracts
and machine leaning, such as the maximum entropy, Naive Bayes
or nearest-neighbor method. Nenadic et al. (2003) classified genes
and proteins into major GO-ID groups using words (ignoring collocations) appearing in abstracts and a support vector machine.
Unfortunately, the receiver operating characteristic (ROC) curve of
some class IDs, which did not have a sufficient number of abstracts,
was quite low. Since the annotated corpus for detailed GO-IDs is not
yet sufficiently large, the IDs used in both methods are limited to a
high class, such as ‘signal transduction’. Although biologists would
usually like to know the evidence for the automatic extraction result
at the abstract or sentence level, both methods do not clearly show
it. There have been preliminary reports on sentence-level annotation using a gene–function relationship extraction method based on
syntactic dependency (Kim and Park, 2004) and using one based
on sentence similarity using Naive Bayes (Chiang and Yu, 2003).
In BioCreAtIvE, which uses full texts, a sliding window approach
was used to respect the gene–function description over multiple sentences (Krallinger and Padron, 2004), and query expansions and
derivationally related term expansions were used to achieve wide
recognition of gene ontology terms (Krymolowski et al., 2004). Since
the information extraction of gene–function relationships is quite
difficult, further efforts are required.
Functional annotation is difficult because there are a variety of
functional term expressions, and some functions cannot be correctly
extracted without considering the ACTOR–OBJECT relationship
(a simple co-occurrence of gene and functional terms becomes an
error). In our approach, functional terms are gathered using various
methods in order to address the first problem as much as possible, and
sentence structure analysis is used to address the second problem.
SYSTEMS AND METHODS
The overview of our method for automatically extracting the gene function
is shown in Figure 1. The following sections describe the method for gathering/generating the terms for each GO-ID and the method for extracting the
information and assigning the GO-ID using the gathered terms.
Augmentation of functional terms
In principle, the terms for the biological process of GO are used to assign the
GO-ID to the gene. However, the number of these terms is insufficient for
automatic extraction (at least at the sentence level) in terms of the recall. The
1228
Augmentation of functional
terms based on GO
Construction of
gene/protein/family name
dictionaries
Step 1: Gene/protein/family name,
and functional term recognition
Step 2: Shallow parsing, noun phrase bracketing,
sentence structure analysis
Step 3: ACTOR-OBJECT relationships extraction
Step 4: GO-ID assignment to
genes (type-1 extraction)
Step 5: Keyword search in
OBJECT
Step 6: GO-ID assignment to
genes (type-2 extraction)
Fig. 1. Overview of the gene function extraction process.
terms are controlled vocabularies, and some are not frequently used in the
abstracts. Furthermore, each GO class can be expressed using various terms
instead of a defined GO term (e.g.‘GO0006915: apoptosis’ can be replaced
with ‘apoptotic process’ in most cases). We thus gather the same and similar
meanings or related terms semi-automatically based on:
(1) related terms having a high co-occurrence score with GO terms,
(2) similar terms having similar collocations with GO terms,
(3) enzyme name extraction by pattern matching,
(4) rule-based generation of syntactic/semantic variations and
(5) verb–technical term combination variations.
Here, ‘similar meaning’ and ‘relatedness’ are distinguished. Words with a
‘similar meaning’ always belong to the same (meaning) class, whereas words
that are ‘related’ do not necessarily belong to the same class. For example,
‘metamorphosis’ and ‘metabolism’ are similar-meaning words. They belong
to the same class, ‘metabolic system’. On the contrary, ‘chaperone’ and
‘protein folding’ are related words. ‘Chaperone’ is a protein that helps the
folding of other proteins. ‘Protein folding’ is a protein action. They do
not belong to the same class. Their functions are expressed using related
words and/or similar meaning words: ‘Protein-A helps the protein folding of
Protein-B’; ‘Protein-A is a chaperone of Protein-B’. Both sentences indicate the same function; GO-term ‘GO0006457: protein folding’, is assigned
to protein-A. Basically, related terms tend to appear in the same abstracts,
while similar-meaning terms tend to have similar collocation (local contexts).
Accordingly, related terms (words and collocations) and similar-meaning
terms are semi-automatically gathered using co-occurrences and collocations
as follows.
Related terms having high co-occurrence score with GO terms The
terms that co-occur with statistically significant frequency in the abstracts are
extracted as follows. Assume we have a large-scale text database, D. For a
GO function, term F, the co-occurrence score of an arbitrary term T with F
can be measured using various formulae, among which the simplest is the
ratio between the density of T in the texts containing F and the density of T in
the whole database D. Although there are more sophisticated formulae, they
are difficult to use when comparing the co-occurrence scores of terms whose
frequencies are much different. Therefore, we first classify all candidate terms
Automatic extraction by NLP
text set
MEDLINE
abstracts
CS
...
HS
shallow parsing
chunking
:
:
:
:
<PUNC ",">
<PH np st1=pron main="we">
<PH vp* main="find">
<CONJ "that">
<PH np st1=common main="Pitx2">
<PH D main="strongly">
<PH vp* main="activate">
<PH np st1=common main="promoter">
np:D
the
E
np
Gad1
---np
promoter
N
<PUNC ".">
</SEN>
Final Index
Rel-Args form
Un-sorted index
:
R=np-vp-np
1: Pitx2
2: activate
3: Gad1 / promoter
:
:
:
:
:
:
:
Pitx2
*-[activate]
Pitx2
*-[activate]-promoter
Pitx2
*-[activate]-Gad1_promoter
promoter [activate]-*
promoter Pitx2-[activate]-*
Gad1_promoter [activate]-*
Gad1_promoter Pitx2-[activate]-*
:
:
:
:
:
:
expansion
Extraction:
np - vp - np
vp - prep - np
np - prep - np
indexing
:
:
Gad1_promoter
3 [activate]-*
1 Pitx2-[activate]-*
:
:
Pitx2
10 *-[activate]
3 *-[activate]-promoter
1 *-[activate]-Gad1_promoter
:
:
promoter
25 [activate]-*
3 Pitx2-[activate]-*
:
:
Fig. 2. Procedure for collocation similarity calculation.
(i.e. terms appearing at least once in the texts containing F) into several
frequency classes and take the ones with the highest score from each class.
The summarized candidate terms calculated for each organism and each
frequency are shown in html style, and terms with a meaning similar to that
of the query term are selected by biologists (PhD holders or PhD students).
Similarity of terms having similar collocations with GO terms As
mentioned above, the similarity of terms is measured by the similarity of
their collocations. For each term T, its profiling text is defined as the set of
all collocations of T (paired with their frequencies) in the database. As for
the types of collocations, we adopted simpler ones such as np(noun phrase)vp(verb phrase), vp-np, vp-prep(preposition)-np, np-vp-np, np-vp-prep-np
and np-prep-np. The search for similar terms is then done by applying a
similar text search technique, the vector space model (Salton et al., 1975).
The procedure for making the profiling text of a term is shown in Figure 2.
After shallow parsing of the whole text, all collocation patterns (see above)
are extracted. The expansion process and the sorting and indexing processes
are then applied to obtain the indexing of all terms by their collocations.
The similarity is defined by the following equations, which are known as
SMART (Singhal et al., 1996):
Avr[ρ(ci ) ∗ ν(ci |q) ∗ ν(ci |X)]
sim(X, q) =
(1)
L + κ ∗ [dlen(X) − L]
N
ρ(ci ) = log 1 +
(2)
df(ci )
1 + log[tf(ci |X)]
,
(3)
1 + log[tf(.|X)]
where X is a similar term candidate and q is the query GO term, dlen(X) is
the number of different collocations in the profile of term X, L is the average
of dlen(X) over all terms, κ is a slope constant, which is set to 0.2, and ci is
ν(ci |X) =
the i-th collocation of query term q. The weight of each collocation ρ(ci ) is
defined by Equation (2), where df(ci ) is the number of terms whose profile
texts contain ci and N is the total number of terms. The weight of significance
of each collocation ci with respect to term X is given by Equation (3), where
tf(ci |X) is the frequency of collocation ci in the profile of X, and tf(.|X) is the
average of the frequencies over the collocations consisting of the profile of
X. The summarized candidate terms calculated for each organism and each
frequency are shown in html style, and terms with a meaning similar to that
of the query term are selected by the same biologists (PhD holders or PhD
students).
Enzyme name extraction by pattern matching Most functions, including metabolism, catabolism and synthesis, are expressed using an enzyme
name. To compensate for the weakness of the vocabularies extracted using
the two methods described above, enzyme names ending with ‘ase’ are
also extracted from the abstracts corresponding to a year. For example, for
the ‘GTP metabolism’ function, ‘GTP cyclohydrolase’, ‘GTP hydrolase’,
‘GTPase’ and ‘GTP guanylyltransferase’ are extracted as enzyme names to
be related to ‘GTP metabolism’. However, some enzyme terms that end in
‘ase’ are not related to these functions. For example, the function of ‘permease’ belongs to ‘transport’. These unrelated terms are removed from the
collected vocabularies semi-automatically.
Rule-based generation of syntactic/semantic variations Syntactic
variations such as ‘folding of protein’ for ‘protein folding’ are automatically generated. Furthermore, semantically similar/related terms
(metabolism→metabolic, metastasis, metamorphosis, reducer, reduction)
and derivationally related terms (apoptosis→apoptotic) of a GO term or a GO
term consisting of single word are gathered using UMLS (for derivationally
related terms), Word Net (http://www.cogsci.princeton.edu/∼wn/) (for both)
1229
A.Koike et al.
and expert knowledge (for both). Errors are generated in some automatic
conversions. For example, the terms ‘transport’ and ‘exchange’ are similarly
used in ‘ion transport/exchange’, but not in ‘nuclear transport’. Accordingly,
conservative conversion terms are provided. Functional term variations are
generated using these similar/related terms. When the same term is automatically generated for multiple GO-IDs, the superclass ID (higher concept class
ID) is used. Hyponym terms (lower class terms, ex. ‘phosphatidylinositol’
is the hyponym of ‘phospholipid’) are also gathered from the MeSH terms
(http://www.nlm.nih.gov/mesh/meshhome.html).
Verb–technical term combination variations Some functions such as
‘regulation’, ‘transport’ and ‘synthesis’ are expressed frequently by the combination of a verb and technical terms. A predefined verb is combined with
one or more technical terms. For example, ‘GO0006846: acetate transport’
is assigned to ACTOR when the verb is ‘transport’, ‘locate’, ‘localize’,
‘translocate’, ‘import’ or ‘export’ and ‘acetate’ is included in OBJECT. Furthermore, some functions can be determined based on the combination of a
verb and an OBJECT or based only on the verb. ‘GO0004672: protein kinase
activity’ is assigned when the verb is ‘phosphorylate’ and the ACTOR and
OBJECT include a protein name. If the OBJECT does not include a protein name, the ACTOR may be a kinase (for compounds). When the verb is
‘palmitoylate’, ‘GO0018318: protein amino acid palmitoylation’ is assigned
to ACTOR without investigating terms in the OBJECT. These verb–technical
term combination variations are semi-automatically produced.
By applying the first two methods to about 190 major GO terms, we
gathered about 3000 terms. Of these, less than 30% were commonly extracted
using method 1 (co-occurrence) and method 2 (collocations). That is, these
methods compensate for each other’s weaknesses. By using all five methods, we gathered about 240 000 terms. (There were about 10 000 original GO
terms.)
Extraction of relations between genes and gene functions
The biological function of each gene was annotated using the following procedure, which is illustrated in Figure 1. The example sentence is shown in
Figure 3. The steps are as follows.
Step 1. Recognition of gene/protein/family names and GO
functional terms The gene name recognition method is described elsewhere (Koike and Takagi, 2004). Briefly, gene name recognition is carried out using the GENA gene name dictionary (http://gena.ontology.ims.
u-tokyo.ac.jp/search/servlet/gena) and family name dictionary (http://marine.
ims.u-tokyo.ac.jp:8080/Dict/family), which were constructed based on major
database entries. In our system, a protein name that does not specify the gene
locus is treated as a family name. For example, since ‘14-3-3’ does not specify
the gene locus (‘14-3-3 alpha’, ‘14-3-3 beta’, etc.), it is registered as a protein
family name. The variations in gene name were generated based on these dictionaries and were quickly searched against abstracts using a devised trie with
many heuristics, such as replacing special characters with spaces, searching
inside and outside the parenthesis separately [e.g. mitogen-activated protein
kinase (MAPK) 1→mitogen-activated kinase 1 + MAPK1], and using continuous expressions (e.g. GATA-4/5/6→GATA4, GATA5, GATA6). After
gene/protein/family name recognition, ambiguities in gene names, especially in abbreviation names [e.g. TAK1 is the abbreviated synonym for
MAP3K7 (mitogen-activated protein kinase kinase kinase 7) and NR2C2
(nuclear receptor subfamily 2, group C, member 2)] were resolved using fullname abbreviation pair search and keyword search. Finally, the existence
of multiple expressions for the same gene was checked [e.g. multiple-name
expression HAP1 (CYP1) in Saccharomyces cerevisiae: HAP1 is the gene
name of YLR256W and YPL101w, but the second name CYP1 specifies this
gene as YLR256W]. In our method, precision and recall were over 90% for
the major eukaryotes (Koike and Takagi, 2004).
The recognition of functional terms was also quickly done over all
abstracts using a trie considering trivial term variation (replacement of special
characters with a space).
1230
Step 2. Shallow parsing, noun phrase bracketing and sentence structure analysis Shallow parsing was done for sentences with gene name
IDs using FDG-Lite (http://www.connexor.com/). After noun phrase bracketing using dependency/syntactic tags and morphological tags, parentheses,
coordinate clauses, subordinate clauses, etc. were analyzed using various
standard rules.
FDG-Lite, developed by Voutilainen et al. at the University of Helsinki,
gives the base form, dependency/syntactic tags and morphological tags.
When a determiner, adverbial and adjective modifiers, coordinating conjunction, participle, noun and pronoun are contiguous, they are regarded as a
noun phrase. Boundary recognition of noun phrases including a coordinating conjunction and comma requires the use of certain devices. The
number of coordinating conjunctions before the target coordinating conjunction, whether or not a ‘past_participle_modifier’ is located after the target
coordinating conjunction, whether or not the verb is before or after the target coordinating conjunction, and whether or not the target coordinating
conjunction is in a subordinate phrase or adverbial phrase beginning with
an interrogative are checked for the boundary of the noun phrase including
coordinating conjunctions and comma.
In principle, a predecessor noun phrase of the predicate verb is regarded as a
subject, and just behind the noun phrase or preposition phrase of the predicate
verb is regarded as an object. Certain rules are used for complicated sentence
structures, such as coordinate-conjunction and insertion-phrase structures.
For example (ignoring adverb phrases and prepositions for simplicity):
NP1 verb1 NP2 coordinating_conjunction verb2 NP3→ The subject of
verb2 is NP1;
NP1, Verb1-ing NP2, Verb2 NP3→The subject of verb1 and verb2 is
NP1.
NP1, NP2 verb1 NP3, verb2 NP4→NP2—NP3 is an insertion phrase;
the subject of verb2 is NP1
NP1 verb1 (predefined verb, such as belong, consist, encode) NP2
[relative pronoun] verb2 NP3→The subjects of verb2 are NP1 and NP2.
NP1 verb1 to-infinitive verb2 NP2→The subject of verb2 is NP1.
In noun phrases including a ‘modifier_of_noun:past_participle’ and ‘participle’, the subject and object inside the phrase are also extracted.
NP1 verb-ing NP2→NP1 verb NP2
NP1 verb-en NP2→NP2 verb NP1,
where NP is a noun phrase without a prepositional phrase.
Simple anaphora (coreference of term or phrase with its antecedent) resolution was also tried. When a pronoun appeared after a relative pronoun,
the previously appearing gene name was assigned after checking for singular/plural consistency. For example, in the sentence ‘In S.cerevisiae, OAC
is in the inner mitochondrial membranes, and deletion of its gene greatly
reduces transport of oxaloacetate sulfate’, our program recognizes ‘its gene’
based on the ‘OAC’.
Step 3. ACTOR–OBJECT relationships extraction The gene–function
relationships are extracted when they are expressed in ACTOR–OBJECT
relationships with predefined verbs or in modification relationships. Here,
ACTOR (agent) means the doer of action and OBJECT means the receiver
of action (higher concept of ‘object’ of subject–object). Basically, only when
‘ACTOR’ is a gene name and ‘OBJECT’ is a gene function, the relationship is
extracted. For some verbs, such as ‘require’, the reverse relation is extracted.
We use these terms, since relationships between ACTOR/OBJECT and gene
name/function are not affected by the passive voice or active voice although
subject–object relationships are affected (in most cases, the subject is protein
and the object is its function in active voice, while the opposite holds true in
passive voice).
The extraction patterns are roughly summarized in Table 1. In each sentence, only the gene function extracted using the corresponding pattern is
highlighted. The kinds of verbs were predefined. As shown in Table 1, the
ACTOR and OBJECT extraction patterns were not limited to subject–object
Automatic extraction by NLP
Fig. 3. Steps in sentence analysis.
relationships. The gene and its function can be expressed in a modification
relationship, subject–complement relationship, subject–adverb relationship
and so on.
Step 4. GO-ID assignment to genes (type-1 extraction) After extraction of the gene–function relationship, whether it is negative or affirmative
and whether it is a contingent fact (including ‘investigate’, ‘test’, ‘examine’, ‘study’, ‘design’ and ‘predicate’) or not are checked. (The negative
and contingent facts are also stored in the database PRIME with marks.
However, in the following discussion, these relationships are not used.)
For the verb–technical term combinations (as described in ‘Augmentation
of functional terms’), the verb is confirmed to be the predefined one. For
some terms, it is difficult to determine whether the assigned function is
appropriate or not from one sentence. For example, ‘GO0006350: transcription’ is defined as ‘the synthesis of either RNA on a template of DNA
or DNA on a template of RNA’ by the Gene Ontology Consortium. In
many contexts, the ACTOR of ‘transcription’ is simply the protein activator. Accordingly, only when at least one key word such as ‘zinc-finger’,
‘Pol_I’, ‘Pol_II’, ‘Pol_III’ and ‘TFIIB’ appear in the same abstract is
the GO-ID accepted. Finally, the gene-ID and GO-ID relationship is the
output.
Step 5. Keyword search in object/complement (type-2 extraction)
In the example sentences shown in Figure 3, a complete GO term is
expressed in each sentence. However, if the sentence includes an expression such as ‘chromosome III segregation’, the same ID cannot be assigned.
1231
A.Koike et al.
Table 1. Example extraction patterns
Patterns
Sentence Types
Examples
Basic Type
NP-VP-NP
Smith and Mitchell (1989) found that [overexpression of <Gene>IMEl</gene>] induced
[an <GO>early meiotic event (recombination)</GO> in rich medium], but later meiotic
events did not occur (i.e., they detected [no spore formation).
[Either pheromonal activation or an <gene>scgl</gene> null mutation] relieves the negative
control and leads to [an <GO>arrest of cell growth</GO> in the
Gl phase of the <GO>cell cycle</GO>].
Many cancer cells protect [themselves against <GO>apoptosis</GO>] by [activating <gene>
nuclear factor-kappaB
(NF-kappaB)/Rel </gene>, a transcription factor that] helps in cell survival.
[The <gene>Kar3 protein</gene> from Saccharomyces cerevisiae] is [a minus end-directed
kinesin family member that] is involved in [both <GO>nuclear fusion</GO>, or
<GO>karyogamy</GO>, and <GO>mitosis</GO>].
[The ability of recombinant <gene>TIA-l</gene>] to induce [<GO>DNA fragmentation</GO>
in permeabilized cells] suggested that this protein is the granule component responsible for
inducing apoptosis in cytolytic lymphocyte (CTL) targets.
[<gene>p53<gene>] is required to induce[ <GO>programmed cell death apoptosis </GO>l.
[The <GO>aromatic carboxvlic acids</GO>] are converted to the corresponding vinvl
derivatives bv
[<gene>Padl</gene>].
[Eight children (5 living, 3 deceased) with severe hereditary nonspherocytic <GO>hemolytic
anemia </GO>caused by <gene>glucose phosphate isomerase</gene> deficiency]
have been observed in two Kentucky and Indiana families.
Furthermore, the level of Imel depends on [the <GO>kinase activity</GO>
of <gene>Ime2</gene>].
A unique 892-base pair cDNA was cloned that prevented [the <GO>programmed cell
death response</GO> following <gene>IL-3 </gene>deprivation] by causing antisense
suppression of an endogenous 2.4-kilobase(kb) mRNA.
We also found that <gene>JNK</gene> activation by. <GO>UV irradiation</GO>.
Saccharomyces cerevisiae Bpt 1p is an ATP-binding cassette (ABC) protein that belongs
to the MRP subfamily and is [a close homologue of the glutathione conjugate (GS conjugate)
<GO>transporter</GO> <gene>Ycf1p</gene>].
In summary, our analyses of embryonic and adult mice demonstrate [that two different
<gene>AP-2 transcription factors<gene>] are specifically expressed during
[<GO>differentiation</GO> of many neural, epidermal and urogenital tissues].
Although <gene>Cdc l5</gene>phosphorylated [<gene>Dbf2</gene>,
<complex>Dbf2-Mobl</comolex>, and <complex> Dbf2(S374A/T544A)-Mob1</complex>],
the pattern of phosphate incorporation into Dbf2 was substantially altered by either the S374A
T544A mutations or omission of Mobl.
Here, evidence is presented that [the <gene>Ras2 protein</gene> of Saccharomyces cerevisiae]
is palmitoylated by [a < <gene>Ras protein acyltransferase (Ras PAT) </gene>
encoded by the <gene>ERF2</gene> and <gene>ERF4</gene>genes].
(A gene-name and its
function appear in
different noun phrases
connected by a verb phrase)
NP-VP-PP
NP-VP-NP PREP
Verb-ing
NP
NP-VP-NP (relative
pronoun)-VP
NP-to infinitive
NP-VP-to-infinitive
NP-VP-NP/PP-PP
Modification inside of
NP
NP-verb/EN-PP
NP of NP
NP Verb-ing NP
NP by NP
NP NP
Adverb phrase (clause)
during NP
Verb-protein
N (including
protein)-Verb-NP
(including protein)
Only verb
NP verb
[] represents noun phrase.
While the resolution of collocation variants has been well studied in NLP
(Jacquemin and Royaute, 1994), it is still a challenging task. Here, we
tried a simple keyword search. The score for each word consisting of
functional terms (=TermScore[i], i-th term score) was defined by ‘1/[1 +
log(‘frequency’ + 1)], where ‘frequency’ is the frequency of appearance in
abstracts over 2 years. The sum score of each collocation (=SumScore) was
calculated. The score for each collocation with the given key words was
defined as j =given keywords TermScore[j ]/SumScore (=CollScore). When
the top, CollScore, was over 0.75, the corresponding GO-ID was accepted. The threshold of 0.75 was determined by using about 100 learning
abstract sets.
Step 6. GO-ID assignment to genes (type-2 extraction) Step 6 is the
same as Step 4, but for type-2 extractions.
1232
RESULTS AND DISCUSSION
Evaluation method
To evaluate the performance of our extraction function, we used
the same abstracts used for GO-term annotation in the SGD
database (http://www.yeastgenome.org/) as an S.cerevisiae test set
and those for GO annotation (GOA; Camon et al., 2003) as
an Homo sapiens test set. These annotated data include GO
evidence codes. Since they include GO-IDs assigned based on
‘sequence similarity’, we used only the abstracts with evidence code
‘IDA:inferred from direct assay’. Furthermore, since the SGD and
GOA annotations were done using full papers and the biological
functions are not described in some abstracts, we used abstracts
Automatic extraction by NLP
that included the corresponding gene names. In total, we used
510 abstracts (726 gene–function relationships) for S.cerevisiae and
202 abstracts (226 gene–function relationships) for H.sapiens. The
recall [=true_positive/(true_positive+false_negative)] was calculated using these abstracts and annotated relationships. Whether each
gene–function relationship could be extracted from the corresponding abstract was used as the evaluation metric. Since not all relationships described in each paper were extracted in SGD and GOA (probably because the annotators’ primary query to PUBMED are gene
names, other gene information are not necessarily extracted), two
kinds of precisions [=true_positive/(true_positive+false_positive)]
were calculated. One was calculated using SGD/GOA annotation
as the gold standard (type-1 and type-2). The other was calculated based on 100 randomly selected gene–function relationships
extracted using each method (type-1 and type-2) for S.cerevisiae
and H.sapiens. In this calculation, whether the assigned GO-ID was
appropriate or not was determined using the same criteria, criteria-1,
-2 or -3. In the following section, the precision and recall for each
method and the causes of the false-positive and false-negative errors
are discussed.
Table 2. Recall rate using type-1 extraction
Organism
Criteria-1
Criteria-2
Criteria-3
S.cerevisiae
H.sapiens
18.2 (29.8)%
19.5 (28.7)%
35.3 (51.2)%
36.3 (50.8)%
42.4 (54.4)%
43.4 (57.1)%
Numbers in parentheses show the expected recall rate, when information
extraction target by manual was also limited to only abstract information
(without body text). Criteria-1 represents the complete match of our assigned
GO-ID and the SGD/GOA-assigned GO-ID. Criteria-2: represents the relationship within two higher or lower classes. For example, class hierarchy
‘GO0009987: cellular process’→‘GO0008219: cell death’→‘GO0012501: programmed cell death’→‘GO0006915: apoptosis→GO00042981: regulation of
apoptosis’→‘GO0043065: positive regulation of apoptosis’ is defined based on the Gene
Ontology group. The assignment of GO0006915 instead of GO00043065 is allowed in
criteria-2. In criteria-3, all superclasses/subclasses except the highest class are allowed.
(In this example, ‘GO0009987: cellular process’ belongs to the highest class.) In
criteria-3, the assignment of ‘GO0008219: cell death’ instead of ‘GO0043065: positive
regulation of apoptosis’ is accepted.
Table 3. Precision using type-1 extraction
Precision and recall
Precision and recall for type-1 extractions Tables 2 and 3 shows
the results of the type-1 extractions. Since the GO has a hierarchical
structure, a superclass or subclass ID was assigned in some cases. The
definition of criteria-1, -2 and -3 was described in Tables 2 and 3 comments. Although in this method, only abstract information is used
for gene/protein function extraction, some GOA/SGD annotations
are not described in the abstracts but described only in the body text.
Therefore, we investigated whether each of 100 randomly selected
GOA/SGD annotations was written in an in-depth class description
(criteria-1), in a higher class description (criteria-2, -3), or neither.
The numbers in parenthesis in Table 2 are estimated recall using
these values. For example, for S.cerevisiae, 61% of the annotations
were written in an in-depth class description (complete match with
SGD annotation) in the abstracts. Therefore, 18.2/0.61 = 29.8% is
the estimated recall.
As shown in Table 2, the recall rate was low. About 5% of our
results were written in a lower (more detailed) class description than
the SGD/GOA annotations for both organisms. The superclass ID
annotations using our method were due to ignorance of the adverbs
and insufficient vocabularies for the hyponym. For example, in the
sentence ‘Inhibition of angiogenesis by recombinant human platelet
factor-4 and related peptides’, ‘GO0001525: angiogenesis’ instead
of ‘GO0016525: negative regulation of angiogenesis’ was assigned
by our method. Classifying the adverbs, verbs and adjectives should
enable more detailed annotations. When only the co-occurrence of
GO-ID and gene-ID in the same sentence was used for the annotation, the recall rates (criteria-3) of S.cerevisiae and H.sapiens were
63 and 55%, respectively. However, the precisions were lower than
50% due to the ACTOR–OBJECT relationship not being considered.
For example, the GO-assignment of ‘GO0008152:metabolism’ to
‘protein-B’ was erroneously done in the sentence ‘protein-A is
involved in the metabolism of protein B’. In this case, the precision
may differ greatly among the vocabularies prepared for each GO-ID.
The causes of the low recall rate even for criteria-3 are as follows
(in order of frequency).
Incomplete vocabularies. In many cases, hyponyms were used
in the abstract. For example, ‘in response to phorbol ester’ is used in
Organism
Criteria-1
Criteria-2
Criteria-3
S.cerevisiae
H.sapiens
54 (14.3)%
48(15.4)%
78(34.5)%
70 (36.5)%
94.0 (43.4)%
91.0 (51.2)%
Numbers in parentheses show precision when GOA and SGD annotations were used
as gold standard. Numbers in front of parentheses show actual precision by checking
extracted gene–function relationship manually.
one abstract to represent the function ‘response to organic substance’.
The knowledge that ‘phorbol ester’ is a hyponym of ‘organic substance’ is required. Although the MeSH terms described above and
the UMLS hierarchy were used in some cases, the resolution of a
broad class is difficult.
Function not described by a pattern in Table 1. The function
expressions varied. For example, ‘We purified Tip1p from a
glucanase extract of yeast cell walls and analyzed the sugar
chain involved in the cell wall linkage’ describes the function of
‘GO0007047: cell wall organization and biogenesis’. Although
‘purify’ was registered as a predefined verb, ‘purify NP (gene-name)
from NP (apparatus name)’ was not provided for the ‘organization’
process. Furthermore, some relationships are difficult to extract. For
example, ‘The main physiological roles of Odc1p and Odc2p are
probably to supply 2-oxoadipate and 2-oxoglutarate from the mitochondrial matrix to the cytosol where they are used in the biosynthesis
of lysine and glutamate, respectively, and in lysine catabolism’ implicitly indicates the functions of Odc1 and Odc2 as ‘GO0006839:
mitochondrial transport’. Although these patterns and predefined
verbs can be added, the task is never ending.
Function not written in one sentence. In some cases, the function is written over multiple sentences, and a pronoun is used in a
subsequent sentence. Although some trials using multiple sentences
have been reported (Krallinger and Padron, 2004), resolving this
situation without degrading precision appears difficult.
Parser errors and structure errors There were only a few
false negatives due to parser and structure errors, so few sentence
structures had to be analyzed. The false negatives were due to
1233
A.Koike et al.
incorrect recognition of the start point of prepositional phrases, incorrect recognition of ‘modifier_of_noun: past_participle’ instead of
‘main_verb: past_tense’, and unresolved anaphora.
In Table 3, the numbers in parentheses were calculated using SGD
and GOA annotations as the gold standard, while the numbers before
the parentheses represent the precision based on manual checking of
100 randomly selected gene–function relationships. Although the
deep class annotation (criteria-1) seems to be difficult, the precision
for criteria-3 seem to be sufficient for practical use. The causes of
the errors are as follows (in descending order of importance):
Gray zone errors. Many errors occurred in the semantically
gray zone. For example, for the sentence ‘IGIF has been found
to enhance the production of interferon-gamma (IFN-gamma) and
granulocyte/macrophage colony-stimulating factor (GM-CSF) while
inhibiting the production of IL-10 in concanavalin A (Con A)stimulated PBMC’, the assignment of the function ‘GO00042091:
IL-10 biosynthesis’ to ‘colony-stimulating factor (GM-CSF)’ is in
the gray zone (extraction pattern is ‘adverb phrase’ in Table 1). This
sentence implicitly indicates the relation but does not state the obvious function. While these errors can be eliminated by limiting the
number of extraction relationship patterns, doing so would reduce
the recall.
Loose verb conditions. The limitation on verbs in the verb–
technical term combinations is quite loose in the ‘catabolism’,
‘synthesis’, ‘biosynthesis’, ‘organization’, ‘biogenesis’ and ‘metabolism’ combinations. This causes false positives. For example, for
‘Saccharomyces cerevisiae possess two Escherichia coli endonuclease III homologs, NTG1 and NTG2, whose gene products function
in the base excision repair pathway and initiate removal of a variety
of oxidized pyrimidines from DNA’, ‘GO0006221: pyrimidine biosynthesis’ and ‘GO0006281: DNA repair’ were assigned as NTG1
and NTG2. The former is a false positive. An additional screening
devise is needed for such terms.
Parser errors and sentence structure analysis errors. There were
only a few parser and sentence structure analysis errors in our small
test sets. Some errors were observed in the modification relationships
in ‘NP preposition NP’ and in the recognition of long names.
Gene name recognition errors. There are some names in common for multiple genes. Although the resolution of this ambiguity by
abbreviation–full name matching and keyword searching was tried,
there were still some failures, as described elsewhere (Koike and
Takagi, 2004).
Precision and recall for type-2 extractions To compensate for the
lack of various expressions for biological functions, ‘key word
match’ instead of ‘complete term match’ was also tried. The results
are summarized in Tables 4 and 5. Compared to the rates for the type1 extractions, the recall rate was slightly higher, and the precision
was slightly lower. The slight increase in the recall rate is because
the keyword search was done within the noun phrase, ignoring the
adverb. For example, for ‘Swi5 and Ace2 are cell cycle-regulated
transcription factors that activate expression of early G(1)-specific
genes in Saccharomyces cerevisiae’, our program extracted the
relations <Swi5 and Ace2> be <cell cycle-regulated transcription
factors> and <Swi5 and Ace2> be-activate <expression of early G(1)specific genes>. A key word search was done in each noun phase, and
only the hypernym ‘GO0007049: cell cycle’ was assigned, although
‘ID GO0000114: G1-specific transcription in mitotic cell cycle’ can
1234
Table 4. Recall rate using type-2 extraction (datasets including Type-1 hits)
Organism
Criteria-1
Criteria-2
Criteria-3
S.cerevisiae
H.sapiens
20.7 (33.9)%
20.8 (30.6)%
37.2 (53.9)%
38.9 (54.0)%
45.9 (58.8)%
48.7 (64.1)%
Table 5. Precision using type-2 extraction (datasets including Type-1 hits)
Organism
Criteria-1
Criteria-2
Criteria-3
S.cerevisiae
H.sapiens
49 (13.9)%
40 (11.8)%
69 (29.5)%
60 (30.2)%
93.9 (37.4)%
90.6 (43.7)%
be assigned to Swi5 and Ace2 genes by using all bold key words. To
raise the recall rate, the score threshold was lowered to 0.75. However, some false positives are attributable to this lower threshold. For
example, ‘GO0015919: peroxisomal membrane transport’ was given
by only the keywords ‘peroxisomal membrane’. In this case, function
assignment without considering ‘transport’ or an appropriate verb or
other nouns caused the false positive.
Function extraction for each organism
We applied our method to all the abstracts with each MeSH term
(homo sapiens, mice, rats, drosophila melanogaster, caenorhabditis
elegans and saccharomyces cerevisiae). The results are summarized in Table 6. We also extracted the family name–function
relationships. Many of them (>80%) were not yet registered
in the major databases such as LocusLink, RGD, GDB, SGD,
Flybase and WormBase. These results are searchable at http://prime.
ontology.ims.u-tokyo.ac.jp
When all abstracts were used, the recall rate of gene–function
relationships in the previous test set of S.cerevisiae was 32.4% for
criteria-1 and 69.2% for criteria-3. Those of H.sapiens were 31.0%
for criteria-1 and 70.8% for criteria-3. Since the precision and recall
rate at each gene–function relationship level (fact level) should differ
from that at an abstract level, the precision was recalculated at the
fact level and is summarized in Table 7 (family name–function relationships are not included). Here, ‘fact level’ is used to mean whether
the extracted gene–function relationship is correct or not. When
multiple evidential sentences are extracted for one gene–function
relationship from multiple abstracts, if at least one sentence is correct,
the gene–function relationship is regarded to be correct, i.e. a fact.
The difference in precision among organisms in Table 7 is mainly
due to the difference in precision of gene/protein name recognition
for each organism. The precision for the gene–function relationship
level in Table 7 is slightly lower than that for the abstract level in
Tables 4 and 5. Erroneously extracted gene–function relationships
consisted of only one or two evidential sentences. The precision
could be increased to some extent by discarding the gene–function
relationships with few evidential sentences.
CONCLUSION
We have developed an information extraction system that uses natural
language techniques to assign GO IDs to each gene/protein/family
found in abstracts. In this system, each sentence is shallowly
Automatic extraction by NLP
Table 6. Function extractions for each organism
Organism
No. of abstracts
Protein (family) kinds
Extracted function gene (non-red) + family (non-red)
S.cerevisiae
C.elegans
D.melanogaster
M.musculus
R.norvegicus
H.sapiens
51 646
5 170
19 331
666 098
1 035 237
8 219 949
2568 (1563)
907 (534)
1294 (533)
5985 (2736)
3254 (2257)
7268 (3609)
22 323 (11 122) + 16 145 (8200)
3602 (2282) + 1891 (1324)
4829 (3201) + 2040 (1376)
174 194 (50 757) + 140 992 (34 524)
142 047 (38 554) + 153 434 (37 024)
337 945 (83 713) + 351 255 (67 540)
Table 7. Precision of gene–function relationship level
Organism
Criteria-3 (%)
S.cerevisiae
C.elegans
D.melanogaster
M.musculus
R.norvegicus
H.sapiens
86
91
81
80
86
87
parsed, and the ACTOR–OBJECT relationships are extracted
using rule-based sentence structure analysis. When gene names
and their functional terms are described in ACTOR–OBJECT
relationships with predefined verb or modification relationships, the
corresponding GO-IDs are assigned to the gene/protein/family. The
gene/protein/family names are quickly recognized by the devised
trie, which is constructed based on the GENA gene name dictionary and family name dictionary to extract ambiguous gene names
that do not specify unique gene names. For wide recognition of the
gene/protein functions, the functional terms are semi-automatically
gathered based on GO using co-occurrence in the same abstract
and the collocation similarities of the terms. The terms related to
a GO term are mainly gathered using the first method, and the
similar-meaning terms are mainly gathered using the second method.
Additional hyponyms are gathered using an MeSH hierarchy, and
semantic/syntactic variations of the gathered terms are generated
using rule-based methods.
In a preliminary experiment, our system had a recall rate of about
42–49% [criteria-3 in Table 2 and Table 4], with 91–94% precision [criteria-3 in Table 3 and Table 5] for both S.cerevisiae and
H.sapiens at an abstract level. Considering the percentage of actually described functions in the test set abstracts, the recall with NLP
was even higher [54–64%: criteria-3 in Table 2 and Table 5]. The
precision of our method is higher than the simple co-occurrence rate
of gene and functional term (<50%) in a sentence, since the ACTOR
and OBJECT relationships are considered. Further, when all NCBI
abstracts are used, the recall rate increases to about 70%, and the
precision drops to about 86–87% for both organisms at the fact level
(gene–function relationship level). Although this evaluation allowed
superclass identification (instead of detailed class) identification, the
annotated GO class level seems to be sufficiently useful. Many of
the false negatives and superclass recognitions (instead of detailed
class recognition) were due to a lack of biological function terms.
Some vocabularies are difficult to find using an automatic process,
while some are detectable with our term-finding system, which uses
co-occurrence and collocation similarities. Expanding the number of
biological functional terms in our system should increase the recall.
Application of this method to abstracts using each major eukaryote
MeSH term resulted in the extraction of over 190 000 non-redundant
gene/protein GO-ID relationships and 150 000 family name GOID relationships for S.cerevisiae, C.elegans, D.melanogaster,
M.musculus, R.norvegicus and H.sapiens. Many biological functions that were not extracted by a major database or consortium were
extracted. The results are open to the public in the PRIME database
(http://prime.ontology.ims.u-tokyo.ac.jp:8081/).
ACKNOWLEDGEMENTS
We thank the reviewers for their helpful suggestions and references.
This work is supported in part by a grant-in aid for scientific research
on priority area genome information science, from the Japanese
Ministry of Education, Culture, Sports, Science and Technology.
REFERENCES
Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P.,
Dolinski,K., Dwight,S.S., Eppig,J.T. et al. (2000) Gene ontology: tool for the
unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 25–29.
Blaschke,C. and Valencia,A. (2002) Automatic ontology construction from the literature.
Genome Inform., 13, 201–213.
Camon,E., Magrane,M., Barrell,D., Binns,D., Fleischmann,W., Kersey,P., Mulder,N.,
Oinn,T., Maslen,J., Cox,A. and Apweiler,R. (2003) The gene ontology annotation
(GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro.
Genome Res., 13, 662–672.
Chiang,J.-H. and Yu,H.-C. (2003) MeKE: discovering the functions of gene products
from biomedical literature via sentence alignment. Bioinformatics, 19, 1417–1422.
Collier,N., Nobata,C. and Tsujii,J. (2000) Comparison between Tagged Corpora for
the Named Entity Task. In Proceedings of the 18th International Conference on
Computational Linguistics, Saarbrucker, Germany, pp. 201–207.
Friedman,C., Kra,P., Yu,H., Krauthammer,M. and Rzhetsky,A. (2001) GENIES: a
natural-language processing system for the extraction of molecular pathways from
journal articles. Bioinformatics, 17(Suppl. 1), S74–S82.
Fukuda,K., Tsunoda,T., Tamura,A. and Takagi,T. (1998) Toward information extraction: identifying protein names from biological papers. Proceedings of the Pacific
Symposium on Biocomputing, pp. 705–716.
Humphrey,K., Demetriou,G. and Gaizauskas,R. (2000) Two applications of information
extraction to biological science journal articles. Enzyme Interact. Protein Struct.,
Proceedings of the Pacific Symposium on Biocomputing, Hawaii, USA, pp. 505–516.
Jacquemin,C. and Royaute,J. (1994) Retrieving terms and their variants in a lexicalised
unification-based framework. Proceedings of SIGIR, pp. 132–141.
Kim,J.-J. and Park,J.C. (2004) Annotation of gene products in the literature with gene
ontology terms using syntactic dependencies. Lect. Notes Artifi. Intell. (in press).
Koike,A. and Takagi,T. (2004) Proceedings of HLT/NAACL BioLINK Workshop,
pp. 9–16.
Koike,A., Kobayashi,Y. and Takagi,T. (2003) Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. Genome Res., 13,
1231–1243.
1235
A.Koike et al.
Krallinger,M. and Padron,M.M. (2004) Prediction of GO annotation by combining entity
specific sentence sliding window profiles. Proceedings of BioCreAtIvE., Granada,
Spain.
Krymolowski,Y., Alex,B. and Leidner,J.L. (2004) BioCreative Task 2.1: The Edinburgh–
Stanford System. Proceedings of BioCreAtIvE.
Nenadic,G., Rice,S., Spasic,I., Ananiadou,S. and Stapley,B. (2003) Selecting text features for gene name classification: from documents to terms. Proceedings of the
ACL Workshop on Natural Language Processing in Biomedicine. Sapporo, Japan,
pp. 121–128.
Raychaudhuri,S., Chang,J., Sutphin,P. and Altman,R. (2002) Associating genes with
gene ontology codes using a maximum entropy analysis of biomedical literature,
Genome Res., 12, 203–214.
Rindflesch,T.C., Tanabe,L., Weinstein,J.N., and Hunter,L. (2000) EDGAR: extraction
of drugs, genes and relations from the biomedical literature. Proceedings of Pacific
Symposium on Bioinformatics, Hawaii, USA, pp. 514–525.
1236
Salton,G., Wong,A. and Yang,C.S. (1975) A vector space model for automatic indexing.
Commun. ACM, 18, 613–620.
Schug,J., Diskin,S., Mazzarelli,J., Brunk,B.P. and Stoeckert,Jr,C.J. (2002) Predicting
gene ontology functions from ProDom and CDD protein domains. Genome Res., 12,
648–655.
Singhal,A., Buckley,C. and Mitra,M. (1996) Pivoted document length normalization. In
Proceedings of ACM SIGIR’96, Zurich, Switzerland, pp. 21–29.
Tanabe,L. and Wilbur,W.J. (2002) Tagging gene and protein names in biomedical text.
Bioinformatics, 18, 1124–1132.
Yakushiji,A., Tateishi,Y., Miyano,Y. and Tsujii,J. (2001) Event extraction from biological papers using a full parser. Proceedings of Pacific Symposium on Bioinformatics,
Hawaii, USA, pp. 408–419.
Xie,H., Wasserman,A., Levine,Z., Novik,A., Grebinskiy,V., Shoshan,A. and Mintz,L.
(2002) Large-scale protein annotation through gene ontology. Genome Res., 12,
785–794.