Automated extraction of disease-gene relationships from MEDLINE

Rochester Institute of Technology
RIT Scholar Works
Theses
Thesis/Dissertation Collections
2005
Automated extraction of disease-gene relationships
from MEDLINE
Jennifer R. Paine
Follow this and additional works at: http://scholarworks.rit.edu/theses
Recommended Citation
Paine, Jennifer R., "Automated extraction of disease-gene relationships from MEDLINE" (2005). Thesis. Rochester Institute of
Technology. Accessed from
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected].
Automated Extraction
of
Disease-Gene Relationships
from MEDLINE
by
Jennifer R. Paine
A thesis
submitted to
requirementes
the
faculty
for the degree
of
of
Rochester Institute
Masters
of
of
Technology
Science in the Department
Rochester Institute
of
Technology
2005
Approved by:
Dr. Debra Burhans
Dr. Jun Xu
Dr. David Lawlor
Dr.
in
Gary Skuse
partial
of
fulfillment
of the
Biological Sciences.
Thesis/Dissertation Author Permission Statement
Tirk of thesis or dismtatiao: Amomated ExtraC1ion of Disease· Gene Relationships from
:-.1EOLI]\"E
Nam: of author: Jennifer R. Paine
Degree:
~fastm of Science
Pr ogr;im: Bioinfoonatic s
College: College of Science
I underst:llld th.it I must submit a print copy of my thesis ar dissertation to the RIT Arc.hives, per currem
RIT guidelines foe the complctiao af my degrtt. I hereby grant to the Rochester Institute af Tecbnology
:llld its :igCflts the non-exclusive license to archive and make occessible my thesis or dissert3tion in wbcle
or in p3rt in JIJ forms of media in perpetuity. I retain 311 other owumhip rights to the copyright of the
thesis ar dissertation. I 3.lso retain the right to use in future wod:.s (such as articles ar boots) all or p3rt of
!his thesis or dissertation .
Print Reproduction Permission Granted:
Jennifer R. Paine
. hereby grant permission to the Rochester Institute
Ti:chnology to reproduce my print thesis or dissatxion in whok ar in p:irt. Ally i:cproduclion will not be
for coounerclll use oc profil
I,
signaturcafAuthcc:
Jennifer R. Paine
D:itc:
s-6-2oos
Print Reproduction Permission Denied:
, hereby duty permission to the RIT library of the
Rochester Institute ofTecb11alogy to reproduce my print thesis oc dissertation in whole or in p3rt.
I,
Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date:-----
Abstract
The
increasing
biomedical literature is
amount of scientific
beginning to bring about a
information automatically from
to
researchers
researchers
these sources.
interpret large-scale
levels
information practical,
propose a method
automated methods of
language processing techniques. This
template
giving
matching based
on a set of
to extract disease: gene
a precision of
segment of
make
information
97%
information
the
logical
finding
and
tools to extract
of particular
collecting
interest
help
can
connections
extraction are required.
storage of
of
of
between
this
In this paper, I
linkages between
genes
a combination of term co-occurrence and natural
method
training
incorporates
templates to
of an experiment on a
relationships
and a recall
of
in the form
diseases. These linkages
database
statistically-driven part-of-speech
in the MEDLINE text. Results
method
To
researchers
for the development
genes and
automated extraction and
diseases from MEDLINE text using
diseases, tokenization,
to
available
genomics studies as well as make
and certain phenotypes.
for the
need
One
is the linkage information between
gene expression
and
information
pre-defined
tagging
and
test set of 50
ill
and
chunking, as
genes and
well as
find relationship-containing
from MEDLINE text
between 51%
lexicons for
78%.
abstracts
can
be
statements
demonstrate that this
applied with
success,
List
of
Figures
Figure 1,
Study
Design Flowchart
iv
Acknowledgments
The
knowledge
Thesis
following thesis,
Advisor, Dr.
Dr.
My
of several people.
invaluable in
and
while an
Jun
Xu, both
individual work, benefitted from the direction
Thesis
Chair, Dr. Debra Burhans, along
provided a
the construction of this project.
Gary
Skuse
project on schedule.
provided sources of
Each
broadened my knowledge
of
of
my
the
and
depth
of
my
external
knowledge in the field that
was
In addition, my Thesis Advisors Dr. David Lawlor
feedback
and guidance that allowed me to complete this
brought to this
advisors
field
breadth
with
and
and challenged
project valuable
insights
that
my thinking, making the finished
both
product
one of value and utility.
In
from
addition to the academic and technical
family
work.
Also
Eleanor
and
friends.
present was
Paine, enabling
Finally,
I
would
division
of
Procter
insights
created an
and
My fiance,
Dan
help,
Bushnell,
I
also received
equally important
provided constant support
help
throughout this
the constant support and encouragement from my parents, Ronald and
me
to persevere through many challenges and
like to
thank everyone
Gamble for
interesting
all of
in the Corporate Functions
their ongoing
project with
finally
many
help
and support.
opportunities
-
obtain the
degree.
Biotechnology
Their
comments and
for future improvement
and use.
Table
of
Contents
Copyright Release Form
ii
Abstract
iii
List
iv
of
Figures
Acknowledgements
v
Introduction
1
Materials
and
5
Methods
Results
13
Discussion
20
Conclusions
25
References
26
Appendices
A1
Appendix A: Code
Appendix B:
Training
B1
Sentences
Appendix C: Templates
CI
Appendix D: Additional Sentences Found in Test Set
Dl
Appendix E: Relationships Found in Test Set
El
VI
of
Abstracts
Introduction
Technical
allows
large
brought
on
tens
about
of
hundreds
in the field
of genes
in
of great
interest to
scientists
currently manually
search
have
The MEDLINE database
information
increased to
used
an
in
of over
research
astounding 3,500 total
references were uploaded
13
million
history
studied methods
for gaining
which
of
they
participate.
linkage to
a
a
disease,
difficult, if not
biology
studies.
Web,
since
one of the
tremendous
2002 the
rate of
day! In 2004 alone,
only
a
According
information
more than
Developing a tool
not
rate.
leading
added
to
has
571,000
to automatically
desirable, but is
an
genomic research.
useful
newspapers, company reports,
the United States was
expands at a
world-wide,
references per
Automated literature mining is
large-scale
in
biomedical abstracts,
information from this literature is
step in progressing
into
conditions can
can also provide valuable
documented
to this enormous database (1).
search and extract relevant
extend
data
experiment can yield
disease
existing biomedical literature. This is
the MEDLINE Fact Sheet on the World Wide
essential
a
may have
changes affect an organism
Understanding
disease, but
past
now
task given the vastness of currently available information sources.
impossible,
sources of
How these
of certain genes and the pathways
to determine whether or not genes
order
a
genomics,
new techniques produce
A typical microarray
researchers.
for
interest,
of
in the
single experiment
one or two genes of
speed the progression of treatment
only
particularly in the field
expression changes.
information regarding the function
In
biology,
a single experiment.
interesting
disease is
of
the genome. Whereas a
information regarding
of genes with
to
of
study
thousands
with regard
not
scale
progress
information in
patent
started as
early
as researchers
not a new
try
filings,
as
field
of study.
an automated
and websites.
the 1940's (2).
In
Only
to navigate the constant
For many years, linguists have
fashion from documents like
fact,
the
first linguistics
program
recently has this field begun to
flow
of new
information from
in
The
language
processing".
finite
First,
a
some
form
contain
of
be
through a
periods as
work
Craven
trees
as positive examples
This
an
accomplish
commonly
of this
inability to
use of
of
linguistics. As J.R.
by
the company
however,
are
to
more
at the
tokenized at the word
words with a
tag
that
article, noun, verb,
is determined, it is then
genes and
identify
hundred
diseases (3).
discussing
sentences
pre-tagged sentences were used
however,
gene or
a
disease lexicons
a
famous
relational
linguist,
stemmed
disease
from
(2). Co-occurrence-based
strong
lack
names not
between
and
for
lexicons
stated
of pre-defined
genes and
diseases,
roots at
shall
set.
this
list in
Co-occurrence is
finds its
the
a
beginnings
know
determining word
of
be
found in the training
in the late 50's: "You
methods
reliance on complete
a
as well as a controlled verb
information
uttered
that may not
of relationships
templates to match syntactic structures.
Firth,
by
study,
for detection
detect any
for extracting
keeps"
limited
several
likely
this gene-disease relationship extraction.
this same task of finding relationships
used method
it
further
steps.
meaningful phrases and use
between
accomplish
most
is first done
this
word context
used statistical methods to
project employs pre-defined gene and
the
Typically,
group them into
relationships
is
is broken into smaller,
seven most common tags are:
or
method allowed
in the literature. The limitation
conjunction with
few defined
text to that which
tagging involves annotating
After
a
to train Hidden Markov Models while thousands of other sentences served
as negative examples.
lexicons, resulting in
(4)
of
sentences are then
In their experiment,
gene-disease relationships.
To
The
has already been done to
and
text
tokenization.
and proper noun.
syntactic
the field
unstructured
delimiters. The
for information like
Ray
example,
narrow
tagged. Part-of-speech
into
to parse words
Prior
one
to
process called
context within a sentence.
templates to search
In
established that contains the
performed
adjective, preposition, number,
possible
in
to as "natural
desired information. If needed,
and part-of-speech
defines their
referred
literature is
filtering can
level using
is
performed
something interesting. Next, the
sentence
text
unstructured
Natural language processing is typically
source of
manageable pieces
level
information from
extraction of useful
a word
relationships,
terms and synonyms as well as
assumptions that certain
To
terms and
sentence structures
help overcome these limitations,
genes and
diseases
were
developed
relationships within sentences were
relationships
between
The primary
genes and
'complete'
and used
in this
objective of this project was to
disease
Language
Processing techniques. Far-reaching
of
microarray
gene
training
in the
lists
entities
synonyms
(3).
for both
templates to match
in
which
text.
develop
from
between
set of sentences
an
benefits include
generated
their
In addition,
information
from MEDLINE text using
gene:
relationships
a
relationships
of terms and
project.
were stated
retrieving
interpretation
lexicons
developed from
diseases
denote
extraction system
co-occurrence-based
Natural
easier and more thorough
experiments.
for
Build Lexicons
Disease Name Lexicon
Find 1 00
gene:
Gene Name Lexicon
sentences
disease
containing stated
Assemble
relationship.
their abstracts and pre-tag the
disease
gene and
names
I
Part-Of-Speech
&
Tag
Sentences
eliminate sentences
without co-occurrence
T
Chunk into Phrases
Develop
Phrase
Templates using
50 of the
I
Template
sentences
Matching
Relationship Flat file
-'"';-.
'
'"'
'''''
J.
Determination
recall
Figure 1.
Study Design Flowchart.
of
Precision
using remaining 50
To
reach a
final
and
sentences
product of a collection of
likely
gene:disease
followed. First, a disease name lexicon was built using data
contained in publicly available databases. Next, a subset of a larger gene lexicon was obtained
(to allow for faster processing for proof of concept). These two lexicons were used to search
relationships, this process was
Medline for 100
sentences
abstracts) containing
(each in
a stated gene:
a
different
disease
abstract
to
relationship.
allow an equal number of sentences and
Each
of these
sentences'
abstracts were
in the remaining steps. The 100
gene and
disease names were pre-tagged for easy retrieval in later steps and then the abstracts were tagged
with parts of speech. Following part-of-speech tagging, the abstracts are chunked into
then downloaded as a plain-text
file
abstracts'
and used
nonoverlapping phrases. At this time, the original 100
identified and the first (arbitrary division) 50
sentences
containing
relationships were
re-
sentences'
develop
templates
template
A
which were used
development,
the tool was
positive result was one where
relationship.
no stated
A
to
retrieve
run on
chunks
the relationships
(training
from
set)
were used to
the test set.
Following
the test set and precision and recall were
the sentence
retrieved
by
the tool contained a
determined.
disease:
gene
negative result was one where the sentence retrieved contained co-occurrence
relationship between the
gene and
disease.
but
Materials
and
Methods
Lexicons
Two lexicons
disease
at
the
has
name and synonym
start of
for
synonyms
this project
identifier,
a unique
arbitrarily
shortened gene
(5)
Typically,
lexicon
allowed
gene name and synonym
for the
its
gene to
the acronym
of
Medicine's
National Institutes
Rare Diseases Website
of the
NIH (6,7). MeSH is the National
list
of
terms. It is structured as a
and a more specific
Subject Headings Fact
extract
in the
Often,
Sheet,
term might
MeSH
information from
general
lexicon,
The diseases list from the Office
disease lexicon. The Office
within
of
On their website, the Office
200,000
persons
of
22,568 terms
terms under the
there
which
headings
of
a general
for the
head term
Rare Diseases defines
in the United States"(9). This data
and
listings from
Library
of
level term
coordinating
be
the MeSH
topics were made
for their
in
key
to
key
term.
generation of the
in 1993
research on rare
as one
source contained a
might
this manuscript (8).
that was established
disease
the
were added
was also used
'rare'
a
of this
'Diseases'
were made synonyms
purpose of
the
project
Use
from
According to
at the time of
an organization
for this
terms
(NIH),
were more specific
Rare Diseases
Rare Diseases is
the National Institutes of Health
instance,
and one of
of terms with general
be "Alzheimer's Disease".
MeSH,
while more specific
hierarchical list
leaves. For
contained
headings for
Health
lexicon
proof of concept.
of the
"Diseases"
terms
used
project contains
of
gene
unique genes.
for
tool
assembled
already
in the
and a
the gene according to
lexicon
to include 717
lexicon
term,
(MeSH)
controlled
the lexicon.
subset of the gene
entries
key
for
accepted symbol
lexicon built for this
terms as nodes and the most specific terms as
To
is the
was
record
the gene's
used as
rapid use and evaluation of the
name and synonym
Medical Subjects Headings
be
lexicon
Each
entries was used.
containing 4,840
for
a gene name and synonym
project:
and a subset of
chosen subset
The disease
the Office
lexicon. The
an acronym
the gene.
for this
Genome Organization (HUGO). The
the Human
was an
were required
"affecting
diseases.
fewer
than
listing of approximately
6,000 diseases but did
Diseases list
was compared to the
Office
to the
of
Each
key
Rare Diseases list
record
disease name,
used
and a synonym
for this
lexicon
key
were
disease
built, they
Appendix
B)
accuracy
Fifty
and the
of the tool.
relationships were
tool to
ability to
not
relationships
remaining 50
It
only
with a
return
be
returned
"Hypocalcemia
tissue
biological
identifier,
includes 9,025
identifier is
an
a
arbitrary
The resulting disease
meaning.
were used to search a
containing
that
noted
least
unique
in the
set of
This
disease
terms.
both
gene and
set to
develop the
markedly
hypercalcemia
information
tool's
functionality
(see
the 50 sentences containing well-defined
both
of
50
for
sentences
processing.
containing
This
also
allowed the
relationships and
facilitated demonstration
identify other
of the tool's
sentences
containing
sentences.
are considered
positive results.
was associated with a
and their
set to gauge the effectiveness and
testing
co-occurrence of a gene and a
disease
the MEDLINE
stated gene-disease
and evaluation of the
method also
test set
clearly
of
from the MEDLINE database
50 sentences, but to
original
the tool as
while
after
local copy
one
respective abstracts were used
sentences with
insulin sensitivity
training
were used as a
the expected
by
at
for training
used
training/testing
relationship between the
and should
be
relationships.
were not
In this project,
stated
be
should
containing
that
unique numerical
records and
sentences were used as a
chosen, their
be developed
sentences not
no
Rare
terms and the terms unique
a unique numerical
attributes:
such sentences were collected
respective abstracts were obtained to
tool.
9,896
has
of
lexicon.
The
name.
term and
abstracts with sentences
One-hundred
extraction
from the MESH
were added to the
project contains
After the lexicons
relationship.
generated
This Office
names.
Corpora
Training/Testing
(10) database for
lexicon
disease
those
in the disease lexicon has three
number given to each
lexicon
for
not contain synonyms
An
reduced
disease
name and a
relationship-containing
example of such a sentence
insulin secretory
is
sentences
this:
response and normal
was associated with a normal
insulin
response
reduced tissue sensitivity".
name without a stated
relationship,
relationship-containing
example of
this:
or with a
non-relationship-containing
glucose-
or
between
with co-occurrence
negatively
sentences and should not
"Finally, PTH does
either
Sentences
not appear to
sentence
be
an
be
tolbutamide-stimulated insulin
disease
stated relationship, are considered
returned
due to the
insulin
and gene name and a
by
the tool as
release
in
positive results.
presence of a negative
antagonist and
has
non-
An
relationship is
no apparent effect on
animals with
dietary-induced
secondary hyperparathyroidism".
Pre-Tagging
The
gene and
in later
finding them
disease
names
This
steps.
in Perl (see Appendix A). The
and appends
to it a
The
the
lexicon
done using
they
be easily
pre-tagged abstracts were
the
abstracts
to ease
implemented
string matching
algorithm
for the
occurrence of a
lexicon term in the text
algorithm searches
could
were pre-tagged within
an exact
user-assigned prefix and suffix.
prefix-suffix pairs so
extraction.
was
from
Genes
and
diseases
retrieved and separated
then ready
for the first
were assigned
in the final
information
steps of
stage of natural
different
language
processing.
Tokenization
and
Part-of-Speech
The first step in the
was
Rule-based
are two
disease
gene:
sentences and then words.
varieties of part-of-speech
taggers are
in the field. Rule-based
to different kinds
by
basic
part-of-speech
an expert
written
into
extracting
relationships
from the
Following tokenization,
abstracts
the words
tagged with their appropriate part-of-speech tags.
There
by
process of
to tokenize the abstracts
were then
Tagging
of text.
An
based
taggers
taggers: rule-based and probabilistic.
on contextual rule sets
have been
proven
that are
effective,
example of a rule-based part-of-speech
Eric Brill (11). Brill's
rule-based
tagger works
painstakingly defined
and are
tagger
'transportable'
is the Brill tagger
by developing rules
based
on a
pre-
Initial tags
tagged corpus.
are assigned
based
context) according to the tagged training
training text, it is
contained
most common
tag for
accuracy in
be very
in its
accurate
words
will
be
variety
used.
'guesses'
data
and
on context
process.
87%
text, but
for
that was not
is found
last three letters
of
because
of
found in
that was not
highly
The Brill tagger is commonly
is the
to a
testing
set of
has been
used and
documents. However, the Brill
accurate
part-of-speech
For
the word.
'adjective'
tagger
and
in
one
if trained
on
same genre of
to
tagger
performed at an
study
1,000 MEDLINE
the
shown
part-of-speech
the second type of part-of-speech tagger, tend to
are
the
sentences
be
(12).
effective over a
text
for
which
they
taggers commonly employ a statistical model that
based
on
information from manually
pre-tagged
training
the tags of previous words.
Computational
a
publicly
Biology Branch
National Center for Biomedical
MedPost is
It
was trained on
MEDLINE
available probabilistic part-of-speech tagger
of the
National Center for
Communications)
a probabilistic tagger that uses
MEDLINE database
97%o
word
(ignoring its
to refine the results. These rules are adjusted based on
when applied
unknown words
For this project,
words.
on the
specifically trained for biomedical text
Probabilistic
the tags
tag
a word
capitalized
'adjective'
some unstructured text
of
tag based
assigned a
for
of
is
word
If a lowercase
noun'.
assigned a
iterative
Probabilistic taggers,
smaller
'proper
an
accuracy
If a
corpus.
for
likely tag
ending in '-ous'. After this initial tagging is completed, the
based
native state was not
approximate
of
is "bulbous", it is
word
then applies sets of rules
their
tag
in the training text, it is
if the
instance,
assigned the
on the most
of
abstracts as the
compared
abstracts.
When tested
Brill tagger, MedPost
downstream processing, the MedPost tagger
public
availability,
MedPost
sentences selected
was
implemented (12).
was chosen
and ease of use.
8
on the same subset of
importance
accuracy
of
of part-of-speech
for this
unknown
randomly from the
performed with an
to the Brill's 87% (12). Due to the
MEDLINE text,
entitled
Information (Lister Hill
Hidden Markov Models to determine tags for
5,700 manually tagged
biomedical
Biotechnology
from the
project
1,000
approximately
tagging
to
due to its accuracy
with
The MedPost tagger
format
by
a
line
requires addition of an
period) to the
of the
and the
beginning
abstract,
resulting
and the
identifier (a
ending line
Often,
the
This
Following this
regular expressions
sends
Markov Model is
in the training set,
possible
tags,
of
applied where
for
be
used as
a
by
for
words
input for
This
by
the Computer
contractions
a part-of-speech tagger.
not
further broken into
visually
sentences
for
is based
on word
the most
of
list
10,000
tagging.
There,
a
using
Hidden
from tag bigram frequencies
assume equal probabilities
sequence
probability
of the most
frequently
of the allowed part-of-speech
for
the
such as
orthography
likely tag
algorithm calculates the
entered
(with
from tokenization
in the lexicon
estimates
training lexicon
manually
first
the
implemented in Perl
was
developed
space
tokens to a stochastic tagger
The Hidden Markov Model
MEDLINE that includes
line,
delimiters.
unknown words
transitional probabilities.
the title
Pennsylvania. In Penn Treebank
transition probabilities are estimated
output probabilities
tags occurring based on a
that was
the abstracts are
periods as
word-level
and where output
capitalization.
calculated
the
using
followed
according to the Penn Treebank
words separated
initial tokenization,
(12)
This
abstract section
abstract:
formatting
format
University
their components) which can then
MedPost
of
PubMed
the changes to the original text are subtle and output
dramatic.
Perl
at
designating the
abstracts
tokenization
broken down into
sentences are
into
a
a specific way.
input for MedPost tagging (see Appendix A).
by tokenizing
Information Science Department
separated
of the abstract.
abstracts were used as
format (13). Penn Treebank format is
tokenization,
letter
capital
of each major section of a
The MedPost tagger begins
and
input to be formatted in
requires abstract
using the
of a given sequence
occurring
tags
for
words
in
each
lexicon term (12).
The
tags.
output
For the
Treebank
processes.
for MedPost
purpose of
(13) tag
can
this project,
be
however,
output was chosen to
MedPost
output
is
customized
the
by the
less
user and
extensive and more
improve portability
structured as a
defaults
of the output
flat-file database
to special
commonly
MedPost
used
Penn
into downstream
with each sentence as a record and
each
identifier
as an alpha-numeric
position of the sentence
input for
to serve as
in
the chunker
input for
the
The
the original abstract.
and carried on to the next steps.
were used as
ID that includes the PubMed ID
output
format
following
was altered
the
using Perl in
order
the information was maintained
(see Appendix A). However,
Only
number as well as
sentences with co-occurrence of
disease
and gene names
steps.
Chunking (Parsing)
Parsing
is the
parsing is that it is
symbols
"determining
language"
in
a
Full parsing
(nouns,
next stage of the
noun
so that
shallow
in
phrases,
entire parse
tree and
Chunking
instance,
allows
a
for
group
verb
different resulting
tokens might
specific patterns
[inhibitsjvp [constitutively
'NP-VP-NP'.
is
of
of
This
nodes are syntactic structures
parsing is computationally
sentences
process
The
trees.
syntactic
to
be
design for
grouped
into
be identified using
finding
is
alternative to
full
into non-overlapping
much
phrases
faster than generating
the
B]np'
in the
for
the
not needed
template
relationship information. For
customized templates.
chunked sentence
could
expressed gene
Accounting
useful
This grouping
a verb phrase or a noun phrase.
be
extracted
because chunking
For
fragment '[heart
descriptive terms
expressed'
complete
description
also more robust.
example, the information contained in the
recognize
internal
Chunking breaks
related words are grouped.
is
and
This kind
etc).
phrases,
allows easier template
of
leaves
which words are
or chunking.
parsing,
syntactically
good
the complete syntactic structure of a sentence or string of
expensive and can produce several
parsing is
A
extraction process.
(3).
outputs a tree
verbs,
information
using
'heart'
and
a
a simplified
gene
A]NT
template that would
'constitutively
compresses them
into
part of the
'Noun Phrase'.
For this project, the publicly
produce syntactic chunks.
The
available
Yam-Cha
chunks produced
(14) chunking
from the training
software was utilized to
abstracts were
first
used to
develop templates,
disease
retrieve gene:
shown to
Support
variation called
simply
based
many dimensions
The
along
with
output of the
('N'
is inside
scanned with
those templates to
is
(with the
also
'O'
effect
it
between the two classifications)
maximum margin
chunker
is
a
text
file containing
Chunk
chunk representation.
indicates
indicates that the token is the
applies a
features (16).
Yam-Cha
a chunk.
and
complex statistical
statistically complicated, but in
that the current token
beginning
of a chunk which
token,
one
or
word,
representation consists of
for noun, 'V for verb, etc.) followed
tag
chunks of text
language processing
To increase speed, Yam-Cha
slow.
method
classify
natural
accuracy (15). However, using this
processing to
of
its corresponding
part-of-speech
token
excellent
classification
in
vector machines are common
Basket Mining. This
for
allows
on
have
causes speed of
methodology
then
set were
chunk annotator uses support vector machines to
syntactic phrases.
have been
from the testing
relationships.
The Yam-Cha
into
the chunks
and
by I, O,
is
or
line
per
the original
B. T indicates that the
'B'
outside of
chunk.
any
immediately follows
another chunk
(15).
After initial chunking,
NP)
using the
chunk
Templates
Appendix C). For
chunk would
were
as
'assembly
the
strings of phrase tags
sentence read
phrases, the disease
These templates
name
were then used
NP-VP-PP-
representation and verb usage
"SHH
causes
would take
is
(ex:
(see Appendix A).
then developed according to chunk
instance, if a training
noun
flat
instructions'
be NP-VP-NP. The template design
part of one of
verb phrase.
identifiers
chunks were reduced to
into
part of the
to search the
testing
cancer", its corresponding
account
other,
(see
and
that the gene name
the verb
sentences
"causes"
is
for relationship
information.
The
relationships
record contains a gene:
relationship back to its
found in the testing
disease relationship
original abstract.
set were exported
as well as a unique
This flat-file
11
was
to
flat-file format
identifier
that
where each
links that
then compared to the list of the 50
is
the
known relationship-containing
relationships within
sentences of the test set of sentences and well as all stated
the entire test set
of abstracts.
The
determined.
12
precision and recall of
the tool was then
Results
Of the 50 relationship-containing
to
develop
templates to
match
sentences
relationship statements, two
resulting templates. Of those two sentences,
The remaining
noun phrase.
development
of
full parsing techniques
For the test
templates
calculated
sentences.
Below
retrieved, the
and
could
potentially
Based
whole
abstracts), 34
50
match unwanted
by
retrieved
sentences were retrieved
gives a recall of
sentences used
itself follows the
for testing.
the
by
sentence
highlighted in darker gray
and
in
parenthesis.
followed
by
was not
in the test
were
set of
successfully
is highlighted in light gray
Sentences that
a short
4%.
successfully using the
sentences
a template
or the use
input into the tool
68%. Precision
For those that
that was retrieved
phrases)
set error was
sentences that were
non-positive-relationship-containing
portion of the sentence
the template
retrieved are
are the
of
be
set and used
title containing only a single
this, the training
on
50 relationship-containing
developed from the training data. This
due to the lack
training
sentences could not
one was an abstract
rather than chunking.
set, of the
(with their corresponding
the tool as the
sentence contained complex phrases that would require either the
very large template (that
of a
input into
were not
description
of
successfully
why the
sentence
was not retrieved.
1
.
[Plasma YYYrenZZZ aetivity]Np [increased] vp [more]ADvp [than]^ [twofold]^
[io\\oSM^wlQ(^&MMhM^W^Ms
ADVP-PP-NP-PP-NP,
was not found
This
sentence
in the training
's template
design,
NP-VP-
set and therefore was not an
included template (see Appendix C).
2-
[QQQhypocalcemiaVWJNP [was associated^ [with]PP [a markedly
reduced
YYYinsZZZ sensitivity]}^
[was
[QQQhypercalcemiaVWjNp
associated]^
[whilejsBAR
[with]PP [a normal
YYYinsZZZ secretory
response and normal
tissue
YYYinsZZZ response]]^ [reduced tissue sensitivity^
(NP-VP-PP-NP)
ighe concentrations were_sigfrfficantly_elevated in about half of the patients
with acute Guillain Barre_Syndrome and tends to^falfig patien^withclinical
Improvement. This sentence contains a disease term, 'guillain-barre syndrome that,
'
in the lexicon,
lexicon
the
was not
in the
synonym contained a
disease
never
having
same format as
it
hyphen). And so
was written
in the document (the
the sentence was never processed
due
been pre-tagged.
be rjeJardgdlYp. {Msbae
[Other
fonnslNp^ofJr^
[causedJ^Xbyfelinapprppri^
secretioii]j4 This
13
sentence
's template
to
NP- VP-SBAR-
design,
not an
VP-PP-NP,
was not found
set and therefore was
in the training
included template (see Appendix C).
5.
0!Upr [J 5Jm [ofjpp [12,patierrJs]i^IwithJpp [active Q^mS^idqsisY^lm
sentence
's template
training set
and
design, NP-VP-PP-NP-PP-NP-PP-NP,
therefore was
not an
was
This
not found in the
included template (see Appendix Cj.
[In]PP [both cases]^ [the frequency]^ [of]PP [urogenital QQQtumorVWV [in]PP
[rats]w [was increased^ [as]PP [a result^ [of]PP [YYYngfbZZZ administration]^
[at]pp [the apparent expense^ [of]PP [neural QQQtumorWV]^ (NP-PP-NP-PP-NP)
[Porcine
crystalline
YYYinsZZZ]NP [0.1 U
fi^hyppglycemiayWi^iOnlpp [all
NP-NP-NP-VP-NP,
was not found
SYM
kgW^Sgj
subjects]^ This
in the training
sentence
set and
's
template
design,
therefore was not an
included template (see Appendix C).
[This]NP [may or may not]VP [indicate]VP [a roleJNP [for]PP [renjup [in]PP [the
[ofjpp [spontaneous hypertensionJNp. (NP-PP-NP-PP-NP)
9.
cause]KP
[The antigenicity]^ [of]Pp [ins]t,.P [is]vp [the caiise];^ [of]PP [the side ef
[ins therapy]}^ [such as]pP [ins allergy]^ [lipoatrophy]j^L[i.fJpp.Bh site]M>.[of
[injection
and
insulin_resistance]NP. This
NP-PP-NP-PP-NP-NP-PP-NP-PP-NP,
was not an
sentence
's template design,
was not found
in the training
NP-VP-NP-PPset and therefore
included template (see Appendix C).
10.
[of]PP [YYYgloblZZZ]NP [whichJMj. [are increased]vi> [in]PP
[subjects],^ [with]PP [QQQdiabetes-mellitusVWJNP [YYYgloblZZZ Ala-c],^ [were
measured]vj> [in]PP [identical
twins^ [concordant]^!? [discordantJADjr [for]SBAR
[diabetes]]^ [to determine^ [whether]SBAR [the observed increases^ [represent^ [a
genetically determined abnormality]^ (NP-NP-VB-PP-NP-PP-NP)
11.
[Plasma YYYrenZZZ activity^ [was significantly increased]vp [in]PP [children]^
[with]PP [QQQkwashiorkorVW and marasmus]]^ [compared]PP [with]PP [healthy
[The
minor
components^
[in]PP [children],^ [whoV [died]VP [compared]PP [with]PP [survivors^
(NP-VP-PP-NP-PP-NP)
children],^
12.
[G6PD
hillbrowV [a
new
variant]^
[of]PP [YYYgepdZZZ]]^ [associated]w [with]PP
(NP-VP-PP-NP)
[drug-induced haemolytic QQQanemiaVW]NP
13.
[In]PP [patients]^ [managed]w [conservativelyjAovp [there]^ [was]VP [QQQglucoseintoleranceVvVjNp [associated] yp [with]PP [a diminished early YYYinsZZZ
response]^ [to]PP [glucoseV [suggesting]VP [inadequate nutrition]Np [in]PP [the
periodJNf.
[between]PP [the QQQoverdoseWVjNP [the
glucose
tolerance test]Nj> (NP-
VP-PP-NP)
14.
[Stimulation])^ [of]PP [growth hormone incretionj^ [by]PP [YYYinsZZZJMp [caused]vp
[QQQhypoglycemiaVW]NP [in]PP [children^ [with]PP [delayed growth^ (NP-VP-
NP)
14
15.
[The
drop]w [in]PP [haptoglobin levels],^ [indicates] Vp [that]SBAR
QQQischemiaVW]Np [may be induced]vp [by]PP
[renal
[a disturbance]^ [in]PP
[YYYgloblZZZ breakdown]NP (NP-VP-PP-NP-PP-NP)
16.
[The attacksjw [of]PP [QQQhemoglobinuriaWVJNp [were associated]vp [with]PP [the
appearance])^ [of]PP [an unstable YYYgloblZZZjNp [in]PP [red cellsV (NP-VP-PP-
NP-PP-NP)
17
.
tQQQprotelnuriaVW levels]^ [were significantly associated^? [with]pp
[QQQhematuriaVyV QQQbacteriuriaW^and reduced GFK\ti&J&h>?
dependence; arid Q(^hypertensiQrjy3TVj#
[lejukpcytumfe
,
sentence was a
poorly
relationship between a
between
other genes and
it does
not contain a
Th is
direct
'
disease. However, 'ins dependence was tagged as
not retrieved was due to the lack of the ability to detect
gene and a
the reason it was
a gene and so
relationships
chosen example sentence as
a
disease
and a gene when
they
are separated
in the text
by
diseases.
18.
[Plasma YYYrenZZZ activity]Np [was elevated] yp [in]PP [moderate
QQQacidosisVW]Np [inducedjvp [by]PP [5 % carbon dioxide inhalation^ [from]PP
[37.5 +- 8.8 ng SYM mlV [to]PP [52.8 +- 7.0 ng SYM mlV (NP-VP-PP-NP)
19.
[We]Np [have presented]vp [the case histofyJKp [ofjpp [a patient]^ [vvith]PP [unilateral
OQQpyelonephritisVVV]^ [elevated peripheraUcenous YYYrenZZZ]>,T [an obvious
NP-NPcauselnpifbfjpp [QC^hypertensionyVV!]i This sentence 's template design,
PP-NP,
was not found
in the training
set and therefore was not an
included template
(see Appendix C).
20.
[The results],^ [support]VP [the concept^ [that]SBAR [the
aldo-sterone
system^
YYYrenZZZ-YYYagtZZZ-
[may be involved] vp [in]pp [primary QQQhypertensionVWjMp
(NP-VP-PP-NP)
21.
[with]PP [high
susceptibility]]^ [to]PP [QQQdental-cariesVWJNP [perhapsJAcvp [refiecting]VP [inferior
resistance] [to]PP [QQQdental-plaqueVW formation],^ (NP-VP-PP-NP-PP-NP)
22.
[SlowlylAovp [acting mechamsms]NP [probably imtiated]vp [by]
of]PP [YYYrenZZZ]NP [may be]Vp [responsible]^/?, [for]PP [the
JC^hyper^sion^YyiNP This sentence 's template design, NP-VP-ADJP-PP-NP,
[A low
parotid
not found
YYYigaZZZ
in the training
secretion
rate]>jp [is associated]vp
set and therefore was not an
was
included template (see Appendix
CJ.
[associated]vp
23.
[Five cases] [with]PP [Soothill type QQQiga-deficiencyVW]
[withjpp [high YYYigheZZZ levels] (NP-VP-PP-NP)
24.
[the mechanisms] [of]PP [QQQarrhythmiaVW
[YYYbdkZZZ] (NP-VP-PP-NP)
[by]PP
[causedjvp
development]NP
25
[Dopamine
.
[It]Ni>
[may be]Vp [one] [of]PP
,
5-HT
abolishable]
,
GABA
and
[by]PP [vagotomy
YYYbdkZZZ] [caused] w [QQQbradycardiaVW
or atropine
15
treatment]
(NP-VP-NP)
26.
findings] [emphasize]VP [the importance] [of]PP [QQQliver-diseasesVW]
[as]PP [a significant cause] [ofJPP [serum YYYigheZZZ elevation] (NP-PP-NP-PPNP)
27.
[WhenJAovp [changes] [in]PP [body
[These
weight] [rectal temperature] [plasma
glucose] [plasma cholesterol] [plasma butanol-extractable iodine
( BEI] [in]PP
[these rats] [were compared]Vp [with]PP [the YYYinsZZZ secretory responses]
[it] [was]VP [eviden^ADjf. [that]SBAR [experimental QQQhyperthyroidismVW]
[results]VP [in]PP [decreased YYYinsZZZ
QQQhypothyroidismVW]
[the pancreas] (NP-PP-NP)
28.
[A
release] [whereas]PP [experimental
[induces]vi> [increased YYYinsZZZ secretion]
[from]PP
girl] [with]PP [malignant QQQhypertensionVW] [had]vp
[increased levels] [of]PP [plasma YYYrenZZZ activity and YYYagtZZZ II
concentration] [in]PP [peripheral blood] [in]PP [blood] [from]PP [the affected
one-year-old
kidneys] [as]PP [compared]PP [with]PP [that] [from]PP
[the
contralateral
kidney]
(NP-VP-NP-PP-NP-PP-NP)
29.
[The
YYYigheZZZ] [oftehJAovp [was elevated] vp [in]PP [patients] [who]
[systemic
[hadjvp
QgQvasculitisyVV] [wjth]pp [respiratory tract involvement]
serum
[particularly]Apvp [those] [with]pp [QO^churg-s1xahss-^dromeVyV__
|5p35vasculitisVyV
and
[a clue]
(^Qwegeners-gr^^
[tojpp. [the pathogenesis] [in]pp [this group] [of]PP [patientsj This sentence 's
template desigtt, NP-ADVP-VP-PP-NP-NP-VP-NP, was not found in the training
and therefore was not an
30.
31
.
32.
data] [demonstrate^ [that]SBAR [QQQhypophosphatemiaVW] [is
associatedjvp [with]PP [an augmented glucose-stimulated YYYinsZZZ release]
[without]PP [any effect] [on]PP [tolbutamide-stimulated YYYinsZZZ release]
VP-PP-NP)
[These
serum
[Infection
,
YYYigheZZZ
QQQdermatitisVy!V
sentence
set and therefore was not an
's
,
mcreasedl(XYigheZZZ]
template
design, NP-NP,
[impaired
was not found
neutrophil
in the training
included template (see Appendix CJ.
[YYYinsZZZ injection] [at]PP [birth] [caused]yp
[QQQhypoglycemiaVW] [suppression] [of]PP [levels] [pfjpp [certain amino
kcids]NP [Inhibition] [pflpp [conversion]; [oflppI!4C substrates]. [intplEp
[glucose] This sentence 's template design, NP-PP-NP-VP-NP, was not found in the
[By]PP [contrast]
training set
34.
(NP-
levels] [were elevated]vp [in]PP [all patients] [with]PP
[QQQaspergillosisVW] [also]ADvp [in]PP [some other forms] [of]PP
[bronchopulmonary QQQaspergillosisVW] [thusJAovp [limiting]^ [the diagnostic
value] [ofJpp [total serum YYYigheZZZ determination] [in]PP [this type] [of]PP
[pulmonary mycotic infection] (NP-VP-PP-NP-PP-NP)
[Total
chemotaxis] This
33.
set
included template (see Appendix C).
and therefore was not an
included template (see Appendix CJ.
[[TYYaglZ/KV5Il^^
'
s
'.ransl] This
training
sentence
's
template
set and therefore was not an
design, NP-NP-PP-NP-NP,
was not found
included template (see Appendix Cj.
16
in the
35.
[Characterization] [ofjpp [antibodies] [to]pp. [the^Y^X:msZ^jec.ejtorJia
causel^toflpj.
[QQQdja^s^mQNP [in]^ [man] Jto sendee 'j /ew/7/are rfej/gw,
NP-NP-PP-NP, was not found in the training set and therefore was not an included
template (see Appendix Cj.
36.
[His course] [was further complicated]VP [by]PP [QQQhypertensionVW]
[associated]yp [with]PP [elevated plasma YYYrenZZZ levels] [without]PP
[evidence] [of]PP [QQQnephritisVW] (NP-VP-PP-NP)
37.
[A salt-QQQwasting-syndromeVW] [associated] vp [with]PP [high plasma
YYYrenZZZ activity] [inappropriately low aldosterone levels] [was observedjvp
[among]PP [eight Jewish families] [from]PP [Iran] (NP-VP-PP-NP)
38.
cysts] [may causejvp [YYYrenZZZ hypersecretion] [with]PP
QQQhypertensionVW] [by]PP [compressing surrounding tissue]
[and]PP [by]PP [distortion] [of]PP [renal vessels] (NP-PP-NP)
[Solitary
renal
[associated
39.
[Altered YYYinsZZZ receptors]
[may be]vp [respbnsiblejAiJjp [forjpp [the pronounced
QQQinsulin-resistanceVW] [the decreased synthesis]. [oflp^[mjlycerides]
[m]pP^rcongenital general ized^C^hr^ys^pphy^iSOiS This sentence 's template
design, NP-VP-ADJP-PP-NP,
an
40.
was not found
in the
training set
and therefore was not
included template (see Appendix Cj.
[with]PP [congenital
QQQanemiaVW] [found]Vp [in]PP [Japan] [GD] [Tokushima
GD] [Tokyo] (NP-VP-PP-NP)
[Two
new
YYYg6pdZZZ variants] [associatedjvp
nonspherocytic
and
41.
[YYYg6pdZZZ
Long Prairie] [is]vp
[an
interesting
new
YYYg6pdZZZ variant]
[that] [demonstratesjv? [that]SBAR [chronic QQQhemolysisVW] [can be
associatedjvp [with]PP [modestly decreased YYYg6pdZZZ activity] [despite]PP
[normal sensitivity] [to]PP [inhibition]
[by]PP [NADPH] (NP-VP-PP-NP)
42.
[Successful immunosuppressive therapy] [in]PP [QQQdiabetesVW]
[by]PP [anti-YYYinsZZZ receptor autoantibodies] (NP-VP-PP-NP)
43
[Macroamjdaserhia'and
acute
[causedjvp
C&QpancreajitisJ^^
NP-VP-NP[with]pP [af^^^^^^W^X^^^ This sentence 's template design,
PP-NP, was not found in the training set and therefore was not an included template
(see Appendix CJ.
44.
[We] [conclude]w [that]SBAR [YYYsomatostatinZZZ] [causedjvp [only transient
QQQhypoglycemiaVW] [in]PP [normal subjects] [that]SBAR
[QQQhyperglycemiaVW] [eventuallyJADvp [developes]vp [as]PP [a consequence]
[of)PP [YYYinsZZZ deficiency] (NP-VP-NP)
45.
[It] [can be concluded]vp [from]PP [the results] [that]SBAR [YYYagtZZZ II] [is
involved]vp [in]PP [the pathogenesis] [of]PP [QQQhypertensionVW] [and]PP [in]PP
[some cases] [of]PP [QQQhypertensionVW accompanying QQQchronic-renalfailureVW] (NP-VP-PP-NP-PP-NP)
17
[It] [was
46.
[that]SBAR [QQQhypertensionVW] [associatedjvp [with]PP
concludedjvp
[low YYYagtZZZ II concentration] [by]PP [implication " low-YYYrenZZZ
QQQhypertensionVW] [is]w [a condition] [separatejvp [from]PP [QQQessential"
hypertensionVW] (NP-VP-PP-NP)
47.
[QQQacanthosis-nigricansVW] [associated]vp [with]PP
syndromeVW
transl]
48.
YYYinsZZZ-resistent diabetes
and
[QQQstein-leventhal-
[author] ['s
aminoaciduria]
(NP-VP-PP-NP)
[Deformability] [of]PP [erythrocytes] [of]PP [a patient] [with]PP [chronic
QQQanemiaVW] [causedjvp [by]pp [a YYYg6pdZZZ variant (
YYYg6pdZZZ Hamburg] [in]PP [red cells] [was studied]vp (NP-VP-PP-NP)
nonspherocytic
49.
[YYYinsZZZ] [increased]^ [QQQcarcinomaVW] [in]PP [substrate-depleted
bladders] [although]SBAR [the increase] [in]PP [QQQcarcinomaVW] [was]vp
[lessjAD^ [( P] [lessJAovp [than]PP [0.01] [than]PP [in]PP [nonsubstrate-depleted
bladders] (NP-VP-NP)
50.
[Patients] [with]PP [QQQhodgkins-diseaseVW] [had]vp [significantly increased
YYYigheZZZ concentrations] (NP-VP-NP)
serum
Of the 12 templates
used
for searching, 7
test set of
data. The NP-VP-PP-NP
sentences
from the 50
Of the
term,
one was
test set sentences (see Appendix
due to the
inability
un-retrieved sentences were
two
more relationships can
effective
short
(no
needed
due to
of this method to recognize nested
to
lack
of a suitable
additional
be found
test
template to
a
relationships without
abstracts
changes and precision can
in
relationships,
recognize
calculated.
the
precision.
interest
set are taken
of
set
and all other
NP-VP-ADJP-PP-NP,
be too
indicating
maintaining
that
Often, though,
complex to
Templates
into consideration,
Within the test
18
the total 34
the relationship
of the template set.
a sentence would
sacrificing
length) in
from the test
be
fine-tuning
from
the
missing disease lexicon
if added to the templates,
sentences
with additional
than seven phrases
When full
data
out of
from
C).
this method, one was
to retrieve relationships
in retrieving
greater
due
by
returning 1 5
the sentence. Two such templates, NP-NP-PP-NP and
each would retrieve
the templates
template was the most useful,
sentences not retrieved
statement within
returned sentences with relationships
were
kept
be
somewhat
precision.
the recall estimate
abstracts, there were a total of 121
sentences with co-occurrence
9
between
sentences contained co-occurrence
positive
relationship between them,
a
disease
between
leaving
sentences within the test set of abstracts.
set abstracts
using the tool
relationships
estimate of approximately
is based
on a test set of
containing
without a
random
Only
MEDLINE text
9
Processing
59
of the sentences
positive relationship.
searching.
it is
retrieved, a
Considering this,
relationships.
is
Therefore
approximately 97%
primarily
positive
true,
50 test
positive
and a new recall
97%
relationship-
abstracts contained co-occurrence
could
be
expected to
information
that useful
be less
than
97% if
be drawn from MEDLINE
primary limitations,
one of the
describing
can
collection.
However,
relationships
as can
be
between diseases
significant number of the relationships
seen
there exists
possible that even though not all sentences
were
successfully
abstracts are
Of those, 32
just
were
were
under
retrieved.
considered, there
successfully
were a total of
retrieved.
70% (see Appendix E).
19
Using this
This
from
the
in MEDLINE
and genes.
containing
themselves are still retrieved.
the 50 test sentences contained a total of 37 unique gene:
Of those, 29
recall estimate of
each of the
were used.
this redundancy,
78%. When the full
not contain a
noted that the estimated precision of
Precision
redundancy in
relationships.
be
for
Of those, 57
relationships.
in the 50
a significant amount of
relationships are
but did
the entire abstract
chosen to contain
retrieved, is the template
on
of
should
sentences that were not
Based
Of those 121 sentences,
total of 1 12 positive relationship-containing
However, it
assumption of this method
using template
and gene name
gives a precision of
data specifically
corresponding
An
text
sentences.
51%.
disease
a
a
returned a total of
(see Appendix D). This
name and a gene name.
measure
46
increases
unique gene:
gives a gene:
disease
the recall to
disease
disease relationship
Discussion
Future improvements to this
limitation
is
a
based
of a co-occurrence
difficult
requirement to
authors refer
to genes
satisfy
system
disease
names
these objects, there remains no need
A
instead
of
second
full
information
limitation to
parsing.
lack
It
full
parser
Finally,
relates
to the
another
training
on
the
work
in the
et al.
in 1998
methods section above that
containing
verbs
in
limitation
to
help
develop the
templates.
due to
based
by
In this experiment, many
improving
a
training
the
training
20
This
set
objects of
most prevalent of
sentences of
sentences
was
It
set.
largely
can
due
be
recall.
interest in
those studied.
bordering verb
interest
from
through adding
improve the tool's
system relied on these seven verbs
methods, extracted the most probable objects involved
information. One
Michael Collins (17).
information between
highlight
chunking
full parsing is
on recognition of noun phrases
function) to
use of
that of the templates themselves. This
from too limited
interactions have been
list. Their
regulate, encode, signal, and
recognize
relationships, to be lost in
their content were not.
in this study that
to extract
is the
retain more syntactic
written
mentioned earlier was
gathered
a system
a short
be trained to
the speed of processing continually
with
is the Collins Parser
has been done
(18) built
automated
such as nested
sentence,
full parsing
to apply
protein
implement
when
template use, may cause some
set used to
biomedical text. Gene:
in the text
ease of
although
additional, syntactically varied, sentences would significantly
Similar
that occurs
software can
This
terms.
allowing for
explored
data
to
of
project
to a lack of templates to retrieve them stemming
based
is
lexicon
for lexicons.
the test set that were expected to be retrieved
assumed
complete
very
the primary
for this
was mentioned
being
a
of standardization
in the text. If the
computationally intensive. Though this is true,
possible
for
First,
several ways.
the method that was used
Chunking,
increasing, it may be feasible
need
possible solution to this
about the syntactic structure of a
the compression.
be implemented in
is the
given the
diseases. A
and
recognition of gene and
tool can
Sekimizu
phrases
(activate, bind, interact,
and
then using statistical
in the implicit interaction. Though this
approach
back
of
limited
relevant
the need
system
of
designed
by
Medicine in the National Institutes
of
similar system was
that concentrated on
this
objective of
recall of
72%
paper,
was
system,
looking
design
was to
Highlight,
interactions
Highlight did
had
heavily
on
the
a precision
short verb
(depending
lexicon to
on
bring
the verb used)
and
its
collaboration with a
Genome Center
at
tagging
et al.
was
likely to
at
This
system
had
discussed in this
phrases:
limited
was also quite
interact with,
associate
and relied on statistical methods to
This tool's
design
to protein:
with,
and
determine
recall was
bind to.
whether the
approximately
77%.
a
et al. of
the Computer Science Department
preprocessor, parser,
tokenization and
was
Markup Language (XML)
a
SRI International in Cambridge (20). This
contain a protein term.
to this project
rules and external
in 2000
done in
to
tagger component,
were
identify
sources
to
recovery
the
21
objects of
interest
designed
The basis
or
of
the
proteins, then
In their
lexicons.
Tool,
the
system
genes and
relevant relationships.
not use
and
system employed a
component.
first tagged for
however, they did
identify
Queens College
Columbia
at
(21). This
that applied Basic Local Alignment Search
knowledge
of
language processing
articles
and error
in that terms
order
Informatics
a natural
pathway information from journal
created a plug-in
entities.
and one much closer to the project
team at the Department of Medical
parsing
was a simpler system
tested.
only three
lexicon
Friedman,
component,
the National
containing 'bind'. The
verb phrase
between
Highlight, however,
construction was similar
group
that enclosed a
Columbia, developed GENIES,
extract molecular
term
system,
list for
verb
precision was
In 2001, Carol
when
and colleagues at
Health in 1999 (19). This
binding relationships
79%
and contained
interest
Rindfiesch
centered on co-occurrence of noun phrases and specific template
not use a protein
noun phrase of
58%
find
Thomas
by
find information. The
protein
noun phrases
more complex
developed
called
for
and a precision of
A slightly
to
relied
approximately 68% to 83%.
Library
in
lexicon, it
an object
information. When tested, this
Another
to
for
extensible
Instead,
the
BLAST, techniques,
within the text.
This
eliminated the need
manner as opposed
interactions
in that it
to
be
for
lexicon
a complete
to the
chunking
employed the use of
full-text
extensive than previous works at
Parsing was
described
method
The GENIES
retrieved.
of terms.
earlier.
project also
This
differed from the
for
precision
a more complete
allowed complex chains of
Its
articles rather than abstracts.
125. The
done in
also
other studies mentioned
verb
list
was also more
this program was 96% and the
recall was
63%.
Lada Adamic
California,
and
her team
tackled the problem
Their
approach was simple
in the
text to
biology.
with
of
determine
They
their technique
studies
finding
in that it
a statistical
Their
order
to
used the rate of co-occurrence
probability that the two
of use of
argument against
described
expensive and the
earlier
NLP
is
in the literature in 2002 (22).
between
disease
a
system could then
lexicons that tend
to
limit the terms
in this
the processing techniques used
itself requires many
steps
and a gene
in the
fully
be
Again,
actual nature of the connection.
that the computational power needed to
process
Palo Alto,
and
were somehow connected
discovered using the
determine the
in the lack
lay
Laboratories
gene-disease connections
propose that the connections
NLP templates in
retrieved.
of
the HP
of researchers at
parse
the power
that can
article and
associated
be
in those
relationship is too
that each contribute additional sources
of error and therefore produce sub-adequate results.
The primary
frequency
certain
assumption of
in the text. If that
disease,
a connection
The team first
the Adamic team
gene occurs with a greater
between the
gene and
gathered gene symbols
then performed a search of the MEDLINE
whether
the
expression
from
frequency
expressing
on a particular
22
more
related to a
likely.
information.
They
official gene acronyms and recorded
statistical methods
The team then focused
is
in text that is
several public sources of
database for the
Disambiguation using
were not gene-related.
that a gene should occur with a certain
the disease
returned abstracts/titles contained words
pathway.
was
a particular
helped to
disease
disease
or gene
remove acronyms
and compared
that
the rate of
occurrence of
co-occurring
both
articles
that
on a single
disease
segment of articles, recall and precision were
genes with
mentioned
the disease
and
those that
did
not.
Due to the team's focus
not calculated.
internal
that met their
take
into
being
However,
their method was able to
statistical criteria.
Though
A
use of a simple acronym
limited
a
number of relevant relationships
does
lexicon
similar simple co-occurrence approach was
limited this
also
recently
of
used
Botany
and
of
Oklahoma in
University
of
Texas Southwestern Medical Center (23). Though their
relationships
using
quite similar to
implicit relationship
an
the approach
of
objects
from
be determined
public
data
method at
A
text,
was
and relationships assumed.
Lussier
of
by
Michael Cantor
and
MRREL
based
on
MRCOC.
of the
at
at the
the
find
objective was to
gathered
name/synonym
weight
their
They
the
novel
search was
frequency.
which co
the names
They
to
for
the
lexicons. Then, the
importance
of
the co
Using this
co-occurrence
69%.
Beth Israel Medical
alternative ways
disease concept, they
and
logic to
into
Olivier Bodeneider
Language Processing. This group first
sources:
them
recall of this method was
It is included here to demonstrate
each
Internal Medicine
to retrieving gene-disease relationships, although not
Columbia University,
UMLS. For
fuzzy
the co-occurrences
the sentence-level, the
developed
Microbiology
diseases for
such as genes and
sources and assembled
unique approach
the
approach, their initial relationship
'objects'
team searched the co-occurrence using
occurrences and scored
network
of
et al. at
the Adamic team.
The Wren group first defined
occurrence would
The Department
group's results.
by Wren,
University
not
the type of relationship
not elaborate on
Advanced Center for Genome Technology's Department
collaboration with
it does
this approach is simple and elegant,
consideration negative relationships and
described. The
find
of the
Center,
from biomedical
Indra Sarkar
National Institutes
derive the information
selected a set of concepts related
to
aside
of
and
Yves
Health (24).
from Natural
disease from the
then obtained related concepts
from two
public
Gene
Ontology
terms related to
then obtained a subset of
23
data
those
concepts.
Due to the
was then possible
existence of experimental mappings of genes to
to retrieve a
list
of genes related to the
concept-searching, the team circumvented the
precision and recall were
this approach
the
Gene
widely
is reasonable,
Ontology
variable
a current
mappings.
need
(1% to
limitation
Global
for
disease
concept.
a complete gene
100%), depending
to
its
precision was
24
usefulness
30%
on
may
UMLS
Using
concepts,
it
this method of
lexicon. However,
disease
lay
and recall was
in
concept.
Though
the incompleteness of
8.8%
overall.
Conclusions
In conclusion, the
simple co-occurrence
chunking
and
relationships
review,
from Medline
implicit
The
interested in
other relevant
dimension
precision and recall of this
database
combining
techniques
a
including
relevant gene-disease
tool, based
on a
literature
of gene-disease relationships will
certain genes/processes/diseases to
generated
will also allow
utility in the quickly growing field
of
generated
biotechnology
25
of
linkages for
disease
"themes"
to
from microarray experiments, enabling
for these large datasets. With the
above, it is believed that the information
quickly link their topic
as well as make reasonable
The resulting database
lists like those
of analysis
language processing
lexicons for retrieving
literature-drawn information
gene
the usefulness of
previous studies.
relationships and networks.
mentioned
great
abstracts.
favorably with
be drawn from large
another
method with some natural
use of this method to create an updatable
allow researchers
interest to
based
described here demonstrates
template matching using controlled
compare
The
project
and
suggested
from
improvements
the use of this tool can
bioinformatics.
be
of
Reference List
1
.
2.
MEDLINE Fact Sheet. NCBI Website. 2- 1 9-2005
National Institutes of Health. 3-4-2005.
Lee,
National
.
Library
of
Medicine
L. "I'm sorry Dave. I'm afraid I can't do that": Linguistics, Statistics, and Natural
Language Processing circa 200 1 in Computer Science: Reflections on the Field,
,
Reflections from
the
Field,
Committee
Research Council, editors; Joseph
Fundamentals
on
Henry
Press:
3.
Shatkay, H; Feldman,
4.
Ray, S.; Craven, M. Representing Sentence Structure
Information Extraction; 200 1
R. Journal of Computational
2004;
of
pp.
Compute;
National
111-118.
Biology 2003, 10(6),
821-855.
in Hidden Markov Models for
.
5.
Fulmer,
6.
Groft. S. Genetic
A.
and
Office
Zhao,
of
and
S. Letter to Paine
J, [PG Internal Communication], Nov. 15, 2004.
Rare Diseases Information Center
-
Office
of
Rare Diseases.
Rare Diseases Website. 2-10-2005. National Institutes
of
Health.
11-
12-2004.
7.
Nelson S. Medical Subject Headings. National
8.
Nelson S. Medical Subject Headings
National Librarv
9.
Groft, S.
About ORD
-
of
Medicine
Office
of
2005. National Institutes
10.
Entrez Pubmed. NCBI Website.
of
National Institutes
Rare Diseases. Office
of
Medicine: 2004.
Fact Sheet. NCBI Website. 2-12-2004.
(MESH)
-
Library
of
of
Health. 3-4-2005.
Rare Diseases Website. 2-1!
Health. 3-4-2005.
National
Library
of
Medicine
-
National Institutes
of
Health. 11-12-2004.
1 1
.
Brill, E. A Simple
Rule-Based Part of Speech
Tagger;
1992.
J. Bioinformatics
12.
Smith, L.; Rindfiesch, T.; Wilbur, W.
13.
Marcus, M. Penn Treebank Project. 2-2-1999.
1 4.
Kudo, T.; Matsumoto, Y. Fast
15.
Kudo, T.; Matsumoto, Y. Chunking
methods for
with
kernel-based text analysis;
Support Vector Machines.
http://chasen.org/-taku/software/vamcha/
26
2004, 20(14), 2320-2321.
.
2001.
2003;
pp.
24-3 1
.
16.
Pradhan, S.; Hacioglu, K.; Ward, W.; Martin, J.; Jurafsky, D.
using Support Vector Machines; Boston, MA, 2004.
17.
Collins,
18.
Sekimizu, T.; Park,
M. Computational Linguistics
H.
19.
Rindfiesch,
T.
on
Frequently
Parsing
2003, 29(4), 589-637.
S.; Tsujii, J. Identifying
Products Based
Shallow Semantic
the Interaction between Genes and
Seen Verbs in Medline Abstracts; 1 998;
Mining molecular binding terminology from
pp.
biomedical text; 1999;
Gene
62-7 1
.
pp.
127-131.
20.
Thomas, J.; Milward, D.; Ouzounis, C; Pulman, S.; Carroll, M. Automatic
Protein Interactions from Scientific Abstracts; 2000; pp. 541-552.
21.
Friedman, C; Kra, P.; Yu, H; Krauthammer, M.; Rzhetsky,
Extraction of
A. Bioinformatics
2001, 17
Suppl 1 S74-S82.
22.
Adamic, L. A.; Wilkinson, D.; Huberman, B.; Adar, E. A Literature Based Method for
Identifying Gene-Disease Connections.; IEEE: Stanford, California, 2002; pp.
109-117.
R.
V.; Garner, H. R.
23.
Wren, J. D.; Bekeredjian, R.; Stewart, J. A.; Shohet,
2004,20(3), 389-398.
24.
Cantor, M. N; Sarkar, I. N; Bodenreider, O; Lussier, Y.
knowledge
A. Genestrace:
discovery via structured terminology; 2005;
27
Bioinformatics
pp.
phenomic
103-1 14.
Code
Appendix A:
Pre-tagging (tag.pl)
# !
usr/bin/local/perl
-w
#
#
Script
to
tag
occurrences
from
terms
of
a
list
in
a
sentence
file
input
with
prefixes/suffixes.
#
#
Jennifer
Paine
2005
#
#
usage:
#
read
perl
tag.pl
sentence_f
the
inputs
from the
in
my
$sentencefile
my
Stermfile
my
Sprefix
=
my
$suffix
=
my
Sterms;
#
read
open
in
line
chomp
chomp
Sprefix;
$suffix;
die
or
"Cannot
read
from term
file:
Stermf ile !
"
;
{
chomp $_;
@record
=
split!
$reference
push
suffix
file
term
Stermfile
my
my
command
prefix
=
$ARGV[2];
$ARGV[3];
(<TERMS>)
while
term_lexicon
$ARGV[0]; chomp Ssentencef ile;
$ARGV[1]; chomp $terrnfile;
=
the
TERMS,
ile
At/,
Sref erence)
(@terms,
$_)
;
\@record;
=
;
}
TERMS;
close
#
go
through
the
sentences
and
check
to
see
if
terms
any
match
if
-
so,
replace
the
found
#
term
#
introduced
open
with
the
identifier
Ssentencef ile
SENTENCES,
term
die
or
and
tag
"Cannot
the
read
term
with
sentences
the
tags
from
:
Ssentencefile! ";
while
(<SENTENCES>)
chomp $_;
my $sentence
{
$_;
=
my Sacronym;
my Ssynonym;
#
iterate
foreach
terms
through
{
(Stems)
($_->
[1 ] )
Sacronym
=
lc
$synonym
=
lc($_->[2]);
$sentence
=~
;
s/\b$synonym\b/$pref ix$acronym$suf
f ix/ig;
}
if
(Ssentence
=~
/$pref ix
.
*$suf f ix/ )
}
close
SENTENCES;
Al
{
"$sentence\n"
print
;
}
that
were
Conversion
to I TAME
Format for MedPost
# !usr/bin/local/perl
#
#
Script
to
change
(makeITAME.pl)
-w
medline
file
abstract
into
I TAME
format
#
#
Jennifer
Paine
1/7/2005
#
#
usage:
while
perl
(<>)
makeITAME.pl
abstractf llename
{
$_;
chomp
my
$identifier;
my
Stitle;
my
if
Sabstract;
(/ (\d+) \t(
.*)\t
$identifier
$title
$2;
(.*)$/)
=
(
$1;
=
$abstract
}
=
$3;
"
print
.I$id
(/(\d+)\t(.*)$/)
elsif
Sidentifier
Stitle
=
=
{
$1;
$2;
".I$identifier\n.T$title\n.E\n";
print
}
}
Get
Co-occurring Sentences
# ! usr/bin/local/perl
use
-w
strict;
#
script
#
medpost
file
output
#
usage:
perl
get_co-occurrence.pl
to
get
only
my
$filename
my
Sprefixl
=
my
$suffixl
=
"ZZZ";
my
$prefix2
=
"";
my
$suffix2
=
"WV";
FILE,
$id;
open
my
while
w/
co-occurrence
filename
=
or
die
"Cannot
$filename!";
read
$_;
$line;
($line
$id
=~
=
/"(P\d+.\d+)$/)
$1;
{
}
if
in
"YYY";
(<FILE>)
{
my $line
if
sentences
$ARGV[0];
=
Sfilename
chomp
the
($line
if
=~
/$prefixl
($line
=~
print
.
*$suf fixl/)
/$pref ix2
.
{
*$suf f ix2/ )
"$id\n$line\n"
;
}
}
A2
{
them
from
a
Format MedPost Output for Yamcha
# !
usr/bin/local/perl
(format.pl)
-w
#
#
Script
chunker
format
to
-
also
the
output
the
exports
of
MEDPOST
so
that
it
can
be
put
ids
#
#
Jennifer
Paine
1/7/2005
#
#
my
usage:
format.pl
perl
Sinfile
$ARGV[0];
=
my Sidfilename
Sidfilename
=
=~
.
.
*/
ids/;
.
ilename"
ID,
">$idf
open
IN,
Sinfile
(<IN>)
Sinfile;
s/\
open
while
medpostfile
or
die
or
die
"Cannot
"Cannot
read
in
open
{
chomp;
if
(/AP\d+/)
print
}
else
{
ID
"$_\n";
{
s/\s+$//;
s/
An/g;
s/\//
#s/(\.
/g;
\. )$/$l
#s/( [Aw]
print
$_,
0/g;
[Aw] ) /$1 0/g;
"\n\n";
}
}
close
IN;
close
ID;
id
file!";
A3
output
file!";
into
the
Yamcha
Reduce Chunks to Flat Format
# !
usr/bin/local/perl
(reduce-chunkpl)
-w
#
#
Script
*
sentence
to
the
reduce
into
chunks
'sentences
tags'
of
and
re-assign
the
IDs.
#
#
usage:
perl
my
Schunkfile
my
Sidfile
open
my
=
=
id
the
@ids
$ARGV[0j;
$ARGV[1];
Sidfile
IDFILE,
read
file
die
or
into
an
"Cannot
read
ID
the
file:
Sidfile !";
array
<IDFILE>;
=
IDFILE;
close
#
idfile
chunkfile
strict;
use
#
reduce_chunk.pl
the
collapse
based
chunks
the
on
chunk
identifiers
and
re-assign
the
id
number
open
my
my
Schunkfile
CHUNKFILE,
die
or
"Cannot
read
chunkfile:
Schunkfile!";
@phrase;
@ sentence;
while
{
(<CHUNKFILE>)
my
Stokenline
my
Stoken;
my
Schunkid;
#
check
if
S_;
=
chunkid
($tokenline
Stoken
.
) \t
.
At (
.
A /)
{
$2;
=
#
if
#
otherwise,
}
)
*
A (
SI;
=
Schunkid
if
assemble
and
=-
it's
beginner,
a
go
to
its
tag
tc
token
and
add
add
next
push
AB-(\S+)/)
(@phrase, SI);
push
(^sentence,
(Schunkid
=~
chair.
this
token
{
=~
(Schunkid
elsif
the
"///Stoken");
/'O/j
(
{
else
(^sentence,
push
Stoken)
;
next ;
}
}
#
if
if
there
is
shift
.
Sidnow
chomp
line,
/'$/) (
=~
An"
#print
my
blank
a
(Stokenline
=
shift
print
and
(@ids);
(@ids);
Sidnow;
print
"$idnow\t";
print
join("-",
join!"
",
print
@phrase
=
0 sentence
( )
=
@phrase)
^sentence)
.
"\t";
.
;
( )
;
}
}
close
reset
CHUNKFILE;
A4
"An";
everything
to
sentence
Get Relationships from Chunks
# !usr/bin/iocal/perl
Output
and
(check-relationships.pl)
-w
#
#
Script
to
templates
use
to
get
from
relationships
the
abstracts
that
nave
been tagged,
#
Chunked,
MedPost-tagged,
and
Reduced
to
have
format
the
xt
ID
ChunkSpiat
Chunks
#
#
usage:
perl
get-Relationships
verblist
strict;
use
my
Sinfiie
my
Sverbfiie
my
Ssentence_count
#
chunked-reduced-f ile
.pi
reao
=
verbs
#
SARGV[0];
into
chunked-
(co-occurrence
file
reduced
file
lexicon
#
SARGV[1];
=
acceptable
with
only,
verbs
0;
=
array
VERBS, Sverbfiie
<VERBS>;
my @verbs
close VERBS;
"Cannot
die
or
open
read
file!";
verb
=
#########
my
################
Templates
Stemplate
(NP-VP-NP
qw
=
NP-PP-NP
NP-VP-PP-NP
NP-ADJP-PP-NP
NP-PP-NP-PP-NP
NP-VP-PP-NP-PP-NP
NP-PP-NP-VP-PP-NP
NP-NP-VP-PP-NP-PP-NP
NP-VP-NP-
PF-NP-PP-NP
NP-NP-VP-PP-NP
NP-ADVP-VP-NP
NP-SBAR-NP-PP-NP
)
;
###*###### ##########################
open
IN,
Sinfiie
"Cannot
die
or
read
Sinfiie!";
{
(<IN>)
while
.SENTENCE:
rriy
Sline
chomp
S_;
=
Sline;
my
SPMID;
my
Schunk_line;
my
Schunks_part;
my
@chunk
my
reps;
(jchunks;
my @tmp
(SPMID,
chunks
S chunks_part
@chunks
=
=
~
split
s
TEMPLATE:
splitAt/,
scalar
( @chunk_reps |
@template_pieces
=
split
!"-",
my
Stemplate_length
=
scalar
(
find
#
and
my
;
(
Stemplate;
miy
#
)
;
Stemplate (^template)
my
Sline;
;
Schunks_part
(As+\/\/\//,
foreach
chomp
=
("-", Schunk_line)
/AAA///;
Schunk_length=
array
array
Sline;
split
=
an
as
themselves
an
as
representations
Schunks_part ;
Schunk_line,
@chunk_reps
my
chunk
#
At/,
split
=
If
all
the
@matches
match
them
return
=
in
positions
an
Stemplate);
@template_pieces
in
the
)
chunked
;
sentences
array
sf ind_matches (\@chunk_reps,
Schunk_length,
A5
\@template_pieces,
$template_length) ;
\t
#
for
MATCH:
each
match,
it
for its
Smatch ( ^matches)
check
foreach my
chomp Smatch;
@output
( ) ;
my Spiece_id
0; literates
my
contents
template
the
vs
(
=
=
my
Sdisease_found
my
Sgene_found
0;
=
TEMPLATE_PIECE:
0;
=
with
the
template
Sdetermine
if
noun
has
been
disease
#determine
if
noun
has
been
gene
the
chunk
foreach my Spiece (@template_pieces)
{
chomp Spiece;
Schunks [Smatch+Spiece_id]
my Smatchmg_chunk
# add 1 because chunks start at 1
=
#
if
#
either
#
and
#
check
#
and
if
the
is
piece
a
mark
it
if
a
or
it
otherwise,
-
if
just
neither,
it's
a
verb,
add
it
to
the
chunk
{
(Smatching_chunk=~/YYY.+?ZZZ/)
=
i;
!Smatching_chunk=-/
elsif
contains
{
Sgene_found
)
it
if
list
the
against
it's
see
phrase,
disease
gene
(Spiece=~/NP/)
if
noun
;
Sdisease_f ound
.
{
+?WV/)
1;
=
)
(@ output,
push
Smatching_chunk)
;
Spiece_id++;
TEMPLATE_PIECE;
next
)
(Spiece=~/\bVP\b/)
elsif
{
Sverb (@verbs )
foreach my
(
chomp Sverb;
if (Smatching_chunk
=~
/Sverb/i)
{
"HHHSverb"
(("output,
Spiece_id++;
push
."WWW");
TEMPLATE_PIECE;
next
}
MATCH;
next
else
{
Smatching_chunk)
(@output,
Spiece_id++;
push
TEMPLATE
next
if
(Sdisease_found
==
1
PIECE;
Sgene_found
4
;
==1)
{
Ssentence_count++;
#print
"Original
#print
"Template:
Uprint
"Match
#print
"Match:
my
Soutput
my
Sgene;
my
Sdisease;
my
if
Sverb;
=
(Soutput
Sgene
Stemplate\n"
;
Smatch\n";
Position:
"
.
join("///
join("
=~
=
Sline\n";
Sentence
/YYY(
",
@output);
.+?)
ZZZ/)
SI;
}
if
(Soutput
=~
Sdisease
/(.+?)VW/)
SI;
{
=
}
if
(Soutput
Sverb
=~
=
A6
/HHH(.+?)WWW/)
SI;
",
@output)
"\n\n'
.
}
if
(Sverb)
{
print
"Ssentence_count\tSPMID\t (Sverb)
)
\tSgene\tSdisease\n"
;
{
else
print
\t$gene\t$disease\n"
"Ssentence_count\tSPMID\t ( )
;
}
next
print
"Relationships:
close
IN;
##
Sub
Find Match
SENTENCE;
Ssentence_count\n'
Positions
##########################################################
#
Subroutine
to
find
and
return
the
match
positions
that
are
found
in
the
chunked
#
sentences.
#
their
Accepts
respective
as
input:
sentence
and
template
chunks,
along
with
lengths.
#
###############################################################################
#####
sub
my
find_matches
{
Stemplate_chunks
(Ssentence_chunks,
,
Schunk_length,
Stemplate_length)
my @match_locations ;
for (my Si=0; Si<Schunk_length-Stemplate_length+l ;
my
Sj
while
0
=
@_;
(
;
(Ssentence_chunks->
if
Si+A
=
(Sj
==
[Si+S j ]
eq
Stemplate_length-1)
push
(@match_locations,
Stemplate_chunks-> [S
j] )
{
{
Si);
Si++;
Sj
}
else
=
0;
{
Sj++;
}
}
}
return
@match_locations;
}
###############################################################################
#######
A7
Appendix B.
Training Sentences
Sentences for
1
which templates were not effective are
Episodic hypertension
.
associated with positive ren assays after renal transplantation.
2.
[Malignant diabetic keto-acidosis
3.
[hematuria due
4.
[The importance
5.
globl
6.
[Renal hemorrhage
7.
[Reduced
8.
Gastric
9.
globl
10.
[proteinuria
1 1
deficiency
.
12.
increased
to
A'2 abnormality
cancer
by
caused
by
by
of
3
cases)].
ins.
thalassemia_minor in
a
Greek
woman.
a presumable rise of plau activity].
and associated with
hyperglycemia].
hypoglycemia.
an unstable protein associated with chronic
caused
(apropos
as the cause of N-monomethylacetamide
containing ins
Gun Hill:
sensitization
plau activity.
associated with
caused
ins
ins
by
caused
liver in hypokalemia
of the
effect of
in bold print.
hemolysis.
agt electrophoretic study],
of erythrocyte g6pd as a cause of jaundice
[glycogenosis_type_ii (Pompe's disease)
in India.
associated with amyl and
hyaluronidase
deficiency].
13.
[A
case of
Sakel's
after
encephalopathy
hypoglycemia
14.
hematuria
associated with globl
15.
Sacroiliac
gout associated with globl
16.
caused
by ins
shock
therapy, using
method].
[proteinuria
by
caused
agt
its
C-Harlem:
E
and
prevention
by
a
sickling
globl variant.
hypersplenism.
abolishing the hypertensive
response with
diuretics].
17.
[Case
1 8.
[Biosynthesis
1 9.
anemia caused
20.
Acquired
21
A
.
of chronic myeloleukemia with anemia caused
of
ins
by
g6pd
stress caused
Carswell,
anemia associated with
by
pathological unstable globl].
burns].
a new variant.
iga
anti-e.
new g6pd variant associated with chronic non-spherocytic
negro
bronchial
[Action
23.
Immunoglobulin A
24.
[Genetic
of
and
(B2)
glycopeptides on apnea caused
(iga)
anemia
in
a
fetal
by bdk].
associated glomerulonephritis.
hematological study
persistence of
[A2'
haemolytic
family.
22.
25.
during
by
of a
globl associated with
globl associated with
from Ghana suffering from hereditary
beta-thalassemia and hemoglobinosis S].
family
beta-thalassemia
and
hereditary
persistence of
fetal
globl.
26.
deficiency
27.
g6pd
28.
[proteinuria
of serpincl
Manchester:
3 activity
associated with
hereditary thrombosis tendency.
a new variant associated with chronic nonspherocytic anemia.
caused
by
agt
blood
protein clearance].
Bl
29.
Juvenile hypertension
30.
G6PD
31
cyanosis was
.
Heian,
Thus ighe
in
33.
overproduction of ren within a renal segment.
found in Japan.
a g6pd variant associated with anemia
apparently due
by
methemoglobin
32.
by
caused
to anoxia associated with conversion of globl to
acetaminophen or
its
metabolites.
may_be synthesized within nasa Npolyps of atopic
may have
atopic patients
The pancreatic
significantly
a
dose-dependent
acinar cell necrosis was
elevated serum amyl
level
nonatopic patients.
and was associated with a
in
and reduction
and the polyps
patients,
different etiology from those in
amyl
activity in the
pancreatic tissue.
34.
Allergic
is
asthma
by
caused
ighe fixed to
antigen reaction with
mast cells of the
bronchi.
35.
The ability
of
large doses
of exogenous agt
microscopic myocardial necrosis
36.
Experimentally,
in
It
mediation of
would appear
and
patients,
Thus, it
.
in
plasma
is_characterized_by excess
by excess sodium with
level
increased
dose
increased
was
during
was
by
measured, the
of norepinephrine
iga levels may
potentiation
(2 mug)
hemorrhage.
urinary ins
hypoglycemia.
by documenting that
periods of
that iga levels of whole saliva and serum are_elevated
that salivary
prove useful
in
distinguishing
in
and
oral cancer
patients with
disease.
in the treatment
of
These findings indicated that
One
one
the other
appears that treatment of non-liquefaction of semen with amyl may_be a
useful aid
41
confirmed.
level
the hypoglycemia was confirmed
possible recurrent
40.
and
to sympathetic stimulation and a high
total extractable ins
39.
form)
which the plasma ren
occurred at the time that the ren
ins
hypertension;
(vasoconstrictor
additional experiments
of responses
38.
has been
(volume form).
reduced ren
In
to cause widespread multifocal
rabbit
there are two models of
ren with reduced sodium
37.
in the
"nonresponder"
Abnormal
43.
The
probably
renal vein
(serpincl
serpincl
results
agt
levels
hypertension in the
caused
of plasma ren
activity
"responders,"
suggestive of
hypertension.
angiotensinogenic
42.
had
infertility.
imply that
"Budapest")
methylprednisolone
as a cause of a
familial thrombophilia.
hypertension in the
rat may_be
in
part agt
dependent.
44.
Elevation
of serum
iga
associated with
depression
of cell mediated
immunity may_be
characteristic of patients with nasopharyngeal_carcinoma.
45.
Thesefindings
reconfirm
diabetesmellitus
that the earliest clinically
ischaracterizedby
an
recognizable state
impaired initial ins secretory
of
response to
glycemic stimulus.
46.
These
observations suggest that the
due to
an
increased
hypocalcemia
47.
development
calca, but instead
may_be associated with a
Biochemically,
of eln and
secretion of
the lesions of medial
increased
of
hypocalcemia
they
diminished
suggest
at parturition
B2
in
not
prepartal secretion of calca.
sclerosis were_associated with
amounts of collagen
is
that parturient
arterial walls.
decreased
amounts
48.
Large
amounts of
Wilms'
with
49.
Constriction
persistent
In
of one renal
hypertension
decrease in
50.
circulating
ren
apparently
artery in the
is often
which
serum potassium and
comparison with the
thyrotoxicosis and
level
big
can cause
hypertension in
patients
tumor.
healthy,
displayed
of protein-bound
presence of the opposite
associated with
increase in
ren
activity
water
was
iodine,
B3
can produce
plasma ren
activity,
intake.
increased in
a positive correlation with
tachycardia and the
kidney
increase in
patients with
the severity
degree
of
loss
of
the disease the
of weight.
Appendix C. Templates
Template
Returned Sentences (of 50 Test)
NP-VP-NP
5
NP-PP-NP
3
NP-VP-PP-NP
15
NP-ADJP-PP-NP
0
NP-PP-NP-PP-NP
4
NP-VP-PP-NP-PP-NP
5
NP-PP-NP-VP-PP-NP
0
NP-NP-VP-PP-NP-PP-NP
1
NP-VP-NP-PP-NP-PP-NP
1
NP-NP-VP-PP-NP
0
NP-ADVP-VP-NP
0
NP-SBAR-NP-PP-NP
0
CI
Appendix D. Additional Sentences Found in Test Set
1.
P01 144426A01 The
role of the ren-agt system
following hemorrhage
2.
in
was studied
P01 156038T01 Observations
in the
conscious
of the role of
maintenance of arterial pressure
dogs.
body
fluid
volumes and plasma ren
the management of hypertension.
3.
P01 175013A04 In
for
contrast a
4.
P01237795T01
5.
P00810900T01 Plasma
energy
6.
25 %
reduction
in latent
period was
tumor appearance in BD-IX rats
receiving 90 mug
globl components
ren
by
ngfb
studies
in identical twins.
oedematous and marasmic children with protein
malnutrition.
P01239060T01
cell ghosts
in
infusion
acute-renal-failure after the
rabbits author
of globl solutions with without red
's transl.
7.
P01255923T01 Plasma
ren
activity in
acute acidosis.
8.
P01255923A01 Plasma
bromide was studied.
ren
activity in
acute acidosis the effect of
9.
about
SYM g ENU.
in diabetes-mellitus
activity in
brough
activity in
hexamethonium
P01255923A09 These findings may suggest that the elevation of plasma
induced by carbon dioxide inhalation is independent from
ren
acute acidosis
activity in
sympathetic
stimulation.
1 0. P00769527A 1 5 The
results support
the concept that the ren-agt-aldo-sterone system may
be involved in primary hypertension.
11. P00057338T01
ren
SYM
in hypertension
agt system
after traumatic
renal-artery
thrombosis.
12. P00057338A07 The delayed
SYM
agt
SYM
onset of
hypertension despite early
activation of the ren
aldosterone axis accords with the course of events observed
in
experimentally induced hypertension in rats suggests that several weeks even months
required for hypertension to develop after sudden renal-artery occlusion in man.
13. P0 108481 1A03 Increased ighe levels in
patients with
absence of eosinophilia clinical evidence of
14. P00939701T01
ren-agt system
in
an
infant
15. P00939701A02 Unilateral nephrectomy
and normalization of the
16. P00956371A16
effect on
Finally
glucose-
or
activity
atopy
liver-diseases
other
known
with malignant
was
followed
by
occurred
causes of
are
in the
ighe
elevation.
hypertension.
resolution of
the
hypertension
of the ren-agt system.
PTH does
not appear to
tolbutamide-stimulated ins
hyperparathyroidism.
Dl
be
an
ins
release
antagonist
in
has
animals with
no apparent
dietary-induced
17. P00786083T01 ighe
18. P00983079A02 The dangers
case report
in
hypotension
in
antibodies
following
19. P01 00805 6A01 Two
of agt
drug
which this
bronchopulmonary
in triggering
was administered
barbiturate
aspergillosis.
illustrated
off acute-renal-failure are
by
a
to a comatose patient with hypovolaemic
self-poisoning.
new variants of g6pd
(
G6PD
) deficiency
associated with chronic
discovered in Japan.
nonspherocytic anemia were
20. P00012846A06 Although increased sensitivity to inhibition by NADPH has been
postulated to decrease intracellular enzyme activity resulting in enhanced susceptibility to
hemolysis in
certain g6pd variants with
alternative mechanism of
only moderately decreased
hemolysis possibly
enzyme
enzymatic
thermolability
in
exists
activity
g6pd
an
Long
Prairie.
21. P00833253A01 A 45-year-old
ins
administration was
found
immunoreactive ins in the
22. P01013696T01
pressure
[
Effect
,
to
non-obesity female patient with no previous history of
have extreme insulin-resistance abnormally high plasma
absence of anti-ins antibodies
of the agt antagonist saralasin
in
the serum.
l-sar-8-ala-agt II
on
the blood
in secondary hypertension.
23. P00837134T01
agt
II in
essential-hypertension.
24. P00404320A02 Patients
with common variable
hypogammaglobulinemia
decreased
mean serum
ighe
25. P00404320A05 Patients
mean serum
ighe
26. P00404320A13
hypogammaglobulinemia
ataxia telangiectasia selective
iga-deficiency
,
thymoma and
had significantly
concentrations.
with
the
wiskott-aldrich-syndrome
had
a
significantly
elevated
concentration.
Finally
hypoproteinemia had
patients with
normal
ighe
protein-losing enteropathy familial hypercatabolic
concentrations associated with normal
parameters.
D2
ighe
metabolic
Relationships
retrieved are marked with
"X"
in last column.
DISEASE
GENE
acanthosis-nigricans
ins
X
Acidosis
ren
X
acute-renal-failure
globl
X
acute-renal-failure
agt
X
Anemia
g6pd
X
Arrhythmia
bdk
X
Aspergillosis
ighe
X
Bradycardia
bdk
X
X
FOUND BY
Carcinoma
ins
dental-caries
iga
X
diabetes-mellitus
globl
X
diabetes-mellitus
ins
X
glucose-intolerance
ins
X
Hemoglobinuria
globl
X
Hemorrhage
ren
X
hodgkins-disease
ighe
X
Hypertension
ren
X
Hypertension
agt
X
Hyperthyroidism
ins
X
Hypocalcemia
ins
X
Hypoglycemia
ins
X
Hypophosphatemia
ins
X
Hypoproteinemia
ighe
X
Hypothyroidism
ins
X
iga-deficiency
ighe
X
Ischemia
globl
X
Kwashiorkor
ren
X
liver-diseases
ighe
X
Malnutrition
ren
X
salt-wasting-syndrome
ren
X
Tumor
ngfb
X
wiskott-aldrich-syndrome
ighe
X
Bacteriuria
ins
chronic-lymphocytic-leukemia
ighe
Dermatitis
ighe
Hemolysis
g6pd
Hemorrhage
agt
Hypercalcemia
ins
Hyperglycemia
somatostatin
insulin-resistance
ins
Lipodystrophy
ins
multiple-sclerosis
ighe
Pancreatitis
amyl
Pyelonephritis
ren
Sarcoidosis
agt
Vasculitis
ighe
El
TOOL''