Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 2005 Automated extraction of disease-gene relationships from MEDLINE Jennifer R. Paine Follow this and additional works at: http://scholarworks.rit.edu/theses Recommended Citation Paine, Jennifer R., "Automated extraction of disease-gene relationships from MEDLINE" (2005). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Automated Extraction of Disease-Gene Relationships from MEDLINE by Jennifer R. Paine A thesis submitted to requirementes the faculty for the degree of of Rochester Institute Masters of of Technology Science in the Department Rochester Institute of Technology 2005 Approved by: Dr. Debra Burhans Dr. Jun Xu Dr. David Lawlor Dr. in Gary Skuse partial of fulfillment of the Biological Sciences. Thesis/Dissertation Author Permission Statement Tirk of thesis or dismtatiao: Amomated ExtraC1ion of Disease· Gene Relationships from :-.1EOLI]\"E Nam: of author: Jennifer R. Paine Degree: ~fastm of Science Pr ogr;im: Bioinfoonatic s College: College of Science I underst:llld th.it I must submit a print copy of my thesis ar dissertation to the RIT Arc.hives, per currem RIT guidelines foe the complctiao af my degrtt. I hereby grant to the Rochester Institute af Tecbnology :llld its :igCflts the non-exclusive license to archive and make occessible my thesis or dissert3tion in wbcle or in p3rt in JIJ forms of media in perpetuity. I retain 311 other owumhip rights to the copyright of the thesis ar dissertation. I 3.lso retain the right to use in future wod:.s (such as articles ar boots) all or p3rt of !his thesis or dissertation . Print Reproduction Permission Granted: Jennifer R. Paine . hereby grant permission to the Rochester Institute Ti:chnology to reproduce my print thesis or dissatxion in whok ar in p:irt. Ally i:cproduclion will not be for coounerclll use oc profil I, signaturcafAuthcc: Jennifer R. Paine D:itc: s-6-2oos Print Reproduction Permission Denied: , hereby duty permission to the RIT library of the Rochester Institute ofTecb11alogy to reproduce my print thesis oc dissertation in whole or in p3rt. I, Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date:----- Abstract The increasing biomedical literature is amount of scientific beginning to bring about a information automatically from to researchers researchers these sources. interpret large-scale levels information practical, propose a method automated methods of language processing techniques. This template giving matching based on a set of to extract disease: gene a precision of segment of make information 97% information the logical finding and tools to extract of particular collecting interest help can connections extraction are required. storage of of of between this In this paper, I linkages between genes a combination of term co-occurrence and natural method training incorporates templates to of an experiment on a relationships and a recall of in the form diseases. These linkages database statistically-driven part-of-speech in the MEDLINE text. Results method To researchers for the development genes and automated extraction and diseases from MEDLINE text using diseases, tokenization, to available genomics studies as well as make and certain phenotypes. for the need One is the linkage information between gene expression and information pre-defined tagging and test set of 50 ill and chunking, as genes and well as find relationship-containing from MEDLINE text between 51% lexicons for 78%. abstracts can be statements demonstrate that this applied with success, List of Figures Figure 1, Study Design Flowchart iv Acknowledgments The knowledge Thesis following thesis, Advisor, Dr. Dr. My of several people. invaluable in and while an Jun Xu, both individual work, benefitted from the direction Thesis Chair, Dr. Debra Burhans, along provided a the construction of this project. Gary Skuse project on schedule. provided sources of Each broadened my knowledge of of my the and depth of my external knowledge in the field that was In addition, my Thesis Advisors Dr. David Lawlor feedback and guidance that allowed me to complete this brought to this advisors field breadth with and and challenged project valuable insights that my thinking, making the finished both product one of value and utility. In from addition to the academic and technical family work. Also Eleanor and friends. present was Paine, enabling Finally, I would division of Procter insights created an and My fiance, Dan help, Bushnell, I also received equally important provided constant support help throughout this the constant support and encouragement from my parents, Ronald and me to persevere through many challenges and like to thank everyone Gamble for interesting all of in the Corporate Functions their ongoing project with finally many help and support. opportunities - obtain the degree. Biotechnology Their comments and for future improvement and use. Table of Contents Copyright Release Form ii Abstract iii List iv of Figures Acknowledgements v Introduction 1 Materials and 5 Methods Results 13 Discussion 20 Conclusions 25 References 26 Appendices A1 Appendix A: Code Appendix B: Training B1 Sentences Appendix C: Templates CI Appendix D: Additional Sentences Found in Test Set Dl Appendix E: Relationships Found in Test Set El VI of Abstracts Introduction Technical allows large brought on tens about of hundreds in the field of genes in of great interest to scientists currently manually search have The MEDLINE database information increased to used an in of over research astounding 3,500 total references were uploaded 13 million history studied methods for gaining which of they participate. linkage to a a disease, difficult, if not biology studies. Web, since one of the tremendous 2002 the rate of day! In 2004 alone, only a According information more than Developing a tool not rate. leading added to has 571,000 to automatically desirable, but is an genomic research. useful newspapers, company reports, the United States was expands at a world-wide, references per Automated literature mining is large-scale in biomedical abstracts, information from this literature is step in progressing into conditions can can also provide valuable documented to this enormous database (1). search and extract relevant extend data experiment can yield disease existing biomedical literature. This is the MEDLINE Fact Sheet on the World Wide essential a may have changes affect an organism Understanding disease, but past now task given the vastness of currently available information sources. impossible, sources of How these of certain genes and the pathways to determine whether or not genes order a genomics, new techniques produce A typical microarray researchers. for interest, of in the single experiment one or two genes of speed the progression of treatment only particularly in the field expression changes. information regarding the function In biology, a single experiment. interesting disease is of the genome. Whereas a information regarding of genes with to of study thousands with regard not scale progress information in patent started as early as researchers not a new try filings, as field of study. an automated and websites. the 1940's (2). In Only to navigate the constant For many years, linguists have fashion from documents like fact, the first linguistics program recently has this field begun to flow of new information from in The language processing". finite First, a some form contain of be through a periods as work Craven trees as positive examples This an accomplish commonly of this inability to use of of linguistics. As J.R. by the company however, are to more at the tokenized at the word words with a tag that article, noun, verb, is determined, it is then genes and identify hundred diseases (3). discussing sentences pre-tagged sentences were used however, gene or a disease lexicons a famous relational linguist, stemmed disease from (2). Co-occurrence-based strong lack names not between and for lexicons stated of pre-defined genes and diseases, roots at shall set. this list in Co-occurrence is finds its the a beginnings know determining word of be found in the training in the late 50's: "You methods reliance on complete a as well as a controlled verb information uttered that may not of relationships templates to match syntactic structures. Firth, by study, for detection detect any for extracting keeps" limited several likely this gene-disease relationship extraction. this same task of finding relationships used method it further steps. meaningful phrases and use between accomplish most is first done this word context used statistical methods to project employs pre-defined gene and the Typically, group them into relationships is is broken into smaller, seven most common tags are: or method allowed in the literature. The limitation conjunction with few defined text to that which tagging involves annotating After a to train Hidden Markov Models while thousands of other sentences served as negative examples. lexicons, resulting in (4) of sentences are then In their experiment, gene-disease relationships. To The has already been done to and text tokenization. and proper noun. syntactic the field unstructured delimiters. The for information like Ray example, narrow tagged. Part-of-speech into to parse words Prior one to process called context within a sentence. templates to search In established that contains the performed adjective, preposition, number, possible in to as "natural desired information. If needed, and part-of-speech defines their referred literature is filtering can level using is performed something interesting. Next, the sentence text unstructured Natural language processing is typically source of manageable pieces level information from extraction of useful a word relationships, terms and synonyms as well as assumptions that certain To terms and sentence structures help overcome these limitations, genes and diseases were developed relationships within sentences were relationships between The primary genes and 'complete' and used in this objective of this project was to disease Language Processing techniques. Far-reaching of microarray gene training in the lists entities synonyms (3). for both templates to match in which text. develop from between set of sentences an benefits include generated their In addition, information from MEDLINE text using gene: relationships a relationships of terms and project. were stated retrieving interpretation lexicons developed from diseases denote extraction system co-occurrence-based Natural easier and more thorough experiments. for Build Lexicons Disease Name Lexicon Find 1 00 gene: Gene Name Lexicon sentences disease containing stated Assemble relationship. their abstracts and pre-tag the disease gene and names I Part-Of-Speech & Tag Sentences eliminate sentences without co-occurrence T Chunk into Phrases Develop Phrase Templates using 50 of the I Template sentences Matching Relationship Flat file -'"';-. ' '"' ''''' J. Determination recall Figure 1. Study Design Flowchart. of Precision using remaining 50 To reach a final and sentences product of a collection of likely gene:disease followed. First, a disease name lexicon was built using data contained in publicly available databases. Next, a subset of a larger gene lexicon was obtained (to allow for faster processing for proof of concept). These two lexicons were used to search relationships, this process was Medline for 100 sentences abstracts) containing (each in a stated gene: a different disease abstract to relationship. allow an equal number of sentences and Each of these sentences' abstracts were in the remaining steps. The 100 gene and disease names were pre-tagged for easy retrieval in later steps and then the abstracts were tagged with parts of speech. Following part-of-speech tagging, the abstracts are chunked into then downloaded as a plain-text file abstracts' and used nonoverlapping phrases. At this time, the original 100 identified and the first (arbitrary division) 50 sentences containing relationships were re- sentences' develop templates template A which were used development, the tool was positive result was one where relationship. no stated A to retrieve run on chunks the relationships (training from set) were used to the test set. Following the test set and precision and recall were the sentence retrieved by the tool contained a determined. disease: gene negative result was one where the sentence retrieved contained co-occurrence relationship between the gene and disease. but Materials and Methods Lexicons Two lexicons disease at the has name and synonym start of for synonyms this project identifier, a unique arbitrarily shortened gene (5) Typically, lexicon allowed gene name and synonym for the its gene to the acronym of Medicine's National Institutes Rare Diseases Website of the NIH (6,7). MeSH is the National list of terms. It is structured as a and a more specific Subject Headings Fact extract in the Often, Sheet, term might MeSH information from general lexicon, The diseases list from the Office disease lexicon. The Office within of On their website, the Office 200,000 persons of 22,568 terms terms under the there which headings of a general for the head term Rare Diseases defines in the United States"(9). This data and listings from Library of level term coordinating be the MeSH topics were made for their in key to key term. generation of the in 1993 research on rare as one source contained a might this manuscript (8). that was established disease the were added was also used 'rare' a of this 'Diseases' were made synonyms purpose of the project Use from According to at the time of an organization for this terms (NIH), were more specific Rare Diseases Rare Diseases is the National Institutes of Health instance, and one of of terms with general be "Alzheimer's Disease". MeSH, while more specific hierarchical list leaves. For contained headings for Health lexicon proof of concept. of the "Diseases" terms used project contains of gene unique genes. for tool assembled already in the and a the gene according to lexicon to include 717 lexicon term, (MeSH) controlled the lexicon. subset of the gene entries key for accepted symbol lexicon built for this terms as nodes and the most specific terms as To is the was record the gene's used as rapid use and evaluation of the name and synonym Medical Subjects Headings be lexicon Each entries was used. containing 4,840 for a gene name and synonym project: and a subset of chosen subset The disease the Office lexicon. The an acronym the gene. for this Genome Organization (HUGO). The the Human was an were required "affecting diseases. fewer than listing of approximately 6,000 diseases but did Diseases list was compared to the Office to the of Each key Rare Diseases list record disease name, used and a synonym for this lexicon key were disease built, they Appendix B) accuracy Fifty and the of the tool. relationships were tool to ability to not relationships remaining 50 It only with a return be returned "Hypocalcemia tissue biological identifier, includes 9,025 identifier is an a arbitrary The resulting disease meaning. were used to search a containing that noted least unique in the set of This disease terms. both gene and set to develop the markedly hypercalcemia information tool's functionality (see the 50 sentences containing well-defined both of 50 for sentences processing. containing This also allowed the relationships and facilitated demonstration identify other of the tool's sentences containing sentences. are considered positive results. was associated with a and their set to gauge the effectiveness and testing co-occurrence of a gene and a disease the MEDLINE stated gene-disease and evaluation of the method also test set clearly of from the MEDLINE database 50 sentences, but to original the tool as while after local copy one respective abstracts were used sentences with insulin sensitivity training were used as a the expected by at for training used training/testing relationship between the and should be relationships. were not In this project, stated be should containing that unique numerical records and sentences were used as a chosen, their be developed sentences not no Rare terms and the terms unique a unique numerical attributes: such sentences were collected respective abstracts were obtained to tool. 9,896 has of lexicon. The name. term and abstracts with sentences One-hundred extraction from the MESH were added to the project contains After the lexicons relationship. generated This Office names. Corpora Training/Testing (10) database for lexicon disease those in the disease lexicon has three number given to each lexicon for not contain synonyms An reduced disease name and a relationship-containing example of such a sentence insulin secretory is sentences this: response and normal was associated with a normal insulin response reduced tissue sensitivity". name without a stated relationship, relationship-containing example of this: or with a non-relationship-containing glucose- or between with co-occurrence negatively sentences and should not "Finally, PTH does either Sentences not appear to sentence be an be tolbutamide-stimulated insulin disease stated relationship, are considered returned due to the insulin and gene name and a by the tool as release in positive results. presence of a negative antagonist and has non- An relationship is no apparent effect on animals with dietary-induced secondary hyperparathyroidism". Pre-Tagging The gene and in later finding them disease names This steps. in Perl (see Appendix A). The and appends to it a The the lexicon done using they be easily pre-tagged abstracts were the abstracts to ease implemented string matching algorithm for the occurrence of a lexicon term in the text algorithm searches could were pre-tagged within an exact user-assigned prefix and suffix. prefix-suffix pairs so extraction. was from Genes and diseases retrieved and separated then ready for the first were assigned in the final information steps of stage of natural different language processing. Tokenization and Part-of-Speech The first step in the was Rule-based are two disease gene: sentences and then words. varieties of part-of-speech taggers are in the field. Rule-based to different kinds by basic part-of-speech an expert written into extracting relationships from the Following tokenization, abstracts the words tagged with their appropriate part-of-speech tags. There by process of to tokenize the abstracts were then Tagging of text. An based taggers taggers: rule-based and probabilistic. on contextual rule sets have been proven that are effective, example of a rule-based part-of-speech Eric Brill (11). Brill's rule-based tagger works painstakingly defined and are tagger 'transportable' is the Brill tagger by developing rules based on a pre- Initial tags tagged corpus. are assigned based context) according to the tagged training training text, it is contained most common tag for accuracy in be very in its accurate words will be variety used. 'guesses' data and on context process. 87% text, but for that was not is found last three letters of because of found in that was not highly The Brill tagger is commonly is the to a testing set of has been used and documents. However, the Brill accurate part-of-speech For the word. 'adjective' tagger and in one if trained on same genre of to tagger performed at an study 1,000 MEDLINE the shown part-of-speech the second type of part-of-speech tagger, tend to are the sentences be (12). effective over a text for which they taggers commonly employ a statistical model that based on information from manually pre-tagged training the tags of previous words. Computational a publicly Biology Branch National Center for Biomedical MedPost is It was trained on MEDLINE available probabilistic part-of-speech tagger of the National Center for Communications) a probabilistic tagger that uses MEDLINE database 97%o word (ignoring its to refine the results. These rules are adjusted based on when applied unknown words For this project, words. on the specifically trained for biomedical text Probabilistic the tags tag a word capitalized 'adjective' some unstructured text of tag based assigned a for of is word If a lowercase noun'. assigned a iterative Probabilistic taggers, smaller 'proper an accuracy If a corpus. for likely tag ending in '-ous'. After this initial tagging is completed, the based native state was not approximate of is "bulbous", it is word then applies sets of rules their tag in the training text, it is if the instance, assigned the on the most of abstracts as the compared abstracts. When tested Brill tagger, MedPost downstream processing, the MedPost tagger public availability, MedPost sentences selected was implemented (12). was chosen and ease of use. 8 on the same subset of importance accuracy of of part-of-speech for this unknown randomly from the performed with an to the Brill's 87% (12). Due to the MEDLINE text, entitled Information (Lister Hill Hidden Markov Models to determine tags for 5,700 manually tagged biomedical Biotechnology from the project 1,000 approximately tagging to due to its accuracy with The MedPost tagger format by a line requires addition of an period) to the of the and the beginning abstract, resulting and the identifier (a ending line Often, the This Following this regular expressions sends Markov Model is in the training set, possible tags, of applied where for be used as a by for words input for This by the Computer contractions a part-of-speech tagger. not further broken into visually sentences for is based on word the most of list 10,000 tagging. There, a using Hidden from tag bigram frequencies assume equal probabilities sequence probability of the most frequently of the allowed part-of-speech for the such as orthography likely tag algorithm calculates the entered (with from tokenization in the lexicon estimates training lexicon manually first the implemented in Perl was developed space tokens to a stochastic tagger The Hidden Markov Model MEDLINE that includes line, delimiters. unknown words transitional probabilities. the title Pennsylvania. In Penn Treebank transition probabilities are estimated output probabilities tags occurring based on a that was the abstracts are periods as word-level and where output capitalization. calculated the using followed according to the Penn Treebank words separated initial tokenization, (12) This abstract section abstract: formatting format University their components) which can then MedPost of PubMed the changes to the original text are subtle and output dramatic. Perl at designating the abstracts tokenization broken down into sentences are into a a specific way. input for MedPost tagging (see Appendix A). by tokenizing Information Science Department separated of the abstract. abstracts were used as format (13). Penn Treebank format is tokenization, letter capital of each major section of a The MedPost tagger begins and input to be formatted in requires abstract using the of a given sequence occurring tags for words in each lexicon term (12). The tags. output For the Treebank processes. for MedPost purpose of (13) tag can this project, be however, output was chosen to MedPost output is customized the by the less user and extensive and more improve portability structured as a defaults of the output flat-file database to special commonly MedPost used Penn into downstream with each sentence as a record and each identifier as an alpha-numeric position of the sentence input for to serve as in the chunker input for the The the original abstract. and carried on to the next steps. were used as ID that includes the PubMed ID output format following was altered the using Perl in order the information was maintained (see Appendix A). However, Only number as well as sentences with co-occurrence of disease and gene names steps. Chunking (Parsing) Parsing is the parsing is that it is symbols "determining language" in a Full parsing (nouns, next stage of the noun so that shallow in phrases, entire parse tree and Chunking instance, allows a for group verb different resulting tokens might specific patterns [inhibitsjvp [constitutively 'NP-VP-NP'. is of of This nodes are syntactic structures parsing is computationally sentences process The trees. syntactic to be design for grouped into be identified using finding is alternative to full into non-overlapping much phrases faster than generating the B]np' in the for the not needed template relationship information. For customized templates. chunked sentence could expressed gene Accounting useful This grouping a verb phrase or a noun phrase. be extracted because chunking For fragment '[heart descriptive terms expressed' complete description also more robust. example, the information contained in the recognize internal Chunking breaks related words are grouped. is and This kind etc). phrases, allows easier template of leaves which words are or chunking. parsing, syntactically good the complete syntactic structure of a sentence or string of expensive and can produce several parsing is A extraction process. (3). outputs a tree verbs, information using 'heart' and a a simplified gene A]NT template that would 'constitutively compresses them into part of the 'Noun Phrase'. For this project, the publicly produce syntactic chunks. The available Yam-Cha chunks produced (14) chunking from the training software was utilized to abstracts were first used to develop templates, disease retrieve gene: shown to Support variation called simply based many dimensions The along with output of the ('N' is inside scanned with those templates to is (with the also 'O' effect it between the two classifications) maximum margin chunker is a text file containing Chunk chunk representation. indicates indicates that the token is the applies a features (16). Yam-Cha a chunk. and complex statistical statistically complicated, but in that the current token beginning of a chunk which token, one or word, representation consists of for noun, 'V for verb, etc.) followed tag chunks of text language processing To increase speed, Yam-Cha slow. method classify natural accuracy (15). However, using this processing to of its corresponding part-of-speech token excellent classification in vector machines are common Basket Mining. This for allows on have causes speed of methodology then set were chunk annotator uses support vector machines to syntactic phrases. have been from the testing relationships. The Yam-Cha into the chunks and by I, O, is or line per the original B. T indicates that the 'B' outside of chunk. any immediately follows another chunk (15). After initial chunking, NP) using the chunk Templates Appendix C). For chunk would were as 'assembly the strings of phrase tags sentence read phrases, the disease These templates name were then used NP-VP-PP- representation and verb usage "SHH causes would take is (ex: (see Appendix A). then developed according to chunk instance, if a training noun flat instructions' be NP-VP-NP. The template design part of one of verb phrase. identifiers chunks were reduced to into part of the to search the testing cancer", its corresponding account other, (see and that the gene name the verb sentences "causes" is for relationship information. The relationships record contains a gene: relationship back to its found in the testing disease relationship original abstract. set were exported as well as a unique This flat-file 11 was to flat-file format identifier that where each links that then compared to the list of the 50 is the known relationship-containing relationships within sentences of the test set of sentences and well as all stated the entire test set of abstracts. The determined. 12 precision and recall of the tool was then Results Of the 50 relationship-containing to develop templates to match sentences relationship statements, two resulting templates. Of those two sentences, The remaining noun phrase. development of full parsing techniques For the test templates calculated sentences. Below retrieved, the and could potentially Based whole abstracts), 34 50 match unwanted by retrieved sentences were retrieved gives a recall of sentences used itself follows the for testing. the by sentence highlighted in darker gray and in parenthesis. followed by was not in the test were set of successfully is highlighted in light gray Sentences that a short 4%. successfully using the sentences a template or the use input into the tool 68%. Precision For those that that was retrieved phrases) set error was sentences that were non-positive-relationship-containing portion of the sentence the template retrieved are are the of be set and used title containing only a single this, the training on 50 relationship-containing developed from the training data. This due to the lack training sentences could not one was an abstract rather than chunking. set, of the (with their corresponding the tool as the sentence contained complex phrases that would require either the very large template (that of a input into were not description of successfully why the sentence was not retrieved. 1 . [Plasma YYYrenZZZ aetivity]Np [increased] vp [more]ADvp [than]^ [twofold]^ [io\\oSM^wlQ(^&MMhM^W^Ms ADVP-PP-NP-PP-NP, was not found This sentence in the training 's template design, NP-VP- set and therefore was not an included template (see Appendix C). 2- [QQQhypocalcemiaVWJNP [was associated^ [with]PP [a markedly reduced YYYinsZZZ sensitivity]}^ [was [QQQhypercalcemiaVWjNp associated]^ [whilejsBAR [with]PP [a normal YYYinsZZZ secretory response and normal tissue YYYinsZZZ response]]^ [reduced tissue sensitivity^ (NP-VP-PP-NP) ighe concentrations were_sigfrfficantly_elevated in about half of the patients with acute Guillain Barre_Syndrome and tends to^falfig patien^withclinical Improvement. This sentence contains a disease term, 'guillain-barre syndrome that, ' in the lexicon, lexicon the was not in the synonym contained a disease never having same format as it hyphen). And so was written in the document (the the sentence was never processed due been pre-tagged. be rjeJardgdlYp. {Msbae [Other fonnslNp^ofJr^ [causedJ^Xbyfelinapprppri^ secretioii]j4 This 13 sentence 's template to NP- VP-SBAR- design, not an VP-PP-NP, was not found set and therefore was in the training included template (see Appendix C). 5. 0!Upr [J 5Jm [ofjpp [12,patierrJs]i^IwithJpp [active Q^mS^idqsisY^lm sentence 's template training set and design, NP-VP-PP-NP-PP-NP-PP-NP, therefore was not an was This not found in the included template (see Appendix Cj. [In]PP [both cases]^ [the frequency]^ [of]PP [urogenital QQQtumorVWV [in]PP [rats]w [was increased^ [as]PP [a result^ [of]PP [YYYngfbZZZ administration]^ [at]pp [the apparent expense^ [of]PP [neural QQQtumorWV]^ (NP-PP-NP-PP-NP) [Porcine crystalline YYYinsZZZ]NP [0.1 U fi^hyppglycemiayWi^iOnlpp [all NP-NP-NP-VP-NP, was not found SYM kgW^Sgj subjects]^ This in the training sentence set and 's template design, therefore was not an included template (see Appendix C). [This]NP [may or may not]VP [indicate]VP [a roleJNP [for]PP [renjup [in]PP [the [ofjpp [spontaneous hypertensionJNp. (NP-PP-NP-PP-NP) 9. cause]KP [The antigenicity]^ [of]Pp [ins]t,.P [is]vp [the caiise];^ [of]PP [the side ef [ins therapy]}^ [such as]pP [ins allergy]^ [lipoatrophy]j^L[i.fJpp.Bh site]M>.[of [injection and insulin_resistance]NP. This NP-PP-NP-PP-NP-NP-PP-NP-PP-NP, was not an sentence 's template design, was not found in the training NP-VP-NP-PPset and therefore included template (see Appendix C). 10. [of]PP [YYYgloblZZZ]NP [whichJMj. [are increased]vi> [in]PP [subjects],^ [with]PP [QQQdiabetes-mellitusVWJNP [YYYgloblZZZ Ala-c],^ [were measured]vj> [in]PP [identical twins^ [concordant]^!? [discordantJADjr [for]SBAR [diabetes]]^ [to determine^ [whether]SBAR [the observed increases^ [represent^ [a genetically determined abnormality]^ (NP-NP-VB-PP-NP-PP-NP) 11. [Plasma YYYrenZZZ activity^ [was significantly increased]vp [in]PP [children]^ [with]PP [QQQkwashiorkorVW and marasmus]]^ [compared]PP [with]PP [healthy [The minor components^ [in]PP [children],^ [whoV [died]VP [compared]PP [with]PP [survivors^ (NP-VP-PP-NP-PP-NP) children],^ 12. [G6PD hillbrowV [a new variant]^ [of]PP [YYYgepdZZZ]]^ [associated]w [with]PP (NP-VP-PP-NP) [drug-induced haemolytic QQQanemiaVW]NP 13. [In]PP [patients]^ [managed]w [conservativelyjAovp [there]^ [was]VP [QQQglucoseintoleranceVvVjNp [associated] yp [with]PP [a diminished early YYYinsZZZ response]^ [to]PP [glucoseV [suggesting]VP [inadequate nutrition]Np [in]PP [the periodJNf. [between]PP [the QQQoverdoseWVjNP [the glucose tolerance test]Nj> (NP- VP-PP-NP) 14. [Stimulation])^ [of]PP [growth hormone incretionj^ [by]PP [YYYinsZZZJMp [caused]vp [QQQhypoglycemiaVW]NP [in]PP [children^ [with]PP [delayed growth^ (NP-VP- NP) 14 15. [The drop]w [in]PP [haptoglobin levels],^ [indicates] Vp [that]SBAR QQQischemiaVW]Np [may be induced]vp [by]PP [renal [a disturbance]^ [in]PP [YYYgloblZZZ breakdown]NP (NP-VP-PP-NP-PP-NP) 16. [The attacksjw [of]PP [QQQhemoglobinuriaWVJNp [were associated]vp [with]PP [the appearance])^ [of]PP [an unstable YYYgloblZZZjNp [in]PP [red cellsV (NP-VP-PP- NP-PP-NP) 17 . tQQQprotelnuriaVW levels]^ [were significantly associated^? [with]pp [QQQhematuriaVyV QQQbacteriuriaW^and reduced GFK\ti&J&h>? dependence; arid Q(^hypertensiQrjy3TVj# [lejukpcytumfe , sentence was a poorly relationship between a between other genes and it does not contain a Th is direct ' disease. However, 'ins dependence was tagged as not retrieved was due to the lack of the ability to detect gene and a the reason it was a gene and so relationships chosen example sentence as a disease and a gene when they are separated in the text by diseases. 18. [Plasma YYYrenZZZ activity]Np [was elevated] yp [in]PP [moderate QQQacidosisVW]Np [inducedjvp [by]PP [5 % carbon dioxide inhalation^ [from]PP [37.5 +- 8.8 ng SYM mlV [to]PP [52.8 +- 7.0 ng SYM mlV (NP-VP-PP-NP) 19. [We]Np [have presented]vp [the case histofyJKp [ofjpp [a patient]^ [vvith]PP [unilateral OQQpyelonephritisVVV]^ [elevated peripheraUcenous YYYrenZZZ]>,T [an obvious NP-NPcauselnpifbfjpp [QC^hypertensionyVV!]i This sentence 's template design, PP-NP, was not found in the training set and therefore was not an included template (see Appendix C). 20. [The results],^ [support]VP [the concept^ [that]SBAR [the aldo-sterone system^ YYYrenZZZ-YYYagtZZZ- [may be involved] vp [in]pp [primary QQQhypertensionVWjMp (NP-VP-PP-NP) 21. [with]PP [high susceptibility]]^ [to]PP [QQQdental-cariesVWJNP [perhapsJAcvp [refiecting]VP [inferior resistance] [to]PP [QQQdental-plaqueVW formation],^ (NP-VP-PP-NP-PP-NP) 22. [SlowlylAovp [acting mechamsms]NP [probably imtiated]vp [by] of]PP [YYYrenZZZ]NP [may be]Vp [responsible]^/?, [for]PP [the JC^hyper^sion^YyiNP This sentence 's template design, NP-VP-ADJP-PP-NP, [A low parotid not found YYYigaZZZ in the training secretion rate]>jp [is associated]vp set and therefore was not an was included template (see Appendix CJ. [associated]vp 23. [Five cases] [with]PP [Soothill type QQQiga-deficiencyVW] [withjpp [high YYYigheZZZ levels] (NP-VP-PP-NP) 24. [the mechanisms] [of]PP [QQQarrhythmiaVW [YYYbdkZZZ] (NP-VP-PP-NP) [by]PP [causedjvp development]NP 25 [Dopamine . [It]Ni> [may be]Vp [one] [of]PP , 5-HT abolishable] , GABA and [by]PP [vagotomy YYYbdkZZZ] [caused] w [QQQbradycardiaVW or atropine 15 treatment] (NP-VP-NP) 26. findings] [emphasize]VP [the importance] [of]PP [QQQliver-diseasesVW] [as]PP [a significant cause] [ofJPP [serum YYYigheZZZ elevation] (NP-PP-NP-PPNP) 27. [WhenJAovp [changes] [in]PP [body [These weight] [rectal temperature] [plasma glucose] [plasma cholesterol] [plasma butanol-extractable iodine ( BEI] [in]PP [these rats] [were compared]Vp [with]PP [the YYYinsZZZ secretory responses] [it] [was]VP [eviden^ADjf. [that]SBAR [experimental QQQhyperthyroidismVW] [results]VP [in]PP [decreased YYYinsZZZ QQQhypothyroidismVW] [the pancreas] (NP-PP-NP) 28. [A release] [whereas]PP [experimental [induces]vi> [increased YYYinsZZZ secretion] [from]PP girl] [with]PP [malignant QQQhypertensionVW] [had]vp [increased levels] [of]PP [plasma YYYrenZZZ activity and YYYagtZZZ II concentration] [in]PP [peripheral blood] [in]PP [blood] [from]PP [the affected one-year-old kidneys] [as]PP [compared]PP [with]PP [that] [from]PP [the contralateral kidney] (NP-VP-NP-PP-NP-PP-NP) 29. [The YYYigheZZZ] [oftehJAovp [was elevated] vp [in]PP [patients] [who] [systemic [hadjvp QgQvasculitisyVV] [wjth]pp [respiratory tract involvement] serum [particularly]Apvp [those] [with]pp [QO^churg-s1xahss-^dromeVyV__ |5p35vasculitisVyV and [a clue] (^Qwegeners-gr^^ [tojpp. [the pathogenesis] [in]pp [this group] [of]PP [patientsj This sentence 's template desigtt, NP-ADVP-VP-PP-NP-NP-VP-NP, was not found in the training and therefore was not an 30. 31 . 32. data] [demonstrate^ [that]SBAR [QQQhypophosphatemiaVW] [is associatedjvp [with]PP [an augmented glucose-stimulated YYYinsZZZ release] [without]PP [any effect] [on]PP [tolbutamide-stimulated YYYinsZZZ release] VP-PP-NP) [These serum [Infection , YYYigheZZZ QQQdermatitisVy!V sentence set and therefore was not an 's , mcreasedl(XYigheZZZ] template design, NP-NP, [impaired was not found neutrophil in the training included template (see Appendix CJ. [YYYinsZZZ injection] [at]PP [birth] [caused]yp [QQQhypoglycemiaVW] [suppression] [of]PP [levels] [pfjpp [certain amino kcids]NP [Inhibition] [pflpp [conversion]; [oflppI!4C substrates]. [intplEp [glucose] This sentence 's template design, NP-PP-NP-VP-NP, was not found in the [By]PP [contrast] training set 34. (NP- levels] [were elevated]vp [in]PP [all patients] [with]PP [QQQaspergillosisVW] [also]ADvp [in]PP [some other forms] [of]PP [bronchopulmonary QQQaspergillosisVW] [thusJAovp [limiting]^ [the diagnostic value] [ofJpp [total serum YYYigheZZZ determination] [in]PP [this type] [of]PP [pulmonary mycotic infection] (NP-VP-PP-NP-PP-NP) [Total chemotaxis] This 33. set included template (see Appendix C). and therefore was not an included template (see Appendix CJ. [[TYYaglZ/KV5Il^^ ' s '.ransl] This training sentence 's template set and therefore was not an design, NP-NP-PP-NP-NP, was not found included template (see Appendix Cj. 16 in the 35. [Characterization] [ofjpp [antibodies] [to]pp. [the^Y^X:msZ^jec.ejtorJia causel^toflpj. [QQQdja^s^mQNP [in]^ [man] Jto sendee 'j /ew/7/are rfej/gw, NP-NP-PP-NP, was not found in the training set and therefore was not an included template (see Appendix Cj. 36. [His course] [was further complicated]VP [by]PP [QQQhypertensionVW] [associated]yp [with]PP [elevated plasma YYYrenZZZ levels] [without]PP [evidence] [of]PP [QQQnephritisVW] (NP-VP-PP-NP) 37. [A salt-QQQwasting-syndromeVW] [associated] vp [with]PP [high plasma YYYrenZZZ activity] [inappropriately low aldosterone levels] [was observedjvp [among]PP [eight Jewish families] [from]PP [Iran] (NP-VP-PP-NP) 38. cysts] [may causejvp [YYYrenZZZ hypersecretion] [with]PP QQQhypertensionVW] [by]PP [compressing surrounding tissue] [and]PP [by]PP [distortion] [of]PP [renal vessels] (NP-PP-NP) [Solitary renal [associated 39. [Altered YYYinsZZZ receptors] [may be]vp [respbnsiblejAiJjp [forjpp [the pronounced QQQinsulin-resistanceVW] [the decreased synthesis]. [oflp^[mjlycerides] [m]pP^rcongenital general ized^C^hr^ys^pphy^iSOiS This sentence 's template design, NP-VP-ADJP-PP-NP, an 40. was not found in the training set and therefore was not included template (see Appendix Cj. [with]PP [congenital QQQanemiaVW] [found]Vp [in]PP [Japan] [GD] [Tokushima GD] [Tokyo] (NP-VP-PP-NP) [Two new YYYg6pdZZZ variants] [associatedjvp nonspherocytic and 41. [YYYg6pdZZZ Long Prairie] [is]vp [an interesting new YYYg6pdZZZ variant] [that] [demonstratesjv? [that]SBAR [chronic QQQhemolysisVW] [can be associatedjvp [with]PP [modestly decreased YYYg6pdZZZ activity] [despite]PP [normal sensitivity] [to]PP [inhibition] [by]PP [NADPH] (NP-VP-PP-NP) 42. [Successful immunosuppressive therapy] [in]PP [QQQdiabetesVW] [by]PP [anti-YYYinsZZZ receptor autoantibodies] (NP-VP-PP-NP) 43 [Macroamjdaserhia'and acute [causedjvp C&QpancreajitisJ^^ NP-VP-NP[with]pP [af^^^^^^W^X^^^ This sentence 's template design, PP-NP, was not found in the training set and therefore was not an included template (see Appendix CJ. 44. [We] [conclude]w [that]SBAR [YYYsomatostatinZZZ] [causedjvp [only transient QQQhypoglycemiaVW] [in]PP [normal subjects] [that]SBAR [QQQhyperglycemiaVW] [eventuallyJADvp [developes]vp [as]PP [a consequence] [of)PP [YYYinsZZZ deficiency] (NP-VP-NP) 45. [It] [can be concluded]vp [from]PP [the results] [that]SBAR [YYYagtZZZ II] [is involved]vp [in]PP [the pathogenesis] [of]PP [QQQhypertensionVW] [and]PP [in]PP [some cases] [of]PP [QQQhypertensionVW accompanying QQQchronic-renalfailureVW] (NP-VP-PP-NP-PP-NP) 17 [It] [was 46. [that]SBAR [QQQhypertensionVW] [associatedjvp [with]PP concludedjvp [low YYYagtZZZ II concentration] [by]PP [implication " low-YYYrenZZZ QQQhypertensionVW] [is]w [a condition] [separatejvp [from]PP [QQQessential" hypertensionVW] (NP-VP-PP-NP) 47. [QQQacanthosis-nigricansVW] [associated]vp [with]PP syndromeVW transl] 48. YYYinsZZZ-resistent diabetes and [QQQstein-leventhal- [author] ['s aminoaciduria] (NP-VP-PP-NP) [Deformability] [of]PP [erythrocytes] [of]PP [a patient] [with]PP [chronic QQQanemiaVW] [causedjvp [by]pp [a YYYg6pdZZZ variant ( YYYg6pdZZZ Hamburg] [in]PP [red cells] [was studied]vp (NP-VP-PP-NP) nonspherocytic 49. [YYYinsZZZ] [increased]^ [QQQcarcinomaVW] [in]PP [substrate-depleted bladders] [although]SBAR [the increase] [in]PP [QQQcarcinomaVW] [was]vp [lessjAD^ [( P] [lessJAovp [than]PP [0.01] [than]PP [in]PP [nonsubstrate-depleted bladders] (NP-VP-NP) 50. [Patients] [with]PP [QQQhodgkins-diseaseVW] [had]vp [significantly increased YYYigheZZZ concentrations] (NP-VP-NP) serum Of the 12 templates used for searching, 7 test set of data. The NP-VP-PP-NP sentences from the 50 Of the term, one was test set sentences (see Appendix due to the inability un-retrieved sentences were two more relationships can effective short (no needed due to of this method to recognize nested to lack of a suitable additional be found test template to a relationships without abstracts changes and precision can in relationships, recognize calculated. the precision. interest set are taken of set and all other NP-VP-ADJP-PP-NP, be too indicating maintaining that Often, though, complex to Templates into consideration, Within the test 18 the total 34 the relationship of the template set. a sentence would sacrificing length) in from the test be fine-tuning from the missing disease lexicon if added to the templates, sentences with additional than seven phrases When full data out of from C). this method, one was to retrieve relationships in retrieving greater due by returning 1 5 the sentence. Two such templates, NP-NP-PP-NP and each would retrieve the templates template was the most useful, sentences not retrieved statement within returned sentences with relationships were kept be somewhat precision. the recall estimate abstracts, there were a total of 121 sentences with co-occurrence 9 between sentences contained co-occurrence positive relationship between them, a disease between leaving sentences within the test set of abstracts. set abstracts using the tool relationships estimate of approximately is based on a test set of containing without a random Only MEDLINE text 9 Processing 59 of the sentences positive relationship. searching. it is retrieved, a Considering this, relationships. is Therefore approximately 97% primarily positive true, 50 test positive and a new recall 97% relationship- abstracts contained co-occurrence could be expected to information that useful be less than 97% if be drawn from MEDLINE primary limitations, one of the describing can collection. However, relationships as can be between diseases significant number of the relationships seen there exists possible that even though not all sentences were successfully abstracts are Of those, 32 just were were under retrieved. considered, there successfully were a total of retrieved. 70% (see Appendix E). 19 Using this This from the in MEDLINE and genes. containing themselves are still retrieved. the 50 test sentences contained a total of 37 unique gene: Of those, 29 recall estimate of each of the were used. this redundancy, 78%. When the full not contain a noted that the estimated precision of Precision redundancy in relationships. be for Of those, 57 relationships. in the 50 a significant amount of relationships are but did the entire abstract chosen to contain retrieved, is the template on of should sentences that were not Based Of those 121 sentences, total of 1 12 positive relationship-containing However, it assumption of this method using template and gene name gives a precision of data specifically corresponding An text sentences. 51%. disease a a returned a total of (see Appendix D). This name and a gene name. measure 46 increases unique gene: gives a gene: disease the recall to disease disease relationship Discussion Future improvements to this limitation is a based of a co-occurrence difficult requirement to authors refer to genes satisfy system disease names these objects, there remains no need A instead of second full information limitation to parsing. lack It full parser Finally, relates to the another training on the work in the et al. in 1998 methods section above that containing verbs in limitation to help develop the templates. due to based by In this experiment, many improving a training the training 20 This set objects of most prevalent of sentences of sentences was It set. largely can due be recall. interest in those studied. bordering verb interest from through adding improve the tool's system relied on these seven verbs methods, extracted the most probable objects involved information. One Michael Collins (17). information between highlight chunking full parsing is on recognition of noun phrases function) to use of that of the templates themselves. This from too limited interactions have been list. Their regulate, encode, signal, and recognize relationships, to be lost in their content were not. in this study that to extract is the retain more syntactic written mentioned earlier was gathered a system a short be trained to the speed of processing continually with is the Collins Parser has been done (18) built automated such as nested sentence, full parsing to apply protein implement when template use, may cause some set used to biomedical text. Gene: in the text ease of although additional, syntactically varied, sentences would significantly Similar that occurs software can This terms. allowing for explored data to of project to a lack of templates to retrieve them stemming based is lexicon for lexicons. the test set that were expected to be retrieved assumed complete very the primary for this was mentioned being a of standardization in the text. If the computationally intensive. Though this is true, possible for First, several ways. the method that was used Chunking, increasing, it may be feasible need possible solution to this about the syntactic structure of a the compression. be implemented in is the given the diseases. A and recognition of gene and tool can Sekimizu phrases (activate, bind, interact, and then using statistical in the implicit interaction. Though this approach back of limited relevant the need system of designed by Medicine in the National Institutes of similar system was that concentrated on this objective of recall of 72% paper, was system, looking design was to Highlight, interactions Highlight did had heavily on the a precision short verb (depending lexicon to on bring the verb used) and its collaboration with a Genome Center at tagging et al. was likely to at This system had discussed in this phrases: limited was also quite interact with, associate and relied on statistical methods to This tool's design to protein: with, and determine recall was bind to. whether the approximately 77%. a et al. of the Computer Science Department preprocessor, parser, tokenization and was Markup Language (XML) a SRI International in Cambridge (20). This contain a protein term. to this project rules and external in 2000 done in to tagger component, were identify sources to recovery the 21 objects of interest designed The basis or of the proteins, then In their lexicons. Tool, the system genes and relevant relationships. not use and system employed a component. first tagged for however, they did identify Queens College Columbia at (21). This that applied Basic Local Alignment Search knowledge of language processing articles and error in that terms order Informatics a natural pathway information from journal created a plug-in entities. and one much closer to the project team at the Department of Medical parsing was a simpler system tested. only three lexicon Friedman, component, the National containing 'bind'. The verb phrase between Highlight, however, construction was similar group that enclosed a Columbia, developed GENIES, extract molecular term system, list for verb precision was In 2001, Carol when and colleagues at Health in 1999 (19). This binding relationships 79% and contained interest Rindfiesch centered on co-occurrence of noun phrases and specific template not use a protein noun phrase of 58% find Thomas by find information. The protein noun phrases more complex developed called for and a precision of A slightly to relied approximately 68% to 83%. Library in lexicon, it an object information. When tested, this Another to for extensible Instead, the BLAST, techniques, within the text. This eliminated the need manner as opposed interactions in that it to be for lexicon a complete to the chunking employed the use of full-text extensive than previous works at Parsing was described method The GENIES retrieved. of terms. earlier. project also This differed from the for precision a more complete allowed complex chains of Its articles rather than abstracts. 125. The done in also other studies mentioned verb list was also more this program was 96% and the recall was 63%. Lada Adamic California, and her team tackled the problem Their approach was simple in the text to biology. with of determine They their technique studies finding in that it a statistical Their order to used the rate of co-occurrence probability that the two of use of argument against described expensive and the earlier NLP is in the literature in 2002 (22). between disease a system could then lexicons that tend to limit the terms in this the processing techniques used itself requires many steps and a gene in the fully be Again, actual nature of the connection. that the computational power needed to process Palo Alto, and were somehow connected discovered using the determine the in the lack lay Laboratories gene-disease connections propose that the connections NLP templates in retrieved. of the HP of researchers at parse the power that can article and associated be in those relationship is too that each contribute additional sources of error and therefore produce sub-adequate results. The primary frequency certain assumption of in the text. If that disease, a connection The team first the Adamic team gene occurs with a greater between the gene and gathered gene symbols then performed a search of the MEDLINE whether the expression from frequency expressing on a particular 22 more related to a likely. information. They official gene acronyms and recorded statistical methods The team then focused is in text that is several public sources of database for the Disambiguation using were not gene-related. that a gene should occur with a certain the disease returned abstracts/titles contained words pathway. was a particular helped to disease disease or gene remove acronyms and compared that the rate of occurrence of co-occurring both articles that on a single disease segment of articles, recall and precision were genes with mentioned the disease and those that did not. Due to the team's focus not calculated. internal that met their take into being However, their method was able to statistical criteria. Though A use of a simple acronym limited a number of relevant relationships does lexicon similar simple co-occurrence approach was limited this also recently of used Botany and of Oklahoma in University of Texas Southwestern Medical Center (23). Though their relationships using quite similar to implicit relationship an the approach of objects from be determined public data method at A text, was and relationships assumed. Lussier of by Michael Cantor and MRREL based on MRCOC. of the at at the the find objective was to gathered name/synonym weight their They the novel search was frequency. which co the names They to for the lexicons. Then, the importance of the co Using this co-occurrence 69%. Beth Israel Medical alternative ways disease concept, they and logic to into Olivier Bodeneider Language Processing. This group first sources: them recall of this method was It is included here to demonstrate each Internal Medicine to retrieving gene-disease relationships, although not Columbia University, UMLS. For fuzzy the co-occurrences the sentence-level, the developed Microbiology diseases for such as genes and sources and assembled unique approach the approach, their initial relationship 'objects' team searched the co-occurrence using occurrences and scored network of et al. at the Adamic team. The Wren group first defined occurrence would The Department group's results. by Wren, University not the type of relationship not elaborate on Advanced Center for Genome Technology's Department collaboration with it does this approach is simple and elegant, consideration negative relationships and described. The find of the Center, from biomedical Indra Sarkar National Institutes derive the information selected a set of concepts related to aside of and Yves Health (24). from Natural disease from the then obtained related concepts from two public Gene Ontology terms related to then obtained a subset of 23 data those concepts. Due to the was then possible existence of experimental mappings of genes to to retrieve a list of genes related to the concept-searching, the team circumvented the precision and recall were this approach the Gene widely is reasonable, Ontology variable a current mappings. need (1% to limitation Global for disease concept. a complete gene 100%), depending to its precision was 24 usefulness 30% on may UMLS Using concepts, it this method of lexicon. However, disease lay and recall was in concept. Though the incompleteness of 8.8% overall. Conclusions In conclusion, the simple co-occurrence chunking and relationships review, from Medline implicit The interested in other relevant dimension precision and recall of this database combining techniques a including relevant gene-disease tool, based on a literature of gene-disease relationships will certain genes/processes/diseases to generated will also allow utility in the quickly growing field of generated biotechnology 25 of linkages for disease "themes" to from microarray experiments, enabling for these large datasets. With the above, it is believed that the information quickly link their topic as well as make reasonable The resulting database lists like those of analysis language processing lexicons for retrieving literature-drawn information gene the usefulness of previous studies. relationships and networks. mentioned great abstracts. favorably with be drawn from large another method with some natural use of this method to create an updatable allow researchers interest to based described here demonstrates template matching using controlled compare The project and suggested from improvements the use of this tool can bioinformatics. be of Reference List 1 . 2. MEDLINE Fact Sheet. NCBI Website. 2- 1 9-2005 National Institutes of Health. 3-4-2005. Lee, National . Library of Medicine L. "I'm sorry Dave. I'm afraid I can't do that": Linguistics, Statistics, and Natural Language Processing circa 200 1 in Computer Science: Reflections on the Field, , Reflections from the Field, Committee Research Council, editors; Joseph Fundamentals on Henry Press: 3. Shatkay, H; Feldman, 4. Ray, S.; Craven, M. Representing Sentence Structure Information Extraction; 200 1 R. Journal of Computational 2004; of pp. Compute; National 111-118. Biology 2003, 10(6), 821-855. in Hidden Markov Models for . 5. Fulmer, 6. Groft. S. Genetic A. and Office Zhao, of and S. Letter to Paine J, [PG Internal Communication], Nov. 15, 2004. Rare Diseases Information Center - Office of Rare Diseases. Rare Diseases Website. 2-10-2005. National Institutes of Health. 11- 12-2004. 7. Nelson S. Medical Subject Headings. National 8. Nelson S. Medical Subject Headings National Librarv 9. Groft, S. About ORD - of Medicine Office of 2005. National Institutes 10. Entrez Pubmed. NCBI Website. of National Institutes Rare Diseases. Office of Medicine: 2004. Fact Sheet. NCBI Website. 2-12-2004. (MESH) - Library of of Health. 3-4-2005. Rare Diseases Website. 2-1! Health. 3-4-2005. National Library of Medicine - National Institutes of Health. 11-12-2004. 1 1 . Brill, E. A Simple Rule-Based Part of Speech Tagger; 1992. J. Bioinformatics 12. Smith, L.; Rindfiesch, T.; Wilbur, W. 13. Marcus, M. Penn Treebank Project. 2-2-1999. 1 4. Kudo, T.; Matsumoto, Y. Fast 15. Kudo, T.; Matsumoto, Y. Chunking methods for with kernel-based text analysis; Support Vector Machines. http://chasen.org/-taku/software/vamcha/ 26 2004, 20(14), 2320-2321. . 2001. 2003; pp. 24-3 1 . 16. Pradhan, S.; Hacioglu, K.; Ward, W.; Martin, J.; Jurafsky, D. using Support Vector Machines; Boston, MA, 2004. 17. Collins, 18. Sekimizu, T.; Park, M. Computational Linguistics H. 19. Rindfiesch, T. on Frequently Parsing 2003, 29(4), 589-637. S.; Tsujii, J. Identifying Products Based Shallow Semantic the Interaction between Genes and Seen Verbs in Medline Abstracts; 1 998; Mining molecular binding terminology from pp. biomedical text; 1999; Gene 62-7 1 . pp. 127-131. 20. Thomas, J.; Milward, D.; Ouzounis, C; Pulman, S.; Carroll, M. Automatic Protein Interactions from Scientific Abstracts; 2000; pp. 541-552. 21. Friedman, C; Kra, P.; Yu, H; Krauthammer, M.; Rzhetsky, Extraction of A. Bioinformatics 2001, 17 Suppl 1 S74-S82. 22. Adamic, L. A.; Wilkinson, D.; Huberman, B.; Adar, E. A Literature Based Method for Identifying Gene-Disease Connections.; IEEE: Stanford, California, 2002; pp. 109-117. R. V.; Garner, H. R. 23. Wren, J. D.; Bekeredjian, R.; Stewart, J. A.; Shohet, 2004,20(3), 389-398. 24. Cantor, M. N; Sarkar, I. N; Bodenreider, O; Lussier, Y. knowledge A. Genestrace: discovery via structured terminology; 2005; 27 Bioinformatics pp. phenomic 103-1 14. Code Appendix A: Pre-tagging (tag.pl) # ! usr/bin/local/perl -w # # Script to tag occurrences from terms of a list in a sentence file input with prefixes/suffixes. # # Jennifer Paine 2005 # # usage: # read perl tag.pl sentence_f the inputs from the in my $sentencefile my Stermfile my Sprefix = my $suffix = my Sterms; # read open in line chomp chomp Sprefix; $suffix; die or "Cannot read from term file: Stermf ile ! " ; { chomp $_; @record = split! $reference push suffix file term Stermfile my my command prefix = $ARGV[2]; $ARGV[3]; (<TERMS>) while term_lexicon $ARGV[0]; chomp Ssentencef ile; $ARGV[1]; chomp $terrnfile; = the TERMS, ile At/, Sref erence) (@terms, $_) ; \@record; = ; } TERMS; close # go through the sentences and check to see if terms any match if - so, replace the found # term # introduced open with the identifier Ssentencef ile SENTENCES, term die or and tag "Cannot the read term with sentences the tags from : Ssentencefile! "; while (<SENTENCES>) chomp $_; my $sentence { $_; = my Sacronym; my Ssynonym; # iterate foreach terms through { (Stems) ($_-> [1 ] ) Sacronym = lc $synonym = lc($_->[2]); $sentence =~ ; s/\b$synonym\b/$pref ix$acronym$suf f ix/ig; } if (Ssentence =~ /$pref ix . *$suf f ix/ ) } close SENTENCES; Al { "$sentence\n" print ; } that were Conversion to I TAME Format for MedPost # !usr/bin/local/perl # # Script to change (makeITAME.pl) -w medline file abstract into I TAME format # # Jennifer Paine 1/7/2005 # # usage: while perl (<>) makeITAME.pl abstractf llename { $_; chomp my $identifier; my Stitle; my if Sabstract; (/ (\d+) \t( .*)\t $identifier $title $2; (.*)$/) = ( $1; = $abstract } = $3; " print .I$id (/(\d+)\t(.*)$/) elsif Sidentifier Stitle = = { $1; $2; ".I$identifier\n.T$title\n.E\n"; print } } Get Co-occurring Sentences # ! usr/bin/local/perl use -w strict; # script # medpost file output # usage: perl get_co-occurrence.pl to get only my $filename my Sprefixl = my $suffixl = "ZZZ"; my $prefix2 = ""; my $suffix2 = "WV"; FILE, $id; open my while w/ co-occurrence filename = or die "Cannot $filename!"; read $_; $line; ($line $id =~ = /"(P\d+.\d+)$/) $1; { } if in "YYY"; (<FILE>) { my $line if sentences $ARGV[0]; = Sfilename chomp the ($line if =~ /$prefixl ($line =~ print . *$suf fixl/) /$pref ix2 . { *$suf f ix2/ ) "$id\n$line\n" ; } } A2 { them from a Format MedPost Output for Yamcha # ! usr/bin/local/perl (format.pl) -w # # Script chunker format to - also the output the exports of MEDPOST so that it can be put ids # # Jennifer Paine 1/7/2005 # # my usage: format.pl perl Sinfile $ARGV[0]; = my Sidfilename Sidfilename = =~ . . */ ids/; . ilename" ID, ">$idf open IN, Sinfile (<IN>) Sinfile; s/\ open while medpostfile or die or die "Cannot "Cannot read in open { chomp; if (/AP\d+/) print } else { ID "$_\n"; { s/\s+$//; s/ An/g; s/\// #s/(\. /g; \. )$/$l #s/( [Aw] print $_, 0/g; [Aw] ) /$1 0/g; "\n\n"; } } close IN; close ID; id file!"; A3 output file!"; into the Yamcha Reduce Chunks to Flat Format # ! usr/bin/local/perl (reduce-chunkpl) -w # # Script * sentence to the reduce into chunks 'sentences tags' of and re-assign the IDs. # # usage: perl my Schunkfile my Sidfile open my = = id the @ids $ARGV[0j; $ARGV[1]; Sidfile IDFILE, read file die or into an "Cannot read ID the file: Sidfile !"; array <IDFILE>; = IDFILE; close # idfile chunkfile strict; use # reduce_chunk.pl the collapse based chunks the on chunk identifiers and re-assign the id number open my my Schunkfile CHUNKFILE, die or "Cannot read chunkfile: Schunkfile!"; @phrase; @ sentence; while { (<CHUNKFILE>) my Stokenline my Stoken; my Schunkid; # check if S_; = chunkid ($tokenline Stoken . ) \t . At ( . A /) { $2; = # if # otherwise, } ) * A ( SI; = Schunkid if assemble and =- it's beginner, a go to its tag tc token and add add next push AB-(\S+)/) (@phrase, SI); push (^sentence, (Schunkid =~ chair. this token { =~ (Schunkid elsif the "///Stoken"); /'O/j ( { else (^sentence, push Stoken) ; next ; } } # if if there is shift . Sidnow chomp line, /'$/) ( =~ An" #print my blank a (Stokenline = shift print and (@ids); (@ids); Sidnow; print "$idnow\t"; print join("-", join!" ", print @phrase = 0 sentence ( ) = @phrase) ^sentence) . "\t"; . ; ( ) ; } } close reset CHUNKFILE; A4 "An"; everything to sentence Get Relationships from Chunks # !usr/bin/iocal/perl Output and (check-relationships.pl) -w # # Script to templates use to get from relationships the abstracts that nave been tagged, # Chunked, MedPost-tagged, and Reduced to have format the xt ID ChunkSpiat Chunks # # usage: perl get-Relationships verblist strict; use my Sinfiie my Sverbfiie my Ssentence_count # chunked-reduced-f ile .pi reao = verbs # SARGV[0]; into chunked- (co-occurrence file reduced file lexicon # SARGV[1]; = acceptable with only, verbs 0; = array VERBS, Sverbfiie <VERBS>; my @verbs close VERBS; "Cannot die or open read file!"; verb = ######### my ################ Templates Stemplate (NP-VP-NP qw = NP-PP-NP NP-VP-PP-NP NP-ADJP-PP-NP NP-PP-NP-PP-NP NP-VP-PP-NP-PP-NP NP-PP-NP-VP-PP-NP NP-NP-VP-PP-NP-PP-NP NP-VP-NP- PF-NP-PP-NP NP-NP-VP-PP-NP NP-ADVP-VP-NP NP-SBAR-NP-PP-NP ) ; ###*###### ########################## open IN, Sinfiie "Cannot die or read Sinfiie!"; { (<IN>) while .SENTENCE: rriy Sline chomp S_; = Sline; my SPMID; my Schunk_line; my Schunks_part; my @chunk my reps; (jchunks; my @tmp (SPMID, chunks S chunks_part @chunks = = ~ split s TEMPLATE: splitAt/, scalar ( @chunk_reps | @template_pieces = split !"-", my Stemplate_length = scalar ( find # and my ; ( Stemplate; miy # ) ; Stemplate (^template) my Sline; ; Schunks_part (As+\/\/\//, foreach chomp = ("-", Schunk_line) /AAA///; Schunk_length= array array Sline; split = an as themselves an as representations Schunks_part ; Schunk_line, @chunk_reps my chunk # At/, split = If all the @matches match them return = in positions an Stemplate); @template_pieces in the ) chunked ; sentences array sf ind_matches (\@chunk_reps, Schunk_length, A5 \@template_pieces, $template_length) ; \t # for MATCH: each match, it for its Smatch ( ^matches) check foreach my chomp Smatch; @output ( ) ; my Spiece_id 0; literates my contents template the vs ( = = my Sdisease_found my Sgene_found 0; = TEMPLATE_PIECE: 0; = with the template Sdetermine if noun has been disease #determine if noun has been gene the chunk foreach my Spiece (@template_pieces) { chomp Spiece; Schunks [Smatch+Spiece_id] my Smatchmg_chunk # add 1 because chunks start at 1 = # if # either # and # check # and if the is piece a mark it if a or it otherwise, - if just neither, it's a verb, add it to the chunk { (Smatching_chunk=~/YYY.+?ZZZ/) = i; !Smatching_chunk=-/ elsif contains { Sgene_found ) it if list the against it's see phrase, disease gene (Spiece=~/NP/) if noun ; Sdisease_f ound . { +?WV/) 1; = ) (@ output, push Smatching_chunk) ; Spiece_id++; TEMPLATE_PIECE; next ) (Spiece=~/\bVP\b/) elsif { Sverb (@verbs ) foreach my ( chomp Sverb; if (Smatching_chunk =~ /Sverb/i) { "HHHSverb" (("output, Spiece_id++; push ."WWW"); TEMPLATE_PIECE; next } MATCH; next else { Smatching_chunk) (@output, Spiece_id++; push TEMPLATE next if (Sdisease_found == 1 PIECE; Sgene_found 4 ; ==1) { Ssentence_count++; #print "Original #print "Template: Uprint "Match #print "Match: my Soutput my Sgene; my Sdisease; my if Sverb; = (Soutput Sgene Stemplate\n" ; Smatch\n"; Position: " . join("/// join(" =~ = Sline\n"; Sentence /YYY( ", @output); .+?) ZZZ/) SI; } if (Soutput =~ Sdisease /(.+?)VW/) SI; { = } if (Soutput Sverb =~ = A6 /HHH(.+?)WWW/) SI; ", @output) "\n\n' . } if (Sverb) { print "Ssentence_count\tSPMID\t (Sverb) ) \tSgene\tSdisease\n" ; { else print \t$gene\t$disease\n" "Ssentence_count\tSPMID\t ( ) ; } next print "Relationships: close IN; ## Sub Find Match SENTENCE; Ssentence_count\n' Positions ########################################################## # Subroutine to find and return the match positions that are found in the chunked # sentences. # their Accepts respective as input: sentence and template chunks, along with lengths. # ############################################################################### ##### sub my find_matches { Stemplate_chunks (Ssentence_chunks, , Schunk_length, Stemplate_length) my @match_locations ; for (my Si=0; Si<Schunk_length-Stemplate_length+l ; my Sj while 0 = @_; ( ; (Ssentence_chunks-> if Si+A = (Sj == [Si+S j ] eq Stemplate_length-1) push (@match_locations, Stemplate_chunks-> [S j] ) { { Si); Si++; Sj } else = 0; { Sj++; } } } return @match_locations; } ############################################################################### ####### A7 Appendix B. Training Sentences Sentences for 1 which templates were not effective are Episodic hypertension . associated with positive ren assays after renal transplantation. 2. [Malignant diabetic keto-acidosis 3. [hematuria due 4. [The importance 5. globl 6. [Renal hemorrhage 7. [Reduced 8. Gastric 9. globl 10. [proteinuria 1 1 deficiency . 12. increased to A'2 abnormality cancer by caused by by of 3 cases)]. ins. thalassemia_minor in a Greek woman. a presumable rise of plau activity]. and associated with hyperglycemia]. hypoglycemia. an unstable protein associated with chronic caused (apropos as the cause of N-monomethylacetamide containing ins Gun Hill: sensitization plau activity. associated with caused ins ins by caused liver in hypokalemia of the effect of in bold print. hemolysis. agt electrophoretic study], of erythrocyte g6pd as a cause of jaundice [glycogenosis_type_ii (Pompe's disease) in India. associated with amyl and hyaluronidase deficiency]. 13. [A case of Sakel's after encephalopathy hypoglycemia 14. hematuria associated with globl 15. Sacroiliac gout associated with globl 16. caused by ins shock therapy, using method]. [proteinuria by caused agt its C-Harlem: E and prevention by a sickling globl variant. hypersplenism. abolishing the hypertensive response with diuretics]. 17. [Case 1 8. [Biosynthesis 1 9. anemia caused 20. Acquired 21 A . of chronic myeloleukemia with anemia caused of ins by g6pd stress caused Carswell, anemia associated with by pathological unstable globl]. burns]. a new variant. iga anti-e. new g6pd variant associated with chronic non-spherocytic negro bronchial [Action 23. Immunoglobulin A 24. [Genetic of and (B2) glycopeptides on apnea caused (iga) anemia in a fetal by bdk]. associated glomerulonephritis. hematological study persistence of [A2' haemolytic family. 22. 25. during by of a globl associated with globl associated with from Ghana suffering from hereditary beta-thalassemia and hemoglobinosis S]. family beta-thalassemia and hereditary persistence of fetal globl. 26. deficiency 27. g6pd 28. [proteinuria of serpincl Manchester: 3 activity associated with hereditary thrombosis tendency. a new variant associated with chronic nonspherocytic anemia. caused by agt blood protein clearance]. Bl 29. Juvenile hypertension 30. G6PD 31 cyanosis was . Heian, Thus ighe in 33. overproduction of ren within a renal segment. found in Japan. a g6pd variant associated with anemia apparently due by methemoglobin 32. by caused to anoxia associated with conversion of globl to acetaminophen or its metabolites. may_be synthesized within nasa Npolyps of atopic may have atopic patients The pancreatic significantly a dose-dependent acinar cell necrosis was elevated serum amyl level nonatopic patients. and was associated with a in and reduction and the polyps patients, different etiology from those in amyl activity in the pancreatic tissue. 34. Allergic is asthma by caused ighe fixed to antigen reaction with mast cells of the bronchi. 35. The ability of large doses of exogenous agt microscopic myocardial necrosis 36. Experimentally, in It mediation of would appear and patients, Thus, it . in plasma is_characterized_by excess by excess sodium with level increased dose increased was during was by measured, the of norepinephrine iga levels may potentiation (2 mug) hemorrhage. urinary ins hypoglycemia. by documenting that periods of that iga levels of whole saliva and serum are_elevated that salivary prove useful in distinguishing in and oral cancer patients with disease. in the treatment of These findings indicated that One one the other appears that treatment of non-liquefaction of semen with amyl may_be a useful aid 41 confirmed. level the hypoglycemia was confirmed possible recurrent 40. and to sympathetic stimulation and a high total extractable ins 39. form) which the plasma ren occurred at the time that the ren ins hypertension; (vasoconstrictor additional experiments of responses 38. has been (volume form). reduced ren In to cause widespread multifocal rabbit there are two models of ren with reduced sodium 37. in the "nonresponder" Abnormal 43. The probably renal vein (serpincl serpincl results agt levels hypertension in the caused of plasma ren activity "responders," suggestive of hypertension. angiotensinogenic 42. had infertility. imply that "Budapest") methylprednisolone as a cause of a familial thrombophilia. hypertension in the rat may_be in part agt dependent. 44. Elevation of serum iga associated with depression of cell mediated immunity may_be characteristic of patients with nasopharyngeal_carcinoma. 45. Thesefindings reconfirm diabetesmellitus that the earliest clinically ischaracterizedby an recognizable state impaired initial ins secretory of response to glycemic stimulus. 46. These observations suggest that the due to an increased hypocalcemia 47. development calca, but instead may_be associated with a Biochemically, of eln and secretion of the lesions of medial increased of hypocalcemia they diminished suggest at parturition B2 in not prepartal secretion of calca. sclerosis were_associated with amounts of collagen is that parturient arterial walls. decreased amounts 48. Large amounts of Wilms' with 49. Constriction persistent In of one renal hypertension decrease in 50. circulating ren apparently artery in the is often which serum potassium and comparison with the thyrotoxicosis and level big can cause hypertension in patients tumor. healthy, displayed of protein-bound presence of the opposite associated with increase in ren activity water was iodine, B3 can produce plasma ren activity, intake. increased in a positive correlation with tachycardia and the kidney increase in patients with the severity degree of loss of the disease the of weight. Appendix C. Templates Template Returned Sentences (of 50 Test) NP-VP-NP 5 NP-PP-NP 3 NP-VP-PP-NP 15 NP-ADJP-PP-NP 0 NP-PP-NP-PP-NP 4 NP-VP-PP-NP-PP-NP 5 NP-PP-NP-VP-PP-NP 0 NP-NP-VP-PP-NP-PP-NP 1 NP-VP-NP-PP-NP-PP-NP 1 NP-NP-VP-PP-NP 0 NP-ADVP-VP-NP 0 NP-SBAR-NP-PP-NP 0 CI Appendix D. Additional Sentences Found in Test Set 1. P01 144426A01 The role of the ren-agt system following hemorrhage 2. in was studied P01 156038T01 Observations in the conscious of the role of maintenance of arterial pressure dogs. body fluid volumes and plasma ren the management of hypertension. 3. P01 175013A04 In for contrast a 4. P01237795T01 5. P00810900T01 Plasma energy 6. 25 % reduction in latent period was tumor appearance in BD-IX rats receiving 90 mug globl components ren by ngfb studies in identical twins. oedematous and marasmic children with protein malnutrition. P01239060T01 cell ghosts in infusion acute-renal-failure after the rabbits author of globl solutions with without red 's transl. 7. P01255923T01 Plasma ren activity in acute acidosis. 8. P01255923A01 Plasma bromide was studied. ren activity in acute acidosis the effect of 9. about SYM g ENU. in diabetes-mellitus activity in brough activity in hexamethonium P01255923A09 These findings may suggest that the elevation of plasma induced by carbon dioxide inhalation is independent from ren acute acidosis activity in sympathetic stimulation. 1 0. P00769527A 1 5 The results support the concept that the ren-agt-aldo-sterone system may be involved in primary hypertension. 11. P00057338T01 ren SYM in hypertension agt system after traumatic renal-artery thrombosis. 12. P00057338A07 The delayed SYM agt SYM onset of hypertension despite early activation of the ren aldosterone axis accords with the course of events observed in experimentally induced hypertension in rats suggests that several weeks even months required for hypertension to develop after sudden renal-artery occlusion in man. 13. P0 108481 1A03 Increased ighe levels in patients with absence of eosinophilia clinical evidence of 14. P00939701T01 ren-agt system in an infant 15. P00939701A02 Unilateral nephrectomy and normalization of the 16. P00956371A16 effect on Finally glucose- or activity atopy liver-diseases other known with malignant was followed by occurred causes of are in the ighe elevation. hypertension. resolution of the hypertension of the ren-agt system. PTH does not appear to tolbutamide-stimulated ins hyperparathyroidism. Dl be an ins release antagonist in has animals with no apparent dietary-induced 17. P00786083T01 ighe 18. P00983079A02 The dangers case report in hypotension in antibodies following 19. P01 00805 6A01 Two of agt drug which this bronchopulmonary in triggering was administered barbiturate aspergillosis. illustrated off acute-renal-failure are by a to a comatose patient with hypovolaemic self-poisoning. new variants of g6pd ( G6PD ) deficiency associated with chronic discovered in Japan. nonspherocytic anemia were 20. P00012846A06 Although increased sensitivity to inhibition by NADPH has been postulated to decrease intracellular enzyme activity resulting in enhanced susceptibility to hemolysis in certain g6pd variants with alternative mechanism of only moderately decreased hemolysis possibly enzyme enzymatic thermolability in exists activity g6pd an Long Prairie. 21. P00833253A01 A 45-year-old ins administration was found immunoreactive ins in the 22. P01013696T01 pressure [ Effect , to non-obesity female patient with no previous history of have extreme insulin-resistance abnormally high plasma absence of anti-ins antibodies of the agt antagonist saralasin in the serum. l-sar-8-ala-agt II on the blood in secondary hypertension. 23. P00837134T01 agt II in essential-hypertension. 24. P00404320A02 Patients with common variable hypogammaglobulinemia decreased mean serum ighe 25. P00404320A05 Patients mean serum ighe 26. P00404320A13 hypogammaglobulinemia ataxia telangiectasia selective iga-deficiency , thymoma and had significantly concentrations. with the wiskott-aldrich-syndrome had a significantly elevated concentration. Finally hypoproteinemia had patients with normal ighe protein-losing enteropathy familial hypercatabolic concentrations associated with normal parameters. D2 ighe metabolic Relationships retrieved are marked with "X" in last column. DISEASE GENE acanthosis-nigricans ins X Acidosis ren X acute-renal-failure globl X acute-renal-failure agt X Anemia g6pd X Arrhythmia bdk X Aspergillosis ighe X Bradycardia bdk X X FOUND BY Carcinoma ins dental-caries iga X diabetes-mellitus globl X diabetes-mellitus ins X glucose-intolerance ins X Hemoglobinuria globl X Hemorrhage ren X hodgkins-disease ighe X Hypertension ren X Hypertension agt X Hyperthyroidism ins X Hypocalcemia ins X Hypoglycemia ins X Hypophosphatemia ins X Hypoproteinemia ighe X Hypothyroidism ins X iga-deficiency ighe X Ischemia globl X Kwashiorkor ren X liver-diseases ighe X Malnutrition ren X salt-wasting-syndrome ren X Tumor ngfb X wiskott-aldrich-syndrome ighe X Bacteriuria ins chronic-lymphocytic-leukemia ighe Dermatitis ighe Hemolysis g6pd Hemorrhage agt Hypercalcemia ins Hyperglycemia somatostatin insulin-resistance ins Lipodystrophy ins multiple-sclerosis ighe Pancreatitis amyl Pyelonephritis ren Sarcoidosis agt Vasculitis ighe El TOOL''
© Copyright 2026 Paperzz