ENVIRONMENTS Identification of environment descriptive terms in text Evangelos Pafilis Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion Crete, Greece [email protected], http://epafilis.info GSC15 – NIH – 23rd Apr 2013 Literature Mining GSC15 – NIH – 23rd Apr 2013 processing text to extract facts of interest GSC15 – NIH – 23rd Apr 2013 terrestrial, aquatic, lagoon, coral reef, marine sediments, freshwater, soil GSC15 – NIH – 23rd Apr 2013 Sequence records (e.g. isolation source field) Literature (e.g. PubMed abstracts) Collections of Species descriptions (e.g. “Biology Description”, “Ecology”, “Habitat”) GSC15 – NIH – 23rd Apr 2013 Large Scale Analysis Image from http://notallbits.files.wordpress.com GSC15 – NIH – 23rd Apr 2013 Large Scale Analysis SPEED Image from http://notallbits.files.wordpress.com GSC15 – NIH – 23rd Apr 2013 ENVIRONMENTS http://environments.hcmr.gr ● ● ● ● Dictionary based Environment Ontology fast performance ● 4000 PubMed abstracts / second * Based on SPECIES name recognition tagger (minor revision PLOS ONE) *: based a single-thread run on an Intel 2,27GHz, 24 GB RAM to evaluate SPECIES performance (processing a set of 536,052 abstracts) GSC15 – NIH – 23rd Apr 2013 EnvO: source of environment descriptor names and synonyms http://environmentontology.org ~1600 terms Based on slides by Dr. Pier Luigi Buttigier, AWI, Bremenhaven, Germany GSC15 – NIH – 23rd Apr 2013 ENVIRONMENTS – Improving Accuracy ● Increasing matches in text ● orthographic variation supported e.g. freshwater, fresh water, and fresh-water ● Case-insensitive matching ● Synonym Selection algorithm in case of ambiguity ● Synonym generation to reflect the way environment descriptive terms are mentioned in text (EnvO specific) Action ● Example Add a variant in which non-informative words have been removed epipelagic zone → epipelagic estuarine biome → estuarine Pluralization sediment → sediments Preventing overmatching (i.e. avoiding increased FP) ● „stopword-list” (e.g. spring, well, range) GSC15 – NIH – 23rd Apr 2013 Scope Not included: species tissues foods GSC15 – NIH – 23rd Apr 2013 Hands-on-workshop From signals to environmentally tagged sequences Annotate microbial sequences based on environmental terms in the related literature Hands-on-workshops (Hackathons): 27-29 Sept 2012, HCMR coming: 10-13 June 2013, HCMR Dr. C.Quince, Dr. U. Ijaz (Uni of Glasgow) Dr. R. Kottman, J. Schnetzer (MM-MPI), Dr P. Buttigieg (AWI) Dr A. Oulas, C. Pavoudi (HCMR), S. Berger (HITS) ACTION ES1103 GSC15 – NIH – 23rd Apr 2013 Approach GSC15 – NIH – 23rd Apr 2013 ENVIRONMENTS and Encyclopedia of Life • Encyclopedia of Life (EOL) - Rubenstein Competition 2013 http://www.eol.org • • • • Over 3 Mi Taxa “one-stop-shop” “Habitat” and “Ecology” “Taxon Biology” and “Distribution” • “ENVIRONMENTS: Discovering habitat terms in EOL Contents“ March 2013 – October 2013 http://environments-eol.blogspot.gr/ • species – environmental context associations derived from mining the EOL’s species pages to respond to the call wishes by John Horne (Coral reef associated marine invertebrates in the IndoPacific) and Mark Steer (Support of student/citizen science projects) GSC15 – NIH – 23rd Apr 2013 Next Steps ● ● ENVIRONMENTS ● Accuracy benchmark ● ENVIRONMENTS corpus (based on EOL) From signals to environmentally tagged sequences ● Performance optimization (e.g. parallelization of sequence analysis and relevant literature retrieval) ● Dedicated service ● Result visualization exploration ● Experiment with more sequence types GSC15 – NIH – 23rd Apr 2013 Next Steps Co-mention based analysis Based on processing PubMed, Apr 2013 GSC15 – NIH – 23rd Apr 2013 Summary 17 Dictionary-based approach: environment descriptive term identification EnvO as the name and synonym source Large Scale: From single document to literature repositories fast performing software From signals to environmentally tagged sequences annotate microbial sequences with environment descriptive terms from related literature ENVIRONMENTS and EOL processing the EOL Taxon pages to extract descriptions of their environmental context GSC15 – NIH – 23rd Apr 2013 Digging-out Information Photo by Dr Chatzinikolaou E http://hartpurylrc.files.wordpress.com GSC15 – NIH – 23rd Apr 2013 Acknowledgements Thank You! HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou Lucia Fanini, Sarah Faulwetter, Anastasis Oulas NNF CPR: Lars Juhl Jensen, Sune Frankild Uni Glasgow: Christopher Quince, Umer Ijaz MM-MPI: Dr. R. Kottman, J. Schnetzer AWI: Dr P. Buttigieg, HITS: S. Berger Funding: EU FP7 REGPOT MARBIGEN (grant no: 264089) Novo Nordisk Foundation Center for Protein Research Amvrakikos Lagoons, May 2011 ACTION ES1103 GSC15 – NIH – 23rd Apr 2013
© Copyright 2026 Paperzz