ENVIRONMENTS Identification of environment

ENVIRONMENTS
Identification of environment descriptive
terms in text
Evangelos Pafilis
Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC),
Hellenic Centre for Marine Research (HCMR), Heraklion Crete, Greece
[email protected], http://epafilis.info
GSC15 – NIH – 23rd Apr 2013
Literature Mining
GSC15 – NIH – 23rd Apr 2013
processing text
to extract facts of interest
GSC15 – NIH – 23rd Apr 2013
terrestrial, aquatic, lagoon,
coral reef, marine sediments,
freshwater, soil
GSC15 – NIH – 23rd Apr 2013
Sequence records (e.g. isolation source field)
Literature (e.g. PubMed abstracts)
Collections of Species descriptions
(e.g. “Biology Description”, “Ecology”, “Habitat”)
GSC15 – NIH – 23rd Apr 2013
Large Scale Analysis
Image from http://notallbits.files.wordpress.com
GSC15 – NIH – 23rd Apr 2013
Large Scale Analysis
SPEED
Image from http://notallbits.files.wordpress.com
GSC15 – NIH – 23rd Apr 2013
ENVIRONMENTS
http://environments.hcmr.gr
●
●
●
●
Dictionary based
Environment Ontology
fast performance
● 4000 PubMed abstracts / second *
Based on SPECIES name recognition tagger (minor
revision PLOS ONE)
*: based a single-thread run on an Intel 2,27GHz, 24 GB RAM to
evaluate SPECIES performance (processing a set of 536,052 abstracts)
GSC15 – NIH – 23rd Apr 2013
EnvO: source of environment descriptor
names and synonyms
http://environmentontology.org
~1600 terms
Based on slides by Dr. Pier Luigi Buttigier, AWI, Bremenhaven, Germany
GSC15 – NIH – 23rd Apr 2013
ENVIRONMENTS – Improving Accuracy
●
Increasing matches in text
●
orthographic variation supported
e.g. freshwater, fresh water, and fresh-water
●
Case-insensitive matching
●
Synonym Selection algorithm in case of ambiguity
●
Synonym generation to reflect the way environment descriptive
terms are mentioned in text (EnvO specific)
Action
●
Example
Add a variant in which
non-informative words
have been removed
epipelagic zone → epipelagic
estuarine biome → estuarine
Pluralization
sediment → sediments
Preventing overmatching (i.e. avoiding increased FP)
●
„stopword-list” (e.g. spring, well, range)
GSC15 – NIH – 23rd Apr 2013
Scope
Not included:
species
tissues
foods
GSC15 – NIH – 23rd Apr 2013
Hands-on-workshop
From signals to environmentally tagged sequences
Annotate microbial sequences based on environmental terms
in the related literature
Hands-on-workshops (Hackathons): 27-29 Sept 2012, HCMR
coming: 10-13 June 2013, HCMR
Dr. C.Quince, Dr. U. Ijaz (Uni of Glasgow)
Dr. R. Kottman, J. Schnetzer (MM-MPI), Dr P. Buttigieg (AWI)
Dr A. Oulas, C. Pavoudi (HCMR), S. Berger (HITS)
ACTION ES1103
GSC15 – NIH – 23rd Apr 2013
Approach
GSC15 – NIH – 23rd Apr 2013
ENVIRONMENTS and Encyclopedia of Life
•
Encyclopedia of Life (EOL) - Rubenstein Competition 2013
http://www.eol.org
•
•
•
•
Over 3 Mi Taxa
“one-stop-shop”
“Habitat” and “Ecology”
“Taxon Biology” and “Distribution”
•
“ENVIRONMENTS: Discovering habitat terms in EOL Contents“
March 2013 – October 2013 http://environments-eol.blogspot.gr/
•
species – environmental context associations derived from mining
the EOL’s species pages to respond to the call wishes by John
Horne (Coral reef associated marine invertebrates in the IndoPacific) and Mark Steer (Support of student/citizen science projects)
GSC15 – NIH – 23rd Apr 2013
Next Steps
●
●
ENVIRONMENTS
● Accuracy benchmark
● ENVIRONMENTS corpus (based on EOL)
From signals to environmentally tagged sequences
● Performance optimization
(e.g. parallelization of sequence analysis and relevant literature retrieval)
● Dedicated service
● Result visualization exploration
● Experiment with more sequence types
GSC15 – NIH – 23rd Apr 2013
Next Steps

Co-mention based analysis
Based on processing PubMed, Apr 2013
GSC15 – NIH – 23rd Apr 2013
Summary





17
Dictionary-based approach: environment descriptive term
identification
EnvO as the name and synonym source
Large Scale: From single document to literature repositories

fast performing software
From signals to environmentally tagged sequences

annotate microbial sequences with environment descriptive
terms from related literature
ENVIRONMENTS and EOL

processing the EOL Taxon pages to extract descriptions of
their environmental context
GSC15 – NIH – 23rd Apr 2013
Digging-out Information
Photo by Dr Chatzinikolaou E
http://hartpurylrc.files.wordpress.com
GSC15 – NIH – 23rd Apr 2013
Acknowledgements
Thank You!
HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou
Lucia Fanini, Sarah Faulwetter, Anastasis Oulas
NNF CPR: Lars Juhl Jensen, Sune Frankild
Uni Glasgow: Christopher Quince, Umer Ijaz
MM-MPI: Dr. R. Kottman, J. Schnetzer
AWI: Dr P. Buttigieg, HITS: S. Berger
Funding: EU FP7 REGPOT MARBIGEN (grant no: 264089)
Novo Nordisk Foundation Center for Protein Research
Amvrakikos Lagoons, May 2011
ACTION ES1103
GSC15 – NIH – 23rd Apr 2013