Chemical Information Mining: Possibilities and Pitfalls

Chemical Information Mining:
Possibilities and Pitfalls
CICAG: “Scientific Text
and Data Mining”
Burlington House
20 May 2009
Neil Stutchbury
Topics
Challenges in finding scientific terms in text
and converting chemical names to structures
Results of a user survey on requirements and
perceptions
Reflections on the future possibilities of
information mining
Acknowledgements
Royal Society of Chemistry
David James
Richard Kidd
Colin Batchelor
Unilever Centre for Molecular Science
Informatics, University of Cambridge
Prof Bobby Glen
Peter Murray-Rust
The Challenge
Entity Name Recognition (NER)
Identifying chemical terms in text,
correctly disambiguating them and
classifying them
Name to Structure Conversion (N2S)
Converting trivial, trade, IUPAC names to
chemical structures
NER Classification*
Chemicals (CM)
Exact, Class, Part
Reactions (RN)
Chemical Adjectives (CJ)
Chemical Prefixes (CPR)
Enzymes (ASE)
* Peter Corbett, Colin Batchelor and Simone Teufel (2007), Annotation of chemical named entities, BioNLP 2007:
Biological, translational, and clinical language processing, Prague, Czech Republic, pp. 57--64.
Chemicals (CM)
CM:Exact
CM:Class
Pyridine was methylated…
The substituted acetylene, 1,3-bis-(tertbutyldimethylsilanyloxy)-5-ethynylbenzene can be derived…
The pyridine, 2, was treated with…
A retrosynthetic analysis for (±)-moracins O and P …
CM:Part
The nitrogen atom in a pyridine ring
The ethyl group
Example
During the course of our studies of thermal decarboxylative Claisen
rearrangement (dCr) reactions of allylic α-tosylesters, 3–9 we have
developed dCr reactions of allylic tosylmalonate esters 1. These
substrates may be converted into the decarboxylated
rearrangement products at room temperature using both the
KOAc–N,O-bis(trimethylsilyl)acetamide (BSA) reagent system
developed initially,3,7 and modified conditions involving treatment
of 1 with DBU and tert-butyldimethylsilyl triflate (TBDMSOTf).4
Further experimentation showed that malonate substrates 2
possessing two allylic ester groups underwent highly regioselective,
room-temperature mono-dCr reactions, in which the allylic group
possessing the more electron-rich R substituent rearranged
preferentially9 (Scheme 1).
Example
During the course of our studies of thermal decarboxylative Claisen
rearrangement (dCr) reactions of allylic α-tosylesters, 3–9 we have
developed dCr reactions of allylic tosylmalonate esters 1. These
substrates may be converted into the decarboxylated
rearrangement products at room temperature using both the
KOAc–N,O-bis(trimethylsilyl)acetamide (BSA) reagent system
developed initially,3,7 and modified conditions involving treatment
of 1 with DBU and tert-butyldimethylsilyl triflate (TBDMSOTf).4
Further experimentation showed that malonate substrates 2
possessing two allylic ester groups underwent highly regioselective,
room-temperature mono-dCr reactions, in which the allylic group
possessing the more electron-rich R substituent rearranged
preferentially9 (Scheme 1).
Reaction
CM:EXACT
CM: CLASS
CM: PART
NER Challenges (1/2)
Abbreviations
Molecular formulae
THF (tetrahydrofuran or tetrahydrofolate?)
The cyclic phosphate moiety of 8-nitro-cGMP
CH3COOCH2CH3 or AcOEt
Cp2Zr(py)(Me3SiC≡CSiMe3)
Part of speech ambiguity: “tosylates” - Noun or Verb?
NER Challenges (2/2)
Spaces
Ambiguous words
“Sodium metabisulphite” is marked up
as “sodium” and “metabisulphite”
“Ion spray voltage”
Ambiguous elements
“In this paper…”
“Bifunctional catalysts… lead to…”
“The lead compound, 1, …”
“The author Y. Li”
“240V”
Voltage or Pyraclofos,
an insecticide
Name to Structure
Assumes correct classification of Chemical
(CM:Exact, Class or Part)
Trade names
Common names
Natural products
Abbreviations
Molecular formulae
IUPAC names
N2S Challenges (1/2)
26% of chemical names in the literature are unacceptable and
cannot be converted into structures (GA Eller, Molecules 2006,
11, 915-928)
Ambiguous names
NH2
Mehtyl
Numbered compounds
NH2
Spelling mistakes
“Diaminocyclohexane”
(1,2 or 1,3; stereochemistry?)
NH2
NH2
“…malonate substrates 2”
“The absolute configuration of 3g…”, 3g: R= 4-Cl-Ph
Split names
“3-Fluoro- and 3-chloro-substituted pyridine”
See: http://www.acdlabs.com/download/app/name/quality_in_patents.pdf
N2S Challenges (2/2)
Complex Naming Conventions:
Acetone
dimethyl ketone
Propanone
2-propanone
propan-2-one
(CH3)2C=O
1,6-diisocyanatohexane
Hexane-1,6-diyl diisocyanate
1,6-hexanediisocyanate
1,6-hexanediyl diisocyanate
1,6-Hexamethylene diisocyanate
1,6-Hexylene diisocyanate
Hexamethylene diisocyanate
OCN
NCO
NER and N2S Products
Vendor
Product
NER
N2S
Univ Cambridge
OSCAR (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007)
Yes
Yes
ChemAxon
Chemicalize.org
No
Yes
InfoChem
Annotator and ICN2S
Yes
Yes
ChemMantis
SureChem (NER) with ACD/Name; ACD/NTS
Batch (N2S)
Yes
Yes
CambridgeSoft
Name=Struct
No
Yes
OpenEye
LexiChem TK
No
Yes
Accelrys
ChemMining
Yes
No
TEMIS/MDL
Chemical Entity Relationships Skill Cartridge
Yes
Yes
MPirics
Chemical Content Recognition
Yes
Yes
Survey
Requirements for chemical information mining tools
How well are those requirements met?
Perceptions of the vendor landscape
Sent to RSC CICAG and Molecular Modelling Group,
PRISM, PRIME, and posted on Chemical Information
Sources Discussion Forum
65 responses received, from Pharma, Academia,
Vendors, and Publishers
Importance and Effectiveness of
Searching Different Sources
Q1a: Importance of searching different sources
45
40
35
30
25
20
15
10
5
0
Critical requirement
Important
Nice to have
Not required
In-house
document
and report
collections
Published
scientific
literature (full
articles)
Patent
literature
Web-based
content
Chemical
supplier
catalogues
University
theses
Q1b: How well can you search different sources
60
50
40
Fully delivered
30
Partially delivered
20
Not possible to do it
10
0
In-house
document
and report
collections
Published
scientific
literature (full
articles)
Patent
literature
Web-based
content
Chemical
supplier
catalogues
University
theses
Requirements: Importance
and How Well Met
Q2a: Importance of user requirements
45
40
35
30
25
20
15
10
5
0
Critical requirement
Important
Nice to have
Not required
Search
Search
Search
Search
documents by documents by documents for documents
chemical
chemical
containing
chemical
structure or
similarity
Markush
reactions (by
substructure
structures
name or
and correctly
structure)
return hits on
specific
exemplars of
the general
Grouping
Search
Search for
search results
documents for
biological
by common
chemical or
entities and
biological
properties as concept (eg
grouping
name (even if
w ell as
documents,
it is know n by
chemical
w hich have a
terms
multiple
common
synonyms or
substructure,
is an
by disease)
ambiguous
Search for
citations
Ability to
Ability to
hover over a
abstract
chemical
synthetic
name in a
routes
document and automatically
see its
from in-house
structure pop
chemistry
up
reports to
create an inhouse
Q2b: How w ell are the user requirements met?
60
50
40
Fully delivered
30
Partially delivered
20
Not possible to do it
10
0
Search
Search
Search
Search
documents by documents by documents for documents
chemical
chemical
containing
chemical
reactions (by
structure or
Markush
similarity
name or
substructure
structures and
structure)
correctly
return hits on
specific
exemplars of
the general
structure
Search
Search for
Grouping
documents for
biological
search results
chemical or
entities and
by common
biological
properties as concept (eg
name (even if
w ell as
grouping
it is know n by
chemical
documents,
multiple
terms
w hich have a
synonyms or
common
is an
substructure,
ambiguous
by disease)
abbreviation)
Search for
citations
Ability to
Ability to
hover over a
abstract
chemical name
synthetic
in a document
routes
and see its
automatically
structure pop f rom in-house
up
chemistry
reports to
create an inhouse
chemical
Current Deployments
24% respondents said they had
deployed chemical information mining
tools in their organisation
28% had not yet done so,
but had plans
46% had no plans to do so
Overall Perceptions
Q10: Overall how well are mining needs met today?
40
35
30
25
20
15
10
5
0
Not w ell met at all;
need better
solutions
Partially met, but
there are gaps
Some products meet Plenty of choice of
most requirements
good solutions
Don't know
Vendor Landscape
83% respondents felt there was plenty of room for
further innovation in this area
Opportunity exists to link chemical information
mining tools with other systems, such as eLN,
chemical inventories and biological databases
No obviously dominant vendor: MDL/Symyx,
CambridgeSoft, ACD/Labs, CAS/SciFinder, OpenEye
and ChemAxon all cited
61% said this technology would continue to be
delivered through niche vendors;
39% said enterprise search vendors would pick it up
Reflections on Text Mining
Automating the mark up process
Annotating at the authoring stage
Integrating chemistry with biology
Measuring Quality of Mark Up
Quality of mark up can be measured by precision and recall
Precision (P) = TP/(TP+FN)
Recall (R) = TP/(TP+FP)
F1 = 2PR/(P+R)
Blue area is all the
correct chemical
names in document
FN
TN
TP
FP
Green area is all the
text in document
which is not
chemical names
Oval area is the
names found by
each algorithm
TP =
FP =
FN =
TN =
True Positives, ie words correctly assigned as chemical names
False Positives, ie normal English words incorrectly assigned chemical names
False Negatives, ie words that should have been assigned as chemical names but were missed
True Negatives, ie normal English words that should not have been assigned as chemical names and weren’t
Automating Mark Up
Tested OSCAR, Chemicalize, ChemMantis and
Name=Struc on:
Eight RSC articles (223 structures)
Four compound lists totalling 7000 names
Results checked manually
Suggested targets for an automated annotator:
Assumes compound references are ignored
P>90% and R>80% (F1>85%)
Filter out trivial names, element symbols, etc
Convert >90% names to structures
Overall target (NER/N2S): >75%
Annotating While Authoring
Retrospective annotation is required for
legacy repositories
Authors should have authoring tools which
annotate on the fly:
In built chemical structure editors
Links to ontologies and vocabularies
Submitted article contains XML text and
marked up content
Needs to be part of the standard authoring
environment (ie Microsoft Office)
Integrating Chemistry with Biology
Excellent ontologically-based search tools for
the BioMed literature exist
PubMed
GoPubMed from Transinsight
Sofia from Biowisdom
Accelrys
NaCTeM suite
….
Limited integration with chemistry
www.GoPubMed.com
Search results are
sorted into
categories taken
from GO and MeSH
Get statistics on your hits
Shows the
number of
articles by topic
and author…
Get statistics on your hits
Shows charts of
how interest in this
topic as grown over
the years, and
where the major
centres of research
are.
Top chart show
number of articles
published per year
and weight (impact
factor)
Vision
Multiple repositories of data and documents
(internal and external)
Domain-relevant ontologies
Linking gene expression to protein function
to chemical properties
Searchable by chemical structure, sequence,
name, biological term, concept etc
Navigable by scientific concept
Enabled through open standards (eg OWL,
RDF, SPARQL, InChI’s etc)
Conclusions
Chemical semantics is complex and
fraught with linguistic challenges, but
can open up the science in scientific
literature
More work required to develop the tools
Future is an integrated chemical and
biological information environment:
the “Scientific Semantic Web”