Chemical Information Mining: Possibilities and Pitfalls CICAG: “Scientific Text and Data Mining” Burlington House 20 May 2009 Neil Stutchbury Topics Challenges in finding scientific terms in text and converting chemical names to structures Results of a user survey on requirements and perceptions Reflections on the future possibilities of information mining Acknowledgements Royal Society of Chemistry David James Richard Kidd Colin Batchelor Unilever Centre for Molecular Science Informatics, University of Cambridge Prof Bobby Glen Peter Murray-Rust The Challenge Entity Name Recognition (NER) Identifying chemical terms in text, correctly disambiguating them and classifying them Name to Structure Conversion (N2S) Converting trivial, trade, IUPAC names to chemical structures NER Classification* Chemicals (CM) Exact, Class, Part Reactions (RN) Chemical Adjectives (CJ) Chemical Prefixes (CPR) Enzymes (ASE) * Peter Corbett, Colin Batchelor and Simone Teufel (2007), Annotation of chemical named entities, BioNLP 2007: Biological, translational, and clinical language processing, Prague, Czech Republic, pp. 57--64. Chemicals (CM) CM:Exact CM:Class Pyridine was methylated… The substituted acetylene, 1,3-bis-(tertbutyldimethylsilanyloxy)-5-ethynylbenzene can be derived… The pyridine, 2, was treated with… A retrosynthetic analysis for (±)-moracins O and P … CM:Part The nitrogen atom in a pyridine ring The ethyl group Example During the course of our studies of thermal decarboxylative Claisen rearrangement (dCr) reactions of allylic α-tosylesters, 3–9 we have developed dCr reactions of allylic tosylmalonate esters 1. These substrates may be converted into the decarboxylated rearrangement products at room temperature using both the KOAc–N,O-bis(trimethylsilyl)acetamide (BSA) reagent system developed initially,3,7 and modified conditions involving treatment of 1 with DBU and tert-butyldimethylsilyl triflate (TBDMSOTf).4 Further experimentation showed that malonate substrates 2 possessing two allylic ester groups underwent highly regioselective, room-temperature mono-dCr reactions, in which the allylic group possessing the more electron-rich R substituent rearranged preferentially9 (Scheme 1). Example During the course of our studies of thermal decarboxylative Claisen rearrangement (dCr) reactions of allylic α-tosylesters, 3–9 we have developed dCr reactions of allylic tosylmalonate esters 1. These substrates may be converted into the decarboxylated rearrangement products at room temperature using both the KOAc–N,O-bis(trimethylsilyl)acetamide (BSA) reagent system developed initially,3,7 and modified conditions involving treatment of 1 with DBU and tert-butyldimethylsilyl triflate (TBDMSOTf).4 Further experimentation showed that malonate substrates 2 possessing two allylic ester groups underwent highly regioselective, room-temperature mono-dCr reactions, in which the allylic group possessing the more electron-rich R substituent rearranged preferentially9 (Scheme 1). Reaction CM:EXACT CM: CLASS CM: PART NER Challenges (1/2) Abbreviations Molecular formulae THF (tetrahydrofuran or tetrahydrofolate?) The cyclic phosphate moiety of 8-nitro-cGMP CH3COOCH2CH3 or AcOEt Cp2Zr(py)(Me3SiC≡CSiMe3) Part of speech ambiguity: “tosylates” - Noun or Verb? NER Challenges (2/2) Spaces Ambiguous words “Sodium metabisulphite” is marked up as “sodium” and “metabisulphite” “Ion spray voltage” Ambiguous elements “In this paper…” “Bifunctional catalysts… lead to…” “The lead compound, 1, …” “The author Y. Li” “240V” Voltage or Pyraclofos, an insecticide Name to Structure Assumes correct classification of Chemical (CM:Exact, Class or Part) Trade names Common names Natural products Abbreviations Molecular formulae IUPAC names N2S Challenges (1/2) 26% of chemical names in the literature are unacceptable and cannot be converted into structures (GA Eller, Molecules 2006, 11, 915-928) Ambiguous names NH2 Mehtyl Numbered compounds NH2 Spelling mistakes “Diaminocyclohexane” (1,2 or 1,3; stereochemistry?) NH2 NH2 “…malonate substrates 2” “The absolute configuration of 3g…”, 3g: R= 4-Cl-Ph Split names “3-Fluoro- and 3-chloro-substituted pyridine” See: http://www.acdlabs.com/download/app/name/quality_in_patents.pdf N2S Challenges (2/2) Complex Naming Conventions: Acetone dimethyl ketone Propanone 2-propanone propan-2-one (CH3)2C=O 1,6-diisocyanatohexane Hexane-1,6-diyl diisocyanate 1,6-hexanediisocyanate 1,6-hexanediyl diisocyanate 1,6-Hexamethylene diisocyanate 1,6-Hexylene diisocyanate Hexamethylene diisocyanate OCN NCO NER and N2S Products Vendor Product NER N2S Univ Cambridge OSCAR (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) Yes Yes ChemAxon Chemicalize.org No Yes InfoChem Annotator and ICN2S Yes Yes ChemMantis SureChem (NER) with ACD/Name; ACD/NTS Batch (N2S) Yes Yes CambridgeSoft Name=Struct No Yes OpenEye LexiChem TK No Yes Accelrys ChemMining Yes No TEMIS/MDL Chemical Entity Relationships Skill Cartridge Yes Yes MPirics Chemical Content Recognition Yes Yes Survey Requirements for chemical information mining tools How well are those requirements met? Perceptions of the vendor landscape Sent to RSC CICAG and Molecular Modelling Group, PRISM, PRIME, and posted on Chemical Information Sources Discussion Forum 65 responses received, from Pharma, Academia, Vendors, and Publishers Importance and Effectiveness of Searching Different Sources Q1a: Importance of searching different sources 45 40 35 30 25 20 15 10 5 0 Critical requirement Important Nice to have Not required In-house document and report collections Published scientific literature (full articles) Patent literature Web-based content Chemical supplier catalogues University theses Q1b: How well can you search different sources 60 50 40 Fully delivered 30 Partially delivered 20 Not possible to do it 10 0 In-house document and report collections Published scientific literature (full articles) Patent literature Web-based content Chemical supplier catalogues University theses Requirements: Importance and How Well Met Q2a: Importance of user requirements 45 40 35 30 25 20 15 10 5 0 Critical requirement Important Nice to have Not required Search Search Search Search documents by documents by documents for documents chemical chemical containing chemical structure or similarity Markush reactions (by substructure structures name or and correctly structure) return hits on specific exemplars of the general Grouping Search Search for search results documents for biological by common chemical or entities and biological properties as concept (eg grouping name (even if w ell as documents, it is know n by chemical w hich have a terms multiple common synonyms or substructure, is an by disease) ambiguous Search for citations Ability to Ability to hover over a abstract chemical synthetic name in a routes document and automatically see its from in-house structure pop chemistry up reports to create an inhouse Q2b: How w ell are the user requirements met? 60 50 40 Fully delivered 30 Partially delivered 20 Not possible to do it 10 0 Search Search Search Search documents by documents by documents for documents chemical chemical containing chemical reactions (by structure or Markush similarity name or substructure structures and structure) correctly return hits on specific exemplars of the general structure Search Search for Grouping documents for biological search results chemical or entities and by common biological properties as concept (eg name (even if w ell as grouping it is know n by chemical documents, multiple terms w hich have a synonyms or common is an substructure, ambiguous by disease) abbreviation) Search for citations Ability to Ability to hover over a abstract chemical name synthetic in a document routes and see its automatically structure pop f rom in-house up chemistry reports to create an inhouse chemical Current Deployments 24% respondents said they had deployed chemical information mining tools in their organisation 28% had not yet done so, but had plans 46% had no plans to do so Overall Perceptions Q10: Overall how well are mining needs met today? 40 35 30 25 20 15 10 5 0 Not w ell met at all; need better solutions Partially met, but there are gaps Some products meet Plenty of choice of most requirements good solutions Don't know Vendor Landscape 83% respondents felt there was plenty of room for further innovation in this area Opportunity exists to link chemical information mining tools with other systems, such as eLN, chemical inventories and biological databases No obviously dominant vendor: MDL/Symyx, CambridgeSoft, ACD/Labs, CAS/SciFinder, OpenEye and ChemAxon all cited 61% said this technology would continue to be delivered through niche vendors; 39% said enterprise search vendors would pick it up Reflections on Text Mining Automating the mark up process Annotating at the authoring stage Integrating chemistry with biology Measuring Quality of Mark Up Quality of mark up can be measured by precision and recall Precision (P) = TP/(TP+FN) Recall (R) = TP/(TP+FP) F1 = 2PR/(P+R) Blue area is all the correct chemical names in document FN TN TP FP Green area is all the text in document which is not chemical names Oval area is the names found by each algorithm TP = FP = FN = TN = True Positives, ie words correctly assigned as chemical names False Positives, ie normal English words incorrectly assigned chemical names False Negatives, ie words that should have been assigned as chemical names but were missed True Negatives, ie normal English words that should not have been assigned as chemical names and weren’t Automating Mark Up Tested OSCAR, Chemicalize, ChemMantis and Name=Struc on: Eight RSC articles (223 structures) Four compound lists totalling 7000 names Results checked manually Suggested targets for an automated annotator: Assumes compound references are ignored P>90% and R>80% (F1>85%) Filter out trivial names, element symbols, etc Convert >90% names to structures Overall target (NER/N2S): >75% Annotating While Authoring Retrospective annotation is required for legacy repositories Authors should have authoring tools which annotate on the fly: In built chemical structure editors Links to ontologies and vocabularies Submitted article contains XML text and marked up content Needs to be part of the standard authoring environment (ie Microsoft Office) Integrating Chemistry with Biology Excellent ontologically-based search tools for the BioMed literature exist PubMed GoPubMed from Transinsight Sofia from Biowisdom Accelrys NaCTeM suite …. Limited integration with chemistry www.GoPubMed.com Search results are sorted into categories taken from GO and MeSH Get statistics on your hits Shows the number of articles by topic and author… Get statistics on your hits Shows charts of how interest in this topic as grown over the years, and where the major centres of research are. Top chart show number of articles published per year and weight (impact factor) Vision Multiple repositories of data and documents (internal and external) Domain-relevant ontologies Linking gene expression to protein function to chemical properties Searchable by chemical structure, sequence, name, biological term, concept etc Navigable by scientific concept Enabled through open standards (eg OWL, RDF, SPARQL, InChI’s etc) Conclusions Chemical semantics is complex and fraught with linguistic challenges, but can open up the science in scientific literature More work required to develop the tools Future is an integrated chemical and biological information environment: the “Scientific Semantic Web”
© Copyright 2026 Paperzz