Chemistry-Enriched Patent Curation semi-automatic analysis and elaboration of patents ChemAxon UGM 2015, Budapest, 20 May 2015 Árpád Figyelmesi ChemAxon Matthias Negri , PhD Scientific Information Center Boehringer Ingelheim Pharma GmbH & Co. KG Content 1. Chemistry in patents 2. Why do we need a patent curation workflow? 3. Semi-automatic Patent Curation Workflow - Overview 4. Linked tools/technologies 5. ChemCurator (ChemCC) 6. Semi-automatic Patent Curation Workflow – Step by Step 7. Lessons learned, weak-points, limitations 8. Outlook Negri Matthias, ChemAxon UGM 2015 2 Chemistry in patents Chemistry appears within diverse form in patents: 1. TEXT - IUPAC names, common names, etc 2. IMAGES - embedded within or attached to the document 3. ATTACHMENTS (MOL/CDX) 4. TABLES – as ONE-image file (tables with chemistry and bioactivity data) – as chemistry-only image files embedded within table tags 5. Markush Structures/Formulas with R-groups -------------------------------------------------------------------------------------- Currently NO commercial solution covers all these cases Most of the cases are considered in the patent curation workflow (Markush/R-group Formulas recognized and stored separately) Negri Matthias, ChemAxon UGM 2015 3 Why do we need a patent curation workflow? Motivations: 1. Linked chemistry-retrieval from patents (+ chemistry as images) 2. IUPAC-enriched XML patent files as NEW source for text-mining 3. extraction of bioactivity data/targets/diseases/… in relation to chemistry 4. Similarity/Substructure frequency in compound sets of patents 5. … Negri Matthias, ChemAxon UGM 2015 4 Semi-automatic Patent Curation Workflow Overview – current state 2 parallel branches SLOWER & memory intensive vs BUT Higher Quality, More Control & IUPAC-enriched XML INPUT FASTER vs LESS informative/flexible - ChemCC as the (near) future perspective I2E API KNIME – Batch indexing, text-mining and (relational) data retrieval Negri Matthias, ChemAxon UGM 2015 5 Linked tools/technologies 1. KNIME/XPATH 2. ChemAxon ChemCurator (ChemCC) 3. Other ChemAxon tools in KNIME nodes (document2structure/d2s, Naming, Molconverter, Structure checker, Standardizer, …) 4. Text/data-mining – Linguamatics I2E (+I2E Chemistry) 5. Optical Structure Recognition – Keymodule CLiDE Batch Negri Matthias, ChemAxon UGM 2015 6 Content 1. Chemistry in patents 2. Why do we need a patent curation workflow? 3. Semi-automatic Patent Curation Workflow - Overview 4. Linked tools/technologies 5. ChemCurator (ChemCC) 6. Semi-automatic Patent Curation Workflow – Step by Step 7. Lessons learned, weak-points, limitations 8. Outlook Negri Matthias, ChemAxon UGM 2015 7 ChemCurator (ChemCC) Computer-aided chemical data extraction English, Chinese and Japanese N2S Markush Editor Structure Checker Hit visualization Third party OSR technologies 8 Árpád Figyelmesi, ChemAxon UGM 2015 ChemCurator (ChemCC) Name to Structure Support for many nomenclatures (common, drug names, …) IUPAC names Custom dictionaries English (2008) Chinese (2013) Japanese (2014) 9 Árpád Figyelmesi, ChemAxon UGM 2015 ChemCurator (ChemCC) Compound Extraction View Compound list Project explorer Annotated document Selected structures 10 ChemCurator (ChemCC) Markush Extraction View Project explorer Markush editor Annotated document Structure checker Selected structures 11 Example structures ChemCurator (ChemCC) General Document Curation Extract Markush Structures from patents Extract specific structures Journal articles Company reports Patent examples Structure extraction wizards Exclude fragments, chemical elements, etc. 12 Árpád Figyelmesi, ChemAxon UGM 2015 ChemCurator (ChemCC) Integration & Information Sharing Other ChemAxon products: Direct IJC schema connection Project sharing function Accessible from Plexus, IJC, etc. Third party tools: Standard file formats Export functions Easily processable projects 13 Árpád Figyelmesi, ChemAxon UGM 2015 Content 1. Chemistry in patents 2. Why do we need a patent curation workflow? 3. Semi-automatic Patent Curation Workflow - Overview 4. Linked tools/technologies 5. ChemCurator (ChemCC) 6. Semi-automatic Patent Curation Workflow – Step by Step 7. Lessons learned, weak-points, limitations 8. Outlook Negri Matthias, ChemAxon UGM 2015 14 Semi-automatic Patent Curation Workflow a) input sources and b) bibliographic data a) Input sources files with patent-IDs list XML collection … b) Retrieval of bibliographic information and attachment data family ID, patent references, expiration date, etc Attachment files MOL/CDX (US-patents only), TIF files …. Negri Matthias, ChemAxon UGM 2015 15 Semi-automatic Patent Curation Workflow c) chemistry retrieval/extraction/filtering 1. ChemCurator branch data retrieval (XML, attachments) from IFI Claims Direct BI-server ChemCurator project creation/sharing/annotation html output Chemistry extraction name2structure/document2structure sdf output Generation of pre-annotated patent set stored as ChemCC projects Faster, but lower quality within the chemistry extraction process Negri Matthias, ChemAxon UGM 2015 16 Semi-automatic Patent Curation Workflow c) chemistry retrieval/extraction/filtering 2. KNIME branch - OCR-errors CLEAN-UP in KNIME improved chemistry recognition - MOL/CDX/TIF - standardizer, structure checker filter formulas, solvents, R-groups Higher quality and more control in chemistry extraction process Negri Matthias, ChemAxon UGM 2015 17 Semi-automatic Patent Curation Workflow c) chemistry retrieval/extraction/filtering 2. KNIME branch MOL IUPAC CDX IUPAC TIFF (via CLiDE) IUPAC Negri Matthias, ChemAxon UGM 2015 18 Semi-automatic Patent Curation Workflow c) chemistry retrieval/extraction/filtering 2. KNIME - Chemistry “Normalization” Merging and Comparison of the converted chemistry output of MOL/CDX/TIF – 2 “quality” checks IUPAC string length (different output order of chemicals in multiple molecules image/multiMOL files OCR-correction (“dictionary” based) “Normalize” IUPAC names Merge IUPAC If NO IUPAC IMG-name is set (within KNIME) set up a relation between each TIFF/attachment file 1. to (one or more) IUPAC name(s) 2. to a position/section in the text/document Negri Matthias, ChemAxon UGM 2015 19 Clean-Up IUPAC Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names Replacement: <chemistry> vs IUPAC IUPAC-enriched XML Chemistry present as text is recognized and extracted either via - Textmining (I2E chemistry – d2s is working in behind) or - Within KNIME/ChemCC using annotate/molconvert Negri Matthias, ChemAxon UGM 2015 20 Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names OCR-errors in chemical names TIF CDX MOL This text-chunk is replaced by the IUPAC name Negri Matthias, ChemAxon UGM 2015 21 Semi-automatic Patent Curation Workflow e) Bioactivity/tabular data extraction with KNIME/XPATH XPATH/XML parsing and extraction of: Tables Rows - XML tags & strings Entries - XML tags & strings Negri Matthias, ChemAxon UGM 2015 22 Semi-automatic Patent Curation Workflow f) Text-/datamining with Linguamatics I2E via KNIME IUPAC-enriched XML as source for I2E API/textmining indexing pre-defined queries results retrieval saved as SDF files (KNIME) Text-mining retrieved (chemistry-related) information Example Nr. Bioactivity data from tables Claims, regions where chemistry appears in patents Genes, diseases Negri Matthias, ChemAxon UGM 2015 23 Semi-automatic Patent Curation Workflow f) Bioactivity Data using I2E multi-queries – 2 steps Source: (IUPAC-enriched) XML 1. Example Nr. – IUPAC 2. Example Nr. – Bioactivity data Example Nr. IUPAC Bioactivity For comparison – chemistry in PDF: Image: Table: 24 Semi-automatic Patent Curation Workflow g) Visualize data-/textmining results in ChemCC SDF file imported into ChemCC project + automatic mapping to existing chemistry Negri Matthias, ChemAxon UGM 2015 25 Lessons learned, weak-points, limitations 1. Advantages KNIME Full-Mode (MOL/CDX/TIF) vs ChemCC branch chemistry check/normalization – 3 input sources improved quality improved chemistry recall - ALL images (incl. tables and drawings) More filtering options in KNIME workflow vs ChemCurator only IUPAC-enriched XML as new source for I2E Advantages ChemCC vs KNIME Full-Mode (MOL/CDX/TIF) faster Image processing using CLiDE is already incorporated with naming Negri Matthias, ChemAxon UGM 2015 26 Lessons learned, weak-points, limitations 2. No full automation of the workflow due to lack of homogenicity in patent data (US vs WO, EP, etc..) Missing attachment files No tables present in XML Error rate in chemistry recognition (OPSIN vs n2s/d2s) … NEEDS: different workflows/branches, patent-files clean-up (OCR) 3. Time & Computational Resources-consuming process Negri Matthias, ChemAxon UGM 2015 27 Outlook 1. KNIME Workflow Add new data fields to Chemicals: BI-internal codes, genes, targets, etc.. Usage of ChemCC html output as source for textmining Ontology mapping Expand workflow by including other sources (internal PDF, literature full-text) Use KNIME to interconnect to BI-intern workflows, DB, etc chemistry-linked information in a patent-DB improved (semantic) search Negri Matthias, ChemAxon UGM 2015 28 Outlook 2. ChemCurator Improved n2s New command-line functions Complex-phrase requests from IFI server Improved SDF import Preprocessing wizards 29 Árpád Figyelmesi, ChemAxon UGM 2015 INPUT Thank You ! Negri Matthias, ChemAxon UGM 2015 30 Árpád Figyelmesi, ChemAxon UGM 2015
© Copyright 2026 Paperzz