Chemistry-Enriched Patent Curation

Chemistry-Enriched Patent Curation
semi-automatic analysis and elaboration of patents
ChemAxon UGM 2015, Budapest, 20 May 2015
Árpád Figyelmesi
ChemAxon
Matthias Negri , PhD
Scientific Information Center
Boehringer Ingelheim Pharma GmbH & Co. KG
Content
1.
Chemistry in patents
2.
Why do we need a patent curation workflow?
3.
Semi-automatic Patent Curation Workflow - Overview
4.
Linked tools/technologies
5.
ChemCurator (ChemCC)
6.
Semi-automatic Patent Curation Workflow – Step by Step
7.
Lessons learned, weak-points, limitations
8.
Outlook
Negri Matthias, ChemAxon UGM 2015
2
Chemistry in patents
Chemistry appears within diverse form in patents:
1.
TEXT - IUPAC names, common names, etc
2.
IMAGES - embedded within or attached to the document
3.
ATTACHMENTS (MOL/CDX)
4.
TABLES
– as ONE-image file (tables with chemistry and bioactivity data)
– as chemistry-only image files embedded within table tags
5. Markush Structures/Formulas with R-groups
-------------------------------------------------------------------------------------- Currently NO commercial solution covers all these cases
 Most of the cases are considered in the patent curation workflow
(Markush/R-group Formulas recognized and stored separately)
Negri Matthias, ChemAxon UGM 2015
3
Why do we need a patent curation workflow?
Motivations:
1.
Linked chemistry-retrieval from patents (+ chemistry as images)
2.
IUPAC-enriched XML patent files  as NEW source for text-mining
3.
extraction of bioactivity data/targets/diseases/… in relation to chemistry
4.
Similarity/Substructure frequency in compound sets of patents
5.
…
Negri Matthias, ChemAxon UGM 2015
4
Semi-automatic Patent Curation Workflow
Overview – current state
2 parallel branches
SLOWER & memory intensive vs BUT Higher Quality, More Control & IUPAC-enriched XML
INPUT
FASTER vs LESS informative/flexible - ChemCC as the (near) future perspective
I2E API KNIME – Batch indexing, text-mining and (relational) data retrieval
Negri Matthias, ChemAxon UGM 2015
5
Linked tools/technologies
1.
KNIME/XPATH
2.
ChemAxon ChemCurator (ChemCC)
3.
Other ChemAxon tools in KNIME nodes (document2structure/d2s,
Naming, Molconverter, Structure checker, Standardizer, …)
4.
Text/data-mining – Linguamatics I2E (+I2E Chemistry)
5.
Optical Structure Recognition – Keymodule CLiDE Batch
Negri Matthias, ChemAxon UGM 2015
6
Content
1.
Chemistry in patents
2.
Why do we need a patent curation workflow?
3.
Semi-automatic Patent Curation Workflow - Overview
4.
Linked tools/technologies
5.
ChemCurator (ChemCC)
6.
Semi-automatic Patent Curation Workflow – Step by Step
7.
Lessons learned, weak-points, limitations
8.
Outlook
Negri Matthias, ChemAxon UGM 2015
7
ChemCurator (ChemCC)
Computer-aided chemical data extraction

English, Chinese and Japanese N2S

Markush Editor

Structure Checker

Hit visualization

Third party OSR technologies
8
Árpád Figyelmesi, ChemAxon UGM 2015
ChemCurator (ChemCC)
Name to Structure

Support for many nomenclatures (common, drug names, …)

IUPAC names

Custom dictionaries

English (2008)

Chinese (2013)

Japanese (2014)
9
Árpád Figyelmesi, ChemAxon UGM 2015
ChemCurator (ChemCC)
Compound Extraction View
Compound list
Project explorer
Annotated document
Selected structures
10
ChemCurator (ChemCC)
Markush Extraction View
Project explorer
Markush editor
Annotated document
Structure checker
Selected structures
11
Example structures
ChemCurator (ChemCC)
General Document Curation
Extract Markush Structures from patents
Extract specific structures
 Journal articles
 Company reports
 Patent examples
Structure extraction wizards
 Exclude fragments, chemical elements, etc.
12
Árpád Figyelmesi, ChemAxon UGM 2015
ChemCurator (ChemCC)
Integration & Information Sharing
Other ChemAxon products:
 Direct IJC schema connection
 Project sharing function
 Accessible from Plexus, IJC, etc.
Third party tools:
 Standard file formats
 Export functions
 Easily processable projects
13
Árpád Figyelmesi, ChemAxon UGM 2015
Content
1.
Chemistry in patents
2.
Why do we need a patent curation workflow?
3.
Semi-automatic Patent Curation Workflow - Overview
4.
Linked tools/technologies
5.
ChemCurator (ChemCC)
6.
Semi-automatic Patent Curation Workflow – Step by Step
7.
Lessons learned, weak-points, limitations
8.
Outlook
Negri Matthias, ChemAxon UGM 2015
14
Semi-automatic Patent Curation Workflow
a) input sources and b) bibliographic data
a) Input sources
 files with patent-IDs list
 XML collection
 …
b) Retrieval of bibliographic information and attachment data
 family ID, patent references, expiration date, etc
 Attachment files MOL/CDX (US-patents only), TIF files
 ….
Negri Matthias, ChemAxon UGM 2015
15
Semi-automatic Patent Curation Workflow
c) chemistry retrieval/extraction/filtering
1. ChemCurator branch
 data retrieval (XML, attachments) from IFI Claims Direct BI-server
 ChemCurator project creation/sharing/annotation  html output
 Chemistry extraction name2structure/document2structure  sdf output
 Generation of pre-annotated patent set stored as ChemCC projects
 Faster, but lower quality within the chemistry extraction process
Negri Matthias, ChemAxon UGM 2015
16
Semi-automatic Patent Curation Workflow
c) chemistry retrieval/extraction/filtering
2. KNIME branch
- OCR-errors CLEAN-UP in KNIME  improved chemistry recognition
- MOL/CDX/TIF - standardizer, structure checker  filter formulas, solvents, R-groups
 Higher quality and more control in chemistry extraction process
Negri Matthias, ChemAxon UGM 2015
17
Semi-automatic Patent Curation Workflow
c) chemistry retrieval/extraction/filtering
2. KNIME branch
 MOL  IUPAC
 CDX  IUPAC
 TIFF (via CLiDE)  IUPAC
Negri Matthias, ChemAxon UGM 2015
18
Semi-automatic Patent Curation Workflow
c) chemistry retrieval/extraction/filtering
2. KNIME - Chemistry “Normalization”
Merging and Comparison of the converted chemistry
output of MOL/CDX/TIF – 2 “quality” checks
 IUPAC
 string length (different output order of chemicals
in multiple molecules image/multiMOL files
 OCR-correction (“dictionary” based)
“Normalize” IUPAC names
Merge IUPAC
If NO IUPAC  IMG-name is set
 (within KNIME) set up a relation between each TIFF/attachment file
1. to (one or more) IUPAC name(s)
2. to a position/section in the text/document
Negri Matthias, ChemAxon UGM 2015
19
Clean-Up IUPAC
Semi-automatic Patent Curation Workflow
d) TIF/attachment replacement with IUPAC names
Replacement:
<chemistry> vs IUPAC
IUPAC-enriched XML
Chemistry present as text is recognized and extracted either via
-
Textmining (I2E chemistry – d2s is working in behind) or
-
Within KNIME/ChemCC using annotate/molconvert
Negri Matthias, ChemAxon UGM 2015
20
Semi-automatic Patent Curation Workflow
d) TIF/attachment replacement with IUPAC names
OCR-errors in chemical names
TIF
CDX
MOL
This text-chunk is replaced by the IUPAC name
Negri Matthias, ChemAxon UGM 2015
21
Semi-automatic Patent Curation Workflow
e) Bioactivity/tabular data extraction with KNIME/XPATH
XPATH/XML parsing and extraction of:
 Tables
 Rows - XML tags & strings
 Entries - XML tags & strings
Negri Matthias, ChemAxon UGM 2015
22
Semi-automatic Patent Curation Workflow
f) Text-/datamining with Linguamatics I2E via KNIME
IUPAC-enriched XML as source for I2E API/textmining




indexing
pre-defined queries
results retrieval
saved as SDF files (KNIME)
Text-mining retrieved (chemistry-related) information
 Example Nr.
 Bioactivity data from tables
 Claims, regions where chemistry appears in patents
 Genes, diseases
Negri Matthias, ChemAxon UGM 2015
23
Semi-automatic Patent Curation Workflow
f) Bioactivity Data using I2E multi-queries – 2 steps
Source: (IUPAC-enriched) XML
1. Example Nr. – IUPAC
2. Example Nr. – Bioactivity data
Example Nr.
IUPAC
Bioactivity
For comparison – chemistry in PDF:
Image:
Table:
24
Semi-automatic Patent Curation Workflow
g) Visualize data-/textmining results in ChemCC
 SDF file imported into ChemCC project + automatic mapping to existing chemistry
Negri Matthias, ChemAxon UGM 2015
25
Lessons learned, weak-points, limitations
1.
Advantages KNIME Full-Mode (MOL/CDX/TIF) vs ChemCC branch

chemistry check/normalization – 3 input sources  improved quality

improved chemistry recall - ALL images (incl. tables and drawings)

More filtering options in KNIME workflow vs ChemCurator only

IUPAC-enriched XML as new source for I2E
Advantages ChemCC vs KNIME Full-Mode (MOL/CDX/TIF)

faster

Image processing using CLiDE is already incorporated with naming
Negri Matthias, ChemAxon UGM 2015
26
Lessons learned, weak-points, limitations
2.
No full automation of the workflow due to lack of homogenicity in patent data (US
vs WO, EP, etc..)

Missing attachment files

No tables present in XML

Error rate in chemistry recognition (OPSIN vs n2s/d2s)

…
 NEEDS: different workflows/branches, patent-files clean-up (OCR)
3.
Time & Computational Resources-consuming process
Negri Matthias, ChemAxon UGM 2015
27
Outlook
1. KNIME Workflow
 Add new data fields to Chemicals: BI-internal codes, genes, targets, etc..
 Usage of ChemCC html output as source for textmining
 Ontology mapping
 Expand workflow by including other sources (internal PDF, literature full-text)
 Use KNIME to interconnect to BI-intern workflows, DB, etc
 chemistry-linked information in a patent-DB  improved (semantic) search
Negri Matthias, ChemAxon UGM 2015
28
Outlook
2. ChemCurator
 Improved n2s
 New command-line functions
 Complex-phrase requests from IFI server
 Improved SDF import
 Preprocessing wizards
29
Árpád Figyelmesi, ChemAxon UGM 2015
INPUT
Thank You !
Negri Matthias, ChemAxon UGM 2015
30
Árpád Figyelmesi, ChemAxon UGM 2015