Comprehensive DWPI structure searching using

Comprehensive DWPISM structure searching using
DCR and DWPIM on STN®
Brian Larner, IP & Science, Thomson Reuters
Robert Austin, Senior STN Trainer, FIZ Karlsruhe
19 May 2016
AGENDA
• Introduction to DWPI chemical structure indexing
• DCR – The Derwent Chemistry Resource
– What is it?
– Search example
• The Derwent Markush Resource (DWPIM)
– What is it?
– Structures indexed
– Search example
• Advanced topics
– Substance descriptors
– Roles
– Polymers, Inorganics, Phthalocyanines & Metallocenes
2
WHAT IS CHEMICAL INDEXING?
• Markush structural indexing for generic structures
– Created for all Markush Structures that meet the criteria
for being indexed
– Also created to cover generic disclosures described only
in words (eg cleaning solution containing a 2-8C alcohol)
• DCR indexing for specific compounds
– Created for any specific compounds mentioned in the
patent
– Some compounds may be covered in Markush structure if
system limits are exceeded
• Fragmentation coding auto-generated from the
above
3
DWPI structure databases on STN
SUBX
DWPIM
DWPI
> 1.9 M structures
> 3.2 M patents
DCR
> 2.5 M structures
REFX
Each structure has a unique Markush Compound Number (MCN) or
DCR number (DCR) which is used as the basis of the cross-file search.
4
CHEMICAL STRUCTURE INDEXING IN
DWPI - CRITERIA
• To receive structural indexing a DWPI record must
meet the following criteria
– Classified in Sections B, C and / or E
– From a major country*
* See list on next slide
5
DWPI COUNTRY COVERAGE FOR
CHEMICAL INDEXING
6
WHAT IS INDEXED
• Compounds claimed to be new
• Compounds produced by a new process
• Compounds having a new use
• Components of compositions
• Novel catalysts and known specific catalysts
• Specific reagents and starting materials in production
processes (DCR only)
• Materials detected, detecting agents, detection media
• Materials recovered or purified in new ways
• Materials removed and removing agents
7
DERWENT CHEMISTRY RESOURCE (DCR)
• This is a database of specific chemical
substances mentioned in patents
• They are also organised into families of
closely related compounds as follows
• basic compound
• salts, isotopes, mixtures, isomers
• Substance records include structure
diagrams and substance data, e.g.
• IUPAC-name, synonyms
• molecular formula, molecular weight
8
DWPI CHEMISTRY RESOURCE (DCR)
• The DCR numbers are associated with the
relevant fragmentation codes for the
substance so they can be searched in
conjunction with non-structural
fragmentation codes if desired
• They also have roles associated with them
(e.g. produced, detected)so that you can
limit your answers by the role of the
compound
9
BENEFITS OF DWPI INDEXING - REAL
EXAMPLE
• Search on Diclofenac or its most common
synonyms (Voltarol or Voltaren) using Key words in
DWPI title & abstract - Find 3530 documents
• Search on Diclofenac via DCR record – We find
3105 records
– 447 of these were not found by the keyword search
10
SOME INVENTIONS FOUND ONLY BY THE
KEYWORD SEARCH ARE LESS RELEVANT
11
BUT THE ONES FOUND ONLY BY DCR ARE
HIGHLY RELEVANT
12
DCR COVERAGE
• DCR records are only created for patents that are
classified in at least one of the following CPI sections
• B(Pharmaceuticals)
• C (Agrochemcals)
• E (General Chemistry)
• In addition existing DCR records are cited when the
substances they relate to are mentioned in the DWPI
abstracts for patents classified in Section D, F, G, J and
K
• DCR numbers are auto-generated from the specific
compound codes in polymer indexing and added to the
indexing
13
DCR COVERAGE BY COMPOUND TYPE
• Ordinary organic compounds (eg ethanol, ibuprofen)
• Inorganic compounds (eg Sodium chloride, ammonia)
• Complexes and organometallics (eg ferrocene, Copper
phthalocyanine, diethyl magnesium bromide)
• Peptides with 10 or less amino acids
• Proteins and other natural polymers with well defined
names*
• Synthetic Polymers from a standard list of around 340
commonly occurring ones
• Plant, animal & microbial extracts*
*these records do not contain structures
14
WHAT IS NOT COVERED
• Generic classes of compounds
• These are covered by other forms of chemical structure
indexing in DWPI eg fragmentation coding
• Synthetic polymers other than the ones in the predefined list of around 340
• These are covered by polymer indexing
• Any compound of ambiguous structure
• This could be those with ill defined ratios of ions or
components
• Or ones with ambiguous names where we can not be sure of
the correct structure
15
16
New STN workflow is oriented around projects
To create a project,
click the
icon.
Projects allow you to:
• Easily return to previous work
• Reuse common queries
• Update searches with the
most current information
17
The new STN interface puts query, history and results at
your fingertips
Structure Editor
Query Builder panel
History panel
Results panel
18
Prepare structure queries using the structure editor
Click OK to add the
query to the
structures tab of
the history panel.
19
Search the structure query and review structures
Automatic Cross File
Search is set ON.
Click on any
structure to
enlarge (zoom).
20
Crossover with REFX and review hit structures in DWPI
Use the REFX operator to
retrieve corresponding
DWPI references (L2).
The structure search (L1) is
combined with technology
terms in DWPI (L2).
Hit structures with hit highlighting
are included in DWPI full view.
DERWENT MARKUSH RESOURCE ON STN
• Approximately 1.9 million structures from around
780,000 patents
• Covers 33 patent issuing authorities (as basic
patent country)
• Can be searched in conjunction with DCR,
MARPAT® and CAS REGISTRY on STN
– In most cases using the same structure query
– Gives the most comprehensive chemical structure search
possible
21
TYPES OF STRUCTURES INDEXED
• Non-polymeric organic molecules
• Organometallic compounds
• Inorganic structures
– Simple inorganic molecules
– Extended structures such as clays, zeolites and
heteropolyacids
• Partially defined structures
• Polymeric structures
– Only for pharmaceutical and agrochemical patents
– Includes peptides as well as synthetic polymers
22
MARKUSH COVERAGE IN OLDER
PATENTS
• Prior to the introduction of DCR in 1999/2000 the
policy was different
• Both specific and generic structures were covered
by Markush structures, often as part of the same
structure
• Some commonly occurring compounds were
indexed using Derwent Compound Numbers
– It was the analysts choice whether to use these or
combine them into a Markush
– These have now been converted to DCR records and can
found by a DCR search
– But are still included in the Derwent Markush Resource
23
ORGANIC MOLECULES IN THE DERWENT
MARKUSH RESOURCE
• Generally speaking they are indexed as shown in
the Patent
• Counter ions are sometimes ignored (but not in the
example below)
Derwent Markush Resource Version
In the patent
24
WHY THE CORE STRUCTURE CAN DIFFER
FROM THE ONE DRAWN IN THE PATENT
• Indexing conventions
– Keto-enol tautomerism (keto form is the preferred one)
– Amidine normalisation (amidine/guanidine groups have
normalised bonds not single and double bonds)
• Use of DWPI markush terminology and shortcuts
– Use of Superatoms terms (CHK, ARY etc.) & shortcuts (CO2,
SO3 etc.)
• Allowing for variable attachments
– Replace all the parts of the structure where the attachment
can be made by a variable group
• Allowing for exceptions mentioned in the patent
– For example where at least one of R1 & R2 is not H
• Allowing for system limits
– Means sometimes one structure is split into 2 or more
25
SUPERATOMS AND THEIR MEANING
(ORGANIC)
Superatom
Definition
STN query node
CHK
Fully saturated alkyl chain
Ak
CHE
Carbon chain containing at least one double
bond (no triple bonds)
Ak
CHY
Carbon chain containing at least one triple
bond (optionally with double bonds)
Ak
CYC
Non-aromatic carbocyclic ring
Cb
ARY
Carbocyclic ring system containing at least
one benzene ring or quinoid variant
Cb
HEA
5 membered ring with 2 double bonds or 6
membered ring with 3 double bonds
Hy
HET
Any mononuclear heterocyclic ring other than
HEA
Hy
HEF
Fused heterocyclic ring system
Hy
See also: DWPIM Reference Manual, Table 3, Page 18.
26
SUPERATOMS AND THEIR MEANING
(INORGANIC OR NON-SPECIFIC)
Superatom
Definition
STN query node
HAL
Halogen excluding At
X
AMX
Alkali(ne earth) metal
M
A35
Group 3 to 5 metal
M
TRM
Transition metal
M
LAN
Lanthanide (excluding Lanthanum)
M
ACT
Actinide or other trans-uranic metal
M
MX
Unspecified metal
M
XX
Unspecified group but not hydrogen, mostly
used for unspecified substituent groups
UNK
Unspecified group (no longer used but
may be present in some older structures)
See also: DWPIM Reference Manual, Tables 4-6, Pages 19-20.
27
SUPERATOMS USED FOR DISPLAY ONLY
Superatom
Definition
ACY
Acyl group (derived from any
organic acid, not just carboxylic)
DYE
Undefined dye chromophore
PRT
Protecting group
PEG
Polymer end group
POL
Polymer group
Please note Derwent Superatoms will become directly searchable in a
subsequent release of the Derwent Markush resource on New STN
See also: DWPIM Reference Manual, Table 6, Page 20.
28
ATTRIBUTES
• Attributes can be applied to Superatoms to restrict the
scope of the group they describe.
• For carbon chain Superatoms (CHK, CHE, CHY) we
have the following
– Describing chain length LOW (1-6C), MID (7-10C) & HI
(>10C)
– Describing chain structure STR (Straight) & BRA (Branched)
• For ring Superatoms we have the following
– Type of ring system - MON (Monocyclic) & FU (Fused)
– Degree of saturation – SAT (Saturated) & UNS
(Unsaturated)
– MON & FU are not applied to HEA
– SAT & UNS are not applied to HEA and ARY
See also: DWPIM Reference Manual, Table 18, Page 62.
29
30
Search example
Search Query:
1
2
3
4
= No further substitution
on Ak (Locked).
1‒ Thiophene: ML = Atom
2‒ Carbocycle (Cb): ML = Atom
Class
3‒ Alkyl (Ak): ML = Class
4‒ Heterocycle (Hy): ML = Atom
Class
Default settings.
31
STN variable query nodes retrieve DWPIM generic nodes
STN variable query nodes
DWPIM retrieved generic nodes
DWPIM generic nodes for Ak
CHK
CHE
CHY
DWPIM generic nodes for Cb
ML = Class
ARY
CYC
DWPIM generic nodes for Hy
HEA
HET
HEF
32
STN node attributes retrieve DWPIM indexed attributes
STN node attributes, e.g. Ak
DWPIM retrieved attributes
DWPIM alkyl (no limitation)
ML = Class
CHK
CHE
CHY
DWPIM alkyl (low)
ML = Class
CHK
CHKLOW
DWPIM alkyl (straight)
ML = Class
CHK
CHKSTR
33
STN variable query nodes retrieve DWPIM generic nodes
STN search query
Typical DWPIM assembled hits
4
3
1
2
3
1
4
2
STN query nodes with Match Level
Class, retrieve corresponding generic
and specific nodes in DWPIM.
4
2
DWPIM attributes are also accessible,
e.g. MON = monocyclic, FUS = Fused.
3
1
34
Prepare structure queries using the structure editor
Cb and Hy nodes have been set to Class
match. Changes from defaults are indicated
with an asterisk. This has no effect on DCR.
Right click on a node to change
Attributes, e.g. Match Level.
Block substitution with the
lock atoms tool.
Click OK to add the
query to the
structures tab of
the history panel.
35
Search the structure query and review structures
Click on a Markush compound number of
interest for detailed display views (next).
Assembled structures with hit highlighting.
Click on any
structure to
enlarge (zoom).
Automatic Cross File
Search is set ON.
36
DWPIM detailed display – Brief view
Unassembled
DWPIM Markush
base structure.
Hit fragments are combined to
form the assembled structure.
Query relevant Ggroups (G2, etc.).
Hit fragments are
highlighted.
37
Detailed display allows you to choose a preferred view
• Brief – unassembled hit Markush base structure with
complete hit G-groups related to the query
‒ Hit fragments within hit G-groups are highlighted
• Full – unassembled hit Markush base structure with all
G-groups, including those not related to the query
‒ Hit fragments within hit G-groups are highlighted
38
Crossover with REFX and review hit structures in DWPI
Use the REFX operator to
retrieve corresponding
DWPI references (L2).
The structure search (L1)
is combined with terms
for antiviral in DWPI (L2).
SUBSTANCE DESCRIPTORS (FILE
SEGMENTS IN MMS TERMINOLOGY)
• These are assigned to all Markush structures
• You can use them to filter your results
• There are three types of substance descriptor
– Technology related– define the technology area the
structure relates to
– Structure related - define the type of structure the
Markush describes
– Miscellaneous –identifies a Markush which contains
structure which are components of a composition
• At least one technology related Substance
Descriptor and at least one structure related
Substance Descriptor is applied to each Markush
39
SUBSTANCE DESCRIPTORS RELATING TO
STRUCTURE
Substance
Descriptor
Definition
C
Co-ordination complex (includes metallocenes)
F
Any polymer not covered by P or N
L
Oligomer (Precise definition depends on structure type)
M
Alloy (Section B/C patents only)
N
Natural polymer (Section B/C patents only)
P
Polypeptide (3-10 amino acids only)
V
Ordinary organic compound (not a salt)
W
Extended inorganic structures (eg zeolites)
Z
Organic salt (at least one ion is organic)
1
Record derived from DCN database
7
Simple Inorganic compound
See also: DWPIM Reference Manual, Table 17, Page 56
OTHER SUBSTANCE DESCRIPTORS
Substance
descriptor
Definition
A
Patent is classified in CPI Section A*
B
Patent is classified in CPI Section B and/or C
E
Patent is classified in CPI Section E
Y
Substances indexed form part of a mixture
*Patent must also have a B, C and/or E class to receive Markush indexing
See also: DWPIM Reference Manual, Table 17, Page 56
POLYMER OR OLIGOMER
Substance
Substance
descriptors
BC definition
E definition
Oligopeptide
VP
3 amino acids
3 amino acids
Polypeptide
P
>=4 amino acids
>=4 amino acids
Oligosaccharide
L
3-6 sugar units
3-9 sugar units
Polysaccharide
N
>= 7 sugar units
>=10 sugar units*
Other oligomer
L
3-8 repeat units
3-9 repeat units
Other polymer
F
>=9 repeat units
>=10 repeat units*
•BC definition refers to definition used when indexing pharmaceutical and
agrochemical patents (Sections B and / or C)
•E definition refers to the definition used when indexing general chemistry
patents (Section E)
•If a patent is classified in Section E as well as Section B and / or Section
C the BC definitions are used
* Not indexed unless part of a dye molecule
42
FILTER BY SUBSTANCE DESCRIPTOR
43
ROLES OF MARKUSH RECORDS
Role
Definition
A
Compound is analyzed or detected
C
Catalyst
D
Detecting agent
M
Component of a mixture (at least 2 components
have been indexed)
N
New compound
P
Compound is produced or purified
Q
Compound defined in terms of starting materials
R
Removing or purifying agent
U
New use of compound
X
Compound is removed
See also: DWPIM Reference Manual, Table 15, Page 55.
44
POLYMERS
• Only for Pharmaceutical (B)
and agrochemical (C)
patents
• Addition polymers are
typically indexed based on
the monomers with Role Q
assigned
• Condensation polymers are
typically indexed based on
the Structural Repeat Unit
(SRU) with Substance
Descriptor F assigned
(polysiloxane example)
...
INORGANIC STRUCTURES
• Salts are drawn as discrete ions with charges
added whenever they are shown or can be easily
deduced
• More complex structures are indexed by listing
each element present as a separate entity with
zero valency
• Compounds formed entirely of non metallic
elements are mostly shown with covalent bonds in
much the same way as for organics
46
PHTHALOCYANINES
• These are drawn fully normalized
• The central metal atom used to be bonded to all 4
N atoms but now (since 2000) it is disconnected
47
METALLOCENES
• Are indexed with the cyclopentadienyl or other π bonded
ligands shown disconnected from the metal atom
• The valency on the metal is reduced by 1 for each bond to a
cyclopentadienyl ring
– For example Ti in titanocene dichloride is shown as 2 valent (a
+2 charge would be placed on the Ti atom)
48
THANK YOU!
Customer Service
For subscriptions, pricing and renewals
http://ip-science.thomsonreuters.com/support/
Technical Support
For access, content, searching, troubleshooting
and technical issues.
http://ip-science.thomsonreuters.com/techsupport
Training
For Thomson Innovation training options.
http://ip.thomsonreuters.com/training/ti/
Contact Us
US, Canada & Latin America
Phone: +1 800 336 4474
[email protected]
Europe, Middle East and Africa
Tel: +44 (0)20 7433 4000
[email protected]
Japan
Phone: +81 3 5218 6500
[email protected]
Asia Pacific (Singapore office)
Phone: +65 6411 6888
ts.support.asia@thomsonreuters
.com
49
50
Metallocene search example
Hint: bond values are adjusted to normalized,
because cyclopentadienyl rings are indexed
with normalized bonds in DWPIM.
Click OK to add the
query to the
structures tab of
the history panel.
51
Metallocene search example
Automatic Cross File
Search is set ON.
52
Resources
• DWPIM Reference Manual (new STN Sign In required)
https://www.stn.org/help/stn/en/dwpim_manual.pdf
• Recorded Events
http://www.stn-international.com/recorded_events.html
‒ Derwent Markush Resource (DWPIM) on STN
‒ DWPIM vs. MMS
‒ Unified Markush Search on new STN
‒ Structure Searching on new STN
For more information …
CAS
[email protected]
Support and Training:
www.cas.org
FIZ Karlsruhe
[email protected]
Support and Training:
www.stn-international.de