Complementarity between Public and Commercial Databases of

ChemAxon European UGM - June 2009, Budapest
Complementarity between Public
and Commercial Databases of
Bioactive Compounds
Sorel Muresan
AstraZeneca R&D Mölndal
DECS Global Compound Sciences
Driver – explosion in SAR knowledgebases
• Chemical information landscape changing fast
•
ChemSpider 20M, PubChem 18M
•
GVKBio, Wombat (commercial products, include biological annotations)
•
ChemNavigator 30M, eMolecules 8M (compound from suppliers)
•
Patent corpus?
• Impact of large scale MedChem, patent & public datasources
•
>11M SAR points in GVKBio
•
>800 bioassays in PubChem
•
Public domain tox and ADR datasets
• Massive information source for competitive intelligence
1
Entity Relationships:
in vitro activity-to-compound-to-protein mappings
MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGG
APLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQ
GYYVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHR
YYQRQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSI
PHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEI
ARPDDSLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSE
VLASVGGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIV
RVEINGQDLKMDCKEYNYDKSIVDSGTTNLRLPKKVFE
AAVKSIKAASSTEKFPDGFWLGEQLVCWQAGTTPWNI
FPVISLYLMGEVTNQSFRITILPQQYLRPVEDVATSQDD
CYKFAISQSSTGTVMGAVIMEGFYVVFDRARKRIGFAV
SACHVHDEFRTAAVEGPFVTLDMEDCGYNIPQTDESTL
MTIAYVMAAICALFMLPLCLMVCQWRCLRCLRQQHD
DFADDISLLK
Document
Assay
Unstructured
Data from
Documents
Result
Compound
Expert Extraction
Sequence
Structured Entries in
Relational
Databases
Comparing Selected Commercial and Public Bioactive Compound Sources:
2006 and 2008
•
In-house structure standardisation to define unique compound
content and facilitate comparisons between databases
•
Selected databases with
•
Unrestricted download
•
Established utility
•
Compound-to-protein and other types of bioactive links
•
Subsetting options
•
Compare protein mappings, document counts and the effect of
filtration on compound counts
•
Produce an all-vs-all overlap matrix
•
Review overlap and content differences for 2008
•
Compare between 2006 and 2008
•
Selected Venn-type comparisons
•
Compare selected larger merges
2
Databases of Bioactive Compounds
Public
Commercial
Structural standardisation
1. Conversion of all sources to nonisomeric SMILES
2. Removal of fragments, such as mixtures, counter ions or water
3. Neutralisation of remaining charges
4. Derivation of a canonical tautomer using LEATHERFACE, an inhouse molecular editor based on SMARTS rules
5. Generation of unique molecular hashcodes
6. Retain unique structures by comparing molecular hashcodes
3
Dataset
Filtered cpds
GVKBio
GVKBio Journals
GVKBio patents
GVKBIO DD
GVKBIO CCD
GVKBIO BACE1
GVKBIO BACE1 journals
GVKBIO BACE1 patents
WOMBAT
PubChem
PubChem Prous
PubChem PDB
PubChem actives
PubChem pharmacol
PubChem MLSMR
PunChem BindingDB
PubChem ChEBI
DrugBank all
DrugBank approved
DrugBank experimental
DNP
MDDR
MDDR launched
Compound
Numbers for
Sources and
Subsets
Database or
Protein ID
Documents
subset
type
Compound
Document
and Protein
Ratios
Filtration
reduction
-8%
-8%
-7%
-4%
-1%
-11%
-6%
-11%
-18%
-23%
-2%
-8%
-3%
-63%
-1%
-4%
-31%
-7%
-3%
-6%
-26%
-4%
-5%
2,054,151
658,198
1,484,218
3,675
8,864
5,228
389
4,901
180,856
14,965,539
4,652
5,706
7,472
5,311
233,284
24,203
7,428
4,545
1,341
2,999
144,383
176,600
1,435
Total
CpdsHuman
Cpds-perprotei
perproteins
document
ns
protein
GVKBIO
87,747
Entrez Gene 3,292
1,468
604
22
GVKBIO
journals
51,810
Entrez Gene 2,660
1,146
239
12
GVKBIO
patents
35,937
Entrez Gene 1,765
952
815
40
GVKBIO DD
26,825
Entrez Gene 733
339
5
0.14
GVKBIO
CCD
27,286
Entrez Gene 1,224
610
7.2
0.32
WOMBAT
10,205
Swiss-Prot 1,979
1,095
91
18
DrugBank
n.a.
Swiss-Prot 1,625
1,356
2.8
n/a
PubChem
BioAssay
n.a.
RefSeq
72
n/a
n/a
n/a
PubChem
PDB
n.a.
RefSeq
818
n/a
14
n/a
BindingDB
1,142
Swiss-Prot
297
97
112
19
DNP
7,765
n/a
n/a
n/a
n/a
18
4
Document Counts
Document Counts
51810
GVKBIO journals
35937
GVKBIO patents
GVKBIO CCD
27286
GVKBIO DD
26825
10205
WOMBAT
7765
DNP
1142
BindingDB
0
10000
20000
30000
40000
50000
60000
Compounds-per-document
Compounds per Document
GVKBIO patents
40
BindingDB
19
DNP
18
WOMBAT
18
GVKBIO journals
12
GVKBIO CCD
0.32
GVKBIO DD
0.14
0
5
10
15
20
25
30
35
40
45
5
Protein Counts
Protein Count
GVKBIO
3292
1468
GVKBIO journals
2660
1146
GVKBIO patents
1765
952
GVKBIO DD
339
GVKBIO CCD
733
1224
610
WOMBAT
All proteins
Human proteins
1979
1095
1625
1356
DrugBank
72
PubChem BioAssay
818
PubChem PDB
BindingDB
97
0
297
500
1000
1500
2000
2500
3000
3500
Compounds-per-protein
Compounds per Protein
815
GVKBIO patents
239
GVKBIO journals
112
BindingDB
91
WOMBAT
14
PubChem PDB
GVKBIO CCD
7
GVKBIO DD
5
DrugBank
3
0
100
200
300
400
500
600
700
800
900
6
Pairwise Comparison Matrix: 23 X 23
GVKBIO
GVKBIO
GVKBIO GVKBIO GVKBIO GVKBIO
WOMBAT PubChem
Journals Patents
DD
CCD
2,054,151 658,198 1,484,218
2,847
6,178
171,178
925,845
658,198
88,265
2,779
5,492
169,734
361,192
1,484,218
1,404
3,149
45,564
633,115
33
1,060
3,513
8,864
2,652
7,925
180,856
133,124
GVKBIO
Journals
GVKBIO
Patents
GVKBIO DD
3,675
GVKBIO CCD
WOMBAT
PubChem
14,965,539
Coverage of Commercial Databases by PubChem
MDDR launched
96
GVKBIO DD
96
GVKBIO CCD
89
WOMBAT
73
64
MDDR
DNP
57
GVKBIO journals
54
BACE1 journals
43
GVKBIO patents
42
BACE1 patents
38
0
20
40
60
80
100
% Overlap with PubChem
7
DBs vs MLSMR (233,284)
GVKBIO journals
5672
PubChem Actives
5062
GVKBIO patents
2908
WOMBAT
1943
PubChem Pharmacology
1544
DNP
1485
GVKBIO DD
1147
MDDR
1101
DrugBank
965
ChEBI
881
Drugbank approved
773
PubChem Prous
700
MDDR launched
662
BindingDB
445
PubChem PDB
424
GVKBIO CCD
PubChem
actives
7,472
258
DrugBank experimental
202
BACE1 patents 3
BACE1 journals 3
0
1000
2000
3000
4000
5000
6000
# Cmpds Overlap with MLSMR
Comparison of Journal Extractions
2008
Document ratios GVK:WOM:BDb 50:9:1
8
GVKBio vs WOMBAT vs PubChem
2006
2008
Comparison of Approved Drug Collections
2006
2008
9
Public vs Commercial Total Merges
Significant public developments
ChEMBL
StARlite 31
Minimum Information about a
Bioactive Entity (MIABE)
• Compounds: 440,055
• Assay points: 1,936,969
• Papers: 26,299
• Protein targets: 3,512
• Human protein targets:
1,644
• Guidelines for publication of
data describing small molecule
interactions with targets
• available for public comment at
http://www.psidev.info/
Also includes CandiStore and
DrugStore
PubChem screening relevance
10
Exploiting Annotated Data
buy
Selectivity and activity
optimisation
service
New
Patent/Literature
of Interest
Evaluate fast follower
opportunities (patent busting)
Rapid access to
current knowledge
Avoid redrawing of
published structures
Virtual screening
(compound prioritisation for screening)
Develop predictive models
(QSARs, pharmacophore models)
Exploration of chemical &
biological space
Integrated Search & Analytics
Search
Tools
“The big merge” requires:
“The big merge”
Predictive tools
& models
• A common set of chemistry rulestarget/compounds
applied carefully & consistently across databases
• A common set of biology rules applied carefully & consistently across databases
Analyses
Visualisation
tools
11
Conclusions
• Filtration and normalisation facilitate rigouorous comparisons
• Both shared and unique content can provide value
• Based on content per se the pendulum is swinging in the public
direction
• Patent compound coverage is increasing in PubChem
• Public and commercial sources offer different linking and mining
functionality
• Journal and patent compound-assay-protein mapping is covered
at a larger scale by commercial databases
• Public sources have esential complementarity to commercial
ones for the exploration of bioactive chemical space
References and Acknowledgments
Reference: “Complementarity between public and commercial
databases: new opportunities in medicinal chemistry informatics”
Chris Southan, Péter Várkonyi and Sorel Muresan, Current Topics In
Medicinal Chemistry, 2007, 7(15), 1502-8
Reference: “Quantitative Assessment of the Expanding
Complementarity between Public and Commercial Databases of
Bioactive Compounds” Chris Southan, Péter Várkonyi and Sorel
Muresan, submitted to J. Cheminformatics
Thanks to: Tudor Oprea from Sunset Molecular for WOMBAT data,
Steve Byant, Paul Thiessen, Yanli Wang for PubChem advice, Jens
Sadowski and Niklas Blomberg from AstraZeneca R&D Mölndal for
the molecular hashcode software and helpful discussions.
12