ChemAxon European UGM - June 2009, Budapest Complementarity between Public and Commercial Databases of Bioactive Compounds Sorel Muresan AstraZeneca R&D Mölndal DECS Global Compound Sciences Driver – explosion in SAR knowledgebases • Chemical information landscape changing fast • ChemSpider 20M, PubChem 18M • GVKBio, Wombat (commercial products, include biological annotations) • ChemNavigator 30M, eMolecules 8M (compound from suppliers) • Patent corpus? • Impact of large scale MedChem, patent & public datasources • >11M SAR points in GVKBio • >800 bioassays in PubChem • Public domain tox and ADR datasets • Massive information source for competitive intelligence 1 Entity Relationships: in vitro activity-to-compound-to-protein mappings MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGG APLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQ GYYVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHR YYQRQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSI PHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEI ARPDDSLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSE VLASVGGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIV RVEINGQDLKMDCKEYNYDKSIVDSGTTNLRLPKKVFE AAVKSIKAASSTEKFPDGFWLGEQLVCWQAGTTPWNI FPVISLYLMGEVTNQSFRITILPQQYLRPVEDVATSQDD CYKFAISQSSTGTVMGAVIMEGFYVVFDRARKRIGFAV SACHVHDEFRTAAVEGPFVTLDMEDCGYNIPQTDESTL MTIAYVMAAICALFMLPLCLMVCQWRCLRCLRQQHD DFADDISLLK Document Assay Unstructured Data from Documents Result Compound Expert Extraction Sequence Structured Entries in Relational Databases Comparing Selected Commercial and Public Bioactive Compound Sources: 2006 and 2008 • In-house structure standardisation to define unique compound content and facilitate comparisons between databases • Selected databases with • Unrestricted download • Established utility • Compound-to-protein and other types of bioactive links • Subsetting options • Compare protein mappings, document counts and the effect of filtration on compound counts • Produce an all-vs-all overlap matrix • Review overlap and content differences for 2008 • Compare between 2006 and 2008 • Selected Venn-type comparisons • Compare selected larger merges 2 Databases of Bioactive Compounds Public Commercial Structural standardisation 1. Conversion of all sources to nonisomeric SMILES 2. Removal of fragments, such as mixtures, counter ions or water 3. Neutralisation of remaining charges 4. Derivation of a canonical tautomer using LEATHERFACE, an inhouse molecular editor based on SMARTS rules 5. Generation of unique molecular hashcodes 6. Retain unique structures by comparing molecular hashcodes 3 Dataset Filtered cpds GVKBio GVKBio Journals GVKBio patents GVKBIO DD GVKBIO CCD GVKBIO BACE1 GVKBIO BACE1 journals GVKBIO BACE1 patents WOMBAT PubChem PubChem Prous PubChem PDB PubChem actives PubChem pharmacol PubChem MLSMR PunChem BindingDB PubChem ChEBI DrugBank all DrugBank approved DrugBank experimental DNP MDDR MDDR launched Compound Numbers for Sources and Subsets Database or Protein ID Documents subset type Compound Document and Protein Ratios Filtration reduction -8% -8% -7% -4% -1% -11% -6% -11% -18% -23% -2% -8% -3% -63% -1% -4% -31% -7% -3% -6% -26% -4% -5% 2,054,151 658,198 1,484,218 3,675 8,864 5,228 389 4,901 180,856 14,965,539 4,652 5,706 7,472 5,311 233,284 24,203 7,428 4,545 1,341 2,999 144,383 176,600 1,435 Total CpdsHuman Cpds-perprotei perproteins document ns protein GVKBIO 87,747 Entrez Gene 3,292 1,468 604 22 GVKBIO journals 51,810 Entrez Gene 2,660 1,146 239 12 GVKBIO patents 35,937 Entrez Gene 1,765 952 815 40 GVKBIO DD 26,825 Entrez Gene 733 339 5 0.14 GVKBIO CCD 27,286 Entrez Gene 1,224 610 7.2 0.32 WOMBAT 10,205 Swiss-Prot 1,979 1,095 91 18 DrugBank n.a. Swiss-Prot 1,625 1,356 2.8 n/a PubChem BioAssay n.a. RefSeq 72 n/a n/a n/a PubChem PDB n.a. RefSeq 818 n/a 14 n/a BindingDB 1,142 Swiss-Prot 297 97 112 19 DNP 7,765 n/a n/a n/a n/a 18 4 Document Counts Document Counts 51810 GVKBIO journals 35937 GVKBIO patents GVKBIO CCD 27286 GVKBIO DD 26825 10205 WOMBAT 7765 DNP 1142 BindingDB 0 10000 20000 30000 40000 50000 60000 Compounds-per-document Compounds per Document GVKBIO patents 40 BindingDB 19 DNP 18 WOMBAT 18 GVKBIO journals 12 GVKBIO CCD 0.32 GVKBIO DD 0.14 0 5 10 15 20 25 30 35 40 45 5 Protein Counts Protein Count GVKBIO 3292 1468 GVKBIO journals 2660 1146 GVKBIO patents 1765 952 GVKBIO DD 339 GVKBIO CCD 733 1224 610 WOMBAT All proteins Human proteins 1979 1095 1625 1356 DrugBank 72 PubChem BioAssay 818 PubChem PDB BindingDB 97 0 297 500 1000 1500 2000 2500 3000 3500 Compounds-per-protein Compounds per Protein 815 GVKBIO patents 239 GVKBIO journals 112 BindingDB 91 WOMBAT 14 PubChem PDB GVKBIO CCD 7 GVKBIO DD 5 DrugBank 3 0 100 200 300 400 500 600 700 800 900 6 Pairwise Comparison Matrix: 23 X 23 GVKBIO GVKBIO GVKBIO GVKBIO GVKBIO GVKBIO WOMBAT PubChem Journals Patents DD CCD 2,054,151 658,198 1,484,218 2,847 6,178 171,178 925,845 658,198 88,265 2,779 5,492 169,734 361,192 1,484,218 1,404 3,149 45,564 633,115 33 1,060 3,513 8,864 2,652 7,925 180,856 133,124 GVKBIO Journals GVKBIO Patents GVKBIO DD 3,675 GVKBIO CCD WOMBAT PubChem 14,965,539 Coverage of Commercial Databases by PubChem MDDR launched 96 GVKBIO DD 96 GVKBIO CCD 89 WOMBAT 73 64 MDDR DNP 57 GVKBIO journals 54 BACE1 journals 43 GVKBIO patents 42 BACE1 patents 38 0 20 40 60 80 100 % Overlap with PubChem 7 DBs vs MLSMR (233,284) GVKBIO journals 5672 PubChem Actives 5062 GVKBIO patents 2908 WOMBAT 1943 PubChem Pharmacology 1544 DNP 1485 GVKBIO DD 1147 MDDR 1101 DrugBank 965 ChEBI 881 Drugbank approved 773 PubChem Prous 700 MDDR launched 662 BindingDB 445 PubChem PDB 424 GVKBIO CCD PubChem actives 7,472 258 DrugBank experimental 202 BACE1 patents 3 BACE1 journals 3 0 1000 2000 3000 4000 5000 6000 # Cmpds Overlap with MLSMR Comparison of Journal Extractions 2008 Document ratios GVK:WOM:BDb 50:9:1 8 GVKBio vs WOMBAT vs PubChem 2006 2008 Comparison of Approved Drug Collections 2006 2008 9 Public vs Commercial Total Merges Significant public developments ChEMBL StARlite 31 Minimum Information about a Bioactive Entity (MIABE) • Compounds: 440,055 • Assay points: 1,936,969 • Papers: 26,299 • Protein targets: 3,512 • Human protein targets: 1,644 • Guidelines for publication of data describing small molecule interactions with targets • available for public comment at http://www.psidev.info/ Also includes CandiStore and DrugStore PubChem screening relevance 10 Exploiting Annotated Data buy Selectivity and activity optimisation service New Patent/Literature of Interest Evaluate fast follower opportunities (patent busting) Rapid access to current knowledge Avoid redrawing of published structures Virtual screening (compound prioritisation for screening) Develop predictive models (QSARs, pharmacophore models) Exploration of chemical & biological space Integrated Search & Analytics Search Tools “The big merge” requires: “The big merge” Predictive tools & models • A common set of chemistry rulestarget/compounds applied carefully & consistently across databases • A common set of biology rules applied carefully & consistently across databases Analyses Visualisation tools 11 Conclusions • Filtration and normalisation facilitate rigouorous comparisons • Both shared and unique content can provide value • Based on content per se the pendulum is swinging in the public direction • Patent compound coverage is increasing in PubChem • Public and commercial sources offer different linking and mining functionality • Journal and patent compound-assay-protein mapping is covered at a larger scale by commercial databases • Public sources have esential complementarity to commercial ones for the exploration of bioactive chemical space References and Acknowledgments Reference: “Complementarity between public and commercial databases: new opportunities in medicinal chemistry informatics” Chris Southan, Péter Várkonyi and Sorel Muresan, Current Topics In Medicinal Chemistry, 2007, 7(15), 1502-8 Reference: “Quantitative Assessment of the Expanding Complementarity between Public and Commercial Databases of Bioactive Compounds” Chris Southan, Péter Várkonyi and Sorel Muresan, submitted to J. Cheminformatics Thanks to: Tudor Oprea from Sunset Molecular for WOMBAT data, Steve Byant, Paul Thiessen, Yanli Wang for PubChem advice, Jens Sadowski and Niklas Blomberg from AstraZeneca R&D Mölndal for the molecular hashcode software and helpful discussions. 12
© Copyright 2026 Paperzz