Compound biological signatures facilitate phenotypic screening and target elucidation Alvaro Cortes Cabrera1, Daniel Lucena Agell2, Mariano Redondo-Horcajo2, Isabel Barasoain2, Fernando Diaz2, Bernhard Fasching3, Paula Petrone1 1 Pharma Research & Early Development Informatics (pREDi). Roche Innovation Center Basel, Basel, Switzerland. 2 Laboratory of microtubule stabilizing agents. Department of Physico-chemical Biology. Center for Biological Research. CSIC. Madrid. Spain. 3 Medicinal Chemistry. Pharma Research & Early Development (pRED). Roche Innovation Center Basel, Basel, Switzerland. SUPPORTING INFORMATION 1. Compound biological fingerprints from HTS 2. Biosimilarity metric: Correction for size-dependency 3. Background distribution for aggregated biosimilarity 4. Chemical SEA and comparisons with BIOSEA 5. Model for colchicine site binders 1. Compound biological fingerprints from HTS We have built our high-throughput screening (HTS) fingerprints using public assay data deposited in PubChem1,2, similarly to previous works in the field3. We have prioritized HTS campaigns that comply with the following requisites: (1) largest number of compounds and (2) deposited by three institutions, namely The Scripps Research Institute, the Burnham Center for Chemical Genomics and the NIH Chemical Genomics Center. The rational for these requisites is to maximize consistency in both screening technologies and compound collections, which contributes to reduce the noise and missing data in the fingerprints. PubChem provided 168 screening campaigns with more than 300,000 compounds tested, of which the 95 most populated screens were used to build our HTSFP dataset. For each assay, we calculated the average response and the standard deviation to transform all the results to Z-Score values. After building the HTSFP datasets, all molecules that were evaluated in less than 12 assays were filtered out to exclude poorly populated fingerprints that cannot be used with BIOSEA. An in-house version of the HTSFP based on Roche proprietary data sets -HTSFP-Roche- was compiled and processed in similar fashion. HTSFP-Roche were used for the hit expansion protocol which was part of the NKCC1 program. The 107 assays in the HTSFP-Roche were selected according to the following criteria: 1) a significant part of the library was used in the primary screen (2) a follow-up dose-response study was carried out with consistent results in terms of HTS hits compared to the primary screen. Considering each HTSFP data set as a matrix of N compounds (rows) by M assays (columns), we can calculate the amount of missing readouts. HTSFP comprises a total of 365,231 molecules and 95 assays with a data completeness of 92 % of the full matrix. HTSFP-Roche spans a total of 1,277,548 molecules by 107 assays, with a data completeness of 46%. This difference in the number of missing points is due to the continuous evolution of the Roche compound library, while in the case of HTSFP special care was taken to include only results from institutes that shared the same library to a certain extent. 2. Biosimilarity metric: Correction for size-dependency We describe the similarity of two HTSFP with a derived version of the mutual information metric (I): π(π₯,π¦) πΌ(π; π) = βπ¦βπ βπ₯βπ π(π₯, π¦)πππ (π(π₯)π(π¦)) [1] where X and Y are HTSFP of two compounds, and x and y are the elements of such fingerprints respectively. To reduce the number of arbitrary parameters4, we employ the method proposed by Kraskov et al.5 based on entropy estimates from k-nearest neighbor distances. Since I(X;Y) has different upper bounds depending on the size and content of the HTSFP being compared, we have chosen the scheme of Strehl and Ghosh6 to produce a normalized mutual information (NMI): πππΌ(π; π) = πΌ(π;π) βπ»(π)π»(π) [2] where I(X;Y) is the raw mutual information, and H(X), H(Y) are the individual entropies for the HTSFP of the molecules x, y respectively. The calculation of NMI can only be defined for the HTSFP components which are common to both fingerprints and a reduced HTSFP can be introduced for this purpose (HTSFP-r). The definition of a reduced fingerprint HTSFP-r introduces in NMI a dependency on the number of assays used. Different solutions have been proposed: Riniker et al3 completed the HTSFP with a constant value of 0 when a value was missing; Wassermann et al.7 proposed a transformation of the metric (i.e. Pearson correlation coefficient) to a frequency value, using thousands of comparisons in a reference set as a background. Building on this last idea, we have performed, randomly, one million NMI calculations between molecules within the HTSFP dataset. Those comparisons included all different fingerprint sizes n between n = 13 (a requirement of our kNN mutual information estimation method with k=10) and a maximum of n = 95 (HTSFP) or n = 107 (HTSFP-Roche). Histograms for random comparisons were built for each of the different number of assays n = 13,14,15...95 in common. We have observed that those results are distributed according to half-normal distributions with a parameter µ of 0 (as expected from NMI values from random comparisons of unrelated compounds) and Ο parameters that depend on the number of assays compared n and it is directly related to the sample size effects on I estimation8. If the sampling is low, that is, there are only a few assay readouts available, the error in entropy estimation is higher and thus, NMI is overestimated. When more assays are available, the estimation of NMI is more accurate and the value is closer to 0 for the comparison of two random, unrelated HTSFP, which results in a lower Ο parameter and narrower distribution of NMI values. Ο(n) follows a power-law behavior that can be parametrized and used to define Z-Score of the NMI (ZNMI): Z NMI ο¨ X ;Y ο© = I ο¨ X ; Y ο© ο³ ( n) H ο¨ X ο©H ο¨Y ο© [3] 3. Background distribution for aggregated biosimilarity The background parametrization of BIOSEA is based on the distribution of similarities between random sets of molecule HTSFPs. First, pairs of random HTSFP sets (of m and n length) are generated with a number of elements in the range of interest (between k = 1 and k = 100). Those pairs of sets are compared in a pairwise manner using ZNMI and the values are added up. The process is repeated 200 times for each final number of comparisons, l = (m x n), and the mean (µ) and the standard deviation (Ο) values are calculated. The µ and Ο values are represented against the number of comparisons and fitted to power-laws (Supplementary Table 2) which are used to estimate the parameters Ο(l) and µ(l) for any given number of comparisons l in the definition of the Z-Score of aggregated values (ZAG): ππ΄πΊ = π βπ π=1 βπ=1 ππππΌ (ππ ;ππ )βµ(π) Ο(π) [4] where X and Y are the two sets of HTSFP. The ZAG distribution is then fitted to a Generalized Extreme Value Distribution (GEVD) and the location, scale and shape parameters are estimated. The resulting GEVD is used to estimate the probability of obtaining a given random ZAG (l), PZ (ZAG > zAG). An expectancy value (e-value = PZ x Nsets) is obtained by multiplying this probability by the number of sets used (Nsets = 265, the number of targets in our database). Minimal and optimal thresholds for ZNMI between two pairs of HTSFP were explored by scanning values in the range 0.5 to 4.5, in increments of 0.1 and evaluating the goodness of fit (Ο2) and the ratio of errors of fitting for both Extreme Value and Gaussian distributions against proposed ZNMI threshold values (Supplementary Figure 4). The optimal ZNMI threshold, that is dataset specific, was obtained by selecting the best ratio of fitting Gaussian distribution/GEVD errors (Supplementary Figure 4), while the minimum value was determined as the first threshold that provided by a better fit for EVD distribution than a Gaussian. Thresholds are reported in Supplementary Table 3. 4. Chemical SEA and comparisons with BIOSEA Following the original description by Keiser et al.9, a version of the SEA was implemented using the chemical structures and targets in the target identification database (Methods section in the Main Article), the Tanimoto coefficient as the similarity metric and the RDKit Morgan fingerprints 10 (radius = 2). Background parametrization of SEA was carried out analogously to the BIOSEA method for consistency in the comparison. An optimal cutoff was determined (Tc = 0.29). SEA target predictions for the 711 drugs were made as described for BIOSEA using a ZAG of 8.0 as a cutoff resulting in a total of 1708 predictions of which 494 could be verified by the literature or database annotations, being 324 positive (66%). In comparison, the percentage of literature confirmed positive predictions for BIOSEA is 50%. Only 51 SEA predictions were shared with BIOSEA out of which 28 were positively confirmed. Thirteen shared positive predictions were pharmacophorically related to at least one reference compound in the target training set, in agreement with the high number of SEA predictions that can be traced to 2D or 3D chemical similarity to compounds in the training set (80% of SEA positive predictions). This comparison shows that BIOSEA is able to identify relationships in a complementary manner to SEA, reaching beyond chemically obvious predictions. 5. Model for colchicine site binders 3D models of the CT1, CT3, CT4 compounds were generated using their SMILES strings extracted from PubChem and the software CORINA11. Conformational analysis was performed using ALFA12 to generate a maximum of 500 diverse conformers per molecule. 3D coordinates of the colchicine-tiol derivative (2-mercapto-N-[1,2,3,10-tetramethoxy-9-oxo-5,6,7,9-tetrahydro-benzo[a]heptalen-7-yl] acetamide)13 were extracted from its tubulin complex (PDB code 1SA0) and used as the reference to superimpose compound conformations with an in-house clone of ROCS14. Best superimposition according to the TanimotoCombo score was selected for each compound. Supplementary Table 1. Top 10 most highly informative assays (PubChem Assay ID) for Tubulin active compounds. Targets in bold are cell-based assays. Compound CT1 CT2 CT3 CT4 CT5 Assay 1 GDH-TPI (588335) Procaspase-3 (463141) MDM2/MD MX (485346) CHRM1 (588814) Gli-Sufu (588413) Assay 2 NCOA1 (588354) Proscaspase7 (463210) PKA-R1A (504707) SENP7 (434973) NFkb (588850) Assay 3 APAF-1 (489031) MC4R (540295) MC4R (540295) APOBEC3A (493011) CHRM1 (588852) Assay 4 DAGLB (504411) TSHR (504810) Peg3 (588405) APOBEC3G (493012) OPRM1OPRD1 (504326) Assay 5 DRD2 (485347) TSHR (504812) STEP (588621) MC4R (540308) CTSL-1 (1906) Assay 6 Proscaspase3 (463141) Bacterial death (E. coli) (449728) Giardia lamblia (540267) TLR9MyD88 (504734) FXN (540364) Assay 7 SIAE (1053197) SCP-1 (493091) NFkb (435022) Tim23 (463212) Giardia lamblia (540267) Assay 8 HKDC1 (493187) Abl1-Rin1 (588664) NFkb (435003) PAFAH2 (492956) STEP (588621) Assay 9 Fam108B (1947) ATG4B (504462) NFkb (588850) Vainilloid Receptor 1 (540277) NTR1 (493036) Assay 10 HTR5A (504634) HCV core dimerization (1899) Apoptotic arm of the Unfolded Protein response (449763) SCP-1 (493091) Apoptotic arm of the Unfolded Protein response (449763) Supplementary Table 2. Aggregated biosimilarity fitting parameters for the optimal and minimal ZNMI thresholds (e.g. µ = ππ₯ π ) ZNMI threshold Minimal (2.6) Optimal (4.0) Mean coefficient (a) 3.58E-02 5.00E-03 Mean exponent (b) 0.99 0.99 0.9989 0.9941 Standard deviation coefficient (a) 1.24E-01 1.06E-01 Standard deviation exponent (b) 7.06E-01 5.77E-01 0.9841 0.9389 Mean Person r2 Standard deviation r2 Supplementary Table 3. Extreme Value Distribution and Gaussian parameters for the optimal and minimal ZNMI thresholds. ZNMI threshold Minimal (2.6) Optimal (4.0) EVD location -4.49E-01 -4.52E-01 EVD scale 9.14E-01 7.21E-01 -7.67E-02 5.07E-2 0.9897 0.9988 EVD Ο2 6.80E-02 6.9E-03 Gaussian µ -1.05E-01 -2.58E-01 Gaussian Ο 9.73E-01 8.08E-01 0.9943 0.9716 1.16E-01 1.59E-01 EVD shape EVD Pearson r 2 Gaussian Pearson r2 Gaussian Ο2 Supplementary Table 4.Assay panel used for target validation available from CEREP/Eurofins (http://www.cerep.fr/). Assay Target Type Source Substrate / Stimulant Measured Incubation Detection method component MAO-A Functional Human placenta Kynuramine 4-hydroxyquinoline Reference Description compound 30min/30°C Photometry Clorgyline Weyler, W. and Salach, J.I. (1985) J. Biol. Chem., 260: 13199-13207. AChE Functional Human recombinant (HEK-293 Acetylthiocholine cells) CYP1A2 Functional Human recombinant CYP 5-thio-2- 30 min/RT Photometry Galanthamine nitrobenzoic acid CEC (5 µM) Ellman, G.L et al. (1961) Biochem. Pharmacol., 7: 88-95. 30 min/37°C Fluorimetry Furafylline Crespi, C.L et al.(1997) Anal. Biochem., 248: 188-190. NET A2B Functional Functional (agonism) A2B Human recombinant (HEK-293 20 min/37°C Functional Human recombinant (HEK-293 Protriptyline (1995) Arzneim-Forsch. Drug Res., 45: 1145-1148. cAMP 10 min/37°C Homogeneous Time NECA Resolved Fluorescence NECA cAMP 10 min/37°C Homogeneous Time [3H]dexamethasone 6 hr/4°C Scintillation counting Cooper, J et al.(1997) Brit. J. Pharmacol., 122: 546-550. XAC Resolved Fluorescence IM-9 cells (cytosol) Perovic, S. and Muller, W.E.G. synaptosomes cells) Binding Scintillation counting incorporation into cells) (antagonism) GR [3H]NE Rat hypothalamus synaptosomes Cooper, J et al.(1997) Brit. J. Pharmacol., 122: 546-550. Dexamethaso Clark, A.F et al.(1996) Invest. ne Ophtalmol. Vis. Sci., 37: 805813. COX1 Functional Human recombinant (Sf9 cells) Arachidonic acid PGE2 5 min/RT Homogeneous Time Resolved Fluorescence Diclofenac Glaser, K et al.(1995) Eur. J. Pharmacol., 281: 107-111. Assay Target Type Source Substrate / Stimulant Measured Incubation Detection method component COX2 Functional Human recombinant (Sf9 cells) Arachidonic acid PGE2 Reference compound 5 min/RT Homogeneous Time NS 398 Resolved Fluorescence A3 (agonism) Functional Human recombinant (CHO cells) cAMP Description 20 min/37°C Homogeneous Time Glaser, K et al.(1995) Eur. J. Pharmacol., 281: 107-111. IB-MECA Yates, L et al.(2006) Auton. Autacoid Pharmacol., 26: 191- Resolved Fluorescence 200. A3 Functional Human recombinant (CHO cells) IB-MECA cAMP 20 min/37°C (antagonism) Homogeneous Time MRS 1220 Yates, L et al.(2006) Auton. Autacoid Pharmacol., 26: 191- Resolved Fluorescence 200. Alpha-2B Binding [3H]RX 821002 Human recombinant (CHO cells) 60 min/RT Scintillation counting Yohimbine Devedjian, J.C et al.(1994) Eur. J. Pharmacol., 252: 43-49. Alpha-1D [3H]prazosin Binding 60 min/RT Scintillation counting Prazosin Kenny, B.A et al.(1995) Brit. J. Pharmacol., 115: 981-986. A2A Functional PC12 cells (endogenous) cAMP 10 min/RT (agonism) A2A NECA Resolved Fluorescence Functional PC12 cells (endogenous) NECA cAMP 10 min/RT (antagonism) H1 (agonism) Homogeneous Time Homogeneous Time Exp. Ther., 298: 209-218. ZM 241385 Resolved Fluorescence Functional intracellular [Ca2+] Human recombinant (HEK-293 RT Fluorimetry Functional (antagonism) CA2 Human recombinant (HEK-293 Histamine Human erythrocytes Miller, T.R et al.(1999) J. Biomol. Screen., 4: 249-258. Histamine intracellular [Ca2+] RT Fluorimetry Pyrilamine cells) Functional Gao, Z et al.(2001) J. Pharmacol. Exp. Ther., 298: 209-218. cells) H1 Gao, Z et al.(2001) J. Pharmacol. Miller, T.R et al.(1999) J. Biomol. Screen., 4: 249-258. 4-nitrophenyl acetate 4-nitrophenol 20 min/RT Photometry Acetazolamid Iyer, R et al.(2006) J. Biomol. e Screen., 11: 782-791. Assay Target Type Source Substrate / Stimulant Measured Incubation Detection method component CLK1 Functional Human recombinant Reference Description compound ATP + Ulight- phospho-Ulight- CFFKNIVTPRTPPPSQ CFFKNIVTPRTPPP fluorescence GK-amide SQGK-amide resonance energy 30 min/RT Time-resolved Staurosporine Muraki, M et al.(2004) J. Biol. Chem., 279: 24246-24254. transfer (LANCE) A1 (agonism) Functional Human recombinant (CHO cells) Impedance 28°C Cellular dielectric CPA spectroscopy A1 Functional Human recombinant (CHO cells) CPA Impedance 28°C (antagonism) GSK3a Cellular dielectric Pharmacol., 143: 705-714. DPCPX spectroscopy Functional Human recombinant ATP + Ulight- phospho-Ulight- CFFKNIVTPRTPPPSQ CFFKNIVTPRTPPP fluorescence GK-amide SQGK-amide resonance energy 60 min/RT Time-resolved Cordeaux, Y et al.(2004) Brit. J. Cordeaux, Y et al.(2004) Brit. J. Pharmacol., 143: 705-714. Staurosporine Meijer, L et al.(2003) Chem. Biol., 10: 1255-1266. transfer (LANCE) JNK3 Functional Human recombinant (E. coli) ATP + Ulight- phospho-Ulight- CFFKNIVTPRTPPPSQ CFFKNIVTPRTPPP fluorescence GK-amide SQGK-amide resonance energy 30 min/RT Time-resolved transfer (LANCE) Staurosporine Lisnock, J.M et al.(2000) Biochemistry, 39: 3141-3148. Supplementary Figure 1. CT5-Tubulin binding competition results. Supplementary Figure 2. CT3-Tubulin binding competition results. Supplementary Figure 3. HTSFP information analysis for tubulin hits and their biological reference compounds. Colored parts represent assays that are shared across hits and references. Black boxes denote the 10 most highly informative assays (HIA) for each compound. Assay names and classification into cell-based or biochemical are found in Supplementary Table 1. Supplementary Figure 4. Extreme Value Distribution (red) and Gaussian distribution (black) fits (Ο2) versus the ZNMI threshold (left) and Ratio of residuals of Gaussian distribution and Extreme Value Distribution fits versus the ZNMI threshold (right). Supplementary Figure 5. Dose response curves for the 5 unreported positive compounds in BIOSEA target predictions. REFERENCES 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Wang, Y. et al. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research 37, W623-W633 (2009). Wang, Y. et al. PubChem BioAssay: 2014 update. Nucleic acids research 42, D1075-D1082 (2014). Riniker, S., Wang, Y., Jenkins, J. L. & Landrum, G. A. Using Information from Historical High-Throughput Screens to Predict Active Compounds. Journal of chemical information and modeling 54, 1880-1891 (2014). Fraser, A. M. & Swinney, H. L. Independent coordinates for strange attractors from mutual information. Physical review A 33, 1134 (1986). Kraskov, A., Stögbauer, H. & Grassberger, P. Estimating mutual information. Physical review E 69, 066138 (2004). Strehl, A. & Ghosh, J. Cluster ensembles---a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research 3, 583-617 (2003). Wassermann, A. M. et al. A Screening Pattern Recognition Method Finds New and Divergent Targets for Drugs and Natural Products. ACS chemical biology (2014). Gil, M., Fernández, M. & Martinez, I. The choice of sample size in estimating mutual information. Applied Mathematics and Computation 27, 201-216 (1988). Keiser, M. J. et al. Predicting new molecular targets for known drugs. Nature 462, 175-181 (2009). Landrum, G. RDKit: Open-source cheminformatics. http://www.rdkit.org. Accessed 26/02/2016 3, 2012 (2016). Corina v. 3.491 (Molecular Networks, 2013). Klett, J., CorteΜs-Cabrera, A. l., Gil-Redondo, R., Gago, F. & Morreale, A. ALFA: Automatic ligand flexibility assignment. Journal of chemical information and modeling 54, 314-323 (2014). Ravelli, R. B. et al. Insight into tubulin regulation from a complex with colchicine and a stathmin-like domain. Nature 428, 198-202 (2004). Rush, T. S., Grant, J. A., Mosyak, L. & Nicholls, A. A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. Journal of medicinal chemistry 48, 1489-1495 (2005).
© Copyright 2026 Paperzz