Compound biological signatures facilitate phenotypic

Compound biological signatures facilitate phenotypic screening and target
elucidation
Alvaro Cortes Cabrera1, Daniel Lucena Agell2, Mariano Redondo-Horcajo2, Isabel Barasoain2,
Fernando Diaz2, Bernhard Fasching3, Paula Petrone1
1
Pharma Research & Early Development Informatics (pREDi). Roche Innovation Center Basel, Basel,
Switzerland.
2
Laboratory of microtubule stabilizing agents. Department of Physico-chemical Biology. Center for
Biological Research. CSIC. Madrid. Spain.
3
Medicinal Chemistry. Pharma Research & Early Development (pRED). Roche Innovation Center
Basel, Basel, Switzerland.
SUPPORTING INFORMATION
1. Compound biological fingerprints from HTS
2. Biosimilarity metric: Correction for size-dependency
3. Background distribution for aggregated biosimilarity
4. Chemical SEA and comparisons with BIOSEA
5. Model for colchicine site binders
1. Compound biological fingerprints from HTS
We have built our high-throughput screening (HTS) fingerprints using public assay data deposited in
PubChem1,2, similarly to previous works in the field3. We have prioritized HTS campaigns that
comply with the following requisites: (1) largest number of compounds and (2) deposited by three
institutions, namely The Scripps Research Institute, the Burnham Center for Chemical Genomics and
the NIH Chemical Genomics Center. The rational for these requisites is to maximize consistency in
both screening technologies and compound collections, which contributes to reduce the noise and
missing data in the fingerprints. PubChem provided 168 screening campaigns with more than 300,000
compounds tested, of which the 95 most populated screens were used to build our HTSFP dataset. For
each assay, we calculated the average response and the standard deviation to transform all the results
to Z-Score values. After building the HTSFP datasets, all molecules that were evaluated in less than
12 assays were filtered out to exclude poorly populated fingerprints that cannot be used with
BIOSEA.
An in-house version of the HTSFP based on Roche proprietary data sets -HTSFP-Roche- was
compiled and processed in similar fashion. HTSFP-Roche were used for the hit expansion protocol
which was part of the NKCC1 program. The 107 assays in the HTSFP-Roche were selected according
to the following criteria: 1) a significant part of the library was used in the primary screen (2) a
follow-up dose-response study was carried out with consistent results in terms of HTS hits compared
to the primary screen.
Considering each HTSFP data set as a matrix of N compounds (rows) by M assays (columns), we can
calculate the amount of missing readouts. HTSFP comprises a total of 365,231 molecules and 95
assays with a data completeness of 92 % of the full matrix. HTSFP-Roche spans a total of 1,277,548
molecules by 107 assays, with a data completeness of 46%. This difference in the number of missing
points is due to the continuous evolution of the Roche compound library, while in the case of HTSFP
special care was taken to include only results from institutes that shared the same library to a certain
extent.
2. Biosimilarity metric: Correction for size-dependency
We describe the similarity of two HTSFP with a derived version of the mutual information metric (I):
𝑝(π‘₯,𝑦)
𝐼(𝑋; π‘Œ) = βˆ‘π‘¦βˆˆπ‘Œ βˆ‘π‘₯βˆˆπ‘‹ 𝑝(π‘₯, 𝑦)π‘™π‘œπ‘” (𝑝(π‘₯)𝑝(𝑦))
[1]
where X and Y are HTSFP of two compounds, and x and y are the elements of such fingerprints
respectively. To reduce the number of arbitrary parameters4, we employ the method proposed by
Kraskov et al.5 based on entropy estimates from k-nearest neighbor distances. Since I(X;Y) has
different upper bounds depending on the size and content of the HTSFP being compared, we have
chosen the scheme of Strehl and Ghosh6 to produce a normalized mutual information (NMI):
𝑁𝑀𝐼(𝑋; π‘Œ) =
𝐼(𝑋;π‘Œ)
√𝐻(𝑋)𝐻(π‘Œ)
[2]
where I(X;Y) is the raw mutual information, and H(X), H(Y) are the individual entropies for the
HTSFP of the molecules x, y respectively. The calculation of NMI can only be defined for the HTSFP
components which are common to both fingerprints and a reduced HTSFP can be introduced for this
purpose (HTSFP-r). The definition of a reduced fingerprint HTSFP-r introduces in NMI a dependency
on the number of assays used. Different solutions have been proposed: Riniker et al3 completed the
HTSFP with a constant value of 0 when a value was missing; Wassermann et al.7 proposed a
transformation of the metric (i.e. Pearson correlation coefficient) to a frequency value, using
thousands of comparisons in a reference set as a background. Building on this last idea, we have
performed, randomly, one million NMI calculations between molecules within the HTSFP dataset.
Those comparisons included all different fingerprint sizes n between n = 13 (a requirement of our
kNN mutual information estimation method with k=10) and a maximum of n = 95 (HTSFP) or n =
107 (HTSFP-Roche). Histograms for random comparisons were built for each of the different number
of assays n = 13,14,15...95 in common. We have observed that those results are distributed according
to half-normal distributions with a parameter µ of 0 (as expected from NMI values from random
comparisons of unrelated compounds) and Οƒ parameters that depend on the number of assays
compared n and it is directly related to the sample size effects on I estimation8. If the sampling is low,
that is, there are only a few assay readouts available, the error in entropy estimation is higher and thus,
NMI is overestimated. When more assays are available, the estimation of NMI is more accurate and
the value is closer to 0 for the comparison of two random, unrelated HTSFP, which results in a lower
Οƒ parameter and narrower distribution of NMI values. Οƒ(n) follows a power-law behavior that can be
parametrized and used to define Z-Score of the NMI (ZNMI):
Z NMI  X ;Y  =
I  X ; Y   ( n)
H  X H Y 
[3]
3. Background distribution for aggregated biosimilarity
The background parametrization of BIOSEA is based on the distribution of similarities between
random sets of molecule HTSFPs. First, pairs of random HTSFP sets (of m and n length) are
generated with a number of elements in the range of interest (between k = 1 and k = 100). Those pairs
of sets are compared in a pairwise manner using ZNMI and the values are added up. The process is
repeated 200 times for each final number of comparisons, l = (m x n), and the mean (µ) and the
standard deviation (Οƒ) values are calculated. The µ and Οƒ values are represented against the number of
comparisons and fitted to power-laws (Supplementary Table 2) which are used to estimate the
parameters Οƒ(l) and µ(l) for any given number of comparisons l in the definition of the Z-Score of
aggregated values (ZAG):
𝑍𝐴𝐺 =
𝑛
βˆ‘π‘š
𝑖=1 βˆ‘π‘—=1 𝑍𝑁𝑀𝐼 (𝑋𝑖 ;π‘Œπ‘— )βˆ’µ(𝑙)
Οƒ(𝑙)
[4]
where X and Y are the two sets of HTSFP. The ZAG distribution is then fitted to a Generalized Extreme
Value Distribution (GEVD) and the location, scale and shape parameters are estimated. The resulting
GEVD is used to estimate the probability of obtaining a given random ZAG (l), PZ (ZAG > zAG). An
expectancy value (e-value = PZ x Nsets) is obtained by multiplying this probability by the number of
sets used (Nsets = 265, the number of targets in our database).
Minimal and optimal thresholds for ZNMI between two pairs of HTSFP were explored by scanning
values in the range 0.5 to 4.5, in increments of 0.1 and evaluating the goodness of fit (Ο‡2) and the ratio
of errors of fitting for both Extreme Value and Gaussian distributions against proposed ZNMI threshold
values (Supplementary Figure 4). The optimal ZNMI threshold, that is dataset specific, was obtained by
selecting the best ratio of fitting Gaussian distribution/GEVD errors (Supplementary Figure 4), while
the minimum value was determined as the first threshold that provided by a better fit for EVD
distribution than a Gaussian. Thresholds are reported in Supplementary Table 3.
4. Chemical SEA and comparisons with BIOSEA
Following the original description by Keiser et al.9, a version of the SEA was implemented using the
chemical structures and targets in the target identification database (Methods section in the Main
Article), the Tanimoto coefficient as the similarity metric and the RDKit Morgan fingerprints 10
(radius = 2). Background parametrization of SEA was carried out analogously to the BIOSEA method
for consistency in the comparison. An optimal cutoff was determined (Tc = 0.29). SEA target
predictions for the 711 drugs were made as described for BIOSEA using a ZAG of 8.0 as a cutoff
resulting in a total of 1708 predictions of which 494 could be verified by the literature or database
annotations, being 324 positive (66%). In comparison, the percentage of literature confirmed positive
predictions for BIOSEA is 50%. Only 51 SEA predictions were shared with BIOSEA out of which 28
were positively confirmed. Thirteen shared positive predictions were pharmacophorically related to at
least one reference compound in the target training set, in agreement with the high number of SEA
predictions that can be traced to 2D or 3D chemical similarity to compounds in the training set (80%
of SEA positive predictions). This comparison shows that BIOSEA is able to identify relationships in
a complementary manner to SEA, reaching beyond chemically obvious predictions.
5. Model for colchicine site binders
3D models of the CT1, CT3, CT4 compounds were generated using their SMILES strings extracted
from PubChem and the software CORINA11. Conformational analysis was performed using ALFA12
to generate a maximum of 500 diverse conformers per molecule. 3D coordinates of the colchicine-tiol
derivative
(2-mercapto-N-[1,2,3,10-tetramethoxy-9-oxo-5,6,7,9-tetrahydro-benzo[a]heptalen-7-yl]
acetamide)13 were extracted from its tubulin complex (PDB code 1SA0) and used as the reference to
superimpose compound conformations with an in-house clone of ROCS14. Best superimposition
according to the TanimotoCombo score was selected for each compound.
Supplementary Table 1. Top 10 most highly informative assays (PubChem Assay ID) for Tubulin
active compounds. Targets in bold are cell-based assays.
Compound
CT1
CT2
CT3
CT4
CT5
Assay 1
GDH-TPI
(588335)
Procaspase-3
(463141)
MDM2/MD
MX (485346)
CHRM1
(588814)
Gli-Sufu
(588413)
Assay 2
NCOA1
(588354)
Proscaspase7 (463210)
PKA-R1A
(504707)
SENP7
(434973)
NFkb
(588850)
Assay 3
APAF-1
(489031)
MC4R
(540295)
MC4R
(540295)
APOBEC3A
(493011)
CHRM1
(588852)
Assay 4
DAGLB
(504411)
TSHR
(504810)
Peg3
(588405)
APOBEC3G
(493012)
OPRM1OPRD1
(504326)
Assay 5
DRD2
(485347)
TSHR
(504812)
STEP
(588621)
MC4R
(540308)
CTSL-1
(1906)
Assay 6
Proscaspase3 (463141)
Bacterial
death (E.
coli)
(449728)
Giardia
lamblia
(540267)
TLR9MyD88
(504734)
FXN
(540364)
Assay 7
SIAE
(1053197)
SCP-1
(493091)
NFkb
(435022)
Tim23
(463212)
Giardia
lamblia
(540267)
Assay 8
HKDC1
(493187)
Abl1-Rin1
(588664)
NFkb
(435003)
PAFAH2
(492956)
STEP
(588621)
Assay 9
Fam108B
(1947)
ATG4B
(504462)
NFkb
(588850)
Vainilloid
Receptor 1
(540277)
NTR1
(493036)
Assay 10
HTR5A
(504634)
HCV core
dimerization
(1899)
Apoptotic
arm of the
Unfolded
Protein
response
(449763)
SCP-1
(493091)
Apoptotic
arm of the
Unfolded
Protein
response
(449763)
Supplementary Table 2. Aggregated biosimilarity fitting parameters for the optimal and minimal ZNMI
thresholds (e.g. µ = π‘Žπ‘₯ 𝑏 )
ZNMI threshold
Minimal (2.6)
Optimal (4.0)
Mean coefficient (a)
3.58E-02
5.00E-03
Mean exponent (b)
0.99
0.99
0.9989
0.9941
Standard deviation coefficient (a)
1.24E-01
1.06E-01
Standard deviation exponent (b)
7.06E-01
5.77E-01
0.9841
0.9389
Mean Person r2
Standard deviation r2
Supplementary Table 3. Extreme Value Distribution and Gaussian parameters for the optimal and
minimal ZNMI thresholds.
ZNMI threshold
Minimal (2.6)
Optimal (4.0)
EVD location
-4.49E-01
-4.52E-01
EVD scale
9.14E-01
7.21E-01
-7.67E-02
5.07E-2
0.9897
0.9988
EVD Ο‡2
6.80E-02
6.9E-03
Gaussian µ
-1.05E-01
-2.58E-01
Gaussian Οƒ
9.73E-01
8.08E-01
0.9943
0.9716
1.16E-01
1.59E-01
EVD shape
EVD Pearson r
2
Gaussian Pearson r2
Gaussian Ο‡2
Supplementary Table 4.Assay panel used for target validation available from CEREP/Eurofins (http://www.cerep.fr/).
Assay Target
Type
Source
Substrate / Stimulant
Measured
Incubation
Detection method
component
MAO-A
Functional
Human placenta
Kynuramine
4-hydroxyquinoline
Reference
Description
compound
30min/30°C
Photometry
Clorgyline
Weyler, W. and Salach, J.I.
(1985) J. Biol. Chem., 260:
13199-13207.
AChE
Functional
Human recombinant (HEK-293
Acetylthiocholine
cells)
CYP1A2
Functional
Human recombinant CYP
5-thio-2-
30 min/RT
Photometry
Galanthamine
nitrobenzoic acid
CEC (5 µM)
Ellman, G.L et al. (1961)
Biochem. Pharmacol., 7: 88-95.
30 min/37°C
Fluorimetry
Furafylline
Crespi, C.L et al.(1997) Anal.
Biochem., 248: 188-190.
NET
A2B
Functional
Functional
(agonism)
A2B
Human recombinant (HEK-293
20 min/37°C
Functional
Human recombinant (HEK-293
Protriptyline
(1995) Arzneim-Forsch. Drug
Res., 45: 1145-1148.
cAMP
10 min/37°C
Homogeneous Time
NECA
Resolved Fluorescence
NECA
cAMP
10 min/37°C
Homogeneous Time
[3H]dexamethasone
6 hr/4°C
Scintillation counting
Cooper, J et al.(1997) Brit. J.
Pharmacol., 122: 546-550.
XAC
Resolved Fluorescence
IM-9 cells (cytosol)
Perovic, S. and Muller, W.E.G.
synaptosomes
cells)
Binding
Scintillation counting
incorporation into
cells)
(antagonism)
GR
[3H]NE
Rat hypothalamus synaptosomes
Cooper, J et al.(1997) Brit. J.
Pharmacol., 122: 546-550.
Dexamethaso
Clark, A.F et al.(1996) Invest.
ne
Ophtalmol. Vis. Sci., 37: 805813.
COX1
Functional
Human recombinant (Sf9 cells)
Arachidonic acid
PGE2
5 min/RT
Homogeneous Time
Resolved Fluorescence
Diclofenac
Glaser, K et al.(1995) Eur. J.
Pharmacol., 281: 107-111.
Assay Target
Type
Source
Substrate / Stimulant
Measured
Incubation
Detection method
component
COX2
Functional
Human recombinant (Sf9 cells)
Arachidonic acid
PGE2
Reference
compound
5 min/RT
Homogeneous Time
NS 398
Resolved Fluorescence
A3 (agonism)
Functional
Human recombinant (CHO cells)
cAMP
Description
20 min/37°C
Homogeneous Time
Glaser, K et al.(1995) Eur. J.
Pharmacol., 281: 107-111.
IB-MECA
Yates, L et al.(2006) Auton.
Autacoid Pharmacol., 26: 191-
Resolved Fluorescence
200.
A3
Functional
Human recombinant (CHO cells)
IB-MECA
cAMP
20 min/37°C
(antagonism)
Homogeneous Time
MRS 1220
Yates, L et al.(2006) Auton.
Autacoid Pharmacol., 26: 191-
Resolved Fluorescence
200.
Alpha-2B
Binding
[3H]RX 821002
Human recombinant (CHO cells)
60 min/RT
Scintillation counting
Yohimbine
Devedjian, J.C et al.(1994) Eur.
J. Pharmacol., 252: 43-49.
Alpha-1D
[3H]prazosin
Binding
60 min/RT
Scintillation counting
Prazosin
Kenny, B.A et al.(1995) Brit. J.
Pharmacol., 115: 981-986.
A2A
Functional
PC12 cells (endogenous)
cAMP
10 min/RT
(agonism)
A2A
NECA
Resolved Fluorescence
Functional
PC12 cells (endogenous)
NECA
cAMP
10 min/RT
(antagonism)
H1 (agonism)
Homogeneous Time
Homogeneous Time
Exp. Ther., 298: 209-218.
ZM 241385
Resolved Fluorescence
Functional
intracellular [Ca2+]
Human recombinant (HEK-293
RT
Fluorimetry
Functional
(antagonism)
CA2
Human recombinant (HEK-293
Histamine
Human erythrocytes
Miller, T.R et al.(1999) J.
Biomol. Screen., 4: 249-258.
Histamine
intracellular [Ca2+]
RT
Fluorimetry
Pyrilamine
cells)
Functional
Gao, Z et al.(2001) J. Pharmacol.
Exp. Ther., 298: 209-218.
cells)
H1
Gao, Z et al.(2001) J. Pharmacol.
Miller, T.R et al.(1999) J.
Biomol. Screen., 4: 249-258.
4-nitrophenyl acetate
4-nitrophenol
20 min/RT
Photometry
Acetazolamid
Iyer, R et al.(2006) J. Biomol.
e
Screen., 11: 782-791.
Assay Target
Type
Source
Substrate / Stimulant
Measured
Incubation
Detection method
component
CLK1
Functional
Human recombinant
Reference
Description
compound
ATP + Ulight-
phospho-Ulight-
CFFKNIVTPRTPPPSQ
CFFKNIVTPRTPPP
fluorescence
GK-amide
SQGK-amide
resonance energy
30 min/RT
Time-resolved
Staurosporine
Muraki, M et al.(2004) J. Biol.
Chem., 279: 24246-24254.
transfer (LANCE)
A1 (agonism)
Functional
Human recombinant (CHO cells)
Impedance
28°C
Cellular dielectric
CPA
spectroscopy
A1
Functional
Human recombinant (CHO cells)
CPA
Impedance
28°C
(antagonism)
GSK3a
Cellular dielectric
Pharmacol., 143: 705-714.
DPCPX
spectroscopy
Functional
Human recombinant
ATP + Ulight-
phospho-Ulight-
CFFKNIVTPRTPPPSQ
CFFKNIVTPRTPPP
fluorescence
GK-amide
SQGK-amide
resonance energy
60 min/RT
Time-resolved
Cordeaux, Y et al.(2004) Brit. J.
Cordeaux, Y et al.(2004) Brit. J.
Pharmacol., 143: 705-714.
Staurosporine
Meijer, L et al.(2003) Chem.
Biol., 10: 1255-1266.
transfer (LANCE)
JNK3
Functional
Human recombinant (E. coli)
ATP + Ulight-
phospho-Ulight-
CFFKNIVTPRTPPPSQ
CFFKNIVTPRTPPP
fluorescence
GK-amide
SQGK-amide
resonance energy
30 min/RT
Time-resolved
transfer (LANCE)
Staurosporine
Lisnock, J.M et al.(2000)
Biochemistry, 39: 3141-3148.
Supplementary Figure 1. CT5-Tubulin binding competition results.
Supplementary Figure 2. CT3-Tubulin binding competition results.
Supplementary Figure 3. HTSFP information analysis for tubulin hits and their biological reference
compounds. Colored parts represent assays that are shared across hits and references. Black boxes
denote the 10 most highly informative assays (HIA) for each compound. Assay names and
classification into cell-based or biochemical are found in Supplementary Table 1.
Supplementary Figure 4. Extreme Value Distribution (red) and Gaussian distribution (black) fits (Ο‡2)
versus the ZNMI threshold (left) and Ratio of residuals of Gaussian distribution and Extreme Value
Distribution fits versus the ZNMI threshold (right).
Supplementary Figure 5. Dose response curves for the 5 unreported positive compounds in BIOSEA
target predictions.
REFERENCES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Wang, Y. et al. PubChem: a public information system for analyzing bioactivities of small
molecules. Nucleic acids research 37, W623-W633 (2009).
Wang, Y. et al. PubChem BioAssay: 2014 update. Nucleic acids research 42, D1075-D1082
(2014).
Riniker, S., Wang, Y., Jenkins, J. L. & Landrum, G. A. Using Information from Historical
High-Throughput Screens to Predict Active Compounds. Journal of chemical information and
modeling 54, 1880-1891 (2014).
Fraser, A. M. & Swinney, H. L. Independent coordinates for strange attractors from mutual
information. Physical review A 33, 1134 (1986).
Kraskov, A., Stögbauer, H. & Grassberger, P. Estimating mutual information. Physical review
E 69, 066138 (2004).
Strehl, A. & Ghosh, J. Cluster ensembles---a knowledge reuse framework for combining
multiple partitions. The Journal of Machine Learning Research 3, 583-617 (2003).
Wassermann, A. M. et al. A Screening Pattern Recognition Method Finds New and Divergent
Targets for Drugs and Natural Products. ACS chemical biology (2014).
Gil, M., Fernández, M. & Martinez, I. The choice of sample size in estimating mutual
information. Applied Mathematics and Computation 27, 201-216 (1988).
Keiser, M. J. et al. Predicting new molecular targets for known drugs. Nature 462, 175-181
(2009).
Landrum, G. RDKit: Open-source cheminformatics. http://www.rdkit.org. Accessed
26/02/2016 3, 2012 (2016).
Corina v. 3.491 (Molecular Networks, 2013).
Klett, J., Cortés-Cabrera, A. l., Gil-Redondo, R., Gago, F. & Morreale, A. ALFA: Automatic
ligand flexibility assignment. Journal of chemical information and modeling 54, 314-323
(2014).
Ravelli, R. B. et al. Insight into tubulin regulation from a complex with colchicine and a
stathmin-like domain. Nature 428, 198-202 (2004).
Rush, T. S., Grant, J. A., Mosyak, L. & Nicholls, A. A shape-based 3-D scaffold hopping
method and its application to a bacterial protein-protein interaction. Journal of medicinal
chemistry 48, 1489-1495 (2005).