Journal of Biomolecular Screening

Journal of Biomolecular Screening
http://jbx.sagepub.com
Enrichment of Extremely Noisy High-Throughput Screening Data Using a Naïve Bayes Classifier
Meir Glick, Anthony E. Klon, Pierre Acklin and John W. Davies
J Biomol Screen 2004; 9; 32
DOI: 10.1177/1087057103260590
The online version of this article can be found at:
http://jbx.sagepub.com/cgi/content/abstract/9/1/32
Published by:
http://www.sagepublications.com
On behalf of:
Society for Biomolecular Sciences
Additional services and information for Journal of Biomolecular Screening can be found at:
Email Alerts: http://jbx.sagepub.com/cgi/alerts
Subscriptions: http://jbx.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Downloaded from http://jbx.sagepub.com at PENNSYLVANIA STATE UNIV on April 16, 2008
© 2004 Society for Biomolecular Sciences. All rights reserved. Not for commercial use or unauthorized distribution.
Glick et al. of Extremely Noisy HTS Data
ARTICLE
10.1177/1087057103260590
Enrichment
Enrichment of Extremely Noisy High-Throughput
Screening Data Using a Naïve Bayes Classifier
MEIR GLICK, ANTHONY E. KLON, PIERRE ACKLIN, and JOHN W. DAVIES
The noise level of a high-throughput screening (HTS) experiment depends on various factors such as the quality and robustness of the assay itself and the quality of the robotic platform. Screening of compound mixtures is noisier than screening single compounds per well. A classification model based on naïve Bayes (NB) may be used to enrich such data. The authors studied the ability of the NB classifier to prioritize noisy primary HTS data of compound mixtures (5 compounds/well) in 4
campaigns in which the percentage of noise presumed to be inactive compounds ranged between 81% and 91%. The top 10%
of the compounds suggested by the classifier captured between 26% and 45% of the active compounds. These results are reasonable and useful, considering the poor quality of the training set and the short computing time that is needed to build and deploy the classifier. (Journal of Biomolecular Screening 2004:32-36)
Key words: high-throughput screening, compound mixtures, molecular similarity, extended-connectivity fingerprints, naïve
Bayes
INTRODUCTION
probability is obtained by multiplying the individual probabilities,
so
BAYES (NB) IS A SIMPLE CLASSIFICATION method
based on the Bayes rule for conditional probability.1 Bayes
rule states that the probability of event A, given that event B has occurred, P(A|B), is the following:
P( Active| D1 , 2 , . . ., n ) = P( D1 | Active ) × P( D2 | Active )...
N
AÏVE
 P( A )
P( A| B ) = P( B| A )
,
 P( B )
where P(A) and P(B) denote the probabilities of events A and B, respectively. NB may be used in high-throughput screening (HTS)
data analysis as a classifier between “active” and “inactive” compounds, guided by the frequency of occurrence of various molecu2
lar descriptors, D, as shown in the following equation:
 P( Active )
P( Active| D) = P( D| Active )
.
 P( D) 
NB classification relies on 2 basic assumptions. First, the
descriptors in the training samples are equally important. Second,
the method “naïvely” (hence “naïve” Bayes) assumes independ3
ence of the descriptors from each other. Therefore, their combined
Novartis Institute for Biomedical Research, Cambridge, MA.
Received Jun 13, 2003, and in revised form Aug 13, 2003. Accepted for publication Sep 5, 2003.
Journal of Biomolecular Screening 9(1); 2004
DOI: 10.1177/1087057103260590
Published by Sage Publications in association with The Society for Biomolecular
Screening
32 www.sbsonline.org
 P( Active ) 
.
× P( Dn | Active )
 P( D1 , 2 , . . ., n )
In practice, these assumptions are often violated, and full independence is rarely observed. Studies on HTS data show, however,
that NB is very robust to violations of these assumptions.4 NB is
also known to be tolerant toward noise.4 Indeed, HTS data might
5
be extremely noisy when screening in mixtures. In our case, the
compound collection was screened in mixtures of 5 compounds/
well. Given that a compound is active, the mixture that contains
this compound is likely to give a signal. The probability that any of
the remaining 4 compounds in the same well will also be active is
remote because the hit rate in an HTS campaign is low (ca. 0.1%).
To pinpoint which of the 5 compounds is the active one, each of the
compounds is retested as 1 compound/well. In this article, we refer
to the remaining 4 compounds as “noise presumed to be inactive
compounds.”
Assessment of NB noise tolerance when the HTS data are extremely noisy is therefore highly desirable. The scientific literature
contains empirical evaluation of various ensemble methods.6
However, the tolerance of NB toward noise in HTS data is unclear
because the performance of NB is determined by how well its underlying assumptions describe the activity of the compounds in a
given HTS campaign. Here we studied the ability of NB to enrich
compound-mixtures data in 4 HTS campaigns. We also tested the
© 2004 The Society for Biomolecular Screening
Downloaded from http://jbx.sagepub.com at PENNSYLVANIA STATE UNIV on April 16, 2008
© 2004 Society for Biomolecular Sciences. All rights reserved. Not for commercial use or unauthorized distribution.
Enrichment of Extremely Noisy HTS Data
accuracy of an NB classifier in mining other compound collections
for additional “active” compounds in which the classifier was
based on a data set, with an increasing amount of noise presumed
to be inactive compounds (up to 99% of the training set).
METHODS
We used the extended-connectivity fingerprints (ECFPs) implementation in Pipeline Pilot2 in all the test cases in this study. The
ECFPs are a new class of fingerprints for molecular characterization, developed by the Sci-Tegic group,7 that rely on the Morgan algorithm.8 Fingerprints normally present the molecule as a collection of substructures such as functional groups. In contrast, the
ECFPs features correspond to the presence of an exact structure
with limited specified attachment points. To generate the fingerprints, the program assigns an initial code to each atom. Then, each
atom code is updated in an iterative manner to reflect the codes of
each atom’s neighbors. When the desired neighborhood size is
reached, the process is complete, and the set of all features is returned as the fingerprint. ECFPs can represent a much larger set of
features than many fingerprints and contain a significant number
of different structural units that are crucial for the molecular comparison among the compounds. To accommodate this amount of
information, a hashing scheme into a space of 232 feature codes is
used.
The Bayesian models were generated and deployed using Pipeline Pilot, in which each bin contains the number of occurrences of
a given hashed fingerprint bit. The normalized probability was
then calculated to provide a final contribution of the feature to the
total relative estimate.
RESULTS
Can the NB classifier prioritize primary data of compound
mixtures?
Advances in combinatorial and parallel chemistries, robotics,
and miniaturization enable screening of large compound collections at an increasing rate. With the soaring costs of a typical
screen, devising time-efficient and cost-effective strategies is
highly desirable. By prioritizing the primary data, one can avoid
picking “false positives” and pinpoint “false negatives,” hence reducing the overall cost of the screen and increasing its yield.
We conducted a retrospective study on 4 in-house HTS campaigns. For each drug target, roughly 300,000 compounds were
screened in mixtures of 5 compounds/well. The compounds were
randomly assigned to these mixtures. Issues such as intermolecular chemical reactivity, cell toxicity, similarity, diversity,
and redundancy were disregarded. The number of “hits” in the primary screen ranged significantly between 811 and 2792 (Table 1).
It should be noted that few of the wells had fewer than 5 compounds (but more than 1). The Z’ factors for these screens ranged
between 0.55 in screen III and 0.87 in screen IV. All the primary
“hits” were deconvoluted; the compounds in the active wells were
then screened again (“cherry-picked” for secondary screening) as
1 compound/well. The percentage of noise presumed to be inactive
compounds ranged between 81% and 91% among the 4 campaigns (Table 1). We defined all the compounds in the active wells
from the primary screen (including the noise presumed to be inactive compounds) as “active.” These active compounds and an additional number of “inactive” compounds, which ranged between
124,133 and 189,910, were used to construct a Bayesian model for
an active compound for each of the campaigns. The Bayesian
model was then employed to prioritize the compounds from the
primary screen for each of the 4 campaigns. The results were compared to the secondary screen (1 compound/well) in terms of enrichment and receiver operating characteristic (ROC) curves.9 An
enrichment curve attempts to answer whether testing compounds
in a particular order is beneficial. To generate such a plot, the
Bayesian classifier was used to order the samples. The curve is a
cumulative count of the number of active compounds found when
testing the compounds in such order. Enrichment and ROC curves
are closely related. ROC plots are a robust method for understanding the utility of a model or score in classifying data samples. A
ROC curve describes the trade-off between sensitivity and specificity. Sensitivity is defined as the ability of the model to avoid
“false negatives,” whereas specificity is the ability to avoid “false
positives.” An ideally sensitive classifier would have an area under
the curve of 1, whereas a random classifier would have a value of
0.5. Roughly, an area from 0.7 to 0.8 is considered reasonable, 0.8
to 0.9 is good, and 0.9 to 1 is excellent. In 3 of 4 cases, the quality of
the model was reasonable. The area below the ROC curve was
higher than 0.75. In terms of enrichment, the top 10% compounds
suggested by the Bayesian model captured between 26% and 45%
(Table 1) of the confirmed “hits.”
Can an NB classifier based on extremely noisy data be used to
identify additional compounds that were not tested in the
campaign?
Screening of external compound collections is routine in the
drug discovery process. Prioritizing these compound collections at
an early stage is therefore desirable. The ability of an NB classifier
to prioritize primary data of compound mixtures implies success in
mining and prioritizing other compound collections, even if the
training of the classifier is based on a highly noisy primary data.
Our strategy involved leaving the NB algorithm unmodified
and altering the training examples to introduce factors believed to
be important—in this case, increasing the amount of noise presumed to be inactive compounds. Two HTS experiments were
studied. In the first one, around 300,000 compounds were tested
for their inhibitory activity. These compounds were purchased
from external vendors based on their drug likeness10 and chemical
diversity. Other compounds were selected from in-house lead optimization projects. In the primary screening, compounds were
tested as 1 compound/well at a concentration of 10 µM. Compounds at –49% change were considered hits and were tested again
at 10 µM. The Z’ factor for this screen was 0.84. The percent
change is the difference between the signal of the sample well and
Journal of Biomolecular Screening 9(1); 2004
Downloaded from http://jbx.sagepub.com at PENNSYLVANIA STATE UNIV on April 16, 2008
© 2004 Society for Biomolecular Sciences. All rights reserved. Not for commercial use or unauthorized distribution.
www.sbsonline.org 33
Glick et al.
Table 1. Prioritization of Compound Mixtures Using Naïve Bayes
Preparation of the Bayesian Models
Drug
Target
Number of “Hits” Taken
from the Primary Screen
(5 Compounds/Well)
I
II
III
IV
Deployment of the Models
Number of Confirmed “Hits”
after Retesting the Compounds as
1 Compound/Well
1826
2792
2584
811
Enrichment Factor
in the First 10%
347
363
258
73
Area under the ROC Curve
3.8
4.5
2.6
2.7
0.87
0.83
0.61
0.75
ROC, receiver operating characteristic.
34 www.sbsonline.org
Enrichment (number of actives
captured in the first 1%)
30%
25%
20%
15%
10%
5%
0%
0
10
20
30
40
50
60
70
80
90
99
% Noise
Bayesian model
(a)
Random model
30%
Enrichment (number of actives
captured in the first 1%)
the median of the signals obtained from the blank wells on the
same plate, divided by the difference between the median signal of
the control wells (wells with the target protein but without the test
compound) and the median signal of the blank wells that are located on the same plate. From these, 242 validated hits were identified with IC50 values ranging between 0.02 and 23.3 µM. The 242
compounds were divided at random into 2 equal-sized groups of
121. We generated a Bayesian model based on 121 “actives” from
the first group and 12,100 “inactive” compounds picked at random
from the experiment.
We then tested the ability of the Bayesian model to pinpoint the
other 121 actives, in which an additional 84,416 inactive compounds (that were not used for the training) were added as “decoys.” We studied the enrichment in terms of percent actives captured in the first 1% (845 compounds). We focused on the first 1%
because it is not practical to retest (“cherry-pick”) tens of thousands of compounds. Instead, the aim of the Bayesian model here
was to identify a small number of active compounds inside a large
compound collection.
We then started generating noise at increasing amounts for up to
99% of the training compounds. The noise was created by picking
at random inactive compounds from the training set and labeling
them as active. New Bayesian models were then created from these
increasingly noisy data. Each Bayesian model was evaluated on
the same test set. The enrichment values when using Bayesian
models, based on an increasing amount of noise, are shown in Figure 1a. The training set without any noise yielded the best Bayesian
model (25% of the actives captured in the first 1% of the compounds). Strikingly, the precision of the Bayesian model was preserved, although increasing levels of noise were introduced (up to
80%). In a value of 70% noise, the model was nearly as good as the
one created without any noise (identification of 20% of the actives
vs. 25%).
The second HTS experiment was an agonist assay in which the
same compound collection was screened. In the primary screening, compounds were tested as 1 compound/well at a concentration of 10 µM. Compounds at 10% change or greater (10% to 90%
agonism) were considered hits and were tested again in duplicates
at 10 µM. The Z’ factor for this screen was 0.80. From these, 170
validated hits were identified with EC50 values ranging between
0.2 and 69.9 µM. The 170 validated hits were divided at random
25%
20%
15%
10%
5%
0%
0
10
20
30
40
50
60
70
80
90
99
% Noise
(b)
Bayesian model
Random model
FIG. 1. The enrichment (percent actives captured in the first 1%) values when using Bayesian models based on an increasing amount of noise
for 2 assays. (a) Antagonist assay. (b) Agonist assay.
into 2 equal-sized groups of 85 compounds. We generated a
Bayesian model from 85 actives in the first group and 8500 inactive compounds picked at random from the experiment. We then
used the model in attempt to identify the second half of the 85 actives in which an additional 100,000 inactive compounds (that
were not part of the training set) were used as decoys. Again, we
studied the enrichment in the first 1% (1002 compounds) when using Bayesian models based on an increasing amount of noise. The
results are depicted in Figure 1b. The noise-free training set
yielded the Bayesian model with the highest precision (25% enrichment). Unlike the previous case (the antagonist assay), the accuracy of the Bayesian model decayed with the increasing amount
of noise, and at 40%, it was ineffective.
Journal of Biomolecular Screening 9(1); 2004
Downloaded from http://jbx.sagepub.com at PENNSYLVANIA STATE UNIV on April 16, 2008
© 2004 Society for Biomolecular Sciences. All rights reserved. Not for commercial use or unauthorized distribution.
Enrichment of Extremely Noisy HTS Data
0.25
% Similar Compounds
0.2
0.15
0.1
0.05
0
0.60-
0.65-
0.70-
0.75-
0.80-
0.85-
0.90-
0.65
0.70
0.75
0.80
0.85
0.90
0.95
0.95-1
Tanimoto Coefficient
Antagonist assay
Agonist assay
(a)
0.0035
0.003
% Similar Compounds
To get an insight into the difference between the 2 experiments,
we derived a similarity matrix between the training and test sets for
each of the 2 HTS campaigns. The feature set was the same as that
used for the classifier (ECFPs). Figure 2a shows the number of active compounds at increasing similarity levels, divided by the number of cells in the similarity matrix (i.e., percent similar compounds). The data are displayed in this “normalized” manner
because the number of compounds used for the training and test
sets was different between the 2 HTS campaigns (85 vs. 121). Figure 2b depicts the similarity between the active compounds used
for the training and the inactive compounds that were used as decoys, divided by the number of cells in the similarity matrix. Figure
2c shows the similarity between the compounds used for the training set and themselves. The compounds on the diagonal (duplicates) were not included in the matrix. In the antagonist assay, the
compounds used for training are more similar to the test set. However, they are also more similar to the decoy set. In both cases, there
is some structural redundancy between the training sets and themselves, indicated by the percentage of compounds with a high
value of the Tanimoto coefficient. The differences in the
intercompound similarity distributions in Figure 2a indicate that
the compounds in the training and test sets had a higher degree of
similarity in the antagonist assay. This may account for the
differences in sensitivity to noise.
0.0025
0.002
0.0015
0.001
0.0005
0
DISCUSSION
0.60-
0.65-
0.70-
0.75-
0.80-
0.85-
0.90-
0.65
0.70
0.75
0.80
0.85
0.90
0.95
0.95-1
Tanimoto Coefficient
Antagonist assay
(b)
Agonist assay
0.4
0.35
0.3
% Similar Compounds
One way of reducing the high costs and increasing the throughput of an HTS campaign is to use mixtures in which 2 or more
compounds are present in the same well. In the case of 5 compounds/well, the primary screening is 5 times faster. The compound management and plate supply is more efficient, and fewer
reagents are needed. When screening in mixtures of 5 compounds/
well, around 80% of the compounds identified in the primary
screening are noise that is presumed to be inactive. In practice, the
amount of noise is higher than 80%, and in 2 of 4 experiments, it
was around 90%, although the final percentage of DMSO in each
of the 4 screens was the same. This may be due to various experimental artifacts such as cell toxicity or perhaps chemical reaction,
which might occur between compounds in the same mixture. Indeed, mixtures design11 may be one way to improve the quality of
the experiment. The significant amount of noise poses a serious
problem when one needs to build a mathematical model based on
the primary data. We have shown that an NB classifier can be used
to prioritize such data. The results were still reasonable, even when
the amount of noise was around 90%. The area under the ROC
curve and the enrichment curves show that the model tends to be
more accurate as the amount of noise is lower; the predictions for
81% and 87% noise are significantly better than the cases of 90%
and 91% noise.
An NB classifier is a more attractive avenue than deriving an
interwell similarity matrix and prioritizing the compounds in a de-
0.25
0.2
0.15
0.1
0.05
0
0.60-
0.65-
0.70-
0.75-
0.80-
0.85-
0.90-
0.65
0.70
0.75
0.80
0.85
0.90
0.95
0.95-1
Tanimoto Coefficient
Antagonist assay
Agonist assay
(c)
FIG. 2. The percentage of compounds for 2 high-throughput screening campaigns at increasing similarity levels. (a) The active compounds
in the training and test sets. (b) The active and inactive compounds in the
training set. (c) The active compounds in the training set and themselves.
creasing Tanimoto coefficient.5 The number of compounds that
can be included in the similarity matrix is a prohibiting factor in
terms of computing time. Using NB is a more efficient approach
Journal of Biomolecular Screening 9(1); 2004
Downloaded from http://jbx.sagepub.com at PENNSYLVANIA STATE UNIV on April 16, 2008
© 2004 Society for Biomolecular Sciences. All rights reserved. Not for commercial use or unauthorized distribution.
www.sbsonline.org 35
Glick et al.
because the computing time grows linearly and enables the prioritization of large compound collections (1 million and more) within
minutes on a desktop PC.
ECFPs represent a much larger and sparse set of features than
what is common for other fingerprints. The virtual size of the fingerprint is 4 billion (4,000,000,000) different features. In practice,
only a small subset of those features is present for a given molecule. The usage of ECFPs resulted in a large number of bins when
using the NB (order of thousands). With a diverse compound collection such as in this study, the noise is presented by descriptors
with a low frequency, and therefore the signal-to-noise proportion
is amplified. This may explain the tolerance of NB that was
demonstrated in this work.
We believe that the similarities between the training sets and
themselves (Fig. 2c) were enough to yield significant frequencies
of descriptors that encoded for true active compounds. A similarity
analysis cannot always foresee the NB classifier’s degree of tolerance. The precision of the Bayesian model was preserved in the
first case (Fig. 1a), unlike the second one (Fig. 1b), in which it decayed rapidly. Only retrospectively, once all the “actives” were
known, the higher degree of similarity between the training and
test sets for the antagonist assay (Fig. 2a) rationalized the results.
Identification of additional compounds from external compound
collections may be more difficult than prioritization of primary
screening data. In the latter case, the model is constructed and employed on the same compounds. In the former, the model is used to
prioritize new compounds. The success of the classifier therefore
depends not only on the amount of noise in the training set but also
on the degree of similarity between the training set and the new
compound collection, as indicated by the results in Figure 1b.
A compound is considered active in the primary screen when its
activity lies at or beyond the set threshold. Assigning this threshold, however, is not straightforward.12 The primary screen measures efficacy, usually in terms of percent inhibition, and not potency (IC50 or EC50 values). One would like, however, to capture
the maximal number of potent compounds based on the primary
data. One way of achieving this goal is cherry-picking additional
compounds based on a Bayesian model. Such a mathematical
model needs to balance between the amounts of noise it can handle
without losing its precision. Regrettably, a priori assessment of NB
noise tolerance is a challenging task because such a problem in machine learning is inherently empirical. Nevertheless, we showed
that NB can be used as a classifier in most of the test cases in this
work, even when significant noise was introduced. It also can be
36 www.sbsonline.org
argued that the best approach to enrich “noisy HTS data” is not
post facto application of statistical methods but up-front screen
quality improvement and switching to a single compound per well.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning:
Data Mining, Inference and Prediction. Berlin: Springer, 2001.
Pipeline Pilot, SciTegic, Inc., 9665 Chesapeake Drive, Suite 401, San Diego,
CA 92123.
Insightful Corporation: Insightful Miner 2 User’s Guide. Seattle, WA: Insightful Corporation, 2001.
Labute P: Binary QSAR: a new method for the determination of quantitative
structure activity relationships. Pac Symp Biocomput 1999;444-555.
Glick M, Klon AE, Acklin P, Davies JW: Prioritization of high throughput
screening data of compound mixtures using molecular similarity. Mol Phys
2003;101:1325-1328.
Opitz D, Maclin R: Popular ensemble methods: a survey. J Artif Intell Res
1999;11:169-198.
Rogers D, Hahn M: High-throughput data analysis: extended-connectivity
fingerprints: a high-dimensional descriptor for molecular data analysis. In
preparation.
Morgan HL: The generation of a unique machine description for chemical
structures: a technique developed at Chemical Abstracts Service. J Chem Doc
1965;5:107-112.
Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. New York: Morgan Kaufmann, 1999.
Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 2001;46:3-26.
Hann M, Hudson B, Lewell X, Lifely R, Miller L, Ramsden N: Strategic pooling of compounds for high-throughput screening. J Chem Inf Comput Sci
1999;39:897-902.
Zhang JH, Chung TD, Oldenburg KR: Confirmation of primary active substances from high throughput screening of chemical and biological populations: a statistical approach and practical considerations. J Comb Chem
2000;2:258-265.
Address reprint requests to:
Meir Glick
Novartis Institute for Biomedical Research
100 Technology Square
Cambridge, MA 02139
E-mail: [email protected]
Journal of Biomolecular Screening 9(1); 2004
Downloaded from http://jbx.sagepub.com at PENNSYLVANIA STATE UNIV on April 16, 2008
© 2004 Society for Biomolecular Sciences. All rights reserved. Not for commercial use or unauthorized distribution.