Molecular Similarity in Medicinal Chemistry

Perspective
pubs.acs.org/jmc
Molecular Similarity in Medicinal Chemistry
Miniperspective
Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath*,§
†
College of Pharmacy and BIO5 Institute, University of Arizona, 1295 North Martin, P.O. Box 210202, Tucson, Arizona 85721,
United States
‡
Translational Genomics Research Institute, 445 North Fifth Street, Phoenix, Arizona 85004, United States
§
Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische
Friedrich-Wilhelms-Universität, Dahlmannstrasse 2, D-53113 Bonn, Germany
ABSTRACT: Similarity is a subjective and multifaceted concept, regardless of whether
compounds or any other objects are considered. Despite its intrinsically subjective nature,
attempts to quantify the similarity of compounds have a long history in chemical informatics
and drug discovery. Many computational methods employ similarity measures to identify new
compounds for pharmaceutical research. However, chemoinformaticians and medicinal
chemists typically perceive similarity in different ways. Similarity methods and numerical
readouts of similarity calculations are probably among the most misunderstood computational
approaches in medicinal chemistry. Herein, we evaluate different similarity concepts, highlight
key aspects of molecular similarity analysis, and address some potential misunderstandings. In
addition, a number of practical aspects concerning similarity calculations are discussed.
■
INTRODUCTION
Molecular similarity is one of the most heavily explored and
exploited concepts in chemical informatics and is also a central
theme in medicinal chemistry.1−3 Many computational
similarity methods have been (and continue to be)
introduced.1,2 Why do we apparently care so much about
similarity in the molecular world? Simply put, comparing
compounds and their properties, especially activity, is one of
the most frequent exercises in chemical and pharmaceutical
research but often for rather different reasons. In medicinal
chemistry, questions are asked such as the following: Can a
similar follow-up candidate compound be identified for a
liability-associated lead? Is a candidate too similar to a
competitor’s compound to establish an intellectual property
position? How can we complement our compound collection
with different (i.e., dissimilar) compounds?” Providing answers
to these and other questions requires the assessment of
similarity (or dissimilarity) in one way or another.
As will be discussed throughout this review, three basic
components are required to construct suitable computational
measures of molecular similarity: (1) a representation whose
components encode the molecular and/or chemical features
relevant for similarity assessment, (2) a potential weighting of
representation features, and (3) a similarity function (also
called a similarity coefficient) that combines the information
contained in the representations to yield an appropriate
similarity. This value usually lies between ‘0’ and ‘1’, where
‘1’ results from the complete identity of the molecular
representations (but not necessarily the compounds). Repre© 2013 American Chemical Society
sentation features typically are different types of molecular
descriptors. A weighting scheme will be required if contributions of these features should be differently prioritized for
similarity assessment (otherwise, if all selected features should
be equally considered, no weighting is required).
Applications in chemical informatics that involve systematic
comparisons of compounds and the quantification of their
similarity provide a stimulating intellectual setting for method
development. Quantitative readouts of similarity are also of
practical relevance in, for example, the identification of new
candidate compounds on the basis of known actives via virtual
screening,4,5 for which similarity searching is one of the most
popular approaches.6,7
Why is similarity assessment a complicated problem? Two
compounds that share a common substructure can be detected
unambiguously, or all compounds sharing this substructure can
be retrieved from a compound database. However, as illustrated
in Figure 1, it cannot be said with certainty if two compounds
are similar to each other, what their degree of similarity might
be and how similarity should be assessed. In this case, the catch
is that it is difficult to rationalize relationships that are
principally subjective in nature. First and foremost, similarity
like beauty is more or less in the eye of the beholder. The
difficulty of the problem increases further when attempting to
describe similarity relationships in a formally consistent manner
and to quantify them with aid of computational methods, as
Received: September 12, 2013
Published: October 23, 2013
3186
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 1. Similarity perception and concepts. Two exemplary vascular endothelial growth factor receptor 2 ligands are shown, and different ways to
assess their similarity are illustrated.
problem is the requirement to clearly define and consistently
account for similarity. As illustrated in Figure 2, compounds
that might not be considered similar often share similar activity
(horizontal compound relationship) or other property values.
In contrast, compounds that likely would be considered very
similar might not do so (vertical compound relationship),
clearly illustrating the limitations of the SPP. Structure−activity
relationship (SAR) discontinuity, i.e., small chemical modifications that lead to significant changes in biological activity,
represents a major limitation of the SPP. The extreme form of
SAR discontinuity is provided by “activity cliffs”.9−11
A key aspect associated with the SPP that strongly influences
nearly all considerations of similarity in chemical informatics
and medicinal chemistry is that molecular similarity values are
rarely of interest per se. Rather, they are used as a basis for
correlating similarity, however assessed, with compounddependent properties such as biological activity. Despite its
fundamental importance, this aspect is surprisingly often not
considered in computational similarity analysis.
further detailed below. Although similarity is difficult to
rationalize and quantify, computational decision support in
similarity assessment is nevertheless often requested in
medicinal chemistry; unfortunately, it fails more often than
not. Why is this so? Herein, different similarity concepts and
computational approaches for similarity assessment are
discussed. In addition, an attempt is made to rationalize why
there is often a discrepancy between computational and
medicinal chemical views of similarity and address some
common misunderstandings. Finally, the use and interpretation
of similarity calculations in the practice of medicinal chemistry
are discussed.
■
DO SIMILAR STRUCTURES HAVE SIMILAR
PROPERTIES?
In the context of a seminal book publication8 that appeared in
the early 1990s when molecular similarity analysis first became
popular, the similarity property principle (SPP) emerged, which
stated that similar compounds should have similar properties,
the most frequently studied property being biological activity.
Although this fundamental principle sounds simple enough, it is
very difficult to capture methodologically. At the heart of the
3187
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 2. Similarity versus activity. Three vascular endothelial growth factor receptor 2 ligands are shown that represent different (vertical vs
horizontal) similarity−activity (potency) relationships.
■
SIMILARITY HAS MANY DIFFERENT MEANINGS
It is evident that similarity is a widely used concept that is of
relevance for recognizing and organizing all components of the
physical environment as well as many other aspects of life.
However, even in the more narrowly confined molecular world,
similarity may have many different meanings or interpretations
depending on our individual perspective. Hence, if the ultimate
aim is to formally describe similarity in a consistent manner
despite its intrinsic limitations, it is of critical importance to first
distinguish between different similarity criteria and concepts, as
illustrated in Figure 1.
Chemical or Molecular Similarity? Although the terms
chemical and molecular similarity are often used synonymously,
this may not be entirely accurate. Chemical similarity is based
primarily on the physicochemical characteristics of compounds
(e.g., solubility, boiling point, log P, molecular weight, electron
densities, dipole moments, etc.) while molecular similarity
focuses primarily on the structural features (e.g., shared
substructures, ring systems, topologies, etc.) of compounds
and their representation. Physicochemical properties and
structural features are typically accounted for by different
types of descriptors. Such descriptors are generally defined as
mathematical functions or models of chemical properties or
molecular structure. For chemical similarity assessment,
reaction information and different functional groups can also
be considered. In the current work, the focus is more on
molecular than chemical similarity.
2D versus 3D Similarity. Similarity can be evaluated on
the basis of 2D and 3D molecular representations. 2D similarity
methods rely on information deduced from molecular graphs.
Direct graph comparisons12 and graph similarity calculations
are computationally demanding and not widely applied in
molecular similarity analysis at present. By contrast, molecular
descriptors that capture graph information such as fragment13
or topological atom environment fingerprints14 are very
popular. Fingerprints are generally defined as bit string13 or
feature set14 representations of molecular structure and
properties. Such molecular representations can be efficiently
compared computationally, thus enabling similarity calculations
on a large scale. Because compounds are inherently threedimensional and their molecular conformations have generally
higher information content than their corresponding molecular
graphs, one might anticipate that 3D similarity, which involves
the comparison of molecular conformations and associated
properties,15,16 should be generally preferred to 2D similarity.
However, this is not the case for two principal reasons. First,
chemists are trained on the basis of molecular graphs (i.e., 2D
structural representations) and in general are more comfortable
with basing their considerations on graphs than on the 3D
structures of compounds. Molecular graphs typically used by
chemists often also contain conformational and stereochemical
information. Second, given the uncertainties associated with
identifying biologically active conformations in vast conformational ensembles of test compounds, 2D approaches are
typically more robust, despite their relative simplicity, and
often yield superior results in SAR analysis and activity
prediction.17,18 Many current similarity methods preferentially
utilize 2D molecular representations; most, however, do not
contain any stereochemical information, which limits their
ability to properly treat enantiomeric compounds. Since such
compounds have identical atom connectivity, their similarity
values will be unity if stereoinsensitive molecular representations are used. Furthermore, as will be discussed below in
detail, similarity calculations on the basis of 2D molecular
representations have a number of other intrinsic limitations.
In the following, we will base our discussion of similarity
calculations and similarity measures on 2D approaches, in
particular, fingerprint similarity searching, for several reasons.
As pointed out above, chemists are generally more familiar with
2D than 3D representations of compounds and consider
similarity mostly on the basis of 2D molecular graphs.
Furthermore, many of the conclusions drawn from the analysis
of simple similarity searching readily apply to more complex
similarity methods. In this context, our preference for 2D
similarity assessment should not be interpreted as a disregard of
3D similarity concepts and methods. Given the medicinal
chemistry focus of our presentation, we mostly adhere to 2D
similarity considerations herein.
Molecular versus Biological Similarity. Another similarity concept that requires consideration is the biological
similarity of compounds, which departs from the conceptual
framework of the SPP. Instead, the usual structural or
physicochemical property descriptors are replaced by the
3188
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 3. Complex similarity relationships. Cyclooxygenase (COX) inhibitors and their activity profiles are compared. HSL stands for hormonesensitive lipase.
to a medicinal chemist’s perspective in this context. Consider,
for example, the set of well-known cyclooxygenase (COX)
inhibitors compared in Figure 3. All of these inhibitors are
approved drugs except lumiracoxib, which lost its United States
approval in 2007. If we apply a whole-compound view,
compounds such as the ibuprofen enantiomers, ibuprofen and
paracetamol, or diclofenac and lumiracoxib, appear visibly
similar. From a medicinal chemistry point of view, however, this
assessment may not be generally agreed upon since small
chemical differences can lead to important changes in specificity
profiles (e.g., diclofenac vs lumiracoxib) or compounds
containing different functional groups can be synthesized or
derivatized in different ways (e.g., ibuprofen vs paracetamol).
Hence, a medicinal chemist’s view of similarity might again be
more local in nature and/or take chemical reaction information
directly into account. Moreover, these COX inhibitors are
involved in highly complex similarity−activity relationships that
also cannot easily be separated from a medicinal chemistry
perspective. For example, the (R)-(−)-enantiomers of
ibuprofen and naproxen are inactive, but under physiological
conditions the (R)-(−)-enantiomer of ibuprofen is converted
into the active (S)-(+)-enantiomer by the enzyme 2arylpropionyl-CoA epimerase. Furthermore, paracetamol and
lumiracoxib are selective for COX-2, but the other inhibitors
are active against both COX-1 and COX-2, the former activity
giving rise to gastrointestinal side effects. Moreover, naproxen
alone is also active against hormone-sensitive lipase.
Such examples illustrate that considerations of chemical and
functional criteria might readily alter the perception of global
molecular resemblance. Clearly, such similarity considerations
fall into a gray zone, as they are influenced by subjective criteria
as well as the experience of the investigator, and hence, there is
no generally accepted way to judge such similarity relationships.
Accordingly, relations between the cognitive and computational
aspects of molecular similarity are discussed in more detail in
the following section.
activities of the compounds against a panel of reference targets,
generally proteins, that provide “biological signatures”19,20
analogous to the structure- or property-based representations
extensively discussed herein. In this case, the activity profiles
corresponding to the biological signatures of the compounds
are compared using an appropriate similarity function as a
measure of pairwise similarity, irrespective of the structural
features of the compounds. Hence, in this case, biological
similarity is assessed in target space rather than chemical space.
For SAR analysis and medicinal chemistry programs,
biological similarity is generally more difficult to implement
than structure- or property-based representations because
specific activity values might not be available for compounds
of interest.
In addition to their use as molecular similarity measures,
biological signatures can also provide an approximate measure
of compound promiscuity.21 For example, summing the
individual values in a binary biological signature (active = 1
or inactive = 0) yields the number of targets against which the
associated compound exhibits activity.
Global versus Local Similarity. A very important criterion
for similarity analysis is distinguishing between global and local
similarity views. For example, the comparison of pharmacophore models in drug design focuses only on selected atoms,
groups, or functionalities that are known or hypothesized to be
responsible for activity. This represents a local view of
similarity, in contrast to the more global view typically found
in chemical informatics, where compounds are considered in
their entirety. In the latter case, the calculated property or
structural descriptors typically used to compute molecular
similarities are generally derived from structural information
associated with entire compounds. For example, if we translate
the structural information of a compound into a fragment
fingerprint, a global molecular representation is obtained. This
whole-compound view of similarity is characteristic of the
perspective of chemoinformaticians.
Medicinal Chemistry Perspective. In addition to local
and global views, however, special attention must also be paid
3189
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 4. Similarity assessment through pattern recognition. Exemplary computer- and human-based pattern recognition processes for similarity
assessment are illustrated.
■
COGNITIVE VERSUS COMPUTATIONAL ASPECTS
OF MOLECULAR SIMILARITY
While similarity as perceived by trained medicinal chemists is
decidedly not the same as similarity obtained by computational
means, there are some aspects of the two that are comparable.
For example, in both cases, some type of symbolic
representation is required to characterize the structural
information of the compounds being compared, although in
the former case the representation is not explicitly stated.
Regardless of their details, however, both types of symbolic
representation must make molecular information comprehensible in such a way that structural/feature patterns can be
identified and recognized. In general terms, pattern recognition
refers to the ability to detect recurrent themes, organization
principles, relationships, and rules in large data sets,22 an
essential requirement for decision making by humans as well as
for computational learning.22,23
The identification of patterns within data forms a basis for
classification and directly applies to our molecular world. More
than anything else, the recognition of molecular patterns, based
on human or computational exploration, provides a basis for
arriving at decisions as to whether two compounds are similar
to each other or not. Since data complexity generally scales with
the number of patterns that can be discovered, it quickly
becomes impossible for humans to consider them in a
comprehensive manner. Therefore, humans intuitively, and
often unconsciously, reduce patterns to simpler ones that
contain the essential feature(s) of the original pattern. But
unlike applications of computational pattern recognition, the
precise nature of these key patterns in human pattern
recognition is unknown. For instance, to cross a road safely,
we need to recognize patterns associated with moving objects
and/or engine noise but are not required to understand which
type of car or motorbike is approaching. This intuitive
reductionist approach to pattern recognition is clearly reflected
by decision-making by medicinal chemists, as further discussed
below.
Selecting key patterns regardless of whether they are
mathematically defined or expressed in terms of vague
conscious or subconscious mental constructs is the most
crucial element in any assessment of molecular similarity. The
key patterns used by humans or computers will generally vary
from individual to individual or from algorithm to algorithm, a
situation that most likely will yield results with varying degrees
of agreement for the same set of data. This follows because the
representations used by humans and by computers, which most
likely are significantly different, are crucial components in
determining what can be understood about relationships of
objects to each other, whether they are physical objects,
concepts, ideas, or compounds. Despite the common search for
key patterns, the use of representations to determine similarity
in machine computation compared to human perception of
similarity by medicinal chemists differs significantly,2 as
schematically illustrated in Figure 4.
In the case of machine computation, algorithms have been
developed for constructing suitable representations of the
structural information in compounds and for evaluating
similarity functions or coefficients associated with these
representations.4,24,25 However, since there is no unique or
invariant way to represent molecular and chemical information,
constructing representations suitable for a given task or goal
depends on what is the task or goal.
As noted earlier and discussed further below, mathematical
functions that are designed to reflect the degree of molecular
similarity typically yield values that lie on the unit interval [0, 1]
of the real line. But as is also discussed below, the form of these
functions also influences the similarity values because they
usually differ even when identical representations are used,
although in some cases they are linearly or monotonically
related.2
Role of Chemical Intuition and Experience. Although
well-defined, computed values may not account for the degree
of similarity in a way that is consistent with the perceptions of
medicinal chemists because human perception of similarity is a
much more complicated, varied, and subtle task (vide supra).
Moreover, the “cognitive algorithms” by which medicinal
chemists perceive similarity are largely unknown, although
some recent work has begun to address this question.26−28
These studies clearly show that chemical intuition and
3190
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
(such as statistics, ecology, and psychology), have also been
used to compare specific molecular representations.
The Tanimoto coefficient is generally defined by
c
Tc(A, B) =
(1)
a+b−c
experience play major roles in decision making in medicinal
chemistry. Surprisingly, there is typically little consensus
between experienced medicinal chemists in judging preferred
compounds and assessing favorable or unfavorable molecular
features.26−28 Furthermore, it has been shown that perception
of molecular structures is strongly context-dependent; i.e.,
depending on the order in which we view compounds and how
they are grouped, different conclusions are drawn.27 This points
to a potential advantage of computational similarity assessment
because compound representations or patterns are constant
and context-independent. It has also been shown that medicinal
chemists often have difficulties comprehending the nature and
meaning of the parameters they might have considered and the
scientific criteria upon which decisions on compounds are
based.28 Medicinal chemists typically base their compound
decisions on very few patterns or parameters, fewer than they
believe,28 a fact that clearly reflects the pattern-reduction
approach referred to above. Decision parameters generally
result from feature reduction and pattern reduction, which also
provides a foundation of machine learning approaches.22,23
Computational methods such as neural networks,29 or support
vector machines,30 are essentially designed for pattern-based
similarity assessment, which requires training data the use of
which also renders these computational modeling efforts
context-dependent. The resulting computational models have
the often cited “box black character”, which means that they
cannot be interpreted in chemical terms. In some ways, this
provides an interesting analogy to medicinal chemists who do
not realize upon which parameters their compound decisions
might be based.28
Although it may not be possible to rationalize our judgments,
we are typically more content with our own decisions than
those obtained computationally that, in many cases, can be
difficult to interpret. Accordingly, machine learning methods
such as decision trees31 or emerging chemical patterns32 are
often favored in practice because they yield interpretable
patterns, even though they may be based on rather abstract
representations of molecular and chemical information. In light
of the above, it is clear that judgments of molecular similarity
can be influenced by a number of cognitive aspects. Lastly, with
regard to the SPP, it should be re-emphasized that mere
assessment of molecular similarity is generally not the ultimate
goal. Rather, in many cases, it is the identification of similar
compounds that, based on the SPP, are presumed to have
similar properties (especially biological activities) to known
reference or target compounds. This adds additional layers of
complexity to our perception of similarity and can further
complicate our judgments.
Similarity Coefficients. The question then arises as to
whether it is reasonable to assume that any “rationalization” of
similarity, or that any consistent computational representation
and comparison of compounds that yields a numerical readout,
will increase our own consensus and be superior to subjective
decisions. The Tanimoto coefficient (Tc)24,33 is introduced to
help answer this and related questions and to provide an
illustration of how molecular similarity can be quantified.
Although it may not be the best procedure, it is by far the most
popular and, because of its ease of implementation and speed, is
in widespread use today in chemical informatics and computational medicinal chemistry. As detailed in the sequel, a variety
of other similarity measures,24,25 most of which did not
originate in chemical informatics but in other scientific fields
where a and b are the number of features present in
compounds A and B, respectively, and c is the number of
features shared by A and B. Hence, Tc quantifies the fraction of
features common to A and B to the total number of features of
A or B, where the c term in the denominator corrects for
double counting of the features.
Another perhaps more intuitive way to interpret Tanimoto
similarity is based on an alternative form of the denominator on
eq 1, i.e.,
a + b − c = (a − c) + (b − c) + c
(2)
Here the terms (a − c) and (b − c) are the number of features
unique to A or B, respectively. Substituting eq 2 into eq 1 yields
the numerically equivalent form of Tc,
c
Tc(A, B) =
(a − c) + (b − c) + c
(3)
Dividing numerator and denominator by (a − c) + (b − c)
gives
Tc(A, B) =
R (a , b , c )
1 + R (a , b , c )
(4)
where
R (a , b , c ) =
c
(a − c) + (b − c)
(5)
which can be interpreted as the ratio of the number of features
shared by A and B to the number of their unique features.
As A and B become more similar, the number of shared
features approaches the number of features in A and B (i.e., c →
a,b) and the number of unique features in both compounds
approaches zero (i.e., (a − c) → 0 and (b − c) → 0) because in
the limit the number of shared features and number of features
in A and B become equal (i.e., a = b = c). Thus, their ratio goes
to infinity, (i.e., R(a,b,c) → ∞), which in the limit gives
Tc(A,B) = 1. Conversely, as A and B become less similar, the
number of shared features approaches zero and consequently
all of the features of A and B are unique, and thus, the ratio of
these features also goes to zero (i.e., c → 0, (a − c) → a, (b −
c) → b, and R(a,b,c) → 0); thus, in the limit, Tc(A,B) = 0. In
the intermediate region where the number of shared features is
greater than zero but less than the lesser of the number of
features in A and B (i.e., 0 < c < min(a,b)) and where the
number of unique features is less than the total number of
possible features (i.e., (a − c) + (b − c) < a + b), the Tanimoto
similarity will lie between the extremes of the unit interval of
the real line, i.e., 0 < Tc(A,B) < 1. One way to think about this
is to note that as the number of shared features between two
compounds increases, their number of unique features must
correspondingly decrease. Thus, there is interplay between the
number of shared features and the number of unique features
exemplified by their ratio R(a,b,c).
The calculation of Tanimoto similarity is typically based on
representations called “molecular fingerprints”,4,6,7 which can
be viewed as classical sets or binary vectors whose elements
have values of “1” or “0” corresponding, respectively, to the
presence or absence of specific features (e.g., molecular
3191
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
fragments). In some cases, elements with value “1” are called
“on-bits“ and those with value “0” are called “off-bits“, hence,
the description of molecular fingerprints as “bit strings” or “bit
vectors”. Note that the molecular fingerprints described above
do not account for multiple occurrences of the different
features, only whether they occur at least once in a given
compound. However, feature counts can be added to
fingerprints by using integer values to represent features
instead of a binary format. Fingerprints of different design and
complexity are available,7 as further discussed below. For
similarity searching, fingerprints are among the original and to
this date most popular descriptors.
Dissimilarity can be quantified in a complementary manner
such that small values indicate similarity and large values
dissimilarity. Accordingly, a dissimilarity measure can be
derived from the Tc by taking the appropriate complement
known as the Soergel distance (Sg),24 i.e.,
Sg(A, B) = 1 − Tc(A, B) = 1 −
c
a+b−c
written here in a form that clearly shows that the denominator
is the arithmetic mean of the number of features in A and B.
Since 1/2(a + b) ≤ (a + b) − c, it follows that Tc(A,B) ≤
Dc(A,B), as illustrated by the distributions depicted in Figure 6.
Both similarity coefficients are symmetric, since the similarity
of A with respect to B is the same as the similarity of B with
respect to A. In fact, any Tv in which α = β yields a symmetric
similarity coefficient such that Tvα=β(A,B) = Tvα=β(B,A).
Tversky similarity coefficients with two unequal weighting
factors (α ≠ β) are, on the other hand, asymmetric, their degree
of asymmetry depending on the relative magnitudes of the
weighting factors.
Similarity coefficients can be classified according to their
compound ranking characteristics. Coefficients that always
produce the same ranking of compounds, although their
absolute similarity values might differ, are said to be monotonic.
For example, Tvα,β(A,B) and Tvα′,β′(A,B) are monotonically
related if the parameters have the same ratio so that α′ = kα
and β′ = kβ. These coefficients can be converted into each
other by the monotonic function
(6)
that can be rewritten as
(a + b − c ) − c
c
=
a+b−c
a+b−c
( a − c ) + (b − c )
=
a+b−c
Tvα′, β′(A, B) =
Sg(A, B) = 1 −
c
α(a − c) + β(b − c) + c
(7)
(8)
Tvα = 1, β = 1(A, B) = Tc(A, B) ≤ Tvα = 1/2, β = 1/2(A, B)
= Dc(A, B)
+ b)
(12)
An extreme form of Tv occurs when the reference compound A
is weighted (α = 1) and the database compound is not (β = 0),
in which case eq 8 becomes
c
Tvα = 1, β = 0(A, B) =
(13)
a
In this case, the Tversky similarity coefficient provides a
measure of how similar A is to B, which can be interpreted as
the fraction of the features in the reference compound A that
are matched by database compound B. Interchanging the values
of the weighting factors so that now α = 0 and β = 1 places the
entire weighting on the database compound B and gives
c
Tvα = 0, β = 1(A, B) =
(14)
b
which in this case can be interpreted as the fraction of the
database compound B that is similar to the reference
compound A. These two forms of Tv represent extreme
forms of Tversky similarity coefficients.
c
1
(a
2
(10)
Thus, Tversky similarity now only depends on the single
parameter α. Note that differences in the numerical distribution
of the normalized Tv and Tc are to a large extent due to the
fact that the Tc corresponds to a non-normalized Tv under the
condition α + β = 2. Furthermore, as clearly shown in Figure 6,
The denominator is closely related to that given for Tanimoto
similarity in eq 3 except for the two parameters α and β that
weight the number of features unique to A or B, (a − c) and (b
− c), respectively. As defined by Tversky,34 α and β are nonnegative. In chemical informatics and computational medicinal
chemistry applications, these parameters are typically chosen to
lie within the unit interval [0, 1] of the real line. In either case,
zero and unity bound the value of Tv. The larger α is compared
to β, the more weight is put on the unique features of reference
compound A and the less on database compound B and vice
versa. Thus, in the case of Tv, whose values also range from 0
to 1, the similarity values change as the two weights vary. This
makes it possible to study the relative importance of common
and unique features for compound ranking with respect to the
reference and database compounds.
As discussed further below, the weighting scheme can be
applied to introduce asymmetry into similarity calculations. For
the special case α = β = 1, where the unique features of both
compounds are weighted equally, Tv is identical to Tc. In the
case where α = β = 0.5, Tv is identical to the Dice coefficient
(Dc)24
Dc(A, B) =
+1−k
which can be verified by elementary algebraic transformations.
Thus, normalization of the parameters imposes no restriction
on the ranking and, hence, the generality of Tv.
In the following, the sum of the weighting parameter values is
restricted to unity, i.e., α + β = 1. Replacing β in eq 8 by 1 − α
yields
c
Tvα(A, B) =
α(a − c) + (1 − α)(b − c) + c
c
=
αa + (1 − α)b
(11)
As noted above, the denominators in eqs 1 and 3, a + b − c and
(a − c) + (b − c) + c, respectively, represent the number of
features that occur in either A or B, and the Tc can then be
rationalized as the percentage of shared features, whereas the
Soergel distance corresponds to the percentage of features
unique to A or B given by (a − c) and (b − c), respectively.
Another similarity measure that is growing in usage is the
Tversky coefficient (Tv),34 which is given by
Tvα , β(A, B) =
1
k
Tvα , β(A, B)
(9)
3192
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
intuitive perspective. Tversky similarity, which originated in
psychology (not informatics), is conceptually based on a
number of asymmetric characteristics that are associated with
human perceptions of similarity. An example given by Tversky
involves a comparison of Korea and China; the similarity of
Korea to China is usually considered to be greater than the
similarity of China to Korea. This view, which is rather general,
suggests that relative size, however accounted for, has a
significant influence on the perceived asymmetry of the
similarity of entities, including compounds, when compared
by humans. Moreover, this can also be interpreted in terms of
eqs 13 and 14, since the “fraction” of Korea that is similar to
China is definitely not the same as the “fraction” of China that
is similar to Korea. Often it is not considered that the Tversky
similarity coefficient is parametrized to account for asymmetric
aspects of similarity by capturing the asymmetric characteristics
inherent in many different types of objects under comparison.
To understand, in light of the above, how human perception
of the similarity of two compounds might be asymmetric, it is
necessary to distinguish the compounds being compared. Let us
consider an ordered pair in which A is a reference compound
and B a database compound. If the reference A is a small
compound and a substructure of a larger compound, A is rather
similar to B. This follows because A is a close match to a part of
B. However, if the situation is reversed, i.e., B is now used as the
reference and A is the database compound, the similarity will be
lower because most of B differs from A. This is a molecular
example of the size effect described above in the case of the
perceived asymmetric similarity comparisons of Korea and
China. Equations 13 and 14 and the accompanying discussion
fully support this analysis.
In Tc calculations, this perceived asymmetric similarity
relationship is not reflected, but Tv calculations offer this
possibility as a consequence of appropriate weighting.
Importantly, perceived relative size-dependent asymmetric
similarity is distinct from representation-dependent molecular
size or complexity effects mentioned above, which systematically bias similarity calculations by producing large values for
larger and topologically more complex compounds.
Human Perception. The assessment of similarity on the
basis of human perception is considerably more complicated
than reflected by the examples given above because a number
of other conscious and subconscious factors also play a role.
For example, a key factor in similarity assessment is the ability
of humans, in general, and medicinal chemists, in particular, to
intuitively reduce the complexity of the problem at hand (vide
supra). This need to reduce complexity largely depends on the
fact, as pointed out by numerous psychologists, that humans
can only hold a relatively small number of things in their
working memory at any point in time.40,41 Working memory is
that part of memory that actively holds multiple pieces of
transitory information that can be manipulated by verbal and
nonverbal tasks, such as reasoning and comprehension, and
makes the results of these tasks available for further
information-processing. In the case of medicinal chemists this
means that only structural features perceived to be most
essential, or some simplified representation of them, might be
retained and considered for similarity assessment, very
consistent with the results obtained by Kutchukian et al.,28
indicating the partly unconscious use of only one or two
chemical parameters by medicinal chemist in compound
evaluation and decision making. Understanding these criteria,
which will undoubtedly differ from medicinal chemist to
Increasing molecular size or complexity generally leads to
increasing fingerprint bit densities, which are defined for a given
compound A as
ρFP (A) =
number of on‐bits
total number of fingerprint bits
(15)
Such increases in the bit density ρFP(A) have a statistical
tendency to yield higher similarity values for larger compounds,35 a well-known complication in similarity searching7
and a cause of apparent asymmetry in distributions of similarity
values.36 Molecular complexity effects can be balanced or
eliminated in different ways, for example, by equally taking into
account bits that are set on or off in similarity calculations37,38
or by combining binary fingerprint representations with their
complements, i.e., adding the complement to the original bit
string, thereby producing a constant fingerprint bit density for
compounds of any size.39
Calculating Tanimoto, Tversky, or Dice similarity has an
assumed advantage that numerical values can now be used to
distinguish similarity relationships in a consistent manner. How
does this numerical approach from chemical informatics relate
to, and perhaps influence, the more subjective assessment of
similarity in medicinal chemistry? Are calculated similarity
values suitable to replace chemical intuition and judgment?
Computed versus Intuitive Similarity. There are a
number of issues that arise when comparing computed
similarity values with those assigned by medicinal chemists.
One issue is that the similarity scale employed by medicinal
chemists is not uniform. The following argument, which
depends on the complementary nature of similarity and
dissimilarity, illustrates this point. In computations the degree
of dissimilarity is typically taken as the complement of
similarity:
dissimilarity = 1 − similarity
Hence, the more dissimilar two compounds are, the less similar
they are to each other and vice versa. Importantly, such
complementary behavior between computed similarity and
dissimilarity values does not, however, apply in the case of
human perception. For example, humans can better assess
similarity the more similar compared objects are to each other.
By contrast, as objects become less and less similar, a point is
reached where it is generally difficult for humans to assess their
degree of similarity or dissimilarity. Recall that in the former
case one is dealing with features that are common to both
compounds, whereas in the latter case one is dealing features
that are unique to each of the compounds. This follows from
the basic psychophysics of human perception because it is
easier for humans to make comparative judgments of objects
with common features than between objects whose features are
unique.
Since computed similarity values do not suffer from these
problems, a divergence between human perceptions and
computed values of similarity likely arises. In most cases, this
is not a problem for medicinal chemists who typically want to
synthesize and test compounds that are similar to known
actives. Then, high calculated similarity values have an intuitive
meaning. However, if similarity values are decreasing in size,
boundaries between similarity and dissimilar become rather
diffuse and one is often unable to interpret such values.
The question of symmetry vs asymmetry of similarity, as
formally discussed above, should also be considered from an
3193
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 5. Frequency of fingerprint features. The relative frequency of occurrence of the 150 most frequent features of (a) MACCS and (b) ECFP4 is
calculated for a random subset of 1 million ZINC database compounds.
using their in-house fingerprints and sets of active compounds,
that a Tc value of 0.85 reflected a high probability that two
compounds shared the same activity.43 For more than 15 years,
this Tc value has propagated in the literature as a general
threshold for bioactivity and has been applied in many practical
applications, although the value is not reliable when other
molecular representations are used for similarity calculations.4,7,44 Neighborhood behavior and calculated similarity
values are strongly dependent on chosen molecular representations and similarity measures.4 While this is generally wellknown, it is often underappreciated in medicinal chemistry
even today. The often-observed use of putative Tc threshold
values of biological activity reflects common misunderstandings
of similarity calculations. In the following, we present and
discuss exemplary similarity calculations to highlight several
characteristic features.
Fingerprints of Different Design. In the following, two
conceptually different fingerprints are compared that are
popular in computational medicinal chemistry. The molecular
access system (MACCS) fingerprint,13 also termed MACCS
structural keys, is a prototypic fragment-based fingerprint that
consists of 166 structural fragments with 1−10 non-hydrogen
atoms and is one of the original and most popular similarity
search tools.6,7 Its design is simple. Each bit position is assigned
to one particular structural fragment or key and its presence or
absence in a compound is detected.
By contrast, we use the extended connectivity fingerprint
(ECFP) with bond diameter four (ECFP4) that currently is
one of the most popular fingerprints for similarity searching.14
ECFPs account for the local bond topologies, which describe
the connectivity of atoms in the neighborhood of each nonhydrogen atom in a molecule. The size of the neighborhood
depends on the so-called bond diameter given by the maximum
number of bonds considered. The ECFP design is much more
complex than MACCS because many different atom environment features can be generated. Different from MACCS,
ECFP4 consists of sets of compound-specific features whose
overlap is quantified as a measure of molecular similarity.
Although many different atom environments can in principle
exist, feature sets derived for individual compounds are often
relatively small (e.g., containing less than 100 features),
depending on their topology.
Similarity Value Distributions. Although the definition of
Tc yields an interpretable value as “the percentage of
fingerprint features shared between two compounds”, it is
very difficult to judge whether a given Tc value indicates the
medicinal chemist, is a nontrivial task. Thus, computed
similarity values and judgments by medicinal chemists are
both influenced by dependencies on molecular size and
complexity, but the effect is much more pronounced and
difficult to predict in the case of medicinal chemists’
assessments of similarity. The inconsistency of humans when
confronted with complex decision tasks42 is well reflected by
generally observed changes in medicinal chemists’ judgment
about the quality of the same compounds when presented in
different orders (vide supra).27 It is evident that medicinal
chemists are often left with conscious or subconscious
“impressions”, which they fold into their assessments of
similarity in some implicit way, being intuitively aware of the
complexity of the problem at hand, which then automatically
leads to a reductionist approach in decision making. It is
therefore not surprising that similarity calculations are attractive
in medicinal chemistry because they reduce complex molecular
comparisons to a simple numerical readout. Then, however, the
key question becomes what such computed values actually
mean.
■
CHARACTERISTICS OF SIMILARITY
CALCULATIONS
In the following section, we highlight opportunities and
limitations of similarity calculations in light of the above
discussion. Thereby, we evaluate the apparent attractiveness of
numerical similarity measures as a complement, or replacement,
of human perception and study relationships between
calculated and perceived similarities.
Similarity Property Principle Revisited. A critically
important aspect to realize is that most similarity methods do
not explicitly take biological activity into account. Thus,
similarity values generally reflect the similarity of chosen
molecular representations. Yet this is hardly of interest in
medicinal chemistry. Instead, chemoinformaticians and medicinal chemists typically attempt to bridge between calculated
similarity and biological activity, well in accord with the SPP
discussed above. In fact, the key question asked in this context
typically is “Which Tc value reliably indicates that compound B
has the same activity as reference compound A?” In other
words, “How similar must A and B be to have the same
activity?” This is the major attraction of reducing complex
similarity relationships to simple numbers and the source of
some profound misunderstandings of similarity calculation.
The 0.85 Myth. In a seminal study quantifying chemical
neighborhood behavior, investigators from Tripos established,
3194
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 6. Similarity coefficient distributions. Distributions of similarity values resulting from 10 million comparisons of randomly chosen ZINC
compounds are reported for the Tanimoto and Dice coefficient and the (a) MACCS and (b) ECFP4 fingerprint.
Figure 7. Comparison of similarity coefficients. For two thrombin inhibitors Dice, Tanimoto, and Tversky coefficients are compared using MACCS
and ECFP4. Tversky similarity calculations were carried out using different parameter settings.
presence of “significant similarity” or not. This is the case
because the coefficient value does not tell us anything about the
specific features under comparison. For instance, many
MACCS bit positions refer to structural features that are
often found in compounds, whereas ECFP4 systematically
encodes atom environments, many of which are infrequently
found in compound data sets. For this reason, ECFP4 Tc values
are generally smaller than MACCS Tc values. This difference in
feature frequencies is illustrated in Figure 5 that reports the
relative frequencies of the 150 most frequently detected
MACCS and ECFP4 features in 1 000 000 compounds
randomly selected from the ZINC (version 12) database.45
MACCS and ECFP4 fingerprints were calculated with the
Molecular Operating Environment (MOE).46 Overall the
ZINC subset contained 183 476 different ECFP4 features, but
only 632 of these features occurred in more than 1% of the
compounds. Considering the sparseness of most ECFP4
features, it is not surprising that some molecules that are
structurally similar contain a significant number of unique
features. Importantly, the differences in feature distribution
between MACCS and ECFP4 lead to very different
distributions of similarity coefficient values. To illustrate these
differences 10 000 000 similarity values were calculated for
randomly chosen pairs of ZINC compounds. The results are
shown for MACCS and ECFP4 Tc and Dc calculations in
Figure 6, where it is clear that the Dc distributions are shifted
toward higher values and are less symmetrical than the
comparable Tc distributions. These effects are due to the
normalization (α + β = 1) of the Dc and can be rationalized
based on the discussions associated with eqs 9 and 11. Similar
effects are, in general, observed for Tv, yielding distributions
very similar to those of the Dc, regardless of the value of the
parameter α. The figure shows that different combinations of
fingerprints and similarity coefficients produce different
similarity value distributions, further emphasizing the critically
important point that calculated similarity has no absolute
meaning.
3195
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 8. Similarity searching using different fingerprints and similarity coefficients. By use of compound A from Figure 7 as a reference, similarity
values were calculated for 1 million ZINC compounds and 25 thrombin inhibitors (including compound B from Figure 7) using the Tanimoto and
Tversky (α = 0.1 and α = 0.9) coefficients and the (a) MACCS and (b) ECFP4 fingerprints. The similarity coefficient is plotted as a function of the
rank (reported on on a logarithmic scale). The positions of the 25 thrombin inhibitors are marked on each curve.
searched and ranked using different similarity coefficients. In
Figure 8, the ranks are displayed on the x-axis from low to high
ranks on a logarithmic scale. On the y-axis, the corresponding
coefficient values are reported. For each similarity coefficient,
the position of the 25 thrombin inhibitors is marked. The
graphs illustrate that compound ranks significantly vary
depending on the coefficient and representation used. In this
example, MACCS in combination with Tv and α = 0.9 yields
the largest number of thrombin inhibitors within the top 1000
database compounds (corresponding to 0.1% of the screened
database). However, it is stressed that no general conclusions
about the relative performance of individual coefficients and
fingerprints can be drawn from a single example given the
strong compound class dependence of similarity calculations
(vide infra).
Similarity Threshold Values. Considering the global
distributions of similarity values, it is of interest to derive
threshold values that indicate a statistically significant level of
similarity. Significance analysis of similarity values can be used,
for instance, to determine if similarities between compounds
sharing a property like biological activity might simply occur by
chance or if compound similarity is likely to be associated with
the shared property. For this purpose, conventional p-values
can be calculated. For example, a Tc threshold value at a
significance level of p = 0.01 would indicate a probability of 1%
that the Tc value calculated for two randomly chosen
compounds meets or exceeds the threshold. Threshold values
can be estimated from the distribution of a large sample of
similarity values obtained by randomly selecting pairs of
compounds and calculating their similarity coefficient. The
cumulative distribution function F(t) of the values then relates
a similarity value t to the ratio of similarity values less than or
equal to t, and the significance is given by p = 1 − F(t). If such
threshold values are generally applicable in the context of
similarity searching, i.e., if a similarity value exceeding a
threshold value is a rare event and thus indicates significant
similarity, they must be largely independent of the selected
reference compound. It is emphasized at this point that only
calculated similarity values and their statistics are considered;
accounting for compound activity according to the SPP is
addressed in the next subsection.
Although the global distribution of Tv values does not
significantly depend on the settings of α, this parameter
determines how similarity relative to a given reference molecule
is perceived. If more weight is put on features (bit settings) of
the reference molecule (i.e., if α > 0.5), different similarity
relationships evolve. Compounds that contain most of the
reference features plus some additional ones are considered to
be more similar to the reference molecule than compounds that
contain fewer of the reference features but also fewer additional
features, although the percentage of shared features might be
the same for both molecules. How different representations and
similarity coefficients affect computed similarity values is
illustrated in Figure 7, using two exemplary thrombin inhibitors
taken from the ChEMBL (version 15)47 database. Both
molecules contain more ECFP4 features than MACCS features,
but the number of shared features is lower for ECFP4, as
expected on the basis of the feature distributions discussed
above. Consequently, the different coefficients produce
significantly lower similarity values for ECFP4. Dc is increased
compared to Tc as shown in the discussions related to eqs 9
and 12. Because Dc is identical to normalized Tv with α = 1, its
value can be numerically compared to the asymmetrical Tv
values with parameters α = 0.1 and α = 0.9, respectively. It can
be observed that Tv decreases for α = 0.1 and increases for α =
0.9. In the first case, more weight is put on the features
exclusive to molecule B, and in the second case, less weight is
put on these features. Thus, the influence of these features on
the similarity value is either increasing or decreasing compared
to Dc. Changing α has an effect on computed similarity values.
More importantly, however, the parameter also influences how
similarity is perceived in a search when database compounds
are ranked in the order of decreasing similarity to a reference
molecule. Here, the absolute value of similarity is not of
interest, especially if the value cannot be interpreted in a
meaningful way. Rather, the rank positions of compounds with
the desired properties determine the usefulness of a similarity
coefficient. Figure 8 illustrates the effect that the choice of
different similarity coefficients has on the ranking of
compounds in a similarity search. Molecule A in Figure 7 was
taken as a reference, and 1 000 000 ZINC compounds together
with molecule B and 24 other thrombin inhibitors were
3196
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 9. Threshold values of similarity coefficients versus significance levels. Cumulative distribution functions were generated for different
similarity coefficients and two fingerprints by selecting 100 random reference compounds from the ZINC database and calculating the similarity to
the remaining ZINC compounds from the subset of selected compounds according to Figure 5. The graphs on the left show the median as well as
first and third quartile cumulative distribution function F(t) derived from the 100 sampled distributions. On the right, threshold values (y-axis) are
shown depending on different levels of significance (x-axis) on a logarithmic scale. The median threshold values as well as the first and third quartile
threshold values are reported: (a) MACCS and Tc; (b) MACCS and Tv(α=0.9); (c) ECFP4 and Tc; (d) ECFP4 and Tv(α=0.9).
case, their similarity to all remaining ZINC compounds in a
ZINC sample was calculated. Not surprisingly, search profiles
for individual reference compounds generally differed. On the
To illustrate the influence of different reference compounds
on similarity calculations, search profiles were generated for 100
compounds randomly chosen from the ZINC database. In each
3197
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 10. Similarity value distributions for active compounds. The distribution of similarity coefficient values for compounds sharing the same
activity is reported for 10 exemplary compound activity classes taken from the ChEMBL database. The boxplot representations provide quartile and
median similarity values for all compound comparisons. The whiskers represent the most extreme data points within the 1.5 interquartile range for
the lower and upper quartiles, respectively. Data points falling outside this range are not shown. On the x-axis, the ChEMBL target identifiers (Ids)
are provided for each class: 11, thrombin; 43, β-2 adrenergic receptor; 72, dopamine D2 receptor; 86, monoamine oxidase A; 194, coagulation factor
X; 214, muscarinic acetylcholine receptor M4; 10 498, cathepsin L;11 003, melanocortin receptor 3; 11 060, carbonic anhydrase VII; 11 627, acyl
coenzyme cholesterol acyltransferase.
Do Activity-Relevant Similarity Threshold Values
Exist? Although the above considerations highlight the
principal limitations of similarity calculations from a statistical
point of view, they do not consider similarity from the
perspective of a medicinal chemist. In this case, the SPP takes
center stage and raises the issue of whether calculated
similarities can serve as indicators of activity similarity. This
directly relates to the “0.85 myth” discussed above and
represents one of the most important applications of
quantitative molecular similarity analysis in medicinal chemistry.
To address this question, similarity calculations must be
carried out for compounds having different specific activities.
Therefore, 10 exemplary compound activity classes were taken
from ChEMBL (version 15).47 Each compound was required to
have a pKi value of at least 7 for its designated target (thus
limiting the analysis to potent compounds with available highconfidence activity measurements). Similarity values were then
calculated for all pairs of compounds sharing the same activity.
The results of these calculations are reported in Figure 10.
Regardless of the fingerprint representations and similarity
coefficients used, the observed similarity value distributions for
active compounds strongly depended on the compound activity
class. For example, median MACCS Tc values varied from ∼0.3
to ∼0.75 depending on the class. As shown in Figure 9a, a
MACCS Tc threshold value of ∼0.65 corresponds to a
statistically significant similarity at the level of p = 0.01. It
follows that most compounds active against a given target
basis of these profiles, a significance level (p-value), given by
the ratio of the number of compounds whose similarity values
with respect to the reference compound exceed the given
threshold, was assigned to every reference compound for each
threshold value in the range 0−1. This yielded 100 curves
relating threshold values to p-values. Figure 9 reports
cumulative distribution functions and threshold values as a
function of the significance level for different similarity
coefficients with respect to the MACCS and ECFP4 fingerprints. The graphs on the left depict the median as well as first
and third quartile sampled cumulative distribution functions,
while the graphs on the right report the Tc threshold value as a
function of the p-value. These graphs are obtained from the
cumulative distribution function by exchanging the x- and y-axis
and by plotting the p-values on a logarithmic scale in order to
enhance the visual resolution for low p-values indicating high
significance. Shown are the median threshold values and the
interquartile ranges of the thresholds obtained from the original
100 curves. From the curves, it is apparent that statistically
significant similarity threshold values strongly depend on the
fingerprint representation and the similarity coefficient that are
used. In addition, there are large variations in threshold value
depending on the reference compounds. Thus, although
threshold values might be associated with statistically significant
similarity, without taking activity into account, they are not
transferable and are associated with large margins of error, due
to the dependence on reference compounds, as illustrated in
Figure 9.
3198
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
all combinations of fingerprints and similarity coefficients.
These findings illustrate that generally applicable similarity
threshold values as a potential indicator of activity similarity do
not exist. Such values also cannot be derived with any certainty
for individual compound classes, as revealed by the variability of
similarity values and lack of general statistical significance.
■
PRACTICAL CONSIDERATIONS
Calculated similarities, regardless of how we perceive them
from a medicinal chemistry perspective, strongly depend on the
compound classes under study as well as the molecular
representations (descriptors) and similarity measures used.48,49
If multiple reference compounds are employed, the results of
similarity calculations must be combined in some ways,
typically through the application of data fusion techniques,50
which further complicates matters. The results discussed above
illustrate that calculated similarity values do not enable us to
relate molecular and activity similarity in a meaningful way to
each other and that it is impossible at present to establish
generally applicable threshold values indicating that two
compounds share the same activity. Does all this mean that
similarity calculations have no utility in medicinal chemistry?
The answer is no. The key issue is to understand what similarity
calculations can and cannot provide for. As long as one believes
that the magnitude of computed similarity measures has
Figure 11. Average Tc threshold values for scaffold recall rates. For
MACCS (blue) and ECFP4 (red), the average Tc threshold value
required to achieve a specified scaffold recall rate is reported. The
variations of these Tc values across all trials are reported as error bars
for recall rates of 25%, 50%, and 75%. Numbers next to the error bars
give the median database selection set size for which the recall rate is
achieved. The figure was adapted from ref 53.
yielded similarity values that varied greatly and were not
statistically significant. Equivalent observations were made for
Figure 12. Early enrichment of active compounds with different scaffolds. Two exemplary reference compounds and a set of active compounds
having different scaffolds are shown that were found in the 100 top-ranked database compounds (individual ranks are reported). At the top, κ opioid
receptor ligands are shown and at the bottom human immunodeficiency virus type 1 protease inhibitors.
3199
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Figure 13. Compound ranks in virtual screening. Results from virtual screening trials are shown leading to the identification of new inhibitors of the
Sec-7 domain of cytohesins (Secin 16, 87, and 144). Two reference compounds are shown. For each of the three hits, rank positions are reported for
four alternative search strategies including support vector machine (SVM) calculations with two fingerprints (FP 1 and FP 2) as descriptors as well as
similarity searching with two fingerprints using a single reference compound (FP 1, reference 1 and FP 3, reference 2).
elements corresponds to a high probability that reference and
database compounds share the same activity.
In contrast to pharmacophore-based searching, fingerprint
similarity searching, which is based on a whole-molecule
assessment of similarity, does not require pharmacophore
hypotheses or specific knowledge about activity-relevant
features of compounds. It is applicable when very little is
known except the activities of the reference compound(s) used
in the search. No activity information associated with specific
substructural features in the fingerprints is required, only the
assumption that the SPP is applicable. Importantly, similarity
searching produces a ranking of database compounds in the
order of decreasing computed similarity values relative to the
reference compound(s). In this case, absolute similarity values
are not relevant except on a relative scale for ranking of
compounds.
A database ranking starts with compounds that are most
similar to the reference compound(s), typically closely related
analogues, and as we proceed further down the ranking,
database compounds become increasingly dissimilar but might
nonetheless be active. In a study designed to assess the scaffold
immediate implications for activity and that their values scale, in
one way or another, with a probability of activity, little can be
expected. Meaningful applications of similarity calculations can,
however, be considered if one is aware of these limitations.
Computed Similarities on a Relative Scale. One of the
major applications of similarity analysis is ligand-based virtual
screening, where one or more active reference compounds are
used to search databases to identify other compounds with
similar structures and, by the SPP, hopefully, with similar
activities.4−7 Such searches can be carried out on the basis of
local or global similarity methods. For example, pharmacophore
searching51 is based on local similarity and attempts to identify
all database compounds that match a predefined pharmacophore query, regardless of the remaining substructures. Such
calculations can be carried out to identify structurally diverse
compounds having similar activities, a procedure commonly
referred to as scaffold hopping.52 The horizontal compound
relationship in Figure 2 represents an example of a scaffold hop.
Pharmacophore searching typically produces a “pass−fail”
readout and identifies a set of compounds that match the
query. It is assumed that close resemblance of pharmacophore
3200
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
hopping potential of similarity searching,53 it has been shown
that Tc threshold values cannot be determined that indicate a
significant enrichment of structurally diverse active compounds
in database rankings. Figure 11 summarizes search results for
MACCS and ECFP4 over different activity classes.53 Average
Tc values are reported for the fraction of active database
compound series with distinct scaffolds for which at least one
active compound was detected. Error bars are shown for Tc
values at which compounds represented by 25%, 50%, or 75%
of all “active scaffolds” were detected. The numbers at the error
bars indicate the median ranks of these active compounds. For
example, to detect active compounds for 25% of all available
scaffolds, ∼1% (5488) of all database compounds had to be
selected on average for MACCS and ∼0.5% (2,360) for
ECFP4. The large error bars indicate that it was not possible to
define Tc threshold values for the retrieval of structurally
diverse active compounds across different compound classes. In
essentially all calculations, however, a few active compounds
with scaffolds different from the reference compounds were
found at relatively high rank positions, as shown in Figure 12.
Thus, the calculations show that scaffold hops can be detected,
although large numbers of other database compounds had to be
selected to achieve a significant scaffold recall of 25% or more.
These findings illustrate the resolution limits of whole-molecule
similarity searching. Nevertheless, similarity searching is
relevant for many practical applications.
The attractiveness of similarity-based compound rankings in
medicinal chemistry is that they provide a continuum of
compound similarity relationships that can be intuitively
assessed. Although we do not know precisely where active
compounds with different scaffolds might be found in
similarity-based ranked lists, inspecting the rankings enables
compound selection on the basis of chemical intuition and
experience. In this case, the chemical informatics and medicinal
chemistry perspectives meet.
Figure 13 shows the results of a practical virtual screening
application54 that exemplifies the opportunities of similarity
searching. The study was designed to identify new inhibitors of
cytohesins,55 a family of small guanine nucleotide exchange
factors, by virtually screening a large compound database
containing 3.7 million compounds. Three newly discovered
structurally diverse inhibitors54 and their database ranks
produced by four related yet distinct search strategies are
reported. The positions of the inhibitors in the database
rankings show a remarkable spread. Two of these active
compounds were highly ranked by one search strategy (ranks 7
and 35, respectively) but vanished in the database background
when the others were applied. The highest rank obtained for
the third inhibitor was 354, and this compound could only be
selected on the basis of visual inspection of rankings and
intuition because a total of only 145 compounds taken from
different rankings were experimentally tested.54
Nearest Neighbor Analysis. Another application of
similarity calculations that is relevant to medicinal chemistry
and is also independent of absolute similarity values is the
mapping of the chemical neighborhood of compounds.
Similarity calculations can easily retrieve the k-nearest nearest
neighbors, i.e., k most similar compounds, to a given compound
from any collection.50 The similarity radius, i.e., the range of
similarity values considered with respect to a specific reference
compound, can be easily adjusted, thereby increasing or
decreasing the number of compounds for inspection. Such
nearest neighbor calculations enable chemical interpretation of
limited numbers of similarity relationships and are useful, for
example, in support of hit expansion studies or in the
generation of focused compound libraries. Since the mapping
of chemical neighborhoods does not require sophisticated
molecular representations, simple fragment-based fingerprints
can be used effectively.
Rendering Fingerprint Calculations Comparable.
Although similarity threshold values of activity do not exist, it
is possible to determine corresponding Tc or other related
coefficient values for different fingerprints that are met or
exceeded by the same proportion of compound pairs in large
databases. For example, in systematic similarity-based search
calculations on 128 compound data sets taken from ChEMBL,
12% of all possible compound pairs reached or exceeded a
MACCS Tc value of 0.70.56 The same proportion of compound
pairs was obtained for an ECFP4 Tc of 0.31, thus establishing
an approximate correspondence of these Tc values for the two
fingerprints. Following this approach, it is possible to map
corresponding Tc values for different fingerprints that select the
same percentage of compound pairs.56 Such correspondences
depend to some extent on the composition of the compound
collection under study. Figure 14 reports correspondence
Figure 14. Corresponding Tc values for MACCS and ECFP4.
Distributions of MACCS and ECFP4 Tc values were determined by
conducting 10 million comparisons between randomly selected ZINC
compounds (according to Figure 6). Correspondence between
MACCS and ECFP4 Tc values was established by relating those Tc
values to each other that were met or exceeded by the same percentage
of comparisons (indicated as labeled points on the curve).
between MACCS Tc and ECFP4 Tc values established on the
basis of the randomly sampled distributions shown in Figure 6.
Selected points on the curve are highlighted that correspond to
certain fractions of compound comparisons meeting or
exceeding the corresponding MACCS Tc (x-axis) or ECFP4
Tc (y-axis) values. The curve illustrates the representation
dependence of similarity values. Furthermore, it provides a
guideline for assessing the significance of similarity values
obtained for a newly introduced fingerprint on the basis of a
standard fingerprint (such as MACCS) with which many
investigators are familiar.
Dissimilarity Selection. The selection of compounds that
are most dissimilar to those of an existing collection has a long
history in compound acquisition in the pharmaceutical
industry.57,58 It is also a meaningful application of similarity/
dissimilarity calculations. In this case, the interest is in the
3201
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
whole-molecule view, as is often done in chemical informatics,
or focuses on pharmacophores or functional groups (i.e., local
molecular information), as is typically the case in medicinal
chemistry. Furthermore, the modeling of activity landscapes,
which integrates compound similarity and activity relationships,63 often leads to rather different interpretations by
computational and medicinal chemists when calculated
similarity values are used. Moreover, attempts to interpret
calculated similarity values and differences between them in
structural terms might often cause confusion. Similarity
calculations are nevertheless of considerable interest in
medicinal chemistry. In addition, because human assessment
depends significantly on the knowledge and experience of
medicinal chemists, it is not surprising that calculated similarity
values are often seen as an attractive means of decision support.
However, we also note that many similarity search and
benchmark studies reported in the computational literature
lack proper statistical assessment, which complicates the
comparison and interpretation of calculated similarity values.
Probably the largest conceptual roadblock to computational
similarity analysis is that the quantitation of chemical or
molecular similarity is generally not of interest per se but rather
the extrapolation from calculated similarity values to other
molecular properties, in particular, biological activity. There are
no well-defined relationships between calculated similarity and
activity similarity and no similarity threshold values that reliably
indicate whether a test compound shares the activity of a
reference compound, a situation that is further confounded by
the presence, albeit rare, of activity cliffs.10,11 These issues
frequently gives rise to misunderstandings in medicinal
chemistry. Moreover, similarity calculations are strongly
dependent on compound classes, molecular representations,
and similarity measures, which complicates their interpretation
and practical application. If one is aware of these caveats,
computational similarity analysis provides a number of
meaningful and useful medicinal chemistry-relevant applications. For example, similarity calculations often aid in
compound selection if the focus is not on the absolute
magnitude of the similarity values but rather on their relative
magnitudes, which determines the ranking of compounds and is
decidedly more robust to differences in similarity values that
arise from the use of different similarity measures.
Despite current limitations, computational similarity analysis
has its place in drug development, if applied in a considerate
manner, to complement and further expand medicinal chemists’
perception(s) of molecular similarity. For fundamental reasons,
it is not possible to eliminate subjective elements from
similarity assessments, which puts strong emphasis on the
careful interpretation of computational results. The development of computational similarity methods with reduced
compound class dependence will be an important topic for
future research. In addition, the exploration of new concepts to
account for biological similarity of small compounds will be
equally attractive.
extreme values of a distribution of similarity values, not the
largest ones as in the case of nearest neighbor analysis but
rather the smallest ones because of the complementary
relationship between similarities and dissimilarities. Different
algorithms have been produced for dissimilarity selection.57,58
Regardless of their specific details, many of these methods are
based upon pairwise similarity calculations of library and
external candidate compounds.
■
CONCLUDING REMARKS
The present review provides an overview of the foundations of
molecular similarity analysis and describes a number of different
similarity-based concepts relevant to medicinal chemistry. As is
well-known, the principal difficulty associated with similarity
analysis is that similarity itself is an inherently subjective
concept so that absolute standards do not exist. Nonetheless, a
wide variety of computational approaches have been developed
in an attempt to account for molecular similarity in a formally
consistent and unbiased manner. Although this may be a
daunting task, it remains a critically important endeavor
because of the power that the concept of molecular similarity
brings to the practice of chemistry in general and to medicinal
chemistry in particular. Long before computational methods for
treating molecular similarity were developed, chemists
employed similarity in a number of areas of chemistry, a
particularly noteworthy example being the development of the
periodic table.59
The similarity concept provides a framework, albeit an
imperfect one, for assessing the similarity of compounds, which
is one of the central tasks in medicinal chemistry. Since an
individual’s capacity to judge similarity relationships is limited
to fairly small numbers of relatively simple compounds,
computational approaches are indeed essential in modern
medicinal chemistry, despite their limitations. This raises a key
issue, namely, how medicinal chemists perceive molecular
similarity and how this perception relates to similarity evaluated
computationally. A brief discussion is provided here describing
some of the cognitive aspects of similarity perception and its
strong association with human pattern recognition and
reduction because they affect the subjective decisions of
medicinal chemists. Clearly, similarity considerations strongly
influence which compounds are made, and these compounds
then essentially reflect our views of similarity. This might often
limit the spectrum of compounds that are considered and
prevent the exploration of chemically unusual ones that fall
outside our similarity perception. On the other hand, for many
therapeutic targets there is a large number of structurally
diverse active compounds available,60 a knowledge base that is
often more considered in chemical informatics than medicinal
chemistry.
Given the medicinal chemistry focus of our presentation, we
have based our methodological considerations on 2D similarity
calculations. However, from a computational perspective, 3D
similarity methods are of course equally relevant.61,62
Regardless of the methods used, however, 3D similarity
assessment in drug design remains affected by the uncertainties
associated with extrapolating from computed to often unknown
bioactive compound conformations.
Without doubt, similarity is often viewed differently in
chemical informatics and medicinal chemistry. This is
exemplified by global and local comparisons of compounds.
We have rationalized that similarity relationships might
fundamentally change depending on whether one applies a
■
AUTHOR INFORMATION
Corresponding Authors
*G.M.: phone, 520-405-4736; e-mail, gerry.maggiora@gmail.
com.
*J.B.: phone, 49-228-2699-306; e-mail, [email protected].
de.
Notes
The authors declare no competing financial interest.
3202
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
Biographies
(2) Medina-Franco, J. L.; Maggiora, G. M. Molecular Similarity
Analysis. In Chemoinformatics for Drug Discovery; Bajorath, J., Ed.; John
Wiley and Sons: Hoboken, NJ, in press.
(3) Kubinyi, H. Similarity and Dissimilarity: A Medicinal Chemist’s
View. Perspect. Drug Discovery Des. 1998, 9−11, 225−232.
(4) Eckert, H.; Bajorath, J. Molecular Similarity Analysis in Virtual
Screening: Foundations, Limitations and Novel Approaches. Drug
Discovery Today 2007, 12, 225−233.
(5) Koeppen, H. Virtual ScreeningWhat Does It give Us? Curr.
Opin. Drug Discovery Dev. 2009, 12, 397−407.
(6) Willett, P. Similarity-Based Virtual Screening Using 2D
Fingerprints. Drug Discovery Today 2006, 11, 1046−1053.
(7) Stumpfe, D.; Bajorath, J. Similarity Searching. Wiley Interdiscip.
Rev.: Comput. Mol. Sci. 2011, 1, 260−282.
(8) Johnson, M.; Maggiora, G. M., Eds. Concepts and Applications of
Molecular Similarity; John Wiley & Sons: New York, 1990.
(9) Maggiora, G. M. On Outliers and Activity CliffsWhy QSAR
Often Disappoints. J. Chem. Inf. Model. 2006, 46, 1535−1535.
(10) Stumpfe, D.; Bajorath, J. Exploring Activity Cliffs in Medicinal
Chemistry. J. Med. Chem. 2012, 55, 2932−2942.
(11) Stumpfe; D.; Hu,Y.; Dimova, D.; Bajorath, J. Recent Progress in
Understanding Activity Cliffs and their Utility in Medicinal Chemistry.
J. Med. Chem. [Online early access]. DOI: 10.1021/jm401120g.
Published Online: Aug 27, 2013.
(12) Raymond, J. W.; Willett, P. Maximum Common Subgraph
Isomorphism Algorithms for the Matching of Chemical Structures. J.
Comput.-Aided Mol. Des. 2002, 16, 521−533.
(13) MACCS Structural Keys; Accelrys: San Diego, CA.
(14) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J.
Chem. Inf. Model. 2010, 50, 742−754.
(15) Good, A. C.; Richards, W. G. Explicit Calculation of 3D
Molecular Similarity. Perspect. Drug Discovery Des. 1998, 9−11, 321−
338.
(16) Rush, T. S.; Grant, J. A.; Mosyak, L.; Nicholls, A. A Shape-Based
3-D Scaffold Hopping Method and Its Application to a Bacterial
Protein−Protein Interaction. J. Med. Chem. 2005, 48, 1489−1495.
(17) Brown, R. D.; Martin, Y. C. The Information Content of 2D and
3D Structural Descriptors Relevant to Ligand−Receptor Binding. J.
Chem. Inf. Model. 1997, 37, 1−9.
(18) McGaughey, G. B.; Sheridan, R. P.; Bayly, C. I.; Culberson, J. C.;
Kreatsoulas, C.; Lindsley, S.; Maiorov, V.; Truchon, J.-F.; Cornell, W.
D. Comparison of Topological, Shape, and Docking Methods in
Virtual Screening. J. Chem. Inf. Model. 2007, 47, 1504−1519.
(19) Fliri, A.; Loging, W.; Thadeio, P. F; Volkmann, R. Biological
Spectra Analysis: Linking Biological Activity Profiles to Molecular
Structure. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, 261−266.
(20) Petrone, P. M.; Simms, B.; Nigsch, F.; Lounkine, E.; Kuthukian,
P.; Cornett, A.; Deng, Z.; Davies, J. W.; Jenkins, J. L.; Glick, M.
Rethinking Molecular Similarity: Comparing Compounds on the Basis
of Biological Activity. ACS Chem. Biol. 2012, 7, 1399−1409.
(21) Hu, Y.; Bajorath, J. Compound Promiscuity: What Can We
Learn from Current Data? Drug Discovery Today 2013, 18, 644−650.
(22) Duda, R. O.; Hart, P. E.; Stork, D. G. Pattern Classification;
Wiley: New York, 2001.
(23) Bishop, C. M. Pattern Recognition and Machine Learning;
Springer: Berlin, 2006.
(24) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical Similarity
Searching. J. Chem. Inf. Comput. Sci. 1998, 38, 983−996.
(25) Maggiora, G. M.; Shanmugasundaram, V. Molecular Similarity
Measures. Methods Mol. Biol. 2004, 275, 1−50.
(26) Takaoka, Y.; Endo, Y.; Yamanobe, S.; Kakinuma, H.; Okubo, T.;
Shimazaki, Y.; Ota, T.; Sumiya, S.; Yoshikawa, K. Development of a
Method for Evaluating Drug-likeness and Ease of Synthesis Using a
Dataset in Which Compounds Are Assigned Scores Based on
Chemists’ Intuition. J. Chem. Inf. Comput. Sci. 2003, 43, 1269−1275.
(27) Lajiness, M. S.; Maggiora, G. M.; Shanmugasundaram, V.
Assessment of the Consistency of Medicinal Chemists in Reviewing
Sets of Compounds. J. Med. Chem. 2004, 47, 4891−4896.
Gerald Maggiora studied chemistry and biophysics at the University
of California at Davis, earning a Ph.D. in biophysics. He spent more
than 20 years as Professor of Chemistry and Biochemistry, University
of Kansas, and Professor of Pharmaceutical Sciences, University of
Arizona. He spent an equal amount of time in the pharmaceutical
industry as a Director of Computer-Aided Drug Discovery and Senior
Research Scientist. His interests include molecular and mathematical
modeling, scientific applications of computer-aided decision making,
drug design, and applications of fuzzy mathematics and rough set
theory to biological and medical problems. For more than 2 decades
he has focused on chemical informatics and molecular similarity. In
2008 he received the Herman Skolnik Award, Division of Chemical
Information of the American Chemical Society.
Martin Vogt studied mathematics and computer science at the
University of Bonn, Germany, and holds a degree in computer science.
He currently is a Research Associate in the Department of Life Science
Informatics at the University of Bonn where he also completed his
doctoral thesis on Bayesian methods for virtual screening under the
guidance of Prof. Jürgen Bajorath. Previously, he was employed at the
Fraunhofer Institute for Applied Information Technology (FIT) where
he worked on image recognition algorithms for bioinformatics
applications. His research interests include algorithmic method
development in chemoinformatics, especially focusing on data mining
and machine learning methods.
Dagmar Stumpfe studied biology at the University of Bonn, Germany.
In 2006, she joined the Department of Life Science Informatics at the
University of Bonn headed by Prof. Jürgen Bajorath for her Ph.D.
thesis, where she worked on methods for computer-aided chemical
biology with a focus on the exploration of compound selectivity. Since
2009, Dagmar has been working as a Postdoctoral Fellow in the
department, and her current research interests include computational
chemical biology and large-scale structure−activity relationship
analysis.
Jürgen Bajorath studied biochemistry at the Free University, Berlin.
Beginning with postdoctoral studies in San Diego, CA, he spent more
than 15 years in the United States. He currently is Professor and Chair
of Life Science Informatics at the University of Bonn, Germany. He is
also an Affiliate Professor in the Department of Biological Structure at
the University of Washington, Seattle. His research interests include
drug discovery, computer-aided medicinal chemistry and chemical
biology, and chemoinformatics (http://www.lifescienceinformatics.
uni-bonn.de).
■
ACKNOWLEDGMENTS
D.S. is supported by Sonderforschungsbereich 704 of the
Deutsche Forschungsgemeinschaft.
■
ABBREVIATIONS USED
COX, cyclooxygenase; Dc, Dice coefficient; ECFP,extended
connectivity fingerprint; FP, fingerprint; HSL, hormonesensitive lipase; MACCS, molecular access system; SAR,
structure−activity relationship; Sg, Soergel distance; SPP,
similarity property principle; SVM, support vector machine;
Tc, Tanimoto coefficient; Tv, Tversky coefficient; 2D, twodimensional; 3D, three-dimensional
■
REFERENCES
(1) Bender, A.; Glen, R. B. Molecular Similarity: A Key Technique in
Molecular Informatics. Org. Biomol. Chem. 2004, 2, 3204−3218.
3203
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204
Journal of Medicinal Chemistry
Perspective
(28) Kutchukian, P. S.; Vasilyeva, N. Y.; Xu, J.; Lindvall, M. K.;
Dillon, M. P.; Glick, M.; Cooley, J. D.; Brooijmans, N. Inside the Mind
of a Medicinal Chemist: The Role of Human Bias in Compound
Prioritization during Drug Discovery. PLoS One 2012, 7, e48476.
(29) Gasteiger, J.; Teckentrup, A.; Terfloth, L.; Spycher, S. Neural
Networks as Data Mining Tools in Drug Design. J. Phys. Org. Chem.
2003, 16, 232−245.
(30) Burges, C. J. C. A Tutorial on Support Vector Machines for
Pattern Recognition. Data Min. Knowl. Discovery 1998, 2, 121−167.
(31) Rusinko, A., III; Farmen, M. W.; Lambert, C. G.; Brown, P. L.;
Young, S. S. Analysis of a Large Structure/Biological Activity Data Set
Using Recursive Partitioning. J. Chem. Inf. Comput. Sci. 1999, 39,
1017−1026.
(32) Auer, J.; Bajorath, J. Emerging Chemical Patterns: A New
Methodology for Molecular Classification and Compound Selection. J.
Chem. Inf. Model. 2006, 46, 2502−2514.
(33) Tanimoto, T. T. IBM Internal Report; IBM Corporation:
Armonk, NY, Nov 17, 1957.
(34) Tversky, A. Features of Similarity. Psychol. Rev. 1977, 84, 327−
352.
(35) Flower, D. R. On the Properties of Bit String-Based Measures of
Chemical Similarity. J. Chem. Comput. Sci. 1998, 38, 379−386.
(36) Wang, Y.; Eckert, H.; Bajorath, J. Apparent Asymmetry in
Fingerprint Similarity Searching Is a Direct Consequence of
Differences in Bit Densities and Molecular Size. ChemMedChem
2007, 2, 1037−1042.
(37) Fligner, M.; Verducci, J.; Blower, P. A Modification of the
Jaccard−Tanimoto Similarity Index for Diverse Selection of Chemical
Compounds Using Binary Strings. Technometrics 2002, 44, 110−119.
(38) Wang, Y.; Bajorath, J. Advanced Fingerprint Methods for
Similarity Searching: Balancing Molecular Size Effects. Comb. Chem.
High Throughput Screening 2010, 13, 220−228.
(39) Nisius, B.; Bajorath, J. Rendering Conventional Molecular
Fingerprints for Virtual Screening Independent of Molecular
Complexity and Size Effects. ChemMedChem 2010, 5, 859−868.
(40) Becker, J. T.; Morris, R. G. Working Memory(s). Brain Cognit.
1999, 41, 1−8.
(41) Cowan, N. What Are the Differences between Long-Term,
Short-Term, and Working Memory? Prog. Brain Res. 2008, 169, 323−
338.
(42) Hodgetts, C. J.; Hahn, U. Similarity-Based Asymmetries in
Perceptual Matching. Acta Psychol. 2012, 139, 291−299.
(43) Patterson, D. E.; Cramer, R. D.; Ferguson, A. M.; Clark, R. D.;
Weinberger, L. E. Neighborhood BehaviorA Useful Concept for
Validation of Molecular Diversity Descriptors. J. Med. Chem. 1996, 39,
3049−3059.
(44) Martin, Y. C.; Kofron, J. L.; Traphagen, L. M. Do Structurally
Similar Compounds Have Similar Biological Activity? J. Med. Chem.
2002, 45, 4350−4358.
(45) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.;
Coleman, R. G. ZINC: A Free Tool To Discover Chemistry for
Biology. J. Chem. Inf. Model. 2012, 52, 1757−1768.
(46) Molecular Operating Environment (MOE); Chemical Computing
Group Inc.: Montreal, Quebec, Canada.
(47) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.;
Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.;
Overington, J. P. ChEMBL: A Large-Scale Bioactivity Database for
Drug Discovery. Nucleic Acids Res. 2011, 40, D1100−D1107.
(48) Bender, A. How Similar Are Those Molecules After All? Use
Two Descriptors and You Will Have Three Different Answers. Expert
Opin. Drug Discovery 2010, 5, 1141−1151.
(49) Sheridan, R. P. Similarity Searching: When Is Complexity
Justified? Expert Opin. Drug Discovery 2007, 2, 423−430.
(50) Willett, P. Combinations of Similarity Rankings Using Data
Fusion. J. Chem. Inf. Model. 2013, 53, 1−10.
(51) Mason, J. S.; Good, A. C.; Martin, E. J. 3-D Pharmacophores in
Drug Discovery. Curr. Pharm. Des. 2001, 7, 567−597.
(52) Renner, S.; Schneider, G. Scaffold-Hopping Potential of LigandBased Similarity Concepts. ChemMedChem 2006, 1, 181−185.
(53) Vogt, M.; Stumpfe, D.; Geppert, H.; Bajorath, J. Scaffold
Hopping Using Two-Dimensional Fingerprints: True Potential, Black
Magic, or a Hopeless Endeavor? Guidelines for Virtual Screening. J.
Med. Chem. 2010, 53, 5707−5715.
(54) Stumpfe, D.; Bill, A.; Novak, N.; Loch, G.; Blockus, H.; Geppert,
H.; Becker, T.; Hoch, M.; Schmitz, A.; Kolanus, W.; Famulok, M.;
Bajorath, J. Targeting Multi-Functional Proteins by Virtual Screening:
Structurally Diverse Cytohesin Inhibitors with Differentiated Biological Functions. ACS Chem. Biol. 2010, 5, 839−849.
(55) Kolanus, W. Guanine Nucleotide Exchange Factors of the
Cytohesin Family and Their Roles in Signal Transduction. Immunol.
Rev. 2007, 218, 102−113.
(56) Dimova, D.; Stumpfe, D.; Bajorath, J. Quantifying the
Fingerprint Descriptor Dependence of Structure−Activity Relationship Information on a Large Scale. J. Chem. Inf. Model. 2013, 53,
2275−2281.
(57) Lajiness, M. S. Dissimilarity-Based Compound Selection
Techniques. Perspect. Drug Discovery Des. 1997, 7−8, 65−84.
(58) Gillet, V. J. Diversity Selection Algorithms. Wiley Interdiscip.
Rev.: Comput. Mol. Sci. 2011, 1, 580−589.
(59) Rouvray, D. H. The Evolution of the Concept of Molecular
Similarity. In Concepts and Applications of Molecular Similarity;
Johnson, M., Maggiora, G. M., Eds.; John Wiley & Sons: New York,
1990; pp 15−42.
(60) Hu, Y.; Bajorath, J. Global Assessment of Scaffold Hopping
Potential for Current Pharmaceutical Targets. Med. Chem. Commun.
2010, 1, 339−344.
(61) Moffat, K.; Gillet, V. J.; Whittle, M.; Bravi, G.; Leach, A. R. A
Comparison of Field-Based Similarity Searching Methods: CatShape,
FBSS, and ROCS. J. Chem. Inf. Model. 2008, 48, 719−729.
(62) Tresadern, G.; Bemporad, D. Modeling Approaches for LigandBased 3D Similarity. Future Med. Chem. 2010, 2, 1547−1561.
(63) Wassermann, A. M.; Wawer, M.; Bajorath, J. Activity Landscape
Representations for Structure−Activity Relationship Analysis. J. Med.
Chem. 2010, 53, 8209−8223.
3204
dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204