QSAR model as a random event: A case of rat toxicity

Bioorganic & Medicinal Chemistry 23 (2015) 1223–1230
Contents lists available at ScienceDirect
Bioorganic & Medicinal Chemistry
journal homepage: www.elsevier.com/locate/bmc
QSAR model as a random event: A case of rat toxicity
Alla P. Toropova a, Andrey A. Toropov a,⇑, Emilio Benfenati a, Danuta Leszczynska b, Jerzy Leszczynski c
a
IRCCS—Istituto di Ricerche Farmacologiche Mario Negri, 20156, Via La Masa 19, Milano, Italy
Interdisciplinary Nanotoxicity Center, Department of Civil and Environmental Engineering, Jackson State University, 1325 Lynch St, Jackson, MS 39217-0510, USA
c
Interdisciplinary Nanotoxicity Center, Department of Chemistry and Biochemistry, Jackson State University, 1400 J.R. Lynch Street, PO Box 17910, Jackson, MS 39217, USA
b
a r t i c l e
i n f o
Article history:
Received 19 November 2014
Revised 29 January 2015
Accepted 30 January 2015
Available online 7 February 2015
Keywords:
QSAR
Validation
Domain of applicability
Oral rat toxicity
CORAL software
a b s t r a c t
Quantitative structure—property/activity relationships (QSPRs/QSARs) can be used to predict physicochemical and/or biochemical behavior of substances which were not studied experimentally. Typically
predicted values for chemicals in the training set are accurate since they were used to build the model.
QSPR/QSAR models must be validated before they are used in practice. Unfortunately, the majority of the
suggested approaches of the validation of QSPR/QSAR models are based on consideration of geometrical
features of clusters of data points in the plot of experimental versus calculated values of an endpoint. We
believe these geometrical criteria can be more useful if they are analyzed for several splits into the training and test sets. In this way, one can estimate the reproducibility of the model with various splits and
better evaluate model reliability. The probability of the correct prediction of an endpoint for external
validation set (in the series of the above-mentioned splits) can provide an useful way to evaluate the
domain of applicability of the model.
Ó 2015 Elsevier Ltd. All rights reserved.
1. Introduction
Quantitative structure—property/activity relationships (QSPRs/
QSARs) are a tool to predict physicochemical and/or biochemical
behaviour of various substances which were not studied
experimentally. The prediction can be done by means of correlations between descriptors and endpoints.1–10 There are various
software tools available to build QSPR/QSAR models,11–15 but there
is no standard procedure for the validation of QSPR/QSAR models.
According to the Guidance document on the validation of QSAR
models developed by the Organisation for Economic Co-operation
and Development (OECD) the validation of a models is an important element of a QSPR/QSAR analyses.16 There are criteria for
validation of the QSPR/QSAR models which were suggested during
last decade.17–21 However, in fact, all suggested criteria of the predictability of QSPR/QSAR models are based on the analysis of the
geometry of data points in the plot of experimental versus calculated values of an endpoint and typically are obtained for the experiments in the training and test sets.17–21
We deem that the split into the training and test sets has
significant influence on the statistical quality of a model for
both the training and the test sets. Taking this into account,
one should study several splits to estimate the predictability of
⇑ Corresponding author.
E-mail address: [email protected] (A.A. Toropov).
http://dx.doi.org/10.1016/j.bmc.2015.01.055
0968-0896/Ó 2015 Elsevier Ltd. All rights reserved.
the used approach. It is to be noted that the split also has the
influence on the configuration of the domain of applicability,
in conformity with the size and variety of compounds in the
training set.
One can preliminarily define the domain of applicability, as a
list of limitations on the molecular architecture.22 Further, the final
definition of the domain of applicability can be done according to
molecular descriptors involved in the building up the model.22,23
As rule, however, there are some exceptions for the domain of
applicability defined by the above-mentioned manner because
there may be some molecules with suitable values of descriptors,
but with a too large difference between the experimental and calculated values of the endpoint.23
Under such circumstances, the empirical probability of the
domain of applicability becomes an attractive alternative of the
classical definition of the domain of applicability. The empirical
probability of the domain of applicability is the probability that a
compounds selected by chance in the external set (not in training
set) will have a satisfactory difference between the experimental
and calculated values of the endpoint.
The aims of the present study are (i) building up models for
toxicity in rat for a group of random splits using the CORAL software;15,24–30 and (ii) the estimation (for these models) of the influence of the distribution into the training and test sets upon the
probability for an arbitrary substance (from validation set) be in
the domain of applicability.
1224
A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230
2. Method
2.1. Data
The lethal rat toxicity data in mg/kg was taken from the United
State Library of Medicine.31 The predicted endpoint is the negative
logarithm of the lethal dose (pLD50). In total, 525 compounds were
selected according to the following principles: (i) these compounds
contains only the following list of chemical elements: H, C, N, O, S,
and P; (ii) these compounds do not have aromatic fragments. The
filtration according to the above-mentioned principles, in fact, is
the preliminary definition of the domain of applicability. A group
of random distributions of all 525 compounds into the sub-training
set, calibration set, test set, and validation set has been studied as
the basis for building up model for pLD50. The splits were selected
according to the following principles: (i) the range of the endpoint
is approximately the same for each sub-set; (ii) the splits are random; and (iii) the splits are not identical (Table 1). The SMILES
(simplified molecular input-line entry system32–34) notation has
been used as the representation of the molecular structure to build
up QSAR models suggested in this work. The SMILES notations
were generated with ACD/ChemSketch software.35
The optimal descriptor (D) is a mathematical function of socalled correlation weights (CW). The correlation weights are
numerical coefficients associated with various molecular features
extracted from the SMILES notations. In other words, one-variable
models examined in this study are based on the descriptors of correlation weights (DCW). The optimal descriptors used to build up
models of the pLD50 are calculated as the following:
DCWðThreshold; Nepoch Þ ¼
P
CWðSk Þ þ
P
CWðSSk Þ þ
P
CWðSSSk Þ
ð1Þ
where the Threshold is a coefficient to separate the molecular features into two categories: (i) rare features which are noise from
the statistical point of view (these attributes are not involved in
building up model, their correlation weights are equal to zero);
Table 1
The identity obtained for pairs of random splits
Split C
Split D
Split E
Set
Split A
Split B
Split C
Split D
Split E
Sub-training
Calibration
Test
Validation
Sub-training
Calibration
Test
Validation
Sub-training
Calibration
Test
Validation
Sub-training
Calibration
Test
Validation
Sub-training
Calibration
Test
Validation
100
100
100
100
40.9
40.6
18.9
16.1
100
100
100
100
40.9
33.8
15.1
15.8
44.9
40.1
14.2
14.5
100
100
100
100
39.4
37.4
9.4
7.9
40.7
43.3
12.0
9.8
38.3
34.0
14.2
14.3
100
100
100
100
38.9
40.6
12.8
13.3
38.7
39.3
11.8
12.1
42.1
40.6
15.2
18.9
35.1
37.1
15.1
15.3
100
100
100
100
Nset
1. Translation of all SMILES into lists of attributes A(i,j);
2. Calculation (by the Monte Carlo method) of correlation
weights of A(i,j) which give a maximum of target function24
defined as
TF ¼ R þ R0 jR R0 j Const;
ð2Þ
0
þCWðBONDÞ þ CWðNOSPÞ þ CWðPAIRÞ
Split B
2.3. The balance of correlations
QSAR models were built with the CORAL software using the following steps (Fig. 1):
2.2. Descriptors
Split A
and (ii) features which are not rare and consequently which are
able be a basis for building up a model; the Nepoch is the number
of iterations of the Monte Carlo optimization (one iteration is the
process of modifications of all correlation weights which are
involved in building up model); Sk, SSk, and SSSk are local SMILES
attributes involving one, two, or three symbols from SMILES notation. Generally, SMILES elements represented by means of two or
three symbols (e.g., ‘Cl’, ‘Br’, ‘%12’,32–34 etc.) can be also used to
build up QSAR model,25–30 but SMILES examined in this work have
not such elements. BOND, NOSP, and PAIR are global descriptors
extracting from SMILES.36 The BOND is a mathematical function
of the presence in the molecular structure of double (‘=’), triple
(‘#’), and 3D stereo chemical (‘@’) bonds; the NOSP is a mathematical function of the presence of nitrogen (‘N’), oxygen (‘O’), sulphur (‘S’), and phosphorus (‘P’); PAIR is a mathematical function
of the simultaneous presence of pairs of SMILES elements, such
as the above-mentioned ‘=’, ‘#’, ‘@’, ‘N’, ‘O’, ‘S’, and ‘P’. Table 2 contains an example of the translation of SMILES into the list of
attributes.
Identityi;j ¼ 0:5ðNsetii;jþNsetj Þ 100%.
where set = (sub-training, calibration, test, and validation); Nseti,j is the number of
identical substances in the set for i-th and j-th splits; Nseti and Nsetj are the numbers of substances in the set for i-th and j-th splits, respectively.
where R and R are correlation coefficients between experimental and calculated values of an endpoint for the sub-training and
the calibration sets; Const is an empirical constant = 0.01;
3. Calculation with Eq. (1) optimal descriptors for all compounds;
4. Calculation (by the Least Squares method) of one-variable model
for the endpoint using compounds of the sub-training set;
5. Estimation of the statistical quality of the model for the subsystem of the training and for the sub-system test.
The statistical quality of the model for the test set is a mathematical function of two parameters: Threshold and the number
of epoch of the Monte Carlo optimization. The selection of the
preferable threshold (T⁄) and the number of epoch (N⁄) is described
in the literature.36
The scheme of the QSAR modeling is based on four subsets of
compounds: the sub-training set; the calibration set; the test set;
and the validation set. The sub-training set plays the role of the
developer of the model, because correlation weights first of all
should give maximum of the correlation coefficient between optimal descriptor and endpoint for this set. The calibration set plays
the role of critic, because data on compounds of this set is used
in order to avoid the overtraining (or at least to reduce the probability of the overtraining). The test set plays the role of the estimator of the model for various thresholds and the numbers of epochs
of the Monte Carlo optimization.37 Finally, the validation set is not
involved in the build up the model and is the ‘main’ validator of the
model.
The probability of the presence of outliers as rule is not equal to
zero. We believe that this probability is a complex mathematical
function of the distribution of compounds into the above-mentioned four sets. Consequently, the realistic reliability of the QSAR
model should be estimated in result of analysis of several splits
into the described four sets.
1225
A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230
Table 2
An example of the translation of SMILES [O@C(O)CCCCCCCC ] into molecular
attributes (A); CW(A) is the correlation weights for the attributes which are used to
calculations with Eq. (1); ‘NOSP01000000’ is the indicator of the presence of oxygen;
‘BOND10000000’ is the indicator of the presence of double bonds; ‘++++O—B2==’ is
indicator of simultaneous presence of oxygen and double bond; DCW(2,
17) = 9.58150 (split A, Eq. (5))
Structural
attribute (A)
CW(A)
The number of
the A in subtraining set
The number
of the A in
calibration set
case of the traditional scheme. The target function in the case of
the traditional scheme is the correlation coefficient between the
optimal descriptor and endpoint for the training set.24
2.4. ‘Lucky’ and ‘unlucky’ splits
The
number of
the A in
test set
The sentence ‘There are not outliers in the test set’ can be
mathematically formulated as the following:
where R ¼ subtraining2 calibration and s ¼ subtraining2 calibration ;
Rsub-training and Rcalibration are correlation coefficients between
experimental and calculated values of an endpoint for the subtraining set and the calibration set, respectively; Ssub-training and
Scalibration are root-mean-square errors for the sub-training set and
the calibration set, respectively.
There are two categories of outliers for a QSPR/QSAR model:
The first category. Compounds which are characterized by a
large difference between the experimental and calculated values
of endpoint even if they are placed in the sub-training set or the
calibration set.
The second category. Compounds which are characterized by
large difference between experimental and calculated values of
endpoint if they are placed in the test set or the validation set.
However, if they are placed into the sub-training set or in the
calibration set, they are characterized by medium difference
between experimental and calculated values of endpoint.
Outliers of the first category are possible makers of the overtraining. According to,38 in the process of the optimization of correlation weights, the number of epochs of the Monte Carlo
optimization that gives the maximum of correlation coefficient
between optimal descriptor and endpoint for the test set exists.
We deem that after this number, the optimization starts to improve
the model in relation of outliers of the first category, and, as result,
it leads to a decrease of the statistical quality for the external test
set. That is why one should define the preferable number of epochs
of the optimization and stop the process at this moment.36,38
Outliers of the second category can be used in order to define
the probability of the domain of applicability P(DA). It can be done
by the following algorithm.
Sk
O. . .. . .. . .
=. . .. . .. . .
C. . .. . .. . .
(. . .. . .. . .
O. . .. . .. . .
(. . .. . .. . .
C. . .. . .. . .
C. . .. . .. . .
C. . .. . .. . .
C. . .. . .. . .
C. . .. . .. . .
C. . .. . .. . .
C. . .. . .. . .
C. . .. . .. . .
1.1835
1.1300
0.3780
0.5585
1.1835
0.5585
0.3780
0.3780
0.3780
0.3780
0.3780
0.3780
0.3780
0.3780
157
146
201
149
157
149
201
201
201
201
201
201
201
201
161
145
203
147
161
147
203
203
203
203
203
203
203
203
48
35
56
40
48
40
56
56
56
56
56
56
56
56
SSk
O. . .=. . .. . .
C. . .=. . .. . .
C. . .(. . .. . .
O. . .(. . .. . .
O. . .(. . .. . .
C. . .(. . .. . .
C. . .C. . .. . .
C. . .C. . .. . .
C. . .C. . .. . .
C. . .C. . .. . .
C. . .C. . .. . .
C. . .C. . .. . .
C. . .C. . .. . .
0.9405
1.3175
0.1835
1.1270
1.1270
0.1835
0.2530
0.2530
0.2530
0.2530
0.2530
0.2530
0.2530
118
81
145
107
107
145
163
163
163
163
163
163
163
113
94
144
108
108
144
175
175
175
175
175
175
175
33
22
40
33
33
40
44
44
44
44
44
44
44
SSSk
O. . .=. . .C. . .
=. . .C. . .(. . .
O. . .(. . .C. . .
(. . .O. . .(. . .
O. . .(. . .C. . .
C. . .C. . .(. . .
C. . .C. . .C. . .
C. . .C. . .C. . .
C. . .C. . .C. . .
C. . .C. . .C. . .
C. . .C. . .C. . .
C. . .C. . .C. . .
2.0585
1.6280
2.0575
0.6240
2.0575
0.1260
0.1270
0.1270
0.1270
0.1270
0.1270
0.1270
42
37
61
7
61
95
75
75
75
75
75
75
43
52
69
9
69
99
78
78
78
78
78
78
11
12
19
5
19
28
21
21
21
21
21
21
Global attributes
NOSP01000000
BOND10000000
++++O B2==
2.6835
0.5615
0.9960
73
141
130
72
141
135
28
35
35
The balance of correlations (the above steps 1–5) is an alternative of the traditional scheme of building up QSAR model24 where
calculations are based on three sets, namely, the training set, the
test (calibration) set, and the external validation set. In the work,24
the comparison of the balance of correlations and the traditional
scheme has shown that predictive potential of the balance of correlations is better than predictive potential of the traditional
scheme. In this work, these comparisons for five different splits
also have been examined. The balance of correlations is building
up a model using four sets, namely, the sub-training set (developer
of the model); the calibration set (critic of the model); the test set
(preliminary estimator of the predictive potential of the model);
and the validation set (final estimator of the predictive potential
of the model). The traditional scheme means that the training set
consists of the sub-training set together with the calibration set,
that is, the above-mentioned ‘critic of the model’ is absent in the
jRtest Rj < e ¼ 0:01 OR jstest sj < e ¼ 0:01
R
þR
s
ð3Þ
þs
1. Noutlier:= 0;
2. Calculate the compound in the test set after its removal, gives
the maximal increase for the correlation coefficient Rtest
between the experimental and calculated endpoint values for
the test set;
3. If after this reduction of the test set, substitutions of Rtest and
Stest in Eq. (3) give at least one true inequality, the process
should be stopped, if not, Noutlier: = Noutlier + 1 and repeat step 2.
As result one can obtain the number of outliers (Noutlier) of the
second category and the probability of the domain of applicability
can be calculated as the following:
PðDAÞ ¼
Noutlier
100%
1
Ntest
ð4Þ
where Ntest is the number of compounds in the test set before the
reduction.
Using these criteria, (i) one can define P(DA) for a group of
splits; (ii) one can select ‘lucky’ splits (Noutlier 0); one can compare values of the criteria for various distribution of data into the
above-described splits.
The calculation of the probability of an unknown compound be
outlier is available on the web-site.15 However in advance one
should (i) build up a model; and (ii) the unknown compound
1226
A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230
Figure 1. The scheme of the building up a QSPR/QSAR model by the CORAL software.
should be represented by SMILES generated by ACD/ChemSketch
software.35
3.1.2. Traditional scheme
3. Results and discussion
pLD50 ¼ 3:3200ð0:0014Þ þ 0:0891ð0:0001Þ DCWð3; 14Þ
2
n = 405, r = 0.7198, q = 0.7170, s = 0.534, F = 1035 (training
set = sub-training set + calibration set).
We have examined fifteen splits into the sub-training, calibration, test, and validation sets: five are ‘lucky’ splits and ten are ‘unlucky’ splits. The statistical characteristics of QSAR models
calculated with the preferable values of the threshold (T⁄) and
the number of epochs of the Monte Carlo optimization (N⁄)36 for
five ‘lucky’ splits characterized by P(DA) = 100% are the following:
DR2m
3.1. Split A
3.2. Split B
3.1.1. Balance of correlations
3.2.1. Balance of correlations
pLD50 ¼ 2:4251ð0:0031Þ þ 0:1282ð0:0004Þ DCWð2;17Þ
ð5Þ
ð6Þ
2
n = 56, r2 = 0.7522, s = 0.588, R2m
= 0.629, R2m
= 0.491,
¼ 0.271 (test set).
n = 64, r2 = 0.7727, s = 0.493 (validation set).
pLD50 ¼ 2:8240ð0:0028Þ þ 0:0867ð0:0003Þ DCWð4;13Þ
2
ð7Þ
2
n = 202, r2 = 0.7390, q2 = 0.7332, s = 0.535, F = 566 (sub-training
set).
n = 203, r2 = 0.7128, s = 0.527 (calibration set).
n = 209, r = 0.6907, q = 0.6843, s = 0.593, F = 462 (sub-training
set).
n = 206, r2 = 0.6924, s = 0.572 (calibration set).
n = 56, r2 = 0.7888, s = 0.530, R2m = 0.710, R2m = 0.607, DR2m = 0.207
(test set).
n = 64, r2 = 0.8057, s = 0.447 (validation set).
n = 50, r2 = 0.8849, s = 0.434 R2m = 0.695, R2m = 0.604, DR2m ¼ 0.182
(test set).
n = 60, r2 = 0.6841, s = 0.457 (validation set).
1227
A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230
Figure 2. Graphical representation of QSAR models for split A, B, C, D, and E calculated with Eqs. 5, 7, 9, 11 and 13, respectively.
3.2.2. Traditional scheme
pLD50 ¼ 3:2847ð0:0014Þ þ 0:0931ð0:0001Þ DCWð4; 18Þ
ð8Þ
n = 415, r2 = 0.7220, q2 = 0.7193, s = 0.550, F = 1073 (training
set = sub-training set + calibration set).
n = 50, r2 = 0.8870, s = 0.462, R2m
= 0.665, R2m
= 0.561,
DR2m
¼ 0.209 (test set).
n = 60, r2 = 0.6632, s = 0.467 (validation set).
3.3. Split C
3.3.1. Balance of correlations
pLD50 ¼ 2:3713ð0:0028Þ þ 0:1034ð0:0003Þ DCWð2; 18Þ
2
ð9Þ
Table 3
The statistical characteristics of QSAR models for toxicity in rat, the case of the
distribution into the sub-training set (40%), the calibration set (40%), the test set
(10%), and validation set (10%)
#
R2sub-train
R2calibration
Noutlier
Ntest
P(DA)
1
2
3
4
5
6
7
8
9
10
Average
0.76
0.77
0.78
0.77
0.76
0.76
0.77
0.79
0.76
0.77
0.769 ± 0.0094
0.76
0.77
0.79
0.77
0.76
0.76
0.77
0.79
0.76
0.77
0.770 ± 0.0111
17
7
15
6
4
12
11
3
9
9
9.3 ± 4.3
61
45
51
62
46
57
59
47
61
48
53.7 ± 6.6
72.1%
89.6%
70.6%
90.3%
91.3%
78.9%
81.3%
93.6%
85.2%
81.3%
83.4 ± 7.6
2
n = 219, r = 0.7287, q = 0.7240, s = 0.552, F = 441 (sub-training
set).
n = 193, r2 = 0.7281, s = 0.548 (calibration set).
P(DA) is the probability of domain of applicability for the test set calculated with
Eq. (4); Noutlier and Ntest are the number of outliers and total number of compounds
in the test set, respectively.
1228
A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230
Table 4
Identifiers of various distributions into the sub-training, calibration, test, and
validation sets
Identifier
Percentage of
compounds in
the
calibration set
Percentage of
compounds in
the subtraining set
30,501,010
50,301,010
40,401,010
20,402,020
40,202,020
30,302,020
10,303,030
30,103,030
20,203,030
30
50
40
20
40
30
10
30
20
50
30
40
40
20
30
30
10
20
Percentage
of
compounds
in the test
set
10
10
10
20
20
20
30
30
30
Percentage of
compounds
in the
validation set
3.4.2. Traditional scheme
pLD50 ¼ 3:2226ð0:0014Þ þ 0:0809ð0:0001Þ DCWð4;9Þ
2
ð12Þ
2
n = 413, r = 0.7017, q = 0.6989, s = 0.580, F = 967 (training
set = sub-training set +calibration set).
n = 50,
r2 = 0.7552,
s = 0.469,
R2m
=
0.674,
R2m ¼
0.553,
DR2m
¼ 0.243 (test set).
n = 62, r2 = 0.5357, s = 0.545 (validation set).
10
10
10
20
20
20
30
30
30
3.5. Split E
3.5.1. Balance of correlations
pLD50 ¼ 2:3027ð0:0032Þþ0:1010ð0:0003ÞDCWð2;18Þ ð13Þ
Table 5
The statistical characteristics for models of toxicity in rat (pLD50) taken in work44
r2
s
Correct classification rate*
Split 1, DCW(3,14)
Training set
231
Calibration set
68
Validation set
69
0.7745
0.6524
0.6177
0.505
0.504
0.491
0.7850
0.8750
0.9000
Split 2, DCW(3,14)
Training set
228
Calibration set
69
Validation set
71
0.7569
0.7544
0.6161
0.524
0.428
0.493
0.7994
0.7272
0.8750
Split 3, DCW(3,9)
Training set
Calibration set
Validation set
0.6862
0.7399
0.6351
0.607
0.394
0.476
0.7972
0.9445
0.8333
n
*
225
72
71
n = 69, r2 = 0.8183, s = 0.400, R2m = 0.797, R2m = 0742, DR2m ¼ 0.111
(test set).
n = 56, r2 = 0.8253, s = 0.489 (validation set).
3.5.2. Traditional scheme
pLD50 ¼ 3:1906ð0:0014Þ þ 0:0897ð0:0001Þ DCWð2;9Þ
2
ð14Þ
2
n = 400, r = 0.7068, q = 0.7038, s = 0.559, F = 959 (training
set=sub-training set +calibration set).
n = 69,
DR2m
r2 = 0.7793,
s = 0.413,
R2m ¼
0.718,
R2m ¼
0.691,
¼ 0.053 (test set).
n = 56, r2 = 0.8024, s = 0.505 (validation set).in Eqs. 5–14, n is
the number of compounds in the set; r is the correlation coefficient; q is the cross-validated correlation coefficient; s is the
Correct classification rate = 0.5 (sensitivity + specificity).
n = 63, r2 = 0.7183, s = 0.474, R2m = 0.663, R2m ¼ 0.611, DR2m ¼
0.103 ((test set).
n = 50, r2 = 0.7678, s = 0.500 (validation set).
3.3.2. Traditional scheme
root-mean-square error; F is the Fischer F-ratio; R2m (should be
>0.5), R2m (should be >0.5), and DR2m (should be <0.2) are metrics
of predictability suggested by Roy et al.18,19 The R2m metric is calculated based on the correlations between the observed (x) and predicted (y) values with taking into account intercept (r2) and
without taking into account intercept (r 2o ):
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
R2m ¼ R2m ðx; yÞ ¼ r 2 1 jr 2 r 20 j
pLD50 ¼ 3:2911ð0:0014Þ þ 0:1112ð0:0002Þ DCWð3;19Þ
ð10Þ
2
n = 199, r2 = 0.7365, q2 = 0.7310, s = 0.545, F = 551 (sub-training
set).
n = 201, r2 = 0.6966, s = 0.560 (calibration set).
2
n = 412, r = 0.7371, q = 0.7346, s = 0.534, F = 1150 (training
set = sub-training set +calibration set).
ð15Þ
where the x is the vector of observed and the y is the vector of
predicted pLD50 for the test set;
R2m ðx; yÞ þ R2m ðy; xÞ
2
n = 63, r2 = 0.6948, s = 0.495, R2m = 0.650, R2m ¼ 0.580, DR2m ¼
0.141 (test set).
n = 50, r2 = 0.7033, s = 0.583 (validation set).
R2m ¼
3.4. Split D
The distribution of the 525 compounds into the sub-training,
the calibration, the test, and the validation set (these compounds
are not ‘visible’ for building up the models) is random with probabilities 0.4; 0.4; 0.1; and 0.1 respectively. Figure 2 shows the graphical representation of models calculated by means of balance of
correlation s with Eqs. 5, 7, 9, 11 and 13.
The traditional scheme gives models calculated with Eqs. 6, 8,
10, 12 and 14. The comparison of the statistical characteristics of
models calculated with balance of correlations and models
calculated by the traditional scheme confirms that predictive
potential of models calculated with the balance of correlations is
better.24
Table 3 contains the statistical quality of QSAR models for a
group of ‘‘unlucky’’ splits with the same distribution. One can see
3.4.1. Balance of correlations
pLD50 ¼ 2:9421ð0:0029Þ þ 0:0988ð0:0003Þ DCWð2; 10Þ
ð11Þ
2
2
n = 194, r = 0.7374, q = 0.7322, s = 0.566, F = 539 (sub-training
set).
n = 219, r2 = 0.7218, s = 0.556 (calibration set).
n = 50,
DR2m
r2 = 0.7612,
s = 0.452,
R2m
=
0.755,
¼ 0.202 (test set).
n = 62, r2 = 0.5460, s = 0.535 (validation set).
R2m ¼
0.653,
DR2m ¼ R2m ðx; yÞ R2m ðy; xÞ
ð16Þ
ð17Þ
A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230
1229
Figure 3. The plot of the number of outliers in the test set versus percentage of satisfactory predictions (in the test set) for different distributions (Table 4) into the subtraining set, the calibration set, test set, and validation set. E.g., the code ‘30,501,010’ means 30% of compounds placed in the sub-training set; 50% of compounds placed in the
calibration set; 10% of compounds placed in the test set; and 10% of compounds placed in the validation set. The dots are results of a few runs of the Monte Carlo optimization
with thresholds of 2, 3, and 4. Each line (slope) corresponds to the percentage of compounds in the validation set (these are indicated by grey background).
(Table 3) ‘unlucky’ splits have the statistical characteristics poorer
than the statistical characteristics of ‘lucky’ splits, but even for
them the empirical probability of the domain of applicability is
not zero (the range of this probability is from 0.70 till 0.91).
The analysis of various distributions of chemicals into the subtraining set, calibration set, test set, and validation set are represented by Figure 2. One can see (Fig. 3 and Table 4) the increase
of the number of compounds in the sub-training and calibration
sets improves predictability of these models. The situation is similar to QSPR model for water solubility.37
There are QSAR models for toxicity in rat described in the literature.36,39–42 The comparison of statistical quality of models for
toxicity in rat which are calculated with Eqs. 5–14 and models
described in36 indicates a paradoxical situation: correlation coefficients for Eqs. 5–14 are higher than the correlation coefficients of
similar models described in,36 whereas root-mean-square error of
models calculated with Eqs. 5–14 are larger than the root-meansquare errors of models described in.36 This paradox probably is
result of difference in the assortment of compounds which were
studied here and compounds which were studied in.36
The comparison of statistical quality of QSAR models for toxicity
in rat described in39 has shown (i) the DRAGON generated descriptors and optimal descriptors give similar quality of prediction for
this endpoint; (ii) the increase of the number of descriptors
involved in building up QSAR can lead to overtraining; and (iii) statistical quality of the best QSAR suggested in39 is the following
n = 14, r2 = 0.9154 (test set). In other words, the correlation coefficient is higher than the correlation coefficients of Eqs. 5–14, however, the number of compounds in the test set for the QSAR
described in39 is considerable smaller. The statistical quality of
QSAR models for toxicity in rat suggested in40 is characterized by
ntest = 325 and r2test = 0.47–0.50. QSAR model for toxicity in rat of
inorganic substances is characterized by n = 10, r2 = 0.80.41
The comparison of models which are calculated by means of
Eqs. 5–14 with the above-mentioned QSAR models taken in the literature36,39–41 indicates that suggested approach can be useful for
QSAR analysis of toxicity in rat.
The analysis of only one model with a ‘lucky’ split can show
that our approach, that is, Eqs. 5–14, gives ‘ideal’ model where
the statistical quality for external sets is reliable from point of
view of criteria calculated with Eq. (3). However, the examination of groups of splits had shown, that this approach can give
the ideal results (for a ‘lucky’ split) and not ideal results (for a
‘unlucky’ split).
We believe that the analysis of models which are obtained with
various splits into the training set and test set (analogical to the
analysis described in42) should become the standard of veritable
validation for all approaches which are used to build up a QSPR/
QSAR models.
According to work,43 the precision of experimental definition
of the LD50 for rats is approximately ±15 mg/kg or approximately one log unit. The experimental data collected in31 are extracted from a group of the reliable sources. Consequently, the QSAR
models calculated with Eqs. 5–14 can be estimated as robust
ones.
In addition the suggested approach has been checked up with
data on toxicity in rat taken in work44 where 369 compounds are
examined. The pLD50 for cis-1,2-dichloroethylene is defined as
999.00. Apparently, this is a misprint and this compound is not
considered in this study. The other 368 compounds were three
times randomly distributed into the training set, the calibration
set, and the validation set. The correct classification rate is the statistical criterion which is used in the work.44 Table 5 contains statistical characteristics of models calculated by the suggested
approach. The average value of the criterion for external validation
sets (n 70) obtained in44 with using of only chemical descriptors
is equal to 0.81. One can see, the average correct classification rate
on the validation set for models calculated by the described optimal descriptors is equal to 0.869 (Table 5), in other words, the
models calculated with optimal descriptors are better than above
mentioned models.44
The Supplementary materials section contains the technical
details related to described models, also, additional information
is available on the Internet.15
1230
A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230
4. Conclusions
There are ‘lucky’ and ‘unlucky’ splits into the sub-training set,
calibration set, test set, and validation set. The robust estimation
of a QSAR model should be done with a group of various random
splits. The domain of applicability can be represented by the mathematical expectation of the probability that a compound selected
by chance in the test set is not ‘outlier’ (Table 3). The increase of
number of compounds in the sub-training set and in the calibration
set leads to increase of the above-mentioned probability (Fig. 3).
The predictive potential of models calculated by means of the balance of correlation24 (based on the distribution of available data
into the sub-training set, calibration set, test set, and external
validation set) is better than predictive potential of models calculated with traditional scheme (based on distribution of available
data into the training set, test set, and external validation set).
Acknowledgements
The authors are grateful for the contribution of the EU project
PROSIL funded under the LIFE program (project LIFE12 ENV/IT/
000154), and the National Science Foundation (NSF/CREST HRD0833178, and EPSCoR Award #:362492-190200-01/NSFEPS090378).
Supplementary data
Supplementary data associated with this article can be found, in
the online version, at http://dx.doi.org/10.1016/j.bmc.2015.01.055.
References and notes
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Furtula, B.; Gutman, I. J. Chemom. 2011, 25, 87.
Hollas, B.; Gutman, I.; Trinajstić, N. Croat. Chem. Acta 2005, 78, 489.
Toropov, A. A.; Toropova, A. P.; Gutman, I. Croat. Chem. Acta 2005, 78, 503.
Ibezim, E.; Duchowicz, P. R.; Ortiz, E. V.; Castro, E. A. Chemom. Intell. Lab. 2012,
110, 81.
Garro Martinez, J. C.; Duchowicz, P. R.; Estrada, M. R.; Castro, G. N.; Castro, E. A.
Int. J. Mol. Sci. 2011, 12, 9354.
García, J.; Duchowicz, P. R.; Rozas, M. F.; Caram, J. A.; Mirífico, M. V.; Fernández,
F. M.; Castro, E. A. J. Mol. Graph. Model. 2011, 31, 10.
Mullen, L. M. A.; Duchowicz, P. R.; Castro, E. A. Chemometr. Intell. Lab. Syst. 2011,
107, 269.
Mouchlis, V. D.; Melagraki, G.; Mavromoustakos, T.; Kollias, G.; Afantitis, A. J.
Chem. Inf. Model. 2012, 52, 711.
Melagraki, G.; Afantitis, A. Curr. Med. Chem. 2011, 18, 2612.
Afantitis, A.; Melagraki, G.; Koutentis, P. A.; Sarimveis, H.; Kollias, G. Eur. J. Med.
Chem. 2011, 46, 497.
11. CAESAR, http://www.caesar-project.eu, Accessed October 12, 2014.
12. BCFBAF (EPISUITE), http://www.epa.gov/opptintr/exposure/pubs/episuite.htm,
Accessed October 12, 2014.
13. TEST, http://www.epa.gov/nrmrl/std/qsar/qsar.html, Accessed October 12,
2014.
14. CHEMPROP, http://www.ufz.de/index.php?en=10684, Accessed October 12,
2014.
15. CORAL, http://www.insilico.eu/coral, Accessed January 27, 2015.
16. Guidance document on the validation of (quantitative) structure - activity
relationships
[(q)sar]
models,
http://www.oecd.org/dataoecd/55/35/
38130292.pdf, Accessed May 30, 2007.
17. Golbraikh, A.; Tropsha, A. J. Mol. Graph. Model. 2002, 20, 269.
18. Roy, K.; Mitra, I. Mini-Rev. Med. Chem. 2012, 12, 491.
19. Roy, K.; Mitra, I.; Kar, S.; Ojha, P. K.; Das, R. N.; Kabir, H. J. Chem. Inf. Model.
2012, 52, 396.
20. Toropova, A. P.; Toropov, A. A.; Rallo, R.; Leszczynska, D.; Leszczynski, J.
Ecotoxicol. Environ. Saf. 2015, 112, 39.
21. Toropova, A. P.; Toropov, A. A.; Veselinović, J. B.; Miljković, F. N.; Veselinović, A.
M. Eur. J. Med. Chem. 2014, 77, 298.
22. Melagraki, G.; Afantitis, A.; Sarimveis, H.; Koutentis, P. A.; Kollias, G.; IgglessiMarkopoulou, O. Mol. Divers 2009, 13, 301.
23. Toropova, A. P.; Toropov, A. A. Eur. J. Pharm. Sci. 2014, 52, 21.
24. Toropov, A. A.; Rasulev, B. F.; Leszczynski, J. Bioorg. Med. Chem. 2008, 16, 5999.
25. Toropov, A. A.; Toropova, A. P.; Benfenati, E. Eur. J. Med. Chem. 2009, 44, 2544.
26. Toropov, A. A.; Toropova, A. P.; Benfenati, E.; Leszczynska, D.; Leszczynski, J. J.
Comput. Chem. 2010, 31, 381.
27. Toropov, A. A.; Toropova, A. P.; Benfenati, E.; Manganaro, A. Mol. Divers 2009,
13, 367.
28. Toropov, A. A.; Toropova, A. P.; Benfenati, E. Eur. J. Med. Chem. 2010, 45, 3581.
29. Toropov, A. A.; Toropova, A. P.; Raska, I.; Benfenati, E. Eur. J. Med. Chem. 2010,
45, 1639.
30. Toropova, A. P.; Toropov, A. A.; Benfenati, E.; Gini, G.; Leszczynska, D.;
Leszczynski, J. Cent. Eur. J. Chem. 2011, 9, 846.
31. United State Library of Medicine, http://toxnet.nlm.nih.gov/, Accessed October
12, 2014.
32. Weininger, D. J. Chem. Inf. Comput. Sci. 1988, 28, 31.
33. Weininger, D.; Weininger, A.; Weininger, J. L. J. Chem. Inf. Comput. Sci. 1989, 29,
97.
34. Weininger, D. J. Chem. Inf. Comput. Sci. 1990, 30, 237.
35. ACD/ChemSketch Freeware, version 11.00, Advanced Chemistry Development,
Inc., Toronto, ON, Canada, www.acdlabs.com, 2011.
36. Toropova, A. P.; Toropov, A. A.; Benfenati, E.; Gini, G.; Leszczynska, D.;
Leszczynski, J. J. Comput. Chem. 2011, 32, 2727.
37. Toropova, A. P.; Toropov, A. A.; Benfenati, E.; Gini, G.; Leszczynska, D.;
Leszczynski, J. Chem. Phys. Lett. 2012, 542, 134.
38. Toropov, A. A.; Toropova, A. P.; Rasulev, B. F.; Benfenati, E.; Gini, G.;
Leszczynska, D.; Leszczynski, J. J. Comput. Chem. 2012, 33, 1902.
39. Toropov, A. A.; Rasulev, B. F.; Leszczynski, J. QSAR Comb. Sci. 2007, 26, 686.
40. Lagunin, A.; Zakharov, A.; Filimonov, D.; Poroikov, V. Mol. Inf. 2011, 30, 241.
41. Toropova, A. P.; Toropov, A. A.; Benfenati, E.; Gini, G. Cent. Eur. J. Chem. 2011, 9,
75.
42. Roy, K.; Mitra, I.; Ojha, P. K.; Kar, S.; Das, R. N.; Kabir, H. Chemomr. Intell. Lab
2012, 118, 200.
43. Pohjanvirta, R.; Unkila, M.; Lindén, J.; Tuomisto, J. T.; Tuomisto, J. Eur. J. Pharm:
Environ. Toxicol. Pharmacol. 1995, 293, 341.
44. Sedykh, A.; Zhu, H.; Tang, H.; Zhang, L.; Richard, A.; Rusyn, I.; Tropsha, A.
Environ. Health Perspect. 2011, 119, 364.