Bioorganic & Medicinal Chemistry 23 (2015) 1223–1230 Contents lists available at ScienceDirect Bioorganic & Medicinal Chemistry journal homepage: www.elsevier.com/locate/bmc QSAR model as a random event: A case of rat toxicity Alla P. Toropova a, Andrey A. Toropov a,⇑, Emilio Benfenati a, Danuta Leszczynska b, Jerzy Leszczynski c a IRCCS—Istituto di Ricerche Farmacologiche Mario Negri, 20156, Via La Masa 19, Milano, Italy Interdisciplinary Nanotoxicity Center, Department of Civil and Environmental Engineering, Jackson State University, 1325 Lynch St, Jackson, MS 39217-0510, USA c Interdisciplinary Nanotoxicity Center, Department of Chemistry and Biochemistry, Jackson State University, 1400 J.R. Lynch Street, PO Box 17910, Jackson, MS 39217, USA b a r t i c l e i n f o Article history: Received 19 November 2014 Revised 29 January 2015 Accepted 30 January 2015 Available online 7 February 2015 Keywords: QSAR Validation Domain of applicability Oral rat toxicity CORAL software a b s t r a c t Quantitative structure—property/activity relationships (QSPRs/QSARs) can be used to predict physicochemical and/or biochemical behavior of substances which were not studied experimentally. Typically predicted values for chemicals in the training set are accurate since they were used to build the model. QSPR/QSAR models must be validated before they are used in practice. Unfortunately, the majority of the suggested approaches of the validation of QSPR/QSAR models are based on consideration of geometrical features of clusters of data points in the plot of experimental versus calculated values of an endpoint. We believe these geometrical criteria can be more useful if they are analyzed for several splits into the training and test sets. In this way, one can estimate the reproducibility of the model with various splits and better evaluate model reliability. The probability of the correct prediction of an endpoint for external validation set (in the series of the above-mentioned splits) can provide an useful way to evaluate the domain of applicability of the model. Ó 2015 Elsevier Ltd. All rights reserved. 1. Introduction Quantitative structure—property/activity relationships (QSPRs/ QSARs) are a tool to predict physicochemical and/or biochemical behaviour of various substances which were not studied experimentally. The prediction can be done by means of correlations between descriptors and endpoints.1–10 There are various software tools available to build QSPR/QSAR models,11–15 but there is no standard procedure for the validation of QSPR/QSAR models. According to the Guidance document on the validation of QSAR models developed by the Organisation for Economic Co-operation and Development (OECD) the validation of a models is an important element of a QSPR/QSAR analyses.16 There are criteria for validation of the QSPR/QSAR models which were suggested during last decade.17–21 However, in fact, all suggested criteria of the predictability of QSPR/QSAR models are based on the analysis of the geometry of data points in the plot of experimental versus calculated values of an endpoint and typically are obtained for the experiments in the training and test sets.17–21 We deem that the split into the training and test sets has significant influence on the statistical quality of a model for both the training and the test sets. Taking this into account, one should study several splits to estimate the predictability of ⇑ Corresponding author. E-mail address: [email protected] (A.A. Toropov). http://dx.doi.org/10.1016/j.bmc.2015.01.055 0968-0896/Ó 2015 Elsevier Ltd. All rights reserved. the used approach. It is to be noted that the split also has the influence on the configuration of the domain of applicability, in conformity with the size and variety of compounds in the training set. One can preliminarily define the domain of applicability, as a list of limitations on the molecular architecture.22 Further, the final definition of the domain of applicability can be done according to molecular descriptors involved in the building up the model.22,23 As rule, however, there are some exceptions for the domain of applicability defined by the above-mentioned manner because there may be some molecules with suitable values of descriptors, but with a too large difference between the experimental and calculated values of the endpoint.23 Under such circumstances, the empirical probability of the domain of applicability becomes an attractive alternative of the classical definition of the domain of applicability. The empirical probability of the domain of applicability is the probability that a compounds selected by chance in the external set (not in training set) will have a satisfactory difference between the experimental and calculated values of the endpoint. The aims of the present study are (i) building up models for toxicity in rat for a group of random splits using the CORAL software;15,24–30 and (ii) the estimation (for these models) of the influence of the distribution into the training and test sets upon the probability for an arbitrary substance (from validation set) be in the domain of applicability. 1224 A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230 2. Method 2.1. Data The lethal rat toxicity data in mg/kg was taken from the United State Library of Medicine.31 The predicted endpoint is the negative logarithm of the lethal dose (pLD50). In total, 525 compounds were selected according to the following principles: (i) these compounds contains only the following list of chemical elements: H, C, N, O, S, and P; (ii) these compounds do not have aromatic fragments. The filtration according to the above-mentioned principles, in fact, is the preliminary definition of the domain of applicability. A group of random distributions of all 525 compounds into the sub-training set, calibration set, test set, and validation set has been studied as the basis for building up model for pLD50. The splits were selected according to the following principles: (i) the range of the endpoint is approximately the same for each sub-set; (ii) the splits are random; and (iii) the splits are not identical (Table 1). The SMILES (simplified molecular input-line entry system32–34) notation has been used as the representation of the molecular structure to build up QSAR models suggested in this work. The SMILES notations were generated with ACD/ChemSketch software.35 The optimal descriptor (D) is a mathematical function of socalled correlation weights (CW). The correlation weights are numerical coefficients associated with various molecular features extracted from the SMILES notations. In other words, one-variable models examined in this study are based on the descriptors of correlation weights (DCW). The optimal descriptors used to build up models of the pLD50 are calculated as the following: DCWðThreshold; Nepoch Þ ¼ P CWðSk Þ þ P CWðSSk Þ þ P CWðSSSk Þ ð1Þ where the Threshold is a coefficient to separate the molecular features into two categories: (i) rare features which are noise from the statistical point of view (these attributes are not involved in building up model, their correlation weights are equal to zero); Table 1 The identity obtained for pairs of random splits Split C Split D Split E Set Split A Split B Split C Split D Split E Sub-training Calibration Test Validation Sub-training Calibration Test Validation Sub-training Calibration Test Validation Sub-training Calibration Test Validation Sub-training Calibration Test Validation 100 100 100 100 40.9 40.6 18.9 16.1 100 100 100 100 40.9 33.8 15.1 15.8 44.9 40.1 14.2 14.5 100 100 100 100 39.4 37.4 9.4 7.9 40.7 43.3 12.0 9.8 38.3 34.0 14.2 14.3 100 100 100 100 38.9 40.6 12.8 13.3 38.7 39.3 11.8 12.1 42.1 40.6 15.2 18.9 35.1 37.1 15.1 15.3 100 100 100 100 Nset 1. Translation of all SMILES into lists of attributes A(i,j); 2. Calculation (by the Monte Carlo method) of correlation weights of A(i,j) which give a maximum of target function24 defined as TF ¼ R þ R0 jR R0 j Const; ð2Þ 0 þCWðBONDÞ þ CWðNOSPÞ þ CWðPAIRÞ Split B 2.3. The balance of correlations QSAR models were built with the CORAL software using the following steps (Fig. 1): 2.2. Descriptors Split A and (ii) features which are not rare and consequently which are able be a basis for building up a model; the Nepoch is the number of iterations of the Monte Carlo optimization (one iteration is the process of modifications of all correlation weights which are involved in building up model); Sk, SSk, and SSSk are local SMILES attributes involving one, two, or three symbols from SMILES notation. Generally, SMILES elements represented by means of two or three symbols (e.g., ‘Cl’, ‘Br’, ‘%12’,32–34 etc.) can be also used to build up QSAR model,25–30 but SMILES examined in this work have not such elements. BOND, NOSP, and PAIR are global descriptors extracting from SMILES.36 The BOND is a mathematical function of the presence in the molecular structure of double (‘=’), triple (‘#’), and 3D stereo chemical (‘@’) bonds; the NOSP is a mathematical function of the presence of nitrogen (‘N’), oxygen (‘O’), sulphur (‘S’), and phosphorus (‘P’); PAIR is a mathematical function of the simultaneous presence of pairs of SMILES elements, such as the above-mentioned ‘=’, ‘#’, ‘@’, ‘N’, ‘O’, ‘S’, and ‘P’. Table 2 contains an example of the translation of SMILES into the list of attributes. Identityi;j ¼ 0:5ðNsetii;jþNsetj Þ 100%. where set = (sub-training, calibration, test, and validation); Nseti,j is the number of identical substances in the set for i-th and j-th splits; Nseti and Nsetj are the numbers of substances in the set for i-th and j-th splits, respectively. where R and R are correlation coefficients between experimental and calculated values of an endpoint for the sub-training and the calibration sets; Const is an empirical constant = 0.01; 3. Calculation with Eq. (1) optimal descriptors for all compounds; 4. Calculation (by the Least Squares method) of one-variable model for the endpoint using compounds of the sub-training set; 5. Estimation of the statistical quality of the model for the subsystem of the training and for the sub-system test. The statistical quality of the model for the test set is a mathematical function of two parameters: Threshold and the number of epoch of the Monte Carlo optimization. The selection of the preferable threshold (T⁄) and the number of epoch (N⁄) is described in the literature.36 The scheme of the QSAR modeling is based on four subsets of compounds: the sub-training set; the calibration set; the test set; and the validation set. The sub-training set plays the role of the developer of the model, because correlation weights first of all should give maximum of the correlation coefficient between optimal descriptor and endpoint for this set. The calibration set plays the role of critic, because data on compounds of this set is used in order to avoid the overtraining (or at least to reduce the probability of the overtraining). The test set plays the role of the estimator of the model for various thresholds and the numbers of epochs of the Monte Carlo optimization.37 Finally, the validation set is not involved in the build up the model and is the ‘main’ validator of the model. The probability of the presence of outliers as rule is not equal to zero. We believe that this probability is a complex mathematical function of the distribution of compounds into the above-mentioned four sets. Consequently, the realistic reliability of the QSAR model should be estimated in result of analysis of several splits into the described four sets. 1225 A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230 Table 2 An example of the translation of SMILES [O@C(O)CCCCCCCC ] into molecular attributes (A); CW(A) is the correlation weights for the attributes which are used to calculations with Eq. (1); ‘NOSP01000000’ is the indicator of the presence of oxygen; ‘BOND10000000’ is the indicator of the presence of double bonds; ‘++++O—B2==’ is indicator of simultaneous presence of oxygen and double bond; DCW(2, 17) = 9.58150 (split A, Eq. (5)) Structural attribute (A) CW(A) The number of the A in subtraining set The number of the A in calibration set case of the traditional scheme. The target function in the case of the traditional scheme is the correlation coefficient between the optimal descriptor and endpoint for the training set.24 2.4. ‘Lucky’ and ‘unlucky’ splits The number of the A in test set The sentence ‘There are not outliers in the test set’ can be mathematically formulated as the following: where R ¼ subtraining2 calibration and s ¼ subtraining2 calibration ; Rsub-training and Rcalibration are correlation coefficients between experimental and calculated values of an endpoint for the subtraining set and the calibration set, respectively; Ssub-training and Scalibration are root-mean-square errors for the sub-training set and the calibration set, respectively. There are two categories of outliers for a QSPR/QSAR model: The first category. Compounds which are characterized by a large difference between the experimental and calculated values of endpoint even if they are placed in the sub-training set or the calibration set. The second category. Compounds which are characterized by large difference between experimental and calculated values of endpoint if they are placed in the test set or the validation set. However, if they are placed into the sub-training set or in the calibration set, they are characterized by medium difference between experimental and calculated values of endpoint. Outliers of the first category are possible makers of the overtraining. According to,38 in the process of the optimization of correlation weights, the number of epochs of the Monte Carlo optimization that gives the maximum of correlation coefficient between optimal descriptor and endpoint for the test set exists. We deem that after this number, the optimization starts to improve the model in relation of outliers of the first category, and, as result, it leads to a decrease of the statistical quality for the external test set. That is why one should define the preferable number of epochs of the optimization and stop the process at this moment.36,38 Outliers of the second category can be used in order to define the probability of the domain of applicability P(DA). It can be done by the following algorithm. Sk O. . .. . .. . . =. . .. . .. . . C. . .. . .. . . (. . .. . .. . . O. . .. . .. . . (. . .. . .. . . C. . .. . .. . . C. . .. . .. . . C. . .. . .. . . C. . .. . .. . . C. . .. . .. . . C. . .. . .. . . C. . .. . .. . . C. . .. . .. . . 1.1835 1.1300 0.3780 0.5585 1.1835 0.5585 0.3780 0.3780 0.3780 0.3780 0.3780 0.3780 0.3780 0.3780 157 146 201 149 157 149 201 201 201 201 201 201 201 201 161 145 203 147 161 147 203 203 203 203 203 203 203 203 48 35 56 40 48 40 56 56 56 56 56 56 56 56 SSk O. . .=. . .. . . C. . .=. . .. . . C. . .(. . .. . . O. . .(. . .. . . O. . .(. . .. . . C. . .(. . .. . . C. . .C. . .. . . C. . .C. . .. . . C. . .C. . .. . . C. . .C. . .. . . C. . .C. . .. . . C. . .C. . .. . . C. . .C. . .. . . 0.9405 1.3175 0.1835 1.1270 1.1270 0.1835 0.2530 0.2530 0.2530 0.2530 0.2530 0.2530 0.2530 118 81 145 107 107 145 163 163 163 163 163 163 163 113 94 144 108 108 144 175 175 175 175 175 175 175 33 22 40 33 33 40 44 44 44 44 44 44 44 SSSk O. . .=. . .C. . . =. . .C. . .(. . . O. . .(. . .C. . . (. . .O. . .(. . . O. . .(. . .C. . . C. . .C. . .(. . . C. . .C. . .C. . . C. . .C. . .C. . . C. . .C. . .C. . . C. . .C. . .C. . . C. . .C. . .C. . . C. . .C. . .C. . . 2.0585 1.6280 2.0575 0.6240 2.0575 0.1260 0.1270 0.1270 0.1270 0.1270 0.1270 0.1270 42 37 61 7 61 95 75 75 75 75 75 75 43 52 69 9 69 99 78 78 78 78 78 78 11 12 19 5 19 28 21 21 21 21 21 21 Global attributes NOSP01000000 BOND10000000 ++++O B2== 2.6835 0.5615 0.9960 73 141 130 72 141 135 28 35 35 The balance of correlations (the above steps 1–5) is an alternative of the traditional scheme of building up QSAR model24 where calculations are based on three sets, namely, the training set, the test (calibration) set, and the external validation set. In the work,24 the comparison of the balance of correlations and the traditional scheme has shown that predictive potential of the balance of correlations is better than predictive potential of the traditional scheme. In this work, these comparisons for five different splits also have been examined. The balance of correlations is building up a model using four sets, namely, the sub-training set (developer of the model); the calibration set (critic of the model); the test set (preliminary estimator of the predictive potential of the model); and the validation set (final estimator of the predictive potential of the model). The traditional scheme means that the training set consists of the sub-training set together with the calibration set, that is, the above-mentioned ‘critic of the model’ is absent in the jRtest Rj < e ¼ 0:01 OR jstest sj < e ¼ 0:01 R þR s ð3Þ þs 1. Noutlier:= 0; 2. Calculate the compound in the test set after its removal, gives the maximal increase for the correlation coefficient Rtest between the experimental and calculated endpoint values for the test set; 3. If after this reduction of the test set, substitutions of Rtest and Stest in Eq. (3) give at least one true inequality, the process should be stopped, if not, Noutlier: = Noutlier + 1 and repeat step 2. As result one can obtain the number of outliers (Noutlier) of the second category and the probability of the domain of applicability can be calculated as the following: PðDAÞ ¼ Noutlier 100% 1 Ntest ð4Þ where Ntest is the number of compounds in the test set before the reduction. Using these criteria, (i) one can define P(DA) for a group of splits; (ii) one can select ‘lucky’ splits (Noutlier 0); one can compare values of the criteria for various distribution of data into the above-described splits. The calculation of the probability of an unknown compound be outlier is available on the web-site.15 However in advance one should (i) build up a model; and (ii) the unknown compound 1226 A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230 Figure 1. The scheme of the building up a QSPR/QSAR model by the CORAL software. should be represented by SMILES generated by ACD/ChemSketch software.35 3.1.2. Traditional scheme 3. Results and discussion pLD50 ¼ 3:3200ð0:0014Þ þ 0:0891ð0:0001Þ DCWð3; 14Þ 2 n = 405, r = 0.7198, q = 0.7170, s = 0.534, F = 1035 (training set = sub-training set + calibration set). We have examined fifteen splits into the sub-training, calibration, test, and validation sets: five are ‘lucky’ splits and ten are ‘unlucky’ splits. The statistical characteristics of QSAR models calculated with the preferable values of the threshold (T⁄) and the number of epochs of the Monte Carlo optimization (N⁄)36 for five ‘lucky’ splits characterized by P(DA) = 100% are the following: DR2m 3.1. Split A 3.2. Split B 3.1.1. Balance of correlations 3.2.1. Balance of correlations pLD50 ¼ 2:4251ð0:0031Þ þ 0:1282ð0:0004Þ DCWð2;17Þ ð5Þ ð6Þ 2 n = 56, r2 = 0.7522, s = 0.588, R2m = 0.629, R2m = 0.491, ¼ 0.271 (test set). n = 64, r2 = 0.7727, s = 0.493 (validation set). pLD50 ¼ 2:8240ð0:0028Þ þ 0:0867ð0:0003Þ DCWð4;13Þ 2 ð7Þ 2 n = 202, r2 = 0.7390, q2 = 0.7332, s = 0.535, F = 566 (sub-training set). n = 203, r2 = 0.7128, s = 0.527 (calibration set). n = 209, r = 0.6907, q = 0.6843, s = 0.593, F = 462 (sub-training set). n = 206, r2 = 0.6924, s = 0.572 (calibration set). n = 56, r2 = 0.7888, s = 0.530, R2m = 0.710, R2m = 0.607, DR2m = 0.207 (test set). n = 64, r2 = 0.8057, s = 0.447 (validation set). n = 50, r2 = 0.8849, s = 0.434 R2m = 0.695, R2m = 0.604, DR2m ¼ 0.182 (test set). n = 60, r2 = 0.6841, s = 0.457 (validation set). 1227 A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230 Figure 2. Graphical representation of QSAR models for split A, B, C, D, and E calculated with Eqs. 5, 7, 9, 11 and 13, respectively. 3.2.2. Traditional scheme pLD50 ¼ 3:2847ð0:0014Þ þ 0:0931ð0:0001Þ DCWð4; 18Þ ð8Þ n = 415, r2 = 0.7220, q2 = 0.7193, s = 0.550, F = 1073 (training set = sub-training set + calibration set). n = 50, r2 = 0.8870, s = 0.462, R2m = 0.665, R2m = 0.561, DR2m ¼ 0.209 (test set). n = 60, r2 = 0.6632, s = 0.467 (validation set). 3.3. Split C 3.3.1. Balance of correlations pLD50 ¼ 2:3713ð0:0028Þ þ 0:1034ð0:0003Þ DCWð2; 18Þ 2 ð9Þ Table 3 The statistical characteristics of QSAR models for toxicity in rat, the case of the distribution into the sub-training set (40%), the calibration set (40%), the test set (10%), and validation set (10%) # R2sub-train R2calibration Noutlier Ntest P(DA) 1 2 3 4 5 6 7 8 9 10 Average 0.76 0.77 0.78 0.77 0.76 0.76 0.77 0.79 0.76 0.77 0.769 ± 0.0094 0.76 0.77 0.79 0.77 0.76 0.76 0.77 0.79 0.76 0.77 0.770 ± 0.0111 17 7 15 6 4 12 11 3 9 9 9.3 ± 4.3 61 45 51 62 46 57 59 47 61 48 53.7 ± 6.6 72.1% 89.6% 70.6% 90.3% 91.3% 78.9% 81.3% 93.6% 85.2% 81.3% 83.4 ± 7.6 2 n = 219, r = 0.7287, q = 0.7240, s = 0.552, F = 441 (sub-training set). n = 193, r2 = 0.7281, s = 0.548 (calibration set). P(DA) is the probability of domain of applicability for the test set calculated with Eq. (4); Noutlier and Ntest are the number of outliers and total number of compounds in the test set, respectively. 1228 A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230 Table 4 Identifiers of various distributions into the sub-training, calibration, test, and validation sets Identifier Percentage of compounds in the calibration set Percentage of compounds in the subtraining set 30,501,010 50,301,010 40,401,010 20,402,020 40,202,020 30,302,020 10,303,030 30,103,030 20,203,030 30 50 40 20 40 30 10 30 20 50 30 40 40 20 30 30 10 20 Percentage of compounds in the test set 10 10 10 20 20 20 30 30 30 Percentage of compounds in the validation set 3.4.2. Traditional scheme pLD50 ¼ 3:2226ð0:0014Þ þ 0:0809ð0:0001Þ DCWð4;9Þ 2 ð12Þ 2 n = 413, r = 0.7017, q = 0.6989, s = 0.580, F = 967 (training set = sub-training set +calibration set). n = 50, r2 = 0.7552, s = 0.469, R2m = 0.674, R2m ¼ 0.553, DR2m ¼ 0.243 (test set). n = 62, r2 = 0.5357, s = 0.545 (validation set). 10 10 10 20 20 20 30 30 30 3.5. Split E 3.5.1. Balance of correlations pLD50 ¼ 2:3027ð0:0032Þþ0:1010ð0:0003ÞDCWð2;18Þ ð13Þ Table 5 The statistical characteristics for models of toxicity in rat (pLD50) taken in work44 r2 s Correct classification rate* Split 1, DCW(3,14) Training set 231 Calibration set 68 Validation set 69 0.7745 0.6524 0.6177 0.505 0.504 0.491 0.7850 0.8750 0.9000 Split 2, DCW(3,14) Training set 228 Calibration set 69 Validation set 71 0.7569 0.7544 0.6161 0.524 0.428 0.493 0.7994 0.7272 0.8750 Split 3, DCW(3,9) Training set Calibration set Validation set 0.6862 0.7399 0.6351 0.607 0.394 0.476 0.7972 0.9445 0.8333 n * 225 72 71 n = 69, r2 = 0.8183, s = 0.400, R2m = 0.797, R2m = 0742, DR2m ¼ 0.111 (test set). n = 56, r2 = 0.8253, s = 0.489 (validation set). 3.5.2. Traditional scheme pLD50 ¼ 3:1906ð0:0014Þ þ 0:0897ð0:0001Þ DCWð2;9Þ 2 ð14Þ 2 n = 400, r = 0.7068, q = 0.7038, s = 0.559, F = 959 (training set=sub-training set +calibration set). n = 69, DR2m r2 = 0.7793, s = 0.413, R2m ¼ 0.718, R2m ¼ 0.691, ¼ 0.053 (test set). n = 56, r2 = 0.8024, s = 0.505 (validation set).in Eqs. 5–14, n is the number of compounds in the set; r is the correlation coefficient; q is the cross-validated correlation coefficient; s is the Correct classification rate = 0.5 (sensitivity + specificity). n = 63, r2 = 0.7183, s = 0.474, R2m = 0.663, R2m ¼ 0.611, DR2m ¼ 0.103 ((test set). n = 50, r2 = 0.7678, s = 0.500 (validation set). 3.3.2. Traditional scheme root-mean-square error; F is the Fischer F-ratio; R2m (should be >0.5), R2m (should be >0.5), and DR2m (should be <0.2) are metrics of predictability suggested by Roy et al.18,19 The R2m metric is calculated based on the correlations between the observed (x) and predicted (y) values with taking into account intercept (r2) and without taking into account intercept (r 2o ): qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi R2m ¼ R2m ðx; yÞ ¼ r 2 1 jr 2 r 20 j pLD50 ¼ 3:2911ð0:0014Þ þ 0:1112ð0:0002Þ DCWð3;19Þ ð10Þ 2 n = 199, r2 = 0.7365, q2 = 0.7310, s = 0.545, F = 551 (sub-training set). n = 201, r2 = 0.6966, s = 0.560 (calibration set). 2 n = 412, r = 0.7371, q = 0.7346, s = 0.534, F = 1150 (training set = sub-training set +calibration set). ð15Þ where the x is the vector of observed and the y is the vector of predicted pLD50 for the test set; R2m ðx; yÞ þ R2m ðy; xÞ 2 n = 63, r2 = 0.6948, s = 0.495, R2m = 0.650, R2m ¼ 0.580, DR2m ¼ 0.141 (test set). n = 50, r2 = 0.7033, s = 0.583 (validation set). R2m ¼ 3.4. Split D The distribution of the 525 compounds into the sub-training, the calibration, the test, and the validation set (these compounds are not ‘visible’ for building up the models) is random with probabilities 0.4; 0.4; 0.1; and 0.1 respectively. Figure 2 shows the graphical representation of models calculated by means of balance of correlation s with Eqs. 5, 7, 9, 11 and 13. The traditional scheme gives models calculated with Eqs. 6, 8, 10, 12 and 14. The comparison of the statistical characteristics of models calculated with balance of correlations and models calculated by the traditional scheme confirms that predictive potential of models calculated with the balance of correlations is better.24 Table 3 contains the statistical quality of QSAR models for a group of ‘‘unlucky’’ splits with the same distribution. One can see 3.4.1. Balance of correlations pLD50 ¼ 2:9421ð0:0029Þ þ 0:0988ð0:0003Þ DCWð2; 10Þ ð11Þ 2 2 n = 194, r = 0.7374, q = 0.7322, s = 0.566, F = 539 (sub-training set). n = 219, r2 = 0.7218, s = 0.556 (calibration set). n = 50, DR2m r2 = 0.7612, s = 0.452, R2m = 0.755, ¼ 0.202 (test set). n = 62, r2 = 0.5460, s = 0.535 (validation set). R2m ¼ 0.653, DR2m ¼ R2m ðx; yÞ R2m ðy; xÞ ð16Þ ð17Þ A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230 1229 Figure 3. The plot of the number of outliers in the test set versus percentage of satisfactory predictions (in the test set) for different distributions (Table 4) into the subtraining set, the calibration set, test set, and validation set. E.g., the code ‘30,501,010’ means 30% of compounds placed in the sub-training set; 50% of compounds placed in the calibration set; 10% of compounds placed in the test set; and 10% of compounds placed in the validation set. The dots are results of a few runs of the Monte Carlo optimization with thresholds of 2, 3, and 4. Each line (slope) corresponds to the percentage of compounds in the validation set (these are indicated by grey background). (Table 3) ‘unlucky’ splits have the statistical characteristics poorer than the statistical characteristics of ‘lucky’ splits, but even for them the empirical probability of the domain of applicability is not zero (the range of this probability is from 0.70 till 0.91). The analysis of various distributions of chemicals into the subtraining set, calibration set, test set, and validation set are represented by Figure 2. One can see (Fig. 3 and Table 4) the increase of the number of compounds in the sub-training and calibration sets improves predictability of these models. The situation is similar to QSPR model for water solubility.37 There are QSAR models for toxicity in rat described in the literature.36,39–42 The comparison of statistical quality of models for toxicity in rat which are calculated with Eqs. 5–14 and models described in36 indicates a paradoxical situation: correlation coefficients for Eqs. 5–14 are higher than the correlation coefficients of similar models described in,36 whereas root-mean-square error of models calculated with Eqs. 5–14 are larger than the root-meansquare errors of models described in.36 This paradox probably is result of difference in the assortment of compounds which were studied here and compounds which were studied in.36 The comparison of statistical quality of QSAR models for toxicity in rat described in39 has shown (i) the DRAGON generated descriptors and optimal descriptors give similar quality of prediction for this endpoint; (ii) the increase of the number of descriptors involved in building up QSAR can lead to overtraining; and (iii) statistical quality of the best QSAR suggested in39 is the following n = 14, r2 = 0.9154 (test set). In other words, the correlation coefficient is higher than the correlation coefficients of Eqs. 5–14, however, the number of compounds in the test set for the QSAR described in39 is considerable smaller. The statistical quality of QSAR models for toxicity in rat suggested in40 is characterized by ntest = 325 and r2test = 0.47–0.50. QSAR model for toxicity in rat of inorganic substances is characterized by n = 10, r2 = 0.80.41 The comparison of models which are calculated by means of Eqs. 5–14 with the above-mentioned QSAR models taken in the literature36,39–41 indicates that suggested approach can be useful for QSAR analysis of toxicity in rat. The analysis of only one model with a ‘lucky’ split can show that our approach, that is, Eqs. 5–14, gives ‘ideal’ model where the statistical quality for external sets is reliable from point of view of criteria calculated with Eq. (3). However, the examination of groups of splits had shown, that this approach can give the ideal results (for a ‘lucky’ split) and not ideal results (for a ‘unlucky’ split). We believe that the analysis of models which are obtained with various splits into the training set and test set (analogical to the analysis described in42) should become the standard of veritable validation for all approaches which are used to build up a QSPR/ QSAR models. According to work,43 the precision of experimental definition of the LD50 for rats is approximately ±15 mg/kg or approximately one log unit. The experimental data collected in31 are extracted from a group of the reliable sources. Consequently, the QSAR models calculated with Eqs. 5–14 can be estimated as robust ones. In addition the suggested approach has been checked up with data on toxicity in rat taken in work44 where 369 compounds are examined. The pLD50 for cis-1,2-dichloroethylene is defined as 999.00. Apparently, this is a misprint and this compound is not considered in this study. The other 368 compounds were three times randomly distributed into the training set, the calibration set, and the validation set. The correct classification rate is the statistical criterion which is used in the work.44 Table 5 contains statistical characteristics of models calculated by the suggested approach. The average value of the criterion for external validation sets (n 70) obtained in44 with using of only chemical descriptors is equal to 0.81. One can see, the average correct classification rate on the validation set for models calculated by the described optimal descriptors is equal to 0.869 (Table 5), in other words, the models calculated with optimal descriptors are better than above mentioned models.44 The Supplementary materials section contains the technical details related to described models, also, additional information is available on the Internet.15 1230 A. P. Toropova et al. / Bioorg. Med. Chem. 23 (2015) 1223–1230 4. Conclusions There are ‘lucky’ and ‘unlucky’ splits into the sub-training set, calibration set, test set, and validation set. The robust estimation of a QSAR model should be done with a group of various random splits. The domain of applicability can be represented by the mathematical expectation of the probability that a compound selected by chance in the test set is not ‘outlier’ (Table 3). The increase of number of compounds in the sub-training set and in the calibration set leads to increase of the above-mentioned probability (Fig. 3). The predictive potential of models calculated by means of the balance of correlation24 (based on the distribution of available data into the sub-training set, calibration set, test set, and external validation set) is better than predictive potential of models calculated with traditional scheme (based on distribution of available data into the training set, test set, and external validation set). Acknowledgements The authors are grateful for the contribution of the EU project PROSIL funded under the LIFE program (project LIFE12 ENV/IT/ 000154), and the National Science Foundation (NSF/CREST HRD0833178, and EPSCoR Award #:362492-190200-01/NSFEPS090378). Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.bmc.2015.01.055. References and notes 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Furtula, B.; Gutman, I. J. Chemom. 2011, 25, 87. Hollas, B.; Gutman, I.; Trinajstić, N. Croat. Chem. Acta 2005, 78, 489. Toropov, A. A.; Toropova, A. P.; Gutman, I. Croat. Chem. Acta 2005, 78, 503. Ibezim, E.; Duchowicz, P. R.; Ortiz, E. V.; Castro, E. A. Chemom. Intell. Lab. 2012, 110, 81. Garro Martinez, J. C.; Duchowicz, P. R.; Estrada, M. R.; Castro, G. N.; Castro, E. A. Int. J. Mol. Sci. 2011, 12, 9354. García, J.; Duchowicz, P. R.; Rozas, M. F.; Caram, J. A.; Mirífico, M. V.; Fernández, F. M.; Castro, E. A. J. Mol. Graph. Model. 2011, 31, 10. Mullen, L. M. A.; Duchowicz, P. R.; Castro, E. A. Chemometr. Intell. Lab. Syst. 2011, 107, 269. Mouchlis, V. D.; Melagraki, G.; Mavromoustakos, T.; Kollias, G.; Afantitis, A. J. Chem. Inf. Model. 2012, 52, 711. Melagraki, G.; Afantitis, A. Curr. Med. Chem. 2011, 18, 2612. Afantitis, A.; Melagraki, G.; Koutentis, P. A.; Sarimveis, H.; Kollias, G. Eur. J. Med. Chem. 2011, 46, 497. 11. CAESAR, http://www.caesar-project.eu, Accessed October 12, 2014. 12. BCFBAF (EPISUITE), http://www.epa.gov/opptintr/exposure/pubs/episuite.htm, Accessed October 12, 2014. 13. TEST, http://www.epa.gov/nrmrl/std/qsar/qsar.html, Accessed October 12, 2014. 14. CHEMPROP, http://www.ufz.de/index.php?en=10684, Accessed October 12, 2014. 15. CORAL, http://www.insilico.eu/coral, Accessed January 27, 2015. 16. Guidance document on the validation of (quantitative) structure - activity relationships [(q)sar] models, http://www.oecd.org/dataoecd/55/35/ 38130292.pdf, Accessed May 30, 2007. 17. Golbraikh, A.; Tropsha, A. J. Mol. Graph. Model. 2002, 20, 269. 18. Roy, K.; Mitra, I. Mini-Rev. Med. Chem. 2012, 12, 491. 19. Roy, K.; Mitra, I.; Kar, S.; Ojha, P. K.; Das, R. N.; Kabir, H. J. Chem. Inf. Model. 2012, 52, 396. 20. Toropova, A. P.; Toropov, A. A.; Rallo, R.; Leszczynska, D.; Leszczynski, J. Ecotoxicol. Environ. Saf. 2015, 112, 39. 21. Toropova, A. P.; Toropov, A. A.; Veselinović, J. B.; Miljković, F. N.; Veselinović, A. M. Eur. J. Med. Chem. 2014, 77, 298. 22. Melagraki, G.; Afantitis, A.; Sarimveis, H.; Koutentis, P. A.; Kollias, G.; IgglessiMarkopoulou, O. Mol. Divers 2009, 13, 301. 23. Toropova, A. P.; Toropov, A. A. Eur. J. Pharm. Sci. 2014, 52, 21. 24. Toropov, A. A.; Rasulev, B. F.; Leszczynski, J. Bioorg. Med. Chem. 2008, 16, 5999. 25. Toropov, A. A.; Toropova, A. P.; Benfenati, E. Eur. J. Med. Chem. 2009, 44, 2544. 26. Toropov, A. A.; Toropova, A. P.; Benfenati, E.; Leszczynska, D.; Leszczynski, J. J. Comput. Chem. 2010, 31, 381. 27. Toropov, A. A.; Toropova, A. P.; Benfenati, E.; Manganaro, A. Mol. Divers 2009, 13, 367. 28. Toropov, A. A.; Toropova, A. P.; Benfenati, E. Eur. J. Med. Chem. 2010, 45, 3581. 29. Toropov, A. A.; Toropova, A. P.; Raska, I.; Benfenati, E. Eur. J. Med. Chem. 2010, 45, 1639. 30. Toropova, A. P.; Toropov, A. A.; Benfenati, E.; Gini, G.; Leszczynska, D.; Leszczynski, J. Cent. Eur. J. Chem. 2011, 9, 846. 31. United State Library of Medicine, http://toxnet.nlm.nih.gov/, Accessed October 12, 2014. 32. Weininger, D. J. Chem. Inf. Comput. Sci. 1988, 28, 31. 33. Weininger, D.; Weininger, A.; Weininger, J. L. J. Chem. Inf. Comput. Sci. 1989, 29, 97. 34. Weininger, D. J. Chem. Inf. Comput. Sci. 1990, 30, 237. 35. ACD/ChemSketch Freeware, version 11.00, Advanced Chemistry Development, Inc., Toronto, ON, Canada, www.acdlabs.com, 2011. 36. Toropova, A. P.; Toropov, A. A.; Benfenati, E.; Gini, G.; Leszczynska, D.; Leszczynski, J. J. Comput. Chem. 2011, 32, 2727. 37. Toropova, A. P.; Toropov, A. A.; Benfenati, E.; Gini, G.; Leszczynska, D.; Leszczynski, J. Chem. Phys. Lett. 2012, 542, 134. 38. Toropov, A. A.; Toropova, A. P.; Rasulev, B. F.; Benfenati, E.; Gini, G.; Leszczynska, D.; Leszczynski, J. J. Comput. Chem. 2012, 33, 1902. 39. Toropov, A. A.; Rasulev, B. F.; Leszczynski, J. QSAR Comb. Sci. 2007, 26, 686. 40. Lagunin, A.; Zakharov, A.; Filimonov, D.; Poroikov, V. Mol. Inf. 2011, 30, 241. 41. Toropova, A. P.; Toropov, A. A.; Benfenati, E.; Gini, G. Cent. Eur. J. Chem. 2011, 9, 75. 42. Roy, K.; Mitra, I.; Ojha, P. K.; Kar, S.; Das, R. N.; Kabir, H. Chemomr. Intell. Lab 2012, 118, 200. 43. Pohjanvirta, R.; Unkila, M.; Lindén, J.; Tuomisto, J. T.; Tuomisto, J. Eur. J. Pharm: Environ. Toxicol. Pharmacol. 1995, 293, 341. 44. Sedykh, A.; Zhu, H.; Tang, H.; Zhang, L.; Richard, A.; Rusyn, I.; Tropsha, A. Environ. Health Perspect. 2011, 119, 364.
© Copyright 2026 Paperzz