Indian Journal of Chemical Technology Vol. 23, January 2016, pp. 9-21 Application of support vector machine in QSAR study of triazolyl thiophenes as cyclin dependent kinase-5 inhibitors for their anti-alzheimer activity Zahra Garkani-Nejad1,* & Abouzar Ghanbari 2 1 Chemistry department, Faculty of Science, Shahid Bahonar University, Kerman, Iran 2 Chemistry department, Faculty of Science, Vali-e-Asr University, Rafsanjan, Iran E-mail: [email protected] Received 7 November 2013; accepted 13 September 2015 Quantitative structure-activity relationship (QSAR) models are mathematical equations constructing a relationship between chemical structures and biological activities. A series of triazolyl thiophenes as cyclin dependent kinase 5 (cdk5/p25) inhibitors have been selected to establish QSAR models. In this work some chemometrics methods are applied for modeling and prediction of the anti-Alzheimer activity of these compounds using descriptors that are calculated from the molecular structures. First, stepwise multiple linear regression method (MLR) is used to select descriptors which are responsible for the anti-Alzheimer activity of these compounds. Then support vector regression (SVR) and partial least squares (PLS) are utilized to construct the nonlinear and linear quantitative structure–activity relationship models. Results demonstrate that the SVR model offers powerful prediction capabilities. Keywords: Cdk5/p25 inhibitors, Triazolyl thiophene, Alzheimer disease, Quantitative structure-activity relationship (QSAR), Support vector regression (SVR) Cyclin dependent kinase 5 (CDK5) plays an essential role in the development of the central nervous system during mammalian embryogenesis1. In the adult, CDK5 is required for the maintenance of neuronal architecture. Its deregulation has profound cytotoxic effects and has been implicated in the development of neurodegenerative diseases such as Alzheimer’s disease and amyotrophic lateral sclerosis2. Cyclindependent kinase 5 (CDK5) is a member of a family of proline-directed serine/threonine kinases3. It regulates a variety of processes in developing adult neurons. CDK5 plays a central role in neuronal migration during the development of the central nervous system4. Excessive up regulation of CDK5 by the truncated activators contributes to neurodegeneration by altering the phosphorylation state of cytosolic and cytoskeletal proteins, and increased CDK5 activity has been implicated in Alzheimer’s disease (AD), amyotrophic lateral sclerosis (ALS), Parkinson’s disease, niemann-Pick type C disease, and ischemia. Postmortem brain analysis of AD patients reveals extensive formation of neurofibrillary tau protein tangles and amyloid plaques. The serine/threonine kinase CDK5 along with its cofactor p25 has been supposed to hyper phosphorylate tau, leading to the formation of paired helical filaments and deposition of cytotoxic neurofibrillary tangles and thus responsible to neurodegenerative disorders such as Alzheimer’s disease, Parkinson’s disease, stroke, or Huntington’s disease. CDK5 also phosphorylates dopamine and Cyclic AMP-regulated phosphor protein (DARPP-32) at threonine75, indicating its role in dopaminergic neurotransmission. Inhibition of the anomalous CDK5/p25 complex is, therefore, a viable target for treating Alzheimer’s disease by preventing tau hyper phosphorylation and neurofibrillary tangle formation. QSAR studies are useful techniques in the drug design process5. Generation of QSAR models which are able to predict biological activities of new compounds are invaluable tools in increasing the efficiency of new synthesis6. In general, QSAR modeling involves dataset preparation, molecular descriptors generation, statistical models derivation, model training and validation using a training dataset and model testing on a testing dataset. The main goal of the present work was developing different models using linear and nonlinear chemometrics approaches for prediction antiAlzheimer activity of triazolyl thiophene derivatives. Stepwise multiple linear regression method (MLR) was used to select descriptors which are responsible INDIAN J. CHEM. TECHNOL., JANUARY 2016 10 for the anti-Alzheimer activity of these compounds. Then support vector regression (SVR) and partial least squares (PLS) were utilized to construct the nonlinear and linear quantitative structure–activity relationship models. Finally, results obtained using three methods were compared. Experimental Section Data set All of 124 methyl linked cyclohexyl thiophenes with triazole as CDK5 inhibitors and experimental IC50 values were collected from the literature7. The IC50 values were converted to the corresponding log IC50 and used as dependent variable. The log IC50 values providing a broad and homogenous data set for QSAR study. The QSAR models were generated using a training set including 88 molecules and a test set including 36 molecules. Training and test sets were selected randomly such that a wide range of activity in the data set was included. The structure of different groups of triazolyl thiophenes are shown in Tables 1 and 2. Table 1 ― Different groups of triazolyl thiophenes used in modeling No. Compound No. 1 6 2 7 3 8 4 9 5 Compound NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS 11 Table 2 ― Studied triazolyl thiophenes with different substituents (contd.) No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Compd. 1a 1b 1c 1d 2a 2b 2c 2d 2e 2f 2g 2h 2i 2j 2k 2l 3a 3b 3c 3d 3e 3f 3g 3h 3i 3j 3k 3l 4a 4b 4c 4d 4e 4f 4g 4h 4i 4j 4k 4l 5a 5b 5c 5d 5e 5f 5g R1 NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 R2 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -C6H5 -C6H5 -C6H5 -C6H5 -CH3 -CH3 -CH3 -CH3 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -C6H5 -C6H5 -C6H5 -C6H5 -CH3 -CH3 -CH3 -CH3 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -C6H5 -C6H5 -C6H5 -C6H5 -CH3 -CH3 -CH3 -CH3 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -C6H5 -C6H5 -C6H5 R3 - Log IC50 1.63 2.60 2.80 2.43 2.54 2.66 2.26 2.66 2.88 2.76 2.95 2.83 2.84 2.77 2.79 2.64 2.76 2.62 2.74 2.63 2.82 2.59 2.69 3.69 1.79 2.66 2.75 2.81 1.76 2.62 2.74 2.62 2.82 2.60 2.69 3.69 1.78 2.66 2.75 2.81 2.65 2.59 2.47 2.73 2.83 2.76 2.55 (contd.) INDIAN J. CHEM. TECHNOL., JANUARY 2016 12 Table 2 ― Studied triazolyl thiophenes with different substituents (contd.) No. 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 Compd. 5h 5i 5j 5k 5l 6a 6b 6c 6d 6e 6f 6g 6h 6i 6j 6k 6l 7a 7aa 7ab 7ac 7ad 7ae 7af 7ag 7ah 7ai 7aj 7b 7c 7d 7e 7f 7g 7h 7i 7j 7k 7l 7m 7n 7o 7p 7q 7r 7s 7t R1 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH R2 -C6H5 -CH3 -CH3 -CH3 -CH3 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -C6H5 -C6H5 -C6H5 -C6H5 -CH3 -CH3 -CH3 -CH3 -4ClC6H4 -CH3 -CH3 -CH3 -CH3 -CH3 -CH3 -CH3 -CH3 -CH3 -CH3 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -C6H5 -C6H5 -C6H5 -C6H5 -C6H5 -C6H5 -C6H5 -C6H5 R3 -CH3 -CH3 -CH3 -C6H5 -C6H5 -C6H5 -C6H5 CH2Cl CH2Cl CH2Cl CH2Cl -CH3 -CH3 -CH3 -C6H5 -C6H5 -C6H5 -C6H5 CH2Cl CH2Cl CH2Cl CH2Cl -CH3 -CH3 -CH3 -CH3 -C6H5 -C6H5 -C6H5 -C6H5 Log IC50 2.81 2.84 2.59 2.75 2.78 1.59 1.81 2.37 2.30 1.57 2.28 2.42 2.66 1.79 2.40 2.26 2.97 2.67 2.81 2.56 2.56 2.64 2.61 2.68 2.66 2.68 2.55 2.42 2.81 2.50 2.55 2.79 2.72 2.68 2.43 2.36 2.41 2.53 2.27 2.54 2.74 2.75 2.67 2.67 2.62 2.67 2.64 (contd.) NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS 13 Table 2 ― Studied triazolyl thiophenes with different substituents (contd.) No. 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 Compd. 7u 7v 7w 7x 7y 7z 8a 8b 8c 8d 8e 8f 8g 8h 8i 8j 8k 8l 9a 9b 9c 9d 9e 9f 9g 9h 9i 9j 9k 9l R1 NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH NHCOCH2Cl NHCOCH3 NHCOC6H5 NHCH2CH2COOH R2 -C6H5 -C6H5 -C6H5 -C6H5 -CH3 -CH3 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -C6H5 -C6H5 -C6H5 -C6H5 -CH3 -CH3 -CH3 -CH3 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -4-ClC6H4 -C6H5 -C6H5 -C6H5 -C6H5 -CH3 -CH3 -CH3 -CH3 Descriptors calculation and selection The first step to obtain a QSAR model was to encode the structural features of molecules, which were named molecular descriptors. The molecular descriptors used to search the best model, were calculated using the Dragon program8 on the basis of the minimum energy molecular geometries that optimized in the Hyperchem package9 based on AM1 semi empirical method. After the calculation of the molecular descriptors, those that stayed constant for all molecules or describe more than 90% equal or less than 10% of non-zero values were excluded. In this study, nonlinear feature mapping technique (SVR) has been used for construction the QSAR models. Since this method cannot be able to select the more significant descriptors from the pool of calculated molecular descriptors, stepwise multiple linear regression (MLR) variable subset selection R3 CH2Cl CH2Cl CH2Cl CH2Cl -CH3 -CH3 - Log IC50 2.73 2.81 2.74 2.56 2.66 2.51 1.54 2.33 2.34 2.41 1.51 2.59 2.64 2.43 2.53 2.66 2.67 2.73 3.46 3.80 3.52 3.77 3.80 3.81 3.74 3.70 3.66 3.70 3.79 3.79 method was used for the selection of the most relevant descriptors from the pool of 256 descriptors. These descriptors would be used as inputs of the SVR. Regression analysis Multiple linear regression (MLR) Multiple linear regression is used to develop the linear model of the property of interest10, which takes the form below: Y = b0 + b1 X1 + b2 X2 +···+bn Xn ... (1) where Y is the property, that is, the dependent variable, X1–Xn represent the specific descriptors, while b1–bn represent the coefficients of those descriptors, and b0 is the intercept of the equation. In this study, a stepwise multiple linear regression procedure was used for model development. For INDIAN J. CHEM. TECHNOL., JANUARY 2016 14 Partial least square (PLS) regression analysis, data set was divided into two groups of training and test sets. The molecules included in these sets, were selected randomly. The training set, consist of 88 molecules, was used for the model generation using the SPSS software package11. The test set, consist of 36 molecules, was used to evaluate the generated models. It is clear that many MLR models will be resulted using stepwise multiple regression procedure. Among them, we have to choose the best one. The best multiple linear regression (MLR) model is one that has high R value, low standard error, the least number of descriptors and high ability for prediction. The best selected model is presented in Table 3. Eight descriptors have been chosen as the optimum number of parameters in this model. Correlation between selected descriptors is shown in Table 4. As it can be seen from the correlation matrix (Table 4) there is no significant correlation between the selected descriptors. Calculated values of log IC50 for training and test sets using MLR model are shown in Tables 5 and 6. PLS is a linear modeling technique where information in the descriptor matrix X is projected onto a small number of underlying (‘‘latent’’) variables called PLS components. The Matrix Y is simultaneously used in estimating the ‘‘latent’’ variables in X that will be most relevant for predicting the Y variables. In the present work, the modeling by PLS method was performed using MINITAB12. For regression analysis, data set were separated into two groups: training and test sets (Tables 5 and 6). The number of significant factors for the PLS algorithm was determined using the cross-validation method. With cross-validation, one sample was kept out (leave-oneout) of the calibration and used for prediction. The process was repeated so that each of the samples was kept out once. The predicted values of left-out samples were then compared to the observed values using prediction error sum of squares (PRESS). The PRESS obtained in the cross-validation was calculated each time that a new principal component Table 3 ― Details of the descriptors in the MLR model Descriptor definition Descriptor type Symbol Regression Coefficient Standard deviation Moran autocorrelation - lag 2 / weighted by atomic Sanderson electronegativities 2D-autocorrrelation MATS2e 20.198 ±1.665 Information content index (neighborhood symmetry of 1-order) Topological IC1 1.822 ±0.260 Geary autocorrelation - lag 4 / weighted by atomic polarizabilities 2D-autocorrrelation GATS4p -0.966 ±1.324 Signal 3 / weighted by atomic masses 3D-MoRSE Mor03m -0.04 ±0.015 Geary autocorrelation - lag 2 / weighted by atomic Sanderson electronegativities 2D-autocorrrelation GATS2e 8.940 ±1.190± Average eigenvector coefficient sum from adjacency matrix Topological VEA2 -8.182 ±1.923 Moran autocorrelation - lag 2 / weighted by atomic polarizabilities 2D-autocorrrelation MATS2p 6.128 ±1.298 Global topological charge index Galvez topological indices JGT 4.122 ±1.281 Table 4 ― Correlation matrix of descriptors selected using MLR model MATS2e IC1 GATS4p Mor3m GATS2e VEA2 MATS2p JGT MATS2e 1 0.293 0.056 0.124 0.595 0.269 0.367 0.106 IC1 GATS4p Mor3m GATS2e VEA2 MATS2p JGT 1 0.127 0.316 0.21 0.168 0.050 0.559 1 0.099 0.242 0.080 0.319 0.071 1 0.055 0.497 0.302 0.321 1 0.047 0.067 0.203 1 0.525 0.438 1 0.538 1 NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS 15 Table 5 ― Experimental and predicted log IC50 values by MLR, PLS and SVR models for training set (contd.) 1 2 3 5 6 7 8 9 10 11 12 14 17 18 19 20 21 23 24 25 26 29 30 31 32 33 34 37 38 40 41 42 44 45 47 50 51 53 54 55 57 58 59 62 64 65 66 67 Exp. 1.63 2.59 2.79 2.53 2.66 2.26 2.65 2.87 2.75 2.94 2.82 2.76 2.53 2.56 2.79 2.21 2.82 2.84 2.76 2.92 2.92 1.76 2.61 2.74 2.62 2.81 2.59 1.78 2.66 2.80 2.64 2.58 2.73 2.83 2.54 2.58 2.75 1.59 1.80 2.37 1.56 2.28 2.42 2.39 2.96 2.67 2.80 2.56 Calc. MLR 1.80 2.64 2.61 2.29 2.72 2.31 2.64 2.74 2.99 2.94 2.63 2.77 2.33 2.59 2.58 2.46 2.73 2.95 2.92 2.70 2.98 2.1 2.46 2.42 2.43 2.34 2.76 2.38 2.74 2.68 2.43 2.59 2.52 2.67 2.91 2.74 2.76 1.85 2.06 2.27 2.26 2.37 2.22 2.21 2.40 2.77 2.82 2.81 Calc. PLS 1.87 2.37 2.60 2.45 2.63 2.58 2.54 2.61 2.89 2.82 2.66 2.81 3.05 2.45 2.58 2.30 2.69 2.73 2.91 3.10 2.96 2.15 2.47 2.69 2.51 2.46 2.69 2.45 2.57 2.73 2.60 2.56 2.66 2.73 2.81 2.72 2.95 1.77 2.16 2.35 2.08 2.15 2.31 2.23 2.39 2.46 2.58 2.53 Calc. SVR 1.68 2.54 2.75 2.49 2.71 2.31 2.65 2.82 2.79 2.90 2.83 2.71 2.58 2.61 2.74 2.26 2.77 2.79 2.71 2.87 2.87 1.81 2.56 2.69 2.57 2.77 2.64 1.83 2.61 2.75 2.60 2.64 2.68 2.78 2.59 2.64 2.80 1.64 1.85 2.42 1.61 2.23 2.44 2.45 2.92 2.62 2.75 2.61 (contd.) INDIAN J. CHEM. TECHNOL., JANUARY 2016 16 Table 5 ― Experimental and predicted log IC50 values by MLR, PLS and SVR models for training set 68 70 71 72 73 74 80 81 82 84 85 86 88 89 90 91 92 94 96 97 99 101 102 103 104 105 106 107 108 109 111 112 114 115 117 119 120 121 122 124 Exp. 2.55 2.61 2.67 2.65 2.67 2.55 2.72 2.68 2.43 2.41 2.53 2.26 2.73 2.75 2.66 2.67 2.61 2.63 2.80 2.73 2.65 1.54 2.32 2.33 2.41 1.50 2.58 2.63 2.42 2.52 2.66 2.72 3.80 3.51 3.79 3.73 3.69 3.60 3.71 3.79 Calc. MLR 2.67 2.79 2.69 2.42 2.53 2.56 2.76 2.70 2.18 2.74 2.54 2.57 2.85 2.69 2.60 2.49 2.63 2.87 2.74 2.67 2.51 2.04 2.14 2.26 2.63 1.88 2.48 2.56 2.40 2.35 2.76 2.50 3.72 3.60 3.59 3.81 3.83 3.45 3.97 3.70 (PC) was added to the model. The optimum number of PLS factors is the one that minimized PRESS. Figure 1 shows the plot of PRESS versus the number of factors for the PLS model. The best PLS model contained 5 factors. Calculated values of logIC50 for training and test sets using PLS models are shown in Tables 5 and 6. Calc. PLS 2.60 2.75 2.72 2.49 2.71 2.79 2.58 2.56 2.43 2.62 2.77 2.40 2.84 2.71 2.55 2.56 2.74 2.67 2.57 2.87 2.60 2.24 2.08 2.33 2.58 2.03 2.50 2.34 2.54 2.39 2.69 2.55 3.59 3.43 3.53 3.71 3.82 3.64 3.90 3.77 Calc. SVR 2.57 2.66 2.71 2.60 2.72 2.60 2.67 2.69 2.40 2.46 2.57 2.32 2.72 2.70 2.62 2.62 2.66 2.59 2.76 2.69 2.62 1.59 2.27 2.38 2.46 1.55 2.53 2.58 2.45 2.47 2.62 2.71 3.75 3.46 3.74 3.69 3.64 3.55 3.66 3.74 Support vector machine (SVM) Support vector machine (SVM) has been first introduced by Vapnik13. There are two main categories for support vector machines: support vector classification (SVC) and support vector regression (SVR). SVM is a learning system using a high dimensional feature space. Support vector regression NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS (SVR) is the most common application form of SVM14. An overview of the basic ideas underlying support vector (SV) machines for regression and function estimation has been given by Basak and coworkers15. In support vector regression (SVR), the basic idea is to map the data X into a higher-dimensional feature space F via a nonlinear mapping Φ and then to do linear regression in this space16,17. Therefore, regression approximation addresses the problem of estimating a function based on a given data set. SVR has recently emerged as an alternative and highly Table 6 ― Experimental and predicted log IC50 values by MLR, PLS and SVR models for test set No. 4 13 15 16 22 27 28 35 36 39 43 46 48 49 52 56 60 61 63 69 75 76 77 78 79 83 87 93 95 98 100 110 113 116 118 123 Exp. 2.11 2.24 2.98 2.68 3.29 3.45 2.44 2.79 2.85 2.88 2.59 2.81 2.75 2.32 2.62 2.34 2.48 1.96 2.44 2.74 2.66 2.77 2.48 2.67 2.14 2.41 2.82 2.46 2.57 2.83 2.62 2.37 3.23 3.50 4.12 3.93 Calc. MLR 2.43 2.84 2.79 2.64 2.81 2.89 2.86 3.69 1.78 2.75 2.47 2.76 2.81 2.84 2.86 2.30 2.66 1.78 2.25 2.64 2.42 2.81 2.50 2.55 2.79 2.36 2.54 2.67 2.72 2.56 2.51 2.66 3.46 3.77 3.81 3.79 Calc. PLS 2.38 2.67 3.06 2.21 3.18 3.20 2.85 2.68 2.89 2.97 2.60 2.71 2.80 2.59 2.71 2.24 2.42 1.91 2.29 2.77 2.64 2.72 2.38 2.53 2.59 2.35 2.83 2.39 2.85 2.69 2.56 2.50 3.34 3.34 3.76 3.91 Calc. SVR 2.62 2.46 2.81 2.86 2.74 2.73 2.73 2.77 2.69 2.76 2.61 2.70 2.72 2.51 2.77 2.42 2.56 2.15 2.44 2.78 2.72 2.68 2.48 2.55 2.42 2.26 2.61 2.62 2.45 2.55 2.65 2.69 3.35 3.06 3.85 3.44 17 effective means of solving the nonlinear regression problem. SVR has been quite successful in both academic and industrial platforms owing to its many attractive features and promising generalization performance. At the present work, the SVR evaluations were carried out using the SVM toolbox in CLEMENTINE software18. Data set was divided into two groups: training and test sets (Tables 5 and 6) and entered descriptors in MLR models, were used as inputs. After that, the kernel function should be determined, which represents the sample distribution in the mapping space. In this work, the RBF (radial basis function) kernel was chosen. The RBF SVR method fits a nonlinear function onto the data, again aiming for maximum flatness. The RBF kernel is also often named a Gaussian kernel since the kernel function is the same as the Gaussian distribution function. The overall performance of SVM was evaluated in terms of root-mean-square error which was calculated from the following equation: RMSE i s1 ( yi y0 )2 s ... (2) In this equation yi is the desired output, y0 is the predicted value by model, and ns is the number of molecules in data set. The predictive power of the SVM models developed on the selected training sets is estimated on the prediction of an external test set and also examined by cross validation test to the calculating statistical parameters of Q2 as follows: Q2 1 ( y0 yi )2 ( yi ym ) 2 ... (3) In the above expressions, ym is the mean of dependent variable. Fig. 1 ― PRESS versus the number of factors for the PLS model INDIAN J. CHEM. TECHNOL., JANUARY 2016 18 Table 7 ― The statistical parameters of different constructed QSAR models Model Training RMSE R F Test RMSE 0.260 R F 0.804 61.950 SVR 0.046 0.996 1192.00 PLS 0.214 0.900 331.88 0.235 0.830 88.980 MLR 0.216 0.890 332.849 0.311 0.490 32.089 Calculated values of log IC50 for training and test sets using SVR model are shown in Tables 5 and 6. The statistical parameters of this model for training and test sets are shown in Table 7. Results and Discussion The main objective of this research is to develop QSAR models for predicting the biological activity of triazolyl thiophene derivatives. In the first stage a multiple linear model (MLR) is developed. The selected variables in this model are MATS2e, MATS2p, GATS4p, GATS2e, IC1, Mor03m, VEA2, and JGT. These descriptors can be categorized as a topological, two-dimensional autocorrelation, threedimensional Morse and charge index. Details are described in the Table 3. Based on the correlation matrix (Table 4), it can be generalized that there is no significant correlation between the selected descriptors. The colinearity threshold in QSAR/QSPR studies is usually considered 0.9, i.e. descriptors with R > 0.9 are selected as collinear. The purpose of implementing this model was determination a linear relationship between descriptors and parameter anti-Alzheimer activity of triazolyl thiophene derivatives. The statistical parameters of this model for training and test sets are shown in Table 7. Besides demonstrating statistical significance, QSAR models should also provide useful chemical insights for mechanism of inhibitory activity. For this reason, an acceptable interpretation of the QSAR results is provided below. By interpreting the descriptors contained in the model, it is possible to gain some insights into factors which are related to the inhibitory activity of triazolyl thiophene inhibitors. IC1 stands for information content index (neighborhood symmetry of order 1)19. IC1 is a measure of structural complexity per vertex in the multigraph of the molecules and is related to their equivalent atoms. Two atoms (vertices of the multigraph) are topologically equivalent when the corresponding neighborhoods of the rth order (in this case 1) are the same. The equivalence classes are then formed by equivalent atoms20. According to the model, it is favorable to increase the value of IC1 to improve the activity of the molecule. Four of the eight selected descriptors belong to the 2D autocorrelation descriptors: GATS2e, GATS4p, MATS2e, and MATS2p. The 2D autocorrelation descriptors have been successfully employed by Fernandez et al.21,22. In these descriptors the atoms represent a set of discrete points in space, and the atomic property and function are evaluated at those points. The symbol for each of the autocorrelation descriptors is followed by two indices, d and w, where d represents the lag and w represents the weight. For example, GATS2e means the Geary autocorrelation descriptor of lag 2 which is weighted by electronegativity. The lag is defined as the topological distance between pairs of atoms. The topological distance between a pair of atoms (i, j) is given in the ij entry in the topological level matrix. The lag can have any value from the set {0, 1, 2, 3, 4, 5, 6, 7, and 8}. The weight can be m (relative atomic mass), p (polarizability), e (Sanderson electronegativity), and v (Van der Waals volume). Relative mass is defined as the ratio of atomic mass of an atom to that of carbon. Similarly, the other three weights p, e, and v are scaled by their corresponding values for Carbon. The GATS4p has a negative sign while, GATS2e has positive sign which indicates that log IC50 is inversely related to the GATS4p descriptor and directly related to the GATS2e descriptors. The physico-chemical property (weights) for GATS4p is relative polarizability and for GATS2e is Sanderson electronegativity. Variables MATS2e and MATS2p belong to the Moran autocorrelation descriptors calculated from the molecular graph and yield values in the interval [1, -1]23. MATS2e (Moran autocorrelation -lag 2/weighted by atomic Sanderson electronegativities) and MATS2p (Moran autocorrelation -lag 2/weighted by atomic polarizabilities) indicate positive correlation with the activity. A positive autocorrelation corresponds to positive values of the descriptor where as a negative autocorrelation produces negative values. 3D-MoRSE descriptors are based on the idea of obtaining information from 3D atomic coordinates by the transform used in electron diffraction studies for preparing theoretical scattering curves 24, 25. One of the eight independent variables of the predictive model NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS belongs to this family of descriptors is Mor03m. The variety of atomic properties weighting the descriptors and the range of scattering angles encode more information about the 3D structure of the molecules in the data. In order to examine the relative importance, as well as the contribution of each descriptor in the model, the value of the mean effect (MF) was calculated for each descriptor. This calculation was performed with the equation below: Mf j j ii 1n dij mj j nj dij ... (4) MFj represents the mean effect for the considered descriptor j, βj is the coefficient of the descriptor j, dij stands for the value of the target descriptors for each molecule and, eventually, m is the descriptor number in the model. The MF value indicates the relative importance of a descriptor, compared with the other descriptors in the model. Figure 2 shows the mean effect of each variable in the MLR model. As can be seen from this figure, IC1, GATS2e and MATS2e are the most important parameters affecting the biological activity of the molecules. As can be seen from this figure, log IC50 values increases with increasing these descriptors. In order to further analysis the model stability, the obtained model was tested for chance correlation. The dependent variable vector (logIC50) was randomly shuffled and a new QSAR model was developed using the original independent variable matrix. The new QSAR model is expected to have low R and RMSE values. Several random shuffles of the Y vector were performed and the results are shown in Table 8. The R and RMSE values indicate that the results obtained for the MLR model are not due to a chance correlation or structural dependency of the Fig. 2 ― Mean effects of descriptors in MLR model 19 training set. The predicted MLR values of log IC50 were plotted versus their experimental values in Fig 3. For the sake of comparison, a PLS analysis was also performed using all 256 variables. Tables 5 and 6 shows the results obtained using PLS model. The statistical parameters of PLS model for training and test sets are shown in Table 7. This table shows that results of PLS model are superior compared with those of the MLR model. The predicted PLS values of logIC50 were plotted versus their experimental values in Fig 4. As a second step, we were interested to investigate the nonlinear characteristics of the activity parameter. Therefore, in the second step, a non-linear method has been used for modeling. The selected descriptors using MLR were used to construct a nonlinear model using SVM technique. In the SVM modeling, firstly, the kernel function should be determined, which represents the sample distribution in the mapping space. In this work, the RBF (radial basis function) kernel was chosen because it had good general performance. The next Table 8 ― Regression coefficient (R) and RMSE values for Y-randomization tests No Model 1 2 3 4 5 6 7 8 9 10 R 0.120 0.224 0.192 0.088 0.095 0.056 0.074 0.105 0.231 0.151 RMSE 0.352 0.245 0.420 0.521 0.640 0.410 0.520 0.397 0.860 0.731 Fig. 3 ― Plot of predicted log IC50 versus experimental values using MLR model 20 INDIAN J. CHEM. TECHNOL., JANUARY 2016 Fig. 4 ― Plot of predicted log IC50 versus experimental values using PLS model Fig. 5 ― Plot of predicted log IC50 versus experimental values using SVR model step in the construction of SVM model was optimizing of its parameters, including ε and C. The optimization of SVM parameters was performed by systemically changing their values in the training step and calculating the RMSE of the model. ε Insensitivity prevents the entire training set to meeting boundary conditions and allows the possibility of sparsely in the dual formulations solution. So, choosing the appropriate value of ε is a critical step. To find an optimal value for ε, the RMSE of SVM models with different ε values was calculated. The optimal value of ε was 0.05. The other parameter is a regularization parameter C that controls the trade-off between maximizing the margin and minimizing the training error. If C is too small, then insufficient stress will be placed on fitting the training data. On the other hand if C is too large, then the SVM model will over fit on the training data. To find an optimal value of C, the RMSE of SVM models with different C values was calculated. The obtained results revealed that the variation of RMSE in C > 20 is small, therefore C = 30 was selected as the optimal value of C. After optimizing SVM parameters, it was used to calculate the log IC50 of training and test set. The statistical parameters of this model are RMSE=0.046, R = 0.996, and F=1192 for the training set, and RMSE= 0.26, R = 0.804, and F=61.95 for the test set. Also the leave-one-out cross validation test was carried out for the evaluation of the prediction power of obtained SVM model. The calculated values show Q2= Rcv2 =0.797 for cross validation test set. training set. It means that this non-linear model is able to account 99.6% of the variances of the antialzheimer activity of triazolyl thiophenes. The predicted SVM values of log IC50 were plotted versus their experimental values in Fig. 5. The correlation coefficient of this plot indicates the reliability of the SVM model. For comparison, the statistical parameters of the models are shown in Table 7. It can be seen from Table 7 that the R value has improved from 0.890 for the MLR model to 0.996 for the SVM model for the Conclusion In the present work, based on the modeling techniques of MLR, PLS and SVM, QSAR models have been developed for prediction of inhibitor activity of methyl linked cyclohexyl thiophenes with triazole from the molecular structure. The results obtained using stepwise multiple linear regression show that more significant descriptors in the MLR model could be categorized as topological, two-dimensional autocorrelation, three-dimensional Morse and charge index. Four of eight selected descriptors are weighted by electronegativity and polarizability. This indicates that two physico-chemical properties of electronegativity and polarizability are very important in design of new drugs. For comparing, three methods of MLR, PLS and SVM are used to construct the linear and nonlinear quantitative relationships for data set. The results show that the SVM model possesses some obvious superiority. Thus it can be reasonably concluded that the proposed model would be expected to predict cyclohexyl thiophenes for new organic compounds or for other organic compounds which experimental values are unknown. References 1 Ahn J S, Radhakrishnan M L, Mapelli M, Choi S, Tidor B, Cuny G D, Musacchio A, Yeh L A & Kosik K S, Chemistry & biology, 12 (2005) 811 NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS 2 3 4 5 6 7 8 9 10 11 12 13 14 Mapelli M & Musacchio A, Neurosignals, 12 (2003) 164 Mapelli M, Massimiliano L, Crovace C, Seeliger M A, Tsai L H, Meijer L & Musacchio A, J Med Chem, 48 (2005) 671 Gupta A & Tsai L H, Neurosignals, 12 (2003) 173 Van De Waterbeemd H, The history of drug research: From Hansch to the present, (Wiley Online Library) 11 (1992) 200 Van De Waterbeemd H, Nature Reviews Drug Discovery, 2 (2003) 192 Shiradkar M, Thomas J, Kanase V & Dighe R Euro J Med Chem, 46 (2011) 2066 DRAGON, software, Version 3 (2003) HYPER CHEM software, Version 7.0 (2002) Afantitis A, Melagraki G, Sarimveis H, Koutentis P A, Igglessi-Markopoulou O & Kollias G, Mol Div, 14 (2010) 225 SPSS software,Version 16 (2008) MINITAB software, Version 15.1 (2008) Vapnik V N, The nature of statistical learning theory, (Springer-Verlag New York Inc), 2000 Burges C J C, Data Mining and Knowledge Discovery, 2 (1998) 121 21 15 Basak D, Pal S & Patranabis D, Neural Information Processing–Letters and Reviews, 11 (2007) 203 16 Vapnik V N, Golowich S & Smola A, Advances in Neural Information Processing Systems, MIT Press, 1997 17 Fatemi M H & Gharaghani S, Bio Med Chem, 15 (2007) 7746 18 CLEMENTINE software, version 12.0 (2008) 19 Bonchev D, Information theoretic indices for characterization of chemical structures, (Research Studies, Press Chichester, UK), 1983 20 Todeschini R & Consonni V, Handbook of Molecular Descriptors, (Wiley–VCH, Weinheim), 2000 21 Fernandez M, Tundidor-Camba A & Caballero J, J Chem Info Mod, 45 (2005) 1884 22 Caballero J, Tundidor‐Camba A & Fernández M, QSAR & Comb Sci, 26 (2006) 27 23 Moreau G & Broto P, Nouv J Chim, 4 (1980) 359 24 Gasteiger J, Sadowski J, Schuur J, Selzer P, Steinhauer L & Steinhauer V, J Chem Info Comp Sci, 36 (1996) 1030 25 Schuur J H, Selzer P & Gasteiger J, J Chem Info Comp Sci, 36 (1996) 334
© Copyright 2026 Paperzz