Application of support vector machine in QSAR study of triazolyl

Indian Journal of Chemical Technology
Vol. 23, January 2016, pp. 9-21
Application of support vector machine in QSAR study of triazolyl thiophenes as
cyclin dependent kinase-5 inhibitors for their anti-alzheimer activity
Zahra Garkani-Nejad1,* & Abouzar Ghanbari 2
1
Chemistry department, Faculty of Science, Shahid Bahonar University, Kerman, Iran
2
Chemistry department, Faculty of Science, Vali-e-Asr University, Rafsanjan, Iran
E-mail: [email protected]
Received 7 November 2013; accepted 13 September 2015
Quantitative structure-activity relationship (QSAR) models are mathematical equations constructing a relationship
between chemical structures and biological activities. A series of triazolyl thiophenes as cyclin dependent kinase 5
(cdk5/p25) inhibitors have been selected to establish QSAR models. In this work some chemometrics methods are applied
for modeling and prediction of the anti-Alzheimer activity of these compounds using descriptors that are calculated from the
molecular structures. First, stepwise multiple linear regression method (MLR) is used to select descriptors which are
responsible for the anti-Alzheimer activity of these compounds. Then support vector regression (SVR) and partial least
squares (PLS) are utilized to construct the nonlinear and linear quantitative structure–activity relationship models. Results
demonstrate that the SVR model offers powerful prediction capabilities.
Keywords: Cdk5/p25 inhibitors, Triazolyl thiophene, Alzheimer disease, Quantitative structure-activity relationship
(QSAR), Support vector regression (SVR)
Cyclin dependent kinase 5 (CDK5) plays an essential
role in the development of the central nervous system
during mammalian embryogenesis1. In the adult,
CDK5 is required for the maintenance of neuronal
architecture. Its deregulation has profound cytotoxic
effects and has been implicated in the development of
neurodegenerative diseases such as Alzheimer’s
disease and amyotrophic lateral sclerosis2. Cyclindependent kinase 5 (CDK5) is a member of a family
of proline-directed serine/threonine kinases3. It
regulates a variety of processes in developing adult
neurons. CDK5 plays a central role in neuronal
migration during the development of the central
nervous system4. Excessive up regulation of CDK5 by
the truncated activators contributes to neurodegeneration by altering the phosphorylation state of
cytosolic and cytoskeletal proteins, and increased
CDK5 activity has been implicated in Alzheimer’s
disease (AD), amyotrophic lateral sclerosis (ALS),
Parkinson’s disease, niemann-Pick type C disease,
and ischemia. Postmortem brain analysis of AD
patients reveals extensive formation of neurofibrillary
tau protein tangles and amyloid plaques. The
serine/threonine kinase CDK5 along with its cofactor
p25 has been supposed to hyper phosphorylate tau,
leading to the formation of paired helical filaments
and deposition of cytotoxic neurofibrillary tangles and
thus responsible to neurodegenerative disorders such
as Alzheimer’s disease, Parkinson’s disease, stroke, or
Huntington’s disease. CDK5 also phosphorylates
dopamine and Cyclic AMP-regulated phosphor
protein (DARPP-32) at threonine75, indicating its role
in dopaminergic neurotransmission. Inhibition of the
anomalous CDK5/p25 complex is, therefore, a viable
target for treating Alzheimer’s disease by preventing
tau hyper phosphorylation and neurofibrillary tangle
formation.
QSAR studies are useful techniques in the drug
design process5. Generation of QSAR models which
are able to predict biological activities of new
compounds are invaluable tools in increasing the
efficiency of new synthesis6. In general, QSAR
modeling involves dataset preparation, molecular
descriptors generation, statistical models derivation,
model training and validation using a training dataset
and model testing on a testing dataset.
The main goal of the present work was developing
different models using linear and nonlinear
chemometrics approaches for prediction antiAlzheimer activity of triazolyl thiophene derivatives.
Stepwise multiple linear regression method (MLR)
was used to select descriptors which are responsible
INDIAN J. CHEM. TECHNOL., JANUARY 2016
10
for the anti-Alzheimer activity of these compounds.
Then support vector regression (SVR) and partial
least squares (PLS) were utilized to construct the
nonlinear and linear quantitative structure–activity
relationship models. Finally, results obtained using
three methods were compared.
Experimental Section
Data set
All of 124 methyl linked cyclohexyl thiophenes
with triazole as CDK5 inhibitors and experimental
IC50 values were collected from the literature7. The
IC50 values were converted to the corresponding log
IC50 and used as dependent variable. The log IC50
values providing a broad and homogenous data set for
QSAR study. The QSAR models were generated
using a training set including 88 molecules and a test
set including 36 molecules. Training and test sets
were selected randomly such that a wide range of
activity in the data set was included. The structure of
different groups of triazolyl thiophenes are shown in
Tables 1 and 2.
Table 1 ― Different groups of triazolyl thiophenes used in modeling
No.
Compound
No.
1
6
2
7
3
8
4
9
5
Compound
NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS
11
Table 2 ― Studied triazolyl thiophenes with different substituents (contd.)
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Compd.
1a
1b
1c
1d
2a
2b
2c
2d
2e
2f
2g
2h
2i
2j
2k
2l
3a
3b
3c
3d
3e
3f
3g
3h
3i
3j
3k
3l
4a
4b
4c
4d
4e
4f
4g
4h
4i
4j
4k
4l
5a
5b
5c
5d
5e
5f
5g
R1
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
R2
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-C6H5
-C6H5
-C6H5
-C6H5
-CH3
-CH3
-CH3
-CH3
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-C6H5
-C6H5
-C6H5
-C6H5
-CH3
-CH3
-CH3
-CH3
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-C6H5
-C6H5
-C6H5
-C6H5
-CH3
-CH3
-CH3
-CH3
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-C6H5
-C6H5
-C6H5
R3
-
Log IC50
1.63
2.60
2.80
2.43
2.54
2.66
2.26
2.66
2.88
2.76
2.95
2.83
2.84
2.77
2.79
2.64
2.76
2.62
2.74
2.63
2.82
2.59
2.69
3.69
1.79
2.66
2.75
2.81
1.76
2.62
2.74
2.62
2.82
2.60
2.69
3.69
1.78
2.66
2.75
2.81
2.65
2.59
2.47
2.73
2.83
2.76
2.55
(contd.)
INDIAN J. CHEM. TECHNOL., JANUARY 2016
12
Table 2 ― Studied triazolyl thiophenes with different substituents (contd.)
No.
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
Compd.
5h
5i
5j
5k
5l
6a
6b
6c
6d
6e
6f
6g
6h
6i
6j
6k
6l
7a
7aa
7ab
7ac
7ad
7ae
7af
7ag
7ah
7ai
7aj
7b
7c
7d
7e
7f
7g
7h
7i
7j
7k
7l
7m
7n
7o
7p
7q
7r
7s
7t
R1
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
R2
-C6H5
-CH3
-CH3
-CH3
-CH3
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-C6H5
-C6H5
-C6H5
-C6H5
-CH3
-CH3
-CH3
-CH3
-4ClC6H4
-CH3
-CH3
-CH3
-CH3
-CH3
-CH3
-CH3
-CH3
-CH3
-CH3
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-C6H5
-C6H5
-C6H5
-C6H5
-C6H5
-C6H5
-C6H5
-C6H5
R3
-CH3
-CH3
-CH3
-C6H5
-C6H5
-C6H5
-C6H5
CH2Cl
CH2Cl
CH2Cl
CH2Cl
-CH3
-CH3
-CH3
-C6H5
-C6H5
-C6H5
-C6H5
CH2Cl
CH2Cl
CH2Cl
CH2Cl
-CH3
-CH3
-CH3
-CH3
-C6H5
-C6H5
-C6H5
-C6H5
Log IC50
2.81
2.84
2.59
2.75
2.78
1.59
1.81
2.37
2.30
1.57
2.28
2.42
2.66
1.79
2.40
2.26
2.97
2.67
2.81
2.56
2.56
2.64
2.61
2.68
2.66
2.68
2.55
2.42
2.81
2.50
2.55
2.79
2.72
2.68
2.43
2.36
2.41
2.53
2.27
2.54
2.74
2.75
2.67
2.67
2.62
2.67
2.64
(contd.)
NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS
13
Table 2 ― Studied triazolyl thiophenes with different substituents (contd.)
No.
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
Compd.
7u
7v
7w
7x
7y
7z
8a
8b
8c
8d
8e
8f
8g
8h
8i
8j
8k
8l
9a
9b
9c
9d
9e
9f
9g
9h
9i
9j
9k
9l
R1
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
NHCOCH2Cl
NHCOCH3
NHCOC6H5
NHCH2CH2COOH
R2
-C6H5
-C6H5
-C6H5
-C6H5
-CH3
-CH3
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-C6H5
-C6H5
-C6H5
-C6H5
-CH3
-CH3
-CH3
-CH3
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-4-ClC6H4
-C6H5
-C6H5
-C6H5
-C6H5
-CH3
-CH3
-CH3
-CH3
Descriptors calculation and selection
The first step to obtain a QSAR model was to
encode the structural features of molecules, which
were named molecular descriptors. The molecular
descriptors used to search the best model, were
calculated using the Dragon program8 on the basis of
the minimum energy molecular geometries that
optimized in the Hyperchem package9 based on
AM1 semi empirical method. After the calculation of
the molecular descriptors, those that stayed constant
for all molecules or describe more than 90% equal or
less than 10% of non-zero values were excluded. In
this study, nonlinear feature mapping technique
(SVR) has been used for construction the QSAR
models. Since this method cannot be able to select
the more significant descriptors from the pool of
calculated molecular descriptors, stepwise multiple
linear regression (MLR) variable subset selection
R3
CH2Cl
CH2Cl
CH2Cl
CH2Cl
-CH3
-CH3
-
Log IC50
2.73
2.81
2.74
2.56
2.66
2.51
1.54
2.33
2.34
2.41
1.51
2.59
2.64
2.43
2.53
2.66
2.67
2.73
3.46
3.80
3.52
3.77
3.80
3.81
3.74
3.70
3.66
3.70
3.79
3.79
method was used for the selection of the most
relevant descriptors from the pool of 256 descriptors.
These descriptors would be used as inputs of the
SVR.
Regression analysis
Multiple linear regression (MLR)
Multiple linear regression is used to develop the
linear model of the property of interest10, which takes
the form below:
Y = b0 + b1 X1 + b2 X2 +···+bn Xn
... (1)
where Y is the property, that is, the dependent
variable, X1–Xn represent the specific descriptors,
while b1–bn represent the coefficients of those
descriptors, and b0 is the intercept of the equation.
In this study, a stepwise multiple linear regression
procedure was used for model development. For
INDIAN J. CHEM. TECHNOL., JANUARY 2016
14
Partial least square (PLS)
regression analysis, data set was divided into two
groups of training and test sets. The molecules
included in these sets, were selected randomly. The
training set, consist of 88 molecules, was used for
the model generation using the SPSS software
package11. The test set, consist of 36 molecules, was
used to evaluate the generated models. It is clear that
many MLR models will be resulted using stepwise
multiple regression procedure. Among them, we
have to choose the best one. The best multiple linear
regression (MLR) model is one that has high R
value, low standard error, the least number of
descriptors and high ability for prediction. The best
selected model is presented in Table 3. Eight
descriptors have been chosen as the optimum
number of parameters in this model. Correlation
between selected descriptors is shown in Table 4. As
it can be seen from the correlation matrix (Table 4)
there is no significant correlation between the
selected descriptors. Calculated values of log IC50
for training and test sets using MLR model are
shown in Tables 5 and 6.
PLS is a linear modeling technique where
information in the descriptor matrix X is projected
onto a small number of underlying (‘‘latent’’)
variables called PLS components. The Matrix Y is
simultaneously used in estimating the ‘‘latent’’
variables in X that will be most relevant for predicting
the Y variables.
In the present work, the modeling by PLS method
was performed using MINITAB12. For regression
analysis, data set were separated into two groups:
training and test sets (Tables 5 and 6). The number of
significant factors for the PLS algorithm was
determined using the cross-validation method. With
cross-validation, one sample was kept out (leave-oneout) of the calibration and used for prediction. The
process was repeated so that each of the samples was
kept out once. The predicted values of left-out
samples were then compared to the observed values
using prediction error sum of squares (PRESS). The
PRESS obtained in the cross-validation was
calculated each time that a new principal component
Table 3 ― Details of the descriptors in the MLR model
Descriptor definition
Descriptor type
Symbol
Regression
Coefficient
Standard
deviation
Moran autocorrelation - lag 2 / weighted by
atomic Sanderson electronegativities
2D-autocorrrelation
MATS2e
20.198
±1.665
Information content index (neighborhood
symmetry of 1-order)
Topological
IC1
1.822
±0.260
Geary autocorrelation - lag 4 / weighted by
atomic polarizabilities
2D-autocorrrelation
GATS4p
-0.966
±1.324
Signal 3 / weighted by atomic masses
3D-MoRSE
Mor03m
-0.04
±0.015
Geary autocorrelation - lag 2 / weighted by
atomic Sanderson electronegativities
2D-autocorrrelation
GATS2e
8.940
±1.190±
Average eigenvector coefficient sum from
adjacency matrix
Topological
VEA2
-8.182
±1.923
Moran autocorrelation - lag 2 / weighted by
atomic polarizabilities
2D-autocorrrelation
MATS2p
6.128
±1.298
Global topological charge index
Galvez topological indices
JGT
4.122
±1.281
Table 4 ― Correlation matrix of descriptors selected using MLR model
MATS2e
IC1
GATS4p
Mor3m
GATS2e
VEA2
MATS2p
JGT
MATS2e
1
0.293
0.056
0.124
0.595
0.269
0.367
0.106
IC1
GATS4p
Mor3m
GATS2e
VEA2
MATS2p
JGT
1
0.127
0.316
0.21
0.168
0.050
0.559
1
0.099
0.242
0.080
0.319
0.071
1
0.055
0.497
0.302
0.321
1
0.047
0.067
0.203
1
0.525
0.438
1
0.538
1
NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS
15
Table 5 ― Experimental and predicted log IC50 values by MLR, PLS and SVR models for training set (contd.)
1
2
3
5
6
7
8
9
10
11
12
14
17
18
19
20
21
23
24
25
26
29
30
31
32
33
34
37
38
40
41
42
44
45
47
50
51
53
54
55
57
58
59
62
64
65
66
67
Exp.
1.63
2.59
2.79
2.53
2.66
2.26
2.65
2.87
2.75
2.94
2.82
2.76
2.53
2.56
2.79
2.21
2.82
2.84
2.76
2.92
2.92
1.76
2.61
2.74
2.62
2.81
2.59
1.78
2.66
2.80
2.64
2.58
2.73
2.83
2.54
2.58
2.75
1.59
1.80
2.37
1.56
2.28
2.42
2.39
2.96
2.67
2.80
2.56
Calc. MLR
1.80
2.64
2.61
2.29
2.72
2.31
2.64
2.74
2.99
2.94
2.63
2.77
2.33
2.59
2.58
2.46
2.73
2.95
2.92
2.70
2.98
2.1
2.46
2.42
2.43
2.34
2.76
2.38
2.74
2.68
2.43
2.59
2.52
2.67
2.91
2.74
2.76
1.85
2.06
2.27
2.26
2.37
2.22
2.21
2.40
2.77
2.82
2.81
Calc. PLS
1.87
2.37
2.60
2.45
2.63
2.58
2.54
2.61
2.89
2.82
2.66
2.81
3.05
2.45
2.58
2.30
2.69
2.73
2.91
3.10
2.96
2.15
2.47
2.69
2.51
2.46
2.69
2.45
2.57
2.73
2.60
2.56
2.66
2.73
2.81
2.72
2.95
1.77
2.16
2.35
2.08
2.15
2.31
2.23
2.39
2.46
2.58
2.53
Calc. SVR
1.68
2.54
2.75
2.49
2.71
2.31
2.65
2.82
2.79
2.90
2.83
2.71
2.58
2.61
2.74
2.26
2.77
2.79
2.71
2.87
2.87
1.81
2.56
2.69
2.57
2.77
2.64
1.83
2.61
2.75
2.60
2.64
2.68
2.78
2.59
2.64
2.80
1.64
1.85
2.42
1.61
2.23
2.44
2.45
2.92
2.62
2.75
2.61
(contd.)
INDIAN J. CHEM. TECHNOL., JANUARY 2016
16
Table 5 ― Experimental and predicted log IC50 values by MLR, PLS and SVR models for training set
68
70
71
72
73
74
80
81
82
84
85
86
88
89
90
91
92
94
96
97
99
101
102
103
104
105
106
107
108
109
111
112
114
115
117
119
120
121
122
124
Exp.
2.55
2.61
2.67
2.65
2.67
2.55
2.72
2.68
2.43
2.41
2.53
2.26
2.73
2.75
2.66
2.67
2.61
2.63
2.80
2.73
2.65
1.54
2.32
2.33
2.41
1.50
2.58
2.63
2.42
2.52
2.66
2.72
3.80
3.51
3.79
3.73
3.69
3.60
3.71
3.79
Calc. MLR
2.67
2.79
2.69
2.42
2.53
2.56
2.76
2.70
2.18
2.74
2.54
2.57
2.85
2.69
2.60
2.49
2.63
2.87
2.74
2.67
2.51
2.04
2.14
2.26
2.63
1.88
2.48
2.56
2.40
2.35
2.76
2.50
3.72
3.60
3.59
3.81
3.83
3.45
3.97
3.70
(PC) was added to the model. The optimum number
of PLS factors is the one that minimized PRESS.
Figure 1 shows the plot of PRESS versus the number
of factors for the PLS model. The best PLS model
contained 5 factors. Calculated values of logIC50 for
training and test sets using PLS models are shown in
Tables 5 and 6.
Calc. PLS
2.60
2.75
2.72
2.49
2.71
2.79
2.58
2.56
2.43
2.62
2.77
2.40
2.84
2.71
2.55
2.56
2.74
2.67
2.57
2.87
2.60
2.24
2.08
2.33
2.58
2.03
2.50
2.34
2.54
2.39
2.69
2.55
3.59
3.43
3.53
3.71
3.82
3.64
3.90
3.77
Calc. SVR
2.57
2.66
2.71
2.60
2.72
2.60
2.67
2.69
2.40
2.46
2.57
2.32
2.72
2.70
2.62
2.62
2.66
2.59
2.76
2.69
2.62
1.59
2.27
2.38
2.46
1.55
2.53
2.58
2.45
2.47
2.62
2.71
3.75
3.46
3.74
3.69
3.64
3.55
3.66
3.74
Support vector machine (SVM)
Support vector machine (SVM) has been first
introduced by Vapnik13. There are two main
categories for support vector machines: support vector
classification (SVC) and support vector regression
(SVR). SVM is a learning system using a high
dimensional feature space. Support vector regression
NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS
(SVR) is the most common application form of SVM14.
An overview of the basic ideas underlying support
vector (SV) machines for regression and function
estimation has been given by Basak and coworkers15.
In support vector regression (SVR), the basic idea
is to map the data X into a higher-dimensional feature
space F via a nonlinear mapping Φ and then to do
linear regression in this space16,17. Therefore,
regression approximation addresses the problem of
estimating a function based on a given data set. SVR
has recently emerged as an alternative and highly
Table 6 ― Experimental and predicted log IC50 values by MLR,
PLS and SVR models for test set
No.
4
13
15
16
22
27
28
35
36
39
43
46
48
49
52
56
60
61
63
69
75
76
77
78
79
83
87
93
95
98
100
110
113
116
118
123
Exp.
2.11
2.24
2.98
2.68
3.29
3.45
2.44
2.79
2.85
2.88
2.59
2.81
2.75
2.32
2.62
2.34
2.48
1.96
2.44
2.74
2.66
2.77
2.48
2.67
2.14
2.41
2.82
2.46
2.57
2.83
2.62
2.37
3.23
3.50
4.12
3.93
Calc. MLR
2.43
2.84
2.79
2.64
2.81
2.89
2.86
3.69
1.78
2.75
2.47
2.76
2.81
2.84
2.86
2.30
2.66
1.78
2.25
2.64
2.42
2.81
2.50
2.55
2.79
2.36
2.54
2.67
2.72
2.56
2.51
2.66
3.46
3.77
3.81
3.79
Calc. PLS
2.38
2.67
3.06
2.21
3.18
3.20
2.85
2.68
2.89
2.97
2.60
2.71
2.80
2.59
2.71
2.24
2.42
1.91
2.29
2.77
2.64
2.72
2.38
2.53
2.59
2.35
2.83
2.39
2.85
2.69
2.56
2.50
3.34
3.34
3.76
3.91
Calc. SVR
2.62
2.46
2.81
2.86
2.74
2.73
2.73
2.77
2.69
2.76
2.61
2.70
2.72
2.51
2.77
2.42
2.56
2.15
2.44
2.78
2.72
2.68
2.48
2.55
2.42
2.26
2.61
2.62
2.45
2.55
2.65
2.69
3.35
3.06
3.85
3.44
17
effective means of solving the nonlinear regression
problem. SVR has been quite successful in both
academic and industrial platforms owing to its many
attractive features and promising generalization
performance.
At the present work, the SVR evaluations were
carried out using the SVM toolbox in CLEMENTINE
software18. Data set was divided into two groups:
training and test sets (Tables 5 and 6) and entered
descriptors in MLR models, were used as inputs.
After that, the kernel function should be determined,
which represents the sample distribution in the
mapping space. In this work, the RBF (radial basis
function) kernel was chosen. The RBF SVR method
fits a nonlinear function onto the data, again aiming
for maximum flatness. The RBF kernel is also often
named a Gaussian kernel since the kernel function is
the same as the Gaussian distribution function.
The overall performance of SVM was evaluated in
terms of root-mean-square error which was calculated
from the following equation:
RMSE 
i s1 ( yi  y0 )2
s
... (2)
In this equation yi is the desired output, y0 is the
predicted value by model, and ns is the number of
molecules in data set.
The predictive power of the SVM models
developed on the selected training sets is estimated on
the prediction of an external test set and also
examined by cross validation test to the calculating
statistical parameters of Q2 as follows:
Q2  1 
( y0  yi )2
( yi  ym ) 2
... (3)
In the above expressions, ym is the mean of
dependent variable.
Fig. 1 ― PRESS versus the number of factors for the PLS model
INDIAN J. CHEM. TECHNOL., JANUARY 2016
18
Table 7 ― The statistical parameters of different constructed
QSAR models
Model
Training
RMSE
R
F
Test
RMSE
0.260
R
F
0.804 61.950
SVR
0.046
0.996 1192.00
PLS
0.214
0.900
331.88
0.235
0.830 88.980
MLR
0.216
0.890 332.849
0.311
0.490 32.089
Calculated values of log IC50 for training and test
sets using SVR model are shown in Tables 5 and 6.
The statistical parameters of this model for training
and test sets are shown in Table 7.
Results and Discussion
The main objective of this research is to develop
QSAR models for predicting the biological activity of
triazolyl thiophene derivatives. In the first stage a
multiple linear model (MLR) is developed. The
selected variables in this model are MATS2e,
MATS2p, GATS4p, GATS2e, IC1, Mor03m, VEA2,
and JGT. These descriptors can be categorized as a
topological, two-dimensional autocorrelation, threedimensional Morse and charge index. Details are
described in the Table 3. Based on the correlation
matrix (Table 4), it can be generalized that there is no
significant correlation between the selected
descriptors. The colinearity threshold in QSAR/QSPR
studies is usually considered 0.9, i.e. descriptors with
R > 0.9 are selected as collinear.
The purpose of implementing this model was
determination a linear relationship between
descriptors and parameter anti-Alzheimer activity of
triazolyl thiophene derivatives. The statistical
parameters of this model for training and test sets are
shown in Table 7.
Besides demonstrating statistical significance,
QSAR models should also provide useful chemical
insights for mechanism of inhibitory activity. For this
reason, an acceptable interpretation of the QSAR
results is provided below. By interpreting the
descriptors contained in the model, it is possible to
gain some insights into factors which are related to
the inhibitory activity of triazolyl thiophene
inhibitors.
IC1 stands for information content index
(neighborhood symmetry of order 1)19. IC1 is a
measure of structural complexity per vertex in the
multigraph of the molecules and is related to their
equivalent atoms. Two atoms (vertices of the
multigraph) are topologically equivalent when the
corresponding neighborhoods of the rth order (in this
case 1) are the same. The equivalence classes are then
formed by equivalent atoms20. According to the
model, it is favorable to increase the value of IC1 to
improve the activity of the molecule.
Four of the eight selected descriptors belong to the
2D autocorrelation descriptors: GATS2e, GATS4p,
MATS2e, and MATS2p. The 2D autocorrelation
descriptors have been successfully employed by
Fernandez et al.21,22. In these descriptors the atoms
represent a set of discrete points in space, and the
atomic property and function are evaluated at those
points. The symbol for each of the autocorrelation
descriptors is followed by two indices, d and w, where
d represents the lag and w represents the weight. For
example, GATS2e means the Geary autocorrelation
descriptor of lag 2 which is weighted by
electronegativity. The lag is defined as the topological
distance between pairs of atoms. The topological
distance between a pair of atoms (i, j) is given in the ij
entry in the topological level matrix. The lag can have
any value from the set {0, 1, 2, 3, 4, 5, 6, 7, and 8}.
The weight can be m (relative atomic mass), p
(polarizability), e (Sanderson electronegativity), and v
(Van der Waals volume). Relative mass is defined as
the ratio of atomic mass of an atom to that of carbon.
Similarly, the other three weights p, e, and v are
scaled by their corresponding values for Carbon. The
GATS4p has a negative sign while, GATS2e has
positive sign which indicates that log IC50 is
inversely related to the GATS4p descriptor and
directly related to the GATS2e descriptors. The
physico-chemical property (weights) for GATS4p is
relative polarizability and for GATS2e is Sanderson
electronegativity.
Variables MATS2e and MATS2p belong to the
Moran autocorrelation descriptors calculated from
the molecular graph and yield values in the interval
[1, -1]23. MATS2e (Moran autocorrelation -lag
2/weighted by atomic Sanderson electronegativities)
and MATS2p (Moran autocorrelation -lag 2/weighted
by atomic polarizabilities) indicate positive
correlation with the activity. A positive
autocorrelation corresponds to positive values of the
descriptor where as a negative autocorrelation
produces negative values.
3D-MoRSE descriptors are based on the idea of
obtaining information from 3D atomic coordinates by
the transform used in electron diffraction studies for
preparing theoretical scattering curves 24, 25. One of the
eight independent variables of the predictive model
NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS
belongs to this family of descriptors is Mor03m. The
variety of atomic properties weighting the descriptors
and the range of scattering angles encode more
information about the 3D structure of the molecules in
the data.
In order to examine the relative importance, as well
as the contribution of each descriptor in the model,
the value of the mean effect (MF) was calculated for
each descriptor. This calculation was performed with
the equation below:
Mf j 
 j ii 1n dij
 mj  j  nj dij
... (4)
MFj represents the mean effect for the considered
descriptor j, βj is the coefficient of the descriptor j, dij
stands for the value of the target descriptors for each
molecule and, eventually, m is the descriptor number
in the model. The MF value indicates the relative
importance of a descriptor, compared with the other
descriptors in the model.
Figure 2 shows the mean effect of each variable in
the MLR model. As can be seen from this figure, IC1,
GATS2e and MATS2e are the most important
parameters affecting the biological activity of the
molecules. As can be seen from this figure, log IC50
values increases with increasing these descriptors.
In order to further analysis the model stability, the
obtained model was tested for chance correlation. The
dependent variable vector (logIC50) was randomly
shuffled and a new QSAR model was developed
using the original independent variable matrix. The
new QSAR model is expected to have low R
and RMSE values. Several random shuffles of the
Y vector were performed and the results are shown in
Table 8. The R and RMSE values indicate that the
results obtained for the MLR model are not due to a
chance correlation or structural dependency of the
Fig. 2 ― Mean effects of descriptors in MLR model
19
training set. The predicted MLR values of log IC50
were plotted versus their experimental values in Fig 3.
For the sake of comparison, a PLS analysis was
also performed using all 256 variables. Tables 5 and 6
shows the results obtained using PLS model. The
statistical parameters of PLS model for training and
test sets are shown in Table 7. This table shows that
results of PLS model are superior compared with
those of the MLR model. The predicted PLS values of
logIC50 were plotted versus their experimental values
in Fig 4.
As a second step, we were interested to investigate
the nonlinear characteristics of the activity parameter.
Therefore, in the second step, a non-linear method has
been used for modeling. The selected descriptors
using MLR were used to construct a nonlinear model
using SVM technique.
In the SVM modeling, firstly, the kernel function
should be determined, which represents the sample
distribution in the mapping space. In this work, the
RBF (radial basis function) kernel was chosen
because it had good general performance. The next
Table 8 ― Regression coefficient (R) and RMSE values for
Y-randomization tests
No Model
1
2
3
4
5
6
7
8
9
10
R
0.120
0.224
0.192
0.088
0.095
0.056
0.074
0.105
0.231
0.151
RMSE
0.352
0.245
0.420
0.521
0.640
0.410
0.520
0.397
0.860
0.731
Fig. 3 ― Plot of predicted log IC50 versus experimental values
using MLR model
20
INDIAN J. CHEM. TECHNOL., JANUARY 2016
Fig. 4 ― Plot of predicted log IC50 versus experimental values
using PLS model
Fig. 5 ― Plot of predicted log IC50 versus experimental values
using SVR model
step in the construction of SVM model was
optimizing of its parameters, including ε and C. The
optimization of SVM parameters was performed by
systemically changing their values in the training step
and calculating the RMSE of the model. ε
Insensitivity prevents the entire training set to meeting
boundary conditions and allows the possibility of
sparsely in the dual formulations solution. So,
choosing the appropriate value of ε is a critical step.
To find an optimal value for ε, the RMSE of SVM
models with different ε values was calculated. The
optimal value of ε was 0.05.
The other parameter is a regularization parameter
C that controls the trade-off between maximizing the
margin and minimizing the training error. If C is too
small, then insufficient stress will be placed on fitting
the training data. On the other hand if C is too large,
then the SVM model will over fit on the training data.
To find an optimal value of C, the RMSE of SVM
models with different C values was calculated. The
obtained results revealed that the variation of RMSE
in C > 20 is small, therefore C = 30 was selected as
the optimal value of C.
After optimizing SVM parameters, it was used to
calculate the log IC50 of training and test set. The
statistical parameters of this model are RMSE=0.046,
R = 0.996, and F=1192 for the training set, and
RMSE= 0.26, R = 0.804, and F=61.95 for the test set.
Also the leave-one-out cross validation test was
carried out for the evaluation of the prediction power
of obtained SVM model. The calculated values show
Q2= Rcv2 =0.797 for cross validation test set.
training set. It means that this non-linear model is able
to account 99.6% of the variances of the antialzheimer activity of triazolyl thiophenes. The
predicted SVM values of log IC50 were plotted
versus their experimental values in Fig. 5. The
correlation coefficient of this plot indicates the
reliability of the SVM model.
For comparison, the statistical parameters of the
models are shown in Table 7. It can be seen from
Table 7 that the R value has improved from 0.890 for
the MLR model to 0.996 for the SVM model for the
Conclusion
In the present work, based on the modeling
techniques of MLR, PLS and SVM, QSAR models
have been developed for prediction of inhibitor
activity of methyl linked cyclohexyl thiophenes with
triazole from the molecular structure. The results
obtained using stepwise multiple linear regression
show that more significant descriptors in the
MLR model could be categorized as topological,
two-dimensional autocorrelation, three-dimensional
Morse and charge index.
Four of eight selected descriptors are weighted by
electronegativity and polarizability. This indicates that
two physico-chemical properties of electronegativity
and polarizability are very important in design of new
drugs.
For comparing, three methods of MLR, PLS and
SVM are used to construct the linear and nonlinear
quantitative relationships for data set. The results
show that the SVM model possesses some obvious
superiority. Thus it can be reasonably concluded that
the proposed model would be expected to predict
cyclohexyl thiophenes for new organic compounds or
for other organic compounds which experimental
values are unknown.
References
1
Ahn J S, Radhakrishnan M L, Mapelli M, Choi S, Tidor B,
Cuny G D, Musacchio A, Yeh L A & Kosik K S, Chemistry
& biology, 12 (2005) 811
NEJAD & GHANBARI: QSAR STUDY OF TRIAZOLYL THIOPHENES AS CYCLIN DEPENDENT KINASE-5 INHIBITORS
2
3
4
5
6
7
8
9
10
11
12
13
14
Mapelli M & Musacchio A, Neurosignals, 12 (2003) 164
Mapelli M, Massimiliano L, Crovace C, Seeliger M A, Tsai L H,
Meijer L & Musacchio A, J Med Chem, 48 (2005) 671
Gupta A & Tsai L H, Neurosignals, 12 (2003) 173
Van De Waterbeemd H, The history of drug research:
From Hansch to the present, (Wiley Online Library) 11
(1992) 200
Van De Waterbeemd H, Nature Reviews Drug Discovery, 2
(2003) 192
Shiradkar M, Thomas J, Kanase V & Dighe R Euro J Med
Chem, 46 (2011) 2066
DRAGON, software, Version 3 (2003)
HYPER CHEM software, Version 7.0 (2002)
Afantitis A, Melagraki G, Sarimveis H, Koutentis P A,
Igglessi-Markopoulou O & Kollias G, Mol Div, 14 (2010) 225
SPSS software,Version 16 (2008)
MINITAB software, Version 15.1 (2008)
Vapnik V N, The nature of statistical learning theory,
(Springer-Verlag New York Inc), 2000
Burges C J C, Data Mining and Knowledge Discovery, 2
(1998) 121
21
15 Basak D, Pal S & Patranabis D, Neural Information
Processing–Letters and Reviews, 11 (2007) 203
16 Vapnik V N, Golowich S & Smola A, Advances in Neural
Information Processing Systems, MIT Press, 1997
17 Fatemi M H & Gharaghani S, Bio Med Chem, 15 (2007) 7746
18 CLEMENTINE software, version 12.0 (2008)
19 Bonchev D, Information theoretic indices for characterization
of chemical structures, (Research Studies, Press Chichester,
UK), 1983
20 Todeschini R & Consonni V, Handbook of Molecular
Descriptors, (Wiley–VCH, Weinheim), 2000
21 Fernandez M, Tundidor-Camba A & Caballero J, J Chem
Info Mod, 45 (2005) 1884
22 Caballero J, Tundidor‐Camba A & Fernández M, QSAR &
Comb Sci, 26 (2006) 27
23 Moreau G & Broto P, Nouv J Chim, 4 (1980) 359
24 Gasteiger J, Sadowski J, Schuur J, Selzer P, Steinhauer L &
Steinhauer V, J Chem Info Comp Sci, 36 (1996) 1030
25 Schuur J H, Selzer P & Gasteiger J, J Chem Info Comp Sci,
36 (1996) 334