A quantitative structure–retention relationship study for prediction of

Environ Chem Lett
DOI 10.1007/s10311-009-0251-9
ORIGINAL PAPER
A quantitative structure–retention relationship study
for prediction of chromatographic relative retention
time of chlorinated monoterpenes
Jahan B. Ghasemi • Sh. Ahmadi • S. D. Brown
Received: 4 September 2008 / Accepted: 28 September 2009
Springer-Verlag 2009
Abstract A novel quantitative structure–retention relationship model has been developed for the gas chromatographic relative retention times (tR) of 67 polychlorinated
monoterpene congeners in a non-polar column. Modeling of
the relative retention time of these compounds as a function
of the theoretically derived descriptors was established by
principal component and partial least squares regressions.
The choice of optimal training sets is efficiently performed
by Kohonen self-organizing map. The genetic algorithm was
used for the selection of the variables resulted in the bestfitted models. Appropriate models with low standard errors
and high correlation coefficients were obtained. Wiener
index, Balaban index, and ideal gas thermal capacity are
examples of the descriptors affected by the retention times of
polychlorinated monoterpenes.
Keywords PLS PCR Quantitative structure–retention relationship Chlorinated monoterpenes SOM, GAPLS
J. B. Ghasemi (&)
Chemistry Department, Faculty of Sciences,
K.N. Toosi University of Technology, Tehran, Iran
e-mail: [email protected]
Sh. Ahmadi
Chemistry Department, Faculty of Science, Razi University,
Kermanshah, Iran
S. D. Brown
Chemistry and Biochemistry Department,
University of Delaware, Newark, DE 19716, USA
Introduction
Toxaphene is a complex mixture of chlorinated bornanes that
was at one time the most heavily used pesticide in the United
States until it was banned in 1982, because of its persistence,
toxicity, and bioaccumulative potential. A number of countries, however, may continue to use toxaphene (Voldner
and Li 1995). Toxaphene is an insecticidal mixture, which
is produced by the controlled chlorination of camphene
(2,2-dimethyl-3-methylene-bicyclo[2.2.1]heptane) under UV
light (Saleh 1991). Toxaphene is an analytically challenging
mixture often due to the hundreds of congeners present in a
sample, the presence of numerous stereoisomers, the limited
availability of authentic standards of individual congeners,
and the potential for interferences from other compounds
(Braekevelt et al. 2001).
The quantitative structure–retention relationship (QSRR)
of solutes is a useful method of discussing the retention
mechanism in gas chromatograph and is an important topic
in chromatographic thermodynamics. In the last two
decades QSRR have often been applied to: (1) predict
retention for a new solute; (2) identify the most informative structural descriptors (regarding properties); (3)
gain insight into the molecular mechanism of separation
operating in a given chromatographic system; (4) evaluate complex physicochemical properties of analytes; and
(5) predict relative biological activities. QSRRs have
been developed for a number of halogenated organic
contaminant groups (Ong and Hites 1991; Nevalainen
et al. 1994; Hale et al. 1985; Hackenberg et al. 2003).
For polybrominated diphenyl ethers (PBDEs), Rayne and
Ikonomou presented several predictive equations with
five or seven simple parameters, including the numbers of
ortho, meta, and para substituents, as well as molecular
mass and calculated ionization potential and dipole
123
Environ Chem Lett
moment, used as independent variables (Rayne and
Ikonomou 2003).
Principal components regression (PCR) (Wold et al.
1987) and partial least squares (PLS) (Haaland and Thomas
1988) are methods that can be useful in dealing with the
problem of the unfavorable more variable/object ratio and
collinearity. When some of variables do not contain relevant
information, these latent variable methods are unable to fully
eliminate the influence of these variables. Many quantitative
structure activity/property relationship (QSAR/QSPR)
studies and methods (Ghasemi et al. 2007a, b; Ghasemi and
Saaidpour 2007) select part of variables used in PLS modeling. The genetic algorithms have been successfully applied
to feature selection in regression analysis (Leardi 1994).
Moreover, an approach combining GA with PLS (GAPLS) is
also proposed for variable selection in QSAR and QSPR
studies (Ghasemi and Ahmadi 2007; Leardi and Gonzalez
1998). The purpose of a variable selection is to select the
variables that contribute significantly to prediction and to
discard the other variables by a fitness function. The hybrid
method that integrates GA as a powerful optimization tool
and PLS as a robust statistical tool can lead to a significant
improvement in QSAR and QSPR models.
The choice for optimal training sets was performed by
Kohonen self-organising map (SOM) (Melssen et al. 2007).
The Kohonen learning algorithm can be found in Kohonen’s paper (Kohonen 1997).
The purpose of the present study is to investigate the
relationship between gas chromatographic relative retention times for 67 PCMTs and their molecular descriptors.
Materials and methods
follows: 160C (2 min), then 20C/min to 280C, 10 min
isothermal.
Descriptor generation
All calculations were run on a Toshiba computer with
Pentium IV as CPU and windows XP as operating system.
ChemOffice 2005 molecular modeling software ver. 9,
supplied by Cambridge Software Company (http://www.
cambridgesoft.com/software/ChemOffice) Cambridge, MA,
USA 2005, was employed for optimization of the structure of
the molecules and calculation of descriptors. The molecular
structures of the data set were sketched using the ChemDraw
Ultra module of this software. The sketched structures were
exported to Chem3D module in order to create their 3D
structures. Each molecule was ‘‘cleaned up’’ and energy
minimization was performed using Allinger’s MM2 force
field and further geometry optimization was done using
semiempirical AM1 (Austin Model) Hamiltonian and PM3
methods by default on the 3D-structure of the molecule.
Then, the lower energy conformers obtained by the aforementioned procedure were fed into an Excel spreadsheet
for calculation of the structural molecular descriptors by
application of the add-in ChemSAR module, developed by
Cambridge Software Company, Cambridge, MA, USA
2005. The generated descriptors are 42 and they are in different categories; include physicochemical, thermodynamic,
electronic, and spatial descriptors available in the ‘Analyze’
option of the Chem3D package. The descriptors calculated
accounts four important properties of the molecules: physicochemical, thermodynamic, electronic, and steric, as they
represent the possible molecular interactions which determined the retention times of the studied molecules (Table 2).
Data set
Splitting training-test sets
GC relative retention times (RRTs * tR) of 67 PCMTs
were taken from the literature (Nikiforov et al. 2000).
They consisted of 45 chlorinated bornanes, 13 chlorinated
bornenes, 8 chlorinated dihydrocamphenes, and 1 chlorinated camphene. Relative retention times (RRTs) were
determined against 2,2,5,5,8,9,9,10,10-nonachlorobornane
(item number 41 in Table 1). The compounds are grouped
by their carbon skeleton structure and the number of
chlorine atoms in the molecule. The structures of PCMTs
skeletons and numbering of carbon atoms are shown in
Fig. 1. The GC conditions reported by (Nikiforov et al.
2000) were as follows: GC-Varian 3700, injector Gerstel
split/splitless at 250C, splitless injection, column-DB-5
(ca. 53 m, 0.25 mm i.d., film thickness 0.1 lm), detector,
electron capture detector at 300C, carrier gas, nitrogen at
flow rate 1.33 ml/min, make-up gas, and nitrogen at
40 ml/min. The temperature program was selected as
123
In order to obtain compounds for external validation, the
available set of chemicals is split into a training set and an
external prediction set by two different methods. First,
random selection (RS) was employed. In this method, data
are split into training set and prediction set through activity
sampling, thus by ordering the chemicals according to their
ascending experimental values, selecting the most and the
least active, and taking each chemical from the set in
the prediction set to be used after model development for
the external validation. The second method for splitting
data set was defined by using the Kohonen self-organising
map (SOM) technique. A SOM package for MATLAB
is available on http://www.cis.hut.fi/projects/somtoolbox/.
The first partition was performed by an 8 9 8 Kohonen
network shown in Fig. 2. The input variables were the 35
descriptors for 67 molecules.
Environ Chem Lett
Table 1 The RRT data of PCMTs and the corresponding predictive values by PCR and PLS models
No.
Chemical name
RRT
1c,d
2-exo,5-endo,9,9,10-pentachloroborane(PeCB)
0.6887 0.7944
0.7518
0.7509
0.7914
0.7579
0.7484
2c,d
2,2,5,5,10,10-hexachlorobornane(HxCB)
0.7181 0.7466
0.7361
0.7342
0.7408
0.7443
0.7416
3c,b
2-exo,3-endo,6-exo,8,9,10-HxCB
0.7667 0.7292
0.7656
0.7779
0.7389
0.7625
0.7796
a,d
4
2-endo,3-exo,6-exo,8,9,10-HxCB
0.7791 0.7292
0.7656
0.7779
0.7389
0.7625
0.7796
5c,d
2-exo,3-endo,5-exo,9,9,10,10-heptachlorobornane(HpCB)
0.7639 0.7599
0.7611
0.7518
0.768
0.7615
0.7489
6c,b
2-exo,5,5,9,9,10,10-HpCB
0.7756 0.8104
0.7784
0.7878
0.8158
0.7992
0.788
7c,b
2-endo,3-exo,5-endo,6-exo,8,9,10-HpCB
0.7875 0.7994
0.7845
0.7821
0.7886
0.7796
0.7854
8c,d
2,2,5,5,10,10,10-HpCB
0.7921 0.7918
0.7888
0.791
0.7933
0.7916
0.7921
9c,d
2-exo,3-endo,5-exo,8,9,10,10- HpCB
0.8319 0.798
0.7951
0.7973
0.7994
0.7978
0.7983
2-exo,5,5,8,9,10,10- HpCB
0.8359 0.8042
0.8015
0.8035
0.8055
0.804
0.8044
0.8423 0.8103
0.8432 0.8165
0.8079
0.8142
0.8098
0.816
0.8116
0.8177
0.8102
0.8164
0.8106
0.8168
c,d
10
11c,d 2-exo,2,2,5,5,8,9,10-HpCB
12c,d 2,2,5-endo,6-exo,8,9,10- HpCB
PCR (1) PCR (2) PCR (3) PLS (1) PLS (2) PLS (3)
13a,d 2-exo,3-endo,6-exo,8,9,10,10-HpCB
0.8468 0.8227
0.8206
0.8222
0.8237
0.8227
0.823
14c,d 2-exo,3-endo,5-exo,6-exo,8,9,10-HpCB
0.8472 0.8289
0.827
0.8285
0.8298
0.8289
0.8292
a,b
15
0.8521 0.8351
0.8333
0.8347
0.8359
0.8351
0.8353
16a,d 2-exo,3-endo,6-endo,8,9,10,10-HpCB
0.8592 0.8413
0.8397
0.841
0.842
0.8413
0.8415
17b,c 2-exo,5-exo,6-endo,8,9,10,10-HpCB
0.8757 0.8475
0.8461
0.8472
0.8481
0.8475
0.8477
18b,c 2,2,5-exo,8,9,10,10-HpCB
0.8765 0.8536
0.8524
0.8534
0.8542
0.8537
0.8539
19c,d 2-endo,3-exo,5-endo,6-exo,8,8,10,10-octachlorobornane (OCB)
0.8161 0.8598
0.8588
0.8597
0.8603
0.8599
0.8601
c,d
20
2-exo,3-exo,5-endo,8,9,10,10-HpCB
0.8282 0.866
0.8652
0.8659
0.8664
0.8661
0.8662
21b,c 2,2,5,5,9,9,10,10-OCB
0.8689 0.8722
0.8715
0.8722
0.8725
0.8723
0.8724
22b,c 2,2,3-exo,5-endo,6-exo,8,9,10-OCB
0.8799 0.8784
0.8779
0.8784
0.8786
0.8785
0.8786
23b,c 2-endo,3-exo,5-endo,6-exo,8,9,10,10-OCB
0.8825 0.8846
0.8843
0.8846
0.8847
0.8847
0.8848
24a,d 2-exo,3-endo,5-exo,8,9,9,10,10-OCB
0.8857 0.8908
0.8906
0.8909
0.8907
0.8909
0.891
25a,d 2,2,5-endo,6-exo,8,8,9,10-OCB
0.8918 0.8969
0.897
0.8971
0.8968
0.8971
0.8971
a,d
26
2-endo,3-exo,5-endo,6-exo,8,8,9,10-OCB
0.8918 0.9031
0.9034
0.9034
0.9029
0.9033
0.9033
27b,c 2-exo,5,5,8,9,9,10,10-OCB
28c,d 2,2,5-endo,6-exo,8,9,10,10-OCB
0.8986 0.9093
0.9232 0.9155
0.9097
0.9161
0.9096
0.9158
0.909
0.9151
0.9096
0.9158
0.9095
0.9157
29a,b 2,2,5,5,8,9,10,10-OCB
0.9347 0.9217
0.9225
0.9221
0.9212
0.922
0.9219
30c,d 2-endo,3-exo,6-exo,8,8,9,10,10-OCB
0.9388 0.9279
0.9289
0.9283
0.9273
0.9282
0.928
0.9342
c,d
31
2,2,5-endo,6-exo,8,9,9,10-OCB
2,2,5,5,6-exo,8,9,10-OCB
0.9438 0.9341
0.9352
0.9346
0.9334
0.9344
32c,d 2,2,3-exo,6-exo,8,9,10,10-OCB
0.9451 0.9402
0.9416
0.9408
0.9395
0.9406
0.9404
33c,d 2-endo,3-exo,5-endo,6-exo,8,8,9,10,10-nonachlorobornane
0.9239 0.9464
0.948
0.947
0.9456
0.9468
0.9466
0.9262 0.9526
0.9543
0.9533
0.9517
0.953
0.9528
(NCB)
34c,d 2-exo,3,3,5-exo,6-endo,9,9,10,10-NCB
a,d
35
0.9394 0.9738
0.9534
0.9236
0.9612
0.9287
0.9299
36a,d 2,2,3-exo,5-endo,6-exo,8,9,10,10-NCB
2,2,3-exo,5,5,9,9,10,10-NCB
0.9575 0.9589
0.9343
0.9281
0.9489
0.9331
0.9349
37c,d 2-exo,3,3,5-exo,6-endo,8,9,10,10-NCB
0.9575 0.9778
0.9506
0.9391
0.9657
0.9602
0.9435
38c,d 2,2,5-endo,6-exo,8,8,9,10,10-NCB
0.9618 0.941
0.9407
0.9527
0.9564
0.9385
0.9547
39c,d 2,2,3-exo,5,5,8,9,10,10-NCB
0.9719 1.0002
0.9804
0.9745
1.0038
0.9869
0.9821
40a,d 2,2,5-endo,6-exo,8,9,9,10,10-NCB
0.977
0.9975
0.9172
0.9527
0.9929
0.9494
0.9547
1
0.9956
1.0458 1.0619
0.9842
1.0746
1.0072
1.0678
1.0161
1.0552
0.9985
1.0733
1.0102
1.0715
c,d
41
2,2,5,5,8,9,9,10,10-NCB
42c,d 2,2,3-exo,5-endo,6-exo,8,9,9,10,10-decachlorobornane (DCB)
43b,c 2-exo,3,3,5-exo,6-endo,8,9,9,10,10-DCB
1.0892 1.0656
1.0773
1.0792
1.0589
1.0569
1.0805
44a,d 2-exo,3,3,6,6,8,8,9,10,10-DCB
1.1133 1.0916
1.0914
1.1035
1.0904
1.0853
1.1103
45c,d 2,2,3-exo,5,5,8,9,9,10,10-DCB
1.1163 1.0686
1.1367
1.1147
1.0796
1.1206
1.1192
0.655
0.7388
0.6976
0.6045
0.7125
0.7043
0.6541
0.687
0.7018
0.6403
0.6853
a,b
46
5-endo,6-exo,8,9,10-pentachlorobornene (PeCB-en)
47b,c 2,6-exo,9,9,10,10-hexachlorobornene (HxCB-en)
0.5904
0.6567 0.744
123
Environ Chem Lett
Table 1 continued
No.
Chemical name
RRT
PCR (1) PCR (2) PCR (3) PLS (1) PLS (2) PLS (3)
48c,d 5,5,9,9,10,10-HxCB-en
0.6867 0.7
0.7014
0.6916
0.6944
0.6848
0.6947
49c,d 5,5,8,9,10,10-HxCB-en
0.711
0.7391
0.7355
0.7417
0.7462
0.7416
50b,c 2,5-endo,8,9,10,10-HxCB-en
0.7439 0.6899
0.7505
0.7506
0.7015
0.7412
0.7487
0.7521
c,d
51
0.7433 0.7749
0.7612
0.7518
0.7674
0.7454
52c,d 2,5-endo,6-exo,8,9,10,10-HpCB-en
0.7588 0.7555
0.7671
0.7512
0.7445
0.746
0.7516
53a,d 2,5,5,8,9,10,10-HpCB-en
0.7733 0.7821
0.8075
0.8012
0.7948
0.809
0.803
54c,d 5,5,8,9,9,10,10-HpCB-en
0.7756 0.767
0.7607
0.7596
0.7766
0.7584
0.7626
55c,d 3,5-exo,6-endo,8,9,10,10-HpCB-en
0.7881 0.78
0.7742
0.7656
0.7709
0.7939
0.7642
c,d
56
2,5,5,9,9,10,10-heptachlorobornene (HpCB-en)
0.738
0.8458 0.862
0.8299
0.8253
0.8417
0.8525
0.8204
57a,d 2,5-endo,6-exo,8,9,9,10,10-OCB-en
3,5-exo,6-endo,8,9,9,10,10-octachlorobornene (OCB-en)
0.8637 0.8117
0.8189
0.8105
0.8045
0.7997
0.8073
58c,d 2,5,5,8,9,9,10,10-OCB-en
0.8711 0.8625
0.8523
0.861
0.8671
0.8645
0.8592
59c,d 2-exo,5,5,6-exo-tetrachloro,3-exo-chloromethyl,2-endodichloromethyl, 3-endo-methylnorbornane
0.8066 0.8507
0.8289
0.8247
0.8441
0.823
0.8172
60c,d 3-exo,5,5,6-exo-tetrachloro,2-exo-hloromethyl,3-endodichloromethyl, 2-endo-methylnorbornane
0.8335 0.8028
0.848
0.8168
0.8096
0.8247
0.8092
61a,d 3-exo,5,5,6-exo-tetrachloro,2-exo,3-endo-bis(dichloromethyl),2endo-methylnorbornane
0.9313 0.8968
0.9015
0.8881
0.8945
0.8655
0.8754
62c,d 3-exo,5,5,6-exo-tetrachloro,2,2-bis(chloromethyl),3-endodichloromethylnorbornane
0.9422 0.924
0.9386
0.9386
0.9385
0.9468
0.9272
63c,d 2-exo,5-endo,6-exo-trichloro,3-endo-chloromethyl,2-endo,3exo-bis(dichloromethyl)norbornane
0.9752 0.9605
0.9763
0.9794
0.9582
0.9707
0.983
64a,b 3-exo,5,5,6-exo-tetrachloro,2-endo-chloromethyl,2-exo,3-endobis(dichloromethyl)norbornane
1.0248 1.0387
1.0269
1.0463
1.0409
1.014
1.0306
65b,c 3-exo,5,5,6-exo-tetrachloro,2-exo-chloromethyl,2-endo,3-endobis(dichloromethyl)norbornane
1.0374 1.0699
1.0129
1.0463
1.0646
1.015
1.0306
66a,d 3-endo,5,5,6-exo-tetrachloro,2-endo-chloromethyl,2-exo,3endo-bis(dichloromethyl)norbornane
1.0416 1.0387
1.0268
1.0463
1.0409
1.014
1.0306
67b,c 2,2,3-exo-trichloro,5-endo-chloromethyl,6-(E)chloromethylene,5-exo-dichloromethylnorbornane
0.8073 0.7269
0.8046
0.8042
0.6894
0.8129
0.8028
a
The compounds used in the SOM prediction set
b
The compounds used in the RS prediction set
c
The compounds used in the calibration set for SOM
d
The compounds used in the calibration set for RS
Variable selection and model building
A total of 42 descriptors were initially calculated by
ChemSar for the entire data set of 67 compounds. The total
number of descriptors was reduced to 35 descriptors by
eliminating the descriptors that were deemed insignificant
(i.e., where the one-parameter correlation coefficient with
the property is less than 0.1). The samples were split into
the calibration and prediction sets (49 calibration samples,
18 prediction samples) by SOM and random sampling (RS)
and the data were preprocessed via autoscaling. Geneticalgorithm-based partial least squares regression (GAPLS)
by the procedure discussed in Leardi’s article and our
previous article (Ghasemi and Ahmadi 2007), was then
applied to the training data set. Because each GA gives a
slightly different model, ten GA runs are performed on the
123
data set, with the goal of verifying the robustness of the
predictive ability and of the set of selected variables. If
some variables are present only in one model, it can be
concluded that they have been selected just by chance, and
therefore, they can be disregarded in the final model.
Finally we selected 11 variables among five runs (Table 3),
and these variables were used as input for PLS_Toolbox
(version 7.0, MathWorks, Inc.) regression software. PCR
and PLS regression (PLS_Toolbox, version 2.1, Eigenvector, USA) and other calculations were performed in the
MATLAB environment.
Validation of the models
Validation is a crucial aspect of any QSRR model to
prove the predictability of the model obtained. Golbraikh
Environ Chem Lett
8
9
H3C
7
H3C
CH3
7
4
3
5
1
5
1
2
6
2
H3C 10
6
H3C 10
Bornane
Bornene
7
7
1
6
4
RS and SOM data splitting method, PLS and PCR
modeling
4
3
5
Results and discussion
8
9
CH3
3
CH3
2
CH3
CH3
2,2,3-Trimethylnorbornane
(dihydrocamphene)
1
6
5
4
3
CH3
2
CH3
CH2
2,2-Dimethyl-3-methylene
norbornane(camphene)
Fig. 1 The structure of polychlorinated monoterpenes
and Tropsha (2002) suggested that for a QSPR model to
have high predictive power, the following conditions
should be fulfilled: (1) a high cross-validated Q2 value;
(2) a correlation coefficient R between the predicted and
observed activities of compounds from an external test set
close to 1; (3) at least one (but better both) of the correlation coefficients for regression through the origin
(predicted versus observed activities, or observed versus
predicted activities) should be close to R2; (4) at least one
slope of the regression lines through the origin should be
close to 1. In this article, the cross-validated R2 (i.e., Q2)
was assessed and the high value of Q2 (more than 0.5)
can be regarded as the proof of the high predictive ability
of the model. Q2 = [1 - PRESS/SS], where PRESS is
the predicted residual sum of squares of deleted data in
the cross-validation, calculated as [predicted value experimental value]2, and SS is the sum of squares of Y
corrected for the mean, calculated as [experimental
value - mean experimental value]2. However, the high
value of Q2 appears to be a necessary but not sufficient
condition for the models to have a high predictive power.
So adequate prediction of an external set, i.e., a test set, is
necessary besides Q2. In all cases, the Q2, RMSE of the
calibration set (RMSEC), were calculated to assess the
quality of the models. An external test set was chosen to
verify the PLSR model, and the RMSE of external test set
(RMSEP) was also examined.
The PLS and PCR analysis was carried out using all 35
variables on RRT of 49, as well as PCMTs selected with
RS and SOM for a comparative study of the SOM selection. The training and external test set of PLS (1) and PCR
(1) are generated by RS data splitting, but the training and
external test set of PCR (2) and PLS (2) are created by
SOM data splitting. Prior to the PLS analysis, the data set
was autoscaled to unit variance to give each variable equal
importance. Leave-one-out cross-validation was used to
determine the number of significant PLS components. The
optimal number of PCs modeling useful information but
avoiding overfitting is determined by the method of Haaland and Thomas (1988; Ghasemi et al. 2003). Table 4
indicates the statistical parameters of the models that were
generated. The number of latent variables and principal
components for PCR (1), PCR (2), PLS (1), and PLS (2)
were 10, 12, 6, and 9, respectively. The RMSEC for PCR
(1), PCR (2), PLS (1), and PLS (2) were 0.029, 0.022,
0.027, and 0.021, respectively. These results indicate the
SOM data splitting improves the data modeling and the
PLS models indicate the same result with fewer LVs than
the PCs needed for PCR.
Variable selection, PLS and PCR modeling
PCR and PLS regression have advantages over ordinary
multiple linear regression (MLR) since they can deal with
collinearity in the predictor variables. Also, the PLS model
converges to an MLR solution if the number of extracted
LVs is equal to the rank of the sample variable space.
In order to estimate the real predictive power of genetic
algorithm-based variable selection, PLS and PCR were
applied to the SOM calibration set with 35 descriptors
(PCR (2) and PLS (2)) and 11 GAPLS selected descriptors
(PCR (3) and PLS (3)). Leave-one-out cross-validation was
used to determine the number of significant latent variables
(LVs) and significant principal components (PCs) for PLS
and PCR. The number of LVs and PCs for PCR (2), PCR
(3), PLS (2), and PLS (3) were 12, 8, 9, and 7, respectively.
The RMSEC for PCR (2), PCR (3), PLS (2), and PLS (3)
were 0.022, 0.020, 0.021, and 0.020, respectively. Therefore, the PCR (3) and PLS (3) models with fewer
descriptors than PCR (2) and PLS (2) were considered to
be the best QSRR models. As Fig. 3b and d indicates, the
correlation between predicted and observed values for PCR
(3) and PLS (3) was good, (R = 0.971, 0.973) and the
123
Environ Chem Lett
Table 2 Molecular descriptors selected to construct the QSRR model and experimental relative retention times of PCMTs
TC
CP
LogP
MR
MW
NRBO
ShapA
WIndx
RRT
46.52
295
0.6887
23.61
338
0.7181
29.32
349
0.7667
14.0625
29.32
349
0.7791
15.05882
12.51
405
0.7639
379.3653 2
15.05882
14.73
405
0.7756
379.3653 3
15.05882
9.68
404
0.7875
No. BIndx
CLSC
1
46712
15
830.7015 254.9191 4.2773
67.7133
310.4752 2
13.06667
2
60718
16
806.0936 263.4179 4.7322
74.1476
344.9203 1
14.0625
3
62604
16
875.1198 269.9637 4.537
71.7992
344.9203 3
14.0625
4
62604
16
875.1198 269.9637 4.537
71.7992
344.9203 3
5
81802
17
863.3268 284.3654 4.8285
77.0817
379.3653 2
6
81805
17
849.2226 279.7424 4.7719
78.4251
7
81550
17
895.4147 285.5226 4.947
76.0581
8
91954.43 17.57143
894.8005 292.3937 5.001557
79.41373 399.0482 2.857143 15.62801
9
97878.93 17.89286
904.3935 297.353
80.80201 410.1199 3
5.086557
15.94821
G
2.942857 437.4286 0.7921
-2.23893
455.8929 0.8319
-7.42071
474.3571 0.8359
10
103803.4
18.21429
913.9865 302.3123 5.171557
82.19029 421.1915 3.142857 16.2684
11
12
109727.9
115652.4
18.53571
18.85714
923.5796 307.2717 5.256557
933.1726 312.231 5.341557
83.57858 432.2631 3.285714 16.58859
84.96686 443.3347 3.428571 16.90879
-12.6025
-17.7843
492.8214 0.8423
511.2857 0.8432
13
121576.9
19.17857
942.7656 317.1903 5.426557
86.35514 454.4064 3.571429 17.22898
-22.9661
529.75
14
127501.4
19.5
952.3586 322.1496 5.511557
87.74342 465.478
3.714286 17.54918
-28.1479
548.2143 0.8472
15
133425.9
19.82143
961.9516 327.109
89.1317
476.5496 3.857143 17.86937
-33.3296
566.6786 0.8521
16
139350.4
20.14286
971.5447 332.0683 5.681557
90.51999 487.6212 4
18.18957
-38.5114
585.1429 0.8592
17
145274.9
20.46429
981.1377 337.0276 5.766557
91.90827 498.6929 4.142857 18.50976
-43.6932
603.6071 0.8757
18
151199.4
20.78571
990.7307 341.9869 5.851557
93.29655 509.7645 4.285714 18.82995
-48.875
622.0714 0.8765
19
157123.9
21.10714 1000.324
346.9463 5.936557
94.68483 520.8361 4.428571 19.15015
-54.0568
640.5357 0.8161
20
163048.4
21.42857 1009.917
351.9056 6.021557
96.07311 531.9077 4.571429 19.47034
-59.2386
659
21
168972.9
21.75
356.8649 6.106557
97.4614
-64.4204
677.4643 0.8689
22
174897.4
22.07143 1029.103
361.8242 6.191557
98.84968 554.051
-69.6021
695.9286 0.8799
23
180821.9
22.39286 1038.696
366.7835 6.276557 100.238
565.1226 5
20.43093
-74.7839
714.3929 0.8825
24
186746.4
22.71429 1048.289
371.7429 6.361557 101.6262
576.1943 5.142857 20.75112
-79.9657
732.8571 0.8857
25
192670.9
23.03571 1057.882
376.7022 6.446557 103.0145
587.2659 5.285714 21.07131
-85.1475
751.3214 0.8918
26
198595.4
23.35714 1067.475
381.6615 6.531557 104.4028
598.3375 5.428571 21.39151
-90.3293
769.7857 0.8918
27
28
204519.9
210444.4
23.67857 1077.068
24
1086.661
386.6208 6.616557 105.7911
391.5802 6.701557 107.1794
609.4091 5.571429 21.7117
620.4808 5.714286 22.0319
-95.5111
-100.693
788.25
0.8986
806.7143 0.9232
29
216368.9
24.32143 1096.254
396.5395 6.786557 108.5677
631.5524 5.857143 22.35209 -105.875
825.1786 0.9347
30
222293.4
24.64286 1105.847
401.4988 6.871557 109.9559
642.624
22.67229 -111.056
843.6429 0.9388
31
228217.9
24.96429 1115.44
406.4581 6.956557 111.3442
653.6956 6.142857 22.99248 -116.238
862.1071 0.9438
32
234142.4
25.28571 1125.033
411.4175 7.041557 112.7325
664.7673 6.285714 23.31268 -121.42
880.5714 0.9451
33
240066.9
25.60714 1134.626
416.3768 7.126557 114.1208
675.8389 6.428571 23.63287 -126.602
899.0357 0.9239
34
245991.4
25.92857 1144.219
421.3361 7.211557 115.5091
686.9105 6.571429 23.95306 -131.784
917.5
0.9262
35
132490
19
876.2405 306.2372 5.7945
87.9687
448.2555 2
17.05263
-22.33
528
0.9394
36
132201
19
911.8295 310.3459 5.7026
86.6037
448.2555 3
17.05263
-22.11
527
0.9575
37
132948
19
911.8295 310.3459 5.7878
86.3969
448.2555 3
17.05263
-22.11
530
0.9575
38
132976
19
901.5455 308.6744 5.4338
87.4543
448.2555 3
17.05263
-16.84
530
0.9618
39
132944
19
897.9521 305.723
5.7312
87.7403
448.2555 3
17.05263
-19.89
530
0.9719
40
132976
19
901.5455 308.6744 5.4338
87.4543
448.2555 3
17.05263
-16.84
530
0.977
41
133955
19
887.656
304.0515 5.2902
88.8531
448.2555 3
17.05263
-14.62
534
1
42
43
166404
167229
20
20
921.8726 324.2333 5.9308
921.8726 324.2333 6.016
91.6578
91.451
482.7005 3
482.7005 3
18.05
18.05
-36.48
-36.48
600
603
1.0458
1.0892
44
166412
20
907.9524 319.6104 5.8742
93.0012
482.7005 3
18.05
-34.26
600
1.1133
45
167225
20
907.9524 319.6104 5.9594
92.7944
482.7005 3
18.05
-34.26
603
1.1163
46
46701
15
859.5637 246.6665 4.17
68.2943
308.4593 3
13.06667
78.92
295
0.655
47
61786
16
837.0197 260.7434 3.9848
73.9573
342.9044 2
14.0625
60.19
344
0.6567
48
62492
16
831.0656 256.4452 4.3179
74.9756
342.9044 2
14.0625
64.33
348
0.6867
123
1019.51
5.596557
542.9794 4.714286 19.79054
4.857143 20.11073
6
0.8468
0.8282
Environ Chem Lett
Table 2 continued
WIndx
RRT
66.77
350
0.711
62.63
350
0.7439
15.05882
42.77
405
0.7433
377.3495 3
15.05882
42.99
404
0.7588
377.3495 3
15.05882
45.21
407
0.7733
79.8013
377.3495 3
15.05882
52.4
408
0.7756
78.12
377.3495 3
15.05882
42.99
407
0.7881
890.2511 289.6753 4.4039
83.1741
411.7945 3
16.05556
28.62
470
0.8458
18
890.2511 289.6753 4.4909
83.1187
411.7945 3
16.05556
28.62
467
0.8637
18
876.3033 285.0524 4.3473
84.5175
411.7945 3
16.05556
30.84
470
0.8711
83441
17
851.0618 281.4139 5.0636
77.4298
379.3653 2
15.05882
9.46
413
0.8066
60
83245
17
851.0618 281.4139 5.2358
77.1676
379.3653 2
15.05882
9.46
412
0.8335
61
62
107959
108368
18
18
861.1545 295.3013 5.464
882.951 294.787 5.4007
82.2217
81.9933
413.8104 2
413.8104 3
16.05556
16.05556
-4.91
-2.47
478
480
0.9313
0.9422
TC
CP
LogP
MR
MW
NRBO
ShapA
No. BIndx
CLSC
49
62818
16
855.5327 255.931
4.2546
74.7472
342.9044 3
14.0625
50
62801
16
859.8115 260.2291 3.8527
73.8057
342.9044 3
14.0625
51
81805
17
844.0739 271.6793 4.1824
79.6918
377.3495 2
52
81578
17
880.198
4.2627
78.0646
53
82172
17
866.2775 271.1651 4.1191
79.4634
54
82410
17
865.599
269.8184 4.4828
55
82168
17
880.198
275.788
56
106089
18
57
105430
58
106092
59
275.788
4.1757
G
63
65703
16
851.4055 263.8721 4.3106
72.3873
365.3388 2
14.0625
9.58
366
0.9752
64
138004
19
893.0126 308.6744 5.6289
87.0474
448.2555 3
17.05263
-16.84
550
1.0248
65
138004
19
893.0126 308.6744 5.6289
87.0474
448.2555 3
17.05263
-16.84
550
1.0374
66
138004
19
893.0126 308.6744 5.6289
87.0474
448.2555 3
17.05263
-16.84
550
1.0416
67
50027
15
819.8987 238.5759 3.9697
68.5138
328.8778 1
13.06667
77.12
316
0.8073
Fig. 2 Occupation of the 8 9 8 Kohonen top-map
slope and intercept were 0.906, 0.929 and 0.057, 0.077,
respectively. The residuals (Fig. 3e, f) are randomly distributed and almost symmetrically fill the negative and
positive half-planes, so we conclude there was no systematic error.
To explore further which descriptors (i.e., X-variable) can
be eliminated from the analysis, we can look at the regression coefficients for the centered and scaled data. In the
absence of knowledge about the relative importance of the
variables, the standard multivariate approach is to (1) center
them by subtracting their averages, and (2) scale each variable to unit variance by dividing them by their standard
deviations, so-called auto-scaling. This corresponds to giving each variable the same weight and the same prior
importance in the analysis (Wold et al. 2001). In PLS
modeling, X-variables may be important for the modeling of
Y. X-variables with small coefficients (in terms of absolute
value) make a small contribution to the response prediction.
However, a variable may also be important for the modeling
of X, which can be identified by large loadings. An alternative statistic is the Variable Importance for Projection
(VIP), which summaries the importance of a variable for
both Y and X (Wold 1994). The coefficients for centered and
scaled data and the VIP values were calculated and compared (Table 3). According to Wold (1994), if a variable has
a relatively small coefficient (in absolute value) and a small
value of VIP, it is a prime candidate for deletion. Wold
considers a value less than 0.8 to be ‘‘small’’ for the VIP. We
find that all 11 variables selected by GAPLS are important
for PLS regression.
Table 3 lists the standardized regression coefficients of
PCR (3) and PLS (3) models. The standardized regression
coefficients reveal the significance of an individual
descriptor presented in the regression model. Obviously, in
Table 3, the effect of ideal Gas thermal capacity and Balaban index on the RRT of the PCMTs are more significant
than the other descriptors in the PCR and PLS models,
123
Environ Chem Lett
Table 3 The descriptors
selected by GAPLS, the PCR
and PLS regression coefficients,
and VIPs for PLS regression
Descriptor name
Descriptor type
PCR (3)
PLS (3)
bs
bs
VIP
Balaban index
BIndx
Steric
2.854
3.240
2.888
Cluster count
CLSC
Steric
-1.492
-1.474
2.728
Critical temperature
TC
Thermodynamic
-2.057
-1.459
2.906
Ideal gas thermal capacity
CP
Thermodynamic
-3.781
-4.385
2.816
Log P
LogP
Thermodynamic
-0.511
-0.474
2.635
Molecular refractivity
MR
Thermodynamic
-2.568
-1.881
2.753
Molecular weight
MW
Steric
2.244
2.181
2.666
Number of rotatable bonds
NRBO
Steric
1.220
1.082
2.702
Shape attribute
ShpA
Steric
-1.489
-1.468
2.727
Standard Gibbs free energy
G
Thermodynamic
-2.205
-2.431
2.718
Wiener index
WIndx
Steric
3.993
2.843
2.818
respectively. The order of significance of the other
descriptors in PCR and PLS models are WIndx [
BIndx [ MR [ MW [ T C [ G [ CLSC [ ShapA [
NRBO [ LogP and BIndx [ WIndx [ G [ MW [ MR
[ ClsC [ ShapA [ TC [ NRBO [ LogP, respectively.
Interpretation of descriptors
Chromatographic retention results from the solvation and
partitioning of individual compounds in a stationary phase.
The solubility is affected by intermolecular forces which
include hydrogen bonding, ion–dipole, dipole–dipole,
dipole-induced dipole interactions, etc. Solubility is related
to the descriptors accounting for the counts of atoms,
functional groups or charged moieties, surface areas of
donor and acceptor groups, hydrophobicity, polarizability,
and conformation flexibility. Chromatographic partition, in
turn, results from adsorption–desorption of compounds on
the stationary phase, during which a change in degree of
freedom of the molecule occurs. The entropic effects also
play important roles. As a result, the chromatographic
retention depends mainly on the electronic properties, and
the size and shape of the solute molecule.
The Wiener index is the most effective descriptor at
PCR and PLS models. The Wiener index (Wiener 1947) is
the first topological index used in chemistry. It is the sum
of distances between all the pairs of vertices in a hydrogensuppressed molecular graph. The correlations between
RRTs of organic compounds and Wiener index show that
the magnitude of intermolecular interactions between the
compounds and the stationary phase is related to the degree
of branching of the molecules. The Wiener index that
represents molecular ‘‘steric-branching’’ increases as the
number of chloride substitution increases since it is defined
as the half sum of all topological distances in the hydrogendepleted graph representing the skeleton of the molecule.
123
Abbreviation
The Balaban index, another topological descriptor, characterizes the branching pattern within a molecule (Balaban
1981). This type of descriptor is evaluated from various
combinations and weightings of the vertices (atoms) and
edges (bonds) of the molecular graph (chemical structure),
its positive coefficient indicates that the presence of chloride substituents increases retention time. Negative coefficients of the descriptor ideal gas thermal capacity suggest
that an increase in this parameter would reduce the RRTs
of molecules. Positive coefficients associated with the
number of rotatable bonds (NRBO) and molecular weight
(MW) descriptors suggest that an increase in the number of
rotatable bonds and molecular weight of a molecule would
increase the retention time of molecules. In general, the
rotatable bond descriptor reflects the flexibility of the
molecule. For example, an increase in the rotational
degrees of freedom may result in a larger diffusional cross
section, thereby influencing permeability adversely. On the
other hand, such flexibility could decrease the crystallinity
of the solute, resulting in improved aqueous solubility and
enhanced absorption (Lu et al. 2004). The MR descriptor is
directly proportional to the polarizability (Hansch and Leo
1995) as expressed by MR = (4pNAa/3), where NA is the
Avogadro constant and a is the polarizability. MR values
are based on molecular fragments in which different fragments add to give the overall MR (Hansch and Leo 1995).
In QSRR, molecular parameters are used to predict retention times in gas–solid interactions. Polarizability, a, is a
key factor in intermolecular interactions and also in nonspecific molecule–stationary phase interactions. The negative sign of the MR descriptor in models shows that
increasing the polarizability of molecules causes a decrease
in the RRT. The critical temperature is an important fundamental physical property of organic compounds. It is
defined as the temperature above which a gas cannot be
liquefied and has a negative effect on the RRT. The
Environ Chem Lett
Fig. 3 Predicted and residual
versus experimental relative
retention times for PCMTs by
PLS (3) and PCR (3) models
(a)
(c)
pls(3)
PCR(3)
1.2
1.1
1.2
y = 0.961x + 0.034
y = 0.962x + 0.033
1.1
R= 0.981
Predicted RRT
Predicted RRT
R = 0.980
1
0.9
0.8
1
0.9
0.8
0.7
0.7
0.6
0.6
0.8
1
0.6
0.6
1.2
0.8
PLS(3)
1.2
y = 0.906x + 0.077
Predicted RRT
1.1
y = 0.929x + 0.057
1.1
R= 0.971
1
0.9
0.8
R = 0.973
1
0.9
0.8
0.7
0.7
0.6
0.6
0.6
0.6
0.8
1
0.8
1.2
PLS (3)
(e) 0.08
(f)
1.2
PCR (3)
0.08
Test
Test
Trainig
Training
0.04
Residuals
0.04
1
Observed RRT
Observed RRT
Residuals
1.2
PCR(3)
(d)
1.2
Predicted RRT
(b)
1
Observed RRT
Observed RRT
0
-0.04
0
0.6
0.8
1
1.2
-0.04
-0.08
0.6
0.7
0.8
0.9
1
1.1
1.2
-0.08
Observed RRT
standard Gibbs free energy is negatively correlated with
RRT. The Log P is the partition coefficient of the molecules in an n-octanol/water system, and reflects the
hydrophobicity of molecules. The retention time decreases
as the hydrophobicity of the compounds increases. The
other descriptors, such as shape attribute and cluster count,
are topology indices that affect intermolecular and molecule stationary phase interaction; both have a negative
effect on RRT of PCMTs.
Observed RRT
Validation of the model
The correlation between observed and predicted RRT for
PLS (3) and PCR (3) were acceptable with R = 0.971 and
0.973, intercept = 0.077 and 0.057, RMSEP = 0.026,
0.025, respectively (Fig. 3b, d, Table 4). The data presented in Table 4 indicate that the PLS (3) and PCR (3)
models have good statistical quality with low prediction
errors, while the PLS model uses fewer latent variables.
123
Environ Chem Lett
Table 4 Statistical results of comparing regression models
Model
Data splitting
method
No. of
descriptors
No. of LVs
or PCs
RMSEC
RMSEP
Q2
R2cal
R2test
PCR (1)
RS
35
10
0.350
0.650
0.894
0.960
0.920
PCR (2)
SOM
11
12
0.022
0.032
0.898
0.959
0.919
PCR (3)
SOM
11
8
0.020
0.025
0.971
0.962
0.947
PLS (1)
RS
35
6
0.027
0.038
0.872
0.928
0.917
PLS (2)
SOM
11
9
0.021
0.032
0.935
0.958
0.929
PLS (3)
SOM
11
7
0.020
0.026
0.969
0.960
0.943
Conclusion
A QSRR study was conducted on 67 PCMTs by using PCR
and PLS procedures. More than 40 theoretically derived
descriptors were calculated for each molecule. Training
and test data splitting were done with SOM. The best set of
calculated descriptors was selected by a genetic algorithm.
Two methods have good statistical qualities. The high
correlation coefficients (0.947 and 0.943 for PCR and PLS,
respectively) and low prediction errors (0.025 and 0.026 for
PCR and PLS, respectively) obtained confirm the good
predictive ability of both models.
References
Balaban AT (1981) Highly discriminating distance based topological
index. Chem Phys Lett 89:399–404
Braekevelt E, Tomy GT, Stern GA (2001) Comparison of an
individual congener standard and a technical mixture for the
quantification of toxaphene in environmental matrices by
HRGC/ECNI-HRMS. Environ Sci Technol 35:3513–3518
Ghasemi J, Ahmadi Sh (2007) Combination of genetic algorithm and
partial least squares for cloud point prediction of nonionic
surfactants from molecular structures. Ann di Chim 97:69–83
Ghasemi J, Saaidpour S (2007) QSPR prediction of aqueous solubility
of drug-like organic. Compounds Chem Pharm Bull 55:669–674
Ghasemi J, Ahmadi Sh, Torkestani K (2003) Simultaneous determination of copper, nickel, cobalt and zinc using zincon as a
metallochromic indicator with partial least squars. Anal Chim
Acta 487:181–188
Ghasemi J, Asadpour S, Abdolmaleki A (2007a) Prediction of GC/
ECD retention times of chlorinated pesticides, herbicides, and
organohalides by multivariate chemometrics methods. Anal
Chim Acta 588:200
Ghasemi J, Saaidpour S, Brown SD (2007b) QSPR study for
estimation of acidity constants of some aromatic acids derivatives using multiple linear regression (MLR) analysis. J Mol
Struct 805:27–32
Golbraikh A, Tropsha A (2002) Beware of q2!. J Mol Graphics Model
20:269–276
Haaland DM, Thomas EV (1988) Partial least-squares methods for
spectral analyses, relation to other quantitative calibration
methods and the extraction of qualitative information. Anal
Chem 60:1193–1202
Hackenberg R, Schutz A, Ballschmiter K (2003) High-resolution gas
chromatography retention data as basis for the estimation of
123
KOW Envalues using PCB congeners as secondary standards.
Eviron Sci Technol 37:2274–2279
Hale MD, Hileman FD, Mazer T, Shell TL, Noble RW, Brooks JJ
(1985) Mathematical modeling of temperature programmed
capillary gas chromatographic retention indexes for polychlorinated dibenzofurans. Anal Chem 57:640–648
Hansch C, Leo A (1995) Exploring QSAR: fundamentals and
applications in chemistry and biology. American Chemical
Society, Washington, DC
Kohonen T (1997) Self-organizing maps, 2nd edn. Springer-Verlag,
Berlin
Leardi R (1994) Application of a genetic algorithm to feature
selection under full validation conditions and to outlier detection. J Chemometr 8:65–79
Leardi R, Gonzalez AL (1998) Genetic algorithms applied to feature
selection in PLS regression: how and when to use them.
Chemometr Intell Lab Syst 41:195–207
Lu JJ, Crimin K, Goodwin JT, Crivori P, Orrenius C, Xing L,
Tandler PJ, Vidmar TJ, Amore BM, Wilson AGE, Stouten
PFW, Burton PS (2004) Influence of molecular flexibility and
polar surface area metrics on oral bioavailability in the rat.
J Med Chem 47:6104–6107
Melssen W, Üstün B, Buydens L (2007) SOMPLS: a supervised selforganising map-partial least squares algorithm for multivariate
regression problems. Chemometr Intell Lab Syst 86:102–120
Nevalainen T, Koistinen J, Nurmela P (1994) Structure verification,
and chromatographic relative retention times for polychlorinated
diphenyl ethers. Environ Sci Technol 28:1341–1347
Nikiforov V, Karavan V, Miltsov SA (2000) Relative retention times
of chlorinated monoterpenes. Chemosphere 41:467–472
Ong VS, Hites RA (1991) Relationship between gas chromatographic
retention … properties of four compound classes. Anal Chem
63:2829–2834
Rayne S, Ikonomou MG (2003) Predicting gas chromatographic
retention times for the 209 polybrominated diphenyl ester
congeners. J Chromatogr A 1016:235–248
Saleh MA (1991) Toxaphene: chemistry, biochemistry, toxicity and
environmental fate. Rev Environ Cont Toxicol 118:1–85
Voldner EC, Li YF (1995) Global usage of selected persistent
organochlorines. Sci Total Environ 160/161:201–210
Wiener H (1947) Structural determination of paraffin boiling points.
J Am Chem Soc 69:17–20
Wold S (1994) QSAR: chemometric methods in molecular designs.
Methods and principles in medicinal chemistry. Verlag-Chmie,
Weinheim, Germany
Wold S, Esbensen K, Geladi P (1987) Principal component analysis.
Chemometr Intell Lab Syst 2:37–52
Wold S, Sjostrom M, Eriksson L (2001) PLS-regression: a basic tool
of chemometrics. Chemometr Intell Lab Syst 58:109–130