Environ Chem Lett DOI 10.1007/s10311-009-0251-9 ORIGINAL PAPER A quantitative structure–retention relationship study for prediction of chromatographic relative retention time of chlorinated monoterpenes Jahan B. Ghasemi • Sh. Ahmadi • S. D. Brown Received: 4 September 2008 / Accepted: 28 September 2009 Springer-Verlag 2009 Abstract A novel quantitative structure–retention relationship model has been developed for the gas chromatographic relative retention times (tR) of 67 polychlorinated monoterpene congeners in a non-polar column. Modeling of the relative retention time of these compounds as a function of the theoretically derived descriptors was established by principal component and partial least squares regressions. The choice of optimal training sets is efficiently performed by Kohonen self-organizing map. The genetic algorithm was used for the selection of the variables resulted in the bestfitted models. Appropriate models with low standard errors and high correlation coefficients were obtained. Wiener index, Balaban index, and ideal gas thermal capacity are examples of the descriptors affected by the retention times of polychlorinated monoterpenes. Keywords PLS PCR Quantitative structure–retention relationship Chlorinated monoterpenes SOM, GAPLS J. B. Ghasemi (&) Chemistry Department, Faculty of Sciences, K.N. Toosi University of Technology, Tehran, Iran e-mail: [email protected] Sh. Ahmadi Chemistry Department, Faculty of Science, Razi University, Kermanshah, Iran S. D. Brown Chemistry and Biochemistry Department, University of Delaware, Newark, DE 19716, USA Introduction Toxaphene is a complex mixture of chlorinated bornanes that was at one time the most heavily used pesticide in the United States until it was banned in 1982, because of its persistence, toxicity, and bioaccumulative potential. A number of countries, however, may continue to use toxaphene (Voldner and Li 1995). Toxaphene is an insecticidal mixture, which is produced by the controlled chlorination of camphene (2,2-dimethyl-3-methylene-bicyclo[2.2.1]heptane) under UV light (Saleh 1991). Toxaphene is an analytically challenging mixture often due to the hundreds of congeners present in a sample, the presence of numerous stereoisomers, the limited availability of authentic standards of individual congeners, and the potential for interferences from other compounds (Braekevelt et al. 2001). The quantitative structure–retention relationship (QSRR) of solutes is a useful method of discussing the retention mechanism in gas chromatograph and is an important topic in chromatographic thermodynamics. In the last two decades QSRR have often been applied to: (1) predict retention for a new solute; (2) identify the most informative structural descriptors (regarding properties); (3) gain insight into the molecular mechanism of separation operating in a given chromatographic system; (4) evaluate complex physicochemical properties of analytes; and (5) predict relative biological activities. QSRRs have been developed for a number of halogenated organic contaminant groups (Ong and Hites 1991; Nevalainen et al. 1994; Hale et al. 1985; Hackenberg et al. 2003). For polybrominated diphenyl ethers (PBDEs), Rayne and Ikonomou presented several predictive equations with five or seven simple parameters, including the numbers of ortho, meta, and para substituents, as well as molecular mass and calculated ionization potential and dipole 123 Environ Chem Lett moment, used as independent variables (Rayne and Ikonomou 2003). Principal components regression (PCR) (Wold et al. 1987) and partial least squares (PLS) (Haaland and Thomas 1988) are methods that can be useful in dealing with the problem of the unfavorable more variable/object ratio and collinearity. When some of variables do not contain relevant information, these latent variable methods are unable to fully eliminate the influence of these variables. Many quantitative structure activity/property relationship (QSAR/QSPR) studies and methods (Ghasemi et al. 2007a, b; Ghasemi and Saaidpour 2007) select part of variables used in PLS modeling. The genetic algorithms have been successfully applied to feature selection in regression analysis (Leardi 1994). Moreover, an approach combining GA with PLS (GAPLS) is also proposed for variable selection in QSAR and QSPR studies (Ghasemi and Ahmadi 2007; Leardi and Gonzalez 1998). The purpose of a variable selection is to select the variables that contribute significantly to prediction and to discard the other variables by a fitness function. The hybrid method that integrates GA as a powerful optimization tool and PLS as a robust statistical tool can lead to a significant improvement in QSAR and QSPR models. The choice for optimal training sets was performed by Kohonen self-organising map (SOM) (Melssen et al. 2007). The Kohonen learning algorithm can be found in Kohonen’s paper (Kohonen 1997). The purpose of the present study is to investigate the relationship between gas chromatographic relative retention times for 67 PCMTs and their molecular descriptors. Materials and methods follows: 160C (2 min), then 20C/min to 280C, 10 min isothermal. Descriptor generation All calculations were run on a Toshiba computer with Pentium IV as CPU and windows XP as operating system. ChemOffice 2005 molecular modeling software ver. 9, supplied by Cambridge Software Company (http://www. cambridgesoft.com/software/ChemOffice) Cambridge, MA, USA 2005, was employed for optimization of the structure of the molecules and calculation of descriptors. The molecular structures of the data set were sketched using the ChemDraw Ultra module of this software. The sketched structures were exported to Chem3D module in order to create their 3D structures. Each molecule was ‘‘cleaned up’’ and energy minimization was performed using Allinger’s MM2 force field and further geometry optimization was done using semiempirical AM1 (Austin Model) Hamiltonian and PM3 methods by default on the 3D-structure of the molecule. Then, the lower energy conformers obtained by the aforementioned procedure were fed into an Excel spreadsheet for calculation of the structural molecular descriptors by application of the add-in ChemSAR module, developed by Cambridge Software Company, Cambridge, MA, USA 2005. The generated descriptors are 42 and they are in different categories; include physicochemical, thermodynamic, electronic, and spatial descriptors available in the ‘Analyze’ option of the Chem3D package. The descriptors calculated accounts four important properties of the molecules: physicochemical, thermodynamic, electronic, and steric, as they represent the possible molecular interactions which determined the retention times of the studied molecules (Table 2). Data set Splitting training-test sets GC relative retention times (RRTs * tR) of 67 PCMTs were taken from the literature (Nikiforov et al. 2000). They consisted of 45 chlorinated bornanes, 13 chlorinated bornenes, 8 chlorinated dihydrocamphenes, and 1 chlorinated camphene. Relative retention times (RRTs) were determined against 2,2,5,5,8,9,9,10,10-nonachlorobornane (item number 41 in Table 1). The compounds are grouped by their carbon skeleton structure and the number of chlorine atoms in the molecule. The structures of PCMTs skeletons and numbering of carbon atoms are shown in Fig. 1. The GC conditions reported by (Nikiforov et al. 2000) were as follows: GC-Varian 3700, injector Gerstel split/splitless at 250C, splitless injection, column-DB-5 (ca. 53 m, 0.25 mm i.d., film thickness 0.1 lm), detector, electron capture detector at 300C, carrier gas, nitrogen at flow rate 1.33 ml/min, make-up gas, and nitrogen at 40 ml/min. The temperature program was selected as 123 In order to obtain compounds for external validation, the available set of chemicals is split into a training set and an external prediction set by two different methods. First, random selection (RS) was employed. In this method, data are split into training set and prediction set through activity sampling, thus by ordering the chemicals according to their ascending experimental values, selecting the most and the least active, and taking each chemical from the set in the prediction set to be used after model development for the external validation. The second method for splitting data set was defined by using the Kohonen self-organising map (SOM) technique. A SOM package for MATLAB is available on http://www.cis.hut.fi/projects/somtoolbox/. The first partition was performed by an 8 9 8 Kohonen network shown in Fig. 2. The input variables were the 35 descriptors for 67 molecules. Environ Chem Lett Table 1 The RRT data of PCMTs and the corresponding predictive values by PCR and PLS models No. Chemical name RRT 1c,d 2-exo,5-endo,9,9,10-pentachloroborane(PeCB) 0.6887 0.7944 0.7518 0.7509 0.7914 0.7579 0.7484 2c,d 2,2,5,5,10,10-hexachlorobornane(HxCB) 0.7181 0.7466 0.7361 0.7342 0.7408 0.7443 0.7416 3c,b 2-exo,3-endo,6-exo,8,9,10-HxCB 0.7667 0.7292 0.7656 0.7779 0.7389 0.7625 0.7796 a,d 4 2-endo,3-exo,6-exo,8,9,10-HxCB 0.7791 0.7292 0.7656 0.7779 0.7389 0.7625 0.7796 5c,d 2-exo,3-endo,5-exo,9,9,10,10-heptachlorobornane(HpCB) 0.7639 0.7599 0.7611 0.7518 0.768 0.7615 0.7489 6c,b 2-exo,5,5,9,9,10,10-HpCB 0.7756 0.8104 0.7784 0.7878 0.8158 0.7992 0.788 7c,b 2-endo,3-exo,5-endo,6-exo,8,9,10-HpCB 0.7875 0.7994 0.7845 0.7821 0.7886 0.7796 0.7854 8c,d 2,2,5,5,10,10,10-HpCB 0.7921 0.7918 0.7888 0.791 0.7933 0.7916 0.7921 9c,d 2-exo,3-endo,5-exo,8,9,10,10- HpCB 0.8319 0.798 0.7951 0.7973 0.7994 0.7978 0.7983 2-exo,5,5,8,9,10,10- HpCB 0.8359 0.8042 0.8015 0.8035 0.8055 0.804 0.8044 0.8423 0.8103 0.8432 0.8165 0.8079 0.8142 0.8098 0.816 0.8116 0.8177 0.8102 0.8164 0.8106 0.8168 c,d 10 11c,d 2-exo,2,2,5,5,8,9,10-HpCB 12c,d 2,2,5-endo,6-exo,8,9,10- HpCB PCR (1) PCR (2) PCR (3) PLS (1) PLS (2) PLS (3) 13a,d 2-exo,3-endo,6-exo,8,9,10,10-HpCB 0.8468 0.8227 0.8206 0.8222 0.8237 0.8227 0.823 14c,d 2-exo,3-endo,5-exo,6-exo,8,9,10-HpCB 0.8472 0.8289 0.827 0.8285 0.8298 0.8289 0.8292 a,b 15 0.8521 0.8351 0.8333 0.8347 0.8359 0.8351 0.8353 16a,d 2-exo,3-endo,6-endo,8,9,10,10-HpCB 0.8592 0.8413 0.8397 0.841 0.842 0.8413 0.8415 17b,c 2-exo,5-exo,6-endo,8,9,10,10-HpCB 0.8757 0.8475 0.8461 0.8472 0.8481 0.8475 0.8477 18b,c 2,2,5-exo,8,9,10,10-HpCB 0.8765 0.8536 0.8524 0.8534 0.8542 0.8537 0.8539 19c,d 2-endo,3-exo,5-endo,6-exo,8,8,10,10-octachlorobornane (OCB) 0.8161 0.8598 0.8588 0.8597 0.8603 0.8599 0.8601 c,d 20 2-exo,3-exo,5-endo,8,9,10,10-HpCB 0.8282 0.866 0.8652 0.8659 0.8664 0.8661 0.8662 21b,c 2,2,5,5,9,9,10,10-OCB 0.8689 0.8722 0.8715 0.8722 0.8725 0.8723 0.8724 22b,c 2,2,3-exo,5-endo,6-exo,8,9,10-OCB 0.8799 0.8784 0.8779 0.8784 0.8786 0.8785 0.8786 23b,c 2-endo,3-exo,5-endo,6-exo,8,9,10,10-OCB 0.8825 0.8846 0.8843 0.8846 0.8847 0.8847 0.8848 24a,d 2-exo,3-endo,5-exo,8,9,9,10,10-OCB 0.8857 0.8908 0.8906 0.8909 0.8907 0.8909 0.891 25a,d 2,2,5-endo,6-exo,8,8,9,10-OCB 0.8918 0.8969 0.897 0.8971 0.8968 0.8971 0.8971 a,d 26 2-endo,3-exo,5-endo,6-exo,8,8,9,10-OCB 0.8918 0.9031 0.9034 0.9034 0.9029 0.9033 0.9033 27b,c 2-exo,5,5,8,9,9,10,10-OCB 28c,d 2,2,5-endo,6-exo,8,9,10,10-OCB 0.8986 0.9093 0.9232 0.9155 0.9097 0.9161 0.9096 0.9158 0.909 0.9151 0.9096 0.9158 0.9095 0.9157 29a,b 2,2,5,5,8,9,10,10-OCB 0.9347 0.9217 0.9225 0.9221 0.9212 0.922 0.9219 30c,d 2-endo,3-exo,6-exo,8,8,9,10,10-OCB 0.9388 0.9279 0.9289 0.9283 0.9273 0.9282 0.928 0.9342 c,d 31 2,2,5-endo,6-exo,8,9,9,10-OCB 2,2,5,5,6-exo,8,9,10-OCB 0.9438 0.9341 0.9352 0.9346 0.9334 0.9344 32c,d 2,2,3-exo,6-exo,8,9,10,10-OCB 0.9451 0.9402 0.9416 0.9408 0.9395 0.9406 0.9404 33c,d 2-endo,3-exo,5-endo,6-exo,8,8,9,10,10-nonachlorobornane 0.9239 0.9464 0.948 0.947 0.9456 0.9468 0.9466 0.9262 0.9526 0.9543 0.9533 0.9517 0.953 0.9528 (NCB) 34c,d 2-exo,3,3,5-exo,6-endo,9,9,10,10-NCB a,d 35 0.9394 0.9738 0.9534 0.9236 0.9612 0.9287 0.9299 36a,d 2,2,3-exo,5-endo,6-exo,8,9,10,10-NCB 2,2,3-exo,5,5,9,9,10,10-NCB 0.9575 0.9589 0.9343 0.9281 0.9489 0.9331 0.9349 37c,d 2-exo,3,3,5-exo,6-endo,8,9,10,10-NCB 0.9575 0.9778 0.9506 0.9391 0.9657 0.9602 0.9435 38c,d 2,2,5-endo,6-exo,8,8,9,10,10-NCB 0.9618 0.941 0.9407 0.9527 0.9564 0.9385 0.9547 39c,d 2,2,3-exo,5,5,8,9,10,10-NCB 0.9719 1.0002 0.9804 0.9745 1.0038 0.9869 0.9821 40a,d 2,2,5-endo,6-exo,8,9,9,10,10-NCB 0.977 0.9975 0.9172 0.9527 0.9929 0.9494 0.9547 1 0.9956 1.0458 1.0619 0.9842 1.0746 1.0072 1.0678 1.0161 1.0552 0.9985 1.0733 1.0102 1.0715 c,d 41 2,2,5,5,8,9,9,10,10-NCB 42c,d 2,2,3-exo,5-endo,6-exo,8,9,9,10,10-decachlorobornane (DCB) 43b,c 2-exo,3,3,5-exo,6-endo,8,9,9,10,10-DCB 1.0892 1.0656 1.0773 1.0792 1.0589 1.0569 1.0805 44a,d 2-exo,3,3,6,6,8,8,9,10,10-DCB 1.1133 1.0916 1.0914 1.1035 1.0904 1.0853 1.1103 45c,d 2,2,3-exo,5,5,8,9,9,10,10-DCB 1.1163 1.0686 1.1367 1.1147 1.0796 1.1206 1.1192 0.655 0.7388 0.6976 0.6045 0.7125 0.7043 0.6541 0.687 0.7018 0.6403 0.6853 a,b 46 5-endo,6-exo,8,9,10-pentachlorobornene (PeCB-en) 47b,c 2,6-exo,9,9,10,10-hexachlorobornene (HxCB-en) 0.5904 0.6567 0.744 123 Environ Chem Lett Table 1 continued No. Chemical name RRT PCR (1) PCR (2) PCR (3) PLS (1) PLS (2) PLS (3) 48c,d 5,5,9,9,10,10-HxCB-en 0.6867 0.7 0.7014 0.6916 0.6944 0.6848 0.6947 49c,d 5,5,8,9,10,10-HxCB-en 0.711 0.7391 0.7355 0.7417 0.7462 0.7416 50b,c 2,5-endo,8,9,10,10-HxCB-en 0.7439 0.6899 0.7505 0.7506 0.7015 0.7412 0.7487 0.7521 c,d 51 0.7433 0.7749 0.7612 0.7518 0.7674 0.7454 52c,d 2,5-endo,6-exo,8,9,10,10-HpCB-en 0.7588 0.7555 0.7671 0.7512 0.7445 0.746 0.7516 53a,d 2,5,5,8,9,10,10-HpCB-en 0.7733 0.7821 0.8075 0.8012 0.7948 0.809 0.803 54c,d 5,5,8,9,9,10,10-HpCB-en 0.7756 0.767 0.7607 0.7596 0.7766 0.7584 0.7626 55c,d 3,5-exo,6-endo,8,9,10,10-HpCB-en 0.7881 0.78 0.7742 0.7656 0.7709 0.7939 0.7642 c,d 56 2,5,5,9,9,10,10-heptachlorobornene (HpCB-en) 0.738 0.8458 0.862 0.8299 0.8253 0.8417 0.8525 0.8204 57a,d 2,5-endo,6-exo,8,9,9,10,10-OCB-en 3,5-exo,6-endo,8,9,9,10,10-octachlorobornene (OCB-en) 0.8637 0.8117 0.8189 0.8105 0.8045 0.7997 0.8073 58c,d 2,5,5,8,9,9,10,10-OCB-en 0.8711 0.8625 0.8523 0.861 0.8671 0.8645 0.8592 59c,d 2-exo,5,5,6-exo-tetrachloro,3-exo-chloromethyl,2-endodichloromethyl, 3-endo-methylnorbornane 0.8066 0.8507 0.8289 0.8247 0.8441 0.823 0.8172 60c,d 3-exo,5,5,6-exo-tetrachloro,2-exo-hloromethyl,3-endodichloromethyl, 2-endo-methylnorbornane 0.8335 0.8028 0.848 0.8168 0.8096 0.8247 0.8092 61a,d 3-exo,5,5,6-exo-tetrachloro,2-exo,3-endo-bis(dichloromethyl),2endo-methylnorbornane 0.9313 0.8968 0.9015 0.8881 0.8945 0.8655 0.8754 62c,d 3-exo,5,5,6-exo-tetrachloro,2,2-bis(chloromethyl),3-endodichloromethylnorbornane 0.9422 0.924 0.9386 0.9386 0.9385 0.9468 0.9272 63c,d 2-exo,5-endo,6-exo-trichloro,3-endo-chloromethyl,2-endo,3exo-bis(dichloromethyl)norbornane 0.9752 0.9605 0.9763 0.9794 0.9582 0.9707 0.983 64a,b 3-exo,5,5,6-exo-tetrachloro,2-endo-chloromethyl,2-exo,3-endobis(dichloromethyl)norbornane 1.0248 1.0387 1.0269 1.0463 1.0409 1.014 1.0306 65b,c 3-exo,5,5,6-exo-tetrachloro,2-exo-chloromethyl,2-endo,3-endobis(dichloromethyl)norbornane 1.0374 1.0699 1.0129 1.0463 1.0646 1.015 1.0306 66a,d 3-endo,5,5,6-exo-tetrachloro,2-endo-chloromethyl,2-exo,3endo-bis(dichloromethyl)norbornane 1.0416 1.0387 1.0268 1.0463 1.0409 1.014 1.0306 67b,c 2,2,3-exo-trichloro,5-endo-chloromethyl,6-(E)chloromethylene,5-exo-dichloromethylnorbornane 0.8073 0.7269 0.8046 0.8042 0.6894 0.8129 0.8028 a The compounds used in the SOM prediction set b The compounds used in the RS prediction set c The compounds used in the calibration set for SOM d The compounds used in the calibration set for RS Variable selection and model building A total of 42 descriptors were initially calculated by ChemSar for the entire data set of 67 compounds. The total number of descriptors was reduced to 35 descriptors by eliminating the descriptors that were deemed insignificant (i.e., where the one-parameter correlation coefficient with the property is less than 0.1). The samples were split into the calibration and prediction sets (49 calibration samples, 18 prediction samples) by SOM and random sampling (RS) and the data were preprocessed via autoscaling. Geneticalgorithm-based partial least squares regression (GAPLS) by the procedure discussed in Leardi’s article and our previous article (Ghasemi and Ahmadi 2007), was then applied to the training data set. Because each GA gives a slightly different model, ten GA runs are performed on the 123 data set, with the goal of verifying the robustness of the predictive ability and of the set of selected variables. If some variables are present only in one model, it can be concluded that they have been selected just by chance, and therefore, they can be disregarded in the final model. Finally we selected 11 variables among five runs (Table 3), and these variables were used as input for PLS_Toolbox (version 7.0, MathWorks, Inc.) regression software. PCR and PLS regression (PLS_Toolbox, version 2.1, Eigenvector, USA) and other calculations were performed in the MATLAB environment. Validation of the models Validation is a crucial aspect of any QSRR model to prove the predictability of the model obtained. Golbraikh Environ Chem Lett 8 9 H3C 7 H3C CH3 7 4 3 5 1 5 1 2 6 2 H3C 10 6 H3C 10 Bornane Bornene 7 7 1 6 4 RS and SOM data splitting method, PLS and PCR modeling 4 3 5 Results and discussion 8 9 CH3 3 CH3 2 CH3 CH3 2,2,3-Trimethylnorbornane (dihydrocamphene) 1 6 5 4 3 CH3 2 CH3 CH2 2,2-Dimethyl-3-methylene norbornane(camphene) Fig. 1 The structure of polychlorinated monoterpenes and Tropsha (2002) suggested that for a QSPR model to have high predictive power, the following conditions should be fulfilled: (1) a high cross-validated Q2 value; (2) a correlation coefficient R between the predicted and observed activities of compounds from an external test set close to 1; (3) at least one (but better both) of the correlation coefficients for regression through the origin (predicted versus observed activities, or observed versus predicted activities) should be close to R2; (4) at least one slope of the regression lines through the origin should be close to 1. In this article, the cross-validated R2 (i.e., Q2) was assessed and the high value of Q2 (more than 0.5) can be regarded as the proof of the high predictive ability of the model. Q2 = [1 - PRESS/SS], where PRESS is the predicted residual sum of squares of deleted data in the cross-validation, calculated as [predicted value experimental value]2, and SS is the sum of squares of Y corrected for the mean, calculated as [experimental value - mean experimental value]2. However, the high value of Q2 appears to be a necessary but not sufficient condition for the models to have a high predictive power. So adequate prediction of an external set, i.e., a test set, is necessary besides Q2. In all cases, the Q2, RMSE of the calibration set (RMSEC), were calculated to assess the quality of the models. An external test set was chosen to verify the PLSR model, and the RMSE of external test set (RMSEP) was also examined. The PLS and PCR analysis was carried out using all 35 variables on RRT of 49, as well as PCMTs selected with RS and SOM for a comparative study of the SOM selection. The training and external test set of PLS (1) and PCR (1) are generated by RS data splitting, but the training and external test set of PCR (2) and PLS (2) are created by SOM data splitting. Prior to the PLS analysis, the data set was autoscaled to unit variance to give each variable equal importance. Leave-one-out cross-validation was used to determine the number of significant PLS components. The optimal number of PCs modeling useful information but avoiding overfitting is determined by the method of Haaland and Thomas (1988; Ghasemi et al. 2003). Table 4 indicates the statistical parameters of the models that were generated. The number of latent variables and principal components for PCR (1), PCR (2), PLS (1), and PLS (2) were 10, 12, 6, and 9, respectively. The RMSEC for PCR (1), PCR (2), PLS (1), and PLS (2) were 0.029, 0.022, 0.027, and 0.021, respectively. These results indicate the SOM data splitting improves the data modeling and the PLS models indicate the same result with fewer LVs than the PCs needed for PCR. Variable selection, PLS and PCR modeling PCR and PLS regression have advantages over ordinary multiple linear regression (MLR) since they can deal with collinearity in the predictor variables. Also, the PLS model converges to an MLR solution if the number of extracted LVs is equal to the rank of the sample variable space. In order to estimate the real predictive power of genetic algorithm-based variable selection, PLS and PCR were applied to the SOM calibration set with 35 descriptors (PCR (2) and PLS (2)) and 11 GAPLS selected descriptors (PCR (3) and PLS (3)). Leave-one-out cross-validation was used to determine the number of significant latent variables (LVs) and significant principal components (PCs) for PLS and PCR. The number of LVs and PCs for PCR (2), PCR (3), PLS (2), and PLS (3) were 12, 8, 9, and 7, respectively. The RMSEC for PCR (2), PCR (3), PLS (2), and PLS (3) were 0.022, 0.020, 0.021, and 0.020, respectively. Therefore, the PCR (3) and PLS (3) models with fewer descriptors than PCR (2) and PLS (2) were considered to be the best QSRR models. As Fig. 3b and d indicates, the correlation between predicted and observed values for PCR (3) and PLS (3) was good, (R = 0.971, 0.973) and the 123 Environ Chem Lett Table 2 Molecular descriptors selected to construct the QSRR model and experimental relative retention times of PCMTs TC CP LogP MR MW NRBO ShapA WIndx RRT 46.52 295 0.6887 23.61 338 0.7181 29.32 349 0.7667 14.0625 29.32 349 0.7791 15.05882 12.51 405 0.7639 379.3653 2 15.05882 14.73 405 0.7756 379.3653 3 15.05882 9.68 404 0.7875 No. BIndx CLSC 1 46712 15 830.7015 254.9191 4.2773 67.7133 310.4752 2 13.06667 2 60718 16 806.0936 263.4179 4.7322 74.1476 344.9203 1 14.0625 3 62604 16 875.1198 269.9637 4.537 71.7992 344.9203 3 14.0625 4 62604 16 875.1198 269.9637 4.537 71.7992 344.9203 3 5 81802 17 863.3268 284.3654 4.8285 77.0817 379.3653 2 6 81805 17 849.2226 279.7424 4.7719 78.4251 7 81550 17 895.4147 285.5226 4.947 76.0581 8 91954.43 17.57143 894.8005 292.3937 5.001557 79.41373 399.0482 2.857143 15.62801 9 97878.93 17.89286 904.3935 297.353 80.80201 410.1199 3 5.086557 15.94821 G 2.942857 437.4286 0.7921 -2.23893 455.8929 0.8319 -7.42071 474.3571 0.8359 10 103803.4 18.21429 913.9865 302.3123 5.171557 82.19029 421.1915 3.142857 16.2684 11 12 109727.9 115652.4 18.53571 18.85714 923.5796 307.2717 5.256557 933.1726 312.231 5.341557 83.57858 432.2631 3.285714 16.58859 84.96686 443.3347 3.428571 16.90879 -12.6025 -17.7843 492.8214 0.8423 511.2857 0.8432 13 121576.9 19.17857 942.7656 317.1903 5.426557 86.35514 454.4064 3.571429 17.22898 -22.9661 529.75 14 127501.4 19.5 952.3586 322.1496 5.511557 87.74342 465.478 3.714286 17.54918 -28.1479 548.2143 0.8472 15 133425.9 19.82143 961.9516 327.109 89.1317 476.5496 3.857143 17.86937 -33.3296 566.6786 0.8521 16 139350.4 20.14286 971.5447 332.0683 5.681557 90.51999 487.6212 4 18.18957 -38.5114 585.1429 0.8592 17 145274.9 20.46429 981.1377 337.0276 5.766557 91.90827 498.6929 4.142857 18.50976 -43.6932 603.6071 0.8757 18 151199.4 20.78571 990.7307 341.9869 5.851557 93.29655 509.7645 4.285714 18.82995 -48.875 622.0714 0.8765 19 157123.9 21.10714 1000.324 346.9463 5.936557 94.68483 520.8361 4.428571 19.15015 -54.0568 640.5357 0.8161 20 163048.4 21.42857 1009.917 351.9056 6.021557 96.07311 531.9077 4.571429 19.47034 -59.2386 659 21 168972.9 21.75 356.8649 6.106557 97.4614 -64.4204 677.4643 0.8689 22 174897.4 22.07143 1029.103 361.8242 6.191557 98.84968 554.051 -69.6021 695.9286 0.8799 23 180821.9 22.39286 1038.696 366.7835 6.276557 100.238 565.1226 5 20.43093 -74.7839 714.3929 0.8825 24 186746.4 22.71429 1048.289 371.7429 6.361557 101.6262 576.1943 5.142857 20.75112 -79.9657 732.8571 0.8857 25 192670.9 23.03571 1057.882 376.7022 6.446557 103.0145 587.2659 5.285714 21.07131 -85.1475 751.3214 0.8918 26 198595.4 23.35714 1067.475 381.6615 6.531557 104.4028 598.3375 5.428571 21.39151 -90.3293 769.7857 0.8918 27 28 204519.9 210444.4 23.67857 1077.068 24 1086.661 386.6208 6.616557 105.7911 391.5802 6.701557 107.1794 609.4091 5.571429 21.7117 620.4808 5.714286 22.0319 -95.5111 -100.693 788.25 0.8986 806.7143 0.9232 29 216368.9 24.32143 1096.254 396.5395 6.786557 108.5677 631.5524 5.857143 22.35209 -105.875 825.1786 0.9347 30 222293.4 24.64286 1105.847 401.4988 6.871557 109.9559 642.624 22.67229 -111.056 843.6429 0.9388 31 228217.9 24.96429 1115.44 406.4581 6.956557 111.3442 653.6956 6.142857 22.99248 -116.238 862.1071 0.9438 32 234142.4 25.28571 1125.033 411.4175 7.041557 112.7325 664.7673 6.285714 23.31268 -121.42 880.5714 0.9451 33 240066.9 25.60714 1134.626 416.3768 7.126557 114.1208 675.8389 6.428571 23.63287 -126.602 899.0357 0.9239 34 245991.4 25.92857 1144.219 421.3361 7.211557 115.5091 686.9105 6.571429 23.95306 -131.784 917.5 0.9262 35 132490 19 876.2405 306.2372 5.7945 87.9687 448.2555 2 17.05263 -22.33 528 0.9394 36 132201 19 911.8295 310.3459 5.7026 86.6037 448.2555 3 17.05263 -22.11 527 0.9575 37 132948 19 911.8295 310.3459 5.7878 86.3969 448.2555 3 17.05263 -22.11 530 0.9575 38 132976 19 901.5455 308.6744 5.4338 87.4543 448.2555 3 17.05263 -16.84 530 0.9618 39 132944 19 897.9521 305.723 5.7312 87.7403 448.2555 3 17.05263 -19.89 530 0.9719 40 132976 19 901.5455 308.6744 5.4338 87.4543 448.2555 3 17.05263 -16.84 530 0.977 41 133955 19 887.656 304.0515 5.2902 88.8531 448.2555 3 17.05263 -14.62 534 1 42 43 166404 167229 20 20 921.8726 324.2333 5.9308 921.8726 324.2333 6.016 91.6578 91.451 482.7005 3 482.7005 3 18.05 18.05 -36.48 -36.48 600 603 1.0458 1.0892 44 166412 20 907.9524 319.6104 5.8742 93.0012 482.7005 3 18.05 -34.26 600 1.1133 45 167225 20 907.9524 319.6104 5.9594 92.7944 482.7005 3 18.05 -34.26 603 1.1163 46 46701 15 859.5637 246.6665 4.17 68.2943 308.4593 3 13.06667 78.92 295 0.655 47 61786 16 837.0197 260.7434 3.9848 73.9573 342.9044 2 14.0625 60.19 344 0.6567 48 62492 16 831.0656 256.4452 4.3179 74.9756 342.9044 2 14.0625 64.33 348 0.6867 123 1019.51 5.596557 542.9794 4.714286 19.79054 4.857143 20.11073 6 0.8468 0.8282 Environ Chem Lett Table 2 continued WIndx RRT 66.77 350 0.711 62.63 350 0.7439 15.05882 42.77 405 0.7433 377.3495 3 15.05882 42.99 404 0.7588 377.3495 3 15.05882 45.21 407 0.7733 79.8013 377.3495 3 15.05882 52.4 408 0.7756 78.12 377.3495 3 15.05882 42.99 407 0.7881 890.2511 289.6753 4.4039 83.1741 411.7945 3 16.05556 28.62 470 0.8458 18 890.2511 289.6753 4.4909 83.1187 411.7945 3 16.05556 28.62 467 0.8637 18 876.3033 285.0524 4.3473 84.5175 411.7945 3 16.05556 30.84 470 0.8711 83441 17 851.0618 281.4139 5.0636 77.4298 379.3653 2 15.05882 9.46 413 0.8066 60 83245 17 851.0618 281.4139 5.2358 77.1676 379.3653 2 15.05882 9.46 412 0.8335 61 62 107959 108368 18 18 861.1545 295.3013 5.464 882.951 294.787 5.4007 82.2217 81.9933 413.8104 2 413.8104 3 16.05556 16.05556 -4.91 -2.47 478 480 0.9313 0.9422 TC CP LogP MR MW NRBO ShapA No. BIndx CLSC 49 62818 16 855.5327 255.931 4.2546 74.7472 342.9044 3 14.0625 50 62801 16 859.8115 260.2291 3.8527 73.8057 342.9044 3 14.0625 51 81805 17 844.0739 271.6793 4.1824 79.6918 377.3495 2 52 81578 17 880.198 4.2627 78.0646 53 82172 17 866.2775 271.1651 4.1191 79.4634 54 82410 17 865.599 269.8184 4.4828 55 82168 17 880.198 275.788 56 106089 18 57 105430 58 106092 59 275.788 4.1757 G 63 65703 16 851.4055 263.8721 4.3106 72.3873 365.3388 2 14.0625 9.58 366 0.9752 64 138004 19 893.0126 308.6744 5.6289 87.0474 448.2555 3 17.05263 -16.84 550 1.0248 65 138004 19 893.0126 308.6744 5.6289 87.0474 448.2555 3 17.05263 -16.84 550 1.0374 66 138004 19 893.0126 308.6744 5.6289 87.0474 448.2555 3 17.05263 -16.84 550 1.0416 67 50027 15 819.8987 238.5759 3.9697 68.5138 328.8778 1 13.06667 77.12 316 0.8073 Fig. 2 Occupation of the 8 9 8 Kohonen top-map slope and intercept were 0.906, 0.929 and 0.057, 0.077, respectively. The residuals (Fig. 3e, f) are randomly distributed and almost symmetrically fill the negative and positive half-planes, so we conclude there was no systematic error. To explore further which descriptors (i.e., X-variable) can be eliminated from the analysis, we can look at the regression coefficients for the centered and scaled data. In the absence of knowledge about the relative importance of the variables, the standard multivariate approach is to (1) center them by subtracting their averages, and (2) scale each variable to unit variance by dividing them by their standard deviations, so-called auto-scaling. This corresponds to giving each variable the same weight and the same prior importance in the analysis (Wold et al. 2001). In PLS modeling, X-variables may be important for the modeling of Y. X-variables with small coefficients (in terms of absolute value) make a small contribution to the response prediction. However, a variable may also be important for the modeling of X, which can be identified by large loadings. An alternative statistic is the Variable Importance for Projection (VIP), which summaries the importance of a variable for both Y and X (Wold 1994). The coefficients for centered and scaled data and the VIP values were calculated and compared (Table 3). According to Wold (1994), if a variable has a relatively small coefficient (in absolute value) and a small value of VIP, it is a prime candidate for deletion. Wold considers a value less than 0.8 to be ‘‘small’’ for the VIP. We find that all 11 variables selected by GAPLS are important for PLS regression. Table 3 lists the standardized regression coefficients of PCR (3) and PLS (3) models. The standardized regression coefficients reveal the significance of an individual descriptor presented in the regression model. Obviously, in Table 3, the effect of ideal Gas thermal capacity and Balaban index on the RRT of the PCMTs are more significant than the other descriptors in the PCR and PLS models, 123 Environ Chem Lett Table 3 The descriptors selected by GAPLS, the PCR and PLS regression coefficients, and VIPs for PLS regression Descriptor name Descriptor type PCR (3) PLS (3) bs bs VIP Balaban index BIndx Steric 2.854 3.240 2.888 Cluster count CLSC Steric -1.492 -1.474 2.728 Critical temperature TC Thermodynamic -2.057 -1.459 2.906 Ideal gas thermal capacity CP Thermodynamic -3.781 -4.385 2.816 Log P LogP Thermodynamic -0.511 -0.474 2.635 Molecular refractivity MR Thermodynamic -2.568 -1.881 2.753 Molecular weight MW Steric 2.244 2.181 2.666 Number of rotatable bonds NRBO Steric 1.220 1.082 2.702 Shape attribute ShpA Steric -1.489 -1.468 2.727 Standard Gibbs free energy G Thermodynamic -2.205 -2.431 2.718 Wiener index WIndx Steric 3.993 2.843 2.818 respectively. The order of significance of the other descriptors in PCR and PLS models are WIndx [ BIndx [ MR [ MW [ T C [ G [ CLSC [ ShapA [ NRBO [ LogP and BIndx [ WIndx [ G [ MW [ MR [ ClsC [ ShapA [ TC [ NRBO [ LogP, respectively. Interpretation of descriptors Chromatographic retention results from the solvation and partitioning of individual compounds in a stationary phase. The solubility is affected by intermolecular forces which include hydrogen bonding, ion–dipole, dipole–dipole, dipole-induced dipole interactions, etc. Solubility is related to the descriptors accounting for the counts of atoms, functional groups or charged moieties, surface areas of donor and acceptor groups, hydrophobicity, polarizability, and conformation flexibility. Chromatographic partition, in turn, results from adsorption–desorption of compounds on the stationary phase, during which a change in degree of freedom of the molecule occurs. The entropic effects also play important roles. As a result, the chromatographic retention depends mainly on the electronic properties, and the size and shape of the solute molecule. The Wiener index is the most effective descriptor at PCR and PLS models. The Wiener index (Wiener 1947) is the first topological index used in chemistry. It is the sum of distances between all the pairs of vertices in a hydrogensuppressed molecular graph. The correlations between RRTs of organic compounds and Wiener index show that the magnitude of intermolecular interactions between the compounds and the stationary phase is related to the degree of branching of the molecules. The Wiener index that represents molecular ‘‘steric-branching’’ increases as the number of chloride substitution increases since it is defined as the half sum of all topological distances in the hydrogendepleted graph representing the skeleton of the molecule. 123 Abbreviation The Balaban index, another topological descriptor, characterizes the branching pattern within a molecule (Balaban 1981). This type of descriptor is evaluated from various combinations and weightings of the vertices (atoms) and edges (bonds) of the molecular graph (chemical structure), its positive coefficient indicates that the presence of chloride substituents increases retention time. Negative coefficients of the descriptor ideal gas thermal capacity suggest that an increase in this parameter would reduce the RRTs of molecules. Positive coefficients associated with the number of rotatable bonds (NRBO) and molecular weight (MW) descriptors suggest that an increase in the number of rotatable bonds and molecular weight of a molecule would increase the retention time of molecules. In general, the rotatable bond descriptor reflects the flexibility of the molecule. For example, an increase in the rotational degrees of freedom may result in a larger diffusional cross section, thereby influencing permeability adversely. On the other hand, such flexibility could decrease the crystallinity of the solute, resulting in improved aqueous solubility and enhanced absorption (Lu et al. 2004). The MR descriptor is directly proportional to the polarizability (Hansch and Leo 1995) as expressed by MR = (4pNAa/3), where NA is the Avogadro constant and a is the polarizability. MR values are based on molecular fragments in which different fragments add to give the overall MR (Hansch and Leo 1995). In QSRR, molecular parameters are used to predict retention times in gas–solid interactions. Polarizability, a, is a key factor in intermolecular interactions and also in nonspecific molecule–stationary phase interactions. The negative sign of the MR descriptor in models shows that increasing the polarizability of molecules causes a decrease in the RRT. The critical temperature is an important fundamental physical property of organic compounds. It is defined as the temperature above which a gas cannot be liquefied and has a negative effect on the RRT. The Environ Chem Lett Fig. 3 Predicted and residual versus experimental relative retention times for PCMTs by PLS (3) and PCR (3) models (a) (c) pls(3) PCR(3) 1.2 1.1 1.2 y = 0.961x + 0.034 y = 0.962x + 0.033 1.1 R= 0.981 Predicted RRT Predicted RRT R = 0.980 1 0.9 0.8 1 0.9 0.8 0.7 0.7 0.6 0.6 0.8 1 0.6 0.6 1.2 0.8 PLS(3) 1.2 y = 0.906x + 0.077 Predicted RRT 1.1 y = 0.929x + 0.057 1.1 R= 0.971 1 0.9 0.8 R = 0.973 1 0.9 0.8 0.7 0.7 0.6 0.6 0.6 0.6 0.8 1 0.8 1.2 PLS (3) (e) 0.08 (f) 1.2 PCR (3) 0.08 Test Test Trainig Training 0.04 Residuals 0.04 1 Observed RRT Observed RRT Residuals 1.2 PCR(3) (d) 1.2 Predicted RRT (b) 1 Observed RRT Observed RRT 0 -0.04 0 0.6 0.8 1 1.2 -0.04 -0.08 0.6 0.7 0.8 0.9 1 1.1 1.2 -0.08 Observed RRT standard Gibbs free energy is negatively correlated with RRT. The Log P is the partition coefficient of the molecules in an n-octanol/water system, and reflects the hydrophobicity of molecules. The retention time decreases as the hydrophobicity of the compounds increases. The other descriptors, such as shape attribute and cluster count, are topology indices that affect intermolecular and molecule stationary phase interaction; both have a negative effect on RRT of PCMTs. Observed RRT Validation of the model The correlation between observed and predicted RRT for PLS (3) and PCR (3) were acceptable with R = 0.971 and 0.973, intercept = 0.077 and 0.057, RMSEP = 0.026, 0.025, respectively (Fig. 3b, d, Table 4). The data presented in Table 4 indicate that the PLS (3) and PCR (3) models have good statistical quality with low prediction errors, while the PLS model uses fewer latent variables. 123 Environ Chem Lett Table 4 Statistical results of comparing regression models Model Data splitting method No. of descriptors No. of LVs or PCs RMSEC RMSEP Q2 R2cal R2test PCR (1) RS 35 10 0.350 0.650 0.894 0.960 0.920 PCR (2) SOM 11 12 0.022 0.032 0.898 0.959 0.919 PCR (3) SOM 11 8 0.020 0.025 0.971 0.962 0.947 PLS (1) RS 35 6 0.027 0.038 0.872 0.928 0.917 PLS (2) SOM 11 9 0.021 0.032 0.935 0.958 0.929 PLS (3) SOM 11 7 0.020 0.026 0.969 0.960 0.943 Conclusion A QSRR study was conducted on 67 PCMTs by using PCR and PLS procedures. More than 40 theoretically derived descriptors were calculated for each molecule. Training and test data splitting were done with SOM. The best set of calculated descriptors was selected by a genetic algorithm. Two methods have good statistical qualities. The high correlation coefficients (0.947 and 0.943 for PCR and PLS, respectively) and low prediction errors (0.025 and 0.026 for PCR and PLS, respectively) obtained confirm the good predictive ability of both models. References Balaban AT (1981) Highly discriminating distance based topological index. Chem Phys Lett 89:399–404 Braekevelt E, Tomy GT, Stern GA (2001) Comparison of an individual congener standard and a technical mixture for the quantification of toxaphene in environmental matrices by HRGC/ECNI-HRMS. Environ Sci Technol 35:3513–3518 Ghasemi J, Ahmadi Sh (2007) Combination of genetic algorithm and partial least squares for cloud point prediction of nonionic surfactants from molecular structures. Ann di Chim 97:69–83 Ghasemi J, Saaidpour S (2007) QSPR prediction of aqueous solubility of drug-like organic. Compounds Chem Pharm Bull 55:669–674 Ghasemi J, Ahmadi Sh, Torkestani K (2003) Simultaneous determination of copper, nickel, cobalt and zinc using zincon as a metallochromic indicator with partial least squars. Anal Chim Acta 487:181–188 Ghasemi J, Asadpour S, Abdolmaleki A (2007a) Prediction of GC/ ECD retention times of chlorinated pesticides, herbicides, and organohalides by multivariate chemometrics methods. Anal Chim Acta 588:200 Ghasemi J, Saaidpour S, Brown SD (2007b) QSPR study for estimation of acidity constants of some aromatic acids derivatives using multiple linear regression (MLR) analysis. J Mol Struct 805:27–32 Golbraikh A, Tropsha A (2002) Beware of q2!. J Mol Graphics Model 20:269–276 Haaland DM, Thomas EV (1988) Partial least-squares methods for spectral analyses, relation to other quantitative calibration methods and the extraction of qualitative information. Anal Chem 60:1193–1202 Hackenberg R, Schutz A, Ballschmiter K (2003) High-resolution gas chromatography retention data as basis for the estimation of 123 KOW Envalues using PCB congeners as secondary standards. Eviron Sci Technol 37:2274–2279 Hale MD, Hileman FD, Mazer T, Shell TL, Noble RW, Brooks JJ (1985) Mathematical modeling of temperature programmed capillary gas chromatographic retention indexes for polychlorinated dibenzofurans. Anal Chem 57:640–648 Hansch C, Leo A (1995) Exploring QSAR: fundamentals and applications in chemistry and biology. American Chemical Society, Washington, DC Kohonen T (1997) Self-organizing maps, 2nd edn. Springer-Verlag, Berlin Leardi R (1994) Application of a genetic algorithm to feature selection under full validation conditions and to outlier detection. J Chemometr 8:65–79 Leardi R, Gonzalez AL (1998) Genetic algorithms applied to feature selection in PLS regression: how and when to use them. Chemometr Intell Lab Syst 41:195–207 Lu JJ, Crimin K, Goodwin JT, Crivori P, Orrenius C, Xing L, Tandler PJ, Vidmar TJ, Amore BM, Wilson AGE, Stouten PFW, Burton PS (2004) Influence of molecular flexibility and polar surface area metrics on oral bioavailability in the rat. J Med Chem 47:6104–6107 Melssen W, Üstün B, Buydens L (2007) SOMPLS: a supervised selforganising map-partial least squares algorithm for multivariate regression problems. Chemometr Intell Lab Syst 86:102–120 Nevalainen T, Koistinen J, Nurmela P (1994) Structure verification, and chromatographic relative retention times for polychlorinated diphenyl ethers. Environ Sci Technol 28:1341–1347 Nikiforov V, Karavan V, Miltsov SA (2000) Relative retention times of chlorinated monoterpenes. Chemosphere 41:467–472 Ong VS, Hites RA (1991) Relationship between gas chromatographic retention … properties of four compound classes. Anal Chem 63:2829–2834 Rayne S, Ikonomou MG (2003) Predicting gas chromatographic retention times for the 209 polybrominated diphenyl ester congeners. J Chromatogr A 1016:235–248 Saleh MA (1991) Toxaphene: chemistry, biochemistry, toxicity and environmental fate. Rev Environ Cont Toxicol 118:1–85 Voldner EC, Li YF (1995) Global usage of selected persistent organochlorines. Sci Total Environ 160/161:201–210 Wiener H (1947) Structural determination of paraffin boiling points. J Am Chem Soc 69:17–20 Wold S (1994) QSAR: chemometric methods in molecular designs. Methods and principles in medicinal chemistry. Verlag-Chmie, Weinheim, Germany Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemometr Intell Lab Syst 2:37–52 Wold S, Sjostrom M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemometr Intell Lab Syst 58:109–130
© Copyright 2025 Paperzz