Annali di Chimica, 97, 2007, by Società Chimica Italiana 69 COMBINATION OF GENETIC ALGORITHM AND PARTIAL LEAST SQUARES FOR CLOUD POINT PREDICTION OF NONIONIC SURFACTANTS FROM MOLECULAR STRUCTURES Jahanbakhsh GHASEMI(°), Shahin AHMADI Chemistry Department, Faculty of Sciences, Razi University, Kermanshah, Iran Summary - Quantitative structure–property relationship (QSPR) analysis has been directed to a series of pure nonionic surfactants containing linear alkyl, cyclic alkyl, and alkey phenyl ethoxylates. Modeling of cloud point of these compounds as a function of the theoretically derived descriptors was established by multiple linear regression (MLR) and partial least squares (PLS) regression. In this study, a genetic algorithm (GA) was applied as a variable selection method in QSPR analysis. The results indicate that the GA is a very effective variable selection approach for QSPR analysis. The comparison of the two regression methods used showed that PLS has better prediction ability than MLR. INTRODUCTION The most common nonionic surfactants are those based on ethylene oxide, referred to as ethoxylated surfactants.1-3 Several classes can be distinguished: alcohol ethoxylates, alkyl phenol ethoxylates, fatty acid ethoxylates, monoalkaolamide ethoxylates, sorbitan ester ethoxylates, fatty amine ethoxylates and ethylene oxide–propylene oxide copolymers (sometimes referred to as polymeric surfactants). Another important class of nonionics is the multihydroxy products such as glycol esters, glycerol (and polyglycerol) esters, glucosides (and polyglucosides) and sucrose esters. Amine oxides and sulphinyl surfactants represent nonionics with a small head group. Nonionic polyoxyethylene alkyl ethers are ethylene oxide adducts of linear alcohols. They were first synthesized in the early 1930s.4 Since then, their wetting, detergency, and foaming properties, and general usefulness as nonionic surfactants makes them widely used in industrial and household products. Polyoxyethylene alkyl ethers have the general formula CnH2n+1(OCH2CH2)mOH. They will be referred as CnEm with n indicating the number of carbons in the alkyl chain and m being the number of ethylene oxide units in the hydrophilic moiety. Polyoxyethylene alkyl ethers are often selected as surfactants for analytical separations because they do not absorb UV light, they are little sensitive to ions and they offer a wide range of differing polarity. Many aqueous solutions of nonionic surfactants separate in two phases on heating above a particular temperature: the clouding temperature. The homogeneous solution becomes turbid (cloudy) and separates into a water-rich phase and an organic surfactant-rich phase. A solute dissolved in the homogeneous solution at low temperature may partition preferentially in one phase (°) Corresponding author. Fax: ( +98831)8369572. E-mail: [email protected] 70 GHASEMI and coworker or the other. Cloud-point extraction is commonly used to extract and/or to concentrate apolar solutes in the surfactant-rich phase obtained at elevated temperature.5-7 The cloud-point temperature and the molar volume of the nonionic surfactant used are needed. Quantitative structure-activity relationship (QSAR) attempt to correlate structural molecular descriptor with functions (i.e. physicochemical properties, biological activities, toxicity etc.) for a set of similar compounds, by means of statistical methods. As a result, a simple mathematical relationship is established: Function = f (structural molecular or fragment properties) QSAR techniques include from chemical measurements and biological assays to the statistical techniques and interpretation of results. 8, 9 There are three main methods which are often applied in QSAR research as follows. (1) Stepwise regression (SR) can select important variables. 10 However, SR is limited in the number of selected variables. (2) Artificial neural network (ANN), which is widely used in QSAR nonlinear modeling, has achieved substantial positive results. 11 However, the algorithm of back-propagation has its shortcomings, namely overfitting. (3) Partial least squares (PLS), 12-15 which is based on factor analysis fundamentals, is applied where there are many variables but not enough samples or observations. PLS has been applied to many fields of applied sciences with great success 16. In chemometrics, it is one of the favored methods of analysis. Genetic algorithms (GA) are developed to mimic some of the processes observed in natural evolution, which are an efficient strategy to search for the global optima of solutions. Now the genetic algorithms have been successfully applied to feature selection in regression analysis.17-21 Moreover, an approach combining GA with PLS (GAPLS) is also proposed for variable selection in QSAR and QSPR studies.22-24 The purpose of a variable selection is to select the variables significantly contributing to prediction and discard the other variables by a fitness function. The hybrid method that integrates GA as a powerful optimization tool and PLS as a robust statistical tool can lead to a significant improvement in QSAR and QSPR models. THEORY GAPLS The GA algorithm is described in (17, 18, 38), and is implemented in Matlab 4. Reference (38) covers the details of the settings required in the software, here summarized in TABLE - 3. GAPLS is a sophisticated hybrid approach that combines GA 25 as a powerful optimization method with PLS 12-15 as a robust statistical method for variable selection. The combination of variables and the internal predictivity of the derived PLS model in GAPLS correspond a chromosome and its fitness in GA, respectively. GAPLS consists of three basic steps. (1) An initial population of chromosomes is created. Each chromosome is a binary bit string, by which the existence of a variable is represented. (2) A fitness of each chromosome in the population is evaluated by the internal predictivity of PLS. (3) The population of chromosomes in the next generation is reproduced. Three operations, i.e., selection, cross-over and mutation of chromosomes, are made in this step. In the overall scheme, steps 2 and 3 are continued until the number of the repetitions is reached at the designated number of generations. Cross validation is used during the GA procedure. In this case, the data are split into five deletion groups. The deletion groups are created by taking every fifth sample (e.g. 1, 6, 11, 16,… as one group; 2, 7, 12, 17,… as the next group). The data are sorted by the cloud point of the component of interest before deletion groups are formed in order that the deletion groups uniformly Cloud point Prediction of Nonionic surfactants from Molecular Structures 71 sample from the data space. The cross validation is then five steps, each time leaving out one group, making a model on the other four, and then predicting the left out group. The number of components is selected according to the minimum value of root mean square error of cross validation (RMSECV). The validation data are not used at all until the GA procedure is complete and a final model is selected. The validation data are then predicted using this final model. The last step of the GA is a stepwise approach in which the variables are entered according to the smoothed value of the frequencies of selection, each time computing the % cross-validated explained variance and the RMSECV. A crucial point in the previous algorithm is the detection of the number of variables to be taken into account (the selection of the solution corresponding to the global maximum often leads to overfitting); this decision is usually made by visually inspecting the plot of the % cross-validated explained variance (or of the RMSECV) versus the number of variables in the model, looking for the number of variables beyond which no “significant” increase of the response (decrease in RMSECV) takes place. Of course, this analysis required some time and the decision about the selected model sometimes could involve a high degree of subjectivity. To have a sounder statistical approach, the following algorithm has been implemented: • detect the global minimum of RMSECV; • by using an F-test (P < 0.1, d.f. = number of samples in the training set − 1, both in numerator and in the denominator) select a “threshold value” corresponding to the highest RMSECV being not significantly different from the global minimum; • look for the solution with the lowest number of variables having a RMSECV lower than the “threshold value”. In such a way, the most parsimonious model among all the models being not significantly different from the global optimum is selected. This modification generally leads to models having a slightly better root mean square error of prediction (RMSEP). Because each GA gives a slightly different model, some GA runs are performed on each data set, with the goal of verifying the robustness of the predictive ability and of the selected variables. As a rule of thumb, it has been found that the performance of the algorithm decreases when >200 variables are used.24 This is due to the fact that a higher variables/objects ratio increases the risk of overfitting and also due to the fact that the size of the search domain becomes too great. In QSAR studies, it is important to obtain a model containing as few variables as possible because this will lead to a simple and interpretable model. Therefore, the quality of a chromosome is determined by both the internal predictivity it gives and the number of variables it uses. In order to increase quality of chromosomes in the population, the best chromosome using the same number of variables is protected unless a chromosome with a lower number of variables gives better internal predictivity. The protected chromosomes in the final population of GA can be regarded as the important combinations of variables. Stepwise multiple linear regression The multiple linear regression (MLR) is an extension of the classical regression method to more than one dimension.44 MLR calculates QSAR equation by performing standard multivariable regression calculations using multiple variables in a single equation. The stepwise multiple linear regression is a commonly used variant of MLR. In this case, also a multiple-term linear equation is produced, but not all independent variables are used. Each variable is added to the equation at a time and a new regression is performed. The new term is retained only if equation passes a test for significance. This regression method is especially useful when the number of variables is large and when the key descriptors are not known. Usually, molecular descriptor matrices can not be directly used as independent variables in the MLR analysis due to their lack of homogeneity, the high correlation between descriptors is GHASEMI and coworker 72 larger than the number of compounds, some of them maybe redundant. Thus previous to the MLR analysis, normally a reduction of variables is necessary in order to obtain a concentrated set of significant underlying variables, not correlated between them, loosing the minimum amount of information. MATERIAL AND METHODS Equipment A Pentium IV personal computer with Windows XP operating system was used. The programs needed GA variable selection was used from genpls matlab code (written by R. Leardi). Backward multiple regression analysis and partial least squares were performed by the XLSTAT 2006 version 2006.2 Add-in software (XLSTAT company). Equations were justified by the correlation coefficient (R), the standard error of estimates (SE), significance index for analysis of variance (F), and significance level (p).43 The contribution of each parameter to the inhibition variations was calculated from the square of the correlation coefficient. For the calculation of molecular descriptors, the HyperChem version 7.5 26 and Dragon (Milano Chemometrics group, version 3.0) 27 softwares were used. Cloud point data and descriptor generation Cloud point data from many literature sources 28-31 (TABLE 1). This set of data represents a wide variety of structure in the hydrophobic domain, including linear alkyl, branched alkyl, cyclic alkyl, linear alkylphenyl, and branched alkylphenyl ethoxylates. Cloud point values were found for 68 different structures for a surfactant percentage of 1% w/w. TABLE 1. - Cloud point temperature of several non-ionic surfactants of the alkyl ethoxylate class for a surfactant percentage of 1% w/w and the corresponding predictive values by MLR and PLS models. NO. Chemical formula 1 2 3 P 4 5 6 7 P 8 9 10 11 P 12 13 14 15 C4H9(OC2H4)1OH C6H13(OC2H4)2OH C6H13(OC2H4)3OH C6H13(OC2H4)4OH C6H13(OC2H4)5OH C6H13(OC2H4)6OH C7H15(OC2H4)3OH C8H17(OC2H4)3OH C8H17(OC2H4)4OH C8H17(OC2H4)5OH C8H17(OC2H4)6OH C8H17(OC2H4)8OH C8H17(OC2H4)9OH C9H19(OC2H4)4OH C9H19(OC2H4)5OH Symbol of surfactant C4E1 C6E2 C6E3 C6E4 C6E5 C6E6 C7E3 C8E3 C8E4 C8E5 C8E6 C8E8 C8E9 C9E4 C9E5 Cloud point (0C) 44.5 0.0 40.5 63.8 75.0 83.0 27.6 7.0 38.5 58.6 72.5 96.0 100.0 32.0 55.0 Predicted cloud point (0C) MLR PLS 41.37 42.199 10.13 9.40 43.55 43.63 54.32 52.25 76.06 76.26 83.25 83.39 17.68 16.37 0.05 12.24 34.61 35.09 60.06 59.61 65.74 66.46 88.21 85.45 101.87 97.94 34.14 31.79 49.71 50.55 Cloud point Prediction of Nonionic surfactants from Molecular Structures NO. Chemical formula Symbol of surfactant 16 17 18 19 20 21 22 23 24 25 P 26 27 P 28 29 30 31 32 33 34 35 P 36 37 38 P 39 40 41 42 43 44 P 45 46 P 47 48 49 50 P 51 P 52 53 54 55 56 57 58 C9H19(OC2H4)6OH C10H21(OC2H4)4OH C10H21(OC2H4)5OH C10H21(OC2H4)6OH C10H21(OC2H4)8OH C10H21(OC2H4)10OH C11H23(OC2H4)4OH C11H23(OC2H4)5OH C11H23(OC2H4)6OH C11H23(OC2H4)8OH C12H25(OC2H4)4OH C12H25(OC2H4)5OH C12H25(OC2H4)6OH C12H25(OC2H4)7OH C12H25(OC2H4)8OH C12H25(OC2H4)9OH C12H25(OC2H4)10OH C12H25(OC2H4)11OH C13H27(OC2H4)5OH C13H27(OC2H4)6OH C13H27(OC2H4)8OH C14H29(OC2H4)5OH C14H29(OC2H4)6OH C14H29(OC2H4)7OH C14H29(OC2H4)8OH C15H31(OC2H4)6OH C15H31(OC2H4)8OH C16H33(OC2H4)6OH C16H33(OC2H4)7OH C16H33(OC2H4)8OH C16H33(OC2H4)9OH C16H33(OC2H4)10OH C16H33(OC2H4)12OH (C2H5)2CHCH2(OC2H4)6OH (C4H9)2CHCH2(OC2H4)6OH c-(C12H24)(OC2H4)9OH (C6H13)2CH(OC2H4)9OH (C4H9)3CH(OC2H4)9OH (C5H11)3CH(OC2H4)12OH c-(C16H32)(OC2H4)11OH t-C8H17-C6H4(OC2H4)9OH n-C9H19-C6H4-(OC2H4)8OH n-C9H19-C6H4-(OC2H4)9OH C9E6 C10E4 C10E5 C10E6 C10E8 C10E10 C11E4 C11E5 C11E6 C11E8 C12E4 C12E5 C12E6 C12E7 C12E8 C12E9 C12E10 C12E11 C13E5 C13E6 C13E8 C14E5 C14E6 C14E7 C14E8 C15E6 C15E8 C16E6 C16E7 C16E8 C16E9 C16E10 C16E12 IC6E6 IC10E6 XC12E9 IC13E9 TC13E9 TC16E12 XC16E11 TC8PE9 NC09PE8 NC09PE9 Cloud point (0C) 75.0 19.7 41.6 60.3 84.5 95.0 10.5 37.0 57.5 82.0 6.0 28.9 51.0 64.7 77.9 87.8 95.5 100.3 27.0 42.0 72.5 20.0 42.3 57.6 70.5 37.5 66.0 35.5 54.0 65.0 75.0 66.0 92.0 78.0 27.0 75.0 35.0 34.0 48.0 80.0 64.3 34.0 56.0 Predicted cloud point (0C) MLR PLS 72.30 72.78 20.92 19.14 33.88 37.35 54.56 54.37 82.05 82.17 103.81 101.57 20.75 18.94 32.71 32.70 53.34 53.59 82.61 83.31 9.51 23.98 35.603 34.36 46.22 49.98 65.82 66.98 75.65 77.61 91.31 90.94 94.80 94.03 106.87 104.07 27.22 26.80 46.25 46.30 77.01 71.10 27.14 25.21 38.58 39.59 57.70 56.72 68.33 70.32 37.91 37.65 66.54 67.09 28.99 29.51 48.24 48.49 55.93 59.06 69.68 70.53 75.68 64.08 98.31 97.89 77.90 82.36 33.68 29.64 73.16 78.58 58.95 48.09 33.12 28.51 49.54 54.29 76.73 74.94 56.76 60.64 43.87 41.81 53.46 53.18 73 GHASEMI and coworker 74 NO. 59 60 61 62 Chemical formula Symbol of surfactant Cloud point (0C) 75.0 87.0 89.0 33.0 Predicted cloud point (0C) MLR PLS 65.77 63.95 80.85 81.43 90.41 89.07 46.61 48.32 n-C9H19-C6H4-(OC2H4)10OH NC09PE10 n-C9H19-C6H4-(OC2H4)12OH NC09PE12 n-C9H19-C6H4-(OC2H4)13OH NC09PE13 n-C12H25-C6H4-(OC2H4)9OH NC12PE9 n-C12H25-C6H4P NC12PE11 50.0 63 69.31 63.44 (OC2H4)11OH C8H1764 CA-620 22.0 29.08 27.85 C6H8(C2H4O)6CH2CH2OH C9H1965 CO-630 54.0 51.91 53.75 C6H8(C2H4O)8CH2CH2OH P 66 C7H15-CH[O(C2H4O)3CH3]2 AGM-7(3) 34.0 50.57 49.01 AGMC11H2330.0 67 11(3) 30.43 31.34 CH[O(C2H4O)3CH3]2 C13H27AGM68 29.0 13(3) 25.05 26.48 CH[O(C2H4O)3CH3]2 P The compounds used in the prediction set and the other used in the calibration set. Molecular descriptors define the molecular structure and physicochemical properties of molecules by a single number. A wide variety of descriptors have been reported for using in QSAR analysis 32-37. Here, 634 descriptors, 9 classes of Dragon descriptors including constitutional, topological, Molecular walk counts, aromaticity indices, geometrical, WHIM descriptors, functional group, empirical and properties descriptors were generated for each compound (TABLE 2). 219 constant and near constant variables exclude from descriptors and then 305 descriptors with pair correlations 0.98 exclude, the total remaining descriptors were 110 and we use these descriptors for GA variable selection. Genetic algorithm The samples were randomly selected to the calibration and prediction sets (55 calibration samples, 13 prediction samples) and data preprocessed via auto scaling. TABLE 3 shows the parameters of GA model. Because each GA gives a slightly different model, ten GA runs are performed on data set, with the goal of verifying the robustness of the predictive ability and of the selected variables. If some variables are present only in one model, it can be concluded that they have been selected just by chance, and therefore, they can be disregarded in the final model (TABLE 4). Finally we selected 12 variables among 5 runs (1, 5, 9, 22, 35, 36, 37, 48, 51, 57, 80, and 110) and we use these variables for PLS regression (TABLE 4, 5). Cloud point Prediction of Nonionic surfactants from Molecular Structures TABLE 2. -The calculated descriptors used in this study descriptor type molecular descriptors no. of descriptors Constitutional Molecular weight, no. of atoms, no. of non-H atoms, 47 no. of bonds, no. of heteroatoms, no. of multiple bonds, no. of aromatic bonds, no. of functional groups (hydroxy, amine, aldehyde, carbonyl, nitro, nitroso, . .), no. of rings, no. of circuits, no of H-bond donors, no of H-bond acceptors, chemical composition Topological indices molecular size index, molecular connectivity indices, information contents, Kier shape indices, path/walkRandic shape indices, Zagreb indices, Schultz indices, Balaban J index, Wiener indices, information contents 266 Molecular walk counts molecular walk counts of order 1-10, total walk count, self-returning of order 1-10 21 Aromaticity indices harmonic oscillator model of aromaticity index, Jug RC 4 index, aromaticity, HOMA total Geometrical 3D-Wiener index, average geometrical distances, molecular eccentricity, spherocity, average shape profile index WHIM descriptors unweighted size, shape, symmetry and accessibility 99 directional indices; size, shape, symmetry and accessibility directional indices weighted by atomic polarizability, atomic Sanderson electronegativity or atomic van der Waals volume; total size, shape symmetry and accessibility indices Functional group numbers of different types of carbons, number of 121 allenes groups, number of esters (aliphatic or aromatic), number of amides, number of different functional groups, number of CH3R, number of CR4, number of different halogens attached to different type of carbons, number of PX3, number of PR3 and ... Empirical descriptors Properties Unsaturation index, hydrophilic factor, aromatic ratio 3 molar refractivity, polar surface area, LogP 3 70 75 GHASEMI and coworker 76 TABLE 3. - Parameters of the GA 30 chromosomes(on average, five variables per chromosome in the original population) Regression method PLS Response cross-validated % explained variance (five deletion groups; the number of components is determined by crossvalidation) Maximum number of variables 30 selected in the same chromosome Probability of mutation 1% Maximum number of components the optimal number of components determined by cross-validation on the model containing all the variables (no higher than 15) Number of runs 100, backward elimination after every 100th evaluation and at the end (if the number of evaluations is not a multiple of 100) Population size TABLE 4. - Selected variables at the various runs of GA No. of run 1 2 3 4 5 Best variable subset 9,48, 36, 80, 22, 51, 57, 110, 37,5 9,36,48,80,51,57,110,37,9,5,22,54,1 9, 57, 36, 48, 51, 37, 80, 110, 5, 1, 22, 107, 35 9, 51, 36, 80, 57, 48, 37,1, 5, 22, 110 9, 51, 57, 36, 80, 48, 110, 37, 1, 5, 22, 35 Maximum cross validation (3 components) 90.736 90.5049 91.1659 90.9489 91.0364 RESULTS AND DISCUSSION MLR modeling Several empirical relationships between structure and cloud point can found in the literature. Gu and Sjöblom have demonstrated a linear relationship between the cloud point and the logarithm of the ethylene oxide number (EO#) for alkyl ethoxylates, alkylphenyl ethoxylates, and methyl capped alkyl ethoxylate esters, as well as a linear relationship between the cloud point and alkyl carbon number (C#) for linear alkyl ethoxylates The relationship for the linear alkyl ethoxylates is40: CP = (87.1 ± 3.3) LOGEO #−(5.78 ± 0.38) − (40.7 ± 5.2) R 2 = 0.943, F = 355, s 2 = 40.4, N = 46. (1) This work followed up with a more complete analysis for a set of 62 surfactants by P.D.T. Huibers and coworkers 42. They used the topological descriptors to allow one to account for branching and ring structure in alkanes. These can be applied to the prediction of cloud point, to Cloud point Prediction of Nonionic surfactants from Molecular Structures 77 account for the influence of variations in hydrophobic structures. Regressions were calculated using different combinations of 41 topological descriptors. CP = ( − 264 . ± 17 .) + (86 .1 ± 3 .0 ) log EO # + (8 .02 ± 0 .78 ) 3κ − (1284 ± 86 ) 0 ABIC − (14 .26 ± 0 .73 )1SIC R 2 = 0.937, F = 211, s 2 = 42.3, N = 62. (2) TABLE 5. - 12 descriptors selected with GA-PLS Descr iptor no. 1 5 9 Descript or symbol AMW Ms AAC 22 MAXDP 35 36 37 48 PW4 PW5 PJI2 IC2 51 IC4 57 VEA1 80 E1u 110 MLOGP Meaning Descriptor type Average molecular weight Mean electrotopological state Mean information index on atomic composition Maximal electrotopological positive variation Path/walk4- Randic shape index Path/walk5- Randic shape index 2D petitjean shape index Information content index (neighborhood symmetry of 2order) Information content index (neighborhood symmetry of 4order) Eigenvector coefficient sum from adjacency matrix. 1 st component accessibility directional WHIM index/unweighted Moriguchi octanol-water partition coefficient (logP) Constitutional descriptors Topological descriptors WHIM descriptors properties where EO# is the number of ethylene oxide residues, 3k is the third order Kier shape index for the hydrophobic tail, 0ABIC is the zeroth order average bonding information content of the tail, and 1SIC is the first order structural information content of the tail. After variable selection with GAPLS, 13 descriptors selected for final modeling and multivariate regressions were calculated using different combinations of 13 descriptors. The equation of the best regression for 68 surfactants is: CP = (-237.30 ± 76.42) + (95.09 ± 10.14)logEO - (107.30 ± 25.60)AMW+ (974.55 ± 137.26)AAC - (51.09 ± 12.42)MAXDP - (1337.21± 354.39)PW4- (72.88 ± 22.18)PJI2- (25.90 ± 4.40)IC4 + (85.88 ± 23.22)E1u R 2 = 955 , F = 122 .7, S 2 = 34 .5, N = 68, p < 0.0001 (3) GHASEMI and coworker 78 The Eq. (3), has eigth variables where EO# is the number of ethylene oxide residues, the AMW is constitutional descriptors class and represents average molecular weight, the AAC, MAXDP, PW4, PJI2, IC4 are topological descriptors and represent mean information index on atomic composition, Maximal electrotopological positive variation, Path/walk4-Randic shape index, 2D petitjean shape index, and Information content index (neighborhood symmetry of 4-order) respectively, and the E1u is WHIM descriptor and indicated 1st component accessibility directional WHIM index/unweighted. The positive sign of logEO# shows that increasing the number of ethylene oxide residues causes the clod point increases. Meanwhile, with increasing the average molecular the clod point decreases. AAC has a positive effect on the cloud point. The other topological descriptors such as MAXDP, PW4 and IC4 have negative effect on cloud point. The E1u WHIM descriptor has positive effect on cloud point. The standardized regression coefficient reveals the significance of an individual descriptor presented in the regression model. Obviously, in TABLE 6 and FIG. 1 the effect of AAC, Mean information index on atomic composition, is more significant than the other descriptors. The order of significance of the other descriptors is logEO#>AMW>PW4>IC4>MAXDP>E1u>PJI2. 100 MLR 90 PLS Predicted Cloud Point(0C) 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 120 0 Experimental Cloud Point( C) FIGURE 1. - The plot of the predicted cloud point (0C) by MLR and PLS as a function of experimental cloud point (0C) for the surfactants involved in the prediction set. PLS modeling The PLS regression was run on the calibration data matrix containing 13 descriptors selected by the procedure discussed in sections 2.2 and 3.1. The PLS model extracted from these data. The PLS estimate of regression coefficients (standardized and nonstandardized) for these descriptors are given in TABLE 8 the total of 13 selected descriptors are first order. The number of factors that gives the better results in the PLS modeling was found to be eight. In TABLE 1, the values of predicted cloud point using PLS model are shown. Cloud point Prediction of Nonionic surfactants from Molecular Structures 79 TABLE 6. - The backward elimination MLR model results Source Intercept logEO AMW AAC MAXDP PW4 PJI2 IC4 E1u b -237.30 95.09 -107.30 974.55 -51.09 -1337.21 -72.88 -25.91 85.88 Sb 76.42 10.142 25.60 137.26 12.42 354.40 22.18 4.41 23.22 bs 0.75 -0.71 0.991 -0.23 -0.33 -0.13 -0.29 0.16 TABLE 7. - Statistical results of comparing regression models Model PLS MLR RMSEC 5.04 5.87 RMSEP 9.72 11.28 R2 0.96 0.95 a 17.03 9.17 b 0.73 0.88 TABLE 8. - The partial least squares regression coefficients Standardized Variable coefficients Standard deviation logEO 1.008 0.656 AMW -1.342 0.338 Ms -0.107 0.208 AAC 1.636 0.362 MAXDP -0.214 0.249 PW4 -0.348 0.265 PW5 -0.200 0.418 PJI2 -0.100 0.089 IC2 0.251 0.336 IC4 -0.445 0.230 VEA1 -0.104 0.237 E1u 0.177 0.091 MLOGP 0.188 0.245 The FIG. 2 displays the VIP (Variable Importance for the Projection) for each explanatory variable, on first component. This allows to quickly identify which are the explanatory variables that contribute the most to the models. On the first component we can see that the Ms, MAXDP, PW4, PJI2 and PW5 have a low influence on the model. Also FIG. 3 indicates that the coefficients of AMW, AAC, IC4 and E1u are significantly different from zero but and they can influence the model more than the others. GHASEMI and coworker 80 E1u VEA1 IC4 PJI2 IC2 PW5 -0.5 PW4 MAXDP 0 MLOGP logEO 0.5 Ms Standardized coefficients 1 AAC 1.5 AMW -1 -1.5 Variable FIGURE 2. - The confidence interval of standardized coefficients for MLR model (confidence level is %95). 1.8 1.6 1.4 VIP 1.2 1 0.8 0.6 0.4 0.2 PW5 PJI2 PW4 MAXDP Ms VEA1 E1u logEO IC2 AMW AAC IC4 MLOGP 0 Variable FIGURE 3. - The plot of VIP for different variables. In FIG. 4, the plot of the predicted cloud point versus experimental cloud point by the regression models employed against the experimental cloud point is represented. The statistical parameters calculated for the MLR and PLS models are represented in TABLE 7. In this table there are five statistical parameters, root mean square error of calibration (RMSEC), root mean square error of prediction (RMSEP), intercept (a) and slope (b) of the plot of the predicted cloud point(0C) against the experimental cloud point(0C), and the square of the correlation coefficient (R2). The data presented in TABLE 7 indicate that the PLS and MLR models have good statistical quality with low Cloud point Prediction of Nonionic surfactants from Molecular Structures 81 prediction error, while the corresponding errors obtained by the PLS model are lower. The PLS model uses higher number of descriptors that allow the model to extract better structural information from descriptors to result in a lower prediction error. High correlation coefficient and closeness of slope to 1 in the PLS model reveal a satisfactory agreement between the predicted and the experimental values. AAC 2.5 logEO 3 E1u 1 0.5 MLOGP IC2 1.5 0 -2 IC4 VEA1 PJI2 PW5 AMW -1.5 PW4 -1 MAXDP -0.5 Ms Standardized coefficients 2 -2.5 Variable FIGURE 4. - The confidence interval of standardized coefficients for PLS model (confidence level is %95). CONCLUSION The QSAR studies on a series of nonionic surfactants have been carried out by combining of GA and PLS. for each compound 634 descriptors, 9 classes of Dragon descriptors, calculated and then the best set of calculated descriptors was selected by Dragon software and genetic algorithm. This paper shows that GA can select variables that provide good solutions in terms of both predictive ability and interpretability. High correlation coefficients (95 and 0.96 for MLR and PLS, respectively) and low prediction errors (11.28 and 9.72 for MLR and PLS, respectively) obtained confirm good predictive ability of both models. Received May 5th, 2006 Acknowledgements - The authors would like to thank Professor Riccardo Leardi of University of Genoa for providing genpls Matlab code program. 82 GHASEMI and coworker REFERENCES 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37) 38) M. J. Schick (ed.): Nonionic Surfactants, Marcel Dekker, New York, 1966. M. J. Schick (ed.): Nonionic Surfactants: Physical Chemistry, Marcel Dekker, New York, 1987. N. Schonfeldt: Surface Active Ethylene Oxide Adducts, Pergamon Press, Oxford, 1970. C. Schöller, M. Wittwer, German Patents P. 605,973, P. 667,744, P. 694,178 to BASF, 30 November 1930. W. L. Hinze, E. Pramauro, CRC Crit. Rev. Anal. Chem., 24, 133 (1993). H. Tani, T. Kamidate, H. Watanabe, J. Chromatogr. A, 786, 229 (1997). F. H. Quina, W.L. Hinze, Ind. Eng. Chem. Res., 38, 4150 (1999). K. Roy, J. T. Leonard, Bioorg. Med. Chem., 13, 2967 (2005). S. D. Berown, S. T. Sum, F. Despagne, Anal. Chem., 68, 21R (1996). G.–D. Hu, R. –C. Zheng, The Method of Multiple Data Analysis, Nan Kai University Publishing House, Tianjin, 1983. T. Aoyama, J. Med. Chem., 33(9), 2583 (1990). B. Kowalski, R. Gerlach, in: K.G. Joreskog, H. Wold (Eds.), Systems Under Indirect Observation, North Holland, Amsterdam, 1982, pp. 191-209. R.-Q. Yu, Introduction to Chemometrics, Human Education Publishing House, Changsha, 1992. B.S. Dayal, J.F. MacGregor,. J. Chemometrics, 11, 73 (1997). P. Geladi, B.R. Kowalski, Anal. Chim. Acta, 185, 1 (1986). B.K. Lavine, Chemometrics, Anal.Chem., 72, 91R (2000). R. Leardi, R. Boggia, M. Terrile, J. Chemometrics, 6, 267 (1992). R. Leardi, J. Chemometrics, 8, 65 (1994). H. Kubinyi, Quant. Struct.-Act. Relat., 13, 285 (1994). H. Kubinyi, Quant. Struct.-Act. Relat., 13, 393 (1994). D. Rogers , A.J. Hopfinger, J. Chem. Inf. Comput. Sci., 34, 854 (1994). H. Kubinyi, J. Chemometrics, 10, 119(1996). K. Hasegawa, Y. Miyashita, K. Funatsu, J. Chem. Inf. Comput. Sci., 37, 306 (1997). R. Leardi, A.L. Gonza´lez, Chemom. Intell. Lab. Syst., 41, 195 (1998). D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine learning, AddisonWesley, New York, 1989. Hypercube, http://www.hyper.com. R. Todeschini, Milano Chemometrics and QSAR Group, http://www.disat.unimib.it/vhm/. M. J. Rosen, ‘‘Surfactants and Interfacial Phenomena,’’ 2nd ed. Wiley, New York, 1989. N. M. van Os, J. R. Haak and L. A. M. Rupert, ‘‘Physico-Chemical Properties of Selected Anionic Cationic and Nonionic Surfactants.’’ Elsevier, Amsterdam, 1993. W.L. Hinze, E. Pramauro, CRC Crit. Rev. Anal. Chem., 24, 133 (1993). A. Berthod, S. Tomer, J. G. Dorsey, Talanta, 55, 69 (2001). R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, Weinheim, Germany, 2000. L.B. Kier, L.H. Hall, Molecular Connectivity in Structure– Activity Analysis, RSP-Wiley, Chichetster, UK, 1986. E.v. Kostantinora, J. Chem. Inf. Comp. Sci., 36, 54 (1997). G. Rucker, C. Rucker, J. Chem. Inf. Comp. Sci., 33, 683 (1993). J. Galvez, R. Garcia, M.T. Salabert, R. Soler, J. Chem. Inf. Comp. Sci., 34, 520 (1994). P. Broto, G. Moreau, C. Vandicke, Eur. J. Med. Chem., 19, 66 (1984). R. Leardi, J. Chemometrics, 14, 643 (2000). Cloud point Prediction of Nonionic surfactants from Molecular Structures 39) 40) 41) 42) 43) 44) 83 S. Wold, Technometrics, 24, 397 (1978). T. Gu and Sjöblom, J. Colloids Surf., 4, 39 (1992). H. Martens, T. Naes, Multivariate Calibration, Wiley, New York, 1989, p. 111. P.D.T. Huibers, D.O. Shah, A.R. Katritzky , J. Colloid Interface Sci., 193,132 (1997). XLSTAT 2006 version 2006.2 Add-in software (XLSTAT company). http://www.xlstat.com. R.H. Myers, Classical and modern regression with applications. PWS-KENT Publishing company: Boston, 1990.
© Copyright 2026 Paperzz