Combination of Genetic Algorithm and Partial Least Squares for

Annali di Chimica, 97, 2007, by Società Chimica Italiana
69
COMBINATION OF GENETIC ALGORITHM AND PARTIAL LEAST
SQUARES
FOR
CLOUD
POINT
PREDICTION
OF
NONIONIC
SURFACTANTS FROM MOLECULAR STRUCTURES
Jahanbakhsh GHASEMI(°), Shahin AHMADI
Chemistry Department, Faculty of Sciences, Razi University, Kermanshah, Iran
Summary - Quantitative structure–property relationship (QSPR) analysis has been
directed to a series of pure nonionic surfactants containing linear alkyl, cyclic alkyl, and
alkey phenyl ethoxylates. Modeling of cloud point of these compounds as a function of
the theoretically derived descriptors was established by multiple linear regression
(MLR) and partial least squares (PLS) regression. In this study, a genetic algorithm
(GA) was applied as a variable selection method in QSPR analysis. The results indicate
that the GA is a very effective variable selection approach for QSPR analysis. The
comparison of the two regression methods used showed that PLS has better prediction
ability than MLR.
INTRODUCTION
The most common nonionic surfactants are those based on ethylene oxide, referred to as
ethoxylated surfactants.1-3 Several classes can be distinguished: alcohol ethoxylates, alkyl phenol
ethoxylates, fatty acid ethoxylates, monoalkaolamide ethoxylates, sorbitan ester ethoxylates, fatty
amine ethoxylates and ethylene oxide–propylene oxide copolymers (sometimes referred to as
polymeric surfactants). Another important class of nonionics is the multihydroxy products such as
glycol esters, glycerol (and polyglycerol) esters, glucosides (and polyglucosides) and sucrose esters.
Amine oxides and sulphinyl surfactants represent nonionics with a small head group. Nonionic
polyoxyethylene alkyl ethers are ethylene oxide adducts of linear alcohols. They were first
synthesized in the early 1930s.4 Since then, their wetting, detergency, and foaming properties, and
general usefulness as nonionic surfactants makes them widely used in industrial and household
products. Polyoxyethylene alkyl ethers have the general formula CnH2n+1(OCH2CH2)mOH. They
will be referred as CnEm with n indicating the number of carbons in the alkyl chain and m being the
number of ethylene oxide units in the hydrophilic moiety.
Polyoxyethylene alkyl ethers are often selected as surfactants for analytical separations because
they do not absorb UV light, they are little sensitive to ions and they offer a wide range of differing
polarity. Many aqueous solutions of nonionic surfactants separate in two phases on heating above a
particular temperature: the clouding temperature. The homogeneous solution becomes turbid
(cloudy) and separates into a water-rich phase and an organic surfactant-rich phase. A solute
dissolved in the homogeneous solution at low temperature may partition preferentially in one phase
(°)
Corresponding author. Fax: ( +98831)8369572. E-mail: [email protected]
70
GHASEMI and coworker
or the other. Cloud-point extraction is commonly used to extract and/or to concentrate apolar
solutes in the surfactant-rich phase obtained at elevated temperature.5-7 The cloud-point temperature
and the molar volume of the nonionic surfactant used are needed.
Quantitative structure-activity relationship (QSAR) attempt to correlate structural molecular
descriptor with functions (i.e. physicochemical properties, biological activities, toxicity etc.) for a
set of similar compounds, by means of statistical methods. As a result, a simple mathematical
relationship is established:
Function = f (structural molecular or fragment properties)
QSAR techniques include from chemical measurements and biological assays to the statistical
techniques and interpretation of results. 8, 9
There are three main methods which are often applied in QSAR research as follows.
(1) Stepwise regression (SR) can select important variables. 10 However, SR is limited in the
number of selected variables.
(2) Artificial neural network (ANN), which is widely used in QSAR nonlinear modeling, has
achieved substantial positive results. 11 However, the algorithm of back-propagation has its
shortcomings, namely overfitting.
(3) Partial least squares (PLS), 12-15 which is based on factor analysis fundamentals, is applied
where there are many variables but not enough samples or observations. PLS has been applied to
many fields of applied sciences with great success 16. In chemometrics, it is one of the favored
methods of analysis.
Genetic algorithms (GA) are developed to mimic some of the processes observed in natural
evolution, which are an efficient strategy to search for the global optima of solutions. Now the
genetic algorithms have been successfully applied to feature selection in regression analysis.17-21
Moreover, an approach combining GA with PLS (GAPLS) is also proposed for variable selection in
QSAR and QSPR studies.22-24 The purpose of a variable selection is to select the variables
significantly contributing to prediction and discard the other variables by a fitness function. The
hybrid method that integrates GA as a powerful optimization tool and PLS as a robust statistical
tool can lead to a significant improvement in QSAR and QSPR models.
THEORY
GAPLS
The GA algorithm is described in (17, 18, 38), and is implemented in Matlab 4. Reference
(38) covers the details of the settings required in the software, here summarized in TABLE - 3.
GAPLS is a sophisticated hybrid approach that combines GA 25 as a powerful optimization method
with PLS 12-15 as a robust statistical method for variable selection. The combination of variables and
the internal predictivity of the derived PLS model in GAPLS correspond a chromosome and its
fitness in GA, respectively. GAPLS consists of three basic steps. (1) An initial population of
chromosomes is created. Each chromosome is a binary bit string, by which the existence of a
variable is represented. (2) A fitness of each chromosome in the population is evaluated by the
internal predictivity of PLS. (3) The population of chromosomes in the next generation is
reproduced. Three operations, i.e., selection, cross-over and mutation of chromosomes, are made in
this step. In the overall scheme, steps 2 and 3 are continued until the number of the repetitions is
reached at the designated number of generations.
Cross validation is used during the GA procedure. In this case, the data are split into five
deletion groups. The deletion groups are created by taking every fifth sample (e.g. 1, 6, 11, 16,… as
one group; 2, 7, 12, 17,… as the next group). The data are sorted by the cloud point of the
component of interest before deletion groups are formed in order that the deletion groups uniformly
Cloud point Prediction of Nonionic surfactants from Molecular Structures
71
sample from the data space. The cross validation is then five steps, each time leaving out one group,
making a model on the other four, and then predicting the left out group. The number of
components is selected according to the minimum value of root mean square error of cross
validation (RMSECV). The validation data are not used at all until the GA procedure is complete
and a final model is selected. The validation data are then predicted using this final model.
The last step of the GA is a stepwise approach in which the variables are entered according
to the smoothed value of the frequencies of selection, each time computing the % cross-validated
explained variance and the RMSECV. A crucial point in the previous algorithm is the detection of
the number of variables to be taken into account (the selection of the solution corresponding to the
global maximum often leads to overfitting); this decision is usually made by visually inspecting the
plot of the % cross-validated explained variance (or of the RMSECV) versus the number of
variables in the model, looking for the number of variables beyond which no “significant” increase
of the response (decrease in RMSECV) takes place. Of course, this analysis required some time and
the decision about the selected model sometimes could involve a high degree of subjectivity. To
have a sounder statistical approach, the following algorithm has been implemented:
• detect the global minimum of RMSECV;
• by using an F-test (P < 0.1, d.f. = number of samples in the training set − 1, both in numerator and
in the denominator) select a “threshold value” corresponding to the highest RMSECV being not
significantly different from the global minimum;
• look for the solution with the lowest number of variables having a RMSECV lower than the
“threshold value”.
In such a way, the most parsimonious model among all the models being not significantly different
from the global optimum is selected. This modification generally leads to models having a slightly
better root mean square error of prediction (RMSEP). Because each GA gives a slightly different
model, some GA runs are performed on each data set, with the goal of verifying the robustness of
the predictive ability and of the selected variables.
As a rule of thumb, it has been found that the performance of the algorithm decreases when
>200 variables are used.24 This is due to the fact that a higher variables/objects ratio increases the
risk of overfitting and also due to the fact that the size of the search domain becomes too great.
In QSAR studies, it is important to obtain a model containing as few variables as possible
because this will lead to a simple and interpretable model. Therefore, the quality of a chromosome
is determined by both the internal predictivity it gives and the number of variables it uses. In order
to increase quality of chromosomes in the population, the best chromosome using the same number
of variables is protected unless a chromosome with a lower number of variables gives better internal
predictivity. The protected chromosomes in the final population of GA can be regarded as the
important combinations of variables.
Stepwise multiple linear regression
The multiple linear regression (MLR) is an extension of the classical regression method to
more than one dimension.44 MLR calculates QSAR equation by performing standard multivariable
regression calculations using multiple variables in a single equation. The stepwise multiple linear
regression is a commonly used variant of MLR. In this case, also a multiple-term linear equation is
produced, but not all independent variables are used. Each variable is added to the equation at a
time and a new regression is performed. The new term is retained only if equation passes a test for
significance. This regression method is especially useful when the number of variables is large and
when the key descriptors are not known.
Usually, molecular descriptor matrices can not be directly used as independent variables in
the MLR analysis due to their lack of homogeneity, the high correlation between descriptors is
GHASEMI and coworker
72
larger than the number of compounds, some of them maybe redundant. Thus previous to the MLR
analysis, normally a reduction of variables is necessary in order to obtain a concentrated set of
significant underlying variables, not correlated between them, loosing the minimum amount of
information.
MATERIAL AND METHODS
Equipment
A Pentium IV personal computer with Windows XP operating system was used. The
programs needed GA variable selection was used from genpls matlab code (written by R. Leardi).
Backward multiple regression analysis and partial least squares were performed by the XLSTAT
2006 version 2006.2 Add-in software (XLSTAT company). Equations were justified by the
correlation coefficient (R), the standard error of estimates (SE), significance index for analysis of
variance (F), and significance level (p).43 The contribution of each parameter to the inhibition
variations was calculated from the square of the correlation coefficient. For the calculation of
molecular descriptors, the HyperChem version 7.5 26 and Dragon (Milano Chemometrics group,
version 3.0) 27 softwares were used.
Cloud point data and descriptor generation
Cloud point data from many literature sources 28-31 (TABLE 1). This set of data represents a
wide variety of structure in the hydrophobic domain, including linear alkyl, branched alkyl, cyclic
alkyl, linear alkylphenyl, and branched alkylphenyl ethoxylates. Cloud point values were found for
68 different structures for a surfactant percentage of 1% w/w.
TABLE 1. - Cloud point temperature of several non-ionic surfactants of the alkyl
ethoxylate class for a surfactant percentage of 1% w/w and the
corresponding predictive values by MLR and PLS models.
NO. Chemical formula
1
2
3
P
4
5
6
7
P
8
9
10
11
P
12
13
14
15
C4H9(OC2H4)1OH
C6H13(OC2H4)2OH
C6H13(OC2H4)3OH
C6H13(OC2H4)4OH
C6H13(OC2H4)5OH
C6H13(OC2H4)6OH
C7H15(OC2H4)3OH
C8H17(OC2H4)3OH
C8H17(OC2H4)4OH
C8H17(OC2H4)5OH
C8H17(OC2H4)6OH
C8H17(OC2H4)8OH
C8H17(OC2H4)9OH
C9H19(OC2H4)4OH
C9H19(OC2H4)5OH
Symbol of
surfactant
C4E1
C6E2
C6E3
C6E4
C6E5
C6E6
C7E3
C8E3
C8E4
C8E5
C8E6
C8E8
C8E9
C9E4
C9E5
Cloud
point
(0C)
44.5
0.0
40.5
63.8
75.0
83.0
27.6
7.0
38.5
58.6
72.5
96.0
100.0
32.0
55.0
Predicted cloud
point (0C)
MLR
PLS
41.37 42.199
10.13
9.40
43.55 43.63
54.32 52.25
76.06 76.26
83.25 83.39
17.68 16.37
0.05 12.24
34.61 35.09
60.06 59.61
65.74 66.46
88.21 85.45
101.87 97.94
34.14 31.79
49.71 50.55
Cloud point Prediction of Nonionic surfactants from Molecular Structures
NO.
Chemical formula
Symbol of
surfactant
16
17
18
19
20
21
22
23
24
25
P
26
27
P
28
29
30
31
32
33
34
35
P
36
37
38
P
39
40
41
42
43
44
P
45
46
P
47
48
49
50
P
51
P
52
53
54
55
56
57
58
C9H19(OC2H4)6OH
C10H21(OC2H4)4OH
C10H21(OC2H4)5OH
C10H21(OC2H4)6OH
C10H21(OC2H4)8OH
C10H21(OC2H4)10OH
C11H23(OC2H4)4OH
C11H23(OC2H4)5OH
C11H23(OC2H4)6OH
C11H23(OC2H4)8OH
C12H25(OC2H4)4OH
C12H25(OC2H4)5OH
C12H25(OC2H4)6OH
C12H25(OC2H4)7OH
C12H25(OC2H4)8OH
C12H25(OC2H4)9OH
C12H25(OC2H4)10OH
C12H25(OC2H4)11OH
C13H27(OC2H4)5OH
C13H27(OC2H4)6OH
C13H27(OC2H4)8OH
C14H29(OC2H4)5OH
C14H29(OC2H4)6OH
C14H29(OC2H4)7OH
C14H29(OC2H4)8OH
C15H31(OC2H4)6OH
C15H31(OC2H4)8OH
C16H33(OC2H4)6OH
C16H33(OC2H4)7OH
C16H33(OC2H4)8OH
C16H33(OC2H4)9OH
C16H33(OC2H4)10OH
C16H33(OC2H4)12OH
(C2H5)2CHCH2(OC2H4)6OH
(C4H9)2CHCH2(OC2H4)6OH
c-(C12H24)(OC2H4)9OH
(C6H13)2CH(OC2H4)9OH
(C4H9)3CH(OC2H4)9OH
(C5H11)3CH(OC2H4)12OH
c-(C16H32)(OC2H4)11OH
t-C8H17-C6H4(OC2H4)9OH
n-C9H19-C6H4-(OC2H4)8OH
n-C9H19-C6H4-(OC2H4)9OH
C9E6
C10E4
C10E5
C10E6
C10E8
C10E10
C11E4
C11E5
C11E6
C11E8
C12E4
C12E5
C12E6
C12E7
C12E8
C12E9
C12E10
C12E11
C13E5
C13E6
C13E8
C14E5
C14E6
C14E7
C14E8
C15E6
C15E8
C16E6
C16E7
C16E8
C16E9
C16E10
C16E12
IC6E6
IC10E6
XC12E9
IC13E9
TC13E9
TC16E12
XC16E11
TC8PE9
NC09PE8
NC09PE9
Cloud
point
(0C)
75.0
19.7
41.6
60.3
84.5
95.0
10.5
37.0
57.5
82.0
6.0
28.9
51.0
64.7
77.9
87.8
95.5
100.3
27.0
42.0
72.5
20.0
42.3
57.6
70.5
37.5
66.0
35.5
54.0
65.0
75.0
66.0
92.0
78.0
27.0
75.0
35.0
34.0
48.0
80.0
64.3
34.0
56.0
Predicted cloud
point (0C)
MLR
PLS
72.30 72.78
20.92 19.14
33.88 37.35
54.56 54.37
82.05 82.17
103.81 101.57
20.75 18.94
32.71 32.70
53.34 53.59
82.61 83.31
9.51 23.98
35.603 34.36
46.22 49.98
65.82 66.98
75.65 77.61
91.31 90.94
94.80 94.03
106.87 104.07
27.22 26.80
46.25 46.30
77.01 71.10
27.14 25.21
38.58 39.59
57.70 56.72
68.33 70.32
37.91 37.65
66.54 67.09
28.99 29.51
48.24 48.49
55.93 59.06
69.68 70.53
75.68 64.08
98.31 97.89
77.90 82.36
33.68
29.64
73.16 78.58
58.95 48.09
33.12 28.51
49.54 54.29
76.73
74.94
56.76 60.64
43.87 41.81
53.46 53.18
73
GHASEMI and coworker
74
NO.
59
60
61
62
Chemical formula
Symbol of
surfactant
Cloud
point
(0C)
75.0
87.0
89.0
33.0
Predicted cloud
point (0C)
MLR
PLS
65.77 63.95
80.85 81.43
90.41 89.07
46.61 48.32
n-C9H19-C6H4-(OC2H4)10OH NC09PE10
n-C9H19-C6H4-(OC2H4)12OH NC09PE12
n-C9H19-C6H4-(OC2H4)13OH NC09PE13
n-C12H25-C6H4-(OC2H4)9OH NC12PE9
n-C12H25-C6H4P
NC12PE11
50.0
63
69.31 63.44
(OC2H4)11OH
C8H1764
CA-620
22.0
29.08 27.85
C6H8(C2H4O)6CH2CH2OH
C9H1965
CO-630
54.0
51.91 53.75
C6H8(C2H4O)8CH2CH2OH
P
66 C7H15-CH[O(C2H4O)3CH3]2 AGM-7(3)
34.0
50.57 49.01
AGMC11H2330.0
67
11(3)
30.43 31.34
CH[O(C2H4O)3CH3]2
C13H27AGM68
29.0
13(3)
25.05 26.48
CH[O(C2H4O)3CH3]2
P
The compounds used in the prediction set and the other used in the calibration set.
Molecular descriptors define the molecular structure and physicochemical properties of
molecules by a single number. A wide variety of descriptors have been reported for using in QSAR
analysis 32-37. Here, 634 descriptors, 9 classes of Dragon descriptors including constitutional,
topological, Molecular walk counts, aromaticity indices, geometrical, WHIM descriptors, functional
group, empirical and properties descriptors were generated for each compound (TABLE 2). 219
constant and near constant variables exclude from descriptors and then 305 descriptors with pair
correlations 0.98 exclude, the total remaining descriptors were 110 and we use these descriptors for
GA variable selection.
Genetic algorithm
The samples were randomly selected to the calibration and prediction sets (55 calibration
samples, 13 prediction samples) and data preprocessed via auto scaling. TABLE 3 shows the
parameters of GA model. Because each GA gives a slightly different model, ten GA runs are
performed on data set, with the goal of verifying the robustness of the predictive ability and of the
selected variables. If some variables are present only in one model, it can be concluded that they
have been selected just by chance, and therefore, they can be disregarded in the final model
(TABLE 4). Finally we selected 12 variables among 5 runs (1, 5, 9, 22, 35, 36, 37, 48, 51, 57, 80,
and 110) and we use these variables for PLS regression (TABLE 4, 5).
Cloud point Prediction of Nonionic surfactants from Molecular Structures
TABLE 2. -The calculated descriptors used in this study
descriptor type
molecular descriptors
no. of
descriptors
Constitutional
Molecular weight, no. of atoms, no. of non-H atoms,
47
no. of bonds, no. of heteroatoms, no. of multiple bonds,
no. of aromatic bonds, no. of functional groups
(hydroxy, amine, aldehyde, carbonyl, nitro, nitroso, . .),
no. of rings, no. of circuits, no of H-bond donors, no of
H-bond acceptors, chemical composition
Topological
indices
molecular size index, molecular connectivity indices,
information contents, Kier shape indices, path/walkRandic shape indices, Zagreb indices, Schultz indices,
Balaban J index, Wiener indices, information contents
266
Molecular walk
counts
molecular walk counts of order 1-10, total walk count,
self-returning of order 1-10
21
Aromaticity
indices
harmonic oscillator model of aromaticity index, Jug RC 4
index, aromaticity, HOMA total
Geometrical
3D-Wiener index, average geometrical distances,
molecular eccentricity, spherocity, average shape
profile index
WHIM
descriptors
unweighted size, shape, symmetry and accessibility 99
directional indices; size, shape, symmetry and
accessibility directional indices weighted by atomic
polarizability, atomic Sanderson electronegativity or
atomic van der Waals volume; total size, shape
symmetry and accessibility indices
Functional
group
numbers of different types of carbons, number of 121
allenes groups, number of esters (aliphatic or
aromatic), number of amides, number of different
functional groups, number of CH3R, number of CR4,
number of different halogens attached to different type
of carbons, number of PX3, number of PR3 and ...
Empirical
descriptors
Properties
Unsaturation index, hydrophilic factor, aromatic ratio
3
molar refractivity, polar surface area, LogP
3
70
75
GHASEMI and coworker
76
TABLE 3. - Parameters of the GA
30 chromosomes(on average, five
variables per chromosome in the original
population)
Regression method
PLS
Response
cross-validated % explained variance (five
deletion groups; the number of
components is determined by crossvalidation)
Maximum number of variables 30
selected in the same chromosome
Probability of mutation
1%
Maximum number of components the optimal number of components
determined by cross-validation on the
model containing all the variables (no
higher than 15)
Number of runs
100, backward elimination after every
100th evaluation and at
the end (if the number of evaluations is
not a multiple of 100)
Population size
TABLE 4. - Selected variables at the various runs of GA
No. of
run
1
2
3
4
5
Best variable subset
9,48, 36, 80, 22, 51, 57, 110, 37,5
9,36,48,80,51,57,110,37,9,5,22,54,1
9, 57, 36, 48, 51, 37, 80, 110, 5, 1, 22, 107, 35
9, 51, 36, 80, 57, 48, 37,1, 5, 22, 110
9, 51, 57, 36, 80, 48, 110, 37, 1, 5, 22, 35
Maximum cross validation
(3 components)
90.736
90.5049
91.1659
90.9489
91.0364
RESULTS AND DISCUSSION
MLR modeling
Several empirical relationships between structure and cloud point can found in the literature.
Gu and Sjöblom have demonstrated a linear relationship between the cloud point and the logarithm
of the ethylene oxide number (EO#) for alkyl ethoxylates, alkylphenyl ethoxylates, and methyl
capped alkyl ethoxylate esters, as well as a linear relationship between the cloud point and alkyl
carbon number (C#) for linear alkyl ethoxylates The relationship for the linear alkyl ethoxylates
is40:
CP = (87.1 ± 3.3) LOGEO #−(5.78 ± 0.38) − (40.7 ± 5.2)
R 2 = 0.943, F = 355, s 2 = 40.4, N = 46.
(1)
This work followed up with a more complete analysis for a set of 62 surfactants by P.D.T.
Huibers and coworkers 42. They used the topological descriptors to allow one to account for
branching and ring structure in alkanes. These can be applied to the prediction of cloud point, to
Cloud point Prediction of Nonionic surfactants from Molecular Structures
77
account for the influence of variations in hydrophobic structures. Regressions were calculated using
different combinations of 41 topological descriptors.
CP = ( − 264 . ± 17 .) + (86 .1 ± 3 .0 ) log EO # + (8 .02 ± 0 .78 ) 3κ − (1284 ± 86 ) 0 ABIC
− (14 .26 ± 0 .73 )1SIC
R 2 = 0.937, F = 211, s 2 = 42.3, N = 62.
(2)
TABLE 5. - 12 descriptors selected with GA-PLS
Descr
iptor
no.
1
5
9
Descript
or
symbol
AMW
Ms
AAC
22
MAXDP
35
36
37
48
PW4
PW5
PJI2
IC2
51
IC4
57
VEA1
80
E1u
110
MLOGP
Meaning
Descriptor type
Average molecular weight
Mean electrotopological state
Mean information index on atomic
composition
Maximal electrotopological positive
variation
Path/walk4- Randic shape index
Path/walk5- Randic shape index
2D petitjean shape index
Information content index
(neighborhood symmetry of 2order)
Information content index
(neighborhood symmetry of 4order)
Eigenvector coefficient sum from
adjacency matrix.
1 st component accessibility
directional WHIM
index/unweighted
Moriguchi octanol-water partition
coefficient (logP)
Constitutional
descriptors
Topological
descriptors
WHIM descriptors
properties
where EO# is the number of ethylene oxide residues, 3k is the third order Kier shape index for the
hydrophobic tail, 0ABIC is the zeroth order average bonding information content of the tail, and
1SIC is the first order structural information content of the tail.
After variable selection with GAPLS, 13 descriptors selected for final modeling and
multivariate regressions were calculated using different combinations of 13 descriptors. The
equation of the best regression for 68 surfactants is:
CP = (-237.30 ± 76.42) + (95.09 ± 10.14)logEO - (107.30 ± 25.60)AMW+ (974.55 ± 137.26)AAC
- (51.09 ± 12.42)MAXDP - (1337.21± 354.39)PW4- (72.88 ± 22.18)PJI2- (25.90 ± 4.40)IC4
+ (85.88 ± 23.22)E1u
R 2 = 955 , F = 122 .7, S 2 = 34 .5, N = 68, p < 0.0001
(3)
GHASEMI and coworker
78
The Eq. (3), has eigth variables where EO# is the number of ethylene oxide residues, the
AMW is constitutional descriptors class and represents average molecular weight, the AAC,
MAXDP, PW4, PJI2, IC4 are topological descriptors and represent mean information index on
atomic composition, Maximal electrotopological positive variation, Path/walk4-Randic shape index,
2D petitjean shape index, and Information content index (neighborhood symmetry of 4-order)
respectively, and the E1u is WHIM descriptor and indicated 1st component accessibility directional
WHIM index/unweighted. The positive sign of logEO# shows that increasing the number of
ethylene oxide residues causes the clod point increases. Meanwhile, with increasing the average
molecular the clod point decreases. AAC has a positive effect on the cloud point. The other
topological descriptors such as MAXDP, PW4 and IC4 have negative effect on cloud point. The
E1u WHIM descriptor has positive effect on cloud point.
The standardized regression coefficient reveals the significance of an individual descriptor
presented in the regression model. Obviously, in TABLE 6 and FIG. 1 the effect of AAC, Mean
information index on atomic composition, is more significant than the other descriptors. The order
of significance of the other descriptors is logEO#>AMW>PW4>IC4>MAXDP>E1u>PJI2.
100
MLR
90
PLS
Predicted Cloud Point(0C)
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
120
0
Experimental Cloud Point( C)
FIGURE 1. - The plot of the predicted cloud point (0C) by MLR and PLS as a function of
experimental cloud point (0C) for the surfactants involved in the prediction set.
PLS modeling
The PLS regression was run on the calibration data matrix containing 13 descriptors selected
by the procedure discussed in sections 2.2 and 3.1. The PLS model extracted from these data. The
PLS estimate of regression coefficients (standardized and nonstandardized) for these descriptors are
given in TABLE 8 the total of 13 selected descriptors are first order. The number of factors that
gives the better results in the PLS modeling was found to be eight. In TABLE 1, the values of
predicted cloud point using PLS model are shown.
Cloud point Prediction of Nonionic surfactants from Molecular Structures
79
TABLE 6. - The backward elimination MLR model results
Source
Intercept
logEO
AMW
AAC
MAXDP
PW4
PJI2
IC4
E1u
b
-237.30
95.09
-107.30
974.55
-51.09
-1337.21
-72.88
-25.91
85.88
Sb
76.42
10.142
25.60
137.26
12.42
354.40
22.18
4.41
23.22
bs
0.75
-0.71
0.991
-0.23
-0.33
-0.13
-0.29
0.16
TABLE 7. - Statistical results of comparing regression models
Model
PLS
MLR
RMSEC
5.04
5.87
RMSEP
9.72
11.28
R2
0.96
0.95
a
17.03
9.17
b
0.73
0.88
TABLE 8. - The partial least squares regression coefficients
Standardized
Variable coefficients Standard deviation
logEO
1.008
0.656
AMW
-1.342
0.338
Ms
-0.107
0.208
AAC
1.636
0.362
MAXDP
-0.214
0.249
PW4
-0.348
0.265
PW5
-0.200
0.418
PJI2
-0.100
0.089
IC2
0.251
0.336
IC4
-0.445
0.230
VEA1
-0.104
0.237
E1u
0.177
0.091
MLOGP
0.188
0.245
The FIG. 2 displays the VIP (Variable Importance for the Projection) for each explanatory
variable, on first component. This allows to quickly identify which are the explanatory variables
that contribute the most to the models. On the first component we can see that the Ms, MAXDP,
PW4, PJI2 and PW5 have a low influence on the model. Also FIG. 3 indicates that the coefficients
of AMW, AAC, IC4 and E1u are significantly different from zero but and they can influence the
model more than the others.
GHASEMI and coworker
80
E1u
VEA1
IC4
PJI2
IC2
PW5
-0.5
PW4
MAXDP
0
MLOGP
logEO
0.5
Ms
Standardized coefficients
1
AAC
1.5
AMW
-1
-1.5
Variable
FIGURE 2. - The confidence interval of standardized coefficients for MLR model (confidence level
is %95).
1.8
1.6
1.4
VIP
1.2
1
0.8
0.6
0.4
0.2
PW5
PJI2
PW4
MAXDP
Ms
VEA1
E1u
logEO
IC2
AMW
AAC
IC4
MLOGP
0
Variable
FIGURE 3. - The plot of VIP for different variables.
In FIG. 4, the plot of the predicted cloud point versus experimental cloud point by the regression
models employed against the experimental cloud point is represented. The statistical parameters
calculated for the MLR and PLS models are represented in TABLE 7. In this table there are five
statistical parameters, root mean square error of calibration (RMSEC), root mean square error of
prediction (RMSEP), intercept (a) and slope (b) of the plot of the predicted cloud point(0C) against
the experimental cloud point(0C), and the square of the correlation coefficient (R2). The data
presented in TABLE 7 indicate that the PLS and MLR models have good statistical quality with low
Cloud point Prediction of Nonionic surfactants from Molecular Structures
81
prediction error, while the corresponding errors obtained by the PLS model are lower. The PLS
model uses higher number of descriptors that allow the model to extract better structural
information from descriptors to result in a lower prediction error. High correlation coefficient and
closeness of slope to 1 in the PLS model reveal a satisfactory agreement between the predicted and
the experimental values.
AAC
2.5
logEO
3
E1u
1
0.5
MLOGP
IC2
1.5
0
-2
IC4
VEA1
PJI2
PW5
AMW
-1.5
PW4
-1
MAXDP
-0.5
Ms
Standardized coefficients
2
-2.5
Variable
FIGURE 4. - The confidence interval of standardized coefficients for PLS model (confidence level
is %95).
CONCLUSION
The QSAR studies on a series of nonionic surfactants have been carried out by combining of
GA and PLS. for each compound 634 descriptors, 9 classes of Dragon descriptors, calculated and
then the best set of calculated descriptors was selected by Dragon software and genetic algorithm.
This paper shows that GA can select variables that provide good solutions in terms of both
predictive ability and interpretability. High correlation coefficients (95 and 0.96 for MLR and PLS,
respectively) and low prediction errors (11.28 and 9.72 for MLR and PLS, respectively) obtained
confirm good predictive ability of both models.
Received May 5th, 2006
Acknowledgements - The authors would like to thank Professor Riccardo Leardi of University of
Genoa for providing genpls Matlab code program.
82
GHASEMI and coworker
REFERENCES
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
13)
14)
15)
16)
17)
18)
19)
20)
21)
22)
23)
24)
25)
26)
27)
28)
29)
30)
31)
32)
33)
34)
35)
36)
37)
38)
M. J. Schick (ed.): Nonionic Surfactants, Marcel Dekker, New York, 1966.
M. J. Schick (ed.): Nonionic Surfactants: Physical Chemistry, Marcel Dekker, New York,
1987.
N. Schonfeldt: Surface Active Ethylene Oxide Adducts, Pergamon Press, Oxford, 1970.
C. Schöller, M. Wittwer, German Patents P. 605,973, P. 667,744, P. 694,178 to BASF, 30
November 1930.
W. L. Hinze, E. Pramauro, CRC Crit. Rev. Anal. Chem., 24, 133 (1993).
H. Tani, T. Kamidate, H. Watanabe, J. Chromatogr. A, 786, 229 (1997).
F. H. Quina, W.L. Hinze, Ind. Eng. Chem. Res., 38, 4150 (1999).
K. Roy, J. T. Leonard, Bioorg. Med. Chem., 13, 2967 (2005).
S. D. Berown, S. T. Sum, F. Despagne, Anal. Chem., 68, 21R (1996).
G.–D. Hu, R. –C. Zheng, The Method of Multiple Data Analysis, Nan Kai University
Publishing House, Tianjin, 1983.
T. Aoyama, J. Med. Chem., 33(9), 2583 (1990).
B. Kowalski, R. Gerlach, in: K.G. Joreskog, H. Wold (Eds.), Systems Under Indirect
Observation, North Holland, Amsterdam, 1982, pp. 191-209.
R.-Q. Yu, Introduction to Chemometrics, Human Education Publishing House, Changsha,
1992.
B.S. Dayal, J.F. MacGregor,. J. Chemometrics, 11, 73 (1997).
P. Geladi, B.R. Kowalski, Anal. Chim. Acta, 185, 1 (1986).
B.K. Lavine, Chemometrics, Anal.Chem., 72, 91R (2000).
R. Leardi, R. Boggia, M. Terrile, J. Chemometrics, 6, 267 (1992).
R. Leardi, J. Chemometrics, 8, 65 (1994).
H. Kubinyi, Quant. Struct.-Act. Relat., 13, 285 (1994).
H. Kubinyi, Quant. Struct.-Act. Relat., 13, 393 (1994).
D. Rogers , A.J. Hopfinger, J. Chem. Inf. Comput. Sci., 34, 854 (1994).
H. Kubinyi, J. Chemometrics, 10, 119(1996).
K. Hasegawa, Y. Miyashita, K. Funatsu, J. Chem. Inf. Comput. Sci., 37, 306 (1997).
R. Leardi, A.L. Gonza´lez, Chemom. Intell. Lab. Syst., 41, 195 (1998).
D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine learning, AddisonWesley, New York, 1989.
Hypercube, http://www.hyper.com.
R. Todeschini, Milano Chemometrics and QSAR Group, http://www.disat.unimib.it/vhm/.
M. J. Rosen, ‘‘Surfactants and Interfacial Phenomena,’’ 2nd ed. Wiley, New York, 1989.
N. M. van Os, J. R. Haak and L. A. M. Rupert, ‘‘Physico-Chemical Properties of Selected
Anionic Cationic and Nonionic Surfactants.’’ Elsevier, Amsterdam, 1993.
W.L. Hinze, E. Pramauro, CRC Crit. Rev. Anal. Chem., 24, 133 (1993).
A. Berthod, S. Tomer, J. G. Dorsey, Talanta, 55, 69 (2001).
R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, Weinheim,
Germany, 2000.
L.B. Kier, L.H. Hall, Molecular Connectivity in Structure– Activity Analysis, RSP-Wiley,
Chichetster, UK, 1986.
E.v. Kostantinora, J. Chem. Inf. Comp. Sci., 36, 54 (1997).
G. Rucker, C. Rucker, J. Chem. Inf. Comp. Sci., 33, 683 (1993).
J. Galvez, R. Garcia, M.T. Salabert, R. Soler, J. Chem. Inf. Comp. Sci., 34, 520 (1994).
P. Broto, G. Moreau, C. Vandicke, Eur. J. Med. Chem., 19, 66 (1984).
R. Leardi, J. Chemometrics, 14, 643 (2000).
Cloud point Prediction of Nonionic surfactants from Molecular Structures
39)
40)
41)
42)
43)
44)
83
S. Wold, Technometrics, 24, 397 (1978).
T. Gu and Sjöblom, J. Colloids Surf., 4, 39 (1992).
H. Martens, T. Naes, Multivariate Calibration, Wiley, New York, 1989, p. 111.
P.D.T. Huibers, D.O. Shah, A.R. Katritzky , J. Colloid Interface Sci., 193,132 (1997).
XLSTAT 2006 version 2006.2 Add-in software (XLSTAT company). http://www.xlstat.com.
R.H. Myers, Classical and modern regression with applications. PWS-KENT Publishing
company: Boston, 1990.