Chemometrics and Intelligent Laboratory Systems 65 (2003) 257 – 279 www.elsevier.com/locate/chemometrics Comparison of principal components regression and partial least squares regression through generic simulations of complex mixtures Peter D. Wentzell *, Lorenzo Vega Montoto Department of Chemistry, Trace Analysis Research Centre, Dalhousie University, Halifax, Nova Scotia, Canada B3H 4J3 Received 17 June 2002; received in revised form 29 October 2002; accepted 31 October 2002 Abstract Two of the most widely employed multivariate calibration methods, principal components regression (PCR) and partial least squares regression (PLS), are compared using simulation studies of complex chemical mixtures which contain a large number of components. Details of the complex mixture model, including concentration distributions and spectral characteristics, are presented. Results from the application of PCR and PLS are presented, showing how the prediction errors and number of latent variables (NLV) used vary with the relative abundance of mixture components. Simulation parameters varied include the distribution of mean concentrations, spectral correlation, noise level, number of mixture components, number of calibration samples, and the maximum number of latent variables available. In all cases, except when artificial constraints were placed on the number of latent variables retained, no significant differences were reported in the prediction errors reported by PCR and PLS. PLS almost always required fewer latent variables than PCR, but this did not appear to influence predictive ability. D 2002 Elsevier Science B.V. All rights reserved. Keywords: Principal components regression; Partial least squares regression; Multivariate calibration; Comparison; Simulation; Complex mixtures 1. Introduction Multivariate calibration, especially first-order calibration, has become an indispensable part of modern analytical chemistry. For as long as multivariate calibration has been employed, researchers have sought to develop better techniques for building calibration models. Countless strategies have evolved * Corresponding author. Tel.: +1-902-494-3708; fax: +1-902494-1310. E-mail address: [email protected] (P.D. Wentzell). over the years in an attempt to improve on existing methods. Although some of these methods are clearly better than others under a given set of circumstances, there is no single ‘‘best’’ approach to calibration. Nevertheless, a number of techniques have withstood the test of time and have become regarded as standards of multivariate calibration for whatever reasons—tradition, simplicity, reliability, versatility or a host of other factors. Among these techniques are principal components regression (PCR) and partial least squares regression (PLS). These two techniques are similar in many ways and the theoretical relationship between them has been treated extensively in the 0169-7439/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 7 4 3 9 ( 0 2 ) 0 0 1 3 8 - 7 258 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 Table 1 Summary of the literature review Reference Topic Conclusions and quotations [7] Theoretical-Simulations [3] Theoretical [4] Theoretical-Simulations [8] Simulations [9] Theoretical [10] UV – Vis.; ternary mixtures; first and second derivative FT-IR; Fat, Protein, Lactose PCR and PLS gave similar optimal prediction results. PLS uses fewer LVs than PCR. ‘‘PLS is one of a continuum of variations similar to PCR making specific implicit assumptions about the importance of each principal component for describing the dependant variable. It is inherently no better than those methods formed by selecting individual eigenvectors.’’ PLS often reached its minimal mean squared error of prediction using fewer factors than PCR. ‘‘PCR and PLS are very similar’’. . .‘‘PLS seems to predict better than PCR in the cases when there are random linear baselines or independently varying major spectral components which overlap with the spectral features of the analyte.’’ 2 2 ‘‘It has been proven that the intuitive inequality RPLS z RPCR holds for any given dimensionality. . .it should be emphasized that a better goodness of fit, as expressed by R2, does not necessarily transcend into superior prediction performance.’’ PLS was similar to PCR in all cases, even when derivatives were used. [11] [12] IR; Blood serum constituents; PCR, PLS and PLS-ANN [13] ATR-FTIR; styrene-butadiene in a polymer; PLS, PCR and MLR [14] NIRS; Forage production; PCR, PLS and restricted PCR [15] UV – Vis.; mixtures of up to five cations NIRS; pharmaceutical tablets; PLS and variants of PCR with selection factors [16] [17] [18] [19] [20] [21] [22] UV – Vis; phenols; MLR, PCR and PLS Fluorescence; amino acid derivatives; PCR and PLS NIR; wheat and blood samples; PCR, PLS UV – Vis and HPLC; pesticides UV – Vis and HPLC; pesticides [23] Fluorescence; PAHs; 10 components UV – Vis; PAHs [24] UV – Vis; colorants PLS and PCR performed in a similar way. Both models used the same number of LVs to construct the model for the prediction step. PLS, PCR and PLS-ANN gave similar results. In presence of small nonlinearity effects, ANNs are not necessary. ‘‘PLS and PCR yielded very similar results for the prediction of every constituent studied.’’ PLS worked better than PCR and MLR when the original spectra were used. ‘‘The best result is obtained with PLS’’. . .‘‘Once again this experimental example supports the theoretical view that PLS is more often superior to PCR and MLR for this FTIR application.’’ PLS, PCR and RPCR were similar when the correct number of PCs were chosen. ‘‘The advantages of PLSR is that the method reaches the minimum prediction error with fewer components included than are included for PCR and RPCR.’’ PLS and PCR gave similar performance for all systems. PLS uses fewer LVs than PCR. PLS and PCR top-down procedures were almost the same in terms of predictive ability. Some pre-processing tools were used in trying to improve the results. ‘‘Whether PCR with selection of PCs is better than PLS or viceversa cannot be decided really from these data. In this case, the quality of the model is similar, but the PCR model with the selection of PCs is simpler.’’ PCR and PLS gave the same results. Using the pre-processing tools, MLR gave better results than PCR and PLS. Stated that PLS works better than PCR but gave no numerical results. Only presented plots where it was difficult to observe any differences. PLS and PCR have similar optimal prediction ability. PCR, PLS1 and PLS2 produce good results. All exhibit similar performance. ‘‘PLS1 seems to predict better than PLS-2 and PCR in cases when there are multiple components with overlapping peaks.’’ PLS and PCR gave almost the same results. PLS was stated to perform slightly better than PCR but this conclusion was not statistically validated. ‘‘The particular data set is predicted well by PLS and PCR. . .although. . .in all cases PLS outperforms PCR.’’ ‘‘. . .no significant difference was observed in the precision of prediction between PCR and PLS methods.’’ P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 259 Table 1 (continued ) Reference Topic Conclusions and quotations [25] UV – Vis; ternary mixture of pesticides [26] Voltammetry; Zn, Cu, Cd and Pb mixture; MLR, PCR and PLS [27] Cyclic voltammetry; tryptophan [28] Theoretical – Practical; NIR; wheat samples and gasoline samples [29] [30] NIR; acrylic fibres UV – Vis, FIA; amines in foods; PCR and PLS NIR; variety of data sets PLS and PCR provide similar results in terms of RMSEP using the same number of LVs. Reported PLS was better than PCR for predictions, but actually, PLS was better for one element (Pb). For the others, PLS and PCR performed equally well. ‘‘The predictions using PLS were considered to be superior to both MLR and PCR as the PLS modeling structure finds the latent variables that describe both the variation in the x- and y-block data. . .’’ PLS and NL-PLS worked better than PCR but the differences were not significant. It seemed that PLS works better when peak shifts appear. ‘‘. . .PCR is often too restrictive (less variance but more bias) and PLS is too flexible (more variance but less bias). Said another way, PCR tends to stop too soon in including additional eigenvectors, while PLS goes too far and uses too many eigenvectors. It should be noted that it is not necessarily PCR and PLS that stop too soon or go to far respectively, but it can be the operator, the one building the model.’’ PLS and PCR had the same prediction ability in this system. PLS1, PLS2 and PCR gave similar prediction errors. [31] ‘‘It has been shown that a correct application of many calibration methods (MLR to the selected variables, PCR, PLS and NN) to a linear problem yields results of similar quality.’’ literature [1 –4]. Naturally, discussions often arise as to the relative merits of these two approaches when applied to chemical data and no clear answer has yet to surface. The purpose of this paper is to attempt to illuminate some of the similarities and differences in the performance of these two techniques from a different perspective, namely through the simulation of complex mixtures. Historically, PCR predates PLS, with the latter appearing in the chemical literature around 1983 (for a recent historical perspective, see Refs. [5,6]). Since its introduction, however, PLS appears by most accounts to have become the method of choice among chemists. The reasons for this are not entirely clear, but a number of perceived advantages have been cited in the literature. These include: (1) PLS should predict better because correlations with the y variable are sought in determining the scores, (2) PLS requires fewer latent variables than PCR and should therefore be more parsimonious, (3) PLS loadings are more readily interpreted, and (4) PLS should handle nonlinearities better than PCR. In this paper, we focus on the first item in this list and attempt to address the question ‘‘Does PLS predict better than PCR?’’. In trying to answer this question, we will specifically focus on the prediction of concentrations from multichannel data intended to emulate spectroscopic responses and created assuming a linear causal model. It is clear that this is a restrictive case and conclusions will be by no means general, but even this limited study requires a model of some complexity. Likewise, we recognize that there are a large number of calibration methods available, which may be equally valid for comparison, but only PCR and PLS have been chosen to keep the scope of the study manageable. Initially, a review of the literature was undertaken to obtain a clear picture of what kinds of studies had already been done to compare these two methods and what conclusions had been drawn. Table 1 is a summary of the results from this literature survey. Given the abundance of papers on multivariate calibration and the limitations of search parameters, some work will undoubtedly have been missed in this survey, but the results should reflect a reasonable cross-section of literature conclusions. In each case, the nature of the study is indicated and the major conclusions of the work are summarized, with a direct quotation from the paper where it is useful. The results of the study were somewhat surprising. Given the clear popularity of PLS over PCR among chemists, we had expected to see many instances 260 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 where the superior predictive performance of PLS was evident. While there were a few cases which indicated that PLS gave better results than PCR, a greater number of studies indicated no real difference in performance. Additionally, there were no theoretical studies which suggested that one method should predict better than the other. The conclusions of this ‘‘meta-analysis’’ would appear to run counter to the popular view of the inherent predictive advantages of PLS. However, a difficulty with these studies is that most are isolated cases, so it is difficult to form a clear picture. Comparative studies of PCR and PLS carried out so far can be classified into four categories: (1) theoretical studies, (2) experimental studies on simple mixtures, (3) simulation studies on simple mixtures, and (4) experimental studies on complex mixtures. Theoretical studies based on the underlying mathematics of PCR and PLS have proven very useful for elucidating the relationship between PCR and PLS, but have not provided a clear indication of where one method may be advantageous over the other. Experimental studies on well-defined mixtures consisting of a few components have also been carried out. Such mixtures, which are often synthetically prepared from a previously determined design matrix, will be referred to as ‘‘simple mixtures’’ in this work. These studies are often inconclusive, since they represent only one particular piece of a bigger puzzle and are limited in the range of variables that can be explored and the statistical inferences that can be made. Simulations of simple systems (for example, using simple Gaussians to represent spectra) can alleviate both of these problems by expanding the ways that variables can be manipulated and allowing Monte-Carlo methods to be employed. One disadvantage, however, is that experimental variables which could be important in distinguishing method performance in real systems (e.g. noise heteroscedascity and correlation, nonlinearities, temperature changes) may not be accurately represented in simulations. In our experience, the application of PCR and PLS to simple systems with well-defined chemical rank gives no real difference in performance. More challenging and potentially more useful applications involve real chemical systems of much greater complexity. Such systems, which will be referred to as ‘‘complex mixtures’’ in this work, are more often at the heart of practical applications of multivariate calibration (e.g. octane number in gasoline, protein in wheat flour). The complexity of these systems arises from the fact that they can easily contain thousands of chemical constituents and therefore often have no well-defined chemical rank. The concentrations of chemical components can range from major constituents to trace levels and will vary between samples in a natural design with confounding inter-component correlations. Under these conditions, differences in the predictive performance of PCR and PLS may become apparent. Indeed, such complex systems have been studied experimentally and comparisons of performance have been made, but again it is difficult to draw broad conclusions from a limited number of isolated experimental results. The objective of the work presented here was to introduce a fifth domain of comparative study, specifically the simulation of complex mixtures. The central hypothesis is that differences between PCR and PLS become more pronounced for complex mixtures and that these differences will be more apparent if we are able to carry out a large number of simulations while modifying important experimental variables. Of course, it is impossible to provide a completely accurate simulation of complex chemical systems such as a petroleum or food product because of the number of unknown parameters involved. The strategy developed here for complex mixture simulation is quite generic and will be limited to some degree in the extent to which it can model real mixtures. Therefore, a secondary objective of this work is to provide the rationale behind the simulation methods employed and to illustrate the performance of the calibration methods as it pertains to certain kinds of changes in the simulation parameters, such as mixture complexity and spectral characteristics. 2. Simulation of complex mixtures The simulations carried out in this study were intended to provide an approximation to real systems which contain a large number of chemical components. Of course, such simulations must always fall short of reality since it is impossible to know the complex relationships among chemical species in a mixture. Nevertheless, it was hoped that conducting such a study would provide insights into the calibra- P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 261 tion process and perhaps into the model of complex chemical mixtures itself. Some basic assumptions were first made about the behavior of complex mixtures. The most fundamental of these was that the matrix of measurements follows a linearly additive model: simulation parameters were synthesized to obtain a reasonable approximation to real systems. Some of the approaches draw from earlier work, which simulated complex mixtures for chromatographic studies [32], but new principles need to be introduced here because of the differences in the two scenarios. D ¼ CST þ E 2.1. Distribution of concentrations ð1Þ Here, D represents the m n matrix of instrument responses for m samples collected over n channels, C is the m p matrix of concentrations for the p components in the m mixtures, S represents the n p matrix of sensitivity factors for the components present, and E is the m n matrix of residual errors (noise). For simplicity and without loss of generality, it will be assumed throughout this work that the responses are mixture spectra of some kind, so that S holds the pure component spectra. Note that this model is no different than the one usually assumed in multivariate calibration, but for complex mixtures, the number of components ( p) can be quite large, often exceeding m and n. Although this technically leads to an underdetermined system with no solution, we postulate that the spectra for most real systems are dominated by relatively few species, which include the quantifiable components of interest. Thus, estimation of concentration by least squares methods such as PCR and PLS will involve the classic variance/bias tradeoff. It is also true that there will likely be some strong cases of correlation (or anticorrelation) among the concentrations of many components in C when a natural calibration design is employed. Since little is known about the distribution of concentrations in real mixtures, let alone their covariance structure, we have made no attempt to model this covariance here. Even with the simple model given in Eq. (1), there are clearly a large number of mixture variables that can be manipulated. These include: (1) the number of chemical components, (2) the concentration distribution of the components, (3) the concentration distribution for a single component, (4) the degree of correlation of the pure component spectra, (5) the number of wavelength channels, (6) the level and structure of noise in the spectral measurements, and (7) the level of uncertainty in the reference concentrations. The following sections describe how these One can imagine that a complex mixture such as a petroleum product or food sample could contain thousands of individual components that elicit an instrument response. The concentrations of these components are expected to vary from sample to sample and two questions need to be addressed concerning the their distribution: (1) for a given component, how does the concentration vary from sample to sample?, and (2) for a given sample, what is the distribution of concentrations of individual components? The answers to these questions are critical in the design of a simulation model for complex mixtures since they affect the relative contributions of analyte and interferences to the overall signal and the range of concentrations available for calibration. Very little information is available in the literature concerning the distribution of the concentrations of different components in a complex mixture. In order to generalize the discussion beyond a single sample, we will consider the question of the distribution of the mean concentrations of different components, where the mean is calculated over all samples. One can imagine two extremes of a mixture consisting of Ncomp components. In one case, the mean concentrations of all components may be nearly the same, resulting in a very narrow distribution. This might be observed, for example, for a set of synthetic calibration mixtures in a designed experiment. At the other extreme, a mixture might contain a large number of components at low concentration and a few dominant components at higher concentrations. This has some intuitive appeal for natural mixtures and some experimental evidence in the literature has indicated exponential-like distributions, which reflect these characteristics [33,34]. The use of an exponential distribution of mean concentrations was explored in this work, but in the end, it was considered to be too restrictive. A true exponential distribution has an 262 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 equal mean and variance, removing the ability to change the shape of the distribution independently of the mean. Instead, the log-normal distribution was chosen since it allowed for a range of situations to be explored. Fig. 1 compares the exponential distribution to the log-normal distribution for a variety of variance values for the latter. Since the log-normal distribution is not symmetric, the median is a better indicator of central tendency than the mean, so the cases presented all have a median of unity. Note that when the variance is small, the limiting case of a narrow distribution of mean concentrations is represented. As the variance increases, the distribution becomes more exponential-like. At a variance of unity, the distribution is quite similar to the exponential distribution, but it lacks the contributions of compounds at very low concentrations, which are likely to be below the detection limit in any case. Because of these characteristics, the log-normal distribution was used to model the mean concentrations of mixture components in this work. Once the distribution of mean component concentrations in a mixture has been determined, it is necessary to define how the concentrations of the individual components distribute around the means from sample to sample. Experimental results reporting the distribution of a particular component in a series of mixtures are relatively common in the literature, particularly in the fields of environmental and clinical chemistry (see, for example, Refs. [35 –37]). Both normal and log-normal distributions are observed, with the latter being reported more frequently in an informal literature survey. In the simulations reported here, a normal distribution about the mean concentration was employed primarily because of its greater simplicity; that is, there was no need to describe the skewness of the distribution. For each component, the standard deviation of the distribution about the mean was taken to be 30% of the mean concentration, i.e. all components had the same relative spread. Although this value is somewhat arbitrary, it was considered reasonable for a calibration experiment involving a natural design. Based on these assumptions, the generation of the component concentration matrix, C, in Eq. (1) was carried out as follows. First, Eq. (2) was used to generate a (Ncomp 1) vector of mean concentrations, cmean, drawn from a log-normal distribution with a median concentration of cmed and a standard deviation of rc: cmean ¼ exp½ðlncmed þ zÞrc ð2Þ In this equation, z, is a vector (Ncomp 1) of random values drawn from a normal distribution with a mean of zero and a variance of unity (N(0,1)). Once the vector of mean concentrations for all of the components is obtained in this way, the actual concentrations used to generate the simulated data via Eq. (1) are obtained by assuming a normal distribution with a relative standard deviation (RSD) of 30%. Mathematically, this is represented in Eq. (3): C ¼ 1m cTmean þ ð0:3Þ1m cTmean *Z Fig. 1. Comparison of exponential and log-normal concentration distributions. ð3Þ Here, 1m designates an (m 1) vector of ones where m is the number of mixtures, Z is an (m Ncomp) matrix of N(0,1) random numbers, and ‘‘*’’ indicates the Hadamard or element-by-element product of two P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 matrices. Any negative values produced in this calculation are set to zero. 2.2. Pure component spectra The spectroscopic (or other response) profiles of the pure components in a complex mixture will obviously have an important impact on the ability to quantitate a particular analyte. The sensitivity of the calibration for the analyte will depend on both the magnitude and shape of its spectrum relative to the interferents in the mixture. In this work, the magnitude of the spectral response is assumed to be imbedded in the distribution of the concentrations; that is, the concentration distribution is assumed to reflect a composite of the absolute concentration and the magnitude of the pure spectral response. For this reason, only the shape of the pure component spectra is considered as a factor in the simulations. The key element here is the correlation, or angle, between the spectral vector of interest (the analyte) and the spectra of other components (interferents) in the mixture. In this regard, one can imagine a broad scope of situations, ranging from cases where spectra are nearly orthogonal to one another (as in the case of mass spectrometry) to cases where spectra are highly overlapped and share a small spectral angle (as in UV – Visible absorption spectroscopy in solution). The intent of this study was to simulate these different situations. Unfortunately, however, no a priori information is available regarding the distribution of spectral angles in complex mixtures. In the absence of such knowledge, two approaches were devised for this work. The first approach was adopted from an earlier simulation study conducted in conjunction with window target-testing factor analysis [32]. In this approach, pure component spectra are constructed in such a way that they exhibit the same mutual correlation, or spectral angle, which can be designated in advance. (Note that spectral similarity can be indicated by either the correlation coefficient or angle between the spectral vectors, but in this work the latter is used more often since it permits a better visualization of relationship between vectors in the calibration space.) The intention here is to represent changes in spectral similarity, which can be expressed in terms of a mean spectral angle, by varying the magnitude of 263 the fixed angle between all spectra. While this approach results in a system that is well-defined, it has a number of important drawbacks. First, the pure component spectra generated do not resemble spectra, which are typically obtained in a chemical system. This is illustrated in Fig. 2, which shows spectra for three components of a 20-component mixture. While this does not represent a mathematical impediment, the resulting mixture spectra are not as aesthetically pleasing as they might be. A second drawback is that the number of spectral channels employed must be greater than or equal to the number of components in the mixture, placing a restriction on the simulation of complex mixtures. Perhaps the biggest disadvantage of this approach, however, is that it does not reflect the distribution of spectral similarity that one might expect to exist naturally in real mixtures; that is, one would expect to observe some pure component spectra which are quite similar and others which are very different. To address the problems of the fixed angle approach, an alternative strategy was employed. Spectra were generated by first creating an (n Ncomp) matrix of random numbers drawn from a normal distribution between (N(0,1)), where n is the number of spectral channels. The columns of this matrix were then smoothed repeatedly until the mean angle among the columns was close to a prespecified desired value. (A 41-point moving average filter with wrapping at the Fig. 2. Representative simulated spectra from a 20-component mixture using a fixed mutual angle criterion with an angle of 30j. 264 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 edges was used for this, with a reduction in the filter width on the final pass to obtain the desired angle.) Following normalization of each column to unit length, this matrix was used as the pure component spectra, S, in Eq. (1). When no filtering is used, the average angle between spectra should be close to 90j (correlation coefficient goes to 0). As the amount of smoothing is increased, the mean angle decreases and the distribution of angles becomes more diffuse. This is illustrated in Fig. 3, which compares representative spectra and distributions of angles and correlation coefficients for three different cases involving 100 components and 100 channels. It should be noted that Fig. 3. Representative spectra (right-hand panels) from continuous angle simulation method. Three different mean angles (85j, 45j, and 25j) are represented. The left-hand panels show the distribution of spectral angles and correlation coefficients in each case. the distributions and mean angles are based on all possible unique pairs of spectral vectors. The three cases illustrated might be considered representative of spectroscopic methods with inherently different information content (e.g. MS, IR and UV – Vis.). Unlike the spectra generated from the fixed angle method, these spectra exhibit varying degrees of correlation away from the mean value and are therefore considered to be more consistent with real mixtures, although the extent to which this is true cannot really be known. 2.3. Noise characteristics Having defined C and S in Eq. (1), it remains to define the spectral measurement noise matrix, E, in order to generate the simulated data. The noise characteristics of measurements are critical to the success of multivariate calibration, not only in terms of magnitude of the noise, but also its correlation structure, as has been demonstrated in the literature [38 – 41]. Since there have been no differences observed between PCR and PLS in the presence of correlated noise [41] and both methods are designed to perform optimally in the presence of uniform, uncorrelated, normally distributed noise, this was the error structure assumed here. This simple assumption also removed the need to specify heteroscedascity and correlation structure in the noise as additional parameters in the simulation. However, the magnitude of the noise in relation to the signal, normally expressed as the signal-to-noise ratio (S/N), is a very important factor. The question of how to express S/N for the simulations carried out is a difficult one, since the objective was to compare different conditions (number of components, spectral shape, concentration distributions) under ‘‘equivalent’’ noise conditions. For univariate data, the S/N for a group of samples can be defined fairly easily, for example, as the mean of calibration responses divided by the standard deviation of the measurements. For multivariate calibration, S/N is usually related to the multivariate sensitivity for a component, SENi, divided by the measurement standard deviation [42]. Unfortunately, this definition would be inappropriate here since the sensitivity, which is the length of the net analyte signal vector (NASi) at unit concentration, is different for each component. Furthermore, the idea of the simulations P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 is to observe changes in sensitivity (through the prediction errors) under different conditions, so fixing the conditions so that they all exhibited the same SEN/N would not be useful. Likewise, employing a fixed noise level for all simulations would not be realistic either, since the magnitude of the total signal changes with conditions (e.g. a mixture with 80 components will give a larger total signal than a mixture with 10 components). What is needed is a way to reflect the univariate S/ N for multichannel data. Often for multichannel data (e.g. chromatography), the S/N is expressed using the peak maximum, but that will not give consistent results here since the spectral shapes can change significantly. Instead, the noise level for these simulations was determined by first obtaining an ‘‘average’’ spectrum, calculated by summing the products of the mean concentration of each component by its corresponding spectrum, i.e. Smean = Scmean. This was then used to calculated an rms averaged signal over the number of channels used, a fraction of which was taken to be the noise as shown in Eq. (4). rnoise 1 ¼ ðS=NÞ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 NScmean N n ð4Þ The S/N is specified prior to a simulation and values of 10, 100, and 1000 were employed in this work. This approach was found to give consistent results under changing simulation conditions and could be intuitively related to realistic measurement conditions. Once rnoise was defined, it was simply multiplied by an (m n) matrix of N(0,1) random numbers to give E. 2.4. Other variables In addition to the variables already discussed, a number of other parameters could be modified in the simulations. A summary of these is given in Table 2 along with the conditions used in this study and the values employed under ‘‘benchmark’’ conditions (see Section 4.1). The number of wavelength channels used was fixed at 100 for all simulations. Although many real instruments have more channels than this, this was seen as a compromise, which was sufficient to represent the spectral variability and reduce computation time. In terms of the number of 265 Table 2 Summary of the different system variables and values used in the simulations Variables Values Variance of concentration distribution Spectrum type Spectral angles 0.01, 0.1, 1 and 10 Number of wavelength channels S/N Number of components Number of calibration samples Number of validation samples Number of prediction samples Number of LVs used Fixed and Continuous 10j, 30j and 80j (Fixed) 25j, 45j and 85j (Continuous) 100 1000, 100 and 10 10, 30 and 80 15, 25 and 50 100 100 80% and 100% of maximum Benchmark conditions are shown in bold. components present in the mixtures, the complexity was varied over three levels (10, 30 and 80). The number of calibration samples was also considered at three levels, but the number of samples used for validation and prediction was fixed at a relatively high value of 100. The validation set was used to determine the number of latent variables to be used and the prediction set was used to evaluate the prediction errors. A large number of samples was used in each case to minimize the variability associated with these estimates. Several levels were also employed for the spectral angle, S/N, and variance of the log-normal concentration distribution. In all of the studies reported here, except those noted in Section 4.8, there were no errors in the reference concentrations used for the calibration, validation and prediction. 3. Experimental 3.1. General procedure The objective of these studies was to compare the quality of prediction of PCR and PLS for complex mixtures under a variety of conditions. Prediction errors were used as the principal figure of merit. A complicating factor is that, under a given set of 266 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 conditions, the predictive ability changes greatly depending on the component under consideration, with major constituents being predicted better than minor ones. Furthermore, in order to obtain a statistically valid sampling, it was necessary carry out replicate runs under each set of conditions, with the same parameters (e.g. concentration distribution), but different realizations of those parameters (e.g. cmean). Thus, each replicate run had a different profile of component concentrations. For these reasons, it was necessary to compile prediction error information for each component in each replicate run and then find some way to summarize this information in a useful way. In addition, because the number of latent variables used is sometimes cited as an important distinction between PCR and PLS, this variable also needed to be compiled. The algorithm to carry out the simulations and produce this information under a given set of conditions is shown in Fig. 4. First, the system parameters (see Table 2) are selected and used to generate the concentration, spectral and error matrices as previously described. These are then combined as given in Eq. (1) to give the calibration, validation and prediction data. For each component in the mixture, a Fig. 4. Flowchart of procedure used for simulations. calibration model is built using PCR and PLS-1, the number of latent variables is selected using the validation set, and the prediction errors are assessed using the prediction set. In this case, predictive ability was evaluated using the relative root-mean-squared error of prediction (RRMSEP), defined by Eq. (5). vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uNpred uX t ðc ĉ Þ2 =N ij ij pred RRMSEPj ¼ i¼1 rangej ð5Þ In this equation, cij and ĉij are the true and predicted the concentrations of component j in mixture i of the prediction set, respectively, Npred is the number of prediction samples, and rangej is the range of concentrations for component j. To avoid the statistical variations caused by using a range based on actual simulated concentration values, a theoretical range equivalent to F 2rc was employed in Eq. (5). Given that the RSD of concentration values was set to 30% in the simulation, this was equivalent to a denominator of (1.2cmean) in Eq. (5). By using the relative prediction error in this way, predictive abilities for components with different concentrations could be easily compared. Returning to the flowchart in Fig. 4, once the RRMSEP and optimum number of latent variables (NLV) were calculated for each component by each method (PCR and PLS), these results were stored together with the mean concentration. This process was then repeated Nrep times. For each repetition, a new realization of the concentration distribution and pure component spectra was used to ensure statistically meaningful results. The number of repetitions depended on the number of components in the mixture since the objective was to obtain a similar total number of components, typically 5000, for each set of conditions (i.e. Nrep = 5000/Ncomp). In order to examine the predictive ability of PCR and PLS as a function of the relative abundance of components in a mixture (i.e. major vs. minor components) and to do this across different concentration distributions, it was necessary to first sort the resulting data from lowest to highest mean concentration. Once sorted, the data were then divided into 20 bins containing equal numbers of prediction results (250 in the P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 case of 5000 data points). Within each bin, the means of the RRMSEP and the NLV were calculated along with their standard deviations. These were then displayed in the form of comparator and difference plots, which are discussed in greater detail in Section 4. 3.2. Computational aspects All the calculations performed in this work were carried out on a Sun Ultra 60 workstation with 2 300 MHz processors and 512 MB of RAM and a 700-MHz Pentium-III PC with 128 MB of RAM. All programs were written in-house using Matlab 6.0 (The MathWorks, Natick, MA). 4. Results and discussion 4.1. Benchmark system In order to compare the relative effects of changes to the simulation parameters, one set of parameters (indicated in Table 2) was chosen as a benchmark. Changes in one variable at a time from these con- 267 ditions were then examined and these results are presented in the sections that follow to illustrate the effect of the changes on calibration performance. In this section, the results for the benchmark system itself are described to gain a general appreciation of the characteristics of comparator and difference plots as they relate to the simulation of multivariate calibration in complex mixtures. The results for the analysis of the benchmark system are summarized in the four panels of Fig. 5. As noted in Section 3.1, the x-axis on all of these plots is divided into 20 bins, which relate to the mean concentration of the components represented in that bin. It is important to note that this is a percentile representation. For example, the first bin represents the bottom 5% of mean concentrations obtained by pooling the results of 167 repeat simulations of mixtures containing 30 components each. As such, it represents the most minor components of the mixtures, whereas the last bin represents the dominant components present. Note also that the scale is not linear in concentration; that is, the 50th percentile does not necessarily correspond to onehalf of the concentration at the 100th percentile, since this depends on the concentration distribution used. Fig. 5. Simulation results for benchmark system showing: (a) comparator plot for relative prediction errors, (b) comparator plot for number of latent variables, (c) difference plot for relative prediction errors, and (d) difference plot for number of latent variables. 268 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 The scale is monotonic, however, and can be used to infer relative magnitudes. Fig. 5a and b is called comparator plots in this work and shows the mean values of the relative prediction errors and number of latent variables within each concentration bin. The error bars shown reflect the standard deviation of the mean values in each case. Fig. 5a is fairly typical for comparator plots of the RRMSEP. It will be noted first that, in terms of prediction errors, there is essentially no difference in the results obtained by PCR and PLS. This was the case for all of the comparisons made in this work, except for the special case described in Section 4.7. For minor components in the mixture, the left side of Fig. 5a exhibits a small plateau that levels off between 25% and 30% RRMSEP. This feature, which is present to a greater or lesser degree on many such comparator plots, shows the region where the model has essentially no predictive ability. It is easily shown [43] that this error level is consistent with a prediction based simply on the mean concentration, so it is clear that the calibration is useless in this region. In the intermediate region after the plateau, the RRMSEP decreases monotonically as the real predictive ability of the models improves. The point at which the predictive ability of the model becomes practically useful is a subjective matter, but the relative prediction errors reach their smallest values for the major components at the right hand side of the figure. Fig. 5a tells us what we might anticipate intuitively—that in a complex mixture with a skewed concentration distribution (relatively few major components and more minor components), it is possible to quantitate the dominant species, but difficult or impossible to determine lesser ones. In this system, which is technically underdetermined (number of calibration samples < number of components), the minor components essentially act as noise in the determination of the major ones. While these observations are not unexpected, they do provide some confidence that the simulation model is behaving as it should. Fig. 5b shows the behavior of the optimum number of latent variables as a function of concentration percentile and again is fairly typical of the plots observed throughout this study. In the low concentration range, where the calibration models are essentially useless, both PCR and PLS use only a few latent variables. This is not surprising since very little information is needed for predictions based on a mean value. As the concentration percentile increases and the predictive ability of the models improves, however, the number of latent variables used increases quickly, reaching a plateau consistent with a multiple linear regression (MLR) model when most or all of the latent variables are used. Generally, throughout these studies, PLS required fewer latent variables than PCR, a result anticipated from literature reports, but this factor also seemed to have no bearing on the relative predictive abilities of the two methods. This observation is discussed in greater detail in Section 4.7. Fig. 5c and d, which is referred to as difference plots in this work, shows the difference between PCR and PLS in the comparator plots to amplify any distinction between the two methods. It is clear from Fig. 5c that differences in the predictive abilities of these two methods are not significant. This was generally the case throughout this work, so in the interest of saving space, difference plots will only be presented if a difference in the prediction errors was observed. Differences in the number of latent variables are almost always observed, but these are usually apparent from the comparator plot. Two additional figures of merit have also been employed in this work to more succinctly summarize the information in the RRMSEP comparator plots. The first is the region above which the calibration models become quantitatively useful, which will be designated with the symbol Q. This point is somewhat arbitrary, but in this work, we have chosen the concentration percentile where the RRMSEP drops to 0.1 or 10%, designated as Q10. A cutoff of 10% is arguably somewhat high for practical quantitation, but the purpose is to make comparisons under different conditions and lower values will lead to more cases where Q is undefined (insufficient predictive ability over the entire range) and therefore uninformative. To give an indication of the best predictive ability, the %RRMSEP at the highest concentration percentile is reported as a second figure of merit and designated with the symbol E100. For the benchmark system studied here, Q10 = 46 (i.e. the species with concentrations above the 46th percentile can be predicted with some quantitative reliability) and E100 = 0.8% (i.e. the dominant species can be predicted with 0.8% RRMSEP). P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 4.2. Effects of spectral correlation and spectrum type Since the performance of multivariate calibration is highly dependent on the selectivity of the responses, the effects of the parameters associated with the simulated spectra will be described first. The benchmark system employs pure component spectra with a continuous distribution of angles and a median angle of 85j. This means that relatively little smoothing has been performed on the original randomized spectra. Fig. 6 compares the predictive performance of PCR and PLS on the benchmark system for continuous angle spectra as the median spectral angle changes. As anticipated, the quality of the predictions decreases with a decrease in the median spectral angle. In fact, the models for median spectral angles of 25j and 45j are essentially useless for all components in the mixture, as reflected by the high RRMSEP values and the fact that the optimum number of latent variables remains small throughout the range of components. What has happened here is that the small plateau corresponding to poor predictions on the lefthand side of the plot for the benchmark system (top panel) has extended further to the right due to the reduced selectivity for individual components. A 269 small downturn is noticed at the right-hand side of both cases involving smaller angles, indicating some improvement in predictive ability, but not to the extent that it is very useful. This is not surprising since, in the continuous angle systems, there will be a significant number of spectra with high correlations and this will increase as the median angle decreases. That is not to say that useful quantitative results could not be obtained from systems with small median angles, but other system parameters would have to change. For example, the number of components would have to decrease, the S/N would have to increase, or the concentration distribution would have to become more skewed to see an overall improvement in predictive ability. The situation is a little different for the fixed angle spectra, where all of the pure component spectra have the same mutual correlation. Comparator plots for these cases are shown in Fig. 7. In these cases as well, the calibration models are not as good overall when the spectral angle is decreased, but the effect is more subtle. There is a monotonic shift of the RRMSEP curves to the right as the angle decreases, extending the region of poor prediction and diminishing the region of good quantitation. The effects, Fig. 6. Comparator plots showing the effects of mean spectral angle for continuous angle spectra: h = 85j (a and b, benchmark system), 45j (c and d), and 25j (e and f). 270 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 Fig. 7. Comparator plots showing the effects of mean spectral angle for fixed angle spectra: h = 80j (a and b), 30j (c and d), and 10j (e and f). however, are much less dramatic than for the case of the continuous angle distribution. These observations may be more clearly indicated in the performance table presented as Table 3, which provides Q10 and E100 values for these systems. In the case of fixed angles, the table indicates that, at 80j, 60% of the most abundant components can be reasonably predicted (i.e. 100 40), while this number drops to 44% at an angle of 10j. From these studies, two conclusions can be reached: (1) the correlation of pure component spectra is an important factor in the success of multivariate calibration, and (2) the distribution of spectral angles is more important than the median value in determining calibration success. The former point is fairly obvious, so confirmation through simulation is satisfying. The latter point is less apparent, but not surprising since, regardless of the number of components present, it takes only one highly correlated spectrum to severely degrade the sensitivity for an analyte. With respect to other parameters examined in the simulations that are described in the following sections, results for the fixed angle and continuous angle systems were qualitatively similar. Therefore, only the results for the continuous angle systems (which, in our opinion, should bear a stronger resemblance to reality) will be reported. 4.3. Effects of mean concentration distribution Table 3 Performance table for effect of the mean spectral angle and the spectrum type (continuous and fixed angle distributions, 30 components, rc = 1, S/N = 1000) Case Q10 (Percentile) E100 (%RRMSEP) Continuous, h = 25j Continuous, h = 45j Continuous, h = 85j Fixed, h = 10j Fixed, h = 30j Fixed, h = 80j – – 46 56 44 40 24 16 0.8 0.6 0.6 0.2 In a synthetic multivariate calibration experiment involving only a few components with carefully designed mixtures, we would expect a fairly narrow distribution of mean concentrations. In other words, the experiment would likely be designed such that the mean response from each analyte was approximately the same. In real applications, where the samples are derived from complex natural systems, there are likely to be a much larger number of components to which the instrument responds. Even imagining that the pure P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 spectral response could be obtained for each of these, which is unlikely, the concept of finding the contravariant vector for calibration of a particular analyte becomes irrelevant because: (a) there may be an insufficient number of channels to mathematically determine this vector, and (b) the sensitivity for the analyte of interest would become vanishingly small due to the requirement of orthogonality to all of the other spectra. However, it is still possible to determine some analytes in such mixture because: (a) a large number of components may be present at such low levels relative to the analyte that they can be ignored with limited consequences, and (b) the concentrations of many components may be highly correlated, permitting quantitation of minor components via their correlation with major ones. Both of these characteristics result in an effective reduction in the apparent number of components in the mixtures and leads to the classical variance/bias tradeoff. In this work, we examined the effects of the distribution of mean concentrations, but not the effects of concentration correlations, which are more difficult to simulate in the absence of a priori information. All of the mean concentration distributions employed in this work had a median concentration of 271 unity, but with different levels of skewness as expressed by the variance of the log-normal distribution. The higher the variance, the greater the skewness and the larger the proportion of minor or trace components relative to the major ones. Values ranging from rc = 0.01, where the distribution is very sharp and all components, have essentially the same mean concentration, to rc = 10, where the distribution is highly skewed, were examined. However, only the range between 0.1 and 3.2 is reported here since that was sufficient to examine the limiting effects. The benchmark system, with rc = 1, has a distribution which is substantially skewed, but not extreme (see Fig. 1). Fig. 8 shows comparator plots illustrating the effects of changes in the mean concentration distribution on prediction when all other parameters are maintained at benchmark conditions. When rc = 0.1, the mean concentrations in the mixture are fairly uniform, so no components are present at insignificant levels and every component can potentially interfere in the determination of every other one. As a result, the RRMSEP values are similar for all components and somewhat better than predictions based on mean values alone, but still reflect poor predictive ability overall. In this instance, the transition region between Fig. 8. Comparator plots showing the effects of log-normal mean concentration distribution: rc = 0.1 (a and b), 0.32 (c and d), 1.0 (e and f, benchmark conditions), and 3.2 (g and h). 272 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 regions of very poor predictions and reasonably good predictions has effectively expanded to occupy the full range of concentrations and the plateaus associated with the two extremes do not appear. This is as we would expect, since all of the components are present at approximately the same concentration levels and therefore should exhibit similar prediction errors. As other conditions in the simulation are changed for this case, we expect the quality of predictions to change overall, but remain similar for all components. Significant improvements will be seen at the point where the number of components becomes equal to the number of calibration samples, since the present system is underdetermined (see Sections 4.5 and 4.6). As the concentration distribution becomes more skewed, Fig. 8 shows that there is increased differentiation between regions of good and bad predictive ability. This means that the plateaus in the two limiting regions become more extended and the transition between them becomes sharper. This is expected, since increasing differences between the major and minor components improves the ability to determine the former at the expense of the latter. However, the proportion of components that can be reliably determined decreases as the distribution becomes more skewed. This is because, for very skewed distributions, the proportion of major components is quite small, although they can be predicted quite well. This is also clear from the figures of merit presented in Table 4. Note that in going from rc = 0.32 to rc = 1, Q10 decreases from 73 to 46 (indicating an increase in the number of components that can be accurately predicted), but then increases again to 62 for rc = 3.2. Table 4 Performance table for effect of the concentration distribution (continuous (h = 85j) and fixed angle (h = 80j) spectral distributions, 30 components, 25 calibration samples, S/N = 1000) Case Q10 (Percentile) E100 (%RRMSEP) Continuous, rc = 0.1 Continuous, rc = 0.32 Continuous, rc = 1 Continuous, rc = 3.2 Fixed, rc = 0.1 Fixed, rc = 0.32 Fixed, rc = 1 Fixed, rc = 3.2 100 73 46 65 53 50 40 62 10 5.6 0.8 0.0 7.2 3.2 0.2 0.0 On the other hand, E100 decreases monotonically, indicating a continuous improvement in predictive ability for the most dominant components as the distribution becomes more skewed. 4.4. Effects of signal-to-noise ratio For the benchmark system used in this study, the S/N (as defined in Section 2.3) was set to 1000. This represents relatively small (ca. 0.1% RSD) but not unreasonable errors in the measured responses. It was important to have a relatively large S/N for the benchmark system so that the effects of measurement noise on prediction errors would not obscure the effects of other variables being examined. For simplicity, the studies here assumed a homoscedastic, Gaussian noise structure with no errors in the reference concentrations. As the S/N of the responses decreases, one would expect the performance of the multivariate calibration models to become progressively worse, and indeed this is the case, as indicated in the comparator plots of Fig. 9 and the performance table presented as Table 5. With decreasing S/N, the proportion of components that can be reliably predicted decreases, with the Q10 values becoming progressively larger. Likewise, the most dominant components are not predicted as well, as indicated by a progressive increase in E100 values. However, the changes are not as dramatic as one might expect. It is still possible to quantitate some components in the mixture with a S/N of only 10 and the changes in performance between S/N = 100 and S/N = 1000 are fairly subtle. This suggests that measurement noise is secondary to some of the other characteristics of the mixture in determining predictive ability at this level. 4.5. Effects of the number of mixture components In complex mixtures, it is anticipated that the predictive ability of the multivariate calibration model will diminish overall as the number of components in the mixture increases, since this increases the likelihood of spectral interferences. This is the case in the studies carried out in this work, as demonstrated by the comparator plots in Fig. 10. When there are only 10 components in the mixtures, excellent predictions are obtained for all components, although there is P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 273 Fig. 9. Comparator plots showing the effects of S/N: S/N = 1000 (a and b, benchmark conditions), 100 (c and d), and 10 (e and f). more than an order of magnitude difference in the relative prediction errors for the least and most abundant components. The magnitudes of these errors will, of course, be dependent on other system parameters, such as the S/N and the degree of spectral correlation. With 30 components (the benchmark system), 54% of the components can be determined with an RRMSEP of 10% or better, while this number drops to 8% when there are 80 components in the mixture. Note that this represents a reduction not only in the percentage of components that can be quantitated, but also the absolute number. Table 5 Performance table for effect of the signal-to-noise ratio (continuous (h = 85j) and fixed angle (h = 80j) spectral distributions, 30 components, 25 calibration samples, rc = 1) Case Q10 (Percentile) E100 (%RRMSEP) Continuous, S/N = 1000 Continuous, S/N = 100 Continuous, S/N = 10 Fixed, S/N = 1000 Fixed, S/N = 100 Fixed, S/N = 10 46 53 85 40 50 92 0.8 1.1 3.4 0.2 0.6 4.7 One of the reasons for the sharp contrast in performance between 10 and 30 components is, of course, the fact that 25 calibration samples are used in the benchmark system and, therefore, the system with 10 components is overdetermined while that with 30 components is underdetermined. This is also reflected by the fact that the number of latent variables used stabilizes at around 10 for both PCR and PLS in the 10-component system. Because of these conditions, it is possible to develop good models for all of the components in the 10-component system, while only the major components can be reliably modeled in the 30-component system. Therefore, the number of calibration samples employed can play an important role and this factor is examined in the next section. 4.6. Effects of the number of calibration samples An increase in the number of calibration samples should result in a general improvement in the performance of multivariate calibration models for two main reasons: (1) the availability of a larger number of observations allows better modeling of the correlations between responses and analyte concentrations and therefore more reliable models, and (2) for complex 274 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 Fig. 10. Comparator plots showing the effects of the number of mixture components: Ncomp = 10 (a and b), 30 (c and d, benchmark conditions), and 80 (e and f). mixtures, it becomes less likely that we will have a mathematically underdetermined system. These effects are illustrated in Fig. 11, which shows com- parator plots using 15, 25 and 50 calibration samples for the benchmark conditions. With only 15 calibration samples, predictive performance is degraded relative Fig. 11. Comparator plots showing the effects of the number of calibration samples: Ncal = 15 (a and b), 25 (c and d, benchmark conditions), and 50 (e and f). P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 to the benchmark system (25 samples), allowing only the upper 20% of components to be determined rather than 54%. There is a dramatic change when 50 calibration samples are used such that, under these circumstances, all 30 components can be reasonably well quantitated (i.e. Q10 = 0%), with prediction errors ranging from 2.2% to 0.04%. The reason for this is that the system becomes overdetermined when the number of calibration samples is greater than the number of components, so it becomes possible to fully model the spectral subspace. This is confirmed by examining the plots of the number of latent variables which, for the case of 50 calibration samples, is consistently around 30 for both PCR and PLS, despite the ability to go higher. Also note that the number of latent variables used by PCR is only about one more than is used by PLS in this case. The fact that a system has more calibration samples than components does not guarantee reliable predictions for all components, since this ultimately relates to sensitivity and noise considerations. In this case, the system consists of a relatively large S/N and a relatively low spectral correlation. As the S/N decreases or the spectral overlap becomes greater, or the number of components increases, the prediction errors obtained in the simulations (not shown here) were observed to increase, even when the system was overdetermined. 4.7. Effects of restricting the number of latent variables In the preceding discussions, little has been said about the differences between the relative predictive performance of PCR and PLS, even though examining these differences was the central motivation for this study. This is because none of the cases presented so far, nor numerous others not presented but carried out away from benchmark conditions, showed any statistically significant differences in the prediction errors for the two methods. Because of this negative conclusion, the emphasis in the discussions so far has been on demonstrating the characteristics of the complex mixture model as it pertains to multivariate calibration in general and thereby attempting to instill some confidence in its ability to function as a reasonable simulation model. In contrast, for virtually all of the cases studies, there are distinct differences in the number of latent 275 variables used by PCR and PLS, with PLS requiring significantly fewer latent variables than PCR, especially in the useful predictive regions of the comparator plots. This is consistent with the results reported in the literature for real samples. However, it should also be noted that in those systems that are not underdetermined (Figs. 10b and 11f), there is very little difference in the number of latent variables employed by the two methods, that number being equal to the number of mixture components. This is also consistent with literature results for multivariate calibration mixtures with a limited number of components, where most of the time there is little difference in the number of latent variables employed by PCR and PLS. These observations lend some support for the complex mixture model developed here and suggest that the apparently greater parsimony of PLS may arise from the fact that real mixtures with a large number of components are mathematically underdetermined when it comes to multivariate calibration. Although PLS may appear more parsimonious in many cases, this does not seem to influence its predictive ability, at least not in the cases studied here. However, there was one set of conditions that can be described where significant differences were observed in the performance of PCR and PLS. It is sometimes the practice in multivariate calibration for the analyst to restrict the number of latent variables retained for the validation step. Rationalization for this restriction may related to the simplicity (parsimony) of the model in the X space, a desire to remove the noise which is often associated with later loading vectors, or simply a desire to improve the computational efficiency of the validation step by reducing the number of models. It was decided to investigate what effect such restrictions on the number of latent variables would have on the prediction results, so simulations were carried out in which the number of latent variables retained for the calibration procedure was constrained to be no more than 80% of the maximum possible. This is referred to as the ‘‘constrained’’ system. Constrained systems for all of the cases discussed so far were examined, but only the results for the constrained benchmark system are reported here, since this was typical of results in all other cases. Comparator and difference plots for the constrained benchmark system are presented in Fig. 12. 276 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 Fig. 12. Comparator (a and b) and difference (c and d) plots for the benchmark system under constrained conditions where the number of latent variables available was restricted to 80% of the maximum possible. The figure shows that PCR and PLS gave very similar relative prediction errors in the limiting cases of the least and most abundant components (left and right plateau regions), but in the intermediate region, PLS clearly outperforms PCR. The difference plot shows that the RRMSEP for PLS is as much as 7% lower than for PCR. In other words, under these conditions of an underdetermined system with a skewed distribution of mean concentrations, PLS is better at predicting intermediate components than PCR when the number of latent variables was constrained. This is reflected in the figures of merit as well. For PCR and PLS, the Q10 values are 65% and 46%, respectively, indicating that PLS can reliably determine a larger number of components. In contrast, the E100 values are 1.3% and 0.8%, indicating a relatively small difference in the ability to predict the most dominant components. Similar plots were observed for most of the other cases examined, except where systems were overdetermined, in which case PCR and PLS gave very similar results. The improved performance of PLS under constrained conditions and its apparently more parsimonious models under all conditions are misleading, however, and both of these properties can be traced to the same source. Since the PLS loadings can be shown to be linear combinations of the PCA eigenvectors [1,3,9], the PLS models are, in fact, not more parsimonious, but only appear to be so. Both PLS and PCA use the same eigenvectors, except that PLS combines the information more effectively by employing concentration information so that fewer latent variables are needed to generate the same predictions. The improved performance of PLS under constrained conditions is illusory. The comparison is not fair, since PLS is not constrained to the same extent as PCR. The formation of the PLS scores and loadings utilizes information from all of the eigenvectors and is therefore capable of incorporating them into the model. Constraints are only applied after this step has been carried out, unlike PCR which is restricted from the beginning. This makes PLS less susceptible to restrictions placed on the number of latent variables. The tendency to place an emphasis on the earlier eigenvectors in PCR is perhaps derived from simple mixtures where these factors dominate, but it is clear from these simulations that the later eigenvectors can play a critical role in calibration with complex mixtures, since their exclusion leads to a degradation in model performance. The importance of the later eigenvectors has also been recognized in the literature in the development of factor selection meth- P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 277 ods [44]. It is perhaps the practice of applying restrictions to the number of latent variables used in PCR that has led to the perception that PCR has inferior predictive ability compared to PLS. This perception was not supported by any of the simulations carried out in this work under unconstrained conditions. error. As with other studies, no cases were observed where there was a significant difference between PCR and PLS. These results, although based on limited simulations, are consistent with those of Thomas and Haaland [8], who found only very marginal improvements in performance of PCR over PLS when concentration noise was a factor. 4.8. Other factors examined 4.9. Summary statistics In addition to the mixture and instrumental parameters discussed in the preceding sections, a number of studies were carried out to examine the effects of instrumental non-idealities, in particular baseline offsets and nonlinearities, on the performance of PCR and PLS. Baseline offsets were introduced by adding a random baseline shift to each of the spectra used for calibration and prediction. Nonlinearities in the instrumental response were introduced by adding negative curvature, modeled on a second-order polynomial, to responses, which were above a certain threshold level. Since all of these studies were exploratory in nature and not intended to be comprehensive, the results are summarized only briefly here. In all cases (except where the baseline offsets were used with overdetermined systems), the introduction of these effects led to a general degradation of performance. However, as in all of the other studies carried out under unconstrained conditions, no significant differences in the performance of PCR and PLS were detected. There is some opinion that differences should be observed when nonlinearities are present [45], but to our knowledge, no comprehensive study has confirmed this. In fact, recent work using extensive nonlinear experimental data [46] supports the conclusions of the results presented here; that is, there are no significant differences between PCR and PLS. Until more extensive studies are done, the issue of nonlinearities remains an open question. In another exploratory study, the effect of errors in the reference (concentration) values was examined. After generation of the spectra, but before calibration, normally distributed errors were added to the calibration and prediction concentrations. Several experiments were done with errors ranging from 1% to 10% of the mean concentration for each component. Prediction errors were calculated using both the errorfree reference values and those contaminated with Although the results presented here have focussed on univariate perturbations from the benchmark conditions, the study as a whole employed a full factorial design. To more thoroughly analyze the differences between PCR and PLS, this broader population base was used for statistical testing. In terms of the prediction ability, the populations consisted of mean RRMSEP values for PCR and PLS for each concentration bin in each trial (about 8000 cases for each method). Both standard and paired t-tests were used to examine the differences between the two methods. Since the paired t-tests gave greater sensitivity to the differences, as would be expected in a study such as this, only these results are reported. For the continuous angle spectra, the t-statistic from the paired t-test was 1.16, well below the critical value of 1.96% at 95%, and therefore not statistically significant. In fact, closer examination of the distribution of (RRMSEPPCR RRMSEPPLS) reveals that, of the 92% of results that fall within F 0.25% RRMSEP, 55% is on the negative side and 37% is on the positive side, therefore slightly favouring PCR. Again, however, this is not statistically significant. In the case of fixed angle spectra, a t-statistic of 19.41 was obtained from the paired t-test, indicating a significant difference between the two methods in favour of PLS. Although significant, this difference is not very meaningful. Closer analysis shows that most of the cases with differences greater than 0.2% RRMSEP fell in regions of very low analyte concentration, i.e. in the region of the plateau where the predictions of them are meaningless. In the histogram, more than 90% of values fall within F 0.2% RRMSEP and essentially everything falls within F 1%. The population size of this study is such that very small differences in the two methods can be detected, but such differences are not likely to be relevant in real applications. Furthermore, in our opinion, the fixed 278 P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 angle scenario is not likely to accurately reflect physical reality. In terms of the number of latent variables used in the calibration models, all statistical tests showed that PLS used fewer latent variables than PCR, as expected from the other results reported here. As noted earlier, however, this is somewhat misleading and does not reflect better predictive ability. 5. Conclusions Relative Prediction Error Results reported in the literature to date have not shown a clear advantage of PLS over PCR in terms of predictive ability in multivariate calibration. The premise of the present study was that differences between the two methods might become more apparent if they were applied to complex mixtures under a variety of conditions. To this end, simulation models for complex mixtures were developed so that the performance of PCR and PLS could be evaluated under changing conditions of concentration distribution, spectral correlation, noise and other factors. Results from the simulations showed that the predictions from multivariate calibrations fell into three regions as shown in Fig. 13: a plateau corresponding to minor components that could not be reliably predicted, a plateau corresponding to major components that could be predicted reasonably well, and a transition region between the two characterized by components present in intermediate amounts that could be determined with varying reliability. The extent of the plateau regions and the sharpness of the transition region were determined by the charac- Transition region Region of no predictive ability Region of good predictive ability Concentration Percentile Fig. 13. General behavior of multivariate calibration applied to complex mixtures. teristics of the mixtures, instrument, and calibration embodied in the simulation parameters. In general, the curves behaved qualitatively as they were expected to when the conditions changed, lending some credence to the simulation models employed. Although the simulation of complex chemical mixtures is a difficult problem due to a lack of prior knowledge of real systems, the model presented in this work is believed to be a good first approximation and an important aspect of this study which may be useful for the study of other calibration methods. In all of the simulations carried out in this work, none of the conditions examined showed significant differences between the predictive abilities of PCR and PLS when the number of latent variables available to the two methods was unconstrained. In a ‘‘metaanalysis’’ of all of the data accumulated, a significant but very small difference was detected for the case of fixed angle spectra, but given the magnitude of this difference and the fact that the fixed angle spectra are not as likely to reflect reality as the continuous angle case, this finding should not be over-interpreted. Substantial differences between the two methods were found when the maximum number of latent variables available was constrained, and these may be the conditions that lead to reports of superior PLS performance in the literature. Comparison under these conditions is unfair, however, since in reality, PLS has access to all of the eigenvectors. Since PLS uses a linear combination of the eigenvectors to construct its loading vectors, it almost always required fewer latent variables in the simulations carried out here. However, in reality, it cannot be regarded as more parsimonious than PCR. The negative findings of this study with regard to the initial hypothesis are supported by those reports in the literature that did not find substantial differences between PLS and PCR. However, these conclusions do not mean that there cannot be particular cases where one or the other method will perform better. Nor does it mean that the are not circumstances not examined in this study or reflected in the simulations where PLS will give better results. Nevertheless, the belief held by many practitioners that PLS has inherently better predictive abilities would appear to be unsubstantiated. Other advantages of PLS, such as the greater interpretability of the loadings, cannot be neglected, however. P.D. Wentzell, L. Vega Montoto / Chemometrics and Intelligent Laboratory Systems 65 (2003) 257–279 Acknowledgements The authors gratefully acknowledge the financial support of the Natural Sciences and Engineering Research Council (NSERC) of Canada. References [1] J.H. Kalivas, Anal. Chim. Acta 428 (2001) 31 – 40. [2] A. Phatak, S. de Jong, J. Chemom. 11 (1997) 311 – 338. [3] A. Lorber, L.E. Wangen, B.R. Kowalski, J. Chemom. 1 (1987) 19 – 31. [4] I.S. Helland, Commun. Stat., Simul. Comput. 17 (1988) 581 – 607. [5] S. Wold, J. Trygg, A. Berglund, H. Antti, Chemom. Intell. Lab. Syst. 58 (2001) 131 – 150. [6] H. Martens, Chemom. Intell. Lab. Syst. 58 (2001) 85 – 95. [7] T. Naes, H. Martens, Commun. Stat., Simul. Comput. 14 (1985) 545 – 576. [8] E.V. Thomas, D.M. Haaland, Anal. Chem. 62 (1990) 1091 – 1099. [9] S. de Jong, J. Chemom. 7 (1993) 551 – 557. [10] I. Durán-Merás, A. Muñoz de la Peña, A. Espinosa-Mansilla, F. Salinas, Analyst 118 (1993) 807 – 813. [11] H.J. Luinge, E. Hop, E.T.G. Lutz, H.A. van Hemert, E.A.M. de Jong, Anal. Chim. Acta 284 (1993) 419 – 433. [12] P. Bhandare, Y. Mendelson, E. Stohr, R.A. Peura, Appl. Spectrosc. 48 (1994) 271 – 273. [13] N. Dupuy, L. Duponchel, B. Amram, J.P. Huvenne, P. Legrand, J. Chemom. 8 (1994) 333 – 347. [14] T. Almoey, E. Haugland, Appl. Spectrosc. 48 (1994) 327 – 332. [15] K.N. Andrew, P.J. Worsfold, Analyst 119 (1994) 1541 – 1546. [16] D. Jouan-Rimbaud, B. Walczak, D.L. Massart, I.R. Last, K.A. Prebble, Anal. Chim. Acta 304 (1995) 285 – 295. [17] F. Navarro-Villoslada, L.V. Pérez-Arribas, M.E. León-González, L.M. Polo-Dı́ez, Anal. Chim. Acta 313 (1995) 93 – 101. [18] J. Saurina, S. Hernández-Casson, Analyst 120 (1995) 305 – 312. [19] J.G. Sun, J. Chemom. 9 (1995) 21 – 29. [20] A. Garrido Frenich, J.L. Martı́nez Vidal, P. Parrilla, M. Martı́nez Galera, J. Chromatogr. 778 (1997) 183 – 192. [21] M. Martı́nez Galera, J.L. Martı́nez Vidal, A. Garrido Frenich, M.D. Gil Garcı́a, J. Chromatogr. 778 (1997) 139 – 149. [22] J. Guiteras, J.L. Beltrán, R. Ferrer, Anal. Chim. Acta 361 (1998) 233 – 240. 279 [23] F.R. Burden, R.G. Brereton, P.T. Walsh, Analyst 122 (1997) 1015 – 1022. [24] Y. Ni, X. Gong, Anal. Chim. Acta 354 (1997) 163 – 171. [25] T. Galeano Dı́az, A. Guiberteau, J.M. Ortı́z Burguillos, F. Salinas, Analyst 122 (1997) 513 – 517. [26] A. Donachie, A.D. Walmsley, S.J. Haswell, Anal. Chim. Acta 378 (1999) 235 – 243. [27] J. Saurina, S. Hernández-Cassou, E. Fabregas, S. Alegret, Analyst 124 (1999) 733 – 737. [28] J.H. Kalivas, J. Chemom. 13 (1999) 111 – 132. [29] M. Blanco, J. Coello, H. Iturriaga, S. Maspoch, J. Pages, Anal. Chim. Acta 384 (1999) 207 – 214. [30] J. Saurina, S. Hernández-Cassou, Analyst 124 (1999) 745 – 749. [31] V. Centner, J. Verdú-Andrés, B. Walczak, D. Jouan-Rimbaud, F. Despagne, L. Pasti, R. Poppi, D. Massart, O.E. de Noord, Appl. Spectrosc. 54 (2000) 608 – 623. [32] M.T. Lohnes, R.D. Guy, P. Wentzell, Anal. Chim. Acta 389 (1999) 95 – 113. [33] L.J. Nagels, W.L. Creten, P.M. Vanpeperstraete, Anal. Chem. 55 (1983) 216 – 220. [34] F. Dondi, Y.D. Kahie, G. Lodi, M. Remelli, P. Reschiglian, C. Bighi, Anal. Chim. Acta 191 (1986) 261 – 273. [35] R.E. Thompson, E.O. Voit, G.I. Scott, Environmetrics 11 (2000) 99 – 119. [36] A. Kubala-Kukus, D. Banas, J. Braziewicz, U. Majewska, S. Mrowczynski, M. Pajek, Spectrochim. Acta, Part B: Atom. Spectrosc. 56B (2001) 2037 – 2044. [37] Z. Michailidis, F. Vosniakos, N. Koutinas, J. Environ. Prot. Ecol. 2 (2001) 61 – 67. [38] P.D. Wentzell, M.T. Lohnes, Chemom. Intell. Lab. Syst. 45 (1999) 65 – 85. [39] C.D. Brown, P.D. Wentzell, J. Chemom. 13 (1999) 133 – 152. [40] C.D. Brown, L. Vega-Montoto, P.D. Wentzell, Appl. Spectrosc. 54 (2000) 1055 – 1068. [41] S. Schreyer, M. Bidinosti, P.D. Wentzell, Appl. Spectrosc. 56 (2002) 789 – 796. [42] K.S. Booksh, B.R. Kowalski, Anal. Chem. 66 (1994) 782 – 791. [43] L. Vega-Montoto, Study of the Performance of Principal Component Regression and Partial Least Squares Regression Using Simulation of Complex Mixtures, MSc thesis, Dalhousie University, Halifax, Canada, 2001. [44] S.Z. Fairchild, J.H. Kalivas, J. Chemom. 15 (2001) 615 – 625. [45] R. Kramer, Chemometric Techniques for Quantitative Analysis, Marcel Dekker, New York, 1998. [46] T.K. Karakach, Comparison of Linear and Nonlinear Multivariate Calibration Methods, MSc thesis, Dalhousie University, Halifax, Canada, 2002.
© Copyright 2026 Paperzz