Computational Statistics & Data Analysis 47 (2004) 225 – 236 www.elsevier.com/locate/csda Computational aspects of algorithms for variable selection in the context of principal components Jorge Cadimaa , J. Orestes Cerdeiraa;∗ , Manuel Minhotob a Departamento de Matematica, Instituto Superior de Agronomia, Tapada da Ajuda, Lisboa 1349-017, Portugal b Departamento de Matem atica, Universidade de Evora, Colegio Luis Antonio Vernay, Rua Romão Ramalho 59, Evora 7000, Portugal Received 4 November 2003; received in revised form 4 November 2003; accepted 8 November 2003 Abstract Variable selection consists in identifying a k-subset of a set of original variables that is optimal for a given criterion of adequate approximation to the whole data set. Several algorithms for the optimization problems resulting from three di4erent criteria in the context of principal components analysis are considered, and computational results are presented. c 2003 Elsevier B.V. All rights reserved. Keywords: Combinatorial optimization; Heuristics; Principal components; Principal variables; Variable selection 1. Introduction In the analysis of data sets with large numbers of variables a frequent aim is to reduce the dimensionality of the data set. A typical way of reducing the dimension of a data set is through a principal component analysis (PCA). Standard results guarantee that retaining the k principal components (PCs) with the largest associated variance produces the k-subset of linear combinations of the p original variables which, under various criteria, best approximates the original variables (see, for example, Jolli4e, 2002, Section 6.3). PCA is an appropriate tool for deriving low-dimension subspaces which capture most of the information of the whole data set. ∗ Corresponding author. Tel.: +351-2136-53467; fax: +351-2136-30723. E-mail addresses: [email protected] (J. Cadima), [email protected] (J.O. Cerdeira), [email protected] (M. Minhoto). c 2003 Elsevier B.V. All rights reserved. 0167-9473/$ - see front matter doi:10.1016/j.csda.2003.11.001 226 J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 However, dimensionality reduction via PCA does not provide a real reduction of dimensionality in terms of the original variables, since all p original variables are required to deBne even a single PC. As McCabe (1984) says: “interpretation of the results and possible subsequent data collection and analysis still involve all of the variables”. Moreover, the PCs do not reliably indicate which variables are the most relevant, in terms of preserving information. It is current practice to use PC loadings in identifying a subset of the original variables which may be strongly related to the Brst few PCs and could therefore be used to summarize the most relevant information in the whole data set. Cadima and Jolli4e (1995, 2001) have shown that this procedure can be misleading, essentially because it is based on the assumption that the resultant of a linear combination (a PC) is dominated by the vectors (variables) with large magnitude coeEcients in that linear combination (high PC loadings). This assumption ignores the inFuence of the magnitude of each vector (the standard deviation of each variable) and the vectors’ relative positions (the pattern of correlations among the variables). We consider the combinatorial problem of identifying, for an arbitrary integer k (1 6 k ¡ p), a k-variable subset which is optimal with respect to a given criterion that measures and quantiBes how well each subset approximates the whole data set. The choice of a criterion will obviously depend on the nature and goals of a speciBc data set analysis. It may also be a4ected by issues such as the sensitivity of the criterion to perturbations on the data or to the exclusion or inclusion of an observation. Several criteria have been suggested in di4erent contexts (see Jolli4e, 1972, 1973, 2002; McCabe, 1984, 1986; Bonifas et al., 1984; Krzanowski, 1987; Gonzalez et al., 1990; de Falguerolles and Jmel, 1993; Tanaka and Mori, 1997; Fedorov et al., 1999; Cadima and Jolli4e, 2001). In this paper, we report computational experiments carried out with several algorithms for the optimization problems resulting from three di4erent criteria that can be found in the literature, although the methodology discussed can be extended to any numerical criterion which may be considered of interest. The Brst criterion was proposed by McCabe (1984) and is a weighted average of the multiple correlations between each PC of the full data set and the k-subset of variables. The second is based on the RV indicator proposed by Robert and EscouBer (1976) to measure similarity between two conBgurations of points in a linear space. The third criterion is an instance of Yanai’s Generalized CoeEcient of Determination, discussed by Ramsay et al. (1984), which is intimately related to the concept of distance between subspaces in Golub and Van Loan (1996, Section 2.6.3). A brief description of these criteria is given in Section 2. The complete search, among all k-variable subsets, of a subset which optimizes a given criterion, is a task which quickly becomes infeasible even for moderately sized data sets (unless k is very small, or very large, when compared with p). It may happen that a criterion has special properties which render enumeration methods possible for some moderate-sized data sets. Furnival and Wilson’s (1974) leaps and bounds algorithm did this in the context of subset selection in Linear Regression, and Duarte Silva (2001) has discussed the application of this algorithm for various methods of Multivariate Statistics. In Duarte Silva (2002) this approach was adapted to the three criteria we consider here, and it was pointed out that the practicability of such methods J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 227 breaks down for data sets with more than 30 or 40 variables. An additional computational burden may occur if, as is often the case, some uncertainty surrounds the precise dimensionality k of the variable subset that is sought. In that case, it may be useful to search for subsets of several di4erent cardinalities and to compare their criterion values before choosing a variable subset size k which is considered appropriate. Hence, heuristics seem to be the only possibility for many real problems. The heuristics which have been traditionally proposed for variable selection are mainly variations of a general class of algorithms called greedy. Forward and backward stepwise selection algorithms, which are also widely used in Linear Regression, belong to this class. We compare the performance of algorithms of this type with some local search heuristics, namely variants of local improvement, simulated annealing and genetic algorithms. The heuristics used in the computational experiments are described in Section 3, and the main results are reported in Section 4. Local search methods performed signiBcantly better than the more traditional greedy-type algorithms. This and other conclusions are discussed in Section 5. 2. Similarity indices The three proximity indices discussed can be introduced through the concept of angle between two (non-zero) n × p matrices A and B, which is the arc whose cosine is given by cos(A; B) = A; B ; A · B where A; B = tr(At B) is the usual inner product of two matrices and · is the norm induced by the inner product, i.e., C = tr(Ct C). Ramsay et al. (1984) call this cosine the matrix correlation between matrices A and B. We shall consider that our underlying data set is an n × p data matrix, X, where p is the number of variables, which have been observed on each of n individuals. As is usual in PCA it is assumed that the columns of X have been column-centered. We denote by K an arbitrary subset of k columns of X, by K the subspace of Rn spanned by K, and by S = (1=n)Xt X the variance–covariance matrix of the p variables. 2.1. The RM criterion The Brst proximity indicator considered here is equivalent to the second of the four criteria proposed by McCabe (1984, 1986) to deBne principal variables (see also Cadima and Jolli4e, 2001, for a fuller discussion). It is the cosine of the angle between the n × p data matrix X, and the n × p matrix whose columns result from regressing each of the p (centered) observed variables on K (i.e., orthogonally projecting them on K): tr(Xt PK X) 1 RM = cos(X; PK X) = · tr([S2 ](K) [S(K) ]−1 ); (1) = tr(Xt X) tr(S) 228 J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 where PK is the matrix of orthogonal projections on K, S(K) is the submatrix which results from retaining those rows and columns of S whose row/column numbers are in K, and [S2 ](K) is the corresponding submatrix of S2 . The name RM highlights the criterion’s nature as a weighted average of the squares of the multiple correlations (rm )i between the data set’s ith PC and the k variables spanning K, the weights i being the PC variances: p 2 i=1 i (rm )i p RM = : i=1 i Tanaka and Mori (1997) proposed an indicator which generalizes the above criterion, based on the PCs of a subset of the k original variables, which they call modi;ed principal components. Their generalization is PRM (r) = cos2 (X; Pr X); (2) the Brst r where Pr is the matrix of orthogonal projections on the subspace deBned by PCs of some k-variable subset of the original variables. The RM criterion is PRM (k). 2.2. The RV criterion Robert and EscouBer (1976) suggest an indicator to measure how similar are two conBgurations of related points in a linear space, allowing for translations, rigid rotations and global changes of scale. The RV-coeEcient, applied to two matrices A and B, whose (comparable) rows deBne the location of each point in the conBguration is cos(AAt ; BBt ). As with the RM coeEcient, two matrices to which it makes sense to apply this notion are the original data matrix X and the matrix of the data Btted to the k-dimensional subspace spanned by K. Thus, we will consider the indicator given by 1 · tr([S2 ](K) [S(K) ]−1 )2 ; (3) RV = cos(XXt ; PK XXt PK ) = tr(S2 ) with notation deBned as in Section 2.1 (see also Gonzalez et al., 1990). Tanaka and Mori (1997) also consider a generalization of the RV criterion, based on comparing the original conBguration of points with the conBguration that results from projecting onto the Brst r PCs of a given k-variable subset: PRV (r) = cos2 (XXt ; Pr XXt Pr ); where Pr is deBned as in Section 2.1. RV coincides with (4) PRV (k). 2.3. The GCD criterion A third proximity coeEcient is Yanai’s Generalized Coe@cient of Determination (GCD) (Ramsay et al., 1984). This indicator measures the degree of similarity between two subspaces, and is deBned as the cosine of the angle between the matrices of orthogonal projections on those subspaces. Golub and Van Loan (1996, Section 2.6.3) J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 229 deBne an intimately related concept, the distance between subspaces as the norm of the di4erence between those projection matrices. It is known that the k-dimensional subspace which accounts for the maximum possible variability within the original data set is the subspace G spanned by the Brst k PCs of the data set. A plausible criterion to measure the adequacy of the k-variable subset K is to compare the similarity between G and K. Hence, we have: GCD = cos(PG ; PK ) = 1 tr ([S{G} ](K) [S(K) ]−1 ); k (5) where S(K) and PK are deBned as above, PG denotes the matrix of orthogonal projection onto subspace G, S{G} represents the approximation to matrix S obtained by retaining, in the spectral decomposition of S, only the eigenvectors/values associated with the k largest eigenvalues, and [S{G} ](K) is the k × k submatrix of S{G} obtained by retaining only the rows/columns with row/column number in K. Again, simple algebra (Cadima and Jolli4e, 2001) shows that the GCD can be re-written in terms of the multiple correlations, (rm )i , between the ith PC of the full data set and the k selected variables: k GCD = 1 (rm )2i : k i=1 Thus, the GCD for the subspaces has values between zero (if the subspaces are orthogonal) and one (if all k PCs are in K, i.e., if the subspaces coincide). The GCD is the average of the squared canonical correlations between two sets of variables spanning each of the subspaces (Ramsay et al., 1984). It is also possible to generalize the GCD, along the lines suggested by Tanaka and Mori (1997), and to compare the subspaces deBned by a subset of the PCs of the full data set and by (a subset of) the PCs of a given k-variable subset: PGCD (r) = cos(PG ; Pr ); (6) where PG and Pr are deBned above. 3. The algorithms Determining, for a given value of k, which k-subsets of the set of variables X maximize (1), (3) or (5) are particular cases of the following general combinatorial optimization problem: Bnd S which maximizes {c(S) : S ∈ F ⊆ 2N }; where N is a Bnite set, and the objective function c : 2N → R+ 0 associates to each subset of N a nonnegative real value. The set S ∗ ⊆ N is an optimal solution if S ∗ is in the set F of feasible solutions and c(S ∗ ) ¿ c(S), for all S ∈ F. Though it is diEcult to classify heuristic algorithms for combinatorial optimization problems, we can identify two di4erent classes: greedy-type heuristics and local search heuristics. 230 J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 Table 1 Synoptic table of local search algorithms to select a k-subset from p variables Restricted improvement Simulated annealing Genetic algorithm Parameters a4ecting complexity nss number of repetitions nss number of repetitions ssp size of starting population nit number of iterations g number of generations Stopping rule Q empty nit ¿ itmax g ¿ gmax Number of calls to evaluate criterion O(nss × k × p) nss × nit ssp + ssp 2 ×g 3.1. Greedy heuristics In our computational study we used four variants of greedy-type heuristics: forward selection, backward elimination and stepwise algorithms with a default forward or backward direction, but which take a step in the opposite direction for every two in the default direction. Algorithms of this kind are widely used in Linear Regression (Draper and Smith, 1998) and have been used in the present context (Cadima and Jolli4e, 2001). 3.2. Local search heuristics Local search heuristics always work inside the feasible region F, moving from one feasible solution to another feasible solution, by exchanging some of its components. We considered three heuristics of this type, which are brieFy described below. As the evaluation the objective functions (1), (3) or (5) clearly dominates all other operations in the algorithms, the number of calls to the calculation of the criterion was used as an appropriate measure of time complexity. Complexity and other aspects of the algorithms are summarized in Table 1. In our implementations, criteria values were recomputed from scratch for each new subset. As pointed out by a referee, a more eEcient strategy would be to update criteria values for each variable swap, speeding up computations at the expense of memory usage. This possibility will be considered in a future implementation. 3.2.1. Restricted improvement We used a modiBed version of improvement heuristics (Tovey, 1997) which we call restricted improvement. A neighbourhood of the current solution S was deBned as the set of k-subsets of X that can be obtained by replacing one element of S by one element from X \ S. The variables which are candidates for membership of set S are arranged in a queue Q. Initially Q consists of all variables which are not included in the starting solution. The algorithm picks an element j from Q, removes it from Q, J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 231 and determines i∗ ∈ S which maximizes {c(S \ {i} ∪ {j}) : i ∈ S}. The current set is updated by letting S := S \ {i∗ } ∪ {j} i4 c(S \ {i∗ } ∪ {j}) ¿ c(S). In case S is updated, variable i∗ will be inserted in Q i4 i∗ has not been in Q before. The algorithm stops when Q is equal to the empty set. Note that determining i∗ involves k evaluations of the objective function, and no element will be inserted in Q more than once, giving O(k × p) evaluations of the criterion. Unlike in standard local improvement, there is no guarantee that the Bnal solution S is a local optimum, since no element will be inserted in Q twice. 3.2.2. Simulated annealing In iteration j of a simulated annealing algorithm (Aarts et al., 1997) a neighbour S may be chosen over a better current solution S with probability exp((c(S ) − c(S))=Tj ), where Tj ¿ 0. In our simulated annealing algorithm, neighbourhoods are deBned as in the restricted improvement heuristic. As for Tj , we start with T1 = 1 − c0 , where c0 is the criterion value of the starting solution, and gradually decrease the current value by 5% every 20 iterations. We also considered the possibility of slightly increasing Tj during the 20p Brst iterations if the current solution does not change for a long time. The algorithm stops with the best solution encountered at any stage, after a given number nit of iterations. The RM criterion assesses the performance of the k-variable subset in predicting the remaining variables. This suggested the following alternative to the standard implementation for this speciBc criterion. The element of S to be dropped is chosen with probability proportional to the entries of the diagonal of the inverse variance– covariance matrix of the variables corresponding to the elements of S. The rationale for this lies in the fact that the reciprocal of the ith diagonal element of the inverse variance–covariance matrix is the partial variance of the ith variable in the subset, i.e., the variance of its residuals after regression on the remaining variables of the subset (see Whittaker, 1990, Section 5.8). Large diagonal elements in the inverse variance– covariance matrix are therefore associated with variables that have little further information to contribute. The element to be added to S is chosen from ST = X \ S with −1 probability proportional to the entries of the diagonal of STS:S T = STST − SS T SS S ST , the matrix of partial variances–covariances of the variables corresponding to the elements T given the variables in S. The largest such elements are associated with variables of S, that are not well explained by those already in the subset. We call this alternative guided swap to distinguish it from the standard uniform swap. Although the guided version involves some additional calculations, both alternatives require nit evaluations of the criterion values. 3.2.3. Genetic algorithms Genetic algorithms (MUuhlenbein, 1997) start with an initial population of ssp feasible solutions which are mated to produce children (i.e., other feasible solutions) that inherit properties of their parents. The next generation will consist of elements selected among those from the previous generation and their children. 232 J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 In our implementation, the number of child-bearing couples (father, mother) to be formed is half the size of the population. Each father is selected among the members of the population with probability proportional to his value of the criterion. For each father F, a mother M is selected with equal probability among the members of the population which have at least two variables not belonging to F. The child produced by each pair (F; M ) includes all the variables which belong to both parents. The remaining variables are selected with equal probability from the parents’ symmetric di4erence, with the additional restriction that at least one variable from M \ F and one from F \ M will be selected. Each new generation is formed by the ssp individuals, from among the previous generation and their o4spring, which have the highest values of the objective function. The number of generations g was used as the stopping criterion. We noted that after a few generations a large number of clones (replicates) of a fairly small number of individuals tended to appear. In several cases it only took a few iterations for the algorithm to get stuck in a population in which the cardinality of the symmetric di4erence of every pair is not greater than two. To overcome this problem, every time a new child is generated the algorithm counts the number of existing replicants of this child, among the current generation and among the children already created. If this number exceeds a given value, maxclone, the child is rejected and is replaced by an admissible solution selected at random, with a uniform distribution. 4. Data and results Nine data sets were considered, which include both real and simulated data. In four cases, only the correlation matrix was available. In the other Bve cases, the original data set or its covariance matrix were known, and in order to diversify situations (in terms of variance of eigenvalues) both the covariance and the correlation matrix PCAs were used. This gives a total of 14 di4erent situations. Details of the data sets can be obtained from http://home.uevora.pt/∼minhoto/optvasa. Among these 14 situations, nine have a moderate number of variables (between 13 and 20). In these cases, a full search was viable and we were able to Bnd optimal k-variable subsets for the three criteria and for all k = 2; 3; : : : ; p − 1. The other Bve situations correspond to a 35-variable data set, in both standardized and unstandardized form, two di4erent (standardized) 50-variable data sets and a 62-variable (standardized) data set. For these Bve cases, we are not sure of having identiBed the optimal values. The computational results were obtained using stand-alone Fortran 90 programs. Running times for the more computationally intensive cases considered were a few seconds, for each cardinality and criterion, on a 400 MHz Pentium II. Similar performance can be achieved using the authors’ subselect software module, which can be loaded from within a session of the R statistical software package. R (http://www.R-project. org) is a Free Software implementation of the S Statistical Language (Becker et al., 1988). Regarding the implementations of the local search methods, some additional speciBcations should be noted. Both the restricted improvement algorithm and the two versions of simulated annealing work with a given number nss of starting solutions (repetitions). J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 233 Table 2 Results for the nine smaller matrices Criterion RM RV GCD Algorithm % suc. mre % suc. mre % suc. mre Forward Step/Forw. Backward Step/Back. Improvem. Genetic S. An. (unif.) S. An. (guid.) 47.15 74.80 58.54 69.92 100.00 100.00 100.00 100.00 0.001307 0.000471 0.005976 0.002160 0 0 0 0 46.34 70.73 43.64 54.47 100.00 100.00 100.00 — 0.001290 0.000450 0.008062 0.005782 0 0 0 — 29.27 63.41 49.59 56.91 100.00 100.00 100.00 — 0.036344 0.011749 0.023695 0.015239 0 0 0 — Each starting solution was chosen with equal probability among all k-subsets. The solution produced by the algorithm is the variable subset with the largest value of the criterion, in the nss repetitions. In choosing the values for the parameters, it was considered desirable to ensure that in all these algorithms the number of evaluations of the criteria was roughly the same. For the moderate size matrices, the following values were assigned: nss = 40 for the restricted improvement; nss = 10 and nit = 1000 for both versions of simulated annealing; and maxclone = 5, ssp = 200, g = 100 for the genetic algorithms. Two measures were used to assess results. Denoting by c∗ and cH , respectively, the optimal value and the value of the heuristic solution produced by the algorithm, these measures are: • % suc.—the percentage of cases in which c∗ was achieved; • mre—the mean relative error, deBned as the mean of the ratios (c∗ − cH )=c∗ . The results for these smaller matrices are summed up in Table 2. For the larger matrices, the values assigned to the parameters were: nss = 100 for the restricted improvement; nss = 10 and nit = 4000 for both versions of simulated annealing; and maxclone = 5, ssp = 800, g = 200 for the genetic algorithms. Since we are not sure what the true optimal values are, the value of c∗ in the deBnition of the performance measures % suc. and mre was taken to be the best value obtained for each criterion, with any of the algorithms. Only the more meaningful cardinalities were now considered, taking into account the number of variables in each case, as shown, together with the results for these matrices, in Tables 3–5. 5. Discussion The combinatorial optimization problems of determining which k-subsets of a given set of variables maximize (1), (3) and (5) were tackled. SpeciBc versions of local search methods, namely of local improvement, simulated annealing and genetic 234 J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 Table 3 Results for the RM criterion No. var (p) p = 35 Card. (k) k from 5 to 15 p = 50 k from 5 to 20 p = 62 k from 5 to 25 Matrix hemat35.var hemat35.cor matd.cor matx.cor mat62.cor Algorithm % suc. mre % suc. mre % suc. mre % suc. mre % suc. mre 18.18 18.18 0 0 100 100 100 0.000754 0.000754 0.008857 0.007973 0 0 0 0 0 0 6.25 37.50 87.50 93.75 0.004399 0.003838 0.027070 0.005726 0.000499 0.000176 0.000033 0 6.25 0 0 12.50 87.50 75.00 0 0 0 0 61.90 57.14 85.71 0.004754 0.004356 0.010023 0.007508 0.000097 0.000239 0.000040 0 93.75 0.000033 87.50 0.000467 85.71 0.000046 Forward 27.27 Step/Forw. 27.27 Backward 27.27 Step/Back. 27.27 Improvem. 100 Genetic 100 S.An. 100 (unif.) S.An. 90.91 (guid.) 0.000001 0.000001 0.000003 0.000002 0 0 0 0.000001 100 0.012297 0.003736 0.016869 0.004408 0.001514 0.000500 0.000641 Table 4 Results for the RV criterion No. var (p) p = 35 Card. (k) k from 5 to 15 p = 50 k from 5 to 20 p = 62 k from 5 to 25 Matrix hemat35.var hemat35.cor matd.cor matx.cor mat62.cor Algorithm % suc. mre % suc. mre % suc. mre % suc. mre % suc. mre % suc. 18.18 18.18 0 0 100 100 100 % suc. 0 0 0 0 68.75 75.00 81.25 % suc. 18.75 18.75 0 18.75 25.00 18.75 75.00 % suc. 0 0 0 0 52.38 38.10 76.19 % suc. Forward 81.82 Step/Forw. 81.82 Backward 81.82 Step/Back. 81.82 Improvem. 100 Genetic 100 S.An. 100 (unif.) mre 0.000000 0.000000 0.000000 0.000000 0 0 0 mre 0.001357 0.001357 0.012337 0.011099 0 0 0 mre 0.004586 0.003497 0.023993 0.007604 0.000383 0.000202 0.000096 mre 0.002270 0.001837 0.004375 0.004593 0.000703 0.000886 0.000212 mre 0.003850 0.003686 0.008891 0.004583 0.0004567 0.000563 0.000239 algorithms, were devised and implemented. The results of these algorithms, and those of four greedy-type methods, were compared in 14 di4erent cases. The local search methods performed signiBcantly better than the greedy-type algorithms. For the nine smaller-sized problems, all local search methods identiBed the optimal solutions, for every cardinality k and for the three criteria. For the larger matrices, we cannot guarantee that the optimal solutions were found. In the two cases where p = 35, for the values of k considered (with a single exception) and for the three criteria, all local search algorithms produced the same solutions. This was no longer the case for the other three situations considered, and despite the good J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 235 Table 5 Results for the GCD criterion No. var (p) p = 35 Card. (k) k from 5 to 15 p = 50 k from 5 to 20 p = 62 k from 5 to 25 Matrix hemat35.var hemat35.cor matd.cor matx.cor mat62.cor Algorithm % suc. mre % suc. mre % suc. mre % suc. mre % suc. mre 0 0 0 0 93.75 93.75 87.50 0 6.25 0 0 56.25 87.50 81.25 0 0 0 0 80.95 76.19 90.48 Forward 18.18 Step/Forw. 18.18 Backward 0 Step/Back. 0 Improvem. 100 Genetic 100 S.An. 100 (unif.) 0.000520 0 0.000520 0 0.026299 0 0.013232 0 0 100 0 100 0 100 0.022461 0.009694 0.059332 0.044351 0 0 0 0.077515 0.024209 0.172976 0.048149 0.000022 0.000167 0.000122 0.088082 0.024832 0.095381 0.037637 0.001522 0.000211 0.000430 0.045313 0.030971 0.097163 0.049069 0.000726 0.000228 0.000068 performance of simulated annealing in most circumstances, it is not possible to state that one given algorithm consistently out-performs the others. It must be kept in mind that the performance of both simulated annealing and genetic algorithms are a4ected by di4erent choices involving the parameters, such as Tj in simulated annealing or maxclone in the genetic algorithms. Variants of each method can also be devised, as illustrated by the guided swap variant of simulated annealing with the RM criterion. Although the results as presented do not highlight major di4erences between both versions of simulated annealing, a closer look at the results of the nss individual repetitions indicates that the guided swap version tends to replicate the best solutions more frequently than its uniform counterpart. In our experiments we noted that simulated annealing and genetic algorithms usually provide solutions which are local optima in relation to the neighborhoods deBned in the restricted improvement algorithm. But this was not always the case. We believe that a reasonable approach to the optimization of the kind of criteria discussed above, consists in using either simulated annealing or genetic algorithms to obtain a tentative solution, which can then be fed to the restricted improvement algorithm for reBnement. An additional aspect relating to the problem under consideration, seems to be relevant. We noted that there were a large number of suboptimal solutions, that is, many solutions whose values of the criteria were close to the best value found. As an example, for the matrix of size p = 62, with cardinality k = 12, we found 800 di4erent solutions with values of the RM criterion, ranging from 0.8079392 (the best value found) to 0.8059924. But it should be noted that the worst of these 800 values is still better than the best value (0.8042016) produced by any of the greedy-type methods! The existence of a large number of suboptimal solutions suggests that a suitable approach to the variable selection problem may consist, not so much in determining an optimal solution for a given criterion, but rather a solution which is suboptimal with respect to several criteria. This is a topic which we will seek to address in future work. 236 J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236 Acknowledgements Research was Bnancially supported by the Portuguese Foundation for Science and Technology (FCT). References Aarts, E.H.L., Korst, J.H.M., van Laarhoven, P.J.M., 1997. Simulated annealing. In: Aarts, E., Lenstra, J.K. (Eds.), Local Search in Combinatorial Optimization. Wiley, New York, pp. 91–120. Becker, R., Chambers, J., Wilks, A., 1988. The New S Language. A programming Environment for Data Analysis and Graphics. Wadsworth and Brooks/Cole Advanced Books and Software, PaciBc Grove, CA. Bonifas, I., EscouBer, Y., Gonzalez, P.L., Sabatier, R., 1984. Choix de variables en analyse en composants principales. Rev. Statist. Appl. 23, 5–15. Cadima, J., Jolli4e, I.T., 1995. Loadings and correlations in the interpretation of principal components. J. Appl. Statist. 22 (2), 203–214. Cadima, J., Jolli4e, I.T., 2001. Variable selection and the interpretation of principal subspaces. J. Agricultural, Biol. Environ. Statist. 6 (1), 62–79. Draper, N., Smith, H., 1998. Applied Regression Analysis, 3rd Edition. Wiley, New York. Duarte Silva, A.P., 2001. EEcient variable screening for multivariate analysis. J. Multivariate Anal. 76, 35–62. Duarte Silva, A.P., 2002. Discarding variables in principal component analysis algorithms for all-subset comparisons. Comput. Statist. 17, 251–271. de Falguerolles, A., Jmel, S., 1993. Un critXere de choix de variables en analyse en composants principales fondYe sur des modXeles graphiques gaussiens particuliers. Canad. J. Statist. 21 (3), 239–256. Fedorov, V., Gruben, D., Leonov, S., 1999. Direct method of selecting informative variables. Smith Kline Beecham Biostatistics and Data Sciences Technical Report 1999-02. Furnival, G.M., Wilson, R.W., 1974. Regressions by leaps and bounds. Technometrics 16, 499–511 (reprinted in Technometrics 42, 1, 69 –79). Golub, G., Van Loan, C., 1996. Matrix Computations. Johns Hopkins University Press, Baltimore, MD. Gonzalez, P.L., Evry, R., ClYeroux, R., Rioux, B., 1990. Selecting the best subset of variables in principal component analysis. In: Momirovic, K., Mildner, V. (Eds.), Physica-Verlag, Compstat, pp. 115–120. Jolli4e, I.T., 1972. Discarding variables in a principal component analysis, I: ArtiBcial data. Appl. Statist. 21, 160–173. Jolli4e, I.T., 1973. Discarding variables in a principal component analysis, II: Real data. Appl. Statist. 22, 21–31. Jolli4e, I.T., 2002. Principal Component Analysis, 2nd Edition. Springer, New York. Krzanowski, W.J., 1987. Selection of variables to preserve multivariate data structure using principal components. Appl. Statist. 36, 22–33. McCabe, G.P., 1984. Principal variables. Technometrics 26 (2), 137–144. McCabe, G.P., 1986. Prediction of principal components by variables subsets. Technical Report 86-19, Department of Statistics, Purdue University. MUuhlenbein, H., 1997. Genetic algorithms. In: Aarts, E., Lenstra, J.K. (Eds.), Local Search in Combinatorial Optimization. Wiley, New York, pp. 137–171. Ramsay, J.O., ten Berge, J., Styan, G.P.H., 1984. Matrix correlation. Psychometrika 49 (3), 403–423. Robert, P., EscouBer, Y., 1976. A unifying tool for linear multivariate statistical methods: the RV-coeEcient. Appl. Statist. 25 (3), 257–265. Tanaka, Y., Mori, Y., 1997. Principal Component Analysis based on a subset of variables: variable selection and sensitivity analysis. American J. Math. Management Sci. 17 (1 & 2), 61–89. Tovey, C.A., 1997. Local improvement on discrete structures. In: Aarts, E., Lenstra, J.K. (Eds.), Local Search in Combinatorial Optimization. Wiley, New York, pp. 57–89. Whittaker, J., 1990. Graphical Models in Applied Multivariate Statistics. Wiley, New York.
© Copyright 2026 Paperzz