Computational aspects of algorithms for variable - CIMA-UE

Computational Statistics & Data Analysis 47 (2004) 225 – 236
www.elsevier.com/locate/csda
Computational aspects of algorithms for variable
selection in the context of principal components
Jorge Cadimaa , J. Orestes Cerdeiraa;∗ , Manuel Minhotob
a Departamento
de Matematica, Instituto Superior de Agronomia, Tapada da Ajuda,
Lisboa 1349-017, Portugal
b Departamento de Matem
atica, Universidade de Evora,
Colegio Luis Antonio Vernay,
Rua Romão Ramalho 59, Evora
7000, Portugal
Received 4 November 2003; received in revised form 4 November 2003; accepted 8 November 2003
Abstract
Variable selection consists in identifying a k-subset of a set of original variables that is
optimal for a given criterion of adequate approximation to the whole data set. Several algorithms
for the optimization problems resulting from three di4erent criteria in the context of principal
components analysis are considered, and computational results are presented.
c 2003 Elsevier B.V. All rights reserved.
Keywords: Combinatorial optimization; Heuristics; Principal components; Principal variables; Variable
selection
1. Introduction
In the analysis of data sets with large numbers of variables a frequent aim is to
reduce the dimensionality of the data set. A typical way of reducing the dimension of
a data set is through a principal component analysis (PCA). Standard results guarantee
that retaining the k principal components (PCs) with the largest associated variance
produces the k-subset of linear combinations of the p original variables which, under
various criteria, best approximates the original variables (see, for example, Jolli4e,
2002, Section 6.3). PCA is an appropriate tool for deriving low-dimension subspaces
which capture most of the information of the whole data set.
∗
Corresponding author. Tel.: +351-2136-53467; fax: +351-2136-30723.
E-mail addresses: [email protected] (J. Cadima), [email protected] (J.O. Cerdeira), [email protected]
(M. Minhoto).
c 2003 Elsevier B.V. All rights reserved.
0167-9473/$ - see front matter doi:10.1016/j.csda.2003.11.001
226
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
However, dimensionality reduction via PCA does not provide a real reduction of
dimensionality in terms of the original variables, since all p original variables are
required to deBne even a single PC. As McCabe (1984) says: “interpretation of the
results and possible subsequent data collection and analysis still involve all of the
variables”.
Moreover, the PCs do not reliably indicate which variables are the most relevant, in
terms of preserving information. It is current practice to use PC loadings in identifying
a subset of the original variables which may be strongly related to the Brst few PCs
and could therefore be used to summarize the most relevant information in the whole
data set. Cadima and Jolli4e (1995, 2001) have shown that this procedure can be
misleading, essentially because it is based on the assumption that the resultant of a
linear combination (a PC) is dominated by the vectors (variables) with large magnitude
coeEcients in that linear combination (high PC loadings). This assumption ignores the
inFuence of the magnitude of each vector (the standard deviation of each variable) and
the vectors’ relative positions (the pattern of correlations among the variables).
We consider the combinatorial problem of identifying, for an arbitrary integer k
(1 6 k ¡ p), a k-variable subset which is optimal with respect to a given criterion
that measures and quantiBes how well each subset approximates the whole data set.
The choice of a criterion will obviously depend on the nature and goals of a speciBc
data set analysis. It may also be a4ected by issues such as the sensitivity of the
criterion to perturbations on the data or to the exclusion or inclusion of an observation.
Several criteria have been suggested in di4erent contexts (see Jolli4e, 1972, 1973, 2002;
McCabe, 1984, 1986; Bonifas et al., 1984; Krzanowski, 1987; Gonzalez et al., 1990;
de Falguerolles and Jmel, 1993; Tanaka and Mori, 1997; Fedorov et al., 1999; Cadima
and Jolli4e, 2001).
In this paper, we report computational experiments carried out with several algorithms for the optimization problems resulting from three di4erent criteria that can be
found in the literature, although the methodology discussed can be extended to any numerical criterion which may be considered of interest. The Brst criterion was proposed
by McCabe (1984) and is a weighted average of the multiple correlations between each
PC of the full data set and the k-subset of variables. The second is based on the RV
indicator proposed by Robert and EscouBer (1976) to measure similarity between two
conBgurations of points in a linear space. The third criterion is an instance of Yanai’s
Generalized CoeEcient of Determination, discussed by Ramsay et al. (1984), which
is intimately related to the concept of distance between subspaces in Golub and Van
Loan (1996, Section 2.6.3). A brief description of these criteria is given in Section 2.
The complete search, among all k-variable subsets, of a subset which optimizes a
given criterion, is a task which quickly becomes infeasible even for moderately sized
data sets (unless k is very small, or very large, when compared with p). It may happen that a criterion has special properties which render enumeration methods possible
for some moderate-sized data sets. Furnival and Wilson’s (1974) leaps and bounds
algorithm did this in the context of subset selection in Linear Regression, and Duarte
Silva (2001) has discussed the application of this algorithm for various methods of
Multivariate Statistics. In Duarte Silva (2002) this approach was adapted to the three
criteria we consider here, and it was pointed out that the practicability of such methods
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
227
breaks down for data sets with more than 30 or 40 variables. An additional computational burden may occur if, as is often the case, some uncertainty surrounds the
precise dimensionality k of the variable subset that is sought. In that case, it may be
useful to search for subsets of several di4erent cardinalities and to compare their criterion values before choosing a variable subset size k which is considered appropriate.
Hence, heuristics seem to be the only possibility for many real problems. The heuristics
which have been traditionally proposed for variable selection are mainly variations of
a general class of algorithms called greedy. Forward and backward stepwise selection
algorithms, which are also widely used in Linear Regression, belong to this class. We
compare the performance of algorithms of this type with some local search heuristics,
namely variants of local improvement, simulated annealing and genetic algorithms. The
heuristics used in the computational experiments are described in Section 3, and the
main results are reported in Section 4. Local search methods performed signiBcantly
better than the more traditional greedy-type algorithms. This and other conclusions are
discussed in Section 5.
2. Similarity indices
The three proximity indices discussed can be introduced through the concept of angle
between two (non-zero) n × p matrices A and B, which is the arc whose cosine is
given by
cos(A; B) =
A; B
;
A · B
where A; B = tr(At B) is the usual inner product
of two matrices and · is the norm
induced by the inner product, i.e., C = tr(Ct C). Ramsay et al. (1984) call this
cosine the matrix correlation between matrices A and B.
We shall consider that our underlying data set is an n × p data matrix, X, where
p is the number of variables, which have been observed on each of n individuals. As
is usual in PCA it is assumed that the columns of X have been column-centered. We
denote by K an arbitrary subset of k columns of X, by K the subspace of Rn spanned
by K, and by S = (1=n)Xt X the variance–covariance matrix of the p variables.
2.1. The RM criterion
The Brst proximity indicator considered here is equivalent to the second of the
four criteria proposed by McCabe (1984, 1986) to deBne principal variables (see also
Cadima and Jolli4e, 2001, for a fuller discussion). It is the cosine of the angle between
the n × p data matrix X, and the n × p matrix whose columns result from regressing
each of the p (centered) observed variables on K (i.e., orthogonally projecting them
on K):
tr(Xt PK X)
1
RM = cos(X; PK X) =
·
tr([S2 ](K) [S(K) ]−1 );
(1)
=
tr(Xt X)
tr(S)
228
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
where PK is the matrix of orthogonal projections on K, S(K) is the submatrix which
results from retaining those rows and columns of S whose row/column numbers are
in K, and [S2 ](K) is the corresponding submatrix of S2 . The name RM highlights the
criterion’s nature as a weighted average of the squares of the multiple correlations (rm )i
between the data set’s ith PC and the k variables spanning K, the weights i being the
PC variances:
p
2
i=1 i (rm )i
p
RM =
:
i=1 i
Tanaka and Mori (1997) proposed an indicator which generalizes the above criterion,
based on the PCs of a subset of the k original variables, which they call modi;ed
principal components. Their generalization is
PRM (r) = cos2 (X; Pr X);
(2)
the Brst r
where Pr is the matrix of orthogonal projections on the subspace deBned by
PCs of some k-variable subset of the original variables. The RM criterion is PRM (k).
2.2. The RV criterion
Robert and EscouBer (1976) suggest an indicator to measure how similar are two
conBgurations of related points in a linear space, allowing for translations, rigid rotations and global changes of scale. The RV-coeEcient, applied to two matrices A and
B, whose (comparable) rows deBne the location of each point in the conBguration is
cos(AAt ; BBt ). As with the RM coeEcient, two matrices to which it makes sense to
apply this notion are the original data matrix X and the matrix of the data Btted to the
k-dimensional subspace spanned by K. Thus, we will consider the indicator given by
1
· tr([S2 ](K) [S(K) ]−1 )2 ;
(3)
RV = cos(XXt ; PK XXt PK ) = tr(S2 )
with notation deBned as in Section 2.1 (see also Gonzalez et al., 1990).
Tanaka and Mori (1997) also consider a generalization of the RV criterion, based
on comparing the original conBguration of points with the conBguration that results
from projecting onto the Brst r PCs of a given k-variable subset:
PRV (r) = cos2 (XXt ; Pr XXt Pr );
where Pr is deBned as in Section 2.1. RV coincides with
(4)
PRV (k).
2.3. The GCD criterion
A third proximity coeEcient is Yanai’s Generalized Coe@cient of Determination
(GCD) (Ramsay et al., 1984). This indicator measures the degree of similarity between
two subspaces, and is deBned as the cosine of the angle between the matrices of
orthogonal projections on those subspaces. Golub and Van Loan (1996, Section 2.6.3)
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
229
deBne an intimately related concept, the distance between subspaces as the norm of
the di4erence between those projection matrices.
It is known that the k-dimensional subspace which accounts for the maximum possible variability within the original data set is the subspace G spanned by the Brst k
PCs of the data set. A plausible criterion to measure the adequacy of the k-variable
subset K is to compare the similarity between G and K. Hence, we have:
GCD = cos(PG ; PK ) =
1
tr ([S{G} ](K) [S(K) ]−1 );
k
(5)
where S(K) and PK are deBned as above, PG denotes the matrix of orthogonal projection
onto subspace G, S{G} represents the approximation to matrix S obtained by retaining,
in the spectral decomposition of S, only the eigenvectors/values associated with the k
largest eigenvalues, and [S{G} ](K) is the k × k submatrix of S{G} obtained by retaining
only the rows/columns with row/column number in K. Again, simple algebra (Cadima
and Jolli4e, 2001) shows that the GCD can be re-written in terms of the multiple
correlations, (rm )i , between the ith PC of the full data set and the k selected variables:
k
GCD =
1
(rm )2i :
k
i=1
Thus, the GCD for the subspaces has values between zero (if the subspaces are orthogonal) and one (if all k PCs are in K, i.e., if the subspaces coincide). The GCD is the
average of the squared canonical correlations between two sets of variables spanning
each of the subspaces (Ramsay et al., 1984).
It is also possible to generalize the GCD, along the lines suggested by Tanaka and
Mori (1997), and to compare the subspaces deBned by a subset of the PCs of the full
data set and by (a subset of) the PCs of a given k-variable subset:
PGCD (r) = cos(PG ; Pr );
(6)
where PG and Pr are deBned above.
3. The algorithms
Determining, for a given value of k, which k-subsets of the set of variables X
maximize (1), (3) or (5) are particular cases of the following general combinatorial
optimization problem:
Bnd S which maximizes {c(S) : S ∈ F ⊆ 2N };
where N is a Bnite set, and the objective function c : 2N → R+
0 associates to each
subset of N a nonnegative real value. The set S ∗ ⊆ N is an optimal solution if S ∗ is
in the set F of feasible solutions and c(S ∗ ) ¿ c(S), for all S ∈ F.
Though it is diEcult to classify heuristic algorithms for combinatorial optimization
problems, we can identify two di4erent classes: greedy-type heuristics and local search
heuristics.
230
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
Table 1
Synoptic table of local search algorithms to select a k-subset from p variables
Restricted improvement
Simulated annealing
Genetic algorithm
Parameters
a4ecting
complexity
nss number of
repetitions
nss number of
repetitions
ssp size of starting
population
nit number of iterations
g number of generations
Stopping rule
Q empty
nit ¿ itmax
g ¿ gmax
Number of
calls to evaluate
criterion
O(nss × k × p)
nss × nit
ssp +
ssp
2
×g
3.1. Greedy heuristics
In our computational study we used four variants of greedy-type heuristics: forward
selection, backward elimination and stepwise algorithms with a default forward or
backward direction, but which take a step in the opposite direction for every two in
the default direction. Algorithms of this kind are widely used in Linear Regression
(Draper and Smith, 1998) and have been used in the present context (Cadima and
Jolli4e, 2001).
3.2. Local search heuristics
Local search heuristics always work inside the feasible region F, moving from one
feasible solution to another feasible solution, by exchanging some of its components.
We considered three heuristics of this type, which are brieFy described below. As the
evaluation the objective functions (1), (3) or (5) clearly dominates all other operations
in the algorithms, the number of calls to the calculation of the criterion was used
as an appropriate measure of time complexity. Complexity and other aspects of the
algorithms are summarized in Table 1. In our implementations, criteria values were
recomputed from scratch for each new subset. As pointed out by a referee, a more
eEcient strategy would be to update criteria values for each variable swap, speeding
up computations at the expense of memory usage. This possibility will be considered
in a future implementation.
3.2.1. Restricted improvement
We used a modiBed version of improvement heuristics (Tovey, 1997) which we
call restricted improvement. A neighbourhood of the current solution S was deBned
as the set of k-subsets of X that can be obtained by replacing one element of S by
one element from X \ S. The variables which are candidates for membership of set S
are arranged in a queue Q. Initially Q consists of all variables which are not included
in the starting solution. The algorithm picks an element j from Q, removes it from Q,
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
231
and determines i∗ ∈ S which maximizes {c(S \ {i} ∪ {j}) : i ∈ S}. The current set is
updated by letting S := S \ {i∗ } ∪ {j} i4 c(S \ {i∗ } ∪ {j}) ¿ c(S). In case S is updated,
variable i∗ will be inserted in Q i4 i∗ has not been in Q before. The algorithm stops
when Q is equal to the empty set. Note that determining i∗ involves k evaluations of
the objective function, and no element will be inserted in Q more than once, giving
O(k × p) evaluations of the criterion. Unlike in standard local improvement, there is
no guarantee that the Bnal solution S is a local optimum, since no element will be
inserted in Q twice.
3.2.2. Simulated annealing
In iteration j of a simulated annealing algorithm (Aarts et al., 1997) a neighbour S may be chosen over a better current solution S with probability exp((c(S ) − c(S))=Tj ),
where Tj ¿ 0. In our simulated annealing algorithm, neighbourhoods are deBned as in
the restricted improvement heuristic. As for Tj , we start with T1 = 1 − c0 , where c0
is the criterion value of the starting solution, and gradually decrease the current value
by 5% every 20 iterations. We also considered the possibility of slightly increasing
Tj during the 20p Brst iterations if the current solution does not change for a long
time. The algorithm stops with the best solution encountered at any stage, after a given
number nit of iterations.
The RM criterion assesses the performance of the k-variable subset in predicting
the remaining variables. This suggested the following alternative to the standard implementation for this speciBc criterion. The element of S to be dropped is chosen
with probability proportional to the entries of the diagonal of the inverse variance–
covariance matrix of the variables corresponding to the elements of S. The rationale
for this lies in the fact that the reciprocal of the ith diagonal element of the inverse
variance–covariance matrix is the partial variance of the ith variable in the subset, i.e.,
the variance of its residuals after regression on the remaining variables of the subset
(see Whittaker, 1990, Section 5.8). Large diagonal elements in the inverse variance–
covariance matrix are therefore associated with variables that have little further information to contribute. The element to be added to S is chosen from ST = X \ S with
−1
probability proportional to the entries of the diagonal of STS:S
T = STST − SS
T SS S ST , the
matrix of partial variances–covariances of the variables corresponding to the elements
T given the variables in S. The largest such elements are associated with variables
of S,
that are not well explained by those already in the subset. We call this alternative
guided swap to distinguish it from the standard uniform swap. Although the guided
version involves some additional calculations, both alternatives require nit evaluations
of the criterion values.
3.2.3. Genetic algorithms
Genetic algorithms (MUuhlenbein, 1997) start with an initial population of ssp feasible
solutions which are mated to produce children (i.e., other feasible solutions) that inherit
properties of their parents. The next generation will consist of elements selected among
those from the previous generation and their children.
232
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
In our implementation, the number of child-bearing couples (father, mother) to be
formed is half the size of the population. Each father is selected among the members of
the population with probability proportional to his value of the criterion. For each father
F, a mother M is selected with equal probability among the members of the population
which have at least two variables not belonging to F. The child produced by each pair
(F; M ) includes all the variables which belong to both parents. The remaining variables
are selected with equal probability from the parents’ symmetric di4erence, with the
additional restriction that at least one variable from M \ F and one from F \ M will
be selected. Each new generation is formed by the ssp individuals, from among the
previous generation and their o4spring, which have the highest values of the objective
function. The number of generations g was used as the stopping criterion. We noted
that after a few generations a large number of clones (replicates) of a fairly small
number of individuals tended to appear. In several cases it only took a few iterations
for the algorithm to get stuck in a population in which the cardinality of the symmetric
di4erence of every pair is not greater than two. To overcome this problem, every time
a new child is generated the algorithm counts the number of existing replicants of this
child, among the current generation and among the children already created. If this
number exceeds a given value, maxclone, the child is rejected and is replaced by an
admissible solution selected at random, with a uniform distribution.
4. Data and results
Nine data sets were considered, which include both real and simulated data. In four
cases, only the correlation matrix was available. In the other Bve cases, the original
data set or its covariance matrix were known, and in order to diversify situations (in
terms of variance of eigenvalues) both the covariance and the correlation matrix PCAs
were used. This gives a total of 14 di4erent situations. Details of the data sets can
be obtained from http://home.uevora.pt/∼minhoto/optvasa. Among these 14
situations, nine have a moderate number of variables (between 13 and 20). In these
cases, a full search was viable and we were able to Bnd optimal k-variable subsets for
the three criteria and for all k = 2; 3; : : : ; p − 1. The other Bve situations correspond
to a 35-variable data set, in both standardized and unstandardized form, two di4erent
(standardized) 50-variable data sets and a 62-variable (standardized) data set. For these
Bve cases, we are not sure of having identiBed the optimal values.
The computational results were obtained using stand-alone Fortran 90 programs. Running times for the more computationally intensive cases considered were a few seconds,
for each cardinality and criterion, on a 400 MHz Pentium II. Similar performance can
be achieved using the authors’ subselect software module, which can be loaded from
within a session of the R statistical software package. R (http://www.R-project.
org) is a Free Software implementation of the S Statistical Language (Becker et al.,
1988).
Regarding the implementations of the local search methods, some additional speciBcations should be noted. Both the restricted improvement algorithm and the two versions
of simulated annealing work with a given number nss of starting solutions (repetitions).
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
233
Table 2
Results for the nine smaller matrices
Criterion
RM
RV
GCD
Algorithm
% suc.
mre
% suc.
mre
% suc.
mre
Forward
Step/Forw.
Backward
Step/Back.
Improvem.
Genetic
S. An. (unif.)
S. An. (guid.)
47.15
74.80
58.54
69.92
100.00
100.00
100.00
100.00
0.001307
0.000471
0.005976
0.002160
0
0
0
0
46.34
70.73
43.64
54.47
100.00
100.00
100.00
—
0.001290
0.000450
0.008062
0.005782
0
0
0
—
29.27
63.41
49.59
56.91
100.00
100.00
100.00
—
0.036344
0.011749
0.023695
0.015239
0
0
0
—
Each starting solution was chosen with equal probability among all k-subsets. The solution produced by the algorithm is the variable subset with the largest value of the
criterion, in the nss repetitions.
In choosing the values for the parameters, it was considered desirable to ensure that
in all these algorithms the number of evaluations of the criteria was roughly the same.
For the moderate size matrices, the following values were assigned: nss = 40 for
the restricted improvement; nss = 10 and nit = 1000 for both versions of simulated
annealing; and maxclone = 5, ssp = 200, g = 100 for the genetic algorithms.
Two measures were used to assess results. Denoting by c∗ and cH , respectively, the
optimal value and the value of the heuristic solution produced by the algorithm, these
measures are:
• % suc.—the percentage of cases in which c∗ was achieved;
• mre—the mean relative error, deBned as the mean of the ratios (c∗ − cH )=c∗ .
The results for these smaller matrices are summed up in Table 2.
For the larger matrices, the values assigned to the parameters were: nss = 100 for
the restricted improvement; nss = 10 and nit = 4000 for both versions of simulated
annealing; and maxclone = 5, ssp = 800, g = 200 for the genetic algorithms.
Since we are not sure what the true optimal values are, the value of c∗ in the
deBnition of the performance measures % suc. and mre was taken to be the best value
obtained for each criterion, with any of the algorithms. Only the more meaningful
cardinalities were now considered, taking into account the number of variables in each
case, as shown, together with the results for these matrices, in Tables 3–5.
5. Discussion
The combinatorial optimization problems of determining which k-subsets of a given
set of variables maximize (1), (3) and (5) were tackled. SpeciBc versions of
local search methods, namely of local improvement, simulated annealing and genetic
234
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
Table 3
Results for the RM criterion
No. var (p) p = 35
Card. (k) k from 5 to 15
p = 50
k from 5 to 20
p = 62
k from 5 to 25
Matrix
hemat35.var
hemat35.cor
matd.cor
matx.cor
mat62.cor
Algorithm
% suc. mre
% suc. mre
% suc. mre
% suc. mre
% suc. mre
18.18
18.18
0
0
100
100
100
0.000754
0.000754
0.008857
0.007973
0
0
0
0
0
0
6.25
37.50
87.50
93.75
0.004399
0.003838
0.027070
0.005726
0.000499
0.000176
0.000033
0
6.25
0
0
12.50
87.50
75.00
0
0
0
0
61.90
57.14
85.71
0.004754
0.004356
0.010023
0.007508
0.000097
0.000239
0.000040
0
93.75
0.000033 87.50
0.000467 85.71
0.000046
Forward
27.27
Step/Forw. 27.27
Backward
27.27
Step/Back. 27.27
Improvem. 100
Genetic
100
S.An.
100
(unif.)
S.An.
90.91
(guid.)
0.000001
0.000001
0.000003
0.000002
0
0
0
0.000001 100
0.012297
0.003736
0.016869
0.004408
0.001514
0.000500
0.000641
Table 4
Results for the RV criterion
No. var (p) p = 35
Card. (k) k from 5 to 15
p = 50
k from 5 to 20
p = 62
k from 5 to 25
Matrix
hemat35.var
hemat35.cor
matd.cor
matx.cor
mat62.cor
Algorithm
% suc. mre
% suc. mre
% suc. mre
% suc. mre
% suc. mre
% suc.
18.18
18.18
0
0
100
100
100
% suc.
0
0
0
0
68.75
75.00
81.25
% suc.
18.75
18.75
0
18.75
25.00
18.75
75.00
% suc.
0
0
0
0
52.38
38.10
76.19
% suc.
Forward
81.82
Step/Forw. 81.82
Backward
81.82
Step/Back. 81.82
Improvem. 100
Genetic
100
S.An.
100
(unif.)
mre
0.000000
0.000000
0.000000
0.000000
0
0
0
mre
0.001357
0.001357
0.012337
0.011099
0
0
0
mre
0.004586
0.003497
0.023993
0.007604
0.000383
0.000202
0.000096
mre
0.002270
0.001837
0.004375
0.004593
0.000703
0.000886
0.000212
mre
0.003850
0.003686
0.008891
0.004583
0.0004567
0.000563
0.000239
algorithms, were devised and implemented. The results of these algorithms, and those
of four greedy-type methods, were compared in 14 di4erent cases.
The local search methods performed signiBcantly better than the greedy-type algorithms. For the nine smaller-sized problems, all local search methods identiBed the
optimal solutions, for every cardinality k and for the three criteria. For the larger
matrices, we cannot guarantee that the optimal solutions were found. In the two
cases where p = 35, for the values of k considered (with a single exception) and
for the three criteria, all local search algorithms produced the same solutions. This
was no longer the case for the other three situations considered, and despite the good
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
235
Table 5
Results for the GCD criterion
No. var (p) p = 35
Card. (k) k from 5 to 15
p = 50
k from 5 to 20
p = 62
k from 5 to 25
Matrix
hemat35.var
hemat35.cor
matd.cor
matx.cor
mat62.cor
Algorithm
% suc. mre
% suc. mre
% suc. mre
% suc. mre
% suc. mre
0
0
0
0
93.75
93.75
87.50
0
6.25
0
0
56.25
87.50
81.25
0
0
0
0
80.95
76.19
90.48
Forward
18.18
Step/Forw. 18.18
Backward
0
Step/Back.
0
Improvem. 100
Genetic
100
S.An.
100
(unif.)
0.000520 0
0.000520 0
0.026299 0
0.013232 0
0
100
0
100
0
100
0.022461
0.009694
0.059332
0.044351
0
0
0
0.077515
0.024209
0.172976
0.048149
0.000022
0.000167
0.000122
0.088082
0.024832
0.095381
0.037637
0.001522
0.000211
0.000430
0.045313
0.030971
0.097163
0.049069
0.000726
0.000228
0.000068
performance of simulated annealing in most circumstances, it is not possible to state
that one given algorithm consistently out-performs the others.
It must be kept in mind that the performance of both simulated annealing and genetic
algorithms are a4ected by di4erent choices involving the parameters, such as Tj in
simulated annealing or maxclone in the genetic algorithms. Variants of each method can
also be devised, as illustrated by the guided swap variant of simulated annealing with
the RM criterion. Although the results as presented do not highlight major di4erences
between both versions of simulated annealing, a closer look at the results of the nss
individual repetitions indicates that the guided swap version tends to replicate the best
solutions more frequently than its uniform counterpart.
In our experiments we noted that simulated annealing and genetic algorithms usually
provide solutions which are local optima in relation to the neighborhoods deBned in
the restricted improvement algorithm. But this was not always the case.
We believe that a reasonable approach to the optimization of the kind of criteria
discussed above, consists in using either simulated annealing or genetic algorithms
to obtain a tentative solution, which can then be fed to the restricted improvement
algorithm for reBnement.
An additional aspect relating to the problem under consideration, seems to be relevant. We noted that there were a large number of suboptimal solutions, that is, many
solutions whose values of the criteria were close to the best value found. As an example, for the matrix of size p = 62, with cardinality k = 12, we found 800 di4erent
solutions with values of the RM criterion, ranging from 0.8079392 (the best value
found) to 0.8059924. But it should be noted that the worst of these 800 values is
still better than the best value (0.8042016) produced by any of the greedy-type methods! The existence of a large number of suboptimal solutions suggests that a suitable
approach to the variable selection problem may consist, not so much in determining an
optimal solution for a given criterion, but rather a solution which is suboptimal with
respect to several criteria. This is a topic which we will seek to address in future work.
236
J. Cadima et al. / Computational Statistics & Data Analysis 47 (2004) 225 – 236
Acknowledgements
Research was Bnancially supported by the Portuguese Foundation for Science and
Technology (FCT).
References
Aarts, E.H.L., Korst, J.H.M., van Laarhoven, P.J.M., 1997. Simulated annealing. In: Aarts, E., Lenstra, J.K.
(Eds.), Local Search in Combinatorial Optimization. Wiley, New York, pp. 91–120.
Becker, R., Chambers, J., Wilks, A., 1988. The New S Language. A programming Environment for Data
Analysis and Graphics. Wadsworth and Brooks/Cole Advanced Books and Software, PaciBc Grove, CA.
Bonifas, I., EscouBer, Y., Gonzalez, P.L., Sabatier, R., 1984. Choix de variables en analyse en composants
principales. Rev. Statist. Appl. 23, 5–15.
Cadima, J., Jolli4e, I.T., 1995. Loadings and correlations in the interpretation of principal components.
J. Appl. Statist. 22 (2), 203–214.
Cadima, J., Jolli4e, I.T., 2001. Variable selection and the interpretation of principal subspaces. J. Agricultural,
Biol. Environ. Statist. 6 (1), 62–79.
Draper, N., Smith, H., 1998. Applied Regression Analysis, 3rd Edition. Wiley, New York.
Duarte Silva, A.P., 2001. EEcient variable screening for multivariate analysis. J. Multivariate Anal. 76,
35–62.
Duarte Silva, A.P., 2002. Discarding variables in principal component analysis algorithms for all-subset
comparisons. Comput. Statist. 17, 251–271.
de Falguerolles, A., Jmel, S., 1993. Un critXere de choix de variables en analyse en composants principales
fondYe sur des modXeles graphiques gaussiens particuliers. Canad. J. Statist. 21 (3), 239–256.
Fedorov, V., Gruben, D., Leonov, S., 1999. Direct method of selecting informative variables. Smith Kline
Beecham Biostatistics and Data Sciences Technical Report 1999-02.
Furnival, G.M., Wilson, R.W., 1974. Regressions by leaps and bounds. Technometrics 16, 499–511 (reprinted
in Technometrics 42, 1, 69 –79).
Golub, G., Van Loan, C., 1996. Matrix Computations. Johns Hopkins University Press, Baltimore, MD.
Gonzalez, P.L., Evry, R., ClYeroux, R., Rioux, B., 1990. Selecting the best subset of variables in principal
component analysis. In: Momirovic, K., Mildner, V. (Eds.), Physica-Verlag, Compstat, pp. 115–120.
Jolli4e, I.T., 1972. Discarding variables in a principal component analysis, I: ArtiBcial data. Appl. Statist.
21, 160–173.
Jolli4e, I.T., 1973. Discarding variables in a principal component analysis, II: Real data. Appl. Statist. 22,
21–31.
Jolli4e, I.T., 2002. Principal Component Analysis, 2nd Edition. Springer, New York.
Krzanowski, W.J., 1987. Selection of variables to preserve multivariate data structure using principal
components. Appl. Statist. 36, 22–33.
McCabe, G.P., 1984. Principal variables. Technometrics 26 (2), 137–144.
McCabe, G.P., 1986. Prediction of principal components by variables subsets. Technical Report 86-19,
Department of Statistics, Purdue University.
MUuhlenbein, H., 1997. Genetic algorithms. In: Aarts, E., Lenstra, J.K. (Eds.), Local Search in Combinatorial
Optimization. Wiley, New York, pp. 137–171.
Ramsay, J.O., ten Berge, J., Styan, G.P.H., 1984. Matrix correlation. Psychometrika 49 (3), 403–423.
Robert, P., EscouBer, Y., 1976. A unifying tool for linear multivariate statistical methods: the RV-coeEcient.
Appl. Statist. 25 (3), 257–265.
Tanaka, Y., Mori, Y., 1997. Principal Component Analysis based on a subset of variables: variable selection
and sensitivity analysis. American J. Math. Management Sci. 17 (1 & 2), 61–89.
Tovey, C.A., 1997. Local improvement on discrete structures. In: Aarts, E., Lenstra, J.K. (Eds.), Local Search
in Combinatorial Optimization. Wiley, New York, pp. 57–89.
Whittaker, J., 1990. Graphical Models in Applied Multivariate Statistics. Wiley, New York.