An optimization criterion for generalized discriminant

1
An optimization criterion for generalized
discriminant analysis on undersampled problems
Jieping Ye, Ravi Janardan, Cheong Hee Park, and Haesun Park
Abstract
An optimization criterion is presented for discriminant analysis. The criterion extends the optimization criteria
of the classical Linear Discriminant Analysis (LDA) through the use of the pseudo-inverse when the scatter
matrices are singular. It is applicable regardless of the relative sizes of the data dimension and sample size,
overcoming a limitation of classical LDA. The optimization problem can be solved analytically by applying the
Generalized Singular Value Decomposition (GSVD) technique. The pseudo-inverse has been suggested and used
for undersampled problems in the past, where the data dimension exceeds the number of data points. The criterion
proposed in this paper provides a theoretical justification for this procedure.
An approximation algorithm for the GSVD-based approach is also presented. It reduces the computational
complexity by finding sub-clusters of each cluster, and uses their centroids to capture the structure of each cluster.
This reduced problem yields much smaller matrices to which the GSVD can be applied efficiently. Experiments
on text data, with up to 7000 dimensions, show that the approximation algorithm produces results that are close
to those produced by the exact algorithm.
Index Terms
Classification, clustering, dimension reduction, generalized singular value decomposition, linear discriminant
analysis, text mining.
J. Ye, R. Janardan, C. Park, and H. Park are with the Department of Computer Science and Engineering, University of Minnesota–Twin
Cities, Minneapolis, MN 55455, U.S.A. Email: jieping,janardan,chpark,hpark @cs.umn.edu.
Corresponding author: Jieping Ye; Tel.: 612-626-7504; Fax: 612-625-0572.
2
I. I NTRODUCTION
Many interesting data mining problems involve data sets represented in very high-dimensional spaces.
We consider dimension reduction for high-dimensional, undersampled data, where the dimension of the
data points is higher than the number of data points. Such high-dimensional, undersampled problems
occur frequently in many applications, including information retrieval [25], facial recognition [3], [28]
and microarray analysis [1].
The application area of interest in this paper is vector space-based information retrieval. The dimension
of the document vectors is typically very high, due to the large number of terms that appear in the collection
of the documents. In the vector space-based model, documents are represented as column vectors in a
term-document matrix. For an
term-document matrix
, its -th term, , represents the
weighted frequency of term in document . Several weighting schemes have been developed for encoding
the document collection in a term-document matrix [17]. An advantage of the vector space based-method
is that once the collection of documents is represented as columns of the term-document matrix in a
high-dimensional space, the algebraic structure of the related vector space can then be exploited.
When the documents are already clustered, we would like to find a dimension-reducing transformation
that preserves the cluster structure of the original full space even after the dimension reduction. Throughout
the paper, the input documents are assumed to have been already clustered before the dimension reduction
step. When the documents are not clustered, then an efficient clustering algorithm such as K-Means [6],
[16] can be applied before the dimension reduction step. We seek a reduced representation of the document
vectors, which best preserves the structure of the original document vectors.
A. Prior related work
Latent Semantic Indexing (LSI) has been widely used for dimension reduction of text data [2], [5]. It is
based on lower rank approximation of the term-document matrix from the Singular Value Decomposition
(SVD) [11]. Although the SVD provides the optimal reduced rank approximation of the matrix when the
3
difference is measured in the
or Frobenius norm, it has limitations in that it does not consider cluster
structure in the data and is expensive to compute. Moreover, the choice of the optimal reduced dimension
is difficult to determine theoretically.
The Linear Discriminant Analysis (LDA) method has been applied for decades for dimension reduction
(feature extraction) of clustered data in pattern recognition [10]. It is classically formulated as an optimization problem on scatter matrices. A serious disadvantage of the LDA is that its objective function
requires that the total scatter matrix be nonsingular. In many modern data mining problems such as
information retrieval, facial recognition, and microarray data analysis, the total scatter matrix in question
can be singular since the data items are from a very high-dimensional space, and in general, the dimension
exceeds the number of data points. This is known as the undersampled or singularity problem [10], [18].
In recent years, many approaches have been brought to bear on such high-dimensional, undersampled
problems. These methods can be roughly grouped into three categories. The first approach applies an
intermediate dimension reduction method, such as LSI, Principal Component Analysis (PCA), and Partial
Least Squares (PLS), to extract important components for LDA [3], [14], [18], [28]. The algorithm is
named two-stage LDA. This approach is straightforward but the intermediate dimension reduction stage
may remove some useful information. Some relevant references on this approach can be found in [15],
[18]. The second approach solves the singularity problem by adding a perturbation to the scatter matrix
[4], [9], [18], [33]. The algorithm is known as the regularized LDA, or RLDA for short. A limitation of
RLDA is that the amount of perturbation to be used is difficult to determine [4], [18].
The final approach applies the pseudo-inverse to avoid the singularity problem, which is equivalent to
approximating the solution using a least-squares solution method. The work in [12] showed that its use
led to appropriate calculation of Mahalanobis distance in singular cases. The use of classical LDA with
the pseudo-inverse has been proposed in the past. The Pseudo Fisher Linear Discriminant (PFLDA) in
[8], [10], [23], [26], [27] is based on the pseudo-inverse of the scatter matrices. The generalization error of
PFLDA was studied in [8], [26], when the size and dimension of the training data vary. As observed in [8],
4
[26], with an increase in the size of the training data, the generalization error of PFLDA first decreases,
reaching a minimum, then increases, reaching a maximum at the point when the size of the training data is
comparable with the data dimension, and afterwards begins to decrease. This peaking behavior of PFLDA
can be resolved by either reducing the size of the training data [8], [26] or by increasing the dimension
through addition of redundant features [27]. Pseudo-inverses of the scatter matrices were also studied in
[18], [20], [29]. Experiments in [18] showed that the pseudo-inverse based method achieved comparable
performance with RLDA and two-stage LDA. It is worthwhile to note that both RLDA and two-stage
LDA involve the estimation of some parameters. The authors in [18] applied cross-validation to estimate
the parameters, which can be very expensive. On the other hand, the pseudo-inverse based LDA does not
involve any parameter and is straightforward. However, its theoretical justification has not been studied
well in the literature.
Recently, a generalization of LDA based on the Generalized Singular Value Decomposition (GSVD)
has been developed [13], [14], which is applicable regardless of the data dimension, and, therefore, can be
used for undersampled problems. The classical LDA solution becomes a special case of this LDA/GSVD
method. In [13], [14], the solution from LDA/GSVD is justified to preserve the cluster structure in the
original full space after dimension reduction. However, no explicit global objective function has been
presented.
B. Contributions of this paper
In this paper, we present a generalized optimization criterion, based on the pseudo-inverse, for discriminant analysis. Our cluster-preserving projections are tailored for extracting the cluster structure of highdimensional data, and are closely related to classical linear discriminant analysis. The main advantage of
the proposed criterion is that it is applicable to undersampled problems. A detailed mathematical derivation
of the proposed optimization problem is presented. The GSVD technique is the key component for the
derivation. The solution from the LDA/GSVD algorithm is a special case of the solution for the proposed
criterion. We call it the exact algorithm, to distinguish it from an efficient approximation algorithm that we
5
also present. Interestingly, we show the equivalence between the exact algorithm and the pseudo-inversebased method, PFLDA. Hence, the criterion proposed in this paper may provide a theoretical justification
for the pseudo-inverse based LDA.
One limitation of the exact algorithm is the high computational complexity of GSVD in handling
large matrices. Typically, large document datasets, such as the ones in [32], contain several thousands of
documents with dimension in the tens of thousands. We propose an approximation algorithm based on
sub-clustering of clusters to reduce the cost of computing the SVD involved in the computation of the
GSVD. Each cluster is further sub-clustered so that the overall structure of each cluster can be represented
by the set of centroids of sub-clusters. As a result, only a few vectors are needed to define the scatter
matrices, thus reducing the computational complexity quite significantly. Experimental results show that
the approximation algorithm produces results close to those produced by the exact one.
We compare our proposed algorithms with other dimension reduction algorithms in terms of classification accuracy, where the K-Nearest neighbors (K-NN) method [7] based on the Euclidean distance is
used for classification. We use 10-fold cross-validation for estimating the misclassification rate. In 10-fold
cross-validation, we divide the data into ten subsets of (approximately) equal size. Then we do the training
and testing ten times, each time leaving out one of the subsets from training, and using only the omitted
subset for testing. The misclassification rate is the average from the ten runs. The misclassification rates
on different data sets for all the dimension reduction algorithms discussed in this paper are reported for
comparison. The results in Section V show that our exact and approximation algorithms are comparable
with the other dimension reduction algorithms discussed in this paper. More interestingly, the results
produced by the approximation algorithm are close to those produced by the exact algorithm, while the
approximation algorithm deals with matrices of much smaller sizes than those in the exact algorithm,
hence has lower computational complexity.
The main contributions of this paper include: (1) A generalization of the classical discriminant analysis
to small sample size data using an optimization criterion, where the non-singularity of the scatter matrices
6
is not required; (2) Mathematical derivation of the solution for the proposed optimization criterion; (3)
Theoretical justification of previously proposed pseudo-inverse based LDA, by showing the equivalence
between the exact algorithm and the Pseudo Fisher Linear Discriminant Analysis (PFLDA); (4) An efficient
approximation algorithm for the proposed optimization criterion. (5) Detailed experimental results for our
exact and approximation algorithms and comparisons with competing algorithms.
The rest of the paper is organized as follows: Classical discriminant analysis is reviewed in Section II. A
generalization of the classical discriminant analysis using the proposed criterion is presented in Section III.
An efficient approximation algorithm is presented in Section IV. Experimental results are presented in
Section V and concluding discussions are presented in Section VI.
For convenience, we present in Table I the important notations used in the rest of the paper.
TABLE I
I MPORTANT NOTATIONS USED IN THE PAPER
Notation
Notation
Description
number of the training data points
dimension of the training data
number of clusters
Description
data matrix of the -th cluster
data matrix
centroid of the -th cluster
between-cluster scatter matrix
number of sub-clusters in K-Means
perturbation in regularized LDA
total scatter matrix
reduced dimension by LSI
size of the -th cluster
global centroid of the training set
within-cluster scatter matrix
dimension reduction transformation matrix
total number of sub-clusters
number of nearest neighbors in K-NN
reduced dimension by the exact algorithm
II. C LASSICAL DISCRIMINANT ANALYSIS
Given a data matrix
, for *,+
1 IR"32 - 1
'0/
each column
&
! IR" #%$ , we consider finding a linear transformation &(' IR) # " that maps
+ , of in the -dimensional space to a vector - in the . -dimensional space:
IR) . Assume the original data is already clustered. The goal here is to find the
7
transformation & ' such that the cluster structure of the original high-dimensional space is preserved in the
,
#%$ where
IR"
, and . Let be the set of column indices that belong to the -th cluster,
, belongs to the -th cluster.
i.e., , for
reduced-dimensional space. Let the document matrix
be partitioned into
clusters as
In general, if each cluster is tightly grouped, but well separated from the other clusters, the quality of the
cluster is considered to be high. In discriminant analysis [10], three scatter matrices, called within-cluster,
' ! " " ' # " " ' $ !!%& # (' '
* *
* '
where the centroid
of the -th cluster is defined as
, where
$
$ ' where ' * * )) * ' IR$ .
and the global centroid is defined as
between-cluster, and total scatter matrices, are defined to quantify the quality of the cluster, as follows:
(1)
$
IR ,
Define the matrices
* +, - .'/ - ' )) 01 2 "'/ 2 ' * 43 .5 " 3 " 2 6-7
and ! can be expressed as * * ' , and ! * *
Then the scatter matrices
(2)
' . The traces of
) 5 ' ) 898 898 trace
! " ' .5 898 5 898 7
trace
the two scatter matrices can be computed as follows,
Hence, trace
measures the closeness of the vectors within the clusters, while trace
the separation between clusters.
: measures
8
In the lower-dimensional space resulting from the linear transformation & ' , the within-cluster and
between-cluster matrices become
& ' * & ' * ' & ' & & ' * & ' * ' & ' ! & 7
(3)
and minimize trace . Common optiAn optimal transformation &,' would maximize trace
mizations in classical discriminant analysis include
trace
and
07
trace
(4)
,
!
for
. The solution can be obtained by solving an eigenvalue problem on the matrix
, if
! is nonsingular. Equivalently, we can compute the optimal solution by
is nonsingular, or on
, if
!
* eigenvectors corresponding
$
solving an eigenvalue problem on the matrix
[10]. There are at most
is bounded above by * . Therefore, the reduced
to nonzero eigenvalues, since the rank of the matrix
* . A stable way to solve this eigenvalue problem is to apply
dimension by classical LDA is at most
The optimization problem in (4) is equivalent to finding all the eigenvectors that satisfy
SVD on the scatter matrices. Details can be found in [28].
Note that a limitation of classical discriminant analysis in many applications involving small sample
data, including text processing in information retrieval, is that the matrix
$ must be nonsingular. Several
methods, including two-stage LDA, regularized LDA and pseudo-inverse based LDA, were proposed in
the past to deal with the singularity problem as follows.
A common way to deal with the singularity problem is to apply an intermediate dimension reduction
stage to reduce the dimension of the original data before classical LDA is applied. Several choices of the
intermediate dimension reduction algorithms include LSI [14], PCA [3], [18], [28], Partial Least Squares
(PLS) [18], etc. The application area of this paper is text mining, hence LSI is a natural choice for
the intermediate dimension reduction. In this two-stage LSI+LDA algorithm, the discriminant stage is
preceded by a dimension reduction stage using LSI. A limitation of this approach is that the optimal
9
value of the reduced dimension for LSI is difficult to determine [14].
$ is to add some constant values to the diagonal elements
$ $ % , for some , where is an identity matrix [9]. It is easy to check that !$ %
of , as
A simple way to deal with the singularity of
"
"
"
is positive definite, hence nonsingular. This approach is called regularized Linear Discriminant Analysis
(RLDA). A limitation of RLDA is that the optimal value of the parameter
2
It is evident that when
, we lose the information on
is difficult to determine.
$ , while very small values of
may not
be sufficiently effective [26]. Cross-validation is commonly applied for estimating the optimal . More
recent studies on RLDA can be found in [4], [18], [26], [33].
Another common way to deal with the singularity problem is to apply pseudo-inverse on the scatter
matrices instead of inverse. The pseudo-inverse of a matrix is defined as follows.
Definition 2.1: The pseudo-inverse of a matrix
, denoted as
, refers to the unique matrix satisfying
the following four conditions:
The pseudo-inverse
be the SVD of
, where
the pseudo-inverse of
* ' ' 7
is commonly computed by the SVD [11] as follows. Let
and
are orthogonal and
, denoted as
'
is diagonal with positive diagonal entries. Then
, can be computed as ' .
The Pseudo Fisher Linear Discriminant Analysis (PFLDA) [8], [10], [23], [26], [27] is equivalent to
finding all the eigenvectors satisfying
$ ! , for
. An advantage of PFLDA over the other
two methods is that there is no parameter involved in PFLDA.
III. G ENERALIZATION
OF DISCRIMINANT ANALYSIS
Classical discriminant analysis expresses its solution by solving an eigenvalue problem when
nonsingular. However, for a general document matrix
$
is
, the number of documents ( ) may be smaller
10
$ is singular. This implies that both and are
below, where the non-singularity of $ is not required.
singular. In this paper, we propose criterion
, and maximize the between-cluster
The criterion aims to minimize the within-cluster distance, trace
.
distance, trace
is a natural extension of the classical one in Eq. (4), in that the inverse of a matrix is
The criterion
than its dimension ( ), then the total scatter matrix
replaced by the pseudo-inverse [11]. While the inverse of a matrix may not exist, the pseudo-inverse of
any matrix is well-defined. Moreover, when the matrix is invertible, its pseudo-inverse is the same as its
inverse.
The criterion
that we propose is:
& trace
7
(5)
& is minimum
Hence the trace optimization is to find an optimal transformation matrix & ' such that
under a certain constraint defined in more detail in Section III-C.
We show how to solve the above minimization problem in Section III-C. The main technique applied
there is the GSVD, briefly introduced in Section III-A below. The constraints on the optimization problem
are based on the observations in Section III-B.
A. Generalized singular value decomposition
The Generalized Singular Value Decomposition (GSVD) was introduced in [21], [31]. A simple algo-
#
rithm to compute GSVD can be found in [13], where the algorithm is based on [21].
* ' * ' The GSVD on the matrix pair
nonsingular matrix
#
IR"
IR
,
IR
$ #%$
, and a
" , such that
will give orthogonal matrices
'
for
*
* '
'
(6)
11
where
Here
IR $ $ # 7
# $ $ # $ IR IR are identity matrices,
IR and
rank rank * , and rank * % rank * rank , and
are zero matrices with diag diag ) #
and
are diagonal matrices satisfying
*
and
% * for
) ) ) ) +
+ *
% * %
From (6), we have
*
.
' * ' hence
'
Therefore
* *
' * * ' '
'
'
' ' * * ' & ' ' * * ' & ' & ' &
&
(7)
& ' &
7
& '
(8)
12
where
& & ' 7
and in Section III-C for the minimization of .
We use the above representations for
It follows from Eq. (7) that
$ %
'
$
(9)
(10)
which is used in the following Lemma 3.5.
B. Linear subspace spanned by the centroids
Let
be a subspace in IR"
spanned by the
- 2 span
centroids of the data vectors. In the lower dimensional space
centroids is
" ) 2 2 obtained by & ' , the linear subspace spanned by the
where
& '
, for * ) .
span The following lemma studies the relationship between the dimension of the subspace
of the matrix
*
and the rank
, as well as their relationship in the reduced space.
Lemma 3.1: Let
dim
and
be defined as above and let
denote the dimensions of the subspaces
and
*
be defined as in Eq. (2). Let dim
respectively. Then
and
rank * % * , or rank * , and dim rank & ' * % * , or rank & ' * ;
rank * , then dim rank & ' * . If dim rank & ' * % * , then dim 2. If dim
* % *.
rank
1. dim
Proof: Consider the matrix
#5 " 2 2 $ #
$ .
can be rewritten as
dim . On the other hand, the rank of the matrix equals rank * ,
It is easy to check that rank
if lies in the space spanned by - )) 2 2 , and equals rank * % * , otherwise. Hence
rank * % * or rank * . Similarly, we can prove dim rank & ' * % * or rank & ' * .
dim
13
where the global centroid
" *
If dim
rank
, from the above argument,
for some coefficients . It
" is the centroid in the lower dimensional space.
#
& '
, where
follows that
rank & ' * .
Again by the above argument, dim
rank & ' * % * , we can show that dim rank * % * .
Similarly, if dim
This completes the proof for part 1.
C. Generalized discriminant analysis using the
measure
We start with a more general optimization problem:
for some
& subject to
* rank & '
(11)
. The optimization in Eq. (11) depends on the value of . The optimal transformation
& ' we are looking for is a special case of the above formulation, where we set
choice of
rank
* . This
guarantees that the dimension of the linear space spanned by the centroids in the original
high-dimensional space and the corresponding one in the transformed lower-dimensional space are kept
close to each other, as shown in the following Proposition 3.1.
To solve the minimization problem in Eq. (11), we need the following three lemmas, where the proofs
of the first two are straightforward from the definition of the pseudo-inverse [11].
Lemma 3.2: For any matrix
Lemma 3.3: For any matrix
IR"
IR"
rank .
#%$
' ' .
, we have
#%$
, we have trace
The following lemma is critical for our main result:
Lemma 3.4: Let
diag
be any diagonal matrix with
. Then for any
IR
matrix
#
with rank
, we have
* 7
' '
trace and matrix
IR
, where
#
IR
#
Furthermore, the equality holds if and only if
, for some orthogonal matrix
is orthogonal and
Proof: It is easy to check that
trace
'
' ' rank
, since
trace
, for any trace and
' contains the first
'
IR
#
IR
#
are diagonal matrices
'
'
(12)
' be the
' . Using the fact that
, together with Eq. (12), we have
' trace '
trace '
trace '
* #
' columns of the orthogonal matrix
since the diagonal elements of the diagonal matrix
trace
where
is non-singular. Let
. Then by the definition of pseudo-inverse,
and the last equality follows from Lemma 3.3.
It is clear that rank
SVD of
' with positive diagonal entries.
where 14
is obtained if and only if
, and the last inequality follows,
are nondecreasing. Moreover, the minimum of
for some orthogonal matrix
15
IR
# . Hence
is the -th principal sub-matrix of
.
We are now ready to present our main result for this section. To solve the minimization problem in
in Theorem 3.1. A sufficient condition
Eq. (11), we first give a lower bound on the objective function
where
, and
on & ' , under which the lower bound computed in Theorem 3.1 is obtained, is presented in Corollary 3.1.
A simple solution is then presented in Corollary 3.2.
The optimal solution for our generalized discriminant analysis presented in this paper is a special case
of the solution in Corollary 3.2, where we set
rank
* .
integer
& ' ! & and * ' * ' in Eq. (6).
of
where
& '
if trace
&
definite, trace Recall from Eq. (8) that by the GSVD,
&
is defined in Eq. (9). Partition &
such that &
#
IR) , &
IR)
# , &
if
%
%
*
% *
* . Since both
. Next consider the case when
&
are defined in Eq. (3), and the
Proof: First consider the easy case when
satisfies rank & '
IR) "
, for some
. Then the following inequality holds:
where &
* #
Theorem 3.1: Assume the transformation matrix & '
IR)
& ' &
& &
$ , and &
# &
IR)
#
, are specified by the GSVD
*.
& '
into
&
%
(13)
" $.
and
are positive semi-
Since
% * , for
16
* )) , we have
% ' $ $# $
& & . We have
$
here
IR is an identity matrix. Define & %& & & ' % &
& ' 7
'
By the property of the pseudo-inverse in Lemma 3.2,
trace
rank & ' ! & rank & ' * 7
It follows that
%
& & '
trace & & ' & & '
trace & & ' * % * %& trace where
equality holds if &
rank
for some orthogonal matrix
#
IR) ) and some matrix
diagonal matrices with positive diagonal entries and
and
(15)
. The second inequality follows from Eq. (14) and Lemma 3.4, and the equality
& & & '
. The first inequality follows, since & & ' is positive semi-definite, and the
holds if and only if
both
(14)
are identity matrices.
#
IR) (16)
IR # , where IR # are
# IR is orthogonal. In particular, it holds if
17
By Eq. (14) and Eq. (15), we have
%& ,
* % * 7
trace
trace
, when Theorem 3.1 gives a lower bound on
become equalities if &
is fixed. From the arguments above, all the inequalities
has the form in Eq. (16), and & . A simple choice is to set and
in
Eq. (16) to be identity matrices, which is used in our implementation later. The result is summarized in
the following corollary.
Corollary 3.1: Let & ' , &
holds, if the partition of &
& & be defined as in Theorem 3.1 and
&
& &
& satisfies &
% * , then
the equality in Eq. 13
and & , where
& .
Theorem 3.1 does not mention the connection between . , the row dimension of the transformation
matrix & ' , and . We can choose the transformation matrix & ' that has large row dimension . and still
satisfies the condition stated in Corollary 3.1. However we are more interested in a lower-dimensional
representation of the original data, while keeping the same information from the original data.
The following corollary says that the smallest possible value for . is , and, more importantly, we can
find a transformation & ' , which has row dimension , and satisfies the condition stated in Corollary 3.1.
Corollary 3.2: For every
%
+ , there exists a transformation & '
in Eq. (13) holds, i.e. the minimum value for
#
#
IR " , such that
&
& ' is obtained. Furthermore for any transformation & '
IR) " , such that the assumption in Theorem 3.1 holds, we have .
Proof: Construct & '
#
IR " , such that the equality
& &
&
.
18
satisfies &
where
#
, &
, and &
IR " contains the first
, where
IR
columns of the matrix
#
is an identity matrix. Hence &,'
'
Hence .
' * #
' ,
. By Corollary 3.1, the equality in Eq. (13)
holds under the above transformation & ' . This completes the proof for the first part.
For any transformation &
IR) " , such that rank &
, it is clear
' + rank &
+ ..
.
, is dependent on the rank
* . As shown in Lemma 3.1, the rank of the matrix * , which is % by GSVD,
of the matrix & '
denotes the degree of linear independence of the centroids in the original data, while the rank, , of
* implies the degree of linear independence of the centroids in the reduced space by
the matrix & '
Theorem 3.1 shows that the minimum value of the objective function,
Lemma 3.1. By Corollary 3.2, if we fix the degree of linear independence of the centroids in the reduced
* space, i.e. if rank & '
row dimension
is fixed, then we can always find a transformation matrix & '
Next we consider the case when
objective function
larger than %
such that the minimum value for
#
IR " with
is obtained.
varies. From the result in Theorem 3.1, the minimum value of the
is smaller for smaller . However, as the value of decreases (note is always no
), the degree of linear independence of the
centroids in the reduced space is a lot less
than the one in the original high-dimensional space, which may lead to information loss of the original
data. Hence, we choose
rank
* , its maximum possible value, in our implementation. One nice
property of this choice is stated in the following proposition:
Proposition 3.1: If
* rank & '
equals rank
* , then 8 8 * +
8 8 + 8 8.
Proof: The proof follows directly from Lemma 3.1.
Proposition 3.1 implies that choosing
rank
* keeps the degree of linear independence of the
centroids in the reduced space the same as or one less than the one in the original space. With the
above choice, the reduced dimension under the transformation & ' is rank
* . We use rank
* to denote the optimal reduced dimension for our generalized discriminant analysis (also called exact
algorithm) throughout the rest of the paper. In this case, we can choose the transformation matrix & ' as
19
Algorithm 1: Exact algorithm
as in Eq. (2).
2. Compute SVD on , as # .
3. rank , !" rank 1. Form the matrices
and
, where
and
are orthogonal and
is diagonal.
# %$'&)(+*,$'&-. , the sub-matrix consisting of the first ( rows and the first columns of matrix , where 2 and 7 are orthogonal and 385 is diagonal.
from step 2, as #%$/&0(+*,$'&-.12436517
7
5. 9: <;>=
.
A9 BD C , where 9 ? B C contains the first E columns of the matrix 9 .
6. @
4. Compute SVD on
in Corollary 3.2 as follows: & '
' , where contains the first
columns of the matrix
. The
pseudo-code for our main algorithm is shown in Algorithm 1, where Lines 2–5 compute the GSVD on
the matrix pair
* ' * ' to obtain the matrix
as in Eq. (6).
In principle, it is not necessary to choose the reduced dimension to be exactly
good performance for values of reduced dimension close to
. Experiments show
. However the results also show that the
performance can be very poor for small values of reduced dimension, probably due to information loss
during dimension reduction.
D. Relation between the exact algorithm and the pseudo-inverse based LDA
As discussed in the Introduction, the Pseudo Fisher Linear Discriminant (PFLDA) method [8], [10],
[23], [26], [27] applies the pseudo-inverse on the scatter matrices. The solution can be obtained by solving
the eigenvalue problem on matrix
criterion
$ ! . The exact algorithm proposed in this paper is based on the
, as defined in Eq. (5), which is an extension of the criterion in classical LDA by replacing the
inverse with the pseudo-inverse. Both PFLDA and the exact algorithm apply pseudo-inverse to deal with
the singularity problem on undersampled problems. A natural question is what is the relation between
the two methods? We show in this section that the solution from the exact algorithm also solves the
20
$ ! , that is, the two methods are essentially equivalent. The significance
eigenvalue problem on matrix
of this equivalence result is that the criterion proposed in this paper may provide theoretical justification
of the pseudo-inverse based LDA, which was used in an ad-hoc manner in the past.
The following lemma is essential to show the relation between the exact algorithm and the PFLDA.
Lemma 3.5: Let
$ Line 5 of the Algorithm 1). Denote
matrix.
$ ' Proof: By Eq. (10), we have
has the form of
be the matrix obtained by computing GSVD on the matrix pair
$ , where
'
$ , i.e.
. Then
Hence
$ '
$ is the total scatter
. From Line 5 of the Algorithm 1,
'
is diagonal. Then
'
$ ' , where
are orthogonal and
7
'
(see
It follows from the definition of pseudo-inverse that
'
and
$ * ' * ' ' . It is easy to check that
7
'
' .
Theorem 3.2: Let eigenvalue problem:
be the matrix as in Lemma 3.5. Then the first
$ ! , for
Proof: From Eq. (7), we have
from Lemma 3.5 that
$ ! .
! , where ' ' '
columns of solve the following
is diagonal as defined in Eq. (7). It follows
where
21
is diagonal. This implies that the columns of
The result of the theorem follows, since only the first
are the eigenvectors of the matrix
diagonal entries of the diagonal matrix
$ ! .
are
nonzero.
IV. A PPROXIMATION ALGORITHM
IR $ 2 # "
One of the limitations of the above exact algorithm is the expensive computation of the GSVD of the
matrix
. In this section, an efficient approximation algorithm is presented to overcome this
limitation.
The K-Means algorithm [6], [16] is widely used to capture the clustered structure of the data, by
decomposing the whole data set as a disjoint union of small sets, called clusters. The K-Means algorithm
aims to minimize the distance within each cluster, hence the data points in each cluster are close to the
cluster centroid and the centroids themselves are a good approximation of the original data set.
*
Although the number of data points In Section II, we use the matrix
IR"
#%$
to capture the closeness of the vectors within each cluster.
is relatively smaller than the dimension
, the absolute value of
can still be large. To simplify the model, we attempt to use the centroids only to approximate the structure
of each cluster, by applying the K-Means algorithm to each cluster.
) 0 are the clusters in the data set, with the size of each cluster
8 8 , and , then the K-Means algorithm is applied to each cluster to produce 8 8
. Let be the
sub-clusters
, with
and the size of each sub-cluster
898 9
8
8
centroid for each sub-cluster
. The within-cluster distance in the -th cluster can
More specifically, if
898 898 898 5 898 by approximating every point
in the sub-cluster
by its centroid
.
be approximated as
22
Algorithm 2: Approximation algorithm
1. Form the matrix
as defined in Eq. (2).
2. Run K-Means algorithm on each cluster
3. Form
with
as defined in Eq. (17) to approximate
* 4. Compute GSVD on the matrix pair
!' rank # .
9 BD C , where 9 B C
4. @
.
#
.
9
to obtain the matrix
as in Eq. (6).
3.
Hence the matrix
*
where
E
columns of the matrix
9
.
can be approximated as
contains the first
" - " ) " - )) . " ) " # $ IR"
(17)
is the total number of sub-clusters, which is typically much smaller than
, the
total number of data points, thus reducing the complexity of the GSVD computation dramatically. The
main steps for our approximation algorithm are summarized in Algorithm 2. For simplicity, in our
implementation we chose all the ’s to have the same value , where stands for the numbers of
sub-clusters. We discuss below the choice for .
A. The choice of the number of sub-clusters ( )
The value of , which is the number of sub-clusters within each cluster
complexity of the approximation algorithm. If , will determine the
is too large, the approximation algorithm will produce
result close to the one using all the points, while the computation of the GSVD will still be expensive.
If is small, then the sub-clusters will not approximate well the original data set. For our problem, we
apply the K-Means algorithm to the data points belonging to the same cluster, which are already close
to each other. Indeed, in our experiments, we found that small values of choosing around 6 to 10 gave good results.
worked well; in particular,
23
TABLE II
C OMPARISON OF
DATA ,
DECOMPOSITION AND QUERY TIME :
IS THE SIZE OF THE TRAINING SET,
IS THE NUMBER OF CLUSTERS IN THE TRAINING SET,
decomposition time
Full
RLDA
LSI+LDA
Exact
Approximation
B. The time complexity
K-NN AND IS
LSI.
query time
0
LSI
IS THE DIMENSION OF THE TRAINING
IS THE NUMBER OF NEAREST NEIGHBORS USED IN
THE REDUCED DIMENSION BY
Method
Under the approximation algorithm, the -th cluster is partitioned into sub-clusters. Hence the number
% . The complexity of the GSVD computation
* matrix is reduced to % . Note that the K-Means
using the approximate
of rows in the approximate
*
matrix is
algorithm for the clustering is very fast [16]. Within every iteration, the computation complexity is linear
in the number of vectors and the number of clusters, while the algorithm converges within a few iterations.
, where is the size of the -th cluster.
# .
Therefore, the total time for all the clusterings is
% Hence the total complexity for our approximation algorithm is
. Since is usually
chosen to be a small number and is much smaller than , the complexity is simplified to
.
Hence the complexity of the K-Means step for cluster is
Table II lists the complexity of different methods discussed above. The K-NN algorithm is used for
querying the data.
24
TABLE III
S UMMARY OF
DATASETS USED FOR EVALUATION
Data Set
1
2
Source
TREC
Reuters-21578
number of documents
210
320
490
7454
2887
3759
7
4
5
number of terms
number of clusters
V. E XPERIMENTAL
3
RESULTS
A. Datasets
In the following experiments, we use three different datasets1 , summarized in Table III. For all datasets,
we use a stop-list to remove common words, and the words are stemmed using Porter’s suffix-stripping
algorithm [22]. Moreover, any term that occurs in fewer than two documents was eliminated as in [32].
Dataset 1 is derived from the TREC-5, TREC-6, and TREC-7 collections [30]. It consist of 210 documents
in a space of dimension of 7454, with 7 clusters. Each cluster has 30 documents. Datasets 2 and 3 are
from Reuters-21578 text categorization test collection Distribution 1.0 [19]. Dataset 2 contains 4 clusters,
with each containing 80 elements; the dimension of the document set is 2887. Dataset 3 has 5 clusters,
each with 98 elements; the dimension of the document set is 3759.
For all the examples, we use the tf-idf weighting scheme [24], [32] for encoding the document collection
with a term-document matrix.
B. Experimental methodology
To evaluate the proposed methods in this paper, we compared them with the other three dimension
reduction methods: LSI, RLDA, and two stage LSI+LDA on the three datasets in Table III. The K-Nearest
Neighbor algorithm [7] (for
* *
) was applied to evaluate the quality of different dimension
reduction algorithms as in [13]. For each method, we applied 10-fold cross-validation to compute the
1
These datasets are available at www.cs.umn.edu/ jieping/Research.html .
25
misclassification rate and the standard deviation. For LSI and two-stage LSI+LDA, the results depend on
by LSI. Our experiments showed that
the choice of reduced dimension
*
for LSI and
for LSI+LDA produced good overall results. Similarly, RLDA depends on the choice of the parameter
. We ran RLDA with different choices of
7
and found that
produced good overall results on
the three datasets. We also applied K-NN on the original high dimensional space without the dimension
reduction step, which we named “FULL” in the following experiments.
The clustering using the K-Means in the approximation algorithm is sensitive to the choices of the
initial centroids. To mitigate this, we ran the algorithm ten times, and the initial centroids for each run
were generated randomly. The final result is the average over the ten different runs.
C. Results
0.18
0.16
0.14
Misclassification rate
0.12
Dataset 1
Dataset 2
Dataset 3
0.1
0.08
0.06
0.04
0.02
0
2
Fig. 1.
Effect of the value of
4
6
8
10
12
14
16
18
20
22
24
26
28
30
.
on the approximation algorithm using Datasets 1–3. The -axis denotes the value of
1) Effect of the value of on the approximation algorithm: In Section IV, we use to denote the
number of sub-clusters for the approximation algorithm. As mentioned in Section IV, our approximation
algorithm worked well for small values of . We tested our approximation algorithm on Datasets 1–3
for different values of , ranging from 2 to 30, and computed the misclassification rates.
nearest
neighbors have been used for classification. As seen from Figure 1, the misclassification rates did not
fluctuate very much within the range. For efficiency, we would like experiments, we chose to be small. Therefore, in our
, as the approximation algorithm performed very well on all three datasets
26
in the vicinity of this value of . However, as shown in Figure 1, the performance of the approximation
algorithm is generally insensitive to the choice of .
2) Comparison of misclassification rates: In the following experiment, we evaluated our exact and
approximation algorithms and compared them with competing algorithms (LSI, RLDA, LSI+LDA) based
on the misclassification rates, using the three datasets in Table III.
0.14
FULL
LSI
RLDA
LSI+LDA
EXACT
APPR
0.12
Misclassification rate
0.1
0.08
0.06
0.04
0.02
0
1
Fig. 2.
7
Value of K in K-Nearest Neighbors
15
Performance of different dimension reduction methods on Dataset 1
0.25
Misclassification rate
0.2
0.15
FULL
LSI
RLDA
LSI+LDA
EXACT
APPR
0.1
0.05
0
1
Fig. 3.
7
Value of K in K-Nearest Neighbors
15
Performance of different dimension reduction methods on Dataset 2
The results for Datasets 1–3 are summarized in Figures 2–4, respectively, where the
the three different choices of K (
* *
-axis shows
) used in K-NN for classification, and the - -axis shows the
misclassification rate. ”EXACT” and “APPR” refer to the exact and approximation algorithms respectively.
We also report the results on the original space without dimension reduction, denoted as ”FULL” in
Figures 2–4. The corresponding standard deviations of different methods for Datasets 1–3 are summarized
in Tables IV. For each dataset, the results on different numbers,
* *
of neighbors used in K-NN
27
0.14
0.12
Misclassification rate
0.1
FULL
LSI
RLDA
LSI+LDA
EXACT
APPR
0.08
0.06
0.04
0.02
0
1
Fig. 4.
7
Value of K in K-Nearest Neighbors
15
Performance of different dimension reduction methods on Dataset 3
are reported.
TABLE IV
S TANDARD DEVIATIONS OF
DIFFERENT METHODS ON
Dataset
DATASETS 1–3, EXPRESSED AS
Dataset 1
Dataset 2
Dataset 3
K
K
K
A PERCENTAGE
Method
1
7
15
1
7
15
1
7
15
FULL
5.41
5.61
5.96
7.99
7.86
6.04
5.63
5.71
5.40
LSI
5.41
6.27
5.70
7.97
6.43
7.10
3.78
5.20
5.13
RLDA
4.69
2.30
2.30
6.50
6.20
5.40
3.29
4.18
4.35
LSI+LDA
5.52
3.33
2.30
6.56
6.60
5.55
4.09
4.28
4.43
EXACT
2.01
2.01
2.01
7.18
5.80
6.07
3.57
3.57
3.57
APPR
3.21
3.67
2.67
6.40
6.80
5.60
4.59
4.12
3.24
For all three datasets, the number of terms in the term-document matrix ( ) is larger than the number
of documents ( ). It follows that all scatter matrices are singular and classical discriminant analysis breaks
down. However our proposed generalized discriminant analysis circumvents this problem.
The main observations from these experiments are: (1) Dimension reduction algorithms, like RLDA,
LSI+LDA, and our exact and approximation algorithms do improve the performance for classification,
even for very high dimensional data sets, like those derived from text documents; (2) Algorithms using the
label information like the proposed exact and approximation algorithms, and RLDA and LSI+LDA, have
28
better performance than the one that does not use the label information, i.e., LSI; (3) The approximation
algorithm deals with a problem of much smaller size than the one dealt with by the exact algorithm.
However, the results from all the experiments show that they have similar misclassification rates; (4) The
performance of the exact algorithm is close to RLDA and LSI+LDA in all experiments. However, both
RLDA and LSI+LDA suffer from the difficulty of choosing certain parameters, while the exact algorithm
can be applied directly without setting any parameters.
3) Effect of the reduced dimension on the exact and approximation algorithms: In Section III, we
showed our exact algorithm using generalized discriminant analysis has optimal reduced dimension
which equals the rank of the matrix
*
,
and is typically very small. We also mentioned in Section III, that
our exact algorithm may not work very well if the reduced dimension is chosen to be much smaller than
the optimal one. In this experiment, we study the effect of the reduced dimension on the performance of
the exact and approximation algorithms.
0.7
EXACT
APPR
0.6
Misclassification Rate
0.5
0.4
0.3
0.2
0.1
0
1
3
6
11
16
21
26
31
36
41
46
Reduced Dimension
Fig. 5.
Effect of the reduced dimension using Dataset 1 (optimal reduced dimension
)
Figures 5–7 illustrate the effect of different choices for the reduced dimension on our exact and
approximation algorithms.
nearest neighbors have been used for classification. The -axis is the
value for the reduced dimension, and the - -axis is the misclassification rate. For each reduced dimension,
the results for our exact and approximation algorithms are ordered from left to right. Our exact and
approximation algorithms achieved the best performance in all cases, when the reduced dimension is
around
. Note, the optimal reduced dimension
for different datasets may be different. These further
29
confirm our theoretical results on the optimal choice of the reduced dimension for our exact algorithm as
discussed in Section III.
0.5
EXACT
APPR
0.45
Misclassification Rate
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1 3
6
11
16
21
26
31
36
41
46
Reduced Dimension
Fig. 6.
Effect of the reduced dimension using Dataset 2 (optimal reduced dimension
)
0.7
EXACT
APPR
Misclassification Rate
0.6
0.5
0.4
0.3
0.2
0.1
0
1
4
6
11
16
21
26
31
36
41
46
Reduced Dimension
Fig. 7.
Effect of the reduced dimension using Dataset 3 (optimal reduced dimension
)
VI. C ONCLUSIONS
An optimization criterion for generalized linear discriminant analysis is presented. The criterion is
applicable to undersampled problems, thus overcoming the limitation of classical linear discriminant
analysis. A new formulation for the proposed generalized linear discriminant analysis based on the trace
optimization is discussed. Generalized singular value decomposition is applied to solve the optimization
problem. The solution from the LDA/GSVD algorithm is a special case of the solution for this new
optimization problem. The equivalence between the exact algorithm and the Pseudo Fisher LDA (PFLDA)
is shown theoretically, thus providing a justification for the pseudo-inverse based LDA.
The exact algorithm involves the matrix
*
30
IR"
#%$
with high column dimension . To reduce the
decomposition time for the exact algorithm, an approximation algorithm is presented, which applies KMeans algorithm to each cluster and replaces the cluster by the centroids of the resulting sub-clusters.
The column dimension of the matrix
*
is reduced dramatically, therefore reducing the complexity for
the computation of GSVD. Experimental results on various real data set show that the approximation
algorithm produces results close to those produced by the exact algorithm.
Acknowledgement
We thank the Associate Editor and the four reviewers for helpful comments that greatly improved the
paper.
Research of J. Ye and R. Janardan is sponsored, in part, by the Army High Performance Computing
Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative
agreement number DAAD19-01-2-0014, the content of which does not necessarily reflect the position or
the policy of the government, and no official endorsement should be inferred. Research of C. Park and
H. Park has been supported in part by the National Science Foundation Grants CCR-0204109 and ACI0305543. Any opinions, findings and conclusions or recommendations expressed in this material are those
of the authors and do not necessarily reflect the views of the National Science Foundation.
31
R EFERENCES
[1] P. Baldi and G.W. Hatfield. DNA Microarrays and Gene Expression: From experiments to data analysis and modeling, Cambridge,
2002.
[2] M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573-595,
1995.
[3] P.N. Belhumeour, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
[4] D.Q. Dai and P.C. Yuen. Regularized discriminant analysis and its application to face recognition. Pattern Recognition, vol. 36, pp.
845–847, 2003.
[5] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. Society for
Information Science, 41, pp. 391–407, 1990.
[6] I.S. Dhillon and D.S. Modha. Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning. 42, pp.
143–175, 2001.
[7] R.O. Duda and P.E. Hart, and D. Stork. Pattern Classification. Wiley, 2000.
[8] R.P.W. Duin. Small sample size generalization. Proc. 9th Scandinavian Conf. on Image Analysis, Vol. 2, 957–964, 1995.
[9] J. H. Friedman. Regularized discriminant analysis. J. American Statistical Association, vol. 84, no. 405, pp. 165–175, 1989.
[10] K. Fukunaga. Introduction to Statistical Pattern Recognition, second edition. Academic Press, Inc., 1990.
[11] G.H. Golub, and C.F. Van Loan. Matrix Computations, John Hopkins University Press, 3rd edition, 1996.
[12] J.C. Gower. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, pp. 325–338.
[13] P. Howland, M. Jeon, and H. Park. Cluster structure preserving dimension reduction based on the generalized singular value
decomposition. SIAM J. Matrix Analysis and Applications, 25(1), pp. 165–179, 2003.
[14] P. Howland and H. Park. Generalizing discriminant analysis using the generalized singular value decomposition, IEEE TPAMI, 2003,
to appear.
[15] P. Howland and H. Park. Equivalence of Several Two-stage Methods for Linear Discriminant Analysis, Proc. Fourth SIAM International
Conference on Data Mining, 2004, to appear.
[16] A.K. Jain, and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[17] G. Kowalski. Information Retrieval System: Theory and Implementation, Kluwer Academic Publishers, 1997.
[18] W.J. Krzanowski, P. Jonathan, W.V McCarthy, and M.R. Thomas. Discriminant analysis with singular covariance matrices: methods
and applications to spectroscopic data. Applied Statistics, 44, pp. 101–115, 1995.
[19] D.D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http: www.research.att.com lewis, 1999.
[20] A. Mehay, C. Cai, and P.B. Harrington. Regularized Linear Discriminant Analysis of Wavelet Compressed Ion Mobility Spectra. Applied
Spectroscopy, Vol. 15, No. 2, pp. 219-227, 2002.
32
[21] C.C. Paige, and M.A.Saunders. Towards a generalized singular value decomposition, SIAM J. Numerical Analysis. 18, pp. 398–405,
1981.
[22] M.F Porter. An algorithm for suffix stripping Program. Program, 14(3), pp. 130–137, 1980.
[23] S. Raudys and R.P.W. Duin. On expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix,
Pattern Recognition Letters, vol. 19, no. 5-6, pp. 385-392, 1998.
[24] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
[25] G. Salton, and M.J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill, 1983.
[26] M. Skurichina and R.P.W. Duin. Stabilizing classifiers for very small sample size. Proc. International Conference on Pattern Recognition,
Vienna, Austria, pp. 891–896, 1996.
[27] M. Skurichina and R.P.W. Duin, Regularization of linear classifiers by adding redundant features, Pattern Analysis and Applications,
vol. 2, no. 1, pp. 44-52, 1999.
[28] D. L. Swets, and J. Weng. Using Discriminant Eigenfeatures for Image Retrieval, IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 18, no. 8, pp. 831–836, 1996.
[29] Q. Tian, M. Barbero, Z.H. Gu, and S.H. Lee, Image classification by the Foley-Sammon transform. Optical Engineering, Vol. 25, No.
7, pp. 834-840, 1986.
[30] TREC. Text Retrieval Conference. http:
trec.nist.gov, 1999.
[31] C. F. Van Loan. Generalizing the singular value decomposition. SIAM J. Numerical Analysis, 13(1), pp. 76–83, 1976.
[32] Y. Zhao, and G. Karypis. Criterion Functions for Document: Experiments and Analysis. Technical Report TR01-040. Department of
Computer Science and Engineering, University of Minnesota, Twin Cities, U.S.A., 2001.
[33] X. S. Zhou, and T.S. Huang. Small Sample Learning during Multimedia Retrieval using BiasMap. Proc. IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, pp. 11–17, 2001.