Choice of Dimensionality Reduction Methods for Feature and

Choice of Dimensionality Reduction Methods for
Feature and Classifier Fusion with Nearest
Neighbor Classifiers
Sampath Deegalla∗† , Henrik Boström∗ , Keerthi Walgama‡
∗ Department
of Computer and Systems Sciences
Stockholm University, Forum 100, SE-164 40 Kista, Sweden
Email: [email protected], [email protected]
† Department of Computer Engineering
Faculty of Engineering, University of Peradeniya, Peradeniya 20400, Sri Lanka
Email: [email protected]
‡ Department of Engineering Mathematics
Faculty of Engineering, University of Peradeniya, Peradeniya 20400, Sri Lanka
Email: [email protected]
Abstract—Often high dimensional data cause problems for
currently used learning algorithms in terms of efficiency and
effectiveness. One solution for this problem is to apply dimensionality reduction by which the original feature set could be reduced
to a small number of features while gaining improved accuracy
and/or efficiency of the learning algorithm. We have investigated
multiple dimensionality reduction methods for nearest neighbor
classification in high dimensions. In previous studies, we have
demonstrated that fusion of different outputs of dimensionality
reduction methods, either by combining classifiers built on
reduced features, or by combining reduced features and then
applying the classifier, may yield higher accuracies than when using individual reduction methods. However, none of the previous
studies have investigated what dimensionality reduction methods
to choose for fusion, when outputs of multiple dimensionality
reduction methods are available. Therefore, we have empirically
investigated different combinations of the output of four dimensionality reduction methods on 18 medicinal chemistry datasets.
The empirical investigation demonstrates that fusion of nearest
neighbor classifiers obtained from multiple reduction methods
in all cases outperforms the use of individual dimensionality
reduction methods, while fusion of different feature subsets is
quite sensitive to the choice of dimensionality reduction methods.
I. I NTRODUCTION
The nearest neighbor algorithm [1] is a simple learning
algorithm which could be applied to high dimensional datasets
without any modification to the original algorithm. However,
its performance is often poor in high dimensions, as is the
case also for many other learning algorithms, due to its
sensitivity to the input data [2]. This is known as the curse of
dimensionality [3], a well known phenomenon that misleads
learning algorithms when applied to high-dimensional data.
Dimensionality reduction is one potential approach to address
this problem [3], [4]. However, selecting a suitable dimensionality reduction method for a dataset may not be straightforward
since the resulting performance is dependent not only on the
dataset but also the learning algorithm.
Researchers have been working on improving accuracy of
nearest neighbor classifier along a number of directions [5],
[6]. One such direction is to combine several classifiers
together which is known as classifier fusion or ensemble learning [1]. However, common procedures for classifier fusion
such as bagging and boosting [1] are known to be ineffective
for nearest neighbor classifiers [5]. Therefore, combining
the nearest neighbor outputs of different feature subsets has
instead been considered in a number of studies. For example,
Bay [5] considered classifier fusion based on selecting random
subset of features for several nearest neighbor classifiers and
then combining their output using simple majority voting.
However, many studies have considered combining the nearest
neighbor outputs only. In contrast, we consider combining
both classifier outputs, i.e., classifier fusion, and combining
the reduced feature subsets, i.e., feature fusion, in order to
improve the classification accuracy of nearest neighbor.
In a previous study [7], we presented two strategies for fusion of the outputs of dimensionality reduction, i.e., combining
an equal number of features from each reduced subset and
combining the output of classifiers trained from equal numbers
of features. The results of that study led us to investigate
more on feature fusion methods due to that larger gains in
accuracy were observed for this type of method. In another
study [8], we extended the methods by considering different
number of features for fusion since different dimensionality
reduction methods may yield the best performance using
different number of features. The results showed that feature
fusion using equal numbers of dimensions and classifier fusion
using different number of dimensions achieve the highest
accuracy. One limitation of both studies is that only three
dimensionality reduction methods were employed and the
results of all methods were considered for fusion. None of the
studies investigated what dimensionality reduction methods
should be considered for fusion, when outputs of multiple
875
dimensionality reduction methods are available.
In this paper, we address this question, in order to further
improve the classification accuracy of the nearest neighbor
classifier. We consider four dimensionality reduction methods,
namely Principal Component Analysis (PCA) [9], [10], Partial
Least Squares (PLS) [10], Information Gain (IG) [1] and
ReliefF [11], [12].
In the next section, we present these dimensionality reduction methods together with the two fusion strategies; feature
and classifier fusion. In section III, the details of the empirical study including description of the datasets, experimental
setup and results are provided. Finally, conclusions and future
directions of the work are presented in section IV.
II. M ETHODS
In this section, we present four dimensionality reduction
methods and two fusion strategies. The following notions are
used. A dataset is described as an n × o data matrix X,
which contains n instances and o original dimensions. The
reduced dataset is denoted by an n × d matrix Z, which
contains n instances and d (< o) dimensions. X T represents
the transpose matrix of X.
A. Principal Component Analysis
Principal component analysis (PCA) [9], [10] transforms the
original feature set into a new set of orthogonal features which
is known as principle components. A principal component can
be defined as a linear combination of original features with
optimal weights. The first principal component contains most
variation in the data and every subsequent principal component
contains the most of the remaining variation. Therefore, PCA
could be used to reduce a high number of dimensions into
much lower dimension.
Suppose the original dataset X is an n × o matrix, in which
n denotes the number of instances and o denotes the number
of features. The first step is to subtract the mean from X along
all the dimensions. Let C be the covariance matrix of X
1
XT X
n−1
where C is an o × o symmetric matrix which captures the
correlations between all the features.
The reduced dataset Z can be written as
C=
vectors and D is an m × m diagonal matrix which contains
the singular values. Here m refers to the rank of X which is
min(n,o).
Considering the first d highest singular values where d <
min(n, o), the transformation can be rewritten as
T
Xn×o = Un×d Dd×d Vd×o
The principal component scores can be calculated as follows:
Z =XV
Here Z is the reduced dataset after applying PCA on the
original dataset X. We have used the Matlab implementation
of PCA which is based on SVD in empirical studies.
B. Partial Least Squares
The partial least squares method (PLS) [10] was developed
and has been extensively applied in chemometrics, primarily as
a regression method. It seeks a linear combination of original
features whose correlation to the output variable (class values)
is maximized. The resulting components are uncorrelated, i.e.,
there is no correlation between the new features. PLS is better
known for its applicability when the number of features are
much higher than the number of instances. Later, it has been
used as a dimensionality reduction method in the context of
high dimensional datasets [13]–[16]. Since PLS consider the
class labels for the transformation, it is considered as one
of the most successful supervised dimensionality reduction
methods.
The basic PLS regression model can be represented as
Y =XB + E
where Y is an n × p response matrix (class matrix), X is
an n × o data matrix, B is an o × p regression coefficient
matrix, E is the n × p noise matrix and p is the number
of dependent variables (class variables). Here, X and Y are
centered by subtracting their respective means. The regression
model could be developed as follows.
The factor score matrix Z can be found using
Z =XW
Z =XE
where E is a d × o projection matrix which consists of d
eigen vectors corresponding to the d highest eigen values of
the covariance matrix C.
Recent implementations of PCA are based on Singular
Value Decomposition (SVD) which is known as a fast and
numerically stable method [10].
In SVD the original data matrix X can be decomposed as:
with the appropriate weight matrix W . The model is equivalent to
Y = ZQ + E
X = U DVT
where Q is the regression coefficients (loadings) for Z. Once
the loadings Q are computed, the above regression model is
equivalent to Y = X B + E, where B = W Q. Here we are
interested in dimensionality reduction, i.e., generating factor
scores, Z, where,
where U is an n × m orthonormal matrix with left singular
vectors, V is an o × m orthonormal matrix with right singular
Z =XW
876
The SIMPLS algorithm [17] has been used to perform PLS
in our studies since it could be applied for both binary and
multiclass classification problems. Although PLS was originally developed for continuous output (continuous response),
we need to use it for categorical output (categorical response).
But, categorical outputs can be treated as continuous outputs in
binary class problems since the PLS method does not employ
any distributional assumptions [16]. Nevertheless, the class
variables of multiclass problems are transformed using the
following dummy-coding as described in [16] to use with the
SIMPLS algorithm. Here, class variable Y is transformed into
a C dimensional random vector where C is number of classes.
yij
yij
=
=
1 if Yi = j
0 otherwise
(1)
where Yi is the class label for ith instance and yij is the
ith row of the random vector for all i=1,. . . ,n and j=1,. . . ,C.
Therefore, for binary problems Y can be used as it is while the
transformed random vector is used for multiclass problems.
Fayyad & Irani’s Minimum Description Length (MDL) [20]
method is used as the default method of discretization.
D. ReliefF
The ReliefF algorithm [11], [12] is an extension of Relief
algorithm. A weight of a feature is calculated by maximizing
the ratio between the distance of closest instances from the
same class and the closest instances from different classes.
The method may particularly suit for nearest neighbor since
it is inspired by instance-based learning [11]. This is also a
filter based feature selection method where best d features are
considered for the classification based on relevance level and
a threshold [11].
The algorithm chooses m instances at random. For each
instance, the algorithm finds the closest instance from the same
class (near-hit) and the closest instance from the different class
(near-miss). In each iteration, triplet, i.e., the choosen instance,
its near-hit and near-miss, are considered for calculating the
weight vector W .
C. Information Gain
Information Gain (IG) is commonly used in decision tree
induction, e.g., ID3, and is used to decide which feature to
choose from a set of features to use in a node of the tree [1].
This is a filter based feature selection method, where the
features are ranked according to the highest information gain,
from which the best d features are selected. This has been
investigated as a dimensionality reduction method for high
dimensional datasets, such as texts [18] and microarrays [19].
Information gain can be generally stated as follows.
where H(class) = −p log2 p denote the entropy of the
class, H(class|f eature) denote the conditional entropy of
the feature with respect to the class.
Maximizing IG is equivalent to minimizing the conditional
entropy of the feature, i.e., H(class|f eature), which can be
computed as follows:
i=1
N
j=1
−
1 (xi − near-hiti )2 + (xi − near-missi )2
m
for all i = 1, . . . , o
Here Wi is the weight of ith feature in the previous iteration,
x is an instance from m intances, xi is the value of the ith
feature, near-hiti is the value of the ith feature of the nearest
instance from the same class to x, and near-missi is the value
of the ith feature of the nearest instance from the different
class to x.
E. Feature Fusion
IG(class|f eature) = H(class) − H(class|f eature)
C
V
X
ni X
Wi −
nij
nij
log2
ni
ni
Here C is the number of classes, V is the number of values
of the feature, N is the total number of instances, ni is the
number of instances having the ith value of the feature and
nij is the number of instances in the latter group belonging
to the jth class.
All attributes in datasets used for experiments are numerical
attributes where as information gain assumes that all attributes
are nominal. In order to calculate the information gain, one
should discretize the numerical data prior to applying the IG
algorithm. We have used WEKA for IG in our studies where
In feature fusion, each dimensionality reduction method is
applied to the original feature set and the resulting feature
sets are considered for fusion. It was earlier observed that
combining an equal number of features from each reduced
feature set produced the highest classification accuracy for the
nearest neighbor classifier [8]. Therefore, an equal number of
features from each reduced feature set are selected and then
merged into a single feature set also in this study. The nearest
neighbor classifier is then applied to the merged feature set
(See Fig. 1). We investigate the effect of choosing different
subsets by choosing three out of four from PCA, PLS, IG and
ReliefF. Finally, we also investigate the result from combining
all four methods.
High
Dimension
Dimension
Reduction
Low
Dimension
kNN
Classifier
Feature
Fusion
kNN
Classifer
Classifier
Fusion
Fig. 1. The diagram illustrates the two fusion methods which combines the
outputs of dimensionality reduction methods
877
F. Classifier Fusion
Instead of generating classifiers from fused feature sets,
classifiers are separately built from each feature set, and their
predictions are merged [21]. Following [8], we adopt the
approach of setting the number of dimensions that is selected
for each dimensionality reduction method to be the one that
results in the highest accuracy, as estimated by ten-fold cross
validation on the training data. Since we investigate what
methods to choose from multiple methods, three out of four
classifier outputs are selected and then fused using unweighted
majority voting. Finally, all the classifier outputs are also
considered for fusion. In case of ties, these are resolved by
randomly selecting a class label.
III. E MPIRICAL S TUDY
In this study, we consider four dimensionality reduction
methods, i.e., PCA, PLS, IG and ReliefF, and two strategies for
fusing their outputs; classifier and feature fusion, as described
above.
A. Datasets
Table I shows 18 publicly available datasets from the medicinal chemistry domain that are selected for this study. Each
set of compounds is represented by two different chemical
descriptor sets: Scitegic Extended Connectivity Fingerprints
(ECFI) and SELMA [22].
TABLE I
D ESCRIPTION OF DATASETS .
Dataset
AI
AMPH1
ATA
COMT
EDC
HIVPR
HIVRT
HPTP
ace
ache
bzr
caco
cox2
cpd-mouse
cpd-rat
gpb
therm
thr
Attributes
ECFI
SELMA
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
1024
94
Instances
Classes
69
130
94
92
119
113
101
132
114
111
163
100
322
980
1198
66
76
88
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
all the methods and compared these to using each method
individually.
First, the original dimensionality is transformed into a lower
number of dimensions using the individual reduction methods.
To generate PCA and PLS components for the test set, the
weight matrix generated for the training set is used. For
IG and ReliefF, features are ranked in decreasing order of
importance using the training set and the same features are
used in the test set. For feature fusion, an equal number
of features from three methods are combined at a time and
then finally all the methods. The number of features to be
combined is found considering the best performance under
cross validation on the training set. For classifier fusion, the
number of features required to yield the highest performance
for each dimensionality reduction method, i.e., the optimal
number of dimensions, is found using cross validation on
the training set. Then the nearest neighbor classifier outputs
are combined together considering three methods at a time.
Finally, classifier outputs of all the methods are considered
for fusion.
We have used Matlab to transform raw attributes to both
PLS and PCA components. The PCA transformation is performed using the Matlab’s Statistics Toolbox whereas the PLS
transformation is performed using the BDK-SOMPLS toolbox
[23], [24], which uses the SIMPLS algorithm. The WEKA data
mining toolkit [1] is used for the IG and ReliefF methods,
as well as for the actual nearest neighbor classification. For
overall validation, ten-fold cross validation is used and classification accuracy is used as the evaluation metric. The mean
accuracy of the cross validation for each dataset is reported in
Tables II and III.
C. Results
B. Experimental setup
One of the central issues to be addressed when combining
the output of several dimensionality reduction methods is
to find out which methods one should combine to reach
the highest accuracy. To investigate this, we have combined
three dimensionality reduction methods at a time and finally
The focus of the study is to investigate the effect of
the choice of different dimensionality reduction methods for
fusion in order to achieve the highest classification accuracy.
Tables II and III show nearest neighbor classification accuracies on the 18 high dimensional datasets. The first five
columns show nearest neighbor classification accuracy on raw
(unreduced) features as well as reduced features after employing the four individual dimensionality reduction methods. The
next five columns represent accuracies of classifier fusion of
different methods and the remaining columns correspond to
feature fusion in conjunction with different methods.
To investigate whether the observed differences in accuracy
are statistically significant, we applied the Friedman test [25]
which is a non-parametric statistical test. The null hypothesis
is formulated as there is no difference in accuracy among different methods. The Friedman test rejects the null hypothesis
for both descriptor sets, i.e., there is indeed a significant difference between classification accuracies of different methods
employed for both ECFI and SELMA. However, one has to
perform a post-hoc test to identify pairs of classifiers that are
significantly different from each other.
In this study, the Nemenyi test [25] is used as the post-hoc
test. If the average rank of two methods differs by at least
878
TABLE II
C LASSIFICATION ACCURACIES FOR ECFI
81.19
55.38
71.22
78.67
60.68
71.36
77.27
73.46
83.18
62.27
58.97
57.00
65.27
62.45
58.35
69.76
62.86
57.92
11.56
85.24
50.00
69.22
76.33
71.44
71.52
79.27
77.20
78.03
59.62
69.12
69.00
65.52
67.14
62.68
80.48
63.75
72.92
8.78
FF.PLSPCAIGReliefF
76.67
58.46
68.89
78.22
63.94
72.42
75.27
78.74
82.50
63.18
65.77
62.00
67.40
62.14
62.02
68.33
68.57
60.14
10.22
FF.PLSIGReliefF
76.67
52.31
74.33
78.22
73.86
68.79
83.18
83.30
79.17
60.45
74.26
67.00
72.09
65.41
60.10
75.71
78.04
70.42
7.17
FF
FF.PCAIGReliefF
76.67
66.92
72.22
80.44
63.94
70.83
78.18
81.87
86.14
59.62
65.62
61.00
67.08
64.18
64.02
74.52
70.36
66.25
8.72
CF
FF.PLSPCAReliefF
Relieff
FF.PLSPCAIG
IG
CF.PLSPCAIGReliefF
PCA
CF.PLSIGReliefF
PLS
CF.PCAIGReliefF
1 AI
2 AMPH1
3 ATA
4 COMT
5 EDC
6 HIVPR
7 HIVRT
8 HPTP
9 ace
10 ache
11 bzr
12 caco
13 cox2
14 cpd-mouse
15 cpd-rat
16 gpb
17 therm
18 thr
Average Rank
Raw
CF.PLSPCAReliefF
Data set
CF.PLSPCAIG
#
79.76
56.15
75.44
80.44
73.11
73.26
81.18
83.30
84.24
66.82
69.41
65.00
72.08
65.31
62.77
77.14
71.25
68.06
5.56
79.52
55.38
73.22
79.33
72.20
73.26
79.18
81.70
80.76
62.35
70.51
71.00
70.51
67.24
63.94
78.57
70.18
71.53
5.94
82.62
60.77
73.44
80.78
73.94
76.89
79.27
80.22
85.91
61.44
68.09
66.00
68.33
66.84
63.35
78.81
67.5
67.92
5.78
82.38
52.31
73.22
81.89
76.52
73.26
79.27
81.81
82.42
60.61
70.44
68.00
71.78
67.35
63.36
77.38
71.43
69.17
5.42
79.52
53.08
74.22
80.44
73.94
71.44
80.18
83.30
80.76
64.17
71.14
69.00
73.02
66.53
63.36
77.14
72.68
70.42
5.17
82.62
56.15
79.78
77.00
70.61
70.68
81.27
80.27
81.67
53.33
69.89
62.00
71.43
65.61
61.77
75.71
76.43
69.44
7.56
76.43
56.92
73.78
77.22
76.36
73.94
86.09
78.68
80.00
59.62
70.37
69.00
72.98
65.82
61.26
80.71
71.25
69.58
6.42
79.76
53.08
73.44
79.33
68.94
68.79
77.18
78.02
78.41
53.18
63.12
65.00
65.87
60.51
59.43
78.57
67.14
67.22
11.08
75.24
60.77
74.78
74.00
71.36
72.27
87.18
79.51
80.91
56.97
71.65
73.00
69.26
62.65
62.18
77.14
70.00
68.33
7.69
76.67
53.85
77.78
75.00
73.03
61.36
72.00
68.57
74.24
52.73
59.82
57.00
64.16
58.27
54.85
73.57
63.93
59.31
12.94
a critical difference, which is dependent on the number of
datasets, number of compared methods and a chosen level of
significance, the test concludes that the accuracies of the two
methods are significantly different. According to the Nemenyi
test, the critical difference for 15 methods and 18 datasets
at p=0.05 is 5.06. Fig. 2 and Fig. 3 show critical difference
diagrams [25] which are visual illustrations of the results of the
post-hoc test. In these diagrams, the best ranking methods are
to the right and the methods that are not significantly different
from each other are connected.
For the ECFI descriptors, the post-hoc test indicates
that all the classifier fusion methods are significantly better using IG alone; feature fusion using PCA, IG and
ReliefF (FF.PCAIGReliefF); and feature fusion using all
four reduction methods (FF.PLSPCAIGReliefF). For the
SELMA descriptors, classifier fusion using PCA, IG and
ReliefF (CF.PCAIGReliefF) and PLS, IG and ReliefF
(CF.PLSIGReliefF) are significantly better than using PCA
alone, and feature fusion using all four reduction methods
(FF.PLSPCAIGReliefF). In addition, CF.PLSIGReliefF is significantly better than using PLS alone. Feature fusion using
all the methods (FF.PLSPCAIGReliefF) perform poorly compared to most of the methods. For all the other methods,
the observations are not sufficient to conclude any significant
differences among the methods.
In general, classifier fusion achieves a higher accuracy for
all the combinations than using subsets of the individual
reduction methods while the accuracy of feature fusion is
sensitive to what subset of the reduction methods is selected.
For example, combining feature sets from PLS, PCA and ReliefF is the best combination whereas PCA, IG and ReliefF is
the worst combination for both descriptor sets. Combining all
feature sets yields the poorest accuracy among all the methods.
Although the performance of feature fusion is comparatively
low to classifier fusion, combining feature sets from PLS, PCA
and ReliefF yields improved accuracy compared to using all
the individual reduction methods for both ECFI and SELMA.
IV. C ONCLUSION
Fusing the outputs of four different dimensionality reduction
methods (PCA, PLS, IG and ReliefF) for the nearest neighbor
algorithm was investigated on 18 high-dimensional medicinal
chemistry datasets, using two different representations (descriptor sets). The results showed that classifier fusion in all
observed cases obtained a higher accuracy than using individual reduction methods irrespective of the methods chosen for
fusion while feature fusion turned out to be sensitive to the
chosen combination of reduction methods.
There are a number of directions for future research. This
study is concerned with binary class problems while many
datasets involve multi-class problems. Therefore, one could
879
TABLE III
C LASSIFICATION ACCURACIES FOR SELMA
73.57
52.31
62.56
66.00
77.35
77.05
84.18
83.30
86.06
62.27
70.70
66.00
72.35
64.08
59.85
78.81
54.11
65.00
8.11
76.43
51.54
63.89
73.78
78.26
73.56
81.27
81.76
82.58
61.21
69.41
65.00
73.26
63.27
60.10
74.76
67.32
66.25
8.17
FF.PLSPCAIGReliefF
80.95
46.92
56.56
64.00
71.52
76.21
77.18
81.81
82.73
60.38
68.64
61.00
72.34
61.63
58.43
66.90
66.61
64.17
11.64
FF.PLSIGReliefF
74.76
55.38
61.89
61.67
75.00
77.05
78.09
80.27
81.74
62.95
68.79
64.00
69.53
62.86
62.18
71.43
62.50
70.56
9.89
FF
FF.PCAIGReliefF
73.57
54.62
66.89
63.00
77.35
76.06
84.18
80.27
84.39
66.74
70.66
69.00
73.90
65.82
62.02
69.52
64.46
71.81
6.42
CF
FF.PLSPCAReliefF
Relieff
FF.PLSPCAIG
IG
CF.PLSPCAIGReliefF
PCA
CF.PLSIGReliefF
PLS
CF.PCAIGReliefF
1 AI
2 AMPH1
3 ATA
4 COMT
5 EDC
6 HIVPR
7 HIVRT
8 HPTP
9 ace
10 ache
11 bzr
12 caco
13 cox2
14 cpd-mouse
15 cpd-rat
16 gpb
17 therm
18 thr
Average Rank
Raw
CF.PLSPCAReliefF
Data set
CF.PLSPCAIG
#
76.43
54.62
57.67
63.89
76.67
78.03
83.09
82.58
85.23
63.11
69.96
67.00
74.80
64.29
60.68
68.33
65.54
67.36
7.31
77.86
51.54
60.89
68.22
75.83
74.47
80.18
83.35
84.39
68.41
71.21
64.00
74.18
64.90
60.51
66.90
69.64
69.72
7.06
75.00
51.54
62.78
67.11
77.42
76.29
85.18
84.89
87.05
67.50
72.46
68.00
73.58
63.98
60.10
79.29
64.46
70.83
5.47
73.57
57.69
63.78
69.22
78.26
76.21
86.09
85.66
84.39
65.68
72.54
71.00
73.56
64.80
60.77
80.71
63.21
68.47
4.78
76.43
56.15
59.78
67.11
76.67
76.21
82.09
83.35
83.48
64.85
70.62
67.00
74.48
65.00
60.93
68.33
66.96
67.36
6.69
76.43
56.15
59.67
67.56
76.59
71.59
84.00
80.33
84.39
67.58
66.84
68.00
70.79
57.76
60.86
64.52
64.11
72.08
8.61
70.95
53.08
63.11
67.11
80.00
77.20
81.27
79.51
83.64
69.32
74.78
74.00
72.35
59.80
59.60
74.29
61.43
73.06
7.08
73.81
54.62
62.67
67.44
74.09
76.14
81.27
81.87
81.74
61.44
70.55
65.00
72.66
63.37
57.76
80.24
65.71
69.58
9.03
67.86
50.77
61.56
69.33
78.33
77.05
83.27
77.97
83.41
64.70
69.45
72.00
72.96
57.35
59.93
78.81
65.89
69.86
8.14
72.38
48.46
60.78
69.44
74.92
68.86
72.27
78.79
81.74
60.23
68.09
70.00
71.72
58.27
52.30
73.10
57.14
72.78
11.61
CD
15 14 13 12 11 10
9
8
7
6
5
4
3
2
1
FF.PLSPCAIGReliefF
IG
FF.PCAIGReliefF
PCA
ReliefF
Raw
FF.PLSIGReliefF
Fig. 2.
CF.PLSPCAIGReliefF
CF.PLSIGReliefF
CF.PLSPCAIG
CF.PCAIGReliefF
CF.PLSPCAReliefF
FF.PLSPCAReliefF
PLS
FF.PLSPCAIG
Comparison of dimensionality reduction methods and their fusion of the outputs with each other using Nemenyi test for ECFI.
CD
15 14 13 12 11 10
9
8
7
PCA
FF.PLSPCAIGReliefF
PLS
FF.PCAIGReliefF
FF.PLSPCAIG
ReliefF
FF.PLSIGReliefF
Fig. 3.
6
5
4
3
2
1
CF.PLSIGReliefF
CF.PCAIGReliefF
Raw
CF.PLSPCAIGReliefF
CF.PLSPCAReliefF
FF.PLSPCAReliefF
CF.PLSPCAIG
IG
Comparison of dimensionality reduction methods and their fusion of the outputs with each other using Nemenyi test for SELMA.
880
extend the study to also consider for multiclass problems in
order to investigate whether the same conclusions could be
drawn. Further, the study considers only the basic nearest
neighbor classifier. One may choose also to study more
sophisticated nearest neighbor implementations in conjunction
with the fusion strategies. Finally, it would also be interesting
to examine whether the fusion strategies considered in the
study are useful also for other learning algorithms such as
support vector machines.
ACKNOWLEDGMENTS
This work was supported by the Swedish Foundation for
Strategic Research through the project High-Performance Data
Mining for Drug Effect Detection at Stockholm University,
Sweden.
[20] U. M. Fayyad and K. B. Irani, “On the Handling of Continuous-Valued
Attributes in Decision Tree Generation,” Machine Learning, vol. 8, pp.
87–102, 1992.
[21] H. Boström, “Feature vs. classifier fusion for predictive data mining a case study in pesticide classification,” in In Proceedings of the 10th
International Conference on Information Fusion, 2007, pp. 121–126.
[22] “Data Repository of Bren School of Information and
Computer
Science,
University
of
California,
Irvine,”
ftp://ftp.ics.uci.edu/pub/baldig/learning/, accessed: 16/02/2012.
[23] W. Melssen, R. Wehrens, and L. Buydens, “Supervised kohonen networks for classification problems,” Chemometrics and Intelligent Laboratory Systems, vol. 83, pp. 99–113, 2006.
[24] W. Melssen, B. Üstün, and L. Buydens, “Sompls: a supervised selforganising map - partial least squares algorithm,” Chemometrics and
Intelligent Laboratory Systems, vol. 86, no. 1, pp. 102–120, 2006.
[25] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,”
Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
R EFERENCES
[1] I. H. Witten and E. Frank, Data Mining: Practical machine learning
tools and techniques. San Francisco: Morgan Kaufmann, 2005.
[2] N. Lavrac, H. Motoda, and T. Fawcett, “Editorial: Data Mining Lessons
Learned,” in Machine Learning, 2004.
[3] M. A. Carreira-perpiñàn, “A review of dimension reduction techniques,”
Dept. of Computer Science, University of Sheffield, Tech. Rep., 1997.
[4] Y. Pang, L. Zhang, Z. Liu, N. Yu, and H. Li, “Neighborhood preserving
projections (npp): A novel linear dimension reduction method,” in ICIC
(1), 2005, pp. 117–125.
[5] S. D. Bay, “Nearest neighbor classification from multiple feature subsets,” Intelligent Data Analysis, vol. 3, pp. 191–209, 1999.
[6] T. Yamada, N. Ishii, and T. Nakashima, “Text Classification by Combining Different Distance Functions with Weights,” IEEJ Transactions on
Electronics, Information and Systems, vol. 127, pp. 2077–2085, 2007.
[7] S. Deegalla and H. Boström, “Fusion of dimensionality reduction
methods: a case study in microarray classification,” in Proceedings of
the 12th International Conference on Information Fusion. IEEE, 2009,
pp. 460–465.
[8] ——, “Improving Fusion of Dimensionality Reduction Methods for
Nearest Neighbor Classification,” in Proceedings of the 8th International
Conference on Machine Learning and Applications, 2009, pp. 771–775.
[9] J. Shlens, “A Tutorial on Principal Component Analysis,” URL:
http://www.snl.salk.edu/ shlens/pub/notes/pca.pdf.
[10] W. Melssen and R. Wehrens, “Chemometrics I Study Guide.”
[11] K. Kira and L. Rendell, “A practical approach to feature selection,”
in Proceedings of International Conference on Machine Learning
(ICML1992), 1992, pp. 249–256.
[12] I. Kononenko, “Estimating attributes: analysis and extension of relief,” in Proceedings of European Conference on Machine Learning
(ICML1994), 1994, pp. 171–182.
[13] D. V. Nguyen and D. M. Rocke, “Tumor classification by partial least
squares using microarray gene expression data,” Bioinformatics, vol. 18,
no. 1, pp. 39–50, 2002.
[14] ——, “Multi-class cancer classification via partial least squares with
gene expression profiles,” Bioinformatics, vol. 18, no. 9, pp. 1216–1226,
2002.
[15] J. J. Dai, L. Lieu, and D. Rocke, “Dimension Reduction for Classification with Gene Expression Microarray Data,” Statistical Applications in
Genetics and Molecular Biology, vol. 5, no. 1, 2006.
[16] A. L. Boulesteix, “PLS Dimension Reduction for Classification with
Microarray Data,” Statistical Applications in Genetics and Molecular
Biology, 2004.
[17] S. de Jong, “SIMPLS: An alternative approach to partial least squares
regression,” Chemometrics and Intelligent Laboratory Systems, 1993.
[18] G. Forman, “An Extensive Empirical Study of Feature Selection Metrics
for Text Classification,” Journal of Machine Learning Research, vol. 3,
pp. 1289–1305, 2003.
[19] E. P. Xing, M. I. Jordan, and R. M. Karp, “Feature selection for
high-dimensional genomic microarray data,” in In Proceedings of the
Eighteenth International Conference on Machine Learning. Morgan
Kaufmann, 2001, pp. 601–608.
881