Choice of Dimensionality Reduction Methods for Feature and Classifier Fusion with Nearest Neighbor Classifiers Sampath Deegalla∗† , Henrik Boström∗ , Keerthi Walgama‡ ∗ Department of Computer and Systems Sciences Stockholm University, Forum 100, SE-164 40 Kista, Sweden Email: [email protected], [email protected] † Department of Computer Engineering Faculty of Engineering, University of Peradeniya, Peradeniya 20400, Sri Lanka Email: [email protected] ‡ Department of Engineering Mathematics Faculty of Engineering, University of Peradeniya, Peradeniya 20400, Sri Lanka Email: [email protected] Abstract—Often high dimensional data cause problems for currently used learning algorithms in terms of efficiency and effectiveness. One solution for this problem is to apply dimensionality reduction by which the original feature set could be reduced to a small number of features while gaining improved accuracy and/or efficiency of the learning algorithm. We have investigated multiple dimensionality reduction methods for nearest neighbor classification in high dimensions. In previous studies, we have demonstrated that fusion of different outputs of dimensionality reduction methods, either by combining classifiers built on reduced features, or by combining reduced features and then applying the classifier, may yield higher accuracies than when using individual reduction methods. However, none of the previous studies have investigated what dimensionality reduction methods to choose for fusion, when outputs of multiple dimensionality reduction methods are available. Therefore, we have empirically investigated different combinations of the output of four dimensionality reduction methods on 18 medicinal chemistry datasets. The empirical investigation demonstrates that fusion of nearest neighbor classifiers obtained from multiple reduction methods in all cases outperforms the use of individual dimensionality reduction methods, while fusion of different feature subsets is quite sensitive to the choice of dimensionality reduction methods. I. I NTRODUCTION The nearest neighbor algorithm [1] is a simple learning algorithm which could be applied to high dimensional datasets without any modification to the original algorithm. However, its performance is often poor in high dimensions, as is the case also for many other learning algorithms, due to its sensitivity to the input data [2]. This is known as the curse of dimensionality [3], a well known phenomenon that misleads learning algorithms when applied to high-dimensional data. Dimensionality reduction is one potential approach to address this problem [3], [4]. However, selecting a suitable dimensionality reduction method for a dataset may not be straightforward since the resulting performance is dependent not only on the dataset but also the learning algorithm. Researchers have been working on improving accuracy of nearest neighbor classifier along a number of directions [5], [6]. One such direction is to combine several classifiers together which is known as classifier fusion or ensemble learning [1]. However, common procedures for classifier fusion such as bagging and boosting [1] are known to be ineffective for nearest neighbor classifiers [5]. Therefore, combining the nearest neighbor outputs of different feature subsets has instead been considered in a number of studies. For example, Bay [5] considered classifier fusion based on selecting random subset of features for several nearest neighbor classifiers and then combining their output using simple majority voting. However, many studies have considered combining the nearest neighbor outputs only. In contrast, we consider combining both classifier outputs, i.e., classifier fusion, and combining the reduced feature subsets, i.e., feature fusion, in order to improve the classification accuracy of nearest neighbor. In a previous study [7], we presented two strategies for fusion of the outputs of dimensionality reduction, i.e., combining an equal number of features from each reduced subset and combining the output of classifiers trained from equal numbers of features. The results of that study led us to investigate more on feature fusion methods due to that larger gains in accuracy were observed for this type of method. In another study [8], we extended the methods by considering different number of features for fusion since different dimensionality reduction methods may yield the best performance using different number of features. The results showed that feature fusion using equal numbers of dimensions and classifier fusion using different number of dimensions achieve the highest accuracy. One limitation of both studies is that only three dimensionality reduction methods were employed and the results of all methods were considered for fusion. None of the studies investigated what dimensionality reduction methods should be considered for fusion, when outputs of multiple 875 dimensionality reduction methods are available. In this paper, we address this question, in order to further improve the classification accuracy of the nearest neighbor classifier. We consider four dimensionality reduction methods, namely Principal Component Analysis (PCA) [9], [10], Partial Least Squares (PLS) [10], Information Gain (IG) [1] and ReliefF [11], [12]. In the next section, we present these dimensionality reduction methods together with the two fusion strategies; feature and classifier fusion. In section III, the details of the empirical study including description of the datasets, experimental setup and results are provided. Finally, conclusions and future directions of the work are presented in section IV. II. M ETHODS In this section, we present four dimensionality reduction methods and two fusion strategies. The following notions are used. A dataset is described as an n × o data matrix X, which contains n instances and o original dimensions. The reduced dataset is denoted by an n × d matrix Z, which contains n instances and d (< o) dimensions. X T represents the transpose matrix of X. A. Principal Component Analysis Principal component analysis (PCA) [9], [10] transforms the original feature set into a new set of orthogonal features which is known as principle components. A principal component can be defined as a linear combination of original features with optimal weights. The first principal component contains most variation in the data and every subsequent principal component contains the most of the remaining variation. Therefore, PCA could be used to reduce a high number of dimensions into much lower dimension. Suppose the original dataset X is an n × o matrix, in which n denotes the number of instances and o denotes the number of features. The first step is to subtract the mean from X along all the dimensions. Let C be the covariance matrix of X 1 XT X n−1 where C is an o × o symmetric matrix which captures the correlations between all the features. The reduced dataset Z can be written as C= vectors and D is an m × m diagonal matrix which contains the singular values. Here m refers to the rank of X which is min(n,o). Considering the first d highest singular values where d < min(n, o), the transformation can be rewritten as T Xn×o = Un×d Dd×d Vd×o The principal component scores can be calculated as follows: Z =XV Here Z is the reduced dataset after applying PCA on the original dataset X. We have used the Matlab implementation of PCA which is based on SVD in empirical studies. B. Partial Least Squares The partial least squares method (PLS) [10] was developed and has been extensively applied in chemometrics, primarily as a regression method. It seeks a linear combination of original features whose correlation to the output variable (class values) is maximized. The resulting components are uncorrelated, i.e., there is no correlation between the new features. PLS is better known for its applicability when the number of features are much higher than the number of instances. Later, it has been used as a dimensionality reduction method in the context of high dimensional datasets [13]–[16]. Since PLS consider the class labels for the transformation, it is considered as one of the most successful supervised dimensionality reduction methods. The basic PLS regression model can be represented as Y =XB + E where Y is an n × p response matrix (class matrix), X is an n × o data matrix, B is an o × p regression coefficient matrix, E is the n × p noise matrix and p is the number of dependent variables (class variables). Here, X and Y are centered by subtracting their respective means. The regression model could be developed as follows. The factor score matrix Z can be found using Z =XW Z =XE where E is a d × o projection matrix which consists of d eigen vectors corresponding to the d highest eigen values of the covariance matrix C. Recent implementations of PCA are based on Singular Value Decomposition (SVD) which is known as a fast and numerically stable method [10]. In SVD the original data matrix X can be decomposed as: with the appropriate weight matrix W . The model is equivalent to Y = ZQ + E X = U DVT where Q is the regression coefficients (loadings) for Z. Once the loadings Q are computed, the above regression model is equivalent to Y = X B + E, where B = W Q. Here we are interested in dimensionality reduction, i.e., generating factor scores, Z, where, where U is an n × m orthonormal matrix with left singular vectors, V is an o × m orthonormal matrix with right singular Z =XW 876 The SIMPLS algorithm [17] has been used to perform PLS in our studies since it could be applied for both binary and multiclass classification problems. Although PLS was originally developed for continuous output (continuous response), we need to use it for categorical output (categorical response). But, categorical outputs can be treated as continuous outputs in binary class problems since the PLS method does not employ any distributional assumptions [16]. Nevertheless, the class variables of multiclass problems are transformed using the following dummy-coding as described in [16] to use with the SIMPLS algorithm. Here, class variable Y is transformed into a C dimensional random vector where C is number of classes. yij yij = = 1 if Yi = j 0 otherwise (1) where Yi is the class label for ith instance and yij is the ith row of the random vector for all i=1,. . . ,n and j=1,. . . ,C. Therefore, for binary problems Y can be used as it is while the transformed random vector is used for multiclass problems. Fayyad & Irani’s Minimum Description Length (MDL) [20] method is used as the default method of discretization. D. ReliefF The ReliefF algorithm [11], [12] is an extension of Relief algorithm. A weight of a feature is calculated by maximizing the ratio between the distance of closest instances from the same class and the closest instances from different classes. The method may particularly suit for nearest neighbor since it is inspired by instance-based learning [11]. This is also a filter based feature selection method where best d features are considered for the classification based on relevance level and a threshold [11]. The algorithm chooses m instances at random. For each instance, the algorithm finds the closest instance from the same class (near-hit) and the closest instance from the different class (near-miss). In each iteration, triplet, i.e., the choosen instance, its near-hit and near-miss, are considered for calculating the weight vector W . C. Information Gain Information Gain (IG) is commonly used in decision tree induction, e.g., ID3, and is used to decide which feature to choose from a set of features to use in a node of the tree [1]. This is a filter based feature selection method, where the features are ranked according to the highest information gain, from which the best d features are selected. This has been investigated as a dimensionality reduction method for high dimensional datasets, such as texts [18] and microarrays [19]. Information gain can be generally stated as follows. where H(class) = −p log2 p denote the entropy of the class, H(class|f eature) denote the conditional entropy of the feature with respect to the class. Maximizing IG is equivalent to minimizing the conditional entropy of the feature, i.e., H(class|f eature), which can be computed as follows: i=1 N j=1 − 1 (xi − near-hiti )2 + (xi − near-missi )2 m for all i = 1, . . . , o Here Wi is the weight of ith feature in the previous iteration, x is an instance from m intances, xi is the value of the ith feature, near-hiti is the value of the ith feature of the nearest instance from the same class to x, and near-missi is the value of the ith feature of the nearest instance from the different class to x. E. Feature Fusion IG(class|f eature) = H(class) − H(class|f eature) C V X ni X Wi − nij nij log2 ni ni Here C is the number of classes, V is the number of values of the feature, N is the total number of instances, ni is the number of instances having the ith value of the feature and nij is the number of instances in the latter group belonging to the jth class. All attributes in datasets used for experiments are numerical attributes where as information gain assumes that all attributes are nominal. In order to calculate the information gain, one should discretize the numerical data prior to applying the IG algorithm. We have used WEKA for IG in our studies where In feature fusion, each dimensionality reduction method is applied to the original feature set and the resulting feature sets are considered for fusion. It was earlier observed that combining an equal number of features from each reduced feature set produced the highest classification accuracy for the nearest neighbor classifier [8]. Therefore, an equal number of features from each reduced feature set are selected and then merged into a single feature set also in this study. The nearest neighbor classifier is then applied to the merged feature set (See Fig. 1). We investigate the effect of choosing different subsets by choosing three out of four from PCA, PLS, IG and ReliefF. Finally, we also investigate the result from combining all four methods. High Dimension Dimension Reduction Low Dimension kNN Classifier Feature Fusion kNN Classifer Classifier Fusion Fig. 1. The diagram illustrates the two fusion methods which combines the outputs of dimensionality reduction methods 877 F. Classifier Fusion Instead of generating classifiers from fused feature sets, classifiers are separately built from each feature set, and their predictions are merged [21]. Following [8], we adopt the approach of setting the number of dimensions that is selected for each dimensionality reduction method to be the one that results in the highest accuracy, as estimated by ten-fold cross validation on the training data. Since we investigate what methods to choose from multiple methods, three out of four classifier outputs are selected and then fused using unweighted majority voting. Finally, all the classifier outputs are also considered for fusion. In case of ties, these are resolved by randomly selecting a class label. III. E MPIRICAL S TUDY In this study, we consider four dimensionality reduction methods, i.e., PCA, PLS, IG and ReliefF, and two strategies for fusing their outputs; classifier and feature fusion, as described above. A. Datasets Table I shows 18 publicly available datasets from the medicinal chemistry domain that are selected for this study. Each set of compounds is represented by two different chemical descriptor sets: Scitegic Extended Connectivity Fingerprints (ECFI) and SELMA [22]. TABLE I D ESCRIPTION OF DATASETS . Dataset AI AMPH1 ATA COMT EDC HIVPR HIVRT HPTP ace ache bzr caco cox2 cpd-mouse cpd-rat gpb therm thr Attributes ECFI SELMA 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 1024 94 Instances Classes 69 130 94 92 119 113 101 132 114 111 163 100 322 980 1198 66 76 88 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 all the methods and compared these to using each method individually. First, the original dimensionality is transformed into a lower number of dimensions using the individual reduction methods. To generate PCA and PLS components for the test set, the weight matrix generated for the training set is used. For IG and ReliefF, features are ranked in decreasing order of importance using the training set and the same features are used in the test set. For feature fusion, an equal number of features from three methods are combined at a time and then finally all the methods. The number of features to be combined is found considering the best performance under cross validation on the training set. For classifier fusion, the number of features required to yield the highest performance for each dimensionality reduction method, i.e., the optimal number of dimensions, is found using cross validation on the training set. Then the nearest neighbor classifier outputs are combined together considering three methods at a time. Finally, classifier outputs of all the methods are considered for fusion. We have used Matlab to transform raw attributes to both PLS and PCA components. The PCA transformation is performed using the Matlab’s Statistics Toolbox whereas the PLS transformation is performed using the BDK-SOMPLS toolbox [23], [24], which uses the SIMPLS algorithm. The WEKA data mining toolkit [1] is used for the IG and ReliefF methods, as well as for the actual nearest neighbor classification. For overall validation, ten-fold cross validation is used and classification accuracy is used as the evaluation metric. The mean accuracy of the cross validation for each dataset is reported in Tables II and III. C. Results B. Experimental setup One of the central issues to be addressed when combining the output of several dimensionality reduction methods is to find out which methods one should combine to reach the highest accuracy. To investigate this, we have combined three dimensionality reduction methods at a time and finally The focus of the study is to investigate the effect of the choice of different dimensionality reduction methods for fusion in order to achieve the highest classification accuracy. Tables II and III show nearest neighbor classification accuracies on the 18 high dimensional datasets. The first five columns show nearest neighbor classification accuracy on raw (unreduced) features as well as reduced features after employing the four individual dimensionality reduction methods. The next five columns represent accuracies of classifier fusion of different methods and the remaining columns correspond to feature fusion in conjunction with different methods. To investigate whether the observed differences in accuracy are statistically significant, we applied the Friedman test [25] which is a non-parametric statistical test. The null hypothesis is formulated as there is no difference in accuracy among different methods. The Friedman test rejects the null hypothesis for both descriptor sets, i.e., there is indeed a significant difference between classification accuracies of different methods employed for both ECFI and SELMA. However, one has to perform a post-hoc test to identify pairs of classifiers that are significantly different from each other. In this study, the Nemenyi test [25] is used as the post-hoc test. If the average rank of two methods differs by at least 878 TABLE II C LASSIFICATION ACCURACIES FOR ECFI 81.19 55.38 71.22 78.67 60.68 71.36 77.27 73.46 83.18 62.27 58.97 57.00 65.27 62.45 58.35 69.76 62.86 57.92 11.56 85.24 50.00 69.22 76.33 71.44 71.52 79.27 77.20 78.03 59.62 69.12 69.00 65.52 67.14 62.68 80.48 63.75 72.92 8.78 FF.PLSPCAIGReliefF 76.67 58.46 68.89 78.22 63.94 72.42 75.27 78.74 82.50 63.18 65.77 62.00 67.40 62.14 62.02 68.33 68.57 60.14 10.22 FF.PLSIGReliefF 76.67 52.31 74.33 78.22 73.86 68.79 83.18 83.30 79.17 60.45 74.26 67.00 72.09 65.41 60.10 75.71 78.04 70.42 7.17 FF FF.PCAIGReliefF 76.67 66.92 72.22 80.44 63.94 70.83 78.18 81.87 86.14 59.62 65.62 61.00 67.08 64.18 64.02 74.52 70.36 66.25 8.72 CF FF.PLSPCAReliefF Relieff FF.PLSPCAIG IG CF.PLSPCAIGReliefF PCA CF.PLSIGReliefF PLS CF.PCAIGReliefF 1 AI 2 AMPH1 3 ATA 4 COMT 5 EDC 6 HIVPR 7 HIVRT 8 HPTP 9 ace 10 ache 11 bzr 12 caco 13 cox2 14 cpd-mouse 15 cpd-rat 16 gpb 17 therm 18 thr Average Rank Raw CF.PLSPCAReliefF Data set CF.PLSPCAIG # 79.76 56.15 75.44 80.44 73.11 73.26 81.18 83.30 84.24 66.82 69.41 65.00 72.08 65.31 62.77 77.14 71.25 68.06 5.56 79.52 55.38 73.22 79.33 72.20 73.26 79.18 81.70 80.76 62.35 70.51 71.00 70.51 67.24 63.94 78.57 70.18 71.53 5.94 82.62 60.77 73.44 80.78 73.94 76.89 79.27 80.22 85.91 61.44 68.09 66.00 68.33 66.84 63.35 78.81 67.5 67.92 5.78 82.38 52.31 73.22 81.89 76.52 73.26 79.27 81.81 82.42 60.61 70.44 68.00 71.78 67.35 63.36 77.38 71.43 69.17 5.42 79.52 53.08 74.22 80.44 73.94 71.44 80.18 83.30 80.76 64.17 71.14 69.00 73.02 66.53 63.36 77.14 72.68 70.42 5.17 82.62 56.15 79.78 77.00 70.61 70.68 81.27 80.27 81.67 53.33 69.89 62.00 71.43 65.61 61.77 75.71 76.43 69.44 7.56 76.43 56.92 73.78 77.22 76.36 73.94 86.09 78.68 80.00 59.62 70.37 69.00 72.98 65.82 61.26 80.71 71.25 69.58 6.42 79.76 53.08 73.44 79.33 68.94 68.79 77.18 78.02 78.41 53.18 63.12 65.00 65.87 60.51 59.43 78.57 67.14 67.22 11.08 75.24 60.77 74.78 74.00 71.36 72.27 87.18 79.51 80.91 56.97 71.65 73.00 69.26 62.65 62.18 77.14 70.00 68.33 7.69 76.67 53.85 77.78 75.00 73.03 61.36 72.00 68.57 74.24 52.73 59.82 57.00 64.16 58.27 54.85 73.57 63.93 59.31 12.94 a critical difference, which is dependent on the number of datasets, number of compared methods and a chosen level of significance, the test concludes that the accuracies of the two methods are significantly different. According to the Nemenyi test, the critical difference for 15 methods and 18 datasets at p=0.05 is 5.06. Fig. 2 and Fig. 3 show critical difference diagrams [25] which are visual illustrations of the results of the post-hoc test. In these diagrams, the best ranking methods are to the right and the methods that are not significantly different from each other are connected. For the ECFI descriptors, the post-hoc test indicates that all the classifier fusion methods are significantly better using IG alone; feature fusion using PCA, IG and ReliefF (FF.PCAIGReliefF); and feature fusion using all four reduction methods (FF.PLSPCAIGReliefF). For the SELMA descriptors, classifier fusion using PCA, IG and ReliefF (CF.PCAIGReliefF) and PLS, IG and ReliefF (CF.PLSIGReliefF) are significantly better than using PCA alone, and feature fusion using all four reduction methods (FF.PLSPCAIGReliefF). In addition, CF.PLSIGReliefF is significantly better than using PLS alone. Feature fusion using all the methods (FF.PLSPCAIGReliefF) perform poorly compared to most of the methods. For all the other methods, the observations are not sufficient to conclude any significant differences among the methods. In general, classifier fusion achieves a higher accuracy for all the combinations than using subsets of the individual reduction methods while the accuracy of feature fusion is sensitive to what subset of the reduction methods is selected. For example, combining feature sets from PLS, PCA and ReliefF is the best combination whereas PCA, IG and ReliefF is the worst combination for both descriptor sets. Combining all feature sets yields the poorest accuracy among all the methods. Although the performance of feature fusion is comparatively low to classifier fusion, combining feature sets from PLS, PCA and ReliefF yields improved accuracy compared to using all the individual reduction methods for both ECFI and SELMA. IV. C ONCLUSION Fusing the outputs of four different dimensionality reduction methods (PCA, PLS, IG and ReliefF) for the nearest neighbor algorithm was investigated on 18 high-dimensional medicinal chemistry datasets, using two different representations (descriptor sets). The results showed that classifier fusion in all observed cases obtained a higher accuracy than using individual reduction methods irrespective of the methods chosen for fusion while feature fusion turned out to be sensitive to the chosen combination of reduction methods. There are a number of directions for future research. This study is concerned with binary class problems while many datasets involve multi-class problems. Therefore, one could 879 TABLE III C LASSIFICATION ACCURACIES FOR SELMA 73.57 52.31 62.56 66.00 77.35 77.05 84.18 83.30 86.06 62.27 70.70 66.00 72.35 64.08 59.85 78.81 54.11 65.00 8.11 76.43 51.54 63.89 73.78 78.26 73.56 81.27 81.76 82.58 61.21 69.41 65.00 73.26 63.27 60.10 74.76 67.32 66.25 8.17 FF.PLSPCAIGReliefF 80.95 46.92 56.56 64.00 71.52 76.21 77.18 81.81 82.73 60.38 68.64 61.00 72.34 61.63 58.43 66.90 66.61 64.17 11.64 FF.PLSIGReliefF 74.76 55.38 61.89 61.67 75.00 77.05 78.09 80.27 81.74 62.95 68.79 64.00 69.53 62.86 62.18 71.43 62.50 70.56 9.89 FF FF.PCAIGReliefF 73.57 54.62 66.89 63.00 77.35 76.06 84.18 80.27 84.39 66.74 70.66 69.00 73.90 65.82 62.02 69.52 64.46 71.81 6.42 CF FF.PLSPCAReliefF Relieff FF.PLSPCAIG IG CF.PLSPCAIGReliefF PCA CF.PLSIGReliefF PLS CF.PCAIGReliefF 1 AI 2 AMPH1 3 ATA 4 COMT 5 EDC 6 HIVPR 7 HIVRT 8 HPTP 9 ace 10 ache 11 bzr 12 caco 13 cox2 14 cpd-mouse 15 cpd-rat 16 gpb 17 therm 18 thr Average Rank Raw CF.PLSPCAReliefF Data set CF.PLSPCAIG # 76.43 54.62 57.67 63.89 76.67 78.03 83.09 82.58 85.23 63.11 69.96 67.00 74.80 64.29 60.68 68.33 65.54 67.36 7.31 77.86 51.54 60.89 68.22 75.83 74.47 80.18 83.35 84.39 68.41 71.21 64.00 74.18 64.90 60.51 66.90 69.64 69.72 7.06 75.00 51.54 62.78 67.11 77.42 76.29 85.18 84.89 87.05 67.50 72.46 68.00 73.58 63.98 60.10 79.29 64.46 70.83 5.47 73.57 57.69 63.78 69.22 78.26 76.21 86.09 85.66 84.39 65.68 72.54 71.00 73.56 64.80 60.77 80.71 63.21 68.47 4.78 76.43 56.15 59.78 67.11 76.67 76.21 82.09 83.35 83.48 64.85 70.62 67.00 74.48 65.00 60.93 68.33 66.96 67.36 6.69 76.43 56.15 59.67 67.56 76.59 71.59 84.00 80.33 84.39 67.58 66.84 68.00 70.79 57.76 60.86 64.52 64.11 72.08 8.61 70.95 53.08 63.11 67.11 80.00 77.20 81.27 79.51 83.64 69.32 74.78 74.00 72.35 59.80 59.60 74.29 61.43 73.06 7.08 73.81 54.62 62.67 67.44 74.09 76.14 81.27 81.87 81.74 61.44 70.55 65.00 72.66 63.37 57.76 80.24 65.71 69.58 9.03 67.86 50.77 61.56 69.33 78.33 77.05 83.27 77.97 83.41 64.70 69.45 72.00 72.96 57.35 59.93 78.81 65.89 69.86 8.14 72.38 48.46 60.78 69.44 74.92 68.86 72.27 78.79 81.74 60.23 68.09 70.00 71.72 58.27 52.30 73.10 57.14 72.78 11.61 CD 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 FF.PLSPCAIGReliefF IG FF.PCAIGReliefF PCA ReliefF Raw FF.PLSIGReliefF Fig. 2. CF.PLSPCAIGReliefF CF.PLSIGReliefF CF.PLSPCAIG CF.PCAIGReliefF CF.PLSPCAReliefF FF.PLSPCAReliefF PLS FF.PLSPCAIG Comparison of dimensionality reduction methods and their fusion of the outputs with each other using Nemenyi test for ECFI. CD 15 14 13 12 11 10 9 8 7 PCA FF.PLSPCAIGReliefF PLS FF.PCAIGReliefF FF.PLSPCAIG ReliefF FF.PLSIGReliefF Fig. 3. 6 5 4 3 2 1 CF.PLSIGReliefF CF.PCAIGReliefF Raw CF.PLSPCAIGReliefF CF.PLSPCAReliefF FF.PLSPCAReliefF CF.PLSPCAIG IG Comparison of dimensionality reduction methods and their fusion of the outputs with each other using Nemenyi test for SELMA. 880 extend the study to also consider for multiclass problems in order to investigate whether the same conclusions could be drawn. Further, the study considers only the basic nearest neighbor classifier. One may choose also to study more sophisticated nearest neighbor implementations in conjunction with the fusion strategies. Finally, it would also be interesting to examine whether the fusion strategies considered in the study are useful also for other learning algorithms such as support vector machines. ACKNOWLEDGMENTS This work was supported by the Swedish Foundation for Strategic Research through the project High-Performance Data Mining for Drug Effect Detection at Stockholm University, Sweden. [20] U. M. Fayyad and K. B. Irani, “On the Handling of Continuous-Valued Attributes in Decision Tree Generation,” Machine Learning, vol. 8, pp. 87–102, 1992. [21] H. Boström, “Feature vs. classifier fusion for predictive data mining a case study in pesticide classification,” in In Proceedings of the 10th International Conference on Information Fusion, 2007, pp. 121–126. [22] “Data Repository of Bren School of Information and Computer Science, University of California, Irvine,” ftp://ftp.ics.uci.edu/pub/baldig/learning/, accessed: 16/02/2012. [23] W. Melssen, R. Wehrens, and L. Buydens, “Supervised kohonen networks for classification problems,” Chemometrics and Intelligent Laboratory Systems, vol. 83, pp. 99–113, 2006. [24] W. Melssen, B. Üstün, and L. Buydens, “Sompls: a supervised selforganising map - partial least squares algorithm,” Chemometrics and Intelligent Laboratory Systems, vol. 86, no. 1, pp. 102–120, 2006. [25] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006. R EFERENCES [1] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann, 2005. [2] N. Lavrac, H. Motoda, and T. Fawcett, “Editorial: Data Mining Lessons Learned,” in Machine Learning, 2004. [3] M. A. Carreira-perpiñàn, “A review of dimension reduction techniques,” Dept. of Computer Science, University of Sheffield, Tech. Rep., 1997. [4] Y. Pang, L. Zhang, Z. Liu, N. Yu, and H. Li, “Neighborhood preserving projections (npp): A novel linear dimension reduction method,” in ICIC (1), 2005, pp. 117–125. [5] S. D. Bay, “Nearest neighbor classification from multiple feature subsets,” Intelligent Data Analysis, vol. 3, pp. 191–209, 1999. [6] T. Yamada, N. Ishii, and T. Nakashima, “Text Classification by Combining Different Distance Functions with Weights,” IEEJ Transactions on Electronics, Information and Systems, vol. 127, pp. 2077–2085, 2007. [7] S. Deegalla and H. Boström, “Fusion of dimensionality reduction methods: a case study in microarray classification,” in Proceedings of the 12th International Conference on Information Fusion. IEEE, 2009, pp. 460–465. [8] ——, “Improving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification,” in Proceedings of the 8th International Conference on Machine Learning and Applications, 2009, pp. 771–775. [9] J. Shlens, “A Tutorial on Principal Component Analysis,” URL: http://www.snl.salk.edu/ shlens/pub/notes/pca.pdf. [10] W. Melssen and R. Wehrens, “Chemometrics I Study Guide.” [11] K. Kira and L. Rendell, “A practical approach to feature selection,” in Proceedings of International Conference on Machine Learning (ICML1992), 1992, pp. 249–256. [12] I. Kononenko, “Estimating attributes: analysis and extension of relief,” in Proceedings of European Conference on Machine Learning (ICML1994), 1994, pp. 171–182. [13] D. V. Nguyen and D. M. Rocke, “Tumor classification by partial least squares using microarray gene expression data,” Bioinformatics, vol. 18, no. 1, pp. 39–50, 2002. [14] ——, “Multi-class cancer classification via partial least squares with gene expression profiles,” Bioinformatics, vol. 18, no. 9, pp. 1216–1226, 2002. [15] J. J. Dai, L. Lieu, and D. Rocke, “Dimension Reduction for Classification with Gene Expression Microarray Data,” Statistical Applications in Genetics and Molecular Biology, vol. 5, no. 1, 2006. [16] A. L. Boulesteix, “PLS Dimension Reduction for Classification with Microarray Data,” Statistical Applications in Genetics and Molecular Biology, 2004. [17] S. de Jong, “SIMPLS: An alternative approach to partial least squares regression,” Chemometrics and Intelligent Laboratory Systems, 1993. [18] G. Forman, “An Extensive Empirical Study of Feature Selection Metrics for Text Classification,” Journal of Machine Learning Research, vol. 3, pp. 1289–1305, 2003. [19] E. P. Xing, M. I. Jordan, and R. M. Karp, “Feature selection for high-dimensional genomic microarray data,” in In Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann, 2001, pp. 601–608. 881
© Copyright 2026 Paperzz