Subset Feature Selection Algorithm Based on

SSRG International Journal of Computer Science and Engineering - (ICRTECITA-2017) - Special Issue - March 2017
Subset Feature Selection Algorithm Based on
Optimal Characterization of Data Set
M Chitra
Assistant Professor
Department of Computer Science and Engineering
Ramco Institute of Technology
Rajapalayam
Abstract—Most feature selection methods
determine a global subset of features, where all
data instances are projected in order to improve
classification accuracy. An attractive alternative
solution is to adaptively find a local subset of
features and this paper presents a novel Local
Feature Selection (LFS) approach for data
classification in the presence of a huge number
of irrelevant features whereby each region of the
sample space has its own distinct optimized
feature set that varies both in membership and
size across the sample space allowing the feature
set to optimally adapt to the local variations in
the sample space. In addition, a method for
measuring the similarities of a query datum to
each of the classes is also proposed that makes
no assumption about the underlying structure of
the samples and hence insensitive to the
distribution of the data over the sample space.
The method is formulated as a linear
programming optimization problem. Moreover,
the method is robust against the over-fitting
problem. The experimental results demonstrate
the viability of the formulation and the
effectiveness of the proposed algorithm.
Keywords—feature selection, class similarity, distance
measure
I. INTRODUCTION
Feature selection have become the focus of much
research in areas of application for which datasets
with tens or hundreds of thousands of variables are
available. Selecting the most relevant variables is
usually suboptimal for building a predictor,
particularly if the variables are redundant.
Conversely, a subset of useful variables may
exclude many redundant, but relevant, variables.
The most widely issue in scientific discipline is
dimensionality reduction. Dimensionality reduction
can be classified into two: feature extraction and
feature selection.Feature selection is the process of
selecting a subset of the terms occurring in the
training set and using only this subset as features in
text classification.
ISSN : 2348 – 8387
Abi Nanthana M, Abinaya S L, Abitha P,
Kaleeswari T
Department of Computer Science and Engineering
Ramco Institute of Technology
Rajapalayam
Feature selection serves two main purposes. First, it
makes training and applying a classifier more
efficient by decreasing the size of the effective
vocabulary. Second, feature selection often
increases classification accuracy by eliminating
noise features. We can view feature selection as a
method for replacing a complex classifier (using all
features) with a simpler one (using a subset of the
features).
Feature selection is an optimization problem.
The following are the two steps involved in feature
selection: 1) Search the space of possible feature
subsets 2) Select the subset that is optimal or nearoptimal with respect to some objective function. In
this, the feature selection process is considered for
data classification. Given a set of training samples
and their classes, the feature selection involves
finding a subset of
relevant features. We
introduced an alternative to the conventional
feature selection called Localized Feature Selection
(LFS). The localized feature selection is realized by
considering each training sample as a
representative point of its neighboring region and
by selecting an optimal feature set for that region.
This method solves the problem of over-fitting.
Over-fitting problem arises by considering too
many numbers of observations that lead to increase
in error.
II. METHOD DESCRIPTION
Subset selection method
This method selects a subset of features that
together have good predictive power, as opposed to
ranking features individually. the sequential
forward selection and sequential backward
selection can be used when new features are
added into existing set or removed out of existing
set. For that, the evaluation can be done by
category distance measurement and classification
error
www.internationaljournalssrg.org
Page 65
SSRG International Journal of Computer Science and Engineering - (ICRTECITA-2017) - Special Issue - March 2017
Original feature set
Generation
Validation
Evaluation
Selected
Subset of
features
Yes
StoppingCr
No
iterion
Figure 1.Steps in Feature Selection method
1. Generation:
In Generation, select candidate subset of feature for
evaluation. At the start, there can be no feature, all
feature, random feature subset. And at the subsequent
we can add, remove and add/remove situations. The
three ways by which the feature space is examined are
complete, heuristics and random.
2. Evaluation:
In Evaluation process, it determine the relevancy of
the generated feature subset candidate towards the
classification task.
In Rvalue = J (candidate subset),
if (Rvalue>best_value) best_value = Rvalue
For evaluation, we use distance as a evaluation
function. In this work we use Euclidean distance
measure as the distance measure. The equation for
distance measure is Z2=X2+Y2 and the steps involved in
distance measure are as follows:
1) Distance measure will select those features that
support instances of the same class to stay within the
same proximity.
2) The instances of same class should be closer in
terms of distance than those from different class.
ISSN : 2348 – 8387
III.
FEATURE SELECTION
A. Preprocessing:
It is a data mining technique that involves
transforming a raw data into an understandable
format. We need to preprocess the data because
of data may be inconsistent and have errors. To
get the required information from huge,
incomplete, noisy and inconsistent set of data, it
is necessary to use data preprocessing. Here, we
used Butterworth filter for preprocessing the
data. For preprocessing, first we need to calculate
Butterworth value by passing order and cutoff
frequency and the result is produced as a
normalized frequency. Later, filtering of data is
done.
B. Feature Optimization:
In Feature optimization, there is a need to
calculate eigen values to find out how much
information can be passed through the medium. It
also helps to rank the usage of features. For this
we need to find out the covariance of a matrix. If
the eigen value is large, then that feature has to
be selected else the feature is not considered. At
last, project the original dataset.
www.internationaljournalssrg.org
Page 66
SSRG International Journal of Computer Science and Engineering - (ICRTECITA-2017) - Special Issue - March 2017
A. Graph Construction:
Graph construction involves the construction of the
graph. If the graph is already present, we need to
assign weights for each edge in the graph. The
weights are assigned by the following three options:
Binary, Heat Kernel and cosine. Then, the output of
this will be send to the next process for processing
the input.
In this, we also need to calculate the
Euclidean distance between data points in the sample.
It focuses on the neighboring samples by assigning
higher weights to them. Initially, the weights are all
assigned uniform values. If two samples are close to
each other in one space, they are close in most of the
other sub-spaces. Then, the distance in N subspace
obtained as follows:
(𝐢)
𝐍
𝐤=𝟏 𝐞𝐱𝐩(−(𝐝𝐢𝐣|𝐤
(𝐣)
∗ (𝐤)
𝐰𝐣 =1/N(
𝐝𝐢𝐣|𝐤 =||(𝐱 (𝐢) -𝐱 )⨂𝐟
𝐝𝐦𝐢𝐧
𝐢𝐣|𝐤 =
𝐦𝐢𝐧
𝐝𝐢𝐯|𝐤
𝐯𝛜𝐲 (𝐢)
𝐝𝐢𝐯|𝐤
𝐦𝐢𝐧
𝐯∉𝒚(𝒊)
− 𝐝𝐦𝐢𝐧
𝐢𝐣|𝐤 )))
||2
A. Overfitting Problem:
Overfitting occurs when a learning model customizes
itself too much to describe the relationship between
training data and the labels. Ittends to make the
model very complex by having too many parameters.
This leads to poor performance on new data. The
main reason overfitting happens is because you have
a small dataset and you try to learn from it. There is
an algorithm for automatically deciding which
features to keep and which features to throw out. This
idea of reducing the number of features can work
well and reduce overfitting. The potential for
overfitting depends not only on the number of
parameters and data but also the conformability of
the model structure with the data shape and the
magnitude of model error compared to the expected
level of noise in the data. The risk of overfitting is
not eliminated for real data sets. The LFS algorithm
inherently tends to select only relevant features and
rejects irrelevant features.
, 𝐢𝐟 𝐲 (𝐣) = 𝐲 (𝐢)
IV. EXPERIMENTAL RESULTS
, 𝐢𝐟 𝐲 (𝐣) ≠ 𝐲 (𝐢)
In the proposed method, the data set is distributed
in a two dimensional feature space where class Y1
datais split into two clusters. The artificial irrelevant
features are independently sampled with zero-mean
and unit- variance. Data-sets“Prostate”, “Duke-breast
cancer”, “Leukemia” and “colon” are microarray
data-sets in which each case the number of features is
significantly larger than the number of samples.
The proposed algorithm is implemented in
MATLAB and executed on a desktop with an Intel
Core i3 CPU.
B. Laplacian Score:
Laplacian Score (LS) is a popular feature ranking
based on feature selection method both supervised
and unsupervised. It seeks features which best reflect
the underlying manifold structure. LS construct a
nearest neighbor graph to model the local structure
and then selects those features which best respect this
graph structure.
A. Overrlapping feature set:
C. Class similarity Measurement:
In this approach, there is no common set of features
across the sample space are inappropriate. The
proposed structure is based on the similarity of a
datum to a specific class. It consists of N regions,
where each region has a representative point, class
label and the optimal feature set. In the region, there
is a factor called impurity level. The impurity level is
the ratio differing class label with the same class
label. And also, we have to measure the similarity to
all regions. After computing the similarity to all
classes, the class label which provides the largest
similarity. If the query sample does not fall into the
region, then we assign its class as the class label of
the nearest sample. The coordinate system is used to
determine the nearest neighboring sample.
ISSN : 2348 – 8387
We are going to consider the fact, whether there is
any overlap between the optimal feature sets of the
representative points. The height of the feature
indicates what percentage of representative points to
select the respective feature. The assumption of a
common feature set over the entire sample space is
not necessarily optimal in the applications. The
common features are interpreted as the most
informative features in terms of accuracy over the
space. The less informative features are interpreted as
being informative features.
B. CPU time:
The computational complexity for computing a
feature set depends mainly on the data dimension.
The proposed method is used to perform feature
selection for one representative point on the data-set
www.internationaljournalssrg.org
Page 67
SSRG International Journal of Computer Science and Engineering - (ICRTECITA-2017) - Special Issue - March 2017
with a number of irrelevant features. The feature
selection for each representative point is independent
of the others and can be performed in a parallel
manner. It determines the class label of its nearest
neighbors and also it requires no optimization. The
CPU time will be reduced to the fraction of a second.
Figure 4. Preprocessed data
Figure 2.Option for getting input
Figure 3. Loading of dataset
ISSN : 2348 – 8387
Figure 5. Filtered data
www.internationaljournalssrg.org
Page 68
SSRG International Journal of Computer Science and Engineering - (ICRTECITA-2017) - Special Issue - March 2017
Figure 6. Optimized data
C. Conclusion:
We present an effective method for Local Feature
Selection to the data classification problem. The most
feature selection algorithms select a global feature.
In this proposed method, we are selecting local
subsets of features that are most informative for the
small regions around the data points. The process of
computation is independent of all others and also
performed in a parallel manner. The proposed
algorithm has the advantage of efficient selection of
features and also done in a parallel way
[5] R. E. Bellman and S. E. Dreyfus, Applied
Dynamic Programming. Rand Corporation, 1962
[6] Y. Sun, S. Todorovic, and S. Goodison, “Locallearning-based feature selection for high-dimensional
data analysis,” IEEE Trans.Pattern Anal. Mach.
Intell., vol. 32, no. 9, pp. 1610–1626, Sep. 2010.
[7] R. E. Bellman and S. E. Dreyfus, Applied
Dynamic Programming. Rand Corporation, 1962.
[8] D. L. Donoho and C. Grimes, “Hessian
eigenmaps: Locally linear embedding techniques for
high-dimensional data,” Proc. Nat.Acad. Sci., vol.
100, no. 10, pp. 5591–5596, 2003.
[9] Number S. T. Roweis and L. K. Saul, “Nonlinear
dimensionality reduction by locally linear
embedding,” Science, vol. 290, no. 5500, pp. 2323–
2326, 2000.
[10] I. K. Fodor, “A survey of dimension reduction
techniques,”
Lawrence
Livermore
National
Laboratory, Tech. Rep. UCRL-ID-148494, 2002.
[11] H.-L. Wei and S. A. Billings, “Feature subset
selection and ranking for data dimensionality
reduction,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 29, no. 1, pp. 162–166, Jan. 2007.
[12] L. Wang, “Feature selection with kernel class
separability ,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 30, no. 9, pp. 1534–1546, Sep. 2008.
[13] K. Kira and L. A. Rendell, “A practical approach
to featureselection,” in Proc. 9th Int. Workshop
Mach. Learn., 1992, pp. 249–256.
[14] S. Boyd and L. Vandenberghe, Convex
Optimization. Cambridge, U.K.: Cambridge Univ.
Press, 2004.
Acknowledgment
The authors wish to acknowledge the guide for
the support.
References
[1]P. Langley, Selection of Relevant Features in
Machine Learning. Defense Technical Information
Center, 1994.
[2] H. Peng , F. Long, and C. Ding, “Feature
selection based on mutual information criteria of
max-dependency,
max-relevance,
and
min
redundancy, ”IEEE Trans. Pattern Anal. Mach.
Intell., vol. 27, no. 8,pp. 1226–1238, Aug. 2005.
[3] T.F. Cox and M.A.A.Cox.Multidimensional
Scaling.Chapman and Hall,second edition, 2001
[4]M. Belkin and P. Niyogi, “Laplacianeigenmaps for
dimensionality reduction and data representation,”
Neural Comput., vol. 15, no. 6, pp. 1373–1396, 2003.
ISSN : 2348 – 8387
www.internationaljournalssrg.org
Page 69