Feature Similarity Measure

國立雲林科技大學
National Yunlin University of Science and Technology
Unsupervised Feature Selection
Using Feature Similarity
Advisor ：Dr. Hsu
Graduate：Ching-Lung Chen
Author ：Pabitra Mitra
Student Member
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Outline










Motivation
Objective
Introduction
Feature Similarity Measure
Feature Selection method
Feature Evaluation indices
Experimental Results and Comparisons
Conclusions
Personal Opinion
Review
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Motivation

Conventional method of feature selection have highcomputational complexity problem in both dimension
and size.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Objective

Propose an unsupervised feature selection algorithm
suitable for data sets, large in both dimension and size.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction 1/3

The sequential floating searches provide better results,
though at the cost of a higher computational complexity.

Broadly classified existing methods into two categories:

Maximization of clustering performance


Sequential unsupervised feature selection、maximum entropy、
neuro-fuzzy approach…
Based on feature dependency and relevance

Correlation coefficients、measures of statistical redundancy、
linear dependence
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction 2/3

We propose an unsupervised algorithm which uses
feature dependency/similarity for redundancy reduction,
but requiring no search.

A new similarity measure call maximal information
compression index, is used in clustering. Its comparison
with correlation coefficient and least-square regression
error is made.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction 3/3

The proposed algorithm is geared toward to two goals:



Minimizing the information loss.
Minimizing the redundancy present in the reduced feature
subset.
The feature selection algorithm unlike most conventional
algorithms, search for best subset, its can be computed in much less
time compared to many indices used in other supervised and
unsupervised feature selection method.
Intelligent Database Systems Lab
Feature Similarity Measure

There are two approaches for measuring similarity
between two random variables:
1.
2.

N.Y.U.S.T.
I.M.
To nonparametrically test the closeness of probability distributions of
the variables.
To measure the amount of functional dependency between the
variables.
We discuss below two existing linear dependency measures:
1.
2.
Correlation Coefficient
Least Square Regression Error(e)
Intelligent Database Systems Lab
Feature Similarity Measure

N.Y.U.S.T.
I.M.
Correlation Coefficient ( )
var() the variance of a variable
cov() the covariance between two variables.
1.
2.
3.
4.
5.
0  1　 |  ( x, y ) |　 1 .
1　 |  ( x, y ) | 0 if x and y are linearly related.
1　 |  ( x, y ) | 1 |  ( y, x) | (symmetric).
1 |  ( x, y ) | 1 |  (u, v) |
xa
y b
if u  c and v  d
for some constants a,b,c,d,then
the measure is invariant to scaling and translation of the variables
the measure is sensitive to rotation of the scatter diagram in (x,y) plane
Intelligent Database Systems Lab
Feature Similarity Measure

N.Y.U.S.T.
I.M.
Least Square Regression Error (e)
the error predicting y from the linear model y = a + bx. a and b are the
regression coefficients obtained by minimizing the mean square error.
The coefficients are given by a  y , and
error e(x,y) is given by
b
cov
( x, y ) var( x)
and the mean square
e( x, y)  var( y)(1   ( x, y) 2 )
Intelligent Database Systems Lab
Feature Similarity Measure

N.Y.U.S.T.
I.M.
Least Square Regression Error (e)
1.
2.
3.
4.
5.
0  e( x, y )  var(
. y)
e(x,y)=0 if x and y are linearly related
e( x, y )  e( y, x) (unsymmetric).
if u=x/c and v = y/d for some constant a,b,c,d, then e(x,y)=d2e(u,v).
the measure e is sensitive to scaling of the variables.
the measure e is sensitive to rotation of the scatter diagram in x-y
plane.
Intelligent Database Systems Lab
Feature Similarity Measure

maximal information compression index (  )
2
Let  be the covariance matrix of random variables x and y.
Define maximal information compression index as 2 ( x, y) 
smallest eigenvalue of 
22 ( x, y )  (var( x)  var( y )  ((var( x)  var( y )) 2  4 var( x) var( y )(1   ( x, y ) 2 )
2 =0 when the features are linearly dependent and increases as the
amount of dependency decreases
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Feature Similarity Measure



The corresponding loss of information in reconstruction of the
pattern is equal to the eigenvalue along the direction normal to the
principal component.
hence, 2 is the amount of reconstruction error committed if the
data is projected to a reduced dimension in the best possible way.
there fore , it’s a measure of the minimum amount of information
loss or maximum amount of information compression.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Feature Similarity Measure



N.Y.U.S.T.
I.M.
the significance of 2 can also be explained geometrically in terms
of linear regression.
the value of 2 is equal to the sum of the squares of the
perpendicular distance of the points (x,y) to the best fit line y  aˆ  bˆx
The coefficients of such a best fit line are given by aˆ  x cot   y
and bˆ   cot  where
Intelligent Database Systems Lab
Feature Similarity Measure

2 has the following properties:
.
1.
.
2.
.
3.
4.
.
5.
.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Feature Similarity Measure
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Feature Selection method

The task of feature selection involves two step:
1.
2.

N.Y.U.S.T.
I.M.
partition the original feature set into a number of homogeneous
subsets (clusters)
selecting a representative feature from each such cluster
The partition of the features is based on K-NN principle
1.
2.
3.
compute the k nearest features of each feature.
among them the feature having the most compact subset is selected,
and its k neighboring features are discarded.
the process is repeated for the remaining features until all of them are
either selected or discarded
Intelligent Database Systems Lab
Feature Selection method


N.Y.U.S.T.
I.M.
Determining the k nearest-neighbors of features, we assign a
constant error threshold ( ) which is set equal to the distance of the
kth nearest-neighbor of the feature select in first iteration.
if 2 greater than  , then we decrease the value of k.
Intelligent Database Systems Lab
Feature Selection method




D : original number of features
the original feature set be O={Fi, i=1,…,D}
the dissimilarity between features Fi and Fj represent by S(Fi,Fj).
Let r ik represent the dissimilarity between feature Fi and its kth
nearest-neighbor feature in R.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Feature Selection method
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Feature Selection method
N.Y.U.S.T.
I.M.

with respect to the dimension (D), the method has complexity O(D2)

evaluation of the similarity measure for a feature pair is of
complexity O(l), thus, the feature selection scheme has overall
complexity O(D2l)

k acts as a scale parameter which controls the degree of details in a
more direct manner.

this algorithm is nonmetric nature of similarity measure.
Intelligent Database Systems Lab
Feature Evaluation indices

Now se describe some indices below:

need class information
1.
2.
3.

class seperability
K-NN classification accuracy
naïve Bayes classification accuracy
do not need class information
1.
2.
3.
entropy
fuzzy feature evaluation index
representation entropy
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Feature Evaluation indices
Class Separability
1
S  trace( S b




S
w
)
Sw is the within class scatter matrix
Sb is the between class scatter matrix.
 j is the a priori probability that a pattern belongs to class w .
j
 j is he sample mean vector of class w .
j
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Feature Evaluation indices
N.Y.U.S.T.
I.M.
K-NN Classification Accuracy

use the K-NN rule for evaluating the effectiveness of the reduced
set for classification.

we randomly select 10% of data as training set and classify the
remaining 90% point.

Ten such independent runs are performed and average accuracy on
test set.
Intelligent Database Systems Lab
Feature Evaluation indices
N.Y.U.S.T.
I.M.
Naïve Bayes Classification Accuracy

Used Bayes maximum likelihood classifier ,assuming normal
distribution of classes to evaluating the classification performance.

Mean and covariance of the classes are estimated from a randomly
selected 10% training sample and the remaining 90% used as test
set.
Intelligent Database Systems Lab
Feature Evaluation indices
N.Y.U.S.T.
I.M.
Entropy

xp,j denotes feature value for p along jth direction.
D




pq
similarity between p,q is given by sim ( p, q)  e
 is a positive constant, a possible value of  is  ln 0.5
D
D is the average distance between data points computed over the entire data set.
if the data is uniformly distributed in the feature space, entropy is
maximum.
Intelligent Database Systems Lab
Feature Evaluation indices
Fuzzy Feature Evaluation Index (FFEI)

are the degree that both patterns p and q belong to the
same cluster in the feature spaces
respectively

membership function

the value of FFEI decreases as the intercluster distances increase.
may be defined as
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
Feature Evaluation indices
Representation Entropy



let the eigenvalues of the d*d covariance matrix of a feature set of
size d be j, j  1,..., d .
~
 j has similar properties like probability, 0  ~j  1 and

d
j 1
j 1
this is equivalent to the amount of redundancy present in that
particular representation of the data set.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons

Three categories of real-life public domain data sets are used:

low-dimensional (D<=10)

medium-dimensional (10<D<=100)

high-dimensional (D>100)

Use nine UCI data set include :
1.
2.
3.
4.
5.
Isolet
Multiple Features
Arrhythmia
Spambase
Waveform
6.
7.
8.
9.
Ionosphere
Forest Cover Type
Wisconsin Cancer
Iris
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons

We use four indices to measure classification and
clustering performance:
1.
2.
3.
4.

Branch and Bound Algorithm (BB)
Sequential Forward Search (SFS)
Sequential Floating Forward Search (SFFS)
Stepwise Clustering (SWC) * using correlation coefficient
in our experiments, we have mainly used entropy as the feature
selection criterion with first three search algorithm.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Results and Comparisons
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Conclusions

An algorithm for unsupervised feature selection using feature similarity
measures is described.

our algorithm is based on pairwise feature similarity measure , which are
fast to compute. It unlike other approaches, which are based on
optimizing either classification or clustering performance explicitly .

We have defined a feature similarity measure called maximal information
compression index.

It also demonstrated through extensive experiments that representation
entropy can be used as an index for quantifying both redundancy reduction
and information loss in a feature selection method.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Personal Opinion

We can learning this method to help our experimental of
feature selection.

This similarity measure is valid only for numeric features,
we can think about how to use in categorical.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Review
1.
compute the k nearest features of each feature.
2.
Among them the feature having the most compact
subset is selected, and its k neighboring features are
discarded.
3.
repeated this process for the remaining feature until all
of them are either selected or discarded.
Intelligent Database Systems Lab

Download Report

Feature Similarity Measure

Paperzz.com

Your Paperzz