Non-Uniform Random Feature Selection and Kernel Density Scoring

792
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 6, NO. 2, APRIL 2013
Non-Uniform Random Feature Selection and
Kernel Density Scoring With SVM Based Ensemble
Classification for Hyperspectral Image Analysis
Sathishkumar Samiappan, Saurabh Prasad, Member, IEEE, and Lori M Bruce, Senior Member, IEEE
Abstract—Traditional statistical classification approaches often
fail to yield adequate results with Hyperspectral imagery (HSI) because of the high dimensional nature of the data, multimodal class
distribution and limited ground truth samples for training. Over
the last decade, Support Vector Machines (SVMs) and Multi-Classifier Systems (MCS) have become popular tools for HSI analysis.
Random Feature Selection (RFS) for MCS is a popular approach to
produce higher classification accuracies. In this study, we present a
Non-Uniform Random Feature Selection (NU-RFS) within a MCS
framework using SVM as the base classifier. We propose a method
to fuse the output of individual classifiers using scores derived from
kernel density estimation. This study demonstrates the improvement in classification accuracies by comparing the proposed approach to conventional analysis algorithms and by assessing the
sensitivity of the proposed approach to the number of training samples. These results are compared with that of uniform RFS and regular SVM classifiers. We demonstrate the superiority of Non-Uniform based RFS system with respect to overall accuracy, user accuracies, producer accuracies and sensitivity to number of training
samples.
Index Terms—Ground cover classification, hyperspectral imagery (HSI), multi-classifier systems (MCSs), random feature
selection (RFS), support vector machines (SVMs).
G
I. INTRODUCTION
ROUND cover classification is a challenging and an
important problem in remote sensing applications. Hyperspectral imagery (HSI) provides a detailed description of
ground-cover materials ranging from visible to infrared regions
of the electromagnetic spectrum. Such a wide spectral range of
information has the potential to yield higher classification accuracies. The key to the design of a powerful classification system
lies in extracting pertinent features from the high-dimensional
data and employing classifiers to exploit those features.
Maximum Likelihood (ML), a traditional supervised pattern
classification approach, often fails to classify HSI data accurately because of (a) the high dimensionality of features, (b)
Manuscript received May 15, 2012; revised August 17, 2012 and December
11, 2012; accepted December 13, 2012. Date of publication January 17, 2013;
date of current version May 13, 2013.
S. Samiappan is with the Mississippi State University, Geo Systems Research
Institute, Starkville, MS 39759 USA (e-mail: [email protected]).
S. Prasad is with the University of Houston, Electrical and Computer Engineering Department, Houston, TX 77004 USA (corresponding author, e-mail:
[email protected]).
L. M. Bruce is with the Department of Electrical and Computer Engineering,
Mississippi State University, Mississippi State, MS 39762-9571 USA (e-mail:
[email protected]).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSTARS.2013.2237757
multimodal distributions and (c) limited ground truth availability. In order to solve the problem of high dimensionality,
there are several existing approaches based on the concepts
of dimensionality reduction and feature selection. Principal
Component Analysis (PCA), Linear Discriminant Analysis
(LDA) and Stepwise-LDA (S-LDA) are popular dimensionality reduction techniques [33]. Feature selection can also be
performed using metrics such as Bhattacharya Distance (BD),
Jeffries-Matusita (JM), entropy etc. The Gaussian ML classifier
assumes that the classes are Gaussian distributed. This is a limitation for majority of practical HSI datasets. Algorithms based
on Gaussian Mixture models [1] have been proposed in the
past to accommodate multi-modal distributions. An alternative
approach that has become more popular recently with HSI data
is the use of Support Vector Machines (SVM) [4]. Finally, there
are techniques to solve the limited ground-truth availability
such as sample interpolation and adaptive classifiers [2], [3].
In recent work, to improve the performance of conventional
single classifiers, Multi-Classifier Systems (MCS) have been
developed [5]–[8]. MCS are often referred as ensemble classifier systems, and they potentially perform better than single
classifiers when diversity is established among the classifiers.
The diversity among the classifiers can be established in different ways [9]–[11]. Prasad et al. [6], [7], demonstrated that,
with an MCS setup with ML as classifier, the performance can
be improved when compared to single classifiers, and there was
a potential to further improve such a system by incorporating
nonlinear SVM classifiers.
Recently, a MCS based on Random Feature Selection (RFS)
[12] proposed by Waske et al. and a dynamic subspace approach
[13] proposed by Min et al. were shown to perform well with
HSI data. Techniques such as random forests [14] and RFS perform well because they create diversity among the classifiers
by re-sampling the spectral bands at the inputs of the classifier. As proposed in [10] diversity can also be created in other
ways. In [15], Breiman has demonstrated the diversity creation
by re-sampling, strategies related to [15] such as bagging [16]
and boosting [17] are also proved to be effective.
The selection of features in [12] is a uniform random feature
selection (RFS). In our recent work [18], we explored the possibilities of using a non-uniform RFS (NU-RFS) based MCS with
SVMs. We found that a diverse classifier ensemble for a classification problem need not always come from a RFS as proposed
in [12], [13]. In [18] we demonstrated that NU-RFS can provide
better performance than uniform RFS. In this study, as an extension, we present a fully automated MCS with NU-RFS using
SVM. It is assumed that a diverse set of features leads to higher
1939-1404/$31.00 © 2013 IEEE
SAMIAPPAN et al.: NON-UNIFORM RANDOM FEATURE SELECTION AND KERNEL DENSITY SCORING
classification accuracies. Although the diversity can be defined
in many ways [10] for the purposes of this study, a diverse set
of spectral bands is defined as the following
1) Bands are selected from multiple spectral regions across
the entire spectrum of signature.
2) Cross-correlation between selected bands is minimized.
The approach proposed in this paper combines the following
methods to create diversity within a pool of classifiers and to ensure that strengths and weaknesses of individual classifiers are
incorporated into the final decision making: a) re-sampling features in the data through RFS; b) manipulation of input features
through NU-RFS; c) manipulation of output classes through
scores computed from Kernel Density estimation. The approach
uses a spectral band grouping [28] to perform NU-RFS and a decision fusion strategy based on kernel density scores. To verify
the effectiveness of this approach, we performed experiments
to compare overall accuracies of SVM, RFS, NU-RFS, SVM
with kernel density fusion, and NU-RFS with kernel density fusion. The sensitivity of the above mentioned approaches to the
number of samples required to train them is also studied in this
work.
This paper is organized as follows. Section II provides a review of SVM, MCS, and RFS for SVM and possible extensions.
Section III describes proposed kernel density based scoring
system for fusion in a MCS, the proposed classification system
based on NU-RFS, and band grouping. Section IV discusses
the experimental setup and provides results. Finally, Section V
summarizes the observations and provides concluding remarks.
II. BACKGROUND
A. Support Vector Machines
The effectiveness of SVMs for HSI data has been shown in [4]
and has gained popularity over the last decade. They often provide high classification accuracies compared to other non-parametric and statistical approaches. SVM classifiers are particularly useful to classify heterogeneous classes with a limited
number of training samples. A detailed tutorial of SVMs can be
found in [19]. SVMs are intrinsically designed as binary classifiers; however multi-class SVM classifiers can be constructed
by using original SVMs as basic blocks. One—vs-all and hierarchical tree based approaches are popular techniques for constructing multi-class SVM classifiers. A more detailed explanation for constructing multi-class SVMs can be found in [20],
[21]. In this paper a non-linear SVM with a Gaussian Radial
Basis Function (RBF) kernel has been used. There are two parameters for an RBF kernel: and . We estimate these parameters using cross-validation and grid search.
B. Multi-Classifier System
The concept of combining the predictions of multiple classifiers to produce a single classifier has been proposed by various researchers in the past [22], [23]. In the literature, this
concept is referred as ensemble classifiers or MCS. The resulting MCS is generally more accurate when compared to the
individual classifiers that form MCS. An effective MCS is one
where the individual classifiers in the MCS are accurate and
make their classification errors on different parts of the input
793
space. Combining the predictions of identical classifiers will not
have any improvement so it is useful only when there is a disagreement among the individual classifiers. In [24], Krogh et
al. proved that overall classification error can be divided into a
quantity which is the average generalization error of each classifier and a quantity proportional to the disagreement among
the classifiers. From [25], [26] it can be concluded that an ideal
MCS should consist of classifiers that have the highest disagreement possible. Bagging [15] and boosting [17] are very popular methods used to create diversity among classifiers. In Bryll
et al. [16], an improved approach called attribute bagging was
introduced. This followed the development of many wrapper
based MCS approaches. Each classifier is trained with independently randomly selected feature subsets. The outputs are expected to be diverse and can be combined to form a final decision. Breiman [27] introduced a decision tree (DT) based classification approach with Random Forests (RF), Min [13] proposed a dynamic subspace approach and Waske proposed construction of SVM ensemble using RFS [12]. These are some of
the approaches that are inspired from the basic idea of bagging
and boosting classifiers successfully used on hyperspectral applications. In [36] Jacobs proposed an approach with mixture of
experts followed by [37]. In [38] S. Kumar et al. demonstrated
the effectiveness of this technique with hyperspectral data with
binary classifiers for a multiclass problem. In their work, partitioning of groups of classes is achieved by binary classifiers at
different levels.
III. PROPOSED APPROACH
In this paper, we propose a SVM-based MCS with unequal
number of features being selected from different spectral regions resulting in a non-uniform random feature selection.
A. Preliminaries
The hyperspectral dataset is assumed to have classes, each
represented as .
is the number of samples in . Samples
in are denoted as
where means
sample of the class . Samples from different classes can be
grouped together to form a super class and is represented as
. A feature vector is dimensional and
each feature is represented by , i.e.,
. We
define the normalized distance between any two set of samples
with respect to its feature vector as and
, where is the
distance between two classes
and
and
defined in (3)
& (4) is the distance between two sets and .
B. Proposed Non-Uniform RFS Strategy
In RFS based multi-classifier system [12], a subset of features
are selected by random sampling from a complete set of features
whose indices tend to follow uniform distribution. Fig. 1(a) illustrates two examples of equally likely uniformly distributed
spectral band feature selection where
has highly correlated
bands as compared to
. An obvious way to avoid this situation is, as shown in Fig. 1(b) to divide the spectrum uniformly into smaller regions and performing feature selection
within each region and concatenating the selected features. Outcome of this approach depends on the choice of number of partitions and partition boundaries. Features can still be correlated
794
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 6, NO. 2, APRIL 2013
Fig. 2. NU-RFS feature selection from original data.
Fig. 1. (a) Two examples of equally likely uniformly distributed spectral band
feature selection where d1 has highly correlated band compared to d2. (b) Example of uniform partitioning of spectral band(shown in blue). (c) Non-uniform
partitioning of spectral bands with uniformly distributed feature selection per
partition.
with this approach. As one progresses their way along the spectrum, the bands in a hyperspectral signature are typically more
highly correlated if they are adjacent and the rate of change in
the correlation of neighboring bands varies. An intelligent way
of partitioning the spectrum would be to place the partition at
a point in the feature set where the correlation of neighboring
bands changes drastically. This will result in a non-uniform
partitioning of spectral bands and bands selected from these
non-uniform regions are expected to be less correlated. This is
shown in Fig. 1(c).
To obtain an optimal set of partition boundaries, we use
a band grouping strategy. In [28], the authors proposed an
intelligent spectral partitioning technique that groups highly
correlated bands into distinct contiguous subspaces and then
used those partitions with a MCS. An intelligent (non-random)
band-grouping was performed to partition the spectrum into
subsets. In this approach, the diversity among the classifiers
is gained by breaking up the spectrum into smaller groups.
The region boundaries are automatically selected based on a
bottom-up band-grouping strategy. In this approach, starting
with the first band, each successive band is added to the
group—if this addition does not change the performance metric
employed, then the growth of that group is stopped and a new
group is started, resulting in a contiguous partitioning of the
spectra. The metric employed for band-grouping in this work is
the product of Bhattacharya distance and correlation. For more
details about band grouping, the readers can refer to [28], [30].
In the proposed approach, the feature space is divided into
distinct but contiguous regions in such a way that in each region,
the class separation is maximized and statistical dependence is
minimized by band grouping. Let
be the size of each region,
be the total number of features in the data and be number
of features that are selected from each region which is directly
proportional to . Then, total number of features selected for
each classifier is
(1)
Fig. 2 illustrates this set up. Since uniform RFS is performed
in each region separately, this approach can be thought of as a
piece-wise uniform RFS. Since the HSI data has high correlation between consecutive bands, there is a good chance of consecutive bands getting grouped into different classifiers when
using uniform RFS in a MCS. These highly correlated bands
would clearly affect the diversity of the ensemble, and then result in reduced robustness of the MCS approach. However, in
the proposed NU-RFS, features for individual classifiers in the
resulting MCS are drawn in a non-uniform fashion thereby creating greater diversity among the classifiers compared to selecting features from a uniform random selection. Experimentally, we observed that this approach can result in better ensembles where features are less correlated, owing to the fact
that the probability of features that are spectrally close to each
other being sent to different learners is very low. The above
said is applied to each classifier in the MCS separately unlike
[12]. With initial experiments presented in [18], we found that
SAMIAPPAN et al.: NON-UNIFORM RANDOM FEATURE SELECTION AND KERNEL DENSITY SCORING
795
Fig. 3. Proposed NU-RFS based Multi-Classifier System.
the size of the region
plays an important role in the performance of overall classification. As recommended in [29], the
. The proposed MCS system is shown in
Fig. 3. The random subspace selection demonstrated in [12],
[13], [29] for MCS aims to create diversity among classifiers.
The aforementioned techniques propose to construct ensembles
by bagging and boosting variants. We believe the optimal subset
to create maximum diversity need not come from a uniform RFS
because there is a very good chance that the nearby spectral
bands get grouped into the same classifier. This is clearly the
case of classifiers having correlated features. This situation of
similar grouping of features can sometimes affects the diversity
by forcing the classifiers to commit similar errors. By using the
proposed approach described in Fig. 3, we attempt to alleviate
this issue. The proposed approach is compared against regular
SVM, RFS and NU-RFS using band grouping only.
C. Uniform Voting and Kernel Density Decision Fusion
NU-RFS produces a group of features to be trained by each
classifier. Each of these sets of features has a unique class separation capability since they form a different combination of the
original feature set. In order to make use of this uniqueness in
our system, we estimate a set of scores for each classifier that is
proportional to the ability of the classifier to classify each class
from all the other classes. For example, if there are classifiers
and the data has classes then we generate a score matrix of
size
. These scores are computed by kernel density estimation across all the features. Oh et al. proposed an approach to
estimate the class separation [31] to perform hand writing recognition. Our approach is inspired from the algorithm proposed in
[31].
After performing NU-RFS, we have a set of training data for
each classifier. A probability density for a class for a feature
vector
is estimated. A probability distribution for the class
can be computed by
(2)
where,
is the kernel function and
is the smoothing
parameter. We have tested the rectangular, normal, triangular
and epanechnikov kernel functions [34] in the proposed system.
Let be the distance between any two classes and
be the
distance between any given class and all other classes
can
be computed by (3) and (4) respectively.
(3)
(4)
where
are
class distributions of and
respectively.
When there is complete overlap between distributions, (3) gives
a minimum and no overlap gives maximum. i.e., when there is
complete overlap
cannot distinguish two classes whereas it
can distinguish the best when there is no overlap. Thus it defines
the ability of
to differentiate between any two classes of
and . Equation (4) is computed for every class then values
are averaged over all the selected features resulting in an array
of scores of separability of each class with respect to the selected
features.
These scores are sorted in descending order where higher the
score, higher the ability to classify a class. This is shown in
Fig. 3 as compute class score. This process is repeated for every
classifier in MCS resulting in a
score matrix representing
the ability of each classifier to distinguish a particular class
from . We denote these scores as . Now, each row of this matrix corresponds to the ability of each classifier to distinguish
classes where higher the value of , higher the chance of distinguishing that class from all other classes
. Although, estimating class probability density function is a harder problem
than classification, the aim of estimating these scores is to get a
coarse estimate of separation which can be used during decision
fusion.
After estimating the score matrix, the actual classification
is performed with all the SVM classifiers in MCS resulting in
class labels for every test sample from each classifier. Let be
796
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 6, NO. 2, APRIL 2013
number of test samples with number of classifiers in MCS,
then the resulting class labels can be represented as a
matrix. Each column of this matrix
holds the prediction of
each test sample from different classifiers in MCS. In the hard
decision fusion scenario, the final classification decision can be
obtained by a majority vote over each classifier. The final decision of
test sample
can be obtained from by (5)
(5)
(6)
Fig. 4. Sensitivity of various algorithms to number of training samples with
AVIRIS Indian Pines data (error bars corresponds to 95% confidence intervals).
Mathematically, mode gives the most frequently occurring
event. By means of majority vote, we get a hard decision
fusion. This only uses the predictions of classifiers in MCS.
The voting scheme described in (5) is uniform voting. i.e., each
classifier in MCS has equal strength in deciding the final class
label. We propose a voting mechanism based on scores where
the strength of each classifier is modified according to its ability
to classify a particular class from all other classes
. This
can be achieved by creating a modified class label column
matrix for each test sample based on corresponding . This is
done by appending with an array of length . The elements
of the appended array will have the class label corresponding
to highest , where length of can vary depending on as
given in (6). From our experiments with various datasets, we
arrived at (6). This is then used to perform majority vote. The
decisions of MCS are not modified when
. These scores
will bias the majority voting decision based on the strengths
and weaknesses of each classifier.
IV. EXPERIMENTAL SETUP AND RESULTS
A. Experimental Dataset and Setup
We have used three hyperspectral datasets representing different analysis tasks—two datasets representing an agricultural
problem, where classes are vegetation cover types and the third
representing an urban classification problem.
The first experimental HSI dataset employed was acquired
using NASA’s AVIRIS sensor and was collected over northwest
Indiana’s Indian Pine test site in June 1992 [32]. The image
represents a vegetation-classification scenario with 145 145
pixels and 220 bands in the 400 to 2450 nm region of the visible
and infrared spectrum. This dataset has 16 classes.
The second experimental HSI dataset was acquired over
north Mississippi’s Blackbelt Experiment Station agricultural
test site in June 2008. The dataset has seven classes—each
representing chemical stress on a corn-crop [35]. The corn crop,
grown under controlled conditions was induced with varying
degrees of chemical stress. The crop was sprayed with seven
different concentrations of Glufosinate herbicide diluted with
water, where the seven classes or concentrations were (control),
1/32, 1/16, 1/8, 1/4, 1/2, and 1 times the labeled rate of the
herbicide concentrations. This dataset is acquired by using
handheld Analytical Spectral Devices (ASD) sensor resulting
in HSI datasets with 2151 bands. Since all seven classes in this
dataset represent the same species under varying degrees of
stress, it makes for a very challenging classification problem.
The third dataset used has 102 spectral bands acquired by
the ROSIS sensor over Pavia, Northern Italy. This data has 9
classes. The classes are Water, Trees, Asphalt, Self-Blocking
Bricks, Bitumen, Tiles, Shadows, Meadows and Bare soil. For
this dataset, considering the very high number of samples from
each class, the model selection is conducted on a subset of
training samples rather samples from each class. These model
parameters are used to train the SVM classifiers.
The classification is performed using SVM with Gaussian
RBF kernel for all experiments [35]. Model selection for the
SVM is performed by using cross validation and grid search.
For all the datasets the RBF parameters C and are estimated
by selecting 10% of training samples from each class and performing a grid search by using cross validation except Pavia
dataset where we used 5% of training data for model selection.
We compute a confusion matrix for every classification problem
and then estimate user, producer and overall accuracies.
B. Results With the AVIRIS Indian Pines Dataset
Experimental results demonstrate the superiority of the
proposed approach compared to SVM and RFS. The study of
overall accuracy for various numbers of training samples is
shown in Fig. 4. At 10% training, NU-RFS with kernel density
based fusion achieved an overall accuracy of 93.7% with
rectangular kernel, and RFS and NU-RFS achieves 81.5% and
80.3% respectively. Interestingly, SVM with kernel scoring
performs better than RFS and Band grouping based NU-RFS.
The proposed Kernel Scoring based NU-RFS outperforms
other approaches by 10%. The maximum overall accuracy
achieved is 97.3% with 50% training. For all our experiments,
we have used an ensemble size of
. We observed that
increasing the ensemble size does not provide any significant
improvement beyond 8. This is similar to an observation made
by Waske et al. [12] when using simple RFS.
The study of the performance of various algorithms with respect to sample size follows a very interesting trend. The proposed approach clearly outperforms other techniques that are
compared. In Fig. 4, rectangular kernel is used to compute the
density. Fig. 5 shows the comparison of various kernel functions
with respect to number of training samples used for training.
SAMIAPPAN et al.: NON-UNIFORM RANDOM FEATURE SELECTION AND KERNEL DENSITY SCORING
797
Fig. 5. Sensitivity of proposed approach with different kernel functions to
number of training samples with AVIRIS Indian Pines data.
Fig. 7. Producer accuracies of different classes of Indian Pines data with various algorithms.
Fig. 6. User accuracies of different classes of Indian Pines data with various
algorithms.
In this case, the rectangular and triangular kernels perform almost equally well compared to normal and epanechikov. The
error bars shown are corresponding to 95% confidence intervals.
From experimentation, it is found that the performance of the
classifier increases with increasing initially and it decreases
after reaching a particular value. In order to maintain uniformity among various experiments, the value of
is used,
datasets.
From
this yields the best performance for all the three
of
the
classiexperimentation, it is found that the performance
fier increases with increasing initially and it decreases after
reaching a particular value.
Figs. 6 and 7 illustrate the user accuracies (UA) and producer accuracies (PA) for each class in Indian Pines dataset.
We observe a consistent improvement ( 2–40%) in both user
and producer accuracies throughout all classes when employing
the proposed kernel density based scoring approach. This is expected, as the confusion between the classes is reduced via the
proposed scoring approach. The standard deviation is shown as
error bars for user and producer accuracies. The deviation is approximately 0.8% for both user and producer accuracies.
C. Results With the Corn Stress Dataset
The study of sensitivity of various classifiers against different
training sample sizes reveals an interesting pattern. As is observed with Indian Pines, the proposed approach handles the
Fig. 8. Sensitivity of various algorithms to number of training samples with
corn stress data.
small sample size situation better than other approaches. Fig. 8
shows a comparison study of the overall accuracy versus the
number of training samples. Systems based on NU-RFS exhibit
a 1.5 to 3% increase in overall performance. It is worth pointing
out that the performance of the kernel scoring NU-RFS algorithm is above 99% with a sample size of 10 samples/class,
where single SVM and original RFS algorithms produce an
accuracy of approximately 93% and 95% respectively. Fig. 9
shows the performance of different types of kernel used. Rectangular kernel performs better than other kernels with small
training sample size.
Figs. 10 and 11 illustrate the user and producer accuracies
for each class in the Corn stress dataset. A similar increase in
the user and producer accuracies is observed as with the Indian
Pines data. The standard deviation is shown as error bars for user
and producer accuracies. The deviation is approximately 0.1%
for both user and producer accuracies.
Tables I and II show the confusion matrices for classification
without feature selection and with Kernel scoring NURFS (Triangular kernel function) with 10% training data, respectively.
Both user accuracies (UA) and producer accuracies (PA) are
improved with the proposed feature selection. Overall accuracies of other feature selection approaches and other kernels are
shown in Figs. 8 and 9.
798
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 6, NO. 2, APRIL 2013
TABLE II
SVM CLASSIFICATION WITH KERNEL SCORING NURFS
Fig. 9. Sensitivity of proposed approach with different kernel functions to
number of training samples with corn stress data.
Fig. 12. Sensitivity of various algorithms to number of training samples with
Pavia data.
D. Results With the Pavia, Italy Dataset
Fig. 10. User accuracies of different classes of corn stress data with various
algorithms.
Fig. 11. Producer accuracies of different classes of corn stress data with various
algorithms.
TABLE I
SVM CLASSIFICATION WITH NO FEATURE SELECTION
The experimental results with the Pavia, Italy dataset show
an improvement in overall classification accuracy compared to
other algorithms. Fig. 12 shows the performance of the proposed approach under various percentage of number of training
samples. Kernel Density based NU-RFS achieves a gain of
7% and also it performs well under limited training samples.
Both Kernel Density based approaches combined with SVM
and NU-RFS shows superior performance over all the other
approaches, and this shows the effectiveness of the proposed
decision fusion approach. Fig. 13 illustrates performance with
different kernel functions. Figs. 14 and 15 shows the user and
producer accuracies for each class of Pavia dataset. Water, trees,
bitumen, tiles and bare soil classes gained an improvement of 1
to 5%. This improvement can be seen from other kernel scoring
techniques without feature selection. The standard deviation
is shown as error bars for user and producer accuracies. The
deviation is approximately 0.2% for both user and producer
accuracies.
V. DISCUSSIONS AND CONCLUSION
An SVM based MCS with NU-RFS is developed in this work
for hyperspectral classification. The overall accuracies are significantly higher compared to regular SVM based single classifiers and Uniform RFS based MCS. NU-RFS seems to handle
the small sample size situation better than other MCS techniques
in our comparison study. SVMs are known to better handle
small sample size situations compared to statistical classifiers
like ML. However, we have observed that single SVM classifier too suffers the curse of dimensionality. The number of
SAMIAPPAN et al.: NON-UNIFORM RANDOM FEATURE SELECTION AND KERNEL DENSITY SCORING
799
situations and hence it will be interesting to explore the possibilities of using this with semi-supervised learning for datasets
with few ground truth points. We are testing these aspects in ongoing work.
REFERENCES
Fig. 13. Sensitivity of proposed approach with different kernel functions to
number of training samples with Pavia data.
Fig. 14. User accuracies of different classes of Pavia data with various algorithms.
Fig. 15. Producer accuracies of different classes of Pavia data with various
algorithms.
features that we selected for each classifier is consistent with
previous RFS implementations [12], [29]. We also conducted
experiments with different values of
and where we
selected more features, the accuracy improved in some regions,
though the impact was marginal. The user and producer accuracies with the proposed approach show a consistent improvement
when compared to other conventional approaches.
We believe that a study using better decision fusion strategies such as the Linear Opinion Pool (LOP) may yield further
improvements, because, it will provide a soft fusion by using the
distance between samples to the hyper plane of SVM. It is important to note that the proposed NU-RFS with the kernel density scoring particularly performs well with small sample size
[1] S. G. Beaven, D. Stein, and L. E. Hoff, “Comparison of Gaussian mixture and linear mixture models for classification of hyperspectral data,”
in Proc. IEEE IGARSS, 2000, vol. 4, pp. 1597–1599.
[2] B. Demir and S. Erturk, “Increasing hyperspectral image classification
accuracy for data sets with limited training samples by sample interpolation,” in Proc. 4th Int. Conf. Recent Advances in Space Technologies,
2009, 2009, pp. 367–369.
[3] Q. Jackson and D. A. Landgrebe, “An adaptive classifier design for
high-dimensional data analysis with a limited training data set,” IEEE
Trans. Geosci. Remote Sens., vol. 39, pp. 2664–2679, Dec. 2001.
[4] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote
sensing images with support vector machines,” IEEE Trans. Geosci.
Remote Sens., vol. 42, pp. 1778–1790, Aug. 2004.
[5] J. A. Benediktsson, C. Garcia, B. Waske, J. Chanussot, J. R. Sveinsson,
and M. Fauvel, “Ensemble methods for classification of hyperspectral
data,” in Proc. IEEE IGARSS 2008, pp. I-62–I-65.
[6] S. Prasad and L. M. Bruce, “A robust multi-classifier decision fusion
framework for hyperspectral, multi-temporal classification,” in Proc.
IEEE IGARSS 2008, pp. II-273–II-276.
[7] S. Prasad and L. M. Bruce, A Divide-and-Conquer Paradigm for
Hyperspectral Classification and Target Recognition Optical Remote
Sensing. Berlin Heidelberg: Springer, 2011, vol. 3, pp. 99–122.
[8] C. Mingmin, K. Qian, J. A. Benediktsson, and R. Feng, “Ensemble
classification algorithm for hyperspectral remote sensing data,” IEEE
Geosci. Remote Sens. Lett., vol. 6, pp. 762–766, Oct. 2009.
[9] M. S. Haghighi, A. Vahedian, and H. S. Yazdi, “Creating and measuring
diversity in multiple classifier systems using support vector data description,” Elsevier Applied Soft Computing, vol. 11, pp. 4941–4942,
Dec. 2011.
[10] R. Ranawana, “Intelligent multi-classifier design methods for the classification of imbalanced data sets—Application to DNA sequence analysis,” Ph.D. dissertation, Univ. of Oxford, Oxford, U.K., 2007.
[11] G. Brown, J. Waytt, R. Harris, and X. Yao, “Diversity creation methods:
A survey and categorisation,” J. Information Fusion, vol. 6, 2005.
[12] B. Waske, S. van der Linden, J. A. Benediksson, A. Rabe, and P.
Hostert, “Sensitivity of support vector machines to random feature
selection in classification of hyperspectral data,” IEEE Trans. Geosci.
Remote Sens., vol. 48, pp. 2880–2889, Jul. 2010.
[13] Y. Jinn-Min, K. Bor-Chen, Y. Pao-Ta, and C. Chun-Hsiang, “A dynamic subspace method for hyperspectral image classification,” IEEE
Trans. Geosci. Remote Sens., vol. 48, pp. 2840–2853, Jul. 2010.
[14] J. Ham, C. Yangchi, M. M. Crawford, and J. Ghosh, “Investigation of
the random forest framework for classification of hyperspectral data,”
IEEE Trans. Geosci. Remote Sens., vol. 43, pp. 492–501, Mar. 2005.
[15] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp.
123–140, Aug. 1996.
[16] R. Bryll, R. G. Osuna, and F. Quek, “Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets,” Pattern Recogn., vol. 36, pp. 1291–1302, 2003.
[17] Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Proc. 13th Int. Conf. Machine Learning, Bari, Italy, 1996.
[18] S. Samiappan, S. Prasad, and L. M. Bruce, “Automated hyperspectral imagery analysis via support vector machines based multi-classifier system with non-uniform random feature selection,” in Proc. IEEE
IGARSS, Vancouver, Canada, 2011.
[19] C. J. C. Burges, “A tutorial on support vector machines for pattern
recognition,” Data Mining and Knowledge Discovery, vol. 2, pp.
212–167, 1998.
[20] D. J. Sebald and J. A. Bucklew, “Support vector machines and the multiple hypothesis test problem,” IEEE Trans. Signal Process., vol. 49,
pp. 2865–2872, 2001.
[21] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass
support vector machines,” IEEE Trans. Neural Networks, vol. 13, pp.
415–425, 2002.
[22] E. Alpaydin, “Multiple networks for function learning,” in Proc. IEEE
Int. Conf. Neural Networks, 1993, vol. 1, pp. 9–14.
[23] R. T. Clemen, “Combining forecasts: A review and annotated bibliography,” Int. J. Forecasting, vol. 5, pp. 559–583, 1989.
[24] A. Krogh, “Neural network ensembles, cross validation, active
learning,” in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 1995.
800
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 6, NO. 2, APRIL 2013
[25] D. W. Opitz, “Generating accurate and diverse members of a neuralnetwork ensemble,” in Advances in Neural Information Processing
Systems. Cambridge, MA: MIT Press, 1996.
[26] D. W. Opitz et al., “Actively searching for an effective neural-network
ensemble,” Connection Science, vol. 8, pp. 337–353, 1996.
[27] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32,
2001.
[28] S. Prasad and L. M. Bruce, “Decision fusion with confidence-based
weight assignment for hyperspectral target recognition,” IEEE Trans.
Geosci. Remote Sens., vol. 46, pp. 1448–1456, 2008.
[29] T. K. Ho, “The random subspace method for constructing decision
forests,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp.
832–844, 1998.
[30] C. Simin, R. Zhang, W. Cheng, and H. Yuan, “Band selection of hyperspectral images based on Bhattacharyya distance,” WSEAS Trans.
Inf. Sci. and App., vol. 6, pp. 1165–1175, 2009.
[31] I.-S. Oh, J. S. Lee, and Y. S. Ching, “Analysis of class separation and
combination of class-dependent features for handwriting recognition,”
IEEE Trans. Pattern Anal. Machine Intell., vol. 21, pp. 1089–1094,
1999.
[32] Purdue University, link (on September 29, 2011) [Online]. Available:
https://engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html
[33] R. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.
Chicester, U.K.: Wiley, 2006.
[34] V. A. Epanechnikov, “Non-parametric estimation of a multivariate
probability density,” Theory of Probability and Its Applications, vol.
14, pp. 153–158, 1967.
[35] M. A. Lee, S. Prasad, L. M. Bruce, T. R. West, D. Reynolds, T. Irby,
and H. Kalluri, “Sensitivity of hyperspectral classification algorithms
to training sample size,” in Proc. IEEE WHISPERS 2009, Grenoble,
France, 2009.
[36] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive
mixtures of local experts,” Neural Computing, vol. 3, pp. 79–87, 1991.
[37] R. A. Jacobs and M. I. Jordan, “Adaptive mixtures of local experts and
the EM algorithm,” Neural Computing, vol. 6, pp. 79–87, 1991.
[38] S. Kumar, J. Ghosh, and M. M. Crawford, “Hierarchical fusion of multiple classifiers for hyperspectral data analysis,” Pattern Analysis Applications, pp. 210–220, 2002, Springer Verlag.
Sathishkumar Samiappan (M’12) received the B.E.
degree in electronics and communication engineering
from Bharathiar University, Coimbatore, India, in
2003 and the M.Tech degree in computer science,
with a major in computer vision and image processing, from Amrita University, Coimbatore, India,
in 2006. Since 2009, he has been working toward the
Ph.D. degree in electrical and computer engineering
at Mississippi State University, Starkville, MS.
Until 2009, he was a Lecturer in the Department
of Electronics and Communication Engineering,
Amrita University, Coimbatore, India. Since 2009, he has been a Graduate
Research Assistant with Geosystems Research Institute and Graduate Teaching
Assistant with the Department of Electrical and Computer Engineering at
Mississippi State University, Starkville, MS. His research interests include big
data problems, pattern recognition, image processing, machine learning and
hyperspectral image classification.
Saurabh Prasad (S’05–M’09) received the B.S. degree in electrical engineering from Jamia Millia Islamia, India, in 2003, the M.S. degree in electrical
engineering from Old Dominion University, Norfolk,
VA, in 2005, and the Ph.D. degree in electrical engineering from Mississippi State University, Starkville,
MS, in 2008.
He is an Assistant Professor in the Electrical and
Computer Engineering Department at the University
of Houston (UH), and is also affiliated with UH’s
Geosensing Systems Engineering Research Center
and the National Science Foundation (NSF) funded National Center for
Airborne Laser Mapping (NCALM). He is the Principal Investigator/Technical-lead on projects funded by the National Geospatial Intelligence Agency
(NGA), National Aeronautics and Space Administration (NASA), and Department of Homeland Security (DHS). His research interests include statistical
pattern recognition, adaptive signal processing and kernel methods for medical
imaging, optical and SAR remote sensing. In particular, his current research
work involves the use of information fusion techniques for designing robust
statistical pattern classification algorithms for hyperspectral remote sensing
systems operating under low-signal-to-noise-ratio, mixed pixel and small
training sample-size conditions.
Dr. Prasad is an active Reviewer for the IEEE TRANSACTIONS ON
GEOSCIENCE AND REMOTE SENSING, the IEEE GEOSCIENCE AND REMOTE
SENSING LETTERS and the Elsevier Pattern Recognition Letters. He was
awarded the GRI’s Graduate Research Assistant of the Year award in May
2007, and the Office-of-Research Outstanding Graduate Student Research
Award in April 2008 at Mississippi State University. In July 2008, he received
the Best Student Paper Award at IGARSS’2008 held in Boston, MA. In October
2010, he received the State Pride Faculty Award at Mississippi State University
for his academic and research contributions. He was the Lead Editor of the
book entitled Optical Remote Sensing: Advances in Signal Processing and
Exploitation Techniques (2011).
Lori M. Bruce (S’90–M’96–SM’01) received the
B.S., M.S., and Ph.D. degree in electrical and computer engineering from the University of Alabama,
Hunstville, and the Georgia Institute of Technology,
Atlanta.
She is the Associate Dean for Research and Graduate Studies in the Bagley College of Engineering
at Mississippi State University. Dr. Bruce has been
a Faculty Member for 14 years, during which she
has taught approximately 40 engineering courses at
the undergraduate and graduate level. Her research
in image processing and remote sensing has been funded by the Department of
Homeland Security, the Department of Energy, the Department of Transportation, the National Aeronautics and Space Administration, the National Geospatial-Intelligence Agency, the National Science Foundation, the United States
Geological Survey, and industry, resulting in over 100 peer reviewed publications and the matriculation of more than 75 graduate students (25 as major professor and more than 50 as thesis/dissertation committee member).
Dr. Bruce is an active member of the IEEE Geoscience and Remote Sensing
Society, and she is a member of the Phi Kappa Phi, Eta Kappa Nu, and Tau
Beta Pi honor societies, and prior to becoming a faculty member, she held the
prestigious title of National Science Foundation Research Fellow.