Dealing with heterogeneous classification problem in the framework

Talanta 132 (2015) 175–181
Contents lists available at ScienceDirect
Talanta
journal homepage: www.elsevier.com/locate/talanta
Dealing with heterogeneous classification problem in the framework
of multi-instance learning
Zhaozhou Lin a, Shuaiyun Jia a, Gan Luo a, Xingxing Dai a, Bing Xu a, Zhisheng Wu a,
Xinyuan Shi a,b,n, Yanjiang Qiao a,b,n
a
College of Chinese Medicine, Beijing University of Chinese Medicine, Beijing 100102, China
Research Center of TCM-information Engineering, State Administration of Traditional Chinese Medicine of the People's Republic of China,
Beijing 100102, China
b
art ic l e i nf o
a b s t r a c t
Article history:
Received 21 April 2014
Received in revised form
28 August 2014
Accepted 3 September 2014
Available online 16 September 2014
To deal with heterogeneous classification problem efficiently, each heterogeneous object was represented by a set of measurements obtained on different part of it, and the heterogeneous classification
problem was reformulated in the framework of multi-instance learning (MIL). Based on a variant of
count-based MIL assumption, a maximum count least squares support vector machine (maxc-LS-SVM)
learning algorithm was developed. The algorithm was tested on a set of toy datasets. It was found that
maxc-LS-SVM inherits all the sound characters of both LS-SVM and MIL framework. A comparison study
between the proposed approach and the other two MIL approaches (i.e., mi-SVM and MI-SVM) was
performed on a real wolfberry fruit spectral dataset. The results demonstrate that by formulating the
heterogeneous classification problem as a MIL one, the heterogeneous classification problem can be
solved by the proposed maxc-LS-SVM algorithm effectively.
& 2014 Elsevier B.V. All rights reserved.
Keywords:
Heterogeneous spectra
Multi-instance learning (MIL)
Error-Correcting Output Codes (ECOC)
Maximum count least square support
vector machine (maxc-LS-SVM)
Geographical origins
1. Introduction
The application of near infrared spectroscopy (NIR) to perform
classification has been spread across the analysis of food, agricultural, petroleum, and pharmaceutical products [1–3]. However,
spectra obtained on heterogeneous objects, such as corn and
pharmaceutical tablet, are often of high variance. Applying common classification methods on these spectra will result in weak
conclusion. This is one typical category of heterogeneous classification problem. However, the heterogeneous classification problem has not yet been solved thoroughly [4].
The key on solving heterogeneous classification problem is to
represent each heterogeneous object efficiently. In the literatures,
pre-treatment is the most successfully and widely used techniques
for solving the heterogeneous classification problem. One builds
certain measurement protocol to improve the representativeness
of measurements. Spectra measured by using integrating sphere
[5–10], rotating the sample during spectra collection, or grinding
the samples [11–14] are of this type. Significantly better results
were observed while the measurement was taken by a patented
measurement method [15]. However, when the spectra are needed
n
Corresponding authors at: College of Chinese Medicine, Beijing University of
Chinese Medicine, Beijing 100102, China. Tel.: þ86 10 84738621;
fax: þ86 84738661.
E-mail addresses: [email protected] (X. Shi), [email protected] (Y. Qiao).
http://dx.doi.org/10.1016/j.talanta.2014.09.007
0039-9140/& 2014 Elsevier B.V. All rights reserved.
to be collected in-situ, none of the above methods are applicable.
Hwang et al. [16] developed a fast and non-destructive analytical
method to identify geographical origins of rice samples via
transmission spectral collected through packed grains. But the
variation of packing thickness will prevent the measurement from
yielding reproducible spectra.
Instead of receiving a set of measurements, each for a sample,
an object can be represented by a set of measurements (instances).
Therefore, the heterogeneous classification problem can be solved
by formulating it as a multi-instance learning one. Instead of
receiving labeled instances as in traditional supervised learning,
the MIL learner receives a set of labeled bags. The majority of MIL
studies are concerned with binary classification problems [17].
Most of these studies assume that a bag is labeled positive if there
is at least one instance in it being positive. Otherwise, a bag is
labeled negative if all the instances in it are negative.
Based on the classical MIL assumption, numerous MIL methods
have been proposed in the literatures, and most of them have been
reviewed in earlier studies [17–19]. These algorithms use mainly
the information of one instance from each positive bag. However,
there is commonly more than one instance in any positive bags
and much of the information contained in these instances is lost.
The MIL assumption is first used to solve the problem of musk
drug activity prediction problem [20]. From then on, many
problems have been formulated as MIL problem, such as image
categorization, object detection, and human active recognition
176
Z. Lin et al. / Talanta 132 (2015) 175–181
[18,21,22]. Although all early work in MIL assumes a specific
concept class known to be appropriate for a drug activity prediction like domain, the classical MIL assumption is not guaranteed to
hold in other domains. Recently, a significant amount of researches
in MIL are concerned with cases where the classical view of MIL
problem is relaxed, and alternative assumptions are considered
instead [17]. Although not all these works clearly state what
particular assumption is used and how it relates to other assumptions, the use of alternative assumptions has been clarified by
reviewing the studies in this area.
In this work, a variant of count-based binary MIL assumption
was adopted. It assumes that a bag is labeled positive if the
product of the positive posteriors of instances, weighted by the
bag prior, is larger than that of the negative ones. Otherwise, a bag
is labeled negative. Base on this assumption, the maxc-LS-SVM
algorithm was proposed to deal with the heterogeneous classification problem. For multi-class cases, the original classification
problem was decomposed in the framework of Error-Correcting
Output Codes (ECOC) by one-versus-one design [23]. The maxc-LSSVM algorithm was modified correspondingly and its performance
was compared with mi-SVM and MI-SVM [24] algorithm. The
results of applying these approaches on toy datasets and a real
herbal dataset clearly showed the advantage of maxc-LS-SVM
algorithm.
2. Theory and algorithm
2.1. Least squares-support vector machine (LS-SVM)
This study will briefly focus on the basic concepts of LS-SVM
because the theory of LS-SVM has been described extensively in
the literatures [25,26]. LS-SVM encompasses similar advantages as
SVM, but additionally, it requires solving only a set of linear
equations, which is much easier and computationally very simple.
The objective function defined in LS-SVM is [27]
N
min J P ðω; eÞ ¼ 12ωT ω þ γ12 ∑ e2i
ω;b;e
i¼1
subject toðs:t:Þ : yi ½ω φðxi Þ þ b ¼ 1 ei ; i ¼ 1; …; N
T
ð1Þ
where w denotes the normal vector to the classification hyperplane, γ is the hyperparameter tuning the amount of regularization
versus the sum squared error, e is the error variable, and φðxÞis the
nonlinear map from original space to the high (and possibly
infinite) dimensional space.
To solve the optimization problem efficiently, a Lagrange
function is constructed and translated into its dual form
N
Lðω; b; e; αÞ ¼ J P ðω; eÞ ∑ αi fyi ½ω φðxi Þ þ b 1 þ ei g
T
ð2Þ
i¼1
where αi values are the Lagrange multipliers.
The conditions for optimality are
8
N
>
∂L
>
>
> ∂ω ¼ 0-ω ¼ ∑ αi yi φðxi Þ
>
>
i¼1
>
>
>
>
N
< ∂L
¼ 0- ∑ αi yi
∂b
i¼1
>
>
>
>
∂L
>
¼
0αi ¼ γei ;
i ¼ 1; :::; N
>
> ∂ei
>
>
>
T
: ∂∂L
¼
0-y
½
ω
φ
ðx
Þ
þ
b
1
þ
e
¼
0;
i
¼ 1; :::; N
i
i
i
αi
it yields a linear system instead of a quadratic programming
problem
"
# " #
0
0
1TN
b
¼
ð4Þ
y
1N Ω þ γ 1 I N
a
where y is a vector containing the reference values, 1N is a [N 1]
vector of ones and I is an [N N] identify matrix. Ω is the kernal
matrix defined by Ωij ¼ φ(xi)Tφ(xj)¼ K(xi,xj).
The classifier in the dual space takes the form
"
#
N
yðxÞ ¼ sign ∑ αi yi Kðx; xi Þ þ b
ð5Þ
i¼1
2.2. Maximum margin formulation of MIL
In traditional supervised learning framework, an object is
represented by one single instance, i.e. measurement vector, and
associated with a class label. The goal is to induce a classifier to
label instances. While, MIL groups instances into bags and each
bag is attached with a class label. More formally, an object is
represented by a bag B ¼{x1, x2, …, xp}, which contains a set of
D-dimensional instances. Each bag is associated with a label Y.
Both mi-SVM and MI-SVM approaches [24] are modified and
extended from Support Vector Machines (SVMs). The mi-SVM
approach explicitly treats the instance labels as unobserved
integer variables subjected to constrain defined by the (positive)
bag labels. A generalized soft-margin SVM is defined as follows in
primal form
minmin12‖w‖2 þ C∑ ξi
fyi g w;b;ξ
i
s:t: 8 i : yi ðhw; xi i þ bÞ Z 1 ξi ; ξi Z 0; yi A f 1; 1g
ð6Þ
where ξ is a non-negative slack variable, C is the hyperparameter
tuning the amount of the degree of misclassification versus the
sum squared error, yi is the label of instance.
The mi-SVM formulation leads to a mixed integer programming problem. One has to maximize a soft-margin criterion jointly
over possible label assignments and hyperplane. While, the MISVM approach extends the notion of margin to bag and maximizing the bag margin directly. The bag margin with respect to a
hyperplane is defined by
γ I Y I maxðhw; xi i þ bÞ
iAI
ð7Þ
Incorporating the above bag margin, an MIL version of the softmargin classifier is defined by
min12‖w‖2 þ C∑ ξI
w;b;ξ
I
s:t: 8 I: YI maxðhw; xi i þ bÞ Z1 ξI ; ξI Z 0
iAI
ð8Þ
Unfolding the max operation by introducing inequality constraint per instance for the negative bags, or by introducing a
selector variable sðIÞ A I which denotes the positive instance
selected for the positive bags, one can obtain the following
equivalent formulation
ð3Þ
From the conditions for optimality, it can be concluded that no
αi values will be exactly equal to zero, meaning that the advantages of automatic sparseness are lost. However, the model can
be trained much more efficiently after constructing the Lagrangian
by solving the linear Karush–Kuhn–Tucker (KKT) system, since
min min12‖w‖2 þ C∑ ξI
s
w;b;ξ
s:t:
I
8 I : Y I ¼ 1 4 hw; xi i bZ 1 ξI ; 8 i A I;
orY I ¼ 1 4 w; xsðIÞ b Z 1 ξI ; ξI Z 0:
ð9Þ
MI-SVM can also be cast as mixed-integer program. One has to
find both the optimal selectors and hyperplane.
A heuristic optimization scheme has been proposed to solve
the mixed-integer programs by alternating the following two steps
(i) for a given integer variables, solve the associated Quadratic
Programming (QP) and find the optimal discriminant function,
Z. Lin et al. / Talanta 132 (2015) 175–181
(ii) for a given discriminant function, update the label of each
instance in mi-SVM or the single selector variable in MI-SVM.
2.3. Multi-class classification
The above three algorithms are originally designed to perform
binary classification. Although binary LS-SVM can be easily
extended to deal with multi-class problems, it is usually more
preferable to build a classifier distinguishing only two classes
rather than to consider more than two classes in a model. In this
study, the original multi-class problem was decomposed into
easier to solve binary classifications by using a common strategy
“one-versus-one” in the framework of ECOC, which is a simple but
powerful framework to deal with the multi-class classification
problem based on the embedding of binary classifiers [23].
The “one-versus-one” decomposition scheme divides an m
class problem into M¼ m(m 1)/2 binary problems. Each problem
is faced by a binary LS-SVM classifier, which is responsible for
distinguishing between a pair of classes. The learning phase of the
classifiers is done using a subset of samples, which contains any
two classes of them, whereas samples with different class labels
are simply ignored.
In prediction phase, a sample is presented to each of the binary
classifiers. The output of a classifier, rij A {0, 1}, denotes the output
of the ith sample on the jth binary classifier. These outputs are
represented by a vector ri
ri ¼ ½r i1 ; r i2 ; ⋯; r ij ; ⋯; r iM ; j ¼ 1; ⋯; M
ð10Þ
The predicted class can be obtained by Hamming decoding
strategies as follows [23]:
Class ¼ min
M
∑ ð1 signðr ij ; ycj ÞÞ=2
c ¼ 1;:::;m j ¼ 1
ð11Þ
where yc is a codeword corresponding to class c.
The classification of individual instances was performed by
using binary classification algorithm.
2.4. Maximum count LS-SVM
For a binary classifier, let Y A { 1,1}. It is assumed that a bag is
labeled positive if more than half of the instances in the bag are
drawn from positive sets. Specifically, the sum rule is used to
decide the label of the bag
Classify B ¼{x1, x2, …, xp} as positive, i.e. Yi ¼1, j¼1, 2, …, p
if
p 1y
yij þ 1
ij
Z ∑
2
2
j¼1
j¼1
p
∑
ð12Þ
or negative, i.e. Yi ¼ 1, j¼ 1, 2, …, p
if
p 1y
yij þ 1
ij
o ∑
2
2
j¼1
j¼1
p
∑
ð13Þ
where N is the number of bags, p denotes the number of instances
in each bag.
For multi-class cases, let YA {1, 2, 3, …, m}. Every instance of
one object is presented to each of the basic ovo-LS-SVM classifiers.
The classification result of one instance is represented by a vector
cj with element ckj taken from {0, 1}. The value 1 means the
instance is assigned to class k, while 0 not. These outputs are then
combined
0
c11
Bc
B 21
C¼B
B ⋮
@
cp1
177
into a score matrix C:
1
c12 ⋯ c1m
c22 ⋯ c2m C
C
C
⋮ C
A
cp2 ⋯ cpm
ð14Þ
From the score matrix, bags can be directly classified with the
following determination function:
Y i ¼ arg max
p
∑ ckj
k ¼ 1;:::;m j ¼ 1
ð15Þ
i¼1, 2, …, N, j ¼1, 2, …, p, k ¼1, 2, …, m. Tie break arbitrarily.
where m denotes the number of categories. The above equation
is also fit to binary cases.
The main steps of maxc-LS-SVM can be summarized as follows:
Training data formalization: assign the bag label to its instances
and use all instances for training.
Basic classifier training: estimate parameters of the basic
classifiers.
Individual instance classification: classify each instance using
the basic classifiers.
Bag classification: classify an object or bag B according to
Eq. (15).
2.5. Performance indicator
For binary classification problem, there are a large number of
well-known statistical metrics such as accuracy, precision, sensitivity, specificity, AUC, Cohen's kappa, etc. However, most of these
performance indicators cannot be directly used for multi-class
cases. In this work, only the metrics whose successful application
on multi-class problems had been proved experimentally were
adopted.
Accuracy rate, also known as classification rate, is the proportion of correctly classified instances in the population.
Cohen's kappa [28] is an alternative measure to accuracy rate,
compensating for random hits. It evaluates the degree of agreement in classification over which would be expected by chance.
Based on an m-class confusion matrix, the Cohen's kappa can be
computed as follows:
m
m
i¼1
i¼1
N ∑ hii ∑ T ri T ci
kappa ¼
m
ð16Þ
N2 ∑ T ri T ci
i¼1
where hii is the number of true positives for each class, N is the
number of bags, m is the number of labels, Tri is the rows' total
counts and Tci is the columns' total counts.
The theoretical range of Cohen's kappa statistic is from 1
(total disagreement) through 0 (random classification) to 1 (perfect agreement), which seems difficult to compare kappa with
accuracy (0–1). Actually, most classifiers do at least the same as
random guessing on most real-world datasets, so their kappa
values score higher than zero by definition. It is generally assumed
that a kappa score lying between 0.8 and 1 indicates that the
corresponding classifier is almost in perfect agreement with the
actual pattern.
The bootstrap cross-validation method (BCV) [29], which is a
bootstrap smoothing version of cross-validation, was used to
estimate the generalization error of each classification algorithm
because it has less variation and bias. The bootstrap datasets are
drawn with replacement for some large B between 50 and 200. For
each dataset, an internal cross-validation error estimator was
obtained with a predetermined classification rule. After B times
178
Z. Lin et al. / Talanta 132 (2015) 175–181
repeat, the averaged error estimation was calculated over all
bootstrap bag sets.
3. Experimental
3.1. Toy data
The toy data has three classes (A, B and C). For each bag of one
specific class, ρ percent of its instances were sampled from the
Gaussian located at either (0, 0), (4, 1), or (1, 4), the rest were
sampled from the Gaussian located at (2, 2). Every bag contains 30
instances. To be comparable with common NIR classifications, 50
bags were generated for each class. A sketch map of the two-way
datasets with ρ A{0.2, 0.4, 0.6, 0.8} is shown in Fig. 1.
The toy data was used to simulate the heterogeneous classification problem. The feasibility of formulizing the heterogeneous
classification problem as a multi-instance learning problem and
the influence of bag size on the proposed maxc-LS-SVM algorithm
were investigated on the toy data.
3.2. Herbal geographical origins data
Generally, the quality of wolfberry fruit varies with geographical
origin, which makes its medical efficacy diverse. Therefore, it is
necessary to distinguish accurately the geographical origins of
wolfberry fruit to assure a traditional Chinese prescription effective. The wolfberry fruit dataset consists of samples collected from
Inner Mongolia, Ningxia, and Qinghai provinces of China. For each
sample, ten spectra (i.e. instances) were measured respectively on
ten different parts of the wolfberry fruit surface. A sketch map of
the ten instances is shown in Fig. 2(a).
Each instance was measured by using a portable Near-infrared
Spectrophotometer (Ocean quest 256-2.5) equipped with an
InGaAs detector and diffuse reflection optical fiber probe. The
spectrum covering the range of 870–2530 nm at distinguishability
of 9.5 nm was recorded. All the raw spectra were transferred into
logarithm mode because they were initially recorded in reflectance mode. Totally, 29, 45, and 45 bags were formed for each
geographical origin, respectively. An overlap plot of the transformed spectra is shown in Fig. 2(b).
The wolfberry fruit data was used to further illustrate the
feasibility of formulizing the heterogeneous classification problem
as a multi-instance learning one. A comparison study between the
proposed maxc-LS-SVM and the mi-SVM, MI-SVM algorithm were
also performed on the wolfberry fruit data.
3.3. Single instance pseudo dataset
In traditional learning framework, each object is represented by
a single labeled instance. Thus, one instance from each bag was
drawn randomly to form the pseudo dataset. Each instance was
labeled with its bag's label.
3.4. Software and algorithms
All calculations were performed on a personal computer i7 880
processor with 6GB RAM under the Win7 Professional operating
system using Matlab 7.9 (Mathworks, Inc., Natick, MA). The
Fig. 1. The two way toy data with ρ ¼0.2, 0.4, 0.6, and 0.8, respectively.
Z. Lin et al. / Talanta 132 (2015) 175–181
179
Fig. 2. (a) A sketch map of the locations of the ten instances for each wolfberry fruit sample; (b) an overlap plot of the transformed spectra. Habt. NM means the samples
were original sampled from Inner Mongolia, similar, Habt. NX stands for Ningxia and Habt. QH denotes Qinghai.
maxc-LS-SVM algorithm was modification of function in the
LS-SVMlab v1.8 [30]. The implementation of the MI-SVM and
mi-SVM algorithm was based on a MIL toolbox publicly available
at http://www.cs.cmu.edu/ juny/MILL/.
Table 1
A comparison between the learning effectiveness of linear kernel and RBF kernel
LS-SVM on the toy datasets (α¼ 0.01).
ρ
4. Results and discussion
A grid search method guided by 10-fold cross-validation was
adopted to optimize the hyperparameters of MI-SVM and mi-SVM.
Each pair (C, γ) in the cross-product of C ranging from 2 15 to 23
and γ ranging from 2 5 to 215 at the increment of 2 in the
exponent was used to train every dichotomous classifier. Meanwhile, the optimal hyperparameters of LS-SVM were determined
by coupled simulated annealing (CSA) algorithm and simplex
algorithm.
4.1. The toy data
To be consistent with traditional classification method, one
single-instance pseudo dataset was drawn for every bag datasets.
The Cohen's kappa and accuracy were calculated by the BCV
procedure on the single-instance pseudo dataset. However, the
randomizations in producing the toy bag sets and in drawing the
single-instance pseudo datasets present variability in the prediction metrics. Therefore, each of the four toy datasets was regenerated 100 times respectively to present a stable evaluation on the
applicability of traditional classification method in heterogeneous
classification problem.
With respect to ρ ¼0.2, the Mann–Whitney U test did not reject
the equivalence between the linear and RBF (radial basis function)
kernel using accuracy with the significance level of α ¼0.01
(Table 1). However, opposite results were obtained on the comparison study between the linear and RBF kernel in term of
Cohen's kappa. In addition, from the decision boundary illustrated
in Fig. 3, it was observed that the decision boundary of RBF kernel
was more efficient to separate one class from another, but was too
complex and specific. Meanwhile, the decision boundary of linear
kernel separate linearly the pseudo samples well. This means that
the ovo-LS-SVM model built with RBF kernel overfit the classification problem. Thus, liner kernel is used in maxc-LS-SVM model to
prevent possible over-fitting.
The results of applying maxc-LS-SVM on the simulated heterogeneous classification data were shown in Figs. 4 and 5. From the
0.2
0.4
0.6
0.8
p-Value
Accuracy
Linear
RBF
0.4355
0.5753
0.6703
0.7981
0.4253
0.5655
0.6708
0.7961
0.0405
0.0439
0.9318
0.7851
p-Value
Kappa
Linear
RBF
0.1453
0.3516
0.4921
0.6832
0.1813
0.3357
0.4906
0.6816
0.0000
0.0264
0.7947
0.8632
results on the toy data with ρ ¼0.2, it was found that the
prediction ability of the model built using maxc-LS-SVM improved
quite a lot, when it was compared with the model constructed
using the base ovo-LS-SVM classifier (Fig. 4(a) and (b)). By varying
the bag size from 5 to 30 at the increment of 5, the classification
accuracy increased gradually. These results indicate that a powerful decision function can be constructed even the data is too weak
to build a usable base classifier. The results presented in Fig. 5
support this opinion.
Regarding ρ ¼ 0.4, 0.6 and 0.8, the statistical analysis does not
reject the null hypothesis that the base classifiers with linear and
RBF kernel perform equally (Table 1) in terms of both kappa and
accuracy. Further examination on the decision boundaries of RBF
kernel reveal that each pair of classes can be separated linearly in
the raw space (Figs. S1–S3 in the supplementary material). Considering the computational burden, linear kernel is preferred and
used for further investigation.
The results presented in Fig. 4(a) clearly shows that the effect of
random success on the accuracy matric decrease with the increase
of ρ. In addition, a base LS-SVM classifier which performs better
than random guessing can be obtained, even most of the instances
from the toy data with ρ equal to 0.2 were shared by three classes
(Fig. 5(a)). From the results summarized in Figs. 4 and 5, it can also
be observed that all the classifiers built using maxc-LS-SVM
perform significantly better than their corresponding base classifiers on the toy datasets (ρ ¼0.4, 0.6, 0.8, respectively). This means
maxc-LS-SVM inherits the sound characters of LS-SVM. With the
bag size varying from 5 to 30 at the increment of 5, the prediction
accuracy gradually improved and even approached perfect agreement. Therefore, it can be concluded that by formulating the
heterogeneous classification problem into a multiple instance
180
Z. Lin et al. / Talanta 132 (2015) 175–181
Fig. 3. The LS-SVM decision boundaries of linear kernel and RBF kernel on the toy data with ρ equal to 0.2.
Table 2
A summary of the prediction performance of the base classification method
LS-SVM and the three multi-instance learning methods i.e. mi-SVM, MI-SVM,
maxc-LS-SVM on the wolfberry fruit data.
Accuracy
LS-SVM
mi-SVM
MI-SVM
maxc-LS-SVM
Fig. 4. The classification accuracy results of applying maxc-LS-SVM algorithm on
the four toy datasets with the number of instances ranging from 5 to 25 at the
increments of 5.
Fig. 5. A summary of the Cohen's kappa results of applying maxc-LS-SVM
algorithm on the toy datasets with the number of instances ranging from 5 to 25
at the increments of 5.
learning one, the heterogeneous classification problem can be
solved effectively.
4.2. The wolfberry fruit data
The results of applying ovo-LS-SVM method on the bootstrap
pseudo data sets were listed in Table 2. It can be observed that the
prediction accuracy of linear kernel ovo-LS-SVM was larger than
that of the RBF kernel. That means linear kernel is more acceptable
Kappa
Liner
RBF
Linear
RBF
0.9276
0.5929
0.8205
0.9898
0.7868
0.5371
0.7414
0.9754
0.8870
0.4427
0.7106
0.9840
0.6667
0.3741
0.5934
0.9616
than RBF kernel in classifying the geographical origin of wolfberry
fruits. From the results obtained on the toy data, it is concluded
that the prediction ability of maxc-LS-SVM will be improved if
more instances are included in a bag. So all instances measured on
one wolfberry fruit were used to investigate the prediction
performance of MIL algorithms. Besides, with bag size fixed, the
comparisons among MIL algorithms become more objective.
The comparison results between maxc-LS-SVM and the other
two MIL algorithms were tabulated in Table 2. For mi-SVM
algorithm, neither linear nor RBF kernel could produce classifier
comparable to LS-SVM algorithm in terms of both classification
accuracy and kappa. There are two factors contributing to this
phenomenon, one is the MIL assumption and the other is the
relationship between bag and its instances. Since mi-SVM is a
present based MIL algorithm, MI-SVM is first brought into
comparison.
MI-SVM algorithm extends the notion of a margin from
individual patterns to sets of patterns. And its prediction performance improved quite a lot when it was compared with mi-SVM
algorithm. But the classification accuracy and kappa of MI-SVM
algorithm were all still worse than that of LS-SVM algorithm. That
means considering all instances in a bag as a whole is benefit for
constructing a powerful classifier. However, adopting the whole
learning strategy along is not enough to solve the heterogeneous
classification problem thoroughly.
The maxc-LS-SVM algorithm, whose MIL assumption is different from that of mi-SVM and MI-SVM, performed much better
than the base classifier in terms of classification accuracy for both
linear and RBF kernel. And the kappa of liner kernel maxc-LS-SVM
reached as high as 0.9840, which indicates that random success
take a limited impact on the prediction results. Compared with the
results obtained using mi-SVM and MI-SVM algorithm, it can be
concluded that maxc-LS-SVM algorithm is more powerful in
classifying the geographical origin of wolfberry fruit.
Z. Lin et al. / Talanta 132 (2015) 175–181
All the results demonstrate that by forming the heterogeneous
classification problem into a multiple instance learning one, the
heterogeneous classification problem can be solved effectively. The
power of mi-SVM and MI-SVM has been demonstrated experimentally in image classification, which is a well-studied topic in
the field of computer vision. But heterogeneous classification is a
rather different application; the present-based MIL assumption is
no longer applicable.
Additionally, it may be argued that it is unnecessary to
complicate the heterogeneous classification problem, since the
accuracy of base classifier using one instance for a object is already
larger than 0.90. But it is still need to build a more powerful
decision rule, because the wolfberry fruit is used as tonic Chinese
herbal medicine, and distinguishing explicitly the geographical
origin of wolfberry fruit is necessary to keep the prescription
effect. Moreover, the kappa of linear kernel ovo-LS-SVM is only
0.8870, a more credible classifier should be constructed.
5. Conclusions
In this study, the heterogeneous classification problem was
formulated as a multi-instance learning one. Based on a variant of
count-based MIL assumption, the maxc-LS-SVM algorithm was
developed to deal with multi-instance learning problem. The
proposed algorithm inherits all the sound characters of both the
base classifier and MIL framework. By incorporating the maximum
count MIL assumption into the process of constructing a suitable
learning algorithm, maxc-LS-SVM make full use of the information
of each instance in the positive (or class specific) bags. A real
wolfberry fruit spectral dataset, which contains 119 objects from
geographical origins, was used in this experiment. In multi-class
classification, the proposed method presented the optimal classification accuracy and kappa compared with the other two MIL
methods (i.e., mi-SVM and MI-SVM). More importantly, it was
concluded that the maximum count based MIL assumption was
more applicable in the heterogeneous classification domain. While
being potentially promising, the maxc-LS-SVM approach needs
to be further evaluated with other heterogeneous classification
applications. This is currently being pursued.
Acknowledgments
The authors would like to thank the anonymous reviewers
for their kind and insightful comments. Financial supports from
the Innovation Group Projects of Beijing University of Chinese
Medicine (no. 2011-CXTD-11) and the National Natural Science
181
Foundation of China (no. 81303218) are gratefully acknowledged.
The computation was partly supported by CHEMCLOUDECOMPUTING (Beijing University of Chemical Technology, Beijing, China).
Appendix A. Supplementary materials
Supplementary data associated with this article can be found in
the online version at http://dx.doi.org/10.1016/j.talanta.2014.09.007.
References
[1] S. Naik, V.V. Goud, P.K. Rout, K. Jacobson, A.K. Dalai, Renew. Energy 35 (2010)
1624–1631.
[2] T. De Beer, A. Burggraeve, M. Fonteyne, L. Saerens, J.P. Remon, C. Vervaet, Int. J.
Pharm. 417 (2011) 32–47.
[3] L.J. Xie, X.Q. Ye, D.H. Liu, Y.B. Ying, Food Res. Int. 44 (2011) 2198–2204.
[4] L. Esteve Agelet, C.R. Hurburgh Jr., Talanta 121 (2014) 288–299.
[5] E. Ziémons, J. Mantanus, P. Lebrun, E. Rozet, B. Evrard, P. Hubert, J. Pharm.
Biomed. Anal. 53 (2010) 510–516.
[6] O.Y. Rodionova, A.L. Pomerantsev, Trends Anal. Chem. 29 (2010) 795–803.
[7] J. Märk, M. Andre, M. Karner, C.W. Huck, Eur. J. Pharm. Biopharm. 76 (2010)
320–327.
[8] J. Mantanus, E. Ziémons, P. Lebrun, E. Rozet, R. Klinkenberg, B. Streel, B. Evrard,
P. Hubert, Talanta 80 (2010) 1750–1757.
[9] B. Wang, G. Liu, Y. Dou, L. Liang, H. Zhang, Y. Ren, J. Pharm. Biomed. Anal. 50
(2009) 158–163.
[10] S. Tripathi, H.N. Mishra, Food Control 20 (2009) 840–846.
[11] J. Moros, J.J. Laserna, Anal. Chem. 83 (2011) 6275–6285.
[12] M. Blanco, A. Peguero, Trends Anal. Chem. 29 (2010) 1127–1136.
[13] C.-O. Chan, C.-C. Chu, D.K.-W. Mok, F.-T. Chau, Anal. Chim. Acta 592 (2007)
121–131.
[14] S. Wold, H. Antti, F. Lindgren, J. Öhman, Chemom. Intell. Lab Syst. 44 (1998)
175–185.
[15] J. Janni, B.A. Weinstock, L. Hagen, S. Wright, Appl. Spectrosc. 62 (2008)
423–426.
[16] J. Hwang, S. Kang, K. Lee, H. Chung, Talanta 101 (2012) 488–494.
[17] J. Foulds, E. Frank, Knowl. Eng. Rev. 25 (2010) 1–25.
[18] Y. Li, D.M.J. Tax, R.P.W. Duin, M. Loog, Pattern Recognit. 46 (2013) 865–874.
[19] Z.-H. Zhou, J. Comput. Sci. Technol. 21 (2006) 800–809.
[20] T.G. Dietterich, R.H. Lathrop, T. Lozano-Pérez, Artif. Intell. 89 (1997) 31–71.
[21] Y. Shi, Y. Gao, R. Wang, Y. Zhang, D. Wang, Artif. Intell. 38 (2013) 16–28.
[22] Z. Liang, Z. Bo, G. Yang, Fifth International Conference on Fuzzy Systems and
Knowledge Discovery, FSKD 2008, pp. 487–492.
[23] S. Escalera, O. Pujol, P. Radeva, J. Mach. Learn. Res. 11 (2010) 661–664.
[24] S. Andrews, I. Tsochantaridis, T. Hofmann, Adv. Neural Inf. Process Syst. 15
(2002) 561–568.
[25] F. Chauchard, J. Svensson, J. Axelsson, S. Andersson-Engels, S. Roussel,
Chemom. Intell. Lab Syst. 91 (2008) 34–42.
[26] S. Ren, L. Gao, Analyst 136 (2011) 1252–1261.
[27] J.A. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle,
J. Suykens, T. Van Gestel, Least Squares Support Vector Machines, World
Scientific Publishing, Singapore, 2002.
[28] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, Pattern
Recognit. 44 (2011) 1761–1776.
[29] W.J. Fu, R.J. Carroll, S. Wang, Bioinformatics 21 (2005) 1979–1986.
[30] 〈http://www.esat.kuleuven.be/sista/lssvmlab/〉.