June 21-24, ICML2010 Haifa, Israel Learning Sparse SVM for

June 21-24, ICML2010
Haifa, Israel
School of Computer Engineering
Presented by:
Mingkui Tan, Li Wang, Ivor W. Tsang
Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets
Background
Experimental Results on Huge Dimensional Problems
In many machine learning applications, there is a great desire of
sparsity with respect to input features. In this work, we propose a new
Feature Generating Machine (FGM) to learn the sparsity of features.
(1) Toy Experiments: Toy features are shown in the first two figures.
For non-monotonic feature selection, suppose B=2, one needs to select
f1 and f2 as the most informative features and get the best prediction
accuracy. We gradually increase noise features. The SVM(IDEAL)
denotes the results obtained by only using f1 and f2.
We propose a new sparse model and transform it into a MKL
problem with exponential linear kernels via a mild convex relaxation.
We propose to solve the relaxed problem by using cutting plane
algorithm which incorporates the MKL learning.
We prove that the proposed algorithm can globally converge within
limited iterations.
From Fig 1, FGM-B(FGM)
shows the best performance in
terms of prediction accuracy,
sparsity and training time.
We show that FGM scales linearly both on dimensions and instances.
Empirically, FGM shows great scalability for non-monotonic feature
selection on large-scale and very high dimensional problems.
Model
Fig 1. Results on synthetic dataset with varying noise features
Introduce a 0-1 vector d to control the status of features (“1” means
selected, “0” means not). As shown as follows:
Suppose we want to select B features, we obtain a new sparse model:
1 2 C
w 
2 2 2
min min
d D w,  , 
 yi w '  xi
s.t. 

n

2
i

(2) Large-scale Real Data Experiments: The experiments on
news20.binary(1355191×9996)
Arxiv
astro-ph(99757×62369),
rcv1.binary(47236×20242), real-sim(20958×32309), URL0(3231961
×16000) and URL1(3231961×20000) are reported. The number in
brackets are (dimensions × instances). The comparison with SVM-RFE
[2] below shows the competitiveness of our method in terms of both
prediction accuracies and training time.
i 1
d      , i  1, ..., n
d
0
B
It can be convex relaxed and converted to a MKL problem with
exponential linear kernels:

1
1 
min max   y  ' 
t X t X t ' I   y 
 d D

 A
2
C 
 t

n

A  {
  1,  i  0}
i
1 i


s.t. 
t  1, t  0; dt  d d 0  B


 dt D
news20.binary
Arxiv astro-ph
rcv1.binary
URL0
URL1


where X t =[x1
dt , ..., xn

dt ] . In general, X t =[x1

real-sim
dt , ..., xn
dt ]
Methods
We propose to use cutting plane algorithm to solve the MKL problem
with exponential linear kernels[4]:
news20.binary
real-sim
Arxiv astro-ph
URL0
rcv1.binary
URL1
Convergence Property
References
Theorem 1: Assume that the sub-problem of MKL in step 2 and the
[1] Chen, J. and Ye, J. Training SVM with indefinite kernels. In ICML, 2008.
[2] Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. Gene selection for cancer
classification using support vector machines. Machine Learning, 46:389–422, 2002.
[3] Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., and Sundararajan, S. A dual
coordinate descent method for large-scale linear SVM. In ICML, 2008.
[4] Li, Y.F., Tsang, I.W., Kwok, J.T., and Zhou, Z.H. Tighter and convex maximum margin
clustering. In AISTATS, 2009b.
…
most violated d selection in step 3 can be exactly solved, FGM can
globally converge after a finite number of steps[1].
Theorem 2: FGM scales linearly in computation with respect to n
and m. In other words, FGM takes O(mn) time complexity [3].
www.ntu.edu.sg

Download Report

June 21-24, ICML2010 Haifa, Israel Learning Sparse SVM for

Paperzz.com

Your Paperzz