AdaMKL: A Novel Biconvex Multiple Kernel Learning Approach

AdaMKL: A Novel Biconvex Multiple
Kernel Learning Approach
Ziming Zhang*, Ze-Nian Li, Mark Drew
School of Computing Science
Simon Fraser University
Vancouver, Canada
{zza27, li, mark}@cs.sfu.ca
1
* This work was done when the author was at SFU.
Outline
 Introduction
 Adaptive Multiple Kernel Learning
 Experiments
 Conclusion
2
Introduction
 Kernel
 Given a set of data
x and a feature mapping function , a kernel
matrix can be defined as the inner product of each pair of
feature vectors of x.
K ( xi , x j )   ( xi ),  ( x j )
 Multiple Kernel Learning
 Aim to learn an optimal kernel as well as support vectors by
combining a set of kernels linearly.
M
K opt    m K m
m 1
3
Kernel coefficients
Introduction
 Multiple Kernel Learning
{K1...M }
4
 i,
   1,
 p
i  0
p  1, 2,
M
K opt    m K m
m 1
Introduction
 Example: Lp-norm Multiple Kernel Learning [1]
min
γ , w ,b , ξ
s.t.
wm
2
1
2
 C  i

2 m m
i
Convex function


i, yi   wm , m ( xi )  b   1  i
m

i  0, C  0
Traditional SVM
constraints
m,  m  0, 
Kernel coefficient
constraints
p
p
1
 Learning
 train a traditional SVM by fixing the kernel coefficients;
 learn kernel coefficients by fixing w
5
[1] M. Kloft, et al. Efficient and accurate Lp-norm multiple kernel learning. In NIPS’09, 2009.
Introduction
 Motivation
 Lp-norm kernel coefficient constraint makes the learning of kernel
coefficients difficult, especially when p>1
 Intuition
 Solve the MKL problem without considering the kernel coefficient
constraints explicitly
 Contributions
 Propose a family of biconvex optimization formulations for MKL
 Can handle the cases of arbitrary norms of kernel coefficients
 Easy and fast to optimize
6
Adaptive Multiple Kernel Learning
weighting
M
f ( x)    m wm , m ( x)  b
m 1
Biconvex function
7
Adaptive Multiple Kernel Learning
 Biconvex functions
 f(x,y) is a biconvex function if fy(x) is convex and fx(y) is
convex.
 Example: f(x,y)=x2+y2-3xy
 Biconvex optimization
 At least one function in the
objective functions and
constraints is biconvex, and
others are convex.
 Local optima
8
Adaptive Multiple Kernel Learning
 Adaptive Multiple Kernel Learning (AdaMKL)
 Aim to simplify the MKL learning process as well as keep the
similar discriminative power of MKL using biconvex
optimization.
 Binary classification
min
θ , w ,b , ξ
s.t.
where
9
1
2
N p (θ ) w 2  C  i
2
i


i, yi    m wm , m ( xi )  b   1  i
m

i  0, C  0
N 0 (θ )  θ 1 , N p (θ )  θ 2
2
p 1
Adaptive Multiple Kernel Learning
 Objective function:
min
w ,b , ξ
s.t.
min
θ , w ,b , ξ
s.t.
where
10
1
2
w 2  C  i
2
i
min
θ , w ,b , ξ


i, yi   wm , m ( xi )  b   1  i
m

i  0, C  0
wm  m  wm
s.t.
1
2
 m2 wm 2  C  i

2 m
i


i, yi    m wm , m ( xi )  b   1  i
m

i  0, C  0
1
2
N p (θ ) w 2  C  i
2
i


i, yi    m wm , m ( xi )  b   1  i
m

i  0, C  0
N 0 (θ )  θ 1 , N p (θ )  θ 2
2
p 1
m2 wm 2  θ 2
2
m
w 2  θ2
2
p
w2 θ1 w
2
1
2
2
2
Adaptive Multiple Kernel Learning
 Optimization
 Learn w by fixing θ using Np(θ) norm
 Learn θ by fixing w using L1 or L2 norm of θ
 Repeat the two steps until converged
11
Adaptive Multiple Kernel Learning
 Learning w


 m2
1
max   i    i j yi y j  
K m ( xi , x j ) 
α
2 i, j
i
 m N p (θ )

s.t.
i, 0   i  C ,   i yi  0
i
m 
12

2
m
N p ( )
 0,

p
 1,
p 1
(Dual)
Adaptive Multiple Kernel Learning
 Learning θ
min
θ ,b , ξ
s.t.
Ln (θ )  C  i
i


i, yi    m wm , m ( xi )  b   1  i
m

i  0, C  0, n  1, 2
1 2
where L1 (θ )  θ 1 , L2 (θ )  θ
2
13
1
Adaptive Multiple Kernel Learning
 Computational complexity
 Same as quadratic programming
 Convergence
 If hard-margin cases (C=+∞) can be solved at the initialization
stage, then AdaMKL will converge to a local minimum.
 If at either step our objective function converged, then AdaMKL
has converged to a local minimum.
14
Adaptive Multiple Kernel Learning
Lp-norm MKL
AdaMKL
2
min
wm 2
1
 C  i

2 m m
i
s.t.


i, yi   wm , m ( xi )  b   1  i
m

γ , w ,b , ξ
m,  m  0, γ
p
p
1
i  0, C  0
min
θ , w ,b , ξ
1
2
N p (θ ) w 2  C  i
2
i


i, yi    m wm , m ( xi )  b   1  i
m

i  0, C  0


 m2
1
max   i    i j yi y j  
K m ( xi , x j ) 
α
2 i, j
i
 m N p (θ )

s.t.
i, 0   i  C ,   i yi  0
s.t.
i
 Convex
 Kernel coefficient norm condition
 Gradient search, Semi-infinite
programming (SIP), etc
15
 Biconvex
 Kernel coefficient conditions
hidden in dual
 Quadratic programming
Experiments
 4 specific AdaMKL: N0L1, N1L1, N1L2, N2L2, where “N”
and “L” denote the types of norm used for learning w
and θ.
 2 experiments
1. Toy example: C=105 without tuning, 10 Gaussian
kernels, randomly sampled from 2D Gaussian
distributions
 Positive samples: mean [0 0], covariance [0.3 0; 0 0.3], 100
samples
 Negative samples: mean [-1 -1] and [1 1], covariance [0.1 0;
0 0.1] and [0.2 0; 0 0.2], 100 samples, respectively.
16
Experiments (2)
2. 4 benchmark datasets: breast-cancer, heart, thyroid,
and titanic (downloaded from
http://ida.first.fraunhofer.de/projects/bench/)
 Gaussian kernels + polynomial kernels
 100, 140, 60, 40 kernels for corresponding datasets,
respectively
 Compared with convex optimization based MKL: GMKL[2]
and SMKL[3]
17
[2] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More efficiency in
multiple kernel learning. In ICML’07.
[3] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. JMLR,
9:2491–2521, 2008.
Experiments - Toy example
18
Experiments - Toy example
19
Experiments - Benchmark datasets
(a) Breast-Cancer: [69.64 ~ 75.23]
20
(c) Thyroid: [95.20 ~ 95.80]
(b) Heart: [79.71 ~ 84.05]
(d) Titanic: [76.02 ~ 77.58])
Experiments - Benchmark datasets
21
(a) Breast-Cancer
(b) Heart
(c) Thyroid
(d) Titanic
Conclusion
 Biconvex optimization for MKL
 Hide the kernel coefficient constraints (non-negative and Lp
(p≥1) norm) in the dual without explicit consideration.
 Easy to optimize, fast to converge, lower computational time
but similar performance as traditional convex optimization
based MKL
22
Thank you !!!
23