AdaMKL: A Novel Biconvex Multiple
Kernel Learning Approach
Ziming Zhang*, Ze-Nian Li, Mark Drew
School of Computing Science
Simon Fraser University
Vancouver, Canada
{zza27, li, mark}@cs.sfu.ca
1
* This work was done when the author was at SFU.
Outline
Introduction
Adaptive Multiple Kernel Learning
Experiments
Conclusion
2
Introduction
Kernel
Given a set of data
x and a feature mapping function , a kernel
matrix can be defined as the inner product of each pair of
feature vectors of x.
K ( xi , x j ) ( xi ), ( x j )
Multiple Kernel Learning
Aim to learn an optimal kernel as well as support vectors by
combining a set of kernels linearly.
M
K opt m K m
m 1
3
Kernel coefficients
Introduction
Multiple Kernel Learning
{K1...M }
4
i,
1,
p
i 0
p 1, 2,
M
K opt m K m
m 1
Introduction
Example: Lp-norm Multiple Kernel Learning [1]
min
γ , w ,b , ξ
s.t.
wm
2
1
2
C i
2 m m
i
Convex function
i, yi wm , m ( xi ) b 1 i
m
i 0, C 0
Traditional SVM
constraints
m, m 0,
Kernel coefficient
constraints
p
p
1
Learning
train a traditional SVM by fixing the kernel coefficients;
learn kernel coefficients by fixing w
5
[1] M. Kloft, et al. Efficient and accurate Lp-norm multiple kernel learning. In NIPS’09, 2009.
Introduction
Motivation
Lp-norm kernel coefficient constraint makes the learning of kernel
coefficients difficult, especially when p>1
Intuition
Solve the MKL problem without considering the kernel coefficient
constraints explicitly
Contributions
Propose a family of biconvex optimization formulations for MKL
Can handle the cases of arbitrary norms of kernel coefficients
Easy and fast to optimize
6
Adaptive Multiple Kernel Learning
weighting
M
f ( x) m wm , m ( x) b
m 1
Biconvex function
7
Adaptive Multiple Kernel Learning
Biconvex functions
f(x,y) is a biconvex function if fy(x) is convex and fx(y) is
convex.
Example: f(x,y)=x2+y2-3xy
Biconvex optimization
At least one function in the
objective functions and
constraints is biconvex, and
others are convex.
Local optima
8
Adaptive Multiple Kernel Learning
Adaptive Multiple Kernel Learning (AdaMKL)
Aim to simplify the MKL learning process as well as keep the
similar discriminative power of MKL using biconvex
optimization.
Binary classification
min
θ , w ,b , ξ
s.t.
where
9
1
2
N p (θ ) w 2 C i
2
i
i, yi m wm , m ( xi ) b 1 i
m
i 0, C 0
N 0 (θ ) θ 1 , N p (θ ) θ 2
2
p 1
Adaptive Multiple Kernel Learning
Objective function:
min
w ,b , ξ
s.t.
min
θ , w ,b , ξ
s.t.
where
10
1
2
w 2 C i
2
i
min
θ , w ,b , ξ
i, yi wm , m ( xi ) b 1 i
m
i 0, C 0
wm m wm
s.t.
1
2
m2 wm 2 C i
2 m
i
i, yi m wm , m ( xi ) b 1 i
m
i 0, C 0
1
2
N p (θ ) w 2 C i
2
i
i, yi m wm , m ( xi ) b 1 i
m
i 0, C 0
N 0 (θ ) θ 1 , N p (θ ) θ 2
2
p 1
m2 wm 2 θ 2
2
m
w 2 θ2
2
p
w2 θ1 w
2
1
2
2
2
Adaptive Multiple Kernel Learning
Optimization
Learn w by fixing θ using Np(θ) norm
Learn θ by fixing w using L1 or L2 norm of θ
Repeat the two steps until converged
11
Adaptive Multiple Kernel Learning
Learning w
m2
1
max i i j yi y j
K m ( xi , x j )
α
2 i, j
i
m N p (θ )
s.t.
i, 0 i C , i yi 0
i
m
12
2
m
N p ( )
0,
p
1,
p 1
(Dual)
Adaptive Multiple Kernel Learning
Learning θ
min
θ ,b , ξ
s.t.
Ln (θ ) C i
i
i, yi m wm , m ( xi ) b 1 i
m
i 0, C 0, n 1, 2
1 2
where L1 (θ ) θ 1 , L2 (θ ) θ
2
13
1
Adaptive Multiple Kernel Learning
Computational complexity
Same as quadratic programming
Convergence
If hard-margin cases (C=+∞) can be solved at the initialization
stage, then AdaMKL will converge to a local minimum.
If at either step our objective function converged, then AdaMKL
has converged to a local minimum.
14
Adaptive Multiple Kernel Learning
Lp-norm MKL
AdaMKL
2
min
wm 2
1
C i
2 m m
i
s.t.
i, yi wm , m ( xi ) b 1 i
m
γ , w ,b , ξ
m, m 0, γ
p
p
1
i 0, C 0
min
θ , w ,b , ξ
1
2
N p (θ ) w 2 C i
2
i
i, yi m wm , m ( xi ) b 1 i
m
i 0, C 0
m2
1
max i i j yi y j
K m ( xi , x j )
α
2 i, j
i
m N p (θ )
s.t.
i, 0 i C , i yi 0
s.t.
i
Convex
Kernel coefficient norm condition
Gradient search, Semi-infinite
programming (SIP), etc
15
Biconvex
Kernel coefficient conditions
hidden in dual
Quadratic programming
Experiments
4 specific AdaMKL: N0L1, N1L1, N1L2, N2L2, where “N”
and “L” denote the types of norm used for learning w
and θ.
2 experiments
1. Toy example: C=105 without tuning, 10 Gaussian
kernels, randomly sampled from 2D Gaussian
distributions
Positive samples: mean [0 0], covariance [0.3 0; 0 0.3], 100
samples
Negative samples: mean [-1 -1] and [1 1], covariance [0.1 0;
0 0.1] and [0.2 0; 0 0.2], 100 samples, respectively.
16
Experiments (2)
2. 4 benchmark datasets: breast-cancer, heart, thyroid,
and titanic (downloaded from
http://ida.first.fraunhofer.de/projects/bench/)
Gaussian kernels + polynomial kernels
100, 140, 60, 40 kernels for corresponding datasets,
respectively
Compared with convex optimization based MKL: GMKL[2]
and SMKL[3]
17
[2] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More efficiency in
multiple kernel learning. In ICML’07.
[3] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. JMLR,
9:2491–2521, 2008.
Experiments - Toy example
18
Experiments - Toy example
19
Experiments - Benchmark datasets
(a) Breast-Cancer: [69.64 ~ 75.23]
20
(c) Thyroid: [95.20 ~ 95.80]
(b) Heart: [79.71 ~ 84.05]
(d) Titanic: [76.02 ~ 77.58])
Experiments - Benchmark datasets
21
(a) Breast-Cancer
(b) Heart
(c) Thyroid
(d) Titanic
Conclusion
Biconvex optimization for MKL
Hide the kernel coefficient constraints (non-negative and Lp
(p≥1) norm) in the dual without explicit consideration.
Easy to optimize, fast to converge, lower computational time
but similar performance as traditional convex optimization
based MKL
22
Thank you !!!
23
© Copyright 2026 Paperzz