Fast Optimization Algorithms for Solving SVM+

1
Fast Optimization Algorithms for Solving
SVM+
Dmitry Pechyony
Akamai Technologies, Cambridge, MA, USA. [email protected]
Vladimir Vapnik
NEC Laboratories, Princeton, NJ, USA. [email protected]
CONTENTS
1.1
1.2
1.3
1.4
1.5
1.6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sparse Line Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 SMO Algorithm for Solving SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Alternating SMO (aSMO) for solving SVM+ . . . . . . . . . . . . . . . . . . . . . .
Conjugate Sparse Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Conjugate Alternating SMO (caSMO) for solving SVM+ . . . . . . . . .
Proof of Convergence Properties of aSMO, caSMO . . . . . . . . . . . . . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 Computation of b in SVM+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.2 Description of the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.3 Comparison of aSMO and caSMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.4 Experiments with SVM+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
7
7
9
11
12
14
17
17
17
18
20
21
In Learning Using Privileged Information (LUPI) model, along with standard
training data in the primary space, a teacher supplies a student with additional
(privileged) information in the secondary space. The goal of the learner is to
find a classifier with a low generalization error in the primary space. One of
the major algorithmic tools for learning in LUPI model is SVM+. In this
chapter we show two fast algorithms for solving the optimization problem of
SVM+. To motivate the usage of SVM+, we show how it can be used to learn
from the data that is generated by human computation games.
This work was done when the first author was with NEC Laboratories, Princeton, NJ, USA.
3
4
1.1
Book title goes here
Introduction
Recently a new learning paradigm, called Learning Using Privileged Information (LUPI), was introduced by Vapnik et al. [13, 14, 15]. In this paradigm,
in addition to the standard training data, (x, y) ∈ X × {±1}, a teacher supplies a student with the privileged information x∗ ∈ X ∗ . The is only available
for the training examples and is never available for the test examples. The
paradigm requires, given a training set {(xi , x∗i , yi )}ni=1 , to find a function
f : X → {−1, 1} with the small generalization error for the unknown test
data x ∈ X.
LUPI paradigm can be implemented based on the well-known algorithm [5]. The decision function of SVM is h(z) = w · z + b, where z is a
feature map of x, and w and b are the solution of the following optimization
problem:
∑
1
∥w∥22 + C
ξi
2
i=1
n
min
w,b,ξ1 ,...,ξn
(1.1)
s.t. ∀ 1 ≤ i ≤ n, yi (w · zi + b) ≥ 1 − ξi ,
∀ 1 ≤ i ≤ n, ξi ≥ 0.
Let h′ = (w′ , b′ ) be the best possible decision function (in terms of the generalization error) that SVM can find. Suppose that for each training example
xi an oracle gives us the value of the slack ξi′ = 1 − yi (w′ · xi + b′ ). We substitute these slacks into (1.1), fix them and optimize (1.1) only over w and b.
We denote such variant of SVM as OracleSVM. By Proposition 1 of [14], the
′
generalization error of the hypothesis found by converges to the one of h√
with
the rate of 1/n. This rate is much faster than the convergence rate 1/ n of
SVM.
In the absence of the optimal values of slacks we use the privileged information {x∗i }ni=1 to estimate them. Let z∗i ∈ Z ∗ be a feature map of x∗i . Our
goal is to find a correcting function ϕ(x∗i ) = w∗ · z∗i + d that approximates
ξi′ . We substitute ξi = ϕ(x∗i ) into (1.1) and obtain the following modification
(1.2) of SVM, called [13]:
∑
1
γ
∥w∥22 + ∥w∗ ∥22 + C
(w∗ · z∗i + d)
2
2
i=1
n
min∗
w,b,w ,d
(1.2)
s.t.∀ 1 ≤ i ≤ n, yi (w · zi + b) ≥ 1 − (w∗ · z∗i + d),
∀ 1 ≤ i ≤ n, w∗ · z∗i + d ≥ 0,
The objective function of SVM+ contains two hyperparameters, C > 0
and γ > 0. The term γ∥w∗ ∥22 /2 in (1.2) is intended to restrict the capacity
(or VC-dimension) of the function space containing ϕ.
Fast Optimization Algorithms for Solving SVM+
5
∗
= K ∗ (x∗i , x∗j ) be kernels in the decision
Let Kij = K(xi , xj ) and Kij
and the correcting space respectively. A common approach to solve (1.1) is
to consider its dual problem. We also use this approach to solve (1.2). The of
(1.1) (1.2) are (1.3) and (1.6) respectively.
max D(α) =
α
s.t.
n
∑
αi −
i=1
n
∑
n
1 ∑
αi αj yi yj Kij
2 i,j=1
(1.3)
yi αi = 0,
(1.4)
∀ 1 ≤ i ≤ n, 0 ≤ αi ≤ C.
(1.5)
i=1
max D(α, β) =
α,β
i=1
−
s.t.
n
∑
αi −
n
1 ∑
αi αj yi yj Kij
2 i,j=1
(1.6)
n
1 ∑
∗
(αi + βi − C)(αj + βj − C)Kij
2γ i,j=1
n
∑
(αi + βi − C) = 0,
i=1
n
∑
yi αi = 0,
(1.7)
i=1
∀ 1 ≤ i ≤ n, αi ≥ 0, βi ≥ 0.
(1.8)
The decision and the correcting functions of SVM+, expressed in the dual
variables, are
n
∑
h(x) =
αj yj K(xj , x) + b
(1.9)
∑n
j=1
∗
and ϕ(x∗i ) = γ1 j=1 (αj + βj − C)Kij
+ d. Note that SVM and SVM+ have
syntactically the same decision function. But semantically they are different,
since α’s found by SVM and SVM+ can differ a lot.
One of the widely used algorithms for solving (1.3) is SMO [12]. At each
iteration SMO optimizes the working set of two dual variables, αs and αt ,
while keeping all other variables fixed. Note that we cannot optimize the
proper subset of such working set, say αs : due to the constraint (1.4), if we
fix n − 1 variables then the last variable is also fixed. Hence the working sets
selected by SMO are irreducible.
Following [3], we present SMO as an instantiation of the framework of
sparse line search algorithms to the optimization problem of SVM. At each
iteration the algorithms in this framework perform line search in maximally
sparse direction which is close to the gradient direction. We instantiate the
above framework to SVM+ optimization problem and obtain alternating SMO
(aSMO) algorithm. aSMO works with irreducible working sets of two or three
variables.
6
Book title goes here
Sparse line search algorithms in turn belong to the family of approximate
gradient descent algorithms. The latter algorithms optimize by going roughly
in the direction of gradient. Unfortunately (e.g. see [11]) gradient descent
algorithms can have very slow convergence. One of the possible remedies is
to use conjugate direction optimization, where each new optimization step is
conjugate to all previous ones. In this paper we combine the ideas of sparse line
search and conjugate directions and present a framework of conjugate sparse
line search algorithms. At each iteration the algorithms in this framework
perform line search in a chosen maximally sparse direction which is close
to the gradient direction and is conjugate to k previously chosen ones. We
instantiate the above framework to SVM+ optimization problem and obtain
Conjugate Alternating SMO (caSMO) algorithm. caSMO works with irreducible
working sets of size up to k+2 and k+3 respectively. Our experiments indicate
an order-of-magnitude running time improvement of caSMO over aSMO.
Vapnik et al. [14, 15] showed that LUPI paradigm emerges in several domains, for example, time series prediction and protein classification. To motivate further the usage of LUPI paradigm and SVM+, we show how it can be
used to learn from the data generated by human computation games. Our experiments indicate that the tags generated by the players in ESP [16] and Tag
a Tune [10] games result in the improvement in image and music classification.
This chapter has the following structure. In Section 1.2 we present sparse
line search optimization framework and instantiate it to SVM and SVM+.
In Section 1.3 we present conjugate sparse line search optimization framework and instantiate it to SVM+. In Section 1.4 we prove the convergence
of aSMO and caSMO. We present our experimental results in Section 1.5. In
Section 1.5.3 we compare aSMO and caSMO with other algorithms for solving
SVM+. Finally, Section 1.5.4 describes our experiments with the data generated by human computation games.
Previous and Related Work
Blum and Mitchell [1] introduced a semi-supervised multiview learning model
that leverages the unlabeled examples in order to boost the performance in
each of the views. SVM-2K algorithm of Farquhar et al. [7], which is similarly to SVM+ is a modification of SVM, performs fully supervised multiview
learning without relying on unlalebeled examples. The conceptual difference
between SVM+ and SVM-2K is that SVM+ looks for a good classifier in a single space, while SVM-2K looks for good classifiers in two spaces. Technically,
the optimization problem of SVM+ involves less constraints, dual variables
and hyperparameters than the one of SVM-2K.
Izmailov et al. [9] developed SMO-style algorithm, named generalized SMO
(gSMO), for solving SVM+. gSMO was used in the experiments in [14, 15]. At
each iteration gSMO chooses two indices, i and j, and optimizes αi , αj , βi ,
βj while fixing other variables. In opposite to the working sets used by aSMO
and caSMO, the one used by gSMO is not irreducible. In Section 1.2 we discuss
the disadvantages of chosing the working set which is not irreducible and in
Fast Optimization Algorithms for Solving SVM+
7
Section 1.5.3 we show that aSMO and caSMO are significantly faster than the
algorithm of [9].
1.2
Sparse Line Search Algorithms
We consider the optimization problem maxx∈F f (x), where f is a concave
function and F is a convex compact domain.
Definition 1 A direction u ∈ Rm is feasible at the point x ∈ F if ∃λ > 0
such that x + λu ∈ F .
Definition 2 Let N Z(u) ⊆ {1, 2, . . . , n} be a set of non-zero indices of u ∈
Rn . A feasible direction u is maximally sparse if any u′ ∈ Rn , such that
N Z(u′ ) ⊂ N Z(u), is not feasible.
Lemma 1 Let u ∈ Rn be a feasible direction vector. The variables in N Z(u)
have a single degree of freedom iff u is maximally sparse feasible.
It follows from this lemma that if we choose a maximally sparse feasible direction u and maximize f along the half-line x + λu, λ > 0, then we find the
optimal values of the variables in N Z(u) while fixing all other variables. As
we will see in the next sections, if u is sparse and has a constant number of
nonzero entries then the exact maximization along the line can be done in
a constant time. In the next sections we will also show fast way of choosing
good maximally sparse feasible direction. To ensure that the optimal λ ̸= 0 we
require the direction u to have an angle with the gradient strictly larger than
π/2. The resulting sparse line search optimization framework is formalized
in Algorithm 1. Algorithm 1 has two parameters, τ and κ. The parameter τ
controls the minimal angle between the chosen direction u to the direction
of gradient. The parameter κ is the minimal size of the optimization step.
Algorithm 1 stops when there is no maximally sparse feasible direction that is
close to the direction of gradient and allows an optimization step of significant
length.
Let u be a feasible direction that is not maximally sparse. By Definition 2
and Lemma 1, the set N Z(u) has at least two degrees of freedom. Thus the
line search along the half-line x+λu, λ > 0, will not find the optimal values of
the variables in N Z(u) while fixing all other variables. To achieve the latter,
instead of 1-dimensional line-search we need to do the expensive k-dimensional
hyperplane search, where k > 1 is number of the degrees of freedom in N Z(u).
This motivates the usage of the maximally sparse search directions.
In Section 1.2.1 we instantiate Algorithm 1 to the optimization problem
of SVM and obtain the well-known SMO algorithm. In Section 1.2.2 we instantiate Algorithm 1 to the optimization problem of SVM+.
8
Book title goes here
Algorithm 1 Line search optimization with maximally sparse feasible directions.
Input: Function f : Rn → R, domain F ⊂ Rn , initial point x0 ∈ F, constants
τ > 0 and κ > 0.
Output: A point x ∈ F with a ‘large’ value of f (x).
1: Set x = x0 . Let λ(x, u) = max{λ : x + λu ∈ F}.
2: while ∃ a maximally sparse feasible direction u such that ▽f (x)T u > τ
and λ(x, u) > κ do
3:
Choose a maximally sparse feasible direction u such that ▽f (x)T u > τ
and λ(x, u) > κ.
4:
Let λ∗ = arg maxλ:x+λu∈F f (x + λu).
5:
Set x = x + λ∗ u.
6: end while
1.2.1
SMO Algorithm for Solving SVM
Following [3], we present SMO as a line search optimization algorithm. Let
I1 (α) = {i : (αi > 0 and yi = −1) or (αi < C and yi = 1)}, I2 (α) = {i :
(αi > 0 and yi = 1) or (αi < C and yi = −1)}. At each iteration, SMO
chooses a direction us ∈ Rn such that s = (i, j), i ∈ I1 , j ∈ I2 , ui = yi ,
uj = −yj and for any r ̸= i, j, ur = 0. If we move from α in the direction us
then (1.4) is satisfied. It follows from the definition of I1 , I2 and us , that ∃λ =
λ(α, s) > 0 such that α + λus ∈ [0, C]n . Thus us is a feasible direction. The
direction us is also maximally sparse. Indeed, any direction u′ with strictly
more zero coordinates than in us has a single nonzero coordinate. But if we
move along such u′ then we will not satisfy (1.4).
△
Let g(α) be a gradient of (1.3) at point α. By Taylor expansion of ψ(λ) =
old
D(α + λus ) at λ = 0, the maximum of ψ(λ) along the half-line λ ≥ 0 is at
′
λ (α
old
, s) = −
∂D(αold +λus )
|λ=0
∂λ
∂ 2 D(αold +λus )
|λ=0
∂λ2
=−
gT (αold )us
,
uTs Hus
where H is a Hessian of D(αold ). The clipped value of λ′ (αold , s) is
(
(
))
C − αiold
αold
λ∗ (αold , s) = min
, max λ′ (αold , s), − i
i∈s
i∈s
ui
ui
(1.10)
(1.11)
so that the new value α = αold +λ∗ (αold , s)us is always in the domain [0, C]n .
Let τ > 0 be a small constant and I = {us | s = (i, j), i ∈ I1 , j ∈
I2 , g(αold )T us > τ }. If I = ∅ then SMO stops. It can be shown that in this
case Karush-Kuhn-Tucker (KKT) optimality conditions are almost satisfied
(up to accuracy τ ).
Let I ̸= ∅. We now describe a procedure for choosing the next direction.
This procedure is currently implemented in LIBSVM and has complexity of
O(n). Let
s = arg max g(αold )T ut
(1.12)
t:ut ∈I
Fast Optimization Algorithms for Solving SVM+
9
be the nonzero components of the direction u that is maximally aligned with
the gradient. Earlier versions of LIBSVM chose us as the next direction. As
observed by [6], empirically faster convergence is obtained if instead of s =
(i, j) we choose s′ = (i, j ′ ) such that us′ ∈ I and the move in the direction us′
maximally increases (1.3):
s′
=
arg
max
t=(t1 ,t2 )
t1 =i,ut ∈I
=
arg
max
t=(t1 ,t2 )
t1 =i,ut ∈I
D(αold + λ′ (αold , t)ut ) − D(α)
(1.13)
(g(αold )T ut )2
.
uTt Hut
(1.14)
The equality (1.14) follows from substituting λ′ (αold , t) (given by (1.10)) into
(1.13) and using the definition (1.3) of D(α).
1.2.2
Alternating SMO (aSMO) for solving SVM+
We start with the identification of the maximally sparse feasible directions.
Since (1.6) has 2n variables, {αi }ni=1 and {βi }ni=1 , the search direction u ∈ R2n .
We designate the first n coordinates of u for α’s and the last n coordinates for
β’s. It follows from the equality constraints (1.7) that (1.6) has three types of
sets of maximally sparse feasible directions, I1 , I2 , I3 ⊂ R2n :
1. I1 = {us | s = (s1 , s2 ), n + 1 ≤ s1 , s2 ≤ 2n, s1 ̸= s2 ; us1 = 1, us2 =
−1, βs2 > 0, ∀ i ∈
/ s ui = 0}. A move in the direction us ∈ I1 increases
βs1 and decreases βs2 > 0 by the same quantity. The rest of the variables
remain fixed.
2. I2 = {us | s = (s1 , s2 ), 0 ≤ s1 , s2 ≤ n, s1 ̸= s2 , ys1 = ys2 ; us1 = 1, us2 =
−1, αs2 > 0, ∀ i ∈
/ s ui = 0}. A move in the direction us ∈ I2 increases
αs1 and decreases αs2 > 0 by the same quantity. The rest of the variables
remain fixed.
3. I3 = {us | s = (s1 , s2 , s3 ), 0 ≤ s1 , s2 ≤ n, n+1 ≤ s3 ≤ 2n, s1 ̸= s2 ; ∀ i ∈
/
s ui = 0; ys1 ̸= ys2 ; us1 = us2 = 1, us3 = −2, βs3 > 0 or us1 = us2 =
−1, us3 = 2, αs1 > 0, αs2 > 0}. A move in the direction us ∈ I3 either
increases αs1 , αs2 and decreases βs3 > 0, or decreases αs1 > 0, αs2 > 0
and increases βs3 . The absolute value of the change in βs3 is twice the
absolute value of the change in αs1 and in αs2 . The rest of the variables
remain fixed.
We abbreviate x = (α, β)T and xold = (αold , β old )T . It can be verified that if
we move from any feasible point x in the direction us ∈ I1 ∪ I2 ∪ I3 then (1.7)
is satisfied. Let D(x) be an objective function defined by (1.6) and λ∗ (xold , s)
be a real number that maximizes ψ(λ) = D(xold + λus ) such that xold + λus
satisfies (1.8). The optimization step in aSMO is x = xold + λ∗ (xold , s) · us ,
where us ∈ I1 ∪ I2 ∪ I3 . We now show how λ∗ (xold , s) is computed.
10
Book title goes here
Let g(xold ) and H be respectively gradient and Hessian of (1.6) at the
point xold . Using the Taylor expansion of ψ(λ) around λ = 0 we obtain that
′
λ (x
old
, s) = arg max ψ(λ) = −
λ:λ≥0
∂D(xold +λus )
|λ=0
∂λ
∂ 2 D(xold +λus )
|λ=0
∂λ2
=−
g(xold )T us
.
uTs Hus
(1.15)
We clip the value of λ′ (xold , s) so that the new point x = xold + λ∗ (xold , s) · us
will satisfy (1.8):
(
)
xold
λ∗ (xold , s) = max λ′ (xold , s), − i
.
(1.16)
i∈s
ui
Let τ > 0, κ > 0, λ(xold , s) = max{λ|xold + λus satisfies (1.8)} and I =
{us | us ∈ I1 ∪ I2 ∪ I3 , g(xold )T us > τ and λ(xold , s) > κ}. If I = ∅ then aSMO
stops.
Suppose that I ̸= ∅. We choose the next direction in three steps. The
first two steps are similar to the ones done by LIBSVM [4] At the first step,
for each 1 ≤ i ≤ 3 we find a vector us(i) ∈ Ii that has a minimal angle with
g(xold ) among all vectors in Ii : if Ii ̸= ∅ then s(i) = arg maxs:us ∈Ii g(xold )T us ,
otherwise s(i) is empty. Since the expression g(xold )T us is linear in us , the
complexity of this step is O(n)
For 1 ≤ i ≤ 3, let Iei = {us | us ∈ Ii , g(xold )T us > τ and λ(xold , us ) > κ}.
(i) (i)
At the second step for 1 ≤ i ≤ 2, if the pair s(i) = (s1 , s2 ) is not empty then
(i) ′(i)
(i)
we fix s1 and find s′(i) = (s1 , s2 ) such that us′(i) ∈ Iei and the move in the
direction us′(i) maximally increases (1.6):
s′(i)
= arg
max
t:t=(t1 ,t2 )
(i)
t1 =s1 ,ut ∈Iei
= arg
max
t:t=(t1 ,t2 )
(i)
t1 =s1 ,ut ∈Iei
D(xold + λ′ (xold , t) · ut ) − D(xold )
|
{z
}
△
=∆i (xold ,t)
(g(xold )T ut )2
.
uTt Hut
(1.17)
The right-hand equality in (1.17) follows from substituting λ′ (xold , t) (given by
(1.15)) into ∆i (xold , t) and using the definition (1.6) of D(x). Similarly, for i =
(3)
(3)
(3) (3) (3)
3 if the triple s(3) = (s1 , s2 , s3 ) is not empty then we fix s1 and s3 and
(3)
′(3)
(3)
find s′(3) = (s1 , s2 , s3 ) such that us′(3) ∈ Ie3 and the move in the direction
us′(3) maximally increases (1.6). Since the direction vectors ut ∈ I1 ∪ I2 ∪ I3
has at most three nonzero components, the computation of ∆i (xold , t) takes a
constant time. Therefore, since we are doing one-dimensional search in (1.17),
the second step takes a linear time.
At the third step, we choose j = arg max1≤i≤3,s′(i) ̸=∅ ∆i (xold , s′(i) ). This
step takes a constant time. The final search direction is us′(j) and the overall
complexity of choosing this direction is O(n).
The following theorem, proven in Section 1.4, establishes the asymptotic
convergence of aSMO.
Fast Optimization Algorithms for Solving SVM+
11
Algorithm 2 Conjugate line search optimization with maximally sparse feasible directions.
Input: Function f (x) = 12 xT Qx+cT x, domain F ⊂ Rn , initial point x0 ∈ F,
constants τ > 0 and κ > 0, maximal order of conjugacy kmax ≥ 1.
Output: A point x ∈ F with ‘large’ value of f (x).
1: Set x = x0 . Let λ(x, u) = max{λ : x + λu ∈ F}.
2: while ∃ a maximally sparse feasible direction u such that ▽f (x)T u > τ
and λ(x, u) > κ do
3:
Choose an order of conjugacy k, 0 ≤ k ≤ kmax .
4:
Let g0 and gi be the gradient of f at the beginning of current and the
i-th previous iteration.
5:
Choose a maximally sparse feasible direction u such that ▽f (x)T u > τ ,
λ(x, u) > κ and for all 1 ≤ i ≤ k, uT (gi−1 − gi ) = 0.
6:
Let λ∗ = arg maxλ:x+λu∈F f (x + λu).
7:
Set x = x + λ∗ u.
8: end while
Theorem 2 aSMO stops after a finite number of iterations. Let x∗ (τ, κ) be the
solution found by aSMO for a given τ and κ and let x∗ be a solution of (1.6).
Then limτ →0,κ→0 x∗ (τ, κ) = x∗ .
1.3
Conjugate Sparse Line Search
Conjugate sparse line search is an extension of Algorithm 1. Similarly to Algorithm 1, we choose a maximally sparse feasible direction. However, in contrast
to Algorithm 1, in conjugate sparse line search we also require that the next
search direction is conjugate to k previously chosen ones. The order of conjugacy, k, is between 0 and an upper bound kmax ≥ 1. We allow k to vary
between iterations. Note that if k = 0 then the optimization step done by
conjugate sparse line search is the same as the one done by (nonconjugate)
sparse line search. In Section 1.3.1 we show how do we choose k.
Let Q be n × n positive semidefinite matrix and c ∈ Rn . In this section we
assume that f is a quadratic function, f (x) = 12 xT Qx+cT x. This assumption
is satisfied by all SVM-like problems. Let u be the next search direction that
we want to find, and ui be a search direction found at the i-th previous
iteration. We would like u to be conjugate to ui , 1 ≤ i ≤ k, w.r.t. the matrix
Q:
∀1 ≤ i ≤ k, uT Qui = 0 .
(1.18)
Let g0 and gi be the gradient of f at the beginning of current and the i-th
previous iteration respectively. Similarly, let x0 and xi be the optimization
12
Book title goes here
point at the beginning of current and the i-th previous iteration. By the
definition of f , gi = Qxi + c. Since for any 1 ≤ i ≤ n, xi−1 = xi + λi ui , where
λi ∈ R, we have that uT Qui = uT Q(xi−1 − xi )/λi = uT (gi−1 − gi )/λi , and
thus the conjugacy condition (1.18) is equivalent to
∀1 ≤ i ≤ k, uT (gi−1 − gi ) = 0 .
(1.19)
The resulting conjugate sparse line search algorithm is formalized in Algorithm 2.
Suppose that the search directions u and ui , 1 ≤ i ≤ k, have r non-zero
components. Commonly the implementations of SMO-like algorithms (e.g.,
LIBSVM and UniverSVM) compute the gradient vector after each iteration.
Thus the additional complexity of checking the condition (1.19) is O(kr).
However the additional complexity of checking (1.18) is O(kr2 ). As we will
see in Section 1.3.1, if k ≥ 1 then r ≥ 3. Hence it is faster to check (1.19) than
(1.18).
In the rest of this section we describe the instantiation of Algorithm 2 to
the optimization problem of SVM+.
1.3.1
Conjugate Alternating SMO (caSMO) for solving SVM+
We use the same notation as in Section 1.2.2. At each iteration we chose a
maximally sparse feasible direction us ∈ R2n that solves the following system
of k + 2 linear equalities:
{
uTs 1 = 0
uTs ȳ = 0,
(1.20)
T
∀1 ≤ i ≤ k
us (gi−1 − gi ) = 0
where 1 ∈ R2n is a column vector with all entries being one and y = [y; 0] is a
row-wise concatenation of the label vector y with n-dimensional zero column
vector. It can be verified that if we move from any feasible point x in the
direction us defined above then (1.7) is satisfied.
We assume that the vectors y, 1, g0 − g1 , . . . , gk−1 − gk are linearly independent. In all our experiments this assumption was indeed satisfied. It follows
from the equality constraints (1.7) that there are 4 types of maximally sparse
feasible directions us that solve (1.20):
1. J1 = {us | us is feasible, s = (s1 , . . . , sk+2 ), n + 1 ≤ s1 , . . . , sk+2 ≤
2n; ∀ i ∈
/ s ui = 0}. A move in the direction us ∈ I1 changes β’s indexed
by s. The rest of the variables are fixed.
2. J2 = {us | us is feasible, s = (s1 , . . . , sk+2 ), 1 ≤ s1 , . . . , sk+2 ≤ n, ys1 =
· · · = ysk+2 ; ∀ i ∈
/ s ui = 0}. A move in the direction us ∈ I2 changes α’s
indexed by s such that the corresponding examples have the same label.
The rest of the variables remain fixed.
3. J3 = {us | us is feasible, s = (s1 , . . . , sk+3 ); ∃ i, j such that si , sj ≤
n and ysi ̸= ysj ; ∃ r such that sr > n; ∀ i ∈
/ s, ui = 0}. A move in
the direction us ∈ I3 changes at least one β and at least α’s such that
Fast Optimization Algorithms for Solving SVM+
13
the corresponding two examples have different labels. The variables that
are not indexed by s remain fixed.
4. J4 = {us | us is feasible, s = (s1 , . . . , sk+3 ), k ≥ 1, 1 ≤ s1 , . . . , sk+3 ≤
n; ∃ i ̸= j such that ysi = ysj = 1; ∃ r ̸= t such that ysr = yst =
−1; ∀ i ∈
/ s, ui = 0}. A move in the direction us ∈ I2 changes α’s indexed
by s. At least two such α’s correspond to different positive examples and
at least two such α’s correspond to different negative examples. The rest
of the variables remain fixed.
The definitions of J1 -J3 here are direct extensions of the corresponding definitions of I1 -I3 for aSMO (see Section 1.2.2). However the definition of J4 does
not have a counterpart in aSMO. By the definition of J4 , it contains feasible
direction vectors that change at least two α’s of positive examples and at
least two α’s of negative examples. But such vectors are not maximally sparse
feasible for the original optimization problem (1.6), since for any us′ ∈ J4
there exists us′ ∈ I2 such that N Z(us′ ) ⊂ N Z(us ). Hence the vectors from
J4 cannot be used by aSMO. Thus the conjugate direction constraints give additional power to the optimization algorithm by allowing to use a new type of
directions that were not available previously.
The optimization step in caSMO is x = xold + λ∗ (xold , s) · us , where us
satisfies the above conditions and λ∗ (xold , s) is computed exactly as in aSMO,
by (1.15) and (1.16).
We now describe the procedure for choosing the next direction us that
satisfies (1.20). At the first iteration we choose the direction us ∈ I1 ∪ I2 ∪ I3
that is generated by aSMO and set k = 0. At the next iterations we proceed
as follows. Let us be the direction chosen at the previous iteration and s =
(s1 , . . . , sk′ ) be the current working set. If us ∈ I1 ∪ I2 then k ′ = k + 2,
otherwise k ′ = k + 3. Recall that 0 ≤ k ≤ kmax . If k < kmax then we try to
increase the order of conjugacy by setting k = k + 1 and finding an index
sk′ +1 = arg
max
D(xold + λ′ (xold , s) · us ) − D(xold ) .
{z
}
t:s=(s1 ,...,sk′ ,st ), |
us ∈J1 ∪J2 ∪J3 ∪J4
and solves (1.20)
(1.21)
△
=∆(xold ,s)
old T
2
) us )
Similarly to (1.17), it can be shown that ∆(xold , s) = (g(xuT Hu
.
s
s
If k = kmax then we reduce ourselves to the case of k < kmax and proceed
as above. The reduction is done by removing from s the least recently added
index s1 , obtaining s = (s2 , . . . , sk′ ), downgrading k = k −1 and renumerating
the indices so that s = (s1 , . . . , sk′ −1 ).
We also use a ‘second opinion’ and consider the search direction us′ ∈
I1 ∪ I2 ∪ I3 that is found by aSMO. If sk′ +1 found by (1.21) is empty, namely
there is no index t such that s = (s1 , . . . , sk′ , t), us ∈ J1 ∪ J2 ∪ J3 ∪ J4 and
us solves (1.20), then we set k = 0 and the next search direction is us′ . Also,
if sk′ +1 is not empty and the functional gain from us′ is larger than the one
from us , namely if ∆(xold , s′ ) > ∆(xold , s), then we set k = 0 and the next
search direction is us′ . Otherwise the next search direction is us .
Given the indices s = (s1 , . . . , sk′ ), in order to find us we need to solve
14
Book title goes here
Algorithm 3 Optimization procedure of [2].
Input: Function f : Rn → R, domain F ⊂ Rn , set of directions U ⊆ Rn ,
initial point x0 ∈ F , constants τ > 0, κ > 0.
1: Set x = x0 . Let λ(x, u) = max{λ : x + λu ∈ F}.
2: while exists u ∈ U such that ▽f (x)T u > τ and λ(x, u) > κ do
3:
Choose any u ∈ U such that ▽f (x)T u > τ and λ(x, u) > κ.
4:
Let λ∗ = arg maxλ:x+λu∈F f (x + λu).
5:
Set x = x + λ∗ u.
6: end while
7: Output x.
(1.20). Since us has O(k) non-zero entries, this takes O(k 3 ) time. It can be
shown that the overall complexity of the above procedure of choosing the
3
next search direction is O(kmax
n). Hence it is computationally feasible only
for small values of kmax . In our experiments (Section 1.5.3) we observed that
the value of kmax that gives nearly-optimal running time is 3. The following theorem, proven in Section 1.4, establishes the asymptotic convergence of
caSMO.
Theorem 3 caSMO stops after a finite number of iterations. Let x∗ (τ, κ) be
the solution found by caSMO for a given τ and κ and let x∗ be a solution of
(1.6). Then limτ →0,κ→0 x∗ (τ, κ) = x∗ .
1.4
Proof of Convergence Properties of aSMO, caSMO
We use the results of [2] to demonstrate convergence properties of aSMO and
caSMO. Bordes et al. [2] presented a generalization of Algorithm 1. Instead of
maximally sparse vectors u, at each iteration they consider directions u from
a set U ⊂ Rn . The optimization algorithm of [2] is formalized in Algorithm 3.
aSMO is obtained from Algorithm 3 by setting U = I1 ∪ I2 ∪ I3 .
Our results are based on the following definition and theorem of [2].
Definition 3 ([2]) A set of directions U ⊂ Rn is a finite witness family
for a convex set F if U is finite and for any x ∈ F , any feasible direction u
at the point x is a positive linear combination of a finite number of feasible
directions vi ∈ U at the point x.
The following theorem is a straightforward generalization of Proposition 13
in [2].
Theorem 4 ([2]) Let U ⊂ Rn be a set that contains a finite witness family for F. Then Algorithm 3 stops after a finite number of iterations. Let
Fast Optimization Algorithms for Solving SVM+
15
x∗ (τ, κ) be the solution found by Algorithm 3 for a given τ and κ and let
x∗ = arg minx∈F f (x). Then limτ →0,κ→0 x∗ (τ, κ) = x∗ .
Lemma 5 The set I1 ∪ I2 ∪ I3 (defined in Section 1.2.2) is a finite witness
family for the set F defined by (1.7)-(1.8).
Proof We use the ideas of the proof of Proposition 7 in [2] that showing that
U = {ei − ej , i ̸= j} is a finite witness family for SVM. The set I1 ∪ I2 ∪ I3
is finite. Let x ∈ R2n be an arbitrary point in F and u be a feasible direction
at x. Recall that the first n coordinates of x are designated for the values
of α’s and the last n coordinates of x are designated for the values of β’s.
There exists λ > 0 such that x′ = x + λu ∈ F. We construct a finite path
x = z(0) → z(1) → . . . → z(m − 1) → z(m) = x′ such that z(j + 1) =
z(j) + γ(j)v(j), where z(j + 1) ∈ F and v(j) is a feasible direction at x.
We prove by induction on m that these two conditions are satisfied. As an
auxiliary tool, we also prove by induction on m that our construction satisfies
the following invariant for all 1 ≤ i ≤ 2n and 1 ≤ j ≤ m:
xi > x′i ⇔ zi (j) ≥ x′i .
(1.22)
Let I+ = {i | 1 ≤ i ≤ n, yi = +1} and I− = {i | 1 ≤ i ≤ n, yi = −1}. We
have from (1.7) that
∑
∑
x′i =
x′i .
(1.23)
i∈I+
i∈I−
The construction of the path has three steps. At the first step (Cases 1 and
∑2n
2 below), we move the current overall weight of β’s (which is i=n+1 zi (j))
∑
2n
towards the overall weight of β’s in x′ (which is i=n+1 x′ (j)). When the
current overall weight of β’s is the same as the overall weight of β’s in x′ ,
we proceed to the second step. At the second step (Case 3 below), we move
the values of β’s in z(j) towards their corresponding target values at x′ . The
second step terminates when the values of β’s in z(j) and x′ are exactly the
same. At the third step (Case 4 below), we move the values of α’s in z(j)
towards their corresponding target values at x′ . The third step terminates
when the values of β’s in z(j) and x′ are the same. We now describe this
procedure in the formal way.
For each
∑2n0 ≤ j ≤ m −∑1,2nwe have four cases:
Case 1 i=n+1 zi (j) > i=n+1 x′i .
∑n
∑n
′
Since z(j), x′ ∈ F, by (1.7) we have that
i=1 zi (j) <
i=1
∑xi . Let T =
′
{i | n + 1 ≤ i ≤ 2n, zi (j) > xi }. We have from (1.7) that
i∈I+ zi (j) =
∑
z
(j).
Using
(1.23)
we
obtain
that
there
exist
r
∈
I
and
s ∈ I− such
+
i∈I− i
′
′
that zr (j) < xr and zs (j) < xs . Let t be any index from T , urst ∈ I3 such
that ut = −2 and
λ = min (x′r − zr (j), x′s − zs (j), (zt (j) − x′t )/2) .
△
Note that λ > 0. We set v(j) = urst , γ(j) = λ and z(j + 1) = z(j) + γ(j)v(j).
16
Book title goes here
Since z(j) ∈ F , it follows from the definition of λ and urst that z(j + 1) ∈ F.
Since z(j) satisfies (1.22), we have that xt > x′t ≥ 0. Thus urst is a feasible
direction at x. Finally, it follows from the definition of r, s, t, λ and urst that
zi (j + 1) ≥ x′i ⇔ zi (j) ≥ x′i for any 1 ≤ i ≤ 2n. Hence z(j + 1) satisfies (1.22).
∑2n
∑2n
Case 2 i=n+1 zi (j) < i=n+1 x′i .
The treatment of this case is similar to what was
the previous
∑n done with ∑
n
′
one. Since z(j), x′ ∈ F, by (1.7) we have that
z
(j)
>
i
i=1
i=1 xi . Let
∑
△
T = {i | n + 1 ≤ i ≤ 2n, zi (j) < x′i }. We have from (1.7) that i∈I+ zi (j) =
∑
i∈I− zi (j). Using (1.23) we obtain that there exist r ∈ I+ and s ∈ I− such
that zr (j) > x′r and zs (j) > x′s . Let t be an arbitrary index from T , urst ∈ I3
such that ut = 2 and
λ = min (zr (j) − x′r , zs (j) − x′s , (x′t − zt (j))/2) .
Note that λ > 0. We set v(j) = urst , γ(j) = λ and z(j + 1) = z(j) + γ(j)v(j).
Since z(j) ∈ F , it follows from the definition of λ and urst that z(j + 1) ∈ F.
Since z(j) satisfies (1.22), we have that xr > x′r ≥ 0 and xs > x′s ≥ 0. Thus
urst is a feasible direction at x. Finally, it follows from the definition of r, s,
t, λ and urst that zi (j + 1) ≥ x′i ⇔ zi (j) ≥ x′i for any 1 ≤ i ≤ 2n. Hence
z(j + 1) satisfies (1.22).
∑2n
∑2n
Case 3 i=n+1 zi (j) = i=n+1 x′i , ∃n + 1 ≤ i ≤ 2n such that zi (j) ̸= x′i .
In this case we have that there exist n + 1 ≤ s, t ≤ 2n such that zs (j) < x′s (j)
and zt (j) > x′t (j). Let
λ = min (x′s (j) − zs (j), zt (j) − x′t (j)) .
Note that λ > 0. We set v(j) = ust ∈ I1 , γ(j) = λ and z(j + 1) = z(j) +
γ(j)v(j). Since z(j) ∈ F, it follows from the definition of λ and ust that
z(j + 1) ∈ F. Since z(j) satisfies (1.22), we have that xt > x′t ≥ 0. Thus ust
is a feasible direction at x. Finally, it follows from the definition of s, t, λ and
ust that zi (j + 1) ≥ x′i ⇔ zi (j) ≥ x′i for any 1 ≤ i ≤ 2n. Hence z(j + 1)
satisfies (1.22).
∑2n
∑2n
Case 4 i=n+1 zi (j) = i=n+1 x′i , ∀n + 1 ≤ i ≤ 2n, zi (j) = x′i , ∃1 ≤ i ≤ n
such that zi (j) ̸= x′i .
We claim that in this case there exist 1 ≤ s, t ≤ n such that ys = yt , zs (j) <
∑2n
∑2n
since i=n+1 zi (j) = i=n+1 x′i , it follows
x′s (j) and zt (j) ∑
> x′t (j). Indeed,
∑
n
n
from (1.7) that i=1 zi (j) = i=1 x′i . Combining this with (1.23) we obtain
our claim. The rest of the construction of z(j + 1) in this case is the same as
in the previous case (starting from the definition of λ), with I1 being replaced
by I2 .
If none of these conditions is satisfied then z(j) = x′ and the path terminates.
By our choice of λ, at each iteration we strictly decrease the number of
coordinates in z(j) that are different from the corresponding coordinates of
x′ . Hence the constructed path is finite. We have that x′ = x + λ′ u = x +
∑m−1
∑m−1 γ(i)
i=0 γ(i)v(i). Hence u =
i=0
λ′ v(i).
Fast Optimization Algorithms for Solving SVM+
17
Proof of Theorem 2 The theorem follows from combining Theorem 4 and
Lemma 5.
Proof of Theorem 3 Recall that at each in the working set selection procedure of caSMO we try the search direction us′ ∈ I1 ∪I2 ∪I3 and this direction is
chosen if it given larger gain in the objective function than the best conjugate
direction. Hence by Lemma 5 the set of search directions U considered by
caSMO contains a finite witness family and and the statement of the theorem
follows from Theorem 4.
1.5
1.5.1
Experiments
Computation of b in SVM+
Suppose that α and β are the solution of (1.6). In this section we show how
do we compute the offset b in the decision function (1.9).
By Karush-Kuhn-Tucker conditions,
αi > 0 ⇒ yi (w · zi + b) = 1 − (w∗ · z∗i + d) .
(1.24)
Expressed in terms of the dual variables α and β,∑the condition (1.24) is
n
αi > 0 ⇒ yi Fi + yi b = 1 − γ1 fi − d, where Fi =
j=1 αj yj Kij and fi =
∑n
∗
.
The
last
condition
is
also
equivalent
to
(α
+
β
−
C)K
j
j
ij
j=1
1
fi − F i ,
γ
1
αi > 0, yi = −1 ⇒ b − d = −1 + fi − Fi .
γ
αi > 0, yi = 1 ⇒ b + d = 1 −
These
two conditions motivate the following way of computing b. Let s+ =
∑
1 − fi /γ − Fi and n+ = |{i : αi > 0, yi = 1}|, s− =
i:α
>0,y
i =1
∑ i
i:αi >0,yi =−1 −1 + fi /γ − Fi , n+ = |{i : αi > 0, yi = −1}|. We set
b = (s+ /n+ + s− /n− )/2.
1.5.2
Description of the Datasets
We conducted our experiments on the datasets generated by ESP and Tag a
Tune human computation games. ESP game1 is played between two players.
Both players are given the same image and they are asked to enter the tags
that characterize it. The player scores points if her tags match the ones of
1 www.gwap.com/gwap/gamesPreview/espgame/
18
Book title goes here
her opponent. Tag a Tune game2 is played between two players. Each player
receives an audio clip of 30 seconds. The player only knows her clip and does
not know the clip of her opponent. Each player sends to her opponent the tags
that describe the clip. Upon receiving opponent’s tags the player decides if
she and her opponent are given the same clips. If both players have the right
decision then both of them score points.
We used the dataset generated by ESP game [16] to perform image classification. This dataset has tagged images, with average 4.7 tags per image. We used the version [8] of ESP dataset available at lear.inrialpes.fr/
people/guillaumin/data_iccv09.php. In our experiments we used DenseHue,
DenseShift, HarrisHue and HarrisSift features that were supplied with
the images. We generated a learning problem in LUPI setting in the following way. Let t1 and t2 be two tags. The problem is to discriminate images
tagged t1 from the ones tagged t2 . We do not consider images with both t1
and t2 tags. The view X is an image feature vector and the privileged information X ∗ is all tags that were associated with the image, except t1 and t2 .
We removed the privileged information from the test images. The privileged
information was represented as a vector with binary entries, indicating if a
particular tag is associated with the image. We considered 3 binary problems
with t1 =“horse”/t2 =“fish”, t1 =“horse”/t2 =“bird” and t1 =“fish”/t2 =“bird”.
We denote these problems as ESP1/ESP2/ESP3 respectively. Each such problem has about 500 examples and is balanced. We used test size of 100 examples.
We used Magnatagtune dataset (available at tagatune.org/Magnatagatune.
html) generated by Tag a Tune game [10] to do music classification, This
dataset has tagged audio clips, with average 4 tags per clip. Each clip is
represented by measurements of its rhythm, pitch and timbre at different intervals. We used only pitch and timbre features. The privileged information
was represented as a vector with binary entries, indicating if a particular tag
is associated with the clip. We generated learning problems in LUPI setting
in the same way as in ESP dataset. We considered 3 binary problems, “classic
music” vs. ”rock”, “rock” vs. “electronic music” and “classic music” vs. “electronic music”. We denote these problems as TAG1/TAG2/TAG3 respectively.
Each problem has between 4000 and 6000 examples and is balanced. We used
test size of 2000 examples.
1.5.3
Comparison of aSMO and caSMO
We implemented aSMO and caSMO on the top of LIBSVM [4]. Similarly to LIBSVM, we also implemented incremental computation of gradients and shrinking. Similarly to LIBSVM, we used caching infrastructure that is provided by
LIBSVM code.
In all our experiments we used RBF kernel in both X and X ∗ spaces. Let σ
and σ ∗ be the hyperparameters of this kernel in space X and X ∗ respectively.
2 www.gwap.com/gwap/gamesPreview/tagatune/
Fast Optimization Algorithms for Solving SVM+
19
τ
ESP1 ESP2 ESP3 TAG1 TAG2 TAG3
1.99 27.16 26.42 30.67
1.95 11.76 7.45
1.9 28.42
26
30.33 1.91 12.05
7.49
1
28.75 27.92 32.17
1.97
12.27
7.35
0.1 29.42 29.42 32.25
2.18
12.34 7.32
TABLE 1.1
Comparison of generalization error vs. stopping criterion τ
ESP1
TAG1
gSMO aSMO caSMO
12.41
4.21
0.13
44.39
4.59
0.1
TABLE 1.2
Mean running time (in seconds) of gSMO, aSMO and caSMO.
Also, let m and m∗ be a median distance between training examples in space
X and X ∗ respectively. We used the log10 -scaled range [m − 6, m + 6] of σ,
[m∗ − 6, m∗ + 6] of σ ∗ , [−1, 5] of C and [−1, 5] of γ. At each such range we
took 10 equally spaced values of the corresponding parameter and generated
10 × 10 × 10 × 10 grid. We set κ = 1e − 3. In all experiments in this section
we report the average running time, over all grid points and over 12 random
draws of the training set.
aSVM and caSVM were initialized with αi = 0 and βi = C for all 1 ≤ i ≤ n.
With such initialization it can be verified that at the first iteration both aSMO
and caSMO will choose a direction u ∈ I3 and it will hold that gT u = 2, where
g is the gradient of the objective function before the first iteration. Thus
the value of the stopping criterion τ should be smaller than 2. We observed
(see Table 1.1) that quite a large value of τ is sufficient to obtain nearly
best generalization error of SVM+. Based on this observation we ran all our
experiments with τ = 1.9.
Table 1.2 compares the mean running time of aSMO, caSMO with kmax = 1,
and the algorithm for solving SVM+, denoted by gSMO, that was used in [14].
We report the results for ESP1 and TAG1 problems, but similar results were
also observed for other four problems. Our results show that aSMO is orderof-magnitude faster than gSMO and caSMO is order-of-magnitude faster than
aSMO. We also observed (see Figure 1.1) that the running time of caSMO depends on the maximal order of conjugacy kmax . Based on our experiments,
we suggest to set heuristically kmax = 3. We used this value in our experiments in Section 1.5.4. Finally, we compared (see Figure 1.2) the running
time of caSMO with the one of LOQO off-the-shelf optimizer. We loosened the
stopping criterion of LOQO so that it will output nearly the same value of
objective function as caSMO. For very small training sets, up to 200, LOQO
Book title goes here
0.102
0.1
0.098
0.096
0.094
0.092
0.09
0.088
0.086
0.084
0.082
1.4
TAG1
TAG2
TAG3
1.2
mean running time
mean running time
20
ESP1
ESP2
ESP3
1
2
3
1
0.8
0.6
0.4
0.2
0
4
5
1
order of conjugacy (kmax)
2
3
4
5
order of conjugacy (kmax)
FIGURE 1.1
Running time (in secs) of caSMO as a function of the order of conjugacy.
LOQO
3 caSMO
mean running time (sec)
mean running time (sec)
3.5
2.5
2
1.5
1
0.5
0
100
200
300
400
500
training size
4.5
LOQO
4 caSMO
3.5
3
2.5
2
1.5
1
0.5
0
100
200
300
400
500
training size
FIGURE 1.2
Comparison of the running time of caSMO and LOQO.
is competitive with caSMO, but for larger training sets LOQO is significantly
slower.
1.5.4
Experiments with SVM+
We show two experiments with SVM+ and compare it with SVM and SVM*.
SVM* is SVM when it is ran in the privileged X* space. The experiments are
for the data that is generated by ESP and Tag a Tune games. These games
are described in the supplementary material. In Figures 1.3 and 1.4 we report
mean test errors over 12 random draws of training and test sets. In almost
all experiments SVM+ was better than SVM and relative improvement is up
to 25%. In ESP1-3 and TAG3 problems SVM* is significantly better than
SVM and thus the privileged space X∗ is much better than the regular space
X. In these problems SVM+ leverages the good privileged information to
improve over SVM. In TAG1 and TAG2 problems X* space is worse than X
Fast Optimization Algorithms for Solving SVM+
ESP1
30
mean error
mean error
35
25
20
15
10
5
100
200
ESP3
ESP2
SVM
SVM+
SVM*
300
400
32
30
28
26
24
22
20
18
16
14
100
40
SVM
SVM+
SVM*
SVM
SVM+
SVM*
35
mean error
40
21
30
25
20
15
200
training size
300
10
100
400
200
300
400
training size
training size
FIGURE 1.3
Learning curves of SVM and SVM+ in image classification.
TAG2
1000
2000
training size
12.5
12
11.5
11
10.5
10
9.5
9
8.5
8
7.5
100 300 500
TAG3
7.5
SVM
SVM+
SVM*
SVM
SVM+
SVM*
7
mean error
SVM
SVM+
SVM*
mean error
mean error
TAG1
10
9
8
7
6
5
4
3
2
1
100 300 500
6.5
6
5.5
5
4.5
1000
2000
4
100 300 500
training size
1000
2000
training size
FIGURE 1.4
Learning curves of SVM and SVM+ in music classification.
space. Nevertheless the results of these datasets show that SVM+ is robust to
the bad privileged information. Moreover, in TAG2 dataset, despite weak X*
space, SVM+ managed to extract useful information from it and outperformed
SVM.
1.6
Conclusions
We presented a framework of sparse conjugate direction optimization algorithms. This framework combines the ideas of SMO algorithm and the optimization using conjugate direction. Our framework is very general and can be
efficiently instantiated to any optimization problem with quadratic objective
function and linear constraints. The instantiation to SVM+ gives two orders-
22
Book title goes here
of-magnitude runnning tine improvement over the existing solver. We plan to
instantiate the above framework to the optimization problem of SVM.
Our work leaves plenty of room for future research. It would be interesting
to automatically tune the maximal order of conjugacy. Also, it is important
to characterize theoretically when sparse conjugate optimization is preferable
over its nonconjugate counterpart. Finally, we plan to develop optimization
algorithms for SVM+ when the kernels in X and/or X ∗ space are linear.
Acknowledgements
We thank Leon Bottou for the fruitful conversations.
Bibliography
[1] A. Blum and T. Mitchell. Combining labeled and unlabeled examples
with co-training. In COLT, pages 92–100, 1998.
[2] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers
with online and active learning. JMLR, 6:1579–1619, 2005.
[3] L. Bottou and C.-J. Lin. Support Vector Machine solvers. In Large Scale
Kernel Machines, pages 1–27. 2007.
[4] C.-C. Chang and C.-J. Lin. LIBSVM: a library for Support Vector
Machines, 2001. Available at http://www.csie.ntu.edu.tw/~cjlin/
libsvm.
[5] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, 1995.
[6] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second
order information for training SVM. JMLR, 6:243–264, 2005.
[7] J. Farquhar, D.R Hardoon, H. Meng, J. Shawe-Taylor, and S. Szedmak.
Two view learning: Svm-2k, theory and practice. In NIPS, pages 355–362,
2005.
[8] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Discriminative metric learning in nearest neighbor models for image autoannotation. In ICCV, pages 309–316, 2009.
[9] R. Izmailov, V. Vapnik, and A. Vashist. Private communication. 2009.
[10] E. Law and L. von Ahn. Input-agreement: A new mechanism for data
collection using human computation games. In CHI, pages 1197–1206,
2009.
[11] J. Nocedal and S. Wright. Numerical Optimization. Springer-Verlag, 2nd
edition, 2006.
[12] J. Platt. Fast training of Support Vector Machines using Sequential
Minimal Optimization. In Advances in Kernel Methods - Support Vector
Learning, pages 185–208. MIT Press, 1999.
[13] V. Vapnik. Estimation of dependencies based on empirical data. Springer–
Verlag, 2nd edition, 2006.
23
24
Book title goes here
[14] V. Vapnik and A. Vashist. A new learning paradigm: Learning using
privileged information. Neural Networks, 22(5-6):544–557, 2009.
[15] V. Vapnik, A. Vashist, and N. Pavlovich. Learning using hidden information: Master class learning. In Proceedings of NATO workshop on Mining
Massive Data Sets for Security, pages 3–14. 2008.
[16] L. von Ahn and L. Dabbish. Labeling images with a computer game. In
CHI, pages 319–326, 2004.