1 Fast Optimization Algorithms for Solving SVM+ Dmitry Pechyony Akamai Technologies, Cambridge, MA, USA. [email protected] Vladimir Vapnik NEC Laboratories, Princeton, NJ, USA. [email protected] CONTENTS 1.1 1.2 1.3 1.4 1.5 1.6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sparse Line Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 SMO Algorithm for Solving SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Alternating SMO (aSMO) for solving SVM+ . . . . . . . . . . . . . . . . . . . . . . Conjugate Sparse Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Conjugate Alternating SMO (caSMO) for solving SVM+ . . . . . . . . . Proof of Convergence Properties of aSMO, caSMO . . . . . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Computation of b in SVM+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Description of the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Comparison of aSMO and caSMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 Experiments with SVM+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 7 7 9 11 12 14 17 17 17 18 20 21 In Learning Using Privileged Information (LUPI) model, along with standard training data in the primary space, a teacher supplies a student with additional (privileged) information in the secondary space. The goal of the learner is to find a classifier with a low generalization error in the primary space. One of the major algorithmic tools for learning in LUPI model is SVM+. In this chapter we show two fast algorithms for solving the optimization problem of SVM+. To motivate the usage of SVM+, we show how it can be used to learn from the data that is generated by human computation games. This work was done when the first author was with NEC Laboratories, Princeton, NJ, USA. 3 4 1.1 Book title goes here Introduction Recently a new learning paradigm, called Learning Using Privileged Information (LUPI), was introduced by Vapnik et al. [13, 14, 15]. In this paradigm, in addition to the standard training data, (x, y) ∈ X × {±1}, a teacher supplies a student with the privileged information x∗ ∈ X ∗ . The is only available for the training examples and is never available for the test examples. The paradigm requires, given a training set {(xi , x∗i , yi )}ni=1 , to find a function f : X → {−1, 1} with the small generalization error for the unknown test data x ∈ X. LUPI paradigm can be implemented based on the well-known algorithm [5]. The decision function of SVM is h(z) = w · z + b, where z is a feature map of x, and w and b are the solution of the following optimization problem: ∑ 1 ∥w∥22 + C ξi 2 i=1 n min w,b,ξ1 ,...,ξn (1.1) s.t. ∀ 1 ≤ i ≤ n, yi (w · zi + b) ≥ 1 − ξi , ∀ 1 ≤ i ≤ n, ξi ≥ 0. Let h′ = (w′ , b′ ) be the best possible decision function (in terms of the generalization error) that SVM can find. Suppose that for each training example xi an oracle gives us the value of the slack ξi′ = 1 − yi (w′ · xi + b′ ). We substitute these slacks into (1.1), fix them and optimize (1.1) only over w and b. We denote such variant of SVM as OracleSVM. By Proposition 1 of [14], the ′ generalization error of the hypothesis found by converges to the one of h√ with the rate of 1/n. This rate is much faster than the convergence rate 1/ n of SVM. In the absence of the optimal values of slacks we use the privileged information {x∗i }ni=1 to estimate them. Let z∗i ∈ Z ∗ be a feature map of x∗i . Our goal is to find a correcting function ϕ(x∗i ) = w∗ · z∗i + d that approximates ξi′ . We substitute ξi = ϕ(x∗i ) into (1.1) and obtain the following modification (1.2) of SVM, called [13]: ∑ 1 γ ∥w∥22 + ∥w∗ ∥22 + C (w∗ · z∗i + d) 2 2 i=1 n min∗ w,b,w ,d (1.2) s.t.∀ 1 ≤ i ≤ n, yi (w · zi + b) ≥ 1 − (w∗ · z∗i + d), ∀ 1 ≤ i ≤ n, w∗ · z∗i + d ≥ 0, The objective function of SVM+ contains two hyperparameters, C > 0 and γ > 0. The term γ∥w∗ ∥22 /2 in (1.2) is intended to restrict the capacity (or VC-dimension) of the function space containing ϕ. Fast Optimization Algorithms for Solving SVM+ 5 ∗ = K ∗ (x∗i , x∗j ) be kernels in the decision Let Kij = K(xi , xj ) and Kij and the correcting space respectively. A common approach to solve (1.1) is to consider its dual problem. We also use this approach to solve (1.2). The of (1.1) (1.2) are (1.3) and (1.6) respectively. max D(α) = α s.t. n ∑ αi − i=1 n ∑ n 1 ∑ αi αj yi yj Kij 2 i,j=1 (1.3) yi αi = 0, (1.4) ∀ 1 ≤ i ≤ n, 0 ≤ αi ≤ C. (1.5) i=1 max D(α, β) = α,β i=1 − s.t. n ∑ αi − n 1 ∑ αi αj yi yj Kij 2 i,j=1 (1.6) n 1 ∑ ∗ (αi + βi − C)(αj + βj − C)Kij 2γ i,j=1 n ∑ (αi + βi − C) = 0, i=1 n ∑ yi αi = 0, (1.7) i=1 ∀ 1 ≤ i ≤ n, αi ≥ 0, βi ≥ 0. (1.8) The decision and the correcting functions of SVM+, expressed in the dual variables, are n ∑ h(x) = αj yj K(xj , x) + b (1.9) ∑n j=1 ∗ and ϕ(x∗i ) = γ1 j=1 (αj + βj − C)Kij + d. Note that SVM and SVM+ have syntactically the same decision function. But semantically they are different, since α’s found by SVM and SVM+ can differ a lot. One of the widely used algorithms for solving (1.3) is SMO [12]. At each iteration SMO optimizes the working set of two dual variables, αs and αt , while keeping all other variables fixed. Note that we cannot optimize the proper subset of such working set, say αs : due to the constraint (1.4), if we fix n − 1 variables then the last variable is also fixed. Hence the working sets selected by SMO are irreducible. Following [3], we present SMO as an instantiation of the framework of sparse line search algorithms to the optimization problem of SVM. At each iteration the algorithms in this framework perform line search in maximally sparse direction which is close to the gradient direction. We instantiate the above framework to SVM+ optimization problem and obtain alternating SMO (aSMO) algorithm. aSMO works with irreducible working sets of two or three variables. 6 Book title goes here Sparse line search algorithms in turn belong to the family of approximate gradient descent algorithms. The latter algorithms optimize by going roughly in the direction of gradient. Unfortunately (e.g. see [11]) gradient descent algorithms can have very slow convergence. One of the possible remedies is to use conjugate direction optimization, where each new optimization step is conjugate to all previous ones. In this paper we combine the ideas of sparse line search and conjugate directions and present a framework of conjugate sparse line search algorithms. At each iteration the algorithms in this framework perform line search in a chosen maximally sparse direction which is close to the gradient direction and is conjugate to k previously chosen ones. We instantiate the above framework to SVM+ optimization problem and obtain Conjugate Alternating SMO (caSMO) algorithm. caSMO works with irreducible working sets of size up to k+2 and k+3 respectively. Our experiments indicate an order-of-magnitude running time improvement of caSMO over aSMO. Vapnik et al. [14, 15] showed that LUPI paradigm emerges in several domains, for example, time series prediction and protein classification. To motivate further the usage of LUPI paradigm and SVM+, we show how it can be used to learn from the data generated by human computation games. Our experiments indicate that the tags generated by the players in ESP [16] and Tag a Tune [10] games result in the improvement in image and music classification. This chapter has the following structure. In Section 1.2 we present sparse line search optimization framework and instantiate it to SVM and SVM+. In Section 1.3 we present conjugate sparse line search optimization framework and instantiate it to SVM+. In Section 1.4 we prove the convergence of aSMO and caSMO. We present our experimental results in Section 1.5. In Section 1.5.3 we compare aSMO and caSMO with other algorithms for solving SVM+. Finally, Section 1.5.4 describes our experiments with the data generated by human computation games. Previous and Related Work Blum and Mitchell [1] introduced a semi-supervised multiview learning model that leverages the unlabeled examples in order to boost the performance in each of the views. SVM-2K algorithm of Farquhar et al. [7], which is similarly to SVM+ is a modification of SVM, performs fully supervised multiview learning without relying on unlalebeled examples. The conceptual difference between SVM+ and SVM-2K is that SVM+ looks for a good classifier in a single space, while SVM-2K looks for good classifiers in two spaces. Technically, the optimization problem of SVM+ involves less constraints, dual variables and hyperparameters than the one of SVM-2K. Izmailov et al. [9] developed SMO-style algorithm, named generalized SMO (gSMO), for solving SVM+. gSMO was used in the experiments in [14, 15]. At each iteration gSMO chooses two indices, i and j, and optimizes αi , αj , βi , βj while fixing other variables. In opposite to the working sets used by aSMO and caSMO, the one used by gSMO is not irreducible. In Section 1.2 we discuss the disadvantages of chosing the working set which is not irreducible and in Fast Optimization Algorithms for Solving SVM+ 7 Section 1.5.3 we show that aSMO and caSMO are significantly faster than the algorithm of [9]. 1.2 Sparse Line Search Algorithms We consider the optimization problem maxx∈F f (x), where f is a concave function and F is a convex compact domain. Definition 1 A direction u ∈ Rm is feasible at the point x ∈ F if ∃λ > 0 such that x + λu ∈ F . Definition 2 Let N Z(u) ⊆ {1, 2, . . . , n} be a set of non-zero indices of u ∈ Rn . A feasible direction u is maximally sparse if any u′ ∈ Rn , such that N Z(u′ ) ⊂ N Z(u), is not feasible. Lemma 1 Let u ∈ Rn be a feasible direction vector. The variables in N Z(u) have a single degree of freedom iff u is maximally sparse feasible. It follows from this lemma that if we choose a maximally sparse feasible direction u and maximize f along the half-line x + λu, λ > 0, then we find the optimal values of the variables in N Z(u) while fixing all other variables. As we will see in the next sections, if u is sparse and has a constant number of nonzero entries then the exact maximization along the line can be done in a constant time. In the next sections we will also show fast way of choosing good maximally sparse feasible direction. To ensure that the optimal λ ̸= 0 we require the direction u to have an angle with the gradient strictly larger than π/2. The resulting sparse line search optimization framework is formalized in Algorithm 1. Algorithm 1 has two parameters, τ and κ. The parameter τ controls the minimal angle between the chosen direction u to the direction of gradient. The parameter κ is the minimal size of the optimization step. Algorithm 1 stops when there is no maximally sparse feasible direction that is close to the direction of gradient and allows an optimization step of significant length. Let u be a feasible direction that is not maximally sparse. By Definition 2 and Lemma 1, the set N Z(u) has at least two degrees of freedom. Thus the line search along the half-line x+λu, λ > 0, will not find the optimal values of the variables in N Z(u) while fixing all other variables. To achieve the latter, instead of 1-dimensional line-search we need to do the expensive k-dimensional hyperplane search, where k > 1 is number of the degrees of freedom in N Z(u). This motivates the usage of the maximally sparse search directions. In Section 1.2.1 we instantiate Algorithm 1 to the optimization problem of SVM and obtain the well-known SMO algorithm. In Section 1.2.2 we instantiate Algorithm 1 to the optimization problem of SVM+. 8 Book title goes here Algorithm 1 Line search optimization with maximally sparse feasible directions. Input: Function f : Rn → R, domain F ⊂ Rn , initial point x0 ∈ F, constants τ > 0 and κ > 0. Output: A point x ∈ F with a ‘large’ value of f (x). 1: Set x = x0 . Let λ(x, u) = max{λ : x + λu ∈ F}. 2: while ∃ a maximally sparse feasible direction u such that ▽f (x)T u > τ and λ(x, u) > κ do 3: Choose a maximally sparse feasible direction u such that ▽f (x)T u > τ and λ(x, u) > κ. 4: Let λ∗ = arg maxλ:x+λu∈F f (x + λu). 5: Set x = x + λ∗ u. 6: end while 1.2.1 SMO Algorithm for Solving SVM Following [3], we present SMO as a line search optimization algorithm. Let I1 (α) = {i : (αi > 0 and yi = −1) or (αi < C and yi = 1)}, I2 (α) = {i : (αi > 0 and yi = 1) or (αi < C and yi = −1)}. At each iteration, SMO chooses a direction us ∈ Rn such that s = (i, j), i ∈ I1 , j ∈ I2 , ui = yi , uj = −yj and for any r ̸= i, j, ur = 0. If we move from α in the direction us then (1.4) is satisfied. It follows from the definition of I1 , I2 and us , that ∃λ = λ(α, s) > 0 such that α + λus ∈ [0, C]n . Thus us is a feasible direction. The direction us is also maximally sparse. Indeed, any direction u′ with strictly more zero coordinates than in us has a single nonzero coordinate. But if we move along such u′ then we will not satisfy (1.4). △ Let g(α) be a gradient of (1.3) at point α. By Taylor expansion of ψ(λ) = old D(α + λus ) at λ = 0, the maximum of ψ(λ) along the half-line λ ≥ 0 is at ′ λ (α old , s) = − ∂D(αold +λus ) |λ=0 ∂λ ∂ 2 D(αold +λus ) |λ=0 ∂λ2 =− gT (αold )us , uTs Hus where H is a Hessian of D(αold ). The clipped value of λ′ (αold , s) is ( ( )) C − αiold αold λ∗ (αold , s) = min , max λ′ (αold , s), − i i∈s i∈s ui ui (1.10) (1.11) so that the new value α = αold +λ∗ (αold , s)us is always in the domain [0, C]n . Let τ > 0 be a small constant and I = {us | s = (i, j), i ∈ I1 , j ∈ I2 , g(αold )T us > τ }. If I = ∅ then SMO stops. It can be shown that in this case Karush-Kuhn-Tucker (KKT) optimality conditions are almost satisfied (up to accuracy τ ). Let I ̸= ∅. We now describe a procedure for choosing the next direction. This procedure is currently implemented in LIBSVM and has complexity of O(n). Let s = arg max g(αold )T ut (1.12) t:ut ∈I Fast Optimization Algorithms for Solving SVM+ 9 be the nonzero components of the direction u that is maximally aligned with the gradient. Earlier versions of LIBSVM chose us as the next direction. As observed by [6], empirically faster convergence is obtained if instead of s = (i, j) we choose s′ = (i, j ′ ) such that us′ ∈ I and the move in the direction us′ maximally increases (1.3): s′ = arg max t=(t1 ,t2 ) t1 =i,ut ∈I = arg max t=(t1 ,t2 ) t1 =i,ut ∈I D(αold + λ′ (αold , t)ut ) − D(α) (1.13) (g(αold )T ut )2 . uTt Hut (1.14) The equality (1.14) follows from substituting λ′ (αold , t) (given by (1.10)) into (1.13) and using the definition (1.3) of D(α). 1.2.2 Alternating SMO (aSMO) for solving SVM+ We start with the identification of the maximally sparse feasible directions. Since (1.6) has 2n variables, {αi }ni=1 and {βi }ni=1 , the search direction u ∈ R2n . We designate the first n coordinates of u for α’s and the last n coordinates for β’s. It follows from the equality constraints (1.7) that (1.6) has three types of sets of maximally sparse feasible directions, I1 , I2 , I3 ⊂ R2n : 1. I1 = {us | s = (s1 , s2 ), n + 1 ≤ s1 , s2 ≤ 2n, s1 ̸= s2 ; us1 = 1, us2 = −1, βs2 > 0, ∀ i ∈ / s ui = 0}. A move in the direction us ∈ I1 increases βs1 and decreases βs2 > 0 by the same quantity. The rest of the variables remain fixed. 2. I2 = {us | s = (s1 , s2 ), 0 ≤ s1 , s2 ≤ n, s1 ̸= s2 , ys1 = ys2 ; us1 = 1, us2 = −1, αs2 > 0, ∀ i ∈ / s ui = 0}. A move in the direction us ∈ I2 increases αs1 and decreases αs2 > 0 by the same quantity. The rest of the variables remain fixed. 3. I3 = {us | s = (s1 , s2 , s3 ), 0 ≤ s1 , s2 ≤ n, n+1 ≤ s3 ≤ 2n, s1 ̸= s2 ; ∀ i ∈ / s ui = 0; ys1 ̸= ys2 ; us1 = us2 = 1, us3 = −2, βs3 > 0 or us1 = us2 = −1, us3 = 2, αs1 > 0, αs2 > 0}. A move in the direction us ∈ I3 either increases αs1 , αs2 and decreases βs3 > 0, or decreases αs1 > 0, αs2 > 0 and increases βs3 . The absolute value of the change in βs3 is twice the absolute value of the change in αs1 and in αs2 . The rest of the variables remain fixed. We abbreviate x = (α, β)T and xold = (αold , β old )T . It can be verified that if we move from any feasible point x in the direction us ∈ I1 ∪ I2 ∪ I3 then (1.7) is satisfied. Let D(x) be an objective function defined by (1.6) and λ∗ (xold , s) be a real number that maximizes ψ(λ) = D(xold + λus ) such that xold + λus satisfies (1.8). The optimization step in aSMO is x = xold + λ∗ (xold , s) · us , where us ∈ I1 ∪ I2 ∪ I3 . We now show how λ∗ (xold , s) is computed. 10 Book title goes here Let g(xold ) and H be respectively gradient and Hessian of (1.6) at the point xold . Using the Taylor expansion of ψ(λ) around λ = 0 we obtain that ′ λ (x old , s) = arg max ψ(λ) = − λ:λ≥0 ∂D(xold +λus ) |λ=0 ∂λ ∂ 2 D(xold +λus ) |λ=0 ∂λ2 =− g(xold )T us . uTs Hus (1.15) We clip the value of λ′ (xold , s) so that the new point x = xold + λ∗ (xold , s) · us will satisfy (1.8): ( ) xold λ∗ (xold , s) = max λ′ (xold , s), − i . (1.16) i∈s ui Let τ > 0, κ > 0, λ(xold , s) = max{λ|xold + λus satisfies (1.8)} and I = {us | us ∈ I1 ∪ I2 ∪ I3 , g(xold )T us > τ and λ(xold , s) > κ}. If I = ∅ then aSMO stops. Suppose that I ̸= ∅. We choose the next direction in three steps. The first two steps are similar to the ones done by LIBSVM [4] At the first step, for each 1 ≤ i ≤ 3 we find a vector us(i) ∈ Ii that has a minimal angle with g(xold ) among all vectors in Ii : if Ii ̸= ∅ then s(i) = arg maxs:us ∈Ii g(xold )T us , otherwise s(i) is empty. Since the expression g(xold )T us is linear in us , the complexity of this step is O(n) For 1 ≤ i ≤ 3, let Iei = {us | us ∈ Ii , g(xold )T us > τ and λ(xold , us ) > κ}. (i) (i) At the second step for 1 ≤ i ≤ 2, if the pair s(i) = (s1 , s2 ) is not empty then (i) ′(i) (i) we fix s1 and find s′(i) = (s1 , s2 ) such that us′(i) ∈ Iei and the move in the direction us′(i) maximally increases (1.6): s′(i) = arg max t:t=(t1 ,t2 ) (i) t1 =s1 ,ut ∈Iei = arg max t:t=(t1 ,t2 ) (i) t1 =s1 ,ut ∈Iei D(xold + λ′ (xold , t) · ut ) − D(xold ) | {z } △ =∆i (xold ,t) (g(xold )T ut )2 . uTt Hut (1.17) The right-hand equality in (1.17) follows from substituting λ′ (xold , t) (given by (1.15)) into ∆i (xold , t) and using the definition (1.6) of D(x). Similarly, for i = (3) (3) (3) (3) (3) 3 if the triple s(3) = (s1 , s2 , s3 ) is not empty then we fix s1 and s3 and (3) ′(3) (3) find s′(3) = (s1 , s2 , s3 ) such that us′(3) ∈ Ie3 and the move in the direction us′(3) maximally increases (1.6). Since the direction vectors ut ∈ I1 ∪ I2 ∪ I3 has at most three nonzero components, the computation of ∆i (xold , t) takes a constant time. Therefore, since we are doing one-dimensional search in (1.17), the second step takes a linear time. At the third step, we choose j = arg max1≤i≤3,s′(i) ̸=∅ ∆i (xold , s′(i) ). This step takes a constant time. The final search direction is us′(j) and the overall complexity of choosing this direction is O(n). The following theorem, proven in Section 1.4, establishes the asymptotic convergence of aSMO. Fast Optimization Algorithms for Solving SVM+ 11 Algorithm 2 Conjugate line search optimization with maximally sparse feasible directions. Input: Function f (x) = 12 xT Qx+cT x, domain F ⊂ Rn , initial point x0 ∈ F, constants τ > 0 and κ > 0, maximal order of conjugacy kmax ≥ 1. Output: A point x ∈ F with ‘large’ value of f (x). 1: Set x = x0 . Let λ(x, u) = max{λ : x + λu ∈ F}. 2: while ∃ a maximally sparse feasible direction u such that ▽f (x)T u > τ and λ(x, u) > κ do 3: Choose an order of conjugacy k, 0 ≤ k ≤ kmax . 4: Let g0 and gi be the gradient of f at the beginning of current and the i-th previous iteration. 5: Choose a maximally sparse feasible direction u such that ▽f (x)T u > τ , λ(x, u) > κ and for all 1 ≤ i ≤ k, uT (gi−1 − gi ) = 0. 6: Let λ∗ = arg maxλ:x+λu∈F f (x + λu). 7: Set x = x + λ∗ u. 8: end while Theorem 2 aSMO stops after a finite number of iterations. Let x∗ (τ, κ) be the solution found by aSMO for a given τ and κ and let x∗ be a solution of (1.6). Then limτ →0,κ→0 x∗ (τ, κ) = x∗ . 1.3 Conjugate Sparse Line Search Conjugate sparse line search is an extension of Algorithm 1. Similarly to Algorithm 1, we choose a maximally sparse feasible direction. However, in contrast to Algorithm 1, in conjugate sparse line search we also require that the next search direction is conjugate to k previously chosen ones. The order of conjugacy, k, is between 0 and an upper bound kmax ≥ 1. We allow k to vary between iterations. Note that if k = 0 then the optimization step done by conjugate sparse line search is the same as the one done by (nonconjugate) sparse line search. In Section 1.3.1 we show how do we choose k. Let Q be n × n positive semidefinite matrix and c ∈ Rn . In this section we assume that f is a quadratic function, f (x) = 12 xT Qx+cT x. This assumption is satisfied by all SVM-like problems. Let u be the next search direction that we want to find, and ui be a search direction found at the i-th previous iteration. We would like u to be conjugate to ui , 1 ≤ i ≤ k, w.r.t. the matrix Q: ∀1 ≤ i ≤ k, uT Qui = 0 . (1.18) Let g0 and gi be the gradient of f at the beginning of current and the i-th previous iteration respectively. Similarly, let x0 and xi be the optimization 12 Book title goes here point at the beginning of current and the i-th previous iteration. By the definition of f , gi = Qxi + c. Since for any 1 ≤ i ≤ n, xi−1 = xi + λi ui , where λi ∈ R, we have that uT Qui = uT Q(xi−1 − xi )/λi = uT (gi−1 − gi )/λi , and thus the conjugacy condition (1.18) is equivalent to ∀1 ≤ i ≤ k, uT (gi−1 − gi ) = 0 . (1.19) The resulting conjugate sparse line search algorithm is formalized in Algorithm 2. Suppose that the search directions u and ui , 1 ≤ i ≤ k, have r non-zero components. Commonly the implementations of SMO-like algorithms (e.g., LIBSVM and UniverSVM) compute the gradient vector after each iteration. Thus the additional complexity of checking the condition (1.19) is O(kr). However the additional complexity of checking (1.18) is O(kr2 ). As we will see in Section 1.3.1, if k ≥ 1 then r ≥ 3. Hence it is faster to check (1.19) than (1.18). In the rest of this section we describe the instantiation of Algorithm 2 to the optimization problem of SVM+. 1.3.1 Conjugate Alternating SMO (caSMO) for solving SVM+ We use the same notation as in Section 1.2.2. At each iteration we chose a maximally sparse feasible direction us ∈ R2n that solves the following system of k + 2 linear equalities: { uTs 1 = 0 uTs ȳ = 0, (1.20) T ∀1 ≤ i ≤ k us (gi−1 − gi ) = 0 where 1 ∈ R2n is a column vector with all entries being one and y = [y; 0] is a row-wise concatenation of the label vector y with n-dimensional zero column vector. It can be verified that if we move from any feasible point x in the direction us defined above then (1.7) is satisfied. We assume that the vectors y, 1, g0 − g1 , . . . , gk−1 − gk are linearly independent. In all our experiments this assumption was indeed satisfied. It follows from the equality constraints (1.7) that there are 4 types of maximally sparse feasible directions us that solve (1.20): 1. J1 = {us | us is feasible, s = (s1 , . . . , sk+2 ), n + 1 ≤ s1 , . . . , sk+2 ≤ 2n; ∀ i ∈ / s ui = 0}. A move in the direction us ∈ I1 changes β’s indexed by s. The rest of the variables are fixed. 2. J2 = {us | us is feasible, s = (s1 , . . . , sk+2 ), 1 ≤ s1 , . . . , sk+2 ≤ n, ys1 = · · · = ysk+2 ; ∀ i ∈ / s ui = 0}. A move in the direction us ∈ I2 changes α’s indexed by s such that the corresponding examples have the same label. The rest of the variables remain fixed. 3. J3 = {us | us is feasible, s = (s1 , . . . , sk+3 ); ∃ i, j such that si , sj ≤ n and ysi ̸= ysj ; ∃ r such that sr > n; ∀ i ∈ / s, ui = 0}. A move in the direction us ∈ I3 changes at least one β and at least α’s such that Fast Optimization Algorithms for Solving SVM+ 13 the corresponding two examples have different labels. The variables that are not indexed by s remain fixed. 4. J4 = {us | us is feasible, s = (s1 , . . . , sk+3 ), k ≥ 1, 1 ≤ s1 , . . . , sk+3 ≤ n; ∃ i ̸= j such that ysi = ysj = 1; ∃ r ̸= t such that ysr = yst = −1; ∀ i ∈ / s, ui = 0}. A move in the direction us ∈ I2 changes α’s indexed by s. At least two such α’s correspond to different positive examples and at least two such α’s correspond to different negative examples. The rest of the variables remain fixed. The definitions of J1 -J3 here are direct extensions of the corresponding definitions of I1 -I3 for aSMO (see Section 1.2.2). However the definition of J4 does not have a counterpart in aSMO. By the definition of J4 , it contains feasible direction vectors that change at least two α’s of positive examples and at least two α’s of negative examples. But such vectors are not maximally sparse feasible for the original optimization problem (1.6), since for any us′ ∈ J4 there exists us′ ∈ I2 such that N Z(us′ ) ⊂ N Z(us ). Hence the vectors from J4 cannot be used by aSMO. Thus the conjugate direction constraints give additional power to the optimization algorithm by allowing to use a new type of directions that were not available previously. The optimization step in caSMO is x = xold + λ∗ (xold , s) · us , where us satisfies the above conditions and λ∗ (xold , s) is computed exactly as in aSMO, by (1.15) and (1.16). We now describe the procedure for choosing the next direction us that satisfies (1.20). At the first iteration we choose the direction us ∈ I1 ∪ I2 ∪ I3 that is generated by aSMO and set k = 0. At the next iterations we proceed as follows. Let us be the direction chosen at the previous iteration and s = (s1 , . . . , sk′ ) be the current working set. If us ∈ I1 ∪ I2 then k ′ = k + 2, otherwise k ′ = k + 3. Recall that 0 ≤ k ≤ kmax . If k < kmax then we try to increase the order of conjugacy by setting k = k + 1 and finding an index sk′ +1 = arg max D(xold + λ′ (xold , s) · us ) − D(xold ) . {z } t:s=(s1 ,...,sk′ ,st ), | us ∈J1 ∪J2 ∪J3 ∪J4 and solves (1.20) (1.21) △ =∆(xold ,s) old T 2 ) us ) Similarly to (1.17), it can be shown that ∆(xold , s) = (g(xuT Hu . s s If k = kmax then we reduce ourselves to the case of k < kmax and proceed as above. The reduction is done by removing from s the least recently added index s1 , obtaining s = (s2 , . . . , sk′ ), downgrading k = k −1 and renumerating the indices so that s = (s1 , . . . , sk′ −1 ). We also use a ‘second opinion’ and consider the search direction us′ ∈ I1 ∪ I2 ∪ I3 that is found by aSMO. If sk′ +1 found by (1.21) is empty, namely there is no index t such that s = (s1 , . . . , sk′ , t), us ∈ J1 ∪ J2 ∪ J3 ∪ J4 and us solves (1.20), then we set k = 0 and the next search direction is us′ . Also, if sk′ +1 is not empty and the functional gain from us′ is larger than the one from us , namely if ∆(xold , s′ ) > ∆(xold , s), then we set k = 0 and the next search direction is us′ . Otherwise the next search direction is us . Given the indices s = (s1 , . . . , sk′ ), in order to find us we need to solve 14 Book title goes here Algorithm 3 Optimization procedure of [2]. Input: Function f : Rn → R, domain F ⊂ Rn , set of directions U ⊆ Rn , initial point x0 ∈ F , constants τ > 0, κ > 0. 1: Set x = x0 . Let λ(x, u) = max{λ : x + λu ∈ F}. 2: while exists u ∈ U such that ▽f (x)T u > τ and λ(x, u) > κ do 3: Choose any u ∈ U such that ▽f (x)T u > τ and λ(x, u) > κ. 4: Let λ∗ = arg maxλ:x+λu∈F f (x + λu). 5: Set x = x + λ∗ u. 6: end while 7: Output x. (1.20). Since us has O(k) non-zero entries, this takes O(k 3 ) time. It can be shown that the overall complexity of the above procedure of choosing the 3 next search direction is O(kmax n). Hence it is computationally feasible only for small values of kmax . In our experiments (Section 1.5.3) we observed that the value of kmax that gives nearly-optimal running time is 3. The following theorem, proven in Section 1.4, establishes the asymptotic convergence of caSMO. Theorem 3 caSMO stops after a finite number of iterations. Let x∗ (τ, κ) be the solution found by caSMO for a given τ and κ and let x∗ be a solution of (1.6). Then limτ →0,κ→0 x∗ (τ, κ) = x∗ . 1.4 Proof of Convergence Properties of aSMO, caSMO We use the results of [2] to demonstrate convergence properties of aSMO and caSMO. Bordes et al. [2] presented a generalization of Algorithm 1. Instead of maximally sparse vectors u, at each iteration they consider directions u from a set U ⊂ Rn . The optimization algorithm of [2] is formalized in Algorithm 3. aSMO is obtained from Algorithm 3 by setting U = I1 ∪ I2 ∪ I3 . Our results are based on the following definition and theorem of [2]. Definition 3 ([2]) A set of directions U ⊂ Rn is a finite witness family for a convex set F if U is finite and for any x ∈ F , any feasible direction u at the point x is a positive linear combination of a finite number of feasible directions vi ∈ U at the point x. The following theorem is a straightforward generalization of Proposition 13 in [2]. Theorem 4 ([2]) Let U ⊂ Rn be a set that contains a finite witness family for F. Then Algorithm 3 stops after a finite number of iterations. Let Fast Optimization Algorithms for Solving SVM+ 15 x∗ (τ, κ) be the solution found by Algorithm 3 for a given τ and κ and let x∗ = arg minx∈F f (x). Then limτ →0,κ→0 x∗ (τ, κ) = x∗ . Lemma 5 The set I1 ∪ I2 ∪ I3 (defined in Section 1.2.2) is a finite witness family for the set F defined by (1.7)-(1.8). Proof We use the ideas of the proof of Proposition 7 in [2] that showing that U = {ei − ej , i ̸= j} is a finite witness family for SVM. The set I1 ∪ I2 ∪ I3 is finite. Let x ∈ R2n be an arbitrary point in F and u be a feasible direction at x. Recall that the first n coordinates of x are designated for the values of α’s and the last n coordinates of x are designated for the values of β’s. There exists λ > 0 such that x′ = x + λu ∈ F. We construct a finite path x = z(0) → z(1) → . . . → z(m − 1) → z(m) = x′ such that z(j + 1) = z(j) + γ(j)v(j), where z(j + 1) ∈ F and v(j) is a feasible direction at x. We prove by induction on m that these two conditions are satisfied. As an auxiliary tool, we also prove by induction on m that our construction satisfies the following invariant for all 1 ≤ i ≤ 2n and 1 ≤ j ≤ m: xi > x′i ⇔ zi (j) ≥ x′i . (1.22) Let I+ = {i | 1 ≤ i ≤ n, yi = +1} and I− = {i | 1 ≤ i ≤ n, yi = −1}. We have from (1.7) that ∑ ∑ x′i = x′i . (1.23) i∈I+ i∈I− The construction of the path has three steps. At the first step (Cases 1 and ∑2n 2 below), we move the current overall weight of β’s (which is i=n+1 zi (j)) ∑ 2n towards the overall weight of β’s in x′ (which is i=n+1 x′ (j)). When the current overall weight of β’s is the same as the overall weight of β’s in x′ , we proceed to the second step. At the second step (Case 3 below), we move the values of β’s in z(j) towards their corresponding target values at x′ . The second step terminates when the values of β’s in z(j) and x′ are exactly the same. At the third step (Case 4 below), we move the values of α’s in z(j) towards their corresponding target values at x′ . The third step terminates when the values of β’s in z(j) and x′ are the same. We now describe this procedure in the formal way. For each ∑2n0 ≤ j ≤ m −∑1,2nwe have four cases: Case 1 i=n+1 zi (j) > i=n+1 x′i . ∑n ∑n ′ Since z(j), x′ ∈ F, by (1.7) we have that i=1 zi (j) < i=1 ∑xi . Let T = ′ {i | n + 1 ≤ i ≤ 2n, zi (j) > xi }. We have from (1.7) that i∈I+ zi (j) = ∑ z (j). Using (1.23) we obtain that there exist r ∈ I and s ∈ I− such + i∈I− i ′ ′ that zr (j) < xr and zs (j) < xs . Let t be any index from T , urst ∈ I3 such that ut = −2 and λ = min (x′r − zr (j), x′s − zs (j), (zt (j) − x′t )/2) . △ Note that λ > 0. We set v(j) = urst , γ(j) = λ and z(j + 1) = z(j) + γ(j)v(j). 16 Book title goes here Since z(j) ∈ F , it follows from the definition of λ and urst that z(j + 1) ∈ F. Since z(j) satisfies (1.22), we have that xt > x′t ≥ 0. Thus urst is a feasible direction at x. Finally, it follows from the definition of r, s, t, λ and urst that zi (j + 1) ≥ x′i ⇔ zi (j) ≥ x′i for any 1 ≤ i ≤ 2n. Hence z(j + 1) satisfies (1.22). ∑2n ∑2n Case 2 i=n+1 zi (j) < i=n+1 x′i . The treatment of this case is similar to what was the previous ∑n done with ∑ n ′ one. Since z(j), x′ ∈ F, by (1.7) we have that z (j) > i i=1 i=1 xi . Let ∑ △ T = {i | n + 1 ≤ i ≤ 2n, zi (j) < x′i }. We have from (1.7) that i∈I+ zi (j) = ∑ i∈I− zi (j). Using (1.23) we obtain that there exist r ∈ I+ and s ∈ I− such that zr (j) > x′r and zs (j) > x′s . Let t be an arbitrary index from T , urst ∈ I3 such that ut = 2 and λ = min (zr (j) − x′r , zs (j) − x′s , (x′t − zt (j))/2) . Note that λ > 0. We set v(j) = urst , γ(j) = λ and z(j + 1) = z(j) + γ(j)v(j). Since z(j) ∈ F , it follows from the definition of λ and urst that z(j + 1) ∈ F. Since z(j) satisfies (1.22), we have that xr > x′r ≥ 0 and xs > x′s ≥ 0. Thus urst is a feasible direction at x. Finally, it follows from the definition of r, s, t, λ and urst that zi (j + 1) ≥ x′i ⇔ zi (j) ≥ x′i for any 1 ≤ i ≤ 2n. Hence z(j + 1) satisfies (1.22). ∑2n ∑2n Case 3 i=n+1 zi (j) = i=n+1 x′i , ∃n + 1 ≤ i ≤ 2n such that zi (j) ̸= x′i . In this case we have that there exist n + 1 ≤ s, t ≤ 2n such that zs (j) < x′s (j) and zt (j) > x′t (j). Let λ = min (x′s (j) − zs (j), zt (j) − x′t (j)) . Note that λ > 0. We set v(j) = ust ∈ I1 , γ(j) = λ and z(j + 1) = z(j) + γ(j)v(j). Since z(j) ∈ F, it follows from the definition of λ and ust that z(j + 1) ∈ F. Since z(j) satisfies (1.22), we have that xt > x′t ≥ 0. Thus ust is a feasible direction at x. Finally, it follows from the definition of s, t, λ and ust that zi (j + 1) ≥ x′i ⇔ zi (j) ≥ x′i for any 1 ≤ i ≤ 2n. Hence z(j + 1) satisfies (1.22). ∑2n ∑2n Case 4 i=n+1 zi (j) = i=n+1 x′i , ∀n + 1 ≤ i ≤ 2n, zi (j) = x′i , ∃1 ≤ i ≤ n such that zi (j) ̸= x′i . We claim that in this case there exist 1 ≤ s, t ≤ n such that ys = yt , zs (j) < ∑2n ∑2n since i=n+1 zi (j) = i=n+1 x′i , it follows x′s (j) and zt (j) ∑ > x′t (j). Indeed, ∑ n n from (1.7) that i=1 zi (j) = i=1 x′i . Combining this with (1.23) we obtain our claim. The rest of the construction of z(j + 1) in this case is the same as in the previous case (starting from the definition of λ), with I1 being replaced by I2 . If none of these conditions is satisfied then z(j) = x′ and the path terminates. By our choice of λ, at each iteration we strictly decrease the number of coordinates in z(j) that are different from the corresponding coordinates of x′ . Hence the constructed path is finite. We have that x′ = x + λ′ u = x + ∑m−1 ∑m−1 γ(i) i=0 γ(i)v(i). Hence u = i=0 λ′ v(i). Fast Optimization Algorithms for Solving SVM+ 17 Proof of Theorem 2 The theorem follows from combining Theorem 4 and Lemma 5. Proof of Theorem 3 Recall that at each in the working set selection procedure of caSMO we try the search direction us′ ∈ I1 ∪I2 ∪I3 and this direction is chosen if it given larger gain in the objective function than the best conjugate direction. Hence by Lemma 5 the set of search directions U considered by caSMO contains a finite witness family and and the statement of the theorem follows from Theorem 4. 1.5 1.5.1 Experiments Computation of b in SVM+ Suppose that α and β are the solution of (1.6). In this section we show how do we compute the offset b in the decision function (1.9). By Karush-Kuhn-Tucker conditions, αi > 0 ⇒ yi (w · zi + b) = 1 − (w∗ · z∗i + d) . (1.24) Expressed in terms of the dual variables α and β,∑the condition (1.24) is n αi > 0 ⇒ yi Fi + yi b = 1 − γ1 fi − d, where Fi = j=1 αj yj Kij and fi = ∑n ∗ . The last condition is also equivalent to (α + β − C)K j j ij j=1 1 fi − F i , γ 1 αi > 0, yi = −1 ⇒ b − d = −1 + fi − Fi . γ αi > 0, yi = 1 ⇒ b + d = 1 − These two conditions motivate the following way of computing b. Let s+ = ∑ 1 − fi /γ − Fi and n+ = |{i : αi > 0, yi = 1}|, s− = i:α >0,y i =1 ∑ i i:αi >0,yi =−1 −1 + fi /γ − Fi , n+ = |{i : αi > 0, yi = −1}|. We set b = (s+ /n+ + s− /n− )/2. 1.5.2 Description of the Datasets We conducted our experiments on the datasets generated by ESP and Tag a Tune human computation games. ESP game1 is played between two players. Both players are given the same image and they are asked to enter the tags that characterize it. The player scores points if her tags match the ones of 1 www.gwap.com/gwap/gamesPreview/espgame/ 18 Book title goes here her opponent. Tag a Tune game2 is played between two players. Each player receives an audio clip of 30 seconds. The player only knows her clip and does not know the clip of her opponent. Each player sends to her opponent the tags that describe the clip. Upon receiving opponent’s tags the player decides if she and her opponent are given the same clips. If both players have the right decision then both of them score points. We used the dataset generated by ESP game [16] to perform image classification. This dataset has tagged images, with average 4.7 tags per image. We used the version [8] of ESP dataset available at lear.inrialpes.fr/ people/guillaumin/data_iccv09.php. In our experiments we used DenseHue, DenseShift, HarrisHue and HarrisSift features that were supplied with the images. We generated a learning problem in LUPI setting in the following way. Let t1 and t2 be two tags. The problem is to discriminate images tagged t1 from the ones tagged t2 . We do not consider images with both t1 and t2 tags. The view X is an image feature vector and the privileged information X ∗ is all tags that were associated with the image, except t1 and t2 . We removed the privileged information from the test images. The privileged information was represented as a vector with binary entries, indicating if a particular tag is associated with the image. We considered 3 binary problems with t1 =“horse”/t2 =“fish”, t1 =“horse”/t2 =“bird” and t1 =“fish”/t2 =“bird”. We denote these problems as ESP1/ESP2/ESP3 respectively. Each such problem has about 500 examples and is balanced. We used test size of 100 examples. We used Magnatagtune dataset (available at tagatune.org/Magnatagatune. html) generated by Tag a Tune game [10] to do music classification, This dataset has tagged audio clips, with average 4 tags per clip. Each clip is represented by measurements of its rhythm, pitch and timbre at different intervals. We used only pitch and timbre features. The privileged information was represented as a vector with binary entries, indicating if a particular tag is associated with the clip. We generated learning problems in LUPI setting in the same way as in ESP dataset. We considered 3 binary problems, “classic music” vs. ”rock”, “rock” vs. “electronic music” and “classic music” vs. “electronic music”. We denote these problems as TAG1/TAG2/TAG3 respectively. Each problem has between 4000 and 6000 examples and is balanced. We used test size of 2000 examples. 1.5.3 Comparison of aSMO and caSMO We implemented aSMO and caSMO on the top of LIBSVM [4]. Similarly to LIBSVM, we also implemented incremental computation of gradients and shrinking. Similarly to LIBSVM, we used caching infrastructure that is provided by LIBSVM code. In all our experiments we used RBF kernel in both X and X ∗ spaces. Let σ and σ ∗ be the hyperparameters of this kernel in space X and X ∗ respectively. 2 www.gwap.com/gwap/gamesPreview/tagatune/ Fast Optimization Algorithms for Solving SVM+ 19 τ ESP1 ESP2 ESP3 TAG1 TAG2 TAG3 1.99 27.16 26.42 30.67 1.95 11.76 7.45 1.9 28.42 26 30.33 1.91 12.05 7.49 1 28.75 27.92 32.17 1.97 12.27 7.35 0.1 29.42 29.42 32.25 2.18 12.34 7.32 TABLE 1.1 Comparison of generalization error vs. stopping criterion τ ESP1 TAG1 gSMO aSMO caSMO 12.41 4.21 0.13 44.39 4.59 0.1 TABLE 1.2 Mean running time (in seconds) of gSMO, aSMO and caSMO. Also, let m and m∗ be a median distance between training examples in space X and X ∗ respectively. We used the log10 -scaled range [m − 6, m + 6] of σ, [m∗ − 6, m∗ + 6] of σ ∗ , [−1, 5] of C and [−1, 5] of γ. At each such range we took 10 equally spaced values of the corresponding parameter and generated 10 × 10 × 10 × 10 grid. We set κ = 1e − 3. In all experiments in this section we report the average running time, over all grid points and over 12 random draws of the training set. aSVM and caSVM were initialized with αi = 0 and βi = C for all 1 ≤ i ≤ n. With such initialization it can be verified that at the first iteration both aSMO and caSMO will choose a direction u ∈ I3 and it will hold that gT u = 2, where g is the gradient of the objective function before the first iteration. Thus the value of the stopping criterion τ should be smaller than 2. We observed (see Table 1.1) that quite a large value of τ is sufficient to obtain nearly best generalization error of SVM+. Based on this observation we ran all our experiments with τ = 1.9. Table 1.2 compares the mean running time of aSMO, caSMO with kmax = 1, and the algorithm for solving SVM+, denoted by gSMO, that was used in [14]. We report the results for ESP1 and TAG1 problems, but similar results were also observed for other four problems. Our results show that aSMO is orderof-magnitude faster than gSMO and caSMO is order-of-magnitude faster than aSMO. We also observed (see Figure 1.1) that the running time of caSMO depends on the maximal order of conjugacy kmax . Based on our experiments, we suggest to set heuristically kmax = 3. We used this value in our experiments in Section 1.5.4. Finally, we compared (see Figure 1.2) the running time of caSMO with the one of LOQO off-the-shelf optimizer. We loosened the stopping criterion of LOQO so that it will output nearly the same value of objective function as caSMO. For very small training sets, up to 200, LOQO Book title goes here 0.102 0.1 0.098 0.096 0.094 0.092 0.09 0.088 0.086 0.084 0.082 1.4 TAG1 TAG2 TAG3 1.2 mean running time mean running time 20 ESP1 ESP2 ESP3 1 2 3 1 0.8 0.6 0.4 0.2 0 4 5 1 order of conjugacy (kmax) 2 3 4 5 order of conjugacy (kmax) FIGURE 1.1 Running time (in secs) of caSMO as a function of the order of conjugacy. LOQO 3 caSMO mean running time (sec) mean running time (sec) 3.5 2.5 2 1.5 1 0.5 0 100 200 300 400 500 training size 4.5 LOQO 4 caSMO 3.5 3 2.5 2 1.5 1 0.5 0 100 200 300 400 500 training size FIGURE 1.2 Comparison of the running time of caSMO and LOQO. is competitive with caSMO, but for larger training sets LOQO is significantly slower. 1.5.4 Experiments with SVM+ We show two experiments with SVM+ and compare it with SVM and SVM*. SVM* is SVM when it is ran in the privileged X* space. The experiments are for the data that is generated by ESP and Tag a Tune games. These games are described in the supplementary material. In Figures 1.3 and 1.4 we report mean test errors over 12 random draws of training and test sets. In almost all experiments SVM+ was better than SVM and relative improvement is up to 25%. In ESP1-3 and TAG3 problems SVM* is significantly better than SVM and thus the privileged space X∗ is much better than the regular space X. In these problems SVM+ leverages the good privileged information to improve over SVM. In TAG1 and TAG2 problems X* space is worse than X Fast Optimization Algorithms for Solving SVM+ ESP1 30 mean error mean error 35 25 20 15 10 5 100 200 ESP3 ESP2 SVM SVM+ SVM* 300 400 32 30 28 26 24 22 20 18 16 14 100 40 SVM SVM+ SVM* SVM SVM+ SVM* 35 mean error 40 21 30 25 20 15 200 training size 300 10 100 400 200 300 400 training size training size FIGURE 1.3 Learning curves of SVM and SVM+ in image classification. TAG2 1000 2000 training size 12.5 12 11.5 11 10.5 10 9.5 9 8.5 8 7.5 100 300 500 TAG3 7.5 SVM SVM+ SVM* SVM SVM+ SVM* 7 mean error SVM SVM+ SVM* mean error mean error TAG1 10 9 8 7 6 5 4 3 2 1 100 300 500 6.5 6 5.5 5 4.5 1000 2000 4 100 300 500 training size 1000 2000 training size FIGURE 1.4 Learning curves of SVM and SVM+ in music classification. space. Nevertheless the results of these datasets show that SVM+ is robust to the bad privileged information. Moreover, in TAG2 dataset, despite weak X* space, SVM+ managed to extract useful information from it and outperformed SVM. 1.6 Conclusions We presented a framework of sparse conjugate direction optimization algorithms. This framework combines the ideas of SMO algorithm and the optimization using conjugate direction. Our framework is very general and can be efficiently instantiated to any optimization problem with quadratic objective function and linear constraints. The instantiation to SVM+ gives two orders- 22 Book title goes here of-magnitude runnning tine improvement over the existing solver. We plan to instantiate the above framework to the optimization problem of SVM. Our work leaves plenty of room for future research. It would be interesting to automatically tune the maximal order of conjugacy. Also, it is important to characterize theoretically when sparse conjugate optimization is preferable over its nonconjugate counterpart. Finally, we plan to develop optimization algorithms for SVM+ when the kernels in X and/or X ∗ space are linear. Acknowledgements We thank Leon Bottou for the fruitful conversations. Bibliography [1] A. Blum and T. Mitchell. Combining labeled and unlabeled examples with co-training. In COLT, pages 92–100, 1998. [2] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. JMLR, 6:1579–1619, 2005. [3] L. Bottou and C.-J. Lin. Support Vector Machine solvers. In Large Scale Kernel Machines, pages 1–27. 2007. [4] C.-C. Chang and C.-J. Lin. LIBSVM: a library for Support Vector Machines, 2001. Available at http://www.csie.ntu.edu.tw/~cjlin/ libsvm. [5] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. [6] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training SVM. JMLR, 6:243–264, 2005. [7] J. Farquhar, D.R Hardoon, H. Meng, J. Shawe-Taylor, and S. Szedmak. Two view learning: Svm-2k, theory and practice. In NIPS, pages 355–362, 2005. [8] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Discriminative metric learning in nearest neighbor models for image autoannotation. In ICCV, pages 309–316, 2009. [9] R. Izmailov, V. Vapnik, and A. Vashist. Private communication. 2009. [10] E. Law and L. von Ahn. Input-agreement: A new mechanism for data collection using human computation games. In CHI, pages 1197–1206, 2009. [11] J. Nocedal and S. Wright. Numerical Optimization. Springer-Verlag, 2nd edition, 2006. [12] J. Platt. Fast training of Support Vector Machines using Sequential Minimal Optimization. In Advances in Kernel Methods - Support Vector Learning, pages 185–208. MIT Press, 1999. [13] V. Vapnik. Estimation of dependencies based on empirical data. Springer– Verlag, 2nd edition, 2006. 23 24 Book title goes here [14] V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5-6):544–557, 2009. [15] V. Vapnik, A. Vashist, and N. Pavlovich. Learning using hidden information: Master class learning. In Proceedings of NATO workshop on Mining Massive Data Sets for Security, pages 3–14. 2008. [16] L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI, pages 319–326, 2004.
© Copyright 2026 Paperzz