Proof of Convergence for Kernelized M3L

Proof of Convergence for Kernelized M3L
Bharath Hariharan
Lihi Zelnik-Manor
S.V.N Vishwanathan
Manik Varma
July 20, 2010
We now describe the convergence of the algo- where αl = [α1l , . . . αN l ], Yl = diag([y1l . . . yN l ])
rithm for kernelized M3L [2]. It closely follows and K = φ(X)t φ(X). Making the substitution
θil = 2yil αil , we get the following optimization
the proof of convergence of SMO [3].
problem:
1
Notation
max
θ
N X
L
X
N
N
L
L
1 XXXX
θil yil −
θik θjl Rkl Kij (2)
2
i=1 l=1
i=1 j=1 k=1 l=1
We denote vectors in bold small letters, for example v. If v is a vector of dimension d, then s.t
vk , k ∈ {1, . . . , d} is the kth component of v, and
0 ≤ θik ≤ 2C
∀i, k s.t yik > 0
vI , I ⊆ {1, . . . , d} denotes the vector with components vk , k ∈ I (with the vk ’s arranged in the
−2C ≤ θik ≤ 0
∀i, k s.t yik < 0
same order as in v). Similarly, matrices will be
written in bold capital letters, for example A. If
This can be written as the following optimizaA is an m × n matrix, then Aij represents the tion problem:
ijth entry of A, and AIJ represents the matrix
Problem:
with entries Aij , i ∈ I, j ∈ J.
1
max f (θ) = − θ t Qθ + yt θ
(3)
A sequence is denoted as {an }, and an is the
θ
2
nth element of this sequence. If â is a limit point s.t
of the sequence, we write an → â.
l≤θ≤u
Here the vector θ = [θ11 . . . θ1L , θ21 , . . . θN L ]t ,
y = [yt1 . . . ytN ]t and Q = K ⊗ R where ⊗ is the
The dual that we are trying to solve is:
Kronecker product. l and u are N L- dimensional
vectors with entries:
L
L X
L
X
X
t
t
0 if yik > 0
max 2
αl 1 − 2
Rlk αl Yl KYk αk (1)
lik =
(4)
α
−2C
if yik < 0
l=1
l=1 k=1
2
The Optimization Problem
s.t
uik =
0 ≤ α ≤ C1
1
2C if yik > 0
0 if yik < 0
(5)
3
The algorithm
Algorithm 1 Our algorithm
θik ← 0
∀i, k
repeat
Pick i, k such that g̃ik p
|≥ τ
if ∃ j such that Kij 6= Kii Kjj and g̃jk 6= 0
then
Pick one such j.
Optimize w.r.t θik and θjk
else
Optimize w.r.t θik
end if.
Update gik
∀i, k
until | g̃ik |< τ
∀i, k
We will prove the convergence result for the algorithm given in Algorithm 1. We start with
θ 1 = 0, and generate a sequence of vectors θ n ,
where θ n is the vector after the (n − 1)th iteration. The gradient ∇f (θ n ), denoted by gn ,
is stored. The projected gradient ∇P f (θ n ), denoted by g̃n , is computed from the gradient on
the fly. τ > 0 is a user-defined parameter that
determines when we stop. The algorithm terminates when all projected gradients are less than
τ.
We assume that R and K are both positive
definite matrices. The eigenvalues of Q are then
λi µj , where λi are the eigenvalues of K and µj
Note the following:
are the eigenvalues of R. Because all eigenvalues of both R and K are positive, so are the
• In each iteration of the algorithm, we opeigenvalues of Q and thus Q is positive defitimize over a set of variables, which may
nite. Therefore, using Lemma 8 (refer to Apeither be a single variable θik or a pair of
pendix A), we have that θ is optimal iff θ is
variables {θik , θjl }.
feasible and ∇P f (θ) = 0.
Our algorithm will not reach an exact solution,
• The projected gradient of all the chosen
and therefore the KKT conditions will not be
variables is non zero at the start of the itersatisfied exactly. Call a (feasible) solution θ τ ation.
optimal if the KKT conditions are satisfied to a
precision of τ , i.e, | ∇Pik f (θ) |< τ for each i, k.
• At least one of the chosen variables has proThen our algorithm terminates when it reaches
jected gradient with magnitude greater than
a τ -optimal solution. Lemma 10 in Appendix
τ.
A shows that by making τ arbitrarily close to
0, we can make our solution arbitrarily close to
Consider the nth iteration. Denote by B the
the optimum. In the following sections we will
show that our algorithm will reach the τ -optimal set of indices of the variables chosen: B =
{(i, k)} or B = {(i, k), (j, k)}. Without loss of
solution in a finite number of steps.
generality, reorder variables so that the variables
in B occupy the first few indices. In the nth iteration, we optimize f over the variables in B keep4 Convergence
ing the rest of the variables constant. Thus we
In this section we prove that the sequence of vec- have to maximize h(∆B ) = f (θ n + [∆tB , 0]t ) −
tors θ n converges.
f (θ n ). This amounts to solving the optimization
2
problem:
Then (11) means that:
1
max h(∆B ) = − ∆tB QBB ∆B − ∆tB (Qθ n )B
∆B
2
+ytB ∆B
(6) Using (7)
s.t
lB − θ B ≤ ∆B ≤ uB − θ B
1
h(∆B ) = − ∆tB QBB ∆B + ∆tB gnB
2
(8)
⇒ ∇h(∆B ) = −QBB ∆B + gnB
(9)
2. θ n+1 6= θ n
n+1
n+1
= 0
< ujk then gjk
3. If ljk < θjk
∀(j, k) ∈ B
= −(Q(θ +
⇒ gn+1
B
1. This follows directly from (13).
2. If θ n+1 = θ n , then ∆∗B = 0 and so, from
(11), gn+1
= ∇h(0) = gnB . This means
B
= 0. But this
that from (13) g̃nB = g̃n+1
B
is a contradiction since we required that all
variables in the chosen set have non zero
projected gradient before the start of the
iteration.
3. Since the final projected gradients are 0 for
all variables in the chosen set (from (13)), if
n+1
n+1
ljk < θjk
< ujk then gjk
= 0 ∀(j, k) ∈ B
Also,
[∆∗t
B,
(13)
=0
1. g̃n+1
B
Then we have that θ n+1 = θ n + ∆∗ , where
t t
n
∆∗ = [∆∗t
B , 0 ] . Now note that since gB =
n
−(Qθ )B + yB
n
g̃n+1
= ∇P h(∆∗B ) = 0
B
Lemma 1. Let θ n be the solution at the start
of the nth iteration. Let B be the set of indices
of the variables over which we optimize. Let the
updated solution be θ n+1 . Then
(7) Proof.
gn+1
= −(Qθ n+1 )B + yB
B
(12)
This leads us to the following lemma:
Assuming that Rkk > 0 ∀k and Kii > 0 ∀i, QBB
is positive definite. This can be seen as follows.
B has either one or two elements. In the first
case, B = {(i, k)} and QBB = Rkk Kii > 0 and
hence QBB is positive definite. In the second
case, suppose B = {(i, k), (j, k)}. The
matrix
Kii Rkk Kij Rkk
. Because
QBB is given by
Kij Rkk Kjj Rkk
2 6= K K (refer Algorithm 1) we have that
Kij
ii ij
QBB must be positive definite.
Hence by Lemma 8 in Appendix A, ∆∗B optimizes (6) iff it is feasible and
∇P h(∆∗B ) = 0
g̃n+1
= ∇P h(∆∗B )
B
(10)
t t
0 ] ))B + yB
Lemma 2. In the same setup as the previous
lemma, f (θ n+1 ) − f (θ n ) ≥ σkθ n+1 − θ n k2 , for
= gnB − QBB ∆∗B = ∇h(∆∗B ) (11) some fixed σ > 0.
3
Proof.
f (θ n+1 ) − f (θ n ) = h(∆∗B )
1
n
= − ∆∗t
QBB ∆∗B + ∆∗t
B gB (14)
2 B
where ∆∗B is the optimum solution of Problem
(6). Now, note that since ∆∗B is feasible and 0 is
feasible, we have, using Lemma 9 in Appendix A,
∗
∆∗t
B ∇h(∆B ) ≥ 0
∗
∗t n
⇒ −∆∗t
B QBB ∆B + ∆B gB ≥ 0
∗
∗t n
⇒ ∆∗t
B QBB ∆B ≤ ∆B gB
Proof. From Lemma 2,
we have that
f (θ n+1 ) − f (θ n ) ≥ 0.
Thus the sequence
{f (θ n )} is monotonically increasing. Since it is
bounded from above (by the optimum value) it
must converge. Since convergent sequences are
Cauchy, this sequence is also Cauchy. Thus for
every , ∃n0 s.t f (θ n+1 ) − f (θ n ) ≤ σ2 ∀n ≥ n0 .
Again using Lemma 2, we get that
kθ n+1 − θ n k2 ≤ 2
(15)
(22)
for every n ≥ n0 . Hence the sequence {θ n } is
Cauchy. The feasible set of θ is closed and com(16)
pact, so Cauchy sequences are also convergent.
Hence {θ n } converges.
(17)
5
This gives us that
Finite termination
We have shown that {θ n } converges. Let θ̂ be a
1 ∗t
1
∗
nt ∗
∗
− ∆∗t
B QBB ∆B + gB ∆B ≥ ∆B QBB ∆B (18) limit point of {θ n }. We will start from the as2
2
sumption that the algorithm runs for an infinite
number of iterations and then prove a contradic1
∗
⇒ f (θ n+1 ) − f (θ n ) ≥ ∆∗t
B QBB ∆B (19) tion.
2
Call the variable θik as τ -violating if the magnitude of the projected gradient g̃ik is greater
1
∆∗ (20) than τ . Note that at every iteration, the chosen
⇒ f (θ n+1 ) − f (θ n ) ≥ νB ∆∗t
2 B B
set of variables contains at least one that is τ where νB is the minimum eigenvalue of the ma- violating. Now suppose the algorithm runs for
trix QBB . Since QBB is positive definite al- an infinite number of iterations. Then it means
ways, this value is always greater than zero, that the sequence of iterates θ k contains an infiand bounded below by the minimum eigenvalue nite number of τ -violating variables. Since there
among all 2 × 2 positive definite submatrices of are only a finite number of distinct variables, we
Q. Thus
have that at least one variable figures as a τ violating variable in the chosen set B an infi∗
f (θ n+1 ) − f (θ n ) ≥ σ∆∗t
B ∆B
nite number of times. Suppose that θil is one
= σkθ n+1 − θ n k2 (21) such variable, and let {kil } be the subsequence
in which this variable is chosen as a τ -violating
for some fixed σ ≥ 0.
variable.
Theorem 1. The sequence {θ n } generated by Lemma
3. For
every
kil +1
kil
0
Algorithm 1 converges.
| θil
− θil |≤ ∀kil > kil .
4
∃kil0
s.t
Proof. We have that since θ k → θ̂, θ kil → θ̂, Lemma 6. There can be only a finite number of
and θ kil +1 → θ̂. Thus, for any given ∃ kil0 such int → bd transitions.
that
Proof. Suppose that we have completed suffikil
0
∀kil > kil
(23) cient number of iterations so that all int → int
| θil − θ̂il |≤ /2
and bd → int transitions have completed. The
kil +1
0
next
int → bd transition will place θil on the
| θil
− θ̂il |≤ /2
∀kil + 1 > kil (24)
boundary. Since there are no bd → int tranThis gives, by triangle inequality,
sitions anymore, θil will stay on the boundary
kil +1
kil
0
∀kil > kil
(25) henceforth. Hence there can be no more int →
| θil
− θil |≤ bd transitions.
Lemma 7. There can only be a finite number of
Lemma 4. | gˆil |≥ τ , where gˆil is the derivative bd → bd transitions.
of f w.r.t θil at θ̂.
Proof. Suppose not, i.e there are an infinite numProof. This is simply because of the fact that ber of bd → bd transitions. Let til be the subse| gilkil |≥ τ for every kil , and the absolute value of quence of kil consisting of bd → bd transitions.
the derivative w.r.t θil is a continuous function Now, the sequence θtil → θ̂il and is therefore
il
of θ, and θ kil → θ̂.
Cauchy. Hence ∃n1 s.t
We use some notation. If θilkil ∈ (lil , uil ) and
if θilkil +1 = lil or θilkil +1 = uil , then we say that
“kil is int → bd”, where “int” stands for interior
and “bd” stands for boundary. Similar interpretations are assumed for “bd → bd” and “int →
int”. Thus each iteration kil can be of one of
only four possible kinds: int → int,int → bd, bd
→ int and bd → bd. We will prove that each of
these kinds of iterations can only occur a finite
number of times.
| θiltil − θiltil +1 |≤ uil − lil
∀til ≥ n1(26)
Similarly, because the gradient is a continuous
function of θ, the sequence {giltil } is convergent
and therefore Cauchy. Hence ∃n2 s.t
τ
| giltil − giltil +1 |≤
∀kil ≥ n2
(27)
2
Also, from the previous lemmas, ∃n3 s.t til is not
int → int, bd → int or int → bd ∀til ≥ n3 .
Take n0 = max(n1 , n2 , n3 ). Now, consider
til ≥ n0 . Without loss of generality, assume that
Lemma 5. There can be only a finite number of θtil = lil . Then, since | g̃ til |≥ τ , we must have
il
il
int → int and bd → int transitions.
that giltil ≥ τ . From (26), and using the fact that
Proof. Suppose not. Then we can construct this is a bd → bd transition, we must have that
an infinite subsequence {sil } of the sequence
θiltil +1 = lil
(28)
{kil } that consists of these transitions. Then we
have that gilsil +1 = 0, using Lemma 1. Hence From (27), we have that
τ
gilsil +1 → 0. Since the gradient is a continuous
giltil +1 ≥
(29)
2
function of θ, and since θ sil +1 → θ̂, we have that
gilsil +1 → ĝil . But this means ĝil = 0, which con- From (28) and (29), we have that g̃iltil +1 ≥ τ2 ,
tradicts Lemma 4.
which contradicts Lemma 1.
5
But if all int → int, int → bd, bd → int and Proof. First let’s write it as a minimization probbd → bd transitions are finite, then θil cannot lem:
be τ -violating an infinite number of times and
min h(x) = −f (x)
(32)
hence we have a contradiction. This gives us the
x
following theorem:
s.t
Theorem 2. Algorithm 1 terminates in finite
number of steps
l≤x≤u
The positive definiteness of H and the fact
that the constraints are box constraints imply
A Optimality conditions
that the problem is convex. Because the problem is convex, and because strong duality holds,
This section proves three results on quadratic
a point x is optimal iff the KKT conditions hold
programs that have been used in the proof above.
at x. The Lagrangian of the above problem is:
Consider the optimization problem:
L(x, a, b) = h(x) + at (l − x) + bt (x − u)(33)
1 t
max − x Hx + pt x
(30)
x
2
The KKT conditions are then:
s.t
∇x L = ∇h(x) − a + b = 0
(34)
ai (li − xi ) = 0
(35)
bi (ui − xi ) = 0
(36)
a≥0
(37)
b≥0
(38)
l≤x≤u
(39)
l≤x≤u
The feasible set of the above problem is defined
by box constraints and is therefore convex. If
the Hessian H is positive definite, then this optimization problem will be convex. The gradient
of f is denoted as ∇f . The projected gradient
of f , denoted as ∇P f is given by:

∇i f (x) if li < xi < ui

max(0, ∇i f (x)) if xi = li (31)
∇Pi f (x) =

min(0, ∇i f (x)) if xi = ui
Lemma 8. Consider the optimization problem From (34), (35) and (36), we get that:
(30). Suppose H is positive definite. Then, x is
∂h
(x) = ai − bi
∀i
optimum for (30) iff
∂xi
(40)
1. x is feasible
2. ∇Pi f (x) = 0


∂h
⇒
(x) =

∂xi
∀i
6
0 if li < xi < ui
ai ≥ 0 if li = xi
−bi ≤ 0 if xi = ui
∀i(41)


0 if li < xi < ui
∂f
−ai ≤ 0 if li = xi
(x) =
⇒

∂xi
bi ≥ 0 if xi = ui
⇒ ∇Pi f (x) = 0
∀i
Lemma 9. Consider the optimization problem
∀i(42) (30). Suppose H is positive definite. If x∗ is
optimal, and x 6= x∗ is feasible, then (x∗ −
x)t ∇f (x∗ ) ≥ 0
(43)
Proof. Because the feasible set is convex and beConversely, suppose ∇Pi f (x) = 0 and x is cause both x and x∗ are feasible, we have that
∂h
feasible. Taking ai = max( ∂x
(x), 0) and bi = x(λ) = x + λ(x∗ − x) is feasible for all λ ∈ [0, 1].
i
∂h
Consider the optimization problem:
(x), 0), we have that
− min( ∂x
i
li < xi < ui
(44)
max f (x(λ)) − f (x)
λ
⇒
∂h
(x) = 0
∂xi
(45) s.t
⇒ ai = 0 and bi = 0
0≤λ≤1
(46)
From a Taylor series expansion, it can be seen
(47) that
xi = li
⇒
(55)
∂h
∂f
(x) = −
(x) ≥ 0
∂xi
∂xi
(48)
⇒ ai ≥ 0 and bi = 0
(49)
λ2 ∗
(x − x)t H(x∗ − x)
2
+λ(x∗ − x)t ∇f (x)
(56)
f (x(λ)) − f (x) = −
Since H is positive definite,(x∗ − x)t H(x∗ − x) >
0. Hence, the problem (55) satisfies the condi∗
xi = ui
(50) tions of Lemma 8. Also, because x is optimal
for (30), λ = 1 is optimal for (55). Hence, by
∂f
∂h
Lemma 8, the projected gradient of (55) at λ = 1
(x) = −
(x) ≤ 0
(51)
⇒
is 0. This means that
∂xi
∂xi
∂f (x(λ)) ⇒ ai = 0 and bi ≥ 0
(52)
≥0
(57)
∂λ λ=1
Thus (34), (35), (36), (37) and (38) are satisfied
by this choice of a and b. Thus the KKT condi- This gives, using the chain rule
tions are equivalent to the following conditions,
∗
t
which are therefore necessary and sufficient for
(x − x ) ∇f (x(λ))
≥0
(58)
λ=1
x to be optimal:
∇Pi f (x) = 0
l≤x≤u
∀i
(53)
⇒ (x∗ − x)t ∇f (x∗ ) ≥ 0
(54)
(59)
where the last step uses the fact that x(1) =
x∗ .
7
Lemma 10. Consider the optimization problem
(30). Suppose H is positive definite. If x is feasible and | ∇Pik f (x) |≤ τ , and if f ∗ is the optimal
value, then
f ∗ − f (x) ≤
d2 τ 2
2m
Also note that:
h∗ = −f ∗ = max min L(x, a, b)
a,b
x
≥ min L(x, a, b)
x
= L(x0 , a, b)
(60)
(67)
0
where d is the dimensionality of x and m is the for some x .
Hence,
least eigenvalue of H.
h(x) − h∗ ≤ h(x) − L(x0 , a, b)
Proof. This proof is the extension of the proof
in [1, p. 459]. Again, take h = −f and define
the Lagrangian. Then, for the given x, we can
define
âi = max(
ai =
∂h
(x), 0)
∂xi
âi if âi ≥ τ
0 ow
∂h
b̂i = − min(
(x), 0)
∂xi
bi =
b̂i if b̂i ≥ τ
0 ow
⇒ h(x) − h∗ ≤ L(x, a, b) − L(x0 , a, b) (69)
(61) Using a Taylor series expansion of g(x) =
L(x, a, b) we see that:
g(x0 ) − g(x) = (x0 − x)t ∇g(x)
(62)
1
+ (x0 − x)t ∇2 g(x)(x0 − x)(70)
2
The Hessian ∇2 g(x) is nothing but H. The
(63) quantity (x0 −x)t ∇2 g(x)(x0 −x) is lowerbounded
by mkx0 − xk2 , m > 0 being the least eigenvalue
of H. Hence,
(64) g(x0 ) − g(x) ≥ (x0 − x)t ∇g(x) + m kx0 − xk2 (71)
2
The right hand side is a convex function of x0 for
fixed x. Setting the gradient = 0, we find that
the right hand side attains its minimum value of
1
1
− 2m
k∇g(x)k2 at x0 = x − m
∇g(x). Thus,
Then, equations (35), (36), (37) and (38) are
satisfied by this choice of a and b . Equations
(35) and (36) mean that this choice of a and b
satisfies:
h(x) = L(x, a, b)
(65)
g(x0 ) − g(x) ≥ −
(34) is only satisfied to a precision of τ , which
means that
k∇x L(x, a, b)k ≤ dτ
(68)
1
k∇g(x)k2
2m
(72)
Combining (66), (69) and (72), and noting
that f = −h and g(x) = L(x, a, b) we get the
desired
inequality.
(66)
8
References
[1] Stephen Boyd and Lieven Vandenberghe.
Convex Optimization. Cambridge University
Press, 2004.
[2] B. Hariharan, M. Varma, S. V. N. Vishwanathan, and L. Zelnik-Manor.
Maxmargin multi-label classification with correlations. In Proceedings of the International
Conference on Machine Learning (To appear), 2010.
[3] S. S. Keerthi and E. G. Gilbert. Convergence
of a generalized smo algorithm for svm classifier design. In Machine Learning, 2000.
9