Appendix for Mixed-norm Regularization for

1
Appendix for Mixed-norm Regularization for
Brain Decoding
Rémi Flamary, Nisrine Jrad, Ronald Phlypo, Marco Congedo, Alain Rakotomamonjy
LITIS, EA 4108 - INSA / Université de Rouen
Avenue de l’Université - 76801 Saint-Etienne du Rouvray Cedex
[email protected]
A PPENDIX
A. Proof of Lipschitz gradient of the squared Hinge loss
Given the training examples {xi , yi }, the squared Hinge
loss is written as :
n
X
2
max(0, 1 − yi x>
J=
i w)
and 1 − xTi w2 ≥ 0. Thus, xi yi max(0, 1 − yi x>
i w) is
Lipschitz with a constant kxi k2 . Now, we can conclude
the proof by stating that ∇w J is Lipschitz as it is a
sum
function and the related constant is
Pn of Lipschitz
2
kx
k
.
i 2
i=1
B. Lipschitz gradient for the multi-task learning problem
i=1
and its gradient is :
∇w J = −2
X
xi yi max(0, 1 − yi x>
i w)
i
The squared Hinge loss is gradient Lipschitz if there
exists a constant L such that:
For the multi-task learning problem, we want to prove
that the function
n
m
m
m X
X
X
1 X
L(yi,t , x>
w
+
b
)
+
λ
kw
−
wj k22
t
s
t
i,t t
m
t=1
t=1 i=1
j=1
is gradient Lipschitz, L(·, ·) being the square Hinge loss.
∀w1 , w2 ∈ Rd . From the above results, it is easy to show that the
first term is gradient Lipschitz as the sum of gradient
The proof essentially relies on showing that Lipschitz functions.
xi yi max(0, 1 − yi x>
i w) is Lipschitz itself i.e there
Now, we also show that the similarity term
exists L0 ∈ R such that
m
X
>
1 X
kxi yi max(0, 1 − yi x>
i w1 ) − xi yi max(0, 1 − yi xi w2 )k
kwt −
wj k22
m j=1
t
≤ L0 kw1 − w2 k
k∇J(w1 )−∇J(w2 )k2 ≤ Lkw1 −w2 k2
Now let us consider different situations. For a given w1
and w2 , if 1 − xTi w1 ≤ 0 and 1 − xTi w2 ≤ 0, then the
left hand side is equal to 0 and any L0 would satisfy the
inequality. If 1 − xTi w1 ≤ 0 and 1 − xTi w2 ≥ 0, then
the left hand side (lhs) is
lhs = kxi k2 (1 − x>
i w2 )
≤
>
kxi k2 (x>
i w1 − xi w2 )
≤
kxi k22 kw1 − w2 k2
(1)
A similar reasoning yields to the same bound when
1 − xTi w1 ≥ 0 1 − xTi w1 ≤ 0 and 1 − xTi w2 ≥ 0
This work was partly supported by the FP7-ICT Programme of the
European Community, under the PASCAL2 Network of Excellence,
ICT- 216886, by the French ANR Project ASAP ANR-09-EMER-001,
OpenVibe2, GazeEEg, and the INRIA ARC MABI.
is also gradient Lipschitz.
This term can be expressed as
m
kwt −
1 X
wj k22
m j=1
=
X
hwt , wt i −
t
m
1 X
hwi , wj i
m i,j=1
= w> Mw
>
where w> = [w1> , . . . , wm
] is the vector of all classifier
md×md
parameters and M ∈ R
is the Hessian matrix of
the similarity regularizer of the form
m
M=I−
1 X
Dt
m t=1
with I the identity matrix and Dt a block matrix with
Dt a (t − 1)-diagonal matrix where each block is an
2
identity matrix I with appropriate circular shift. Dt is
thus a (t − 1) row-shifted version of I.
Once we have this formulation, we can use the fact that
a function f is gradient Lipschitz of constant L if the
largest eigenvalue of its Hessian is bounded by L on its
domain [1]. Hence, since we have
m
kMk2 ≤ kIk2 +
1 X
kDt k2 = 2
m t=1
the Hessian matrix of the similarity term 2 · M has
consequently bounded eigenvalues. This concludes the
proof that the function w> Mw is gradient Lipschitz
continuous.
C. Proximal operators
1) `1 norm: the proximal operator of the `1 norm is
defined as :
1
proxλkxk1 (u) = arg min kx − uk22 + λkxk1
x 2
and has the following closed-form solution for which
each component is
[proxλkxk1 (u)]i = sign(ui )(|ui | − λ)+
2) `1 − `2 norm: the proximal operator of the `1 − `2
norm is defined as :
X
1
proxλ Pg∈G kxg k2 (u) = arg min kx−uk22 +λ
kxg k2
x 2
g∈G
the minimization problem can be decomposed into several ones since the indices g are separable. Hence, we
can just focus on the problem
1
min kx − uk22 + λkxk2
x 2
which minimizer is
0
if kuk2 ≤ λ
λ
(1 − kuk
)u
otherwise
2
R EFERENCES
[1] D. Bertsekas, Nonlinear programming.
Athena scientific, 1999.