1 Appendix for Mixed-norm Regularization for Brain Decoding Rémi Flamary, Nisrine Jrad, Ronald Phlypo, Marco Congedo, Alain Rakotomamonjy LITIS, EA 4108 - INSA / Université de Rouen Avenue de l’Université - 76801 Saint-Etienne du Rouvray Cedex [email protected] A PPENDIX A. Proof of Lipschitz gradient of the squared Hinge loss Given the training examples {xi , yi }, the squared Hinge loss is written as : n X 2 max(0, 1 − yi x> J= i w) and 1 − xTi w2 ≥ 0. Thus, xi yi max(0, 1 − yi x> i w) is Lipschitz with a constant kxi k2 . Now, we can conclude the proof by stating that ∇w J is Lipschitz as it is a sum function and the related constant is Pn of Lipschitz 2 kx k . i 2 i=1 B. Lipschitz gradient for the multi-task learning problem i=1 and its gradient is : ∇w J = −2 X xi yi max(0, 1 − yi x> i w) i The squared Hinge loss is gradient Lipschitz if there exists a constant L such that: For the multi-task learning problem, we want to prove that the function n m m m X X X 1 X L(yi,t , x> w + b ) + λ kw − wj k22 t s t i,t t m t=1 t=1 i=1 j=1 is gradient Lipschitz, L(·, ·) being the square Hinge loss. ∀w1 , w2 ∈ Rd . From the above results, it is easy to show that the first term is gradient Lipschitz as the sum of gradient The proof essentially relies on showing that Lipschitz functions. xi yi max(0, 1 − yi x> i w) is Lipschitz itself i.e there Now, we also show that the similarity term exists L0 ∈ R such that m X > 1 X kxi yi max(0, 1 − yi x> i w1 ) − xi yi max(0, 1 − yi xi w2 )k kwt − wj k22 m j=1 t ≤ L0 kw1 − w2 k k∇J(w1 )−∇J(w2 )k2 ≤ Lkw1 −w2 k2 Now let us consider different situations. For a given w1 and w2 , if 1 − xTi w1 ≤ 0 and 1 − xTi w2 ≤ 0, then the left hand side is equal to 0 and any L0 would satisfy the inequality. If 1 − xTi w1 ≤ 0 and 1 − xTi w2 ≥ 0, then the left hand side (lhs) is lhs = kxi k2 (1 − x> i w2 ) ≤ > kxi k2 (x> i w1 − xi w2 ) ≤ kxi k22 kw1 − w2 k2 (1) A similar reasoning yields to the same bound when 1 − xTi w1 ≥ 0 1 − xTi w1 ≤ 0 and 1 − xTi w2 ≥ 0 This work was partly supported by the FP7-ICT Programme of the European Community, under the PASCAL2 Network of Excellence, ICT- 216886, by the French ANR Project ASAP ANR-09-EMER-001, OpenVibe2, GazeEEg, and the INRIA ARC MABI. is also gradient Lipschitz. This term can be expressed as m kwt − 1 X wj k22 m j=1 = X hwt , wt i − t m 1 X hwi , wj i m i,j=1 = w> Mw > where w> = [w1> , . . . , wm ] is the vector of all classifier md×md parameters and M ∈ R is the Hessian matrix of the similarity regularizer of the form m M=I− 1 X Dt m t=1 with I the identity matrix and Dt a block matrix with Dt a (t − 1)-diagonal matrix where each block is an 2 identity matrix I with appropriate circular shift. Dt is thus a (t − 1) row-shifted version of I. Once we have this formulation, we can use the fact that a function f is gradient Lipschitz of constant L if the largest eigenvalue of its Hessian is bounded by L on its domain [1]. Hence, since we have m kMk2 ≤ kIk2 + 1 X kDt k2 = 2 m t=1 the Hessian matrix of the similarity term 2 · M has consequently bounded eigenvalues. This concludes the proof that the function w> Mw is gradient Lipschitz continuous. C. Proximal operators 1) `1 norm: the proximal operator of the `1 norm is defined as : 1 proxλkxk1 (u) = arg min kx − uk22 + λkxk1 x 2 and has the following closed-form solution for which each component is [proxλkxk1 (u)]i = sign(ui )(|ui | − λ)+ 2) `1 − `2 norm: the proximal operator of the `1 − `2 norm is defined as : X 1 proxλ Pg∈G kxg k2 (u) = arg min kx−uk22 +λ kxg k2 x 2 g∈G the minimization problem can be decomposed into several ones since the indices g are separable. Hence, we can just focus on the problem 1 min kx − uk22 + λkxk2 x 2 which minimizer is 0 if kuk2 ≤ λ λ (1 − kuk )u otherwise 2 R EFERENCES [1] D. Bertsekas, Nonlinear programming. Athena scientific, 1999.
© Copyright 2026 Paperzz