Supplementary Material: Representational Similarity Learning 1. Extending methods to matrices that are not Positive Semi-Definite For any symmetric S we have an eigendecomposition U DU T , where the eigenvalues may be negative. We can think of W as having the form BDB T for some B and using the D from S. b is This leads to a contradiction to our assumption that B the minimizer of L(B) + G(B) and completes the proof that kβbj· k = kβbk· k. Now, let βbj· + βbk· = z, then the minimizer satisfies min wj kβbj· k + wk kβbk· k bj· ,β bk· β So we consider the following optimizations instead. such that βbj· + βbk· = z and kβbj· k = kβbk· k min kS − XBDBX T k2F , B and min kU − XBk2F . It is easy to see that the solution to this optimization is βbj· = βbk· = z/2 B Then given a B construct W = BDB T . The rest remains the same. 2. Clustering properties of GrOWL with absolute error loss function Proof of Theorem 3.1 Proof. The proof is divided into two steps. First, we show kβbj· k = kβbk· k and then we further show that the rows are equal. We proceed by contradiction. Assume kβbj· k 6= kβbk· k and, without loss of generality, suppose kβbj· k > kβbk· k. We see that there exists a modification of the solution with a smaller GrOWL norm and same data-fitting term, and thus smaller overall objective value b is the minimizer which contradicts our assumption that B of L(B) + G(B). b except v bj· = βbj· − ε Consider the modification, V = B b b bk· = βk· + ε where ε = δ βj· and δ is chosen such and v b k−kβ b k kβ that kεk ∈ 0, j· 2 k· Let L(B) = kY − XBk1 = kY 0 − x·j βbj· − x·k βbk· k1 where Y 0 is the residual term given by Y 0 = Y − P b i6=j,k x·i βi· . Since x·j = x·k , L is invariant under b Same is true for this transformation, i.e., L(V ) = L(B). L(B) = kY − XBk2F . Observe that the GrOWL norm of B is equal to the OWL norm of the vector of euclidean norms of rows of B. Since kvk· k = kβk· + εk ≤ kβk· k + kεk, this transformation is equivalent to that defined in Lemma 3.1 and we have b − G(V ) ≥ ∆kεk G(B) Proof of Theorem 3.2 Proof. The proof is similar to the identical columns theorem. By contradiction and without loss of generality, suppose kβbj· k > kβbk· k. We show that there exists a transforb such that the increase in the data fitting term mation of B is smaller than the decrease in the GrOWL norm. Consider the modification, V , as defined in the proof of Theorem 3.1. By triangle inequality, the difference in loss function L that results from this modification satisfies b ≤ x·j − x·k kεk1 L(V ) − L(B) 1 Invoking√Lemma 3.1 as in the previous theorem and kεk1 ≤ rkεk, we get b + G(B)) b L(V ) + G(V )−(L(B) √ ∆ ≤ r(x·j − x·k 1 − √ )kεk < 0 r b is the minimizer of This contradicts our assumption that B L(B) + G(B) and completes the proof for absolute loss. The proof with squared Frobenius loss can easily be extended using the inequality derived in Appendix B. Proof of Theorem 3.3 Proof. The proof is similar to the identical columns theobk· k β rem. By contradiction, suppose kβbj· − βbk· k ≥ 8φk 4φ2 +1 ≥ bk· k 2kβ . φ b We show that there exists a transformation of B such that the increase in the data fitting term is smaller than the decrease in the GrOWL norm. Supplementary Material: Representational Similarity Learning Consider the modification, V , as defined in the proof of bj· −β bk· β . 2 Theorem 3.1 with ε = By triangle inequality, the difference in loss function L that results from this modification satisfies b ≤ x·j − x·k kεk1 L(V ) − L(B) 1 to a common norm, i.e., (kx·i k = c for i = 1, · · · , p). Define L(X) = kY − XBk2F = kY 0 − x·j βj· − x·k βk· k2F where Y 0 is again the residual term. b ∈ Rp×r and if V is as defined in the Lemma 1. Let B respective theorems, then we have b ≤ kεkkY 0 kF kx·j − x·k k L(V ) − L(B) We now bound the decrease in the GrOWL norm. Note by parallelogram law, kβbj· + βbk· k2 = 2kβbj· k2 + 2kβbk· k2 − kβbj· − βbk· k2 1 1 − − 1 kβbj· − βbk· k2 ≤ 2kβbj· k2 + 2kβbk· k2 + 4φ2 4φ2 !2 bj· − βbk· k k β 1 + 4φ2 b 2 ≤ 4kβbj· k + − kβj· − βbk· k2 2φ 4φ2 !2 bj· − βbk· k k β kβbj· kkβbj· − βbk· k 2 ≤ 4kβbj· k + −2 2φ φ !2 kβbj· − βbk· k ≤ kβbj· k + kβbk· k − 2φ Proof. 1 kY 0 − x·j (βbj· − ε) − x·k (βbk· + ε)k2F 2 1 − kY 0 − x·j βbj· − x·k βbk· k2F 2 b = L(V ) − L(B) Expanding the Frobenius norm terms, canceling the common 21 kY 0 k2F terms and using the common norm of columns (kx·i k = c for i = 1, · · · , p) we get b L(V ) − L(B) c2 tr((βbj· − ε)(βbj· − ε)T + (βbk· + ε)(βbk· + ε)T 2 T T − βbj· βbj· − βbk· βbk· ) + tr(Y 0T (x·j − x·k )ε) = T + tr((βbj· − ε)xT·j x·k (βbk· + ε)T − βbj· xT·j x·k βbk· ) Thus, we have b − G(V ) ≥ ∆ kβbj· k + kβbk· k − kβbj· + βbk· k G(B) b L(V ) − L(B) ∆kβbj· − βbk· k ∆kεk ≥ = 2φ φ Combining this with kεk1 ≤ √ Expanding terms and making further cancellations gives = tr(Y 0T (x·j − x·k )ε) − (c2 − xT·j x·k ) tr((βbj· − βbk· − ε)εT ) ≤ tr(Y 0T (x·j − x·k )ε) − (c2 − xT x·k )kεk(kβbj· k − kβbk· k − kεk) rkεk, we get ·j b + G(B)) b L(V ) + G(V )−(L(B) √ ∆ ≤ rx·j − x·k 1 − kεk < 0 φ b is the minimizer of This contradicts our assumption that B L(B) + G(B) and completes the proof for absolute loss. The proof with squared Frobenius loss can easily be extended using the inequality derived in Appendix B. ≤ tr(Y 0T (x·j − x·k )εT ) 0 ≤ kY kF k(x·j − x·k )εkF = kεkkY 0 kF kx·j − x·k k where the first inequality follows from simplification and Cauchy-Schwarz inequality. The second inequality follows from c2 > xT·j x·k and kβbj· k2 − kβbk· k2 − kεk > 0 (by assumption). The third inequality follows, again, by CauchySchwarz inequality. 3. Clustering properties of GrOWL with squared Frobenius loss function Using this Lemma one can easily extend the clustering properties of GrOWL to the optimization in (1). In this section, we consider the optimization min kY − XBk2F + G(B) X (1) Here we derive an upper bound on the increase in the squared loss term after applying the transformation, V . We assume that the columns of the matrix, X, are normalized 4. Proximal algorithms for GrOWL Proof. Outline: the proof proceeds by finding a lower bound for the objective function in (5) and then we show that the proposed solution achieves this lower bound. Supplementary Material: Representational Similarity Learning First, note that the following is true for any B and V , kB − V k2F = p X kβi· − vi· k2 i=1 ≥ p X (kβi· k − kvi· k)2 = kβ̃ − ṽk2 i=1 where the inequality follows from reverse triangle inequality. Combining this with G(B) = Ωw (β̃), we have a lower bound on the objective function in (5). For all B ∈ Rp×r 1 1 kB − V k2F + G(B) ≥ kproxΩw (ṽ) − ṽk2 + Ωw (proxΩw (ṽ)) 2 2 Finally, we show that B = Vb achieves this lower bound, 1 b kV − V k2F + G(Vb ) 2 p 1X vi· = k(proxΩw (ṽ))i − vi· k22 + Ωw (proxΩw (ṽ)) 2 i=1 kvi· k = 1 kproxΩw (ṽ) − ṽk22 + Ωw (proxΩw (ṽ)) 2
© Copyright 2026 Paperzz