Supplementary Material: Representational Similarity Learning

Supplementary Material: Representational Similarity Learning
1. Extending methods to matrices that are not
Positive Semi-Definite
For any symmetric S we have an eigendecomposition
U DU T , where the eigenvalues may be negative. We can
think of W as having the form BDB T for some B and
using the D from S.
b is
This leads to a contradiction to our assumption that B
the minimizer of L(B) + G(B) and completes the proof
that kβbj· k = kβbk· k. Now, let βbj· + βbk· = z, then the
minimizer satisfies
min wj kβbj· k + wk kβbk· k
bj· ,β
bk·
β
So we consider the following optimizations instead.
such that βbj· + βbk· = z and kβbj· k = kβbk· k
min kS − XBDBX T k2F ,
B
and
min kU − XBk2F .
It is easy to see that the solution to this optimization is
βbj· = βbk· = z/2
B
Then given a B construct W = BDB T . The rest remains
the same.
2. Clustering properties of GrOWL with
absolute error loss function
Proof of Theorem 3.1
Proof. The proof is divided into two steps. First, we
show kβbj· k = kβbk· k and then we further show that the
rows are equal. We proceed by contradiction. Assume
kβbj· k 6= kβbk· k and, without loss of generality, suppose
kβbj· k > kβbk· k. We see that there exists a modification
of the solution with a smaller GrOWL norm and same
data-fitting term, and thus smaller overall objective value
b is the minimizer
which contradicts our assumption that B
of L(B) + G(B).
b except v
bj· = βbj· − ε
Consider the modification, V = B
b
b
bk· = βk· + ε where ε = δ βj· and δ is chosen such
and v
b k−kβ
b k
kβ
that kεk ∈ 0, j· 2 k·
Let L(B) = kY − XBk1 = kY 0 − x·j βbj· − x·k βbk· k1
where Y 0 is the residual term given by Y 0 = Y −
P
b
i6=j,k x·i βi· . Since x·j = x·k , L is invariant under
b Same is true for
this transformation, i.e., L(V ) = L(B).
L(B) = kY − XBk2F .
Observe that the GrOWL norm of B is equal to the OWL
norm of the vector of euclidean norms of rows of B. Since
kvk· k = kβk· + εk ≤ kβk· k + kεk, this transformation is
equivalent to that defined in Lemma 3.1 and we have
b − G(V ) ≥ ∆kεk
G(B)
Proof of Theorem 3.2
Proof. The proof is similar to the identical columns theorem. By contradiction and without loss of generality, suppose kβbj· k > kβbk· k. We show that there exists a transforb such that the increase in the data fitting term
mation of B
is smaller than the decrease in the GrOWL norm.
Consider the modification, V , as defined in the proof of
Theorem 3.1. By triangle inequality, the difference in loss
function L that results from this modification satisfies
b ≤ x·j − x·k kεk1
L(V ) − L(B)
1
Invoking√Lemma 3.1 as in the previous theorem and
kεk1 ≤ rkεk, we get
b + G(B))
b
L(V ) + G(V )−(L(B)
√ ∆
≤ r(x·j − x·k 1 − √ )kεk < 0
r
b is the minimizer of
This contradicts our assumption that B
L(B) + G(B) and completes the proof for absolute loss.
The proof with squared Frobenius loss can easily be extended using the inequality derived in Appendix B.
Proof of Theorem 3.3
Proof. The proof is similar to the identical columns theobk· k
β
rem. By contradiction, suppose kβbj· − βbk· k ≥ 8φk
4φ2 +1 ≥
bk· k
2kβ
.
φ
b
We show that there exists a transformation of B
such that the increase in the data fitting term is smaller than
the decrease in the GrOWL norm.
Supplementary Material: Representational Similarity Learning
Consider the modification, V , as defined in the proof of
bj· −β
bk·
β
.
2
Theorem 3.1 with ε =
By triangle inequality, the
difference in loss function L that results from this modification satisfies
b ≤ x·j − x·k kεk1
L(V ) − L(B)
1
to a common norm, i.e., (kx·i k = c for i = 1, · · · , p). Define L(X) = kY − XBk2F = kY 0 − x·j βj· − x·k βk· k2F
where Y 0 is again the residual term.
b ∈ Rp×r and if V is as defined in the
Lemma 1. Let B
respective theorems, then we have
b ≤ kεkkY 0 kF kx·j − x·k k
L(V ) − L(B)
We now bound the decrease in the GrOWL norm. Note by
parallelogram law,
kβbj· + βbk· k2
= 2kβbj· k2 + 2kβbk· k2 − kβbj· − βbk· k2
1
1
−
−
1
kβbj· − βbk· k2
≤ 2kβbj· k2 + 2kβbk· k2 +
4φ2
4φ2
!2
bj· − βbk· k
k
β
1 + 4φ2 b
2
≤ 4kβbj· k +
−
kβj· − βbk· k2
2φ
4φ2
!2
bj· − βbk· k
k
β
kβbj· kkβbj· − βbk· k
2
≤ 4kβbj· k +
−2
2φ
φ
!2
kβbj· − βbk· k
≤ kβbj· k + kβbk· k −
2φ
Proof.
1
kY 0 − x·j (βbj· − ε) − x·k (βbk· + ε)k2F
2
1
− kY 0 − x·j βbj· − x·k βbk· k2F
2
b =
L(V ) − L(B)
Expanding the Frobenius norm terms, canceling the common 21 kY 0 k2F terms and using the common norm of
columns (kx·i k = c for i = 1, · · · , p) we get
b
L(V ) − L(B)
c2
tr((βbj· − ε)(βbj· − ε)T + (βbk· + ε)(βbk· + ε)T
2
T
T
− βbj· βbj·
− βbk· βbk·
) + tr(Y 0T (x·j − x·k )ε)
=
T
+ tr((βbj· − ε)xT·j x·k (βbk· + ε)T − βbj· xT·j x·k βbk·
)
Thus, we have
b − G(V ) ≥ ∆ kβbj· k + kβbk· k − kβbj· + βbk· k
G(B)
b
L(V ) − L(B)
∆kβbj· − βbk· k
∆kεk
≥
=
2φ
φ
Combining this with kεk1 ≤
√
Expanding terms and making further cancellations gives
= tr(Y 0T (x·j − x·k )ε) − (c2 − xT·j x·k ) tr((βbj· − βbk· − ε)εT )
≤ tr(Y 0T (x·j − x·k )ε)
− (c2 − xT x·k )kεk(kβbj· k − kβbk· k − kεk)
rkεk, we get
·j
b + G(B))
b
L(V ) + G(V )−(L(B)
√ ∆
≤
rx·j − x·k 1 −
kεk < 0
φ
b is the minimizer of
This contradicts our assumption that B
L(B) + G(B) and completes the proof for absolute loss.
The proof with squared Frobenius loss can easily be extended using the inequality derived in Appendix B.
≤ tr(Y
0T
(x·j − x·k )εT )
0
≤ kY kF k(x·j − x·k )εkF
= kεkkY 0 kF kx·j − x·k k
where the first inequality follows from simplification and
Cauchy-Schwarz inequality. The second inequality follows
from c2 > xT·j x·k and kβbj· k2 − kβbk· k2 − kεk > 0 (by assumption). The third inequality follows, again, by CauchySchwarz inequality.
3. Clustering properties of GrOWL with
squared Frobenius loss function
Using this Lemma one can easily extend the clustering
properties of GrOWL to the optimization in (1).
In this section, we consider the optimization
min kY − XBk2F + G(B)
X
(1)
Here we derive an upper bound on the increase in the
squared loss term after applying the transformation, V . We
assume that the columns of the matrix, X, are normalized
4. Proximal algorithms for GrOWL
Proof. Outline: the proof proceeds by finding a lower
bound for the objective function in (5) and then we show
that the proposed solution achieves this lower bound.
Supplementary Material: Representational Similarity Learning
First, note that the following is true for any B and V ,
kB − V k2F =
p
X
kβi· − vi· k2
i=1
≥
p
X
(kβi· k − kvi· k)2 = kβ̃ − ṽk2
i=1
where the inequality follows from reverse triangle inequality.
Combining this with G(B) = Ωw (β̃), we have a lower
bound on the objective function in (5). For all B ∈ Rp×r
1
1
kB − V k2F + G(B) ≥ kproxΩw (ṽ) − ṽk2 + Ωw (proxΩw (ṽ))
2
2
Finally, we show that B = Vb achieves this lower bound,
1 b
kV − V k2F + G(Vb )
2
p
1X
vi·
=
k(proxΩw (ṽ))i
− vi· k22 + Ωw (proxΩw (ṽ))
2 i=1
kvi· k
=
1
kproxΩw (ṽ) − ṽk22 + Ωw (proxΩw (ṽ))
2