Matrix Calculus - Notes on the Derivative of a Trace

Matrix Calculus - Notes on the Derivative of a Trace
Johannes Traa
This write-up elucidates the rules of matrix calculus for expressions involving the trace
of a function of a matrix X:
£
¤
f = tr g ( X ) .
(1)
We would like to take the derivative of f with respect to X:
∂f
=? .
∂X
(2)
One strategy is to write the trace expression as a scalar using index notation, take the
derivative, and re-write in matrix form. An easier way is to reduce the problem to one or
more smaller problems where the results for simpler derivatives can be applied. It’s bruteforce vs bottom-up.
M ATRIX -VALUED D ERIVATIVE
The derivative of a scalar f with respect to a matrix X ∈ RM ×N can be written as:
1

∂f
∂X 11



 ∂f
∂f 
 ∂X 21
=
∂X 
 ...



∂f
∂X M 1
∂f
∂X 12
···
∂f
∂X 1N
∂f
∂X 22
···
∂f
∂X 2N
..
.
..
..
.
∂f
∂X M 2
···
.
∂f
∂X M N













(3)
So, the result is the same size as X.
M ATRIX AND I NDEX N OTATION
It is useful to be able to convert between matrix notation and index notation. For example,
the product AB has elements:
X
(4)
[AB]i k = A i j B j k ,
j
and the matrix product ABCT has elements:
XX
X
X
X
£
¤
£
¤
Ai j B j k Cl k .
ABCT i l = A i j BCT j l = A i j B j k C l k =
j
j
j
k
(5)
k
F IRST-O RDER D ERIVATIVES
E XAMPLE 1
Consider this example:
f = t r [AXB] .
(6)
We can write this using index notation as:
XX
XX
X
XXX
X
A i j [XB] j i =
Ai j
X j k B ki =
A i j X j k B ki .
f = [AXB]i i =
i
i
j
i
j
k
Taking the derivative with respect to X j k , we get:
X
∂f
= A i j B ki = [BA]k j .
∂X j k
i
i
j
(7)
k
(8)
The result has to be the same size as X, so we know that the indices of the rows and
columns must be j and k, respectively. This means we have to transpose the result above
to write the derivative in matrix form as:
∂t r [AXB]
= AT BT .
∂X
(9)
2
E XAMPLE 2
Similarly, we have:
£
¤ XXX
f = t r AXT B =
A i j X k j B ki ,
i
j
(10)
k
so that the derivative is:
X
∂f
= A i j B ki = [BA]k j ,
∂X k j
i
(11)
The X term appears in (10) with indices k j , so we need to write the derivative in matrix
form such that k is the row index and j is the column index. Thus, we have:
£
¤
∂ t r AXT B
= BA .
(12)
∂X
M ULTIPLE - ORDER
Now consider a more complicated example:
¤
£
f = t r AXBXCT
XXXXX
=
A i j X j k B kl X l m C i m .
i
j
k
l
(13)
(14)
m
The derivative has contributions from both appearances of X.
TAKE 1
In index notation:
XXX
£
¤
∂f
=
A i j B kl X l m C i m = BXCT A k j ,
∂X j k
i l m
XXX
£
¤
∂f
=
A i j X j k B kl C i m = CT AXB ml .
∂X l m
i j k
Transposing appropriately and summing the terms together, we have:
£
¤
∂ t r AXBXCT
= AT CXT BT + BT XT AT C .
∂X
(15)
(16)
(17)
TAKE 2
We can skip this tedious process by applying (9) for each appearance of X:
£
¤
£
¤
∂ t r AXBXCT
∂ t r [AXD] ∂ t r EXCT
=
+
= AT DT + ET C .
∂X
∂X
∂X
(18)
3
where D = BXCT and E = AXB. So we just evaluate the matrix derivative for each appearance of X assuming that everything else is a constant (including other X’s). To see why this
rule is useful, consider the following beast:
£
¤
f = t r AXXT BCXT XC .
(19)
We can immediately write down the derivative using (9) and (12):
£
¤
¡
¢T ¡
¢
¡
¢ ¡
¢T
∂ t r AXXT BCXT XC
= (A)T XT BCXT XC + BCXT XC (AX) + (XC) AXXT BC + AXXT BCXT (C)T
∂X
(20)
= ACT XT XCT BT X + BCXT XCAX + XCAXXT BC + XCT BT XXT AT CT .
(21)
F ROBENIUS NORM
The Frobenius norm shows up when we have an optimization problem involving a matrix
factorization and we want to minimize a sum-of-squares error criterion:
!2
Ã
X
XX
£
¤
(22)
X i k − Wi j H j k = kX − WHk2F = t r (X − WH) (X − WH)T .
f =
i
k
j
We can work with the expression in index notation, but it’s easier to work directly with
matrices and apply the results derived earlier. Suppose we want to find the derivative with
respect to W. Expanding the matrix outer product, we have:
£
¤
£
¤
£
¤
£
¤
f = t r XXT − t r XHT WT − t r WHXT + t r WHHT WT .
Applying (9) and (12), we easily deduce that:
£
¤
∂ t r (X − WH) (X − WH)T
∂W
= −2XHT + 2WHHT .
(23)
(24)
L OG
Consider this trace expression:
µ
¶
XX
£ T
¤ XX
f = t r V log (AXB) =
Vi j log
A i m X mn B n j .
i
j
(25)
m n
Taking the derivative with respect to X mn , we get:


¶
µ
X
X
XX
V
V
∂f
ij


=
Ai m Bn j =
Ai m
Bn j .
PP
∂ X mn
A i m X mn B n j
AXB i j
i j
i j
(26)
m n
Thus:
4
¤
£
∂ t r VT log (AXB)
∂X
µ
T
=A
¶
V
BT .
AXB
(27)
Similarly:
£
¡
¢¤
∂ t r VT log AXT B
µ
V
=B
AXT B
¶T
A.
(28)
Consider the tricky case of a d i ag (−) operator:
£
¤ XX
f = t r A d i ag (x) B =
Ai j x j B j i .
(29)
∂X
These bare a spooky resemblance to (9) and (12).
D IAG
E XAMPLE 1
i
j
Taking the derivative, we have:
X
£¡
¢ ¤
∂f
= A i j B j i = AT ¯ B 1 j .
∂x j
i
(30)
So we can write:
£
¤
∂ t r A d i ag (x) B
∂x
¡
¢
= AT ¯ B 1 .
(31)
E XAMPLE 2
Consider the following take on the last example:
£
¤ XXX
f = t r J A d i ag (x) B =
Ai j x j B j k ,
i
j
(32)
k
where J is the matrix of ones.
TAKE 1
Taking the derivative, we have:
Ã
!Ã
!
XX
X
X
£
¤
∂f
=
Ai j B j k =
Ai j
B j k = AT 1 ¯ B1 j .
∂x j
i k
i
k
(33)
So we can write:
£
¤
∂ t r J A d i ag (x) B
∂x
= AT 1 ¯ B1 .
(34)
5
TAKE 2
We could have derived this result from the previous example using the rotation property of
the trace operator:
£
¤
£
¤
£
¤
£
¤
f = t r J A d i ag (x) B = t r 11T A d i ag (x) B = t r 1T A d i ag (x) B1 = t r aT d i ag (x) b ,
(35)
where we have defined a = AT 1 and b = B 1. Applying (31), we have:
£
¤
∂ t r aT d i ag (x) b
= (a ¯ b) 1 = AT 1 ¯ B 1 .
∂x
(36)
E XAMPLE 3
Consider a more complicated example:
¶
µ
X
£ T
¡
¢¤ X X
A i m xm B m j .
f = t r V log A d i ag (x) B =
Vi j log
i
(37)
m
j
Taking the derivative with respect to x m , we have:


X
X
Vi j
∂f
 Ai m Bm j
P
=
A
x
B
∂x m
i
m
m
m
j
i j
m
"
µ
¶ #
X X
V
=
Ai m
Bm j
A d i ag (x) B i j
j
i
·µ µ
¶
¶ ¸
V
T
= A
¯B 1
.
A d i ag (x) B
m
(38)
(39)
(40)
The final result is:
£
¡
¢¤
∂ t r VT log A d i ag (x) B
∂x
µ µ
= AT
¶
¶
V
¯B 1 .
A d i ag (x) B
(41)
E XAMPLE 4
How about when we have a trace composed of a sum of expressions, each of which depends
on what row of a matrix B is chosen:
"
#
µ
µ
¶
¶
X T
XXX
X
X
¡
¢
f = tr
V log A d i ag (Bk: X) C =
Vi j log
Ai m
B kn X nm C m j . (42)
k
k
i
j
m
n
Taking the derivative, we get:
6

XXX
∂f

=
P
∂X nm
k i j

Vi j
Ai m
m
µ
XX
µ
P
¶

 A i m B kn C m j

(43)
B kn X nm C m j
n
¶
X
V
= B kn
Ai mCm j
i j A d i ag (Bk: X) C i j
k
¶¸
¶T
· µµ
X
V
T
T
A¯C
,
= B kn 1
A d i ag (Bk: X) C
km
k
(44)
(45)
so we can write:

∂ tr
·
P
k
¡
¢
VT log A d i ag (Bk: X) C
∂X
¸
T
µ³
V
A d i ag (B1: X)C
´T
T
¶
A¯C 
1






..
= BT 
 .
.


 µ
¶
³
´T


1T A d i ag V(BK : X)C A ¯ CT
(46)
C ONCLUSION
We’ve learned two things. First, if we don’t know how to find the derivative of an expression using matrix calculus directly, we can always fall back on index notation and convert
back to matrices at the end. This reduces a potentially unintuitive matrix-valued problem
into one involving scalars, which we are used to. Second, it’s less painful to massage an
expression into a familiar form and apply previously-derived identities.
7