Matrix Differentiation - NUS School of Computing

Matrix Differentiation
CS5240 Theoretical Foundations in Multimedia
Leow Wee Kheng
Department of Computer Science
School of Computing
National University of Singapore
Leow Wee Kheng (NUS)
Matrix Differentiation
1 / 34
Linear Fitting Revisited
Linear Fitting Revisited
Linear fitting solves this problem:
Given n data points pi = [xi1 · · · xim ]⊤ , 1 ≤ i ≤ n, and their
corresponding values vi , find a linear function f that
minimizes the error
E=
n
X
(f (pi ) − vi )2 .
(1)
i=1
The linear function f (pi ) has the form
f (p) = f (x1 , . . . , xm ) = a1 x1 + · · · + am xm + am+1 .
Leow Wee Kheng (NUS)
Matrix Differentiation
(2)
2 / 34
Linear Fitting Revisited
The data points are organized into a matrix equation
D a = v,
(3)
where

x11 · · ·
 ..
..
D= .
.
xn1 · · ·
x1m
..
.
xnm


a1
..
.
1

..  , a = 

. 
 am
1
am+1





 , and v = 


v1
..  . (4)
. 
vn
The solution of Eq. 3 is
a = (D⊤ D)−1 D⊤ v.
Leow Wee Kheng (NUS)
Matrix Differentiation
(5)
3 / 34
Linear Fitting Revisited
Denote each row of D as d⊤
i . Then,
E=
n
X
2
2
(d⊤
i a − vi ) = kD a − vk .
(6)
i=1
So, linear least squares problem can be described very compactly as
min kD a − vk2 .
(7)
a
To show that the solution in Eq. 5 minimizes error E, need to
differentiate E with respect to a and set it to zero:
dE
= 0.
da
(8)
How to do this differentiation?
Leow Wee Kheng (NUS)
Matrix Differentiation
4 / 34
Linear Fitting Revisited
The obvious (but hard) way:

2
n
m
X
X

E=
aj xij + am+1 − vi  .
i=1
(9)
j=1
Expand equation explicitly giving



n
m

X
X


 aj xij + am+1 − vi  xik , for k 6= m + 1

2




i=1
j=1
∂E
=



∂ak
n
m

X
X



 aj xij + am+1 − vi  ,

2
for k = m + 1


i=1
j=1
Then, set ∂E/∂ak = 0 and solve for ak .
This is slow, tedious and error prone!
Leow Wee Kheng (NUS)
Matrix Differentiation
5 / 34
Linear Fitting Revisited
Which one do you like to be?
Leow Wee Kheng (NUS)
Matrix Differentiation
6 / 34
Linear Fitting Revisited
At least like these?
Leow Wee Kheng (NUS)
Matrix Differentiation
7 / 34
Matrix Derivatives
Matrix Derivatives
There are 6 common types of matrix derivatives:
Type
Scalar
Vector
Matrix
Scalar
∂y
∂x
∂y
∂x
∂Y
∂x
Vector
∂y
∂x
∂y
∂x
Matrix
∂y
∂X
Leow Wee Kheng (NUS)
Matrix Differentiation
8 / 34
Matrix Derivatives
Derivatives by Scalar
Numerator Layout Notation
Denominator Layout Notation
∂y
∂x

∂y
∂x

∂y 
=
∂x 



∂Y 
=

∂x

∂y1
∂x
..
.
∂ym
∂x
∂y11
∂x
..
.
∂ym1
∂x
Leow Wee Kheng (NUS)
···
..
.
···

∂y
∂y1
∂ym
∂y⊤
=
···
≡
∂x
∂x
∂x
∂x





∂y1n
∂x
..
.
∂ymn
∂x






Matrix Differentiation
9 / 34
Matrix Derivatives
Derivatives by Vector
Numerator Layout Notation
∂y
∂y
∂y
=
···
∂x
∂x1
∂xn


∂y 
=
∂x 

∂y1
∂x1
..
.
∂ym
∂x1
≡
···
..
.
···
∂y1
∂xn
..
.
∂ym
∂xn






Denominator Layout

∂y
 ∂x1
 .
∂y
..
=
∂x 
 ∂y
∂xn

∂y1
···
 ∂x1
∂y 
..
=  ...
.
∂x 
 ∂y1
···
∂xn
∂y
∂x⊤
Leow Wee Kheng (NUS)
≡
Matrix Differentiation
Notation






∂ym
∂x1
..
.
∂ym
∂xn






∂y⊤
∂x
10 / 34
Matrix Derivatives
Derivative by Matrix
Numerator Layout Notation


∂y
∂y
···
 ∂x11
∂xm1 

 .
∂y
..
.

..
..
=
.


∂X 
∂y
∂y 
···
∂x1n
∂xmn
≡
Denominator Layout Notation


∂y
∂y
···
 ∂x11
∂x1n 


∂y
..
.
.

..
..
=
.


∂X 
∂y
∂y 
···
∂xm1
∂xmn
∂y
∂X⊤
Leow Wee Kheng (NUS)
≡
Matrix Differentiation
∂y
∂X
11 / 34
Matrix Derivatives
Pictorial Representation
.
.
numerator
layout
.
denominator
layout
.
Leow Wee Kheng (NUS)
.
.
.
.
Matrix Differentiation
12 / 34
Matrix Derivatives
Caution
◮
Most books and papers don’t state which convention they use.
◮
Reference [2] uses both conventions but clearly differentiate them.


∂y
 ∂x1 
 . 
∂y
∂y
∂y
∂y
.. 
···
=
=

⊤
∂x1
∂xn
∂x 
∂x
 ∂y 
∂xn




∂y1
∂y1
∂y1
∂ym
···
···
 ∂x1
 ∂x1
∂xn 
∂x1 
⊤


 .
∂y
∂y
.
.
.. 
.
.




..
.. 
..
=  ..
=  ..
. 
⊤
∂x
∂x
 ∂y1
 ∂ym
∂ym 
∂ym 
···
···
∂x1
∂xn
∂xn
∂xn
◮
It is best not to mix the two conventions in your equations.
◮
We adopt numerator layout notation.
Leow Wee Kheng (NUS)
Matrix Differentiation
13 / 34
Matrix Derivatives
Commonly Used Derivatives
Commonly Used Derivatives
Here, scalar a, vector a and matrix A are not functions of x and x.
da
= 0 (column matrix)
(C1)
dx
(C2)
da
= 0⊤
dx
(row matrix)
(C3)
da
= 0⊤
dX
(matrix)
(C4)
da
= 0 (matrix)
dx
(C5)
dx
=I
dx
Leow Wee Kheng (NUS)
Matrix Differentiation
14 / 34
Matrix Derivatives
(C6)
d x⊤ a
d a⊤ x
=
= a⊤
dx
dx
(C7)
dx⊤ x
= 2 x⊤
dx
(C8)
d(x⊤ a)2
= 2 x ⊤ a a⊤
dx
(C9)
dAx
=A
dx
(C10)
(C11)
Commonly Used Derivatives
d x⊤A
= A⊤
dx
d x⊤Ax
= x⊤ (A + A⊤ )
dx
Leow Wee Kheng (NUS)
Matrix Differentiation
15 / 34
Matrix Derivatives
Derivatives of Scalar by Scalar
Derivatives of Scalar by Scalar
(SS1)
∂(u + v)
∂u ∂v
=
+
∂x
∂x ∂x
(SS2)
∂uv
∂v
∂u
=u
+v
∂x
∂x
∂x
(product rule)
(SS3)
∂g(u) ∂u
∂g(u)
=
∂x
∂u ∂x
(chain rule)
(SS4)
∂f (g) ∂g(u) ∂u
∂f (g(u))
=
∂x
∂g
∂u ∂x
Leow Wee Kheng (NUS)
(chain rule)
Matrix Differentiation
16 / 34
Matrix Derivatives
Derivatives of Vector by Scalar
Derivatives of Vector by Scalar
(VS1)
∂au
∂u
=a
∂x
∂x
where a is not a function of x.
(VS2)
∂Au
∂u
=A
∂x
∂x
where A is not a function of x.
⊤
(VS3)
∂u⊤
=
∂x
(VS4)
∂(u + v)
∂u ∂v
=
+
∂x
∂x
∂x
Leow Wee Kheng (NUS)
∂u
∂x
Matrix Differentiation
17 / 34
Matrix Derivatives
(VS5)
∂g(u)
∂g(u) ∂u
=
∂x
∂u ∂x
Derivatives of Vector by Scalar
(chain rule)
with consistent matrix layout.
(VS6)
∂f (g) ∂g(u) ∂u
∂f (g(u))
=
∂x
∂g
∂u ∂x
(chain rule)
with consistent matrix layout.
Leow Wee Kheng (NUS)
Matrix Differentiation
18 / 34
Matrix Derivatives
Derivatives of Matrix by Scalar
Derivatives of Matrix by Scalar
(MS1)
∂aU
∂U
=a
∂x
∂x
where a is not a function of x.
(MS2)
∂U
∂AUB
=A
B
∂x
∂x
where A and B are not functions of x.
(MS3)
∂U ∂V
∂(U + V)
=
+
∂x
∂x
∂x
(MS4)
∂UV
∂V ∂U
=U
+
V
∂x
∂x
∂x
Leow Wee Kheng (NUS)
(product rule)
Matrix Differentiation
19 / 34
Matrix Derivatives
Derivatives of Scalar by Vector
Derivatives of Scalar by Vector
(SV1)
∂au
∂u
=a
∂x
∂x
where a is not a function of x.
(SV2)
∂(u + v)
∂u
∂v
=
+
∂x
∂x ∂x
(SV3)
∂v
∂u
∂uv
=u
+v
∂x
∂x
∂x
(product rule)
(SV4)
∂g(u)
∂g(u) ∂u
=
∂x
∂u ∂x
(chain rule)
(SV5)
∂f (g(u))
∂f (g) ∂g(u) ∂u
=
∂x
∂g
∂u ∂x
Leow Wee Kheng (NUS)
(chain rule)
Matrix Differentiation
20 / 34
Matrix Derivatives
(SV6)
∂v
∂u
∂u⊤ v
= u⊤
+ v⊤
∂x
∂x
∂x
where
(SV7)
Derivatives of Scalar by Vector
(product rule)
∂v
∂u
and
are in numerator layout.
∂x
∂x
∂u⊤Av
∂v
∂u
= u⊤A
+ v⊤A⊤
∂x
∂x
∂x
where
(product rule)
∂v
∂u
and
are in numerator layout,
∂x
∂x
and A is not a function of x.
Leow Wee Kheng (NUS)
Matrix Differentiation
21 / 34
Matrix Derivatives
Derivatives of Scalar by Matrix
Derivatives of Scalar by Matrix
(SM1)
∂au
∂u
=a
∂X
∂X
where a is not a function of X.
(SM2)
∂(u + v)
∂u
∂v
=
+
∂X
∂X ∂X
(SM3)
∂v
∂u
∂uv
=u
+v
∂X
∂X
∂X
(SM4)
∂g(u)
∂g(u) ∂u
=
∂X
∂u ∂X
(SM5)
∂f (g(u))
∂f (g) ∂g(u) ∂u
=
∂X
∂g
∂u ∂X
Leow Wee Kheng (NUS)
(product rule)
(chain rule)
(chain rule)
Matrix Differentiation
22 / 34
Matrix Derivatives
Derivatives of Vector by Vector
Derivatives of Vector by Vector
(VV1)
(VV2)
∂au
∂u
∂a
=a
+u
∂x
∂x
∂x
(product rule)
∂Au
∂u
=A
∂x
∂x
where A is not a function of x.
(VV3)
∂(u + v)
∂u ∂v
=
+
∂x
∂x ∂x
(VV4)
∂g(u) ∂u
∂g(u)
=
∂x
∂u ∂x
(VV5)
∂f (g(u))
∂f (g) ∂g(u) ∂u
=
∂x
∂g
∂u ∂x
Leow Wee Kheng (NUS)
(chain rule)
(chain rule)
Matrix Differentiation
23 / 34
Matrix Derivatives
Notes on Denominator Layout
Notes on Denominator Layout
In some cases, the results of denominator layout are the transpose of
those of numerator layout. Moreover, the chain rule for denominator
layout goes from right to left instead of left to right.
Numerator Layout Notation
Denominator Layout Notation
(C7)
da⊤ x
= a⊤
dx
da⊤ x
=a
dx
(C11)
dx⊤Ax
= x⊤ (A + A⊤ )
dx
dx⊤Ax
= (A + A⊤ )x
dx
(VV5)
∂f (g) ∂g(u) ∂u
∂f (g(u))
=
∂x
∂g
∂u ∂x
∂f (g(u))
∂u ∂g(u) ∂f (g)
=
∂x
∂x ∂u
∂g
Leow Wee Kheng (NUS)
Matrix Differentiation
24 / 34
Matrix Derivatives
Derivations of Derivatives
Derivations of Derivatives
(C6)
d x⊤ a
d a⊤ x
=
= a⊤
dx
dx
(The not-so-hard way)
Let s = a⊤ x = a1 x1 + · · · + an xn . Then,
∂s
ds
= ai . So,
= a⊤ .
∂xi
dx
(The easier way)
X
∂s
ds
Let s = a⊤ x =
ai xi . Then,
= ai . So,
= a⊤ .
∂xi
dx
i
dx⊤ x
= 2 x⊤
dx
X
ds
∂s
= 2xi . So,
= 2 x⊤ .
Let s = x⊤ x =
x2i . Then,
∂xi
dx
(C7)
i
Leow Wee Kheng (NUS)
Matrix Differentiation
25 / 34
Matrix Derivatives
(C8)
d(x⊤ a)2
= 2 x ⊤ a a⊤
dx
Let s = x⊤ a. Then,
(C9)
Derivations of Derivatives
∂s
ds2
∂s2
= 2s
= 2 s ai . So,
= 2 x ⊤ a a⊤ .
∂xi
∂xi
dx
dAx
=A
dx
(The hard
 way)
a11 · · ·
 ..
..
Ax =  .
.
an1 · · ·

 

a1n
x1
a11 x1 + · · · + a1n xn

..   ..  = 
..
.
.  .  
.
ann
xn
an1 x1 + · · · + ann xn
(The easy way)
Let s = Ax. Then, si =
X
ds
∂si
= aij . So,
= A.
aij xj , and
∂xj
dx
j
Leow Wee Kheng (NUS)
Matrix Differentiation
26 / 34
Matrix Derivatives
(C10)
Derivations of Derivatives
d x⊤A
= A⊤
dx
Let y⊤ = x⊤ A, and aj denote the j-th column of A. Then, yi = x⊤ aj .
Applying (C6) yields
(C11)
dy⊤
dyi
= a⊤
.
So,
= A⊤ .
j
dx
dx
d x⊤Ax
= x⊤ (A + A⊤ )
dx
Apply (SV6) to
d x⊤Ax
dAx
dx
and obtain x⊤
+ (Ax)⊤ ,
dx
dx
dx
Next, apply (C9) to the first part of the sum, and obtain
x⊤A + (Ax)⊤ , which is x⊤ (A + A⊤ ).
(Need to prove SV6—Homework.)
Leow Wee Kheng (NUS)
Matrix Differentiation
27 / 34
Linear Fitting Revisited
Linear Fitting Revisited
Now, let us show that the solution
a = (D⊤ D)−1 D⊤ v.
minimizes error E
E=
n
X
2
2
(d⊤
i a − vi ) = kDa − vk .
i=1
Proof:
E = kDa − vk2 = (Da − v)⊤ (Da − v)
= (a⊤ D⊤ − v⊤ )(Da − v)
= a⊤ D⊤ Da − a⊤ D⊤ v − v⊤ Da + v⊤ v.
Leow Wee Kheng (NUS)
Matrix Differentiation
28 / 34
Linear Fitting Revisited
E = a⊤ D⊤ Da − a⊤ D⊤ v − v⊤ Da + v⊤ v.
Apply (C11), (C6), (C9) and (C2) to the four terms.
dE
da
= a⊤ (D⊤ D + D⊤ D) − (D⊤ v)⊤ − v⊤ D + 0
= 2 a⊤ D⊤ D − 2 v⊤ D.
Set dE/da = 0 and obtain
2 a⊤ D ⊤ D − 2 v ⊤ D = 0
a⊤ D ⊤ D = v ⊤ D
Transpose both sides of the equation and get
D⊤ D a = D⊤ v
a = (D⊤ D)−1 D⊤ v.
Leow Wee Kheng (NUS)
Matrix Differentiation
29 / 34
Summary
Summary
◮
Matrix calculus studies calculus of matrices.
◮
There are 6 common derivatives of matrices.
◮
There are 2 competing notational convention:
numerator layout notation vs. denominator layout convention.
◮
We adopt numerator layout notation.
◮
Do not mix the two conventions in your equations.
◮
Use matrix differentiation to prove that pseudo-inverse minimizes
sum square error.
Leow Wee Kheng (NUS)
Matrix Differentiation
30 / 34
Summary
Use the right tool
become lightning fast!
Leow Wee Kheng (NUS)
Matrix Differentiation
31 / 34
Probing Questions
Probing Questions
◮
Is there a simple way to double check that the derivative result
makes sense?
◮
Why do we use sum square error for linear fitting? Can we use
other forms of errors?
◮
Six common types of matrix derivatives are discussed. Three other
types are left out. Can we work out the other derivatives, e.g.,
derivatives of vector by matrix or matrix by matrix?
Leow Wee Kheng (NUS)
Matrix Differentiation
32 / 34
Homework
Homework
1. What are the key concepts that you have learned?
2. Prove the product rule SV3 using scalar product rule SS2.
(SV3)
∂uv
∂v
∂u
=u
+v
∂x
∂x
∂x
3. Prove the product rule SV6 using SV3.
(SV6)
∂u⊤ v
∂v
∂u
= u⊤
+ v⊤
∂x
∂x
∂x
where
∂v
∂u
and
are in numerator layout.
∂x
∂x
4. Q2 of AY2015/16 Final Evaluation.
Leow Wee Kheng (NUS)
Matrix Differentiation
33 / 34
References
References
1. J. E. Gentle, Matrix Algebra: Theory, Computations, and Applications in
Statistics, Springer, 2007.
2. H. Lütkepohl, Handbook of Matrices, John Wiley & Sons, 1996.
3. K. B. Petersen and M. S. Pedersen, The Matrix Cookbook, 2012.
www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
4. Wikipedia, Matrix Calculus.
en.wikipedia.org/wiki/Matrix_calculus
Leow Wee Kheng (NUS)
Matrix Differentiation
34 / 34