Box Product

Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Efficient Automatic Differentiation of Matrix
Functions
Peder A. Olsen, Steven J. Rennie, and Vaibhava Goel
IBM T.J. Watson Research Center
Automatic Differentiation 2012
Fort Collins, CO
July 25, 2012
Peder, Steven and Vaibhava
Matrix Differentiation
watsonlogo
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
How can we compute the derivative?
Numerical Methods that do numeric differentiation by formulas
like the finite difference
f 0 (x) ≈
f (x + h) − f (x)
h
These methods are simple to program, but loose half
of all significant digits.
Symbolic Symbolic differentiation gives the symbolic
representation of the derivative without saying how
best to implement the derivative. In general symbolic
derivatives are as accurate as they come.
Automatic Something between numeric and symbolic
differentiation. The derivative implementation is
computed from the code for f (x).
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Finite Difference:
f 0 (x) ≈
f (x + h) − f (x)
h
Central Difference:
f 0 (x) ≈
f (x + h) − f (x − h)
2h
For h < x, where is the machine precision, these approximations
become zero. Thus h has to be chosen with care.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
The error in the finite difference is bounded by
kf 00 k∞
h 2kf k∞
+
2
h
whose minimum is
p
2 kf k∞ kf 00 k∞
achieved for
s
h=2
Peder, Steven and Vaibhava
kf k∞
.
kf 00 k∞
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Numerical Differentiation Error
The graph shows the error in approximating the derivative for
f (x) = e x as a function of h.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
An imaginary derivative
The error stemming from the subtractive cancellation can be
completely removed by using complex numbers.
Let f : R → R then
(ih)2 00
(ih)3 (3)
f (x) +
f (x)
2!
3!
h2
h2
= f (x) − f 00 (x) + hi(f 0 (x) − f (3) (x))
2
6
f 0 (x + ih) ≈ f (x) + ihf 0 (x) +
Thus the derivative can be approximated by
0
f (x + ih)
0
f (x) ≈ Im
h
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Imaginary Differentiation Error
The graph shows the error in approximating the derivative for
f (x) = e x as a function of h.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
By using dual numbers dual numbers a + bE , where E 2 = 0 we
can further reduce the error. For second order derivatives we can
use hyper dual numbers. ((2012) J. Fike).
Complex number implementations usually come with most
programming languages. For dual or hyper-dual numbers we need
to download or implement code.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
What is Automatic Differentiation?
Algorithmic Differentiation (often referred to as
Automatic Differentiation or just AD) uses the software
representation of a function to obtain an efficient method
for calculating its derivatives. These derivatives can be of
arbitrary order and are analytic in nature (do not have
any truncation error).
B. Bell – author of CppAD
Use of dual or complex numbers is a form of automatic
differentiation. More common though is operator overloading
implementations as in CppAD.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
The Speelpenning function is:
f (y) =
n
Y
xi (y)
i=1
We consider xi = (y − ti ), so that f (y ) =
compute the symbolic derivative we get
f 0 (y ) =
Qn
i=1 (y
− ti ). If we
n Y
X
(y − tj ),
i=1 j6=i
which requires n(n − 1) multiplications to compute with a
straightforward implementation. Rewriting the expression as
f 0 (y ) =
n
X
f (y )
,
y − ti
i=1
reduces the computational cost, but is not numerically stable if
y ≈ ti .
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Reverse Mode Differentiation
Define fk =
Qk
i=1 (y
− ti ) and rk =
Qn
i=k (y
− ti ), then
n Y
n
X
X
f (y ) =
(y − tj ) =
fi−1 ri+1 ,
0
i=1 j6=i
i=1
and fi = (y − ti )fi−1 , f1 = y − t1 , ri = (y − ti )ri+1 , rn = y − tn .
Note that f (y ) = fn = r1 .
Cost of computing f (y ) is n subtractions and n − 1 multiplies. At
the overhead of storing f1 , . . . , fn the derivative can be computed
with an additional n − 1 additions and 2n − 1 multiplications.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Reverse Mode Differentiation
Under quite realistic assumptions the evaluation of a
gradient requires never more than five times the effort of
evaluating the underlying function by itself.
Andreas Griewank (1988)
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Reverse mode differentiation (1971) is essentially applying the
chain rule on a function in the reverse order that the function is
computed.
This is the same technique as applied in the backpropagation
algorithm (1969) for neural networks and the forward-backward
algorithm for training HMMs (1970).
Reverse mode differentiation applies to very general classes of
functions.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Reverse mode differentiation was independently invented by
G. M. Ostrovskii (1971)
S. Linnainmaa (1976)
John Larson and Ahmed Sameh (1978)
Bert Speelpenning (1980)
Not surprisingly the fame went to B. Speelpenning, who
admittedly told the full story at AD 2012.
A. Sameh was on B. Speelpenning’s thesis committee.
The Speelpenning function was suggested as a motivation to
Speelpenning’s thesis by his Canadian interim advisor Prof. Art
Sedgwick who passed away on July 5th.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Why worry about matrix differentiation?
The differentiation of functions of a matrix argument is still not
well understood
No simple “calculus” for computing matrix derivatives.
Matrix derivatives can be complicated to compute even for
quite simple expressions.
Goal: Build on known results to define a matrix-derivative calculus.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Quiz
The Tikhonov regularized maximum (log-)likelihood for learning
the covariance Σ given an empirical covariance S is
f (Σ) = −log det(Σ) − trace(SΣ−1 ) − kΣ−1 k2F .
What is f 0 (Σ) and f 00 (Σ)?
Statisticians can compute it. Can you?
Can a calculus be reverse engineered?
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Terminology
Classical calculus:
scalar-scalar functions (f : R → R)
scalar-vector functions (f : Rd → R).
Matrix calculus:
scalar-matrix functions (f : Rm×n → R)
matrix-matrix functions (f : Rm×n → Rk×l ).
Note:
1
Derivatives of scalar-matrix functions requires derivatives of
matrix-matrix functions (why?)
2
Matrix-matrix derivatives should be computed implicitly
wherever possible (why?)
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Optimizing Scalar-Matrix Functions
Useful quantities for optimizing a scalar-matrix function
f (X) : Rd×d → R
The gradient is a matrix–matrix function:
f 0 (X) : Rd×d → Rd×d .
The Hessian is also a matrix-matrix function:
f 00 (X) : Rd×d → Rd
2 ×d 2
.
Desiderata: The derivative of a matrix–matrix function should be a
matrix, so that we can directly use it to compute the Hessian.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Optimizing Scalar-Matrix Functions (continued)
Taking the scalar–matrix derivative of
f (G(X))
will require the information in the matrix–matrix derivative
∂G
.
∂X
Desiderata: The derivative of a matrix-matrix function should be
a matrix, so that a convenient chain-rule can be established.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
The Matrix-Matrix Derivative
We define the matrix–matrix derivative to be:


∂f11
∂x11
 ∂f12
 ∂x11

∂f11
∂x12
∂f12
∂x12
..
.
···
···
..
.
∂f11
∂xmn
∂f12 
∂xmn 

∂fkl
∂x11
∂fkl
∂x12
···
∂fkl
∂xmn
∂F def ∂vec(F> )
=
=
.
∂X
∂vec> (X> ) 
 ..
Peder, Steven and Vaibhava
Matrix Differentiation
..  .
. 
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
The Matrix-Matrix Derivative
We define the matrix–matrix derivative to be:


∂f11
∂x11
 ∂f12
 ∂x11

∂f11
∂x12
∂f12
∂x12
..
.
···
···
..
.
∂f11
∂xmn
∂f12 
∂xmn 

∂fkl
∂x11
∂fkl
∂x12
···
∂fkl
∂xmn
∂F def ∂vec(F> )
=
=
.
∂X
∂vec> (X> ) 
 ..
Peder, Steven and Vaibhava
Matrix Differentiation
..  .
. 
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
The Matrix-Matrix Derivative
We define the matrix–matrix derivative to be:


∂f11
∂x11
 ∂f12
 ∂x11

∂f11
∂x12
∂f12
∂x12
..
.
···
···
..
.
∂f11
∂xmn
∂f12 
∂xmn 

∂fkl
∂x11
∂fkl
∂x12
···
∂fkl
∂xmn
∂F def ∂vec(F> )
=
=
.
∂X
∂vec> (X> ) 
 ..
Peder, Steven and Vaibhava
Matrix Differentiation
..  .
. 
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
The Matrix-Matrix Derivative
We define the matrix–matrix derivative to be:


∂f11
∂x11
 ∂f12
 ∂x11

∂f11
∂x12
∂f12
∂x12
..
.
···
···
..
.
∂f11
∂xmn
∂f12 
∂xmn 

∂fkl
∂x11
∂fkl
∂x12
···
∂fkl
∂xmn
∂F def ∂vec(F> )
=
=
.
∂X
∂vec> (X> ) 
 ..
..  .
. 
Note that the matrix–matrix derivative of a scalar–matrix
function is not the same as the scalar–matrix derivative:
> !
∂ mat(f )
∂f
= vec>
.
∂X
∂X
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Computing Derivatives
Automatic Differentiation
Matrix-Matrix Derivatives
Linear Matrix Functions
Derivatives of Linear Matrix Functions
1
A matrix–matrix derivative is a matrix outer-product:
F (X) = trace(AX)B,
2
A matrix–matrix derivative is a Kronecker product:
F (X) = AXB,
3
∂F
= vec(B> )vec> (A).
∂X
∂F
= A ⊗ B> .
∂X
A matrix–matrix derivative cannot be expressed with standard
operations. We call the new operation the box product
F (X) = AX> B,
Peder, Steven and Vaibhava
∂F
= A B> .
∂X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
Direct Matrix Products
A Direct Matrix Product, X = A ~ B, is a matrix whose elements
are
x(i1 i2 )(i3 i4 ) = aiσ(1) iσ(2) biσ(3) iσ(4) .
(i1 i2 ) is shorthand for i1 n2 + i2 − 1 where i2 ∈ {1, 2, . . . , n2 }.
σ is a permutation over Z4
The direct matrix products are central to matrix–matrix
differentiation. 8 can be expressed using Kronecker products:
A ⊗ B A > ⊗ B A ⊗ B> A> ⊗ B>
B ⊗ A B > ⊗ A B ⊗ A> B> ⊗ A>
Then 8 using matrix outer products and the final 8 using the box
product.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
Definition
Let A ∈ Rm1 ×n1 and B ∈ Rm2 ×n2
Definition (Kronecker Product)
A ⊗ B ∈ R(m1 m2 )×(n1 n2 ) is defined by (A ⊗ B)(i−1)m2 +j,(k−1)n2 +l =
aik bjl = (A ⊗ B)(ij)(kl) .
Definition (Box Product)
A B ∈ R(m1 m2 )×(n1 n2 ) is defined by (A B)(i−1)m2 +j,(k−1)n1 +l =
ail bjk = (A B)(ij)(kl) .
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2 × 2
matrices, A and B:

a11 b11
a11 b21

a21 b11
a21 b21
A⊗B
a11 b12 a12 b11
a11 b22 a12 b21
a21 b12 a22 b11
a21 b22 a22 b21

a12 b12
a12 b22 

a22 b12 
a22 b22
Peder, Steven and Vaibhava

a11 b11
a11 b21

a21 b11
a21 b21
AB
a12 b11 a11 b12
a12 b21 a11 b22
a22 b11 a21 b12
a22 b21 a21 b22
Matrix Differentiation

a12 b12
a12 b22 

a22 b12 
a22 b22
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2 × 2
matrices, A and B:

a11 b11
a11 b21

a21 b11
a21 b21
A⊗B
a11 b12 a12 b11
a11 b22 a12 b21
a21 b12 a22 b11
a21 b22 a22 b21

a12 b12
a12 b22 

a22 b12 
a22 b22
Peder, Steven and Vaibhava

a11 b11
a11 b21

a21 b11
a21 b21
AB
a12 b11 a11 b12
a12 b21 a11 b22
a22 b11 a21 b12
a22 b21 a21 b22
Matrix Differentiation

a12 b12
a12 b22 

a22 b12 
a22 b22
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2 × 2
matrices, A and B:

a11 b11
a11 b21

a21 b11
a21 b21
A⊗B
a11 b12 a12 b11
a11 b22 a12 b21
a21 b12 a22 b11
a21 b22 a22 b21

a12 b12
a12 b22 

a22 b12 
a22 b22
Peder, Steven and Vaibhava

a11 b11
a11 b21

a21 b11
a21 b21
AB
a12 b11 a11 b12
a12 b21 a11 b22
a22 b11 a21 b12
a22 b21 a21 b22
Matrix Differentiation

a12 b12
a12 b22 

a22 b12 
a22 b22
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2 × 2
matrices, A and B:

a11 b11
a11 b21

a21 b11
a21 b21
A⊗B
a11 b12 a12 b11
a11 b22 a12 b21
a21 b12 a22 b11
a21 b22 a22 b21

a12 b12
a12 b22 

a22 b12 
a22 b22
Peder, Steven and Vaibhava

a11 b11
a11 b21

a21 b11
a21 b21
AB
a12 b11 a11 b12
a12 b21 a11 b22
a22 b11 a21 b12
a22 b21 a21 b22
Matrix Differentiation

a12 b12
a12 b22 

a22 b12 
a22 b22
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2 × 2
matrices, A and B:

a11 b11
a11 b21

a21 b11
a21 b21
A⊗B
a11 b12 a12 b11
a11 b22 a12 b21
a21 b12 a22 b11
a21 b22 a22 b21

a12 b12
a12 b22 

a22 b12 
a22 b22
Peder, Steven and Vaibhava

a11 b11
a11 b21

a21 b11
a21 b21
AB
a12 b11 a11 b12
a12 b21 a11 b22
a22 b11 a21 b12
a22 b21 a21 b22
Matrix Differentiation

a12 b12
a12 b22 

a22 b12 
a22 b22
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2 × 2
matrices, A and B:

a11 b11
a11 b21

a21 b11
a21 b21
A⊗B
a11 b12 a12 b11
a11 b22 a12 b21
a21 b12 a22 b11
a21 b22 a22 b21

a12 b12
a12 b22 

a22 b12 
a22 b22
Peder, Steven and Vaibhava

a11 b11
a11 b21

a21 b11
a21 b21
AB
a12 b11 a11 b12
a12 b21 a11 b22
a22 b11 a21 b12
a22 b21 a21 b22
Matrix Differentiation

a12 b12
a12 b22 

a22 b12 
a22 b22
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2 × 2
matrices, A and B:

a11 b11
a11 b21

a21 b11
a21 b21
A⊗B
a11 b12 a12 b11
a11 b22 a12 b21
a21 b12 a22 b11
a21 b22 a22 b21

a12 b12
a12 b22 

a22 b12 
a22 b22
Peder, Steven and Vaibhava

a11 b11
a11 b21

a21 b11
a21 b21
AB
a12 b11 a11 b12
a12 b21 a11 b22
a22 b11 a21 b12
a22 b21 a21 b22
Matrix Differentiation

a12 b12
a12 b22 

a22 b12 
a22 b22
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2 × 2
matrices, A and B:

a11 b11
a11 b21

a21 b11
a21 b21
A⊗B
a11 b12 a12 b11
a11 b22 a12 b21
a21 b12 a22 b11
a21 b22 a22 b21

a12 b12
a12 b22 

a22 b12 
a22 b22
Peder, Steven and Vaibhava

a11 b11
a11 b21

a21 b11
a21 b21
AB
a12 b11 a11 b12
a12 b21 a11 b22
a22 b11 a21 b12
a22 b21 a21 b22
Matrix Differentiation

a12 b12
a12 b22 

a22 b12 
a22 b22
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform


1
2


A=3
 ..
.
1
2
3
..
.
1
2
3
..
.
···
···
···
10
10
10
···

1
2

3

.. 
.
10
Peder, Steven and Vaibhava
1 2 3 ···
1 2 3 · · ·


B = 1 2 3 · · ·
 .. .. ..
. . .
1 2 3 ···
Matrix Differentiation

10
10

10

.. 
.
10
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Peder, Steven and Vaibhava
Definition
Properties
Identity Box Products
The Fast Fourier Transform
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
A⊗B
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
AB
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
vec(A)vec> (B)
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
The box product behaves similarly to the Kronecker product:
1 Vector Multiplication:
(B> ⊗ A)vec(X) = vec(AXB)
Peder, Steven and Vaibhava
(B> A)vec(X) = vec(AX> B)
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
The box product behaves similarly to the Kronecker product:
1 Vector Multiplication:
(B> ⊗ A)vec(X) = vec(AXB)
2
(B> A)vec(X) = vec(AX> B)
Matrix Multiplication:
(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD)
Peder, Steven and Vaibhava
(A B)(C D) = (AD) ⊗ (BC)
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
The box product behaves similarly to the Kronecker product:
1 Vector Multiplication:
(B> ⊗ A)vec(X) = vec(AXB)
2
(B> A)vec(X) = vec(AX> B)
Matrix Multiplication:
(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD)
3
(A B)(C D) = (AD) ⊗ (BC)
Inverse and Transpose:
(A ⊗ B)−1 = A−1 ⊗ B−1
(A ⊗ B)> = A> ⊗ B>
Peder, Steven and Vaibhava
(A B)−1 = B−1 A−1
(A B)> = B> A>
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
The box product behaves similarly to the Kronecker product:
1 Vector Multiplication:
(B> ⊗ A)vec(X) = vec(AXB)
2
(B> A)vec(X) = vec(AX> B)
Matrix Multiplication:
(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD)
3
Inverse and Transpose:
(A ⊗ B)−1 = A−1 ⊗ B−1
(A ⊗ B)> = A> ⊗ B>
4
(A B)(C D) = (AD) ⊗ (BC)
(A B)−1 = B−1 A−1
(A B)> = B> A>
Mixed Products:
(A ⊗ B)(C D) = (AC) (BD)
Peder, Steven and Vaibhava
(A B)(C ⊗ D) = (AD) (BC)
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
Let A ∈ Rm1 ×n1 , B ∈ Rm2 ×n2 then
Trace:
trace(A⊗B) = trace(A) trace(B)
trace(AB) = trace(AB)
Determinant: Here m1 = n1 and m2 = n2 is required
det(A ⊗ B) = (det(A))m2 (det(A))m1
m1
m2
det(A B) = (−1)( 2 )( 2 ) (det(A))m2 (det(B))m1
Associativity:
(A ⊗ B) ⊗ C = A ⊗ (B ⊗ C)
(A B) C = A (B C),
but not for mixed products. In general we have
(A ⊗ B) C 6= A ⊗ (B C)
Peder, Steven and Vaibhava
(A B) ⊗ C 6= A (B ⊗ C).
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
Identity Box Products
The identity box product Im In is permutation matrix with
interesting properties. Let A ∈ Rm1 ×n1 , B ∈ Rm2 ×n2 .
Orthonormal: (Im In )> (Im In ) = Imn
Transposition: (Im1 In1 )vec(A) = vec(A> )
Connector: Converting a Kronecker product to a box product:
(A ⊗ B)(In1 In2 ) = A B.
Converting a box product to a Kronecker product:
(A B)(In2 In1 ) = A ⊗ B.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
Box products are old things in a new wrapping
Although the notation for box products is new, Im In has long
been known as Tm,n by physicists, or as a stride permutation Lmn
m
by others. These objects are identical, but the box-product allows
us to express more complex identities more compactly, e.g. let
A ∈ Rm1 ×n1 , B ∈ Rm2 ×n2 then
(Im1 In1 m2 In2 )vec((A ⊗ B)> ) = vec(vec(A)vec> (B)).
The initial permutation matrix would have to be written
Im1 In1 m2 In2 = (Im1 ⊗ Tn1 m2 ,m1 )Tm1 ,n1 m1 m2 ,
if the notation of box-products wasn’t being used.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
Direct matrix products are important because such matrices can be
multiplied fast with each other and with vectors.
Let us show how the FFT can be done in terms of Kronecker and
box products. Recall that the DFT matrix of order n is given by
Fn = [e −2πikl/n ]0≤k,l<n = [ωnkl ]0≤k,l<n and therefore
F2 =
1 1
,
1 −1


1 1
1
1
1 −i −1 i 

F4 = 
1 −1 1 −1
1 i −1 −i
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
We can factor the matrix F4 as follows:

1

F4 = 
1
1
1
−1
1

1

1


−1

1
 1


1
1
1
−1
−i
1
1

1


1 
−1
or more compactly
1 1
F4 = (F2 ⊗ I2 )diag vec
(I2 F2 )
1 −i
Peder, Steven and Vaibhava
Matrix Differentiation

1
1



1
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
In general if we define the matrix

1
1
1
1
α
α2

2
1
α
α4
VN,M (α) = 
 ..
..
..
.
.
.
1 αN−1 α2(N−1)
...
...
...
..
.
1

αM−1
α2(M−1)
..
.



,


. . . α(N−1)(M−1)
For N = km we have the following factorizations of the DFT
matrix:
FN
= (Fk ⊗ Im )(diag(vec(Vm,k (ωN ))))(Ik Fm )
FN
= (Fm Ik )(diag(vec(Vm,k (ωN ))))(Fk ⊗ Im )
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Definition
Properties
Identity Box Products
The Fast Fourier Transform
This allows FFTN (x) = y = FN x to be computed as
FN x = vec((Vm,k (ωN ) ◦ (Fm X> ))F>
k ).
where x = vec(X) and ◦ denotes elementwise multiplication (.∗ in
Matlab). The direct computation has a cost of O(k 2 m2 ), whereas
the formula above does the job in O(km(k + m)) operations.
Repeated use of the identity for N = 2n leads to the Cooley-Tukey
FFT algorithm. (James Cooley was at T. J. Watson from 1962 to
1991).
The fastest FFT library in the world as of 2012 (Spiral) uses
knowledge of several such factorizations to automatically optimize
the FFT implementation for an arbitrary value of n to a given
platform!
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Here we show the role of Kronecker and box product in the
matrix–matrix differentiation framework.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Basic Differentiation Identitities
Let X ∈ Rm×n .
Identity: F(X) = X, F0 (X) = Imn
Transpose:F(X) = X> , F0 (X) = Im In
Chain Rule:
F(X) = G(H(X)),
F0 (X) = G0 (H(X))H0 (X)
Product Rule:
F(X) = G(X)H(X),
F0 (X) = (I⊗H> (X))G0 (X)+(G(X)⊗I)H0 (X)
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
More Derivative Identitities
Assume X ∈ Rm×m is a square matrix.
Square: F(X) = X2 , F0 (X) = Im ⊗ X> + X ⊗ Im .
Inverse: F(X) = X−1 , F0 (X) = −X−1 ⊗ X−> .
+Transpose: F(X) = X−> , F0 (X) = −X−> X−1 .
Square Root: F(X) = X1/2 ,
−1
F0 (X) = I ⊗ (X1/2 )> + X1/2 ⊗ I .
Integer Power: F(X) = Xk , F0 (X) =
Peder, Steven and Vaibhava
Pk−1
i=0
Matrix Differentiation
Xi ⊗ (Xk−1−i )> .
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
First some simple identities:
f (X) = trace(AX)
f (X) = trace(AX> )
f (X) = log det(X)
f 0 (X) = A>
f 0 (X) = A
f 0 (X) = X−>
Then the chain rule for more general expressions:
> !
∂
∂G >
vec
log det(G(X))
=
vec (G(X))−1
∂X
∂X
> !
∂G >
∂
=
vec
trace(G(X))
vec(I)
∂X
∂X
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
The chain rule to get scalar-matrix derivatives is awkward to use.
Instead we have some short-cuts.
∂
∂
trace(G(X)H(X)) =
trace (H(Y)G(X) + H(X)G(Y))
,
∂X
∂X
Y=X
∂
∂
trace(AF−1 (X)) = −
trace F−1 (Y)AF−1 (Y)F(X) ,
∂X
∂X
Y=X
and
∂log det(F(X))
∂
−1
=
trace F (Y)F(X) .
∂X
∂X
Y=X
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Use the formulas on the previous page to compute
∂
trace((X − A)(X − B)−1 (X − C))
∂X
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Use the formulas on the previous page to compute
∂
trace((X − A)(X − B)−1 (X − C))
∂X
The answer is:
(X − C)> (X − B)−> + (X − B)−> (X − A)>
+(X − B)−> (X − A)> (X − C)> (X − B)−>
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Finally if
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Pn
ai x i
q(x)
r (x) =
= Pni=1
i
p(x)
j=1 bi x
is a scalar-scalar function we can form a matrix-matrix function by
simply substituting a matrix X for the scalar x. Then
>
∂
trace(r (X)) = r 0 (X)
∂X
and if h(x) = log(r (x)) then
>
∂
log det(r (X)) = h0 (X) .
∂X
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1 X> )X)
trace(F(X))
×
×
X
(·)−1
(·)>
+
X
I
Peder, Steven and Vaibhava
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1 X> )X)
trace(F(X))
×
×
T1 = I + X
X
(·)−1
(·)>
+
X
I
Peder, Steven and Vaibhava
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1 X> )X)
trace(F(X))
×
×
T2 = T−1
1
T1 = I + X
X
(·)−1
(·)>
+
X
I
Peder, Steven and Vaibhava
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1 X> )X)
trace(F(X))
×
×
T2 = T−1
1
T1 = I + X
(·)−1
X
T3 = X>
X
+
I
Peder, Steven and Vaibhava
(·)>
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1 X> )X)
trace(F(X))
×
×
T4 = T2 ∗ T3
T2 = T−1
1
T1 = I + X
(·)−1
X
T3 = X>
X
+
I
Peder, Steven and Vaibhava
(·)>
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1 X> )X)
trace(F(X))
T5 = T4 ∗ X
×
T4 = T2 ∗ T3
T2 = T−1
1
T1 = I + X
(·)−1
×
X
T3 = X>
X
+
I
Peder, Steven and Vaibhava
(·)>
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1 X> )X)
f = trace(T5 ) trace(F(X))
T5 = T4 ∗ X
×
T4 = T2 ∗ T3
T2 = T−1
1
T1 = I + X
(·)−1
×
X
T3 = X>
X
+
I
Peder, Steven and Vaibhava
(·)>
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Forward mode computation of f 0 (X):
trace(F(X))
×
×
(·)−1
+
0 I
Peder, Steven and Vaibhava
T2
I⊗I X
T4
T3
T5
(·)>
I⊗I X
T1
I⊗I X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
(G + H)0 = G0 + H0
trace(F(X))
×
×
(·)−1
T01 = 0 + I ⊗ I
+
0 I
Peder, Steven and Vaibhava
T2
I⊗I X
T4
T3
T5
(·)>
I⊗I X
T1
I⊗I X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
−1
−>
0
0
Chain rule: (T−1
1 ) = −(T1 ⊗ T1 )T1
trace(F(X))
×
×
0
T02 = −(T2 ⊗ T>
2 )T1
T01 = 0 + I ⊗ I
(·)−1
+
0 I
Peder, Steven and Vaibhava
T2
I⊗I X
T4
T3
T5
(·)>
I⊗I X
T1
I⊗I X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
(X> )0 = I I
trace(F(X))
×
×
0
T02 = −(T2 ⊗ T>
2 )T1
T01 = 0 + I ⊗ I
(·)−1
+
0 I
Peder, Steven and Vaibhava
T2
I⊗I X
T4
T3
T5
(·)>
I⊗I X
T1
I⊗I X
Matrix Differentiation
T03 = I I
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Product rule: (GH)0 = (I ⊗ H> )G0 + (G ⊗ I)H0
trace(F(X))
×
0
0
T04 = (I ⊗ T>
3 )T2 + (T2 ⊗ I)T3
0
T02 = −(T2 ⊗ T>
2 )T1
T01 = 0 + I ⊗ I
(·)−1
+
0 I
Peder, Steven and Vaibhava
×
T2
I⊗I X
T4
T3
T5
(·)>
T03 = I I
I⊗I X
T1
I⊗I X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Product rule: (GH)0 = (I ⊗ H> )G0 + (G ⊗ I)H0
trace(F(X))
T05 = (I ⊗ X> )T04 + (T4 ⊗ I)(I ⊗ I)
0
0
T04 = (I ⊗ T>
3 )T2 + (T2 ⊗ I)T3
0
T02 = −(T2 ⊗ T>
2 )T1
T01 = 0 + I ⊗ I
(·)−1
+
0 I
Peder, Steven and Vaibhava
×
T2
×
I⊗I X
T4
T3
T5
(·)>
T03 = I I
I⊗I X
T1
I⊗I X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
((trace(F))0 )> = vec−1 vec> (I)F0
f 0 = vec−1 vec> (I)T05
trace(F(X))
T05 = (I ⊗ X> )T04 + (T4 ⊗ I)(I ⊗ I)
0
0
T04 = (I ⊗ T>
3 )T2 + (T2 ⊗ I)T3
0
T02 = −(T2 ⊗ T>
2 )T1
T01 = 0 + I ⊗ I
(·)−1
+
0 I
Peder, Steven and Vaibhava
×
T2
×
I⊗I X
T4
T3
T5
(·)>
T03 = I I
I⊗I X
T1
I⊗I X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Forward mode differentiation
Requires huge intermediate matrices.
Only the last step reduces the size.
Critical points:
F0 (X) is composed of Kronecker, box (and outer) products for
a large class of functions. (wow!)
These can be “unwound” by multiplication with vectorized
scalar-matrix derivatives (gasp!):
vec> (C)A ⊗ B = vec> (B> CA)
Reverse mode differentiation:
Evaluate the derivative from top to bottom.
Small scalar-matrix derivatives are propagated down the tree.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
Reverse Mode Differentiation: (f 0 )> = vec−1 vec> (I)F0
f0 =
trace(F(X)) R0 = I
×
×
(·)−1
+
T3
T2
Peder, Steven and Vaibhava
X
(·)>
X
T1
I
T4
T5
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
vec> (R0 )(I ⊗ X> ) = vec> (XR0 I)
f0 =
trace(F(X)) R0 = I
T05 = (I ⊗ X> ) T04 + T4 ⊗ I
×
T5
R1 = XR0
×
(·)−1
+
T3
T2
Peder, Steven and Vaibhava
X
(·)>
X
T1
I
T4
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
vec> (R0 )(T4 ⊗ I) = vec> (IR0 T4 )
f 0 = R>
2
trace(F(X)) R0 = I
T05 = (I ⊗ X> ) T04 + T4 ⊗ I
×
T5
R1 = XR0
×
(·)−1
+
T3
T2
Peder, Steven and Vaibhava
(·)>
X
T1
I
X R2 = R0 T 4
T4
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
>
vec> (R1 )(I ⊗ T>
3 ) = vec (T3 R1 I)
f 0 = R>
2
trace(F(X)) R0 = I
×
T5
R1 = XR0
T04
= (I ⊗
T>
3)
T02 +
(T2 ⊗ I) T03
×
X R2 = R0 T 4
T4
R3 = T 3 R1
(·)−1
+
T3
T2
X
T1
I
Peder, Steven and Vaibhava
(·)>
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
vec> (R1 )(T2 ⊗ I) = vec> (IR1 T2 )
f 0 = R>
2
trace(F(X)) R0 = I
×
T5
R1 = XR0
0
0
T04 = (I ⊗ T>
3 ) T2 + (T2 ⊗ I) T3
×
R3 = T 3 R1
R4 = R1 T 2
(·)−1
+
X R2 = R0 T 4
T4
T3
T2
X
T1
I
Peder, Steven and Vaibhava
(·)>
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
>
vec> (R3 )(−T2 ⊗ T>
2 ) = vec (−T2 R3 T2 )
f 0 = R>
2
trace(F(X)) R0 = I
×
×
= −(T2 ⊗
0
T>
2 )T1
X R2 = R0 T 4
T4
R3 = T3 R1
T02
T5
R4 = R1 T 2
(·)−1
T3
T2
(·)>
R5 = −T2 R3 T2
+
X
T1
I
Peder, Steven and Vaibhava
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
vec> (R4 )(I I) = vec> (IR>
4 I)
>
f 0 = R>
2 +R6
trace(F(X)) R0 = I
×
×
T5
X R2 = R0 T 4
T4
R4 = R1 T 2
(·)−1
T3
T2
(·)>
T03 = I I
R5 = −T2 R3 T2
+
X
T1
I
Peder, Steven and Vaibhava
R6 = R>
4
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
vec> (R5 )0 = vec(0)
>
f 0 = R>
2 +R6
trace(F(X)) R0 = I
×
×
(·)−1
X R2 = R0 T 4
T4
T3
T2
T5
(·)>
R5 = −T2 R3 T2
T01 =0+I ⊗ I
I
+
X
T1
R7 = 0
Peder, Steven and Vaibhava
R6 = R>
4
X
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
vec> (R5 )I ⊗ I = vec(R5 )
>
>
f 0 = R>
2 +R6 +R8
trace(F(X)) R0 = I
×
×
(·)−1
X R2 = R0 T 4
T4
T3
T2
T5
(·)>
R5 = −T2 R3 T2
T01 =0+I ⊗ I
I
+
X
T1
R7 = 0
Peder, Steven and Vaibhava
X
R6 = R>
4
R8 = R5
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Product and Chain Rule
Matrix Powers
Trace and Log Determinant
Forward and Reverse Mode Differentiation
>
>
f 0 (X) = R>
2 + R6 + R8
>
>
f 0 = R>
2 +R6 +R8
trace(F(X)) R0 = I
×
×
(·)−1
+
Peder, Steven and Vaibhava
(·)>
X
T1
I
X R2 = R0 T 4
T4
T3
T2
X
T5
R6 = R>
4
R8 = R5
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Derivative Patterns
Hessian Forms
Newton’s Method
Future Work
There is growing interest in Machine Learning in scalar–matrix
objective functions.
Probabilistic Graphical Models
Covariance Selection
Optimization in Graphs and Networks.
Data-mining in social networks
Can we use this theory to help optimize such functions? We are on
the look-out for interesting problems.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Derivative Patterns
Hessian Forms
Newton’s Method
Future Work
The Anatomy of a Matrix-Matrix Function
Theorem
Let R(X) be a rational matrix–matrix function formed from
constant matrices and K occurrences of X using arithmetic matrix
operations (+, −, ∗ and (·)−1 ) and transposition ((·)> ). Then the
derivative of the matrix–matrix function is of the form
k1
X
Ai ⊗ Bi +
i=1
K
X
Ai Bi .
i=k1 +1
The matrices Ai and Bi are computed as parts of R(X).
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Derivative Patterns
Hessian Forms
Newton’s Method
Future Work
Hessian Forms
The derivative of a function f of the form trace(R(X)) or
log det(R(X)) is a rational function. Therefore, the Hessian is of
the form
k1
K
X
X
Ai ⊗ Bi +
Ai Bi .
i=1
i=k1 +1
where K is the number of times X occurs in the expression for the
derivative.
If Ai , Bi are d × d matrices then multiplication by H can be done
with O(Kd 3 ) operations.
Generalizations of this result can be found in the paper.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Derivative Patterns
Hessian Forms
Newton’s Method
Future Work
Let f : Rd×d → R with f (X) = trace(R(X)) or
f (X) = log det(R(X)).
Derivative: G = f 0 (X) ∈ Rd×d and Hessian:
2
2
H = f 00 (X) ∈ Rd ×d .
Newton direction vec(V) = H−1 vec(G) can be computed
efficiently if K d
K
1
2
≥3
≥d
Algorithm
A ⊗ B = A−1 ⊗ B−1
Barthels-Stewart algorithm
Conjugate Gradient Algorithm
General matrix inversion
Complexity
O(d 3 )
O(d 3 )
O(Kd 5 )
O(d 6 + Kd 4 )
The (generalized) Bartels-Stewart algorithms solves the
Sylvester-like equation AX + X> B = C.
Peder, Steven and Vaibhava
Matrix Differentiation
Storage
O(Kd 2 )
O(Kd 2 )
O(Kd 2 )
O(d 4 )
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Derivative Patterns
Hessian Forms
Newton’s Method
Future Work
What about optimizing functions of the form
f (X) + kvec(X)k1 ?
Common approaches:
Strategy
Newton-Lasso
Orthantwise `1
method
Coordinate descent
FISTA
CG
L-BFGS
Use Hessian structure?
X
X
X
7
It is not obvious how to take advantage of the Hessian structure
for all these methods. For orthantwise CG for example – the
sub-problem requires the Newton direction for a sub-matrix of the
Hessian.
Peder, Steven and Vaibhava
Matrix Differentiation
Introduction
The Kronecker and Box Product
Matrix Differentiation
Optimization
Derivative Patterns
Hessian Forms
Newton’s Method
Future Work
1
We have applied this methodology to the covariance selection
problem – a paper is forthcoming.
2
We are digging deeper in the theory of matrix differentiation
and properties of the box products.
3
We are on the look-out for more interesting matrix
optimization problems. All suggestions are appreciated!
Peder, Steven and Vaibhava
Matrix Differentiation