On Hessians of composite functions

Applied Mathematical Sciences, Vol. 8, 2014, no. 172, 8601 - 8609
HIKARI Ltd, www.m-hikari.com
http://dx.doi.org/10.12988/ams.2014.410805
On Hessians of Composite Functions
Čestmı́r Bárta, Martin Kolář and Jan Šustek
Department of Mathematics, Faculty of Science
The University of Ostrava, 701 03 Ostrava
Czech Republic
c 2014 Čestmı́r Bárta, Martin Kolář and Jan Šustek. This is an open
Copyright access article distributed under the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original work
is properly cited.
Abstract
In this paper we will study the Hessian matrices of composite functions F which are in the form F (x) = f (g(x)) with f : R → R and
g : Rn → R. We present an explicit formulas for the Hessian of F and
for the inverse of the Hessian matrix.
Mathematics Subject Classification: 26B05
Keywords: Hessian, composite function
1
Introduction
For a function f : Rn → R the Hessian matrix Hf (x) is defined as the n × n
matrix of the second-order partial derivatives,

 2
∂2f
∂ f
.
.
.
2
∂x1 ∂xn
 ∂x. 1
.. 
...
.
Hf (x) = 
.
. 
.

2
2
∂ f
∂ f
...
∂xn ∂x1
∂x2
n
The determinant of the above matrix is called Hessian. The Hessian matrix
plays an important role in many areas of mathematics. It is important in
neural computing, for instance several non-linear optimization algorithms used
for training neural networks are based on considerations of the second-order
properties of the error surface, which are controlled by the Hessian matrix [1].
8602
Čestmı́r Bárta, Martin Kolář and Jan Šustek
The inverse of the Hessian matrix has been used to identify the least significant
weights in a network as a part of network ‘pruning’ algorithms, or it can also
be used to assign error bars to the predictions made by a trained network [1].
Hessian matrices are also applied in Morse theory [4].
Throughout this paper we will study the Hessian matrices of composite
functions F which are in the form F (x) = f (g(x)) with f : R → R and g : Rn →
R. We present an explicit formulas for the Hessian of F and for the inverse
of the Hessian matrix. In previous years, several papers were published with
such formulas, but only for special functions F , see e.g. [2] or [6]. These results
follow easily from our general results.
We will use the following notation for simple work with vectors and matrices


a(1)


row a(i) = a(1) . . . a(n) ,
col a(i) =  ...  ,
a(n)




a(1)
0
a(1, 1) . . . a(1, n)



..  ,
..
..
mat a(i, j) =  ...
diag a(i) = 
.
.
.
. 
0
a(n)
a(n, 1) . . . a(n, n)
Using this notation we have
∇g(x) = col
∂g
∂xi
and
I = diag 1 = mat δij ,
where δij is the Kronecker symbol, i.e. δii = 1 and δij = 0 for i 6= j.
2
Results
Our first result concerns the determinant of the Hessian matrix.
Theorem 1 Let f : R → R and g : Rn → R be real functions and let F (x) =
f (g(x)) be their composition. Then the Hessian of the function F equals
f 00 (g(x))
T
−1
∇g(x) · Hg(x) · ∇g(x) f 0 (g(x))n det Hg(x) .
det HF (x) = 1 + 0
f (g(x))
The following result concerns the inverse of the Hessian matrix.
Theorem 2 Let f : R → R and g : Rn → R be real functions and let F (x) =
f (g(x)) be their composition. Then the inverse of the Hessian matrix of the
function F equals
Hg(x)−1 · ∇g(x) · ∇g(x)T
Hg(x)−1
−1
HF (x) = I − f 0 (g(x))
· 0
.
T · Hg(x)−1 · ∇g(x)
f (g(x))
+
∇g(x)
00
f (g(x))
8603
On Hessians of composite functions
Theorem 3 is a consequence of previous theorems for a particular class of
functions g.
Theorem 3 Let αi ∈ R for i = 1, . . . , n and let f : R → R be a real function.
n
Q
Put g(x) =
xαk k and F (x) = f (g(x)). Then the Hessian of F equals
k=1
det HF (x) =
where |α| =
n
Y
g(x)f 00 (g(x)) |α|
0
n
n+1
1+
f
(g(x))
(−1)
(|α|−1)
αk xknαk −2 ,
0
f (g(x)) |α| − 1
k=1
n
P
αk . The inverse of the Hessian matrix is
k=1
−1
HF (x)
1
1
mat
−
=
g(x)f 0 (g(x))
|α| − 1
δij
xi xj .
−
(|α| − 1)2 + |α|(|α| − 1) αi
1
f 0 (g(x))
g(x)f 00 (g(x))
Example 1 Chen [2] considered function of the form F (x) = f
n
P
hk (xk )
k=1
and computed inductively its Hessian. We obtain the same result directly
n
P
hk (xk ). Hence
using Theorem 1. We have g(x) =
k=1
∇g(x) = col h0i (xi ) ,
n
Y
det Hg(x) =
h00k (xk ) ,
Hg(x) = diag h00i (xi ) ,
Hg(x)−1 = diag
k=1
1
h00i (xi )
.
Then Theorem 1 implies that
n
Y
f 00 (g(x))
1
0
0
0
n
det HF (x) = 1 + 0
row hi (xi ) · diag 00
· col hi (xi ) f (g(x))
h00k (xk )
f (g(x))
hi (xi )
k=1
n
n
00
0
2
X
Y
f (g(x))
hk (xk )
= 1+ 0
f 0 (g(x))n
h00k (xk ) .
00
f (g(x)) k=1 hk (xk )
k=1
Moreover, using Theorem 2, we obtain the inverse of the Hessian matrix.
diag 1
1
diag h00 (x
· col h0i (xi ) · row h0i (xi )
h00
i)
−1
i
i (xi )
HF (x) = I − f 0 (g(x))
0 (g(x))
1
0
0
f
+ row hi (xi ) · diag h00 (xi ) · col hi (xi )
f 00 (g(x))
i
!
h0i (xi )h0j (xj )
1
h00
i (xi )
= mat δij −
· diag 0
n
0
2
P
hk (xk )
f (g(x))h00i (xi )
f 0 (g(x))
+
00
00
f (g(x))
h (xk )
k
k=1
= mat δij −
h0i (xi )h0j (xj )
h00
i (xi )
f 0 (g(x))
f 00 (g(x))
+
n
P
k=1
h0k (x2k )
h00
k (xk )
!
1
f 0 (g(x))h00j (xj )
8604
Čestmı́r Bárta, Martin Kolář and Jan Šustek
Example 2 Trojovský and Hladı́ková [6] considered function F (x) = exp
n
Q
xk
k=1
and computed its Hessian as the sum of 2n simpler determinants. We obtain
the same result directly using Theorem 3. We have f (t) = et and αi = 1 for
every i. Hence f 0 (t) = f 00 (t) = et . Then Theorem 3 implies that
n
eng(x) (−1)n+1 (n − 1)g(x)n−2
det HF (x) = 1 + g(x)
n−1
n
Q
n
n
Y
n
xk Y
n+1
k=1
n−1+n
= (−1)
xk e
xkn−2 .
k=1
k=1
Moreover we obtain the inverse of the Hessian matrix.
1
1
1
−1
HF (x) =
− δij xi xj
mat
− (n−1)2
g(x)eg(x)
n−1
+ n(n − 1)
g(x)
n
Q
1
= mat
n
Q
e
k=1
xk
n
Q
xk
1
−
n−1
xk
− δij xi xj
k=1
(n − 1)2 + n(n − 1)
xk
k=1
k=1
s
Example 3 Consider the function F (x) =
n
Q
n
Q
xk , i.e. the geometric mean.
√
n
Using the notation of Theorem 3, we have
f
(t)
=
t and αi = 1 for every i.
1
1
1 n
1 1
−1
−2
0
00
n
Hence f (t) = n t
and f (t) = n n − 1 t
. Then Theorem 3 implies that
1
g(x) n1 n1 − 1 g(x) n −2 n
1
g(x)1−n (−1)n+1 (n−1)g(x)n−2 = 0 .
det HF (x) = 1+
1
n
1
−1
n−1 n
g(x) n
n
k=1
n
3
Proofs
In our proofs we will need the following two lemmas. We present them also
with their short proofs.
Lemma 1 (Matrix determinant lemma) ([3], Lemma 1.1) Let A be an invertible n × n matrix and let u, v ∈ Rn be column vectors. Then
det(A + uvT ) = (1 + vT A−1 u) det A .
Proof. By a direct computation we easily obtain the identity
A 0
I + A−1 uvT A−1 u
I
0
A
u
=
.
vT 1
0T
1
−vT 1
0T 1 + vT A−1 u
8605
On Hessians of composite functions
Taking determinants of both sides we immediately obtain
det(A + uvT ) = det A · det(I + A−1 uvT ) = det A · (1 + vT A−1 u) .
Lemma 2 (Sherman-Morrison formula) ([5], Section 2.7.1) Let A be an invertible n × n matrix and let u, v ∈ Rn be column vectors. Suppose that the
matrix A + uvT is invertible. Then
A−1 uvT
T −1
A−1 .
(A + uv ) = I −
T
−1
1+v A u
Proof. From the assumption and Lemma 1 it follows that 1 + vT A−1 u 6= 0.
Then
A−1 uvT + (vT A−1 u)A−1 uvT
1 + vT A−1 u
−1
T
A uv + A−1 u(vT A−1 u)vT
= I + A−1 uvT −
1 + vT A−1 u
−1
A uvT A−1
= A−1 (A + uvT ) −
(A + uvT )
T A−1 u
1
+
v
A−1 uvT A−1
−1
= A −
(A + uvT )
1 + vT A−1 u
I = I + A−1 uvT −
We obtain the result by multiplying by (A + uvT )−1 .
Using these lemmas we now prove our theorems.
Proof of Theorem 1. The second-order partial derivatives of F are
∂ 2F
∂g
∂g
∂ 2g
= f 00 (g(x))
(x)
(x) + f 0 (g(x))
(x) .
∂xi ∂xj
∂xi
∂xj
∂xi ∂xj
(1)
Hence the Hessian matrix can be written as HF (x) = A + uvT , where
A = f 0 (g(x))Hg(x) ,
u = f 00 (g(x))∇g(x) ,
v = ∇g(x) .
(2)
From Lemma 1 we obtain
−1
det HF (x) = 1+∇g(x)T · f 0 (g(x))Hg(x) ·f 00 (g(x))∇g(x) det f 0 (g(x))Hg(x)
f 00 (g(x))
T
−1
= 1+ 0
∇g(x) · Hg(x) · ∇g(x) f 0 (g(x))n det Hg(x) .
f (g(x))
8606
Čestmı́r Bárta, Martin Kolář and Jan Šustek
Proof of Theorem 2. As in the proof of Theorem 1 we obtain (1) and (2).
Then Lemma 2 implies
−1
HF (x)
−1
T −1
f 0 (g(x))Hg(x)
· f 00 (g(x))∇g(x) · ∇g(x)
0
= I−
· f (g(x))Hg(x)
−1
T
· f 00 (g(x))∇g(x)
1 + ∇g(x) · f 0 (g(x))Hg(x)
Hg(x)−1 · ∇g(x) · ∇g(x)T
Hg(x)−1
= I − f 0 (g(x))
· 0
.
f (g(x))
+ ∇g(x)T · Hg(x)−1 · ∇g(x)
00
f (g(x))
Proof of Theorem 3. The derivatives of the function g are
∂g
αi
= g(x) ,
∂xi
xi
∂ 2g
αi (αi − 1)
=
g(x) ,
2
∂xi
x2i
∂ 2g
αi αj
=
g(x) .
∂xi ∂xj
xi xj
Therefore its gradient is equal to
∇g(x) = g(x) col
αi
xi
and its Hessian matrix is
αi αj
αi
Hg(x) = mat
− diag 2 g(x) = −g(x)(A + uvT ) ,
xi xj
xi
where
A = diag
αi
,
x2i
u = − col
αi
,
xi
v = col
αi
.
xi
Then Lemma 1 implies that
n
det Hg(x) = (−g(x))
x2i
αi
αi
· diag
· col
1 − row
xi
αi
xi
n
= (−g(x)) (1 − |α|)
= (−1) (1 − |α|)
n
Y
αk
x2
k=1 k
n
Y
αk
k=1
n
Y
n
x2k
αk xknαk −2 .
(3)
k=1
Lemma 2 implies that
x2
Hg(x)−1
diag αxii · col αxii · row αxii · diag αii
−1
x2i
=
diag
+
x2
g(x)
αi
1 − row αxii · diag αii · col αxii
−1
x2i
mat xi xj
=
diag
+
.
g(x)
αi
1 − |α|
!
(4)
On Hessians of composite functions
8607
We will use the following identity
row
x2
αi
αi
xi xj
αi
αi
· diag i · col + row
· mat
· col
xi
αi
xi
xi
1 − |α|
xi
|α|2
|α|
= |α| +
=
. (5)
1 − |α|
1 − |α|
Then (3), (4), (5) and Theorem 1 imply that
det HF (x)
f 00 (g(x))
αi −1
x2i
xi xj
αi
= 1+ 0
g(x) row
·
diag
+ mat
· g(x) col
f (g(x))
xi g(x)
αi
1 − |α|
xi
n
Y
× f 0 (g(x))n (−1)n (1 − |α|)
αk xknαk −2
k=1
x2i
αi
αi
xi xj
g(x)f 00 (g(x))
αi
αi
· diag
· col + row
· mat
= 1−
row
· col
f 0 (g(x))
xi
αi
xi
xi
1 − |α|
xi
n
Y
αk xknαk −2
× f 0 (g(x))n (−1)n+1 (|α| − 1)
k=1
n
Y
g(x)f 00 (g(x)) |α|
0
n
n+1
= 1+
f (g(x)) (−1) (|α| − 1)
αk xknαk −2 .
f 0 (g(x)) |α| − 1
k=1
By a direct computation, using (5), we obtain
∇g(x)T · Hg(x)−1 · ∇g(x)
αi −1
x2i
mat xi xj
αi
= g(x) row
·
diag
+
· g(x) col
xi g(x)
αi
1 − |α|
xi
2
αi
xi
αi
αi mat xi xj
αi
= −g(x) row
· diag
· col + row
·
· col
xi
αi
xi
xi 1 − |α|
xi
|α|
= g(x)
(6)
|α| − 1
and
Hg(x)−1 · ∇g(x) · ∇g(x)T
−1
x2i
mat xi xj
αi
αi
=
diag
+
· g(x) col · g(x) row
g(x)
αi
1 − |α|
xi
xi
2
xi
αi
αi
1
αi
αi
= −g(x) diag
· col · row
+
mat xi xj · col · row
αi
xi
xi 1 − |α|
xi
xi
g(x)
α j xi
=
mat
.
(7)
|α| − 1
xj
8608
Čestmı́r Bárta, Martin Kolář and Jan Šustek
Then (6), (7) and Theorem 2 imply
−1
HF (x)
=
=
=
I−
!
Hg(x)−1 · ∇g(x) · ∇g(x)T
·
f 0 (g(x))
Hg(x)−1
f 0 (g(x))
+ ∇g(x)T · Hg(x)−1 · ∇g(x)
f 00 (g(x))
!
α j xi
g(x)
mat
x2i
mat xi xj
−1
|α|−1
xj
diag
+
I − f 0 (g(x))
·
|α|
g(x)f 0 (g(x))
αi
1 − |α|
+ g(x) |α|−1
f 00 (g(x))
1
α j xi
x2i
mat xi xj
β
mat
−
I
·
diag
+
,
(8)
0
g(x)f (g(x))
xj
αi
1 − |α|
where
β=
g(x)
|α|−1
f 0 (g(x))
f 00 (g(x))
|α|
+ g(x) |α|−1
.
Now we have
x2
α j xi
· diag i
xj
αi
αj xi mat xi xj
β mat
·
xj
1 − |α|
x2
−I · diag i
αi
xi xj
−I · mat
1 − |α|
β mat
From the identity β +
(9) + (10) =
β|α|
1−|α|
=
β
1−|α|
= β mat xi xj ,
(9)
β|α|
mat xi xj ,
1 − |α|
δij
= − mat xi xj ,
αi
1
=
mat xi xj .
|α| − 1
=
(10)
(11)
(12)
we obtain that
−1
f 0 (g(x))
(|α|
g(x)f 00 (g(x))
− 1)2 + |α|(|α| − 1)
mat xi xj .
(13)
Finally, (8), (11), (12) and (13) imply that
1
1
1
δij
−1
HF (x) =
mat
− f 0 (g(x))
−
xi xj .
2 + |α|(|α| − 1)
g(x)f 0 (g(x))
|α| − 1
αi
(|α|
−
1)
00
g(x)f (g(x))
Acknowledgement
The authors were supported by grant GAČR P201/12/2351 of the Czech Science Foundation and by grant SGS08/PřF/2014 of the University of Ostrava.
On Hessians of composite functions
8609
References
[1] Ch.M. Bishop. Neural networks for pattern recognition. Oxford University
Press, 1995.
[2] B.-Y. Chen. An Explicit Formula of Hessian Determinants of Composite
Functions and Its Applications. Kragujevac Journal of Mathematics, 36 (1),
2012, 27–39.
[3] J. Ding, A. Zhou. Applied Mathematics Letters, 20 (12), 2007, 1223–1226.
[4] J. Milnor. Morse theory. Princeton University Press, 1969.
[5] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery. Numerical
Recipes: The Art of Scientific Computing (3rd ed.), Cambridge University
Press, New York, 2007.
[6] P. Trojovský, E. Hladı́ková. On the Hessian of the Exponential Function
with n Variables. International Journal of Pure and Applied Mathematics,
66 (3), 2011, 287–296.
Received: October 19, 2014; Published: December 1, 2014