Statistical Proofs of Some Matrix Theorems

International Statistical Review (2006), 74, 2, 169–185, Printed in Wales by Cambrian Printers
c International Statistical Institute
Statistical Proofs of Some Matrix Theorems
C. Radhakrishna Rao
Statistics Department, Joab Thomas Building, Pennsylvania State University, University Park, PA
16802, USA
Summary
Books on linear models and multivariate analysis generally include a chapter on matrix algebra, quite
rightly so, as matrix results are used in the discussion of statistical methods in these areas. During recent
years a number of papers have appeared where statistical results derived without the use of matrix
theorems have been used to prove some matrix results which are used to generate other statistical results.
This may have some pedagogical value. It is not, however, suggested that prior knowledge of matrix theory
is not necessary for studying statistics. It is intended to show that a judicious use of statistical and matrix
results might be of help in providing elegant proofs of problems both in statistics and matrix algebra
and make the study of both the subjects somewhat interesting. Some basic notions of vector spaces and
matrices are, however, necessary and these are outlined in the introduction to this paper.
Key words: Cauchy–Schwarz inequality; Fisher information; Kronecker product; Milne’s inequality; Parallel
sum of matrices; Schur product.
1 Introduction
Matrix algebra is extensively used in the study of linear models and multivariate analysis in
statistics [see for instance Rao (1973), Rao & Toutenburg (1999) and Rao & Rao (1998)]. It is
generally stated that a knowledge of matrix theory is a prerequisite for the study of statistics.
Recently a number of papers have appeared where statistical results are used to prove some matrix
theorems. [Refs: Dey, Hande & Tiku (1994), Kagan & Landsman (1997), Rao (2000), Kagan &
Smith (2001), Kagan & Rao (2003, 2005), Kagan, Landsman & Rao (2006)]. The object of this
paper is to review some of these results and give some new results. It may be noted that the statistical
results used to prove matrix theorems are derivable without using matrix theory. We consider only
matrices and vectors with real numbers.
1.1 Some Basic Notions of Vector Spaces and Matrices
A symmetric p p matrix A is said to be nnd (nonnegative definite) if a Aa for any p-vector
a, which is indicated by A . A is said to be pd (positive definite) if a Aa for all non-null
a, which is indicated by A . We write A B if A B is nnd and A B if A B is pd. The
notation C p k is used to denote a matrix C with p rows and k columns. The transpose of C
obtained by interchanging columns and rows is denoted by C k p. The rank of any matrix X,
denoted by X , is the number of independent columns (equivalently of independent rows) in X.
The paper is based on a keynote presentation at the 14th International Workshop on Matrices and Statistics meeting in
Auckland, New Zealand, March 29–April 1, 2005.
170
C.R. RAO
We denote by X the vector space generated by the column vectors of X p k,
X y y Xa a R k where R k the vector space of all k-vectors.
Given X p q X r , there exists a matrix Y p s s
respect to a given pd matrix V
X VY Y p r such that with
where q s is a matrix with all elements zero. The vector spaces X and Y are said to
be mutually V -orthogonal. [If V I , the identity matrix, X and Y are simply said to be
orthogonal]. The space R p of all p-vectors is generated by
If z
X Y Xa Y b a R q b Rs R p , then
z
Xa Y b
Multiplying by X V , we have
X Vz
X V Xa a X V X Xa X X V X X V z
X Vz
so that the part of z lying in X can be expressed as P X z where
X X V X X V
(1.1)
is called the projection operator on the space X . [The explicit expression (1.1) for a projection
PX
operator in terms of a generalized inverse defined in (1.3) was first reported in Rao (1967). See also
Rao (1973, p. 47)].
Note that we can choose the orthogonal complement of X as Y , where
Y
Given a matrix X m
inverse) of X, such that
I X X V X n, there exists a matrix X
XX X
X
X V
(1.2)
n m, called the g-inverse (generalized
X
(1.3)
need not be unique unless X is a square matrix of full rank. If C is any matrix such that
C C, so that X X behaves like an identity matrix in such situations.
[See Rao (1973), Rao & Mitra (1971, Section 16.5), and Rao & Rao (1998, Chapter 8)].
C X , then X X
2 Some Basic Results
Let x x ½ x p be a p-vector rv (random variable) and y y ½ yq be a q-vector rv
all with zero mean values. We denote the covariance matrix of x and y by
E x y E x y ½ ½
½ ¾
C x y E x y E x p y½ E x p y¾ which is a p q matrix. The following results hold.
C y x E x ½ yq E x p yq Statistical Proofs of Some Matrix Theorems
171
THEOREM 2.1. If A p p is the covariance matrix of a p-vector rv x, then:
a b
A is nnd
(2.1)
If A s p, then there exists a matrix
D p p s such that y D x a.s.
(2.2)
Conversely, if A p p is an nnd matrix, then there exists a rv x such that
C x x A (2.3)
To prove (2.1) consider the scalar rv a x a
R p . Its variance is
V a x E a x x a a Aa A is nnd
The result (2.2) follows by choosing D as the orthogonal complement of A. Then
C D x D x Cov D x D AD
D x a.s.
Note: The result (2.3) follows if we assume the existence of an nnd matrix A ½¾ such that
A½¾ A½¾ A (or a matrix B such that A B B ). Then we need only choose an rv z with I
as its covariance matrix and define x A ½¾ z (or x Bz). However, we prove the result without
using the matrix result on the existence of A ½¾ (or the factorization A B B ). Later we show that
the factorization A B B is a consequence of a statistical result.
We assume that there exists a p -vector rv with any given p p nnd matrix as
its covariance matrix and show that there exists a p-vector rv associated with any given p p nnd
matrix. Since an rv exists when p , there exists one when p and so on.
Let us consider the nnd matrix A p p in the partitioned form
A
½½
A
where A ½½ p p . Then, since A is nnd, the quadratic form
a A½½ a c a c ¾ for all a and c.
(2.4)
If , then a for all a which implies that . If x ½ is an rv associated with the nnd
matrix A ½½ , then x ½ is an rv associated with A.
If , then, choosing c a , (2.4) reduces to
a A½½ ½ a for all a
i.e., A ½½ ½ is nnd. Let x ½ be an rv associated with A ½½ ½ p p and
z be a univariate rv with variance and independent of x ½ . Then it is easy to verify that the rv
x
½
½ z
u
(2.5)
z
has A as its covariance matrix.
THEOREM 2.2. Let A p p be a symmetric matrix partitioned as
A
A
½½
A¾½
A½¾
A¾¾
172
C.R. RAO
Then A is nnd if and only if
(a) A ½½ is nnd,
(b) A ¾¾ A¾½ A½½ A½¾ A¾ ½ (Schur complement of A ½½ ) is nnd,
(c) A ½¾ A½½ .
The theorem was proved by Albert (1969) using purely matrix methods. Statistical proofs were
given by Dey et al. (1994) and Rao (2000).
First we prove sufficiency using Theorem 2.1. (a) implies that there exists a random variable x
with mean zero and covariance matrix A ½½ . (b) implies that there exists a random variable z with
mean zero and covariance matrix A ¾½ . We can choose z to be independent of x. Now consider the
random variable x y where y z A ¾½ A½½ x. Verify using (c) that C A, which
implies that A is nnd.
To prove necessity, observe that there exists a random variable x y with the covariance matrix
A. (a) follows since C x x A ½½ . If A ½½ is not of full rank, let a be a nonnull vector such that
a A½½ which implies that a x a.s. and C a x b y a A½¾ b b, i.e., a A½¾ .
Hence A ½¾ A½½ which proves (c).
Now consider the random variable z y A ¾½ A½½ x whose covariance matrix is
C z z A¾¾ A¾½ A½½ A½¾
which implies (b).
THEOREM 2.3. Let A and B be nnd matrices partitioned as
A
B
A¾½
and B B¾½
If A B A ½½ B½½ , then A A ½½ and B B½½ .
A
A½¾
A¾¾
½½
½½
B½¾
B¾¾
The statistical proof of this theorem given by Dey et al. (1994) and Mitra (1973) is reproduced
here.
Let U½ U¾ N A and V½ V¾ N B be independent. Then Z U ½ U¾ V½ V¾ Z ½ Z ¾ N A B . Since A B A ½½ B½½ , there exists a matrix G such
that
A¾½ B¾½ G A½½ B½½ A¾¾ B¾¾ G A½¾ B½¾ which implies
Cov Z ¾ G Z ½ U¾ V¾
GU½ GV½ Then CovU¾ GU½ and CovV¾ GV½ which implies U¾ GU½ A A ½½ and
V¾ GV½ B B½½ .
THEOREM 2.4. Let A be an nnd matrix. Then it can be factorized as
A BB
Let x ½ x p x be rv with the nnd matrix A p p as its covariance matrix. Then we can
recursively construct random variables
zi
xi aii
½ x i ½
ai ½ x½ i p
such that z ½ z p are mutually uncorrelated. It may be noted that
ai ½ x ½ aii
½ x i ½
(2.6)
Statistical Proofs of Some Matrix Theorems
173
is the regression of x i on x ½ x i ½ . Denote by b i¾ , the variance of z i . We can write the transformation (2.6) as
Z Fx
where F is a triangular matrix of rank p. Then
¾ C z z FC x x F F AF
where ¾ is a diagonal matrix with diagonal elements b ½¾ b¾p , some of which may be zeros, and
A F ½ ¾ F ½ F ½ F ½ B B
where is the diagonal matrix with diagonal elements b ½ b p .
3 Cauchy–Schwarz (CS) and Related Inequalities
A simple (statistical) proof of the usual CS inequality is to use the fact that the second raw moment
of an rv is nonnegative. Consider a bivariate rv, x y such that
b½½
E x ¾ Then
E y
E x y b½¾
b½¾
x
b½½
¾
b¾¾
E y ¾ ¾
b¾¾ bb½¾ (3.1)
½½
which written in the form
¾
b½¾
b½½b¾¾
(3.2)
is the usual CS inequality. If b ½½ , x a.s. and b½¾ and the result (3.2) is true.
In particular, if x y is a discrete bivariate distribution with x i yi having probability p i i
n, then
pi xi yi ¾ pi xi¾ pi yi¾ since b½¾ pi x i yi b½½ pi x i¾ and b¾¾ pi yi¾ .
(3.3)
The matrix version of CS inequality is obtained by considering rv’s x and y such that
½½ C x x ½¾ C x y ¾¾ C y y and noting that
C y ¾½ ½½ x y ¾½ ½½ x ¾¾ ¾½ ½½ ½¾
(3.4)
Az and y Bz where z is an s-vector
A A AB B B
A special case of (3.4) is obtained by considering rv’s x
rv with uncorrelated components, in which case
½¾
½½
¾¾
and (3.4) reduces to
BB
B A AA AB
(3.5)
The statistical proofs of the results in the following Lemmas are given in Dey et al. (1994).
LEMMA 3.1. Let A and B be nnd matrices. Then
a A a a B ½ a a A B a a A a a B a provided a A B , with equality when either A or B has rank unity.
174
Let A
C.R. RAO
X ½ X ½ and B X ¾ X ¾ . Consider the linear models
Y X ½
X ½ X ¿ Y½ X ½ Y¾ X ¾ Y¿ Y
¾
¾
under the three models are
Ti a i i X i X i X i Yi i V T½ a A a ½ ½ V T¾ a B a ¾ ½ The best linear estimates of a
Now
V T¿ ½ ¾ ¾ V ½ T½ ¾ T¾ ½ ¾ ¾ ½¾ V T½ ¾¾ V T¾ aaAAa aaaBB a a If either A or B is unity, T¿ is same as ½ T½ ¾ T¾ ½ ¾ . This completes the proof.
a A B a
By using the same argument, the result of Lemma 3.1 can be generalized as follows.
LEMMA 3.2. If a
A B , then
a A B a a
where a
a ½ a¾ .
a ½ A a ½ a ¾ B a ¾ ½ Aa¾ B LEMMA 3.3. Let A i i
k be nnd matrices and a A i . Then
¾
k
k
k
a
Ai a ½ ¾
a Ai a
i ½
½ a A i a ½
LEMMA 3.4. Let A be nnd matrix. Then for any choice of g-inverse
a A aa a
a A a a A a if a A
4 Schur and Kronecker Product of Matrices
The following theorems are well known.
THEOREM 4.1. Let A and B be nnd matrices of order p p. Then the Schur product A Æ B is nnd.
A Æ B ai j bi j , where A ai j and B bi j .
Proof. Let X be a p-vector rv such that C X X A and Y be an independent p-vector rv such
that C Y Y B. Then it is seen that
CX
Æ Y X Æ Y A Æ B
so that A Æ B is the covariance matrix of a rv , which proves the result.
THEOREM 4.2. Let A p p be an nnd matrix and B q q be an nnd matrix. Then the Kronecker
product A B is nnd. A B a i j B , where A ai j .
Proof. Let X be a p-vector rv such that C X X A and Y be an independent q-vector rv such
that C Y Y B. Then
C X Y X Y A B
so that A B is the covariance matrix of a rv , which proves the desired result.
Statistical Proofs of Some Matrix Theorems
Note that Theorem 4.1 is a consequence of Theorem 4.2 as A
A B.
175
Æ B is a principal submatrix of
5 Results Based on Fisher Information
5.1 Properties of Information Matrix
Let x be a p-vector rv with density function f x , where
parameter. Define the Fisher score vector
½ q is a q-vector
f
f
Jx ½ q
and information matrix of order q q
Ix E Jx Jx Cov Jx Jx which is nnd.
Monotonicity of information: Let T x be a function of x. Then under some regularity conditions
(given in Rao (1973, pp. 329–332))
IT Ix (5.1)
where I T is the information matrix derived from the distribution of T .
Additivity of information: If x and y are independent random variable whose densities involve the
same parameters, then
Ix y Ix I y (5.2)
Information when x has a multivariate normal distribution: Let x N p B A, i.e., distributed
as p-variate normal with mean B and a pd covariance matrix A. Then the information in x on is
Ix B A
If y
Gx, then
½
B
(5.3)
I y B G G AG ½ G B
(5.4)
and if G is invertible, I x I y . Note that if G is of full rank, I y Ix , i.e., there is no
loss of information in the transformation y Gx.
Suppose that the parameter is partitioned into two subvectors, ½ ¾ where ½ ½ p and ¾ p ½ s . The corresponding partition of the score vector is
¼
¼
¼
¼
Jx
J½x J¾x (5.5)
E J J E J J I I ½x ½x
½x ¾x
½½
½¾
Ix (5.6)
E J¾x J½x E J¾x J¾x I¾½ I¾¾
We define the conditional score vector of ½ , when ¾ is not of interest and considered as a nuisance
and the information matrix is
parameter, by
J ½x
J½x I½¾ I¾¾½ J¾x
(5.7)
176
C.R. RAO
which is the residual of J½x subtracting the regression on J ¾x , in which case
I ½x E J ½x J ½x I½½ I½¾ I¾¾½ I¾½ As one sees from the definition I ½x
J½x and J¾x are uncorrelated. If
(5.8)
I½½ , the equality sign holding if and only if the components of
I ½½ I ½¾ ½
Ix ¾½
I
I ¾¾
then
I ½x I½½ I½¾ I¾¾½ I¾½
Ix½½ which can be proved as follows.
Consider a random variable p q vector r.v.
x
y
Np
Then
using (5.3) with B
I½½
q
½
I¾½
I½¾
I¾¾
(5.9)
Ix y I ½½
. Further
z x I½¾ I¾¾½ y and y
are independent. Hence
I ½½
Ix y Iz y Iz I y Iz I½½ I½¾ I¾¾½ I¾½
½
since C z z I½½ I½¾ I¾¾½ I¾½ .
The monotonicity property remains the same for conditional information. Thus if T
statistic and IT is pd, then
I½T I ½x T x is a
(5.10)
But the additivity property may not hold. If x and y are independent random variables, we can only
say
I½x y I ½x I ½ y (5.11)
which is described as superadditivity, unless x and y are identically distributed in which case equality
holds. For a proof, reference may be made to Kagan & Rao (2003).
5.2 Some Applications
THEOREM 5.1. If A and B are pd matrices, then
A
B A ½ B ½
(5.12)
Proof. Consider independent rv’s
N B and y N A B Æ ¾ I where Æ ¾ I is added to A B to make C y y positive definite. Note that Æ can be chosen as small
x
Statistical Proofs of Some Matrix Theorems
as we like. Then
and
177
x y N A ƾ I Ix B
½
I y Ix y A ƾ I ½ Using monotonicity (5.1) and additivity (5.2) properties of Fisher information, we have
A Æ ¾ I ½ Ix y Ix y Ix I y B ½
Since (5.13) is true for all Æ , taking the limit as Æ , we have the result (5.12).
(5.13)
The equality given in the following theorem is useful in the distribution theory of quadratic forms
(Rao, 1973, p.77). We give a statistical proof of the equality.
THEOREM 5.2. Let p p be a pd matrix and B k
p k, be matrices such that BC . Then
C C C
½
C
½
B B
½
p of rank k, C p k p of rank
B B
½
½
(5.14)
N p and the transformation
Y½ C X N p k C C C Y¾ B ½ X Nk B ½ B ½ B Proof. Consider the random variable X
Here Y½ and Y¾ are independent, and using additivity (5.1) of Fisher information we have
½ I X IY Y IY IY Note that IY is the first term and IY is the second term on the left hand side of (5.14). See
½
½
¾
½
¾
¾
(Rao, 1973, p.77) for an algebraic proof.
6 Milne’s Inequality
Milne (1925) proved the following result.
THEOREM 6.1. For any constants i i n, such that ½ n and given
i i i n,
¾ i
i
i
i
(6.1)
¾
i
i¾
i
i
Proof. We give a statistical proof of Milne’s inequalities.
To prove the right hand side inequality, consider the bivariate discrete distribution of X Y :
probability
value of X
value of Y
½ n
½
½ n ½ ½ n ½
½
(6.2)
We use the well known formula
C X Y E X ½ X ¾ Y½ Y¾ (6.3)
178
C.R. RAO
where X ½ Y½ and X ¾ Y¾ are independent samples from the distribution (6.2). A typical term on
the right hand side of (6.3) is
i
j
i
j
¾
i ¾ j ¾ i
j
Hence C X Y E XY E X E Y . The right hand side inequality follows by observing
that
i
i
i
E XY E
X
and
E
Y
i
i
i¾
COROLLARY 1. For any rv X with Pr X A ,
E
AX
E
AX
E
A¾ X ¾
(6.4)
We use the same arguments as in the main theorem exploiting (6.3), choosing A X and A X
as the variables X and Y .
COROLLARY 2. (Matrix version of Milne’s inequality). Let A V ½ Vn be symmetric commuting
matrices such that A A Vi A i n. Then
i A Vi ½ i A Vi ½ i A¾ Vi¾ where i and ½ n .
½
(6.5)
A purely statistical proof of (6.5) is not known. The proof given in Kagan & Rao (2003) depends
on the result that an nnd matrix A has an nnd square root A B ¾ . Is there a statistical proof of this
result?
A different statistical proof of Milne’s inequality (6.1) given in Kagan & Rao (2003) is as follows.
Proof. Let x ½ x n be independent bivariate normal vectors such that
E x i i
½
C x i x i ¾
i i in which case the elements of I xi are
I½½xi
I½¾xi
I¾¾x i ¾ i
i i
I¾½x ¾ i
i
i
and the information based on x, the whole set of x ½ x n is
i
i¾
i i
I½¾x I¾½x i¾
I½½x
where
I¾¾x ¾ ½
I½x i ¾ i i ¾ i ¾ i
i
i
(6.6)
Statistical Proofs of Some Matrix Theorems
179
Consider now the vector x ½ of the first components of x ½½ x ½n of x. The distribution of x ½
depends only on ½ and
Ix ½ ½ i
and trivially
I½x ½ ½ Ix ½ (6.7)
Since x ½ is a statistic, by applying the super additivity result (5.11) to (6.6) and (6.7), we have
¾ ½ i
i i
i
i (6.8)
i¾
i¾
i¾
and setting leads after simple calculation to the
Dividing both sides of (6.8) by
i
i
i
i
right-hand side inequality of (6.1).
Note that the classical inequality between arithmetic and geometric means gives
¾ ¾
i
i
i
i
i
i
i
i
i¾
i
which establishes the left-hand side inequality in (6.1).
7 Convexity of Some Matrix Functions
A
½½i
Ai Let
A½¾i
A¾½i
A
½½
A
A½¾
A¾½
A¾¾
B
½½
B
B½¾
be nnd matrices. Further let
B¾½
A¾¾i
i
n
n
n
i Ai i and i ½
B¾¾
½
A½ Æ A¾ Æ Æ An n½ Æ Ai Since A and B are nnd matrices, we have
A¾¾ A¾½ A½½ A½¾
B¾¾ B¾½ B½½ B½¾ A½½ A½¾ A¾¾ A¾½
B½½ B½¾ B¾¾ B¾½ (7.1)
(7.2)
As a consequence of (7.1) and (7.2), we have the following theorems.
THEOREM 7.1. Let C i i
n be symmetric matrices of the same order. Then
n
n
i Ci¾ i Ci ¾ ½
½
n
n
ÆCi¾ ÆCi ¾ ½
½
(7.3)
(7.4)
Proof. Choose A ½½i Ci¾ A½¾i A¾½i Ci and A ¾¾i I . Then A i is nnd by the sufficiency
part of Theorem 2.2. Hence using (7.1) and (7.2), we have (7.3) and (7.4) respectively.
180
C.R. RAO
THEOREM 7.2. Let C i i
n, be pd matrices. Then
n
n
i Ci ½ i Ci ½ ½
½
n
n
ÆCi ½ ÆCi ½ (7.5)
½
(7.6)
½
Proof. Choose A ½½i Ci ½ A½¾i A¾½i Ci A¾¾i I . Applying the same argument as in
Theorem 7.1, the results (7.5) and (7.6) follow from (7.3) and (7.4) respectively. The inequality (7.5)
is proved by Olkin & Pratt (1958). See also Marshall & Olkin (1979, Chapter 16). In particular, if
C½ and C ¾ , then C ½ Æ C¾ C½ ½ Æ C¾ ½ ½ .
THEOREM 7.3. Let C i si
that Bi Bi Im . Then
si i n and Bi si m i n, be matrices such
Bi C i C i Bi Bi Ci Bi ¾ (7.7)
In particular, when C i are symmetric for all i , (7.9) reduces to
Proof. Since the matrix
Bi Ci¾ Bi
B C C B
i
i
i i
Bi Ci Bi ¾ Bi C i Bi
(7.8)
Bi C i Bi
B Bi
is nnd, we get (7.8) and (7.9) by applying (7.3). Some of these results are proved in Kagan &
Landsman (1997) and Kagan & Smith (2001) using Information Inequality (5.1)
THEOREM 7.4. Let A and B be pd matrices of order n n and X and Y be m n matrices. Then
X Æ Y A Æ B ½ X Æ B X A
½
X Æ Y B
½
Y (7.9)
Proof. The result follows by considering the Schur product of the nnd matrices
X A
½
X
X
X
Y B
and
A
½
Y
Y
Y
B
and taking the Schur complement.
THEOREM 7.5. (Fiedler, 1961). Let A be a pd matrix. Then
AÆA
Proof. The Schur product
A
I
I
A
½
A
Æ
½
½
I
I
A
I
(7.10)
AÆA
½
I
I
A
½
ÆA
(7.11)
is nnd, since each matrix on the left hand side of (7.11) is nnd. Multiplying the right hand side matrix
of (7.13) by the vector b a a on the left and by b on the right, where a is an arbitrary vector,
we have
a A Æ A
½
a a a a A Æ A ½ I Statistical Proofs of Some Matrix Theorems
181
Purely matrix proofs of some of the results in this section can be found in Ando (1979, 1998).
THEOREM 7.6. Let A ½ An be s s pd matrices and define the partitions
Ai
A
A½¾i
A¾¾i
½½i
A¾½i
where A½½i is of order p p. Then
n
and A i
½
½
½½
i Ai
½
A½½
A½¾
i
A¾¾
i
i
A¾½
i
n
i A½½i ½ (7.12)
½
Proof. Let
N i Ai i n
be independent s-dimensional normal variables, where ½ ¾ is a p s p s dimensional
parameter and j are known constants such that ½¾ n¾ . Set x x ½ x n i¾ i .
Then
i Ai ½ Ix i Ai ½ Ix ½
From (5.8) and the result, A ½½
A½½i A½¾i A¾¾½i A¾½i ,
i ½
n
½ I½x i A½½
i A½½i (7.13)
i I½ x xi
i
i
½
Due to superadditivity (5.11), (7.15) leads to (7.14).
8 Inequalities on Harmonic Mean and Parallel Sum of Matrices
The following theorem is a matrix version of the inequality on harmonic means given in Rao
(1996).
THEOREM 8.1. Let A ½ and A ¾ be random pd matrices of the same order, and E denote expectation.
Then
E
A½ A¾
½
½
½
½
E A½ E A¾ ½
½
Proof. Let C be any pd matrix. Then
A C
½
½
A½
A½
A½
(8.1)
(8.2)
C
is nnd by the sufficiency part of Theorem 2.2. Taking expectation of (8.2), and considering the Schur
complement, we have
E A½ C
Now choosing C
½
A½ E A ½ E C ½
E A ½ A ½ A¾ and noting that
A½ ½ A¾ ½ ½ A½ A½ A½ A¾ ½
(8.3)
A½
the result (8.1) follows by an application of (8.3). The same result was proved by Prakasa Rao (1998)
using a different method.
THEOREM 8.2. Let A ½ A p be random pd matrices and M ½ M p be their expectations.
182
C.R. RAO
Then
E A M
where
A
(8.4)
A½ ½ A p ½ ½ M M½ ½ M p ½ ½
Proof. The result (8.4) is proved by repeated applications of (8.1).
THEOREM 8.3. (Parallel Sum Inequality). Let A i j i
ces. Denote
A i
p and j q be pd matri-
A½i½ A pi½ ½ A j A j i A j q A A½½ A p½ ½ Then
Ai A (8.5)
The result follows from (8.4) by considering A i j i p, as possible values of A ½ A p ,
giving uniform probabilities to index j , and taking expectation over j .
We can get a weighted version of the inequality (8.5) by attaching weights
q.
j to index j j THEOREM 8.4. Let A ½ An be nnd matrices of the same order p and B p k of rank k and
B Ai for all i . Then
B Ai B ½ ½ B Ai B
(8.6)
where Ai denotes any g-inverse of A i .
X i X i and consider independent linear models
Yi X i i E i and C i i I i n (8.7)
The best estimate of B (in the sense of having minimum dispersion error matrix) from the entire
model is
B
Xi Xi X i Yi
Proof. Let A i
with the covariance matrix
Ai B The estimate of B from the i -th part of the model, Y i X i i , is
i B X i X i X i Yi
B
(8.8)
(8.9)
with the covariance matrix
B Ai B (8.10)
Combining the estimates (8.9) in an optimum way we have the estimate of B
B Ai B ½ ½ B Ai B ½ i
with the covariance matrix
B Ai B ½ ½ Obviously (8.11) is not smaller than (8.8) which yields the desired inequality.
The results of Theorems 8.1 and 8.4 are similar to those of Dey et al. (1994).
(8.11)
Statistical Proofs of Some Matrix Theorems
183
9 Inequalities on the Elements of an Inverse Matrix
The following theorem is of interest in estimation theory.
½½ ½¾ ½½
½¾
½
¾½ ¾¾ ¾½ ¾¾
where p q p q is a pd matrix and ½½ is of order p p. Then
½½ ½½ ½
THEOREM 9.1. Let
Proof. Consider the rv
X
Y
Np
q
½½
¾½
½¾
¾¾
(9.1)
where X is p-variate and Y is q-variate. Using the formulae (5.3) and (5.4) on information matrices,
we have
I X Y ½½ I X ½½½
and the information inequality (5.1)
(9.2)
I X Y Ix yields the desired result.
10 Carlen’s Superadditivity of Fisher Information
Let f u be the probability density of a p q vector variable u
Jf
f
½½
jrs f ½¾
f ¾½
u ½ u p q and define
f ¾¾
f f jrs E
ur us
and f ½½ and f ¾¾ are matrices of orders p and q respectively. Let g u ½ u p and h u p ½ u p q be the marginal probability densities of u ½ u p and u p ½ u p q respectively with the corresponding J matrices, J g of order p p and J h of order q q. Then Carlen (1991) proved
where
that
Trace J f
Trace Jg Trace Jh (10.1)
We prove the result (10.1), using Fisher information inequalities. Let us introduce a p q -vector
parameter
½ ¾ ½ p p ½ p q where ½ is a p-vector and ¾ is a q-vector, and consider the probability densities
f u ½ ½ u p q p q g u ½ ½ u p p and h u p ½ p ½ u p q p q 184
C.R. RAO
It is seen that in terms of information
I f ½ f ½½
I f ¾ f ¾¾
which shows that
f ½½
Ig ½ Jg
Ih ¾ Jh
Jg Jh f ¾¾
Taking traces, we get the desired result (10.1). A similar proof is given by Kagan & Landsman
(1997).
11 Inequalities on Principal Submatrices of an nnd Matrix
A k k principal submatrix of an nnd matrix A n n is a matrix A k k obtained by retaining
columns and rows indexed by the sequence i ½ i k where i ½ i ¾ i k n. It is
clear that if A , then A .
THEOREM 11.1. If A is nnd matrix, then (i) A ¾ A¾ , and if A is pd (ii) A
(iii) X A ½ X X A ½ X where X has the same number of rows as A.
Proof. All the results are derived by observing that
A
X
which implies
X
X A
A
X ½
X
X X A
½
X
½
A
½
, and
(11.1)
(11.2)
and taking a Schur complement for special choices of X. For (i), choose A ¾ for A and X A. For
(ii), choose X I . (iii) Follows directly from (11.2). Different proofs and additional results are
given by Zhang (1998).
References
Albert, A. (1969). Conditions for positive and nonnegative definiteness in terms of pseudo inverses. SIAM J. Appl. Math., 17,
434–440.
Ando, T. (1979). Concavity of certain maps on positive definite matrices and applications to Hadamard products. Linear
Algebra and Appl., 26, 203–241.
Ando, T. (1998). Operator-Theoretic Methods for Matrix Inequalities. Hokusei Gakuen University.
Carlen, E.A. (1991). Superadditivity of Fisher’s information and logarithmic Sobolev inequalities. J. Funct. Anal., 101,
194–211.
Dey, A., Hande, S. & Tiku, M.L. (1994). Statistical proofs of some matrix results. Linear and Multilinear Algebra, 38,
109–116.
Fiedler, M. (1961). Über eine Ungleichung für positiv definite Matrizen. Math. Nachr., 23, 197–199.
Kagan, A. & Landsman, Z. (1997). Statistical meaning of Carlen’s superadditivity of Fisher information. Statistics and
Probability Letters, 32, 175–179.
Kagan, A., Landsman, Z. & Rao, C.R. (2006). Sub- and superadditivity à la Carlen of matrices related to Fisher information.
To appear in J. Statistical Planning and Inference.
Kagan, A. & Rao, C.R. (2003). Some properties and applications of the efficient Fisher score. J. Statist. Planning and
Inference, 116, 343–352.
Kagan, A. & Rao, C.R. (2005). On estimation of a location parameter in presence of an ancillary component. Teoriya
Veroyatnostej i ee Primenenie, 50, 171–176.
Kagan, A. & Smith, P.J. (2001). Multivariate normal distributions, Fisher information and matrix inequalities. Internat. J.
Math. Education in Sci. Technology, 32, 91–96.
Marshall, A.W. & Olkin, I. (1979). Inequalities: Theory of Majorization and its Applications. New York: Academic Press.
Statistical Proofs of Some Matrix Theorems
185
Milne, E.A. (1925). Note on Rosseland’s integral for absorption coefficient. Monthly Notices of Royal Astronomical Society,
85, 979–984.
Mitra, S.K. (1973). Statistical proofs of some propositions on nonnegative definite matrices. Bull. Inter. Statist. Inst., XLV,
206–211.
Olkin, I. & Pratt, J. (1958). A multivariate Tchebycheff’s inequality. Ann. Math. Stat., 29, 226–234.
Prakasa Rao, B.L.S. (1998). An inequality for the expectation of harmonic mean of random matrices. Tech. Report, Indian
Statistical Institute, Delhi.
Rao, C.R. (1967). Calculus of generalized inverse of matrices: Part 1 – General theory. Sankhyā A, 19, 317–350.
Rao, C.R. (1973). Linear Statistical Inference and its Applications, Second edition. New York: Wiley.
Rao, C.R. (1996). Seven inequalities in estimation theory. Student, 1, 149–158.
Rao, C.R. (2000). Statistical proofs of some matrix inequalities. Linear Algebra and Applic., 321, 307–320.
Rao, C.R. & Mitra, S.K. (1971). Generalized Inverse of Matrices and its Applications. New York: John Wiley.
Rao, C.R. & Rao, M.B. (1998). Matrix Algebra and its Applications to Statistics and Econometrics. Singapore: World
Scientific.
Rao, C.R. & Toutenburg, H. (1999). Linear Models: Least Squares and Alternatives, (Second Edition). Springer.
Zhang, Fuzhen (1998). Some inequalities on principal submatrices of positive semidefinite matrices. Tech. Report, Nova
South Eastern University, Fort Lauderdale.
Résumé
Les ouvrages sur les modèles linéaires et l’analyse multivariée incluent un chapitre sur l’algèbre matricielle, tout à fait
à juste titre, car les résultats matriciels sont utilisés dans la discussion des méthodes statistiques dans ces domaines. Dans
les années récentes, sont apparus de nombreux articles dans lesquels les résultats statistiques obtenus sans avoir recours aux
théorèmes matriciels ont été utilisés pour démontrer des résultats matriciels qui sont eux-mêmes utilisés pour générer d’autres
résultats statistiques. Cela peut avoir une valeur pédagogique. On ne veut pas dire pour autant que la connaissance préalable
de la théorie matricielle n’est pas nécessaire pour étudier les statistiques. On cherche à montrer qu’un usage judicieux de
résultats statistiques et matriciels pourrait aider à fournir des démonstrations élégantes de problèmes autant en statistique
qu’en algèbre matricielle et rendre l’étude des deux sujets plutôt intéressante. Des notions basiques des espaces vectoriels et
des matrices sont cependant nécessaires et elles sont évoquées dans l’introduction de cet article.
[Received June 2005, accepted November 2005]