paper

A Geometric Approach to Statistical Estimation
Rudolf Kulhavý1
Institute of Information Theory and Automation, Academy of Sciences
of the Czech Republic, P. O. Box 18, 182 08 Prague, Czech Republic
[email protected]
Abstract
The role of Kerridge inaccuracy, Shannon entropy and
Kullback-Leibler distance in statistical estimation is
shown for both dicrete and continuous observations. The
cases of data independence and regression-type dependence are considered in parallel. Pythagorean-like relations valid for probability distributions are presented and
their importance for estimation under compressed data is
indicated.
1. Introduction
Rules of probability theory provide a fundamental tool for
statistical estimation. It is the computational complexity
of these rules, however, that makes estimation algorithms
often infeasible. Modified rules are perhaps needed that
would be easier to implement, yet close enough to what
probability does. An appealing way of approximate inference would be to merge statistical and computational
uncertainty. The question is how the two kinds of uncertainty should be translated into one language.
One possible approach is to use concepts of information
theory, namely, to view estimation as calculation of a certain distance between the empirical and model distributions of data. The approach is far from being new. In
statistics, minimum distance estimation and its consistency for a large class of distances was studied very early
[1]. In robust statistics, D-estimators were studied including the question how the choice of a particular distance affects robustness [2], [3]. In system identification,
information-theoretic distances were used in structure determination [4] and approximation [5], [6].
In this paper, we make use of three information measures — inaccuracy, entropy and Kullback-Leibler (K-L)
distance and show how they are related to likelihood.
Then we consider the case when a sample average of
some prespecified functions is known rather than a complete empirical distribution. A distance between the empirical and model distributions is decomposed into a sum
of two distances in a Pythagorean-like way. In [7], [8],
[9], a Pythagorean relation was shown to hold for K-L
1 Supported in part by Grant 102/94/0314 of the Czech Grant Agency
and Grant 275109 of the Academy of Sciences of the Czech Republic.
distances. Here we make a slight extension presenting a
Pythagorean-like theorem that links inaccuracy and K-L
distance. Apart from giving another unified view of parameter estimation, the result provides a tool for possible
approximation of likelihood.
2. Parameter estimation revisited
Independent observations
Consider a sequence of random variables Y N =
(Y1 ; : : :; YN ) with values in a set . Suppose that Yk
are independent and identically distributed according to
a common probability distribution S parametrized by an
unknown parameter 2 . To cover the cases of discrete
and continuous Y with one notation, we introduce densities s = ddS , 2 as Radon-Nikodym densities of S
with respect to a common dominating measure on . In
particular, when Y is discrete and is a counting measure,
s (y) are probability mass functions, when Y is continuous and is a Lebesgue measure, s (y) are probability
density functions.
Y
Y
Owing to the independence assumption, the joint density
p (yN ) with respect to N is simply
p (yN ) =
N
Y
k=1
s (yk ):
(1)
The product can be expressed in a form more convenient
for later approximation. We introduce an empirical density of observed data as
N
X
rN (y) = N1 yk (y)
k=1
where yk is a Radon-Nikodym density with respect to
the measure of a point-mass distribution concentrated
at the point fyk g. When Y is discrete, yk (y) is 1 for
y = yk and 0 elsewhere. When Y is continuous, yk (y)
is
6 yk and
R a Dirac function, i.e., yk (y) = 0 for y =
(y)
(dy)
=
1
.
Next
we
define
Kerridge
inaccuy
Y k
racy of r relative to s [10]
K(r:s) = ?
Z
Y
r(y) log s(y) (dy):
With the two notions, the joint density can be rewritten as
p (yN ) = exp ?N K(rN :s ) :
(2)
Remark 2.1 Inaccuracy K(r:s) is closely linked with
Shannon entropy of r [11]
H(r) = ?
Z
Y
D(r jj s) =
Y
K(rN :s ) = K([^N ; 1 ? ^N ] jj [; 1 ? ])
?
?
= H [^N ; 1 ? ^N ] + D [^N ; 1 ? ^N ] jj [; 1 ? ] :
r(y) log r(y) (dy)
and Kullback-Leibler (K-L) distance of r and s [12]
Z
of trials yN . The inaccuracy of the corresponding probability vectors is then
It is not difficult to see that
r(y) log r(y)
s(y) (dy):
Indeed, when Y is discrete, we have
K(rN :s ) = H(rN ) + D(rN jj s ):
(3)
An analogous formula does not hold for continuous Y . A
formal evaluation gives H(rN ) = ?1 and D(rN jj s ) =
1. Yet, K(rN :s ) is finite under a weak assumption that
s (yk ) > 0 for k = 1; : : :; N .
Remark 2.2 Given a particular sequence yN , the joint
density p can be regarded as a function of the unknown
parameters known as likelihood, lN () = p (yN ). Substituting (2) for p , we have
K(rN :s ) = ? N1 loglN ():
(4)
Thus, maximizing likelihood is equivalent to minimizing
inaccuracy
arg max
l () = arg min
K(rN :s ):
2 N
2
provided the extremum points exist. For discrete Y , it
follows from (3) that maximizing likelihood is equivalent
to minimizing K-L distance
arg max
l () = arg min
D(rN jj s )
2 N
2
since H(rN ) is independent of .
Remark 2.3 Owing to (3), inaccuracy can be regarded as
a combined measure of uncertainty of Y . While H(rN )
measures the intrinsic uncertainty of Y caused by its
stochastic behaviour, D(rN jj s ) quantifies the increase of
uncertainty due to the use of a wrong distribution to predict Y . From the statistical point of view, inaccuracy (4)
is a negative normalized log-likelihood. The minimum inaccuracy achievable within a class of densities s is just
a transformed value of the maximum likelihood over the
class. In coding theory, inaccuracy gives an average length
of code designed for s rather than rN . While H(rN )
gives the minimum average code length achievable with
rN , D(rN jj s ) measures the increase of an average length
of code designed for s (see Theorem 5.4.3 in [13]). The
last interpretation links inaccuracy with the minimum description length principle [14].
Example 2.1 (Bernoulli distribution) Consider a simple
model of coin tossing where
= fHead; Tailg and
s (y) is if y = Head and 1 ? if y = Tail. Let ^N be
the relative frequency of heads observed in the sequence
Y
K(rN :s ) ? K(rN :s^N )
?
= D [^N ; 1 ? ^N ] jj [; 1 ? ] 0:
Thus, = ^N minimizes inaccuracy over .
Example 2.2 (Normal distribution) Let Y be normally
distributed with an unknown mean , Y N(; 2 ).
Straightforward calculations yield
K(rN :s ) = 12 log 2 + 12 log 2 + 21 2 VN
?
+ 21 2 ? ^N 2
with ^N = EN (Y ), VN = EN (Y 2) ? EN (Y )2 where
P
EN (X) = N1 Nk=1 Xk stands for the empirical mean of
a random variable X . Because of the inequality
?
K(rN :s ) ? K(rN :s^N ) = 21 2 ? ^N 2 0;
^N is the minimum inaccuracy estimate of .
Dependent observations
Consider sequences of random variables
Y N = (Y1; : : :; YN ); U N = (U1 ; : : :; UN )
Y
U
with values in sets
and , respectively. Suppose that
the output values Yk depend on past data U k , Y k?1 only
through a known vector function Zk = z(U k ; Y k?1) 2
. Let the distribution S of Yk given Zk be parametrized
by 2 . Let the distribution G of Uk given Y k?1, U k?1
be independent of . We introduce densities s = ddS ,
2 and g = ddG as Radon-Nikodym densities of S
and G with respect to corresponding dominating measures
on and on , respectively. Let be a common
dominating measure for distributions considered on .
Z
Y
U
Z
By elementary rules of probability theory, the density
+m ; uN +m j ym; um ) with respect to N N is
p (ymN +1
m+1
+m ; uN +m j ym; um )
p (ymN +1
m+1
=
NY
+m
k=m+1
s (yk j zk )
NY
+m
k=m+1
g(uk j yk?1; uk?1):
(5)
Here m denotes the minimum number of samples for
which zm+1 is defined. Thanks to the product form of
(5), the -dependent part of it can be rewritten as follows.
We introduce an empirical density of observed data
NX
+m
yk ;zk (y; z)
rN (y; z) = N1
k=m+1
where yk ;zk is Radon-Nikodym density with respect to
of a point mass concentrated at f(yk ; zk )g. We define
conditional Kerridge inaccuracy as
K(r:s)
=?
Z
Y Z
r(y; z) log s(y j z) (dy) (dz):
With these notions, the product of conditional sampling
densities s (yk j zk ) can be put in the form
NY
+m
k=m+1
N :s ) :
s (yk j zk ) = exp ?N K(r
(6)
Remark 2.4 Again, there is a close connection between
the conditional inaccuracy K(r:s)
, conditional Shannon
entropy of r
=?
H(r)
Z
Y Z
z) (dy) (dz)
r(y; z) log r(y;
r(z)
Z
z)
r(y; z) log s(yr(y;
j z) r(z) (dy) (dz)
Y Z
R
r(z) = Y r(y; z) (dy)
where
is a marginal density.
When Y is discrete, the three quantities are related by
N :s ) = H(r
N ) + D(r
N jj s ):
K(r
Y Z
N :s ) = ?N K([
N ; 1 ? N ] jj [; 1 ? ])
K(r
N ; 1 ? N ] jj [; 1 ? ])
?(1 ? N )K([
^N ; 1 ? ^N ]:[; 1 ? ])
= K([
= H([^N ; 1 ? ^N ]) + D([^N ; 1 ? ^N ] jj [; 1 ? ])
where
^N = N N + (1 ? N ) N :
The obvious inequality
and conditional K-L distance of r and s
jj s) =
D(r
Example 2.3 (Markov chain) Consider a simple model of
“weather forecast” where Yk denotes weather on k-th day,
Zk = Yk?1, = = fSunny; Rainyg, s (y j z) = if
y = z and 1 ? if y 6= z . Thus stands for the probability of steady weather. Let N be the relative frequency
of sunny days followed by another sunny day, N be the
relative frequency of rainy days followed by another rainy
day, N be the relative frequency of sunny days followed
by any day. Then
(7)
An analogous formula does not hold for continuous Y
= ?1 and D(r
jj s) = 1 then. Yet, K(r:s)
since H(r)
is
finite provided that s (yk j zk ) > 0 for k = m+1; : : :; N+
m.
Remark 2.5 Given particular sequences yN +m and
uN +m , the density p can be regarded as a function
of the unknown parameters , i.e., likelihood lN () =
+m ; uN +m j ym; um ). Substituting (6) for p gives
p (ymN +1
m+1
N :s ) = ? 1 log lN () + c
K(r
(8)
N
where c is a constant independent of . Thus, maximizing
likelihood is equivalent to minimizing conditional inaccuracy
N :s )
arg max
l () = arg min
K(r
2 N
2
provided the extremum points exist. For discrete Y , it
follows from (7) that maximizing likelihood is equivalent
to minimizing conditional K-L distance
N jj s )
arg max
l () = arg min
D(r
2 N
2
N :s ) ? K(r
N :s^N )
K(r
?
= D [^N ; 1 ? ^N ] jj [; 1 ? ] 0:
implies that = ^N minimizes inaccuracy over .
Example 2.4 (ARX model) Let Yk j Zk be normally
T
distributed with the conditional
?
mean Zk , Yk T
T
2
N( Zk ; ). Provided EN ZZ > 0, we have
N :s ) = 1 log 2 + 1 log 2 + 1 2 VN
K(r
2
2
2
?
T
?
1
^
+ 22 ? N CN ? ^N
with
^N = CN?1 EN (ZY )
VN = EN (Y 2 ) ? EN (Y Z T )CN?1EN (ZY )
CN = EN (ZZ T )
PN +m
where EN (X) = N1 k=m+1 Xk stands for the empirical mean of a random variable X . Clearly, for every 2 N :s ) ? K(r
N :s^N )
K(r
?
?
= 21 2 ? ^N T CN ? ^N 0:
For = ^N inaccuracy achieves its minimum over .
3. Pythagorean-like relations
N ) is independent of .
since H(r
Remark 2.6 Most of what was said in Remark 2.3 applies
to the case of dependent data straightforwardly. Note,
N ) depends
however, that the conditional entropy H(r
now on the structure of a model defined by the choice
of Zk . When estimating simultaneously the structure and
parameters of a model, we have thus to pay attention to
both the entropy and K-L distance components of inaccuracy. One notable example is Akaike’s information criterion [4] which follows essentially by taking expectation of
(approximate) conditional inaccuracy.
Independent observations
Let the sample average of a given vector function h:
Rn
Y!
N
1X
N k=1 h(yk ) = h
be the only information available from observed data. We
denote by h the set of all densities r(y) such that
R
Z
Y
r(y) h(y) (dy) = h :
(9)
2 Rh .
Obviously, rN
Consider an exponential family
Sh composed of densities
s (y) = s0 (y) exp T h(y) ? ()
where s0 (y) is a fixed density and
normalizing constant
() = log
Z
Y
() is logarithm of the
s0 (y) exp T h(y) (dy):
Let be the set of 2
Rn such that () < 1.
We say that s^ (y) is a h-projection of rN (y) onto
both densities give the same expectation of h(y)
Z Y
(10)
Sh if
Rh \ Sh.
Theorem 3.1 Let s^ be a h-projection of rN onto
Then for every r 2 h and every s 2 h
R
S
K(r:s) = K(r:s^ ) + D(s^ jj s):
(11)
Sh .
(dy) = 0:
r(y) ? s^ (y) log ss^ (y)
(y)
Y
A h-projection is a solution to two closely related optimization problems. To show it, we need only the following fundamental property of K-L distance (see, e.g., Theorem 2.6.3 in [13]).
Lemma 3.1 For any two densities r(y) and s(y), it holds
D(r jj s) 0. The equality occurs if and only if r(y) =
s(y) almost everywhere with respect to .
Corollary 3.1 Let s^ be a h-projection of rN onto
Then for every r 2 h
R
K(r:s^ ) = min
K(r:s ):
2
Sh .
(13)
Proof. Theorem 3.1 and Lemma 3.1 together imply
K(r:s^ ) = K(r:s ) ? D(s^ jj s) K(r:s )
for every 2 with equality if and only if s (y) = s^ (y)
almost everywhere with respect to .
2
With regard to Remark 2.2, the minimum inaccuracy esti^ is the maximum likelihood estimate for the family
mate .
Note
that K(r:s^ ) is independent of .
h
S
Corollary 3.2 Let s^ be a h-projection of rN onto
Then for every s 2 h
D(s^ jj s) D(r jj s )
with equality if and only if r(y) = s^ (y) almost everywhere with respect to .
2
A h-projection s^ is thus also a minimum K-L distance
(or generalized maximum entropy) estimate of r 2 Rh
relative to s .
Corollary 3.3 Let s^ be a h-projection of rN onto
Then for every 2 , the likelihood value satisfies
Sh .
lN () = lN (^) exp ?N D(s^ jj s)
(15)
Dependent observations
Let the sample average of a given vector function h:
The proposition follows by definitions of inaccuracy and
2
K-L distance.
h
D(s^ jj s ) = D(r jj s ) ? D(r jj s^ ):
Proof. The proposition follows by taking together the def2
inition lN () = p(yN ), (2) and Theorem 3.2.
Z D(s^ jj s) = rmin
D(r jj s )
2R
by (3)
(12)
Proof. Combining (11), (9) and (10) we have
S
D(s^ jj s) = K(r:s ) ? K(r:s^ ):
Restricting to r 2 Rh such that D(r jj s ) < 0, we have
Therefore, by Lemma 3.1
rN (y) ? s^ (y) h(y) (dy) = 0:
Clearly, s^ lies in the intersection
Proof. Theorem 3.1 implies
Sh .
Z ! Rn
Y
+m
1 NX
N k=m+1 h(yk ; zk ) = h
be the only information available from observed data
yN +m , uN +m . Let h be the set of r(y; z) such that
R
Z
Y Z
Obviously, rN
r(y; z) h(y; z) (dy) (dz) = h:
2 Rh .
Consider an exponential family
(16)
Qh composed of densities
q (y; z) = s0 (y j z) w(z) exp T h(y; z) ? ()
where s0 (y j z) and w(z) are fixed densities and
logarithm of the normalizing constant
(17)
() is
Z
s0 (yjz) w(z) exp T h(y; z)
Y Z
(dy) (dz):
Let be the set of 2 Rn such that () < 1.
() = log
One can easily compute the conditional density
s (y j z) = s0 (y j z) exp T h(y; z) ? (; z)
(18)
and the marginal density
(14)
w(z) = w(z) exp (; z) ? ()
(19)
where
(; z) = log
Z
s0 (yjz) exp T h(y; z) (dy):
Y
Assume that even the marginal density (19) is exponential,
i.e., (; z) can be factorized as follows
(; z) = ()T h (z) + 0():
(20)
Assume, moreover, that the functions hi (z) are linear
combinations of functions hj (y; z), i.e.,
spanfhi (z)g spanfhj (y; z)g:
(21)
It can be ensured by adding hi (z) to hj (y; z) if necessary.
We say that q^ (y; z) is a h-projection of rN (y; z) onto Qh
if rN and q^ give the same expectation of h(Y; Z)
Z
Y Z
rN (y; z) ? q^ (y; z) h(y; z) (dy) (dz) = 0:
(22)
Clearly, q^ lies in the intersection h \ h . Owing to the
assumption (21), relation (22) implies also
R Q
Z rN (z) ? w^ (z) h (z) (dz) = 0:
Z
In other words,
(23)
w^ (z) is a h -projection of rN (z) onto
Wh = fw(z)g.
We define conditional K-L distance of s and s0 given r as
jj s0 j r) =
D(s
Z
j z) (dy) (dz):
s(y j z) r(z) log ss(y
0
(y j z)
Y Z
Lemma 3.2 Let s (y j z), s0 (y j z) be any two densities
from h . Under the assumptions (20)–(21),
S
D(s jj s0 j r) = D(s jj s0 j rN )
for every r 2
Proof. Substituting (18) for s gives
Z
where
h^ (; z) =
Z
Z
Y
r(z) ( ? 0 )T ^h(; z)
? (; z) + (; z) (dz)
K(r:w) = K(r:w^ ) + D(w^ jj w)
where r stands for the marginal density r(z). Subtracting
both equations, we get a conditional Pythagorean relation
) = K(r:s
^ ) + D(s
^ jj s j w^ ):
K(r:s
Finally, by Lemma 3.2,
for every r 2
^ jj s j w^ ) = D(s
^ jj s j r)
D(s
Rh .
2
A h-projection solves two dual optimization problems
again. To solve them, we need the following modification
of Lemma 3.1.
jj s) 0
Lemma 3.3 For any r(y; z) and s(y j z), D(r
with equality if and only if r(y; z) = s(y j z) r(z) almost everywhere with respect to . Similarly, for any
jj s0 j r) 0 with equality if
s(y j z), s0 (y j z) and r(z), D(s
and only if s(y j z) r(z) = s0 (y j z) r(z) almost everywhere
with respect to .
Corollary 3.4 Let s^ w^ be a h-projection of rN onto
Then for every r 2 h
R
^ ) = min K(r:s
):
K(r:s
2
Qh .
(25)
Proof. Theorem 3.2 and Lemma 3.3 together imply
s (y j z) h(y; z) (dy):
R
) = K(r:s
^ ) + D(s
^ jj s j r):
K(r:s
With regard to Remark 2.5, the minimum conditional in^ is the maximum likelihood estimate
accuracy estimate ^ ) is
for the family h = fs (y j z)g. Note that K(r:s
independent of .
S
S
Theorem 3.2 Let s^ (y j z) w^ (z) be a h-projection of
rN (y; z) onto h . Then, for every r 2 h and every
s 2 h , the following holds independently of w
Q
marginal Pythagorean relation
Corollary 3.5 Let s^ w^ be a h-projection of rN onto
Then for every s 2 h and every r 2 h
h(; z) = r (; z). The propoOne easily verifies that ^
sition follows then by substituting (20) for (; z) and
taking (21) into account.
2
S
K(r:sw ) = K(r:s^ w^ ) + D(s^ w^ jj sw):
Also by Theorem 3.1, the h -projection (23) satisfies the
^ ) = K(r:s
) ? D(s
^ jj s j r) K(r:s
)
K(r:s
for every 2 where equality holds if and only if
s (y j z) r(z) = s^ (y j z) r(z) almost everywhere with respect to .
2
Rh .
D(s jj s0 j r) =
Proof. By Theorem 3.1, the h-projection (22) satisfies the
joint Pythagorean relation
(24)
R
^ jj s j r) = min D(~
r jj s ):
D(s
r~2R
h
Qh .
(26)
Proof. Theorem 3.2 implies
^ jj s j r) = K(r:s
) ? K(r:s
^ ):
D(s
jj s) < 0, we have
Restricting to r 2 Rh such that D(r
by (7)
^ jj s j r) = D(r
jj s) ? D(r
jj s^ ):
D(s
Owing to Lemma 3.3
S
^ jj s j r) D(r
jj s )
D(s
with equality if and only if r(y; z) = s^ (y j z) r(z) almost
everywhere with respect to . The proposition follows
by Lemma 3.2.
2
A h-projection s^ is thus also a minimum conditional K-L
distance (generalized maximum entropy) estimate of r 2
h relative to s .
R
Corollary 3.6 Let s^ w^ be a h-projection of rN onto
Then for every 2 , the likelihood value satisfies
^ jj s j r)
lN () = lN (^) exp ?N D(s
where r is an arbitrary density from
Qh .
(27)
Rh .
It follows easily by combining the definition
+m ; uN +m j ym; um), relation (6) and
lN () = p (ymN +1
m+1
Theorem 3.2.
2
4. Estimation with compressed data
S
Exponential family h = fs g
Suppose that the family of sampling distributions is exponential and we are to estimate the parameters . What
Corollaries 3.3 and 3.6 say is that as far as we know the
^ of
minimum inaccuracy (maximum likelihood) estimate , the whole likelihood can be restored by evaluating (possibly conditional) K-L distance between s^ and s . Combined with Corollaries 3.2 and 3.5, we see that the target
object is K-L distance
D(s^ jj s) = rmin
D(r jj s )
2R
h
in the case of independent data and conditional K-L distance
^ jj s j r) = min D(~
r jj s ); r 2 Rh :
D(s
r~2R
h
in the case of regression-type dependence. In both cases,
the sample average (empirical mean) h carries sufficient
information for exact restoration of the above functions.
S
General family
= fs g
Even if
is not an exponential family or cannot be
imbedded in an exponential family of sufficiently low
dimension, Theorems 3.1 and 3.2 can be applied —
separately for each particular density s .
For independent data, choosing s0 (y) = s (y) in (10) defines an exponential family ;h going through the point
s . By Corollary 3.3 and 3.2, we have
S
jj s ) :
lN () = lN (; ^ ) exp ?N rmin
D(r
2Rh
(29)
Compared with (28), K-L distance is replaced by a conditional one.
^ ) is the maximum value of likelihood for
Note that lN (; the family ;h . Its value depends on , but with carefully chosen functions hi , its variation can be neglected,
lN (; ^) const, 2 . This suggests to approximate
likelihood by the minimum K-L distance between h and
s . Note that another argument for this conclusion can be
found in large deviation theory [15].
S
R
5. References
Proof.
S
Analogously, for regression-type dependence, choosing
s0 (y j z) = s (y j z) in (17) defines an exponential family ;h going through s . By Corollary 3.6 and 3.5, we
have
lN () = lN (; ^ ) exp ?N rmin
D(r jj s ) :
2Rh
(28)
[1] J. Wolfowitz, “The minimum distance method”,
Ann. Math. Statist., vol. 28, pp. 75–88, 1957.
[2] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw,
and W. A. Stahel, Robust Statistics: The Approach Based
on Influence Functions, Wiley, New York, 1986.
[3] I. Vajda, Theory of Statistical Inference and Information, Kluwer, Dordrecht, 1989.
[4] H. Akaike, “A new look at the statistical model
identification”, IEEE Trans. Automat. Control, vol. 19,
pp. 716–723, 1974.
[5] B. Hanzon, “A differential-geometric approach to
approximate nonlinear filtering”, in Geometrization of
Statistical Theory, C. T. J. Dodson, Ed., pp. 219–224.
ULDM Publications, Lancaster, England, 1987.
[6] A. A. Stoorvogel and J. H. van Schuppen, “System
identification with information theoretic criteria”, Report
BS-R9513, CWI, Amsterdam, 1995.
[7] N. N. Čencov, Statistical Decision Rules and Optimal Inference, vol. 53 of Transl. of Math. Monographs,
Amer. Math. Soc., Providence, RI, 1982.
[8] I. Csiszár, “I -divergence geometry of probability
distributions and minimization problems”, Ann. Probab.,
vol. 3, pp. 146–158, 1975.
[9] S. Amari, Differential-Geometrical Methods in
Statistics, vol. 28 of Lecture Notes in Statistics, SpringerVerlag, Berlin, second edition, 1990.
[10] D. F. Kerridge, “Inaccuracy and inference”, J. Roy.
Statist. Soc. Ser. B, vol. 23, pp. 284–294, 1961.
[11] C. E. Shannon, “A mathematical theory of communication”, Bell System Tech. J., vol. 26, pp. 379–423,
623–656, 1948.
[12] S. Kullback and R. A. Leibler, “On information and
sufficiency”, Ann. Math. Statist., vol. 22, pp. 79–86, 1951.
[13] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York, 1991.
[14] J. Rissannen, Stochastic Complexity in Statistical
Inquiry, World Scientific, Singapore, 1989.
[15] R. Kulhavý, “A Kullback-Leibler distance approach to system identification”, in Preprints of the IFAC
Symposium on Adaptive Systems in Control and Signal
Processing, Budapest, Hungary, 1995, pp. 55–66.