A Geometric Approach to Statistical Estimation Rudolf Kulhavý1 Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, P. O. Box 18, 182 08 Prague, Czech Republic [email protected] Abstract The role of Kerridge inaccuracy, Shannon entropy and Kullback-Leibler distance in statistical estimation is shown for both dicrete and continuous observations. The cases of data independence and regression-type dependence are considered in parallel. Pythagorean-like relations valid for probability distributions are presented and their importance for estimation under compressed data is indicated. 1. Introduction Rules of probability theory provide a fundamental tool for statistical estimation. It is the computational complexity of these rules, however, that makes estimation algorithms often infeasible. Modified rules are perhaps needed that would be easier to implement, yet close enough to what probability does. An appealing way of approximate inference would be to merge statistical and computational uncertainty. The question is how the two kinds of uncertainty should be translated into one language. One possible approach is to use concepts of information theory, namely, to view estimation as calculation of a certain distance between the empirical and model distributions of data. The approach is far from being new. In statistics, minimum distance estimation and its consistency for a large class of distances was studied very early [1]. In robust statistics, D-estimators were studied including the question how the choice of a particular distance affects robustness [2], [3]. In system identification, information-theoretic distances were used in structure determination [4] and approximation [5], [6]. In this paper, we make use of three information measures — inaccuracy, entropy and Kullback-Leibler (K-L) distance and show how they are related to likelihood. Then we consider the case when a sample average of some prespecified functions is known rather than a complete empirical distribution. A distance between the empirical and model distributions is decomposed into a sum of two distances in a Pythagorean-like way. In [7], [8], [9], a Pythagorean relation was shown to hold for K-L 1 Supported in part by Grant 102/94/0314 of the Czech Grant Agency and Grant 275109 of the Academy of Sciences of the Czech Republic. distances. Here we make a slight extension presenting a Pythagorean-like theorem that links inaccuracy and K-L distance. Apart from giving another unified view of parameter estimation, the result provides a tool for possible approximation of likelihood. 2. Parameter estimation revisited Independent observations Consider a sequence of random variables Y N = (Y1 ; : : :; YN ) with values in a set . Suppose that Yk are independent and identically distributed according to a common probability distribution S parametrized by an unknown parameter 2 . To cover the cases of discrete and continuous Y with one notation, we introduce densities s = ddS , 2 as Radon-Nikodym densities of S with respect to a common dominating measure on . In particular, when Y is discrete and is a counting measure, s (y) are probability mass functions, when Y is continuous and is a Lebesgue measure, s (y) are probability density functions. Y Y Owing to the independence assumption, the joint density p (yN ) with respect to N is simply p (yN ) = N Y k=1 s (yk ): (1) The product can be expressed in a form more convenient for later approximation. We introduce an empirical density of observed data as N X rN (y) = N1 yk (y) k=1 where yk is a Radon-Nikodym density with respect to the measure of a point-mass distribution concentrated at the point fyk g. When Y is discrete, yk (y) is 1 for y = yk and 0 elsewhere. When Y is continuous, yk (y) is 6 yk and R a Dirac function, i.e., yk (y) = 0 for y = (y) (dy) = 1 . Next we define Kerridge inaccuy Y k racy of r relative to s [10] K(r:s) = ? Z Y r(y) log s(y) (dy): With the two notions, the joint density can be rewritten as p (yN ) = exp ?N K(rN :s ) : (2) Remark 2.1 Inaccuracy K(r:s) is closely linked with Shannon entropy of r [11] H(r) = ? Z Y D(r jj s) = Y K(rN :s ) = K([^N ; 1 ? ^N ] jj [; 1 ? ]) ? ? = H [^N ; 1 ? ^N ] + D [^N ; 1 ? ^N ] jj [; 1 ? ] : r(y) log r(y) (dy) and Kullback-Leibler (K-L) distance of r and s [12] Z of trials yN . The inaccuracy of the corresponding probability vectors is then It is not difficult to see that r(y) log r(y) s(y) (dy): Indeed, when Y is discrete, we have K(rN :s ) = H(rN ) + D(rN jj s ): (3) An analogous formula does not hold for continuous Y . A formal evaluation gives H(rN ) = ?1 and D(rN jj s ) = 1. Yet, K(rN :s ) is finite under a weak assumption that s (yk ) > 0 for k = 1; : : :; N . Remark 2.2 Given a particular sequence yN , the joint density p can be regarded as a function of the unknown parameters known as likelihood, lN () = p (yN ). Substituting (2) for p , we have K(rN :s ) = ? N1 loglN (): (4) Thus, maximizing likelihood is equivalent to minimizing inaccuracy arg max l () = arg min K(rN :s ): 2 N 2 provided the extremum points exist. For discrete Y , it follows from (3) that maximizing likelihood is equivalent to minimizing K-L distance arg max l () = arg min D(rN jj s ) 2 N 2 since H(rN ) is independent of . Remark 2.3 Owing to (3), inaccuracy can be regarded as a combined measure of uncertainty of Y . While H(rN ) measures the intrinsic uncertainty of Y caused by its stochastic behaviour, D(rN jj s ) quantifies the increase of uncertainty due to the use of a wrong distribution to predict Y . From the statistical point of view, inaccuracy (4) is a negative normalized log-likelihood. The minimum inaccuracy achievable within a class of densities s is just a transformed value of the maximum likelihood over the class. In coding theory, inaccuracy gives an average length of code designed for s rather than rN . While H(rN ) gives the minimum average code length achievable with rN , D(rN jj s ) measures the increase of an average length of code designed for s (see Theorem 5.4.3 in [13]). The last interpretation links inaccuracy with the minimum description length principle [14]. Example 2.1 (Bernoulli distribution) Consider a simple model of coin tossing where = fHead; Tailg and s (y) is if y = Head and 1 ? if y = Tail. Let ^N be the relative frequency of heads observed in the sequence Y K(rN :s ) ? K(rN :s^N ) ? = D [^N ; 1 ? ^N ] jj [; 1 ? ] 0: Thus, = ^N minimizes inaccuracy over . Example 2.2 (Normal distribution) Let Y be normally distributed with an unknown mean , Y N(; 2 ). Straightforward calculations yield K(rN :s ) = 12 log 2 + 12 log 2 + 21 2 VN ? + 21 2 ? ^N 2 with ^N = EN (Y ), VN = EN (Y 2) ? EN (Y )2 where P EN (X) = N1 Nk=1 Xk stands for the empirical mean of a random variable X . Because of the inequality ? K(rN :s ) ? K(rN :s^N ) = 21 2 ? ^N 2 0; ^N is the minimum inaccuracy estimate of . Dependent observations Consider sequences of random variables Y N = (Y1; : : :; YN ); U N = (U1 ; : : :; UN ) Y U with values in sets and , respectively. Suppose that the output values Yk depend on past data U k , Y k?1 only through a known vector function Zk = z(U k ; Y k?1) 2 . Let the distribution S of Yk given Zk be parametrized by 2 . Let the distribution G of Uk given Y k?1, U k?1 be independent of . We introduce densities s = ddS , 2 and g = ddG as Radon-Nikodym densities of S and G with respect to corresponding dominating measures on and on , respectively. Let be a common dominating measure for distributions considered on . Z Y U Z By elementary rules of probability theory, the density +m ; uN +m j ym; um ) with respect to N N is p (ymN +1 m+1 +m ; uN +m j ym; um ) p (ymN +1 m+1 = NY +m k=m+1 s (yk j zk ) NY +m k=m+1 g(uk j yk?1; uk?1): (5) Here m denotes the minimum number of samples for which zm+1 is defined. Thanks to the product form of (5), the -dependent part of it can be rewritten as follows. We introduce an empirical density of observed data NX +m yk ;zk (y; z) rN (y; z) = N1 k=m+1 where yk ;zk is Radon-Nikodym density with respect to of a point mass concentrated at f(yk ; zk )g. We define conditional Kerridge inaccuracy as K(r:s) =? Z Y Z r(y; z) log s(y j z) (dy) (dz): With these notions, the product of conditional sampling densities s (yk j zk ) can be put in the form NY +m k=m+1 N :s ) : s (yk j zk ) = exp ?N K(r (6) Remark 2.4 Again, there is a close connection between the conditional inaccuracy K(r:s) , conditional Shannon entropy of r =? H(r) Z Y Z z) (dy) (dz) r(y; z) log r(y; r(z) Z z) r(y; z) log s(yr(y; j z) r(z) (dy) (dz) Y Z R r(z) = Y r(y; z) (dy) where is a marginal density. When Y is discrete, the three quantities are related by N :s ) = H(r N ) + D(r N jj s ): K(r Y Z N :s ) = ?N K([ N ; 1 ? N ] jj [; 1 ? ]) K(r N ; 1 ? N ] jj [; 1 ? ]) ?(1 ? N )K([ ^N ; 1 ? ^N ]:[; 1 ? ]) = K([ = H([^N ; 1 ? ^N ]) + D([^N ; 1 ? ^N ] jj [; 1 ? ]) where ^N = N N + (1 ? N ) N : The obvious inequality and conditional K-L distance of r and s jj s) = D(r Example 2.3 (Markov chain) Consider a simple model of “weather forecast” where Yk denotes weather on k-th day, Zk = Yk?1, = = fSunny; Rainyg, s (y j z) = if y = z and 1 ? if y 6= z . Thus stands for the probability of steady weather. Let N be the relative frequency of sunny days followed by another sunny day, N be the relative frequency of rainy days followed by another rainy day, N be the relative frequency of sunny days followed by any day. Then (7) An analogous formula does not hold for continuous Y = ?1 and D(r jj s) = 1 then. Yet, K(r:s) since H(r) is finite provided that s (yk j zk ) > 0 for k = m+1; : : :; N+ m. Remark 2.5 Given particular sequences yN +m and uN +m , the density p can be regarded as a function of the unknown parameters , i.e., likelihood lN () = +m ; uN +m j ym; um ). Substituting (6) for p gives p (ymN +1 m+1 N :s ) = ? 1 log lN () + c K(r (8) N where c is a constant independent of . Thus, maximizing likelihood is equivalent to minimizing conditional inaccuracy N :s ) arg max l () = arg min K(r 2 N 2 provided the extremum points exist. For discrete Y , it follows from (7) that maximizing likelihood is equivalent to minimizing conditional K-L distance N jj s ) arg max l () = arg min D(r 2 N 2 N :s ) ? K(r N :s^N ) K(r ? = D [^N ; 1 ? ^N ] jj [; 1 ? ] 0: implies that = ^N minimizes inaccuracy over . Example 2.4 (ARX model) Let Yk j Zk be normally T distributed with the conditional ? mean Zk , Yk T T 2 N( Zk ; ). Provided EN ZZ > 0, we have N :s ) = 1 log 2 + 1 log 2 + 1 2 VN K(r 2 2 2 ? T ? 1 ^ + 22 ? N CN ? ^N with ^N = CN?1 EN (ZY ) VN = EN (Y 2 ) ? EN (Y Z T )CN?1EN (ZY ) CN = EN (ZZ T ) PN +m where EN (X) = N1 k=m+1 Xk stands for the empirical mean of a random variable X . Clearly, for every 2 N :s ) ? K(r N :s^N ) K(r ? ? = 21 2 ? ^N T CN ? ^N 0: For = ^N inaccuracy achieves its minimum over . 3. Pythagorean-like relations N ) is independent of . since H(r Remark 2.6 Most of what was said in Remark 2.3 applies to the case of dependent data straightforwardly. Note, N ) depends however, that the conditional entropy H(r now on the structure of a model defined by the choice of Zk . When estimating simultaneously the structure and parameters of a model, we have thus to pay attention to both the entropy and K-L distance components of inaccuracy. One notable example is Akaike’s information criterion [4] which follows essentially by taking expectation of (approximate) conditional inaccuracy. Independent observations Let the sample average of a given vector function h: Rn Y! N 1X N k=1 h(yk ) = h be the only information available from observed data. We denote by h the set of all densities r(y) such that R Z Y r(y) h(y) (dy) = h : (9) 2 Rh . Obviously, rN Consider an exponential family Sh composed of densities s (y) = s0 (y) exp T h(y) ? () where s0 (y) is a fixed density and normalizing constant () = log Z Y () is logarithm of the s0 (y) exp T h(y) (dy): Let be the set of 2 Rn such that () < 1. We say that s^ (y) is a h-projection of rN (y) onto both densities give the same expectation of h(y) Z Y (10) Sh if Rh \ Sh. Theorem 3.1 Let s^ be a h-projection of rN onto Then for every r 2 h and every s 2 h R S K(r:s) = K(r:s^ ) + D(s^ jj s): (11) Sh . (dy) = 0: r(y) ? s^ (y) log ss^ (y) (y) Y A h-projection is a solution to two closely related optimization problems. To show it, we need only the following fundamental property of K-L distance (see, e.g., Theorem 2.6.3 in [13]). Lemma 3.1 For any two densities r(y) and s(y), it holds D(r jj s) 0. The equality occurs if and only if r(y) = s(y) almost everywhere with respect to . Corollary 3.1 Let s^ be a h-projection of rN onto Then for every r 2 h R K(r:s^ ) = min K(r:s ): 2 Sh . (13) Proof. Theorem 3.1 and Lemma 3.1 together imply K(r:s^ ) = K(r:s ) ? D(s^ jj s) K(r:s ) for every 2 with equality if and only if s (y) = s^ (y) almost everywhere with respect to . 2 With regard to Remark 2.2, the minimum inaccuracy esti^ is the maximum likelihood estimate for the family mate . Note that K(r:s^ ) is independent of . h S Corollary 3.2 Let s^ be a h-projection of rN onto Then for every s 2 h D(s^ jj s) D(r jj s ) with equality if and only if r(y) = s^ (y) almost everywhere with respect to . 2 A h-projection s^ is thus also a minimum K-L distance (or generalized maximum entropy) estimate of r 2 Rh relative to s . Corollary 3.3 Let s^ be a h-projection of rN onto Then for every 2 , the likelihood value satisfies Sh . lN () = lN (^) exp ?N D(s^ jj s) (15) Dependent observations Let the sample average of a given vector function h: The proposition follows by definitions of inaccuracy and 2 K-L distance. h D(s^ jj s ) = D(r jj s ) ? D(r jj s^ ): Proof. The proposition follows by taking together the def2 inition lN () = p(yN ), (2) and Theorem 3.2. Z D(s^ jj s) = rmin D(r jj s ) 2R by (3) (12) Proof. Combining (11), (9) and (10) we have S D(s^ jj s) = K(r:s ) ? K(r:s^ ): Restricting to r 2 Rh such that D(r jj s ) < 0, we have Therefore, by Lemma 3.1 rN (y) ? s^ (y) h(y) (dy) = 0: Clearly, s^ lies in the intersection Proof. Theorem 3.1 implies Sh . Z ! Rn Y +m 1 NX N k=m+1 h(yk ; zk ) = h be the only information available from observed data yN +m , uN +m . Let h be the set of r(y; z) such that R Z Y Z Obviously, rN r(y; z) h(y; z) (dy) (dz) = h: 2 Rh . Consider an exponential family (16) Qh composed of densities q (y; z) = s0 (y j z) w(z) exp T h(y; z) ? () where s0 (y j z) and w(z) are fixed densities and logarithm of the normalizing constant (17) () is Z s0 (yjz) w(z) exp T h(y; z) Y Z (dy) (dz): Let be the set of 2 Rn such that () < 1. () = log One can easily compute the conditional density s (y j z) = s0 (y j z) exp T h(y; z) ? (; z) (18) and the marginal density (14) w(z) = w(z) exp (; z) ? () (19) where (; z) = log Z s0 (yjz) exp T h(y; z) (dy): Y Assume that even the marginal density (19) is exponential, i.e., (; z) can be factorized as follows (; z) = ()T h (z) + 0(): (20) Assume, moreover, that the functions hi (z) are linear combinations of functions hj (y; z), i.e., spanfhi (z)g spanfhj (y; z)g: (21) It can be ensured by adding hi (z) to hj (y; z) if necessary. We say that q^ (y; z) is a h-projection of rN (y; z) onto Qh if rN and q^ give the same expectation of h(Y; Z) Z Y Z rN (y; z) ? q^ (y; z) h(y; z) (dy) (dz) = 0: (22) Clearly, q^ lies in the intersection h \ h . Owing to the assumption (21), relation (22) implies also R Q Z rN (z) ? w^ (z) h (z) (dz) = 0: Z In other words, (23) w^ (z) is a h -projection of rN (z) onto Wh = fw(z)g. We define conditional K-L distance of s and s0 given r as jj s0 j r) = D(s Z j z) (dy) (dz): s(y j z) r(z) log ss(y 0 (y j z) Y Z Lemma 3.2 Let s (y j z), s0 (y j z) be any two densities from h . Under the assumptions (20)–(21), S D(s jj s0 j r) = D(s jj s0 j rN ) for every r 2 Proof. Substituting (18) for s gives Z where h^ (; z) = Z Z Y r(z) ( ? 0 )T ^h(; z) ? (; z) + (; z) (dz) K(r:w) = K(r:w^ ) + D(w^ jj w) where r stands for the marginal density r(z). Subtracting both equations, we get a conditional Pythagorean relation ) = K(r:s ^ ) + D(s ^ jj s j w^ ): K(r:s Finally, by Lemma 3.2, for every r 2 ^ jj s j w^ ) = D(s ^ jj s j r) D(s Rh . 2 A h-projection solves two dual optimization problems again. To solve them, we need the following modification of Lemma 3.1. jj s) 0 Lemma 3.3 For any r(y; z) and s(y j z), D(r with equality if and only if r(y; z) = s(y j z) r(z) almost everywhere with respect to . Similarly, for any jj s0 j r) 0 with equality if s(y j z), s0 (y j z) and r(z), D(s and only if s(y j z) r(z) = s0 (y j z) r(z) almost everywhere with respect to . Corollary 3.4 Let s^ w^ be a h-projection of rN onto Then for every r 2 h R ^ ) = min K(r:s ): K(r:s 2 Qh . (25) Proof. Theorem 3.2 and Lemma 3.3 together imply s (y j z) h(y; z) (dy): R ) = K(r:s ^ ) + D(s ^ jj s j r): K(r:s With regard to Remark 2.5, the minimum conditional in^ is the maximum likelihood estimate accuracy estimate ^ ) is for the family h = fs (y j z)g. Note that K(r:s independent of . S S Theorem 3.2 Let s^ (y j z) w^ (z) be a h-projection of rN (y; z) onto h . Then, for every r 2 h and every s 2 h , the following holds independently of w Q marginal Pythagorean relation Corollary 3.5 Let s^ w^ be a h-projection of rN onto Then for every s 2 h and every r 2 h h(; z) = r (; z). The propoOne easily verifies that ^ sition follows then by substituting (20) for (; z) and taking (21) into account. 2 S K(r:sw ) = K(r:s^ w^ ) + D(s^ w^ jj sw): Also by Theorem 3.1, the h -projection (23) satisfies the ^ ) = K(r:s ) ? D(s ^ jj s j r) K(r:s ) K(r:s for every 2 where equality holds if and only if s (y j z) r(z) = s^ (y j z) r(z) almost everywhere with respect to . 2 Rh . D(s jj s0 j r) = Proof. By Theorem 3.1, the h-projection (22) satisfies the joint Pythagorean relation (24) R ^ jj s j r) = min D(~ r jj s ): D(s r~2R h Qh . (26) Proof. Theorem 3.2 implies ^ jj s j r) = K(r:s ) ? K(r:s ^ ): D(s jj s) < 0, we have Restricting to r 2 Rh such that D(r by (7) ^ jj s j r) = D(r jj s) ? D(r jj s^ ): D(s Owing to Lemma 3.3 S ^ jj s j r) D(r jj s ) D(s with equality if and only if r(y; z) = s^ (y j z) r(z) almost everywhere with respect to . The proposition follows by Lemma 3.2. 2 A h-projection s^ is thus also a minimum conditional K-L distance (generalized maximum entropy) estimate of r 2 h relative to s . R Corollary 3.6 Let s^ w^ be a h-projection of rN onto Then for every 2 , the likelihood value satisfies ^ jj s j r) lN () = lN (^) exp ?N D(s where r is an arbitrary density from Qh . (27) Rh . It follows easily by combining the definition +m ; uN +m j ym; um), relation (6) and lN () = p (ymN +1 m+1 Theorem 3.2. 2 4. Estimation with compressed data S Exponential family h = fs g Suppose that the family of sampling distributions is exponential and we are to estimate the parameters . What Corollaries 3.3 and 3.6 say is that as far as we know the ^ of minimum inaccuracy (maximum likelihood) estimate , the whole likelihood can be restored by evaluating (possibly conditional) K-L distance between s^ and s . Combined with Corollaries 3.2 and 3.5, we see that the target object is K-L distance D(s^ jj s) = rmin D(r jj s ) 2R h in the case of independent data and conditional K-L distance ^ jj s j r) = min D(~ r jj s ); r 2 Rh : D(s r~2R h in the case of regression-type dependence. In both cases, the sample average (empirical mean) h carries sufficient information for exact restoration of the above functions. S General family = fs g Even if is not an exponential family or cannot be imbedded in an exponential family of sufficiently low dimension, Theorems 3.1 and 3.2 can be applied — separately for each particular density s . For independent data, choosing s0 (y) = s (y) in (10) defines an exponential family ;h going through the point s . By Corollary 3.3 and 3.2, we have S jj s ) : lN () = lN (; ^ ) exp ?N rmin D(r 2Rh (29) Compared with (28), K-L distance is replaced by a conditional one. ^ ) is the maximum value of likelihood for Note that lN (; the family ;h . Its value depends on , but with carefully chosen functions hi , its variation can be neglected, lN (; ^) const, 2 . This suggests to approximate likelihood by the minimum K-L distance between h and s . Note that another argument for this conclusion can be found in large deviation theory [15]. S R 5. References Proof. S Analogously, for regression-type dependence, choosing s0 (y j z) = s (y j z) in (17) defines an exponential family ;h going through s . By Corollary 3.6 and 3.5, we have lN () = lN (; ^ ) exp ?N rmin D(r jj s ) : 2Rh (28) [1] J. Wolfowitz, “The minimum distance method”, Ann. Math. Statist., vol. 28, pp. 75–88, 1957. [2] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel, Robust Statistics: The Approach Based on Influence Functions, Wiley, New York, 1986. [3] I. Vajda, Theory of Statistical Inference and Information, Kluwer, Dordrecht, 1989. [4] H. Akaike, “A new look at the statistical model identification”, IEEE Trans. Automat. Control, vol. 19, pp. 716–723, 1974. [5] B. Hanzon, “A differential-geometric approach to approximate nonlinear filtering”, in Geometrization of Statistical Theory, C. T. J. Dodson, Ed., pp. 219–224. ULDM Publications, Lancaster, England, 1987. [6] A. A. Stoorvogel and J. H. van Schuppen, “System identification with information theoretic criteria”, Report BS-R9513, CWI, Amsterdam, 1995. [7] N. N. Čencov, Statistical Decision Rules and Optimal Inference, vol. 53 of Transl. of Math. Monographs, Amer. Math. Soc., Providence, RI, 1982. [8] I. Csiszár, “I -divergence geometry of probability distributions and minimization problems”, Ann. Probab., vol. 3, pp. 146–158, 1975. [9] S. Amari, Differential-Geometrical Methods in Statistics, vol. 28 of Lecture Notes in Statistics, SpringerVerlag, Berlin, second edition, 1990. [10] D. F. Kerridge, “Inaccuracy and inference”, J. Roy. Statist. Soc. Ser. B, vol. 23, pp. 284–294, 1961. [11] C. E. Shannon, “A mathematical theory of communication”, Bell System Tech. J., vol. 26, pp. 379–423, 623–656, 1948. [12] S. Kullback and R. A. Leibler, “On information and sufficiency”, Ann. Math. Statist., vol. 22, pp. 79–86, 1951. [13] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York, 1991. [14] J. Rissannen, Stochastic Complexity in Statistical Inquiry, World Scientific, Singapore, 1989. [15] R. Kulhavý, “A Kullback-Leibler distance approach to system identification”, in Preprints of the IFAC Symposium on Adaptive Systems in Control and Signal Processing, Budapest, Hungary, 1995, pp. 55–66.
© Copyright 2026 Paperzz