1 Martingales and Information Divergence Peter Harremoës Abstract— A new maximal inequality for non-negative martinales is proved. It strengthens a wellknown maximal inequality by Doob, and it is demonstrated that the stated inequality is tight. The inequality emphasizes the relation between martingales and information divergence. It implies convergence of bounded martingales. Application in to the Shannon-McMillan-Breiman theorem and Markov chains are mentioned. Index Terms— Information divergence, martingale, convergence, maximal inequality. I. I NTRODUCTION Comparing results from probability theory and information theory is not a new idea. Some convergence theorems in probability theory can be reformulated as ” the entropy converges to its maximum”. A. Rényi [1] used information divergence to prove convergence of Markov chains to equilibrium on a finite state space. Later I. Csiszár [2] and Kendall [3] extended Rényi’s method to provide convergence on countable state spaces. Their proofs use basically that information divergence is an Csiszár -divergence. Later J. Fritz [4] used information theoretic arguments to establish convergence in total variation of reversible Markov chains. Recently A. Barron [5] improved Fritz’ method and proved convergence in information divergence. Other limit theorems has been proved using information theoretic methods. The Central Limit Theorem was treated by Linnik [6] and A. Barron [7], the Local Central Limit Theorem was treated by S. Takano [8] and Poisson’s law was treated by P. Harremoës [9]. There has also been work strengthening weak or strong convergence to convergence in information divergence. All the above mentioned papers have results of this kind, but also work by A. Barron [5] should be mentioned. Some work has also been done where the limit of a sequence is identified as a information projection. The most important paper in this direction is due to I. Csiszár [10]. The law of large numbers exists in different versions. The weak law of large numbers states that an empirical average is close to the theoretical mean with high probability for a large sample size. Using large deviation bounds it is easy to prove the weak law of large numbers. The strong law of large numbers states that the empirical average converges to the theoretical mean with probability one. Large deviation bounds give an exponentially decreasing probability of a large deviation which implies the strong law of large numbers. These large deviation bounds are closely related to information theory. There are two important ways of generalizing the law of large numbers. One is in the direction of martingales. The other is in the direction of ergodic processes. For both martingales and ergodic processes the generalization of the law of large numbers exists in a weak and a strong version. The weak versions, convergence University of Copenhagen. of martingales in mean and the mean ergodic theorem, can be proved using information theoretic methods. The strong versions, pointwise convergence of martingales and the individual ergodic theorem, can be proved using very different techniques. In this paper we shall see that pointwise convergence of martingales can also be proved using information theoretic methods. Let and be probability measures. Then the information divergence from to is defined by if otherwise. This quantity is also called the Kullback-Leibler discrimination or relative entropy. Information divergence does not define a metric, but is related to total variation via Pinsker’s inequality proved by I. Csiszár [11] and others. If is a sequence of probability distributions, we say that converges to in information if for . Pinsker’s inequality shows that convergence in information is a stronger condition than convergence in total variation. See [12] for details about topologies related to information divergence. Barron used properties of information divergence to prove mean convergence of martingales in [5]. The basic result is that information divergence is continuous under the formation of the intersection of a decreasing sequence of -algebras. The same technique can be used to obtain a classical result of Pinsker [13] about continuity under an increasing sequence of -algebras. In order to prove almost sure convergence of martingales Barron just remarks that mean convergence implies pointwise convergence according to the maximal inequalities. The main result is the following maximal inequality. Theorem 1: Let be a positive martingale, and put If then (1) Using this maximal inequality, pointwise convergence of a martingale follows very easily. II. S OME NEW MAXIMAL INEQUALITIES Let be a measurable space with a probability measure All mean values will be calculated with respect to The following inequalities is well-known . Here they are just stated for completeness. Lemma 2: Let be a martingale. Put and Then (2) and (3) 2 Proof: Let Then denote the stopping time (4) (5) (6) (7) Let denote the stopping time 4 3 Then (8) (9) (10) (11) Lemma 3: Let and 2 1 be a positive martingale. Put If then 0 0 (12) we get (13) 3 4 5 x (15) The following function will play an important role in what follows. Put for Remark that is strictly convex with a minimum at Theorem 1 can now be stated as: Theorem 4: Let be a positive martingale. Put and If then (16) (27) (14) (17) (18) (19) Similarly, 2 The function and Proof: By using that 1 implies that and Proof: Using the inequality we get (28) (29) (30) (20) (21) (22) (23) (24) (25) (26) (31) (32) (33) Inequality (27) is obtained by reorganizing the terms. Inequality (28) is proved in the same way. Now we shall see that Inequality (27) is tight. Thus we have to consider a martingale for which the points on the curve are attained. To construct such an example the time has to be continuous. Let the unit interval be equipped with the Lebesgue measure. Let be the -algebra generated by the Borel measurable sets on and the set Let denote the function (random variable) on given by (34) 3 where Remark that tional expectation of given is is increasing. The condi- for supremum is attained This implies that (35) for Put The Thus for (43) If in the above example is positive then is decreasing and shall be replaced by All the calculations are the same, and therefore also Inequality (28) is tight. 4 III. C ONVERGENCE OF MARTINGALES 3 In order to prove convergence of martingales we have to reorganize our inequalities somewhat. Theorem 5: Let be a martingale and assume that Let be the probability measure given by . For put and put Then y2 (44) 1 Proof: For each value 0 0 0.2 0.4 0.6 0.8 1 Using convexity of leads to of we have (45) x Plot of and for on The conditional expectation is illustrated. Thus (36) (46) (37) (38) Again a similar inequality is satisfied for the minimum of a martingale, i.e. (47) (39) (40) Now, let be a martingale. Without loss of generality we will assume that Then (48) Thus We see that (41) (42) is increasing. Assume that is bounded. Then converges to 0 for tending to infinity. In particular for Thus, for and is a Cauchy sequence with probability one. Therefore the martingale converges pointwise almost surely. 4 defines a measure given by 4 (50) 3 and (51) 2 Hence we have 1 (52) 0 0 1 2 3 4 5 6 One may ask if this inequality also holds in other situations. If we have 7 x -1 Fig. 1. The function (53) where the norm is the total variation norm. Then Inequality 52 states that and Doob’s bound corresponding to the tangent. IV. D ISCUSSION Theorem 1 can be seen as a strengthening of a classical maximal inequality by Doob, which states that (49) A plot comparing the two inequalities displays that Doob’s inequality corresponds to a tangent to the function Thus the new inequality is superior to Doob’s inequality in a neighborhood of 1, and it the behavior in this region which implies convergence of the martingale. Only in a neighborhood of Doob’s inequality is optimal. In this paper upper bounds for and lower bounds on are given in terms of and each of the bounds is shown to be tight. In the example the tightness is obtained for a certain value of a parameter The upper and the lower bounds are tight for different values of Therefore a tighter bound on is possible in terms of Such tighter bounds would be highly interesting and are an obvious subject for further investigation.Convergence of martingales is used in the proof of the Shannon-McMillan-Breiman theorem [14]. It is interesting that exactly the finiteness of is needed in this tehorem. One should also remark that convergence of Markov chains can be proved using ideas from the theory of information projections as pointed out by Barron in [5]. The obvious applications of the theorems presented here to Markov chains has been postponed to a future article. In this manuscript the focus has been on random variables which forms martingales. We have seen that the upper bound of can be given in terms of the information divergence from one probability measure to another, but also the quantity can be given in terms of measures. If then the random variable is not a probability density except if all the random variables are equal, but it (54) This inequality was proved by Volkonskij and Rozanov [15] and was later refined to Pinsker’s inequality, see [16] for more details about the history of this problem. For Inequality 52 does not hold in general. Let be a martingale with respect to Assume that is measurable. Then for (55) (56) (57) (58) (59) (60) Thus and have the same restriction to Actually is the information projection of into the convex set of probability measures such that and have the same restriction to With these observations we are able to state the following conjecture. Conjecture 6: Let be a probability measure and let be a decreasing sequence of convex sets. Let denote the information projection of on Then (61) Similar conjectures can be formulated for a sequence of maximum likelihood estimates and for the minimum of measures instead of the maximum. The conjecture is true for and in a number of other special cases. If the conjecture is true a number of new convergence theorems will hold. For instance a sequence maximum entropy distributions on smaller and smaller sets will converge pointwise almost surely if some weak regularity conditions are fulfilled. 5 R EFERENCES [1] A. Rényi, “On measures of entropy and information,” in Proc. 4th Berkeley Symp. Math. Statist. and Prob., vol. 1, (Berkely), pp. 547–561, Univ. Calif. Press, 1961. [2] I. Csiszár, “Eine informationstheoretische Ungleichung und ihre anwendung auf den Beweis der ergodizität von Markoffschen Ketten,” Publ. Math. Inst. Hungar. Acad., vol. 8, pp. 95–108, 1963. [3] D. G. Kendall, “Information theory and the limit theorem for Markov chains and processes with a countable infinity of states,” Ann. Inst. Stat. Math., vol. 15, pp. 137–143, 1964. [4] J. Fritz, “An information-theoretical proof of limit theorems for reversible Markov processes,” in Trans. Sixth Prague Conf. on Inform. Theory, Statist. Decision Functions, Random Processes, (Prague), Czech. Acad. Science, Academia Publ., 1973. Prague, Sept. 1971. [5] A. R. Barron, “Limits of information, Markov chains, and projections,” in Proceedings 2000 International Symposium on Information Theory, p. 25, 2000. [6] Y. V. Linnik, “An information-theoretic proof of the central limit theorem with Lindeberg condition,” Theory Probab. Appl., vol. 4, pp. 288–299, 1959. [7] A. R. Barron, “Entropy and the Central Limit Theorem,” Annals Probab. Theory, vol. 14, no. 1, pp. 336 – 342, 1986. [8] S. Takano, “Convergence of entropy in the central limit theorem,” Yokohama Mathematical Journal, vol. 35, pp. 143–148, 1987. [9] P. Harremoës, “Binomial and Poisson distributions as maximum entropy distributions,” IEEE Trans. Inform. Theory, vol. IT-47, pp. 2039–2041, July 2001. [10] I. Csiszár, “Sanov property, generalized I-projection and a conditional limit theorem,” Ann. Probab., vol. 12, pp. 768–793, 1984. [11] I. Csiszár, “Information-type measures of difference of probability distributions and indirect observations,” Studia Sci. Math. Hungar., vol. 2, pp. 299–318, 1967. [12] P. Harremoës, Information Topologies with Applications, vol. Special volume of Bolyi Series. Springer. To appear. [13] M. S. Pinsker, Information and Information Stability of Random Variables and Processes. Moskva: Izv. Akad. Nauk, 1960. in Russian. [14] T. Cover and J. A. Thomas, Elements of Information Theory. Wiley, 1991. [15] V. A. Volkonskij and J. A. Rozanov, “Some limit theorems for random functions - i,” Theory Prob. Appl., vol. 4, pp. 178 – 197, 1959. [16] A. Fedotov, P. Harremoës, and F. Topsøe, “Refinements of Pinsker’ s Inequality,” IEEE Trans. Inform. Theory, vol. 49, pp. 1491–1498, June 2003.
© Copyright 2026 Paperzz