Martingales and Information Divergence

1
Martingales and Information Divergence
Peter Harremoës
Abstract— A new maximal inequality for non-negative martinales is proved. It strengthens a wellknown maximal inequality by
Doob, and it is demonstrated that the stated inequality is tight. The
inequality emphasizes the relation between martingales and information divergence. It implies convergence of
bounded
martingales. Application in to the Shannon-McMillan-Breiman
theorem and Markov chains are mentioned.
Index Terms— Information divergence, martingale, convergence, maximal inequality.
I. I NTRODUCTION
Comparing results from probability theory and information
theory is not a new idea. Some convergence theorems in probability theory can be reformulated as ” the entropy converges
to its maximum”. A. Rényi [1] used information divergence
to prove convergence of Markov chains to equilibrium on a
finite state space. Later I. Csiszár [2] and Kendall [3] extended Rényi’s method to provide convergence on countable
state spaces. Their proofs use basically that information divergence is an Csiszár -divergence. Later J. Fritz [4] used
information theoretic arguments to establish convergence in total variation of reversible Markov chains. Recently A. Barron
[5] improved Fritz’ method and proved convergence in information divergence. Other limit theorems has been proved using
information theoretic methods. The Central Limit Theorem was
treated by Linnik [6] and A. Barron [7], the Local Central Limit
Theorem was treated by S. Takano [8] and Poisson’s law was
treated by P. Harremoës [9]. There has also been work strengthening weak or strong convergence to convergence in information divergence. All the above mentioned papers have results of
this kind, but also work by A. Barron [5] should be mentioned.
Some work has also been done where the limit of a sequence
is identified as a information projection. The most important
paper in this direction is due to I. Csiszár [10].
The law of large numbers exists in different versions. The
weak law of large numbers states that an empirical average is
close to the theoretical mean with high probability for a large
sample size. Using large deviation bounds it is easy to prove
the weak law of large numbers. The strong law of large numbers states that the empirical average converges to the theoretical mean with probability one. Large deviation bounds give an
exponentially decreasing probability of a large deviation which
implies the strong law of large numbers. These large deviation bounds are closely related to information theory. There are
two important ways of generalizing the law of large numbers.
One is in the direction of martingales. The other is in the direction of ergodic processes. For both martingales and ergodic
processes the generalization of the law of large numbers exists
in a weak and a strong version. The weak versions, convergence
University of Copenhagen.
of martingales in mean and the mean ergodic theorem, can be
proved using information theoretic methods. The strong versions, pointwise convergence of martingales and the individual
ergodic theorem, can be proved using very different techniques.
In this paper we shall see that pointwise convergence of martingales can also be proved using information theoretic methods.
Let and be probability measures. Then the information
divergence from to is defined by
if
otherwise.
This quantity is also called the Kullback-Leibler discrimination or relative entropy. Information divergence does not define a metric, but is related to total variation via Pinsker’s inequality
proved by I. Csiszár [11]
and others. If
is a sequence of probability distributions, we say that
converges to
in information if
for
. Pinsker’s inequality shows that
convergence in information is a stronger condition than convergence in total variation. See [12] for details about topologies
related to information divergence.
Barron used properties of information divergence to prove
mean convergence of martingales in [5]. The basic result is that
information divergence is continuous under the formation of the
intersection of a decreasing sequence of -algebras. The same
technique can be used to obtain a classical result of Pinsker [13]
about continuity under an increasing sequence of -algebras. In
order to prove almost sure convergence of martingales Barron
just remarks that mean convergence implies pointwise convergence according to the maximal inequalities. The main result is
the following maximal inequality.
Theorem 1: Let
be a positive martingale, and
put
If
then
(1)
Using this maximal inequality, pointwise convergence of a
martingale follows very easily.
II. S OME NEW MAXIMAL INEQUALITIES
Let be a measurable space with a probability measure
All mean values will be calculated with respect to
The following inequalities is well-known . Here they are just stated for
completeness.
Lemma 2: Let
be a martingale. Put
and
Then
(2)
and
(3)
2
Proof: Let
Then
denote the stopping time
(4)
(5)
(6)
(7)
Let
denote the stopping time
4
3
Then
(8)
(9)
(10)
(11)
Lemma 3: Let
and
2
1
be a positive martingale. Put
If
then
0
0
(12)
we get
(13)
3
4
5
x
(15)
The following function will play an important role in what
follows. Put
for
Remark that is
strictly convex with a minimum at
Theorem 1 can now be stated as:
Theorem 4: Let
be a positive martingale. Put
and
If
then
(16)
(27)
(14)
(17)
(18)
(19)
Similarly,
2
The function
and
Proof: By using that
1
implies that
and
Proof:
Using the inequality
we get
(28)
(29)
(30)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(31)
(32)
(33)
Inequality (27) is obtained by reorganizing the terms. Inequality (28) is proved in the same way.
Now we shall see that Inequality (27) is tight. Thus we have
to consider a martingale for which the points on the curve
are attained. To construct such an example the time has to
be continuous. Let the unit interval
be equipped with the
Lebesgue measure. Let
be the -algebra generated by the
Borel measurable sets on
and the set
Let denote
the function (random variable) on
given by
(34)
3
where
Remark that
tional expectation of given is
is increasing. The condi-
for
supremum
is
attained
This implies that
(35)
for
Put
The
Thus
for
(43)
If in the above example is positive then is decreasing
and
shall be replaced by
All the calculations are the
same, and therefore also Inequality (28) is tight.
4
III. C ONVERGENCE OF MARTINGALES
3
In order to prove convergence of martingales we have to reorganize our inequalities somewhat.
Theorem 5: Let
be a martingale and assume that
Let
be the probability measure given by
. For
put
and put
Then
y2
(44)
1
Proof: For each value
0
0
0.2
0.4
0.6
0.8
1
Using convexity of leads to
of
we have
(45)
x
Plot of
and
for
on
The conditional expectation
is illustrated.
Thus
(36)
(46)
(37)
(38)
Again a similar inequality is satisfied for the minimum of a
martingale, i.e.
(47)
(39)
(40)
Now, let
be a martingale. Without loss of generality we will assume that
Then
(48)
Thus
We see that
(41)
(42)
is increasing.
Assume that
is bounded. Then
converges to 0
for
tending to infinity. In particular
for
Thus,
for
and
is a Cauchy sequence with probability one.
Therefore the martingale converges pointwise almost surely.
4
defines a measure
given by
4
(50)
3
and
(51)
2
Hence we have
1
(52)
0
0
1
2
3
4
5
6
One may ask if this inequality also holds in other situations.
If
we have
7
x
-1
Fig. 1. The function
(53)
where the norm is the total variation norm. Then Inequality 52
states that
and Doob’s bound corresponding to the tangent.
IV. D ISCUSSION
Theorem 1 can be seen as a strengthening of a classical maximal inequality by Doob, which states that
(49)
A plot comparing the two inequalities displays that Doob’s inequality corresponds to a tangent to the function Thus the
new inequality is superior to Doob’s inequality in a neighborhood of 1, and it the behavior in this region which implies
convergence of the martingale. Only in a neighborhood of
Doob’s inequality is optimal.
In this paper upper bounds for
and lower bounds
on
are given in terms of
and each
of the bounds is shown to be tight. In the example the tightness is obtained for a certain value of a parameter
The
upper and the lower bounds are tight for different values of
Therefore a tighter bound on
is possible in terms of
Such tighter bounds would be
highly interesting and are an obvious subject for further investigation.Convergence of martingales is used in the proof of the
Shannon-McMillan-Breiman theorem [14]. It is interesting that
exactly the finiteness of
is needed in this tehorem.
One should also remark that convergence of Markov chains
can be proved using ideas from the theory of information projections as pointed out by Barron in [5]. The obvious applications of the theorems presented here to Markov chains has been
postponed to a future article.
In this manuscript the focus has been on random variables
which forms martingales. We have seen that the upper bound
of
can be given in terms of the information divergence from one probability measure to another, but also
the quantity
can be given in terms of measures. If
then the random variable
is not a probability density except if all the random variables are equal, but it
(54)
This inequality was proved by Volkonskij and Rozanov [15]
and was later refined to Pinsker’s inequality, see [16] for more
details about the history of this problem.
For
Inequality 52 does not hold in general. Let
be a martingale with respect to
Assume
that
is
measurable. Then for
(55)
(56)
(57)
(58)
(59)
(60)
Thus
and
have the same restriction to
Actually
is
the information projection of
into the convex set of probability measures such that and
have the same restriction
to
With these observations we are able to state the following
conjecture.
Conjecture 6: Let
be a probability measure and let
be a decreasing sequence of convex sets. Let
denote the information projection of on
Then
(61)
Similar conjectures can be formulated for a sequence of maximum likelihood estimates and for the minimum of measures
instead of the maximum. The conjecture is true for
and in a number of other special cases. If the conjecture is
true a number of new convergence theorems will hold. For instance a sequence maximum entropy distributions on smaller
and smaller sets will converge pointwise almost surely if some
weak regularity conditions are fulfilled.
5
R EFERENCES
[1] A. Rényi, “On measures of entropy and information,” in Proc. 4th Berkeley Symp. Math. Statist. and Prob., vol. 1, (Berkely), pp. 547–561, Univ.
Calif. Press, 1961.
[2] I. Csiszár, “Eine informationstheoretische Ungleichung und ihre anwendung auf den Beweis der ergodizität von Markoffschen Ketten,” Publ.
Math. Inst. Hungar. Acad., vol. 8, pp. 95–108, 1963.
[3] D. G. Kendall, “Information theory and the limit theorem for Markov
chains and processes with a countable infinity of states,” Ann. Inst. Stat.
Math., vol. 15, pp. 137–143, 1964.
[4] J. Fritz, “An information-theoretical proof of limit theorems for reversible
Markov processes,” in Trans. Sixth Prague Conf. on Inform. Theory, Statist. Decision Functions, Random Processes, (Prague), Czech. Acad. Science, Academia Publ., 1973. Prague, Sept. 1971.
[5] A. R. Barron, “Limits of information, Markov chains, and projections,” in
Proceedings 2000 International Symposium on Information Theory, p. 25,
2000.
[6] Y. V. Linnik, “An information-theoretic proof of the central limit theorem
with Lindeberg condition,” Theory Probab. Appl., vol. 4, pp. 288–299,
1959.
[7] A. R. Barron, “Entropy and the Central Limit Theorem,” Annals Probab.
Theory, vol. 14, no. 1, pp. 336 – 342, 1986.
[8] S. Takano, “Convergence of entropy in the central limit theorem,” Yokohama Mathematical Journal, vol. 35, pp. 143–148, 1987.
[9] P. Harremoës, “Binomial and Poisson distributions as maximum entropy
distributions,” IEEE Trans. Inform. Theory, vol. IT-47, pp. 2039–2041,
July 2001.
[10] I. Csiszár, “Sanov property, generalized I-projection and a conditional
limit theorem,” Ann. Probab., vol. 12, pp. 768–793, 1984.
[11] I. Csiszár, “Information-type measures of difference of probability distributions and indirect observations,” Studia Sci. Math. Hungar., vol. 2,
pp. 299–318, 1967.
[12] P. Harremoës, Information Topologies with Applications, vol. Special volume of Bolyi Series. Springer. To appear.
[13] M. S. Pinsker, Information and Information Stability of Random Variables
and Processes. Moskva: Izv. Akad. Nauk, 1960. in Russian.
[14] T. Cover and J. A. Thomas, Elements of Information Theory. Wiley, 1991.
[15] V. A. Volkonskij and J. A. Rozanov, “Some limit theorems for random
functions - i,” Theory Prob. Appl., vol. 4, pp. 178 – 197, 1959.
[16] A. Fedotov, P. Harremoës, and F. Topsøe, “Refinements of Pinsker’ s Inequality,” IEEE Trans. Inform. Theory, vol. 49, pp. 1491–1498, June
2003.