Conjugate Priors Represent Strong Pre

 Board of the Foundation of the Scandinavian Journal of Statistics 2004. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA Vol 31: 235–246, 2004
Conjugate Priors Represent Strong
Pre-Experimental Assumptions
EDUARDO GUTIÉRREZ-PEÑA
National University of Mexico
PIETRO MULIERE
Bocconi University
ABSTRACT. It is well known that Jeffreys’ prior is asymptotically least favorable under the
entropy risk, i.e. it asymptotically maximizes the mutual information between the sample and the
parameter. However, in this paper we show that the prior that minimizes (subject to certain
constraints) the mutual information between the sample and the parameter is natural conjugate
when the model belongs to a natural exponential family. A conjugate prior can thus be regarded as
maximally informative in the sense that it minimizes the weight of the observations on inferences
about the parameter; in other words, the expected relative entropy between prior and posterior is
minimized when a conjugate prior is used.
Key words: Bayes risk, conjugate prior, exponential family, mutual information, rate-distortion
function
1. Introduction
In the context of Bayesian inference, where the parameter is regarded as a random variable
and assigned a probability distribution, it is well known that Jeffreys’ prior is asymptotically
least favorable under entropy risk, i.e., it asymptotically maximizes the mutual information
between the distribution of the sample and that of the parameter (Clarke & Barron, 1994).
Thus, Jeffreys’ prior can be regarded as minimally informative in the sense that it maximizes the
weight of the observations on inferences about the parameter and so its impact on such
inferences is as weak as possible.
By contrast, what priors are maximally informative? The aim of this paper is to address
this question. Conjugate priors for exponential families have been previously shown to be
‘maximally informative’ from at least a couple of viewpoints (see e.g. Consonni & Veronese,
1992; section 3). Our main result is the following: the prior that minimizes (subject to certain
constraints) the mutual information between the sample and the parameter is natural conjugate when the model belongs to a natural exponential family (theorem 1). In this sense, a
conjugate prior can be regarded as maximally informative in that it minimizes the weight of
the observations on inferences about the parameter and so its impact on such inferences is as
strong as possible. One may also rephrase this result to say that, on average, the prior is
closest to the posterior in the Kullback–Leibler sense when the prior is natural conjugate
(corollary 1).
Following Kapur & Kesavan (1992, section 6.1), the two optimizations of the mutual
information described above can be regarded as dual problems. Clarke (1996) uses posterior
densities based on relative entropy reference priors to identify data implicit in the use of
informative priors. He achieves this by minimizing the relative entropy distance between the
corresponding informative posteriors and the posterior derived from the reference prior. Our
main result suggests that conjugate priors should probably not be used for representing prior
knowledge without substantive justification. Nevertheless, at least for exponential families,
236
E. Gutie´rrez-Peña and P. Muliere
Scand J Statist 31
one can envisage a range of priors from minimally informative (Jeffreys) to maximally
informative (conjugate). This is in line with Kapur & Kesavan (1992, section 6.6), who argue
that one of the aims of the dual entropy (or relative entropy) optimization problems is to know
the ‘worst’ possible distribution along with the ‘best’ possible distribution so that we can know
the gain that we get by using the desired optimization principle.
In the next section, we briefly review basic concepts of information theory. In section 3, we
state and prove our main result and discuss the relationship between mutual information and
the Bayes risk of a specific Bayesian prediction problem. Finally, section 4 contains some
concluding remarks.
2. Mutual information and rate distortion
Here we briefly review basic concepts of information theory such as mutual information
and rate-distortion function. For a comprehensive account of these and related concepts
see, e.g. Cover & Thomas (1991). Throughout the paper we shall use upper case letters to
denote random variables and the corresponding lower case symbols to denote the outcomes.
Let g be a r-finite positive measure on the Borel sets of Rd (typically Lebesgue measure or a
counting measure). The relative entropy (or Kullback–Leibler number) between two densities
(with respect to g) p and q is
Z
pðxÞ
DKL ðpjjqÞ ¼ pðxÞ log
gðdxÞ;
qðxÞ
and is usually interpreted as a measure of the ‘distance’ between two distributions. It is always
non-negative and is zero if and only if p equals q, but it is not symmetric.
Now let X and Y be two random variables with joint density (with respect to the
product measure g ¼ g1 · g2) p(x, y). Then the Shannon (1948) mutual information,
defined by
Z Z
pðx; yÞ
g ðdxÞg2 ðdyÞ ¼ DKL fpðx; yÞjjpðxÞpðyÞg;
IðX ; Y Þ ¼
pðx; yÞ log
pðxÞpðyÞ 1
describes the amount of information that the random variable Y gives about the random
variable X (and vice versa). The quantity I(X; Y) is one of the most fundamental information
quantities and is a measure of the dependence between the two random variables X and Y.
From its definition, it is easy to see that I(X; Y) ¼ I(Y; X) and that I(X; Y) ¼ 0 if and only if
X and Y are independent.
Write p(x, y) ¼ p(x) p(y|x). An important property of the mutual information I(X; Y) is
that it is a concave function of p(x) for fixed p(y|x), and a convex function of p(y|x) for fixed
p(x); see, e.g. Cover & Thomas (1991, p. 31).
In an idealized communication system, maximizing I(X; Y) over the marginal distribution
p(x) of a source X, for a fixed conditional distribution p(y|x) (the channel), yields the so-called
capacity of the channel (Shannon, 1948), i.e.,
C ¼ sup IðX ; Y Þ;
pðxÞ
where the supremum is taken over all possible input distributions p(x). The capacity C has the
significance that information can be sent through the channel with arbitrary low probability of
error at any rate less than C. Thus, the capacity addresses the problem of data transmission.
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
Conjugate priors represent strong assumptions
Scand J Statist 31
237
However, if we fix the marginal distribution p(x) of the source X, and then minimize I(X; Y)
over the class of conditional distributions p(y|x) subject to a distortion constraint, the result is
the information rate-distortion function, i.e.
RðlÞ ¼
inf IðX ; Y Þ;
pðyjxÞ2Pl
where the infimum is taken over the class Pl of conditional distributions p(y|x) that satisfy
Z Z
pðxÞpðyjxÞLðx; yÞg1 ðdxÞg2 ðdyÞ l;
where L(x, y) is a given distortion (loss) function (Shannon, 1959; Berger, 1971; Csiszár, 1974).
In this case p(y|x) is regarded as a description of how X is represented by Y, and the idea is that
it should be represented by as few bits as possible subject to a specified amount of inaccuracy.
A communication system can be designed that achieves a maximum average distortion l, if and
only if the capacity of the channel exceeds R(l), i.e., R(l) is the effective rate at which the source
produces information subject to the constraint that the receiver can tolerate an average
distortion of l (Berger, 1971). Thus, the rate-distortion function addresses the problem of data
compression. As pointed out by Berger (1971, section 2.3), there is an alternative
interpretation of the definition of R(l) that is more appealing intuitively. Specifically, the
same curve R(l) would result if we fix the mutual information and then choose p(y|x) so as to
minimize the average distortion.
In a Bayesian context, letting Q denote the parameter, with density w(h), and X represent the
data, maximization of (an asymptotic expression for) I(X; Q) with respect to w(h) typically
yields the reference prior for Q introduced by Bernardo (1979), of which Jeffreys’ prior is a
particular case; see Clarke & Barron (1994).
Yuan & Clarke (1999a,b), again in a Bayesian context, used the rate-distortion theory in
order to construct likelihood functions p(x|h) for data X in situations where it is not
possible to identify a parametric family on the basis of physical modeling but there is prior
information about the parameter of interest Q, described by the prior w(h). Such likelihoods minimize the mutual information I(X; Q) and are thus called ‘minimally informative’.
Clearly, we can also write p(x, h) as m(x)w(h|x) so, for fixed m(x), we can minimize I(X; Q)
with respect to w(h|x) in order to construct a posterior, which in turn will produce a prior w(h)
for h. While this procedure is intimately related to the formulation of the rate-distortion
theory, here we focus on w(h) and not on the minimal mutual information per se, as is the case
in rate-distortion theory.
Before we proceed, note that if Xn ¼ {X1, . . . , Xn} is a sample of n i.i.d. observations from
p(x|h) then
n pðx jhÞ
gðdxn Þdh
pðxn jhÞwðhÞ log
mðxn Þ
Z Z
wðhjxn Þ
dhgðdxn Þ;
¼
wðhjxn Þmðxn Þ log
wðhÞ
IðX n ; HÞ ¼
Z Z
where
mðxn Þ ¼
Z
wðhÞpðxn jhÞdh and
wðhjxn Þ ¼
wðhÞpðxn jhÞ
:
mðxn Þ
Note also that if T ¼ T(Xn) is a sufficient statistic for Q, then
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
238
E. Gutie´rrez-Peña and P. Muliere
Scand J Statist 31
IðT ; HÞ ¼ IðX n ; HÞ
(Cover & Thomas, 1991; section 2.10).
3. Maximally informative priors for exponential families
In this section we first review basic concepts concerning natural exponential families and
define an ‘intrinsic’ distortion function. We then state and prove the main results of this paper.
3.1. Review of exponential families
We first review some basic results concerning natural exponential families; see BarndorffNielsen (1978) for a comprehensive account of the properties of these models. Let g be a
r-finite positive measure on the Borel sets of Rd, and consider the family F of probability
measures whose density with respect to g is of the form
pðxjhÞ ¼ b1 ðxÞ expfx0 h MðhÞg;
h 2 N;
ð1Þ
for some function b1(Æ), where M(h) ¼ log b1(x) exp (x¢h)g(dx) and N ¼ intU, with
U ¼ {h 2 Rd: M(h) < +1}. Let g1(dx) ¼ b1(x)g(dx). We assume that g1 is not concentrated on an affine subspace of Rd, and that N is not empty. The family F is called a natural
exponential family and is said to be regular if U is an open subset of Rd. The function M(h),
called the cumulant transform of F, is convex and infinitely differentiable.
For notational convenience, and without loss of generality, we take d ¼ 1 so that F is a
univariate natural exponential family. The multivariate case is completely analogous.
The mapping l ¼ l(h) ¼ dM(h)/dh is one-to-one and differentiable, with inverse h ¼ h(l),
and provides an alternative parameterization of F, called the mean-value parameterization.
The set X ¼ l(N) is termed the mean parameter space. Let X denote the interior of the convex
hull of the support of g. Then F is said to be steep if X ¼ X. In this paper, we shall only be
concerned with steep natural exponential families.
The function
V ðlÞ ¼
d 2 MfhðlÞg
;
dh2
l 2 X;
is called the variance function of F. An important property of the variance function is that the
pair (V(Æ), X) characterizes F (see e.g. Morris, 1982).
P
Consider now a sample X1, . . . , Xn of independent observations from (1) and let S ¼ iXi.
Then S follows the exponential family model
pðsjh; nÞ ¼ bðs; nÞ expfsh nMðhÞg;
h 2 N;
where b(Æ,n) denotes the n-fold convolution of b1(Æ).
The corresponding natural conjugate family has densities (with respect to the Lebesgue
measure on the Borel sets of R) of the form
wðhÞ ¼ wðhjs0 ; n0 Þ H ðs0 ; n0 Þ expfs0 h n0 MðhÞg
~ such that pðjh; n
~Þ is
(Diaconis & Ylvisaker, 1979). Let CF ˝ R+ be the set of all values of n
still a natural exponential family. If CF ¼ R+ then F is infinitely divisible (the binomial family
provides an example of an exponential family which is not infinitely divisible as CF ¼ N+ in
that case).
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
Conjugate priors represent strong assumptions
Scand J Statist 31
239
Now let T ¼ S/n. Then T has a density of the form
pðtjh; nÞ ¼ Bðnt; nÞ expfn½ht MðhÞg;
where B(s, n) ¼ b(s, n) if F is discrete, and B(s, n) ¼ nb(s, n) if F is continuous.
Finally, the predictive distribution of T is given by
mðtÞ ¼
Bðnt; nÞH ðn0 t0 ; n0 Þ
;
H ðnt þ n0 t0 ; n þ n0 Þ
ð2Þ
where t0 ¼ s0/n0.
The relative entropy between two members of the same parametric family has been used by
several authors as a loss function in some statistical decision problems, including point estimation (see e.g. Gutiérrez-Peña, 1992) and hypothesis testing (e.g. Rueda, 1992; Bernardo,
1999); see also Robert (1996). It has the appealing property that it gives rise to Bayesian
procedures which are invariant under reparameterizations.
Let us now reparameterize the family F in terms of l ¼ l(h), i.e.,
pðtjl; nÞ ¼ Bðnt; nÞ expfn½thðlÞ Mh ðlÞg;
l 2 X;
where Mh(l) ¼ M(h(l)). The corresponding conjugate prior then takes the form
wðlÞ ¼ wðljt0 ; n0 Þ ¼ Hðn0 t0 ; n0 Þ expfn0 ½t0 hðlÞ Mh ðlÞgV ðlÞ1 :
Note that T is a minimal sufficient statistic for F. It follows from the remark at the end of
section 2 that I(Xn; Q) ¼ I(T; l). Therefore, in the remainder of this paper we shall focus on
I(T; l).
Consider the loss L(t, l) ¼ d(t; l), where dð~
l; lÞ ¼ DKL fpðj~
l; 1Þjjpðjl; 1Þg. For the exponential family F, this loss takes the form
Lðt; lÞ ¼ Mh ðlÞ Mh ðtÞ þ fhðtÞ hðlÞgt:
ð3Þ
Remark. The mean-value parameter space X is the interior of the smallest interval containing T, the support of T. Thus, X ¼ T if T is an open interval, and the closure of X always
contains T. If T is not open, we can define L(Æ, l) by continuity at the boundary of the interval
¼ ½a; b as
X
Lða; lÞ ¼ lim Lðt; lÞ and Lðb; lÞ ¼ lim Lðt; lÞ
t#a
t"b
for all l 2 X. The distortion functions used in the following examples can be derived using
this approach.
3.2. Choosing the prior: two examples
The choice of a prior determines the strength of the dependence among the elements of an
exchangeable sequence {Xn}. Following de Finetti & Savage (1962), we argue that it is necessary to assess a prior distribution in terms of the actual information available on the
parameter and according to the strength of the dependence between the observations and the
parameter (or, from a predictive viewpoint, between past and future) that we wish to describe.
Our work was motivated by related ideas developed by Cifarelli & Regazzini (1983,1987), who
considered the association between past and future observations in exponential family models
but used a different measure of dependence.
We wish to minimize the mutual information between T and l over a class of posterior
distributions w(l|t) defined by an expected distortion constraint. The solutions are parametric
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
240
E. Gutie´rrez-Peña and P. Muliere
Scand J Statist 31
families that depend on the choice of the loss function as well as on the source distribution
m(t). While the rate-distortion function quantifies the amount of data compression that can be
optimally achieved, the posterior distribution that attains the rate-distortion function lower
bound yields the weakest dependence between the source t and the output l within the class Pl
of posterior densities over which the mutual information has been minimized.
We now follow Blahut (1972) and state the solution to the rate-distortion function optimization problem; see also Cover & Thomas (1991, section 13.7) and Cyranzki & Tzannes
(1983). Using this, we derive the solution for a source of the form (2) under an intrinsic loss of
the form (3).
1. Fix m(t) and let L(t, l) be a loss (distortion) function such that it is positive, continuous in
its two arguments, and zero when t ¼ l.
2. The problem lies in minimizing the mutual information I(T; l), i.e.,
RðlÞ ¼
inf
wðljtÞ2P l
IðT ; lÞ
over the class Pl of conditional distributions w(l|t) that satisfy
Z Z
mðtÞwðljtÞLðt; lÞgðdtÞdl l:
ð4Þ
3. The solution to this minimization problem is given by
w ðljtÞ ¼ wk ðljtÞ R
w ðlÞekLðt;lÞ
;
w ð~
lÞekLðt;~lÞ d~
l
where w*(l) is determined by the equation
Z
mðtÞekLðt;lÞ
R
gðdtÞ ¼ 1;
w ð~
lÞekLðt;~lÞ d~
l
ð5Þ
ð6Þ
for l such that w*(l)>0, and k, which is the Lagrange multiplier for constraint (4), is
determined by
Z Z
mðtÞwk ðljtÞLðt; lÞdl gðdtÞ ¼ l:
ð7Þ
We refer to the prior w*(l) as maximally informative as it minimizes the weight of the observations when making inferences about l. Our aim here is exactly the opposite of that of
Bernardo’s approach to finding reference (i.e., minimally informative) priors.
In order to illustrate the procedure described above, we now present two simple examples.
Example 1 (binomial case). Let
mðtÞ ¼
n
nt
Cða0 þ b0 ÞCða0 þ ntÞCðb0 þ nð1 tÞÞ
;
Cða0 ÞCðb0 ÞCða0 þ b0 þ nÞ
with t 2 T ¼ {0, 1/n, . . . , (n ) 1)/n, 1}, a0 > 0 and b0 > 0. Consider the following loss
function
t
1t
Lðt; lÞ ¼ t log
þ ð1 tÞ log
;
l
1l
where we define 0 log 0 ¼ limx fi 0x log x ¼ 0. This is the intrinsic loss (3) for the binomial
family.
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
Conjugate priors represent strong assumptions
Scand J Statist 31
241
According to (6), w*(l) must satisfy the following condition
X
t2T
R1
0
mðtÞ ek½t logðt=lÞþð1tÞ logðð1tÞ=ð1lÞÞ
w ð~
lÞ ek½t logðt=~lÞþð1tÞ logðð1tÞ=ð1~lÞÞ d~
l
¼ 1:
Set w*(l) ¼ Be(l|d, c), i.e., a beta density such that E(l) ¼ d/(d + c). Then the denominator
of the previous expression becomes
tkt ð1 tÞkð1tÞ
Cðd þ cÞ Cðd þ ktÞCðc þ kð1 tÞÞ
:
CðdÞCðcÞ
Cðd þ c þ kÞ
Hence we have
X n Cða0 þ b0 ÞCðdÞCðcÞCða0 þ ntÞCðb0 þ nð1 tÞÞCðd þ c þ kÞ
lkt ð1 lÞkð1tÞ
¼ 1:
nt
Cðd þ cÞCða0 ÞCðb0 ÞCðd þ ktÞCðc þ kð1 tÞÞCða0 þ b0 þ nÞ
t2T
When k ¼ n, this equation is satisfied by setting d ¼ a0 and c ¼ b0. Therefore w*(l) ¼
Be(l|a0, b0). Finally, from (5) we see that w*(l|t) ¼ Be(l|a0 + nt, b0 + n(1 ) t)).
Example 2 (normal case). Let mðtÞ ¼ N ðtjl0 ; r20 Þ (i.e., m(t) is the density of a normal distribution with mean l0 and variance r20 ) and let L(t, l) ¼ k(t ) l)2, with k a known positive
constant. This quadratic form, with k ¼ 1/2, is the intrinsic loss (3) for the normal family.
Now, according to (6), w*(l) must satisfy
2
Z 1
Nðtjl ; r2 ÞekkðtlÞ
dt ¼ 1:
R1 0 0
2
lÞekkðt~lÞ d~
l
1 1 w ð~
Set w*(l) ¼ N(l|d, c). Then the denominator of this last expression becomes
Z 1
N ð~
ljd; cÞNðtj~
l; ð2kkÞ1 Þd~
l ¼ ð2pÞ1=2 ð2kkÞ1=2 Nðtjd; c þ ð2kkÞ1 Þ:
ð2pÞ1=2 ð2kkÞ1=2
1
Hence we have
Z
1
1
N ðtjl0 ; r20 ÞNðtjl; ð2kkÞ1 Þ
N ðtjd; c þ ð2kkÞ1 Þ
dt ¼ 1:
ð8Þ
Thus, if we set d ¼ l0 and c ¼ r20 ð2kkÞ1 , then we have
N ðtjl0 ; r20 Þ
N ðtjd; c þ ð2kkÞ1 Þ
¼1
for all t, and condition (8) is trivially satisfied. Note that this solution is unique.
Therefore w ðlÞ ¼ N ðljl0 ; r20 ð2kkÞ1 Þ. Finally, from (5) we see that
w ðljtÞ ¼ R 1
N ðljl0 ; r20 ð2kkÞ1 ÞNðtjl; ð2kkÞ1 Þ
1
~
N ð~
ljl0 ; r20 ð2kkÞ1 ÞN ðtj~
l; ð2kkÞ1 Þd l
where
l1 ¼
l0 þ tð2kkr20 1Þ
1
1
2
:
and
r
¼
1
1
2kk
2kkr20
2kkr20
Note that, from (7),
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
¼ Nðljl1 ; r21 Þ;
242
E. Gutie´rrez-Peña and P. Muliere
Scand J Statist 31
l ¼ lðkÞ ¼ ET EljT fLðT ; lÞg ¼ k El ET jl fðT lÞ2 g ¼ 1=2k:
To sum up, the maximally informative prior is given by w ðlÞ ¼ N ðljl0 ; r20 ð2kkÞ1 Þ with
k > ð2kr20 Þ1 . This condition, written as l < kr20 , means that the expected distortion that can
be tolerated must be less than k times the variance of the source distribution m(t). For
l kr20 , the rate-distortion function is zero; see Cover & Thomas (1991).
Recall that, for a normal sampling density N(t|l, 1) and a normal N ðljl0 ; s20 Þ prior, the
predictive distribution of T is N ðtjl0 ; s20 þ 1=nÞ. As s20 > 0, r20 has to be greater that 1/n in
order for mðtÞ ¼ Nðtjl0 ; r20 Þ to be eligible as a source distribution. This constraint is consistent with the constraint k > ð2kr20 Þ1 mentioned in the previous paragraph as long as
k ¼ 1/2 and k ¼ n.
In both of the examples given above, the prior which minimizes the mutual information
between the observation and the parameter has a conjugate form. This suggests that a similar
result could hold more generally. This is indeed the case. In the next subsection we show that,
for exponential families, the prior that minimizes the mutual information between the
parameter l and the statistic T when an intrinsic loss (distortion) function is chosen is a
natural conjugate prior.
Remark 1 (computational issues). The approach followed here allows us to verify whether a
given w(l) is a solution of the minimization problem. Generally speaking, however, finding a
solution is not an easy problem. The main technique for minimizing the rate-distortion
function is the Blahut–Arimoto algorithm (Arimoto, 1972; Blahut, 1972). A more recent
description of this algorithm can be found, for example, in Cover & Thomas (1991).
Remark 2 (an inverse problem). In order to highlight the role of the prior distribution, and
to show that it is as close as possible to the posterior distribution in a very specific sense, we
consider the marginal distribution m(t) as the source distribution, with the posterior distribution w(l|t) describing the transmission channel (see Tzannes & Noonan, 1973; Cyranzki &
Tzannes, 1983). As mentioned in section 2, however, it is also possible to consider w(l) as the
source distribution and minimize the mutual information over conditional distributions p(t|l),
as proposed by Yuan & Clarke (1999a,b). Following Kapur & Kesavan (1992), given the
former problem, the latter can be regarded as an inverse problem, and vice versa. These two
approaches arrive at the same result only in certain cases (e.g. when the source distribution is
normal and the loss function is quadratic).
Remark 3 (loss function). It should be clear that an appropriate choice of the loss function
is crucial in order to obtain analytical results. A general procedure to construct suitable
(intrinsic) loss functions for exponential families was described in section 3.1. Such losses are
now widely used in Bayesian statistical decision problems and have an interesting informationtheoretic interpretation.
3.3. Main results
Theorem 1
Consider an exponential family F, parameterized in terms of the mean-value parameter l, and a
loss function of the form (3). Then the mutual information I(T;l) between T ¼ T(Xn) and
l ¼ l(Q) is minimized, over the class Pl of conditional distributions w(l|t) that satisfy
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
Conjugate priors represent strong assumptions
Scand J Statist 31
Z Z
mðtÞwðljtÞLðt; lÞgðdtÞdl l;
243
ð9Þ
by a posterior distribution w*(l|t) derived from a natural conjugate prior on Q.
Proof. Let m(t) be as in (2) for suitable functions B(Æ, Æ) and H(Æ, Æ), and consider a loss
function of the form (3).
Denote by w*(l|t) the density in Pl at which the infimum in the definition of the ratedistortion function is attained. It follows from the discussion in section 3.2 that the solution to
the rate-distortion function minimization problem is given by
w ðljtÞ ¼ wk ðljtÞ R
h ðlÞekðthðlÞMh ðlÞÞ
;
w ð~
lÞekðthð~lÞMh ð~lÞ d~
l
where w*(l) is determined by the equation
Z
pðtÞekðthðlÞMh ðlÞÞ
R
gðdtÞ ¼ 1;
w ð~
lÞekðthð~lÞMh ð~lÞÞ d~
l
for l such that w*(l) > 0, and k is the Lagrange multiplier for constraint (9).
Now set w*(l) ¼ H(dc,c) exp {c[dh(l))Mh(l)]}V(l))1. Then the denominator of the last
expression becomes
H ðdc; cÞ
expfk½hðtÞt Mh ðtÞg:
H ðdc þ kt; c þ kÞ
Therefore we must have
Z
H ðdc þ kt; c þ kÞH ðn0 t0 ; n0 ÞBðnt; nÞ
pðtjl; kÞgðdtÞ ¼ 1:
H ðn0 t0 þ nt; n0 þ nÞH ðdc; cÞBðkt; kÞ
When k ¼ n, this equation is satisfied by setting d ¼ t0 and c ¼ n0. Other values of k would
lead to d and c depending on l and so are not admissible.
Hence w*(l) ¼ w(l|n0t0, n0) ¼ H(n0t0, n0) exp{n0[t0h(l) ) Mh(l)]}V(l))1, and w*(l|t) ¼
w(l|kt + n0t0, k + n0). This solution is unique. To see this, as Pl is convex, it is enough to
show that I(T; l) is strictly convex on Pl as a functional of w(Æ|Æ). This follows from theorem 2.7.4 of Cover & Thomas (1991).
The above prior, written in terms of the canonical parameter h, is given by
wh ðhÞ ¼ H ðn0 t0 ; n0 Þ expfn0 ½t0 h MðhÞg;
which is seen to be natural conjugate for the exponential family F.
Therefore, the distribution wh ðhjtÞ which minimizes the mutual information between T and
H (subject to a constraint on the expected relative entropy loss) belongs to the natural conjugate family.
This result shows that natural conjugate priors achieve optimal data compression and
produce posterior distributions which are as close as possible to their corresponding priors.
Such priors attain the rate-distortion function lower bound, and so can be regarded as
maximally informative, implying that the corresponding likelihoods are minimally informative.
Diaconis & Ylvisaker (1979) characterize the conjugate family of prior distributions for a
general exponential family in terms of the linearity of the posterior expected value of l with
respect to t. The following corollary suggests that conjugate priors should only be used in
situations where the researcher wishes to minimize the weight of the observations on inferences
about the parameter.
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
244
E. Gutie´rrez-Peña and P. Muliere
Scand J Statist 31
Corollary 1
Consider an exponential family F, parameterized in terms of the mean-value parameter l. Then
the expected relative entropy between a prior w(l) and its posterior w(l|t) is minimized (over the
class Pl) by a prior on l which is induced by a natural conjugate prior on Q.
Proof. The result follows upon noting that the mutual information between T and l can be
written as
Z
Z
wðljtÞ
dlgðdtÞ:
IðT ; lÞ ¼ mðtÞ wðljtÞ log
wðlÞ
Thus, given m(t), the prior wh(h) which is closest in the Kullback–Leibler sense (on average) to
the corresponding posterior wh(h|t), is a natural conjugate prior.
We now described a further property of conjugate priors for exponential families that
relates the mutual information between the sample and the parameter to the Bayes risk
associated with a Bayesian estimation problem.
Consider a parametric family
F ¼ fph ðxÞ : h 2 Ng;
and let Xn ¼ {X1, . . . , Xn} be a sample of independent and identically distributed observations
from F. Now consider the following situation: for each k ¼ 1, 2, . . . , n, the statistician
produces an estimate ^pk ðxk jxk Þ of ph(xk), whose adequacy is to be judged on the basis of the
loss function
lk ð^pk ; hÞ ¼ DKL ðph jj^pk Þ:
For each k¼1, 2, . . . ,n, the corresponding risk is given by
Z
rk ð^pk ; hÞ ¼ lk ð^pk ; hÞpðxk1 Þdxk1 :
Now let
Rn ð^p; hÞ ¼
n
X
rk ð^pk ; hÞ
k¼1
denote the cumulative risk and note that Rn ð^p; hÞ ¼ DKL ðphn jj^
pÞ, where
phn ðxn Þ ¼
n
Y
ph ðxk Þ and
^pðxn Þ ¼
k¼1
n
Y
^pk ðxk jxk1 Þ:
k¼1
The minimax risk is defined by
RMM
¼ inf sup Rn ð^p; hÞ;
n
^p
h
whereas the Bayes risk is given by
Z
RBn;w ¼ inf Rn ð^p; hÞwðhÞdh:
^p
It is well known that Jeffreys’ prior is asymptotically least favorable under the entropy risk,
i.e., it maximizes an asymptotic expression for RBn;w over the class of priors w(Æ); see Clarke &
Barron (1994). The following theorem shows that, in the case of exponential families, a related
property holds for a conjugate prior.
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
Conjugate priors represent strong assumptions
Scand J Statist 31
245
Theorem 2
Consider an exponential family F. Then the prior that minimizes the Bayes risk RBn;w , over the
class Pl, is a conjugate prior.
Proof. The proof relies on the following equality:
Z
RBn;w ¼ Rn ðmw ; hÞwðhÞdh;
where
mw ðxn Þ ¼
Z
phn ðxn ÞwðhÞdh
is the (Bayesian) predictive distribution (Aitchison, 1975). Using this, it can then be shown
that RBn;w ¼ IðX n ; HÞ (Clarke & Barron, 1994; Haussler & Opper, 1997).
The result now follows from theorem 1 upon recalling that I(Xn; Q) ¼ I(T; l).
4. Concluding remarks
Given a natural exponential family, the priors in the corresponding conjugate family minimize
(subject to certain constraints on the posteriors) the mutual information between the sample
and the parameter. In this sense, conjugate priors can be regarded as maximally informative in
that they minimize the weight of the observations on inferences about the parameter; in other
words, the expected relative entropy between prior and posterior is minimized. This result can
be also interpreted as saying that conjugate priors minimize the Bayes risk of the particular
predictive problem described in section 3.3. Thus, conjugate priors and Jeffreys’ prior (which
maximizes the Bayes risk over the class of priors) can be regarded as the two extremes of a
particular solution to the problem of eliciting a prior distribution.
The idea described in this paper, according to which one can adopt a prior that minimizes or
maximizes the dependence between Q and Xn depending on the kind of prior information
available, represents, in our opinion, a sensible procedure for selecting a prior. Furthermore,
these two extremes could be used, in principle, to assess the information content of the data
implicitly assumed by using an informative prior, as in Clarke (1996). Any prior which emphasizes the role of the observation by maximizing I(Q; Xn) may be regarded as a reasonable
candidate to represent vague prior information. However, it is tempting to regard as ‘conjugate’
any prior minimizing I(Q; Xn), even in more general (not necessarily exponential family) settings.
Acknowledgements
This work was carried out while the second author was visiting the Department of Probability
and Statistics, National University of Mexico, and was supported by CONACyT Grant
32256-E, Mexico, and Bocconi University, Italy. The first author wishes to acknowledge
partial support from the Sistema Nacional de Investigadores, Mexico. The authors are also
very grateful to two anonymous referees for their insightful comments and suggestions which
greatly improved the presentation of the paper.
References
Aitchison, J. (1975). Goodness of prediction fit. Biometrika 62, 547–554.
Arimoto, S. (1972). An algorithm for computing the capacity of arbitrary discrete memoryless channels.
IEEE Trans. Inform. Theory IT-18, 14–20.
Board of the Foundation of the Scandinavian Journal of Statistics 2004.
246
E. Gutie´rrez-Peña and P. Muliere
Scand J Statist 31
Barndorff-Nielsen, O. (1978). Information and exponential families in statistical theory. Wiley, Chichester.
Berger, T. (1971). Rate distortion theory. Prentice-Hall, Englewood Cliffs, NJ.
Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference (with discussion). J. Roy.
Statist. Soc. Ser. B 41, 113–147.
Bernardo, J. M. (1999). Nested hypothesis testing: the Bayesian reference criterion (with discussion). In
Bayesian statistics 6 (eds J. M. Bernardo, J. O. Berger, A. P. Dawid & A. F. M. Smith), 101–130. Oxford
University Press, Oxford.
Blahut, R. E. (1972). Computation of channel capacity and rate-distortion functions. IEEE Trans. Inform.
Theory IT-18, 460–473.
Cifarelli, D. M. & Regazzini, E. (1983). Qualche osservazione sull’uso di distribuzioni iniziali coniugate
alla famiglia esponenziale. Statistica 43, 415–424.
Cifarelli, D. M. & Regazzini, E. (1987). Priors for exponential families which maximize the association
between past and future observations. In Probability and Bayesian statistics (ed. R. Viertl), 83–95.
Plenum Publishing Corporation, New York.
Clarke, B. S. (1996) Implications of reference priors for prior information and for sample size. J. Amer.
Statist. Assoc. 91, 173–184.
Clarke, B. S. & Barron, A. R. (1994). Jeffreys’ prior is asymptotically least favorable under entropy risk.
J. Statist. Plann. Inference 41, 37–60.
Consonni, G. & Veronese, P. (1992). Conjugate priors for exponential families having quadratic variance
functions. J. Amer. Statist. Assoc. 87, 1123–1127.
Cover, T. M. & Thomas, J. A. (1991). Elements of information theory. Wiley, New York.
Csiszár, I. (1974). On the computation of rate-distortion functions. IEEE Trans. Inform. Theory IT-20,
122–124.
Cyranzki, J. F. & Tzannes, N. S. (1983). Some closed form solutions to the mutual information principle.
Kybernetes 12, 187–191.
de Finetti, B. & Savage, L. J. (1962). Sul modo di scegliere le probabilità iniziali. Sui fondamenti della
statistica Biblioteca del Metron Series C 1, 81–147 (English summary, 148–151).
Diaconis, P. & Ylvisaker, D. (1979). Conjugate priors for exponential families. Ann. Statist. 7, 269–281.
Gutiérrez-Peña, E. (1992). Expected logarithmic divergence for exponential families. In Bayesian statistics
4 (eds J. M. Bernardo, J. O. Berger, A. P. Dawid & A. F. M. Smith), 669–674. Oxford University Press,
Oxford.
Haussler, D. & Opper, M. (1997). Mutual information, metric entropy and cumulative relative entropy
risk. Ann. Statist. 25, 2451–2492.
Kapur, J. N. & Kesavan, H. K. (1992). Entropy optimization principles with applications. Academic Press,
Boston, MA.
Morris, C. N. (1982). Natural exponential families with quadratic variance functions. Ann. Statist. 10,
65–80.
Robert, C. P. (1996). Intrinsic losses. Theory Decision 40, 191–214.
Rueda, R. (1992). A Bayesian alternative to parametric hypothesis testing. Test 1, 61–67.
Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423,
623–656.
Shannon, C. E. (1959). Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Convent.
Rec. 4, 142–163.
Tzannes, N. S. & Noonan, J. P. (1973). The mutual information principle and applications. Inform.
Control 22, 1–12.
Yuan A. & Clarke, B. S. (1999a). An information criterion for likelihood selection. IEEE Trans. Inform.
Theory 45, 562–571.
Yuan A. & Clarke, B. S. (1999b). A minimally informative likelihood for decision analysis: illustration and
robustness. Canad. J. Statist. 27, 649–665.
Received February 2002, in final form August 2003
Eduardo Gutiérrez-Peña, IIMAS-UNAM, Apartado Postal 20–726, 01000 México D.F., Mexico.
E-mail: [email protected]
Board of the Foundation of the Scandinavian Journal of Statistics 2004.