PROBABILITY MEASURES WITH GIVEN MARGINALS
AND CONDITIONALS: I-PROJECTIONS AND
CONDITIONAL ITERATIVE PROPORTIONAL FITTING
Erhard Cramer
received:
Abstract: The iterative proportional fitting procedure (IPF-P) is an
algorithm to compute approximately probability measures with prescribed
marginals. We propose two extensions of the IPF-P, called conditional iterative proportional fitting procedures (CIPF-P), so that, additionally, given
conditional distributions are taken into account. This modification is carried
out by using the geometrical interpretation of the IPF-P as successive application of I-projections. Finally, we establish the convergence of both CIPF-Ps
in the finite discrete case.
1.
Introduction
The specification of probability models by marginal and conditional distributions
has many applications in statistics, e.g., in Bayesian statistics, in contingency
tables, in log-linear models etc. This kind of modeling involves three principal
problems, i.e the existence, the uniqueness and the construction (or approximation) of a probability measure with the prescribed distributions. In particular,
the first two problems have been extensively investigated in literature. Above
all, pure models are considered, i.e., models that are specified exclusively either
by marginals or by conditionals [e.g., [14], [2] and the references therein]. Some
papers address the problem of a unique specification by mixed constraints given
that a measure with these distributions exists and present general criteria for
uniqueness [cf. [18], [3], [4]].
These considerations are immediately connected to the construction or approximation problem. Inter alia the so-called iterative proportional fitting procedure (IPF-P) applies to the approximation problem in the marginal case. The
IPF-P was apparently first published by [26]. Since [16] used it to estimate cell
AMS 1991 subject classifications. 60E05, 62B10.
Key words and phrases. Conditional iterative proportional fitting, iterative proportional fitting, I-projection, Kullback-Leibler distance, distributions with given marginals, conditional
specification.
1
frequencies in contingency tables under marginal constraints, especially the finite
discrete case has received great attention. Some relevant papers in this setting
are [23], [17], [15], [19], [21] and [12]. Surveys of the procedure’s properties and its
application are given in [6], [19] and [20]. A detailed description of the historical
development of the IPF-P and related procedures is presented in [8]. Extensions
of the originally discrete version to the continuous case were proposed by [23], [29]
and [21] [see also [37]]. [41] use a version of the IPF-P to calculate the maximum
likelihood estimators in graphical Gaussian models [see [1], [31]].
In the paper on hand we develop another kind of extension. Our aim is to
construct a probability measure P∗ approximately that has not only prescribed
marginals but also possesses given conditional distributions. To be more precise,
let m ∈ N be the number of given marginals and conditionals and Mj , 1 ≤ j ≤ m,
be the set of all probability measures that fulfill the jth constraint. We formulate
our extension of the IPF-P using the following geometric observation [cf. [27],
[12]]: The structure of the IPF-P is similar to the cyclic procedure of calculating
the projection on the intersection of non-orthogonal subspaces in Hilbert spaces
[cf. [43]]. Starting with an initial measure P(0) we calculate the so-called Iprojection P(1) of P(0) on the set M1 with respect to (w.r.t.) the Kullback-Leibler
distance
dP
(x) d P(x), P Q,
log
X dQ
I P||Q =
+∞,
P Q.
Then we compute the I-projection P(2) of P(1) on M2 and so on. After projecting
P(m−1) onto Mm we start a new cycle:
M
M
M
M
M
M
j
m
1
2
1
2
P(1) −→
· · · −→
P(m) −→
P(m+1) −→
· · · −→ · · ·
P(0) −→
The I-projection of a probability measure P on a set M is by definition the
solution of the minimization problem
(I1 )
I Q||P −→ min .
M
Q∈
But since the Kullback-Leibler distance is not symmetric in P and Q the minimization can be done w.r.t. the second component as well. This yields for a
given measure P the optimization problem
I P||Q −→ min .
(I2 )
M
Q∈
In our setting the set M is one of the sets Mj , 1 ≤ j ≤ m. Subsequently, we
consider both optimization problems. We call the solution of the problem (Ij )
the Ij -projection of P on M.
2
Our first aim is to solve the optimization problems (I1 ) and (I2 ) for one
given constraint, i.e., M is one of the sets Mj , 1 ≤ j ≤ m. After presenting
some notations and preliminaries in Section 2 we give explicit expressions for
the ρ-densities of the I-projections in Section 3. Moreover, we obtain that the
solutions of the minimization problems, the I1 -projections and the I2 -projections,
respectively, are identical if Mj is prescribed by a marginal and different if Mj is
specified by a conditional.
Adopting the method of the IPF-P we formulate in Section 4 two different
procedures to calculate an approximation of the required measure. In case of
exclusively marginal constraints the procedures coincide and correspond to the
usual IPF-P. A discrete version of these procedures was introduced by [33]. Using an expression of [7] we call our algorithms conditional iterative proportional
fitting procedures (CIPF-Ps). Extending the Gaussian IPF-P introduced by [41],
[10] proposes Gaussian CIPF-Ps to compute Gaussian measures prescribed by
Gaussian marginals as well as Gaussian conditionals.
After the introduction of the algorithms we focus on the convergence properties of the CIPF-Ps in the finite discrete case in Section 5. The convergence
of the IPF-P in this setting was proved by several authors [e.g., [23], [17], [15],
[19], [12]]. In the general case it remained an open problem until [37] establishes
the convergence in the case of two prescribed marginals. It has to be mentioned
that a convergence proof in the particular case of Gaussian distributions has
been provided by [41]. Their result was extended by [1] to complex Gaussian
distributions. [10] proves the convergence of both CIPF-Ps in case of underlying
Gaussian distributions.
After deriving some general properties of the sequences of probability measures generated by the CIPF-Ps we prove the convergence of both CIPF-Ps in
the finite discrete case.
2.
Preliminaries and notations
We consider the following situation: Let In = {1, . . . , n}, n ∈ N, be a finite
set and (Xj , Bj , ρj ) be measure spaces with σ-finite measures ρj , j ∈ In . For
∅ = K ⊆ In we define the product
spaces
X
=
X
,
B
=
K
j
K
j∈K
j∈K Bj
and the product measure ρK =
j∈K ρj . If K = In we write X, B and ρ,
respectively. We assume that all considered probability measures on (X, B) are
continuous w.r.t. the product measure ρ. The set of all these measures is denoted
by P, the set of all probability measures on (X, B) by M1 (X, B). Given x ∈ X
the vector xK ∈ XK is defined by (xk(1) , . . . , xk(r) ) where k(1) < · · · < k(r) are
the ordered elements of the set K ⊆ In . For a non-empty subset K of In we
denote by PK the marginal measure of P on (XK , BK ), and for ∅ = K, L ⊆ In
•
with K ∩ L = ∅ we define the stochastic kernel PK|L : XL × BK −→ [0, 1] by the
×
3
ρK -density
d P
d PK∪L
d PL
L
(x
)
(xL ), if
(xL ) > 0,
xL
K∪L
d PK|L
d ρK∪L
d ρL
d ρL
(xK ) =
d ρK
d PL
gK (xK ),
if
(xL ) = 0,
d ρL
where gK is any ρK -density on XK . For the marginal measure PK we write
•
alternatively PK|∅ .
Suppose that for 1 ≤ j ≤ m disjoint sets L(j), K(j) ⊆ In with K(j) = ∅ are
•
given. If L(j) = ∅ let ΨK(j)|∅ be a given probability measure on (XK(j) , BK(j)),
•
if L(j) = ∅ let ΨK(j)|L(j) : XL(j) × BK(j) −→ [0, 1] be a given stochastic kernel.
The sets Mj , 1 ≤ j ≤ m, are defined by
•
•
Mj = MK(j)|L(j) = {Q ∈ P : QK(j)|L(j) = ΨK(j)|L(j)
[QL(j) ]}.
If we focus on one set Mj we write MK|L instead of Mj to emphasize the dependency on the sets K and L.
Since the Kullback-Leibler distance of two probability measures P and Q is
not symmetric we can minimize it for a given measure P w.r.t. the first and the
second component, respectively. This motivates the following definition:
DEFINITION 2.1. Let M ⊆ M1 (X, B) be a non-empty set of probability
measures and P be a given probability measure on (X, B).
(i) Aprobability
measure
P∗ is called the I1 -projection of P on M iff
I P∗ ||P = min{I Q||P : Q ∈ M}.
(ii) Aprobability
measure
P∗ is called the I2 -projection of P on M iff
I P||P∗ = min{I P||Q : Q ∈ M}.
I1 -projections have received great attention in literature [e.g., [42], [38], [39],
[5]]. They can be understood as a special case of so-called f -projections [see
[32], [36], [42]]. [13] use both types of I-projections to formulate an alternating
minimization procedure. This algorithm can be used to calculate the minimum
distance w.r.t. the Kullback-Leibler distance between two convex sets of probability measures.
The families of probability measures with finite distance towards a given measure P are indicated by
S1 (P, δ) = {Q ∈ M1 (X, B) : I Q||P < δ},
0 < δ ≤ ∞.
S2 (P, δ) = {Q ∈ M1 (X, B) : I P||Q < δ},
Before we attend the projections themselves we give some properties of the
sets Mj . Since the results can be easily derived we omit the proofs. Let ||P−Q|| =
sup{|P(A) − Q(A)| : A ∈ B} be the total variation of the probability measures
P and Q.
4
LEMMA 2.2. The sets Mj are convex, 1 ≤ j ≤ m.
LEMMA 2.3. The sets Mj are variation closed, 1 ≤ j ≤ m.
Since MK|L is variation closed, Theorem 2.1 in [12] yields under the condition
S1 (P, ∞) ∩ MK|L = ∅ the existence of a unique I1 -projection P∗ = T1,K|LP of P
on MK|L. Moreover, a probability measure P∗ is the I1 -projection of P on M iff
I Q||P ≥ I Q||P∗ + I P∗ ||P ∀ Q ∈ MK|L.
(2.1)
If Q ∈ MK|L ∩ S1 (P, ∞), equation (2.1) is equivalent to the inequality
d P∗
(2.2)
log
(x) d Q(x) ≥ I P∗ ||P .
X dP
The latter characterization of the I1 -projections is applied in the proof of Theorem
3.4 to establish the representation of the I1 -projection. A characterization of the
I2 -projection by a triangle inequality of the type (2.1) does not hold in general.
[9, Theorem 23.3] presents an analogous inequality for some exponential families.
However, in Corollary 3.11 we prove an equation of this kind for the set MK|L and
an arbitrary probability measure P provided that K ∪ L = In . A characterization
of I2 -projections similar to (2.2) is established by [36] who investigates optimality
conditions for f -projections (see Remark 3.10).
3.
Representation of I-projections
In this section we establish explicit representations for the ρ-densities of the Iprojections of P on one set MK|L. We observe that in the case of a marginal
adjustment the densities of the I1 -projection and the I2 -projection, respectively,
coincide whereas they are different for conditionals.
Before presenting the results we introduce the weighted Kullback-Leibler distance for stochastic kernels ξ and ζ w.r.t. a measure η, i.e.,
I ξ(x, •)||ζ(x, •) d η(x),
Iη ξ||ζ =
X
and the following decomposition of the Kullback-Leibler distance:
LEMMA 3.1. Let P, Q ∈ M1 (X, B) be probability measures with I P||Q < ∞.
(i) If K, L = ∅ is a decomposition of In , i.e., K ∪ L = In and K ∩ L = ∅, then
•
•
(3.1)
I P||Q = IPL PK|L||QK|L + I PL ||QL .
5
(ii) Let K(1), . . . , K(r) (r ≥ 2) be a decomposition
of In with K(i) = ∅, 1 ≤
s−1
i ≤ r. Introducing the notation L(s) = i=1 K(i), 2 ≤ s ≤ r, we have:
r
•
•
IPL(s) PK(s)|L(s) ||QK(s)|L(s) + I PK(1) ||QK(1) .
I P||Q =
s=2
Proof. From I P||Q < ∞ we conclude P Q. Applying a result of [34, p. 342]
•
•
[see also [40, p. 561]] we obtain PL QL and PK|L QK|L [PL ] and
L
d PxK|L
dP
d PL
(xK∪L ) =
(xL )
xL (xK ) ·
dQ
d QK|L
d QL
[Q].
Applying this factorization of the density the additive decomposition
(3.1) results
from a direct evaluation of the Kullback-Leibler distance I P||Q . Assertion (ii)
follows from (i) by induction on r.
REMARK 3.2. In Section 2 it is supposed that the measures P and Q are
absolutely continuous w.r.t. the product measure ρ. It has to be mentioned that
the additive decompositions presented in Lemma 3.1 hold without this restriction,
because the proof makes no use of this assumption.
REMARK 3.3.
of Lemma 3.1 (ii) we obtain for product
r
r Under the assumptions
measures P = i=1 PK(i) and Q = i=1 QK(i) a well-known decomposition of the
Kullback-Leibler distance [see [42, p. 229]]:
r
I P||Q =
I PK(i) ||QK(i) .
i=1
THEOREM 3.4. Let P ∈ P, S1 (P, ∞) ∩ MK|L = ∅ and R = In \(K ∪ L).
•
(i) If we prescribe a marginal ΨK|∅ (i.e., L = ∅) the I1 -projection P∗ = T1,K|∅P
of P on MK|L has the ρ-density
(3.2)
K
d PxR|K
d T1,K|∅P
d ΨK
(x) =
(xR )
(xK ).
dρ
d ρR
d ρK
•
(ii) If we prescribe a conditional ΨK|L (i.e., L = ∅) the I1 -projection P∗ =
T1,K|LP of P on MK|L has the ρ-density
K∪L
L
d ΨxK|L
d PxR|K∪L
d T1,K|LP
d PL
(x) = cK|L
(xR )
(xK )
(xL )
(3.3)
dρ
d ρR
d ρK
d ρL
L
L
)
× exp(−I ΨxK|L
||PxK|L
with the normalizing constant
−1
xL
xL
cK|L =
(3.4)
exp(−I ΨK|L||PK|L ) d PL (xL )
∈ [1, ∞).
X
L
6
Proof. First of all, the uniqueness of the I1 -projection is a consequence of the
assumption S1 (P, ∞) ∩MK|L = ∅ and the strict convexity of the Kullback-Leibler
distance [cf. [32, 2.2]]. Hence, it is sufficient to show that the measures defined by
the densities (3.2) and (3.3) solve the minimization problem given in Definition
2.1 (i).
Although the I1 -projection in the marginal case has been applied in many
papers concerned with the IPF-Procedure (cf. Remark 3.7), we give the proof
for completeness as well. Let L = ∅ and M be the probability measure defined
by the density
K
d PxR|K
dM
d ΨK
(xR )
(xK ).
(x) =
dρ
d ρR
d ρK
Integration yields M ∈ MK|∅. By assumption
wehave S1 (P, ∞)∩MK|∅ = ∅. Consequently, a measure Q0 ∈ MK|L with I Q0 ||P < ∞ exists. Applying Lemma
•
3.1 we deduce from Q0,K = ΨK|∅ = MK the inequality
I MK ||PK = I ΨK ||PK
•
•
≤ IQ0,R Q0, R|K ||PR|K + I Q0,K ||PK = I Q0 ||P < ∞.
This implies
M P ρ. A straightforward calculation
MK PK and therefore
shows I M||P = I QK ||PK , Q ∈ MK|∅. Applying again Lemma 3.1 yields
•
•
•
•
IQK QR|K ||PR|K = IQK QR|K ||MR|K = I Q||M . Finally, we obtain
I Q||P = I Q||M + I M||P ≥ I M||P ,
(3.5)
Q ∈ MK|L.
Let L = ∅. The proof of this case is more difficult. It is carried out in five
steps. First, let the measure M be defined by the right hand side of (3.3).
(1) In the first step we derive a condition on the constant cK|L. We have
to guarantee that M is a probability measure, i.e., M(X) = 1. From the
definition of M we get the condition:
dM
1 = M(X) =
(x) d ρ(x)
dρ
X
d PxK∪L
L
d ΨxK|L
d PL
R|K∪L
(xR )
(xK )
(xL )
= cK|L
d ρK
d ρL
XL XK XR d ρR L
L
) d ρR (xR ) d ρK (xK ) d ρL(xL )
× exp(−I ΨxK|L
||PxK|L
L
L
) d PL (xL )
= cK|L
exp(−I ΨxK|L
||PxK|L
X
L
(3.6)
= cK|L · c̃,
say.
7
(2) We show 0 < c̃ ≤ 1. By assumption a measure Q0 ∈ MK|L ∩ S1 (P, ∞)
•
•
exists. This implies IQ0,L ΨK|L||PK|L < ∞ and Q0,L PL . Therefore the
L
L
, xL ∈ XL , is Q0,L a.e. finite,
non-negative function h(xL ) = I ΨxK|L
||PxK|L
ergo exp ◦(−h) > 0 [Q0,L ]. Let NL = {xL ∈ XL : exp(−h(xL )) > 0}. We
conclude Q0,L (NL ) = 1 and PL (NL ) > 0 so that
exp(−h(xL )) d PL (xL ) =
exp(−h(xL )) d PL (xL ) ≤ 1.
0 < c̃ =
NL XL >0
≤1
This yields the claimed representation for cK|L given by (3.4).
(3) Now we prove the existence of the Radon-Nikodym derivative dM/dP. From
the assumption Q0 ∈ MK|L ∩S1 (P, ∞) we conclude as above Q0 P. With
•
•
•
Q0 K|L = ΨK|L PK|L [PL ] and the implication
PL (AL ) = 0
=⇒
ML (AL ) =
AL
L
L
) d PL (xL ) = 0
cK|L exp(−I ΨxK|L
||PxK|L
we get ML PL and ergo M P ρ.
(4) In order to prove the optimality of M we have to establish first that M ∈
•
•
MK|L ∩ S1 (P, ∞). Since M ∈ MK|L Lemma 3.1 (i) and MR|K∪L = PR|K∪L
lead to the equation
•
•
I M||P = IML ΨK|L||PK|L + I ML ||PL .
From 0 ≤ ue−u ≤ e−1 = max ze−z , u ≥ 0, we deduce
0≤z<∞
L
•
•
L
I ΨxK|L
||PxK|L
d ML (xL )
IML ΨK|L||PK|L =
XL = cK|L
h(xL ) exp(−h(xL )) d PL (xL )
XL ≤ e−1
≤ cK|L · e−1 < ∞
and
log cK|L exp(−h(xL )) d ML (xL )
XL
= log cK|L − cK|L
h(xL ) exp(−h(xL )) d PL (xL )
XL I ML ||PL =
≥0
≤ log cK|L < ∞.
Summing up we obtain I M||P ≤ cK|L · e−1 + log cK|L < ∞ and therefore
M ∈ MK|L ∩ S1 (P, ∞).
8
(5) M has the minimal property (2.1). If I Q||P = ∞ nothing remains to be
shown. We establish (2.2) for Q ∈ MK|L ∩ S1 (P, ∞):
dM
log
(x) d Q(x)
X d P
L
d MxK|L
xL
xL
d Q(x)
log cK|L
=
xL (xK ) exp(−I ΨK|L||PK|L )
d
P
X
K|L
L
d MxK|L
(3.7)
(xK ) d Q(x)
= log cK|L + log
L
d PxK|L
X
L
L
||PxK|L
) d Q(x)
− I ΨxK|L
X L
L
d QL (xL )
= log cK|L +
I ΨxK|L
||PxK|L
X
L
L
L
) d QL (xL )
−
I ΨxK|L
||PxK|L
X
L
= log cK|L.
In the preceding calculation we have shown that (2.2) holds with equality.
This property leads to an interesting identity for I1 -projections. A more
detailed discussion is postponed to Corollary 3.6.
is valid
Since log cK|L is independent of Q ∈ MK|L the preceding calculation
for M (∈ MK|L ∩ S1 (P, ∞) by (4)) as well. So we have I M||P = log cK|L
which yields the desired result.
REMARK 3.5. In view of the solution (3.2) for a prescribed marginal, one
might expect that in the conditional case the density of the I1 -projection can be
obtained by a similar construction, i.e.,
(3.8)
K∪L
L
d PxR|K∪L
d ΨxK|L
d P̃
d PL
(x) =
(xR )
(xK )
(xL ).
dρ
d ρR
d ρK
d ρL
In contrast to this idea, (3.3) points out that this intuitive
solution has to be
L
L
||PxK|L
). This factor meamultiplied by the scaling function cK|L exp(−I ΨxK|L
xL
xL
sures the distance between the measures ΨK|L and PK|L for a fixed xL . Hence, it
could be seen as a smoothing component, which becomes more significant as the
distance between the kernels
The density of the projection T1,K|∅P is
xL increases.
xL
defined to be zero if I ΨK|L||PK|L = ∞.
In addition to this smoothing aspect, the factor guarantees that the measure
T1,K|∅P is absolutely continuous w.r.t. the given measure P. The absolute continuity has to be fulfilled necessarily by the I1 -projection of P, because we assume
9
that the intersection S1 (P, ∞) ∩ MK|L is nonempty. Unlike this, the measure
P̃ given by (3.8) will generally not have this property. This can be seen as folL
L
PxK|L
except for a set NL with
lows: Suppose that P̃ P, which implies ΨxK|L
PL -measure zero. However, in some cases it is not possible to comply with this
condition, although an I1 -projection of P on MK|L exists.
The assumption S1 (P, ∞) ∩ MK|L = ∅ yields an even weaker condition.
L
L
Namely, ΨxK|L
PxK|L
has to be only valid except for a set NL of QL -measure
zero, where QL is an arbitrary probability measure with QL PL . Hence,
the support of QL can be smaller than that one of PL . This behaviour is illustrated by the following example: Let X = L × K, L = {0, 1}, K = {2, 3}
and the probability measure P be defined by P({(0, 2)}) = 0, P({(1, 3)}) = 1/2
and P({(0, 3)}) = P({(1, 2)}) = 1/4. Given the kernel ΨxK|L({y}) = 1/2 for all
x ∈ {0, 1}, y ∈ {2, 3}, we obtain that the probability measure P̃ defined by (3.8)
is not absolutely continuous w.r.t. P, since P({(0, 2)}) = 0 and P̃({(0, 2)}) = 1/8.
Hence, P̃ ∈ S1 (P, ∞)∩MK|L. An element of S1 (P, ∞)∩MK|L is given by the measure Q where Q({(0, 2)}) = Q({(0, 3)}) = 0 and Q({(1, 2)}) = Q({(1, 3)}) = 1/2.
By taking into account part (5) of the proof of Theorem 3.4, we find for the
probability measure M defined by the right hand side of (3.3) the equation:
dQ
I Q||P =
(x) d Q(x)
log
dP
X
dQ
dM
=
(x) d Q(x) + log
(x) d Q(x)
log
X dM
X dP
(3.7) = I Q||M + I M||P .
Hence, an interesting conclusion from the proof of Theorem 3.4 is given by Corollary 3.6. [12] calls this phenomenon an analogue to the Pythagorean law. In case
of a given marginal it results directly from (3.5).
COROLLARY 3.6. Let S1 (P, ∞) ∩ MK|L = ∅ and T1,K|LP be the I1 -projection
of P on MK|L. Then:
(3.9)
I Q||P = I Q||T1,K|LP + I T1,K|LP||P ∀ Q ∈ MK|L.
REMARK 3.7. The result for the I1 -projection in the marginal case can be
found in many publications [e.g., [23], [29], [15], [12], [38], [37]]. [5] obtain it
by Fenchel duality if the dominating measure ρ is given by the two-dimensional
Lebesgue measure.
THEOREM 3.8. Let P ∈ P, S2 (P, ∞) ∩ MK|L = ∅ and R = In \(K ∪ L).
•
(i) If we prescribe a marginal ΨK|∅ (i.e., L = ∅) the I2 -projection P∗ = T2,K|∅P
of P on MK|∅ is given by the ρ-density
(3.10)
K
d PxR|K
d T2,K|∅P
d ΨK
(x) =
(xR )
(xK ).
dρ
d ρR
d ρK
10
•
(ii) If we prescribe a conditional ΨK|L (i.e., L = ∅) the I2 -projection P∗ =
T2,K|LP of P on M is given by the ρ-density
(3.11)
K∪L
L
d PxR|K∪L
d ΨxK|L
d T2,K|LP
d PL
(x) =
(xR )
(xK )
(xL ).
dρ
d ρR
d ρK
d ρL
Proof. As in the proof of Theorem 3.4 the uniqueness of the I2 -projection is a
consequence of the assumption S2 (P, ∞) ∩ MK|L = ∅ and the strict convexity of
the Kullback-Leibler distance [cf. [32, 2.2]]. Thus it is sufficient to give a measure
which solves the minimization problem given in Definition 2.1 (ii).
Let M — as in the proof of Theorem 3.4 — be the right hand side of (3.10) or
(3.11), respectively.
We
have to show that M solves the minimization problem,
i.e., I P||Q ≥ I P||M , Q ∈ MK|L.
First let L = ∅. For Q ∈ MK|∅ we obtain from Lemma 3.1 (i)
•
•
I P||Q = I PK ||QK + IPK PR|K ||QR|K
(3.12)
•
•
•
= I PK ||ΨK|∅ + IPK PR|K ||QR|K .
•
This shows I PK ||ΨK|∅ < ∞. By definition of M we get for Q = M the equa
•
tion I P||M = I PK ||ΨK|∅ . This yields the minimal property and therefore
T2,K|∅P = M, since the weighted Kullback-Leibler distance is non-negative.
The case L = ∅ is treated in a similar way. Lemma 3.1 (ii) yields for Q ∈ MK|L
•
•
•
•
I P||Q = I PL ||QL + IPL PK|L||ΨK|L + IPK∪L PR|K∪L ||QR|K∪L .
•
•
From the definition of M we conclude I P||M = IPL PK|L||ΨK|L which proves
T2,K|LP = M.
REMARK 3.9. It turns out in (3.10) and (3.11), that the intuitively expected
measure is the I2 -projection of the measure P in the marginal as well as in the
conditional case. In contrast to the I1 -projection, the absolute continuity condi•
•
tion in the conditional situation, i.e., PK|L ΨK|L [PL ], is met by the measure
defined in (3.11). The condition depends only on the given measure P and on
•
the given kernel ΨK|L. Hence, this situation differs from that one considered in
Theorem 3.4, where the absolute continuity condition has to be only fulfilled by
an appropriate marginal measure QL (cf. Remark 3.5).
It has to be mentioned that a measure with the density given in (3.11) is
well defined even if Sj (P, ∞) ∩ MK|L = ∅, j = 1, 2. Hence, if no measure with
a finite Kullback-Leibler distance towards P exists, one would call this measure
the I-projection of P on MK|L in both situations.
REMARK 3.10. The results of Theorem 3.8 can be proved alternatively applying Theorem 5 in [36], which provides a characterization of f -projections [see
also [42, p. 277]]. Since MK|L is a convex set, this theorem reads in our notations:
11
Let ϕf be a differentiable function on (0, ∞) with derivative ϕf (z), z ∈ (0, ∞),
and ϕf (0) = limz→0,z>0 ϕf (z). Then P∗ is the f -projection of P on MK|L iff
∗ dP
(x) (d P∗ − d Q)(x) ≤ 0 ∀ Q ∈ S2 (P, ∞) ∩ MK|L.
ϕf
d
P
X
The Kullback-Leibler distance can be seen as two particular f -divergences. On
the one hand, choosing ϕf (x) = x log(x) yields after some rearrangements the
optimality condition (2.2) for the I1 -projection. On the other hand, the choice
ϕf (x) = − log(x) leads to a similar characterization of the I2 -projection:
P∗ is the I2 -projection of P on MK|L iff
−1
∗
dP
−
(d P∗ − d Q)(x) ≤ 0 ∀ Q ∈ S2 (P, ∞) ∩ MK|L.
(x)
d
P
X
A straightforward calculation shows that this equation is equivalent to
−1
∗
dP
(3.13)
d Q(x) ∀ Q ∈ S2 (P, ∞) ∩ MK|L.
1≤
(x)
X dP
Replacing d P∗ /d P by the densities stated in Theorem 3.8, i.e., (3.10) and (3.11),
it is easy to see that (3.13) holds.
An interesting question is whether a triangle inequality similar to (2.1) holds
for the I2 -projection. In general this is not true. But before presenting a counterexample we present an analogue to Corollary 3.6 for the special case L = ∅,
K ∪ L = In . The proof is an immediate consequence of Lemma 3.1. If (3.14)
holds for the I2 -projection [9, p. 320] calls such a measure an orthoprojection.
COROLLARY 3.11. Let L = ∅, K ∪L = In , S2 (P, ∞)∩MK|L = ∅ and T2,K|LP
be the I2 -projection of P on MK|L. Then the following equation holds
(3.14)
I P||Q = I T2,K|LP||Q + I P||T2,K|LP ∀ Q ∈ MK|L.
REMARK 3.12. The condition K ∪ L = In in Corollary 3.11 is sufficient for
equality in (3.14). If L = ∅, K = In (given marginal) or K ∪ L In (given
conditional) equality needs not to hold. We present a counterexample in the case
of a given marginal. A similar example can be found in the case of a prescribed
conditional. Let X = {0, 1}2 , ρ be the counting measure and P be the uniform
distribution on X. The prescribed marginal ΨK on XK = {0, 1} is represented
by ΨK ({0}) = 1/3, ΨK ({1}) = 2/3. The application of Theorem 3.8 leads to the
I2 -projection T2,K|∅P of P on MK|∅, which is represented by the density tp00 =
tp10 = 1/6, tp01 = tp11 = 1/3. For q ∈ (0, 1/2) we define the measures Qq ∈ MK|∅
by the probabilities (Qq )00 = q/3, (Qq )01 = 1/3 + 2/3q, (Qq )10 = 1/3 − 1/3q and
(Qq )11 = 1/3 − 2/3q and we obtain that
> 0, q ∈ (0, 1/4)
I P||Qq − I T2,K|∅P||Qq − I P||T2,K|∅P
.
= 0, q = 1/4
< 0, q ∈ (1/4, 1/2)
12
4.
Conditional Iterative Proportional Fitting and
CIPF-Ps
The family of distributions M = m
j=1 Mj includes all measures Q ∈ P that have
the given marginals and conditionals. Subsequently we assume that a measure
with the given constraints exists, i.e., M = ∅. We look for the I1 -projection and
the I2 -projection, respectively, of a given probability measure P on the set M.
Applying Lemmata 2.2 and 2.3 in Section 2 we obtain
LEMMA 4.1. The set M is convex and closed w.r.t. the total variation.
In contrast to the case of one restriction we cannot give an explicit representation of the I-projections. Therefore we formulate two iterative procedures to
approximate them. As mentioned in the introduction we call them conditional
iterative proportional fitting procedures (CIPF-P). The algorithms are defined
by successive application of I-projections, i.e., for j = 1, 2:
(0)
Pj = P,
(1)
(0)
Pj = Tj,K(1)|L(1) Pj ,
..
..
.
.
(m)
(m−1)
Pj
= Tj,K(m)|L(m) Pj
(m+1)
(m)
Pj
= Tj,K(1)|L(1) Pj ,
..
..
.
.
,
For brevity we write CIPF-Pj if the CIPF-P is based on the Ij -projection. These
definitions and Lemma 4.1 yield the following properties of the limiting measure
if it exists.
(k)
PROPOSITION 4.2. If a measure Pj ∈ M1 (X, B) with limk→∞ ||Pj − Pj || =
0 exists, then Pj ∈ M, j = 1, 2.
(k)
We now derive some properties of the two sequences (Pj )k∈N0 , j = 1, 2
generated by the CIPF-Ps. In the case of I1 -projections some of these results
are given in [12] [see also [37]]. The proofs are for the most part consequences
of the triangle inequality for the I1 -projection and of the decomposition given in
Lemma 3.1, respectively. Let M1 (P) = M ∩ S1 (P, ∞).
LEMMA 4.3. Let M = ∅.
(i) (i−1) (k) , k ∈ N, Q ∈ M.
(i) I Q||P ≥ I Q||P1 + ki=1 I P1 ||P1
If Q ∈ M1 (P) equality holds.
(ii) If M1 (P) = ∅ then
∞
(i) (i−1) (k) (k−1) < ∞ and lim I P1 ||P1
= 0.
I P1 ||P1
k→∞
i=1
13
Moreover, we obtain for l ∈ N
(4.1)
(k+l)
lim ||P1
k→∞
(k)
− P1 || = 0.
The last assertion follows from the inequality ||Q1 − Q2 || ≤ 2I Q1 ||Q2 ,
Q1 , Q2 ∈ M1 (X, B) [cf. [11], [28] and [25]]. [29] deduces from (4.1) the convergence of the IPF-P. But since (4.1) is not the Cauchy condition this conclusion
is false [see the remarks in [12] and [37]].
(k) LEMMA 4.4. (i) I Q||P1 , k ∈ N0 , for all Q ∈ M.
(k) (ii) I Q||P2 , k ∈ N0 , for all Q ∈ M.
(k) (k) (iii) The limits limk→∞ I Q||P1 and limk→∞ I Q||P2 , Q ∈ M1 (P), exist and
are finite.
Proof. The results for the I1 -projection are an immediate consequence of Lemma
4.3. The assertion (ii) for the I2 -projection follows from Lemma 3.1. For instance,
we obtain for Q ∈ M ⊆ MK|L in the conditional case
•
•
•
•
I Q||P = IQK∪L QR|K∪L||PR|K∪L + IQL QK|L||PK|L + I QL ||PL
•
•
≥ IQK∪L QR|K∪L ||PR|K∪L + I QL ||PL
= I Q||T2,K|LP .
A successive application of this inequality yields the claimed monotonicity.
5.
Convergence results for the CIPF-Ps in the
finite discrete case
In this section we derive some convergence properties of the CIPF-Ps in the finite
discrete case. The convergence of the CIPF-P1 in this setting is an application
of Theorem 3.1 in [12] who considers I1 -projections on linear subspaces.
THEOREM 5.1. Let X be a finite set. If for given marginals and conditionals
a probability measure Q ∈ M1 (P) exists, the CIPF-P1 converges to a probability
(k) measure P∗1 ∈ M, i.e., limk→∞ I P∗1 ||P1 = 0. Moreover, P∗1 is the I1 -projection
of P on M.
The proof for the CIPF-P2 requires more effort. Suppose that X has N elements and that the considered probability measures are represented by their
density w.r.t. the counting measure. The density is denoted by the
corresponding small Latin
e.g., p instead of P. Notations like I q||p and S2 (q, α)
letter,
instead of I Q||P and S2 (Q, α), respectively, are used as well. We make use of
the following Lemma.
14
LEMMA 5.2. The function I q|| · defined by
I q||p =
q(x) log(q(x)/p(x)),
q p,
x:q(x)>0
is continuous on the sublevel sets S2 (q, α), α ≥ 0.
With this preliminaries we can formulate the convergence result for the CIPFP2 .
(k)
THEOREM 5.3. Let S1 (P, ∞)∩M = ∅. Then the sequence (P2 )k generated by
(k) the CIPF-P2 converges to a probability measure P∗2 ∈ M, i.e., limk→∞ I P∗2 ||P2 =
0.
Proof. By assumption we have S1 (P,∞) ∩ M = ∅.
a probability mea Therefore
finite
distance I Q0 ||P = I q0 ||p towards P exists. Since
sure Q0 ∈ M with
the function I q0 || · is lower semicontinuous [cf. [42, p. 272]] the sublevel sets
S2(q0 , α) ⊆ RN , α ≥ 0, are closed [cf. [22, p. 148]]. Furthermore, the function
I q0 || · is strictly convex [cf. [42, p. 271]]. Hence, the sets S2 (q0 , α), α ≥ 0, are
bounded iff a non-negative number α0 ≥ 0 exists such that S2 (q0 , α0 ) is bounded
and non-empty [cf. [35, p. 70]]. Since S2 (q0 , 0) = {q0 } by the definiteness of
the Kullback-Leibler distance [cf. [30]], we obtain the compactness of all sublevel
sets S2 (q0 , α), α ≥ 0.
(k)
Since (p2 )k ⊆ S2 (q0 , I q0 ||p ) by Lemma 4.4 (ii), a convergent subsequence
(l(k))
(p2 )k with limit q∗ ∈ M exists. Moreover, for any x ∈ X with q∗ (x) > 0
(l(k))
a number k0 (x) exists such that p2 (x) > 0 for all k ≥ k0 (x). Writing s =
(s) maxx:q∗ (x)>0 l(k0 (x)) this leads to I q∗ ||p2 < ∞. Applying again Lemma 4.4
(k) we get I q∗ ||p2 for all k ≥ s. Considering Proposition 4.2, the continuity
(s) (k) = 0 which
of I q∗ || · on the set S2 (q∗ , I q∗ ||p2 ) yields limk→∞ I q∗ ||p2
establishes the assertion.
REMARK 5.4. The limiting measure P∗2 in Theorem 5.3 is not necessarily the
I2 -projection of P on the set M. Consider the case of two given marginals
to elucidate this phenomenon. By definition the CIPF-Ps coincide. Let X =
{(0, 0), (0, 1), (1, 0), (1, 1)} and ρ be the counting measure. The initial distribution
P on (X, Pot(X)) is assumed to be the uniform distribution, i.e., P({x}) = 1/4,
•
x ∈ X. The two marginals Ψ{j}|∅ on Xj = {0, 1} are supposed to be identical,
i.e., Ψ{j}|∅ ({0}) = 1/3 and Ψ{j}|∅ ({1}) = 2/3, j = 1, 2. The ρ-density of a distribution in M is of the form q00 = q, q01 = q10 = 1/3 − q and q11 = 1/3 + q with
q ∈ [0, 1/3]. Minimization of the Kullback-Leibler distance w.r.t. the first and
the second component, respectively, yields the results
I1 -projection :
q = 1/9,
I2 -projection :
q = (−1 +
15
√
17)/24 ≈ 0.130129.
Therefore the IPF-P does not converge to the I2 -projection of P on M. [42,
p. 278/9] presents another example that I1 -projection and I2 -projection can be
different in a marginal setting.
EXAMPLE 5.5. In case of a conditional specification [24] propose a procedure
based on Markov random fields to construct a probability measure. They present
an example where their method does not lead to a distribution fulfilling the given
constraints. Applying the CIPF-Ps we compute an approximation of the sought
measure.
[24] consider random variables X1 and X2 on spaces X1 = {0, 1} and X2 =
{0, 1, 2}, respectively. The specified conditional probabilities are given in the
following tables:
P(X1 = i|X2 = j) 0
1
2
P(X2 = i|X1 = j)
0
1
0
0
1 0.43 0
0
0.7
1
0 0.57 1
1
0.3 0.4
0
2
0.6
Starting with the uniform distribution on the product space X1 × X2 we obtain
after twenty cycles the approximations of a measure in M
CIPF-P1
(40)
P1 ({i, j})
0
1
0
1
0.351011 0.150433
0
2
0
0.199423 0.299134
CIPF-P2
(40)
P2 ({i, j})
0
1
0
1
0.351015 0.150435
0
2
0
0.199420 0.299130
An inspection of the conditional distributions shows that they are fulfilled approximately. As reported by [19] in case of the IPF-P, the speed of convergence
is quite slow.
Acknowledgments
The author is grateful to Professor Hans-Hermann Bock and an anonymous referee for their helpful suggestions and remarks.
16
References
[1] Andersen, H. H., Højbjerre, M., Sørensen, D., and Eriksen, P. S.: Linear and
Graphical Models for the Multivariate Complex Normal Distribution. Lecture
Notes in Statistics 101. Springer, New York (1995).
[2] Arnold, B. C., Castillo, E., and Sarabia, J.-M.: Conditionally Specified Distributions. Lecture Notes in Statistics 73. Springer, New York (1992).
[3] Arnold, B. C., Castillo, E., and Sarabia, J.-M.: General conditional specification models. Comm. Statist. Theory Methods 24, 1–11 (1995).
[4] Arnold, B. C., Castillo, E., and Sarabia, J.-M.: Specification of distributions
by combinations of marginal and conditional distributions. Statist. Probab.
Lett. 26, 153–157 (1996).
[5] Bhattacharya, B. and Dykstra, R. L.: A general duality approach to Iprojections. J. Statist. Plann. Inference 47, 203–216 (1995).
[6] Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W.: Discrete Multivariate
Analysis — Theory and Practice. MIT Press, Cambridge, Massachusetts
(1975).
[7] Bock, H. H.: A conditional iterative proportional fitting (CIPF) algorithm
with applications in the statistical analysis of discrete spatial data. In Bull.
ISI, 47th Session, Contributed papers, vol. 1, pp. 141–142. Paris (1989).
[8] Brown, J. B., Chase, P. J., and Pittenger, A. O.: Order independence and
factor convergence in iterative scaling. Linear Algebra Appl. 190, 1–38 (1993).
[9] Čencov, N. N.: Statistical Decision Rules and Optimal Inference (in Russian). Nauka, Moscow (1972). [Translation American Mathematical Society,
vol. 53, 1982].
[10] Cramer, E.: Conditional iterative proportional fitting for Gaussian distributions. J. Multivariate Anal. 65, 261–276 (1998).
[11] Csiszár, I.: Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 2, 299–318
(1967).
[12] Csiszár, I.: I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 3, 146–158 (1975).
[13] Csiszár, I. and Tusnády, G.: Information geometry and alternating minimization procedures. Statist. Decisions Suppl. Issue 1, 205–237 (1984).
[14] Dall’Aglio, G., Kotz, S., and Salinetti, G. (eds.): Advances in Probability
Distributions with Given Marginals. Kluwer Academic Publishers, Dordrecht
(1991).
[15] Darroch, J. N. and Ratcliff, D.: Generalized iterative scaling for log-linear
models. Ann. Math. Statist. 43, 1470–1480 (1972).
[16] Deming, W. E. and Stephan, F. F.: On a least square adjustment of a
sampled frequency table when the expected marginal totals are known. Ann.
Math. Statist. 11, 427–444 (1940).
17
[17] Fienberg, S. E.: An iterative procedure for estimation in contingency tables.
Ann. Math. Statist. 41, 907–917 (1970).
[18] Gelman, A. and Speed, T. P.: Characterizing a joint probability distribution
by conditionals. J. Roy. Statist. Soc. Ser. B 55, 185–188 (1993).
[19] Haberman, S. J.: The Analysis of Frequency Data. The University of Chicago
Press, Chicago (1974).
[20] Haberman, S. J.: Analysis of Qualitive Data, vol. 2. Academic Press, New
York (1979).
[21] Haberman, S. J.: Adjustment by minimum discrimination information. Ann.
Statist. 12, 971–988 (1984).
[22] Hiriart-Urruty, J.-B. and Lemaréchal, C.: Convex Analysis and Minimization Algorithms I. Springer, Berlin (1993).
[23] Ireland, C. T. and Kullback, S.: Contingency tables with given marginals.
Biometrika 55, 179–188 (1968).
[24] Kaiser, M. S. and Cressie, N.: The construction of multivariate distributions from Markov random fields. Department of Statistics and Statistical
Laboratory, Iowa State University, Preprint (1996).
[25] Kemperman, J. H. B.: On the Optimum Rate of Transmitting Information.
In M. Behara, K. Krickeberg, and J. Wolfowitz (eds.), Probability and Information Theory. Lecture Notes in Mathematics 89, pp. 126–169. Springer,
Berlin (1969).
[26] Kruithof, R.: Telefoonverkeersrekening. De Ingenieur 52, E15–E25 (1937).
[27] Kullback, S.: Information Theory and Statistics. Wiley, New York (1959).
[28] Kullback, S.: A lower bound for discrimination information in terms of variation. IEEE Trans. Information Theory 13, 126–127 (1967).
[29] Kullback, S.: Probability densities with given marginals. Ann. Math. Statist.
39, 1236–1243 (1968).
[30] Kullback, S. and Leibler, R. A.: On information and sufficiency. Ann. Math.
Statist. 22, 79–86 (1951).
[31] Lauritzen, S. L.: Graphical Models. Clarendon Press, Oxford (1996).
[32] Liese, F.: On the existence of f -projections. Colloquia Mathematica Societatis János Bolyai 16, 431–446 (1975).
[33] Mitscherling, J.: Eine Verallgemeinerung des Iterative Proportional Fitting
Algorithmus auf vorgegebene bedingte Wahrscheinlichkeiten. Master’s thesis,
RWTH Aachen (1987).
[34] Rao, M. M.: Measure Theory and Integration. Wiley, New York (1987).
[35] Rockafellar, R. T.: Convex Analysis. Princeton University Press, Princeton,
New Jersey (1970).
[36] Rüschendorf, L.: On the minimum discrimination theorem. Statist. Decisions Suppl. Issue 1, 263–283 (1984).
[37] Rüschendorf, L.: Convergence of the iterative proportional fitting procedure.
Ann. Statist. 23, 1160–1174 (1995).
18
[38] Rüschendorf, L. and Thomsen, W.: Note on the Schrödinger equation and
I-projections. Statist. Probab. Lett. 17, 369–375 (1993).
[39] Schroeder, C.: I-projections and conditional limit theorems for discrete parameter Markov processes. Ann. Probab. 21, 721–758 (1993).
[40] Skorokhod, A. V.: On admissible translations of measures in Hilbert space.
Theory Probab. Appl. 15, 557–580 (1970).
[41] Speed, T. P. and Kiiveri, H. T.: Gaussian Markov distributions over finite
graphs. Ann. Statist. 14, 138–150 (1986).
[42] Vajda, I.: Theory of Statistical Inference and Information. Kluwer Academic
Publishers, Dordrecht (1989).
[43] von Neumann, J.: Functional Operators, vol. 2. Princeton University Press,
Princeton (1950).
Erhard Cramer
Department of Mathematics
University of Oldenburg
D-26111 Oldenburg, Germany
Email: [email protected]
19
© Copyright 2026 Paperzz