A Alternative Mixed Graphical Models B Proof of Theorem 1

Mixed Graphical Models via Exponential Families
A
Alternative Mixed Graphical
Models
It is instructive to compare our class of mixed MRF
distributions (5) with the models derived from the
marginal distribution P (Z) and the conditional distribution P (Y |Z).
Suppose that we model the conditional distribution
P (Y |Z) as in the conditional distribution form of (6).
Therefore, this alternative distribution has the same
form of conditional distribution P (Y |Z) as . However,
instead of assuming that each node-conditional distribution is drawn from an exponential family, which
would then lead to our joint mixed MRF distribution
in (5) for P (Y, Z), we model the random vector Z separately as following a Markov Random Field (MRF)
distribution:
⇢ X
X
P (Z) = exp
✓rz0 BZ (Zr0 ) +
CZ (Zr0 )
X
+
r 0 2VZ
r 0 2VZ
✓rzz0 t0 BZ (Zr0 ) BZ (Zt0 )
AZ (✓z , ✓zz ) .
(r 0 ,t0 )2EZ
(10)
Note that the log-partition function AZ (·) here is defined as
⇢ X
X
X
log
exp
✓rz0 BZ (Zr0 ) +
CZ (Zr0 )
Z2Z l
r 0 2VZ
+
r 0 2VZ
X
✓rzz0 t0 BZ (Zr0 ) BZ (Zt0 ) ,
hand, an important benefit of this modeling approach
is that the conditions for normalizability of (11) can
be characterized simply as those on the marginal P (Z)
(10) and those on the conditional P (Y |Z) (6). In other
words, so long as (10) and (6) are well-defined, the
joint distribution (11) always exists and is well-defined
as well.
B
Proof of Theorem 1
This theorem can be understood as the extension of
Proposition 2 in (Yang et al., 2012); the only di↵erence
here is that we allow the heterogeneous types of nodeconditional distributions. We follow the proof policy
of that paper: Define Q(X) as
Q(X) := log(P (X)/P (0)),
for any X = (X1 , . . . , Xp ) 2 X1 ⇥ . . . ⇥ Xp where 0
indicates a zero vector (The number of zeros vary appropriately in the context below). For any X, also
denote X̄r := (X1 , . . . , Xr 1 , 0, Xr+1 , . . . , Xp ).
Now, consider the following general form for Q(X):
X
Q(X) =
Xt1 Gt1 (Xt1 ) + . . . +
(12)
t1 2V
X
since the joint distribution on X has factors of size k
at most. It can then be seen that
(r 0 ,t0 )2EZ
exp(Q(X)
which is dependent only on the parameters ✓z and ✓zz .
Given the specifications of the conditional distribution
P (Y |Z) and the marginal distribution P (Z), we can
then specify the joint distribution simply as P (Y, Z) =
P (Y |Z)P (Z), so that
P (Y, Z; ✓) = exp
+
X
⇢ X
r2VY
X
X
yy
✓rt
BY (Yr ) BY (Yt ) +
r 0 2V
✓rz0 BZ (Zr0 )
Z
✓rzz0 t0 BZ (Zr0 ) BZ (Zt0 )
(r 0 ,t0 )2EZ
(r,t)2EY
+
X
✓ry BY (Yr ) +
yz
✓rr
0 BY (Yr ) BZ (Zr 0 ) +
(r,r 0 )2EY Z
⇣
AY |Z {✓r (Z)}r2VY , ✓yy
X
CY (Yr ) +
r2VY
⌘
AZ ✓z , ✓zz
X
Xt1 . . . Xtk Gt1 ,...,tk (Xt1 , . . . , Xtk ),
t1 ,...,tk 2V
Q(X̄r )) = P (X)/P (X̄r )
P (Xr |X1 , . . . , Xr 1 , Xr+1 , . . . , Xp )
,
=
P (0|X1 , . . . , Xr 1 , Xr+1 , . . . , Xp )
where the first equality follows from the definition of
Q(X). Now, consider simplifications of both sides of
(13). Given the form of Q(X) in (12), we have
Q(X) Q(X̄r ) =
✓
p
X
X1 G1 (X1 ) +
Xt G1t (X1 , Xt ) + . . . +
(14)
t=2
CZ (Zr0 )
X
r 0 2VZ
(11)
Note that this distribution is distinct from our mixed
MRF distribution in (5). In particular, the logpartition function of (11) is not AY |Z (·) + AZ (·) as
AY |Z is a function on random vector Z.
The form of P (Y, Z) in (11) is thus much more complicated than that⇣in (5) due to the ⌘complicated nonlinear term AY |Z {✓r (Z)}r2VY , ✓yy . On the other
(13)
◆
Xt2 . . . Xtk G1,t2 ,...,tk (X1 , . . . , Xtk ) .
t2 ,...,tk 2{2,...,p}
Also, given the exponential family form of the nodeconditional distribution specified in the theorem,
log
P (Xr |X1 , . . . , Xr 1 , Xr+1 , . . . , Xp )
=
P (0|X1 , . . . , Xr 1 , Xr+1 , . . . , Xp )
Er (XV \r )(Br (Xr )
Br (0)) + (Cr (Xr )
(15)
Cr (0)).
Setting Xt = 0 for all t 6= r in (13), and using the
expressions for the left and right hand sides in (14)
Eunho Yang, Yulia Baker, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu
and (15), we obtain,
Xr Gr (Xr )
= Er (0)(Br (Xr )
Br (0)) + (Cr (Xr )
Cr (0)).
Setting Xu = 0 for all u 62 {r, t},
Br (0)) + (Cr (Xr )
Cr (0)).
Combining these two equations yields
Xr Xt Grt (Xr , Xt )
= Er (0, Xt , 0)
Er (0) (Br (Xr )
Br (0)).
(16)
Similarly, from the same reasoning for node t, we have
Xt Gt (Xt ) + Xr Xt Grt (Xr , Xt )
= Et (0, Xr , 0)(Bt (Xt )
Bt (0)) + (Ct (Xt )
Ct (0)),
and at the same time,
X
(r,r 0 )2E
+
=
(17)
+
r 0 2VZ
CZ (Zr0 ) exp
Br (0)).
(18)
Et (0, Xr , 0)
Et (0)
+
(19)
r 0 2V
X
yy
✓rt
X
"
X
exp
r 0 2V
CZ (Zr0 )
X
✓ry BY (Yr ) +
Xr Xt Grt (Xr , Xt )
Bt (0)).
X
CY (Yr )
r2VY
X
yz
✓rr
0
BY (Yr ) BZ (Zr0 )
(r,r 0 )2EY Z
✓rz0 BZ (Zr0 ) +
X
(r 0 ,t0 )2E
Z
X
Y
yy
✓rt
Z
(r 0 ,t0 )2EZ
BY (Yr ) BY (Yt ) +
⇢ X
CZ (Zr0 ) .
✓rzz0 t0 BZ (Zr0 ) BZ (Zt0 )
exp
⇢ X
Z
✓ry BY (Yr ) +
r2VY
BY (Yr ) BY (Yt ) +
X
Y
X
CY (Yr )
r2VY
yz
✓rr
0
BY (Yr ) BZ (Zr0 )
(r,r 0 )2EY Z
r2VY
X
(r,t)2EY
r2VY
yy
✓rt
BY (Yr ) BY (Yt ) +
X
(r,r 0 )2E
D
yz
✓rr
0 BY (Yr ) BZ (Zr 0 )
YZ
function
Proof of Corollary 1
Thus, the k-th order factors in the joint distribution
as specified in (12) are tensor products of (Br (Xr )
Br (0)), thus proving the statement of the theorem.
The conditional distribution P (Y |Z = z) for
any particular assignment of the random variables Z is normalizable by assumption.
It
can then be shown that the log-partition functionh of the joint distribution is precisely
given by
i
0
yy
0
EZ exp AY |Z 0 ({✓r (Z )}r2VY , ✓ ) . This expression is also finite and well-defined since there are only
finitely many configurations of Z.
C
E
More generally, we can show that
Xt1 . . . Xtk Gt1 ,...,tk (Xt1 , . . . , Xtk ) =
✓t1 ,...,tk (Bt1 (Xt1 )
Bt1 (0)) . . . (Btk (Xtk )
Btk (0)).
Proof of Theorem 2
We can simply start from the definition of the log partition function in the Manichean MRF joint distribu-
#
✓rzz0 t0 BZ (Zr0 ) BZ (Zt0 )
is
the
conditional
log-partition
AY |Z (✓¯y (Z), ✓¯yy ) by definition.
Plugging (19) back into (17),
Br (0))(Bt (Xt )
⇢ X
X
Hence, we can conclude as in the statement since the
term
⇢ X
X
X
exp
✓ry BY (Yr ) +
CY (Yr )
+
Br (0)).
CY (Yr ) +
r2VY
(r,t)2EY
Since (18) should hold for all possible combinations of
Xr , Xt , for any fixed Xt 6= 0,
= ✓rt (Br (Xr )
X
(r,t)2EY
+
Er (0, Xt , 0) Er (0)
(Br (Xr )
Bt (Xt ) Bt (0)
X
r2VY
YZ
r 0 2VZ
Et (0)
= ✓rt (Br (Xr )
yz
✓rr
0 BY (Yr ) BZ (Zr 0 ) +
Z
Bt (0)).
✓rzz0 t0 BZ (Zr0 ) BZ (Zt0 )+
(r 0 ,t0 )2EZ
r 0 2VZ
=
Et (0) (Bt (Xt )
r 0 2VZ
X
yy
✓rt
BY (Yr ) BY (Yt ) +
Y,Z
Therefore, from (16) and (17), we obtain
Et (0, Xr , 0)
r2VY
Simply this can be represented as
"
⇢ X
X
X
exp
✓rz0 BZ (Zr0 ) +
Xr Xt Grt (Xr , Xt )
= Et (0, Xr , 0)
Y,Z
X
(r,t)2EY
Xr Gr (Xr ) + Xr Xt Grt (Xr , Xt )
= Er (0, Xt , 0)(Br (Xr )
tion in (5):
⇢ X
X
X
A(✓) =
exp
✓ry BY (Yr ) +
✓rz0 BZ (Zr0 )+
Proof of Theorem 3
Suppose that neither conditions (a) nor (b) are satisfied. Then, either Xr or Xt can possibly take values
#
Mixed Graphical Models via Exponential Families
approaching both 1 and 1. Also, for some ↵,
0
such that Cr (Xr ) = O(Xr↵ ) and Ct (Xt ) = O(Xt ),
we have (↵ 1)(
1) < 1. We will show that under
these conditions, the necessary condition for normalizability detailed in Proposition 1 will be violated, that
is:
Cr (Xr ) + ✓rt Xr Xt + Ct (Xt )
0,
(20)
for sufficiently large Xr and Xt , from which we can
conclude that the joint (5) is not normalizable. Note
that we ignore the node-wise terms ✓r Xr and ✓t Xt
without loss of generality in our asymptotic argument since they are asymptotically smaller than the
quadratic term.
Consider the following sequences of values taken by the
random variables Xr , Xt , where Xr = a and Xt = a
for arbitrary positive a and some fixed positive constants
and . We then have Xr Xt = a + and
↵
Xr + Xt = a↵ + a . As we increase a, Xr and Xt
will approach infinity, however, if + > max{↵ , },
then Cr (Xr ) + ✓rt Xr Xt + Ct (Xt ) will not be less than
or equal to 0: in other words, the necessary condition
for normalizability detailed in Proposition 1 will be
violated.
(case 1: ↵ or is less than or equal to 1)
Consider the case where ↵  1. If we simply set
= max{ , 1} and
= 1, then
+
> max{↵ , }, so that the necessary
condition for normalizability detailed in Proposition 1 will be violated as discussed above. By
symmetry, the same will hold when  1. Thus,
in this case, (20) always holds.
(case 2: Both ↵ and is larger than 1) In this
case, the condition + > max{↵ , } can be
rewritten as > (↵ 1) and
1 > . Hence,
as long as (↵ 1) <
,
we
can
always find
1
and satisfying + > max{↵ , }, so that the
necessary condition for normalizability detailed in
Proposition 1 will be violated. By symmetry, the
same will hold when  1. The earlier (case 1) also
can be absorbed in this condition (↵ 1) <
1,
which is equivalent as (↵ 1)(
1) < 1.
Therefore, if (↵ 1)(
1) < 1, then the condition
(20) always holds, so that from Proposition 1, the joint
distribution in (5) will not be normalizable.