Proceedings Chapter
Information geometry and minimum description length networks
SUN, Ke, et al.
Abstract
We study parametric unsupervised mixture learning. We measure the loss of intrinsic
information from the observations to complex mixture models, and then to simple mixture
models. We present a geometric picture, where all these representations are regarded as free
points in the space of probability distributions. Based on minimum description length, we
derive a simple geometric principle to learn all these models together. We present a new
learning machine with theories, algorithms, and simulations.
Reference
SUN, Ke, et al. Information geometry and minimum description length networks. In:
Proceedings of the 32 nd International Conference on Machine Learning. 2015. p.
49-58
Available at:
http://archive-ouverte.unige.ch/unige:73193
Disclaimer: layout of this document may differ from the published version.
Proofs for “Information Geometry and Minimum
Description Length Networks”
An Approximation of ln N (B, α)
As the value of ln N (B, α) does not depend on the choice of the coordinate
system, we abuse notation and vary η = η (1) , . . . , η (dim S) to an ideal coordinate system, where g(η) is everywhere identity. In theory, this is possible
locally. However, for convenience, we assume that such a coordinate system
exists globally. By definition,
!
Z
m
X
αi exp (−D(η k η i )) dη
ln N (B, α) = ln
η∈S i=1
= ln
≈ ln
m
X
i=1
m
X
!
Z
exp (−D(η k η i )) dη
αi
η∈S
Z
1
exp − (η − η i )T g(η i )(η − η i )
2
η (dim S)
!
p
(1)
(dim S)
× |g(η)|dη · · · dη
Z
···
αi
η (1)
i=1
!
1
2
(1)
(dim S)
= ln
αi
···
exp − kη − η i k2 dη · · · dη
2
η (1)
η (dim S)
i=1
!
m
X
dim S
dim S
ln(2π)
=
ln(2π).
(S.1)
≈ ln
αi exp
2
2
i=1
m
X
Z
Z
The first “≈” is by approximating D to a square distance, which is only accurate
when η and η i are close enough. The second “≈” is by relaxing the domain of
the integration from S to <dim S . This is a rough approximation for a general S,
to show the order of the term ln N (B, α), and to show its weak dependence to
B and α. More accurate approximations based on specific choices of S can lead
to better implementations of MDL networks and better criteria in accordance
to MDL.
1
Proof of E(N , A) ≤ Ê(N , A) (HARDN)
Proof. ∀l, ∀i,
nl+1
X
αl+1,j exp −D(η li k η l+1,j ) ≥ max αl+1,j exp −D(η li k η l+1,j ) . (S.2)
j
j=1
As − ln(x) is monotonically decreasing,
nl+1
nl
L−1
XX
X
E(N , A) = −
ln
αl+1,j exp −D(η li k η l+1,j )
l=0 i=1
≤−
nl
L−1
XX
j=1
ln max αl+1,j exp −D(η li k η l+1,j )
j
l=0 i=1
=−
nl
L−1
XX
max ln αl+1,j − D(η li k η l+1,j )
j
l=0 i=1
=
nl
L−1
XX
l=0 i=1
min − ln αl+1,j + D(η li k η l+1,j ) = Ê(N , A).
j
(S.3)
Proof of E(N , A) ≤ Ē(N , A, B) (SOFTN)
Proof. Because of the convexity of − ln(x),
nl+1
nl
L−1
XX
X
E(N , A) = −
ln
αl+1,j exp −D(η li k η l+1,j )
l=0 i=1
j=1
α
exp
−D(η
k
η
)
l+1,j
li
l+1,j
=−
ln
βlij ·
j
β
li
j=1
l=0 i=1
"
!#
nl+1
n
L−1
l
XXX j
αl+1,j exp −D(η li k η l+1,j )
βli − ln
≤
βlij
l=0 i=1 j=1
!
nl n
L−1
l+1
XX
X
βlij
j
βli ln
=
+ D(η li k η l+1,j ) = Ē(N , A, B). (S.4)
αl+1,j
i=1 j=1
nl
L−1
XX
nl+1
X
l=0
2
Proof of Theorem 3
Proof. Denote the true distribution with the components {η ti } and the weights
{αit } by T rue(x). By eq. (7), ∀N , ∀A, when n → ∞,
Z
n1
X
E(N , A) = −n T rue(x) ln
α1j exp −D(η(x) k η 1j ) dx
j=1
−
nl
L−1
XX
nl+1
ln
l=1 i=1
Z
= −n
X
αl+1,j exp −D(η li k η l+1,j )
j=1
n1
X
T rue(x) ln
α1j p x | η 1j dx + constant
j=1
−
nl
L−1
XX
nl+1
ln
X
l=1 i=1
αl+1,j exp −D(η li k η l+1,j ) .
(S.5)
j=1
We construct an MDL network N t , where Lt1 is given by {η t1i = η ti } with
t
the weights {α1i
= αit }. The rest of the cells {η tli } in higher levels, including
t
their weights {αli } are given by the sub-optimal solution which minimizes the
above eq. (S.5) with Lt1 and its weights fixed. Given that L0 is fixed by infinite
samples corresponding to the truth, ∀N , ∀A,
Z
T rue(x)
t
t
dx
E(N , A) − E(N , A ) =n T rue(x) ln Pn1
j=1 α1j p xi | η 1j
nl+1
nl
L−1
XX
X
−
ln
αl+1,j exp −D(η li k η l+1,j )
l=1 i=1
+
nl
L−1
XX
j=1
nl+1
ln
l=1 i=1
X
t
αl+1,j
exp −D(η tli k η tl+1,j ) .
j=1
(S.6)
If {η 1j } in N or {α1j } in A does not correspond to T rue(x), the first term on
the right-hand-side of eq. (S.6) will go to +∞ as n → ∞. The second term
is always non-negative, because of the non-negativity of D. Because of the
sub-optimality discussed earlier, the third term is lower-bounded, as in
nl+1
nl
n1
L−1
XX
X
X
t
t
t
ln
αl+1,j exp −D(η li k η l+1,j )
≥−
D(η ti k η̃),
(S.7)
l=1 i=1
j=1
i=1
where η̃ can be any distribution, e.g., the right-handed Bregman centroid {η ti }.
The right-hand-side of eq. (S.7) is the negative cost of a simple structure (one
cell in L2 ) to represent Lt1 , which is upper-bounded by the sub-optimal negative
cost on the left-hand-side. Integrating all the three terms on the right-hand-of
eq. (S.6), E(N , A) > E(N t , At ). Hence, in the optimal solution, L1 must be
exactly {η ti } and the weights must be exactly {αit }.
3
Proof of Theorem 4
Proof. By the definition of D(η 1 k η 2 ) in section 2.3 as a Bregman divergence,
∀θ(η), we have
gain(η) = D(η 1 k η 2 ) − D(η 1 k η) − D(η k η 2 )
= + ψ ? (η 1 ) − ψ ? (η 2 ) − θ T2 (η 1 − η 2 )
− ψ ? (η 1 ) − ψ ? (η) − θ T (η 1 − η)
− ψ ? (η) − ψ ? (η 2 ) − θ T2 (η − η 2 )
= (θ 2 − θ)T (η − η 1 ).
(S.8)
Let θ lc = (θ 1 + θ 2 )/2 be the left-handed Bregman centroid of θ 1 and θ 2 ,
then θ 2 − θ lc = θ lc − θ 1 . Therefore,
gain(η lc ) = (θ 2 − θ lc )T (η lc − η 1 ) = (θ lc − θ 1 )T (η lc − η 1 ).
(S.9)
On the other hand, ∀η a , η b ∈ S, η a 6= η b ,
D(η a k η b ) + D(η b k η a ) = + ψ ? (η a ) − ψ ? (η b ) − θ Tb (η a − η b )
+ ψ ? (η b ) − ψ ? (η a ) − θ Ta (η b − η a )
=(θ a − θ b )T (η a − η b ) > 0.
(S.10)
By eqs. (S.9) and (S.10),
gain(η lc ) = D(η lc k η 1 ) + D(η 1 k η lc ) > 0
(which proves ¬).
(S.11)
Similarly, we let η rc = (η 1 + η 2 )/2 be the righted-handed Bregman centroid,
then
gain(η rc ) = (θ 2 − θ rc )T (η rc − η 1 ) = (θ 2 − θ rc )T (η 2 − η rc )
= D(η 2 k η rc ) + D(η rc k η 2 ).
(S.12)
By eqs. (S.11) and (S.12), ∃η ∈ S satisfying
gain(η) ≥ max{D(η lc k η 1 ) + D(η 1 k η lc ), D(η 2 k η rc ) + D(η rc k η 2 )}. (S.13)
4
© Copyright 2026 Paperzz