Course Notes for Advanced Probabilistic Machine Learning

Course Notes for
Advanced Probabilistic Machine Learning
John Paisley
Department of Electrical Engineering
Columbia University
Fall 2014
Abstract
These are lecture notes for the seminar ELEN E9801 Topics in Signal Processing:
“Advanced Probabilistic Machine Learning” taught at Columbia University in Fall
2014. They are transcribed almost verbatim from the handwritten lecture notes, and
so they preserve the original bulleted structure and are light on the exposition. Some
lectures therefore also have a certain amount of reiterating in them, so some statements
may be repeated a few times throughout the notes.
The purpose of these notes is to (1) have a cleaner version for the next time
it’s taught, and (2) make them public so they may be helpful to others. Since the
exposition comes during class, often via student questions, these lecture notes may
come across as too fragmented depending on the reader’s preference. Still, I hope
they will be useful to those not in the class who want a streamlined way to learn the
material at a fairly rigorous level, but not yet at the hyper-rigorous level of many
textbooks, which also mix the fundamental results with the fine details. I hope these
notes can be a good primer towards that end. As with the handwritten lectures, this
document does not contain any references.
The twelve lectures are split into two parts. The first eight deal with several
stochastic processes fundamental to Bayesian nonparametrics: Poisson, gamma,
Dirichlet and beta processes. The last four lectures deal with some advanced techniques for posterior inference in Bayesian models. Each lecture was between 2 and
2-1/2 hours long.
i
Contents
I Topics in Bayesian Nonparametrics
1
1
Poisson distribution and process, superposition and marking theorems
2
2
Completely random measures, Campbell’s theorem, gamma process
12
3
Beta processes and the Poisson process
19
4
Beta processes and size-biased constructions
25
5
Dirichlet processes and a size-biased construction
31
6
Dirichlet process extensions, count processes
38
7
Exchangeability, Dirichlet processes and the Chinese restaurant process
45
8
Exchangeability, beta processes and the Indian buffet process
53
II Topics in Bayesian Inference
9
58
EM and variational inference
59
10 Stochastic variational inference
66
11 Variational inference for non-conjugate models
73
12 Scalable MCMC inference
80
ii
Part I
Topics in Bayesian Nonparametrics
1
Chapter 1
Poisson distribution and process, superposition and
marking theorems
s
The Poisson distribution is perhaps the fundamental discrete distribution and, along with the
Gaussian distribution, one of the two fundamental distributions of probability.
Importance:
Poisson
Gaussian
→ discrete r.v.’s
→ continuous r.v.’s
Definition: A random variable X ∈ {0, 1, 2, . . . } is Poisson distributed with parameter λ > 0
if
λn −λ
e ,
(1.1)
P (X = n|λ) =
n!
denoted X ∼ Pois(λ).
Moments of Poisson
E[X] =
∞
X
nP (X = n|λ) =
∞
X
n=1
n=1
∞
X
λn
λn−1 −λ
−λ
e =λ
e
(n − 1)!
(n − 1)!
n=1
|
{z
}
(1.2)
=1
∞
∞
X
X
nλn−1 −λ
(n + 1)λn −λ
E[X ] =
n P (X = n|λ) = λ
e =λ
e
(n − 1)!
n!
n=1
n=1
n=0
2
∞
X
2
= λ(E[X] + 1) = λ2 + λ
(1.3)
V[X] = E[X 2 ] − E[X]2 = λ
(1.4)
Sums of Poisson r.v.’s (take 1)
s
Sums of Poisson r.v.’s are also Poisson. Let X1 ∼ Pois(λ1 ) and X2 ∼ Pois(λ2 ). Then
X1 + X2 ∼ Pois(λ1 + λ2 ).
2
Chapter 1
Poisson distribution and process, superposition and marking theorems
3
Important interlude: Laplace transforms and sums of r.v.’s
s
Laplace transforms give a very easy way to calculate the distribution of sums of r.v.’s (among
other things).
Laplace transform
s
Let X ∼ p(X) be a positive random variable and let t > 0. The Laplace transform of X is
Z
−tX
E[e ] =
e−tx p(x) dx
(sums when appropriate)
(1.5)
X
Important property
s
There is a one-to-one mapping between p(x) and E[e−tX ]. That is, if p(x) and q(x) are two
distributions and Ep [e−tX ] = Eq [e−tX ], then p(x) = q(x) for all x. (p and q are the same
distribution)
Sums of r.v.’s
s
Let X1 ∼ p(x), X2 ∼ q(x) and Y = X1 + X2 . What is the distribution of Y ?
s
Approach: Take the Laplace transform of Y and see what happens.
ind
ind
Ee−tY = Ee−t(X1 +X2 ) = E[e−tX1 e−tX2 ] = E[e−tX1 ]E[e−tX2 ]
{z
}
|
(1.6)
by independence of X1 and X2
s
So we can multiply the Laplace transforms of X1 and X2 and see if we recognize it.
Sums of Poisson r.v.’s (take 2)
s
The Laplace transform of a Poisson random variable has a very important form that should be
memorized.
∞
∞
n
X
X
(λe−t )n
−t
−t
−tX
−tn λ
−λ
−λ
Ee
=
e
e =e
= e−λ eλe = eλ(e −1)
(1.7)
n!
n!
n=0
n=0
s
Back to the problem: X1 ∼ Pois(λ1 ), X2 ∼ Pois(λ2 ), Y = X1 + X2 .
ind
ind
Ee−tY = E[e−tX1 ]E[e−tX2 ] = eλ1 (e
−t −1)
−t −1)
eλ2 (e
−t −1)
= e(λ1 +λ2 )(e
(1.8)
We recognize that the last term is the Laplace transform of a Pois(λ1 + λ2 ) random variable.
We can therefore conclude that Y ∼ Pois(λ1 + λ2 ).
s
Another way of saying this is that, if we draw Y1 ∼ Pois(λ1 + λ2 ) and X1 ∼ Pois(λ1 ) and
X2 ∼ Pois(λ2 ) and define Y2 = X1 + X2 . Then Y1 is equal to Y2 in distribution. (i.e., they
d
may not be equal, but they have the same distribution. We write this as Y1 = Y2 .)
P
P
s The idea extends to sums of more than two. Let Xi ∼ Pois(λi ). Then
i Xi ∼ Pois(
i λi )
since
Y
P
P
−t
Ee−t i Xi =
Ee−tXi = e( i λi )(e −1) .
(1.9)
i
Chapter 1
Poisson distribution and process, superposition and marking theorems
4
A conjugate prior for λ
s
What if we have X1 , . . . , XN that we believe to be generated by a Pois(λ) distribution, but we
don’t know λ?
s
One answer: Put a prior distribution on λ, p(λ), and calculate the posterior of λ using Bayes’
rule.
Bayes’ rule (review)
P (A, B) = P (A|B)P (B) = P (B|A)P (A)
⇓
P (B|A)P (A)
P (A|B) =
P (B)
(1.10)
(1.11)
likelihood × prior
evidence
posterior =
(1.12)
Gamma prior
s
Let λ ∼ Gam(a, b), where p(λ|a, b) =
posterior of λ is
ba
λa−1 e−bλ
Γ(a)
is a gamma distribution. Then the
"
p(λ|X1 , . . . , XN ) ∝ p(X1 , . . . , XN |λ)p(λ) =
N
Y
λXi
i=1
Xi !
#
e
−λ
ba a−1 −bλ
λ e
Γ(a)
PN
∝ λa+ i=1 Xi −1 e−(b+N )λ
⇓
P
p(λ|X1 , . . . , XN ) = Gam(a + N
i=1 Xi , b + N )
Note that E[λ|X1 , . . . , XN ] =
V[λ|X1 , . . . , XN ] =
s
a+
PN
Xi
(1.13)
(1.14)
b+N
≈
Empirical average of Xi
(Makes sense because E[X|λ] = λ)
P
a+ N
i=1 Xi
(b+N )2
≈
Empirical average/N
(Get more confident as we see more Xi )
i=1
The gamma distribution is said to be the conjugate prior for the parameter of the Poisson
distribution because the posterior is also gamma.
Poisson–Multinomial
s
A sequence of Poisson r.v.’s is closely related to the multinomial distribution as follows:
P
ind
Let Xi ∼ Pois(λi ) and let Y = N
i=1 Xi .
~ = (X1 , . . . , XN ) given Y ?
Then what is the distribution of X
Chapter 1
5
Poisson distribution and process, superposition and marking theorems
We can use basic rules of probability. . .
P (X1 , . . . , XN ) = P (X1 , . . . , XN , Y =
PN
Xi ) ← (Y is a deterministic function of X1:N )
= P (X1 , . . . , XN |Y =
PN
Xi )P (Y =
i=1
i=1
PN
i=1
And so
P (X1 , . . . , XN |Y =
s
We know that P (Y ) = Pois(Y ;
P (X1 , . . . , XN |Y =
P
PN
PN
i=1
i=1 Xi ) =
P (X1 ,...,XN )
P
P (Y = N
i=1 Xi )
=
Xi )
Q
i
P (Y =
(1.15)
P (Xi )
PN
i=1 Xi )
(1.16)
λi ), so
"N
Y λXi
#," P
#
PN
N
PN
i=1 Xi
(
λ
)
i
i
i=1
e−λi
e− i=1 λi
PN
i Xi ) =
X
!
(
X
)!
i
i
i=1
i=1
!
X
i
N
Y
Y!
λi
=
(1.17)
PN
X1 ! · · · XN ! i=1
λ
j
j=1
⇓
P
Mult(Y ; p1 , . . . , pN ), pi = λi / j λj
s
What is this saying?
1. Given the sum of N independent Poisson r.v.’s, the individual values are distributed as a
multinomial using the normalized parameters.
2. We can sample X1 , . . . , XN in two equivalent ways
a) Sample Xi ∼ Pois(λi ) independently
P
λN
λ1
P
P
b) First sample Y ∼ Pois( j λj ), then (X1 , . . . , XN ) ∼ Mult Y ; λj , . . . , λj
j
j
Poisson as a limiting case distribution
s
The Poisson distribution arises as a limiting case of the sum over many binary events each
having small probability.
Binomial distribution and Bernoulli process
s
λ
Imaging we have an array of random
variables
X
,
where
X
∼
Bern
for m = 1, .. . , n
nm
nm
n
Pn
and fixed 0 ≤ λ ≤ n. Let Yn = m=1 Xnm and Y = limn→∞ Yn . Then Yn ∼ Bin n, nλ and
Y ∼ Pois(λ).
Chapter 1
6
Poisson distribution and process, superposition and marking theorems
Picture: Have n coins with bias λ/n
evenly spaced between [0, 1]. Go down
the line and flip each independently. In
the limit n → ∞, the total # of 1’s is
a Pois(λ) r.v.
Proof :
k n−k
n!
λ
λ
lim P (Yn = k|λ) = lim
1−
(1.18)
n→∞
n→∞ (n − k)!k!
n
n
"
−k # k n λ
n(n − 1) · · · (n − k + 1)
λ
λ
1−
= lim
1−
n→∞
nk
n
k!
n
{z
}|
{z
}
|
{z
}|
k
→1
λ −λ
→1
→
e
|k!{z }
= Pois(k; λ)
λ
So limn→∞ Bin n, n = Pois(λ).
A more general statement
s
Let λnm be an array of positive numbers such that
Pn
1.
m=1 λnm = λ < ∞
2. λnm < 1 and limn→∞ λnm = 0 for all m
Let Xnm ∼ Bern(λnm ) for m = 1, . . . , n. Let Yn =
Y ∼ Pois(λ).
Pn
m=1
Xnm and Y = limn→∞ Yn . Then
Proof : (use Laplace transform)
−tY
1. E e
= lim E e
−tYn
n→∞
−tXnm
2. E e
3. So E e
−t·1
= lim
n→∞
= lim E e
E e−tXnm
m=1
= 1 − λnm (1 − e−t )
n→∞
m=1
∞
X
1
s
λsnm (1 − e−t )s
n
X
ln(1 − λnm (1 − e−t )) = −
m=1
because 0 ≤ 1 − λnm (1 − e−t ) < 1
!
λnm (1 − e−t ) −
m=1
|
−t )
n
Y
Pn
−t
1 − λnm (1 − e−t ) = lim e m=1 ln(1−λnm (1−e ))
s=1
So E e−tY = e−λ(1−e
= lim
n→∞
−t·0
4. ln(1 − λnm (1 − e )) = −
5.
m=1 Xnm
+ (1 − λnm )e
n
Y
−t
n
X
Pn
n→∞
= λnm e
−tY
−t
{z
=λ
}
. Therefore Y ∼ Pois(λ).
∞
X
1
|s=2
s
n
X
!
λsnm (1 − e−t )s
m=1
{z
→ 0 as n → ∞
}
Chapter 1
Poisson distribution and process, superposition and marking theorems
7
Poisson process
s
In many ways, the Poisson process is no more complicated than the previous discussion on the
Poisson distribution. In fact, the Poisson process can be thought of as a “structured” Poisson
distribution, which should hopefully be more clear below.
Intuitions and notations
s
S : a space (think Rd or part of Rd )
s
Π : a random countable subset of S
(i.e., a random # of points and their
locations)
s
A ⊂ S : a subset of S
s
N (A) : a counting measure. Counts
how many points in Π fall in A
(i.e., N (A) = |Π ∩ A|)
s
µ(·) : a measure on S
,→ µ(A) ≥ 0 for |A| > 0
µ(·) is non-atomic
,→ µ(A) → 0 as |A| → ∅
Think of µ as a scaled probability distribution that is continuous so that µ({x}) = 0 for all
points x ∈ S.
Poisson processes
s
A Poisson process Π is a countable subset of S such that
1. For A ⊂ S, N (A) ∼ Pois(µ(A))
2. For disjoint sets A1 , . . . , Ak , N (A1 ), . . . , N (Ak ) are independent Poisson random variables.
N (·) is called a “Poisson random measure”. (See above for mapping from Π to N (·))
Some basic properties
s
The most basic properties follow from the properties of a Poisson distribution.
a) EN (A) = µ(A) → therefore µ is sometimes referred to as a “mean measure”
b) If A1 , . . . , Ak are disjoint, then
N(
Sk
i=1 Ai ) =
P
S
k
k
N
(A
)
∼
Pois
µ(A
)
=
Pois
µ(
A
)
i
i
i
i=1
i=1
i=1
Pk
Since this holds for k → ∞ and µ(Ai ) & 0, N (·) is “infinitely divisible”.
(1.19)
Chapter 1
Poisson distribution and process, superposition and marking theorems
c) Let A1 , . . . , Ak be disjoint subsets of S. Then
S
P (N (A1 ) = n1 , . . . , N (Ak ) = nk |N ( ki=1 Ai ) = n) =
P (N (A1 )=n1 ,...,N (Ak )=nk )
S
P (N ( ki=1 Ai )=n)
8
(1.20)
Notice from earlier that N (Ai ) ⇔ Xi . Following the same exact calculations,
ni
Sk
Qk
µ(Ai )
n!
P (N (A1 ) = n1 , . . . , N (Ak ) = nk |N ( i=1 Ai ) = n) = n1 !···nk ! i=1 µ(Sk A )
j=1
j
(1.21)
Drawing from a Poisson process (break in the basic properties)
s
Property (c) above gives a very simple way for drawing Π ∼ PP(µ), though some thought
is required to see why.
1. Draw the total number of points N (S) ∼ Pois(µ(S)).
iid
2. For i = 1, . . . , N (S) draw Xi ∼ µ/µ(S). In other words, normalize µ to get a
probability distribution on S.
3. Define Π = {X1 , . . . , XN (S) }.
d) Final basic property (of these notes)
E e−tN (A) = eµ(A)(e
−t −1)
−→ an obvious result since N (A) ∼ Pois(µ(A))
(1.22)
Some more advanced properties
Superposition Theorem: Let Π1 , Π2S
, . . . be a countable collection of independent
Poisson proP∞
∞
cesses with Πi ∼ PP(µi ). Let Π = i=1 Πi . Then Π ∼ PP(µ) with µ = i=1 µi .
Proof : Remember from the definition of a Poisson process we have to show two things.
1. Let N (A) be the PRM (Poisson randomP
measure) associated with PP (Poisson process)
Π and Ni (A) with Πi . Clearly N (A) = P∞
i=1 Ni (A), and since Ni (A) ∼ Pois(µi (A)) by
definition, it follows that N (A) ∼ Pois( ∞
i=1 µi (A)).
2. Let A1 , . . . , Ak be disjoint. Then N (A1 ), . . . , N (Ak ) are independent because Ni (Aj )
are independent for all i and j.
Restriction Theorem: If we restrict Π to a subset of S, we still have a Poisson process. Let
S1 ⊂ S and Πi = Π ∩ S1 . Then Π1 ∼ PP(µ1 ), where µ1 (A) = µ(S1 ∩ A). This can be thought
of as setting µ = 0 outside of S1 , or just looking at the subspace S1 and ignoring the rest of S.
Mapping Theorem: This says that one-to-one function y = f (x) preserve the Poisson process.
That is, if Πx ∼ PP(µ) and Πy = f (Πx ), then Πy is also a Poisson process with the proper
transformation made to µ. (See Kingman for details. We won’t use this in this class.)
Chapter 1
Poisson distribution and process, superposition and marking theorems
9
Example
Let N be a Poisson random measure on
R2 with mean measure µ(A) = c|A|.
|A| is the area of A (“Lebesgue
measure”)
Intuition : Think trees in a forest.
Question 1: What is the distance R from the origin to the nearest point?
Answer: We know that R > r if N (Br ) = 0, where Br = ball of radius r. Since these are
equivalent events, we know that
2
P (R > r) = P (N (Br ) = 0) = e−µ(Br ) = e−cπr .
(1.23)
Question 2: Let each atom be the center of a disk of radius a. Take our line of sight as the
x-axis. How far can we see?
Answer: The distance V is equivalent to the farthest we can extend a rectangle Dx with y-axis
boundaries of [−a, a]. We know that V > x if N (Dx ) = 0. Therefore
P (V > x) = P (N (Dx ) = 0) = e−µ(Dx ) = e−2acx .
(1.24)
Marked Poisson processes
s
The other major theorem of these notes relates to “marking” the points of a Poisson process
with a random variable.
s
Let Π ∼ PP(µ). For each x ∈ Π, associate a r.v. y ∼ p(y|x). We say that y has “marked” x.
The results is also a Poisson process.
Theorem: Let µ be a measure on space S and p(y|x) a probability distribution on space M .
For each x ∈ Π ∼ PP(µ) draw y ∼ p(y|x) and define Π∗ = {(xi , yi )}. Then Π∗ is a Poisson
process on S × M with mean measure µ∗ = µ(dx)p(y|x)dy.
Comment:R If N ∗ (C) = |Π∗ ∩ C| for C ⊂ S × M , this says that N ∗ (C) ∼ Pois(µ∗ (C)), where
µ∗ (C) = C µ(dx)p(y|x)dy.
Chapter 1
Poisson distribution and process, superposition and marking theorems
10
R
∗
Proof : Need to show that Ee−tN (C) = exp{ C (e−t − 1)µ(dx)p(y|x)dy}
P (S)
1. Note that N ∗ (C) = N
i=1 1{(xi , yi ) ∈ C}. N (S) is PRM associated with Π ∼ PP(µ).
2. Recall tower property: Ef (A, B) = E[E[f (A, B)|B]].
h h
n P
o ii
N (S)
−tN ∗ (C)
3. Therefore, Ee
= E E exp −t i=1 1{(xi , yi ) ∈ C} |Π
4. Manipulating :
N (S)

Y
E[e−t1{(xi ,yi )∈C} |Π]

Ee−tN
∗ (C)
= E
(1.25)
i=1


N (S) Z
= E
Y
e−t·1 1{(xi , yi ) ∈ C} + e−t·0 1{(xi , yi ) 6∈ C} p(yi |xi )dyi 
M
i=1
5. Continuing :


N (S)
Ee−tN
∗ (C)
= E
Y
Z
M
i=1
|
{z
use 1{(xi , yi ) 6∈ C} = 1 − 1{(xi , yi ) ∈ C}
"
= E
n Y
1+
S
i=1
#
µ(dx)
|N (S) = n
(e−t − 1)1{(x, y) ∈ C}p(y|x)dy
µ(S)
M
{z
}
Tower again using Poisson-multinomial representation
"
|
}
Z Z
|
= E
(e−t − 1)1{(xi , yi ) ∈ C}p(yi |xi )dyi ]
[1 +
Z
µ(dx)
(1 + (e−t − 1)p(y|x)dy
µ(S)
C
{z
N (S) #
N (S) ∼ Pois(µ(S))
(1.26)
}
6. Recall that if n ∼ Pois(λ), then
n
Ez =
∞
X
n=0
z
nλ
n
n!
e
−λ
−λ
=e
∞
X
(zλ)n
n=0
n!
= eλ(z−1) .
Therefore, this last expectation shows that
Z
µ(dx)
−tN ∗ (C)
−t
Ee
= exp µ(S) (e − 1)p(y|x)dy
µ(S)
C
Z
= exp
(e−t − 1)µ(dx)p(y|x)dy ,
C
thus N ∗ (C) ∼ Pois(µ∗ (C)).
(1.27)
Chapter 1
Poisson distribution and process, superposition and marking theorems
11
Example 1: Coloring
s
Let Π ∼ PP(µ) and let an x ∈ Π be randomly colored from among K colors. Denote the color
by y with P (y = i) = pi . Then Π∗ = {(xi , yi )} is a PP on S × {1, . . . , K} with mean measure
Q
1(y=i)
µ∗ (dx · {y}) = µ(dx) K
. If we want to restrict Π∗ to the ith color, called Π∗i , then
i=1 pi
we know that Π∗i ∼ PP(pi µ). We can also restrict to two colors, etc.
Example 2: Using the extended space
s
Image we have a 24-hour store and customers arrive according to a PP with mean measure
µ on R (time). In this case, let µ(R) = ∞, but µ([a, b]) < ∞ for finite a, b (which gives the
expected number of arrivals between times a and b). Imagine a customer arriving at time x
stays for duration y ∼ p(y|x). At time t, what can we say about the customers in the store?
s
Π ∼ PP(µ) and Π∗ = {(xi , yi )} for xi ∈ Π is PP(µ(dx)p(y|x)dy) because it’s a marked
Poisson process.
s
We can construct the marked Poisson process like below. The counting measure N ∗ (C) ∼
Pois(µ∗ (C)), µ∗ = µ(dx)p(y|x)dy.
s
The points in Ct (below) are those that arrive before time t and are still there at time t. It
follows that
Z
Z
∞
t
N ∗ (Ct ) ∼ Pois
µ(dx)
0
p(y|x)dy .
t−x
N ∗ (Ct ) is the number of customers in the store at time t.
Chapter 2
Completely random measures, Campbell’s theorem,
gamma process
Poisson process review
s
Recall the definition of a Poisson random measure.
PRM definition: Let S be a space and µ a non-atomic (i.e., diffuse, continuous) measure on it
(think a positive function). A random measure N on S is a PRM with mean measure µ if
a) For every subset A ⊂ S, N (A) ∼ Pois(µ(A)).
b) For disjoint sets A1 , . . . , Ak , N (A1 ), . . . , N (Ak ) are independent r.v.’s
Poisson process definition: Let X1 , . . . , XN (S) be the N (S) points in N (a random number)
of measure equal to one. Then the point process Π = {X1 , . . . , XN (S) } is a Poisson process,
denoted Π ∼ PP(µ).
Recall that to draw this (when µ(S) < ∞) we can
a) Draw N (S) ∼ Pois(µ(S))
iid
b) For i = 1, . . . , N (S), draw Xi ∼ µ/µ(S) ← normalize µ to get a probability measure.
Functions of Poisson processes
s
Often models will take the form of a function of an underlying Poisson process:
P
x∈Π
f (x).
Example (see Sec. 5.3 of Kingman): Imagine that star locations are distributed as Π ∼ PP(µ).
They’re marked independently with a mass m ∼ p(m), giving a marked PP Π∗ ∼ PP(µ×p dm).
The gravitational field at, e.g., the origin is
X
(x,m)∈Π∗
f ((x, m)) =
X
(x,m)∈Π∗
Gmx
x
kxk32
12
(G is a constant from physics)
Chapter 2
13
Completely random measures, Campbell’s theorem, gamma process
s
We can analyze these sorts of problems using PP techniques
s
We will be more interested in the context of “completely random measures” in this class.
Functions: The finite case
s
Let Π P
∼ PP(µ) and |Π| < ∞ (with probability 1). Let f (x) be a positive function. Let
M = x∈Π f (x). We calculate its Laplace transform (for t < 0):


|Π|
Y
P
EetM = Eet x∈Π f (x) = E  etf (xi )  ← recall two things (below)
(2.1)
i=1
Recall:
1. |Π| = N (S) ← Poisson random measure for Π
2. Tower property Eg(x, y) = E[E[g(x, y)|y]].
So:


N (S)
E
Y
 

N (S)
etf (xi )  = E E 
i=1
Y
h N (S) i
etf (xi ) |N (S) = E E etf (x)
.
(2.2)
i=1
Since N (S) ∼ Pois(µ(S)), we use the last term to conclude


N (S)
Y
E
etf (xi )  = exp{µ(S)(E[etf (x) ] − 1)}
i=1
Z
= exp
µ(dx)(etf (x) − 1)
S
since Eetf (x) =
Z
S
µ(dx) tf (x)
e
µ(S)
%
↑
t N (A)
recall that E[(e )
Z
] = exp
µ(dx)(et − 1).
A
(f (x) = 1 in this case)
And so for functions M =
number of points
P
x∈Π
tM
Ee
f (x) of Poisson processes Π with an almost sure finite
Z
= exp
S
µ(dx)(etf (x) − 1).
(2.3)
Chapter 2
Completely random measures, Campbell’s theorem, gamma process
14
The infinite case
s
In the case where µ(S) = ∞, N (S) = ∞ with probability one. We therefore use a different
proof technique that gives the same result.
P2k
k
s Approximate f (x) with simple functions: fk =
i=1 ai 1Ai using k2 equally-spaced levels in
the interval (0, k); Ai = [ 2ik , i+1
) and ai = 2ik . In the limit k → ∞, the width of those levels
2k
1
& 0, and so fk → f .
2k
s
That’s not a proof, but consider that fk (x) = maxi≤k2k 2ik 1( 2ik < x), so fk (x) % x as k → ∞.
R
P
s Important notation change: M =
x∈Π f (x) ⇔ S N (dx)f (x).
s
Approximate f with fk . Then with the notation change, Mk =
s
The Laplace functional is
P
x∈Π
fk (x) ⇔
P2k
i=1
ai N (Ai ).
k
tM
Ee
=
=
lim Ee
tMk
k→∞
lim exp
k→∞
k→∞

2k
X

Z
= exp
= lim E
2
Y
etai N (Ai )
← N (Ai ) and N (Aj ) are independent
i=1
µ(Ai )(etai
i=1
µ(dx)(etf (x) − 1)


− 1)

← N (Ai ) ∼ Pois(µ(Ai ))
← integral as limit of infinitesimal sums
(2.4)
S
Mean and variance of M: Using ideas from moment generating functions, it follows that
Z
Z
µ(dx)f (x)2
µ(dx)f (x),
V(M) =
(2.5)
E(M) =
S
S
|
|
{z
}
{z
}
=
d
tM |
t=0
dt Ee
=
d2
EetM |t=0
dt2
− E(M)2
Chapter 2
Finiteness of
R
S
15
Completely random measures, Campbell’s theorem, gamma process
N (dx)f (x)
s
R
The next obvious question when µ(S) = ∞ (and thus N (S) = ∞) is if S N (dx)f (x) < ∞.
Campbell’s theorem gives the necessary and sufficient conditions for this to be true.
s
Campbell’s Theorem: Let Π ∼ PP(µ) and N the PRM. Let f (x) be a non-negative function
on S. Then with probability one,
(
R
Z
< ∞ if S min(f (x), 1)µ(dx) < ∞
M=
N (dx)f (x)
(2.6)
= ∞ otherwise
S
Proof : For u > 0, e−uM = e−uM 1(M < ∞) % 1(M < ∞) as u & 0. By dominated
convergence, as u & 0
Ee−uM = E[e−uM 1(M < ∞)] % E[1(M < ∞)] = P (M < ∞)
In our case: P (M < ∞) = limu&0 exp
(2.7)
µ(dx)(e−uf (x) − 1).
R
S
Sufficiency
1. For P (M < ∞) = 1 as in the theorem, we need limu&0
R
S
µ(dx)(e−uf (x) − 1) = 0.
2. For 0 < u < 1 we have
1−f (x) < 1−uf (x) < e−uf (x)
−→
(Figure: e−uf (x) is convex in f (x)
and 1 − uf (x) is a 1st order Taylor
expansion of e−uf (x) at f (x) = 0)
3. Therefore 1 − e−uf (x) < f (x). Also, 1 − e−uf (x) < 1 trivially.
4. So:
Z
0≤
µ(dx)(1 − e
−uf (x)
S
5. If
R
S
)≤
µ(dx) min(f (x), 1)
(2.8)
S
µ(dx) min(f (x), 1) < ∞, then by dominated convergence
Z
Z
−uf (x)
lim µ(dx)(1 − e
)=
µ(dx) 1 − exp lim −uf (x)
=0
u&0
s
Z
S
S
u&0
(2.9)
This proves sufficiency. For necessity we can show that (see, e.g., Cinlar II.2.13)
Z
Z
min(f (x), 1)µ(dx) = ∞ =⇒
S
S
µ(dx)(1 − e−uf (x) ) = ∞ =⇒ Ee−uM = 0, ∀u (2.10)
Chapter 2
Completely random measures, Campbell’s theorem, gamma process
16
Completely random measures (CRM)
s
Definition of measure: The set function µ is a measure on the space S if
1. µ(∅) = 0
2. µ(A) ≥ 0 for all A ⊂ S
P∞
3. µ(∪∞
i=1 Ai ) =
i=1 µ(Ai ) when Ai ∩ Aj = ∅, i 6= j
s
Definition of completely random measure: The set function M is a completely random measure
on the space S if it satisfies #1 to #3 above and
1. M (A) is a random variable
2. M (A1 ), . . . , M (Ak ) are independent for disjoint sets Ai
s
Example: Let N be the counting measure associated with Π ∼ PP(µ). It’s a CRM.
s
We will be interested in the following situation: Let Π ∼ PP(µ) and mark each θ ∈ Π with a r.v.
π ∼ λ(π), π ∈ R+ . Then Π∗ = {(θ, π)} is a PP on S × R+ with mean measure µ(dθ)λ(π)dπ
s
R
If N (dθ, dπ) is the counting measure for Π∗ , then N (C) ∼ Pois( C µ(dθ)λ(π)dπ).
s
For A ⊂ S, let M (A) =
s
M is a special case of sums of functions of Poisson processes with f (θ, π) = π. Therefore we
know that
Z Z ∞
tM (A)
Ee
= exp
(etπ − 1)µ(dθ)λ(π)dπ.
(2.11)
R R∞
A
0
N (dθ, dπ)π. Then M is a CRM on S.
A
s
0
This works both ways: If we define M and show it has this Laplace transform, then we know
there is a marked Poisson process “underneath” it with mean measure equal to µ(dθ)λ(π)dπ.
−→
marked PP on S × R+ with mean
measure µ(dx) × λ(π)dπ
completely random
R R ∞ measure on S.
M (A) = A 0 N (dθ, dπ)π
Chapter 2
Completely random measures, Campbell’s theorem, gamma process
17
Gamma processes
s
Definition: Let µ be a non-atomic measure on S. Then G is a gamma process if for all A ⊂ S,
G(A) ∼ Gam(µ(A), c) and G(A1 ), . . . , G(Ak ) are independent for disjoint A1 , . . . , Ak . We
write G ∼ GaP(µ, c). (c > 0)
s
Before trying to intuitively understand G, let’s calculate its Laplace transform. For t < 0,
µ(A)
Z ∞ µ(A)
c
c
µ(A)−1 −G(A)(c−t)
tG(A)
Ee
=
G(A)
e
dG(A) =
(2.12)
Γ(µ(A))
c−t
0
s
Manipulate this term as follows (and watch the magic!)
c
c−t
µ(A)
=
=
=
=
=
c−t
exp −µ(A) ln
c
Z c−t
1
ds
exp −µ(A)
s
c
Z c−t Z ∞
−πs
exp −µ(A)
ds
e dπ
c
0
Z ∞ Z c−t
−πs
e ds
(switched integrals)
dπ
exp −µ(A)
c
0
Z ∞
tπ
−1 −cπ
exp µ(A)
(e − 1)π e dπ
(2.13)
0
Therefore, G has an underlying Poisson random measure on S × R+
Z Z ∞
G(A) =
N (dθ, dπ)π, N (dθ, dπ) ∼ Pois(µ(dθ)π −1 e−cπ dπ)
A
s
(2.14)
0
The mean measure of N is µ(dθ)π −1 e−cπ dπ on S × R+ . We can use this to answer questions
about G ∼ GaP(µ, c) using the Poisson process perspective.
R R∞
1. How many total atoms?
µ(dθ)π −1 e−cπ dπ = ∞ ⇒ infinite # w.p. 1
S 0
(Tells us that there are an infinite number of points in any subset A ⊂ S that have nonzero
mass according to G)
R R∞
2. How many atoms ≥ > 0? S µ(dθ)π −1 e−cπ dπ < ∞ ⇒ finite # w.p. 1
(w.r.t. #1, further tells us only a finite number have mass greater than )
R R∞
3. Campbell’s theorem: f (θ, π) = π → S 0 min(π, 1)µ(dθ)π −1 e−cπ dπ < ∞, therefore
Z Z ∞
G(A) =
N (dθ, dπ)π < ∞ w.p. 1
A
0
(Tells us if we summed up the infinite number of nonzero masses in any set A, we would
get a finite number even though we have an infinite number of nonzero things to add)
s
Aside: We already knew #3 by definition of G, but this isn’t always the order in which CRMs
are defined. Imagine starting the definition with a mean measure on S × R+ .
P∞ 1
1
s #3 shouldn’t feel mysterious at all. Consider
n=1 n2 . It’s finite, but for each n, n2 > 0 and
there are an infinite number of settings for n. “Infinite jump processes” such as the gamma
process replace deterministic sequences like (1, 1/22 , 1/32 , . . . ) with something random.
Chapter 2
Completely random measures, Campbell’s theorem, gamma process
18
Gamma process as a limiting case
s
Is there a more intuitive way to understand the gamma process?
s
, c) and
Definition: Let µ be a non-atomic measure on S and µ(S) < ∞. Let πi ∼ Gam( µ(S)
K
PK
iid
θi ∼ µ/µ(S) for i = 1, . . . , K. If Gk = i=1 πi δθi , then limK→∞ GK = G ∼ GaP(µ, c).
iid
Picture: In the limit K → ∞, we have
more and more atoms with smaller and
smaller weights
GK (A) =
K
X
i=1
πi δθi (A) =
K
X
πi 1(θi ∈ A)
i=1
Proof : Use the Laplace transform
PK
Q
tπi 1(θi ∈A)
EetGK (A) = Eet i=1 πi 1(θi ∈A) = E K
= E[etπ1(θ∈A) ]K
i=1 e
K
= E etπ 1(θ ∈ A) + 1(θ 6∈ A)
K
= E[etπ ]P (θ ∈ A) + P (θ 6∈ A)
"
#K
µ(S)
K
c
µ(A)
µ(A)
=
+1−
c−t
µ(S)
µ(S)
!#K
"
µ(S)
K
µ(A)
c
−1
= 1+
µ(S)
c−t
"
n !#K
∞
c
µ(A) X (ln c−t )n µ(S)
= 1+
← (exponential power series)
µ(S) n=1
n!
K
K
µ(A) µ(S)
c
2
= 1+
ln
+ O(1/K )
(2.15)
µ(S)
K
c−t
s
In the limit K → ∞, this last equation converges to
exp
( recall that limK→∞ (1 +
a
K
µ(A)
c
µ(S) ln
µ(S)
c−t
=
c
c−t
µ(A)
.
+ O(K −2 ))K = ea )
s
This is the Laplace transform of a Gam(µ(A), c) random variable.
s
Therefore, GK (A) → G(A) ∼ Gam(µ(A), c), which we’ve already defined and analyzed
as a gamma process.
Chapter 3
Beta processes and the Poisson process
A sparse coding latent factor model
s
We have a d × n matrix Y . We want to factorize it as follows:
where
θi ∼ p(θ)
wj ∼ p(w)
zj ∈ {0, 1}K
s

i = 1, . . . , K 

j = 1, . . . , n


j = 1, . . . , n
“sparse coding” because each vector Yj
only possesses
P the columns of Θ indicated
by zj (want i zji K)
Example: Y could be
a) gene data of n people,
b) patches extracted from an image for denoising (called “dictionary learning”)
s
We want to define a “Bayesian nonparametric” prior for this problem. By this we mean that
1. The prior can allow K → ∞ and remain well defined
2. As K → ∞, the effective rank is finite (and relatively small)
3. The model somehow learns this rank from the data (not discussed)
19
Chapter 3
Beta processes and the Poisson process
20
A “beta sieves” prior
s
Let θi ∼ µ/µ(S) and wj be drawn as above. Continue this generative model by letting
iid
zji ∼ Bern(πi ), j = 1, . . . , n
γ γ πi ∼ Beta α , α 1 −
, i = 1, . . . , K
K
K
(3.1)
(3.2)
s
The set (θi , πi ) are paired. πi gives the probability an observation picks θi . Notice that we
expect πi → 0 as K → ∞.
PK
s Construct a completely random measure HK =
i=1 πi δθi .
s
We want to analyze what happens when K → ∞. We’ll see that it converges to a beta process.
Asymptotic analysis of beta sieves
s
PK
iid
iid
We have that HK =
i=1 πi δθi , πi ∼ Beta (αγ/K, α(1 − γ/K)) , θi ∼ µ/µ(S), where
γ = µ(S) < ∞. We want to understand limK→∞ HK .
s
Look at the Laplace transform of HK (A). Let H(A) = limK→∞ HK (A). Then
EetH(A) = lim EetHK (A) = lim Eet
K→∞
|K→∞
s
PK
i=1
πi 1(θi ∈A)
= lim E[etπ1(θ∈A) ]K
{z K→∞
}
(3.3)
sum → product and use i.i.d. fact
Focus on Eetπ1(θ∈A) for a particular K-level approximation. We have the following (long)
sequence of equalities:
Eetπ1(θ∈A) = E[etπ 1(θ ∈ A) + 1(θ 6∈ A)] = P (θ ∈ A)Eetπ + P (θ 6∈ A)
= 1+
∞ s Y
s−1 αγ
X
+r
µ(A) tπ
t
K
Ee − 1 ← Eetπ = 1 +
µ(S)
s! r=0 α + r
s=1
∞
s−1
+r
µ(A) X ts Y αγ
K
← plugging in ↑
= 1+
µ(S) s=1 s! r=0 α + r
∞
s−1
µ(A) X ts Y r
= 1+
+ O( K12 ) ← separate out r = 0
K s=1 s! r=1 α + r
∞
= 1+
= 1+
= 1+
= 1+
µ(A) X ts αΓ(α)Γ(s)
+ O( K12 ) ← since Γ(α + 1) = αΓ(α)
K s=1 s! Γ(α + s)
Z
∞
µ(A) X ts 1 s−1
απ (1 − π)α−1 dπ + O( K12 ) ← normalizer of Beta(s, α)
K s=1 s! 0
Z ∞
µ(A) 1 X (tπ)s −1
απ (1 − π)α−1 dπ + O( K12 )
K 0 s=1
s!
Z 1
µ(A)
(etπ − 1)απ −1 (1 − π)α−1 dπ + O( K12 )
K 0
Chapter 3
Again using limK→∞ (1 + a/K + O(K −2 )K = ea , it follows that
Z 1
tπ
−1
α−1
tπ1(θ∈A) K
(e − 1)απ (1 − π) dπ .
lim E[e
] = exp µ(A)
K→∞
s
21
Beta processes and the Poisson process
(3.4)
0
We therefore know that
1. H is a completely random measure.
2. It has an associated underlying Poisson random measure N (dθ, dπ) on S × [0, 1] with
mean measure µ(dθ)απ −1 (1 − π)α−1 dπ.
R R1
3. We can write H(A) as A 0 N (dθ, dπ)π.
Beta process (as a CRM)
s
Definition: Let N (dθ, dπ) be a Poisson random measure on S × [0, 1] with mean measure
−1
µ(dθ)απ
(1 − π)α−1 dπ, where µ is a non-atomic measure. Define the CRM H(A) as
R R1
N (dθ, dπ)π . Then H is called a beta process, H ∼ BP(α, µ) and
A 0
Z 1
tH(A)
tπ
−1
α−1
Ee
= exp µ(A)
(e − 1)απ (1 − π) dπ .
0
s
(We just saw how we can think of H as the limit of a finite collection of random variables. This
time we’re just starting from the definition, which we could proceed to analyze regardless of
the beta sieves discussion above.)
s
Properties of H: Since H has a Poisson process representation, we can use the mean measure to calculate its properties (and therefore the asymptotic properties of the beta sieves
approximation).
s
Finiteness: Using Campbell’s theorem, H(A) is finite with probability one, since
Z Z 1
min(π, 1) απ −1 (1 − π)α−1 dπµ(dθ) = µ(A) < ∞ (by assumption about µ)
|
{z }
A 0
=π
(3.5)
s
Infinite jump process: H(A) is constructed from an infinite number of jumps, almost all
infinitesimally small, since
Z 1
−1
α−1
N (A × (0, 1]) ∼ Pois µ(A)
απ (1 − π) dπ = Pois(∞)
(3.6)
0
s
Finite number of “big” jumps: There are only a finite number of jumps greater than any
> 0 since
Z 1
N (A × [, 1]) ∼ Pois µ(A)
απ −1 (1 − π)α−1 dπ
(3.7)
|
{z
}
< ∞ for > 0
As → 0, the value in the Poisson goes to infinity, so the infinite jump process arises in
this limit. Since the integral over the magnitudes is finite, this infinite number of atoms is
being introduced in a “controlled” way as a function of (i.e., not “too quickly”)
Chapter 3
s
Beta processes and the Poisson process
22
Reminder and intuitions: All of these properties are over instantiations of a beta process, and
so all statements are made with probability one.
– It’s not absurd to talk about beta processes that don’t have an infinite number of jumps, or
integrate to something infinite (“not absurd” in the way that it is absurd to talk about a
negative value drawn from a beta distribution).
– The support of the beta process includes these events, but they have probability zero, so
any H ∼ BP(α, µ) is guaranteed to have the properties discussed above.
– It’s easy to think of H as one random variable, but as the beta sieves approximation shows,
H is really a collection of an infinite number of random variables.
– The statements we are making about H above aren’t like asking whether a beta random
variable is greater than 0.5. They are larger scale statements about properties of this
infinite collection of random variables as a whole.
s
Another definition of the beta process links it to the beta distribution and our finite approximation:
s
Definition II: Let µ be a non-atomic measure on S. For all infinitesimal sets dθ ∈ S, let
H(dθ) ∼ Beta{αµ(dθ), α(1 − µ(dθ))},
then H ∼ BP(α, µ).
s
We aren’t going to prove this, but the proof is actually very similar to the beta sieves proof.
s
Note the difference from the gamma process, where G(A) ∼ Gam(µ(A), c) for any A ⊂ S.
The beta distribution only comes in the infinitesimal limit. That is
H(A) 6∼ Beta{αµ(A), α(1 − µ(A))},
when µ(A) > 0. Therefore, we can only write beta distributions on things that equal zero with
probability one. . . Compare this with the limit of the beta sieves prior.
s
Observation: While µ({θ}) = µ(dθ) = 0,
s
This is a major difference between a measure and a function: µ is a measure, not a function. It
also seems to me a good example of why these additional
R concepts and notations are necessary,
e.g., why we can’t just combine things like µ(A) = A p(θ)dθ into one single notation, but
instead talk about the “measure” µ and it’s associated density p(θ) such that µ(dθ) = p(θ)dθ.
s
This leads to discussions involving the Radon-Nikodym theorem, etc. etc.
s
The level of our discussion stops at an appreciation for why these types of theorems exist and
are necessary (as overly-obsessive as they may feel the first time they’re encountered), but we
won’t re-derive them.
R
A
µ({θ})dθ = 0 but
R
A
µ(dθ) = µ(A) > 0.
Chapter 3
Beta processes and the Poisson process
23
Bernoulli process
s
The Bernoulli process is constructed from the infinite limit of the “z” sequence in the proiid
cess zji ∼ Bern(πi ), πi ∼ Beta (αγ/K, α(1 − γ/K)) , i = 1, . . . , K. The random measure
P
(K)
Xj = K
i=1 zji δθi converges to a “Bernoulli process” as K → ∞.
s
Definition: Let H ∼ BP(α, µ). For each atom of H (the θ for which H({θ}) > 0), let
ind
Xj ({θ})|H ∼ Bern(H({θ})). Then Xj is a Bernoulli process, denoted Xj |H ∼ BeP(H).
s
Observation: We know that H has an infinite number of locations θ where H({θ}) > 0 because
it’s a discrete measure (the Poisson process proves that). Therefore, X is infinite as well.
Some questions about X
1. How many 1’s in Xj ?
2. For X1 , . . . , Xn |H ∼ BeP(H), how many total locations are there with at least one Xj
equaling one there (marginally speaking, with H integrated out)?
P
i.e., what is {θ : nj=1 Xj ({θ}) > 0}
s
The Poisson process representation of the BP makes calculating this relatively easy. We start
by observing that the Xj are marking the atoms, and so we have a marked Poisson process (or
“doubly marked” since we can view (θ, π) as a marked PP as well).
Beta process marked by a Bernoulli process
s
Definition: Let Π ∼ PP(µ(dθ)απ −1 (1−π)α−1 dπ) on S ×[0, 1] be a Poisson process underlying
iid
a beta process. For each (θ, π) ∈ Π draw a binary vector z ∈ {0, 1}n where zi |π ∼ Bern(π)
for i = 1, . . . , n. Denote the distribution on z as Q(z|π). Then Π∗ = {(θ, π, z)} is a marked
Poisson process with mean measure µ(dθ)απ −1 (1 − π)α−1 dπQ(z|π).
s
There is therefore a Poisson process underlying the joint distribution of the hierarchical process
H ∼ BP(α, µ),
s
iid
Xi |H ∼ BeP(H), i = 1, . . . , n.
We next answer the two questions about X asked above, starting with the second one.
Chapter 3
Question: What is
Kn+
24
Beta processes and the Poisson process
Pn
= {θ : j=1 Xj ({θ}) > 0} ?
Answer: The transition distribution Q(z|π) gives the probability of a vector z at a particular
location (θ, π) ∈ Π (notice Q doesn’t depend on θ).
All we care about is whether z ∈ C = {0, 1}n \~0 (i.e., has a 1 in it)
We make the following observations:
– The probability Q(C|π) = P (z ∈ C|π) = 1 − P (z 6∈ C|π) = 1 − (1 − π)n .
– If we restrict the marked PP to C, we get the distribution on the value of Kn+ :
Z Z 1
+
Kn = N (S, [0, 1], C) ∼ Pois
µ(dθ)απ −1 (1 − π)α−1 dπ Q(C|π) .
| {z }
S 0
(3.8)
= 1−(1−π)n
– It’s worth stopping to remember that N is a counting measure, and think about what
exactly N (S, [0, 1], C) is counting.
∗ N (S, [0, 1], C) is asking for the number of times event C happens (an event related
to z), not caring about what the corresponding θ or π are (hence the S and [0, 1]).
∗ i.e., it’s counting the thing we’re asking for, Kn+ .
n
– We can show that 1 − (1 − π) =
n−1
X
π(1 − π)i ← geometric series
i=0
– It follows that
Z Z
1
−1
α−1
µ(dθ)απ (1 − π)
S
Z 1
n−1
X
α
dπ(1 − (1 − π) ) = µ(S)
(1 − π)α+i−1 dπ
n
0
= µ(S)
n−1
X
αΓ(1)Γ(α + i)
i=0
=
Therefore Kn+ ∼ Pois
n−1
X
αµ(S) i=0
α+i
Γ(α + i + 1)
n−1
X
αµ(S)
i=0
s
0
i=0
α+i
(3.9)
.
s
Notice that as n → ∞, Kn+ → ∞ with probability one, and that EKn+ ≈ αµ(S) ln n.
s
Also notice that we get the answer to the first question for free. Since Xj are i.i.d., we can
treat each one marginally as if it were the first one.
s
If n = 1, X(S) ∼ Pois(µ(S)). That is, the number of ones in each Bernoulli process is
Pois(µ(S))-distributed.
Chapter 4
Beta processes and size-biased constructions
The beta process
s
Definition (review): Let α > 0 and µ be a finite non-atomicRmeasure on S. Let C ∈ S × [0, 1]
and N be a Poisson random measure with N (C) ∼ Pois( C µ(dθ)απ −1 (1 − π)α−1 dπ). For
R R1
A ⊂ S define H(A) = A 0 N (dθ, dπ)π. Then H is a beta process, denoted H ∼ BP(α, µ).
Intuitive picture (review)
Figure 4.1 (left) Poisson process (right) CRM constructed form Poisson process. If (dθ, dπ) is a point in
the PP, N (dθ, dπ) = 1 and N (dθ, dπ)π = π. H(A) is adding up π’s in the set A × [0, 1].
Drawing from this prior
s
In general, we know that if Π ∼ PP(µ), we can draw N (S) ∼ Pois(µ(S)) and X1 , . . . , XN (S) ∼
µ/µ(S) and construct Π from the Xi0 s.
s
Similarly, we have the reverse property that if N ∼ Pois(γ) and X1 , . . . , XN ∼ p(X), then the
set Π = {X1 , . . . , XN } ∼ PP(γp(X)dX). (This inverse property will be useful later.)
iid
iid
25
Chapter 4
Beta processes and size-biased constructions
26
s
R R1
Since S 0 µ(dθ)απ −1 (1 − π)α−1 dπ = ∞, this approach obviously won’t work for drawing
H ∼ BP(α, µ).
s
The method of partitioning [0, 1] and drawing N (S×(a, b]) followed by
(a)
θi
(a)
∼ µ/µ(S),
πi
∼ Rb
a
απ −1 (1 − π)α−1
απ −1 (1
−
(a)
π)α−1 dπ
1(a < πi
≤ b)
(4.1)
is possible (using the Restriction theorem from Lecture 1, independence of PP’s on disjoint
sets, and the first bullet of this section), but not as useful for Bayesian models.
s
The goal is to find size-biased representations for H that are more straightforward. (i.e., that
involve sampling from standard distributions, which will hopefully make inference easier)
Size-biased representation I (a “restricted beta process”)
s
Definition: Let α = 1 and µ be a non-atomic measure on S with µ(S) = γ < ∞. Generate the
following independent set of random variables
iid
iid
Vi ∼ Beta(γ, 1), θi ∼ µ/µ(S),
P∞ Qi
Let H = i=1
j=1 Vj δθi . Then H ∼ BP(1, µ).
i = 1, 2, . . .
(4.2)
Proof: The proof uses the limiting case of the following finite approximation
PK
s Let πi ∼ Beta( γ , 1), θi ∼ µ/µ(S) for i = 1, . . . , K. Let HK =
i=1 πi δθi . Then limK→∞ HK ∼
K
BP(1, µ). The proof is similar to the one last lecture.
Question 1: As K → ∞, what is π(1) = max{π1 , . . . , πK }?
Answer: Look at the CDF’s. We want the function P (π(1) < V1 ) for a V1 ∈ [0, 1]. Because the
πi are independent,
P (π(1) < V1 ) = lim P (π1 < V1 , . . . , πK < V1 ) = lim
K→∞
Z
– P (πi < V1 ) =
0
– lim
K→∞
K
Y
V1
γ
γ Kγ −1
πi dπi = V1K
K
P (πi < V1 ) = V1γ
i=1
– Therefore, π(1) = V1 ,
V1 ∼ Beta(γ, 1)
K→∞
K
Y
i=1
P (πi < V1 )
(4.3)
Chapter 4
27
Beta processes and size-biased constructions
Question 2: What is the second largest, denoted π(2) = limK→∞ max{π1 , . . . , πK }\{π(1) }?
Answer: This is a little more complicated, but answering how to get π(2) shows how to get the
remaining π(i) .
P (π(2) < t|π(1) = V1 ) =
Y
P (πi < t|πi < V1 ) ← condition is each πi < π(1) = V1
πi 6=π(1)
=
Y P (πi < t, πi < V1 )
P (πi < V1 )
π 6=π
i
=
(1)
Y
πi 6=π(1)
=
lim
K→∞
P (πi < t)
← since t < V1 , first event contains second
P (π(1) = V1 )
Y
πi 6=π(1)
"
=
lim
K→∞
t
V1
– So the density p(π(2) |π(1) = V1 ) =
γ
γ K −1
π
dπi
i
0 K
R V1 γ Kγ −1
π
dπi
0 K i
Rt
Kγ #K−1
=
V1−1 γ
– Change of variables: V2 := π(2) /V1
π(2)
V1
→
t
V1
γ
(4.4)
γ−1
. π(2) has support [0, V1 ].
π(2) = V1 V2 , dπ(2) = V1 dV2 .
– Plugging in, P (V2 |π(1) = V1 ) = V1−1 γV2γ−1 · V1 = γV2γ−1 = Beta(γ, 1)
|{z}
Jacobian
– The above calculation has shown two things:
1. V2 is independent of V1 (this is an instance of a “neutral-to-the-right process”)
2. V2 ∼ Beta(γ, 1)
s
Since π(2) |{π(1) = V1 } = V1 V2 and V1 , V2 are independent, we can get the value of π(2) using
previously drawn V1 and then drawing V2 from Beta(γ, 1) distributions.
s
The same exact reasoning follows for π(3) , π(4) , . . .
s
For example, for π(3) , we have P (π(3) < t|π(2) = V1 V2 , π(1) = V1 ) = P (π(3) < t|π(2) = V1 V2 )
because conditioning on π(2) = V1 V2 restricts π(3) to also satisfy condition of π(1) .
s
In other words, if we force π(3) < π(2) by conditioning, we get the additional requirement
π(3) < π(1) for free, so we can condition on the π(i) immediately before.
s
Think of V1 V2 as a single non-random (i.e., already known) value by the time we get to π(3) . We
can exactly follow the above sequence after making the correct substitutions and re-indexing.
Chapter 4
Beta processes and size-biased constructions
28
Size-biased representation II
Definition: Let α > 0 and µ be a non-atomic measure on S with µ(S) < ∞. Generate the
following random variables:
αµ(S) , i = 1, 2, . . .
α+i−1
∼ Beta(1, α + i − 1), j = 1, . . . , Ci
∼ µ/µ(S), j = 1, . . . , Ci
Ci ∼ Pois
πij
θij
Define H =
P∞ PCi
i=1
j=1
(4.5)
πij δθij . Then H ∼ BP(α, µ).
Proof:
s
We can use Poisson processes to prove this. This is a good example of how easy a proof can
become when we recognize a hidden Poisson process and calculate it’s mean measure.
P i
– Let Hi = C
j=1 πij δθij . Then the set Πi = {(θij , πij )} is a Poisson process because it
contains a Poisson-distributed number of i.i.d. random variables.
– As a result, the mean measure of Πi is
αµ(S)
× (α + i − 1)(1 − π)α+i−2 dπ × µ(dθ)/µ(S)
|
{z
} |
{z
}
α
+
i
−
1
| {z }
distribution on π
Poisson # part
(4.6)
distribution on θ
We can simplify this to αµ(dθ)(1 − π)α+i−2 dπ. We can justify this with the marking
theorem (π marks θ), or just thinking about the joint distribution of (θ, π).
P
S∞
– H= ∞
i=1 Hi by definition. Equivalently Π =
i=1 Πi .
– By the superposition theorem, we know that Π is a Poisson process with mean measure
equal to the sum of the mean measures of each Πi .
– We can calculate this directly:
∞
X
αµ(dθ)(1 − π)
α+i−2
dπ = αµ(dθ)(1 − π)
i=1
α−2
∞
X
(1 − π)i dπ
(4.7)
|i=1 {z }
= 1−π
π
– Therefore, we’ve shown that Π is a Poisson process with mean measure
απ −1 (1 − π)α−1 dπµ(dθ).
– In other words, this second size-biased construction is the CRM constructed from integrating a PRM with this mean measure against the function f (θ, π) = π along the π
dimension. This is the definition of a beta process.
Chapter 4
Beta processes and size-biased constructions
29
Size-biased representation III
s
Definition: Let α > 0 and µ be a non-atomic measure on S with µ(S) < ∞. The following is
a constructive definition of H ∼ BP(α, µ).
Ci ∼ Pois(µ(S)),
H=
(`)
Vij ∼ Beta(1, α),
Ci
∞ X
X
i−1
Y
(`)
(1 − Vij )δθij
(i)
Vij
i=1 j=1
s
φij ∼ µ/µ(S)
(4.8)
(4.9)
`=1
Like the last construction, the weights are decreasing in expectation as a function of i.
H=
C1
X
j=1
V1j δθ1j +
C2
X
(2)
V2j (1
−
(1)
V2j )δθ2j
+
C3
X
j=1
(3)
(2)
(1)
V3j (1 − V3j )(1 − V3j )δθ3j + · · · (4.10)
j=1
s
The structure is also very similar. We have a Poisson-distributed number of atoms in each
group and they’re marked with an independent random variable.
s
P
Therefore, we can write H = ∞
i=1 Hi , where each Hi has a corresponding Poisson process Πi
with mean measure µ(S) × (µ(dθ)/µ(S)) × λi (π)dπ, where λi (π) is the distribution of
i−1
Y
π = Vi (1 − Vj ),
Vi ∼ Beta(1, α).
j=1
s
By the superposition theorem, H has an underlyingP
Poisson process Π with mean measure
equal to the sum of each Hi ’s mean measures: µ(dθ) ∞
i=1 λi (π)dπ.
s
Therefore, all that remains is to calculate this sum (which is a little complicated).
Proof :
s
WeQfocus on λi (π), which is the distribution on π = f (V1 , . . . , Vi ), where f (V1 , . . . , Vi ) =
i−1
Vi j=1
(1 − Vj ), Vj ∼ Beta(1, α).
s
Lemma: Let T ∼ Gam(i − 1, α). Then e−T =
s
Proof : Define ξj =
can show by a change of variables that ξj ∼ Exp(α).
Q − ln(1 − Vj ). We
Pi−1
The function − ln i−1
(1
−
V
)
=
j
j=1
j=1 ξj . Since the Vj are independent, the ξj are indepenP
dent. We know that sums of i.i.d. exponential r.v.’s are gamma distributed, so T = i−1
j=1 ξj is
Qi−1
d
distributed as Gam(i − 1, α). That is, − ln j=1 (1 − Vj ) = T ∼ Gam(i − 1, α) and the result follows because the same function of two equally distributed r.v.’s is also equally distributed.
s
We split the proof into two cases, i = 1 and i > 1.
s
Case i = 1: V1j ∼ Beta(1, α), therefore π1j = V1j ∼ λ1 (π)dπ = α(1 − π)α−1 dπ.
d
Qi−1
j=1 (1
− Vj ).
Chapter 4
s
Beta processes and size-biased constructions
30
Case i > 1: Vij ∼ Beta(1, α), Tij ∼ Gam(i − 1, α), πij = Vij e−Tij . We need to find the density
of πij . Let Wij = e−Tij . Then changing variables,
αi−1 α−1
pWi (w|α) =
w (− ln w)i−2 .
(i − 2)!
(4.11)
↑
plug Tij = − ln Wij into gamma distribution and multiply by Jacobian
s
s
Therefore πij = Vij Wij and using the product distribution formula
Z 1
w−1 pV (π/w|α)pWi (w|α)dw
λi (π|α) =
π
Z 1
αi
=
wα−1 (− ln w)i−2 (1 − π/w)α−1 dw
(i − 2)! π
Z 1
αi
w−1 (− ln w)i−2 (w − π)α−1 dw
=
(i − 2)! π
(4.12)
This integral
P∞ doesn’t have a closed form solution. However, recall that we only need to calculate
µ(dθ) i=1 λi (π)dπ to find the mean measure of the underlying Poisson process.
µ(dθ)
∞
X
λi (π)dπ = µ(dθ)λ1 (π)dπ + µ(dθ)
i=1
µ(dθ)
∞
X
i=2
∞
X
λi (π)dπ
(4.13)
i=2
∞
X
Z 1
αi
λi (π)dπ = µ(dθ)
dπ
w−1 (− ln w)i−2 (w − π)α−1 dw
(i
−
2)!
π
i=2
Z 1
∞
X
(−α ln w)i−2
= µ(dθ)dπα2
w−1 (w − π)α−1 dw
(i − 2)!
π
|i=2
{z
}
= e−α ln w = w−α
= µ(dθ)dπα2
Z
|π
1
w−(α+1) (w − π)α−1 dw
{z
}
=
= µ(dθ)
s
1
(w−π)α απwα π
α(1 − π)α
dπ
π
(4.14)
Adding µ(dθ)λ1 (π)dπ from Case 1 with this last value,
µ(dθ)
∞
X
λi (π)dπ = µ(dθ)απ −1 (1 − π)α−1 dπ.
(4.15)
i=1
s
Therefore, the construction corresponds to a Poisson process with mean measure equal to that
of a beta process. It’s therefore a beta process.
Chapter 5
Dirichlet processes and a size-biased construction
s
We saw how beta processes can be useful as a Bayesian nonparametric prior for latent factor
(matrix factorization) models.
s
We’ll next discuss BNP priors for mixture models.
Quick review
s
2-dimensional data generated from Gaussian with unknown
mean and known variance.
s
There are a small set of possible means and an observations
picks one of them using a probability distribution.
PK
s Let G =
i=1 πk δθi be the mixture distribution on mean parameters – θi : ith mean, πi : probability of it
s
For the nth observation,
1.
2.
cn ∼ Disc(π) picks mean index
xn ∼ N (θcn , Σ) generates observation
Priors on G
s
Let µ be a non-atomic probability measure on the parameter space.
s
Since π is a K-dimensional probability vector, a natural prior is Dirichlet.
Dirichlet distribution: A distribution on probability vectors
s
Definition: Let alpha1 , . . . , αK be K positive numbers. The Dirichlet distribution density
function is defined as
P
K
Γ( i αi ) Y αi −1
Dir(π|α1 , . . . , αK ) = Q
πi
(5.1)
i Γ(αi ) i=1
31
Chapter 5
s
Dirichlet processes and a size-biased construction
32
Goals: The goals are very similar to the beta process.
1. We want K → ∞
2. We want the parameters α1 , . . . , αK to be such that, as K → ∞, things are well-defined.
3. It would be nice to link this to the Poisson process somehow.
Dirichlet random vectors and gamma random variables
s
Theorem: Let Zi ∼ Gam(αi , b) for i = 1, . . . , K. Define πi = Zi /
PK
j=1
Zj Then
(π1 , . . . , πK ) ∼ Dir(α1 , . . . , αK ).
Furthermore, π and Y =
s
PK
j=1
(5.2)
Zj are independent random variables.
Proof: This is just a change of variables.
– p(Z1 , . . . , ZK ) =
K
Y
p(Zi ) =
i=1
K
Y
bαi αi −1 −bZi
Zi e
Γ(α
)
i
i=1
– (Z1 , . . . , ZK ) := f (Y, π) = (Y π1 , . . . , Y πK−1 , Y (1 −
– PY,π (Y, π) = PZ (f (Y, π)) · |J(f )|


– J(f ) = 
∂f1
∂π1
∂fK
∂π1
···
..
.
···
∂f1
∂πK−1
∂f1
∂Y
∂fK
∂πK−1
∂fK
∂Y
PK−1
i=1
πi ))
← J(·) = Jacobian

Y
 0
 
=
 0
−Y

0
Y
0
−Y
···
0
..
.
··· 1 −
π1
π2
..
.
PK−1
i=1





πi
– And so |J(f )| = Y K−1
– Therefore
K
K−1
Y
X
bα i
αi −1 −bY πi K−1
PZ (f (Y, π))|J(f )| =
(Y πi )
e
Y
,
(πK := 1 −
πi )
Γ(αi )
i=1
i=1
#
P αi
" P
K
Y
P
Γ(
α
)
b i
P
Q i i
=
(5.3)
Y i αi −1 e−bY
πiαi −1
Γ( i αi )
Γ(α
)
i i=1
i
|
{z
}|
{z
}
P
Gam(
s
i αi , b)
Dir(α1 , . . . , αK )
We’ve shown that:
1. A Dirichlet distributed probability vector is a normalized sequence of independent gamma
random variables with a constant scale parameter.
2. The sum of these gamma random variables is independent of the normalization because
their joint distribution can be written as a product of two distributions.
3. This works in reverse: If we want to draw an independent sequence of gamma r.v.’s, we
can draw a Dirichlet vector and scale it by an independent gamma random variable with
first parameter equal to the sum of the Dirichlet parameters (and second parameter set to
whatever we want).
Chapter 5
33
Dirichlet processes and a size-biased construction
Dirichlet process
s
Definition: Let α > 0 and µ a non-atomic probability measure on S. For all partitions of S,
A1 , . . . , Ak , where Ai ∩ Aj = ∅ for i 6= j and ∪K
i=1 Ai = S, define the random measure G on S
such that
(G(A1 ), . . . , G(Ak )) ∼ Dir(αµ(A1 ), . . . , αµ(Ak )).
(5.4)
Then G is a Dirichlet process, denoted G ∼ DP(αµ).
Dirichlet processes via the gamma process
s
Pick a partition of S, A1 , . . . , Ak . We can represent G ∼ DP(αµ) as the normalization of
gamma distributed random variables,
G0 (A )
G0 (Ak ) 1
,..., 0
,
(G(A1 ), . . . , G(Ak )) =
G0 (S)
G (S)
0
0
G (Ai ) ∼ Gam(αµ(Ai ), b),
0
G (S) = G
(∪ki=1 Ai )
=
k
X
(5.5)
G0 (Ai )
(5.6)
i=1
s
Looking at the definition and how G0 is defined, we realize that G0 ∼ GaP(αµ, b). Therefore, a
Dirichlet process is simply a normalized gamma process.
s
Note that G0 (S) ∼ Gam(α, b). So G0 (S) < ∞ with probability one and so the normalization
G is well-defined.
Gamma processes and the Poisson process
s
Recall that the gamma process is constructed from a Poisson process.
s
Gamma process: Let N be a Poisson random measure on S × R+ with mean measure
R R∞
αµ(dθ)z −1 e−bz dz. Define G0 (Ai ) = Ai 0 N (dθ, dz)z. Then G0 ∼ GaP(αµ, b).
s
Since the DP is a rescaled GaP, this shares the same
properties (from Campbells theorem)
– For example, it’s an infinite jump process.
– However, the DP is not a CRM like the GaP
since G(Ai ) and G(Aj ) are not independent for
disjoint sets Ai and Aj . This should be clear
since G has to integrate to 1.
Chapter 5
Dirichlet processes and a size-biased construction
34
Dirichlet process as limit of finite approximation
s
This is very similar to the previous discussion on limits of finite approximations to the gamma
and beta process.
s
Definition: Let α > 0 and µ a non-atomic probability measure on S. Let
GK =
K
X
π ∼ Dir(α/K, . . . , α/K),
πi δθi ,
iid
θi ∼ µ
(5.7)
i=1
Then limK→∞ GK = G ∼ DP(αµ).
s
Rough proof: We can equivalently write
GK =
K
X
PK
i=1
s
!
Zi
j=1
Zj
δθi ,
Zi ∼ Gam(α/K, b),
θi ∼ µ
(5.8)
P
0
0
If G0K = K
i=1 Zi δθi , we’ve already proven that GK → G ∼ GaP(αµ, b). GK is thus the limit
0
0
of the normalization of GK . Since limK→∞ GK (S) is finite almost surely, we can take the
limit of the numerator and denominator of the gamma representation of GK separately. The
numerator converges to a gamma process and the denominator its normalization. Therefore,
GK converges to a Dirichlet process.
Some comments
s
This infinite limit of the finite approximation results in an infinite vector, but the original
definition was of a K dimensional vector, so is a Dirichlet process infinite or finite dimensional?
Actually, the finite vector of the definition is constructed from an infinite process:
G(Aj ) = lim GK (Aj ) = lim
K→∞
K→∞
K
X
πi δθi (Aj ).
(5.9)
i=1
s
Since the partition A1 , . . . , Ak of S is of a continuous space we have to be able to let K → ∞,
so there has to be an infinite-dimensional process underneath G.
s
The Dirichlet process gives us a way of defining priors on infinite discrete probability distributions on this continuous space S.
s
As an intuitive example, if S is a space corresponding to the mean of a Gaussian, the Dirichlet
process gives us a way to assign a probability to every possible value of this mean.
s
Of course, by thinking of the DP in terms of the gamma and Poisson processes, we know that an
infinite number of means will have probability zero, and infinite number will also have non-zero
probability, but only a small handful of points in the space will have substantial probability.
The number and locations of these atoms are random and learned during inference.
s
Therefore, as with the beta process, size-biased representations of G ∼ DP(αµ) are needed.
Chapter 5
Dirichlet processes and a size-biased construction
35
A “stick-breaking” construction of G ∼ DP(αµ)
s
Definition: Let α > 0 and µ be a non-atomic probability measure on S. Let
Vi ∼ Beta(1, α),
θi ∼ µ
(5.10)
independently for i = 1, 2, . . . Define
G=
∞
X
i=1
i−1
Y
Vi (1 − Vj )δθi .
(5.11)
j=1
Then G ∼ DP(αµ).
s
Intuitive picture: We start with a unit length stick and break off proportions.
(1 − V2 )(1 − V1 ) is what’s left
after the first two breaks. We
take proportion V3 of that for θ3
and leave (1 − V3 )(1 − V2 )(1 − V1 )
G = V1 δθ1 +
V2 (1 − V1 )δθ2 +
.
V3 (1 − V2 )(1 − V1 )δθ3 + · · ·
Getting back to finite Dirichlets
s
Recall from the definition that (G(A1 ), . . . , G(AK )) ∼ Dir(αµ(A1 ), . . . , αµ(AK )) for all partitions A1 , . . . , AK of S.
s
Using the stick-breaking construction, we need to show that the vector formed by
G(Ak ) =
∞
X
i=1
i−1
Y
Vi (1 − Vj )δθi (Ak )
(5.12)
j=1
for k = 1, . . . , K is distributed as Dir(αµ1 , . . . , αµK ), where µk = µ(Ak ).
s
Since P (θi ∈ Ak ) = µ(Ak ), δθi (Ak ) can be equivalently represented by a K-dimensional
vector eYi = (0, . . . , 1, . . . , 0), with the 1 in the position Yi and Yi ∼ Disc(µ1 , . . . , µK ) and the
rest 0.
s
Letting πi = G(Ai ), we therefore need to show that if
π=
∞
X
i=1
Vi
∞
Y
(1 − Vj )eYi ,
j=1
Then π ∼ Dir(αµ1 , . . . , αµK ).
iid
Vi ∼ Beta(1, α),
iid
Yi ∼ Disc(µ1 , . . . , µK )
(5.13)
Chapter 5
s
Dirichlet processes and a size-biased construction
36
Lemma: Let π ∼ Dir(a1 + b1 , . . . , aK + bK ). We can equivalently represent this as
P
P
π = V Y + (1 − V )W, V ∼ Beta( k ak , k bk ),
Y ∼ Dir(a1 , . . . , aK ),
W ∼ Dir(b1 , . . . , bK )
(5.14)
P
Proof : Use the normalized gamma representation: πi = Zi / j Zj , Zi ∼ Gam(ai + bi , c).
– We can use the equivalence
ZiY ∼ Gam(ai , c),
ZiW ∼ Gam(bi , c)
⇔
ZiY + ZiW ∼ Gam(ai + bi , c) (5.15)
– Splitting into two random variables this way we have the following normalized gamma
representation for π
Y
W
Z1Y + Z1W
ZK
+ ZK
P Y
, ... , P Y
(5.16)
π =
W
W
i Zi + Zi
i Zi + Zi
P Y
Y
Zi
ZK
Z1Y
i
P Y
P Y , ... , P Y
=
W
Z
+
Z
Z
i
i
i Zi
| i {z
}| i i
{z
}
P
P
V ∼ Beta(
+
ai ,
i
ZW
P Yi i W
i Zi + Zi
P
|
Y ∼ Dir(a1 ,...,aK )
i bi )
{z
1−V
W
ZK
Z1W
P W , ... , P W
Z
i Zi
}| i i
{z
}
W ∼ Dir(b1 ,...,bK )
– From the previous proof about normalized gamma r.v.’s, we know that the sums are
independent from the normalized values. So V , Y , and W are all independent.
s
Proof of stick-breaking construction:
– Start with π ∼ Dir(αµ1 , . . . , αµK ).
– Step 1:
P
K
Γ( i αi ) Y αi −1
Q
πi
=
i Γ(αi ) i=1
K
X
j=1
!
πj
P
K
Γ( i αi ) Y αi −1
Q
πi
i Γ(αi ) i=1
(5.17)
P
K
K
X
αµj Γ( i αi ) Y αi +ej (i)−1
Q
=
πi
αµ
Γ(α
)
j
i
i
j=1
i=1
K
X
P
K
Y
Γ(1 + i αi )
α +e (i)−1
Q
=
µj
πi i j
Γ(1 + αµj ) i6=j Γ(αi ) i=1
j=1
|
{z
}
= Dir(αµ+ej )
– Therefore, a hierarchical representation of Dir(αµ1 , . . . , αµK ) is
Y ∼ Discrete(µ1 , . . . , µK ),
π ∼ Dir(αµ + eY ).
(5.18)
Chapter 5
Dirichlet processes and a size-biased construction
37
– Step 2:
From the lemma we have that π ∼ Dir(αµ + eY ) can be expanded into the equivalent
hierarchical representation π = V Y 0 + (1 − V )π 0 , where
P
P
V ∼ Beta( i eY (i), i αµi ),
Y 0 ∼ Dir(eY ) , π 0 ∼ Dir(αµ1 , . . . , αµK ) (5.19)
{z
}
|
| {z } | {z }
=1
= eY with probability 1
=α
– Combining Steps 1& 2:
We will use these steps to recursively break down a Dirichlet distributed random vector
an infinite number of times. If
π = V eY + (1 − V )π 0 ,
V ∼ Beta(1, α),
Y ∼ Disc(µ1 , . . . , µK ),
(5.20)
π 0 ∼ Dir(αµ1 , . . . , αµK ),
Then from steps 1 & 2, π ∼ Dir(αµ1 , . . . , αµK ).
– Notice that there are independent Dir(αµ1 , . . . , αµK ) r.v.’s on both sides. We “broke
down” the one on the left, we can continue by “breaking down” the one on the right:
π = V1 eY1 + (1 − V1 )(V2 eY2 + (1 − V2 )π 00 )
iid
iid
Vi ∼ Beta(1, α),
Yi ∼ Disc(µ1 , . . . , µK ),
(5.21)
π 00 ∼ Dir(αµ1 , . . . , αµK ),
π is still distributed as Dir(αµ1 , . . . , αµK ).
– Continue this an infinite number of times:
π=
∞
X
i=1
Vi
i−1
Y
eYi ,
iid
Vi ∼ Beta(1, α),
iid
Yi ∼ Disc(µ1 , . . . , µK ).
(5.22)
j=1
Still, following each time the right-hand
QT Dirichlet is expanded we get a Dir(αµ1 , . . . , αµK )
random variable. Since limT →∞ j=1 (1 − Vj ) = 0, the term pre-multiplying this RHS
Dirichlet vector equals zero and the limit above results, which completes the proof.
s
Corollary:
If G is drawn from DP(αµ) using the stick-breaking construction and β ∼ Gam(α, b) independently, then βG ∼ GaP(αµ, b). Writing this out,
∞
i−1
Y
X
βG =
β Vi (1 − Vj ) δθi
i=1
s
(5.23)
j=1
We therefore get a method for drawing a gamma process almost for free. Notice that α appears
in both the DP and gamma distribution on β. These parameters must be the same value for βG
to be a gamma process.
Chapter 6
Dirichlet process extensions, count processes
Gamma process to Dirichlet process
s
Gamma process: Let α > 0 and µ a non-atomic probability measure on S. Let N (dθ, dw)
be a Poisson randomR measure
on S × R+ with mean measure αµ(dθ)we−cw dw, c > 0. For
R∞
0
A ⊂ S, let G (A) = A 0 N (dθ, dw)w. Then G0 is a gamma process, G0 ∼ GaP(αµ, c), and
G0 (A) ∼ Gam(αµ(A), c).
s
Normalizing a gamma process: Let’s take G0 and normalize it. That is, define G(dθ) =
G0 (dθ)/G0 (S). (G0 (S) ∼ Gam(α, c), so it’s finite w.p. 1). Then G is called a Dirichlet
process, written G ∼ DP(αµ).
Why? Take S and partition it into K disjoint regions, i.e., (A1 , . . . , AK ), Ai ∩ Aj = ∅, i 6= j,
∪i Ai = S. Construct the vector
0
G (A1 )
G0 (AK )
(G(A1 ), . . . , G(AK )) =
,..., 0
.
(6.1)
G0 (S)
G (S)
Since each G0 (Ai ) ∼ Gam(αµ(Ai ), c), and G0 (S) =
PK
i=1
G0 (Ai ), it follows that
(G(A1 ), . . . , G(AK )) ∼ Dir(αµ(A1 ), . . . , αµ(AK )).
(6.2)
This is the definition of a Dirichlet process.
s
The Dirichlet process has many extensions to suit the structure of different problems.
s
We’ll look at four, two that are related to the underlying normalized gamma process, and two
from the perspective of the stick-breaking construction.
s
The purpose is to illustrate how the basic framework of Dirichlet process mixture modeling can
be easily built into more complicated models that address problems not perfectly suited to the
basic construction.
s
Goal is to make it clear how to continue these lines of thinking to form new models.
38
Chapter 6
Dirichlet process extensions, count processes
39
Example 1: Spatially and temporally normalized gamma processes
s
Imagine we wanted a temporally evolving Dirichlet process. Clusters (i.e., atoms, θ) may arise
and die out at different times (or exist in geographical regions)
Time-evolving model: Let N (dθ, dw, dt) be a Poisson random measure on S × R+ × R with
R∞
mean measure αµ(dθ)w−1 e−cw dwdt. Let G0 (dθ, dt) = 0 N (dθ, dw, dt)w. Then G0 is a
gamma process with added “time” dimension t.
s
(There’s nothing new from what we’ve studied: Let θ∗ = (θ, t) and αµ(dθ∗ ) = αµ(dθ)dt.)
s
For each atom (θ, t) with G0 (dθ, dt) > 0, add a marking yt (θ) ∼ Exp(λ).
s
We can think of yt (θ) as the lifetime of parameter θ born at time t.
s
By the marking theorem, N ∗ (dθ, dw, dt, dy) ∼ Pois αµ(dθ)w−1 e−cw dwdtλe−λy dy
s
At time t0 , construct the Dirichlet process G0t by normalizing over all atoms “alive” at time t.
(Therefore, ignore atoms already dead or yet to be born.)
ind
Spatial model: Instead of giving each atom θ a time-stamp and “lifetime,” we might want to
give it a location and “region of influence”.
s
Replace t ∈ R with x ∈ R2 (e.g., latitude-longitude). Replace dt with dx.
s
Instead of yt (θ) ∼ Exp(λ) = lifetime, yx (θ) ∼ Exp(λ) = radius of ball at x.
s
G0x is the DP at location x0 .
s
It is formed by normalizing over all
atoms θ for which
x0 ∈ ball of radius yx (θ) at x
Chapter 6
Dirichlet process extensions, count processes
40
Example 2: Another time-evolving formulation
s
We can think of other formulations. Here’s one where time is discrete. (We will build up to this
with the following two properties).
s
Even though the DP doesn’t have an underlying PRM, the fact that it’s constructed from a PRM
means we can still benefit from its properties.
Superposition and the Dirichlet process: Let G01 ∼ GaP(α1 µ1 , c) and G02 ∼ GaP(α2 µ2 , c).
Then G01+2 = G01 + G02 ∼ GaP(α1 µ1 + α2 µ2 , c). Therefore,
G1+2 =
G01+2
∼ DP(α1 µ1 + α2 µ2 ).
G01+2 (S)
(6.3)
We can equivalently write
G1+2 =
G02 (S)
G01 (S)
G01
G02
×
+
×
∼ DP(α1 µ1 + α2 µ2 )
G01+2 (S) G01 (S)
G01+2 (S)
G02 (S)
| {z }
| {z } | {z }
Beta(α1 ,α2 )
DP(α1 µ1 )
(6.4)
DP(α2 µ2 )
From the lemma last week, these two DP’s and the beta r.v. are all independent.
s
Therefore,
G = πG1 + (1 − π)G2 ,
π ∼ Beta(α1 , α2 ),
G1 ∼ DP(α1 µ1 ),
G2 ∼ DP(α2 µ2 ) (6.5)
is equal in distribution to G ∼ DP(α1 µ1 + α2 µ2 ).
Thinning of gamma processes (a special case of the marking theorem)
s
We know that we can construct G0 ∼ GaP(αµ, c) from the Poisson random measure N (dθ, dw) ∼
Pois (αµ(dθ)w−1 e−cw dw). Mark each point (θ, w) in N with a binary variable z ∼ Bern(p).
Then
(6.6)
N (dθ, dw, z) ∼ Pois pz (1 − p)z αµ(dθ)w−1 e−cw dw .
s
If we view z = 1 as “survival” and z = 0 as “death,” then if we only care about the atoms that
survive, we have
N1 (dθ, dw) = N (dθ, dw, z = 1) ∼ Pois(pαµ(dθ)w−1 e−cw dw).
(6.7)
s
This is called “thinning.” We see that p ∈ (0, 1) down-weights the mean measure, so we only
expect to see a fraction p of what we saw before.
s
Still, a normalized thinned gamma process is a Dirichlet process
Ġ0 ∼ GaP(pαµ, c),
Ġ =
Ġ0
∼ DP(pαµ).
Ġ0 (S)
(6.8)
Chapter 6
s
Dirichlet process extensions, count processes
41
What happens if we thin twice? We’re marking with z ∈ {0, 1}2 and restricting to z = [1, 1],
G̈0 ∼ GaP(p2 αµ, c)
G̈ ∼ DP(p2 αµ).
→
(6.9)
s
Back to the example, we again want a time-evolving Dirichlet process where new atoms are
born and old atoms die out.
s
We can easily achieve this by introducing new gamma processes and thinning old ones.
A dynamic Dirichlet process:
At time t:
1. Draw G∗t ∼ GaP(αt µt , c).
2. Construct G0t = G∗t + Ġ0t−1 , where Ġ0t−1 is the gamma process
at time t − 1 thinned with parameter p.
3. Normalize Gt = G0t /G0t (S).
s
Why is Gt still a Dirichlet process? Just look at G0t :
– Let G0t−1 ∼ GaP(α̂t−1 µ̂t−1 , c).
– Then Ġ0t−1 ∼ GaP(pα̂t−1 µ̂t−1 , c) and G0t ∼ GaP(αt µt + pα̂t−1 µ̂t−1 , c).
– So Gt ∼ DP(αt µt + pα̂t−1 µ̂t−1 ).
s
By induction,
Gt ∼ DP(αt µt + pαt−1 µt−1 + p2 αt−2 µt−2 + · · · + pt−1 α1 µ1 ).
s
(6.10)
If we consider the special case where αt µt = αµ for all t, we can simplify this Dirichlet process
1 − pt
αµ .
(6.11)
Gt ∼ DP
1−p
In the limit t → ∞, this has the steady state
G∞ ∼ DP
s
1
αµ .
1−p
(6.12)
Stick-breaking construction (review for the next process)
We saw that if α > 0 and µ is any probability measure, atomic or non-atomic or mixed, then
we can draw G ∼ DP(αµ) as follows:
iid
Vi ∼ Beta(1, α),
iid
θi ∼ µ,
G=
∞
X
i=1
i−1
Y
Vi (1 − Vj )δθi
j=1
(6.13)
Chapter 6
42
Dirichlet process extensions, count processes
s
It’s often the case where we have grouped data. For example, groups of documents where each
document is a set of words.
s
We might want to model each group (indexed by d) as a mixture Gd . Then, for observation n
(d)
(d)
(d)
in group d, θn ∼ Gd , xn ∼ p(x|θn ).
s
We might think that each group shares the same set of highly probable atoms, but has different
distributions on them.
s
The result is called a mixed-membership model.
Mixed-membership models and the hierarchical Dirichlet process (HDP)
s
As the stick-breaking construction makes clear, when µ is non-atomic simply drawing each
iid
Gd ∼ DP(αµ) won’t work because it places all probability mass on a disjoint set of atoms.
s
The HDP fixes this by “discretizing the base distribution.”
iid
Gd | G0 ∼ DP(βG0 ),
s
G0 ∼ DP(αµ).
(6.14)
Since G0 is discrete, Gd has probability on the same subset of atoms. This is very obvious by
writing the process with the stick-breaking construction:
Gd =
∞
X
(d)
πi δθi ,
(d)
(d)
(π1 , π2 , . . . ) ∼ Dir(αp1 , αp2 , . . . )
(6.15)
i=1
i−1
Y
pi = Vi (1 − Vj ),
iid
Vi ∼ Beta(1, α),
iid
θi ∼ µ.
j=1
s
Nested Dirichlet processes
The stick-breaking construction is totally general: µ can be any distribution.
What if µ → DP(αµ)? That is, we define the base distribution to be a Dirichlet process.
G∼
∞
X
i=1
i−1
Y
Vi (1 − Vj )δGi ,
iid
Vi ∼ Beta(1, α),
iid
Gi ∼ DP(αµ).
(6.16)
j=1
iid
(We write Gi to link to the DP, but we could have written θi ∼ DP(αµ) since that’s what we’ve
been using.)
s
We now have a mixture model of mixture models. For example:
1. A group selects G(d) ∼ G (picks mixture Gi according to probability Vi
Q
j<i (1
− Vj )).
Chapter 6
Dirichlet process extensions, count processes
43
(d)
2. Generates all of its data using this mixture. For the nth observation in group d, θn ∼ G(d) ,
(d)
(d)
Xn ∼ p(X|θn ).
s
In this case we have all-or-nothing sharing. Two groups either share the atoms and the distribution on them, or they share nothing.
Nested Dirichlet process trees
s
We can nest this further. Why not let µ in the nDP be a Dirichlet process also? Then we would
have a three level tree.
s
We can then pick paths down this tree to a leaf node where we get an atom.
Count Processes
s
We briefly introduce count processes. With the Dirichlet process, we often have the generative
structure
∞
X
iid
∗
G ∼ DP(αµ), θj |G ∼ G, G =
πi δθi , j = 1, . . . , N
(6.17)
i=1
s
What can we say about the count process n(θ) =
s
Recall the following equivalent processes:
0
G ∼ GaP(αµ, c) (6.18)
n(θ)|G0 ∼ Pois(G0 (θ)) (6.19)
s
and
PN
j=1
1(θj∗ = θ)?
G0 ∼ GaP(αµ, c) (6.20)
n(S) ∼ Pois(G0 (S)) (6.21)
∗
θ1:n(S)
∼ G0 /G0 (S) (6.22)
We can therefore analyze this using the underlying marked Poisson process. However, notice
that we have to let the data size be random and Poisson distributed.
Marking theorem: Let G0 ∼ GaP(αµ, c) and mark each (θ, w) for which G0 (θ) = w > 0 with
the random variable n|w ∼ Pois(w). Then (θ, w, n) is a marked Poisson process with mean
n
measure αµ(dθ)w−1 e−cw dw wn! e−w .
Chapter 6
s
44
Dirichlet process extensions, count processes
We can restrict this to n by integrating over θ and w.
Theorem: The number of atoms having k counts is
Z Z ∞
α 1 k wk
#k ∼ Pois
αµ(dθ)w−1 e−cw dw e−w = Pois
k!
k 1+c
S 0
(6.23)
Theorem: The total number of uniquely observed atoms is also Poisson
#unique
∞
X
∞
X
1
α 1 k =
= Pois(α ln(1 + ))
#k ∼ Pois
k 1+c
c
k=1
k=1
(6.24)
0
0
0
Theorem: The total
P∞number of counts is n(S)|G ∼ Pois(G (S)), G ∼ GaP(αµ, c). So
α
E[n(S)] = c ( = k=1 kE#k )
Final statement: Let c = Nα . If we expect a dataset of size N to be drawn from G ∼ DP(αµ),
we expect that dataset to use α ln(α + N ) − α ln α unique atoms from G.
A quick final count process
s
Instead of gamma process −→ Poisson counts, we could have beta process −→ negative
binomial counts.
s
Let H =
s
Let n(θ) = negBin(r, H(θ)), where the negative binomial random variable counts how
many “successes” there are, with P (success) = H(θ) until there are r “failures” with
P (failure) = 1 − H(θ).
s
This is another count process that can be analyzed using the underlying Poisson process.
P∞
i=1
πi δθi ∼ BP(α, µ).
Chapter 7
Exchangeability, Dirichlet processes and the
Chinese restaurant process
DP’s, finite approximations and mixture models
s
DP: We saw how, if α > 0 and µ is a probability measure on S, for every finite partition
(A1 , . . . , Ak ) of S, the random measure
(G(A1 )), . . . , G(Ak )) ∼ Dir(αµ(A1 ), . . . , αµ(Ak ))
defines a Dirichlet process.
s
Finite approximation: We also saw how we can approximate G ∼ DP(αµ) with a finite
Dirichlet distribution,
GK =
K
X
πi δθi ,
i=1
s
α
α
,...,
,
πi ∼ Dir
K
K
iid
θ ∼ µ.
Mixture models: Finally, the most common setting for these priors is in mixture models, where
we have the added layers
θj∗ |G ∼ G,
Xj |θj∗ ∼ p(X|θj∗ ), j = 1, . . . , n.
(P r(θj∗ = θi |G) = πi )
s
The values of θ1∗ , . . . , θn∗ induce a clustering of the data.
s
If θj∗ = θj∗0 for j 6= j 0 then Xj and Xj 0 are “clustered” together since they come from the same
distribution.
s
We’ve thus far focused on G. We now focus on the clustering of X1 , . . . , Xn induced by G.
Polya’s Urn model (finite Dirichlet)
s
To simplify things, we work in the finite setting and replace the parameter θi with its index i.
We let the indicator variables c1 , . . . , cn represent θ1∗ , . . . , θn∗ such that θj∗ = θcj .
45
Chapter 7
s
s
s
Exchangeability, Dirichlet processes and the Chinese restaurant process
Polya’s Urn is the following process for generating c1 , . . . , cn :
P
1
1. For the first indicator, c1 ∼ K
i=1 K δi
Pn−1 1
2. For the nth indicator, cn |c1 , . . . , cn−1 ∼ j=1
δ +
α+n−1 cj
α
α+n−1
46
PK
1
i=1 K δi
α
In words, we start with an urn having K
balls of color i for each of K colors. We randomly
pick a ball, put it back in the urn and put another ball of the same color in the urn.
(n−1)
Another way to write #2 above is to define ni
cn |c1 , . . . , cn−1 ∼
=
K
X
i=1
α
K
Pn−1
j=1
1(cj = i). Then
(n−1)
+ ni
δi .
α+n−1
To put it most simply, we’re just sampling the next color from the empirical distribution of the
urn at step n.
s
What can we say about p(c1 = i1 , . . . , cn = in ) (write as p(c1 , . . . , cn )) under this prior?
Q
1. By the chain rule of probability, p(c1 , . . . , cn ) = nj=1 p(cj |c1 , . . . , cj−1 ).
α
K
(j−1)
+ ni
2. p(cj = i|c1 , . . . , cj−1 ) =
α+j−1
3. Therefore,
p(c1:n ) = p(c1 )p(c2 |c1 )p(c3 |c1 , c2 ) · · · =
n
Y
j=1
s
α
K
(j−1)
+ ncj
α+j−1
(7.1)
A few things to notice about p(c1 , . . . , cn )
Q
1. The denominator is simply nj=1 (α + j − 1)
(j−1)
(n)
2. ncj
(n)
is incrementing by one. That is, after c1:n we have the counts (n1 , . . . , nK ). For
Qni(n) α
(n)
each ni the numerator will contain s=1
( K + s − 1).
3. Therefore,
QK Qn(n)
i
( α + s − 1)
i=1
Qn s=1 K
p(c1 , . . . , cn ) =
j=1 (α + j − 1)
s
Key: The key thing to notice is that this does not depend on the order of c1 , . . . , cn . That is, if
we permuted c1 , . . . , cn such that cj = iρ(j) , where ρ(·) is a permutation of (1, . . . , n), then
p(c1 = i1 , . . . , cn = in ) = p(c1 = iρ(1) , . . . , cn = iρ(n) ).
s
(7.2)
The sequence c1 , . . . , cn is said to be “exchangeable” in this case.
Chapter 7
Exchangeability, Dirichlet processes and the Chinese restaurant process
47
Exchangeability and independent and identically distributed (iid) sequences
s
Independent sequences are exchangeable
p(c1 , . . . , cn ) =
n
Y
i=1
s
p(ci ) =
n
Y
p(cρ(i) ) = p(cρ(1) , . . . , cρ(n) ).
(7.3)
i=1
Exchangeable sequences aren’t necessarily independent (exchangeability is “weaker”). Think
of the urn. cj is clearly not independent of c1 , . . . , cj−1 .
Exchangeability and de Finetti’s
s
de Finetti’s theorem: A sequence is exchangeable if and only if there is a parameter π with
distribution p(π) for which the sequence is independent and identically distributed given π.
s
In other words, for our problem there is a probability vector π such that p(c1:n |π) =
s
The problem is to find p(π)
Qn
j=1
p(cj |π).
Z
p(c1 , . . . , cn ) =
=
p(c1 , . . . , cn |π)p(π)dπ
Z Y
n
p(cj |π)p(π)dπ
j=1
=
Z Y
n
πcj p(π)dπ
j=1
=
Z Y
k
(n)
n
πi i p(π)dπ
i=1
↓
↓
"K
#
QK Qni(n) α
Y n(n)
s=1 ( K + s − 1)
i=1
Qn
= Ep(π)
πi i
(α
+
j
−
1)
j=1
i=1
(7.4)
s
Above, the first equality is always true. The second one is by de Finetti’s theorem since
c1 , . . . , cn is exchangeable. (We won’t proven this theorem, we’ll just use it.) The following
results. In the last equality, the left hand side was previously shown and the right hand side is
what the second to last line is equivalently written as.
s
By de Finetti and exchangeability of c1 , . . . , cn , we therefore arrive at an expression for the
moments of π according to the still unknown distribution p(π).
Chapter 7
s
Exchangeability, Dirichlet processes and the Chinese restaurant process
48
Because the moments of a distribution are unique to that distribution (like the Laplace transα
α
form), p(π) has to be Dir( K
,..., K
), since plugging this in for p(π) we get
Ep(π)
"K
Y
(n)
n
πi i
#
=
Z Y
K
πini
i=1
i=1
K
Γ(α) Y Kα −1
π
dπ
α K
Γ( K
) i=1 i
Q
Z
K
α
Y
Γ(α) K
α
Γ(α + n)
i=1 Γ( K + ni )
π ni + K −1 dπ
=
QK
α K
α
Γ( K ) Γ(α + n)
i=1 Γ( K + ni ) i=1
{z
}
|
α
α
= Dir( K
+n1 ,..., K
+nk )
Q i α
α
) ns=1
( K + s − 1)
Γ(α) i=1 Γ( K
Qn
=
α K
Γ( K ) Γ(α) j=1 (α + j − 1)
QK
QK Qn(n)
i
( α + s − 1)
i=1
Qn s=1 K
=
j=1 (α + j − 1)
(7.5)
s
This holds for all n and (n1 , . . . , nk ). Since a distribution is defined by its moments, the result
follows.
s
Notice that we didn’t
Q need de Finetti since we could just hypothesize the existence of a π for
which p(c1:n |π) = i p(ci |π). It’s more useful when the distribution is more “non-standard,”
or to prove that a π doesn’t exist.
s
Final statement: As n → ∞, the distribution
(n)
α
ni + K
i=1 α+n
PK
δi →
PK
i=1
πi∗ δi .
– This is because there exists a π for which c1 , . . . , cn are iid, and so by the law of large
numbers the point π ∗ exists and π ∗ = π.
α
α
,..., K
), it follows that the empirical distribution converges to a random
– Since π ∼ Dir( K
α
α
vector that is distributed as Dir( K
,..., K
).
The infinite limit (Chinese restaurant process)
s
Let’s go back to the original notation:
θj∗ |GK ∼ GK ,
GK =
K
X
π ∼ Dir(
πi δθi ,
i=1
s
iid
θi ∼ µ.
Following the exact same ideas (only changing notation). The urn process is
∗
θn∗ |θ1∗ , . . . , θn−1
∼
K
X
i=1
s
α
α
, . . . , ),
K
K
α
K
(n−1)
= ni
δθ ,
α+n−1 i
iid
θi ∼ µ.
We’ve proven that limK→∞ GK = G ∼ DP(αµ). We now take the limit of the corresponding
urn process.
Chapter 7
s
Exchangeability, Dirichlet processes and the Chinese restaurant process
(n−1)
+
Re-indexing: At observation n, re-index the atoms so that nj
> 0 for j = 1, . . . , Kn−1
and
(n−1)
+
+
∗
nj
= 0 for j > Kn−1 . (Kn−1 = # unique values in θ1:n−1 ) Then
∗
θn∗ |θ1∗ , . . . , θn−1
∼
+
Kn−1
X α
K
K
X
+
i=1+Kn−1
1
δθ .
K i
(7.6)
+
Obviously for n K, Kn−1
= K very probably, and just the left term remains. However,
we’re interested in K → ∞ before we let n grow. In this case
1.
2.
α
K
(n−1)
+ ni
α+n−1
K
X
1
+
i=1+Kn−1
s
(n−1)
+ ni
α
δθ i +
α+n−1
α+n−1
i=1
s
49
K
(n−1)
ni
−→
α+n−1
δθi −→ µ.
For #2, if you sample K times from a distribution and create a uniform measure on those
+
samples, then in the infinite limit you get the original distribution back. Removing Kn−1
<∞
of those atoms doesn’t change this (we won’t prove this).
The Chinese restaurant process
s
Let α > 0 and µ a probability measure on S. Sample the sequence θ1∗ , . . . , θn∗ , θ∗ ∈ S as
follows:
1. Set θ1∗ ∼ µ
∗
2. Sample θn∗ |θ1∗ , . . . , θn−1
∼
Pn−1
1
∗
j=1 α+n−1 δθj
+
α
µ
α+n−1
Then the sequence θ1∗ , . . . , θn∗ is a Chinese restaurant process.
s
(n−1)
Equivalently define ni
=
Pn−1
j=1
1(θj∗ = θj ). Then
+
Kn−1
∗
θn∗ |θ1∗ , . . . , θn−1
∼
X
i=1
s
As n → ∞,
α
α+n−1
→ 0 and
+
Kn−1
X
i=1
s
(n−1)
ni
α
δθi +
µ.
α+n−1
α+n−1
(n−1)
∞
X
ni
δθi −→ G =
πi δθi ∼ DP(αµ).
α+n−1
i=1
(7.7)
Notice with the limits that K first went to infinity and then n went to infinity. The resulting
empirical distribution is a Dirichlet process because for finite K the de Finetti mixing measure
is Dirichlet and the infinite limit of this finite Dirichlet is a Dirichlet process (as we’ve shown).
Chapter 7
50
Exchangeability, Dirichlet processes and the Chinese restaurant process
Chinese restaurant analogy
s
An infinite sequence of tables each have a dish (parameter) placed on it that is drawn iid from
µ. The nth customer sits at an occupied table with probability proportional to the number of
customers seated there, or selects the first unoccupied table with probability proportional to α.
s
“Dishes” are parameters for distributions, a “customer” is a data point that uses its dish to
create its value.
Some properties of the CRP
s
Cluster growth: What does the number of unique clusters Kn+ look like as a function of n?
Kn+ =
E[Kn+ ]
Pn
j=1
=
∗
1(θj∗ 6= θ`<j
) ← this event occurs when we select a new table
n
X
E[1(θj∗
6=
∗
θ`<j
)]
=
j=1
n
X
P (θj∗
6=
j=1
∗
θ`<j
)
=
n
X
j=1
α
≈ α ln n
α+j−1
Where does this come from? Each θj∗ can pick a new table. θj∗ does so with probability
s
α
.
α+j−1
Cluster sizes: We saw last lecture that if we let the number of observations n be random, where
n|y ∼ Pois(y),
y ∼ gam(α, α/n),
then we can analyze cluster size and number using the Poisson process.
– E[n] = E[E[n|y]] = E[y] = n ← expected number of observations
– Kn+ ∼ Pois(α ln(α + n) − α ln α) ← total # clusters
– Therefore E[Kn+ ] = α ln((α + n)/α) (compare with above where n is not random)
–
s
PKn+
i=1
(n)
1(ni
= k) ∼ Pois
α
( n )K
K α+n
← number of clusters with k observations
It’s important to remember that n is random here. So in #2 and #3 above, we first generate
n and then sample this many times from the CRP. For example, E[Kn+ ] is slightly different
depending on whether n is random or not.
Chapter 7
51
Exchangeability, Dirichlet processes and the Chinese restaurant process
Inference for the CRP
s
P
1
α
∗
Generative process: Xn |θn∗ ∼ p(X|θn∗ ), θn∗ |θ1:n−1
∼ ni=1 α+n−1
δθi∗ + α+n−1
µ. α > 0 is
“concentration” parameter and we assume µ is non-atomic probability measure.
s
Posterior inference: Given the data X1 , . . . , XN and parameters α and µ, the goal of inference
∗
is to perform the inverse problem of finding θ1∗ , . . . , θN
. This gives the unique parameters
∗
θ1:KN = unique(θ1:N ) and the partition of the data into clusters.
s
Using Bayes rule doesn’t get us far (recall that p(B|A) = p(A|B)p(B)
).
p(A)
"N
#
∞
.
Y
Y
∗
∗
∗
p(θi ) intractable normalizer
p(θ1 , θ2 , . . . , θ1:N |X1:N ) =
p(Xj |θj ) p(θ1:N )
s
(7.8)
i=1
j=1
Gibbs sampling: We can’t calculate the posterior analytically, but we can sample from it: Iterate
between sampling the atoms given the assignments and then sampling the assignments given
the atoms.
Sampling the atoms θ1 , θ2 , . . .
s
∗
This is the easier of the two. For the KN unique clusters in θ1∗ , . . . , θN
at iteration t, we need to
sample θ1 , . . . , θKN .
Sample θi : Use Bayes rule,

∗
p(θi |θ−1 , θ1:N
, X1:N ) ∝ 

Y
j:θj∗ =θi
|
p(Xj |θi ) × p(θi )
|{z}
prior (µ)
{z
}
(7.9)
likelihood
s
In words, the posterior of θi depends only on the data assigned to the ith cluster according to
∗
θ1∗ , . . . , θN
.
s
We simply select this subset of data and calculate the posterior of θi on this subset. When µ
and p(X|θ) are conjugate, this is easy.
Sampling θj∗ (seating assignment for Xj )
s
∗
Use exchangeability of θ1∗ , . . . , θN
to treat Xj as if it were the last observation,
∗
∗
) ∝ p(Xj |θj∗ , Θ)p(θj∗ |θ−j
)
p(θj∗ |X1:N , Θ, θ−j
s
← also conditions on “future” θn∗
(7.10)
Below is the sampling algorithm followed by the mathematical derivation
(
set
θj∗
=
P
w.p. ∝ p(Xj |θi ) n6=j 1(θn∗ = θi ),
R
∼ p(θ|Xj ) w.p. ∝ α p(Xj |θ)p(θ)dθ
θi
θnew
∗
θi ∈ unique{θ−j
}
(7.11)
Chapter 7
Exchangeability, Dirichlet processes and the Chinese restaurant process
52
s
The first line should be straightforward from Bayes rule. The second line is trickier because we
have to account for the infinitely remaining parameters. We’ll discuss the second line next.
s
First, return to the finite model and take then take the limit (and assume the appropriate reindexing).
s
∗
Define: ni−j = #{θn∗ : θn∗ , n 6= j}, K−j = #unique{θ−j
}.
s
Then the prior on θj∗ is
∗
θj∗ |θ−j
∼
K−j
X
i=1
K
X 1
n−j
α
i
δθi +
δθ
α+n−1
α + n − 1 i=1 K i
(7.12)
s
P
1
The term K
i=1 K δθi overlaps with the K−j atoms in the first term, but we observe that
K−j /K → 0 as K → ∞.
s
First: What’s the probability a new atom is used in the infinite limit (K → ∞)?
∗
) ∝ lim α
p(θj∗ = θnew |Xj , θ−j
K→∞
K
X
1
p(Xj |θi )
K
i=1
(7.13)
iid
Since θi ∼ µ,
Z
K
X
1
lim
p(Xj |θi ) = Eµ [p(Xj |θ)] = p(Xj |θ)µ(dθ).
K→∞
K
i=1
(7.14)
Technically, this is the probability that an atom is selected from the second part of (7.12) above.
We’ll see why this atom is therefore “new” next.
s
Second: Why is θnew ∼ p(θ|Xj )? (And why is it new to begin with?)
s
Given that θj∗ = θnew , we need to find the index i so that θnew = θi from the second half of
(7.12).
p(θnew = θi |Xj , θj∗ = θnew ) ∝ p(Xj |θnew = θi )p(θnew = θi |θj∗ = θnew )
1
∝ lim p(Xj |θi ) ⇒ p(Xj |θ)µ(dθ)
K→∞
K
(7.15)
So p(θnew |Xj ) ∝ p(Xj |θ)µ(dθ).
Therefore, given that the atom associated with Xj is selected from the second half of (7.12),
the probability it coincides with an atom in the first half equals zero (and so it’s “new” with
probability one). Also, the atom itself is distributed according to the posterior given Xj .
Chapter 8
Exchangeability, beta processes and the Indian
buffet process
Marginalizing (integrating out) stochastic processes
s
We saw how the Dirichlet process gives a discrete distribution on model parameters in a
clustering setting. When the Dirichlet process is integrated out, the cluster assignments form a
Chinese restaurant process:
∗
p(θ1∗ , . . . , θN
)
|
{z
N
Y
Z
p(θn∗ |G) p(G) dG
| {z }
n=1
| {z } DP
=
}
Chinese restaurant process
(8.1)
i.i.d. from discrete dist.
s
There is a direct parallel between the beta-Bernoulli process and the “Indian buffet process”:
p(Z ∗, . . . , ZN ) =
| 1 {z
}
Z Y
N
n=1
Indian buffet process
|
p(Zn |H) p(H) dH
| {z }
{z
} BP
(8.2)
Bernoulli process
s
As with the DP→CRP transition, the BP→IBP transition can be understood from the limiting
case of the finite BP model.
Beta process (finite approximation)
s
Let α > 0, γ > 0 and µ a non-atomic probability measure. Define
HK =
K
X
πi δθi ,
πi ∼ Beta(α Kγ , α(1 −
γ
)),
K
i=1
Then limK→∞ HK = H ∼ BP(α, γµ). (See Lecture 3 for proof.)
53
θi ∼ µ.
(8.3)
Chapter 8
Exchangeability, beta processes and the Indian buffet process
54
Bernoulli process using HK
s
Given HK , we can draw the Bernoulli process ZnK |HK ∼ BeP (HK ) as follows:
ZnK
=
K
X
bin δθi ,
bin ∼ Bernoulli(πi ).
(8.4)
i=1
Notice that bin should also be marked with K, which we ignore. Again we are particularly
interested in limK→∞ (Z1K , . . . , ZNK ).
s
To derive the IBP, we first consider
lim
K→∞
s
K
p(Z1:N
)
= lim
K→∞
Z Y
N
p(ZnK |HK )p(HK )dHK .
(8.5)
n=1
We can think of Z1K , . . . , ZNK in terms of a binary matrix, BK = [bin ], where
iid
1. each row corresponds to atom θi and bin ∼ Bern(πi )
2. each column corresponds to a Bernoulli process, ZnK
Important: The rows of B are independent processes.
s
iid
Consider the process bin |pii ∼ Bern(πi ), πi ∼ Beta(α Kγ , α(1 −
bi1 , . . . , biN follows an urn model with two colors.
γ
)).
K
The marginal process
Polya’s urn (two-color special case)
1. Start with an urn having αγ/K balls of color 1 and α(1 − γ/K) balls of color 2.
2. Pick a ball at random, pit it back and put a second one of the same color
Mathematically:
1. bi1 ∼
γ
γ
δ1 + (1 − )δ0
K
K
2. bi,N +1 |bi1 , . . . , biN ∼
(N )
where ni
=
PN
j=1 bij .
αγ
K
(N )
(N )
+ ni
α(1 − Kγ ) + N − ni
δ1 +
α+N
α+N
δ0
Recall from exchangeability and deFinetti that
(N )
n
lim i −→ πi ∼ Beta(α Kγ , α(1 −
K→∞ N
s
γ
))
K
(8.6)
α
α
Last week we proved this in the context of the finite symmetric Dirichlet, π ∼ Dir( K
,..., K
).
The beta distribution is the two-dimensional special case of the Dirichlet and the proof can be
applied to any parameterization besides symmetric.
Chapter 8
55
Exchangeability, beta processes and the Indian buffet process
s
In the Dirichlet→CRP limiting case, K corresponds to the number of colors in the urn and
α
K → ∞ with the starting number of each color K
→ 0.
s
In this case, there are always only two colors. However, the number of urns equals K, so the
number of urn processes is going to infinity as the first parameter of each beta goes to zero.
An intuitive derivation of the IBP: Work with the urn representation of BK . Again, bin is entry
(i, n) of BK and the generative process for BK is
bi,N +1 |bi1 , . . . , biN ∼
αγ
K
(N )
(N )
+ ni
α(1 − Kγ ) + N − ni
δ1 +
α+N
α+N
δ0
(8.7)
P
(N )
where ni = N
j=1 bij . Each row of BK is associated with a θi ∼iid µ so we can reconstruct
P
K
ZnK = i=1 bin δθi using what we have.
s
Let’s break down limK→∞ BK into two cases.
Case n = 1: We ask how many ones are in the first column of B?
lim
K→∞
K
X
i=1
bi1 ∼ lim Bin(K, γ/K) = Pois(γ)
(8.8)
K→∞
So Z1 has Pois(γ) ones. Since the θ associated with these ones are i.i.d., we can “ex post facto”
draw them i.i.d. from µ and re-index.
Case n > 1: For the remaining Zn , we break this into two subcases.
(n−1)
s
s
Subcase ni
s
Subcase ni
For each i,
> 0:
bin |bi1 , . . . , bi,n−1 ∼
(n−1)
= 0:
bin |bi1 , . . . , bi,n−1 ∼ lim
αγ
α+n−1
(n−1)
ni
α + n − 1 − ni
δ1 +
α+n−1
α+n−1
(n−1)
δ0
α Kγ
α(1 − Kγ ) + n − 1
δ1 +
δ0
K→∞ α + n − 1
α+n−1
1
δ
K 1
→ 0δ1 , but there are also an infinite number of these indexes i for
(n−1)
ni
which
= 0. Is there a limiting argument we can again make to just ask how many ones
(n−1)
there are total for these indexes with ni
= 0?
s
(n−1)
Let Kn = #{i : ni
limK→0
K
X
i=1
> 0}, which is finite almost surely. Then
(n−1)
bin 1(ni
1
αγ
= 0) ∼ lim Bin K − Kn ,
K→∞
α+n−1 K
= Pois
αγ
α+n−1
(8.9)
Chapter 8
s
Exchangeability, beta processes and the Indian buffet process
56
αγ
new locations for which bin = 1. Again, since the atoms are i.i.d.
So there are Pois α+n−1
regardless of the index, we can simply draw them i.i.d. from µ and re-index.
Putting it all together: The Indian buffet process
s
For n = 1: Draw C1 ∼ Pois(γ), θ1 , . . . , θC1 ∼ µ and set Z1 =
s
For n > 1: Let Kn−1 =
iid
Pn−1
j=1
PC1
i=1 δθ1 .
Cj . For i = 1, . . . , Kn−1 , draw
(n−1)
(n−1)
bin |bi,1:n−1 ∼
Then draw Cn ∼ Pois
αγ
α+n−1
α + n − 1 − ni
ni
δ1 +
α+n−1
α+n−1
(8.10)
iid
and θKn−1 +1 , . . . , θKn−1 +Cn ∼ µ and set
Kn−1
Zn =
X
Kn−1 +Cn
X
bin δθi +
i0 =K
i=1
s
δ0 .
By exchangeability and deFinetti, limN →∞
1
N
PN
i=1
δθi .
(8.11)
n−1 +1
Zi → H ∼ BP(α, γµ).
The IBP story: Start with α customers not eating anything.
1. A customer walks into an Indian buffet with an infinite number of dishes and samples
Pois(γ) of them.
2. The nth customer arrives and samples from the previously sampled dishes with probability proportional
to the number of previous customers who sampled it, and then samples
αγ
Pois α+n−1 new dishes.
s
s
In modeling scenarios, each dish corresponds to a factor (e.g., a one-dimensional subspace)
and a customer samples a subset of factors.
Pn
P
n
Clearly, after n customers there are i=1 Ci ∼ Pois
i=1
sampled. (See Lecture 3 for another derivation of this quantity.)
αγ
α+i−1
dishes that have been
BP constructions and the IBP
s
From Lecture 4: Let α, γ > 0 and µ a non-atomic probability measure. Let
αγ Ci ∼ Pois
, πij ∼ Beta(1, α + i − 1), θij ∼ µ
α+i−1
P PCi
and define H = ∞
i=1
j=1 πij δθij . Then H ∼ BP(α, γµ).
(8.12)
Chapter 8
Exchangeability, beta processes and the Indian buffet process
57
s
We proved this using the Poisson process theory. Now we’ll show how this construction can be
arrived at in the first place.
s
Imagine an urn with β1 balls of one color and β2 of another. The distribution on the first draw
from this urn is
β1
β2
δ1 +
δ0 .
β1 + β2
β1 + β2
Let (b1 , b2 , . . . ) beP
the urn process with this initial configuration. Then as we have seen, in the
1
limit N → ∞, N N
i=1 bi → π ∼ Beta(β1 , β2 ).
s
Key: We can skip the whole urn procedure if we’re only interested in π. That is, if we draw
from the urn once, look at it and see it’s color one, then the urn distribution is
β1 + 1
β2
δ1 +
δ0
1 + β1 + β2
1 + β1 + β2
The questions to ask are what will happen if we continue? What is the different between
this “posterior” configuration and another urn where this is defined to be the initial setting? It
shouldn’t be hard to convince yourself that, in this case, as N → ∞,
N
1 X
bi |b1 = 1 → π ∼ Beta(β1 + 1, β2 ).
N i=1
s
This is because the sequence (b1 = 1, b2 , b3 , . . . ) is equivalent to an urn process (b2 , b3 , . . . )
where the initial configuration are β1 + 1 of color one and β2 of color 2.
s
Furthermore, weP
know from deFinetti and exchangeability that if Z1 , . . . , ZN are from an IBP,
1
then limN →∞ N N
i=1 Zi → H ∼ BP(α, γµ). Right now we’re only interested in this limit.
Derivation of the construction via the IBP
αγ
s For each Zn , there are Cn ∼ Pois
new “urns” introduced with the initial configuration
α+n−1
α+n−1
1
δ1 +
δ0 . From this point on, each of these new urns can be treated as indepenα+n
α+n
dent processes. Notice that this is the case for every instantiation of the IBP.
s
With the IBP, we continue this urn process. Instead, we can ask the limiting distribution of the
urn immediately after instantiating it. We know from above that the new urns created at step n
will converge to random variables drawn independently from a Beta(1, α + n − 1) distribution.
We can draw this probability directly.
s
The results is the construction written above.
Part II
Topics in Bayesian Inference
58
Chapter 9
EM and variational inference
s
We’ve been discussing Bayesian nonparametric priors and their relationship to the Poisson
process up to this point (first 2/3 of the course). We now turn to the last 1/3 which will be about
model inference (with tie-ins to Bayesian nonparametrics).
s
In this lecture, we’ll review two deterministic optimization-based approaches to inference: the
EM algorithm for MAP inference and an extension of EM to approximate Bayesian posterior
inference called variational inference.
General setup
1. Prior: We define a model with a set of parameters θ and a prior distribution on them p(θ).
2. Likelihood: Conditioned on the model parameters, we have a way of generating the set of
data X according to the distribution p(X|θ).
3. Posterior: By Baye’s rule, the posterior distribution of the parameters θ is
p(θ|X) = R
p(X|θ)p(θ)
p(X|θ)p(θ)dθ
(9.1)
s
Of course, the problem is that for most models the denominator is intractable, and so we can’t
actually calculate p(θ|X) analytically.
s
The first, most obvious solution is to ask for the most probable θ according to p(θ|X). Since
.Z
p(θ|X) = p(X|θ)p(θ)
p(X|θ)p(θ)dθ
|
{z
}
constant wrt θ
we know that
arg max p(θ|X) = arg max p(X|θ)p(θ) = arg max ln p(X|θ) + ln p(θ)
θ
θ
θ
|
{z
}
(9.2)
for computational convenience
s
Therefore, we can come up with an algorithm for finding a local optimal solution to the (usually)
non-convex problem arg maxθ p(X, θ).
59
Chapter 9
60
EM and variational inference
s
This often leads to a problem (that will be made more clear in an example later). The function
ln p(X, θ) is sometimes hard to optimize with respect to θ (e.g., it might require messy gradients).
s
The EM algorithm is a very general method for optimizing ln p(X, θ) w.r.t. θ in a “nice” way
(e.g., closed form updates).
s
With EM, we expand the joint likelihood
p(X, θ) using some auxiliary variables c so that we
R
have p(X, θ, c) and p(X, θ) = p(X, θ, c)dc. c is something picked entirely out of convenience. For now, we keep it abstract and highlight this with an example later.
Build-up to EM
s
Observe that by basic probability, p(c|X, θ)p(X, θ) = p(X, θ, c). Therefore,
ln p(X, θ) = ln p(X, θ, c) − ln p(c|X, θ)
{z
}
| {z } |
thing we want
s
(9.3)
expanded version equal to thing we want
We then introduce any probability distribution on c, q(c), and follow the the sequence of steps:
1. ln p(X, θ) = ln p(X, θ, c) − ln p(c|X, θ)
2. q(c) ln p(X, θ) = q(c) ln p(X, θ, c) − q(c) ln p(c|X, θ)
3.
R
q(c) ln p(X, θ)dc =
q(c) ln p(X, θ, c)dc −
R
q(c) ln p(c|X, θ)dc
q(c) ln p(X, θ, c)dc −
R
q(c) ln q(c)dc
− q(c) ln p(c|X, θ)dc +
R
q(c) ln q(c)dc
4. ln p(X, θ) =
R
R
R
(add and subtract same thing)
Z
Z
p(X, θ, c)
q(c)
5. ln p(X, θ) = q(c) ln
dc +
q(c) ln
dc
q(c)
p(c|X, θ)
|
{z
}
|
{z
}
L(q,θ)
s
KL(qkp)
We have decomposed the objective function ln p(X, θ) into a sum of two terms
1. L(q, θ) : A function we can (hopefully) calculate
2. KL(qkp) : The KL-divergence between q and p. Recall that KL ≥ 0 and KL = 0 only
when q = p.
s
The practical usefulness isn’t immediately obvious. For now, we analyze it as an algorithm
for optimizing ln p(X, θ) w.r.t. θ. That is, how can we use the right side to find a local optimal
solution of the left side?
Chapter 9
s
EM and variational inference
61
We will see that the following algorithm gives a sequence of values for θ that are monotonically
increasing p(X, θ) and can be shown to converge to a local optimum of p(X, θ).
The Expectation-Maximization Algorithm
E-step: Change q(c) by setting q(c) = p(c|X, θ). Obviously, we’re assuming p(c|X, θ) can be
calculated in closed form. What does this do?
1. Sets KL(qkp) = 0.
2. Since θ is fixed, ln p(X, θ) isn’t changed here, therefore since KL ≥ 0 in general, by
changing q so that KL(qkp) = 0, we increase L(q, θ) so that L(q, θ) = ln p(X, θ).
M-step: L(q, θ) = Eq [ln p(X, θ, c)] − Eq [ln q(c)] has now been modified as a function of θ.
Therefore, we can maximize L(q, θ) w.r.t. θ. What does this do?
1. Increases L(q, θ) (obvious).
2. Since θ has changed, q(c) 6= p(c|X, θ) anymore, so KL(qkp) > 0.
3. Therefore,
ln p(X, θold ) = Eq
p(X, θnew , c)
p(X, θold , c)
< Eq ln
+KL(qkp) = ln p(X, θnew )
ln
q(c)
q(c)
(9.4)
Example: Probit classification
s
We’re given data pairs D = {(xi , zi )} for i = 1, . . . , N , xi ∈ Rd and zi ∈ {−1, +1}. We want
to build a linear classifier f (xT w − w0 ) : x → {−1, +1}.

 > 0 for all points to upper right
T
< 0 for all points to lower left
xi w − w 0

= 0 all points on line
Classify using the probit function:
T
x w − w0
T
f (x w−w0 ) = Φ
= p(z = 1|w, x)
σ
Φ is the CDF of a standard normal distribution
Simplification of notation: Let w ← [w0 wT ]T and xi ← [1xTi ]T .
– The model variable θ = w in this case. Let w ∼ N (0, cI) be the prior.
Q
– The joint likelihood is p(z, w|x) = p(w) N
i=1 p(zi |w, xi ).
Chapter 9
s
EM and variational inference
62
So we would like to optimize
N
X
N
N
xT w X
xT w X
1 T
i
ln p(w)+
ln p(zi |w, xi ) = − w w+
zi ln Φ
+ (1−zi ) ln 1 − Φ i
2c
σ
σ
i=1
i=1
i=1
(9.5)
with respect to w, which is hard because of Φ(·).
Solution via EM
s
Introduce the latent variables (“hidden data”) y1, . . . , yN such that yi |xi , w ∼ N (xTi w, σ 2 ) and
let zi = 1(yi > 0). Marginalizing this expanded joint likelihood,
Z ∞
p(zi = 1|xi , w) =
p(yi , zi = 1|xi , w)dyi
(9.6)
−∞
Z ∞
xT w =
1(yi > 0)N (yi |xTi w, σ 2 )dyi = Φ i
σ
−∞
so the expanded joint likelihood p(z, y, w|x) has the correct marginal distribution.
s
From the EM algorithm we know that
p(z, y, w|x)
ln p(z, w|x) = Eq ln
+ KL(q(y)kp(y|z, w, x))
q(y)
Z
N Z
X
q(y)
p(zi , yi , w|xi )
dy + q(y) ln
dy
=
q(y) ln
q(y)
p(y|z,
w,
x)
i=1
(9.7)
Q: Can we simplify q(y) = q(y1 , . . . , yN )?
Q
A: We need to set q(y) = p(y|z, w, x). Since p(y|z, w, x) = N
i=1 p(yi |zi , w, xi ), we know that
QN
we can write q(y) = i=1 q(yi ) (but we still don’t know what it is).
s
E-step
1. For each i, set q(yi ) = p(yi |zi , w, xi ) ∝ p(zi |yi )p(yi |xi , w). Since this second term equals
1(yi ∈ Rzi )N (yi |xTi w, σ 2 ), it follows that q(yi ) = T Nzi (xTi w, σ 2 ), which is a truncated
normal on the zi half of R.
P
P
2. Construct the objective i Eq(yi ) [ln p(zi , yi , w|xi )]. We can ignore i Eq(yi ) [ln q(yi )]
because it doesn’t depend on w, which is what we want to optimize over next.
N
X
Eq ln p(zi , yi , w|xi ) =
i=1
N
X
i=1
=
Eq [− 2σ1 2 (yi
|
−
{z
xTi w)2 ]
+
}
ln p(yi |xi ,w) + const.
1 T
− 2c
w w + const.
N
X
− 2σ1 2 (Eq [yi ] −
i=1
N
X
Eq ln p(zi |yi )
i=1
← ln p(w)
xTi w)2 −
1 T
w w
2c
(9.8)
+ const. w.r.t. w
Chapter 9
s
EM and variational inference
63
M-step
1. Optimize the objective from the E-step with respect to w. We can take the gradient w.r.t.
w and see this is equal to
w=
s
1
c
I+
N
N
−1 X
1 X
T
2
x
x
x
E
[y
]/σ
i i
i q i
σ 2 i=1
i=1
(9.9)
The expectation Eq [yi ] is of the truncated Normal r.v. with distribution q(yi ), which can
be looked up in a book or on Wikipedia. The expectation looks complicated, but it’s
simply an equation that can be evaluated using what we have. The final algorithm can
be seen in this equation: After updating w, update q. This updates Eq [yi ] and so we can
again update w. The code would look like iterations between updating w and updating
Eq [yi ] for each i. Notice that this is a nice, closed form iterative algorithm. The output w
maximizes p(z, w|x).
Extension of EM to a BNP modification of the probit model
Model extension: Imagine that x is very high dimensional. We might think that only a few
dimensions of x are relevant for predicting z. We can use a finite approximation to the gamma
process to address this.
iid
Gamma process prior: Let cj ∼ Gam( αd , b), wj ∼ N (0, cj ). From the GaP theory we know
that when d α most values of cj ≈ 0, but there are some that will be large. In this context,
cj are variances, so the model says that most dimensions of w should be zero, in which case the
corresponding dimension of x is not used in predicting z = f (xT w).
Log joint likelihood: The EM equation is
Z
ln p(z, w|x) =
p(z, y, w, c|x)
q(y, c) ln
dydc +
q(y, c)
Z
q(y, c) ln
q(y, c)
dydc (9.10)
p(y, c|z, w, x)
We can think of y again as the “hidden data” that doesn’t have an interpretation as a model
variable. c can be thought of as a model variable and we want to integrate out the uncertainty
of this variable to infer a point estimate of w.
E-step:
1. Set q(y, c) = p(y, c|z, w, x). In this case y and c are conditionally independent of each
other,
p(y, c|z, w, x) = p(y|z, w, x)p(c|w).
We can calculate these as before:
q(yi ) = p(yi |zi , w, xi ) = T Nzi (xTi w, σ 2 )
q(cj ) = p(cj |wj ) = GiG(wj2 , α, b) ← generalized inverse Gaussian
(9.11)
(9.12)
Chapter 9
EM and variational inference
64
2. Construct the objective function
P
P
P
i Eq ln p(zi , yi , w, c|xi ) =
i Eq ln p(zi |yi )+
i Eq ln p(yi |xi , w)+Eq ln p(w|c)+const.
M-step:
1. Optimize the objective from the E-step with respect to w. We can take the gradient with
respect to w and set to zero to find that
w=
diag(Eq [c−1
j ])
N
N
−1 X
1 X
T
+ 2
xi xi
xi Eq [yi ]/σ 2
σ i=1
i=1
(9.13)
Variational inference
s
We motivate variational inference in the context of the model we’ve been discussing and then
give a general review of it.
s
Generative process:
zi = sign(yi ),
yi ∼ N (xTi w, σ 2 ),
wj ∼ N (0, cj ),
cj ∼ Gam( αd , b)
(9.14)
s
We saw how we could maximize ln p(z, w|x) easily by including distributions on y and c.
Could we do the same for w?
Z
Z
p(z, y, w, c|x)
q(y, c, w)
ln p(z|x) = q(y, c, w) ln
dydcdw + q(y, c, w) ln
dydcdw
q(y, c, w)
p(y, c, w|z, x)
(9.15)
s
Short answer: Probably not. What has changed?
1. Before, we were maximizing ln p(z, w|x) w.r.t. w. This meant we needed to calculate
p(y, c|z, x, w). Fortunately, since y and c are independent conditioned on w, we can
calculate p(y, c|z, x, w) = p(y|z, x, w)p(c|w) analytically using the point estimate of w.
2. Therefore, we could set q(y) = p(y|z, x, w) and q(c) = p(c|w) to make KL = 0 and
optimize L.
3. This time, we can’t factorize p(y, c, w|z, x), so we can’t find q(y, c, w) = p(y, c, w|z, x)
in closed form to make KL = 0.
s
Some other observations
1. ln p(z|x) is a constant since it depends only on the data z, x and the underlying model.
2. Therefore, it doesn’t make sense to talk about maximizing ln p(z|x) like before when we
were using EM to maximize ln p(z, w|x) with respect to w.
3. If we could find q(y, c, w) to make KL(qkp) = 0, we would have the full posterior of the
model! (That’s the whole purpose of inference.)
4. What made EM for ln p(z, w|x) easy was that we could write q(y, c) = q(y)q(c) and still
find optimal distributions.
Chapter 9
s
EM and variational inference
65
Variational inference idea
1. We can’t write q(y, c, w) = q(y)q(c)q(w) and still have KL(qkp) = 0, i.e., can’t have
q(y)q(c)q(w) = p(y, c, w|z, x).
2. What if we make an approximation that p(y, c, w|z, x) ≈ q(y)q(c)q(w)?
3. We still have that
Z
Z
p(z, y, w, c|x)
q(y)q(c)q(w)
ln p(z|x) = q(y)q(c)q(w) ln
dydcdw + q(y)q(c)q(w) ln
dydcdw
q(y)q(c)q(w)
p(y, c, w|z, x)
|
{z
} |
{z
}
L(q)
KL(qkp)
(9.16)
4. However, KL 6= 0 (ever)
5. Notice though that, since ln p(z|x) = const. and KL > 0, if we can maximize L wrt q,
we can minimize KL between q and p.
Variational inference (in general)
s
Have data x and model with variables θ1 , . . . , θm .
s
Choose probability distributions q(θi ) for each θi .
s
Construct the equality
ln p(x)
| {z }
=
log evidence (constant)
Z Y
m
p(x, θ , . . . , θ )
1
m
dθ1 . . . dθm
q(θi ) ln Qm
i=1 q(θi )
i=1
{z
}
|
(9.17)
L:variational lower bound on log evidence
+
Z Y
m
q(θi ) ln
i=1
|
Qm
q(θi )
dθ1 . . . dθm
p(θ1 , . . . , θm |x)
{z
}
Q
KL divergence between
i=1
m
i=1
q(θi ) and p(θ1 ,...,θm |x)
Observations
1. As before, ln p(x) is a constant (that we don’t know)
2. KL(qkp) ≥ 0 (as always), therefore L ≤ ln p(x)
3. As with EM, p(x, θ1 , . . . , θm ) is something we can write (joint likelihood)
P R
4. L = Eq ln p(x, θ1 , . . . , θm ) − m
q(θi ) ln q(θi )dθi is a closed form objective function
i=1
|
{z
}|
{z
}
expected log joint likelihood
individual entropies of each q
(currently by assumption, not always true!)
5. We’ve defined the distribution q(θi ) but not its parameters. L is a function of these
parameters
6. By finding parameters of each q(θi ) to maximize L, we are equivalently finding parameters
to minimize KL(qkp).
Chapter 10
Stochastic variational inference
Bayesian modeling review
s
Recall that we have a set of data X and of model parameters Θ = {θi } for i = 1, . . . , M .
s
We have defined a model for the data, p(X|Θ), and a prior for the model, p(Θ).
s
The goal is to learn Θ and capture a level of uncertainty, so by Bayes rule, we want
p(Θ|X) =
s
p(X|Θ)p(Θ)
.
p(X)
(10.1)
We can’t calculate p(X) so we need approximate methods.
Variational inference
s
Using the EM algorithm as motivation, we set up the “master equation”
Z
Z
p(X, Θ)
q(Θ)
ln p(X)
=
q(Θ) ln
dΘ +
q(Θ) ln
dΘ
| {z }
q(Θ)
p(Θ|X)
|
{z
}
|
{z
}
constant marginal likelihood
variational lower bound, L
(10.2)
KL-divergence between q and desired p(Θ|X)
s
We have freedom to pick q(Θ), but that means we have to define it.
s
There is a trade-off:
1. Pick q(Θ) to minimize KL (ideally set q = p(Θ|X)!)
2. We don’t know p(Θ|X), so pick q that’s easy to optimize.
s
What do we mean by “optimize”?
Answer: We can pick a q(Θ) so that we have a closed-form function L. Since ln p(X) is
a constant, by maximizing L wrt the parameters of q, we are minimizing the non-negative
KL-divergence. Each increase in L finds a q closer to p(Θ|X) according to KL.
66
Chapter 10
67
Stochastic variational inference
Picking q(θi )
s
The variational objective function is
L=
Z Y
m
m Z
X
q(θi ) ln p(X, Θ)dΘ −
q(θi ) ln q(θi )dθi
i=1
s
(10.3)
i=1
Imagine we have q(θj ) for j 6= i. Then we can isolate q(θi ) as follows:
Z
Z
XZ
L =
q(θi )E−qi [ln p(X, Θ)] dθi − q(θi ) ln q(θi )dθi −
q(θj ) ln q(θj )dθj
j6=i
Z
=
q(θi ) ln
1
Z
exp{E−qi [ln p(X, Θ)]}
dθi + constant wrt θi
q(θi )
(10.4)
s
R
Z is the normalizing constant: Z = exp{E−qi [ln p(X, Θ)]}dθi . It’s a constant wrt θi , so we
can cancel it out by including the appropriate term in the constant.
s
Therefore, L = −KL(q(θi )k exp{E−qi [ln p(X, Θ)]}/Z) + const wrt θi .
s
To maximize this over all q(θi ), we need to set KL = 0 (since KL ≥ 0). Therefore,
q(θi ) ∝ exp{E−qi [ln p(X, Θ)]}.
(10.5)
s
In words: We can find q(θi ) by calculating the expected log joint likelihood using the other q’s
and exponentiating the results (and normalizing it)
s
Notice that we get both the form and parameters for q(θi ) this way.
Example: Sparse regression
– Data: Labeled data (xn , zn ), xn ∈ Rd , zn ∈ {−1, 1}, d large
– Model: zn = sign(yn ), yn ∼ N (xTn w, σ 2 ), wj ∼ N (0, cj ), cj ∼ Gam( ad , b)
– Recall: Last week, we did EM MAP for w, meaning we got a point estimate of w and
integrated out y and c. In that case, we could set q(cj ) and q(yn ) to their exact posteriors
conditioned on w because they were independent.
Variational inference
Step 1: Pick a factorization, say q(y, c, w) =
Step 2: Find optimal q distributions
Q
Q
q(y
)
q(c
)
q(w)
n
j
n
j
Chapter 10
68
Stochastic variational inference
Z
[ln 1(zn = sign(yn ))] + E−q [ln p(yn |xn , w)] } = T Nz (xT Eq w, σ 2 )
– q(yn ) ∝ exp{
E−q
n
n
Z
{z
}
|
{z
}
|Z
determines support of yn (R+ or R− )
Determines function and params
This would have been much harder by taking derivatives of L
–
q(cj ) ∝ exp{E−q [ln p(wj |cj )] + @
E@q ln p(cj )}
−1
a
∝ cj 2 exp{− 2c1j Eq [wj2 ]}cjd
−1
exp{−bcj } = GiG(2b, Eq [wj2 ], ad − 21 )
Again, it would have been harder to define GiG(τ1 , τ2 , τ3 ) and take derivatives of L
– q(w) ∝ exp{E−q [ln p(~y |X, w)] + E−q [ln p(w|~c)]} = N (µ, Σ) (a standard calculation)
Problem: w ∈ Rd and we’re assuming d is large (e.g., 10,000). Therefore, calculating the
d × d matrix Σ is impractical.
A solution: Use further factorization. Let q(w) =
optimal form of q(wj ) to be N (µj , σj2 ).
Q
j
q(wj ). We can again find the
More tricks: In this case, updating q(wj ) depends on q(wi6=j ). This can make things
slow to converge. Instead, it works better to directly work with L and update the vectors
µ
~ = (µ1 , . . . , µd ) and ~σ 2 = (σ12 , . . . , σd2 ) by setting, e.g., ∇µ~ L = 0.
Conjugate exponential family models (CEF)
s
Variational inference in CEF models has a nice generic form. Here, conjugate means the
prior/likelihood of all model variables are conjugate. Exponential means all distributions are in
the exponential family.
s
Review and notation: Pick a model variable, θi . We have a likelihood and prior for it:
– Likelihood: Assume a conditionally independent likelihood
"N
#
N
N
n X
o
Y
Y
T
p(w1 , . . . , wn |η(θi )) =
p(wn |ηi ) =
h(wn ) exp ηi
t(wn ) − N A(ηi )
(10.6)
n=1
n=1
n=1
η(θi ) ≡ ηi is natural parameter, t(wn ) is sufficient statistic vector, A(ηi ) is log-normalizer
– Prior: A conjugate prior is p(ηi ) = f (χ, ν) exp{ηiT χ − νA(η)}
– Posterior: p(ηi |w)
~ = f (χ0 , ν 0 ) exp{ηiT χ0 − ν 0 A(ηi )}, χ0 = χ +
P
n
t(wn ), ν 0 = ν + N
Variational Inference
s
We have variational distributions on each wn , q(wn ) and on q(θi ). Using the log → expectation
→ exponential “trick”
N
n
o
X
q(θi ) ∝ exp η(θi )T χ +
Eq [t(wn )] − (ν + N )A(η(θi )) .
n=1
(10.7)
Chapter 10
69
Stochastic variational inference
s
Therefore, we immediately see that the optimal q distribution for a model variable with a prior
conjugate to the likelihood is in the same family as the prior.
s
In fact, we see that with variational inference, we take the expected value of the sufficient
statistics that would be used to calculate the true conditional posterior.
s
Contrast this with Gibbs sampling, where we have samples of t(w1 ), . . . , t(wN ) and we sample
a new value of θi directly using these sufficient statistics (that are themselves changing with
each iteration because they are also sampled).
The more direct calculation for CEF models
s
Imagine instead that we wanted to work directly with the variational objective function, L.
– Generically, L = Eq [ln p(X, θ)] − Eq [ln q(θ)]. In terms of the CEF form we’ve been
talking about, this is
L = Eq [ln p(w1:N , θi )] − Eq [ln q(θi )] + things that don’t depend on q(θi )
s
We’ve defined q(θi ) = f (χ0 , ν 0 ) exp{η(θi )T χ0 − ν 0 A(η(θi ))}. Therefore focusing just on θi ,
P
Lθi = Eq [ηiT (χ + n t(wn )) − (ν + N )A(ηi )] − Eq [ηiT χ0 − ν 0 A(ηi )] − ln f (χ0 , ν 0 )
N
X
T
0
= Eq [ηi ] χ +
Eq [t(wn )] − χ − (ν + N − ν 0 )Eq [A(ηi )] − ln f (χ0 , ν 0 ) (10.8)
n=1
s
We need some equalities from the conjugate exponential family:
Z
Z
q(θi )dθi = 1 ⇒
∇χ0 q(θi )dθi = 0
Z
=
(∇χ0 ln f (χ0 , ν 0 ) + ηi )q(θi )dθi
which means
Z
⇒ Eq ηi = −∇χ0 ln f (χ0 , ν 0 )
Z
q(θi )dθi = 1 ⇒
s
(10.9)
d
q(θi )dθi = 0
(10.10)
dν 0
Z d
0
0
=
ln
f
(χ
,
ν
)
+
A(η
)
q(θi )dθi
i
dν 0
d
which means ⇒ Eq A(ηi ) = − 0 ln f (χ0 , ν 0 )
dν
Therefore, the general form of the objective for θi is
P
Lθi = −∇χ0 ln f (χ0 , η 0 )T (χ + n Eq t(wn ) − χ0 ) − dνd 0 ln f (χ0 , ν 0 )(ν + N − ν 0 ) − ln f (χ0 , ν 0 )
(10.11)
Chapter 10
s
Stochastic variational inference
70
Our goal is to maximize this with respect to (χ0 , ν 0 ). If we take the gradient of L with respect
to these parameters and set it to zero,
#
" 2
P
d ln f
d2 ln f
χ + n Eq [t(wn )] − χ0
dχ0 dχ0T
dχ0 dν 0
=0
(10.12)
∇(χ0 ,ν 0 ) L = −
d2 ln f
d2 ln f
ν + N − ν0
dν 0 dχ0T
dν 02
we see that we get exactly the same updates that we derived before using the “log-expectationexponentiate” approach. We can simply ignore the matrix and set the right vector equal to zero.
Variational inference from the gradient perspective
s
The most general way to optimize an objective function is via gradient ascent.
s
In the variational inference setting, this means updating q(θ)i|ψi ) according to ψi ← ψi +
ρM ∇ψi L, where ρ is step size and M is a positive definite matrix.
s
Ideally, we want to maximize with respect to ψi , meaning set ψi such that ∇ψi L = 0. We saw
how we could do that in closed form with CEF models.
0 h 2 i−1
0
χ
χ
s Equivalently, if
0
0
.
←
+ ρM ∇(χ ,ν ) L, we can set ρ = 1 and M = − dd·d·lnTf
ν0
ν0
Then this step would move directly to the closed form solution for ψi (i.e., where ∇ψi L = 0).
s
i−1
h 2
h 2 i−1
ln q(θi )
= − d d·d·
We observe that M = − dd·d·lnTf
. This second matrix is the inverse of
T
the Fisher information matrix. When M is set to this value, we are said to be moving in the
direction of the natural gradient.
s
Therefore, in sum, when we update χ0 and ν 0 in closed form as previously discussed, we are
implicitly taking a full step (since ρ = 1) in the direction of the natural gradient (because of
setting of M ) and landing directly on the optimum. The important point being stressed here is
that there is an implied gradient algorithm being executed when we calculate our closed-form
parameter updates in variational inference for CEF models.
s
This last point motivates the following discussion on scalable extensions of variational inference
for CEF models.
s
In many data modeling problems we have two issues:
– N is huge (# documents in corpus, # users, etc)
– Calculating q(wn ) requires non-trivial amount of work
Both of these issues lead to slow algorithms. (How slow? e.g., one day = one iteration)
Chapter 10
s
71
Stochastic variational inference
We’ll next show how stochastic optimization can be applied to this problem in a straightforward
way for CEF models.
Simple notation
s
Let θ be a “global variable,” meaning it’s something that interacts with each unit of data.
s
Let wn be a set of “local variables” and associated data. Given θ, wn and wn0 are independent
of each other.
Examples:
– The Gaussian mixture model, θ = { |{z}
π , µ1:K , Σ1:K }, wn = { xn , cn }.
|{z} |{z}
| {z }
mixing
weights
Gaussian params
data
cluster
– LDA, θ = {β1 , . . . , βk }, wn = { θn , ~cn , ~xn }
|{z} |{z} |{z}
| {z }
topic dist word alloc word obs
topics
s
The variational objective function is,
L=
N
X
h
Eq
n=1
s
p(wn |θ) i
ln
+ Eq ln p(θ) − Eq ln q(θ).
q(wn )
(10.13)
Imagine if we subsampled St ⊂ {1, . . . , N } and created
Lt =
N X h p(wn |θ) i
Eq ln
+ Eq ln p(θ) − Eq ln q(θ).
|St | n∈S
q(wn )
(10.14)
t
s
There are
s
Each wn appears in
s
Therefore, we can calculate ELt summing over the random subsets as follows
N
|St |
possible subsets of {1, . . . , N } of size |St |, each having probability 1/
N −1
|St |−1
of the
N
|St |
s
.
subsets.
−1 X
N
=
Lt
(10.15)
|St |
St
−1 h
N
N −1
p(wn |θ) i
N X N
Eq ln
=
+ Eq ln p(θ) − Eq ln q(θ).
|St | n=1 |St |
|St | − 1
q(wn )
ELt
N
|St |
Observe that
N −1 N −1
|St |
|St |−1
=
|St |
.
N
Therefore, ELt = L.
Stochastic variational inference (SVI)
Chapter 10
Stochastic variational inference
72
s
Idea: What if for each iteration we subsampled a random subset of wn indexed by St , optimized
their q(wn ) only and then took a gradient step on parameters of q(θ) using Lt instead of L? We
would have SVI.
s
Stochastic variational inference: Though the technique can be made more general, we restrict
our discussion to CEF models.
s
Method:
1. At iteration t, subsample w1 , . . . , wN according to randomly generated index set St .
N X h p(wn |θ) i
2. Construct Lt =
Eq ln
+ Eq ln p(θ) − Eq ln q(θ).
|St | n∈S
q(wn )
t
3. Optimize q(wn ) for n ∈ St .
4. Update q(θ|ψ) by setting ψ ← ψ + ρt M ∇ψ Lt .
A closer look at Step 4 (the key step)
s
By following the same calculations as before (and letting ψ = [χ0 , ν 0 ],
" 2
#
P
d ln f
d2 ln f
χ + |SNt | n∈St Eq [t(wn )] − χ0
dχ0 dχ0T
dχ0 dν 0
∇(χ0 ,ν 0 ) L = −
d2 ln f
d2 ln f
ν + |St | − ν 0
dν 0 dχ0T
dν 02
"
s
If we set M = −
s
χ0
ν0
d2 ln f
dχ0 dχ0T
d2 ln f
dν 0 dχ0T
d2 ln f
dχ0 dν 0
d2 ln f
dν 02
← (1 − ρt )
(10.16)
#−1
, then this simplifies nicely.
χ0
ν0
+ ρt
χ+
N
|St |
P
Eq [t(wn )]
ν + |St |0
n∈St
(10.17)
What has changed?
– We are now looking at subsets of data.
– Therefore,
we don’t
ρt = 1 like before. In general, stochastic optimization requires
P
P set
2
t ρt = ∞ and
t ρt < ∞ for convergence.
s
What is this doing?
– Before, we were making a full update using all the data (ρt = 1). Now, since we only
look at a subset, we average the “full update” (restricted to the subset) with the most
recent value of the parameters χ0 and ν 0 . Since ρt → 0, we weight this new information
less and less.
Chapter 11
Variational inference for non-conjugate models
s
Variational inference
Have: Data X, model variables Θ = {θi }, factorized distribution q(Θ) =
Z
Setup: ln p(X) =
| {z }
constant
!
Y
i
|
!
Z
p(X, Θ)
q(θi ) ln Q
dΘ +
i q(θi )
{z
} |
Y
i
Q
Q
i
q(θi ).
q(θi )
dΘ
p(Θ|X)
{z
}
q(θi ) ln
i
KL divergence ≥ 0
variational objective L
Variational inference: Find parameters of each q(θi ) to maximize L, which simultaneously
minimizes KL(qkp(Θ|X)).
Potential problems: This assumes that we can actually compute L = Eq ln p(X, Θ) −
Eq ln q(Θ). Even if we use the “trick” from before, where we set q(θi ) ∝ exp{E−q ln p(X, Θ)},
we still need to be able to take these expectations.
s
Examples: We’ll give three examples where this problem arises, followed by possible solutions.
1. Poisson matrix factorization has a generative portion containing (loosely) something like
Y ∼ Pois(a1 b1 + · · · + ak bk ). Y is the data nad ai , bi are r.v.’s we want to learn.
P
P
P
– Log-joint: ln p(Y, a, b) = Y ln i ai bi − i ai bi + i ln p(ai )p(bi ) + const.
P
Q
– Problem: E ln i ai bi is intractable for continuous distribution q(a, b) = i q(ai )q(bi ).
2. Logistic normal models look like cn ∼ Disc(p), with pi ∝ exi and x ∼ N (µ, Σ).
P P
P
– Log-joint: ln p(c, x) = i xi n 1(cn = i) − N ln i exi + ln p(x).
P
– Problem: −E ln i exi is intractable, where q(x) = N (m, S).
3. Logistic regression has the process yn ∼ σ(xTn w)δ1 + (1 − σ(xTn w))δ−1 , σ(a) =
– Log-joint: ln p(y, w|X) = −
P
n
T
ln(1 + e−yn xn w ) + ln p(w).
T
– Problem: −E ln(1 + e−yn xn w ) is intractable using continuous q(w).
73
1
.
1+e−a
Chapter 11
Variational inference for non-conjugate models
74
s
Instant solution: Setting q(θ) = δθ corresponds to a point estimate of θ. We don’t need to do
the intractable integral in that case.
s
One solution to intractable objectives is to find a more tractable lower bound of the function
that’s intractable (since we’re in the maximization regime).
– The tighter the lower bound the better the function is being approximated. Therefore, not
all lower bounds are equally good.
– Lower bounds often use additional parameters that control how tight they are. These are
auxiliary variables that are optimized along with q.
s
Solutions to the three problems
1. ln() is concave. Therefore, ln Ex ≥ E ln x. If we introduce p ∈ ∆K , then
"
#
X
X ai b i
X
X
X
ai b i
E ln
ai bi = E ln
pi
≥E
pi ln
=
pi E[ln ai + ln bi ] −
pi ln pi
pi
pi
i
i
i
i
i
(11.1)
Often E ln x is tractable. After updating q(ai ) and q(bi ), we also need to update the vector p to
tighten the bound.
2. − ln() is convex, therefore we can use a first-order Taylor expansion
"
#
P
X
X
Eexi − ξ
d
ln
z
xi
xi
− E ln
e ≥ − ln ξ − E (
e − ξ)
= − ln ξ − i
dz ξ
ξ
i
i
(11.2)
Usually we can calculate Eexi (MGF). We then update ξ after q(x).
T
3. Finding a lower bound for − ln(1 + e−yn xn w ) is more challenging. One bound due to Jaakola
and Jordan in 2000 is
1
T
− ln(1 + e−yn xn w ) ≥ ln σ(ξn ) + (yn xTn w − ξn ) − λ(ξn )(wxn xTn w − ξn2 )
2
(11.3)
where λ(ξn ) = (2σ(ξn ) − 1)/(4ξn ).
– In this case there is a xin for each xn since the functions we want to bound are different
for each observation.
– Though it’s more complicated, we can take all expectations now since the bound is
quadratic in w.
– In fact, when q(w) and p(w) are Gaussian, the solution for q is in closed form.
– After solving for q(w), we can get closed form updates of ξn .
s
The way we update all auxiliary variables, p, ξ and ξn , is by differentiation and setting to zero.
s
We next give a more concrete example found in Bayesian nonparametric models.
Chapter 11
Variational inference for non-conjugate models
75
• A size-biased representation of the beta process (review)
iid
• Definition: Let γ > 0 and µ a non-atomic probability measure. For k = 1, 2, . . . , draw θk ∼ µ
P
Qk
iid
and Vj ∼ Beta(γ, 1). If we define H = ∞
k=1
j=1 Vj δθk then H ∼ BP(1, γµ).
Proof : See notes from Lecture 4 for the proof.
• Bernoulli process (review)
s
Recall that if we have H, then we can use it to draw a sequence of Bernoulli processes
iid
~ n = P∞ znk δθ , znk ∼ Bern Qk Vj .
Z~n |H ∼ BeP(H), where Z
k
k=1
i=1
~ n is also often treated as latent and part of data generation)
(Z
Variational inference for the beta process (using this representation)
s
~ n are problem-specific (and
We simply focus on inference for V1 , V2 , . . . . Inference for θk and Z
possibly don’t have non-conjugacy issues)
s
Let everything else in the model be represented by X. X contains all the data and any variables
~ and Θ (I’ll switch to Z instead of Z
~ now)
besides Z
s
Joint likelihood: The joint likelihood is the first thing we need to write out. This can factorize
as
p(X, Θ, Z, V ) = p(X|Θ, Z)p(Z|V )p(V )p(Θ).
(11.4)
s
Therefore, to update the variational distribution q(V ) given all other q distributions, we only
need to focus on p(Z|V )p(V ).
s
For now, we assume q(Z) is a delta function. That is, we have a 0-or-1 point estimate for the
values in Z.
s
So we focus on p(Z|V )p(V ) with Z a point estimate. According to the prior/likelihood
definition, this further factorizes as
"N ∞
#" ∞
#
YY
Y
p(Z, V ) =
p(znk |V1 , . . . , Vk )
← truncate at T atoms
p(Vk )
| n=1 k=1
"
∝
s
We pick q(V ) =
QT
likelihood
k
N Y
T Y
Y
n=1 k=1
k=1
} | k=1 {z
{z
j=1
Vj
znk prior
1−
k
Y
j=1
Vj
1−znk
}
#"
T
Y
#
Vjγ−1
k=1
q(Vk ) and leave the form undefined for now.
(11.5)
Chapter 11
s
76
Variational inference for non-conjugate models
The variational objective related to V is
"
# T
N X
T
k
T
X
X
X
X
Qk
LV =
znk
Eq ln Vj + (1 − znk )Eq ln(1 − j=1 Vj ) + (γ−1)Eq ln Vj −
Eq ln q
n=1 k=1
j=1
k=1
k=1
(11.6)
Problem: The expectation Eq ln(1 −
Qk
j=1
Vj ) is intractable.
Question: How can we update q(Vk )?
A solution: Try lower bounding ln(1 −
Qk
j=1
Vj ) in such a way that it works.
Q
P
Q
Recall: An aside from Lecture 3 stated that 1 − kj=1 Vj = ki=1 (1 − Vi ) i−1
j=1 Vj . Therefore,
we can try looking for a lower bound using this instead.
s
Since ln is a concave function, ln EX ≥ E ln X. Therefore, if we introduce a k-dimensional
probability vector p, we have
"
#
Q
k
i−1
k
X
Y
X
(1 − Vi ) i−1
V
j
j=1
ln
pi (1 − Vi )
Vj /pi ≥
pi ln
p
i
i=1
j=1
i=1
k
i−1
k
X
X
X
pi ln pi (11.7)
ln Vj −
=
pi ln(1 − Vi ) +
i=1
j=1
i=1
Therefore,
k
k
i−1
k
X
X
X
X
Eq ln 1 −
Vj ≥
pi Eq ln(1 − Vi ) +
Eq ln Vj −
pi ln pi
j=1
i=1
j=1
(11.8)
i=1
(Aside) This use of p is the variational lower bound idea and is often how variational inference
is presented in papers:
Z
Z
Z
p(X, Θ)
p(X, Θ)
ln p(X) = ln p(X, Θ)dΘ = ln q(Θ)
dΘ ≥ q(Θ) ln
dΘ ≡ L.
q(Θ)
q(Θ
(11.9)
Q
s We simply replace Eq ln 1 − k Vi with this lower bound and hope it fixes the problem.
i=1
Since we have this problematic term for all k = 1, . . . , T , we have T auxiliary probability
vectors p(1) , . . . , p(T ) .
s
We can still use the method of setting q(Vk ) = exp{E−q ln p(Z, V )} to find the best q(Vk ) but
replacing p(Z, V ) with the lower bound of it. Therefore, q(Vk ) is the best q for the lower bound
version of L (the function we’re now optimizing), not the true L.
Chapter 11
s
Variational inference for non-conjugate models
77
In this case, we find that we get a (very complicated) analytical solution for q(Vk ):
PN
Pm
PN
PT
P
(m)
)−1
γ+ T
n=1 (1−znk ( i=k+1 pi
n=1 znk + m=k+1
m=k
q(Vk ) ∝ Vk
(1 − Vk )
×
P
PN
(m)
1+ T
m=k
n=1 (1−znk )pk −1
(11.10)
But this simply has the form of a beta distribution with parameters that can be calculated
“instantaneously” by a computer.
Un-addressed question: What do we set the probability vector p(m) equal to?
Answer: We can treat it like a new model parameter and optimize
over itPalong with the
P
variational parameters. In general, if we want to optimize p in ki=1 pi Eyi − ki=1 pi ln pi , we
can set
pi ∝ exp{Eyi }.
(11.11)
Pi−1
In this case, Eyi = E ln(1 − Vi ) + j=1 E ln Vj , which is fast to compute.
s
We can iterate between updating p(1) , . . . , p(T ) and q(V1 ), . . . , q(VT ) before moving on to the
other q distribution, or update each once and move to the other q’s before returning.
Directly optimizing L
s
Rather than introduce approximations, we would like to find other ways to optimize L. Specifically, is there a way to optimize it without taking expectations at all?
Setup: Coordinate ascent
Q
Given q(θ) i q(θi ), the following is the general way to optimize L.
1. Pick a q(θi |φi ) ← φi are variational parameters
2. Set ∇φi L = 0 w.r.t. φi (if possible) or φi ← φi + ρ∇φi L (if not)
3. After updating each q(θi ) this way, repeat.
Problem: For q(θi |φi ), Eq ln p(X, Θ) isn’t (totally) in closed form.
Detail: Let
ln p(X, Θ) = f1 (X, θi ) + f2 (X, Θ)
| {z } | {z }
problem
→
L = Ef1 + Ef2 − E ln q
|
{z
}
(11.12)
= h(X,Ψ)
no problem
f1 isn’t necessarily the log of an entire distribution, so f1 and f2 could contain θi . Ψ contains
all the variational parameters of q.
s
What we need is ∇ψi L = ∇ψi Ef1 (X, θi ) + ∇ψi h(X, Θ).
|
{z
} |
{z
}
hard
easy
Chapter 11
Variational inference for non-conjugate models
Solution (version 1): Create an unbiased estimate of ∇ψi Ef1
Z
∇ψi Ef1 = ∇ψi f1 (X, θi )q(θi |ψi )dθi
Z
=
f1 (X, θi )q(θi |ψi )∇ψi ln q(θi |ψi )dθi
78
(11.13)
We use the log trick for this q∇ψ ln q = qq ∇ψ q = ∇ψ q.
s
Use Monte Carlo integration:
Eq X ≈
S
1X
Xs ,
S s=1
iid
Xs ∼ q
→
S
1X
Xs = Eq X
S→∞ S
s=1
lim
(11.14)
Therefore, we can use the following approximation,
Z
f1 (X, θi )q(θi |ψi )∇ψi ln q(θi |ψi )dθi = Eq [f1 (X, θi )∇ψi ln q(θi |ψi )]
≈
S
1X
(s)
(s)
f1 (X, θi )∇ψi ln q(θi ),
S s=1
|
{z
}
≡
(s) iid
θi
∼ q(θ
(11.15)
i |ψi ).
∇ψi Ef1
Therefore, ∇ψi L ≈ ∇ψi Ef1 + ∇ψi h(X, Ψ) and RHS is an unbiased approximation, meaning
the expectation of the RHS equals the LHS.
Problem with this
s
Even though the expectation of this approximation equals the truth, the variance might be huge,
meaning S should be large.
P
iid
Review: If X̂S = S1 Ss=1 xs , xs ∼ q. Then EX̂S = Eq X and V ar(X̂S ) = V ar(X)/S. If we
want V ar(X̂s ) < , then S > V ar(X)/ (possible huge!)
Need: Variance reduction, which is a way for approximating Ef (θ) that is unbiased, but has
smaller variance. One such method is to use a control variate.
Control variates: A way of approximating Eq f (θ) with less variance than using Monte Carlo
integration as described above.
Introduce a function g(θ) of the same variables such that
– g(θ) and f (θ) are highly correlated
– Eg(θ) can be calculated (unlike Ef (θ))
Chapter 11
Variational inference for non-conjugate models
79
Procedure:
– Replace f (θ) with fˆ(θ) = f (θ) − a(g(θ) − Eg(θ)) in L, with a ∈ R
P
P
– Replace 1 S f (θs )∇ ln q(θs ) with 1 S fˆ(θs )∇ ln q(θs )
S
s=1
S
s=1
Why this works better:
– Eq fˆ(θ) = Ef (θ) − a(Eg(θ) − E[Eg(θ)]) = Ef (θ) ← unbiased
– V ar(fˆ(θ)) = E[(fˆ(θ) − Efˆ(θ))2 ] = V ar(f (θ)) − 2aCov(f (θ), g(θ)) + a2 V ar(g(θ))
V ar(fˆ(θ))
V ar(f (θ))
s
Setting a = Cov(f,g)
minimizes V (fˆ(θ)) above. In this case, we can see that
V ar(f )
1 − Corr(f, g)2 .
s
Therefore, when f and g are highly correlated, the variance of fˆ is much less than f . This
translates to fewer samples to approximate Efˆ(θ) within a accuracy threshold than needed for
Ef (θ) using the same threshold.
s
In practice, the optimal value a =
Cov(f,g)
V ar(f )
=
can be approximated each iteration using samples.
Modified coordinate ascent
1. Pick a q(θi |ψi )
2. If L is closed form w.r.t. q(θi |ψi ), optimize the normal way. Else, approximate ∇ψi L
by ∇∗ψi L = ∇ψi Ef1 + ∇ψi h(X, Ψ) and set ψi ← ψi + ρ∇∗ψi L. Possibly taking multiple
steps.
3. After each q(θi |ψi ) has been updated, repeat.
Picking control variates
– Lower bounds: By definition, a tight lower bound is highly correlated with the function it
approximates. It’s also picked to give closed form expectations.
– Taylor expansions (2nd order): A second order Taylor expansion is not a bound, but often
it is a better approximation than a bound. in this case,
1
g(θ) = f (µ) + (θ − µ)T ∇f (µ) + (θ − µ)T ∇2 f (µ)(θ − µ).
2
(11.16)
If the second moments of q(θ) can be calculated, then this is a possible control variate. µ
could be the mean of q(θ|ψ) given the current value of ψ.
Chapter 12
Scalable MCMC inference
Abstract
In the last class, I present two recent papers that address scalability for Monte
Carlo Metropolis Hastings (MCMC) inference:
1. M. Welling and Y.W. Teh, “Bayesian Learning via Stochastic Gradient Langevin
Dynamics,” International Conference on Machine Learning, 2011.
2. D. Maclaurin and R.P. Adams, “Firefly Monte Carlo: Exact MCMC with Subsets of Data,” Uncertainty in Artificial Intelligence, 2014.
These presentations were given using slides, which I transcribe verbatim.
s
s
s
Tons of digitized data that can potentially be modeled:
s
Online “documents” (blogs, reviews, tweets, comments, etc.)
s
Audio, video & images (Pandora, YouTube, Flickr, etc.)
s
Social media, medicine, economic, astronomy, etc. etc.
It used to be that the problem was turning a mathematical model into reality in the most basic
sense:
s
How do we learn it? (computer innovations)
s
Yes, but how do we learn it? (MCMC, optimization methods)
Given the massive amount of data now readily accessible by a computer, all of which we want
to use, the new question is:
s
s
Yes, but how do we learn it in a reasonable amount of time?
Up to (and often including) now, the default solution has been to sub-sample a manageable
amount and throw the rest away.
80
Chapter 12
s
Scalable MCMC inference
81
Justification:
− The large data we have is itself a small subsample of the set of all possible data.
− So if we think that’s a reasonable statistical representation, what’s wrong with a little less
(relatively speaking)?
s
Example 1: Netflix has 17K movies, 480K users, 100M ratings
− There are 8,160M potential ratings (1.2% measured), so if we remove users to learn about
the movies, what can it hurt?
s
Example 2: We have 2M newspaper articles of ×1000 more existing articles. Why not
subsample to learn the topics?
s
We want to (and should) use all data since the more data we have, the more information we can
extract.
s
This translates to more “expressive” (i.e., complicated) probabilistic models in the context of
this class.
s
This hinges on having the computational resources to do it:
s
s
More powerful computers,
s
Recognizing and exploiting parallelization,
s
Using inference tricks to process all the data more efficiently
Recent successes in large scale machine learning have mostly been optimization-based approaches:
s
Stochastic optimization has a huge literature
s
Stochastic variational inference ports this over to the problem of approximate posterior
learning
s
The common denominator of these methods is that they process small subsets of the data at a
time.
s
Example: With SVI we saw how each piece of data had its own local variables and all data
shared global variables.
s
s
Pick a subset of the data,
s
update their local variables,
s
then perform a gradient step on the global variables.
This idea of processing different subsets of data for each iteration is a general one.
Chapter 12
s
Can it be extended to MCMC inference?
s
We’ll review two recent ML papers on this:
Scalable MCMC inference
82
s
M. Welling and Y.W. Teh, “Bayesian Learning via Stochastic Gradient Langevin Dynamics,” ICML 2011.
s
D. Maclaurin and R.P. Adams, “Firefly Monte Carlo: Exact MCMC with Subsets of Data,”
UAI 2014.
Bayesian Learning via Stochastic Gradient Langevin Dynamics
s
This SGLD algorithm combines two things:
s
Stochastic gradient optimization
s
Langevin dynamics, which incorporates noise to the updates
s
The stochastic gradients and Langevin dynamics are combined in such a way that MCMC
sampling results.
s
Considers a simple modeling framework with approximations
(Alert: possible future research!)
s
Let θ denote model vector and p(θ) its prior distribution.
s
Let p(x|θ) be the probability of data x given model variable θ.
s
The posterior distribution of N observations X = {x1 , . . . , xN } is assumed to have the form
p(θ|X) ∝ p(θ)
N
Y
p(xi |θ).
i=1
s
s
How do we work with this? Rough outline of following slides:
s
Maximum a posteriori with gradient methods
s
Stochastic gradient methods for MAP (big data extension)
s
Metropolis-Hastings MCMC sampling
s
Langevin diffusions (gradient + MH-type algorithm)
s
Stochastic gradient for Langevin diffusions
A maximum a posteriori estimate of θ is
θMAP = arg max ln p(θ) +
θ
N
X
i=1
ln p(xi |θ).
Chapter 12
s
Scalable MCMC inference
83
This can be (locally) optimized using gradient ascent
θt+1 = θt + ∆θt
where ∆θt is the weighted gradient of the log joint likelihood
t
∆θt =
2
∇ ln p(θt ) +
N
X
!
∇ ln p(xi |θt ) .
i=1
s
Notice that this requires evaluating p(xi |θt ) for every xi .
s
Stochastic gradient ascent optimizes exactly the same objective function (or finds a local
optimal solution).
s
The only difference is that for each step t, a subset of n N observations is used in the step
θt+1 = θt + ∆θt :
!
n
t
NX
∆θt =
∇ ln p(θt ) +
∇ ln p(xti |θt ) .
2
n i=1
s
xti is the ith observation in subset t:
s
s
This set is usually talked of as being randomly generated
s
In practice, contiguous chunks of n are often used
For the stochastic gradient
∆θt =
t
2
∇ ln p(θt ) +
N
n
Pn
i=1
∇ ln p(xti |θt )
there are two requirements on t to provably ensure convergence (see Leon Bottou’s papers,
e.g., one in 1998),
∞
∞
X
X
t = ∞,
2t < ∞.
t=1
s
t=1
Intuitively:
s
The first one ensures that the parameters can move as far as necessary to find the local
optimum.
s
The second one ensures that the “noise” from using n N does not have “infinite impact”
and the algorithm converges.
a γ
b+t
s
Often used: t =
s
MAP (and ML) approaches are good tools to have, but they don’t capture uncertainty and can
lead to overfitting.
s
A typical Bayesian approach is to use MCMC sampling, which is a set of techniques that gives
samples from the posterior.
with γ ∈ (0.5, 1].
Chapter 12
s
Scalable MCMC inference
84
One MCMC approach is “random walk Metropolis-Hasting”:
s
∗
= θt + ηt
Sample ηt ∼ N (0, I) and set θt+1
∗
∗
− i.e., θt+1 ∼ q(θt+1 |θt ) = N (θt , I).
s
∗
with probability
Set θt+1 = θt+1
∗ )q(θ |θ ∗ )
p(X,θt+1
t t+1
∗ |θ )
p(X,θt )q(θt+1
t
=
∗ )
p(X,θt+1
,
p(X,θt )
otherwise θt+1 = θt .
s
We evaluate the complete joint likelihood to calculate the probability of acceptance.
s
After t > T , we can treat θt as samples from the posterior distribution p(θ|X) and use them for
empirical averages.
s
Langevin dynamics combines gradient methods with the idea of random walk MH sampling.
s
Instead of treating t as an index, think of it as a continuous time and let
θt+∆t = θt + ∆θt
∆θt =
s
√
∆t
∇ ln p(X, θt ) + ηt ,
2
ηt ∼ N (0, ∆tI).
We have:
s
A random Gaussian component like random walk MH
s
A derivative component like gradient ascent.
s
We could then have an accept/reject step similar to the one on the previous slide (but q doesn’t
cancel out).
s
What happens to θt+∆t − θt = ∆θt when ∆t → 0?
√
dθt = ∇ ln p(X, θt )dt + dWt .
2
s
We can show that we will always accept the new location.
s
On RHS: The right term randomly explores the space, while the left term pulls in the direction
of higher density.
s
The path that θ maps out mimics the posterior density (beyond the scope of class to prove this).
s
What does this mean?
−If we wait and collect samples of θ every ∆t amount of time (sufficiently large), we can treat
them as samples from p(θ|X).
s
We would naturally want to think of discretizing this.
Chapter 12
s
Scalable MCMC inference
85
i.e., set
θt+∆t = θt + ∆θt
∆θt =
∆t
∇ ln p(X, θt ) + ηt ,
2
ηt ∼ N (0, ∆tI).
for 0 < ∆t 1, and accept with probability 1.
s
(Notice an equivalent representation is used above.)
s
The authors consider replacing ∆t with t such that
P∞
P∞ 2
t=1 t = ∞,
t=1 t < ∞.
s
This leads to the insight and contribution of the paper:
Replace ∇ ln p(X, θt ) with its stochastic gradient.
s
That is, let θt+1 = θt + ∆θt with
t
∆θt =
2
n
NX
∇ ln p(θt ) +
∇ ln p(xti , θt )
n i=1
!
+ ηt
ηt ∼ N (0, t I).
again with
P∞
t=1 t
= ∞,
P∞
2
t=1 t
< ∞.
s
Notice that the only difference between this and stochastic gradient for MAP is the addition of
ηt .
P∞
s The authors’ give intuition why we can skip the accept/reject step by setting
t=1 t =
P∞ 2
∞, t=1 t < ∞, and also why the following converges to the Langevin dynamics
!
n
√
t
NX
∆θt =
∇ ln p(θt ) +
∇ ln p(xti , θt ) + t ηt
2
n i=1
ηt ∼ N (0, I) ← notice the equivalent representation
s
It’s not intended to be a proof, so we won’t go over it.
s
However notice that:
s √t is huge (relatively) compared to t for 0 < t 1.
s
The sums of both terms individually are infinite, so both provide an “infinite” amount of
information.
Chapter 12
s
s
Scalable MCMC inference
86
The sum of the variances of both terms is finite for the first and infinite for the second, so
sub-sampling has finite impact.
The paper considers the toy problem
θ1 ∼ N (0, σ12 ),
θ2 ∼ N (0, σ22 ),
1
1
xi ∼ N (θ1 , σx2 ) + N (θ1 + θ2 , σx2 ).
2
2
s
Set θ1 = 0, θ2 = 1, sample N = 100 points.
s
Set n = 1 and process the data set 10000 times.
s
(Showed figures and experimental details from the paper.)
Firefly Monte Carlo: Exact MCMC with Subsets of Data
s
We’re again in the setting where we have a lot of data that’s conditionally independent given a
model variable θ.
s
MCMC algorithms such as random walk Metropolis-Hastings are time consuming:
∗
Set θt+1 ← θt+1
∼ N (θt , I) w.p.
Q
∗ )
∗ )
p(θt+1
p(xi |θt+1
Qi
.
p(θt ) i p(xi |θt )
s
This requires many evaluations in the joint likelihood.
s
Evaluating subsets for each t is promising, but previous methods are only approximate or
asymptotically correct.
s
This paper presents a method for performing exact MCMC sampling by working with subsets.
s
High level intuition: Introduce a binary switch for each observation and only process those for
which the switch is on.
s
The switches change randomly each iteration, so they are like fireflies (1 = light, 0 = dark).
Hence “Firefly Monte Carlo”.
s
This does require nice properties, tricks and simplifications which make this a non-trivial
method for each new model.
s
Again, we have a parameter of interest θ with prior p(θ).
Chapter 12
s
We have N observations X = {x1 , . . . , xN } with N huge.
s
The likelihood factorizes, therefore the posterior
p(θ|X) ∝ p(θ)
N
Y
Scalable MCMC inference
87
p(xn |θ).
n=1
s
For notational convenience let Ln (θ) = p(xn |θ), so
N
Y
p(θ|X) ∝ p(θ)
Ln (θ).
n=1
s
The initial setup for FlyMC is remarkably straightforward:
s
Introduce a new function B, with Bn (θ) the evaluation on xn with parameter θ, such that
0 < Bn (θ) ≤ Ln (θ) for all θ.
s
With each xn associate an auxiliary variable zn ∈ {0, 1} distributed as
Ln (θ) − Bn (θ)
p(zn |xn , θ) =
Ln (θ)
s
z n Bn (θ)
Ln (θ)
1−zn
The augmented posterior is
p(θ, Z|X) ∝ p(θ)
N
Y
p(xn |θ)p(zn |xn , θ).
n=1
s
Summing out zn from p(θ)
as required.
s
Therefore, this is an auxiliary variable sampling method:
QN
n=1
p(xn |θ)p(zn |xn , θ) leaves the original joint likelihood in tact
s
Before: Sample θ with Metropolis-Hastings
s
Now: Iterate sampling each zn and then θ with MH.
s
As with all MCMC methods, to get samples from the desired p(θ|X), keep the θ samples and
throw away the Z samples.
s
What has been done so far is always possible as a general rule. The question is if it makes
things easier (here, faster).
Chapter 12
s
Scalable MCMC inference
88
To this end, we need to look at the augmented joint likelihood. For a particular term,
z 1−zn
Ln (θ) − Bn (θ) n Bn (θ)
p(xn |θ)p(zn |xn , θ) = Ln (θ)
Ln (θ)
Ln (θ)
(
Ln (θ) − Bn (θ) if zn = 1
=
Bn (θ)
if zn = 0
s
For each iteration, we only need to evaluate the likelihood Ln (θ) at points where zn = 1.
Alarm bells: What about Bn (θ)?
s
What about all the Bn (θ), which appear regardless of zn ?
s
Look at the augmented joint likelihood:
Y
p(X, Z, θ) = p̂(θ)
L̂n (θ),
n:zn =1
p̂(θ) = p(θ)
N
Y
Bn (θ),
L̂n (θ) =
n=1
s
When we sample a new θ using MH, we compute
s
We have to evaluate
s
∗ )
p(X,Z,θt+1
.
p(X,Z,θt )
p(θ) → fast
s
L̂n (θ) for each n : zn = 1 → fast if
Q
N
s
n=1 Bn (θ) → fast?
s
Ln (θ) − Bn (θ)
.
Bn (θ)
P
n zn
N
This therefore indicates the two requirements we have about Bn (θ) for this to work well:
s
Bn (θ) is close to Ln (θ) for many n and θ so that few zn = 1.
QN
s
n=1 Bn (θ) has to be quickly evaluated.
s
Condition 1 states that Bn (θ) is a tight lower bound of Ln (θ) − Recall that p(zn = 1|xn , θ) =
Ln (θ)−Bn (θ)
.
Ln (θ)
s
Condition 2 essentially means that B allows us to only have to evaluate a function of θ and
statistics of X = {x1 , . . . , xN }.
s
(Showed intuitive figure from paper.)
Chapter 12
s
Scalable MCMC inference
89
This leaves two issues:
s
Actually choosing the function Bn (θ).
s
Sampling zn for each n.
s
Notice that to sample zn ∼ Bern
n.
s
So our computations have actually increased.
s
The paper presents methods for dealing with this as well.
s
Problem 1: Choosing Bn (θ). Solution: Problem specific.
s
Example: Logistic regression. For xn ∈ Rd and yn ∈ {−1, 1},
Ln (θ)−Bn (θ)
Ln (θ)
, we have to evaluate Ln (θ) and Bn (θ) for each
T
T
Ln (θ) = (1 + e−yn xn θ )−1 ≥ ea(ξ)(yn xn θ)
s
a(ξ) and c(ξ) are functions of auxiliary parameter ξ.
s
The product is
N
Y
Bn (θ) = ea(ξ)θ
T(
P
n
2 + 1 (y xT θ)+c(ξ)
2 n n
= Bn (θ).
1 T P
xn xT
n )θ+ 2 θ ( n yn xn )+c(ξ)
n=1
s
We can calculate the necessary statistics from the data in advance.
fast to compute.
s
Problem 2: Sampling zn . Solution: Two general methods.
s
Simplest: Sample new zn for random subset each iteration.
s
Alternative: Introduce MH step to propose switching light→dark and dark→light.
s
Focus on zn :
s
Propose zn → zn0 by sampling zn0 ∼ q(zn0 |zn ).
s
Accept with probability
0 |x ,θ)q(z |z 0 )
p(zn
n
n n
0 |z ) .
p(zn |xn ,θ)q(zn
n
s
Immediate observation: Always accept if zn0 = zn .
s
Focus on zn :
s
Propose zn → zn0 by sampling zn0 ∼ q(zn0 |zn ).
QN
n=1
Bn (θ) is then extremely
Chapter 12
s
Accept with probability
Scalable MCMC inference
90
0 |x ,θ)q(z |z 0 )
p(zn
n
n n
0 |z ) .
p(zn |xn ,θ)q(zn
n
s
P
Case zn0 = 0, zn = 1: Sample total from Bern( n zn , q1→0 ), then pick which xn uniformly
w.o. replacement.
Observation: Can re-use Ln (θ) and Bn (θ) if rejected.
s
P
Case zn0 = 1, zn = 0: Sample from Bern(N − n zn , q0→1 ), then pick which xn uniformly
w.o. replacement.
Observation: Want q0→1 small to reduce evaluations of L, B.
s
Final consideration: We want Bn (θ) tight, which is done by setting its auxiliary parameters.
s
The authors discuss this in their paper for the logistic regression problem.
s
It’s an important consideration since it will control how many zn = 1 in an iteration
Tighter Bn (θ) means fewer zn = 1 means faster algorithm.
s
This procedure may require running a fast algorithm in advance (e.g., stochastic MAP).
s
(Showed figures and experiments from the paper.)