Dirichlet Process

Graphical Models
Lecture 21: Topic Models & Dirichlet Processes
Andrew McCallum
[email protected]
1
Thanks to Yee Whye The, Tom Griffiths and Erik Sudderth for some slide materials.
Background
Dirichlet Distribu8on
2
Dirichlet DistribuHon
A “dice factory”
•
•
•
This distribuHon is defined over a “(k-­‐1)-­‐simplex”
(k non-­‐negaHve arguments which sum to one).
The Dirichlet is the conjugate prior to the mulHnomial.
(This means that if our likelihood is mulHnomial with a Dirichlet prior, then the posterior is also Dirichlet!)
The Dirichlet parameter αi can be thought of as a prior count of the ith class.
QuesHon: How likely is mul-nomial θ?
Answer: What probability would it give to the counts αi.
Dirichlet DistribuHon
• Mul8variate equivalent of Beta distribu8on
(a “coin factory”)
• Parameters α determine form of the prior
Small α, most mass concentrated on a few outcomes.
Important for later!
Latent Dirichlet AllocaHon
A tool for discovering interpretable “topics” from large collec8ons of documents
5
Analysis of PNAS abstracts
• Test topic models with a real database of scien8fic papers from PNAS
• All 28,154 abstracts from 1991-­‐2001
• All words occurring in at least five abstracts, not on “stop” list (20,551)
• Total of 3,026,970 tokens in corpus
A selecHon of topics
FORCE
SURFACE
MOLECULES
SOLUTION
SURFACES
MICROSCOPY
WATER
FORCES
PARTICLES
STRENGTH
POLYMER
IONIC
ATOMIC
AQUEOUS
MOLECULAR
PROPERTIES
LIQUID
SOLUTIONS
BEADS
MECHANICAL
HIV
VIRUS
INFECTED
IMMUNODEFICIENCY
CD4
INFECTION
HUMAN
VIRAL
TAT
GP120
REPLICATION
TYPE
ENVELOPE
AIDS
REV
BLOOD
CCR5
INDIVIDUALS
ENV
PERIPHERAL
MUSCLE
CARDIAC
HEART
SKELETAL
MYOCYTES
VENTRICULAR
MUSCLES
SMOOTH
HYPERTROPHY
DYSTROPHIN
HEARTS
CONTRACTION
FIBERS
FUNCTION
TISSUE
RAT
MYOCARDIAL
ISOLATED
MYOD
FAILURE
STRUCTURE
ANGSTROM
CRYSTAL
RESIDUES
STRUCTURES
STRUCTURAL
RESOLUTION
HELIX
THREE
HELICES
DETERMINED
RAY
CONFORMATION
HELICAL
HYDROPHOBIC
SIDE
DIMENSIONAL
INTERACTIONS
MOLECULE
SURFACE
NEURONS
BRAIN
CORTEX
CORTICAL
OLFACTORY
NUCLEUS
NEURONAL
LAYER
RAT
NUCLEI
CEREBELLUM
CEREBELLAR
LATERAL
CEREBRAL
LAYERS
GRANULE
LABELED
HIPPOCAMPUS
AREAS
THALAMIC
TUMOR
CANCER
TUMORS
HUMAN
CELLS
BREAST
MELANOMA
GROWTH
CARCINOMA
PROSTATE
NORMAL
CELL
METASTATIC
MALIGNANT
LUNG
CANCERS
MICE
NUDE
PRIMARY
OVARIAN
Cold topics
Hot topics
Cold topics
Hot topics
2
SPECIES
GLOBAL
CLIMATE
CO2
WATER
ENVIRONMENTAL
YEARS
MARINE
CARBON
DIVERSITY
OCEAN
EXTINCTION
TERRESTRIAL
COMMUNITY
ABUNDANCE
134
MICE
DEFICIENT
NORMAL
GENE
NULL
MOUSE
TYPE
HOMOZYGOUS
ROLE
KNOCKOUT
DEVELOPMENT
GENERATED
LACKING
ANIMALS
REDUCED
179
APOPTOSIS
DEATH
CELL
INDUCED
BCL
CELLS
APOPTOTIC
CASPASE
FAS
SURVIVAL
PROGRAMMED
MEDIATED
INDUCTION
CERAMIDE
EXPRESSION
Cold topics
37
CDNA
AMINO
SEQUENCE
ACID
PROTEIN
ISOLATED
ENCODING
CLONED
ACIDS
IDENTITY
CLONE
EXPRESSED
ENCODES
RAT
HOMOLOGY
289
75
KDA
ANTIBODY
PROTEIN
ANTIBODIES
PURIFIED
MONOCLONAL
MOLECULAR
ANTIGEN
MASS
IGG
CHROMATOGRAPHY
MAB
POLYPEPTIDE
SPECIFIC
GEL
EPITOPE
SDS
HUMAN
BAND
MABS
APPARENT
RECOGNIZED
LABELED
SERA
IDENTIFIED
EPITOPES
FRACTION
DIRECTED
DETECTED
NEUTRALIZING
Hot topics
2
SPECIES
GLOBAL
CLIMATE
CO2
WATER
ENVIRONMENTAL
YEARS
MARINE
CARBON
DIVERSITY
OCEAN
EXTINCTION
TERRESTRIAL
COMMUNITY
ABUNDANCE
134
MICE
DEFICIENT
NORMAL
GENE
NULL
MOUSE
TYPE
HOMOZYGOUS
ROLE
KNOCKOUT
DEVELOPMENT
GENERATED
LACKING
ANIMALS
REDUCED
179
APOPTOSIS
DEATH
CELL
INDUCED
BCL
CELLS
APOPTOTIC
CASPASE
FAS
SURVIVAL
PROGRAMMED
MEDIATED
INDUCTION
CERAMIDE
EXPRESSION
Latent structure
probabilis8c
process
Observed data
Latent structure (meaning)
probabilis8c
process
Observed data (words)
Latent structure (meaning)
sta8s8cal
inference
Observed data (words)
Latent Dirichlet allocaHon
(Blei, Ng, & Jordan, 2001; 2003)
α
distribution over topics
for each document
Dirichlet priors
β
distribution over words
for each topic
φ (j) ∼ Dirichlet(β)
T
φ (j)
θ (d) ∼ Dirichlet(α)
θ (d)
topic assignment
for each word
zi ∼ Discrete(θ (d) )
zi
word generated from
assigned topic
wi
wi ∼ Discrete(φ (zi) )
Nd D
Inference in LDA
High tree width = intractable exact inference
α
θ
θ
θ
z1
z2
z3
z4
z5
z6
z7
z8
z9
z10
z11
z12
w1
w2
w3
w4
w5
w6
w7
w8
w9
w10
w11
w12
β
Approximate inference:
-­‐ VariaHonal Mean Field
-­‐ Gibbs Sampling
-­‐ Collapsed Gibbs Sampling
The collapsed Gibbs sampler
• Using conjugacy of Dirichlet and mul8nomial distribu8ons, integrate out con8nuous parameters
• Defines a distribu8on on discrete ensembles z
The collapsed Gibbs sampler
• Sample each zi condi8oned on z-­‐i
• Nota8on:
–
–
–
–
–
–
i indexes over words w and their topic assignments z
j indexes over topics
nwi(zi) is the number of 8mes word type i occurs in topic zi
nj(di) is the number of tokens in document di assigned to topic j.
n.(zi) is the total number tokens in topic zi (the “.” is wildcard)
n.(di) is the total number of tokens in document di
The collapsed Gibbs sampler
• Sample each zi condi8oned on z-­‐i
• This is nicer than your average Gibbs sampler:
– memory: counts can be cached in two sparse matrices
– op8miza8on: no special func8ons, simple arithme8c
– the distribu8ons on Φ and Θ are analy8c given z and w, and can later be found for each sample
Gibbs sampling in LDA
iteration
1
Gibbs sampling in LDA
iteration
1
2
Gibbs sampling in LDA
iteration
1
2
Gibbs sampling in LDA
iteration
1
2
Gibbs sampling in LDA
iteration
1
2
Gibbs sampling in LDA
iteration
1
2
Gibbs sampling in LDA
iteration
1
2
Gibbs sampling in LDA
iteration
1
2
Gibbs sampling in LDA
iteration
1
2
…
1000
Latent Dirichlet allocaHon
(Blei, Ng, & Jordan, 2001; 2003)
α
Reasonable value: 50/T or 0.01
distribution over topics
for each document
Dirichlet priors
Reasonable value: 0.01 -­‐ 1.0
distribution over words
for each topic
φ (j) ∼ Dirichlet(β)
T
β
φ (j)
θ (d) ∼ Dirichlet(α)
θ (d)
zi
zi ∼ Discrete(θ (d) )
wi
wi ∼ Discrete(φ (zi) )
Nd D
Varying α decreasing α increases sparsity
Varying β
decreasing β increases sparsity
?
SelecHng the number of topics
• An example of BN model structure learning
• How to do it?
– Earlier lecture:
data likelihood + prior preferring smaller structure;
then try lots of possible structures
– Today:
r esHmaHon te
e
m
ra
a
p
r e
Bake togeth
define infinite structure with a and structure consideraHons
prior enforcing that most of it is rarely used
• Non-­‐parametric models
– have parameters
– number of parameters instan8ated grows with data
Dirichlet Processes
Can be confusing because
there are different ways to see it.
I’ll describe four.
32
Dirichlet Process as Noisy Copier
• Mo8va8on:
– You have a distribu8on.
– You’d like to have a copy that can be a liole different from the original
~ DP(α, H
)
• G p
c
o
ert
op
y fi rigina
urb
de
ed
lity l dist
co
pa ribu
py
ram Ho
ete n
r (h
igh
er = clo
s
er)
33
Dirichlet Process DefiniHon
Dirichlet Processes
Definition
• A Dirichlet Process (DP) is a distribution over prob distributions. A Dirichlet
(DP)
is awe distribution
over
probability
(More correctly, Process
instead of “prob distribuHon” should say “probability measure” which works on conHnuous domains.)
measures.
A Dhas
P htwo
as tparameters:
wo parameters:
•A DP
distribution
H, which
is like the
theoDP.
Base distribu8on H (which is tmean
he mof
ean f the DP).
–Base
Strength parameter α, which is like an inverse-variance of the DP.
Strength parameter α (which is like an inverse-­‐variance of the DP).
We–write:
G ∼ DP(α, H)
• We write:
Gany
∼ partition
DP(α,H) anny arHHon (A1,...,An) of X:
if for
(Ai1f , f. or ..,A
) ofpX:
Dirichlet(αH(A
(G(A
(G(A1), . . .n )), G∼(An)) ∼ Dirichlet(αH(A1), 1 ), . . . , G(A
1 ), . . . , αH(An )). . . , αH(An))
like a game: opponent gets to pick the parHHoning; DP generates a bunch of Gs.
A4
A1
A3
A2
A6
A5
Summary: A DP is a distribuHon over probability measures such that marginals on finite parHHons are Dirichlet distributed. 34
Yee Whye Teh (Gatsby)
DP and HDP Tutorial
Mar 1, 2007 / CUED
5 / 53
Dirichlet Process DefiniHon
• Sounds magical! Is this even possible?
Yes.
• Fact: Samples from a DP are always discrete distribu8ons.
• Intui8on: – Make an infinite number of samples from original H, but re-­‐weight them according to draw from Dirichlet with uniform mean and concentra8on related to α.
– With small α, most mass will be on just a few samples. (Will probably never even see the other samples.)
35
Pòlya’s urn scheme produces a sequence θ1 , θ2 , . . . with the
rawfollowing conditionals:
d
P
D
a
cHng
Constru
�n−1
δθ + αH
θn |θ1:n−1 ∼ i=1 i
n−1+α
Blackwell-­‐MacQueen Urn Scheme
picking
balls
colorscfrom
an urn:
icking balls ooff ddifferent
ifferent olors from • Imagine pImagine
Start with no balls in the urn.
withw
probability
draw θin ∼tH,
anduadd
an urn. Start ith no ∝bα,alls he rn.a ball of
that color into the urn.
probability
∝ n − 1, pick a ball at random from
raw, 1
...∞: • For the nth dWith
the urn, record θ to be its color, return the ball into
n
n
the urn and place a second ball of same color into
– with probability ∝ α, draw θn ∼ H, urn.
and add a ball of that color into the urn. – With probability ∝ n − 1, pick a ball at random from the urn, record θn to be its color, return the ball into the urn and place a second ball of same color into urn.
Yee Whye Teh (Gatsby)
DP and HDP Tutorial
Mar 1, 2007 / CUED
Note: For large α, mostly just draw from H. For small α, ofen copy an old value, perturbing G away from H.
Blackwell-MacQueen urn scheme is like a “representer” for the DP—a finite projection of an infinite object.
36
NEXT: Weʼd like to know G(x) for each different color x. Need to gather all balls of same color and count them...
10 / 53
nstrucHon
o
c
e
m
a
s
e
h
t
f o
w
AlternaHve vie
Chinese Restaurant Process
Chinese Restaurant Process
Use de Finei’s Theorem about exchangeability to gather together balls of the same color ...into “restaurant tables”
customer = urn scheme draw
table = ball color = θi
Generating from the CRP:
the CRP:
• Genera8ng First customer sitsfrom at the first
table.
Customer n sits at:
– First customer sits at twhere
he finrst table.
Table k with probability
is the number of customers
at table k .
– Customer n sits at:
A new table K + 1 with probability
.
nk
α+n−1
k
α
α+n−1
Customers
⇔kintegers,
tables ⇔ clusters.
with probability nk/(α+n-­‐1), nk = # people @ table k
• Table The CRP• exhibits
clustering
property
of theαDP.
A new tthe
able K+1 with probability /(α+n-­‐1)
2
1
4
3
8
5
6
7
9
. . .
∞ # tables
# customers at a table = (re)-­‐weighHng of that table’s value.
Most mass focussed on early tables
NEXT: Weʼd like to know G(x) for each different color x without having to simulate an infinite # customers...
Yee Whye Teh (Gatsby)
DP and HDP Tutorial
Mar 1, 2007 / CUED
13 / 53
37
nstrucHon
o
c
e
m
a
s
e
h
t
f o
w
AlternaHve vie
-breaking Construction
Stick-breaking Construction
SHck Breaking ConstrucHon
“What are the table-­‐weights when there are an infinite number of customers?”
But howAnswers: do
draws
G ∼ DP(α,
H) look
like?
But how do draws
G ∼ DP(α,
H) look
like?
G is
discrete
probability
one,one,
so:look G is discrete
with probability
so:
What do with
draws G ~ DP(a,H) like?
∞
∞ �
�
∗
G =G = πk δπθkk∗δθk
k=1
δθk = point mass on θk
k =1
The stick-breaking construction shows that G ∼ DP(α, H) if:
where
he stick-breaking
construction shows that G ∼ DP(α, H) if:
πk = βk
πk−1
k = βk
�
k�
−1
(1 − βl )
!(1)
(1 −l=1βl )
!(3)
βl=1
k ∼ Beta(1, α)
!(4)
∗
θ
k ∼ Hα)
βk ∼ Beta(1,
θk∗
!(6)
!(5)
!(3)
!(2)
!(1)
!(2)
!(4)
!(5)
∼ HWe write π ∼ GEM(α)
! if π = (π1 , π2 , . . .) is distributed as above.
(6)
38
We write π ∼ GEM(α) if π = (π , π , . . .) is distributed as above.
What does all this have to do with
non-­‐parametric infinite mixtures?!
39
Finite Mixture Model
E.g. Mixture of Gaussians
prior on Gaussian
means
H
π
Gaussian mean
θ
T
# mixture components = T
zi
xi
mixture
proporHons
mixture
indicator
Gaussian
from mean mzi
N
40
• 
• 
Infinite limit of mixture models as the # of mixture componen
Gaussian mixture model example:
Finite Mixture Model
DP (Infinite) Mixture Model
E.g. Mixture of Gaussians
H
prior on Gaussian
means
H
π
Gaussian mean
θ
T
# mixture components = T
zi
xi
mixture
proporHons |T|
mixture
indicator
Gaussian
from mean θzi
N
α
DP
G
θi
xi
N
41
Alterna*ve Diagram
• 
• 
Infinite limit of mixture models as the # of mixture componen
Gaussian mixture model example:
DP (Infinite) Mixture Model
DP (Infinite) Mixture Model
E.g. Mixture of Gaussians
H
α
prior on Gaussian
means
stick breaking
H
π
Gaussian mean
θ
∞
# mixture components = T
zi
xi
mixture
proporHons |∞|
mixture
indicator
Gaussian
from mean mzi
N
α
DP
G
θi
xi
N
42
• 
• 
Infinite limit of mixture models as the # of mixture componen
Gaussian mixture model example:
Geing back
to LDA...
• We want to generate a corpus of documents from a set of shared “topics”
• The DP Mixture Model does not explicitly enforce any sharing. (AlternaHvely: the DP Mixture confounds the mixture values and mixture proporHons.)
• We need something more...
DP (Infinite) Mixture Model
H
α
DP
G
θi
xi
N
43
happens if we put a prior on a Dirichlet Process?
re models as the # of mixture components tends to infinity.
Why would we want
to?
Hierarchical DP models as the # of mixture componen
• 
Infinite
limit
of
mixture
model
example:
•  We might
have a collection of related documents or images, each
of which
is a mixture
DP (Infinite) Mixture odel
Gaussian M
mixture
model example:
Mixture Model
H
• 
of gaussians
γ
α
DP
H
G0
DP
α
Gj
DP
G
θi
θi
xi
xi
J N
N
44
happens if we put a prior on a Dirichlet Process?
re models as the # of mixture components tends to infinity.
Why would we want
to?
Hierarchical DP
model
example:
•  We might
have a collection of related documents or images, each of which is aAlterna*ve mixture
Diagram
Mixture M
odel
of gaussians
H
γ
α
G0
DP
β
γ
α
Gj
H
π
θi
θ
zi
xi
∞
J N
xi
J N
45
Representations of Hierarchical Dirichlet Processes
Another Picture of the HDP
Hierarchical Pòlya urn scheme
Let G0 ∼ DP(γ, H) and G1 , G2 |G0 ∼ DP(α, G0 ).
The hierarchical Pòlya urn scheme to generate draws from G1 , G2 :
!"01 !"02 !"03 !"04 !"05 !"06 . . . . .
!"11 !"12 !"13 !"14 !"15 !"16 . . . . .
!"21 !"22 !"23 !"24 !"25 !"26 . . . . .
!11 !12 !13 !14 !15 !16 !17 . . . . .
!21 !22 !23 !24 !25 !26 !27 . . . . .
46
Inference in the
Dirichlet Process Mixture Model
47
Collapsed Gibbs Recipe for DP Mixture
The big picture:
• For each data point
– pretend it is the last point (by exchangeability)
– the prior is just the Chinese restaurant dynamics
– the likelihood is just the usual mixture likelihood
48
Collapsed Gibbs Recipe for DP Mixture
The slightly more detailed picture,
s8ll skipping the evidence (word) likelihood:
• For the nth word: – with probability ∝ α, draw zn ∼ G0.
• To draw from G0: with probability ∝ γ, draw zn ~ H,
with prob ∝ n-­‐1, draw a topic from those already in G0 propor8onally according to their counts.
– With probability ∝ nj − 1, draw a topic from those already in Gj propor8onally according to counts.
49
...
y
l
s
u
o
i
v
e
From pr
The finite collapsed Gibbs sampler
Sample each zi condi8oned on z-­‐i
For the DP (infinite) case:
– Very similar, but include the possibility of picking a “new” topic, using Chinese restaurant dynamics.
– When you pick a “new” topic for a document, first go to G0 and consider using a “old new” topic from another document, otherwise create a “new new” topic.
Generic Collapsed Gibbs Sampler for [Sudderth PhD]
DP Mixture Model [Neal 2000, Alg #2]
108
CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS
Given the previous concentration parameter α(t−1) , cluster assignments z (t−1) , and cached
statistics for the K current clusters, sequentially sample new assignments as follows:
1. Sample a random permutation τ (·) of the integers {1, . . . , N }.
2. Set α = α(t−1) and z = z (t−1) . For each i ∈ {τ (1), . . . , τ (N )}, resample zi as follows:
(a) For each of the K existing clusters, determine the predictive likelihood
fk (xi ) = p(xi | {xj | zj = k, j "= i} , λ)
This likelihood can be computed from cached sufficient statistics via Prop. 2.1.4.
Also determine the likelihood fk̄ (xi ) of a potential new cluster k̄ via eq. (2.189).
(b) Sample a new cluster assignment zi from the following (K + 1)–dim. multinomial:
!
#
K
K
"
"
1
−i
zi ∼
αfk̄ (xi )δ(zi , k̄) +
Nk fk (xi )δ(zi , k)
Zi = αfk̄ (xi ) +
Nk−i fk (xi )
Zi
k=1
k=1
Nk−i
is the number of other observations currently assigned to cluster k.
(c) Update cached sufficient statistics to reflect the assignment of xi to cluster zi . If
zi = k̄, create a new cluster and increment K.
3. Set z (t) = z. Optionally, mixture parameters for the K currently instantiated clusters
may be sampled as in step 3 of Alg. 2.1.
4. If any current clusters are empty (Nk = 0), remove them and decrement K accordingly.
5. If α ∼ Gamma(a, b), sample α(t) ∼ p(α | K, N, a, b) via auxiliary variable methods [76].
Algorithm 2.3. Rao–Blackwellized Gibbs sampler for an infinite, Dirichlet process mixture model, as
defined in Fig. 2.24. Each iteration sequentially resamples the cluster assignments for all N observaN
51
Application: Document Topic Modelling
HDP Mixture Experimental Results
Compared against latent Dirichlet allocation, a parametric version
of the HDP mixture for topic modelling.
Perplexity on test abstacts of LDA and HDP mixture
1050
LDA
HDP Mixture
Number of samples
1000
Perplexity
Posterior over number of topics in HDP mixture
15
950
900
850
10
5
800
750
10
20
30
40
50 60 70 80 90 100 110 120
Number of LDA topics
0
61 62 63 64 65 66 67 68 69 70 71 72 73
Number of topics
52
Further VariaHons
53
Nested Chinese Restaurant Process
[Blei et al 2003]
&
4 3 2 1
!1
%
2 1
!2
4 3
#
c1
c2
!3
$
c3
z
!4
2
!5
4 3
!6
w
cL
(a)
!
N M
8
1
"
(b)
Figure 1: (a) The paths of four tourists through the infinite tree of Chinese restaurants (L 54
=
3). The solid lines connect each restaurant to the restaurants referred to by its tables. The
collected paths of the four tourists describe a particular subtree of the underlying infinite
Nested Chinese Restaurant Process
the, of,
a, to,
and, in,
is, for
neurons, visual,
cells, cortex,
synaptic, motion,
response, processing
cell,
neuron,
circuit,
cells,
input,
i,
figure,
synapses
chip,
analog,
vlsi,
synapse,
weight,
digital,
cmos,
design
algorithm, learning,
training, method,
we, new,
problem, on
recognition,
speech,
character,
word,
system,
classification,
characters,
phonetic
b,
x,
e,
n,
p,
any,
if,
training
hidden,
units,
layer,
input,
output,
unit,
x,
vector
control,
reinforcement,
learning,
policy,
state,
actions,
value,
optimal
55
Figure 5: A topic hierarchy estimated from 1717 abstracts from NIPS01 through NIPS12.
Each node contains the top eight words from its corresponding topic distribution.
Infinite Hidden Markov Models
Infinite Hidden Markov Model
Previous issue is that there is no sharing of possible next state
across different current states.
Implement sharing of next states using a HDP:
(τ1 , τ2 , . . .) ∼ GEM(γ)
(π1l , π2l , . . .)|τ ∼ DP(α, τ )
!
"
#
$l
H
%*l
v1
v2
vT
x1
x2
xT
8
v0
56
Yee Whye Teh (Gatsby)
DP and HDP Tutorial
Mar 1, 2007 / CUED
Infinite N-­‐gram Model
“A StochasHc Memoizer for Sequence Data”
[Wood, Archambeau, Gasthaus, James, Teh, 2009]
H
G[ ]
a
G[o]
ac
G[ac]
ca] a
a
G[oa]
1
G[a]
d0:0
G[o]
o
G[oaca]
1
a
1
c
o
G[a]
o
11
a
d1:1
G[oac]
a
c
G[oa]
oac
c
oac
c
o
o
G[aca]
G[ ]
o
1
d2:2
c
G[oaca]
G[oaca]
G[oacac]
c
1
d2:4
(b) Prefix tree for oacac.
c
(c) Initialisation.
orresponding prefix tree for the string oacac. Note that (a) and (b) correspond to the57
acao. (c) Chinese restaurant franchise sampler representation of subtree highlighted in
Readings for More Detail
• Erik Sudderth’s PhD thesis
hzp://www.cs.brown.edu/~sudderth/papers/sudderthPhD.pdf
• Yee Whye Teh’s Dirichlet Process Tutorial
hzp://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes/mlss2007.pdf
• HDP introduc8on, LDA with infinite topics
hzp://www.cse.buffalo.edu/faculty/mbeal/papers/hdp.pdf
• HDP implementa8on by Teh.
hop://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-­‐r21.tgz
58