Keyword Search Assesment - The Institute for Signal and

COLLEGE OF ENGINEERING
Preliminary Exam Report
Review of Hierarchical Dirichlet Process and Infinite HMMs
Amir Harati
[email protected]
Institute for Signal and Information Processing,
Department of Electrical Engineering,
Temple University
February 2012
EXECUTIVE SUMMARY
In this report, I review three research papers concerning hierarchical Dirichlet processes (HDP), infinite
hidden Markov models (HDP-HMM), and several inference algorithms for these models. I also use
several other sources to support the material discussed in this report.
The three primary references used in this exam are "Hierarchical Bayesian Nonparametric Models with
Applications,” “On-Line Learning for the Infinite Hidden Markov Model,” and “Sticky HDP-HMM with
Application to Speaker Diarization". The basic materials in all of these papers involve HDP and HDPHMMs. The first paper is a more general overview and contains several other models which are not
related directly to the main theme of this report. Therefore this report is not organized based on any one of
these references but contains most of the relevant material from those papers. In many cases I have
corrected several errors ranging from typos to simple algorithmic/computational errors and have also used
a unified notation through the report that I believe makes the presentations easier.
To make the report more self-contained I have added a background review of Dirichlet processes (DP),
but this review is very short and readers may need to review some background papers before reading this
report. After reviewing DPs, I start by introducing HDP and the reason that they are needed. Several
properties of HDP are derived (both in the main report and the appendix) and some of its properties are
justified. Two basic inference algorithms for HDPs are presented in detail.
HDP-HMM is introduced based on the general framework of hierarchical Dirichlet processes. The
differences between this new model and HDP are emphasized and several of their properties are derived
and explained. Three important inference algorithms are reviewed and presented in detail.
In writing this review, one of my primary intentions is to produce a self-sufficient document that can be
used as a reference to implement some of the inference algorithms for HDP-HMM. Moreover, an
interested reader can easily start from these and derive more general or application specific algorithms.
Table of Contents
1
Introduction ........................................................................................................................................... 1
2
Background ........................................................................................................................................... 1
3
Hierarchical Dirichlet Process .............................................................................................................. 2
3.1
Stick-Breaking Construction ......................................................................................................... 2
3.2
Different Representations ............................................................................................................. 4
3.3
Chinese Restaurant Franchise ....................................................................................................... 4
3.3.1
3.4
Inference Algorithms .................................................................................................................... 6
3.4.1
Posterior Sampling in CRF ..................................................................................................... 7
3.4.2
Augmented Posterior Representation Sampler .................................................................... 9
3.5
4
Posterior and Conditional Distributions................................................................................ 5
Applications ................................................................................................................................ 10
HDP-HMM ......................................................................................................................................... 11
4.1
CRF with Loyal Customers......................................................................................................... 12
4.2
Inference Algorithms .................................................................................................................. 12
4.2.1
Direct Assignment Sampler ................................................................................................. 12
4.2.2
Block Sampler...................................................................................................................... 16
4.2.3
Learning Hyper-parameters ................................................................................................ 18
4.2.4
Online learning .................................................................................................................... 21
4.3
Applications ................................................................................................................................ 26
5
Conclusion .......................................................................................................................................... 26
6
Reference ............................................................................................................................................ 27
Appendix A: Derivation of HDP Relationships .......................................................................................... 29
A.1. Stick-Breaking Construction (Teh Y. , Jordan, Beal, & Blei, 2006)............................................... 29
A.2. Deriving Posterior and Predictive Distributions ............................................................................. 30
Appendix B: Derivation of HDP-HMM Relationships............................................................................... 32
B.1. Derivation of the posterior distribution for  z t ,st  ......................................................................... 32
Preliminary Exam Report
1
Page 1 of 34
INTRODUCTION
Nonparametric Bayesian methods provide a consistent framework to infer the model complexity from the
data. Moreover, Bayesian methods make hierarchical modeling easier and therefore open doors for more
interesting and complex applications. In this report, we review hierarchical Dirichlet processes (HDP) and
its applications to derive infinite hidden Markov models (HMM) or HDP-HMM. We also review three
inference algorithms for the so called HDP-HMM in details.
This report is organized into five sections and two appendixes. Section two is a quick review of Dirichlet
process. Section three is devoted to HDP and its inference algorithms and section four is focused on
HDP-HMM and its inference algorithms. For the sake of readability, some of the mathematical details are
presented in the appendix sections.
2
BACKGROUND
A Dirichlet process (DP) is a distribution over distributions, or more precisely over discrete distributions.
Formally, a Dirichlet process DP  , G0  is “defined to be the distribution of a random probability
measure G over  such that for any finite measurable partition  A1 , A2 ,..., Ar  of  the random distribution
G  A1  ,..., G  Ar 
is distributed as finite dimensional Dirichlet distribution” (Teh Y. , Jordan, Beal, &
Blei, 2006) :
G  A1  ,..., G  Ar  ~ Dir G0  A1  ,...,G0  Ar 
(2.1)
A constructive definition for Dirichlet process is given by Sethuraman (Sethuraman, 1994) which is
known as stick-breaking construction. This construction explicitly shows that draws from a DP are
discrete with probability one.
vk |  , G0 ~ Beta 1,  ,  k |  , G0 ~ G0
k 1
 k  vk  1  l ,
l 1
G

  k  k
(2.2)
k 1
 can be interpreted as a random probability measure over positive integers and is denoted by
 ~ GEM   . In both of these definitions G0 , or base distribution, is the mean of the DP, and  is the
concentration parameter which can be understood as the inverse of variance.
Another way to look at the DP is through the Polya urn scheme. In this approach, we have to consider
i.i.d. draws from a DP and consider the predictive distribution over these draws (Teh Y. , Jordan, Beal, &
Blei, 2006):
i 1
i | 1 ,...,i 1 , , G0 ~ 
k 1
1

 k 
G0
N 1 
N 1 
(2.3)
In the urn interpretation of equation (2.3), we have an urn with several balls of different colors in it. We
draw a ball and put it back in the urn and add another ball of the same color to the urn. With probability
proportional to  we draw a ball with a new color. To make the clustering property more clear, we should
\
Update: February 29, 2012
Preliminary Exam Report
Page 2 of 34
introduce a new set of variables that represent distinct values of the atoms. Let 1* ,..., K* to be the distinct
values and mk be the number of  l associated with k* . We would now have:
K
i | 1 ,...,i 1 , , G0 ~ 
k 1
mk

 * 
G0
N 1  k N 1 
(2.4)
Another useful interpretation of (2.4) is the Chinese restaurant process (CRF). In CRF we have a Chinese
restaurant with infinite number of tables. A new customer  i comes into the restaurant and can either sit
around one of the occupied tables with probability proportional to the number of people already sitting
there or start a new table with probability proportional to  . In this metaphor, each customer is a data
point and each table is a cluster.
3
HIERARCHICAL DIRICHLET PROCESS
A Hierarchical Dirichlet Process (HDP) is the natural extension of a Dirichlet process for problems with
multiple groups of data. Usually, data is split into J groups a priori. For example, consider a collection of
documents. If words are considered as data points, each document would be a group. We want to model
data inside a group using a mixture model. However, we are also interested to tie groups to each other, i.e.
to share clusters across all groups. Let’s assume that we have an indexed collection of DPs with a
common base distribution G j  ~ DP( , G0 ) . Unfortunately this simple model cannot solve the problem
since for continues G0 different G j necessary have no atoms in common. The solution is to use a discrete
G0 with broad support. In other words, G0 is itself a draw from a Dirichlet process. HDP is defined by
(Teh & Jordan, 2010) equation (3.1).
G0 |  , H ~ DP ( , H )
G j |  , G0 ~ DP ( , G0 )
(3.1)
 ji | G j ~ G j
x ji |  ji ~ F  ji 
for j  J
In this definition H provides prior distribution for factor  ji .  governs the variability of G0 around H
and  controls the variability of G j around G0 . H ,  and  are hyper-parameters of HDP.
3.1
Stick-Breaking Construction
Because G0 is a Dirichlet distribution it has a stick-breaking representation:
G0 

  
k
k 1
**
k
,
(3.2)
Where  k** ~ H and    k k 1 ~ GEM   . Since support of G j is contained in within the support of G0

we can write a similar equation to (3.2) for G j :
\
Update: February 29, 2012
Preliminary Exam Report
Page 3 of 34

G j    jk  **
(3.3)
 j ~ DP  ,  
(3.4)
k
k 1
Then we have:


v jk ~ Beta   0  k , 0 1 



k 1
 jk  v jk  1  v jl ,

k
   
l
l 1

(3.5)
for k  1,..., 
l 1
We also have:
E   k    k 1 1   
k
E  jk   E   k 
(3.6)
  1   k  
Var  jk   E  k
  Var   k   Var   k 
 1 
Stick-breaking construction for HDP and (3.6) are derived in A.1. Figure 1 demonstrates stick-breaking
Figure 1 - Stick Breaking Construction for HDP: The left panel shows a draw of  , while the right three show
independent draws conditioned on  (Teh & Jordan, 2010).
and cluster sharing of HDP.
\
Update: February 29, 2012
Preliminary Exam Report
3.2
Page 4 of 34
Different Representations
Definition (3.1) shows the first representation of HDP. Another representation can be obtained by
introducing an indicator variable as shown in equation (3.7).
Figure 2 shows the graphical models of both of these representations.
 |  ~ GEM ( )
 j |  ,  ~ DP ( ,  )
 k | H ,  ~ H ( )
z ji |  j ~  j
(3.7)
 
x ji |  k k 1 , z ji ~ F  ji

3.3
Chinese Restaurant Franchise
The Chinese restaurant franchise (CRF) is the natural extension of Chinese restaurant process for HDPs.
In CRF, we have a franchise with several restaurants and a franchise wide menu. The first customer in
restaurant j sits at one of the tables and orders an item from the menu. Other customers either sit at one of
the occupied tables and eat the food served at that table or sit at a new table and order their own food from
the menu. Moreover, the probability of sitting at a table is proportional to the number of customers
already seated at that table. In this metaphor, restaurants correspond to groups and customer i in restaurant
j corresponds to  ji (customers are distributed according to G j ). Tables are i.i.d. variables  *jt distributed
according to G0 and finally foods are i.i.d. variables k** distributed according to H . If customer i at
*
**
restaurant j sits at table t ji and that table serves dish k jt , we will have  ji   jt ji  kt . In another way, each
ji
restaurant represents a simple DP and therefore a cluster over data points. At the franchise level we have
another DP but this time clustering is over tables.
Now let introduce several variables that will be used throughout this paper. n jkt is the number of
Figure 2-(a) HDP representation of (3.1) (b) Alternative indicator variable
representation (3.7) (Teh, Jordan, Beal, & Blei, 2004)
\
Update: February 29, 2012
Preliminary Exam Report
Page 5 of 34
customers in restaurant j , seated around table t ,and who eat dish k . m jk is the number of tables in
restaurant j serving dish k and K is the number of unique dishes served in the entire franchise. Marginal
counts are denoted with dots. For example, n j k is the number of customers in restaurant j eating dish k .
3.3.1
Posterior and Conditional Distributions
 
CRF can be characterized by its state which consists of the dish labels **  k**
t 
ji
j 1,..., J
i 1,..., n
 
and dishes k jt ji
j 1,...., J
i 1,..., n
k 1,..., K
, the tables
. As a function of the state of the CRF, we also have the number of
 
customers n  n jtk  , the number of tables m  m jk  , customer labels θ   ji  and table labels    *jt
(Teh & Jordan, 2010). The posterior distribution of G0 is given by:

K

 H  k 1 m k  **
k

G0 |  , H ,  ~ DP   m ,

 m






(3.8)
Where m is the total number of tables in the franchise and m k is the total number of tables serving dish k .
Equation (3.9) shows the posterior for G j . n j is the total number of customers in restaurant j and n j k is
the total number of customers in restaurant j eating dish k .

K

 G0  k 1 n j k  **
k
G j |  , G0 , θ j ~ DP    n j ,

  nj





(3.9)
Conditional distributions can be obtained by integrating out G j and G0 respectively. By integrating out G j
from (3.9) we obtain:
m j.
 ji |  j1 ,..., j ,i 1 , , G0 ~ 
t 1
n jt
  nj
 * 
jt

  nj
G0
(3.10)
And by integrating out G0 from (3.8) we obtain:
mk

 ** 
H
k
 m
k 1   m k
K
 *jt |  *j1 ,..., *j ,t 1 ,  , H ~ 
(3.11)
A draw from (3.8) can be obtained using (3.12) and a draw from (3.9) can be obtained using (3.13).
 0 , 1 ,...,  K |  , G0 , θ* ~ Dir   , m 1 ,..., m K 
G0 |  , H ~ DP   , H 
G0   0G0 
K
  
k
k 1
\
(3.12)
**
k
Update: February 29, 2012
Preliminary Exam Report
Page 6 of 34
 j 0 ,  j1 ,...,  jK |  , θ j ~ Dir ( 0 , 1  n j 1 ,...,  K  n j K )
G j |  , G0 ~ DP( 0 , G0 )
G j   j 0G j 
K

k 1
(3.13)

jk  k**
From (3.12) and (3.13) we see that the posterior of G0 is a mixture of atoms corresponding to dishes and
an independent draw from DP( , H ) and G j is a mixture of atoms at k** and an independent draw from
DP ( 0 , G0 ) (Teh & Jordan, 2010).
nj 

In an HDP each restaurant represents a DP and so m j  O   log
 since the number of clusters scales
 

logarithmically. On the other hand, equation (3.8) shows G0 is a DP over tables and so

mj 
nj 


K  O   log j
log
  O   log 
  . This shows that HDP represents a prior belief that
j
 
  



the number of clusters grows very slowly (double logarithmically) when increasing the number of data
points but faster (logarithmically) in the number of groups (Teh & Jordan, 2010).


All of the above relationships are derived in A.2.
3.4
Inference Algorithms
First, let introduce several notations. x   x ji  , x jt   x ji | t ji  t , t  t ji  , k  k jt  and z  {z ji } .Where
z ji  k jt ji denotes the mixture component associated with the observation x ji .To indicate the removal of a
variable from a set of variables or a count, we use a superscript  ji ; for example x ji  x \ x ji or n jt ji is the
number of customers (observations) in restaurant (group) j seated at table t excluding x ji . We also assume
F   has the density f  |   and H has the density h   and is conjugate to F . The conditional density of
x ji under mixture component k giving all data excluding x ji is defined as:
x
f k ji
x  
ji
 
 h  |   
f x j i d
 h  |   
f x j i d
j i Dk  x ji
j iDk \ x ji
  
 
, k  1,...,, K
(3.14)
 
f k newji x ji  h  |   f x ji d , k  k new
x


where Dk  j i : k jt ji  k denotes the set of indices of the data item currently associated with dish k . For
the conjugate case, we could obtain a closed form for this likelihood function. Particularly if emission
distributions are Gaussian with unknown mean and covariance, the conjugate prior is a normal-inverse-
\
Update: February 29, 2012
Preliminary Exam Report
Page 7 of 34
Wishart distribution (Sudderth, 2006) denoted by NIW     , , , and k  k , k  . Given some
observations
x
 ji
for
component, k
| k jt ji  k , j   1,..., J , i  1,..., n
of
the
mixture
x 
(l )
L
l 1
(in
this
case
x 
(l )
L
l 1
=
 ) from a multivariate Gaussian distribution, the posterior still
remains in the normal-inverse-Wishart family and its parameters are updated using:
k   L
 k   L
 kk     l 1 x (l )
(3.15)
L
 k  k     l 1 x (l ) x (l )   T   T
L
T
In practice there are some efficient ways (using Cholesky decomposition) to update these equations and
allows fast likelihood evaluation for each data point. Finally, marginalizing  k induces a multivariate tstudent distribution with  k  d  1 degree of freedom ( d is the dimension of data points):
 x ji
fk
x
 x   t
ji
 
f k newji x ji

k  d 1



 k 1 k
 x ji ;k ,
 k  , k  1,..., K

 k  k  d  1 


(3.16)

  1 
 t  d 1  x ji ; ,
  , k  k new

   d  1 

Assuming  k   d  1 (3.16) can be approximated by moment-match Gaussian (Sudderth, 2006).
3.4.1
Posterior Sampling in CRF
Equations (3.10) and (3.11) show how we can produce samples from the prior over  ji and  *jt , using the
proper likelihood function and how this framework can provide us with necessary tools to sample from
the posterior given the observations x . In this first algorithm, we sample index t ji and k jt using a simple
Gibbs sampler.
In this algorithm we assumed that emission distributions are Gaussian but using other conjugate pairs is
the same. The following list shows a single iteration of the algorithm, but to obtain reliable samples we
have to run this many times. Notice that the most computational costly operation is the calculation of the
conditional densities. Moreover, the number of events that could happen for each iteration of the
algorithm is K  1 .
1. Given the pervious state assignment K , k and t :
Sample t:
2. For all j  1,..., J , i  1,...., n j do the follow sequentially
\
Update: February 29, 2012
Preliminary Exam Report
Page 8 of 34
3. Compute the conditional density for x ji using (3.16) for k  1,..., K , k new .
4. Calculate the likelihood for t ji  t new using:
 ji
k
  mm   f  x   m  

p x ji | t  ji , t ji  t new , k 
K
 x ji
k
ji
k 1
x
 
(3.17)
f k newji x ji
5. Sample t ji from the multinomial probability

p t ji  t | t
6.
 ji
 
njt ji f  x ji x ji ,
k jt

,k  
 p x ji | t  ji , t ji  t new , k ,





if t previously used
(3.18)
if t  t new
If the sampled value of t ji is t new , obtain a sample of k jt new by:

p k jt new  k | t, k
 jt new

 
 
m kx ji f k x ji x ji , If k previously used


x
new
  f k newji x ji , If k =k
(3.19)
7. If k  k new then increment K .
8. Update the cached statistics.
9. If a table t becomes unoccupied delete the corresponding k jt and, if as a result some mixture
component becomes unallocated, delete that mixture component too.
Sample K:
10. For all j  1,..., J , t  1,..., m j do the following sequentially:
11. Compute the conditional density for x jt using (3.16) for k  1,..., K , k new .
12. Sample k jt from a multinomial distribution:

p k jt | t, k
-jt

 
 
m kx jt f k x jt x jt , If k previously used


x
new
  f k newjt x jt , If k =k
(3.20)
13. If k  k new then increment K .
14. Update the cached statistics.
15. If a mixture component becomes unallocated delete that mixture component
\
Update: February 29, 2012
Preliminary Exam Report
Page 9 of 34
Equation (3.17) is obtained from (3.11). From (3.10) we see the prior probability that t ji takes a
previously used value is proportional to n jt ji and the prior probability of taking a new value is
proportional to  .The likelihood for x ji given t ji  t for some previously used t is f k x ji  x ji  and the
likelihood for t ji  t new is given by (3.17). By multiplying these priors and likelihoods we can obtain the
posterior distribution (3.18). In the same way, (3.19) and (3.20) can be obtained by multiplying the
likelihoods and priors given by (3.11).
3.4.2
Augmented Posterior Representation Sampler
In the previous algorithm, the sampling for all groups is coupled which makes the derivation of the CRF
sampler for certain models difficult (Teh Y. , Jordan, Beal, & Blei, 2006). This happens because G0 was
integrated out. An alternative approach is to sample G0 . More specifically, we will use (3.12) and (3.13)
to sample from G0 and G j respectively. This algorithm contains two main steps. First we sample cluster
 
indices  z ji  (instead of tables and dishes) and then we sample  and  j
jJ
. Equation (3.12) shows in
order to sample from  we should first sample m jk  . This completes the second algorithm.
1. Given the pervious state assignment for z ,  and  j  from previous step
Sample Z
2. For all j  1,..., J , i  1,...., n j do the follow sequentially
3. Compute the conditional density for x ji using (3.16) for k  1,..., K , k new .
4. Sample z ji using:
p ( z ji | z
 ji
 
 
  jk f k x ji x ji

,)  
x
 j 0 f k newji x ji
If k previously used
(3.21)
If k=k new
5. If a new component k new is chosen, the corresponding atom is initiated using (3.22) and set
z ji  K  1 and K  K  1 .
v0 |  ~ Beta   ,1

new
0 ,

 Knew
1    0 0 ,  0 (1  v0 ) 
v j |  ,  0 , 0 ~ Beta  0 v0 ,  0 (1  v0 ) 

\
new
j0 ,
(3.22)

 new
jK 1 ~   j 0 v j ,  j 0 (1  v j ) 
Update: February 29, 2012
Preliminary Exam Report
Page 10 of 34
6. Update the cached statistics.
Sampling m
7. Sample m jk  using (3.23) where s(n, m) is the Stirling number of the first kind. Alternatively, we
can sample m jk  by simulating a CRF, which is more efficient for large n j .k .
p(m jk  m | z,  , , n j k ) 

  k 
 k  n j k

s(n j k , m)  k 
m
(3.23)
Sampling  and 
8. Sample  and  using(3.12) and (3.13) respectively.
In this algorithm we have used alternative representation given by (3.7). Particularly, we can see z ji ~  j .
By combining this with (3.13) we can obtain a conditional prior probability function for z ji and by
multiplying it with the likelihoods we can obtain (3.21). (3.22) is the stick-breaking step for the new atom
and follows the steps described in section 3.1. In particular the third line in (3.22) is obtained by replacing
the remaining stick  0 with a unit stick. The validity of this approach is shown in (Pitman, 1996).For a
proof of (3.23)look at (Antoniak, 1974). It should be noted that computing this equation is generally very
costly and we can alternatively simulate a CRF to sample m jk  .
3.5
Applications
Among the several applications of HDP, we will only review two of them in this section. It should be
noted that the following section, HDP-HMM, is by itself an application of the general HDP framework.
One of the most cited applications of HDP is in the field of information retrieval (IR) (Teh & Jordan,
2010). A state of the art but heuristic algorithm which is very popular in IR applications is the “termfrequency inverse document frequency” (tf-idf) algorithm. The intuition behind this algorithm is that the
relevance of a term to a document is proportional to the number of times that term occurred in the
document. However, terms that occur in many documents should be down weighted. It has been shown
that HDP provides a justification for this intuition (Teh & Jordan, 2010) and an algorithm based on HDP
outperforms all state of the art algorithms.
Another application cited extensively in literature is topic modeling (Teh & Jordan, 2010). In topic
modeling, we want to model documents with a mixture model. Topics are defined as probability
distributions across a set of words while documents are defined as probability distributions across
different topics. At the same time we want to share topics among documents within a corpus. So each
document is a group with its own mixing proportions but components (topics) are shared across all
documents using an HDP model.
\
Update: February 29, 2012
Preliminary Exam Report
4
Page 11 of 34
HDP-HMM
Hidden Markov models (HMMs) are a class of doubly stochastic processes in which discrete state
sequences are modeled as a Markov chain (Rabiner, 1989). In the following discussion we will denote the
state of the Markov chain at time t with z t and the state-specific transition distribution for state j by  j
.The Markovian structure means zt ~  zt 1 . Observations are conditionally independent given the state of
the HMM and are denoted by xt ~ F  zt  .
HDP-HMM is an extension of HMM in which the number of states can be infinite. The idea is relatively
simple; at each state z t we should be able to go to an infinite number of states so the transition distribution
should be a draw from a DP. On the other hand, we want reachable states from one state to be shared
among all states so these DPs should be linked together. The result is an HDP. In an HDP-HMM each
state corresponds to a group (restaurant) and therefore, unlike HDP in which an association of data to
groups is assumed to be known a priori, we are interested to infer this association. The major problem
with original HDP-HMM is the state persistence. HDP-HMM has a tendency to make many redundant
states and switch rapidly among them (Teh Y. , Jordan, Beal, & Blei, 2006). This problem is solved by
introducing a sticky parameter to the definition of HDP-HMM (Fox E. , Sudderth, Jordan, & Willsky,
2011). Equation (4.1) shows the definition of a sticky HDP-HMM with unimodal emissions.  is a sticky
hyper-parameter and generally can be learned from data. Original HDP-HMM is a special case with   0
. From this equation we can see for each state (group) we have a simple unimodal emission distribution.
This limitation can be addressed using a more general model defined in (4.2). In this model, a DP is
associated with each state and a model with augmented state ( zt , st ) is obtained. Figure 3 shows a
graphical representation.
 |  ~ GEM ( )
 j |  ,  ~ DP(   ,
   j
)
 
 **
j | H ,  ~ H ( )
 
zt | zt 1 ,  j
 
xt |  **
j

j 1

j 1
(4.1)
~  zt 1
 
, zt ~ F  zt
 |  ~ GEM ( )
 j |  ,  ~ DP(   ,
 j |  ~ GEM ( )
   j
)
 
 kj** | H ,  ~ H ( )
 
s |   , z
zt | zt 1 ,  j

j 1

t
j
j 1
 
xt |  kj**
\

k , j 1
t
(4.2)
~  zt 1
~  zt
 
, zt ~ F  zt st
Update: February 29, 2012
Preliminary Exam Report
Page 12 of 34
Figure 3-Graphical model of HDP-HMM (Fox E. ,
Sudderth, Jordan, & Willsky, 2011)
4.1
CRF with Loyal Customers
The metaphor for the Chinese restaurant franchise for sticky HDP-HMM is a franchise with loyal
customers. In this case each restaurant has a special dish which is also served in other restaurants. If a
customer xt is going to restaurant j then it is more likely that he eats the specialty dish zt  j there. His
children xt 1 also go to the same restaurant and eat the same dish. However, if xt eats another dish ( zt  j
) then his children go to the restaurant indexed by z t and more likely eat their specialty dish. Thus
customers are actually loyal to dishes and tend to go to restaurants where their favorite dish is the
specialty.
4.2
Inference Algorithms
4.2.1
Direct Assignment Sampler
This sampler is adapted from (Fox E. , Sudderth, Jordan, & Willsky, 2011) and (Fox E. , Sudderth,
Jordan, & Willsky, 2010). In this section we present the sampler for HDP-HMM with DP emission. This
algorithm is very similar to the second inference algorithm for HDP presented in 3.4.2. The algorithm is
divided into two steps: the first step is to sample the augmented state  zt , st  and the second is to sample 
.
In order to sample  zt , st  we need to have the posterior. By inspecting Figure 3 and using the chain rule
we can write the following relationship for this posterior.
p  zt  k , st  j | z \ t , s\ t , x1:T ,  ,  ,  ,  ,   
p  st  j | zt  k , z\ t , s\ t , x1:T ,  ,   p  zt  k | z \ t , s\ t , x1:T ,  ,  ,  ,   
p  st  j | s | z  k ,  t ,   p  xt |  x | z  k , st  j ,  t 
p  zt  k | z\ t ,  , ,  
(4.3)
  p  x | x | z  k , s ,  t p  s | s | z  k ,  t,  
t
t
t
st
\
Update: February 29, 2012
Preliminary Exam Report
Page 13 of 34
The reason that we have summed over st in the last line is because we are interested to calculate the
likelihood for each state. This equation also tells us that we should first sample the state z t and then
conditioned on the current state, sample the mixture component for that state. In B.1. we will derive the
following relationships for the component of (4.3). (4.6) is written for Gaussian emissions but we can
always use the general relationship (3.14) for an arbitrary emission distribution.

  k  nzt t1k    zt 1 , k 

p  zt  k | z\ t ,  , ,    

 2  k  zt 1
,

 



  z  nkzt    k , zt 1     zt 1 , k    k , zt 1  
t 1
t 1

 k  1,..., K 
t




n




z
,
k


k
t

1


k  K 1
(4.4)
 nkj t
, j  1,..., K k 

   nk t
p  st  j | s | z  k ,  t ,    


, j  K k  1
   n t
k



(4.5)


 kj  1  kj
p  xt |  x | z  k , st  j ,  t   t kj  d 1  xt ;kj ,
 kj  , k  1,..., K , j  1,..., K k


 kj  kj  d  1


 
p xt | x | z  k , st  j new ,  t
 
p xt | x | z  k new ,  t
  t


  1   , k  1,..., K , j 

   d  1 


  1   , k  k new
xt ; ,

 d 1 

   d  1 

  t
k new



kj
x ;
new  d 1  t
,
j new
(4.6)
After sampling  zt , st  we have to sample  but from (3.12) we see that we need to know the distribution
of the number of tables considering dish k ( m k k 1 ). The approach is to first find the distribution of
K
tables serving dish k ( m k k 1 ). In this algorithm, instead of using the approach based on Stirling
K
numbers, we can obtain this distribution by a simulation of the CRF, and then adjust this distribution to
obtain the real distribution of considered dishes by tables. To review the reason that this adjustment is
necessary, we should notice that  introduces a non-informative bias to each restaurant so customers are
more likely to select the specialty dish of the restaurant. In order to obtain the considered dish distribution
we should reduce this bias from the distribution of the served dish. This can be done using an override
variable. Suppose m jk tables are serving dish k in restaurant j . If k  j then m jk  m jk since the served
dish is not the house specialty but if k  j then there is probability that tables are overridden by the house
specialty. Suppose that  jt is the override variable with prior p  jt |   ~  , we can write:
\
Update: February 29, 2012
Preliminary Exam Report
Page 14 of 34



 
p  jt | k jt  j ,  ,   p k jt  j |  jt ,  ,  p  jt | 

  j 1     jt  0

 jt  1
 
(4.7)
The sum of these Bernoulli random variables is a binomial random variable and finally we can calculate
the number of tables that considered ordering dish k by (4.17).


1. Given a previous set of z1:( nT1) , s1:( nT1) and  ( n1)
2. For all t 1,2,..., T  .
3. For each of the K currently instantiated states compute:

The predictive conditional distributions for each of the K k currently instantiated mixture
components for this state, and also for a new component and for a new state.
 nkjt
f k, j ( xt )  
   nkt

f k , Kk 1 ( xt ) 

 nkt


 p  xt |  x | z  k , st  j ,  t


 
p xt | x | z  k , st  j new ,  t
  
f knew ,0 ( xt )  
p xt | x | z  k new ,  t
t 

  n 
 

(4.8)

(4.9)

(4.10)
The predictive conditional distribution of the HDP-HMM state without knowledge of the
current mixture component.
  z  nkzt    k , zt 1     zt 1 , k    k , zt 1  
t 1
t 1

f k  xt    k  nzt t1    zt 1 , k  
t




n




z
,
k


k
t

1








f K 1  xt  
4.
 2  k  zt 1
 
K k
 f  x   f
k, j
t
j 1



k , K k 1 ( xt )  , k
 1,..., K
(4.11)
f knew ,0 ( xt ), k  K  1
Sample z t :
K
zt ~
 f  x  ( z , k )  f
k
t
t
 ( zt , K  1)
K 1 ( xt )
(4.12)
k 1
\
Update: February 29, 2012
Preliminary Exam Report
Page 15 of 34
5. Sample st conditioned on z t :
K k
st ~
f
 ( st , j )  f k , K k 1 ( xt ) ( st , K k  1)
k , j ( xt )
j 1
(4.13)
6. If k  K  1 increase the K and transform  as
0|
 ~ Beta  ,1

new
0 ,

 Knew
1    0 0 ,  0 (1  v0 ) 
(4.14)
7. If st  K k  1 increment K k .
8. Update the cache. If there is a state with nk
 0 or n k
 0 remove k and decrease K . If nkj  0
remove the component j and decrease K k .
9. Sample auxiliary variables by simulating a CRF:
10. For each  j, k  1,..., K set m jk  0 and n  0 . For each customer in restaurant j eating dish k (
2
i  1,..., n jk ), sample:
  k   ( j, k ) 
x ~ Ber 

 n   k   ( j, k ) 
(4.15)
11. Increment n and if x  1 increment m jk .
12. For each j  1,..., K  ,sample the override variables in restaurant j :



 j ~ Binomial  m jj ,


,  

   j 1    
 

(4.16)
13. Set the number of informative tables in restaurant j :
m jk
m jk  
 m jj   j
jk
jk
(4.17)
14. Sample  :
 ( n ) ~ Dir  , m 1 ,..., m K 
\
(4.18)
Update: February 29, 2012
Preliminary Exam Report
Page 16 of 34
15. Optionally sample hyper-parameters  ,  ,  and  .
4.2.2
Block Sampler
The problem with the direct assignment sampler mentioned in the previous section is the slow
convergence rate since we sample states sequentially. The sampler can also group two temporal sets of
observations related to one underlying state into two separate states. However, in the last sampling
scheme we have not used the Markovian structure to improve the performance. In this section a variant of
forward-backward procedure is incorporated in the sampling algorithm that enables us to sample the state
sequence z1:T at once. However, to achieve this goal, a fixed truncation level L should be accepted which
in a sense reduces the model into a parametric model (Fox E. , Sudderth, Jordan, & Willsky, 2011).
However, it should be noted that the result is different from a classical parametric Bayesian HMM since
the truncated HDP priors induce a shared sparse subset of the L possible states (Fox E. , Sudderth, Jordan,
& Willsky, 2011). In short, we obtain an approximation to the nonparametric Bayesian HDP-HMM with
maximum number of possible states set to L . However, for almost all applications this should not cause
any problem if we set L reasonably high. The approximation used in this algorithm is the degree L weak
limit approximation to the DP (Ishwaran & Zarepour, 2002) which is defined as:
GEM L   Dir  / L,..., / L 
(4.19)
Using (4.19)  is approximated as (Fox, Sudderth, Jordan, & Willsky, Supplement to " A Sticky HDPHMM with Application to Speaker Diarization", 2010):
 |  ~ Dir  / L,...,  / L 
(4.20)
 j |  ,  ,  ~ Dir 1 ,..., j   ,... L 
(4.21)
Similar to (3.4) we can write:
And posteriors are (similar to (3.13)):
 | m,  ~ Dir   / L  m 1 ,...,  / L  m L 
 j | z1:T , ,  ~ Dir 1  n j1 ,..., j    n jj ,..., L  n jL 
(4.22)
In (4.22) n jk is the number of transitions from state j to state k and m jk is the same as (4.17).
Finally an order L weak limit approximation is used for the DP prior on the emission parameters:
 
 k | z1:T , s1:T , ~ Dir  / L  nk1 ,..., / L  nkL
(4.23)
The forward-backward algorithm for the joint sample z1:T and s1:T given x1:T can be obtained by:
p  z t , st | x1:T , z, π,ψ,θ 

 

 p  zt | zt 1 , x1:T , π, θ  p st | zt p xt |  zt ,st p  x1:t 1 | zt , π, θ,ψ  p  xt 1:T | zt , π, θ,ψ 
\
(4.24)
Update: February 29, 2012
Preliminary Exam Report
Page 17 of 34
The right side of equation (4.24) has two parts: forward and backward probabilities (Rabiner, 1989).The
forward probability includes p  zt | zt 1 , x1:T , π, θ  p  st | zt  f  xt |  zt , st  p  x1:t 1 | zt , π, θ, ψ  and backward
probability includes p  xt 1:T | zt , π, θ,ψ  . It seems that the authors in this work approximate the forward
probabilities with p  zt | zt 1 , x1:T , π, θ  p  st |  zt  f  xt |  zt ,st  , and for backward probabilities we have:
p  xt 1:T | zt , π, θ, ψ   mt ,t 1  zt 1 
  p  z |   p  s |  f  x |   m  z 


1
zt
st
zt 1
t
t
zt
t
t 1,t
zt , st
t
t T
t  T 1
 L

 mt ,t 1  k    i 1
1

(4.25)
   f  x |   m  z 
L
ki
il
t
zt , st
l 1
t 1,t
t T
t
k  1,...L
t  T 1
As a result we would have (Fox E. , Sudderth, Jordan, & Willsky, 2010) :
p  z t  k , st  j | x1:T , z, π,ψ,θ 


  zt 1k kj f xt |  zt ,st mt 1,t  zt 
(4.26)
where for Gaussian emission for components are given by f  xt |  zt , st     xt ; kj ,  kj 
The algorithm is as follows (Fox E. , Sudderth, Jordan, & Willsky, 2010):
1. Given the previous π( n1) , ψ ( n1) , β( n1) and θ ( n1) .
2. For k 1,..., L , initialize messages to mT 1,T  k   1
3. For t T  1,...,1 and k 1,..., L compute
mt ,t 1  k  
L
L
 
 il N  xt 1 ; il , il mt 1,t  i 
ki
(4.27)
i 1 l 1
4. Sample the augmented state  zt , st  sequentially and start from t  1 :
Set nik  0, nkj  0 and  kj   for  i, k  1,..., L and  k , j  1,..., L  1,..., L
2
For all  k , j  1,..., L  1,..., L compute:


f k , j  xt    zt 1 ,k k , j N xt ; k , j , k , j mt 1,t  k 
(4.28)
5. Sample augmented state  zt , st  :
\
Update: February 29, 2012
Preliminary Exam Report
Page 18 of 34

L L
z ,s ~   f
x  z ,k  s , j
t t
k, j t
t
t
k 1 j 1

   
(4.29)
6. Increase nzt 1zt and nzt st and add xt to the cached statistics.
k , j  k , j  xt
(4.30)
7. Sample m,ω,m similar to the previous algorithm
8. Update  :
 ~ Dir  / L  m 1 ,...,  / L  m L 
(4.31)
9. For k 1,..., L :

Sample  k and k :
 k ~ Dir 1  nk1 ,..., k    nkk ,..., L  nkL 
 
 k ~ Dir  / L  nk 1 ,..., / L  nkL

(4.32)
For j  1,..., L sample:
 k , j ~ p  |  , k , j 
(4.33)
10. Set π( n)  π, ψ ( n)  ψ,  ( n)   and θ( n )  θ
11. Optionally sample hyper-parameters  ,  ,  and  .
4.2.3
Learning Hyper-parameters
Hyper-parameters including  ,  ,  and  can also be inferred like other parameters of the model (Fox,
Sudderth, Jordan, & Willsky, 2010).
4.2.3.1
Posterior for  α + κ 
Consider the probability of data x ji to sit behind table t :

 ji

t  t1 ,..., m j
n jt
p t ji  t | t  ji , n jt ji , ,   
new

   t  t

\


(4.34)
Update: February 29, 2012
Preliminary Exam Report
Page 19 of 34
This equation can be written by considering equation (3.10) and (4.1). From this equation we can say
customer table assignment follows a DP with concentration parameter    . Antoniak (Antoniak, 1974)
has shown that if  ~ GEM   , zi ~  then the distribution of the number of unique values of z i
resulting from N draws from  has the following form:
p  K | N ,  
( )
s  N , K  K
(  N )
(4.35)
Where s( N , K ) is the Stirling number of the first kind. Using these two equations the distribution of the
number of tables in the restaurant j is as follows:

 
    
p mj |  ,nj  s nj ,mj
mj

    
(4.36)

     nj
The posterior over    is as follows:
p    | m1 ,..., mJ , n1 ,..., nJ   p     p  m1 ,..., mJ |    , n1 ,..., nJ
 p    
J
 p m
j
|  ,nj
j 1
 p    
J
 s n
,mj
j
j 1
 p       
m

    
mj

J
 s n
  x  y
(4.37)

    
J
      n 
j

, m j is not a function of    and therefore can be ignored.
j
j 1
By substitution of   x, y  
    
     nj
j 1
The reason for the last line is that

1
 t x 1 1  t 
 x  y 
y 1
dt and also by considering that   x  1  x  x 
0
we obtain:
p    | m1 ,..., mJ , n1 ,..., nJ
  p       
m
n j  1  

1  rj
1 
 r
   0 j
j 1 
J




n j 1
drj
(4.38)
Finally by considering the fact that we have placed a Gamma  a, b  prior on    we can write:
s
 n j  j  
p    , r , s | m1 ,..., mJ , n1 ,..., nJ      
e
1  rj

 rj
j 1     
Where s j can be either one or zero. For marginal probabilities we obtain:
a  m 1   b
\
J



n j 1
(4.39)
Update: February 29, 2012
Preliminary Exam Report
Page 20 of 34
p    | r , s, m1 ,..., mJ , n1 ,..., nJ      
a  m 1

 j1 s j e   b  j1log rj 
 Gamma   m 



p rj |    , r\ j , s, m1 ,..., mJ , n1 ,..., nJ  rj  1  rj

p s j |    , r , s\ j m1 ,..., mJ , n1 ,..., nJ
4.2.3.2

J
J


n j 1
J
s ,b 
j 1 j

J
j 1
log rj

 Beta     1, n j


nj
 nj  j


  Ber 

  
 nj     


(4.40)
(4.41)
s
(4.42)
Posterior of γ
Similar to the discussion for (4.35) if we want to find the distribution of the unique number of dishes
served in the whole franchise we would have p  K |  , m
  s m
,K
  
   m

. Therefore for the
posterior distribution of  we can write:
p  | K , m
  p   p  K |  , m 
    1, m 
 p    K
  m 
 p    K   m
(4.43)
1
   1   m 1 d
0
By considering the fact that that prior over  is Gamma  a, b we can finally write:
p  , ,  | K , m   

  K 1 
m   b log 
1   m

 e
  
1
(4.44)
And finally for the marginal distributions we have:
 b log 
p  | , , K , m      K 1 e 
 Gamma   K   , b  log 
p  |  , , K , m     1  
p  |  , , K , m
4.2.3.3
m 1
 Beta   1, m
(4.46)

 m 
m 

  Ber 
  
 m  


(4.45)
(4.47)
Posterior of σ
The posterior for  is obtained in a similar way to    . We use two auxiliary variables r  and s  and the
final marginalized distributions are:
\
Update: February 29, 2012
Preliminary Exam Report
Page 21 of 34
p  | r , s, K1 ,..., K J , n1 ,..., nJ    
  K  1

 j1 sj e b j1log rj 
J
J


p rj |  , r\j , s, K1 ,..., K J , n1 ,..., nJ  rj 1  rj

p sj |  , r , s\ j , K1 ,..., K J , n1 ,..., nJ

 nj 


 

n j 1
(4.48)
(4.49)
sj
(4.50)
It should be noted that in cases where we use auxiliary variables we prefer to iterate several times before
moving to the next iteration of the main algorithm.
4.2.3.4
Posterior of ρ
By definition  

 
and by considering the fact that the prior on  is Beta  c, d  and  jt ~ Ber   
we can write:
p   |    p  |   p   

 Binomial 



 Beta 


4.2.4


j
j
j
j

; m ,   Beta  c, d 


 c, m 

j
j
(4.51)

d


Online learning
The last two approaches are based on batch learning methodology. One problem with these methods is the
need to run the whole algorithm for the whole data set when new data points become available. More than
that, for large datasets we might face some practical constraints such as memory size. Another alternative
approach is to use sequential learning techniques which essentially let us update models once a new data
point becomes available. The algorithm that we are describing here is adapted from (Rodriguez, 2011),
but the main idea for a general case is published in (Carvalho, Johannes, Lopes, & Polson, 2010) and
(Carvalho, Lopes, Polson, & Taddy, 2010). For Bayesian problems different versions of particle filters are
used to replace batch MCMC methods. For further information about particle filters refer to (Cappe,
Godsill, & Moulines, 2007). It should be noted that this algorithm is developed for the non-sticky (   0 )
HDP-HMM with one mixture per state but generalization to sticky HDP-HMM with DP emissions is
straightforward.
4.2.4.1
Particle learning (PL) Framework for mixtures
PL is proposed in (Carvalho, Johannes, Lopes, & Polson, 2010) and (Carvalho, Lopes, Polson, & Taddy,
2010) and is a special formulation of augmented particle filters. A general mixture model that PL is
supposed to infer can be represented by:
\
Update: February 29, 2012
Preliminary Exam Report
Page 22 of 34
yt 1  f  xt 1 , 


xt 1  g xt ,
(4.52)
In this set of equations, the first line is the observation equation and the second line is the state evolution
which, in case of mixtures, indicates which component is assigned to the observations and xt   x1 ,..., xt 
In order to estimate states and parameters we should define an “essential state vector” zt   xt , st ,  where



st  stx , st and st are the state sufficient statistics and st is the parameter sufficient statistics (Carvalho,
x
Johannes, Lopes, & Polson, 2010). After observing yt 1 , particles should be updated based on:

p zt | y

t 1


p  yt 1 | zt  p zt | y t

p yt 1 | y
t
 


(4.53)


p xt 1 | yt 1  p  xt 1 | zt , yt 1  p zt | y t 1 dzt
(4.54)
A particle approximation to (4.53) is:
p
N
 z | y   
t 1
N
t
i 1

(i )
t
zt( i )
( zt ), 
(i )
t


p yt 1 | zt(i )

N


p yt 1 | zt( j )
j 1
(4.55)

Using this approximation we can generate propagated samples from the posterior p  xt 1 | zt , yt 1  to


approximate p xt 1 | y t 1 . Sufficient statistics can be updated using a deterministic mapping and finally
parameters should be updated using sufficient statistics. The main condition for this algorithm to be
possible is the tractability of p yt 1 | zt(i ) and p  xt 1 | zt  yt 1  . The PL algorithm is as follows:



1. Resample particles z t with weights t(i )  p yt 1 | zt(i )

2. Propagate new states xt 1 using p  xt 1 | zt  yt 1 
3. Update state and parameter sufficient statistics deterministically st 1  S  st , xt 1 , yt 1 
4. Sample  from p  | st 1 


After one sequential run through the data we can use smoothing algorithms to obtain p xT | y T . The
algorithm is repeated b  1,..., B to generate B sample paths.


1. Sample xT(b ) , sT(b ) , (b ) from output of particle filter
\

N
i 1
1
 ( i )  zT 
N zT
Update: February 29, 2012
Preliminary Exam Report
Page 23 of 34
For t  T  1:1


2. Sample xT(b ) , sT(b) ~
N
t(i) z( s ) , t(i) 
t
s 1

qt(i )
, qt(i )  p xt(b1) , rt(b1) | zt( i )
N


qt( s )
s 1
4.2.4.2
PL for HDP-HMM

  
 
    where z is
 
In this algorithm we use Zt(i )  zt(i ) , L(ti ) , nljt(i ) , Slt(i ) ,t(i ) , lt(i ) ,  t(i ) , mljt( i ) ,t( i ) , ult( i) ,  lt( i)
t
the state, Lt is the number of states at time t , nljt is the number of transitions from state l to state j at time t ,
S lt is the sufficient statistics for state l at time t and other variables have the same definitions as previous
sections (the last three are auxiliary variables which are used to infer hyper-parameters.).
From (4.4) and by setting   0 we can write:
p  zt 1  k | z1 ,..., zt , zt  2

  k  nzt kt
 k ,  ,    

  Lt 1 k

  k   nkk t    zt , k    k , k   
 , k  1,.., Lt 
  nk t    zt , k 
(4.56)


, k  Lt  1
 
In this equation k  is the next state zt  2 that we have not seen yet. Because of this we should integrate
this out by summing over all possibilities. The result is:

  k  nzt k t
p  zt 1  k | z1 ,..., zt ,  ,    

  Lt 1


   nk t    zt , k  
 , k  1,.., Lt 
k t    zt , k  

, k  Lt  1
    n
(4.57)

  k  nz kt , k  1,.., Lt 
t


, k  Lt  1

Lt 1
After normalization we can write:
p  zt 1 | z1 ,..., zt ,  ,  
Lt



 nzt kt
nzt  
k 1
p  xt 1 | Z t   p  xt 1 | zt 1  p  zt 1 | Z t  dzt 1 
k
Lt

k 1

k

k
 nzt kt
nzt  

f
 Lt 1
nzt  
 xt 1
k
 Lt 1
 xt 1  
 Lt 1
nzt  
(4.58)
f Lt xt11  xt 1  (4.59)
The algorithm is as follow:
\
Update: February 29, 2012
Preliminary Exam Report
1. Compute t(i ) 
Page 24 of 34

vt(i )
N
v( j )
j 1 t
where vt(i ) 

ql(i ) Z t(i ) , xt 1


L(t i )
l 1


ql(i ) Z t(i ) , xt 1 and we have:
 n (i( i))   t(i )  lt(i )
 zt lt
f l  xt 1  xt 1  , l  1,..., L(t i )
 nz(i( i)) t   t(i )

t

(i ) (i )
  t  L(Ti ) 1,t  x
f t 1 x
, l  L(t i )  1
 (i )
( i ) L(t i ) 1  t 1 
 nzt( i ) t   t
(4.60)
Where fl  xt 1    is similar to (4.6).
2. Sample Zt(1) , Zt(2) ,..., Zt( N ) from

 
p N Z t | y t 1 
N
i 1

(i )
t
Z t( i )
 Zt 
(4.61)
3. Propagate the particles to generate Z t(i )1 :
4.
Sample zt(i )1 :
zt(i )1
~p


L(t i ) 1
zt(i )1
|
Z t(i ) , xt 1
 
l 1
ql(i ) Z t(i ) , xt 1

L(t i ) 1 ( j )
ql
j 1


Z t( j ) , xt 1
 
 l zt(i )1

(4.62)
5. Update the number of states:
L(ti)1
 L(ti )  1, zt(i )1  L(ti )  1
  (i )
(i )
 Lt 1  Lt otherwise
(4.63)
6. Update the sufficient statistics:
nz(i( i)) , z ( i ) ,t 1  nz(i( i)) , z ( i ) ,t  1 S z ( i ) ,t 1  S z ( i ) ,t  s  xt 
t
t 1
t
t 1
t 1
t 1
(4.64)
7. If zt(i )1  L(ti ) :
\
Update: February 29, 2012
Preliminary Exam Report
Page 25 of 34
t(i 1)   t(i )


v0 |  ~ Beta  t(i ) ,1

(i )
0,t 1 ,
 
(4.65)
 L((ii)) 1,t   0,(it) 0 ,  0,( it) (1  v0 )
t

8. Hyper-parameters update:
9. Sample ml(,i )j ,t 1 :

 

p ml(,i )j ,t 1  m  s nl(,i )j ,t 1 , m t(i )  (j i,t)1

m
(4.66)
Alternatively we can simulate a CRF instead of computing Stirling numbers.


10. Sample  t(i )1 by first sample t(i 1) ~ Beta  t(i )  1, m(it)1 :

 

 
 t(i )1 ~  Gama a  L(ti)1 , b  log t(i 1)  1    Gama a  L(ti)1  1, b  log t(i 1)

1 

a  L(t i)1  1

(4.67)
 
m(it)1 b  log t(i 1)
11. Sampling  t(i )1 using auxiliary variables:
 l(,it)1 ~ Beta  t(i )  1, nl(it)1



n(i )
ul(,it)1 ~ Ber  (i ) l t 1(i )
  t  nl t 1







(i )
t 1

(4.68)
L(t i )1


(i )
(i )
~ Gam  aa  m t 1  u t 1 , b  log  l(,it)1 


l 1



12. Resample  :

t(i1) ~ Dir m ,1,t 1 ,..., m , L( i ) ,t 1 ,  t(i )1
t 1

(4.69)
After finishing this inference step, we can optionally smooth the states using an algorithm similar to the
one discussed in 4.2.4.1. However one drawback of this algorithm is the fact that paths for z1 ,..., zT are
coupled together since we integrate out  l for the inference algorithm. To improve particle diversity we
can sample transition probability and emission parameters explicitly. The smoothing algorithm would be
\
Update: February 29, 2012
Preliminary Exam Report
Page 26 of 34
as follows:
N
1. Sample ZT( b ) ~ 
 
i 1
1
 (i )
N ZT
2. Sample  l( b )
  ~ Dir 
(b )
l
 
(b )
 nl(bT) , 1(b)  nl(1bT) ,...,  L((bb))  nlL(b()b )T
3. Sample  l( b ) from the NIW l |   S
T
lT
, n lT , , , , 
T

(4.70)
.
 
4. Sample z1(b) ,..., zT(b) using a single run of Forward-Backward algorithm applications. Use  l( b )
 
and  l( b ) as parameters for the Forward-Backward algorithm.
4.3
Applications
One of the applications of HDP-HMM, which is extensively discussed in (Fox E. , Sudderth, Jordan, &
Willsky, 2011), is speaker diarization. In this application, we are interested to segment an audio file into
time intervals associated with different speakers. If the number of speakers is known a priori a classic
HMM can be used and each speaker can be modeled as different states of HMM. However, in real world
applications the number of speakers is not known and therefore nonparametric models are a natural
solution. It has been shown in (Fox E. , Sudderth, Jordan, & Willsky, 2011) that HDP-HMM can produce
results comparable to other state of the art systems.
Another application which is cited as an application of HDP-HMM is word segmentation (Teh & Jordan,
2010). In this problem, we have an utterance and we are interested to segment it into words. Each word
can be represented as a state in a HDP-HMM and transition distributions can define a grammar over
words.
5
CONCLUSION
In this report, we have investigated hierarchical Dirichlet processes and its application to extend HMMs
into infinite HMMs. We also reviewed two inference algorithms for HDP and three inference algorithms
for HDP-HMM.
HDP-HMM seems to be a good candidate for many applications which traditionally use HMMs. Using a
nonparametric Bayesian approach could help us to automatically learn the complexity of the models from
the data instead of relying on heuristic tuning methods. Moreover, the framework can provide a generic
and simple approach to organize all models (i.e. different HMMs in a speech recognizer) in a well-defined
hierarchy and tie parameters of different models using Bayesian hierarchical methods. The definition of
HDP-HMM (with DP emission) can also be altered to include another HDP that links DP emissions of
different states together (to link different components of mixture models together.) Another area of work
is in inference algorithms. We have presented three algorithms based on Gibbs sampling. It seems block
and sequential samplers have some interesting properties that make them reasonable candidates for big
datasets. Particularly it seems easy to build a parallel implementation of sequential sampler which can be
\
Update: February 29, 2012
Preliminary Exam Report
Page 27 of 34
an important factor for large scale problems. Studying other kinds of inference methods like variational
methods or parallel implementation of these algorithms can be a subject of further research.
6
REFERENCE
Antoniak, C. (1974). Mixtures of Dirichlet Process with Applications to Bayesian Nonparametric
Problems. The Annals of Statistics, 1152-1174.
Cappe, O., Godsill, S. J., & Moulines, E. (2007, May). An Overview of Existing Methods and Recent
Advances in Sequential Monte Carlo. Proceedings of the IEEE, 95(5), 899-924.
Carvalho, C. M., Johannes, M., Lopes, H. F., & Polson, N. (2010). Particel Learning and Smoothing.
Statistical Science, 88-106.
Carvalho, C. M., Lopes, H. F., Polson, N. G., & Taddy, M. A. (2010). Particle Learning for General
Mixtures. Bayesian Analysis, 709-740.
Fox, E., Sudderth, E., Jordan, M., & Willsky, A. (2010). Supplement to " A Sticky HDP-HMM with
Application to Speaker Diarization". doi:10.1214/10-AOAS395SUPP
Fox, E., Sudderth, E., Jordan, M., & Willsky, A. (2011). A Sticky HDP-HMM with Application to
Speaker Diarization. The Annalas of Applied Statistics, 5, 1020-1056.
Ishwaran, H., & Zarepour, M. (2002, June). Exact and approximate sum representations for the Dirichlet
process. Canadian Journal of Statistics, 269-283.
Pitman, J. (1996). Random Discrete Distributions Invariant under Size-Biased Permutation. Advances in
Applied ProbabilityApplied Probability Trust, 525--539.
Rabiner, L. R. (1989, February). A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition. Proceedings of the IEEE, 77(2), 257-286.
Rodriguez, A. (2011, July). On-Line Learning for the Infinite Hidden Markov. Communications in
Statistics: Simulation and Computation, 40(6), 879-893.
Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 639-650.
Sudderth, E. B. (2006). chapter 2 of "Graphical Models for Visual Object Recognition and Tracking".
Cambridge, MA: Massachusetts Institute of Technology.
Teh, Y., & Jordan, M. (2010). Hierarchical Bayesian Nonparametric Models with Applications. In N.
Hjort, C. Holmes, P. Mueller, & S. Walker, Bayesian Nonparametrics: Principles and Practice.
Cambridge, UK: Cambridge University Press.
Teh, Y., Jordan, M., Beal, M., & Blei, D. (2004). Hierarchical Dirichlet Processes. Technical Report 653
UC Berkeley.
Teh, Y., Jordan, M., Beal, M., & Blei, D. (2006). Hierarchical Dirichlet Processes. Journal of the
\
Update: February 29, 2012
Preliminary Exam Report
Page 28 of 34
American Statistical Association, 1566-1581.
\
Update: February 29, 2012
Preliminary Exam Report
Page 29 of 34
APPENDIX A: DERIVATION OF HDP RELATIONSHIPS
A.1. Stick-Breaking Construction (Teh Y. , Jordan, Beal, & Blei, 2006)
Lemma A.1.1
k 1
In this lemma we show  k  vk  1  vl 
l 1
We know:
 K 1 
 k  vk 1   l 
 l 1 
Using this equation we can write:
1
k 1
K
l 1
l 1
 l  1  v1  v2 (1  v1 )  v3 (1  v1 )(1  v2 )....  (1  v1 ) 1  v2  v3 (1  v2 )....   (1  vl )
  k  vk
k 1
 1  v 
l
l 1




We know G j    jk **k .Let  A1 ,..., Ar  be a random partition on, define Kl  k :  k**  Al , l  1,..., r
k 1
from(3.1) and the definition of DP we have:
G ( A ),..., G ( A )  ~ Dir G ( A ),...,G ( A ) 
j
1
j
r
0
1
0
(A1.1)
r
Using (A1.1), (3.2) and(3.3) we obtain:




 jk ,...,
 jk  ~ Dir  
 k ,...,
k 

 kK

 kK

kK r
kK r
1
 1







(A1.2)
Since this is correct for every finite partition of positive integers we conclude that  j ~ DP( ,  )
Now for a partition 1,..., k  1 ,k ,k  1,... and aggregation property of Dirichlet distribution we
obtain:
\
Update: February 29, 2012
Preliminary Exam Report
Page 30 of 34


 k 1

 k 1


,

,

~
Dir


,

,

l 


jl
jk
jl 
l
k
l  k 1
k  k 1
 l 1

 l 1





(A1.3)
Using the neutrality property of a Dirichlet distribution:







,

~
Dir

,

l 
 jk
 k
jl 
k 1
l  k 1
l  k 1



1   jl 

1


(A1.4)
l 1

Notice that

l
1
l  k 1
k

l
 jk
and also defining v jk 
1
l 1
we have:
k 1

jl
l 1


v jk ~ Beta   k , 1 




k
   
l
l 1
(A1.5)

Using Lemma A.1.1

k 1

k 1

l 1

l 1
 jk  v jk 1   jl   v jk  1  v jl 
(A1.6)
A.2. Deriving Posterior and Predictive Distributions
Derivation of equation(3.8):
Since G0 is a random distribution, we can draw from it. Let assume θ** are i.i.d. draws from G0 . θ** takes
value in  since G0 is a distribution over  .Let A1 , A2 ,..., Ak be a finite measureable partition of  and


m r  # i : i**  Ar . By using the conjugacy and also definition of DP we can write:
G0 ( A1 ),..., G0 ( Ak ) | 1** ,...,n** ~ Dir  H  A1   m 1,...,  H  Ak   m k 
(A1.7)
k
This shows the posterior of G0 is a DP with concentration parameter equal to    m r    m and mean
r
n
 H   **
of
\
i 1
 m
n
i
. We also know
K
    m
i 1
**
i
k 1

k  k**
so we can write:
Update: February 29, 2012
Preliminary Exam Report
Page 31 of 34
K

 H  m k  **

k
k 1
G0 |  , H , θ** ~ DP    m ,

 m









(A1.8)
Derivation of equation(3.9):
Similar to the last proof, let A1 , A2 ,..., Ak be a finite measureable partition of  and n j k  #i :  ji  Ar 

G j ( A1 ),..., G j ( Ak ) |  j1 ,..., jn ~ Dir  G0  A1   n j 1 ,..., G0  Ak   n j
k

(A1.9)
n
K
 G0    ji
k 1
  nj
So G j is a DP with concentration    n j k    n j and mean
i 1
K

 G0  n j k  **

k
k 1
G j |  , G0 , θ ~ DP    n j ,

  nj


and finally we can write:







(A1.10)
Derivation of equation(3.10):
For A   :


p  ji  A |  j1... ji 1  E G j  A  |  j1... ji 1 

nj k 


t : *jt

K
n
k 1
 k**


j k  k**
1 
  G0 ( A) 
  nj 
K
n
k 1

j k  k**



(A1.11)
n jt  *
jt

K
 

k 1 t : *jt  k**

n jt  * 
jt
(A1.12)
mj
n
jt
t 1
 *
jt
From(A1.11) and (A1.12):
1
 ji |  j1... ji 1 , , G0 ~
  nj

  G0 


mj
n
t 1
jt

 * 
jt 

(A1.13)
Derivation of Equation(3.11) is very similar to the above lines and we just need to calculate the
expectation of G0 instead of G j .
\
Update: February 29, 2012
Preliminary Exam Report
Page 32 of 34
Deration of Equation(3.12)
K



H

m k  ** 

k
k 1
 and also we know that we can write G has
We know that G0 |  , H , θ** ~ DP    m ,
0


 m




two parts; one is a draw form a DP and the other is a draw from a multinomial distribution:

G0 


K
k 1
j 1
k 1
 kk   0 j j   k **
(A1.14)
k
Let  A0 , A1 ,..., Ak  be a partition on  where A0 contains the whole  except spikes located at k** and
Aj , j  1,..., K contains spikes. We can write:
 G0 ( A0 ), G0 ( A1 ),..., G0 ( Ak )  ~ Dir  0 F  A0  ,..., 0 F  Ak  
K
0    m , F 
 H   m k  k**
(A1.15)
k 1
 m
And as a result we can write:
 0 , 1 ,...,  K  ~ Dir  , m 1,..., m K 
(A1.16)
Derivation of equation (3.13) follows the same lines.
APPENDIX B: DERIVATION OF HDP-HMM RELATIONSHIPS
B.1. Derivation of the posterior distribution for  z t ,st 
Lemma B.1.1
\
Update: February 29, 2012
Preliminary Exam Report
Page 33 of 34
K
 1
   k k d 

  k 
K

k 1
K


  k 
 k 1 

k 1
  K


   k 

  k 1   k 1 
 k d 
K





k


k 1






(B1.1)
Dir  k 


K
  k 
k 1
K


  k 
 k 1 


Derivation of p  zt  k | z\ t ,  , ,   :
By using the chain rule and graphical model of Figure 3 we can write:
p  zt  k | z \ t ,  ,  ,   
  p 
i
|  ,  , 
i

 p  zt 1 |  k  p zt  k |  zt 1



 p  z

  p 

i
t 1

|  k  p zt  k |  zt 1
 p 
i
i
p  z |  d 


z 1

p  z |  i  d 

 t ,t 1


|  ,  , 
 | z 1 i ,
(B1.2)
|  z | z 1  i,  t , t  1 ,  ,  ,  d 
i
For zt 1  j and k  j we can write:
p  zt  k | z\ t ,  ,  ,   
 p  z |   p  | z | z  k ,  t , t  1 ,  , ,  d
 p  z  k |   p  | z | z  j,  t , t  1 ,  , ,  d
t 1
k
1
k
k
k
t
j
1
j
j
(B1.3)
j
 p  zt 1 |  z | z 1  k ,  t , t  1 ,  ,  ,  
p  zt  k |  z | z 1  j ,  t , t  1 ,  ,  ,  
For k  j
p  zt  j | z\ t ,  , ,   
 p  z
t 1
j
 
 

|  j p zt  j |  j p  j | z | z 1  j ,  t , t  1 ,  , ,  d j
 p  zt  j , zt 1 |  z | z 1  j ,  t , t  1 ,  , ,  
(B1.4)
Using the fact that z has a multinomial distribution,  j |  ,  ~ Dir 1 ,..., j   ,..., K , K 1  , and
by using Lemma B.1.1:
\
Update: February 29, 2012
Preliminary Exam Report
Page 34 of 34
p  z | z 1  i |  , ,   

  
   k , i 
 p 

i
|  , ,   p  z | z 1  i |  i  d  i
i
K 1
K 1
k 1
k 1
 ikk   k ,i 1  iknik d i
       k , i  
       k , i     

      k , i      

k
i
k
k
k
k
k
k
k

k
    
     ni
k

k
k
   k , i   nik 
k
   k , i   nik
(B1.5)

  k    k , i   nik 
  k    k , i  
In the equation above, nik denotes the number of transitions from state i to state j . Using (B1.5), (B1.3),
(B1.4), and after some algebra we can obtain:

  k  nzt t 1    zt 1 , k 

p  zt  k | z\ t ,  , ,    
  2  k  z 1
t
,

  


  z  nkzt    zt 1 , k    k , zt 1  
t 1
t 1

 k  1,..., K 
t




n
k      zt 1 , k 


(B1.6)
k  K 1
Equation (4.5) can be obtained similar to (3.10). Notice in this case that for each state we have a DP and
therefore numbers of data points for DP are all data points associated with that state.
Equation (4.6) can be obtained similar to (3.16). The only difference is that we only consider observations
assigned to state zt  k and st  j .
\
Update: February 29, 2012