COLLEGE OF ENGINEERING Preliminary Exam Report Review of Hierarchical Dirichlet Process and Infinite HMMs Amir Harati [email protected] Institute for Signal and Information Processing, Department of Electrical Engineering, Temple University February 2012 EXECUTIVE SUMMARY In this report, I review three research papers concerning hierarchical Dirichlet processes (HDP), infinite hidden Markov models (HDP-HMM), and several inference algorithms for these models. I also use several other sources to support the material discussed in this report. The three primary references used in this exam are "Hierarchical Bayesian Nonparametric Models with Applications,” “On-Line Learning for the Infinite Hidden Markov Model,” and “Sticky HDP-HMM with Application to Speaker Diarization". The basic materials in all of these papers involve HDP and HDPHMMs. The first paper is a more general overview and contains several other models which are not related directly to the main theme of this report. Therefore this report is not organized based on any one of these references but contains most of the relevant material from those papers. In many cases I have corrected several errors ranging from typos to simple algorithmic/computational errors and have also used a unified notation through the report that I believe makes the presentations easier. To make the report more self-contained I have added a background review of Dirichlet processes (DP), but this review is very short and readers may need to review some background papers before reading this report. After reviewing DPs, I start by introducing HDP and the reason that they are needed. Several properties of HDP are derived (both in the main report and the appendix) and some of its properties are justified. Two basic inference algorithms for HDPs are presented in detail. HDP-HMM is introduced based on the general framework of hierarchical Dirichlet processes. The differences between this new model and HDP are emphasized and several of their properties are derived and explained. Three important inference algorithms are reviewed and presented in detail. In writing this review, one of my primary intentions is to produce a self-sufficient document that can be used as a reference to implement some of the inference algorithms for HDP-HMM. Moreover, an interested reader can easily start from these and derive more general or application specific algorithms. Table of Contents 1 Introduction ........................................................................................................................................... 1 2 Background ........................................................................................................................................... 1 3 Hierarchical Dirichlet Process .............................................................................................................. 2 3.1 Stick-Breaking Construction ......................................................................................................... 2 3.2 Different Representations ............................................................................................................. 4 3.3 Chinese Restaurant Franchise ....................................................................................................... 4 3.3.1 3.4 Inference Algorithms .................................................................................................................... 6 3.4.1 Posterior Sampling in CRF ..................................................................................................... 7 3.4.2 Augmented Posterior Representation Sampler .................................................................... 9 3.5 4 Posterior and Conditional Distributions................................................................................ 5 Applications ................................................................................................................................ 10 HDP-HMM ......................................................................................................................................... 11 4.1 CRF with Loyal Customers......................................................................................................... 12 4.2 Inference Algorithms .................................................................................................................. 12 4.2.1 Direct Assignment Sampler ................................................................................................. 12 4.2.2 Block Sampler...................................................................................................................... 16 4.2.3 Learning Hyper-parameters ................................................................................................ 18 4.2.4 Online learning .................................................................................................................... 21 4.3 Applications ................................................................................................................................ 26 5 Conclusion .......................................................................................................................................... 26 6 Reference ............................................................................................................................................ 27 Appendix A: Derivation of HDP Relationships .......................................................................................... 29 A.1. Stick-Breaking Construction (Teh Y. , Jordan, Beal, & Blei, 2006)............................................... 29 A.2. Deriving Posterior and Predictive Distributions ............................................................................. 30 Appendix B: Derivation of HDP-HMM Relationships............................................................................... 32 B.1. Derivation of the posterior distribution for z t ,st ......................................................................... 32 Preliminary Exam Report 1 Page 1 of 34 INTRODUCTION Nonparametric Bayesian methods provide a consistent framework to infer the model complexity from the data. Moreover, Bayesian methods make hierarchical modeling easier and therefore open doors for more interesting and complex applications. In this report, we review hierarchical Dirichlet processes (HDP) and its applications to derive infinite hidden Markov models (HMM) or HDP-HMM. We also review three inference algorithms for the so called HDP-HMM in details. This report is organized into five sections and two appendixes. Section two is a quick review of Dirichlet process. Section three is devoted to HDP and its inference algorithms and section four is focused on HDP-HMM and its inference algorithms. For the sake of readability, some of the mathematical details are presented in the appendix sections. 2 BACKGROUND A Dirichlet process (DP) is a distribution over distributions, or more precisely over discrete distributions. Formally, a Dirichlet process DP , G0 is “defined to be the distribution of a random probability measure G over such that for any finite measurable partition A1 , A2 ,..., Ar of the random distribution G A1 ,..., G Ar is distributed as finite dimensional Dirichlet distribution” (Teh Y. , Jordan, Beal, & Blei, 2006) : G A1 ,..., G Ar ~ Dir G0 A1 ,...,G0 Ar (2.1) A constructive definition for Dirichlet process is given by Sethuraman (Sethuraman, 1994) which is known as stick-breaking construction. This construction explicitly shows that draws from a DP are discrete with probability one. vk | , G0 ~ Beta 1, , k | , G0 ~ G0 k 1 k vk 1 l , l 1 G k k (2.2) k 1 can be interpreted as a random probability measure over positive integers and is denoted by ~ GEM . In both of these definitions G0 , or base distribution, is the mean of the DP, and is the concentration parameter which can be understood as the inverse of variance. Another way to look at the DP is through the Polya urn scheme. In this approach, we have to consider i.i.d. draws from a DP and consider the predictive distribution over these draws (Teh Y. , Jordan, Beal, & Blei, 2006): i 1 i | 1 ,...,i 1 , , G0 ~ k 1 1 k G0 N 1 N 1 (2.3) In the urn interpretation of equation (2.3), we have an urn with several balls of different colors in it. We draw a ball and put it back in the urn and add another ball of the same color to the urn. With probability proportional to we draw a ball with a new color. To make the clustering property more clear, we should \ Update: February 29, 2012 Preliminary Exam Report Page 2 of 34 introduce a new set of variables that represent distinct values of the atoms. Let 1* ,..., K* to be the distinct values and mk be the number of l associated with k* . We would now have: K i | 1 ,...,i 1 , , G0 ~ k 1 mk * G0 N 1 k N 1 (2.4) Another useful interpretation of (2.4) is the Chinese restaurant process (CRF). In CRF we have a Chinese restaurant with infinite number of tables. A new customer i comes into the restaurant and can either sit around one of the occupied tables with probability proportional to the number of people already sitting there or start a new table with probability proportional to . In this metaphor, each customer is a data point and each table is a cluster. 3 HIERARCHICAL DIRICHLET PROCESS A Hierarchical Dirichlet Process (HDP) is the natural extension of a Dirichlet process for problems with multiple groups of data. Usually, data is split into J groups a priori. For example, consider a collection of documents. If words are considered as data points, each document would be a group. We want to model data inside a group using a mixture model. However, we are also interested to tie groups to each other, i.e. to share clusters across all groups. Let’s assume that we have an indexed collection of DPs with a common base distribution G j ~ DP( , G0 ) . Unfortunately this simple model cannot solve the problem since for continues G0 different G j necessary have no atoms in common. The solution is to use a discrete G0 with broad support. In other words, G0 is itself a draw from a Dirichlet process. HDP is defined by (Teh & Jordan, 2010) equation (3.1). G0 | , H ~ DP ( , H ) G j | , G0 ~ DP ( , G0 ) (3.1) ji | G j ~ G j x ji | ji ~ F ji for j J In this definition H provides prior distribution for factor ji . governs the variability of G0 around H and controls the variability of G j around G0 . H , and are hyper-parameters of HDP. 3.1 Stick-Breaking Construction Because G0 is a Dirichlet distribution it has a stick-breaking representation: G0 k k 1 ** k , (3.2) Where k** ~ H and k k 1 ~ GEM . Since support of G j is contained in within the support of G0 we can write a similar equation to (3.2) for G j : \ Update: February 29, 2012 Preliminary Exam Report Page 3 of 34 G j jk ** (3.3) j ~ DP , (3.4) k k 1 Then we have: v jk ~ Beta 0 k , 0 1 k 1 jk v jk 1 v jl , k l l 1 (3.5) for k 1,..., l 1 We also have: E k k 1 1 k E jk E k (3.6) 1 k Var jk E k Var k Var k 1 Stick-breaking construction for HDP and (3.6) are derived in A.1. Figure 1 demonstrates stick-breaking Figure 1 - Stick Breaking Construction for HDP: The left panel shows a draw of , while the right three show independent draws conditioned on (Teh & Jordan, 2010). and cluster sharing of HDP. \ Update: February 29, 2012 Preliminary Exam Report 3.2 Page 4 of 34 Different Representations Definition (3.1) shows the first representation of HDP. Another representation can be obtained by introducing an indicator variable as shown in equation (3.7). Figure 2 shows the graphical models of both of these representations. | ~ GEM ( ) j | , ~ DP ( , ) k | H , ~ H ( ) z ji | j ~ j (3.7) x ji | k k 1 , z ji ~ F ji 3.3 Chinese Restaurant Franchise The Chinese restaurant franchise (CRF) is the natural extension of Chinese restaurant process for HDPs. In CRF, we have a franchise with several restaurants and a franchise wide menu. The first customer in restaurant j sits at one of the tables and orders an item from the menu. Other customers either sit at one of the occupied tables and eat the food served at that table or sit at a new table and order their own food from the menu. Moreover, the probability of sitting at a table is proportional to the number of customers already seated at that table. In this metaphor, restaurants correspond to groups and customer i in restaurant j corresponds to ji (customers are distributed according to G j ). Tables are i.i.d. variables *jt distributed according to G0 and finally foods are i.i.d. variables k** distributed according to H . If customer i at * ** restaurant j sits at table t ji and that table serves dish k jt , we will have ji jt ji kt . In another way, each ji restaurant represents a simple DP and therefore a cluster over data points. At the franchise level we have another DP but this time clustering is over tables. Now let introduce several variables that will be used throughout this paper. n jkt is the number of Figure 2-(a) HDP representation of (3.1) (b) Alternative indicator variable representation (3.7) (Teh, Jordan, Beal, & Blei, 2004) \ Update: February 29, 2012 Preliminary Exam Report Page 5 of 34 customers in restaurant j , seated around table t ,and who eat dish k . m jk is the number of tables in restaurant j serving dish k and K is the number of unique dishes served in the entire franchise. Marginal counts are denoted with dots. For example, n j k is the number of customers in restaurant j eating dish k . 3.3.1 Posterior and Conditional Distributions CRF can be characterized by its state which consists of the dish labels ** k** t ji j 1,..., J i 1,..., n and dishes k jt ji j 1,...., J i 1,..., n k 1,..., K , the tables . As a function of the state of the CRF, we also have the number of customers n n jtk , the number of tables m m jk , customer labels θ ji and table labels *jt (Teh & Jordan, 2010). The posterior distribution of G0 is given by: K H k 1 m k ** k G0 | , H , ~ DP m , m (3.8) Where m is the total number of tables in the franchise and m k is the total number of tables serving dish k . Equation (3.9) shows the posterior for G j . n j is the total number of customers in restaurant j and n j k is the total number of customers in restaurant j eating dish k . K G0 k 1 n j k ** k G j | , G0 , θ j ~ DP n j , nj (3.9) Conditional distributions can be obtained by integrating out G j and G0 respectively. By integrating out G j from (3.9) we obtain: m j. ji | j1 ,..., j ,i 1 , , G0 ~ t 1 n jt nj * jt nj G0 (3.10) And by integrating out G0 from (3.8) we obtain: mk ** H k m k 1 m k K *jt | *j1 ,..., *j ,t 1 , , H ~ (3.11) A draw from (3.8) can be obtained using (3.12) and a draw from (3.9) can be obtained using (3.13). 0 , 1 ,..., K | , G0 , θ* ~ Dir , m 1 ,..., m K G0 | , H ~ DP , H G0 0G0 K k k 1 \ (3.12) ** k Update: February 29, 2012 Preliminary Exam Report Page 6 of 34 j 0 , j1 ,..., jK | , θ j ~ Dir ( 0 , 1 n j 1 ,..., K n j K ) G j | , G0 ~ DP( 0 , G0 ) G j j 0G j K k 1 (3.13) jk k** From (3.12) and (3.13) we see that the posterior of G0 is a mixture of atoms corresponding to dishes and an independent draw from DP( , H ) and G j is a mixture of atoms at k** and an independent draw from DP ( 0 , G0 ) (Teh & Jordan, 2010). nj In an HDP each restaurant represents a DP and so m j O log since the number of clusters scales logarithmically. On the other hand, equation (3.8) shows G0 is a DP over tables and so mj nj K O log j log O log . This shows that HDP represents a prior belief that j the number of clusters grows very slowly (double logarithmically) when increasing the number of data points but faster (logarithmically) in the number of groups (Teh & Jordan, 2010). All of the above relationships are derived in A.2. 3.4 Inference Algorithms First, let introduce several notations. x x ji , x jt x ji | t ji t , t t ji , k k jt and z {z ji } .Where z ji k jt ji denotes the mixture component associated with the observation x ji .To indicate the removal of a variable from a set of variables or a count, we use a superscript ji ; for example x ji x \ x ji or n jt ji is the number of customers (observations) in restaurant (group) j seated at table t excluding x ji . We also assume F has the density f | and H has the density h and is conjugate to F . The conditional density of x ji under mixture component k giving all data excluding x ji is defined as: x f k ji x ji h | f x j i d h | f x j i d j i Dk x ji j iDk \ x ji , k 1,...,, K (3.14) f k newji x ji h | f x ji d , k k new x where Dk j i : k jt ji k denotes the set of indices of the data item currently associated with dish k . For the conjugate case, we could obtain a closed form for this likelihood function. Particularly if emission distributions are Gaussian with unknown mean and covariance, the conjugate prior is a normal-inverse- \ Update: February 29, 2012 Preliminary Exam Report Page 7 of 34 Wishart distribution (Sudderth, 2006) denoted by NIW , , , and k k , k . Given some observations x ji for component, k | k jt ji k , j 1,..., J , i 1,..., n of the mixture x (l ) L l 1 (in this case x (l ) L l 1 = ) from a multivariate Gaussian distribution, the posterior still remains in the normal-inverse-Wishart family and its parameters are updated using: k L k L kk l 1 x (l ) (3.15) L k k l 1 x (l ) x (l ) T T L T In practice there are some efficient ways (using Cholesky decomposition) to update these equations and allows fast likelihood evaluation for each data point. Finally, marginalizing k induces a multivariate tstudent distribution with k d 1 degree of freedom ( d is the dimension of data points): x ji fk x x t ji f k newji x ji k d 1 k 1 k x ji ;k , k , k 1,..., K k k d 1 (3.16) 1 t d 1 x ji ; , , k k new d 1 Assuming k d 1 (3.16) can be approximated by moment-match Gaussian (Sudderth, 2006). 3.4.1 Posterior Sampling in CRF Equations (3.10) and (3.11) show how we can produce samples from the prior over ji and *jt , using the proper likelihood function and how this framework can provide us with necessary tools to sample from the posterior given the observations x . In this first algorithm, we sample index t ji and k jt using a simple Gibbs sampler. In this algorithm we assumed that emission distributions are Gaussian but using other conjugate pairs is the same. The following list shows a single iteration of the algorithm, but to obtain reliable samples we have to run this many times. Notice that the most computational costly operation is the calculation of the conditional densities. Moreover, the number of events that could happen for each iteration of the algorithm is K 1 . 1. Given the pervious state assignment K , k and t : Sample t: 2. For all j 1,..., J , i 1,...., n j do the follow sequentially \ Update: February 29, 2012 Preliminary Exam Report Page 8 of 34 3. Compute the conditional density for x ji using (3.16) for k 1,..., K , k new . 4. Calculate the likelihood for t ji t new using: ji k mm f x m p x ji | t ji , t ji t new , k K x ji k ji k 1 x (3.17) f k newji x ji 5. Sample t ji from the multinomial probability p t ji t | t 6. ji njt ji f x ji x ji , k jt ,k p x ji | t ji , t ji t new , k , if t previously used (3.18) if t t new If the sampled value of t ji is t new , obtain a sample of k jt new by: p k jt new k | t, k jt new m kx ji f k x ji x ji , If k previously used x new f k newji x ji , If k =k (3.19) 7. If k k new then increment K . 8. Update the cached statistics. 9. If a table t becomes unoccupied delete the corresponding k jt and, if as a result some mixture component becomes unallocated, delete that mixture component too. Sample K: 10. For all j 1,..., J , t 1,..., m j do the following sequentially: 11. Compute the conditional density for x jt using (3.16) for k 1,..., K , k new . 12. Sample k jt from a multinomial distribution: p k jt | t, k -jt m kx jt f k x jt x jt , If k previously used x new f k newjt x jt , If k =k (3.20) 13. If k k new then increment K . 14. Update the cached statistics. 15. If a mixture component becomes unallocated delete that mixture component \ Update: February 29, 2012 Preliminary Exam Report Page 9 of 34 Equation (3.17) is obtained from (3.11). From (3.10) we see the prior probability that t ji takes a previously used value is proportional to n jt ji and the prior probability of taking a new value is proportional to .The likelihood for x ji given t ji t for some previously used t is f k x ji x ji and the likelihood for t ji t new is given by (3.17). By multiplying these priors and likelihoods we can obtain the posterior distribution (3.18). In the same way, (3.19) and (3.20) can be obtained by multiplying the likelihoods and priors given by (3.11). 3.4.2 Augmented Posterior Representation Sampler In the previous algorithm, the sampling for all groups is coupled which makes the derivation of the CRF sampler for certain models difficult (Teh Y. , Jordan, Beal, & Blei, 2006). This happens because G0 was integrated out. An alternative approach is to sample G0 . More specifically, we will use (3.12) and (3.13) to sample from G0 and G j respectively. This algorithm contains two main steps. First we sample cluster indices z ji (instead of tables and dishes) and then we sample and j jJ . Equation (3.12) shows in order to sample from we should first sample m jk . This completes the second algorithm. 1. Given the pervious state assignment for z , and j from previous step Sample Z 2. For all j 1,..., J , i 1,...., n j do the follow sequentially 3. Compute the conditional density for x ji using (3.16) for k 1,..., K , k new . 4. Sample z ji using: p ( z ji | z ji jk f k x ji x ji ,) x j 0 f k newji x ji If k previously used (3.21) If k=k new 5. If a new component k new is chosen, the corresponding atom is initiated using (3.22) and set z ji K 1 and K K 1 . v0 | ~ Beta ,1 new 0 , Knew 1 0 0 , 0 (1 v0 ) v j | , 0 , 0 ~ Beta 0 v0 , 0 (1 v0 ) \ new j0 , (3.22) new jK 1 ~ j 0 v j , j 0 (1 v j ) Update: February 29, 2012 Preliminary Exam Report Page 10 of 34 6. Update the cached statistics. Sampling m 7. Sample m jk using (3.23) where s(n, m) is the Stirling number of the first kind. Alternatively, we can sample m jk by simulating a CRF, which is more efficient for large n j .k . p(m jk m | z, , , n j k ) k k n j k s(n j k , m) k m (3.23) Sampling and 8. Sample and using(3.12) and (3.13) respectively. In this algorithm we have used alternative representation given by (3.7). Particularly, we can see z ji ~ j . By combining this with (3.13) we can obtain a conditional prior probability function for z ji and by multiplying it with the likelihoods we can obtain (3.21). (3.22) is the stick-breaking step for the new atom and follows the steps described in section 3.1. In particular the third line in (3.22) is obtained by replacing the remaining stick 0 with a unit stick. The validity of this approach is shown in (Pitman, 1996).For a proof of (3.23)look at (Antoniak, 1974). It should be noted that computing this equation is generally very costly and we can alternatively simulate a CRF to sample m jk . 3.5 Applications Among the several applications of HDP, we will only review two of them in this section. It should be noted that the following section, HDP-HMM, is by itself an application of the general HDP framework. One of the most cited applications of HDP is in the field of information retrieval (IR) (Teh & Jordan, 2010). A state of the art but heuristic algorithm which is very popular in IR applications is the “termfrequency inverse document frequency” (tf-idf) algorithm. The intuition behind this algorithm is that the relevance of a term to a document is proportional to the number of times that term occurred in the document. However, terms that occur in many documents should be down weighted. It has been shown that HDP provides a justification for this intuition (Teh & Jordan, 2010) and an algorithm based on HDP outperforms all state of the art algorithms. Another application cited extensively in literature is topic modeling (Teh & Jordan, 2010). In topic modeling, we want to model documents with a mixture model. Topics are defined as probability distributions across a set of words while documents are defined as probability distributions across different topics. At the same time we want to share topics among documents within a corpus. So each document is a group with its own mixing proportions but components (topics) are shared across all documents using an HDP model. \ Update: February 29, 2012 Preliminary Exam Report 4 Page 11 of 34 HDP-HMM Hidden Markov models (HMMs) are a class of doubly stochastic processes in which discrete state sequences are modeled as a Markov chain (Rabiner, 1989). In the following discussion we will denote the state of the Markov chain at time t with z t and the state-specific transition distribution for state j by j .The Markovian structure means zt ~ zt 1 . Observations are conditionally independent given the state of the HMM and are denoted by xt ~ F zt . HDP-HMM is an extension of HMM in which the number of states can be infinite. The idea is relatively simple; at each state z t we should be able to go to an infinite number of states so the transition distribution should be a draw from a DP. On the other hand, we want reachable states from one state to be shared among all states so these DPs should be linked together. The result is an HDP. In an HDP-HMM each state corresponds to a group (restaurant) and therefore, unlike HDP in which an association of data to groups is assumed to be known a priori, we are interested to infer this association. The major problem with original HDP-HMM is the state persistence. HDP-HMM has a tendency to make many redundant states and switch rapidly among them (Teh Y. , Jordan, Beal, & Blei, 2006). This problem is solved by introducing a sticky parameter to the definition of HDP-HMM (Fox E. , Sudderth, Jordan, & Willsky, 2011). Equation (4.1) shows the definition of a sticky HDP-HMM with unimodal emissions. is a sticky hyper-parameter and generally can be learned from data. Original HDP-HMM is a special case with 0 . From this equation we can see for each state (group) we have a simple unimodal emission distribution. This limitation can be addressed using a more general model defined in (4.2). In this model, a DP is associated with each state and a model with augmented state ( zt , st ) is obtained. Figure 3 shows a graphical representation. | ~ GEM ( ) j | , ~ DP( , j ) ** j | H , ~ H ( ) zt | zt 1 , j xt | ** j j 1 j 1 (4.1) ~ zt 1 , zt ~ F zt | ~ GEM ( ) j | , ~ DP( , j | ~ GEM ( ) j ) kj** | H , ~ H ( ) s | , z zt | zt 1 , j j 1 t j j 1 xt | kj** \ k , j 1 t (4.2) ~ zt 1 ~ zt , zt ~ F zt st Update: February 29, 2012 Preliminary Exam Report Page 12 of 34 Figure 3-Graphical model of HDP-HMM (Fox E. , Sudderth, Jordan, & Willsky, 2011) 4.1 CRF with Loyal Customers The metaphor for the Chinese restaurant franchise for sticky HDP-HMM is a franchise with loyal customers. In this case each restaurant has a special dish which is also served in other restaurants. If a customer xt is going to restaurant j then it is more likely that he eats the specialty dish zt j there. His children xt 1 also go to the same restaurant and eat the same dish. However, if xt eats another dish ( zt j ) then his children go to the restaurant indexed by z t and more likely eat their specialty dish. Thus customers are actually loyal to dishes and tend to go to restaurants where their favorite dish is the specialty. 4.2 Inference Algorithms 4.2.1 Direct Assignment Sampler This sampler is adapted from (Fox E. , Sudderth, Jordan, & Willsky, 2011) and (Fox E. , Sudderth, Jordan, & Willsky, 2010). In this section we present the sampler for HDP-HMM with DP emission. This algorithm is very similar to the second inference algorithm for HDP presented in 3.4.2. The algorithm is divided into two steps: the first step is to sample the augmented state zt , st and the second is to sample . In order to sample zt , st we need to have the posterior. By inspecting Figure 3 and using the chain rule we can write the following relationship for this posterior. p zt k , st j | z \ t , s\ t , x1:T , , , , , p st j | zt k , z\ t , s\ t , x1:T , , p zt k | z \ t , s\ t , x1:T , , , , p st j | s | z k , t , p xt | x | z k , st j , t p zt k | z\ t , , , (4.3) p x | x | z k , s , t p s | s | z k , t, t t t st \ Update: February 29, 2012 Preliminary Exam Report Page 13 of 34 The reason that we have summed over st in the last line is because we are interested to calculate the likelihood for each state. This equation also tells us that we should first sample the state z t and then conditioned on the current state, sample the mixture component for that state. In B.1. we will derive the following relationships for the component of (4.3). (4.6) is written for Gaussian emissions but we can always use the general relationship (3.14) for an arbitrary emission distribution. k nzt t1k zt 1 , k p zt k | z\ t , , , 2 k zt 1 , z nkzt k , zt 1 zt 1 , k k , zt 1 t 1 t 1 k 1,..., K t n z , k k t 1 k K 1 (4.4) nkj t , j 1,..., K k nk t p st j | s | z k , t , , j K k 1 n t k (4.5) kj 1 kj p xt | x | z k , st j , t t kj d 1 xt ;kj , kj , k 1,..., K , j 1,..., K k kj kj d 1 p xt | x | z k , st j new , t p xt | x | z k new , t t 1 , k 1,..., K , j d 1 1 , k k new xt ; , d 1 d 1 t k new kj x ; new d 1 t , j new (4.6) After sampling zt , st we have to sample but from (3.12) we see that we need to know the distribution of the number of tables considering dish k ( m k k 1 ). The approach is to first find the distribution of K tables serving dish k ( m k k 1 ). In this algorithm, instead of using the approach based on Stirling K numbers, we can obtain this distribution by a simulation of the CRF, and then adjust this distribution to obtain the real distribution of considered dishes by tables. To review the reason that this adjustment is necessary, we should notice that introduces a non-informative bias to each restaurant so customers are more likely to select the specialty dish of the restaurant. In order to obtain the considered dish distribution we should reduce this bias from the distribution of the served dish. This can be done using an override variable. Suppose m jk tables are serving dish k in restaurant j . If k j then m jk m jk since the served dish is not the house specialty but if k j then there is probability that tables are overridden by the house specialty. Suppose that jt is the override variable with prior p jt | ~ , we can write: \ Update: February 29, 2012 Preliminary Exam Report Page 14 of 34 p jt | k jt j , , p k jt j | jt , , p jt | j 1 jt 0 jt 1 (4.7) The sum of these Bernoulli random variables is a binomial random variable and finally we can calculate the number of tables that considered ordering dish k by (4.17). 1. Given a previous set of z1:( nT1) , s1:( nT1) and ( n1) 2. For all t 1,2,..., T . 3. For each of the K currently instantiated states compute: The predictive conditional distributions for each of the K k currently instantiated mixture components for this state, and also for a new component and for a new state. nkjt f k, j ( xt ) nkt f k , Kk 1 ( xt ) nkt p xt | x | z k , st j , t p xt | x | z k , st j new , t f knew ,0 ( xt ) p xt | x | z k new , t t n (4.8) (4.9) (4.10) The predictive conditional distribution of the HDP-HMM state without knowledge of the current mixture component. z nkzt k , zt 1 zt 1 , k k , zt 1 t 1 t 1 f k xt k nzt t1 zt 1 , k t n z , k k t 1 f K 1 xt 4. 2 k zt 1 K k f x f k, j t j 1 k , K k 1 ( xt ) , k 1,..., K (4.11) f knew ,0 ( xt ), k K 1 Sample z t : K zt ~ f x ( z , k ) f k t t ( zt , K 1) K 1 ( xt ) (4.12) k 1 \ Update: February 29, 2012 Preliminary Exam Report Page 15 of 34 5. Sample st conditioned on z t : K k st ~ f ( st , j ) f k , K k 1 ( xt ) ( st , K k 1) k , j ( xt ) j 1 (4.13) 6. If k K 1 increase the K and transform as 0| ~ Beta ,1 new 0 , Knew 1 0 0 , 0 (1 v0 ) (4.14) 7. If st K k 1 increment K k . 8. Update the cache. If there is a state with nk 0 or n k 0 remove k and decrease K . If nkj 0 remove the component j and decrease K k . 9. Sample auxiliary variables by simulating a CRF: 10. For each j, k 1,..., K set m jk 0 and n 0 . For each customer in restaurant j eating dish k ( 2 i 1,..., n jk ), sample: k ( j, k ) x ~ Ber n k ( j, k ) (4.15) 11. Increment n and if x 1 increment m jk . 12. For each j 1,..., K ,sample the override variables in restaurant j : j ~ Binomial m jj , , j 1 (4.16) 13. Set the number of informative tables in restaurant j : m jk m jk m jj j jk jk (4.17) 14. Sample : ( n ) ~ Dir , m 1 ,..., m K \ (4.18) Update: February 29, 2012 Preliminary Exam Report Page 16 of 34 15. Optionally sample hyper-parameters , , and . 4.2.2 Block Sampler The problem with the direct assignment sampler mentioned in the previous section is the slow convergence rate since we sample states sequentially. The sampler can also group two temporal sets of observations related to one underlying state into two separate states. However, in the last sampling scheme we have not used the Markovian structure to improve the performance. In this section a variant of forward-backward procedure is incorporated in the sampling algorithm that enables us to sample the state sequence z1:T at once. However, to achieve this goal, a fixed truncation level L should be accepted which in a sense reduces the model into a parametric model (Fox E. , Sudderth, Jordan, & Willsky, 2011). However, it should be noted that the result is different from a classical parametric Bayesian HMM since the truncated HDP priors induce a shared sparse subset of the L possible states (Fox E. , Sudderth, Jordan, & Willsky, 2011). In short, we obtain an approximation to the nonparametric Bayesian HDP-HMM with maximum number of possible states set to L . However, for almost all applications this should not cause any problem if we set L reasonably high. The approximation used in this algorithm is the degree L weak limit approximation to the DP (Ishwaran & Zarepour, 2002) which is defined as: GEM L Dir / L,..., / L (4.19) Using (4.19) is approximated as (Fox, Sudderth, Jordan, & Willsky, Supplement to " A Sticky HDPHMM with Application to Speaker Diarization", 2010): | ~ Dir / L,..., / L (4.20) j | , , ~ Dir 1 ,..., j ,... L (4.21) Similar to (3.4) we can write: And posteriors are (similar to (3.13)): | m, ~ Dir / L m 1 ,..., / L m L j | z1:T , , ~ Dir 1 n j1 ,..., j n jj ,..., L n jL (4.22) In (4.22) n jk is the number of transitions from state j to state k and m jk is the same as (4.17). Finally an order L weak limit approximation is used for the DP prior on the emission parameters: k | z1:T , s1:T , ~ Dir / L nk1 ,..., / L nkL (4.23) The forward-backward algorithm for the joint sample z1:T and s1:T given x1:T can be obtained by: p z t , st | x1:T , z, π,ψ,θ p zt | zt 1 , x1:T , π, θ p st | zt p xt | zt ,st p x1:t 1 | zt , π, θ,ψ p xt 1:T | zt , π, θ,ψ \ (4.24) Update: February 29, 2012 Preliminary Exam Report Page 17 of 34 The right side of equation (4.24) has two parts: forward and backward probabilities (Rabiner, 1989).The forward probability includes p zt | zt 1 , x1:T , π, θ p st | zt f xt | zt , st p x1:t 1 | zt , π, θ, ψ and backward probability includes p xt 1:T | zt , π, θ,ψ . It seems that the authors in this work approximate the forward probabilities with p zt | zt 1 , x1:T , π, θ p st | zt f xt | zt ,st , and for backward probabilities we have: p xt 1:T | zt , π, θ, ψ mt ,t 1 zt 1 p z | p s | f x | m z 1 zt st zt 1 t t zt t t 1,t zt , st t t T t T 1 L mt ,t 1 k i 1 1 (4.25) f x | m z L ki il t zt , st l 1 t 1,t t T t k 1,...L t T 1 As a result we would have (Fox E. , Sudderth, Jordan, & Willsky, 2010) : p z t k , st j | x1:T , z, π,ψ,θ zt 1k kj f xt | zt ,st mt 1,t zt (4.26) where for Gaussian emission for components are given by f xt | zt , st xt ; kj , kj The algorithm is as follows (Fox E. , Sudderth, Jordan, & Willsky, 2010): 1. Given the previous π( n1) , ψ ( n1) , β( n1) and θ ( n1) . 2. For k 1,..., L , initialize messages to mT 1,T k 1 3. For t T 1,...,1 and k 1,..., L compute mt ,t 1 k L L il N xt 1 ; il , il mt 1,t i ki (4.27) i 1 l 1 4. Sample the augmented state zt , st sequentially and start from t 1 : Set nik 0, nkj 0 and kj for i, k 1,..., L and k , j 1,..., L 1,..., L 2 For all k , j 1,..., L 1,..., L compute: f k , j xt zt 1 ,k k , j N xt ; k , j , k , j mt 1,t k (4.28) 5. Sample augmented state zt , st : \ Update: February 29, 2012 Preliminary Exam Report Page 18 of 34 L L z ,s ~ f x z ,k s , j t t k, j t t t k 1 j 1 (4.29) 6. Increase nzt 1zt and nzt st and add xt to the cached statistics. k , j k , j xt (4.30) 7. Sample m,ω,m similar to the previous algorithm 8. Update : ~ Dir / L m 1 ,..., / L m L (4.31) 9. For k 1,..., L : Sample k and k : k ~ Dir 1 nk1 ,..., k nkk ,..., L nkL k ~ Dir / L nk 1 ,..., / L nkL (4.32) For j 1,..., L sample: k , j ~ p | , k , j (4.33) 10. Set π( n) π, ψ ( n) ψ, ( n) and θ( n ) θ 11. Optionally sample hyper-parameters , , and . 4.2.3 Learning Hyper-parameters Hyper-parameters including , , and can also be inferred like other parameters of the model (Fox, Sudderth, Jordan, & Willsky, 2010). 4.2.3.1 Posterior for α + κ Consider the probability of data x ji to sit behind table t : ji t t1 ,..., m j n jt p t ji t | t ji , n jt ji , , new t t \ (4.34) Update: February 29, 2012 Preliminary Exam Report Page 19 of 34 This equation can be written by considering equation (3.10) and (4.1). From this equation we can say customer table assignment follows a DP with concentration parameter . Antoniak (Antoniak, 1974) has shown that if ~ GEM , zi ~ then the distribution of the number of unique values of z i resulting from N draws from has the following form: p K | N , ( ) s N , K K ( N ) (4.35) Where s( N , K ) is the Stirling number of the first kind. Using these two equations the distribution of the number of tables in the restaurant j is as follows: p mj | ,nj s nj ,mj mj (4.36) nj The posterior over is as follows: p | m1 ,..., mJ , n1 ,..., nJ p p m1 ,..., mJ | , n1 ,..., nJ p J p m j | ,nj j 1 p J s n ,mj j j 1 p m mj J s n x y (4.37) J n j , m j is not a function of and therefore can be ignored. j j 1 By substitution of x, y nj j 1 The reason for the last line is that 1 t x 1 1 t x y y 1 dt and also by considering that x 1 x x 0 we obtain: p | m1 ,..., mJ , n1 ,..., nJ p m n j 1 1 rj 1 r 0 j j 1 J n j 1 drj (4.38) Finally by considering the fact that we have placed a Gamma a, b prior on we can write: s n j j p , r , s | m1 ,..., mJ , n1 ,..., nJ e 1 rj rj j 1 Where s j can be either one or zero. For marginal probabilities we obtain: a m 1 b \ J n j 1 (4.39) Update: February 29, 2012 Preliminary Exam Report Page 20 of 34 p | r , s, m1 ,..., mJ , n1 ,..., nJ a m 1 j1 s j e b j1log rj Gamma m p rj | , r\ j , s, m1 ,..., mJ , n1 ,..., nJ rj 1 rj p s j | , r , s\ j m1 ,..., mJ , n1 ,..., nJ 4.2.3.2 J J n j 1 J s ,b j 1 j J j 1 log rj Beta 1, n j nj nj j Ber nj (4.40) (4.41) s (4.42) Posterior of γ Similar to the discussion for (4.35) if we want to find the distribution of the unique number of dishes served in the whole franchise we would have p K | , m s m ,K m . Therefore for the posterior distribution of we can write: p | K , m p p K | , m 1, m p K m p K m (4.43) 1 1 m 1 d 0 By considering the fact that that prior over is Gamma a, b we can finally write: p , , | K , m K 1 m b log 1 m e 1 (4.44) And finally for the marginal distributions we have: b log p | , , K , m K 1 e Gamma K , b log p | , , K , m 1 p | , , K , m 4.2.3.3 m 1 Beta 1, m (4.46) m m Ber m (4.45) (4.47) Posterior of σ The posterior for is obtained in a similar way to . We use two auxiliary variables r and s and the final marginalized distributions are: \ Update: February 29, 2012 Preliminary Exam Report Page 21 of 34 p | r , s, K1 ,..., K J , n1 ,..., nJ K 1 j1 sj e b j1log rj J J p rj | , r\j , s, K1 ,..., K J , n1 ,..., nJ rj 1 rj p sj | , r , s\ j , K1 ,..., K J , n1 ,..., nJ nj n j 1 (4.48) (4.49) sj (4.50) It should be noted that in cases where we use auxiliary variables we prefer to iterate several times before moving to the next iteration of the main algorithm. 4.2.3.4 Posterior of ρ By definition and by considering the fact that the prior on is Beta c, d and jt ~ Ber we can write: p | p | p Binomial Beta 4.2.4 j j j j ; m , Beta c, d c, m j j (4.51) d Online learning The last two approaches are based on batch learning methodology. One problem with these methods is the need to run the whole algorithm for the whole data set when new data points become available. More than that, for large datasets we might face some practical constraints such as memory size. Another alternative approach is to use sequential learning techniques which essentially let us update models once a new data point becomes available. The algorithm that we are describing here is adapted from (Rodriguez, 2011), but the main idea for a general case is published in (Carvalho, Johannes, Lopes, & Polson, 2010) and (Carvalho, Lopes, Polson, & Taddy, 2010). For Bayesian problems different versions of particle filters are used to replace batch MCMC methods. For further information about particle filters refer to (Cappe, Godsill, & Moulines, 2007). It should be noted that this algorithm is developed for the non-sticky ( 0 ) HDP-HMM with one mixture per state but generalization to sticky HDP-HMM with DP emissions is straightforward. 4.2.4.1 Particle learning (PL) Framework for mixtures PL is proposed in (Carvalho, Johannes, Lopes, & Polson, 2010) and (Carvalho, Lopes, Polson, & Taddy, 2010) and is a special formulation of augmented particle filters. A general mixture model that PL is supposed to infer can be represented by: \ Update: February 29, 2012 Preliminary Exam Report Page 22 of 34 yt 1 f xt 1 , xt 1 g xt , (4.52) In this set of equations, the first line is the observation equation and the second line is the state evolution which, in case of mixtures, indicates which component is assigned to the observations and xt x1 ,..., xt In order to estimate states and parameters we should define an “essential state vector” zt xt , st , where st stx , st and st are the state sufficient statistics and st is the parameter sufficient statistics (Carvalho, x Johannes, Lopes, & Polson, 2010). After observing yt 1 , particles should be updated based on: p zt | y t 1 p yt 1 | zt p zt | y t p yt 1 | y t (4.53) p xt 1 | yt 1 p xt 1 | zt , yt 1 p zt | y t 1 dzt (4.54) A particle approximation to (4.53) is: p N z | y t 1 N t i 1 (i ) t zt( i ) ( zt ), (i ) t p yt 1 | zt(i ) N p yt 1 | zt( j ) j 1 (4.55) Using this approximation we can generate propagated samples from the posterior p xt 1 | zt , yt 1 to approximate p xt 1 | y t 1 . Sufficient statistics can be updated using a deterministic mapping and finally parameters should be updated using sufficient statistics. The main condition for this algorithm to be possible is the tractability of p yt 1 | zt(i ) and p xt 1 | zt yt 1 . The PL algorithm is as follows: 1. Resample particles z t with weights t(i ) p yt 1 | zt(i ) 2. Propagate new states xt 1 using p xt 1 | zt yt 1 3. Update state and parameter sufficient statistics deterministically st 1 S st , xt 1 , yt 1 4. Sample from p | st 1 After one sequential run through the data we can use smoothing algorithms to obtain p xT | y T . The algorithm is repeated b 1,..., B to generate B sample paths. 1. Sample xT(b ) , sT(b ) , (b ) from output of particle filter \ N i 1 1 ( i ) zT N zT Update: February 29, 2012 Preliminary Exam Report Page 23 of 34 For t T 1:1 2. Sample xT(b ) , sT(b) ~ N t(i) z( s ) , t(i) t s 1 qt(i ) , qt(i ) p xt(b1) , rt(b1) | zt( i ) N qt( s ) s 1 4.2.4.2 PL for HDP-HMM where z is In this algorithm we use Zt(i ) zt(i ) , L(ti ) , nljt(i ) , Slt(i ) ,t(i ) , lt(i ) , t(i ) , mljt( i ) ,t( i ) , ult( i) , lt( i) t the state, Lt is the number of states at time t , nljt is the number of transitions from state l to state j at time t , S lt is the sufficient statistics for state l at time t and other variables have the same definitions as previous sections (the last three are auxiliary variables which are used to infer hyper-parameters.). From (4.4) and by setting 0 we can write: p zt 1 k | z1 ,..., zt , zt 2 k nzt kt k , , Lt 1 k k nkk t zt , k k , k , k 1,.., Lt nk t zt , k (4.56) , k Lt 1 In this equation k is the next state zt 2 that we have not seen yet. Because of this we should integrate this out by summing over all possibilities. The result is: k nzt k t p zt 1 k | z1 ,..., zt , , Lt 1 nk t zt , k , k 1,.., Lt k t zt , k , k Lt 1 n (4.57) k nz kt , k 1,.., Lt t , k Lt 1 Lt 1 After normalization we can write: p zt 1 | z1 ,..., zt , , Lt nzt kt nzt k 1 p xt 1 | Z t p xt 1 | zt 1 p zt 1 | Z t dzt 1 k Lt k 1 k k nzt kt nzt f Lt 1 nzt xt 1 k Lt 1 xt 1 Lt 1 nzt (4.58) f Lt xt11 xt 1 (4.59) The algorithm is as follow: \ Update: February 29, 2012 Preliminary Exam Report 1. Compute t(i ) Page 24 of 34 vt(i ) N v( j ) j 1 t where vt(i ) ql(i ) Z t(i ) , xt 1 L(t i ) l 1 ql(i ) Z t(i ) , xt 1 and we have: n (i( i)) t(i ) lt(i ) zt lt f l xt 1 xt 1 , l 1,..., L(t i ) nz(i( i)) t t(i ) t (i ) (i ) t L(Ti ) 1,t x f t 1 x , l L(t i ) 1 (i ) ( i ) L(t i ) 1 t 1 nzt( i ) t t (4.60) Where fl xt 1 is similar to (4.6). 2. Sample Zt(1) , Zt(2) ,..., Zt( N ) from p N Z t | y t 1 N i 1 (i ) t Z t( i ) Zt (4.61) 3. Propagate the particles to generate Z t(i )1 : 4. Sample zt(i )1 : zt(i )1 ~p L(t i ) 1 zt(i )1 | Z t(i ) , xt 1 l 1 ql(i ) Z t(i ) , xt 1 L(t i ) 1 ( j ) ql j 1 Z t( j ) , xt 1 l zt(i )1 (4.62) 5. Update the number of states: L(ti)1 L(ti ) 1, zt(i )1 L(ti ) 1 (i ) (i ) Lt 1 Lt otherwise (4.63) 6. Update the sufficient statistics: nz(i( i)) , z ( i ) ,t 1 nz(i( i)) , z ( i ) ,t 1 S z ( i ) ,t 1 S z ( i ) ,t s xt t t 1 t t 1 t 1 t 1 (4.64) 7. If zt(i )1 L(ti ) : \ Update: February 29, 2012 Preliminary Exam Report Page 25 of 34 t(i 1) t(i ) v0 | ~ Beta t(i ) ,1 (i ) 0,t 1 , (4.65) L((ii)) 1,t 0,(it) 0 , 0,( it) (1 v0 ) t 8. Hyper-parameters update: 9. Sample ml(,i )j ,t 1 : p ml(,i )j ,t 1 m s nl(,i )j ,t 1 , m t(i ) (j i,t)1 m (4.66) Alternatively we can simulate a CRF instead of computing Stirling numbers. 10. Sample t(i )1 by first sample t(i 1) ~ Beta t(i ) 1, m(it)1 : t(i )1 ~ Gama a L(ti)1 , b log t(i 1) 1 Gama a L(ti)1 1, b log t(i 1) 1 a L(t i)1 1 (4.67) m(it)1 b log t(i 1) 11. Sampling t(i )1 using auxiliary variables: l(,it)1 ~ Beta t(i ) 1, nl(it)1 n(i ) ul(,it)1 ~ Ber (i ) l t 1(i ) t nl t 1 (i ) t 1 (4.68) L(t i )1 (i ) (i ) ~ Gam aa m t 1 u t 1 , b log l(,it)1 l 1 12. Resample : t(i1) ~ Dir m ,1,t 1 ,..., m , L( i ) ,t 1 , t(i )1 t 1 (4.69) After finishing this inference step, we can optionally smooth the states using an algorithm similar to the one discussed in 4.2.4.1. However one drawback of this algorithm is the fact that paths for z1 ,..., zT are coupled together since we integrate out l for the inference algorithm. To improve particle diversity we can sample transition probability and emission parameters explicitly. The smoothing algorithm would be \ Update: February 29, 2012 Preliminary Exam Report Page 26 of 34 as follows: N 1. Sample ZT( b ) ~ i 1 1 (i ) N ZT 2. Sample l( b ) ~ Dir (b ) l (b ) nl(bT) , 1(b) nl(1bT) ,..., L((bb)) nlL(b()b )T 3. Sample l( b ) from the NIW l | S T lT , n lT , , , , T (4.70) . 4. Sample z1(b) ,..., zT(b) using a single run of Forward-Backward algorithm applications. Use l( b ) and l( b ) as parameters for the Forward-Backward algorithm. 4.3 Applications One of the applications of HDP-HMM, which is extensively discussed in (Fox E. , Sudderth, Jordan, & Willsky, 2011), is speaker diarization. In this application, we are interested to segment an audio file into time intervals associated with different speakers. If the number of speakers is known a priori a classic HMM can be used and each speaker can be modeled as different states of HMM. However, in real world applications the number of speakers is not known and therefore nonparametric models are a natural solution. It has been shown in (Fox E. , Sudderth, Jordan, & Willsky, 2011) that HDP-HMM can produce results comparable to other state of the art systems. Another application which is cited as an application of HDP-HMM is word segmentation (Teh & Jordan, 2010). In this problem, we have an utterance and we are interested to segment it into words. Each word can be represented as a state in a HDP-HMM and transition distributions can define a grammar over words. 5 CONCLUSION In this report, we have investigated hierarchical Dirichlet processes and its application to extend HMMs into infinite HMMs. We also reviewed two inference algorithms for HDP and three inference algorithms for HDP-HMM. HDP-HMM seems to be a good candidate for many applications which traditionally use HMMs. Using a nonparametric Bayesian approach could help us to automatically learn the complexity of the models from the data instead of relying on heuristic tuning methods. Moreover, the framework can provide a generic and simple approach to organize all models (i.e. different HMMs in a speech recognizer) in a well-defined hierarchy and tie parameters of different models using Bayesian hierarchical methods. The definition of HDP-HMM (with DP emission) can also be altered to include another HDP that links DP emissions of different states together (to link different components of mixture models together.) Another area of work is in inference algorithms. We have presented three algorithms based on Gibbs sampling. It seems block and sequential samplers have some interesting properties that make them reasonable candidates for big datasets. Particularly it seems easy to build a parallel implementation of sequential sampler which can be \ Update: February 29, 2012 Preliminary Exam Report Page 27 of 34 an important factor for large scale problems. Studying other kinds of inference methods like variational methods or parallel implementation of these algorithms can be a subject of further research. 6 REFERENCE Antoniak, C. (1974). Mixtures of Dirichlet Process with Applications to Bayesian Nonparametric Problems. The Annals of Statistics, 1152-1174. Cappe, O., Godsill, S. J., & Moulines, E. (2007, May). An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo. Proceedings of the IEEE, 95(5), 899-924. Carvalho, C. M., Johannes, M., Lopes, H. F., & Polson, N. (2010). Particel Learning and Smoothing. Statistical Science, 88-106. Carvalho, C. M., Lopes, H. F., Polson, N. G., & Taddy, M. A. (2010). Particle Learning for General Mixtures. Bayesian Analysis, 709-740. Fox, E., Sudderth, E., Jordan, M., & Willsky, A. (2010). Supplement to " A Sticky HDP-HMM with Application to Speaker Diarization". doi:10.1214/10-AOAS395SUPP Fox, E., Sudderth, E., Jordan, M., & Willsky, A. (2011). A Sticky HDP-HMM with Application to Speaker Diarization. The Annalas of Applied Statistics, 5, 1020-1056. Ishwaran, H., & Zarepour, M. (2002, June). Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 269-283. Pitman, J. (1996). Random Discrete Distributions Invariant under Size-Biased Permutation. Advances in Applied ProbabilityApplied Probability Trust, 525--539. Rabiner, L. R. (1989, February). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-286. Rodriguez, A. (2011, July). On-Line Learning for the Infinite Hidden Markov. Communications in Statistics: Simulation and Computation, 40(6), 879-893. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 639-650. Sudderth, E. B. (2006). chapter 2 of "Graphical Models for Visual Object Recognition and Tracking". Cambridge, MA: Massachusetts Institute of Technology. Teh, Y., & Jordan, M. (2010). Hierarchical Bayesian Nonparametric Models with Applications. In N. Hjort, C. Holmes, P. Mueller, & S. Walker, Bayesian Nonparametrics: Principles and Practice. Cambridge, UK: Cambridge University Press. Teh, Y., Jordan, M., Beal, M., & Blei, D. (2004). Hierarchical Dirichlet Processes. Technical Report 653 UC Berkeley. Teh, Y., Jordan, M., Beal, M., & Blei, D. (2006). Hierarchical Dirichlet Processes. Journal of the \ Update: February 29, 2012 Preliminary Exam Report Page 28 of 34 American Statistical Association, 1566-1581. \ Update: February 29, 2012 Preliminary Exam Report Page 29 of 34 APPENDIX A: DERIVATION OF HDP RELATIONSHIPS A.1. Stick-Breaking Construction (Teh Y. , Jordan, Beal, & Blei, 2006) Lemma A.1.1 k 1 In this lemma we show k vk 1 vl l 1 We know: K 1 k vk 1 l l 1 Using this equation we can write: 1 k 1 K l 1 l 1 l 1 v1 v2 (1 v1 ) v3 (1 v1 )(1 v2 ).... (1 v1 ) 1 v2 v3 (1 v2 ).... (1 vl ) k vk k 1 1 v l l 1 We know G j jk **k .Let A1 ,..., Ar be a random partition on, define Kl k : k** Al , l 1,..., r k 1 from(3.1) and the definition of DP we have: G ( A ),..., G ( A ) ~ Dir G ( A ),...,G ( A ) j 1 j r 0 1 0 (A1.1) r Using (A1.1), (3.2) and(3.3) we obtain: jk ,..., jk ~ Dir k ,..., k kK kK kK r kK r 1 1 (A1.2) Since this is correct for every finite partition of positive integers we conclude that j ~ DP( , ) Now for a partition 1,..., k 1 ,k ,k 1,... and aggregation property of Dirichlet distribution we obtain: \ Update: February 29, 2012 Preliminary Exam Report Page 30 of 34 k 1 k 1 , , ~ Dir , , l jl jk jl l k l k 1 k k 1 l 1 l 1 (A1.3) Using the neutrality property of a Dirichlet distribution: , ~ Dir , l jk k jl k 1 l k 1 l k 1 1 jl 1 (A1.4) l 1 Notice that l 1 l k 1 k l jk and also defining v jk 1 l 1 we have: k 1 jl l 1 v jk ~ Beta k , 1 k l l 1 (A1.5) Using Lemma A.1.1 k 1 k 1 l 1 l 1 jk v jk 1 jl v jk 1 v jl (A1.6) A.2. Deriving Posterior and Predictive Distributions Derivation of equation(3.8): Since G0 is a random distribution, we can draw from it. Let assume θ** are i.i.d. draws from G0 . θ** takes value in since G0 is a distribution over .Let A1 , A2 ,..., Ak be a finite measureable partition of and m r # i : i** Ar . By using the conjugacy and also definition of DP we can write: G0 ( A1 ),..., G0 ( Ak ) | 1** ,...,n** ~ Dir H A1 m 1,..., H Ak m k (A1.7) k This shows the posterior of G0 is a DP with concentration parameter equal to m r m and mean r n H ** of \ i 1 m n i . We also know K m i 1 ** i k 1 k k** so we can write: Update: February 29, 2012 Preliminary Exam Report Page 31 of 34 K H m k ** k k 1 G0 | , H , θ** ~ DP m , m (A1.8) Derivation of equation(3.9): Similar to the last proof, let A1 , A2 ,..., Ak be a finite measureable partition of and n j k #i : ji Ar G j ( A1 ),..., G j ( Ak ) | j1 ,..., jn ~ Dir G0 A1 n j 1 ,..., G0 Ak n j k (A1.9) n K G0 ji k 1 nj So G j is a DP with concentration n j k n j and mean i 1 K G0 n j k ** k k 1 G j | , G0 , θ ~ DP n j , nj and finally we can write: (A1.10) Derivation of equation(3.10): For A : p ji A | j1... ji 1 E G j A | j1... ji 1 nj k t : *jt K n k 1 k** j k k** 1 G0 ( A) nj K n k 1 j k k** (A1.11) n jt * jt K k 1 t : *jt k** n jt * jt (A1.12) mj n jt t 1 * jt From(A1.11) and (A1.12): 1 ji | j1... ji 1 , , G0 ~ nj G0 mj n t 1 jt * jt (A1.13) Derivation of Equation(3.11) is very similar to the above lines and we just need to calculate the expectation of G0 instead of G j . \ Update: February 29, 2012 Preliminary Exam Report Page 32 of 34 Deration of Equation(3.12) K H m k ** k k 1 and also we know that we can write G has We know that G0 | , H , θ** ~ DP m , 0 m two parts; one is a draw form a DP and the other is a draw from a multinomial distribution: G0 K k 1 j 1 k 1 kk 0 j j k ** (A1.14) k Let A0 , A1 ,..., Ak be a partition on where A0 contains the whole except spikes located at k** and Aj , j 1,..., K contains spikes. We can write: G0 ( A0 ), G0 ( A1 ),..., G0 ( Ak ) ~ Dir 0 F A0 ,..., 0 F Ak K 0 m , F H m k k** (A1.15) k 1 m And as a result we can write: 0 , 1 ,..., K ~ Dir , m 1,..., m K (A1.16) Derivation of equation (3.13) follows the same lines. APPENDIX B: DERIVATION OF HDP-HMM RELATIONSHIPS B.1. Derivation of the posterior distribution for z t ,st Lemma B.1.1 \ Update: February 29, 2012 Preliminary Exam Report Page 33 of 34 K 1 k k d k K k 1 K k k 1 k 1 K k k 1 k 1 k d K k k 1 (B1.1) Dir k K k k 1 K k k 1 Derivation of p zt k | z\ t , , , : By using the chain rule and graphical model of Figure 3 we can write: p zt k | z \ t , , , p i | , , i p zt 1 | k p zt k | zt 1 p z p i t 1 | k p zt k | zt 1 p i i p z | d z 1 p z | i d t ,t 1 | , , | z 1 i , (B1.2) | z | z 1 i, t , t 1 , , , d i For zt 1 j and k j we can write: p zt k | z\ t , , , p z | p | z | z k , t , t 1 , , , d p z k | p | z | z j, t , t 1 , , , d t 1 k 1 k k k t j 1 j j (B1.3) j p zt 1 | z | z 1 k , t , t 1 , , , p zt k | z | z 1 j , t , t 1 , , , For k j p zt j | z\ t , , , p z t 1 j | j p zt j | j p j | z | z 1 j , t , t 1 , , , d j p zt j , zt 1 | z | z 1 j , t , t 1 , , , (B1.4) Using the fact that z has a multinomial distribution, j | , ~ Dir 1 ,..., j ,..., K , K 1 , and by using Lemma B.1.1: \ Update: February 29, 2012 Preliminary Exam Report Page 34 of 34 p z | z 1 i | , , k , i p i | , , p z | z 1 i | i d i i K 1 K 1 k 1 k 1 ikk k ,i 1 iknik d i k , i k , i k , i k i k k k k k k k k ni k k k k , i nik k k , i nik (B1.5) k k , i nik k k , i In the equation above, nik denotes the number of transitions from state i to state j . Using (B1.5), (B1.3), (B1.4), and after some algebra we can obtain: k nzt t 1 zt 1 , k p zt k | z\ t , , , 2 k z 1 t , z nkzt zt 1 , k k , zt 1 t 1 t 1 k 1,..., K t n k zt 1 , k (B1.6) k K 1 Equation (4.5) can be obtained similar to (3.10). Notice in this case that for each state we have a DP and therefore numbers of data points for DP are all data points associated with that state. Equation (4.6) can be obtained similar to (3.16). The only difference is that we only consider observations assigned to state zt k and st j . \ Update: February 29, 2012
© Copyright 2026 Paperzz