Graphical Models Lecture 21: Topic Models & Dirichlet Processes Andrew McCallum [email protected] 1 Thanks to Yee Whye The, Tom Griffiths and Erik Sudderth for some slide materials. Background Dirichlet Distribu8on 2 Dirichlet DistribuHon A “dice factory” • • • This distribuHon is defined over a “(k-‐1)-‐simplex” (k non-‐negaHve arguments which sum to one). The Dirichlet is the conjugate prior to the mulHnomial. (This means that if our likelihood is mulHnomial with a Dirichlet prior, then the posterior is also Dirichlet!) The Dirichlet parameter αi can be thought of as a prior count of the ith class. QuesHon: How likely is mul-nomial θ? Answer: What probability would it give to the counts αi. Dirichlet DistribuHon • Mul8variate equivalent of Beta distribu8on (a “coin factory”) • Parameters α determine form of the prior Small α, most mass concentrated on a few outcomes. Important for later! Latent Dirichlet AllocaHon A tool for discovering interpretable “topics” from large collec8ons of documents 5 Analysis of PNAS abstracts • Test topic models with a real database of scien8fic papers from PNAS • All 28,154 abstracts from 1991-‐2001 • All words occurring in at least five abstracts, not on “stop” list (20,551) • Total of 3,026,970 tokens in corpus A selecHon of topics FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN Cold topics Hot topics Cold topics Hot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION Cold topics 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 75 KDA ANTIBODY PROTEIN ANTIBODIES PURIFIED MONOCLONAL MOLECULAR ANTIGEN MASS IGG CHROMATOGRAPHY MAB POLYPEPTIDE SPECIFIC GEL EPITOPE SDS HUMAN BAND MABS APPARENT RECOGNIZED LABELED SERA IDENTIFIED EPITOPES FRACTION DIRECTED DETECTED NEUTRALIZING Hot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION Latent structure probabilis8c process Observed data Latent structure (meaning) probabilis8c process Observed data (words) Latent structure (meaning) sta8s8cal inference Observed data (words) Latent Dirichlet allocaHon (Blei, Ng, & Jordan, 2001; 2003) α distribution over topics for each document Dirichlet priors β distribution over words for each topic φ (j) ∼ Dirichlet(β) T φ (j) θ (d) ∼ Dirichlet(α) θ (d) topic assignment for each word zi ∼ Discrete(θ (d) ) zi word generated from assigned topic wi wi ∼ Discrete(φ (zi) ) Nd D Inference in LDA High tree width = intractable exact inference α θ θ θ z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 β Approximate inference: -‐ VariaHonal Mean Field -‐ Gibbs Sampling -‐ Collapsed Gibbs Sampling The collapsed Gibbs sampler • Using conjugacy of Dirichlet and mul8nomial distribu8ons, integrate out con8nuous parameters • Defines a distribu8on on discrete ensembles z The collapsed Gibbs sampler • Sample each zi condi8oned on z-‐i • Nota8on: – – – – – – i indexes over words w and their topic assignments z j indexes over topics nwi(zi) is the number of 8mes word type i occurs in topic zi nj(di) is the number of tokens in document di assigned to topic j. n.(zi) is the total number tokens in topic zi (the “.” is wildcard) n.(di) is the total number of tokens in document di The collapsed Gibbs sampler • Sample each zi condi8oned on z-‐i • This is nicer than your average Gibbs sampler: – memory: counts can be cached in two sparse matrices – op8miza8on: no special func8ons, simple arithme8c – the distribu8ons on Φ and Θ are analy8c given z and w, and can later be found for each sample Gibbs sampling in LDA iteration 1 Gibbs sampling in LDA iteration 1 2 Gibbs sampling in LDA iteration 1 2 Gibbs sampling in LDA iteration 1 2 Gibbs sampling in LDA iteration 1 2 Gibbs sampling in LDA iteration 1 2 Gibbs sampling in LDA iteration 1 2 Gibbs sampling in LDA iteration 1 2 Gibbs sampling in LDA iteration 1 2 … 1000 Latent Dirichlet allocaHon (Blei, Ng, & Jordan, 2001; 2003) α Reasonable value: 50/T or 0.01 distribution over topics for each document Dirichlet priors Reasonable value: 0.01 -‐ 1.0 distribution over words for each topic φ (j) ∼ Dirichlet(β) T β φ (j) θ (d) ∼ Dirichlet(α) θ (d) zi zi ∼ Discrete(θ (d) ) wi wi ∼ Discrete(φ (zi) ) Nd D Varying α decreasing α increases sparsity Varying β decreasing β increases sparsity ? SelecHng the number of topics • An example of BN model structure learning • How to do it? – Earlier lecture: data likelihood + prior preferring smaller structure; then try lots of possible structures – Today: r esHmaHon te e m ra a p r e Bake togeth define infinite structure with a and structure consideraHons prior enforcing that most of it is rarely used • Non-‐parametric models – have parameters – number of parameters instan8ated grows with data Dirichlet Processes Can be confusing because there are different ways to see it. I’ll describe four. 32 Dirichlet Process as Noisy Copier • Mo8va8on: – You have a distribu8on. – You’d like to have a copy that can be a liole different from the original ~ DP(α, H ) • G p c o ert op y fi rigina urb de ed lity l dist co pa ribu py ram Ho ete n r (h igh er = clo s er) 33 Dirichlet Process DefiniHon Dirichlet Processes Definition • A Dirichlet Process (DP) is a distribution over prob distributions. A Dirichlet (DP) is awe distribution over probability (More correctly, Process instead of “prob distribuHon” should say “probability measure” which works on conHnuous domains.) measures. A Dhas P htwo as tparameters: wo parameters: •A DP distribution H, which is like the theoDP. Base distribu8on H (which is tmean he mof ean f the DP). –Base Strength parameter α, which is like an inverse-variance of the DP. Strength parameter α (which is like an inverse-‐variance of the DP). We–write: G ∼ DP(α, H) • We write: Gany ∼ partition DP(α,H) anny arHHon (A1,...,An) of X: if for (Ai1f , f. or ..,A ) ofpX: Dirichlet(αH(A (G(A (G(A1), . . .n )), G∼(An)) ∼ Dirichlet(αH(A1), 1 ), . . . , G(A 1 ), . . . , αH(An )). . . , αH(An)) like a game: opponent gets to pick the parHHoning; DP generates a bunch of Gs. A4 A1 A3 A2 A6 A5 Summary: A DP is a distribuHon over probability measures such that marginals on finite parHHons are Dirichlet distributed. 34 Yee Whye Teh (Gatsby) DP and HDP Tutorial Mar 1, 2007 / CUED 5 / 53 Dirichlet Process DefiniHon • Sounds magical! Is this even possible? Yes. • Fact: Samples from a DP are always discrete distribu8ons. • Intui8on: – Make an infinite number of samples from original H, but re-‐weight them according to draw from Dirichlet with uniform mean and concentra8on related to α. – With small α, most mass will be on just a few samples. (Will probably never even see the other samples.) 35 Pòlya’s urn scheme produces a sequence θ1 , θ2 , . . . with the rawfollowing conditionals: d P D a cHng Constru �n−1 δθ + αH θn |θ1:n−1 ∼ i=1 i n−1+α Blackwell-‐MacQueen Urn Scheme picking balls colorscfrom an urn: icking balls ooff ddifferent ifferent olors from • Imagine pImagine Start with no balls in the urn. withw probability draw θin ∼tH, anduadd an urn. Start ith no ∝bα,alls he rn.a ball of that color into the urn. probability ∝ n − 1, pick a ball at random from raw, 1 ...∞: • For the nth dWith the urn, record θ to be its color, return the ball into n n the urn and place a second ball of same color into – with probability ∝ α, draw θn ∼ H, urn. and add a ball of that color into the urn. – With probability ∝ n − 1, pick a ball at random from the urn, record θn to be its color, return the ball into the urn and place a second ball of same color into urn. Yee Whye Teh (Gatsby) DP and HDP Tutorial Mar 1, 2007 / CUED Note: For large α, mostly just draw from H. For small α, ofen copy an old value, perturbing G away from H. Blackwell-MacQueen urn scheme is like a “representer” for the DP—a finite projection of an infinite object. 36 NEXT: Weʼd like to know G(x) for each different color x. Need to gather all balls of same color and count them... 10 / 53 nstrucHon o c e m a s e h t f o w AlternaHve vie Chinese Restaurant Process Chinese Restaurant Process Use de Finei’s Theorem about exchangeability to gather together balls of the same color ...into “restaurant tables” customer = urn scheme draw table = ball color = θi Generating from the CRP: the CRP: • Genera8ng First customer sitsfrom at the first table. Customer n sits at: – First customer sits at twhere he finrst table. Table k with probability is the number of customers at table k . – Customer n sits at: A new table K + 1 with probability . nk α+n−1 k α α+n−1 Customers ⇔kintegers, tables ⇔ clusters. with probability nk/(α+n-‐1), nk = # people @ table k • Table The CRP• exhibits clustering property of theαDP. A new tthe able K+1 with probability /(α+n-‐1) 2 1 4 3 8 5 6 7 9 . . . ∞ # tables # customers at a table = (re)-‐weighHng of that table’s value. Most mass focussed on early tables NEXT: Weʼd like to know G(x) for each different color x without having to simulate an infinite # customers... Yee Whye Teh (Gatsby) DP and HDP Tutorial Mar 1, 2007 / CUED 13 / 53 37 nstrucHon o c e m a s e h t f o w AlternaHve vie -breaking Construction Stick-breaking Construction SHck Breaking ConstrucHon “What are the table-‐weights when there are an infinite number of customers?” But howAnswers: do draws G ∼ DP(α, H) look like? But how do draws G ∼ DP(α, H) look like? G is discrete probability one,one, so:look G is discrete with probability so: What do with draws G ~ DP(a,H) like? ∞ ∞ � � ∗ G =G = πk δπθkk∗δθk k=1 δθk = point mass on θk k =1 The stick-breaking construction shows that G ∼ DP(α, H) if: where he stick-breaking construction shows that G ∼ DP(α, H) if: πk = βk πk−1 k = βk � k� −1 (1 − βl ) !(1) (1 −l=1βl ) !(3) βl=1 k ∼ Beta(1, α) !(4) ∗ θ k ∼ Hα) βk ∼ Beta(1, θk∗ !(6) !(5) !(3) !(2) !(1) !(2) !(4) !(5) ∼ HWe write π ∼ GEM(α) ! if π = (π1 , π2 , . . .) is distributed as above. (6) 38 We write π ∼ GEM(α) if π = (π , π , . . .) is distributed as above. What does all this have to do with non-‐parametric infinite mixtures?! 39 Finite Mixture Model E.g. Mixture of Gaussians prior on Gaussian means H π Gaussian mean θ T # mixture components = T zi xi mixture proporHons mixture indicator Gaussian from mean mzi N 40 • • Infinite limit of mixture models as the # of mixture componen Gaussian mixture model example: Finite Mixture Model DP (Infinite) Mixture Model E.g. Mixture of Gaussians H prior on Gaussian means H π Gaussian mean θ T # mixture components = T zi xi mixture proporHons |T| mixture indicator Gaussian from mean θzi N α DP G θi xi N 41 Alterna*ve Diagram • • Infinite limit of mixture models as the # of mixture componen Gaussian mixture model example: DP (Infinite) Mixture Model DP (Infinite) Mixture Model E.g. Mixture of Gaussians H α prior on Gaussian means stick breaking H π Gaussian mean θ ∞ # mixture components = T zi xi mixture proporHons |∞| mixture indicator Gaussian from mean mzi N α DP G θi xi N 42 • • Infinite limit of mixture models as the # of mixture componen Gaussian mixture model example: Geing back to LDA... • We want to generate a corpus of documents from a set of shared “topics” • The DP Mixture Model does not explicitly enforce any sharing. (AlternaHvely: the DP Mixture confounds the mixture values and mixture proporHons.) • We need something more... DP (Infinite) Mixture Model H α DP G θi xi N 43 happens if we put a prior on a Dirichlet Process? re models as the # of mixture components tends to infinity. Why would we want to? Hierarchical DP models as the # of mixture componen • Infinite limit of mixture model example: • We might have a collection of related documents or images, each of which is a mixture DP (Infinite) Mixture odel Gaussian M mixture model example: Mixture Model H • of gaussians γ α DP H G0 DP α Gj DP G θi θi xi xi J N N 44 happens if we put a prior on a Dirichlet Process? re models as the # of mixture components tends to infinity. Why would we want to? Hierarchical DP model example: • We might have a collection of related documents or images, each of which is aAlterna*ve mixture Diagram Mixture M odel of gaussians H γ α G0 DP β γ α Gj H π θi θ zi xi ∞ J N xi J N 45 Representations of Hierarchical Dirichlet Processes Another Picture of the HDP Hierarchical Pòlya urn scheme Let G0 ∼ DP(γ, H) and G1 , G2 |G0 ∼ DP(α, G0 ). The hierarchical Pòlya urn scheme to generate draws from G1 , G2 : !"01 !"02 !"03 !"04 !"05 !"06 . . . . . !"11 !"12 !"13 !"14 !"15 !"16 . . . . . !"21 !"22 !"23 !"24 !"25 !"26 . . . . . !11 !12 !13 !14 !15 !16 !17 . . . . . !21 !22 !23 !24 !25 !26 !27 . . . . . 46 Inference in the Dirichlet Process Mixture Model 47 Collapsed Gibbs Recipe for DP Mixture The big picture: • For each data point – pretend it is the last point (by exchangeability) – the prior is just the Chinese restaurant dynamics – the likelihood is just the usual mixture likelihood 48 Collapsed Gibbs Recipe for DP Mixture The slightly more detailed picture, s8ll skipping the evidence (word) likelihood: • For the nth word: – with probability ∝ α, draw zn ∼ G0. • To draw from G0: with probability ∝ γ, draw zn ~ H, with prob ∝ n-‐1, draw a topic from those already in G0 propor8onally according to their counts. – With probability ∝ nj − 1, draw a topic from those already in Gj propor8onally according to counts. 49 ... y l s u o i v e From pr The finite collapsed Gibbs sampler Sample each zi condi8oned on z-‐i For the DP (infinite) case: – Very similar, but include the possibility of picking a “new” topic, using Chinese restaurant dynamics. – When you pick a “new” topic for a document, first go to G0 and consider using a “old new” topic from another document, otherwise create a “new new” topic. Generic Collapsed Gibbs Sampler for [Sudderth PhD] DP Mixture Model [Neal 2000, Alg #2] 108 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS Given the previous concentration parameter α(t−1) , cluster assignments z (t−1) , and cached statistics for the K current clusters, sequentially sample new assignments as follows: 1. Sample a random permutation τ (·) of the integers {1, . . . , N }. 2. Set α = α(t−1) and z = z (t−1) . For each i ∈ {τ (1), . . . , τ (N )}, resample zi as follows: (a) For each of the K existing clusters, determine the predictive likelihood fk (xi ) = p(xi | {xj | zj = k, j "= i} , λ) This likelihood can be computed from cached sufficient statistics via Prop. 2.1.4. Also determine the likelihood fk̄ (xi ) of a potential new cluster k̄ via eq. (2.189). (b) Sample a new cluster assignment zi from the following (K + 1)–dim. multinomial: ! # K K " " 1 −i zi ∼ αfk̄ (xi )δ(zi , k̄) + Nk fk (xi )δ(zi , k) Zi = αfk̄ (xi ) + Nk−i fk (xi ) Zi k=1 k=1 Nk−i is the number of other observations currently assigned to cluster k. (c) Update cached sufficient statistics to reflect the assignment of xi to cluster zi . If zi = k̄, create a new cluster and increment K. 3. Set z (t) = z. Optionally, mixture parameters for the K currently instantiated clusters may be sampled as in step 3 of Alg. 2.1. 4. If any current clusters are empty (Nk = 0), remove them and decrement K accordingly. 5. If α ∼ Gamma(a, b), sample α(t) ∼ p(α | K, N, a, b) via auxiliary variable methods [76]. Algorithm 2.3. Rao–Blackwellized Gibbs sampler for an infinite, Dirichlet process mixture model, as defined in Fig. 2.24. Each iteration sequentially resamples the cluster assignments for all N observaN 51 Application: Document Topic Modelling HDP Mixture Experimental Results Compared against latent Dirichlet allocation, a parametric version of the HDP mixture for topic modelling. Perplexity on test abstacts of LDA and HDP mixture 1050 LDA HDP Mixture Number of samples 1000 Perplexity Posterior over number of topics in HDP mixture 15 950 900 850 10 5 800 750 10 20 30 40 50 60 70 80 90 100 110 120 Number of LDA topics 0 61 62 63 64 65 66 67 68 69 70 71 72 73 Number of topics 52 Further VariaHons 53 Nested Chinese Restaurant Process [Blei et al 2003] & 4 3 2 1 !1 % 2 1 !2 4 3 # c1 c2 !3 $ c3 z !4 2 !5 4 3 !6 w cL (a) ! N M 8 1 " (b) Figure 1: (a) The paths of four tourists through the infinite tree of Chinese restaurants (L 54 = 3). The solid lines connect each restaurant to the restaurants referred to by its tables. The collected paths of the four tourists describe a particular subtree of the underlying infinite Nested Chinese Restaurant Process the, of, a, to, and, in, is, for neurons, visual, cells, cortex, synaptic, motion, response, processing cell, neuron, circuit, cells, input, i, figure, synapses chip, analog, vlsi, synapse, weight, digital, cmos, design algorithm, learning, training, method, we, new, problem, on recognition, speech, character, word, system, classification, characters, phonetic b, x, e, n, p, any, if, training hidden, units, layer, input, output, unit, x, vector control, reinforcement, learning, policy, state, actions, value, optimal 55 Figure 5: A topic hierarchy estimated from 1717 abstracts from NIPS01 through NIPS12. Each node contains the top eight words from its corresponding topic distribution. Infinite Hidden Markov Models Infinite Hidden Markov Model Previous issue is that there is no sharing of possible next state across different current states. Implement sharing of next states using a HDP: (τ1 , τ2 , . . .) ∼ GEM(γ) (π1l , π2l , . . .)|τ ∼ DP(α, τ ) ! " # $l H %*l v1 v2 vT x1 x2 xT 8 v0 56 Yee Whye Teh (Gatsby) DP and HDP Tutorial Mar 1, 2007 / CUED Infinite N-‐gram Model “A StochasHc Memoizer for Sequence Data” [Wood, Archambeau, Gasthaus, James, Teh, 2009] H G[ ] a G[o] ac G[ac] ca] a a G[oa] 1 G[a] d0:0 G[o] o G[oaca] 1 a 1 c o G[a] o 11 a d1:1 G[oac] a c G[oa] oac c oac c o o G[aca] G[ ] o 1 d2:2 c G[oaca] G[oaca] G[oacac] c 1 d2:4 (b) Prefix tree for oacac. c (c) Initialisation. orresponding prefix tree for the string oacac. Note that (a) and (b) correspond to the57 acao. (c) Chinese restaurant franchise sampler representation of subtree highlighted in Readings for More Detail • Erik Sudderth’s PhD thesis hzp://www.cs.brown.edu/~sudderth/papers/sudderthPhD.pdf • Yee Whye Teh’s Dirichlet Process Tutorial hzp://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes/mlss2007.pdf • HDP introduc8on, LDA with infinite topics hzp://www.cse.buffalo.edu/faculty/mbeal/papers/hdp.pdf • HDP implementa8on by Teh. hop://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-‐r21.tgz 58
© Copyright 2026 Paperzz