Generative Topic Models for Community Analysis Ramesh Nallapati Objectives • Provide an overview of topic models and their learning techniques – Mixture models, PLSA, LDA – EM, variational EM, Gibbs sampling • Convince you that topic models are an attractive framework for community analysis – 5 definitive papers 9/18/2007 10-802: Guest Lecture 2 / 57 Outline • Part I: Introduction to Topic Models – Naive Bayes model – Mixture Models • Expectation Maximization – PLSA – LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis – – – – – – 9/18/2007 Citation modeling with PLSA Citation Modeling with LDA Author Topic Model Author Topic Recipient Model Modeling influence of Citations Mixed membership Stochastic Block Model 10-802: Guest Lecture 3 / 57 Introduction to Topic Models • Multinomial Naïve Bayes • For each document d = 1,, M • Generate Cd ~ Mult( ¢ | ) C • For each position n = 1,, Nd W1 W2 W3 ….. • Generate wn ~ Mult(¢|b,Cd) WN M b 9/18/2007 10-802: Guest Lecture 4 / 57 Introduction to Topic Models • Naïve Bayes Model: Compact representation C C W1 W2 W3 ….. WN M W N b M b 9/18/2007 10-802: Guest Lecture 5 / 57 Introduction to Topic Models • Multinomial naïve Bayes: Learning – Maximize the log-likelihood of observed variables w.r.t. the parameters: • Convex function: global optimum • Solution: 9/18/2007 10-802: Guest Lecture 6 / 57 Introduction to Topic Models • Mixture model: unsupervised naïve Bayes model • Joint probability of words and classes: C Z • But classes are not visible: W N M b 9/18/2007 10-802: Guest Lecture 7 / 57 Introduction to Topic Models • Mixture model: learning – Not a convex function • No global optimum solution – Solution: Expectation Maximization • Iterative algorithm • Finds local optimum • Guaranteed to maximize a lower-bound on the log-likelihood of the observed data 9/18/2007 10-802: Guest Lecture 8 / 57 Introduction to Topic Models log(0.5x1+0.5x2) • Quick summary of EM: – Log is a concave function 0.5log(x1)+0.5log(x2) X2 X1 0.5x1+0.5x2 H() – Lower-bound is convex! – Optimize this lower-bound w.r.t. each variable instead 9/18/2007 10-802: Guest Lecture 9 / 57 Introduction to Topic Models • Mixture model: EM solution E-step: M-step: 9/18/2007 10-802: Guest Lecture 10 / 57 Introduction to Topic Models 9/18/2007 10-802: Guest Lecture 11 / 57 Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model d d • Select document d ~ Mult() • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) z Topic distribution • generate wn ~ Mult( ¢ | bzn) w N M b 9/18/2007 10-802: Guest Lecture 12 / 57 Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model – Learning using EM – Not a complete generative model • Has a distribution over the training set of documents: no new document can be generated! – Nevertheless, more realistic than mixture model • Documents can discuss multiple topics! 9/18/2007 10-802: Guest Lecture 13 / 57 Introduction to Topic Models • PLSA topics (TDT-1 corpus) 9/18/2007 10-802: Guest Lecture 14 / 57 Introduction to Topic Models 9/18/2007 10-802: Guest Lecture 15 / 57 Introduction to Topic Models • Latent Dirichlet Allocation a • For each document d = 1,,M • Generate d ~ Dir(¢ | a) • For each position n = 1,, Nd z • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | bzn) w N M b 9/18/2007 10-802: Guest Lecture 16 / 57 Introduction to Topic Models • Latent Dirichlet Allocation – Overcomes the issues with PLSA • Can generate any random document – Parameter learning: • Variational EM – Numerical approximation using lower-bounds – Results in biased solutions – Convergence has numerical guarantees • Gibbs Sampling – Stochastic simulation – unbiased solutions – Stochastic convergence 9/18/2007 10-802: Guest Lecture 17 / 57 Introduction to Topic Models • Variational EM for LDA – Approximate the posterior by a simpler distribution • A convex function in each parameter! 9/18/2007 10-802: Guest Lecture 18 / 57 Introduction to Topic Models • Gibbs sampling – Applicable when joint distribution is hard to evaluate but conditional distribution is known – Sequence of samples comprises a Markov Chain – Stationary distribution of the chain is the joint distribution 9/18/2007 10-802: Guest Lecture 19 / 57 Introduction to Topic Models • LDA topics 9/18/2007 10-802: Guest Lecture 20 / 57 Introduction to Topic Models • LDA’s view of a document 9/18/2007 10-802: Guest Lecture 21 / 57 Introduction to Topic Models • Perplexity comparison of various models Unigram Lower is better 9/18/2007 LDA 10-802: Guest Lecture 22 / 57 Introduction to Topic Models • Summary – Generative models for exchangeable data – Unsupervised models – Automatically discover topics – Well developed approximate techniques available for inference and learning 9/18/2007 10-802: Guest Lecture 23 / 57 Outline • Part I: Introduction to Topic Models – Naive Bayes model – Mixture Models • Expectation Maximization – PLSA – LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis – – – – – – 9/18/2007 Citation modeling with PLSA Citation Modeling with LDA Author Topic Model Author Topic Recipient Model Modeling influence of Citations Mixed membership Stochastic Block Model 10-802: Guest Lecture 24 / 57 Hyperlink modeling using PLSA 9/18/2007 10-802: Guest Lecture 25 / 57 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Select document d ~ Mult() d d • For each position n = 1,, Nd • generate zn ~ Mult( ¢ | d) z • generate wn ~ Mult( ¢ | bzn) z • For each citation j = 1,, Ld • generate zj ~ Mult( ¢ | d) w c N • generate cj ~ Mult( ¢ | zj) L M b 9/18/2007 10-802: Guest Lecture 26 / 57 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] PLSA likelihood: d d z z New likelihood: w c N L M b 9/18/2007 Learning using EM 10-802: Guest Lecture 27 / 57 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Heuristic: a (1-a) 0 · a · 1 determines the relative importance of content and hyperlinks 9/18/2007 10-802: Guest Lecture 28 / 57 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Experiments: Text Classification • Datasets: – Web KB • 6000 CS dept web pages with hyperlinks • 6 Classes: faculty, course, student, staff, etc. – Cora • 2000 Machine learning abstracts with citations • 7 classes: sub-areas of machine learning • Methodology: – Learn the model on complete data and obtain d for each document – Test documents classified into the label of the nearest neighbor in training set – Distance measured as cosine similarity in the space – Measure the performance as a function of a 9/18/2007 10-802: Guest Lecture 29 / 57 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Classification performance Hyperlink 9/18/2007 content Hyperlink 10-802: Guest Lecture content 30 / 57 Hyperlink modeling using LDA 9/18/2007 10-802: Guest Lecture 31 / 57 Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] a • For each document d = 1,,M • Generate d ~ Dir(¢ | a) • For each position n = 1,, Nd z z • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | bzn) w N L M b 9/18/2007 •For each citation j = 1,, Ld c • generate zj ~ Mult( . | d) • generate cj ~ Mult( . | zj) Learning using variational EM 10-802: Guest Lecture 32 / 57 Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] 9/18/2007 10-802: Guest Lecture 33 / 57 Author-Topic Model for Scientific Literature 9/18/2007 10-802: Guest Lecture 34 / 57 Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a P • For each author a = 1,,A a x • Generate a ~ Dir(¢ | ) • For each topic k = 1,,K • Generate fk ~ Dir( ¢ | a) z A •For each document d = 1,,M • For each position n = 1,, Nd •Generate author x ~ Unif(¢ | ad) w • generate zn ~ Mult( ¢ | a) N M f 9/18/2007 K • generate wn ~ Mult( ¢ | fzn) b 10-802: Guest Lecture 35 / 57 Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a Learning: Gibbs sampling P a x z A w N M f 9/18/2007 K b 10-802: Guest Lecture 36 / 57 Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Perplexity results 9/18/2007 10-802: Guest Lecture 37 / 57 Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Topic-Author visualization 9/18/2007 10-802: Guest Lecture 38 / 57 Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 1: Author similarity 9/18/2007 10-802: Guest Lecture 39 / 57 Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 2: Author entropy 9/18/2007 10-802: Guest Lecture 40 / 57 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] 9/18/2007 10-802: Guest Lecture 41 / 57 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Gibbs sampling 9/18/2007 10-802: Guest Lecture 42 / 57 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Datasets – Enron email data • 23,488 messages between 147 users – McCallum’s personal email • 23,488(?) messages with 128 authors 9/18/2007 10-802: Guest Lecture 43 / 57 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Topic Visualization: Enron set 9/18/2007 10-802: Guest Lecture 44 / 57 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] • Topic Visualization: McCallum’s data 9/18/2007 10-802: Guest Lecture 45 / 57 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] 9/18/2007 10-802: Guest Lecture 46 / 57 Modeling Citation Influences 9/18/2007 10-802: Guest Lecture 47 / 57 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Copycat 9/18/2007 model 10-802: Guest Lecture 48 / 57 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Citation influence model 9/18/2007 10-802: Guest Lecture 49 / 57 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Citation influence graph for LDA paper 9/18/2007 10-802: Guest Lecture 50 / 57 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Words 9/18/2007 in LDA paper assigned to citations 10-802: Guest Lecture 51 / 57 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Performance evaluation – Data: • 22 seed papers and 132 cited papers • Users labeled citations on a scale of 1-4 – Models considered: • Citation influence model • Copy cat model • LDA-JS-divergence – Symmetric Divergence in topic space • LDA-post where • Page Rank • TF-IDF – Evaulation measure: • Area under the ROC curve 9/18/2007 10-802: Guest Lecture 52 / 57 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Results 9/18/2007 10-802: Guest Lecture 53 / 57 Mixed membership Stochastic Block models [Work In Progress] • A complete generative model for text and citations • Can model the topicality of citations – Topic Specific PageRank • Can also predict citations between unseen documents 9/18/2007 10-802: Guest Lecture 54 / 57 Summary • Topic Modeling is an interesting, new framework for community analysis – Sound theoretical basis – Completely unsupervised – Simultaneous modeling of multiple fields – Discovers “soft”-communities and clusters in terms of “topic” membership – Can also be used for predictive purposes 9/18/2007 10-802: Guest Lecture 55 / 57
© Copyright 2026 Paperzz