Probabilistic models for community analysis

Generative Topic Models for
Community Analysis
Ramesh Nallapati
Objectives
• Provide an overview of topic models and
their learning techniques
– Mixture models, PLSA, LDA
– EM, variational EM, Gibbs sampling
• Convince you that topic models are an
attractive framework for community
analysis
– 5 definitive papers
9/18/2007
10-802: Guest Lecture
2 / 57
Outline
• Part I: Introduction to Topic Models
– Naive Bayes model
– Mixture Models
• Expectation Maximization
– PLSA
– LDA
• Variational EM
• Gibbs Sampling
• Part II: Topic Models for Community Analysis
–
–
–
–
–
–
9/18/2007
Citation modeling with PLSA
Citation Modeling with LDA
Author Topic Model
Author Topic Recipient Model
Modeling influence of Citations
Mixed membership Stochastic Block Model
10-802: Guest Lecture
3 / 57
Introduction to Topic Models
• Multinomial Naïve Bayes

• For
each document d = 1,, M
• Generate Cd ~ Mult( ¢ | )
C
• For each position n = 1,, Nd
W1
W2
W3
…..
• Generate wn ~ Mult(¢|b,Cd)
WN
M
b
9/18/2007
10-802: Guest Lecture
4 / 57
Introduction to Topic Models
• Naïve Bayes Model: Compact representation


C
C
W1
W2
W3
…..
WN
M
W
N
b
M
b
9/18/2007
10-802: Guest Lecture
5 / 57
Introduction to Topic Models
• Multinomial naïve Bayes: Learning
– Maximize the log-likelihood of observed
variables w.r.t. the parameters:
• Convex function: global optimum
• Solution:
9/18/2007
10-802: Guest Lecture
6 / 57
Introduction to Topic Models
• Mixture model: unsupervised naïve Bayes
model

• Joint probability of words and classes:
C
Z
• But classes are not visible:
W
N
M
b
9/18/2007
10-802: Guest Lecture
7 / 57
Introduction to Topic Models
• Mixture model: learning
– Not a convex function
• No global optimum solution
– Solution: Expectation Maximization
• Iterative algorithm
• Finds local optimum
• Guaranteed to maximize a lower-bound on the log-likelihood
of the observed data
9/18/2007
10-802: Guest Lecture
8 / 57
Introduction to Topic Models
log(0.5x1+0.5x2)
• Quick summary of EM:
– Log is a concave function
0.5log(x1)+0.5log(x2)
X2
X1
0.5x1+0.5x2
H()
– Lower-bound is convex!
– Optimize this lower-bound w.r.t. each variable instead
9/18/2007
10-802: Guest Lecture
9 / 57
Introduction to Topic Models
• Mixture model: EM solution
E-step:
M-step:
9/18/2007
10-802: Guest Lecture
10 / 57
Introduction to Topic Models
9/18/2007
10-802: Guest Lecture
11 / 57
Introduction to Topic Models
• Probabilistic Latent Semantic Analysis Model
d

d
• Select document d ~ Mult()
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
z
Topic
distribution
• generate wn ~ Mult( ¢ | bzn)
w
N
M
b
9/18/2007
10-802: Guest Lecture
12 / 57
Introduction to Topic Models
• Probabilistic Latent Semantic Analysis
Model
– Learning using EM
– Not a complete generative model
• Has a distribution  over the training set of
documents: no new document can be generated!
– Nevertheless, more realistic than mixture
model
• Documents can discuss multiple topics!
9/18/2007
10-802: Guest Lecture
13 / 57
Introduction to Topic Models
• PLSA topics (TDT-1 corpus)
9/18/2007
10-802: Guest Lecture
14 / 57
Introduction to Topic Models
9/18/2007
10-802: Guest Lecture
15 / 57
Introduction to Topic Models
• Latent Dirichlet Allocation

a
• For each document d = 1,,M
• Generate d ~ Dir(¢ | a)
• For each position n = 1,, Nd
z
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | bzn)
w
N
M
b
9/18/2007
10-802: Guest Lecture
16 / 57
Introduction to Topic Models
• Latent Dirichlet Allocation
– Overcomes the issues with PLSA
• Can generate any random document
– Parameter learning:
• Variational EM
– Numerical approximation using lower-bounds
– Results in biased solutions
– Convergence has numerical guarantees
• Gibbs Sampling
– Stochastic simulation
– unbiased solutions
– Stochastic convergence
9/18/2007
10-802: Guest Lecture
17 / 57
Introduction to Topic Models
• Variational EM for LDA
– Approximate the posterior by a simpler
distribution
• A convex function in each parameter!
9/18/2007
10-802: Guest Lecture
18 / 57
Introduction to Topic Models
• Gibbs sampling
– Applicable when joint distribution is hard to evaluate but
conditional distribution is known
– Sequence of samples comprises a Markov Chain
– Stationary distribution of the chain is the joint distribution
9/18/2007
10-802: Guest Lecture
19 / 57
Introduction to Topic Models
• LDA topics
9/18/2007
10-802: Guest Lecture
20 / 57
Introduction to Topic Models
• LDA’s view of a document
9/18/2007
10-802: Guest Lecture
21 / 57
Introduction to Topic Models
• Perplexity comparison of various models
Unigram
Lower is better
9/18/2007
LDA
10-802: Guest Lecture
22 / 57
Introduction to Topic Models
• Summary
– Generative models for exchangeable data
– Unsupervised models
– Automatically discover topics
– Well developed approximate techniques
available for inference and learning
9/18/2007
10-802: Guest Lecture
23 / 57
Outline
• Part I: Introduction to Topic Models
– Naive Bayes model
– Mixture Models
• Expectation Maximization
– PLSA
– LDA
• Variational EM
• Gibbs Sampling
• Part II: Topic Models for Community Analysis
–
–
–
–
–
–
9/18/2007
Citation modeling with PLSA
Citation Modeling with LDA
Author Topic Model
Author Topic Recipient Model
Modeling influence of Citations
Mixed membership Stochastic Block Model
10-802: Guest Lecture
24 / 57
Hyperlink modeling using PLSA
9/18/2007
10-802: Guest Lecture
25 / 57
Hyperlink modeling using PLSA

[Cohn and Hoffman, NIPS, 2001]
• Select document d ~ Mult()
d
d
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
z
• generate wn ~ Mult( ¢ | bzn)
z
• For each citation j = 1,, Ld
• generate zj ~ Mult( ¢ | d)
w
c
N
• generate cj ~ Mult( ¢ | zj)
L
M
b
9/18/2007

10-802: Guest Lecture
26 / 57
Hyperlink modeling using PLSA
[Cohn and Hoffman, NIPS, 2001]

PLSA likelihood:
d
d
z
z
New likelihood:
w
c
N
L
M
b
9/18/2007

Learning using EM
10-802: Guest Lecture
27 / 57
Hyperlink modeling using PLSA
[Cohn and Hoffman, NIPS, 2001]
Heuristic:
a
(1-a)
0 · a · 1 determines the relative importance of content and hyperlinks
9/18/2007
10-802: Guest Lecture
28 / 57
Hyperlink modeling using PLSA
[Cohn and Hoffman, NIPS, 2001]
• Experiments: Text Classification
• Datasets:
– Web KB
• 6000 CS dept web pages with hyperlinks
• 6 Classes: faculty, course, student, staff, etc.
– Cora
• 2000 Machine learning abstracts with citations
• 7 classes: sub-areas of machine learning
• Methodology:
– Learn the model on complete data and obtain d for each
document
– Test documents classified into the label of the nearest neighbor
in training set
– Distance measured as cosine similarity in the  space
– Measure the performance as a function of a
9/18/2007
10-802: Guest Lecture
29 / 57
Hyperlink modeling using PLSA
[Cohn and Hoffman, NIPS, 2001]
• Classification performance
Hyperlink
9/18/2007
content
Hyperlink
10-802: Guest Lecture
content
30 / 57
Hyperlink modeling using LDA
9/18/2007
10-802: Guest Lecture
31 / 57
Hyperlink modeling using LDA
[Erosheva, Fienberg, Lafferty, PNAS, 2004]
a
• For each document d = 1,,M

• Generate d ~ Dir(¢ | a)
• For each position n = 1,, Nd
z
z
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | bzn)
w
N
L
M
b
9/18/2007
•For each citation j = 1,, Ld
c
• generate zj ~ Mult( . | d)
• generate cj ~ Mult( . | zj)

Learning using variational EM
10-802: Guest Lecture
32 / 57
Hyperlink modeling using LDA
[Erosheva, Fienberg, Lafferty, PNAS, 2004]
9/18/2007
10-802: Guest Lecture
33 / 57
Author-Topic Model for Scientific Literature
9/18/2007
10-802: Guest Lecture
34 / 57
Author-Topic Model for Scientific Literature
[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
a
P
• For each author a = 1,,A
a
x
• Generate a ~ Dir(¢ | )
• For each topic k = 1,,K
• Generate fk ~ Dir( ¢ | a)

z
A
•For each document d = 1,,M
• For each position n = 1,, Nd
•Generate author x ~ Unif(¢ | ad)
w
• generate zn ~ Mult( ¢ | a)
N
M
f
9/18/2007
K
• generate wn ~ Mult( ¢ | fzn)
b
10-802: Guest Lecture
35 / 57
Author-Topic Model for Scientific Literature
[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
a
Learning: Gibbs sampling
P
a
x

z
A
w
N
M
f
9/18/2007
K
b
10-802: Guest Lecture
36 / 57
Author-Topic Model for Scientific Literature
[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Perplexity results
9/18/2007
10-802: Guest Lecture
37 / 57
Author-Topic Model for Scientific Literature
[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Topic-Author visualization
9/18/2007
10-802: Guest Lecture
38 / 57
Author-Topic Model for Scientific Literature
[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Application 1: Author similarity
9/18/2007
10-802: Guest Lecture
39 / 57
Author-Topic Model for Scientific Literature
[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Application 2: Author entropy
9/18/2007
10-802: Guest Lecture
40 / 57
Author-Topic-Recipient model for email data
[McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
9/18/2007
10-802: Guest Lecture
41 / 57
Author-Topic-Recipient model for email data
[McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
Gibbs sampling
9/18/2007
10-802: Guest Lecture
42 / 57
Author-Topic-Recipient model for email data
[McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
• Datasets
– Enron email data
• 23,488 messages between 147 users
– McCallum’s personal email
• 23,488(?) messages with 128 authors
9/18/2007
10-802: Guest Lecture
43 / 57
Author-Topic-Recipient model for email data
[McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
• Topic Visualization: Enron set
9/18/2007
10-802: Guest Lecture
44 / 57
Author-Topic-Recipient model for email data
[McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
• Topic Visualization: McCallum’s data
9/18/2007
10-802: Guest Lecture
45 / 57
Author-Topic-Recipient model for email data
[McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
9/18/2007
10-802: Guest Lecture
46 / 57
Modeling Citation Influences
9/18/2007
10-802: Guest Lecture
47 / 57
Modeling Citation Influences
[Dietz, Bickel, Scheffer, ICML 2007]
• Copycat
9/18/2007
model
10-802: Guest Lecture
48 / 57
Modeling Citation Influences
[Dietz, Bickel, Scheffer, ICML 2007]
• Citation influence model
9/18/2007
10-802: Guest Lecture
49 / 57
Modeling Citation Influences
[Dietz, Bickel, Scheffer, ICML 2007]
• Citation influence graph for LDA paper
9/18/2007
10-802: Guest Lecture
50 / 57
Modeling Citation Influences
[Dietz, Bickel, Scheffer, ICML 2007]
• Words
9/18/2007
in LDA paper assigned to citations
10-802: Guest Lecture
51 / 57
Modeling Citation Influences
[Dietz, Bickel, Scheffer, ICML 2007]
• Performance evaluation
– Data:
• 22 seed papers and 132 cited papers
• Users labeled citations on a scale of 1-4
– Models considered:
• Citation influence model
• Copy cat model
• LDA-JS-divergence
– Symmetric Divergence in topic space
• LDA-post
where
• Page Rank
• TF-IDF
– Evaulation measure:
• Area under the ROC curve
9/18/2007
10-802: Guest Lecture
52 / 57
Modeling Citation Influences
[Dietz, Bickel, Scheffer, ICML 2007]
• Results
9/18/2007
10-802: Guest Lecture
53 / 57
Mixed membership Stochastic Block models
[Work In Progress]
• A complete generative model for text and
citations
• Can model the topicality of citations
– Topic Specific PageRank
• Can also predict citations between unseen
documents
9/18/2007
10-802: Guest Lecture
54 / 57
Summary
• Topic Modeling is an interesting, new
framework for community analysis
– Sound theoretical basis
– Completely unsupervised
– Simultaneous modeling of multiple fields
– Discovers “soft”-communities and clusters in
terms of “topic” membership
– Can also be used for predictive purposes
9/18/2007
10-802: Guest Lecture
55 / 57