Slides

Discovering Coherent Topics
Using General Knowledge
Meichun Hsu
Zhiyuan (Brett) Chen
Malu Castellanos
Arjun Mukherjee
Riddhiman Ghosh
Bing Liu
http://www.cs.uic.edu/~zchen/
Topic Model
Topic 1
Document 1
Document 2
Topic 2
…
…
Document M
Topic
Model
Topic T
Coherent Topics
Price
Cheap
Expensive
Cost
Money
Pricey
Dollar
Coherent Topics
Price
Cheap
Expensive
Cost
Money
Pricey
Dollar
Price
Family
Cheap
Expensive
Politics
Cost
Size
Issues of Unsupervised Topic Models
Many topics are not coherent.
Objective functions do not correlate
well with human judgments (Chang
et al., 2009).
Remedy: Knowledge-based
Topic Models
Knowledge-based Topic Models
DF-LDA (Andrzejewski et al., 2009)
Picture
Picture
Must-Link
Cannot-Link
Photo
Price
Knowledge-based Topic Models
DF-LDA (Andrzejewski et al., 2009)
Seeded models (Burns et al., 2012;
Jagarlamudi et al., 2012; Lu et al., 2011;
Mukherjee and Liu, 2012)
Knowledge Assumptions
Knowledge is correct for a
domain.
Knowledge Assumptions
Knowledge is correct for a
domain.
Knowledge is domain dependent.
Existing Model Flow
Existing Model Flow
Existing Model Flow
Existing Model Flow
Existing Model Flow
Existing Model Flow
Our Proposed Model Flow
Our Proposed Model Flow
Our Proposed Model Flow
General Knowledge
General Knowledge
Domain Independent
May be wrong for a domain
Lexical Semantic Relations
Synonyms {Expensive, Pricey}
Antonyms {Expensive, Cheap}
Adjective-Attributes
{Expensive, Price}
Lexical Semantic Relations
Synonyms {Expensive, Pricey}
WordNet
Antonyms {Expensive, Cheap}
Adjective-Attributes
{Expensive, Price}
(Fei et al. 2012)
LR-Sets
Example:
{Expensive, Pricey, Cheap, Price}
LR-Sets (Lexical Relation)
Example:
{Expensive, Pricey, Cheap, Price}
Words should be in the same topic
Issues of LR-Sets
No correct LRsets for a word
Partially wrong
knowledge
Issues of LR-Sets
No correct LR-sets for a word
{Card, Menu}
Card
{Card, Bill}
Issues of LR-Sets
No correct LR-sets for a word
{Card, Menu}
{Card, Bill}
Issues of LR-Sets
No correct LR-sets for a word
{Card, Menu}
{Card, Bill}
Issues of LR-Sets
Partially wrong knowledge
Picture
{Picture, Pic, Flick}
Issues of LR-Sets
Partially wrong knowledge
{Picture, Pic, Flick}
Addressing Issues
No correct LRsets for a word
Relaxing wrong
sets for a word
Partially wrong
knowledge
Word Correlation
+ GPU
Addressing Issues
No correct LRsets for a word
Relaxing wrong
sets for a word
Partially wrong
knowledge
Word Correlation
+ GPU
Relaxing Wrong LR-sets
{Card, Menu}
{Card, Bill}
Relaxing Wrong LR-sets
{Card, Menu}
{Card, Bill}
{Card}
Estimate Knowledge
{Picture, Image}
{Picture, Painting}
Word Distributions From LDA
Word
Picture
Image
Photo
Quality
Resolution
…
Painting
Prob
0.20
0.15
0.12
0.10
0.05
0.0002
Estimate Word Correlation
Word
Picture
Image
Photo
Quality
Resolution
…
Painting
Prob
0.20
0.15
0.12
0.10
0.05
0.0002
{Picture, Image}
{Picture, Painting}
Word Correlation Matrix C
Word
Picture
Image
Photo
Quality
Resolution
…
Painting
Prob
0.20
0.15
0.12
0.10
0.05
0.0002
{Picture, Image}
0.15 / 0.20
{Picture, Painting}
0.0002 / 0.20
Quality of LR-set s Towards w
Relaxing Wrong LR-sets
{Card, Menu}
Q(s1, “Card”) < ɛ
{Card, Bill}
Q(s2, “Card”) < ɛ
Relaxing Wrong LR-sets
{Card, Menu}
Q(s1, “Card”) < ɛ
{Card, Bill}
Q(s2, “Card”) < ɛ
{Card}
Addressing Issues
No correct LRsets for a word
Relaxing wrong
sets for a word
Partially wrong
knowledge
Word Correlation
+ GPU
Simple Pólya Urn Model (SPU)
Simple Pólya Urn Model (SPU)
Simple Pólya Urn Model (SPU)
Simple Pólya Urn Model (SPU)
Simple Pólya Urn Model (SPU)
Simple Pólya Urn Model (SPU)
The richer get richer!
Interpreting LDA Under SPU
Interpreting LDA Under SPU
picture
Topic 0
Interpreting LDA Under SPU
picture
picture
Topic 0
Generalized Pólya Urn Model
(GPU)
Generalized Pólya Urn Model
(GPU)
Generalized Pólya Urn Model
(GPU)
Generalized Pólya Urn Model
(GPU)
Generalized Pólya Urn Model
(GPU)
Applying GPU
picture
Topic 0
Applying GPU
picture
picture
image
painting
Topic 0
Applying GPU
picture
picture
image
painting
Word
Correlation
Topic 0
Addressing Issues
No correct LRsets for a word
Relaxing wrong
sets for a word
Partially wrong
knowledge
Word Correlation
+ GPU
Evaluation
Evaluation
Four domains
KL-Divergence
Evaluation
Topic Coherence
Human Evaluation
Model Comparison
LDA (Blei et al., 2003)
LDA-GPU (Mimno et al., 2011)
DF-LDA (Andrzejewski et al., 2009)
MDK-LDA (Chen et al., 2013)
GK-LDA
KL-Divergence
Topic Coherence (#T = 15)
Human Evaluation
Example Topics
love
Conclusions
Discovering Coherent Topics Using General
Knowledge
Conclusions
Discovering Coherent Topics Using General
Knowledge
No correct LRsets for a word
Partially wrong
knowledge
Conclusions
Discovering Coherent Topics Using General
Knowledge
No correct LRsets for a word
Relaxing wrong
sets for a word
Partially wrong
knowledge
Word Correlation
+ GPU
Datasets:
http://www.cs.uic.edu/~zchen/
Datasets:
http://www.cs.uic.edu/~zchen/