How to do your own topic modeling - YDC2

How to do your own topic modeling
David Newman
University of California, Irvine
TwTT session
Yale University
Tuesday October 4, 2011
1
Acknowledgements
●
We gratefully acknowledge support from IMLS
●
Topic-modeling-tool is based off of Mallet (UMass)
●
Arun Balagopalan developed the topic-modelingtool
2
(from Blei)
3
(from Blei)
4
Topic Modeling Tool
●
●
Simple-to-use graphical user interface for doing
topic modeling
–
Simple interface
–
csv and html output
–
Designed for non-technical users
Openly available at Google Code:
–
http://code.google.com/p/topic-modeling-tool/
5
Why would you want to topic model?
●
Collection management
–
●
Subject metadata enhancement
–
●
Say you are managing a number of collections.
Topic model each collection to describe what is in
that collection
Say you have a collection of heterogeneous text
content, with partial and/or variable subject
tagging. Topic model could provide uniform
subject tags across this collection
Other reasons? (Let's hear from you!)
6
Input to Topic Model is Easy
●
●
Input the document collection
–
Folder containing collection of .txt files, or
–
Single .txt file, one document per line
–
(let D denote # of documents in collection)
Input number of topics
–
(let T denote number of topics)
7
Let's try it
●
Download TopicModelingTool.jar
●
Select text collection for topic modeling
–
●
(The google code site has sample text collections for
demonstration purposes)
Hit the button!
8
9
Let's look at the output
●
file:///home/newman/topic-modeling-tool/economy/try1_
●
Output files
●
–
List of T topics
–
List of topics in each of D documents
–
List of top-ranked documents in each of T topics
Output file formats
–
html (view in browser)
–
csv (load in Excel)
10
What can I control?
●
Number of topics, T
●
List of stopwords (the, and, ...)
●
Whether to preserve case
●
Number of words printed per topic (default = 10)
●
●
Threshold for tagging a document with a topic (e.g.
5%, 10%, etc)
Number of iterations
11
What can I control?
●
Number of topics, T
–
T = 10 topics is default
–
Suggestions
●
●
●
●
for D=1,000-10,000 ... set T = 10-40
for D=10,000-100,000 ... set T = 50-200
for D=100,000-1,000,000 ... set T = 250-1000
–
first tried T = 20 topics
–
now try T = 40 topics
file:///home/newman/topic-modeling-tool/economy/try2_
12
Compare T=20 and T=40 topics
●
T=20 topics
–
●
topic-17: bush japan president states united world
china north country korea
T=40 topics
–
topic-12: bush president china states chinese people
united friday day nation
–
topic-8: japan bush korea north world japanese
united south states economic
13
How does it work?
●
●
●
Topic model assumes documents exhibit multiple topics
Topic model learns from patterns of words that tend to cooccur within documents, e.g. say we see many documents
mentioning these words:
–
health, insurance, patient, ...
–
health, care, doctor ...
–
prescription, drugs ...
–
medicare, medicaid, benefit ...
Then might get topic looking like this:
–
health care insurance medicare drug medicaid
prescription ...
14
(from Blei)
15
(from Blei)
16
What can I control?
●
List of stopwords (the, and, ...)
17
What can I control?
●
List of stopwords (the, and, ...)
●
Oops ... try again ... mystopwords2.txt
18
Other things to try
●
Preserve case
●
Threshold for tagging a document with a topic
●
Number of topic words printed
19
How long does it take?
●
●
Should take minutes/hours/days depending on size
of topic model
Proportional to:
–
Total number of words in collection (or documents in
collection)
–
Number of topics, T
–
Number of iterations
20
What next?
●
●
●
What do I do with my topic model?
html good for viewing, re-running topic model,
spot-checking
csv good for tagging, subject metadata
enhancement, subject indexing, etc.
21
Applications
●
National Science Foundation
●
National Institutes of Health
●
Other
22
National Science Foundation
23
National Science Foundation
24
National Institutes of Health
25
National Institutes of Health
26
National Institutes of Health
27
FAQs
●
Can I topic model a single text document?
●
How could I topic model a long book?
●
Is learning T topics in fact learning the first-T topics (if I
learn 10 topics, then later learn 20 topics, are the first 10 of the
20-topic-model the same as the 10-topic model)?
●
What if my topics are too general?
●
What if my topics are indistinguishable?
●
Do topics have an order?
●
How to set topic model parameters (number of iterations)
28
Help! It's not working!
●
●
We're here to help, we want to make this a usable
tool
Send help requests to: [email protected]
29
30