How to do your own topic modeling David Newman University of California, Irvine TwTT session Yale University Tuesday October 4, 2011 1 Acknowledgements ● We gratefully acknowledge support from IMLS ● Topic-modeling-tool is based off of Mallet (UMass) ● Arun Balagopalan developed the topic-modelingtool 2 (from Blei) 3 (from Blei) 4 Topic Modeling Tool ● ● Simple-to-use graphical user interface for doing topic modeling – Simple interface – csv and html output – Designed for non-technical users Openly available at Google Code: – http://code.google.com/p/topic-modeling-tool/ 5 Why would you want to topic model? ● Collection management – ● Subject metadata enhancement – ● Say you are managing a number of collections. Topic model each collection to describe what is in that collection Say you have a collection of heterogeneous text content, with partial and/or variable subject tagging. Topic model could provide uniform subject tags across this collection Other reasons? (Let's hear from you!) 6 Input to Topic Model is Easy ● ● Input the document collection – Folder containing collection of .txt files, or – Single .txt file, one document per line – (let D denote # of documents in collection) Input number of topics – (let T denote number of topics) 7 Let's try it ● Download TopicModelingTool.jar ● Select text collection for topic modeling – ● (The google code site has sample text collections for demonstration purposes) Hit the button! 8 9 Let's look at the output ● file:///home/newman/topic-modeling-tool/economy/try1_ ● Output files ● – List of T topics – List of topics in each of D documents – List of top-ranked documents in each of T topics Output file formats – html (view in browser) – csv (load in Excel) 10 What can I control? ● Number of topics, T ● List of stopwords (the, and, ...) ● Whether to preserve case ● Number of words printed per topic (default = 10) ● ● Threshold for tagging a document with a topic (e.g. 5%, 10%, etc) Number of iterations 11 What can I control? ● Number of topics, T – T = 10 topics is default – Suggestions ● ● ● ● for D=1,000-10,000 ... set T = 10-40 for D=10,000-100,000 ... set T = 50-200 for D=100,000-1,000,000 ... set T = 250-1000 – first tried T = 20 topics – now try T = 40 topics file:///home/newman/topic-modeling-tool/economy/try2_ 12 Compare T=20 and T=40 topics ● T=20 topics – ● topic-17: bush japan president states united world china north country korea T=40 topics – topic-12: bush president china states chinese people united friday day nation – topic-8: japan bush korea north world japanese united south states economic 13 How does it work? ● ● ● Topic model assumes documents exhibit multiple topics Topic model learns from patterns of words that tend to cooccur within documents, e.g. say we see many documents mentioning these words: – health, insurance, patient, ... – health, care, doctor ... – prescription, drugs ... – medicare, medicaid, benefit ... Then might get topic looking like this: – health care insurance medicare drug medicaid prescription ... 14 (from Blei) 15 (from Blei) 16 What can I control? ● List of stopwords (the, and, ...) 17 What can I control? ● List of stopwords (the, and, ...) ● Oops ... try again ... mystopwords2.txt 18 Other things to try ● Preserve case ● Threshold for tagging a document with a topic ● Number of topic words printed 19 How long does it take? ● ● Should take minutes/hours/days depending on size of topic model Proportional to: – Total number of words in collection (or documents in collection) – Number of topics, T – Number of iterations 20 What next? ● ● ● What do I do with my topic model? html good for viewing, re-running topic model, spot-checking csv good for tagging, subject metadata enhancement, subject indexing, etc. 21 Applications ● National Science Foundation ● National Institutes of Health ● Other 22 National Science Foundation 23 National Science Foundation 24 National Institutes of Health 25 National Institutes of Health 26 National Institutes of Health 27 FAQs ● Can I topic model a single text document? ● How could I topic model a long book? ● Is learning T topics in fact learning the first-T topics (if I learn 10 topics, then later learn 20 topics, are the first 10 of the 20-topic-model the same as the 10-topic model)? ● What if my topics are too general? ● What if my topics are indistinguishable? ● Do topics have an order? ● How to set topic model parameters (number of iterations) 28 Help! It's not working! ● ● We're here to help, we want to make this a usable tool Send help requests to: [email protected] 29 30
© Copyright 2026 Paperzz