Poster - David Uthus

Social media == new source of information and
the ground for social interaction
Twitter:
Noisy and content-sparse data
Question:
Can we carve out fine grained topics within the
micro-documents, e.g. topics such as “food”,
“movies”, “music”, etc ?
Ritter et al. (2010) uncover dialogue acts such as
“Status”, “Question to Followers”, “Comment”,
“Reaction” within Twitter post via unsupervised
modeling.
Ramage et al. (2010) cluster a set of Twitter
conversations using supervised LDA.
Main finding: Twitter is 11% substance, 5% status,
16% style, 10% social, and 56% other.
1119 posts were selected from Twitter
corpus, with the following topics:
Food 276
Sickness 39
Music 153
Computers 57
Other 321
Movies 141
Sports 34
Relationships 66
Travel 32
The topic labels were not included in the data.
Latent Dirichlet Allocation (Blei et al, 2003) was
used to model topics
Each document == a mixture of topics
Topic == a probability distribution over words
Intuition behind LDA == words that co-occur
across documents have similar meaning and
indicate the topic of the document.
Example: if “café” and “food” co-occur in multiple
documents, they are related to the same topic.
Individual posts broken into words and a random
topic between 0 and 8 was assigned to each word.
Only nouns and verbs were left in the data.
A representation of a tweet such as (1) is in (2)
(1) “hey, whats going on lets get some food in this
café”
(2) [(‘food’, t1), (‘cafe’, t2)]
t0 music 67%; movies 33%
t1 other 50%; travel 12.5%; sports 12.5%; food 25%
t2 music 50%; relationships 25%; sickness 12.5%;
computers 12.5%
t3 food 80%; movies 20%
t4 food 74%; music 11%; sickness 10%; other 5%
t5 music 67%; other 14%; movies 9%; food 5%; sports 5%
t6 food 50%; other 14%; sickness 14%; computers 14%;
movies 8%
t7 food 70%; music 20%; relationships 10%
t8 food 45%; travel 28%; relationships 9%; other 9%;
computers 9%
Results: Tweets on the same topic scattered across clusters
Clusters contained various unrelated tweets.
Problem: tweets on the same topic had no words in common
Solution:
‘Thesaurus’ of words related to the topics in posts was created
Most generic words from the thesaurus were selected as “pads”
Sample thesaurus entry:
(3) food_topic [“pizza”, “café” , “food”, eat..]
food topic padding set: [“food”, “café”]
Posts shorter than 3 words were padded with 2-3 words from the
relevant “padding” set
Padding ensured that some shared words appear in posts
Unpadded posts:
(4) “café eat” (5) “food”
Padded posts:
(4’) “café food eat food” (5’) “café food food”.
t0 music 77%; food 9%; computers 4.54%; sports 4.54%; travel
4.54%
t1 movies 80%; relationships 3%; sickness 10%; music 5%;
food 2%
t2 food 69%; movies 16%; music 5%; sickness 5%; sports 5%
t3 food 93%; sickness 4%; other 3%; movies 1%
t4 other 65%; travel 29%; movies 6%
t5 sickness 100%
t6 music 46%; other 24%; sports 20%; computers 10%
t7 food 49%; computers 22%; other 20%; music 8%; movies 1%
t8 relationships 55%; food 23%; music 10%; sickness 6%;
movies 6%
5 out of 9 clusters contained at least one strongly
dominant semantic topic (over 60%)
6 topics emerge from the padded data clustering:
food, movies, music, other, sickness, and
relationships.
Food is distributed across 3 clusters and is the
dominant topic in all of them
The most cohesive clusters are t5, t3, t2, t1, and t0;
they have a dominant semantic topic – one that
occurs in over 60% of the posts within the cluster
Differences across the two data sets:
 Non-padded data has no pure clusters and no clusters
where one topic occurs < 80% of the time
 3 clusters found in non-padded data: music,
 food, and other.
 6 distinct clusters found in the padded data:
music, movies, food, relationships, other, and sickness
Similarities in the padded and the non-padded data sets:
 movie posts were given the same topic ID as food posts
 Food, music, and movies were distributed across
several topic assignments
Information sparsity has a strong negative effect on
topic clustering.
Topic clustering is improved, when:
a) only nouns and verbs are used in posts
b) posts are padded with similar words
Future investigation:
Can padding help discover topics in large sets of
tweets?
Thank you!












Barzilay, R. and L. Lee. 2004. Catching the Drift: Probabilistic Content Models, with Applications to
Generation and Summarization. In Proceedings of HLT-NAACL, 113-120. Stroudsburg, PA.
Blei, D., Ng, A. and Jordan, M. 2003. Latent Dirichlet Allocation. Journal of Machine Learning: 3, 9931022.
Griffiths, T. and Steyvers, M. 2003. Prediction and Semantic Association. In Neural Information
Processing Systems 15: 11-18.
Griffiths, T. and Steyvers, M. 2004. Finding Scientific Topics. Proceedings of Natural Academic of
Sciences, 101 Supplement 1:5228-5235.
Griffiths, T. and Steyvers, M. 2007. Probabilistic Topic Models. In Landauer, T., McNamara, D., D.
Dennis, S. and Kintsch, W. (eds.) Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum.
Hoffman, T. 1999. Probabilistic Latent Semantic Analysis. In Proceedings of the Fifteenth Conference on
Uncertainty in Artificial Intelligence, 289-296. Stockholm, Sweden: Morgan Kaufman.
Hoffman, T. 2001. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning
Journal, 42(1), 177-196.
Ramage, D., Dumais, S., and Liebling, D. 2010. Characterizing Microblogs with Topic Models.
Association for the Advancement of Artificial Intelligence.
Ritter, A., Cherry, C., and Dolan, B. 2010. Unsupervised Modeling of Twitter Conversations. In
Proceedings of HLT-NAACL 2010. 113-120. Stroudsburg, PA.
Ritter, A. et. al. 2011. Status Messages: A Unique Textual Source of Realtime and Social Information.
Presented at Natural Language Seminar, Information Sciences Institute, USC.
Yao, L., Mimno,D., and McCallum, A. 2009. Efficient Methods for Topic Model Inference on Streaming
Document Collections. In Proceedings of the 15th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining, 937-946.