Using Topic Modeling and Text Embeddings to Predict Deleted Tweets

Using Topic Modeling and Text Embeddings to Predict Deleted Tweets
Peter Potash†? , Eric Bell† , Joshua Harrison†
†
Pacific Northwest National Laboratory, Richland, WA
?
University of Massachusetts Lowell
Dept. of Computer Science, Lowell, MA
[email protected]
Abstract
Predictive models for tweet deletion have been a relatively unexplored area of Twitter-related computational
research. We first approach the deletion of tweets as a
spam detection problem, applying a small set of handcrafted features to improve upon the current state-ofthe-art in predicting deleted tweets. Next, we apply our
approach to a dataset of deleted tweets that better reflects the current deletion rate. Since tweets are deleted
for reasons beyond just the presence of spam, we apply topic modeling and text embeddings in order to
capture the semantic content of tweets that can lead to
tweet deletion. Our goal is to create an effective model
that has a low-dimensional feature space and is also
language-independent. A lean model would be computationally advantageous processing high-volumes of
Twitter data, which can reach 9,885 tweets per second1 .
Our results show that a small set of spam-related features combined with word topics and character-level
text embeddings provide the best f1 when trained with
a random forest model. The highest precision of the
deleted tweet class is achieved by a modification of
paragraph2vec to capture author identity.
1
Introduction
Despite the relative infancy of Twitter, there have been
numerous computational approaches to spam detection in
Twitter (Miller et al. 2014; Mccord and Chuah 2011; Guo
and Chen 2014; Lin and Huang 2013). According to the
Twitter rules2 , users may not use Twitter for the purposes
of spamming. Accounts that violate this rule may be suspended or terminated, and their tweets removed from Twitter
Search. Consequently, the presence of spam in a tweet is a
strong signal that a tweet could be deleted. However, spamming tweets are only a subset of all deleted tweets. Users
may choose to delete their own tweets for various reasons
(see Section 2.2 for a deeper analysis of motives for deleting
tweets).
As social media technology matures, users become more
aware of the broader ramifications of their posts. One imc 2015, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
1
http://www.internetlivestats.com/one-second/#tweets-band
2
https://support.twitter.com/articles/18311
portant consequence is how posts affect a user’s online identity. For example, Facebook could potentially check whether
photos uploaded to the site could elicit negative reaction3 . In
a similar spirit, predicting tweet deletion will aid in maintaining a positive online identity for users.
Aside from social factors, there is specific content in a
tweet that may make the tweet a possible candidate for
deletion. Therefore, to understand which tweets might be
deleted, we need an effective feature representation of tweet
content. Furthermore, we desire to create a model that is
language independent – a single model regardless of tweet
language. This makes the use of language-specific tools and
lexica unappealing. For practical purposes, we desire to represent tweet content through a compact feature representation, rendering the ubiquitous bag-of-words (BOW) (bigrams, trigrams, etc.) model an unattractive possibility. Another motivation for a compressed representation is that it is
better suited for additional features and further development.
Topic Modeling, specifically Latent Dirichlet Allocation
(LDA) (Blei, Ng, and Jordan 2003) and tailored variations,
has been used to understand tweet content in various applications (Hong and Davison 2010; Godin et al. 2013;
Zhao et al. 2011). Since LDA represents documents (tweets
in our case) as a distribution over all possible topics, the dimensionality of the document with regards to LDA is equal
to the number of topics. Thus, we can compress a vocabulary of a variable amount into a small, fixed size. Another more recent method for semantic representation in a
low-dimensional space is word2vec (Mikolov et al. 2013a;
2013b). The original algorithm creates fixed-length representations at the word level, and it has been extended to
model entire documents (Le and Mikolov 2014). In terms
of Twitter research, word representations have been used
for such problems as hashtag recommendation (Tomar et al.
2014), syntax parsing (Ling et al. 2015), and the popular
sentiment prediction task (Tang et al. 2014; Stojanovski et
al. 2015). In our experiments, we seek to show that topic
modeling and text embeddings can effectively capture the
semantic qualities of a tweet that lead to deletion, and doing
so only on the order of hundreds of features.
The rest of the paper is laid out as follows: in Section 2 we
3
http://www.businessinsider.com/facebook-could-checkwhether-users-are-uploading-rude-photos-2015-11?r=UK&IR=T
review related work that focuses on deleted tweets. In Section 3 we conduct initial experiments that test our small set
of spam-related features, and compare our system to the current state-of-the-art in predicting deleted tweets. In Section
4 we give relevant background and describe how we will apply topic modeling and text embeddings into our model for
predicting deleted tweets. In Section 5 we will test the usage
of topic modeling and text embeddings as features for predicting deleted tweets. Finally, in Section 6 we will offer our
conclusions and look toward future work.
2
2.1
Related Work
Predicting Deleted Tweets
Few works exist that attempt to build predictive models
for tweet deletion. Zhou et al. (2015) focus on a subset of
deleted tweets – regrettable tweets. These are tweets that
the authors believe to contain inappropriate content. Inappropriate content can range from vulgar language to sharing
private content such as a personal email address. Through
manual investigation, the authors identify ten major topics
including negative sentiment, cursing, and relationships that
are prevalent in regrettable tweets. The authors then exploit
WordNet and UrbanDictionary to create keyword lists related to the ten topics. Finally, using a combination of existing lexica and the topic keywords as features, the authors
build classifiers to test the accuracy of their model. The authors complement 700 manually labeled regrettable tweets
with 700 normal tweets to create their evaluation dataset.
The authors’ best performance from 10-fold cross-validation
was an f1 score of 0.85 using a J48 classifier (a Java implementation of the C4.5 algorithm (Quinlan 2014)) on a balanced dataset of regretful and non-regretful tweets.
The work that most closely resembles ours is that of
Petrovic et al. (2013). In their work, the authors use bag-ofwords, user id, and various ‘social’ features including number of followers, is a retweet, and is a reply as features for
their predictive model. The authors train a classifier on 68
million tweets and proceed to test on 7.5 million tweets.
The balance of deleted verse non-deleted is organic in their
dataset – tweets were collected for the month of January
and marked as deleted if they were removed by the end of
February. In the authors’ dataset, only 3.2% of the tweets
were deleted4 . Due to the types of features and size of the
dataset, the feature space is over 47 million dimensions, and
an SVM classifier took 8 hours to train. The SVM recorded
an f1 score of 0.27 with user ID seen as the most important
feature.
2.2
Analyzing Deleted Tweets
While not focusing specifically on deletion prediction, other
works examine the phenomenon of tweet deletion. Almuhimedi et al. (2013) perform a large-scale analysis of
tweet deletion. In their dataset of over 67 million tweets,
only 2.4% of tweets were deleted, however 50% of roughly
300,000 users have deleted a tweet. According to the authors, the attributes that most notably differentiate deleted
4
Current deletion rate is 23%, a dramatic increase over the
course of two years (see Section 5.1).
from non-deleted tweets are “...clients used to post tweets,
the locations of posts (determined via FourSquare), and the
quantities of replies that they generated.” The authors did
not find a substantial correlation between deleted tweets and
the presence of regrettable topics, even sentiment vocabulary. Furthermore, the authors state that “A substantial fraction of deleted tweets are deleted for superficial reasons.
Together, typos, rephrasing, and spam account for 18% of
deleted tweets.” The authors’ work illustrates the overall difficulty in finding distinct, generalizable patterns in deleted
tweets.
Sleeper et al. (2013) conducted an Amazon Mechanical
Turk task to understand the types of regret users experience
regarding content they have tweeted. Of the 474 responses
(using the regret categories from Knapp et al. (1986)) the
most common cause for regret was revealing too much in the
tweet (e.g. personal information or a secret), followed by direct criticism regarding a specific person. Contrastingly, the
participants rarely reported experiencing regret due to lying
or ‘behavioral edict’ (telling someone to behave in a certain
way). Of the participants who experienced regret due to a
specific tweet, only 52% of the tweets were actually deleted.
This further highlights the difficulty in deletion prediction,
that even if a tweet has cause for deletion, it may very well
remain on twitter.
3
Initial Experiments
The objective of our initial experiments is two-fold: 1) Compare to current state-of-the-art for predicting deleted tweets
(Petrovic, Osborne, and Lavrenko 2013); 2) Determine the
optimal classifiers for the task of deletion prediction in a
limited-dimensional feature space.
3.1
Spam Features
Since spam is an important aspect of deleted tweets, we first
tried to create a small set of spam-related features to compare against the current state-of-the-art in predicting deleted
tweets. The inspiration for these features is taken directly
from the Twitter Rules (see Footnote 2), specifically what
Twitter regards as spam. This resulted in the following 12
features for a given tweet:
• Tweet length (character level)
• Tweet length (token level)
• Percent of tweet in uppercase
• Percent of tweet text that is non-alphanumeric
• Has the token ‘followers’
• Number of links
• Percent of tweet that is links
• Number of user mentions
• Number of hashtags
• Is the tweet a retweet
• Shannon entropy
• Metric entropy
The Shannon entropy of a tweet T containing the set of characters C is defined as follows:
X
H(T ) = −
P (c)log2 P (c)
(1)
c∈C
And P (c) is defined as:
P (c) =
f req(c)
len(T )
(2)
where f req(c) is the frequency of c in T . Metric entropy
is Shannon entropy normalized by character-level tweet
length, len(T ).
3.2
Results
The results of our initial experiments (see Table 1) show that
the Random Tree classifier clearly performs optimally on
this constrained problem. Furthermore, as the dataset size
grows, the two tree-based classifiers produce the best results. In fact, aside from the Naive Bayes classifier, the performance of all the classifiers increases as the size of the
dataset grows. The Random Forest classifier is the only one
that surpasses the f1 score of 0.27 reported by Petrovic et al.
5
Classifier
SVM
Naive Bayes
Bayes Net
J48
Random Forest
Experimental Design
In order to create a labeled dataset of deleted tweets we
started with a live sample of 320k tweets from May 2015.
To confirm that no tweet deletions were missed during sampling, we queried the Twitter API5 (this was done in late
June 2015). As examined by Lin and Huang (2013) (their
work covers spam specifically), deleted tweets have a certain lifespan, as long as a year for shrewd spammers. Thus,
we are effectively only predicting whether a tweet will be
deleted within a month’s time, not whether a tweet will ever
be deleted. Because the dataset of Petrovic et al. (2013) has
a 3.2% deletion rate, only 3.2% of tweets in our datasets for
these experiments will contain deleted tweets. This number
will be adjusted to better reflect current deletion rates in our
next experiments (see Section 5).
The five classifiers we report are SVM, Naive Bayes,
Bayes Net, Random Forest, and J48 (this classifier recorded
the best results in Zhou et al.’s (2015) experiments). To
make the Random Forest model more efficient as a predictive model, we only use 15 trees. In order to test the effect of
training set size on classifier performance, we ran the classifier with three different dataset sizes: 10k, 130k, and 255k.
We were unable to use the full 320k dataset of tweets – 255k
is the largest dataset we could create that has the desired
3.2% proportion of deleted tweets. The results of our experiment are listed in Table 1. In all our experiments we perform 10-fold cross validation and report the f1 score on the
deleted tweet class. All experiments were conducted using
Weka (Hall et al. 2009), except for the SVM experiments,
which used the Python package scikit-learn (Pedregosa et al.
2011). Since function-based classifiers have difficulty with a
strong class imbalance, which occurs in our dataset, we used
the class weighting functionality provided by the library.
3.3
(2013) that uses an SVM classifier. Conversely, the results
of our SVM experiments generate the worst results, despite
our best attempts to alleviate the class-imbalance problem.
We note that the results for Petrovic et al. are from a training set of of 68 million tweets, and since the SVM results
improve as the dataset size increases, there is potential for
substantial improvement with a much larger training set. Although we are unable to use the authors’ exact dataset (Twitter only allows the sharing of tweet IDs6 , which makes it
impossible to recover the deleted examples), we do reiterate
the increased classifier performance with larger datasets in
our experiments.
https://dev.twitter.com/overview/documentation
10k
0.062
0.188
0.121
0.140
0.226
130k
0.090
0.180
0.219
0.252
0.301
255k
0.101
0.165
0.255
0.307
Table 1: Results from our initial experiment using only
spam-related features. We present the f1 score of various
classifiers on datasets of varying sizes.
Furthermore, the feature space for Petrovic et al.’s model
has 47 million dimensions – our classifiers only use 12 features. Petrovic et al. use Liblinear (Fan et al. 2008) to implement their SVM, which uses highly optimized C/C++ code.
In contrast, we use the standard Java based Weka distribution. In spite of this, running on the 255k dataset, our Random Forest model took 62 seconds to train, which is 4,121
samples per second. Petrovic et al. took eight hours to train
68 million samples, which is 2,361 samples per second. This
further illustrates the efficiency of having both a greatly reduced feature space and a simpler predictive model. In summary, using a Random Forest classifier with 12 features, we
have produced a model that is more accurate and more efficient at predicting deleted tweets compared to the current
state-of-the-art. We will use the same classifier in our next
experiments, but modify the feature space to include features
that can capture the semantic content of the tweets, which
the features described up to this point do not address.
4
Compact Feature Representation
As previously discussed, we would like to represent our
tweets in small feature space, making for efficient training
and processing. A sample dataset of 90k tweets has a vocabulary with 313,680 unique tokens. Thus, to represent these
90k tweets using BOW we would need a feature space with
313,680 dimensions. In practice, a model such as ours would
be used in a real-time scenario, in which case using a BOW
representation would cause us to either ignore new vocabulary or retrain our entire system. To further explain this, a
sample of 10k tweets from the 90k dataset reveals a vocabulary of 47,212 words. This means that growing from 10k
to 90k tweets produces 266,468 new tokens. Therefore, it is
imperative that the design of our features allows for incorporating new vocabulary in an efficient manner.
6
https://dev.twitter.com/overview/terms/agreement-and-policy
In Addition, the features need to be languageindependent, rendering language-specific tools (i.e.
part-of-speech taggers, etc.) and lexica (i.e. WordNet, etc.)
unavailable for our purposes. There is perhaps one caveat to
this principal of language-independence: we use a Python
port of the Twokenize class from ark-tweet-nlp7 . This tool
is inherently designed for languages that are separated by
whitespace. Thus, the notion of word ‘tokens’ for languages
that are not whitespace-separated, such as certain Asian
languages, is not as precise. Aside from this one tool, we
hold that the remaining aspects of our system adhere to the
philosophy of language-independence.
We identify two models for a low-dimensional semantic
representation of tweet content: topic modeling and text embeddings. Each of these models has the power to shrink an
invariable large vocabulary size into a small, user-defined reduction. In this section we give a brief overview of the models and detail the various ways we incorporate them into our
system.
4.1
Topic Modeling
Topic Modeling, notably LDA (Blei, Ng, and Jordan 2003),
has become a ubiquitous item in the toolkit of text processing researchers. Using the idea of ‘topics’, which are multinomial distributions over the vocabulary of a corpus, LDA
defines a distribution over topics for each document in the
corpus. We use the implementation of LDA offered by the
Python package gensim8 . The gensim version of LDA uses
the online variation of LDA from Hoffman et al. (2010). This
variation of standard LDA is purposefully designed to handle a continuous input of training documents better than the
original version. Aside from the increased efficiency – faster
convergence on larger datasets – this variation is intuitively
better, as the ‘online’ aspect of the algorithm matches the
streaming nature of Twitter. For our implementation of LDA,
we set the number of topics to be 100.
In an effort to combine the relative merits of topic modeling and text embeddings, Liu et al. (2015) proposed Topical
Word Embeddings (TWE). The authors conceived three different forms of TWE, based on how separately the topic and
word embedding models are trained. In their experiments,
the only model that outperformed a BOW model is perhaps
the simplest TWE, TWE1, where the topic and word embedding models are trained completely separately (using LDA
for the topic model) and the resulting per-word vectors are
concatenated together to create TWE1. Since the general
topic model creates a distribution of topics at the document
level, the authors use the topics’ distributions over words to
create the document-specific Word Topics (WT):
P (zi |wi , d) ∝ P (wi |zi )P (zi |d)
(3)
where zi is a topic, d is a document, and xi is a word from
d. Since our implementation of LDA has 100 topics, WT
will also have 100 topics. Using topic modeling, we are able
to create a low-dimensional representation of tweets both at
the document level and at the word level. In our primary
7
https://github.com/myleott/ark-twokenize-py
8
https://github.com/piskvorky/gensim/
experiments (see Section 5) we will test which is a better
representation as features for our problem.
4.2
Text Embeddings
Text embeddings, specifically word2vec (Mikolov et al.
2013a; 2013b), create a fixed vector representation of words
in a training corpus. These vectors capture the semantic nature of a given word – words that appear in similar contexts
will be located together in the n-dimensional vector space.
We use the gensim package to implement word2vec as well,
leveraging the Skip-gram model to train our word2vec representations. The Skip-gram model uses the current target
word to try to predict the other words in its surrounding
context window. Future work can test the effectiveness of
Negative Sampling and subsampling (Mikolov et al. 2013b)
to train word2vec models for the task of deletion prediction.
word2vec is an inherintly online model, because (in a nondistributed system), each training example – and the words
therein – is processed individually and provides a unique
update to the word embeddings.
The language of Twitter introduces further difficulty for
applying a word2vec model. The ubiquitous nature of slang,
acronyms, and misspellings creates a wider variety of unique
tokens than in standard written text. Researchers such as
Godin et al. (2013) address this issue by using a conversion dictionary9 for acronyms and slang. Other researchers
(Miura et al. 2014; Boag, Potash, and Rumshisky 2015)
incorporate a spell checking library into their processing
pipeline. Instead of these approaches, which would add further processing steps to our system, we echo the method of
Malmi et al. (2015)10 and create representations at the character level. We posit that character-level representations will
provide a more robust model that can better handle the inventive, disorganized language present in tweets. Another
minor factor that may contribute to the effectiveness of a
character-level model, aside from the the usage of numerous alphabets used by various languages, is the presence of
character-based languages in twitter, such as Japanese. Because these languages express semantic meaning with individual characters, and our tokenizer does not break apart
strings of characters, a character-level model will capture a
word’s meaning in the same way a word-level model does
for alphabet-based, whitespace-delineated languages.
There is, however, a deeper need to test the effectiveness
of a character-level model. As mentioned in Section 2.2, the
presence of regretful content in a tweet does not imply future deletion. Thus, we believe that for large-scale deletion
prediction, spam is a larger factor than regretful content.
While a word-level model may invariably be more capable
of capturing a tweet’s true semantic content, a characterlevel model is better able to capture how well-formed the
language of a tweet is, which we believe to be more indicative of spam, and consequently more indicative of deletion.
We will experiment with both character-level (char2vec) and
word-level representations, setting vector size to 100 as we
9
http://www.noslang.com/dictionary/
This work deals with rap lyrics, as opposed to Twitter. However, both rap and Twitter are forms of non-standard language.
10
did with topic modeling, and setting the window size to 5
when training the respective models.
5
5.1
Primary Experiments
Experimental Design
Since the goal of our primary experiments is to test the effectiveness of topic modeling and text embedding features
for predicting deleted tweets – as opposed to testing the effect of training set size – our dataset for our primary experiments is a sample of 90k tweets from the dataset detailed
in Section 3.2. While previous work states a tweet deletion rate around 3% (Petrovic, Osborne, and Lavrenko 2013;
Almuhimedi et al. 2013), our own analysis of 5.1 million
tweets from May 2015 shows that current tweet deletion has
severely increased to a rate of 23%. We believe this may be
due in part to new mass deletion services such as TweetDelete11 . This percentage has been further verified after examining over a billion tweets. Therefore, in our primary experiments we keep the organic distribution of labels in the
dataset. We reiterate that these experiments are conducted
over a multi-lingual dataset. Table 2 shows the top 10 languages present in our dataset, and the percent of the dataset
covered by each language.
Language
English
Japanese
Spanish
Arabic
Portuguese
Russian
Turkish
Korean
Thai
French
% of Tweets
31.2
21.2
9.2
6.1
3.6
2.6
2.3
2.2
2.1
2.1
mantic meaning as individual tokens12 , and can therefore be
viewed similarly as stopwords, which are typically ignored
when training LDA models (Blei, Ng, and Jordan 2003). For
practical purposes, removing user mentions and URLs also
reduces the size of these models13 . For features that are finergrained than message level (WT, word2vec, and char2vec),
we compute the average vector for all representations in the
tweet body in order to create a message-level representation.
Since we are arguing that topic modeling and text embeddings are effectively dimensionality-reductions of the overall tweet content, we also compare Principal Component
Analysis (PCA) features. For each tweet, we create a BOW
representation using term frequency. We then apply PCA to
the BOW vectors to create a 100-dimensional vector representation of the tweets.
Drawing from the results of Section 3.3, we use a Random
Forest for our supervised classifier, again using 15 Random
Trees as the base classifiers. The numbers of features sampled for the Random Trees is not fixed, as it depends on the
size of the overall feature space. We present the precision,
recall and f1 score on the deleted tweet class, showing the
results from 10-fold cross-validation. Like before, these experiments are conducted in Weka (Hall et al. 2009).
5.2
Features
Spam
Spam+PCA
Spam+word2vec
Spam+char2vec
Spam+LDA
Spam+WT
Spam+char2vec+LDA
Spam+char2vec+WT
Table 2: Top 10 languages present in our experimental
dataset.
There are three types of features we experiment with:
spam-related features (detailed in Section 3.1), topic modeling features, and text embedding features. For topic modeling, we use both the standard LDA distribution over topics
as well as WT (see Section 4.1); for text embeddings, we use
standard word2vec token representations as well as char2vec
representations (see Section 4.2). The topic and text embedding models are trained separately on a dataset of 8.4 million tweets, of which the experimental data is a subset. We
believe by training these models on a larger set of data we
will create more robust models and higher-performing features. Future work can seek to validate this hypothesis (see
Section 6).
When training these unsupervised models, we remove all
user mentions and URLs from the tweet because we contend
that these do not actually affect the content of the tweet message itself because URLs and user mentions do not carry se11
http://www.tweetdelete.net/
Results
Results of our experiments our shown in Table 3. As standalone features, all topic modeling/text embedding features
did worse than spam-related features. Therefore, all combinations of features include spam-related features. The results are not trying to suggest that a Random Forest is the
best classifier for all features. However, by fixing the classifier we are able to draw conclusions about the individual
performance of the features. We highlight which system(s)
produced the best result for each individual metric.
Precision
0.638
0.629
0.747
0.787
0.755
0.761
0.794
0.799
Recall
0.325
0.309
0.297
0.319
0.322
0.325
0.320
0.321
F1
0.431
0.415
0.425
0.454
0.452
0.456
0.456
0.458
Table 3: Results from our primary experiment using spamrelated, topic modeling, and text embedding features.
5.3
Representing Author ID
In their predictive model for tweet deletion, Petrovic et
al. (2013) state that the most powerful feature was nontextual: author ID. To this point, we have completely ignored non-textual features. Representing author ID presents
the same computational problem that BOW does: a large,
sparse feature space that our methodology seeks to avoid.
12
We do take into account user mentions and URLs via our
spam-related features.
13
Of the 313,680 unique tokens from 90k tweets mentioned in
Section 4, 12.7% of these tokens are either user mentions or URLs.
One possibility for encoding author ID in a low-dimensional
feature-space is to use an extension of word2vec called paragraph2vec (Le and Mikolov 2014). While word2vec seeks to
optimize a word embedding matrix W when predicting word
sequences in text, paragraph2vec also incorporates a document embedding matrix D into the predictive task. Each
document (paragraph, etc.) is assigned an ID, and the result
of the algorithm a fixed-length vector that represents an entire document. Document IDs can effectively be thought of
as another word in a word sequence.
The IDs do not need to be unique to each document. Thus,
we create an ID for each author in our training set14 , and
apply that ID to each of an author’s tweets. By doing so,
we create a 100-dimensional vector representation unique to
each author, which we call author2vec. Table 3 shows the
results of our experiments incorporating author2vec. Since
Spam, WT, and char2vec were the best performing features
from Section 5.2, we include these features as well in our
experiments.
Features
Spam+author2vec
Spam+author2vec+WT
Spam+author2vec+
char2vec
Spam+author2vec+
WT+char2vec
Precision
0.787
0.818
Recall
0.216
0.244
F1
0.339
0.376
0.794
0.279
0.413
0.800
0.283
0.418
Table 4: Results from our author2vec experiments.
5.4
Discussion
What is immediately apparent in the results is the strong disparity between precision and recall performance, with precision and recall generally high and low, respectively. For
a practical application, one can argue that precision is in
fact the most important metric: if a tweet is predicted to be
deleted, it is likely to require further action by those generating the tweet in order to remedy the situation. Therefore, predicting that a tweet will be deleted needs to come
with a high confidence, possibly at the expense of capturing all the tweets that will be deleted. The stark contrast between precision and recall is no more evident than in the
results of the author2vec experiments. The combination of
Spam, author2vec,and WT produced the highest precision
out of any experiment. Conversely, this combination also
produced one of the lowest recalls. Interestingly, spam features on their own recorded the highest recall. Consequently,
all additional features (aside from WT, which maintained
the same recall), lowered the recall. Note that the addition
of PCA features deteriorates performance across all metrics
compared to only using spam features. Finally, the combination of spam, char2vec, and WT features generated the best
result in terms of f1 score.
Focusing on the performance of the topic and text embedding models specifically, the results dictate that the finergrained models provide improved performance. As we hy14
We use the same dataset of 8.4 million tweets mentioned in
Section 5.1 as our training set. Algorithm implementation is also
done with the gensim Python package
pothesized, using the char2vec features for the task of tweet
deletion was more effective than the word2vec features by
a considerable margin. To a lesser extent, the same is true
comparing the WT and LDA features, with WT providing
better results across all three metrics. Based on the results
of Liu et al. (2015), where TWE1 outperformed LDA in
a multi-class text classification task, this result can be expected. Unfortunately, the positive effect of LDA and WT
remains relatively mysterious, after manual inspection of the
model’s topics did not reveal discernable categories of tokens.
6
Conclusion and Future Work
In this work we have detailed how to create a compact
feature representation in a supervised model for predicting
deleted tweets that is also language independent. Specifically, we exploit topic modeling and text embeddings to create this representation. Furthermore, although the predictive
model itself is a fixed, supervised model, the features for the
model are generated from online algorithms. This is desirable for a system that processes streaming data such as Twitter. The results of our experiments illustrate four key conclusions for our constrained problem: 1) Character-level text
embeddings (char2vec) are superior to word-level text embeddings (word2vec); 2) By a small yet noticeable margin,
word-level topic modeling (WT) is preferable to documentlevel topic modeling (LDA); 3) In terms of precision, author2vec provides the best boost in performance; 4) For the
tweet deletion task, precision is high but recall is low.
There are numerous avenues for future work that can refine our experiments and potentially improve results. For example, we would also like to see if training the topic and
text embedding models on a much larger dataset actually
ameliorates overall system performance as we hypothesize.
Also, the features we explore focus almost exclusively on
the content of the tweet itself, but it would also be possible
to include tweet metadata, such as time of tweet creation,
number of likes, etc, as well as author metadata, such as date
of account creation, number of followers, etc. These features
would be computationally efficient to compute, as they are
stored in the JSON data structure provided in the tweet object15 . Lastly, we are particularly fascinated with the notion
of ‘unrelated’ hashtags that Twitter warns as an indicator of
spam. Thus, given a tweet, we like to determine how relevant each of its hashtags are and encode this as a feature for
our model.
References
Almuhimedi, H.; Wilson, S.; Liu, B.; Sadeh, N.; and Acquisti, A. 2013. Tweets are forever: a large-scale quantitative analysis of deleted tweets. In Proceedings of the 2013
conference on Computer supported cooperative work, 897–
908. ACM.
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent
dirichlet allocation. the Journal of machine Learning research 3:993–1022.
15
https://dev.twitter.com/overview/api/tweets
Boag, W.; Potash, P.; and Rumshisky, A. 2015. Twitterhawk: A feature bucket approach to sentiment analysis. In
Proceedings of the 9th International Workshop on Semantic
Evaluation (SemEval 2015), 640–646.
Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; and
Lin, C.-J. 2008. Liblinear: A library for large linear classification. The Journal of Machine Learning Research 9:1871–
1874.
Godin, F.; Slavkovikj, V.; De Neve, W.; Schrauwen, B.; and
Van de Walle, R. 2013. Using topic models for twitter hashtag recommendation. In Proceedings of the 22nd international conference on World Wide Web companion, 593–596.
International World Wide Web Conferences Steering Committee.
Guo, D., and Chen, C. 2014. Detecting non-personal and
spam users on geo-tagged twitter network. Transactions in
GIS 18(3):370–384.
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann,
P.; and Witten, I. H. 2009. The weka data mining software:
an update. ACM SIGKDD explorations newsletter 11(1):10–
18.
Hoffman, M.; Bach, F. R.; and Blei, D. M. 2010. Online
learning for latent dirichlet allocation. In advances in neural
information processing systems, 856–864.
Hong, L., and Davison, B. D. 2010. Empirical study of topic
modeling in twitter. In Proceedings of the First Workshop on
Social Media Analytics, 80–88. ACM.
Knapp, M. L.; Stafford, L.; and Daly, J. A. 1986. Regrettable
messages: Things people wish they hadn’t said. Journal of
Communication 36(4):40–58.
Le, Q. V., and Mikolov, T. 2014. Distributed representations of sentences and documents. arXiv preprint
arXiv:1405.4053.
Lin, P.-C., and Huang, P.-M. 2013. A study of effective features for detecting long-surviving twitter spam accounts. In
Advanced Communication Technology (ICACT), 2013 15th
International Conference on, 841–846. IEEE.
Ling, W.; Dyer, C.; Black, A.; and Trancoso, I. 2015.
Two/too simple adaptations of word2vec for syntax problems. Proceedings of the North American Chapter of the
Association for Computational Linguistics (NAACL), Denver, CO.
Liu, Y.; Liu, Z.; Chua, T.-S.; and Sun, M. 2015. Topical word
embeddings. In Twenty-Ninth AAAI Conference on Artificial
Intelligence.
Malmi, E.; Takala, P.; Toivonen, H.; Raiko, T.; and Gionis,
A. 2015. Dopelearning: A computational approach to rap
lyrics generation. arXiv preprint arXiv:1505.04771.
Mccord, M., and Chuah, M. 2011. Spam detection on twitter using traditional classifiers. In Autonomic and trusted
computing. Springer. 175–186.
Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a.
Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
Dean, J. 2013b. Distributed representations of words and
phrases and their compositionality. In Advances in neural
information processing systems, 3111–3119.
Miller, Z.; Dickinson, B.; Deitrick, W.; Hu, W.; and Wang,
A. H. 2014. Twitter spammer detection using data stream
clustering. Information Sciences 260:64–73.
Miura, Y.; Sakaki, S.; Hattori, K.; and Ohkuma, T. 2014.
Teamx: A sentiment analyzer with enhanced lexicon mapping and weighting scheme for unbalanced data. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 628–632.
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.;
Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,
R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;
Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikitlearn: Machine learning in Python. Journal of Machine
Learning Research 12:2825–2830.
Petrovic, S.; Osborne, M.; and Lavrenko, V. 2013. I wish
i didn’t say that! analyzing and predicting deleted messages
in twitter. arXiv preprint arXiv:1305.3107.
Quinlan, J. R. 2014. C4. 5: programs for machine learning.
Elsevier.
Sleeper, M.; Cranshaw, J.; Kelley, P. G.; Ur, B.; Acquisti, A.;
Cranor, L. F.; and Sadeh, N. 2013. I read my twitter the next
morning and was astonished: A conversational perspective
on twitter regrets. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 3277–3286.
ACM.
Stojanovski, D.; Strezoski, G.; Madjarov, G.; and Dimitrovski, I. 2015. Twitter sentiment analysis using deep convolutional neural network. In Hybrid Artificial Intelligent
Systems. Springer. 726–737.
Tang, D.; Wei, F.; Yang, N.; Zhou, M.; Liu, T.; and Qin,
B. 2014. Learning sentiment-specific word embedding for
twitter sentiment classification. In Proceedings of the 52nd
Annual Meeting of the Association for Computational Linguistics, volume 1, 1555–1565.
Tomar, A.; Godin, F.; Vandersmissen, B.; De Neve, W.; and
Van de Walle, R. 2014. Towards twitter hashtag recommendation using distributed word representations and a deep
feed forward neural network. In Advances in Computing,
Communications and Informatics (ICACCI, 2014 International Conference on, 362–368. IEEE.
Zhao, W. X.; Jiang, J.; Weng, J.; He, J.; Lim, E.-P.; Yan, H.;
and Li, X. 2011. Comparing twitter and traditional media
using topic models. In Advances in Information Retrieval.
Springer. 338–349.
Zhou, L.; Wang, W.; and Chen, K. 2015. Identifying regrettable messages from tweets. In Proceedings of the 24th
International Conference on World Wide Web Companion,
145–146. International World Wide Web Conferences Steering Committee.