Using Topic Modeling and Text Embeddings to Predict Deleted Tweets Peter Potash†? , Eric Bell† , Joshua Harrison† † Pacific Northwest National Laboratory, Richland, WA ? University of Massachusetts Lowell Dept. of Computer Science, Lowell, MA [email protected] Abstract Predictive models for tweet deletion have been a relatively unexplored area of Twitter-related computational research. We first approach the deletion of tweets as a spam detection problem, applying a small set of handcrafted features to improve upon the current state-ofthe-art in predicting deleted tweets. Next, we apply our approach to a dataset of deleted tweets that better reflects the current deletion rate. Since tweets are deleted for reasons beyond just the presence of spam, we apply topic modeling and text embeddings in order to capture the semantic content of tweets that can lead to tweet deletion. Our goal is to create an effective model that has a low-dimensional feature space and is also language-independent. A lean model would be computationally advantageous processing high-volumes of Twitter data, which can reach 9,885 tweets per second1 . Our results show that a small set of spam-related features combined with word topics and character-level text embeddings provide the best f1 when trained with a random forest model. The highest precision of the deleted tweet class is achieved by a modification of paragraph2vec to capture author identity. 1 Introduction Despite the relative infancy of Twitter, there have been numerous computational approaches to spam detection in Twitter (Miller et al. 2014; Mccord and Chuah 2011; Guo and Chen 2014; Lin and Huang 2013). According to the Twitter rules2 , users may not use Twitter for the purposes of spamming. Accounts that violate this rule may be suspended or terminated, and their tweets removed from Twitter Search. Consequently, the presence of spam in a tweet is a strong signal that a tweet could be deleted. However, spamming tweets are only a subset of all deleted tweets. Users may choose to delete their own tweets for various reasons (see Section 2.2 for a deeper analysis of motives for deleting tweets). As social media technology matures, users become more aware of the broader ramifications of their posts. One imc 2015, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 1 http://www.internetlivestats.com/one-second/#tweets-band 2 https://support.twitter.com/articles/18311 portant consequence is how posts affect a user’s online identity. For example, Facebook could potentially check whether photos uploaded to the site could elicit negative reaction3 . In a similar spirit, predicting tweet deletion will aid in maintaining a positive online identity for users. Aside from social factors, there is specific content in a tweet that may make the tweet a possible candidate for deletion. Therefore, to understand which tweets might be deleted, we need an effective feature representation of tweet content. Furthermore, we desire to create a model that is language independent – a single model regardless of tweet language. This makes the use of language-specific tools and lexica unappealing. For practical purposes, we desire to represent tweet content through a compact feature representation, rendering the ubiquitous bag-of-words (BOW) (bigrams, trigrams, etc.) model an unattractive possibility. Another motivation for a compressed representation is that it is better suited for additional features and further development. Topic Modeling, specifically Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) and tailored variations, has been used to understand tweet content in various applications (Hong and Davison 2010; Godin et al. 2013; Zhao et al. 2011). Since LDA represents documents (tweets in our case) as a distribution over all possible topics, the dimensionality of the document with regards to LDA is equal to the number of topics. Thus, we can compress a vocabulary of a variable amount into a small, fixed size. Another more recent method for semantic representation in a low-dimensional space is word2vec (Mikolov et al. 2013a; 2013b). The original algorithm creates fixed-length representations at the word level, and it has been extended to model entire documents (Le and Mikolov 2014). In terms of Twitter research, word representations have been used for such problems as hashtag recommendation (Tomar et al. 2014), syntax parsing (Ling et al. 2015), and the popular sentiment prediction task (Tang et al. 2014; Stojanovski et al. 2015). In our experiments, we seek to show that topic modeling and text embeddings can effectively capture the semantic qualities of a tweet that lead to deletion, and doing so only on the order of hundreds of features. The rest of the paper is laid out as follows: in Section 2 we 3 http://www.businessinsider.com/facebook-could-checkwhether-users-are-uploading-rude-photos-2015-11?r=UK&IR=T review related work that focuses on deleted tweets. In Section 3 we conduct initial experiments that test our small set of spam-related features, and compare our system to the current state-of-the-art in predicting deleted tweets. In Section 4 we give relevant background and describe how we will apply topic modeling and text embeddings into our model for predicting deleted tweets. In Section 5 we will test the usage of topic modeling and text embeddings as features for predicting deleted tweets. Finally, in Section 6 we will offer our conclusions and look toward future work. 2 2.1 Related Work Predicting Deleted Tweets Few works exist that attempt to build predictive models for tweet deletion. Zhou et al. (2015) focus on a subset of deleted tweets – regrettable tweets. These are tweets that the authors believe to contain inappropriate content. Inappropriate content can range from vulgar language to sharing private content such as a personal email address. Through manual investigation, the authors identify ten major topics including negative sentiment, cursing, and relationships that are prevalent in regrettable tweets. The authors then exploit WordNet and UrbanDictionary to create keyword lists related to the ten topics. Finally, using a combination of existing lexica and the topic keywords as features, the authors build classifiers to test the accuracy of their model. The authors complement 700 manually labeled regrettable tweets with 700 normal tweets to create their evaluation dataset. The authors’ best performance from 10-fold cross-validation was an f1 score of 0.85 using a J48 classifier (a Java implementation of the C4.5 algorithm (Quinlan 2014)) on a balanced dataset of regretful and non-regretful tweets. The work that most closely resembles ours is that of Petrovic et al. (2013). In their work, the authors use bag-ofwords, user id, and various ‘social’ features including number of followers, is a retweet, and is a reply as features for their predictive model. The authors train a classifier on 68 million tweets and proceed to test on 7.5 million tweets. The balance of deleted verse non-deleted is organic in their dataset – tweets were collected for the month of January and marked as deleted if they were removed by the end of February. In the authors’ dataset, only 3.2% of the tweets were deleted4 . Due to the types of features and size of the dataset, the feature space is over 47 million dimensions, and an SVM classifier took 8 hours to train. The SVM recorded an f1 score of 0.27 with user ID seen as the most important feature. 2.2 Analyzing Deleted Tweets While not focusing specifically on deletion prediction, other works examine the phenomenon of tweet deletion. Almuhimedi et al. (2013) perform a large-scale analysis of tweet deletion. In their dataset of over 67 million tweets, only 2.4% of tweets were deleted, however 50% of roughly 300,000 users have deleted a tweet. According to the authors, the attributes that most notably differentiate deleted 4 Current deletion rate is 23%, a dramatic increase over the course of two years (see Section 5.1). from non-deleted tweets are “...clients used to post tweets, the locations of posts (determined via FourSquare), and the quantities of replies that they generated.” The authors did not find a substantial correlation between deleted tweets and the presence of regrettable topics, even sentiment vocabulary. Furthermore, the authors state that “A substantial fraction of deleted tweets are deleted for superficial reasons. Together, typos, rephrasing, and spam account for 18% of deleted tweets.” The authors’ work illustrates the overall difficulty in finding distinct, generalizable patterns in deleted tweets. Sleeper et al. (2013) conducted an Amazon Mechanical Turk task to understand the types of regret users experience regarding content they have tweeted. Of the 474 responses (using the regret categories from Knapp et al. (1986)) the most common cause for regret was revealing too much in the tweet (e.g. personal information or a secret), followed by direct criticism regarding a specific person. Contrastingly, the participants rarely reported experiencing regret due to lying or ‘behavioral edict’ (telling someone to behave in a certain way). Of the participants who experienced regret due to a specific tweet, only 52% of the tweets were actually deleted. This further highlights the difficulty in deletion prediction, that even if a tweet has cause for deletion, it may very well remain on twitter. 3 Initial Experiments The objective of our initial experiments is two-fold: 1) Compare to current state-of-the-art for predicting deleted tweets (Petrovic, Osborne, and Lavrenko 2013); 2) Determine the optimal classifiers for the task of deletion prediction in a limited-dimensional feature space. 3.1 Spam Features Since spam is an important aspect of deleted tweets, we first tried to create a small set of spam-related features to compare against the current state-of-the-art in predicting deleted tweets. The inspiration for these features is taken directly from the Twitter Rules (see Footnote 2), specifically what Twitter regards as spam. This resulted in the following 12 features for a given tweet: • Tweet length (character level) • Tweet length (token level) • Percent of tweet in uppercase • Percent of tweet text that is non-alphanumeric • Has the token ‘followers’ • Number of links • Percent of tweet that is links • Number of user mentions • Number of hashtags • Is the tweet a retweet • Shannon entropy • Metric entropy The Shannon entropy of a tweet T containing the set of characters C is defined as follows: X H(T ) = − P (c)log2 P (c) (1) c∈C And P (c) is defined as: P (c) = f req(c) len(T ) (2) where f req(c) is the frequency of c in T . Metric entropy is Shannon entropy normalized by character-level tweet length, len(T ). 3.2 Results The results of our initial experiments (see Table 1) show that the Random Tree classifier clearly performs optimally on this constrained problem. Furthermore, as the dataset size grows, the two tree-based classifiers produce the best results. In fact, aside from the Naive Bayes classifier, the performance of all the classifiers increases as the size of the dataset grows. The Random Forest classifier is the only one that surpasses the f1 score of 0.27 reported by Petrovic et al. 5 Classifier SVM Naive Bayes Bayes Net J48 Random Forest Experimental Design In order to create a labeled dataset of deleted tweets we started with a live sample of 320k tweets from May 2015. To confirm that no tweet deletions were missed during sampling, we queried the Twitter API5 (this was done in late June 2015). As examined by Lin and Huang (2013) (their work covers spam specifically), deleted tweets have a certain lifespan, as long as a year for shrewd spammers. Thus, we are effectively only predicting whether a tweet will be deleted within a month’s time, not whether a tweet will ever be deleted. Because the dataset of Petrovic et al. (2013) has a 3.2% deletion rate, only 3.2% of tweets in our datasets for these experiments will contain deleted tweets. This number will be adjusted to better reflect current deletion rates in our next experiments (see Section 5). The five classifiers we report are SVM, Naive Bayes, Bayes Net, Random Forest, and J48 (this classifier recorded the best results in Zhou et al.’s (2015) experiments). To make the Random Forest model more efficient as a predictive model, we only use 15 trees. In order to test the effect of training set size on classifier performance, we ran the classifier with three different dataset sizes: 10k, 130k, and 255k. We were unable to use the full 320k dataset of tweets – 255k is the largest dataset we could create that has the desired 3.2% proportion of deleted tweets. The results of our experiment are listed in Table 1. In all our experiments we perform 10-fold cross validation and report the f1 score on the deleted tweet class. All experiments were conducted using Weka (Hall et al. 2009), except for the SVM experiments, which used the Python package scikit-learn (Pedregosa et al. 2011). Since function-based classifiers have difficulty with a strong class imbalance, which occurs in our dataset, we used the class weighting functionality provided by the library. 3.3 (2013) that uses an SVM classifier. Conversely, the results of our SVM experiments generate the worst results, despite our best attempts to alleviate the class-imbalance problem. We note that the results for Petrovic et al. are from a training set of of 68 million tweets, and since the SVM results improve as the dataset size increases, there is potential for substantial improvement with a much larger training set. Although we are unable to use the authors’ exact dataset (Twitter only allows the sharing of tweet IDs6 , which makes it impossible to recover the deleted examples), we do reiterate the increased classifier performance with larger datasets in our experiments. https://dev.twitter.com/overview/documentation 10k 0.062 0.188 0.121 0.140 0.226 130k 0.090 0.180 0.219 0.252 0.301 255k 0.101 0.165 0.255 0.307 Table 1: Results from our initial experiment using only spam-related features. We present the f1 score of various classifiers on datasets of varying sizes. Furthermore, the feature space for Petrovic et al.’s model has 47 million dimensions – our classifiers only use 12 features. Petrovic et al. use Liblinear (Fan et al. 2008) to implement their SVM, which uses highly optimized C/C++ code. In contrast, we use the standard Java based Weka distribution. In spite of this, running on the 255k dataset, our Random Forest model took 62 seconds to train, which is 4,121 samples per second. Petrovic et al. took eight hours to train 68 million samples, which is 2,361 samples per second. This further illustrates the efficiency of having both a greatly reduced feature space and a simpler predictive model. In summary, using a Random Forest classifier with 12 features, we have produced a model that is more accurate and more efficient at predicting deleted tweets compared to the current state-of-the-art. We will use the same classifier in our next experiments, but modify the feature space to include features that can capture the semantic content of the tweets, which the features described up to this point do not address. 4 Compact Feature Representation As previously discussed, we would like to represent our tweets in small feature space, making for efficient training and processing. A sample dataset of 90k tweets has a vocabulary with 313,680 unique tokens. Thus, to represent these 90k tweets using BOW we would need a feature space with 313,680 dimensions. In practice, a model such as ours would be used in a real-time scenario, in which case using a BOW representation would cause us to either ignore new vocabulary or retrain our entire system. To further explain this, a sample of 10k tweets from the 90k dataset reveals a vocabulary of 47,212 words. This means that growing from 10k to 90k tweets produces 266,468 new tokens. Therefore, it is imperative that the design of our features allows for incorporating new vocabulary in an efficient manner. 6 https://dev.twitter.com/overview/terms/agreement-and-policy In Addition, the features need to be languageindependent, rendering language-specific tools (i.e. part-of-speech taggers, etc.) and lexica (i.e. WordNet, etc.) unavailable for our purposes. There is perhaps one caveat to this principal of language-independence: we use a Python port of the Twokenize class from ark-tweet-nlp7 . This tool is inherently designed for languages that are separated by whitespace. Thus, the notion of word ‘tokens’ for languages that are not whitespace-separated, such as certain Asian languages, is not as precise. Aside from this one tool, we hold that the remaining aspects of our system adhere to the philosophy of language-independence. We identify two models for a low-dimensional semantic representation of tweet content: topic modeling and text embeddings. Each of these models has the power to shrink an invariable large vocabulary size into a small, user-defined reduction. In this section we give a brief overview of the models and detail the various ways we incorporate them into our system. 4.1 Topic Modeling Topic Modeling, notably LDA (Blei, Ng, and Jordan 2003), has become a ubiquitous item in the toolkit of text processing researchers. Using the idea of ‘topics’, which are multinomial distributions over the vocabulary of a corpus, LDA defines a distribution over topics for each document in the corpus. We use the implementation of LDA offered by the Python package gensim8 . The gensim version of LDA uses the online variation of LDA from Hoffman et al. (2010). This variation of standard LDA is purposefully designed to handle a continuous input of training documents better than the original version. Aside from the increased efficiency – faster convergence on larger datasets – this variation is intuitively better, as the ‘online’ aspect of the algorithm matches the streaming nature of Twitter. For our implementation of LDA, we set the number of topics to be 100. In an effort to combine the relative merits of topic modeling and text embeddings, Liu et al. (2015) proposed Topical Word Embeddings (TWE). The authors conceived three different forms of TWE, based on how separately the topic and word embedding models are trained. In their experiments, the only model that outperformed a BOW model is perhaps the simplest TWE, TWE1, where the topic and word embedding models are trained completely separately (using LDA for the topic model) and the resulting per-word vectors are concatenated together to create TWE1. Since the general topic model creates a distribution of topics at the document level, the authors use the topics’ distributions over words to create the document-specific Word Topics (WT): P (zi |wi , d) ∝ P (wi |zi )P (zi |d) (3) where zi is a topic, d is a document, and xi is a word from d. Since our implementation of LDA has 100 topics, WT will also have 100 topics. Using topic modeling, we are able to create a low-dimensional representation of tweets both at the document level and at the word level. In our primary 7 https://github.com/myleott/ark-twokenize-py 8 https://github.com/piskvorky/gensim/ experiments (see Section 5) we will test which is a better representation as features for our problem. 4.2 Text Embeddings Text embeddings, specifically word2vec (Mikolov et al. 2013a; 2013b), create a fixed vector representation of words in a training corpus. These vectors capture the semantic nature of a given word – words that appear in similar contexts will be located together in the n-dimensional vector space. We use the gensim package to implement word2vec as well, leveraging the Skip-gram model to train our word2vec representations. The Skip-gram model uses the current target word to try to predict the other words in its surrounding context window. Future work can test the effectiveness of Negative Sampling and subsampling (Mikolov et al. 2013b) to train word2vec models for the task of deletion prediction. word2vec is an inherintly online model, because (in a nondistributed system), each training example – and the words therein – is processed individually and provides a unique update to the word embeddings. The language of Twitter introduces further difficulty for applying a word2vec model. The ubiquitous nature of slang, acronyms, and misspellings creates a wider variety of unique tokens than in standard written text. Researchers such as Godin et al. (2013) address this issue by using a conversion dictionary9 for acronyms and slang. Other researchers (Miura et al. 2014; Boag, Potash, and Rumshisky 2015) incorporate a spell checking library into their processing pipeline. Instead of these approaches, which would add further processing steps to our system, we echo the method of Malmi et al. (2015)10 and create representations at the character level. We posit that character-level representations will provide a more robust model that can better handle the inventive, disorganized language present in tweets. Another minor factor that may contribute to the effectiveness of a character-level model, aside from the the usage of numerous alphabets used by various languages, is the presence of character-based languages in twitter, such as Japanese. Because these languages express semantic meaning with individual characters, and our tokenizer does not break apart strings of characters, a character-level model will capture a word’s meaning in the same way a word-level model does for alphabet-based, whitespace-delineated languages. There is, however, a deeper need to test the effectiveness of a character-level model. As mentioned in Section 2.2, the presence of regretful content in a tweet does not imply future deletion. Thus, we believe that for large-scale deletion prediction, spam is a larger factor than regretful content. While a word-level model may invariably be more capable of capturing a tweet’s true semantic content, a characterlevel model is better able to capture how well-formed the language of a tweet is, which we believe to be more indicative of spam, and consequently more indicative of deletion. We will experiment with both character-level (char2vec) and word-level representations, setting vector size to 100 as we 9 http://www.noslang.com/dictionary/ This work deals with rap lyrics, as opposed to Twitter. However, both rap and Twitter are forms of non-standard language. 10 did with topic modeling, and setting the window size to 5 when training the respective models. 5 5.1 Primary Experiments Experimental Design Since the goal of our primary experiments is to test the effectiveness of topic modeling and text embedding features for predicting deleted tweets – as opposed to testing the effect of training set size – our dataset for our primary experiments is a sample of 90k tweets from the dataset detailed in Section 3.2. While previous work states a tweet deletion rate around 3% (Petrovic, Osborne, and Lavrenko 2013; Almuhimedi et al. 2013), our own analysis of 5.1 million tweets from May 2015 shows that current tweet deletion has severely increased to a rate of 23%. We believe this may be due in part to new mass deletion services such as TweetDelete11 . This percentage has been further verified after examining over a billion tweets. Therefore, in our primary experiments we keep the organic distribution of labels in the dataset. We reiterate that these experiments are conducted over a multi-lingual dataset. Table 2 shows the top 10 languages present in our dataset, and the percent of the dataset covered by each language. Language English Japanese Spanish Arabic Portuguese Russian Turkish Korean Thai French % of Tweets 31.2 21.2 9.2 6.1 3.6 2.6 2.3 2.2 2.1 2.1 mantic meaning as individual tokens12 , and can therefore be viewed similarly as stopwords, which are typically ignored when training LDA models (Blei, Ng, and Jordan 2003). For practical purposes, removing user mentions and URLs also reduces the size of these models13 . For features that are finergrained than message level (WT, word2vec, and char2vec), we compute the average vector for all representations in the tweet body in order to create a message-level representation. Since we are arguing that topic modeling and text embeddings are effectively dimensionality-reductions of the overall tweet content, we also compare Principal Component Analysis (PCA) features. For each tweet, we create a BOW representation using term frequency. We then apply PCA to the BOW vectors to create a 100-dimensional vector representation of the tweets. Drawing from the results of Section 3.3, we use a Random Forest for our supervised classifier, again using 15 Random Trees as the base classifiers. The numbers of features sampled for the Random Trees is not fixed, as it depends on the size of the overall feature space. We present the precision, recall and f1 score on the deleted tweet class, showing the results from 10-fold cross-validation. Like before, these experiments are conducted in Weka (Hall et al. 2009). 5.2 Features Spam Spam+PCA Spam+word2vec Spam+char2vec Spam+LDA Spam+WT Spam+char2vec+LDA Spam+char2vec+WT Table 2: Top 10 languages present in our experimental dataset. There are three types of features we experiment with: spam-related features (detailed in Section 3.1), topic modeling features, and text embedding features. For topic modeling, we use both the standard LDA distribution over topics as well as WT (see Section 4.1); for text embeddings, we use standard word2vec token representations as well as char2vec representations (see Section 4.2). The topic and text embedding models are trained separately on a dataset of 8.4 million tweets, of which the experimental data is a subset. We believe by training these models on a larger set of data we will create more robust models and higher-performing features. Future work can seek to validate this hypothesis (see Section 6). When training these unsupervised models, we remove all user mentions and URLs from the tweet because we contend that these do not actually affect the content of the tweet message itself because URLs and user mentions do not carry se11 http://www.tweetdelete.net/ Results Results of our experiments our shown in Table 3. As standalone features, all topic modeling/text embedding features did worse than spam-related features. Therefore, all combinations of features include spam-related features. The results are not trying to suggest that a Random Forest is the best classifier for all features. However, by fixing the classifier we are able to draw conclusions about the individual performance of the features. We highlight which system(s) produced the best result for each individual metric. Precision 0.638 0.629 0.747 0.787 0.755 0.761 0.794 0.799 Recall 0.325 0.309 0.297 0.319 0.322 0.325 0.320 0.321 F1 0.431 0.415 0.425 0.454 0.452 0.456 0.456 0.458 Table 3: Results from our primary experiment using spamrelated, topic modeling, and text embedding features. 5.3 Representing Author ID In their predictive model for tweet deletion, Petrovic et al. (2013) state that the most powerful feature was nontextual: author ID. To this point, we have completely ignored non-textual features. Representing author ID presents the same computational problem that BOW does: a large, sparse feature space that our methodology seeks to avoid. 12 We do take into account user mentions and URLs via our spam-related features. 13 Of the 313,680 unique tokens from 90k tweets mentioned in Section 4, 12.7% of these tokens are either user mentions or URLs. One possibility for encoding author ID in a low-dimensional feature-space is to use an extension of word2vec called paragraph2vec (Le and Mikolov 2014). While word2vec seeks to optimize a word embedding matrix W when predicting word sequences in text, paragraph2vec also incorporates a document embedding matrix D into the predictive task. Each document (paragraph, etc.) is assigned an ID, and the result of the algorithm a fixed-length vector that represents an entire document. Document IDs can effectively be thought of as another word in a word sequence. The IDs do not need to be unique to each document. Thus, we create an ID for each author in our training set14 , and apply that ID to each of an author’s tweets. By doing so, we create a 100-dimensional vector representation unique to each author, which we call author2vec. Table 3 shows the results of our experiments incorporating author2vec. Since Spam, WT, and char2vec were the best performing features from Section 5.2, we include these features as well in our experiments. Features Spam+author2vec Spam+author2vec+WT Spam+author2vec+ char2vec Spam+author2vec+ WT+char2vec Precision 0.787 0.818 Recall 0.216 0.244 F1 0.339 0.376 0.794 0.279 0.413 0.800 0.283 0.418 Table 4: Results from our author2vec experiments. 5.4 Discussion What is immediately apparent in the results is the strong disparity between precision and recall performance, with precision and recall generally high and low, respectively. For a practical application, one can argue that precision is in fact the most important metric: if a tweet is predicted to be deleted, it is likely to require further action by those generating the tweet in order to remedy the situation. Therefore, predicting that a tweet will be deleted needs to come with a high confidence, possibly at the expense of capturing all the tweets that will be deleted. The stark contrast between precision and recall is no more evident than in the results of the author2vec experiments. The combination of Spam, author2vec,and WT produced the highest precision out of any experiment. Conversely, this combination also produced one of the lowest recalls. Interestingly, spam features on their own recorded the highest recall. Consequently, all additional features (aside from WT, which maintained the same recall), lowered the recall. Note that the addition of PCA features deteriorates performance across all metrics compared to only using spam features. Finally, the combination of spam, char2vec, and WT features generated the best result in terms of f1 score. Focusing on the performance of the topic and text embedding models specifically, the results dictate that the finergrained models provide improved performance. As we hy14 We use the same dataset of 8.4 million tweets mentioned in Section 5.1 as our training set. Algorithm implementation is also done with the gensim Python package pothesized, using the char2vec features for the task of tweet deletion was more effective than the word2vec features by a considerable margin. To a lesser extent, the same is true comparing the WT and LDA features, with WT providing better results across all three metrics. Based on the results of Liu et al. (2015), where TWE1 outperformed LDA in a multi-class text classification task, this result can be expected. Unfortunately, the positive effect of LDA and WT remains relatively mysterious, after manual inspection of the model’s topics did not reveal discernable categories of tokens. 6 Conclusion and Future Work In this work we have detailed how to create a compact feature representation in a supervised model for predicting deleted tweets that is also language independent. Specifically, we exploit topic modeling and text embeddings to create this representation. Furthermore, although the predictive model itself is a fixed, supervised model, the features for the model are generated from online algorithms. This is desirable for a system that processes streaming data such as Twitter. The results of our experiments illustrate four key conclusions for our constrained problem: 1) Character-level text embeddings (char2vec) are superior to word-level text embeddings (word2vec); 2) By a small yet noticeable margin, word-level topic modeling (WT) is preferable to documentlevel topic modeling (LDA); 3) In terms of precision, author2vec provides the best boost in performance; 4) For the tweet deletion task, precision is high but recall is low. There are numerous avenues for future work that can refine our experiments and potentially improve results. For example, we would also like to see if training the topic and text embedding models on a much larger dataset actually ameliorates overall system performance as we hypothesize. Also, the features we explore focus almost exclusively on the content of the tweet itself, but it would also be possible to include tweet metadata, such as time of tweet creation, number of likes, etc, as well as author metadata, such as date of account creation, number of followers, etc. These features would be computationally efficient to compute, as they are stored in the JSON data structure provided in the tweet object15 . Lastly, we are particularly fascinated with the notion of ‘unrelated’ hashtags that Twitter warns as an indicator of spam. Thus, given a tweet, we like to determine how relevant each of its hashtags are and encode this as a feature for our model. References Almuhimedi, H.; Wilson, S.; Liu, B.; Sadeh, N.; and Acquisti, A. 2013. Tweets are forever: a large-scale quantitative analysis of deleted tweets. In Proceedings of the 2013 conference on Computer supported cooperative work, 897– 908. ACM. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. the Journal of machine Learning research 3:993–1022. 15 https://dev.twitter.com/overview/api/tweets Boag, W.; Potash, P.; and Rumshisky, A. 2015. Twitterhawk: A feature bucket approach to sentiment analysis. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 640–646. Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; and Lin, C.-J. 2008. Liblinear: A library for large linear classification. The Journal of Machine Learning Research 9:1871– 1874. Godin, F.; Slavkovikj, V.; De Neve, W.; Schrauwen, B.; and Van de Walle, R. 2013. Using topic models for twitter hashtag recommendation. In Proceedings of the 22nd international conference on World Wide Web companion, 593–596. International World Wide Web Conferences Steering Committee. Guo, D., and Chen, C. 2014. Detecting non-personal and spam users on geo-tagged twitter network. Transactions in GIS 18(3):370–384. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; and Witten, I. H. 2009. The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1):10– 18. Hoffman, M.; Bach, F. R.; and Blei, D. M. 2010. Online learning for latent dirichlet allocation. In advances in neural information processing systems, 856–864. Hong, L., and Davison, B. D. 2010. Empirical study of topic modeling in twitter. In Proceedings of the First Workshop on Social Media Analytics, 80–88. ACM. Knapp, M. L.; Stafford, L.; and Daly, J. A. 1986. Regrettable messages: Things people wish they hadn’t said. Journal of Communication 36(4):40–58. Le, Q. V., and Mikolov, T. 2014. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053. Lin, P.-C., and Huang, P.-M. 2013. A study of effective features for detecting long-surviving twitter spam accounts. In Advanced Communication Technology (ICACT), 2013 15th International Conference on, 841–846. IEEE. Ling, W.; Dyer, C.; Black, A.; and Trancoso, I. 2015. Two/too simple adaptations of word2vec for syntax problems. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Denver, CO. Liu, Y.; Liu, Z.; Chua, T.-S.; and Sun, M. 2015. Topical word embeddings. In Twenty-Ninth AAAI Conference on Artificial Intelligence. Malmi, E.; Takala, P.; Toivonen, H.; Raiko, T.; and Gionis, A. 2015. Dopelearning: A computational approach to rap lyrics generation. arXiv preprint arXiv:1505.04771. Mccord, M., and Chuah, M. 2011. Spam detection on twitter using traditional classifiers. In Autonomic and trusted computing. Springer. 175–186. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119. Miller, Z.; Dickinson, B.; Deitrick, W.; Hu, W.; and Wang, A. H. 2014. Twitter spammer detection using data stream clustering. Information Sciences 260:64–73. Miura, Y.; Sakaki, S.; Hattori, K.; and Ohkuma, T. 2014. Teamx: A sentiment analyzer with enhanced lexicon mapping and weighting scheme for unbalanced data. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 628–632. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830. Petrovic, S.; Osborne, M.; and Lavrenko, V. 2013. I wish i didn’t say that! analyzing and predicting deleted messages in twitter. arXiv preprint arXiv:1305.3107. Quinlan, J. R. 2014. C4. 5: programs for machine learning. Elsevier. Sleeper, M.; Cranshaw, J.; Kelley, P. G.; Ur, B.; Acquisti, A.; Cranor, L. F.; and Sadeh, N. 2013. I read my twitter the next morning and was astonished: A conversational perspective on twitter regrets. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 3277–3286. ACM. Stojanovski, D.; Strezoski, G.; Madjarov, G.; and Dimitrovski, I. 2015. Twitter sentiment analysis using deep convolutional neural network. In Hybrid Artificial Intelligent Systems. Springer. 726–737. Tang, D.; Wei, F.; Yang, N.; Zhou, M.; Liu, T.; and Qin, B. 2014. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, 1555–1565. Tomar, A.; Godin, F.; Vandersmissen, B.; De Neve, W.; and Van de Walle, R. 2014. Towards twitter hashtag recommendation using distributed word representations and a deep feed forward neural network. In Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on, 362–368. IEEE. Zhao, W. X.; Jiang, J.; Weng, J.; He, J.; Lim, E.-P.; Yan, H.; and Li, X. 2011. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval. Springer. 338–349. Zhou, L.; Wang, W.; and Chen, K. 2015. Identifying regrettable messages from tweets. In Proceedings of the 24th International Conference on World Wide Web Companion, 145–146. International World Wide Web Conferences Steering Committee.
© Copyright 2026 Paperzz