GeoTweet: A Geographical Analysis of Word Frequency in Tweets Across the U.S. Garrett Griffin and Hanna Nowicki 2015-12-16 Abstract GeoTweet is a convenient tool for collecting tweets sent by users in the United States, storing them in a database, and performing some simple word-frequency linguistic analysis by geography. We implemented a very simple caching system utilizing the database, where if a user searches for a tweet, it stores the results from the search and allows our user to pull them up without the processing time that it would normally have taken to search for the frequency of a word in the tweets that this user has collected. We have used MongoDB to store tweets as well as to act as a cache, Python to process tweets and interact with MongoDB, JavaScript to process user interaction and display the results that we receive, and D3 to enact the actual visualization of word frequency in the tweets. With GeoTweet, we set out to create a tool for neatly visualizing tweets according to the geographical location of the person who tweeted them, specifically relegated to tweets sent within the United States. This, we determined, was a project that would be both easy to use for any user and easy to modify for any developer. Alongside this, our project would provide some interesting, tangible artifacts for linguistic analysis; the graphs produced by our word frequency visualization of tweets would be likely to reveal some interesting linguistic trends, confirming certain of our expectations and rejecting others. Ultimately, what we aimed to offer our users was an efficient, easy to use tool to create simple and elegant visualizations of data that provide more information upon a given user’s interaction with the visualization. Goals Our goal for this project was to create a useful visualization tool using large amounts of collected data. We both felt that we hadn’t had a chance to work on a project like this in our time at Middlebury, so this implementation project presented a good opportunity to bring our other interests into our computer science work. Another goal that we had was to broaden our understanding of working with databases and creating a simple web application. 1 Figure 1: A search for the word ”beach” Accomplishments Ultimately, what we produced was a system that allows a user to run a script to collect and parse tweets, storing them in an instance of MongoDB and then run a web server, which will offer them a box in which to input a search term that they wish to find amongst the tweets that they have collected. Finally, the server will visualize the frequency of the search term as a heat map of the United States, according to state. As can be seen in Figure 1, the user has entered and searched for the term beach. The heat map of the United States is darkened in areas where the term appears more frequently in the collected tweets, with the term appearing most frequently in the state of Hawaii. If the term does not appear at all in the tweets collected from a given state, it will be displayed as white, as, for example, in Alaska and South Dakota. The final feature that needs a brief introduction is the tooltip, which appears in the approximate middle of the top of the state whenever the cursor is placed within the boundaries of a state; it contains information on the search term, such as the percentage of tweets in which the word appears, the total number of tweets collected from that state, the total number of tweets collected from that state containing the given word, and the average follower count of the users that sent tweets with the search term in them. 2 Implementation In implementing this project, we used Python and JavaScript and various libraries for these languages, as well as examples of the usage of these libraries. The technology at the base of our chain was Python, which we used to process tweets, parsing out the parts of the tweet in which we were interested, and then to store these tweets in an instance of MongoDB[5]. For our interactions with Twitter using Python, we used a wrapper library for the Twitter API[9] called tweepy, which allows our user to filter a live-stream of tweets whenever the script is running. We have used Twitter’s coordinate filter to decide what tweets we were interested in, providing tweepy with a list of bounding-boxes within which to collect tweets. Specifically, we have a bounding-box for the continental United States, for Hawaii, and for Alaska. From there, we only store tweets with the country code ”US,” and we store tweets according to their state abbreviation, searching the data for either a state abbreviation or for a state name, in which case we have mapped state names to state abbreviations and store the tweet accordingly. If the location data of a given user is improperly formatted or incomprehensible (for example, we would not be able to tell what state a tweet in ’Nueva York’ belongs to), we will simply throw it out. Next, we have a search bar that we implemented using jQuery and Bootstrap.[2] This bar listens for the user hitting the ’enter’ key or pressing the submit button, and when this event occurs, we send a request to a Python script that is running a server on port 5000 of the localhost using flask[6], a very lightweight framework for running Python servers. This Python server, upon receiving a request that contains a search term, checks our cache to see if the term has been searched before, and if it has, returns the JSON object representing the counts of that search term throughout various states. If the term has not been searched prior to this point, the server iterates over every tweet object in our database instance, checking each tweet to see if it contains the search term and if it does, incrementing the count of tweets for that state while keeping track of the average number of followers who sent tweets containing that word. Eventually, either way, we will receive a JSON-formatted object that has state abbreviations mapped to the various variables that we display in our tooltip. The final aspect of our project is the map visualization itself. The map is displayed using the topoJSON JavaScript library, which receives a JSON file that contains coordinates defining the borders of states and then creates an Albers projection of the United States based on these coordinates. After this, we bind our data to each state using the JSON object that is returned after our search. We color states according to the relative frequency of a tweet in a given state; so, one tweet in Montana is, in theory, equivalent to about one hundred tweets in California. Finally, we add tooltips on top of each state, so that whenever our mouse cursor is within the boundaries of a given state, a tooltip pops up, displaying the information about tweets containing the search term in that given state. 3 Figure 2: A search for the word ”hella” Results Our Twitter visualization tool eventually returned several interesting results for different search terms. Displayed here are a few examples with brief explanations and proposals as to why we have gotten these results. One of the use cases for our application that we found most interesting is local slang and dialect analysis. In Figure 2, we can see that the search term ’hella’ is used more and more frequently the further west we go, which confirms what we would expect, as the term is originally Northern California slang. As we can see in Figure 3, a striking example of local language usage in tweets is the frequency of the occurrence of the word ’aloha’ in the state of Hawaii. This map does not take too much explaining: the reasons why native Hawaiians are using the word ’aloha’ at a much higher rate than other Americans are self-explanatory due to local language usage. As illustrated by Figure 4, one example of geographical themes or trends in tweets can be found with the search term ’ski,’ which occurs at an extremely high frequency in the American northeast, particularly in the state of Vermont, where skiing is less a hobby and more a lifestyle. As can be seen from these few examples, our application has many different use cases in linguistic analysis and can display interesting trends in geographical analysis. 4 Figure 3: A search for the word ”aloha” Figure 4: A search for the word ”ski” 5 Difficulties In order to accomplish what we set out to, we needed to use a few technologies that neither of us was very familiar with. We both took software development this semester, which was a great help later in the semester because of its focus on Javascript and databases. However, at the start of the semester neither of us had a good understanding of Javascript or databases, and this lack of foundation gave us some difficulties in creating a plan for implementation. The project was introduced very early in the semester, so we had to make long-term decisions based on little knowledge, and we ran into some problems later because of this. Looking at our data now and how structured it is, we realize perhaps that we should have used a SQL database instead of MongoDB, but it took us too long to come this realization. Another area of the implementation that presented some difficulties was using D3 for the visualizations. We found D3 to be somewhat finicky and our inexperience with programming in it made using it frustrating at times. We had intended to integrate node.js into the project, and had tried for a while to accomplish this, but neither of us had a good enough understanding of it to get it completely functional before the final deadline. We ended up having to move away from the idea of using node.js near the end of the semester and instead used a python web server to run our python code alongside our Javascript code. Though this ended up working out, it gave us a lot of difficulty up until the project due date. As expected, we ran into difficulties with time management, which was especially problematic because of the project’s self-guided nature. The project demo gave us a goal to work towards in the first half of the semester, but following the demo and demo presentations, we had difficulties structuring our time to keep on a good trajectory in the second half of the semester. Because of this, we ended up having to complete a large portion of the project near the end of the semester, which did not leave us with the freedom to experiment and think deeply about the best ways to accomplish things In general, as we went along, we found that our initial goals for the project had been grander than we were capable of under the time constraints, and ended up having to scale our project back in order to meet the final deadline. Future Work In the future, we would like to add a range of visualizations to GeoTweet for users. These added visualizations, such as a zooming function that would allow users to view statistics of word usage in specific towns or cities in the U.S., or a timeline that would allow users to analyze tweet data at different times, would build on the current map visualization and provide more tools to allow users to effectively examine linguistic trends in tweets across the US. When we showed our program to another student to ask their opinion about what they would like to see in addition to the visualization that we have in place, the student stated that more statistics on the posters of the tweets, such as their 6 age, would be helpful in analyzing trends in the data. From the development side, we would like to make our program more efficient, so new user queries would display much quicker. Currently, our program displays results immediately when queried for a word it has previously searched, as these results are stored when they are returned, but a newly searched word takes around 20 seconds to load. A feature that we weren’t able to incorporate into GeoTweet during this semester is updating tweet data in real-time. Currently, we have a twitter stream parser that pulls tweets into a Mongo database when we run it, and then we run the program from these stored tweets instead of using current tweets to update the statistics we display. This would be a helpful feature to include in the future because trends in social media are continually changing, unless we are frequently streaming more tweets into our database, the information displayed does not portray this well. Another improvement that we could make in our next steps would be to improve our parser. As it stands, we take the users input and collect tweets from the database that match this input exactly. In the future, we could change our parser to deal with misspellings and punctuation in order to display more useful results for users. We would have also like to get our webpage hosted on an external server, since we are currently running the program locally on our computers, but we did not have time to accomplish this during the class. From a linguistics point of view, there are many more complex analyses that we could make in addition to our analysis of word frequency according to state. Though this semester we focused on learning new technologies to make this project possible, it would be interesting to focus more on research in linguistics to add depth and utility to the project. We researched a few papers on twitter linguistic analyses like determining personality traits based on twitter profiles and language usage [8], using machine learning to classify sarcasm in tweets [4], and the linguistic register [7] that would be interesting to return to for future work. Initially, we had thought that it would be interesting to analyze word usage on a global scale, and in the process incorporating translation services into our project, so a user could search a word in one language and view statistics on the equivalent words in different languages across the globe. We quickly realized that we would have to scale our project back, but as twitter is widely used around the world, incorporating more countries’ data into our visualizations would increase GeoTweet’s usefulness and relevance. Conclusion In spite of the challenges, we created a useful and easy to use tool to analyze tweets by state across the U.S. through this project. The project produced some interesting results and acted as a way for us to improve our coding skills and widen our knowledge of different technologies like D3, Javascript, and MongoDB. We are interested in continuing to develop this project in the future. 7 References [1] Douglas Biber. Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press, 1995. [2] Bootstrap. Bootstrap Components. components/. Accessed: 2015-11-15. http://getbootstrap.com/ [3] Mike Bostock. Bostock’s Blocks. http://bl.ocks.org/mbostock. Accessed: 2015-09-28. [4] CC Liebrecht, FA Kunneman, and APJ van den Bosch. The perfect solution for detecting sarcasm in tweets# not. 2013. [5] MongoDB. The MongoDB 3.2 Manual. manual/. Accessed: 2015-10-15. https://docs.mongodb.org/ [6] Armin Ronacher. Flask. http://flask.pocoo.org. Accessed: 2015-11-18. [7] Harold Schiffman. Linguistic Register. http://ccat.sas.upenn.edu/ ~haroldfs/messeas/regrep/node2.html. Accessed: 2015-09-30. [8] Chris Sumner, Alison Byers, Rachel Boochever, and Gregory J Park. Predicting dark triad personality traits from twitter usage and a linguistic analysis of tweets. In Machine Learning and Applications (ICMLA), 2012 11th International Conference on, volume 2, pages 386–393. IEEE, 2012. [9] Twitter. Twitter API Overview. https://dev.twitter.com/overview/ api. Accessed: 2015-09-30. [10] Nick Qi Zhu. Data Visualization with D3. js Cookbook. Packt Publishing Ltd, 2013. 8
© Copyright 2025 Paperzz