GeoTweet: A Geographical Analysis of Word Frequency in Tweets

GeoTweet: A Geographical Analysis of Word
Frequency in Tweets Across the U.S.
Garrett Griffin and Hanna Nowicki
2015-12-16
Abstract
GeoTweet is a convenient tool for collecting tweets sent by users in the United
States, storing them in a database, and performing some simple word-frequency
linguistic analysis by geography. We implemented a very simple caching system
utilizing the database, where if a user searches for a tweet, it stores the results
from the search and allows our user to pull them up without the processing time
that it would normally have taken to search for the frequency of a word in the
tweets that this user has collected. We have used MongoDB to store tweets as
well as to act as a cache, Python to process tweets and interact with MongoDB,
JavaScript to process user interaction and display the results that we receive,
and D3 to enact the actual visualization of word frequency in the tweets.
With GeoTweet, we set out to create a tool for neatly visualizing tweets according to the geographical location of the person who tweeted them, specifically
relegated to tweets sent within the United States. This, we determined, was a
project that would be both easy to use for any user and easy to modify for any
developer. Alongside this, our project would provide some interesting, tangible
artifacts for linguistic analysis; the graphs produced by our word frequency visualization of tweets would be likely to reveal some interesting linguistic trends,
confirming certain of our expectations and rejecting others. Ultimately, what
we aimed to offer our users was an efficient, easy to use tool to create simple
and elegant visualizations of data that provide more information upon a given
user’s interaction with the visualization.
Goals
Our goal for this project was to create a useful visualization tool using large
amounts of collected data. We both felt that we hadn’t had a chance to work
on a project like this in our time at Middlebury, so this implementation project
presented a good opportunity to bring our other interests into our computer
science work. Another goal that we had was to broaden our understanding of
working with databases and creating a simple web application.
1
Figure 1: A search for the word ”beach”
Accomplishments
Ultimately, what we produced was a system that allows a user to run a script
to collect and parse tweets, storing them in an instance of MongoDB and then
run a web server, which will offer them a box in which to input a search term
that they wish to find amongst the tweets that they have collected. Finally,
the server will visualize the frequency of the search term as a heat map of the
United States, according to state.
As can be seen in Figure 1, the user has entered and searched for the term
beach. The heat map of the United States is darkened in areas where the
term appears more frequently in the collected tweets, with the term appearing
most frequently in the state of Hawaii. If the term does not appear at all in
the tweets collected from a given state, it will be displayed as white, as, for
example, in Alaska and South Dakota. The final feature that needs a brief
introduction is the tooltip, which appears in the approximate middle of the top
of the state whenever the cursor is placed within the boundaries of a state; it
contains information on the search term, such as the percentage of tweets in
which the word appears, the total number of tweets collected from that state,
the total number of tweets collected from that state containing the given word,
and the average follower count of the users that sent tweets with the search term
in them.
2
Implementation
In implementing this project, we used Python and JavaScript and various libraries for these languages, as well as examples of the usage of these libraries.
The technology at the base of our chain was Python, which we used to process tweets, parsing out the parts of the tweet in which we were interested, and
then to store these tweets in an instance of MongoDB[5]. For our interactions
with Twitter using Python, we used a wrapper library for the Twitter API[9]
called tweepy, which allows our user to filter a live-stream of tweets whenever
the script is running. We have used Twitter’s coordinate filter to decide what
tweets we were interested in, providing tweepy with a list of bounding-boxes
within which to collect tweets. Specifically, we have a bounding-box for the
continental United States, for Hawaii, and for Alaska. From there, we only
store tweets with the country code ”US,” and we store tweets according to their
state abbreviation, searching the data for either a state abbreviation or for a
state name, in which case we have mapped state names to state abbreviations
and store the tweet accordingly. If the location data of a given user is improperly formatted or incomprehensible (for example, we would not be able to tell
what state a tweet in ’Nueva York’ belongs to), we will simply throw it out.
Next, we have a search bar that we implemented using jQuery and Bootstrap.[2]
This bar listens for the user hitting the ’enter’ key or pressing the submit button, and when this event occurs, we send a request to a Python script that is
running a server on port 5000 of the localhost using flask[6], a very lightweight
framework for running Python servers. This Python server, upon receiving a
request that contains a search term, checks our cache to see if the term has been
searched before, and if it has, returns the JSON object representing the counts
of that search term throughout various states. If the term has not been searched
prior to this point, the server iterates over every tweet object in our database
instance, checking each tweet to see if it contains the search term and if it does,
incrementing the count of tweets for that state while keeping track of the average number of followers who sent tweets containing that word. Eventually,
either way, we will receive a JSON-formatted object that has state abbreviations
mapped to the various variables that we display in our tooltip.
The final aspect of our project is the map visualization itself. The map is
displayed using the topoJSON JavaScript library, which receives a JSON file
that contains coordinates defining the borders of states and then creates an
Albers projection of the United States based on these coordinates. After this,
we bind our data to each state using the JSON object that is returned after
our search. We color states according to the relative frequency of a tweet in
a given state; so, one tweet in Montana is, in theory, equivalent to about one
hundred tweets in California. Finally, we add tooltips on top of each state, so
that whenever our mouse cursor is within the boundaries of a given state, a
tooltip pops up, displaying the information about tweets containing the search
term in that given state.
3
Figure 2: A search for the word ”hella”
Results
Our Twitter visualization tool eventually returned several interesting results for
different search terms. Displayed here are a few examples with brief explanations
and proposals as to why we have gotten these results.
One of the use cases for our application that we found most interesting is
local slang and dialect analysis. In Figure 2, we can see that the search term
’hella’ is used more and more frequently the further west we go, which confirms
what we would expect, as the term is originally Northern California slang.
As we can see in Figure 3, a striking example of local language usage in
tweets is the frequency of the occurrence of the word ’aloha’ in the state of
Hawaii. This map does not take too much explaining: the reasons why native
Hawaiians are using the word ’aloha’ at a much higher rate than other Americans
are self-explanatory due to local language usage.
As illustrated by Figure 4, one example of geographical themes or trends in
tweets can be found with the search term ’ski,’ which occurs at an extremely
high frequency in the American northeast, particularly in the state of Vermont,
where skiing is less a hobby and more a lifestyle.
As can be seen from these few examples, our application has many different
use cases in linguistic analysis and can display interesting trends in geographical
analysis.
4
Figure 3: A search for the word ”aloha”
Figure 4: A search for the word ”ski”
5
Difficulties
In order to accomplish what we set out to, we needed to use a few technologies
that neither of us was very familiar with. We both took software development
this semester, which was a great help later in the semester because of its focus
on Javascript and databases. However, at the start of the semester neither of us
had a good understanding of Javascript or databases, and this lack of foundation
gave us some difficulties in creating a plan for implementation.
The project was introduced very early in the semester, so we had to make
long-term decisions based on little knowledge, and we ran into some problems
later because of this. Looking at our data now and how structured it is, we
realize perhaps that we should have used a SQL database instead of MongoDB,
but it took us too long to come this realization.
Another area of the implementation that presented some difficulties was
using D3 for the visualizations. We found D3 to be somewhat finicky and our
inexperience with programming in it made using it frustrating at times.
We had intended to integrate node.js into the project, and had tried for a
while to accomplish this, but neither of us had a good enough understanding
of it to get it completely functional before the final deadline. We ended up
having to move away from the idea of using node.js near the end of the semester
and instead used a python web server to run our python code alongside our
Javascript code. Though this ended up working out, it gave us a lot of difficulty
up until the project due date.
As expected, we ran into difficulties with time management, which was especially problematic because of the project’s self-guided nature. The project demo
gave us a goal to work towards in the first half of the semester, but following
the demo and demo presentations, we had difficulties structuring our time to
keep on a good trajectory in the second half of the semester. Because of this,
we ended up having to complete a large portion of the project near the end of
the semester, which did not leave us with the freedom to experiment and think
deeply about the best ways to accomplish things
In general, as we went along, we found that our initial goals for the project
had been grander than we were capable of under the time constraints, and ended
up having to scale our project back in order to meet the final deadline.
Future Work
In the future, we would like to add a range of visualizations to GeoTweet for
users. These added visualizations, such as a zooming function that would allow
users to view statistics of word usage in specific towns or cities in the U.S., or a
timeline that would allow users to analyze tweet data at different times, would
build on the current map visualization and provide more tools to allow users to
effectively examine linguistic trends in tweets across the US.
When we showed our program to another student to ask their opinion about
what they would like to see in addition to the visualization that we have in place,
the student stated that more statistics on the posters of the tweets, such as their
6
age, would be helpful in analyzing trends in the data. From the development
side, we would like to make our program more efficient, so new user queries would
display much quicker. Currently, our program displays results immediately when
queried for a word it has previously searched, as these results are stored when
they are returned, but a newly searched word takes around 20 seconds to load.
A feature that we weren’t able to incorporate into GeoTweet during this
semester is updating tweet data in real-time. Currently, we have a twitter
stream parser that pulls tweets into a Mongo database when we run it, and then
we run the program from these stored tweets instead of using current tweets to
update the statistics we display. This would be a helpful feature to include in
the future because trends in social media are continually changing, unless we are
frequently streaming more tweets into our database, the information displayed
does not portray this well.
Another improvement that we could make in our next steps would be to
improve our parser. As it stands, we take the users input and collect tweets
from the database that match this input exactly. In the future, we could change
our parser to deal with misspellings and punctuation in order to display more
useful results for users.
We would have also like to get our webpage hosted on an external server,
since we are currently running the program locally on our computers, but we
did not have time to accomplish this during the class.
From a linguistics point of view, there are many more complex analyses that
we could make in addition to our analysis of word frequency according to state.
Though this semester we focused on learning new technologies to make this
project possible, it would be interesting to focus more on research in linguistics
to add depth and utility to the project. We researched a few papers on twitter
linguistic analyses like determining personality traits based on twitter profiles
and language usage [8], using machine learning to classify sarcasm in tweets [4],
and the linguistic register [7] that would be interesting to return to for future
work.
Initially, we had thought that it would be interesting to analyze word usage
on a global scale, and in the process incorporating translation services into our
project, so a user could search a word in one language and view statistics on the
equivalent words in different languages across the globe. We quickly realized
that we would have to scale our project back, but as twitter is widely used
around the world, incorporating more countries’ data into our visualizations
would increase GeoTweet’s usefulness and relevance.
Conclusion
In spite of the challenges, we created a useful and easy to use tool to analyze
tweets by state across the U.S. through this project. The project produced
some interesting results and acted as a way for us to improve our coding skills
and widen our knowledge of different technologies like D3, Javascript, and MongoDB. We are interested in continuing to develop this project in the future.
7
References
[1] Douglas Biber. Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press, 1995.
[2] Bootstrap.
Bootstrap Components.
components/. Accessed: 2015-11-15.
http://getbootstrap.com/
[3] Mike Bostock. Bostock’s Blocks. http://bl.ocks.org/mbostock. Accessed: 2015-09-28.
[4] CC Liebrecht, FA Kunneman, and APJ van den Bosch. The perfect solution
for detecting sarcasm in tweets# not. 2013.
[5] MongoDB. The MongoDB 3.2 Manual.
manual/. Accessed: 2015-10-15.
https://docs.mongodb.org/
[6] Armin Ronacher. Flask. http://flask.pocoo.org. Accessed: 2015-11-18.
[7] Harold Schiffman. Linguistic Register. http://ccat.sas.upenn.edu/
~haroldfs/messeas/regrep/node2.html. Accessed: 2015-09-30.
[8] Chris Sumner, Alison Byers, Rachel Boochever, and Gregory J Park. Predicting dark triad personality traits from twitter usage and a linguistic
analysis of tweets. In Machine Learning and Applications (ICMLA), 2012
11th International Conference on, volume 2, pages 386–393. IEEE, 2012.
[9] Twitter. Twitter API Overview. https://dev.twitter.com/overview/
api. Accessed: 2015-09-30.
[10] Nick Qi Zhu. Data Visualization with D3. js Cookbook. Packt Publishing
Ltd, 2013.
8