Technical Document

Search Query Database Visualization: A Directed Graph Approach
Brent Arata
Abstract:
To understand what people really look for online when they type something into a search
engine, one must understand the frequency of the types of things people search for. I try to accomplish
this by drawing a force directed graph representation of AOL’s 2006 search queries for the entire year.
The data space of a vast quantity of data is a mystery to most people. My goal for this project is to
quantify what people really are interested in by dissecting large chunks of data. Unless other
information visualization techniques used by other programs, the frequency of a particular word is
determined by the size of the node that contains that word. For instance, suppose that cheesecake was
the most popular word searched for during January 2006. Then, the largest node that would appear on
your screen should highlight as cheesecake. If there were other large clusters of words that used the
same phrase as another node, they would all be connected to that particular node. This technique also
is very useful for identifying large clusters of similar words.
Introduction:
What do people look for online that captivates one’s attention for so long? Is it something
about the topics we find online that makes us glued to our computer or could it be something else?
Visualizing what people look for on the internet would give us a good inference as to why people are
always glued to their computers. Does there interest change over time? Does it change with season or
with the time of day? With a force-directed graph layout of what topics people look on the internet can
give us a pretty clear picture as to what people are interested in. My goal is to visualize the frequency in
which certain phrases that people look up on the internet and see what people are interested in over
time. The software that I used to visualize my results was Adobe’s Flash’s Actionscript 3 with the Flex
compiler. The dataset I used was a whole year’s worth of queries released from AOL for 2006. This
dataset contained no censored information within it. Therefore, even censored queries could range
anyway from exotic pornographic search requests to topics that relate to governmental security
agencies; topics that most citizens would not be able to find easily online without a special plugin.
Without lack of secrecy, I visualized the whole data set of everyone’s search queries, with some really
surprising results that came out from it.
This paper presents some of these surprising results as well as an inference from the author;
myself, as to what goes through people’s minds when they are browsing through the internet. I was
inspired create this project based on d3s’ force-directed graph representations that show the relations
between particular groups of objects by connecting them with lines. I originally wanted to create a tree
representation of the data but unfortunately, constructing a correlation between one data point to all
data points was too tough. Also as a viewer it would make no sense if a relationship between two
pieces of data was couldn’t be one-to-one. Browser users always switch from links back and forth. For
instance, the back button on internet explorer, Mozilla Firefox, and Google chrome wouldn’t even exist
if users didn’t want to go back to a previous topic they were interested in. If I were to visualize this
project using a tree structure, the grouping of data would be very rigid and inflexible; unlike the
internet.
Background:
People have tried to visualize what people look for on the internet in many ways in the past.
One of these visualization techniques is probably something you have learned from your earlier years:
pie charts. Pie charts as well as bar charts were one of the very first tools to find correlations between
one or more datasets that lie on some given domain. Pie charts have been used to classify what people
look on the internet based on the internet culture’s various sub-cultures. For example, pie charts have
been used to show correlations between how many people prefer one particular item over another.
However, there is one major problem with this type of visualization. Percentages are always implied.
This means that the bias will always be show for a given set of data during a given period of time. That
data could have had significant changes from when it was printed until now. The data could be
potentially false. My visualization solves that problem entirely.
Directed graphs have been a topic of study in Graph theory, the study of pairwise comparisons
between one more discrete object. A graph can relate two discrete objects with an edge that connects
them. Some graphs can only have an edge that can be traveled from one direction. This implies that
there exists a relation between an ordered set of objects. These types of graphs are called directed
graphs. These types of graphs can change more dynamically than a pie chart because the edges that
connect the nodes (objects) of a graph can expand and contract much easier than a pie chart. In order
to expand or shrink the frequencies of values given in a pie, the chart must be retendered every time
this is recalculated. If there are multiple frequencies changing on the pie chart, the viewer could get
confused on what the data is visualizing. A graph approach to representing data points makes it much
more convenient for the viewer to observe changes within the given dataset. The graph also has the
benefit of chaining relationships together more easily without distracting the field of view.
Method:
My tool kit was pure Adobe Actionscript 3 code and the Flash Development kit as my interface
to write my code. My application only parses a small portion of AOL search query data due to the sheer
size of the data. If I were to try to feed in all the data at once, it could take days for the all the data to be
visualized. This way, one could see the results of my visualization in units so I can examine particular
seasons for a given year. However, the user can specify to load in all the data in at once into my
application; if he or she has patience to observe it all. I plotted the nodes of my graph at random points,
on the view screen. I was influenced to do this based on the sheer magnitude of the AOL data because
the data itself had over three hundred million data points that the application had to plot. The data
points are all colored green but if you observe closely at the data points, you will see that some points
are colored with different shades of green. The darker shade represents a source node. A source node
is a node that my visualizer finds as unique from all the other nodes found. It could range from being as
simple as having a unique url search tag to having a unique hyponym that no other query shares. The
lighter shade of green represents the node that shares an attribute with another node. All lighter green
nodes are connected an edge to their respective source node. All nodes can have their information
decoded from them by a simple mouse over. The information displayed from each node contains the
query id of the node, the data in which it was searched for, and the tag that it searched.
Source nodes increase in size as more relationships are found between the source nodes and
their respective neighbors. Also, the amount of outgoing edges is proportional to the z-axis of a source
node. This way, more popular nodes would stand out more than less popular nodes because that is the
main query that people searched for and are most interested in. In other words, this implies that there
is a one-to-one relationship to between how big a source node is and the overall online community’s
interest level. Why would people be searching for the same thing in a search engine if they were not
interested in the topic that they were searching for?
All nodes were all sorted in within their respective hashmaps but when it came to figure out
what node had been visualized or not. A red-black tree facilitated the retrieval of the visible queries on
the screen. I choose to use a red-black tree its efficient search method without the memory cost of a
hashmap.
Analysis:
This I was only able to visualize a piece of the information given to me. I noticed that the most
popular phrases that people searched for online pertained either to pop culture or other search engines;
such as Google. Around holidays such as Christmas, shopping was the number one phrase searched for
by AOL users. This shows a correlation between the holidays and what people look for online.
Therefore, season does affect what people look for online, which was quite fascinating but not
surprising. I noticed that Jesse Mccartney was the most popular phrase searched for online due to his
fame as being a teen actor as well as his first album being published around that time.
Although the directed graph design is great for dealing with rapid changes, it does its faults for
visualizing information from a dataset. As more edges connect to a particular source node, the node
gets larger. The z-axis of a particular node is proportional to the amount of outgoing edges that come
from that node. The source nodes can get so big that they can completely occlude other less popular
nodes in the background.
This may seem like a good thing bit large, less significant but still important nodes that are much
larger than most of their neighbors can be completely occluded by one very large neighbor. The viewer
would be unable to decode any other relevant information around a given large source node. For
instance, the search query that contained the search tag: “Jesse Mccartney” completely occluded the
screen so much that no other node could be clicked on except Jesse’s source.
This issue is a consequence from random placement of my nodes when they are found and
analyzed by my dataset. I did not implement any sort of node sorter that sorted each tree (source-link
relation) by some sort of metric to reduce clutter. I choose not to do this because I wanted to viewer to
be uncertain, just as much as I, to be engaged with viewing the data. The viewer could get a sense very
quickly by identifying the ever expanding nodes and by figuring out what type of query it was. Since my
goal was to visualize what people were interested over time, my model does that perfectly without
having to worry about the details of looking at all popular nodes. If I were to visualize all popular nodes
and have them scaled fit in my viewing window. The viewer would not feel any engagement with my
application because the viewer would not have a sense of what nodes were added to the dataset and
what nodes were not. The size of the non-source nodes in my application give the viewer a sense of
virtual activity or traffic, which was a message I was trying to convey about the internet.
Related Work:
I also did some database visualization using a force directed graph to visualize if movie sets that
work well with each other continue to work with each other. Google also created a project a few years
ago called Google Touchgraph is very similar to what I did for my visualization. Google TouchGraph uses
a force directed graph that connects one website with another with the use of an edge link. The user
can hover over to node to see the contents of that node such as url, the name of the webpage, etc.
Future Work:
If I were to continue to work on this project, I would like to implement some sort of camera
system so that viewer could see the total visualization over time instead of a small piece of it. I also
wasn’t able to get the text to display correctly due having each node the container for the rollover text.
If the node was scaled upward, the image data of the text would be distorted as well.
Other work that I would like to implement would be a pause button to stop the grabbing of
queries from the data set. This way, the viewer would be able the view the contents of each node
individually. I would also like to work on the placement of nodes. Instead of them being placed
randomly on my screen, I would like them to be placed in some sort of pseudo-order to reduce
cluttering later.
References:
1. Serman, Chris. Visualizing the Web with Google.
http://searchenginewatch.com/article/2067764/Visualizing-the-Web-with-Google
2. Thinkmap. http://www.thinkmap.com/
3. Norgard Barbara, Kim Youngin. Adding Natural Language Processing Techniques to the Entry
Vocabulary Module Building Process.
http://metadata.sims.berkeley.edu/papers/nlptech.html