Search Query Database Visualization: A Directed Graph Approach Brent Arata Abstract: To understand what people really look for online when they type something into a search engine, one must understand the frequency of the types of things people search for. I try to accomplish this by drawing a force directed graph representation of AOL’s 2006 search queries for the entire year. The data space of a vast quantity of data is a mystery to most people. My goal for this project is to quantify what people really are interested in by dissecting large chunks of data. Unless other information visualization techniques used by other programs, the frequency of a particular word is determined by the size of the node that contains that word. For instance, suppose that cheesecake was the most popular word searched for during January 2006. Then, the largest node that would appear on your screen should highlight as cheesecake. If there were other large clusters of words that used the same phrase as another node, they would all be connected to that particular node. This technique also is very useful for identifying large clusters of similar words. Introduction: What do people look for online that captivates one’s attention for so long? Is it something about the topics we find online that makes us glued to our computer or could it be something else? Visualizing what people look for on the internet would give us a good inference as to why people are always glued to their computers. Does there interest change over time? Does it change with season or with the time of day? With a force-directed graph layout of what topics people look on the internet can give us a pretty clear picture as to what people are interested in. My goal is to visualize the frequency in which certain phrases that people look up on the internet and see what people are interested in over time. The software that I used to visualize my results was Adobe’s Flash’s Actionscript 3 with the Flex compiler. The dataset I used was a whole year’s worth of queries released from AOL for 2006. This dataset contained no censored information within it. Therefore, even censored queries could range anyway from exotic pornographic search requests to topics that relate to governmental security agencies; topics that most citizens would not be able to find easily online without a special plugin. Without lack of secrecy, I visualized the whole data set of everyone’s search queries, with some really surprising results that came out from it. This paper presents some of these surprising results as well as an inference from the author; myself, as to what goes through people’s minds when they are browsing through the internet. I was inspired create this project based on d3s’ force-directed graph representations that show the relations between particular groups of objects by connecting them with lines. I originally wanted to create a tree representation of the data but unfortunately, constructing a correlation between one data point to all data points was too tough. Also as a viewer it would make no sense if a relationship between two pieces of data was couldn’t be one-to-one. Browser users always switch from links back and forth. For instance, the back button on internet explorer, Mozilla Firefox, and Google chrome wouldn’t even exist if users didn’t want to go back to a previous topic they were interested in. If I were to visualize this project using a tree structure, the grouping of data would be very rigid and inflexible; unlike the internet. Background: People have tried to visualize what people look for on the internet in many ways in the past. One of these visualization techniques is probably something you have learned from your earlier years: pie charts. Pie charts as well as bar charts were one of the very first tools to find correlations between one or more datasets that lie on some given domain. Pie charts have been used to classify what people look on the internet based on the internet culture’s various sub-cultures. For example, pie charts have been used to show correlations between how many people prefer one particular item over another. However, there is one major problem with this type of visualization. Percentages are always implied. This means that the bias will always be show for a given set of data during a given period of time. That data could have had significant changes from when it was printed until now. The data could be potentially false. My visualization solves that problem entirely. Directed graphs have been a topic of study in Graph theory, the study of pairwise comparisons between one more discrete object. A graph can relate two discrete objects with an edge that connects them. Some graphs can only have an edge that can be traveled from one direction. This implies that there exists a relation between an ordered set of objects. These types of graphs are called directed graphs. These types of graphs can change more dynamically than a pie chart because the edges that connect the nodes (objects) of a graph can expand and contract much easier than a pie chart. In order to expand or shrink the frequencies of values given in a pie, the chart must be retendered every time this is recalculated. If there are multiple frequencies changing on the pie chart, the viewer could get confused on what the data is visualizing. A graph approach to representing data points makes it much more convenient for the viewer to observe changes within the given dataset. The graph also has the benefit of chaining relationships together more easily without distracting the field of view. Method: My tool kit was pure Adobe Actionscript 3 code and the Flash Development kit as my interface to write my code. My application only parses a small portion of AOL search query data due to the sheer size of the data. If I were to try to feed in all the data at once, it could take days for the all the data to be visualized. This way, one could see the results of my visualization in units so I can examine particular seasons for a given year. However, the user can specify to load in all the data in at once into my application; if he or she has patience to observe it all. I plotted the nodes of my graph at random points, on the view screen. I was influenced to do this based on the sheer magnitude of the AOL data because the data itself had over three hundred million data points that the application had to plot. The data points are all colored green but if you observe closely at the data points, you will see that some points are colored with different shades of green. The darker shade represents a source node. A source node is a node that my visualizer finds as unique from all the other nodes found. It could range from being as simple as having a unique url search tag to having a unique hyponym that no other query shares. The lighter shade of green represents the node that shares an attribute with another node. All lighter green nodes are connected an edge to their respective source node. All nodes can have their information decoded from them by a simple mouse over. The information displayed from each node contains the query id of the node, the data in which it was searched for, and the tag that it searched. Source nodes increase in size as more relationships are found between the source nodes and their respective neighbors. Also, the amount of outgoing edges is proportional to the z-axis of a source node. This way, more popular nodes would stand out more than less popular nodes because that is the main query that people searched for and are most interested in. In other words, this implies that there is a one-to-one relationship to between how big a source node is and the overall online community’s interest level. Why would people be searching for the same thing in a search engine if they were not interested in the topic that they were searching for? All nodes were all sorted in within their respective hashmaps but when it came to figure out what node had been visualized or not. A red-black tree facilitated the retrieval of the visible queries on the screen. I choose to use a red-black tree its efficient search method without the memory cost of a hashmap. Analysis: This I was only able to visualize a piece of the information given to me. I noticed that the most popular phrases that people searched for online pertained either to pop culture or other search engines; such as Google. Around holidays such as Christmas, shopping was the number one phrase searched for by AOL users. This shows a correlation between the holidays and what people look for online. Therefore, season does affect what people look for online, which was quite fascinating but not surprising. I noticed that Jesse Mccartney was the most popular phrase searched for online due to his fame as being a teen actor as well as his first album being published around that time. Although the directed graph design is great for dealing with rapid changes, it does its faults for visualizing information from a dataset. As more edges connect to a particular source node, the node gets larger. The z-axis of a particular node is proportional to the amount of outgoing edges that come from that node. The source nodes can get so big that they can completely occlude other less popular nodes in the background. This may seem like a good thing bit large, less significant but still important nodes that are much larger than most of their neighbors can be completely occluded by one very large neighbor. The viewer would be unable to decode any other relevant information around a given large source node. For instance, the search query that contained the search tag: “Jesse Mccartney” completely occluded the screen so much that no other node could be clicked on except Jesse’s source. This issue is a consequence from random placement of my nodes when they are found and analyzed by my dataset. I did not implement any sort of node sorter that sorted each tree (source-link relation) by some sort of metric to reduce clutter. I choose not to do this because I wanted to viewer to be uncertain, just as much as I, to be engaged with viewing the data. The viewer could get a sense very quickly by identifying the ever expanding nodes and by figuring out what type of query it was. Since my goal was to visualize what people were interested over time, my model does that perfectly without having to worry about the details of looking at all popular nodes. If I were to visualize all popular nodes and have them scaled fit in my viewing window. The viewer would not feel any engagement with my application because the viewer would not have a sense of what nodes were added to the dataset and what nodes were not. The size of the non-source nodes in my application give the viewer a sense of virtual activity or traffic, which was a message I was trying to convey about the internet. Related Work: I also did some database visualization using a force directed graph to visualize if movie sets that work well with each other continue to work with each other. Google also created a project a few years ago called Google Touchgraph is very similar to what I did for my visualization. Google TouchGraph uses a force directed graph that connects one website with another with the use of an edge link. The user can hover over to node to see the contents of that node such as url, the name of the webpage, etc. Future Work: If I were to continue to work on this project, I would like to implement some sort of camera system so that viewer could see the total visualization over time instead of a small piece of it. I also wasn’t able to get the text to display correctly due having each node the container for the rollover text. If the node was scaled upward, the image data of the text would be distorted as well. Other work that I would like to implement would be a pause button to stop the grabbing of queries from the data set. This way, the viewer would be able the view the contents of each node individually. I would also like to work on the placement of nodes. Instead of them being placed randomly on my screen, I would like them to be placed in some sort of pseudo-order to reduce cluttering later. References: 1. Serman, Chris. Visualizing the Web with Google. http://searchenginewatch.com/article/2067764/Visualizing-the-Web-with-Google 2. Thinkmap. http://www.thinkmap.com/ 3. Norgard Barbara, Kim Youngin. Adding Natural Language Processing Techniques to the Entry Vocabulary Module Building Process. http://metadata.sims.berkeley.edu/papers/nlptech.html
© Copyright 2026 Paperzz