Poster Link

Senior Project – Computer Science - 2012
Stephen Santise
Advisor – Prof. Aaron G. Cass
Overview:
Humans are unique creatures. Everything we do is slightly different from
everyone else. Even though many times these differences are so minute
they are not even noticed. This is exactly the same for web browsing. The
way that each human browses the web is unique to that person. The
websites they visit as well as the order in which they visit them is unique.
Now wouldn't it be nice if this uniqueness was not just overlooked and was
actually used to benefit the user’s browsing experience. In this research we
compare different representations of browsing histories to find which one
will best be used to represent this uniqueness. Then by using machine
learning algorithms this research will attempt to create a fingerprint from
which a user could be identified based on there web-browsing history
alone.
Finding out what makes each history unique:
In order to be able identify someone from their browsing history, there must be
something that makes each browsing history unique. After carful analysis and bit of
common sense I have deduced that there are three main features of each browsers
history that makes it unique. These features are:
1. The websites that have been visited
2. The number of times each website has been revisited
3. The order in which each website has been visited
Manipulating the dataset to represent the history’s uniqueness:
The task of representing which webpages have been visited and how many times
each website was visited was simple. In the simple data set shown below you can see
that every website the user visited is already represented.
Represent the number of times that each website is visited is also a simple task. Each
attribute value could instead of being a binary yes or no, it could be a number
representing how many times the site that attribute represents was visited.
Website: http://www.union.edu
Webpage: http://cs.union.edu/Poster/posterguidelines.html
This leads to another question about what it means to revisit a website because the
history stores every “webpage” visited. To solve this problem my research counts
both webpage and website visits and creates datasets for both.
Manipulating the dataset to represent the history’s order:
In order to create one data set with multiple users on it then it requires that each
user has the same attributes. Making it not possible to preserve any order for every
user after the first user. To solve this problem I have employ a technique from natural
language processing called n-grams. In NLP n-grams are used to group words together
and help predict parts of speech.
The “N” in n-grams stands for the number of grams grouped together. A gram can be
any variable that exists in an ordered list. In my research a gram is the site visited. A
representation of how the dataset would look for a tri-gram representation. You can
see in the first instance that Bob visited Site2 then Site3 then Site 5.
The n-gram technique also has another variable of skips. A skip would represent the
amount of grams skipped before recording another gram. A dataset for a 2 skip trigram would look exactly the same as the one above except that two sites would not
have been next to each other in the original history. For example in the first instance
Bob would have visited Site2 the two more sites then Site3 then two more sites then
Site5.
Representing the browsing history to the computer:
Every users history is stored in a database file of different types depending on the
browser they use. These database files contain a great deal extra data beyond just the
webpages visited. For the purpose of this research I have stripped all this extra data
off to create an ordered list of every webpage visited. The next step is to turn this list
into a data set that could be used by the computer.
Using the created datasets:
After researching different techniques and I found that learning classifiers were best
suited for this task of identification for three main reasons:
1. They use simple datasets that are easily manipulated
2. Classification is a similar task to identification
3. A great deal of classifier algorithms have already been developed are readily available
to be used through the Weka library.
4. Tools are readily available to evaluate the correctness of a classifier’s results on
classifying a dataset
The Experiment:
1.
2.
3.
Collect browsing histories from volunteers.
Strip the extra data out of the collected browsing histories
Create data sets. Which includes:
1.
2.
3.
The simplest data set that could be created would be similar to the one above. Where
each column would represent a different “Attribute” of the data. In this case each
Attribute would represent a webpage in the set of all webpages. The last column
contains the user names. This is the column that will contain an empty cell when we
don’t know who the browsing history belongs to. Each row would contain the
browsing history from each user. These rows are called instances.
4.
5.
6.
A separate dataset for every combination of n-grams and skips from a 0
skip bi-gram to a 50 skip 50-gram
One of every previous dataset for both website and webpage specificity
Then splitting every dataset into two sets.
1. 80% training set
2. 20% testing set
For every dataset train a classifier and test it with its corresponding test set.
Evaluate the results to find which representation of the data will yield the highest
percentage of correct predictions.
Report on the findings