Discovering Interesting Market Share Patterns in the Online World

Clickprints on the Web:
Are there Signatures in Web
Browsing Data?
Balaji Padmanabhan
The Wharton School
University of Pennsylvania
Yinghui (Catherine) Yang
Graduate School of Management
University of California, Davis
1
Signatures in technology mediated applications

Unique typing patterns, or “keystroke dynamics”

Miller 1994, Monrose and Rubin 1997, Everitt and McOwan 2003.
 In an experiment involving 42 user profiles, Monrose and Rubin (1997)
shows that depending on the classifier used, between 80 to 90 percent of
users can be automatically recognized using features such as the latency
between keystrokes and the length of time different keys are pressed.

Writeprints

Li, Zheng and Chen (2006)
 Experiments involving 10 users in two different message boards suggest
that “writeprints” could well exist since the accuracies obtained were
between 92 and 99 percent.

Walkie Talkie?


Mäntyjärvi et al. 2005
individuals may have unique “gait” or walking patterns when they move
with mobile devices.
2
Motivating Questions

Do unique behavioral signatures exist in
Web browsing data?

How can behavioral signatures be learned?

Why is this useful?
3
How to Decide Whether Signatures Exist

Two General Methods:
 Build




features and classify.
Build features/variables to describe users’ activities
Learn a classifier (user ID as the dependent variable)
Check it’s accuracy on unseen data
Answer the question
 A patterns-based


approach
pick a pattern representation, and search for distinguishing
patterns.
e.g. for user k, “total_time < 5 minutes and number of pages
> 50” may be a unique clickprint since there is no other user
for whom this is true.
4
The Aggregation Question

Given a unit of analysis (click/session), how much aggregation is needed
before there is enough information in each aggregation to uniquely
identify a person?

For some level of aggregation, agg, we’d like
{c1, c2,…, cagg}  user

{c1, c2,…, ck}  <v1, v2,…,vq, user>


Feature construction, F
{<v1, v2,…,vq, user>}  user = M(v1, v2,…,vq)

Building a predictive model

Find the smallest level of aggregation agg at which unique clickprints
(accuracy > threshold) exist.

Key elements:


How features are constructed for a group of sessions
How much aggregation needs to be done
5
An example of aggregations
Assume six user sessions such that:
User 1:
Visits 10 pages over 30 mins in session 1
Visits 6 pages over 20 mins in session 2
Visits 12 pages over 15 mins in session 3
At agg = 1, F constructs Dv1 as:
Avg. num pages
Avg. time spent
User
per sess.
per sess.
10
30
1
6
20
1
12
15
1
5
40
2
3
10
2
7
15
2
User 2:
Visits 5 pages over 40 mins in session 1
Visits 3 pages over 10 mins in session 2
Visits 7 pages over 15 mins in session 3
6
An example of aggregations
At agg = 2, F constructs Dv2 as:
Avg. num pages
Avg. time spent
User
per sess.
per sess.
8
25
1
9
17.5
1
4
25
2
5
12.5
2
3
At agg = 3, F constructs Dv as:
Avg. num pages
Avg. time spent
User
per sess.
per sess.
9.33
21.67
1
5
21.67
2
Comment
From
From
From
From
sessions
sessions
sessions
sessions
1
2
1
2
and
and
and
and
2
3
2
3
of
of
of
of
user
user
user
user
1
1
2
2
Comment
From sessions 1, 2 and 3 of user 1
From sessions 1, 2 and 3 of user 2
7
Experiments and Design

comScore Networks, 50,000 users, 1 year

User-Centric data

A session is a user’s activities across Web sites

Created multiple data sets by combining sessions from 2, 3,
4, 5, 10, 15, 20 users (140 data sets in total)

User selection:

Users with household size 1

Users with enough sessions for adequate out-of-sample testing

Pick users with > 300 sessions in a year


First 2/3 sessions as training, last 1/3 sessions as hold-out
Same number of sessions for the selected users in each data set to
guarantee same class prior before and after aggregation.
8
Experiments and Design

The Features
 For a single session
(i) The duration
(ii) The number of pages viewed
(iii) The starting time (in seconds after 12.00am) and
(iv) The number of sites visited
(v) Binary variables indicating for the top k (=5, 10) Web sites
are visited
note: these top-k Web sites for each user are identified
only from the training set
 For sets of sessions
 Create variables capturing distributions of these measures


Mean, median, variance max and min for the continuous
attributes
Frequency counts for the top Web sites
9
Experiments and Design

Classifier
 J4.8

classification tree in weka
Model goodness
 Temporal

Threshold accuracy
 90%,

hold out samples (1/3 testing)
also used other different levels
Increase aggregation level and stop when
accuracy is high enough or stopping condition
is reached.
 Set
agg=30 in these experiments
10
Results for one specific accuracy threshold
The optimal levels of aggregation averaged across 20 runs for 90% accuracy
(top 10 web sites).
# of users
Mean
% runs with
agg<30
2
1.05
100%
3
1.26
95%
4
1.78
90%
5
2.16
95%
10
4.24
85%
15
5.2
75%
20
8.9
50%
11
Heuristic for Large Problems:
A Monotonicity Assumption

accuracy(M | agg1)  accuracy(M | agg2) whenever agg1 
agg2

In words:


the goodness of the model when applied to “more aggregated” data
is never worse than the goodness of the model applied to “less
aggregated” data
Can then use a binary search procedure to find the optimal
agg.


Perhaps not very useful when useful agg values are much smaller,
as in our problems/experiments
Continuing to study when this may work and be useful
12
Conclusion
Contribution: Significance of the problem and
initial results
 Challenges

 Scale
 What

is a signature?
On-going/Future Research
 Scale
 Pattern-based
signature
 Application-driven signature problems (e.g. fraud
detection, personalization, etc.)
13
Thank you.
14
Related Work
 Learning user profiles online




User profiles for fraud detection



Fawcett and Provost (1996)
Cortes and Pregibon (2001)
Data Preprocessing


Aggarwal et al. (1998)
Adomavicius and Tuzhilin (2001)
Mobasher et al. (2002)
Cooley et al. (1999), Zheng et al. (2003).
Online intrusion detection

Ellis et al. (2004)
15
Binary search for the optimal aggregation

Start with N users’ Web sessions mixed together.

Assume that the range of aggregations we wish to
consider are 1, 2, 3,…, K sessions

Consider accuracy at agg = K/2
this accuracy ≥ threshold then recursively search in
the lower half of the sequence
 If
 If
this accuracy < threshold then recursively search in
the higher half of the sequence
16
Histogram of number of sessions
1800
1600
1400
1200
1000
800
600
400
200
e
M
or
00
19
00
17
00
15
00
13
00
11
0
90
0
70
0
50
0
30
10
0
0
Number of sessions
17
Distribution of the agg values
25
28
M
or
e
Agg
22
19
16
13
10
7
4
5
4
3
2
1
0
1
Frequency
10 user runs
18