2013-09-25-kdd13 final

Scalable Supervised Dimensionality
Reduction using Clustering
Troy Raeder, Claudia Perlich, Brian Dalessandro,
Ori Stitelman, Foster Provost
m6d
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
What we do
100 Million
URL’s
Who should
we target for
a product? cookies
What data should
we pay for?
100 Million
Browsers
Shopping
at one
of ad
Does
the
our campaign
siteseffect?
have an
conversion
0.0001% to 1%
baserate
Attribution?
Where should
we advertise and
Billions of
at what price?
Auctions
per day
Ad
Exchange
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Agnostic Data
A consumer’s online activity
The Non-Branded Web
gets recorded like this:
The Branded Web
Browsing History
Hashed URL’s:
date1 abkcc
date2 kkllo
date3 88iok
date4 7uiol
…
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Purchases
Encoded
date1 3012L20
date 2 4199L30
…
date n 3075L50
Our Model
• Our goal: To identify people who are likely to purchase a
particular product after seeing an ad.
• Our Approach: A massive, sparse classification problem.
• Data points: Individual cookies.
 Features: are past visited URLs.
 Class: Have you ever bought from Brand X?
• Our system: Thousands of classification models, with Millions of
features per model.
4
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Dimensionality Reduction
• Our high-dimensional classification models work really well in
most contexts, but in some cases fewer dimensions are better.
• Rare Events: Some campaigns get very few positives, making it
hard to estimate meaningful coefficients.
• Cold Start: At the very beginning of a campaign, we have
seen fewer positive examples. Same problem.
• Flexibility: There are some things that large models just can’t
do (speed).
5
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Dimensionality Reduction
• There are a few obvious options for dimensionality reduction.
• Hashing: Run each URL through a hash function, and spit out a
specified number of buckets.
• Categorization: We had both free and commercial website
category data. Binary URL space  binary category space.
www.baseball-reference.com
Sports/Baseball/Major_League/Statistics
• SVD: Singular Value Decomposition in Mahout to transform
large, sparse feature space into small dense feature space.
www.dmoz.org
6
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Dimensionality Reduction
7
•
These are all good options, but could we do better?
•
Motivation: Guarantee sufficient representation in the data.
•
Intuition: combine similar URLs together.
•
How should we measure similarity between URLs?
•
Answer: Model parameters!
•
Result: supervised multi-task dimensionality reduction in the
space of model parameters.
•
Basic idea: Hierarchical clustering of the URLs themselves.
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Setup
models
Model 1 Model 2 Model 3
U
R
L
S
www.nd.edu
-0.001
0.912
1.035
0.173
www.abc.com
0.631
0.464
0.547
-1.792
www.xyz.com
-1.929
0.705
0.146
1.385
www.espn.com
-1.151
0.543
0.469
0.310
www.yahoo.com/finance
-2.086
-1.096
1.341
1.368
www.yahoo.com/sports
-0.514
0.312
0.278
0.356
0.370
-0.121
-0.442
-0.497
www.123.com
…
…
Table entries are model parameters (Naïve Bayes)
8
Model 4
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
…
…
…
Building the Algorithm
• For hierarchical clustering, we need:
• A feature space and a distance measure.
• Pearson correlation in the space of
model parameters.
• A method for cutting the tree.
• Popularity based.
9
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Example
Kids
Health
Home
News
Games
&
Videos
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Home
Experiments
• We built models off data from 28 campaigns.
• Our production cluster definitions have 4,318 features.
• We tried to get each of the “challengers” as close to this as
we possibly could.
• We evaluate on Lift (5%) and AUC.
11
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Results
12
Average Average
Lift (5%) Relative Perf.
Win Loss
Tie
Features
Cluster
4.024
100%
-
-
-
4,318
SVD
3.539
86.0%
4
20
4
1,000
Hash
3.035
70.0%
1
26
1
4,318
Commercial
3.195
71.3%
2
24
2
1,183
Free Context
3.643
84.4%
1
17
10
5,984
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Results
13
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Results (in lab)
14
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Results (in production)
15
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
Questions?
• Thanks for coming!
16
© 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential