Scalable Supervised Dimensionality Reduction using Clustering Troy Raeder, Claudia Perlich, Brian Dalessandro, Ori Stitelman, Foster Provost m6d © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential What we do 100 Million URL’s Who should we target for a product? cookies What data should we pay for? 100 Million Browsers Shopping at one of ad Does the our campaign siteseffect? have an conversion 0.0001% to 1% baserate Attribution? Where should we advertise and Billions of at what price? Auctions per day Ad Exchange © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Agnostic Data A consumer’s online activity The Non-Branded Web gets recorded like this: The Branded Web Browsing History Hashed URL’s: date1 abkcc date2 kkllo date3 88iok date4 7uiol … © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Purchases Encoded date1 3012L20 date 2 4199L30 … date n 3075L50 Our Model • Our goal: To identify people who are likely to purchase a particular product after seeing an ad. • Our Approach: A massive, sparse classification problem. • Data points: Individual cookies. Features: are past visited URLs. Class: Have you ever bought from Brand X? • Our system: Thousands of classification models, with Millions of features per model. 4 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Dimensionality Reduction • Our high-dimensional classification models work really well in most contexts, but in some cases fewer dimensions are better. • Rare Events: Some campaigns get very few positives, making it hard to estimate meaningful coefficients. • Cold Start: At the very beginning of a campaign, we have seen fewer positive examples. Same problem. • Flexibility: There are some things that large models just can’t do (speed). 5 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Dimensionality Reduction • There are a few obvious options for dimensionality reduction. • Hashing: Run each URL through a hash function, and spit out a specified number of buckets. • Categorization: We had both free and commercial website category data. Binary URL space binary category space. www.baseball-reference.com Sports/Baseball/Major_League/Statistics • SVD: Singular Value Decomposition in Mahout to transform large, sparse feature space into small dense feature space. www.dmoz.org 6 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Dimensionality Reduction 7 • These are all good options, but could we do better? • Motivation: Guarantee sufficient representation in the data. • Intuition: combine similar URLs together. • How should we measure similarity between URLs? • Answer: Model parameters! • Result: supervised multi-task dimensionality reduction in the space of model parameters. • Basic idea: Hierarchical clustering of the URLs themselves. © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Setup models Model 1 Model 2 Model 3 U R L S www.nd.edu -0.001 0.912 1.035 0.173 www.abc.com 0.631 0.464 0.547 -1.792 www.xyz.com -1.929 0.705 0.146 1.385 www.espn.com -1.151 0.543 0.469 0.310 www.yahoo.com/finance -2.086 -1.096 1.341 1.368 www.yahoo.com/sports -0.514 0.312 0.278 0.356 0.370 -0.121 -0.442 -0.497 www.123.com … … Table entries are model parameters (Naïve Bayes) 8 Model 4 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential … … … Building the Algorithm • For hierarchical clustering, we need: • A feature space and a distance measure. • Pearson correlation in the space of model parameters. • A method for cutting the tree. • Popularity based. 9 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Example Kids Health Home News Games & Videos © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Home Experiments • We built models off data from 28 campaigns. • Our production cluster definitions have 4,318 features. • We tried to get each of the “challengers” as close to this as we possibly could. • We evaluate on Lift (5%) and AUC. 11 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Results 12 Average Average Lift (5%) Relative Perf. Win Loss Tie Features Cluster 4.024 100% - - - 4,318 SVD 3.539 86.0% 4 20 4 1,000 Hash 3.035 70.0% 1 26 1 4,318 Commercial 3.195 71.3% 2 24 2 1,183 Free Context 3.643 84.4% 1 17 10 5,984 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Results 13 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Results (in lab) 14 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Results (in production) 15 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential Questions? • Thanks for coming! 16 © 2013 Media6Degrees. All Rights Reserved. Proprietary and Confidential
© Copyright 2026 Paperzz