Deep Learning Adaptive Learning Rate Algorithms Upsolver Implementing deep learning in Ad-tech Yoni Iny - CTO Ori Rafael – CEO About Upsolver Founded in Feb 2014 Post seed Offices in TLV On-boarding first customers Finished 3 technology validation projects 1st to graduate from Yahoo’s ad-tech innovation program THE TEAM R&D ORI RAFAEL YONI INY CEO & Co-Founder CTO & Co-Founder 7 years technology management at 8200 3 years of B2B sales and biz dev B.A Computer Science, MBA 9 years at 8200 CTO of a major Data science group B.A Mathematics & Computer Science 8200 dev force Data Science Academia and industry experience Shani Elharrar Bar Vinograd VP R&D Senior Data Scientist 5 years at 8200 Coding since age 13 Omer Kushnir Full stack developer 12 years at 8200 Jason Fine Full stack developer 5 years at 8200 Coding since age 13 Chief data science at Adallom 5 years at 8200 B.A Mathematics Eyal Gruss Deep learning expert PHD in theoretical physic Principal Data scientist at RSA Digital advertising – from direct deals to programmatic Programmatic/RTB Process Source: searchenginewatch.com The data Offers • • • • Advertiser Product Offer Banner/Rich Media/Video Ad placement • • • • App/Domain Placement Device OS User • • • • • Id Geo Gender Age Intent Context • Content • Timing The “standard” approach feature engineering + logistic regression Why choose a different approach? • High dimensionality – hard to understand • Slow process in a rapidly changing domain • High sparsity and skewedness Applying deep learning in the ad-tech space DL in advertising vs. computer vision Computer Vision Advertising Small data True big data Data is static (pixels) Trending data Dense data Sparse data ~No strong latency requirements Billions of bids per day, latency < 20ms Our product • Combination of approaches to handle advertising challenges – Hinton, LeCun, Bengio, Ng • Implemented in CUDA (GPU code) + C# • Training 30K samples/sec, serving 150K samples/sec • Optimized for sparse data • Online learning Stochastic gradient descent • Learn by computing the first order gradients in the error function and adjusting the weights • In order to get good results, we need to find a good error function and a good learning rate algorithm Learning rate pitfalls E w What happens when the learning rate is too big This isn’t actually a good representation of the error surface This is a good representation of the error surface Fixed learning rate + Annealing • Fixed learning rate – set alpha to a constant based on empirical tests – This is super simple, so it’s usually the first thing people try – You can usually find an alpha that works pretty well: Start with 0.1 Too slow => alpha*3 No divergence => alpha/3 • Annealing decays the learning rate (usually with a half life) – Faster in the beginning, slower at the end (for better or worse) (Nesterov) Momentum • Faster learning in consistent directions – Per parameter learning rate • The method – first jump, then correct AdaGrad • Adaptive learning rate per parameter • Like a smarter annealing schedule, but still artificially ends learning after a while • Can be combined with momentum AdaDelta (Google 2012) • AdaGrad without the disadvantages • Doesn’t disappear, a good fit for data that changes over time • A bit harder to implement Second order methods (VSGD/CG/L-BFGS) • What matters isn’t the size of the gradient, it’s the ratio of the gradient to the curvature • Second order methods try to approximate the curvature • Usually significantly slower • Hard to implement! better ratio Questions! Ori Rafael +972-54-9849666 [email protected] Yoni Iny +972-54-4860360 [email protected]
© Copyright 2026 Paperzz