Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker We are drowning in data Data Never Sleeps 2.0 - DOMO (2014) 4/5/16 GTC 2016 2 Ripe Opportunities • Many problems to solve • Limitless amounts of image data • Deep Learning pushing State of the Art everywhere • GPUs making everything possible 4/5/16 GTC 2016 3 The Problem • Deep Learning is data-driven • ImageNet has 1.2 million training examples • Few large, labeled image datasets exist • It’s expensive to label data • Our datasets are +100m images • Few qualified to label • Highly sensitive customer data • Necessary subject matter expertise 4/5/16 GTC 2016 4 Ever labeled data? • Not as easy as it seems: • It’s repetitive • Less accurate over time • One day computers will do it all for you? • But not yet • Can some of this effort be automated? 4/5/16 GTC 2016 5 Many Approaches • Mechanical Turk • Costly • Time Consuming • Clustering • Pre-trained classifiers • What if pre-trained classifiers don’t work well on data? • Expensive • Active learning • How many clusters • Iterative labeling • What features to • Open problem use? Can we combine these into something useful? 4/5/16 GTC 2016 6 The Goal • Inspired by Image Similarity experience and Jeremy Howard TED talk • Use machines to filter the noise • Reduce repetitive tasks • Leverage human labeler • Understand the data • Label iteratively • Allow exploration 4/5/16 GTC 2016 7 Our Approach 4/5/16 GTC 2016 8 Our Approach Compare Image Hashes to filter Duplicate Images 4/5/16 GTC 2016 9 Our Approach 4/5/16 GTC 2016 10 Our Approach 4/5/16 GTC 2016 11 Our Approach 4/5/16 GTC 2016 12 Our Approach Prevents overfocusing on one portion of feature space 4/5/16 GTC 2016 13 Our Approach 4/5/16 GTC 2016 14 Our Approach Label Images on the boundary of the class 4/5/16 GTC 2016 15 Our Approach Improve CNN features for labeled classes 4/5/16 GTC 2016 16 GUI 4/5/16 GTC 2016 17 Hardware • Cirrascale GB5670 • • • • 56 CPU Cores 8x NVIDIA K-80 512GB DDR4 1 TB SSD 4/5/16 GTC 2016 18 Benefits • Create Large, Labeled Datasets • High quality • Allows data exploration • Dramatic time reduction • ~3-5x faster initially • Multiplicative efficiency gains • Flexible framework • Perform data science with images 4/5/16 GTC 2016 19 CONFIDENTIAL 20
© Copyright 2025 Paperzz