Bootstrapping Labels for One

Bootstrapping Labels for
One-Hundred Million Images
Jimmy Whitaker
We are drowning in data
Data Never Sleeps 2.0 - DOMO (2014)
4/5/16
GTC 2016
2
Ripe Opportunities
• Many problems to solve
• Limitless amounts of image data
• Deep Learning pushing State of the Art
everywhere
• GPUs making everything possible
4/5/16
GTC 2016
3
The Problem
• Deep Learning is data-driven
• ImageNet has 1.2 million training examples
• Few large, labeled image datasets exist
• It’s expensive to label data
• Our datasets are +100m images
• Few qualified to label
• Highly sensitive customer data
• Necessary subject matter expertise
4/5/16
GTC 2016
4
Ever labeled data?
• Not as easy as it seems:
• It’s repetitive
• Less accurate over time
• One day computers will do it all for you?
• But not yet
• Can some of this effort be automated?
4/5/16
GTC 2016
5
Many Approaches
• Mechanical Turk
• Costly
• Time Consuming
• Clustering
• Pre-trained classifiers
• What if pre-trained
classifiers don’t work
well on data?
• Expensive
• Active learning
• How many clusters • Iterative labeling
• What features to
• Open problem
use?
Can we combine these into something useful?
4/5/16
GTC 2016
6
The Goal
• Inspired by Image Similarity experience
and Jeremy Howard TED talk
• Use machines to filter the noise
• Reduce repetitive tasks
• Leverage human labeler
• Understand the data
• Label iteratively
• Allow exploration
4/5/16
GTC 2016
7
Our Approach
4/5/16
GTC 2016
8
Our Approach
Compare Image
Hashes to filter
Duplicate
Images
4/5/16
GTC 2016
9
Our Approach
4/5/16
GTC 2016
10
Our Approach
4/5/16
GTC 2016
11
Our Approach
4/5/16
GTC 2016
12
Our Approach
Prevents overfocusing on one
portion of
feature space
4/5/16
GTC 2016
13
Our Approach
4/5/16
GTC 2016
14
Our Approach
Label Images on the
boundary of the class
4/5/16
GTC 2016
15
Our Approach
Improve CNN features for
labeled classes
4/5/16
GTC 2016
16
GUI
4/5/16
GTC 2016
17
Hardware
• Cirrascale GB5670
•
•
•
•
56 CPU Cores
8x NVIDIA K-80
512GB DDR4
1 TB SSD
4/5/16
GTC 2016
18
Benefits
• Create Large, Labeled Datasets
• High quality
• Allows data exploration
• Dramatic time reduction
• ~3-5x faster initially
• Multiplicative efficiency gains
• Flexible framework
• Perform data science with images
4/5/16
GTC 2016
19
CONFIDENTIAL
20