Slides (Building high level feautures with unsupervised learning)

Article Review
Todd Hricik
Learning High Level Features
• Previous studies in computer vision have used
labeled data to learn “higher” level features
– Requires a large training set containing features
you wish to recognize
– Difficult to obtain in many cases
• Focus of current work in this paper is to build
high level , class specific, feature detectors
from unlabeled images
Learning Features From Unlabeled
Data
• RBMs (Hinton et al.,2006)
• Autoencoders (Hinton & Salakhutdinov, 2006;
Bengio et al., 2007)
• Sparse coding (Lee et al., 2007)
• and K-means (Coates et al., 2011)
• To date most have only succeeded in learning
low-level features such as “lines” or “globs”
• Authors consider the possibility of capturing
more complex features using deep autoencoders
on unlabeled data
Deep Autoencoders
• Made up of symmetrical encoding (blue) and
decoding (red) deep belief networks
Training Set
• Randomly sample 200x200 pixel frames from
10 million YouTube videos
• OpenCV face detector was run on 60x60
randomly-sampled patches from the training
set
• 3% of the 100,000 sampled patches contained
faces learned by OpenCV
Deep Autoencoder Architecture
•
1 billion trainable parameters
•
Still tiny. Human Visual Cortex is 106 times
larger
•
Local Receptive Fields (LRF)
– each feature connects to small region of the
lower layer
•
Local L2 Pooling
– Square root of sum of squares (inputs)
•
Local Contrast Normalization (LCN)
H
W1
Learning and Optimization
• Parameters H are fixed to uniform weights
• Encoding weights W1 and decoding weights W2 of the
first sublayers
• Lambda: tradeoff parameter between sparsity and
reconstruction
• m, k: number of examples and pooling units in a layer
respectively
• The objective function of the model is the sum of the
individual objectives of the three layers
Validation of Higher Level Features
Learned
• Control experiments used to analyze invariance properties
of the face detector
• Test set consists of 37,000 images (containing 13,026 faces)
and were sampled from
– Labeled Faces In the Wild dataset (Huang et al., 2007)
– ImageNet dataset (Deng et al., 2009)
• After training, test set was used to measure the
performance of each neuron in classifying faces against
distractors
– For each neuron, compute its maximum and minimum
activation values and then picked 20 equally spaced thresholds
in between
– The reported accuracy is the best classification accuracy among
20 thresholds
Validation Results
• Best neuron obtained 81.7% accuracy in detecting
faces (serendipity?)
– Random Guess Accuracy achieved 64.8% accuracy
– Best neuron in one layered network achieved 71%
accuracy
Sub-sample of test set
positive/negative = 1
Entire test set
Validation Results Analysis I
• Removing the LCF layer
reduced accuracy of best
performing neuron to
78.5%
• Robustness of face detector
to translation, scaling and
out-of-plane rotation (Fig.
4,5)
• Remove all images that
have faces from the training
set and repeat experiment
results in 72.5% accuracy
Can Other Well Performing Neurons
Recognize Other High Level Features?
• Constructed two datasets having positive/negative
ratios similar to face ratios in training data
(1) Human bodies vs. distractors (Keller et al., 2009)
(2) Cat faces vs. distractors (Zhang et al., 2008)
Can Other Well Performing Neurons
Recognize Other High Level Features?
Summary of Results and Comparisons
to State of the Art Methods
Thank You
Questions?