CS598Visual Information retrieval

CS598VISUAL INFORMATION RETRIEVAL
Lecture VIII: LSH Recap and Case Study on Face
Retrieval
LECTURE VIII
Part I: Locality Sensitive Hashing Re-cap
Slides credit: Y. Gong
THE PROBLEM

Large scale image search:
Given a query image
 Want to search a large database to find similar images


E.g., search the internet to find similar images
Need fast turn-around time
 Need accurate search results

LARGE SCALE IMAGE SEARCH

Find similar images in a large database
Kristen Grauman et al
INTERNET LARGE SCALE IMAGE SEARCH
Internet contains billions of images
Search the internet
The Challenge:
– Need way of measuring similarity between images (distance
metric learning)
– Needs to scale to Internet (How?)
LARGE SCALE IMAGE SEARCH

Representation must fit in memory (disk too slow)

Facebook has ~10 billion images (1010)

PC has ~10 Gbytes of memory (1011 bits)
 Budget of 101 bits/image
Fergus et al
REQUIREMENTS FOR IMAGE SEARCH
 Search

Fast




Kd-trees: tree data structure to improve search speed
Locality Sensitive Hashing: hash tables to improve search speed
Small code: binary small code (010101101)
Scalable


must be both fast, accurate and scalable
Require very little memory
Accurate

Learned distance metric
SUMMARIZATION OF EXISTING ALGORITHMS

Tree Based Indexing


Hashing


Spatial partitions (i.e. kd-tree) and recursive hyper
plane decomposition provide an efficient means to
search low-dimensional vector data exactly.
Locality-sensitive hashing offers sub-linear time
search by hashing highly similar examples together.
Binary Small Code

Compact binary code, with a few hundred bits per
image
TREE BASED STRUCTURE

Kd-tree


The kd-tree is a binary tree in which every node is a
k-dimensional point
They are known to break down in practice for
high dimensional data, and cannot provide better
than a worst case linear query time guarantee.
LOCALITY SENSITIVE HASHING


Hashing methods to do fast Nearest Neighbor (NN) Search
Sub-liner time search by hashing highly similar examples
together in a hash table

Take random projections of data

Quantize each projection with few bits

Strong theoretical guarantees
 More
detail later
BINARY SMALL CODE
1110101010101010


Binary?
0101010010101010101
 Only use binary code (0/1)


Small?
A small number of bits to code each image
 i.e. 32 bits, 256 bits


How could this kind of small code improve the image search
speed? More detail later.
DETAIL OF THESE ALGORITHMS
1. Locality sensitive hashing
Basic LSH
 LSH for learned metric

2. Small binary code
Basic small code idea
 Spectral hashing

1. LOCALITY SENSITIVE HASHING

The basic idea behind LSH is to project the data
into a low-dimensional binary (Hamming) space


i.e., each data point is mapped to a b-bit vector, called
the hash key.
Each hash function h must satisfy the locality
sensitive hashing property:

where
of interest
∈ [0, 1] is the similarity function
Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable
Distributions. In SOCG, 2004.
LSH FUNCTIONS FOR DOT PRODUCTS
The hashing function of LSH to produce Hash Code
is a hyperplane separating the space
1. LOCALITY SENSITIVE HASHING
• Take random projections of data
• Quantize each projection with few bits
101
0
Feature vector
1
0
1
1
No learning involved
0
Slide credit: Fergus et al.
HOW TO SEARCH FROM HASH TABLE?
A set of data points
N
Hash function
hr1…rk
Xi
Search the hash table
for a small set of images
<< N
Hash table
110101
hr1…rk
Q
110111
Q
111101
New query
results
COULD WE IMPROVE LSH?

Could we utilize learned metric to improve LSH?

How to improve LSH from learned metric?

Assume we have already learned a distance metric A
from domain knowledge

XTAX has better quantity than simple metrics such as
Euclidean distance
HOW TO LEARN DISTANCE METRIC?
First assume we have a set of domain knowledge
 Use the methods described in the last lecture to
learn distance metric A


As discussed before,
Thus
is a linear embedding function that
embeds the data into a lower dimensional space
 Define G =

LSH FUNCTIONS FOR LEARNED METRICS




Given learned metric with
G could be viewed as linear parametric function or a linear
embedding function for data x
Thus the LSH function could be:
Data embedding
The key idea is first embed the data into a lower space by
G and then do LSH in the lower dimensional space
Jain, B. Kulis, and K. Grauman. Fast Image Search for Learned Metrics. In CVPR, 2008
SOME RESULTS FOR LSH

Caltech-101 data set

Goal: Exemplar-based Object Categorization
Some exemplars
 Want to categorize the whole data set

RESULTS: OBJECT CATEGORIZATION
[CORR]
Caltech-101 database
Kristen Grauman et al
ML = metric learning
QUESTION ?

Is Hashing fast enough?

Is sub-linear search time fast enough?

For retrieving (1 + e) near neighbors is bounded
by O(n1/(1+e) )
Is it fast enough?
 Is it scalable enough? (adapt to the memory of a
PC?)

NO!



Small binary code could do better.
Cast an image to a compact binary code, with a few
hundred bits per image.
Small code is possible to perform real-time searches
with millions from the Internet using a single large PC.

Within 1 second! (for 80 million data  0.146 sec.)

80 million data (~300G)

120M
BINARY SMALL CODE



First introduced in text search/retrieval
Salakhutdinov and Hinton [3] introduced it for text
documents retrieval
Introduced to computer vision by Torralba et al [4].
[3] Ruslan Salakhutdinov and Geoffrey Hinton, "Semantic Hashing. International Journal of Approximate Reasoning,"
2009
[4] A. Torralba, R. Fergus, and Y. Weiss. "Small Codes and Large Databases for Recognition." In CVPR, 2008.
SEMANTIC HASHING
Ruslan Salakhutdinov and Geoffrey Hinton Semantic Hashing.
International Journal of Approximate Reasoning, 2009
Query
Semantic
Hash
Function
Quite different
to a (conventional)
randomizing hash
Fergus et al
Binary
code
Semantically
similar
images
Address Space
Images in database
Query address
SEMANTIC HASHING


Similar points are mapped into similar small code
Then store these code into memory and compute hamming
distance (very fast, carried out by hardware)
OVERALL QUERY SCHEME
Binary code
<10μs
Image 1
generate
Retrieved images
<1ms
Store into memory
Use hard ware to
compute hamming distance
Feature vector
Query
Image
Feature vector
Fergus et al
small code
~1ms (in Matlab)
SEARCH FRAMEWORK

Produce binary code (01010011010)

Store these binary code into the memory


Use hardware to compute the hamming distance
(very fast)
Sort the Hamming distances and get final
ranking results
HOW TO LEARN SMALL BINARY CODE

Simplest method (use median)

LSH are already able to produce binary code

Restricted Boltzmann Machines (RBM)

Optimal small binary code by spectral hashing
1. SIMPLE BINARIZATION STRATEGY
Set threshold (unsupervised)
- e.g. use median
0
1
Fergus et al
0
1
2. LOCALITY SENSITIVE HASHING

LSH is ready to generate binary code (unsupervised)
• Take random projections of data
• Quantize each projection with few bits
0
101
Feature vector
1
0
1
Fergus et al
1
0
No learning involved
3. RBM [3] TO GENERATE CODE



Not going into detail, see Salakhutdinov and Hinton for detail
Use a deep neural network to train small code
Supervised method
R. R. Salakhutdinov and G. E. Hinton. Learning a Nonlinear Embedding by Preserving Class
Neighbourhood Structure. In AISTATS, 2007.
LABELME RETRIEVAL


LabelMe is a large database with human annotated
images
The goal of this experiment is to
First generate small code
 Use hamming distance to search for similar images
 Sort the results to produce final ranking


Gist descriptor: ground truth
EXAMPLES OF LABELME RETRIEVAL

12 closest neighbors under different distance metrics
TEST SET 2: WEB IMAGES

12.9 million images from tiny image data set

Collected from Internet


No labels, so use Euclidean distance between Gist
vectors as ground truth distance
Note: Gist descriptor is a kind of feature widely used in
computer vision for
EXAMPLES OF WEB RETRIEVAL

12 neighbors using different distance metrics
Fergus et al
WEB IMAGES RETRIEVAL
Observation: more codes get better performance
RETRIEVAL TIMINGS
Fergus et al
SUMMARY

Image search should be
Fast
 Accurate
 Scalable

Tree based methods
 Locality Sensitive Hashing
 Binary Small Code (state of the art)

QUESTIONS?
LECTURE VIII
Part II: Face Retrieval
Challenges
……
……
……
Gallery faces
• Poses, lighting and facial expressions confront recognition
• Efficiently matching against large gallery dataset is nontrivial
• Large number of subjects matters
Contextual face recognition
• Design robust face similarity metric
– Leverage local feature context to deal with visual variations
• Enable large-scale face recognition tasks
– Design efficient and scalable matching metric
– Employ image level context and active learning
– Utilize social network context to scope recognition
– Design social network priors to improve recognition
Preprocessing
Input to our
algorithm
Face Detection
Eye Detection
Face alignment
Boosted cascade
Neural network
Similarity transform
to canonical frame
[Viola-Jones ‘01]
Illumination
normalization
Self-quotient image
[Wang et. al. ‘04]
Feature extraction
Gaussian pyramid
Dense sampling in scale
One feature descriptor per patch:
*
Patches
8×8, extracted on a regular grid at each scale
DAISY Shape
<0
>
{ f1 … fn }
fi ε R400
Filtering
Convolution with 4 oriented fourthderivative of Gaussian quadrature pairs
n ≈ 500
Spatial Aggregation
Log-polar arrangement of 25
Gaussian-weighted regions
Face representation & matching
Adjoin Spatial information:
f1
x1
y1
{ f1 … fn }
fn
… xn
yn
{ g1, g2 … gn }
…
Quantizating by a forest of randomized trees in Feature Space × Image Space :
T1
Tk
T2
w,
…
f1
x1
y1
Each feature gi contributes to k bins of the combined histogram vector h.
IDF weighted L1 norm: wi = log ( #{ training h : h(i) > 0 } / #training ).
d( h, h’ ) = Σi wi | h(i) – h’(i) |
< τ
>
Randomized projection trees
Linear decision at each node:
{
w, [f x y]’
}
< τ
>
w a random projection:
w ~ N( 0, Σ ).
Normalizes spatial and feature parts
τ = median
{ w, [f x y]’ }
Can also be randomized
w
τ
Why random projections?
• Simple
• Interact well with high-dimensional sparse data (feature descriptors!)
• Generalize trees used previously used for vision tasks (kd-trees, Extremely Randomized
[Dasgupa & Freund, Wakin et. al., ...]
Forests)
[Guerts, LePetit & Fua, ...]
Additional data-dependence can be introduced through multiple trials:
Select a (w, τ) pair that minimizes a cost function (i.e., MSE, conditional entropy)
Implicit elastic matching
{ g1 … gn }
Even with good descriptors, dense
correspondence is difficult:
???
{ g’1 … g’n }
Enable efficient inverted file indexing!!
Instead, quantize features such that
corresponding features likely map to same bin:
Bin 1
gi
g’j
Bin 2
…
Number of “soft matches’’ is given by
comparing histograms:
h
…
h’
Histogram of joint quantizations
Similarity(I,I’) = 1 - ||h - h’ ||w1
…
Ross
Query face
…
…
…
…
……
…
Gallery faces
…
…
Exploring the optimal settings
• A subset of PIE for exploration (11554 faces / 68 users)
– 30 faces per person are used for inducing the trees
• Three settings to explore
– Histogram distance metric
– Tree depth
– Number of trees
Distance metric
Reco. Rate
L2 un-weighted
86.3%
L2 IDF-weighted
86.7%
L1 un-weighted
89.3%
Forest size
1
5
10
15
L1 IDF-weighted
89.4%
Reco. Rate
89.4%
92.4%
93.1%
93.6%
Recognition accuracy (1)
ORL
40 subjects
Unconstrained
Ext. Yale B
PIE
Multi-PIE
38 subjects , Extreme
illumination
68 subjects
Pose, illumination
250 subjects
Pose, illumination,
expression, time
Baseline (PCA)
88.1%
65.4%
62.1%
32.1%
LDA
93.9%
81.3%
89.1%
37.0%
LPP
93.7%
86.4%
89.2%
21.9%
This work
96.5%
91.4%
94.3%
67.6%
 Gallery faces:
ORL: 5 faces/subject
PIE: 30 faces/subject
YaleB: 20 faces/subject
Multi-PIE: faces in the 1st session
Recognition accuracy (2)
PIE->ORL
(ORL->ORL)
ORL->PIE
(PIE->PIE)
PIE -> Multi-PIE
(Multi-PIE->Multi-PIE)
Baseline (PCA)
85.0%
(88.1%)
55.7%
(62.1%)
26.5%
(32.6%)
LDA
58.5%
(93.9%)
72.8%
(89.1%)
8.5%
(37.0%)
LPP
17.0%
(93.7%)
69.1%
(89.2%)
17.1%
(21.9%)
This work
92.5%
(96.5%)
89.7%
(94.3%)
67.2%
(67.6%)
 The first dataset is used for inducing the forest
 The forest is then applied to test on the second dataset
Challenges
……
……
……
Gallery faces
• Poses, lighting and facial expressions confront recognition
• Efficiently matching against large gallery dataset is nontrivial
• Large number of subjects matters
Contextual face recognition
• Design robust face similarity metric
– Leverage local feature context to deal with visual variations
• Enable large-scale face recognition tasks
– Design efficient and scalable matching metric
– Employ image level context and active learning
– Utilize social network context to scope recognition
– Design social network priors to improve recognition
Preprocessing
Input to our
algorithm
Face Detection
Eye Detection
Face alignment
Boosted cascade
Neural network
Similarity transform
to canonical frame
[Viola-Jones ‘01]
Illumination
normalization
Self-quotient image
[Wang et. al. ‘04]
Feature extraction
Gaussian pyramid
Dense sampling in scale
One feature descriptor per patch:
*
Patches
8×8, extracted on a regular grid at each scale
DAISY Shape
<0
>
{ f1 … fn }
fi ε R400
Filtering
Convolution with 4 oriented fourthderivative of Gaussian quadrature pairs
n ≈ 500
Spatial Aggregation
Log-polar arrangement of 25
Gaussian-weighted regions
Face representation & matching
Adjoin Spatial information:
f1
x1
y1
{ f1 … fn }
fn
… xn
yn
{ g1, g2 … gn }
…
Quantizating by a forest of randomized trees in Feature Space × Image Space :
T1
Tk
T2
w,
…
f1
x1
y1
Each feature gi contributes to k bins of the combined histogram vector h.
IDF weighted L1 norm: wi = log ( #{ training h : h(i) > 0 } / #training ).
d( h, h’ ) = Σi wi | h(i) – h’(i) |
< τ
>
Randomized projection trees
Linear decision at each node:
{
w, [f x y]’
}
< τ
>
w a random projection:
w ~ N( 0, Σ ).
Normalizes spatial and feature parts
τ = median
{ w, [f x y]’ }
Can also be randomized
w
τ
Why random projections?
• Simple
• Interact well with high-dimensional sparse data (feature descriptors!)
• Generalize trees used previously used for vision tasks (kd-trees, Extremely Randomized
[Dasgupa & Freund, Wakin et. al., ...]
Forests)
[Guerts, LePetit & Fua, ...]
Additional data-dependence can be introduced through multiple trials:
Select a (w, τ) pair that minimizes a cost function (i.e., MSE, conditional entropy)
Implicit elastic matching
{ g1 … gn }
Even with good descriptors, dense
correspondence is difficult:
???
{ g’1 … g’n }
Enable efficient inverted file indexing!!
Instead, quantize features such that
corresponding features likely map to same bin:
Bin 1
gi
g’j
Bin 2
…
Number of “soft matches’’ is given by
comparing histograms:
h
…
h’
Histogram of joint quantizations
Similarity(I,I’) = 1 - ||h - h’ ||w1
…
Ross
Query face
…
…
…
…
……
…
Gallery faces
…
…
Exploring the optimal settings
• A subset of PIE for exploration (11554 faces / 68 users)
– 30 faces per person are used for inducing the trees
• Three settings to explore
– Histogram distance metric
– Tree depth
– Number of trees
Distance metric
Reco. Rate
L2 un-weighted
86.3%
L2 IDF-weighted
86.7%
L1 un-weighted
89.3%
Forest size
1
5
10
15
L1 IDF-weighted
89.4%
Reco. Rate
89.4%
92.4%
93.1%
93.6%
Recognition accuracy (1)
ORL
40 subjects
Unconstrained
Ext. Yale B
PIE
Multi-PIE
38 subjects , Extreme
illumination
68 subjects
Pose, illumination
250 subjects
Pose, illumination,
expression, time
Baseline (PCA)
88.1%
65.4%
62.1%
32.1%
LDA
93.9%
81.3%
89.1%
37.0%
LPP
93.7%
86.4%
89.2%
21.9%
This work
96.5%
91.4%
94.3%
67.6%
 Gallery faces:
ORL: 5 faces/subject
PIE: 30 faces/subject
YaleB: 20 faces/subject
Multi-PIE: faces in the 1st session
Recognition accuracy (2)
PIE->ORL
(ORL->ORL)
ORL->PIE
(PIE->PIE)
PIE -> Multi-PIE
(Multi-PIE->Multi-PIE)
Baseline (PCA)
85.0%
(88.1%)
55.7%
(62.1%)
26.5%
(32.6%)
LDA
58.5%
(93.9%)
72.8%
(89.1%)
8.5%
(37.0%)
LPP
17.0%
(93.7%)
69.1%
(89.2%)
17.1%
(21.9%)
This work
92.5%
(96.5%)
89.7%
(94.3%)
67.2%
(67.6%)
 The first dataset is used for inducing the forest
 The forest is then applied to test on the second dataset
Scalability challenges
• Given N labeled faces, how to choose a maximum
M<N faces to form the gallery faces such that the
recognition accuracy is maximized?
• How to handle the problem of recognizing large
number of subjects in a social network?
An active learning framework
P(Y | X)  N(0, K)
 yi  t i
 ( yi , ti )  exp
 2 2

  (ti , t j )   (ti , t j )
2




  (ti , t j )  1   (ti , t j )
A discriminative model (GP+MRF)
*
H (ti )  arg max   q bp (ti  c) log q bp (ti  c)
Active learning: x  arg max
iU
iU
cC
Experiments
•
•
•
•
80 minutes “friends” video
6 characters
1282 tracks, 16720 faces
500 samples are used for testing
Social network scope and priors
• Scope the recognition by social network
• Build the prior probability of whom Rachel would like to tag
Effects of social priors
Perfect recognition
Recognition w/ Priors
Recognition w/o Priors

Download Report

CS598Visual Information retrieval

Paperzz.com

Your Paperzz