Visual Recognition for Perceptive Interfaces Trevor Darrell Vision

Approximate Correspondences in
High Dimensions
Kristen Grauman*
Trevor Darrell
MIT CSAIL
(*) UT Austin…
MIT CSAIL
Vision interfaces
Key challenges: robustness
Illumination
Occlusions
Object pose
Intra-class
appearance
Clutter
Viewpoint
MIT
CSAIL
Vision interfaces
Key challenges: efficiency
• Thousands to millions of pixels in an image
• 3,000-30,000 human recognizable object categories
• Billions of images indexed by Google Image Search
• 18 billion+ prints produced from digital camera
images in 2004
• 295.5 million camera phones sold in 2005
MIT CSAIL
Vision interfaces
Local representations
Describe component regions or patches separately
Maximally Stable Extremal
Regions [Matas et al.]
SIFT [Lowe]
Shape context
[Belongie et al.]
Salient regions
[Kadir et al.]
Harris-Affine
[Schmid et al.]
Superpixels
[Ren et al.]
Spin images
[Johnson and Hebert]
Geometric
Blur
MIT
CSAIL
[Berg et
al.]
Vision
interfaces
How to handle sets of features?
• Each instance is unordered set of vectors
• Varying number of vectors per instance
MIT CSAIL
Vision interfaces
Partial matching
Compare sets by
computing a partial
matching between
their features.
MIT CSAIL
Vision interfaces
Pyramid match overview
optimal partial
matching
MIT CSAIL
Vision interfaces
Computing the partial matching
• Optimal matching
• Greedy matching
• Pyramid match
for sets with
features of dimension
MIT CSAIL
Vision interfaces
Pyramid match overview
Pyramid match measures similarity of a
partial matching between two sets:
•
•
•
Place multi-dimensional, multi-resolution
grid over point sets
Consider points matched at finest resolution
where they fall into same grid cell
Approximate optimal similarity with worst
case similarity within pyramid cell
No explicit search for matches!
MIT CSAIL
Vision interfaces
Pyramid match
Number of newly
matched pairs at level i
Approximate
partial match
similarity
Measure of difficulty
of a match at level i
MIT
CSAIL
[Grauman and Darrell, ICCV 2005]
Vision interfaces
Pyramid extraction
,
Histogram
pyramid:
level i has bins
of size
MIT CSAIL
Vision interfaces
Counting matches
Histogram
intersection
MIT CSAIL
Vision interfaces
Example pyramid match
MIT CSAIL
Vision interfaces
Example pyramid match
MIT CSAIL
Vision interfaces
Example pyramid match
MIT CSAIL
Vision interfaces
Example pyramid match
pyramid match
optimal match
MIT CSAIL
Vision interfaces
Approximating the optimal
partial matching
x
MIT CSAIL
interfaces
Randomly generated uniformly distributed point sets with m= 5Vision
to 100,
d=2
PM preserves rank…
MIT CSAIL
Vision interfaces
and is robust to clutter…
MIT CSAIL
Vision interfaces
Learning with the pyramid match
• Kernel-based methods
– Embed data into a Euclidean space via a
similarity function (kernel), then seek linear
relationships among embedded data
– Efficient and good generalization
– Include classification, regression,
clustering, dimensionality reduction,…
• Pyramid match forms a Mercer kernel
MIT CSAIL
Vision interfaces
Category recognition results
ETH-80 data set
Kernel
Complexity
Match [Wallraven et al.]
Time (s)
Accuracy
Pyramid match
Mean number of features
Mean number of
features
MIT
CSAIL
Vision interfaces
Category recognition results
0.002 s / match
5 s / match
Pyramid
match kernel
over spatial
features with
quantized
appearance
2004
6/05
12/05 3/06
Time of publication
6/06
MIT CSAIL
Vision interfaces
Vocabulary-guided pyramid match
But rectangular histogram may scale poorly with input
dimension…
Build data-dependent histogram structure…
New Vocabulary-guided PM [NIPS 06]:
• Hierarchical k-means over training set
• Irregular cells; record diameter of each bin
• VG pyramid structure stored O(kL); stored once
• Individual Histograms still stored sparsely
MIT CSAIL
Vision interfaces
Vocabulary-guided pyramid match
Uniform bins
Vocabularyguided bins
• Tune pyramid partitions
to the feature
distribution
• Accurate for d > 100
• Requires initial corpus
of features to determine
pyramid structure
• Small cost increase
over uniform bins: kL
distances against bin
MIT points
CSAIL
centers to insert
Vision interfaces
Vocabulary-guided pyramid match
W * # new matches @ level i
wij * (# matches in cell j level i - # matches in children)
nij(X) : hist. X level i cell j
ch(n) : child h of node n
wij : weight for hist. X level i cell j
(1) ~= diameter of cell
Mercer kernel
(2) ~= dij(X) + dij(Y)
(dij(H)=max dist of H’s pts in cell i,j to center)
Upper bound
c2(n11)
MIT CSAIL
Vision interfaces
Results: Evaluation criteria
• Quality of match scores
How similar are the rankings produced by the
approximate measure to those produced by the
optimal measure?
• Quality of correspondences
How similar is the approximate correspondence field
to the optimal one?
• Object recognition accuracy
Used as a match kernel over feature sets, what is the
recognition output?
MIT CSAIL
Vision interfaces
Match score quality
ETH-80 images, sets of SIFT features
d=8
d=128
Vocabularyguided pyramid
match
d=8
d=128
Uniform bin
pyramid match
Dense SIFT
(d=128)
MIT
CSAIL
k=10, L=5 for VG PM;
Vision
PCA forinterfaces
low-dim feats
Match score quality
ETH-80 images, sets of SIFT features
MIT CSAIL
Vision interfaces
Bin structure and match counts
Data-dependent bins allow more gradual distance ranges
d=3
d=8
d=13
d=68
d=113
d=128
MIT CSAIL
Vision interfaces
Approximate correspondences
Use pyramid intersections to compute
smaller explicit matchings.
MIT CSAIL
Vision interfaces
Approximate correspondences
Use pyramid intersections to compute
smaller explicit matchings.
MIT CSAIL
Vision interfaces
Correspondence examples
MIT CSAIL
Vision interfaces
Approximate correspondences
ETH-80 images, sets of SIFT descriptors
MIT CSAIL
Vision interfaces
Approximate correspondences
ETH-80 images, sets of SIFT descriptors
MIT CSAIL
Vision interfaces
Impact on recognition accuracy
• VG-PMK as kernel for SVM
• Caltech-4 data set
• SIFT descriptors extracted
at Harris and MSER
interest points
MIT CSAIL
Vision interfaces
Sets of features elsewhere
diseases as
sets of gene
expressions
methods as
sets of
instructions
documents as
bags of words
MIT CSAIL
Vision interfaces