Distance Functions and Metric Learning: Part 1 An object is most

Rough Outline
Distance Functions and
Metric Learning:
Part 1
Michael Werman
Ofir Pele
The Hebrew University
of Jerusalem
• Bin
- to
- Bin Distances.
• Cross- Bin Distances:
– Quadratic-Form (aka Mahalanobis) / Quadratic-Chi.
– The Earth Mover’s Distance.
• Perceptual Color Differences.
• Hands- On Code Example.
12/1/2011 version. Updated Version:
http://www.cs.huji.ac.il/~ofirpele/DFML_ECCV2010_tutorial/
Distance ?
Metric
• Non-negativity:
D(P, Q) ≥ 0
• Identity of indiscernibles:
D(P, Q) = 0 iﬀ P = Q
• Symmetry:
D(P, Q) = D(Q, P )
• Subadditivity (triangle inequality):
D(P, Q) ≤ D(P, K) + D(K, Q)
Pseudo-Metric (aka Semi-Metric)
Metric
• Non-negativity:
D(P, Q) ≥ 0
• Property changed to:
D(P, Q) = 0 if P = Q
• Symmetry:
D(P, Q) = D(Q, P )
• Subadditivity (triangle inequality):
D(P, Q) ≤ D(P, K) + D(K, Q)
• Non-negativity:
D(P, Q) ≥ 0
• Identity of indiscernibles:
D(P, Q) = 0 iﬀ P = Q
An object is most similar to itself
1
Metric
Minkowski-Form Distances
• Symmetry:
D(P, Q) = D(Q, P )
• Subadditivity (triangle inequality):
D(P, Q) ≤ D(P, K) + D(K, Q)
Lp (P, Q) =
Ã
X
i
|Pi − Qi |p
! p1
Useful for many algorithms
Minkowski-Form Distances
L1 (P, Q) =
X
i
Kullback-Leibler Divergence
|Pi − Qi |
sX
L2 (P, Q) =
(Pi − Qi )2
i
KL(P, Q) =
X
Pi log
i
Pi
Qi
• Information theoretic origin.
• Non symmetric.
•
Qi = 0 ?
L∞ (P, Q) = max |Pi − Qi |
i
Jensen-Shannon Divergence
1
1
KL(P, M ) + KL(Q, M )
2
2
1
M = (P + Q)
2
J S(P, Q) =
2Pi
1X
Pi log
+
JS(P, Q) =
2
Pi + Q i
i
1X
2Qi
Qi log
2 i
Pi + Qi
Jensen-Shannon Divergence
1
1
KL(P, M ) + KL(Q, M )
2
2
1
M = (P + Q)
2
J S(P, Q) =
• Information Theoretic origin.
• Symmetric.
√
• JS is a metric.
2
χ2 Histogram Distance
Jensen-Shannon Divergence
• Using Taylor extension and some algebra:
1 X (Pi − Qi )2
χ (P, Q) =
2 i (Pi + Qi )
∞
X
X (Pi − Qi )2n
1
JS(P, Q) =
2n(2n − 1) i (Pi + Qi )2n−1
n=1
=
2
1 X (Pi − Qi )2
+
2 i (Pi + Qi )
• Statistical origin.
• Experimentally results are very similar to JS.
1 X (Pi − Qi )4
+ ...
12 i (Pi + Qi )3
• Reduces the effect of large bins.
p
• χ2 is a metric
χ2 Histogram Distance
χ2 (
,
) < χ2 (
χ2 Histogram Distance
,
χ2 Histogram Distance
1 X (Pi − Qi )2
2
χ (P, Q) =
2 i (Pi + Qi )
)
L1 (
,
) > L1 (
,
)
Bin-to-Bin Distances
L1 , L2 , χ2 are
• Bin
- to
- Bin distances such as L2
sensitive to quantization:
• Experimentally better than L2 .
L1 (
,
) > L1 (
,
)
3
Bin-to-Bin Distances
Bin-to-Bin Distances
• #bins
robustness
distinctiveness
• #bins
robustness
distinctiveness
• #bins
robustness
distinctiveness
• #bins
robustness
distinctiveness
Can we achieve
robustness
L1 (
) > L1 (
,
and
,
)
The Quadratic-Form Histogram Distance
QF A (P, Q) =
q
(P − Q)T A(P − Q)
sX
=
(Pi − Qi )(Pj − Qj )Aij
distinctiveness ?
The Quadratic-Form Histogram Distance
QF A (P, Q) =
ij
• Aij is the similarity between bin i and j.
q
(P − Q)T A(P − Q)
sX
=
(Pi − Qi )(Pj − Qj )Aij
ij
A=I
• If A is the inverse of the covariance matrix, QF is called
Mahalanobis distance.
sX
=
(Pi − Qi )2 = L2 (P, Q)
ij
The Quadratic-Form Histogram Distance
The Quadratic-Form Histogram Distance
q
QF (P, Q) = (P − Q)T A(P − Q)
q
QF (P, Q) = (P − Q)T A(P − Q)
A
• Does not reduce the effect of large bins.
A
• If A is positive-semidefinite then QF is a pseudo-metric.
• Alleviates the quantization problem.
• Linear time computation in # non zero Aij.
4
Pseudo-Metric (aka Semi-Metric)
• Non-negativity:
D(P, Q) ≥ 0
• Property changed to:
D(P, Q) = 0 if P = Q
• Symmetry:
D(P, Q) = D(Q, P )
• Subadditivity (triangle inequality):
D(P, Q) ≤ D(P, K) + D(K, Q)
The Quadratic-Form Histogram Distance
QF A (P, Q) =
=
q
(P − Q)T A(P − Q)
q
(P − Q)T W T W (P − Q)
= L2 (W P, W Q)
The Quadratic-Form Histogram Distance
QF A (P, Q) =
q
(P − Q)T A(P − Q)
• If A is positive-definite then QF is a metric.
The Quadratic-Form Histogram Distance
QF A (P, Q) =
q
(P − Q)T A(P − Q)
= L2 (W P, W Q)
We assume there is a linear transformation
that makes bins independent
There are cases where this is not true
e.g. COLOR
The Quadratic-Form Histogram Distance
The Quadratic-Form Histogram Distance
q
QF (P, Q) = (P − Q)T A(P − Q)
q
QF (P, Q) = (P − Q)T A(P − Q)
A
• Converting distance to similarity (Hafner et. al 95):
Aij = 1 −
Dij
maxij (Dij )
A
• Converting distance to similarity (Hafner et. al 95):
Aij = e
D(i,j)
ij (D(i,j))
−α max
If α is large enough,
A will be positive-definitive
5
The Quadratic-Form Histogram Distance
QF A (P, Q) =
The Quadratic-Chi Histogram Distance
q
(P − Q)T A(P − Q)
v
uX
u
(Pi − Qi )(Pj − Qj )Aij
t
P
P
QCA
(P,
Q)
=
m
(
(P
+
Qc )Aci )m ( c (Pc + Qc )Acj )m
c
c
ij
• Learning the similarity matrix: part 2 of this tutorial.
• Aij is the similarity between bin i and j.
• Generalizes QF and χ2 .
• Reduces the effect of large bins.
• Alleviates the quantization problem.
• Linear time computation in # non zero Aij.
The Quadratic-Chi Histogram Distance
QCA
m (P, Q)
•
Similarity-Matrix-Quantization-Invariant Property
v
uX
u
(Pi − Qi )(Pj − Qj )Aij
P
P
=t
(
(P
+
Qc )Aci )m ( c (Pc + Qc )Acj )m
c
c
ij
D
is non-negative if A is positive-semidefinite.
• Symmetric.
(
,
)= (
D
)
,
• Triangle inequality unknown.
• If we define 00 = 0 and 0 ≤ m < 1 , QC is continuous.
Sparseness-Invariant
The Quadratic-Chi Histogram Distance
Empty bins
D
(
,
)
= D
(
,
Code
function dist= QuadraticChi(P,Q,A,m)
)
Z= (P+Q)*A;
% 1 can be any number as Z_i==0 iff D_i=0
Z(Z==0)= 1;
Z= Z.^m;
D= (P-Q)./Z;
% max is redundant if A is
% positive-semidefinite
dist= sqrt( max(D*A*D’,0) );
6
The Quadratic-Chi Histogram Distance
Code
The Quadratic-Chi Histogram Distance
Time Complexity: O(#Aij 6= 0)
Code
What about sparse (e.g. BoW) histograms ?
function dist= QuadraticChi(P,Q,A,m)
function dist= QuadraticChi(P,Q,A,m)
Z= (P+Q)*A;
% 1 can be any number as Z_i==0 iff D_i=0
Z(Z==0)= 1;
Z= Z.^m;
D= (P-Q)./Z;
% max is redundant if A is
% positive-semidefinite
dist= sqrt( max(D*A*D’,0) );
Z= (P+Q)*A;
% 1 can be any number as Z_i==0 iff D_i=0
Z(Z==0)= 1;
Z= Z.^m;
D= (P-Q)./Z;
% max is redundant if A is
% positive-semidefinite
dist= sqrt( max(D*A*D’,0) );
The Quadratic-Chi Histogram Distance Code
The Quadratic-Chi Histogram Distance Code
What about sparse (e.g. BoW) histograms ?
What about sparse (e.g. BoW) histograms ?
P −Q=
P −Q=
0
0
0
-3 0
0
0
4
5
0
0
0
0
Time Complexity: O(SK)
0
0
0
-3 0
0
0
4
5
0
0
0
0
Time Complexity: O(SK)
#(P 6= 0) + #(Q 6= 0)
The Quadratic-Chi Histogram Distance Code
The Earth Mover’s Distance
What about sparse (e.g. BoW) histograms ?
P −Q=
0
0
0
-3 0
0
0
4
5
0
0
0
0
Time Complexity: O(SK)
Average of non-zero entries
in each row of A
7
The Earth Mover’s Distance
The Earth Mover’s Distance
• The Earth Mover’s Distance is defined as the
minimal cost that must be paid to transform one
histogram into the other, where there is a
“ground distance” between the basic features
that are aggregated into the histogram.
≠
The Earth Mover’s Distance
The Earth Mover’s Distance
EM DD (P, Q) =
s.t :
=
X
i,j
The Earth Mover’s Distance
EM DD (P, Q) = min
F ={Fij }
X
X
s.t :
Fij ≤Pi
Fij ≤Qj
i,j
i,j
X
i
Pi =
i
Fij ≥ 0
Fij Dij
Fij = Qj
X
Qj = 1
j
\
• Pele and Werman 08 – EM
D , a new EMD
i,j Fij Dij
P
i Fij
j
Fij =
X
X
The Earth Mover’s Distance
P
X
F ={Fij }
Fij = Pi
j
X
min
definition.
i
X
X
Fij = min(
Pi ,
Qj )
i
j
Fij ≥ 0
8
Definition:
EM D
D
\
EMD
C (P, Q) =
s.t :
X
j
X
min
F ={Fij }
X
Fij Dij +
Demander
Supplier
i,j
¯
¯
¯X
X ¯¯
¯
¯
Pi −
Qj ¯¯ × C
¯
¯ i
¯
j
Fij ≤Pi
X
Fij = min(
i,j
i
Dij = 0
Fij ≤Qj
X
Pi ,
i
X
Qj )
j
P
Fij ≥ 0
Demander ≤
P
Supplier
\
When to Use EMD
\
EM
D
Demander
Supplier
Dij = C
P
Demander ≤
P
• When the total mass of two histograms is
important.
EM D
¡
,
¢
= EM D
¡
,
¢
Supplier
\
When to Use EMD
\
When to Use EMD
• When the total mass of two histograms is
• When the difference in total mass between
important.
EM D
\
EM
D
¡
¡
histograms is a distinctive cue.
,
,
¢
¢
= EM D
\
< EM
D
¡
¡
,
,
¢
EMD
¡
,
¢
= EMD
¡
,
¢
=0
¢
9
\
When to Use EMD
\
When to Use EMD
• When the difference in total mass between
• If ground distance is a metric:
histograms is a distinctive cue.
EMD
\
EM
D
¡
,
¡
,
¢
¢
= EMD
\
< EM
D
¡
¢
,
¡
• EM D is a metric only for normalized histograms.
=0
¢
,
The Earth Mover’s Distance Complexity Zoo
\
EMD
- a Natural Extension to L1
• General ground distance: O(N 3 log N )
\
• EM
D = L1 if:
(
0
Dij =
2
D
– Orlin 88
if i = j
otherwise
\
EMD
C (P, Q) =
\
D is a metric for all histograms (C ≥ 12 ∆).
• EM
min
F ={Fij }
X
C=1
• L1 normalized 1D histograms O(N )
– Werman, Peleg and Rosenfeld 85
Fij Dij +
i,j
¯
¯
¯X
X ¯¯
¯
¯
Pi −
Qj ¯¯ × C
¯
¯ i
¯
j
The Earth Mover’s Distance Complexity Zoo
The Earth Mover’s Distance Complexity Zoo
• L1 normalized 1D cyclic histograms O(N )
• L1 Manhattan grids O(N 2 log N (D + log N ))
– Pele and Werman 08 (Werman, Peleg, Melter, and
– Ling and Okada 07
Kong 86)
10
The Earth Mover’s Distance Complexity Zoo
The Earth Mover’s Distance Complexity Zoo
• What about -N d
imensional histograms with a
• What about -N d
imensional histograms with a
cyclic dimensions?
cyclic dimensions?
The Earth Mover’s Distance Complexity Zoo
The Earth Mover’s Distance Complexity Zoo
• What about -N d
imensional histograms with a
• L1 general histograms O(N 2 log2D−1 N )
cyclic dimensions?
– Gudmundsson, Klein, Knauer and Smid 07
p1
p2
p3
p4
p5
p6
p7
p8
l2
l1
l2`
The Earth Mover’s Distance Complexity Zoo
The Earth Mover’s Distance Complexity Zoo
• min(L1 , 2) 1D linear/cyclic {1, 2, . . . , ∆} O(N )
• min(L1 , 2) 1D linear/cyclic {1, 2, . . . , ∆} O(N )
histograms
– Pele and Werman 08
histograms
– Pele and Werman 08
11
The Earth Mover’s Distance Complexity Zoo
The Earth Mover’s Distance Complexity Zoo
• min(L1 , 2) general {1, 2, . . . , ∆}D histograms
N
K is the number of edges with cost 1 O(N 2 K log( K ))
• Any thresholded distance O(N 2 log N (K + log N ))
– Pele and Werman 09
– Pele and Werman 08
number of edges with cost
different from the threshold
O(N 2 log2 N )
Thresholded Distances
K = O(log N)
The Flow Network Transformation
• EMD with a thresholded ground distance is
not an approximation of EMD.
• It has better performance.
Original
Network
The Flow Network Transformation
Original
Network
Simplified
Network
Simplified
Network
The Flow Network Transformation
Flowing the Monge sequence
(if ground distance is a metric,
zero-cost edges are a Monge sequence)
12
The Flow Network Transformation
Removing Empty Bins
and their edges
The Flow Network Transformation
We actually finished here….
Combining Algorithms
Combining Algorithms
• EMD algorithms can be combined.
• EMD algorithms can be combined.
• For example L1:
• For example, thresholded L1 :
The Earth Mover’s Distance Approximations
The Earth Mover’s Distance Approximations
• Charikar 02, Indyk and Thaper 03 – approximated
EMD on {1, . . . , ∆}d by embedding it into the
L1 norm.
Time complexity: O(T N d log ∆)
Distortion (in expectation): O(d log ∆)
13
The Earth Mover’s Distance Approximations
The Earth Mover’s Distance Approximations
• Grauman and Darrell 05 – Pyramid Match Kernel
(PMK) same as Indyk and Thaper, replacing L1
• Lazebnik, Schmid and Ponce 06 – used PMK in
the spatial domain (SPM).
with histogram intersection.
• PMK approximates EMD with partial matching.
• PMK is a mercer kernel.
• Time complexity & distortion – same as Indyk and
Thaper (proved in Grauman and Darrell 07).
The Earth Mover’s Distance Approximations
level 0
++
+
+
+
+
+
+
level 1
++
+
+
+
+
+
+
level 2
++
+
+
+
+
+
++ +
++ +
++ +
+
+
+
x 1/8
+
+
+
+
The Earth Mover’s Distance Approximations
x 1/4
x 1/2
The Earth Mover’s Distance Approximations
The Earth Mover’s Distance Approximations
• Shirdhonkar and Jacobs 08- approximated EMD
• Khot and Naor 06 – any embedding of the EMD
over the -d dimensional Hamming cube into L1
must incur a distortion of Ω(d) .
using the sum of absolute values of the weighted
wavelet coefficients of the difference histogram.
• Andoni, Indyk and Krauthgamer 08- for sets with
cardinalities upper bounded by a parameter s
the distortion reduces to O(log s log d) .
P
Wavelet
D i f fer en ce
Q
Transform
j=1
j=0
Absolute
values
×2−2×0
+
×2−2×1
• Naor and Schechtman 07 - any embedding of the
EMD over {0, 1, . . . , ∆}2 must incur a distortion
√
of Ω( log ∆) .
14
Robust Distances
outliers
same difference.
• With colors, the natural choice.
CIEDE200 Distance from Blue
• Very high distances
Robust Distances
Robust Distances
120
100
80
60
40
20
0
Robust Distances - Exponent
• Usually a negative exponent is used:
∆E 00(blue, red) = 56
∆E 00(blue, yellow) = 102
Robust Distances - Exponent
• Exponent is used because (Ruzon and Tomasi 01):
robust, smooth, monotonic, and a metric
Input is always discrete anyway …
• Let d(a, b) be a distance measure between two
features - a, b.
• The negative exponent distance is:
de (a, b) = 1 − e
−d(a,b)
σ
Robust Distances - Thresholded
• Let d(a, b) be a distance measure between two
features - a, b.
• The thresholded distance with a threshold
of t > 0 is:
dt (a, b) = min(d(a, b), t).
15
Thresholded Distances
• Thresholded metrics are also metrics (Pele and
Werman ICCV 2009).
Thresholded Distances
• Thresholded vs. exponent:
– Fast computation of cross-bin distances with a
• Better results.
thresholded ground distance.
• Pele and Werman ICCV 2009 algorithm computes
– Exponent changes small distances – can be a
problem (e.g. color differences).
EMD with thresholded ground distances much
faster.
• Thresholded distance corresponds to sparse
similarities matrix-
>faster QC / QF computation.
Aij = 1 −
Dij
maxij (Dij )
Thresholded Distances
Thresholded Distances
Exponent changes small distances
150
100
∆E00
min(∆E00 , 10)
−∆E00
4
10(1 − e −∆E
)
00
10(1 − e 5 )
50
0
distance from blue
distance from blue
• Color distance should be thresholded (robust).
∆E00
min(∆E00 , 10)
−∆E00
10
4
10(1 − e −∆E
)
00
10(1 − e 5 )
5
0
A Ground Distance for SIFT
A Ground Distance for Color Image
The ground distance between two SIFT bins
(xi , yi , oi ) and (xj , yj , oj ):
The ground distances between two LAB image bins
(xi , yi , Li , ai , bi ) and (xj , yj , Lj , aj , bj ) we use are:
dR =||(xi , yi ) − (xj , yj )||2 +
min(|oi − oj |, M − |oi − oj |)
dcT = min((||(xi , yi ) − (xj , yj )||2 )+
∆00 ((Li , ai , bi ), (Lj , aj , bj )), T )
dT = min(dR , T )
16
Perceptual Color Differences
Perceptual Color Differences
• Euclidean distance on L*a*b* space is widely
considered as perceptual uniform.
distance from blue
300
||Lab||2
200
100
0
Perceptual Color Differences
• ∆E00 on L*a*b* space is better.
60
adist2
||Lab||2
Luo, Cui and Rigg 01. Sharma, Wu and Dalal 05.
20
40
20
Purples before blues
0
distance from blue
distance from blue
Perceptual Color Differences
∆E00
15
10
5
0
Perceptual Color Differences
Perceptual Color Differences
• ∆E00 on L*a*b* space is better.
• ∆E00 on L*a*b* space is better.
∆E00
distance from blue
150
||Lab||2
∆E00
100
50
0
17
Perceptual Color Differences
Perceptual Color Differences
• ∆E00 on L*a*b* space is better.
• Color distance should be saturated (robust).
∆E 00(blue, red) = 56
∆E 00(blue, yellow) = 102
• Color distance should be thresholded (robust).
Thresholded Distances
distance from blue
• But still has major problems.
150
∆E00
min(∆E00 , 10)
100
4
10(1 − e −∆E
)
00
10(1 − e 5 )
−∆E00
50
0
Perceptual Color Descriptors
distance from blue
Exponent changes small distances
∆E00
min(∆E00 , 10)
−∆E00
10
4
10(1 − e −∆E
)
00
10(1 − e 5 )
5
0
Perceptual Color Descriptors
Perceptual Color Descriptors
• 11 basic color terms. Berlin and Kay 69.
• 11 basic color terms. Berlin and Kay 69.
white
black
purple
red
green
pink
yellow
orange
blue
brown
grey
18
•
Perceptual Color Descriptors
Perceptual Color Descriptors
• 11 basic color terms. Berlin and Kay 69.
• 11 basic color terms. Berlin and Kay 69.
Image copyright by Eric Rolph. Taken from:
http://upload.wikimedia.org/wikipedia/commons/5/5c/Double-alaskan-rainbow.jpg
Perceptual Color Descriptors
Perceptual Color Descriptors
• How to give each pixel an “11-colors”
• Learning Color Names from Real-
description ?
Perceptual Color Descriptors
• Learning Color Names from Real-
World Images,
J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007.
Perceptual Color Descriptors
World Images,
J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007.
• Learning Color Names from Real-
World Images,
J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007.
chip-based
real-world
19
Perceptual Color Descriptors
• Learning Color Names from Real-
Perceptual Color Descriptors
World Images,
J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007.
• Applying Color Names to Image Description,
J. van de Weijer, C. Schmid ICIP 2007.
• For each color returns a probability distribution
• Outperformed state of the art color descriptors.
over the 11 basic colors.
white
black
red
purple
green
pink
yellow
orange
blue
brown
grey
Perceptual Color Descriptors
Perceptual Color Descriptors
• Using illumination invariants – black, gray and
• This method is still not perfect.
white are the same.
• “Too much invariance” happens in other cases
• 11 color vector for purple(255,0,255) is:
(Local features and kernels for classification of texture and object categories:
An in-depth study - Zhang, Marszalek, Lazebnik and Schmid. IJCV 2007,
Learning the discriminative power-invariance trade-off - Varma and Ray. ICCV
2007).
• To conclude: Don’t solve imaginary problems.
0 0 0 0 0 1 0 0 0 0 0
• In real world images there are no such
over- saturated colors.
Open Questions
• EMD variant that reduces the effect of large bins.
• Learning the ground distance for EMD.
Hands-On Code Example
http://www.cs.huji.ac.il/~ofirpele/FastEMD/code/
• Learning the similarity matrix and normalization
factor for QC.
http://www.cs.huji.ac.il/~ofirpele/QC/code/
20
Tutorial:
http://www.cs.huji.ac.il/~ofirpele/DFML_ECCV2010_tutorial/
Or
“Ofir Pele”
21

Download Report

Distance Functions and Metric Learning: Part 1 An object is most

Paperzz.com

Your Paperzz