Lecture 5 Distance-based methods

POLYTECHNIC UNIVERSITY OF VALENCIA
Lecture 5
Distance-based methods
Pattern Recognition
Contents
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Alfons Juan-Císcar
[email protected]
www.dsic.upv.es/∼ajuan
1
5.2 Metric spaces and distance functions . . . . . . . . . . . . . . . . . . . . . . .
3
5.3 Minimum distance classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
5.4 Nearest neighbor classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
5.5 k-nearest neighbor classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Department of Computer Systems and Computation (DSIC)
Polytechnic University of Valencia
Last update: September 21, 2010
Example and justification
5.1 Introduction
As its name indicates, distance-based methods require a distance function to be defined so as to measure the proximity between any pair of data points.
Posterior class probabilities are locally estimated from N prototypes (labeled training
samples) as:
kc(x)
(5.1)
p̂(c | x) =
k
where k is a predefined number of nearest neighbors to be considered and kc(x) is
the number of nearest neighbors of x that are labeled with c.
8
7
6
5
4
3
2
1
0
b
c (x) = arg max p(c | x) ≈ arg max kc(x)
c=1,...,C
c=1,...,C
b
b
b
bc
x
bc
ut
bc
ut
0 1 2 3 4 5 6 7 8
p̂(x | c) =
kc
Nc
V
Nc
N
→
c′
1 / 17
A. Juan, ETSINF / DSIC, UPV, 2010.
N• = 6
N ◦ = 5 N△ = 3
V (Ball(x, rad = 2)) = 4π

k•(x) = 3 

k◦(x) = 1
→ c∗(x) ≈ •


k△(x) = 1
→
p̂(•) =
p̂(x | •) =
p̂(c) p̂(x | c)
kc
p̂(c | x) = P ′
=
k
p̂(c ) p̂(x | c′)
That is, x is assigned to the most voted class among its k nearest neighbors.
A. Juan, ETSINF / DSIC, UPV, 2010.
k=5
ut
bc
p̂(c) =
(5.2)
N = 14
bc
Using (5.1), the Bayes classifier can be approximated as:
∗
b
b
→
→
6
14
3
6
V
p̂(◦) =
5
14
p̂(x | ◦) =
p̂(• | x) =
3
5
p̂(△) =
1
5
V
3
14
p̂(x | △) =
p̂(◦ | x) =
1
5
1
3
V
p̂(△ | x) =
1
5
2 / 17
5.2
5.2.1 Vectorial metrics: the Euclidean distance or L2
Metric spaces and distance functions
A metric space is a pair (E, d) comprising a set of of points E and an application
d:E×E →R
(5.3)
(5.8)
d
called metric or distance (function), satisfying:
This is the most popular distance in RD and the one which we will use by default.
d(x, x) = 0
d(x, y) > 0 si
The Euclidean distance or L2 in RD is defined as:
sX
(xd − yd)2
para todo x, y ∈ RD
d2(x, y) =
(5.4)
x 6= y
In the plane, the Euclidean ball has the form of a circle:
(5.5)
d(x, y) = d(y, x)
(5.6)
d(x, y) + d(y, z) ≥ d(x, z)
(5.7)
B2(a, r)
r
The last property is called triangle inequality:
a
y
, y)
d (y , z )
z
d(x, z)
A. Juan, ETSINF / DSIC, UPV, 2010.
3 / 17
5.2.2 Vectorial metrics: distances L1 and L∞
The L1 metric is the fairest distance since it sums absolute differences:
X
d1(x, y) =
|xd − yd|
para todo x, y ∈ RD
(5.9)
d
On the contrary, L∞ takes into account only the maximum difference in absolute value:
d
para todo x, y ∈ RD
(5.10)
B∞
B1
4 / 17
5.2.3 Non-vectorial metrics: the edit distance
The Euclidean distance sums squared differences, and thus it might depend too much
on the dimensions (features) showing largest differences.
d∞(x, y) = max |xd − yd|
A. Juan, ETSINF / DSIC, UPV, 2010.
The edit distance computes the minimum number of elementary edit operations (insertion, deletion and substitution
of distinct symbols) required to transform
a string into another. Example:
d(paTernn, pattern) = 3
p→p
_paTernn −−→ p_aTernn
a→a
−−→ pa_Ternn
B2
λ→t
−−→ pat_Ternn
T→t
−−→ patt_ernn
e→e
−−→ patte_rnn
The balls L1, L2 and L∞ are tighly related:
O(|x||y|) by Dymanic Programming
a→λ
b
deletion
bs a →
titu b
tio
n
λ→b
insertion
x
su
d (x
r→r
−−→ patter_nn
a
n
r
e
t
t
a
p
n→λ
−−→ patter_n
n→n
−−→ pattern_
A. Juan, ETSINF / DSIC, UPV, 2010.
5 / 17
A. Juan, ETSINF / DSIC, UPV, 2010.
p a T e r
n n
6 / 17
Example: edit distance between chain codes
y
x
x
0
1
3
1s 0s
3
1
2 3
1 0b
3
2b 3
1
1 0
3
1
2 3
1
3
2
0i
1
1s
1 0s
1
1
1 0
1
2
1
3
2 2i
0b 0
3b 0
1
1 0b
3 0b
1 0
3
1 0s
3
1s
0s 3
2b 3
1 2b
2b 3
1
1
3
2 2 2s 2 2b
0
3
3
2 3
3
3
3
3
d(x, y) = 6
d(x, y) = 12
1
1
0
1
1
1
0
1
1
1
2
2
3
2
3
3
3
3
2
3
3
3
0
0
5.3 Minimum distance classifier
y
0
1
1
1 0
1s
1
1 0s
1
2
3s
1
2 2
0
2s
3
3
3
3
In the simplest case, k = 1 and each class c is represented by a single prototype pc.
The aproximation (5.2) results in the so-called minimum distance classifier:
3
3
3
c(x) = arg min d(x, pc)
(5.11)
In the Euclidean case, it is linear with x:
c(x) = arg min kx − pck
c
= arg min (x − pc)t(x − pc)
c
1
1
0
1
1
1
0
1
1
1
2
2
3
2
3
3
3
3
2
3
3
3
0
0
0 3 3 3 2 3 3 2 3 3 2 3 2 1 1 1 0 1 1 0 1 1 0 1
for all x ∈ E
c
= arg min xtx − 2ptcx + ptcpc
c
= arg max 2ptcx − ptcpc
c
= arg max gc(x)
c
with: gc(x) = wtc x + wc0
where wc = 2pc and wc0 = −ptcpc
Learning: each class is usually represented by its sample mean or median.
00303033033232322222112111010101
A. Juan, ETSINF / DSIC, UPV, 2010.
7 / 17
A. Juan, ETSINF / DSIC, UPV, 2010.
Minimum distance classifier example
8 / 17
5.4
Nearest neighbor classifier
Consider the case k = 1 with no constraints on the number of prototypes from each
class. The aproximation (5.2) is then called nearest neighbor classifier:
40
c(x) = arg min min d(x, p)
p∈Pc
c
−3
6
30
for all x ∈ E
(5.12)
12x
where Pc is the set of prototypes that represents class c, c = 1, . . . , C.
In the Euclidean case, it is a piecewise linear function of x:
20
c(x) = arg min min kx − pk
10
0
−
4x
0
b
1
b
2
p•
c
p∈Pc
c
p∈Pc
c
p∈Pc
c
p∈Pc
= arg min min (x − p)t(x − p)
4
= arg min min xtx − 2ptx + ptp
b
3
4
bc
5
6
p◦
bc
7
= arg min min −2ptx + ptp
8
= arg max gc(x)
c
with: gc(x) = max 2ptx − ptp
p∈Pc
Learning: each class is usually represented by all its available prototypes.
A. Juan, ETSINF / DSIC, UPV, 2010.
9 / 17
A. Juan, ETSINF / DSIC, UPV, 2010.
10 / 17
Nearest neighbor classifier example
5.4.1 Boundaries in 2D: Voronoi diagrams
40
10x
14x − 2
− 5
49
Synthetic example
30
25
100 6s and 100 9s
(1×2 and 64 grey levels)
A
B
6
9
60
55
20
15
x2
1
−
 2x − 4
 4x
9
−
x  6x
a
m
10
Lower brightness
ma
x
20
10
b
0
b
1
2
b
3
bc
4
5
bc
6
7
11 / 17
5
10
Let P be the Bayes error and let P be the NN error when N → ∞. Then:
P
C−1
C
x1
≤ 2P ∗
20
15
25
20
25
30
35
40
45
50
55
60
Upper brightness
12 / 17
k-nearest neighbor classifier
In the general case, (5.2) is known as the k-nearest neighbor classifier:
c(x) = arg max kc(x)
(5.13)
for all x ∈ E
c=1,...,C
(5.14)
where kc(x) can be formally described as
1
2P ∗
15
5.5
∗
C
P∗
C−1
30
A. Juan, ETSINF / DSIC, UPV, 2010.
5.4.2 Asymptotic NN probability of error
P∗ ≤ P ≤ P∗ 2 −
35
15
0
A. Juan, ETSINF / DSIC, UPV, 2010.
40
20
8
0
45
25
5
0
50
kc(x) = |k(x) ∩ Pc|
0.8
(5.15)
being Pc the prototypes from class c, and k(x) a set of k nearest neighbors of x; i.e.
0.6
k(x) = arg min max d(x, p)
c=2
S⊂P
|S|=k
0.4
3
5
0.2
0
0
A. Juan, ETSINF / DSIC, UPV, 2010.
P∗
C−1
C
p∈S
(5.16)
Tie-breaking rule: decide among the tied classes by using the NN rule.
10
100
Learning: all available prototypes are tipically used.
0
0
0.2
0.4
0.6
0.8
1
Note: the 2-NN classifier is equivalent to the (1-)NN classifier.
13 / 17
A. Juan, ETSINF / DSIC, UPV, 2010.
14 / 17
k-nearest neighbor classifier example
k-nearest neighbor classifier example (2)
7
3
6
|3N N (x) ∩ {1, 2, 3}|
bc
5
bc
bc
4
bc
×
3
2
b
2
1
1
b
0
b
b
-1
-1
|3N N (x) ∩ {5, 7}|
0
b
b
0
1
b
2
b
3
4
bc
5
6
bc
7
15 / 17
Asymptotic k-NN probability of error
The asymptotic k-NN probability of error tends to the Bayes error when the following
conditions are satisfied:
N →∞
(5.17)
k→∞
(5.18)
k
→0
N
(5.19)
The last two conditions are satisfied by using, for instance, k =
Excellent result:
N →∞
⇒
√
N
Bayes error
Problem 1: convergence might be really slow
Problem 2: high computational cost
A. Juan, ETSINF / DSIC, UPV, 2010.
1
k NN class
1
•
2
◦
3
◦
4
•
5
◦
8
A. Juan, ETSINF / DSIC, UPV, 2010.
0
17 / 17
A. Juan, ETSINF / DSIC, UPV, 2010.
2
dist.
√
√1
√2
√4
√5
8
3
•
1
1
1
2
2
4
5
6
◦ decision
0
•
1 NN → •
2
◦
2 NN → •
3
◦
16 / 17