Zero-error dissimilarity based classifiers

Zero-error dissimilarity based classifiers1
Robert P.W. Duin and Elžbieta Pȩkalska
Delft Pattern Recognition Group
Faculty of Applied Sciences
Lorentzweg 1, 2628 CJ Delft, The Netherlands
[email protected]
Abstract
We consider general non-Euclidean distance measures between real
world objects that need to be classified. It is assumed that objects are
represented by distances to other objects only. Conditions for zero-error
dissimilarity based classifiers are derived. Additional conditions are
given under which the zero-error decision boundary is a continues
function of the distances to a finite set of training samples. These
conditions affect the objects as well as the distance measure used. It is
argued that they can be met in practice.
1 Introduction
In pattern recognition the real world objects like characters or faces are traditionally
represented in a feature space. Generalisation is performed by training classifiers in such a
space using a set of examples, the training set. In some applications like shape recognition,
good features may be difficult to find and dissimilarities between objects are used for
representation [1]. Usually the nearest neighbour rule is then used as a classifier.
In our previous work we showed that instead of the nearest neighbour rule also other,
traditional classifiers like Fisher’s Linear Discriminant may be used for generalisation. To
that purpose we studied the use of dissimilarity spaces as well as Euclidean embedding for
representing the dissimilarities between the objects in the training set as well as between
the training objects and test objects to be classified. We found that such dissimilarity based
classifiers may perform better and may be evaluated faster than the nearest neighbour rule.
A problem that had to be faced in building dissimilarity based spaces on distances measures
practically used in pattern recognition is that such distance measures are frequently not
Euclidean embeddable and are even sometimes not metric. Nevertheless, they may perform
well and it remains of practical interest to study their properties fundamentally.
An intriguing aspect of the dissimilarity based approach to pattern recognition is that the
distance measures of interest, although not Euclidean, may be given other interesting
properties. One of them is that they may built a so-called Hausdorff space in which
distances are zero if and only if the objects are identical. In case the real world objects can
1.
The original paper was written in 2003 when the first author was visiting Anil Jain, MSU in East
Lansing, USA. The paper was not accepted for publication as it ‘tried to proof an obvious truth’.
Nevertheless it was for the authors a significant step as it made them clear that there should be in
some way a continuous relation between the real world objects and their representation to have the
prospect of a zero-error classifier defined by a finite training set.
be unambiguously labelled, which is rather natural for real world objects, this opens the
road to build so-called zero-error classifiers. Such classifiers make use of the non-overlap
of the classes and define the classification function in the gap between them.
For non-Euclidean embeddable datasets, however, we don’t have yet a space in which this
may be performed. For such datasets the problem of constructing such classifiers is not yet
solved. In this paper we will approach it fundamentally and start by defining a topology
based on a neighbourhood system defined on a given dissimilarity measure. We will define
conditions for such measures and for the class of objects under consideration and proof two
fundamental theorems based on these conditions.
First, it will be proved that under very general conditions for the dissimilarity measure,
including non-metricness, a zero error classifier exists.
Second, general conditions with respect to the class of objects will be given under which
such a classifier can be based on a finite set of objects.
Together, these two theorems show that a zero-error classifier may be constructed in
practice using a finite training set.
2 Intuition
Before giving a formal proof we will give some intuition on what we are aiming at.
Consider a two-class dissimilarity based classification problem, e.g. face recognition from
2D images. Let for each of the classes the images of the two persons A and B be sampled
from a set for which parameters describing light circumstances, position, face expression,
etcetera, vary continuously. Let us consider a dissimilarity measure D(x,y) between face
images x and y that is continuous in the parameters generating x and y. Let D(x,y) = 0 iff x
= y.
Now the infinite set of images of A as well as the infinite set of images of B constitute under
some conditions for D(•) both connected sets AD, respectively BD, defined on D(•). For all
images of A there is a one to one relation with an element in AD. Similar for B and BD.
Very similar images are close in such sets. Between any two images of the same class exists
a connected path between the two elements representing the two images.
For classification problems like the given one it can be assumed that no two images, one
from A and one from B have a dissimilarity D < δ. Consequently the two sets AD and BD
do not intersect and are thereby entirely separable.
3 Conditions
We will proof the above formally if the following conditions hold for dissimilarity measure
D(x,y), ∀x,y ∈ X, in which X is the entire set of objects of interest.
Condition 3.1 D(x,y) ∈ ℜ, 0 ≤ D(x,y) < ∞.
Condition 3.2 D(x,y) is continuous in the parameters that generate the objects x and y.
Condition 3.3 D(x,y) = 0 iff x = y.
Condition 3.4 D(x,y) > δ, for some δ, 0 < δ < ∞ iff (x∈A ∧ y∈B) ∨ (x∈B ∧ y∈A).
Lemma 3.1 For a sufficiently small ε, ∀x∈A ∃y∈A D(x,y)<ε y≠x, similar for x∈B y∈B.
This directly follows from the conditions 3.2, 3.3 and 3.4.
4 The existence of a zero-error classifier
In this section we will proof that there exists a classifier that under the given conditions all
x correctly classifies. First we define a neighbourhood basis for x and on top of that
neighbourhoods and neighbourhood systems.
Definition 4.1 The neighbourhood basis of x, NB(x) is the set of all y with
D(x,y) < ε, 0 < ε < δ.
Definition 4.2 If P(X) is the power set of X, then N∈P(X) is a neighbourhood of x if
NB(x)⊆N
Definition 4.3 A neighbourhood system N(x) is the collection of all neighbourhoods of x.
As a result has each x in A in all its neighbourhoods some y in A:
Lemma 4.1 ∀x∈A ∃y∈N ∀N∈N(x) y≠x y∈A.
From definition 4.1 it follows that ∃y∈A with D(x,y) < ε is equivalent with ∃y∈NB(x),
which is equivalent with ∃y∈N ∀N∈N(x) as ∀N∈N(x) NB(x)⊆N (definitions 4.2 and
4.3). Substitution in lemma 3.1 shows the result.
The neighbourhood basis of all x in A contains no points of B
Lemma 4.2 ∀x∈A ∀y∈B y∉NB(x) x∉NB(y).
This follows from the condition 3.4 on the dissimilarities between objects in A and B
and definition 4.1 for NB(x).
So all x in A have a neighbourhood that contains no points of B
Lemma 4.3 ∀x∈A ∃N∈N(x) N∩B=∅, and ∀x∈B ∃N∈N(x) N∩A=∅.
This follows from choosing N = NB(x) and lemma 4.2.
This brings us to the central theorem of the paper, the existence of a rule that correctly
classifies each x∈A as class_A and each x∈B as class_B.
Theorem 4.1
The following classifier correctly classifies x ∀x∈A ∧ ∀x∈B.
1 if ∃N∈N(x) (N∩A=∅ ∧ N∩B=∅) reject x
2 elseif ∃N∈N(x) N∩B=∅ classify x as class_A
3 elseif ∃N∈N(x) N∩A=∅ classify x as class_B
4 else reject x.
Proof:
Assume x∈A
 ∀N∈N(x) ∃y∈N y≠x y∈A (lemma 4.1)
 N∩A≠∅
 rule 1 does not apply, but ∃N∈N(x) N∩B=∅ (lemma 4.3)
 rule 2 applies
 x is classified as class_A
Assume x∈B, similar, using rule 3 instead of 2.
 x is classified as class_B
if x is in A rule 4 does not apply as some of its neighbourhoods have just points in A.
q.e.d.
5 Remarks
Theorem 4.1 just shows that an error free classifier exists. It does not describe how such a
rule may be defined based on a finite set of examples. This will be the topic of the next
section. In order to prepare that we have based the theorem not just on the neighbourhood
basis, which would have been possible, but formulated it more general in terms of
neighbourhood systems.
Rule 1 in theorem 4.1 should take care of that objects not belonging to one of the two
classes (x∉A∪B) are rejected. Some x∉A∪B, however, having a sufficiently small
dissimilarity with objects in A or B may also be classified as A or B and not rejected. This
does not contradict the theorem as this considers only points x∈A∪B.
Rule 4 takes care of objects x∉A∪B that have small dissimilarities to objects in A as well
as B and rejects them.
Further, it can be proven that for each subset S of A the closure of S, cl(S) will be classified
as A (objects on the border of A have neighbourhoods that have no points of B and visa
versa). So classes are closed sets: they contain their boundary.
6 The decision function
We will now proof that the classification function can be expressed in the dissimilarities to
a finite set of points if the dissimilarity measure is bounded (any distance is finite) and can
be expressed in a finite set of parameters describing the objects. We will set two additional
conditions: The following steps are needed:
Condition 6.1 The number of parameters by which an arbitrary object can be generated is
finite, say n.
Condition 6.2 The values of these object parameters are bounded.
Theorem 6.1 The classifier defined in theorem 4.1 can be based on a finite set of objects.
Sketch of a proof:
1. There exists a one-to-one mapping between the sets of objects, A or B, and ℜn (condition 6.1)
2. The dissimilarities of a fixed, given object x to all possible objects constitute a continuous mapping ℜn → ℜ (conditions 3.2 and 6.1).
3. As the object x itself is included in the set of all objects, there is a single point in ℜn
that maps on the origin (condition 3.3).
4. Because of the continuity of the mapping (condition 3.2) a finite interval in ℜn around
x maps on [0,ε) in ℜ and thereby on the neighbourhood NB(x). Call this interval an
object patch.
5. Because the object parameters are bounded (condition 6.2) the entire set of possible
objects can be mapped on a finite interval in ℜn. Call this interval for all objects of
class A (B) the class patch of class A (B).
6. From 4 and 5 it follows that just a finite set of objects is needed to cover the class patches with object patches.
As a result the classifier can be written as continuous function of a finite set of
dissimilarities
Theorem 6.2 The classifier defined in theorem 4.1 can be written as continuous function
of the dissimilarities to a finite set of objects.
Sketch of a proof:
1. Any point can be correctly classified by the nearest neighbour rule applied to the dissimilarities to the set of objects that cover the class patches PA for class A and PB for
class B (theorem 6.1, step 6).
2. The continuous decision function
if

x ∈ PA
exp ( – D ( ( x, y ) ⁄ s ) ) –

exp ( – D ( ( x, y ) ⁄ s ) ) > 0
x ∈ PB
then y → class_A else y → class_B
performs the same classification. It classifies any object correctly if s is sufficiently
small, i.e. if 0 < s < (δ-ε)/log(max(|PA|,|PB|)) as for that value of s the term with the
nearest neighbour object dominates.
7 Discussion
The given conditions may be met in many practical applications. For instances, if objects
can be identified without confusion from a video screen using a finite set of pixels, then for
many shape distance measures defined on these pixels the conditions are fulfilled.
Consequently, for such applications a zero-error classifier exists (theorem 4.1) and can be
based on a finite set of objects (theorem 6.1) by a continuous dissimilarity based classifier
(theorem 6.2). As the conditions on the dissimilarity measure are very mild, it does not even
have to be a proper metric, such measures may be studied in great freedom using all
possible knowledge from the application. It thereby largely enables the inclusion of expert
knowledge in statistical pattern recognition procedures.
Acknowledgements
This research was supported by the Dutch Organization for Scientific Research (NWO).
References
[1] L. da Fontoura Costa and R. Marcondes Cesar Jr., Shape Analysis and Classification,
CRC Press, Boca Raton, 2001.
[2]E. Pekalska and R.P.W. Duin, Automatic pattern recognition by similarity
representations - a novel approach, Electronic Letters, vol. 37, no. 3, 2001, 159-160.
[3]E. Pekalska and R.P.W. Duin, Similarity representations allow to build good classifiers,
Pattern Recognition Letters, vol. 23, no. 8, 2002, 943-956.
[4]E. Pekalska, P. Paclik, and R.P.W. Duin, A generalized kernel approach to dissimilarity
based classification, Journal of Machine Learning Research, vol. 2, no. 2, 2002, 175211.
[5] 226. E. Pekalska, D.M.J. Tax, and R.P.W. Duin, One-class LP classifier for
dissimilarity representations, NIPS 2002, Vancouver, NIPS 2003.
[6]R.P.W. Duin and E. Pekalska, Possibilities of zero-error recognition by dissimilarity
representations (Invited), in: J.M. Inesta, L. Mico (eds.), Pattern Recognition in
Information Systems (Proc. PRIS2002, Alicante, April 2002), ICEIS Press, Setubal,
Portugal, 2002, 20-32.
[7]A.K. Jain and D. Zongker, Representation and recognition of handwritten digits using
deformable templates, IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 19, no. 12, 1997, 1386-1391.
[8]M.-P. Dubuisson and A.K. Jain, Modified Hausdorff distance for object matching, Proc.
12th IAPR International Conference on Pattern Recognition (Jerusalem, October 9-13,
1994), vol. 1, IEEE, Piscataway, NJ, USA,94CH3440-5, 1994, 566-568.
[9] A.K. Jain, R.P.W. Duin, and J. Mao, Statistical Pattern Recognition: A Review, IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, 2000, 4-37.
[10]J.S. Sanchez, F. Pla, and F.J. Ferri, Prototype selection for the nearest neighbour rule
through proximity graphs, Pattern Recognition Letters, vol. 18, no. 6, 1997, 507-513.
[11]Uri Lipowezky, Selection of the optimal prototype subset for 1-NN classification,
Pattern Recognition Letters, vol. 19, no. 10, 1998, 907-918.