Probability–based distance function for distance–based

Probability–based distance function for distance–based
classifiers
Cezary Dendek1 and Jacek Mańdziuk2
1
2
Warsaw University of Technology, Faculty of Mathematics and Information Science,
Plac Politechniki 1, 00-661 Warsaw, POLAND,
[email protected],
Warsaw University of Technology, Faculty of Mathematics and Information Science,
Plac Politechniki 1, 00-661 Warsaw, POLAND,
(phone: (48 22) 621 93 12; fax: (48 22) 625 74 60)
[email protected],
WWW home page: http://www.mini.pw.edu.pl/˜mandziuk/
Abstract. In the paper a new measure of distance between events/observations in
the pattern space is proposed and experimentally evaluated with the use of k-NN
classifier in the context of binary classification problems. The application of the
proposed approach visibly improves the results compared to the case of training
without postulated enhancements in terms of speed and accuracy.
Numerical results are very promising and outperform the reference literature results of k-NN classifiers built with other distance measures.
1 Introduction
The problem of constructing and measuring a distance between observations is frequently encountered in numerous application fields. The usual approach is based on
constructing distance functions in the directly observed space (usually Rn ).
However, in many cases of real-life problems the dimensions of observations are
mappings of probability space (eg. biological data often expresses genome and history
of an individual). Exploration of the probability space features and their inclusion in
the distance measure, often based on correlation between dimensions (Mahalanobis
distance) and their individual influence on classification accuracy (weighted distances),
usually increases accuracy. Effectively, those improvements change the space of observations or the space of their difference using linear and quadratical transformations.
In this paper another direction is explored: measurement of distance in partially
reconstructed and standardized probability space. The model of distance is proposed
in section 2. The benchmark data sets and results of numerical evaluation of proposed
distance measure efficacy in the context of k-NN classifier are presented in sections 3
and 4, respectively. Conclusions and directions for future research are placed in the last
section.
Presented measure of distance is a direct continuation and generalization of authors’
previous work [1] introducing probability–related distance measure and works [2, 3]
related to properties of metrical structure of pattern space.
2 Distance in the training patterns space
2.1 Introduction
Pattern space has naturally defined structure of metrical space which is obtained by its
immersion into Rn . This approach however does not preserve structure of probability
space, which can be used to improve accuracy of estimators.
Improved immersion can be obtained with the use of Cumulative Density Functions
(CDF) by transformation of pattern space, as described in [1]. Let CDF i denotes CDF
calculated on i-th dimension of pattern space. Transformation of pattern is defined as
follows:
(CDF (x))i := CDFi (xi )
Application of CDF transformation on pattern space creates standardized space (denoted CDF–Space). Projection of training patterns into CDF–Space results in uniform
distribution of patterns in each dimension (marginal distributions are U [0, 1]).
Estimation of CDF (denoted as ECDF) can be obtained either by parametric estimation (fitting parameters of arbitrary chosen family of distributions) or by the use of
simple non-parametric estimator as following:
ECDFi (x) =
|{zi ∈ T rSet : zi ≤ xi }|
,
|T rSet|
where T rSet denotes the training set.
2.2 Model of distance
Structure of the introduced distance measure consists of two components: univariate
distance measure (discussed in Sect 2.5) providing a measure of distance in a given
dimension and linking function (discussed in Sect 2.6) which combines those measures
and provides a univariate distance. Proposed distance measures are applicable to probabilistic spaces of structure Se , introduced in Sect 2.3.
2.3 Event manifestation error
Let D be probability distribution of continuous random variable V over probability
space S. The idea of proposed probabilistic distance is based on modification of standard sampling process: event E in space S does not map directly to value v of V –
instead, the error of event manifestation, denoted e, is introduced: E maps to the neighborhood of v according to the distribution of the error e and distribution D:
v = V (E + e)
Proposed model creates new probability space Se .
The process of sampling the space S can be expressed in terms of CDF of distribution D by simple conversion of S by mapping it’s events to U [0, 1] distribution and
using inverse theorem. Random variable V can be sampled in the following way:
V = CDF −1 (U [0, 1])
Let error of event manifestation be a random variable of distribution Err. The process of sampling the space Se can be expressed as:
V = CDF −1 (min (max(U [0, 1] + Err, 0), 1))
2.4 Model of probabilistic distance
Let v be an observation of a random variable V and x be a fixed point, x ∈ R. As a
distribution Err of event manifestation error U [−1, 1] has been chosen for the sake of
simplicity.
The probabilistic distance from fixed point x to observation v is a probability measure of smallest neighborhood of x, generated by the manifestation error e and containing v. In terms of CDF it can be expressed as:
Z
CDF −1 (xc +|xc −vc |)
d(x; v) =
CDF −1 (xc −|xc −vc |)
dCDF (x) = min(1, xc +|xc −vc |)−max(0, xc −|xc −vc |),
where xc = CDF (x) and vc = CDF (v). As the postulated measure is a probability,
d(x; v) ≥ 0.
The contour plot of function d(x; v) is presented in Fig 1.
2.5 Univariate distance measures in CDF–Space
A distance measure on the CDF–Space is required to be symmetrical and to operate
on observations rather than fixed values. This goal can be achieved by combination of
d(x; v) values. The contour plots of all proposed variants of the distance are presented
in Fig 2.5. The following variants has been considered:
Distance based on expected value
DExpVal (u, v) =
u+v
d( u+v
2 ; v) + d( 2 ; u)
2
This variant expresses the assumption that both observed events, u and v are manifestations of theirs average value. Expression could be simplified as:
³u + v ´
³u + v ´
DExpVal (u, v) ∝ d
;v + d
; u ∝ |CDF (u) − CDF (v)|
2
2
The obtained simplified form has been evaluated in details in [1].
Distance based on min function
DMin (u, v) = min(d(u; v), d(v; u))
v
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
x
Fig. 1. Probabilistic distance from fixed point x to observation v.
Distance based on max function
DMax (u, v) = max(d(u; v), d(v; u))
Distance based on the distance average
DAvg (u, v) =
d(u; v) + d(v; u)
∝ d(u; v) + d(v; u) = DMax (u, v) + DMin (u, v)
2
Distance based on the independency assumption
DInd (u, v) = 1 − (1 − d(u; v)) ∗ (1 − d(v; u)) = d(v; u) + d(u; v) − d(v; u)d(u; v)
Distance based on the Carthesian sub-linking
p
DCart (u, v) = d(u; v)2 + d(v; u)2
2.6 Linking function
In order to provide a unified distance measure for pattern space in case of multidimensional data, the distances calculated independently in each dimension have to be combined. The combination, defined as linking function, can be parametrically dependent
on the training set (and, usually, more computationally requiring) or data-independent.
Let Di (xi , yi ) denotes a distance measures in i-th dimension and C(x, y) the combined
distance measure.
Standard linking This data-independent variation is based on Carthesian distance definition:
n
X
Cstd (x, y) =
Di (xi , yi )2
i=1
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
0.2
0.4
0.6
0.8
1
0
0.2
(a) DExpVal
0.4
0.6
0.8
1
0.8
1
0.8
1
(b) DAvg
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
0.2
0.4
0.6
0.8
1
0
0.2
(c) DMin
0.4
0.6
(d) DMax
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
0.2
0.4
0.6
(e) DInd
0.8
1
0
0.2
0.4
0.6
(f) DCart
Fig. 2. Contour plots of distances between (u, v) ∈ [0, 1]2 calculated with postulated probabilistic distances.
Mahalanobis linking This data-dependent variation is based on Mahalanobis distance
definition, which includes an information of estimated observations’ covariances between sub-dimensions. Let Σ denotes covariance matrix of CDFi (xi ). A distance between events x and y is defined as:
r
³
´
³
´T
CMah (x, y) :=
[Di (xi , yi )]ni=1 Σ −1 [Di (xi , yi )]ni=1 )
Average linking
n
Cavg (x, y) =
1X
Di (xi , yi )
n i=1
Mahalanobis-Avg-SQRT linking Let Σ denotes covariance matrix of CDFi (xi ). A
distance between events x and y is defined as:
´
1 X ³ −1
Σ 2 [Di (xi , yi )]ni=1 ,
n i=1
n
CMahAvgSqrt (x, y) :=
3 Data sets
In order to provide experimental support of presented method validity and generate results that can be compared to other sources, data sets available at UCI Machine Learning Repository [4] were chosen. The benchmark sets were selected according to the
following criteria and reasons:
– they represent a binary classification problem
– the number of observations is lower than 1 000 (due to the high computational cost
of leave–one–out method of estimating the accuracy)
– they represent integer, real or binary categorical type of attributes in order to avoid
discrete distance problem
A brief characteristics of selected data sets is presented in Table 1.
Table 1. A brief characteristics of selected data sets
data set
instance number attributes number class proportion
BUPA Liver Disorders
345
7
200 : 145
Pima Indians Diabetes
768
9
500 : 268
Wisconsin Diagnostics Breast Cancer
569
31
357 : 212
Sonar
208
61
111 : 97
Ionosphere
351
34
225 : 126
The data sets used in the following experiments are obtained by the transformation
described in sect. 2.1 with the use of the simple, nonparametric estimation of ECDF
which resulted in normalization of data and uniformity of marginal distributions. In
order to provide the possibility to assess the significance of experimental results, the
raw, untransformed data sets has been used where indicated.
4 Results
All data sets has been evaluated with the use of a k-NN classifier in each combination of
the distance measure and the link function defined in sections 2.5 and 2.6, resp. Results
of misclassification rate estimation (in per cent) have been obtained with use of the
leave–one–out estimator.
In order to extend the results described in [1] and to provide the context of the
current research data sets were evaluated with and without application of outlier removal
procedure proposed in [1].
Numerical results of the evaluation of the proposed distance components are presented in Tables 4 and 2, respectively. Presented results are misclassification rates of
the best (in terms of the highest accuracy of the k-NN classifier) distance function constructed with use of a given distance component and evaluated in the context of a given
data set. The overall ranking column contains the rank of the given component in the
ranking of the average ranks of the components. The average ranks are calculated by
averaging ranking of the respective components in the context of a particular data set.
Table 3 provides the comparison with k-NN classifier based on standard, Euclidean
distance.
Table 2. k-NN 1–CV minimal misclassification rate estimation for the distance built with the
given distance element.
distance
element
DAvg
DCart
DMin
DInd
DExpVal
DMax
CAvg
CMahAvgSqrt
Cstd
CMah
Pima
3–NN 5–NN
26.95 25.78
27.21 25.39
27.47 25.52
25.52 26.04
27.34 26.43
26.69 25.13
26.69 25.13
27.99 26.43
25.52 26.04
27.47 27.34
data set
Bupa
WDBC
3–NN 5–NN 3–NN 5–NN
32.46 30.43 2.81 3.34
32.46 30.14 2.81 3.51
31.3 29.57 2.64 2.99
32.46 29.57 2.99 2.99
33.33 29.57 2.28 2.81
33.33 31.01 3.16 3.69
32.46 30.14 2.28 2.81
32.46 29.57 2.28 2.81
33.62 29.57 2.99 2.99
31.3 29.57 2.99 2.99
overall
Sonar
Ionosphere
ranking
3–NN 5–NN 3–NN 5–NN
12.98 11.54 9.12 9.69
4
12.5 12.02 8.83 9.69
3
11.54 12.98
9.4 9.69
1
13.46 13.46 8.26 8.26
2
12.98 11.06 9.69 11.11
5
14.42 15.38 9.12 10.26
6
12.98 11.06 8.26 8.26
1
12.98 11.06 8.55 8.26
2
11.54 11.54 9.12 9.12
3
11.54 11.54 9.12 8.83
3
The results presented in the Tables fully confirm the efficacy of the proposed distance construction: for each evaluated data set the best obtained model, concerning a
Table 3. k-NN 1–CV misclassification rate estimation for the Carthesian distance.
distance
element
Rn –Space
ECDF–space
data set
Pima
Bupa
WDBC
Sonar
Ionosphere
3–NN 5–NN 3–NN 5–NN 3–NN 5–NN 3–NN 5–NN 3–NN 5–NN
30.60 28.52 36.23 33.62 7.38 6.68 18.27 17.31 15.10 15.38
27.73 26.56 35.07 31.59
2.99
3.69 12.98 13.94 19.09 22.22
Table 4. k-NN 1–CV minimal misclassification rate estimation for the distance built with the
given distance element after the outlier extraction.
distance
element
DAvg
DCart
DMin
DInd
DExpVal
DMax
CAvg
CMahAvgSqrt
Cstd
CMah
Pima
3–NN 5–NN
21.88 22.27
22.27 22.01
22.92 22.92
22.53 22.92
23.31 22.92
22.66 22.92
21.88 22.27
23.05 22.92
22.53 22.01
23.44 22.92
data set
Bupa
WDBC
3–NN 5–NN 3–NN 5–NN
27.83 28.12 2.64 3.51
28.41 28.12 2.64 3.69
26.67 27.54 2.64 3.16
27.25 28.12 2.99 3.16
29.28 28.41 2.28 2.99
29.28 28.41 2.99 3.51
29.28 28.7 2.28 2.99
29.86 28.12 2.28 2.99
28.99 28.12 3.16 3.16
26.67 27.54 3.16 3.16
overall
Sonar
Ionosphere
ranking
3–NN 5–NN 3–NN 5–NN
12.98 11.54 8.55 9.12
1
12.5 11.54 8.26 8.83
1
11.54 12.02 8.83 9.12
3
13.46 12.5 7.98 7.98
4
12.98 11.06 9.69 10.83
5
14.42 15.38 8.83 9.97
6
12.98 11.06 7.98 7.98
1
12.98 11.06 7.98 7.98
2
11.54 12.02 8.83 8.55
3
11.54 12.02 8.55 8.55
4
misclassification rate, is the model constructed using the components proposed. High
observed rank of DExpVal component, which can be reduced to simple difference of
CDF , empirically shows the significance of other univariate distance measures. The
simple linking function CAvg appears to be better than Carthesian linking concerning
all evaluated probability–based univariate distances.
Finally, the advantage of the application of the outlier removal procedure has been
clearly shown.
In summary, the use of the probability–based distance metric combined with the
algorithm of outliers removal allowed, within the k-NN models, to achieve comparable
or better results than the best ones presented in the literature. In particular the following
comparisons with other definitions of the metric functions within the considered data
sets has been taken into account: weighted distances [7], adaptive distance measure [5],
boosting distance estimation [8] and cam weighted distance [6]. Table 5 presents the
summary of results accomplished by the k-NN classifiers.
Table 5. k-NN misclassification rate comparison. The results in the first two rows are calculated
with probability-based distance measure, with and without outlier removal, respectively. Results
in the following rows are taken from the literature. Non. avail. denotes that the results could not
be found it the respective papers.
estimation
method
probability–based
leave-one-out
with outlier removal (Table 4)
CV
probability–based
leave-one-out
without outlier removal (Table 2)
CV
leave-one-out
adaptive distance measure [5]
CV
leave-one-out
cam weighted distance [6]
CV
weighted distances [7]
100 x 5–CV
boosting distance estimation [8] 100 x 20/80
distance maeasure
BUPA Pima WDBC
Sonar
Ionosp.
26.67 21.88
2.28
11.06
7.98
29.57 25.13
2.28
11.06
8.26
30.59 25.13
2.79
12.00
4.29
35.3
3.5
Non. avail.
6.8
24.7
36.22 27.33 Non. avail. Non. avail. Non. avail.
33.58 28.91
4.67
25.67
16.27
5 Conclusions
In the paper a new class of measures of distance between events/observations in the
pattern space is proposed and experimentally evaluated with the use of k-NN classifier
in the context of binary classification problems. It is shown that proposed measures
produce in average better results than training without their use in all of the evaluated
cases. Crossvalidational estimate of resulting model quality has been compared with
numerical results provided by other researchers, concerning the k-NN classifiers built
with other distance measures. Other possible applications of presented distance measure
(especially in the context of training sequence construction proposed in [2]) as well as
separate selection of univariate distance for each dimension are considered as future
research plans.
Acknowledgement
This work was supported by the research grant from the Warsaw University of Technology.
References
1. Dendek, C., Mańdziuk, J.: Improving performance of a binary classifier by training set selection. In Kurková, V., Neruda, R., Koutnı́k, J., eds.: ICANN (1). Volume 5163 of Lecture
Notes in Computer Science., Springer (2008) 128–135
2. Dendek, C., Mańdziuk, J.: Including metric space topology in neural networks training by
ordering patterns. In Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E., eds.: ICANN (2).
Volume 4132 of Lecture Notes in Computer Science., Springer (2006) 644–653
3. Mańdziuk, J., Shastri, L.: Incremental class learning approach and its application to handwritten digit recognition. Inf. Sci. Inf. Comput. Sci. 141(3-4) (2002) 193–217
4. A. Asuncion, D.N.: UCI machine learning repository (2007)
5. Wang, J., Neskovic, P., Cooper, L.N.: Improving nearest neighbor rule with a simple adaptive
distance measure. Pattern Recogn. Lett. 28(2) (2007) 207–213
6. Zhou, C.Y., Chen, Y.Q.: Improving nearest neighbor classification with cam weighted distance. Pattern Recogn. 39(4) (2006) 635–645
7. Pardes, R., Vidal, E.: Learning weighted metrics to minimize nearest-neighbor classification
error. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(7) (2006) 1100–
1110
8. Amores, J., Sebe, N., Radeva, P.: Boosting the distance estimation. Pattern Recogn. Lett.
27(3) (2006) 201–209