Probability–based distance function for distance–based classifiers Cezary Dendek1 and Jacek Mańdziuk2 1 2 Warsaw University of Technology, Faculty of Mathematics and Information Science, Plac Politechniki 1, 00-661 Warsaw, POLAND, [email protected], Warsaw University of Technology, Faculty of Mathematics and Information Science, Plac Politechniki 1, 00-661 Warsaw, POLAND, (phone: (48 22) 621 93 12; fax: (48 22) 625 74 60) [email protected], WWW home page: http://www.mini.pw.edu.pl/˜mandziuk/ Abstract. In the paper a new measure of distance between events/observations in the pattern space is proposed and experimentally evaluated with the use of k-NN classifier in the context of binary classification problems. The application of the proposed approach visibly improves the results compared to the case of training without postulated enhancements in terms of speed and accuracy. Numerical results are very promising and outperform the reference literature results of k-NN classifiers built with other distance measures. 1 Introduction The problem of constructing and measuring a distance between observations is frequently encountered in numerous application fields. The usual approach is based on constructing distance functions in the directly observed space (usually Rn ). However, in many cases of real-life problems the dimensions of observations are mappings of probability space (eg. biological data often expresses genome and history of an individual). Exploration of the probability space features and their inclusion in the distance measure, often based on correlation between dimensions (Mahalanobis distance) and their individual influence on classification accuracy (weighted distances), usually increases accuracy. Effectively, those improvements change the space of observations or the space of their difference using linear and quadratical transformations. In this paper another direction is explored: measurement of distance in partially reconstructed and standardized probability space. The model of distance is proposed in section 2. The benchmark data sets and results of numerical evaluation of proposed distance measure efficacy in the context of k-NN classifier are presented in sections 3 and 4, respectively. Conclusions and directions for future research are placed in the last section. Presented measure of distance is a direct continuation and generalization of authors’ previous work [1] introducing probability–related distance measure and works [2, 3] related to properties of metrical structure of pattern space. 2 Distance in the training patterns space 2.1 Introduction Pattern space has naturally defined structure of metrical space which is obtained by its immersion into Rn . This approach however does not preserve structure of probability space, which can be used to improve accuracy of estimators. Improved immersion can be obtained with the use of Cumulative Density Functions (CDF) by transformation of pattern space, as described in [1]. Let CDF i denotes CDF calculated on i-th dimension of pattern space. Transformation of pattern is defined as follows: (CDF (x))i := CDFi (xi ) Application of CDF transformation on pattern space creates standardized space (denoted CDF–Space). Projection of training patterns into CDF–Space results in uniform distribution of patterns in each dimension (marginal distributions are U [0, 1]). Estimation of CDF (denoted as ECDF) can be obtained either by parametric estimation (fitting parameters of arbitrary chosen family of distributions) or by the use of simple non-parametric estimator as following: ECDFi (x) = |{zi ∈ T rSet : zi ≤ xi }| , |T rSet| where T rSet denotes the training set. 2.2 Model of distance Structure of the introduced distance measure consists of two components: univariate distance measure (discussed in Sect 2.5) providing a measure of distance in a given dimension and linking function (discussed in Sect 2.6) which combines those measures and provides a univariate distance. Proposed distance measures are applicable to probabilistic spaces of structure Se , introduced in Sect 2.3. 2.3 Event manifestation error Let D be probability distribution of continuous random variable V over probability space S. The idea of proposed probabilistic distance is based on modification of standard sampling process: event E in space S does not map directly to value v of V – instead, the error of event manifestation, denoted e, is introduced: E maps to the neighborhood of v according to the distribution of the error e and distribution D: v = V (E + e) Proposed model creates new probability space Se . The process of sampling the space S can be expressed in terms of CDF of distribution D by simple conversion of S by mapping it’s events to U [0, 1] distribution and using inverse theorem. Random variable V can be sampled in the following way: V = CDF −1 (U [0, 1]) Let error of event manifestation be a random variable of distribution Err. The process of sampling the space Se can be expressed as: V = CDF −1 (min (max(U [0, 1] + Err, 0), 1)) 2.4 Model of probabilistic distance Let v be an observation of a random variable V and x be a fixed point, x ∈ R. As a distribution Err of event manifestation error U [−1, 1] has been chosen for the sake of simplicity. The probabilistic distance from fixed point x to observation v is a probability measure of smallest neighborhood of x, generated by the manifestation error e and containing v. In terms of CDF it can be expressed as: Z CDF −1 (xc +|xc −vc |) d(x; v) = CDF −1 (xc −|xc −vc |) dCDF (x) = min(1, xc +|xc −vc |)−max(0, xc −|xc −vc |), where xc = CDF (x) and vc = CDF (v). As the postulated measure is a probability, d(x; v) ≥ 0. The contour plot of function d(x; v) is presented in Fig 1. 2.5 Univariate distance measures in CDF–Space A distance measure on the CDF–Space is required to be symmetrical and to operate on observations rather than fixed values. This goal can be achieved by combination of d(x; v) values. The contour plots of all proposed variants of the distance are presented in Fig 2.5. The following variants has been considered: Distance based on expected value DExpVal (u, v) = u+v d( u+v 2 ; v) + d( 2 ; u) 2 This variant expresses the assumption that both observed events, u and v are manifestations of theirs average value. Expression could be simplified as: ³u + v ´ ³u + v ´ DExpVal (u, v) ∝ d ;v + d ; u ∝ |CDF (u) − CDF (v)| 2 2 The obtained simplified form has been evaluated in details in [1]. Distance based on min function DMin (u, v) = min(d(u; v), d(v; u)) v 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 x Fig. 1. Probabilistic distance from fixed point x to observation v. Distance based on max function DMax (u, v) = max(d(u; v), d(v; u)) Distance based on the distance average DAvg (u, v) = d(u; v) + d(v; u) ∝ d(u; v) + d(v; u) = DMax (u, v) + DMin (u, v) 2 Distance based on the independency assumption DInd (u, v) = 1 − (1 − d(u; v)) ∗ (1 − d(v; u)) = d(v; u) + d(u; v) − d(v; u)d(u; v) Distance based on the Carthesian sub-linking p DCart (u, v) = d(u; v)2 + d(v; u)2 2.6 Linking function In order to provide a unified distance measure for pattern space in case of multidimensional data, the distances calculated independently in each dimension have to be combined. The combination, defined as linking function, can be parametrically dependent on the training set (and, usually, more computationally requiring) or data-independent. Let Di (xi , yi ) denotes a distance measures in i-th dimension and C(x, y) the combined distance measure. Standard linking This data-independent variation is based on Carthesian distance definition: n X Cstd (x, y) = Di (xi , yi )2 i=1 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 (a) DExpVal 0.4 0.6 0.8 1 0.8 1 0.8 1 (b) DAvg 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 (c) DMin 0.4 0.6 (d) DMax 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 (e) DInd 0.8 1 0 0.2 0.4 0.6 (f) DCart Fig. 2. Contour plots of distances between (u, v) ∈ [0, 1]2 calculated with postulated probabilistic distances. Mahalanobis linking This data-dependent variation is based on Mahalanobis distance definition, which includes an information of estimated observations’ covariances between sub-dimensions. Let Σ denotes covariance matrix of CDFi (xi ). A distance between events x and y is defined as: r ³ ´ ³ ´T CMah (x, y) := [Di (xi , yi )]ni=1 Σ −1 [Di (xi , yi )]ni=1 ) Average linking n Cavg (x, y) = 1X Di (xi , yi ) n i=1 Mahalanobis-Avg-SQRT linking Let Σ denotes covariance matrix of CDFi (xi ). A distance between events x and y is defined as: ´ 1 X ³ −1 Σ 2 [Di (xi , yi )]ni=1 , n i=1 n CMahAvgSqrt (x, y) := 3 Data sets In order to provide experimental support of presented method validity and generate results that can be compared to other sources, data sets available at UCI Machine Learning Repository [4] were chosen. The benchmark sets were selected according to the following criteria and reasons: – they represent a binary classification problem – the number of observations is lower than 1 000 (due to the high computational cost of leave–one–out method of estimating the accuracy) – they represent integer, real or binary categorical type of attributes in order to avoid discrete distance problem A brief characteristics of selected data sets is presented in Table 1. Table 1. A brief characteristics of selected data sets data set instance number attributes number class proportion BUPA Liver Disorders 345 7 200 : 145 Pima Indians Diabetes 768 9 500 : 268 Wisconsin Diagnostics Breast Cancer 569 31 357 : 212 Sonar 208 61 111 : 97 Ionosphere 351 34 225 : 126 The data sets used in the following experiments are obtained by the transformation described in sect. 2.1 with the use of the simple, nonparametric estimation of ECDF which resulted in normalization of data and uniformity of marginal distributions. In order to provide the possibility to assess the significance of experimental results, the raw, untransformed data sets has been used where indicated. 4 Results All data sets has been evaluated with the use of a k-NN classifier in each combination of the distance measure and the link function defined in sections 2.5 and 2.6, resp. Results of misclassification rate estimation (in per cent) have been obtained with use of the leave–one–out estimator. In order to extend the results described in [1] and to provide the context of the current research data sets were evaluated with and without application of outlier removal procedure proposed in [1]. Numerical results of the evaluation of the proposed distance components are presented in Tables 4 and 2, respectively. Presented results are misclassification rates of the best (in terms of the highest accuracy of the k-NN classifier) distance function constructed with use of a given distance component and evaluated in the context of a given data set. The overall ranking column contains the rank of the given component in the ranking of the average ranks of the components. The average ranks are calculated by averaging ranking of the respective components in the context of a particular data set. Table 3 provides the comparison with k-NN classifier based on standard, Euclidean distance. Table 2. k-NN 1–CV minimal misclassification rate estimation for the distance built with the given distance element. distance element DAvg DCart DMin DInd DExpVal DMax CAvg CMahAvgSqrt Cstd CMah Pima 3–NN 5–NN 26.95 25.78 27.21 25.39 27.47 25.52 25.52 26.04 27.34 26.43 26.69 25.13 26.69 25.13 27.99 26.43 25.52 26.04 27.47 27.34 data set Bupa WDBC 3–NN 5–NN 3–NN 5–NN 32.46 30.43 2.81 3.34 32.46 30.14 2.81 3.51 31.3 29.57 2.64 2.99 32.46 29.57 2.99 2.99 33.33 29.57 2.28 2.81 33.33 31.01 3.16 3.69 32.46 30.14 2.28 2.81 32.46 29.57 2.28 2.81 33.62 29.57 2.99 2.99 31.3 29.57 2.99 2.99 overall Sonar Ionosphere ranking 3–NN 5–NN 3–NN 5–NN 12.98 11.54 9.12 9.69 4 12.5 12.02 8.83 9.69 3 11.54 12.98 9.4 9.69 1 13.46 13.46 8.26 8.26 2 12.98 11.06 9.69 11.11 5 14.42 15.38 9.12 10.26 6 12.98 11.06 8.26 8.26 1 12.98 11.06 8.55 8.26 2 11.54 11.54 9.12 9.12 3 11.54 11.54 9.12 8.83 3 The results presented in the Tables fully confirm the efficacy of the proposed distance construction: for each evaluated data set the best obtained model, concerning a Table 3. k-NN 1–CV misclassification rate estimation for the Carthesian distance. distance element Rn –Space ECDF–space data set Pima Bupa WDBC Sonar Ionosphere 3–NN 5–NN 3–NN 5–NN 3–NN 5–NN 3–NN 5–NN 3–NN 5–NN 30.60 28.52 36.23 33.62 7.38 6.68 18.27 17.31 15.10 15.38 27.73 26.56 35.07 31.59 2.99 3.69 12.98 13.94 19.09 22.22 Table 4. k-NN 1–CV minimal misclassification rate estimation for the distance built with the given distance element after the outlier extraction. distance element DAvg DCart DMin DInd DExpVal DMax CAvg CMahAvgSqrt Cstd CMah Pima 3–NN 5–NN 21.88 22.27 22.27 22.01 22.92 22.92 22.53 22.92 23.31 22.92 22.66 22.92 21.88 22.27 23.05 22.92 22.53 22.01 23.44 22.92 data set Bupa WDBC 3–NN 5–NN 3–NN 5–NN 27.83 28.12 2.64 3.51 28.41 28.12 2.64 3.69 26.67 27.54 2.64 3.16 27.25 28.12 2.99 3.16 29.28 28.41 2.28 2.99 29.28 28.41 2.99 3.51 29.28 28.7 2.28 2.99 29.86 28.12 2.28 2.99 28.99 28.12 3.16 3.16 26.67 27.54 3.16 3.16 overall Sonar Ionosphere ranking 3–NN 5–NN 3–NN 5–NN 12.98 11.54 8.55 9.12 1 12.5 11.54 8.26 8.83 1 11.54 12.02 8.83 9.12 3 13.46 12.5 7.98 7.98 4 12.98 11.06 9.69 10.83 5 14.42 15.38 8.83 9.97 6 12.98 11.06 7.98 7.98 1 12.98 11.06 7.98 7.98 2 11.54 12.02 8.83 8.55 3 11.54 12.02 8.55 8.55 4 misclassification rate, is the model constructed using the components proposed. High observed rank of DExpVal component, which can be reduced to simple difference of CDF , empirically shows the significance of other univariate distance measures. The simple linking function CAvg appears to be better than Carthesian linking concerning all evaluated probability–based univariate distances. Finally, the advantage of the application of the outlier removal procedure has been clearly shown. In summary, the use of the probability–based distance metric combined with the algorithm of outliers removal allowed, within the k-NN models, to achieve comparable or better results than the best ones presented in the literature. In particular the following comparisons with other definitions of the metric functions within the considered data sets has been taken into account: weighted distances [7], adaptive distance measure [5], boosting distance estimation [8] and cam weighted distance [6]. Table 5 presents the summary of results accomplished by the k-NN classifiers. Table 5. k-NN misclassification rate comparison. The results in the first two rows are calculated with probability-based distance measure, with and without outlier removal, respectively. Results in the following rows are taken from the literature. Non. avail. denotes that the results could not be found it the respective papers. estimation method probability–based leave-one-out with outlier removal (Table 4) CV probability–based leave-one-out without outlier removal (Table 2) CV leave-one-out adaptive distance measure [5] CV leave-one-out cam weighted distance [6] CV weighted distances [7] 100 x 5–CV boosting distance estimation [8] 100 x 20/80 distance maeasure BUPA Pima WDBC Sonar Ionosp. 26.67 21.88 2.28 11.06 7.98 29.57 25.13 2.28 11.06 8.26 30.59 25.13 2.79 12.00 4.29 35.3 3.5 Non. avail. 6.8 24.7 36.22 27.33 Non. avail. Non. avail. Non. avail. 33.58 28.91 4.67 25.67 16.27 5 Conclusions In the paper a new class of measures of distance between events/observations in the pattern space is proposed and experimentally evaluated with the use of k-NN classifier in the context of binary classification problems. It is shown that proposed measures produce in average better results than training without their use in all of the evaluated cases. Crossvalidational estimate of resulting model quality has been compared with numerical results provided by other researchers, concerning the k-NN classifiers built with other distance measures. Other possible applications of presented distance measure (especially in the context of training sequence construction proposed in [2]) as well as separate selection of univariate distance for each dimension are considered as future research plans. Acknowledgement This work was supported by the research grant from the Warsaw University of Technology. References 1. Dendek, C., Mańdziuk, J.: Improving performance of a binary classifier by training set selection. In Kurková, V., Neruda, R., Koutnı́k, J., eds.: ICANN (1). Volume 5163 of Lecture Notes in Computer Science., Springer (2008) 128–135 2. Dendek, C., Mańdziuk, J.: Including metric space topology in neural networks training by ordering patterns. In Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E., eds.: ICANN (2). Volume 4132 of Lecture Notes in Computer Science., Springer (2006) 644–653 3. Mańdziuk, J., Shastri, L.: Incremental class learning approach and its application to handwritten digit recognition. Inf. Sci. Inf. Comput. Sci. 141(3-4) (2002) 193–217 4. A. Asuncion, D.N.: UCI machine learning repository (2007) 5. Wang, J., Neskovic, P., Cooper, L.N.: Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recogn. Lett. 28(2) (2007) 207–213 6. Zhou, C.Y., Chen, Y.Q.: Improving nearest neighbor classification with cam weighted distance. Pattern Recogn. 39(4) (2006) 635–645 7. Pardes, R., Vidal, E.: Learning weighted metrics to minimize nearest-neighbor classification error. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(7) (2006) 1100– 1110 8. Amores, J., Sebe, N., Radeva, P.: Boosting the distance estimation. Pattern Recogn. Lett. 27(3) (2006) 201–209
© Copyright 2026 Paperzz