All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8 Detecting Local Outliers in High Dimensional Space Dr. Dharm Singh1, Tarun Kulshreshtha2, Divya Saxena2, Ankur Agrawal3 1Department of Computer Science, College of Technology and Engineering Maharana Pratap University of Agriculture and Technology, Udaipur [email protected] 2 Department of Computer Science, G.L.A. Institute of Technology and Management, Mathura, India [email protected] 3 Department of Computer Science, Indian Institute of Technology Bombay, Powai, Mumbai, Maharashtra, India [email protected] Abstract β Outlier detection has received significant attention in many applications, such as detecting credit card fraud or network intrusions. Existing studies in outlier detection mostly focus on detecting outliers in full feature space. But most algorithms tend to break down in high-dimensional feature spaces because classes of objects often exist in specific subspace of the original feature space. As a novel solution to tackle this problem, we propose here a preference dimension based outlier detection technique, which uses a unique subspace for calculating distance between objects. Using this concept we adopt local density based outlier detection to cope with highdimensional data. A broad experimental evaluation shows that this approach yields results of significantly better quality than existing algorithms. Keywords: Outlier detection, high-dimensional data 1. INTRODUCTION Outlier mining is an active research field in data mining and has a lot of practical applications in many different domains, such as, financial fraud detection [1], network intrusion detection [2], medical abnormal reactions analysis [3], and signal preprocessing [4] etc. Hawkins gave the definition of outlier that "An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism" [5]. Based on the definition, researchers have proposed various schemes for outlier mining. Statistical-based approaches were the earliest approaches used for outlier mining [6]. The statistical based approaches assume that the dataset have a distribution and an object having low probability with respect to the distribution may be an outlier [7]-[8]. However, this approach requires a prior underlying distribution of the dataset to compute the outlier scores, which is usually unknown. In order to overcome the limitations of the statistical-based approaches, the distancebased [9] and density-based [10] approaches were introduced to detect outliers, which use k-nearest neighbors (KNN) to compute the similarity between data points. Many useful outlier detection methods proposed in the last decade compute outlier score of the data points in a complete feature space, i.e. each dimension is equally weighted when computing the distance between points. These approaches are successful for low-dimensional data sets. However, in higher (a) (b) Fig. 1. Outliers according to local density-based outlier detection (a) and according to preference dimension based outlier detection (b). dimensional feature spaces, their accuracy and efficiency deteriorates significantly. The major reason for this behavior is the so-called curse of dimensionality: In high dimensional feature spaces, a full-dimensional distance is often no longer meaningful, since the nearest neighbor of a point is expected to be almost as far as its farthest neighbor [11]. In this paper, we use the concept of preference dimension to solve this problem. Our new approach PDOF is founded on the concept of local outlier factor proposed in density-based local outlier detection [10]. In the example given in Fig. 1, a local density based algorithm will detect only one outlier i.e. o 1 as shown in Fig. 1(a) while preference dimension based algorithm will detect two outliers o1 and o2 as shown in Fig. 1(b). In order to ensure the quality of outlier detection in highdimensional datasets, we suggest using only a subset of the full feature space called preference dimension to compute outliers in a subspace instead of using each dimension with equal importance. Thus, we build a weight vector based on the variance in each attribute and use a weighted Euclidean distance measure based on this weight vector. Using this more flexible model, we propose the algorithm PDOF (Preference Dimension Outlier Factor) to efficiently compute exact solutions of the subspace outlier detection problem. The user can select a parameter Ξ² indicating the variance threshold for feature All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8 selection in the preference dimension of the dataset. Only those features with a variance of no more than Ξ² are included in the preference dimension. The remainder of the paper is organized as follows: In Section 2, we discuss related work and point out our contributions. In Section 3, we formalize our notion of preference dimension efficiently detect subspace based outliers. In Section 4, we discuss the time-complexity of our approach. Section 5 contains an extensive experimental evaluation and Section 6 concludes the paper. 2. RELATED WORK Definition 5: (local outlier factor of an object p) The local outlier factor of p is defined as πΏππΉπππππ‘π (π) = ππππππππ‘π (π) ππππππππ‘π (π) |ππππππ‘π (π)| βπ β NMinPts (p) It is easy to see that the lower lrdMinPts(p) is, and the higher the lrdMinPts(o), o β NMinPts(p) are, the higher is the LOF value of p. The outlier factor of object p captures the degree to which we call p an outlier. However, the weakness of the LOF algorithm is that it cannot detect the outliers existing in a subspace of full feature space. 2.1 LOF Algorithm 2.2 DSNOF Algorithm LOF (Local Outlier Factor)[10] is the first concept of an object which also quantifies how outlying an object is and the LOF value of an object is based on the average of the ratios of the local reachability density of the area around the object and the local reachability densities of its neighbors. The size of the neighborhood of the object is determined by the area containing a user-supplied minimum number of points (MinPts). Several concepts and terms to explain the LOF algorithm can be defined as follows. The DSNOF algorithm[12] considers the density similarity between the object and its neighbors and calculates the DSNOF value. In this algorithm each object in dataset is assigned a density-similarity-neighbor based outlier factor (DSNOF) to indicate the degree of outlier-ness of an object. The algorithm calculates the densities of an object and its neighbors and constructs the similar density series (SDS) in the neighborhood of the object. Based on the SDS, this algorithm computes the average series cost (ASC) of the objects. Finally, it calculates the DSNOF of the object based on the ASC of the object and its neighbors. Definition 1: (k-distance of an object p) For any positive integer k, the k-distance of object p, denoted as k-distance(p), is defined as the distance d(p,o) between p and an object of D such that: (i) for at least k objects oβ² β D\{p} it holds that d(p,oβ²) β€ d(p,o) , and (ii) for at most k - 1 objects oβ² β D\{p} it holds that d(p,oβ²) < d(p,o). Definition 2: (k-distance neighborhood of an object p) Given the k-distance of p, the k-distance neighborhood of p contains every object whose distance from p is not greater than the k-distance. Definition 3: (reachability distance of an object p w.r.t. object o) Let k be a natural number. The reachability distance of object p with respect to object o is defined as ππππβ β πππ π‘π (π) = max { π β πππ π‘ππππ(π), π(π, π)} Let MinPts be the only parameter and the values reach-distMinPts (p,o), for o β NMinPts(p), be a measure of the volume to determine the density in the neighborhood of an object p. Definition 4: (local reachability density of an object p) The local reachability density of p is defined as ππππππππ‘π (π) = 1/ [ βπ β NMinPts(p) ππππβ β πππ π‘πππππ‘π (π, π) ] |ππππππ‘π (π)| 2.2 Subspace Outlier Detection in Data with Mixture of Variances and Noise This algorithm[13] introduced a bottom-up approach to discover clusters of outliers in any m-dimensional subspace from an n-dimensional space. It uses an outlier score function based on Chebyshev (Lβ norm) distance in order to properly rank the outliers. First, it uses a method to compute the outlier score for all points in each dimension. It states that if a point is an outlier in a subspace, the score must be high for that point in each dimension of the subspace. It then aggregates the scores to compute the final outlier score for the points in the dataset. It introduces a filter threshold to eliminate the high dimensional noise during the aggregation. 2.3 Our Contributions In this paper, we make the following contributions: In order to enhance the quality of density-based outlier detection algorithms in high-dimensional space, we extend the wellfounded notion of density-based local outliers to ensure high quality results even in high-dimensional spaces. We do not use any sampling or approximation techniques, thus the result of our outlier-mining algorithm is determinate. We propose an efficient method called PDOF which is able to detect all subspace based outliers of a certain dimensionality in a single scan over the All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8 database and is linear in the number of dimensions. And finally, we successfully apply our algorithm PDOF to several real-world data sets, showing its superior performance over existing approaches. 3. THE PROPOSED ALGORITHM (PDOF) In this section, we formalize the notion of preference dimension based outliers. Let D be a database of d dimensional points (D β Rd), where the set of attributes is denoted by A = {A1, . . . Ad}. The projection of a point p onto an attribute Ai β A is denoted by ΟAi (p). Intuitively, a preference dimension based outlier is an object whose density is low as compared to its neighbors in a certain subspace of the full feature space. In order to find meaningful attributes to form a preference dimension, we select only those attributes which have low variance in the complete dataset. But some of the values may be far apart from the mean value for an attribute (possibly for outliers). These values will hinder our variance calculation to capture the normal behavior of the dataset. Thus we need to eliminate these values. For that we use standard deviation as a measure. Definition 1 (controlled variance along an attribute) Let gi is the mean value and Οi is the standard deviation for an attribute Ai β A. The controlled variance of dataset along Ai , denoted by VarAi(D), is defined as follows: ππππ΄π (π·) = βπ β π·βΆππβ3Οπ<ππ΄ π 2 (π)<gi +3Οi (πππ π‘(ππ΄π (π), ππ ) |π β π· βΆ ππ β 3Οπ < ππ΄π (π) < g i + 3Οi | The intuition of our formalization is to consider those attributes in preference dimension which have a low variance in the dataset. Therefore, we use a weight vector wp which to calculate distance between two points in the preference dimension. Definition 2 (preference dimension similarity measure) Let Ξ² β R be a user-defined threshold on the allowed variance. Let wp = (w1, w2, ...wd) be the so-called preference dimension weight vector, where π€π = { 0 ππ ππππ΄π (π·) > π½ 1 ππ ππππ΄π (π·) β€ π½ The preference dimension similarity measure between two points p,qβ D is denoted by π πππ π‘ππππ (π, π) = ββ π€π . (ππ΄π (π) β ππ΄π (π))2 π=1 where wi is the i-th component of wp. Let us note, that the preference dimension similarity measure distpref(p, q) is simply a weighted Euclidean distance. The parameter Ξ² specifies the threshold for a low variance. As we are only interested in distinguishing between dimensions with low variance and all other dimensions, weighting the dimensions inversely proportional to their variance is not useful. Thus, our weight vector has only two possible values. After calculating the preference dimension distance between different data objects, we now use this distance measure to find out the outlying degree for each object as in LOF algorithm with some modifications. Definition 3 (preference dimension k-distance of an object p) For any positive integer k, the preference dimension kdistance of object p, denoted as k-distpref(p), is defined as the distance distpref(p,o) between p and an object of D such that: (i) for at least k objects oβ² β D\{p} it holds that distpref(p,oβ²) β€ distpref(p,o) , and (ii) for at most k - 1 objects oβ² β D\{p} it holds that distpref(p,oβ²) < distpref(p,o). Definition 4 (preference dimension k-distance neighborhood of an object p) Given the preference dimension k-distance of p, the preference dimension k-distance neighborhood of p contains every object whose preference dimension distance from p is not greater than the preference dimension k-distance. Definition 5 (preference dimension reachability distance of an object p w.r.t. object o) Let k be a natural number. The preference dimension reachability distance of object p with respect to object o is defined as ππππππππβ β πππ π‘π (π) = πππ₯ { π β πππ π‘ππππ (π), πππ π‘ππππ (π, π)} Let MinPts be the only parameter and the values prefreachdistMinPts (p,o), for o β NMinPts(p), be a measure of the volume to determine the density in the neighborhood of an object p. Definition 6 (preference dimension reachability density of an object p) The preference dimension reachability density of p is defined as πππππππππ‘π (π) βπ β NMinPts(p) ππππππππβ β πππ π‘πππππ‘π (π, π) = 1/ [ ] |ππππππ‘π (π)| Definition 7 (preference dimension outlier factor of an object p) The preference dimension outlier factor of p is defined as All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8 ππ·ππΉπππππ‘π (π) = πππππππππ‘π (π) πππππππππ‘π (π) |ππππππ‘π (π)| βπ β NMinPts(p) The higher the pdrdMinPts(o), o β NMinPts(p) are relative to pdrdMinPts(p), the higher is the PDOF value of p. The outlier factor of object p captures the degree to which we call p an outlier. Thus, we can find n outliers existing in a certain subspace of full feature space by taking the top n objects ordered by their PDOF values. 4. COMPLEXITY ANALYSIS The overall worst-case time complexity of our algorithm based on the sequential scan of the dataset is O(d · n2). Our algorithm first calculates variance along each attribute which takes O(d · n) time. Then we assign a preference dimension weight vector which takes O(d) time. The preference dimension k-distance neighborhood for all points can be calculated in O(d · n2) time. Calculation of preference dimension reachability distance between all pairs of points require linear time. The pdrd and PDOF values of all the points can then be calculated in O(k · n) time. Thus the worst case complexity of our algorithm comes out to be O(d · n2) with the assumption of no index structure. As this complexity is realistic, our algorithm can be used practically to detect outliers in high-dimensional database. 5. found our algorithm performing better in most of the cases. Table 1 summarizes the results of running both LOF and PDOF algorithms on the test dataset with minpts = 10 and Ξ² = 0.0005 for PDOF. TABLE 1 RESULTS OF RUNNING LOF AND PDOF ALGORITHMS ON TEST DATASET K LOF PDOF 20 8 7 30 12 13 40 14 15 50 18 20 60 21 23 80 26 28 100 30 30 EXPERIMENTAL EVALUATION We conducted all our experiments on a workstation with a Pentium 4 2.6 GHz processor and 1.5 GB of RAM. We implemented all algorithms in Java. In order to test the ability of our algorithm to detect outliers, we used the KDD CUP 99 Network Connections Data Set from the UCI repository [14]. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. The task was to detect the attack connections without any prior knowledge about the properties of the network intrusion attacks. The detection will be simply based on the hypothesis that the attack connections may behave differently from the normal network activities which makes them outliers. We created a test dataset from the KDD original dataset with 1000 connections. Each record has 38 continuous attributes representing the statistics of a connection and its associated connection type, i.e. normal, buffer overflow attack. A very small number of attack connections are randomly selected. There are 22 types of attacks with the size varying from 1 to 4. Totally, there are 43 attack connections in the test dataset. We used the LOF algorithm as a baseline to compare our results as it is a well-known outlier detection method. We ran both the algorithms for different values of input parameters. We Fig. 3. Execution time for LOF and PDOF algorithms In table 1, K indicates the total no of outliers detected (including correct and incorrect detection) and the entries corresponding to LOF and PDOF indicate the actual number of intrusion attacks detected by the two algorithms. As we can see from the table, our PDOF algorithm consistently performed well over the LOF algorithm. . In figure 3, we have drawn the curve of execution time with no of data points corresponding to the two algorithms. 6. CONCLUSION In this paper we proposed an enhancement of LOF outlier detection algorithm to cope with the curse of dimensionality by introducing the notion of preference dimension so that outliers All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8 existing in certain subspace of full feature space are detected. We have shown that our approach is superior to the well-known LOF algorithm in detecting outliers in high-dimensional databases. The proposed algorithm LSOF first detects the meaningful attributes for each object by limiting the allowed variance in the dataset. A parameter Ξ² is used to limit the amount of variance. The future work can be to determine the value of parameter Ξ² automatically for better performance for any given data set. REFERENCES [1] Dianmin Yue, Xiaodan Wu, Yunfeng Wang, Yue Li, and Chao-Hsien Chu, βA Review of Data Mining-Based Financial Fraud Detection Researchβ, Proceedings of 2007 International Conference on Wireless Communications, Networking and Mobile Computing, Shanghai, P. R. China, September 21-25, 2007, pp. 5514-5517. [2] Jiong Zhang and Mohammad Zulkernine, βAnomaly Based Network Intrusion Detection with Unsupervised Outlier Detectionβ, Proceedings of 2006 IEEE International Conference on Communications, Istanbul, Turkey, June 1115, 2006, pp. 2388-2393. [3] Vili Podgorelec, Marjan Heri_ko, and Ivan Rozman, βImproving Mining of Medical Data by Outliers Predictionβ, Proceedings of 18th IEEE International Symposium on Computer-Based Medical Systems, Trinity College Dublin, Ireland, June 23-24, 2005, pp. 91-96. [4] Jari Näsi, Aki Sorsa, and Kauko Leiviskä, βSensor Validation And Outlier Detection Using Fuzzy Limitsβ, Proceedings of 44th IEEE Conference on Decision and Control, and the European Control Conference, Seville, Spain, December 12-15, 2005, pp. 7828-7833. [5] Hawkins D., Identification of Outliers, Chapman and Hall, London, 1980. [6] Victoria J. Hodge and Jim Austin, βA Survey of Outlier Detection Methodologiesβ, Artificial Intelligence Review, Kluwer Academic, Netherlands, 2004, pp. 85-126. [7] E. Eskin, βAnomaly Detection over Noisy Data Using Learned Probability Distributionsβ, Proceedings of 17th International Conference on Machine Learning, Stanford, CA, USA, June 29-July 2, 2000, pp. 255-262. [8] Yamanishi K. and Takeuchi J., βDiscovering Outlier Filtering Rules from Unlabeled Data-Combining a supervised Learner with an Unsupervised Learnerβ, Proceedings of 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 26-29, 2001, pp. 389-394 [9] Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB β98: Proceedings of the 24rd International Conference on Very Large Data Bases, pages 392β403, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. [10] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: identifying density-based local outliers. SIGMOD Rec., 29(2):93β104, 2000. [11] A. Hinneburg, C. C. Aggarwal, and D. A. Keim. βWhat is the Nearest Neighbor in High Dimensional Spacesβ. In Proc. 26th Int. Conf. on Very Large Databases (VLDBβ00), Cairo, Egypt, 2000. [12] Hui Cao, Gangquan Si, Wenzhi Zhu, Yanbin Zhang. βEnhancing Effectiveness of Density-based Outlier Miningβ.2008 International Symposiums on Information Processing. [13] M. Q. Nguyen, L. Mark, E. Omiecinski. βSubspace Outlier Detection in Data with Mixture of Variances and Noiseβ. Report Number GT-CS-08-11, Georgia Institute of Technology, Atlanta, GA 30332, USA,2008. [14] C. B. D. Newman and C. Merz. UCI repository of machine learning databases, 1998.
© Copyright 2026 Paperzz